[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- User-Level Checkpointing
- Building User-Level Checkpointable Jobs
- Re-Linking User-Level Applications
- Troubleshooting User-Level Re-Linking
- Resolving Re-Linking Errors
- Re-Linking C++ Applications
[ Top ]
User-Level Checkpointing
LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re- linked with a set of libraries provided by LSF. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred.
By default, the checkpoint libraries are installed in LSF_LIBDIR and
echkpnt
anderestart
are installed in the LSF_SERVERDIR.Optionally, third party checkpoint and restart implementations can be used with LSF. You must use the
echkpnt
anderestart
supplied with the implementations. To avoid overwriting theechkpnt
anderestart
supplied by LSF, install any third party implementations in a separate directory by defining LSB_ECHKPNT_METHOD and LSB_ECHKPNT_METHOD_DIR as environment variables or inlsf.conf
.There are restrictions to the use of the current implementation of the checkpoint library for user-level checkpointing. These are:
- The checkpointed process can only be restarted on hosts of the same architecture and with the same operating system as the host on which the checkpoint was created.
- Only single process jobs can be checkpointed.
- Processes with open pipes and sockets can be checkpointed but may not properly restart as the pipes and sockets are not re-opened on restart.
- If a process has
stdin
,stdout
, orstderr
as open pipes, all data in the pipes is lost on restart.- The checkpointed process cannot be operating on a private stack when the checkpoint happens.
- The checkpointed process cannot use internal timers.
- The checkpointed program must be statically linked.
SIGHUP
is used internally to implement checkpointing. Do not use this signal in programs to be checkpointed.[ Top ]
Building User-Level Checkpointable Jobs
Building a user-level checkpointable job involves re-linking your application object files (.o files) with the LSF checkpoint startup routine and library. LSF also provides a set of replacement linkers that call the standard linkers on your platform with the correct options to build a checkpointable application. LSF provides:
libckpt.a
, the checkpoint libraryckpt_crt0.o
, the checkpoint startup routineckpt_ld
the checkpoint linker for C language applicationsckpt_ld_f
the checkpoint linker for Fortran applicationsLibrary
The checkpoint library replaces low-level system calls such as
open()
,close()
, anddup()
, and contains signal handlers and routines to internally implement checkpointing.Startup routine
The startup routine replaces the language-level module that calls
main()
, sets the checkpoint signal handler, and initializes internal data structures used to record job information.Linkers
The checkpoint linkers are used to re-link your application with the checkpoint library and startup routine. They are shell scripts that call the standard linkers on your operating system with the correct options. The scripts are designed to use the native compilers on most platforms. Use
ckpt_ld
for C language applications andckpt_ld_f
for Fortran applications. The following compilers are supported by theckpt_ld
replacement linker:Re-Linking User-Level Applications
[ Top ]
Re-Linking User-Level Applications
To re-link your application, you must have access to the object files (.o files) for your application. If you are using third party applications, the vendor must supply you with the object files. If you are building your own applications you need to first compile them without linking. C++ applications need to be modified as described in Re-Linking C++ Applications before re-linking.
C Language applications
To compile a C language application without linking, run the compiler with the
-c
option instead of the-o
option. For example, to compile an object file formy_job
:%cc -c my_job.c
To re-link a C language object file use the supplied LSF replacement linker
ckpt_ld
. For example, to re-link an object file for an application calledmy_job
:%ckpt_ld -o my_job my_job.o
If you get an error while re-linking see Troubleshooting User-Level Re-Linking.
Fortran applications
To compile a Fortran application without linking, run the compiler with the
-c
option instead of the-o
option. For example, to compile an object file formy_job
:%f77 -c my_job.f
To re-link a Fortran object file use the supplied LSF replacement linker
ckpt_ld_f
. For example, to re-link an object file for an application calledmy_job
:%ckpt_ld_f -o my_job my_job.o
If you get an error while re-linking see Troubleshooting User-Level Re-Linking.
[ Top ]
Troubleshooting User-Level Re-Linking
If an error is reported when using
ckpt_ld
to link you application with the checkpoint libraries, follow steps outlined in Resolving Re-Linking Errors to help isolate the problem. If you cannot resolve your errors, call Platform Customer Support.The
ckpt_ld
replacement linker is designed for C language applications, if your application was created using C++, you need to modify your files as described in Re- Linking C++ Applications before re-linking.What the replacement linkers do
The replacement linkers are shell scripts designed to use the standard compilers on your OS with the correct options to build a checkpointable executable. The linkers do the following:
- Include the startup routine by replacing the module that calls
main()
withckpt_crt0.o
- Include the checkpoint library by adding
libckpt.a
- Force as much static linking as possible
[ Top ]
Resolving Re-Linking Errors
To resolve linking errors, you need to step through the linking process performed by the linker. To do this, perform the following procedures:
- View the linking script
- Include the startup library
- Include the checkpoint library
- Force static linking
View the linking script
View the low-level linking script by running your linker in verbose mode. This will display the libraries called by your linker. Use this information to help determine which files need to be replaced.
Refer to the man page supplied with your compiler to determine the verbose mode switch. The following table lists the verbose mode switch for some operating systems.
Operating System Verbose Mode Switch SUNOS/Solaris
-#
AIX
-v
IRIX
-show -non_shared
HP-UX
-v
OSF1
-v -non_shared
For example, running the Sparc C Compiler 3.0 with the verbose switch,
-#
, formy_job.o
:%cc -o -# my_job my_job.o
/usr/ccs/bin/ld /opt/SUNWspro/SC3.0/lib/crti.o /opt/SUNWspro/SC3.0/lib/crt1.o /opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job my_job.o -Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc /opt/SUNWspro/SC3.0/lib/crtn.oInclude the startup library
Add the startup library by replacing the library that calls
main()
withckp_crt0.o
. To determine which library callsmain()
, runnm
for all libraries listed in the low-level linking script. For example:%nm /opt/SUNWspro/SC3.0/lib/crt1.o | grep -i main
Replace /opt/SUNWspro/SC3.0/lib/crt1.o with /usr/share/lsf/lib/ckpt_crt0.o:
/usr/ccs/bin/ld /opt/SUNWspro/SC3.0/lib/crti.o/usr/share/lsf/lib/ckpt_crt0.o
/opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job my_job.o -Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc /opt/SUNWspro/SC3.0/lib/crtn.oInclude the checkpoint library
Add the checkpoint library by adding
libckpt.a
after language-specific libraries and before system-specific libraries. For example:/usr/ccs/bin/ld /opt/SUNWspro/SC3.0/lib/crti.o/usr/share/lsf/lib/ckpt_crt0.o
/opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job my_job.o/usr/share/lsf/lib/libckpt.a
-Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc /opt/SUNWspro/SC3.0/lib/crtn.oForce static linking
Force your application to link statically to as many libraries as possible. Refer to the documentation supplied with your compiler for more information about static linking. For example, on Solaris the
-Bstatic
and-Bdynamic
compiler switches are used to force modules to statically link wherever possible:/usr/ccs/bin/ld-Bstatic
/opt/SUNWspro/SC3.0/lib/crti.o/usr/share/lsf/lib/ckpt_crt0.o
/opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job my_job.o/usr/share/lsf/lib/libckpt.a
-Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc-Bdynamic -ldl -Bstatic
/opt/SUNWspro/SC3.0/lib/crtn.o[ Top ]
Re-Linking C++ Applications
To use the replacement linker on C++ applications, the module that calls
main()
must be extracted from its library file and included in the linking script. For example, the followingVerilog
application is written in C++ and being re-linked on Solaris. It reports an undefined symbolmain
inlibckpt.a
:/usr/ccs/bin/ld /opt/SUNWspro/SC3.0.1/lib/crti.o /opt/SUNWspro/SC3.0.1/lib/crt1.o /opt/SUNWspro/SC3.0.1/lib/cg89/__fstd.o /opt/SUNWspro/SC3.0.1/lib/values-xt.o -Y P,lxx/lib:opt/SUNWspro/SC3.0.1/lib:/usr/ccs/lib:/usr/lib -o verilog verilog.o verilog/lib/*.o lib/libcman.a -L/usr/openwin/lib -lXt -X11 lib/libvoids.a -lm -lgen lxx/lib/_main.o -lC -lC_mtstubs -lsocket -lnsl -lintl -w -c -ldl /opt/SUNWspro/lib/crtn.oTo determine which library contains
main()
, runnm
for all libraries listed in the low- level linking script. For example:%nm lib/libvoids.a | grep main
This module must be extracted using:
%ar x lib/libvoids.a main.o
The
main.o
object file must be included in the re-linking script to generate a checkpointable executable:/usr/ccs/bin/ld /opt/SUNWspro/SC3.0.1/lib/crti.o /opt/SUNWspro/SC3.0.1/lib/crt1.o /opt/SUNWspro/SC3.0.1/lib/cg89/__fstd.o /opt/SUNWspro/SC3.0.1/lib/values-xt.o -Y P,lxx/lib:opt/SUNWspro/SC3.0.1/lib:/usr/ccs/lib:/usr/lib -o verilogmain.o
verilog.o verilog/lib/*.o lib/libcman.a -L/usr/openwin/lib -lXt -X11 lib/libvoids.a -lm -lgen lxx/lib/_main.o -lC -lC_mtstubs -lsocket -lnsl -lintl -w -c -ldl /opt/SUNWspro/lib/crtn.o
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: March 13, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.