Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



User-Level Checkpointing


Contents

[ Top ]


User-Level Checkpointing

LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re- linked with a set of libraries provided by LSF. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred.

By default, the checkpoint libraries are installed in LSF_LIBDIR and echkpnt and erestart are installed in the LSF_SERVERDIR.

Optionally, third party checkpoint and restart implementations can be used with LSF. You must use the echkpnt and erestart supplied with the implementations. To avoid overwriting the echkpnt and erestart supplied by LSF, install any third party implementations in a separate directory by defining LSB_ECHKPNT_METHOD and LSB_ECHKPNT_METHOD_DIR as environment variables or in lsf.conf.

Limitations

There are restrictions to the use of the current implementation of the checkpoint library for user-level checkpointing. These are:

In this section

[ Top ]


Building User-Level Checkpointable Jobs

Building a user-level checkpointable job involves re-linking your application object files (.o files) with the LSF checkpoint startup routine and library. LSF also provides a set of replacement linkers that call the standard linkers on your platform with the correct options to build a checkpointable application. LSF provides:

Library

The checkpoint library replaces low-level system calls such as open(), close(), and dup(), and contains signal handlers and routines to internally implement checkpointing.

Startup routine

The startup routine replaces the language-level module that calls main(), sets the checkpoint signal handler, and initializes internal data structures used to record job information.

Linkers

The checkpoint linkers are used to re-link your application with the checkpoint library and startup routine. They are shell scripts that call the standard linkers on your operating system with the correct options. The scripts are designed to use the native compilers on most platforms. Use ckpt_ld for C language applications and ckpt_ld_f for Fortran applications. The following compilers are supported by the ckpt_ld replacement linker:
Operating System Compiler
AIX
cc
HP-UX
c89
IRIX 6.2
For IRIX 6.2 you need to use cc with the -non_shared -mips2 -32 compiler options, and ckpt_ld with -mips2 -32 linker options. For example, to compile and link my_job.c:
% cc -c my_job.c -non_shared -mips2 -32
% ckpt_ld -o my_job my_job.o -mips2 -32

OSF1
cc
Solaris
cc (SUN C compiler) and gcc
SunOS
gcc

Where to go next

Re-Linking User-Level Applications

[ Top ]


Re-Linking User-Level Applications

To re-link your application, you must have access to the object files (.o files) for your application. If you are using third party applications, the vendor must supply you with the object files. If you are building your own applications you need to first compile them without linking. C++ applications need to be modified as described in Re-Linking C++ Applications before re-linking.

C Language applications

Compile without linking

To compile a C language application without linking, run the compiler with the -c option instead of the -o option. For example, to compile an object file for my_job:

% cc -c my_job.c

Re-linking

To re-link a C language object file use the supplied LSF replacement linker ckpt_ld. For example, to re-link an object file for an application called my_job:

% ckpt_ld -o my_job my_job.o

If you get an error while re-linking see Troubleshooting User-Level Re-Linking.

Fortran applications

Compile without linking

To compile a Fortran application without linking, run the compiler with the -c option instead of the -o option. For example, to compile an object file for my_job:

% f77 -c my_job.f

Re-linking

To re-link a Fortran object file use the supplied LSF replacement linker ckpt_ld_f. For example, to re-link an object file for an application called my_job:

% ckpt_ld_f -o my_job my_job.o

If you get an error while re-linking see Troubleshooting User-Level Re-Linking.

[ Top ]


Troubleshooting User-Level Re-Linking

If an error is reported when using ckpt_ld to link you application with the checkpoint libraries, follow steps outlined in Resolving Re-Linking Errors to help isolate the problem. If you cannot resolve your errors, call Platform Customer Support.

The ckpt_ld replacement linker is designed for C language applications, if your application was created using C++, you need to modify your files as described in Re- Linking C++ Applications before re-linking.

What the replacement linkers do

The replacement linkers are shell scripts designed to use the standard compilers on your OS with the correct options to build a checkpointable executable. The linkers do the following:

In this section

[ Top ]


Resolving Re-Linking Errors

To resolve linking errors, you need to step through the linking process performed by the linker. To do this, perform the following procedures:

  1. View the linking script
  2. Include the startup library
  3. Include the checkpoint library
  4. Force static linking

View the linking script

View the low-level linking script by running your linker in verbose mode. This will display the libraries called by your linker. Use this information to help determine which files need to be replaced.

Verbose mode switches

Refer to the man page supplied with your compiler to determine the verbose mode switch. The following table lists the verbose mode switch for some operating systems.

Operating System Verbose Mode Switch
SUNOS/Solaris
-#
AIX
-v
IRIX
-show -non_shared
HP-UX
-v
OSF1
-v -non_shared

For example, running the Sparc C Compiler 3.0 with the verbose switch, -#, for my_job.o:

% cc -o -# my_job my_job.o
/usr/ccs/bin/ld /opt/SUNWspro/SC3.0/lib/crti.o /opt/SUNWspro/SC3.0/lib/crt1.o 
/opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job 
my_job.o -Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc 
/opt/SUNWspro/SC3.0/lib/crtn.o

Include the startup library

Add the startup library by replacing the library that calls main() with ckp_crt0.o. To determine which library calls main(), run nm for all libraries listed in the low-level linking script. For example:

% nm /opt/SUNWspro/SC3.0/lib/crt1.o | grep -i main

Replace /opt/SUNWspro/SC3.0/lib/crt1.o with /usr/share/lsf/lib/ckpt_crt0.o:

/usr/ccs/bin/ld /opt/SUNWspro/SC3.0/lib/crti.o /usr/share/lsf/lib/ckpt_crt0.o 
/opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job 
my_job.o -Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc 
/opt/SUNWspro/SC3.0/lib/crtn.o

Include the checkpoint library

Add the checkpoint library by adding libckpt.a after language-specific libraries and before system-specific libraries. For example:

/usr/ccs/bin/ld /opt/SUNWspro/SC3.0/lib/crti.o /usr/share/lsf/lib/ckpt_crt0.o 
/opt/SUNWspro/SC3.0/lib/__fstd.o /opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job 
my_job.o /usr/share/lsf/lib/libckpt.a -Y 
P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib -Qy -lc 
/opt/SUNWspro/SC3.0/lib/crtn.o

Force static linking

Force your application to link statically to as many libraries as possible. Refer to the documentation supplied with your compiler for more information about static linking. For example, on Solaris the -Bstatic and -Bdynamic compiler switches are used to force modules to statically link wherever possible:

/usr/ccs/bin/ld -Bstatic /opt/SUNWspro/SC3.0/lib/crti.o 
/usr/share/lsf/lib/ckpt_crt0.o /opt/SUNWspro/SC3.0/lib/__fstd.o 
/opt/SUNWspro/SC3.0/lib/values-xt.o -o my_job my_job.o 
/usr/share/lsf/lib/libckpt.a -Y P,/opt/SUNWspro/SC3.0/lib:/usr/ccs/lib:/usr/lib 
-Qy -lc -Bdynamic -ldl -Bstatic /opt/SUNWspro/SC3.0/lib/crtn.o

[ Top ]


Re-Linking C++ Applications

To use the replacement linker on C++ applications, the module that calls main() must be extracted from its library file and included in the linking script. For example, the following Verilog application is written in C++ and being re-linked on Solaris. It reports an undefined symbol main in libckpt.a:

/usr/ccs/bin/ld /opt/SUNWspro/SC3.0.1/lib/crti.o 
/opt/SUNWspro/SC3.0.1/lib/crt1.o /opt/SUNWspro/SC3.0.1/lib/cg89/__fstd.o 
/opt/SUNWspro/SC3.0.1/lib/values-xt.o -Y 
P,lxx/lib:opt/SUNWspro/SC3.0.1/lib:/usr/ccs/lib:/usr/lib -o verilog verilog.o 
verilog/lib/*.o lib/libcman.a -L/usr/openwin/lib -lXt -X11 lib/libvoids.a -lm 
-lgen lxx/lib/_main.o -lC -lC_mtstubs -lsocket -lnsl -lintl -w -c -ldl 
/opt/SUNWspro/lib/crtn.o

To determine which library contains main(), run nm for all libraries listed in the low- level linking script. For example:

% nm lib/libvoids.a | grep main

This module must be extracted using:

% ar x lib/libvoids.a main.o

The main.o object file must be included in the re-linking script to generate a checkpointable executable:

/usr/ccs/bin/ld /opt/SUNWspro/SC3.0.1/lib/crti.o 
/opt/SUNWspro/SC3.0.1/lib/crt1.o /opt/SUNWspro/SC3.0.1/lib/cg89/__fstd.o 
/opt/SUNWspro/SC3.0.1/lib/values-xt.o -Y 
P,lxx/lib:opt/SUNWspro/SC3.0.1/lib:/usr/ccs/lib:/usr/lib -o verilog main.o 
verilog.o verilog/lib/*.o lib/libcman.a -L/usr/openwin/lib -lXt -X11 
lib/libvoids.a -lm -lgen lxx/lib/_main.o -lC -lC_mtstubs -lsocket -lnsl -lintl 
-w -c -ldl /opt/SUNWspro/lib/crtn.o

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: March 13, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.