The job checkpoint and restart feature enables you to stop jobs and then restart them from the point at which they stopped, which optimizes resource usage. LSF can periodically capture the state of a running job and the data required to restart it. This feature provides fault tolerance and allows LSF administrators and users to migrate jobs from one host to another to achieve load balancing.
Checkpointing enables LSF users to restart a job on the same execution host or to migrate a job to a different execution host. LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.
When LSF checkpoints a job, the echkpnt interface creates a checkpoint file in the directory checkpoint_dir/job_ID, and then checkpoints and resumes the job. The job continues to run, even if checkpointing fails.
The operating system provides checkpoint and restart functionality that is transparent to your applications and enabled by default. To implement job checkpoint and restart at the kernel level, the LSF echkpnt and erestart executables invoke operating system-specific calls.
LSF uses the default executables echkpnt.default and erestart.default for kernel-level checkpoint and restart.
For systems that do not support kernel-level checkpoint and restart, LSF provides a job checkpoint and restart implementation that is transparent to your applications and does not require you to rewrite code. User-level job checkpoint and restart is enabled by linking your application files to the LSF checkpoint libraries in LSF_LIBDIR. LSF uses the default executables echkpnt.default and erestart.default for user-level checkpoint and restart.
Different applications have different checkpointing implementations that require the use of customized external executables (echkpnt.application and erestart.application). Application-level checkpoint and restart enables you to configure LSF to use specific echkpnt.application and erestart.application executables for a job, queue, or cluster. You can write customized checkpoint and restart executables for each application that you use.
LSF uses a combination of corresponding checkpoint and restart executables. For example, if you use echkpnt.fluent to checkpoint a particular job, LSF will use erestart.fluent to restart the checkpointed job. You cannot override this behavior or configure LSF to use a specific restart executable.
Kernel-level checkpoint and restart is enabled by default. LSF users make a job checkpointable by either submitting a job using bsub -k and specifying a checkpoint directory or by submitting a job to a queue that defines a checkpoint directory for the CHKPNT parameter.
To enable user-level checkpoint and restart, you must link your application object files to the LSF checkpoint libraries provided in LSF_LIBDIR. You do not have to change any code within your application. For instructions on how to link application files, see the Platform LSF Programmer’s Guide.
For application-level checkpoint and restart, once the LSF_SERVERDIR contains one or more checkpoint and restart executables, users can specify the external checkpoint executable associated with each checkpointable job they submit. At restart, LSF invokes the corresponding external restart executable.
The directory/name combinations must be unique within the cluster. For example, you can write two different checkpoint executables with the name echkpnt.fluent and save them as LSF_SERVERDIR/echkpnt.fluent and my_execs/echkpnt.fluent. To run checkpoint and restart executables from a directory other than LSF_SERVERDIR, you must configure the parameter LSB_ECHKPNT_METHOD_DIR in lsf.conf.
An echkpnt.application must return a value of 0 when checkpointing succeeds and a non-zero value when checkpointing fails.
All checkpoint and restart executables run under the user account of the user who submits the job.
LSF identifies checkpoint files by the checkpoint directory and job ID. For example:
LSF writes the checkpoint file to my_dir/123.
LSF maintains all of the checkpoint files for a single job in one location. When a job restarts, LSF creates both a new subdirectory based on the new job ID and a symbolic link from the old to the new directory. For example, when job 123 restarts on a new host as job 456, LSF creates my_dir/456 and a symbolic link from my_dir/123 to my_dir/456.
The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.
Checkpoint directory and checkpoint period—values specified at the job level override values for the queue. Values specified in an application profile setting overrides queue level configuration.
Checkpoint and restart executables—the value for checkpoint_method specified at the job level overrides the application-level CHKPNT_METHOD, and the cluster-level value for LSB_ECHKPNT_METHOD specified in lsf.conf or as an environment variable.
Configuration parameters and environment variables—values specified as environment variables override the values specified in lsf.conf
Specifying mandatory application-level checkpoint and restart executables that apply to all checkpointable batch jobs in the cluster
Specifying the directory that contains customized application-level checkpoint and restart executables
Saving standard output and standard error to files in the checkpoint directory
Automatically checkpointing jobs before suspending or terminating them
For Cray systems only, copying all open job files to the checkpoint directory