Resizable Jobs

Enabling resizable jobs allows LSF to run a job with minimum and maximum slots requested and have it dynamically use the number of slots available at any given time.

By default, if a job specifies minimum and maximum slots requests (bsub -n min,max), LSF makes a one time allocation and schedules the job. You can configure resizable jobs, where LSF dispatches jobs as long as minimum slot request is satisfied. After the job successfully starts, LSF continues to schedule and allocate additional resources to satisfy the maximum slot request for the job. For example, a job asks for -n 4,32 processors. The Job starts to run and gets 20 slots at time t0. After that, LSF continues to allocate more job resources; for instance, 4 slots to jobs at time t1. Then, another 8 slots at time t2, which finally satisfies 32 slot requirement.

About resizable jobs

Resizable Job

A job whose job slot allocation can grow and shrink during its run time. The allocation change request may be triggered automatically or by the bresize command. For example, after the job starts, you can explicitly cancel resize allocation requests or have the job release idle resources back to the LSF.

Autoresizable job

A resizable job with a minimum and maximum slot request. LSF automatically schedules and allocates additional resources to satisfy job maximum request as the job runs.

For autoresizable jobs, LSF automatically calculates the pending allocation requests. The maximum pending allocation request is calculated based on the maximum number of requested slots minus the number of allocated slots. And the minimum pending allocation request is always 1. B ecause the job is running and its previous minimum request is already satisfied, LSF is able to allocate any number of additional slots to the running job. For instance, if job requests -n 4, 32, if LSF allocates 20 slots to the job initially, its active pending allocation request is 1 to 12. 1 is minimum slot request. 12 is maximum slot request. After LSF assigns another 4 slots, the pending allocation request is 1 to 8.

Pending allocation request

An additional resource request attached to a resizable job. Only running jobs can have pending allocation requests. At any given time, the job only has one allocation request.

LSF creates a new pending allocation request and schedules it after job physically starts on the remote host (after LSF receives the JOB_EXECUTE event from sbatchd) or notification successfully completes.

Notification command

A notification command is an executable that is invoked on the first execution host of a job in response to an allocation (grow or shrink) event. It can be used to inform the running application for allocation change. Due to the various implementations of applications, each resizable application may have its own notification command provided by the application developer.

The notification command runs under the same user ID environment, home, and working directory as the actual job. The standard input, output, and error of the program are redirected to the NULL device. If the notification command is not in the user's normal execution path (the $PATH variable), the full path name of the command must be specified.

A notification command exits with one of the following values:

LSB_RESIZE_NOTIFY_OK=0 

LSB_RESIZE_NOTIFY_FAIL=1

LSF sets these environment variables in notification command environment. LSB_RESIZE_NOTIFY_OK indicates notification succeeds. For allocation both "grow" and "shrink" events, LSF updates the job allocation to reflect the new allocation.

LSB_RESIZE_NOTIFY_FAIL indicates notification failure. For allocation "grow" event, LSF reschedules the pending allocation request. For allocation "shrink" event, LSF fails the alloction release request.

Configuration to enable resizable jobs

The resizable jobs feature is enabled by defining an application profile using the RESIZABLE_JOBS parameter in lsb.applications.


Configuration file

Parameter and syntax

Behavior

lsb.applications

RESIZABLE_JOBS=Y|N|auto

  • When RESIZABLE_JOBS=Y jobs submitted to the application profile are resizable.

  • When RESIZABLE_JOBS=auto jobs submitted to the application profile are automatically resizable.

  • To enable cluster-wide resizable behavior by default, define RESIZABLE_JOBS=Y in the default application profile.

RESIZE_NOTIFY_CMD=notify_cmd

RESIZE_NOTIFY_CMD specifies an application-level resize notification command. The resize notification command is invoked on the first execution host of a running resizable job when a resize event occurs, including releasing resources and adding resources.

LSF set appropriate environment variables to indicate the event type and before running the notification command.


Configuration to modify resizable job behavior

There is no configuration to modify resizable job behavior.

Resizable job commands

Commands for submission


Command

Description

bsub -app application_profile_name

Submits the job to the specified application profile configured for resizable jobs

bsub -app application_profile_name -rnc resize_notification_command

Submits the job to the specified application profile configured for resizable jobs, with the specified resize notification command.The job-level resize notification command overrides the application-level RESIZE_NOTIFY_CMD setting.

bsub -ar -app application_profile_name

Submits the job to the specified application profile configured for resizable jobs, as an autoresizable job. The job-level -ar option overrides the application-level RESIZABLE_JOBS setting. For example, if the application profile is not autoresizable, job level bsub -ar will make the job autoresizable.


Commands to monitor


Command

Description

bacct

  • Displays resize notification command.

  • Displays resize allocation changes.

bhist

  • Displays resize notification command.

  • Displays resize allocation changes.

  • Displays the job-level autoresizable attribute.

bjobs -l

  • Displays resize notification command.

  • Displays resize allocation changes.

  • Displays the job-level autoresizable attribute.

  • Displays pending resize allocation requests.


Commands to control


Command

Description

bmod -ar | -arn

Add or remove the job-level autoresizable attribute. bmod only updates the autoresizable attribute for pending jobs.

bmod -rnc resize_notification_cmd | -rncn

Modify or remove resize notification command for submitted job.

bresize release

Release allocated resources from a running resizable job.

  • Release all slots except one slot from the first execution node.

  • Release all hosts except the first execution node.

  • Release a list of hosts and different slots for each explicitly.

  • Specify a resize notification command to be invoked on the first execution host of the job.

To release resources from a running job, the job must be submitted to an application profile configured as resizable.

  • By default, only cluster administrators, queue administrators, root and the job owner are allowed to run bresize to change job allocations.

  • User group administrators are allowed to run bresize to change the allocation of jobs within their user groups.

bresize cancel

Cancel a pending allocation request. The active pending allocation request is from r auto-resize request generated by LSF automatically. If job does not have active pending request, the command fails with an error message.

bresize release -rnc resize_notification_cmd

Specify or remove a resize notification command. The resize notification is invoked on the job first execution node. The resize notification command only applies to the release request and overrides the corresponding resize notification parameters defined in either the application profile (RESIZE_NOTIFY_CMD in lsb.applications) and job level (bsub -rnc notify_cmd).

If the resize notification command completes successfully, LSF considers the allocation release done and updates the job allocation. If the resize notification command fails, LSF does not update the job allocation.

The resize_notification_cmd specifies the name of the executable to be invoked on the first execution host when the job's allocation has been modified.

The resize notification command runs under the user account of job.

-rncn removes the resize notification command in both job-level and application-level.

bresize release -c

By default, if the job has an active pending allocation request, LSF does not allow users to release resource. Use the bresize release -c command to cancel the active pending resource request when releasing slots from existing allocation. By default, the command only releases slots.

If a job still has an active pending allocation request, but you do not want to allocate more resources to the job, use the bresize cancel command to cancel allocation request.

Only the job owner, cluster administrators, queue administrators, user group administrators, and root are allowed to cancel pending resource allocation requests.


Commands to display configuration


Command

Description

bapp

Displays the value of parameters defined in lsb.applications.