Knowledge Center Contents Previous Next Index |
Chunk Job Dispatch
Contents
About Job Chunking
LSF supports
job chunking
, where jobs with similar resource requirements submitted by the same user are grouped together for dispatch. The CHUNK_JOB_SIZE parameter inlsb.queues
andlsb.applications
specifies the maximum number of jobs allowed to be dispatched together in achunk job
.Job chunking can have the following advantages:
- Reduces communication between
sbatchd
andmbatchd
, and scheduling overhead inmbatchd
- Increases job throughput in
mbatchd
and more balanced CPU utilization on the execution hostsAll of the jobs in the chunk are dispatched as a unit rather than individually. Job execution is sequential, but each chunk job member is not necessarily executed in the order it was submitted.
restriction:
You cannot auto-migrate a suspended chunk job member.Chunk job candidates
Jobs with the following characteristics are typical candidates for job chunking:
- Take between 1 and 2 minutes to run
- All require the same resource (for example a software license or a specific amount of memory)
- Do not specify a beginning time (
bsub -b
) or termination time (bsub -t
)Running jobs with these characteristics without chunking can underutilize resources because LSF spends more time scheduling and dispatching the jobs than actually running them.
Configuring a special high-priority queue for short jobs is not desirable because users may be tempted to send all of their jobs to this queue, knowing that it has high priority.
Configure Chunk Job Dispatch
CHUNK_JOB_SIZE (lsb.queues)
By default, CHUNK_JOB_SIZE is not enabled.
- To configure a queue to dispatch chunk jobs, specify the CHUNK_JOB_SIZE parameter in the queue definition in
lsb.queues
.For example, the following configures a queue named
chunk
, which dispatches up to 4 jobs in a chunk:Begin Queue QUEUE_NAME = chunk PRIORITY = 50 CHUNK_JOB_SIZE = 4 End QueuePostrequisites: After adding CHUNK_JOB_SIZE to
lsb.queues
, usebadmin reconfig
to reconfigure your cluster.Chunk jobs and job throughput
Throughput can deteriorate if the chunk job size is too big. Performance may decrease on queues with CHUNK_JOB_SIZE greater than 30. You should evaluate the chunk job size on your own systems for best performance.
CHUNK_JOB_SIZE (lsb.applications)
By default, CHUNK_JOB_SIZE is not enabled. Enabling application-level job chunking overrides queue-level job chunking.
- To configure an application profile to chunk jobs together, specify the CHUNK_JOB_SIZE parameter in the application profile definition in
lsb.applications
.Specify CHUNK_JOB_SIZE=1 to disable job chunking for the application. This value overrides chunk job dispatch configured in the queue.
Postrequisites: After adding CHUNK_JOB_SIZE to
lsb.applications
, usebadmin reconfig
to reconfigure your cluster.CHUNK_JOB_DURATION (lsb.params)
If CHUNK_JOB_DURATION is defined in the file
lsb.params
, a job submitted to a chunk job queue is chunked under the following conditions:
- A job-level CPU limit or run time limit is specified (
bsub -c
or-W
), or- An application-level CPU limit, run time limit, or run time estimate is specified (CPULIMIT, RUNLIMIT, or RUNTIME in
lsb.applications
), or- A queue-level CPU limit or run time limit is specified (CPULIMIT or RUNLIMIT in
lsb.queues
),
and
the values of the CPU limit, run time limit, and run time estimate are all less than or equal to the CHUNK_JOB_DURATION.Jobs are not chunked if:
- The CPU limit, run time limit, or run time estimate is greater than the value of CHUNK_JOB_DURATION, or
- No CPU limit, no run time limit, and no run time estimate are specified.
The value of CHUNK_JOB_DURATION is displayed by
bparams -l
.
- After adding CHUNK_JOB_DURATION to
lsb.params
, usebadmin reconfig
to reconfigure your cluster.By default, CHUNK_JOB_DURATION is not enabled.
Restrictions on chunk jobs
CHUNK_JOB_SIZE is ignored and jobs are not chunked under the following conditions:
- Interactive queues (INTERACTIVE = ONLY parameter)
- CPU limit greater than 30 minutes (CPULIMIT parameter in
lsb.queues
orlsb.applications
). If CHUNK_JOB_DURATION is set inlsb.params
, the job is chunked only if it is submitted with a CPU limit that is less than or equal to the value of CHUNK_JOB_DURATION (bsub -c
)- Run limit greater than 30 minutes (RUNLIMIT parameter in
lsb.queues
orlsb.applications
). If CHUNK_JOB_DURATION is set inlsb.params
, the job is chunked only if it is submitted with a run limit that is less than or equal to the value of CHUNK_JOB_DURATION (bsub -W
)- Run time estimate greater than 30 minutes (RUNTIME parameter in
lsb.applications
)Jobs submitted with the following
bsub
options are not chunked; they are dispatched individually:
-I
(interactive jobs)-c
(jobs with CPU limit greater than 30)-W
(jobs with run limit greater than 30 minutes)-app
(jobs associated with an application profile that specifies a run time estimate or run time limit greater than 30 minutes, or a CPU limit greater than 30). CHUNK_JOB_SIZE is either not specified in the application, or CHUNK_JOB_SIZE=1, which disables chunk job dispatch configured in the queue.-R
"cu[]" (jobs with a compute unit resource requirement).Submitting and Controlling Chunk Jobs
When a job is submitted to a queue or application profile configured with the CHUNK_JOB_SIZE parameter, LSF attempts to place the job in an existing chunk. A job is added to an existing chunk if it has the same characteristics as the first job in the chunk:
- Submitting user
- Resource requirements
- Host requirements
- Queue or application profile
- Job priority
If a suitable host is found to run the job, but there is no chunk available with the same characteristics, LSF creates a new chunk.
Resources reserved for any member of the chunk are reserved at the time the chunk is dispatched and held until the whole chunk finishes running. Other jobs requiring the same resources are not dispatched until the chunk job is done.
For example, if all jobs in the chunk require a software license, the license is checked out and each chunk job member uses it in turn. The license is not released until the last chunk job member is finished running.
WAIT status
When
sbatchd
receives a chunk job, it does not start all member jobs at once. A chunk job occupies a single job slot. Even if other slots are available, the chunk job members must run one at a time in the job slot they occupy. The remaining jobs in the chunk that are waiting to run are displayed asWAIT
bybjobs
. Any jobs inWAIT
status are included in the count of pending jobs bybqueues
andbusers
. Thebhosts
command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.The
bhist -l
command shows jobs inWAIT
status asWaiting ...
The
bjobs -l
command does not display aWAIT
reason in the list of pending jobs.Controlling chunk jobs
Job controls affect the state of the members of a chunk job. You can perform the following actions on jobs in a chunk job:
Migrating jobs with
bmig
changes the dispatch sequence of the chunk job members. They are not redispatched in the order they were originally submitted.Rerunnable chunk jobs
If the execution host becomes unavailable, rerunnable chunk job members are removed from the queue and dispatched to a different execution host.
See Chapter 30, "Job Requeue and Job Rerun" for more information about rerunnable jobs.
Checkpointing chunk jobs
Only running chunk jobs can be checkpointed. If
bchkpnt -k
is used, the job is also killed after the checkpoint file has been created. If chunk job in WAIT state is checkpointed,mbatchd
rejects the checkpoint request.See Chapter 31, "Job Checkpoint, Restart, and Migration" for more information about checkpointing jobs.
Fairshare policies and chunk jobs
Fairshare queues can use job chunking. Jobs are accumulated in the chunk job so that priority is assigned to jobs correctly according to the fairshare policy that applies to each user. Jobs belonging to other users are dispatched in other chunks.
TERMINATE_WHEN job control action
If the TERMINATE_WHEN job control action is applied to a chunk job,
sbatchd
kills the chunk job element that is running and puts the rest of the waiting elements into pending state to be rescheduled later.Enforce resource usage limits on chunk jobs
By default, resource usage limits are not enforced for chunk jobs because chunk jobs are typically too short to allow LSF to collect resource usage.
- To enforce resource limits for chunk jobs, define LSB_CHUNK_RUSAGE=Y in
lsf.conf
. Limits may not be enforced for chunk jobs that take less than a minute to run.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |