Knowledge Center Contents Previous Next Index |
Goal-Oriented SLA-Driven Scheduling
Contents
- Using Goal-Oriented SLA Scheduling
- Configuring Service Classes for SLA Scheduling
- View Information about SLAs and Service Classes
- Understanding Service Class Behavior
Using Goal-Oriented SLA Scheduling
Goal-oriented SLA scheduling policies help you configure your workload so that your jobs are completed on time and reduce the risk of missed deadlines. They enable you to focus on the "what and when" of your projects, not the low-level details of "how" resources need to be allocated to satisfy various workloads.
Service-level agreements in LSF
A
service-level agreement
(SLA) defines how a service is delivered and the parameters for the delivery of a service. It specifies what a service provider and a service recipient agree to, defining the relationship between the provider and recipient with respect to a number of issues, among them:
- Services to be delivered
- Performance
- Tracking and reporting
- Problem management
An SLA in LSF is a "just-in-time" scheduling policy that defines an agreement between LSF administrators and LSF users. The SLA scheduling policy defines how many jobs should be run from each SLA to meet the configured goals.
restriction:
LSF MultiCluster does not support SLAs.Service classes
SLA definitions consist of service-level goals that are expressed in individual
service classes
. A service class is the actual configured policy that sets the service-level goals for the LSF system. The SLA defines the workload (jobs or other services) and users that need the work done, while the service class that addresses the SLA defines individual goals, and a time window when the service class is active.Service-level goals
You configure the following kinds of goals:
Deadline goals
A specified number of jobs should be completed within a specified time window. For example, run all jobs submitted over a weekend.
Velocity goals
Expressed as concurrently running jobs. For example: maintain 10 running jobs between 9:00 a.m. and 5:00 p.m. Velocity goals are well suited for short jobs (run time less than one hour). Such jobs leave the system quickly, and configuring a velocity goal ensures a steady flow of jobs through the system.
Throughput goals
Expressed as number of finished jobs per hour. For example: finish 15 jobs per hour between the hours of 6:00 p.m. and 7:00 a.m. Throughput goals are suitable for medium to long running jobs. These jobs stay longer in the system, so you typically want to control their rate of completion rather than their flow.
Combining different types of goals
You might want to set velocity goals to maximize quick work during the day, and set deadline and throughput goals to manage longer running work on nights and over weekends.
How service classes perform goal-oriented scheduling
Goal-oriented scheduling makes use of other, lower level LSF policies like queues and host partitions to satisfy the service-level goal that the service class expresses. The decisions of a service class are considered first before any queue or host partition decisions. Limits are still enforced with respect to lower level scheduling objects like queues, hosts, and users.
Optimum number of running jobs
As jobs are submitted, LSF determines the optimum number of job slots (or concurrently running jobs) needed for the service class to meet its service-level goals. LSF schedules a number of jobs at least equal to the optimum number of slots calculated for the service class.
LSF attempts to meet SLA goals in the most efficient way, using the optimum number of job slots so that other service classes or other types of work in the cluster can still progress. For example, in a service class that defines a deadline goal, LSF spreads out the work over the entire time window for the goal, which avoids blocking other work by not allocating as many slots as possible at the beginning to finish earlier than the deadline.
Submit jobs to a service class
You submit jobs to a service class as you would to a queue, except that a service class is a higher level scheduling policy that makes use of other, lower level LSF policies like queues and host partitions to satisfy the service-level goal that the service class expresses.
The service class name where the job is to run is configured in
lsb.serviceclasses
. If the SLA does not exist or the user is not a member of the service class, the job is rejected.Outside of the configured time windows, the SLA is not active, and LSF schedules jobs without enforcing any service-level goals. Jobs will flow through queues following queue priorities even if they are submitted with
-sla
.
- Run
bsub -sla
service_class_name
to submit a job to a service class for SLA-driven scheduling.bsub -W 15 -sla Kyuquot sleep 100
submits the UNIX command
sleep
together with its argument 100 as a job to the service class namedKyuquot
.Submitting with a run limit
You should submit your jobs with a run time limit at the job level (
-W
option), the application level (RUNLIMIT parameter in the application definition inlsb.applications
), or the queue level (RUNLIMIT parameter in the queue definition inlsb.queues
). You can also submit the job with a run time estimate defined at the application level (RUNTIME parameter inlsb.applications
) instead of or in conjunction with the run time limit.The following table describes how LSF uses the values that you provide for SLA-driven scheduling.
Modify SLA jobs (bmod)
- Run
bmod -sla
to modify the service class a job is attached to, or to attach a submitted job to a service class. Runbmod -slan
to detach a job from a service class:bmod -sla Kyuquot 2307
Attaches job 2307 to the service class
Kyuquot
.bmod -slan 2307
Detaches job 2307 from the service class
Kyuquot
.You cannot:
- Use
-sla
with otherbmod
options- Move job array elements from one service class to another, only entire job arrays
- Modify the service class of jobs already attached to a job group
If a default SLA is configured in
lsb.params
,bmod -slan
moves the job to the default SLA. If the job is already attached to the default SLA,bmod -slan
has no effect on that job.Configuring Service Classes for SLA Scheduling
Configure service classes in
LSB_CONFDIR/
cluster_name
/configdir/lsb.serviceclasses
. Each service class is defined in aServiceClass
section.Each service class section begins with the line Begin ServiceClass and ends with the line
End ServiceClass
. You must specify:
- A service class name
- At least one goal (deadline, throughput, or velocity) and a time window when the goal is active
- A service class priority
All other parameters are optional. You can configure as many service class sections as you need.
important:
The name you use for your service classes cannot be the same as an existing host partition or user group name.User groups for service classes
You can control access to the SLA by configuring a user group for the service class. If LSF user groups are specified in
lsb.users
, each user in the group can submit jobs to this service class. If a group contains a subgroup, the service class policy applies to each member in the subgroup recursively. The group can define fairshare among its members, and the SLA defined by the service class enforces the fairshare policy among the users in the user group configured for the SLA.By default, all users in the cluster can submit jobs to the service class.
Service class priority
A higher value indicates a higher priority, relative to other service classes. Similar to queue priority, service classes access the cluster resources in priority order.
LSF schedules jobs from one service class at a time, starting with the highest-priority service class. If multiple service classes have the same priority, LSF runs the jobs from these service classes in the order the service classes are configured in
lsb.serviceclasses
.Service class priority in LSF is completely independent of the UNIX scheduler's priority system for time-sharing processes. In LSF, the NICE parameter is used to set the UNIX time-sharing priority for batch jobs.
Service class configuration examples
- The service class
Uclulet
defines one deadline goal that is active during working hours between 8:30 AM and 4:00 PM. All jobs in the service class should complete by the end of the specified time window. Outside of this time window, the SLA is inactive and jobs are scheduled without any goal being enforced:Begin ServiceClass NAME = Uclulet PRIORITY = 20 GOALS = [DEADLINE timeWindow (8:30-16:00)] DESCRIPTION = "working hours" End ServiceClassThe service class Nanaimo
defines a deadline goal that is active during the weekends and at nights.Begin ServiceClass NAME = Nanaimo PRIORITY = 20 GOALS = [DEADLINE timeWindow (5:18:00-1:8:30 20:00-8:30)] DESCRIPTION = "weekend nighttime regression tests" End ServiceClassThe service class Inuvik
defines a throughput goal of 6 jobs per hour that is always active:Begin ServiceClass NAME = Inuvik PRIORITY = 20 GOALS = [THROUGHPUT 6 timeWindow ()] DESCRIPTION = "constant throughput" End ServiceClass
tip:
To configure a time window that is always open, use the timeWindow keyword with empty parentheses.The service class Tofino
defines two velocity goals in a 24 hour period. The first goal is to have a maximum of 10 concurrently running jobs during business hours (9:00 a.m. to 5:00 p.m). The second goal is a maximum of 30 concurrently running jobs during off-hours (5:30 p.m. to 8:30 a.m.)Begin ServiceClass NAME = Tofino PRIORITY = 20 GOALS = [VELOCITY 10 timeWindow (9:00-17:00)] \ [VELOCITY 30 timeWindow (17:30-8:30)] DESCRIPTION = "day and night velocity" End ServiceClassThe service class Kyuquot
defines a velocity goal that is active during working hours (9:00 a.m. to 5:30 p.m.) and a deadline goal that is active during off-hours (5:30 p.m. to 9:00 a.m.) Only usersuser1
anduser2
can submit jobs to this service class.Begin ServiceClass NAME = Kyuquot PRIORITY = 23 USER_GROUP = user1 user2 GOALS = [VELOCITY 8 timeWindow (9:00-17:30)] \ [DEADLINE timeWindow (17:30-9:00)] DESCRIPTION = "Daytime/Nighttime SLA" End ServiceClass
The service class Tevere
defines a combination similar toKyuquot
, but with a deadline goal that takes effect overnight and on weekends. During the working hours in weekdays the velocity goal favors a mix of short and medium jobs.Begin ServiceClass NAME = Tevere PRIORITY = 20 GOALS = [VELOCITY 100 timeWindow (9:00-17:00)] \ [DEADLINE timeWindow (17:30-8:30 5:17:30-1:8:30)] DESCRIPTION = "nine to five" End ServiceClassView Information about SLAs and Service Classes
Monitor the progress of an SLA (bsla)
- Run
bsla
to display the properties of service classes configured inlsb.serviceclasses
and dynamic information about the state of each configured service class.Examples
- One velocity goal of service class
Tofino
is active and on time. The other configured velocity goal is inactive.bsla
SERVICE CLASS NAME: Tofino -- day and night velocity PRIORITY: 20 GOAL: VELOCITY 30 ACTIVE WINDOW: (17:30-8:30) STATUS: Inactive SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD GOAL: VELOCITY 10 ACTIVE WINDOW: (9:00-17:00) STATUS: Active:On time SLA THROUGHPUT: 10.00 JOBS/CLEAN_PERIOD NJOBS PEND RUN SSUSP USUSP FINISH 300 280 10 0 0 10
- The deadline goal of service class
Uclulet
is not being met, andbsla
displays statusActive:Delayed
:bsla
SERVICE CLASS NAME: Uclulet -- working hours PRIORITY: 20 GOAL: DEADLINE ACTIVE WINDOW: (8:30-19:00) STATUS: Active:Delayed SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD ESTIMATED FINISH TIME: (Tue Oct 28 06:17) OPTIMUM NUMBER OF RUNNING JOBS: 6 NJOBS PEND RUN SSUSP USUSP FINISH 40 39 1 0 0 0
- The configured velocity goal of the service class
Kyuquot
is active and on time. The configured deadline goal of the service class is inactive.bsla Kyuquot
SERVICE CLASS NAME: Kyuquot -- Daytime/Nighttime SLA PRIORITY: 23 USER_GROUP: user1 user2 GOAL: VELOCITY 8 ACTIVE WINDOW: (9:00-17:30) STATUS: Active:On time SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD GOAL: DEADLINE ACTIVE WINDOW: (17:30-9:00) STATUS: Inactive SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD NJOBS PEND RUN SSUSP USUSP FINISH 0 0 0 0 0 0
- The throughput goal of service class
Inuvik
is always active.bsla
displays:
- Status as active and on time
- An optimum number of 5 running jobs to meet the goal
- Actual throughput of 10 jobs per hour based on the last CLEAN_PERIOD
bsla Inuvik
SERVICE CLASS NAME: Inuvik -- constant throughput PRIORITY: 20 GOAL: THROUGHPUT 6 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 10.00 JOBs/CLEAN_PERIOD OPTIMUM NUMBER OF RUNNING JOBS: 5 NJOBS PEND RUN SSUSP USUSP FINISH 110 95 5 0 0 10View jobs running in an SLA (bjobs)
- Run
bjobs -sla
to display jobs running in a service class:bjobs -sla Inuvik
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 136 user1 RUN normal hostA hostA sleep 100 Sep 28 13:24 137 user1 RUN normal hostA hostB sleep 100 Sep 28 13:25Use
-sla with
-g
to display job groups attached to a service class. Once a job group is attached to a service class, all jobs submitted to that group are subject to the SLA.Track historical behavior of an SLA (bacct)
- Run
bacct
to display historical performance of a service class. For example, service classesInuvik
andTuktoyaktuk
configure throughput goals.bsla
SERVICE CLASS NAME: Inuvik -- throughput 6 PRIORITY: 20 GOAL: THROUGHPUT 6 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 10.00 JOBs/CLEAN_PERIOD OPTIMUM NUMBER OF RUNNING JOBS: 5 NJOBS PEND RUN SSUSP USUSP FINISH 111 94 5 0 0 12 -------------------------------------------------------------- SERVICE CLASS NAME: Tuktoyaktuk -- throughput 3 PRIORITY: 15 GOAL: THROUGHPUT 3 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 4.00 JOBs/CLEAN_PERIOD OPTIMUM NUMBER OF RUNNING JOBS: 4 NJOBS PEND RUN SSUSP USUSP FINISH 104 96 4 0 0 4These two service classes have the following historical performance. For SLA
Inuvik
,bacct
shows a total throughput of 8.94 jobs per hour over a period of 20.58 hours:bacct -sla Inuvik
Accounting information about jobs that are: - submitted by users user1, - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on service classes Inuvik, ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 183 Total number of exited jobs: 1 Total CPU time consumed: 40.0 Average CPU time consumed: 0.2 Maximum CPU time of a job: 0.3 Minimum CPU time of a job: 0.1 Total wait time in queues: 1947454.0 Average wait time in queue:10584.0 Maximum wait time in queue:18912.0 Minimum wait time in queue: 7.0 Average turnaround time: 12268 (seconds/job) Maximum turnaround time: 22079 Minimum turnaround time: 1713 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00Total throughput: 8.94 (jobs/hour) during 20.58 hours
Beginning time: Oct 11 20:23 Ending time: Oct 12 16:58For SLA Tuktoyaktuk,
bacct
shows a total throughput of 4.36 jobs per hour over a period of 19.95 hours:bacct -sla Tuktoyaktuk
Accounting information about jobs that are: - submitted by users user1, - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on service classes Tuktoyaktuk, ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 87 Total number of exited jobs: 0 Total CPU time consumed: 18.0 Average CPU time consumed: 0.2 Maximum CPU time of a job: 0.3 Minimum CPU time of a job: 0.1 Total wait time in queues: 2371955.0 Average wait time in queue:27263.8 Maximum wait time in queue:39125.0 Minimum wait time in queue: 7.0 Average turnaround time: 30596 (seconds/job) Maximum turnaround time: 44778 Minimum turnaround time: 3355 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00Total throughput: 4.36 (jobs/hour) during 19.95 hours
Beginning time: Oct 11 20:50 Ending time: Oct 12 16:47Because the run times are not uniform, both service classes actually achieve higher throughput than configured.
Understanding Service Class Behavior
A simple deadline goal
The following service class configures an SLA with a simple deadline goal with a half hour time window.
Begin ServiceClass NAME = Quadra PRIORITY = 20 GOALS = [DEADLINE timeWindow (16:15-16:45)] DESCRIPTION = short window End ServiceClassSix jobs submitted with a run time of 5 minutes each will use 1 slot for the half hour time window.
bsla
shows that the deadline can be met:bsla Quadra
SERVICE CLASS NAME: Quadra -- short window PRIORITY: 20 GOAL: DEADLINE ACTIVE WINDOW: (16:15-16:45) STATUS: Active:On time ESTIMATED FINISH TIME: (Wed Jul 2 16:38) OPTIMUM NUMBER OF RUNNING JOBS: 1 NJOBS PEND RUN SSUSP USUSP FINISH 6 5 1 0 0 0The following illustrates the progress of the SLA to the deadline. The optimum number of running jobs in the service class (
nrun
) is maintained at a steady rate of 1 job at a time until near the completion of the SLA.When the finished job curve (
nfinished
) meets the total number of jobs curve (njobs
) the deadline is met. All jobs are finished well ahead of the actual configured deadline, and the goal of the SLA was met.
An overnight run with two service classes
bsla
shows the configuration and status of two service classesQualicum
andComox
:
Qualicum
has a deadline goal with a time window which is active overnight:bsla Qualicum
SERVICE CLASS NAME: Qualicum PRIORITY: 23 GOAL: VELOCITY 8 ACTIVE WINDOW: (8:00-18:00) STATUS: Inactive SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD GOAL: DEADLINE ACTIVE WINDOW: (18:00-8:00) STATUS: Active:On time ESTIMATED FINISH TIME: (Thu Jul 10 07:53) OPTIMUM NUMBER OF RUNNING JOBS: 2 NJOBS PEND RUN SSUSP USUSP FINISH 280 278 2 0 0 0The following illustrates the progress of the deadline SLA Qualicum
running 280 jobs overnight with random runtimes until the morning deadline. As with the simple deadline goal example, when the finished job curve (nfinished
) meets the total number of jobs curve (njobs
) the deadline is met with all jobs completed ahead of the configured deadline.
Comox
has a velocity goal of 2 concurrently running jobs that is always active:bsla Comox
SERVICE CLASS NAME: Comox PRIORITY: 20 GOAL: VELOCITY 2 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 2.00 JOBS/CLEAN_PERIOD NJOBS PEND RUN SSUSP USUSP FINISH 100 98 2 0 0 0The following illustrates the progress of the velocity SLA Comox
running 100 jobs with random runtimes over a 14 hour period.
When an SLA is missing its goal
- Use the CONTROL_ACTION parameter in your service class to configure an action to be run if the SLA goal is delayed for a specified number of minutes.
CONTROL_ACTION (lsb.serviceclasses)
CONTROL_ACTION=VIOLATION_PERIOD[
minutes
] CMD [
action
]
If the SLA goal is delayed for longer than VIOLATION_PERIOD, the action specified by CMD is invoked. The violation period is reset and the action runs again if the SLA is still active when the violation period expires again. If the SLA has multiple active goals that are in violation, the action is run for each of them.
Example
CONTROL_ACTION=VIOLATION_PERIOD[10] CMD [echo `date`: SLA is in violation >> ! /tmp/sla_violation.log]Preemption and SLA policies
SLA jobs cannot be preempted. You should avoid running jobs belonging to an SLA in low priority queues.
Chunk jobs and SLA policies
SLA jobs will not get chunked. You should avoid submitting SLA jobs to a chunk job queue.
SLA statistics files
Each active SLA goal generates a statistics file for monitoring and analyzing the system. When the goal becomes inactive the file is no longer updated. The files are created in the
LSB_SHAREDIR/cluster_name/logdir/SLA
directory. Each file name consists of the name of the service class and the goal type.For example the file named
Quadra.deadline
is created for the deadline goal of the service class nameQuadra
. The following file namedTofino.velocity
refers to a velocity goal of the service class namedTofino
:cat Tofino.velocity
# service class Tofino velocity, NJOBS, NPEND (NRUN + NSSUSP + NUSUSP), (NDONE + NEXIT) 17/9 15:7:34 1063782454 2 0 0 0 0 17/9 15:8:34 1063782514 2 0 0 0 0 17/9 15:9:34 1063782574 2 0 0 0 0 # service class Tofino velocity, NJOBS, NPEND (NRUN + NSSUSP + NUSUSP), (NDONE + NEXIT) 17/9 15:10:10 1063782610 2 0 0 0 0Resizable jobs and SLA scheduling
For resizable job allocation requests, since the job itself has already started to run, LSF bypasses dispatch rate checking and continues scheduling the allocation request.
Job groups and SLA scheduling
Job groups provide a method for assigning arbitrary labels to groups of jobs. Typically, job groups represent a project hierarchy. You can use
-g
with-sla
at job submission to attach all jobs in a job group to a service class and have them scheduled as SLA jobs and subject to the scheduling policy of the SLA. Within the job group, resources are allocated to jobs on a fairshare basis.All jobs submitted to a group under an SLA automatically belong to the SLA itself. You cannot modify a job group of a job that is attached to an SLA.
A job group hierarchy can belong to only one SLA.
It is not possible to have some jobs in a job group not part of the service class. Multiple job groups can be created under the same SLA. You can submit additional jobs to the job group without specifying the service class name again.
If the specified job group does not exist, it is created and attached to the SLA.
You can also use
-sla
to specify a service class when you create a job group withbgadd
.View job groups attached to an SLA (bjgroup)
- Run
bjgroup
to display job groups attached to a service class:bjgroup
GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /fund1_grp 5 4 0 1 0 0 Venezia 1/5 user1 /fund2_grp 11 2 5 0 0 4 Venezia 5/5 user1 /bond_grp 2 2 0 0 0 0 Venezia 0/- user2 /risk_grp 2 1 1 0 0 0 () 1/- user2 /admi_grp 4 4 0 0 0 0 () 0/- user2
bjgroup
displays the name of the service class that the job group is attached to withbgadd -sla
service_class_name
. If the job group is not attached to any service class, empty parentheses()
are displayed in the SLA name column.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |