Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Fairshare Scheduling

To configure any kind of fairshare scheduling, you should understand the following concepts:

You can configure fairshare at either host level or queue level. If you require more control, you can implement hierarchical fairshare. You can also set some additional restrictions when you submit a job.

To get ideas about how to use fairshare scheduling to do different things, see Ways to Configure Fairshare.

Contents

Understanding Fairshare Scheduling

By default, LSF considers jobs for dispatch in the same order as they appear in the queue (which is not necessarily the order in which they are submitted to the queue). This is called first-come, first-served (FCFS) scheduling.

Fairshare scheduling divides the processing power of the LSF cluster among users and queues to provide fair access to resources, so that no user or queue can monopolize the resources of the cluster and no queue will be starved.

If your cluster has many users competing for limited resources, the FCFS policy might not be enough. For example, one user could submit many long jobs at once and monopolize the cluster's resources for a long time, while other users submit urgent jobs that must wait in queues until all the first user's jobs are all done. To prevent this, use fairshare scheduling to control how resources should be shared by competing users.

Fairshare is not necessarily equal share: you can assign a higher priority to the most important users. If there are two users competing for resources, you can:

Queue-level vs. host partition fairshare

You can configure fairshare at either the queue level or the host level. However, these types of fairshare scheduling are mutually exclusive. You cannot configure queue-level fairshare and host partition fairshare in the same cluster.

If you want a user's priority in one queue to depend on their activity in another queue, you must use cross-queue fairshare or host-level fairshare.

Fairshare policies

A fairshare policy defines the order in which LSF attempts to place jobs that are in a queue or a host partition. You can have multiple fairshare policies in a cluster, one for every different queue or host partition. You can also configure some queues or host partitions with fairshare scheduling, and leave the rest using FCFS scheduling.

How fairshare scheduling works

Each fairshare policy assigns a fixed number of shares to each user or group. These shares represent a fraction of the resources that are available in the cluster. The most important users or groups are the ones with the most shares. Users who have no shares cannot run jobs in the queue or host partition.

A user's dynamic priority depends on their share assignment, the dynamic priority formula, and the resources their jobs have already consumed.

The order of jobs in the queue is secondary. The most important thing is the dynamic priority of the user who submitted the job. When fairshare scheduling is used, LSF tries to place the first job in the queue that belongs to the user with the highest dynamic priority.

User Share Assignments

Both queue-level and host partition fairshare use the following syntax to define how shares are assigned to users or user groups.

Syntax

[usernumber_shares]

Enclose each user share assignment in square brackets, as shown. Separate multiple share assignments with a space between each set of square brackets.

user

Specify users of the queue or host partition. You can assign the shares:

By default, when resources are assigned collectively to a group, the group members compete for the resources according to FCFS scheduling. You can use hierarchical fairshare to further divide the shares among the group members.

When resources are assigned to members of a group individually, the share assignment is recursive. Members of the group and of all subgroups always compete for the resources according to FCFS scheduling, regardless of hierarchical fairshare policies.

number_shares

Specify a positive integer representing the number of shares of cluster resources assigned to the user.

The number of shares assigned to each user is only meaningful when you compare it to the shares assigned to other users, or to the total number of shares. The total number of shares is just the sum of all the shares assigned in each share assignment.

Examples

[User1, 1] [GroupB, 1] 

Assigns 2 shares: 1 to User1, and 1 to be shared by the users in GroupB. Each user in GroupB has equal importance. User1 is as important as all the users in GroupB put together. In this example, it does not matter if the number of shares is 1, 6 or 600. As long as User1 and GroupB are both assigned the same number of shares, the relationship stays the same.

[User1, 10] [GroupB@, 1] 

If GroupB contains 10 users, assigns 20 shares in total: 10 to User1, and 1 to each user in GroupB. Each user in GroupB has equal importance. User1 is ten times as important as any user in GroupB.

[User1, 10] [User2, 9] [others, 8] 

Assigns 27 shares: 10 to User1, 9 to User2, and 8 to the remaining users, as a group. User1 is slightly more important than User2. Each of the remaining users has equal importance.

[User1, 10] [User2, 6] [default, 4] 

The relative percentage of shares held by a user will change, depending on the number of users who are granted shares by default.

Dynamic User Priority

LSF calculates a dynamic user priority for individual users or for a group, depending on how the shares are assigned. The priority is dynamic because it changes as soon as any variable in formula changes. By default, a user's dynamic priority gradually decreases after a job starts, and the dynamic priority immediately increases when the job finishes.

How LSF calculates dynamic priority

By default, LSF calculates the dynamic priority for each user based on:

If you enable additional functionality, the formula can also involve additional resources used by jobs belonging to the user:

How LSF measures fairshare resource usage

LSF measures resource usage differently, depending on the type of fairshare:

Default dynamic priority formula

By default, LSF calculates dynamic priority according to the following formula:

dynamic priority = number_shares / (cpu_time * CPU_TIME_FACTOR + run_time * RUN_TIME_FACTOR + (1 + job_slots) * RUN_JOB_FACTOR + fairshare_adjustment*FAIRSHARE_ADJUSTMENT_FACTOR)

note:  
The maximum value of dynamic user priority is 100 times the number of user shares (if the denominator in the calculation is less than 0.01, LSF rounds up to 0.01).

For cpu_time, run_time, and job_slots, LSF uses the total resource consumption of all the jobs in the queue or host partition that belong to the user or group.

number_shares

The number of shares assigned to the user.

cpu_time

The cumulative CPU time used by the user (measured in hours). LSF calculates the cumulative CPU time using the actual (not normalized) CPU time and a decay factor such that 1 hour of recently-used CPU time decays to 0.1 hours after an interval of time specified by HIST_HOURS in lsb.params (5 hours by default).

run_time

The total run time of running jobs (measured in hours).

job_slots

The number of job slots reserved and in use.

fairshare_adjustment

The adjustment calculated by the fairshare adjustment plugin (libfairshareadjust.*).

Configuring the default dynamic priority

You can give additional weight to the various factors in the priority calculation by setting the following parameters in lsb.params.

If you modify the parameters used in the dynamic priority formula, it affects every fairshare policy in the cluster.

CPU_TIME_FACTOR

The CPU time weighting factor.

Default: 0.7

RUN_TIME_FACTOR

The run time weighting factor.

Default: 0.7

RUN_JOB_FACTOR

The job slots weighting factor.

Default: 3

FAIRSHARE_ADJUSTMENT_FACTOR

The fairshare plugin (libfairshareadjust.*) weighting factor.

Default: 0

HIST_HOURS

Interval for collecting resource consumption history

Default: 5

Customizing the dynamic priority

In some cases the dynamic priority equation may require adjustments beyond the run time, cpu time, and job slot dependencies provided by default. The fairshare adjustment plugin is open source and can be customized once you identify specific requirements for dynamic priority.

All information used by the default priority equation (except the user shares) is passed to the fairshare plugin. In addition, the fairshare plugin is provided with current memory use over the entire cluster and the average memory allocated to a slot in the cluster.

note:  
If you modify the parameters used in the dynamic priority formula, it affects every fairshare policy in the cluster. The fairshare adjustment plugin (libfairshareadjust.*) is not queue-specific.
Example

Jobs assigned to a single slot on a host can consume host memory to the point that other slots on the hosts are left unusable. The default dynamic priority calculation considers job slots used, but doesn't account for unused job slots effectively blocked by another job.

The fairshare adjustment plugin example code provided by Platform LSF is found in the examples directory of your installation, and implements a memory-based dynamic priority adjustment as follows:

fairshare adjustment= (1+slots)*((total_memory /slots)/(slot_memory*THRESHOLD))

slots

The number of job slots in use by started jobs.

total_memory

The total memory in use by started jobs.

slot_memory

The average memory allocated per slot.

THRESHOLD

The memory threshold set in the fairshare adjustment plugin.

How Fairshare Affects Job Dispatch Order

Within a queue, jobs are dispatched according to the queue's scheduling policy.

A user's priority gets higher when they use less than their fair share of the cluster's resources. When a user has the highest priority, LSF considers one of their jobs first, even if other users are ahead of them in the queue.

If there are only one user's jobs pending, and you do not use hierarchical fairshare, then there is no resource contention between users, so the fairshare policies have no effect and jobs are dispatched as usual.

Job dispatch order among queues of equivalent priority

The order of dispatch depends on the order of the queues in the queue configuration file. The first queue in the list is the first to be scheduled.

Jobs in a fairshare queue are always considered as a group, so the scheduler attempts to place all jobs in the queue before beginning to schedule the next queue.

Jobs in an FCFS queue are always scheduled along with jobs from other FCFS queues of the same priority (as if all the jobs belonged to the same queue).

Example

In a cluster, queues A, B, and C are configured in that order and have equal queue priority.

Jobs with equal job priority are submitted to each queue in this order: C B A B A.

Host Partition User-based Fairshare

User-based fairshare policies configured at the host level handle resource contention across multiple queues.

You can define a different fairshare policy for every host partition. If multiple queues use the host partition, a user has the same priority across multiple queues.

To run a job on a host that has fairshare, users must have a share assignment (USER_SHARES in the HostPartition section of lsb.hosts). Even cluster administrators cannot submit jobs to a fairshare host if they do not have a share assignment.

View host partition information

  1. Use bhpart to view the following information:

Configure host partition fairshare scheduling

  1. To configure host partition fairshare, define a host partition in lsb.hosts.
  2. Use the following format.

    Begin HostPartition
    HPART_NAME = Partition1
    HOSTS = hostA hostB ~hostC
    USER_SHARES = [groupA@, 3] [groupB, 7] [default, 1]
    End HostPartition 
    

Queue-level User-based Fairshare

User-based fairshare policies configured at the queue level handle resource contention among users in the same queue. You can define a different fairshare policy for every queue, even if they share the same hosts. A user's priority is calculated separately for each queue.

To submit jobs to a fairshare queue, users must be allowed to use the queue (USERS in lsb.queues) and must have a share assignment (FAIRSHARE in lsb.queues). Even cluster and queue administrators cannot submit jobs to a fairshare queue if they do not have a share assignment.

View queue-level fairshare information

  1. To find out if a queue is a fairshare queue, run bqueues -l. If you see "USER_SHARES" in the output, then a fairshare policy is configured for the queue.

Configure queue-level fairshare

  1. To configure a fairshare queue, define FAIRSHARE in lsb.queues and specify a share assignment for all users of the queue:
  2. FAIRSHARE = USER_SHARES[[user, number_shares]...] 
    

Cross-queue User-based Fairshare

User-based fairshare policies configured at the queue level handle resource contention across multiple queues.

Applying the same fairshare policy to several queues

With cross-queue fairshare, the same user-based fairshare policy can apply to several queues can at the same time. You define the fairshare policy in a master queue and list slave queues to which the same fairshare policy applies; slave queues inherit the same fairshare policy as your master queue. For job scheduling purposes, this is equivalent to having one queue with one fairshare tree.

In this way, if a user submits jobs to different queues, user priority is calculated by taking into account all the jobs the user has submitted across the defined queues.

To submit jobs to a fairshare queue, users must be allowed to use the queue (USERS in lsb.queues) and must have a share assignment (FAIRSHARE in lsb.queues). Even cluster and queue administrators cannot submit jobs to a fairshare queue if they do not have a share assignment.

User and queue priority

By default, a user has the same priority across the master and slave queues. If the same user submits several jobs to these queues, user priority is calculated by taking into account all the jobs the user has submitted across the master-slave set.

If DISPATCH_ORDER=QUEUE is set in the master queue, jobs are dispatched according to queue priorities first, then user priority. This avoids having users with higher fairshare priority getting jobs dispatched from low-priority queues.

Jobs from users with lower fairshare priorities who have pending jobs in higher priority queues are dispatched before jobs in lower priority queues. Jobs in queues having the same priority are dispatched according to user priority.

Queues that are not part of the ordered cross-queue fairshare can have any priority. Their priority can fall within the priority range of cross-queue fairshare queues and they can be inserted between two queues using the same fairshare tree.

View cross-queue fairshare information

  1. Run bqueues -l to know if a queue is part of cross-queue fairshare.
  2. The FAIRSHARE_QUEUES parameter indicates cross-queue fairshare. The first queue listed in the FAIRSHARE_QUEUES parameter is the master queue-the queue in which fairshare is configured; all other queues listed inherit the fairshare policy from the master queue.

    All queues that participate in the same cross-queue fairshare display the same fairshare information (SCHEDULING POLICIES, FAIRSHARE_QUEUES, USER_SHARES, SHARE_INFO_FOR) when bqueues -l is used. Fairshare information applies to all the jobs running in all the queues in the master-slave set.

    bqueues -l also displays DISPATCH_ORDER in the master queue if it is defined.

    bqueues
    QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
    normal           30   Open:Active      -    -    -    -     1     1     0     0
    short            40   Open:Active      -    4    2    -     1     0     1     0
    license          50   Open:Active      10   1    1    -     1     0     1     0 
    bqueues -l normal
    QUEUE: normal
    -- For normal low priority jobs, running only if hosts are lightly loaded.  This is 
    the default queue. 
    PARAMETERS/STATISTICS
    PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV 
    30    20  Open:Inact_Win      -    -    -    -     1     1        0     0     0    0 
    SCHEDULING PARAMETERS
    r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
    loadSched   -     -     -     -       -     -    -     -     -      -      - 
    loadStop    -     -     -     -       -     -    -     -     -      -      - 
    
                 cpuspeed    bandwidth 
    loadSched          -            - 
    loadStop           -            -  
    SCHEDULING POLICIES:  FAIRSHARE
    FAIRSHARE_QUEUES:  normal short license
    USER_SHARES:  [user1, 100] [default, 1]  
    SHARE_INFO_FOR: normal/ 
    USER/GROUP   SHARES  PRIORITY  STARTED  RESERVED  CPU_TIME  RUN_TIME ADJUST
    user1            100      9.645      2        0         0.2     7034    0.000 
    USERS:  all users 
    HOSTS:  all  
    ... 
    bqueues -l short
    QUEUE: short
    PARAMETERS/STATISTICS
    PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV 
    40    20  Open:Inact_Win   -    4    2    -  1     0       1     0     0    0 
    SCHEDULING PARAMETERS
    r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
    loadSched   -     -     -     -       -     -    -     -     -      -      - 
    loadStop    -     -     -     -       -     -    -     -     -      -      - 
    
    		             cpuspeed    bandwidth 
    loadSched          -            - 
    loadStop           -            -  
    SCHEDULING POLICIES:  FAIRSHARE
    FAIRSHARE_QUEUES:  normal short license
    USER_SHARES:  [user1, 100] [default, 1]  
    SHARE_INFO_FOR: short/ 
    USER/GROUP   SHARES  PRIORITY  STARTED  RESERVED  CPU_TIME  RUN_TIME
    user1         100      9.645      2        0         0.2     7034 
    USERS:  all users 
    HOSTS:  all  
    ... 
    

Configuring cross-queue fairshare

Considerations
Configure cross-queue fairshare
  1. Decide to which queues in your cluster cross-queue fairshare will apply.
  2. For example, in your cluster you may have the queues normal, priority, short, and license and you want cross-queue fairshare to apply only to normal, license, and short.

  3. Define fairshare policies in your master queue.
  4. In the queue you want to be the master, for example normal, define the following in lsb.queues:

  5. In all the slave queues listed in FAIRSHARE_QUEUES, define all queue values as desired.
  6. For example:

    Begin Queue
    QUEUE_NAME    = queue2
    PRIORITY      = 40
    NICE          = 20
    UJOB_LIMIT    = 4
    PJOB_LIMIT    = 2
    End Queue 
    Begin Queue
    QUEUE_NAME    = queue3
    PRIORITY      = 50
    NICE          = 10
    PREEMPTION = PREEMPTIVE
    QJOB_LIMIT    = 10
    UJOB_LIMIT    = 1
    PJOB_LIMIT    = 1
    End Queue 
    

Controlling job dispatch order in cross-queue fairshare

DISPATCH_ORDER parameter (lsb.queues)

Use DISPATCH_ORDER=QUEUE in the master queue to define an ordered cross-queue fairshare set. DISPATCH_ORDER indicates that jobs are dispatched according to the order of queue priorities, not user fairshare priority.

Priority range in cross-queue fairshare

By default, the range of priority defined for queues in cross-queue fairshare cannot be used with any other queues. The priority of queues that are not part of the cross-queue fairshare cannot fall between the priority range of cross-queue fairshare queues.

For example, you have 4 queues: queue1, queue2, queue3, and queue4. You configure cross-queue fairshare for queue1, queue2, and queue3, and assign priorities of 30, 40, 50 respectively. The priority of queue4 (which is not part of the cross-queue fairshare) cannot fall between 30 and 50, but it can be any number up to 29 or higher than 50. It does not matter if queue4 is a fairshare queue or FCFS queue.

If DISPATCH_ORDER=QUEUE is set in the master queue, queues that are not part of the ordered cross-queue fairshare can have any priority. Their priority can fall within the priority range of cross-queue fairshare queues and they can be inserted between two queues using the same fairshare tree. In the example above, queue4 can have any priority, including a priority falling between the priority range of the cross-queue fairshare queues (30-50).

Jobs from equal priority queues

Hierarchical User-based Fairshare

For both queue and host partitions, hierarchical user-based fairshare lets you allocate resources to users in a hierarchical manner.

By default, when shares are assigned to a group, group members compete for resources according to FCFS policy. If you use hierarchical fairshare, you control the way shares that are assigned collectively are divided among group members.

If groups have subgroups, you can configure additional levels of share assignments, resulting in a multi-level share tree that becomes part of the fairshare policy.

How hierarchical fairshare affects dynamic share priority

When you use hierarchical fairshare, the dynamic share priority formula does not change, but LSF measures the resource consumption for all levels of the share tree. To calculate the dynamic priority of a group, LSF uses the resource consumption of all the jobs in the queue or host partition that belong to users in the group and all its subgroups, recursively.

How hierarchical fairshare affects job dispatch order

LSF uses the dynamic share priority of a user or group to find out which user's job to run next. If you use hierarchical fairshare, LSF works through the share tree from the top level down, and compares the dynamic priority of users and groups at each level, until the user with the highest dynamic priority is a single user, or a group that has no subgroups.

View hierarchical share information for a group

  1. Use bugroup -l to find out if you belong to a group, and what the share distribution is.
  2. bugroup -l
    GROUP_NAME: group1 
    USERS: group2/ group3/
    SHARES:  [group2,20] [group3,10] 
    GROUP_NAME: group2
    USERS: user1 user2 user3 
    SHARES: [others,10] [user3,4] 
    GROUP_NAME: group3
    USERS: all
    SHARES: [user2,10] [default,5] 
     

    This command displays all the share trees that are configured, even if they are not used in any fairshare policy.

View hierarchical share information for a host partition

By default, bhpart displays only the top level share accounts associated with the partition.

  1. Use bhpart -r to display the group information recursively.
  2. The output lists all the groups in the share tree, starting from the top level, and displays the following information:

Configuring hierarchical fairshare

To define a hierarchical fairshare policy, configure the top-level share assignment in lsb.queues or lsb.hosts, as usual. Then, for any group of users affected by the fairshare policy, configure a share tree in the UserGroup section of lsb.users. This specifies how shares assigned to the group, collectively, are distributed among the individual users or subgroups.

If shares are assigned to members of any group individually, using @, there can be no further hierarchical fairshare within that group. The shares are assigned recursively to all members of all subgroups, regardless of further share distributions defined in lsb.users. The group members and members of all subgroups compete for resources according to FCFS policy.

You can choose to define a hierarchical share tree for some groups but not others. If you do not define a share tree for any group or subgroup, members compete for resources according to FCFS policy.

Configure a share tree

  1. Group membership is already defined in the UserGroup section of lsb.users. To configure a share tree, use the USER_SHARES column to describe how the shares are distributed in a hierachical manner. Use the following format.
  2. Begin UserGroup
    GROUP_NAME    GROUP_MEMBER          USER_SHARES
    GroupB       (User1 User2)          ()
    GroupC       (User3 User4)          ([User3, 3] [User4, 4])
    GroupA       (GroupB GroupC User5)  ([User5, 1] [default, 10])
    End UserGroup 
    

An Engineering queue or host partition organizes users hierarchically, and divides the shares as shown. It does not matter what the actual number of shares assigned at each level is.

The Development group gets the largest share (50%) of the resources in the event of contention. Shares assigned to the Development group can be further divided among the Systems, Application, and Test groups, which receive 15%, 35%, and 50%, respectively. At the lowest level, individual users compete for these shares as usual.

One way to measure a user's importance is to multiply their percentage of the resources at every level of the share tree. For example, User1 is entitled to 10% of the available resources (.50 x .80 x .25 = .10) and User3 is entitled to 4% (.80 x .20 x .25 = .04). However, if Research has the highest dynamic share priority among the 3 groups at the top level, and ChipY has a higher dynamic priority than ChipX, the next comparison is between User3 and User4, so the importance of User1 is not relevant. The dynamic priority of User1 is not even calculated at this point.

Queue-based Fairshare

When a priority is set in a queue configuration, a high priority queue tries to dispatch as many jobs as it can before allowing lower priority queues to dispatch any job. Lower priority queues are blocked until the higher priority queue cannot dispatch any more jobs. However, it may be desirable to give some preference to lower priority queues and regulate the flow of jobs from the queue.

Queue-based fairshare allows flexible slot allocation per queue as an alternative to absolute queue priorities by enforcing a soft job slot limit on a queue. This allows you to organize the priorities of your work and tune the number of jobs dispatched from a queue so that no single queue monopolizes cluster resources, leaving other queues waiting to dispatch jobs.

You can balance the distribution of job slots among queues by configuring a ratio of jobs waiting to be dispatched from each queue. LSF then attempts to dispatch a certain percentage of jobs from each queue, and does not attempt to drain the highest priority queue entirely first.

When queues compete, the allocated slots per queue are kept within the limits of the configured share. If only one queue in the pool has jobs, that queue can use all the available resources and can span its usage across all hosts it could potentially run jobs on.

Managing pools of queues

You can configure your queues into a pool, which is a named group of queues using the same set of hosts. A pool is entitled to a slice of the available job slots. You can configure as many pools as you need, but each pool must use the same set of hosts. There can be queues in the cluster that do not belong to any pool yet share some hosts used by a pool.

How LSF allocates slots for a pool of queues

During job scheduling, LSF orders the queues within each pool based on the shares the queues are entitled to. The number of running jobs (or job slots in use) is maintained at the percentage level specified for the queue. When a queue has no pending jobs, leftover slots are redistributed to other queues in the pool with jobs pending.

The total number of slots in each pool is constant; it is equal to the number of slots in use plus the number of free slots to the maximum job slot limit configured either in lsb.hosts (MXJ) or in lsb.resources for a host or host group. The accumulation of slots in use by the queue is used in ordering the queues for dispatch.

Job limits and host limits are enforced by the scheduler. For example, if LSF determines that a queue is eligible to run 50 jobs, but the queue has a job limit of 40 jobs, no more than 40 jobs will run. The remaining 10 job slots are redistributed among other queues belonging to the same pool, or make them available to other queues that are configured to use them.

Accumulated slots in use

As queues run the jobs allocated to them, LSF accumulates the slots each queue has used and decays this value over time, so that each queue is not allocated more slots than it deserves, and other queues in the pool have a chance to run their share of jobs.

Interaction with other scheduling policies

Examples

Three queues using two hosts each with maximum job slot limit of 6 for a total of 12 slots to be allocated:
Four queues using two hosts each with maximum job slot limit of 6 for a total of 12 slots; queue4 does not belong to any pool.
queue1, queue2, and queue3 belong to one pool, queue6, queue7, and queue8 belong to another pool, and queue4 and queue5 do not belong to any pool.

LSF orders the queues in the two pools from higher-priority queue to lower-priority queue (queue1 is highest and queue8 is lowest):

queue1 -> queue2 -> queue3 -> queue6 -> queue7 -> queue8 

If the queue belongs to a pool, jobs are dispatched from the highest priority queue first. Queues that do not belong to any pool (queue4 and queue5) are merged into this ordered list according to their priority, but LSF dispatches as many jobs from the non-pool queues as it can:

queue1 -> queue2 -> queue3 -> queue4 -> queue5 -> queue6 -> queue7 -> queue8 

Configuring Slot Allocation per Queue

Configure as many pools as you need in lsb.queues.

SLOT_SHARE parameter

The SLOT_SHARE parameter represents the percentage of running jobs (job slots) in use from the queue. SLOT_SHARE must be greater than zero (0) and less than or equal to 100.

The sum of SLOT_SHARE for all queues in the pool does not need to be 100%. It can be more or less, depending on your needs.

SLOT_POOL parameter

The SLOT_POOL parameter is the name of the pool of job slots the queue belongs to. A queue can only belong to one pool. All queues in the pool must share the same set of hosts.

Host job slot limit

The hosts used by the pool must have a maximum job slot limit, configured either in lsb.hosts (MXJ) or lsb.resources (HOSTS and SLOTS).

Configure slot allocation per queue

  1. For each queue that uses queue-based fairshare, define the following in lsb.queues:
    1. SLOT_SHARE
    2. SLOT_POOL
  2. Optional: Define the following in lsb.queues for each queue that uses queue-based fairshare:
    1. HOSTS to list the hosts that can receive jobs from the queue
    2. If no hosts are defined for the queue, the default is all hosts.

      tip:  
      Hosts for queue-based fairshare cannot be in a host partition.
    3. PRIORITY to indicate the priority of the queue.
  3. For each host used by the pool, define a maximum job slot limit, either in lsb.hosts (MXJ) or lsb.resources (HOSTS and SLOTS).
Configure two pools

The following example configures pool A with three queues, with different shares, using the hosts in host group groupA:

Begin Queue
QUEUE_NAME = queue1
PRIORITY   = 50
SLOT_POOL  = poolA
SLOT_SHARE = 50
HOSTS      = groupA
...
End Queue 
Begin Queue
QUEUE_NAME = queue2
PRIORITY   = 48
SLOT_POOL  = poolA
SLOT_SHARE = 30
HOSTS      = groupA
...
End Queue 
Begin Queue
QUEUE_NAME = queue3
PRIORITY   = 46
SLOT_POOL  = poolA
SLOT_SHARE = 20
HOSTS      = groupA
...
End Queue 

The following configures a pool named poolB, with three queues with equal shares, using the hosts in host group groupB:

Begin Queue
QUEUE_NAME = queue4
PRIORITY   = 44
SLOT_POOL  = poolB
SLOT_SHARE = 30
HOSTS      = groupB
...
End Queue 
Begin Queue
QUEUE_NAME = queue5
PRIORITY   = 43
SLOT_POOL  = poolB
SLOT_SHARE = 30
HOSTS      = groupB
...
End Queue 
Begin Queue
QUEUE_NAME = queue6
PRIORITY   = 42
SLOT_POOL  = poolB
SLOT_SHARE = 30
HOSTS      = groupB
...
End Queue 

View Queue-based Fairshare Allocations

View configured job slot share

  1. Use bqueues -l to show the job slot share (SLOT_SHARE) and the hosts participating in the share pool (SLOT_POOL):
  2. QUEUE: queue1
    
    PARAMETERS/STATISTICS
    PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV 
     50   20  Open:Active       -    -    -    -     0     0     0     0     0    0
    Interval for a host to accept two jobs is 0 seconds
    
     STACKLIMIT MEMLIMIT
       2048 K     5000 K
    
    SCHEDULING PARAMETERS
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -  
     loadStop    -     -     -     -       -     -    -     -     -      -      -  
    
                 cpuspeed    bandwidth 
    loadSched          -            - 
    loadStop           -            - 
    
    USERS:  all users
    HOSTS:  groupA/ 
    SLOT_SHARE: 50%
    SLOT_POOL: poolA 
    

View slot allocation of running jobs

  1. Use bhosts, bmgroup, and bqueues to verify how LSF maintains the configured percentage of running jobs in each queue.
  2. The queues configurations above use the following hosts groups:

    bmgroup -r
    GROUP_NAME   HOSTS
    groupA       hosta hostb hostc
    groupB       hostd hoste hostf 
     

    Each host has a maximum job slot limit of 5, for a total of 15 slots available to be allocated in each group:

    bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hosta ok - 5 5 5 0 0 0 hostb ok - 5 5 5 0 0 0 hostc ok - 5 5 5 0 0 0 hostd ok - 5 5 5 0 0 0 hoste ok - 5 5 5 0 0 0 hostf ok - 5 5 5 0 0 0

    Pool named poolA contains queue1,queue2, and queue3.poolB contains queue4, queue5, and queue6. The bqueues command shows the number of running jobs in each queue:

    bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP queue1 50 Open:Active - - - - 492 484 8 0 queue2 48 Open:Active - - - - 500 495 5 0 queue3 46 Open:Active - - - - 498 496 2 0 queue4 44 Open:Active - - - - 985 980 5 0 queue5 43 Open:Active - - - - 985 980 5 0 queue6 42 Open:Active - - - - 985 980 5 0

    As a result: queue1 has a 50% share  and can run 8 jobs; queue2 has a 30% share  and can run 5 jobs; queue3 has a 20% share   and is entitled 3 slots, but since the total number of slots available must be 15, it can run 2 jobs; queue4, queue5, and queue6 all share 30%, so 5 jobs are running in each queue.

Typical Slot Allocation Scenarios

3 queues with SLOT_SHARE 50%, 30%, 20%, with 15 job slots

This scenario has three phases:

  1. All three queues have jobs running, and LSF assigns the number of slots to queues as expected: 8, 5, 2. Though queue Genova deserves 3 slots, the total slot assignment must be 15, so Genova is allocated only 2 slots:
  2. bqueues
    QUEUE_NAME    PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
    Roma           50   Open:Active      -    -    -    -  1000   992     8     0
    Verona         48   Open:Active      -    -    -    -   995   990     5     0
    Genova         48   Open:Active      -    -    -    -   996   994     2     0 
    
  3. When queue Verona has done its work, queues Roma and Genova get their respective shares of 8 and 3. This leaves 4 slots to be redistributed to queues according to their shares: 50% (2 slots) to Roma, 20% (1 slot) to Genova. The one remaining slot is assigned to queue Roma again:
  4. bqueues
    QUEUE_NAME  PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
    Roma         50   Open:Active      -    -    -    -   231   221    11     0
    Verona       48   Open:Active      -    -    -    -     0     0     0     0
    Genova       48   Open:Active      -    -    -    -   496   491     4     0 
    
  5. When queues Roma and Verona have no more work to do, Genova can use all the available slots in the cluster:
  6. bqueues
    QUEUE_NAME   PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
    Roma          50   Open:Active      -    -    -    -     0     0     0     0
    Verona        48   Open:Active      -    -    -    -     0     0     0     0
    Genova        48   Open:Active      -    -    -    -   475   460    15     0 
    

The following figure illustrates phases 1, 2, and 3:

2 pools, 30 job slots, and 2 queues out of any pool

The queues Milano and Parma run very short jobs that get submitted periodically in bursts. When no jobs are running in them, the distribution of jobs looks like this:

QUEUE_NAME  PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
Roma         50   Open:Active      -    -    -    -  1000   992     8     0
Verona       48   Open:Active      -    -    -    -  1000   995     5     0
Genova       48   Open:Active      -    -    -    -  1000   998     2     0
Pisa         44   Open:Active      -    -    -    -  1000   995     5     0
Milano       43   Open:Active      -    -    -    -     2     2     0     0
Parma        43   Open:Active      -    -    -    -     2     2     0     0
Venezia      43   Open:Active      -    -    -    -  1000   995     5     0
Bologna      43   Open:Active      -    -    -    -  1000   995     5     0 

When Milano and Parma have jobs, their higher priority reduces the share of slots free and in use by Venezia and Bologna:

QUEUE_NAME   PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
Roma          50   Open:Active      -    -    -    -   992   984     8     0
Verona        48   Open:Active      -    -    -    -   993   990     3     0
Genova        48   Open:Active      -    -    -    -   996   994     2     0
Pisa          44   Open:Active      -    -    -    -   995   990     5     0
Milano        43   Open:Active      -    -    -    -    10     7     3     0
Parma         43   Open:Active      -    -    -    -    11     8     3     0
Venezia       43   Open:Active      -    -    -    -   995   995     2     0
Bologna       43   Open:Active      -    -    -    -   995   995     2     0 

Round-robin slot distribution - 13 queues and 2 pools

The initial slot distribution looks like this:

bqueues
QUEUE_NAME   PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
Roma          50   Open:Active      -    -    -    -    15     6    11     0
Verona        48   Open:Active      -    -    -    -    25    18     7     0
Genova        47   Open:Active      -    -    -    -   460   455     3     0
Pisa          44   Open:Active      -    -    -    -   264   261     3     0
Milano        43   Open:Active      -    -    -    -   262   259     3     0
Parma         42   Open:Active      -    -    -    -   260   257     3     0
Bologna       40   Open:Active      -    -    -    -   260   257     3     0
Sora          40   Open:Active      -    -    -    -   261   258     3     0
Ferrara       40   Open:Active      -    -    -    -   258   255     3     0
Napoli        40   Open:Active      -    -    -    -   259   256     3     0
Livorno       40   Open:Active      -    -    -    -   258   258     0     0
Palermo       40   Open:Active      -    -    -    -   256   256     0     0
Venezia        4   Open:Active      -    -    -    -   255   255     0     0 

Initially, queues Livorno, Palermo, and Venezia in poolB are not assigned any slots because the first 7 higher priority queues have used all 21 slots available for allocation.

As jobs run and each queue accumulates used slots, LSF favors queues that have not run jobs yet. As jobs finish in the first 7 queues of poolB, slots are redistributed to the other queues that originally had no jobs (queues Livorno, Palermo, and Venezia). The total slot count remains 21 in all queues in poolB.

bqueues
QUEUE_NAME    PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
Roma           50   Open:Active      -    -    -    -    15     6     9     0
V              48   Open:Active      -    -    -    -    25    18     7     0
Genova         47   Open:Active      -    -    -    -   460   455     5     0
Pisa           44   Open:Active      -    -    -    -   263   261     2     0
Milano         43   Open:Active      -    -    -    -   261   259     2     0
Parma          42   Open:Active      -    -    -    -   259   257     2     0
Bologna        40   Open:Active      -    -    -    -   259   257     2     0
Sora           40   Open:Active      -    -    -    -   260   258     2     0
Ferrara        40   Open:Active      -    -    -    -   257   255     2     0
Napoli         40   Open:Active      -    -    -    -   258   256     2     0
Livorno        40   Open:Active      -    -    -    -   258   256     2     0
Palermo        40   Open:Active      -    -    -    -   256   253     3     0
Venezia         4   Open:Active      -    -    -    -   255   253     2     0 

The following figure illustrates the round-robin distribution of slot allocations between queues Livorno and Palermo:

How LSF rebalances slot usage

In the following examples, job runtime is not equal, but varies randomly over time.

3 queues in one pool with 50%, 30%, 20% shares

A pool configures 3 queues:

As queue1 and queue2 finish their jobs, the number of jobs in queue3 expands, and as queue1 and queue2 get more work, LSF rebalances the usage:

10 queues sharing 10% each of 50 slots

In this example, queue1 (the curve with the highest peaks) has the longer running jobs and so has less accumulated slots in use over time. LSF accordingly rebalances the load when all queues compete for jobs to maintain a configured 10% usage share.

Using Historical and Committed Run Time

By default, as a job is running, the dynamic priority decreases gradually until the job has finished running, then increases immediately when the job finishes.

In some cases this can interfere with fairshare scheduling if two users who have the same priority and the same number of shares submit jobs at the same time.

To avoid these problems, you can modify the dynamic priority calculation by using either or both of the following weighting factors:

Historical run time decay

By default, historical run time does not affect the dynamic priority. You can configure LSF so that the user's dynamic priority increases gradually after a job finishes. After a job is finished, its run time is saved as the historical run time of the job and the value can be used in calculating the dynamic priority, the same way LSF considers historical CPU time in calculating priority. LSF applies a decaying algorithm to the historical run time to gradually increase the dynamic priority over time after a job finishes.

Configure historical run time
  1. Specify ENABLE_HIST_RUN_TIME=Y in lsb.params.
  2. Historical run time is added to the calculation of the dynamic priority so that the formula becomes the following:

    dynamic priority = number_shares / (cpu_time * CPU_TIME_FACTOR + run_time * 
    RUN_TIME_FACTOR + (1 + job_slots) * RUN_JOB_FACTOR + 
    fairshare_adjustment(struct*shareAdjustPair)*FAIRSHARE_ADJUSTMENT_FACTOR) 
     

    historical_run_time-(measured in hours) of finished jobs accumulated in the user's share account file. LSF calculates the historical run time using the actual run time of finished jobs and a decay factor such that 1 hour of recently-used run time decays to 0.1 hours after an interval of time specified by HIST_HOURS in lsb.params (5 hours by default).

How mbatchd reconfiguration and restart affects historical run time

After restarting or reconfiguring mbatchd, the historical run time of finished jobs might be different, since it includes jobs that may have been cleaned from mbatchd before the restart. mbatchd restart only reads recently finished jobs from lsb.events, according to the value of CLEAN_PERIOD in lsb.params. Any jobs cleaned before restart are lost and are not included in the new calculation of the dynamic priority.

Example

The following fairshare parameters are configured in lsb.params:

CPU_TIME_FACTOR = 0
RUN_JOB_FACTOR  = 0
RUN_TIME_FACTOR = 1
FAIRSHARE_ADJUSTMENT_FACTOR = 0 

Note that in this configuration, only run time is considered in the calculation of dynamic priority. This simplifies the formula to the following:

dynamic priority = number_shares / (run_time * RUN_TIME_FACTOR)

Without the historical run time, the dynamic priority increases suddenly as soon as the job finishes running because the run time becomes zero, which gives no chance for jobs pending for other users to start.

When historical run time is included in the priority calculation, the formula becomes:

dynamic priority = number_shares / (historical_run_time + run_time) * RUN_TIME_FACTOR)

Now the dynamic priority increases gradually as the historical run time decays over time.

Committed run time weighting factor

Committed run time is the run time requested at job submission with the -W option of bsub, or in the queue configuration with the RUNLIMIT parameter. By default, committed run time does not affect the dynamic priority.

While the job is running, the actual run time is subtracted from the committed run time. The user's dynamic priority decreases immediately to its lowest expected value, and is maintained at that value until the job finishes. Job run time is accumulated as usual, and historical run time, if any, is decayed.

When the job finishes, the committed run time is set to zero and the actual run time is added to the historical run time for future use. The dynamic priority increases gradually until it reaches its maximum value.

Providing a weighting factor in the run time portion of the dynamic priority calculation prevents a "job dispatching burst" where one user monopolizes job slots because of the latency in computing run time.

Configure committed run time
  1. Set a value for the COMMITTED_RUN_TIME_FACTOR parameter in lsb.params. You should also specify a RUN_TIME_FACTOR, to prevent the user's dynamic priority from increasing as the run time increases.
  2. If you have also enabled the use of historical run time, the dynamic priority is calculated according to the following formula:

    dynamic priority = number_shares / (cpu_time * CPU_TIME_FACTOR + (historical_run_time + run_time) * RUN_TIME_FACTOR + (committed_run_time - run_time) * COMMITTED_RUN_TIME_FACTOR + (1 + job_slots) * RUN_JOB_FACTOR + fairshare_adjustment(struc* shareAdjustPair)*FAIRSHARE_ADJUSTMENT_FACTOR)

    committed_run_time-The run time requested at job submission with the -W option of bsub, or in the queue configuration with the RUNLIMIT parameter. This calculation measures the committed run time in hours.

    In the calculation of a user's dynamic priority, COMMITTED_RUN_TIME_FACTOR determines the relative importance of the committed run time in the calculation. If the -W option of bsub is not specified at job submission and a RUNLIMIT has not been set for the queue, the committed run time is not considered.

COMMITTED_RUN_TIME_FACTOR can be any positive value between 0.0 and 1.0. 
The default value is 0.0. As the value of COMMITTED_RUN_TIME_FACTOR 
approaches 1.0, more weight is given to the committed run time in the 
calculation of the dynamic priority. 
Limitation

If you use queue-level fairshare, and a running job has a committed run time, you should not switch that job to or from a fairshare queue (using bswitch). The fairshare calculations will not be correct.

Run time displayed by bqueues and bhpart

The run time displayed by bqueues and bhpart is the sum of the actual, accumulated run time and the historical run time, but does not include the committed run time.

Example

The following fairshare parameters are configured in lsb.params:

CPU_TIME_FACTOR = 0
RUN_JOB_FACTOR = 0
RUN_TIME_FACTOR = 1
FAIRSHARE_ADJUSTMENT_FACTOR = 0
COMMITTED_RUN_TIME_FACTOR = 1 

Without a committed run time factor, dynamic priority for the job owner drops gradually while a job is running:

When a committed run time factor is included in the priority calculation, the dynamic priority drops as soon as the job is dispatched, rather than gradually dropping as the job runs:

Users Affected by Multiple Fairshare Policies

If you belong to multiple user groups, which are controlled by different fairshare policies, each group probably has a different dynamic share priority at any given time. By default, if any one of these groups becomes the highest priority user, you could be the highest priority user in that group, and LSF would attempt to place your job.

To restrict the number of fairshare policies that will affect your job, submit your job and specify a single user group that your job will belong to, for the purposes of fairshare scheduling. LSF will not attempt to dispatch this job unless the group you specified is the highest priority user. If you become the highest priority user because of some other share assignment, another one of your jobs might be dispatched, but not this one.

Submit a job and specify a user group

  1. To associate a job with a user group for the purposes of fairshare scheduling, use bsub -G and specify a group that you belong to. If you use hierarchical fairshare, you must specify a group that does not contain any subgroups.
Example

User1 shares resources with groupA and groupB. User1 is also a member of groupA, but not any other groups.

User1 submits a job:

bsub sleep 100 

By default, the job could be considered for dispatch if either User1 or GroupA has highest dynamic share priority.

User1 submits a job and associates the job with GroupA:

bsub -G groupA sleep 100 

If User1 is the highest priority user, this job will not be considered.

Example with hierarchical fairshare

In the share tree, User1 shares resources with GroupA at the top level. GroupA has 2 subgroups, B and C. GroupC has 1 subgroup, GroupD. User1 also belongs to GroupB and GroupC.

User1 submits a job:

bsub sleep 100 

By default, the job could be considered for dispatch if either User1, GroupB, or GroupC has highest dynamic share priority.

User1 submits a job and associates the job with GroupB:

bsub -G groupB sleep 100 

If User1 or GroupC is the highest priority user, this job will not be considered.

Ways to Configure Fairshare

Global fairshare

Global fairshare balances resource usage across the entire cluster according to one single fairshare policy. Resources used in one queue affect job dispatch order in another queue.

If two users compete for resources, their dynamic share priority is the same in every queue.

Configure global fairshare
  1. To configure global fairshare, you must use host partition fairshare. Use the keyword all to configure a single partition that includes all the hosts in the cluster.
  2. Begin HostPartition
    HPART_NAME =GlobalPartition
    HOSTS = all
    USER_SHARES = [groupA@, 3] [groupB, 7] [default, 1]
    End HostPartition 
    

Chargeback fairshare

Chargeback fairshare lets competing users share the same hardware resources according to a fixed ratio. Each user is entitled to a specified portion of the available resources.

If two users compete for resources, the most important user is entitled to more resources.

Configure chargeback fairshare
  1. To configure chargeback fairshare, put competing users in separate user groups and assign a fair number of shares to each group.
Example

Suppose two departments contributed to the purchase of a large system. The engineering department contributed 70 percent of the cost, and the accounting department 30 percent. Each department wants to get their money's worth from the system.

  1. Define 2 user groups in lsb.users, one listing all the engineers, and one listing all the accountants.
  2. Begin UserGroup
    Group_Name   Group_Member
    eng_users    (user6 user4)
    acct_users   (user2 user5)
    End UserGroup 
    
  3. Configure a host partition for the host, and assign the shares appropriately.
  4. Begin HostPartition
    HPART_NAME = big_servers
    HOSTS = hostH
    USER_SHARES = [eng_users, 7] [acct_users, 3]
    End HostPartition 
    

Equal Share

Equal share balances resource usage equally between users.

Configure equal share
  1. To configure equal share, use the keyword default to define an equal share for every user.
  2. Begin HostPartition
    HPART_NAME = equal_share_partition
    HOSTS = all
    USER_SHARES = [default, 1]
    End HostPartition 
    

Priority user and static priority fairshare

There are two ways to configure fairshare so that a more important user's job always overrides the job of a less important user, regardless of resource use.

Configure priority user fairshare

A queue is shared by key users and other users.

Priority user fairshare gives priority to important users, so their jobs override the jobs of other users. You can still use fairshare policies to balance resources among each group of users.

If two users compete for resources, and one of them is a priority user, the priority user's job always runs first.

  1. Define a user group for priority users in lsb.users, naming it accordingly.
  2. For example, key_users.

  3. Configure fairshare and assign the overwhelming majority of shares to the key users:
  4. Begin Queue
    QUEUE_NAME = production 
    FAIRSHARE = USER_SHARES[[key_users@, 2000] [others, 1]]
    ...
    End Queue 
     

    In the above example, key users have 2000 shares each, while other users together have only 1 share. This makes it virtually impossible for other users' jobs to get dispatched unless none of the users in the key_users group has jobs waiting to run.

    If you want the same fairshare policy to apply to jobs from all queues, configure host partition fairshare in a similar way.

Configure static priority fairshare

Static priority fairshare assigns resources to the user with the most shares. Resource usage is ignored.

  1. To implement static priority fairshare, edit lsb.params and set all the weighting factors used in the dynamic priority formula to 0 (zero).

If two users compete for resources, the most important user's job always runs first.

Resizable jobs and fairshare

Resizable jobs submitting into fairshare queues or host partitions are subject to fairshare scheduling policies. The dynamic priority of the user who submitted the job is the most important criterion. LSF treats pending resize allocation requests as a regular job and enforces the fairshare user priority policy to schedule them.

The dynamic priority of users depends on:

Resizable job allocation changes affect the user priority calculation if the RUN_JOB_FACTOR or FAIRSHARE_ADJUSTMENT_FACTOR is greater than zero (0). Resize add requests increase number of slots in use and decrease user priority. Resize release requests decrease number of slots in use, and increase user priority. The faster a resizable job grows, the lower the user priority is, the less likely a pending allocation request can get more slots.

note:  
The effect of resizable job allocation changes when the Fairshare_adjustment_factor is greater than 0 depends on the user-defined fairshare adjustment plugin (libfairshareadjust.*).

After job allocation changes, bqueues and bhpart displays updated user priority.


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index