Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Using Platform LSF HPC with SGI Cpusets


LSF HPC makes use of SGI cpusets to enforce processor limits for LSF jobs. When a job is submitted, LSF creates a cpuset and attaches it to the job before the job starts running, After the job finishes, LSF deallocates the cpuset. If no host meets the CPU requirements, the job remains pending until processors become available to allocate the cpuset.

Contents

[ Top ]


About SGI cpusets

An SGI cpuset is a named set of CPUs. The processes attached to a cpuset can only run on the CPUs belonging to that cpuset.

Dynamic cpusets

Jobs are attached to a cpuset dynamically created by LSF HPC. The cpuset is deleted when the job finishes or exits. If not specified, the default cpuset type is dynamic.

Static cpusets

Jobs are attached to a static cpuset specified by users at job submission. This cpuset is not deleted when the job finishes or exits. Specifying a cpuset name at job submission implies that the cpuset type is static. If the static cpuset does not exist, the job will remain pending until LSF HPC detects a static cpuset with the specified name.

System architecture

How LSF HPC uses cpusets

CPU containment and reservation

On systems running IRIX 6.5.24 and up or SGI Altix or AMD64 (x86-64) ProPack 3.0 and up, cpusets can be created and deallocated dynamically out of available machine resources. Not only does the cpuset provide containment, so that a job requiring a specific number of CPUs will only run on those CPUs, but also reservation, so that the required number of CPUs are guaranteed to be available only for the job they are allocated to.

Cpuset creation and deallocation

LSF can be configured to make use of SGI cpusets to enforce processor limits for LSF jobs. When a job is submitted, LSF creates a cpuset and attaches it to the job when the job is scheduled. After the job finishes, LSF deallocates the cpuset. If no host meets the CPU requirements, the job remains pending until processors become available to allocate the cpuset.

Assumptions and limitations

Backfill and slot reservation

Since backfill and slot reservation are based on an entire host, they may not work correctly if your cluster contains hosts that use both static and dynamic cpusets or multiple static cpusets.

Chunk jobs

Jobs submitted to a chunk job queue are not chunked together, but run as individual LSF jobs inside a dynamic cpuset.

Preemption

Pre-execution and post-execution

Job pre-execution programs run within the job cpuset, since they are part of the job. By default, post-execution programs run outside of the job cpuset.

If JOB_INCLUDE_POSTPROC=Y is specified in lsb.applications, post- execution processing is not attached to the job cpuset, and Platform LSF does not release the cpuset until post-execution processing has finished.

Suspended jobs

Jobs suspended (for example, with bstop) will release their cpusets.

Cpuset memory options

Static cpusets

PAM jobs on IRIX

PAM on IRIX cannot launch parallel processes within cpusets.

Array services authentication (Altix only)

For PAM jobs on Altix, the SGI Array Services daemon arrayd must be running and AUTHENTICATION must be set to NONE in the SGI array services authentication file /usr/lib/array/arrayd.auth (comment out the AUTHENTICATION NOREMOTE method and uncomment the AUTHENTICATION NONE method).

To run a mulithost MPI applications, you must also enable rsh without password prompt between hosts:

For more information about SGI Array Services, see SGI Job Container and Process Aggregate Support.

For more information about PAM jobs, see SGI Vendor MPI Support.

Forcing a cpuset job to run

The administrator must use brun -c to force a cpuset job to run. If job is forced to run on non-cpuset hosts, or if any host in the host list specified with -m is not a cpuset host, -extsched cpuset options are ignored and the job runs with no cpusets allocated.

If the job is forced to run on a cpuset host:

Resizable jobs

Jobs running in a cpuset cannot be resized.

[ Top ]


Configuring LSF HPC with SGI Cpusets

Automatic configuration at installation and upgrade

lsb.modules

During installation and upgrade, lsfinstall adds the schmod_cpuset external scheduler plugin module name to the PluginModule section of lsb.modules:

Begin PluginModule
SCH_PLUGIN              RB_PLUGIN           SCH_DISABLE_PHASES 
schmod_default               ()                      () 
schmod_cpuset                ()                      () 
End PluginModule


The schmod_cpuset plugin name must be configured after the standard LSF plugin names in the PluginModule list.

For upgrade, lsfinstall comments out the schmod_topology external scheduler plugin name in the PluginModule section of lsb.modules

lsf.conf

During installation and upgrade, lsfinstall sets the following parameters in lsf.conf:

For upgrade, lsfinstall comments out the following obsolete parameters in lsf.conf, and sets the corresponding RLA configuration:

lsf.shared

During installation and upgrade, lsfinstall defines the cpuset Boolean resource in lsf.shared:

Begin Resource
RESOURCENAME   TYPE      INTERVAL  INCREASING   DESCRIPTION
...
cpuset         Boolean   ()        ()           (cpuset host)
...
End Resource


You should add the cpuset resource name under the RESOURCES column of the Host section of lsf.cluster.cluster_name. Hosts without the cpuset resource specified are not considered for scheduling cpuset jobs.

lsf.cluster.cluster_name

For each cpuset host, hostsetup adds the cpuset Boolean resource to the HOST section of lsf.cluster.cluster_name.

For more information

See the Platform LSF Configuration Reference for information about the lsb.modules, lsf.conf, lsf.shared, and lsf.cluster.cluster_name files.

Optional configuration

lsb.queues

lsf.conf

Increase file descriptor limit for MPI jobs (Altix only)

By default, Linux sets the maximum file descriptor limit to 1024. This value is too small for jobs using more than 200 processes. To avoid MPI job failure, specify a larger file descriptor limit. For example:

# /etc/init.d/lsf stop
# ulimit -n 16384
# /etc/init.d/lsf start

Any host with more than 200 CPUs should start the LSF HPC daemons with the larger file descriptor limit. SGI Altix already starts the arrayd daemon with the same ulimit specifier, so that MPI jobs run without LSF HPC can start as well.

For more information

See the Platform LSF Configuration Reference for information about the lsb.queues and lsf.conf files.

Resources for dynamic and static cpusets

If your environment uses both static and dynamic cpusets or you have more than one static cpuset configured, you must configure decreasing numeric resources to represent the cpuset count, and use -R "rusage" in job submission. This allows preemption, and also lets you control number of jobs running on static and dynamic cpusets or on each static cpuset.

Configuring cpuset resources

  1. Edit lsf.shared and configure resources for cpusets and configure resources for static cpusets and non-static cpusets. For example:
    Begin Resource
    RESOURCENAME  TYPE    INTERVAL INCREASING  DESCRIPTION  # Keywords
       ...
       dcpus       Numeric ()       N          
       scpus       Numeric ()       N          
    End Resource
    

    Where:

    • dcpus is the number CPUs outside static cpusets (that is the total number of CPUs minus the number of CPUs in static cpusets).
    • scpus is the number of CPUs in static cpusets. For static cpusets, configure a separate resource for each static cpuset. You should use the cpuset name as the resource name.


      The names dcpus and scpus can be any name you like.

  2. Edit lsf.cluster.cluster_name to map the resources to hosts. For example:
    Begin ResourceMap
    RESOURCENAME        LOCATION
    dcpus               (4@[hosta]) # total cpus - cpus in static cpusets
    scpus               (8@[hostc]) # static cpusets
    End ResourceMap
    
    • For dynamic cpuset resources, the value of the resource should be the number of free CPUs on the host; that is, the number of CPUs outside of any static cpusets on the host.
    • For static cpuset resources, the number of the resource should be the number of CPUs in the static cpuset.
  3. Edit lsb.params and configure your cpuset resources as preemptable. For example:
    Begin Parameters
    ...
    PREEMPTABLE_RESOURCES = scpus dcpus
    End Parameters
    
  4. Edit lsb.hosts and set MXJ greater than or equal to the total number of CPUs in static and dynamic cpusets you have configured resources for.

Viewing your cpuset resources

Use the following commands to verify your configuration:

bhosts -s
RESOURCE                 TOTAL       RESERVED       LOCATION
dcpus                      4.0            0.0       hosta
scpus                      8.0            0.0       hosta
lshosts -s
RESOURCE                                VALUE       LOCATION
dcpus                                       4       hosta
scpus                                       8       hosta
bhosts
HOST_NAME        STATUS     JL/U    MAX  NJOBS  RUN  SSUSP  USUSP RSV
hosta            ok            -      -      1    1      0      0   0

Using preemption

To use preemption on systems running IRIX or TRIX versions earlier than 6.5.24, use cpusetscript as the job suspend action in lsb.queues:

Begin Queue
...
JOB_CONTROLS = SUSPEND[cpusetscript]
...
End Queue

To enable checkpointing before the job is migrated by the cpusetscript, specify the CHKPNT=chkpnt_dir parameter in the configuration of the preemptable queue.

Submitting jobs

You must use -R "rusage" in job submission. This allows preemption, and also lets you control number of jobs running on static and dynamic cpusets or on each static cpuset.

Configuring default and mandatory cpuset options

Use the DEFAULT_EXTSCHED and MANDATORY_EXTSCHED queue paramters in lsb.queues to configure default and mandatory cpuset options.


Use keywords SGI_CPUSET[] or CPUSET[] to identify the external scheduler parameters. The keyword SGI_CPUSET[] is deprecated. The keyword CPUSET[] is preferred.

DEFAULT_EXTSCHED=[SGI_]CPUSET[cpuset_options]

Specifies default cpuset external scheduling options for the queue.

-extsched options on the bsub command are merged with DEFAULT_EXTSCHED options, and -extsched options override any conflicting queue-level options set by DEFAULT_EXTSCHED.

For example, if the queue specifies:

DEFAULT_EXTSCHED=CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]

and a job is submitted with:

-extsched "CPUSET[CPUSET_TYPE=dynamic;CPU_LIST=1,5,7-12;
CPUSET_OPTIONS=CPUSET_MEMORY_LOCAL]"

LSF HPC uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;CPU_LIST=1, 5, 7-12; 
CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE CPUSET_MEMORY_LOCAL]

DEFAULT_EXTSCHED can be used in combination with MANDATORY_EXTSCHED in the same queue. For example, if the job specifies:

-extsched "CPUSET[CPU_LIST=1,5,7-12;MAX_CPU_PER_NODE=4]"

and the queue specifies:

Begin Queue
...
DEFAULT_EXTSCHED=CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]
MANDATORY_EXTSCHED=CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2]
...
End Queue

LSF HPC uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2;CPU_LIST=1, 5, 
7-12;CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]

If cpuset options are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -extsched option of bsub. For example, if DEFAULT_EXTSCHED=CPUSET[MAX_RADIUS=2], and you do not want to specify any radius option at all, use -extsched "CPUSET[MAX_RADIUS=]".

See Specifying cpuset properties for jobs for more information about external scheduling options.

MANDATORY_EXTSCHED=[SGI_]CPUSET[cpuset_options]

Specifies mandatory cpuset external scheduling options for the queue.

-extsched options on the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options override any conflicting job-level options set by -extsched.

For example, if the queue specifies:

MANDATORY_EXTSCHED=CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2]

and a job is submitted with:

-extsched "CPUSET[MAX_CPU_PER_NODE=4;CPU_LIST=1,5,7-12;]"

LSF HPC uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2;CPU_LIST=1, 5, 7-12]

MANDATORY_EXTSCHED can be used in combination with DEFAULT_EXTSCHED in the same queue. For example, if the job specifies:

-extsched "CPUSET[CPU_LIST=1,5,7-12;MAX_CPU_PER_NODE=4]"

and the queue specifies:

Begin Queue
...
DEFAULT_EXTSCHED=CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]
MANDATORY_EXTSCHED=CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2]
...
End Queue

LSF HPC uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2;CPU_LIST=1, 5, 
7-12;CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]

If you want to prevent users from setting certain cpuset options in the -extsched option of bsub, use the keyword with no value. For example, if the job is submitted with -extsched "CPUSET[MAX_RADIUS=2]", use MANDATORY_EXTSCHED=CPUSET[MAX_RADIUS=] to override this setting.

See Specifying cpuset properties for jobs for more information about external scheduling options.

Priority of topology scheduling options

The options set by -extsched can be combined with the queue-level MANDATORY_EXTSCHED or DEFAULT_EXTSCHED parameters. If -extsched and MANDATORY_EXTSCHED set the same option, the MANDATORY_EXTSCHED setting is used. If -extsched and DEFAULT_EXTSCHED set the same options, the -extsched setting is used.

topology scheduling options are applied in the following priority order of level from highest to lowest:

  1. Queue-level MANDATORY_EXTSCHED options override ...
  2. Job level -ext options, which override ...
  3. Queue-level DEFAULT_EXTSCHED options

For example, if the queue specifies:

DEFAULT_EXTSCHED=CPUSET[MAX_CPU_PER_NODE=2]

and the job is submitted with:

bsub -n 4 -ext "CPUSET[MAX_CPU_PER_NODE=1]" myjob

The cpuset option in the job submission overrides the DEFAULT_EXTSCHED, so the job will run in a cpuset allocated with a maximum of 1 CPU per node, honoring the job- level MAX_CPU_PER_NODE option.

If the queue specifies:

MANDATORY_EXTSCHED=CPUSET[MAX_CPU_PER_NODE=2]

and the job is submitted with:

bsub -n 4 -ext "CPUSET[MAX_CPU_PER_NODE=1]" myjob

The job will run in a cpuset allocated with a maximum of 2 CPUs per node, honoring the MAX_CPU_PER_NODE option in the queue.

[ Top ]


Using LSF HPC with SGI Cpusets

Specifying cpuset properties for jobs

To specify cpuset properties for LSF jobs, use:

If a job is submitted with the -extsched option, LSF HPC submits jobs with hold, then resumes the job before dispatching it to give time for LSF HPC to attach the -extsched options. The job starts on the first execution host.

For more information about job operations, see Administering Platform LSF.

For more information about bsub, see the Platform LSF Command Reference.

Syntax

-ext[sched] "[SGI_]CPUSET[cpuset_options]"

Specifies a list of CPUs and cpuset attributes used by LSF to allocate a cpuset for the job.


You can abbreviate the -extsched option to -ext. Use keywords SGI_CPUSET[] or CPUSET[] to identify the external scheduler parameters. The keyword SGI_CPUSET[] is deprecated. The keyword CPUSET[] is preferred.

where cpuset_options are:

Options valid only for dynamic cpusets

When a job is submitted using -extsched, LSF creates a cpuset with the specified CPUs and cpuset attributes and attaches it to the processes of the job. The job is then scheduled and dispatched.

Running jobs on specific CPUs

The CPUs available for your jobs may have specific features you need to take advantage of (for example, some CPUs may have more memory, others have a faster processor). You can partition your machines to use specific CPUs for your jobs, but the cpusets for your jobs cannot cross hosts, and you must run multiple operating systems

You can create static cpusets with the particular CPUs your jobs need, but you cannot control the specific CPUs in the cpuset that the job actually uses.

A better solution is to use the CPU_LIST external scheduler option to request specific CPUs for your jobs. LSF can choose the best set of CPUs from the CPU list to create a cpuset for the job. The best cpuset is the one with the smallest CPU radius that meets the CPU requirements of the job. CPU radius is determined by the processor topology of the system and is expressed in terms of the number of router hops between CPUs.

CPU_LIST requirements


To make job submission easier, you should define queues with the specific CPU_LIST requirements. Set CPU_LIST in MANDATORY_EXTSCHED or DEFAULT_EXTSCHED option in your queue definitions in lsb.queues.

span[ptile] resource requirement

CPU_LIST is interpreted as a list of possible CPU seelctions, not a strict requirement. For example, if you subit a job with the the -R "span[ptile]" option:

bsub -R "span[ptile=1]" -ext "CPUSET[CPU_LIST=1,3]" -n2 ...

the following combination of CPUs is possible:

CPUs on host 1 CPUs on host 2
1
1
1
3
3
1
3
3

Cpuset attributes

The following cpuset attributes are supported in the list of cpuset options specified by CPUSET_OPTIONS:

See the SGI resource administration documentation and the man pages for the cpuset command for information about these cpuset attributes.

SGI Altix

Restrictions on CPUSET_MEMORY_MANDATORY

Restrictions on CPUSET_CPU_EXCLUSIVE

The scheduler will not use CPU 0 when determining an allocation on IRIX or TRIX. You must not include CPU 0 in the list of CPUs specified by CPU_LIST.

MPI_DSM_MUSTRUN environment variable

You should not use the MPI_DSM_MUSTRUN=ON environment variable. If a job is suspended through preemption, LSF can ensure that cpusets are recreated with the same CPUs, but it cannot ensure that a certain task will run on a specific CPU. Jobs running with MPI_DSM_MUSTRUN cannot migrate to a different part of the machine. MPI_DSM_MUSTRUN also interferes with job checkpointing.

Including memory nodes in the allocation (Altix ProPack4 and Propack 5)

When you specify a list of memory node IDs with the cpuset external scheduler option MEM_LIST, LSF creates a cpuset for the job that includes the memory nodes specified by MEM_LIST in addition to the local memory attached to the CPUs allocated for the cpuset. For example, if "CPUSET[MEM_LIST=30-40]", and a 2-CPU parallel job is scheduled to run on CPU 0-1 (physically located on node 0), the job is able to use memory on node 0 and nodes 30-40.

Unavailable memory nodes listed in MEM_LIST are ignored when LSF allocates the cpuset. For example, a 4-CPU job across two hosts (hostA and hostB) that specifies MEM_LIST=1 allocates 2 CPUs on each host. The job is scheduled as follows:

If hostB only has 2 CPUs, only node 0 is available, and the job will only use the memory on node 0.

MEM_LIST is only available for dynamic cpuset jobs at both the queue level and the command level.

CPUSET_MEMORY_LOCAL

When both MEM_LIST and CPUSET_OPTIONS=CPUSET_MEMORY_LOCAL are both specified for the job, the root cpuset nodes are included as the memory nodes for the cpuset. MEM_LIST is ignored, and CPUSET_MEMORY_LOCAL overrides MEM_LIST.

CPU radius and processor topology

If LSB_CPUSET_BESTCPUS is set in lsf.conf, LSF can choose the best set of CPUs that can create a cpuset. The best cpuset is the one with the smallest CPU radius that meets the CPU requirements of the job. CPU radius is determined by the processor topology of the system and is expressed in terms of the number of router hops between CPUs.

For better performance, CPUs connected by metarouters are given a relatively high weights so that they are the last to be allocated

Best-fit and first-fit CPU list

By default, LSB_CPUSET_BESTCPUS=Y is set in lsf.conf. LSF applies a best-fit algorithm to select the best CPUs available for the cpuset.

Example

For example, the following command creates an exclusive cpuset with the 8 best CPUs if available:

bsub -n 8 -extsched "CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]" myjob

If LSB_CPUSET_BESTCPUS is not set in lsf.conf, LSF builds a CPU list on a first- fit basis; in this example, the first 8 available CPUs are used.

Maximum radius for dynamic cpusets

Use the MAX_RADIUS cpuset external scheduler option to specify the maximum radius for dynamic cpuset allocation. If LSF HPC cannot allocate a cpuset with radius less than or equal to MAX_RADIUS, the job remains pending.

MAX_RADIUS implies that the job cannot span multiple hosts. Platform LSF HPC puts each cpuset host into its own group to enforce this when MAX_RADIUS is specified.

How the best CPUs are selected

CPU_LIST MAX_RADIUS LSB_CPUSET_BESTCPUS Algorithm used Applied to
specified
specified or not specified
N
first fit
cpus in CPU_LIST
not specified
specified or not specified
N
first fit
all cpus in system
specified
specified
Y
max radius
cpus in CPU_LIST
not specified
specified
Y
max radius
all cpus in system
specified
not specified
Y
best fit
cpus in CPU_LIST
not specified
not specified
Y
best fit
all cpus in system

Allocating cpusets on multiple hosts (Altix only)

On SGI Altix systems, if a single host cannot satisfy the cpuset requirements for the job, LSF HPC will try to allocate cpusets on multiple hosts, and the parallel job will be launched within the cpuset.

If you define the external scheduler option CPUSET[CPUSET_TYPE=none], no cpusets are allocated and the job is dispatched and run outside of any cpuset.


Spanning multiple hosts is not supported on IRIX or TRIX. Platform HPC creates cpusets on a single host (or on the first host in the allocation.)

LSB_HOST_CPUSETS environment variable

After dynamic cpusets are allocated and before the job starts running LSF HPC sets the LSB_HOST_CPUSETS environment variable. LSB_HOST_CPUSETS has the following format:

number_hosts host1_name cpuset1_name host2_name 
cpuset2_name ...

For example, if hostA and hostB have 2 CPUs, and hostC has 4 CPUs, cpuset 1-0 is created on hostA, hostB and hostC, and LSB_HOST_CPUSETS set to:

3 hostA 1-0 hostB 1-0 hostC 1-0

LSB_HOST_CPUSETS is only set for jobs that allocate dynamic cpusets.

LSB_CPUSET_DEDICATED environment variable

When a static or dynamic cpuset is allocated, LSF HPC sets the LSB_CPUSET_DEDICATED environment variable. For CPUSET_TYPE=none, LSB_CPUSET_DEDICATED is not set.

The LSB_CPUSET_DEDICATED variable is set by LSF as follows:

How cpuset jobs are suspended and resumed

When a cpuset job is suspended (for example, with bstop), job processes are moved out of the cpuset and the job cpuset is destroyed. Platform LSF HPC keeps track of which processes belong to the cpuset, and attempts to recreate a job cpuset when a job is resumed, and binds the job processes to the cpuset.

When a job is resumed, regardless of how it was suspended, the RESUME_OPTION is honored. If RESUME_OPTION=ORIG_CPUS then LSF HPC first tries to get the original CPUs from the same nodes as the original cpuset in order to use the same memory. If this does not get enough CPUs to resume the job, LSF HPC tries to get any CPUs in an effort to get the job resumed.


SGI Altix Linux ProPack 5 supports memory migration and does not require additional configuration to enable this feature. If you submit and then suspend a job using a dynamic cpuset, LSF HPC will create a new dynamic cpuset when the job resumes. The memory pages for the job are migrated to the new cpuset as required.

Example

Assume a host with 2 nodes, 2 CPUs per node (total of 4 CPUs)

Node CPUs
0
0
1
1
2
3

When a job running within a cpuset that contains cpu 1 is suspended:

  1. The job processes are detached from the cpuset and suspended
  2. The cpuset is destroyed

When the job is resumed:

  1. A cpuset with the same name is recreated
  2. The processes are resumed and attached to the cpuset

The RESUME_OPTION parameter determines which CPUs are used to recreate the cpuset:

If the job originally had a cpuset containing cpu 1, the possibilities when the job is resumed are:

RESUME_OPTION Eligible CPUs
ORIG_CPUS
0
1


not ORIG_CPUS
0
1
2
3

Viewing cpuset information for your jobs

bacct, bjobs, bhist

The bacct -l, bjobs -l, and bhist -l commands display the following information for jobs:

bjobs -l 221

Job <221>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Com
                     mand <myjob>
Thu Dec 15 14:19:54: Submitted from host <hostA>, CWD <$HOME
                     >, 2 Processors Requested; 
Thu Dec 15 14:19:57: Started on 2 Hosts/Processors <2*hostA>
                     , Execution Home </home/user1>, Execution CWD 
                     </home/user1>;
Thu Dec 15 14:19:57: CPUSET_TYPE=dynamic;NHOSTS=1;HOST=hostA;CPUSET_NAME=
                     /reg62@221;NCPUS=2; 
Thu Dec 15 14:20:03: Done successfully. The CPU time used is 0.0 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp   mem
 loadSched   -     -     -     -       -     -    -     -     -      -     -
 loadStop    -     -     -     -       -     -    -     -     -      -     -

 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE              ATTACHMENT
 0        -           -              -                     -
 1        -           -              -                     -
 2      root       Dec 15 14:19   JID=0x118f; ASH=0x0      N

bhist -l 221
Job <221>, User <user1>, Project <default>, Command <myjob> 
Thu Dec 15 14:19:54: Submitted from host <hostA>, to Queue <
                     normal>, CWD <$HOME>, 2 Processors Requested; 
Thu Dec 15 14:19:57: Dispatched to 2 Hosts/Processors <2*hostA>;
Thu Dec 15 14:19:57: CPUSET_TYPE=dynamic;NHOSTS=1;HOST=hostA
                     ;CPUSET_NAME=/reg62@221;NCPUS=2; 
Thu Dec 15 14:19:57: Starting (Pid 4495); 
Thu Dec 15 14:19:57: External Message "JID=0x118f; ASH=0x0" was posted from "ro
                     ot" to message box 2; 
Thu Dec 15 14:20:01: Running with execution home </home/user1>, Execution CWD
                     </home/user1>, Execution Pid <4495>; 
Thu Dec 15 14:20:01: Done successfully. The CPU time used is 0.0 seconds; 
Thu Dec 15 14:20:03: Post job process done successfully;

Summary of time in seconds spent in various states by  Thu Dec 15 14:20:03
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  3        0        4        0        0        0        7

bacct -l 221
Accounting information about jobs that are:
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

Job <221>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Com
                     mand <myjob>
Thu Dec 15 14:19:54: Submitted from host <hostA>, CWD <$HOME>;
Thu Dec 15 14:19:57: Dispatched to 2 Hosts/Processors <2*hostA>;
Thu Dec 15 14:19:57: CPUSET_TYPE=dynamic;NHOSTS=1;HOST=hostA;CPUSET_NAME=
                     /reg62@221;NCPUS=2; 
Thu Dec 15 14:20:01: Completed <done>.

Accounting information about this job:
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      0.03        3              7     done         0.0042     0K      0K
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:       1      Total number of exited jobs:     0
 Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
 Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
 Total wait time in queues:     3.0
 Average wait time in queue:    3.0
 Maximum wait time in queue:    3.0      Minimum wait time in queue:    3.0
 Average turnaround time:         7 (seconds/job)
 Maximum turnaround time:         7      Minimum turnaround time:         7
 Average hog factor of a job:  0.00 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00

brlainfo

Use brlainfo to display topology information for a cpuset host. It displays

brlainfo
HOSTNAME          CPUSET_OS  NCPUS  NFREECPUS NNODES  NCPU/NODE NSTATIC_CPUSETS
hostA             SGI_IRIX   2      2         1       2         0
hostB             PROPACK_4  4      4         2       2         0
hostC             PROPACK_4  4      3         2       2         0
brlainfo -l
HOST: hostC
CPUSET_OS   NCPUS  NFREECPUS NNODES  NCPU/NODE NSTATIC_CPUSETS
PROPACK_4   4      3         2       2         0
FREE CPU LIST: 0-2
NFREECPUS ON EACH NODE: 2/0,1/1
STATIC CPUSETS: NO STATIC CPUSETS
CPU_RADIUS: 2,3,3,3,3,3,3,3

Examples

Using preemption

[ Top ]


Using SGI Comprehensive System Accounting facility (CSA)

The SGI Comprehensive System Accounting facility (CSA) provides data for collecting per-process resource usage, monitoring disk usage, and chargeback to specific login accounts. If is enabled on your system, LSF HPC writes records for LSF jobs to CSA.

SGI CSA writes an accounting record for each process in the pacct file, which is usually located in the /var/adm/acct/day directory. SGI system administrators then use the csabuild command to organize and present the records on a job by job basis.

For each job running on the SGI system, LSF HPC writes an accounting record to CSA when the job starts and when the job finishes. LSF daemon accounting in CSA starts and stops with the LSF daemon.

See the SGI resource administration documentation for information about CSA.

Setting up SGI CSA

  1. Set the following parameters in /etc/csa.conf to on:
    • CSA_START
    • WKMG_START
  2. Run the csaswitch command to turn on the configuration changes in /etc/csa.conf.

See the SGI resource administration documentation for information about the csaswitch command.

Information written to the pacct file

LSF writes the following records to the pacct file when a job starts and when it exits:

Viewing LSF job information recorded in CSA

Use the SGI csaedit command to see the ASCII content of the pacct file. For example:

# csaedit -P /var/csa/day/pacct -A

For each LSF job, you should see two lines similar to the following:

-------------------------------------------------------------------------------
---------
37   Raw-Workld-Mgmt  user1    0x19ac91ee000064f2 0x0000000000000000        0  
REQID=1771  ARRAYID=0  PROV=LSF  START=Jun  4 15:52:01  ENTER=Jun  4 15:51:49  
TYPE=INIT  SUBTYPE=START  MACH=hostA  REQ=myjob  QUE=normal
...
39   Raw-Workld-Mgmt  user1    0x19ac91ee000064f2 0x0000000000000000        0  
REQID=1771  ARRAYID=0  PROV=LSF  START=Jun  4 16:09:14  TYPE=TERM  SUBTYPE=EXIT  
MACH=hostA  REQ=myjob  QUE=normal--
-------------------------------------------------------------------------------
---------

The REQID is the LSF job ID (1771).

See the SGI resource administration documentation for information about the csaedit command.

[ Top ]


Using SGI User Limits Database (ULDB--IRIX only)

The SGI user limits database (ULDB) allows user-specific limits for jobs. If no ULDB is defined, job limits are the same for all jobs. If you use ULDB, you can configures LSF so that jobs submitted to a host with the SGI job limits package installed are subject to the job limits configured in the ULDB.

Set LSF_ULDB_DOMAIN=domain_name in lsf.conf to specify the name of the LSF domain in the ULDB domain directive. A domain definition of name domain_name must be configured in the jlimit.in input file.

The ULDB contains job limit information that system administrators use to control access to a host on a per user basis. The job limits in the ULDB override the system default values for both job limits and process limits. When a ULDB domain is configured, the limits will be enforced as SGI job limits.

If the ULDB domain specified in LSF_ULDB_DOMAIN is not valid or does not exist, LSF uses the limits defined in the domain named batch. If the batch domain does not exist, then the system default limits are set.

When an LSF job is submitted, an SGI job is created, and the job limits in the ULDB are applied.

Next, LSF resource usage limits are enforced for the SGI job under which the LSF job is running. LSF limits override the corresponding SGI job limits. The ULDB limits are used for any LSF limits that are not defined. If the job reaches the SGI job limits, the action defined in the SGI system is used.

SGI job limits in the ULDB apply only to batch jobs.

You can also define resource limits (rlimits) in the ULDB domain. One advantage to defining rlimits in ULDB as opposed to in LSF is that rlimits can be defined per user and per domain in ULDB, whereas in LSF, limits are enforced per queue or per job.

See the SGI resource administration documentation for information about configuring ULDB domains in the jlimit.in file.

SGI Altix


SGI ULDB is not supported on Altix systems, so no process aggregate (PAGG) job-level resource limits are enforced for jobs running on Altix. Other operating system and LSF resource usage limits are still enforced.

LSF resource usage limits controlled by ULDB job limits

Increasing the default MEMLIMIT for ULDB

In some pre-defined LSF queues, such as normal, the default MEMLIMIT is set to 5000 (5 MB). However, if ULDB is enabled (LSF_ULDB_DOMAIN is defined) the MEMLIMIT should be set greater than 8000 in lsb.queues.

Example ULDB domain configuration

The following steps enable the ULDB domain LSF for user user1:

  1. Define the LSF_ULDB_DOMAIN parameter in lsf.conf:
    ...
    LSF_ULDB_DOMAIN=LSF
    ...
    

Note


You can set the LSF_ULDB_DOMAIN to include more than one domain. For example:
LSF_ULDB_DOMAIN="lsf:batch:system"

  1. Configure the domain directive LSF in the jlimit.in file:
    domain <LSF> {                           # domain for LSF 
            jlimit_numproc_cur = unlimited
            jlimit_numproc_max = unlimited   # JLIMIT_NUMPROC 
            jlimit_nofile_cur = unlimited
            jlimit_nofile_max = unlimited    # JLIMIT_NOFILE 
            jlimit_rss_cur = unlimited
            jlimit_rss_max = unlimited       # JLIMIT_RSS 
            jlimit_vmem_cur = 128M
            jlimit_vmem_max = 256M           # JLIMIT_VMEM 
            jlimit_data_cur = unlimited
            jlimit_data_max =unlimited       # JLIMIT_DATA 
            jlimit_cpu_cur = 80
            jlimit_cpu_max = 160             # JLIMIT_CPU 
    } 
    
  2. Configure the user limit directive for user1 in the jlimit.in file:
    user user1 { 
            LSF { 
               jlimit_data_cur = 128M 
               jlimit_data_max = 256M 
             } 
    } 
    
  3. Use the IRIX genlimits command to create the user limits database:
    genlimits -l -v
    

[ Top ]


SGI Job Container and Process Aggregate Support

An SGI job contains all processes created in a login session, including array sessions and session leaders. Job limits set in ULDB are applied to SGI jobs either at creation time or through the lifetime of the job. Job limits can also be reset on a job during its lifetime.

SGI IRIX job containers

If SGI Job Limits is installed, LSF HPC creates a job container when starting a job, uses the job container to signal all processes in the job, and uses the SGI job ID to collect job resource usage for a job.

If LSF_ULDB_DOMAIN is defined in lsf.conf, ULDB job limits are applied to the job.

The SGI job ID is also used for kernel-level checkpointing.

SGI Altix Process Aggregates (PAGG)

Similar to an SGI job container, a process aggregate (PAGG) is a collection of processes. A child process in a PAGG inherits membership, or attachment, to the same process aggregate containers as the parent process. When a process inherits membership, the process aggregate containers are updates for the new process member. When a process exits, the process leaves the set of process members and the aggregate containers are updated again.

SGI Altix


Since SGI ULDB is not supported on Altix systems, no PAGG job-level resource limits are enforced for jobs running on Altix. Other operating system level and LSF resource limits are still enforced.

Viewing SGI job ID and Array Session Handle (ASH)

Use bjobs and bhist to display SGI job ID and Array Session Handle.

SGI Altix


On Altix systems, the array session handle is not available. It is displayed as ASH=0x0.

bjobs -l 640
Job <640>, User <user1>, Project <default>, Status <RUN>, Queue <normal>, 
                     Command <pam -mpi -auto_place myjob>
Tue Jan 20 12:37:18: Submitted from host <hostA>, CWD <$HOME>, 2 Processors Re
                     quested;
Tue Jan 20 12:37:29: Started on 2 Hosts/Processors <2*hostA>,
                     Execution Home </home/user1>, Execution CWD </home/user1>;
Tue Jan 20 12:37:29: CPUSET_TYPE=dynamic;NHOSTS=1;ALLOCINFO=hostA 640-0;
Tue Jan 20 12:38:22: Resource usage collected.
                     MEM: 1 Mbytes;  SWAP: 5 Mbytes;  NTHREAD: 1
                     PGID: 5020232;  PIDs: 5020232


 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE                        ATTACHMENT

 0          -             -                        -                      -
 1          -             -                        -                      -
 2      root       Jan 20 12:41   JID=0x2bc0000000001f7a; ASH=0x2bc0f     N

bhist -l 640
Job <640>, User <user1>, Project <default>, Command 
                     <pam -mpi -auto_place myjob>
Sat Oct 19 14:52:14: Submitted from host <hostA>, to Queue <normal>, CWD
                     <$HOME>, Requested Resources <unclas>;
Sat Oct 19 14:52:22: Dispatched to <hostA>;
Sat Oct 19 14:52:22: CPUSET_TYPE=none;NHOSTS=1;ALLOCINFO=hostA;
Sat Oct 19 14:52:23: Starting (Pid 5020232);
Sat Oct 19 14:52:23: Running with execution home </home/user1>, Execution CWD
                     </home/user1>, Execution Pid <5020232>;
Sat Oct 19 14:53:22: External Message "JID=0x2bc0000000001f7a; ASH=0x2bc0f" was
                     posted from "root" to message box 2;

Summary of time in seconds spent in various states by  Sat Oct 19 14:54:00
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  8        0        98       0        0        0        106 

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: March 13, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.