Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Working with Your Cluster

Contents

Viewing cluster information

LSF provides commands for users to access information about the cluster. Cluster information includes the cluster master host, cluster name, cluster resource definitions, cluster administrator, and so on.

To view the ...
Run ...
Version of LSF
lsid
Cluster name
lsid
Current master host
lsid
Cluster administrators
lsclusters
Configuration parameters
bparams

View LSF version, cluster name, and current master host

  1. Run lsid to display the version of LSF, the name of your cluster, and the current master host:
  2. lsid
    Platform LSF 7 Update 5 March 3 2009
    Copyright 1992-2009 Platform Computing Corporation
    
    My cluster name is cluster1
    My master name is hostA 
    

View cluster administrators

  1. Run lsclusters to find out who your cluster administrator is and see a summary of your cluster:
  2. lsclusters
    CLUSTER_NAME   STATUS   MASTER_HOST    ADMIN        HOSTS     SERVERS
    cluster1       ok       hostA          lsfadmin     6      6 
     

    If you are using the LSF MultiCluster product, you will see one line for each of the clusters that your local cluster is connected to in the output of lsclusters.

View configuration parameters

  1. Run bparams to display the generic configuration parameters of LSF. These include default queues, job dispatch interval, job checking interval, and job accepting interval.
  2. bparams
    Default Queues:  normal idle
    Job Dispatch Interval:  20 seconds
    Job Checking Interval:  15 seconds
    Job Accepting Interval:  20 seconds 
    
  3. Run bparams -l to display the information in long format, which gives a brief description of each parameter and the name of the parameter as it appears in lsb.params.
  4. bparams -l
    
    System default queues for automatic queue selection:
        DEFAULT_QUEUE = normal idle
    
    The interval for dispatching jobs by master batch daemon:
        MBD_SLEEP_TIME = 20 (seconds)
    
    The interval for checking jobs by slave batch daemon:
        SBD_SLEEP_TIME = 15 (seconds)
    
    The interval for a host to accept two batch jobs subsequently:
        JOB_ACCEPT_INTERVAL = 1 (* MBD_SLEEP_TIME)
    
    The idle time of a host for resuming pg suspended jobs:
        PG_SUSP_IT = 180 (seconds)
    
    The amount of time during which finished jobs are kept in core:
        CLEAN_PERIOD = 3600 (seconds)
    
    The maximum number of finished jobs that are logged in current event file:
        MAX_JOB_NUM = 2000
    
    The maximum number of retries for reaching a slave batch daemon:
        MAX_SBD_FAIL = 3
    
    The number of hours of resource consumption history:
        HIST_HOURS = 5
    
    The default project assigned to jobs.
        DEFAULT_PROJECT = default
    
    Sync up host status with master LIM is enabled:
    LSB_SYNC_HOST_STAT_LIM = Y
    
    MBD child query processes will only run on the following CPUs:
    MBD_QUERY_CPUS=1 2 3 
    
  5. Run bparams -a to display all configuration parameters and their values in lsb.params.
  6. For example:

    bparams -a 
    lsb.params configuration at Fri Jun 8 10:27:52 CST 2007 
         MBD_SLEEP_TIME = 20 
         SBD_SLEEP_TIME = 15 
         JOB_ACCEPT_INTERVAL = 1 
         SUB_TRY_INTERVAL = 60 
         LSB_SYNC_HOST_STAT_LIM =  N 
         MAX_JOBINFO_QUERY_PERIOD = 2147483647 
         PEND_REASON_UPDATE_INTERVAL = 30 
    

Viewing daemon parameter configuration

  1. Display all configuration settings for running LSF daemons.
  2. Display mbatchd and root sbatchd configuration.
  3. Display LIM configuration.
  4. Use lsadmin showconf lim to display the parameters configured in lsf.conf or ego.conf that apply to root LIM.

    By default, lsadmin displays the local LIM parameters. You can specify the host to display the LIM parameters.

Examples

Example directory structures

UNIX and Linux

The following figures show typical directory structures for a new UNIX or Linux installation with lsfinstall. Depending on which products you have installed and platforms you have selected, your directory structure may vary.

Microsoft Windows

The following diagram shows an example directory structure for a Windows installation.

Cluster administrators

Primary cluster administrator

Required. The first cluster administrator, specified during installation. The primary LSF administrator account owns the configuration and log files. The primary LSF administrator has permission to perform clusterwide operations, change configuration files, reconfigure the cluster, and control jobs submitted by all users.

Other cluster administrators

Optional. May be configured during or after installation.

Cluster administrators can perform administrative operations on all jobs and queues in the cluster. Cluster administrators have the same cluster-wide operational privileges as the primary LSF administrator except that they do not have permission to change LSF configuration files.

Add cluster administrators

  1. In the ClusterAdmins section of lsf.cluster.cluster_name, specify the list of cluster administrators following ADMINISTRATORS, separated by spaces.
  2. You can specify user names and group names.

    The first administrator in the list is the primary LSF administrator. All others are cluster administrators.

    For example:

    Begin ClusterAdmins
    ADMINISTRATORS = lsfadmin admin1 admin2
    End ClusterAdmins 
    
  3. Save your changes.
  4. Run lsadmin reconfig to reconfigure LIM.
  5. Run badmin mbdrestart to restart mbatchd.

Controlling daemons

Permissions required

To control all daemons in the cluster, you must

Daemon commands

The following is an overview of commands you use to control LSF daemons.

Daemon
Action
Command
Permissions
All in cluster
Start
lsfstartup
Must be root or a user listed in lsf.sudoers for all these commands
 
Shut down
lsfshutdown
 
sbatchd
Start
badmin hstartup [host_name ...|all]
Must be root or a user listed in lsf.sudoers for the startup command
 
Restart
badmin hrestart [host_name ...|all]
Must be root or the LSF administrator for other commands
 
Shut down
badmin hshutdown [host_name ...|all]
mbatchd
mbschd
Restart
badmin mbdrestart
Must be root or the LSF administrator for these commands
 
Shut down
  1. badmin hshutdown
  2. badmin mbdrestart
 
 
Reconfigure
badmin reconfig
 
RES
Start
lsadmin resstartup [host_name ...|all]
Must be root or a user listed in lsf.sudoers for the startup command
 
Shut down
lsadmin resshutdown [host_name ...|all]
Must be the LSF administrator for other commands
 
Restart
lsadmin resrestart [host_name ...|all]
LIM
Start
lsadmin limstartup [host_name ...|all]
Must be root or a user listed in lsf.sudoers for the startup command
 
Shut down
lsadmin limshutdown [host_name ...|all]
Must be the LSF administrator for other commands
 
Restart
lsadmin limrestart [host_name ...|all]
 
Restartall
in cluster
lsadmin reconfig

sbatchd

Restarting sbatchd on a host does not affect jobs that are running on that host.

If sbatchd is shut down, the host is not available to run new jobs. Existing jobs running on that host continue, but the results are not sent to the user until sbatchd is restarted.

LIM and RES

Jobs running on the host are not affected by restarting the daemons.

If a daemon is not responding to network connections, lsadmin displays an error message with the host name. In this case you must kill and restart the daemon manually.

If the LIM and the other daemons on the current master host shut down, another host automatically takes over as master.

If the RES is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted.

Controlling mbatchd

You use the badmin command to control mbatchd.

Reconfigure mbatchd

If you add a host to a host group, a host to a queue, or change resource configuration in the Hosts section of lsf.cluster.cluster_name, the change is not recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must restart mbatchd.

  1. Run badmin reconfig.

When you reconfigure the cluster, mbatchd is not restarted. Only configuration files are reloaded.

Restart mbatchd

  1. Run badmin mbdrestart.
  2. LSF checks configuration files for errors and prints the results to stderr. If no errors are found, the following occurs:

tip:  
Whenever mbatchd is restarted, it is unavailable to service requests. In large clusters where there are many events in lsb.events, restarting mbatchd can take some time. To avoid replaying events in lsb.events, use the command badmin reconfig.

Log a comment when restarting mbatchd

  1. Use the -C option of badmin mbdrestart to log an administrator comment in lsb.events.
  2. For example:

    badmin mbdrestart -C "Configuration change" 
     

    The comment text Configuration change is recorded in lsb.events.

  3. Run badmin hist or badmin mbdhist to display administrator comments for mbatchd restart.

Shut down mbatchd

  1. Run badmin hshutdown to shut down sbatchd on the master host.
  2. For example:

    badmin hshutdown hostD
    Shut down slave batch daemon on <hostD> .... done 
    
  3. Run badmin mbdrestart:
  4. badmin mbdrestart
    Checking configuration files ...
    No errors found. 
     

    This causes mbatchd and mbschd to exit. mbatchd cannot be restarted, because sbatchd is shut down. All LSF services are temporarily unavailable, but existing jobs are not affected. When mbatchd is later started by sbatchd, its previous status is restored from the event log file and job scheduling continues.

Customize batch command messages

LSF displays error messages when a batch command cannot communicate with mbatchd. Users see these messages when the batch command retries the connection to mbatchd.

You can customize three of these messages to provide LSF users with more detailed information and instructions.

  1. In the file lsf.conf, identify the parameter for the message that you want to customize.
  2. The following lists the parameters you can use to customize messages when a batch command does not receive a response from mbatchd.

    Reason for no response from mbatchd
    Default message
    Parameter used to customize the message
    mbatchd is too busy to accept new connections or respond to client requests
    LSF is processing your request. Please wait...
    LSB_MBD_BUSY_MSG
    internal system connections to mbatchd fail
    Cannot connect to LSF. Please wait...
    LSB_MBD_CONNECT_FAIL_MSG
    mbatchd is down or there is no process listening at either the LSB_MBD_PORT or the LSB_QUERY_PORT
    LSF is down. Please wait...
    LSB_MBD_DOWN_MSG

  3. Specify a message string, or specify an empty string:
  4. Save and close the lsf.conf file.

Reconfiguring your cluster

After changing LSF configuration files, you must tell LSF to reread the files to update the configuration. Use the following commands to reconfigure a cluster:

The reconfiguration commands you use depend on which files you change in LSF. The following table is a quick reference.

After making changes to ...
Use ...
Which ...
hosts
badmin reconfig
reloads configuration files
license.dat
lsadmin reconfig AND badmin mbdrestart
restarts LIM, reloads configuration files, and restarts mbatchd
lsb.applications
badmin reconfig
reloads configuration files
Pending jobs use new application profile definition. Running jobs are not affected.
lsb.hosts
badmin reconfig
reloads configuration files
lsb.modules
badmin reconfig
reloads configuration files
lsb.nqsmaps
badmin reconfig
reloads configuration files
lsb.params
badmin reconfig
reloads configuration files
lsb.queues
badmin reconfig
reloads configuration files
lsb.resources
badmin reconfig
reloads configuration files
lsb.serviceclasses
badmin reconfig
reloads configuration files
lsb.users
badmin reconfig
reloads configuration files
lsf.cluster.cluster_name
lsadmin reconfig AND badmin mbdrestart
restarts LIM, reloads configuration files, and restarts mbatchd
lsf.conf
lsadmin reconfig AND badmin mbdrestart
reconfigures LIM, reloads configuration files, and restarts mbatchd
lsf.licensescheduler
bladmin reconfig lsadmin reconfig badmin mbdrestart
reconfigures bld, reconfigures LIM, reloads configuration files, and restarts mbatchd
lsf.shared
lsadmin reconfig AND badmin mbdrestart
restarts LIM, reloads configuration files, and restarts mbatchd
lsf.sudoers
badmin reconfig
reloads configuration files
lsf.task
lsadmin reconfig AND badmin reconfig
restarts LIM and reloads configuration files

Reconfigure the cluster with lsadmin and badmin

To make a configuration change take effect, use this method to reconfigure the cluster.

  1. Log on to the host as root or the LSF administrator.
  2. Run lsadmin reconfig to reconfigure LIM:
  3. lsadmin reconfig

    The lsadmin reconfig command checks for configuration errors.

    If no errors are found, you are prompted to either restart lim on master host candidates only, or to confirm that you want to restart lim on all hosts. If fatal errors are found, reconfiguration is aborted.

  4. Run badmin reconfig to reconfigure mbatchd:
  5. badmin reconfig

    The badmin reconfig command checks for configuration errors.

    If fatal errors are found, reconfiguration is aborted.

Reconfigure the cluster by restarting mbatchd

To replay and recover the running state of the cluster, use this method to reconfigure the cluster.

  1. Run badmin mbdrestart to restart mbatchd:
  2. badmin mbdrestart

    The badmin mbdrestart command checks for configuration errors.

    If no fatal errors are found, you are asked to confirm mbatchd restart. If fatal errors are found, the command exits without taking any action.

    tip:  
    If the lsb.events file is large, or many jobs are running, restarting mbatchd can take some time. In addition, mbatchd is not available to service requests while it is restarted.

View configuration errors

  1. Run lsadmin ckconfig -v.
  2. Run badmin ckconfig -v.

This reports all errors to your terminal.

How reconfiguring the cluster affects licenses

If the license server goes down, LSF can continue to operate for a period of time until it attempts to renew licenses.

Reconfiguring causes LSF to renew licenses. If no license server is available, LSF does not reconfigure the system because the system would lose all its licenses and stop working.

If you have multiple license servers, reconfiguration proceeds provided LSF can contact at least one license server. In this case, LSF still loses the licenses on servers that are down, so LSF may have fewer licenses available after reconfiguration.


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index