[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
[ Top ]
Job Related Functions
Deleting a job
To delete a job, send a
KILL
signal to the job by usinglsb_signaljob()
or uselsb_deletejob()
to kill the job.int lsb_deletejob(jobId, times, options) LS_LONG_INT jobId; int times; int options;Set to 0
lsb_deletejob()
deletes the job after a specific number of runs. The variable times represents the number of runs .Viewing job output
The output from an LSF job is normally not available until the job is finished. However, LSBLIB provides
lsb_peekjob()
to retrieve the name of a job file for the job specified byjobId
.To get the job output and job error files, append
.out
or.err
to the end of the base job file name fromlsb_peekjob()
.Only the job owner can use
lsb_peekjob()
to see job output.char *lsb_peekjob(jobId) LS_LONG_INT jobId; Job IDOn success, the job file name is returned. On failure, it returns
NULL
and setslsberrno
to indicate the error.The next call reuses the storage for the file name.
Moving jobs from one host to another
Use
lsb_mig()
to migrate a job from one host to another.int lsb_mig(mig, badHostIdx); struct submig *mig; Job to be migrated int *badHostIdx;If the call fails,
(**askedHosts)[*badHostIdx]
is not a host known to the LSF system.
lsf.batch.h
defines thestruct
submig
to hold the details of the job to be migrated. It has the following fields:struct submig { LS_LONG_INT jobId; Job ID to be migrated int options; int numAskedHosts; Number of hosts supplied for migration char **askedHosts; Array of pointers to the hosts };For the values of options, see the options field of
struct submit
used inlsb_submit()
function call.On success,
lsb_mig()
returns 0. On failure, it returns -1 and setslsberrno
to the usual error.External job message and data exchange
lsb_postjobmsg()
sends an external message/status to a job. It can also transfer an attached data file through a TCP connection. The posted messages and attached data files can be read frommbatchd
by invokinglsb_readjobmsg()
.int lsb_postjobmsg(jobExternalMsgReq, fileName) struct jobExternalMsgReq *jobExternalMsgReq; char *fileName; Data file to be attached int lsb_readjobmsg(jobExternalMsgReq, jobExternalMsgReply) struct jobExternalMsgReq *jobExternalMsgReq; struct jobExternalMsgReply *jobExternalMsgReply;Use struct jobExternalMsgReq as a parameter in both
lsb_postjobmsg()
andlsb_readjobmsg()
. It contains all the details on the external message or status to be read or posted.struct jobExternalMsgReq { int options; Indicated which operation to be performed #define EXT_MSG_POST 0x01 Post external message #define EXT_ATTA_POST 0x02 Post external data file #define EXT_MSG_READ 0x04 Read external message #define EXT_ATTA_READ 0x08 Read external data file #define EXT_MSG_REPLAY 0x10 Replay external message LS_LONG_INT jobId; Message of the job to be posted/read char *jobName; Name of the job if jobId is undefined (<=0) int msgIdx; Index in the list char *desc; Text description of the message int userId; Author of the message long dataSize; Size of the data file time_t postTime; Message sending time };The struct jobExternalMsgReply holds information on external message/status requested by the user. It is defined in
lsbatch.h
as follows:struct jobExternalMsgReply { LS_LONG_INT jobId; Message of the job to be read int msgIdx; Index in the message list char *desc; Text description of the message int userId; Author of the message long dataSize; Size of the data file time_t postTime; Message sending time int dataStatus; Status of the attached data #define EXT_DATA_UNKNOWN 0 Data transferring of the message is processing #define EXT_DATA_NOEXIST 1 Message without data attached #define EXT_DATA_AVAIL 2 Data of the message is available #define EXT_DATA_UNAVAIL 3 Data of the message is corrupt };[ Top ]
User and Host Related Functions
User information
Use
lsb.users
to:
- Configure user groups, hierarchical fairshare for users and user groups, and job slot limits for users and user groups.
- Configure account mappings in a MultiCluster environment.
LSBLIB provides the function
lsb_userinfo()
for getting information on LSF user and user groups.struct userInfoEnt *lsb_userinfo(users, numUsers) char **users; User names int *numUsers; Number of user namesTo get information about all users, set *numUsers = 0; *numUsers is updated to the actual number of users when
lsb_userinfo()
returns. To get information on the invoker, set users = NULL and *numUsers = 1.The function returns an array of userInfoEnt structure containing user information. The structure is defined in
lsbatch.h
as followed:struct userInfoEnt { char *user; Name of the user or user group float procJobLimit; Max number of started jobs on each processor int maxJobs; Max number of started or running jobs allowed int numStartJobs; Number of started jobs of the user/group int numJobs; Number of jobs the user/group submitted int numPEND; Number of pending jobs of the user/group int numRUN; Number of running jobs of the user/group int numSSUSP; Number of system-suspended jobs int numUSUSP; Number of user-suspended jobs int numRESERVE; Number of job slots reserved for pending jobs };
lsb_userinfo()
gets:
- The maximum number of job slots that a user can use simultaneously on any host
- The maximum number of job slots that a user can use simultaneously in the whole local LSF cluster
- The current number of job slots used by running and suspended jobs
- The current number of job slots reserved for pending jobs
The maximum number of job slots are defined in the
lsb.users
LSF configuration file. The reserved user name default, also defined inlsb.users
, matches users not already listed inlsb.users
who have no jobs started in the system.On success, returns an array of userInfoEnt structures and sets *numUsers to the number of userInfoEnt structures returned. The next call writes over the returned array.
On failure,
lsb_userinfo()
returnsNULL
and setslsberrno
to indicate the error. Iflsberrno
isLSBE_BAD_USER
, (*users)[*numUsers] is not a user known to the LSF system. Otherwise, if *numUsers is less than its original value, *numUsers is the actual number of users found.Getting information in host group or user group
lsb_hostgrpinfo()
andlsb_usergrpinfo()
get membership of LSF host or user groups.struct groupInfoEnt *lsb_hostgrpinfo (groups, numGroups, options) struct groupInfoEnt *lsb_usergrpinfo (groups, numGroups, options) char **groups; Array of group names int *numGroups; Number of group names int options; struct groupInfoEnt { char *group; Group name char *memberList; ASCII list of member names int numUserShares; Number of users with shares struct userShares *userShares; User shares representation }; struct userShares { char *user; User name int shares; Number of shares assigned to the user }; options The bitwise inclusive OR of some of the following flags:Get the information of user group.
Get the information of host.
Expand the group membership recursively. That is, if a member of a group is itself a group, give the names of its members recursively, rather than its name, which is the default.
Get membership of all groups.
Display the information in the long format.
lsb_hostgrpinfo()
gets LSF host group membership,lsb_usergrpinfo()
gets LSF user group membership.
lsb.users(5)
andlsb.hosts(5)
define LSF user and host groups, respectively.On success,
lsb_hostgrpinfo()
andlsb_usergrpinfo()
return an array of groupInfoEnt structures which hold the group name and the list of names of its members. If a member of a group is itself a group (i.e., a subgroup), then a '/' is appended to the name to indicate this. *numGroups is the number of groupInfoEnt structures returned.On failure,
lsb_hostgrpinfo()
andlsb_usergrpinfo()
returnsNULL
and setslsberrno
to indicate the error. Iflsberrno
isLSBE_BAD_GROUP
, (*groups)[*numGroups] is not a group known to the LSF system. Otherwise, if *numGroups is less than its original value, *numGroups is the actual number of groups found.Host partition in fairshare scheduling
To configure host partition fairshare, define a host partition in
lsb.hosts
.lsb_hostpartinfo()
to gets the information on defined host partitions.struct hostPartInfoEnt *lsb_hostpartinfo (hostParts, numHostParts) char **hostParts; Host partition names int *numHostParts; Number of host partition namesTo get information on all host partitions, set hostParts to NULL;
*numHostParts
is the actual number of host partitions when thislsb_hostpartinfo()
returns.The next call reuses the storage for the array of
hostPartInfoEnt
structures.
lsb_hostpartinfo()
returns a structhostPartInfoEnt
describing the host partitions:struct hostPartInfoEnt { char hostPart[MAX_LSB_NAME_LEN]; Name of the host partition char *hostList; Names of hosts in the partition int numUsers; Number of users sharing the partition struct hostPartUserInfo *users; Description of user in the partition };The string variable hostList contains the names of the host in the partition and each of the names has a foward slash character (
/
) appended. (Seelsb_groupinfo(3)
.)The struct
hostPartUserInfo
holds information on a specific user in the host partition.struct hostPartUserInfo { char user[MAX_LSB_NAME_LEN]; User Name int shares; Number of shares assigned to the user float priority; Priority of user to use the host partition int numStartJobs; Number of started jobs on host partition float histCpuTime; Normalized CPU time of finished jobs int numReserveJobs; Number of reserved job slots for pending jobs int runTime; Time unfinished jobs spend in RUN state };For priority, the bigger values represent higher priorities. Jobs belonging to the user or user group with the highest priority are considered first for dispatch when resources in the host partition are being contended for. In general, a user or user group with more shares, fewer
numStartJobs
and lesshistCpuTime
has higher priority.On success, returns an array of
hostPartInfoEnt
structures which hold information on the host partitions, and sets*numHostParts
to the number of hostPartInfoEnt structures.On failure,
lsb_hostpartinfo()
returns NULL and sets lsberrno to indicate the error. Iflsberrno
isLSBE_BAD_HPART
,(*hostParts)[*numHostParts]
is not a host partition known to the LSF system. Otherwise, if*numHostParts
is less than its original value,*numHostParts
is the actual number of host partitions found.Controlling hosts and daemons
The user can control the hosts and daemons through
lsb_hostcontrol()
andlsb_reconfig()
.
lsb_hostcontrol()
opens or closes a host and restarts or shutdowns the slave batch daemon.int lsb_hostcontrol (struct hostCtrlReq *); struct hostCtrlReq { char *host; Host to be controlled int opCode; Option for host control char *message; Message attached by the admin };If host is NULL, the local host is assumed.
lsbatch.h
defines theopCode
parameter containing the following control selection flags:Closes the host so that no jobs can dispatched to it.
Opens the host to accept jobs.
Restart the
sbatchd
on the host. The sbatchd will receive a request from thembatchd
and re-execute itself. This permits thesbatchd
binary to be updated. This operation will fail if no sbatchd is running on the specified host.The
sbatchd
on the host will exit.HOST_CLOSE_REMOTE
MultiCluster--Closes a leased host on the submission cluster
In order to use updated batch LSF configuration files, the user can use
lsb_reconfig()
to restart the master batch daemon,mbatchd
.int lsb_reconfig (struct mbdCtrlReq *); struct mbdCtrlReq { int opCode; Options for configuration char *name; Reserved for future use char *message; Message attached by the admin };The parameter
opCode
is defined inlsbatch.h
and should be one of the following:Restarts a new
mbatchd
Reread the configuration files
Check validity of the
mbatchd
configuration files
lsb_reconfig()
provides the following functionality to:
- Dynamically reconfigure an LSF batch system to pick up new configuration parameters
- Change to the job queue setup since system startup or the last reconfiguration
- Restart a new master batch daemon
- Check the validity of the configuration files.
On success, both
lsb_hostcontrol()
andlsb_reconfig()
. On failure, they return -1 and setlsberrno
to indicate the error.[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: March 13, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.