A frequent problem is non-accessible files due to a non-uniform file space. If a task is run on a remote host where a file it requires cannot be accessed using the same name, an error results. Almost all interactive LSF commands fail if the user’s current working directory cannot be found on the remote host.
If you are running NFS, rearranging the NFS mount table may solve the problem. If your system is running the automount server, LSF tries to map the filenames, and in most cases it succeeds. If shared mounts are used, the mapping may break for those files. In such cases, specific measures need to be taken to get around it.
The automount maps must be managed through NIS. When LSF tries to map filenames, it assumes that automounted file systems are mounted under the /tmp_mnt directory.
To share files among Windows machines, set up a share on the server and access it from the client. You can access files on the share either by specifying a UNC path (\\server\share\path) or connecting the share to a local drive name and using a drive:\path syntax. Using UNC is recommended because drive mappings may be different across machines, while UNC allows you to unambiguously refer to a file on the network.
This section lists some common problems with LSF jobs. Most problems are due to incorrect installation or configuration. Check the mbatchd and sbatchd error log files; often the log message points directly to the problem.
The section also includes some common problems with the LIM, the RES and interactive applications.
This displays most configuration errors. If this does not report any errors, check in the LIM error log.
Sometimes the LIM is up, but executing the lsload command prints the following error message:
If the LIM has just been started, this is normal, because the LIM needs time to get initialized by reading configuration files and contacting other LIMs. If the LIM does not become available within one or two minutes, check the LIM error log for the host you are working on.
To prevent communication timeouts when starting or restarting the local LIM, define the parameter LSF_SERVER_HOSTS in the lsf.conf file. The client will contact the LIM on one of the LSF_SERVER_HOSTS and execute the command, provided that at least one of the hosts defined in the list has a LIM that is up and running.
When the local LIM is running but there is no master LIM in the cluster, LSF applications display the following message:
Cannot locate master LIM now, try later.
Check the LIM error logs on the first few hosts listed in the Host section of the lsf.cluster.cluster_name file. If LSF_MASTER_LIST is defined in lsf.conf, check the LIM error logs on the hosts listed in this parameter instead.
Sometimes the master LIM is up, but executing the lsload or lshosts command prints the following error message:
If the /etc/hosts file on the host where the master LIM is running is configured with the host name assigned to the loopback IP address (127.0.0.1), LSF client LIMs cannot contact the master LIM. When the master LIM starts up, it sets its official host name and IP address to the loopback address. Any client requests will get the master LIM address as 127.0.0.1, and try to connect to it, and in fact will try to access itself.
127.0.0.1 localhost myhostname
On UNIX, if the RES is unable to read the lsf.conf file and does not know where to write error messages, it logs errors into syslog.
On Windows, if the RES is unable to read the lsf.conf file and does not know where to write error messages, it logs errors into C:\temp.
If remote execution fails with the following error message, the remote host could not securely determine the user ID of the user requesting remote execution.
Check the RES error log on the remote host; this usually contains a more detailed error message.
If you are not using an identification daemon (LSF_AUTH is not defined in the lsf.conf file), then all applications that do remote executions must be owned by root with the setuid bit set. This can be done as follows.
If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with the nosuid flag.
If you are using an identification daemon (defined in the lsf.conf file by LSF_AUTH), inetd must be configured to run the daemon. The identification daemon must not be run directly.
If LSF_USE_HOSTEQUIV=Y is defined in the lsf.conf file, check if /etc/hosts.equiv or HOME/.rhosts on the destination host has the client host name in it. Inconsistent host names in a name server with /etc/hosts and /etc/hosts.equiv can also cause this problem.
On SGI hosts running a name server, you can try the following command to tell the host name lookup code to search the /etc/hosts file before calling the name server.
setenv HOSTRESORDER "local,nis,bind"
For Windows hosts, users must register and update their Windows passwords using the lspasswd command. Passwords must be 3 characters or longer and 31 characters or less.
For Windows password authentication in a non-shared file system environment, you must define the parameter LSF_MASTER_LIST in lsf.conf so that jobs will run with correct permissions. If you do not define this parameter, LSF assumes that the cluster uses a shared file system environment.
A command may fail with the following error message due to a non-uniform file name space.
chdir(...) failed: no such file or directory
You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.
If your current working directory does not exist on a remote host, you should not execute commands remotely on that host.
On UNIX, if the directory exists, but is mapped to a different name on the remote host, you have to create symbolic links to make them consistent.
LSF can resolve most, but not all, problems using automount. The automount maps must be managed through NIS. Follow the instructions in your Release Notes for obtaining technical support if you are running automount and LSF is not able to locate directories on remote hosts.
First, check the sbatchd and mbatchd error logs. Try running the following command to check the configuration.
This reports most errors. You should also check if there is any email from LSF in the LSF administrator’s mailbox. If the mbatchd is running but the sbatchd dies on some hosts, it may be because mbatchd has not been configured to use those hosts.
Check whether LIM is running. You can test this by running the lsid command. If LIM is not running properly, follow the suggestions in this chapter to fix the LIM first. You should make sure that all hosts use the same lsf.conf file. Note that it is possible that mbatchd is temporarily unavailable because the master LIM is temporarily unknown, causing the following error message.
Check whether services are registered properly. See Administering Platform LSF for information about registering LSF services.
If you configure a list of server hosts in the Host section of the lsb.hosts file, mbatchd allows sbatchd to run only on the hosts listed. If you try to configure an unknown host as a HOSTS definition for a queue in the lsb.queues file, mbatchd logs the following message.
mbatchd on host: LSB_CONFDIR/cluster/configdir/file(line #): Host hostname is not used by lsbatch; ignored
If you try to configure an unknown host in the HostGroup or HostPartition sections of the lsb.hosts file, you also see the message.
If you start sbatchd on a host that is not known by mbatchd, mbatchd rejects the sbatchd. The sbatchd logs the following message and exits.
This host is not used by lsbatch system.
Both of these errors are most often caused by not running the following commands, in order, after adding a host to the configuration.
lsadmin reconfig badmin reconfig
You must run both of these before starting the daemons on the new host.
On AIX, if the XPG_SUS_ENV=ON environment variable is set in the user's environment before the process is executed and a process attempts to set the limit lower than current usage, the operation fails with errno set to EINVAL. If the XPG_SUS_ENV environment variable is not set, the operation fails with errno set to EFAULT.
The messages listed in this section may be generated by any LSF daemon.
The daemon could not open the named file for the reason given by error. This error is usually caused by incorrect file permissions or missing files. All directories in the path to the configuration files must have execute (x) permission for the LSF administrator, and the actual files must have read (r) permission. Missing files could be caused by incorrect path names in the lsf.conf file, running LSF daemons on a host where the configuration files have not been installed, or having a symbolic link pointing to a nonexistent file or directory.
Memory allocation failed. Either the host does not have enough available memory or swap space, or there is an internal error in the daemon. Check the program load and available swap space on the host; if the swap space is full, you must add more swap space or run fewer (or smaller) programs on that host.
auth_user: getservbyname(ident/tcp) failed: error; ident must be registered in services
LSF_AUTH=ident is defined in the lsf.conf file, but the ident/tcp service is not defined in the services database. Add ident/tcp to the services database, or remove LSF_AUTH from the lsf.conf file and setuid root those LSF binaries that require authentication.
auth_user: operation(<host>/<port>) failed: error
LSF_AUTH=ident is defined in the lsf.conf file, but the LSF daemon failed to contact the identd daemon on host. Check that identd is defined in inetd.conf and the identd daemon is running on host.
auth_user: Authentication data format error (rbuf=<data>) from <host>/<port>
auth_user: Authentication port mismatch (...) from <host>/<port>
LSF_AUTH=ident is defined in the lsf.conf file, but there is a protocol error between LSF and the ident daemon on host. Make sure the ident daemon on the host is configured correctly.
LSF_AUTH is not defined, and the LSF daemon received a request that originates from a non-privileged port. The request is not serviced.
Set the LSF binaries (for example, lsrun) to be owned by root with the setuid bit set, or define LSF_AUTH=ident and set up an ident server on all hosts in the cluster. If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with the nosuid flag.
userok: Forged username suspected from <host>/<port>: <claimed_user>/<actual_user>
The service request claimed to come from user claimed_user but ident authentication returned that the user was actually actual_user. The request was not serviced.
userok: ruserok(<host>,<uid>) failed
LSF_USE_HOSTEQUIV=Y is defined in the lsf.conf file, but host has not been set up as an equivalent host (see /etc/host.equiv), and user uid has not set up a .rhosts file.
init_AcceptSock: RES service(res) not registered, exiting
init_AcceptSock: res/tcp: unknown service, exiting initSock: LIM service not registered.
initSock: Service lim/udp is unknown. Read LSF Guide for help get_ports: <serv> service not registered
The LSF services are not registered. See Administering Platform LSF for information about configuring LSF services.
init_AcceptSock: Can’t bind daemon socket to port <port>: error, exiting
init_ServSock: Could not bind socket to port <port>: error
These error messages can occur if you try to start a second LSF daemon (for example, RES is already running, and you execute RES again). If this is the case, and you want to start the new daemon, kill the running daemon or use the lsadmin or badmin commands to shut down or restart the daemon.
The messages listed in this section are caused by problems in the LSF configuration files. General errors are listed first, and then errors from specific files.
file(line): Section name expected after Begin; ignoring section file(line): Invalid section name name; ignoring section
The keyword begin at the specified line is not followed by a section name, or is followed by an unrecognized section name.
file(line): section section: Premature EOF
The end of file was reached before reading the end section line for the named section.
file(line): keyword line format error for section section; Ignore this section
The first line of the section should contain a list of keywords. This error is printed when the keyword line is incorrect or contains an unrecognized keyword.
file(line): values do not match keys for section section; Ignoring line
The number of fields on a line in a configuration section does not match the number of keywords. This may be caused by not putting () in a column to represent the default value.
file: HostModel section missing or invalid
file: Resource section missing or invalid
file: HostType section missing or invalid
The HostModel, Resource, or HostType section in the lsf.shared file is either missing or contains an unrecoverable error.
file(line): Name name reserved or previously defined. Ignoring index
The name assigned to an external load index must not be the same as any built-in or previously defined resource or load index.
file(line): Duplicate clustername name in section cluster. Ignoring current line
A cluster name is defined twice in the same lsf.shared file. The second definition is ignored.
file(line): Bad cpuFactor for host model model. Ignoring line
The CPU factor declared for the named host model in the lsf.shared file is not a valid number.
file(line): Too many host models, ignoring model name
You can declare a maximum of 127 host models in the lsf.shared file.
file(line): Resource name name too long in section resource. Should be less than 40 characters. Ignoring line
The maximum length of a resource name is 39 characters. Choose a shorter name for the resource.
file(line): Resource name name reserved or previously defined. Ignoring line.
You have attempted to define a resource name that is reserved by LSF or already defined in the lsf.shared file. Choose another name for the resource.
file(line): illegal character in resource name: name, section resource. Line ignored.
Resource names must begin with a letter in the set [a-zA-Z], followed by letters, digits or underscores [a-zA-Z0-9_].
main: LIM cannot run without licenses, exiting
The LSF software license key is not found or has expired. Check that FLEXlm is set up correctly, or contact Platform support at support@platform.com.
main: Received request from unlicensed host <host>/<port>
LIM refuses to service requests from hosts that do not have licenses. Either your LSF license has expired, or you have configured LSF on more hosts than your license key allows.
initLicense: Trying to get license for LIM from source <LSF_CONFDIR/license.dat>
getLicense: Can’t get software license for LIM from license file <LSF_CONFDIR/license.dat>: feature not yet available.
Your LSF license is not yet valid. Check whether the system clock is correct.
findHostbyAddr/<proc>: Host <host>/<port> is unknown by <myhostname>
function: Gethostbyaddr_(<host>/<port>) failed: error
main: Request from unknown host <host>/<port>: error
function: Received request from non-LSF host <host>/<port>
The daemon does not recognize host as a Platform LSF host. The request is not serviced. These messages can occur if host was added to the configuration files, but not all the daemons have been reconfigured to read the new information. If the problem still occurs after reconfiguring all the daemons, check whether the host is a multi-addressed host. See Administering Platform LSF for information about working with multi-addressed hosts.
rcvLoadVector: Sender (<host>/<port>) may have different config?MasterRegister: Sender (host) may have different config?
LIM detected inconsistent configuration information with the sending LIM. Run the following command so that all the LIMs have the same configuration information.
Note any hosts that failed to be contacted.
rcvLoadVector: Got load from client-only host <host>/<port>. Kill LIM on <host>/<port>
A LIM is running on a Platform LSF client host. Run the following command, or go to the client host and kill the LIM daemon.
saveIndx: Unknown index name <name> from ELIM
LIM received an external load index name that is not defined in the lsf.shared file. If name is defined in lsf.shared, reconfigure the LIM. Otherwise, add name to the lsf.shared file and reconfigure all the LIMs.
saveIndx: ELIM over-riding value of index <name>
This is a warning message. The ELIM sent a value for one of the built-in index names. LIM uses the value from ELIM in place of the value obtained from the kernel.
getusr: Protocol error numIndx not read (cc=num): error
getusr: Protocol error on index number (cc=num): error
Protocol error between ELIM and LIM. See Administering Platform LSF for a description of the ELIM and LIM protocols.
These messages are logged by the RES.
doacceptconn: getpwnam(<username>@<host>/<port>) failed: error doacceptconn: User <username> has uid <uid1> on client host <host>/<port>, uid <uid2> on RES host; assume bad user authRequest: username/uid <userName>/<uid>@<host>/<port> does not exist
authRequest: Submitter’s name <clname>@<clhost> is different from name <lname> on this host
RES assumes that a user has the same userID and username on all the LSF hosts. These messages occur if this assumption is violated. If the user is allowed to use LSF for interactive remote execution, make sure the user’s account has the same user ID and user name on all LSF hosts.
doacceptconn: root remote execution permission denied authRequest: root job submission rejected
Root tried to execute or submit a job but LSF_ROOT_REX is not defined in the lsf.conf file.
resControl: operation permission denied, uid = <uid>
The user with user ID uid is not allowed to make RES control requests. Only the LSF administrator, or root if LSF_ROOT_REX is defined in lsf.conf, can make RES control requests.
resControl: access(respath, X_OK): error
The RES received a reboot request, but failed to find the file respath to re-execute itself. Make sure respath contains the RES binary, and it has execution permission.
The following messages are logged by the mbatchd and sbatchd daemons:
renewJob: Job <jobId>: rename(<from>,<to>) failed: error
mbatchd failed in trying to re-submit a rerunnable job. Check that the file from exists and that the LSF administrator can rename the file. If from is in an AFS directory, check that the LSF administrator’s token processing is properly setup
See Administering Platform LSF for information about installing on AFS.
logJobInfo_: fopen(<logdir/info/jobfile>) failed: error
logJobInfo_: write <logdir/info/jobfile> <data> failed: error
logJobInfo_: seek <logdir/info/jobfile> failed: error
logJobInfo_: write <logdir/info/jobfile> xdrpos <pos> failed: error
logJobInfo_: write <logdir/info/jobfile> xdr buf len <len> failed: error
logJobInfo_: close(<logdir/info/jobfile>) failed: error
rmLogJobInfo: Job <jobId>: can’t unlink(<logdir/info/jobfile>): error
rmLogJobInfo_: Job <jobId>: can’t stat(<logdir/info/jobfile>): error
readLogJobInfo: Job <jobId> can’t open(<logdir/info/jobfile>): error
start_job: Job <jobId>: readLogJobInfo failed: error
readLogJobInfo: Job <jobId>: can’t read(<logdir/info/jobfile>) size size: error
initLog: mkdir(<logdir/info>) failed: error
<fname>: fopen(<logdir/file> failed: error
getElogLock: Can’t open existing lock file <logdir/file>: error
getElogLock: Error in opening lock file <logdir/file>: error
releaseElogLock: unlink(<logdir/lockfile>) failed: error
touchElogLock: Failed to open lock file <logdir/file>: error
touchElogLock: close <logdir/file> failed: error
mbatchd failed to create, remove, read, or write the log directory or a file in the log directory, for the reason given in error. Check that LSF administrator has read, write, and execute permissions on the logdir directory.
If logdir is on AFS, check that the instructions in Administering Platform LSF have been followed. Use the fs ls command to verify that the LSF administrator owns logdir and that the directory has the correct ACL.
replay_newjob: File <logfile> at line <line>: Queue <queue> not found, saving to queue <lost_and_found>replay_switchjob: File <logfile> at line <line>: Destination queue <queue> not found, switching to queue <lost_and_found>
When mbatchd was reconfigured, jobs were found in queue but that queue is no longer in the configuration.
replay_startjob: JobId <jobId>: exec host <host> not found, saving to host <lost_and_found>
When mbatchd was reconfigured, the event log contained jobs dispatched to host, but that host is no longer configured to be used by LSF.
do_restartReq: Failed to get hData of host <host_name>/<host_addr>
mbatchd received a request from sbatchd on host host_name, but that host is not known to mbatchd. Either the configuration file has been changed but mbatchd has not been reconfigured to pick up the new configuration, or host_name is a client host but the sbatchd daemon is running on that host. Run the following command to reconfigure the mbatchd or kill the sbatchd daemon on host_name.
LSF daemon (LIM) not responding ... still trying
During LIM restart, LSF commands will fail and display this error message. User programs linked to the LIM API will also fail for the same reason. This message is displayed when LIM running on the master host list or server host list is restarted after configuration changes, such as adding new resources, binary upgrade, and so on.
Use LSF_LIM_API_NTRIES in lsf.conf or as an environment variable to define how many times LSF commands will retry to communicate with the LIM API while LIM is not available. LSF_LIM_API_NTRIES is ignored by LSF and EGO daemons and all EGO commands.
When LSB_API_VERBOSE=Y in lsf.conf, LSF batch commands will display the not responding retry error message to stderr when LIM is not available.
When LSB_API_VERBOSE=N in lsf.conf, LSF batch commands will not display the retry error meesage when LIM is not available.
When mbatchd and batch commands fail reading lsb.events, LSF logs and displays details about which event field was reached when parsing of the event file failed.
event time_stamp offset[byte:field]: field_name [field_name ...]
Dec 28 14:25:30 2008 9861 3 7.02 init_log: Reading event file </home/user1/LSF7/work/LSF7/logdir/lsb.events>: Bad event format at line 15: JOB_NEW 1198869866 offset[28:3]: First 10 fields: jobId userId options numProcessors subTime beginTime termTime sigValue chkpntPeriod restartPidDec 28 14:25:30 2008 9861 3 7.02 init_log: Reading event file </home/user1/LSF7/work/LSF7/logdir/lsb.events>: Bad event format at line 16: bad eventVersionDec 28 14:25:30 2008 9861 3 7.02 switch_log(): reading event file </home/user1/LSF7/work/LSF7/logdir/lsb.events>: Bad event format at line 15: JOB_NEW 1198869866offset[28:2]: First 10 fields: jobId userId options numProcessors subTime beginTime termTime sigValue chkpntPeriod restartPidDec 28 14:25:30 2008 9861 3 7.02 switch_log(): reading event file </home/user1/LSF7/work/LSF7/logdir/lsb.events>: Bad event format at line 16: bad eventVersion
bhist -l 309Dec 28 16:04:53 2008 8146 3 7.02 File /home/user1/LSF7/work/LSF7/logdir/lsb.events: Bad event format at line 20: JOB_EXECUTE 1198888660 offset[48:1]: execCwdbadmin mbdhistFile /home/user1/LSF7/work/LSF7/logdir/lsb.events: Bad event format at line 19: JOB_EXECUTE 1198888660 offset[48:1]: execCwdFile /home/user1/LSF7/work/LSF7/logdir/lsb.events: Bad event format at line 20: bad eventVersion