[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- About Platform LSF HPC and MPICH2
- Configuring LSF HPC to Work with MPICH2
- Building Parallel Jobs
- Submitting MPICH2 Jobs
[ Top ]
About Platform LSF HPC and MPICH2
MPICH is a freely available, portable implementation of the MPI Standard for message- passing libraries, developed jointly with Mississippi State University. MPICH is designed to provide a high performance, portable, and convenient programming environment. MPICH2 implements both MPI-1 and MPI-2.
The
mpiexec
command of MPICH2 spawns all tasks, while LSF HPC retains full control over the tasks spawned. Specifically, LSF HPC collects rusage information, performs job control (signal), and cleans up after the job is finished. Jobs run within LSF allocation, controlled by LSF HPC.Requirements
Assumptions and limitations
- MPICH2 is installed and configured correctly
- The user's current working directory is part of a shared file system reachable by all hosts
- Currently,
mpiexec -file filename
(XML job description) is not supported.Glossary
(Message Passing Interface) A message passing standard. It defines a message passing API useful for parallel and distributed applications.
A portable implementation of the MPI standard.
An MPI implementation that implements both MPI-1 and MPI-2.
(Parallel Application Manager) The supervisor of any parallel job.
(Parallel Job Launcher) Any executable script or binary capable of starting parallel tasks on all hosts assigned for a parallel job.
(Remote Execution Server) An LSF daemon residing on each host. It monitors and manages all LSF tasks on the host.
(TaskStarter) An executable responsible for starting a task on the local host and reporting the process ID and host name to the PAM.
For more information
See the Mathematics and Computer Science Division (MCS) of Argonne National Laboratory (ANL) MPICH Web page at www-unix.mcs.anl.gov/mpi/mpich/ for more information about MPICH and MPICH2.
Files installed by lsfinstall
During installation,
lsfinstall
copies these files to the following directories:
These files... Are installed to... TaskStarter
LSF_BINDIR
pam
LSF_BINDIR
esub.mpich2
LSF_SERVERDIR
mpich2_wrapper
LSF_BINDIR
mpirun.lsf
LSF_BINDIR
pjllib.sh
LSF_BINDIR
Resources and parameters configured by lsfinstall
- External resources in
lsf.shared
:Begin Resource RESOURCE_NAME TYPE INTERVAL INCREASING DESCRIPTION ... mpich2 Boolean () () (MPICH2 MPI) ... End ResourcesThe
mpich2
Boolean resource is used for mapping hosts with MPICH2 available.
You should add thempich2
resource name under the RESOURCES column of the Host section oflsf.cluster.
cluster_name.
- Parameter to
lsf.conf
:LSB_SUB_COMMANDNAME=y[ Top ]
Configuring LSF HPC to Work with MPICH2
- Make sure MPICH2 commands are in the PATH environment variable. MPICH2 commands include
mpiexec
,mpd
,mpdboot
,mpdtrace
, andmpdexit.
For example:
[174]- which mpiexec /pcc/app/mpich2/kernel2.4-glibc2.3-x86/bin/mpiexechmmer Boolean () () (hmmer availability) lammpi Boolean () () (lam-mpi available host) mpich2 Boolean () () (mpich2 available host) <==== End ResourceBegin Host HOSTNAME model type server r1m mem swp RESOURCES #Keywords qat20 ! ! 1 3.5 () () (mpich2) qat21 ! ! 1 3.5 () () (mpich2) qat22 ! ! 1 3.5 () () (mpich2) End Host
- Run
lsadmin reconfig
andbadmin mbdrestart
as root.- Run
lshosts
to confirm that an mpich2 resource is configured on all hosts on which you would like to run mpich2 parallel jobs.For example:
[173]-lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES qat20 LINUX86 PC1133 23.1 1 310M - Yes (mpich2) qat21.lsf.p LINUX86 PC1133 23.1 1 311M 635M Yes (mpich2) qat22.lsf.p UNKNOWN UNKNOWN_ 1.0 - - - Yes (mpich2)[root@qat20 test]#mpdtrace -l
qat20_37272 qat21_52535[61]- cat .mpd.conf MPD_USE_ROOT_MPD=Y <========== secretword=123579a
- Make sure $HOME/.mpd.conf has a permission mode of 600 after you finish the modification.
- Set LSF_START_MPD_RING=N in your job script or in the environment for all users.
- If you want to start an MPD ring on all hosts, follow the steps described in the MPICH2 documentation to start an MPD ring across all LSF hosts for each user. The user MPD ring must be running all the time, and you must set LSF_START_MPD_RING=N in your job script or in the environment for all users.
Do not runmpdallexit
ormpdcleanup
to terminate the MPD ring.
- Make sure LSF uses system host official names (/etc/hosts): this will prevent problems when you run the application.
172.25.238.91 scali scali.lsf.platform.com 172.25.238.96 scali1 scali1.lsf.plaform.com
- Change the $LSF_BINDIR/mpich2_wrapper script to make sure MPI_TOPDIR= points to the MPICH2 install directory.
[ Top ]
Building Parallel Jobs
- Use
mpicc -o
to compile your source code.For example:
[178]-
which mpicc /pcc/app/mpich2/kernel2.4-glibc2.3- x86/bin/mpicc
5:19pm Mon, Sep-19-2005 qat21:~/milkyway/bugfix/test
[179]-
mpicc -o hw.mpich2 hw.c 3.2
- Make sure the compiled binary can run under the root MPD ring outside Platform LSF HPC.
For example:
[180]-
mpiexec -np 2 hw.mpich2
Process 0 is printing on qat21 (pid =16160): Greetings from process 1 from qat20 pid 24787![ Top ]
Submitting MPICH2 Jobs
bsub command
Use the
bsub
command to submit MPICH2 jobs.bsub <
bsub_options
> -n <
###> -a mpich2 mpirun.lsf <
mpiexec_options> job <
job_options>
Note that -np options of mpiexec will be ignored.
For example:
bsub -I -n 8 -R "span[ptile=4]" -a mpich2 -W 2 mpirun.lsf -np 3 ./hw.mpich2
#!/bin/sh #BSUB -n 8 #BSUB -a mpich2 mpirun.lsf ./hw.mpich2The
mpich2_wrapper
script supports almost all originalmpiexec
options except those that will affect job scheduling decisions, for example,-np
(-n
).
-n
syntax is supported. If you use the-n
option, you must either request enough CPUs when the job is submitted, or set the environment variable LSB_PJL_TASK_GEOMETRY. See Running Jobs with Task Geometry for detailed usage of LSB_PJL_TASK_GEOMETRY.Task geometry with MPICH2 jobs
MPICH2
mpirun
requires the first task to run on the local node OR all tasks to run on a remote node (-nolocal
). If the LSB_PJL_TASK_GEOMETRY environment variable is set,mpirun.lsf
makes sure the task group that contains task 0 in LSB_PJL_TASK_GEOMETRY runs on the first node.The environment variable LSB_PJL_TASK_GEOMETRY is checked for all parallel jobs. If LSB_PJL_TASK_GEOMETRY is set users submit a parallel job (a job that requests more than 1 slot), LSF attempts to shape LSB_MCPU_HOSTS accordingly.
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: March 13, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.