When used with the ch_p4 device, the mpirun program uses a file called a machines file to list the machines or nodes that are available for running mpich programs.
The easiest way to create a machines file is to edit the file
mpich/util/machines/machines.xxxx
to contain names of machines of architecture xxxx. The xxxx
matches the arch given when mpich was configured. Then whenever
mpirun is executed, the required number of hosts will be selcted from
this file for the run. (There is no fancy scheduling; the hosts are selected
starting from the top). To run all your MPI processes on a single
workstation, just make all the lines in the file the same.
A sample machines.solaris file might look like:
mercury venus earth mars earth marsThe names should be provided in the same format as is output by the hostname command. For example, if the result of hostname on earth is earth.my.edu (and similarly for the other names), then the machines file should be
mercury.my.edu venus.my.edu earth.my.edu mars.my.edu earth.my.edu mars.my.eduFor nodes that contain multiple processors, indicate the number of processors by following the name with a colon and the number of processors. For example, if mars in the previous example had two processors, then the machines file should be
mercury venus earth mars:2 earth mars:2
When using the ch_p4 device, it is possible to speedup the process of starting jobs by using the secure server. The secure server is a program that runs on the machines listed in the machines.xxxx file (where xxxx is the name of the machine architecture) and that allows programs to start faster. There are two ways to install this program: so that only one user may use it and so all users may use it. No special privileges are required to install the secure server for a single user's use. However, the secure server does use the Unix ruserok routine which usually requires a .rhosts or hosts.equiv file. See the documentation on your system for ruserok to make sure that your environment will support the ruserok routine.
To use the secure server, follow these steps:
rpcinfo -p mysun
2. Start the secure server.
The script sbin/chp4_servs
sbin/chp4_servs -port=n -arch=$ARCHcan be used to start the secure servers. This makes use of the remote shell command (rsh, remsh, or ssh) to start the servers; if you cannot use the remote shell command, you will need to log into each system on which you want to start the secure server and start the server manually. The command to start an individual server using port 2345 is
serv_p4 -o -p 2345 &For example, if you had chosen a port number of 2345 and were using Solaris, then you would give the command
sbin/chp4_servs -port=2345 --with-arch=solarisThe server will keep a log of its activities in a file named Secure_Server.Log.xxxx in the current directory, where xxxx is the process id of the process that started the server (note that the server may be running as a child of that initial process).
3. To make use of the secure servers using the ch_p4 device, you
must inform mpirun of the
port number. You can do this in two ways. The first is to give the
-p4ssport n option to mpirun. For example, if the port is
2345 and you want to run cpi on four processors, use
mpirun -np 4 -p4ssport 2345 cpiThe other way to inform mpirun of the secure server is to use the environment variables MPI_USEP4SSPORT and MPI_P4SSPORT. In the C-shell, you can set these with
setenv MPI_USEP4SSPORT yes setenv MPI_P4SSPORT 2345The value of MPI_P4SSPORT must be the port with which you started the secure servers. When these environment variables are set, no extra options are needed with mpirun.
To stop the servers, their processes must be killed. This is easily done with
the Scalable Unix Tools [(ref Gropp:1994:SUT)] with the command
ptfps -all -tn serv_p4 -and -o $LOGNAME -kill INTAlternately, you can log into each system and execute something like
ps auxww | egrep "$LOGNAME.*serv_p4"if using a BSD-style ps, or
ps -flu $LOGNAME | egrep 'serv_p4'if using a System V-style ps. The System V style will work only if the command name is short; the System V ps only gives you the first 80 characters of the command name, and if it was started with a long (but valid) directory path, the name of the command may have been lost.
An alternative approach is discussed in Section Managing the servers .
An experimental perl5 program is provided to help you manage the p4 secure servers. This program is chkserv, and is installed in the sbin directory. You can use this program to check that your servers are running, start up new servers, or stop servers that are running.
Before using this script, you may need to edit it; check that it has appropriate values for serv_p4, portnum, and machinelist; you may also need to set the first line to your version of perl5.
To check on the status of your servers, use
chkserv -port 2345To restart any servers that have stopped, use
chkserv -port 2345 -restartThis does not restart servers that are already running; you can use this as a cron job every morning to make sure that your servers are running. Note that this uses the same remote shell command that configure found; if you can't use that remote shell command to start the process on the remote systems, you'll need to restart the servers by hand. In that case, you can use the output from chkserv -port 1234 to see which servers need to be restarted.
chkserv -port 2345 -killThis contacts all running servers and tells them to exit. It does not use rsh, and can be used on any system.
This software is experimental. If you have comments or suggestions, please send them [email protected].