Setting up mpirun for the ch_p4 device


Up: Details Next: The Machines File Previous: Examples

When used with the ch_p4 device, the mpirun program uses a file called a machines file to list the machines or nodes that are available for running mpich programs.



Up: Details Next: The Machines File Previous: Examples


The Machines File


Up: Setting up mpirun for the ch_p4 device Next: Faster job startup for the ch_p4 device Previous: Setting up mpirun for the ch_p4 device

The easiest way to create a machines file is to edit the file
mpich/util/machines/machines.xxxx to contain names of machines of architecture xxxx. The xxxx matches the arch given when mpich was configured. Then whenever mpirun is executed, the required number of hosts will be selcted from this file for the run. (There is no fancy scheduling; the hosts are selected starting from the top). To run all your MPI processes on a single workstation, just make all the lines in the file the same. A sample machines.solaris file might look like:

    mercury 
    venus 
    earth 
    mars 
    earth 
    mars 
The names should be provided in the same format as is output by the hostname command. For example, if the result of hostname on earth is earth.my.edu (and similarly for the other names), then the machines file should be
    mercury.my.edu 
    venus.my.edu 
    earth.my.edu 
    mars.my.edu 
    earth.my.edu 
    mars.my.edu 
For nodes that contain multiple processors, indicate the number of processors by following the name with a colon and the number of processors. For example, if mars in the previous example had two processors, then the machines file should be
    mercury 
    venus 
    earth 
    mars:2 
    earth 
    mars:2 



Up: Setting up mpirun for the ch_p4 device Next: Faster job startup for the ch_p4 device Previous: Setting up mpirun for the ch_p4 device


Faster job startup for the ch_p4 device


Up: Setting up mpirun for the ch_p4 device Next: Stopping the P4 servers Previous: The Machines File

When using the ch_p4 device, it is possible to speedup the process of starting jobs by using the secure server. The secure server is a program that runs on the machines listed in the machines.xxxx file (where xxxx is the name of the machine architecture) and that allows programs to start faster. There are two ways to install this program: so that only one user may use it and so all users may use it. No special privileges are required to install the secure server for a single user's use. However, the secure server does use the Unix ruserok routine which usually requires a .rhosts or hosts.equiv file. See the documentation on your system for ruserok to make sure that your environment will support the ruserok routine.

To use the secure server, follow these steps:

    1. Choose a port. This is a number that you will use to identify the secure server (different port numbers may be used to allow multiple secure servers to operate). A good choice is a number over 1000. If you pick a number that is already being used, the server will exit, and you'll have to pick another number. On many systems, you can use the rpcinfo command to find out which ports are in use (or reserved). For example, to find the ports in use on host mysun, try
    rpcinfo -p mysun 
    


    2. Start the secure server. The script sbin/chp4_servs

        sbin/chp4_servs -port=n -arch=$ARCH 
    
    can be used to start the secure servers. This makes use of the remote shell command (rsh, remsh, or ssh) to start the servers; if you cannot use the remote shell command, you will need to log into each system on which you want to start the secure server and start the server manually. The command to start an individual server using port 2345 is
        serv_p4 -o -p 2345 & 
    
    For example, if you had chosen a port number of 2345 and were using Solaris, then you would give the command
        sbin/chp4_servs -port=2345 --with-arch=solaris 
    
    The server will keep a log of its activities in a file named Secure_Server.Log.xxxx in the current directory, where xxxx is the process id of the process that started the server (note that the server may be running as a child of that initial process).


    3. To make use of the secure servers using the ch_p4 device, you must inform mpirun of the port number. You can do this in two ways. The first is to give the -p4ssport n option to mpirun. For example, if the port is 2345 and you want to run cpi on four processors, use

      mpirun -np 4 -p4ssport 2345 cpi 
    
    The other way to inform mpirun of the secure server is to use the environment variables MPI_USEP4SSPORT and MPI_P4SSPORT. In the C-shell, you can set these with
        setenv MPI_USEP4SSPORT yes 
        setenv MPI_P4SSPORT 2345 
    
    The value of MPI_P4SSPORT must be the port with which you started the secure servers. When these environment variables are set, no extra options are needed with mpirun.

Note that when mpich is installed, the secure server and the startup commands are copied into the bin directory so that users may start their own copies of the server. This is discussed in the Users Guide.



Up: Setting up mpirun for the ch_p4 device Next: Stopping the P4 servers Previous: The Machines File


Stopping the P4 servers


Up: Setting up mpirun for the ch_p4 device Next: Managing the servers Previous: Faster job startup for the ch_p4 device

To stop the servers, their processes must be killed. This is easily done with the Scalable Unix Tools [(ref Gropp:1994:SUT)] with the command

    ptfps -all -tn serv_p4 -and -o $LOGNAME -kill INT 
Alternately, you can log into each system and execute something like
    ps auxww | egrep "$LOGNAME.*serv_p4" 
if using a BSD-style ps, or
    ps -flu $LOGNAME | egrep 'serv_p4' 
if using a System V-style ps. The System V style will work only if the command name is short; the System V ps only gives you the first 80 characters of the command name, and if it was started with a long (but valid) directory path, the name of the command may have been lost.

An alternative approach is discussed in Section Managing the servers .



Up: Setting up mpirun for the ch_p4 device Next: Managing the servers Previous: Faster job startup for the ch_p4 device


Managing the servers


Up: Setting up mpirun for the ch_p4 device Next: Thorough Testing Previous: Stopping the P4 servers

An experimental perl5 program is provided to help you manage the p4 secure servers. This program is chkserv, and is installed in the sbin directory. You can use this program to check that your servers are running, start up new servers, or stop servers that are running.

Before using this script, you may need to edit it; check that it has appropriate values for serv_p4, portnum, and machinelist; you may also need to set the first line to your version of perl5.

To check on the status of your servers, use

     chkserv -port 2345  
To restart any servers that have stopped, use
     chkserv -port 2345 -restart 
This does not restart servers that are already running; you can use this as a cron job every morning to make sure that your servers are running. Note that this uses the same remote shell command that configure found; if you can't use that remote shell command to start the process on the remote systems, you'll need to restart the servers by hand. In that case, you can use the output from chkserv -port 1234 to see which servers need to be restarted.


     chkserv -port 2345 -kill  
This contacts all running servers and tells them to exit. It does not use rsh, and can be used on any system.

This software is experimental. If you have comments or suggestions, please send them [email protected].



Up: Setting up mpirun for the ch_p4 device Next: Thorough Testing Previous: Stopping the P4 servers