Using the High-Performance MPD Job Launcher

2.3. Using the High-Performance MPD Job Launcher

MPD is a new high-performance job launcher developed by Argonne National Laboratory, the makers of MPICH. It serves as a drop-in replacement to mpirun, and can be used to launch parallel jobs. MPD can start both MPI and non-MPI parallel applications.

Advantages of MPD:

Drawbacks of MPD:

2.3.1. Using MPD for MPI applications

To launch interactive and batch MPI applications, you must compile your program with the MPD version of the MPICH library. This library is identical to regular MPICH and supports the same interface. It is located in:

/opt/mpich-mpd/gnu/lib

Once your executable has been compiled with the MPD libraries, use the mpirun from:

/opt/mpich-mpd/gnu/bin/mpirun

This MPD version of mpirun operates very similarly to older versions. See the manpage or use --help for full details. You will be able to control all nodes running your job from the console node (usually the frontend), including sending signals to your parallel job.

See earlier sections in this chapter to find instructions for using mpirun in batch and interactive settings.

2.3.2. The MPD Ring and Troubleshooting

For the MPD job launcher to work correctly, a ring of daemons must be threaded through every node in the cluster. he ring is defined as a set of connected mpd daemons, such that each daemon has an open TCP connection to each of its two nearest neighbors. The order of nodes in the ring does not matter, and a new node can enter the ring at any place. Rocks uses a distributed agreement protocol to construct and maintain this ring through node failures and additions. This ring maintainer is called KAgreement-mpd, after a well-known problem in distributed systems.

If one daemon in the ring dies, the remaining MPD nodes will reknit the ring around it. The KAgreement-mpd protocol will restart the dead daemon as soon as possible, enabling it to rejoin the ring.

Once the ring of MPD daemons is in place, parallel jobs can be quickly started. The original ring is composed of mpd daemons running as root. Each has a local pipe that serves as a "console" on the node, through which commands are sent for starting and controlling jobs. Only one console pipe is allowed per job.

When the root ring receives a job-start command, it creates a child ring by forking "mpd" daemons running as the user. This ring lives only to service the job, and will be destroyed when execution completes. The job ring inherits the console pipe from the appropriate root daemon. This ring sets up a tree-based connection structure to handle standard input, output, error streams, as well as any MPI messages that may be sent. All output is collected in one place and sent back though the console pipe.

Without a complete root ring, no jobs can be started. The task of keeping the ring healthy depends on correct and reliable MPD daemons and KAgreement-mpd. Although stable in practice, KAgreement-mpd has not been extensively tested in production. In addition, the MPD daemons themselves are relatively new and untested.

The KAgreement-mpd protocol uses Ganglia-style multicast messages to coordinate every node in the cluster. These messages enable new nodes to join the ring automatically, and gracefully supports temporary network partitions. You can inspect these messages with:

$ telnet localhost 8649 | grep mpd

You should see two lines of output for every node in the cluster. KAgreement-mpd requires that the greceptor daemon (which publishes and listens for user-defined Ganglia metrics) be running on every node in the cluster.

Other ways of troubleshooting the MPD ring is to use the utilities mpdtrace, mpddump, mpdshutdown, mpdjobinfo, etc. Check the contents of /opt/mpich-mpd/gnu/bin for the full list of MPD utilities.

2.3.3. Conclusion

The cluster-fork command can also use the MPD system to run any standard UNIX command for speedy startup and efficient job control. See the Cluster Fork page for more details.

In conclusion, we have made efforts to provide a fully functioning MPD ring without any intervention or effort from the user. Please be patient with any bugs found, as the KAgreement-mpd protocol is not yet mature. For more information on MPD and its utilities, see manuals and associated documentation present on your frontend machine in "/opt/mpich-mpd/{www,doc,man}".