Debugging parallel programs is notoriously difficult. Parallel programs are subject not only to the usual kinds of bugs but also to new kinds having to do with timing and synchronization errors. Often, the program ``hangs,'' for example when a process is waiting for a message to arrive that is never sent or is sent with the wrong tag. Parallel bugs often disappear precisely when you adds code to try to identify the bug, which is particularly frustrating. In this section we discuss several approaches to parallel debugging.
Just as in sequential debugging, you often wish to trace interesting events in the program by printing trace messages. Usually you wish to identify a message by the rank of the process emitting it. This can be done explicitly by putting the rank in the trace message.
The MPI Standard specifies a mechanism for installing one's own error handler,
and specifies the behavior of two predefined ones, MPI_ERRORS_RETURN
and MPI_ERRORS_ARE_FATAL.
As part of the MPE library, we include two other error handlers to
facilitate the use of command-line debuggers such as dbx in
debugging MPI programs.
MPE_Errors_call_dbx_in_xterm MPE_Signals_call_debuggerThese error handlers are located in the MPE directory. A configure option (-mpedbg) includes these error handlers into the regular MPI libraries, and allows the command-line argument -mpedbg to make MPE_Errors_call_dbx_in_xterm the default error handler (instead of MPI_ERRORS_ARE_FATAL).
The -dbg=<name of debugger> option to mpirun causes processes
to be run under the control of the chosen debugger.
For example, enter
mpirun -dbg=gdb or mpirun -dbg=gdb a.outinvokes the mpirun_dbg.gdb script located in the mpich/bin directory. This script captures the correct arguments, invokes the gdb debugger, and starts the first process under gdb where possible. There are five debugger scripts; ddd, gdb, xxgdb, ddd, and totalview. These may need to be edited depending on your system. There is another debugger script for dbx, but this one will always need to be edited as the debugger commands for dbx varies between versions. You can also use this option to call another debugger; for example, -dbg=mydebug. All you need to do is write a script file, mpirun_dbg.mydebug, which follows the format of the included debugger scripts, and place it in the mpich/bin directory. More information on using the Totalview debugger with mpich can be found in Section Debugging MPI programs with TotalView .
It is often convenient to have a debugger start when a program detects an
error. If mpich was configured with the option --enable-mpedbg, then
adding the command-line option -mpedbg to the program will cause
mpich to attempt to start a debugger (usually dbx or gdb) when
an error that generates a signal (such as SIGSEGV) occurs. For
example,
mpirun -np 4 a.out -mpedbgIf you are not sure if your mpich provides this service, you can use -mpiversion to see if mpich was built with the --enable-mpedbg option. This feature may not be available with all devices.
On workstation clusters, you can often attach a debugger to a running
process.
For example, the debugger dbx often accepts a process id (pid)
which you can get by using the ps command. The form may be either
dbx a.out 1234or
dbx -pid 1234 a.outwhere 1234 is the process id*.
To do this with gdb, start gdb and at its prompt do
file a.out attach 1234One can also attach the TotalView debugger to a running program (See Section Debugging MPI programs with TotalView below).
If one specifies -p4norem on the command line, mpirun will not
actually
start the processes. The master process prints a message suggesting how the
user can do it. The point of this option is to enable the user to start the
remote processes under his favorite debugger, for instance. The option only
makes sense when processes are being started remotely, such as on a
workstation network. Note that this is an argument to the program, not
to mpirun. For example, to run myprog this way, use
mpirun -np 4 myprog -p4noremFor example, to run cpi with 2 processes, where the second process is run under the debugger, the session would look something like
mpirun -np 2 cpi -p4norem waiting for process on host shakey.mcs.anl.gov: /home/me/cpi sys2.foo.edu 38357 -p4amslaveon the first machine and
% gdb cpi GNU gdb 5.0 Copyright 2000 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i586-mandrake-linux"... (gdb) run sys2.foo.edu 38357 -p4amslave Starting program: /home/me/cpi sys2.foo.edu 38357 -p4amslave
TotalView(c) [18] is a powerful, commercial-grade, portable debugger for parallel and multithreaded programs, available from Etnus (http://www.etnus.com/). TotalView understands multiple MPI implementations, including mpich. By ``understand'' is meant that if you have TotalView installed on your system, it is easy to start your mpich program under the control of TotalView, even if you are running on multiple machines, manage your processes both collectively and individually through TotalView's convenient GUI, and even examine internal mpich data structures to look at message queues [3]. The general operation model of TotalView will be familiar to users of command-line-based debuggers such as gdb or dbx.
Starting an mpich program under TotalView control.
To start a parallel program under TotalView control,
simply add
` -dbg=totalview'
to your mpirun arguments:
mpirun -dbg=totalview -np 4 cpi
TotalView will come up and you can start the program by typing ` G'. A window will come up asking whether you want to stop processes as they execute MPI_Init. You may find it more convenient to say ``no'' and instead to set your own breakpoint after MPI_Init (see Section Debugging MPI programs with TotalView ). This way when the process stops it will be on a line in your program instead of somewhere inside MPI_Init.
Attaching to a running program.
TotalView can attach to a running MPI program, which is particularly useful if you suspect that your code has deadlocked. To do this start TotalView with no arguments, and then press ` N' in the root window. This will bring up a list of the processes that you can attach to. When you dive through the initial mpich process in this window TotalView will also acquire all of the other mpich processes (even if they are not local). See the TotalView manual for more details of this process.
Debugging with TotalView.
You can set breakpoints by clicking in the left margin on a line number. Most of the TotalView GUI is self-explanatory. You select things with the left mouse button, bring up an action menu with the middle button, and ``dive'' into functions, variables, structures, processes, etc., with the right button. Pressing cntl-? in any TotalView window brings up help relevant to that window. In the initial TotalView window it brings up general help. The full documentation (The TotalView User's Guide) is available from the Etnus web site.
You switch from viewing one process to the next with the arrow buttons at the top-right corner of the main window, or by explicitly selecting (left button) a process in the root window to re-focus an existing window onto that process, or by diving (right button) through a process in the root window to open a new window for the selected process. All the keyboard shortcuts for commands are listed in the menu that is attached to the middle button. The commands are mostly the familiar ones. The special one for MPI is the ` m' command, which displays message queues associated with the process.
Note also that if you use the MPI-2 function MPI_Comm_set_name on a communicator, TotalView will display this name whenever showing information about the communicator, making it easier to understand which communicator is which.