Debugging

Up: Programming Tools Next: The printf Approach Previous: MPMD Programs

Debugging parallel programs is notoriously difficult. Parallel programs are subject not only to the usual kinds of bugs but also to new kinds having to do with timing and synchronization errors. Often, the program ``hangs,'' for example when a process is waiting for a message to arrive that is never sent or is sent with the wrong tag. Parallel bugs often disappear precisely when you adds code to try to identify the bug, which is particularly frustrating. In this section we discuss several approaches to parallel debugging.

Up: Programming Tools Next: The printf Approach Previous: MPMD Programs

The printf Approach

Up: Debugging Next: Error handlers Previous: Debugging

Just as in sequential debugging, you often wish to trace interesting events in the program by printing trace messages. Usually you wish to identify a message by the rank of the process emitting it. This can be done explicitly by putting the rank in the trace message.

Up: Debugging Next: Error handlers Previous: Debugging

Error handlers

Up: Debugging Next: Starting jobs with a debugger Previous: The printf Approach

The MPI Standard specifies a mechanism for installing one's own error handler, and specifies the behavior of two predefined ones, MPI_ERRORS_RETURN and MPI_ERRORS_ARE_FATAL. As part of the MPE library, we include two other error handlers to facilitate the use of command-line debuggers such as dbx in debugging MPI programs.

    MPE_Errors_call_dbx_in_xterm 
    MPE_Signals_call_debugger

These error handlers are located in the MPE directory. A configure option (-mpedbg) includes these error handlers into the regular MPI libraries, and allows the command-line argument -mpedbg to make MPE_Errors_call_dbx_in_xterm the default error handler (instead of MPI_ERRORS_ARE_FATAL).

Up: Debugging Next: Starting jobs with a debugger Previous: The printf Approach

Starting jobs with a debugger

Up: Debugging Next: Starting the debugger when an error occurs Previous: Error handlers

The -dbg=<name of debugger> option to mpirun causes processes to be run under the control of the chosen debugger. For example, enter

    mpirun -dbg=gdb or mpirun -dbg=gdb a.out

invokes the mpirun_dbg.gdb script located in the mpich/bin directory. This script captures the correct arguments, invokes the gdb debugger, and starts the first process under gdb where possible. There are five debugger scripts; ddd, gdb, xxgdb, ddd, and totalview. These may need to be edited depending on your system. There is another debugger script for dbx, but this one will always need to be edited as the debugger commands for dbx varies between versions. You can also use this option to call another debugger; for example, -dbg=mydebug. All you need to do is write a script file, mpirun_dbg.mydebug, which follows the format of the included debugger scripts, and place it in the mpich/bin directory. More information on using the Totalview debugger with mpich can be found in Section Debugging MPI programs with TotalView .

Up: Debugging Next: Starting the debugger when an error occurs Previous: Error handlers

Starting the debugger when an error occurs

Up: Debugging Next: Attaching a debugger to a running program Previous: Starting jobs with a debugger

It is often convenient to have a debugger start when a program detects an error. If mpich was configured with the option --enable-mpedbg, then adding the command-line option -mpedbg to the program will cause mpich to attempt to start a debugger (usually dbx or gdb) when an error that generates a signal (such as SIGSEGV) occurs. For example,

    mpirun -np 4 a.out -mpedbg

If you are not sure if your mpich provides this service, you can use -mpiversion to see if mpich was built with the --enable-mpedbg option. This feature may not be available with all devices.

Up: Debugging Next: Attaching a debugger to a running program Previous: Starting jobs with a debugger

Attaching a debugger to a running program

Up: Debugging Next: Debugging MPI programs with TotalView Previous: Starting the debugger when an error occurs

On workstation clusters, you can often attach a debugger to a running process. For example, the debugger dbx often accepts a process id (pid) which you can get by using the ps command. The form may be either

    dbx a.out 1234

    dbx -pid 1234 a.out

where 1234 is the process id*.

To do this with gdb, start gdb and at its prompt do

    file a.out 
    attach 1234

One can also attach the TotalView debugger to a running program (See Section Debugging MPI programs with TotalView below).

Up: Debugging Next: Debugging MPI programs with TotalView Previous: Starting the debugger when an error occurs

Debugging MPI programs with TotalView

Up: Debugging Next: Log and tracefile tools Previous: Attaching a debugger to a running program

TotalView(c) [18] is a powerful, commercial-grade, portable debugger for parallel and multithreaded programs, available from Etnus (http://www.etnus.com/). TotalView understands multiple MPI implementations, including mpich. By ``understand'' is meant that if you have TotalView installed on your system, it is easy to start your mpich program under the control of TotalView, even if you are running on multiple machines, manage your processes both collectively and individually through TotalView's convenient GUI, and even examine internal mpich data structures to look at message queues [3]. The general operation model of TotalView will be familiar to users of command-line-based debuggers such as gdb or dbx.

Starting an mpich program under TotalView control.

To start a parallel program under TotalView control, simply add ` -dbg=totalview' to your mpirun arguments:

    mpirun -dbg=totalview -np 4 cpi

TotalView will come up and you can start the program by typing ` G'. A window will come up asking whether you want to stop processes as they execute MPI_Init. You may find it more convenient to say ``no'' and instead to set your own breakpoint after MPI_Init (see Section Debugging MPI programs with TotalView ). This way when the process stops it will be on a line in your program instead of somewhere inside MPI_Init.

Attaching to a running program.

TotalView can attach to a running MPI program, which is particularly useful if you suspect that your code has deadlocked. To do this start TotalView with no arguments, and then press ` N' in the root window. This will bring up a list of the processes that you can attach to. When you dive through the initial mpich process in this window TotalView will also acquire all of the other mpich processes (even if they are not local). See the TotalView manual for more details of this process.

Debugging with TotalView.

You can set breakpoints by clicking in the left margin on a line number. Most of the TotalView GUI is self-explanatory. You select things with the left mouse button, bring up an action menu with the middle button, and ``dive'' into functions, variables, structures, processes, etc., with the right button. Pressing cntl-? in any TotalView window brings up help relevant to that window. In the initial TotalView window it brings up general help. The full documentation (The TotalView User's Guide) is available from the Etnus web site.

You switch from viewing one process to the next with the arrow buttons at the top-right corner of the main window, or by explicitly selecting (left button) a process in the root window to re-focus an existing window onto that process, or by diving (right button) through a process in the root window to open a new window for the selected process. All the keyboard shortcuts for commands are listed in the menu that is attached to the middle button. The commands are mostly the familiar ones. The special one for MPI is the ` m' command, which displays message queues associated with the process.

Note also that if you use the MPI-2 function MPI_Comm_set_name on a communicator, TotalView will display this name whenever showing information about the communicator, making it easier to understand which communicator is which.