Debugging


Up: Programming Tools Next: The printf Approach Previous: MPMD Programs

Debugging parallel programs is notoriously difficult. Parallel programs are subject not only to the usual kinds of bugs but also to new kinds having to do with timing and synchronization errors. Often, the program ``hangs,'' for example when a process is waiting for a message to arrive that is never sent or is sent with the wrong tag. Parallel bugs often disappear precisely when you adds code to try to identify the bug, which is particularly frustrating. In this section we discuss several approaches to parallel debugging.



Up: Programming Tools Next: The printf Approach Previous: MPMD Programs


The printf Approach


Up: Debugging Next: Error handlers Previous: Debugging

Just as in sequential debugging, you often wish to trace interesting events in the program by printing trace messages. Usually you wish to identify a message by the rank of the process emitting it. This can be done explicitly by putting the rank in the trace message. As noted above, using the ``line labels'' option ( -l) with mpirun in the ch_p4mpd device in mpich adds the rank automatically.



Up: Debugging Next: Error handlers Previous: Debugging


Error handlers


Up: Debugging Next: Starting jobs with a debugger Previous: The printf Approach

The MPI Standard specifies a mechanism for installing one's own error handler, and specifies the behavior of two predefined ones, MPI_ERRORS_RETURN and MPI_ERRORS_ARE_FATAL. As part of the MPE library, we include two other error handlers to facilitate the use of command-line debuggers such as dbx in debugging MPI programs.

    MPE_Errors_call_dbx_in_xterm 
    MPE_Signals_call_debugger 
These error handlers are located in the MPE directory. A configure option (-mpedbg) includes these error handlers into the regular MPI libraries, and allows the command-line argument -mpedbg to make MPE_Errors_call_dbx_in_xterm the default error handler (instead of MPI_ERRORS_ARE_FATAL).



Up: Debugging Next: Starting jobs with a debugger Previous: The printf Approach


Starting jobs with a debugger


Up: Debugging Next: Starting the debugger when an error occurs Previous: Error handlers

The -dbg=<name of debugger> option to mpirun causes processes to be run under the control of the chosen debugger. For example, enter

    mpirun -dbg=gdb or mpirun -dbg=gdb a.out 
invokes the mpirun_dbg.gdb script located in the mpich/bin directory. This script captures the correct arguments, invokes the gdb debugger, and starts the first process under gdb where possible. There are five debugger scripts; ddd, gdb, xxgdb, ddd, and totalview. These may need to be edited depending on your system. There is another debugger script for dbx, but this one will always need to be edited as the debugger commands for dbx varies between versions. You can also use this option to call another debugger; for example, -dbg=mydebug. All you need to do is write a script file, mpirun_dbg.mydebug, which follows the format of the included debugger scripts, and place it in the mpich/bin directory. More information on using the Totalview debugger with mpich can be found in Section Debugging MPI programs with TotalView .



Up: Debugging Next: Starting the debugger when an error occurs Previous: Error handlers


Starting the debugger when an error occurs


Up: Debugging Next: Attaching a debugger to a running program Previous: Starting jobs with a debugger

It is often convenient to have a debugger start when a program detects an error. If mpich was configured with the option --enable-mpedbg, then adding the command-line option -mpedbg to the program will cause mpich to attempt to start a debugger (usually dbx or gdb) when an error that generates a signal (such as SIGSEGV) occurs. For example,

    mpirun -np 4 a.out -mpedbg  
If you are not sure if your mpich provides this service, you can use -mpiversion to see if mpich was built with the --enable-mpedbg option. This feature may not be available with all devices.



Up: Debugging Next: Attaching a debugger to a running program Previous: Starting jobs with a debugger


Attaching a debugger to a running program


Up: Debugging Next: Debugging MPI programs with TotalView Previous: Starting the debugger when an error occurs

On workstation clusters, you can often attach a debugger to a running process. For example, the debugger dbx often accepts a process id (pid) which you can get by using the ps command. The form may be either

    dbx a.out 1234 
or
    dbx -pid 1234 a.out 
where 1234 is the process id*.

To do this with gdb, start gdb and at its prompt do

    file a.out 
    attach 1234 
One can also attach the TotalView debugger to a running program (See Section Debugging MPI programs with TotalView below).



Up: Debugging Next: Debugging MPI programs with TotalView Previous: Starting the debugger when an error occurs


Debugging MPI programs with TotalView


Up: Debugging Next: Using mpigdb Previous: Attaching a debugger to a running program

TotalView(c) [18] is a powerful, commercial-grade, portable debugger for parallel and multithreaded programs, available from Etnus (http://www.etnus.com/). TotalView understands multiple MPI implementations, including mpich. By ``understand'' is meant that if you have TotalView installed on your system, it is easy to start your mpich program under the control of TotalView, even if you are running on multiple machines, manage your processes both collectively and individually through TotalView's convenient GUI, and even examine internal mpich data structures to look at message queues [3]. The general operation model of TotalView will be familiar to users of command-line-based debuggers such as gdb or dbx.

Starting an mpich program under TotalView control.

To start a parallel program under TotalView control, use

    totalview mpirun 
and select Processes/Startup Parameters from the TotalView menu system. Then you can enter the mpirun arguments in the window that pops up. For example you would enter
    -np 4 cpi 

TotalView will come up and you can start the program by typing ` G'. A window will come up asking whether you want to stop processes as they execute MPI_Init. You may find it more convenient to say ``no'' and instead to set your own breakpoint after MPI_Init (see Section Debugging MPI programs with TotalView ). This way when the process stops it will be on a line in your program instead of somewhere inside MPI_Init.

Attaching to a running program.

TotalView can attach to a running MPI program, which is particularly useful if you suspect that your code has deadlocked. To do this start TotalView with no arguments, and then press ` N' in the root window. This will bring up a list of the processes that you can attach to. When you dive through the initial mpich process in this window TotalView will also acquire all of the other mpich processes (even if they are not local). See the TotalView manual for more details of this process.

Debugging with TotalView.

You can set breakpoints by clicking in the left margin on a line number. Most of the TotalView GUI is self-explanatory. You select things with the left mouse button, bring up an action menu with the middle button, and ``dive'' into functions, variables, structures, processes, etc., with the right button. Pressing cntl-? in any TotalView window brings up help relevant to that window. In the initial TotalView window it brings up general help. The full documentation (The TotalView User's Guide) is available from the Etnus web site.

You switch from viewing one process to the next with the arrow buttons at the top-right corner of the main window, or by explicitly selecting (left button) a process in the root window to re-focus an existing window onto that process, or by diving (right button) through a process in the root window to open a new window for the selected process. All the keyboard shortcuts for commands are listed in the menu that is attached to the middle button. The commands are mostly the familiar ones. The special one for MPI is the ` m' command, which displays message queues associated with the process.

Note also that if you use the MPI-2 function MPI_Comm_set_name on a communicator, TotalView will display this name whenever showing information about the communicator, making it easier to understand which communicator is which.



Up: Debugging Next: Using mpigdb Previous: Attaching a debugger to a running program


Using mpigdb


Up: Debugging Next: Log and tracefile tools Previous: Debugging MPI programs with TotalView

The ch_p4mpd device version of MPICH features a ``parallel debugger'' that consists simply of multiple copies of the gdb debugger, together with a mechanism for redirecting stdin. The mpigdb command is a version of mpirun that runs each user process under the control of gdb and also takes control of stdin for gdb. The ` z' command allows you to direct terminal input to any specified process or to broadcast it to all processes. We demonstrate this by running the example under this simple debugger.

    donner% mpigdb -np 5 cpi                # default is stdin bcast 
    (mpigdb) b 29                           # set breakpoint for all 
    0-4: Breakpoint 1 at 0x8049e93: file cpi.c, line 29. 
    (mpigdb) r                              # run all 
    0-4: Starting program: /home/lusk/mpich/examples/basic/cpi  
    0: Breakpoint 1, main (argc=1, argv=0xbffffa84) at cpi.c:29 
    1-4: Breakpoint 1, main (argc=1, argv=0xbffffa74) at cpi.c:29 
    0-4: 29     n = 0;                      # all reach breakpoint 
    (mpigdb) n                              # single step all 
    0: 38               if (n==0) n=100; else n=0; 
    1-4: 42         MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 
    (mpigdb) z 0                            # limit stdin to rank 0 
    (mpigdb) n                              # single step process 0 
    0: 40               startwtime = MPI_Wtime(); 
    (mpigdb) n                              # until  caught up 
    0: 42           MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 
    (mpigdb) z                              # go back to bcast stdin 
    (mpigdb) n                              # single step all 
                    ...                     # until interesting spot 
    (mpigdb) n 
    0-4: 52                 x = h * ((double)i - 0.5); 
    (mpigdb) p x                            # bcast print command 
    0: $1 = 0.0050000000000000001           # 0's value of x 
    1: $1 = 0.014999999999999999            # 1's value of x 
    2: $1 = 0.025000000000000001            # 2's value of x 
    3: $1 = 0.035000000000000003            # 3's value of x 
    4: $1 = 0.044999999999999998            # 4's value of x 
    (mpigdb) c                              # continue all 
    0: pi is approximately 3.141600986923, Error is 0.000008333333 
    0-4: Program exited normally. 
    (mpigdb) q                              # quit 
    donner% 
If the debugging process hangs (no mpigdb prompt) because the current process is waiting for action by another process, ctl-C (Control-C) will bring up a menu that allows you to switch processes. The mpigdb is not nearly as advanced as TotalView, but it is often useful, and it is freely distributed with MPICH.



Up: Debugging Next: Log and tracefile tools Previous: Debugging MPI programs with TotalView