Programs fail after starting


Up: In Case of Trouble Next: General Previous: Workstation Networks



Up: In Case of Trouble Next: General Previous: Workstation Networks


General


Up: Programs fail after starting Next: HPUX Previous: Programs fail after starting

    1. Q: I use MPI_Allreduce, and I get different answers depending on the number of processes I'm using.


    A: The MPI collective routines may make use of associativity to achieve better parallelism. For example, an

    MPI_Allreduce( &in, &out, MPI_DOUBLE, 1, ... ); 
    
    might compute

    (((((((a+b)+c)+d)+e)+f)+g)+h)

    or it might compute

    ((a+b) + (c+d)) + ((e+f) + (g+h)),

    where a,b,... are the values of in on each of eight processes. These expressions are equivalent for integers, reals, and other familar objects from mathematics but are not equivalent for datatypes, such as floating point, used in computers. The association that MPI uses will depend on the number of processes, thus, you may not get exactly the same result when you use different numbers of processes. Note that you are not getting a wrong result, just a different one (most programs assume the arithmetic operations are associative). All processes do get the same result from a single call to MPI_Allreduce.


    2. Q: My Fortran program fails with a BUS error.


    A: The C compiler that mpich was built with and the Fortran compiler that you are using have different alignment rules for variables of type like DOUBLE PRECISION. For example, the GNU C compiler gcc may assume that all doubles are aligned on eight-byte boundaries, but the Fortran language requires only that DOUBLE PRECISION align with INTEGERs, which may be four-byte aligned.

    There is no good fix. Consider rebuilding mpich with a C compiler that supports weaker data alignment rules. Some Fortran compilers will allow you to force eight-byte alignment for DOUBLE PRECISION (for example, -dalign or -f on some Sun Fortran compilers); note though that this may break some correct Fortran programs that exploit Fortran's storage association rules.

    Some versions of gcc may support -munaligned-doubles; mpich should be rebuilt with this option if you are using gcc, version 2.7 or later. Mpich attemps to detect and use this option where available.


    3. Q: I'm using fork to create a new process, or I'm creating a new thread, and my code fails.


    A: The mpich implementation is not thread safe and does not support either fork or the creation of new processes. Note that the MPI specification is thread safe, but implementations are not required to be thread safe. At this writing, few implementations are thread-safe, primarily because this reduces the performance of the MPI implementation (you at least need to check to see if you need a thread lock, actually getting and releasing the lock is even more expensive).

    The mpich implementation supports the MPI_Init_thread call; with this call, new in MPI-2, you can find out what level of thread support the MPI implementation supports. As of version 1.2.0 of mpich, only MPI_THREAD_SINGLE is supported. We believe that version 1.2.0 and later support MPI_THREAD_FUNNELED, and some users have used mpich in this mode (particularly with OpenMP), but we have not rigourously tested mpich for this mode. Future versions of mpich will support MPI_THREAD_MULTIPLE.

    Q: C++ programs execute global destructors (or constructors) more times than expected. For example:

        class Z { 
        public: 
          Z()  { cerr << "*Z" << endl; } 
           Z() { cerr << "+Z" << endl; } 
        }; 
     
        Z z; 
     
        int main(int argc, char **argv) { 
          MPI_Init(&argc, &argv); 
          MPI_Finalize(); 
        } 
    
    when running with the ch_p4 device on two processes executes the destructor twice for each process.


    A: The number of processes running before MPI_Init or after MPI_Finalize is not defined by the MPI standard; you can not rely on any specific behavior. In the ch_p4 case, a new process is forked to handle connection requests; it terminates with the end of the program.

    You can use the threaded listener with the ch_p4 device, or use the ch_p4mpd device instead. Note, however, that this code is not portable because it relies on behavior that the MPI standard does not specify.



Up: Programs fail after starting Next: HPUX Previous: Programs fail after starting


HPUX


Up: Programs fail after starting Next: LINUX Previous: General

    1. Q: My Fortran programs seem to fail with SIGSEGV when running on HP workstations.
    A: Try compiling and linking the Fortran programs with the option +T. This may be necessary to make the Fortran environment correctly handle interrupts used by mpich to create connections to other processes.



Up: Programs fail after starting Next: LINUX Previous: General


LINUX


Up: Programs fail after starting Next: Workstation Networks Previous: HPUX

    1. Q: Processes fail with messages like


    p0_1835:  p4_error: Found a dead connection while looking for messages: 1 
    

    A: What is happening is that the TCP implementation on this platform is deciding that the connection has ``failed'' when it really hasn't. The current mpich implementation assumes that the TCP implementation will not close connections and has no code to reanimate failed connections. Future versions of mpich will work around this problem.

    In addition, some users have found that the single processor Linux kernel is more stable than the SMP kernel.



Up: Programs fail after starting Next: Workstation Networks Previous: HPUX


Workstation Networks


Up: Programs fail after starting Next: Trouble with Input and Output Previous: LINUX

    1. Q: My job runs to completion but exits with the message


    Timeout in waiting for processes to exit.  This may be due to a defective 
    rsh program (Some versions of Kerberos rsh have been observed to have this 
    problem). 
    This is not a problem with P4 or mpich but a problem with the operating 
    environment.  For many applications, this problem will only slow down 
    process termination. 
    
    What does this mean?


    A: If anything causes the rundown in MPI_Finalize to take more than about 5

    minutes, it becomes suspicious of the rsh implementation. The rsh used with some Kerberos installations assumed that sizeof(FD_SET) == sizeof(int). This meant that the rsh program assumed that the largest FD value was 31. When a program uses fork to create processes that launch rsh, while maintaining the stdin, stdout, and stderr to the forked process, this assumption is no longer true, since the FD that rsh creates for the socket may be greater than 31 if there are enough processes running. When using such a broken implementation of rsh, the symptom is that jobs never terminate because the rsh jobs are waiting (with select) for the socket to close.

    The ch_p4mpd device eliminates this problem.



Up: Programs fail after starting Next: Trouble with Input and Output Previous: LINUX