Problems starting programs


Up: In Case of Trouble Next: General Previous: LINUX



Up: In Case of Trouble Next: General Previous: LINUX


General


Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs

    1. Q: When trying to start a program with
         mpirun -np 2 cpi 
    
    either I get an error message or the program hangs.


    A: On some systems such as IBM SPs, there are many mutually exclusive ways to run parallel programs; each site can pick the approach(es) that it allows. The script mpirun tries one of the more common methods, but may make the wrong choice. Use the -v or -t option to mpirun to see how it is trying to run the program, and then compare this with the site-specific instructions for using your system. You may need to adapt the code in mpirun to meet your needs. See also the next question.


    2. Q: When trying to run a program with, e.g., mpirun -np 4 cpi, I get

        usage : mpirun [options] <executable> [<dstnodes>] [-- <args>] 
    
    or
        mpirun [options] <schema> 
    

    A: You have a command named mpirun for a different implementation of MPI in your path ahead of the mpich version. Execute the command
        which mpirun 
    
    to see which command named mpirun was actually found. The fix is to either change the order of directories in your path to put the mpich version of mpirun first, or to define an alias for mpirun that uses an absolute path. For example, in the csh shell, you might do
        alias mpirun /usr/local/mpich/bin/mpirun  
    
    to set mpirun to the mpich version.


    3. Q: When trying to start a large number of processes on a workstation network, I get the message

         p4_error: latest msg from perror: Too many open files 
    

    A: There is a limitation on the number of open file descriptors. On some systems you can increase this limit yourself; on others you must have help from your system administrator. You could experiment with the secure server, but it is not a complete solution.


    4. Q: When I try to run a program on more than one processor, I get an error message


         mpirun -np 2 cpi 
         /home/me/cpi: error in loading shared libraries: libcxa.so.1:  
         cannot open shared object file: No such file or directory 
    
    There is no trouble running on one processor.


    A: This means that some shared library used by the system cannot be found on remote processors. There are two possibilities:

      1. The shared libraries are not installed on the remote processor. To fix this, have your system administrators install the libraries.
      2. The shared libraries are not in the default path. This can happen if the path is in an environment variable that is set in your current shell but that is not part of your default or remote shell environment. The fix in this case is harder, because you must communicate the location of the shared library to the executable (a major deficiency in most Unix shared-library designs is that executables do not, by default, remember where a shared library was found when linking). The simplest fix may be to have your system administrators place the necessary shared libraries into one of the directories that is searched by default. If this is not possible, then you will need to help the compiler and linker out.

      Many linkers provide a way to specify the search path for shared libraries. The trick is to (a) pass this command to the linker program and (b) specify all of the libraries that are needed.

      For example, on Linux systems, the linker command to specify the shared library search path is -rpath path, e.g., -rpath /usr/lib:/usr/local/lib. To pass this command to the linker through the Intel C compiler icc, the command -Qoption,link,-rpath,path is used. By default, the Linux linker looks in /usr/lib and the directories specified by the environment variable LD_LIBRARY_PATH. Thus, to force the linker to include the path to the shared library, you can use

         mpicc -o cpi cpi -Qoption,link,-rpath,$LD_LIBRARY_PATH:/usr/lib 
      
      If this works, then consider editing the value of LDFLAGS in the compiler scripts (e.g., mpicc) to include this option.

      Unfortunately, each compiler has a different way of passing these arguments to the linker, and each linker has a different set of arguments for specifying the shared library search path. You will need to check the documentation for your system to find this options.


    5. Q: When attempting to run cpilog I get the following message:
        ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2 
    

    A: The X11 version that configure found isn't properly installed. This is a common problem with Sun/Solaris systems. One possibility is that your Solaris machines are running slightly different versions. You can try forcing static linking (-Bstatic on Solaris).

    Alternately, consider adding these lines to your .login (assuming C shell):

        setenv OPENWINHOME /usr/openwin 
        setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib 
    
    (you may want to check with your system administrator first to make sure that the paths are correct for your system). Make sure that you add them before any line like
        if ($?USER == 0 || $?prompt == 0) exit  
    

    6. Q: My program fails when it tries to write to a file.


    A: If you opened the file before calling MPI_INIT, the behavior of MPI (not just the mpich implementation of MPI) is undefined. In the ch_p4 device, only process zero (in MPI_COMM_WORLD) will have the file open; the other processes will not have opened the file. Move the operations that open files and interact with the outside world to after MPI_INIT (and before MPI_FINALIZE).


    7. Q: Programs seem to take forever to start.


    A: This can be caused by any of several problems. On systems with dynamically-linked executables, this can be caused by problems with the file system suddenly getting requests from many processors for the dynamically-linked parts of the executable (this has been measured as a problem with some DFS implementations). You can try statically linking your application.

    On workstation networks, long startup times can be due to the time used to start remote processes; see the discussion on the secure server in Section Using the Secure Server for the ch_p4 device or consider using the ch_p4mpd device.



Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs


Workstation Networks


Up: Problems starting programs Next: IBM RS6000 Previous: General

    1. Q: When I use mpirun, I get the message Permission denied.
    A: See Section The Most Common Problems .


    2. Q: When I use mpirun, I get the message Try again.


    A: If you see something like this

        % mpirun -np 2 cpi  
        Try again. 
    
    it means that you were unable to start a remote job with the remote shell command on some machine, even though you would normally be able to. This may mean that the destination machine is very busy, out of memory, or out of processes. The man page for rshd may give you more information.

    The only fix for this is to have your system administrator look into the machine that is generating this message.


    3. Q: When running the ch_p4 device, I get error messages of the form

        stty: TCGETS: Operation not supported on socket 
    
    or
        stty: tcgetattr: Permission denied 
    
    or
        stty: Can't assign requested address 
    

    A: This means that one of your login startup scripts (i.e., .login and .cshrc or .profile) contains an unguarded use of the stty or tset program. For C shell users, one typical fix is to check for the variables TERM or PROMPT to be initialized. For example,
        if ($?TERM) then 
            eval `tset -s -e^\? -k^U -Q -I $TERM` 
        endif 
    
    Another solution is to see if it is appropriate to add
        if ($?USER == 0 || $?prompt == 0) exit  
    
    near the top of your .cshrc file (but after any code that sets up the runtime environment, such as library paths (e.g., LD_LIBRARY_PATH)).


    4. Q: When running the ch_p4 device and running either the tstmachines script to check the machines file or the mpich tests, I get messages about unexpected output or differences from the expected output. I also get extra output when I run programs. MPI programs do seem to work, however.


    A: This means that one your login startup scripts (i.e., .login and .cshrc or .profile or .bashrc) contains an unguarded use of some program that generates output, such as fortune or even echo. For C shell users, one typical fix is to check for the variables TERM or PROMPT to be initialized. For example,

        if ($?TERM) then 
            fortune 
        endif 
    
    Another solution is to see if it is appropriate to add
        if ($?USER == 0 || $?prompt == 0) exit  
    
    near the top of your .cshrc file (but after any code that sets up the runtime environment, such as library paths (e.g., LD_LIBRARY_PATH)).


    5. Q: Occasionally, programs fail with the message

    poll: protocol failure during circuit creation 
    

    A:

    You may see this message if you attempt to run too many MPI programs in a short period of time. For example, in Linux and when using the ch_p4 device (without the secure server or ssh), mpich may use rsh to start the MPI processes. Depending on the particular Linux distribution and verison, there may be a limit of as few as 40 processes per minute. When running the mpich test suite or starting short parallel jobs from a script, it is possible to exceed this limit.

    To fix this, you can do one of the following:

      1. Wait a few seconds between running parallel jobs. You may need to wait up to a minute.


      2. Modify /etc/inetd.conf to allow more processes per minute for rsh. For example, change

      shell stream tcp nowait root /etc/tcpd2 in.rshd  
      
      to
      shell stream tcp nowait.200 root /etc/tcpd2 in.rshd  
      

      3. Use the ch_p4mpd device or the secure server option of the ch_p4 device instead. Neither of these relies on inetd.


    6. Q: When using mpirun I get strange output like

     arch: No such file or directory 
    

    A: This is usually a problem in your .cshrc file. Try the shell command
        which hostname 
    
    If you see the same strange output, then your problem is in your .cshrc file. You may have some code in your .cshrc file that assumes that your shell is connected to a terminal.


    7. Q: When I try to run my program, I get

    p0_4652:  p4_error: open error on procgroup file (procgroup): 0 
    

    A: This indicates that the mpirun program did not create the expected input file needed to run the program. The most likely reason is that the mpirun command is trying to run a program built with device ch_p4 as a shared memory (ch_shmem) or other device.

    Try the following:

    Run the program using mpirun and the -t argument:

        mpirun -t -np 1 foo 
    
    This should show what mpirun would do (-t is for testing). Or you can use the -echo argument to see exactly what mpirun is doing:
    mpirun -echo -np 1 foo 
    
    Depending on the choice made by the installer of mpich, you should select the device-specific version of mpirun over a ``generic'' version. We recommend that the installation prefix include the device name, for example, /usr/local/mpich/solaris/ch_p4.


    8. Q: When trying to run a program I get this message:

        icy%  mpirun -np 2 cpi -mpiversion 
        icy: icy: No such file or directory 
    

    A: Your problem is that /usr/lib/rsh is not the remote shell program. Try the following:
     which rsh 
     ls /usr/*/rsh 
    
    You probably have /usr/lib in your path ahead of /usr/ucb or /usr/bin. This picks the `restricted' shell instead of the `remote' shell. The easiest fix is to just remove /usr/lib from your path (few people need it); alternately, you can move it to after the directory that contains the `remote' shell rsh.

    Another choice would be to add a link in a directory earlier in the search path to the remote shell. For example, I have /home/gropp/bin/solaris early in my search path; I could use

         cd /home/gropp/bin/solaris 
         ln -s /usr/bin/rsh  rsh 
    
    there (assuming that /usr/bin/rsh is the remote shell).


    9. Q: When trying to run a program I get this message:

    trying normal rsh 
    

    A: You are using a version of the remote shell program that does not support the -l argument. Reconfigure mpich with -rshnol and rebuild mpich. You may suffer some loss of functionality if you try to run on systems where you have different user names. You might also try using ssh.


    10. Q: When I run my program, I get messages like


    | ld.so: warning: /usr/lib/libc.so.1.8 has older revision than expected 9 
    

    A: You are trying to run on another machine with an out-dated version of the basic C library. Unfortunately, some manufacturers do not make the shared libraries compatible between minor (or even maintenance) releases of their software. You need to have you system administrator bring the machines to the same software level.

    One temporary fix that you can use is to add the link-time option to force static linking instead of dynamic linking for system libraries. For some Sun workstations, the option is -Bstatic.


    11. Q: Programs never get started. Even tstmachines hangs.


    A: Check first that rsh works at all. For example, if you have workstations w1 and w2, and you are running on w1, try

       rsh w2 true 
    
    This should complete quickly. If it does not, try
       rsh w1 true 
    
    (that is, use rsh to run true on the system that you are running on). If you get Permission denied, see the help on that. If you get
    krcmd: No ticket file (tf_util) 
    rsh: warning, using standard rsh: can't provide Kerberos auth data. 
    
    then your system has a faulty installation of rsh. Some FreeBSD systems have been observed with this problem. Have your system administrator correct the problem (often one of an inconsistent set of rsh/rshd programs).


    12. Q: When running ch_p4 device, I get error messages of the form

        more slaves than message queues 
    

    A: This means that you are trying to run mpich in one mode when it was configured for another. In particular, you are specifying in your p4 procgroup file that several processes are to shared memory on a particular machine by either putting a number greater than 0 on the first line (where it signifies number of local processes besides the original one), or a number greater than 1 on any of the succeeding lines (where it indicates the total number of processes sharing memory on that machine). You should either change your procgroup file to specify only one process on line, or reconfigure mpich with
        configure --with-device=ch_p4 -comm=shared 
    
    which will reconfigure the p4 device so that multiple processes can share memory on each host. The reason this is not the default is that with this configuration you will see busy waiting on each workstation, as the device goes back and forth between selecting on a socket and checking the internal shared-memory queue.


    13. Q: My programs seem to hang in MPI_Init.


    A: There are a number of ways that this can happen:

      1. One of the workstations you selected to run on is dead (try tstmachines if you are using the ch_p4 device)
      2. You linked with the FSU pthreads package; this has been reported to cause problems, particularly with the system select call that is part of Unix and is used by mpich.

      Another is if you use the library -ldxml (extended math library) on Compaq Alpha systems. This has been observed to cause MPI_Init to hang. No workaround is known at this time; contact Compaq for a fix if you need to use MPI and -ldxml together.

      The root of this problem is that the ch_p4 device uses SIG_USR1, and so any library that also uses this signal can interfere with the operation of mpich if it is using ch_p4. You can rebuild mpich to use a different signal by using the configure argument --with-device=ch_p4:-listener_sig=SIGNAL_NAME and remaking mpich.


    14. Q: My program (using device ch_p4) fails with
        p0_2005:  p4_error: fork_p4: fork failed: -1 
                  p4_error: latest msg from perror: Error 0 
    

    A: The executable size of your program may be too large. When a ch_p4 or ch_tcp device program starts, it may create a copy of itself to handle certain communication tasks. Because of the way in which the code is organized, this (at least temporarily) is a full copy of your original program and occupies the same amount of space. Thus, if your program is over half as large as the maximum space available, you wil get this error. On SGI systems, you can use the command size to get the size of the executable and swap -l to get the available space. Note that size gives you the size in bytes and swap -l gives you the size in 512-byte blocks. Other systems may offer similar commands.

    A similar problem can happen on IBM SPs using the ch_mpl device; the cause is the same but it originates within the IBM MPL library.


    15. Q: Sometimes, I get the error

        Exec format error. Wrong Architecture. 
    

    A: You are probably using NFS (Network File System). NFS can fail to keep files updated in a timely way; this problem can be caused by creating an executable on one machine and then attempting to use it from another. Usually, NFS catches up with the existence of the new file within a few minutes. You can also try using the sync command. mpirun in fact tries to run the sync command, but on many systems, sync is only advisory and will not guarentee that the file system has been made consistent.


    16. Q: There seem to be two copies of my program running on each node. This doubles the memory requirement of my application. Is this normal?


    A: Yes, this is normal. In the ch_p4 implementation, the second process is used to dynamically establish connections to other processes. With Version 1.1.1 of mpich, this functionality can be placed in a separate thread on many architectures, and this second process will not be seen. To enable this, use the option -p4_opts=-threaded_listener on the configure command line for mpich.


    17. Q: MPI_Abort sometimes doesn't work in the ch_p4 device. Why?


    A: Currently (Version 1.2.5) a process detects that another process has aborted only when it tries to send or receive a message, and the aborting process is one that it has communicated with in the past. Thus it is possible for a process busy with computation not to notice that one of its peers has issued an MPI_Abort, although for many common communication patterns this does not present a problem. This will be fixed in a future release.



Up: Problems starting programs Next: IBM RS6000 Previous: General


IBM RS6000


Up: Problems starting programs Next: IBM SP Previous: Workstation Networks

    1. Q: When trying to run on an IBM RS6000 with the ch_p4 device, I got
    % mpirun -np 2 cpi 
    Could not load program /home/me/mpich/examples/basic/cpi  
    Could not load library libC.a[shr.o] 
    Error was: No such file or directory 
    

    A: This means that mpich was built with the xlC compiler but that some of the machines in your util/machines/machines.rs6000 file do not have xlC installed. Either install xlC or rebuild mpich to use another compiler (either xlc or gcc; gcc has the advantage of never having any licensing restrictions).



Up: Problems starting programs Next: IBM SP Previous: Workstation Networks


IBM SP


Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000

    1. Q: When starting my program on an IBM SP, I get this:
    $ mpirun -np 2 hello 
    ERROR: 0031-124  Couldn't allocate nodes for parallel execution.  Exiting ... 
    ERROR: 0031-603  Resource Manager allocation for task: 0, node:  
    me1.myuniv 
    .edu, rc = JM_PARTIONCREATIONFAILURE 
    ERROR: 0031-635  Non-zero status -1 returned from pm_mgr_init 
     
    

    A: This means that either mpirun is trying to start jobs on your SP in a way different than your installation supports or that there has been a failure in the IBM software that manages the parallel jobs (all of these error messages are from the IBM poe command that mpirun uses to start the MPI job). Contact your system administrator for help in fixing this situation. You system administrator can use
    dsh -av "ps aux | egrep -i 'poe|pmd|jmd'" 
    
    from the control workstation to search for stray IBM POE jobs that can cause this behavior. The files /tmp/jmd_err on the individual nodes may also contain useful diagnostic information.


    2. Q: When trying to run on an IBM SP, I get the message from mpirun:

      ERROR: 0031-214  pmd: chdir </a/user/gamma/home/mpich/examples/basic> 
      ERROR: 0031-214  pmd: chdir </a/user/gamma/home/mpich/examples/basic> 
    

    A: These are messages from tbe IBM system, not from mpirun. They may be caused by an incompatibility between POE, the automounter (especially the AMD automounter) and the shell, especially if you are using a shell other than ksh. There is no good solution; IBM often recommends changing your shell to ksh!


    3. Q: When trying to run on an IBM SP, I get this message:

    ERROR: 0031-124  Less than 2 nodes available from pool 0 
    

    A: This means that the IBM POE/MPL system could not allocate the requested nodes when you tried to run your program; most likely, someone else was using the system. You can try to use the environment variables MP_RETRY and MP_RETRYCOUNT to cause the job to wait until the nodes become available. Use man poe to get more information.


    4. Q: When running on an IBM SP, my job generates the message

      Message number 0031-254 not found in Message Catalog. 
    
    and then dies.


    A: If your user name is eight characters long, you may be experiencing a bug in the IBM POE environment. The only fix at the time this was written was to use an account whose user name was seven characters or less. Ask your IBM representative about PMR 4017X (poe with userids of length eight fails) and the associated APAR IX56566.



Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000