Problems starting programs

Up: In Case of Trouble Next: General Previous: LINUX

General

Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs

     mpirun -np 2 cpi

A: On some systems such as IBM SPs, there are many mutually exclusive ways to run parallel programs; each site can pick the approach(es) that it allows. The script mpirun tries one of the more common methods, but may make the wrong choice. Use the -v or -t option to mpirun to see how it is trying to run the program, and then compare this with the site-specific instructions for using your system. You may need to adapt the code in mpirun to meet your needs. See also the next question.

2. Q: When trying to run a program with, e.g., mpirun -np 4 cpi, I get

    usage : mpirun [options] <executable> [<dstnodes>] [-- <args>]

    mpirun [options] <schema>

mpirun

mpich

    which mpirun

mpirun

mpich

mpirun

    alias mpirun /usr/local/mpich/bin/mpirun

mpirun

mpich

3. Q: When trying to start a large number of processes on a workstation network, I get the message

     p4_error: latest msg from perror: Too many open files

4. Q: When I try to run a program on more than one processor, I get an error message

     mpirun -np 2 cpi 
     /home/me/cpi: error in loading shared libraries: libcxa.so.1:  
     cannot open shared object file: No such file or directory

A: This means that some shared library used by the system cannot be found on remote processors. There are two possibilities:

Many linkers provide a way to specify the search path for shared libraries. The trick is to (a) pass this command to the linker program and (b) specify all of the libraries that are needed.

For example, on Linux systems, the linker command to specify the shared library search path is -rpath path, e.g., -rpath /usr/lib:/usr/local/lib. To pass this command to the linker through the Intel C compiler icc, the command -Qoption,link,-rpath,path is used. By default, the Linux linker looks in /usr/lib and the directories specified by the environment variable LD_LIBRARY_PATH. Thus, to force the linker to include the path to the shared library, you can use

   mpicc -o cpi cpi -Qoption,link,-rpath,$LD_LIBRARY_PATH:/usr/lib

LDFLAGS

mpicc

Unfortunately, each compiler has a different way of passing these arguments to the linker, and each linker has a different set of arguments for specifying the shared library search path. You will need to check the documentation for your system to find this options.

cpilog

    ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2

configure

Alternately, consider adding these lines to your .login (assuming C shell):

    setenv OPENWINHOME /usr/openwin 
    setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib

before

    if ($?USER == 0 || $?prompt == 0) exit

A: If you opened the file before calling MPI_INIT, the behavior of MPI (not just the mpich implementation of MPI) is undefined. In the ch_p4 device, only process zero (in MPI_COMM_WORLD) will have the file open; the other processes will not have opened the file. Move the operations that open files and interact with the outside world to after MPI_INIT (and before MPI_FINALIZE).

7. Q: Programs seem to take forever to start.

A: This can be caused by any of several problems. On systems with dynamically-linked executables, this can be caused by problems with the file system suddenly getting requests from many processors for the dynamically-linked parts of the executable (this has been measured as a problem with some DFS implementations). You can try statically linking your application.

On workstation networks, long startup times can be due to the time used to start remote processes; see the discussion on the secure server in Section Using the Secure Server for the ch_p4 device or consider using the ch_p4mpd device.

Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs

Workstation Networks

Up: Problems starting programs Next: IBM RS6000 Previous: General

mpirun

Permission denied

The Most Common Problems

2. Q: When I use mpirun, I get the message Try again.

A: If you see something like this

    % mpirun -np 2 cpi  
    Try again.

rshd

The only fix for this is to have your system administrator look into the machine that is generating this message.

3. Q: When running the ch_p4 device, I get error messages of the form

    stty: TCGETS: Operation not supported on socket

    stty: tcgetattr: Permission denied

    stty: Can't assign requested address

.login

.cshrc

.profile

stty

tset

TERM

PROMPT

    if ($?TERM) then 
        eval `tset -s -e^\? -k^U -Q -I $TERM` 
    endif

    if ($?USER == 0 || $?prompt == 0) exit

.cshrc

after

LD_LIBRARY_PATH

4. Q: When running the ch_p4 device and running either the tstmachines script to check the machines file or the mpich tests, I get messages about unexpected output or differences from the expected output. I also get extra output when I run programs. MPI programs do seem to work, however.

A: This means that one your login startup scripts (i.e., .login and .cshrc or .profile or .bashrc) contains an unguarded use of some program that generates output, such as fortune or even echo. For C shell users, one typical fix is to check for the variables TERM or PROMPT to be initialized. For example,

    if ($?TERM) then 
        fortune 
    endif

    if ($?USER == 0 || $?prompt == 0) exit

.cshrc

after

LD_LIBRARY_PATH

5. Q: Occasionally, programs fail with the message

poll: protocol failure during circuit creation

You may see this message if you attempt to run too many MPI programs in a short period of time. For example, in Linux and when using the ch_p4 device (without the secure server or ssh), mpich may use rsh to start the MPI processes. Depending on the particular Linux distribution and verison, there may be a limit of as few as 40 processes per minute. When running the mpich test suite or starting short parallel jobs from a script, it is possible to exceed this limit.

To fix this, you can do one of the following:

2. Modify /etc/inetd.conf to allow more processes per minute for rsh. For example, change

shell stream tcp nowait root /etc/tcpd2 in.rshd

shell stream tcp nowait.200 root /etc/tcpd2 in.rshd

ch_p4mpd

ch_p4

inetd

6. Q: When using mpirun I get strange output like

 arch: No such file or directory

.cshrc

    which hostname

.cshrc

7. Q: When I try to run my program, I get

p0_4652:  p4_error: open error on procgroup file (procgroup): 0

mpirun

ch_p4

ch_shmem

Try the following:

Run the program using mpirun and the -t argument:

    mpirun -t -np 1 foo

-t

-echo

mpirun -echo -np 1 foo

mpich

/usr/local/mpich/solaris/ch_p4

8. Q: When trying to run a program I get this message:

    icy%  mpirun -np 2 cpi -mpiversion 
    icy: icy: No such file or directory

/usr/lib/rsh

 which rsh 
 ls /usr/*/rsh

/usr/lib

/usr/ucb

/usr/bin

/usr/lib

Another choice would be to add a link in a directory earlier in the search path to the remote shell. For example, I have /home/gropp/bin/solaris early in my search path; I could use

     cd /home/gropp/bin/solaris 
     ln -s /usr/bin/rsh  rsh

/usr/bin/rsh

9. Q: When trying to run a program I get this message:

trying normal rsh

-l

mpich

-rshnol

mpich

ssh

10. Q: When I run my program, I get messages like

| ld.so: warning: /usr/lib/libc.so.1.8 has older revision than expected 9

One temporary fix that you can use is to add the link-time option to force static linking instead of dynamic linking for system libraries. For some Sun workstations, the option is -Bstatic.

11. Q: Programs never get started. Even tstmachines hangs.

A: Check first that rsh works at all. For example, if you have workstations w1 and w2, and you are running on w1, try

   rsh w2 true

   rsh w1 true

rsh

true

Permission denied

krcmd: No ticket file (tf_util) 
rsh: warning, using standard rsh: can't provide Kerberos auth data.

rsh

rsh/rshd

12. Q: When running ch_p4 device, I get error messages of the form

    more slaves than message queues

mpich

    configure --with-device=ch_p4 -comm=shared

13. Q: My programs seem to hang in MPI_Init.

A: There are a number of ways that this can happen:

tstmachines

ch_p4

select

mpich

Another is if you use the library -ldxml (extended math library) on Compaq Alpha systems. This has been observed to cause MPI_Init to hang. No workaround is known at this time; contact Compaq for a fix if you need to use MPI and -ldxml together.

The root of this problem is that the ch_p4 device uses SIG_USR1, and so any library that also uses this signal can interfere with the operation of mpich if it is using ch_p4. You can rebuild mpich to use a different signal by using the configure argument --with-device=ch_p4:-listener_sig=SIGNAL_NAME and remaking mpich.

ch_p4

    p0_2005:  p4_error: fork_p4: fork failed: -1 
              p4_error: latest msg from perror: Error 0

ch_p4

ch_tcp

size

swap -l

size

swap -l

A similar problem can happen on IBM SPs using the ch_mpl device; the cause is the same but it originates within the IBM MPL library.

15. Q: Sometimes, I get the error

    Exec format error. Wrong Architecture.

sync

mpirun

sync

16. Q: There seem to be two copies of my program running on each node. This doubles the memory requirement of my application. Is this normal?

A: Yes, this is normal. In the ch_p4 implementation, the second process is used to dynamically establish connections to other processes. With Version 1.1.1 of mpich, this functionality can be placed in a separate thread on many architectures, and this second process will not be seen. To enable this, use the option -p4_opts=-threaded_listener on the configure command line for mpich.

17. Q: MPI_Abort sometimes doesn't work in the ch_p4 device. Why?

A: Currently (Version 1.2.5) a process detects that another process has aborted only when it tries to send or receive a message, and the aborting process is one that it has communicated with in the past. Thus it is possible for a process busy with computation not to notice that one of its peers has issued an MPI_Abort, although for many common communication patterns this does not present a problem. This will be fixed in a future release.

Up: Problems starting programs Next: IBM RS6000 Previous: General

IBM RS6000

Up: Problems starting programs Next: IBM SP Previous: Workstation Networks

ch_p4

% mpirun -np 2 cpi 
Could not load program /home/me/mpich/examples/basic/cpi  
Could not load library libC.a[shr.o] 
Error was: No such file or directory

mpich

xlC

util/machines/machines.rs6000

xlC

mpich

xlc

gcc

Up: Problems starting programs Next: IBM SP Previous: Workstation Networks

IBM SP

Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000

$ mpirun -np 2 hello 
ERROR: 0031-124  Couldn't allocate nodes for parallel execution.  Exiting ... 
ERROR: 0031-603  Resource Manager allocation for task: 0, node:  
me1.myuniv 
.edu, rc = JM_PARTIONCREATIONFAILURE 
ERROR: 0031-635  Non-zero status -1 returned from pm_mgr_init

mpirun

poe

mpirun

dsh -av "ps aux | egrep -i 'poe|pmd|jmd'"

/tmp/jmd_err

2. Q: When trying to run on an IBM SP, I get the message from mpirun:

  ERROR: 0031-214  pmd: chdir </a/user/gamma/home/mpich/examples/basic> 
  ERROR: 0031-214  pmd: chdir </a/user/gamma/home/mpich/examples/basic>

mpirun

ksh

3. Q: When trying to run on an IBM SP, I get this message:

ERROR: 0031-124  Less than 2 nodes available from pool 0

MP_RETRY

MP_RETRYCOUNT

man poe

4. Q: When running on an IBM SP, my job generates the message

  Message number 0031-254 not found in Message Catalog.

A: If your user name is eight characters long, you may be experiencing a bug in the IBM POE environment. The only fix at the time this was written was to use an account whose user name was seven characters or less. Ask your IBM representative about PMR 4017X (poe with userids of length eight fails) and the associated APAR IX56566.

Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000

Problems starting programs

General

Workstation Networks

IBM RS6000

IBM SP

Problems starting programs