mpirun -np 2 cpieither I get an error message or the program hangs.
A:
On some systems such as IBM SPs, there are many mutually exclusive ways
to run parallel programs; each site can pick the approach(es) that it allows.
The script mpirun tries one of the more common methods, but may make
the wrong choice. Use the -v or -t option to mpirun to
see how it is trying to run the program, and then compare this with the
site-specific instructions for using your system. You may need to adapt the
code in mpirun to meet your needs. See also the next question.
2. Q:
When trying to run a program with, e.g., mpirun -np 4 cpi, I get
usage : mpirun [options] <executable> [<dstnodes>] [-- <args>]or
mpirun [options] <schema>
which mpirunto see which command named mpirun was actually found. The fix is to either change the order of directories in your path to put the mpich version of mpirun first, or to define an alias for mpirun that uses an absolute path. For example, in the csh shell, you might do
alias mpirun /usr/local/mpich/bin/mpirunto set mpirun to the mpich version.
3. Q:
When trying to start a large number of processes on a workstation network,
I get the message
p4_error: latest msg from perror: Too many open files
4. Q:
When I try to run a program on more than one processor, I get an error message
mpirun -np 2 cpi /home/me/cpi: error in loading shared libraries: libcxa.so.1: cannot open shared object file: No such file or directoryThere is no trouble running on one processor.
A:
This means that some shared library used by the system cannot be found on
remote processors. There are two possibilities:
Many linkers provide a way to specify the search path for shared libraries. The trick is to (a) pass this command to the linker program and (b) specify all of the libraries that are needed.
For example, on Linux systems, the linker command to specify the shared
library search path is -rpath path, e.g., -rpath
/usr/lib:/usr/local/lib. To pass this command to the linker
through the Intel C compiler icc, the command
-Qoption,link,-rpath,path is used.
By default, the Linux linker looks in /usr/lib and the directories
specified by the environment variable LD_LIBRARY_PATH.
Thus, to force the linker to include the path to the shared library, you can
use
mpicc -o cpi cpi -Qoption,link,-rpath,$LD_LIBRARY_PATH:/usr/libIf this works, then consider editing the value of LDFLAGS in the compiler scripts (e.g., mpicc) to include this option.
Unfortunately, each compiler has a different way of passing these arguments to the linker, and each linker has a different set of arguments for specifying the shared library search path. You will need to check the documentation for your system to find this options.
ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2
Alternately, consider adding these lines to your .login (assuming C
shell):
setenv OPENWINHOME /usr/openwin setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib(you may want to check with your system administrator first to make sure that the paths are correct for your system). Make sure that you add them before any line like
if ($?USER == 0 || $?prompt == 0) exit
A:
If you opened the file before calling MPI_INIT, the behavior
of MPI (not just the mpich implementation of MPI) is undefined. In the
ch_p4 device, only
process zero (in MPI_COMM_WORLD) will have the file open; the other
processes will not have opened the file. Move the operations that open files
and interact with the outside world to after MPI_INIT (and before
MPI_FINALIZE).
7. Q:
Programs seem to take forever to start.
A:
This can be caused by any of several problems. On systems with
dynamically-linked executables, this can be caused by problems with the
file system suddenly getting requests from many processors for the
dynamically-linked parts of the executable (this has been measured as a
problem with some DFS implementations). You can try statically linking
your application.
On workstation networks, long startup times can be due to the time used to start remote processes; see the discussion on the secure server in Section Using the Secure Server for the ch_p4 device or consider using the ch_p4mpd device.
2. Q:
When I use mpirun, I get the message Try again.
A:
If you see something like this
% mpirun -np 2 cpi Try again.it means that you were unable to start a remote job with the remote shell command on some machine, even though you would normally be able to. This may mean that the destination machine is very busy, out of memory, or out of processes. The man page for rshd may give you more information.
The only fix for this is to have your system administrator look into the machine that is generating this message.
3. Q:
When running the ch_p4 device, I get
error messages of the form
stty: TCGETS: Operation not supported on socketor
stty: tcgetattr: Permission deniedor
stty: Can't assign requested address
if ($?TERM) then eval `tset -s -e^\? -k^U -Q -I $TERM` endifAnother solution is to see if it is appropriate to add
if ($?USER == 0 || $?prompt == 0) exitnear the top of your .cshrc file (but after any code that sets up the runtime environment, such as library paths (e.g., LD_LIBRARY_PATH)).
4. Q:
When running the ch_p4 device and running either
the tstmachines script to check the machines file or the mpich tests,
I
get messages about unexpected output or differences from the expected output.
I also get extra output when I run programs. MPI programs do seem to work,
however.
A:
This means that one your login startup scripts (i.e., .login and
.cshrc or .profile or .bashrc) contains an unguarded use
of some program that generates output, such as fortune or even echo.
For C shell users, one typical fix is to check for the variables TERM
or PROMPT to be initialized. For example,
if ($?TERM) then fortune endifAnother solution is to see if it is appropriate to add
if ($?USER == 0 || $?prompt == 0) exitnear the top of your .cshrc file (but after any code that sets up the runtime environment, such as library paths (e.g., LD_LIBRARY_PATH)).
5. Q:
Occasionally, programs fail with the message
poll: protocol failure during circuit creation
You may see this message if you attempt to run too many MPI programs in a short period of time. For example, in Linux and when using the ch_p4 device (without the secure server or ssh), mpich may use rsh to start the MPI processes. Depending on the particular Linux distribution and verison, there may be a limit of as few as 40 processes per minute. When running the mpich test suite or starting short parallel jobs from a script, it is possible to exceed this limit.
To fix this, you can do one of the following:
2. Modify /etc/inetd.conf to allow more processes per minute for
rsh. For example, change
shell stream tcp nowait root /etc/tcpd2 in.rshdto
shell stream tcp nowait.200 root /etc/tcpd2 in.rshd
6. Q:
When using mpirun I get strange output like
arch: No such file or directory
which hostnameIf you see the same strange output, then your problem is in your .cshrc file. You may have some code in your .cshrc file that assumes that your shell is connected to a terminal.
7. Q:
When I try to run my program, I get
p0_4652: p4_error: open error on procgroup file (procgroup): 0
Try the following:
Run the program using mpirun and the -t argument:
mpirun -t -np 1 fooThis should show what mpirun would do (-t is for testing). Or you can use the -echo argument to see exactly what mpirun is doing:
mpirun -echo -np 1 fooDepending on the choice made by the installer of mpich, you should select the device-specific version of mpirun over a ``generic'' version. We recommend that the installation prefix include the device name, for example, /usr/local/mpich/solaris/ch_p4.
8. Q:
When trying to run a program I get this message:
icy% mpirun -np 2 cpi -mpiversion icy: icy: No such file or directory
which rsh ls /usr/*/rshYou probably have /usr/lib in your path ahead of /usr/ucb or /usr/bin. This picks the `restricted' shell instead of the `remote' shell. The easiest fix is to just remove /usr/lib from your path (few people need it); alternately, you can move it to after the directory that contains the `remote' shell rsh.
Another choice would be to add a link in a directory earlier in the search
path to the remote shell. For example, I have /home/gropp/bin/solaris
early in my search path; I could use
cd /home/gropp/bin/solaris ln -s /usr/bin/rsh rshthere (assuming that /usr/bin/rsh is the remote shell).
9. Q:
When trying to run a program I get this message:
trying normal rsh
10. Q:
When I run my program, I get messages like
| ld.so: warning: /usr/lib/libc.so.1.8 has older revision than expected 9
One temporary fix that you can use is to add the link-time option to force static linking instead of dynamic linking for system libraries. For some Sun workstations, the option is -Bstatic.
11. Q:
Programs never get started. Even tstmachines hangs.
A:
Check first that rsh works at all. For example, if
you have workstations w1 and w2, and you are running on
w1, try
rsh w2 trueThis should complete quickly. If it does not, try
rsh w1 true(that is, use rsh to run true on the system that you are running on). If you get Permission denied, see the help on that. If you get
krcmd: No ticket file (tf_util) rsh: warning, using standard rsh: can't provide Kerberos auth data.then your system has a faulty installation of rsh. Some FreeBSD systems have been observed with this problem. Have your system administrator correct the problem (often one of an inconsistent set of rsh/rshd programs).
12. Q:
When running ch_p4 device, I get error messages of the form
more slaves than message queues
configure --with-device=ch_p4 -comm=sharedwhich will reconfigure the p4 device so that multiple processes can share memory on each host. The reason this is not the default is that with this configuration you will see busy waiting on each workstation, as the device goes back and forth between selecting on a socket and checking the internal shared-memory queue.
13. Q:
My programs seem to hang in MPI_Init.
A:
There are a number of ways that this can happen:
Another is if you use the library -ldxml (extended math library) on Compaq Alpha systems. This has been observed to cause MPI_Init to hang. No workaround is known at this time; contact Compaq for a fix if you need to use MPI and -ldxml together.
The root of this problem is that the ch_p4 device uses SIG_USR1, and so any library that also uses this signal can interfere with the operation of mpich if it is using ch_p4. You can rebuild mpich to use a different signal by using the configure argument --with-device=ch_p4:-listener_sig=SIGNAL_NAME and remaking mpich.
p0_2005: p4_error: fork_p4: fork failed: -1 p4_error: latest msg from perror: Error 0
A similar problem can happen on IBM SPs using the ch_mpl device; the cause is the same but it originates within the IBM MPL library.
15. Q:
Sometimes, I get the error
Exec format error. Wrong Architecture.
16. Q:
There seem to be two copies of my program running on each node. This doubles
the memory requirement of my application. Is this normal?
A:
Yes, this is normal. In the ch_p4 implementation, the second process
is used to dynamically establish connections to other processes. With Version
1.1.1 of mpich, this functionality can be placed in a separate thread on many
architectures, and this second process will not be seen. To enable this, use
the option
-p4_opts=-threaded_listener on the configure command line for
mpich.
17. Q:
MPI_Abort sometimes doesn't work in the ch_p4 device. Why?
A:
Currently (Version 1.2.5) a process detects that another
process has aborted
only when it tries to send or receive a message, and the aborting
process is one that
it has communicated with in the past. Thus it is possible for a process busy
with computation not to notice that one of its peers has issued an
MPI_Abort, although for many common communication patterns this does
not present a problem. This will be fixed in a future release.
% mpirun -np 2 cpi Could not load program /home/me/mpich/examples/basic/cpi Could not load library libC.a[shr.o] Error was: No such file or directory
$ mpirun -np 2 hello ERROR: 0031-124 Couldn't allocate nodes for parallel execution. Exiting ... ERROR: 0031-603 Resource Manager allocation for task: 0, node: me1.myuniv .edu, rc = JM_PARTIONCREATIONFAILURE ERROR: 0031-635 Non-zero status -1 returned from pm_mgr_init
dsh -av "ps aux | egrep -i 'poe|pmd|jmd'"from the control workstation to search for stray IBM POE jobs that can cause this behavior. The files /tmp/jmd_err on the individual nodes may also contain useful diagnostic information.
2. Q:
When trying to run on an IBM SP, I get the message from
mpirun:
ERROR: 0031-214 pmd: chdir </a/user/gamma/home/mpich/examples/basic> ERROR: 0031-214 pmd: chdir </a/user/gamma/home/mpich/examples/basic>
3. Q:
When trying to run on an IBM SP, I get this message:
ERROR: 0031-124 Less than 2 nodes available from pool 0
4. Q:
When running on an IBM SP, my job generates the message
Message number 0031-254 not found in Message Catalog.and then dies.
A:
If your user name is eight characters long, you may be experiencing a bug in
the IBM POE environment.
The only fix at the time this was written was to use
an account whose user name was seven characters or less.
Ask your IBM representative about PMR 4017X (poe with userids of length eight
fails) and the associated APAR IX56566.