mpirun -np 2 cpieither I get an error message or the program hangs.
A:
On some systems such as IBM SPs, there are many mutually exclusive ways
to run parallel programs; each site can pick the approach(es) that it allows.
The script mpirun tries one of the more common methods, but may make
the wrong choice. Use the -v or -t option to mpirun to
see how it is trying to run the program, and then compare this with the
site-specific instructions for using your system. You may need to adapt the
code in mpirun to meet your needs. See also the next question.
2. Q:
When trying to run a program with, e.g., mpirun -np 4 cpi, I get
usage : mpirun [options] <executable> [<dstnodes>] [-- <args>]or
mpirun [options] <schema>
which mpirunto see which command named mpirun was actually found. The fix is to either change the order of directories in your path to put the mpich version of mpirun first, or to define an alias for mpirun that uses an absolute path. For example, in the csh shell, you might do
alias mpirun /usr/local/mpich/bin/mpirunto set mpirun to the mpich version.
3. Q:
When I try to run a program on more than one processor, I get an error message
mpirun -np 2 cpi /home/me/cpi: error in loading shared libraries: libcxa.so.1: cannot open shared object file: No such file or directoryThere is no trouble running on one processor.
A:
This means that some shared library used by the system cannot be found on
remote processors. There are two possibilities:
Many linkers provide a way to specify the search path for shared libraries. The trick is to (a) pass this command to the linker program and (b) specify all of the libraries that are needed.
For example, on Linux systems, the linker command to specify the shared
library search path is -rpath path, e.g., -rpath
/usr/lib:/usr/local/lib. To pass this command to the linker
through the Intel C compiler icc, the command
-Qoption,link,-rpath,path is used.
By default, the Linux linker looks in /usr/lib and the directories
specified by the environment variable LD_LIBRARY_PATH.
Thus, to force the linker to include the path to the shared library, you can
use
mpicc -o cpi cpi -Qoption,link,-rpath,$LD_LIBRARY_PATH:/usr/libIf this works, then consider editing the value of LDFLAGS in the compiler scripts (e.g., mpicc) to include this option.
Unfortunately, each compiler has a different way of passing these arguments to the linker, and each linker has a different set of arguments for specifying the shared library search path. You will need to check the documentation for your system to find this options.
ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2
Alternately, consider adding these lines to your .login (assuming C
shell):
setenv OPENWINHOME /usr/openwin setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib(you may want to check with your system administrator first to make sure that the paths are correct for your system). Make sure that you add them before any line like
if ($?USER == 0 || $?prompt == 0) exit
A:
If you opened the file before calling MPI_INIT, the behavior
of MPI (not just the mpich implementation of MPI) is undefined. In the
ch_p4 device, only
process zero (in MPI_COMM_WORLD) will have the file open; the other
processes will not have opened the file. Move the operations that open files
and interact with the outside world to after MPI_INIT (and before
MPI_FINALIZE).
6. Q:
Programs seem to take forever to start.
A:
This can be caused by any of several problems. On systems with
dynamically-linked executables, this can be caused by problems with the
file system suddenly getting requests from many processors for the
dynamically-linked parts of the executable (this has been measured as a
problem with some DFS implementations). You can try statically linking
your application.
A:
There are a number of ways that this can happen:
Another is if you use the library -ldxml (extended math library) on Compaq Alpha systems. This has been observed to cause MPI_Init to hang. No workaround is known at this time; contact Compaq for a fix if you need to use MPI and -ldxml together.
The root of this problem is that the ch_p4 device uses SIG_USR1, and so any library that also uses this signal can interfere with the operation of mpich if it is using ch_p4. You can rebuild mpich to use a different signal by using the configure argument --with-device=ch_p4:-listener_sig=SIGNAL_NAME and remaking mpich.
Exec format error. Wrong Architecture.
A:
Yes, this is normal. In the ch_p4 implementation, the second process
is used to dynamically establish connections to other processes. With Version
1.1.1 of mpich, this functionality can be placed in a separate thread on many
architectures, and this second process will not be seen. To enable this, use
the option
-p4_opts=-threaded_listener on the configure command line for
mpich.
A:
Currently (Version 1.2.5) a process detects that another
process has aborted
only when it tries to send or receive a message, and the aborting
process is one that
it has communicated with in the past. Thus it is possible for a process busy
with computation not to notice that one of its peers has issued an
MPI_Abort, although for many common communication patterns this does
not present a problem. This will be fixed in a future release.
% mpirun -np 2 cpi Could not load program /home/me/mpich/examples/basic/cpi Could not load library libC.a[shr.o] Error was: No such file or directory
$ mpirun -np 2 hello ERROR: 0031-124 Couldn't allocate nodes for parallel execution. Exiting ... ERROR: 0031-603 Resource Manager allocation for task: 0, node: me1.myuniv .edu, rc = JM_PARTIONCREATIONFAILURE ERROR: 0031-635 Non-zero status -1 returned from pm_mgr_init
dsh -av "ps aux | egrep -i 'poe|pmd|jmd'"from the control workstation to search for stray IBM POE jobs that can cause this behavior. The files /tmp/jmd_err on the individual nodes may also contain useful diagnostic information.
2. Q:
When trying to run on an IBM SP, I get the message from
mpirun:
ERROR: 0031-214 pmd: chdir </a/user/gamma/home/mpich/examples/basic> ERROR: 0031-214 pmd: chdir </a/user/gamma/home/mpich/examples/basic>
3. Q:
When trying to run on an IBM SP, I get this message:
ERROR: 0031-124 Less than 2 nodes available from pool 0
4. Q:
When running on an IBM SP, my job generates the message
Message number 0031-254 not found in Message Catalog.and then dies.
A:
If your user name is eight characters long, you may be experiencing a bug in
the IBM POE environment.
The only fix at the time this was written was to use
an account whose user name was seven characters or less.
Ask your IBM representative about PMR 4017X (poe with userids of length eight
fails) and the associated APAR IX56566.