Programs fail at startup


Up: In Case of Trouble Next: General Previous: IBM SP



Up: In Case of Trouble Next: General Previous: IBM SP


General


Up: Programs fail at startup Next: Workstation Networks Previous: Programs fail at startup

    1. Q: With some systems, you might see
        /lib/dld.sl: Bind-on-reference call failed  
        /lib/dld.sl: Invalid argument 
    
    (This example is from HP-UX), or
        ld.so: libc.so.2: not found 
    
    (This example is from SunOS 4.1; similar things happen on other systems).


    A: The problem here is that your program is using shared libraries, and the libraries are not available on some of the machines that you are running on. To fix this, relink your program without the shared libraries. To do this, add the appropriate command-line options to the link step. For example, for the HP system that produced the errors above, the fix is to use -Wl,-Bimmediate to the link step. For Solaris, the appropriate option is -Bstatic.



Up: Programs fail at startup Next: Workstation Networks Previous: Programs fail at startup


Workstation Networks


Up: Programs fail at startup Next: Programs fail after starting Previous: General

    1. Q: I can run programs using a small number of processes, but once I ask for more than 4--8 processes, I do not get output from all of my processes, and the programs never finish.


    A: We have seen this problem with installations using AFS. The remote shell program, rsh, supplied with some AFS systems limits the number of jobs that can use standard output. This seems to prevent some of the processes from exiting as well, causing the job to hang. There are four possible fixes:

      1. Use a different rsh command. You can probably do this by putting the directory containing the non-AFS version first in your PATH. This option may not be available to you, depending on your system. At one site, the non-AFS version was in /bin/rsh.


      2. Use the secure server (serv_p4). See the discussion in the Users Guide.


      3. Redirect all standard output to a file. The MPE routine MPE_IO_Stdout_to_file may be used to do this.


      4. Get a fixed rsh command. The likely source of the problem is an incorrect usage of the select system call in the rsh command. If the code is doing something like

          int mask; 
          mask |= 1 << fd; 
          select( fd+1, &mask, ... ); 
      
      instead of
          fd_set mask; 
          FD_SET(fd,&mask); 
          select( fd+1, &mask, ... ); 
      
      then the code is incorrect (the select call changed to allow more than 32 file descriptors many years ago, and the rsh program (or programmer!) hasn't changed with the times).

    A fourth possiblity is to get an AFS version of rsh that fixes this bug. As we are not running AFS ourselves, we do not know whether such a fix is available.


    2. Q: Not all processes start.


    A: This can happen when using the ch_p4 device and a system that has extremely small limits on the number of remote shells you can have. Some systems using ``Kerberos'' (a network security package) allow only three or four remote shells; on these systems, the size of MPI_COMM_WORLD will be limited to the same number (plus one if you are using the local host).

    The only way around this is to try the secure server; this is documented in the mpich installation guide. Note that you will have to start the servers ``by hand'' since the chp4_servs script uses remote shell to start the servers.



Up: Programs fail at startup Next: Programs fail after starting Previous: General