Notes on getting MPICH running Under Linux


Up: Frequently Asked Questions Next: poll: protocol failure during circuit creation Previous: Permission Denied

Introduction The purpose of this document is to describe the steps necessary to allow MPICH processes to be started and to communicate with one another. The installation that we are focusing on is a RedHat 7.2 installation with medium security (the default). While other distributions will certainly vary, this is a good example of the sorts of problems that one might run across.

There are three methods for starting MPICH processes that are typically used on clusters today. These are rsh, ssh, and mpd.

We will first describe getting the rsh service working. We will include rlogin in this process because it is helpful for testing. Next we will describe getting ssh working and enabling the ssh-agent to allow for logins without password typing. Finally we will discuss issues related to process communication in MPICH and firewalls.

Enabling rsh By default the rsh server is not installed, and it is necessary for use of the rsh service in starting MPICH processes. The rsh server, in.rshd, is part of the rsh-server RPM. This RPM is located on the first disc of the RedHat 7.2 distribution. The rlogin server, in.rlogind, is also included in this package.

The xinetd server controls the availability of the rsh and rlogin services. This server is installed by default, but by default rsh and rlogin services are disabled. To enable these services, you must edit the files /etc/xinetd.d/rsh and /etc/xinetd.d/rlogin. Here is the rsh file as it looks by default:


# default: on 
# description: The rshd server is the server for the rcmd(3) routine and, \ 
#       consequently, for the rsh(1) program.  The server provides \ 
#       remote execution facilities with authentication based on \ 
#       privileged port numbers from trusted hosts. 
service shell 
{ 
        socket_type             = stream 
        wait                    = no 
        user                    = root 
        log_on_success          += USERID 
        log_on_failure          += USERID 
        server                  = /usr/sbin/in.rshd 
        disable                 = yes 
} 
You must enable the service by changing "disable = yes" to "disable = no". The same must be done to the rlogin config file to enable that service.

At this point the xinetd daemon must be restarted to register these changes:


/etc/rc.d/init.d/xinetd restart 
At this point you should receive a "Permission denied." if you attempt a command such as "rsh localhost hostname" as a non-root user (or as root for that matter).

To allow users to rsh without passwords you need to edit /etc/hosts.equiv, the system-wide host file for rsh and rlogin. This file should hold hostnames of machines that you would like users to be able to start MPICH processes from. For example, simply adding:


localhost.localdomain 
Should allow users to perform the command "rsh localhost hostname" successfully. Likewise adding other hostnames will allow users on those hosts to rsh to this host.

However, there is another catch! By default (with medium security) packet filtering is enabled as well, and this will prevent users from remote hosts from connecting to this machine using the rsh or rlogin services. This packet filter, or firewall, is administered using the ipchains package (which is installed by default).

The firewall configuration is written out by a program called lokkit at installation time (I think). The configuration is stored in /etc/sysconfig/ipchains and by default looks like this:


# Firewall configuration written by lokkit 
# Manual customization of this file is not recommended. 
# Note: ifup-post will punch the current nameservers through the 
#       firewall; such entries will *not* be listed here. 
:input ACCEPT 
:forward ACCEPT 
:output ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT 
-A input -s 0/0 -d 0/0 -i lo -j ACCEPT 
-A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT 
-A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT 
-A input -p udp -s 0/0 -d 0/0 2049 -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT 
While an in-depth discussion of ipchains rules is outside the context of this document, it's worth talking about how this works a bit. First, the rules are applied in order from top of the list to the bottom of the list. The argument to -j says what to do if a packet matches; it's usually either ACCEPT (let the packet in), or REJECT (toss it out). If a packet makes it through the entire list then the default policy is applied. In this case the default policy is ACCEPT.

The following line tells the packet filter to allow all localhost (-i lo) traffic to pass unmolested:


-A input -s 0/0 -d 0/0 -i lo -j ACCEPT 
This line blocks all new TCP connections going to ports 0-1023, which is the range of most services, including rsh/rlogin:


-A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT 
We're going to modify this file to allow rsh and rlogin traffic.


# Firewall configuration written by lokkit 
# Manual customization of this file is not recommended. 
# Note: ifup-post will punch the current nameservers through the 
#       firewall; such entries will *not* be listed here. 
:input ACCEPT 
:forward ACCEPT 
:output ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT 
-A input -s 0/0 -d 0/0 -i lo -j ACCEPT 
# 
# New rules for rlogin/rsh traffic, incoming or outgoing 
# 
-A input -p tcp -s 0/0 -d 0/0 513 -b -j ACCEPT 
-A input -p tcp -s 0/0 -d 0/0 514 -b -j ACCEPT 
# 
# End of new rules 
# 
-A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT 
-A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT 
-A input -p udp -s 0/0 -d 0/0 2049 -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT 
At this point users on remote systems with accounts on this system should be able to rsh/rlogin to this machine without using a password.

Enabling ssh Enabling ssh is somewhat easier.

First the ssh server, sshd, must be installed. This is part of the openssh-server RPM. This RPM is located on the first disc of the RedHat 7.2 distribution.

Once the server is installed, it must be started:


/etc/rc.d/init.d/sshd start 
The service will be automatically started on reboot.

At this point ssh on the localhost should work, although a password will still be required. However, our firewall rules will be preventing connections from other machines.

We again modify /etc/sysconfig/ipchains, this time to allow ssh traffic in and out. See the above section for a discussion of what we are doing here.


# Firewall configuration written by lokkit 
# Manual customization of this file is not recommended. 
# Note: ifup-post will punch the current nameservers through the 
#       firewall; such entries will *not* be listed here. 
:input ACCEPT 
:forward ACCEPT 
:output ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT 
-A input -s 0/0 -d 0/0 -i lo -j ACCEPT 
# 
# New rules for ssh traffic, incoming or outgoing 
#  
-A input -p tcp -s 0/0 -d 0/0 22 -b -j ACCEPT 
# 
# End of new rules 
# 
-A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT 
-A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT 
-A input -p udp -s 0/0 -d 0/0 2049 -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT 
At this point users on remote systems should be able to ssh into the machine, but they will still need a password.

Users should set up a private/public authentication key pair in order for ssh to operate without passwords. This process is documented in the installation guide, but a summary of the steps for RH7.2 will be included here.

First run the "ssh-keygen -t rsa" application to create the private/public key pair. By default this will create the files /.ssh/id_rsa and /.ssh/id_rsa.pub. Use a password.

Next place the public key ( .ssh/id_rsa.pub) in the file /.ssh/authorized_keys. If more than one machine is going to be used, then this key must be put in the /.ssh/authorized_keys file on each machine. The permissions on the .ssh directory should be set to 700; otherwise the sshd may choose to not accept the keys.

This will allow you to connect using rsa keys rather than simple UNIX passwords. The next step is to enable an SSH agent so that you do not need to repeatedly type your password.

The agent is started with "ssh-agent <cmd>". Typically <cmd> is $SHELL, so that your default shell is started. The agent will then handle authentication on your behalf any time you attempt to use ssh from this shell. To give the ssh-agent your password, type "ssh-add". This will query you for the passphrase that accompanies your rsa key.

Once you have completed this, you will be able to ssh to other systems on which your key is authorized without typing a password.

Interprocess communication MPICH processes use the standard UNIX mechanisms for allocating ports for intercommunication. Using this mechanism processes are given ports in the range of 1024--65535.

Unfortunately for us, the default firewall configuration blocks some port ranges that our MPICH processes might be given to use for communication. This leads to a situation where MPICH applications will occasionally fail to communicate (when they happen to get the wrong port value).

We're going to modify the ipchains configuration file to remove lines disabling ranges of ports that our processes might use for intercommunication.

The two default rules of interest are the following:


-A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT 
The first blocks incoming TCP connections to ports 6000-6009 (often used by X), while the second blocks incoming TCP connections to port 7100 (often used by the X font server).

We simply remove these rules:


# Firewall configuration written by lokkit 
# Manual customization of this file is not recommended. 
# Note: ifup-post will punch the current nameservers through the 
#       firewall; such entries will *not* be listed here. 
:input ACCEPT 
:forward ACCEPT 
:output ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT 
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT 
-A input -s 0/0 -d 0/0 -i lo -j ACCEPT 
-A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT 
-A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT 
-A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT 
-A input -p udp -s 0/0 -d 0/0 2049 -j REJECT 
# 
# Removed these rules to eliminate chance of MPICH comm. failure 
# 
# -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT 
# -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT 
# 
# End of removed rules 
# 
This modification, in conjunction with one to allow process startup, should prepare your system for MPICH jobs.



Up: Frequently Asked Questions Next: poll: protocol failure during circuit creation Previous: Permission Denied