Building a diskless Linux Cluster for high performance computations from a standard Linux distribution

Stefan Böhringer
Institut für Humangenetik
Universitätsklinikum Essen

Date: April, $7^{th}$ , 2003

Abstract:

This paper describes the steps involved in building a Linux cluster usable for high performance computing. A diskless design for Linux clusters is highly beneficial on account of reduced administration costs. Therefore an implementation of a diskless cluster is described here. Since Linux development is still in flux the focus is on an abstract treatment. The actual steps involved at the time of writing are given in highlighted text sections. The setting considered here involves Pre eXecution Environment (PXE) capable hardware.

Keywords: Linux, cluster, diskless boot, PXE, tftp, NFS boot, load balancing, high performance computations

Introduction

Linux is a highly stable operating system (OS) that is being used for many high availability tasks, like web and database servers. Also Linux clusters have been built to serve as high perfomance computations facilities [1]. These clusters are commonly referred as Beowulf clusters [2], whereby this term is not well defined. In general it denotes any type of linux boxes in proximity (when the metric to define proximity is again subject to variation).

Using a web search one can find numerous guides to build a Linux cluster but in this instance there did not seem to be a matching document. Therefore this document intends to focus on the specific setting considered here. A different setting might be served better elsewhere.

The following aims are pursued:

-: OS is Linux
-: Diskless boot
-: No hardware changes to nodes (no eproms)
-: Homogeneous Nodes

This guide assumes some UNIX proficiency. It is recommended to use a web search if any terms seem to be ambiguous or are unknown to the reader [3]. Any computer topic is treated in depth on the web. In the instance of problems during the build of a Linux cluster it is highly recommended to use Google/Groups [4], since numerous problems have been discussed before and can be retrieved from the cited service.

The documents treats the assembly of the cluster in three sections.

-: Choice of hardware
-: Configuring the boot process (server configuration)
-: Configuring the shared file system (node configuration)

In the following a practical example based on a RedHat distribution is reported. In the following the cluster built is referred to as the Genecruncher I cluster ([5]). The steps involved to build that particular type of cluster are shown in non proportional font.

Choice of Hardware

According to the aims outlined above, there are some constrained with respect to the hardware usable for a linux cluster. You need a mainboard that supports network boot via PXE which is implemented through a Managed Boot Agent (MBA). This should not be a problem for most mainboards these days. At the time of selecting the hardware (Autumn 2002) there was no cheap mainboard available with an onboard network interface card (NIC) which would support PXE. Therefore a PCI NIC with PXE support had to be added. 3Com and Intel cards should be usable. The Genecruncher I uses the following type of hardware (specific expamples are shown in nonproportional font).

Genecruncher I hardware configuration
-------------------------------------
Mainboard: Elitegroup K7S5A
CPU: AMD Duron 1,2 GHz
Memory: 256 MB
NIC: 3Com 3C905C-TX-M PCI
Graphics: ATI Xpert 2000/32MB

Additions in the server node
CDRom
80 GB Harddisk

Netgear 16 port 100Mbit switch

Server configuration

The server is to be installed with a Linux operation system (OS). After installation of the OS the following services have to be installed and configured.

$\bullet$: DHCP
$\bullet$: tftp
$\bullet$: PXE
$\bullet$: NFS

A standard Redhat 7.3 distribution, which was downloaded from
the internet was installed on the Genecruncher I server.
During installation it was ensured that DHCP and TFTP software
packages were installed.

Network configuration of the server node

Dynamic Host configuration Protocol (DHCP)

When the boot process of a client node started that node is bereft of any information. The first bit of information a node needs is an IP address and a host name. This configuration deals with assigning constant IP addresses to the nodes, which is a requirement for netbooting the clients. This requires one to determine the Ethernet Hardware addresses of the client node NICs, which will be reported on any attempted netboot from a client. The dhcpd demon is configured via the dhcpd.conf file usually located at /etc/dhcpd.conf. It is documented in the dhcpd.conf manual file.

On a RedHat 7.3 box the dhcpd demon is managed via SysV startup scripts.
To activate the service at boot time issue the command:
    /sbin/chkconfig --level 345 dhcpd start

Here is an exerpt of the Genecruncher I dhcpd.conf:

option domain-name-servers 132.252.3.10, 132.252.1.7;
option routers 192.168.1.1;

subnet 192.168.0.0 netmask 255.255.255.0 {
#       range 192.168.0.90 192.168.0.90;
}

subnet 192.168.1.0 netmask 255.255.255.0 {
        range 192.168.1.100 192.168.1.254;
}

group {
        filename "pxelinux.0";
        use-host-decl-names on;

        host cn1 {
                fixed-address 192.168.1.100;
                hardware ethernet 00:04:75:9d:32:43;
                option root-path "/tftpboot/192.168.1.100";
        }
# ... repreat for all diskless nodes
}

Punching wholes into the firewall

It is perhaps a good choice to turn off firewalling in the first place. Port accesses for all required services have to be permitted.

On RedHat 7.3 the firewall rules are stored in the file
/etc/sysconfig/ipchains
The relevant ports are 67 for dhcp and 60 for tftp. However tftp
seems to require additional ports. The following configuration is
very liberal and lets through the nfs ports also (amongst others).

-------------- /etc/sysconfig/ipchains --------------
:input ACCEPT
:forward ACCEPT
:output ACCEPT
-A input -s 0/0 0:65535 -d 0/0 0:65535 -p udp -i eth0 -j ACCEPT
-A input -s 0/0 0:65535 -d 0/0 0:65535 -p tcp -i eth0 -j ACCEPT
-A input -s 0/0 0:65535 -d 0/0 0:65535 -p udp -i eth1 -j ACCEPT
-A input -s 0/0 0:65535 -d 0/0 0:65535 -p tcp -i eth1 -j ACCEPT
# allow for dhcp requests
-A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT
# allow for tftp
-A input -s 192.168.0.0/16 60: -d 0/0 0: -p udp -i eth1 -j ACCEPT
-A input -s 0/0 -d 0/0 -i lo -j ACCEPT
-A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT
-A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT
-A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT
-A input -p udp -s 0/0 -d 0/0 2049 -j REJECT
-A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT
-A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT

Trivial File Transfer Protocol (TFTP)

TFTP allows the diskless nodes to load boot code over a NIC. This is discussed in section 3.4. The boot code will then load a linux kernel which then will use NFS to perform any further file accesses over the network.

On RedHat 7.3 the TFTP demon is controlled by the xinetd meta demon. You
can enable the service by changing the disabled option in the
/etc/xinetd.d/tftp file. The '-s' option specifies which directory will
be served. In the following /tftpboot is assumed.

-------------- /etc/xinetd.d/tftp --------------
service tftp
{
    socket_type		= dgram
    protocol		= udp
    wait			= yes
    user			= root
    server			= /usr/sbin/in.tftpd
    server_args		= -s /tftpboot
    disable			= no
    per_source		= 11
    cps				= 100 2
    only_from		= 0.0.0.0/0
}

Pre Execution Environment (PXE)

To make use of the PXE booting process you need to download the PXELINUX ([6]) which is part of the SYSLINUX project ([7]). PXE is a standard for network cards to load boot code over the network. This is done via tftp and you therefore have to populate the tftp directory with the apporiate files. pxelinux.0 is the binary from the PXELINUX distribution which allows ix68 machines to boot linux. A directory is to be created in the /tftpboot directory named pxelinux.cfg. Within that directory a file named default is to be created. It holds the name of the kernel and the kernel options.

label linux
    kernel	bzImage
    append	root=/dev/nfs ip=dhcp init=/sbin/init

Network File System (NFS)

The server has to expose parts of its filesystem to be mountable by the client and used by them as their root filesystem. The kernel is instructed to use the /dev/nfs device to mount its file system (sec. 3.4). This device has to be created in the /dev directory and serves as a placeholder.

mknod /dev/nfs c 0 255

The kernel can be told to use that device as root file system by the use of the rdev command.

rdev bzImage /dev/nfs

To allow for file access via NFS later in the boot process you have to configure the NFS server via the /etc/exports file. You have add a line like

/tftpboot	192.168.1.0/24(rw,no_root_squash)

where 192.168.1.0 is the network of the client nodes.

In the Genecruncer implementation, both, the rdev and the PXE options were used.

To enable nfs on RedHat 7.3 issue the commands:
chkconfig --level 345 nfs on
chkconfig --level 345 nfslock on
Again, the firewall rules must let through nfs netwok accesses.

This completes the server configuration. Move on to the clients.

Client configuration

The client configuration takes place entirely on the server on account of lacking disk space except for the bios configuration of the diskless nodes.

Hardware

Since the client nodes are diskless there is not much to configure. You have to enable netboot in the bios. There may be several netboot options in the bios when the MBA boot option is the correct one. Second you have to determine the Hardware ethernet address of the network card in the client node. To get it, plug in the network and a monitor and start the boot process. At some point the NIC tries to set an IP address via DHCP and should display its hardware address at this point in time. Since the NIC is waiting you should have enough time to write it down. It turns out to be useful to tag each node with a label and the hardware address, since in this instance you will not have to plug a monitor into the node again. The hardware address has to be supplied to the dhcp daemon running on the server as described above (section 3.2).

Software

Building a client kernel

Building a new kernel requires you to install the kernel sources. Once these are install you can enter the top level source directory.

Under Rehat 7.3 the sources are supplied on the CD distribution.
If you did not install them in the first place you can install the RPM
kernel-2.4.18-3.rpm. The source directory is placed at:
/usr/src/linux-2.4.18-3

Now issue 'make menuconfig' to build a new kernel being capable of netbooting. This kernel must understand the following things, which have to be compiled into the kernel.

$\bullet$: Support the NIC installed for netbooting
$\bullet$: Be able to configure via DHCP
$\bullet$: Be able to NFS mount its file system

The NIC you have installed in the clients has to be chosen from the list of supported network cards under Network device support / Ethernet (10 or 100Mbit) menu entries.

If you are not sure about your NIC plug it into another linux
box, and let it autoconfigure the card. Then you can inspect the
file /etc/modules.conf (on Redhat) to learn about the driver used.

You have to press 'Y' and an '*' should appear before the NIC driver, indicating that static support for that particular NIC is included in the kernel.

Next, choose the "Networking options" from the main menu. Here you have to enable "IP: kernel level autoconfiguration". This enables the "dhcp" and "bootp" options, both of which have to be included statically (in the "TCP/IP networking" section). By some obscure quantum effect this lets appear another option under "File Systems" -> "Network File Systems" -> "NFS file system support". Both, "NFS file system support" and "Root file system on NFS" are to be compiled statically into the kernel. Now you are done. Save the configuration and issue 'make dep ; make'. This will produce a 'bzImage' file.

On RedHat 7.3 the resulting bzImage file is located at:
/usr/src/linux-2.4.18-3/arch/i386/boot/bzImage

The bzImage file is then to be copied to the tftpboot directory (section 3.2). On problems do a Google Search ('Linux Kernel HOWTO').

Putting together the client filesystem

By convention a root tree for each client is placed in the /tftpboot/IP where IP is the IP address of the node as specified in the dhcpd.conf file (e.g. /tftpboot/192.168.0.100). The following script was used to create a template root tree from which actual client root trees are created (createClientTemplate.sh):

#!/bin/sh

CUSTOM_FILES=/home/pingu/clusterFiles
CLIENT_TEMPLATE=/tftpboot/client
DIRS_FROM_SERVER="bin lib usr boot etc sbin"

echo "Cleaning out destination dir..."
rm -rf $CLIENT_TEMPLATE
mkdir $CLIENT_TEMPLATE
echo "Copying directories: $DIRS_FROM_SERVER ..."
( cd / ; cp -r $DIRS_FROM_SERVER $CLIENT_TEMPLATE )

echo Creating devices...
./MAKEDEV -d $CLIENT_TEMPLATE/dev generic
./MAKEDEV -d $CLIENT_TEMPLATE/dev console
./MAKEDEV -d $CLIENT_TEMPLATE/dev loop0
losetup $CLIENT_TEMPLATE/dev/loop0 $CLIENT_TEMPLATE/var/swap

echo Copying customized files...
cp $CUSTOM_FILES/rc.sysinit $CLIENT_TEMPLATE/etc/rc.d
cp $CUSTOM_FILES/fstab $CLIENT_TEMPLATE/etc
cp $CUSTOM_FILES/inittab $CLIENT_TEMPLATE/etc

This script copies essential directories (DIRS_FROM_SERVER) as they are on the server machine. Second the devices are created using the MAKEDEV tool. Essential is the console devices used for kernel output. The loop0 devices is used later to establish a swap file. Third few files are replaced to account for netbooting.

The rc.sysinit replacement is necessary to allow for the setup of
a proper swap file on RedHat 7.3. It is not listed here on account
of size but is to be downloaded from the supplementary Website (see below).

The fstab lists the nfs mounts of client nodes. All these directories are
shared between nodes. The IP address 192.168.1.1 is the address of the server
and must be replaced with the correct address. Excerpt:
192.168.1.1:/tftpboot/IP / nfs rw,bg,soft,intr 0 0
192.168.1.1:/tftpboot/client/usr/bin /usr/bin nfs rw,bg,soft,intr 0 0
192.168.1.1:/tftpboot/client/usr/etc /usr/etc nfs rw,bg,soft,intr 0 0
...
none				/dev/pts	devpts	gid=5,mode=620  0 0
none				/proc		proc    defaults        0 0
/dev/loop0			swap		swap	defaults		0 0

The inittab configures the runlevels of the nodes:
...
# Default runlevel. The runlevels used by RHS are:
#   0 - halt (Do NOT set initdefault to this)
#   1 - Single user mode
#   2 - Multiuser, without NFS (The same as 3, if you do not have networking)
#   3 - Full multiuser mode
#   4 - unused
#   5 - X11
#   6 - reboot (Do NOT set initdefault to this)
# 
id:3:initdefault:

# System initialization.
si::sysinit:/etc/rc.d/rc.sysinit
...

For each node to be added for diskless boot a unique file system has to be created. Here is a script to build such a tree from the template file structure. The script has to be supplied with the IP address of the node to be added.

#!/bin/bash

BASE=/tftpboot
CLIENT_TEMPLATE=$BASE/client
NODE=$BASE/$1

if [ "$1" = '' ]; then
	echo USAGE: $0 node-ip
fi
echo Creating filesystem for node $1
mkdir $NODE

# the following dirs are on their own on each node:
# boot etc dev [duplicated]
# tmp root home [empty]
# var [sceleton]
# all others are mounted via nfs
# bin lib sbin usr

echo Creating empty dirs and sceletons...
for d in tmp root home proc usr ; do
	mkdir $NODE/$d
done

echo Duplicating needed dirs...
cd $CLIENT_TEMPLATE
for d in dev boot etc sbin bin lib usr/sbin ; do
	echo Duplicating $d...
	tar cp- $d | tar xC- $NODE
done


echo Creating var sceleton...
mkdir $NODE/var
for d in local log spool spool/anacron spool/at opt db lib lib/nfs lib/nfs/statd tmp lock named nis run lock/subsys ; do
	mkdir $NODE/var/$d
done
touch /var/lib/nfs/rmtab
touch /var/lib/nfs/xtab

echo Creating usr sceleton other than sbin...
for d in bin dict etc games GNUstep include info kerberos lib libexec local man  share src tmp X11R6 ; do
	mkdir $NODE/usr/$d
done

echo Creating swap file...
dd if=/dev/zero of=$NODE/var/swap bs=1k count=50k
/sbin/mkswap $NODE/var/swap

The script does the following. First it creates a new root directory for the node. Second it creates empty, private directories for the node. Third it copies private directories from the template file tree. Fourth it creates empty directories for directories to be mounted via NFS and to be shared amongst the nodes. Fifth a swap file is created for the node (50Mb). The /dev/loop0 device created earlier serves as a loopback devices which allows for mounting the file created as a swap device.

Files can be downloaded from [5].

TROUBLESHOOTING
On Redhat 7.3 the boot process used to start hanging during the
experimental phase of the configuration. It turned out that
the /dev/console file went corrupt. Recreating /dev/pts and /dev/console
remedied this problem. Issue:
mkdir $CLIENT_TEMPLATE/dev/pts; ./MAKEDEV -d $CLIENT_TEMPLATE/dev console

Finishing off

Well, there should not be any issues left. However, you should be ready to accept some time investment to have your own cluster up and running. This guide may contain errors and new ReHat versions may vary from 7.3 in many aspects. The client nodes were established by first making a single node bootable. It turned out to be useful to create necessary users, do application and service setup on that machine and later pulling off copies for all nodes.

Remaining topics

You are likely to install a load balancing software on your cluster. This topic is not covered here. You are referred to two references to look for the options ([8,9]). You are encouraged to give feedback on this document, may it be error reports or comments.

References

Bibliography

1: http://top500.org.
2: http://www.beowulf.org.
3: http://google.com.
4: http://www.google.com/grphp.
5: http://www.s-boehringer.de/genecruncher.
6: http://syslinux.zytor.com/pxe.php.
7: http://syslinux.zytor.com.
8: http://lcic.net.
9: http://zillion.sf.net.

Stefan Böhringer 2003-04-07