Rocks Cluster Distribution: Users Guide:
Prev	Chapter 7. Frequently Asked Questions	Next

7.2. Configuration

7.2.1. How do I remove a compute node from the cluster?
7.2.2. Why doesn't startx work on the frontend machine?
7.2.3. I can't install compute nodes and I have a Dell Powerconnect 5224 network switch, what can I do?
7.2.4. The Myrinet network doesn't appear to fully functioning. How do I debug it?
7.2.5. What should the BIOS boot order for compute nodes be?
7.2.6. How do I export a new directory from the frontend to all the compute nodes that is accessible under /home?
7.2.7. How do I disable the feature that reinstalls compute nodes after a hard reboot?

7.2.1. How do I remove a compute node from the cluster?

On your frontend end, execute:

# insert-ethers --remove="[your compute node name]"

For example, if the compute node's name is compute-0-1, you'd execute:

# insert-ethers --remove="compute-0-1"

The compute node has been removed from the cluster.

7.2.2. Why doesn't startx work on the frontend machine?

Before you can run startx you need to configure XFree86 for your video card. This is done just like on standard Red Hat machines using the system-config-display program. If you do not know anything about your video card just select "4MB" of video RAM and 16 bit color 800x600. This video mode should work on any modern VGA card.

7.2.3. I can't install compute nodes and I have a Dell Powerconnect 5224 network switch, what can I do?

Here's how to configure your Dell Powerconnect 5224:

You need to set the edge port flag for all ports (in some Dell switches is labeled as fast link).

First, you'll need to set up an IP address on the switch:

Plug in the serial cable that came with the switch.
Connect to the switch over the serial cable.
The username/password is: admin/admin.

Assign the switch an IP address:

# config
# interface vlan 1
# ip address 10.1.2.3 255.0.0.0

Now you should be able to access the switch via the ethernet.
Plug an ethernet cable into the switch and to your laptop.
Configure the ip address on your laptop to be:
IP: 10.20.30.40 netmask: 255.0.0.0
Point your web browser on your laptop to 10.1.2.3
Username/password is: admin/admin.
Set the edge port flag for all ports. This is found under the menu item: System->Spanning Tree->Port Settings.
Save the configuration.
This is accomplished by going to System->Switch->Configuration and typing 'rocks.cfg' in the last field 'Copy Running Config to File'. In the field above it, you should see 'rocks.cfg' as the 'File Name' in the 'Start-Up Configuration File'.

7.2.4. The Myrinet network doesn't appear to fully functioning. How do I debug it?

We use High-Performance Linpack (HPL), the program used to rank computers on the Top500 Supercomputer lists, to debug Myrinet. HPL is installed on all compute nodes by default.

To run HPL on the compute nodes, see Interactive Mode.

Then it is just a matter of methodically testing the compute nodes, that is, start with compute-0-0 and compute-0-1 and make sure they are functioning, then move to compute-0-2 and compute-0-3, etc.

When you find a suspected malfunctioning compute node, the first thing to do is verify the Myrinet map (this contains the routes from this compute node to all the other Myrinet-connected compute nodes).

Examine the map by logging into the compute node and executing:

$ /usr/sbin/gm_board_info

This will display something like:

GM build ID is "1.5_Linux @compute-0-1 Fri Apr  5 21:08:29 GMT 2002."


Board number 0:
  lanai_clockval    = 0x082082a0
  lanai_cpu_version = 0x0900 (LANai9.0)
  lanai_board_id    = 00:60:dd:7f:9b:1d
  lanai_sram_size   = 0x00200000 (2048K bytes)
  max_lanai_speed   = 134 MHz
  product_code      = 88
  serial_number     = 66692
    (should be labeled: "M3S-PCI64B-2-66692")
LANai time is 0x1de6ae70147 ticks, or about 15309 minutes since reset.
This is node 86 (compute-0-1)  node_type=0
Board has room for 8 ports,  3000 nodes/routes,  32768 cache entries
          Port token cnt: send=29, recv=248
Port: Status  PID
   0:   BUSY 12160  (this process [gm_board_info])
   2:   BUSY 12552  
   4:   BUSY 12552  
   5:   BUSY 12552  
   6:   BUSY 12552  
   7:   BUSY 12552  
Route table for this node follows:
The mapper 48-bit ID was: 00:60:dd:7f:96:1b
gmID MAC Address                                 gmName Route
---- ----------------- -------------------------------- ---------------------
   1 00:60:dd:7f:9a:d4                     compute-0-10 b7 b9 89
   2 00:60:dd:7f:9a:d1                     compute-1-15 b7 bf 86
   3 00:60:dd:7f:9b:15                     compute-0-16 b7 81 84
   4 00:60:dd:7f:80:ea                     compute-1-16 b7 b5 88
   5 00:60:dd:7f:9a:ec                      compute-0-9 b7 b9 84
   6 00:60:dd:7f:96:79                     compute-2-13 b7 b8 83
   8 00:60:dd:7f:80:d4                      compute-1-1 b7 be 83
   9 00:60:dd:7f:9b:0c                      compute-1-0 b7 be 84

Now, login to a known good compute node and execute /usr/sbin/gm_board_info on it. If the gmID's and gmName's are not the same on both, then there probably is a bad Myrinet component.

Start replacing components to see if you can clear the problem. Try each procedure in the list below.

Replace the cable
Move the cable to a different port on the switch
Replace the Myrinet card in the compute node
Contact Myricom

After each procedure, make sure to rerun the mapper on the compute node and then verify the map (with /usr/sbin/gm_board_info). To rerun the mapper, execute:

# /etc/rc.d/init.d/gm-mapper start

The mapper will run for a few seconds, then exit. Wait for the mapper to complete before you run gm_board_info (that is, run ps auwx | grep mapper and make sure the mapper has completed).

7.2.5. What should the BIOS boot order for compute nodes be?

This is only an issue for machines that support network booting (also called PXE). In this case the boot order should be cdrom, floppy, hard disk, network. This means on bare hardware the first boot will network boot as no OS is installed on the hard disk. This PXE boot will load the Red Hat installation kernel and install the node just as if the node were booted with the Rocks Boot CD. If you select the boot order to place PXE before hard disk to node will repeatedly re-install itself.

7.2.6. How do I export a new directory from the frontend to all the compute nodes that is accessible under /home?

Execute this procedure:

Add the directory you want to export to the file /etc/exports.
For example, if you want to export the directory /export/disk1, add the following to /etc/exports:
/export/disk1 10.0.0.0/255.0.0.0(rw)
This exports the directory only to nodes that are on the internal network (in the above example, the internal network is configured to be 10.0.0.0)
Restart NFS:
# /etc/rc.d/init.d/nfs restart
Add an entry to /etc/auto.home.
For example, say you want /export/disk1 on the frontend machine (named frontend-0) to be mounted as /home/scratch on each compute node.
Add the following entry to /etc/auto.home:
scratch frontend-0:/export/disk1
Inform 411 of the change:
# make -C /var/411

Now when you login to any compute node and change your directory to /home/scratch, it will be automounted.

7.2.7. How do I disable the feature that reinstalls compute nodes after a hard reboot?

When compute nodes experience a hard reboot (e.g., when the compute node is reset by pushing the power button or after a power failure), they will reformat the root file system and reinstall their base operating environment.

To disable this feature:

Create a file that will override the default:

# cd /home/install
# cp rocks-dist/lan/arch/build/nodes/auto-kickstart.xml \
site-profiles/4.2/nodes/replace-auto-kickstart.xml

Where arch is "i386", "x86_64" or "ia64".

Edit the file site-profiles/4.2/nodes/replace-auto-kickstart.xml
Remove the line:
<package>rocks-boot-auto<package>
Rebuild the distribution:
# cd /home/install # rocks-dist dist
Reinstall all your compute nodes
An alternative to reinstalling all your compute nodes is to login to each compute node and execute:
# /etc/rc.d/init.d/rocks-grub stop # /sbin/chkconfig --del rocks-grub

Prev	Home	Next
Installation	Up	System Administration