Package Building Procedures
Prev

16 Procedures for dealing with disk failures

When a machine has a disk failure (e.g. panics due to read errors, etc), then we should do the following steps:

Note the time and failure mode (e.g. paste in the relevant console output) in /var/portbuild/${arch}/reboots
For i386 gohan clients, scrub the disk by touching /SCRUB in the nfsroot (e.g. /a/nfs/8.dir1/SCRUB) and rebooting. This will dd if=/dev/zero of=/dev/ad0 and force the drive to remap any bad sectors it finds, if it has enough spares left. This is a temporary measure to extend the lifetime of a drive that is on the way out.

Note: For the i386 blade systems another signal of a failing disk seems to be that the blade will completely hang and be unresponsive to either console break, or even NMI.

For other build systems that don't newfs their disk at boot (e.g. amd64 systems) this step has to be skipped.

If the problem recurs, then the disk is probably toast. Take the machine out of mlist and (for ata disks) run smartctl on the drive:

smartctl -t long /dev/ad0

It will take about 1/2 hour:

gohan51# smartctl -t long /dev/ad0
smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 31 minutes for test to complete.
Test will complete after Fri Jul  4 03:59:56 2008

Use smartctl -X to abort test.

Then smartctl -a /dev/ad0 shows the status after it finishes:

# SMART Self-test log structure revision number 1
# Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
#   1  Extended offline    Completed: read failure       80%     15252    319286

It will also display other data including a log of previous drive errors. It is possible for the drive to show previous DMA errors without failing the self-test though (because of sector remapping).

When a disk has failed, please inform the cluster administrators so we can try to get it replaced.

Prev	Home
How to configure a new head node (pointyhat instance)