Package Building Procedures | ||
---|---|---|
Prev |
When a machine has a disk failure (e.g. panics due to read errors, etc), then we should do the following steps:
Note the time and failure mode (e.g. paste in the relevant console output) in /var/portbuild/${arch}/reboots
For i386 gohan clients, scrub the disk by touching /SCRUB in the nfsroot (e.g. /a/nfs/8.dir1/SCRUB) and rebooting. This will dd if=/dev/zero of=/dev/ad0 and force the drive to remap any bad sectors it finds, if it has enough spares left. This is a temporary measure to extend the lifetime of a drive that is on the way out.
Note: For the i386 blade systems another signal of a failing disk seems to be that the blade will completely hang and be unresponsive to either console break, or even NMI.
For other build systems that don't newfs their disk at boot (e.g. amd64 systems) this step has to be skipped.
If the problem recurs, then the disk is probably toast. Take the machine out of mlist and (for ata disks) run smartctl on the drive:
smartctl -t long /dev/ad0
It will take about 1/2 hour:
gohan51# smartctl -t long /dev/ad0 smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 31 minutes for test to complete. Test will complete after Fri Jul 4 03:59:56 2008 Use smartctl -X to abort test.
Then smartctl -a /dev/ad0 shows the status after it finishes:
# SMART Self-test log structure revision number 1 # Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 80% 15252 319286
It will also display other data including a log of previous drive errors. It is possible for the drive to show previous DMA errors without failing the self-test though (because of sector remapping).
When a disk has failed, please inform the cluster administrators so we can try to get it replaced.