Once a diagnosis has been made, the diagnosis is output in the form of a list.suspect event. A list.suspect event is an event comprised of one or more possible fault or defect events. Sometimes the diagnosis cannot narrow the cause of errors to a single fault or defect. For example, the underlying problem might be a broken wire connecting controllers to the main system bus. The problem might be with a component on the bus or with the bus itself. In this specific case, the list.suspect event will contain multiple fault events: one for each controller attached to the bus, and one for the bus itself.
In addition to describing the fault that was diagnosed, a fault event also contains four payload members for which the diagnosis is applicable.
The resource is the component that was diagnosed as faulty. The fmdump(1M) command shows this payload member as “Problem in.”
The Automated System Recovery Unit (ASRU) is the hardware or software component that must be disabled to prevent further error symptoms from occurring. The fmdump(1M) command shows this payload member as “Affects.”
The Field Replaceable Unit (FRU) is the component that must be replaced or repaired to fix the underlying problem.
The Label payload is a string that gives the location of the FRU in the same form as it is printed on the chassis or motherboard, for example next to a DIMM slot or PCI card slot. The fmdumpcommand shows this payload member as “Location.”
For example, after receiving a certain number of ECC correctable errors in a given amount of time for a particular memory location, the CPU and memory diagnosis engine issues a diagnosis (list.suspect event) for a faulty DIMM.
# fmdump -v -u 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c TIME UUID SUNW-MSG-ID Oct 31 13:40:18.1864 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c AMD-8000-8L 100% fault.cpu.amd.icachetag Problem in: hc:///motherboard=0/chip=0/cpu=0 Affects: cpu:///cpuid=0 FRU: hc:///motherboard=0/chip=0 Location: SLOT 2 |
In this example, fmd(1M) has identified a problem in a resource, specifically a CPU (hc:///motherboard=0/chip=0/cpu=0). To suppress further error symptoms and to prevent an uncorrectable error from occurring, an ASRU, (cpu:///cpuid=0), is identified for retirement. The component that needs to be replaced is the FRU (hc:///motherboard=0/chip=0).