Chapter 9. ZFS Troubleshooting and Data Recovery

Table of Contents

ZFS Failure Modes
Missing Devices in a ZFS Storage Pool
Damaged Devices in a ZFS Storage Pool
Corrupted ZFS Data
Checking ZFS Data Integrity
Data Repair
Data Validation
Controlling ZFS Data Scrubbing
Identifying Problems in ZFS
Determining if Problems Exist in a ZFS Storage Pool
Understanding zpool status Output
System Reporting of ZFS Error Messages
Repairing a Damaged ZFS Configuration
Repairing a Missing Device
Physically Reattaching the Device
Notifying ZFS of Device Availability
Repairing a Damaged Device
Determining the Type of Device Failure
Clearing Transient Errors
Replacing a Device in a ZFS Storage Pool
Repairing Damaged Data
Identifying the Type of Data Corruption
Repairing a Corrupted File or Directory
Repairing ZFS Storage Pool-Wide Damage
Repairing an Unbootable System

ZFS Failure Modes

As a combined file system and volume manager, ZFS can exhibit many different failure modes. This chapter begins by outlining the various failure modes, then discusses how to identify them on a running system. This chapter concludes by discussing how to repair the problems. ZFS can encounter three basic types of errors:

  • Missing devices

  • Damaged devices

  • Corrupted data

Note that a single pool can experience all three errors, so a complete repair procedure involves finding and correcting one error, proceeding to the next error, and so on.

Missing Devices in a ZFS Storage Pool

If a device is completely removed from the system, ZFS detects that the device cannot be opened and places it in the FAULTED state. Depending on the data replication level of the pool, this might or might not result in the entire pool becoming unavailable. If one disk in a mirrored or RAID-Z device is removed, the pool continues to be accessible. If all components of a mirror are removed, if more than one device in a RAID-Z device is removed, or if a single-disk, top-level device is removed, the pool becomes FAULTED. No data is accessible until the device is reattached.

Damaged Devices in a ZFS Storage Pool

The term “damaged” covers a wide variety of possible errors. Examples include the following errors:

  • Transient I/O errors due to a bad disk or controller

  • On-disk data corruption due to cosmic rays

  • Driver bugs resulting in data being transferred to or from the wrong location

  • Simply another user overwriting portions of the physical device by accident

In some cases, these errors are transient, such as a random I/O error while the controller is having problems. In other cases, the damage is permanent, such as on-disk corruption. Even still, whether the damage is permanent does not necessarily indicate that the error is likely to occur again. For example, if an administrator accidentally overwrites part of a disk, no type of hardware failure has occurred, and the device need not be replaced. Identifying exactly what went wrong with a device is not an easy task and is covered in more detail in a later section.

Corrupted ZFS Data

Data corruption occurs when one or more device errors (indicating missing or damaged devices) affects a top-level virtual device. For example, one half of a mirror can experience thousands of device errors without ever causing data corruption. If an error is encountered on the other side of the mirror in the exact same location, corrupted data will be the result.

Data corruption is always permanent and requires special consideration during repair. Even if the underlying devices are repaired or replaced, the original data is lost forever. Most often this scenario requires restoring data from backups. Data errors are recorded as they are encountered, and can be controlled through regular disk scrubbing as explained in the following section. When a corrupted block is removed, the next scrubbing pass recognizes that the corruption is no longer present and removes any trace of the error from the system.