ZFS is a fundamentally different file system because it is more than just a file system. ZFS combines the roles of file system and volume manager, enabling additional storage devices to be added to a live system and having the new space available on all of the existing file systems in that pool immediately. By combining the traditionally separate roles, ZFS is able to overcome previous limitations that prevented RAID groups being able to grow. Each top level device in a zpool is called a vdev, which can be a simple disk or a RAID transformation such as a mirror or RAID-Z array. ZFS file systems (called datasets) each have access to the combined free space of the entire pool. As blocks are allocated from the pool, the space available to each file system decreases. This approach avoids the common pitfall with extensive partitioning where free space becomes fragmented across the partitions.
zpool | A storage pool is the most
basic building block of ZFS. A pool
is made up of one or more vdevs, the underlying devices
that store the data. A pool is then used to create one
or more file systems (datasets) or block devices
(volumes). These datasets and volumes share the pool of
remaining free space. Each pool is uniquely identified
by a name and a GUID. The features
available are determined by the ZFS
version number on the pool.
Note:FreeBSD 9.0 and 9.1 include support for ZFS version 28. Later versions use ZFS version 5000 with feature flags. The new feature flags system allows greater cross-compatibility with other implementations of ZFS. |
vdev Types | A pool is made up of one or more vdevs, which
themselves can be a single disk or a group of disks, in
the case of a RAID transform. When
multiple vdevs are used, ZFS spreads
data across the vdevs to increase performance and
maximize usable space.
|
Transaction Group (TXG) | Transaction Groups are the way changed blocks are
grouped together and eventually written to the pool.
Transaction groups are the atomic unit that
ZFS uses to assert consistency. Each
transaction group is assigned a unique 64-bit
consecutive identifier. There can be up to three active
transaction groups at a time, one in each of these three
states:
snapshot
are written as part of the transaction group. When a
synctask is created, it is added to the currently open
transaction group, and that group is advanced as quickly
as possible to the syncing state to reduce the
latency of administrative commands. |
Adaptive Replacement Cache (ARC) | ZFS uses an Adaptive Replacement Cache (ARC), rather than a more traditional Least Recently Used (LRU) cache. An LRU cache is a simple list of items in the cache, sorted by when each object was most recently used. New items are added to the top of the list. When the cache is full, items from the bottom of the list are evicted to make room for more active objects. An ARC consists of four lists; the Most Recently Used (MRU) and Most Frequently Used (MFU) objects, plus a ghost list for each. These ghost lists track recently evicted objects to prevent them from being added back to the cache. This increases the cache hit ratio by avoiding objects that have a history of only being used occasionally. Another advantage of using both an MRU and MFU is that scanning an entire file system would normally evict all data from an MRU or LRU cache in favor of this freshly accessed content. With ZFS, there is also an MFU that only tracks the most frequently used objects, and the cache of the most commonly accessed blocks remains. |
L2ARC | L2ARC is the second level
of the ZFS caching system. The
primary ARC is stored in
RAM. Since the amount of
available RAM is often limited,
ZFS can also use
cache vdevs.
Solid State Disks (SSDs) are often
used as these cache devices due to their higher speed
and lower latency compared to traditional spinning
disks. L2ARC is entirely optional,
but having one will significantly increase read speeds
for files that are cached on the SSD
instead of having to be read from the regular disks.
L2ARC can also speed up deduplication
because a DDT that does not fit in
RAM but does fit in the
L2ARC will be much faster than a
DDT that must be read from disk. The
rate at which data is added to the cache devices is
limited to prevent prematurely wearing out
SSDs with too many writes. Until the
cache is full (the first block has been evicted to make
room), writing to the L2ARC is
limited to the sum of the write limit and the boost
limit, and afterwards limited to the write limit. A
pair of sysctl(8) values control these rate limits.
vfs.zfs.l2arc_write_max
controls how many bytes are written to the cache per
second, while vfs.zfs.l2arc_write_boost
adds to this limit during the
“Turbo Warmup Phase” (Write Boost). |
ZIL | ZIL accelerates synchronous transactions by using storage devices like SSDs that are faster than those used in the main storage pool. When an application requests a synchronous write (a guarantee that the data has been safely stored to disk rather than merely cached to be written later), the data is written to the faster ZIL storage, then later flushed out to the regular disks. This greatly reduces latency and improves performance. Only synchronous workloads like databases will benefit from a ZIL. Regular asynchronous writes such as copying files will not use the ZIL at all. |
Copy-On-Write | Unlike a traditional file system, when data is overwritten on ZFS, the new data is written to a different block rather than overwriting the old data in place. Only when this write is complete is the metadata then updated to point to the new location. In the event of a shorn write (a system crash or power loss in the middle of writing a file), the entire original contents of the file are still available and the incomplete write is discarded. This also means that ZFS does not require a fsck(8) after an unexpected shutdown. |
Dataset | Dataset is the generic term
for a ZFS file system, volume,
snapshot or clone. Each dataset has a unique name in
the format
poolname/path@snapshot .
The root of the pool is technically a dataset as well.
Child datasets are named hierarchically like
directories. For example,
mypool/home , the home
dataset, is a child of mypool
and inherits properties from it. This can be expanded
further by creating
mypool/home/user . This
grandchild dataset will inherit properties from the
parent and grandparent. Properties on a child can be
set to override the defaults inherited from the parents
and grandparents. Administration of datasets and their
children can be
delegated. |
File system | A ZFS dataset is most often used as a file system. Like most other file systems, a ZFS file system is mounted somewhere in the systems directory hierarchy and contains files and directories of its own with permissions, flags, and other metadata. |
Volume | In additional to regular file system datasets, ZFS can also create volumes, which are block devices. Volumes have many of the same features, including copy-on-write, snapshots, clones, and checksumming. Volumes can be useful for running other file system formats on top of ZFS, such as UFS virtualization, or exporting iSCSI extents. |
Snapshot | The
copy-on-write
(COW) design of
ZFS allows for nearly instantaneous,
consistent snapshots with arbitrary names. After taking
a snapshot of a dataset, or a recursive snapshot of a
parent dataset that will include all child datasets, new
data is written to new blocks, but the old blocks are
not reclaimed as free space. The snapshot contains
the original version of the file system, and the live
file system contains any changes made since the snapshot
was taken. No additional space is used. As new data is
written to the live file system, new blocks are
allocated to store this data. The apparent size of the
snapshot will grow as the blocks are no longer used in
the live file system, but only in the snapshot. These
snapshots can be mounted read only to allow for the
recovery of previous versions of files. It is also
possible to
rollback a live
file system to a specific snapshot, undoing any changes
that took place after the snapshot was taken. Each
block in the pool has a reference counter which keeps
track of how many snapshots, clones, datasets, or
volumes make use of that block. As files and snapshots
are deleted, the reference count is decremented. When a
block is no longer referenced, it is reclaimed as free
space. Snapshots can also be marked with a
hold. When a
snapshot is held, any attempt to destroy it will return
an EBUSY error. Each snapshot can
have multiple holds, each with a unique name. The
release command
removes the hold so the snapshot can deleted. Snapshots
can be taken on volumes, but they can only be cloned or
rolled back, not mounted independently. |
Clone | Snapshots can also be cloned. A clone is a writable version of a snapshot, allowing the file system to be forked as a new dataset. As with a snapshot, a clone initially consumes no additional space. As new data is written to a clone and new blocks are allocated, the apparent size of the clone grows. When blocks are overwritten in the cloned file system or volume, the reference count on the previous block is decremented. The snapshot upon which a clone is based cannot be deleted because the clone depends on it. The snapshot is the parent, and the clone is the child. Clones can be promoted, reversing this dependency and making the clone the parent and the previous parent the child. This operation requires no additional space. Because the amount of space used by the parent and child is reversed, existing quotas and reservations might be affected. |
Checksum | Every block that is allocated is also checksummed.
The checksum algorithm used is a per-dataset property,
see set .
The checksum of each block is transparently validated as
it is read, allowing ZFS to detect
silent corruption. If the data that is read does not
match the expected checksum, ZFS will
attempt to recover the data from any available
redundancy, like mirrors or RAID-Z).
Validation of all checksums can be triggered with scrub .
Checksum algorithms include:
fletcher algorithms are faster,
but sha256 is a strong cryptographic
hash and has a much lower chance of collisions at the
cost of some performance. Checksums can be disabled,
but it is not recommended. |
Compression | Each dataset has a compression property, which
defaults to off. This property can be set to one of a
number of compression algorithms. This will cause all
new data that is written to the dataset to be
compressed. Beyond a reduction in space used, read and
write throughput often increases because fewer blocks
are read or written.
|
Copies | When set to a value greater than 1, the
copies property instructs
ZFS to maintain multiple copies of
each block in the
File System
or
Volume. Setting
this property on important datasets provides additional
redundancy from which to recover a block that does not
match its checksum. In pools without redundancy, the
copies feature is the only form of redundancy. The
copies feature can recover from a single bad sector or
other forms of minor corruption, but it does not protect
the pool from the loss of an entire disk. |
Deduplication | Checksums make it possible to detect duplicate
blocks of data as they are written. With deduplication,
the reference count of an existing, identical block is
increased, saving storage space. To detect duplicate
blocks, a deduplication table (DDT)
is kept in memory. The table contains a list of unique
checksums, the location of those blocks, and a reference
count. When new data is written, the checksum is
calculated and compared to the list. If a match is
found, the existing block is used. The
SHA256 checksum algorithm is used
with deduplication to provide a secure cryptographic
hash. Deduplication is tunable. If
dedup is on , then
a matching checksum is assumed to mean that the data is
identical. If dedup is set to
verify , then the data in the two
blocks will be checked byte-for-byte to ensure it is
actually identical. If the data is not identical, the
hash collision will be noted and the two blocks will be
stored separately. Because DDT must
store the hash of each unique block, it consumes a very
large amount of memory. A general rule of thumb is
5-6 GB of ram per 1 TB of deduplicated data).
In situations where it is not practical to have enough
RAM to keep the entire
DDT in memory, performance will
suffer greatly as the DDT must be
read from disk before each new block is written.
Deduplication can use L2ARC to store
the DDT, providing a middle ground
between fast system memory and slower disks. Consider
using compression instead, which often provides nearly
as much space savings without the additional memory
requirement. |
Scrub | Instead of a consistency check like fsck(8),
ZFS has scrub .
scrub reads all data blocks stored on
the pool and verifies their checksums against the known
good checksums stored in the metadata. A periodic check
of all the data stored on the pool ensures the recovery
of any corrupted blocks before they are needed. A scrub
is not required after an unclean shutdown, but is
recommended at least once every three months. The
checksum of each block is verified as blocks are read
during normal use, but a scrub makes certain that even
infrequently used blocks are checked for silent
corruption. Data security is improved, especially in
archival storage situations. The relative priority of
scrub can be adjusted with vfs.zfs.scrub_delay
to prevent the scrub from degrading the performance of
other workloads on the pool. |
Dataset Quota | ZFS provides very fast and
accurate dataset, user, and group space accounting in
addition to quotas and space reservations. This gives
the administrator fine grained control over how space is
allocated and allows space to be reserved for critical
file systems.
ZFS supports different types of quotas: the dataset quota, the reference quota (refquota), the user quota, and the group quota. Quotas limit the amount of space that a dataset and all of its descendants, including snapshots of the dataset, child datasets, and the snapshots of those datasets, can consume. Note:Quotas cannot be set on volumes, as the
|
Reference Quota | A reference quota limits the amount of space a dataset can consume by enforcing a hard limit. However, this hard limit includes only space that the dataset references and does not include space used by descendants, such as file systems or snapshots. |
User Quota | User quotas are useful to limit the amount of space that can be used by the specified user. |
Group Quota | The group quota limits the amount of space that a specified group can consume. |
Dataset Reservation | The reservation property makes
it possible to guarantee a minimum amount of space for a
specific dataset and its descendants. If a 10 GB
reservation is set on
storage/home/bob , and another
dataset tries to use all of the free space, at least
10 GB of space is reserved for this dataset. If a
snapshot is taken of
storage/home/bob , the space used by
that snapshot is counted against the reservation. The
refreservation
property works in a similar way, but it
excludes descendants like
snapshots.
Reservations of any sort are useful in many situations, such as planning and testing the suitability of disk space allocation in a new system, or ensuring that enough space is available on file systems for audio logs or system recovery procedures and files. |
Reference Reservation | The refreservation property
makes it possible to guarantee a minimum amount of
space for the use of a specific dataset
excluding its descendants. This
means that if a 10 GB reservation is set on
storage/home/bob , and another
dataset tries to use all of the free space, at least
10 GB of space is reserved for this dataset. In
contrast to a regular
reservation,
space used by snapshots and descendant datasets is not
counted against the reservation. For example, if a
snapshot is taken of
storage/home/bob , enough disk space
must exist outside of the
refreservation amount for the
operation to succeed. Descendants of the main data set
are not counted in the refreservation
amount and so do not encroach on the space set. |
Resilver | When a disk fails and is replaced, the new disk must be filled with the data that was lost. The process of using the parity information distributed across the remaining drives to calculate and write the missing data to the new drive is called resilvering. |
Online | A pool or vdev in the Online
state has all of its member devices connected and fully
operational. Individual devices in the
Online state are functioning
normally. |
Offline | Individual devices can be put in an
Offline state by the administrator if
there is sufficient redundancy to avoid putting the pool
or vdev into a
Faulted state.
An administrator may choose to offline a disk in
preparation for replacing it, or to make it easier to
identify. |
Degraded | A pool or vdev in the Degraded
state has one or more disks that have been disconnected
or have failed. The pool is still usable, but if
additional devices fail, the pool could become
unrecoverable. Reconnecting the missing devices or
replacing the failed disks will return the pool to an
Online state
after the reconnected or new device has completed the
Resilver
process. |
Faulted | A pool or vdev in the Faulted
state is no longer operational. The data on it can no
longer be accessed. A pool or vdev enters the
Faulted state when the number of
missing or failed devices exceeds the level of
redundancy in the vdev. If missing devices can be
reconnected, the pool will return to a
Online state. If
there is insufficient redundancy to compensate for the
number of failed disks, then the contents of the pool
are lost and must be restored from backups. |
All FreeBSD documents are available for download at http://ftp.FreeBSD.org/pub/FreeBSD/doc/
Questions that are not answered by the
documentation may be
sent to <[email protected]>.
Send questions about this document to <[email protected]>.