NOTE: the last update of this page is dated 2013, before the Firefly release. The details of the implementation may be different.
The purpose of the PG Backend interface is to abstract over the differences between replication and erasure coding as failure recovery mechanisms.
Much of the existing PG logic, particularly that for dealing with peering, will be common to each. With both schemes, a log of recent operations will be used to direct recovery in the event that an OSD is down or disconnected for a brief period of time. Similarly, in both cases it will be necessary to scan a recovered copy of the PG in order to recover an empty OSD. The PGBackend abstraction must be sufficiently expressive for Replicated and ErasureCoded backends to be treated uniformly in these areas.
However, there are also crucial differences between using replication and erasure coding which PGBackend must abstract over:
The current PG implementation performs a write by performing the write locally while concurrently directing replicas to perform the same operation. Once all operations are durable, the operation is considered durable. Because these writes may be destructive overwrites, during peering, a log entry on a replica (or the primary) may be found to be divergent if that replica remembers a log event which the authoritative log does not contain. This can happen if only 1 out of 3 replicas persisted an operation, but was not available in the next interval to provide an authoritative log. With replication, we can repair the divergent object as long as at least 1 replica has a current copy of the divergent object. With erasure coding, however, it might be the case that neither the new version of the object nor the old version of the object has enough available chunks to be reconstructed. This problem is much simpler if we arrange for all supported operations to be locally roll-back-able.
Core Changes:
PGBackend Interfaces:
Currently, we select the log with the newest last_update and the longest tail to be the authoritative log. This is fine because we aren’t generally able to roll operations on the other replicas forward or backwards, instead relying on our ability to re-replicate divergent objects. With the write approach discussed in the previous section, however, the erasure coded backend will rely on being able to roll back divergent operations since we may not be able to re-replicate divergent objects. Thus, we must choose the oldest last_update from the last interval which went active in order to minimize the number of divergent objects.
The difficulty is that the current code assumes that as long as it has an info from at least 1 OSD from the prior interval, it can complete peering. In order to ensure that we do not end up with an unrecoverably divergent object, a K+M erasure coded PG must hear from at least K of the replicas of the last interval to serve writes. This ensures that we will select a last_update old enough to roll back at least K replicas. If a replica with an older last_update comes along later, we will be able to provide at least K chunks of any divergent object.
Core Changes:
PGBackend interfaces:
Currently, an OSD is able to request a temp acting set mapping in order to allow an up-to-date OSD to serve requests while a new primary is backfilled (and for other reasons). An erasure coded pg needs to be able to designate a primary for these reasons without putting it in the first position of the acting set. It also needs to be able to leave holes in the requested acting set.
Core Changes:
Reads with the replicated strategy can always be satisfied synchronously out of the primary OSD. With an erasure coded strategy, the primary will need to request data from some number of replicas in order to satisfy a read. The perform_read() interface for PGBackend therefore will be async.
PGBackend interfaces:
With the replicated strategy, all replicas of a PG are interchangeable. With erasure coding, different positions in the acting set have different pieces of the erasure coding scheme and are not interchangeable. Worse, crush might cause chunk 2 to be written to an OSD which happens already to contain an (old) copy of chunk 4. This means that the OSD and PG messages need to work in terms of a type like pair<shard_t, pg_t> in order to distinguish different pg chunks on a single OSD.
Because the mapping of object name to object in the filestore must be 1-to-1, we must ensure that the objects in chunk 2 and the objects in chunk 4 have different names. To that end, the filestore must include the chunk id in the object key.
Core changes:
We probably won’t support object classes at first on Erasure coded backends.
We currently have two scrub modes with different default frequencies:
The primary requests a scrubmap from each replica for a particular range of objects. The replica fills out this scrubmap for the range of objects including, if the scrub is deep, a crc32 of the contents of each object. The primary gathers these scrubmaps from each replica and performs a comparison identifying inconsistent objects.
Most of this can work essentially unchanged with erasure coded PG with the caveat that the PGBackend implementation must be in charge of actually doing the scan, and that the PGBackend implementation should be able to attach arbitrary information to allow PGBackend on the primary to scrub PGBackend specific metadata.
The main catch, however, for erasure coded PG is that sending a crc32 of the stored chunk on a replica isn’t particularly helpful since the chunks on different replicas presumably store different data. Because we don’t support overwrites except via DELETE, however, we have the option of maintaining a crc32 on each chunk through each append. Thus, each replica instead simply computes a crc32 of its own stored chunk and compares it with the locally stored checksum. The replica then reports to the primary whether the checksums match.
PGBackend interfaces:
If crush is unable to generate a replacement for a down member of an acting set, the acting set should have a hole at that position rather than shifting the other elements of the acting set out of position.
Core changes:
The logic for recovering an object depends on the backend. With the current replicated strategy, we first pull the object replica to the primary and then concurrently push it out to the replicas. With the erasure coded strategy, we probably want to read the minimum number of replica chunks required to reconstruct the object and push out the replacement chunks concurrently.
Another difference is that objects in erasure coded pg may be unrecoverable without being unfound. The “unfound” concept should probably then be renamed to unrecoverable. Also, the PGBackend implementation will have to be able to direct the search for pg replicas with unrecoverable object chunks and to be able to determine whether a particular object is recoverable.
Core changes:
PGBackend interfaces:
For the most part, backfill itself should behave similarly between replicated and erasure coded pools with a few exceptions:
For 2, we don’t really need to place the backfill peer in the acting set for replicated PGs anyway. For 1, PGBackend::choose_backfill() should determine which OSDs are backfilled in a particular interval.
Core changes:
PGBackend interfaces: