The ring builder process includes these high-level steps:
The utility calculates the number of partitions to assign to each device based on the weight of the device. For example, for a partition at the power of 20, the ring has 1,048,576 partitions. One thousand devices of equal weight will each want 1,048.576 partitions. The devices are sorted by the number of partitions they desire and kept in order throughout the initialization process.
Note Each device is also assigned a random tiebreaker value that is used when two devices desire the same number of partitions. This tiebreaker is not stored on disk anywhere, and so two different rings created with the same parameters will have different partition assignments. For repeatable partition assignments,
RingBuilder.rebalance()
takes an optional seed value that seeds the Python pseudo-random number generator.The ring builder assigns each partition replica to the device that requires most partitions at that point while keeping it as far away as possible from other replicas. The ring builder prefers to assign a replica to a device in a region that does not already have a replica. If no such region is available, the ring builder searches for a device in a different zone, or on a different server. If it does not find one, it looks for a device with no replicas. Finally, if all options are exhausted, the ring builder assigns the replica to the device that has the fewest replicas already assigned.
Note The ring builder assigns multiple replicas to one device only if the ring has fewer devices than it has replicas.
When building a new ring from an old ring, the ring builder recalculates the desired number of partitions that each device wants.
The ring builder unassigns partitions and gathers these partitions for reassignment, as follows:
The ring builder unassigns any assigned partitions from any removed devices and adds these partitions to the gathered list.
The ring builder unassigns any partition replicas that can be spread out for better durability and adds these partitions to the gathered list.
The ring builder unassigns random partitions from any devices that have more partitions than they need and adds these partitions to the gathered list.
The ring builder reassigns the gathered partitions to devices by using a similar method to the one described previously.
When the ring builder reassigns a replica to a partition, the ring builder records the time of the reassignment. The ring builder uses this value when it gathers partitions for reassignment so that no partition is moved twice in a configurable amount of time. The RingBuilder class knows this configurable amount of time as
min_part_hours
. The ring builder ignores this restriction for replicas of partitions on removed devices because removal of a device happens on device failure only, and reassignment is the only choice.
Theses steps do not always perfectly rebalance a ring due to the random nature of gathering partitions for reassignment. To help reach a more balanced ring, the rebalance process is repeated until near perfect (less than 1 percent off) or when the balance does not improve by at least 1 percent (indicating we probably cannot get perfect balance due to wildly imbalanced zones or too many partitions recently moved).