CouchDB Clustering

Consistent Hashing

CouchDB’s storage model uses unique ids to save and retrieve documents.

CouchDB-Lounge allocates a portion of the keyspace to each node, so that can address as many nodes as you have, distributing document requests (GETs and PUTs) to the CouchDB nodes. By allocating the keyspace using a good consistent hash function, we ensure that nodes see roughly equal load. A good hash function will take any arbitrary docid from an HTTP request URL, eg /db/docid and point it to a random node. The same function can be run during to proxy GETs to the partition that contains their documents.

This is commonly illustrated in a ring shape, with the random hash wrapped around the outside. Each tic mark designates the boundaries in keyspace between two partitions. The hash function maps from document ids to positions on the ring. The ring is continuous so you can always add more nodes by splitting a single partition into pieces.

With 4 physical servers you allocate the keyspace into 16 independent partitions by distributing them across the servers like so:

A: 0,1,2,3
B: 4,5,6,7
C: 8,9,a,b
D: c,d,e,f

The proxy itself is implemented in nginx hashing… config file… lookup table

# redirect /db/docid to the proper node based on a lookup table

Because the cluster is implemented in HTTP, the proxy can partition documents according to the request URL, without inspecting the body. In practice this is accomplished by running the hash function and comparing the result to the portion of the keyspace allocated to each active node. The proxy looks up the proper partition for the hash value in a configuration table, forwarding the http request to the proper partition.

The configuration table is just a mapping from hash ranges to partition database urls. For example, the hash key "71db329b58378c8fa8876f0ec04c72e5" is mapped to the node B, database 7 in the table above. This could be represented by the databases URL http://B.couches.local/db-7/.

Consistent hashing is a simple way to ensure that you can always find the documents you saved, while balancing storage load evenly across partitions.

Redundant Storage

Consistent hashing solves the problem of how to break a single logical database up evenly across a set of partitions, which can the be distributed across multiple servers. It does not address the problem of how to ensure that data you’ve stored is safe from loss due to hardware or software failure. If you are serious about your data, you can’t consider it saved until you have at least two copies of it, preferably in different geographical locations.

CouchDB replication makes maintaining hot-failover redundant slaves or load-balanced multi-master databases relatively painless. The specifics of how to manage replication are covered in the Replication chapter. What is important in this context is to understand that maintaining redundant copies is orthogonal to the harder task of ensuring that the cluster consistently chooses the same partition for a particular document id.

For data safety you’ll want to have at least two or three copies of everything. However, if you encapsulate redundancy the higher layers of the cluster can treat each partition as a single unit, and let the logical partitions themselves manage redundancy and failover.

Redundant Proxies

Just as we can’t accept the possibility of hardware failure leading to data loss, we’ll need to run multiple instances of the proxy nodes to avoid the chance that a proxy node crash could leave portions of the cluster unavailable. By running redundant proxy instances, and load balancing across them, we can both increase cluster throughput as well as reliability

View Merging

Consistent hashing leaves documents on the proper node, but documents can still emit() any key. The point of incremental map reduce is to bring the function to the data, so we shoudn’t redistribute the emitted keys, instead we send the queries to the CouchDB nodes via HTTP proxy, and merge the results using the Twisted Python Smartproxy.

Smartproxy sends each view request to every node, so it needs to merge the responses before returning them to the client. Thankfully this operation is not resource intensive, as merging can be done in constant memory space, no matter how many rows are returned. The Smartproxy recieves the first row from each cluster node, and compares them. We sort the nodes according to their row key, using CouchDB’s collation rules. Smartproxy pops the top row from first sorted node, and returns it to the client.

This process can be repeated as long as the clients continue to send rows, but if a limit is imposed by the client, Smartproxy must end the response early, discarding any extra rows sent by the nodes.

This layout is simple and loosely coupled. It has the advantage that it is easy to reason about, which helps in understanding topology and diagnosing failures. There is work underway to move the behavior to Erlang, which ought to make managing dynamic clusters possible, as well as let us integrate cluster control into the CouchDB runtime.

… == Growing the Cluster

Using CouchDB at web scale likely requires CouchDB clusters that can be scaled dynamically. Growing sites must continuously add more storage capacity, so we need a strategy to increase the size of our cluster without taking it down. Some workloads can result in temporary growth in data size, in which case we’ll also need a process for shrinking the cluster without an interruption in service. // more overview… In this section we’ll see how we can use CouchDB’s replication filters to split one database into several partitions, and how to use that techinque to grow the cluster without downtime. There are simple steps you can take to avoid partitioning databases while growing the cluster.

Oversharding is a technique where you partition the cluster so that there are multiple shards on each physical machine. Moving a partition from one machine to another is simpler than spitting it into smaller partitions, as the configuration map of the cluster used by the proxy only needs to change to point to shards at thier new homes, rather than adding new logical shards. It’s also less resource intensive to move a partition than to split it into many.

One question we need to answer is "how much should we overshard?" The answer depends on your application and deployment, but there are some forces that push us in one direction over another. If we get the number of shards right, we’ll end up with a cluster that can grow optimally.

In the earlier section on view merging we discussed how merges can be accomplished in constant space, no matter the number of rows returned. The memory space and network resources required to merge views, as well as to map from document ids to partitioned, does however grow linearly with the number of partitions under a given proxy. For this reason we’ll want to limit the number of partitions for each proxy. However, we can’t accept an upper limit on cluster size. The solution is to use a tree of proxies, where the root proxy partitions to some number of intermediate proxies, which then proxy to database nodes.

The factors that come into play when deciding how many partitions each proxy should manage are: the storage available to each individual server node, the projected growth rate of the data, the network and memory resources available to proxies, and the acceptable latency for requests against the cluster.

Assuming a conservative 64 shards per proxy, and 1TB of data storage per node (including room for compaction these nodes will need roughly 2TB of drive space), we can see that with a single proxy in front of CouchDB data nodes, we’ll be able to store at maximum 64TB of data (on 128 or perhaps 192 server nodes, depending on the level of reducancy required by the system) before we have to increase the number of partitions.

By replacing database nodes with another proxy, and repartitioning each of the 64 partitions into another 64 partitions, we end up with 4096 partitions and a tree depth of two. Just as the initial system can hold 64 partitions on just a few nodes, we can transition to the two layer tree without needing thousands of machines. If we assume each proxy must be run on it’s own node, and that at first database nodes can hold 16 partitions, we’ll see that we need 65 proxies, and 256 database machines (not including redundancy factors, which should typically multiply the cluster size by two or three times.) To get started with a cluster that can grow smoothly from 64 terabytes to 4 petabytes, we can begin with roughly 600 to 1000 server nodes, adding new ones as data size grows and we move partitions to other machines.

We’ve seen that even a cluster with depth of 2 can hold a vast amount of data. Basic arithmetic shows us the by applying the same process to create a cluster with 3 layers of proxies, we can manage 262 petabytes on thousands of machines. Consertive estimates for the latency introduced by each layer is about 100 ms, so even without performance tuning we should see overall response times of 300ms even with a tree of depth 3, and should be able to manage queries over exabyte datasets in under a second.

By using oversharding and iteratively replacing full shards (database nodes which host only one partition) with proxy nodes that point to another set of oversharded partitions, we can grow the cluster to very large sizes while incurring a minimum of latency.

Now we need to look at the mechanics of the two processes that allow the cluster to grow: moving a partition from an overcrowded node to an empty node, and splitting a large partition into many sub-partitions. Moving partitions is simpler, which is why it makes sense to use it when possible, running the more resource intensive repartition process only when partitions get large enough that only one or two can fit on each database server.

Moving Partitions

As we mentioned earlier, each partition is made up of N redundant CouchDB databases, each stored on different physical servers. To keep things easy to reason about, any operations should be applied to all redundant copies automatically. For the sake of discussion we’ll just talk about the abstract partition, but be aware that the redundant nodes will all be the same size, and so should require the same operations during cluster growth.

The simplest way to move a partition from one node to another, is to create an empty database on the target node, and use CouchDB replication to fill the new node with data from the old node. When the new copy of the partition is up to date with the original, the proxy node can be reconfigured to point to the new machine. Once the proxy points to the new partition location, one final round of replication will bring it up to date, and the old partition can be retired, freeing space on the original machine.

Another method for moving partition databases is to rsync the files on disk from the old node to the new one. Depending on how recently the partition was compacted, this should result in efficient, low-CPU intialization of a new node. Replication can then be used to bring the rsynced file up to date. See more about rsync and replication in the Replication chapter.

Splitting Partitions

The last major thing we need to run a CouchDB cluster is the capability to split an oversized partition into smaller pieces. In the Replication chapter we discussed how to do filtered replication. Splitting partitions is accomplished by creating the target partitions, and configuring them with the range of hash keys they are interested in. They then apply filtered replication to the source partition database, requesting only documents that meet their hash criteria. The result is multiple partial copies of the source database, so that each new partition has an equal share of the data. In total, they have a complete copy of the original data. Once the replication is complete, and the new partitions have also brought their redundant backups up to date, a proxy for the new set of partitions is brought online, and the top-level proxy pointed at it instead of the old partition. Just like with moving a partition, we should do one final round of replication after the old partition is no longer reachable by the cluster, so that any last second updates are not lost. Once that is done we can retire the old partition so that its hardware can be reused elsewhere in the cluster.