Getting cluster topology

Your SDK will be responsible for storing keys on particular nodes; therefore your SDK needs to be able to retrieve current cluster topology. The way that Couchbase Server stores all addresses for existing keys in a cluster is by providing a vBucket map. Your SDK will need to request a vBucket map from Couchbase Server and maintain an open connection for streaming updates from the server. Couchbase Server will provide vBucket maps and updates as JSON. To create and maintain such a connection, you can do a REST request from your SDK, and Couchbase Server will send an initial vBucket Map and stream updates as needed.

You should provide the appropriate REST endpoints to your SDK as some initial configuration parameter specified in a developer’s application. The client application should bootstrap the REST/JSON information by building URLs discovered from a standard base URL. After following the bootstrapping sequence and retrieving the URL for vBucket maps, your client library will have a REST/JSON URL that appears as follows:

http://HOST:PORT/pools/default/bucketsStreaming/BUCKET_NAME

For example:

http://couchbase1:8091/pools/default/bucketsStreaming/default

The following is an example response from that URL, in JSON:

{
    "name" : "default",
    "bucketType" : "couchbase",
...
    "vBucketServerMap" : {
        "hashAlgorithm" : "CRC",
        "numReplicas" : 1,
        "serverList" : ["10.1.2.14:11210"],
        "vBucketMap" : [[0,-1],[0,-1],[0,-1],[0,-1],[0,-1] : ]
   }
}

The REST/JSON URLs might be under HTTP Basic Auth authentication control, so the client application may also have to provide (optional) user/password information to the your client library so that the proper HTTP/REST request can be made.

The REST/JSON URLs are ‘streaming’, in that the Couchbase Server does not close the HTTP REST connection after responding with one vBucket map. Instead, Couchbase Server keeps the connection open and continues to stream vBucket maps to your client library when there are cluster changes, for instance new server nodes are added, removed, or when vBuckets are reassigned to different servers. In the Couchbase Server streaming response, new vBucket-to-server map JSON messages are delimited by four newlines (“\n\n\n\n”) characters.

The above section describes what we call named-bucket REST endpoints. That is, each named bucket on a specified port has a streaming REST endpoint in the form:

http://HOST:PORT/pools/default/bucketsStreaming/BUCKET_NAME

There is another kind of REST endpoint which describes all SASL-authenticated buckets. This SASL-authenticated endpoint has the form of:

http://HOST:PORT/pools/default/saslBucketsStreaming

Sample output:

{"buckets":[
    {"name":"default",
            "nodeLocator":"vbucket",
            "saslPassword":"",
            "nodes":[
             {"clusterMembership":"active","status":"healthy","hostname":"10.1.4.11:8091",
             "version":"1.6.1rc1","os":"x86_64-unknown-linux-gnu",
             "ports":{"proxy":11211,"direct":11210}},
             {"clusterMembership":"active","status":"healthy","hostname":"10.1.4.12:8091",
             "version":"1.6.1pre_21_g5aa2027","os":"x86_64-unknown-linux-gnu",
             "ports":{"proxy":11211,"direct":11210}}],
            "vBucketServerMap":{
            "hashAlgorithm":"CRC","numReplicas":1,
        "serverList":["10.1.4.11:11210","10.1.4.12:11210"],
        "vBucketMap":[[0,-1],[1,-1],...,[0,-1],[0,-1]]}}

 ]
        }

One main difference between the SASL-based bucket response versus the per-bucket response is that the SASL-based response can describe more than one bucket in a cluster. In the SASL REST/JSON response, these multiple buckets would be found in the JSON response under the “buckets” array.

Parsing the JSON

Once your client library has received a complete vBucket-to-server map message, it should use its favorite JSON parser to process the map into more useful data structures. An implementation of this kind of JSON parsing in C exists as a helper library in libvbucket, or for Java, jvbucket.

The libvbucket and jvbucket helper libraries don’t do any connection creation, socket management, protocol serialization, etc. That’s the job of your higher-level library. These helper libraries instead just know how to parse a JSON vBucket-to-server map and provide an API to access the map information.

Handling vBucketMap information

The vBucketMap value within the returned JSON describes the vBucket organization. For example:

"serverList":["10.1.4.11:11210","10.1.4.12:11210"], "vBucketMap":[[0,1],[1,0],[1,0],[1,0],:,[0,1],[0,1]]

The vBucketMap is zero-based indexed by vBucketId. So, if you have a vBucket whose vBucketId is 4, you’d look up vBucketMap[4]. The entries in the vBucketMap are arrays of integers, where each integer is a zero-based index into the serverList array. The 0th entry in this array describes the primary server for a vBucket. Here’s how you read this stuff, based on the above config:

The vBucket with vBucketId of 0 has a configuration of vBucketMap[0], or [0, 1]. So vBucket 0’s primary server is at serverList[0], or 10.1.4.11:11210.

While vBucket 0’s first replica server is at serverList[1], which is 10.1.4.12:11210.

The vBucket with vBucketId of 1 has a configuration of vBucketMap[1], or [1, 0]. So vBucket 1’s primary server is at serverList[1], or 10.1.4.12:11210. And vBucket 1’s first replica is at serverList[0], or 10.1.4.11:11210.

This structure and information repeats for every configured vBucket.

If you see a –1 value, it means that there is no server yet at that position. That is, you might see:

"vBucketMap":[[0,-1],[0,-1],[0,-1],[0,-1],:]

Sometimes early before the system has been completely configured, you might see variations of:

"serverList":[], "vBucketMap":[]

Encoding the vBucketId

As the user’s application makes item data API invocations on your client library (mc.get(“some_key”), mc.delete(“some_key”), your client library will hash the key (“some_key”) into a vBucketId. Your client library must also encode a binary request message (following memcached binary protocol), but also also needs to include the vBucketId as part of that binary request message.

Note: Python-aware readers might look at this implementation for an example.

Each Couchbase Server will double-check the vBucketId as it processes requests, and would return NOT_MY_VBUCKET error responses if your client library provided the wrong vBucketId to the wrong couchbase server. This mismatch is expected in the normal course of the lifetime of a cluster – especially when the cluster is changing configuration, such as during a Rebalance.

Handling rebalances in your client library

A major operation in a cluster of Couchbase servers is rebalancing. A Couchbase system administrator may choose to initiate a rebalance because new servers might have been added, old servers need to be decommissioned and need to be removed, etc. An underlying part of rebalancing is the controlled migration of vBuckets (and the items in those migrating vBuckets) from one Couchbase server to another.

There is a certain amount of time, given the distributed nature of couchbase servers and clients, where vBuckets ownership may have changed and migrated from one server to another server, but your client library has not been informed. So, your client library could be trying to talk to the ‘wrong’ or outdated server for a given item, since your client library is operating with an out-of-date vBucket-to-server map.

Below is a walk-through of this situation in more detail and how to handle this case:

Before the Rebalance starts, any existing, connected clients should be operating with the cluster’s pre-rebalance vBucket-to-server map.

As soon as the rebalance starts, Couchbase will “broadcast” (via the streaming REST/JSON channels) a slightly updated vBucket-to-server map message. The assignment of vBuckets to servers does not change at this point at the start of the rebalance, but the serverList of all the servers in the Couchbase cluster does change. That is, vBuckets have not yet moved (or are just starting to move), but now your client library knows the addresses of any new couchbase servers that are now part of the cluster. Knowing all the servers in the cluster (including all the newly added servers) is important, as you will soon see.

At this point, the Couchbase cluster will be busy migrating vBuckets from one server to another.

Concurrently, your client library will be trying to do item data operations (Get/Set/Delete’s) using its pre-Rebalance vBucket-to-server map. However, some vBuckets might have been migrated to a new server already. In this case, the server your client library was trying to use will return a NOT_MY_VBUCKET error response.

Your client library should handle that NOT_MY_VBUCKET error response by retrying the request against another server in the cluster. The retry, of course, might fail with another NOT_MY_VBUCKET error response, in which your client library should keep probing against another server in the cluster.

Eventually, one server will respond with success, and your client library has then independently discovered the new, correct owner of that vBucketId. Your client library should record that knowledge in its vBucket-server-map(s) for use in future operations time.

An implementation of this can be seen in the libvBucket API vbucket_found_incorrect_master().

The following shows a swim-lane diagram of how moxi interacts with libvBucket during NOT_MY_VBUCKET errors libvbucket_notmyvbucket.pdf.

At the end of the Rebalance, the couchbase cluster will notify streaming REST/JSON clients, finally, with a new vBucket-to-server map. This can be handled by your client library like any other vBucket-to-server map update message. However, in the meantime, your client library didn’t require granular map updates during the Rebalancing, but found the correct vBucket owners on its own.

Fast forward map

A planned, forthcoming improvement to the above NOT_MY_VBUCKET handling approach is that Couchbase will soon send an optional second map during the start of the Rebalance. This second map, called a “fast forward map”, provides the final vBucket-to-server map that would represent the cluster at the end of the Rebalance. A client library can use the optional fast forward map during NOT_MY_VBUCKET errors to avoid linear probing of all servers and can instead just jump straight to talking with the eventual vBucket owner.

Please see the implementation in libvBucket that handles a fast-forward-map.

The linear probing, however, should be retained by client library implementations as a good fallback, just-in-case error handling codepath.

Redundancy and availability

Client library authors should enable their user applications to specify multiple URLs into the Couchbase cluster for redundancy. Ideally, the user application would specify an odd number of URLs, and the client library should compare responses from every REST/JSON URL until is sees a majority of equivalent cluster configuration responses. With an even number of URLs which provide conflicting cluster configurations (such as when there’s only two couchbase servers in the cluster and there’s a split-brain issue), the client library should provide an error to the user application rather than attempting to access items from wrong nodes (nodes that have been Failover’ed out of the cluster).

The libvBucket C library has an API for comparing two configurations to support these kinds of comparisons. See the vbucket_compare() function.

As an advanced option, the client library should keep multiple REST/JSON streams open and do continual “majority vote” comparisons between streamed configurations when there are re-configuration events.

As an advanced option, the client library should “learn” about multiple cluster nodes from its REST/JSON responses. For example, the user may have specified just one URL into a multi-node cluster. The REST/JSON response from that one node will list all other nodes, which the client library can optionally, separately contact. This allows the client library to proceed even if the first URL/node fails (as long as the client library continues running).