Failover considerations

Failing over a node means that Couchbase Server removes the node from a cluster and makes replicated data at other nodes available for client requests.

Because Couchbase Server provides data replication within a cluster, the cluster can handle failure of one or more nodes without affecting your ability to access the stored data. In the event of a node failure, you can manually initiate a failover status for the node in the Couchbase Web Console and resolve the issues.

Alternately, you can configure Couchbase Server so it automatically removes a failed node from the cluster and have the cluster operate in a degraded mode. If you choose this automatic option, the workload for functioning nodes that remain in the cluster increases. You still need to address the node failure, return a functioning node to the cluster, and then rebalance the cluster in order for the cluster to function as it did prior to node failure.

Whether you manually fail over a node or have Couchbase Server perform automatic failover, you must determine the underlying cause for the failure. Then set up functioning nodes, add the nodes, and rebalance the cluster. Keep in mind the following guidelines on replacing or adding nodes when you cope with node failure and failover scenarios:

If the node failed due to a hardware or system failure, you must add a new replacement node to the cluster and rebalance.
If the node failed because of capacity problems in your cluster, you must replace the node but also add additional nodes to meet the capacity needs.
If the node failure was transient in nature and the failed node functions once again, you can add the node back to the cluster.

Be aware that failover is a distinct operation compared to removing or rebalancing a node. Typically, you remove a functioning node from a cluster for maintenance or other reasons; in contrast, you perform a failover for a node that does not function.

To remove a functioning node from a cluster, use the Couchbase Web Console to indicate the node that is to be removed. Then rebalance the cluster so that data requests for the node can be handled by other nodes. Since the node you want to remove still functions, it is able to handle data requests until the rebalance completes. At this point, other nodes in the cluster will handle data requests. Therefore, there is no disruption in data service and no loss of data that can occur when you remove a node and then rebalance the cluster. If you need to remove a functioning node for administration purposes, you must use the remove and rebalance functionality and not failover.

If you try to fail over a functioning node, this can result in data loss because failover will immediately remove the node from the cluster. Any data that has not yet been replicated to other nodes can be permanently lost if it had not been persisted to disk.

For more information about performing failover see the following resources:

Automated failover. It will automatically mark a node as failed over if the node has been identified as unresponsive or unavailable. There are some deliberate limitations to the automated failover feature.
Initiating a failover. Whether you use automatic or manual failover, you need to perform additional steps to bring a cluster into a fully functioning state.
Adding nodes after failover. After you resolve the issue with the failed over node, you can add the node back to the cluster.

Choosing a failover solution

Because node failover has the potential to reduce the performance of your cluster, you should consider how best to handle a failover situation. Using automated failover means that a cluster can fail over a node without user-intervention and without knowledge and identification of the issue that caused the node failure. It still requires you to initiate a rebalance in order to return the cluster to a healthy state.

If you choose manual failover to manage your cluster you need to monitor the cluster and identify when an issue occurs. If an issues does occur you then trigger a manual failover and rebalance operation. This approach requires more monitoring and manual intervention, there is also still a possibility that your cluster and data access may still degrade before you initiate failover and rebalance.

In the following sections the two alternatives and their issues are described in more detail.

Automated failover considerations

Automatically failing components in any distributed system can cause problems. If you cannot identify the cause of failure, and you do not understand the load that will be placed on the remaining system, then automated failover can cause more problems than it is designed to solve. Some of the situations that might lead to problems include:

Avoiding failover chain-reactions (Thundering herd)

Imagine a scenario where a Couchbase Server cluster of five nodes is operating at 80–90% aggregate capacity in terms of network load. Everything is running well but at the limit of cluster capacity. Imagine a node fails and the software decides to automatically failover that node. It is unlikely that all of the remaining four nodes are be able to successfully handle the additional load.

The result is that the increased load could lead to another node failing and being automatically failed over. These failures can cascade and lead to the eventual loss of an entire cluster. Clearly having 1/5th of the requests not being serviced due to single node failure would be more desirable than none of the requests being serviced due to an entire cluster failure.

The solution in this case is to continue cluster operations with the single node failure, add a new server to the cluster to handle the missing capacity, mark the failed node for removal and then rebalance. This way there is a brief partial outage rather than an entire cluster being disabled.

One alternate preventative solution is to ensure there is excess capacity to handle unexpected node failures and allow replicas to take over.

Handling failovers with network partitions

In case of network partition or split-brain where the failure of a network device causes a network to be split, Couchbase implements automatic failover with the following restrictions:

Automatic failover requires a minimum of three (3) nodes per cluster. This prevents a 2-node cluster from having both nodes fail each other over in the face of a network partition and protects the data integrity and consistency.
Automatic failover occurs only if exactly one (1) node is down. This prevents a network partition from causing two or more halves of a cluster from failing each other over and protects the data integrity and consistency.
Automatic failover occurs only once before requiring administrative action. This prevents cascading failovers and subsequent performance and stability degradation. In many cases, it is better to not have access to a small part of the dataset rather than having a cluster continuously degrade itself to the point of being non-functional.
Automatic failover implements a 30 second delay when a node fails before it performs an automatic failover. This prevents transient network issues or slowness from causing a node to be failed over when it shouldn’t be.

If a network partition occurs, automatic failover occurs if and only if automatic failover is allowed by the specified restrictions. For example, if a single node is partitioned out of a cluster of five (5), it is automatically failed over. If more than one (1) node is partitioned off, autofailover does not occur. After that, administrative action is required for a reset. In the event that another node fails before the automatic failover is reset, no automatic failover occurs.

Handling misbehaving nodes

There are cases where one node loses connectivity to the cluster or functions as if it has lost connectivity to the cluster. If you enable it to automatically failover the rest of the cluster, that node is able to create a cluster-of-one. The result for your cluster is a similar partition situation we described previously.

In this case you should make sure there is spare node capacity in your cluster and failover the node with network issues. If you determine there is not enough capacity, add a node to handle the capacity after your failover the node with issues.

Manual or monitored failover

Performing manual failover through monitoring can take two forms, either by human monitoring or by using a system external to the Couchbase Server cluster. An external monitoring system can monitor both the cluster and the node environment and make a more information-driven decision. If you choose a manual failover solution, there are also issues you should be aware of. Although automated failover has potential issues, choosing to use manual or monitored failover is not without potential problems.

Human intervention

One option is to have a human operator respond to alerts and make a decision on what to do. Humans are uniquely capable of considering a wide range of data, observations and experiences to best resolve a situation. Many organizations disallow automated failover without human consideration of the implications. The drawback of using human intervention is that it will be slower to respond than using a computer-based monitoring system.

External monitoring

Another option is to have a system monitoring the cluster via the Couchbase REST API. Such an external system is in a good position to failover nodes because it can take into account system components that are outside the scope of Couchbase Server.

For example monitoring software can observe that a network switch is failing and that there is a dependency on that switch by the Couchbase cluster. The system can determine that failing Couchbase Server nodes will not help the situation and will therefore not failover the node.

The monitoring system can also determine that components around Couchbase Server are functioning and that various nodes in the cluster are healthy. If the monitoring system determines the problem is only with a single node and remaining nodes in the cluster can support aggregate traffic, then the system may failover the node using the REST API or command-line tools.

Using automatic failover

There are a number of restrictions on automatic failover in Couchbase Server. This is to help prevent some issues that can occur when you use automatic failover.

Disabled by Default Automatic failover is disabled by default. This prevents Couchbase Server from using automatic failover without you explicitly enabling it.
Minimum Nodes Automatic failover is only available on clusters of at least three nodes.

If two or more nodes go down at the same time within a specified delay period, the automatic failover system will not failover any nodes.

Required Intervention Automatic failover will only fail over one node before requiring human intervention. This is to prevent a chain reaction failure of all nodes in the cluster.
Failover Delay There is a minimum 30 second delay before a node will be failed over. This time can be raised, but the software is hard coded to perform multiple pings of a node that may be down. This is to prevent failover of a functioning but slow node or to prevent network connection issues from triggering failover.

You can use the REST API to configure an email notification that will be sent by Couchbase Server if any node failures occur and node is automatically failed over.

Once an automatic failover has occurred, the Couchbase Cluster is relying on other nodes to serve replicated data. You should initiate a rebalance to return your cluster to a fully functioning state.

Resetting the Automatic failover counter

After a node has been automatically failed over, Couchbase Server increments an internal counter that indicates if a node has been failed over. This counter prevents the server from automatically failing over additional nodes until you identify the issue that caused the failover and resolve it. If the internal counter indicates a node has failed over, the server will no longer automatically failover additional nodes in the cluster. You will need to re-enable automatic failover in a cluster by resetting this counter.

Important

Reset the automatic failover only after the node issue is resolved, rebalance occurs, and the cluster is restored to a fully functioning state.

You can reset the counter using the REST API:

> curl -i -u cluster-username:cluster-password \
    http://localhost:8091/settings/autoFailover/resetCount

Initiating a node failover

If you need to remove a node from the cluster due to hardware or system failure, you need to indicate the failover status for that node. This causes Couchbase Server to use replicated data from other functioning nodes in the cluster.

Important: Do not use failover to remove a functioning node from the cluster for administration or upgrade operations. This is because initiating a failover for a node activates replicated data at other nodes which reduces the overall capacity of the cluster. Data from the failover node that has not yet been replicated at other nodes or persisted on disk will be lost.

You can provide the failover status for a node with two different methods:

Using the Web Console

Go to the Management -> Server Nodes section of the Web Console. Find the node that you want to failover, and click the Fail Over button. You can only failover nodes that the cluster has identified as being Down.

Web Console will display a warning message.

Click Fail Over to indicate the node is failed over. You can also choose to Cancel.

Using the Command-line

You can failover one or more nodes using the failover command in couchbase-cli. To failover the node, you must specify the IP address and port, if not the standard port for the node you want to failover. For example:

```
> couchbase-cli failover --cluster=localhost:8091\
    -u cluster-username -p cluster-password\
    --server-failover=192.168.0.72:8091
```

If successful this indicates the node is failed over.

After you specify that a node is failed over you should handle the cause of failure and get your cluster back to a fully functional state.

Handling a failover situation

Any time that you automatically or manually failover a node, the cluster capacity will be reduced. Once a node is failed over:

The number of available nodes for each data bucket in your cluster will be reduced by one.
Replicated data handled by the failover node will be enabled on other nodes in the cluster.
Remaining nodes will have to handle all incoming requests for data.

After a node has been failed over, you should perform a rebalance operation. The rebalance operation will:

Redistribute stored data across the remaining nodes within the cluster.
Recreate replicated data for all buckets at remaining nodes.
Return your cluster to the configured operational state.

You may decide to add one or more new nodes to the cluster after a failover to return the cluster to a fully functional state. Better yet you may choose to replace the failed node and add additional nodes to provide more capacity than before.

Adding back a failed over node

You can add a failed over node back to the cluster if you identify and fix the issue that caused node failure. After Couchbase Server marks a node as failed over, the data on disk at the node will remain. A failed over node will not longer be synchronized with the rest of the cluster; this means the node will no longer handle data request or receive replicated data.

When you add a failed over node back into a cluster, the cluster will treat it as if it is a new node. This means that you should rebalance after you add the node to the cluster. This also means that any data stored on disk at that node will be destroyed when you perform this rebalance.

Copy or Delete Data Files before Rejoining Cluster

Therefore, before you add a failed over node back to the cluster, it is best practice to move or delete the persisted data files before you add the node back into the cluster. If you want to keep the files you can copy or move the files to another location such as another disk or EBS volume. When you add a node back into the cluster and then rebalance, data files will be deleted, recreated and repopulated.