Table of Contents

Jill wakes up at 4:30 am looking dazzled at her mobile phone. She receives text message after text message one every minute. Finally, Joe calls. Joe is furious and Jill as trouble understanding what Joe is saying. In fact, Jill has a hard time remembering why Joe would call her in the middle of the night. Then she remembers: “Joe is running this online shop selling sports gear on one of your servers and he is furious because the server went down and Joe’s customers in New Zeeland [Sanity check time zones] are angry because they can’t get to the online shop.”

This is a typical scenario and you probably have seen a lot of variations of it being either in the role of Jill, or Joe, or both. If you are Jill you want to sleep at night and if you are Joe you want your customers to buy from you whenever it pleases them.

Having a Backup

The problems persists, computers fail and there are a lot of ways they can fail. There are hardware problems, power outages, bugs in the operating system or application software. Only CouchDB doesn’t have any bugs. Wait, that is of course not true, there can even be problems in CouchDB, no piece of software is free of bugs (except maybe Donald Knuth’s TeX system).

Whatever the cause is, you want to make sure that the service you are providing (in this case the database for an online store) is resilient against failure. The road to resilience is a road of finding and removing single points of failure: A server’s power supply can fail. To avoid the server turning off on such an event, most come with at least two power supplies. To take this further, you could get a server where everything exists twice or more often, but that would be a highly specialized (and expensive) piece of hardware. It is much cheaper to get two similar servers where the one can take over when the other has a problem. You need to make sure both servers have the same set of data in order to switch them without a user noticing.

Removing all single points of failure will give you a highly available or fault tolerant system. The order of tolerance is only restrained by your budget. If you can’t afford to lose a customer’s shopping cart in any event, you need to store it on at least two servers in at least to far apart geographical locations.

Note

Amazon does that for example for their amazon.com Web site. If one datacenter is victim of an earthquake, a user will still be able to shop.

It is likely though that Amazon’s problems are not your problems and that you have a whole set of new problems when your data center goes away. But you still want to be able to live through a server failure.

Before we dive into setting up a highly available CouchDB system, we look at another situation:

Joe calls Jill during regular business hours and relays his customer’s complaints that loading the online shop takes “forever”. Jill takes a quick look at the server and concludes that this is a lucky problem to have, leaving Joe puzzled. Jill explains that Joe’s shop is suddenly attracting a lot more users that buy things. Joe chimes in “I got this great review on that blog”, that’s where they must come from and a quick referrer check reveals that indeed a lot of the new customers are coming from a single site. The blog post already includes comments of unhappy customers voicing their frustration with the slow site. Joe wants to make his customers happy and asks Jill what to do. Jill advises to set up a second server that can take half of the load of the current server, making sure all requests get answered in a reasonable amount of time. Joe agrees and Jill sets out to set things up.

The solution to the outlined problem looks a lot like the one for providing a fault tolerant setup: Install a second server, synchronize all data. The difference is that with fault tolerance, the second server just sits there and waits for the first one to fail. The the second case, a second server helps to answer all incoming requests. The second case is not fault tolerant. If one server crashes, the other would get all the requests and is likely to break down or at least provide very slow service, both of which is not acceptable. Keep in mind that while the solutions look similar, high availability and fault tolerance are not the same. We get back to the second scenario in a bit, but first we will have a look at how to set up a fault tolerant CouchDB system.

We already gave it away in the previous chapters: The solution to synchronizing servers is replication.