ZooKeeper Production Deployment

Kafka uses ZooKeeper to store persistent cluster metadata and is a critical component of the Confluent Platform deployment. For example, if you lost the Kafka data in ZooKeeper, the mapping of replicas to Brokers and topic configurations would be lost as well, making your Kafka cluster no longer functional and potentially resulting in total data loss.

This document provides the key considerations before making your ZooKeeper cluster live, but is not a complete guide for running ZooKeeper in production. For more detailed information, see the ZooKeeper Administrator’s Guide.

These areas are covered:

  • Logistical considerations, such as hardware recommendations and deployment strategies
  • Configuration changes for a production environment
  • Post-deployment considerations, such as restarting and backing up your cluster

Stable version

Confluent Platform ships a stable version of ZooKeeper. You can use the ENVI “four letter word” to find the current version of a running server with nc. For example:

echo envi | nc localhost 2181

This will display all of the environment information for the ZooKeeper server, including the version.

Note

Note that the ZooKeeper start script and functionality of ZooKeeper is tested only with this version of ZooKeeper.

Hardware

A production ZooKeeper cluster can cover a wide variety of use cases. The recommendations are intended to be general guidelines for choosing proper hardware for a cluster of ZooKeeper servers. Not all use cases can be covered in this document, so consider your use case and business requirements when making a final decision.

Memory

In general, ZooKeeper is not a memory intensive application when handling only data stored by Kafka. The physical memory needs of a ZooKeeper server scale with the size of the znodes stored by the ensemble. This is because each ZooKeeper holds all znode contents in memory at any given time. For Kafka, the dominant driver of znode creation is the number of partitions in the cluster. In a typical production use case, a minimum of 8 GB of RAM should be dedicated for ZooKeeper use. Note that ZooKeeper is sensitive to swapping and any host running a ZooKeeper server should avoid swapping.

CPU

In general, ZooKeeper as a Kafka metadata store does not heavily consume CPU resources. However, if ZooKeeper is shared, or the host on which the ZooKeeper server is running is shared, CPU should be considered. ZooKeeper provides a latency sensitive function, so if it must compete for CPU with other processes, or if the ZooKeeper ensemble is serving multiple purposes, you should consider providing a dedicated CPU core to ensure context switching is not an issue.

Disks

Disk performance is vital to maintaining a healthy ZooKeeper cluster. Solid state drives (SSD) are highly recommended as ZooKeeper must have low latency disk writes in order to perform optimally. Each request to ZooKeeper must be committed to to disk on each server in the quorum before the result is available for read. A dedicated SSD of at least 64 GB in size on each ZooKeeper server is recommended for a production deployment. You can use autopurge.purgeInterval and autopurge.snapRetainCount to automatically cleanup ZooKeeper data and lower maintenance overhead.

JVM

ZooKeeper runs as a JVM. It is not notably heap intensive when running for the Kafka use case. A heap size of 1 GB is recommended for most use cases and monitoring heap usage to ensure no delays are caused by garbage collection.

Important Configuration Options

ZooKeeper does not require configuration tuning for most deployments. Below are a few important parameters to consider. A complete list of configurations can be found in the ZooKeeper project page.

clientPort

This is the port where ZooKeeper clients will listen on. This is where the Brokers will connect to ZooKeeper. Typically this is set to 2181.

  • Type: int
  • Importance: required
dataDir

The directory where ZooKeeper data will be stored. This location should be a dedicated disk that is ideally an SSD.

  • Type: string
  • Importance: required
tickTime

The unit of time for ZooKeeper translated to milliseconds. This governs all ZooKeeper time dependent operations. It is used for heartbeats and timeouts especially. Note that the minimum session timeout will be two ticks.

  • Type: int
  • Default: 2000
  • Importance: high
maxClientCnxns

The maximum allowed number of client connections for a ZooKeeper server. To avoid running out of allowed connections set this to 0 (unlimited).

  • Type: int
  • Default: 60
  • Importance: high
autopurge.snapRetainCount

When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCount most recent snapshots and the corresponding transaction logs in the dataDir and dataLogDir respectively and deletes the rest.

  • Type: int
  • Default: 3
  • Importance: high
autopurge.purgeInterval

The time interval in hours for which the purge task has to be triggered. Set to a positive integer (1 and above) to enable the auto purging.

  • Type: int
  • Default: 0
  • Importance: high

Monitoring

ZooKeeper servers should be monitored to ensure they are functioning properly and proactively identify issues. In this section, a set of common monitoring best practices is discussed.

Operating System

The underlying OS metrics can help predict when the ZooKeeper service will start to struggle. In particular, you should monitor:

  • Number of open file handles – this should be done system wide and for the user running the ZooKeeper process. Values should be considered with respect to the maximum allowed number of open file handles. ZooKeeper opens and closes connections often, and needs an available pool of file handles to choose from.
  • Network bandwidth usage – because ZooKeeper keeps track of state, it is sensitive to timeouts caused by network latency. If the network bandwidth is saturated, you may experience hard to explain timeouts with client sessions that will make your Kafka cluster less reliable.

“Four Letter Words”

ZooKeeper responds to a set of commands, each four letters in length. The documentation gives a description of each one and what version each became available. To run the commands you must send a message (via netcat or telnet) to the ZooKeeper client port. For example echo stat | nc localhost 2181 would return the output of the STAT command to stdout.

You should monitor the STAT and MNTR four letter words. Each environment will be somewhat different, but you will be monitoring for any rapid increase or decrease in any metric reported here. The metrics reported by these commands should be consistent except for number of packets sent and received which should increase slowly over time.

JMX Monitoring

Confluent Control Center monitors the Broker to ZooKeeper connection as show here. The ZooKeeper server also provides a number of JMX metrics that are described in the project documentation here. Here are a few JMX metrics that are important to monitor:

  • NumAliveConnections - make sure you are not close to maximum as set with maxClientCnxns
  • OutstandingRequests - should be below 10 in general
  • AvgRequestLatency - target below 10 ms
  • MaxRequestLatency - target below 20 ms
  • HeapMemoryUsage (Java built-in) - should be relatively flat and well below max heap size

In addition to JMX metrics ZooKeeper provides, Kafka tracks a number of relevant ZooKeeper events via the SessionExpireListener that should be monitored to ensure the health of ZooKeeper-Kafka interactions:

  • ZooKeeperAuthFailuresPerSec (secure environments only)
  • ZooKeeperDisconnectsPerSec
  • ZooKeeperExpiresPerSec
  • ZooKeeperReadOnlyConnectsPerSec
  • ZooKeeperSaslAuthenticationsPerSec (secure environments only)
  • ZooKeeperSyncConnectsPerSec

Multi-node Setup

In a real production environment, the ZooKeeper servers will be deployed on multiple nodes. This is called an ensemble. An ensemble is a set of 2n + 1 ZooKeeper servers where n is any number greater than 0. The odd number of servers allows ZooKeeper to perform majority elections for leadership. At any given time, there can be up to n failed servers in an ensemble and the ZooKeeper cluster will keep quorum. If at any time, quorum is lost, the ZooKeeper cluster will go down. A few general considerations for multi-node ZooKeeper ensembles:

  • Start with a small ensemble of 3 or 5 servers and only scale as truly necessary (or as required for fault tolerance). Each write must be propagated to a quorum of servers in the ensemble. This means that if you choose to run an ensemble of 3 servers, you are tolerant to 1 server being lost, and writes must propagate to 2 servers before they are committed. If you have 3 servers, you are tolerant to 2 servers being lost, but writes must propagate to 3 servers before they are committed.
  • Redundancy in the physical/hardware/network layout: try not to put them all in the same rack, decent (but don’t go nuts) hardware, try to keep redundant power and network paths, etc.
  • Virtualization: place servers in different availability zones to avoid correlated crashes and to make sure that the storage system available can accommodate the requirements of the transaction logging and snapshotting of ZooKeeper.
  • When it doubt, keep it simple. ZooKeeper holds important data, so prefer stability and durability over trying a new deployment model in production.

A multi-node setup does require a few additional configurations. There is a comprehensive overview of these in the project documentation. Below is a concrete example configuration to help get you started.

tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24

Use this configuration in a file passed to the ZooKeeper start script like this:

bin/zookeeper-server-start etc/kafka/zookeeper.properties

This example is for a 3 node ensemble. Note that this configuration file is expected to be identical across all members of the ensemble. Breaking this down, you can see tickTime, dataDir, and clientPort are all set to typical single server values. The initLimit and syncLimit are used to govern how long following ZooKeeper servers can take to initialize with the current leader and how long they can be out of sync with the leader. In this configuration, a follower can take 10000ms to initialize and may be out of sync for up to 4000ms based on the tickTime being set to 2000ms.

The server.* properties set the ensemble membership. The format is server.<myid>=<hostname>:<leaderport>:<electionport>. Some explanation:

  • myid is the server identification number. In this example, there are three servers, so each one will have a different myid with values 1, 2, and 3 respectively. The myid is set by creating a file named myid in the dataDir that contains a single integer in human readable ASCII text. This value must match one of the myid values from the configuration file. If another ensemble member has already been started with a conflicting myid value, an error will be thrown upon startup.
  • leaderport is used by followers to connect to the active leader. This port should be open between all ZooKeeper ensemble members.
  • electionport is used to perform leader elections between ensemble members. This port should be open between all ZooKeeper ensemble members.

The autopurge.snapRetainCount and autopurge.purgeInterval have been set to purge all but three snapshots every 24 hours.

Post Deployment

After you deploy your ZooKeeper cluster, ZooKeeper largely runs without much maintenance. The project documentation contains many helpful tips on operations such as managing cleanup of the dataDir, logging, and troubleshooting.