Monitoring

Kafka uses Yammer Metrics for metrics reporting in both the server and the client. This can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.

Server Metrics

Here are the important metrics to alert on a Kafka broker:

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
Number of under-replicated partitions (| ISR | < | all replicas |). Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
Number of partitions that don’t have an active leader and are hence not writable or readable. Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=ActiveControllerCount
Number of active controllers in the cluster. Alert if value is anything other than 1.

Here are the list of metrics to observe on a Kafka broker:

kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Aggregate incoming message rate.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Aggregate incoming byte rate.
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Aggregate outgoing byte rate.
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
Request rate.
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
Log flush rate and time.
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
Leader election rate and latency.
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
Unclean leader election rate.
kafka.server:type=ReplicaManager,name=PartitionCount
Number of partitions on this broker. This should be mostly even across all brokers.
kafka.server:type=ReplicaManager,name=LeaderCount
Number of leaders on this broker. This should be mostly even across all brokers. If not, set auto.leader.rebalance.enable to true on all brokers in the cluster.
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica
Maximum lag in messages between the follower and leader replicas. This is controlled by the replica.lag.max.messages config.
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
Lag in number of messages per follower replica. This is useful to know if the replica is slow or has stopped replicating from the leader.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Total time in ms to serve the specified request.
kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize
Number of requests waiting in the producer purgatory. This should be non-zero acks=-1 is used on the producer.
kafka.server:type=FetchRequestPurgatory,name=PurgatorySize
Number of requests waiting in the fetch purgatory. This is high if consumers use a large value for fetch.wait.max.ms .

Producer Metrics

Starting with 0.8.2, the new producer exposes the following metrics:

Global Request Metrics

MBean: kafka.producer:type=producer-metrics,client-id=([-.w]+)

request-latency-avg
The average request latency in ms.
request-latency-max
The maximum request latency in ms.
request-rate
The average number of requests sent per second.
response-rate
The average number of responses received per second.
incoming-byte-rate
Bytes/second read off all sockets.
outgoing-byte-rate
The average number of outgoing bytes sent per second to all servers.

Global Connection Metrics

MBean: kafka.producer:type=producer-metrics,client-id=([-.w]+)

connection-count
The current number of active connections.
io-ratio
The fraction of time the I/O thread spent doing I/O.
io-time-ns-avg
The average length of time for I/O per select call in nanoseconds.
io-wait-ratio
The fraction of time the I/O thread spent waiting.
select-rate
Number of times the I/O layer checked for new I/O to perform per second.
io-wait-time-ns-avg
The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.

Per broker metrics

MBean: kafka.producer:type=producer-node-metrics,client-id=([-.w]+),node-id=([0-9]+)

Besides the Global Request Metrics, the following metrics are also available per broker:

request-size-max
The maximum size of any request sent in the window for a node.
request-size-avg
The average size of all requests in the window for a node.

Per topic metrics

MBean: kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),topic=([-.w]+)

Besides the Global Request Metrics, the following metrics are also available per topic:

byte-rate
The average number of bytes sent per second for a topic.
record-send-rate
The average number of records sent per second for a topic.
compression-rate
The average compression rate of record batches for a topic.
record-retry-rate
The average per-second number of retried record sends for a topic.
record-error-rate
The average per-second number of record sends that resulted in errors for a topic.

Consumer Metrics

kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)
Number of messages the consumer lags behind the producer by.
kafka.consumer:type=ConsumerFetcherManager,name=MinFetchRate,clientId=([-.\w]+)
The minimum rate at which the consumer sends fetch requests to the broker. If a consumer is dead, this value drops to roughly 0.
kafka.consumer:type=ConsumerTopicMetrics,name=MessagesPerSec,clientId=([-.\w]+)
The throughput in messages consumed per second.
kafka.consumer:type=ConsumerTopicMetrics,name=MessagesPerSec,clientId=([-.\w]+)
The throughput in bytes consumed per second.

The following metrics are available only on the high-level consumer:

kafka.consumer:type=ZookeeperConsumerConnector,name=KafkaCommitsPerSec,clientId=([-.\w]+)
The rate at which this consumer commits offsets to Kafka. This is only relevant if offsets.storage=kafka .
kafka.consumer:type=ZookeeperConsumerConnector,name=ZooKeeperCommitsPerSec,clientId=([-.\w]+)
The rate at which this consumer commits offsets to ZooKeeper. This is only relevant if offsets.storage=zookeeper. Monitor this value if your ZooKeeper cluster is under performing due to high write load.
kafka.consumer:type=ZookeeperConsumerConnector,name=RebalanceRateAndTime,clientId=([-.\w]+)
The rate and latency of the rebalance operation on this consumer.
kafka.consumer:type=ZookeeperConsumerConnector,name=OwnedPartitionsCount,clientId=([-.\w]+),groupId=([-.\w]+)
The number of partitions owned by this consumer.