Monitoring¶
Kafka uses Yammer Metrics for metrics reporting in both the server and the client. This can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.
Server Metrics¶
Here are the important metrics to alert on a Kafka broker:
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
- Number of under-replicated partitions (| ISR | < | all replicas |). Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
- Number of partitions that don’t have an active leader and are hence not writable or readable. Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=ActiveControllerCount
- Number of active controllers in the cluster. Alert if value is anything other than 1.
Here are the list of metrics to observe on a Kafka broker:
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
- Aggregate incoming message rate.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
- Aggregate incoming byte rate.
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
- Aggregate outgoing byte rate.
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
- Request rate.
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
- Log flush rate and time.
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
- Leader election rate and latency.
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
- Unclean leader election rate.
kafka.server:type=ReplicaManager,name=PartitionCount
- Number of partitions on this broker. This should be mostly even across all brokers.
kafka.server:type=ReplicaManager,name=LeaderCount
- Number of leaders on this broker. This should be mostly even across all brokers. If not, set
auto.leader.rebalance.enable
totrue
on all brokers in the cluster.kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
- If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
- When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica
- Maximum lag in messages between the follower and leader replicas. This is controlled by the
replica.lag.max.messages
config.kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
- Lag in number of messages per follower replica. This is useful to know if the replica is slow or has stopped replicating from the leader.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}
- Total time in ms to serve the specified request.
kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize
- Number of requests waiting in the producer purgatory. This should be non-zero acks=-1 is used on the producer.
kafka.server:type=FetchRequestPurgatory,name=PurgatorySize
- Number of requests waiting in the fetch purgatory. This is high if consumers use a large value for
fetch.wait.max.ms
.
Producer Metrics¶
Starting with 0.8.2, the new producer exposes the following metrics:
Global Request Metrics¶
MBean: kafka.producer:type=producer-metrics,client-id=([-.w]+)
request-latency-avg
- The average request latency in ms.
request-latency-max
- The maximum request latency in ms.
request-rate
- The average number of requests sent per second.
response-rate
- The average number of responses received per second.
incoming-byte-rate
- Bytes/second read off all sockets.
outgoing-byte-rate
- The average number of outgoing bytes sent per second to all servers.
Global Connection Metrics¶
MBean: kafka.producer:type=producer-metrics,client-id=([-.w]+)
connection-count
- The current number of active connections.
io-ratio
- The fraction of time the I/O thread spent doing I/O.
io-time-ns-avg
- The average length of time for I/O per select call in nanoseconds.
io-wait-ratio
- The fraction of time the I/O thread spent waiting.
select-rate
- Number of times the I/O layer checked for new I/O to perform per second.
io-wait-time-ns-avg
- The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
Per broker metrics¶
MBean: kafka.producer:type=producer-node-metrics,client-id=([-.w]+),node-id=([0-9]+)
Besides the Global Request Metrics, the following metrics are also available per broker:
request-size-max
- The maximum size of any request sent in the window for a node.
request-size-avg
- The average size of all requests in the window for a node.
Per topic metrics¶
MBean: kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),topic=([-.w]+)
Besides the Global Request Metrics, the following metrics are also available per topic:
byte-rate
- The average number of bytes sent per second for a topic.
record-send-rate
- The average number of records sent per second for a topic.
compression-rate
- The average compression rate of record batches for a topic.
record-retry-rate
- The average per-second number of retried record sends for a topic.
record-error-rate
- The average per-second number of record sends that resulted in errors for a topic.
Consumer Metrics¶
kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)
- Number of messages the consumer lags behind the producer by.
kafka.consumer:type=ConsumerFetcherManager,name=MinFetchRate,clientId=([-.\w]+)
- The minimum rate at which the consumer sends fetch requests to the broker. If a consumer is dead, this value drops to roughly 0.
kafka.consumer:type=ConsumerTopicMetrics,name=MessagesPerSec,clientId=([-.\w]+)
- The throughput in messages consumed per second.
kafka.consumer:type=ConsumerTopicMetrics,name=MessagesPerSec,clientId=([-.\w]+)
- The throughput in bytes consumed per second.
The following metrics are available only on the high-level consumer:
kafka.consumer:type=ZookeeperConsumerConnector,name=KafkaCommitsPerSec,clientId=([-.\w]+)
- The rate at which this consumer commits offsets to Kafka. This is only relevant if
offsets.storage=kafka
.kafka.consumer:type=ZookeeperConsumerConnector,name=ZooKeeperCommitsPerSec,clientId=([-.\w]+)
- The rate at which this consumer commits offsets to ZooKeeper. This is only relevant if
offsets.storage=zookeeper
. Monitor this value if your ZooKeeper cluster is under performing due to high write load.kafka.consumer:type=ZookeeperConsumerConnector,name=RebalanceRateAndTime,clientId=([-.\w]+)
- The rate and latency of the rebalance operation on this consumer.
kafka.consumer:type=ZookeeperConsumerConnector,name=OwnedPartitionsCount,clientId=([-.\w]+),groupId=([-.\w]+)
- The number of partitions owned by this consumer.