.. _controlcenter_userguide_systemhealth:

System Health
=============

System Health provides insight into the well-being of the cluster from both a broker and topic-centric perspective.
The two sections share many common features with each other, which will be outlined below.

Navigation
----------

System Health is accessible by selecting the item labeled "System Health" from the main navigation.

The main ways to adjust the data you're seeing are as follows:

1. Using the sub navigation menu to jump between broker and topic views:

.. figure:: images/systemhealth/main-nav.png
   :scale: 50%
   :align: center

2. Clicking the KPI menu items on the left hand side of the page to see more detailed information:

.. figure:: images/systemhealth/topics-menu-open.gif
   :scale: 50%
   :align: center

3. Using the table view selector to choose between raw, trend and bullet chart views:

.. figure:: images/systemhealth/table-view-selector.gif
   :scale: 50%
   :align: center

4. Selecting *"View details"* from the "..." contextual menu on each row of the topics table which navigates to Topic Management:

.. figure:: images/systemhealth/topics-table-menu-open.gif
   :scale: 50%
   :align: center

5. Selecting a percentile from the request latency percentile dropdown (broker view only):

.. figure:: images/systemhealth/brokers-produced-fetched-charts-percentile.gif
   :scale: 50%
   :align: center

.. note::
   Request latency percentiles are only available in the brokers section of System Health.


UI Commonalities
----------------

Chart Tooltips
  Each chart displays a similarly styled tooltip when hovering. These tooltips can display multiple metrics at the same time which are each paired with an icon. The icon will either be a (good) check mark or a (bad) X symbol at each point.

Table Metric Validations
  Similar to the chart tooltips, many of the table metrics will visually change to indiciate potential issues via a red underline. Hovering the mouse over text with a red underline will display an explanatory tooltip.

Produced and Fetched Charts
---------------------------

.. figure:: images/systemhealth/brokers-produced-fetched-charts.png
   :scale: 50%
   :align: center

Produced (left hand side)
   - **Bytes** – total number of bytes per second produced to this cluster
   - **Requests** – total number of successful / failed produce requests to this cluster
   - **Latency** – produce request latency across all brokers, at the median, 95th, 99th, or 99.9th percentile

Fetched (right hand side):
   - **Bytes** – total number of bytes per second fetched from this cluster
   - **Requests** – total number of successful / failed fetch requests to this cluster
   - **Latency** – fetch request latency across all brokers, at the median, 95th, 99th, or 99.9th percentile

Per broker or topic breakdown
    Hovering the mouse cursor over an individual row of the broker or topic table will overlay the request statistics for that individual broker or topic in the chart.

Request latency (broker view only)
    In the broker section, the request latency for each broker will appear as its own line in the bottom-most chart.

    Hovering the mouse cursor over a specific line will highlight the corresponding row for that particular broker in the table.

Request lifecycle (broker view only)
    Clicking a line inside the latency chart will display the breakdown of produce or fetch latency throughout the entire request lifecycle.

    The request latency profile can shown at the median, 95th, 99th, or 99.9th percentile by selecting the corresponding header.


Broker Aggregate Metrics
------------------------

Broker count
    Number of brokers currently online
Zookeeper
    ZooKeeper status - *Up* or *Down*

    **Expires**

    .. include:: fragments/brokerClusterZooKeeperExpires.rst
    **Leader elections**

    .. include:: fragments/brokerClusterLeaderElection.rst
Active controller
  .. include:: fragments/brokerClusterActiveController.rst
Unclean elections
  .. include:: fragments/brokerClusterUncleanCount.rst
Network pool usage
    Average network pool capacity usage across all brokers
Request pool usage
    Average request handler capacity usage across all brokers (i.e. the fraction of time request handler threads are not sitting idle)
Disk usage
    Disk usage distribution - indicates whether disk usage distribution is even or skewed across all brokers in a cluster

    Disk usage is determined to be skewed if the relative mean absolute difference of all broker sizes exceeds 10%.
Online topic partitions
    Total number of online topic partitions
Under replicated topic partitions
  .. include:: fragments/brokerClusterUnderReplicated.rst
Offline topic partitions
  .. include:: fragments/brokerClusterOfflineTopicPartitions.rst

Broker Metrics Table
--------------------

Id
    Id for this broker
Throughput
    **Bytes In** / **Bytes Out** – Number of bytes per second produced to, or fetched from this broker (including from other brokers as part of replication)
Latency (produce)
  .. include:: fragments/brokerProductionRequestLatency.rst
Latency (fetched)
  .. include:: fragments/brokerFetchRequestLatency.rst
Partition replicas
    Total number of partition replicas served by this broker
Segment
    Total size in bytes of the log segments served by this broker (excluding index size)
Rack
    Rack Id for this broker

Topic Aggregate Metrics
-----------------------

Topic count
    Total number of topics
In sync replicas
  .. include:: fragments/topicInSyncReplica.rst
Out of sync replicas
    Total number of partition replicas that are out of sync

Topic Metrics Table
-------------------

Name
    Topic name
Throughput
    **Bytes In** / **Bytes Out** – Number of bytes per second produced to, or fetched from this topic (including from replicas)
Partition replicas
    **Total** – Total number of partition replicas for this topic

    **In Sync** – Total number of partition replicas that are in sync

    **Out of Sync** – Total number of partition replicas that are in sync
Partitions
    **Total** – Number of partitions for this topic

    **Under replicated** – Number of partitions that are under replicated (i.e. partitions with in-sync replicas < replication factor)
Segment
    **Count** – Number of log segments for this topic across all partition leaders.

    **Size** – Size in bytes of the log for this topic (does not include replicas)
Offset
    **Start** – Minimum offset across all partitions for this topic

    **End** – Maximum offset across all partitions for this topic