Troubleshooting Control Center

Common issues

Installing and Setup

If you encounter issues during installation and setup, you can try these solutions.

Bad security configuration

  • Check the security configuration for all brokers, metrics reporter, client interceptors, and Control Center (see debugging check configuration). For example, is it SASL_SSL, SASL_PLAINTEXT, SSL?

  • Possible errors include:

    ERROR SASL authentication failed using login context 'Client'. (org.apache.zookeeper.client.ZooKeeperSaslClient)
    
    Caused by: org.apache.kafka.common.KafkaException: java.lang.IllegalArgumentException: No serviceName defined in either JAAS or Kafka configuration
    
    org.apache.kafka.common.errors.IllegalSaslStateException: Unexpected handshake request with client mechanism GSSAPI, enabled mechanisms are [GSSAPI]
    
  • Verify that the correct Java Authentication and Authorization Service (JAAS) configuration was detected.

  • If ACLs are enabled, check them.

  • To verify that you can communicate with the cluster, try to produce and consume using console-* with the same security settings.

InvalidStateStoreException

  • This error usually indicates that data is corrupted in the configured confluent.controlcenter.data.dir. For example, this can be caused by an unclean shutdown. To fix, give Control Center a new ID by changing confluent.controlcenter.id and restart.
  • Allow permission for the configured confluent.controlcenter.data.dir.

Not enough brokers

Check the logs for the related error not enough brokers. Verify the topic replication factors are set correctly and verify that there are enough brokers available.

Local store permissions

Check the local permissions in Control Center state directory. These settings are as defined in the config confluent.controlcenter.data.dir in the control-center.properties. You can access that directory with the user ID that was used to start Control Center.

Multiple Control Center’s with the same ID

You must use unique IDs for each Control Center instance, including instances in Docker. Duplicate IDs are not supported and will cause problems.

License expired

If you see a message similar to this:

[2017-08-21 14:12:33,812] WARN checking license failure. please contact `[email protected] <mailto:[email protected]>`_ for a license key: Unable to process JOSE object (cause: org.jose4j.lang.JoseException: Invalid JOSE….

You should verify that the user has a valid license, as specified in confluent.license=<your key>. Prior to Confluent Platform 3.3.1, the value for key (<your key>) must be the actual key value. With Confluent Platform 3.3.1 and later, this can be either the key or a path to a license file. For more information, see the Control Center configuration documentation.

System health

Web interface that is blank or stuck loading

If you experience a web interface that is blank or stuck loading, you can select the cluster in the drop-down and use the information below to troubleshoot.

  • Are there errors or warnings in the logs? For more information on how to find logs, see the documentation.

  • What are you monitoring? Are you under-provisioned?

  • Is there a lag in Control Center? Especially on the MetricsAggregateStore partitions

  • Use browser debugging tools to check REST calls to find out if the requests have been made successfully and with a valid response, specifically these requests:

    ../../../_images/c3-troubleshoot.png

    Tip: You can view these calls by using common web browser tools (e.g., Chrome Developer Tools).

  • The /2.0/metrics/<cluster-id>/maxtime endpoint should return the latest timestamp that Control Center has for metrics data.

  • If no data is returned from the backend, verify that you’re getting data on the input topic and review the logs for issues.

I see a rocket ship

If you see a rocket ship graphic in the web interface, you can use the information below to troubleshoot.

../../../_images/rocketship.png
  • Is the correct cluster selected in the drop-down?
  • Usually this means that Kafka doesn’t have any metrics data, but this image could also indicate a 500 Internal Server Error has occurred.
  • Use browser debugging tools to check the response. An empty response ({ }) from the /2.0/metrics/<cluster-id>/maxtime endpoint means that Kafka hasn’t received any metrics data.
  • Verify that the metrics reporter is set up correctly. Dump the _confluent-metrics input topic to see if there are any messages produced.
  • If you get a 500 error, check the Control Center logs for errors.

Nothing is produced on the Metrics (_confluent-metrics) topic

Control Center is lagging behind Kafka

If Control Center is not reporting the latest data and the charts are falling behind etc, you can use this information to troubleshoot.

  • This can happen if Control Center is underpowered or churning through loads of backlog.
  • Check the offset lag. If lag is large and increasing over time, Control Center may not be able to handle the monitoring load. Try these additional checks for cluster and system.
  • With Confluent Platform 3.3.x and later, you can set a short amount of time for the skip backlog monitoring settings: confluent.monitoring.interceptor.topic.skip.backlog.minutes and confluent.metrics.topic.skip.backlog.minutes. For example, you can set this to 0 if you want to process from the latest offsets. Control Center will ignore everything on the input topics older than a specified amount of time. This is useful when you need Control Center to catch up faster. For more information, see the Control Center configuration documentation.

RecordTooLargeException

If you receive this error in the broker logs, you can use this information to troubleshoot.

  • Set confluent.metrics.reporter.max.request.size=10485760 in broker the server.properties file. This is the default in 3.3.x and later.
  • Change the topic configuration for _confluent-metrics to accept large messages. This is the default in 3.3.x and later. For more information, see the Metrics Reporter message size documentation.
$ bin/kafka-topics.sh --zookeeper <host:port> --alter --config max.message.bytes=10485760  --topic _confluent-metrics

Parts of the broker or topic table have blank values

This is a known issue that should be transient until Control Center is caught up. It can be caused by:

  • Different streams topologies that are processing at different rates during restore.
  • Control Center is lagging or having trouble keeping up due to lack of resources.

Streams Monitoring

Blank charts

If you are experiencing blank charts, you can use this information to troubleshoot.

  • Are you getting any data on _confluent-monitoring topic?
  • Control Center doesn’t show unconsumed messages because Confluent doesn’t know the expected consumption. Future versions will have dedicated produce charts.
  • Verify that the interceptors are set up correctly, including security, and that the messages make it to _confluent-monitoring for the time range selected. For example:
    • {"clientType":"PRODUCER","clientId":"rock-client-producer-4","group...
    • {"clientType":"CONSUMER","clientId":"rock-client-consumer-2","group...

Unexpected herringbone pattern

If you are experiencing an unexpected herringbone pattern, you can use this information to troubleshoot.

  • Verify whether the clients are properly shut down.

  • Look for these errors in client logs:

    • Failed to shutdown metrics reporting thread...
    • Failed to publish all cached metrics on termination for...
    • ERROR Terminating publishing and collecting monitoring metrics for
    • Failed to close monitoring interceptor for…

Missing consumers or consumer groups

If you are missing consumers or consumer groups, you can use this information to troubleshoot.

  • Look for errors or warnings in the missing client’s log.
  • Verify whether the input topic is receiving interceptor data for the missing client.

Connect

I see rocket ship

If you see a rocket ship graphic in the web interface, you can use the information below to troubleshoot.

../../../_images/rocketship.png
  • Is the Connect cluster that is defined in confluent.controlcenter.connect.cluster available?
  • Can you reach the Connect endpoints directly by running a cURL command (e.g., curl www.example.com)?
  • Check the Connect logs for any errors. Control Center is a proxy to Connect.

Debugging

Check logs

These are the Control Center log types.

  • c3.log - Control Center and HTTP log that are not related to streams
  • c3-streams.log - Streams
  • c3-kafka.log - Client, ZooKeeper, and Kafka

Here are things to look for in the logs:

  • ERROR
  • shutdown
  • Exceptions - verify that the brokers can be reached
  • WARN
  • Healthcheck errors and warnings

If nothing is obvious, turn DEBUG logging on and restart Control Center.

Enable debug and trace logging

  1. Open the <path-to-confluent>/etc/confluent-control-center/log4j-rolling.properties file. This file is referenced by the CONTROL_CENTER_LOG4J_OPTS environment variable.

  2. Change the log level (log4j.appender.streams.filter.1.level) to TRACE and comment out log4j.appender.streams.filter.1.rate. For example:

    log4j.rootLogger=INFO, stdout
    
    log4j.appender.stdout=org.apache.log4j.ConsoleAppender
    log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
    log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n
    
    log4j.appender.streams=org.apache.log4j.ConsoleAppender
    log4j.appender.streams.layout=org.apache.log4j.PatternLayout
    log4j.appender.streams.layout.ConversionPattern=[%d] %p %m (%c)%n
    log4j.appender.streams.filter.1=io.confluent.Log4jRateFilter
    # will allow everything that is >=level
    log4j.appender.streams.filter.1.level=TRACE
    # will only allow rate/second logs at <level
    #log4j.appender.streams.filter.1.rate=25
    
    log4j.logger.kafka=ERROR, stdout
    log4j.logger.org.apache.kafka.streams=INFO, streams
    log4j.additivity.org.apache.kafka.streams=false
    log4j.logger.io.confluent.controlcenter.streams=INFO, streams
    log4j.additivity.io.confluent.controlcenter.streams=false
    log4j.logger.org.apache.zookeeper=ERROR, stdout
    log4j.logger.org.apache.kafka=ERROR, stdout
    log4j.logger.org.I0Itec.zkclient=ERROR, stdout
    
  3. Restart Control Center.

    $ ./bin/control-center-stop
    $ ./bin/control-center-start ../etc/confluent-control-center/control-center.properties
    

Check configurations

  • Is security enabled? Check the security configuration settings on the broker, clients, and Control Center.

  • Verify that the prefixes are correct.

  • Are the metrics reporter and interceptors installed and configured correctly?

  • Verify the topic configurations for all Control Center topics: replication factor, timestamp type, min isr, retention. You can use this command, ZooKeeper host and port (<host:port>) are specified. Verify that the correct configurations are picked up by each process.

    $ ./bin/kafka-topics --zookeeper <host:port> --describe
    

Review input topics

  • _confluent-monitoring and _confluent-metrics are the entry points for Control Center data

  • Verify that the input topics are created, where host and port (<host:port>), and topic (<input_topic>) are specified:

    $ bin/kafka-topics.sh --zookeeper <host:port> --topic <input_topic>
    
  • Verify that data is being produced in the input topics. The security settings must be properly configured in the consumer for this to work. This is accomplished by specifying the properties file that was used to start Control Center (e.g., control-center.properties) in the following command, and setting <input_topic> to the topic you wish to read.

    bin/control-center-console-consumer config/control-center.properties --topic <input_topic>
    

Size of clusters

For examples on how to size your environment, review the Control Center example deployments.

System check

Check the system level metrics where Confluent Cloud is running, including CPU, memory, disk, and JVM settings. Are the within the recommended values?

Frontend and REST API

  • Using browser debugging tools, view the network settings to verify the request and response are showing the correct data.

    Tip: You can right-click on the row to copy content as HAR or cURL.

    ../../../_images/save-as-curl.png
  • The backend REST calls are logged in c3.log.

Consumer offset lag

Verify that all offset lags for Control Center topics are not increasing over time. Review the MetricsAggregateStore and aggregate-rekey topics as they are often the bottleneck. You will need to run this command multiple times to observe the trend, where Control Center version (<version>) and ID (control-center-id) are specified.

./bin/kafka-consumer-groups --bootstrap-server <host:port> --describe --group _confluent-controlcenter-<version>-<control-center-id>

Enable GC logging

Enable GC logs, restart Control Center with the following, where directory (<dir>) is specified:

$ CONTROL_CENTER_JVM_PERFORMANCE_OPTS="-server -verbose:gc -Xloggc:<dir>/gc.log -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCCause -Djava.awt.headless=true"

Thread dump

Run this command for a thread dump:

$ jstack -l $(jcmd | grep -i 'controlcenter\.ControlCenter' | awk '{print $1}') > jstack.out

Data directory

The Control Center local state is stored in confluent.controlcenter.data.dir.

You can use this command to determine the size of your data directory (<data.dir>).

du -h <data.dir>