Diagnostics Tool | Cluster Administration

Overview

The oc adm diagnostics command runs a series of checks for error conditions in the host or cluster. Specifically, it:

Verifies that the default registry and router are running and correctly configured.
Checks ClusterRoleBindings and ClusterRoles for consistency with base policy.
Checks that all of the client configuration contexts are valid and can be connected to.
Checks that SkyDNS is working properly and the pods have SDN connectivity.
Validates master and node configuration on the host.
Checks that nodes are running and available.
Analyzes host logs for known errors.
Checks that systemd units are configured as expected for the host.

Using the Diagnostics Tool

OpenShift Origin can be deployed in many ways: built from source, included in a VM image, in a container image, or as enterprise RPMs. Each method implies a different configuration and environment. To minimize environment assumptions, the diagnostics were added to the openshift binary so that wherever there is an OpenShift Origin server or client, the diagnostics can run in the exact same environment.

To use the diagnostics tool, preferably on a master host and as cluster administrator, run:

$ oc adm diagnostics

This runs all available diagnostics, skipping any that do not apply.

You can run one or multiple specific diagnostics by name, or run specific diagnostics by name as you work to address issues. For example:

$ oc adm diagnostics <name1> <name2>

The options mostly require working configuration files. For example, the NodeConfigCheck does not run unless a node configuration is available.

Diagnostics look for configuration files in standard locations:

Client:
- As indicated by the $KUBECONFIG environment variable variable
- ~/.kube/config file
Master:
- /etc/origin/master/master-config.yaml
Node:
- /etc/origin/node/node-config.yaml

Non-standard locations can be specified with flags (respectively, --config, --master-config, and --node-config). If a configuration file is not found or specified, related diagnostics are skipped.

Available diagnostics include:

Diagnostic Name Purpose

Diagnostic Name	Purpose
`AggregatedLogging`	Check the aggregated logging integration for proper configuration and operation.
`AnalyzeLogs`	Check systemd service logs for problems. Does not require a configuration file to check against.
`ClusterRegistry`	Check that the cluster has a working Docker registry for builds and image streams.
`ClusterRoleBindings`	Check that the default cluster role bindings are present and contain the expected subjects according to base policy.
`ClusterRoles`	Check that cluster roles are present and contain the expected permissions according to base policy.
`ClusterRouter`	Check for a working default router in the cluster.
`ConfigContexts`	Check that each context in the client configuration is complete and has connectivity to its API server.
`DiagnosticPod`	Creates a pod that runs diagnostics from an application standpoint, which checks that DNS within the pod is working as expected and the credentials for the default service account authenticate correctly to the master API.
`EtcdWriteVolume`	Check the volume of writes against etcd for a time period and classify them by operation and key. This diagnostic only runs if specifically requested, because it does not run as quickly as other diagnostics and can increase load on etcd.
`MasterConfigCheck`	Check this host’s master configuration file for problems.
`MasterNode`	Check that the master running on this host is also running a node to verify that it is a member of the cluster SDN.
`MetricsApiProxy`	Check that the integrated Heapster metrics can be reached via the cluster API proxy.
`NetworkCheck`	Create diagnostic pods on multiple nodes to diagnose common network issues from an application standpoint. For example, this checks that pods can connect to services, other pods, and the external network. If there are any errors, this diagnostic stores results and retrieved files in a local directory (*/tmp/openshift/*, by default) for further analysis. The directory can be specified with the `--network-logdir` flag.
`NodeConfigCheck`	Checks this host’s node configuration file for problems.
`NodeDefinitions`	Check that the nodes defined in the master API are ready and can schedule pods.
`RouteCertificateValidation`	Check all route certificates for those that might be rejected by extended validation.
`ServiceExternalIPs`	Check for existing services that specify external IPs, which are disallowed according to master configuration.
`UnitStatus`	Check systemd status for units on this host related to OpenShift Origin. Does not require a configuration file to check against.

AggregatedLogging

Check the aggregated logging integration for proper configuration and operation.

AnalyzeLogs

Check systemd service logs for problems. Does not require a configuration file to check against.

ClusterRegistry

Check that the cluster has a working Docker registry for builds and image streams.

ClusterRoleBindings

Check that the default cluster role bindings are present and contain the expected subjects according to base policy.

ClusterRoles

Check that cluster roles are present and contain the expected permissions according to base policy.

ClusterRouter

Check for a working default router in the cluster.

ConfigContexts

Check that each context in the client configuration is complete and has connectivity to its API server.

DiagnosticPod

Creates a pod that runs diagnostics from an application standpoint, which checks that DNS within the pod is working as expected and the credentials for the default service account authenticate correctly to the master API.

EtcdWriteVolume

Check the volume of writes against etcd for a time period and classify them by operation and key. This diagnostic only runs if specifically requested, because it does not run as quickly as other diagnostics and can increase load on etcd.

MasterConfigCheck

Check this host’s master configuration file for problems.

MasterNode

Check that the master running on this host is also running a node to verify that it is a member of the cluster SDN.

MetricsApiProxy

Check that the integrated Heapster metrics can be reached via the cluster API proxy.

NetworkCheck

Create diagnostic pods on multiple nodes to diagnose common network issues from an application standpoint. For example, this checks that pods can connect to services, other pods, and the external network.

If there are any errors, this diagnostic stores results and retrieved files in a local directory (/tmp/openshift/, by default) for further analysis. The directory can be specified with the --network-logdir flag.

NodeConfigCheck

Checks this host’s node configuration file for problems.

NodeDefinitions

Check that the nodes defined in the master API are ready and can schedule pods.

RouteCertificateValidation

Check all route certificates for those that might be rejected by extended validation.

ServiceExternalIPs

Check for existing services that specify external IPs, which are disallowed according to master configuration.

UnitStatus

Check systemd status for units on this host related to OpenShift Origin. Does not require a configuration file to check against.

Running Diagnostics in a Server Environment

Master and node diagnostics are most useful in an Ansible-deployed cluster. This provides some diagnostic benefits:

Master and node configuration is based on a configuration file in a standard location.
Systemd units are configured to manage the server(s).
All components log to journald.

Having configuration files where Ansible places them means that you will generally not need to specify where to find them. Running oc adm diagnostics without flags will look for master and node configurations in the standard locations and use them if found; this should make the Ansible-installed use case as simple as possible. Also, it is easy to specify configuration files that are not in the expected locations:

$ oc adm diagnostics --master-config=<file_path> --node-config=<file_path>

Systemd units and logs entries in journald are necessary for the current log diagnostic logic. For other deployment types, logs may be going into files, to stdout, or may combine node and master. At this time, for these situations, log diagnostics are not able to work properly and will be skipped.

Running Diagnostics in a Client Environment

You may have access as an ordinary user, and/or as a cluster-admin user, and/or may be running on a host where OpenShift Origin master or node servers are operating. The diagnostics attempt to use as much access as the user has available.

A client with ordinary access should be able to diagnose its connection to the master and run a diagnostic pod. If multiple users or masters are configured, connections will be tested for all, but the diagnostic pod only runs against the current user, server, or project.

A client with cluster-admin access available (for any user, but only the current master) should be able to diagnose the status of infrastructure such as nodes, registry, and router. In each case, running oc adm diagnostics looks for the client configuration in its standard location and uses it if available.

Ansible-based Health Checks

Additional diagnostic health checks are available through the Ansible-based tooling used to install and manage OpenShift Origin clusters. They can report common deployment problems for the current OpenShift Origin installation.

These checks can be run either using the ansible-playbook command (the same method used during Advanced Installation) or as a containerized version of openshift-ansible. For the ansible-playbook method, the checks are provided by the openshift-ansible Git repository. For the containerized method, the openshift/origin-ansible container image is distributed via Docker Hub. Example usage for each method are provided in subsequent sections.

The following health checks are a set of diagnostic tasks that are meant to be run against the Ansible inventory file for a deployed OpenShift Origin cluster using the provided health.yml playbook.

Due to potential changes the health check playbooks could make to hosts, they should only be used on clusters that have been deployed using Ansible and using the same inventory file with which it was deployed. Changes mostly involve installing dependencies so that the checks can gather required information, but it is possible for certain system components (for example, docker or networking) to be altered if their current state differs from the configuration in the inventory file. Only run these health checks if you would not expect your inventory file to make any changes to your current cluster configuration.

Table 1. Diagnostic Health Checks
Check Name	Purpose
`etcd_imagedata_size`	This check measures the total size of OpenShift Origin image data in an etcd cluster. The check fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of image data amounts to 50% or more of the currently used space in the etcd cluster. A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift Origin image data, which can eventually result in your etcd cluster crashing. A user-defined limit may be set by passing the `etcd_max_image_data_size_bytes` variable. For example, setting `etcd_max_image_data_size_bytes=40000000000` will cause the check to fail if the total size of image data stored in etcd exceeds 40 GB.
`etcd_traffic`	This check detects higher-than-normal traffic on an etcd host. It fails if a `journalctl` log entry with an etcd sync duration warning is found. For further information on improving etcd performance, see Recommended Practices for OpenShift Origin etcd Hosts and the Red Hat Knowledgebase.
`etcd_volume`	This check ensures that the volume usage for an etcd cluster is below a maximum user-specified threshold. If no maximum threshold value is specified, it is defaulted to `90%` of the total volume size. A user-defined limit may be set by passing the `etcd_device_usage_threshold_percent` variable.
`docker_storage`	Only runs on hosts that depend on the docker daemon (nodes and containerized installations). Checks that docker's total usage does not exceed a user-defined limit. If no user-defined limit is set, docker's maximum usage threshold defaults to 90% of the total size available. The threshold limit for total percent usage can be set with a variable in your inventory file, for example `max_thinpool_data_usage_percent=90`. This also checks that docker's storage is using a supported configuration.
`curator`, `elasticsearch`, `fluentd`, `kibana`	This set of checks verifies that Curator, Kibana, Elasticsearch, and Fluentd pods have been deployed and are in a `running` state, and that a connection can be established between the control host and the exposed Kibana URL. These checks will only run if the `openshift_logging_install_logging` inventory variable is set to `true`, to ensure that they are executed in a deployment where cluster logging has been enabled.
`logging_index_time`	This check detects higher than normal time delays between log creation and log aggregation by Elasticsearch in a logging stack deployment. It fails if a new log entry cannot be queried through Elasticsearch within a timeout (by default, 30 seconds). The check only runs if logging is enabled. A user-defined timeout may be set by passing the `openshift_check_logging_index_timeout_seconds` variable. For example, setting `openshift_check_logging_index_timeout_seconds=45` will cause the check to fail if a newly-created log entry is not able to be queried via Elasticsearch after 45 seconds.

A similar set of checks meant to run as part of the installation process can be found in Configuring Cluster Pre-install Checks. Another set of checks for checking certificate expiration can be found in Redeploying Certificates.

Running Health Checks via ansible-playbook

To run the openshift-ansible health checks using the ansible-playbook command, specify your cluster’s inventory file and run the health.yml playbook:

# ansible-playbook -i <inventory_file> \
    ~/openshift-ansible/playbooks/openshift-checks/health.yml

To set variables in the command line, include the -e flag with any desired variables in key=value format. For example:

# ansible-playbook -i <inventory_file> \
    ~/openshift-ansible/playbooks/openshift-checks/health.yml
    -e openshift_check_logging_index_timeout_seconds=45
    -e etcd_max_image_data_size_bytes=40000000000

To disable specific checks, include the variable openshift_disable_check with a comma-delimited list of check names in your inventory file before running the playbook. For example:

openshift_disable_check=etcd_traffic,etcd_volume

Alternatively, set any checks you want to disable as variables with -e openshift_disable_check=<check1>,<check2> when running the ansible-playbook command.

Running Health Checks via Docker CLI

It is possible to run the openshift-ansible playbooks in a Docker container, avoiding the need for installing and configuring Ansible, on any host that can run the origin-ansible image via the Docker CLI.

To do so, specify your cluster’s inventory file and the health.yml playbook when running the following docker run command as a non-root user that has privileges to run containers:

# docker run -u `id -u` \ (1)
    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ (2)
    -v /etc/ansible/hosts:/tmp/inventory:ro \ (3)
    -e INVENTORY_FILE=/tmp/inventory \
    -e PLAYBOOK_FILE=playbooks/openshift-checks/health.yml \ (4)
    -e OPTS="-v -e openshift_check_logging_index_timeout_seconds=45 -e etcd_max_image_data_size_bytes=40000000000" \ (5)
    openshift/origin-ansible

1	These options make the container run with the same UID as the current user, which is required for permissions so that the SSH key can be read inside the container (SSH private keys are expected to be readable only by their owner).
2	Mount SSH keys as a volume under */opt/app-root/src/.ssh* under normal usage when running the container as a non-root user.
3	Change */etc/ansible/hosts* to the location of your cluster’s inventory file, if different. This file will be bind-mounted to */tmp/inventory*, which is used according to the `INVENTORY_FILE` environment variable in the container.
4	The `PLAYBOOK_FILE` environment variable is set to the location of the *health.yml* playbook relative to */usr/share/ansible/openshift-ansible* inside the container.
5	Set any variables desired for a single run with the `-e key=value` format.

In the above command, the SSH key is mounted with the :Z flag so that the container can read the SSH key from its restricted SELinux context; this means that your original SSH key file will be relabeled to something like system_u:object_r:container_file_t:s0:c113,c247. For more details about :Z, see the docker-run(1) man page.

Keep this in mind for these volume mount specifications because it could have unexpected consequences. For example, if you mount (and therefore relabel) your $HOME/.ssh directory, sshd will become unable to access your public keys to allow remote login. To avoid altering the original file labels, mounting a copy of the SSH key (or directory) is recommended.

You might want to mount an entire .ssh directory for various reasons. For example, this would allow you to use an SSH configuration to match keys with hosts or modify other connection parameters. It would also allow you to provide a known_hosts file and have SSH validate host keys, which is disabled by the default configuration and can be re-enabled with an environment variable by adding -e ANSIBLE_HOST_KEY_CHECKING=True to the docker command line.