Chapter 7. Writing A Zenoss Performance Collector

Zenoss is designed to be an extensible platform for integrating new performance collectors. Basically, this should be a simple matter of getting the list of devices and sending/receiving data over the network to collect new values. Essentially, this is what every collector does.

But, there are usually many more small steps that need to be done to provide better integration with Zenoss.

Let's start with a minimum collector. The collector must be able to be started with the command-line options used by the other Zenoss collectors. It should support:

$ collector start

This should “deamon-ize” the collector, running it forever in the background.

$ collector stop

This should find the collector and stop it with a graceful shutdown.

$ collector run

The collector should run for "one cycle" if it has a cycle, and should not “daemon-ize” and log to stderr.

$ collector run -d someDevice -v 10

The "-d" option is used to restrict the collector to a single device. The -v option is used to set the verboseness of the logging. Here 10 means "debug", 20 means "Info", 30 - "Warning", 40 - "Error", 50 - "Critical". That is, higher numbers reduce the amount of logging.

These options are all available without doing much in the way of implementation if one uses a shell script like the other collectors and derives their collector from PBDaemon or ZCmdClass.

Writing collectors in other languages will require this same infrastructure.

Each collector should post a periodic event, called the Heartbeat. If a heartbeat event is not updated, the Zenoss GUI will indicate a problem with the collector. Ideally, the collector only sends a heartbeat event after each successful collection cycle. It is not acceptable to just post events in a separate thread or timer, unless that thread also does some minimal testing for internal status and health.

Each collector should post an event when it is shutdown, so that the console is kept informed of intentional shutdowns. However, these events should be cleared by matching start events. Start/shutdown events should only be sent when the server is “daemon-ized”.

Each collector should post values to RRD files and execute thresholds against those updates. There are python classes to do this, but if the RRD values are posted back to ZenHub, this will be taken care of for the collector by ZenHub.

Data collection needs to work in a wide variety of networking infrastructures, so it needs to have acceptable performance in light of high latency wide-area networks. Collectors should intentionally interleave requests to multiple devices to reduce the overall time necessary to walk the list of devices. Collectors should not overload a single device by sending multiple outstanding requests to that device.

In order to debug collection, the collector should be capable of logging detailed debugging output at each step of collection, as well as posting events about collection failure. In particular, logging raw values and errors from devices helps find errors in post-processing. Any performance information about total devices collected, or total collect time should be posted at the informational level (above debug).

Since the collectors are generally going to run long-term, cached values and other stored and pre-computed values should be periodically purged in order to synchronize the collectors' state with the real world, as well eliminate possible memory leaks.

If the collector monitors device components as well as whole devices, it may be necessary to load the device configuration information in an incremental way. If it takes 30 minutes to gather the configuration information, this is simply two slow and unresponsive. The collector should load its configuration information incrementally, performing collection against those devices it knows about. It can cache the configuration information persistently to provide a larger "initial set" of configuration upon start-up.

Many collectors benefit from "pre-failing" their devices. They get the list of devices presently marked down by the ping tester, and they skip those devices during collection. This eliminates unnecessary longer delays as collectors run against devices that are just unreachable.

Lower-Level Network Code

Zenoss Client/Client Wrapper

Plugin Package

DataCollector Integration