Zenoss is designed to be an extensible platform for integrating new performance collectors. Basically, this should be a simple matter of getting the list of devices and sending/receiving data over the network to collect new values. Essentially, this is what every collector does.
Each collector should post values to RRD files and execute thresholds against those updates. The Python class RRDUtil
supports writing values to RRD files. The Python class Thresholds
will simplify the execution of thresholds on each RRD update.
Data collection needs to work in a wide variety of networking infrastructures, so it needs to have acceptable performance in light of high latency wide-area networks. Collectors should intentionally interleave requests to multiple devices to reduce the overall time necessary to walk the list of devices. Collectors should not overload a single device by sending multiple outstanding requests to that device.
In order to debug collection, the collector should be capable of logging detailed debugging output at each step of collection, as well as posting events about collection failure. In particular, logging raw values and errors from devices helps find errors in post-processing. Any performance information about total devices collected, or total collect time should be posted at the informational level (above debug).
Since the collectors are generally going to run long-term, cached values and other stored and pre-computed values should be periodically purged in order to synchronize the collectors' state with the real world, as well to eliminate possible memory leaks.
If the collector monitors device components as well as whole devices, it may be necessary to load the device configuration information in an incremental way. If it takes 30 minutes to gather the configuration information, this is simply too slow and unresponsive. The collector should load its configuration information incrementally, performing collection against those devices it knows about. It can cache the configuration information persistently to provide a larger "initial set" of configuration upon start-up.
Many collectors benefit from "pre-failing" their devices. They get the list of devices presently marked down by the ping tester, and they skip those devices during collection. This eliminates unnecessary longer delays as collectors run against devices that are just unreachable.