Ganglia Gmetric

5.1. Ganglia Gmetric

Rocks uses the Ganglia system to monitor the cluster. We have defined several extension structures to Ganglia that allow you to easily add metrics and alarms to your cluster monitoring system.

A single value to be monitored is called a metric in the Ganglia system. Metrics are measured on nodes of the cluster, and Ganglia communicates them to the frontend. This section describes how to write your own metric monitors.

5.1.1. Greceptor Metric

To define a new metric to monitor, we write a Metric module for the Rocks greceptor daemon. Throughout the monitoring chapter, we use "fan speed" as an example. This metric could read and monitor the speed of a CPU or chassis fan in a node. When the fan rpm is low, it could signal a malfunction that needs your attention.

  1. To monitor a value, you must be able to read it programmatically. Write a script that returns the fan speed as a number. It can be written in any language, but should have the following behavior:

    # read-fan1-speed
    3000
    #

    The output should contain no units or any ancillary information, as they will be added later.

  2. The greceptor metric module that calls your script must be written in Python. The listing below gives an example:

    import os
    import gmon.events
    
    class FanSpeed(gmon.events.Metric):
    
         def getFrequency(self):
             """How often to read fan speed (in sec)."""
             return 30
    
         def name(self):
             return 'fan1-speed'
    
         def units(self):
             return 'rpm'
    
         def value(self):
             speed = os.popen('read-fan1-speed').readline()
             return int(speed)
    
    
    def initEvents():
         return FanSpeed

    The class is derived from the Rocks gmon.events.Metric class. This module calls your 'read-fan1-speed' script to get the actual value. You may obtain your metric value directly using Python code if you like, just return its value at the end of the value() method.

    Greceptor will call the name(), value() functions every 30 seconds or so. The actual frequency is uniformly random with an expectation of 30 seconds.

  3. Make an RPM that places this file in:

    /opt/ganglia/lib/python/gmon/metrics/fanspeed.py

    Any name ending in *.py will work. Add the RPM to your roll and reinstall your compute nodes. See the hpc-ganglia package in the HPC roll for an example.

  4. The metric value should be visible in the ganglia gmond output on the frontend, which can be inspected with:

    $ telnet localhost 8649

    If the metric value is an integer or float, a graph of its value over time is visible on the Ganglia web-frontend pages, in the Cluster or Host views.

5.1.2. Greceptor Listener

You may also define a greceptor module that listens for a particular Gmetric name on the Ganglia multicast channel. This is called a listener. Rocks uses listeners to implement 411, and setup the MPD job launcher ring.

We will use our fan-speed example. This process installs a listener for the metric we just defined.

  1. Derive the listener from the Rocks class gmon.events.Listener. The name() method defines what metric name to listen for, it must match exactly. In this case we specify the 'fan1-speed' name from the previous example.

    import gmon.events
    
    class FanSpeed(gmon.events.Listener):
    	"""Does something when we hear a fan-speed metric."""
    
    	fanthresh = 500
    
    	def name(self):
    		"""The metric name we're interested in."""
    
    		return "fan1-speed"
    
    
    	def run(self, metric):
    		"""Called every time a metric with our name passes on the 
    		Ganglia multicast network in the cluster."""
    
    		fanspeed = float(metric["VAL"])
    
    		if fanspeed < self.fanthresh:
    			self.getWorried()
    
    
    	def getWorried(self):
    		"""Does something appropriate when the fan speed is too low."""
    
    		# Some action
    		pass
    
    
    def initEvents():
    	return FanSpeed

    Greceptor will call the run() method whenever a 'fan1-speed' metric passes on the Ganglia multicast channel. The metric argument to run() is a Python dictionary keyed with the attribute names from a <METRIC> Ganglia metric. The source of the metric is given by metric["IP"].

  2. Make an RPM that places this file in:

    /opt/ganglia/lib/python/gmon/listeners/fanspeed.py

    Any name ending in *.py will work. Add the RPM to your roll and reinstall your compute nodes. See the hpc-ganglia package in the HPC roll for an example.

In the next section, we show how to make an alarm for a metric value using the Ganglia News construct.