2.4. What to Monitor?

As stated earlier, the resources present in every system are CPU power, bandwidth, memory, and storage. At first glance, it would seem that monitoring would need only consist of examining these four different things.

Unfortunately, it is not that simple. For example, consider a disk drive. What things might you want to know about its performance?

There are more ways of studying disk drive performance; these points have only scratched the surface. The main concept to keep in mind is that there are many different types of data for each resource.

The following sections explore the types of utilization information that would be helpful for each of the major resource types.

2.4.1. Monitoring CPU Power

In its most basic form, monitoring CPU power can be no more difficult than determining if CPU utilization ever reaches 100%. If CPU utilization stays below 100%, no matter what the system is doing, there is additional processing power available for more work.

However, it is a rare system that does not reach 100% CPU utilization at least some of the time. At that point it is important to examine more detailed CPU utilization data. By doing so, it becomes possible to start determining where the majority of your processing power is being consumed. Here are some of the more popular CPU utilization statistics:

User Versus System

The percentage of time spent performing user-level processing versus system-level processing can point out whether a system's load is primarily due to running applications or due to operating system overhead. High user-level percentages tend to be good (assuming users are not experiencing unsatisfactory performance), while high system-level percentages tend to point toward problems that will require further investigation.

Context Switches

A context switch happens when the CPU stops running one process and starts running another. Because each context switch requires the operating system to take control of the CPU, excessive context switches and high levels of system-level CPU consumption tend to go together.

Interrupts

As the name implies, interrupts are situations where the processing being performed by the CPU is abruptly changed. Interrupts generally occur due to hardware activity (such as an I/O device completing an I/O operation) or due to software (such as software interrupts that control application processing). Because interrupts must be serviced at a system level, high interrupt rates lead to higher system-level CPU consumption.

Runnable Processes

A process may be in different states. For example, it may be:

  • Waiting for an I/O operation to complete

  • Waiting for the memory management subsystem to handle a page fault

In these cases, the process has no need for the CPU.

However, eventually the process state changes, and the process becomes runnable. As the name implies, a runnable process is one that is capable of getting work done as soon as it is scheduled to receive CPU time. However, if more than one process is runnable at any given time, all but one[1] of the runnable processes must wait for their turn at the CPU. By monitoring the number of runnable processes, it is possible to determine how CPU-bound your system is.

Other performance metrics that reflect an impact on CPU utilization tend to include different services the operating system provides to processes. They may include statistics on memory management, I/O processing, and so on. These statistics also reveal that, when system performance is monitored, there are no boundaries between the different statistics. In other words, CPU utilization statistics may end up pointing to a problem in the I/O subsystem, or memory utilization statistics may reveal an application design flaw.

Therefore, when monitoring system performance, it is not possible to examine any one statistic in complete isolation; only by examining the overall picture it it possible to extract meaningful information from any performance statistics you gather.

2.4.2. Monitoring Bandwidth

Monitoring bandwidth is more difficult than the other resources described here. The reason for this is due to the fact that performance statistics tend to be device-based, while most of the places where bandwidth is important tend to be the buses that connect devices. In those instances where more than one device shares a common bus, you might see reasonable statistics for each device, but the aggregate load those devices place on the bus would be much greater.

Another challenge to monitoring bandwidth is that there can be circumstances where statistics for the devices themselves may not be available. This is particularly true for system expansion buses and datapaths[2]. However, even though 100% accurate bandwidth-related statistics may not always be available, there is often enough information to make some level of analysis possible, particularly when related statistics are taken into account.

Some of the more common bandwidth-related statistics are:

Bytes received/sent

Network interface statistics provide an indication of the bandwidth utilization of one of the more visible buses — the network.

Interface counts and rates

These network-related statistics can give indications of excessive collisions, transmit and receive errors, and more. Through the use of these statistics (particularly if the statistics are available for more than one system on your network), it is possible to perform a modicum of network troubleshooting even before the more common network diagnostic tools are used.

Transfers per Second

Normally collected for block I/O devices, such as disk and high-performance tape drives, this statistic is a good way of determining whether a particular device's bandwidth limit is being reached. Due to their electromechanical nature, disk and tape drives can only perform so many I/O operations every second; their performance degrades rapidly as this limit is reached.

2.4.3. Monitoring Memory

If there is one area where a wealth of performance statistics can be found, it is in the area of monitoring memory utilization. Due to the inherent complexity of today's demand-paged virtual memory operating systems, memory utilization statistics are many and varied. It is here that the majority of a system administrator's work with resource management takes place.

The following statistics represent a cursory overview of commonly-found memory management statistics:

Page Ins/Page Outs

These statistics make it possible to gauge the flow of pages from system memory to attached mass storage devices (usually disk drives). High rates for both of these statistics can mean that the system is short of physical memory and is thrashing, or spending more system resources on moving pages into and out of memory than on actually running applications.

Active/Inactive Pages

These statistics show how heavily memory-resident pages are used. A lack of inactive pages can point toward a shortage of physical memory.

Free, Shared, Buffered, and Cached Pages

These statistics provide additional detail over the more simplistic active/inactive page statistics. By using these statistics, it is possible to determine the overall mix of memory utilization.

Swap Ins/Swap Outs

These statistics show the system's overall swapping behavior. Excessive rates here can point to physical memory shortages.

Successfully monitoring memory utilization requires a good understanding of how demand-paged virtual memory operating systems work. While such a subject alone could take up an entire book, the basic concepts are discussed in Chapter 4 Physical and Virtual Memory. This chapter, along with time spent actually monitoring a system, gives you the the necessary building blocks to learn more about this subject.

2.4.4. Monitoring Storage

Monitoring storage normally takes place at two different levels:

The reason for this is that it is possible to have dire problems in one area and no problems whatsoever in the other. For example, it is possible to cause a disk drive to run out of disk space without once causing any kind of performance-related problems. Likewise, it is possible to have a disk drive that has 99% free space, yet is being pushed past its limits in terms of performance.

However, it is more likely that the average system experiences varying degrees of resource shortages in both areas. Because of this, it is also likely that — to some extent — problems in one area impact the other. Most often this type of interaction takes the form of poorer and poorer I/O performance as a disk drive nears 0% free space although, in cases of extreme I/O loads, it might be possible to slow I/O throughput to such a level that applications no longer run properly.

In any case, the following statistics are useful for monitoring storage:

Free Space

Free space is probably the one resource all system administrators watch closely; it would be a rare administrator that never checks on free space (or has some automated way of doing so).

File System-Related Statistics

These statistics (such as number of files/directories, average file size, etc.) provide additional detail over a single free space percentage. As such, these statistics make it possible for system administrators to configure the system to give the best performance, as the I/O load imposed by a file system full of many small files is not the same as that imposed by a file system filled with a single massive file.

Transfers per Second

This statistic is a good way of determining whether a particular device's bandwidth limitations are being reached.

Reads/Writes per Second

A slightly more detailed breakdown of transfers per second, these statistics allow the system administrator to more fully understand the nature of the I/O loads a storage device is experiencing. This can be critical, as some storage technologies have widely different performance characteristics for read versus write operations.

Notes

[1]

Assuming a single-processor computer system.

[2]

More information on buses, datapaths, and bandwidth is available in Chapter 3 Bandwidth and Processing Power.