SNMP Monitoring with Nagios

Using Nagios, you can monitor Dell servers with SNMP via Dell's server administration tools.

Nagios has been around since 2002 and is considered stable software. It is in use by the likes of American Public Media, JP Morgan Chase and Yahoo, just to name a few. It is an enterprise-level network and systems-monitoring platform. Nagios performs checks of services and hosts using external programs called Nagios plugins.

SNMP (Simple Network Management Protocol) is a network protocol designed for monitoring network-attached devices. It uses OIDs (Object IDentifiers) for defining the information, known as MIBs (Management Information Base), that can be monitored. The design is extensible, so vendors can define their own items to be monitored.

OpenManage is provided with Dell servers and is an extremely well-documented system (see Resources) that provides extensive server administration capabilities. OpenManage works with both Linux and Windows. The OpenManage “SNMP Reference Guide” (see Resources) is a 732-page document that is “intended for system administrators, network administrators and anyone who wants to write SNMP MIB applications to monitor systems”. The “SNMP Reference Guide” documents the SNMP OIDs/MIBs for monitoring Dell's servers.

The system described here was implemented for a local utility company when it upgraded to Dell Power Edge servers. As often is the case, out of the box, Nagios didn't do exactly what the company needed, but being an open-source project, it easily was extended to accomplish the goal. All we needed was a Nagios plugin to monitor the new servers.

Don't Re-invent the Wheel

The first thing I set out to do was find an existing Nagios plugin that offered similar functionality to what we needed. Quite a number of existing plugins are available. In less than one hour, I found check_snmp_temperature.pl by William Leibzon. This is a plugin module that monitors the temperature of various devices remotely via SNMP. Although monitoring temperatures was not our goal, retrieving information via SNMP and reporting it to Nagios was. The module is written in Perl and after reading it over, it looked very well written.

Chapter 4 of the Dell's “SNMP Reference Guide” is the “System State Group”. It states:

The Management Information Base (MIB) variables presented in this section enable you to track various attributes that describe the state of the critical components supported by your system. Components monitored under the System State Group include power supplies, AC power cords, AC power switches, and cooling devices, as well as temperature, fan, amperage, and voltage probes.

The associated OIDs provide the overall state of all the critical subsystems that we were interested in. OIDs exist that provide much greater detail, but in this situation, the requirement was to be alerted only if a server had a problem and to indicate the particular subsystem that had the problem. One subsystem was not addressed in the “System State Group” chapter—the RAID subsystem. There is, however, an OID for monitoring it. This OID is described in Chapter 23, the “Storage Management Group”.

As stated earlier, these OIDs are used to define particular MIBs that can be queried via SNMP. On the Dell server, there is an SNMP server running. The SNMP server answers queries that are in the form of a long string of numbers (the OID). This string of numbers is understood by the SNMP server to be a specific question. For instance, if you want to ask the SNMP server “How are your power supplies?”, you would send it the OID .1.3.6.1.4.1.674.10892.1.200.10.1.9.1 (Figure 1). The SNMP server will respond with 3 if the power supplies are okay.

Figure 1. Sample SNMP Query

Table 1 shows the OIDs we are interested in.

Table 1. OIDs

NameObject IDDescription
systemStateChassisStatus1.3.6.1.4.1.674.10892.1.200.10.1.4Defines the system status of this chassis.
systemStatePowerSupplyStatusCombined 1.3.6.1.4.1.674.10892.1.200.10.1.9 Defines the status of all power supplies in this chassis.
systemStateVoltageStatusCombined 1.3.6.1.4.1.674.10892.1.200.10.1.12 Defines the status of all voltage probes in this chassis.
systemStateCoolingDeviceStatusCombined 1.3.6.1.4.1.674.10892.1.200.10.1.21 Defines the cooling device status of all cooling devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.
systemStateTemperatureStatusCombined 1.3.6.1.4.1.674.10892.1.200.10.1.24 Defines the status of all temperature probes in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.
systemStateMemoryDeviceStatusCombined 1.3.6.1.4.1.674.10892.1.200.10.1.27 Defines the status of all memory devices in this chassis.
systemStateChassisIntrusionStatusCombined 1.3.6.1.4.1.674.10892.1.200.10.1.30 Defines the intrusion status of all intrusion-detection devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.
systemStateEventLogStatus 1.3.6.1.4.1.674.10892.1.200.10.1.41 Defines the overall status of this chassis (ESM) event log.
agentGlobalSystemStatus 1.3.6.1.4.1.674.10893.1.20.110.13 Global health information for the subsystem managed by the Storage Management software. This global status should be used by applications other than HP OpenView. HP OpenView should refer to the globalStatus in the root level object group. This is a rollup for the entire agent including any monitored devices. The status is intended to give initiative to an SNMP monitor to get further data when this status is abnormal.

One of the benefits to choosing these particular OIDs turned out to be that they all respond in the same format. Dell refers to this format as DellStatus, and it maps integers to subsystem states:

Variable Name:           DellStatus
Data Type:               Integer
Possible Data Values     Meaning of Data Value:
  other(1)               The object's status is not
                            one of the following:
  unknown(2)             The object's status is unknown.
  ok(3)                  The object's status is OK.
  nonCritical(4)         The object's status is warning, noncritical.
  critical(5)            The object's status is critical (failure).
  nonRecoverable(6)      The object's status is nonrecoverable (dead).

Now that we knew what we wanted to monitor, it was time to modify check_snmp_temperature.pl to do what was needed. The result, check_dell_openmanager.0.7-test.pl, is too long to print here, but it is available on the Linux Journal FTP site (see Resources).

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

why reinvent the wheel :-)

natxo asenjo's picture

http://folk.uio.no/trondham/software/check_openmanage.html

check_openmanage does all you need and is (at this point) a better solution for nagios.

Wheel wasn't reinvented

Trond H. Amundsen's picture

I feel the need to comment on this. I am the author of the check_openmanage plugin. I know for a fact that Jason's plugin existed long before check_openmanage, so to be precise it was I who reinvented the wheel. Also, our two plugins are different in their focus, and I believe that both are needed. Users can choose whichever plugin they want, among these two and many others. Isn't open source great :)

Besides that, I really enjoyed Jason's article. It explains in a detailed and concise manner how one goes about to monitor something with SNMP, and how to integrate this with Nagios. This is universally useful to many out there.

Thanks for a great article, Jason!

-trond

MonitorSNMP is a free

Anonymous's picture

MonitorSNMP is a free monitoring service, basic but provides notification based on rules. Easy to setup and use. Take a look

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState