SNMP Monitoring with Nagios
Nagios has been around since 2002 and is considered stable software. It is in use by the likes of American Public Media, JP Morgan Chase and Yahoo, just to name a few. It is an enterprise-level network and systems-monitoring platform. Nagios performs checks of services and hosts using external programs called Nagios plugins.
SNMP (Simple Network Management Protocol) is a network protocol designed for monitoring network-attached devices. It uses OIDs (Object IDentifiers) for defining the information, known as MIBs (Management Information Base), that can be monitored. The design is extensible, so vendors can define their own items to be monitored.
OpenManage is provided with Dell servers and is an extremely well-documented system (see Resources) that provides extensive server administration capabilities. OpenManage works with both Linux and Windows. The OpenManage “SNMP Reference Guide” (see Resources) is a 732-page document that is “intended for system administrators, network administrators and anyone who wants to write SNMP MIB applications to monitor systems”. The “SNMP Reference Guide” documents the SNMP OIDs/MIBs for monitoring Dell's servers.
The system described here was implemented for a local utility company when it upgraded to Dell Power Edge servers. As often is the case, out of the box, Nagios didn't do exactly what the company needed, but being an open-source project, it easily was extended to accomplish the goal. All we needed was a Nagios plugin to monitor the new servers.
The first thing I set out to do was find an existing Nagios plugin that offered similar functionality to what we needed. Quite a number of existing plugins are available. In less than one hour, I found check_snmp_temperature.pl by William Leibzon. This is a plugin module that monitors the temperature of various devices remotely via SNMP. Although monitoring temperatures was not our goal, retrieving information via SNMP and reporting it to Nagios was. The module is written in Perl and after reading it over, it looked very well written.
Chapter 4 of the Dell's “SNMP Reference Guide” is the “System State Group”. It states:
The Management Information Base (MIB) variables presented in this section enable you to track various attributes that describe the state of the critical components supported by your system. Components monitored under the System State Group include power supplies, AC power cords, AC power switches, and cooling devices, as well as temperature, fan, amperage, and voltage probes.
The associated OIDs provide the overall state of all the critical subsystems that we were interested in. OIDs exist that provide much greater detail, but in this situation, the requirement was to be alerted only if a server had a problem and to indicate the particular subsystem that had the problem. One subsystem was not addressed in the “System State Group” chapter—the RAID subsystem. There is, however, an OID for monitoring it. This OID is described in Chapter 23, the “Storage Management Group”.
As stated earlier, these OIDs are used to define particular MIBs that can be queried via SNMP. On the Dell server, there is an SNMP server running. The SNMP server answers queries that are in the form of a long string of numbers (the OID). This string of numbers is understood by the SNMP server to be a specific question. For instance, if you want to ask the SNMP server “How are your power supplies?”, you would send it the OID .184.108.40.206.4.1.674.108220.127.116.11.1.9.1 (Figure 1). The SNMP server will respond with 3 if the power supplies are okay.
Table 1 shows the OIDs we are interested in.
Table 1. OIDs
|systemStateChassisStatus||18.104.22.168.4.1.674.10822.214.171.124.1.4||Defines the system status of this chassis.|
|systemStatePowerSupplyStatusCombined||126.96.36.199.4.1.674.108188.8.131.52.1.9||Defines the status of all power supplies in this chassis.|
|systemStateVoltageStatusCombined||184.108.40.206.4.1.674.108220.127.116.11.1.12||Defines the status of all voltage probes in this chassis.|
|systemStateCoolingDeviceStatusCombined||18.104.22.168.4.1.674.10822.214.171.124.1.21||Defines the cooling device status of all cooling devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateTemperatureStatusCombined||126.96.36.199.4.1.674.108188.8.131.52.1.24||Defines the status of all temperature probes in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateMemoryDeviceStatusCombined||184.108.40.206.4.1.674.108220.127.116.11.1.27||Defines the status of all memory devices in this chassis.|
|systemStateChassisIntrusionStatusCombined||18.104.22.168.4.1.674.10822.214.171.124.1.30||Defines the intrusion status of all intrusion-detection devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateEventLogStatus||126.96.36.199.4.1.674.108188.8.131.52.1.41||Defines the overall status of this chassis (ESM) event log.|
|agentGlobalSystemStatus||184.108.40.206.4.1.674.108220.127.116.11.13||Global health information for the subsystem managed by the Storage Management software. This global status should be used by applications other than HP OpenView. HP OpenView should refer to the globalStatus in the root level object group. This is a rollup for the entire agent including any monitored devices. The status is intended to give initiative to an SNMP monitor to get further data when this status is abnormal.|
One of the benefits to choosing these particular OIDs turned out to be that they all respond in the same format. Dell refers to this format as DellStatus, and it maps integers to subsystem states:
Variable Name: DellStatus Data Type: Integer Possible Data Values Meaning of Data Value: other(1) The object's status is not one of the following: unknown(2) The object's status is unknown. ok(3) The object's status is OK. nonCritical(4) The object's status is warning, noncritical. critical(5) The object's status is critical (failure). nonRecoverable(6) The object's status is nonrecoverable (dead).
Now that we knew what we wanted to monitor, it was time to modify check_snmp_temperature.pl to do what was needed. The result, check_dell_openmanager.0.7-test.pl, is too long to print here, but it is available on the Linux Journal FTP site (see Resources).
Free DevOps eBooks, Videos, and more!
Regardless of where you are in your DevOps process, Linux Journal can help!
We offer here the DEFINITIVE DevOps for Dummies, a mobile Application Development Primer, and advice & help from the expert sources like:
- Linux Journal
- Android Candy: Google Keep
- Handling the workloads of the Future
- Readers' Choice Awards 2014
- How Can We Get Business to Care about Freedom, Openness and Interoperability?
- diff -u: What's New in Kernel Development
- Days Between Dates?
- Synchronize Your Life with ownCloud
- Computing without a Computer
- Non-Linux FOSS: Don't Type All Those Words!