SNMP Monitoring with Nagios
Nagios has been around since 2002 and is considered stable software. It is in use by the likes of American Public Media, JP Morgan Chase and Yahoo, just to name a few. It is an enterprise-level network and systems-monitoring platform. Nagios performs checks of services and hosts using external programs called Nagios plugins.
SNMP (Simple Network Management Protocol) is a network protocol designed for monitoring network-attached devices. It uses OIDs (Object IDentifiers) for defining the information, known as MIBs (Management Information Base), that can be monitored. The design is extensible, so vendors can define their own items to be monitored.
OpenManage is provided with Dell servers and is an extremely well-documented system (see Resources) that provides extensive server administration capabilities. OpenManage works with both Linux and Windows. The OpenManage “SNMP Reference Guide” (see Resources) is a 732-page document that is “intended for system administrators, network administrators and anyone who wants to write SNMP MIB applications to monitor systems”. The “SNMP Reference Guide” documents the SNMP OIDs/MIBs for monitoring Dell's servers.
The system described here was implemented for a local utility company when it upgraded to Dell Power Edge servers. As often is the case, out of the box, Nagios didn't do exactly what the company needed, but being an open-source project, it easily was extended to accomplish the goal. All we needed was a Nagios plugin to monitor the new servers.
The first thing I set out to do was find an existing Nagios plugin that offered similar functionality to what we needed. Quite a number of existing plugins are available. In less than one hour, I found check_snmp_temperature.pl by William Leibzon. This is a plugin module that monitors the temperature of various devices remotely via SNMP. Although monitoring temperatures was not our goal, retrieving information via SNMP and reporting it to Nagios was. The module is written in Perl and after reading it over, it looked very well written.
Chapter 4 of the Dell's “SNMP Reference Guide” is the “System State Group”. It states:
The Management Information Base (MIB) variables presented in this section enable you to track various attributes that describe the state of the critical components supported by your system. Components monitored under the System State Group include power supplies, AC power cords, AC power switches, and cooling devices, as well as temperature, fan, amperage, and voltage probes.
The associated OIDs provide the overall state of all the critical subsystems that we were interested in. OIDs exist that provide much greater detail, but in this situation, the requirement was to be alerted only if a server had a problem and to indicate the particular subsystem that had the problem. One subsystem was not addressed in the “System State Group” chapter—the RAID subsystem. There is, however, an OID for monitoring it. This OID is described in Chapter 23, the “Storage Management Group”.
As stated earlier, these OIDs are used to define particular MIBs that can be queried via SNMP. On the Dell server, there is an SNMP server running. The SNMP server answers queries that are in the form of a long string of numbers (the OID). This string of numbers is understood by the SNMP server to be a specific question. For instance, if you want to ask the SNMP server “How are your power supplies?”, you would send it the OID .18.104.22.168.4.1.674.10822.214.171.124.1.9.1 (Figure 1). The SNMP server will respond with 3 if the power supplies are okay.
Table 1 shows the OIDs we are interested in.
Table 1. OIDs
|systemStateChassisStatus||126.96.36.199.4.1.674.108188.8.131.52.1.4||Defines the system status of this chassis.|
|systemStatePowerSupplyStatusCombined||184.108.40.206.4.1.674.108220.127.116.11.1.9||Defines the status of all power supplies in this chassis.|
|systemStateVoltageStatusCombined||18.104.22.168.4.1.674.10822.214.171.124.1.12||Defines the status of all voltage probes in this chassis.|
|systemStateCoolingDeviceStatusCombined||126.96.36.199.4.1.674.108188.8.131.52.1.21||Defines the cooling device status of all cooling devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateTemperatureStatusCombined||184.108.40.206.4.1.674.108220.127.116.11.1.24||Defines the status of all temperature probes in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateMemoryDeviceStatusCombined||18.104.22.168.4.1.674.10822.214.171.124.1.27||Defines the status of all memory devices in this chassis.|
|systemStateChassisIntrusionStatusCombined||126.96.36.199.4.1.674.108188.8.131.52.1.30||Defines the intrusion status of all intrusion-detection devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateEventLogStatus||184.108.40.206.4.1.674.108220.127.116.11.1.41||Defines the overall status of this chassis (ESM) event log.|
|agentGlobalSystemStatus||18.104.22.168.4.1.674.10822.214.171.124.13||Global health information for the subsystem managed by the Storage Management software. This global status should be used by applications other than HP OpenView. HP OpenView should refer to the globalStatus in the root level object group. This is a rollup for the entire agent including any monitored devices. The status is intended to give initiative to an SNMP monitor to get further data when this status is abnormal.|
One of the benefits to choosing these particular OIDs turned out to be that they all respond in the same format. Dell refers to this format as DellStatus, and it maps integers to subsystem states:
Variable Name: DellStatus Data Type: Integer Possible Data Values Meaning of Data Value: other(1) The object's status is not one of the following: unknown(2) The object's status is unknown. ok(3) The object's status is OK. nonCritical(4) The object's status is warning, noncritical. critical(5) The object's status is critical (failure). nonRecoverable(6) The object's status is nonrecoverable (dead).
Now that we knew what we wanted to monitor, it was time to modify check_snmp_temperature.pl to do what was needed. The result, check_dell_openmanager.0.7-test.pl, is too long to print here, but it is available on the Linux Journal FTP site (see Resources).
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- July 2016 Issue of Linux Journal
- Tibbo Technology's Tibbo Project System
- Client-Side Performance
- Sony Settles in Linux Battle
- Libarchive Security Flaw Discovered
- Peppermint 7 Released
- Profiles and RC Files
- Git 2.9 Released
- Snappy Moves to New Platforms
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide