SNMP Monitoring with Nagios
Nagios has been around since 2002 and is considered stable software. It is in use by the likes of American Public Media, JP Morgan Chase and Yahoo, just to name a few. It is an enterprise-level network and systems-monitoring platform. Nagios performs checks of services and hosts using external programs called Nagios plugins.
SNMP (Simple Network Management Protocol) is a network protocol designed for monitoring network-attached devices. It uses OIDs (Object IDentifiers) for defining the information, known as MIBs (Management Information Base), that can be monitored. The design is extensible, so vendors can define their own items to be monitored.
OpenManage is provided with Dell servers and is an extremely well-documented system (see Resources) that provides extensive server administration capabilities. OpenManage works with both Linux and Windows. The OpenManage “SNMP Reference Guide” (see Resources) is a 732-page document that is “intended for system administrators, network administrators and anyone who wants to write SNMP MIB applications to monitor systems”. The “SNMP Reference Guide” documents the SNMP OIDs/MIBs for monitoring Dell's servers.
The system described here was implemented for a local utility company when it upgraded to Dell Power Edge servers. As often is the case, out of the box, Nagios didn't do exactly what the company needed, but being an open-source project, it easily was extended to accomplish the goal. All we needed was a Nagios plugin to monitor the new servers.
The first thing I set out to do was find an existing Nagios plugin that offered similar functionality to what we needed. Quite a number of existing plugins are available. In less than one hour, I found check_snmp_temperature.pl by William Leibzon. This is a plugin module that monitors the temperature of various devices remotely via SNMP. Although monitoring temperatures was not our goal, retrieving information via SNMP and reporting it to Nagios was. The module is written in Perl and after reading it over, it looked very well written.
Chapter 4 of the Dell's “SNMP Reference Guide” is the “System State Group”. It states:
The Management Information Base (MIB) variables presented in this section enable you to track various attributes that describe the state of the critical components supported by your system. Components monitored under the System State Group include power supplies, AC power cords, AC power switches, and cooling devices, as well as temperature, fan, amperage, and voltage probes.
The associated OIDs provide the overall state of all the critical subsystems that we were interested in. OIDs exist that provide much greater detail, but in this situation, the requirement was to be alerted only if a server had a problem and to indicate the particular subsystem that had the problem. One subsystem was not addressed in the “System State Group” chapter—the RAID subsystem. There is, however, an OID for monitoring it. This OID is described in Chapter 23, the “Storage Management Group”.
As stated earlier, these OIDs are used to define particular MIBs that can be queried via SNMP. On the Dell server, there is an SNMP server running. The SNMP server answers queries that are in the form of a long string of numbers (the OID). This string of numbers is understood by the SNMP server to be a specific question. For instance, if you want to ask the SNMP server “How are your power supplies?”, you would send it the OID .184.108.40.206.4.1.674.108220.127.116.11.1.9.1 (Figure 1). The SNMP server will respond with 3 if the power supplies are okay.
Table 1 shows the OIDs we are interested in.
Table 1. OIDs
|systemStateChassisStatus||18.104.22.168.4.1.674.10822.214.171.124.1.4||Defines the system status of this chassis.|
|systemStatePowerSupplyStatusCombined||126.96.36.199.4.1.674.108188.8.131.52.1.9||Defines the status of all power supplies in this chassis.|
|systemStateVoltageStatusCombined||184.108.40.206.4.1.674.108220.127.116.11.1.12||Defines the status of all voltage probes in this chassis.|
|systemStateCoolingDeviceStatusCombined||18.104.22.168.4.1.674.10822.214.171.124.1.21||Defines the cooling device status of all cooling devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateTemperatureStatusCombined||126.96.36.199.4.1.674.108188.8.131.52.1.24||Defines the status of all temperature probes in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateMemoryDeviceStatusCombined||184.108.40.206.4.1.674.108220.127.116.11.1.27||Defines the status of all memory devices in this chassis.|
|systemStateChassisIntrusionStatusCombined||18.104.22.168.4.1.674.10822.214.171.124.1.30||Defines the intrusion status of all intrusion-detection devices in this chassis. The result is returned as a combined status value. The value has the same definition type as DellStatus.|
|systemStateEventLogStatus||126.96.36.199.4.1.674.108188.8.131.52.1.41||Defines the overall status of this chassis (ESM) event log.|
|agentGlobalSystemStatus||184.108.40.206.4.1.674.108220.127.116.11.13||Global health information for the subsystem managed by the Storage Management software. This global status should be used by applications other than HP OpenView. HP OpenView should refer to the globalStatus in the root level object group. This is a rollup for the entire agent including any monitored devices. The status is intended to give initiative to an SNMP monitor to get further data when this status is abnormal.|
One of the benefits to choosing these particular OIDs turned out to be that they all respond in the same format. Dell refers to this format as DellStatus, and it maps integers to subsystem states:
Variable Name: DellStatus Data Type: Integer Possible Data Values Meaning of Data Value: other(1) The object's status is not one of the following: unknown(2) The object's status is unknown. ok(3) The object's status is OK. nonCritical(4) The object's status is warning, noncritical. critical(5) The object's status is critical (failure). nonRecoverable(6) The object's status is nonrecoverable (dead).
Now that we knew what we wanted to monitor, it was time to modify check_snmp_temperature.pl to do what was needed. The result, check_dell_openmanager.0.7-test.pl, is too long to print here, but it is available on the Linux Journal FTP site (see Resources).
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Google's Abacus Project: It's All about Trust
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Back to Backups
- Secure Desktops with Qubes: Introduction
- Working with Command Arguments
- Secure Desktops with Qubes: Installation
- Fancy Tricks for Changing Numeric Base
- Linux Mint 18
- Seeing Red and Getting Sleep
- The Italian Army Switches to LibreOffice
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide