Monitoring E-Mail with Nagios

 in

Have you ever felt like you were being ignored? Have you ever felt like you were talking but no one was listening? Well, that's how it feels when your e-mail system is broken and you don't know it.

During the past week, I've had a couple system problems that prevented people from receiving e-mail messages that my wife or I sent. The sad part was that we didn't know the messages weren't being delivered. We'd receive a message asking a question, and we'd reply to the sender thinking nothing of it. A few days later, we'd get a phone call from the person asking whether we ever were going to respond.

In our case, two situations were conspiring against us: a change in Comcast's firewall policy and a change in Yahoo's mail delivery policy.

It all began when my wife started complaining that something was wrong with the e-mail system because she'd not heard back from a friend whom she had sent a message the previous day. I sent a quick e-mail to a friend of mine, got a response, and informed my wife that “it worked for me,” and chalked it up to her friend not being responsive.

Then, just to demonstrate to her that the mail server was healthy, I asked the server to print out its mail queue. Crap! There were 55 messages in the queue waiting to be delivered. Of course, by this time, even I had noticed that the volume of incoming spam had gone down to none. So, Houston, we had a problem.

After several years of running my own mail server on my home machine connected to the Internet via Comcast, Comcast decided to implement a new firewall policy and started blocking incoming SMTP (tcp/25) connections on its residential users' networks. Of course, I wasn't informed of the change, because I don't use Comcast's e-mail system! Previously, we would send e-mail from our workstations, and our mail server would forward the message through Comcast's smarthost; incoming messages came directly to our server. This configuration had worked for years. But, with the new firewall policy, something broke. Some of our messages were being delivered, and some weren't. I'm speculating that the ones not delivered were going through servers that did sending address verification, and as they couldn't connect back to my mail server to validate my e-mail address, they refused delivery.

So, I decided to take the inexpensive way out. I could have spent an extra $20 a month and gotten a business account with Comcast, which I eventually did, but I didn't at first. I created a VPN tunnel from my home machine to one of my servers on the open Internet. Then, I moved my DNS pointers to point to that machine and had it forward incoming messages through the VPN. I configured my home server to use that machine as its smarthost rather than Comcast's server. Aside from the blatant violation of Comcast's Acceptable Use Policy, this seemed like it would work pretty well.

Then, the other shoe dropped.

My wife and I quickly realized that this was working much better, but it still wasn't quite right. People my wife emailed on a daily basis weren't receiving her messages. The common denominator was that all of these people were using Yahoo e-mail accounts. So, I manually forced delivery of one e-mail messages and saw that Yahoo was deferring delivery due to questionable traffic patterns. And, that made sense; I was trying to deliver 55 deferred messages, probably all at once.

It's important to note that I monitor my e-mail server, and the Exim daemon never sent an alarm, so merely monitoring a service isn't enough. Instead of monitoring the service itself, it's better to monitor the server's function, which is what the rest of this article is about.

I was hesitant to write another article on Nagios, but e-mail is becoming more and more critical, and when it does break, it breaks in strange ways.

Of course, I monitor my Exim daemon as well as my server's route to the Internet. I use a Nagios service check for SMTP, like this:

define service {
        use generic-service
        name                    smtp
        host_name               host.example.com
        notification_options    w,c,r
        service_description     E-Mail SMTP Server
        check_command           check_smtp
}

I use a similar check to monitor my Internet gateway. But, as bad as the e-mail situation became, neither of these alarms would have indicated a problem. So, rather than monitoring to see whether a process is running, I set out to begin monitoring the server's critical functions, e-mail transport and delivery.

The first problem I wanted to address was being informed when messages were stuck in Exim's mail queue. I actually thought I'd have to write a custom program to check for this situation. While researching the situation further, I came across a posting from someone with a similar problem. It turns out that Nagios already has a command that performs this check, and I never knew it. Nagios's check commands are in /usr/nagios/libexec/, and let me tell you, there is a lot of gold in that directory.

So, I created an entry in Nagios's checkcommands.cfg file, like this:

define command{
        command_name    check_mailq
        command_line    $USER1$/check_mailq -w 3 -c 5 -v 9
}

Then, I created an entry in the services.cfg file that looked like this:

define service {
        use generic-service
        name                    mailq
        host_name               dominion
        notification_options    w,c,r
        service_description     SMTP Mail Queue
        check_command           check_mailq
}

Finally, I restarted Nagios and tested the new configuration by shutting down my server's outside network interface and attempting to send an e-mail message. Obviously, the mail transport operation failed and I got my alarm.

So at this point, I am pretty sure that if I have another problem with my e-mail system, at least I'll know it in a timely fashion. But, I thought it would be good to put in one more check.

It would be nice to know if my server ever finds itself on a Real-time Blocking List (RBL). Once again, Nagios has a command to check for this situation, but it comes in C source, which I couldn't get to compile. Anyway, I think I like my solution better.

My program looks up the server's IP address at http://www.anti-abuse.org, which, in turn, checks the IP address against several other RBLs at once. I'm probably going to configure Nagios to perform this check a few times a day, at most.

Here's the program:

#!/usr/bin/perl

open CMD, "wget -q http://www.anti-abuse.org/rblresults.php?host=192.168.1.1 -O - |";

while () {
        if (!/listed in /) { next; }
        if (!/NOT listed in /) { $error++; }
}

if (!$error) {
        print "OK\n";
        exit 0;
} else {
        print "CRITICAL: $error\n";
}

As you can see, it's not that complex. It simply sends a query to Anti-abuse.org and looks for the results. I hard-coded my machine's IP address in this case, but it would be trivial to use one of Nagios' variables and send the IP address as a command-line parameter to this program. Then, the program makes sure that each of the results indicates that my machine is not listed on an RBL. If this check fails, we set a flag for later use. Finally, I created a checkcommand.cfg and services.cfg entry just as I did above.

Now I find myself in the awkward predicament of having written a program that I can't test. In order to test this program fully, I'd have to get my server on an RBL list, which I'm not about to do. Even so, I believe this program will work.

I don't know about you, but I live by e-mail, so my e-mail system simply has to work. The problems I had recently demonstrated that my monitoring policy wasn't sufficient. I believe that the new policy would have alerted me to the situation in a timely fashion. But, as is always the case, you can't test for everything, so I'm sure I'm missing something.

______________________

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Blocked wget

Anonymous's picture

Here is what I had to change to get this to work. It seems anti-abuse is blocking wget so I had to use the --user-agent option. Also I only check for is listed in and return exit code 2 for Nagios critical. Replace x.x.x.x with known good and bad ip for testing.

#! /usr/bin/perl -w
open (CMD, "wget -q --user-agent=\".\" http://www.anti-abuse.org/rblresults.php?host=x.x.x.x -O - |");

while () {
	if (+m/is listed in /) { $error++;}
}

if ($error) {
        print "CRITICAL: $error\n";
	exit 2;
} else {
        print "OK: $error\n";
        exit 0;
}

*duh*

Anonymous's picture

Your IP will never be blocked because you're checking a non-routable address.... Read RFC1918.

Mail loop test

ludvigm's picture

Checking the daemon running and queue size is nice, but there's more to a correctly running mail server. My Nagios is configured to check a "mail loop" - ie it sends an email with a timestamp to its mailserver and checks later on via IMAP/SSL that it arrived into a mailbox. Such a mail-loop checks the mailsystem in its complexity, including correctly configured DNS, working SMTP, local delivery (including LDAP in my case) and IMAP. Google for "mail loop nagios" for a load of scripts that can be used for this task. And yes, I get alerted via SMS if the loop breaks. Using clickatell.com's http interface for sms alerts.

Umm, this subject is duplicated in Monitoring SMTP damn

Anonymous's picture

Damn this magazine has gone to shit; why don't you just post the Nagios Manual as an article? Damn, sad. Sad shit.

Very interesting

Anna (anonymously)'s picture

Why don't you just send a cc to yourself?

Very interesting, though.

Plugin sources & error in your script

Reto's picture

First, a pointer:

http://nagiosplugins.org/

Second, there's a problem with your RBL check script. You don't set the exit code for the CRIT case so perl will exit with "0" in either case. Nagios actually uses the exit code, not the text, so you'll not get an alert.

http://nagios.sourceforge.net/docs/3_0/pluginapi.html

Also ... while()

Bah. HTML ate my code (which

Reto's picture

Bah. HTML ate my code (which it probably did for the author as well:


while(<CMD>), not while()

but how?

Josh's picture

My only concern is: How do you send the alert out if your email server is down? SMS? And if so, through what conduit?

SMS

Thăng Phạm Duy's picture

You can send the alert via a SMS gateway. You can use Kannel software as a gateway and a Siemens M20 or Nokia Premicell devices as a SMSC.

Sure you can test it - pick

Anonymous's picture

Sure you can test it - pick a banned ip and hard code that in!

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState