Monitoring E-Mail with Nagios

Have you ever felt like you were being ignored? Have you ever felt like you were talking but no one was listening? Well, that's how it feels when your e-mail system is broken and you don't know it.

During the past week, I've had a couple system problems that prevented people from receiving e-mail messages that my wife or I sent. The sad part was that we didn't know the messages weren't being delivered. We'd receive a message asking a question, and we'd reply to the sender thinking nothing of it. A few days later, we'd get a phone call from the person asking whether we ever were going to respond.

In our case, two situations were conspiring against us: a change in Comcast's firewall policy and a change in Yahoo's mail delivery policy.

It all began when my wife started complaining that something was wrong with the e-mail system because she'd not heard back from a friend whom she had sent a message the previous day. I sent a quick e-mail to a friend of mine, got a response, and informed my wife that “it worked for me,” and chalked it up to her friend not being responsive.

Then, just to demonstrate to her that the mail server was healthy, I asked the server to print out its mail queue. Crap! There were 55 messages in the queue waiting to be delivered. Of course, by this time, even I had noticed that the volume of incoming spam had gone down to none. So, Houston, we had a problem.

After several years of running my own mail server on my home machine connected to the Internet via Comcast, Comcast decided to implement a new firewall policy and started blocking incoming SMTP (tcp/25) connections on its residential users' networks. Of course, I wasn't informed of the change, because I don't use Comcast's e-mail system! Previously, we would send e-mail from our workstations, and our mail server would forward the message through Comcast's smarthost; incoming messages came directly to our server. This configuration had worked for years. But, with the new firewall policy, something broke. Some of our messages were being delivered, and some weren't. I'm speculating that the ones not delivered were going through servers that did sending address verification, and as they couldn't connect back to my mail server to validate my e-mail address, they refused delivery.

So, I decided to take the inexpensive way out. I could have spent an extra $20 a month and gotten a business account with Comcast, which I eventually did, but I didn't at first. I created a VPN tunnel from my home machine to one of my servers on the open Internet. Then, I moved my DNS pointers to point to that machine and had it forward incoming messages through the VPN. I configured my home server to use that machine as its smarthost rather than Comcast's server. Aside from the blatant violation of Comcast's Acceptable Use Policy, this seemed like it would work pretty well.

Then, the other shoe dropped.

My wife and I quickly realized that this was working much better, but it still wasn't quite right. People my wife emailed on a daily basis weren't receiving her messages. The common denominator was that all of these people were using Yahoo e-mail accounts. So, I manually forced delivery of one e-mail messages and saw that Yahoo was deferring delivery due to questionable traffic patterns. And, that made sense; I was trying to deliver 55 deferred messages, probably all at once.

It's important to note that I monitor my e-mail server, and the Exim daemon never sent an alarm, so merely monitoring a service isn't enough. Instead of monitoring the service itself, it's better to monitor the server's function, which is what the rest of this article is about.

I was hesitant to write another article on Nagios, but e-mail is becoming more and more critical, and when it does break, it breaks in strange ways.

Of course, I monitor my Exim daemon as well as my server's route to the Internet. I use a Nagios service check for SMTP, like this:

define service {
        use generic-service
        name                    smtp
        host_name               host.example.com
        notification_options    w,c,r
        service_description     E-Mail SMTP Server
        check_command           check_smtp
}

I use a similar check to monitor my Internet gateway. But, as bad as the e-mail situation became, neither of these alarms would have indicated a problem. So, rather than monitoring to see whether a process is running, I set out to begin monitoring the server's critical functions, e-mail transport and delivery.

The first problem I wanted to address was being informed when messages were stuck in Exim's mail queue. I actually thought I'd have to write a custom program to check for this situation. While researching the situation further, I came across a posting from someone with a similar problem. It turns out that Nagios already has a command that performs this check, and I never knew it. Nagios's check commands are in /usr/nagios/libexec/, and let me tell you, there is a lot of gold in that directory.

So, I created an entry in Nagios's checkcommands.cfg file, like this:

define command{
        command_name    check_mailq
        command_line    $USER1$/check_mailq -w 3 -c 5 -v 9
}

Then, I created an entry in the services.cfg file that looked like this:

define service {
        use generic-service
        name                    mailq
        host_name               dominion
        notification_options    w,c,r
        service_description     SMTP Mail Queue
        check_command           check_mailq
}

Finally, I restarted Nagios and tested the new configuration by shutting down my server's outside network interface and attempting to send an e-mail message. Obviously, the mail transport operation failed and I got my alarm.

So at this point, I am pretty sure that if I have another problem with my e-mail system, at least I'll know it in a timely fashion. But, I thought it would be good to put in one more check.

It would be nice to know if my server ever finds itself on a Real-time Blocking List (RBL). Once again, Nagios has a command to check for this situation, but it comes in C source, which I couldn't get to compile. Anyway, I think I like my solution better.

My program looks up the server's IP address at http://www.anti-abuse.org, which, in turn, checks the IP address against several other RBLs at once. I'm probably going to configure Nagios to perform this check a few times a day, at most.

Here's the program:

#!/usr/bin/perl

open CMD, "wget -q http://www.anti-abuse.org/rblresults.php?host=192.168.1.1 -O - |";

while () {
        if (!/listed in /) { next; }
        if (!/NOT listed in /) { $error++; }
}

if (!$error) {
        print "OK\n";
        exit 0;
} else {
        print "CRITICAL: $error\n";
}

As you can see, it's not that complex. It simply sends a query to Anti-abuse.org and looks for the results. I hard-coded my machine's IP address in this case, but it would be trivial to use one of Nagios' variables and send the IP address as a command-line parameter to this program. Then, the program makes sure that each of the results indicates that my machine is not listed on an RBL. If this check fails, we set a flag for later use. Finally, I created a checkcommand.cfg and services.cfg entry just as I did above.

Now I find myself in the awkward predicament of having written a program that I can't test. In order to test this program fully, I'd have to get my server on an RBL list, which I'm not about to do. Even so, I believe this program will work.

I don't know about you, but I live by e-mail, so my e-mail system simply has to work. The problems I had recently demonstrated that my monitoring policy wasn't sufficient. I believe that the new policy would have alerted me to the situation in a timely fashion. But, as is always the case, you can't test for everything, so I'm sure I'm missing something.

Load Disqus comments