Monitoring E-Mail with Nagios

April 30th, 2009 by Mike Diehl in

Your rating: None Average: 4.6 (5 votes)

Have you ever felt like you were being ignored? Have you ever felt like you were talking but no one was listening? Well, that's how it feels when your e-mail system is broken and you don't know it.

During the past week, I've had a couple system problems that prevented people from receiving e-mail messages that my wife or I sent. The sad part was that we didn't know the messages weren't being delivered. We'd receive a message asking a question, and we'd reply to the sender thinking nothing of it. A few days later, we'd get a phone call from the person asking whether we ever were going to respond.

In our case, two situations were conspiring against us: a change in Comcast's firewall policy and a change in Yahoo's mail delivery policy.

It all began when my wife started complaining that something was wrong with the e-mail system because she'd not heard back from a friend whom she had sent a message the previous day. I sent a quick e-mail to a friend of mine, got a response, and informed my wife that “it worked for me,” and chalked it up to her friend not being responsive.

Then, just to demonstrate to her that the mail server was healthy, I asked the server to print out its mail queue. Crap! There were 55 messages in the queue waiting to be delivered. Of course, by this time, even I had noticed that the volume of incoming spam had gone down to none. So, Houston, we had a problem.

After several years of running my own mail server on my home machine connected to the Internet via Comcast, Comcast decided to implement a new firewall policy and started blocking incoming SMTP (tcp/25) connections on its residential users' networks. Of course, I wasn't informed of the change, because I don't use Comcast's e-mail system! Previously, we would send e-mail from our workstations, and our mail server would forward the message through Comcast's smarthost; incoming messages came directly to our server. This configuration had worked for years. But, with the new firewall policy, something broke. Some of our messages were being delivered, and some weren't. I'm speculating that the ones not delivered were going through servers that did sending address verification, and as they couldn't connect back to my mail server to validate my e-mail address, they refused delivery.

So, I decided to take the inexpensive way out. I could have spent an extra $20 a month and gotten a business account with Comcast, which I eventually did, but I didn't at first. I created a VPN tunnel from my home machine to one of my servers on the open Internet. Then, I moved my DNS pointers to point to that machine and had it forward incoming messages through the VPN. I configured my home server to use that machine as its smarthost rather than Comcast's server. Aside from the blatant violation of Comcast's Acceptable Use Policy, this seemed like it would work pretty well.

Then, the other shoe dropped.

My wife and I quickly realized that this was working much better, but it still wasn't quite right. People my wife emailed on a daily basis weren't receiving her messages. The common denominator was that all of these people were using Yahoo e-mail accounts. So, I manually forced delivery of one e-mail messages and saw that Yahoo was deferring delivery due to questionable traffic patterns. And, that made sense; I was trying to deliver 55 deferred messages, probably all at once.

It's important to note that I monitor my e-mail server, and the Exim daemon never sent an alarm, so merely monitoring a service isn't enough. Instead of monitoring the service itself, it's better to monitor the server's function, which is what the rest of this article is about.

I was hesitant to write another article on Nagios, but e-mail is becoming more and more critical, and when it does break, it breaks in strange ways.

Of course, I monitor my Exim daemon as well as my server's route to the Internet. I use a Nagios service check for SMTP, like this:

define service {
        use generic-service
        name                    smtp
        host_name               host.example.com
        notification_options    w,c,r
        service_description     E-Mail SMTP Server
        check_command           check_smtp
}

I use a similar check to monitor my Internet gateway. But, as bad as the e-mail situation became, neither of these alarms would have indicated a problem. So, rather than monitoring to see whether a process is running, I set out to begin monitoring the server's critical functions, e-mail transport and delivery.

The first problem I wanted to address was being informed when messages were stuck in Exim's mail queue. I actually thought I'd have to write a custom program to check for this situation. While researching the situation further, I came across a posting from someone with a similar problem. It turns out that Nagios already has a command that performs this check, and I never knew it. Nagios's check commands are in /usr/nagios/libexec/, and let me tell you, there is a lot of gold in that directory.

So, I created an entry in Nagios's checkcommands.cfg file, like this:

define command{
        command_name    check_mailq
        command_line    $USER1$/check_mailq -w 3 -c 5 -v 9
}

Then, I created an entry in the services.cfg file that looked like this:

define service {
        use generic-service
        name                    mailq
        host_name               dominion
        notification_options    w,c,r
        service_description     SMTP Mail Queue
        check_command           check_mailq
}

Finally, I restarted Nagios and tested the new configuration by shutting down my server's outside network interface and attempting to send an e-mail message. Obviously, the mail transport operation failed and I got my alarm.

So at this point, I am pretty sure that if I have another problem with my e-mail system, at least I'll know it in a timely fashion. But, I thought it would be good to put in one more check.

It would be nice to know if my server ever finds itself on a Real-time Blocking List (RBL). Once again, Nagios has a command to check for this situation, but it comes in C source, which I couldn't get to compile. Anyway, I think I like my solution better.

My program looks up the server's IP address at http://www.anti-abuse.org, which, in turn, checks the IP address against several other RBLs at once. I'm probably going to configure Nagios to perform this check a few times a day, at most.

Here's the program:

#!/usr/bin/perl

open CMD, "wget -q http://www.anti-abuse.org/rblresults.php?host=192.168.1.1 -O - |";

while () {
        if (!/listed in /) { next; }
        if (!/NOT listed in /) { $error++; }
}

if (!$error) {
        print "OK\n";
        exit 0;
} else {
        print "CRITICAL: $error\n";
}

As you can see, it's not that complex. It simply sends a query to Anti-abuse.org and looks for the results. I hard-coded my machine's IP address in this case, but it would be trivial to use one of Nagios' variables and send the IP address as a command-line parameter to this program. Then, the program makes sure that each of the results indicates that my machine is not listed on an RBL. If this check fails, we set a flag for later use. Finally, I created a checkcommand.cfg and services.cfg entry just as I did above.

Now I find myself in the awkward predicament of having written a program that I can't test. In order to test this program fully, I'd have to get my server on an RBL list, which I'm not about to do. Even so, I believe this program will work.

I don't know about you, but I live by e-mail, so my e-mail system simply has to work. The problems I had recently demonstrated that my monitoring policy wasn't sufficient. I believe that the new policy would have alerted me to the situation in a timely fashion. But, as is always the case, you can't test for everything, so I'm sure I'm missing something.

__________________________
Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com


Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Anonymous's picture

*duh*

On June 1st, 2009 Anonymous (not verified) says:

Your IP will never be blocked because you're checking a non-routable address.... Read RFC1918.

ludvigm's picture

Mail loop test

On May 15th, 2009 ludvigm says:

Checking the daemon running and queue size is nice, but there's more to a correctly running mail server. My Nagios is configured to check a "mail loop" - ie it sends an email with a timestamp to its mailserver and checks later on via IMAP/SSL that it arrived into a mailbox. Such a mail-loop checks the mailsystem in its complexity, including correctly configured DNS, working SMTP, local delivery (including LDAP in my case) and IMAP. Google for "mail loop nagios" for a load of scripts that can be used for this task. And yes, I get alerted via SMS if the loop breaks. Using clickatell.com's http interface for sms alerts.

Anonymous's picture

Umm, this subject is duplicated in Monitoring SMTP damn

On May 12th, 2009 Anonymous (not verified) says:

Damn this magazine has gone to shit; why don't you just post the Nagios Manual as an article? Damn, sad. Sad shit.

Anna (anonymously)'s picture

Very interesting

On May 8th, 2009 Anna (anonymously) (not verified) says:

Why don't you just send a cc to yourself?

Very interesting, though.

billigflieger's picture

Thanks. Nice article

On May 5th, 2009 billigflieger (not verified) says:

Thanks. Nice article

Reto's picture

Plugin sources & error in your script

On May 1st, 2009 Reto (not verified) says:

First, a pointer:

http://nagiosplugins.org/

Second, there's a problem with your RBL check script. You don't set the exit code for the CRIT case so perl will exit with "0" in either case. Nagios actually uses the exit code, not the text, so you'll not get an alert.

http://nagios.sourceforge.net/docs/3_0/pluginapi.html

Also ... while()

Reto's picture

Bah. HTML ate my code (which

On May 1st, 2009 Reto (not verified) says:

Bah. HTML ate my code (which it probably did for the author as well:


while(<CMD>), not while()

Reto's picture

Bah. HTML ate my code (which

On May 1st, 2009 Reto (not verified) says:

Bah. HTML ate my code (which it probably did for the author as well:


while(), not while()

Josh's picture

but how?

On April 30th, 2009 Josh (not verified) says:

My only concern is: How do you send the alert out if your email server is down? SMS? And if so, through what conduit?

Thăng Phạm Duy's picture

SMS

On May 4th, 2009 Thăng Phạm Duy (not verified) says:

You can send the alert via a SMS gateway. You can use Kannel software as a gateway and a Siemens M20 or Nokia Premicell devices as a SMSC.

Anonymous's picture

Sure you can test it - pick

On April 30th, 2009 Anonymous (not verified) says:

Sure you can test it - pick a banned ip and hard code that in!

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Newsletter

Each week Linux Journal editors will tell you what's hot in the world of Linux. You will receive late breaking news, technical tips and tricks, and links to in-depth stories featured on www.linuxjournal.com.
Sign up for our Email Newsletter

Tech Tip Videos

From the Magazine

December 2009, #188

If last month's Infrastrucuture issue was too "big" for you then try on this month's Embedded issue. Find out how to use Player for programming mobile robots, build a humidity controller for your root cellar, find out how to reduce the boot time of your embedded system, and if you're new to embedded systems find out the basics that go into one. You can also read about the Beagle Board, the Mesh Potato and a spate of other interestingly named items. And along with our regular columns don't miss our new monthly column: Economy Size Geek.







Read this issue