Fight Spam with SpamProbe

by Steve Hastings

I get a lot of spam e-mail. These days, however, most of it doesn't go to my e-mail Inbox, because I'm filtering my e-mail with SpamProbe. SpamProbe is a spam detector; you train it to recognize what you consider to be spam. It builds databases of keywords from your e-mail messages and then uses the keyword databases to decide whether incoming e-mail messages are spam.

In this article I explain how to set up SpamProbe to intercept spam e-mails and file them into a folder named Spam. If you prefer, you also may set it up to delete these messages. The setup I describe enables spam checking on a per-user basis, and users control which of their messages are considered to be spam. The setup is completely server-based and thus works with any e-mail client. Users need to understand only how to move messages from one mail folder to another.

Because it handles spam completely on the server, SpamProbe is great for users who must access their mail over a slow link, such as a modem. Client-based filters must download all the mail, spam and non-spam alike, while a server-based filter can keep all the spam on the server.

The setup described in this article works with any trainable spam filter, not only SpamProbe.

Why SpamProbe?

Why use SpamProbe instead of another spam filter? I argue you should you use it because it is a Bayesian filter with some advanced features. Bayesian spam filters work by building two databases: a database of keywords from spam e-mails and a database of keywords from nonspam e-mails. They then analyze each new e-mail message, comparing keywords against the two databases and estimating the probability the message is a spam message. You train a Bayesian spam filter by feeding spam messages to it so it can build a spam keywords database; or, you can feed it nonspam messages so it can build a nonspam keywords database. Whoever controls the training of the filter thus controls what that filter considers spam.

As the filter processes incoming e-mail messages, it continues to update its keyword databases. Each message it flags as spam also is used to update the spam keywords database. As users feed corrections back into the system, the filter becomes better and better at detecting spam.

Bayesian spam filters are efficient: they don't load down a server too much, and they don't depend on a connection to an external server to access a spam database. Once they are trained, they can block almost all spam messages, with few or no false positives.

SpamProbe builds its database using not only single keywords but pairs of keywords too. The word money, by itself, might not indicate spam reliably; the phrase "make money" are a much better indicator. An ideal spam filter might use even longer chains of words, but that would be quite expensive computationally.

SpamProbe also correctly handles e-mails and attachments in BASE64 or quoted-printable encoding, and it has a feature for handling Asian character sets. SpamProbe is released under the QPL, so it is free for use by anyone.

No False Positives

My biggest fear with a spam checker is "false positives": e-mail judged as spam when it really isn't. This hasn't been a problem with SpamProbe, because it gives more weight to the nonspam keyword database than the spam keyword database. In other words, it errs on the side of letting spam slip past, to reduce the chance of accidentally flagging a good e-mail as spam. To be completely safe, though, I have my SpamProbe set up to deliver spam e-mails into a special folder, Spam. This setup proves to be more convenient than receiving spam in my Inbox. From time to time I check the Spam folder, quickly running my eye down the list to make sure it all looks like spam. Then I press Ctrl+A to select all messages and Ctrl+D to delete them all. These keystrokes work in Evolution; your e-mail client may vary.

Setting Up a User for SpamProbe

Below, I describe how I set up my own e-mail server. Because your server probably won't be set up exactly like my server, you have to adapt these instructions. My e-mail server has SpamProbe and procmail and is set up to use Maildir format. Users on my server access their mail with IMAP; I explain later why that's important.

Maildir format means each individual e-mail is saved in its own file. By default, mail is delivered into a directory called ~/Maildir, with folders being implemented as subdirectories under ~/Maildir. Inside the folder directory are three subdirectories: new, tmp and cur. cur is where messages are stored. On my system, the Inbox folder is ~/Maildir/cur, and the Spam folder is ~/Maildir/.Spam/cur. Subfolders are not nested directories; if a folder named nonspam is a subfolder of Spam, the directory is ~/Maildir/.Spam.nonspam/cur.

Each incoming e-mail is fed through SpamProbe with the receive option. Here, from the man page for SpamProbe, are two examples of the output from SpamProbe receive:

SPAM 0.99 595f0150587edd7b395691964069d7af
GOOD 0.02 595f0150587edd7b395691964069d7af

First, an e-mail is flagged as either SPAM or GOOD. Next comes a numeric score reflecting how confident SpamProbe is. (On my server, it appears these scores are always 1.0 or 0.0, reflecting total confidence.) Last is a message digest string.

To enable spam checking for a user on my server, I set up a .procmailrc file, as shown below, in the user's home directory:

# feed all through spam filter
:0
SCORE=| /usr/local/bin/spamprobe receive
# insert spam filter header line
:0 wf
| formail -I "X-SpamProbe: $SCORE"
# test: did spam filter declare it spam?
:0 a
*^X-SpamProbe: SPAM
        $HOME/Maildir/.Spam/
# anything left over after above, deliver into main Inbox
:0
        $HOME/Maildir/

When the e-mail server software tries to deliver an e-mail, it runs procmail. This procmail script causes the spam checker to operate.

SpamProbe receive scores the e-mail and then formail -I inserts a new line into the header of the e-mail, the X-SpamProbe: line. Next, procmail matches an e-mail header line that starts with X-SpamProbe: SPAM. (This procmail script doesn't use the numeric score or message digest from spamprobe receive.) E-mails that match are delivered to the Spam folder. The final / on the folder name indicates to procmail that the e-mail should be delivered in Maildir format.

If you prefer simply to delete spam e-mail messages, you can, of course, change the .procmailrc file to deliver spam to /dev/null instead of to a Spam folder.

Make SpamProbe Easy to Train

SpamProbe recognizes spam based on the contents of its databases. If a spam e-mail slips past SpamProbe, you want to tell SpamProbe about it. And, if SpamProbe ever flags an e-mail as spam when it isn't, you definitely want to tell SpamProbe about it. You could save up a collection of spam e-mails and manually feed them through SpamProbe to flag them as spam; you'd then do the same with false positives. But it's easy to automate the training process.

A user moves any false positives from the Spam folder to a subfolder of Spam called nonspam. Likewise, a user moves any spam that eluded the filter to a subfolder of Spam called missedspam. This is why IMAP is important: with IMAP, a user can simply drag a message into the appropriate folder (nonspam or missedspam).

Listing 1, spamprobe_update, handles e-mails in the Spam subfolders.

Listing 1. spamprobe_update

After a few sanity checks, the find commands locate e-mails and feed them to SpamProbe, and SpamProbe updates its databases appropriately. Then spam messages simply are deleted with rm, while nonspam messages are moved to the Inbox folder.

This spamprobe_update script is a little bit sleazy: without consulting with the mail server software, it's shuffling e-mail messages around or deleting them. On my system, this works perfectly fine; my mail server software notices the messages have moved around, but there's no problem.

A less sleazy solution might involve a command-line tool that uses IMAP to delete or move the e-mail messages. A Python script probably would be an easy way to do this. One potential security risk: the user's password needs to be stored in a file somewhere in a format the command-line IMAP tool is able to read. Directly manipulating the files may be sleazy, but it doesn't require storing a password.

For total automation, set up spamprobe_update as a cron job. The user can set up his own job with crontab -e to edit his personal cron table; or the sysadmin can set it up with crontab -u <username> -e. On my server, I like to have spamprobe_update run once per hour, on the hour, so I added these lines to the cron table:

#m   h dom mon dow command
 0   *   *   *   * /usr/local/bin/spamprobe_update

The first line is a comment. The second line sets up spamprobe_update to run at the 0 minute of every hour, every day of the month, every month and every day of the week. Insert these lines or simply the non-comment one in the cron table for any user who will be using SpamProbe.

Because spamprobe_update is intended to run as a cron job, it produces no output unless there is an error. Any output from a cron process is packaged up as an e-mail and sent to the user.

For a server with many users, you might want to use a random number, 0-59, for the minute to avoid having spamprobe_update run simultaneously for every user on the system. For my e-mail server with only a handful of users, I don't bother.

What about POP3?

With POP3 by the time the user sees the message, the message already has been downloaded to the user's computer and deleted from the server. If your users insist on POP3, however, you always could set up some alternative system. Perhaps the users could collect some messages and then move them into a special directory on a file server. Or, the users could forward the e-mails to special spam handling e-mail accounts where custom scripts would run on receipt of mail.

Conclusion

My ideal e-mail client would have spam management tools built in, not only a delete command, but a delete and flag this message as spam option too. But the system I described here is almost as convenient, and it works with any e-mail client.

Because users control the training of SpamProbe, they have control over what messages SpamProbe traps. Malicious users cannot affect other users' definitions of spam.

Once it is trained, SpamProbe does a great job of stripping out the spam while leaving alone the messages you want. Give it a try.

Resources

An article by Paul Graham, called "A Plan for Spam", was the inspiration for SpamProbe and other Bayesian spam filters.

The SpamProbe main page.

The procmail main page.

An article about Bayesian theory and how it applies to spam filters. The author, Gary Robinson, also had an article on spam checker theory published in the March 2003 issue of Linux Journal.

Steve Hastings first used UNIX on actual paper teletypes. He enjoys bicycling, music, petting his cat and making his Linux computers do new things.

Load Disqus comments