Fight Spam with SpamProbe
I get a lot of spam e-mail. These days, however, most of it doesn't go to my e-mail Inbox, because I'm filtering my e-mail with SpamProbe. SpamProbe is a spam detector; you train it to recognize what you consider to be spam. It builds databases of keywords from your e-mail messages and then uses the keyword databases to decide whether incoming e-mail messages are spam.
In this article I explain how to set up SpamProbe to intercept spam e-mails and file them into a folder named Spam. If you prefer, you also may set it up to delete these messages. The setup I describe enables spam checking on a per-user basis, and users control which of their messages are considered to be spam. The setup is completely server-based and thus works with any e-mail client. Users need to understand only how to move messages from one mail folder to another.
Because it handles spam completely on the server, SpamProbe is great for users who must access their mail over a slow link, such as a modem. Client-based filters must download all the mail, spam and non-spam alike, while a server-based filter can keep all the spam on the server.
The setup described in this article works with any trainable spam filter, not only SpamProbe.
Why use SpamProbe instead of another spam filter? I argue you should you use it because it is a Bayesian filter with some advanced features. Bayesian spam filters work by building two databases: a database of keywords from spam e-mails and a database of keywords from nonspam e-mails. They then analyze each new e-mail message, comparing keywords against the two databases and estimating the probability the message is a spam message. You train a Bayesian spam filter by feeding spam messages to it so it can build a spam keywords database; or, you can feed it nonspam messages so it can build a nonspam keywords database. Whoever controls the training of the filter thus controls what that filter considers spam.
As the filter processes incoming e-mail messages, it continues to update its keyword databases. Each message it flags as spam also is used to update the spam keywords database. As users feed corrections back into the system, the filter becomes better and better at detecting spam.
Bayesian spam filters are efficient: they don't load down a server too much, and they don't depend on a connection to an external server to access a spam database. Once they are trained, they can block almost all spam messages, with few or no false positives.
SpamProbe builds its database using not only single keywords but pairs of keywords too. The word money, by itself, might not indicate spam reliably; the phrase "make money" are a much better indicator. An ideal spam filter might use even longer chains of words, but that would be quite expensive computationally.
SpamProbe also correctly handles e-mails and attachments in BASE64 or quoted-printable encoding, and it has a feature for handling Asian character sets. SpamProbe is released under the QPL, so it is free for use by anyone.
My biggest fear with a spam checker is "false positives": e-mail judged as spam when it really isn't. This hasn't been a problem with SpamProbe, because it gives more weight to the nonspam keyword database than the spam keyword database. In other words, it errs on the side of letting spam slip past, to reduce the chance of accidentally flagging a good e-mail as spam. To be completely safe, though, I have my SpamProbe set up to deliver spam e-mails into a special folder, Spam. This setup proves to be more convenient than receiving spam in my Inbox. From time to time I check the Spam folder, quickly running my eye down the list to make sure it all looks like spam. Then I press Ctrl+A to select all messages and Ctrl+D to delete them all. These keystrokes work in Evolution; your e-mail client may vary.
Below, I describe how I set up my own e-mail server. Because your server probably won't be set up exactly like my server, you have to adapt these instructions. My e-mail server has SpamProbe and procmail and is set up to use Maildir format. Users on my server access their mail with IMAP; I explain later why that's important.
Maildir format means each individual e-mail is saved in its own file. By default, mail is delivered into a directory called ~/Maildir, with folders being implemented as subdirectories under ~/Maildir. Inside the folder directory are three subdirectories: new, tmp and cur. cur is where messages are stored. On my system, the Inbox folder is ~/Maildir/cur, and the Spam folder is ~/Maildir/.Spam/cur. Subfolders are not nested directories; if a folder named nonspam is a subfolder of Spam, the directory is ~/Maildir/.Spam.nonspam/cur.
Each incoming e-mail is fed through SpamProbe with the receive option. Here, from the man page for SpamProbe, are two examples of the output from SpamProbe receive:
SPAM 0.99 595f0150587edd7b395691964069d7af GOOD 0.02 595f0150587edd7b395691964069d7af
First, an e-mail is flagged as either SPAM or GOOD. Next comes a numeric score reflecting how confident SpamProbe is. (On my server, it appears these scores are always 1.0 or 0.0, reflecting total confidence.) Last is a message digest string.
To enable spam checking for a user on my server, I set up a .procmailrc file, as shown below, in the user's home directory:
# feed all through spam filter :0 SCORE=| /usr/local/bin/spamprobe receive # insert spam filter header line :0 wf | formail -I "X-SpamProbe: $SCORE" # test: did spam filter declare it spam? :0 a *^X-SpamProbe: SPAM $HOME/Maildir/.Spam/ # anything left over after above, deliver into main Inbox :0 $HOME/Maildir/
When the e-mail server software tries to deliver an e-mail, it runs procmail. This procmail script causes the spam checker to operate.
SpamProbe receive scores the e-mail and then formail -I inserts a new line into the header of the e-mail, the X-SpamProbe: line. Next, procmail matches an e-mail header line that starts with X-SpamProbe: SPAM. (This procmail script doesn't use the numeric score or message digest from spamprobe receive.) E-mails that match are delivered to the Spam folder. The final / on the folder name indicates to procmail that the e-mail should be delivered in Maildir format.
If you prefer simply to delete spam e-mail messages, you can, of course, change the .procmailrc file to deliver spam to /dev/null instead of to a Spam folder.
- Three More Lessons
- Django Models and Migrations
- August 2015 Issue of Linux Journal: Programming
- Hacking a Safe with Bash
- The Controversy Behind Canonical's Intellectual Property Policy
- Secure Server Deployments in Hostile Territory, Part II
- Shashlik - a Tasty New Android Simulator
- Huge Package Overhaul for Debian and Ubuntu
- Embed Linux in Monitoring and Control Systems
- KDE Reveals Plasma Mobile