Fight Spam with SpamProbe

How to set up this trainable e-mail filter to eliminate false positives, work with IMAP and run as a cron job.

I get a lot of spam e-mail. These days, however, most of it doesn't go to my e-mail Inbox, because I'm filtering my e-mail with SpamProbe. SpamProbe is a spam detector; you train it to recognize what you consider to be spam. It builds databases of keywords from your e-mail messages and then uses the keyword databases to decide whether incoming e-mail messages are spam.

In this article I explain how to set up SpamProbe to intercept spam e-mails and file them into a folder named Spam. If you prefer, you also may set it up to delete these messages. The setup I describe enables spam checking on a per-user basis, and users control which of their messages are considered to be spam. The setup is completely server-based and thus works with any e-mail client. Users need to understand only how to move messages from one mail folder to another.

Because it handles spam completely on the server, SpamProbe is great for users who must access their mail over a slow link, such as a modem. Client-based filters must download all the mail, spam and non-spam alike, while a server-based filter can keep all the spam on the server.

The setup described in this article works with any trainable spam filter, not only SpamProbe.

Why SpamProbe?

Why use SpamProbe instead of another spam filter? I argue you should you use it because it is a Bayesian filter with some advanced features. Bayesian spam filters work by building two databases: a database of keywords from spam e-mails and a database of keywords from nonspam e-mails. They then analyze each new e-mail message, comparing keywords against the two databases and estimating the probability the message is a spam message. You train a Bayesian spam filter by feeding spam messages to it so it can build a spam keywords database; or, you can feed it nonspam messages so it can build a nonspam keywords database. Whoever controls the training of the filter thus controls what that filter considers spam.

As the filter processes incoming e-mail messages, it continues to update its keyword databases. Each message it flags as spam also is used to update the spam keywords database. As users feed corrections back into the system, the filter becomes better and better at detecting spam.

Bayesian spam filters are efficient: they don't load down a server too much, and they don't depend on a connection to an external server to access a spam database. Once they are trained, they can block almost all spam messages, with few or no false positives.

SpamProbe builds its database using not only single keywords but pairs of keywords too. The word money, by itself, might not indicate spam reliably; the phrase "make money" are a much better indicator. An ideal spam filter might use even longer chains of words, but that would be quite expensive computationally.

SpamProbe also correctly handles e-mails and attachments in BASE64 or quoted-printable encoding, and it has a feature for handling Asian character sets. SpamProbe is released under the QPL, so it is free for use by anyone.

No False Positives

My biggest fear with a spam checker is "false positives": e-mail judged as spam when it really isn't. This hasn't been a problem with SpamProbe, because it gives more weight to the nonspam keyword database than the spam keyword database. In other words, it errs on the side of letting spam slip past, to reduce the chance of accidentally flagging a good e-mail as spam. To be completely safe, though, I have my SpamProbe set up to deliver spam e-mails into a special folder, Spam. This setup proves to be more convenient than receiving spam in my Inbox. From time to time I check the Spam folder, quickly running my eye down the list to make sure it all looks like spam. Then I press Ctrl+A to select all messages and Ctrl+D to delete them all. These keystrokes work in Evolution; your e-mail client may vary.

Setting Up a User for SpamProbe

Below, I describe how I set up my own e-mail server. Because your server probably won't be set up exactly like my server, you have to adapt these instructions. My e-mail server has SpamProbe and procmail and is set up to use Maildir format. Users on my server access their mail with IMAP; I explain later why that's important.

Maildir format means each individual e-mail is saved in its own file. By default, mail is delivered into a directory called ~/Maildir, with folders being implemented as subdirectories under ~/Maildir. Inside the folder directory are three subdirectories: new, tmp and cur. cur is where messages are stored. On my system, the Inbox folder is ~/Maildir/cur, and the Spam folder is ~/Maildir/.Spam/cur. Subfolders are not nested directories; if a folder named nonspam is a subfolder of Spam, the directory is ~/Maildir/.Spam.nonspam/cur.

Each incoming e-mail is fed through SpamProbe with the receive option. Here, from the man page for SpamProbe, are two examples of the output from SpamProbe receive:

SPAM 0.99 595f0150587edd7b395691964069d7af
GOOD 0.02 595f0150587edd7b395691964069d7af

First, an e-mail is flagged as either SPAM or GOOD. Next comes a numeric score reflecting how confident SpamProbe is. (On my server, it appears these scores are always 1.0 or 0.0, reflecting total confidence.) Last is a message digest string.

To enable spam checking for a user on my server, I set up a .procmailrc file, as shown below, in the user's home directory:

# feed all through spam filter
:0
SCORE=| /usr/local/bin/spamprobe receive
# insert spam filter header line
:0 wf
| formail -I "X-SpamProbe: $SCORE"
# test: did spam filter declare it spam?
:0 a
*^X-SpamProbe: SPAM
        $HOME/Maildir/.Spam/
# anything left over after above, deliver into main Inbox
:0
        $HOME/Maildir/

When the e-mail server software tries to deliver an e-mail, it runs procmail. This procmail script causes the spam checker to operate.

SpamProbe receive scores the e-mail and then formail -I inserts a new line into the header of the e-mail, the X-SpamProbe: line. Next, procmail matches an e-mail header line that starts with X-SpamProbe: SPAM. (This procmail script doesn't use the numeric score or message digest from spamprobe receive.) E-mails that match are delivered to the Spam folder. The final / on the folder name indicates to procmail that the e-mail should be delivered in Maildir format.

If you prefer simply to delete spam e-mail messages, you can, of course, change the .procmailrc file to deliver spam to /dev/null instead of to a Spam folder.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Fight Spam with SpamProbe

Anonymous's picture

Thats exactly sounds like SpamAssassin. SpamAssassin also works exactly the same way and uses Bayesian filter to differentiate between Spam and Non Spam emails. However to avail the feature of Spam Control to POP3 users we usually create a separate spam admin email ID and keep him as watchdog so that any mails flowing into my server it recieves a copy and then the admin can decide whether a mail is spam or not. This is advantages in one kind as it provides a better control on mails being reported as spam. Also we work out on the feedback system as well by sending mails as attachment to the spam admin email id.

regards
Rakesh
CompTIA Linux+ Professional

Re: Fight Spam with SpamProbe

Anonymous's picture

how do you train the filter when the user marks the mail as spam?

Re: Fight Spam with SpamProbe

Anonymous's picture

Using Redhat's standard IMAP server each mailfolder is a file in the Linux filesystem. If you just slezily delete the spam mailfolder file you will get an error trying to move subsequent spam into the now nonexisting file:

ERROR : Could not complete request.
Query: COPY 11763 "mail/spam"
Reason Given: [TRYCREATE] UID COPY failed: No such destination mailbox

Anyone for a better solution? Any good IMAP management tools?

Re: Fight Spam with SpamProbe

Anonymous's picture

I have the same question, what would you do with Cyrus-imapd? If anyone has sucessfully implemented a sendmail->procmail->spamprobe->cyrus implementation I would love to here the details, especially if you used redhat 8-9.

Re: Fight Spam with SpamProbe

Anonymous's picture

What if you are not using MAILDIR but rather cyrus's
var/spool/imap/user/ directories.

Not sure what to put in the /etc/procmailrc file.

Re: Fight Spam with SpamProbe

Anonymous's picture

I have the same question, what would you do with Cyrus-imapd? If anyone has sucessfully implemented a sendmail->procmail->spamprobe->cyrus implementation I would love to here the details, especially if you used redhat 8-9.

Re: Fight Spam with SpamProbe

Anonymous's picture

what is best mail filter avaible ???

thanks

Talha

Re: Fight Spam with SpamProbe

Anonymous's picture

This is all very good, however many corporations dictate using MS Exchange Server for their e-mail needs. I would have found the article more useful for my actual job if it described how to use this product in concert with MS Exchange Server.

Is it possible to integrate SpamProbe with Exchange Server to filter out SPAM before it reaches it?

Re: Fight Spam with SpamProbe

steveha's picture

Exchange Server runs on a Windows server platform. Both SpamProbe and Bogofilter run on a *NIX platform such as Linux. Unless you were to port one of these filters to Windows yourself, I don't see how you can use them in a similar way to what I described in my article.

It should be possible to set up a mail gateway server, running Linux, and have all the mail pass through this server on the way to the Exchange server. You could then run SpamProbe or Bogofilter on the gateway server. This does not permit the per-user spam training; whoever is in charge of the gateway server would have to train the spam filter.

If you can afford to run a slower spam filter on your mail (i.e. if you can dedicate a fast server as your gateway) you might want to look into SpamAssassin. This is a rules-based spam filter, and will filter spam without any training. My problem with SpamAssassin is that it will flag some spam-like emails that I actually want to receive. For example, if I sometimes shop at a discount computer web site, and they send me a notice of a sale, I want to get that notice; but SpamAssassin might very well flag that notice as a spam. So I suggest that if you do try out SpamAssassin, you don't just have it delete all emails it thinks are spam.

It is possible to run SpamAssassin and SpamProbe or Bogofilter on the same server; and if they both agree that an email is spam, you just delete it. Also, newer versions of SpamAssassin have trainable Bayesian spam filter functionality built-in, so you could train SpamAssassin to fine-tune its spam recognition.

http://spamassassin.org/index.html

steveha

Re: Fight Spam with SpamProbe

Anonymous's picture

What about spam that's sent as a UUEncoded attachment? How do any filters work against that?

Re: Fight Spam with SpamProbe

steveha's picture

Yes, SpamProbe will handle this, and Bogofilter versions from 0.10.0 onward will handle this. They handle BASE64, UUENCODE, and quoted-printable.

Bogofilter (was: Re: Fight Spam with SpamProbe)

Anonymous's picture

Bogofilter work just alike and it has been around much longer. Also, it implements 3 different algorithms for the bayesian decision.
I

Bogofilter (was: Re: Fight Spam with SpamProbe)

Anonymous's picture

I'm sure that bogofilter is a nice tool but it has not "been around much longer." In fact both were written at the same time, in the days immediately following Paul's article.

Bogofilter (was: Re: Fight Spam with SpamProbe)

steveha's picture

I tried Bogofilter before I tried SpamProbe. The problem was that Bogofilter didn't do the right thing with spam messages that were BASE64 encoded. SpamProbe decodes the BASE64 and detects that spam.

Since you brought it up, I went and checked: Bogofilter version 0.10.0 and newer have this feature! So Bogofilter is another solid choice as a spam filter. Since Bogofilter has the improved algorithm, it may actually be better than SpamProbe.

The setup I described in the article, with a few changes, should also work well with Bogofilter.

steveha

Re: Fight Spam with SpamProbe

Anonymous's picture

See SpamOracle (http://pauillac.inria.fr/~xleroy/software.html) which does something very similar. It is also in debian, 'apt-get install spamoracle'

--Brock

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState