An Introduction to the Spambayes Project
You now need to configure your e-mail client to collect mail from the proxy rather than from your POP3 server. Where you currently have pop3.example.com, port 110, set up as your POP3 server, you need to set it to localhost, port 1110. If you're running the proxy on a different machine from your e-mail client, use machinename, port 1110.
Classifying your mail is now as easy as clicking “Get new mail”. The proxy adds an X-Spambayes-Classification header to each message, and you can set up a filter in your mail program to file away suspected spam in its own folder. Until you do some training, however, all your messages are classified as unsure.
Once you're up and running, you should check your suspected spam folder periodically to see whether any real messages slip through, so-called false positives. As you train the system, this will happen less and less often.
Initial training isn't an absolute requirement, but you'll get better results from the outset if you do it. You can use the upload a message or mbox file form to train via the web interface, either on individual messages or UNIX mbox files.
Once you're up and running, you can use the web interface to train the system on the messages the POP3 proxy has seen. The Review messages page lists your messages, classified according to whether the software thought they were spam, ham or unsure. You can correct any mistakes by checking the boxes and then clicking Train. After a couple of days (depending on how much e-mail you get), there'll be very few mistakes.
Spambayes does an excellent job of classifying your mail, but it's only as good as the data on which you train it. Here are some tips to help you get the best results:
Don't train on old mail. The characteristics of your e-mail change over time—sometimes subtly, sometimes dramatically—so it's best to use recent mail.
Take care when training. If you mistakenly train a spam message as ham, or vice versa, it will throw off the classifier.
Try to train on roughly as much spam as ham. This isn't critical, but you'll get better results with a fair balance.
The Spambayes software is in constant development. Many people are involved, and we have many ideas about what to do next. Here's a taste of where the project might go:
Improving the tokenizer and classifier as new research reveals more accurate ways to classify spam.
Intelligent autotraining: once the system is up and running, it should be possible for it to keep itself up-to-date by training itself, with users correcting only the odd mistake. We're already doing something along these lines with the Procmail system, but we're looking at ways of making it more automated and compatible with all platforms.
SMTP proxy: to train the system from any e-mail client on any platform, you could send a message to a special ham or spam address. This could be a simple way to correct classification mistakes, and it would combine well with intelligent auto-training techniques.
Database reduction: the more you train the system, the larger its database gets. We're looking at ways to keep the database size down.
Integration with spam-reporting tools: the web interface and the e-mail plugins could let you report spams to systems like Vipul's Razor and Pyzor.
More e-mail client integration: we already have the Outlook plugin, and we'd like to integrate with more e-mail clients. The POP3 proxy and the web interface work well with any e-mail client, but having a Delete as Spam button right there in your e-mail client is much more convenient than switching to your web browser.
Better documentation: we aim to publish documentation on how to set up Spambayes on all the popular platforms and e-mail clients.
By the time this article is in print, some of these things already may be happening; see my Update page at www.entrian.com/spambayes for details.