# A Statistical Approach to the Spam Problem

Most people are spending significant time daily on the task of distinguishing spam from useful e-mail. I don't think I'm alone in feeling that we have better things to do. Spam-filtering software can help.

This article discusses one of many possible mathematical foundations for a key aspect of spam filtering—generating an indicator of “spamminess” from a collection of tokens representing the content of an e-mail.

The approach described here truly has been a distributed effort in the best open-source tradition. Paul Graham, an author of books on Lisp, suggested an approach to filtering spam in his on-line article, “A Plan for Spam”. I took his approach for generating probabilities associated with words, altered it slightly and proposed a Bayesian calculation for dealing with words that hadn't appeared very often. Then I suggested an approach based on the chi-square distribution for combining the individual word probabilities into a combined probability (actually a pair of probabilities—see below) representing an e-mail. Finally, Tim Peters of the Spambayes Project proposed a way of generating a particularly useful spamminess indicator based on the combined probabilities. All along the way the work was guided by ongoing testing of embodiments written in Python by Tim Peters for Spambayes and in C by Greg Louis of the Bogofilter Project. The testing was done by a number of people involved with those projects.

We will assume the existence of a body of e-mails (the corpus) for training, together with software capable of parsing each e-mail into its constituent words. We will further assume that each training e-mail has been classified manually as either “ham” (the e-mail you want to read) or “spam” (the e-mail you don't). We will use this data and software to train our system by generating a probability for each word that represents its spamminess.

For each word that appears in the corpus, we calculate:

*b*(*w*) = (the number of spam e-mails containing the word*w*) / (the total number of spam e-mails).*g*(*w*) = (the number of ham e-mails containing the word*w*) / (the total number of ham e-mails).*p*(*w*) =*b*(*w*) / (*b*(*w*) +*g*(*w*))

*p*(*w*) can be roughly
interpreted as the probability that a randomly chosen e-mail
containing word *w* will be a spam.
Spam-filtering programs can compute
*p*(*w*) for every word in an
e-mail and use that information as the basis for further
calculations to determine whether the e-mail is ham or spam.

However, there is one notable wrinkle. In the real world, a particular person's inbox may contain 10% spam or 90% spam. Obviously, this will have an impact on the probability that, for that individual, an e-mail containing a given word will be spam or ham.

That impact is ignored in the calculation above. Rather, that
calculation, in effect, approximates the probabilitily that a
randomly chosen e-mail containing *w* would be
spam in a world where half the e-mails were spams and half were
hams. The merit of that approach is based on the assumption that we
don't want it to be harder or easier for an e-mail to be classified
as spam just because of the relative proportion of spams and hams
we happen to receive. Rather, we want e-mails to be judged purely
based on their own characteristics. In practice, this assumption
has worked out quite well.

There is a problem with probabilities calculated as above
when words are very rare. For instance, if a word appears in
exactly one e-mail, and it is a spam, the calculated
*p*(*w*) is 1.0. But clearly
it is not absolutely certain that all future e-mail containing that
word will be spam; in fact, we simply don't have enough data to
know the real probability.

Bayesian statistics gives us a powerful technique for dealing with such cases. This branch of statistics is based on the idea that we are interested in a person's degree of belief in a particular event; that is the Bayesian probability of the event.

When exactly one e-mail contains a particular word and that e-mail is spam, our degree of belief that the next time we see that word it will be in a spam is not 100%. That's because we also have our own background information that guides us. We know from experience that virtually any word can appear in either a spam or non-spam context, and that one or a handful of data points is not enough to be completely certain we know the real probability. The Bayesian approach lets us combine our general background information with the data we have collected for a word in such a way that both aspects are given their proper importance. In this way, we determine an appropriate degree of belief about whether, when we see the word again, it will be in a spam.

We calculate this degree of belief,
*f*(*w*), as follows:

where:

*s*is the strength we want to give to our background information.*x*is our assumed probability, based on our general backround information, that a word we don't have any other experience of will first appear in a spam.*n*is the number of e-mails we have received that contain word*w*.

This gives us the convenient use of *x* to
represent our assumed probability from background information and
*s* as the strength we will give that
assumption. In practice, the values for *s* and
*x* are found through testing to optimize
performance. Reasonable starting points are 1 and .5 for
*s* and *x*,
respectively.

We will use *f*(*w*)
rather than *p*(*w*) in our
calculations so that we are working with reasonable probabilities
rather than the unrealistically extreme values that can often occur
when we don't have enough data. This formula gives us a simple way
of handling the case of no data at all; in that case,
*f*(*w*) is exactly our
assumed probability based on background information.

Those who are interested in the derivation of the above formula, read on; others may want to skip down to the next section.

If you are already familiar with the principles of Bayesian statistics, you will probably have no trouble understanding the derivation. Otherwise, I suggest that you read sections 1 and 2 of David Heckerman's “A Tutorial on Learning with Bayesian Networks” (see Resources) before continuing.

The formula above is based on assuming that spam/ham
classification for e-mails containing word *w*
is a binomial random variable with a beta distribution prior. We
calculate the posterior expectation after incorporating the
observed data. We are going to do a test that is the equivalent of
the “experiment” of tossing a coin multiple times to see whether
it appears to be biased. There are *n* trials.
If we were tossing a coin, each coin toss would be a trial, and
we'd be counting the number of heads. But in our case, the trial is
looking at the next e-mail in the training corpus that contains the
word “porn” and seeing whether the e-mail it contains is a spam.
If it is, the experiment is considered to have been successful.
Clearly, it's a binomial experiment: there are two values, yes or
no. And they are independent: the fact that one e-mail contains
“porn” isn't correlated to the question of whether the next one
will. Now, it happens that if you have a binomial experiment and
assume a beta distribution for the prior, you can express the
expectation that the *n*th + 1 trial will be
successful as shown in Equation 2.

*q*is the number of successes.*n*is the number of trials.*u*and*v*are the parameters of the beta distribution.

We want to show that Equation 1 is equivalent to Equation 2. First perform the following substitutions:

*s* = *u* +
*v* *s* *
*x* = *u*

The next step is to replace *q* with
*n* *
*p*(*w*). We have already
discussed the fact that
*p*(*w*) is an approximation
of the probability that a randomly chosen e-mail containing
*w* is spam in an imaginary world where there
are the same number of spams as hams. So *n* *
*p*(*w*) approximates the
count of spams containing *w* in that world.
This approximates the count of successes in our experiment and is
therefore the equivalent of *q*. This completes
our demonstration of the equivalence of Equations 1 and 2.

In testing, replacing
*p*(*w*) with
*f*(*w*) in all calculations
where *p*(*w*) would
otherwise be used, has uniformly resulted in more reliable spam/ham
classification.

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Sponsored by Bit9

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Sponsored by Storix

## Trending Topics

March 2015 Issue of Linux Journal: System Administration | Mar 02, 2015 |

March 2015 Video Preview | Mar 02, 2015 |

High-Availability Storage with HA-LVM | Feb 26, 2015 |

DNSMasq, the Pint-Sized Super Dæmon! | Feb 24, 2015 |

Localhost DNS Cache | Feb 23, 2015 |

Days Between Dates: the Counting | Feb 19, 2015 |

- High-Availability Storage with HA-LVM
- DNSMasq, the Pint-Sized Super Dæmon!
- March 2015 Issue of Linux Journal: System Administration
- Localhost DNS Cache
- Real-Time Rogue Wireless Access Point Detection with the Raspberry Pi
- Days Between Dates: the Counting
- The Usability of GNOME
- PostgreSQL, the NoSQL Database
- Linux for Astronomers
- You're the Boss with UBOS

## Comments

## i dont understand it "Serve

i dont understand it

"Serve from the cache if it is younger than $cachetime"

whats it

## this entry

this is nice entry thanks for it

## Combining the probabilities

Can we use the Fisher's method for combining the probabilities of different parameters in Fraud Domain also.

Regards

sumit

## Here are some scientific

Here are some scientific approaches to filter out the spam in the e-mails. The probability of some particular words appears repeatedly in spam mails are used to identify whether the mail is a spam or not. Bayesian spam filtering method is the most discussed and used in the complex process of spam filtering. This is method is widely adopted by the commercial spam filters available today. But now day’s spammers are using other techniques like Bayesian poisoning to reduce the effectiveness of this method. This subject needs a wide discussion to find out a perfect technique in spam filtering. order fulfillment

## spam code

To create this caching you would put some code like the following on the top of your PHP page.

$cachefile = 'caching_folder/cachedpage.html';

$cachetime = 30;

// Serve from the cache if it is younger than $cachetime

if (file_exists($cachefile) && time() - $cachetime < filemtime($cachefile)) {

include($cachefile);

exit;

}

ob_start(); // Start the output buffer

## Great

This is really great info on Spam. I was hunting for this. This is a one the best service provider. Fine information, many thanks to the author. It is puzzling to me now, but in general, the usefulness and importance is overwhelming. Very much thanks again and good luck! regards fast weight loss

## Anti-spam solution

I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.

## Anti-spam solution

I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.

## Hypothesis - f(w)s NOT in a uniform distribution??

I guess the hypothesis should state ``The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s ARE in a uniform distribution.''

Is it right?

## If we CAN show that the data

If we CAN show that the data ARE a random distribution of noise, then the null hypothesis stands and our test hypothesis fails. So the name of the game becomes trying to prove that the null hypothesis is correct. If we fail to prove the data is random, then we are supporting the hypothesis that the data is uniformly distributed (in turn, deducing a way to classify the data).

## Spam Keywords

I've read all of the book Ending Spam as well as Mr Graham's APlan for spam but i have a problem and i was wondering if anyone can point me to the correct direction. I'm currently doing my senior project and i'm desighing a spam filter but since the corpus of spam and ham e-mails that i have is not big enough i cannot create a keyword dictionary where each word is carrying a weight of how spam it is or not using this mathematical theories. My question is if you know where i can find a ready keyword list where each word is cqrrying a weight?

## The closest thing I've found

The closest thing I've found is a database of known spam messages which have been forwarded to site by the general public.

You can download the raw message files via ftp by going to:

www.spamarchive.org

I don't think you'll find any pre-weighted word lists available for download (not publicly anyhow).

Hope this helps.

:)

## What does L stand for in the Fisher-Robinson Inverse Chi-Square

What does L stand for in the Fisher-Robinson Inverse Chi-Square?

In the text above it says "First, calculate -2ln p1 * p2 * ... * pn.", but what is LN? Does it stand for Lexicon Number? Or does the letter L have a greater significance? E.g multiply N by L. I am almost there at getting this understood, any suggestions welcome.

## 'ln' means...

'ln' is for natural logarithm. If you are using the Python code from this article, you would do something like,

import math

def product(values):

....return reduce(lambda x, y: x*y, values)

def chiCombined(probs):

....prod = product(probs)

....return chi2P(-2*math.log(prod) , 2*len(probs))

print chiCombined([0.9, .2, .21, .89, .2, .78])

=>0.572203878688

print chiCombined([0.2, .2, .01, .79, .2, .58])

=>0.0594128323345

print chiCombined([0.7, .89, .71, .79, .972, .68])

=>0.996012078132

/Chris

## Thanks + 'underflowing to zero' tip

Thanks for your reply Chris. I did a good few searches on Google but could not find any numeric examples of this on the web. So, its a real help to see some numbers to test my own code against.

Whilst trying to find out more about logs, i discovered a good web page for the 'mathematically challenged' programer: http://www.gigamonkeys.com/book/practical-a-spam-filter.html . In that article, the author suggests getting the log of each individual probability first, then multiplying them together. This, apparently, can prevent the result from underflowing to zero.

I'm writing my spam filter in PHP but most examples on the topic seem to be in either LISP or Python (which have quite a similar syntax to PHP in many ways). So, when I'm confident that i've done it right, I'll put a PHP version online.

Thanks, once again, for all those who have shared their knowledge to rid the world of spam; Chris Steinbach, Gary Robinson, Paul Graham, Brian Burton, Jonathan Zdziarski, Bill Yerazunis, Peter Seibel amongst many others.

## sum the logs not multiply

Ooops. I'm really showing my ignorance of maths and messing up this beautiful webpage in the process! Sorry folks. To correct my previous comment, the article suggests to sum the logs of each probability (I mistakenly said multiply them) rather than multiplying all the probabilities and then taking the log.