A Statistical Approach to the Spam Problem

Using Bayesian statistics to detect an e-mail's spamminess.
Combining the Probabilities

At this point, we can compute a probability, f(w), for each word that may appear in a new e-mail. This probability reflects our degree of belief, based on our background knowledge and on data in our training corpus, that an e-mail chosen randomly from those that contain w will be a spam.

So each e-mail is represented by a set of of probabilities. We want to combine these individual probabilities into an overall combined indicator of spamminess or hamminess for the e-mail as a whole.

In the field of statistics known as meta-analysis, probably the most common way of combining probabilities is due to R. A. Fisher. If we have a set probabilities, p1, p2, ..., pn, we can do the following. First, calculate -2ln p1 * p2 * ... * pn. Then, consider the result to have a chi-square distribution with 2n degrees of freedom, and use a chi-square table to compute the probability of getting a result as extreme, or more extreme, than the one calculated. This “combined” probability meaningfully summarizes all the individual probabilities.

The initial set of probabilities are considered to be with respect to a null hypothesis. For instance, if we make the null hypothesis assumption that a coin is unbiased, and then flip it ten times and get ten heads in a row, there would be a resultant probability of (1/2)10 = 1/1024. It would be an unlikely event with respect to the null hypothesis, but of course, if the coin actually were to be biased, it wouldn't necessarily be quite so unlikely to have an extreme outcome. Therefore, we might reject the null hypothesis and instead choose to accept the alternate hypothesis that the coin is biased.

We can use the same calculation to combine our f(w)s. The f(w)s are not real physical probabilities. Rather, they can be thought of as our best guess about those probabilities. But in traditional meta-anaytical uses of the Fisher calculation they are not known to be the real probabilities either. Instead, they are assumed to be, as part of the null hypothesis.

Let our null hypothesis be “The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s are not in a uniform distribution.” Now suppose we have a word, “Python”, for which f(w) is .01. We believe it occurs in spams only 1% of the time. Then to the extent that our belief is correct, an unlikely event occurred; one with a probability of .01. Similarly, every word in the e-mail has a probability associated with it. Very spammy words like porn might have probabilities of .99 or greater.

Then we use the Fisher calculation to compute an overall probability for the whole set of words. If the e-mail is a ham, it is likely that it will have a number of very low probabilities and relatively few very high probabilities to balance them, with the result that the Fisher calculation will give a very low combined probability. This will allow us to reject the null hypothesis and assume instead the alternative hypothesis that the e-mail is a ham.

Let us call this combined probability H:

Equation 3

where C-1() is the inverse chi-square function, used to derive a p-value from a chi-square-distributed random variable.

Of course, we know from the outset the null hypothesis is false. Virtually no e-mail is actually a random collection of words unbiased with regard to hamminess or spamminess; an e-mail usually has a telling number of words of one type or the other. And certainly words are not independent. E-mails containing “sex” are more likely to contain “porn”, and e-mails containing the name of a friend who programs Python are more likely also to contain “Python”. And, the f(w)s are not in a uniform distribution. But for purposes of spam detection, those departures from reality usually work in our favor. They cause the probabilities to have a nonrandom tendency to be either high or low in a given e-mail, giving us a strong statistical basis to reject the null hypothesis in favor of one alternative hypothesis or the other. Either the e-mail is a ham or it is a spam.

So the null hypothesis is a kind of straw dog. We set it up only in order for it to be knocked down in one direction or the other.

It may be worthwhile to mention here a key difference between this approach and many other combining approaches that assume the probabilities are independent. This difference applies, for example, to such approaches as the “Bayesian chain rule” and “Naïve Bayesian classification”. Those approaches have been tested in head-to-head combat against the approach described here using large numbers of human-classified e-mails and didn't fare as well (i.e., didn't agree as consistently with human judgment).

The calculations for those approaches are technically invalid due to requiring an independence of the data points that is not actually present. That problem does not occur when using the Fisher technique because the validity of the calculations doesn't depend on the data being independent. The Fisher calculation is structured such that we are rejecting a null hypothesis that includes independence in favor of one of the two alternative hypotheses that are of interest to us. These alternative hypotheses each involve non-independence, such as the correlation between “sex” and “porn”, as part of the phenomenon that creates the extreme combined probabilities.

These combined probabilities are only probabilities under the assumption that the null hypothesis—known to be almost certainly false—is true. In the real world, the Fisher combined probability is not a probability at all, but rather an abstract indicator of how much (or little) we should be inclined to accept the null hypothesis. It is not meant to be a true probability when the null hypothesis is false, so the fact that it isn't doesn't cause computational problems for us. The naïve Bayesian classifier is known to be resistant to distortions due to a lack of independence, but it doesn't avoid them entirely.

Whether these differing responses to a lack of independence in the data are responsible for the superior performance of Fisher in spam/ham classification testing to date is not known, but it is something to consider as one possible factor.

The individual f(w)s are only approximations to real probabilities (i.e., when there is very little data about a word, our best guess about its spam probability as given by f(w) may not reflect its actual reality). But if you consider how f(w) is calculated, you will see that this uncertainty diminishes asymptotically as f(w) approaches 0 or 1, because such extreme values can be achieved only by words that have occurred quite frequently in the training data and either almost always appear in spams or almost always appear in hams. And, conveniently, it's the numbers near 0 that have by far the greatest impact on the calculations. To see this, consider the influence on the product .01 * .5 if the first term is changed to .001, vs. the influence on the product .51 *.5 if the first term is changed to .501, and recall that the Fisher technique is based on multiplying the probabilities. So our null hypothesis is violated by the fact that the f(w)s are not completely reliable, but in a way that matters vanishingly little for the words of the most interest in the search for evidence of hamminess: the words with f(w) near 0.

Alert readers will wonder at this point: “Okay, I understand that these Fisher calculations seem to make sense for hammy words with correspondingly near-0 probabilities, but what about spammy words with probabilities near 1?” Good question! Read on for the answer that will complete our discussion.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

i dont understand it "Serve

sigorta's picture

i dont understand it

"Serve from the cache if it is younger than $cachetime"

whats it

this entry

sigorta's picture

this is nice entry thanks for it

Combining the probabilities

Anonymous's picture

Can we use the Fisher's method for combining the probabilities of different parameters in Fraud Domain also.

Regards
sumit

Here are some scientific

Sandy's picture

Here are some scientific approaches to filter out the spam in the e-mails. The probability of some particular words appears repeatedly in spam mails are used to identify whether the mail is a spam or not. Bayesian spam filtering method is the most discussed and used in the complex process of spam filtering. This is method is widely adopted by the commercial spam filters available today. But now day’s spammers are using other techniques like Bayesian poisoning to reduce the effectiveness of this method. This subject needs a wide discussion to find out a perfect technique in spam filtering. order fulfillment

spam code

şubesi's picture

To create this caching you would put some code like the following on the top of your PHP page.

$cachefile = 'caching_folder/cachedpage.html';
$cachetime = 30;
// Serve from the cache if it is younger than $cachetime
if (file_exists($cachefile) && time() - $cachetime < filemtime($cachefile)) {
include($cachefile);
exit;
}
ob_start(); // Start the output buffer

Great

leo1234's picture

This is really great info on Spam. I was hunting for this. This is a one the best service provider. Fine information, many thanks to the author. It is puzzling to me now, but in general, the usefulness and importance is overwhelming. Very much thanks again and good luck! regards fast weight loss

Anti-spam solution

Jack's picture

I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.

Anti-spam solution

Jack's picture

I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.

Hypothesis - f(w)s NOT in a uniform distribution??

Martin Žember's picture

I guess the hypothesis should state ``The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s ARE in a uniform distribution.''

Is it right?

If we CAN show that the data

Anonymous's picture

If we CAN show that the data ARE a random distribution of noise, then the null hypothesis stands and our test hypothesis fails. So the name of the game becomes trying to prove that the null hypothesis is correct. If we fail to prove the data is random, then we are supporting the hypothesis that the data is uniformly distributed (in turn, deducing a way to classify the data).

Spam Keywords

polis kasi's picture

I've read all of the book Ending Spam as well as Mr Graham's APlan for spam but i have a problem and i was wondering if anyone can point me to the correct direction. I'm currently doing my senior project and i'm desighing a spam filter but since the corpus of spam and ham e-mails that i have is not big enough i cannot create a keyword dictionary where each word is carrying a weight of how spam it is or not using this mathematical theories. My question is if you know where i can find a ready keyword list where each word is cqrrying a weight?

The closest thing I've found

Anonymous's picture

The closest thing I've found is a database of known spam messages which have been forwarded to site by the general public.

You can download the raw message files via ftp by going to:
www.spamarchive.org

I don't think you'll find any pre-weighted word lists available for download (not publicly anyhow).

Hope this helps.

:)

What does L stand for in the Fisher-Robinson Inverse Chi-Square

Johnnie Walker's picture

What does L stand for in the Fisher-Robinson Inverse Chi-Square?
In the text above it says "First, calculate -2ln p1 * p2 * ... * pn.", but what is LN? Does it stand for Lexicon Number? Or does the letter L have a greater significance? E.g multiply N by L. I am almost there at getting this understood, any suggestions welcome.

'ln' means...

ChrisSteinbach's picture

'ln' is for natural logarithm. If you are using the Python code from this article, you would do something like,


import math

def product(values):
....return reduce(lambda x, y: x*y, values)

def chiCombined(probs):
....prod = product(probs)
....return chi2P(-2*math.log(prod) , 2*len(probs))

print chiCombined([0.9, .2, .21, .89, .2, .78])
=>0.572203878688
print chiCombined([0.2, .2, .01, .79, .2, .58])
=>0.0594128323345
print chiCombined([0.7, .89, .71, .79, .972, .68])
=>0.996012078132

/Chris

Thanks + 'underflowing to zero' tip

Johnnie Walker's picture

Thanks for your reply Chris. I did a good few searches on Google but could not find any numeric examples of this on the web. So, its a real help to see some numbers to test my own code against.

Whilst trying to find out more about logs, i discovered a good web page for the 'mathematically challenged' programer: http://www.gigamonkeys.com/book/practical-a-spam-filter.html . In that article, the author suggests getting the log of each individual probability first, then multiplying them together. This, apparently, can prevent the result from underflowing to zero.

I'm writing my spam filter in PHP but most examples on the topic seem to be in either LISP or Python (which have quite a similar syntax to PHP in many ways). So, when I'm confident that i've done it right, I'll put a PHP version online.

Thanks, once again, for all those who have shared their knowledge to rid the world of spam; Chris Steinbach, Gary Robinson, Paul Graham, Brian Burton, Jonathan Zdziarski, Bill Yerazunis, Peter Seibel amongst many others.

sum the logs not multiply

Johnnie Walker's picture

Ooops. I'm really showing my ignorance of maths and messing up this beautiful webpage in the process! Sorry folks. To correct my previous comment, the article suggests to sum the logs of each probability (I mistakenly said multiply them) rather than multiplying all the probabilities and then taking the log.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix