A Statistical Approach to the Spam Problem
March 1st, 2003 by Gary Robinson in
Most people are spending significant time daily on the task of distinguishing spam from useful e-mail. I don't think I'm alone in feeling that we have better things to do. Spam-filtering software can help.
This article discusses one of many possible mathematical foundations for a key aspect of spam filtering—generating an indicator of “spamminess” from a collection of tokens representing the content of an e-mail.
The approach described here truly has been a distributed effort in the best open-source tradition. Paul Graham, an author of books on Lisp, suggested an approach to filtering spam in his on-line article, “A Plan for Spam”. I took his approach for generating probabilities associated with words, altered it slightly and proposed a Bayesian calculation for dealing with words that hadn't appeared very often. Then I suggested an approach based on the chi-square distribution for combining the individual word probabilities into a combined probability (actually a pair of probabilities—see below) representing an e-mail. Finally, Tim Peters of the Spambayes Project proposed a way of generating a particularly useful spamminess indicator based on the combined probabilities. All along the way the work was guided by ongoing testing of embodiments written in Python by Tim Peters for Spambayes and in C by Greg Louis of the Bogofilter Project. The testing was done by a number of people involved with those projects.
We will assume the existence of a body of e-mails (the corpus) for training, together with software capable of parsing each e-mail into its constituent words. We will further assume that each training e-mail has been classified manually as either “ham” (the e-mail you want to read) or “spam” (the e-mail you don't). We will use this data and software to train our system by generating a probability for each word that represents its spamminess.
For each word that appears in the corpus, we calculate:
b(w) = (the number of spam e-mails containing the word w) / (the total number of spam e-mails).
g(w) = (the number of ham e-mails containing the word w) / (the total number of ham e-mails).
p(w) = b(w) / (b(w) + g(w))
p(w) can be roughly interpreted as the probability that a randomly chosen e-mail containing word w will be a spam. Spam-filtering programs can compute p(w) for every word in an e-mail and use that information as the basis for further calculations to determine whether the e-mail is ham or spam.
However, there is one notable wrinkle. In the real world, a particular person's inbox may contain 10% spam or 90% spam. Obviously, this will have an impact on the probability that, for that individual, an e-mail containing a given word will be spam or ham.
That impact is ignored in the calculation above. Rather, that calculation, in effect, approximates the probabilitily that a randomly chosen e-mail containing w would be spam in a world where half the e-mails were spams and half were hams. The merit of that approach is based on the assumption that we don't want it to be harder or easier for an e-mail to be classified as spam just because of the relative proportion of spams and hams we happen to receive. Rather, we want e-mails to be judged purely based on their own characteristics. In practice, this assumption has worked out quite well.
There is a problem with probabilities calculated as above when words are very rare. For instance, if a word appears in exactly one e-mail, and it is a spam, the calculated p(w) is 1.0. But clearly it is not absolutely certain that all future e-mail containing that word will be spam; in fact, we simply don't have enough data to know the real probability.
Bayesian statistics gives us a powerful technique for dealing with such cases. This branch of statistics is based on the idea that we are interested in a person's degree of belief in a particular event; that is the Bayesian probability of the event.
When exactly one e-mail contains a particular word and that e-mail is spam, our degree of belief that the next time we see that word it will be in a spam is not 100%. That's because we also have our own background information that guides us. We know from experience that virtually any word can appear in either a spam or non-spam context, and that one or a handful of data points is not enough to be completely certain we know the real probability. The Bayesian approach lets us combine our general background information with the data we have collected for a word in such a way that both aspects are given their proper importance. In this way, we determine an appropriate degree of belief about whether, when we see the word again, it will be in a spam.
We calculate this degree of belief, f(w), as follows:

Equation 1
where:
s is the strength we want to give to our background information.
x is our assumed probability, based on our general backround information, that a word we don't have any other experience of will first appear in a spam.
n is the number of e-mails we have received that contain word w.
This gives us the convenient use of x to represent our assumed probability from background information and s as the strength we will give that assumption. In practice, the values for s and x are found through testing to optimize performance. Reasonable starting points are 1 and .5 for s and x, respectively.
We will use f(w) rather than p(w) in our calculations so that we are working with reasonable probabilities rather than the unrealistically extreme values that can often occur when we don't have enough data. This formula gives us a simple way of handling the case of no data at all; in that case, f(w) is exactly our assumed probability based on background information.
Those who are interested in the derivation of the above formula, read on; others may want to skip down to the next section.
If you are already familiar with the principles of Bayesian statistics, you will probably have no trouble understanding the derivation. Otherwise, I suggest that you read sections 1 and 2 of David Heckerman's “A Tutorial on Learning with Bayesian Networks” (see Resources) before continuing.
The formula above is based on assuming that spam/ham classification for e-mails containing word w is a binomial random variable with a beta distribution prior. We calculate the posterior expectation after incorporating the observed data. We are going to do a test that is the equivalent of the “experiment” of tossing a coin multiple times to see whether it appears to be biased. There are n trials. If we were tossing a coin, each coin toss would be a trial, and we'd be counting the number of heads. But in our case, the trial is looking at the next e-mail in the training corpus that contains the word “porn” and seeing whether the e-mail it contains is a spam. If it is, the experiment is considered to have been successful. Clearly, it's a binomial experiment: there are two values, yes or no. And they are independent: the fact that one e-mail contains “porn” isn't correlated to the question of whether the next one will. Now, it happens that if you have a binomial experiment and assume a beta distribution for the prior, you can express the expectation that the nth + 1 trial will be successful as shown in Equation 2.

Equation 2
q is the number of successes.
n is the number of trials.
u and v are the parameters of the beta distribution.
We want to show that Equation 1 is equivalent to Equation 2. First perform the following substitutions:
s = u + v s * x = u
The next step is to replace q with n * p(w). We have already discussed the fact that p(w) is an approximation of the probability that a randomly chosen e-mail containing w is spam in an imaginary world where there are the same number of spams as hams. So n * p(w) approximates the count of spams containing w in that world. This approximates the count of successes in our experiment and is therefore the equivalent of q. This completes our demonstration of the equivalence of Equations 1 and 2.
In testing, replacing p(w) with f(w) in all calculations where p(w) would otherwise be used, has uniformly resulted in more reliable spam/ham classification.
At this point, we can compute a probability, f(w), for each word that may appear in a new e-mail. This probability reflects our degree of belief, based on our background knowledge and on data in our training corpus, that an e-mail chosen randomly from those that contain w will be a spam.
So each e-mail is represented by a set of of probabilities. We want to combine these individual probabilities into an overall combined indicator of spamminess or hamminess for the e-mail as a whole.
In the field of statistics known as meta-analysis, probably the most common way of combining probabilities is due to R. A. Fisher. If we have a set probabilities, p1, p2, ..., pn, we can do the following. First, calculate -2ln p1 * p2 * ... * pn. Then, consider the result to have a chi-square distribution with 2n degrees of freedom, and use a chi-square table to compute the probability of getting a result as extreme, or more extreme, than the one calculated. This “combined” probability meaningfully summarizes all the individual probabilities.
The initial set of probabilities are considered to be with respect to a null hypothesis. For instance, if we make the null hypothesis assumption that a coin is unbiased, and then flip it ten times and get ten heads in a row, there would be a resultant probability of (1/2)10 = 1/1024. It would be an unlikely event with respect to the null hypothesis, but of course, if the coin actually were to be biased, it wouldn't necessarily be quite so unlikely to have an extreme outcome. Therefore, we might reject the null hypothesis and instead choose to accept the alternate hypothesis that the coin is biased.
We can use the same calculation to combine our f(w)s. The f(w)s are not real physical probabilities. Rather, they can be thought of as our best guess about those probabilities. But in traditional meta-anaytical uses of the Fisher calculation they are not known to be the real probabilities either. Instead, they are assumed to be, as part of the null hypothesis.
Let our null hypothesis be “The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s are not in a uniform distribution.” Now suppose we have a word, “Python”, for which f(w) is .01. We believe it occurs in spams only 1% of the time. Then to the extent that our belief is correct, an unlikely event occurred; one with a probability of .01. Similarly, every word in the e-mail has a probability associated with it. Very spammy words like porn might have probabilities of .99 or greater.
Then we use the Fisher calculation to compute an overall probability for the whole set of words. If the e-mail is a ham, it is likely that it will have a number of very low probabilities and relatively few very high probabilities to balance them, with the result that the Fisher calculation will give a very low combined probability. This will allow us to reject the null hypothesis and assume instead the alternative hypothesis that the e-mail is a ham.
Let us call this combined probability H:

Equation 3
where C-1() is the inverse chi-square function, used to derive a p-value from a chi-square-distributed random variable.
Of course, we know from the outset the null hypothesis is false. Virtually no e-mail is actually a random collection of words unbiased with regard to hamminess or spamminess; an e-mail usually has a telling number of words of one type or the other. And certainly words are not independent. E-mails containing “sex” are more likely to contain “porn”, and e-mails containing the name of a friend who programs Python are more likely also to contain “Python”. And, the f(w)s are not in a uniform distribution. But for purposes of spam detection, those departures from reality usually work in our favor. They cause the probabilities to have a nonrandom tendency to be either high or low in a given e-mail, giving us a strong statistical basis to reject the null hypothesis in favor of one alternative hypothesis or the other. Either the e-mail is a ham or it is a spam.
So the null hypothesis is a kind of straw dog. We set it up only in order for it to be knocked down in one direction or the other.
It may be worthwhile to mention here a key difference between this approach and many other combining approaches that assume the probabilities are independent. This difference applies, for example, to such approaches as the “Bayesian chain rule” and “Naïve Bayesian classification”. Those approaches have been tested in head-to-head combat against the approach described here using large numbers of human-classified e-mails and didn't fare as well (i.e., didn't agree as consistently with human judgment).
The calculations for those approaches are technically invalid due to requiring an independence of the data points that is not actually present. That problem does not occur when using the Fisher technique because the validity of the calculations doesn't depend on the data being independent. The Fisher calculation is structured such that we are rejecting a null hypothesis that includes independence in favor of one of the two alternative hypotheses that are of interest to us. These alternative hypotheses each involve non-independence, such as the correlation between “sex” and “porn”, as part of the phenomenon that creates the extreme combined probabilities.
These combined probabilities are only probabilities under the assumption that the null hypothesis—known to be almost certainly false—is true. In the real world, the Fisher combined probability is not a probability at all, but rather an abstract indicator of how much (or little) we should be inclined to accept the null hypothesis. It is not meant to be a true probability when the null hypothesis is false, so the fact that it isn't doesn't cause computational problems for us. The naïve Bayesian classifier is known to be resistant to distortions due to a lack of independence, but it doesn't avoid them entirely.
Whether these differing responses to a lack of independence in the data are responsible for the superior performance of Fisher in spam/ham classification testing to date is not known, but it is something to consider as one possible factor.
The individual f(w)s are only approximations to real probabilities (i.e., when there is very little data about a word, our best guess about its spam probability as given by f(w) may not reflect its actual reality). But if you consider how f(w) is calculated, you will see that this uncertainty diminishes asymptotically as f(w) approaches 0 or 1, because such extreme values can be achieved only by words that have occurred quite frequently in the training data and either almost always appear in spams or almost always appear in hams. And, conveniently, it's the numbers near 0 that have by far the greatest impact on the calculations. To see this, consider the influence on the product .01 * .5 if the first term is changed to .001, vs. the influence on the product .51 *.5 if the first term is changed to .501, and recall that the Fisher technique is based on multiplying the probabilities. So our null hypothesis is violated by the fact that the f(w)s are not completely reliable, but in a way that matters vanishingly little for the words of the most interest in the search for evidence of hamminess: the words with f(w) near 0.
Alert readers will wonder at this point: “Okay, I understand that these Fisher calculations seem to make sense for hammy words with correspondingly near-0 probabilities, but what about spammy words with probabilities near 1?” Good question! Read on for the answer that will complete our discussion.
The calculation described above is sensitive to evidence of hamminess, particularly when it's in the form of words that show up in far more hams than spams. This is because probabilities near 0 have a great influence on the product of probabilities, which is at the heart of Fisher's calculation. In fact, there is a 1971 theorem that says the Fisher technique is, under certain circumstances, as powerful as any technique can possibly be for revealing underlying trends in a product of possibilities (see Resources).
However, very spam-oriented words have f(w)s near 1, and therefore have a much less significant effect on the calculations. Now, it might be assumed that this is a good thing. After all, for many people, misclassifying a good e-mail as spam seems a lot worse than misclassifying a bad e-mail as a ham, because no great harm is done if a single spam gets through but significant harm might result from a single good e-mail being wrongly classified as spam and therefore ignored by the recipient. So it may seem good to be sensitive to indications of hamminess and less sensitive to indications of spamminess.
However, there are ways to deal with this problem that in real-world testing do not add a noticeable tendency to wrongly classify good e-mail as spam, but do significantly reduce the tendency to misclassify spam as ham.
The most effective technique that has been identified in recent testing efforts follows.
First, “reverse” all the probabilities by subtracting them from 1 (that is, for each word, calculate 1 - f(w)). Because f(w) represents the probability that a randomly chosen e-mail from the set of e-mails containing w is a spam, 1 - f(w) represents the probability that such a randomly chosen e-mail will be a ham.
Now do the same Fisher calculation as before, but on the (1 - f(w))s rather than on the f(w)s. This will result in near-0 combined probabilities, in rejection of the null hypothesis, when a lot of very spammy words are present. Call this combined probability S.
Now calculate:

Equation 4
I is an indicator that is near 1 when the preponderance of the evidence is in favor of the conclusion that the e-mail is spam and near 0 when the evidence points to the conclusion that it's ham. This indicator has a couple of interesting characteristics.
Suppose an e-mail has a number of very spammy words and also a number of very hammy words. Because the Fisher technique is sensitive to values near 0 and less sensitive to values near 1, the result might be that both S and H are very near 0. For instance, S might be on the order of .00001 and H might be on the order of .000000001. In fact, those kinds of results are not as infrequent as one might assume in real-world e-mails. One example is when a friend forwards a spam to another friend as part of an e-mail conversation about spam. In such a case, there will be strong evidence in favor of both possible conclusions.
In many approaches, such as those based on the Bayesian chain rule, the fact that there may be more spammy words than hammy words in an example will tend to make the classifier absolutely certain that the e-mail is spam. But in fact, it's not so clear; for instance, the forwarded e-mail example is not spam.
So it a useful characteristic of I that it is near .5 in such cases, just as it is near .5 when there is no particular evidence in one direction or the other. When there is significant evidence in favor of both conclusions, I takes the cautious approach. In real-world testing, human examination of these mid-valued e-mails tends to support the conclusion that they really should be classified somewhere in the middle rather than being subject to the black-or-white approach of most classifiers.
The Spambayes Project, described in Richie Hindle's article on page 52, takes advantage of this by marking e-mails with I near .5 as uncertain. This allows the e-mail recipient to give a bit more attention to e-mails that can't be classified with confidence. This lessens the chance of a good e-mail being ignored due to incorrect classification.
To date, the software using this approach is based on one word per token. Other approaches are possible, such as building a hash table of phrases. It is expected that the math described here can be employed in those contexts as well, and there is reason to believe that phrase-based systems will have performance advantages, although there is controversy about that idea. Future Linux Journal articles can be expected to cover any developments in such directions. CRM114 (see Resources) is an example of a phrase-based system that has performed very well, but at the time of this writing it hasn't been directly tested against other approaches on the same corpus. (At the time of this writing, CRM114 is using the Bayesian chain rule to combine p(w)s.)
The techniques described here have been used in projects such as Spambayes and Bogofilter to improve performance of the spam-filtering task significantly. Future developments, which may include integrating these calculations with a phrase-based approach, can be expected to achieve even better performance.
Subscribe now!
Breaking News
| Charter Trades Privacy for Pocketbook | 20 hours 43 min ago |
| SSL Glitch Unlocks Debian, Ubuntu, & Others | 1 day 20 hours ago |
| MySpace Cashes in Spam to the Tune of $234 Million | 1 day 22 hours ago |
| Google Shoos the Trustbusters Away | 2 days 18 hours ago |
Featured Video
Linux Journal Gadget Guy, Shawn Powers, takes us through installing Ubuntu on a machine running Windows with the Wubi installer.
Delicious
Digg
Reddit
Newsvine
Technorati






Gaide
On May 5th, 2008 Henny (not verified) says:
外国為替証拠金取引(FX)と日経225先物・オプション取引のトレイダーズ証券。
神戸市・三宮のオリジナルTシャツ プリント、オリジナルマグカップ制作なら、プリントショップのP-ART!
アルカリイオン水が手作りできる、さんご浄水パックはフェニックス健康オンラインショップ。
オンラインFX取引サイト『ALGORITHM TRADE FX』。投資顧問、ディーリング事業を行うアストマックスのグループ企業であるアストマックスFXが御提供。
インプラント治療の最新情報や基礎知識・メデント認定の安心インプラント歯科医院情報を掲載。
FX、為替取引、外為、オンライントレードのことならJNS。
語学留学(アメリカ 留学)やワーホリとは違う効果。使える英語が身につくインターンシップならiiP!
低コストでIP-VPN並のセキュアネットワークを提供。拠点間通信を閉域網で構築したい担当者様、VPN、ルータ、拠点間通信、閉域網ならGOLです。
在庫多数!ff11 ギル リネージュ アデナ その他多数のゲーム取扱をしております。オンラインゲームの通貨専門サイト、RMTなら青い鳥。安心の法人運営。
アダルトグッズ激安通販のアダルトマーケット。
Anti-spam solution
On April 14th, 2008 Jack (not verified) says:
I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.
Anti-spam solution
On April 14th, 2008 Jack (not verified) says:
I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.
Hypothesis - f(w)s NOT in a uniform distribution??
On May 26th, 2007 Martin Žember (not verified) says:
I guess the hypothesis should state ``The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s ARE in a uniform distribution.''
Is it right?
Spam Keywords
On May 25th, 2006 polis kasi (not verified) says:
I've read all of the book Ending Spam as well as Mr Graham's APlan for spam but i have a problem and i was wondering if anyone can point me to the correct direction. I'm currently doing my senior project and i'm desighing a spam filter but since the corpus of spam and ham e-mails that i have is not big enough i cannot create a keyword dictionary where each word is carrying a weight of how spam it is or not using this mathematical theories. My question is if you know where i can find a ready keyword list where each word is cqrrying a weight?
The closest thing I've found
On July 4th, 2006 Anonymous (not verified) says:
The closest thing I've found is a database of known spam messages which have been forwarded to site by the general public.
You can download the raw message files via ftp by going to:
www.spamarchive.org
I don't think you'll find any pre-weighted word lists available for download (not publicly anyhow).
Hope this helps.
:)
What does L stand for in the Fisher-Robinson Inverse Chi-Square
On February 12th, 2006 Johnnie Walker (not verified) says:
What does L stand for in the Fisher-Robinson Inverse Chi-Square?
In the text above it says "First, calculate -2ln p1 * p2 * ... * pn.", but what is LN? Does it stand for Lexicon Number? Or does the letter L have a greater significance? E.g multiply N by L. I am almost there at getting this understood, any suggestions welcome.
'ln' means...
On February 13th, 2006 ChrisSteinbach (not verified) says:
'ln' is for natural logarithm. If you are using the Python code from this article, you would do something like,
import math
def product(values):
....return reduce(lambda x, y: x*y, values)
def chiCombined(probs):
....prod = product(probs)
....return chi2P(-2*math.log(prod) , 2*len(probs))
print chiCombined([0.9, .2, .21, .89, .2, .78])
=>0.572203878688
print chiCombined([0.2, .2, .01, .79, .2, .58])
=>0.0594128323345
print chiCombined([0.7, .89, .71, .79, .972, .68])
=>0.996012078132
/Chris
wewew
On April 18th, 2008 Anonymous (not verified) says:
lose weight
weight control
weight loss
Weight loss products
Diet pills
Slimming capsule
slimming
Slimming products
Weight loss products
Thanks + 'underflowing to zero' tip
On February 21st, 2006 Johnnie Walker (not verified) says:
Thanks for your reply Chris. I did a good few searches on Google but could not find any numeric examples of this on the web. So, its a real help to see some numbers to test my own code against.
Whilst trying to find out more about logs, i discovered a good web page for the 'mathematically challenged' programer: http://www.gigamonkeys.com/book/practical-a-spam-filter.html . In that article, the author suggests getting the log of each individual probability first, then multiplying them together. This, apparently, can prevent the result from underflowing to zero.
I'm writing my spam filter in PHP but most examples on the topic seem to be in either LISP or Python (which have quite a similar syntax to PHP in many ways). So, when I'm confident that i've done it right, I'll put a PHP version online.
Thanks, once again, for all those who have shared their knowledge to rid the world of spam; Chris Steinbach, Gary Robinson, Paul Graham, Brian Burton, Jonathan Zdziarski, Bill Yerazunis, Peter Seibel amongst many others.
sum the logs not multiply
On February 21st, 2006 Johnnie Walker (not verified) says:
Ooops. I'm really showing my ignorance of maths and messing up this beautiful webpage in the process! Sorry folks. To correct my previous comment, the article suggests to sum the logs of each probability (I mistakenly said multiply them) rather than multiplying all the probabilities and then taking the log.
New information
On May 1st, 2008 Deki (not verified) says:
多言語翻訳可能。実務・技術・医薬・金融・法律の翻訳の実績多数・高品質・安心価格でのサービス実現。WEB・DTPにも対応。
実務・技術・医薬・金融・法律の翻訳なら。多言語翻訳可能。実績多数・高品質・安心価格の翻訳会社、NAIway翻訳サービス。WEB・DTPにも対応。
株式会社ジョイエスは日本全国対応可能の不動産担保ローン専門業者です。地方物件・持分・借地権・住宅ローン返済中・築古マンション・住宅購入資金等、各種不動産担保ローンを取り揃えております。
不動産ローンをお探しなら日宝へ。
日宝では不動産担保ローンにおける32年の実績と信頼があります。
不動産担保ローンで32年の実績と信頼・日宝。多様化するお客様のあらゆるニーズに迅速・適確・誠実にお応え致します。
不動産担保融資においてもお客様のあらゆるニーズにお応え致します。
不動産融資では多様化するお客様のあらゆるニーズに迅速・適確・誠実にお応え致し、32年の実績と信頼があります。
保育士 求人のことなら、保育専門の派遣会社あんだんて。
防犯カメラ・監視カメラ・ネットワークカメラのことなら。遠隔監視システム、等の設置工事や機器販売専門サイトです。
Gaide
On May 5th, 2008 Anonymous (not verified) says:
外国為替証拠金取引(FX)と日経225先物・オプション取引のトレイダーズ証券。
神戸市・三宮のオリジナルTシャツ プリント、オリジナルマグカップ制作なら、プリントショップのP-ART!
アルカリイオン水が手作りできる、さんご浄水パックはフェニックス健康オンラインショップ。
オンラインFX取引サイト『ALGORITHM TRADE FX』。投資顧問、ディーリング事業を行うアストマックスのグループ企業であるアストマックスFXが御提供。
インプラント治療の最新情報や基礎知識・メデント認定の安心インプラント歯科医院情報を掲載。
パニック障害・自律
On May 13th, 2008 agshin (not verified) says:
パニック障害・自律神経失調症をたった3ヶ月で克服したマル秘テクニックを紹介します!
送料・代引き手数料無料ハナビラタケGはフェニックス健康オンラインショップ。
高級レギンス専門店のイビザストア。お買い上げ金額5000円以上で送料無料!レギンスとカラータイツの専門店です。
リネージュ2のアデナをカンタンに購入するならこちら。PlusOneのリネージュ2 RMTページ。余ってしまったアデナの買取も。
2000件を超える実績があります。楽器レンタルOne