# A Statistical Approach to the Spam Problem

The calculation described above is sensitive to evidence of hamminess, particularly when it's in the form of words that show up in far more hams than spams. This is because probabilities near 0 have a great influence on the product of probabilities, which is at the heart of Fisher's calculation. In fact, there is a 1971 theorem that says the Fisher technique is, under certain circumstances, as powerful as any technique can possibly be for revealing underlying trends in a product of possibilities (see Resources).

However, very spam-oriented words have
*f*(*w*)s near 1, and
therefore have a much less significant effect on the calculations.
Now, it might be assumed that this is a good thing. After all, for
many people, misclassifying a good e-mail as spam seems a lot worse
than misclassifying a bad e-mail as a ham, because no great harm is
done if a single spam gets through but significant harm might
result from a single good e-mail being wrongly classified as spam
and therefore ignored by the recipient. So it may seem good to be
sensitive to indications of hamminess and less sensitive to
indications of spamminess.

However, there are ways to deal with this problem that in real-world testing do not add a noticeable tendency to wrongly classify good e-mail as spam, but do significantly reduce the tendency to misclassify spam as ham.

The most effective technique that has been identified in recent testing efforts follows.

First, “reverse” all the probabilities by subtracting them
from 1 (that is, for each word, calculate 1 -
*f*(*w*)). Because
*f*(*w*) represents the
probability that a randomly chosen e-mail from the set of e-mails
containing *w* is a spam, 1 -
*f*(*w*) represents the
probability that such a randomly chosen e-mail will be a
ham.

Now do the same Fisher calculation as before, but on the (1 -
*f*(*w*))s rather than on the
*f*(*w*)s. This will result
in near-0 combined probabilities, in rejection of the null
hypothesis, when a lot of very spammy words are present. Call this
combined probability *S*.

Now calculate:

*I* is an indicator that is near 1 when
the preponderance of the evidence is in favor of the conclusion
that the e-mail is spam and near 0 when the evidence points to the
conclusion that it's ham. This indicator has a couple of
interesting characteristics.

Suppose an e-mail has a number of very spammy words and also
a number of very hammy words. Because the Fisher technique is
sensitive to values near 0 and less sensitive to values near 1, the
result might be that both *S* and
*H* are very near 0. For instance,
*S* might be on the order of .00001 and
*H* might be on the order of .000000001. In
fact, those kinds of results are not as infrequent as one might
assume in real-world e-mails. One example is when a friend forwards
a spam to another friend as part of an e-mail conversation about
spam. In such a case, there will be strong evidence in favor of
both possible conclusions.

In many approaches, such as those based on the Bayesian chain rule, the fact that there may be more spammy words than hammy words in an example will tend to make the classifier absolutely certain that the e-mail is spam. But in fact, it's not so clear; for instance, the forwarded e-mail example is not spam.

So it a useful characteristic of *I* that
it is near .5 in such cases, just as it is near .5 when there is no
particular evidence in one direction or the other. When there is
significant evidence in favor of both conclusions,
*I* takes the cautious approach. In real-world
testing, human examination of these mid-valued e-mails tends to
support the conclusion that they really should be classified
somewhere in the middle rather than being subject to the
black-or-white approach of most classifiers.

The Spambayes Project, described in Richie Hindle's article
on page 52, takes advantage of this by marking e-mails with
*I* near .5 as uncertain. This allows the e-mail
recipient to give a bit more attention to e-mails that can't be
classified with confidence. This lessens the chance of a good
e-mail being ignored due to incorrect classification.

## Trending Topics

## Webinar

### Practical Task Scheduling Deployment

July 20, 2016 12:00 pm CDT

One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.

Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.

Join *Linux Journal*'s Mike Diehl and Pat Cameron of Help Systems.

Free to *Linux Journal* readers.

SUSE LLC's SUSE Manager | Jul 21, 2016 |

My +1 Sword of Productivity | Jul 20, 2016 |

Non-Linux FOSS: Caffeine! | Jul 19, 2016 |

Murat Yener and Onur Dundar's Expert Android Studio (Wrox) | Jul 18, 2016 |

Rogue Wave Software's Zend Server | Jul 14, 2016 |

Webinar: Practical Task Scheduling Deployment | Jul 14, 2016 |

- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Non-Linux FOSS: Caffeine!
- Doing for User Space What We Did for Kernel Space
- Google's SwiftShader Released
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- LiveCode Ltd.'s LiveCode

## Comments

## i dont understand it "Serve

i dont understand it

"Serve from the cache if it is younger than $cachetime"

whats it

## this entry

this is nice entry thanks for it

## Combining the probabilities

Can we use the Fisher's method for combining the probabilities of different parameters in Fraud Domain also.

Regards

sumit

## Here are some scientific

Here are some scientific approaches to filter out the spam in the e-mails. The probability of some particular words appears repeatedly in spam mails are used to identify whether the mail is a spam or not. Bayesian spam filtering method is the most discussed and used in the complex process of spam filtering. This is method is widely adopted by the commercial spam filters available today. But now day’s spammers are using other techniques like Bayesian poisoning to reduce the effectiveness of this method. This subject needs a wide discussion to find out a perfect technique in spam filtering. order fulfillment

## spam code

To create this caching you would put some code like the following on the top of your PHP page.

$cachefile = 'caching_folder/cachedpage.html';

$cachetime = 30;

// Serve from the cache if it is younger than $cachetime

if (file_exists($cachefile) && time() - $cachetime < filemtime($cachefile)) {

include($cachefile);

exit;

}

ob_start(); // Start the output buffer

## Great

This is really great info on Spam. I was hunting for this. This is a one the best service provider. Fine information, many thanks to the author. It is puzzling to me now, but in general, the usefulness and importance is overwhelming. Very much thanks again and good luck! regards fast weight loss

## Anti-spam solution

I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.

## Anti-spam solution

I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.

## Hypothesis - f(w)s NOT in a uniform distribution??

I guess the hypothesis should state ``The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s ARE in a uniform distribution.''

Is it right?

## If we CAN show that the data

If we CAN show that the data ARE a random distribution of noise, then the null hypothesis stands and our test hypothesis fails. So the name of the game becomes trying to prove that the null hypothesis is correct. If we fail to prove the data is random, then we are supporting the hypothesis that the data is uniformly distributed (in turn, deducing a way to classify the data).

## Spam Keywords

I've read all of the book Ending Spam as well as Mr Graham's APlan for spam but i have a problem and i was wondering if anyone can point me to the correct direction. I'm currently doing my senior project and i'm desighing a spam filter but since the corpus of spam and ham e-mails that i have is not big enough i cannot create a keyword dictionary where each word is carrying a weight of how spam it is or not using this mathematical theories. My question is if you know where i can find a ready keyword list where each word is cqrrying a weight?

## The closest thing I've found

The closest thing I've found is a database of known spam messages which have been forwarded to site by the general public.

You can download the raw message files via ftp by going to:

www.spamarchive.org

I don't think you'll find any pre-weighted word lists available for download (not publicly anyhow).

Hope this helps.

:)

## What does L stand for in the Fisher-Robinson Inverse Chi-Square

What does L stand for in the Fisher-Robinson Inverse Chi-Square?

In the text above it says "First, calculate -2ln p1 * p2 * ... * pn.", but what is LN? Does it stand for Lexicon Number? Or does the letter L have a greater significance? E.g multiply N by L. I am almost there at getting this understood, any suggestions welcome.

## 'ln' means...

'ln' is for natural logarithm. If you are using the Python code from this article, you would do something like,

import math

def product(values):

....return reduce(lambda x, y: x*y, values)

def chiCombined(probs):

....prod = product(probs)

....return chi2P(-2*math.log(prod) , 2*len(probs))

print chiCombined([0.9, .2, .21, .89, .2, .78])

=>0.572203878688

print chiCombined([0.2, .2, .01, .79, .2, .58])

=>0.0594128323345

print chiCombined([0.7, .89, .71, .79, .972, .68])

=>0.996012078132

/Chris

## Thanks + 'underflowing to zero' tip

Thanks for your reply Chris. I did a good few searches on Google but could not find any numeric examples of this on the web. So, its a real help to see some numbers to test my own code against.

Whilst trying to find out more about logs, i discovered a good web page for the 'mathematically challenged' programer: http://www.gigamonkeys.com/book/practical-a-spam-filter.html . In that article, the author suggests getting the log of each individual probability first, then multiplying them together. This, apparently, can prevent the result from underflowing to zero.

I'm writing my spam filter in PHP but most examples on the topic seem to be in either LISP or Python (which have quite a similar syntax to PHP in many ways). So, when I'm confident that i've done it right, I'll put a PHP version online.

Thanks, once again, for all those who have shared their knowledge to rid the world of spam; Chris Steinbach, Gary Robinson, Paul Graham, Brian Burton, Jonathan Zdziarski, Bill Yerazunis, Peter Seibel amongst many others.

## sum the logs not multiply

Ooops. I'm really showing my ignorance of maths and messing up this beautiful webpage in the process! Sorry folks. To correct my previous comment, the article suggests to sum the logs of each probability (I mistakenly said multiply them) rather than multiplying all the probabilities and then taking the log.