A Statistical Approach to the Spam Problem
At this point, we can compute a probability, f(w), for each word that may appear in a new e-mail. This probability reflects our degree of belief, based on our background knowledge and on data in our training corpus, that an e-mail chosen randomly from those that contain w will be a spam.
So each e-mail is represented by a set of of probabilities. We want to combine these individual probabilities into an overall combined indicator of spamminess or hamminess for the e-mail as a whole.
In the field of statistics known as meta-analysis, probably the most common way of combining probabilities is due to R. A. Fisher. If we have a set probabilities, p1, p2, ..., pn, we can do the following. First, calculate -2ln p1 * p2 * ... * pn. Then, consider the result to have a chi-square distribution with 2n degrees of freedom, and use a chi-square table to compute the probability of getting a result as extreme, or more extreme, than the one calculated. This “combined” probability meaningfully summarizes all the individual probabilities.
The initial set of probabilities are considered to be with respect to a null hypothesis. For instance, if we make the null hypothesis assumption that a coin is unbiased, and then flip it ten times and get ten heads in a row, there would be a resultant probability of (1/2)10 = 1/1024. It would be an unlikely event with respect to the null hypothesis, but of course, if the coin actually were to be biased, it wouldn't necessarily be quite so unlikely to have an extreme outcome. Therefore, we might reject the null hypothesis and instead choose to accept the alternate hypothesis that the coin is biased.
We can use the same calculation to combine our f(w)s. The f(w)s are not real physical probabilities. Rather, they can be thought of as our best guess about those probabilities. But in traditional meta-anaytical uses of the Fisher calculation they are not known to be the real probabilities either. Instead, they are assumed to be, as part of the null hypothesis.
Let our null hypothesis be “The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s are not in a uniform distribution.” Now suppose we have a word, “Python”, for which f(w) is .01. We believe it occurs in spams only 1% of the time. Then to the extent that our belief is correct, an unlikely event occurred; one with a probability of .01. Similarly, every word in the e-mail has a probability associated with it. Very spammy words like porn might have probabilities of .99 or greater.
Then we use the Fisher calculation to compute an overall probability for the whole set of words. If the e-mail is a ham, it is likely that it will have a number of very low probabilities and relatively few very high probabilities to balance them, with the result that the Fisher calculation will give a very low combined probability. This will allow us to reject the null hypothesis and assume instead the alternative hypothesis that the e-mail is a ham.
Let us call this combined probability H:
where C-1() is the inverse chi-square function, used to derive a p-value from a chi-square-distributed random variable.
Of course, we know from the outset the null hypothesis is false. Virtually no e-mail is actually a random collection of words unbiased with regard to hamminess or spamminess; an e-mail usually has a telling number of words of one type or the other. And certainly words are not independent. E-mails containing “sex” are more likely to contain “porn”, and e-mails containing the name of a friend who programs Python are more likely also to contain “Python”. And, the f(w)s are not in a uniform distribution. But for purposes of spam detection, those departures from reality usually work in our favor. They cause the probabilities to have a nonrandom tendency to be either high or low in a given e-mail, giving us a strong statistical basis to reject the null hypothesis in favor of one alternative hypothesis or the other. Either the e-mail is a ham or it is a spam.
So the null hypothesis is a kind of straw dog. We set it up only in order for it to be knocked down in one direction or the other.
It may be worthwhile to mention here a key difference between this approach and many other combining approaches that assume the probabilities are independent. This difference applies, for example, to such approaches as the “Bayesian chain rule” and “Naïve Bayesian classification”. Those approaches have been tested in head-to-head combat against the approach described here using large numbers of human-classified e-mails and didn't fare as well (i.e., didn't agree as consistently with human judgment).
The calculations for those approaches are technically invalid due to requiring an independence of the data points that is not actually present. That problem does not occur when using the Fisher technique because the validity of the calculations doesn't depend on the data being independent. The Fisher calculation is structured such that we are rejecting a null hypothesis that includes independence in favor of one of the two alternative hypotheses that are of interest to us. These alternative hypotheses each involve non-independence, such as the correlation between “sex” and “porn”, as part of the phenomenon that creates the extreme combined probabilities.
These combined probabilities are only probabilities under the assumption that the null hypothesis—known to be almost certainly false—is true. In the real world, the Fisher combined probability is not a probability at all, but rather an abstract indicator of how much (or little) we should be inclined to accept the null hypothesis. It is not meant to be a true probability when the null hypothesis is false, so the fact that it isn't doesn't cause computational problems for us. The naïve Bayesian classifier is known to be resistant to distortions due to a lack of independence, but it doesn't avoid them entirely.
Whether these differing responses to a lack of independence in the data are responsible for the superior performance of Fisher in spam/ham classification testing to date is not known, but it is something to consider as one possible factor.
The individual f(w)s are only approximations to real probabilities (i.e., when there is very little data about a word, our best guess about its spam probability as given by f(w) may not reflect its actual reality). But if you consider how f(w) is calculated, you will see that this uncertainty diminishes asymptotically as f(w) approaches 0 or 1, because such extreme values can be achieved only by words that have occurred quite frequently in the training data and either almost always appear in spams or almost always appear in hams. And, conveniently, it's the numbers near 0 that have by far the greatest impact on the calculations. To see this, consider the influence on the product .01 * .5 if the first term is changed to .001, vs. the influence on the product .51 *.5 if the first term is changed to .501, and recall that the Fisher technique is based on multiplying the probabilities. So our null hypothesis is violated by the fact that the f(w)s are not completely reliable, but in a way that matters vanishingly little for the words of the most interest in the search for evidence of hamminess: the words with f(w) near 0.
Alert readers will wonder at this point: “Okay, I understand that these Fisher calculations seem to make sense for hammy words with correspondingly near-0 probabilities, but what about spammy words with probabilities near 1?” Good question! Read on for the answer that will complete our discussion.
- Brent Laster's Professional Git (Wrox)
- Own Your DNS Data
- Machine Learning Everywhere
- Smoothwall Express
- Bash Shell Script: Building a Better March Madness Bracket
- Simple Server Hardening
- From vs. to + for Microsoft and Linux
- Understanding OpenStack's Success
- The Weather Outside Is Frightful (Or Is It?)
- Ensono M.O.