A Statistical Approach to the Spam Problem
The calculation described above is sensitive to evidence of hamminess, particularly when it's in the form of words that show up in far more hams than spams. This is because probabilities near 0 have a great influence on the product of probabilities, which is at the heart of Fisher's calculation. In fact, there is a 1971 theorem that says the Fisher technique is, under certain circumstances, as powerful as any technique can possibly be for revealing underlying trends in a product of possibilities (see Resources).
However, very spam-oriented words have f(w)s near 1, and therefore have a much less significant effect on the calculations. Now, it might be assumed that this is a good thing. After all, for many people, misclassifying a good e-mail as spam seems a lot worse than misclassifying a bad e-mail as a ham, because no great harm is done if a single spam gets through but significant harm might result from a single good e-mail being wrongly classified as spam and therefore ignored by the recipient. So it may seem good to be sensitive to indications of hamminess and less sensitive to indications of spamminess.
However, there are ways to deal with this problem that in real-world testing do not add a noticeable tendency to wrongly classify good e-mail as spam, but do significantly reduce the tendency to misclassify spam as ham.
The most effective technique that has been identified in recent testing efforts follows.
First, “reverse” all the probabilities by subtracting them from 1 (that is, for each word, calculate 1 - f(w)). Because f(w) represents the probability that a randomly chosen e-mail from the set of e-mails containing w is a spam, 1 - f(w) represents the probability that such a randomly chosen e-mail will be a ham.
Now do the same Fisher calculation as before, but on the (1 - f(w))s rather than on the f(w)s. This will result in near-0 combined probabilities, in rejection of the null hypothesis, when a lot of very spammy words are present. Call this combined probability S.
I is an indicator that is near 1 when the preponderance of the evidence is in favor of the conclusion that the e-mail is spam and near 0 when the evidence points to the conclusion that it's ham. This indicator has a couple of interesting characteristics.
Suppose an e-mail has a number of very spammy words and also a number of very hammy words. Because the Fisher technique is sensitive to values near 0 and less sensitive to values near 1, the result might be that both S and H are very near 0. For instance, S might be on the order of .00001 and H might be on the order of .000000001. In fact, those kinds of results are not as infrequent as one might assume in real-world e-mails. One example is when a friend forwards a spam to another friend as part of an e-mail conversation about spam. In such a case, there will be strong evidence in favor of both possible conclusions.
In many approaches, such as those based on the Bayesian chain rule, the fact that there may be more spammy words than hammy words in an example will tend to make the classifier absolutely certain that the e-mail is spam. But in fact, it's not so clear; for instance, the forwarded e-mail example is not spam.
So it a useful characteristic of I that it is near .5 in such cases, just as it is near .5 when there is no particular evidence in one direction or the other. When there is significant evidence in favor of both conclusions, I takes the cautious approach. In real-world testing, human examination of these mid-valued e-mails tends to support the conclusion that they really should be classified somewhere in the middle rather than being subject to the black-or-white approach of most classifiers.
The Spambayes Project, described in Richie Hindle's article on page 52, takes advantage of this by marking e-mails with I near .5 as uncertain. This allows the e-mail recipient to give a bit more attention to e-mails that can't be classified with confidence. This lessens the chance of a good e-mail being ignored due to incorrect classification.
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Keeping track of IP address
1 hour 2 min ago
- Roll your own dynamic dns
6 hours 15 min ago
- Please correct the URL for Salt Stack's web site
9 hours 26 min ago
- Android is Linux -- why no better inter-operation
11 hours 42 min ago
- Connecting Android device to desktop Linux via USB
12 hours 10 min ago
- Find new cell phone and tablet pc
13 hours 8 min ago
14 hours 37 min ago
- Automatically updating Guest Additions
15 hours 46 min ago
- I like your topic on android
16 hours 32 min ago
- This is the easiest tutorial
23 hours 8 min ago
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?