A Statistical Approach to the Spam Problem
To date, the software using this approach is based on one word per token. Other approaches are possible, such as building a hash table of phrases. It is expected that the math described here can be employed in those contexts as well, and there is reason to believe that phrase-based systems will have performance advantages, although there is controversy about that idea. Future Linux Journal articles can be expected to cover any developments in such directions. CRM114 (see Resources) is an example of a phrase-based system that has performed very well, but at the time of this writing it hasn't been directly tested against other approaches on the same corpus. (At the time of this writing, CRM114 is using the Bayesian chain rule to combine p(w)s.)
The techniques described here have been used in projects such as Spambayes and Bogofilter to improve performance of the spam-filtering task significantly. Future developments, which may include integrating these calculations with a phrase-based approach, can be expected to achieve even better performance.
Gary Robinson is CEO of Transpose, LLC (www.transpose.com), a company specializing in internet trust and reputation solutions. He has worked in the field of collaborative filtering since 1985. His personal weblog, which frequently covers spam-related developments, is radio.weblogs.com/0101454, and he can be contacted at grobinson@transpose.com.
- « first
- ‹ previous
- 1
- 2
- 3
- 4
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- Non-Linux FOSS: libnotify, OS X Style
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- RSS Feeds
- It is quiet helping
1 hour 39 min ago - Technology
1 hour 56 min ago - Reachli - Amplifying your
3 hours 12 min ago - excellent
4 hours 1 min ago - good point!
4 hours 4 min ago - Varnish works!
4 hours 13 min ago - Reply to comment | Linux Journal
4 hours 43 min ago - Reply to comment | Linux Journal
7 hours 9 min ago - Reply to comment | Linux Journal
11 hours 8 min ago - Yeah, user namespaces are
12 hours 25 min ago
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
i dont understand it "Serve
i dont understand it
"Serve from the cache if it is younger than $cachetime"
whats it
this entry
this is nice entry thanks for it
Combining the probabilities
Can we use the Fisher's method for combining the probabilities of different parameters in Fraud Domain also.
Regards
sumit
Here are some scientific
Here are some scientific approaches to filter out the spam in the e-mails. The probability of some particular words appears repeatedly in spam mails are used to identify whether the mail is a spam or not. Bayesian spam filtering method is the most discussed and used in the complex process of spam filtering. This is method is widely adopted by the commercial spam filters available today. But now day’s spammers are using other techniques like Bayesian poisoning to reduce the effectiveness of this method. This subject needs a wide discussion to find out a perfect technique in spam filtering. order fulfillment
spam code
To create this caching you would put some code like the following on the top of your PHP page.
$cachefile = 'caching_folder/cachedpage.html';
$cachetime = 30;
// Serve from the cache if it is younger than $cachetime
if (file_exists($cachefile) && time() - $cachetime < filemtime($cachefile)) {
include($cachefile);
exit;
}
ob_start(); // Start the output buffer
Great
This is really great info on Spam. I was hunting for this. This is a one the best service provider. Fine information, many thanks to the author. It is puzzling to me now, but in general, the usefulness and importance is overwhelming. Very much thanks again and good luck! regards fast weight loss
Anti-spam solution
I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.
Anti-spam solution
I forgot about spam problem when I started using Gafana.com -it is 100% effective, no false positives, no spam.. Not really expensive, extremely helpful. So, spam is not a problem for me now.
Hypothesis - f(w)s NOT in a uniform distribution??
I guess the hypothesis should state ``The f(w)s are accurate, and the present e-mail is a random collection of words, each independent of the others, such that the f(w)s ARE in a uniform distribution.''
Is it right?
If we CAN show that the data
If we CAN show that the data ARE a random distribution of noise, then the null hypothesis stands and our test hypothesis fails. So the name of the game becomes trying to prove that the null hypothesis is correct. If we fail to prove the data is random, then we are supporting the hypothesis that the data is uniformly distributed (in turn, deducing a way to classify the data).
Spam Keywords
I've read all of the book Ending Spam as well as Mr Graham's APlan for spam but i have a problem and i was wondering if anyone can point me to the correct direction. I'm currently doing my senior project and i'm desighing a spam filter but since the corpus of spam and ham e-mails that i have is not big enough i cannot create a keyword dictionary where each word is carrying a weight of how spam it is or not using this mathematical theories. My question is if you know where i can find a ready keyword list where each word is cqrrying a weight?
The closest thing I've found
The closest thing I've found is a database of known spam messages which have been forwarded to site by the general public.
You can download the raw message files via ftp by going to:
www.spamarchive.org
I don't think you'll find any pre-weighted word lists available for download (not publicly anyhow).
Hope this helps.
:)
What does L stand for in the Fisher-Robinson Inverse Chi-Square
What does L stand for in the Fisher-Robinson Inverse Chi-Square?
In the text above it says "First, calculate -2ln p1 * p2 * ... * pn.", but what is LN? Does it stand for Lexicon Number? Or does the letter L have a greater significance? E.g multiply N by L. I am almost there at getting this understood, any suggestions welcome.
'ln' means...
'ln' is for natural logarithm. If you are using the Python code from this article, you would do something like,
import math
def product(values):
....return reduce(lambda x, y: x*y, values)
def chiCombined(probs):
....prod = product(probs)
....return chi2P(-2*math.log(prod) , 2*len(probs))
print chiCombined([0.9, .2, .21, .89, .2, .78])
=>0.572203878688
print chiCombined([0.2, .2, .01, .79, .2, .58])
=>0.0594128323345
print chiCombined([0.7, .89, .71, .79, .972, .68])
=>0.996012078132
/Chris
Thanks + 'underflowing to zero' tip
Thanks for your reply Chris. I did a good few searches on Google but could not find any numeric examples of this on the web. So, its a real help to see some numbers to test my own code against.
Whilst trying to find out more about logs, i discovered a good web page for the 'mathematically challenged' programer: http://www.gigamonkeys.com/book/practical-a-spam-filter.html . In that article, the author suggests getting the log of each individual probability first, then multiplying them together. This, apparently, can prevent the result from underflowing to zero.
I'm writing my spam filter in PHP but most examples on the topic seem to be in either LISP or Python (which have quite a similar syntax to PHP in many ways). So, when I'm confident that i've done it right, I'll put a PHP version online.
Thanks, once again, for all those who have shared their knowledge to rid the world of spam; Chris Steinbach, Gary Robinson, Paul Graham, Brian Burton, Jonathan Zdziarski, Bill Yerazunis, Peter Seibel amongst many others.
sum the logs not multiply
Ooops. I'm really showing my ignorance of maths and messing up this beautiful webpage in the process! Sorry folks. To correct my previous comment, the article suggests to sum the logs of each probability (I mistakenly said multiply them) rather than multiplying all the probabilities and then taking the log.