After struggling with spam e-mail for years the old fashioned way-- highlight, DEL-- I finally succumbed and installed POPFile on my server. POPFile uses a Bayesian Filter technique and it is amazingly effective. Within a day I had 95% accuracy; within a week I had 97% accuracy. Two months later, I'm up to nearly 99% accuracy:
It's interesting that bayesian filtering is so effective, yet most people never heard of it until mid 2002. Spam has been around seemingly forever; why wasn't this technique adopted sooner? I did some digging and came up with Paul Graham's A Plan For Spam. Paul is an interesting guy with a LISP background, and although he probably wasn't the first person to think of using Bayesian techniques to fight spam, he was definitely the first person to stump for a workable algorithm:
I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term.I know what you're thinking now: say I'm a spammer. How would I beat a Bayesian Filter? Well, it's possible, but it's hard:But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.
Assuming they could solve the problem of the headers, the spam of the future will probably look something like this:Digging through today's email for examples-- what about messages with no text, only HTML images?
Hey there. Thought you should check out the following:
http://www.27meg.com/foobecause that is about as much sales pitch as content-based filtering will leave the spammer room to make. (Indeed, it will be hard even to get this past filters, because if everything else in the email is neutral, the spam probability will hinge on the url, and it will take some effort to make that look neutral.)
Spammers range from businesses running so-called opt-in lists who don't even try to conceal their identities, to guys who hijack mail servers to send out spams promoting porn sites. If we use filtering to whittle their options down to mails like the one above, that should pretty much put the spammers on the "legitimate" end of the spectrum out of business; they feel obliged by various state laws to include boilerplate about why their spam is not spam, and how to cancel your "subscription," and that kind of text is easy to recognize.
Received: from host-122-195.firstpointsecure.com ([69.42.122.195]) by server.mydomain.com with Microsoft SMTPSVC(6.0.3790.0); Sun, 19 Sep 2004 14:32:43 -0400 From: "Good News" <rodfournier@moquije.remarkablenews.com> To: me <me@mydomain.com> Subject: Single? Date: Sun, 19 Sep 2004 11:32:56 -0800 MIME-Version: 1.0 Content-type: text/html; charset="ISO-8859-1" Content-transfer-encoding: 7bit Message-Id: <0771687B7E76766B477E707A6C346C697C7A70756C7A7A356A7674$4df803ge2@moquije.remarkablenews.com> Return-Path: rodfournier@moquije.remarkablenews.com <html> </head> <body> <p align="center"><a href="http://quugot.deliveredsavings.com/date3/?i=iog0771687b7e76766b4v&vj=jzv77e707a6c346c697c7ig&n=ksia70756c7a7a356a7674k&pq=vtyk&winner&_m01"> <img border="0" src="http://quugot.deliveredsavings.com/date3/at.gif" width="383" height="210"></a><br> <br> <br> <br> <br> </p> <p align="center"> <a href="http://quugot.deliveredsavings.com/date3/rd.cgi?i=iog0771687b7e76766b4v&vj=jzv77e707a6c346c697c7ig&n=ksia70756c7a7a356a7674k&pq=vtyk&winner&_m01"> <img border="0" src="http://quugot.deliveredsavings.com/date3/5.gif" width="502" height="59"></a></p> <p align="center"></p> <img src="http://quugot.deliveredsavings.com/date3/logogen.img?i=iog0771687b7e76766b4v&vj=jzv77e707a6c346c697c7ig&n=ksia70756c7a7a356a7674k&pq=vtyk" border=0> </body> </html>
Or messages with non-spam spoof text?
<font size="2" face="Verdana">Stop this <a href="http://www.muss4267pinn.com/a.ddd">please</a>!</font><br> <br> christy passport nocturnal director cargoes corrigendum sicklewort doria polaroid <br> <br>
Interestingly, POPFile has no problem at all correctly categorizing these messages as spam. That's the value of parsing the headers and the HTML, something early researchers failed to do. Graham cites this as the primary reason why Bayesian filtering wasn't used prior to 2002. They didn't think it was effective enough!
98.6 percent accuracy is good, one of the best available, but it's not 100 percent. Can we do better with other spam fighting techniques? I agree with Graham's position that blacklists are both a bad idea and a losing battle, so I won't even go there. Whitelists, on the other hand, are more interesting. Take a service like SpamArrest, for example. This works like so:
Some people love whitelists. I'm not a fan. Putting the burden of verification on the sender seems kind of onerous to me. Even though it's a one time thing, it is an additional hurdle for every person that wants to communicate with me. On the other hand, this type of anti-machine whitelist is a reasonable approach to an intractable problem. Bayes will always let some spam slip through, so it is arguably the only way to get an ironclad "100 percent" effective spam blocking.
Posted by Jeff Atwood View blog reactions
« The Jack Principles Why Objects Suck, Revisited »
Here's a cool bit of code that generates a CAPTCHA in ASP.NET:
http://www.codeproject.com/aspnet/CaptchaImage.asp
Jeff Atwood on September 19, 2004 06:13 PMThe big part about any filter is false positives. I personally never want to lose email, I just want it away from my Inbox.
With the seemingly impending canning of Sender ID or the "Email Caller ID" crap there needs to be something done about spam.
Honestly I see the only course is to redo the SMTP protocol into a slightly better alternative. POP is okay but even that is a very basic and primitive protocol written seemingly for 1986, not 2006. I think they're both in need of some revamps into something new and different, not XSMTP or some other name that adds 2 lines of code. Sure you break a lot of mail servers but if the idea is structurally sound why wouldn't everyone jump to it?
I guess it'll take spam and viruses a couple of more years before the problem gets "really" serious. (As if it wasn't serious in 2000 or any day prior to now). Hell I remember spam in my AOL account back in '94 and it was a big deal then too. Blah anyways I suppose something will be done some day. Until then, Baysian all the way.
Jeremy Brayton on September 20, 2004 02:27 PMYou should definitely try SpamArrest (or another similar human-only Whitelist) if your main concern is false positives. Mails not verified human aren't false positives, technically, they're just in limbo. With that caveat, it is 100% effective-- you will ONLY get email from human beings from that point on, guaranteed!
It's too early to tell if anything will come of SenderID and the like. We can always hope...
Jeff Atwood on September 20, 2004 06:59 PMI know this comment is a couple of years after the post, but the Anti Spam SMTP Proxy Sourceforge project (http://assp.sourceforge.net) provides the be-all, end-all server-side actually works spam filter. Or at least the very good beginnings of one.
I set it up on my server about a month ago, and in that time it's reduced the spam I personally get from about 100 per day to about 2 per day, and those are marked as spam - I only get them because it's still in "test" mode, allowing me to train it more effectively. I've only gotten 2 false positives personally.
On that server, I also host several other domains and about a dozen users. I don't know the spam load on the other users, but I've been keeping track of the messages that are being marked as spam and I've only seen a couple of false positives (which I quickly remedy by using ASSP's mail interface).
The most powerful feature of ASSP is not its bayesian filtering (although that's an integral part of the setup), it's the delaying feature (also known as greylisting). I won't go into a technical description here (google is your friend), but ASSP leverages the behavior of valid MTAs in such a way that spam sent from zombies or other poorly-implemented spam sources doesn't even make it to the server, saving me bandwidth and processing power.
Anyway, if you're managing a small to medium-sized mail server and want to filter spam at the server level, I heartily recommend ASSP.
-Peter
Peter on December 11, 2006 01:19 PM| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |