A recent Los Angeles Times article reveals that the 419 scam spammers have their very own anthem: a song titled I Go Chop Your Dollars by Nigerian recording artist Osofia:
"419 is just a game, you are the losers, we are the winners.
White people are greedy, I can say they are greedy
White men, I will eat your dollars, will take your money and disappear.
419 is just a game, we are the masters, you are the losers."
We may joke about the 419 scams.. after all, who in their right mind actually falls for this stuff? But like all spammers, they do it because it works:
[Samuel] sent 500 e-mails a day and usually received about seven replies. Shepherd would then take over. "When you get a reply, it's 70% sure that you'll get the money," Samuel said.
Spam only became a problem for me about a year and a half ago, but clearly it's here to stay. I've used POPFile for about a year to cut down on my email spam**. Some people swear by challenge-response human verification systems such as SpamArrest, but as Scott Mitchell notes, this system has some issues:
While the challenge/response system was effective in reducing my spam intake from about 100 messages a day to around 1 or 2 messages a day, the approach, in my estimation, was not ideal. One big disadvantage was that fewer people took the time to respond to the challenge email than I had anticipated, for two reasons:
- Some people don't want to take the time to follow instructions for a challenge email. Maybe their message wasn't that important after all, maybe they're busy, or maybe they just don't like being told what to do. These people's messages, I reckoned, weren't that vital. If you can't take two seconds to respond to the challenge, then just how important is that email you're sending me?
- What worried me most, and led me to suspend my C/R anti-spam system, is that I noticed some people weren't responding to the challenge email because they never received it! This unfortunate circumstance could happen if their own spam blocking solution halted my challenge email. A couple folks informed me that Outlook 2003 categorized my challenge emails as spam. Others using a similar challenge/response anti-spam system would never get my challenge as my challenge would generate a challenge on their side.
The "I challenge your challenge!" scenario is particularly amusing. And on top of the two issues Scott highlights, there are other social problems with challenge/response spam blocking.
Although I've had great success with POPFile, which uses Bayesian filtering techniques, I had no idea that there's an even better technique: Markovian filtering. That's what the CRM114 Discriminator* uses. There's an outstanding slide deck (pdf) that explains how it all works. In a nutshell, Markovian filtering weights phrases and words, whereas Bayesian filtering only looks at individual words. How much better is it? I'll let the CRM114 author, Bill Yerazunis, pitch it:
For the month of April 2005, I receieved over 10,000 emails. About 60% were spam. I had ZERO classification errors. ZERO.As of Feb 1 through March 1, 2004, 8738 messages (4240 spam, 4498 nonspam), and my total error rate was ONE. That translates to better than 99.984% accuracy, which is over ten times more accurate than human accuracy
I measured my own accuracy to be around 99.84%, by classifying the same set of about 3000 messages twice over a period of about a week, reading each message from the top until I feel "confident" of the message status, (one message per screen unless I want more than one screen to decide on a message.) and doing the classification in small batches with plenty of breaks and other office tasks to avoid fatigue. Then I diff()ed the two passes to generate a result. Assuming I never duplicate the same mistake, I, as an unassisted human, under nearly optimal conditions, am 99.84% accurate.
Most Bayesian techniques top out at around ~98% percent accuracy with a little training, but Markovian can achieve a rarified 99.5% accuracy. The most notable Windows port of CRM114 is SpamRIP.
* A reference to the movie Dr. Strangelove. In the movie, the "CRM114 Discriminator" is a fictional accessory for a radio receiver that's "designed not to receive at all", that is, unless the message is properly authenticated.
** I have since switched to K9 because it's simpler and faster-- and does the same Bayesian filtering.
SpamRIP is adware. Better stay away.
Anonymous Coward on October 21, 2005 7:00 AMI am very close to 99.5% with my Bayesian. Take a look at the statistics:
http://www.newsforge.com.ar/images/Spamsieve3.png
There have been some negatives here and there (and false positives) but so far, I haven't read a spam in months! I completely forgot about its existance. :)
I recently ditched my long standing e-mail address, simply because of the amount of spam I recieved.
I did use POPFile, but it was the fact that every morning when I turned my computer on, it had to download around 1000 e-mails a day, with 99.9% being spam.
These days I've got two e-mails address, one super secret one just for family and friends. The other for using in places that are likely to get it on a spam list.
Peter Bridger on October 21, 2005 8:33 AMOnce again, something new. I've heard of approaches using Bayesian with two word and three word phrases, but I've never heard of Markovian approach.
Is the approach similar to Bayesian in that it uses statistical inference, but just with phrases?
Haacked on October 21, 2005 1:19 PMI am very close to 99.5% with my Bayesian.
The Markovian filtering would deliver something like 99.99% in this case.. rule of thumb is, it cuts the amount of spam that makes it through a bayesian filter in half.
Is the approach similar to Bayesian in that it uses statistical inference, but just with phrases?
Yes. From page 20 of the slide deck (which I still highly recommend -- it's great)
http://crm114.sourceforge.net/Plateau99.pdf
How to Turn a Bayesian into a Markovian
(1) change the feature generator from single words to spanning multiple words *
(2) Change the weighting so that longer features have more weight (ie. Longer features generate local probabilities closer to 0.0 and 1.0)
(3) The 22n weighting means that the
weights were 1, 4, 16, 64, 256, ... for span
lengths of 1, 2, 3, 4, 5 ... words
Wow-- here's the music video for "I Go Chop Your Dollars":
http://nigeriamovies.net/419.htm
Pretty funny. It's a lot less gangsta rap than I was expecting, with scenes of the mark signing contracts, touring Nigeria, etc.
Jeff Atwood on March 5, 2006 3:18 AMThe comments to this entry are closed.
|
|
Traffic Stats |