October 30, 2006
I recently got into a spirited discussion about Akismet. What is Akismet?
When a new comment, trackback, or pingback comes to your blog it is submitted to the Akismet web service which runs hundreds of tests on the comment and returns a thumbs up or thumbs down.
Akismet is awfully coy about the "tests" they run to distinguish between spam and everything else. I believe Akismet is essentially the same as the old mt-blacklist plugin I use to block trackback spam. But instead of manually entering blacklist terms, Akismet harnesses the collective knowledge of the intarwebs. As soon as one person blacklists something, it's blacklisted for the entire Akismet community. And it definitely works. It's so effective that some people use it as their only protection against spam comments and trackbacks. I think this is very unwise.
First of all, blacklists aren't a panacea. They have their pros and cons. Just ask Matt Mullenweg, the author of Akismet. He recently left this comment on a blog post:
Unfortunately, the DNS realtime blacklists cause an unusually high false positive rate, which is why we don't use them anymore.
Interesting. And if you're going to keep a blacklist, you might as well keep a greylist and whitelist, too:
These three lists have been around as long as spam itself:
Items on the Blacklist are never allowed through. They are either held in a moderation queue, or deleted.
Items on the Whitelist are always allowed through.
Items on the Greylist are held for human moderation.
Akismet also offers a moderation queue, so it has aspects of a greylist as well. Instead of spending time maintaining a blacklist, you spend time staring down a greylist moderation queue. I'm not so sure that's an improvement. If you consider Akismet a success because you ignore the moderation queue entirely, have you really succeeded?
It's also quite possible to use whitelist attacks on blacklists, where spammers use innocent and legitimate URLs in their spam. I've had a few of these myself. Even if you don't have a whitelist, attacks like this greatly reduce the effectiveness of a blacklist-- legitimate domains end up blacklisted through collateral damage.
But let's forget, for a moment, all the problems I just described with blacklists, whitelists, and greylists. The core problem is relying on a single method of defense against spam. Relying only on Akismet means:
- You've added an external dependency to your website. I hate dependencies, and I always strive to keep the number of dependencies I accept to an absolute minimum.
- If Akismet goes down, you either get inundated by spam while the floodgates are open, or nobody can comment/trackback. Neither scenario is desirable.
- I get 75 spam trackbacks per hour on this blog. Multiply that the number of blogs on the internet, and you get an astronomically large number. Why should Akismet have to check every single one of those? Does Akismet have the capacity to scale that large? And is it reasonable to expect them to?
I can understand making the choice to use Akismet exclusively for trackbacks, where our options for combating spam are severely limited. But for comments, abandoning CAPTCHA in favor of Akismet is unforgivable. Engtech explains some of the problems with this approach in a recent comment:
[Akismet] has been pretty effective, but there's been a few interesting cases:
- compliment spam ("great post!" with website field linking to their p-rn/adsense splog site)
- only attacking blogs that appear to still have the default post as the first post -- less likely to monitor spam.
- one p-rn spammer who finds political/pop culture keywords in a post and inserts human crafted messages. Like: "Some people say Matt Damon isn't that good of an actor, I really liked him in Talented Mr. Ripley" whenever it finds a post with "Matt Damon"
The one thing it has absolutely sucked at is spammers-to-be. People who are just testing out spam generation algorithms that have no payload. So you'll get random gibberish from an IP address and it will take a few days for Akismet to learn.
Hearing this pains me greatly. All the of the above could have been completely eliminated by using both methods: CAPTCHA to validate that it's a human, then Akismet to validate that it's not human-entered spam.
Akismet is a fine addition to our anti-spamming toolkit. But that doesn't mean it's a good idea to outsource your entire anti-spam effort to a single website, either. Anti-spam security starts at home. For best results, use defense in depth and combine local anti-spam measures, such as CAPTCHA, with Akismet as a backup.
Posted by Jeff Atwood
Thanks for the linkage.
Since I posted that on your captcha post, I've got a live example of "test spam" that made it through Akismet: http://engtech.wordpress.com/2006/09/20/vistaprint-business-cards-isnt-a-scam/#comment-2822
What was kind of cool is that within 24 hrs it was followed by another 80-95 that Akismet all caught.
The website payload was usually articles on design, Wikipedia references, etc. Stuff that wouldn't be considered spam by most.
I'm of a different opinion. Akismet has proved excellent for me, catching 1843 spam messages so far with zero normal comments marked as spam. There has been around five spam messages which went through but I cought those in manual moderation.
CAPTCHA's are awful usability wise. Heck I have run into CAPTCHA's that I can't even read. That doesn't even get into people who are colorblind or worse, really are blind and rely on screen readers. I would not recommend CAPTCHA's on any blog. I can deal with one spam comment sneaking through once in awhile.
SPAM SPAM SPAM SPAM SPAM SPAM SPAM SPAM SPAM
I never bothered with Askismet, because it wasn't standard when I abandoned MovableType and it's crummy spam-fighting tools for Wordpress. Instead I ended up with SpamKarma2 (http://unknowngenius.com/blog/wordpress/spam-karma/), which I've been using ever since.
It uses multiple rules to develop a spam score. No one metric is necessarily enough to mark a comment as spam (or not spam). It uses a blacklist, but that's just one of the criteria.
It's worked amazingly well for me. I may have had one or two false positives in the 800+ spams It's caught. I've had almost no false negatives, even though spam volumes seem to have jumped up dramatically in the last few weeks.
What was kind of cool is that within 24 hrs it was followed by another 80-95 that Akismet all caught. The website payload was usually articles on design, Wikipedia references, etc. Stuff that wouldn't be considered spam by most.
Still, there's no way this stuff would make it through CAPTCHA. And using CAPTCHA (for comments, obviously) would reduce the load on Akismet substantially. It's one less HTTP round trip for data you know with 99% certainty is already bad.
Chris G, read the last post on this blog before you make such an overarching proclamation.
Does akismet use the sbl-xbl? Currently spamhaus is the most respected RBL, and most comment spam comes from the same sources, so it's natural to integrate with them. You don't even need akismet for that, you can easily modify your comment submission page to do that (if such a mod isn't available now).
Not sure I understand why info should be on the graylist, I have my personal website as an info site.
Have you really seen lots of spam with info url's ? I sure havent, most of them are com.
In fact having a info domain protected me for a long time against spam, it seemed their email harvesting programs didn't understand there were a bunch of new top domains, however in the last year or so that has changed.
I think it's difficult to compare fighting email and blog spam. When AOL uses a DNS blacklist to block incoming mail, you can bet that there'll be damage done. But when Joe Blogger uses the same blacklist to block possible web spammers, what's the worst thing that can happen?
I'm using a DNS blacklist (sbl-xbl by spamhaus.org) quite successfully on my wiki. Never had a false positive and blocked lots and lots of spam (see http://wiki.chongqed.org/CaughtSpam). It's no silver bullet, but it saves a lot of CPU time.
Did I mention that I hate CAPTCHA? Must admit that your's is the best I've seen so far.
why would spammers all collect to a particular top domain, that would be pretty stupid wouldnt it ?
Spammers are not exactly known for their genius. I don't like blocking *.inf0 or blogsp0t.com either, but it's sadly necessary due to the volume of spam coming from there.
I agree that CAPTCHAs are terrible for usability.
Akismet has been great for me so far. Sometimes it does make Type I errors, but I have a back-end set up to fix that. I'm not going back to CAPTCHAs as long as Akismet is around.
Have you really seen lots of spam with info url's
Did I mention that I hate CAPTCHA?
There are a lot of things I hate, such as the security line at the airport, locks on my doors, and waiting in line at the department of motor vehicles. But they're all necessary because the alternatives are even worse.
It's no silver bullet, but it saves a lot of CPU time.
Which is crazy, because we have essentially infinite CPU time, and more being created every day. What we don't have is infinite bandwidth, or infinite mental bandwidth.
That's why CAPTCHA is such a good idea: it optimizes for people using a resource that is already plentiful and getting more plentiful every day.
Yeah, much agreement with Jeff and Chris Pirillo on .i-fo being dead. A lot of spam come from there.
I've noticed Akismet will mark . info comments as spam even if they're valid -- the .i-fo domain is that bad.
(I can't even type . info on this site: Your comment could not be submitted due to questionable content: .i-fo matching (\.i-fo))
A good alternative (at least for now) to the CAPTCHA is to ask the users browser to solve a math problem in Java Script. No user interaction is involved so there is not a usability concern. Wordpress has a plugin called 'HashCash' that does this. It keeps the amount of spam you have to moderate down to a minimum of human entered spam:
I use this along with Akismet and its like a Teflon wall.
Thats ridicolous to target a particular top domain like that, why would spammers all collect to a particular top domain, that would be pretty stupid wouldnt it ?
I don't see the rules being any different for registering under another top domain so there is no reason to discriminate against info.
He goes as far as saying that the people registering one are basically idiots, we'll, I registered mine the same day it was released so it wasn't any spammers under info at that point.
This is just ridicolous, it really makes me mad, so now we are blocking top domains ? Great, why not start to block complete countries as well.
So you're blocking links to blogsp0t, but anything that points to google.com goes through automatically?
I have actually seen a number of trackbacks exactly like that-- but the blacklist takes precedence over the whitelist, so anything containing blogsp0t will be discarded.
I've relaxed the blogsp0t rule recently because I closed a lot of my old posts to new trackbacks.
Having recently switched from bblog (I know, I know) to Wordpress, I have so many new options of spam fighting now. I'm curious if a bot busting idea that other sites use is active in comment spam busting.
Was the comment page loaded in before the comment was posted? How long? I know this wouldn't help trackback spam but it would take a big bite out of the 40,000 comments and trackback spam I had on my old blog.
Yes, I'm looking into the timing of comment spam and the possibility of a plug in to WP.
The real problem is mixed case captcha. Many letters look the same, and since captcha often changes sizes, it's literally impossible to tell the different cases of the lettes. At the very Least, implementors should make captcha case-insensitive.
The other maddening thing is you are usually cut off after three or four tries and have to wait a day. This is insane. All they need is a timer. No robot is going to take five minutes doing captcha-errors. The slow timing as a victim struggles with case-sensitive captcha guarantees it's a human. A pissed-off human.
I’m having problems with Aksiment. It says that it caught 5 comments, but it’s only displaying two. I want to just go and delete those other three but I can’t view them. I’m going to have to disable Akismet, pathetically bad
why would you get spam in the first place? shouldn't you first register under some kind of circumstance... Am I talking nonsense?
Anyway, maybe they're part hackers. You know, they get you IP address and those stuff. They install illegal software into your pc, or something near that. Am I right in any case?