May 28, 2008
Have you ever used Craigslist? It's an almost entirely free, mostly anonymous classified advertising service which evolved from an early internet phenomenon into a service so powerful it is often accused of single-handedly destroying the newspaper business. Unfortunately, these same characteristics also make Craigslist a particularly juicy target for spammers and evildoers. Who knows; maybe it's karma.
I consider Craiglist a generally benevolent public service. Perhaps that's why I was so profoundly disturbed by John Nagle's wartime narrative of the raging battle between Craigslist and spammers.
Spam on Craigslist has been a minor nuisance for years. Not any more. This year, the spammers started winning and are taking over Craigslist. Here's how they did it. Craigslist tries to stop spamming by:
- Checking for duplicate submissions.
- Blocking excessive posts from a single IP address.
- Requiring users to register with a valid email address.
- Using a CAPTCHA to stop automated posting tools.
- Letting users flag postings they recognize as spam.
Several commercial products are now available to overcome those little obstacles to bulk posting. CL Auto Posting Tool is one such product. It not only posts to Craigslist automatically, it has built-in strategies to overcome each Craigslist anti-spam mechanism:
- Random text is added to each spam message to fool Craigslist's duplicate message detector.
- IP proxy sites are used to post from a wide range of IP addresses.
- E-mail addresses for reply are Gmail accounts conveniently created by Jiffy Gmail Creator (ed. note: this does not break Google's CAPTCHA, as you can see in this screenshot.)
- An OCR system reads the obscured text in the CAPTCHA.
- Automatic monitoring detects when a posting has been flagged as spam and reposts it.
CL Auto Poster isn't the only such tool. Other desktop software products are AdBomber and Ad Master. For spammers preferring a service-oriented approach, there's ItsYourPost. With these power tools, the defenses of Craigslist have been overrun. Some categories on Craigslist have become over 90% spam. The personals sections were the first to go, then the services categories, and more recently, the job postings.
Craigslist is fighting back. Its latest gimmick is phone verification. Posting in some categories now requires a callback phone call, with a password sent to the user either by voice or as an SMS message. Only one account is allowed per phone number. Spammers reacted by using VoIP numbers. Craigslist blocked those. Spammers tried using number-portability services like Grand Central and Tossable Digits. Craigslist blocked those. Spammers tried using their own free ringtone sites to get many users to accept the Craigslist verification call, then type in the password from the voice message. Craigslist hasn't countered that trick yet.
Much of the back and forth battle can be followed in various forums. It's not clear yet who will win.
I've used Craigslist quite a few times in the past, mostly to sell things that are too unwieldy to ship, with generally positive results. But that's the "for sale" section, and the spammers seem to be concentrating on the personals and services. I was curious about this, so I delved into the local personals section in what I guessed to be the most popular category. (Note to my wife: this is research! Research! I swear!)
Almost immediately I found a personals ad with the following "image":
It's an encoded wartime transmission from someone battling Craigslist spammers. It ends on this dire warning:
99.9% of the ads these days are fakes. Sad but true. REALLY, ALMOST ALL THE ADS ARE FAKE!
But is it true? I saw some obvious spam in the personals section -- all of which had been flagged for removal by the time I clicked on it -- but certainly nothing to corroborate this 99.9% claim. I did a few unique term searches on random personals (my favorite at the moment is "no murderers please!"), and they came up unique.
Clearly, there's a war on, and there have been casualties on both sides. Even if the spammers aren't winning, every inch they gain further undermines the community's trust in Craigslist and devalues everyone's participation.
This is a topic I am acutely interested in as we build stackoverflow.com out. Like Craigslist, stackoverflow will offer a rich experience for anonymous internet users. We will not require you to create an account or "login" to answer or ask questions. We'll even track your reputation and preferred settings for you, as long as you allow us to store a standard browser cookie. While it's true that we'll initially be a low-value target due to limited traffic and a specialized audience, that will inevitably change over time. So you can expect some of the same measures on stackoverflow that Craigslist and Wikipedia use to mitigate anonymous evil:
- Some form of CAPTCHA.
- The ability to temporarily "lock" controversial questions so only registered users can edit or add responses.
- An automatic throttle if we see rapid, bot-like actions from your IP address.
- Some basic heuristics to detect "spammy" content, such as too many URLs.
- An easy way for users with sufficient reputation to undo vandalism by reverting to an earlier version.
The community itself can also assist. Every question and answer on stackoverflow can be rated Digg style; if a given bit of content rapidly accrues a large number of downmods, it is likely to be spam or inappropriate content, and will be automatically removed or directed into a moderation queue.
Don't get me wrong. I've been humbled by the quality -- and the sheer size -- of the community that has grown up around this blog. I expect the overwhelming majority of people who participate in stackoverflow.com will be absolutely upstanding internet citizens. Wikipedia is a living testament to the fact that goodness vastly outnumbers evil. We good guys can win, if we've had the forethought to put some controls in place first.
Allowing anonymous users write permission creates a volatile situation where a dozen sufficiently motivated spammers can easily poison the well for thousands of typical users. These spammers don't give a damn about the community we're building together. All they care about is getting paid by posting their links anywhere and everywhere they can. They'll run roughshod over as many websites and pages as possible in their frantic, abusive pursuit of money. If I didn't so desperately want to choke the life out of each and every one of them, I might actually feel sorry for the poor bastards.
But here's the problem: following the rules and being a good citizen is easy. Being evil is hard; it takes more work. Sometimes a lot more work. The bad guys get paid to learn about their exploits. Are you willing to educate yourself about the complex evil that a tiny minority of powerful users are prepared to unleash upon your site? As with so many things in life, this is best illustrated by a scene from Spaceballs:
HELMET So, Lone Starr, Yogurt has taught you well.
If there is one thing I despise, it is a fair fight. But if I must than
I must. May the best man win. Put 'er there. (offers to shake his hand)
LONE STARR goes to shake his hand. HELMET takes
the ring off LONE STARR'S hand.
HELMET The ring. I can't believe you fell for
the oldest trick in the book. What a goof. What's with you man? Come on.
You know what? No, here let me give it back to you. (offers the ring back)
LONE STARR goes up to get the ring back. HELMET
throws it in a grate. The ring goes in the grate. LONE STARR tries to catch
it and falls to the grate.
HELMET Oh, look. You fell for that, too. I can't
believe it man.
LONE STARR gets up and runs to a corner.
HELMET So, Lone Starr, now you see that evil will
always triumph, because good is dumb.
As the good guys, we can't afford to be ignorant of the spammers' techniques. If that means spelunking through the grimiest corners of some scummy black hat forums, then so be it. I'll tell you this: I've never nofollowed a single link on this blog until today. The most effective way to fight the evil spammers is to understand them, and the first step toward understanding evil is openly linking to their tools and methods, exposing them to as much public scrutiny as possible.
When you design your software, work under the assumption that some of your users will be evil: out to game the system, to defeat it at every turn, to cause interruption and denial of service, to attack and humiliate other users, to fill your site with the vilest, nastiest spam you can possibly imagine. If you don't do that, you'll end up with something like blog trackbacks, which are irreparably busted at this point. Trackbacks are the source of countless untold hours of institutionalized spam pain and suffering, all because the initial designers apparently did not ask themselves one simple question: what if some of our users are evil?
When good is dumb, evil will always triumph.
Websites that allow users to post content will always be vulnerable to the actions of a handful of evil, spammy users. It's not pleasant. It is a dark mirror into the ugly underbelly of human nature. But it's also an unfortunate, unavoidable fact of life: some of your users will be evil. And when you fail to design for evil, you have failed your community.
Posted by Jeff Atwood
2 thoughts on how to counter spam
1) follow the money. the spam must have away to find whoever is paying for it (or why would they pay)
2) spam the spammer. reply to the spam with as much useless information as you legally can. Give them 10M useless email addresses. If you can do it morally, DDOS them. Set up voluntary bot nets (nospam@home) that make it as costly as possible to get anything useful out of spam
"Say goodbye to your two best friends! And I don't mean your pals in the Winnebago!!" --Dark Helmet
It seems to me the only realistic way to prevent spamming is to have some sort of vigilante task force that identifies the root source of the spam and posts their contact info publicly.
If there was a known spammer who lived near me, even if they didn't affect sites I use, I would gladly do things to make their life miserable. Maybe not the most legal of solutions, but probably effective. Of course we would need support around the world, too.
The real solution is to start charging.. 50 cents an ad isn't burdensome to the public, but devastating for spammers.
Right now spammers see themselves in a gray area. Might be breaking the law, but odds of being prosecuted are small. Few spammers would cross the line into credit card fraud.
Why aren't captchkas simple questions that would require AI far more complex than what we have right now? For example, What was the last name of the wife of the 41st President of the United States?
Or if that's too easy, something more complex, but that would still be easily solved by a google search.
@Phil re:MS Cats
I believe they are cycled in and out as the pets become adopted, as this is a secondary goal. By the time a bot has picked the right combination (which by the way would be really tough because it doesn't provide any feedback as to which ones it guessed wrong...and you get a whole new set on each fail).
I think even in a fairly small pool of 3000 that get rotated in and out, guessing is a losing game.
I really liked WesleyC's comments. There's a guy who is looking at the entire game board.
"An encrypted timestamp (combined with something unique about the user--perhaps an IP or user agent?) placed in a hidden field--if the form is older than one hour or so, or the IP/user agent doesn't match, block the submission."
"A text field or textarea with an easily-readable, spam-worthy name, such as "comment" or "post", placed off the viewable area via CSS positioning--if it's filled in, block the submission."
The comments noticing that there might need to be a higher 'cost' (CAPTCHA or some other way of testing for a human) for posts containing URLs, or being considered more spammy after Bayesian evaluation, are also on the ball.
Looking forward to your implementation of Stackoverflow, Jeff and Jarad!
Spam identification is a computationally difficult task, so use it as your CAPTCHA.
Present the use with 4 messages, 1 known spam, 1 known ham and 2 others, and have them classify them. So long as they get the known spam and ham correct, you have a reasonable chance that they are human and that they have classified the unknowns correctly.
How do the black hats avoid their own message boards filling up with spam? Is it just the "good is dumb" thing and we "good guys" refuse to get down to their level and mess up their forums for swapping hints about how to mess up others?
Sometimes I think the only way to end spam would be to amass a private army of henchmen-hackers, capable of tracing down commenters' locations and napalming the place unless innocence can be proved.
Remember that Russian spammer who got murdered, and the police admitted that literally millions of people have a motive? http://www.securityfocus.com/news/11256
If only we organized our hate...
I find it amazing how many ideas were posted on this blog alone (ones I hadn't thought of), yet there is some flaw in each and every one. Although the flash one or java one seem kinda convincing. Unfortunately, the better computer AI gets, the harder humans will have to work to prove their existence.
I think one of the advantages that stackoverflow will have is that its subject matter is very limited--having a "discuss anything" area would be a bad idea, IMO, for exactly that reason.
In the Personals sections of websites, two things are problematic: it involves email, and it involves sex-related activities. Those are two things that spammers love and have.
Services also involves email, and is pretty easy to fake as certain services are often desired and you're going to get a lot of hits on those.
"For Sale" is harder to fake--each individual ad is unlikely to get many hits (with the exception of concert tickets or currently-popular electronics).
stackoverflow postings will be less likely to contain URLs or require email, and will definitely not be sex-related (let's hope)! I think the very nature of the site will make your job a lot easier than craigslist. It will also make it a lot easier for human beings to detect that something is spam.
The one thing to think about is attachments--if you allow anybody to attach anything, that allows people to attach redirects to websites and also to attach malicious JS that will operate in your domain.
@Sean: "Why aren't captchkas [sic] simple questions that would require AI far more complex than what we have right now?"
Because coming up with hard questions is even harder than answering hard questions. Click on my URL to see how a bot can answer your sample question automatically. You had to employ a human brain to *invent* that question, and yet a machine cracked it in 0.27 seconds! Now imagine trying to make a question so hard that a machine couldn't answer it... and then try to imagine *making a machine* that could make questions so hard that a machine couldn't answer them.
There's plenty of research on questions machines can't answer; Google "The balloon hit a branch and burst." for more information. There's much less (useful) research on coming up with these questions in the first place.
But this is all academic. The right solution, as others have said, is (1) stop frequenting sites where spam is indistinguishable from ham, and (2) use human moderators to mod down spam, optionally using (2) to train a filter at the same time. It works for Wikipedia, YouTube, most competent blogs...
Spam is a severe problem but I have noticed on a few occasions were otherwise secure systems have had holes in their spammer protection. I used to run a forum which revieved about 200 posts a day and the spam protection hadn't failed. However I noticed a few months later that the software's knowledge base feature (which barely anyone used or looked at) had been overrun with spam as the captcha was missing on submit link for the knowledge base. I ended up removing the section altogether as it wasn't any use but if it had been a more public section such as the downloads db it would have caused a lot more trouble. (It was a gaming forum and it hosted a good few modifications)
i see that the problem that needs to be solve is to differentiate human from a bot/machine. I was thinking of a predetermined places to click on a flash and then use the sequence of positions on the screen (which again changes on every re-load for better spam protection. I believe this will give better protection since it will be a timed way of doing it
The thing I’ve always thought about spam is that whilst software struggles to recognise it, humans can almost always spot it immediately. So I figure your best bet is to make it as easy as possible for humans to flag spam. I speak from no experience.
I do, however, have experience of using SpamSieve on my Mac. It does the heuristic thing of learning from what I flag as spam. Very few false negatives or false positives, although admittedly my spam traffic is peanuts compared to what a popular forum might receive.
Your suggestions (and more) have either already been implemented to no effect by Craigslist, or were not applicable in the first place.
What is a developer to do when there's nothing left to fight with and/or your resources are far outmatched by the spammers?
Well Jeff, with all that knowledge, when will you change the keyword to enter? ;) It has been the same ever since my first comment to the page. I could write a SPAM tool right now. Actually I don't even need a tool for that. A simple BASH shell script with curl to post text to the page in a for-loop will do :P
Those CAPTCHAs are getting more and more useless. The better OCR software will get, the more useless these will get. And one day the only way to make them unreadable for OCR software will be to make them unreadable for human beings, too. Despite that, they don't work well for people with disabilites.
The biggest problem is: I'm a spammer and I want to spam a forum with CAPTCHAs that no OCR software can handle. No problem. I make a simple porn page and ask people on every access to the page to first solve the CAPTCHA. In fact this is not my capture, but the one of the forum I want to spam. That way people are helping me to solve the CAPTCHA and spam the forum. Pretty easy, isn't it? Would also work on your page. The problem here is, that the CAPTCHAs don't tell people where they come from (what page or service they try to secure). A capture should contain the URL of the page to that it belongs to!
Still, CAPTCHAs are not the way to go. Anyone thinking about alternatives to these? It must be something a human being can easily solve, but that's almost impossible for a computer to solve. Not a trivial task.
I have to ask .
I mean( I know it is going to sound insanely naive) I get that spam is incredibly lucrative but how exactly
what is the revenue model( not just craigslist but mail, trackbacks,comments whereever they spam)
I dont know anyone who responds to a spam ad
who is the customer in all this money that spammers are making
clearly theres tons and tons of money in it for it to be such a concerted effort but um how
If you're going to include CAPTCHA verification on stackoverflow, you should think about using http://en.wikipedia.org/wiki/Recaptcha . It's as great way to combine the verification with getting actual work done.
re: Mecki's question...I wonder if something subject-matter specific would solve the "mechanical turk pr0n site" way to get around captchas? If you're posting in a vaguely .net related forum then "which of the following words is not a reserved word in C#" - that kind of thing. Might also help improve the signal/noise.
Spam is so insanely cheap to produce that it just takes one or two idiots per million clicking on it to make money.
re: Mecki's question...I wonder if something subject-matter
specific would solve the "mechanical turk pr0n site" way to get
around captchas? If you're posting in a vaguely .net related forum
then "which of the following words is not a reserved word in C#" -
that kind of thing. Might also help improve the signal/noise.
That's actually an excellent idea!
It also helps to determine which pr0n sites (if any) are used as to bypass the security. Not quite sure what good that'll do but anyway...
Mecki: your request is not entirely accurate. You want a problem that is simple to solve for humans, simple to verify for machines and hard to solve for machines. Otherwise, you could require mathematical proofs to post things. Those proofs are surely machine-safe, however, you cannot verify them mechanically in general.
Furthermore, given a problem with the above characteristics, you will also want a large set of possible answers to lower the probability of just guessting the captcha right. Just asking does "this calculation equal 2?" will stop like half the spam, because you can just guess. (Assuming a uniform distribution of expressions that equal two and others). The larger this set is and the better your distribution of answers is (that is, the more it looks like a uniform distribution), the more spam will be stopped.
Even more, you have to prevent that those problems are farmed, that is, placed on some other website and solved by humans again. I think you need to include the URL in the answer in order to prevent this.
Taking all of this together, Id say its a damn tough job to find such problems that are solvable by a great number of users.
On the other hand, I like the idea of having some server-side AI that tracks what posts are marked as spam by enough users and marks things that look similar as "possible spam". Judging from the learning rate of such agents with a single input, this could work, as it distributes the work of training such an agent to a large number of users (and they don't even need to know it)
Instead of the Spaceballs bit, this might be better (from Star Trek, "The Omega Glory", Episode Number 52, Season Number 2, First Aired March 1, 1968). Dr. McCoy says:
"Spock, I've found that evil usually triumphs, unless good is very, very careful."
And, like Rahul Chandran, above, I just can't wrap my head around how anyone makes money with spam. I mean, it's OBVIOUSLY spam. Who's going to click on it?
I think that to solve the spam problem once and for all(or at least make it very unprofitable), is to educate users. Educate every user of every computer, how to recognize spam and scams. The government should do this, pay for it, do a literal blitzkrieg of advertising to educate consumers not to buy anything from spam.
Its the only 100%(or as near as possible to make it unprofitable).
just a question about stackoverflow, you comment here that you are planning to use a cookie to store and evaluate user reputation. But will this work even when I access the site from different computers?
Ah, but Lone Starr wins in the end, so I'm optimistic. :-)
Anyway, I like the idea of subject-related captchas but I doubt if they'll work in practice because the questions have to come from some sort of limited catalog. If (when) stackoverflow get's popular enough, specific attacks will spawn.
I'd rather opt to treat all anonymous posts with the most suspicion possible short of being rude. And to counterbalance this, make it easy to create an account and sign on (OpenID, anyone?). Perhaps this even can be combined. Creating simple subject-specific question/answer pairs should be very easy for most users and this could be used to constantly permute the catalog of available captchas.
What about something like when a user does a post it sends an e-mail to their account with a one-time link in it that they need to click on to activate their post?
I know you can check e-mail via programming to find the link to click back to, but it makes certain they have a valid e-mail address at least.
The problem with question is, that there is a limited amount of these. Sure, questions are a good way to secure a page. For a human being, a simple question like "Which animal can fly? A lion, a bear, a monkey or a bird?" is trival to solve. For a computer, this is impossible, unless it has complete understanding of the English language, can understand the meaning of the question and find the correct answer. If you can write such a program, you would not use it for a spamming tool, you would sell it to Microsoft for 10 billion dollars :P Finally you can just tell your computer what you want it to do and it will understand you. Combine this with speech recognation and you have a Star Trek like Computer: "Computer, ... do this and that and finally ...". The problem is: If your database contains 200 questions, one day someone has collected them all together will answers, place all that into a database and a tool can easily detect the right question and look up the answer in a database. Such a scheme will only work if you update the questions in very short intervals. Intervals that are short enough to avoid that spammers can ever keep such a database up-to-date.
It seems to me that we are going about fighting spammers completely the wrong way. I don't believe there will ever be a way to completely block one group of people from using an open forum while still allowing everyone else. No matter how ingenious the filter/protection it will always be circumvented eventually, because there is money to be made in it. So instead remove the cause. Do not let anon posts include links or email addresses. Remove their ability to make money off of spamming your sight.
I think one of the reasons that spam is taking over the world is the mistaken philosophy of what Fake Steve Jobs would call the "freetards". Sometimes free is evil. If every commentary site in the world, if Gmail, Craigslist and similar services, required a $1 upfront payment to register an account then the spammers would go broke.
If you are going to work with ratings ( brownie points ) etc :
* New users can only post to a firehose section, regular users must want to go the firehouse and mark posts as spam or real, if the post is marked real it goes to the proper forum/board.
* Once 1 post is marked as real, that user can post 1 message per day
* Once 2 more posts have not been marked as 'not spam', that user can can help in the firehose section ( mark 1 message per day as non-spam/spam )
* Once a 'real' marked post is marked as 'spam' because it got approved by a spammer user, both the approver and the poster get rating 0, all their posts go to the firehose again.
Then of course you could go further and say that admins can give special powers to known users to have unlimited 'spam/not spam' voting power.
Either way, you cant solve the spammer problem with technologie only.
This project from Microsoft Research looks promising,
using images of cats, instead of words, to identify humans
isn't that too easy to guess?
ignored case characters + numbers, 5 characters = (26 + 10) ^ 5 = 60.466.176 possibilites.
telling 12 images into 2 categories: 2 ^ 12 = 4096 possibilities.
If I might make a suggestion--for a CAPTCHA you can't do much better than thephppro's text-only CAPTCHA. No website I've ever built with it has yet been broken.
Combined with a spam-fighting CodeIgniter plugin I've written, it seems to be amazingly effective. This plugin uses several techniques to fight spam:
An encrypted timestamp (combined with something unique about the user--perhaps an IP or user agent?) placed in a hidden field--if the form is older than one hour or so, or the IP/user agent doesn't match, block the submission.
A text field or textarea with an easily-readable, spam-worthy name, such as "comment" or "post", placed off the viewable area via CSS positioning--if it's filled in, block the submission.
A text or audio-based CAPTCHA--if there's no image to use OCR on, it's a little difficult to break it!
This idea is stolen by MS. I have this scheme already in action (for posting to a blog) almost a year ago and it was digged on DIGG.com (that's how I found it in the first place). Already at that time I found the weak spot: Your database will have images of how many different cats? 100? Okay, if you know the MD5 checksum of each of the cat images, a spam bot can take all images, calculate the checksum, verify against a database and has the cat images. So you would need at least some random data in each image that changes every time the same image is displayed. Even then a patter match algorithm would work (I know a nice tool that finds duplicate images on your HD, even if the dupe has a different resolution, some text written on it not found in the original, some colors changed, and so on; it still knows it's basically the same image - and the failure rate of this tool is below 5%). Also you lock out blind people completely; how can they recognize a cat?
I pretty much agree with most of the things you wrote.
More generally spoken, the question is:
What is the real solution?
1) Avoid spam getting posted by some complicated CAPTCHA like scheme?
2) Don't care for spam getting posted, but have a computer find out what is spam by some super clever application (however this might work)
3) Don't care for either and hope users will mark spam posts as spam.
(3) is no good solution IMHO. Think of a side getting 10'000 spam posts a day compared to 200 user posts a day. You expect the 200 users to do all the work to tag the 10'000 spam posts as spam?
(1) is the problem I fail to see an ultimate solution for.
(2) would be perfect, but I fail to see how the application can really recognize at least 99% of all spam posts.
Spam is also a very subjective term. What I might see as acceptable might be tagged spam by someone else. Otherwise I'd say the solution is (4):
4) Outlaw spam all over the world, punish spammers hard and make sure this law is enforced by all means.
Laws are not always the way to go. Laws can't solve all problems of society. However, in some cases it has already worked. A lot of people all over the world already got arrested for spamming and had to pay high fines. However, since Internet is worldwide, as long as there is at least one country that will not act against spammers, spammers will simply spam from there.
Craigslist personals are targeted because there is a large number of desperate and stupid (a very bad combination) people on it. Do you really think spam is going to be a problem on SOF?
Caveman throws rocks at another caveman.
That caveman responds by wearing a thick animal hide for protection.
First caveman invents sharp pointy stick to stab through hide.
Second caveman invents shield to protect against sharp pointy sticks.
First caveman invents club to bash through shield.
Second caveman invents armor with extra padding to protect against club.
Thousands of years later:
First caveman invents long range missles.
Second caveman invents interceptor missles.
First caveman invents bomber aircraft.
Second caveman invents anti-aircraft.
First caveman invents stealth aircraft.
Second caveman invents radar rewritten to detect stealth.
First caveman invents nuclear ICBMs.
Second caveman goes about inventing a "Star Wars" shield.
And so it goes. And so shall the spam wars go.
The question is whether today's status quo is closer to throwing rocks or firing nukes. I suspect we're still fairly young in the evolutionary process.
@Mecki: "For a human being, a simple question like "Which animal can fly? A lion, a bear, a monkey or a bird?" is trival to solve. For a computer, this is impossible"
True, but a even the dumbest computer has a 1-in-4 chance of guessing it at random.
For such questions to work they have to be more open-ended, rather than selecting from a limited choice. Which just ends up frustrating genuine users.
The other approach is to go for simpler multiple choice questions with far more possible answers. This at least reduces the hit rate from guessing. (e.g. show a 20 x 20 grid of coloured squares and ask the user to click on the red one. Reduces the hit rate from 1-in-4 to 1-in-400).
Possibly objectionable to users (as it would require more effort), but could small culture-tuned rebuses be used to represent CAPTCHA phrases?
But I still don't understand how IP proxy sites can be used by bots and this is important because many people still think IP address blocking is effective.
I used to run a dating site, and I used many techniques which I developed myself. The legality of some may be questionable. My favorite was "poisoning" an account: the user would have no clue that anything was wrong, they could post and interact as normal, but noone would see what they wrote (except for other poisoned people). Also "discourage" mode would randomly delay a user's page loads and discard a percentage of posts - a way to get people to WANT to leave, rather than wanting to get back in with another account. I also assigned each user a hidden "risk" based on a variety of factors like ip country, which was offset by "trust" gained by being a decent member for a while. I never took the easy/pseudo way out by banning entire countries or ip blocks.
I think the best way to handle these things by far, is to not let the enemy know that you're on to them. Don't give them a reason to upgrade their weaponry. Let them waste their time and get shoddy results. However, this is deceptive and could be illegal. It might not fly in a big corporation.
I have experience in implementing anti-spam filters.
* throttling individual IPs, network blocks, unique e-mails, e-mail domains, usernames, etc. with different limit for each (and per hour, per day). Trending is really powerful.
* banning of IPs (look at X-Forwarded-For too, and see XFF project).
Spammers eventually run out of open proxies and cheap VPSes.
Unfortunately you have to keep large whitelist and take off bans when IP stops spamming (because of hijacked windows machines spamming from average joe's IPs)
* statistical (bayesian) filtering does work well if you use 2 or 3 word sequences. If you have a lot of incoming ham and spam, occasional spammer trying to game filter won't skew it and it even might learn to recognize these obvious attempts.
I have to ask why this blog seems to get so few spam comments when the CAPTCHA word is always 'orange'. :|
Probably either a terrible idea or one that goes against some principles behind stack overflow. So think of this as an idea for some other site, preferably one that is so self assured that it doesn't mind a) making users jump through a few hoops to sign up and b) thinks it can charge a small fee for people to contribute and still draw people.
The idea is that to contribute you have to pay a small nominal amount of money - I'm thinking $2 to $5 - as a kind of good behaviour bond.
If you want the money back or want to stop contributing you cancel the account and 14 days later you get your money back. If you've been flagged as having done nasty stuff then you don't get your money back.
A few bonuses:
- interest on the money can be used to help run the site.
- the cost is a disincentive for people who might otherwise poke around and look for exploits
- if you need both email and a credit card to sign up (and if there's some idea of uniqueness of both) then you've got something approaching two factor auth.
A few drawbacks:
- a hassle to sign up
- locks out people who don't / won't / can't use a credit card online
- more to be managed for the site, including more security and especially accounting headaches.
So not a serious suggestion, just something to think about.
I guess if you wanted less hassle you could use an invite only model and only allow a small number of invites to be sent from each user and prune the tree and / or penalize people who invited spammers / hoodlums. But then you need arbitration for false accusations...
Also I think craigslist bears most of the blame here. Either their programmers truly suck, or Craig is holding the reigns too tightly.
Require user accounts, with a long slow verification process, instead of annoying verification for every post. And for christs sake, add some features. The internet now has image capability Craig. Everybody with these big sites is so afraid to change ANYTHING because their business model might explode. Have some balls.
A lot of commenters miss the point that SOF and other participation websites need to reduce the barrier to entry, not raise it. The spammers will always learn the ropes, so by making an overly convoluted path to normal participation just reduces real participation because either they are not used to the process or they can't be arsed, and the site dies. Spam drops off, but only because the spammers realize the site is a waste of their time.
I can't begin to count the number of bb's out there I've never bothered with because I don't want to sign up, I don't want an account with them. I'll never come back unless google indexes it and by mistake and I land there.
@dnm - I don't think you want the poster to be the primary moderator.
1. spammers will delete complaints that it's spam
2. posters will delete comments that their post is dog crap
3. posters will gain enough 'kudos' to allow them to spam, and then spam everywhere. (who watches the watchers)
4. You also have to guard against 'ganging up' on legitimate users. I'm sure there's a lot of spammers that are decent coders who could infiltrate the site to get high moderation status and abuse that power. It's also good in general because you get some right twits in this industry who think they're jesus's little brother in terms of worldly importance. They are more evil than spammers by far.
In a way I don't care if you've posted 1000 times or just the once. The only thing that matters is if you have something important to say. Participation, while great, isn't everything, and just means you have heaps of free time.
I'm not giving my credit card info to some random company just so I can post. It's an idiotic suggestion. Do you trust every website you visit with your credit card details? Facebook is huge and there's no way I'd hand over that sort of info. I might think well of Jeff/Joel, but I'm not paying to make my 2 cents known.
Jeff is soliciting my response in the first place by placing the comment box there. Let's not forget Jeff gets paid through ad revenue by site traffic. I dare say this site wouldn't drive as much traffic if it was devoid of the little comments box.
Or maybe you think the free-ness of the internet should really be 2-tier. Those with a voice are those that can afford one.
- commenting should be free of login if one wishes
- if people want accounts, maybe they get some minor privilege elevation (like pseudo-moderation) and the ability to post.
I like @French Horn's poisoned accounts, the alternate reality / honeypot. While I can see ways around it, it's pretty good to allow the spammers to think they're on top of things, while in reality they're not.
Ascii captcha is also damn brilliant. Not foolproof, but nothing can be.
I wonder if there could be a variant of the old-school style of verification I remember on video games: where they make you type in the third word of the second sentence on page 42, etc. The rendered page content becomes part of the captcha, which makes anonymous mechanical turks a little less effective (the unpaid variety at least).
Spam prevention has to be easy to use from the user, it has to be easy on admins, and it mustn't significantly raise the bar to participation. Remember that if a person can figure it out, so can a machine, because the machine is still programmed by the human.
I'd like to see a site that used clever CSS and extra textbox honeypots to make it hard for a bot to tell what fields it should be putting data into. If you get data back from a field that the human users shouldnt even see, dump it in the bin. That'd be tricky from an accessibility standpoint though.
What about something like when a user does a post it sends an e-mail to their account with a one-time link in it that they neede to click on to activate their post?
I know you can check e-mail via programming to find the link to click back to, but it makes certain they have a valid e-mail address at least.
I have seen this or very similar on craigslist already...it has been defeated as well. This also gets into the territory of it being way easier just to sign up for an account, which from what I gather does not meet Jeff's goal, he simply wants a way to not force a walled garden on people by allowing anonymous posters.
I think the people that say it is obviously spam are out of touch with the common user.
Do you know those banner ads that pretend to be Window's Update notifications or something. I've watched as people clicked those. I tried to stop them, but I was too slow.
While it may not be a problem for a site specifically designed for computer programmers, not everyone on the internet is as savvy. Most people don't expect to be fooled.
Like some of the posters here have pointed out I believe that spam is mostly an economic problem, and it will take an economic solution to fix it. CAPTCHA's and other forum/commenting moderation systems are really band aids and incremental advances to stemming the spam tide.
I think that the easiest way to eliminate spam, (or reduce it to miniscule levels) is to remove the financial incentive behind it. However instead of attempting to target the spammers by making each posting a monetary transaction we should instead be targeting the people who buy goods from these spamming agencies.
(Rough figures) Since it takes 1 person to buy something from a spam advertisement to make a profit for the spammer we should target that one person and fine/educate them for it.
Though somehow I don't see this solution being easy or simple by any means.
The other problem is that when someone does come up with an effective method for deterring spam that can't be worked around, the spammers fight back and fight back hard.
Another thought on my last thought. If people spam spamers with junk replies then they will be forced to filter out our spam, and whatever they use can be used to filter them out them.
someone commented about really smart AI's; If an AI is smart enough I can't tell if it is an AI or not, I don't care. All I care about is if what is posted is useful for me.
Selling (and maybe using) automated spamming software should be a felony with harsh penalties and it should be strictly enforced.
Wikipedia is a living testament to the fact that goodness vastly outnumbers evil.
Not everyone agrees that Wikipedia is good, these intelligent (design) folks seem to think it's evil, so have started their own wiki trunk:
"The following is a growing list of examples of liberal bias, deceit, silly gossip, and blatant errors on Wikipedia. Wikipedia has been called the National Enquirer of the Internet:"
- Conservapedia: http://www.conservapedia.com/Bias_in_Wikipedia
Couldn't resist... :) Interesting post Jeff.
It seems that best way to beat an automated tool is with a human response. Why not disallow anonymous postings and require an account to be able to post new messages?
New users would have their posts moderated and they could only response to an existing thread or subject. When they post a message, only the person who originated the thread would be able to see the message. They would decide if it's real or spam and take the appropriate action. If it's spam, that account gets tossed. If it's a real message, it gets marked as visible to the rest of the community. Messages not acted on within 48 hours get automatically purged.
New users would have some sort of threshold where they need to have their first 3 to 5 messages moderated before they become a standard user. It does pass some extra responsibility to the person starting the thread, but you get to load balance the message moderation across the user base. You could even open the moderation so that anyone who had previous participated in that message thread could moderate the new messages from unvalidated users.
Granted, this would be an annoyance. But it's a short term annoyance. I would put up with some initial annoyance if I knew that it would keep out the widows of Nigerian Princes.
Nobody will accept my spam defeating technique.
MAKE IT LEGAL TO VIOLENTLY MURDER SPAMMERS.
If you want to stop linkspam - just disallow html except from trusted users...a healthy portion of existing internet content is already advertising. I don't know why people think the situation would be different with user-contributed content. It seems like bloggers like you -- I read and enjoy your blog regularly -- want to have it both ways. Implicitly, you want the benefits from user-contributed content but you don't want to do the manual work that is necessary to police it. Think about graffiti, how do people handle graffiti? I bet people handle it more with scrub brushes than they do with laws. You ever see those smart businesses that have a wall so attrative for graffiti that their best strategy is to hire a talented graffiti artist to create a mural? You need to quit pretending that people submitting forms to you are committing some sort of crime,and start thinking about a way to turn it to your advantage.
Ascii art captcha, that is awesome!
For using images of cats. Lets say there are 6 images, 2-3 of which are cats. Sure, if you have only 100 cat images, you could theoretically md5 them all, but if you actually serve 6 seperate images.
So instead of serving them seperately, 'glue' the images all together into one large physical image in a scriptable image manip program before serving. At the same time, you can generate the supporting code needed. If a few of the sample images are procedural, along with the background for the whole image, that trashes the MD5 trick.
All CAPTCHA's can be beat. It's just a mater of cost. So use that.
(this had better be my last post or y'all think /I'm/ a spam bot :)
Craigslist also uses rDNS as an antispam measure (when sending emails).
Spammers concentrate their efforts at major and active sites. So unless StackOverFlow becomes one, I don't believe it will be a target.
I don't see spam on this site and it's pretty active.
A flash captcha sounds good. To make it really tough for spammers, create an animation where the image of the text is split in two parts and scroll them sideways at opposite directions and when they meet at some point, they create the text. I don't see how any software can figure this out.
One of the (maybe) possible solutions: After comment gets submitted, show it immediately, but then run background process that will send mail to gmail account, and then check if the mail ended in Inbox or spam folder. Although, I'm not sure that google would be happy with such (ab)usage of their service :)
Nice post Jeff, keep the hard work
it is apparent that you never used Craigslist. Had you ever used it to connect with buyers, sellers or to look for local help (landscapers for example), you would have learned that positive experiences overshadow the spam problem. Recognizing spam is very easy when dealing with people locally.
You should lighten up a little bit also. Read some Rants Raves on craigslist for that.
it is apparent you cannot read. Jeff clearly stated that he HAS used CL. Further, before posting, he used sections of CL he had not used before to verify John Nagle's claims.
To address a couple of issues. It has been pointed out that in the scenarios mentioned above there is a limited number of cat pictures (3000 for MS I believe). One way to deal with this is to continue to expand and change, using a method similar to how Recaptha does it. Taking the cat images as an example, we can surely make changes to the images to change the MD5 sum and filename to try and fool machines, but we can also add a couple of random images from Flicker and/or other image services to each test. Keep track of which of these added images have been tagged as "cats" and once they have been selected [by humans] enough times, then they become part of the cat pool. These images of course would not count for or against the determination of a human. Don't forget, we must also expire images after they have been around too long, a success rate of even 1% for a machine can add up.
I like the idea of a database of questions that are specific to the website in question (i.e. programing for a coding site, hiking for an outdoors site, etc.) the questions could be even a little difficult and require some research. But again you would have to continue to add new questions [not from easily available public sources] and expire old ones. Not an easy task in this case.
It is a good thing that for most websites just a little deterrent is enough to keep spammers at bay.
What about mailing lists? That is a legal and useful use of what some would term spam software. However, now we get into the tricky area of the DMCA. Are you for the DMCA?
It has a few flaws, I know, but maybe some feedback?
I hope Craigslist survives... I found my current apartment (I live in NYC) there so easily and the thought of paying a realtor again makes me sick. Even the newspaper is a horribly stressful method here.
Personally I feel it's a theoretically impossible task - the difference of bots and humans is the concept of humanity itself. We keep seeing that in order to overcome bots we keep making the validation technique more and more human; but it fails..
This is because it's like the allies are having a secret meeting at a nazi headquarters - in german. IMO the only way you can avoid it is by developing a medium of information that computers can't intrude - speak in a language nazis can't know, or get out of there. Not something they will have trouble with, but something they CAN'T.
Another thing is to eliminate anonymity - which is a scary 1984'ish idea..
Since adding their new developer platform (with OpenSocial), there's been an upsurge in spam from badly coded apps. They're plugging those holes pretty quickly, but some of the solutions involve limiting access by apps to the system.
I agree: you need someone on your side who understands the mind, tools and tricks of the enemy. That is, you need your own private police force.
I hope Craigslist survives... I found my current apartment (I live in NYC) there so easily and the thought of paying a realtor again makes me sick. Even the newspaper is a horribly stressful method here.
you will need a (trained) Bayesian filter to win spam battles. Spam is your friend - it trains the filter which in turns becomes more effective against spam. Best way to fight spam is to use it against itself.
Captcha is annoying and useless, as there are scripts out there that can work around it. Nothing can work around Bayesian filters.
I'd be curious to know some statistics with regard to how many posts each section gets in a day. This problem seems ideally suited to traditional machine learning techniques, but maybe the size of the data sets makes it infeasible for Craigslist. Assuming the data set was small enough or Craigslist had sufficient resources to allocate, something like an unsupervised learner which clustered posts based on a series of attributes and then using community input to label the clusters (something which is already happening with the 'Mark as Spam' links on each post).
CAPTCHAs are incredibly annoying for the good guys, and don't actually stop the baddies, so give up on them please - use the amusing alternative where you get 3 random tiny photos, and you have to click on the kitten.
I assume the most popular protection methods are going to be targeted by the spammers first, so using your own off-the-wall solution might actually work best of all!
I have 2 cents to chime in about this topic. And it's a very philosophical 2 cents really, so bear with me or just skip over.
I notice that SPAM is almost developing into a hive mind that creeps into the regions which "deserve" it the most. Yeah, sure there's email spam, but one bayesian filter later I get practically no spam ever in my mail. Gray listing is also very powerful.
But what I notice on sites like ebay and craigslist is that whenever we as a society get lazy and try to do things "the easy way" - think of all the board room meetings parodies that we've all seen by now where a 22 year old "genius" says "We'll bring the pet store to their fingers and make profit" - whenever we do that, spam shows up.
I mean personals are a notoriously lazy way to date and really, even without spam, I could never trust a posting on something as anonymous as the internet. Call me untrusting or cooky for that, but it's simply absurd for me to try and reconcile something as intimate as dating with something as anonymous as the internet. Go clubbing/jogging/walking your dog if you want to meet random people.
All in all, the point I'm trying to make is that SPAM creeps in to places where we stretch the reach of our daily experience more than it's meant to go. The reason why craigslist or any other online site has difficulty separating SPAM from HAM might be that there's almost no difference between the two: indeed, how could you possibly tell if a personal ad is genuine or not?
In that sense, I think for stackoverflow.com that so long as there's a difference between genuine content and content to make profit, you will have no problem getting rid of spam. Patterns will be easy to detect, text will be easy to recognize, and you can even setup tests that are extremely task specific like programming language based riddles or whatever.
As soon as you introduce "profit making elements", like "hire a coder" style stuff, you will be faced with impossible to detect SPAM squall.
There's a reason why only certain areas of craigslist are spammed. When you are looking to buy cheap $5 stools from people moving out, you are unlikely to be a big spender and your tolerance is very low; you are after all looking fora five buck stool. But go to the real estate for sale area, and you will get hundreds of spam postings. In the same way, spammers naturally go to places where people have to either be gullible to begin with or lower their paranoia threshold to be able to participate (like personal ads).
My 2 long cents.
"stating" and "doing" are two very different things. Anyone who has used craigslist exctensively will know that spam is just noise. As I just pointed out in another post, if you want to put spam on a backburner, use a Bayesian filter. Captcha is too primitive for that.
What about a Flash based CAPTCHA - put its impenetrability to use.
Captchas (specifically pictures of words) have been broken. I spent a day researching the state of the art for breaking captchas and it turns out that there is code available out there (OCaml and Python was found in about 10 minutes). The supposed gold standard of captchas, GMail's sign up, has been reportedly broken. If you actually sit down and spend a day thinking about how to break it, and you're even remotely talented at programming, the solutions become pretty obvious. Clearly reasonably talented programmers are doing this (what programmer do you know that knows OCaml, but is also incompetent?).
Another issue is that certain types of attacks can be jump started by human interaction. It turns out that if you are spamming a site over and over you get a pretty good idea what the correct answers to captchas are. If you have a team of low paid workers work an hour on entering them, then that is usually enough of a seed to overcome the captcha. This can also work with the aforementioned pr0n site redirecting.
Realistically, for a spammer to be effective they only have to get the captcha right about 25% of the time. Anything with choices that can be guessed (like 4 picture options with kittens) is an immediate fail.
The underline problem is that there is HUGE money in this activity. An out of work programmer could easily support himself. I know people who own million dollar per year businesses supported by this kind of activity. And those businesses are for a specific niche, I can't imagine what a general situation would be like.
Go read about Asirra before posting something like "oh, but spammers can build a database of all the 100 or so images and do an Md5 hash to determine which are cats and dogs".
Asirra has a database of about 3 million images and it's always growing, thanks to their relationship with petfinder.com. Imagine if all pet websites would contribute - Asirra would probably grow significantly faster than spammers can keep up.
Although, the fact that a user has to organize 12 images into 2 categories means 1/2^12 = 1/4096 spam posts will still go through. But, combine this with requiring users to register for an account to post, and give users the option to flag posts as spam, I think this could be very effective.
With as popular as this blog has become, I'm surprised your ORANGE captcha still works so well.
It may only be 4096 possibilities, but it resets when you get it wrong. It'd be pointless to blindly guess.
Pretty much every technology, from the rock forward, was invented for the purposes of Good. (Caveman Ogg smashes wheat with rock, makes flour.) Almost inevitably, someone eventually comes along and uses the technology for Evil. (Caveman Grogg steals Ogg's rock, smashes Ogg in head, pwns Ogg's rock.) This has been repeated many, many times. We always think the problem is with the technology. (Maybe if we wrapped the rock in something soft so it couldn't hurt people... or if we only issued rocks to people we trust... or invented rock-proof armor...) Maybe we should investigate fixing people and not technology? :-)
Clearly no one here knows that the Old English typeface is an impenetrable cloak
While the idea of having people prove themselves by adding non-spam content is attractive there might be an initial hurdle with people being disinclined to try out your site if the first time they try to contribute they are apparently ignored because they need to be moderated.
Could we address this by building a trust metric on top of OpenID? I'm thinking of something vaguely like Advogato, except you build up reputation on several participating sites and that serves as your letter of introduction to another site that trusts those sites to know who to trust...
I've always wondered if a CAPTCHA that 'just' targets URL's is available. If a user were to spam a product then surely the only way to get anything out of it a URL must be added to the message?
Perhaps the new method of fighting spam will consist of a centralised, human group that moderates every URL posted, checks it personally and verifies the message. Imagine something like Akismet for WordPress, but run only for URL's to verify whether the message is spam or not. I can imagine a centralised website that verifies every URL posted in any Forum or Blog software manually using paid workers could be very effective, although I'm positive that it cannot be that simple.
One thing worth noting is that a would-be spammer can use a captured captcha image from your site as a captcha on their site, thereby getting humans to do the OCR. One way to undermine that strategy is to include information that identifies your site in the captcha. Similarly, bits of text that are obviously irrelevant (to a human) can break OCR-based attacks.
One thing you failed to mention about Wikipedia is the vast amount of bot work that reverts vandalism over there. If humans were solely in charge of keeping Wikipedia in good shape then it would be in shambles. There is an IRC channel that receives every edit done to Wikiepdia, a bot then check the page for known bad URLs and string and reverts if necessary. Also Wikipedia as nofollow = true for all external links.
While captcha obviously isn't a perfect solution, better ones can help: http://alipr.com/captcha/
That uses a two pronged test: image recognition by clicking the geometric center of various superimposed images and by identifying an object in a random image.
A little excessive for many sites but I would imagine alot hard to circumvent by automated means.
"Spam is so insanely cheap to produce that it just takes one or two idiots per million clicking on it to make money." -- Rhywun on May 29, 2008 05:39 AM
Aaaaand there's your spam business model. The cost per transaction is essentially free.
Add a "cover charge" and you'd demolish that profit model. For example a smallish-but-not-micro credit card transaction (say, $5) to be able to post to the site forever.
As a happy side effect it would verify the reality (if not exactly the identity or humanity) of every prospective contributor. There would be no need to correlate user reality with site identity, thus preserving anonymity.
I was a CL true believer for years but I quit using it 2 years ago ... not just b/c of spammers but b/c the quality of all interactions was in a rapid decline. I would try to sell something at a reasonable -- no, an INSANELY CHEAP -- price (hey I just want to get rid of the thing, that's why it's not on EBay) but no matter how cheap I'd price it I'd get a flood of jerks offering literally nothing, with a healthy heaping of insults as well. And I won't even get into no-shows, abusive followups from no-shows (dude, I sold it to someone else because you NEVER SHOWED UP), and other general ass-hattery.
If CL just offered a teeny tiny cover charge the quality of interaction would skyrocket. Not because the service is "worth" the cover charge, but because of the very existence of the cover charge.
I'd recommend reading chapter 21 "Breaking the Rules" in "Rules of Play" by Katie Salen and Eric Zimmerman. (Then read the rest of it.) They are not talking about games rather than software, but I believe the ideas apply to the type of social software you are looking at.
While Salen, and Zimmerman are not specifically targeting your problem, the entire subject of gameplay lends itself to the type of 'social interaction hacking' that is required to avoid or mitigate these problems.
"What about a Flash based CAPTCHA"
That seems like a really good idea - or an animated GIF or something. Is there a good reason why that might fail?
You only need to solve one CAPTCHA term. If the flash or GIF CAPTCHA has 30 frames of animation, you have 30 different views of the term you need to solve (rather than just one), which would help heuristic OCR processes tremendously.