December 20, 2006
You might read a post on this blog and decide I'm full of crap. That's fine. I often am full of crap. I encourage you to leave a comment explaining why you feel this way. And, while you're at it, feel free to point out any errors or inaccuracies in anything I've written. This kind of simple, immediate, highly visible public dialog is why I believe so strongly in comments as an essential part of blogging.
But sometimes a mere comment isn't enough. Maybe you have your own blog. Depending on the depth of your feelings on the matter, you might want to write an entire post on your blog explaining, in great detail, specifically why I'm full of crap. Then you'd publish your post for the world to see. But how do you know that I, the target of your vitriol, have read your post? How do you know that I can even find your post? You could email me directly, but that feels a little too intimate. Or, you could leave a comment linking to your response, but that feels like additional work.
The answer lies in trackbacks. Trackbacks are a way of relating conversations across websites. After you publish your post, you send a trackback to my post. This is usually handled automatically by the blogging software. The trackback links our two posts together. I get notified of any trackbacks to my posts, so I can follow the trackback to read your response. Furthermore, trackbacks are public, just like comments. So any future readers can also follow our conversation thread by directly navigating from my blog to yours with a single click. Tom Coates created a little diagram which illustrates this process:
They're a great idea. Unfortunately, trackbacks are so horribly and fundamentally broken that they're effectively useless.
The original trackback specification was published by Six Apart in summer 2002. It's very basic. The trackback URL is published in the metadata embedded in every blog post:
dc:title="The Programmer's Bill of Rights"
You simply HTTP POST a bit of data to the trackback URL of the post you're commenting on, like so:
Content-Type: application/x-www-form-urlencoded; charset=utf-8
See? Simple. And it works great. In one swell foop, you've created a coherent conversation that flows across two totally different websites!
Well, it was great. Until the spammers realized two things:
- how high the pagerank is for popular blogs (7+)
- how trivially easy it is to abuse the trackback mechanism because trackbacks have no authentication mechanism whatsoever.
CAPTCHA has completely solved my comment spam problem. But distinguishing between humans and machines is useless on trackbacks, which are all machine entered by definition. I've fought the good fight against the rising tide of trackbacks with various blacklists over the last three years, but as this blog grows more and more popular, I'm clearly losing the war. Malicious spammers can batch register dirt-cheap domain names and write scripts to mass-POST these URLs all over the blogosphere far, far faster than I can ever hope to blacklist them. Every day starts with a depressing routine of adding 4-8 new spam URLs to my blacklist.
Yes, there are distributed blacklists like Akismet. Yes, you can put all your trackbacks into a moderation queue and spend 5 minutes every day deleting them all manually. Yes, you could retrieve the linking page and make sure it contains the promised link to your post. But these are only slightly larger band-aids over a massive, sucking chest wound. These aren't sustainible solutions. We have a much deeper problem. Trackbacks, as we currently know them, are dead, kaput, expired.
It's an absolute travesty, and I completely blame Six Apart's initial trackback specification. How could they forget the rich history of email spam we've had to deal with for the last ten years? Trackbacks, as a result of Six Apart's incredibly naive initial design, are now a total loss. That's what happens when you design social software without considering the impact of malicious users from the very beginning.
Now, hopefully you'll understand why I've disabled all trackbacks for this blog as of today.
And please, if you're designing social software, try to avoid repeating the many mistakes of our forefathers. Again. Design from day one with the assumption that a few of your users will be evil. If you don't, like Six Apart, your naivite will make the entire community suffer sooner or later.
Posted by Jeff Atwood
Yeah, security in general should always be kept in mind. Theres a line between treating your users like customers and being insecure.
This begs the question, how would you go about designing Trackbacks 2.0?
I feel that true users are usually not evil. The evil-doers are out to abuse whatever system they can for their own gain. It is highly likely that it is the same base of "evil-users" who are responsible for spam in comments, trackback abuse, pop-up ads and spy/malware, and the "blink" tag. Well, maybe not that last one. True users can be bastards, but it is usually because they are demanding and intimately in touch with the software or service. They have high standards and expect evolutionary positive change in their app/service of choice.
If only their was some way to tag the "evil-doers" themselves and differentiate them from the mass user base...but until then, it does pay to plan.
Would it be any easier to maintain a white-list of legit blogs that you regularly get trackbacks from?
You'd still have to manually approve new blogs, but the approval list would only contain a small fraction of total trackbacks.
Still taking my time to perfect the trackback, and I still got love for the streets.
Your old captcha post implied that you'd already disabled trackbacks, so I thought they'd been gone since you first instituted the captcha. Probably would have been a decent idea, since even legitimate trackbacks are largely spam, of the "great post!" variety.
Did you change the font on the blog somehow? I can't put my finger on whether it's larger or also a different face; or if it's just me.
For my trackback implementation, I do a reverse lookup -- when a trackback ping comes in, I read the putative trackback URL and look for a link to my post there. No link, no trackback. (This was Nikhil's idea, to be clear.)
It's worked great. I know it's working because I hardly ever get trackback spam (like, a total of 6 since I implemented this), but I get referrer spam all the time, so they're hitting the blog all right.
Of course, I can tweak anything I want, since I run a MeWare blog. :-)
I decide to comment now that you aren't taking trackbacks. "Your full of crap."
Actually, TypePad has been very good at automatically catching trackback spam. The trick they use is to examine the trackback url and verify that the url links to the post.
Because a link must be provided, an authentication mechanism is not really needed. You can also screen out links that contain certain words or whitelist links that contain .NET related keywords.
If only their was some way to tag the "evil-doers" themselves and differentiate them from the mass user base...
Right, but this implies logins and persistent identity, too.
when a trackback ping comes in, I read the putative trackback URL and look for a link to my post there. No link, no trackback. (This was Nikhil's idea, to be clear.)
This is a reasonable idea, but it doesn't scale. Furthermore, it could easily become a huge DDOS (distributed denial of service) vector. The last time I checked, I was getting 75 spam trackbacks PER HOUR-- more than one every minute! That means our server would be overloaded with bandwidth and CPU overhead of going out and retrieving all that spammy content to look for my blog post's link.
So, if I was an evil user, I'd create a 3 megabyte HTML page, and I'd "trackback" your site every second. Or, I could have my zombie web farm send you a bunch of trackbacks, hundreds per second, pointing to garbage URLs.
Of course, these attacks are possible with other means. But making trackbacks do a reverse lookup makes DDOS attacks far easier-- they'd get our server to do all the work!
Spooky, Jeff. I blogged a response about this post and mentioned that reading the origin link could become a DOS vector. 'Course, I was thinking along the lines of a Bayesian-style content analysis, but the gist is the same.
I'm somewhat perplexed that capable people have been buying into the trackback technology for so long.
On the trackback spec. improvements
So the additional processing should be on the trackback posting side right? For example after the POST you could return an identifying code and an image (like that "orange" thing you have here), and the posting side could show the image to the user and ask him to decode it into some text, then post it again with an identifying code and the decoded text appended to the usual trackback parameters. That way you'd be sure someone is sitting on the other side to process the images (or some other stuff you give them). Of course this should be implemented in all of the blogging software, but it's not too difficult and it seems quite secure to me. Am I missing something?..
An absolute travesty? Really?
If you're interested in helping fix TrackBack, you're more than welcome to help join the standardization effort:
From that same post:
"As many familiar with the protocol will attest, TrackBack, despite its wide market adoption, is far from perfect -- largely due to the fact that TrackBack was invented for a blogosphere that was much different in size and makeup. Today, blogging has exploded in popularity, presenting TrackBack with a whole new set of challenges to address."
Was it an absolute travesty to design a spec that was appropriate for the audience it was delivered to? Or should Ben and Mena have assumed there would be hundreds of millions of bloggers? Now granted, they're part of the reason that there *are* so many millions of bloggers today, but just as HTML 1.0 didn't do everything the modern web needs, so too did the first version of TrackBack have shortcomings.
Very little would get done if everybody asked "what if this gets as popular as SMTP?" Not to say we shouldn't take that responsibility seriously, but I think it's understandable to be naive about social abuse in the same way that the architects of email, feeds, tags, and the web itself were.
Trackbacks are nothing more than a worldwide circle-jerk.
I've removed them from all sites that I'm involved with a long time ago.
It should be possible to use a CAPTCHA-based system also for trackbacks. Blog authors would only have to visit the site they refer to once, to receive a personalized security code (e.g., a GUID). The security code would then be registered in the author’s blogging engine, and be used automatically for all subsequent trackback registrations (as a trivial extension to Six Apart’s original metadata). In case of misuse, the code would simply be revoked, and all connected trackback posts could be removed automatically. It shouldn’t take more than a few hours to implement support for it in a blogging engine.
Jeff, I think Trackbacks are really not enough. In a lot of situations a blog could point back at anything that links to it - articles, forum posts as well as other blog entries. In my blog I take *any* inbound link and check back on it to see if the link's there like Mike and if so link it. A timed routine that runs once a day then goes out and re-checks links over time to insure that links haven't gone dead - if they're not there anymore the trackback is removed. This actually works great for things that end up on home page links...
It's not perfect - there's noise there at times - but I haven't seen any trackback spam because it usually gets thrown out before it ever gets linked. It also helps to have an easy way to get rid of trackbacks - I have my blog set up so that I all views become editable in admin mode so I can breeze through and remove garbage comments and postbacks very quickly without even hitting hte admin interface. Like you though - comment spam has nearly completely died by adding Captcha, so most of the cleanup comes from backlinks but it's pretty minor.
OTOH, I question how much value there really is in trackbacks these days. How often do you really follow a trackback when reading a blog especially if a topic already has a number of comments? The trackback mechanism simply doesn't tell the target site enough to make it truly informative enough to give the user the ability to see what you're getting suckered into...
So, if I was an evil user, I'd create a 3
megabyte HTML page, and I'd "trackback" your
site every second. Or, I could have my zombie
web farm send you a bunch of trackbacks,
hundreds per second, pointing to garbage URLs.
Perhaps, but it might take a long time for any trackback spammers get to that point. Not to mention that it would affect their bandwdith costs.
This would be easy to circumvent. Only grab the first X KB of data from a blog. If the link is not there, so be it.
Love the blog, Jeff, it's a daily stop for me, but, when I went to read your post today the new, larger fonts slapped me in the eyeballs! Why such a jump? Just for my two cents it makes it less readable than before. I feel like I'm in the large print section of my local library.
You know, I'm glad I ran into your blog, because before I read this post I had no idea what 'track back' was.
After blogger B spent the time and effort to write an entire post to reply to blogger A's post, he could also post a comment at A's with a two-line summary of the reply and a link to B. It's negligible extra work compared with writing the reply.
Hmm... random specification for a track-back system, eh? Well, I tend to randomly come up with specifications. I've come up with about a dozen over the years. Email clients that allow threading on messages (closer in functionality to a web forum than an email) and participation from other people, messaging clients that allow exchange of small widgets and effectively allow people to "code" together (can you imagine two people collaborating over the same piece of code real-time looking at the same information and both able to modify the code live? I can.) And even a new protocol for a file-sharing system that someone later wound up coming up with as well that became know as Bit.Torrent (his idea was way better, though, because it broke the file into chunks. Mine was just whole-sale the entire file across the network, sharable from anyone. Sort of a queue-based system where you just downloaded from the first available person. I got the idea after Hotline sucked so much.)
Maybe a track-back protocol wouldn't be that hard to invision.
Marius, by asking the poster to wait for a response and then manually decode the image you are negating one of the advantages of track back - speed and ease of use.
Granted, it's slightly quicker to wait, then read a word. However, it doesn't take much more effort on the persons part to just copy, paste and post a link, and then both servers have less work to do.
An absolute travesty? Really? [..] so too did the first version of TrackBack have shortcomings.
That's fine, but why is the latest version of the Trackback spec two and a half years old? You'd think a small, nimble company like Six Apart could do better than the W3C, but I guess not.
The fact that there's been zero update to the trackback spec in the last 2 1/2 years to address the ongoing epidemic of trackback spam is, indeed, an absolute travesty. Really.
There are millions of blogs now, so all they'd have to do is send you a permalink to every blog post ever created and bang, they've gotten you to DOS yourself by leveraging the small number of bytes they sent you by whatever number of bytes you decide to read from the linked post
Exactly. It makes an inverted DOS attack trivial to mount-- *your server* is doing all the work!
The problem I find with trackbacks, is when you're browsing a blog, and see a trackback in the comments, it's often a quote from the blog entry you've just read. And if you follow the link it goes to someone elses blog, where there's just a link back to the blog you came from with that quote as the text - and no extra comment.
I guess it's different if you're the blog author involved.
An evil user can use a post verification system to DOS you even if you only read 100 bytes from every post to verify. They'd just create a distributed DOS with a large number of posts to hit your trackback URL. They don't even have to be their own posts. There are millions of blogs now, so all they'd have to do is send you a permalink to every blog post ever created and bang, they've gotten you to DOS yourself by leveraging the small number of bytes they sent you by whatever number of bytes you decide to read from the linked post.
So far all I've been able to come up with is a system that is (presumably PHP) script based, where basically the ping back is received by a script that then takes the URL from the referrer and fetches the first page from the referring URL. It then parses the entire HTML document it's fed, looking for an exact duplicate of the original post's URL. If it finds it, the site is accepted. If it does not find it, the site is rejected.
The pros to this are that it forces them to actually link to you on a page. The cons are, well, they're just large. It's so easy to defeat that it's not funny.
There are additional safety features you could embed in the parsing script, where while parsing for the originating URL the script also looks for meta refreshes that would take the user away from that page and also potentially malicious java-script. But the drawback there is that the script may deny a genuine blogger. I mean, where do you draw the line?
Thus far, I can't think of any method that wouldn't be cracked almost immediately. Basically, the nature of links on the web are too fluid. Anyone can link to anything, and the only way to weed out the good from the bad is by hand. There's already been a ton of proof that filters are imperfect and can only handle so much before they're bypassed. There's always some clever monkey.
I will say that the track-back service you have above reminds me a great deal of the old Web Rings that used to exist... I always knew that tech would make its way back in, it was just a matter of time...
Sometimes the Technorati site doesn't respond to queries. Refresh and try it again; it'll likely work.
Such is the cost of using external dependencies..
There like those jerks from nigeria that keep sending me emails on how I've wone $20,000,000 and all I have to do is sent them my lifes savings to claim it.
They should all have their hands cut off before their shot...
So what do you think about Pingback?
The font is Calibri. All you Office Beta testers should know that. :)
Nice change, Jeff. Very sharp. And good call on the Trackbacks.
Regarding the DDOS attack, certainly that's a worry for the large sites. But that's hardly a worry for the great majority of blogs.
Just like my Invisible CAPTCHA control. It should be trivially easy for someone to break it, but have they? No. It won't happen until it makes monetary sense for them to.
Speaking of external dependencies, why choose Technorati over Akismet? You've mentioned the problem of having to review for false positives. I say don't. If a few accidentally get caught by the filter and don't show up, so be it. That still seems more accurate than Technorati. How many blogs that reference your posts are being missed by Technorati?
But that's hardly a worry for the great majority of blogs.
I'm sure that's *exactly* what they thought about trackbacks originally.
How many blogs that reference your posts are being missed by Technorati
But there were also blogs that referenced my posts that never sent trackbacks, either. Google search would be best, but I can't scope the linking query the way I want. Plus I get weird hits on forums, and other oddball places that aren't conversational. Technorati, for all its warts, is very blog-focused, unlike a generic Google search. And Google blog search is even worse..
The potential for spam was already well known when trackback was specified, but there wasn't a critical mass necessary for spammers to exploit it. It would have died an early death if the bloggers proactively spammed trackback before it's widely deployed (?)(!!).
It would have spawned protective features like link condoms before trackback spam got out of hand.
when a trackback ping comes in, I read the putative trackback URL
and look for a link to my post there. No link, no trackback. (This
was Nikhil's idea, to be clear.)
This is a reasonable idea, but it doesn't scale. Furthermore, it
could easily become a huge DDOS (distributed denial of service)
vector. The last time I checked, I was getting 75 spam trackbacks
I submit that, rather than the original commenter's idea not scaling, your mental implementation doesn't scale.
The implementation you're imagining is as naive as the original trackback specification. A validation algorithm would need some degree of intelligence in terms of remembering and rating URLs (to prevent flooding the same URL) and paying attention to page size.
Defensive coding would make this a workable solution.
You put it right, I think I agree with you on this thing, trackback theory of yours looks cool.
That's a great idea to use the Technorati link! I wish I'd thought of it... and told you about it... ;-)
Very nice explanation of the problem.
You may enjoy this -- back in 2005, in response to the increasingly obvious broken-ness of TrackBack, I wrote a short (and hopefully amusing) play titled "TrackBack: A Tragedy in Three Acts".
Don't worry, the acts are very short :)
I'm sorry but I have to disagree that trackbacks are completely dead. Your last paragraph is more accurate in that the concept is still good and they do still work, although the spammers have brought the value down.
Still, the many non commercial bloggers can use them in the manner in which they were originally intended, and that is to get the message out to as many people as possible about their posts.
Reverse Cell Lookup
Great post on trackback links. I hope all the work I've been doing with them until now has not been wasted!
I see how trackbacks could be a useful thing for social blogging purposes, but what I would not look forward to is editing all of the comments. It's why I've never bothered with them. In fact, I don't even bother much with Typepad anymore. Some people, I know, love Typepad, but I just never got the hang of it.
reverse cell phone lookup