October 30, 2008
HCI Remixed
I like to take one or two books with me when I travel, and one of the books I chose for this trip is HCI Remixed.
Sometimes the books I choose are a bust. Fortunately that didn't happen this time.
HCI Remixed covers all the major milestones in the field of human computer interaction. And when I say major, I mean it: things like Douglas Engelbart's famous demonstration, now referred to as The Mother of All Demos:
On December 9, 1968, Douglas C. Engelbart and the group of 17 researchers working with him in the Augmentation Research Center at Stanford Research Institute in Menlo Park, CA, presented a 90-minute live public demonstration of the online system, NLS, they had been working on since 1962. The public presentation was a session in the Fall Joint Computer Conference held at the Convention Center in San Francisco, and it was attended by about 1,000 computer professionals. This was the public debut of the computer mouse. But the mouse was only one of many innovations demonstrated that day, including hypertext, object addressing and dynamic file linking, as well as shared-screen collaboration involving two persons at different sites communicating over a network with audio and video interface.
So, all those trappings of modern computing that we take for granted today? Engelbart demonstrated them all two years before I was born. It just took a while for the rest of the world to catch up to his vision.
That's the lesson of many of the groundbreaking HCI discoveries presented in this book. Some people see further. Engelbart was so far ahead of his time in 1968 that his demonstration wasn't taken seriously -- it seemed absurd and impractical. It really makes you wonder which of today's HCI researchers we're ignoring but shouldn't be.
The book also takes an interesting approach; it doesn't summarize the papers, instead, it presents the reflections of current working HCI professionals on the papers. It's a little bit meta. You're hearing the impact of these HCI discoveries -- some big, some small -- as related by young researchers who were heavily influenced by them.
As a primer and overview of the field of human computer interaction, it's tough to beat. Reading this reminds me how far we've come, and yet how far we have to go.
October 29, 2008
The Problem With URLs
URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right?
Visit my website at http://www.example.com, it's awesome!
To locate the URL in the above text, a simple regular expression should suffice -- we'll look for a string at a word boundary beginning with http:// , followed by one or more non-space characters:
\bhttp://[^\s]+
Piece of cake. This seems to work. There's plenty of forum and discussion software out there which auto-links using exactly this approach. Although it mostly works, it's far from perfect. What if the text block looked like this?
My website (http://www.example.com) is awesome.
This URL will be incorrectly encoded with the final paren. This, by the way, is an extremely common way average everyday users include URLs in their text.
What's truly aggravating is that parens in URLs are perfectly legal. They're part of the spec and everything:
only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
Certain sites, most notably Wikipedia and MSDN, love to generate URLs with parens. The sites are lousy with the damn things:
http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software) http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx
URLs with actual parens in them means we can't take the easy way out and ignore the final paren. You could force users to escape the parens, but that's sort of draconian, and it's a little unreasonable to expect your users to know how to escape characters in the URL.
http://en.wikipedia.org/wiki/PC_Tools_%28Central_Point_Software%29 http://msdn.microsoft.com/en-us/library/aa752574%28VS.85%29.aspx
To detect URLs correctly in all most cases, you have to come up with something more sophisticated. Granted, this isn't the toughest problem in computer science, but it's one that many coders get wrong. Even coders with years of experience, like, say, Paul Graham.
If we're more clever in constructing the regular expression, we can do a better job.
\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
- The primary improvement here is that we're only accepting a whitelist of known good URL characters. Allowing arbitrary random characters in URLs is setting yourself up for XSS exploits, and I can tell you that from personal experience. Don't do it!
- We only allow certain characters to "end" the URL. Ending a URL in common punctuation marks like period, exclamation point, semicolon, etc means those characters will be considered end-of-hyperlink characters and not included in the URL.
- Parens, if present, are allowed in the URL -- and we absorb the leading paren, if it is there, too.
I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens. Thus, there has to be a handful of postfix code to detect and discard the user-enclosed parens from the matched URLs:
if (s.StartsWith("(") && s.EndsWith(")"))
{
return s.Substring(1, s.Length - 2);
}
That's a whole lot of extra work, just because the URL spec allows parens. We can't fix Wikipedia or MSDN and we certainly can't change the URL spec. But we can ensure that our websites avoid becoming part of the problem. Avoid using parens (or any unusual characters, for that matter) in URLs you create. They're annoying to use, and rarely handled correctly by auto-linking code.
October 25, 2008
The Web Browser is the New Laptop
I've been reading a lot of good things about the emerging "netbook" category of subnotebooks:
The term netbook refers to a category of small to medium sized, light-weight, low-cost, energy-efficient, Internet-centric laptops, generally optimized for Web surfing and e-mailing.
Like any self-respecting nerd, I already own a laptop, of course, but my wife has taken to surfing the internet at night and doing her Java-based New York Times crosswords in bed. Plus there's the whole pregnancy thing, so it'd be nice for her to have her own "space" laptop-wise. So I pulled the trigger on an Acer Aspire One netbook.
The specs are indeeed modest, but not bad at all for the $369 sticker price:
- Intel Atom 1.6 Ghz CPU
- 802.11 b/g wireless
- 1 GB ram
- 120 GB hard drive
- 8.9" 1024x600 display
- Windows XP Home
- webcam, mic, 3 usb ports, ethernet, vga out.
I didn't expect much from this cheap, diminutive laptop; it's mostly for web surfing, light email, maybe a tiny bit of miscellaneous office work. And in case the color choice didn't make it clear, it's not even for me. That's my story, and I'm sticking to it!
As I sat down to configure this machine, I belatedly realized that for most of what I do with a computer, this cute little netbook is perfectly adequate. Sure, the keyboard is a bit cramped, it's no performance powerhouse, and the screen size, at 1024 x 600, is definitely the minimum necessary for it to be practical. It took some adaptation, but it wasn't frustrating or disappointing to use. It delivered (almost) the same web experience I'd get on my desktop or laptop, with no serious compromises. It just.. worked.
As stupid as it sounds, I had fallen in love with this silly little netbook.
But even that's not the whole story -- after spending some time with a netbook, I realized that calling them "small laptops" is a mistake. Netbooks are an entirely different breed of animal. They are cheap, portable web browsers.
The most popular application in the world is the web browser. By far. Number two isn't even close. Just check out the front page of Wakoopa's most used apps:
By my reckoning, six of the top 10 "apps" here are actually web browsers or websites running in web browsers. It's certainly consistent with how my wife and I are increasingly using our computers. Every day, more and more of what we need to do is delivered through a browser, with fewer and fewer compromises. I spend ridiculous, unhealthy amounts of time browsing the web, and this netbook does that with aplomb.
At this point, who cares what operating system you run? Choice of web browser will have a far more profound impact on most people's daily lives. As the prices for netbooks inevitably collapse, they are poised to transform the entire computer market, threatening both Apple and Microsoft.
- Apple laptops are beautiful, but I can't imagine the average user who spends all their time in the web browser paying 3 to 4 times the price of a netbook for a Mac laptop. Macs are brilliantly designed, it's true, but that's a hell of a tax to run Safari.
- Speaking of taxes, what about the Microsoft Tax? I'm already heavily infatuated with the current iteration of netbooks as represented by the Aspire. And they can only get better and cheaper over time. Imagine a machine with the same specs as the Aspire One but at $299, $199, maybe even $99. It's going to happen. It's inevitable. This is a huge opening for Linux; it's the ideal way to deliver a complete, modern web browser at nearly zero marginal cost to both the vendor and consumer.
- The booming growth of netbooks will keep Windows XP alive much longer than expected. As much as I like Vista as a solid (if not stellar) upgrade from XP, the prehistoric 2001 era system requirements for XP still make it a better choice for these kinds of devices. 1 GB of memory is roomy; a measly 16 GB of disk space plenty. Can't say that for Vista. No sir. It's also an opportunity for Microsoft to play games with the Linux market by reducing the price of XP to crazy low, fire sale, everything-must-go levels. But only for "select" and "preferred" OEM vendors, of course, not for the common folks on the street.
I won't lie. One of the attractions of this particular model is that it runs Windows XP, an operating system I, and every other software vendor on the planet, know by heart. It'll run whatever without me having to think too much about it. But I could easily see myself leaving some of that potential flexibility on the table if the price dropped to $199 or so. If it runs Firefox 3, or Chrome, or Opera, that's about all I need.
I'm quite happy with our Acer Aspire One netbook for now, but I'll probably be picking up one of the next generation of netbooks for myself.
I agree with Omar that Netbooks are poised to transform computing. They still have a way to go, of course, but the $299 or $199 no-compromises, go-anywhere, zero-monthly-contract-fees web browser in the palm of your hand -- with the requisite 9" or larger screen -- is almost upon us. I guess I hadn't been paying enough attention, because that's a shocker to me.
Pitching the web browser as a bona-fide operating system always seemed stupid to me. Or at least it did, until I sat down with my first netbook. If I were Apple or Microsoft, I'd I'd be watching this category of devices very, very closely.
October 23, 2008
You're Reading The World's Most Dangerous Programming Blog
Have you ever noticed that blogs are full of misinformation and lies? In particular, I'm referring to this blog. The one you're reading right now. For example, yesterday's post was so bad that it is conclusive proof that I've jumped the shark.
Again.
Apparently, according to one Reddit commenter, the information presented here is downright dangerous:
Jeff Atwood has always held the distinction of having the most dangerous programming blog, in that some young or aspiring developers may actually listen to some of his "advice", but now he's somehow managed to snag the achievement of having the most inane programming blog as well.To put it in more frank terms Jeff: What you've just written is one of the most insanely idiotic things I have ever heard. At no point in your rambling, incoherent response were you even close to anything that could be considered a rational thought. Everyone in this room is now dumber for having read this post. I award you no points, and may God have mercy on your soul.
I enjoyed the Billy Madison quote, but I'm not sure my blog has earned that particular distinction yet. If this blog is the most dangerous content that young, inexperienced developers have ever read then, well, I'd have to seriously question whether or not they've ever actually used this thing we call the "world wide web".
Allow me to illustrate with an example.
Today I happened across this blog entry from Mads Kristensen. In it, Mads explains that Deflate is faster than GZip.
First I tested theGZipStreamand then theDeflateStream. I expected a minor difference because the two compression methods are different, but the result astonished me. I measured the DeflateStream to be 41% faster than GZip. That's a very big difference. With this knowledge, I'll have to change the HTTP compression module to choose Deflate over GZip.
This was a surprising result to me, because the two compression algorithms are very closely related. On the other hand, we use GZip extensively and heavily to cache HTML fragment output strings on the Stack Overflow server, as Scott Hanselman explains. If Deflate really is that much faster, we need to switch to it!
But, like any veteran internet user, I never take what I read on a blog -- or any other site on the internet, for that matter -- as fact. Rather, it's a germ of an intriguing idea, a call to action. I fired up my IDE and built a small test harness to test for myself: is Deflate faster than GZip?
public static class StopwatchExtensions
{
public static long Time(this Stopwatch sw, Action action, int iterations)
{
sw.Reset();
sw.Start();
for (int i = 0; i < iterations; i++) { action(); }
sw.Stop();
return sw.ElapsedMilliseconds;
}
}
class Program
{
static void Main(string[] args)
{
string s = File.ReadAllText(@"c:\test.html");
byte[] b;
var sw = new Stopwatch();
b = CompressGzip(s);
Console.WriteLine("gzip size: " + b.Length);
Console.WriteLine(sw.Time(() => CompressGzip(s), 1000));
Console.WriteLine(sw.Time(() => DecompressGzip(b), 1000));
b = CompressDeflate(s);
Console.WriteLine("deflate size: " + b.Length);
Console.WriteLine(sw.Time(() => CompressDeflate(s), 1000));
Console.WriteLine(sw.Time(() => DecompressDeflate(b), 1000));
}
}
The results were surprising: on my box, GZip is just as fast as Deflate. For giant strings, for medium strings, for small strings. In every possible testing combination I can think of, Deflate is nowhere near 40% faster.
gzip size: 3125 242 171 deflate size: 3107 225 149
That's not exactly what Mads' blog entry tells me should happen. Do I think Mads is an idiot for posting this? Well, no. I don't.
- The original blog entry was posted in late 2006; since then new versions of the .NET framework have shipped and hardware has gotten faster. Perhaps there was some significant change in either that produces this different outcome.
- My test is a bit different than Mads' testing. I use a random HTML file as the compression target; I can't tell exactly what he's compressing in his benchmark. I also tried with small, medium, and large strings. The tests are similar, but they're not the same.
Is this the type of dangerous misinformation that blogs are vilified for? Should I be angry at Mads for posting this? Not at all. I learned a bit more about Deflate and GZip. It provided an opportunity for me to refactor my compression code some. I even learned how to benchmark using lambda syntax. If I hadn't read this post, if it hadn't provided that impetus of an idea for me to ponder, I wouldn't have bothered.
I am a better programmer for having read that blog post. Even though, near as I can tell, it's offering inaccurate advice.
Update: I got a bit more curious about this, so I ran some more tests on different machines. Here are the results, in milliseconds, for a thousand runs each using the Google homepage HTML as the target (it's about 7 Kb):
How much faster is Deflate than GZip?
| Core 2 Duo 3.5 Ghz | Core 2 Quad 1.86 Ghz | Athlon X2 2.1 Ghz | |
| Compress | 8% faster | 8% faster | 50% faster |
| Decompress | 15% faster | 17% faster | 37% faster |
There's the 40% Mads was talking about. That is a little shocking when you consider that GZip is simply Deflate plus a checksum and header/footer! (You can download the source code for this test and try it yourself.)
So my point -- and I do have one -- is this: when you say that the information presented on a blog is "dangerous", you're implying the audience is too dumb or inept to read critically.
I, for one, have too much respect for my audience to ever do that. I am continually humbled by the quality of the comments and discussion on the blog entries I post. In fact, I'd say that has been the single most surprising thing I've learned in my four plus years of blogging: the best content always begins where the blog post ends. My audience is far, far smarter than I will ever be.
On second thought, maybe what I promote on this blog is dangerous: thinking for yourself.
But I'm pretty confident you can handle that.
October 22, 2008
The One Thing Every Software Engineer Should Know
I'm a huge Steve Yegge fan, so It was a great honor to have Steve Yegge on a recent Stack Overflow podcast. One thing I couldn't have predicted, however, was one particular theme of Steve's experience at Google and Amazon that kept coming up time and time again:
If there was one thing I could teach every engineer, it would be how to market.
Not how to type, not how to write, not how to design a programming language, but marketing.
This is painful for developers to hear, because we love code. But all that brilliant code is totally irrelevant until:
- people understand what you're doing
- people become interested in what you're doing
- people get excited about what you're doing
That, in a nutshell, is marketing. Just because you're a marketer doesn't necessarily mean you're a marketing weasel. Sure, the two things are highly correlated -- but at its core, marketing is little more than an intermediate level course on fundamental human communication. Not something us programmers have historically been so great at.
That's why even the hardest of hard-core programmers should be paying attention to people like Seth Godin. Steve was referring to marketing in the broader, more timeless sense of getting other people interested in your ideas.
After hearing Steve mention this several times on our podcast -- and having seen his related talk How to Ignore Marketing and Become Irrelevant in Two Easy Steps I suddenly realized why I was so fascinated with two particular books I recently discovered. Books I kept referring to, over and over, during the development of Stack Overflow.
| Whatever You Think, Think the Opposite |
It's Not How Good You Are, It's How Good You Want to Be |
|
|
|
I couldn't put down these two small-format books from the late Paul Arden. Guess what Mr. Arden did for a living? That's right, he was an executive creative director for Saatchi & Saatchi -- an advertising firm.
I had been reading dirty books. Marketing books. By choice, even. I'm a bit embarrassed to admit this, because these are exactly the kinds of pithy little business books I usually make fun of other people for reading. But in reading these books, I realized that so much of what we do on Stack Overflow has nothing to do with how awesome our code is -- and everything to do with marketing.
We're all software developers here, so let me put this in terms programmers understand: Dungeons & Dragons character statistics. You know, the classics.
If you're a programmer, and you want to get better at your job every year, you might think that the most important character stat to build is coding. Let's call this INT. So at the end of many years of toil, you'll end up something like this:
| str | 6 |
| dex | 9 |
| con | 12 |
| int | 51 |
| wis | 13 |
| chr | 4 |
OK, you're a genius programmer who can code circles around everyone else. But you may never ship any of your code for reasons that you don't control. That's an illusion. You can control when, how, and where your code ships. You probably spent too much time building your code and not enough time as an advocate of your code. Did you explain to people what your code does, why it's cool and important? Did you offer reasons why your code is going to make their lives better, at least in some small way? Did you make it easy for people to find and use your code?
I believe most programmers will be better served in their professional career if they shoot for character development more along these lines:
| str | 16 |
| dex | 14 |
| con | 15 |
| int | 18 |
| wis | 16 |
| chr | 17 |
Sometimes, you become a better programmer by choosing not to program. I agree with Steve: if I could teach my fellow software engineers one thing, it would be how to market themselves, their code, and their project.
October 20, 2008
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
I'm not a huge fan of The Daily WTF for reasons I've previously outlined. There is, however, the occasional gem -- such as this one posted by ezrec:
Browsing through a web archive of some old computer club conversations, I ran across this sentence:"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"
Hmm. "clbuttic".
Google "clbuttic" - thousands of hits!
There's a someone who call his car 'clbuttic'.
There are "Clbuttic Steam Engine" message boards.
Webster's dictionary - no help.
Hmm. What can this be?
As programmers, this isn't much of a mystery to us; it seems every day a brand new software developer is born and immediately begins repeating all the same mistakes we made years ago. I can't resist linking to Language Log again on this topic, where a commenter disputes whether or not this is an actual real world problem:
The "clbuttics" story may be a little exaggerated if not actually a web-legend. Sure, Google returns 4,000 hits–but by the time one reaches page 2 (in search of a page that isn't reporting on the silliness, or reporting on the reports, etc.) we're down to 200 hits.Almost all of those 200 seem to have a "clbuttic mistake" by Apple at their core. Google's redundancy-compacting routines are only invoked when requested, it seems, and even then, the variety of information in 200 hits may be small.
In short, it's an echo chamber. 200 or 4,000 or however many hits today aren't as impressive as the same number last year, etc. All the more so as web sites of all kinds put randomly chosen (even Googled!) words out there just to game Google.
While I agree this particular manifestation of the mistake is probably over-reported (because, haha, butts are funny) and fairly rare on the open web, I still get this shiner on page one of my search results:
Is the song Dueling Banjos considered blue grbutt?
Poor Bluegrass World. I'm pretty sure that site is legitimate, though I have no idea how they'd post an article in that state. Obligatory link to dueling banjos scene from Deliverance. I'm inclined to believe this is, in fact, still a problem. There are many, many examples besides "clbuttic" out there. Perhaps you've heard of the United States Consbreastution?
Of course, what we have here is failed obscenity filters implemented by (extremely) newbie developers with regular expressions. I could explain, but as they say, a picture is worth a thousand words, particularly when it's a picture of my very bestest friend, RegexBuddy:
Oh, great, an inexperienced developer had a problem, and thought they would use regular expressions. Now they have two problems. Well, technically through Google they now have many thousands of problems, but who's counting.
I'm not sure regular expressions are to blame here. The replacement is so mind-bendingly naive that it might as well have been a simple Replace operation. We, being extra-smart-gets-things-done developers, would write a superior regular expression using the \b word boundary qualifier around the replacement, and use some capturing parens to handle both the singular and plural cases.
How about those Great Tits, eh?
Proving, yet again, that bad ideas are just plain bad ideas, regardless of language or implementation choices. Obscenity filters are like blacklists; using one is tantamount to admitting failure before you've even started.
But it still happens all the time. One of the most famous incidents was when the Yahoo! email developers created the accidental non-word Medireview. They weren't trying to filter obscenities, but JavaScript webmail exploits.
In 2001 Yahoo! silently introduced an email filter which changed some strings in HTML emails considered to be dangerous. While it was intended to stop spreading JavaScript viruses, no attempts were made to limit these string replacements to script sections and attributes, out of fear this would leave some loophole open. Additionally, word boundaries were not respected in the replacement.The list of replacements:
Javascript → java-script Jscript → j-script Vbscript → vb-script Livescript → live-script Eval → review Mocha → espresso Expression → statement
Some side-effects of this implementation:
| medieval | → | medireview |
| evaluation | → | reviewuation |
| expressionist | → | statementist |
medireview.com is currently occupied by domain squatters. Perhaps that's a fitting end for this "company", though I perversely almost want the company to exist, as wholly formed from our imaginations, sort of like Jamcracker.
I can't help wondering just how freaked out the brass at Yahoo must have been about then-new JavaScript browser exploits to actually deploy such a brain-damaged "solution". To be fair, it was seven years ago, but still -- did it not occur to anyone that such broad replacement criteria might have some serious side-effects? Or that replacing one thing with another, when it comes to human beings and written language, is an activity that is fraught with peril even in the best possible circumstances?
Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.
October 16, 2008
Programming Is Hard, Let's Go Shopping!
A few months ago, Dare Obasanjo noticed a brief exchange my friend Jon Galloway and I had on Twitter. Unfortunately, Twitter makes it unusually difficult to follow conversations, but Dare outlines the gist of it in Developers, Using Libraries is not a Sign of Weakness:
The problem Jeff was trying to solve is how to allow a subset of HTML tags while stripping out all other HTML so as to prevent cross site scripting (XSS) attacks. The problem with Jeff's approach which was pointed out in the comments by many people including Simon Willison is that using regexes to filter HTML input in this way assumes that you will get fairly well-formed HTML. The problem with that approach which many developers have found out the hard way is that you also have to worry about malformed HTML due to the liberal HTML parsing policies of many modern Web browsers. Thus to use this approach you have to pretty much reverse engineer every HTML parsing quirk of common browsers if you don't want to end up storing HTML which looks safe but actually contains an exploit. Thus to utilize this approach Jeff really should have been looking at using a full fledged HTML parser such as SgmlReader or Beautiful Soup instead of regular expressions.The sad thing is that Jeff Atwood isn't the first nor will he be the last programmer to think to himself "It's just HTML sanitization, how hard can it be?". There are many lists of Top HTML Validation Bloopers that show tricky it is to get the right solution to this seemingly trivial problem. Additionally, it is sad to note that despite his recent experience, Jeff Atwood still argues that he'd rather make his own mistakes than blindly inherit the mistakes of others as justification for continuing to reinvent the wheel in the future. That is unfortunate given that is a bad attitude for a professional software developer to have.
My response?
Bad attitude? I think that's a matter of perspective.
(The phase "programming is hard, let's go shopping!" is a snowclone. As usual, Language Log has us covered. Ironically, we later had a brief run-in with Consultant Barbie "herself" on Stack Overflow -- who you may know from reddit. There's no trace of her left on SO, but as griefing goes, it was fairly benign and even arguably on-topic.)
In the development of Stack Overflow, I determined early on that we'd be using Markdown for entering questions and answers in the system. Unfortunately, Markdown allows users to intermix HTML into the markup. It's part of the spec and everything. I sort of wish it wasn't, actually -- one of the great attractions of pseudo-markup languages like BBCode is that they have nothing in common with HTML and thus sanitizing the input becomes trivial. Users have two choices:
- Enter approved pseudo-markup.
- Trick question. There is no other choice!
With BBCode, if the user enters HTML you blow it away with extreme prejudice -- it's encoded, without exceptions. Easy. No thinking and barely any code required.
Since we use Markdown, we don't have that luxury. Like it or not, we are now in the nasty, brutish business of distinguishing "good" HTML markup from "evil" HTML markup. That's hard. Really hard. Dare and Jon are right to question the competency and maybe even the sanity of any developer who willingly decided to bite off that particular problem.
But here's the thing: deeply understanding HTML sanitization is a critical part of my business. Users entering markdown isn't just some little tickbox in a feature matrix for me, it is quite literally the entire foundation that our website is built on.
Here's a pop quiz from way back in 2001. See how you do.
- Code Reuse is:
- Good
- Bad
- Reinventing the Wheel is:
- Good
- Bad
- The Not-Invented-Here Syndrome is:
- Good
- Bad
I'm sure most developers are practically climbing over each other in their eagerness to answer at this point. Of course code reuse is good. Of course reinventing the wheel is bad. Of course the not-invented-here syndrome is bad.
Except when it isn't.
If it's a core business function -- do it yourself, no matter what.Pick your core business competencies and goals, and do those in house. If you're a software company, writing excellent code is how you're going to succeed. Go ahead and outsource the company cafeteria and the CD-ROM duplication. If you're a pharmaceutical company, write software for drug research, but don't write your own accounting package. If you're a web accounting service, write your own accounting package, but don't try to create your own magazine ads. If you have customers, never outsource customer service.
Being a "professional" developer, if there really is such a thing -- I still have my doubts -- doesn't mean choosing third-party libraries for every possible programming task you encounter. Nor does it mean blindly writing everything yourself out of a misguided sense of duty or the perception that's what gonzo, hardcore programming types do. Rather, experienced developers learn what their core business functions are and write whatever software they deem necessary to perform those functions extraordinarily well.
Do I regret spending a solid week building a set of HTML sanitization functions for Stack Overflow? Not even a little. There are plenty of sanitization solutions outside the .NET ecosystem, but precious few for C# or VB.NET. I've contributed the core code back to the community, so future .NET adventurers can use our code as a guidepost (or warning sign, depending on your perspective) on their own journey. They can learn from the simple, proven routine we wrote and continue to use on Stack Overflow every day.
Honestly, I'm not that great of a developer. I'm not so much talented as competent and loud. Start writing and talking and you can be loud, too. But I'll tell you this: in choosing to fight that HTML sanitizer battle, I've earned the scars of experience. I don't have to take anybody's word for it -- I don't have to trust "libraries". I can look at the code, examine the input and output, and predict exactly what kinds of problems might arise. I have a deep and profound understanding of the risks, pitfalls, and tradeoffs of HTML sanitization.. and cross-site scripting vulnerabilities.
As Richard Feynman so famously wrote on his last blackboard, what I cannot create, I do not understand.
This is exactly the kind of programming experience I need to keep watch over Stack Overflow, and I wouldn't trade it for anything.
You may not be building a website that depends on users entering markup, so you might make a different decision than I did. But surely there's something, some core business competency, so important that you feel compelled to build it yourself, even if it means making your own mistakes.
Programming is hard. But that doesn't mean you should always go shopping for third party libraries instead of writing code. If it's a core business function, write that code yourself, no matter what. If other programmers don't understand why it's so critically important that you sit down and write that bit of code -- well, that's their problem.
They're probably too busy shopping to understand.
October 14, 2008
Preventing CSRF and XSRF Attacks
In Cross-Site Request Forgeries and You I urged developers to take a close look at possible CSRF / XSRF vulnerabilities on their own websites. They're the worst kind of vulnerability -- very easy to exploit by attackers, yet not so intuitively easy to understand for software developers, at least until you've been bitten by one.
On the Freedom to Tinker blog, Bill Zeller offers one of the best, most concise explanation of XSRF that I've read to date:
CSRF vulnerabilities occur when a website allows an authenticated user to perform a sensitive action but does not verify that the user herself is invoking that action. The key to understanding CSRF attacks is to recognize that websites typically don't verify that a request came from an authorized user. Instead they verify only that the request came from the browser of an authorized user. Because browsers run code sent by multiple sites, there is a danger that one site will (unbeknownst to the user) send a request to a second site, and the second site will mistakenly think that the user authorized the request.
That's the key element to understanding XSRF. Attackers are gambling that users have a validated login cookie for your website already stored in their browser. All they need to do is get that browser to make a request to your website on their behalf. If they can either:
- Convince your users to click on a HTML page they've constructed
- Insert arbitrary HTML in a target website that your users visit
The XSRF game is afoot. Not too difficult, is it?
Bill Zeller and Ed Felten also identified new XSRF vulnerabilities in four major websites less than two weeks ago:
- ING Direct
We discovered CSRF vulnerabilities in ING's site that allowed an attacker to open additional accounts on behalf of a user and transfer funds from a user's account to the attacker's account.
- YouTube
We discovered CSRF vulnerabilities in nearly every action a user can perform on YouTube.
- MetaFilter
We discovered a CSRF vulnerability in MetaFilter that allowed an attacker to take control of a user's account.
- The New York Times
We discovered a CSRF vulnerability in NYTimes.com that makes user email addresses available to an attacker. If you are a NYTimes.com member, abitrary sites can use this attack to determine your email address and use it to send spam or to identify you.
If major public facing websites are falling prey to these serious XSRF exploits, how confident do you feel that you haven't made the same mistakes? Consider carefully. I'm saying this as a developer who has already made these same mistakes on his own website. I'm just as guilty as anyone.
It's our job to make sure future developers don't repeat the same stupid mistakes we made -- at least not without a fight. The Felten and Zeller paper (pdf) recommends the "double-submitted cookie" method to prevent XSRF:
When a user visits a site, the site should generate a (cryptographically strong) pseudorandom value and set it as a cookie on the user's machine. The site should require every form submission to include this pseudorandom value as a form value and also as a cookie value. When a POST request is sent to the site, the request should only be considered valid if the form value and the cookie value are the same. When an attacker submits a form on behalf of a user, he can only modify the values of the form. An attacker cannot read any data sent from the server or modify cookie values, per the same-origin policy. This means that while an attacker can send any value he wants with the form, he will be unable to modify or read the value stored in the cookie. Since the cookie value and the form value must be the same, the attacker will be unable to successfully submit a form unless he is able to guess the pseudorandom value.
The advantage of this approach is that it requires no server state; you simply set the cookie value once, then every HTTP POST checks to ensure that one of the submitted <input> values contains the exact same cookie value. Any difference between the two means a possible XSRF attack.
An even stronger, albeit more complex, prevention method is to leverage server state -- to generate (and track, with timeout) a unique random key for every single HTML FORM you send down to the client. We use a variant of this method on Stack Overflow with great success. That's why with every <form> you'll see the following:
<input id="fkey" name="fkey" type="hidden" value="df8652852f139" />
If you want to audit a website for XSRF vulnerabilities, start by asking this simple question about every single HTML form you can find: "where's the XSRF value?"
October 12, 2008
The Importance of Sitemaps
So I've been busy with this Stack Overflow thing over the last two weeks. By way of apology, I'll share a little statistic you might find interesting: the percentage of traffic from search engines at stackoverflow.com.
| Sept 16th one day after public launch | 10% |
| October 11th less than one month after public launch | 50% |
I try to be politically correct in discussing web search, avoiding the g-word whenever possible, desperately attempting to preserve the illusion that web search is actually a competitive market. But it's becoming a transparent and cruel joke at this point. When we say "web search" we mean one thing, and one thing only: Google. Rich Skrenta explains:
I'm not a professional analyst, and my approach here is pretty back-of-the-napkin. Still, it confirms what those of us in the search industry have known for a long time.The New York Times, for instance, gets nearly 6 times as much traffic from Google as it does from Yahoo. Tripadvisor gets 8 times as much traffic from Google vs. Yahoo.
Even Yahoo's own sites are no different. While it receives a greater fraction of Yahoo search traffic than average, Yahoo's own flickr service gets 2.4 times as much traffic from Google as it does from Yahoo.
My favorite example: According to Hitwise, [ex] Yahoo blogger Jeremy Zawodny gets 92% of his inbound search traffic from Google, and only 2.7% from Yahoo.
That was written almost two years ago. Guess which way those numbers have gone since then?
Google generally does a great job, so they deserve their success wholeheartedly, but I have to tell you: Google's current position as the start page for the internet kind of scares the crap out of me, in a way that Microsoft's dominance over the desktop PC never did. I mean, monopoly power over a desktop PC is one thing -- but the internet is the whole of human knowledge, or something rapidly approaching that. Do we really trust one company to be a benevolent monopoly over.. well, everything?
But I digress. Our public website isn't even a month old, and Google is already half our traffic. I'm perfectly happy to feed Google the kind of quality posts (well, mostly) fellow programmers are creating on Stack Overflow. The traffic graph provided by Analytics is amusingly predictable, as well.
Giant peak of initial interest, followed by the inevitable trough of disillusionment, and then the growing weekly humpback pattern of a site that actually (shock and horror) appears to be useful to some people. Go figure. Guess they call it crackoverflow for a reason.
We knew from the outset that Google would be a big part of our traffic, and I wanted us to rank highly in Google for one very selfish reason -- writing search code is hard. It's far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.
I'm constantly looking up my own stuff via Google searches, and I guess I've gotten spoiled. I expect to type in a few relatively unique words from the title and have whatever web page I know is there appear instantly in front of me. For the first two weeks, this was definitely not happening reliably for Stack Overflow questions. I'd type in the exact title of a question and get nothing. Sometimes I'd even get copies of our content from evil RSS scraper sites that plug in their own ads of questionable provenance, which was downright depressing. Other times, I'd enter a question title and get a perfect match. Why was old reliable Google letting me down? Our site is simple, designed from the outset to be easy for search engines to crawl. What gives?
What I didn't understand was the importance of a little file called sitemap.xml.
On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage. The URL to get to the entire list of questions looks like this:
http://stackoverflow.com/questions http://stackoverflow.com/questions?page=2 http://stackoverflow.com/questions?page=3 .. http://stackoverflow.com/questions?page=931
Not particularly complicated. I naively thought Google would have no problem crawling all the questions in this format. But after two weeks, it wasn't happening. My teammate, Geoff, clued me in to Google's webmaster help page on sitemaps:
Sitemaps are particularly helpful if:
- Your site has dynamic content.
- Your site has pages that aren't easily discovered by Googlebot during the crawl process - for example, pages featuring rich AJAX or Flash.
- Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
- Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.
I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn't necessarily the case. The pagination links on Stack Overflow were apparently enough to prevent full indexing.
Enter sitemap.xml. The file itself is really quite simple; it's basically a non-spammy, non-shady way to have a "page" full of links that you feed to search engines. A way that is officially supported and endorsed by all the major web search engines. An individual record looks something like this:
<url> <loc>http://stackoverflow.com/questions/24109/c-ide-for-linux</loc> <lastmod>2008-10-11</lastmod> <changefreq>daily</changefreq> <priority>0.6</priority> </url>
The above element is repeated for each one of the ~27,000 questions on Stack Overflow at the moment. Most search engines assume the file is at the root of your site, but you can inform them of an alternate location through robots.txt:
User-Agent: * Allow: / Sitemap: /sitemap.xml
There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.
I'm a little aggravated that we have to set up this special file for the Googlebot to do its job properly; it seems to me that web crawlers should be able to spider down our simple paging URL scheme without me giving them an explicit assist.
The good news is that since we set up our sitemaps.xml, every question on Stack Overflow is eminently findable. But when 50% of your traffic comes from one source, perhaps it's best not to ask these kinds of questions.
Just smile and nod and follow the rules like everyone else. I, for one, welcome our pixelated Google overlords!
