I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

October 16, 2008

Programming Is Hard, Let's Go Shopping!

A few months ago, Dare Obasanjo noticed a brief exchange my friend Jon Galloway and I had on Twitter. Unfortunately, Twitter makes it unusually difficult to follow conversations, but Dare outlines the gist of it in Developers, Using Libraries is not a Sign of Weakness:

The problem Jeff was trying to solve is how to allow a subset of HTML tags while stripping out all other HTML so as to prevent cross site scripting (XSS) attacks. The problem with Jeff's approach which was pointed out in the comments by many people including Simon Willison is that using regexes to filter HTML input in this way assumes that you will get fairly well-formed HTML. The problem with that approach which many developers have found out the hard way is that you also have to worry about malformed HTML due to the liberal HTML parsing policies of many modern Web browsers. Thus to use this approach you have to pretty much reverse engineer every HTML parsing quirk of common browsers if you don't want to end up storing HTML which looks safe but actually contains an exploit. Thus to utilize this approach Jeff really should have been looking at using a full fledged HTML parser such as SgmlReader or Beautiful Soup instead of regular expressions.

The sad thing is that Jeff Atwood isn't the first nor will he be the last programmer to think to himself "It's just HTML sanitization, how hard can it be?". There are many lists of Top HTML Validation Bloopers that show tricky it is to get the right solution to this seemingly trivial problem. Additionally, it is sad to note that despite his recent experience, Jeff Atwood still argues that he'd rather make his own mistakes than blindly inherit the mistakes of others as justification for continuing to reinvent the wheel in the future. That is unfortunate given that is a bad attitude for a professional software developer to have.

My response?

twitter message: Programming is hard, let's go shopping!

Bad attitude? I think that's a matter of perspective.

(The phase "programming is hard, let's go shopping!" is a snowclone. As usual, Language Log has us covered. Ironically, we later had a brief run-in with Consultant Barbie "herself" on Stack Overflow -- who you may know from reddit. There's no trace of her left on SO, but as griefing goes, it was fairly benign and even arguably on-topic.)

In the development of Stack Overflow, I determined early on that we'd be using Markdown for entering questions and answers in the system. Unfortunately, Markdown allows users to intermix HTML into the markup. It's part of the spec and everything. I sort of wish it wasn't, actually -- one of the great attractions of pseudo-markup languages like BBCode is that they have nothing in common with HTML and thus sanitizing the input becomes trivial. Users have two choices:

  1. Enter approved pseudo-markup.
  2. Trick question. There is no other choice!

With BBCode, if the user enters HTML you blow it away with extreme prejudice -- it's encoded, without exceptions. Easy. No thinking and barely any code required.

Since we use Markdown, we don't have that luxury. Like it or not, we are now in the nasty, brutish business of distinguishing "good" HTML markup from "evil" HTML markup. That's hard. Really hard. Dare and Jon are right to question the competency and maybe even the sanity of any developer who willingly decided to bite off that particular problem.

But here's the thing: deeply understanding HTML sanitization is a critical part of my business. Users entering markdown isn't just some little tickbox in a feature matrix for me, it is quite literally the entire foundation that our website is built on.

Here's a pop quiz from way back in 2001. See how you do.

  1. Code Reuse is:
    1. Good
    2. Bad
  2. Reinventing the Wheel is:
    1. Good
    2. Bad
  3. The Not-Invented-Here Syndrome is:
    1. Good
    2. Bad

I'm sure most developers are practically climbing over each other in their eagerness to answer at this point. Of course code reuse is good. Of course reinventing the wheel is bad. Of course the not-invented-here syndrome is bad.

Except when it isn't.

Joel Spolsky explains:

If it's a core business function -- do it yourself, no matter what.

Pick your core business competencies and goals, and do those in house. If you're a software company, writing excellent code is how you're going to succeed. Go ahead and outsource the company cafeteria and the CD-ROM duplication. If you're a pharmaceutical company, write software for drug research, but don't write your own accounting package. If you're a web accounting service, write your own accounting package, but don't try to create your own magazine ads. If you have customers, never outsource customer service.

Being a "professional" developer, if there really is such a thing -- I still have my doubts -- doesn't mean choosing third-party libraries for every possible programming task you encounter. Nor does it mean blindly writing everything yourself out of a misguided sense of duty or the perception that's what gonzo, hardcore programming types do. Rather, experienced developers learn what their core business functions are and write whatever software they deem necessary to perform those functions extraordinarily well.

Do I regret spending a solid week building a set of HTML sanitization functions for Stack Overflow? Not even a little. There are plenty of sanitization solutions outside the .NET ecosystem, but precious few for C# or VB.NET. I've contributed the core code back to the community, so future .NET adventurers can use our code as a guidepost (or warning sign, depending on your perspective) on their own journey. They can learn from the simple, proven routine we wrote and continue to use on Stack Overflow every day.

Honestly, I'm not that great of a developer. I'm not so much talented as competent and loud. Start writing and talking and you can be loud, too. But I'll tell you this: in choosing to fight that HTML sanitizer battle, I've earned the scars of experience. I don't have to take anybody's word for it -- I don't have to trust "libraries". I can look at the code, examine the input and output, and predict exactly what kinds of problems might arise. I have a deep and profound understanding of the risks, pitfalls, and tradeoffs of HTML sanitization.. and cross-site scripting vulnerabilities.

What I cannot create, I do not understand.

As Richard Feynman so famously wrote on his last blackboard, what I cannot create, I do not understand.

This is exactly the kind of programming experience I need to keep watch over Stack Overflow, and I wouldn't trade it for anything.

You may not be building a website that depends on users entering markup, so you might make a different decision than I did. But surely there's something, some core business competency, so important that you feel compelled to build it yourself, even if it means making your own mistakes.

Programming is hard. But that doesn't mean you should always go shopping for third party libraries instead of writing code. If it's a core business function, write that code yourself, no matter what. If other programmers don't understand why it's so critically important that you sit down and write that bit of code -- well, that's their problem.

They're probably too busy shopping to understand.

[advertisement] Complimentary paperback book on lightweight peer code review. 10 essays from industry experts. Free shipping. Order Best Kept Secrets of Peer Code Review.

Posted by Jeff Atwood    View blog reactions

 

« Preventing CSRF and XSRF Attacks Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea? »

 

Comments

> If it's a core business function, write that code yourself, no matter what.

I would have to agree with that... if you have problems/bugs with some component that is critical to your app, you want to not only be able to go in and fix/debug the issue but also to fully understand the piece if you expect to fix it right. I never feel good when I have to relly on someone outside of my business when things affecting my business break.

Kris on October 17, 2008 06:54 AM

"Not invented here" attitudes are a problem, so is the opposite, the attitude that everything your colleagues create is crap and that anything created by a third party must be better. These folks develop an exotic bug collection, different patterns of bugs from every source they copied and pasted from.

John on October 17, 2008 06:59 AM

>> a solid week building a set of HTML sanitization functions

It sounds like the time may have been better spent unit-testing an existing library, and if it was found to fail a lot of the tests, write your own against those unit tests. Best case; you spend a day writing tests and the rest of the week to work on other things. Worst case: you have some tests and expectations to write your own library against, and you'd be doing that anyway. Wouldn't you?

I can now see how SOF slipped by several months.

Most of the rest of your post amounts to nothing but snobbery!

Douglas F Shearer on October 17, 2008 07:07 AM

This is one time I have to disagree with you. Attitudes like this are why the quality of software doesn't advance at a better pace than it does. We spend so much time re-inventing the wheel and not adding new value to an application.

You can use an open source library like beautiful soup and get all the ability to fix problems on your own, retain the ability to read and understand what the code is doing, and at the same time benefit from the work from alot of other people keeping an eye out for things you may not have thought of. You might actually learn something new. Beautiful Soup is written in python, but you could quite easily wrap it/run it in IronPython to match your .NET requirement.

Bypassing options like this to roll your own is just a combination of vanity (I can do everything better than anyone else can) and/or laziness (it is harder to read/understand someone else's code than to roll your own).

With the variety of open source solutions out there now, you can avoid reinventing the wheel while at the same time having complete understanding of the code being run and retaining the ability to modify it as you need to.

Spencer on October 17, 2008 07:11 AM

This was an interesting article, and I'd love to hear you talk about when you WOULD want to use an existing library. My reading of this post was that you were arguing that sometimes reinventing the wheel can be good in specific situations, and that writing your own HTML sanitizer for Stack Overflow was one of those situations for a variety of reasons.

So my question is, under what circumstances would you have used a library for HTML sanitization rather than rolling your own? You mention ed that there are few sanitizers for C# and VB.NET; if there had been one or more really mature and widely used ones then would you have not written your own?

I've love to see a blog entry on the specific criteria you use to decide when to use an existing library with examples of when you did and when you didn't. Especially focusing on things for which external libraries exist and you had to think long and hard about whether to go with one of them, since that would give us insight into the criteria you use to make those decisions.

Eli Courtwright on October 17, 2008 07:14 AM

Is re-inventing the wheel really a bad thing if you make a better wheel?

Dan on October 17, 2008 07:15 AM

I find reusing existing libraries is often in conflict with the principle of YAGNI. Libraries grow over to time to satisfy the needs of multiple usage scenarios. If you only need 50% of the functionality of that library for your scenario then the library brings along a heap of baggage that has security, performance and complexity implications.

Darrel Miller on October 17, 2008 07:15 AM

> the time may have been better spent unit-testing an existing library

I agree with you; Jon had a set of unit tests he worked up that I wanted to build on. However, many of the XSS exploits would require actual browser code to execute against -- different browsers interpret "sketchy" markup differently. So a *complete* and *accurate* XSS test suite would have to fire up browser .exes and somehow detect JS execution and other conditions in the browser.

> You can use an open source library like beautiful soup

In .NET? How?

This is like telling me I should use rainbows and cotton candy. Well, obviously.

There are almost no options in .NET, which is one of the *reasons* I wanted to go this route. So others would have more solutions!

Jeff Atwood on October 17, 2008 07:17 AM

I'm definitely in agreement here, controlling dependencies is a key way to control complexity. Sometimes the fast and easy solution ends up eating more resources over the long haul:

http://theprogrammersparadox.blogspot.com/2008/09/dependency-too-far.html

Paul.

Paul W. Homer on October 17, 2008 07:32 AM

Maybe I'm misunderstanding "core competency" here, but do users seriously go to Stack Overflow so that they can write in Markdown? I thought they went there to find and answer programming questions. I thought *that* was your core competency.

I understand there weren't any libraries doing what you wanted, so that much I'm with you on, but the core competency explanation falls a little limp.

Dan Hulton on October 17, 2008 07:35 AM

"In .NET? How?"

Did you not read his whole comment? He mentioned running it using IronPython.

N on October 17, 2008 07:37 AM

w.r.t the sanitising html problem, what about
a) translate into BBCode (say for <i>i</i> tags and <b>b</b> tags)
b) get rid of any markup which remains

Sean on October 17, 2008 07:40 AM

@Dan Hulton:

Jeff's talking about his own core competency - he's the one running the site, so he has to be proficient in handling everything people would do with the site. You're confusing his focus on SO with that of the users.

Adam V on October 17, 2008 07:41 AM

Thank goodness SO uses markdown! I mean it is the foundation of SO! All hail markdown...the greatest thing on earth!

are u serious????

Joe Beam on October 17, 2008 07:46 AM

Touché

Antonio on October 17, 2008 07:47 AM

Excellent article, I couldn't agree more. Aside from the obvious benefits to the business its great for developers to have the opportunity to solve Hard Problems. Going out and buying libraries to solve Hard Problems tells your developers that you don't trust them to get it right.

Jim on October 17, 2008 07:50 AM

I think it should go without saying that there's an addendum to that quote on "If it's a core business function -- do it yourself, no matter what." It should probably be something like "if that thing is hard and you can't trust the quality of an off the shelf implementation". The classic example as John Carmack explaining why they wrote a 3d engine - it was a big deal and the major seller. Nothing else would do.

I don't think that applies in this case. There would almost certainly have been some sanitisation code which you could have reused as others have pointed out. Some things are just building blocks. If your core competency is web sites, you don't necessarily have to start by writing a web server.

Andrew on October 17, 2008 07:54 AM

So... Stack Overflow's "Core Business Function" is sanitizing HTML?~

If there were no decent HTML sanitizers for .NET (seriously? I come from Python, where I can think of 3 excellent sanitizers off the top of my head - BeautifulSoup, lxml, and feedparser's built-in one - not to mention that we're a much smaller community) then I can understand building one from scratch. But even then, I would have gone with a port of another language/platform's sanitizer, because I *guarantee* there are domain problems they've dealt with that you couldn't possibly have have thought up on your first iteration.

That would actually be the best of both worlds: not having to make your own mistakes, but still intricately understanding how the code works.

But kudos for getting the code out there.

Adam Gomaa on October 17, 2008 07:56 AM

You can talk about efficiency, delivery dates, and the "right" ways to do things in your project, but after all as a developer you understand that it is your project, and you started it because you wanted to, because you like to code, and coding stackoverflow was fun. That is the only and the major reason you decided to DIY. Digging through other people messy code and trying to pervert it to work with your framework is NOT FUN. Coding your own solution for interesting problem IS FUN. And there is nothing else to add.
Keep up reinventing the weal as long as it is fun!

Maggus on October 17, 2008 08:03 AM

> So... Stack Overflow's "Core Business Function" is sanitizing HTML?~

nice looking, easy to understand user-generated content is our core business. And guess how that content is generated?

> If there were no decent HTML sanitizers for .NET (seriously?

You'd be surprised. In many ways .NET is kind of a backwater. Compare how many blogging engines there are in PHP, for example, to how many there are in .NET.

http://www.codinghorror.com/blog/archives/000320.html

(from mid-2005, but the relative stats have not changed since then, may even be worse as PHP has exploded)

Jeff Atwood on October 17, 2008 08:05 AM

Hey Jeff,

There's probably a good reason for not being able to do this but my first though after reading your post is why don't you write a sanitizer that converts valid html tags to something like BBCode?
You can convert valid tags to BBCode and escape the rest, that way users don't have to learn BBCode to format their comments and you get to throw away anything that's invalid. You can still allow people to use whatever pseudo markup language you're converting to but it makes it so much easier for people who already know html.

Kevin T on October 17, 2008 08:09 AM

>In .NET? How?

>This is like telling me I should use rainbows and cotton candy. Well,
>obviously.

>There are almost no options in .NET, which is one of the *reasons* I
>wanted to go this route. So others would have more solutions!

Perhaps this should have been examined before choosing .NET. The group of developers around .NET are very different that the community around something like Python or PHP. They are more likely to understand the value of shared code, especially where that code is not the differentiation between them and their competition.

One thing, though on third part libraries. They are great, as long as you have the source. If not, you do not even have all of the source for your own application, so how canyou be expected to maintain it?

Perhaps this whole experience has made you a little wiser about the value of source code. Hopefully others can learn from your pain.

Grant on October 17, 2008 08:11 AM

The main point of Dare's post, which you've totally failed to address, wasn't that you shouldn't ever write your own HTML sanitisation code, but that trying to do so with *regular expressions* is a huge source of problems.

You could just as well have rolled your own validation by requiring valid XHTML input and using a SAX parser, which makes it quite easy to whitelist tags and attributes, or even validate that the input is well-formed in other ways (e.g. that inline elements don't contain block elements).

Jonathan Buchanan on October 17, 2008 08:16 AM

>nice looking, easy to understand user-generated content is our core business

A big part of what makes SO nice-looking and easy to understand is the delectably responsive UI. But you use a third party library to support that. Is it distinctively less of a core biz function than Markdown support?

alexis.kennedy on October 17, 2008 08:17 AM

Good post Jeff and thanks for sharing the codez :-)

o.s. on October 17, 2008 08:22 AM

I always enjoyed the phrase. "Don't reinvent the wheel, unless you plan on learning more about wheels"

Raisins on October 17, 2008 08:28 AM

Honestly, I disagree entirely with Joel's comment that's being referenced here. Yes, for certain "core functions" you should write things yourself, but using a framework or library to help get it done quicker is a good thing, not a bad thing.

For example, if I was writing a storefront for an e-commerce site, I would prefer to write my own store and fulfillment system to fully encapsulate the business needs, but I would gladly use an existing storefront framework out there (for example, Satchmo if I was using Django) that takes care of the payment gateways, even if I end up redoing everything else from scratch.

I guess it depends on the context of the application. I would not trust a "drop it in" e-commerce package for anything except the most basic of online stores, but I would gladly borrow the payment and generic CRUD modules (e.g. adding new customers) from one to shorten development time.

Wayne on October 17, 2008 08:29 AM

Oddly, many businesses blithely Go Shopping. The proliferation of BOM/MRP/ERP software systems is the prime example. And SAP is the prime of the prime. How you make your widgets is your core competence. But many still buy such software. May be that's why the USofA is going down the tubes.

BuggyFunBunny on October 17, 2008 08:32 AM

I agree that, for instance, a pharmaceutical company should write their own drug research software, but writing your own software and writing your own software from scratch are two completely different cups of tea. Especially when it comes to security - if history has taught us anything, it's that you should *never* write your own custom security-related routines, if at all possible.
However, the fact that you released your HTML-sanitizer to the public and posted it on your blog is certainly a plus, as I'm sure that now it will be picked apart and scrutinized by everyone in the community, especially those trying to prove that you have no idea what you're talking about :)

BlueRaja on October 17, 2008 08:44 AM

The comment thread on this one is freakin' hilarious! But, yeah, I agree with Jonathan Buchanan's sentiments... Dare's post seemed to attack the usage of regular expressions to accomplish your goal -- not just the HTML sanitisation.

-- Kevin Fairchild

Kevin Fairchild on October 17, 2008 08:50 AM

By Jeff's logic writing his own web server would be acceptable as well. Serving web pages is clearly part of his core business.

How many questions/answers actually contain HTML? Would it really have been that great an inconvenience to disallow HTML markup? Jeff even alludes to this when talking about how much easier things would have been with BBCode.

Why not just encode all HTML before Markdown sees it? Why not consider a different markup language?

Absconditus on October 17, 2008 08:57 AM

You don't need to understand everything to run StackOverflow. Computer science has this wonderful philosophical device: abstraction. Black boxes make the composition of systems from smaller functional units wonderfully tractable. What you are complaining about is the lack of a suitable black box, so you wrote your own. No problem there. But by the 'Feynman metric' from the blackboard, I doubt you understand the entire operation of StackOverflow. Did you write your own database (and could you, from scratch)? Is the network stack custom rolled?

I doubt it, and rightly so as rewriting them would be crazy. Feynman wanted to understand the entire universe stack, from top to bottom. You needed something that didn't exist, so you made it. That's the great luxury of software development.

Imagine, however, that a suitable sanitization engine had existed. Then you would have been crazy, from a production point of view, to roll your own if the extant engine had decent documentation, and the time of integration was small enough. You trust black boxes to give certain guarantees at every level of operation; another one here wouldn't have been a problem.

From a 'do I understand the universe' point of view, you could have written your own HTML sanitizer to scratch that particular curious itch, but it's a weird one to start out with when there are far more interesting problems to be able to solve.

Henry on October 17, 2008 09:03 AM

*If* you are *able* and *willing* to write significantly better code than what exists, or code doesn't exist, or the code that exists can't be easily adapted to what you want to do - then you pretty much have to write code. Otherwise don't waste time and get on with your job.

So much support code that is written is just reinventing the wheel, and very poorly at that. Most of the time that devs reinvent the wheel they are neither willing or able to write better code - they just want to write the code. They also usually don't have the benefit of a lot of eyes looking at and testing their code, so rarely does it even begin to approach the quality of code that is already out there and used by other people.

Developer Dude on October 17, 2008 09:13 AM

These days, HTML sanitization is primarily about security (preventing XSS attacks). When it comes to security, you want to use a proven, standardized solution. Would you roll your own version of SSL, or a cryptographic hash?

You say that your solution is proven, and can now be reused. Call me in 5 years when that's actually true; right now it has gone through precious little battle-testing.

I disagree that this is core business functionality for stackoverflow. Your core is how you facilitate collaboration, not the content format.

Chase Seibert on October 17, 2008 09:29 AM

When it's a week of work, easy call.

When it's six months to a year, involving a not insignificant investment, what do you do? The choice is not easy then. And no matter which way you go, you will always wonder if the other way was better.

cthrall on October 17, 2008 09:49 AM

> Would you roll your own version of SSL, or a cryptographic hash?

Well, first I'd design my own CPU, RAM, and motherboard. From scratch, naturally. Then an OS to run everything. Maybe an IDE, debugger, things like that. But after that I'll be all over SSL and hashes like fleas on a dog!

If you are a security vendor, you might want to build SSL or hashes.

If your website allows arbitrary user-generated HTML in markup for *every single page*, you might.. just.. consider.. writing your own HTML sanitizer.

But what the hell do I know.

Jeff Atwood on October 17, 2008 09:50 AM

So, by the same rationale, does that mean you should learn C? If you can't create (given, like a thousand manyears) the .Net framework, how can you understand it? How can you defend your use of it.

:)

I mean this only half jokingly.

I'll await a response while building my webserver driven by telegraph latches, based on what I've learned in Charles Petzold's "Code" ;).

doug t on October 17, 2008 09:52 AM

Coding Horror is turning into the DailyWTF with all the submissions coming from Jeff himself. Talk about over-complication! I see the problem as being this:

"Markdown allows users to intermix HTML into the markup"

And the solution is to change your markdown interpreter so that intermixing HTML is not allowed. Problem solved. No need to spend a week (or more) creating some HTML sanitizer that frankly isn't needed at all. Markdown includes more than enough formatting options without having to drop into HTML.

Wayne on October 17, 2008 09:53 AM

From <a href="http://daringfireball.net/projects/markdown/license">http://daringfireball.net/projects/markdown/license</a>;
> Markdown is free software, available under the terms of a BSD-style open source license.

If HTML is the problem, then strip it out of your 3rd party library. If you want to foster the markdown community, offer the patch to other developers. I don't believe you absolutely have to write your core functionality yourself. I do believe however you have to modify it to suit your needs.

Bill on October 17, 2008 10:14 AM

This whole talk of core competencies gave me an idea. Microsoft is a software company. Apple is a hardware company. Windows was developed completely in-house. MacOS X is built on open source Unix foundations. Which one should have come out the better? Which one did?

Inventing your own wheel is sometimes necessary. I think it was in this case. But it should always be the exception, not the rule.

Felix Pleşoianu on October 17, 2008 10:18 AM

I sure hope that Jeff packages his HTML sanitizer as an open source library and posts it SourceForge. .NET will forever be backwater unless developers start publishing their hand-rolled libraries.

Wayne on October 17, 2008 10:29 AM

I don't think writing a simple sanitizer is all that hard. At least, I have done it myself, taking the conservative approach of running through the input character-by-character with a finite-state machine scrawled on a piece of paper that says whether < or & is allowed at any point and whether to recover from errors by inserting > or escaping the faulty code. You then have a table that says which element names are permitted and which attributes those elements can have. The result is well-formed XHTML fragment that will display safely.

The point of the above is that (a) you only let through known-good HTML, rather than trying to spot and fix known-bad HTML (since making a list of good things is easier than making a list of all bad things), and (b) approach the problem systematically rather than trying a quick regex-based bodge and then a few more bodges on top of that.

I don't see how translating to BBCode and back again could possibly be simpler as it presumably involves interpreting the HTML to generate the BBCode... And forbidding HTML in Markdown is not much simpler because after sanitizing the Markdown (on order to forbid HTML), you then convert it to HTML and have to hope there is no way to fool the Markdown formatter to make it produce bad HTML. Safer to write a bullet-proof HTML sanitizer and apply it at the very end of the pipeline.

Damian Cugley on October 17, 2008 10:41 AM

So, what's the difference between rolling your own HTML sanitizer and rolling your own jQuery?

Scott on October 17, 2008 10:57 AM

"But what the hell do I know."

That is what I've been trying to figure out for years.

pwnguin on October 17, 2008 11:15 AM

Wow, so many comments about how there are libraries in .Net to sanitize HTML and no one mentions what these libraries might be. Which is Jeff's primary issue.

Stephen on October 17, 2008 11:32 AM

I agree to a point, though I have to say it really depends on what you consider core to your business.

If for example your business depends on an API allowing other developers to create an ecosystem around your services - should you roll your own XML parsers? In this exampl parsing XML is as core to business aims as parsng posted content is to you, and it's why I think you're off the mark a little in this instance.

You seem to have done the research and identified that there is a gap for the HTML sanitisation which you needed to fill with in house code. That much, I totally agree with. I'm not sure I'm 100% with you're analysis that this represents a core business function though.

Andrew on October 17, 2008 11:39 AM

"In .NET? How?

This is like telling me I should use rainbows and cotton candy. Well, obviously."

The stress has obviously gotten the man. Why didn't you even read the whole post? Or couldn't you yourself come up with a way to use python from .net?
And no response of course to these questions...

Usually the content here is quite good but this is a severe case of NIH-syndrome and denial wrapped together.
Sanitizing html is your core business now? Nice.

WTF on October 17, 2008 11:48 AM

I have to massively disagree with you on this one, particularly in the games industry.

Middleware in our industry is quite common, hence Midway paying Epic 3.X Billion dollars for a 10 year deal for their rendering engine. Particularly, the product in our industry is about providing content, not code. In which it makes sense that if it's cheaper to buy the code to help produce that content (and content production could be faster as a result), then that's the proper path to go.

I do agree, that as a programmer dealing with those systems, you need to understand it enough to be able to reproduce it in some variety. But at the core, if you can get a contract to purchase that software to do the job better/faster/easier/less expensive, then you should do it. Otherwise, you'll end up overbudget/slower, which is not productive to a video game.

~Main

MainRoach on October 17, 2008 11:50 AM

If it were only a time vs. materials world: reuse, reuse, reuse.

But its not. Computer Science curriculum would agree with that. Thats why in a data structures class, any good professor will make you write a stack/queue/linked-list before directing you to the STL. Thats why in my Web Programming class this semester, we were directed to write a web server in C. The time it takes to write these things is worth it (at least to me) if for nothing else than the personal growth I experience when I learn something.

When Jeff's sanitizer breaks, he'll know how to fix it a lot better than if he just had this big abstract entity of an HTML sanitizer to search and prod through.

Charles Callebs on October 17, 2008 12:01 PM

"Thats why in my Web Programming class this semester, we were directed to write a web server in C. "

You've inversely proved my point with your own ;)

A web server in C is not a 2.4million line codebase that 20 people have all contributed to over the past 4 years. Video games are.

If it only took a weekend to write a code sanitizer, then yea, do it yourself. But writing a robust multi threaded physics library that works cross platform and cross project? I'll leave that to another company who's entirely dedicated to providing that software in the marketplace as their lively hood.

Good programmers program, Great programmers reuse.

~Main

MainRoach on October 17, 2008 12:22 PM

Hey Jeff,

I'm a PHP developer, and there are loads of libraries, frameworks, CMSs etc out there. I've just (hopefully) won the battle with my project managers to let me write my own stuff instead of having to use 3rd party code.

Why?

So much of the 'popular' code (CMSs especially) is really really badly written. The time taken to evaluate, test, debug, tweak existing code is often way longer than doing it yourself.

I ALWAYS re-invent the wheel. How else would the wheel get getter? And who is anybody to say I'm not good enough to add incremental value to the wheel?

That's not to say I don't use built in stuff and code that I know is good. Sometimes I'll take a snippet from the web and RE-WRITE it so I understand what it does and how it works. This is not a black and white thing.

As you pointed out, use 3rd party code where it makes sense, but don't rely on it to build your software. Telling a client their site is down because of an exploit in 3rd party code is hardly going to enhance your reputation.

If there is a bug in any on my applications, it's my fault, and my job to fix it. The buck stops with me, every time. That's what being a professional programmer is all about.

Rant over. :-)

Trevor on October 17, 2008 12:51 PM

> so many comments about how there are libraries in .Net to sanitize HTML and no one mentions what these libraries might be. Which is Jeff's primary issue.

It's because there really aren't any. There's the HTML Agility Pack -- which isn't really designed for sanitizing without writing a bunch of (error-prone) code to make it work -- and that's about it.

As others have said, the idea that sanitizing is this super-hard impossible problem is also not really true. Certainly nowhere near as hard as the physics library example @MainRoach proposed, etc. And like @Damian said, you can write a decent sanitizer in a few days.

Testing it thoroughly is another matter..

Jeff Atwood on October 17, 2008 12:52 PM

Okay, interesting article (and I've read that one of Joel's in the past), but I think that you are both talking about something totally valid but making the wrong point.

The reason that you had to write your own HTML sanitizer, and the reason that Microsoft's Excel team had to write their own compiler (Joel's article, as I recall) is that libraries or external programs to do what was needed *didn't exist*.

When that Excel compiler was written, there was no open-source community to speak of, and they couldn't very well modify a commercial compiler for their needs. Google *can't* use external libraries, because nothing scales to their level...yet. And as you pointed out in comments, .NET is a backwater and doesn't have the kind of libraries yet that older platforms do, so the thing you needed didn't exist.

The moral of the story isn't "We should write important stuff in-house." The moral of the story is, "If it doesn't exist or you need something way beyond what exists, you probably will have to write it yourself." That's a fact of life, not a lesson in good software design.

-Max

Max Kanat-Alexander on October 17, 2008 12:52 PM

Jeff, while I often enjoy your insights, you're just not being rational here.

"Core business function" is synonymous with "competitive advantage." If HTML sanitizing were core, you'd have written your business plan around how much better you are at it than others. It would be up there with how you attract and keep smart programmers on your site. The rest is plumbing.

If .NET doesn't have its own sanitizer, perhaps it wasn't the right choice for a platform. Personally, I've always wondered why you chose .NET. I know it's the one that's most familiar to you. That's a plus if you want to complete a small new project fast, but using the same language and tools all the time limits your growth potential as a developer. Considering how many readers you have who don't use .NET (myself among them), you really don't want to end up as one of those curmudgeonly single-language programmers.

David Leppik on October 17, 2008 12:57 PM

I don't know about the whole "html sanitizer being a core business function" thing (quite frankly, I could live with textile or bbcode).

But I do think it's a good investment for a developer to reinvent something. You can't talk about scalable comet architecture if you've never wrote a server. You can't talk about javascript compilation optimization if you've never written a javascript engine.

If you want to be an expert at something, you've gotta experience its ins and outs, the full development cycle, the bugs, the caveats, the holes and the limitations.

And if you are a programmer (read: not a content-entry monkey) and you have bills to pay, you'll probably want to be good at *some* programming-related task.

Leo Horie on October 17, 2008 12:57 PM

The easy solution is to allow only well formed image tags, and zap everything else, no?

Rob on October 17, 2008 01:14 PM

"If your website allows arbitrary user-generated HTML in markup for *every single page*, you might.. just.. consider.. writing your own HTML sanitizer."

And your website makes several database hits on *every single page*, so you might... just.. consider.. writing your own database server or at the very least your own data access library.

You would probably want to write your own cookie and session state handlers too, as those are used on every page.

Seriously, I really don't see how the frequency of use has anything to do with your core competency as a business. Our employees drink coffee and tea every day, sometimes several times a day - does that mean we open up our own coffee bar in the warehouse? Do we start our own telco because of all the phone calls that get made? Perhaps we should also start making our own chairs, since people sit in them ALL the time.

If you couldn't find anything off-the-shelf to do exactly what you needed it to do, and you weren't willing to compromise, that's okay. That's a common problem - it's the source of many of the best tools out there. But you then proceeded to come up with a vastly inferior solution using regular expressions instead of an actual parser. Pfeh. That's like complaining about a lack of quality encryption tools in your domain and proceeding to build your own based on ROT13.

Aaron G on October 17, 2008 01:42 PM

I don't see a license attached to your sanitizer. Doesn't that make it unusable for anyone else?

(Also, does C# have first-class functions, or is that just for clarity? If so - neat!)

Bernard on October 17, 2008 01:58 PM

I'm a fan of reuse, but, seriously, give it a rest people. You act like all of the open source code out there is of equal quality and documented. It isn't. Forgive me I'm hesitant to put NightOwl201978's homegrown HTML sanitizer in there that he's used on his blog, which receives 3 visitors a day.

I see this as a problem with the open source community. You go to write an application, and 99% of the time, people say, "oh, don't write that from scratch! Go work on *decidedly-mediocre-project-that-prompted-you-to-develop-this-in-the-first-place* instead!" Oftentimes these projects have SEVERE issues (symptoms like memory leaks or unmanageable complexity) that are NOT simple fixes to make, they're often architectural, or, worse, cultural. (Such as inappropriate use of low level languages, failure to abstract properly, etc.) The very thing that project needs the most is someone to come along and outdo it, who isn't afraid to say that the code quality is unacceptable.

Perhaps that is the overall problem with open source: when all code is free, we wrongly assume it is good code.

Matt Green on October 17, 2008 02:04 PM

Jeff still hasn't provided a convincing case as to why he needs to allow HTML at all. He more or less admits that this isn't necessary when discussing BBCode. Can he even show us a question/answer where user-entered HTML was necessary/desirable?

Absconditus on October 17, 2008 02:04 PM

Aaron, the big difference there is that the only thing user-generated in that entire list is...

Bingo, the HTML.

Stack Overflow has a targeted audience of programmers. This isn't some random forum on the internet about knitting, its a group of professionals, a percentage of which probably have the ability to break something one has written.

How many HTML sanitizers are written with this kind of audience in mind?

Thats not a rhetorical question, I'd sincerely like to know.

Bottom line is, its one of the most important features of Stack Overflow and requires a lot of attention to detail.

Charles Callebs on October 17, 2008 02:06 PM

@Absconditus - Now, I'd agree with that :P

Charles Callebs on October 17, 2008 02:06 PM

BBcode won't save you from anything, and the assumption that, by using it, you're *saving* anything at all is naive at best and dangerous at worst. For an example take a look at this: http://ha.ckers.org/blog/20060619/cross-site-scripting-strikes-apnaspacecom-hi5com-aboutcom-and-b3tacom/
Not to mention that it's clumsy and unintuitive, and every site that supports bbcode-style markup does it slightly differently.

anon on October 17, 2008 02:33 PM

@Charles: Python's feedparser has a strong html-sanitizer.
http://www.feedparser.org/docs/html-sanitization.html

anon on October 17, 2008 02:35 PM

Jeff,

I can see your more general point about re-use. Having written my own HTML sanitizer, I can understand why you wouldn't want to use some code that you really don't understand very well in a core function of your product.

Also, for what it's worth, I love stack overflow's input idiom. It combines the ease and familiarity of entering plain text with the immediate feedback of a GUI editor.

But you really ought to understand how difficult the problem domain is, and have humility about your solution. You have to assume your code will be wrong. Concentrate on making sure that, when it fails, it fails well.

Your use of a whitelist rather than a blacklist is a step forward, but the use of regular expressions is two steps back. You *can't* parse HTML with regular expressions. There are a few thousand screeds on this topic, but here's a good one: http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

What you need to do to sanitize HTML - or, for that matter, *any* untrusted network input, is:

1. Parse it: load it into a data structure. An actual structure, with actual rules. During this process, your parser may fail. That's great. *Give up*. If you see input you can't handle, assume it's an attack. On a site like StackOverflow, where users get real-time feedback, this doesn't even create a usability problem! You immediately say "I'm sorry, I couldn't understand that."
2. Emit it. During this phase, if somebody tricked your parser, it shouldn't matter: your emitter should be smart. Let's say your parser has a bug where it mistakenly thinks the "bar" in "<foo bar baz='boz'>" is actually the start of text. If someone tries to attack that, your emitter can neuter the errant ">" by quoting it as "<foo>bar baz='boz'&gt;</foo>". It will look ugly, but it will fail in a way which is at least *safe*. Importantly, the emitter should be as distant and disconnected from the parser as possible, so that they do not share bugs.

If you are concerned about memory overhead (but you shouldn't be, because you've loaded the whole thing as a string anyway) you can always use an event-driven parser/emitter pair (think SAX) rather than loading everything into a single structure first.

If you follow this structure, you can still even use regexps for the parsing phase, because your parser can screw up horribly and your emitter will still produce valid, if unpleasant, output.

Glyph Lefkowitz on October 17, 2008 02:47 PM

Jeff,

Even though SO is awesome, I'm glad to see you're again writing more regularly.

I don't know if input sanitation is core to StackOverflow. It seems to me the SO experience could still be great even if you hadn't written some sanitation code.

Having said that, I'm glad you did - just 3 weeks ago we used some of the code you posted on refactormycode.com.

Keep up the good work! And thanks for contributing to the .NET community.

-Esteban

Esteban on October 17, 2008 03:17 PM

Hey Jeff, interesting article you have there. I spent a great deal of time writing an open source project http://www.codeplex.com/gsb where I developed things that I knew already existed. It took me 1 1/2 years of spare time to write it and still not done.

I refactored it continuously to use a number of 3rd party components that were "really" better than mine, but that did not stop me from learning, which is why I did it in the first place - to learn. Now though my learning’s, I was able to pick and choose the components that more closely matched my requirements and less likely to give me grief in the future.

Now if I spent the time writing those components to match the functionality of what my app now has, well it would have taken me 5 years instead of 1 1/2.

Perspective is one thing, but programming is damn hard :-)!

Mitch Barnett on October 17, 2008 03:22 PM

Good read! I also liked the article about how I "can be loud too"

Yes, it's very hard to detect evil from good, but I don't blame you for writing an html sanitizer yourself. I'm definitely an idealist who dreams of taking third party libraries and designing/putting the pieces together and (effortlessly) building an app. And I try to do that as much as I can. But there are some things that you just have to do yourself (yes... like "If it's a core business function, write that code yourself" is a reasonable standard)...
And in the back of my mind I wonder, "how trusting should I really be toward all the third party stuff I'm using?". I mean, considering all the unknowns about it, efficiency, security, etc... even with open source, do you (or anyone) go in and review/inspect all the source code?

John on October 17, 2008 03:58 PM

Wow, dare i chime in with yet another comment? I guess i do.

Your core business is community Jeff, that and the people in it. A drupal install or whatever standard CMS with some code parsing would have been fine. I guess we do like you care so much about the plumbing, so perhaps thats why those people come back in the first place. Perhaps you did work on your core business after all but never even noticed.

Tijs Teulings on October 17, 2008 04:05 PM

Code re-use is over-rated, because most code (especially in-house) is not of very good quality, and cannot accurately anticipate unknown, future needs. Creating reusable code requires better developers, who are rare and expensive, and rewards mystery projects from the future at the expense of the resources the client has available now. Trying to bend old code to new uses increases the risk of revealing bugs, and ties up developers who spend a lot of time trying to gain an understanding of the existing code, which generally ends up full of hacks to work around the grand visions of the first developers.

This in turn ends up as a maintenance nightmare, as different teams or departments fork the 'reusable code' or, god forbid, try to keep it in sync. It would be crazy for any manager to allow some other team or department who doesn't fully understand their project to contribute code or design direction to some underlying code they both rely on. The other team cannot possibly be familiar with all the other projects that re-use the code, and the assumptions the developers on those projects have made.

The main exception is library re-use, such as HTML sanitizers, which is an excellent idea. If the library itself is full of un-reusable code - no problem. It is far more important that it is maintanable and well-tested than reusable.

Reusability deserves to go on the scrapheap of moribund trends, like making everything object-oriented. A nice idea, so long as you can stay out of the real world.

Anonymous on October 17, 2008 04:39 PM

I think you're all missing the fact that a huge point in this is that he's also trying to contribute to the .NET web 'world' as it were.

I, just recently, went in search of a CMS to do fill my projects - and I would have loved nothing more than to find a .NET solution, it would be a perfect excuse for me to pick up ASP.NET a bit more and leave behind (finally!) my PHP roots. And yet, precious few solutions are to be had and almost all of them are a PAIN to install compared to Drupal, Wordpress, and the like.

.NET is in danger, yet again, of being considered only a "big kids toy" for office and intranet, and nobody has done anything yet to prove that wrong - myself included, as I just installed Drupal with a bunch of fancy modules that will make my life easier. I can't fault Jeff for not taking that easy way out, it's not like I gave the man money to do this project, it is his, he had no commitment or reason to produce it other than he wanted to, he can write whatever he pleases on whatever timeline he wants.

Vassi on October 17, 2008 04:51 PM

Wow, so many sanitizer-writer experts!

Steve on October 17, 2008 05:56 PM

I'm with Jeff on this one, even if that's the unpopular opinion. I work at a large, very visible organization. We have been burned on 3rd party libraries and tools...BADLY. One very well-supported (and necessary) tool recently closed up shop after the parent organization was purchased by another organization. There is no longer even any mention of it on their website, when it used to be a front-and-center app. The licenses just stopped working one morning, which means....wait for it...all the code that relied on that tool that we paid for immediately failed. Guess which IT department had to scramble (and is currently scrambling) to craft a home-grown solution now when we could have spent that time better at an earlier date?

The worst thing is that the performance of this tool was hiding some serious design and performance flaws in the underlying code we inherited. There are just layers and layers of WTF-ery in this...I'm all for not reinventing the wheel, but Joel's quote is dead on. The developers should NOT have relied on this third party tool to be the keystone of a mission critical app. The performance of this app is a core business issue, and anything related to it should have been hand-rolled, even if that was the harder road to travel.

Alan on October 17, 2008 06:19 PM

What a busy thread! I'm inclined to think that StackOverflow's core competency is storing/indexing developer questions/answers. The schema, SPROCS, and the Lucene.NET index sum up a lot of the value that I see in StackOverflow (which is a great site btw).

If there's truly nothing else out there that does a decent job for .NET (and I'm not completely sold on this) then I would agree that you're painted into a corner.

A lot of developers on this thread, (myself included) may be a little over sensitive to re-developing code that already exists when something like that costs so much (coding, testing, maintenance). The best line of code is the one you don't write, don't own and still delivers value to you.

Tyler on October 17, 2008 07:09 PM

I just finished reading the October issue of MSDN magazine, and would have to say that at the rate the .NET framework is growing, we'll soon be writing nothing but business logic.

Take the "Coding Tools" article on page 86; even if it's your core business to write that are processor intensive (I'm thinking applying filters to images, etc.), you'd be crazy not to use the new support for parallelism the new version of the framework will offer.

I'm impressed with the future of the framework; however, I realize that not having a thorough understanding of parallelism (even though I may never have to write "boiler plate" parallelism code) is probably dangerous.

Esteban on October 17, 2008 10:18 PM

Tyler: I was just about to post the same thing. HTML sanitisation is a tangenital issue to the primary functionality provided by stackoverflow: a programming community that doesn't suck.

Jeff: You totally missed the point of Joel's original post.

You also could have saved yourself a week of hacking by not being so stubborn about allowing html markup. We're hackers. We can quickly pick up whatever small markup is required to make a post look nice.

Daniel on October 17, 2008 10:32 PM

> a programming community that doesn't suck.

And a big part of the reason it doesn't suck is that people can format their posts almost as much as I formatted this blog post. Either by a) taking the time to learn Markup or b) relying on the tried-and-true HTML that almost every developer now knows by heart.

The average posts on Stack Overflow just plain look *better* than other forums and sites.

I agree allowing HTML was painful, but in a good way. I've grown to enjoy the flexibility of using either markdown or HTML interchangeably. It means when programmers first encounter Stack Overflow, their first instincts in editing a post -- "hmm, how about if I enter a hyperlink here" -- work exactly the way they expect them to.

Choices about markup code are as critical to us as they are to, say, Wikipedia. It's all about the content, and making it easy for users to do the right thing when entering content.

Jeff Atwood on October 17, 2008 11:22 PM

How do you determine that there are no existing libraries? Do you just google or do you have a list of sites (sourceforge, cpan, ...)? Every time I hear developers say this type of thing with certainty, I suspect that I'm faking it as a dev, since I never feel sure of what's actually out there, even after spending a lot of time researching...

Josh on October 18, 2008 01:10 AM

Markdown sucks.

Markdown interprets all text between two underscores as italic. This would be fine if nobody needed to use underscores. In other words, this:

"Popular Apache modules include mod_php and mod_rewrite"

shows up like this:

"Popular Apache modules include modphp and modrewrite"

You can escape underscores, but this defeats Markdown's stated purpose of appearing like natural text.

Jonathan Drain, Dungeons & Dragons Blogger on October 18, 2008 03:48 AM

I should add: my preferred alternative is Textile, or Mediawiki formatting, or no formatting at all. Do users really need HTML to comment on a blog post?

Jonathan Drain, Dungeons & Dragons Blogger on October 18, 2008 03:50 AM

> Markdown interprets all text between two underscores as italic.

Agree. We changed this so intra-word underscores are not allowed in our Markdown server-side parser.

More here:
http://blog.stackoverflow.com/2008/06/three-markdown-gotcha/

Jeff Atwood on October 18, 2008 04:13 AM

well, it's good you can still enjoy coding.

But I will be more than happy if you explain some tips to manage your time for both programming and writing blogs (with such entries) :D


Mediocre-Ninja.blogSpot.com on October 18, 2008 04:50 AM

Trevor on October 17, 2008 12:51 PM took the words out of my mouth.

Reinvent the wheel (not the concept, but the instance) to become a better programmer.
Learn to code by writing code. You will not understand all the risks and pitfalls of a HTML sanitizer if have never written one.

I do not say you should never reuse code. But rolling your can definitely be the best option.

And I think that 'there is not suitable code available' definitely is a good reason to do what programmer (hopefully) do best: write code.

Jacco on October 18, 2008 07:27 AM

I've posted this like a hundred times, but I'll mention it again just for fun. HTMLEncode your string, and then replace "&lt;b&gt;" with <b>, etc. So what if people can see that someone put "<script>" in their post.

Tim on October 18, 2008 09:16 AM

Might be a bit OT but to me this is a huge WTF:

>> "due to the liberal HTML parsing policies of many modern Web browsers"


I mean serioulsy, how did the browser-developers think when they explicitly added support for lazy html in the first place.

"Hmm, I pretty often write foeach by mistake instead of foreach...let's make foeach do the same thing as foreach and I'll save 2sec per day! I mean, all developers must have this problem so I'm doing the world a huge favor!"


Today i understand that every browser has to add support for all crap that all other browsers already added, but why add support lazy html in the first place?!

Crazy Ivan on October 18, 2008 09:33 AM

"Being a "professional" developer, if there really is such a thing? "
No such thing as yet. People are being paid to write code etc, but that's not an acceptable definition of being a 'professional'.

Your article demonstrates the conflict between acting and thinking as a code monkey and as a business person. A 'professional' software engineer knows 'IT' but is also business aware with respect to 'IT'; in the scenario you have outlined there are at least the issues of cost-effectiveness, risk and resource constraints to consider.

Simon Parmenter on October 18, 2008 10:23 AM

Seriously Jeff,

Do you really think posting some code on a website qualifies some code as having been contributed back to the community?

- Donal

Donal on October 18, 2008 10:25 AM

Coding is hard? Nah! Any twit can write code.

"what I cannot create, I do not understand". You can infer this to mean that the code you write you understand. Hands up all developers that did infer this.

Just the twits.

Simon Parmenter on October 18, 2008 10:33 AM

I understand Google is rolling their own browser because they wanted to make sure that their applications will run smoothly. FOr some of their business core.

Kenneth on October 18, 2008 11:51 AM

Same goes for outsourcing, keep the core competencies that you rely on in-house.

stjohnroe on October 18, 2008 02:48 PM

Donal, your tone seems to imply that one should care about whether they "gave back" to the community. This is an erroneous assumption.

Matt Green on October 18, 2008 09:14 PM

Looks like you've managed to get every programming expert in the country to come post a comment. They all know the best way and each of them is smarter than the rest. This is good stuff. :)

T.J. on October 19, 2008 01:47 AM

I have to say that I definitely agree with Jeff on this one. While I think claiming that HTML sanitizing is the core competency may be a stretch, I think it's core enough for the purposes of this topic. Even if there are third party libraries available for such a thing, unless there is one that is truly complete, time tested and professional, it makes sense to write your own. You can look at the others for ideas and to learn the things that they've already learned, but it's something that's better off being written in a way that is more easily understood and maintained.

Comparing that to writing your own web server pretty ridiculous. There is no such thing as a third party HTML sanitizer that is on the order of reliability as Apache or IIS, in any language. Using a library that is written in Python and forcing it into a C# .NET package via IronPython would be madness, unless you happened to be a seasoned Python expert or have one in-house that is able to make changes and corrections in a timely manner.

Gerald on October 19, 2008 04:08 AM

Didn't Dare quit the internets for ever? IMO, he hasn't had much good to say - apart from causing drama with Arrington.

http://www.25hoursaday.com/weblog/2008/03/05/IndefiniteHiatus.aspx

James on October 19, 2008 05:48 AM

There are a lot of square wheels out there.

3rd-party libraries are definitely a problem - the OpenSSl debacle should make everyone think twice. Security software is especially troublesome. Experts (like Schneier) will tell you if you create your own encryption algorithm, you're almost certainly a fool. If you're not a fool, you rely on a commercial product or something like OpenSSL. On the other hand, unless you're an expert on the subject, you'll have to assume the product or library you selected is secure - that assumption will only be based on the assertions of the vendor. OpenSSL was not secure for two years, which ought to tell you something about the level of testing done. Are the commercial products any better? How can you tell?

Lepto Spirosis on October 19, 2008 04:55 PM

Stop it now...my head hurts

Tinuviel on October 19, 2008 05:26 PM

The "let's go shopping" thing is cute, but doesn't actually mean anything. I don't use libraries and frameworks to avoid writing code, I do it to improve the quality and features my software includes in the given time.

I ended up solving my HTML sanitization problem by breaking it into two steps: HTML parsing and HTML sanitization. I used the open source HTML Agility Pack to parse HTML into XML, which handled mismatched and malformed tags. Then I put my effort into the logic of the HTML sanitization. I felt like that gave me the best of both worlds - I used HTML parsing code which has been proven effective by tens of thousands of users in a huge variety of edge cases over five years, and was able to put my efforts into sanitization logic which was unique to my site.

Oh, and posting your code on refactormycode isn't really releasing it to the community. What license is the code under? Do you accept contributions? Where do I submit bugs? You should at least post it to CodeProject, which has support for code licenses, comments, etc.

Jon Galloway on October 19, 2008 09:14 PM

I also don't agree with using regular expression for writing an HTML sanitizer.

Grom on October 19, 2008 11:13 PM

I'm sorry to inform you, but if the spec requires HTML as input, the spec is wrong. If your core business is to somehow display user generated content, you simple don't allow HTML, period. And if you want to allow *some* HTML, just go with the BB code way.

Bucket on October 20, 2008 03:01 AM

Ooops, just clicked on the hear it spoken link...

Wow, so many guru's on here, who really know their stuff! Well done all, you're a bunch of heroes that all have wonderfully usable sites that I visit every day. Amazing how you all picked exactly the correct technology (which you do every time -don't you) giving you time to come on here and share your much valued wisdom. Yes, jolly well done you!

Anyway Jeff, I think SO looks great and I'm amazed you managed to make such a funky looking site in .NET. Also, you may may be louder than some through blogging, but it's also because you talk a lot of sense, making you IMO talented developer too. It's not just about writing beautiful code.....

bloop on October 20, 2008 03:11 AM

"I'm sorry to inform you, but if the spec requires HTML as input, the spec is wrong."

Since more people know HTML than any other form of marking up and generating rich user content, I fail to see how you can make that assertion. Forcing users to adapt to something unfamiliar is a bad spec, not allowing them to use something that is.

Gerald on October 20, 2008 03:47 AM

>>"If you are a security vendor, you might want to build SSL or hashes."

WTF?

>>"deeply understanding HTML sanitization is a critical part of my business"

I thought you were running a 'people' site. Even if you did not support HTML, SO will work fine.

Jeff, Your core business is not sanitizing html.
There are communities whose core business is sanitizing html. You should have borrowed code from them.

Niyaz PK on October 20, 2008 04:16 AM

"And the solution is to change your markdown interpreter so that intermixing HTML is not allowed. Problem solved. No need to spend a week (or more) creating some HTML sanitizer that frankly isn't needed at all. Markdown includes more than enough formatting options without having to drop into HTML."
You can't just kill functionality until a product is safe. Denying any user the right to enter any word longer than 6 characters would solve some problems. So would disconnecting SO from the Internet.
It's a balancing act.

Tom on October 20, 2008 05:21 AM

"I thought you were running a 'people' site. Even if you did not support HTML, SO will work fine."
Surely allowing the "people", many of whom are web programmers, to program with a known markup, rather than forcing them to learn YASWM (Yet Another S***ing Web Markup).

Tom on October 20, 2008 05:35 AM

I agree with you. If you don't write your own stuff, you either have to rely on an outside programmer to fix it, or set aside a week and a half to figure out their code and change it yourself. That can take a lot longer than writing it yourself and spending an hour to debug.

arkangyl on October 20, 2008 06:03 AM

Whatever the arguments over code-reuse versus NIH, posting the code to http://refactormycode.com/codes/333-sanitize-html certainly doesn't count as "[contributing] the core code back to the community".

bobby on October 20, 2008 06:09 AM

Of course, all third-party and open source libraries are completely perfect and you shouldn't dare question it.

Right...

SO looks great - glad you're taking the time to build something for the community that will save us all time and effort in the future.


HB on October 20, 2008 06:32 AM

> And a big part of the reason it doesn't suck is that people can format their posts
> almost as much as I formatted this blog post.

Sounds great. How long until we get some of that non-suckage in the commenting software here?

T.E.D. on October 20, 2008 07:02 AM

I think writing your own HTML sanitizer to better understand XSS attacks is a good idea. Web application security is terrible because the hackers are collaborating more than developers. As a developer, your knowledge of security exploits is often nothing more than recommended practices and the report from a security scanner. You usually don't have any idea of how the exploit really works. For example, form bots have always plagued my web sites and I really need to create my own to understand how to defeat them and to put my web forms to the test.

Robert S. Robbins on October 20, 2008 07:33 AM

What?!? This entire post is based on a false dilemma that's simply nonsense. You don't HAVE to parse HTML just because you're using Markdown. You can use Markdown, and strip out all the HTML, and the world won't explode. Your users will be limited to Markdown's syntax for formatting, which arguably may be a good thing. The Python version of Markdown even implements strip-all-HTML as a parser option: "safe_mode," and I use it every time I use Markdown in a public-facing area.

Silly post.

Carl Meyer on October 20, 2008 09:19 AM

You could use third party code to sanitize HTML even if you use markdown. Just unencode the markdown elements after it has been sanitized.

ogem on October 20, 2008 09:37 AM

"You don't HAVE to parse HTML just because you're using Markdown. You can use Markdown, and strip out all the HTML, and the world won't explode."

The optimal user experience calls for HTML, so the only way he can do that is by compromising on the user experience. The purpose of software is to serve the user, not deny the user options because of security concerns on his end that can be solved. The world exploding is irrelevent.

Gerald on October 20, 2008 10:47 AM

I have heard Jeff say several times on his Stack Overflow podcasts that he considers programmers to be especially intelligent. My initial reaction to push-back comments here was that he was mistaken about his assessment. However, it isn't that, it is that there is a difference between programmers. There are innovators and there are those who embrace and advance innovations.

Innovators identify a problem and an initial solution but then they go deeper and learn everything they can learn about the system where the problem exists. A solution to the problem emerges as an end result of the trial and error cycles of the learning process. Why wouldn't he use his own solution at that point?

I understand the arguments on both sides but I think both sides should also understand that to advance we require both the leaps of innovators and the steady advancements of improvers.

Erin on October 20, 2008 10:50 AM

Nice post...

Rem on October 20, 2008 11:02 AM

I'm sorry about my previous, quite blunt comment. Let me clarify a bit. Let's compare it with... drunk driving.

You may think yourself that you're very good at drunk driving. You might have a long past of drunk driving and never have an accident. However, drunk driving is inherently unsafe, and the fact that you think you've mastered the art and never had any accidents until now, doesn't mean it *is* safe. It's still risky business, and in the future something is bound to happen. Besides no one but crazy people would ever advise you to drink and drive... heck I would bet even yourself wouldn't recommend it to anyone. And that's not even mentioning the actual possible victims involved, when you do have an accident (aww, the poor kids, how are they going to grow up without a mother).

In my eyes, if you want to allow HTML, you should sanatize *all* input first, and from there selectively allow what's needed. In a simple example, first replace all special chars with HTML entities (< becodes &lt;, > becodes &gt;, etc) and after that apply rules (change &lt;i&gt; into <i>, etc). This way you are absolutely sure that nothing comes through.

Also, think about this. If someome puts a <marquee> on the page, and you see (or more probably, someone else sees it and reports it to you, much later) the whole page scrolling around, it's like *oops, I forgot to strip marquee, hehe*. But that's an obvious example... what if someone finds a clever way to do some XSS nastiness (say, as a simple example, steal some cookies) and manages to hide it very well. You won't notice it, and your users won't notice it. At least... not until *after* the damage has been done.

So please... think of the kids...

bucket on October 20, 2008 11:04 PM

What a coincidence, just had an exam in systems engineering where one question was: "You are faced with the decision to buy a complete solution or create your own, what will determine your decision."

We also had to point out advantages/drawbacks for both options but the final, "correct" answer, according to our professor, was something like this:

Lots of money, little time => Buy complete system
Lots of time, little money => Create your own

Blub on October 21, 2008 02:57 AM

"Unfortunately, Markdown allows users to intermix HTML into the markup."

Even if Markdown would not allow HTML, you still need to take care what ends up in your output. ![](javascript: ...) variants can wreak havoc on a site.

![](javascript:{
var e = document.getElementbyId('element-to-replace');
var content = e.innerHTML;
e.innerHTML = content + '&#60;&#111;&#98;&#106;&..
})

Ergomane on October 21, 2008 04:59 AM

> You can use Markdown, and strip out all the HTML, and the world won't explode.

Yes, but a certain surprisingly large percentage of the audience knows HTML and can't be bothered to learn Markdown. This means a large percentage of our question/answer posts would look bad by default.

Having content that LOOKS GOOD *is* a competitive advantage. Take this challenge: compare any random Stack Overflow question with any other Q&A or discussion forum page. I bet you that 9 times out of 10 we look noticeably better.

That's because users know HTML or Markdown, and the content gets entered decently, and edited until it is whipped into shape with our easy, low-friction collaborative editing tools.

> How long until we get some of that non-suckage in the commenting software here?

I hear you. Ironically, I used an off-the-shelf solution written in a language I barely know (PERL) so it's a bit of a challenge. Though I really should consider updating from my mid-2004 version of Movable Type one of these days..

Jeff Atwood on October 21, 2008 05:07 AM

I disagree with commenters who presume that "the optimal user experience involves HTML." Markdown is simple, leverages very common plain-text-formatting conventions (ie _emphasis_), and requires far fewer keystrokes for common formatting needs.

From the server perspective, of course, it's no contest: Markdown has the advantage that a simple user typo (forgetting to close a tag, or whatever other issues you forget to explicitly account for) doesn't have the potential to break the entire page layout. And lastly, of course, there are the security problems with HTML, which you will _never fully account for_, no matter how much time you put into your custom HTML sanitizer.

These aren't just "programmer convenience" issues, they're user experience issues: there's a huge opportunity cost in terms of time you _could_ spend working on other, much cooler features, that you instead spend allowing your users to type <strong> instead of *.

You chose to idealistically force OpenID on all of your users, many of whom had to go out and learn it for the first time, because you believe it is simply a better solution for login, even though "everyone already knows" how to create a username and password. Absent the "everyone already knows it" argument, Markdown is hands-down a better solution than HTML for accepting untrusted formatted input. And learning Markdown (with the WMD buttons right there to show you) is a far lower barrier than figuring out and registering an OpenID. So why the difference in approach?

"Take this challenge: compare any random Stack Overflow question with any other Q&A or discussion forum page. I bet you that 9 times out of 10 we look noticeably better."

That's almost entirely because you have a much cleaner page layout in the first place than most Q&A or discussion forums. In my informal survey, at least 9 out of 10 SO questions/answers contain no special formatting at all, Markdown or HTML. You haven't presented any convincing evidence that allowing HTML has made a significant difference in the overall visual quality of SO questions/answers; certainly not enough to justify the opportunity cost of all the time you put into allowing it.

Carl

Carl Meyer on October 21, 2008 07:54 AM

What happened to "the best code is no code at all"?


> written in a language I barely know (PERL)

"Perl" isn't an acronym. Please don't write it in all-caps; it's a little cringe-inducing.

Eevee on October 21, 2008 12:51 PM

Re: comment system:

Great example of a need that is being filled by a third party system that is so awesome (disqus) that it isn't really worth building your own commenting system for a blog anymore.

Silas Snider on October 21, 2008 06:45 PM

Core business? What kind of crack do they have in California, Jeff? Sanitising HTML isn't your core business, keeping your web site up with the advertising revenue is.

Side note: You are not the exception to every rule. Drop the arrogant attitude and learn some humility in the face of the people you believe to be your peers.

Rob on October 21, 2008 06:59 PM

It would be interesting to see the stats on number of questions/answers that use the various formats:
* No formatting at all besides paragraphs.
* Markdown formatting
* HTML formatting
* Mixture of Markdown/HTML

Grom on October 21, 2008 10:50 PM

When you want to reinvent the wheel, and make your another-html-sanitizer, it'd be great to have a good chunk of malicious html to test against it.

Does anybody know of any place to find it? I haven't been lucky of finding any information about it.

Gaizka on October 22, 2008 04:00 AM

people are sort of missing the point here, everyone's talking about using existing code without pointing out that at one time this existing code was also written by someone because they felt there was a need for it. But often you'll read seemingly smart people forget the egg when talking about chickens.

And plus, html sanitizing can be as easy or as hard as you want it to be, you don't have to check for every instance of malicious code, all you need to do is limit the allowed stuff to stuff that _cannot_ be malicious. Yes, this needs a bit of knowledge, but thats part of the job of programming.

patrick on October 22, 2008 10:18 AM

PERL Practical Extraction and Report Language

PERL Pathologically Eclectic Rubbish Lister :-)

Wierd how people post like they know something but really don't ;-)

patrick on October 22, 2008 10:36 AM

Disturbing to see the mediocrity and repitition in these comments.

Pardeep on October 22, 2008 02:36 PM

@Carl, OpenID is a small one time learning investment, and one that probably didn't take most people much more time than it would have taken to fill out your average registration form. After that it's a one-click login. There's nothing to remember. Huge difference.

Many (maybe even most) people simply wouldn't bother learning Markdown, and their questions and answers would look bad by default. I know I wouldn't bother learning it. To this day I have never once used BBCode or Markdown even though I use several sites that support them. Why would I want to learn a new markup language in order to use Stack Overflow? More to the point, why should I HAVE to learn it, when I already know HTML, and many more websites support HTML than Markdown?

The amount of time Jeff spent implementing good HTML support is very small compared to the inconvenience it would have caused to the users, and the effects it would have had on the quality of the content.

Gerald on October 22, 2008 02:51 PM

I'm glad you defended yourself on this point.

I agreed with Dare that in general circumstances you have to go with a well tested, open-source solution when one's available. But I figured that the particular circumstances of your case would need to be known.

Spencer, above says:

>Beautiful Soup is written in python, but you could quite
>easily wrap it/run it in IronPython to match your .NET
>requirement.

I think this is misguided. Calling out to IronPython from your C# code is nasty. (The reverse is far simpler -- calling C# from IronPython). You have to embed a python engine and just yuck.

Again, it's a case of reserving judgment until the actual facts are known.

lb

secretGeek on October 22, 2008 05:02 PM

patrick:

> Wierd how people post like they know something but really don't ;-)

Yes. Weird, isn't it.

"Perl" is just a deliberate misspelling of "Pearl" because the latter was already the name of an existing language. Any and all clever expansions are backronyms. The name was not originally intended to stand for anything at all. (Regardless, I don't see people write SCUBA or LASER or Visual BASIC, either, so I'm not sure why there's such an insistence amongst people unfamiliar with the language to write it in all caps whether it's an acronym or not.)

Even the documentation explicitly mentions this: http://perldoc.perl.org/perlfaq1.html#What%27s-the-difference-between-%22perl%22-and-%22Perl%22%3f

"But never write "PERL", because perl is not an acronym, apocryphal folklore and post-facto expansions notwithstanding."

Eevee on October 22, 2008 07:58 PM

If you used PHP you wouldn't have these problems.

(Haha, just joking!) :)

Although the limitations of .Net can be frustrating sometimes.

Practicality on October 24, 2008 08:27 AM

This is just generic correctness masquerading as insight.

gjvc on October 26, 2008 07:52 AM

I once compared HTML sanitization to x86 virtualization. Simply put, HTML sanitization is virtualizing HTML.

Yuhong Bao on October 26, 2008 09:03 PM

"Unfortunately, Markdown allows users to intermix HTML into the markup. It's part of the spec and everything. I sort of wish it wasn't, actually -- one of the great attractions of pseudo-markup languages like BBCode is that they have nothing in common with HTML and thus sanitizing the input becomes trivial."
I once compared HTML sanitization to virtualization. Using a pseudo-markup language is like binary translation, while sanitizing HTML is like virtualization.

Yuhong Bao on October 26, 2008 09:11 PM

Here is non-regex approach written in PHP, http://refactormycode.com/codes/557-html-filter

Grom on October 27, 2008 03:52 PM

Google does input sanitizing in-house.
http://www.internetnews.com/security/article.php/3689566

Dave on November 2, 2008 01:22 AM

For the record - I've not found much other than Jeff's code as a starting point on the asp/vb side for html sanitization to prevent xss. Jeff - thanks for starting it, and I'm glad you learned a lot from it. Certainly a tough project, which is why I don't want to write it from scratch.

jens on November 19, 2008 12:44 PM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.