A few months ago, Dare Obasanjo noticed a brief exchange my friend Jon Galloway and I had on Twitter. Unfortunately, Twitter makes it unusually difficult to follow conversations, but Dare outlines the gist of it in Developers, Using Libraries is not a Sign of Weakness:
The problem Jeff was trying to solve is how to allow a subset of HTML tags while stripping out all other HTML so as to prevent cross site scripting (XSS) attacks. The problem with Jeff's approach which was pointed out in the comments by many people including Simon Willison is that using regexes to filter HTML input in this way assumes that you will get fairly well-formed HTML. The problem with that approach which many developers have found out the hard way is that you also have to worry about malformed HTML due to the liberal HTML parsing policies of many modern Web browsers. Thus to use this approach you have to pretty much reverse engineer every HTML parsing quirk of common browsers if you don't want to end up storing HTML which looks safe but actually contains an exploit. Thus to utilize this approach Jeff really should have been looking at using a full fledged HTML parser such as SgmlReader or Beautiful Soup instead of regular expressions.The sad thing is that Jeff Atwood isn't the first nor will he be the last programmer to think to himself "It's just HTML sanitization, how hard can it be?". There are many lists of Top HTML Validation Bloopers that show tricky it is to get the right solution to this seemingly trivial problem. Additionally, it is sad to note that despite his recent experience, Jeff Atwood still argues that he'd rather make his own mistakes than blindly inherit the mistakes of others as justification for continuing to reinvent the wheel in the future. That is unfortunate given that is a bad attitude for a professional software developer to have.
My response?
Bad attitude? I think that's a matter of perspective.
(The phase "programming is hard, let's go shopping!" is a snowclone. As usual, Language Log has us covered. Ironically, we later had a brief run-in with Consultant Barbie "herself" on Stack Overflow -- who you may know from reddit. There's no trace of her left on SO, but as griefing goes, it was fairly benign and even arguably on-topic.)
In the development of Stack Overflow, I determined early on that we'd be using Markdown for entering questions and answers in the system. Unfortunately, Markdown allows users to intermix HTML into the markup. It's part of the spec and everything. I sort of wish it wasn't, actually -- one of the great attractions of pseudo-markup languages like BBCode is that they have nothing in common with HTML and thus sanitizing the input becomes trivial. Users have two choices:
With BBCode, if the user enters HTML you blow it away with extreme prejudice -- it's encoded, without exceptions. Easy. No thinking and barely any code required.
Since we use Markdown, we don't have that luxury. Like it or not, we are now in the nasty, brutish business of distinguishing "good" HTML markup from "evil" HTML markup. That's hard. Really hard. Dare and Jon are right to question the competency and maybe even the sanity of any developer who willingly decided to bite off that particular problem.
But here's the thing: deeply understanding HTML sanitization is a critical part of my business. Users entering markdown isn't just some little tickbox in a feature matrix for me, it is quite literally the entire foundation that our website is built on.
Here's a pop quiz from way back in 2001. See how you do.
I'm sure most developers are practically climbing over each other in their eagerness to answer at this point. Of course code reuse is good. Of course reinventing the wheel is bad. Of course the not-invented-here syndrome is bad.
Except when it isn't.
If it's a core business function -- do it yourself, no matter what.Pick your core business competencies and goals, and do those in house. If you're a software company, writing excellent code is how you're going to succeed. Go ahead and outsource the company cafeteria and the CD-ROM duplication. If you're a pharmaceutical company, write software for drug research, but don't write your own accounting package. If you're a web accounting service, write your own accounting package, but don't try to create your own magazine ads. If you have customers, never outsource customer service.
Being a "professional" developer, if there really is such a thing -- I still have my doubts -- doesn't mean choosing third-party libraries for every possible programming task you encounter. Nor does it mean blindly writing everything yourself out of a misguided sense of duty or the perception that's what gonzo, hardcore programming types do. Rather, experienced developers learn what their core business functions are and write whatever software they deem necessary to perform those functions extraordinarily well.
Do I regret spending a solid week building a set of HTML sanitization functions for Stack Overflow? Not even a little. There are plenty of sanitization solutions outside the .NET ecosystem, but precious few for C# or VB.NET. I've contributed the core code back to the community, so future .NET adventurers can use our code as a guidepost (or warning sign, depending on your perspective) on their own journey. They can learn from the simple, proven routine we wrote and continue to use on Stack Overflow every day.
Honestly, I'm not that great of a developer. I'm not so much talented as competent and loud. Start writing and talking and you can be loud, too. But I'll tell you this: in choosing to fight that HTML sanitizer battle, I've earned the scars of experience. I don't have to take anybody's word for it -- I don't have to trust "libraries". I can look at the code, examine the input and output, and predict exactly what kinds of problems might arise. I have a deep and profound understanding of the risks, pitfalls, and tradeoffs of HTML sanitization.. and cross-site scripting vulnerabilities.
As Richard Feynman so famously wrote on his last blackboard, what I cannot create, I do not understand.
This is exactly the kind of programming experience I need to keep watch over Stack Overflow, and I wouldn't trade it for anything.
You may not be building a website that depends on users entering markup, so you might make a different decision than I did. But surely there's something, some core business competency, so important that you feel compelled to build it yourself, even if it means making your own mistakes.
Programming is hard. But that doesn't mean you should always go shopping for third party libraries instead of writing code. If it's a core business function, write that code yourself, no matter what. If other programmers don't understand why it's so critically important that you sit down and write that bit of code -- well, that's their problem.
They're probably too busy shopping to understand.
The easy solution is to allow only well formed image tags, and zap everything else, no?
Rob on October 17, 2008 2:14 AMI don't see a license attached to your sanitizer. Doesn't that make it unusable for anyone else?
(Also, does C# have first-class functions, or is that just for clarity? If so - neat!)
Bernard on October 17, 2008 2:58 AMI'm a fan of reuse, but, seriously, give it a rest people. You act like all of the open source code out there is of equal quality and documented. It isn't. Forgive me I'm hesitant to put NightOwl201978's homegrown HTML sanitizer in there that he's used on his blog, which receives 3 visitors a day.
I see this as a problem with the open source community. You go to write an application, and 99% of the time, people say, oh, don't write that from scratch! Go work on *decidedly-mediocre-project-that-prompted-you-to-develop-this-in-the-first-place* instead! Oftentimes these projects have SEVERE issues (symptoms like memory leaks or unmanageable complexity) that are NOT simple fixes to make, they're often architectural, or, worse, cultural. (Such as inappropriate use of low level languages, failure to abstract properly, etc.) The very thing that project needs the most is someone to come along and outdo it, who isn't afraid to say that the code quality is unacceptable.
Perhaps that is the overall problem with open source: when all code is free, we wrongly assume it is good code.
Matt Green on October 17, 2008 3:04 AMJeff still hasn't provided a convincing case as to why he needs to allow HTML at all. He more or less admits that this isn't necessary when discussing BBCode. Can he even show us a question/answer where user-entered HTML was necessary/desirable?
Absconditus on October 17, 2008 3:04 AMAaron, the big difference there is that the only thing user-generated in that entire list is...
Bingo, the HTML.
Stack Overflow has a targeted audience of programmers. This isn't some random forum on the internet about knitting, its a group of professionals, a percentage of which probably have the ability to break something one has written.
How many HTML sanitizers are written with this kind of audience in mind?
Thats not a rhetorical question, I'd sincerely like to know.
Bottom line is, its one of the most important features of Stack Overflow and requires a lot of attention to detail.
Charles Callebs on October 17, 2008 3:06 AM@Absconditus - Now, I'd agree with that :P
Charles Callebs on October 17, 2008 3:06 AMJeff,
I can see your more general point about re-use. Having written my own HTML sanitizer, I can understand why you wouldn't want to use some code that you really don't understand very well in a core function of your product.
Also, for what it's worth, I love stack overflow's input idiom. It combines the ease and familiarity of entering plain text with the immediate feedback of a GUI editor.
But you really ought to understand how difficult the problem domain is, and have humility about your solution. You have to assume your code will be wrong. Concentrate on making sure that, when it fails, it fails well.
Your use of a whitelist rather than a blacklist is a step forward, but the use of regular expressions is two steps back. You *can't* parse HTML with regular expressions. There are a few thousand screeds on this topic, but here's a good one: http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
What you need to do to sanitize HTML - or, for that matter, *any* untrusted network input, is:
1. Parse it: load it into a data structure. An actual structure, with actual rules. During this process, your parser may fail. That's great. *Give up*. If you see input you can't handle, assume it's an attack. On a site like StackOverflow, where users get real-time feedback, this doesn't even create a usability problem! You immediately say I'm sorry, I couldn't understand that.
2. Emit it. During this phase, if somebody tricked your parser, it shouldn't matter: your emitter should be smart. Let's say your parser has a bug where it mistakenly thinks the bar in foo bar baz='boz' is actually the start of text. If someone tries to attack that, your emitter can neuter the errant by quoting it as foobar baz='boz'gt;/foo. It will look ugly, but it will fail in a way which is at least *safe*. Importantly, the emitter should be as distant and disconnected from the parser as possible, so that they do not share bugs.
If you are concerned about memory overhead (but you shouldn't be, because you've loaded the whole thing as a string anyway) you can always use an event-driven parser/emitter pair (think SAX) rather than loading everything into a single structure first.
If you follow this structure, you can still even use regexps for the parsing phase, because your parser can screw up horribly and your emitter will still produce valid, if unpleasant, output.
Glyph Lefkowitz on October 17, 2008 3:47 AMJeff,
Even though SO is awesome, I'm glad to see you're again writing more regularly.
I don't know if input sanitation is core to StackOverflow. It seems to me the SO experience could still be great even if you hadn't written some sanitation code.
Having said that, I'm glad you did - just 3 weeks ago we used some of the code you posted on refactormycode.com.
Keep up the good work! And thanks for contributing to the .NET community.
-Esteban
Esteban on October 17, 2008 4:17 AMGood read! I also liked the article about how I can be loud too
Yes, it's very hard to detect evil from good, but I don't blame you for writing an html sanitizer yourself. I'm definitely an idealist who dreams of taking third party libraries and designing/putting the pieces together and (effortlessly) building an app. And I try to do that as much as I can. But there are some things that you just have to do yourself (yes... like If it's a core business function, write that code yourself is a reasonable standard)...
And in the back of my mind I wonder, how trusting should I really be toward all the third party stuff I'm using?. I mean, considering all the unknowns about it, efficiency, security, etc... even with open source, do you (or anyone) go in and review/inspect all the source code?
Wow, dare i chime in with yet another comment? I guess i do.
Your core business is community Jeff, that and the people in it. A drupal install or whatever standard CMS with some code parsing would have been fine. I guess we do like you care so much about the plumbing, so perhaps thats why those people come back in the first place. Perhaps you did work on your core business after all but never even noticed.
Tijs Teulings on October 17, 2008 5:05 AMCode re-use is over-rated, because most code (especially in-house) is not of very good quality, and cannot accurately anticipate unknown, future needs. Creating reusable code requires better developers, who are rare and expensive, and rewards mystery projects from the future at the expense of the resources the client has available now. Trying to bend old code to new uses increases the risk of revealing bugs, and ties up developers who spend a lot of time trying to gain an understanding of the existing code, which generally ends up full of hacks to work around the grand visions of the first developers.
This in turn ends up as a maintenance nightmare, as different teams or departments fork the 'reusable code' or, god forbid, try to keep it in sync. It would be crazy for any manager to allow some other team or department who doesn't fully understand their project to contribute code or design direction to some underlying code they both rely on. The other team cannot possibly be familiar with all the other projects that re-use the code, and the assumptions the developers on those projects have made.
The main exception is library re-use, such as HTML sanitizers, which is an excellent idea. If the library itself is full of un-reusable code - no problem. It is far more important that it is maintanable and well-tested than reusable.
Reusability deserves to go on the scrapheap of moribund trends, like making everything object-oriented. A nice idea, so long as you can stay out of the real world.
Anonymous on October 17, 2008 5:39 AMI think you're all missing the fact that a huge point in this is that he's also trying to contribute to the .NET web 'world' as it were.
I, just recently, went in search of a CMS to do fill my projects - and I would have loved nothing more than to find a .NET solution, it would be a perfect excuse for me to pick up ASP.NET a bit more and leave behind (finally!) my PHP roots. And yet, precious few solutions are to be had and almost all of them are a PAIN to install compared to Drupal, Wordpress, and the like.
.NET is in danger, yet again, of being considered only a big kids toy for office and intranet, and nobody has done anything yet to prove that wrong - myself included, as I just installed Drupal with a bunch of fancy modules that will make my life easier. I can't fault Jeff for not taking that easy way out, it's not like I gave the man money to do this project, it is his, he had no commitment or reason to produce it other than he wanted to, he can write whatever he pleases on whatever timeline he wants.
Wow, so many sanitizer-writer experts!
Steve on October 17, 2008 6:56 AMI'm with Jeff on this one, even if that's the unpopular opinion. I work at a large, very visible organization. We have been burned on 3rd party libraries and tools...BADLY. One very well-supported (and necessary) tool recently closed up shop after the parent organization was purchased by another organization. There is no longer even any mention of it on their website, when it used to be a front-and-center app. The licenses just stopped working one morning, which means....wait for it...all the code that relied on that tool that we paid for immediately failed. Guess which IT department had to scramble (and is currently scrambling) to craft a home-grown solution now when we could have spent that time better at an earlier date?
The worst thing is that the performance of this tool was hiding some serious design and performance flaws in the underlying code we inherited. There are just layers and layers of WTF-ery in this...I'm all for not reinventing the wheel, but Joel's quote is dead on. The developers should NOT have relied on this third party tool to be the keystone of a mission critical app. The performance of this app is a core business issue, and anything related to it should have been hand-rolled, even if that was the harder road to travel.
Alan on October 17, 2008 7:19 AMIf it's a core business function, write that code yourself, no matter what.
I would have to agree with that... if you have problems/bugs with some component that is critical to your app, you want to not only be able to go in and fix/debug the issue but also to fully understand the piece if you expect to fix it right. I never feel good when I have to relly on someone outside of my business when things affecting my business break.
Kris on October 17, 2008 7:54 AMNot invented here attitudes are a problem, so is the opposite, the attitude that everything your colleagues create is crap and that anything created by a third party must be better. These folks develop an exotic bug collection, different patterns of bugs from every source they copied and pasted from.
John on October 17, 2008 7:59 AMa solid week building a set of HTML sanitization functions
It sounds like the time may have been better spent unit-testing an existing library, and if it was found to fail a lot of the tests, write your own against those unit tests. Best case; you spend a day writing tests and the rest of the week to work on other things. Worst case: you have some tests and expectations to write your own library against, and you'd be doing that anyway. Wouldn't you?
I can now see how SOF slipped by several months.
Most of the rest of your post amounts to nothing but snobbery!
Douglas F Shearer on October 17, 2008 8:07 AMWhat a busy thread! I'm inclined to think that StackOverflow's core competency is storing/indexing developer questions/answers. The schema, SPROCS, and the Lucene.NET index sum up a lot of the value that I see in StackOverflow (which is a great site btw).
If there's truly nothing else out there that does a decent job for .NET (and I'm not completely sold on this) then I would agree that you're painted into a corner.
A lot of developers on this thread, (myself included) may be a little over sensitive to re-developing code that already exists when something like that costs so much (coding, testing, maintenance). The best line of code is the one you don't write, don't own and still delivers value to you.
Tyler on October 17, 2008 8:09 AMThis is one time I have to disagree with you. Attitudes like this are why the quality of software doesn't advance at a better pace than it does. We spend so much time re-inventing the wheel and not adding new value to an application.
You can use an open source library like beautiful soup and get all the ability to fix problems on your own, retain the ability to read and understand what the code is doing, and at the same time benefit from the work from alot of other people keeping an eye out for things you may not have thought of. You might actually learn something new. Beautiful Soup is written in python, but you could quite easily wrap it/run it in IronPython to match your .NET requirement.
Bypassing options like this to roll your own is just a combination of vanity (I can do everything better than anyone else can) and/or laziness (it is harder to read/understand someone else's code than to roll your own).
With the variety of open source solutions out there now, you can avoid reinventing the wheel while at the same time having complete understanding of the code being run and retaining the ability to modify it as you need to.
Spencer on October 17, 2008 8:11 AMThis was an interesting article, and I'd love to hear you talk about when you WOULD want to use an existing library. My reading of this post was that you were arguing that sometimes reinventing the wheel can be good in specific situations, and that writing your own HTML sanitizer for Stack Overflow was one of those situations for a variety of reasons.
So my question is, under what circumstances would you have used a library for HTML sanitization rather than rolling your own? You mention ed that there are few sanitizers for C# and VB.NET; if there had been one or more really mature and widely used ones then would you have not written your own?
I've love to see a blog entry on the specific criteria you use to decide when to use an existing library with examples of when you did and when you didn't. Especially focusing on things for which external libraries exist and you had to think long and hard about whether to go with one of them, since that would give us insight into the criteria you use to make those decisions.
Eli Courtwright on October 17, 2008 8:14 AMIs re-inventing the wheel really a bad thing if you make a better wheel?
Dan on October 17, 2008 8:15 AMI find reusing existing libraries is often in conflict with the principle of YAGNI. Libraries grow over to time to satisfy the needs of multiple usage scenarios. If you only need 50% of the functionality of that library for your scenario then the library brings along a heap of baggage that has security, performance and complexity implications.
Darrel Miller on October 17, 2008 8:15 AMthe time may have been better spent unit-testing an existing library
I agree with you; Jon had a set of unit tests he worked up that I wanted to build on. However, many of the XSS exploits would require actual browser code to execute against -- different browsers interpret sketchy markup differently. So a *complete* and *accurate* XSS test suite would have to fire up browser .exes and somehow detect JS execution and other conditions in the browser.
You can use an open source library like beautiful soup
In .NET? How?
This is like telling me I should use rainbows and cotton candy. Well, obviously.
There are almost no options in .NET, which is one of the *reasons* I wanted to go this route. So others would have more solutions!
Jeff Atwood on October 17, 2008 8:17 AMI'm definitely in agreement here, controlling dependencies is a key way to control complexity. Sometimes the fast and easy solution ends up eating more resources over the long haul:
http://theprogrammersparadox.blogspot.com/2008/09/dependency-too-far.html
Paul.
Paul W. Homer on October 17, 2008 8:32 AMMaybe I'm misunderstanding core competency here, but do users seriously go to Stack Overflow so that they can write in Markdown? I thought they went there to find and answer programming questions. I thought *that* was your core competency.
I understand there weren't any libraries doing what you wanted, so that much I'm with you on, but the core competency explanation falls a little limp.
Dan Hulton on October 17, 2008 8:35 AMIn .NET? How?
Did you not read his whole comment? He mentioned running it using IronPython.
N on October 17, 2008 8:37 AMw.r.t the sanitising html problem, what about
a) translate into BBCode (say for ii/i tags and bb/b tags)
b) get rid of any markup which remains
@Dan Hulton:
Jeff's talking about his own core competency - he's the one running the site, so he has to be proficient in handling everything people would do with the site. You're confusing his focus on SO with that of the users.
Adam V on October 17, 2008 8:41 AMThank goodness SO uses markdown! I mean it is the foundation of SO! All hail markdown...the greatest thing on earth!
are u serious????
Joe Beam on October 17, 2008 8:46 AMTouch
Antonio on October 17, 2008 8:47 AMExcellent article, I couldn't agree more. Aside from the obvious benefits to the business its great for developers to have the opportunity to solve Hard Problems. Going out and buying libraries to solve Hard Problems tells your developers that you don't trust them to get it right.
Jim on October 17, 2008 8:50 AMI think it should go without saying that there's an addendum to that quote on If it's a core business function -- do it yourself, no matter what. It should probably be something like if that thing is hard and you can't trust the quality of an off the shelf implementation. The classic example as John Carmack explaining why they wrote a 3d engine - it was a big deal and the major seller. Nothing else would do.
I don't think that applies in this case. There would almost certainly have been some sanitisation code which you could have reused as others have pointed out. Some things are just building blocks. If your core competency is web sites, you don't necessarily have to start by writing a web server.
Andrew on October 17, 2008 8:54 AMSo... Stack Overflow's Core Business Function is sanitizing HTML?~
If there were no decent HTML sanitizers for .NET (seriously? I come from Python, where I can think of 3 excellent sanitizers off the top of my head - BeautifulSoup, lxml, and feedparser's built-in one - not to mention that we're a much smaller community) then I can understand building one from scratch. But even then, I would have gone with a port of another language/platform's sanitizer, because I *guarantee* there are domain problems they've dealt with that you couldn't possibly have have thought up on your first iteration.
That would actually be the best of both worlds: not having to make your own mistakes, but still intricately understanding how the code works.
But kudos for getting the code out there.
Adam Gomaa on October 17, 2008 8:56 AMYou can talk about efficiency, delivery dates, and the right ways to do things in your project, but after all as a developer you understand that it is your project, and you started it because you wanted to, because you like to code, and coding stackoverflow was fun. That is the only and the major reason you decided to DIY. Digging through other people messy code and trying to pervert it to work with your framework is NOT FUN. Coding your own solution for interesting problem IS FUN. And there is nothing else to add.
Keep up reinventing the weal as long as it is fun!
So... Stack Overflow's Core Business Function is sanitizing HTML?~
nice looking, easy to understand user-generated content is our core business. And guess how that content is generated?
If there were no decent HTML sanitizers for .NET (seriously?
You'd be surprised. In many ways .NET is kind of a backwater. Compare how many blogging engines there are in PHP, for example, to how many there are in .NET.
http://www.codinghorror.com/blog/archives/000320.html
(from mid-2005, but the relative stats have not changed since then, may even be worse as PHP has exploded)
Jeff Atwood on October 17, 2008 9:05 AMHey Jeff,
There's probably a good reason for not being able to do this but my first though after reading your post is why don't you write a sanitizer that converts valid html tags to something like BBCode?
You can convert valid tags to BBCode and escape the rest, that way users don't have to learn BBCode to format their comments and you get to throw away anything that's invalid. You can still allow people to use whatever pseudo markup language you're converting to but it makes it so much easier for people who already know html.
In .NET? How?
This is like telling me I should use rainbows and cotton candy. Well,
obviously.
There are almost no options in .NET, which is one of the *reasons* I
wanted to go this route. So others would have more solutions!
Perhaps this should have been examined before choosing .NET. The group of developers around .NET are very different that the community around something like Python or PHP. They are more likely to understand the value of shared code, especially where that code is not the differentiation between them and their competition.
One thing, though on third part libraries. They are great, as long as you have the source. If not, you do not even have all of the source for your own application, so how canyou be expected to maintain it?
Perhaps this whole experience has made you a little wiser about the value of source code. Hopefully others can learn from your pain.
Grant on October 17, 2008 9:11 AMThe main point of Dare's post, which you've totally failed to address, wasn't that you shouldn't ever write your own HTML sanitisation code, but that trying to do so with *regular expressions* is a huge source of problems.
You could just as well have rolled your own validation by requiring valid XHTML input and using a SAX parser, which makes it quite easy to whitelist tags and attributes, or even validate that the input is well-formed in other ways (e.g. that inline elements don't contain block elements).
Jonathan Buchanan on October 17, 2008 9:16 AMnice looking, easy to understand user-generated content is our core business
A big part of what makes SO nice-looking and easy to understand is the delectably responsive UI. But you use a third party library to support that. Is it distinctively less of a core biz function than Markdown support?
alexis.kennedy on October 17, 2008 9:17 AMGood post Jeff and thanks for sharing the codez :-)
o.s. on October 17, 2008 9:22 AMI always enjoyed the phrase. Don't reinvent the wheel, unless you plan on learning more about wheels
Raisins on October 17, 2008 9:28 AMHonestly, I disagree entirely with Joel's comment that's being referenced here. Yes, for certain core functions you should write things yourself, but using a framework or library to help get it done quicker is a good thing, not a bad thing.
For example, if I was writing a storefront for an e-commerce site, I would prefer to write my own store and fulfillment system to fully encapsulate the business needs, but I would gladly use an existing storefront framework out there (for example, Satchmo if I was using Django) that takes care of the payment gateways, even if I end up redoing everything else from scratch.
I guess it depends on the context of the application. I would not trust a drop it in e-commerce package for anything except the most basic of online stores, but I would gladly borrow the payment and generic CRUD modules (e.g. adding new customers) from one to shorten development time.
Wayne on October 17, 2008 9:29 AMOddly, many businesses blithely Go Shopping. The proliferation of BOM/MRP/ERP software systems is the prime example. And SAP is the prime of the prime. How you make your widgets is your core competence. But many still buy such software. May be that's why the USofA is going down the tubes.
BuggyFunBunny on October 17, 2008 9:32 AMI agree that, for instance, a pharmaceutical company should write their own drug research software, but writing your own software and writing your own software from scratch are two completely different cups of tea. Especially when it comes to security - if history has taught us anything, it's that you should *never* write your own custom security-related routines, if at all possible.
However, the fact that you released your HTML-sanitizer to the public and posted it on your blog is certainly a plus, as I'm sure that now it will be picked apart and scrutinized by everyone in the community, especially those trying to prove that you have no idea what you're talking about :)
The comment thread on this one is freakin' hilarious! But, yeah, I agree with Jonathan Buchanan's sentiments... Dare's post seemed to attack the usage of regular expressions to accomplish your goal -- not just the HTML sanitisation.
-- Kevin Fairchild
Kevin Fairchild on October 17, 2008 9:50 AMBy Jeff's logic writing his own web server would be acceptable as well. Serving web pages is clearly part of his core business.
How many questions/answers actually contain HTML? Would it really have been that great an inconvenience to disallow HTML markup? Jeff even alludes to this when talking about how much easier things would have been with BBCode.
Why not just encode all HTML before Markdown sees it? Why not consider a different markup language?
Absconditus on October 17, 2008 9:57 AMYou don't need to understand everything to run StackOverflow. Computer science has this wonderful philosophical device: abstraction. Black boxes make the composition of systems from smaller functional units wonderfully tractable. What you are complaining about is the lack of a suitable black box, so you wrote your own. No problem there. But by the 'Feynman metric' from the blackboard, I doubt you understand the entire operation of StackOverflow. Did you write your own database (and could you, from scratch)? Is the network stack custom rolled?
I doubt it, and rightly so as rewriting them would be crazy. Feynman wanted to understand the entire universe stack, from top to bottom. You needed something that didn't exist, so you made it. That's the great luxury of software development.
Imagine, however, that a suitable sanitization engine had existed. Then you would have been crazy, from a production point of view, to roll your own if the extant engine had decent documentation, and the time of integration was small enough. You trust black boxes to give certain guarantees at every level of operation; another one here wouldn't have been a problem.
From a 'do I understand the universe' point of view, you could have written your own HTML sanitizer to scratch that particular curious itch, but it's a weird one to start out with when there are far more interesting problems to be able to solve.
Henry on October 17, 2008 10:03 AM*If* you are *able* and *willing* to write significantly better code than what exists, or code doesn't exist, or the code that exists can't be easily adapted to what you want to do - then you pretty much have to write code. Otherwise don't waste time and get on with your job.
So much support code that is written is just reinventing the wheel, and very poorly at that. Most of the time that devs reinvent the wheel they are neither willing or able to write better code - they just want to write the code. They also usually don't have the benefit of a lot of eyes looking at and testing their code, so rarely does it even begin to approach the quality of code that is already out there and used by other people.
Developer Dude on October 17, 2008 10:13 AMThese days, HTML sanitization is primarily about security (preventing XSS attacks). When it comes to security, you want to use a proven, standardized solution. Would you roll your own version of SSL, or a cryptographic hash?
You say that your solution is proven, and can now be reused. Call me in 5 years when that's actually true; right now it has gone through precious little battle-testing.
I disagree that this is core business functionality for stackoverflow. Your core is how you facilitate collaboration, not the content format.
Chase Seibert on October 17, 2008 10:29 AMWhen it's a week of work, easy call.
When it's six months to a year, involving a not insignificant investment, what do you do? The choice is not easy then. And no matter which way you go, you will always wonder if the other way was better.
cthrall on October 17, 2008 10:49 AMWould you roll your own version of SSL, or a cryptographic hash?
Well, first I'd design my own CPU, RAM, and motherboard. From scratch, naturally. Then an OS to run everything. Maybe an IDE, debugger, things like that. But after that I'll be all over SSL and hashes like fleas on a dog!
If you are a security vendor, you might want to build SSL or hashes.
If your website allows arbitrary user-generated HTML in markup for *every single page*, you might.. just.. consider.. writing your own HTML sanitizer.
But what the hell do I know.
Jeff Atwood on October 17, 2008 10:50 AMSo, by the same rationale, does that mean you should learn C? If you can't create (given, like a thousand manyears) the .Net framework, how can you understand it? How can you defend your use of it.
:)
I mean this only half jokingly.
I'll await a response while building my webserver driven by telegraph latches, based on what I've learned in Charles Petzold's Code ;).
doug t on October 17, 2008 10:52 AMCoding Horror is turning into the DailyWTF with all the submissions coming from Jeff himself. Talk about over-complication! I see the problem as being this:
Markdown allows users to intermix HTML into the markup
And the solution is to change your markdown interpreter so that intermixing HTML is not allowed. Problem solved. No need to spend a week (or more) creating some HTML sanitizer that frankly isn't needed at all. Markdown includes more than enough formatting options without having to drop into HTML.
From a href=http://daringfireball.net/projects/markdown/licensehttp://daringfireball.net/projects/markdown/license/a">http://daringfireball.net/projects/markdown/license/a">http://daringfireball.net/projects/markdown/licensehttp://daringfireball.net/projects/markdown/license/a
Markdown is free software, available under the terms of a BSD-style open source license.
If HTML is the problem, then strip it out of your 3rd party library. If you want to foster the markdown community, offer the patch to other developers. I don't believe you absolutely have to write your core functionality yourself. I do believe however you have to modify it to suit your needs.
Bill on October 17, 2008 11:14 AMThis whole talk of core competencies gave me an idea. Microsoft is a software company. Apple is a hardware company. Windows was developed completely in-house. MacOS X is built on open source Unix foundations. Which one should have come out the better? Which one did?
Inventing your own wheel is sometimes necessary. I think it was in this case. But it should always be the exception, not the rule.
Felix Pleoianu on October 17, 2008 11:18 AMI just finished reading the October issue of MSDN magazine, and would have to say that at the rate the .NET framework is growing, we'll soon be writing nothing but business logic.
Take the Coding Tools article on page 86; even if it's your core business to write that are processor intensive (I'm thinking applying filters to images, etc.), you'd be crazy not to use the new support for parallelism the new version of the framework will offer.
I'm impressed with the future of the framework; however, I realize that not having a thorough understanding of parallelism (even though I may never have to write boiler plate parallelism code) is probably dangerous.
Esteban on October 17, 2008 11:18 AMI sure hope that Jeff packages his HTML sanitizer as an open source library and posts it SourceForge. .NET will forever be backwater unless developers start publishing their hand-rolled libraries.
Tyler: I was just about to post the same thing. HTML sanitisation is a tangenital issue to the primary functionality provided by stackoverflow: a programming community that doesn't suck.
Jeff: You totally missed the point of Joel's original post.
You also could have saved yourself a week of hacking by not being so stubborn about allowing html markup. We're hackers. We can quickly pick up whatever small markup is required to make a post look nice.
Daniel on October 17, 2008 11:32 AMI don't think writing a simple sanitizer is all that hard. At least, I have done it myself, taking the conservative approach of running through the input character-by-character with a finite-state machine scrawled on a piece of paper that says whether or is allowed at any point and whether to recover from errors by inserting or escaping the faulty code. You then have a table that says which element names are permitted and which attributes those elements can have. The result is well-formed XHTML fragment that will display safely.
The point of the above is that (a) you only let through known-good HTML, rather than trying to spot and fix known-bad HTML (since making a list of good things is easier than making a list of all bad things), and (b) approach the problem systematically rather than trying a quick regex-based bodge and then a few more bodges on top of that.
I don't see how translating to BBCode and back again could possibly be simpler as it presumably involves interpreting the HTML to generate the BBCode... And forbidding HTML in Markdown is not much simpler because after sanitizing the Markdown (on order to forbid HTML), you then convert it to HTML and have to hope there is no way to fool the Markdown formatter to make it produce bad HTML. Safer to write a bullet-proof HTML sanitizer and apply it at the very end of the pipeline.
Damian Cugley on October 17, 2008 11:41 AMSo, what's the difference between rolling your own HTML sanitizer and rolling your own jQuery?
Scott on October 17, 2008 11:57 AMBut what the hell do I know.
That is what I've been trying to figure out for years.
pwnguin on October 17, 2008 12:15 PMa programming community that doesn't suck.
And a big part of the reason it doesn't suck is that people can format their posts almost as much as I formatted this blog post. Either by a) taking the time to learn Markup or b) relying on the tried-and-true HTML that almost every developer now knows by heart.
The average posts on Stack Overflow just plain look *better* than other forums and sites.
I agree allowing HTML was painful, but in a good way. I've grown to enjoy the flexibility of using either markdown or HTML interchangeably. It means when programmers first encounter Stack Overflow, their first instincts in editing a post -- hmm, how about if I enter a hyperlink here -- work exactly the way they expect them to.
Choices about markup code are as critical to us as they are to, say, Wikipedia. It's all about the content, and making it easy for users to do the right thing when entering content.
Jeff Atwood on October 17, 2008 12:22 PMWow, so many comments about how there are libraries in .Net to sanitize HTML and no one mentions what these libraries might be. Which is Jeff's primary issue.
Stephen on October 17, 2008 12:32 PMI agree to a point, though I have to say it really depends on what you consider core to your business.
If for example your business depends on an API allowing other developers to create an ecosystem around your services - should you roll your own XML parsers? In this exampl parsing XML is as core to business aims as parsng posted content is to you, and it's why I think you're off the mark a little in this instance.
You seem to have done the research and identified that there is a gap for the HTML sanitisation which you needed to fill with in house code. That much, I totally agree with. I'm not sure I'm 100% with you're analysis that this represents a core business function though.
Andrew on October 17, 2008 12:39 PMIn .NET? How?
This is like telling me I should use rainbows and cotton candy. Well, obviously.
The stress has obviously gotten the man. Why didn't you even read the whole post? Or couldn't you yourself come up with a way to use python from .net?
And no response of course to these questions...
Usually the content here is quite good but this is a severe case of NIH-syndrome and denial wrapped together.
Sanitizing html is your core business now? Nice.
I have to massively disagree with you on this one, particularly in the games industry.
Middleware in our industry is quite common, hence Midway paying Epic 3.X Billion dollars for a 10 year deal for their rendering engine. Particularly, the product in our industry is about providing content, not code. In which it makes sense that if it's cheaper to buy the code to help produce that content (and content production could be faster as a result), then that's the proper path to go.
I do agree, that as a programmer dealing with those systems, you need to understand it enough to be able to reproduce it in some variety. But at the core, if you can get a contract to purchase that software to do the job better/faster/easier/less expensive, then you should do it. Otherwise, you'll end up overbudget/slower, which is not productive to a video game.
~Main
MainRoach on October 17, 2008 12:50 PMIf it were only a time vs. materials world: reuse, reuse, reuse.
But its not. Computer Science curriculum would agree with that. Thats why in a data structures class, any good professor will make you write a stack/queue/linked-list before directing you to the STL. Thats why in my Web Programming class this semester, we were directed to write a web server in C. The time it takes to write these things is worth it (at least to me) if for nothing else than the personal growth I experience when I learn something.
When Jeff's sanitizer breaks, he'll know how to fix it a lot better than if he just had this big abstract entity of an HTML sanitizer to search and prod through.
Charles Callebs on October 17, 2008 1:01 PMThats why in my Web Programming class this semester, we were directed to write a web server in C.
You've inversely proved my point with your own ;)
A web server in C is not a 2.4million line codebase that 20 people have all contributed to over the past 4 years. Video games are.
If it only took a weekend to write a code sanitizer, then yea, do it yourself. But writing a robust multi threaded physics library that works cross platform and cross project? I'll leave that to another company who's entirely dedicated to providing that software in the marketplace as their lively hood.
Good programmers program, Great programmers reuse.
~Main
MainRoach on October 17, 2008 1:22 PMHey Jeff,
I'm a PHP developer, and there are loads of libraries, frameworks, CMSs etc out there. I've just (hopefully) won the battle with my project managers to let me write my own stuff instead of having to use 3rd party code.
Why?
So much of the 'popular' code (CMSs especially) is really really badly written. The time taken to evaluate, test, debug, tweak existing code is often way longer than doing it yourself.
I ALWAYS re-invent the wheel. How else would the wheel get getter? And who is anybody to say I'm not good enough to add incremental value to the wheel?
That's not to say I don't use built in stuff and code that I know is good. Sometimes I'll take a snippet from the web and RE-WRITE it so I understand what it does and how it works. This is not a black and white thing.
As you pointed out, use 3rd party code where it makes sense, but don't rely on it to build your software. Telling a client their site is down because of an exploit in 3rd party code is hardly going to enhance your reputation.
If there is a bug in any on my applications, it's my fault, and my job to fix it. The buck stops with me, every time. That's what being a professional programmer is all about.
Rant over. :-)
Trevor on October 17, 2008 1:51 PMso many comments about how there are libraries in .Net to sanitize HTML and no one mentions what these libraries might be. Which is Jeff's primary issue.
It's because there really aren't any. There's the HTML Agility Pack -- which isn't really designed for sanitizing without writing a bunch of (error-prone) code to make it work -- and that's about it.
As others have said, the idea that sanitizing is this super-hard impossible problem is also not really true. Certainly nowhere near as hard as the physics library example @MainRoach proposed, etc. And like @Damian said, you can write a decent sanitizer in a few days.
Testing it thoroughly is another matter..
Jeff Atwood on October 17, 2008 1:52 PMOkay, interesting article (and I've read that one of Joel's in the past), but I think that you are both talking about something totally valid but making the wrong point.
The reason that you had to write your own HTML sanitizer, and the reason that Microsoft's Excel team had to write their own compiler (Joel's article, as I recall) is that libraries or external programs to do what was needed *didn't exist*.
When that Excel compiler was written, there was no open-source community to speak of, and they couldn't very well modify a commercial compiler for their needs. Google *can't* use external libraries, because nothing scales to their level...yet. And as you pointed out in comments, .NET is a backwater and doesn't have the kind of libraries yet that older platforms do, so the thing you needed didn't exist.
The moral of the story isn't We should write important stuff in-house. The moral of the story is, If it doesn't exist or you need something way beyond what exists, you probably will have to write it yourself. That's a fact of life, not a lesson in good software design.
-Max
Max Kanat-Alexander on October 17, 2008 1:52 PMJeff, while I often enjoy your insights, you're just not being rational here.
Core business function is synonymous with competitive advantage. If HTML sanitizing were core, you'd have written your business plan around how much better you are at it than others. It would be up there with how you attract and keep smart programmers on your site. The rest is plumbing.
If .NET doesn't have its own sanitizer, perhaps it wasn't the right choice for a platform. Personally, I've always wondered why you chose .NET. I know it's the one that's most familiar to you. That's a plus if you want to complete a small new project fast, but using the same language and tools all the time limits your growth potential as a developer. Considering how many readers you have who don't use .NET (myself among them), you really don't want to end up as one of those curmudgeonly single-language programmers.
David Leppik on October 17, 2008 1:57 PMI don't know about the whole html sanitizer being a core business function thing (quite frankly, I could live with textile or bbcode).
But I do think it's a good investment for a developer to reinvent something. You can't talk about scalable comet architecture if you've never wrote a server. You can't talk about javascript compilation optimization if you've never written a javascript engine.
If you want to be an expert at something, you've gotta experience its ins and outs, the full development cycle, the bugs, the caveats, the holes and the limitations.
And if you are a programmer (read: not a content-entry monkey) and you have bills to pay, you'll probably want to be good at *some* programming-related task.
Leo Horie on October 17, 2008 1:57 PMHow do you determine that there are no existing libraries? Do you just google or do you have a list of sites (sourceforge, cpan, ...)? Every time I hear developers say this type of thing with certainty, I suspect that I'm faking it as a dev, since I never feel sure of what's actually out there, even after spending a lot of time researching...
Josh on October 18, 2008 2:10 AMSame goes for outsourcing, keep the core competencies that you rely on in-house.
stjohnroe on October 18, 2008 3:48 AMMarkdown sucks.
Markdown interprets all text between two underscores as italic. This would be fine if nobody needed to use underscores. In other words, this:
Popular Apache modules include mod_php and mod_rewrite
shows up like this:
Popular Apache modules include modphp and modrewrite
You can escape underscores, but this defeats Markdown's stated purpose of appearing like natural text.
Jonathan Drain, Dungeons Dragons Blogger on October 18, 2008 4:48 AMI should add: my preferred alternative is Textile, or Mediawiki formatting, or no formatting at all. Do users really need HTML to comment on a blog post?
Jonathan Drain, Dungeons Dragons Blogger on October 18, 2008 4:50 AMMarkdown interprets all text between two underscores as italic.
Agree. We changed this so intra-word underscores are not allowed in our Markdown server-side parser.
More here:
http://blog.stackoverflow.com/2008/06/three-markdown-gotcha/
well, it's good you can still enjoy coding.
But I will be more than happy if you explain some tips to manage your time for both programming and writing blogs (with such entries) :D
Trevor on October 17, 2008 12:51 PM took the words out of my mouth.
Reinvent the wheel (not the concept, but the instance) to become a better programmer.
Learn to code by writing code. You will not understand all the risks and pitfalls of a HTML sanitizer if have never written one.
I do not say you should never reuse code. But rolling your can definitely be the best option.
And I think that 'there is not suitable code available' definitely is a good reason to do what programmer (hopefully) do best: write code.
Jacco on October 18, 2008 8:27 AMDonal, your tone seems to imply that one should care about whether they gave back to the community. This is an erroneous assumption.
Matt Green on October 18, 2008 10:14 AMI've posted this like a hundred times, but I'll mention it again just for fun. HTMLEncode your string, and then replace lt;bgt; with b, etc. So what if people can see that someone put script in their post.
Tim on October 18, 2008 10:16 AMMight be a bit OT but to me this is a huge WTF:
due to the liberal HTML parsing policies of many modern Web browsers
I mean serioulsy, how did the browser-developers think when they explicitly added support for lazy html in the first place.
Hmm, I pretty often write foeach by mistake instead of foreach...let's make foeach do the same thing as foreach and I'll save 2sec per day! I mean, all developers must have this problem so I'm doing the world a huge favor!
Today i understand that every browser has to add support for all crap that all other browsers already added, but why add support lazy html in the first place?!
Seriously Jeff,
Do you really think posting some code on a website qualifies some code as having been contributed back to the community?
- Donal
Donal on October 18, 2008 11:25 AMI understand Google is rolling their own browser because they wanted to make sure that their applications will run smoothly. FOr some of their business core.
Kenneth on October 18, 2008 12:51 PMLooks like you've managed to get every programming expert in the country to come post a comment. They all know the best way and each of them is smarter than the rest. This is good stuff. :)
T.J. on October 19, 2008 2:47 AMI have to say that I definitely agree with Jeff on this one. While I think claiming that HTML sanitizing is the core competency may be a stretch, I think it's core enough for the purposes of this topic. Even if there are third party libraries available for such a thing, unless there is one that is truly complete, time tested and professional, it makes sense to write your own. You can look at the others for ideas and to learn the things that they've already learned, but it's something that's better off being written in a way that is more easily understood and maintained.
Comparing that to writing your own web server pretty ridiculous. There is no such thing as a third party HTML sanitizer that is on the order of reliability as Apache or IIS, in any language. Using a library that is written in Python and forcing it into a C# .NET package via IronPython would be madness, unless you happened to be a seasoned Python expert or have one in-house that is able to make changes and corrections in a timely manner.
There are a lot of square wheels out there.
3rd-party libraries are definitely a problem - the OpenSSl debacle should make everyone think twice. Security software is especially troublesome. Experts (like Schneier) will tell you if you create your own encryption algorithm, you're almost certainly a fool. If you're not a fool, you rely on a commercial product or something like OpenSSL. On the other hand, unless you're an expert on the subject, you'll have to assume the product or library you selected is secure - that assumption will only be based on the assertions of the vendor. OpenSSL was not secure for two years, which ought to tell you something about the level of testing done. Are the commercial products any better? How can you tell?
Stop it now...my head hurts
Tinuviel on October 19, 2008 6:26 AMDidn't Dare quit the internets for ever? IMO, he hasn't had much good to say - apart from causing drama with Arrington.
http://www.25hoursaday.com/weblog/2008/03/05/IndefiniteHiatus.aspx
James on October 19, 2008 6:48 AMI also don't agree with using regular expression for writing an HTML sanitizer.
Grom on October 19, 2008 12:13 PMI'm sorry to inform you, but if the spec requires HTML as input, the spec is wrong. If your core business is to somehow display user generated content, you simple don't allow HTML, period. And if you want to allow *some* HTML, just go with the BB code way.
Bucket on October 20, 2008 4:01 AMOoops, just clicked on the hear it spoken link...
Wow, so many guru's on here, who really know their stuff! Well done all, you're a bunch of heroes that all have wonderfully usable sites that I visit every day. Amazing how you all picked exactly the correct technology (which you do every time -don't you) giving you time to come on here and share your much valued wisdom. Yes, jolly well done you!
Anyway Jeff, I think SO looks great and I'm amazed you managed to make such a funky looking site in .NET. Also, you may may be louder than some through blogging, but it's also because you talk a lot of sense, making you IMO talented developer too. It's not just about writing beautiful code.....
bloop on October 20, 2008 4:11 AMI'm sorry to inform you, but if the spec requires HTML as input, the spec is wrong.
Since more people know HTML than any other form of marking up and generating rich user content, I fail to see how you can make that assertion. Forcing users to adapt to something unfamiliar is a bad spec, not allowing them to use something that is.
If you are a security vendor, you might want to build SSL or hashes.
WTF?
deeply understanding HTML sanitization is a critical part of my business
I thought you were running a 'people' site. Even if you did not support HTML, SO will work fine.
Jeff, Your core business is not sanitizing html.
There are communities whose core business is sanitizing html. You should have borrowed code from them.
And the solution is to change your markdown interpreter so that intermixing HTML is not allowed. Problem solved. No need to spend a week (or more) creating some HTML sanitizer that frankly isn't needed at all. Markdown includes more than enough formatting options without having to drop into HTML.
You can't just kill functionality until a product is safe. Denying any user the right to enter any word longer than 6 characters would solve some problems. So would disconnecting SO from the Internet.
It's a balancing act.
I thought you were running a 'people' site. Even if you did not support HTML, SO will work fine.
Surely allowing the people, many of whom are web programmers, to program with a known markup, rather than forcing them to learn YASWM (Yet Another S***ing Web Markup).
I agree with you. If you don't write your own stuff, you either have to rely on an outside programmer to fix it, or set aside a week and a half to figure out their code and change it yourself. That can take a lot longer than writing it yourself and spending an hour to debug.
arkangyl on October 20, 2008 7:03 AMWhatever the arguments over code-reuse versus NIH, posting the code to http://refactormycode.com/codes/333-sanitize-html certainly doesn't count as [contributing] the core code back to the community.
bobby on October 20, 2008 7:09 AMThis is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |
Posted by: |