I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

Nov 15, 2009

Parsing Html The Cthulhu Way

Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.

That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code.

kraken-cthulhu.jpg

This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration.

I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:

 # pull out data between <td> tags
($table_data) = $html =~ /<td>(.*?)<\/td>/gis;

"But, it works!" they say.
"It's easy!"
"It's quick!"
"It will do the job just fine!"

I berate them for not being lazy. You need to be lazy as a programmer. Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use CPAN and use HTML::Sanitizer. It will make your coding easier. It will leave your code more maintainable. You won't have to sit there hand-coding regular expressions. Your code will be more robust. You won't have to bug fix every time the HTML breaks your crappy regex

For many novice programmers, there's something unusually seductive about parsing HTML the Cthulhu way instead of, y'know, using a library like a sane person. Which means this discussion gets reopened almost every single day on Stack Overflow. The above post from five years ago could be a discussion from yesterday. I think we can forgive a momentary lapse of reason under the circumstances.

Like I said, this is a well understood phenomenon in most programming circles. However, I was surprised to see a few experienced programmers in metafilter comments actually defend the use of regular expressions to parse HTML. I mean, they've heeded the Call of Cthulhu … and liked it.

Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing. In fact, designing a program to do so may well be a completely wrong-headed approach, if it changes a program from a few-line script to a bullet-proof commercial-grade program which takes orders of magnitude more time to properly code and support. Resource expenditure should always (oops, make that very frequently, I about overgeneralized, too) be considered when creating a programmatic solution.

In addition, hard boundaries need not always be an HTML-oriented limitation. They can be as simple as "work with these sets of web pages", "work with this data from these web pages", "work for 98% users 98% of the time", or even "OMG, we have to make this work in the next hour, do the best you can".

We live in a world full of newbie PHP developers doing the first thing that pops into their collective heads, with more born every day. What we have here is an ongoing education problem. The real enemy isn't regular expressions (or, for that matter, goto), but ignorance. The only crime being perpetrated is not knowing what the alternatives are.

So, while I may attempt to parse HTML using regular expressions in certain situations, I go in knowing that:

  • It's generally a bad idea.
  • Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
  • I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.

It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.

So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand.

Whatever method you choose -- just don't leave the <cthulhu> tag open, for humanity's sake.

[advertisement] JIRA 4 - Simplify issue tracking for everyone involved. Get started from $10 for 10 users.

Posted by Jeff Atwood    View blog reactions
« Whitespace: The Silent Killer
Buy Bad Code Offsets Today! »
Comments

Why someone use home grown parser? Because it works in 99% cases. Why not using full-blown parser engine? Because it does not work in 100% cases. It does not work in terms of cost and performance. It saves 1%, but it looses 99%. This is why.

vtolkov on November 16, 2009 1:41 AM

Is this considered yet another awesome comment?
http://www.codinghorror.com/blog/archives/001130.html

Alex Vincent on November 16, 2009 1:51 AM

Jeff... ummm ahhhh well didnt you build an html sanitizer that uses regular expressions? http://refactormycode.com/codes/333-sanitize-html

Marty on November 16, 2009 2:04 AM

I think is more of a reference to House of Leaves than to Lovecraft. It's a good thing Mark Z. Danielewski did not know about the wonders of Unicode though...

Deadprogrammer on November 16, 2009 2:05 AM

Hey, thanks for outing me, you ass.

Cthulhu on November 16, 2009 2:11 AM

I am reminded that everytime you try to solve a problem with regular expressions, you now have two problems. The orignal problem and regular expressions used to solve the problem.

Mikej on November 16, 2009 2:26 AM

I think it should be mentioned that if you can create a fully valid html parser you just created the core of a web browser.

Not a small project...

Practicality on November 16, 2009 2:27 AM

Jeff, are you high? Tell me you are not still trying to justify using regexes on HTML. On ANY HTML...

*sigh*

Look, even though libraries have a lot of code to them, so does the implementation of regexes. In other words, the code paths are similar in terms of complexity and execution cycles. You are just choosing the wrong method, the one that is going to be incomplete, buggy and difficult to maintain because it is what you know.

It's just wrong. Think of the children.

Travis on November 16, 2009 2:35 AM

The ignorance problem is that many, many developers don't know or believe that HTML is not a regular language.

And when you do accept that you're going for a 98% solution and using regular expressions on HTML, you have to be very aware of potentially creating cross-site scripting vulnerabilities.

Adrian on November 16, 2009 3:05 AM

The saddest thing is that parsing html with a real parser is easier than using regexp in ruby using hpricot, but that doesn't stop some people from writing

article_contents = string.scan /(.*?)/

instead of

require 'hpricot'
(Hpricot(string)/'div.article').inner_html

And then get all confused when blahblah breaks everything.

Daniel on November 16, 2009 3:20 AM

Bah, if you're going to declare no html on your comments, you could at least escape the html for people.

Daniel on November 16, 2009 3:22 AM

Jeff is happy to talk about how nasty some practice is, as long as he still gets to justify when he did it himself. Jeff does not admit mistakes easily.

Breton on November 16, 2009 3:25 AM

Simple things like finding all the href attributes in a document are easily accomplished with a regex. But once you get into trying to match opening and closing tags, yeah, it becomes hopeless.

Austin on November 16, 2009 4:50 AM

XPath > RE

The captcha is ridiculous!

JJM on November 16, 2009 5:17 AM

Seriously, the demonoid sounds from hell when I click the audio help are more blatantly evil and stupid than suggesting HTML should be parsed with a regex!

JJM on November 16, 2009 5:18 AM

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."

http://www.crummy.com/software/BeautifulSoup/

Jacques Beaurain on November 16, 2009 5:22 AM

Kevin Peterson shares my concern: Where are the parsers that don't barf on bad HTML?

Kyle Lahnakoski on November 16, 2009 5:23 AM

Matching nested parentheses (from Mastering Regular Expressions, 2nd edition, pages 330-331)


my $LevelN; # This must be predeclared because it's used in its own definition
$LevelN = qr/ \(( [^()] ; (??{ $LevelN }) )+ \) /x;

This matches arbitrarily nested parenthesized text...


So I think, given that construct, it should be possible to generalize this to parse arbitrary nested HTML tags, including arbitrary JavaScript &c.


Also, anyone thinking about letting users put up HTML content from their Rich Text Editor what Generates HTML Output, check out rsnake's XSS page first (http://ha.ckers.org/xss.html). It becomes apparent that the problem is all the weird quirks of all the different versions of all the browsers out there. And remember, the hackers aren't going to be using your Rich Text Editor, they're just going to be submitting Evil HTML of their Own Construction directly, probably using curl or something. So you'll be trying to sanitize arbitrary HTML snippets such that they can't cause problems on any browsers, most of which are not installed on your system right now, and until you went to that page you probably didn't even know about all those possible ways of getting scripts to run. And that list can only get longer, not shorter. So save yourself some headache and use some other kind of markup that you translate to HTML *very carefully*.

Phil Brass on November 16, 2009 5:33 AM

To add another HTML parser to the list, there's also libxml2's HTMLParser. It's probably the best open source HTML parser in C.

http://www.xmlsoft.org/html/libxml-HTMLparser.html

Laurent on November 16, 2009 6:12 AM

The only crime committed by most novice developers is not knowing what the alternatives are. This post, on the other hand, does nothing to help promote what those alternatives are. No cherry picked recommendations for a few common web languages? PHP, RoR, etc?

Hpricot and Rubyful Soup help me a good bit in RoR.

phreakre on November 16, 2009 6:12 AM

Now, I have that song in my head! Go Metallica!

Wayne on November 16, 2009 8:31 AM

"Even Jon Skeet cannot parse HTML using regular expressions."

Them's fightin' words.

Chris F. on November 16, 2009 8:32 AM

ESR disagrees :)
http://www.jgc.org/blog/2009/11/parsing-html-in-python-with.html

Pádraig Brady on November 16, 2009 8:34 AM

Is it just me or does this blog post try to argue both sides of the same issue?

R. Bemrose on November 16, 2009 8:43 AM

"Even Jon Skeet cannot parse HTML using regular expressions."

I lol'd, what a great way to put this in perspective.

Patrick on November 16, 2009 8:46 AM

But the link to CPAN HTML::Sanitizer is broken.

Robert Claypool on November 16, 2009 8:47 AM

Link behind "HTML::Sanitizer" is dead.

Dennis Gorelik on November 16, 2009 8:49 AM

So what is the preferred method for dealing with XSS (Cross Site Scripting) issues then, particularly if you're using a Rich Text Editor that saves formatting as HTML?

Dominic Pettifer on November 16, 2009 8:54 AM

The last time I went for an HTML library to parse some HTML, the HTML was so broken I had to resort to regex.

The regex broke afterwards, after the generated HTML was slightly changed. It was trivially fixed.

So, while I agree that HTML (and, particularly, XML) should be parsed appropriately, YMMV. I get the feeling a lot of people who complain about regex have never bothered to LEARN it as the complex language it is.

Daniel Sobral on November 16, 2009 8:58 AM

What about HTMLTidy? http://tidy.sourceforge.net/ Convert stuff to proper XHTML and then use your XML processing mechanism of choice to parse the data.

Arethuza on November 16, 2009 9:08 AM

This would be a lot more helpful if some specific libs besides the Perl solution were posted. I had a non-trivial time trying to find ready-to-use stable libraries on various platforms (e.g. PHP). Any suggestions?

Joe on November 16, 2009 9:11 AM

Can someone get me one of those T-shirts with "I parse HTML with RE" on front?

Goran on November 16, 2009 9:14 AM

> I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.

Maybe HTML processing isn't trivial, Jeff.

Anonymous on November 16, 2009 9:14 AM

http://search.cpan.org/~nesting/HTML-Sanitizer-0.04/Sanitizer.pm tells me "not found", btw. Pretty surprising, as we can still find the code in Nesting's archives, and that it is still refered to at e.g. http://search.cpan.org/~podmaster/HTML-Scrubber-0.08/Scrubber.pm

sylvainulg on November 16, 2009 9:15 AM

I'm torn here. I mean there's jwz's famous quote about

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

On the other hand, CPAN isn't particularly useful if you aren't using Perl. And "Hell is other people's Perl".

Rev Matt on November 16, 2009 9:22 AM

I have no joke here, I just like saying http://discordianquotes.com/quote/8008

chaos on November 16, 2009 9:28 AM

"The only crime being perpetrated is not knowing what the alternatives are."

Yup - that just about covers the whole darn thing! Who hasn't worked with developers who rolled their own XYZ, when that program is already out there and supported by some other community.

Jarrett Meyer on November 16, 2009 9:32 AM

This is the most useful thing you said in a long time. Glad to have you bad Jeff.

Linus Torvalds on November 16, 2009 9:32 AM

I have a daily job that scrapes a page. Unfortunately, it looks like the author of that page is emitting with Word or something similarly horrible. Tables within tables within tables without sanity(or regularity). The faint whisper of "ia, ia" curdles around the mind when contemplating the source.

A DOM parser would gibber insanely to itself, quietly screaming at the brokenness of the form.

Since I didn't particularly want to do that to a parse, I regexed away and was able to do it. Only minor loss of sanity points....

Paul N on November 16, 2009 9:36 AM

Find me a good html parsing engine and I will gladly use it. Tidy is the best I have found so far, and you still have to do quite a few rounds of additional cleaning using regex's after Tidy is done.

Practicality on November 16, 2009 9:38 AM

I've done a lot of HTML parsing with regular expressions and rarely had a problem. That's because I'm usually working on one file or a small set and I'm doing the regular expressions in my text editor, so I get immediate feedback when it doesn't do what I want.

I tend to think there should be an inverse correlation between code elegance and the number of times the code gets run. If you're only going to run your code once, feel free to throw in the most Lovecraftian regex you can concoct. Just make sure to comment your tome^H^H^H^Hfile "This code was not meant for mere mortals to understand. If you value your sanity, make sure you can roll well on your Refactoring skill check."

Trevor Stone on November 16, 2009 9:49 AM

I have this feeling about Finite State Automata vs. Pushdown Automata...

Salvatore on November 16, 2009 9:52 AM

@Joe:

You might want to take a look at HTMLPurifier for PHP. It's a whitelist-based approach to HTML filtering. Their comparison page also lists a few other libraries, although, as you can imagine, they are in favor of their own approach.

http://htmlpurifier.org/

Arthur on November 16, 2009 9:54 AM

I worked on a large complex scraping system that scraped tens of thousands of pages each day to extract structured data from them. 90% of the extraction was regex, and it has been working fine for many years. In some places we would use html parsing libraries and sanitizers, but often regexs worked great and were simpler to code. As a side note we would also often run into invalid HTML that broke the parsers we tried.

ch on November 16, 2009 9:56 AM

Use regex to parse html it's a temptation of the evil.

Agustin on November 16, 2009 9:57 AM

Though I have no quarrel with the statement that html should not be parsed with RE's in general, there are cases where it makes sense. An internal tool I wrote for my company needs to parse an html page. The format of that page never changes, and it does not need to parse any other pages. A couple of RE's made quick work of the parsing, and work well. They got the job done. Of course, if that page ever changes the RE's will necessarily change along with it, but that's not really a big deal.

In sum, the solution isn't robust, but it works, it was easy, it will be easy to modify / fix, and it saves my coworkers and I untold frustration and time every week.

Not sure how there could be anything wrong with that.

John on November 16, 2009 10:00 AM

Now, that was actually funny.
Thanks god I still don't get requested of parsing HTML to pull out stuff from it.

It's somehow ironic that this post doesn't allow HTML (no HTML in red) either!

MaxDZ8 on November 16, 2009 10:05 AM

Again with the PHP Sass, it seems like no matter how many times you talk about "it's the programmer not the language" you just can't forgive PHP for having such a low barrier of entry.

To contribute, though: when scraping an 80k+ file (they exist, trust me [[shudder]]), the Regex were significantly less awful than loading up the whole DOM parser and praying that there isn't an "unrecoverable error" in there somewhere.

Then again, had the task been too much more complex I may have had to start eating 5 babies a weak (only up to 3 right now). This post does strike me as arguing both sides of the coin, but at least it hits them in the right way: If you're going in with Regex in your hand, know that you carry your sanity in those same hands.

And please, Jeff, lay off PHP; it's not funny or clever anymore, and just as many horrible, horrible things can be said of VB developers as PHP, it's just that a much higher percentage of the VB developers are not "hobbyists" ;)

Dereleased on November 16, 2009 10:08 AM

I think to contend that HTML parsing is a solved problem there should be a few more examples. I'd like to caveat this by saying that the approach of converting HTML to well formed XML (or XHTML) does not work for everyone, and I would think that a HTML parser that qualifies an established solution would be robust enough to handle the flexibility that HTML allows.

While I agree that regex is not the right approach I disagree that this is a solved problem.

bdaniels on November 16, 2009 10:09 AM

Okay, sure, you can't parse HTML with regex, and you shouldn't try. There is a problem not served any the available libraries, and that's parsing the garbage that sort of looks like HTML if you don't look too deeply that's littered all over the web. For these, using a regular expression to look for what you need will work better and more reliably than trying to figure out how to get your parser to not blow up when it discovers that what it's parsing doesn't actually validate.

Kevin Peterson on November 16, 2009 10:14 AM

You know, there was actually a question just about that on SO a couple of days ago.

laura on November 16, 2009 10:18 AM

A small snippet of code was sighted in this post.

Ah well, we can still hope.

AC on November 16, 2009 10:22 AM

i think the biggest problems arise when the html (or xml) is not well formed. But then you get in trouble both with most libraries or pre-built parsers i know too.

fred on November 16, 2009 10:23 AM

For the brave of heart: Write a regular language to recognize all strings of balanced parens.

Actually, don't, because this is provably impossible.

Why? Because regular expressions recognize *regular languages*, a specific, well-defined class of languages. HTML, like the balanced parens problem above, doesn't conform to this pattern.

Every once in a while, I'm reminded of why studying bona fide computer science in college was the right idea. It won't necessarily make you a better programmer, but it has saved me from doing really stupid things from time to time, like trying to parse html with a regex.

David R. Albrecht on November 16, 2009 10:23 AM

Write a regex to identify all balanced parenthetical strings. I dare you.

I didn't study computer science because it was easy, rather, because it has some nice ass-saving properties that prevent you from doing stupid things. Like parsing HTML with a regex.

David R. Albrecht on November 16, 2009 10:25 AM

Jeff, sorry abt the double post above, something is amiss with your website. I tried submitting the first one 3 times. The first time, I got an error about a temp file, the second time I got a CAPTCHA error, and the third time (I hit refresh each time), the comment somehow went through. Weird.

Also, your captchas are kind of hard. Just saying.

David R. Albrecht on November 16, 2009 10:27 AM

Thank you VERY much! Now we can link to this article when explaining to SO posters why this is a bad idea.

BTW, .NET users can use http://htmlagilitypack.codeplex.com/

- TrueWill

Bill Sorensen on November 16, 2009 10:41 AM

I worked at one company in the 1990's (before the days of CMS's) where I maintained web pages for a knowledgebase about the product I supported. The official website team at this company periodically changed the design of the website, and then they had a huge task editing hundreds of pages one by one, to match the new design.

Of course, to update the pages I was responsible for, I wrote a Perl script as a crude form of HTML templates, and my pages were done in five minutes. I offered my script to them to help them get their work done. They refused, saying, "we don't have time to learn new tools, we have hundreds of pages to edit!"

I was appalled at the time, but I've learned something since then: There are all sorts of people working with data, with HTML, and with code. To some people, it doesn't make a task easier to learn a new library -- it makes the task HARDER. To them, using a tool they know how to use already is a huge win, even if that tool solves the task inefficiently.

Eventually, a person trying to manipulate HTML with a regular expression hits a wall, where their tool simply can't solve the task. Some people will simply not be able to do some things. That's why they need to hire someone who has more tools.

Bill Karwin on November 16, 2009 10:43 AM

That is why I proposed a feature on meta.stackoverflow a long time ago to support question templates (like google code does), that would avoid such common cases

Domen Kozar on November 16, 2009 10:46 AM

The HTML::Sanitizer 0.04 module is available on BackPan at http://backpan.perl.org/authors/id/N/NE/NESTING/. However, it does not appear to pass its own test suite (2 of 4 tests fail in t/03security.t) using Perl 5.10.1 on MacOS X 10.5.8. Sadly, that makes it of limited relevance.

Jonathan Leffler on November 16, 2009 10:48 AM

Many programmers have a RegEx hammer and don't want to learn a DOM/XPath based screwdriver and ratchet set.

Sadly, (X)HTML is mostly nuts, bolts and screws. Yeah, you can hammer it together, but it will fall back apart soon enough.

John Lopez on November 16, 2009 10:52 AM

Personally, I always use a HTML parser whenever possible.

As a beginner in regular expressions, it's a huge pain in the arse to write a regular expression - let alone one to parse HTML.

Tangr on November 16, 2009 11:01 AM

Good points, but I think you left one important piece of advise out: don't do it at all. Both a library and regex approach are broken solutions if your source HTML isn't up to the standard. Therefore, it is much more preferred to tap into a structured data source, like XML, RSS, JSON, a RDBMS. The HTML has to come from somewhere, right?

Of course, there are scenarios where you do not have that kind of access to the original data source, like when you write your own search engine :)

Ferdy on November 16, 2009 11:20 AM

Ie! Ie! Microsoft Fhtagn!

Søren on November 16, 2009 11:37 AM

What a timely post. You've just convinced me to abandon my RegEx parsing hack and try to find a more 'stable' approach.

Found Html Agility Pack on codeplex - http://htmlagilitypack.codeplex.com/ Had working code in 10 minutes. Hmm, maybe there's a lesson to be learned here...

Adam Lacey on November 16, 2009 11:42 AM

lol. See a much better, more sophisticated treatment over at esr's blog.

Andrew on November 16, 2009 11:55 AM

You use becoming a follower of Cthulu like it's a bad thing ?

mgb on November 16, 2009 12:33 PM

I really enjoyed this article today. You really nailed being a good developer.

Gabe on November 16, 2009 12:33 PM

I almost always use regular expressions to sanitize scraped content (add missing quotes, remove attributes that my parser of choice chokes on etc) and then run it through the parser. So far, so good.

Gustaf Sjöberg on November 16, 2009 12:40 PM

I don't waste time debating how to parse HTML since finding BeautifulSoup

jojo on November 16, 2009 12:43 PM

I scrape HTML that is purposefully malformed to muck up the scraping process, using Regex. Had been using the DOM structure, but that has it's own problems.

If it works...

Steve on November 16, 2009 12:47 PM

There are no definitives really to this. The thing is most people parsing HTML are doing it for a specific set of pages usually in the same format. No RegEx could not perfectly parse HTML but it can parse it when you know the exact form of the HTML.

I started a project intending to use a library to parse the HTML but it became more trouble than it was worth. I knew the sections of information I wanted to pull out and I knew the WYSIWYG editor only allowed a small set of HTML for formatting and links e.g. strong, italic, underline, a link, bullets, numbers... In the end it was not using anything more than a simple bit of code to pull out the same content in plain text.

pete on November 17, 2009 1:15 AM

@craigybear

The problem is that (x)html is not a markup language, it's an adhoc hacked together AST notation, and malformed html in particular is difficult because the rules for properly resolving html into its requisite tree structure are complicated and obtuse, and involve painful reverse engineering of multiple browsers. (it works in IE, so my markup must be correct!)

And so, if all you wanted to do was build a simple markup language, and a simple stylesheet language for sending your technical manual to the printers, yes, that's drop dead simple for any slightly "competant" programmer. But if you're Donald Knuth (You've heard of him, right?!), it takes about 10-20 years.

However, then using that markup language to extract useful information is an entirely different task for which a markup language is not really designed for. html was hacked into doing that task in the form of xml, but malformed tag soup, the sort of html you'd find out in the wild--- well let's just look at the facts: It takes a team of hundreds of developers several years to make a tolerably compatible html parser/renderer. And you're just gonna hack one up in a day, are you?

Breton on November 17, 2009 1:19 AM

So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.

Julian on November 17, 2009 1:41 AM

So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.

Julian on November 17, 2009 1:42 AM

You forgot to close the tag! Luckily, I think I got in there before all hell was unleashed.

Skizz on November 17, 2009 2:15 AM

What an awesome painting of Cthulhu.

Nick Wiggill on November 17, 2009 2:17 AM


I bet Chuck Norris can parse HTML using RegEx.

ClutchControl on November 17, 2009 3:22 AM

*ElderSign*
I bet Chuck Norris can parse HTML using RegEx.
*/ElderSign*

ClutchControl on November 17, 2009 3:24 AM

*ElderSign*
I bet Chuck Norris can parse HTML using RegEx.
*/ElderSign*

ClutchControl on November 17, 2009 3:25 AM

The code of Cthulhu....

Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?

Rob Uttley on November 17, 2009 3:37 AM

Well, ASP.Net uses regex to parse HTML, and it works quite well. I fact, open up your Reflector and point it at "System.Web.UI.BaseParser" class and "ParseStringInternal" method at "System.Web.UI.TemplateParser" class. You will see that it can work, when used properly.

Ricardo Nolde on November 17, 2009 3:37 AM

Arrrrggghhhhhhh.....I was too late. The open cthulhu tag has gained enough power to swallow attempts to close it. Run. Run for your lives. Chaos is coming.

Skizz on November 17, 2009 3:53 AM

Regular expressions is based on math (formal language actually:
http://en.wikipedia.org/wiki/Regular_expressions#Formal_language_theory
). (X)HTML is based on a tree structure which is a data structure. Those two fields are not related, that is why it's awkward to use regex to parse HTML.

Hoffmann on November 17, 2009 4:09 AM

What I do is HTML Tidy the user input/document and then use XSLT to whitelist acceptable parts. No scripts, styles of proprietary shit makes it though.

I've even added a third step before running it through the XSLT that adds cool features similar to markdown or textile.

You can take my code and run, if you want, I wrote this as an extension for Symphony CMS: http://github.com/rowan-lewis/htmlformatter/

Nobody on November 17, 2009 4:24 AM

Ok, I'm a fucking retard, the code above no longer uses the XSLT whitelist, but what the hell, you get the idea right?

Nobody on November 17, 2009 4:27 AM

The pingback 1.0 specification actually uses regexp for parsing HTML to autodiscover the pingback URL.

However, in this case I think it's not a bad case because it greatly simplifies handling and code.

Gasper Zejn on November 17, 2009 5:20 AM

Who else thinks there should be a Cthulhu badge for StackOverflow?

Joseph Cooney on November 17, 2009 5:37 AM

am i the only one that noticed that cthulhu is not a god, but a great old one

dystopia on November 17, 2009 5:46 AM

*cough* BeautifulSoup *cough*

geekboxjockey on November 17, 2009 6:16 AM

Are you all insane? HTML is easy to parse using any language that has good string handling/matching (even VB works, although it gets slow).

How do you think a browser manages this? Typesetting programmes have been doing the same for decades, with the same type of tags (think SGML), long before HTML. Evaluating a bunch of tags is a trivial first-principle task to any competent programmer.

Naturally, using a library is the quickest and most reliable way of doing this and there are lots of 'em.

What is the big deal?

Craiggybear on November 17, 2009 6:27 AM

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.

Joseph Crotty on November 17, 2009 7:17 AM

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.

Joseph Crotty on November 17, 2009 7:19 AM

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.

Joseph Crotty on November 17, 2009 7:20 AM

@Rob
"Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?"

See Charles Stross's "The Atrocity Archives" (http://www.amazon.com/Atrocity-Archives-Charles-Stross/dp/0441016685/) and "The Jennifer Morgue" to see what software might interest Cthulu and his kin.

Dave C. on November 17, 2009 7:36 AM

there should a filter in stackoverflow automatically deleting/reporting/marking as dangerous the posts that include "html parsing regular expressions" in it. and redirecting the submitter to this post.

Youri on November 17, 2009 7:41 AM

Nice to see a respectable minority familiar with the mythos, tiki li mother fuckers!

"Only minor loss of sanity points...."
--Paul N on November 16, 2009 9:36 AM

But Paul N goes the extra step, exposing his familiarity with the beloved RPG; see you at GenCon! Now roll 3d20 and ignore the result.

*Cthulhu for president 2012 : Why settle for the lesser of two evils?*

Azathoth on November 17, 2009 8:30 AM

@Phil Brass: Regular expressions don't permit recursion. You've smuggled it in by using the semantics of the language in which you're defining the regex. You have to go to type-2 grammars in order to parse nested constructs (Like HTML, or parentheses).

Turing-complete is a stronger computational class than that of a type-2 grammar (Which is, IIRC, a pushdown automaton - regexs are nondeterministic finite state machines), so it's not really surprising that you can parse HTML with regular expressions + glue code in Perl or whatever, but it's still not really a good idea compared to writing a proper recursive-descent parser.

JamesP on November 17, 2009 9:05 AM

You can do whatever you like, even if it seems stupid, but only if you do it well.

Kapusta on November 17, 2009 9:25 AM

There is a big difference between parsing and simply extracting. Sure you can't parse html with regex but if you simply want to extract a bit of data it works better. Sanitizing html, parsing it into a DOM, traversing and extracting data and then crossing your fingers it will work for all the broken html out there seems like a big mistake when you just want to get a specific data value from a page. Not to mention you end up with bloatware that may not even work. This is the problem with having a cs degree. You tend to think that theory trumps practice. I remember a co-worker once who implemented a whole postscript parser to get at a single data value on page X of a document.

Chris S on November 17, 2009 9:53 AM

More comments»

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.