Parsing Html The Cthulhu Way

November 15, 2009

Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.

That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code.

kraken-cthulhu.jpg

This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration.

I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:

 # pull out data between <td> tags
($table_data) = $html =~ /<td>(.*?)<\/td>/gis;

"But, it works!" they say.
"It's easy!"
"It's quick!"
"It will do the job just fine!"

I berate them for not being lazy. You need to be lazy as a programmer. Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use CPAN and use HTML::Sanitizer. It will make your coding easier. It will leave your code more maintainable. You won't have to sit there hand-coding regular expressions. Your code will be more robust. You won't have to bug fix every time the HTML breaks your crappy regex

For many novice programmers, there's something unusually seductive about parsing HTML the Cthulhu way instead of, y'know, using a library like a sane person. Which means this discussion gets reopened almost every single day on Stack Overflow. The above post from five years ago could be a discussion from yesterday. I think we can forgive a momentary lapse of reason under the circumstances.

Like I said, this is a well understood phenomenon in most programming circles. However, I was surprised to see a few experienced programmers in metafilter comments actually defend the use of regular expressions to parse HTML. I mean, they've heeded the Call of Cthulhu … and liked it.

Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing. In fact, designing a program to do so may well be a completely wrong-headed approach, if it changes a program from a few-line script to a bullet-proof commercial-grade program which takes orders of magnitude more time to properly code and support. Resource expenditure should always (oops, make that very frequently, I about overgeneralized, too) be considered when creating a programmatic solution.

In addition, hard boundaries need not always be an HTML-oriented limitation. They can be as simple as "work with these sets of web pages", "work with this data from these web pages", "work for 98% users 98% of the time", or even "OMG, we have to make this work in the next hour, do the best you can".

We live in a world full of newbie PHP developers doing the first thing that pops into their collective heads, with more born every day. What we have here is an ongoing education problem. The real enemy isn't regular expressions (or, for that matter, goto), but ignorance. The only crime being perpetrated is not knowing what the alternatives are.

So, while I may attempt to parse HTML using regular expressions in certain situations, I go in knowing that:

  • It's generally a bad idea.
  • Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
  • I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.

It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.

So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand.

Whatever method you choose -- just don't leave the <cthulhu> tag open, for humanity's sake.

Posted by Jeff Atwood
145 Comments

Depends on what you want to do with HTML. If you want to extract text, for example, regexes not only work, they work really well.

DMB on November 17, 2009 11:21 AM

>> *cough* BeautifulSoup *cough*

*cough* Extremely slow *cough*

DMB on November 17, 2009 11:23 AM

If you want a perl based HTML parser specifically designed to remove XSS type attacks, check out HTML::Defang (http://search.cpan.org/~kurianja/HTML-Defang-1.02/)

HTML::Defang uses a custom html tag parser. The parser has been designed and tested to work with nasty real world html and to try and emulate as close as possible what browsers actually do with strange looking constructs. The test suite has been built based on examples from a range of sources such as http://ha.ckers.org/xss.html and http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as possible XSS attack scenarios have been dealt with.

Rob Mueller on November 17, 2009 12:04 PM

We only allow XHTML to be saved, it is validated before saved. All problems are solved this way. The WYSIWYG-editor only allows valid XHTML to be created.
Standard parsers for X(HT)ML are avalaible in masses and are rock-stable. Of course wandering through the DOM is not as easy as thought at first glance, but once you understand the subtleties, the knowledge is useful on many tasks. The resulting code using this approach leads to performant, safe and correct apps.

toettoe on November 17, 2009 12:14 PM

101'st

If you use xhtml it's pretty straightforward.

Anyways, everything's just a heap of div tags these days anyways.

Punky on November 17, 2009 12:43 PM

Wow,the code is beautiful for the expert,but i only know a few.So the website http://www.laptopbatterypack.org.uk for ours company is need some change from the expert.

Laptop-battery on November 18, 2009 2:01 AM

Per request.

Santi on November 18, 2009 2:32 AM

/cthulhu

Santi on November 18, 2009 2:34 AM

What HTML are you parsing? Your own, someone you knows, or a whole engine for spidering? The first two could get away with regex, the last two I'm not so sure.

A different Chris S on November 18, 2009 5:13 AM

My last comment "Someone you knows" wasn't a West Country accent but a typo

A different Chris S on November 18, 2009 5:15 AM

Hello Coding Horror,

My name is Robert Sullivan and I am the advertising director for Dark Recesses Enterprise (www.darkrecesses.com). Dark Recesses is an on-line horror fiction periodical, published by Boyd E. Harris and edited by Bailey Hunter, among others.

Dark Recesses Enterprises wants to expand the contemporary definition of horror, to push the boundaries beyond the commercial marketplace definitions, by providing quality horror industry news and articles, and by publishing the best in short fiction, by today’s up and coming writers.

I am sending you this message just to make contact, to establish a line of communication. I do want to sell advertising space on our website and in our periodicals, but at this point I am taking a low pressure approach.

Please contact me at your earliest convenience.

Sincerely,

Robert Sullivan
Dark Recesses Enterprise
(Home) 256-747-8683
(Cell) 334-220-4117

Robert Sullivan on November 18, 2009 6:58 AM

Webrat?

http://github.com/brynary/webrat

Victor on November 18, 2009 7:36 AM

^

spam at its best

wow on November 18, 2009 8:33 AM

"The only crime being perpetrated is not knowing what the alternatives are."

I commit this crime regularly. In some cases, there's just so many options for everything you could think of doing... sometimes picking the 'right one' for the job takes longer than picking the first and hacking up a solution.

Steve-O on November 18, 2009 10:20 AM

Haven't you ever heard of CGIProxy, Glype, or PHProxy? These do exceptionaly well mirroring websites, Specially Glype and CGIProxy, by modifying the html with Regexes.

keldorn on November 19, 2009 12:53 PM

In the beginning I parsed XML with regex, then I learned XSLT, and there was much rejoicing!

Scott on November 20, 2009 5:15 AM

Jeff, I believe you dropped this:
/cthulhu

CJH / esper on November 21, 2009 1:52 AM

You guys are just quitters... I parse HTML with Regular Expressions all the time. The trick is to do it in two passes. The first pass extracts the tag, the second pass processes the tag. I use this approach in PHP to import external web pages into a CMS together with all their referenced stylesheets, images, media, and javascript files. It also recursively parses the stylesheet external file references.

Actually now that I think about it, this is kind of a compromise since I use individual RegEx's for each tag. In other words its a kind of halfway house between a pure hand-crafted heuristic and a more orthogonal approach... This is probably the way the HTML parsing libraries do it anyway.

Ratty on November 21, 2009 6:07 AM

@Ratty: And what's the advantage of not using an existing parsor??? Too much spare time?

Kahl on November 22, 2009 10:47 AM

Can you please stop using PHP in a derogatory manner? After all, you're the one actually advocating writing enterprise apps on Windows.

John on November 22, 2009 12:02 PM

Parsing html with regular expressions = bad idea granted, but locating something specific within html with regex = good idea. Want to find A tags: use regex. Want to locate images: use regex. Want to apply XSLT to html for the purpose of converting it to an RSS feed: Use dedicated parser. Html Agility pack, Tidy, System.Html all fine parsers, all easy to use 99% of the result with 1% of the effort.

"A good artist copies. A great artist steals". Leverage an API!

BobSmall on December 1, 2009 7:27 AM

It's not really about using regex vs some other parsing method, its really just about the cohesion between the search mechanism and the rest of the software.

Regex has its place as a simple search mechanism. It's easy to implement and generally gets the job in a productive fashion. If the searching is complex, then a different mechanism should be used.

The only thing that would irk me is if the searching function call was located deep within a 1000 line module. I wouldn't care at all if I had to replace a single search class.

If the project had unit tests, that makes replacing the algorithm even easier.

I'm posting the Cthuluhu picture on my wall at work anyway.

Great post.

Bryan on December 1, 2009 7:44 AM

"Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn."

Mark Rogers on December 2, 2009 6:49 AM

Nice post , good to see other developers considering the wrath of the old ones whilst they are working having Shub-Niggurath show up due to faulty error trapping is in no-ones interests

christian on December 4, 2009 3:28 AM

It's not Cthulhu. It's ZA̡͊͠͝LGΌ.

Jon on February 6, 2010 11:22 PM

Instead of HTML::Sanitiizer. just point to http://search.cpan.org/search?query=html+parser&mode=all and let people pick one.

Rob Kinyon on February 6, 2010 11:22 PM

Meh... use the right tool for the job. And sometimes, that means using regexes - if you're dealing with a consistently formed XML or HTML file, a simple regex may be a lot less effort than using a dedicated parser...

Simon on February 6, 2010 11:22 PM

Didn't we argue about this a year ago, and you dismissed me with "programming is hard, let's go shopping"?

Yes, that's right, you did: http://www.codinghorror.com/blog/archives/001172.html

Instead of putting your time into improving a working, open source HTML parser (which just recently added a selector engine), you wrote a bunch of hacky regex. Now you have 2 problems wasted valuable development hours, and you deserve the pain taunting my warnings.

Jon Galloway on February 6, 2010 11:22 PM

Also, your wack-ass busted old moveeeablee typee cobol blogg enginne hath wacked my comment formatting. Bah.

Jon Galloway on February 6, 2010 11:22 PM

I got downvoted on StackOverflow for saying that Regex is not the right solution for parsing HTML. It was offset by 11 upvotes, but some people will just never get it. It's one thing to use a regex to tokenize HTML, but another thing entirely to use them as if HTML were a regular grammar.

Jason Truesdell on February 6, 2010 11:22 PM

Jeff, didn't you spend a considerable amount of time in one of the StackOverflow podcasts trying to convince Joel that it was OK for you to try and parse Markup with a bunch of regular expressions, despite the fact that it's not a regular language and runs into a bunch of the same types of problems?

Mike McNertney on February 12, 2010 4:17 PM

Whoops heh, that's what I get for not looking at the date of the post... for some reason this just popped up in my rss reader again.

Mike McNertney on February 12, 2010 4:18 PM

Back in the day I wrote my own C HTML parser, back before it was a solved problem. I even had my own version of xpath for it.

James Rogers on February 23, 2010 10:48 AM

I was seduced by the RegExHtmlMonster. I woke up screaming and decided it was time to parse the nightmares away.

Michael Baun on March 5, 2010 11:54 AM

This is such a great post collection Hostgator I bookmarked it (and added it to Digg).

willard on April 14, 2010 3:08 AM

Very informative and trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading Arkadaslik Sitesi - Sohbet Odalari

Emir Sadak on September 17, 2010 6:20 AM

This guy slaps Cthulhu across the face and laughs heartily!
http://jmrware.com/articles/2009/uri_regexp/URI_regex.html

Dan Beam on September 29, 2010 8:47 PM

If the ideology of this post and most of the comments are to believed as gospel then the following book will certainly make the baby Jesus cry...

Mikeschinkel on October 22, 2010 6:45 PM

LOL.. now I understand bobince's persistence in MY post about regex vs HTML: http://stackoverflow.com/questions/3951485/regex-extracting-only-the-visible-page-text-from-a-html-source-document

(..and maybe some of you would be amused by my own persistence, too :)

However, as I stated numerous times in my comments, I wasn't out to parse the HTML per se, but "merely" interested in a much coarser extraction. And for my purposes, the regex approach works - it's a tradeoff between efficiency and total robustness. But the outcome is surprisingly solid. The final implementation can be found here: http://www.martinwardener.com/regex/

Mind you, regarding the "secondary" issue (extracting all links/URLs from an HTML document), it is of no concern that this implementation is over-eager (by design, btw) and picks out a few invalid URLs (mostly pertaining to script blocks) - those will be filtered out during the subsequent URL validation anyway.

D7samurai on October 25, 2010 3:34 AM

I was recently working on a java project to retrieve all the separate unique words found (content) on a specified HTML page, and print them alphabetically along with their frequency on that page.

My program, instead of using regular expressions, reads the file line by line. Any text that is within the ending '>' and beginning '<' HTML brackets is read into a new variable. This new variable then contains all of the words found (visible, not alt tags) on that web page, separated by spaces.

Using this method, the only text that is really left out are image alt tags and meta descriptions and keywords. Three regular expressions, since you love them so much, could get those before or after the fact.

My program then built a Binary Search Tree based on the words found in that HTML file, along with their frequency. Being a web developer, I have found this a neat tool to have to evaluate keywords of a website, as it works quite well. Not saying it's the 'perfect parser', but it works with HTML, BROKEN HTML, PHP, ASP, or most any kind of web page out there.

Adam Richards on November 6, 2010 1:27 AM

I am a beginning Python coder. I wanted to be able to go to www.fictionpress.com (a website containing stories people write) and turn raw HTML into the story I am trying to extract. Is there any better method than using regular expressions? Using the RE module allowed me to not only parse the HTML, but also remove headers and footers I didn't want to see.

What should I be using instead of regular expressions for HTML?

Irregularme.wordpress.com on January 31, 2011 8:56 PM

If the HTML changes then almost any scraper or parser will fail. Anyone that says otherwise is bullshitting in order to justify doing the 'right thing'.

It is the right thing to use a DOM or other similar method because it can be easier to read the code, and there are often other useful functions hanging about that can make any future development easier. However this 'robust' BS needs to stop.

If the HTML changes then any parser breaks, unless it's your own HTML you're parsing and you design it in a careful way, i.e. using lots of unique 'id' attributes in tags so changing the structure of the HTML doesn't break anything.

I agree with Chris S from November 17, 2009. His comment still stands now.

xandrani on July 3, 2011 1:52 PM

Hi, Jeff.

You say "It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine." Well, let me tell you my story.

Once, I had a rep of 1486 in Stack Overflow. I was so excited because finally, FINALLY, I could create my own tags. This was the objective of my life. I got 616 rep points in one month. I deleted my Twitter and Google + accounts for not losing a second. I just needed mere fourteen points! My question at http://stackoverflow.com/q/6873945 finally would have a "mozmill" tag; http://stackoverflow.com/q/6797631 and http://stackoverflow.com/q/6797779 would have the "rhinounit" tag; I could solve problems such as http://meta.stackoverflow.com/q/98584 by myself whether I find them. I rejoiced in anticipation.

Then, I found a quite innocent question about extracting some data from HTML. It seemed to be a pretty stably structured document, so I answered with a regex that could solve the problem: http://stackoverflow.com/q/6878032#6878203 Note that I emphasized that the solution was quick'n'dirty, an unstable document required some more sophisticated tool.

And I got a downvote. I could see my dreamt tags going away. I just give two steps behind, my journey would be longer. What if more people find my answer and downvote it too? What if I lost hundred of rep points?! My tags! MY TAGS! I panicked. I just managed to refrain my mourning to, between hiccups, give my testimony here.

There is a clear lesson here: do not parse HTML with regular expressions in any way. It can destroy your dreams, your soul, your life. If you do it, you'll end up smoking crack. I learned the lesson and am trying to rebuild my life, maybe - MAYBE - with the ability of creating tags in SO. Do not make my mistake. It is not worth it.

Suspensaodedescrenca.wordpress.com on July 29, 2011 1:54 PM

Jeff, I really enjoyed your article. I posted an answer to the question on SO you referred to in this article here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/7564061#7564061. Seeing as there are so many answers, it may never be read, but what do you think about Balancing Group Definitions? I just find it interesting b/c it allows a regex engine to have state and act as a PDA.

Holler if you find my response interesting.

Samuel Smith on September 26, 2011 9:08 PM

«Back

The comments to this entry are closed.