Parsing: Beyond Regex

April 1, 2005

I've blogged ad nauseam about how much I love Regular Expressions, but even the mighty regular expression has limits. As noted in Daniel Cazzulini's blog:

A full-blown programming language cannot be parsed with regular expressions. But given the limited number of programming languages (successful ones, let's say), how big do you think is the niche for getting proficient with those tools/techniques?

There is an inmensely bigger amount of common problems and small parsing needs that are very cost-effectively solved with regular expressions. For example, you don't need much more than that to parse XML, XPath, XPointer, DataBinder.Eval-like .NET expressions, templates, MSBuild property references, Postbuild commands, etc etc etc. So becoming proficient with regexes is much more important and relevant to solve day to day problems than mastering BNF, lex/yacc, or any other full-blown parsing tecniques/tools, IMO.

Indeed, when colorizing code with simple(ish) regular expressions, you'll run into annoying edge conditions like this one:

  Dim s as String
  s = "This is a string with ""quotes"""

Or, let's say we wanted to parse out HTML tags with this naive regular expression:

<tag[^>]*>(.*?)</tag>

Seems solid enough, right? Well, there are problems. What about nested tags? What about a tag that contains improperly escaped characters, like this one (from a real blog, by the way):

<input type="submit" name="previewcomment" value="preview >>">

While you can hack around these problems with more and more regular expression cleverness, you eventually paint yourself into a corner with complexity. Regular expressions don't truly understand the code that they are colorizing-- but parsers do. Depending on the problem you're attacking, at some point you have to bite the bullet and utilize a full-blown parser.

Drazen Dotlic recently published a an interesting CodeProject article discussing how to use the CoCo-R parser to parse and colorize C#:

Why don't existing tools provide better formatting and color coding? Because parsing is hard. I thought I knew well most of the C# language constructs before I started working on the Colorizer. Boy, was I wrong! Throughout the course of this project, I have run into several constructs I have never seen before. I have also learned to appreciate more the work of the guys building the C# compiler. If it takes a team of people in Microsoft to properly deal with this issue, how could I have done it alone and in my spare time?

Sir Isaac Newton said "If I have seen farther than others, it is because I was standing on the shoulders of giants", and in this case, I was standing on the shoulders of Coco-R.

Coco-R is a compiler compiler (I guess that's where coco comes from). You have probably heard of tools like lex and YACC - Coco-R is a modern version of these tools. What's really great about it is that there are ports to several languages including C#. This is very convenient because additional processing you may want to do during parsing can be written in the same language tool itself is written in - C#.

Drazen's article offers a HTTP handler and a web control that harness the CoCo-R parser to colorize and reformat code with a level of fidelity you'll never get from regular expressions. Good stuff.

Posted by Jeff Atwood
5 Comments

Full scale parsing is not cheap, but I admit I have not tested it to see how expensive it really is.
There is no VB.NET parser for the moment :) and yes, I did know this was coming ;) On the other hand, I was "challenged" by few guys in the comment section of the CodeProject article to write more about how to write a parser for a language, so VB.NET might be a good language to describe basics ;) Actually, considering that attributed grammar for C# could be to a certain extent reused for VB.NET, in the end it might be relatively easy to generate a parser - all you need to do is to express EBNF syntax of VB.NET in Coco-R, reuse some of hand-written code and Coco-R will generate parser for you. Then you need to add the same code I added for C# parser (colorizing and formatting) but that's the easy part - all of the heavy lifting was already done by Coco-R guys...

Drazen Dotlic on April 1, 2005 3:30 AM

Thanks for the kind words.

Drazen Dotlic on April 1, 2005 3:39 AM

Drazen, can you comment on the performance profile of this technique, and how you're using it? Any FAQs or other stuff we should know?

And-- you knew this was coming-- is there a VB.NET parser?

Jeff Atwood on April 1, 2005 11:05 AM

COCO/R for VB.NET

http://www.ssw.uni-linz.ac.at/coco/

Alejandro Varela ( rosario, Argentina) on February 1, 2008 5:59 AM

HTML is not a regular language. (See: http://en.wikipedia.org/wiki/Regular_language for information) Using regular expressions to 'parse' it is a really bad idea.

http://htmlparsing.icenine.ca/ for even more information

Just use a parser

Vaclav Verio on January 10, 2009 6:47 AM

The comments to this entry are closed.