I love regular expressions. No, I'm not sure you understand: I really love regular expressions.
You may find it a little odd that a hack who grew up using a language with the ain't keyword would fall so head over heels in love with something as obtuse and arcane as regular expressions. I'm not sure how that works. But it does. Regular expressions rock.They should absolutely be a key part of every modern coder's toolkit.
If you've ever talked about regular expressions with another programmer, you've invariably heard this 1997 chestnut:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
The quote is from Jamie Zawinski, a world class hacker who I admire greatly. If he's telling us not to use regular expressions, should we even bother? Maybe, if you live and die by soundbites. But there's a bit more to the story than that, as evidenced by Jeffrey Friedl's exhaustive research on the Zawinski quote. Zawinski himself commented on it. Analyzing the full text of Jamie's posts in the original 1997 thread, we find the following:
Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most "obvious" (at least, to people who don't know any better) way to get from point A to point B.
The first quote is too glib to be taken seriously. But this, I completely agree with. Here's the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil.
I couldn't agree more. Regular expressions are like a particularly spicy hot sauce -- to be used in moderation and with restraint only when appropriate. Should you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl, and I'm not sure you need those kind of headaches. If you drench your plate in hot sauce, you're going to be very, very sorry later.
In the same way that I can't imagine food without a dash of hot sauce now and then, I can't imagine programming without an occasional regular expression. It'd be a bland, unsatisfying experience.
But wait! Let me guess! The last time you had to read a regular expression and figure it out, your head nearly exploded! Why, it wasn't even code; it was just a bunch of unintelligible Q*Bert line noise!
Calm down. Take a deep breath. Relax.
Let me be very clear on this point: If you read an incredibly complex, impossible to decipher regular expression in your codebase, they did it wrong. If you write regular expressions that are difficult to read into your codebase, you are doing it wrong.
Look. Writing so that people can understand you is hard. I don't care if it's code, English, regular expressions, or Klingon. Whatever it is, I can show you an example of someone who has written something that is pretty much indistinguishable from gibberish in it. I can also show you something written in the very same medium that is so beautiful it will make your eyes water. So the argument that regular expressions are somehow fundamentally impossible to write or read, to me, holds no water. Like everything else, it just takes a modicum of skill.
Buck up, soldier. Even Ruby code is hard to read until you learn the symbols and keywords that make up the language. If you can learn to read code in whatever your language of choice is, you can absolutely handle reading a few regular expressions. It's just not that difficult. I won't bore you with a complete explanation of the dozen or so basic elements of regular expressions; Mike already covered this ground better than I can:
I'd like to illustrate with an actual example, a regular expression I recently wrote to strip out dangerous HTML from input. This is extracted from the SanitizeHtml routine I posted on RefactorMyCode.
var whitelist = @"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>| </?s>|</?strike>|</?blockquote>|</?sub>|</?super>| </?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>| </?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";
What do you see here? The variable name whitelist is a strong hint. One thing I like about regular expressions is that they generally look like what they're matching. You see a list of HTML tags, right? Maybe with and without their closing tags?
Honestly, is this so hard to understand? To me it's perfectly readable. But we can do better. In most modern regex dialects, you can flip on a mode where whitespace is no longer significant. This frees you up to use whitespace and comments in your regular expression, like so.
var whitelist = @"</?p>| <br\s?/?>| (?# allow space at end) </?b>| </?strong>| </?i>| </?em>| </?s>| </?strike>| </?blockquote>| </?sub>| </?super>| </?h(1|2|3)>| (?# h1,h2,h3) </?pre>| <hr\s?/?>| (?# allow space at end) </?code>| </?ul>| </?ol>| </?li>| </a>| <a[^>]+>| (?# allow attribs) <img[^>]+/?> (?# allow attribs) ";
Do you understand it now? All I did was add a smattering of comments and a lot of whitespace. The same exact technique I would use on any code, really.
But how did I cook up this regular expression? How do I know it does what I think it does? How do I test it? Well, again, I do that the same way I do with all my other code: I use a tool. My tool of choice is RegexBuddy.
Now we get syntax highlighting and, more importantly, real time display of matches there at the bottom in our test data as we type. This is huge. If you're wondering why your IDE doesn't automatically do this for you with any regex strings it detects in your code, tell me about it. I've been wondering that very same thing for years.
RegexBuddy is far and away the best regex tool on the market in my estimation. Nothing else even comes close. But it does cost money. If you don't use software that costs money, there are plenty of alternatives out there. You wouldn't read or write code in notepad, right? Then why in the world would you attempt to read or write regular expressions that way? Before you complain how hard regular expressions are to deal with, get the right tools!
This trouble is worth it, because regular expressions are incredibly powerful and succinct. How powerful? I was able to write a no-nonsense, special purpose HTML sanitizer in about 25 lines of code and four regular expressions. Compare that with a general purpose HTML sanitizer which would take hundreds if not thousands of lines of procedural code to do the same thing.
I do have some tips for keeping your sanity while dealing with regular expressions, however:
IgnorePatternWhitespace option, then use that whitespace to make your regex easier for us human beings to parse and understand. Comment liberally.
If you're afraid of regular expressions, don't be. Start small. Used responsibly and with the right tooling they are big, powerful -- dare I say, spicy -- wins. If you make regular expressions a part of your toolkit, you'll be able to write less code that does more. It'll just.. taste batter.
You might enjoy them so much, in fact, that you completely forget about that "second problem".
| [advertisement] Read the largest case study ever published about lightweight peer code review in Best Kept Secrets of Peer Code Review. Free book, free shipping. |
Posted by Jeff Atwood View blog reactions
« Smart Enough Not To Build This Website Open Wireless and the Illusion of Security »
why
do
programmers
think
adding
whitespace
makes
things
easier
to
read?
It
doesn't.
Stop
doing
it.
Sean on June 27, 2008 12:16 PMI always wondered how people got by without regexs. Then I started asking that Steve Yegge question in interviews (the one about replacing all phone numbers in a huge site with one email address). Now I know. And I'm sadder for the knowledge.
Tom Clancy on June 27, 2008 12:19 PMI hope there's quite a bit more to your HTML sanitiser than just a few regular expressions. Sanitisation is an extremely hard problem, which can only really be solved using a proper parser. For example, using just regular expressions it's very hard to ensure that your users have properly closed their tags - an unclosed pre tag could seriously affect the layout of the rest of your page.
Even worse is that you appear to be allowing all attributes on image and link elements. This allows people to inject XSS vulnerabilities using onclick, onmouseover and other attributes. Even the href attribute needs to be sanitised, or someone could use a javascript: (or vbscript: or v\rb\rs\rc\ri\rp\rt:) protocol to XSS your application.
Remember, when you're sanitising HTML you aren't just defending against known attacks based on the HTML spec, you're also defending against weird undocumented glitches in the various different browsers.
There's a good discussion thread (with interesting links) about this problem on my blog, here: http://simonwillison.net/2007/Mar/12/xss/
Simon Willison on June 27, 2008 12:25 PMSean,howareyousosurewhitespacedoesn'tmakethingseasiertoread?
IknowIgetsomebenefitoutoftheoccasionalspace.
Obviouslycopiousamountsofcarriagereturnswon'tdoanybodyanygood,
butyoudon'talwayshavetotakethingstoexcesss.
Sean:
Becau seBadWhite Spaci
Can Tota
l y Messyou up
I strongly disagree with Sean. White space and comments are huge aids in reading code.
Dan on June 27, 2008 12:26 PMI love regular expressions too, but your example of choice here is poor for defending them against detractors. Using regular expressions to parse any form of markup, especially one as wildly variant as HTML, is almost never a good idea. In the example you gave, the tags that allow attributes will fail on attributes with closing tags in their values. Additionally, unless you have some sort of extra layer of filtering on that, you're opening yourself to an XSS attack via malicious on* event handler attributes in images and links. Now, you /could/ fix it up by making the pattern matching more specific. Create a list of possible attributes and allow it to appear an unlimited number of times. And within that subgroup, make specific matching rules for string start and termination. But now you have to handle all the possible cases of HTML. You need to check for double or single quotes, and you can't make sure the quote type matches without backreferences, which are part of the 'hacks' that have been bolted on to modern matching algorithms. And that's just a start on the potential problems. A syntax as inexact as HTML (or almost any markup if we factor in the potentially recursive nature) is something that demands a real parser if you want to do it right instead of providing more fodder for people to complain about.
Cthulhon on June 27, 2008 12:29 PMRegarding Regex Buddy.
No trial version. No purchase. Period.
Randy Magruder on June 27, 2008 12:31 PMNice post, Jeff. I'd have to say the biggest problem I see is with programmers who don't think it's worthwhile or "fun" to learn regex. Almost all the ones I've worked with avoid it like the plague. But then once they start using the 15th language in a row that supports regex they start realizing that it's probably not such a bad idea.
I can't tell you how many stop-gap translation applications I haven't had to build just because of regex support in Java, XML, PHP....
Raymond on June 27, 2008 12:32 PMI never knew about the IgnoreWhiteSpace option. That makes life much better. Thanks!
Mike on June 27, 2008 12:35 PMORANGE!
I'm working on a regex tool built in WPF. Its primarily a learning thing for me, but I'm looking to make it a slimmed down, bloat free regex go-to tool for those I-just-need-to-test-this-for-a-sec regex moments.
http://statestreetgang.net/post/2008/05/Regex-and-WPF.aspx
I've already made serveral improvements to that code, and I think I'll be entering it in this upcoming "community coding contest".
That'll give me some motivation to finish it.
Will Sullivan on June 27, 2008 12:44 PM(not ^that^ Sean)
Other Sean: Adding whitespace makes things easier to read. If it doesn't work for you, then you must be some mental-parsing genius. Great. You're better than the rest of us. Go on with your life now.
Jeff: I agree with your RegEx assessment.
Sean on June 27, 2008 12:45 PMMy favourite I've been using for years is Regex Coach - http://www.weitz.de/regex-coach/
It's free, with donations encouraged.
Works on both Linux & Windows.
Mark on June 27, 2008 12:46 PMDamn you Jeff! Since I read your post I've been thinking a lot about eating a hamburger with lots of Tabasco. I was subliminally attacked by your free publicity... Great blog, read it every day.
Waldemar on June 27, 2008 12:49 PMAwesome, "Regular expression for testing prime numbers" :
http://mail.pm.org/pipermail/athens-pm/2003-January/000033.html
print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/
This is by Abigail, who is something of a legend in the Perl community.
TiTi on June 27, 2008 12:52 PMI'll admit I have made a regex or two with poor whitespace, however it is nice once you get to the point where you can read regex like your native language. That said, recursive regular expressions can still be kind of confusing.
Will on June 27, 2008 12:56 PMDon't get me wrong. I understood what you meant.
But I find it kinda fun that you love regexps and don't like xml. :)
GoA on June 27, 2008 12:57 PMDitto on strongly disagreeing with the first comment. White space turns ordinary obfuscated perl into something a bit more, uh, pythonic.
Steven Klassen on June 27, 2008 12:57 PMRail against XML yet embrace Regex!?! Me no understand so good.
Of course, I will admit to using regular expressions very rarely, whereas I utilize XML rather extensively.
Kenneth on June 27, 2008 01:02 PMIn the same vein as RegexBuddy, IntelliJ has a fantastic regex plugin:
Ryan Breidenbach on June 27, 2008 01:10 PMOops, wrong link:
<a href="http://plugins.intellij.net/plugin/?id=19">http://plugins.intellij.net/plugin/?id=19</a>
Ryan Breidenbach on June 27, 2008 01:10 PM@Randy - there's an old trial version (2.04) of RegexBuddy floating around various shareware sites. I couldn't recommend it more - it really is a fantastic tool. Even their website is valuable as a reference for writing regular expressions.
Richard Dingwall on June 27, 2008 01:10 PMA coworker and I cracked up laughing reading this post. We have someone on our team who overuses regular expressions and hot sauce.
Tim B on June 27, 2008 01:15 PMBrilliant, well said. I struggle with RE but I do value them. JWZ is imminently quotable but usually done so wildly out of context. Another of his great lines is "Linux is only free if your time has no value", often used by MS zealots to bash Linux zealots.
Rev Matt on June 27, 2008 01:18 PMFirst, a great cheat sheet for writing these things...
http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/
I find them incredibly useful, mostly I use them for validating input... basic stuff like making sure the user entered a valid email address or phone number or date. Maybe I haven't used them enough, but i feel like bashing my head on the desk whenever I'm trying to write them.
I had a class in college where we were given regular expressions and we had to walk through them with specific terms too. That wasn't much better.
Kris on June 27, 2008 01:21 PMEclipse users can get QuickREx for free. It's fantastic.
http://www.bastian-bergerhoff.com/eclipse/features/web/QuickREx/toc.html
Bossy Joe on June 27, 2008 01:21 PMThere is one called Expresso that is pretty slick. It's "free" as far as I'm aware (just a nag screen)
HB on June 27, 2008 01:24 PMI agree with others who said that regular expressions are not a good way to sanitize HTML. Somewhat appropriately, IIRC, the JWZ quote was about someone using RegEx for parsing HTML. JWZ wasn't complaining in general about overuse of RegEx'es (though he has), but the context of that quote I believe was specifically for using them in HTML where they're not the right tool. He suggested more of a true parser for HTML. I'm sure javascript and XSS exploits have just made that statement even move true.
Grain of salt warning - my memory isn't 100% sure this was on HTML, though it was either HTML or SGML/XML. The original post was on usenet, and a google search only shows the quote, not the full context. If it was HTML, it's somewhat ironic that the example is HTML yet pulls in the JWZ quote recommending against it.
Rich on June 27, 2008 01:25 PMDON'T use regular expressions to parse markup (HTML/XML/whatever).
http://htmlparsing.icenine.ca/
http://wiki.hypexr.org/wikka.php?wakka=/RegexFAQ
Am I the only one still curious what the official verdict on that Mensa page is?
Jeff, are you going to post your thoughts on it?
PaoloB on June 27, 2008 01:37 PMAs with other languages you mention (e.g. Ruby), you have to use regular expressions often enough that you internalize some of the less, um, user-friendly aspects of the expressions. Classic cognitive problem in regular expressions: reserved characters that mean different things in different contexts. ("Well, unless it's inside square brackets. Then it means ....") An interesting dilemma for a) regex novices working on b) straightforward problems is that it can be just as fast, by the time you take into account all the debugging time for your regular expression, to just write a parsing function in Your Language Of Choice. Not as elegant, of course, but for one-offs, it can be awful tempting to forego the headaches of "line-noise" syntax ...
mike on June 27, 2008 01:39 PMI abuse of hot sauce but I'm Mexican so it's OK for me, and please, please, no more Tabasco, that's not real "salsa", it's not even hot, if you want something hot and delicious try "salsa de chile habanero"
http://www.salsasetc.com/graphics/H-175A%20large.jpg, it will make burn your tushi like never before.
Juan Zamudio
I never, ever want to hear from anyone, ever again, that programming in assembly language is hard or useless.
Rob on June 27, 2008 01:50 PMOf course then you always get the guy asking you for help making a Regular Expression to match strings of balanced parentheses.
Steve Steiner on June 27, 2008 01:52 PMThis is the reg expression to rule all reg expressions: %s\n !:P
Just kidding! But do take a look at this small post on how to read quoted strings with scanf: http://narg.eu/?p=6 - in one simple reg exp.
I ~love~ regular expressions. I do know that they're slow compared to strstr, strpos, or whatever your language's equivalents are, because they are yet a different coding language that gets PARSED and COMPILED. At their basic level, they represent a finite state machine (this is not so much true with modern regex, but the basic commands in like POSIX regex are).
Therefore - long regular expressions are going to be SLOWER than shorter ones. Personally... I'd have taken the article's long 'or' string and broke it up into a loop over a list of allowed elements (easier to add onto later too).
Something to note is that every regex engine is different - some optimize things differently (ie, if your parser is naive about building the FSM, the long or statement above will result in a huge structure, one for every OR), some have wholly different functionality (though, in general, 'keyword' characters are consistent), and some have different 'shortcuts', especially for character classes. The most widely used is probably PCRE (Perl Compatible Regular Expressions), which obviously works just like Perl, but it's a C library that is used in a number of different places, but its syntax is just a little different than say Java or Javascript's syntax, which is very different than BRE (basic regex), etc.
Its super powerful though, however you cut it. Its one of those tools that programmers need to know, because often times its the best tool for the job - especially in the text-based web world.
Justin on June 27, 2008 01:57 PMJeff, any thoughts on introducing BNF? It is the perfect compliment to fill in the gaps of regular expressions. Simplest possible way to get balanced matching and parsing. Arguably easier to use than regular expressions. It is supported in all major programming languages. Whenever turning text into a data structure, reach for a BNF first.
Besides, the complexity of regular expressions seem to grow at length^2. BNFs feel more like log(length).
http://en.wikipedia.org/wiki/Backus-Naur_form
Kyle on June 27, 2008 02:16 PMHere is a really cool tutorial (video) and cheatsheet for learning regular expressions:
http://e-texteditor.com/blog/2007/regular_expressions_tutorial
This was a tremendious help for me when I started out trying to learn how to use regexes (for use when programming in ruby), and since then I have found more and more use for them.
The editor e ( http://e-texteditor.com ), that is used in the tutorial, is probably what has helped me most keeping my regex ability up-to-date and in use daily. It has a cool way of live highlighting what your regex matches in the text as you type it, that really helps you out when you try to use regexes while editing (kind of like a realtime regexbuddy that is always there when you need it).
Aaron on June 27, 2008 02:16 PM> 1. DON'T use regular expressions to parse markup (HTML/XML/whatever).
> 2. I agree with others who said that regular expressions are not a good way to sanitize HTML.
> 3. Sanitisation is an extremely hard problem, which can only really be solved using a proper parser.
You *can* solve it in regex if you define the solution very, very strictly as we have. It's really a special case. There are a few regexes I use to accomplish this. See the actual code here:
http://refactormycode.com/codes/333-sanitize-html#refactor_11455
Comment there if you test the code and find it doesn't work. I think you'll be pleasantly surprised.
Actual tag balance has to be achieved in another, unrelated routine. Perfectly "safe" HTML can have unbalanced tags.
Jeff Atwood on June 27, 2008 02:16 PMAs a seasoned Perlmonger, I regularly deal with complicated regexes that do some very tricky stuff. Fortunately, Perl is excellent at providing you with nice syntax to make regexes both readable and scalable.
Here's how I would have implemented your example:
http://pastebin.com/f467492d4
Note that Perl make it extremely easy to build a regex from sections, defining each part separately, with full commenting. Much of the body of the regex can be easily factored out into arrays, which are considerably easier to modify!
Perl also provides a natural syntax for including comments within your regexes. Both valuable techniques for building large, but usable regexes.
Admittedly, your example is probably a bit too simplistic for the slightly verbose treatment I've given it. But imagine a more complicated regex..
The way I see it is, if you think you need something like RegexBuddy, you probably need to refactor your regex into easily-understandable (and easily-testable) component parts instead. I can see how it might be useful if you're trying to reverse-engineer someone's badly-written opaque regex, or if you're trying to match a very complicated pattern. But in general I would say if you need it, you're doing it wrong.
(What were you thinking? talking about regexes and taking a poke at Perl in the same sentence? you really brought it on yourself! :))
Dan on June 27, 2008 02:25 PMHATE regular expressions.
HATE HATE HATE.
It drove me nuts when I ran across them and couldn't figure them out, so I learned how to use them very well for about a year. I wrote some moderately complex ones, some simple, and then I just stopped using them.
My problem wasn't so much not being able to understand "what" they did, but whether it was correct or not.
It is very easy to write a regex that looks like it should work but misses on a few things.
Just go to regexlib.com and search for currency, you'll find 30+ distinct different ways to parse or format US currency.
How easily can you tell the difference between these two?
^\d*\.\d{2}$
^\d+(?:\.\d{0,2})?$
What about these two?
^\$( )*\d*(.\d{1,2})?$
([^,0-9]\D*)([0-9]*|\d*\,\d*)$
Or God forbid these two?
^\$?\-?([1-9]{1}[0-9]{0,2}(\,\d{3})*(\.\d{0,2})?|[1-9]{1}\d{0,}(\.\d{0,2})?|0(\.\d{0,2})?|(\.\d{1,2}))$|^\-?\$?([1-9]{1}\d{0,2}(\,\d{3})*(\.\d{0,2})?|[1-9]{1}\d{0,}(\.\d{0,2})?|0(\.\d{0,2})?|(\.\d{1,2}))$|^\(\$?([1-9]{1}\d{0,2}(\,\d{3})*(\.\d{0,2})?|[1-9]{1}\d{0,}(\.\d{0,2})?|0(\.\d{0,2})?|(\.\d{1,2}))\)$
^\$([0]|([1-9]\d{1,2})|([1-9]\d{0,1},\d{3,3})|([1-9]\d{2,2},\d{3,3})|([1-9],\d{3,3},\d{3,3}))([.]\d{1,2})?$|^\(\$([0]|([1-9]\d{1,2})|([1-9]\d{0,1},\d{3,3})|([1-9]\d{2,2},\d{3,3})|([1-9],\d{3,3},\d{3,3}))([.]\d{1,2})?\)$|^(\$)?(-)?([0]|([1-9]\d{0,6}))([.]\d{1,2})?$
I'd much rather bank on writing a "ParseCurrency" function to parse or format the data using standard string manipulation.
That's way easier to look at in 3 months or 3 years.
There is nothing that can be done with a regex that can't be done with a function call. The function call may be 10 more lines than a single regex, but will always be 100 times easier to read and debug.
I feel that it goes along with Code Complete's Self Documenting Code idea. If your code or regex can't be understood without several lines of comments or a separate tool to parse it then there must be a better way.
Another good regex tool is Expresso:
http://www.ultrapico.com/Expresso.htm
It has really made some tricky regex easy to understand.
Michael Silver on June 27, 2008 02:40 PMIntelligently adding whitespace helps, because before we read something we subconsciously observe the shape of its layout. This gives us an important clue to the underlying data hierarchy; it provides a means of navigating the text.
Without whitespace, we have to read the text in its entirety before seeing the forest for the trees.
As an aside, that's also why USING ALL CAPITAL LETTERS makes things more difficult to read -- it removes the shape of words, so we don't get those free visual hints.
Vance Vagell on June 27, 2008 02:44 PM@Sean
What's your language of choose? Whitespace? (http://en.wikipedia.org/wiki/Whitespace_(programming_language) ?)
Dave on June 27, 2008 02:45 PMI especially agree with your recommendation to break thing up. It's very tempting to try to solve the entire solution in one go, but even it it works you end up with a lot of logic encoded in a string. That's just a maintenance nightmare waiting to happen.
You didn't mention tools like Regulator or ReguLazy. Any thoughts on those?
Jon Galloway on June 27, 2008 02:58 PMJust thought I'd mention that if you want to get really good, get a copy of Mastering Regular Expressions by Jeffrey Friedl. Everything mentioned in Mike's blog posts above is covered pretty exhaustively in the first 3 chapters, and chapters 4-6 will take you well beyond that, into understanding the underlying regex engines, and working with them to optimise your regex - important if the regex is going to be used over and over again, as would be the case in the above example. Then there are chapters on 4 implementations (perl, java, .NET and pcre as used in PHP).
A relevant example of efficiency optimisation - if the regex engine is aware that all tags start with '<' then it will not even bother to start trying to match except where there is a '<' character. In many cases this optimisation means the regex is never applied, for the cost of a quick indexof() call.
To make it easy for the regex to spot that all matches start with the same character, take the first character out of the alternatives bit. This would give
var whitelist =
@"< (?# opening angle bracket - here so that regex engine can spot it)
( (?# start alternative)
br\s?/? | (?# allow space at end)
/?p |
/?b |
/?strong |
/?i |
/?em |
/?s |
/?strike |
/?blockquote |
/?sub |
/?super |
/?h(1|2|3) | (?# h1,h2,h3)
/?pre |
hr\s?/? | (?# allow space at end)
/?code |
/?ul |
/?ol |
/?li |
/a |
a[^>]+ | (?# allow attribs)
img[^>]+/?>(?# allow attribs)
)
> (?# closing angle bracket)
";
(Hope the formatting survives ...)
You could go a little further and factor out the '/?' that starts most of the lines. It will be repeatedly tested in the current format, and factoring it out would mean it was only tested once, though you will lose a little readability by doing that. A little benchmarking with two alternatives would let you know how much difference that change would make ...
mish on June 27, 2008 03:16 PMGreat post.
fxp on June 27, 2008 03:27 PMAnother free regex tool: http://www.gskinner.com/RegExr/
It also has an offline version.
I know Perl is the traditional soft target when it comes to observations about the folly of overusing regular expressions - and based on past atrocities this reputation may have been deserved a few years ago.
But these days well written Perl (no kids, that's not an oxymoron) tends not to rely too heavily on them. I just grabbed some of my code at random. I seem to average about 0 to 5 regular expressions per 1,000 lines of code - although of course it depends what I'm doing.
And Perl's regular expressions (which are actually not regular expressions in the formal sense - they're more general than that) are now pretty highly evolved; features like named captures, expanded syntax (which as a previous commenter notes allows patterns to be laid out quite readably) and support for matching recursive syntaxes make them safely and expressively useful - at least in the hands of someone capable of restraint :)
You know Jeff, sometime when we're on the same continent I'd like to sit down and show you 'modern' Perl (again, not an oxymoron). Based on your general approach to problem solving and apparent philosophy of coding I think you might actually like it...
Anyway, enough with the sales pitch. Keep up the good work.
Andy Armstrong on June 27, 2008 03:57 PMOh for goodness sake stop slagging off perl. Perl is like English, a bit hard to learn, but very expressive and very very useful. Also the CPAN module <a href="http://search.cpan.org/~dland/Regexp-Assemble/">Regexp::Assemble</a> is insanely useful, and <b>fast</b>.
kd on June 27, 2008 04:01 PM@Rob Assembly is easy, I could do that at 15, but I still have trouble understanding many regular expressions.
Even though I don't fully understand them, they are very cool for stuff like this:
http://wincue.cvs.sourceforge.net/wincue/wincue/src/filename_formats.txt?revision=1.2&view=markup
The file is used for guessing album, artist, track number and track titles from file names. The older version was a hand-written parser which a friend of mine reimplemented with regular expressions, making it *much* easier to maintain and customize.
I don't think regular expressions are necessary, unless you've got a nightmare of a parsing task ahead of you. It's just one more syntax to learn, and I sure as hell don't need that. I'd rather hand-write it. Sure it's a little more code, but more is less.
Josh Stodola on June 27, 2008 04:09 PMI would recommend (if you have .NET2) to get FREE tool, it also generates a dll with the regex once u developed it.
http://tools.osherove.com/CoolTools/Regulator/tabid/185/Default.aspx
Hey, Regex Buddy is built with Delphi!
Nick Hodges on June 27, 2008 04:32 PMI couldn't agree with you more. When I frist met regex I thought either I was too stupid to understand it or the guy that wrote it was a genius.
Once I found the right tool and toyed with it a little, I realized what a powerful weapon it can be.
The tool I use is pretty simple and offers no major light effects, but it's usable inside eclipse, so for this convenience, that's what I chose. http://regex-util.sourceforge.net/update/
Raphael on June 27, 2008 04:44 PMWell, I think Jeff is spot on with this regular expression business.
I ran into the same problem a while back, and did the exact same thing, using the same tool and all.
Good to know I'm doing SOMETHING good.
André Medeiros on June 27, 2008 04:49 PMI always wished my college had a course in regexes. I've used them a few times, but it's always been such a pain. I think I just need to make a project that really emphasizes them, so that they get ingrained into me.
Asmor on June 27, 2008 04:50 PM>> If you drench your plate in hot sauce, you're going to be very, very sorry later.
I beg to differ. I love hot sauce. I put it on almost everything, in the amounts that would kill normal people or at least cause a major permanent injury. I eat raw habaneros, too.
Whoa There on June 27, 2008 04:51 PMAlthough I absolutely cherish regular expressions (Viva la PCRE!) as one of the most lethal tools in my batman belt of programming tricks (I'm the regular expression go-to guy in my office), I do completely agree that it is extraordinarily easy to overuse them.
@Jeff: I think you may have done your less regex savvy readers a slightly better service by noting their alternatives when you mention not regexing themselves to death. I think a good follow-up post would be to point those folks in the direction of their languages' built in string manipulation functionality. While regex can, in some scenarios save you hours of pointless string twiddling, I think it's important to note that, with great power comes great responsibility. For simple- and even sometimes medium-level tasks, smart use of string manipulation will scream past regex performance-wise. Otherwise, great stuff, as usual!
Chris on June 27, 2008 05:19 PMThanks Jeff.. As a novice PHP coder, I've found myself in need of, and intimidated by, regexes time and time again. After reading this blog, I think I now have the courage to wade in at full steam and make use of this useful and misunderstood tool..
CroW on June 27, 2008 05:29 PMI agree they are quite handy. I would have wrote a state machine driven parser for html sanitizing though. Good topic though.
jminadeo on June 27, 2008 06:20 PMThere's also a multi-lingual regex builder here: http://regex.larsolavtorvik.com/
Joseph LeBlanc on June 27, 2008 06:41 PMThis is probably a very limited portion of your actual validation methods, but I hope you're also planning on killing javascript and the likes.
Actually, going back to your previous posts about the horrors of BBCode or whatever that was, there are probably two good reasons for BBCode as opposed to HTML/other standard:
- the brackets don't require a shift modifier on the standard keyboard layout. Don't think it makes a difference? Hey, you're a programmer. [It's a heck of a lot easier to use a bracket in the middle of typing than a less-than/greater-than thing.]
- it's also a whole lot easier to take something which may or may not be safe and transform it into something you know is than it is to try and clean the original so that it's safe. [lit: transformer vs. converter]
and so, -it's more accessible to the average user, and -it's more reasonable for the average developer.
Ryan H. on June 27, 2008 06:43 PMtaste batter? ;)
sbohr on June 27, 2008 06:48 PMwow this is sad
maybe in a decade theyll be saying regex vs bnf is like goto vs functions
bn fan on June 27, 2008 06:54 PM"not that regular expressions are evil, per se, but that overuse of regular expressions is evil"
Is it odd that I've always interpreted the expression this way?
The reason "now you've got two problems" comes up so often is because it so easily comes to mind and forces you to consider if regular expressions are really appropriate for the task at hand.
Why use regular expressions to extract an extension or file name from a path, when System.IO.Path does the same thing in a more readable manner?
Actually, you might as well replace "regular expression" with XML, or databases, or any number of other solutions people generally rush toward without thinking.
Why sanitize the HTML? I just convert all the left angle brackets into their HTML entities to 'reveal' what the naughty person was trying to do.
Jon on June 27, 2008 07:08 PMGreat post - lots of value both here and from pointers to other links. Thank you!
Patrick on June 27, 2008 07:32 PM@Randy Magruder
> No trial version. No purchase. Period.
While there may not be a trial version per se, there is a three month unconditional money back guarantee (http://www.regexbuddy.com/guarantee.html). So you can in effect try it for three months.
I've been using RegexBuddy since version 1.0. It's worth every penny.
Mark on June 27, 2008 07:46 PM"Why sanitize the HTML? I just convert all the left angle brackets into their HTML entities to 'reveal' what the naughty person was trying to do."
Er, because sometimes you want to allow some HTML? You might, even, be anticipating it? Like from a richtext editor?
Trevor on June 27, 2008 08:08 PM@Jeff Atwood:
I think you posted this rant before.
Trevor on June 27, 2008 08:12 PMRegexBuddy 1.21 Demo Download: http://www.brothersoft.com/regexbuddy-29621.html
Moritz on June 27, 2008 08:13 PMHow is this different from your writings on XML?
How many posts can you stretch out "X is good for some things, just don't use it for too many things." tip?
Not that it isn't a good tip.
Calvin Spealman on June 27, 2008 09:14 PMCalvin,
The overall tip is good as you said, but I don't think Jeff is overusing the theme. He's talking about good stuff. There are some things that are great tools for the right purpose. There are also some things that are terrible for everything. There is no panacea.
C# is great for lots of stuff, but not if I need to write something to do the same things I can do with say 15 lines of vbscript. Its a pain in the butt to have to "figure out" how to get something done in C# when a few line script can do the trick.
vbscript is great for lots of stuff, except when I really need something more versatile for so broad in scope.
Regex is a great pattern matching tool. That's what it is for, pattern matching. Use it for pattern matching and there is absolutely no better tool in my opinion, and in the opinion of many apparently.
What supports regex? Most unix text tools, like sed, awk, perl, and literally MANY MANY others.
And in Windows C# supports regex, vbscript doesn't and that sucks. It should. I don't know about Visual Basic, I am still new to the windows development world. I probably won't bother learning Visual Basic, but who knows.
Anyway, the point is that Jeff is pointing out some great tools. I think he's a little understated at times, such as now. Regex is perfect for ANY pattern matching need. Regex is just about the only utility available for really complex pattern matching. If anything, regex is better suited for complex pattern matching because it CAN do it where nothing else can without writing your own pattern matching routines. I mean really...
Scot McPherson on June 27, 2008 09:44 PMfunny that you bring up regular expressions today because i just saw an insane one that nearly made me fall out of my chair. i'm in c# most of the time, and i don't think i'm alone in saying that c# developers don't throw around regexes too often. i was messing around with a javascript calendar picker and found this gem (and yes, it was all on one line):
System.Text.RegularExpressions.Regex DateRegEx = new System.Text.RegularExpressions.Regex(@"^((0?[13578]|10|12)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[01]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1}))|(0?[2469]|11)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[0]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1})))$");
my basic reaction was to close that file and never look at it again. sure i could have deciphered it, added nice comments, etc. but i have other bugs to fix and the allocated hours for this project are dwindling...
cowgod on June 27, 2008 09:49 PMThe author of RegExBuddy also writes a great book about regular expressions available from LuLu publishing. Good price, nice product, great read, indispensible reference.
http://www.lulu.com/content/229786
If you want to use Regular expressions or do already, you really need this book.
Xepol on June 27, 2008 10:21 PMHi comment smart enough!
The apps isn't free, does their any app open source or free for Regexp
the question for readers also!
Xepol, Jan is also working on a Regex book with another very talented regex pro, Steven Levithan.
http://www.regex-guru.info/2008/05/writing-offline/
I bet it's gonna be REALLY good. Consider this preordered.
Jeff Atwood on June 27, 2008 11:02 PMI once heard a good advice that seems to work for me and my code:
Try to use a lot of vertical space and
very little horizontal space!
This applies to regular expressions as well. Readable code is all about whitespace, comments and proper naming.
Florian Potschka on June 28, 2008 12:52 AM+1 for Expresso (http://www.ultrapico.com/Expresso.htm)
A more simple one is Regulazy (http://tools.osherove.com/CoolTools/Regulazy/tabid/182/Default.aspx)
Florian Potschka on June 28, 2008 12:54 AMWhy are you writing your own html sanitizer? It has already been written enough times. Are you also writing your own webserver and C library? And why are you using regular expressions to do it? Do you _want_ your service to be vulnerable to html/js injection?
James on June 28, 2008 01:15 AMFor some reason I believe there must be a sound correlation between liking regular expressions and disliking XML. I suspect people either do both or neither :)
Mikhail Edoshin on June 28, 2008 02:42 AMDean said "I strongly disagree with Sean. White space and comments are huge aids in reading code."
I strongly disagree with that. Ar least half of it.
Comments are evil. A necessary evil, sometimes, but nonetheless they are evil. We should aim for 'self-documenting' code. 99% of the times when the code is not self-documenting, it's because the developer didn't do as good a job as (s)he should have (maybe because they were not given the opportunity, but we're not debating causes).
That being said, I'm not an expert at regexes and *that* is why the comments in Jeff's original post would help me understand his regex. But that's because of *my* shortcoming. If we take that approach, then we should have comment on each line of code explaining what it does, just in case someone that doesn't know the programming language we picked happens to read the source... impractical.
F.O.R.
PS: Am i the only one that discovered RefactorMyCode thanks to this post ?
@N
"But that's because of *my* shortcoming."
Disagree. What if you're not planning to write a regexp, but you want to use an existing one, where there's something complex with in it. Knowing roughly what it does helps. This is yet another case of overuse = failure, non-use = failure.
Also, no, I knew not of RMC before this. Thanks, Jeff!
Tom on June 28, 2008 05:58 AM1. If you use it all the time, regex is great.
2. If you use it once in awhile, avoid it. Seems like I have to re-learn it each time it hit a case where I need it. Even with the tools.
3. If other people who don't use regex all the time will be supporting the code, don't use it.
But this is a chicken/egg story. If you use it lots, you know it. If you don't... avoid.
Regex always seems like going back to assembler. That's why you have to all those utilities. But gee... we are already working in a compiled environment. Why go back for regex with unobvious shift-numeric syntax (!@#$%^&*). I'd prefer to use something with regex's power, but with a more obvious syntax - a regex compiler - but with the "original" code part of the "real" code. It probably exists.
mihondo on June 28, 2008 07:29 AM
Remember if you seldomly use regex but have a case for one you can often find the expression on the internet. I couldn't be bothered to work out how to parse a date in exact format dd/mm/yy. Just looked it up on the internet pasted it in checked it works, great even stops you going past the max days of that month and past 12 months including the leap day.
pete on June 28, 2008 08:22 AMorange - it is all the time
Can somebody tell me(from personal experience) the scenarios
- where to use regex
- where to avoid
I only find it useful in validating email, searching & to strip out dangerous HTML from input.
Anand on June 28, 2008 08:28 AMCoding language does seem important to level of regex use. For some thoughts on using regexes in Perl vs Python, see
http://www.fluidinfo.com/terry/2007/06/13/resorting-to-regular-expressions/
Terry
While I don't use regex much, I do see its beauty for some problems - plus I've never really understood the quote about having two problems, but given context that its saying regex isn't a solution for everything.. obviously I agree.
Here's the thing though, I just don't see a html sanitizer as being a good example for regex.. sure regex in this scenario can bring results quickly.. and it does the base things perfectly.. but sanitizing html is more than matching patterns.. and while I'm sure you could build more regex to progressively pull everything apart and back together.. it kind of makes me wonder if you are then almost trying to parse with regex..
Naively speaking, because I've never actually written a complicated sanitizer.. I would say that a traditional programming approach - although much slower to see results at first.. would be more flexible.. and given the recursive patterns that exist.. once you start to hit a point.. you'll see results, and get past problems that regex would become more troublesom.. faster.
I'd be hugely suprized if there wasn't a .NET lib out there for doing this already.. if not, then I think a codeplex/sf project is called for, obviously from your posting on refactormycode - and on here.. theres a lot of hugely knowledgable people in regards to how sanitization should work.. and it would be really interesting to see a product from it.
I'd do it myself, but I know anything I put up would be ripped apart instantly - but hey, if it triggers people to do something in a (gah, give it here, this is how you do it!) kinda way.. then maybe I should :P
Stephen on June 28, 2008 09:39 AMmihondo: You said regex feels like assembly, and you want something higher level. Guess what the following does:
number :: '0'..'9'*
phoneNumber :: [ '(' number ')' ] number '-' number
This is a BNF (Backus-Naur Form) to match phone numbers. BNFs are a high level grammar designer. You can do just about anything a regex can do, though BNFs ted to be self-documenting. The typical use-case is for turning a program's source code into an abstract syntaxt tree, but it fits really well for simple stuff. The wikipedia page has a simple example for matching any US postal office. The nice thing about BNF is that it turns the text into a data structure. The _really_ nice thing about BNF is it is designed to deal with things like nested tags/parens, the weakest part of regexes.
Common extended-BNF parsers include Yacc/bison (for the unixes) and ANTLR (for Java). My personal favorite is PyParsing, as it has some tasty syntatic sugar.
Kyle on June 28, 2008 09:40 AM"I can also show you something written in the very same medium that is so beautiful it will make your eyes water"
Ok, I'm calling you out on the Klingon. Let's see that beautiful eye-watering Klingon.
Rick! on June 28, 2008 12:01 PMOrange
I've yet to get a handle on Regex but I do appreciate expressions that others have published and just work for me. Really, really appreciate it. Checking email addresses, post (zip) codes, phone numbers, etc. All this validation of text allows my code to remain concise and also allows me to get on with my job. The language independence is a big bonus in this respect.
I enjoy writing the unit tests against them to make sure all is good and this allows me to know exactly what is and is not covered in each expression.
Having said that, readability is atrocious even with whitespace and this adds further importance to the unit testing as this now doubles up as documentation.
Joe on June 28, 2008 12:52 PMThe abuse of regexes as parsers isn't unknown to me. Actually I've created a function that parses a ?:-like language:
public static string ParseTemplateString(string str, Func<string, object> getVars)
{
// Regex
System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex(
string.Format(@"
(?<mod>\?!?)? # Match the type of the expression
(?<v1>\$[A-Za-z_0-9]+) # Match the variable or the complex condition
(?(mod)
(
{0} # Match first opeing delimiter
(?<inner>
(?>
{0} (?<LEVEL>) # On opening delimiter push level
|
{1} (?<-LEVEL>) # On closing delimiter pop level
|
(?! {0} | {1} ) . # Match any char unless the opening
)+ # or closing delimiters are in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
)
{1} # Match last closing delimiter
){{1,2}} # Match one or two subexpressions
|
:(?<v2>\$[A-Za-z_0-9]+) # Match the simple condition
)?
", "\\{", "\\}"),
System.Text.RegularExpressions.RegexOptions.Compiled
| System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
// $Var, $Var:$Condition, ?(!)$Condition{...}{...}
System.Text.RegularExpressions.MatchCollection mc = rx.Matches(str);
foreach (System.Text.RegularExpressions.Match m in mc)
{
if (m.Groups["mod"].Length > 0)
{
bool cond = Convert.ToBoolean(getVars(m.Groups["v1"].Value.Substring(1)));
if (m.Groups["mod"].Value == "?!")
cond = !cond;
if (!cond)
{
if (m.Groups["inner"].Captures.Count == 2)
str = str.Replace(m.Value, ParseTemplateString(
m.Groups["inner"].Captures[1].Value, getVars));
else
str = str.Replace(m.Value, "");
}
else
str = str.Replace(m.Value, ParseTemplateString(
m.Groups["inner"].Captures[0].Value, getVars));
}
else if (m.Groups["v2"].Length > 0)
{
bool cond = Convert.ToBoolean(getVars(m.Groups["v2"].Value.Substring(1)));
if (!cond)
str = str.Replace(m.Value, "");
else
{
object val = getVars(m.Groups["v1"].Value.Substring(1));
str = str.Replace(m.Value, val.ToString());
}
}
else
{
str = str.Replace(m.Value, getVars(m.Groups["v1"].Value.Substring(1)).ToString());
}
}
return str;
}
Hey Jeff, here's a regular expression you might enjoy:
s/who I admire/whom I admire/
:)
Mark on June 28, 2008 03:50 PM"Regular expressions rock.They should absolutely be a key part of every modern coder's toolkit."
I always find myself disagreeing with you whenever you say something should apply to every programmer, regardless of their area of expertise. As just one of many examples, what if the coder is in the video game industry? Sometimes it seems like people forget that there's more to programming than processing text files and validating inputs.
Mike on June 28, 2008 08:41 PM@Josh Stodola: One of the nice things is that the underlying engines will also have been optimized for speed, so you don't have to. Sure you can hand-write a simple parser, but can you hand-write it to make it fast?
Matijs van Zuijlen on June 28, 2008 08:47 PMThe best regex advice I've heard is to not try to write your own parser...FWIW
David Smith on June 29, 2008 06:35 AM"Should you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl"
Shame on you for perpetuating this tired old piece of nonsense.
Earle Martin on June 29, 2008 07:48 AMquote:
>>Regarding Regex Buddy.
>>
>>No trial version. No purchase. Period.
Agreed 100%. If you can't be bothered to make a version I can try-before-I-buy, I don't bother. My boss signs off only so many purchases, and I end up paying the rest - I don't use these credits on random software.
One of the reasons Delphi stopped being my language of choice when the Personal version was removed (and I still don't want it anywhere near my desk, even though Turbo has been added - fingers still burning from the first snub).
> Are you also writing your own webserver and C library?
I couldn't find any good c# HTML sanitizing code that wasn't a huge, dumb dependency. Now I can, because I wrote it!
> Try to use a lot of vertical space and very little horizontal space!
Agree, see "flattening arrow code"
http://www.codinghorror.com/blog/archives/000486.html
> I'd prefer to use something with regex's power, but with a more obvious syntax
Maybe fluent interface? But I disagree.
http://www.codinghorror.com/blog/archives/000989.html
Saw some replies asking about open source regex editor:
KDE regular expression editor manual:
<a href="http://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.html">http://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.html</a>
Redet:
<a href="http://billposer.org/Software/redet.html">http://billposer.org/Software/redet.html</a>
Simple version:
<a href="http://www.arachnoid.com/regex_lab/">http://www.arachnoid.com/regex_lab/</a>
@Anand: where to avoid regex?
- validating email
- strip out dangerous HTML from input
Why?
- validating email
Almost all regular expressions will be invalid for a subset of cases.
Validation via regular expressions is approaching the problem from the wrong angle. The format of the address isn't important. You need to know whether the address will accept mail and this problem can be solved in better ways than checking if the email address pattern is valid.
And always give the user a "No, my email address is valid" checkbox when reporting an "invalid" email address error. This always leaves the user in control.
- stripping out dangerous HTML from input
It's a solved problem. There is a library in your chosen language that already does this. This library will be faster and less buggy.
Where to use regex?
When it is not a solved problem and then only for simple patterns. The most common complex problems are already solved.
Hi Jeff... I have similar feelings about the overuse of ajax as you do about the overuse of regular expressions
http://blog.pnbconsulting.com.au/?p=134
lomaxx on June 29, 2008 04:52 PMOne of the best books to learn how to use regex : http://oreilly.com/catalog/9780596528126/
Before reading it, I thought I knew regular expression. It made me change my mind.
Regular Expressions are a very powerful tool that all developers should know, but sometimes you can fall into deep subtle pits of despair if you don't know PERFECTLY what you are doing.
The most important things I discovered one month ago are:
[1] NOT ALL REGEXPR ENGINES USE THE SAME SYNTAX AND/OR MATCHING ALGORYTHM
[2] SOMETIMES, REGEXPR ENGINES CHEAT!
For [1], just check the RegExp section in Xml Schema Specification at W3C (http://www.w3.org/TR/xmlschema11-2/#regexs). They decided that, since most people would want a full match on a RegExp, their parser would automatically anchor it (WORST. IDEA. EVER).
So, if you decided (like I did, fool me, fool me) to define a RegExp in a Schema for Validation, and then use it also in another part of my application, you will have lots of trouble.
Basically, in XSD you get the full Perl RegExp syntax, without ^ $ (which will be treated as NORMAL CHARACRERS) and /A /Z (which will BREAK your RegExp), and you will get an automatic anchor instead...
For [2], some engines (ie: .NET Regex engine) cheat on some expressions, to make things work "almost any time". Basically, I had 2 expressions that should have returned different matches (by Perl Syntax), but they returned the same matches (in .NET Match). I'm sorry I can't remember the exact expressions right now, but I remember shouting the loudest WTF ever, when I checked this... and I will not tell you about the differences between .NET Parser and the various Java Parsers :-)
So, I would add this advice to the list of this post:
- Always check (double-triple-check) your Expressions IN THE ENVIRONMENT they will be executed (or with the right options in your tool of choice).
Filini on June 30, 2008 02:33 AMI hate regexes. Don't get me wrong on this, I have used them very often, I know how to write very, very advanced regexes. I love the idea behind them... but I hate their syntax. The syntax simply sucks. Also there are about 20 different flavors and each time I use them in a different language or with a different tool, there are different pitfalls I have to fix.
I would like them a lot more if they had a better, cleaner syntax and if there was one, and exactly one standard that defines them and all tools and languages either stick to this standard or should not offer them in the first place.
Also note that regexes are nothing more than "shortcuts". Whatever you do in a regex could be done with /normal/ code as well, you would just need a lot more code to do it. E.g. ever regex can be written as a simple state machine parser. Sometimes such a parser can be much more powerful, easier to extend and ... this is an important thing to consider ... it can also be much faster. Depending on how good the regex compiler is, it might create better or much worse code.
E.g. your sanitize regex is nothing more than (pseudo code):
for (i = 0; i < lengthOfString(string); i++) {
if (charAt(string, i) == '<') {
// Handle HTML tag
}
}
Inside the if, you skip the first character if it is /, then you can if-elsif-else about all known tags, check if they allow whitespace or other characters up to the > character and then loop till you see the > character. It might be harder to read at first, but that way you can see how the string is really processed, what is going on where, you can optimize the process (e.g. instead of if's you can switch, place the tags in a sorted table, look it up with binary search, have a number assigned to every tag, jump to the right code with switch-case).
Using regexes is like SQL. You only say what you want the computer to do and the computer magically presents the result. You have zero influence on how it gets there, how fast it gets there, etc.
Mecki on June 30, 2008 03:11 AMMecki, you have no idea how many times I've refactored *out* for-loops with long, tortuous string processing inside them in favour of a simple regular expression.
You are right though - sometimes the compilation of the regex can make it slower than just coding - but in my experience, that's a pretty rare state, and normally it's 'cos I've crafted a very slow regex.
I would strongly recommend Fiedl's book 'Mastering Regular Expressions'.
I do agree with the other posters who mention using already existing solutions where possible, for things like HTML validation.
Andy Burns on June 30, 2008 03:31 AMRegExs Like XML, good when uses appropriately, bad when used inappropriately ....
Don't use in place of a parser .. it is not a full parser and should not be used as such
Don't use to do simple string manipulation, your language should have better/faster tools to do this
Do use to do simple pattern matching, it's what it was designed for
XML great for storing/sending structured data between programs, terrible to read, terrible to write, use an interface!
"They should absolutely be a key part of every modern coder's toolkit."
If you avoid web, and data chores then find and replace is just as good. I've never really found much use for regexes when coding my win32 c++ apps... every now and again I will want to change something a bit complicated... but its pretty rare that find/replace won't cut it. The last example I can think of was changing a function pointer type... i had to regex the various static function definitions... but only because I had used inconsistent variable naming. :)
Regexes are great for manipulating data rather than code imo, and HTML is more typical of data than code...
Allowing parts of HTML that are safe and blocking unsafe parts is a classic problem. Best to block it all if at all possible, by just escaping every special char. Its a lot easier, and is extremely secure too. :)
Jheriko on June 30, 2008 05:10 AMActually RegExps come from pure computer science.
Every "Deterministic Finite State Machine" is equal
to an RegExp. (I know them under the acronym DFA which
stands for "Deterministic Finite Automaton").
Actually they are even equal to the more powerful "Nondeterministic
Finite State Machines".
( But then DFA's and NDFA's are actually the same thing -
don't bother, that is real computer science ...
"nondeterministic" means that there are "nasty" operators
that are allowed not to "eat" any input - RegExps has
these "nasty" operations in form of the asterisk, the question
mark, ...)
So while using RegExps for String Matching Operations is a valid
use case there is much more to RegExps then this.
You can build any DFA/NDFA using a RegExp. Any !
Regexes are fine in code and all that, but for the past few years, I've been using a regex tool I built to make writing the code *itself* easier. ( http://www.hova.org/regexhelper )
One of the key features (probably the only one) is the idea of a "match" mode. Given a large input, most of which is garbage, the regular expression match/replace is run on only the matches. The rest is discarded. What does this allow you to do? Well, it lets you perform manipulation of things you're interested in, while ignoring the rest.
A good example is using this new method to strip all the element ID's from a large HTML document so that you can then convert it to server-side code.
hova on June 30, 2008 06:46 AM"To me it's perfectly readable."
Ha ha ha ha ha
Joker on June 30, 2008 08:41 AMI once had to write a WML sanitizer- It essentially took user-created HTML (the whole thing was to create a mobile view of pages entered in my employer's proprietary CMS) and checked the "entries" for what would amount to invalid WML, and fixed it. I can honestly tell you that at least as far as *ml sanitizing, regexes aren't good enough by themselves, but used in conjunction with parsing, they're much easier than parsing alone.
Alex on June 30, 2008 09:39 AMAll this complaining about Orange... you ought to put the link to that post next to the Orange "captcha" so people know and will quit littering the comments about the "broken" captha.
HB on June 30, 2008 09:48 AM
"You may find it a little odd that a hack who grew up using a language with the ain't keyword would fall so head over heels in love with something as obtuse and arcane as regular expressions."
Ummm, well, no. That seems totally consistent :-)
"They should absolutely be a key part of every modern coder's toolkit."
Well, ish. You should know what they are, and when to use them. Whether you ever need to is a completely different thing. It's so long since I used one I can't remember how long it's been. So calling them a "key part" of my toolkit, well, I dunno.
Jim Cooper on June 30, 2008 09:50 AM@Alex
"regexes aren't good enough by themselves, but used in conjunction with parsing, they're much easier than parsing alone."
Lots of (most?) parsers use regexes somewhere, often for extracting tokens (eg numbers, identifiers, etc)
Jim Cooper on June 30, 2008 09:53 AMregextester.com is the only RegEx tool I've ever bothered to use.
It's free, it's online, it's good.
Dave on June 30, 2008 10:29 AM@kenneth, @GoA:
Do you guys mind revealing this connection you see between XML and Regular Expressions?
XML, a data description format and Regular Expressions, a pattern recognition engine .... I fail to see the connection that you too see.
Does anyone else see why it is odd that one should like both or none at all?
Anvar on June 30, 2008 11:55 AMOn the subject of good regular expression tools, I would like to recommend a free online one that is designed for .NET programmers:
http://www.lastdomainnameonearth.com.
I often seen problem arise because people do not understand the theory behind regular expressions. So, they try it to do things that you really need a context-free grammar or better to do. Perhaps it is easier to let the regex just handle the tokenization and let the syntax and semantics be defined in a language more suitable to it.
Carleton on June 30, 2008 03:28 PMIf you are a python developer two regex tools that are quite useful:
http://www.pythonregex.com (online)
http://kodos.sourceforge.net/ (offline)
Disclaimer, I wrote the pythonregex website while playing around with the Google App Engine.
Dave on June 30, 2008 08:10 PMThere are so many people complaining about how RegEx is too complicated, etc. But the point of this was to keep them short, to a focused point.
Yeah, you can write functions to validate or clean your input, but really, is writing an expression like [^a-z0-9\s] *really* that hard to use to clean up some input? (yes, there is a shorter version than that, but just as an example)
The actual function could effecively do the same thing, but even the shortest and simplest Regular Expressions can save a lot of development time.
HB on June 30, 2008 08:29 PM> Before you complain how hard regular expressions are to deal with, get the right tools!
Like most such programming needs, emacs ships with a built-in mode for this. :-)
See http://www.emacswiki.org/cgi-bin/wiki/ReBuilder for more info
T.E.D. on July 1, 2008 06:37 AMI posted this on refactormycode but figure'd I'd post it here too.
"<SCRIPT SRC=http://myscript/xss.js?<B>" is vulnerable to XSS.
As I posted on the refactor site (under a different name, oops):
HTMLEncode your string and then replace the entities. Doing it the way you are trying to do it is is REALLY, REALLY hard. And as others have said, you have to know about every browser quirk. I can't emphasize enough how dangerous this is.
Oh, and REs rule! :)
Tim C on July 1, 2008 12:23 PMJust a tip to the people who somehow think writing their own loops to manipulate a string is faster than a PROPER regex .. in 99% of the cases writing your own will not give any performance benefits and will probably be slower. Sloppy regular expressions can be slow, but if you make sure your regex is short and precise they are usually very fast.
In Perl, and probably in other languages, all static regular expressions in a program are compiled once, making them very fast, especially in cases where you need to use the same regex to match against multiple strings. The occurances where regular expressions can really cause a hit are when they are built dynamically, like as part of an eval statement, or contain a variable, which there is no way to compile ahead of time.
Arguing that you shouldn't use them because you don't know them is no argument at all. Everyone starts off not knowing any programming languages, yet they learn them because it is a good idea to know the most effective tools for the job.
mocker on July 1, 2008 02:08 PM@F.O.R.: Self-documenting code is only self-documenting to persons already versed in the language. And while that is going to be your typical audience, new team members might come from other backgrounds. Many people say Ruby is very readable, but I haven't spent the time to review its syntax, so it's about as readable as Perl, to me.
If there's a comment that says this regex/function/whatever parses currency (thanks @Dave for the great example), I can be reasonably sure that it at least *tries* to parse currency, according to some definition.
Any single line of code is likely to be readable without a comment. But reading larger blocks of code (even if it's just 20 LOC), can take some time to work through, especially for those unfamiliar with the language.
@Dave: One reason to love, or at least tolerate, regular expressions is that they are compiled (behind the scenes, often) as a state machine, and execute many times faster than "manual" string search/parse routines. At least if the text being searched is large, or the pattern complex. This can be very noticeable if the work is being done in a [insert browser here] browser.
As with some others, I heartily recommend Friedl's Mastering Regular Expressions.
Daniel 'Dang' Griffith on July 2, 2008 08:48 AMCount me in as another one who uses Expresso http://www.ultrapico.com/Expresso.htm.
Also, the site www.regular-expressions.info is very helpful as a tutorial. It's from the author of RegexBuddy.
PRMan on July 2, 2008 04:32 PMJeff, just to be clear, this stuff about the "ain't" keyword -- it is a joke, right?
Maybe I'm being naive, but I've managed to steer clear of any variant of VB for many years now, and I wouldn't put anything past the designers of that language.
oryx3 on July 3, 2008 09:47 AM@Sean, I bet your code is totally unreadable, and I bet it is just easier to re-write it (and fire your ass in the process).
Sean on July 3, 2008 02:58 PMlooks like the regex to turn url's into anchors is missing the ( in wikipedia links...
alex on July 7, 2008 11:30 PMFor the .Net folks out there, you can use:
http://www.nregex.com/nregex/default.aspx
It uses the actual .net regex engine, so you know it will work in your code just like it does in the tester.
cubanx on July 9, 2008 08:32 AM
'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.'
Great quote. It was for a very similar reason that I created http://txt2re.com/ .
MarkE on July 10, 2008 04:45 AManother one here who uses Expresso, best RegEx tool i've seen so far - and it's free to register.
FrankS on July 11, 2008 05:46 AMHi!
I have just signaled your fantastic post!
Very good observations!
| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |