I love regular expressions. No, I'm not sure you understand: I really love regular expressions.
You may find it a little odd that a hack who grew up using a language with the ain't keyword would fall so head over heels in love with something as obtuse and arcane as regular expressions. I'm not sure how that works. But it does. Regular expressions rock.They should absolutely be a key part of every modern coder's toolkit.
If you've ever talked about regular expressions with another programmer, you've invariably heard this 1997 chestnut:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
The quote is from Jamie Zawinski, a world class hacker who I admire greatly. If he's telling us not to use regular expressions, should we even bother? Maybe, if you live and die by soundbites. But there's a bit more to the story than that, as evidenced by Jeffrey Friedl's exhaustive research on the Zawinski quote. Zawinski himself commented on it. Analyzing the full text of Jamie's posts in the original 1997 thread, we find the following:
Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most "obvious" (at least, to people who don't know any better) way to get from point A to point B.
The first quote is too glib to be taken seriously. But this, I completely agree with. Here's the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil.
I couldn't agree more. Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint only when appropriate. Should you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl, and I'm not sure you need those kind of headaches. If you drench your plate in hot sauce, you're going to be very, very sorry later.
In the same way that I can't imagine food without a dash of hot sauce now and then, I can't imagine programming without an occasional regular expression. It'd be a bland, unsatisfying experience.
But wait! Let me guess! The last time you had to read a regular expression and figure it out, your head nearly exploded! Why, it wasn't even code; it was just a bunch of unintelligible Q*Bert line noise!
Calm down. Take a deep breath. Relax.
Let me be very clear on this point: If you read an incredibly complex, impossible to decipher regular expression in your codebase, they did it wrong. If you write regular expressions that are difficult to read into your codebase, you are doing it wrong.
Look. Writing so that people can understand you is hard. I don't care if it's code, English, regular expressions, or Klingon. Whatever it is, I can show you an example of someone who has written something that is pretty much indistinguishable from gibberish in it. I can also show you something written in the very same medium that is so beautiful it will make your eyes water. So the argument that regular expressions are somehow fundamentally impossible to write or read, to me, holds no water. Like everything else, it just takes a modicum of skill.
Buck up, soldier. Even Ruby code is hard to read until you learn the symbols and keywords that make up the language. If you can learn to read code in whatever your language of choice is, you can absolutely handle reading a few regular expressions. It's just not that difficult. I won't bore you with a complete explanation of the dozen or so basic elements of regular expressions; Mike already covered this ground better than I can:
I'd like to illustrate with an actual example, a regular expression I recently wrote to strip out dangerous HTML from input. This is extracted from the SanitizeHtml routine I posted on RefactorMyCode.
var whitelist = @"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>| </?s>|</?strike>|</?blockquote>|</?sub>|</?super>| </?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>| </?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";
What do you see here? The variable name whitelist is a strong hint. One thing I like about regular expressions is that they generally look like what they're matching. You see a list of HTML tags, right? Maybe with and without their closing tags?
Honestly, is this so hard to understand? To me it's perfectly readable. But we can do better. In most modern regex dialects, you can flip on a mode where whitespace is no longer significant. This frees you up to use whitespace and comments in your regular expression, like so.
var whitelist = @"</?p>| <br\s?/?>| (?# allow space at end) </?b>| </?strong>| </?i>| </?em>| </?s>| </?strike>| </?blockquote>| </?sub>| </?super>| </?h(1|2|3)>| (?# h1,h2,h3) </?pre>| <hr\s?/?>| (?# allow space at end) </?code>| </?ul>| </?ol>| </?li>| </a>| <a[^>]+>| (?# allow attribs) <img[^>]+/?> (?# allow attribs) ";
Do you understand it now? All I did was add a smattering of comments and a lot of whitespace. The same exact technique I would use on any code, really.
But how did I cook up this regular expression? How do I know it does what I think it does? How do I test it? Well, again, I do that the same way I do with all my other code: I use a tool. My tool of choice is RegexBuddy.
Now we get syntax highlighting and, more importantly, real time display of matches there at the bottom in our test data as we type. This is huge. If you're wondering why your IDE doesn't automatically do this for you with any regex strings it detects in your code, tell me about it. I've been wondering that very same thing for years.
RegexBuddy is far and away the best regex tool on the market in my estimation. Nothing else even comes close. But it does cost money. If you don't use software that costs money, there are plenty of alternatives out there. You wouldn't read or write code in notepad, right? Then why in the world would you attempt to read or write regular expressions that way? Before you complain how hard regular expressions are to deal with, get the right tools!
This trouble is worth it, because regular expressions are incredibly powerful and succinct. How powerful? I was able to write a no-nonsense, special purpose HTML sanitizer in about 25 lines of code and four regular expressions. Compare that with a general purpose HTML sanitizer which would take hundreds if not thousands of lines of procedural code to do the same thing.
I do have some tips for keeping your sanity while dealing with regular expressions, however:
IgnorePatternWhitespace option, then use that whitespace to make your regex easier for us human beings to parse and understand. Comment liberally.If you're afraid of regular expressions, don't be. Start small. Used responsibly and with the right tooling they are big, powerful – dare I say, spicy – wins. If you make regular expressions a part of your toolkit, you'll be able to write less code that does more. It'll just.. taste batter.
You might enjoy them so much, in fact, that you completely forget about that "second problem".
Rail against XML yet embrace Regex!?! Me no understand so good.
Of course, I will admit to using regular expressions very rarely, whereas I utilize XML rather extensively.
Kenneth on June 27, 2008 2:02 AMIn the same vein as RegexBuddy, IntelliJ has a fantastic regex plugin:
Ryan Breidenbach on June 27, 2008 2:10 AMOops, wrong link:
a href=http://plugins.intellij.net/plugin/?id=19http://plugins.intellij.net/plugin/?id=19/a">http://plugins.intellij.net/plugin/?id=19/a">http://plugins.intellij.net/plugin/?id=19http://plugins.intellij.net/plugin/?id=19/a
Ryan Breidenbach on June 27, 2008 2:10 AM@Randy - there's an old trial version (2.04) of RegexBuddy floating around various shareware sites. I couldn't recommend it more - it really is a fantastic tool. Even their website is valuable as a reference for writing regular expressions.
Richard Dingwall on June 27, 2008 2:10 AMA coworker and I cracked up laughing reading this post. We have someone on our team who overuses regular expressions and hot sauce.
Tim B on June 27, 2008 2:15 AMBrilliant, well said. I struggle with RE but I do value them. JWZ is imminently quotable but usually done so wildly out of context. Another of his great lines is Linux is only free if your time has no value, often used by MS zealots to bash Linux zealots.
Rev Matt on June 27, 2008 2:18 AMFirst, a great cheat sheet for writing these things...
http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/
I find them incredibly useful, mostly I use them for validating input... basic stuff like making sure the user entered a valid email address or phone number or date. Maybe I haven't used them enough, but i feel like bashing my head on the desk whenever I'm trying to write them.
I had a class in college where we were given regular expressions and we had to walk through them with specific terms too. That wasn't much better.
Kris on June 27, 2008 2:21 AMEclipse users can get QuickREx for free. It's fantastic.
http://www.bastian-bergerhoff.com/eclipse/features/web/QuickREx/toc.html
Bossy Joe on June 27, 2008 2:21 AMThere is one called Expresso that is pretty slick. It's free as far as I'm aware (just a nag screen)
HB on June 27, 2008 2:24 AMI agree with others who said that regular expressions are not a good way to sanitize HTML. Somewhat appropriately, IIRC, the JWZ quote was about someone using RegEx for parsing HTML. JWZ wasn't complaining in general about overuse of RegEx'es (though he has), but the context of that quote I believe was specifically for using them in HTML where they're not the right tool. He suggested more of a true parser for HTML. I'm sure javascript and XSS exploits have just made that statement even move true.
Grain of salt warning - my memory isn't 100% sure this was on HTML, though it was either HTML or SGML/XML. The original post was on usenet, and a google search only shows the quote, not the full context. If it was HTML, it's somewhat ironic that the example is HTML yet pulls in the JWZ quote recommending against it.
Rich on June 27, 2008 2:25 AMDON'T use regular expressions to parse markup (HTML/XML/whatever).
http://htmlparsing.icenine.ca/
http://wiki.hypexr.org/wikka.php?wakka=/RegexFAQ
Am I the only one still curious what the official verdict on that Mensa page is?
Jeff, are you going to post your thoughts on it?
PaoloB on June 27, 2008 2:37 AMAs with other languages you mention (e.g. Ruby), you have to use regular expressions often enough that you internalize some of the less, um, user-friendly aspects of the expressions. Classic cognitive problem in regular expressions: reserved characters that mean different things in different contexts. (Well, unless it's inside square brackets. Then it means ....) An interesting dilemma for a) regex novices working on b) straightforward problems is that it can be just as fast, by the time you take into account all the debugging time for your regular expression, to just write a parsing function in Your Language Of Choice. Not as elegant, of course, but for one-offs, it can be awful tempting to forego the headaches of line-noise syntax ...
mike on June 27, 2008 2:39 AMI abuse of hot sauce but I'm Mexican so it's OK for me, and please, please, no more Tabasco, that's not real salsa, it's not even hot, if you want something hot and delicious try salsa de chile habanero
http://www.salsasetc.com/graphics/H-175A%20large.jpg, it will make burn your tushi like never before.
Juan Zamudio
I never, ever want to hear from anyone, ever again, that programming in assembly language is hard or useless.
Rob on June 27, 2008 2:50 AMOf course then you always get the guy asking you for help making a Regular Expression to match strings of balanced parentheses.
Steve Steiner on June 27, 2008 2:52 AMThis is the reg expression to rule all reg expressions: %s\n !:P
Just kidding! But do take a look at this small post on how to read quoted strings with scanf: http://narg.eu/?p=6 - in one simple reg exp.
I ~love~ regular expressions. I do know that they're slow compared to strstr, strpos, or whatever your language's equivalents are, because they are yet a different coding language that gets PARSED and COMPILED. At their basic level, they represent a finite state machine (this is not so much true with modern regex, but the basic commands in like POSIX regex are).
Therefore - long regular expressions are going to be SLOWER than shorter ones. Personally... I'd have taken the article's long 'or' string and broke it up into a loop over a list of allowed elements (easier to add onto later too).
Something to note is that every regex engine is different - some optimize things differently (ie, if your parser is naive about building the FSM, the long or statement above will result in a huge structure, one for every OR), some have wholly different functionality (though, in general, 'keyword' characters are consistent), and some have different 'shortcuts', especially for character classes. The most widely used is probably PCRE (Perl Compatible Regular Expressions), which obviously works just like Perl, but it's a C library that is used in a number of different places, but its syntax is just a little different than say Java or Javascript's syntax, which is very different than BRE (basic regex), etc.
Its super powerful though, however you cut it. Its one of those tools that programmers need to know, because often times its the best tool for the job - especially in the text-based web world.
Justin on June 27, 2008 2:57 AMJeff, any thoughts on introducing BNF? It is the perfect compliment to fill in the gaps of regular expressions. Simplest possible way to get balanced matching and parsing. Arguably easier to use than regular expressions. It is supported in all major programming languages. Whenever turning text into a data structure, reach for a BNF first.
Besides, the complexity of regular expressions seem to grow at length^2. BNFs feel more like log(length).
http://en.wikipedia.org/wiki/Backus-Naur_form
Kyle on June 27, 2008 3:16 AM 1. DON'T use regular expressions to parse markup (HTML/XML/whatever).
2. I agree with others who said that regular expressions are not a good way to sanitize HTML.
3. Sanitisation is an extremely hard problem, which can only really be solved using a proper parser.
You *can* solve it in regex if you define the solution very, very strictly as we have. It's really a special case. There are a few regexes I use to accomplish this. See the actual code here:
http://refactormycode.com/codes/333-sanitize-html#refactor_11455
Comment there if you test the code and find it doesn't work. I think you'll be pleasantly surprised.
Actual tag balance has to be achieved in another, unrelated routine. Perfectly safe HTML can have unbalanced tags.
Jeff Atwood on June 27, 2008 3:16 AMAs a seasoned Perlmonger, I regularly deal with complicated regexes that do some very tricky stuff. Fortunately, Perl is excellent at providing you with nice syntax to make regexes both readable and scalable.
Here's how I would have implemented your example:
http://pastebin.com/f467492d4
Note that Perl make it extremely easy to build a regex from sections, defining each part separately, with full commenting. Much of the body of the regex can be easily factored out into arrays, which are considerably easier to modify!
Perl also provides a natural syntax for including comments within your regexes. Both valuable techniques for building large, but usable regexes.
Admittedly, your example is probably a bit too simplistic for the slightly verbose treatment I've given it. But imagine a more complicated regex..
The way I see it is, if you think you need something like RegexBuddy, you probably need to refactor your regex into easily-understandable (and easily-testable) component parts instead. I can see how it might be useful if you're trying to reverse-engineer someone's badly-written opaque regex, or if you're trying to match a very complicated pattern. But in general I would say if you need it, you're doing it wrong.
(What were you thinking? talking about regexes and taking a poke at Perl in the same sentence? you really brought it on yourself! :))
Dan on June 27, 2008 3:25 AMHATE regular expressions.
HATE HATE HATE.
It drove me nuts when I ran across them and couldn't figure them out, so I learned how to use them very well for about a year. I wrote some moderately complex ones, some simple, and then I just stopped using them.
My problem wasn't so much not being able to understand what they did, but whether it was correct or not.
It is very easy to write a regex that looks like it should work but misses on a few things.
Just go to regexlib.com and search for currency, you'll find 30+ distinct different ways to parse or format US currency.
How easily can you tell the difference between these two?
^\d*\.\d{2}$
^\d+(?:\.\d{0,2})?$
What about these two?
^\$( )*\d*(.\d{1,2})?$
([^,0-9]\D*)([0-9]*|\d*\,\d*)$
Or God forbid these two?
^\$?\-?([1-9]{1}[0-9]{0,2}(\,\d{3})*(\.\d{0,2})?|[1-9]{1}\d{0,}(\.\d{0,2})?|0(\.\d{0,2})?|(\.\d{1,2}))$|^\-?\$?([1-9]{1}\d{0,2}(\,\d{3})*(\.\d{0,2})?|[1-9]{1}\d{0,}(\.\d{0,2})?|0(\.\d{0,2})?|(\.\d{1,2}))$|^\(\$?([1-9]{1}\d{0,2}(\,\d{3})*(\.\d{0,2})?|[1-9]{1}\d{0,}(\.\d{0,2})?|0(\.\d{0,2})?|(\.\d{1,2}))\)$
^\$([0]|([1-9]\d{1,2})|([1-9]\d{0,1},\d{3,3})|([1-9]\d{2,2},\d{3,3})|([1-9],\d{3,3},\d{3,3}))([.]\d{1,2})?$|^\(\$([0]|([1-9]\d{1,2})|([1-9]\d{0,1},\d{3,3})|([1-9]\d{2,2},\d{3,3})|([1-9],\d{3,3},\d{3,3}))([.]\d{1,2})?\)$|^(\$)?(-)?([0]|([1-9]\d{0,6}))([.]\d{1,2})?$
I'd much rather bank on writing a ParseCurrency function to parse or format the data using standard string manipulation.
That's way easier to look at in 3 months or 3 years.
There is nothing that can be done with a regex that can't be done with a function call. The function call may be 10 more lines than a single regex, but will always be 100 times easier to read and debug.
I feel that it goes along with Code Complete's Self Documenting Code idea. If your code or regex can't be understood without several lines of comments or a separate tool to parse it then there must be a better way.
Another good regex tool is Expresso:
http://www.ultrapico.com/Expresso.htm
It has really made some tricky regex easy to understand.
Michael Silver on June 27, 2008 3:40 AMIntelligently adding whitespace helps, because before we read something we subconsciously observe the shape of its layout. This gives us an important clue to the underlying data hierarchy; it provides a means of navigating the text.
Without whitespace, we have to read the text in its entirety before seeing the forest for the trees.
As an aside, that's also why USING ALL CAPITAL LETTERS makes things more difficult to read -- it removes the shape of words, so we don't get those free visual hints.
Vance Vagell on June 27, 2008 3:44 AM@Sean
What's your language of choose? Whitespace? (http://en.wikipedia.org/wiki/Whitespace_(programming_language) ?)
Dave on June 27, 2008 3:45 AMJust thought I'd mention that if you want to get really good, get a copy of Mastering Regular Expressions by Jeffrey Friedl. Everything mentioned in Mike's blog posts above is covered pretty exhaustively in the first 3 chapters, and chapters 4-6 will take you well beyond that, into understanding the underlying regex engines, and working with them to optimise your regex - important if the regex is going to be used over and over again, as would be the case in the above example. Then there are chapters on 4 implementations (perl, java, .NET and pcre as used in PHP).
A relevant example of efficiency optimisation - if the regex engine is aware that all tags start with '' then it will not even bother to start trying to match except where there is a '' character. In many cases this optimisation means the regex is never applied, for the cost of a quick indexof() call.
To make it easy for the regex to spot that all matches start with the same character, take the first character out of the alternatives bit. This would give
var whitelist =
@ (?# opening angle bracket - here so that regex engine can spot it)
( (?# start alternative)
br\s?/? | (?# allow space at end)
/?p |
/?b |
/?strong |
/?i |
/?em |
/?s |
/?strike |
/?blockquote |
/?sub |
/?super |
/?h(1|2|3) | (?# h1,h2,h3)
/?pre |
hr\s?/? | (?# allow space at end)
/?code |
/?ul |
/?ol |
/?li |
/a |
a[^]+ | (?# allow attribs)
img[^]+/?(?# allow attribs)
)
(?# closing angle bracket)
;
(Hope the formatting survives ...)
You could go a little further and factor out the '/?' that starts most of the lines. It will be repeatedly tested in the current format, and factoring it out would mean it was only tested once, though you will lose a little readability by doing that. A little benchmarking with two alternatives would let you know how much difference that change would make ...
mish on June 27, 2008 4:16 AMGreat post.
fxp on June 27, 2008 4:27 AMAnother free regex tool: http://www.gskinner.com/RegExr/
It also has an offline version.
I know Perl is the traditional soft target when it comes to observations about the folly of overusing regular expressions - and based on past atrocities this reputation may have been deserved a few years ago.
But these days well written Perl (no kids, that's not an oxymoron) tends not to rely too heavily on them. I just grabbed some of my code at random. I seem to average about 0 to 5 regular expressions per 1,000 lines of code - although of course it depends what I'm doing.
And Perl's regular expressions (which are actually not regular expressions in the formal sense - they're more general than that) are now pretty highly evolved; features like named captures, expanded syntax (which as a previous commenter notes allows patterns to be laid out quite readably) and support for matching recursive syntaxes make them safely and expressively useful - at least in the hands of someone capable of restraint :)
You know Jeff, sometime when we're on the same continent I'd like to sit down and show you 'modern' Perl (again, not an oxymoron). Based on your general approach to problem solving and apparent philosophy of coding I think you might actually like it...
Anyway, enough with the sales pitch. Keep up the good work.
Andy Armstrong on June 27, 2008 4:57 AMOh for goodness sake stop slagging off perl. Perl is like English, a bit hard to learn, but very expressive and very very useful. Also the CPAN module a href=http://search.cpan.org/~dland/Regexp-Assemble/Regexp::Assemble/a is insanely useful, and bfast/b.
kd on June 27, 2008 5:01 AM@Rob Assembly is easy, I could do that at 15, but I still have trouble understanding many regular expressions.
Even though I don't fully understand them, they are very cool for stuff like this:
http://wincue.cvs.sourceforge.net/wincue/wincue/src/filename_formats.txt?revision=1.2view=markup
The file is used for guessing album, artist, track number and track titles from file names. The older version was a hand-written parser which a friend of mine reimplemented with regular expressions, making it *much* easier to maintain and customize.
I don't think regular expressions are necessary, unless you've got a nightmare of a parsing task ahead of you. It's just one more syntax to learn, and I sure as hell don't need that. I'd rather hand-write it. Sure it's a little more code, but more is less.
Josh Stodola on June 27, 2008 5:09 AMI would recommend (if you have .NET2) to get FREE tool, it also generates a dll with the regex once u developed it.
http://tools.osherove.com/CoolTools/Regulator/tabid/185/Default.aspx
Hey, Regex Buddy is built with Delphi!
Nick Hodges on June 27, 2008 5:32 AMI couldn't agree with you more. When I frist met regex I thought either I was too stupid to understand it or the guy that wrote it was a genius.
Once I found the right tool and toyed with it a little, I realized what a powerful weapon it can be.
The tool I use is pretty simple and offers no major light effects, but it's usable inside eclipse, so for this convenience, that's what I chose. http://regex-util.sourceforge.net/update/
Raphael on June 27, 2008 5:44 AMWell, I think Jeff is spot on with this regular expression business.
I ran into the same problem a while back, and did the exact same thing, using the same tool and all.
Good to know I'm doing SOMETHING good.
Andr Medeiros on June 27, 2008 5:49 AMI always wished my college had a course in regexes. I've used them a few times, but it's always been such a pain. I think I just need to make a project that really emphasizes them, so that they get ingrained into me.
Asmor on June 27, 2008 5:50 AMIf you drench your plate in hot sauce, you're going to be very, very sorry later.
I beg to differ. I love hot sauce. I put it on almost everything, in the amounts that would kill normal people or at least cause a major permanent injury. I eat raw habaneros, too.
Whoa There on June 27, 2008 5:51 AMAlthough I absolutely cherish regular expressions (Viva la PCRE!) as one of the most lethal tools in my batman belt of programming tricks (I'm the regular expression go-to guy in my office), I do completely agree that it is extraordinarily easy to overuse them.
@Jeff: I think you may have done your less regex savvy readers a slightly better service by noting their alternatives when you mention not regexing themselves to death. I think a good follow-up post would be to point those folks in the direction of their languages' built in string manipulation functionality. While regex can, in some scenarios save you hours of pointless string twiddling, I think it's important to note that, with great power comes great responsibility. For simple- and even sometimes medium-level tasks, smart use of string manipulation will scream past regex performance-wise. Otherwise, great stuff, as usual!
Chris on June 27, 2008 6:19 AMThanks Jeff.. As a novice PHP coder, I've found myself in need of, and intimidated by, regexes time and time again. After reading this blog, I think I now have the courage to wade in at full steam and make use of this useful and misunderstood tool..
CroW on June 27, 2008 6:29 AMI agree they are quite handy. I would have wrote a state machine driven parser for html sanitizing though. Good topic though.
jminadeo on June 27, 2008 7:20 AMThere's also a multi-lingual regex builder here: http://regex.larsolavtorvik.com/
Joseph LeBlanc on June 27, 2008 7:41 AMThis is probably a very limited portion of your actual validation methods, but I hope you're also planning on killing javascript and the likes.
Actually, going back to your previous posts about the horrors of BBCode or whatever that was, there are probably two good reasons for BBCode as opposed to HTML/other standard:
- the brackets don't require a shift modifier on the standard keyboard layout. Don't think it makes a difference? Hey, you're a programmer. [It's a heck of a lot easier to use a bracket in the middle of typing than a less-than/greater-than thing.]
- it's also a whole lot easier to take something which may or may not be safe and transform it into something you know is than it is to try and clean the original so that it's safe. [lit: transformer vs. converter]
and so, -it's more accessible to the average user, and -it's more reasonable for the average developer.
Ryan H. on June 27, 2008 7:43 AMtaste batter? ;)
sbohr on June 27, 2008 7:48 AMwow this is sad
maybe in a decade theyll be saying regex vs bnf is like goto vs functions
bn fan on June 27, 2008 7:54 AMnot that regular expressions are evil, per se, but that overuse of regular expressions is evil
Is it odd that I've always interpreted the expression this way?
The reason now you've got two problems comes up so often is because it so easily comes to mind and forces you to consider if regular expressions are really appropriate for the task at hand.
Why use regular expressions to extract an extension or file name from a path, when System.IO.Path does the same thing in a more readable manner?
Actually, you might as well replace regular expression with XML, or databases, or any number of other solutions people generally rush toward without thinking.
Great post - lots of value both here and from pointers to other links. Thank you!
Patrick on June 27, 2008 8:32 AM@Randy Magruder
No trial version. No purchase. Period.
While there may not be a trial version per se, there is a three month unconditional money back guarantee (http://www.regexbuddy.com/guarantee.html). So you can in effect try it for three months.
I've been using RegexBuddy since version 1.0. It's worth every penny.
Mark on June 27, 2008 8:46 AMWhy sanitize the HTML? I just convert all the left angle brackets into their HTML entities to 'reveal' what the naughty person was trying to do.
Er, because sometimes you want to allow some HTML? You might, even, be anticipating it? Like from a richtext editor?
Trevor on June 27, 2008 9:08 AM@Jeff Atwood:
I think you posted this rant before.
Trevor on June 27, 2008 9:12 AMRegexBuddy 1.21 Demo Download: http://www.brothersoft.com/regexbuddy-29621.html
Moritz on June 27, 2008 9:13 AMFor kicks, try this:
s/regular expressions/macros/
Ben Karel on June 27, 2008 9:42 AMHow is this different from your writings on XML?
How many posts can you stretch out X is good for some things, just don't use it for too many things. tip?
Not that it isn't a good tip.
Calvin Spealman on June 27, 2008 10:14 AMfunny that you bring up regular expressions today because i just saw an insane one that nearly made me fall out of my chair. i'm in c# most of the time, and i don't think i'm alone in saying that c# developers don't throw around regexes too often. i was messing around with a javascript calendar picker and found this gem (and yes, it was all on one line):
System.Text.RegularExpressions.Regex DateRegEx = new System.Text.RegularExpressions.Regex(@^((0?[13578]|10|12)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[01]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1}))|(0?[2469]|11)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[0]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1})))$);
my basic reaction was to close that file and never look at it again. sure i could have deciphered it, added nice comments, etc. but i have other bugs to fix and the allocated hours for this project are dwindling...
cowgod on June 27, 2008 10:49 AMThe author of RegExBuddy also writes a great book about regular expressions available from LuLu publishing. Good price, nice product, great read, indispensible reference.
http://www.lulu.com/content/229786
If you want to use Regular expressions or do already, you really need this book.
Xepol on June 27, 2008 11:21 AMHi comment smart enough!
The apps isn't free, does their any app open source or free for Regexp
the question for readers also!
Xepol, Jan is also working on a Regex book with another very talented regex pro, Steven Levithan.
http://www.regex-guru.info/2008/05/writing-offline/
I bet it's gonna be REALLY good. Consider this preordered.
Jeff Atwood on June 27, 2008 12:02 PMwhy
do
programmers
think
adding
whitespace
makes
things
easier
to
read?
It
doesn't.
Stop
doing
it.
Sean on June 27, 2008 1:16 PMI always wondered how people got by without regexs. Then I started asking that Steve Yegge question in interviews (the one about replacing all phone numbers in a huge site with one email address). Now I know. And I'm sadder for the knowledge.
Tom Clancy on June 27, 2008 1:19 PMSean,howareyousosurewhitespacedoesn'tmakethingseasiertoread?
IknowIgetsomebenefitoutoftheoccasionalspace.
Obviouslycopiousamountsofcarriagereturnswon'tdoanybodyanygood,
butyoudon'talwayshavetotakethingstoexcesss.
Sean:
Becau seBadWhite Spaci
Can Tota
l y Messyou up
I strongly disagree with Sean. White space and comments are huge aids in reading code.
Dan on June 27, 2008 1:26 PMRegarding Regex Buddy.
No trial version. No purchase. Period.
Randy Magruder on June 27, 2008 1:31 PMNice post, Jeff. I'd have to say the biggest problem I see is with programmers who don't think it's worthwhile or fun to learn regex. Almost all the ones I've worked with avoid it like the plague. But then once they start using the 15th language in a row that supports regex they start realizing that it's probably not such a bad idea.
I can't tell you how many stop-gap translation applications I haven't had to build just because of regex support in Java, XML, PHP....
Raymond on June 27, 2008 1:32 PMI never knew about the IgnoreWhiteSpace option. That makes life much better. Thanks!
Mike on June 27, 2008 1:35 PMORANGE!
I'm working on a regex tool built in WPF. Its primarily a learning thing for me, but I'm looking to make it a slimmed down, bloat free regex go-to tool for those I-just-need-to-test-this-for-a-sec regex moments.
http://statestreetgang.net/post/2008/05/Regex-and-WPF.aspx
I've already made serveral improvements to that code, and I think I'll be entering it in this upcoming community coding contest.
That'll give me some motivation to finish it.
Will Sullivan on June 27, 2008 1:44 PM(not ^that^ Sean)
Other Sean: Adding whitespace makes things easier to read. If it doesn't work for you, then you must be some mental-parsing genius. Great. You're better than the rest of us. Go on with your life now.
Jeff: I agree with your RegEx assessment.
Sean on June 27, 2008 1:45 PMMy favourite I've been using for years is Regex Coach - http://www.weitz.de/regex-coach/
It's free, with donations encouraged.
Works on both Linux Windows.
Mark on June 27, 2008 1:46 PMDamn you Jeff! Since I read your post I've been thinking a lot about eating a hamburger with lots of Tabasco. I was subliminally attacked by your free publicity... Great blog, read it every day.
Waldemar on June 27, 2008 1:49 PMAwesome, Regular expression for testing prime numbers :
http://mail.pm.org/pipermail/athens-pm/2003-January/000033.html
print Prime if (1 x shift) !~ /^1?$|^(11+?)\1+$/
This is by Abigail, who is something of a legend in the Perl community.
TiTi on June 27, 2008 1:52 PMI'll admit I have made a regex or two with poor whitespace, however it is nice once you get to the point where you can read regex like your native language. That said, recursive regular expressions can still be kind of confusing.
Will on June 27, 2008 1:56 PMDon't get me wrong. I understood what you meant.
But I find it kinda fun that you love regexps and don't like xml. :)
GoA on June 27, 2008 1:57 PMDitto on strongly disagreeing with the first comment. White space turns ordinary obfuscated perl into something a bit more, uh, pythonic.
Steven Klassen on June 27, 2008 1:57 PMWhy are you writing your own html sanitizer? It has already been written enough times. Are you also writing your own webserver and C library? And why are you using regular expressions to do it? Do you _want_ your service to be vulnerable to html/js injection?
James on June 28, 2008 2:15 AMThe abuse of regexes as parsers isn't unknown to me. Actually I've created a function that parses a ?:-like language:
public static string ParseTemplateString(string str, Funcstring, object getVars)
{
// Regex
System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex(
string.Format(@
(?mod\?!?)? # Match the type of the expression
(?v1\$[A-Za-z_0-9]+) # Match the variable or the complex condition
(?(mod)
(
{0} # Match first opeing delimiter
(?inner
(?
{0} (?LEVEL) # On opening delimiter push level
|
{1} (?-LEVEL) # On closing delimiter pop level
|
(?! {0} | {1} ) . # Match any char unless the opening
)+ # or closing delimiters are in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
)
{1} # Match last closing delimiter
){{1,2}} # Match one or two subexpressions
|
:(?v2\$[A-Za-z_0-9]+) # Match the simple condition
)?
, \\{, \\}),
System.Text.RegularExpressions.RegexOptions.Compiled
| System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
// $Var, $Var:$Condition, ?(!)$Condition{...}{...}
System.Text.RegularExpressions.MatchCollection mc = rx.Matches(str);
foreach (System.Text.RegularExpressions.Match m in mc)
{
if (m.Groups[mod].Length 0)
{
bool cond = Convert.ToBoolean(getVars(m.Groups[v1].Value.Substring(1)));
if (m.Groups[mod].Value == ?!)
cond = !cond;
if (!cond)
{
if (m.Groups[inner].Captures.Count == 2)
str = str.Replace(m.Value, ParseTemplateString(
m.Groups[inner].Captures[1].Value, getVars));
else
str = str.Replace(m.Value, );
}
else
str = str.Replace(m.Value, ParseTemplateString(
m.Groups[inner].Captures[0].Value, getVars));
}
else if (m.Groups[v2].Length 0)
{
bool cond = Convert.ToBoolean(getVars(m.Groups[v2].Value.Substring(1)));
if (!cond)
str = str.Replace(m.Value, );
else
{
object val = getVars(m.Groups[v1].Value.Substring(1));
str = str.Replace(m.Value, val.ToString());
}
}
else
{
str = str.Replace(m.Value, getVars(m.Groups[v1].Value.Substring(1)).ToString());
}
}
return str;
}
For some reason I believe there must be a sound correlation between liking regular expressions and disliking XML. I suspect people either do both or neither :)
Mikhail Edoshin on June 28, 2008 3:42 AMHey Jeff, here's a regular expression you might enjoy:
s/who I admire/whom I admire/
:)
Mark on June 28, 2008 4:50 AMDean said I strongly disagree with Sean. White space and comments are huge aids in reading code.
I strongly disagree with that. Ar least half of it.
Comments are evil. A necessary evil, sometimes, but nonetheless they are evil. We should aim for 'self-documenting' code. 99% of the times when the code is not self-documenting, it's because the developer didn't do as good a job as (s)he should have (maybe because they were not given the opportunity, but we're not debating causes).
That being said, I'm not an expert at regexes and *that* is why the comments in Jeff's original post would help me understand his regex. But that's because of *my* shortcoming. If we take that approach, then we should have comment on each line of code explaining what it does, just in case someone that doesn't know the programming language we picked happens to read the source... impractical.
F.O.R.
PS: Am i the only one that discovered RefactorMyCode thanks to this post ?
@N
But that's because of *my* shortcoming.
Disagree. What if you're not planning to write a regexp, but you want to use an existing one, where there's something complex with in it. Knowing roughly what it does helps. This is yet another case of overuse = failure, non-use = failure.
Also, no, I knew not of RMC before this. Thanks, Jeff!
Tom on June 28, 2008 6:58 AM1. If you use it all the time, regex is great.
2. If you use it once in awhile, avoid it. Seems like I have to re-learn it each time it hit a case where I need it. Even with the tools.
3. If other people who don't use regex all the time will be supporting the code, don't use it.
But this is a chicken/egg story. If you use it lots, you know it. If you don't... avoid.
Regex always seems like going back to assembler. That's why you have to all those utilities. But gee... we are already working in a compiled environment. Why go back for regex with unobvious shift-numeric syntax (!@#$%^*). I'd prefer to use something with regex's power, but with a more obvious syntax - a regex compiler - but with the original code part of the real code. It probably exists.
mihondo on June 28, 2008 8:29 AM
Remember if you seldomly use regex but have a case for one you can often find the expression on the internet. I couldn't be bothered to work out how to parse a date in exact format dd/mm/yy. Just looked it up on the internet pasted it in checked it works, great even stops you going past the max days of that month and past 12 months including the leap day.
pete on June 28, 2008 9:22 AMorange - it is all the time
Can somebody tell me(from personal experience) the scenarios
- where to use regex
- where to avoid
I only find it useful in validating email, searching to strip out dangerous HTML from input.
Anand on June 28, 2008 9:28 AMRegular expressions rock.They should absolutely be a key part of every modern coder's toolkit.
I always find myself disagreeing with you whenever you say something should apply to every programmer, regardless of their area of expertise. As just one of many examples, what if the coder is in the video game industry? Sometimes it seems like people forget that there's more to programming than processing text files and validating inputs.
Mike on June 28, 2008 9:41 AM@Josh Stodola: One of the nice things is that the underlying engines will also have been optimized for speed, so you don't have to. Sure you can hand-write a simple parser, but can you hand-write it to make it fast?
Matijs van Zuijlen on June 28, 2008 9:47 AMCoding language does seem important to level of regex use. For some thoughts on using regexes in Perl vs Python, see
http://www.fluidinfo.com/terry/2007/06/13/resorting-to-regular-expressions/
Terry
While I don't use regex much, I do see its beauty for some problems - plus I've never really understood the quote about having two problems, but given context that its saying regex isn't a solution for everything.. obviously I agree.
Here's the thing though, I just don't see a html sanitizer as being a good example for regex.. sure regex in this scenario can bring results quickly.. and it does the base things perfectly.. but sanitizing html is more than matching patterns.. and while I'm sure you could build more regex to progressively pull everything apart and back together.. it kind of makes me wonder if you are then almost trying to parse with regex..
Naively speaking, because I've never actually written a complicated sanitizer.. I would say that a traditional programming approach - although much slower to see results at first.. would be more flexible.. and given the recursive patterns that exist.. once you start to hit a point.. you'll see results, and get past problems that regex would become more troublesom.. faster.
I'd be hugely suprized if there wasn't a .NET lib out there for doing this already.. if not, then I think a codeplex/sf project is called for, obviously from your posting on refactormycode - and on here.. theres a lot of hugely knowledgable people in regards to how sanitization should work.. and it would be really interesting to see a product from it.
I'd do it myself, but I know anything I put up would be ripped apart instantly - but hey, if it triggers people to do something in a (gah, give it here, this is how you do it!) kinda way.. then maybe I should :P
Stephen on June 28, 2008 10:39 AMmihondo: You said regex feels like assembly, and you want something higher level. Guess what the following does:
number :: '0'..'9'*
phoneNumber :: [ '(' number ')' ] number '-' number
This is a BNF (Backus-Naur Form) to match phone numbers. BNFs are a high level grammar designer. You can do just about anything a regex can do, though BNFs ted to be self-documenting. The typical use-case is for turning a program's source code into an abstract syntaxt tree, but it fits really well for simple stuff. The wikipedia page has a simple example for matching any US postal office. The nice thing about BNF is that it turns the text into a data structure. The _really_ nice thing about BNF is it is designed to deal with things like nested tags/parens, the weakest part of regexes.
Common extended-BNF parsers include Yacc/bison (for the unixes) and ANTLR (for Java). My personal favorite is PyParsing, as it has some tasty syntatic sugar.
Kyle on June 28, 2008 10:40 AMI can also show you something written in the very same medium that is so beautiful it will make your eyes water
Ok, I'm calling you out on the Klingon. Let's see that beautiful eye-watering Klingon.
Rick! on June 28, 2008 1:01 PMI once heard a good advice that seems to work for me and my code:
Try to use a lot of vertical space and
very little horizontal space!
This applies to regular expressions as well. Readable code is all about whitespace, comments and proper naming.
Florian Potschka on June 28, 2008 1:52 PMOrange
I've yet to get a handle on Regex but I do appreciate expressions that others have published and just work for me. Really, really appreciate it. Checking email addresses, post (zip) codes, phone numbers, etc. All this validation of text allows my code to remain concise and also allows me to get on with my job. The language independence is a big bonus in this respect.
I enjoy writing the unit tests against them to make sure all is good and this allows me to know exactly what is and is not covered in each expression.
Having said that, readability is atrocious even with whitespace and this adds further importance to the unit testing as this now doubles up as documentation.
Joe on June 28, 2008 1:52 PM+1 for Expresso (http://www.ultrapico.com/Expresso.htm)
A more simple one is Regulazy (http://tools.osherove.com/CoolTools/Regulazy/tabid/182/Default.aspx)
Florian Potschka on June 28, 2008 1:54 PMHi Jeff... I have similar feelings about the overuse of ajax as you do about the overuse of regular expressions
http://blog.pnbconsulting.com.au/?p=134
lomaxx on June 29, 2008 5:52 AMThe best regex advice I've heard is to not try to write your own parser...FWIW
David Smith on June 29, 2008 7:35 AMShould you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl
Shame on you for perpetuating this tired old piece of nonsense.
Earle Martin on June 29, 2008 8:48 AMAre you also writing your own webserver and C library?
I couldn't find any good c# HTML sanitizing code that wasn't a huge, dumb dependency. Now I can, because I wrote it!
Try to use a lot of vertical space and very little horizontal space!
Agree, see flattening arrow code
http://www.codinghorror.com/blog/archives/000486.html
I'd prefer to use something with regex's power, but with a more obvious syntax
Maybe fluent interface? But I disagree.
http://www.codinghorror.com/blog/archives/000989.html
Saw some replies asking about open source regex editor:
KDE regular expression editor manual:
a href=http://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.htmlhttp://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.html/a">http://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.html/a">http://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.htmlhttp://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.html/a
Redet:
a href=http://billposer.org/Software/redet.htmlhttp://billposer.org/Software/redet.html/a">http://billposer.org/Software/redet.html/a">http://billposer.org/Software/redet.htmlhttp://billposer.org/Software/redet.html/a
Simple version:
a href=http://www.arachnoid.com/regex_lab/http://www.arachnoid.com/regex_lab//a">http://www.arachnoid.com/regex_lab//a">http://www.arachnoid.com/regex_lab/http://www.arachnoid.com/regex_lab//a
One of the best books to learn how to use regex : http://oreilly.com/catalog/9780596528126/
Before reading it, I thought I knew regular expression. It made me change my mind.
On the subject of good regular expression tools, I would like to recommend a free online one that is designed for .NET programmers:
http://www.lastdomainnameonearth.com.
Regular Expressions are a very powerful tool that all developers should know, but sometimes you can fall into deep subtle pits of despair if you don't know PERFECTLY what you are doing.
The most important things I discovered one month ago are:
[1] NOT ALL REGEXPR ENGINES USE THE SAME SYNTAX AND/OR MATCHING ALGORYTHM
[2] SOMETIMES, REGEXPR ENGINES CHEAT!
For [1], just check the RegExp section in Xml Schema Specification at W3C (http://www.w3.org/TR/xmlschema11-2/#regexs). They decided that, since most people would want a full match on a RegExp, their parser would automatically anchor it (WORST. IDEA. EVER).
So, if you decided (like I did, fool me, fool me) to define a RegExp in a Schema for Validation, and then use it also in another part of my application, you will have lots of trouble.
Basically, in XSD you get the full Perl RegExp syntax, without ^ $ (which will be treated as NORMAL CHARACRERS) and /A /Z (which will BREAK your RegExp), and you will get an automatic anchor instead...
For [2], some engines (ie: .NET Regex engine) cheat on some expressions, to make things work almost any time. Basically, I had 2 expressions that should have returned different matches (by Perl Syntax), but they returned the same matches (in .NET Match). I'm sorry I can't remember the exact expressions right now, but I remember shouting the loudest WTF ever, when I checked this... and I will not tell you about the differences between .NET Parser and the various Java Parsers :-)
So, I would add this advice to the list of this post:
- Always check (double-triple-check) your Expressions IN THE ENVIRONMENT they will be executed (or with the right options in your tool of choice).
Filini on June 30, 2008 3:33 AMThe comments to this entry are closed.
|
|
Traffic Stats |