I was intrigued by a recent comment from a Microsoft Hotmail developer on the ptifalls they've run into while upgrading Hotmail to .NET 2.0:
Regular Expressions can be very expensive. Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior. We've taken several hotfixes for this. RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.
I'm definitely guilty of this. When I throw a regex together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. But you may not need a large string to cause a major performance bottleneck-- it's entirely possible to formulate regular expression patterns that consume exponentially more CPU time for each additional input character, as noted in this python mailing list posting:
I'm acutely aware of that one because it burns people regularly. These aren't cases of hostile input, they're cases of innocently "erroneous" input. After maybe a year of experience, people using a backtracking regexp engine usually figure out how to write a regexp that doesn't go resource-crazy when parsing strings that *do* match. Those are the inputs the program expects. But all inputs can suffers errors, and a regexp that works well when the input matches can still go nuts trying to match a non-matching string, consuming an exponential amount of time trying an exponential number of futile backtracking possibilities.
The example provided in that email is a pattern of
(x+x+)+y
Which means, in regex-ese:
I tried this pattern in RegexBuddy's debugger first, with a simple 20 character test string:
xxxxxxxxxxxxxxxxxxxx
This is no problem in RegexBuddy, because it has an advanced regular expression engine that throttles regex expressions before they hang your machine. Now let's try it in .NET and see what happens:
Dim pattern As String = "(x+x+)+y"
Dim sw As New Stopwatch
sw.Start()
Regex.Match("xxxxxxxxxxxxxy", pattern)
sw.Stop()
Console.Write("Valid match took ")
Console.Write(sw.ElapsedMilliseconds)
Console.WriteLine(" ms")
Dim s As String = "xx"
For n As Integer = 1 To 24
sw.Start()
Regex.Match(s, pattern)
sw.Stop()
Console.Write("Invalid match of ")
Console.Write(s.Length)
Console.Write(" chars took ")
Console.Write(sw.ElapsedMilliseconds)
Console.WriteLine(" ms")
s = s + "x"
Next
Here's the output from this console app:
Valid match took 0 ms Invalid match of 2 chars took 0 ms Invalid match of 3 chars took 0 ms Invalid match of 4 chars took 0 ms Invalid match of 5 chars took 0 ms Invalid match of 6 chars took 0 ms Invalid match of 7 chars took 0 ms Invalid match of 8 chars took 0 ms Invalid match of 9 chars took 0 ms Invalid match of 10 chars took 1 ms Invalid match of 11 chars took 2 ms Invalid match of 12 chars took 3 ms Invalid match of 13 chars took 6 ms Invalid match of 14 chars took 13 ms Invalid match of 15 chars took 25 ms Invalid match of 16 chars took 50 ms Invalid match of 17 chars took 100 ms Invalid match of 18 chars took 202 ms Invalid match of 19 chars took 403 ms Invalid match of 20 chars took 808 ms Invalid match of 21 chars took 1624 ms Invalid match of 22 chars took 3257 ms Invalid match of 23 chars took 6432 ms Invalid match of 24 chars took 12857 ms Invalid match of 25 chars took 25657 ms
Looks like n-squared behavior to me.
I can see where the Hotmail devs would be freaking out about this, considering matching against only 20 characters eats almost a full second of server CPU time! It's too bad the .NET regex engine doesn't implement some kind of throttling or exception behavior when the cost of the regex grows too large.
The author of RegexBuddy calls this catastrophic backtracking, and he has a page describing how to avoid catastrophic backtracking in your regular expressions:
The solution is simple. When nesting repetition operators, make absolutely sure that there is only one way to match the same match. If repeating the inner loop 4 times and the outer loop 7 times results in the same overall match as repeating the inner loop 6 times and the outer loop 2 times, you can be sure that the regex engine will try all those combinations.
The essential part of the customer example he gives is converting a dot (any character) match to a [^,\r\n] (not comma, not carriage return, not newline) match. In other words-- be as specific as possible!
Nice post.
I've always been highly suspicious of regular expressions and hence have never used them. When I considered what it would take to write the exact same code to perform the required parsing it took my breath away. Surely all of this complex parsing logic must come at a pretty hefty price. Apparently it does...
matt on January 14, 2006 9:29 PMI'd be hesitant to toss out regular expressions because of a few catastrophic examples. They are quite rare in the wild. For example, the "(x+x+)+y" is egregiously contrived and could be replaced with "xx+y".
So yes, it's important to watch out for those rare exponential cases, but let's not toss out the baby with the bathwater and rubber ducky.
Haacked on January 14, 2006 11:55 PM> Surely all of this complex parsing logic must come at a pretty hefty price.
Well, no, I wouldn't say that.
This post is about specific backtracking scenarios that can cause *exponential* behavior. It's not very common. But if it happens the results can be quite painful.
Also the hotmail devs are writing code that probably gets 10,000 times more pressure under use than anything we'll ever write. So keep that in mind, too.
Jeff Atwood on January 15, 2006 4:41 AM> Surely all of this complex parsing logic must come at a pretty hefty price.
Yup. That's why I stopped using the for statement. I mean, I could do something like this:
for(int i = 0; i < 1; i = i + 0)
That would just kill my performance.
You've never ever used regexes before? Yikes. You do know, at the very least, there are lots of places where you can get pre-formed regexes that work just fine? (Hint: use Google :)
foobar on January 15, 2006 5:43 AMThe trouble is that people regard regexes as some kind of magic instead of simply a kind of programming language that provides text matching features.
I'm with foobar on this. If a novice programmer codes up for(;;); and it spins for ever, no-one would blame the programming language.
Programming *does* have to be learned. Regexes are no exception.
Dominic Cronin on January 15, 2006 7:41 AMNice post.
Nitpicking: This is not n-squared behavior. It is exponential -- 2-to-the-nth. Every number in the series is almost exactly double its predecessor.
And an important note: For regexes to be really useful for parsing, we need to match groups; and there lies the crux of the problem. If you remove the requirement to match groups, it is not very hard to write a non-backtracking regex engine, in which the complexity of parsing can be exponential in the pattern size, but it is linear in the matched text size.
Shai on January 15, 2006 2:40 PMI do not know how works the Regex stuff. Does it generate some code JIT compiled, the same way as Serializer ? Anyone remembers Lex and Yacc tools (code generator based on lexical and grammatical rules) that belongs to UX word ? I think that code generation may help to get rid with tricky performances issues. What is your opinion about it Jeff ?
Laurent B. on January 16, 2006 3:47 AMAnd then the requisite bad use of regex when a couple small string operations would have been much smarter.
Here's an example of how not to validate whether or not the first 2 chars in a string are alpha's:
bob on January 19, 2006 12:56 PMFor a great treatment in regex in general, and for avoiding catastrophic backtracking in particular, I highly recommend Jeffrey Friedl's book Mastering Regular Expressions (2nd Ed.): http://www.oreilly.com/catalog/regex2/
Some of this is covered on http://www.regular-expressions.info/atomic.html , but the book goes into more topics, more depth, and more analysis.
The two biggest innovations that made dealing with these problems so far have been non-greedy quantifiers, and to a greater extent, possesive quantifiers. These innovations, like most things in the modern regex world, came out of perl*. They have also influenced the planning for perl 6's regex engine. I'm not holding my breath on perl 6's release, but the plans make interesting reading (if you use regex a lot):
http://dev.perl.org/perl6/doc/design/syn/S05.html
http://dev.perl.org/perl6/doc/design/apo/A05.html
http://dev.perl.org/perl6/doc/design/exe/E05.html
*: Friedl's book actually had some influence here as well. He pointed out that if you could tell the regex engine "backtracking is useless here", you could get rid of a lot of pathological situations. Hence, possesive quantifiers. (Friedl's examples are in perl-ese, but different regex engines (and types of engines) are covered in the book.)
Jacob J on June 6, 2006 11:03 AMJeff Atwood: "Looks like n-squared behavior to me."
matt: "For example, the '(x+x+)+y' is egregiously contrived and could be replaced with 'xx+y'."
Dare I propose that the Programming Crisis bloggers are always worried about might be due simply to the fact that blogger-programmers couldn't pass Algorithms 101 with a laxative?
"2, 4, 8, 16, 32, 64,..." is 2^n, not n^2; and "(x+x+)+y" is equivalent to "(xx)+y", not "xx+y".
Anonymous Coward: Dare I propose that the Programming Crisis bloggers are always worried about might be due simply to the fact that blogger-programmers couldn't pass Algorithms 101 with a laxative? "2, 4, 8, 16, 32, 64,..." is 2^n, not n^2; and "(x+x+)+y" is equivalent to "(xx)+y", not "xx+y".
Actually, "(x+x+)+y" matches two or more x's, as does "xx+y". "(xx)+y" matches an even number of x's. Good luck with those laxatives Anonymous Coward.
Dave on July 15, 2006 10:35 PMjust a rectification, because you're being so anal it hurts:
(x+x+)+y IS a sintact equivalent in matching xx+y
for a semantic equivalent version, you could use [xx]+y
and if you need grouping, just go for (x+)y
then, someone could tell me please WHY everyone need to fix an example?
is a TRIVIAL example for this behaviour, for the twelve gods!
did you have a prograsm in fixing this???
Anonymous Coward on January 3, 2008 9:03 AMTo the Anonymous Coward from 2008-01-03 who criticized people for correcting examples while doing so himself, (x+x+)+y is not equivalent to [xx]+y (no need to repeat characters in character classes) or x+y (with or without the parens). Rather, it is equivalent to xx+y (as you noted correctly) or x{2,}y
The pertinent resources for information on how to avoid pathological regex matching cases with a backtracking engine have already been mentioned (JFriedl's book and regular-expressions.info are among the most thorough yet accessible guides). A regex expert should not really have to worry about these cases though, because they are generally quite obvious if one truly understands how backtracking works.
@Jacob J, possessive quantifiers were neither invented by Perl (v5.8 doesn't even support them) nor JFriedl (who recommends in his book that regex engine implementers optimize by automatically possessifying quantifiers when this can be determined to not change potential matches).
A recent, related regex innovation is backtracking control verbs included in Perl 5.10 and recent versions of PCRE, which afford greater control over backtracking than ever.
Steve on January 9, 2008 4:32 AM| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |