March 3, 2005
I am currently in the middle of a way-overdue refactoring of MhtBuilder, which uses regular expressions extensively. I noticed that I had sort of mindlessly added
RegexOptions.Compiled all over the place. It says "compiled" so it must be faster, right? Well, like so many other things, that depends:
In [the case of RegexOptions.Compiled], we first do the work to parse into opcodes. Then we also do more work to turn those opcodes into actual IL using Reflection.Emit. As you can imagine, this mode trades increased startup time for quicker runtime: in practice, compilation takes about an order of magnitude longer to startup, but yields 30% better runtime performance. There are even more costs for compilation that should mentioned, however. Emitting IL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory that you'll ever get back. In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey. But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly.
In other words, this is something you don't want to do casually, as I was. And 30% faster isn't a very compelling performance gain to balance against those serious tradeoffs. Unless you're in a giant loop, or processing humongous strings, it's almost never worth it. The MSDN documentation also has this interesting tidbit:
To improve performance, the regular expression engine caches all regular expressions in memory. This avoids the need to reparse an expression into high-level byte code each time it is used.
The second time you build your non-compiled regex, no additional interpreting overhead is incurred. And you get that for free. Even though it sounds faster and all, you probably don't want to use
RegexOptions.Compiled. But what about
This avoid the pitfalls associated with dynamic compilation by turning your regular expressions into a compiled DLL. There aren't many articles describing how to do this, but Kent Tegels dug up a few Regex articles with sample code showing how to take advantage of
It seems ideal-- all the advantages of compilation with none of the disadvantages-- but it adds one disadvantage of its own: your regular expressions are now written in stone. You can't change them at runtime, and you have to know what you're going to do entirely up front. This might be a worthwhile tradeoff at the end of a large project that uses regular expressions extensively, but still.. only 30% faster? I'd want some actual benchmark numbers from my application before I could justify the loss of flexibility and the additional file dependency.
Posted by Jeff Atwood
I have seen a factor of 3 improvement in performance for compiled regular expressions over uncompiled ones. I think the performance is greatest when the text you are matching against is very long compared to the pattern.
Well, 3x is definitely the kind of performance increase that would seriously tip the scales in favor of compilation! As always, measurement is critical.
For this particular app, I doubt it matters; I may be running regex against 100kb of HTML, but that's utterly dwarfed by the time it will take me to HTTP GET all the files referenced in that HTML (to build the MHT), so it's negligible in terms of overall runtime. If I was writing a straight parser or code colorizer, I'd look harder at compilation.
I think you pretty much got it right. .Compile reminds me much of Perl's Study(). In an intepreted language, that might be very help... in a compiled one, I'm not so sure.
In my own perf testing of an app that uses regex extensivly, i got a 2X improvement in performance by compiling to assembly. If your only calling the regex a few times, than its problably not worth it. But if you are processing a LOT of data with regex, than it probably is. In my case this change took 30 seconds off a 2 minute process.
Correct me if I'm wrong. Is having compiled Regexs in static fields a little optimization?
Thanks alot for this information....
I was trying to parse 4 gig text files as streams a meg at a time and got a 6x improvement....but initially didn't want to even try a compiled regex without some kind of concrete data about it performing better compiled or not.
Just to add my (small) experience in regards to speed of compiled regex, I found that if you have large source strings, it's the only way to go.
By large, I mean anything starting from a few hundreds Kbs of text.
I am doing a file preparation app and had a regex hanging up. Worked with small files, but on larger files, it looked like it wasn't doing anything for 10 minutes or so.
Before tossing the code in the garbage (I need to handle large files, so that was a real deal breaker), I used regexoptions.compiled.
Using the same files, file processing was pretty much instant.
So in my particular case, it really made a difference.
Well, this post is old news, but I just wanted to add my two cents. The process of creating compiled regular expressions is painstaking at best. If you have the need to compile your regular expressions to an assembly that is, say, CLS compliant, strong named, etc, you might like to take a look at this tool that I put together. It makes the management of compilation a snap.
It's already saved me a ton of time. I have full source and compiled EXE for download on the site. There are also more features forthcoming.