I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

August 28, 2004

Java vs. .NET RegEx performance

I was intrigued when I saw a cryptic reference to " the lackluster RegEx performance in .NET 1.1" on Don Park's blog. Don referred me to this page, which displays some really crazy benchmark results from a Java regex test class-- calling C#'s regex support "20 times slower than [Java]." First of all, them's fightin' words, and second, those results are just too crazy to really make sense. Why would C# be over an order of magnitude slower than Java at a classic computer science problem like a regular expression parser? I don't believe it.

So I downloaded the Java JDK, a freeware Java development environment, and I ran that benchmark class on my own machine:

Regular expression library: java.util.regex.Pattern

RE: ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)
  MS    MAX     AVG     MIN     DEV     INPUT
  46    16      0.0046  0       0       'http://www.linux.com/'
  61    16      0.0061  0       0       'http://www.thelinuxshow.com/main.php3'
  61    16      0.0061  0       0       'usd 1234.00'
  172   16      0.0172  0       0       'he said she said he said no'
RE: usd [+-]?[0-9]+.[0-9][0-9]
  MS    MAX     AVG     MIN     DEV     INPUT
  0     0       0.0     0       0       'http://www.linux.com/'
  15    15      0.0015  0       0       'http://www.thelinuxshow.com/main.php3'
  15    15      0.0015  0       0       'usd 1234.00'
  15    15      0.0015  0       0       'he said she said he said no'
RE: \b(\w+)(\s+\1)+\b
  MS    MAX     AVG     MIN     DEV     INPUT
  0     0       0.0     0       0       'http://www.linux.com/'
  31    16      0.0031  0       0       'http://www.thelinuxshow.com/main.php3'
  31    16      0.0031  0       0       'usd 1234.00'
  47    16      0.0047  0       0       'he said she said he said no'
Total time taken: 266

Note that I only ran this for the "built in" Java regex library java.util.regex.Pattern; the benchmark has template code for dozens of alternative regex parsers. I snipped that code out for simplicity. The standard Java regex class performs very well according to the results shown on the referring page.

I then converted the sample code to VB.NET, and got these results:

Regular expression library: System.Text.RegularExpressions

RE: ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)
  MS    MAX     AVG     MIN     DEV     INPUT
  32    3.033   0.0032  0.0025  0.0325  'http://www.linux.com/'
  63    3.04    0.0063  0.0053  0.0325  'http://www.thelinuxshow.com/main.php3'
  122   3.053   0.0122  0.0109  0.0327  'usd 1234.00'
  234   3.067   0.0234  0.0212  0.0328  'he said she said he said no'
RE: usd [+-]?[0-9]+.[0-9][0-9]
  MS    MAX     AVG     MIN     DEV     INPUT
  20    0.729   0.002   0.0017  0.0073  'http://www.linux.com/'
  40    0.732   0.004   0.0036  0.0073  'http://www.thelinuxshow.com/main.php3'
  63    1.748   0.0063  0.0056  0.0175  'usd 1234.00'
  82    1.751   0.0082  0.0075  0.0175  'he said she said he said no'
RE: \\b(\\w+)(\\s+\\1)+\\b
  MS    MAX     AVG     MIN     DEV     INPUT
  19    0.25    0.0019  0.0017  0.0037  'http://www.linux.com/'
  38    0.252   0.0038  0.0034  0.0038  'http://www.thelinuxshow.com/main.php3'
  62    4.961   0.0062  0.0053  0.0497  'usd 1234.00'
  81    4.963   0.0081  0.007   0.0497  'he said she said he said no'
Total time taken: 170

So.. yeah. I don't know what kind of crack the guys at manageability.org are smoking, but I can't seem to find a local vendor.

You may be interested in my VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so you can run this test on your own machine. A few notes on the test:

  • My PC is an Athlon FX-53 (3800+), although the relative scores are all that matter. Just for fun, I'll try both versions on a few other boxes I have here, and post the results in the comments.
  • No optimizations were enabled for either the Java or .NET solutions.
  • Console apps were executed directly without a debugger. Having a debugger running will double your runtime. I check for this in the .NET version and display a warning if you have the debugger attached.
  • The timing code is a little different between the Java and .NET versions. The .NET conversion uses the QueryPerformanceCounter windows API call to get accurate sub-millisecond timing. One side effect of this is that I have to make two passes to get all the benchmark results: the first pass times each of the 120,000 regex calls individually into an array, and the second pass times just the total execution time. You'll notice that the first pass takes twice as long; that's due to the overhead of calling QueryPerformanceCounter 120,000 times. The upside is that I provide far more accurate timing results, as you can see in the table above. I think the built in Java timer System.currentTimeMillis is kind of like the .NET DateTime.Now.Ticks, eg, limited to a resolution of about 10ms. This was OK back in early 2002 when the Java benchmark was originally constructed, but on today's PCs, it's kind of tough to measure a single regex call with a granularity of 10ms..
  • I think there's a small bug in the original source. Notice that the innermost timing loop takes a start time before it enters the loop, then calculates the elapsed time inside the loop against that start time. This seems wrong to me, because each loop will reflect not only its time but the total time since the loop was entered. Anyway, I've preserved this "feature" in my VB.NET source so the timings will be comparable.

Even though .NET appears to perform almost 40 percent faster than Java in this test, it's still interesting that in only 6 months since that Java benchmark was run (813ms), I can produce Java code that runs over three times as fast (266ms)! So before we put our language war hats on, consider that perhaps the real winner here is the hardware. Doh!

Posted by Jeff Atwood    View blog reactions

 

« Net.WebClient and GZip The Incredible LinkTron 5000(tm)! »

 

Comments

As promised here are some benchmarks from a few machines around the house. I ran each benchmark 4-5 times and took the best reported score.

Athlon 64 3800+ (2.4ghz)
.NET 170ms
Java 265ms

P4 3.2 (3.2ghz), w/hyperthreading
.NET 185ms
Java 330ms

Athlon XP 2400+ (1.7 ghz), dual proc
.NET 250ms
Java -

Pentium M (centrino) 1.2ghz
.NET 320ms
Java 541ms

* I didn't want to install the JDK on my server

Jeff Atwood on August 28, 2004 07:33 PM

Jeff,

The benchmarks I did were done back in Feb. of 2003, so are relevant in that period. So accusing me of smoking crack is a little below the belt.

Now maybe you need to provide the .NET source afterall you're jumping through some peculiar hoops to get your timing "right". How exactly are you timing both versions? Don't time individual calls, time the aggregate result. Finally, let us all see the code!

Carlos E. Perez on August 29, 2004 07:44 AM

It's in the article:

"You may be interested in my VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so you can run this test on your own machine."

http://www.codinghorror.com/files/code/regexbenchmark.zip

Also your article is dated December 2003!

http://www.manageability.org/blog/archive/20030210%23p_i_found_a_a/document_view

Jeff Atwood on August 29, 2004 12:51 PM

Jeff,

Take a very good look at your own results and you'll realize that Java is indeed faster at Regex than .NET.

Look at the averages and the max timings and you'll see the Java bests .NET in almost all the cases. Now for some reason the aggregate times show the opposite. This of course can only be attributed to the way you measure time. That is you use a different routine and this should account for the disparity.

So, the two conclusion that you can make with your results are:

(1) Java still is faster with Regex than .NET
(2) Using QueryPerformanceCounter can gives you more precision and faster timing than DateTime.Now.Ticks.

Finally, why don't you try using 1,000,000 iterations instead of the measly 10,000. The result will surely put you in complete shock!

Carlos

Carlos E. Perez on August 29, 2004 08:12 PM

Carlos,

1) There is a bug in the original code's innermost timing loop, as I pointed out in my blog entry, and in my commented port of the source to VB.NET. Each iteration of that loop is calculating elapsed time using a start time from OUTSIDE the loop rather than at the start of each iteration. See my comments in the source. This will *definitely* cause the aggregate total times (which are added up from each regex time) to be off. What can I say, I didn't write the code.

2) The java System.currentTimeMillis timings seem to have the same precision as DateTime.Now.Ticks, eg, about 10ms. Notice how the "minimum" execution time in the Java code is 0ms? That's because it was too fast to be measured with that granularity (or, maybe it executed in zero time? lol). The average per-regex time on today's fast machines is going to be highly unreliable with a timing granularity of 10ms. My .NET port uses the QueryPerformanceCounter call which harnesses low-level processor support for nanosecond type timing accuracy-- those timings are granular to a few nanoseconds and definitely valid (minus the bug in the original source). Notice my minimum times are actual numbers.. Basically both System.currentTimeMillis and DateTime.Now.Ticks are *worthless* for timing these kinds of single-regex operations because they happen way too fast. And there's a bug on top of that!

3) the total loop count is 3 (regexes) * 4 (strings) * 10,000 (iterations) = 120,000. I tried this using 100,000 iterations for a total of 1,200,000 calls. Results (on the Athlon FX-53):

Java: 2531ms
.NET: 1772ms

Jeff Atwood on August 29, 2004 09:07 PM

1) Precisely, with the timing code in the inner loop you are in essence benchmarking your time code! There's an enormous difference between counting seconds and figuring out exactly what time it is. You are measuring the former versus the latter. Get rid of the inner loop timing and just get the aggregate time.

2) Your result show Java being faster.

4) Do 1,000,000 iterations * 3 (regex) * 4 (string) and just get the aggregate timed result. There's absolutely no need for high resolution timing when you are recording aggregate results that may take a couple of seconds.

Carlos

Carlos E. Perez on August 29, 2004 09:46 PM

Do you actually read anything I post? As I said in the original blog entry that you are commenting on, I do TWO PASSES.

Pass #1 - 120,000 regexes
Low-level timing of each regex call via QueryPerformanceCounter. No total time is recorded.

Pass #2 - 120,000 regexes
No timing of the regex calls. Only total execution time is recorded.

All the total times you have seen are recorded in pass #2. Trust me, these numbers are correct. Don't take my word for it-- download the source and try it yourself. The output shows the results from both passes. Detailed timing from pass #1 (with the source bug, which is why the totals don't add up), and total time from pass #2.

You're right that the overhead of calling the timing code 120,000 times intereferes with total execution time. I pointed out the same thing in the original blog post:

"One side effect of this is that I have to make two passes to get all the benchmark results: the first pass times each of the 120,000 regex calls individually into an array, and the second pass times just the total execution time. You'll notice that the first pass takes twice as long; that's due to the overhead of calling QueryPerformanceCounter 120,000 times. The upside is that I provide far more accurate timing results, as you can see in the table above."

Jeff Atwood on August 29, 2004 11:23 PM

Also I did run 1,000,000 iterations. The results were predictable. Multiply the result by 10..

10,000 iterations : ~170 ms
100,000 iterations : ~1.7 seconds
1,000,000 iterations : ~17 seconds

Same for Java. Just multiply time by 10-- still 40% slower.

Jeff Atwood on August 29, 2004 11:25 PM

Jeff,

There you go, that the reason you VB code is faster, you don't check for time in the inner loop when you get the aggregate time.

That is why when you measure each call Java consistently is faster.

That is why when you measure the aggregate call, Java is longer, that's because you have the timing code inside the Java version and in the VB version you DO NOT have it.

If you want to compare apples to apples then you should compare indentically functioning code. For example here is one you should run on your machines:

http://www.manageability.org/blog/archive/20030520%23p_the_problem_with_cameron/view


Carlos

Carlos E. Perez on August 30, 2004 07:06 AM

Jeff,

I double checked your Java code and removed timing information to see what the difference would be. The timing code slows Java down code by about 12%.

The original benchmarks (Feb 2003) see: http://www.jroller.com/page/ceperez/20030210#p_i_found_a_a/document_view
you pointed to a version that was archived in Dec 2003. To the best of my knowledge was correct at the time. Since then Microsoft may have made improvements to the regex class.

However, the orginal benchmarks do not stress regex enough to be informative. The benchmark link in my previous message should do a better job at that.

Carlos

Carlos E. Perez on August 30, 2004 09:04 AM

It's true that I didn't run a version of the Java code without per-regex timings. So I commented out the individual regex times:

// long iterStarTime = System.currentTimeMillis();
// timeTaken[regnum][itter][strnum] = ( System.currentTimeMillis() - iterStarTime ) ;

Total time is still 265ms with or without these lines. I tried it both ways, 4-5 times.

Bear in mind that QueryPerformanceCounter is a windows API call so it has a lot of overhead associated with it.. DateTime.Now.Ticks is much, much faster, and it looks like System.currentTimeMillis() is also too fast to make a difference.

But these calls both lack timing precision..

Jeff Atwood on August 30, 2004 09:27 PM

In case your interested I ran Jeff's benchmarks using the beta JDK 1.5 and VS 2005 beta using 100000 iterations, and measured the following total time ( I disabled the inner time measurements).

Java 1.5 JDK: 3282 secs.
VS 2005: 2637 secs.

I'm seeing results simular to what jeff is seeing.

Carlos, in you work do you typicall find Java to be signficantly faster then .NET? The reason I ask is that I do a lot of performance oriented work, and I generally find .Net to be 5 to 20 faster in most cases. There may be a library here or there where the java implementation is better (for example most of the collection classes), and more performant, but for the most part .net has a slight performance advantaged. I will say the 1.5 JDK java is get closer.

Also the your link to more comprehensive regex bench mark showing that java is faster is a bit miss leading. This particular benchmark creates 1000's random of regex, which the .net implementation attempts to cache. Due to the randomness of the regex's this caching causes signfincant overhead. In other more real world senarious .nets caching may be a benifit.

terry

Terry on September 16, 2004 02:02 PM

Sorry for the confusion.
The previous post sould have said:
"I generally find .Net to be 5 to 20 percent faster in most cases."

Just to calrify, I do like java, and I think it is very well done. Same goes for .net. So any time I see something like x is 5x faster then y. I get suspious. I hope I didn't add anything to the hysteria.

Terry on September 16, 2004 02:12 PM

Terry,

Interesting results, thanks for posting them.

What are your PC specs just out of curiosity? Did you notice that Java REALLY likes the Athlon 64-- it is disproportionately faster on the A64 compared to the Pentium 4.

Jeff Atwood on September 16, 2004 10:31 PM

Did you do any testing of compiling regular expressions vs running the regular expressions? ie, is the regular expression compiled each time you use it, or are you just compiling each once and running the tests on the resulting state machine?

Java Utilities on March 5, 2005 08:42 PM

That's a function of the regex engine. I know how .NET does it (you have to pass a special flag to get compiled regexes), but I can't speak for Java.

I did a direct 1-1 port of the existing code. I suggest downloading the sample solution linked in the post, which contains both the .NET and Java classes used.

Jeff Atwood on March 5, 2005 11:23 PM

jeff,
i am interested in your VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so that i can run this test on my own machine, but the link u provided is having a corrupt zip file, can u plz provide me the link

amrit on July 28, 2005 01:43 AM

It was a problem with the server config, ZIP files were being compressed with GZIP compression. I think Windows Update overwrote part of my metabase.xml IIS config file..

Jeff Atwood on August 10, 2005 06:55 PM

For what it's worth, I get nearly identical timing results in VS.NET 2005 (final RTM) as I do in VS.NET 2003.

Jeff Atwood on November 2, 2005 02:33 PM

Carlos made a huge mistake in his article. He added RegexOptions.Compiled. This causes IL to be emitted at runtime, causing a huge slowdown. Take this out and .NET wins easily.

Rik Hemsley on August 8, 2007 03:44 AM

Hey, would somebody save me a lot of reading time and summarize the results?
thanks.

Elroy on December 27, 2007 08:49 AM

I know this is an old bllog, but you don't tell us whether you were using -server or -client JVM

Run the test with -server flag, which is much faster than client.

Ak on May 22, 2008 08:47 AM

> Athlon 64 3800+ (2.4ghz)
> .NET 170ms
>
> P4 3.2 (3.2ghz), w/hyperthreading
> .NET 185ms
>
> Athlon XP 2400+ (1.7 ghz), dual proc
> .NET 250ms
>
> Pentium M (centrino) 1.2ghz
> .NET 320ms

Four years later:

> Core 2 Duo E8500 @ 3.5 Ghz
> .NET 100 ms

Jeff Atwood on May 22, 2008 10:36 AM

Well, your four years later numbers are missing java result? Also, were you using -client or -server JVM?

Ak on May 23, 2008 04:18 AM

Also, 100 ms is very short time. Java (especially server JVM) has a slower startup time. The benchmark should be at least 5 sec or larger.

Ak on May 23, 2008 04:21 AM

I don't get the same result. Java regex is twice faster for me.

http://kingrazi.blogspot.com/2008/05/shootout-c-net-vs-java-benchmarks.html

Razi on May 30, 2008 09:04 PM

Razi,
You said in your own post the code was not properly ported, benchmarks need to be based off of relatively identical source code.

Ak,
All of the measurements for this article are self generated by the code. So start up time is not a factor. And as Jeff said, he tried orders of magnitude higher and they scaled as expected.

On the post before that: I think rather then showing the performance between Java vs .NET, he was showing the performance of the CPU, specifically 10% more Frequency speed but nearly 100% better performance compared to the P4.

Joshua on June 17, 2008 10:53 AM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.