I was intrigued when I saw a cryptic reference to " the lackluster RegEx performance in .NET 1.1" on Don Park's blog. Don referred me to this page, which displays some really crazy benchmark results from a Java regex test class-- calling C#'s regex support "20 times slower than [Java]." First of all, them's fightin' words, and second, those results are just too crazy to really make sense. Why would C# be over an order of magnitude slower than Java at a classic computer science problem like a regular expression parser? I don't believe it.
So I downloaded the Java JDK, a freeware Java development environment, and I ran that benchmark class on my own machine:
Regular expression library: java.util.regex.Pattern
RE: ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)
MS MAX AVG MIN DEV INPUT
46 16 0.0046 0 0 'http://www.linux.com/'
61 16 0.0061 0 0 'http://www.thelinuxshow.com/main.php3'
61 16 0.0061 0 0 'usd 1234.00'
172 16 0.0172 0 0 'he said she said he said no'
RE: usd [+-]?[0-9]+.[0-9][0-9]
MS MAX AVG MIN DEV INPUT
0 0 0.0 0 0 'http://www.linux.com/'
15 15 0.0015 0 0 'http://www.thelinuxshow.com/main.php3'
15 15 0.0015 0 0 'usd 1234.00'
15 15 0.0015 0 0 'he said she said he said no'
RE: \b(\w+)(\s+\1)+\b
MS MAX AVG MIN DEV INPUT
0 0 0.0 0 0 'http://www.linux.com/'
31 16 0.0031 0 0 'http://www.thelinuxshow.com/main.php3'
31 16 0.0031 0 0 'usd 1234.00'
47 16 0.0047 0 0 'he said she said he said no'
Total time taken: 266
Note that I only ran this for the "built in" Java regex library java.util.regex.Pattern; the benchmark has template code for dozens of alternative regex parsers. I snipped that code out for simplicity. The standard Java regex class performs very well according to the results shown on the referring page.
I then converted the sample code to VB.NET, and got these results:
Regular expression library: System.Text.RegularExpressions
RE: ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)
MS MAX AVG MIN DEV INPUT
32 3.033 0.0032 0.0025 0.0325 'http://www.linux.com/'
63 3.04 0.0063 0.0053 0.0325 'http://www.thelinuxshow.com/main.php3'
122 3.053 0.0122 0.0109 0.0327 'usd 1234.00'
234 3.067 0.0234 0.0212 0.0328 'he said she said he said no'
RE: usd [+-]?[0-9]+.[0-9][0-9]
MS MAX AVG MIN DEV INPUT
20 0.729 0.002 0.0017 0.0073 'http://www.linux.com/'
40 0.732 0.004 0.0036 0.0073 'http://www.thelinuxshow.com/main.php3'
63 1.748 0.0063 0.0056 0.0175 'usd 1234.00'
82 1.751 0.0082 0.0075 0.0175 'he said she said he said no'
RE: \\b(\\w+)(\\s+\\1)+\\b
MS MAX AVG MIN DEV INPUT
19 0.25 0.0019 0.0017 0.0037 'http://www.linux.com/'
38 0.252 0.0038 0.0034 0.0038 'http://www.thelinuxshow.com/main.php3'
62 4.961 0.0062 0.0053 0.0497 'usd 1234.00'
81 4.963 0.0081 0.007 0.0497 'he said she said he said no'
Total time taken: 170
So.. yeah. I don't know what kind of crack the guys at manageability.org are smoking, but I can't seem to find a local vendor.
You may be interested in my VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so you can run this test on your own machine. A few notes on the test:
Even though .NET appears to perform almost 40 percent faster than Java in this test, it's still interesting that in only 6 months since that Java benchmark was run (813ms), I can produce Java code that runs over three times as fast (266ms)! So before we put our language war hats on, consider that perhaps the real winner here is the hardware. Doh!
Posted by Jeff Atwood View blog reactions
« Net.WebClient and GZip The Incredible LinkTron 5000(tm)! »
As promised here are some benchmarks from a few machines around the house. I ran each benchmark 4-5 times and took the best reported score.
Athlon 64 3800+ (2.4ghz)
.NET 170ms
Java 265ms
P4 3.2 (3.2ghz), w/hyperthreading
.NET 185ms
Java 330ms
Athlon XP 2400+ (1.7 ghz), dual proc
.NET 250ms
Java -
Pentium M (centrino) 1.2ghz
.NET 320ms
Java 541ms
* I didn't want to install the JDK on my server
Jeff Atwood on August 28, 2004 07:33 PMJeff,
The benchmarks I did were done back in Feb. of 2003, so are relevant in that period. So accusing me of smoking crack is a little below the belt.
Now maybe you need to provide the .NET source afterall you're jumping through some peculiar hoops to get your timing "right". How exactly are you timing both versions? Don't time individual calls, time the aggregate result. Finally, let us all see the code!
Carlos E. Perez on August 29, 2004 07:44 AMIt's in the article:
"You may be interested in my VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so you can run this test on your own machine."
http://www.codinghorror.com/files/code/regexbenchmark.zip
Also your article is dated December 2003!
http://www.manageability.org/blog/archive/20030210%23p_i_found_a_a/document_view
Jeff Atwood on August 29, 2004 12:51 PMJeff,
Take a very good look at your own results and you'll realize that Java is indeed faster at Regex than .NET.
Look at the averages and the max timings and you'll see the Java bests .NET in almost all the cases. Now for some reason the aggregate times show the opposite. This of course can only be attributed to the way you measure time. That is you use a different routine and this should account for the disparity.
So, the two conclusion that you can make with your results are:
(1) Java still is faster with Regex than .NET
(2) Using QueryPerformanceCounter can gives you more precision and faster timing than DateTime.Now.Ticks.
Finally, why don't you try using 1,000,000 iterations instead of the measly 10,000. The result will surely put you in complete shock!
Carlos
Carlos E. Perez on August 29, 2004 08:12 PMCarlos,
1) There is a bug in the original code's innermost timing loop, as I pointed out in my blog entry, and in my commented port of the source to VB.NET. Each iteration of that loop is calculating elapsed time using a start time from OUTSIDE the loop rather than at the start of each iteration. See my comments in the source. This will *definitely* cause the aggregate total times (which are added up from each regex time) to be off. What can I say, I didn't write the code.
2) The java System.currentTimeMillis timings seem to have the same precision as DateTime.Now.Ticks, eg, about 10ms. Notice how the "minimum" execution time in the Java code is 0ms? That's because it was too fast to be measured with that granularity (or, maybe it executed in zero time? lol). The average per-regex time on today's fast machines is going to be highly unreliable with a timing granularity of 10ms. My .NET port uses the QueryPerformanceCounter call which harnesses low-level processor support for nanosecond type timing accuracy-- those timings are granular to a few nanoseconds and definitely valid (minus the bug in the original source). Notice my minimum times are actual numbers.. Basically both System.currentTimeMillis and DateTime.Now.Ticks are *worthless* for timing these kinds of single-regex operations because they happen way too fast. And there's a bug on top of that!
3) the total loop count is 3 (regexes) * 4 (strings) * 10,000 (iterations) = 120,000. I tried this using 100,000 iterations for a total of 1,200,000 calls. Results (on the Athlon FX-53):
Java: 2531ms
.NET: 1772ms
1) Precisely, with the timing code in the inner loop you are in essence benchmarking your time code! There's an enormous difference between counting seconds and figuring out exactly what time it is. You are measuring the former versus the latter. Get rid of the inner loop timing and just get the aggregate time.
2) Your result show Java being faster.
4) Do 1,000,000 iterations * 3 (regex) * 4 (string) and just get the aggregate timed result. There's absolutely no need for high resolution timing when you are recording aggregate results that may take a couple of seconds.
Carlos
Carlos E. Perez on August 29, 2004 09:46 PMDo you actually read anything I post? As I said in the original blog entry that you are commenting on, I do TWO PASSES.
Pass #1 - 120,000 regexes
Low-level timing of each regex call via QueryPerformanceCounter. No total time is recorded.
Pass #2 - 120,000 regexes
No timing of the regex calls. Only total execution time is recorded.
All the total times you have seen are recorded in pass #2. Trust me, these numbers are correct. Don't take my word for it-- download the source and try it yourself. The output shows the results from both passes. Detailed timing from pass #1 (with the source bug, which is why the totals don't add up), and total time from pass #2.
You're right that the overhead of calling the timing code 120,000 times intereferes with total execution time. I pointed out the same thing in the original blog post:
"One side effect of this is that I have to make two passes to get all the benchmark results: the first pass times each of the 120,000 regex calls individually into an array, and the second pass times just the total execution time. You'll notice that the first pass takes twice as long; that's due to the overhead of calling QueryPerformanceCounter 120,000 times. The upside is that I provide far more accurate timing results, as you can see in the table above."
Jeff Atwood on August 29, 2004 11:23 PMAlso I did run 1,000,000 iterations. The results were predictable. Multiply the result by 10..
10,000 iterations : ~170 ms
100,000 iterations : ~1.7 seconds
1,000,000 iterations : ~17 seconds
Same for Java. Just multiply time by 10-- still 40% slower.
Jeff Atwood on August 29, 2004 11:25 PMJeff,
There you go, that the reason you VB code is faster, you don't check for time in the inner loop when you get the aggregate time.
That is why when you measure each call Java consistently is faster.
That is why when you measure the aggregate call, Java is longer, that's because you have the timing code inside the Java version and in the VB version you DO NOT have it.
If you want to compare apples to apples then you should compare indentically functioning code. For example here is one you should run on your machines:
http://www.manageability.org/blog/archive/20030520%23p_the_problem_with_cameron/view
Carlos
Jeff,
I double checked your Java code and removed timing information to see what the difference would be. The timing code slows Java down code by about 12%.
The original benchmarks (Feb 2003) see: http://www.jroller.com/page/ceperez/20030210#p_i_found_a_a/document_view
you pointed to a version that was archived in Dec 2003. To the best of my knowledge was correct at the time. Since then Microsoft may have made improvements to the regex class.
However, the orginal benchmarks do not stress regex enough to be informative. The benchmark link in my previous message should do a better job at that.
Carlos
Carlos E. Perez on August 30, 2004 09:04 AMIt's true that I didn't run a version of the Java code without per-regex timings. So I commented out the individual regex times:
// long iterStarTime = System.currentTimeMillis();
// timeTaken[regnum][itter][strnum] = ( System.currentTimeMillis() - iterStarTime ) ;
Total time is still 265ms with or without these lines. I tried it both ways, 4-5 times.
Bear in mind that QueryPerformanceCounter is a windows API call so it has a lot of overhead associated with it.. DateTime.Now.Ticks is much, much faster, and it looks like System.currentTimeMillis() is also too fast to make a difference.
But these calls both lack timing precision..
In case your interested I ran Jeff's benchmarks using the beta JDK 1.5 and VS 2005 beta using 100000 iterations, and measured the following total time ( I disabled the inner time measurements).
Java 1.5 JDK: 3282 secs.
VS 2005: 2637 secs.
I'm seeing results simular to what jeff is seeing.
Carlos, in you work do you typicall find Java to be signficantly faster then .NET? The reason I ask is that I do a lot of performance oriented work, and I generally find .Net to be 5 to 20 faster in most cases. There may be a library here or there where the java implementation is better (for example most of the collection classes), and more performant, but for the most part .net has a slight performance advantaged. I will say the 1.5 JDK java is get closer.
Also the your link to more comprehensive regex bench mark showing that java is faster is a bit miss leading. This particular benchmark creates 1000's random of regex, which the .net implementation attempts to cache. Due to the randomness of the regex's this caching causes signfincant overhead. In other more real world senarious .nets caching may be a benifit.
terry
Terry on September 16, 2004 02:02 PMSorry for the confusion.
The previous post sould have said:
"I generally find .Net to be 5 to 20 percent faster in most cases."
Just to calrify, I do like java, and I think it is very well done. Same goes for .net. So any time I see something like x is 5x faster then y. I get suspious. I hope I didn't add anything to the hysteria.
Terry on September 16, 2004 02:12 PMTerry,
Interesting results, thanks for posting them.
What are your PC specs just out of curiosity? Did you notice that Java REALLY likes the Athlon 64-- it is disproportionately faster on the A64 compared to the Pentium 4.
Jeff Atwood on September 16, 2004 10:31 PMDid you do any testing of compiling regular expressions vs running the regular expressions? ie, is the regular expression compiled each time you use it, or are you just compiling each once and running the tests on the resulting state machine?
Java Utilities on March 5, 2005 08:42 PMThat's a function of the regex engine. I know how .NET does it (you have to pass a special flag to get compiled regexes), but I can't speak for Java.
I did a direct 1-1 port of the existing code. I suggest downloading the sample solution linked in the post, which contains both the .NET and Java classes used.
Jeff Atwood on March 5, 2005 11:23 PMjeff,
i am interested in your VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so that i can run this test on my own machine, but the link u provided is having a corrupt zip file, can u plz provide me the link
It was a problem with the server config, ZIP files were being compressed with GZIP compression. I think Windows Update overwrote part of my metabase.xml IIS config file..
Jeff Atwood on August 10, 2005 06:55 PMFor what it's worth, I get nearly identical timing results in VS.NET 2005 (final RTM) as I do in VS.NET 2003.
Jeff Atwood on November 2, 2005 02:33 PMCarlos made a huge mistake in his article. He added RegexOptions.Compiled. This causes IL to be emitted at runtime, causing a huge slowdown. Take this out and .NET wins easily.
Hey, would somebody save me a lot of reading time and summarize the results?
thanks.
I know this is an old bllog, but you don't tell us whether you were using -server or -client JVM
Run the test with -server flag, which is much faster than client.
Ak on May 22, 2008 08:47 AM> Athlon 64 3800+ (2.4ghz)
> .NET 170ms
>
> P4 3.2 (3.2ghz), w/hyperthreading
> .NET 185ms
>
> Athlon XP 2400+ (1.7 ghz), dual proc
> .NET 250ms
>
> Pentium M (centrino) 1.2ghz
> .NET 320ms
Four years later:
> Core 2 Duo E8500 @ 3.5 Ghz
> .NET 100 ms
Well, your four years later numbers are missing java result? Also, were you using -client or -server JVM?
Ak on May 23, 2008 04:18 AMAlso, 100 ms is very short time. Java (especially server JVM) has a slower startup time. The benchmark should be at least 5 sec or larger.
Ak on May 23, 2008 04:21 AMI don't get the same result. Java regex is twice faster for me.
http://kingrazi.blogspot.com/2008/05/shootout-c-net-vs-java-benchmarks.html
Razi on May 30, 2008 09:04 PMRazi,
You said in your own post the code was not properly ported, benchmarks need to be based off of relatively identical source code.
Ak,
All of the measurements for this article are self generated by the code. So start up time is not a factor. And as Jeff said, he tried orders of magnitude higher and they scaled as expected.
On the post before that: I think rather then showing the performance between Java vs .NET, he was showing the performance of the CPU, specifically 10% more Frequency speed but nearly 100% better performance compared to the P4.
Joshua on June 17, 2008 10:53 AM| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |