I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

September 27, 2005

Mastering GUIDs with Occam's Razor

Do you remember the scene from the movie Full Metal Jacket where the marines recite the USMC creed (mp3)?

full-metal-jacket.jpg

It's a little known fact, but programmers have a similar creed:

This is my GUID. There are many like it but this one is mine. My GUID is my best friend. It is my life. I must master it as I must master my life. Without me, my GUID is useless. Without my GUID I am useless.

In fact, GUIDs are so near and dear to our hearts that we recently had a spirited discussion about them at work. Let's say you had a string and needed to determine whether it was a valid GUID. The easy way is a .Parse() style Try-Catch code block:

guid g;
try
{
    g = new Guid("x");
}
catch 
{
}

This is the correct answer.. most of the time. But you know programmers. They never met an edge condition they didn't enjoy discussing ad nauseam. And I was one of the first to chime in:

This is definitely a good way to validate a data type, however, just be aware of the exception performance penalty. Throwing exceptions on failure to cast is expensive, so if this is something that
  • will be invalid often
  • appears in a loop
  • occurs with high frequency

then you'd want to go with a non-exception based check. However most of the time none of these things are true, so the performance is irrelevant.

Then someone suggested trying a regular expression. Oh great, now we have two problems:

Regex r = new Regex(
  "^((?-i:0x)?[A-Fa-f0-9]{32}|
  [A-Fa-f0-9]{8}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{12}|
  \{[A-Fa-f0-9]{8}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{12}\})$”); 

It's valid, but I couldn't resist tweaking this regex for simplicity's sake. The official GUID spec only defines one format for GUID strings, the familiar 8-4-4-4-12 format:

Regex r = new Regex(
  @"^({|\()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|\))?$"); 

This is my post, so I'll skip the part where others poked holes in my regex. Just when we thought it was over, a fellow developer whipped out a code snippet that benchmarks how long it takes to validate GUIDs via each method:

static void Main(string[] args)
{
      Guid g = Guid.NewGuid();
      string s = g.ToString();

      DateTime before = DateTime.Now;
      for (int i = 0; i < 10000; i++)
      {
        bool retVal = IsGuid(s);
      }
      Console.WriteLine(DateTime.Now.Subtract(before));

      before = DateTime.Now;
      for (int i = 0; i < 10000; i++)
      {
        bool retVal = IsGuid2(s);
      }
      Console.WriteLine(DateTime.Now.Subtract(before));
      Console.ReadLine();
}

public static bool IsGuid(string guidString)
{
      try
      {
            Guid guid = new Guid(guidString);
            return true;
      }
      catch
      {
            return false;
      }
}

public static bool IsGuid2(string guidString)
{
      Regex r;
      r = new Regex(
        @"^({|\()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|\))?$");
      Match m = r.Match(guidString);
      if (m.Success)
            return true;
      else
            return false;
}

According to this, constructor validation is 3 to 4 times faster than the regex.. or is it? I immediately noticed a few problems that made this a rather questionable benchmark. And, as before, I couldn't resist investigating:

If I increase the iterations to 100,000:
00.1874856
00.7968138
You typically wouldn't want to create a new regex inside the loop, because it's too expensive. If I move the regex creation outside the loop:
00.2031094
00.5780806
If I set RegexOptions.Compiled on the regex:
00.1874856
00.3437236
If I run the above with CTRL+F5 (sans debugger):
00.1718673
00.1874916

It was definitely a fun discussion. I certainly learned a few things about GUIDs I didn't know. Heck, discussions like this are why I joined a software development company in the first place. But it's also a pointless discussion.

Performance was a complete non-issue in this particular scenario. That's why we should always program with Occam's Razor in mind:

Given two similar code paths, choose the simpler one.

Edge conditions and fancy techniques are interesting, but they're not necessarily a worthwhile use of time. Sometimes the simple and stupid solution is all you need.

Posted by Jeff Atwood    View blog reactions

 

« Microsoft naming: who stole the soul? Keyboarding: Microsoft Natural Ergonomic 4000 »

 

Comments

Speanking of Full Metal Jacket.. if you're having trouble meeting your development milestones, have you considered the R. Lee Ermey motivational figure?

http://www.rleeermey.com/motivationalfigure.php

Personally, I like to welcome all new developers to our team with a slightly modified but highly inspirational R. Lee Ermey speech:

"If you ladies survive recruit training, you will be a [DEVELOPER]. You will be a minister of death praying for [MORE DEVELOPMENT TASKS]. But until that day you are pukes. You are the lowest form of life on Earth. You are not even human fucking beings. You are nothing but unorganized grabastic pieces of amphibian shit. Because I am hard you will not like me. But the more you hate me the more you will learn. I am hard but I am fair. There is no racial bigotry here. I do not look down on niggers, kikes, wops or greasers. Here you are all equally worthless. And my orders are to weed out all non-hackers who do not pack the gear to serve on my beloved [DEVELOPMENT TEAM]. Do you maggots understand that?"

I have no idea why there's so much turnover.

Jeff Atwood on September 29, 2005 01:02 AM

So you're saying you liked the .Parse solution better than the Regex solution in this situation?

Haacked on September 29, 2005 01:59 AM

Yes.

I realized I missed another optimization, too! IsGuid2 should be a one-liner:

return r.IsMatch(guidString);

That shaves another 18% off:

0.5468
0.4531

And it means the regex, in the final compiled non-debugger run, is now consistently faster (!) than using the Guid constructor.

And that's with zero exceptions, which means if there were exceptions, the perf gap would widen considerably.

Not that it matters. But I'm just saying.

Jeff Atwood on September 29, 2005 03:46 AM

I think most sane people would prefer the non-exception-catching dependent way. However, I would step into the guid constructor source to see what they are doing. Using a regex to match a guid pattern seems like enormous overkill.

Ron on September 29, 2005 01:01 PM

Well, per the documentation, the GUID constructor supports the following:

---
32 contiguous digits:

dddddddddddddddddddddddddddddddd

-or-

Groups of 8, 4, 4, 4, and 12 digits with hyphens between the groups. The entire GUID can optionally be enclosed in matching braces or parentheses:

dddddddd-dddd-dddd-dddd-dddddddddddd

-or-

{dddddddd-dddd-dddd-dddd-dddddddddddd}

-or-

(dddddddd-dddd-dddd-dddd-dddddddddddd)

-or-

Groups of 8, 4, and 4 digits, and a subset of eight groups of 2 digits, with each group prefixed by "0x" or "0X", and separated by commas. The entire GUID, as well as the subset, is enclosed in matching braces:

{0xdddddddd, 0xdddd, 0xdddd,{0xdd,0xdd,0xdd,0xdd,0xdd,0xdd,0xdd,0xdd}}

All braces, commas, and "0x" prefixes are required. All embedded spaces are ignored. All leading zeroes in a group are ignored.
---

For every case but the last one, this improved regex should work:

"^[A-Fa-f0-9]{32}$|^({|\()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|\))?$"

And with the final optimization (IsGuid2 becomes a one-liner .IsMatch) this is still slightly faster than the GUID constructor in my testing.

Jeff Atwood on September 29, 2005 01:20 PM

Faster, shmaster. What about the elegance? Couldn't we use a design pattern of some sort to increase the elegance? I'm all about the elegance.

Also I'm concerned about heap contention and fragmentation. Which of these makes the fewest number of memory allocations? Bzzzzt! Wrong answer, we're doing it my way now....

Damien Katz on September 29, 2005 03:01 PM

Sorry, for a second there I channeled the ghost of a guy I used to work with in discussion eerily similar to this.

Damien Katz on September 29, 2005 04:24 PM

^[A-Fa-f0-9]{32}$|({|\()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|\))?$|^({)?[0xA-Fa-f0-9]{3,10}(, {0,1}[0xA-Fa-f0-9]{3,6}){2}, {0,1}({)([0xA-Fa-f0-9]{3,4}, {0,1}){7}[0xA-Fa-f0-9]{3,4}(}})$

Tested against the guid values documented here:

<a href="http://msdn2.microsoft.com/en-us/library/96ff78dc.aspx">http://msdn2.microsoft.com/en-us/library/96ff78dc.aspx</a>;

Cheers,
Colin

Colin Bowern on January 18, 2006 05:53 PM

Hey,

If anyone follows C# 3.0, if MS forgets it, one can actualy add a .TryParse() method to Guid if I understand correctly with Extension methods. I opt that we nag them into including it in the next release!

Rick

Richard Penwell on February 21, 2006 02:39 AM

I am getting the reverse results. The RegEx method never ends up being faster. The try/catch method seems to be 2x faster than the RegEx method.

Are you running on 32 or 64 bit?

LaptopHeaven on October 24, 2006 03:06 PM

> The RegEx method never ends up being faster. The try/catch method seems to be 2x faster than the RegEx method.

The code sample has to be modified. Follow the instructions given above:

1. Increase iterations to 100,000 to get a better measurement.

2. Hoist the creation of the regex out of the loop. It's insane to create 100,000 regex objects.

3. make IsGuid2 a one-liner: return r.IsMatch(guidString)

4.run with CTRL+F5 (sans debugger)

After implementing those changes, on my current machine I end up with 0.073 for the first method and 0.017 for the second method (regex).

Switching to System.Diagnostics.Stopwatch (new to .NET 2.0) produces the same results.

Jeff Atwood on October 24, 2006 04:37 PM

One thing you're doing on IsGuid() is creating 100,000 Guid objects in the first version, but not the second. Those objects will need to get collected by the garbage collector, which might be running periodically. That might be screwing up your metrics. Is there a way to do the cast to an existing object? Or maybe you can inhibit garbage collection.


Brendan Dowling on October 24, 2006 06:16 PM

What the F!!!

There is a bigger reason this benchmark is bogus - it never tries an invalid GUID !!

Should I submit this blog post to the Daily WTF?

You need a benchmark that fails at least some of the time - then your algorithm will kick the ass off the try/catch one.

Geez

El Guapo on October 24, 2006 06:43 PM

Well, I ran the code with Jeff's changes using good and bad GUIDs and got the following results:

Catching the exception took 0.1902736 seconds for 100,000 *good* GUIDs, but 1.8226208 seconds for 100,000 *bad* GUIDs.

Using a regular expression took 0.250360 for 100,000 *good* GUIDs, but only 0.0600864 seconds for 100,000 *bad* GUIDs.

Code shown below.

using System;
using System.Text.RegularExpressions;

namespace GuidTest
{
public class GuidTest
{
// ====================================================================
// ====================================================================
const int NumCycles= 100000; // 100,000 trials per test type

// ====================================================================
// ====================================================================
static string Format=
@"^({|\()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|\))?$";

// ====================================================================
// ====================================================================
static Regex GuidRegex= new Regex(Format);

// ====================================================================
// ====================================================================
static void Main(string[] args)
{
Guid g = Guid.NewGuid();
string s = g.ToString();


TestExceptionCatchingMethod(s);
TestExceptionCatchingMethod("12");

TestRegularExpressionMethod(s);
TestRegularExpressionMethod("12");


}

// ====================================================================
// ====================================================================
static void TestExceptionCatchingMethod(string s)
{
Console.WriteLine("RUnning TestExceptionCatchingMethod({0})", s);

DateTime before = DateTime.Now;

for (int i = 0; i < NumCycles; i++)
{
bool retVal = IsGuid(s);
}

Console.WriteLine(DateTime.Now.Subtract(before));
}

// ====================================================================
// ====================================================================
static void TestRegularExpressionMethod(string s)
{
Console.WriteLine("RUnning TestRegularExpressionMethod({0})", s);

DateTime before = DateTime.Now;

for (int i = 0; i < NumCycles; i++)
{
bool retVal = IsGuid2(s);
}

Console.WriteLine(DateTime.Now.Subtract(before));
}

// ====================================================================
// ====================================================================
public static bool IsGuid(string guidString)
{
bool guidStringIsGood= false;

try

{
Guid guid = new Guid(guidString);
guidStringIsGood= true;
}

catch

{
guidStringIsGood= false;
}

return guidStringIsGood;
}

// ====================================================================
// ====================================================================
public static bool IsGuid2(string guidString)

{
return GuidRegex.IsMatch(guidString);
}
}
}

Jeff May on October 25, 2006 08:19 AM

The original code never constructs a Guid object in the Regex case. Presumably, if you are going to test the validity of a Guid, you're likely going to want to do something with it requiring constructing a Guid object anyways. Granted a Guid.TryParse would be nicer.

And then there's the issue of the all those Guids that gave their lives for these tests. The horror! :)

Mike on October 25, 2006 09:28 AM

> you're likely going to want to do something with it requiring constructing a Guid object anyways

Well, maybe, but not really. The string representation of a GUID is usually all you need.

Jeff Atwood on October 25, 2006 10:03 AM

And the RegEx problem is even worse in the pre-Whidbey releases of the CLR before LCG (lightweight code generation) existed. Temporary assemblies are generated during construction and those couldn't be GC'd prior to Whidbey. This same problem applies to XmlSerializer and XSLTransform as Tess Ferrandez discusses on her blog (which I think is second only to yours and Rico Mariani's for sheer geektasticity):

[snip]

If we take a look at the XmlSerializer constructor with Reflector we find that it calls

this.tempAssembly = XmlSerializer.GenerateTempAssembly(this.mapping, type, defaultNamespace, location, evidence);

which generates a temp (dynamic) assembly. So every time this code runs (i.e. every time the page is hit) it will generate a new assembly.

The reason it generates an assembly is that it needs to generate functions for serializing and deserializing and these need to reside somewhere.

Ok, fine... it creates an assembly, so what? When we’re done with it, it should just disappear right?

Well... an assembly is not an object on the GC Heap, the GC is really unaware of assemblies, so it won’t get garbage collected. The only way to get rid of assemblies in 1.0 and 1.1 is to unload the app domain in which it resides.

And therein lies the problem Dr Watson.

Temporary assemblies are also used for regular expressions, as well as script blocks in XSL Transforms. In the case of the script blocks I would suggest using XSLT extension objects or cache the transform.

[snip]

Here's the original post for anyone who is interested:

http://blogs.msdn.com/tess/archive/2006/02/15/532804.aspx

Brian Hartung on October 25, 2006 10:30 AM

Clarification...I didn't mean to imply assemblies can be gc'd starting with Whidbey (obviously they are not allocated on the GC heap but rather in internal CLR memory so it's apples and oranges). The point is assemblies can't be unloaded without tearing down the app domain, but with LCG, they're no longer generated in the first place.

Brian Hartung on October 25, 2006 12:17 PM

I should have just included this link to start with:

http://msdn.microsoft.com/msdnmag/issues/06/01/CLRInsideOut/#S7

Brian Hartung on October 25, 2006 12:20 PM

> Well, maybe, but not really. The string representation of a GUID is usually all you need.

From a type safety perspective, it's better to pass around Guid objects. If you're going to consume the Guid immediately, then yeah the string will work.

Mike on October 26, 2006 07:11 AM

Interestingly everyone has fallen into the trap of discussing this non-topic. Yes, the regex version is faster, and if you need to validate 100,000 guids in your normal program operation (as opposed to just a one off) then use the regex.

For every other scenario i'll stick with the one which is easier to type and remember. Of course, if they stick a regex version into the TryParse function of the GUID object in .net 3.5 then i'll happily use it every time.

Simon on November 13, 2007 02:31 AM

What a cool thread! I returned to college in my 40's and graduated with a BS Computer Science in 2002. Now I am a semi-senior C# programmer, but discussions like this make me feel like an infant. Thanks, gentlemen, for the stimulating discussion, and the information.

Jim Bechtold on March 10, 2008 03:45 PM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.