I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

Mar 19, 2007

Primary Keys: IDs versus GUIDs

Long-time readers of this blog know that I have an inordinate fondness for GUIDs. Each globally unique ID is like a beautiful snowflake: every one a unique item waiting to be born.

Perhaps that's why I read with great interest recent accounts of people switching their database tables from traditional integer primary keys ...

ID  Value
--  -----
 1  Apple
 2  Orange
 3  Pear
 4  Mango

.. to GUID keys.

ID                                    Value
------------------------------------  -----
C87FC84A-EE47-47EE-842C-29E969AC5131  Apple
2A734AE4-E0EF-4D77-9F84-51A8365AC5A0  Orange
70E2E8DE-500E-4630-B3CB-166131D35C21  Pear
15ED815C-921C-4011-8667-7158982951EA  Mango

I know what you're thinking. Using sixteen bytes instead of four bytes for a primary key? Have you lost your mind? Those additional 12 bytes do come at a cost. But that cost may not be as great as you think:

Using a GUID as a row identity value feels more natural-- and certainly more truly unique-- than a 32-bit integer. Database guru Joe Celko seems to agree. GUID primary keys are a natural fit for many development scenarios, such as replication, or when you need to generate primary keys outside the database. But it's still a question of balancing the tradeoffs between traditional 4-byte integer IDs and 16-byte GUIDs:

GUID Pros

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway
GUID Cons

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
  • Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

I'm not proposing that every database switch to GUID primary keys, but I do think it's important to know the option is out there. If you're still on the fence, what should I choose for my primary key? has excellent advice and a solid analysis of the tradeoffs.

Posted by Jeff Atwood    View blog reactions
« Creating User Friendly 404 Pages
Code Access Security and Bitfrost »
Comments

"JM on March 21, 2007 05:47 AM" has quite a long discussion of natural keys, which misses a couple of critical points.

The first one being people. Using just the name, you end up with a huge problem when the second "John Smith" is supposed to be inserted into your database, but is blocked because it violates the primary key constraint.

So, you have to add another column. Maybe street address? Again, you run into a problem if John Smith of 100 1st Ave in Los Angeles is already in the system and you want to add John Smith of 100 1st Ave in Seattle. So, you add the city, and the state, and then you run into a problem if your database ever has to go international and "state" is part of your primary key but the country you are adding doesn't have "states".

So, you end up using something like the social security number, which is by definition not a "natural key" (it's an artificial key created by the Social Security Administration to differentiate different John Smiths - using some else's surrogate key as your natural key doesn't actually make it any more natural). Again, your whole system has to be rebuilt from the ground up the moment you have a customer who doesn't have a social security number (outside the US) or who has the same social as someone else (the SSA doesn't guarantee these are unique, and as recent identity theft cases show, hundreds of people can actually use the same SS#).

Why not just assign a surrogate key, a single column (either identity or GUID or whatever you come up with that's easy to use that way), that make every John Smith different from every other John Smith?

Oh, and on another point of using address or phone number as part of the key (something I've actually seen), what happens when someone gets a second phone number (cell phone maybe) or second address (winter home for example)? Do you violate Normal Form and add a second address field to the customer table? Do you include it in your key? Or do you do the right thing and have addresses in a sub-table with a foreign key to the customer table primary key?

But, let's stick to his example of a "product name". You have to assume, in that case, that you will never, ever have to deal with some people calling it "a pen" and others calling it "la pluma", so I guess you better stick to designing databases that will never be used anywhere in they don't speak the exact same language you do. Heck, even within a single language, it can be a "ruler" to some people and a "measuring stick" to others, and still others might call it a "yard stick". "Car" vs "auto" vs "automobile" vs "horseless carriage", or "truck" vs "lorry", would be other examples. The list could go on for a very, very long space (check out the nearest Roget's, also called a thesaurus, for an example of how large the list is).

Thus, the basic assumption that "natural keys" should be used because people will somehow automatically use them correctly is patently false. Since the basic assumption is false, the rest of the thesis is useless.

Gus on May 18, 2007 12:06 PM

I agree with those advocating natural keys in URLs.

They're "nicer" looking, they're more memorable, and they're search engine friendly.

I run a video games website and we recently replaced all of our "review.php?id=123" links with "reviews/123/name_of_the_game.html" and got noticable improvements in search engine results with zero performance hit. In fairness the 123 was kept, to ensure backward compatibility with legacy links, but if writing from scratch, that would be gone.

Which is better?

www.myblogsite.com?userID=0AFDC426-C52E-46F0-8752-5F9C71159FD9
www.myblogsite.com?userID=7439
www.myblogsite.com/JeffSmith

Yeah, I think so too. Is there a performance hit to the latter? Yeah. But it's minimal, and well worth it. All your internal joins, etc, stay on surrogate int keys.

As for security, userID=7439 CAN be changed to userID=7440. So? The URL string is the least secure thing on your site. Users can TYPE directly into it. If you're trusting that, your choice of primary key types is the least of your problems.

Matt Burgess on May 21, 2007 10:15 AM

The above post makes some valid points. But is flawed.

Natural keys should be used, but not to make "typeable" addresses. They should be used to make neat links.

People shouldn't be typing in /pen/, they should be clicking on the "Pens" link. Url by guess is not adequate navigation. :)

Matt Burgess on May 21, 2007 10:25 AM

I was speaking with a colleague about this exact question earlier today, and he pointed something out that hasn't been mentioned in this thread as of yet. So, for completeness I'll toss it in.

If you can reasonably assume that each row will be unique across an entire schema, and gracefully handle when that's != true, then Guids also buy you some "polymorphism" for your tables of sorts. For example, if you have an Image table, and images can be associated with multiple entities in your business logic layer, then instead of requiring relations tables 'FooImage', 'BarImage' etc., you can just create an 'EntityImage' table to reference the guid for any object, and the image guid.

This doesn't really add to the debate over performance or integrity (and perhaps you don't like the idea of such free association of images to, now, any other entity, or replacing manifold relations tables for one larger table which, you'll argue, will perform poorly even with an unclustered guid PK), of course, but it might appeal to the more OO-minded developers.

Daniel Crenna on June 14, 2007 4:01 AM

I wrote a lengthy article on this very topic:
http://cookingwithsql.com/index.php?option=com_contenttask=viewid=52Itemid=58

In a nutshell, GUIDS suck for performance. If you are creating rinky dink apps, go for it. If you are building an application with a monster database, then stick to integers.

Datagod on July 21, 2007 5:36 AM

@Datagod:

Thank you for nicely highlighting the "percentages" calculated off those tiny, tiny example tables. My tables happen to have many more columns with way larger data. Do the percentages hold then?

I must say I especially like your char(10) columns, which must be the newest breakthrough in enterprise development.

If you want to be taken seriously, change that script-kiddie name of yours too.

Ishmaeel on July 21, 2007 9:49 AM

Ishmaeel, the example was simplified so as to make it easier to explain.

There is no call for being so rude.

When is the last time you performance tuned a terabyte database? I thought so.

Get bent.

Datagod on July 28, 2007 12:17 PM

Datagod:

How do the % look if you use the sequential GUIDs?

Bobby on August 13, 2007 12:08 PM

Guys, how do you handle merge replication when using autogenerated keys? In my setup there will be lots of PCs (at different locations)with their own SQL Express database. Tables have autogenerated primary and foreign keys. Now all this data need to replicated back to a central location on a daily basis. We can't go with separate identity ranges as that makes managing this for the number of PCs would be hard.

I am looking for some pointers as to how to handle replications problems which will happen when 2 or more records have same primary or forign key. Is there a simpler way besides going row by row on each machine and inserting the records into the central server using the new keys generated by the central server?

Bobby on August 13, 2007 12:19 PM

1) It's amazing that seemingly intelligent people are concerned that there might be a collision with GUIDs and how they're not "guaranteed" to be unique. They might want to brush up on their math skills. A 128 bit number isn't easily duplicated. So what if you get a dup key on insert? It gets kicked back, you handle the error, you regen and retry. What's the big deal? It should only happen once every few thousand years anyway. Your system won't be in production that long and/or the problem won't happen enough to be an issue.

2) The extra 12 bytes isn't a big deal in most cases. Notice I say most cases. If you have a million rows that's only around 12MB right? If you have a super large db then of course you should think about it and maybe use a different strategy but for someone to say "I'm never going to use a GUID PK under any circumstance" is a little short sited IMHO. In computer science the answer is always "it depends".

3) Anyone that uses a clustered index on a guid (without a really good reason) probably doesn't know what they're doing in the first place. So this is user error, not a flaw in the datatype or db engine. If you didn't know that the clustered index doesn't have to be the same as the PK then you should study up and make sure you understand the tools and technology you're using.

Groover on September 12, 2007 12:07 PM

Wow, thanks to RK for the link. Maybe I should brush up on my math as well :)

"1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs"

http://en.wikipedia.org/wiki/Universally_Unique_Identifier

Groover on September 12, 2007 12:30 PM

"1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs"

The only problem with this logic is that nothing is keeping track of the UUIDs being generated. You have just as much chance of getting a duplicate on the second try as you do no the 100 Quintillionth.

Groover made an excellent point.... HANDLE THE ERROR, RETRY!

Datagod on September 25, 2007 5:18 AM

Hi,

Excellent blog!!!

I am a developer (not a db admin) working designing a database - confirming to best
practices. I am not sure at this time, but at a later time in the app's life, we may have to deal with replication, merging etc.

I read various pros and cons and concerns raised in this blog with auto-gen keys in this blog and the potential performance issues with GUID keys.

I have not read anyone commenting on the statement from Celko 'sarticle...

"There are other ways of getting a unique identifier for a table. The most portable method for getting a new identifier number which is not in the set is something like this:
INSERT INTO Foobar (keycol, a, b, c...)
VALUES(COALESCE((SELECT MAX(keycol) FROM Foobar) +1, 0),
aa, bb, cc, ...); "....

Looks like performance wise, this method may take a 'one-time hit during insertion' because of the sub-query. Can anyone with past experience comment on issues with this approach when it comes to replication/merging and any other db management issues?

Thank you
- Athadu

Athadu on November 23, 2007 12:34 PM

There is nothing wrong with exposing parameters into URL, just verify it! Anything can be entered into the URL including SQL so it should always be verified (against database, cookie and/or session), using a GUID to help here is just lazy and the wrong tool for the job. Remember the Web is stateless, each page needs to check who is logged in and that they have access to the resource that the URL (or cookie) is requesting.

Mark on March 20, 2008 3:57 AM

Using GUIDs helps with security and obfuscation as well. When data must be selected and is databound in some control, such as a listbox, the user can easily figure out certain things about your database.

This is especially true for web pages where a user might be able to figure out how to gain access to certain resources they might not otherwise have access to (ie: view source - scroll to listbox - "OOH! I have access to IDs 1, 2 and 4, but not 3." - "Let's plug 3 into this hidden field and see what happens"). Or SQL injection. Or query strings. You get the point.

Using GUIDs will prevent this. BOING!

Nathan on April 14, 2008 1:43 PM

In weighing up the options, many people presume 'replication' and 'merging' is a rare occurance.

However, when you are dealing with disconnected datasets in the client (which is now more common than not), every transaction from client to server is effectively a database replication/merge scenario.

Paul Coddington on July 14, 2008 10:24 AM

GUIDs are unique identifiers, nothing more. Numeric IDs can give you many useful feature 'for free'. I disagree with GUIDs feeling natural - we're talking about a database which is not natural - numbers are natural and mean something to humans plus you get chronological ordering built in. The performance on a numeric ID is so much better.

Regarding replication, the best way (not my opinion, but in every real world case) is to have a numeric primary key and a GUID column for replication.

Unfortunately the academic best practices which are thought up by the gurus do not translate into the best method for the real applications!

You have to think about how the database is going to be used as that is the absolute goal of a database. Doing things in the disputed 'correct' way is of no use if the solution does not give the project the best end result.

Sam on July 18, 2008 5:31 AM

Consider this:

(employee) What is your customer ID?
(customer) 2A734AE4-E0EF-4D77-9F84-51A8365AC5A0

And you believe this will happen? Never never.

So does anyone has an practical solution?

I am pro-guid...

Willem Luijk on February 21, 2009 3:15 AM

So you are making my points really!

Use both - they have complimentary and valid uses - that's why I've still got a job!

I just cannot understand why people think that GUIDs are the new id - no they are there for a purpose - I don't think sequential numbers will ever have their day.

As I say - think about it, don't just say mine is better than yours - you won't even get job!

Sam on April 9, 2009 5:35 AM

I will not argue for using either GUID, IDENTITY columns or even natural keys. I think it all depends on your needs, if you are building a datawarehouse you'll definitely NOT stick with IDENTITY, you would probably create the Id on-the-fly, SMALLINT, INT, BIGINT or whatever suit your needs.

IMHO using GUIDs when developing Web-applications is nice and easy to deal with, for instance creating new customer, new user etc. notice the usage!! Small tables, but wide access to functionality, much easier to generate the GUID in the application and commit, anyway that's just my opinion. I saw another poster commented the cons of natural keys, I agree :)
(GUID are also un-avoidable when dealing with merge-replication, and merge-replication should normally exist in a normal sized environment where users updates/ inserts to the frontend-SQL and the data needs to be replicated to the backend)

I have read through all the posts (I believe), some misconceptions exists here and there, but the one from :

Aaron G on March 21, 2007 8:00 AM
"If you use IDENTITY as a key, the values tend to cluster on physical
data pages because they are sequential. The result is that if the most
recent rows are the most likely to be accessed, there will be locking
contention for control of those physical data pages. What you really
wanted in a key is some spread of the rows over physical storage to
avoid having every user trying to get to the same page at the same
time."
Aaron's comment
"And I almost missed this gem: "If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential." No Joe, that's what a clustered index is for. Identity keys don't have to be the clustered index and mine usually aren't."

- Really hit my eyes... Aaron doesn't know how a clustered index works :-) Joe does! Aaron missed the point regarding pages and Joe is completely right, it would be very nice to create a clustered index on different filegroups.. however this can be accomplished at a physical level, but that is another story.

Willem Luijk on February 21, 2009 3:15 AM:
You are mixing customer id and surrogate values, don't do that! The GUID is created for relationship and not some, in your case, customerId.
I have always drawn a sharp line between Id's and numbers; IMHO as a rule of thumb: Id's are arbitrary values and shouldn't be used as anything else, whereas numbers are describing something.
So in your case, CustomerId is in fact a GUID, but you need to model a CustomerNo as well :)
And the CustomerNo can be in a friendly easy reading format, i.e. Customer: Joe Montego, could have a CustomerNo like : Montego01 , easy to understand and Mr. Montego is very happy, he doesn't need to look at some stupid insurance/ invoice or other crappy system generated, lazy db's/ developer/ -solution where only the nerds thinks it's cool to have a customernumber like 4533-49900.900-88 *LOL* right? (and where he usually ends up with.. yada yada... I only have my phonenumber and the service employee ends up with looking their own nerdy-invented stupid customerId up using that phonenumber *ROTFL*)

Anonymous on July 23, 2009 3:48 AM

LOL - Forgot to enter my name... Anonymous on July 23, 2009 2:48 PM


Janus007 on July 23, 2009 3:50 AM

Like Andrew, I think URLs should be in the CONS list. Having hackable URLs make it much easier to debug and content admins who use my apps are able to grok a number in the query string (/Page.aspx?PageId=14).

Jason on February 6, 2010 10:02 PM

The problem with these pros/cons, is that none of these pros make a whole ton of sense.

"Unique across every table, every database, every server" - The primary key needs only be unique within it's own table. Everything else here is overkill.

"Allows easy merging of records from different databases" - so does a properly tuned insert..select statement.

"You can generate IDs anywhere, instead of having to roundtrip to the database" - but, the primary key belongs to the database. you should only need to create one when you're inserting into the DB anyway, and already incurring the roundtrip. Generating PK IDs outside of the DB feels like a serious design flaw, leaving your app and DB tied together in a way that seems to fly in the face of design considerations like orthogonality.

There may be specific considerations where this is a fit, but I hate to see this presented as a great idea for everyone when in most cases it's unnecessary slow down for questionable gain.

Dug Wilson on February 6, 2010 10:02 PM

My strategy is to use identity keys for anything that's a sequence (such as a very large stream of chronological data - and no, the timestamp is not a valid option). For tables containing records that are only loosely related or totally unrelated to each other, I use a GUID. It's convenient because, as mentioned in the original post, it can be generated right away without a round-trip to the server.

If a table's going to have billions or trillions of records (we have several of those), it's rarely a good candidate for a GUID key because the potential for collisions does exist. It's also better to use an int (or small/tinyint) for anything representing a "type" because it can be mapped to an enum or a set of constants at the application level.

Replication is a moot point. If you have multiple publishers sending to a single subscriber, just give each server a static key (one key for the whole server) and make the primary key on the subscriber a combination of the identity and the server ID. Perfect uniqueness, almost-perfect indexing.

Aaron G on February 6, 2010 10:02 PM

Oh, just a P.S.: I read Joe Celko's post and was kind of furrowing my eyebrow until he finally laid all his cards on the table with this statement:

"If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they
model."

Sorry Joe - what you're sellin', I'm not buyin'.

And I almost missed this gem: "If you use IDENTITY as a key, the values tend to cluster on physical data pages because they are sequential." No Joe, that's what a clustered index is for. Identity keys don't have to be the clustered index and mine usually aren't.

I'm not going to pretend that I know everything about databases, but the only thing I read in Joe's post that was actually correct was his statement that it's a religious war. There's definitely religion in there, fighting tooth and nail for frivolous causes based on no evidence or erroneous/outdated evidence.

Aaron G on February 6, 2010 10:02 PM

How about neither.

One should strive for natural keys in a database as opposed to generated ones.

I've seen both used in databases and for the most part they could have been replaced by natural keys. It is very rare that a table shouldn't have a natrual key or a natural composite key. Most developers choose generated keys as a way of maintaining primary keys on tables, but really, you can avoid it if you try with a good data model.

Jon Raynor on February 6, 2010 10:02 PM

1. The id=4 URL problem:
Not checking access to data/pages is criminal and one should think about changing his job if not doing it.

2. Every number may change. A nice customer ID 12345 may change to M12345 or 200012345-2009. Maybe not tomorrow, not in a month. But at the next merger or the next product manager, CEO or legal issue.
A GUID will never change.
So keep em both if you need a second one, a GUID forever, and a nice ID (as TemporalObject) for showing up.

3. I've almost never seen a customer or whatever business relevant ID of 1. Either they are 0001, 10001 or 2009001 (e.g. for invoices).

4. Performance:
Distinct between reading (reporting) and writing. Mostly you read x times more then write. And most of the time you don't need operations on the GUID. If you use a GUID and a nice and readable ID do your db operations on the latter. Consider having special tables and views jsut for reading concerns, perhaps already prepared for the UI and use cases, avoiding any joins.

5. 100% thumbs up for Groover.

Don on February 6, 2010 10:02 PM

awesome article thanks planning jobs

Planningjobs on August 29, 2010 9:52 AM

«Back

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.