Primary Keys: IDs versus GUIDs

March 19, 2007

Long-time readers of this blog know that I have an inordinate fondness for GUIDs. Each globally unique ID is like a beautiful snowflake: every one a unique item waiting to be born.

Perhaps that's why I read with great interest recent accounts of people switching their database tables from traditional integer primary keys ...

ID  Value
--  -----
 1  Apple
 2  Orange
 3  Pear
 4  Mango

.. to GUID keys.

ID                                    Value
------------------------------------  -----
C87FC84A-EE47-47EE-842C-29E969AC5131  Apple
2A734AE4-E0EF-4D77-9F84-51A8365AC5A0  Orange
70E2E8DE-500E-4630-B3CB-166131D35C21  Pear
15ED815C-921C-4011-8667-7158982951EA  Mango

I know what you're thinking. Using sixteen bytes instead of four bytes for a primary key? Have you lost your mind? Those additional 12 bytes do come at a cost. But that cost may not be as great as you think:

Using a GUID as a row identity value feels more natural-- and certainly more truly unique-- than a 32-bit integer. Database guru Joe Celko seems to agree. GUID primary keys are a natural fit for many development scenarios, such as replication, or when you need to generate primary keys outside the database. But it's still a question of balancing the tradeoffs between traditional 4-byte integer IDs and 16-byte GUIDs:

GUID Pros

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway
GUID Cons

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
  • Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

I'm not proposing that every database switch to GUID primary keys, but I do think it's important to know the option is out there. If you're still on the fence, what should I choose for my primary key? has excellent advice and a solid analysis of the tradeoffs.

Posted by Jeff Atwood
127 Comments

I agree for the most part. Another advantage of integers however is that they produce friendlier URLs. For web apps, I favor the HiLo table key pattern.

Andrew Peters on March 20, 2007 5:26 AM

One downside of integer ids on websites is that they are predictable, so anyone who wants to scrape data from your site can do so more easily.

Laurie on March 20, 2007 6:06 AM

(is it just me or is the word always "orange")
I like using guids in a section of an unsecured part of a website. I am working on a contract management system that has multiple companies logging into one system, and to keep certain things from being able to just guess (like system messages to users) i use a guid. But for everything that is admin level, i just us Identities.

Jeremy on March 20, 2007 6:42 AM

"(is it just me or is the word always "orange")"

It is, he's using a "retarded" capchga as he calls it. There was a post a couple monthes back explaining why.

Cullen

mac_man25 on March 20, 2007 6:53 AM

@Backwarlink: Not only does Cons point 3 conflict with Pros point 4, but it also conflicts with Pros points 1 and 2.

On the bright side, though, Cons point 3 is a con of existing database software, not a con of GUIDs per se. If your hashtable performs better on sequential keys, then you're not using the right hash function. (And if you're not using a hashtable, why are you writing database software?)

Anonymous Cowherd on March 20, 2007 6:57 AM

Doesn't Pros point 4 conflict with Cons point 3?

Partially, yes. Using completely random GUIDs can lead to insertion problems (and precludes physical clustering of the index).

Here's a great little comparison between standard GUIDs and sequential GUIDs in SQL Server 2005:

http://www.sqljunkies.com/Article/4067A1B1-C31C-4EAF-86C3-80513451FC03.scuk

Jeff Atwood on March 20, 2007 7:00 AM

Yes, Jeff uses orange: http://www.codinghorror.com/blog/archives/000712.html

More to the point of the original post (grin /), I got to say that, for some odd reason, my instinct is to go with GUIDs. I think the drawbacks are marginal.

1~ size: disk space is not free, but it's cheaper by the day
2~ readability: I think GUIDs are unreadable when presented 'raw'; I always consider developing tools to help administrators *and developers* in their daily tasks, which would resolve the GUIDs to something more readable for them;
3~ generating randomness: for the time being, I have never hit a case where the cost of generating a random GUID was the performance bottleneck; same applies to the index clustering aspect, although I'm sure that if I was developing database systems I'd be much more concerned with this one.

The only problem I can see with GUIDs is that they are not naturally ordered (so I have to create a mechanism to keep track of order, where needed), and that there's always that sinking suspicion that one day you will get a duplicate GUID (since, I believe, most GUID generating libraries I've seen don't guarantee 100.00% uniqueness, but something like 99.9999..%). Call it Murphy's Curse ;P
F.O.R.

Frank Rizzi on March 20, 2007 7:01 AM

I don't like using GUIDs as primary keys because when you deal with high volumes, you pay a big price for the extra bytes. You pay it in the join operations, in the index seeks, and in the number of records in the cache.
I found that a good way is to combine the two approaches. That is - use an integer as the primary key and add a separate GUID as a replication id. Do all joins on the primary key and use the GUID only for replication. This ensures high performance reading and writing the table, but also allows to uniquely identify each record. The downside is that the same record on different servers may have a different primary key, but in my opinion this is a minor issue.

Rami on March 20, 2007 7:13 AM

I think it's possible to have a sequentially generated GUID on the middle tier (eg, without contacting the database server to generate the GUID). You could do this by factoring date and time into the GUID calculation. Each new GUID generated is guaranteed to be "higher" in sequence than ones generated earlier in time.

Here's one such approach for SQL Server 2000:

http://codebetter.com/blogs/scott.bellware/archive/2006/12/27/156671.aspx

Of course, this assumes the date and times are in sync on said machines..

Jeff Atwood on March 20, 2007 7:24 AM

Sequentially assigned GUID values. Seems pointless.

I don't see any actual benefit to this at all. Is it all about geek factor?

l belk on March 20, 2007 7:28 AM

I do hope no-one is suggesting that one build web site URLs off GUIDs.

http://www.codinghorror.com/blog/archives/C87FC84A-EE47-47EE-842C-29E969AC5131.html

is a bit manky, and not at all user friendly.

David Barrett on March 20, 2007 7:29 AM

I love GUIDs.

We actually just started using them in one of our applications at work and they've been pretty neat to work with.

BTW, been reading your blog for a while, first time commenting. Just gotta say though, keep it up!

Calvin on March 20, 2007 7:48 AM

I've been using GUIDs for a long time. The real problem is that they don't index well in large databases. Basically when you use integers the newest row (at the bottom of the table) is the biggest number. That alone saves a lot on index fragmentation. However, I do like being about to create a primary key without talking to the database. I've been using CombID for a while. Basically it is a guid, but it is not completely random, part of the guid contains a date/time stamp. With the date/time stuck in there (in the right place), the guids now sort correctly in the DB! The newest row is the highest guid number, yea!

http://www.informit.com/articles/printerfriendly.asp?p=25862rl=1

Jonathan Miller on March 20, 2007 7:57 AM

IMHO, GUIDs should not be "in" a URL but definitely are better than raw ids in a query string... such as /Products.aspx?Id=C87FC84A-EE47-47EE-842C-29E969AC5131 is better than /Products.aspx?Id=4. Why? Well continuing with the "product" idea, users could figure out the URLs for products created around the item they are viewing. (Id=3, Id=5, Id=6) What if it's been deleted? What if it's been deactivated? Hopefully you've handled this but if you haven't it sucks to have someone purchase 300 of a product discontinued five months ago. It also exposes things like how many products you have, how often you have them, and worst comes to worst, someone could write a script to "crawl" your whole database, suck it down, leading to your whole database (including parts you don't want others to see) replicated or perhaps a DDoS, etc... It all depends on your own circumstances but for some using auto_increment as a primary key and then showing that to the world can be very bad.

Brett on March 20, 2007 7:58 AM

One of the reasons I sort of hate blog entries like this are that it tends to let the less-experienced/educator keep believing that most/every table in a database needs a integer OR a GUID-based key... The third, and better (as even Celko will tell you, as much as I hate to give him as an example -- he's really a troll who doesn't even use ANSI 92/99-standard join syntax) is to pick a true candidate key, and not a synthesized one. Jeff, I've even seen people create a "Year" table with, you guessed it, a YearID (1,2,3...) next to a "YearName" column. It made me cry. The only thing worse would be to see a GUID next to "YearName". Maybe I just need to take a vacation.

Don't get me wrong here. I'm not poo-pooing GUIDs, but I do think they're overkill for a lot of (especially smaller) applications. BTW: They're awesome in handheld applications where you're totally disconnected from your server.

Dave Markle on March 20, 2007 9:24 AM

Clustered SQL Server indices are used to physically order the pages.

Also, any non-clustered indices imbed the clustered index value, so "fatter" primary keys propogate the fatness in the non-clustered indices.

Thumbs down on GUIDs as PKs.

steve on March 20, 2007 10:17 AM

I rarely use GUID's, let alone as PK's. One instance recently that I did use them was for linking to AD User Objects within a SQL database, I wanted to replicate a users name and email from AD into a sql database so I could quickly lookup up common information in an app that I just created.

Their name and e-mail would be looked up very frequently, so instead of burdening AD with this task, I put the information in SQL using the ObjectGuid attribute from AD and making that my FK in SQL.

This is one of the few times that I have actually used a GUID for a valid reason other than "the geek factor".

Just looking at Microsoft's Membership/Roles Provider creating guid's for everything (AppId, RoleId, UserId), sent shivers up my spine. Very difficult to work with and not needed in that instance.

OTOH, there is very valid reasons for using GUID's, I just don't think as often as people think.

Sean Chambers on March 20, 2007 10:27 AM

@David Barrett - "I do hope no-one is suggesting that one build web site URLs off GUIDs."

There is at least one fairly-large software company out there that already has:
http://www.microsoft.com/downloads/details.aspx?FamilyID=B533619A-0008-4DD6-9ED1-47D482683C78displaylang=en

How does that "proof by counter-example" thing go again?

J Marlowe on March 20, 2007 10:43 AM

One drawback to incrementing integers as keys is that users notice the incrementing nature and then assume things about the data that is not reliable, such as order of creation, date of creation, data that was created near each other in time. And I've had tremendous difficulty trying to break them of this thinking, even after adding a column to record this seemingly necessary information.

Andy on March 20, 2007 10:53 AM

Personally, I don't think GUIDs *NOR* auto-increment integer IDs should be in the URL at all. Two reasons:

1. http://codinghorror.com/blog/archives/primary-key-ids-vs-guids.html is way more readable than the two alternatives.

2. Ideally, the primary key shouln't be exposed for something as permanent and public as an URL. Keep the primary key private or transitory if exposed (such as an admin interface in which the url to edit a post isn't likely to be shared).

As you can tell, I'm a fan of surrogate keys (http://en.wikipedia.org/wiki/Surrogate_key)

One real world scenario that causes problems with auto-increment primary keys being exposed in the URL is when users import data from one blog to another using BlogML (or something similar). Having the same PK in the new system is unlikely.

Haacked on March 20, 2007 11:30 AM

If you're exposing database keys of *any* kind in your URLs, that's a whole different problem.

http://www.codinghorror.com/blog/archives/000093.html

The traditional "auto incrementing 32-bit integer" identity column is plenty fast, but it has some problems. I highly recommend reading Joe Celko's post on this.

http://groups.google.com/group/microsoft.public.sqlserver.programming/msg/6d61dbf80d6f0fb6?hl=enrnum=14

Jeff Atwood on March 20, 2007 11:59 AM

I too have made the mistake of building a database with GUIDs in the clustered index. Imagine my delight later, during performance tuning, when I realized that I had a table structure designed to _perfectly fragment_. Oops.

c on March 20, 2007 12:01 PM

I agree with Dave Markle. Natural or candidate keys are by far the best way if you can get away with it. I think most people jump to using auto ids or guid without actually considering them. Also a lot of the time you can get away with using composite keys instead.

http://en.wikipedia.org/wiki/Relational_database#Keys

Mladen Mihajlovic on March 21, 2007 2:01 AM

Yes, but I'm quoting the 3rd edition that is copyrighted in 2005. The Joe Celko article you linked was written in 2001. It doesn't really matter, I'm sure he still feels the same way about both today.

The point is that Celko is not suggesting a GUID is a better alternative to an identity column. Rather, he says it's worse. But, you used his post to imply his opinion is the other way around.

tieTYT on March 21, 2007 2:10 AM

Wow. This is very... wrong. GUIDs don't work like that, and don't solve the problems you think they solve.

Let's start at the beginning. Which flavor of UUID are we talking about? A proper GUID includes a machine-specific fixed portion, which is great for distributed systems, but vastly increases the probability of collisions in data from that node. A UUID consisting of almost exclusively random data may have a lovely distribution, but the probabilites in a high-throughput system can quickly reach, say, the chance of winning a state lottery or getting struck by lightning -- small, but not negligible.

Therefore, any system that needs to merge multiple relations with unique identifiers needs to have robust methods of resolving collisions, however unlikely they may be. In short, if your code won't work with plain integers as primary key identifiers, it's *still* broken if you replace integers with UUIDs. You're just postponing the problem.

Anonymous Cowherd on March 21, 2007 2:22 AM

Point of interest: Microsoft CRM 3 uses GUIDs for primary keys. Not sure about previous versions.

I've been developing for MSCRM for a little while now, and at first I was like 'Eugh!' but now I'm all like, 'Ooooh.'

Andrew on March 21, 2007 2:36 AM

Gotta agree with Anonymous Cowherd. GUIDs are not _actually_ "globally unique", you know. Depending on how they're generated, the chance of collision is small but it exists. That's negligible for low-volume throwaway stuff like URLs or COM interfaces etc., but it's absolutely unacceptable for large databases. Terrible idea.

Chris Nahr on March 21, 2007 2:47 AM

Wow, so sorely mistaken. Good luck with all the storage arrays you'll need with such setups.

Also plainly missing is anything about Oracle or PostgreSQL.

Once again CAPTCHA: orange
Strange setup....

opensoresfreak on March 21, 2007 2:50 AM

GUIDs may be much worse than autoincrement numeric IDs as primary keys for specific database backends like InnoDB. A nice discussion of this topic specific to MySQL can be found here http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/

ludo on March 21, 2007 3:03 AM

My suggestion:

Step 1) Figure out (perhaps by analyzing data flow) where data from various sources gets merged.

Step 2) Write and test code that detects spurious matches and handles them robustly. In general, this code may be dog slow and nasty, but it needs to work, not merely with unit tests (anecdotal evidence) but with stronger guarantees (a mathematical proof would be nice).

Step 3) Now, switch from integers to GUIDs. Your performance-sapping merging code will now be called once in a blue moon.

Starting out with GUIDs is premature optimization, and will obscure failures.

Anonymous Cowherd on March 21, 2007 3:03 AM

The Application I work on, switched to GUIDs as DataBase keys 3 years ago, and the only cons that we're experiencing is the "debug hell" (Point 2).

As for Web URLs, having a GUID key in the DataBase does not force you to also write that key in the URL; there are many techniques to create user-friendly URLs even if you have GUIDs (URL-rewriting, static page generation, Key/Title caching) depending on the size of your Website.

On the other hand, if you're going to use a key in the URL, ID or GUID are not so different (in user-friendliness) when your ID is more than 100000; eventually you'll be copying/pasting the URL anyway...

And, is user-friendly URLs really a must? For blogs, maybe; for small catalogs, maybe; for Microsoft Downloads, definetly not.
-

Filini on March 21, 2007 4:22 AM

Jeff:

Of course I agree that the world isn't the same as it was back then. But I would suggest that the tables that most need GUIDs as PKs are the ones that would be hurt most in the area of performance, simply because they'd be so big -- "Transaction" or "Order" tables come to mind. These are the tables that are going to get hurt the most with large key sizes.

And if you're using GUIDs for *other* tables (state description tables, account tables, time-dimension tables, etc...), well then you're going to be in for real trouble.

For me it comes down to replication and distributed transactions. If you aren't doing that sort of stuff, you generally don't need GUIDs. I prefer the ease of communication of a 32-bit integer when debugging or troubleshooting over a GUID any day.

And I wouldn't be so quick to put Celko up on such a high pedestal -- he sometimes has worthwhile things to say, but he really isn't all that. It's not just his tone that I take exception to. Ask him how we writes his joins, and what he does when he has to mix INNER JOINs and OUTER JOINs in the same query. You'll vomit for sure.

Dave Markle on March 21, 2007 4:45 AM

GUID's are probably OK for many online applications. But if you ever try to migrate a terabyte of legacy data and into a modern system, resolving data quality issues along the way, you will find there is a lot of benefit in a creating you keys to be more deterministic so you can track a record in the target back to the source easily. So long as you design you keys properly you can switch to a different key generation algorithm for new records when the target system comes online without any serious performance implications.

Adrian on March 21, 2007 4:47 AM

I have a right to exist!

0AFDC426-C52E-46F0-8752-5F9C71159FD9

0AFDC426-C52E-46F0-8752-5F9C71159FD9 on March 21, 2007 4:50 AM

I hate UIDs (or misuse and abuse of them). They are uninformative and easily hide bad design decisions, and as mentioned above, are rarely truly unique. Two object share an identity, when they are identical, i.e. equal by some standard. Giving objects distinct ID-numbers implies that they are not identical and that identity cannot be inferred from other available data; this is often not the case. It is preferable to use forms of identification that are descriptive, without resorting to artificial identification.

Flaky on March 21, 2007 4:56 AM

"rarely" unique ?? Well, I hope you have some math to back that up.

Also, someone said the chance of a conflict is unacceptable for large databases ?? I don't know how you design your databases but that's easily solved in any database I have worked with.

RK on March 21, 2007 5:10 AM

If it's worth identifying, one day it may be worth exposing on the web. Give the thing a cool URI, something which won't be broken if you change your DB setup. A uniform id string, which is backed by a system which enables global disambiguation. Better still give your relations a URI too, one day you may like to expose them to the Semantic Web.

Danny on March 21, 2007 5:44 AM

Where I work, we only use GUIDs for FKeys. All PKeys are based off of other unique data.

All clustering discussions aside(a whole different beast than primary key concepts), let's go through an example.

Let's say you have a magazine subscription system. A potential subscriber wants to subscribe to your magazine. Well, his wife already has subscribed him and even provided the exact same data for this person, including contact and billing info.

Using a GUID as a primary key for the subscriber data, this person is now getting two subscriptions of the same magazine. Not good.

Basing it on such data as last name, first name, email, and possibly birth date or some other unique and distinguishing info, you TOTALLY eliminate this, ONLY if the unique primary key constraint isn't circumvented by substituting some other piece of data for one element of the key. In that case, oh well, call customer service and have them cancel one subscription.

Now, let's say the subscriber wants to subscribe to another magazine that you provide. You use GUIDs to identify the subscriber, also providing a user-friendly subscriber ID. In the subscriptions table you use the GUID to identify the subscriber's subscriptions, non-uniquely, but indexed on the guid. In the magazines table you use the GUIDs to identify the magazine, unique indexed and as primary key. NOW you COULD use the pair of GUIDs in the subscriptions table as a primary key.

As look ups, GUIDs work great, for a primary key on parent data? I wouldn't use it. I'm probably reading the blog entry all wrong, but it's kind of a no brainer when it comes to primary key concepts.

Could we go to pineapple someday? I'm getting tired of oranges and need a change of taste. ;)

John Baughman on March 21, 2007 5:53 AM

What I meant by rarely truly unique was that UID schemes don't usually guarantee unique IDs, i.e. "truly unique", excuse for the bad wording.

Flaky on March 21, 2007 5:55 AM

Now, with the parent subscriber table having it's non-GUID primary key, the index will actually be smaller and easier on those hashtables. Thus providing a performance boost when it comes to a transactional database. GUIDs as look ups aren't that horrible IF they are either combined with other identifying data including another GUID. It provides a more rather unique way of identifying the row. And by providing a repeatable character set in you key combo, fragmentation should be reduced. Based on what I know from Oracle, fragmentation comes from not having enough repeatable character strings. I would assume SQL Server would be similar in that problem.

Disk space any more is a rather moot point with a new drive being so cheap. Performance is where all the good and bad happen. If you can't find a performant index you are going to see the GUID performance issue and think it is solely based on this idea. Another option, is to not actually use the UNIQUEIDENTIFIER data type and save the data as a NVARCHAR. The performance of a NVARCHAR is going to be far superior to UNIQUEIDENTIFIER anyday.

(Putting on fire retardant suit now... ;)

John Baughman on March 21, 2007 6:08 AM

What's wrong with using a true candidate key?, why make work and move away from the actual relationships within the data. What would Codd et al make of it all....:)

Ja on March 21, 2007 6:17 AM

I like the fact that once you're through using a GUID, you can sell it on eBay. (Yes, I've seen it).

gregor-e on March 21, 2007 6:22 AM

I have to say that I don't think of a database where every table has it's ID field as a GUID is very practical.

I also have to say that I'm not a big fan of IDENTITY columns either. If I were a perfect programmer it would be different. I prefer to keep lookup/supporting tables constraint and IDENTITY free which allows me to fix mistakes (like accidentily deleting a lookup record or worse). Those tables are often small enough to not hinder performance too much (enough to care) when foreign keys are not linked from main data tables.

Most databases have a composition of 90% supporting tables and 10% business data tables. Using a plain integer for the supporting tables and a GUID for the business data is a more practical approach.

Just my humble opinion.

Roger on March 21, 2007 6:29 AM

GUID's have there place but it isn't as the PK of the table. Being able to see the sequence inserts of data far out ways everything else. SQL Server has a nice datatype of rowversion which is a timestamp but also a glorified GUID. It is unique across the database but has the benefit of changing when the row is updated.

I think it is a mistake to use a GUID for replication purposes only. Lets not mix apples with 'orange''s here ;)

Since space is almost a non-factor... IMHO you should always include a pk, and rowversion (GUID) on every table. If you care about the rowversion changing on update then also add an additional column of a static guid to the table.

Ben Kruger on March 21, 2007 6:46 AM

"Database guru Joe Celko seems to agree."

You say that. I think Mr. Celko does not mean what you think he means. (Although I dislike his smarminess) he's right - IDENTITY columns are horrible - but

GUID's are far worse.

You *should* be using a 'real' key - ie. something derived from your data model - not a 'surrogate' key, which is what integer keys in all their variants

including IDENTITY and GUID are. A GUID is not different from IDENTITY in most situations and is worse in many.

A 'real' key is often called a candidate key and is derived from real attributes of the entity you're modelling. There are very good reasons for this. Say

you're a manufacturer with a range of products that you identify somehow - let's assume "name". Now "name" is probably used within your business from the

shop floor to the sales brochure. That's a primary key and it has many advantages:-

- it will be unique (you can't have your people referring to two different products by the same name, or there'd be chaos)

- everyone in your business will know the names of the products they deal with

- if your data is presented outside the database antwhere from a brochure to a data dump, everyone will be able to see what it relates to on sight

- if you can't find a unique key like this there is something wrong with your data model which *will* lead to anomolies and bugs sooner rather than later

The only reason why IDENTITY (really 'surrogate') keys were ever used was that candidate keys can get pretty big. As a result many people use IDENTITY as an

"alternative key" to improve performance but make them invisible to the users. Only nongs use them as the actual primary key.

Ironically, the performance argument is nonsence. Let's say you've got your PRODUCT_NAME but decide to introduce PRODUCT_ID as an IDENTITY field for the

primary key. This is what will happen:-

- you now need two indexes, one on PRODUCT_ID for "performance" and another on PRODUCT_NAME so your users can do searches using the attribute they know.

Inserts now take longer because two indexes need to be updated - whoops

- if that extra index on PRODUCT_NAME isn't unique your users will be able to enter duplicate records and you'll have data corruption until the end of time.

- your code will become sprinkled with constructs like "SELECT DISTINCT ..." because the parameters to the query (as entered by the users) will probably

have to use the now non-unique natural key for "usability"

- if you ever need to recover or fix data corrupted by a buggy application, it is going to take your support staff a lot longer if they have to trace

relationships through a set of arbitrary numbers. Heaven knows how they'd cope with GUID's (You don't have buggy applications?). Also if you ever need to

migrate/replicate data to another application with a different schema the analysis is going to be a lot easier. That's a common requirement in the world

these days with widescale enterprise integration. All those ETL (extract/transform/load) tool vendors are making money meeting a real need.

So you've now got a redundant index that is going to give you a lot of trouble, why not get rid of one? How about the arbitrary one that no user can

understand in the real world? ie. PRODUCT_ID. If you do that, your performance comes back, and you eliminate a whole lot of scope for error and corruption.

Now, let's take the purported advantages of GUID/IDENTITY/integer:


Pro Con
# Unique across every table, every database, How do you detect duplicates?
every server

# Allows easy merging of records How do you detect duplicates? Where do you put the business
from different databases rules relating one record to another? In the ETL script
right? And then you have to keep the results in a mapping
table for ever afterwards because GUID/IDENTITY value is
*not* predictable. Congratulations, you've just got a
a multi-database, 3 way join. Perhaps not so easy after all.

# Allows easy distribution of databases I'm not sure if you mean just duplication of read-only data for
across multiple servers performance, or multi-server capture of input. The first is
easy with flat out replication. The second requires more
work to prevent duplicate entry but I can't see how GUID's
help at all - you still need to check your input for the
candidate key to detect duplicates. This is not too hard
with industrial grade RDBMs (not easy, not impossible), but
you'd still need to do it with GUID's *and* you need to do
it against the real natural-world key - a cross-database query.

# You can generate IDs anywhere, instead of having Strawman. In the case of IDENTITY, detection of duplicates always
to roundtrip to the database requires the RDBMS. I'll concede GUID's can be generated
on the front end, but you're not winning as you still need
to check for duplicates.

# Most replication scenarios require GUID columns anyway True enough, but this is due to the technical deficiencies of
current products. Implementation details should not drive
design, only influence pragmatism in implementation. When
I use replication I stick the replication id on as an extra
field the users can't see. If the real primary key isn't
enough for error detection, I add a trigger or two.

Lastly, getting back to the PRODUCT example. Let's say you really have a need for distributed generation of keys. You couldn't go wrong if you took a leaf

out of the real world and use a _generated_but_determinate_ identifier. Let me take three examples that illustrate:

1. Computer vendors

Their product "names" are a form of structured key which is constructed according to an arbitrary convention, like:

product line(3 chars)-model number-revision number

These numbers appear in their catalogs and all internal documents. If you've ever ordered equipment you'll know you've had to specify this stuff.
This allows different units in the company to generate guaranteed unique id's that have meaning

2. MAC addresses

As every one knows MAC addresses are not random, rather the leading edge is assigned to a manufactor who uses the remaining bits to structure an id.

3. SNMP object ids

These identify data objects inside network devices. Each is globally unique (across the world), and describe a hierarchy of objects inside each
network device. Each manufactor is assigned a node on the tree and is free to assign any hierarchiachial structure they need underneath that node.


These are good id's as they don't overlap, can be generated without central control and are readily understood by people working with them. Neither IDENTITY

nor GUID are understood at all outside the database which contains them.

JM on March 21, 2007 6:47 AM

GUIDs make the job of bringing data together so much easier and it comes in very VERY handy if you have an application that is multi-user based or has multiple globally distributed databases.

I would argue that performance and disk space is not a problem unless you have millions of rows and hundreds of tables. A few hundred thousand rows will never be a problem!...even spazzy MS Access database can handle that.

Neil on March 21, 2007 7:04 AM

It really depends on the needs for your applicaiton. I deal with distributed systems and disconnected systems constantly so a sequential increment of a int won't cut it. You have to be able to spin up a surrogate key on a disconnectd client.
I know there is a performance / storage trade off. I also know if you are arguing over this, you are either missing largely impact issues elsewhere or building one of the top 10 largest sites in the world.
For the rest of us dealing with disconnected, remote applications (e.g. - real apps of any size) - you use a useful tool.

Bart on March 21, 2007 8:03 AM

I don't know anything about the mathematical probability of GUIDs being unique, but I do have some practical experience in the subject.

I worked on a team years ago where we had to generate our own GUIDS for COM type-libraries. Using Microsoft's most modern UUID generator at the time, we ran into conflicts no less than every week or two. Often times, our conflicts were with MS Office, or other Microsoft GUIDs.

I've also seen inserts fail due to collisions in test databases where GUIDs were used. Even our simple load tests on a later application (at a different company) proved that GUIDs were not anywhere near "globally unique"...

Chris Jaynes on March 21, 2007 8:05 AM

To the people who keep blathering on about natural keys: You are obviously not maintenance programmers in commercial environments.

When the business changes, so does the meaning of your data, and so does the meaning of your natural keys. By using a surrogate key and not exposing it to the end-user, you can fiddle with the other constraints on your database tables without breaking foreign key relationships. While natural keys would be great in a perfect world, it's far too easy to choose natural keys poorly at design time. For example, a lot of people seem to think that SSNs are a good natural key.

\begin{sarcasm}Because not everyone is Joe Celko, and we can sidestep our grievous inability to design databases by using surrogate keys.\end{sarcasm}

As for GUID vs IDENTITY, I think it's as simple as asking yourself whether or not you really need a *GLOBALLY* unique identifier. Can data exist in multiple databases (Service Oriented Architecture blah blah)? Use a GUID. Data warehousing or replication? Use a GUID. Otherwise, default to IDENTITY and enjoy IDENTITY's transparency.

Just my two cents. Thanks for listening!

Nicholas Piasecki on March 21, 2007 8:06 AM

i like guids so much i changed my name!

{CF76B38A-888E-4BEA-B08D-13AFBAAD8745} on March 21, 2007 8:48 AM

The traditional "auto incrementing 32-bit integer" identity column is plenty fast, but it has some problems. I highly recommend reading Joe Celko's post on this.

Most of the "technical/hardware" issues Joe mentions are incorrect. You can read all about it in other posts in the same newsgroup (http://tinyurl.com/2jdo9y) by other individuals that are more knowledgeable about the inner workings of Oracle/SQL Server database engines than Joe Celko.

Also, as far as I know Joe Celko doesn't "approve" of GUIDs. He preaches the use of natural keys.

In your example above, if the table is "Fruits", why not use the fruit name as the key?

By the way, great blog.

xAvailx on March 21, 2007 9:04 AM

Damn, this is worse than FizzBuzz. Honestly, I would have never expected programmers to be so clueless. Not truly unique? What is this, the argument between the Sun blowing up 5 billion years from now or "only" 4.5 billion years? Who cares? There are less than 10^8 seconds in an year; if you generate a billion GUIDs per second, you will need a thousand years before the chance of a collision is 1/2. How many of you generate a billion GUIDs per second? And if you do, what exactly do you propose that would do the job better?

As for (not) using surrogate keys... sorry; less than a month ago there was an article on the daily WTF about someone having to declare one of his twins' birthdate in a different month than the other, because the database of the insurance company was using natural keys, and they couldn't handle two people with the same names, addresses, and birthdate. Codd was a genius, but I simply can't find a serious reason for not using surrogate keys; natural keys have too many problems.

Marcel Popescu on March 21, 2007 9:06 AM

a 64-bit GUID can be stored in an 8-byte integer (which are supported by most modern dbs) so the size concerns over a 4-byte integer are not as bad as with Jeff's 16-byte CHAR representation decision.

Old Timer on March 21, 2007 9:12 AM

Oracle is really a different beast in this respect.

First, it doesn't have built-in support for auto-incrementing INTEGER keys. You have to set up a SEQUENCE object, then explicitly query that object's NEXTVAL property to get a new value for your key. At best, you can automate this query step using a trigger, but we don't tend to do this in practice where I work. Though that is mostly a result of our DBA team not having a good system for managing triggers.

Second, Oracle has no built-in UUID datatype. It does have a SYS_GUID function that returns a UUID, but it is stored as a 16-byte RAW value. RAW values are kind of a pain because you have to convert them to a numeric or text datatype before you can really do any logic on them.

So, performance penalties all around with Oracle. I might just have to do some tests and see what we get.

WaterBreath on March 21, 2007 9:51 AM

Using a GUID as a primary key for the subscriber data, this person is now getting two subscriptions of the same magazine. Not good.

You're oversimplifying the problem, and the solution...

The tradeoff of getting the benefits of a surrogate key is that you may need to include an additional unique index on your table. Which I would think would be very obvious in this case. Even so, your natural key or unique index won't protect you from this problem, as anyone with experience managing customer data knows. You also need to have a de-duping mechanism for information such as names and addresses, to deal with things like typos and whitespace inconsistencies.

The performance of a NVARCHAR is going to be far superior to UNIQUEIDENTIFIER anyday.

I don't see how this could be. Barring conversions and any other implementational differences, a VARCHAR storing a text representation of a GUID is going to be at least twice as large as a GUID, isn't it? I'd think there would be at least a small performance impact, just as there is with GUIDs over integers.

WaterBreath on March 21, 2007 9:52 AM

Guids suck, they take up more room in the DB, they take more DB horsepower to keep indexed, and they aren't readible.

Natural keys stink for all the reasons mentioned.

I see nothing wrong with the good old, tried and true, auto-incrementing int ID as a key.

The ONLY argument for the guid is if you are going to use the id in a URL as it obfuscates things better. That’s it, and there are other ways around that now aren’t there?

Roy on March 21, 2007 9:57 AM

RAW values are kind of a pain because you have to convert them to a numeric or text datatype before you can really do any logic on them

I should have checked that before I submitted. Apparently comparison and equality logic works fine on a RAW field. And you can make a RAW field a primary key in Oracle with no problems.

So, my guess is that the only drawback beyond INTEGER keys is the size.

WaterBreath on March 21, 2007 9:58 AM

Marcel, on a purely theoretical basis you would be correct. You must also take into account the algorithm and random number generator used to generate the GUID. In cases where conflicting GUIDs are being generated often I would suspect the GUID generators are at fault.

Kuerwen on March 21, 2007 10:19 AM

What is the impact on using GUIDs for joins? Isn't faster to join on an integer?

Dan on March 21, 2007 10:22 AM

Does anyone realize that you can set specific GUID values in your table? I've had more than one occasion where I've had to put test data into our server, and using the GUID FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF has helped me tremendously. You can easily delete the sucker when you're done, and have the system generate a new one on the fly.

Besides, GUIDs are required if you're performing replication, as many have stated. If you know this is going to be a requirement, you might as well use them from the get-go.

Granted, I've seen some piss-poor design schemes where GUIDs could be reproduced (since there is no truly random number in any computer system, no matter how advanced), but this comes back to weighing the ease of using an auto-generated key vs. using your entire table as a key, and having one field being the difference.

James on March 21, 2007 10:27 AM

The issue I see with using guids is their lack of perdictability. For example we have integration setup between two system. System A is a public facing web-site, System B is our billing system. We acquired a bunch of new buildings and we needed to put them on our public facing site, but keep them hidden. Rather then adding the records and then getting the GUIDs and putting them in our exlusions table, we simply were able to get the last ID value that SQL Server had generated (NOT THE MAX ID) and then insert the ID values into the exclusion table before the records were generated. This prevented anyone from getting even a momentary glance at data that is not ready for public viewing.

Predictablility has it's place. How do you know if someone delete a record when you are using GUIDS?

If you are using sequential GUIDs, what's the difference between the IDENTITY column? They are both predictable.

I will leave the GUIDs to heterogeneous items like software packages that need to have some uniqueness to them so they don't overlap.

The only response I truly have seen that is a drawback to the predictable sequential IDs is the possibility of data scrapping. But if they got to one piece of information through some web page they can also get to the rest, they just have to write it differently.

Tim on March 21, 2007 10:37 AM

Just to let everyone know, SQl Server and MySQl are not the only DB servers out there. Real database servers do not have any requirements on the data to do proper replication!

Glenn on March 21, 2007 10:37 AM

I have always used a primary key of an identity, with the GUID as the secured URL id, so my tables would look like this if they were update-able on the web page.

Table
=====
id
...
created
createdBy
guid

else I just use the PK and that is IT.

Jeremy on March 21, 2007 10:51 AM

I started to write something but then I read Dug Wilson's entry and it summarizes exactly what I was thinking. So I'll try to go a different direction.

This pro-con list is nice, in an abstract kind of fashion. When I think of real-life database and application design scenarios, there are very few times a GUID would be totally necessary. Almost every single one involves web-related apps with "exposed" data (like in the QueryString or Cookie).

Int counters exposes chronology for pages and users (like with MySpace userId's) which you might not want for your given website, or if you fear about content scraping (where someone could program to hit your site with a QueryString that counts on its own).

The only other time I like GUIDs (or generated uniqueidentifier in SQL Server) is when you have multiple tables where the data from both might be used by the application in the same object. For instance I have seen apps where the Employees class could pull from separate tables with very similar data and a column identifier like "EmployeeType" was not feasible in this situation; they had to be separate tables. In that case a simple row counter could cause confusion.

ian on March 21, 2007 10:55 AM

We use a short string that serves as a "GUID" without actually being of the standard GUID format. We have other data which can be combined to generate a completely unique value across the enterprise (location-transation number). The value is slightly longer than a standard GUID but it is visually recognizable and sortable. We feel that this gives us the best of both worlds.

Matt on March 21, 2007 10:58 AM

When implementing, people who consider using Natural keys obviously haven't had to do anything with them (other than write pretty diagrams) I hate to throw a reality wrench in the works, but Natural / Candidate keys are stink unless they're 'just numbers' and that'll never change (yeah right) - you're setting yourself up for failure.

On top of that, you're really asking for trouble if you're candidate key is alphanumeric and/or if your candidate key is composed of more than 1 column.

All for the sake of conceptual purity (wank). If you need to enforce a relationship between tuples, use a unique index on them instead and stop crying about needing to have more than 1 index on a table. Do you have 1 million inserts a second? If it does matter, you'll have other issues - natural versus surrogate keys are the least of your concerns.

Primary Keys are best when meaningless (i.e. surrogate keys). UUID's are meaningless too, so that's fine. But UUID's have a performance cost feel like overkill most of the time. I don't really see the gain unless replication is necessary (and even then, figure out if it's necessary first). UUID's just feel kludgy. Don't get me wrong, I'm no hater, but they're best used sparingly.

1. URL's shouldn't use the primary key be it an int or a GUID. Figure out a more user-friendly way. People always try to hack URL's.
2. Using GUID's doesn't stop screen-scrapers or robots (except maybe the real simple ones). If your site will/did link to them, they will find them.

If you need to generate keys without having to go to the database every time it's simple - just find some schmucks crappy broken code on the net or spend beer money and read Fowlers Enterprise Patterns of Architecture.

(PS By all means use candidate keys for lookups, searches, and external identifiers, but for the love of god don't use them as the primary or foreign key)

I'm guilty in the past of using candidate keys (just after graduating uni). Through trial-by-fire I have seen the light.

AnonCowardNumber3 on March 21, 2007 10:59 AM

There is DB theory, and then there are the physical limitations of the DBMS being used, which you can either make work for you or against you.

With SQL Server, the smaller the key used in a clustered index, especially if you have non-clustered indices, the better. Hence, putting a PK (which is a clustered index) on a customer name of varchar(500) is bad in this situation. Putting it on their federal tax id is a little better. Using an autoinc field to id the customer makes no sense.

Using auto-incs for a State table, where the abbreviation is char(2), makes no sense, put it on the State code.

How did the (bad) concept of putting key info in URLs creep into this conversation?

If you don't know the difference between a clustered and non-clustered index in SQL Server, then you should pause a read before building any more databases.

steve on March 21, 2007 11:09 AM

So what do you do if you have the GUID 7897F8AE-456B-FBC7-78900A0EEBAB and you want to know what it is?

I prefer to have a single sequence per database, and have my primary keys globally unique across the database, but not globally unique across the universe.

When surfacing these keys to the user (be it in GET parameters, or in text boxes, or on mailing labels), I like to format it in some special way that is somewhat unique to the application.

This way, at first glance I can see US-10257134, and know that it goes to one system while S347142 goes to another system, 238-73-2175 is a social security number, and XU721-8WF7L-172AB-3275A-A9FFX is a windows registration key. Maybe I don't recognize the key format... but I'm still no better off than the GUID case.

philihp on March 21, 2007 12:39 PM

Identities only make sense on really top-level things like Individuals and concrete instances. How does a GUID help a lineitem in an order?

Customer has Orders
Order has OrderItems

OrderItems have no context outside of an Order, so why would you attach a GUID to it? What really does it buy you?

Also, there's no penalty for having both a simple ID for tables and queries as well as having a GUID for identity (if you need it) as well as any other logical id's the thing might need in it's lifetime.

The nice thing about autogen id's is that you never update them (assuming you could, but there's no need). You may want to update the External Id's because yes, things like product identifiers change from inception to marketing.

(Person)
PersonId
PersonName
PersonGuid

(Product)
ProductId
ProductName
ProductGuid
ProductInternalId
ProductExternalId
CreatedDateTime
UpdatedDateTime
RowVer

And if you were that complicated, create a Keys table (sorry I couldn't resist fizzbuzzing it.)

(Key)
KeyId - yet another (surrogate) autogen key that has no meaning
ProductId - references ProductId in Product table
KeyTypeId - (e.g. external, internal, guid)
KeyValue - {23-AD-23} or 'VWTT240L99' or whatever.

Also, another poster said that gaps in keys can indicate deleted records. In Sql Server (at least) that's a lie your computer told you. Failed inserts or rolled back transactions will do the same. The key, once allocated, is gone regardless of the transaction outcome.

Heck in reality we don't even delete rows on major entities (like Person or Order, although you might with OrderItem) - instead we only logically delete them (a whole other lot of problems).

The good thing about opinions .. everyone's got one.

AnonCowardNumber3 on March 21, 2007 12:49 PM

Ixnay on blindly using natural keys for PKs. It leads to the dark side.

Case in point: I have seen at least three insurance systems that used "policy effective date" as a component of a complex natural PKey. Sure, it's what uniquely identifies each Policy record. But guess what - effective dates can be changed through various business processes. Now all of the sudden a process which should be as simple as updating a single field in one row has turned into cascading a complex natural key value down to every other table that is unfortunate enough to reference it via a foreign key relationship.

Using GUIDS tends to promote use of surrogate keys, which in core entities is a Good Thing.

I say go with GUIDs for primary entities until you have a PROVEN need to reduce storage requirements on a specific table - which is a lot more rare than most people think. Also, if there is ANY chance of a future merge of two databases (you sell your system to two companies that later merge), then seriously consider going GUID to prevent a future nightmare when the time comes to integrate the two sources.

For domain entities (lookup tables), I say go with a natural key ONLY if it is guaranteed to be immutable. The choice between GUID and INT for other lookup tables should be on a case by case basis - depending on real usage patterns and the need for security (lots of good points brought up by others, no need to regurgitate).

Thats my 2 cents and a pickle.

keith on March 21, 2007 12:54 PM

I designed my first CRM database containing GUIDS as keys app. 10 years ago now (time flies). During the develoment cycle I've been cursing at myself for doing so, cause yes, debugging is hell ("what was that key again...").

But when one of my cusomers asked me to migrate two of their branch offices into one, database migration was a breeze. The whole process took less than a day. I'd say that compensated for my debuging problems.

Gabri on March 21, 2007 12:56 PM

Joe Celko does not agree with your topic nor does the article you linked. His post merely states that the identity column is bad. He doesn't say that a GUID is good. In fact, here is a quote from his book "SQL For Smarties":

The algorithm used for generating new GUIDs has been widely criticized...
Besides the usual problems with exposed physical locators, each GUID requires 16 bytes of storage, while a simple INTEGER needs only 4 bytes on most machines...
Indexes and PRIMARY KEYs built on GUIDs will have worse performance than shorter key columns...
The real problem is that GUIDs are difficult to interpret, so it becomes difficult to work with them directly and trace them back to their source for validation...
The GUID cannot participate in queries involving aggregate functions; you would first have to cast it as a CHAR(36) and use the string value...
Other features of this data type are very proprietary and will not port out of a Microsoft environment...

Saying that Joe Celko agrees with you was very misleading. He only agrees with you that the identity column is bad.

tieTYT on March 21, 2007 12:57 PM

Recently used GUID for a major product, the reason for it was that data was to be collect seperatly then combined into the main database. Using GUID it was was easy to handle this, just a copy from the other databases to the main, did not have to worry about keeping keys in sync or code to make sure that each copy of the collection software used a different key ID range.
Also because of the sources of data different programs had to be written, in different languages and tools, to get all the data.
That was a fun job, non-sarcastic.

Based on that here are some additional cons

The format for GUIDs is different depending on the library and the tool being used. This is even the case the in the microsoft family of software. It became to standardize on a format then write my own procedure to handle the languages/tools which did not generate that format using thier own commands.

Using the old int ID you had an easy way to see if something got deleted or if a range of data is missing. sequential GUID can help alot with this but are still not easy to catch with the eyeball.

For alot of things in debugging and coding integers are alot easier to work with then strings.

If users are pulling data directly from the database into things such as excel or word it is harder to explain that huge string vs a short number. Minor training issue, and not a problem in most cases.

You no longer have a short unique ID in every table that can be printed on reports/display so that users can identifiy the record when talking with each other. You have to add some more code to handle this.

will dieterich on March 21, 2007 1:16 PM

His post merely states that the identity column is bad.

Which is, more or less, what I'm saying: mapping row identity to a traditional 4-byte incrementing integer can lead to some pathological behavior on the part of the database, users, and developers. Not that it is wrong, per se, but it's less than ideal.

I'm interested in learning about alternatives to the traditional choices.

here is a quote from his book "SQL For Smarties":

Which was originally published in 1996, because I have an original copy. The world of 1996 is not the world of 2007. I'll agree that a larger PK field has significant performance ramifications, but given today's hardware, you'd need to have a very large table before performance costs become prohibitive.

Jeff Atwood on March 21, 2007 1:32 PM

Remember that storage is never just disk space. It means memory cache and new I/O bottlenecks (disk, memory, and network bandwidth, as rare as network-constrained databases are) as well.

For basic testing the solution is simple, just MD5 some identifier and call it a GUID. =p As a collary, GUID columns are significantly faster for storing MD5 values if you ever have to search against them, for whatever reason. Too bad they don't extend to larger SHAs.

Foxyshadis on March 21, 2007 1:51 PM

There is one thing that I dont think I understand properly in the article that introduces "COMBs".

The performance is very low when NEWID() is used, and it is greater when he introduces a new stored procedure that calls NEWID() and performs a lot of other operations too (like casting from one type to another, getting the current timestamp, etc).

How can operation_A be slower than operation_B, when operation_B is a complex set of steps, one of which is operation_A?

Alex on March 22, 2007 2:49 AM

"Honestly, I would have never expected programmers to be so clueless. Not truly unique? What is this, the argument between the Sun blowing up 5 billion years from now or "only" 4.5 billion years? Who cares? There are less than 10^8 seconds in an year; if you generate a billion GUIDs per second, you will need a thousand years before the chance of a collision is 1/2. How many of you generate a billion GUIDs per second?"

The clueless fool is you. Please go back to school (in case you've already finished) and learn about basic probability theory.

You seem to believe that a low probability means you're _guaranteed_ a thousand years of GUID generation before you get a collision. Of course that's rubbish; assuming perfectly equal distribution you could still get an identical pair tomorrow!

But the distribution of real-world GUID generators is not perfect. Chris Jaynes said he got duplicates every week, using the Microsoft generator. That's not so great, eh?

And you also ignore that the "Pro" argument wasn't just uniqueness in a single DB but across all servers in the world. Wanna bet that your "billion" GUIDs that make a collision likely would get created really fast if all DBs around the world started using them as keys?

Bottom line: Using GUIDs as DB keys because you expect them to be "globally unique", without any further checking, is stupid. They simply aren't.

Chris Nahr on March 22, 2007 4:03 AM

You are more likely to be hit by lightening before you get two of the same, a fool saying he got it every week is doing something else wrong, that should be fairly obvious to anybody. Do the math for gods sake.

What are you really arguing about ? Guids do not need to be unique across all databases in the world, why the h-l would they need to ? Nobody expects that except the people like you trying naively and wrongly to use it as an argument against them.

Are autoincremented int's unique in ANY way at all ?? What are you going on about ?? My good, there's a lot of people here that shouldn't even be in front of a computer.

RK on March 22, 2007 6:34 AM

I'm surprised that no one has mentioned the birthday paradox! In n-bits we have only 2^n/2 GUIDs that can be generated without collision. The probability that at least two of 50000 objects drawn from [1, 2^32] are the same is approximately 1 - e^(-(50000 * (50000 - 1))/(2^32)), or 44.1%. Autoincrement is guaranted to be a unique key in your database, but with GUID you're pretty much assuring, on a high-use system, that you will eventually collide.

Elliott Back on March 22, 2007 8:08 AM

You will not collide using guids if you design your database properly, ever.

RK on March 22, 2007 9:20 AM

Your math is wrong, where did you get 32 bits from ? A guid is 128 bit.

For those that jsut can't seem to get the math correct, this is the probabilitites:

http://en.wikipedia.org/wiki/Universally_Unique_Identifier

RK on March 22, 2007 9:31 AM

Thanks, RK. I loved the "shouldn't even be in front of a computer" reply :) And yes, I did take into account the birthday paradox in my calculation.

Bottom line, anyone who thinks collisions between GUIDs is a realistic possibility has... well, problems grin. I once saw a picture with a cow which fell down through a roof and destroyed a computer (this was in an advertisement for a backup program); anyone having nightmares about this happening?

(Somewhat related on the topic of programmers being unable to realistically estimate stuff like this, I have a friend who insisted that a 32-bit auto-incrementing field was too little - what if he reached the limit? He was creating a database for tracking customers.)

Marcel Popescu on March 22, 2007 10:16 AM

You stated: "You can generate IDs anywhere, instead of having to roundtrip to the database"

I assume you mean getting a new ID and then using it in other tables, databases?

If so can't you solve this with transactions?

Matthew R. Miller on March 22, 2007 10:50 AM

I would say their use is pretty obvious, changing worflow ? Not sure what you mean by that, always used guids because they make a h of a lot more sense than using ints or some made up varchar key (can you belive it?) that some people here suggested.

Seem to be some religious war against guids by some people, the math is there to support the claim they are GLOBALLY unique, for all intents and purposes they are globally unique, you are more likely to be hit by a meteorite than get a collission and even if you did, what's the big deal ? You handle it gracefully and generate a new guid.

Either it's a religious thing or you just lack the intelligence to understand guids. Something weird is certainly going on with this blatant bs about guids.

RK on March 22, 2007 12:25 PM

It seems obvious to me that GUIDs in databases are a very bad idea for many of the reasons stated in this comment stream as well as the fact that a lot of the supposed advantages seem like disadvantages to me. But let's assume for the sake of discussion that they are better. Are they so much better that it's worth changing minds and workflows? Are they twice as good? I doubt it. Usually a new solution has to be ten times better for it to be worth fighting to overcome the inertia of the old solution.

Pete on March 22, 2007 1:39 PM

By the way, I had to stop reading the article about GUIDs in databases when it completely undermined its own credibility by claiming SSN is a natural key. There are people with two SSN. There are multiple people with the same SSN. SSN is a huge bureaucratic fustercluck and anybody who uses it as a natural key WILL eventually run into trouble if your database is the huge success you hope it will be. But don't trust me. Google it for yourself.

Pete on March 22, 2007 1:46 PM

@Anonymous Cowherd:

Actually when implemented correctly, using the machine's MAC address (when the MAC address is taken from a serious vendor's NIC) and a timestamp is the only way to guarantee GUIDs are actually unique.
The person having conflicting GUIDs complained they where conflicting with MS office. It is highly unlikely that the same MAC address has been used in the past, on a computer that had a clock way into the future, by microsoft and this person, incidentally corresponding to the moment you generate a new one. In case the conflicts arise regularly, probably type 4 GUIDs are generated by a tool that doesn't use enough entropy to generate the random numbers (for instance seeds the standard random generator with the number of seconds since midnight).

Martijn on March 23, 2007 2:12 AM

@RK:

People get hit by lightning. People win state lotteries.

SQL is not relational, but has the features needed to implement relations. Making a SQL database a relational database -- that is, ensuring that your SQL tables are relations and not tabular data -- requires some effort. Spurious PK matches invalidate the chain of assumptions that make a SQL table act enough like a relation to maintain data integrity and permit joins to produce meaningful results.

GUIDs are a bit of a mess. There are various ways of generating them that have vastly different amounts of entropy. The guy who had weekly collisions was probably using an algorithm that included the machine's MAC address and a timestamp. Under those circumstances, it is quite likely that collisions will take place.

Recognizing spurious PK matches is the cost of using surrogate keys. This cost must be paid, no matter what form the surrogate key takes. Clever usage of a good GUID algorithm can then be used to amortize this cost to nearly zero; this allows easy-to-understand and easy-to-test methods to be used to resolve conflicts, even if the runtime costs are relatively high.

GUIDs aren't a loss, quite the opposite; but pretending that they make the underlying problems go away is naive.

Anonymous Cowherd on March 23, 2007 3:56 AM

no one has mentioned threading/locking yet, and this stuff has big consequences on scalability.

you can generate any number of guids at once, in parallel, but you can only generate one 'autonumber' at once (they have to be queued up). Hence, in a highly concurrent application, guids avoid one major bottleneck.

i think the article about MySQL 5 goes into this in detail.

That's a pretty cool concept in my dumb old opinion. And not the sort of thing we usually think of since we grew up on boring old single-core computers.

So use guids, cause your gandkids kids are going to thank you.

lb

lb on March 23, 2007 6:35 AM

I think all this discussion shows is that there is no "right" way of doing something. People have their own experiences, their own needs, their own scenarios, that change the impact of using one key type versus another.

Although this discussion was quite good!

Brian Meents on March 23, 2007 7:13 AM

I am addicted to GUIDs. I love them. I work with data on many different databases and have a few apps at work that move data from oracle to sql and back and its no problem with GUIDs.

ryan on March 24, 2007 3:55 AM

The cool thing about using Guids, in this fashion is that they reduce the amount of collision detection required when importing foreign data.

Our product has a plug-in architecture that relies on this detail to allow people to drop Dlls into a folder on our CMS and have new functionality without jumping through hoops to ensure that the references the plugin has to itself don't point to something else in the database...

Shame you can't use a Guid as an Attribute though (well you can use a string and a property that returns a guid but yeah ...)

Russ C. on March 28, 2007 3:28 AM

Remember that there are two keys on a table - the natural key and the surrogate jey, celko proposes only the natural key and he gets really confused over the use of IDENTITY and GUID for the surrogate - read Codd and Date on this rather than celko.

So long as you use a sequential guid rather than NEWID() for instance then you are ok from the locking and fragmentation stand point - consider the randomness of the NEWID(), say you are inserting 100 rows into your table, the liklihood on a table with even just 20 - 30K rows is taht each one of the 100 rows will be inserted on a seperate page, require index nodes to be locked and possibly page splits - the locking story here is horrendous; whereas IDENTITY or NEWSEQUENTIALID will at least cause significantly less of a locking burden.

When the GUID is being created outside the database, perhaps the middle tier or the client then you are going to have the locking horror, but a way round this is to prefix your guid with some sort of node number so that at least the rows being inserted by that node are around the same pages in the tables etc... There are a number of ways of doing that, and when it boils down to it you probably wouldn't use GUID's in that situation anyway.

Using your sample, the natural key would be 'value', but thats a bad name - it should be more representative, say 'Colour', then the 'ID' column is what's termed the surrogate key which is then used in any foriegn key references in other tables (this is the mistake celko makes copying the natural key everywhere causing a locking disaster if the natural key needs to change - inconsistency between the database and application).

Some good links there !

Tony.

Tony Rogerson on March 29, 2007 1:27 PM

Thats a fantastic post, that helped me making a decision that I can have workflow Instance ID (Windows workflow foundation) as Primary Key http://aliwriteshere.wordpress.com/2007/04/02/worflow-instance-id-as-primary-key/.

Ali on April 2, 2007 5:06 AM

Hey {CF76B38A-888E-4BEA-B08D-13AFBAAD8745}, That's my name!

{CF76B38A-888E-4BEA-B08D-13AFBAAD8745} on April 3, 2007 11:07 AM

I know this blog entry is getting stale, but I've been trying to apply the concept of using GUID's in a distributed environment, but the problem I'm running into is that the environment (MS FoxPro) I'm using does not support a native 16 byte implementation of the GUID (a la SQL's Uniqueidentifier), I have no trouble generating a string representation of the GUID (by accessing the Windows API), and I can even come up with the binary 16 byte representation of the GUID programmatically.

So my question is this: What is the best way to represent a GUID in a non-MSSQL table? A 38 char field, or a 16-byte binary field? I know that the 38 char record is much larger than the 16 byte field, but the 16 byte hex field is extremely human-unfriendly... Which is the best way to go?

Kit Roed on April 6, 2007 12:38 PM

I supposed this article was targeted to outline the physical cons and pros of both.
I'm not going to repeat what one or two already said, but I'd like to underline: always look for a real primary key for your table, then and only then, if you didn't found one, evaluate the best use of either integer identity or Guid columns.

Cesar on April 11, 2007 4:26 AM

Can anyone tell me if it's necessary to use a GUID for your primary key in a database that is to be hosted in a SQL Server 2005 clustered environment?

I've been told if you were to use an Identity Column that there is a chance that concurrent requests may result in one or more servers in a cluster attempting to insert a row with the same primary key, resulting in a failure.

Is this really true?

Do developers need to concern themselves with the hosting environment when doing database design?

Do you have to use a GUID or some sort of composite primary key combined of a column with a serverid and another with an identity column?

Michael on April 23, 2007 1:16 PM

Keep in mind that index size can adversly affect perf - smaller index size is always better - what happens when the index goes from fitting into cache to not fitting into cache - I guess I am just lucky in that I get to work with vldb data. in a vldb you should never use a guid - but if you need to send something to a cookie such as keeping track of a user then you should send a guid - and then change it on each visit - this makes it impossible to steal a users session.

Two seperate ideas here - guids are bigger - the web contains hazards that should not affect your db design performance wise.

Mikey on May 8, 2007 12:48 PM

More comments»

The comments to this entry are closed.