Maybe Normalizing Isn't Normal

July 14, 2008

One of the items we're struggling with now on Stack Overflow is how to maintain near-instantaneous performance levels in a relational database as the amount of data increases. More specifically, how to scale our tagging system. Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.

Dare Obasanjo had an excellent post When Not to Normalize your SQL Database wherein he helpfully provides a sample database schema for a generic social networking site. Here's what it would look like if we designed it in the accepted normalized fashion:

social network database example, normalized

Normalization certainly delivers in terms of limiting duplication. Every entity is represented once, and only once -- so there's almost no risk of inconsistencies in the data. But this design also requires a whopping six joins to retrieve a single user's information.

select * from Users u
inner join UserPhoneNumbers upn
on u.user_id = upn.user_id
inner join UserScreenNames usn
on u.user_id = usn.user_id
inner join UserAffiliations ua
on u.user_id = ua.user_id
inner join Affiliations a
on a.affiliation_id = ua.affiliation_id
inner join UserWorkHistory uwh
on u.user_id = uwh.user_id
inner join Affiliations wa
on uwh.affiliation_id = wa.affiliation_id

(Update: this isn't intended as a real query; it's only here to visually illustrate the fact that you need six joins -- or six individual queries, if that's your cup of tea -- to get all the information back about the user.)

Those six joins aren't doing anything to help your system's performance, either. Full-blown normalization isn't merely difficult to understand and hard to work with -- it can also be quite slow.

As Dare points out, the obvious solution is to denormalize -- to collapse a lot of the data into a single Users table.

Social database example, denormalized

This works -- queries are now blindingly simple (select * from users), and probably blindingly fast, as well. But you'll have a bunch of gaping blank holes in your data, along with a slew of awkwardly named field arrays. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now. Congratulations on your demotion!

Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?

Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory -- will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.

There's no shortage of fascinating database war stories from companies that made it big. I do worry that these war stories carry an implied tone of "I lost 200 pounds and so could you!"; please assume the tiny-asterisk disclaimer results may not be typical is in full effect while reading them. Here's a series that Tim O'Reilly compiled:

There's also the High Scalability blog, which has its own set of database war stories:

First, a reality check. It's partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn't have to be all or nothing as I've pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn't.

Despite copious evidence that normalization rarely scales, I find that many software engineers will zealously hold on to total database normalization on principle alone, long after it has ceased to make sense.

When growing Cofax at Knight Ridder, we hit a nasty bump in the road after adding our 17th newspaper to the system. Performance wasn't what it used to be and there were times when services were unresponsive.

A project was started to resolve the issue, to look for 'the smoking gun'. The thought being that the database, being as well designed as it was, could not be of issue, even with our classic symptom being rapidly growing numbers of db connections right before a crash. So we concentrated on optimizing the application stack.

I disagreed and waged a number of arguments that it was our database that needed attention. We first needed to tune queries and indexes, and be willing to, if required, pre-calculate data upon writes and avoid joins by developing a set of denormalized tables. It was a hard pill for me to swallow since I was the original database designer. Turned out it was harder for everyone else! Consultants were called in. They declared the db design to be just right - that the problem must have been the application.

After two months of the team pushing numerous releases thought to resolve the issue, to no avail, we came back to my original arguments.

Pat Helland notes that people normalize because their professors told them to. I'm a bit more pragmatic; I think you should normalize when the data tells you to:

  1. Normalization makes sense to your team.
  2. Normalization provides better performance. (You're automatically measuring all the queries that flow through your software, right?)
  3. Normalization prevents an onerous amount of duplication or avoids risk of synchronization problems that your problem domain or users are particularly sensitive to.
  4. Normalization allows you to write simpler queries and code.

Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd. Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. Fear not the specter of denormalization. Duplicated data and synchronization problems are often overstated and relatively easy to work around with cron jobs. Disks and memory are cheap and getting cheaper every nanosecond. Measure performance on your system and decide for yourself what works, free of predispositions and bias.

As the old adage goes, normalize until it hurts, denormalize until it works.

Posted by Jeff Atwood
301 Comments

Just to add that materialized view will involves caching of the data.

Odd on July 15, 2008 2:26 AM

View data isn't cacbed, however the execution plans are.

Personally I thnk anyone who encourages denormalization to the extent that this article does hasn't worked with databases enough. Denormanlzation should always be an optimization, and not done by design, and a denormanlized database is *not* the same as database that was never normalized to begin with.

In many case I think materialized / indexed views will suffice in cases when you think denormalization is necessary.

Rick on July 15, 2008 2:33 AM

You changed the rules.

The normalized model allows for n number of user phone numbers, screen names, affiliation and work history. It also does not care about the order of those items.

The de-normalized model limits the number of the items listed above. It also force an order.

Only one represent the business. Which is it?

rogueDBA on July 15, 2008 2:37 AM

I didn't realise that denormalisation was controversial. I was taught to do it in university.

Phil on July 15, 2008 2:42 AM

If you are just mostly doing dirty reads and do not particularly care if the data is 100% consistent all the time then you gain little by doing a lot of normalization. In fact, you should denormalize as much as it makes sense to do so for read-heavy loads.

Most web applications are of this type and do not even require a regular database necessarily. We are just conditioned to throw that database there every time without even questioning it. That is pretty high in the brainless zombie score rating.

Tero on July 15, 2008 2:44 AM

great post. simple is beautiful....

Steve on July 15, 2008 2:46 AM

Denormalized data makes a mess of the application code: you end up having to do all kinds of funny work to make sure that their affiliation is added to the right spot (affiliation_1? affiliation_2?), and if you remove it, then you end up backtracking and rectifying how your assignments work. And that's assuming that you don't accidentally run into a case where someone has 4 affiliations.

This whole concept is physically painful for me to even consider.

Before you start denormalizing your database, you should talk to a DBA and see what solutions they might have. And step #2 would be to be smarter about pulling back data than slurping back the entire database via your ORM.

Between those two options, you should be able to get the DB performance you're looking for without hemorrhaging functionality and extensibility.

Robert Fischer on July 15, 2008 2:55 AM

I just spent a year having normalization hammered into my meager little mind as part of my database module and now you're telling me that it mightn't be necessary...........

love your blog Jeff but i hate it when you do this to me. :(

Reader 101 on July 15, 2008 3:18 AM

How about using materialized query tables or materialized views?

Andrew Stuart on July 15, 2008 3:21 AM

@Raymond Berg: I agree but I don't think Google can resolve a link in the form of http://www.codinghorror.com/mtype/mt-comments-renamed.cgi?__mode=redid=xxxx to an actual site.

Juan Zamudio on July 15, 2008 3:23 AM

My approach is to design it all normalized at first, as it makes the design cleaner and usually easier to grok, and then denormalize where absolutely needed. As long as the system is performing well, I would prefer to keep my data normalized and then create denormalized views so I can get those huge joins out of my way.

When it comes to teaching, normalized tables is the way to go. People must master the rules before they break them (and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already)

SomeGuy on July 15, 2008 3:25 AM

Quite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol.. :D

nub on July 15, 2008 3:33 AM

Jeff, IMO, more important than denormalization is how you design your indexes, and what type of hardware you're running on. IO is the big killer, not so much normalization. Denormalizing a table would be about the last thing I would try to boost performance. SQl server should be able to shred those queries, even with 20 million plus records in each table.

Rob on July 15, 2008 3:33 AM

Normalization is about design, denormalization is about optimization. They should be applied as such, i.e. anything should be normalized until denormalization is needed. As Knuth (or was it Hoare?) said: Premature optimization is the root of all evil.

And from the short glimpse I did at the article about performance of tag systems it seems they tested only on MySQL. Not exactly a database one connects with good performance on complex queries -- it's extremely fast on flat data, but to my (not utterly up-to-date) knowledge they never managed to catch up with Postgres or the commercial RDBMS' on anything reasonably complex. Not sure if that applies to single joins already, but I wouldn't be too surprised.

Peter Becker on July 15, 2008 3:36 AM

Normalization might indeed not be the fastest design, but can usually made fast enough without denormalization, as Dare Obasanjo states in his original article, too:

Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching.

Jens Gring on July 15, 2008 3:43 AM

In my __LONG__ experience of both DB and application design, it is a bad (or rushed) application design that usually forces denormalization on a well designed database.

Several small queries over normalized data also usually outstrip one big one over denormalized data.

But, denormalization still has a home in large data warehouses.

WeAreJimbo on July 15, 2008 3:45 AM

Hope smart tools from DatabaseGear can help.

Lex Y. Li on July 15, 2008 3:48 AM

Lets unroll our loops next.

Christof on July 15, 2008 3:51 AM

Just a nitpick, I'd recommend using the term Gender over Sex. Someone might ally themselves with a different gender to their sex and that's the one they're going to want to tell you.
(Obviously it's sample data, but just for the future :)

Id did annoy me slightly always having normalization drummed into me without mention that it's not always best, but I guess SomeGuy is right.

[ICR] on July 15, 2008 3:54 AM

Another thing you could try in the case of StackOverFlow is to reduce the data required from the user.
Some questions you can ask:
1) Do our web application really need those phone numbers? Are we going to call our users?
2) Is address needed?
3) Does gender matter?
4) If an email id and password (and a display name if necessary)is what really matters, bother with all those unnecessary data?

You can add the bells and whistles later. First, Ship.

Niyaz PK on July 15, 2008 3:56 AM

Jeff,

if you want to avoid the usage of big and complicated SQL-Statements, why not implementing Views instead of denormalization. A good RDBMS should offer this mechanism, especially for performance and usage optimizations.

You could have a View called users, could still say SELECT * FROM users, but what is exactly within users can be changed at any time, without changing the interface to the application; The View users in this example.

Frank on July 15, 2008 3:59 AM

One thing you're not taking into account is a system like memcached that avoids hitting the database at all. When using such a system, it might be better to actually be storing normalized data so that your hits on the database for cache misses are small and involve as little table locking as possible.

-Max

Max Kanat-Alexander on July 15, 2008 4:00 AM

Before normalizing denormalizing think about what you need to to with the database and where you are going to do it. One example: We have a quite big database with test results. Most of the work is gathering data for reports. Normalizing Denormalizing is an issue and you only know what is faster if you know how you access it.

offler on July 15, 2008 4:01 AM

It's always amazed me how we spend hours normalizing databases just to spend more hours creating non-normalized views to run reports on.

I've always tried to normalize as much as was sensible rather than slavishly following the supposed good practice of BC Normal Form.

Excellent thought-provoking article.

Ali M on July 15, 2008 4:01 AM

by the way: why do you want to create a list of users and adresses? Wanna sell?

offler on July 15, 2008 4:02 AM

I'm not sure I agree with the total denormalisation in the example, Jeff. I'm (relatively) recently out of University, so I've still got the normalisation-is-everything rules in my mind :-) Being pragmatic, though, and as a couple of other people have noted, denormalisation is an optimisation to a fully-normalised design. I'd argue that denormalisation should be done in stages, in just the same way that normalisation is.

For example, I don't think I've ever taken a production database past 3NF. Certainly the only time I've take a design to 5NF is for my University database assignment. And aren't stored procedures supposed to improve performance for this sort of queries. Yes, you're always going to take a performance hit for a query utilising six joins (there are some crazily-sized cartesian products going on in there), but wouldn't a multi-layered approach aid this? I.e., for common or computation-intensive operations (such as viewing a user's profile in the example above), using multiple stored procedures to extract sections of data and then another stored procedure to put them together. It could also be accomplished using the user_id primary key to extract the relevant record from each table and then combining the records at the application level.

Having said all that, I haven't done any database programming in a good couple of years now, so I could be blowing a lot of hot air :-)

Alastair Smith on July 15, 2008 4:02 AM

Databases are one of those things that you just can't take a cavalier approach to. If you screw that design up, and worse still, launch it, you're in for a world of hurt when the problems with the data come.

Normalisation means the only problem you're going to get is scalability: your data is going to be good (presuming you were also clever/paranoid enough to get all the integrity checks in there too).

As the quote says, you can always denormalise later for speed.

Lewisham on July 15, 2008 4:02 AM

If you create indexed views over the joins, then the optimizer should use those anyway and there won't be a perfomance hit from being denormalized.

Harry M on July 15, 2008 4:04 AM

Addendum: And as others have said, there are much better solutions than denormalising your database. Having Materialized Views for reads, or memcached.

Lewisham on July 15, 2008 4:07 AM

Quite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol.. :D

nub on July 15, 2008 02:33 AM

because when you're selecting multiple id fields you want to be able to distinguish between them, example: user_id and affiliation_id

instead of
select user.id user_id, affiliation.id affiliation_id from ...
you can just have
select user_id, affiliation_id from ...
taht way you don't always have to prefix the fields with the table name and you also you don't have to alias the fields everytime;

and table names are singular because each row in the table designates one specific person; one row = one user

james on July 15, 2008 4:10 AM

That 6 join query is wrong, and the rest of your article follows from that incorrect join.

Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.

If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?

You're simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't! Test it - you'll see I'm right. But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows.

And if you don't even understand why your query is wrong, you have no business designing databases.

A valid reason to denormalize is to precalculate data, which you touch on very briefly. But you always write that twice: once normalized, and that's the primary data, and then again, the cached/precalculated version. You should always be able to regenerate that from the normalized data.

OK, one final reason to denormalize which you didn't even write: if you need to do a where clause from one table, but the order by from a different table, you need to denormalize because you can not create a combined index from both tables. (Databases with function indexes might be able to but that's pretty complicated.)

I'm sorry to bash you so much, but you shouldn't write about what you have no experience with.

Ariel on July 15, 2008 4:15 AM

The approach we've taken at work is to have data normalised (well all the new tables, we have some truly hideous legacy tables that we haven't tidied up yet!) and then either use memory cache or search tables to speed up access.

In fact our search table is one honking great big denormalised table. There can't be any joins at all. This is generated by an off line task constantly.

I'm always wary about denormalising the 'cannonical' tables as this leads you into a maintenance nightmare, especially when it comes to adding or removing columns.

Glenn on July 15, 2008 4:16 AM

(You're automatically measuring all the queries that flow through your software, right?)

Got an article about this? Would be a good topic

Colm Larkin on July 15, 2008 4:22 AM

Two keywords (already mentioned) that I want to underline:

- views
- views aka materialized views
- star scheme for your billions of tuples (that your memory won't hold)
(a href=http://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Star_schema/a)">http://en.wikipedia.org/wiki/Star_schema/a)">http://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Star_schema/a)

Indexing and caching aside.

Krischan on July 15, 2008 4:23 AM

Can somebody tell me how 'views' can increase the performance of a very large system?

Niyaz PK on July 15, 2008 4:25 AM

and table names are singular because each row in the table designates one specific person; one row = one user

james: are you serious? Tables don't have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular.

Also, you do not want to prefix every single column with the name of the table, what does that save you? Just type tablename.columnname instead of tablename_columnname - no need for aliasing. Sometimes you violate that rule where it might be confusing, but for that most part don't double prefix stuff.

Ariel on July 15, 2008 4:28 AM

Your technique of describing a denormalized database model is one of the basic principals of dimensional modelling primarily used for data warehousing.

It's pretty much the opposite of normalizing your model (it contains redundant data, utilizes lots of space etc). Especially if end users need to use this data to query for e.g. reporting purposes then dimensional modelling is absolutely the way to go.

However, if you’re designing a normalized database, I would suggest to keep it normalized and don’t denormalize at all unless you’re designing a dimensional database structure from the beginning. Keep it separated and don’t mix the two modelling techniques even though it might be more efficient in some cases. Making things unnecessary complex is not something you want to do very often.

Sander on July 15, 2008 4:28 AM

DB nuts come out in forces. Views as a performance improvement LOL.

Even on a very small DB those 6 joins will cost you. Oracle uses huge amounts of CPU for joins, whether or not you have lots of data. With any kind of load at all it will cost you.

Just forget about using a relational DB and use Berkeley.

Asd on July 15, 2008 4:35 AM

I'm starting to study object-relational database, and I wonder if it wouldn't be a nicer solution for you, since is wipes away the necessity to write long complicated joins.

My studies haven't yet gone too far on that topic, but it seems ORDBs are easier to deal with than relational ones.

***

Which software did you use do create those diagrams?

Mrio Marinato, from Brazil on July 15, 2008 4:41 AM

@Niyaz: Materialized views (or indexed views, depending on your rdbms) are in fact tables, but created and maintained by your rdbms. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically.

Goran on July 15, 2008 4:44 AM

sorta on topic: What'd you use to make the diagrams? I dig

X on July 15, 2008 4:47 AM

table names are plural, column names are singular.

According to Codd and Date table names should be singular, but what did they know.

Pablo on July 15, 2008 4:48 AM

I'm a little lost here.

I'd say the example is not even normalized. I mean User Phone Number as well as User Screen Name should me merged into the User table and they will still be in BCNF. Then you keep the rest of tables as they are...

Why do you take Phone Number and Screen Name out of the picture? Is not as if you may have several Phone Numbers for one user with that design... and if the user does not have any phone number then, you know, NULL values may help.

Jorge on July 15, 2008 4:51 AM

and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already
Erm... Isn't the entire point of this post that some people haven't?

Tom on July 15, 2008 4:57 AM

Unless you are doing *a lot* more reading then writing

Aren't most -- perhaps in fact *all* -- websites in this class of application? Millions of reads, handful of writes.

you should denormalize as much as it makes sense to do so for read-heavy loads. Most web applications are of this type and do not even require a regular database necessarily

My point exactly!

View data isn't cacbed, however the execution plans are.

That buys you virtually nothing. Caching of execution plans is pretty much automatic for all modern SQL databases these days -- even for raw SQL. No stored procs or even parameterization of the query is required!

http://msdn.microsoft.com/en-us/library/ms181055.aspx

When any SQL statement is executed in SQL Server 2005, the relational engine first looks through the procedure cache to verify that an existing execution plan for the same SQL statement exists. SQL Server 2005 reuses any existing plan it finds, saving the overhead of recompiling the SQL statement. If no existing execution plan exists, SQL Server 2005 generates a new execution plan for the query.

Jeff Atwood on July 15, 2008 4:57 AM

I always learnt the last rule of normalization is to de-normalize.

Also I believe large organizations use a de-normalized for for selects and a normalized view for updates and inserts using replicated servers.

Have a look at the Scaling Databases section of http://msdn.microsoft.com/en-us/magazine/cc500561.aspx

John on July 15, 2008 5:08 AM

@Ariel
Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.
The syntax may be incorrect (I'm not sure you're right, though), but I assume the idea is that you might want a complete list of all users and their related data. Can you give the correct answer rather than just picking holes?
At the same time, the fact that the syntax for joins is so tricky to get right and interpret on reading is another major problem with Normalised Tables.

If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?
What if you needed a list of everybody's screen name and their related phone numbers?

You're simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't! Test it - you'll see I'm right.
Well, as the articlesays, there'll be no measurable difference until you reach billions of data rows. Do you even read the article?

But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows.
Which you have singularly failed to provide an example of.

Tom on July 15, 2008 5:10 AM

very easy answer:
normalize till its good, denormalize till it works!
That's the rule I learn in college.

nleite on July 15, 2008 5:12 AM

Hi Jeff,

All I can say is: Thanks!

This is the most enjoyable and valuable post in the blogoshphere for monts!

Daniel Fisher on July 15, 2008 5:12 AM

Why do you need to use a relational db for something like querying tags in the first place? Another approach would be simply to treat the db as a persistence layer and maintain in memory indexes for your tags.

Feeding a term vector of tags into something like Lucene is going to be faster than any DB query and a hell of a lot easier to scale.

Ryan Roberts on July 15, 2008 5:13 AM

An observation : if you use LINQ (Linq2SQL,Entity Framework,...) you don't have to write all those inner joins. You can use properties to access related entities. I love SQL but LINQ is allows to work with your data on a higher abstraction. If needed we, the SQL oldies, can still impress the young with blazing fast queries which will become a lost art.

Tom Pester on July 15, 2008 5:16 AM

'Golden Copy' of data should also be taken into account when designing databases, to often I have seen database that don't apply any normalisation have multiple copies of critical fields spread across multiple tables...

Ollie Riches on July 15, 2008 5:19 AM

select * whatever can also be a performance drain worse than the 16 inner joins you described. select what you need and nothing more.

Brian on July 15, 2008 5:23 AM

@Ariel
Looks like you'd get some duplication where you had multiple screen names and multiple phone numbers, but that's a very fine edge condition. It could also be helpful because you'd be able to detect users with multiple profiles or who have provided multiple contact details.

The fact that you didn't provide suggests that the normalised table is quite tricky to extract the complete data from cleanly.

Tom on July 15, 2008 5:24 AM

In your example you've turned several one to many or many to many relationships into 1-2 and 1-3 relationships. These are obviously not the same thing and there are many real life requirements for these standard relationships (1-N, and N-N).

The problems really start happening when you aren't normalizing your tables because your professor told you to, but when you need to model these one-to-many or a many-to-many relationships between data.

And I disagree that it takes millions of rows to make this stuff relevant, I worked on a project which had a search query with 4 or 5 joins and even specialised indexing for the major queries that were being run still caused any FULLTEXT searching on the data (3000 rows) to take 9 seconds. One summary table later and it was down to 0.09. Just by removing the joins!

So yeah, I guess my point is that the important thing here is the cost of a JOIN, and you need to know how to deal with these (summary tables, proper indexing, query caching, etc.) in situations where they ARE necessary.

David McLaughlin on July 15, 2008 5:25 AM

Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory...

Holy carp, is that considered modest? Somebody please tell my company so they'll upgrade our Pentium 4 512mb machines. That billion dollars we make each year has to be going somewhere.

Chris on July 15, 2008 5:28 AM

As I tried to clarify in my blog post, denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed.

Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on.

Reading various database stories may make it sound like all the cool kids are doing it but all the cool kids are also rolling their own file systems and implementing their own alternatives to relational databases as well. They did that because they hit scaling problems and felt they had no alternatives but they gave up something along the way.

More on this at http://www.25hoursaday.com/weblog/2007/10/10/WhenDatabasesLieConsistencyVsAvailabilityInDistributedSystems.aspx

Dare Obasanjo on July 15, 2008 5:28 AM

Goran,
So my question is: where is the performance enhancement in using views?
When something is updated in the tables, all these join statements will be needed to make the view. Right?
So for a very large application, views does not give performance improvements. Tell me if I am wrong.

Niyaz PK on July 15, 2008 5:30 AM

There's a whole lot of denormalisation going on with even the most normalised databases. Indexes are denormalisations so is anything you cache.

I think the approach to take here is to denormalise for speed but to ensure that denormalised data sits on top of normalised data can be rebuilt automatically. This is what happens with indexes, and it's a tried tested methodology.

As with indexes, anything denormalised should be hidden from the main programming effort. In fact, I'd go as far as to say that it should be a function of the database. How this can be done without an over-reliance on views is difficult to imagine though.

Sean Bamforth on July 15, 2008 5:30 AM

Another case where you would almost never want to denormalize would be a situation where you'd turn fixed-length rows into variable-length rows. If your table is all ints, you wouldn't want to replace them with varchars.

However, denormalizing to create fixed-length rows instead of variable-length rows (or overnormalizing to do it) could sometimes be a performance benefit, if your application is able to reference things by their integer ids instead of their string values.

-Max

Max Kanat-Alexander on July 15, 2008 5:31 AM

Okay, from the very beginning it has been known that normalisation can cause problems with performance. I think every trained DB designer know this.

But what happens, is that people see Normalisation = Slow, that makes them assume that normalisation isn't needed. My data retrival needs to be fast, therefore I am not going to normalise!

So they end up with a database they call denormalised for performance, but really it isn't, as the database was never normalised in the first place.

Tubs on July 15, 2008 5:32 AM

The only time I throw normalisation out the window is when I have a list of infrequently changing fixed-sized items which I can separate by commas and store in a single field instead of another table.

However, whilst it saves time on the querying side, the issue I find with doing this approach is it completely slows down the processing side, as my application must separate the list into an array or another structure before I can work with the data.

Personally, if the trade-off of each approach is near the same, I'd prefer a more logically structured database without null fields than a denormalized database which is hard to maintain.

Lachlan on July 15, 2008 5:35 AM

Ariel and George,
Phone numbers don't go into the user table because users can have multiple phone numbers (work, phone, cell, fax, etc). A properly normalized database extracts them out to their own table especially since the majority of users will have lots of nulls in these fields.This is assuming that you are normalizing your DB to 3NF.

In addition, who says you wouldn't want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That's exactly what it does.

Dare Obasanjo on July 15, 2008 5:37 AM

always wondered what database normalisation meant...

can't imagine that joins are exactly a heavy performance drain though... the data is indexed after all. I don't actually have the experience of a database that big (biggest db i worked on only held info on about 1.5 million people - linked to various details in much smaller tables), but I would imagine you could have trillions of records with no problem given some good indexing implementation...

in my experience slowness comes from things like calculating a sum, where the database doesn't update the sum on each record change, but recalculates the whole damned thing every time its needed. or poor application design, making 100s of server requests, when actually one would do just fine.

just taking a naive implementation, given numeric primary keys you can reduce the amount of records you actually have to search through by a factor of 10 for each digit you decide to take advantage of...

I wonder what modern database implementations actually do? generic keys aren't so useful but similar logic can still be applied... at worst there is always the choice of a binary tree...

Jheriko on July 15, 2008 5:38 AM

I am pretty sure your original normalized database in the diagram is poorly designed. After all you have two tables that appear to hold a user_id + affiliation_id relationship (UserAffiliation and UserWorkHistory). That, my friend, is *wrong*. And no wonder you have six joins, you have a database design that does not make sense.

You don't want to write a query like that-- ever-- in real life. It doesn't make sense. If you have three users who each have two phone numbers and two screen names, your join gets you a result set with twelve rows in whatever order the database deems appropriate.

In my experience, the odds of a developer properly designing a database are extremely small. You have just proven that to me yet again. And I can guarantee you that bad designs are way more likely to cause performance problems than properly normalizing data. Not only that, bad database design is likely to cause other problems like inconsistent data, incomprehensible code, and inscrutable bugs.

Michael on July 15, 2008 5:44 AM

@Niyaz: When something is updated in the tables, all these join statements will be needed to make the view. Right?

Yes - whenever you update / insert data it will in fact update / insert twice. But that's likes akin to saying indexes reduce performance as you need to store the same data multiple times :). My favorite saying on this is You can read fast and store slow or you can store fast and read slow.

@Jheriko: The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow.

Goran on July 15, 2008 5:48 AM

In some cases you may be right, but what will you do if you need more entries in affiliation, work_history and screen_name? Do you really want to have so many empty fields in each entry?

Jonathan Rass on July 15, 2008 5:52 AM

@Jeff,

Your example of is wrong. This type of normalization will make designing the user interface, source code and the rest of the sql more complicated. For example to get all users connected to an affiliation will need 3 times the queries now then before. And what if you want to extend WorkHistory to 4 (or 10) in a later release?

I also cannot imagine you need all this information in 1 call over a large number of users. To get this data for a single user it doesn't matter much if you do a few more calls to the database. I assume that large queries probably go over user only, or user and affiliation.

Also the denormalized table isn't complete, it's missing 3 work_history_affiliations.

But some denormalization can be done on your sample without becoming denormalized:

Removing the UserPhoneNumber table is good, that table is kind of useless, users don't want to submit hundreds of phone numbers and it isn't likely there are is an index on phone_number.

UserAffiliation and UserWorkHistory can be merged. Maybe add a type field if needed, otherwise an empty company_name should tell enough..


So you woud end up with:

User
------------ UserScreenName
------------ UserWorkHistoryAndAffiliation -------- Affiliation

BillGoates on July 15, 2008 5:57 AM

Why not cache denormalized sets of user data in a second database? That allows us the benefits of both.

Jonathan Drain's Dungeons Dragons Blog on July 15, 2008 6:02 AM

The only companies running on denormalized databases are companies providing websites that have loosened constraints when it comes to data loss. Your credit card company is not going to denormalize their data in their transactional system anytime soon.

Atwood is wrong. There is a better way to do it, and that's to normalize your data. This keeps your data integrity and DRY, and helps protect you from bugs and errors in your database. Correctness should always come first. Denormalization is up there with hack and kludge and should be treated as such, not as a reasonable alternative to normalization. (Albeit a sometimes necessary one, though probably employed more than it needs to be due to posts like this one, and also employed due to limitations of the shitty database these sites seem to run on top of, MySQL.)

Greg on July 15, 2008 6:03 AM

The biggest scalability problems I face are with human processes, not computer processes. Aside from number crunching applications, I've seldom run into the need to restructure code for performance. But I'm always running into mental scalability problems, ways of doing work that are going to drive people crazy in the long run.

John on July 15, 2008 6:04 AM

Another piece of savvy writing. I thank so much these writings from the experience and humble truth. Whatever it means.

xmariachi on July 15, 2008 6:08 AM

Premature optimisation! Design the database fully normalised, measure performance, optimise, rinse and repeat until fast *enough*.

Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend.

Chris on July 15, 2008 6:09 AM

If you want to have a look at the spec of fully engineered and normalised database, check out the NHS Data Dictionary

This diagram shows the person property :

http://www.datadictionary.nhs.uk/data_dictionary/diagrams/person__amp__person_property_imsp.asp?shownav=1

Tubs on July 15, 2008 6:12 AM

Wanted to repeat what Brian said, because it's one of the more valuable things in the comments: don't use SELECT * and you'll have cut your query count almost in half.

The database diagrams appear to have been made in a recent copy of Visual Studio.

While I agree with the general point Jeff, I do think there are other ways of achieving the effects of denormalization without squishing tons of fields into a table. Views, reporting databases, etc. If The DailyWTF is to be believed, just adding some (well-thought) indicies should be everyone's first start. Then learning how to use tuners optimizers. Jumping straight to denormalizing a database is the same class of mistake as thinking a properly normalized database can't be a problem: you're assuming facts not in evidence based on what someone told you about their experience.

Tom Clancy on July 15, 2008 6:13 AM

This post (and the comments) raise some very good points (kudos for that), but manages to fall short a little in leading people down the right path in some respects.

There are a number of other ways that the example schema could be partially denormalized (something I don't think was made clear). While I don't think you could have done much more in a single post, perhaps this is something that could be covered in a future one. If you're up to it, that is, it sounds like the expertise revealed in some comments may reveal just how big a question you are trying to tackle. :)

Ben Garreros on July 15, 2008 6:19 AM

I've read the odd article on this site now and then, but still not come decided whether Attwood is writing ironically, or if he is actually incompetent.

Firstly, the sample schema is not normalised. The member_count field in the Affiliation table is a derived value (SELECT COUNT(*) FROM UserAffiliation WHERE affiliation_id = ?). The religious_views and political_views columns in the User table smell badly of multi-values. The purpose of the UserWorkHistory table is unclear, but looks unnormalised.

Secondly, the SELECT example given should user outer joins.

Thirdly, using SELECT * is a no-no. What happens if the order of the columns change? If you're not using integer indices, then performance suffers. If you are using integer indices, then column order is significant. What if someone adds more columns that your code doesn't need? You'll still be pulling that data back.

Oh well, I'm going to add your example schema to our pre-interview exam (commonly known as the idiot filter) - and if anyone suggests your denormalised table as an improvement I'll know to export them out the building uninterviewed.

Chris on July 15, 2008 6:22 AM

You mean you aren't using NHibernate, Jeff? Or even Linq to Sql?

In your example, if you need to get all of the data for a given user, your sql would look something like:

select * from Users where user_id = 10;
select * UserPhoneNumbers where user_id = 10;
select * UserScreenNames where user_id = 10;
select * UserAffiliations where user_id = 10;
select * Affiliations where user_id = 10;
select * UserWorkHistory where user_id = 10;
select * Affiliations where user_id = 10;

and your ORM would handle populating your POCOs.

Even if you aren't using an ORM of some kind, ADO.Net handles the simultaneous retrieval of multiple sets of records just fine.

Like Peter Becker said, Normalization is about design, denormalization is about optimization.


Julio on July 15, 2008 6:25 AM

From experience: Normalize tables in databases (up to the third normal but probably not more) when these will be mainly used to collect information (lots of inserts, updates).

Denormalize tables when these are going to be used to select type qyeries (to feed reports, etc) or better still offload these to a data warehouse cubes which are optimized for analysis and reporting.

cyclo on July 15, 2008 6:32 AM

codinghorror.com 127.0.0.1

hosts on July 15, 2008 6:34 AM

Your normalized database could actually have a couple more tables. Phone Type, IM Service, Company name and maybe even Title could be in lookup tables so you can find all the people for X Company, or using Y IM service etc.
You can't allow free text there or some will enter ATT and others ATT. You could control it on the front end, but you'd still need a table to get the list.

Rich on July 15, 2008 6:34 AM

Ask anyone who works with the family of databases called “multivalued” (such as Revelation’s OpenInsight or IBM’s U2) and they’ll tell you that you normalise when data HAS to be treated atomically – or as so many others have said, design to third normal then justify denormalising. Removing external joins is one easy way of reducing web latency.

Andrew McAuley on July 15, 2008 6:35 AM

You make a good point about denormalization. I am developing a social networking app and have been fairly strict about normalizing data, and have created indexes to help performance. It is not yet a problem, but it has occurred to me that occasionally tortuous joins may eventually exact a performance hit. I'm fully prepared to abandon purity for performance when/if this happens. Thought-provoking article.

decaf on July 15, 2008 6:36 AM

An obvious (?) solution which works for even very large sites is to store the master copy of your table in normalized form, and then use batch jobs to denormalize the data as appropriate. This means you can have multiple denormalized forms suitable for different purposes, without losing any data integrity.

Of course this precludes the use of ORM, but once you start talking about performance and scalability you've already done that.

Greg Beech on July 15, 2008 6:43 AM

People normalize because their professors told them to.

Never had one of those. I normalize because it's elegant and easy(er) for me to understand. I don't do much work on huge systems, except for our major systems which were already designed for us and are quite denormalized already. Of course now I'm doing more data warehouse work, where all the rules get thrown out the window.

Rhywun on July 15, 2008 6:49 AM

Using NHibernate lazy loading would allow you to just compose your object tree with the data you need at that point in time.

Say for example you needed to display the users name. No need to populate the whole aggregate (Affiliation, Work History...) just the root (User).

That should help cut down some of the joins.

In the end of course 'it depends'.

Stuart Cam on July 15, 2008 6:51 AM

You're just another knucklehead. Think of it this way: would you take seriously a guy who's spent 20 years on the line at Tyson dismembering chickens announcing that he was *really* a neurosurgeon? And that he's got the perfect way to remove that tumor that's been giving you massive headeaches? The folks who've spent decades developing RDBMSs and systems from same, really do know more about it than you do.

That said; RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter. Read Codd's papers or Date's book. If you haven't bothered to do either (you haven't, right?), then why should your opinion matter?

There are reasons Codd drew his conclusions. He didn't just make it up.

Some thoughts I've collected over the years.


Not only am I not aware of any resource that would advocate logical modeling of data in an unnormalized fashion, I would be against anyone reading it if it existed. The logical data model should be normalized. Normalization is the process of identifying the one best place each fact belongs.
-- Craig Mullins/2007


There's a tendency, especially among managers of smaller shops, to assume that any developer knows how to set up a database. Frankly, this perplexes me. You wouldn't assume that any given developer knows how to code in C# or set up a Web Service, so why is it that we're all supposed to be database pros? The end result is that too many databases are designed by people who have never even heard the term normalization, never mind developed any understanding of the various normal forms.
-- Mike Gunderloy/2006


This is a great book on building databases the correct way. I highly recommend this book[Database Modeling and Design by Teorey] and if you cannot understand it you shouldn't be building a database anyway.
-- Robert C/1999 [the current 4th edition/2006 is even better]


Many of the existing formatted data systems provide users with tree-structured files or slightly more general network models of the data. Application programs developed to work with these systems tend to be logically impaired...
-- E. F. Codd, A Relational Model of Data for Large Shared Data Banks/1970


[Dr.] Codd had a bunch of ...fairly complicated queries, and since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long that would navigate through this labyrinth of pointers and stuff [XML anyone?]. Codd would sort of write them down as one-liners. ...[T]hey weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that.
-- Don Chamberlin/1995


Proper data management is the key to great architecture. Ignoring this and abstracting data access and data management away just to have a convenient programming model is … problematic. ... Many of the proponents of O/R mapping that I run into (and that is a generalization and I am not trying to offend anyone – just an observation) are folks who don't know SQL and RDBMS technology in any reasonable depth and/or often have no interest in doing so.
-- Clemens Vasters/2006


The big myth perpetrated by architects who don't really understand relational database architecture (me included early in my career) is that the more tables there are, the more complex the design will be. So, conversely, shouldn't condensing multiple tables into a single catch-all table simplify the design? It does sound like a good idea, but at one time giving Pauly Shore the lead in a movie sounded like a good idea too.
-- Louis Davidson/2007


Technologies like OODBMS sacrifice sound, application technology agnostic data management for short-term convenience (convenience for one single application, written using one particular programming language). Relational technology essentially completely replaced network or hierarchical database technology, and there were excellent reasons why that happened. We should most certainly not be reviving either of those discredited approaches by slapping on the latest buzzwords (OO, XML, etc) as window dressing. ...We don't see any future for JDO.
-- Gavin King/2004 [he wrote Hibernate]


...the logic exercised by the database to maintain referential integrity is not free, but if the rdbms does not do it, it gets done in the application, where it costs the same if not more - just now the application becomes slower versus the rdbms. Whether the logic of triggers or RI is performed in the application or the database, it's going to have similar pathlength - the only question is whether the rdbms or application can be more efficient. As more applications are written in languages that emphasize portability over performance (Java) or ease of learning over performance (Visualbasic), the benefits of putting logic that centres around data further down into the database engine become stronger.
-- Blair Adamache/2003


In CS we don't have a lot of formal models to guide us, as in engineering or other science. Much of CS is entirely ad-hoc. However we do have a sound and complete model for data storage (relational model) and hardly anyone uses it. It boggles my mind. Do people not *want* their programs to work predictably? [corollary: if you *must* interact with an Application to modify data, the data specification isn't relational]
-- Anonymous Coward/2005


The people using SQL to store data often don't grasp that there's a difference between expressing various facts as a set of relations and stuffing data into a database like you stuff material into a file.
-- Christopher Browne/2005


Programmers will always gravitate towards viewing the data in their databases as their private bailiwick, and insist that users of the data access it through their own API. Learning SQL is certainly better than learning a hundred programmer's different APIs.
-- David Cressey/2005


It is very obvious that a lot of people have absolutely no idea what a business model is or for that matter a database. The number of times I have seen business rule code that should be imbedded in the database definition and not in the BLL (Business Logic Layer) just makes me want to scream. As soon as data rules (the backbone of a business model) are moved away from a relational style declarative constraint, you have reduced the effectiveness, meaning and integrity of your business model/database.
-- DavidM/2004


Normalizing is an attempt to optimize generally. Denormalization is an attempt to normalize for a particular application, to introduce a bias.
-- Gulutzan Pelzer/2003


If you store all your business logic within the database, then it doesn't matter what flaky applications people write you can guarantee the integrity of the data.
-- Pretsel/2004

BuggyFunBunny on July 15, 2008 6:54 AM

Lets unroll our loops next. (Christog)

I seen that and I must say was blown away how removing loops can make your application run faster. Of course, you shouldn't jump right away and erase all your loops and ifs, but sometimes it can help, just like de-normalization.
You can read about a guy optimizing his XML parser in the book called Beautiful Code. Sorry, I don't remember who wrote it.

Sergej Andrejev on July 15, 2008 6:56 AM

I run an e-commerce website with about 400,000 page views a month. I've been a complete purist about normalizing the database, as well as using business objects to access all data (rather than using ad hoc SQL queries). We have some pretty complicated business object hierarchies and I've long been waiting for there to be a performance issue (especially as we use a shared SQL Server), but the website continues to be blazingly fast. I realize that you are going to have a lot more page views than our site, but it looks like your schema is going to be simpler and you have a dedicated SQL Server, so I think you can afford to be a purist for now and not worry about denormalizing until it becomes an issue.

Ben Mills on July 15, 2008 6:58 AM

Good post Jeff.

The point many of the jihadist posters are missing is that the right thing may change dramatically as an application scales up.

Your high scalability point is key. As you said, Everything is fast for small n.

The ideals of normalized data, and even of using a RDBMS at all, fall apart at some scale, as Google and Amazon will attest.

John Pirie on July 15, 2008 7:00 AM

jeff, are you awake?

wiggles on July 15, 2008 7:02 AM

It's funny how many people just ignore your last conclusion because it didn't suit their argumentation purpose; “normalize until it hurts, denormalize until it works” sums it up very nicely.

That being said, the above example would be no reason to change the normalized DB schema but instead to change the query (obviosuly). In any case, such a query would be very rarely necessary (someone already mentioned data warehousing, I think).

Konrad on July 15, 2008 7:03 AM

Ahhhhh! Don't tell people that! Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations.

Sean on July 15, 2008 7:04 AM

Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.

Whose tradition database design principles are you referring to? It's been some time since I took any database courses at college, but I do remember that both the book and the instructor told us that going all the way to the 5th level was extreme and not a good idea, because it added complexity to the queries used to get data which slowed it down. We were told that in most cases, the 3rd level of normalization will provide the best results.

The purpose of normalization is reduce size and eliminate the number of duplicate records, but the trade off is that the queries used to reconstruct the data become more complex and thus take more time.

Delmania on July 15, 2008 7:05 AM

(disclaimer: I'm talking completely about Microsoft SQL Server 2005 here.)

I'm calling shenanigans on this post, unless you put up a definitive example of where your app is running into problems with a normalized database design.

Like others have said, to get the user data you could use an indexed view. But also as others have said, if you have proper indexes and foreign keys setup, the SQL engine should have no problem figuring it out and generating the same query plan as if you had just queried those tables directly in the first place.

I'd also like to say that the example given is a poor one, seeing as you're joining 1-to-1 tables to 1-to-n (phone numbers, affiliations, etc.)

Not to mention that de-normalization leads to it's own problems..

Nicholas on July 15, 2008 7:07 AM

PS. I name my fields with ORM in mind, even if I don't use it. Thus, Id, not UserId or god forbid user_id...

Rhywun on July 15, 2008 7:08 AM

Keep it normalized. How often do you actually need EVERYTHING about a user at once? Probably only when you are editing that user.

As others have said, use proper indexes to speed it up. De-normalized tables will haunt you for years whenever there are issues. As in an earlier post you are being cruel to the maintenance programmer (even if that is yourself).

Practicality on July 15, 2008 7:10 AM

all of those with a data access layer step forward... not so fast jeff

structure on July 15, 2008 7:11 AM

I run an e-commerce website with about 400,000 page views a month

That's very low. I've got a site doing twice that traffic every day and has about 30 tables, with the heaviest (and most common query) working on a table of 200k records requiring a sequential scan each time since the sort criteria is based on an average of a field in the resultset that is different for each query which are finally joined against 12 tables.

All running on a 2ghz opteron with 1gb ram, and cpu usage hovers at 40% at peak times.

God bless PostgreSQL.

nick on July 15, 2008 7:13 AM

Sorry to contradict Greg !

Use a multi-value (a.k.a. Pick) database and you don't have to worry about normalization ever again. They are fast too.

Mike on July 15, 2008 7:14 AM

More comments»

The comments to this entry are closed.