One of the items we're struggling with now on Stack Overflow is how to maintain near-instantaneous performance levels in a relational database as the amount of data increases. More specifically, how to scale our tagging system. Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.
Dare Obasanjo had an excellent post When Not to Normalize your SQL Database wherein he helpfully provides a sample database schema for a generic social networking site. Here's what it would look like if we designed it in the accepted normalized fashion:
Normalization certainly delivers in terms of limiting duplication. Every entity is represented once, and only once -- so there's almost no risk of inconsistencies in the data. But this design also requires a whopping six joins to retrieve a single user's information.
select * from Users u inner join UserPhoneNumbers upn on u.user_id = upn.user_id inner join UserScreenNames usn on u.user_id = usn.user_id inner join UserAffiliations ua on u.user_id = ua.user_id inner join Affiliations a on a.affiliation_id = ua.affiliation_id inner join UserWorkHistory uwh on u.user_id = uwh.user_id inner join Affiliations wa on uwh.affiliation_id = wa.affiliation_id
(Update: this isn't intended as a real query; it's only here to visually illustrate the fact that you need six joins -- or six individual queries, if that's your cup of tea -- to get all the information back about the user.)
Those six joins aren't doing anything to help your system's performance, either. Full-blown normalization isn't merely difficult to understand and hard to work with -- it can also be quite slow.
As Dare points out, the obvious solution is to denormalize -- to collapse a lot of the data into a single Users table.
This works -- queries are now blindingly simple (select * from users), and probably blindingly fast, as well. But you'll have a bunch of gaping blank holes in your data, along with a slew of awkwardly named field arrays. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now. Congratulations on your demotion!
Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?
Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory -- will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.
There's no shortage of fascinating database war stories from companies that made it big. I do worry that these war stories carry an implied tone of "I lost 200 pounds and so could you!"; please assume the tiny-asterisk disclaimer results may not be typical is in full effect while reading them. Here's a series that Tim O'Reilly compiled:
There's also the High Scalability blog, which has its own set of database war stories:
First, a reality check. It's partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn't have to be all or nothing as I've pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn't.
Despite copious evidence that normalization rarely scales, I find that many software engineers will zealously hold on to total database normalization on principle alone, long after it has ceased to make sense.
When growing Cofax at Knight Ridder, we hit a nasty bump in the road after adding our 17th newspaper to the system. Performance wasn't what it used to be and there were times when services were unresponsive.A project was started to resolve the issue, to look for 'the smoking gun'. The thought being that the database, being as well designed as it was, could not be of issue, even with our classic symptom being rapidly growing numbers of db connections right before a crash. So we concentrated on optimizing the application stack.
I disagreed and waged a number of arguments that it was our database that needed attention. We first needed to tune queries and indexes, and be willing to, if required, pre-calculate data upon writes and avoid joins by developing a set of denormalized tables. It was a hard pill for me to swallow since I was the original database designer. Turned out it was harder for everyone else! Consultants were called in. They declared the db design to be just right - that the problem must have been the application.
After two months of the team pushing numerous releases thought to resolve the issue, to no avail, we came back to my original arguments.
Pat Helland notes that people normalize because their professors told them to. I'm a bit more pragmatic; I think you should normalize when the data tells you to:
Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd. Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. Fear not the specter of denormalization. Duplicated data and synchronization problems are often overstated and relatively easy to work around with cron jobs. Disks and memory are cheap and getting cheaper every nanosecond. Measure performance on your system and decide for yourself what works, free of predispositions and bias.
As the old adage goes, normalize until it hurts, denormalize until it works.
| [advertisement] Peer code review without meetings, paperwork, or stopwatches? No wonder Code Collaborator won the Jolt Award. |
Posted by Jeff Atwood View blog reactions
« Monkeypatching For Humans The Ultimate Software Gold Plating »
I just spent a year having normalization hammered into my meager little mind as part of my database module and now you're telling me that it mightn't be necessary...........
love your blog Jeff but i hate it when you do this to me. :(
Reader 101 on July 15, 2008 02:18 AMHow about using materialized query tables or materialized views?
Andrew Stuart on July 15, 2008 02:21 AMMy approach is to design it all normalized at first, as it makes the design cleaner and usually easier to grok, and then denormalize where absolutely needed. As long as the system is performing well, I would prefer to keep my data normalized and then create denormalized views so I can get those huge joins out of my way.
When it comes to teaching, normalized tables is the way to go. People must master the rules before they break them (and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already)
SomeGuy on July 15, 2008 02:25 AMQuite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol.. :D
nub on July 15, 2008 02:33 AMNormalization is about design, denormalization is about optimization. They should be applied as such, i.e. anything should be normalized until denormalization is needed. As Knuth (or was it Hoare?) said: "Premature optimization is the root of all evil".
And from the short glimpse I did at the article about performance of tag systems it seems they tested only on MySQL. Not exactly a database one connects with good performance on complex queries -- it's extremely fast on flat data, but to my (not utterly up-to-date) knowledge they never managed to catch up with Postgres or the commercial RDBMS' on anything reasonably complex. Not sure if that applies to single joins already, but I wouldn't be too surprised.
Peter Becker on July 15, 2008 02:36 AMNormalization might indeed not be the fastest design, but can usually made fast enough without denormalization, as Dare Obasanjo states in his original article, too:
"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching."
Jens Göring on July 15, 2008 02:43 AMIn my __LONG__ experience of both DB and application design, it is a bad (or rushed) application design that usually forces denormalization on a well designed database.
Several small queries over normalized data also usually outstrip one big one over denormalized data.
But, denormalization still has a home in large data warehouses.
Hope smart tools from DatabaseGear can help.
Lex Y. Li on July 15, 2008 02:48 AMLets unroll our loops next.
Christof on July 15, 2008 02:51 AMJust a nitpick, I'd recommend using the term "Gender" over "Sex". Someone might ally themselves with a different gender to their sex and that's the one they're going to want to tell you.
(Obviously it's sample data, but just for the future :)
Id did annoy me slightly always having normalization drummed into me without mention that it's not always best, but I guess SomeGuy is right.
[ICR] on July 15, 2008 02:54 AMAnother thing you could try in the case of StackOverFlow is to reduce the data required from the user.
Some questions you can ask:
1) Do our web application really need those phone numbers? Are we going to call our users?
2) Is address needed?
3) Does gender matter?
4) If an email id and password (and a display name if necessary)is what really matters, bother with all those unnecessary data?
You can add the bells and whistles later. First, Ship.
Niyaz PK on July 15, 2008 02:56 AMJeff,
if you want to avoid the usage of big and complicated SQL-Statements, why not implementing Views instead of denormalization. A good RDBMS should offer this mechanism, especially for performance and usage optimizations.
You could have a View called "users", could still say "SELECT * FROM users", but what is exactly within "users" can be changed at any time, without changing the interface to the application; The View "users" in this example.
Frank on July 15, 2008 02:59 AMOne thing you're not taking into account is a system like memcached that avoids hitting the database at all. When using such a system, it might be better to actually be storing normalized data so that your hits on the database for cache misses are small and involve as little table locking as possible.
-Max
Max Kanat-Alexander on July 15, 2008 03:00 AMBefore normalizing denormalizing think about what you need to to with the database and where you are going to do it. One example: We have a quite big database with test results. Most of the work is gathering data for reports. Normalizing <> Denormalizing is an issue and you only know what is faster if you know how you access it.
offler on July 15, 2008 03:01 AMIt's always amazed me how we spend hours normalizing databases just to spend more hours creating non-normalized views to run reports on.
I've always tried to normalize as much as was sensible rather than slavishly following the supposed good practice of BC Normal Form.
Excellent thought-provoking article.
Ali M on July 15, 2008 03:01 AMby the way: why do you want to create a list of users and adresses? Wanna sell?
offler on July 15, 2008 03:02 AMI'm not sure I agree with the total denormalisation in the example, Jeff. I'm (relatively) recently out of University, so I've still got the normalisation-is-everything rules in my mind :-) Being pragmatic, though, and as a couple of other people have noted, denormalisation is an optimisation to a fully-normalised design. I'd argue that denormalisation should be done in stages, in just the same way that normalisation is.
For example, I don't think I've ever taken a production database past 3NF. Certainly the only time I've take a design to 5NF is for my University database assignment. And aren't stored procedures supposed to improve performance for this sort of queries. Yes, you're always going to take a performance hit for a query utilising six joins (there are some crazily-sized cartesian products going on in there), but wouldn't a multi-layered approach aid this? I.e., for common or computation-intensive operations (such as viewing a user's profile in the example above), using multiple stored procedures to extract sections of data and then another stored procedure to put them together. It could also be accomplished using the user_id primary key to extract the relevant record from each table and then combining the records at the application level.
Having said all that, I haven't done any database programming in a good couple of years now, so I could be blowing a lot of hot air :-)
Alastair Smith on July 15, 2008 03:02 AMDatabases are one of those things that you just can't take a cavalier approach to. If you screw that design up, and worse still, launch it, you're in for a world of hurt when the problems with the data come.
Normalisation means the only problem you're going to get is scalability: your data is going to be good (presuming you were also clever/paranoid enough to get all the integrity checks in there too).
As the quote says, you can always denormalise later for speed.
Lewisham on July 15, 2008 03:02 AMIf you create indexed views over the joins, then the optimizer should use those anyway and there won't be a perfomance hit from being denormalized.
Harry M on July 15, 2008 03:04 AMAddendum: And as others have said, there are much better solutions than denormalising your database. Having Materialized Views for reads, or memcached.
Lewisham on July 15, 2008 03:07 AM"Quite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol.. :D
nub on July 15, 2008 02:33 AM "
because when you're selecting multiple id fields you want to be able to distinguish between them, example: user_id and affiliation_id
instead of
select user.id user_id, affiliation.id affiliation_id from ...
you can just have
select user_id, affiliation_id from ...
taht way you don't always have to prefix the fields with the table name and you also you don't have to alias the fields everytime;
and table names are singular because each row in the table designates one specific person; one row = one user
james on July 15, 2008 03:10 AMThat 6 join query is wrong, and the rest of your article follows from that incorrect join.
Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.
If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?
You're simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't! Test it - you'll see I'm right. But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows.
And if you don't even understand why your query is wrong, you have no business designing databases.
A valid reason to denormalize is to precalculate data, which you touch on very briefly. But you always write that twice: once normalized, and that's the primary data, and then again, the "cached/precalculated" version. You should always be able to regenerate that from the normalized data.
OK, one final reason to denormalize which you didn't even write: if you need to do a where clause from one table, but the order by from a different table, you need to denormalize because you can not create a combined index from both tables. (Databases with function indexes might be able to but that's pretty complicated.)
I'm sorry to bash you so much, but you shouldn't write about what you have no experience with.
Ariel on July 15, 2008 03:15 AMThe approach we've taken at work is to have data normalised (well all the new tables, we have some truly hideous legacy tables that we haven't tidied up yet!) and then either use memory cache or search tables to speed up access.
In fact our search table is one honking great big denormalised table. There can't be any joins at all. This is generated by an off line task constantly.
I'm always wary about denormalising the 'cannonical' tables as this leads you into a maintenance nightmare, especially when it comes to adding or removing columns.
Glenn on July 15, 2008 03:16 AM> (You're automatically measuring all the queries that flow through your software, right?)
Got an article about this? Would be a good topic
Colm Larkin on July 15, 2008 03:22 AMTwo keywords (already mentioned) that I want to underline:
- views
- views aka materialized views
- star scheme for your billions of tuples (that your memory won't hold)
(<a href="http://en.wikipedia.org/wiki/Star_schema">http://en.wikipedia.org/wiki/Star_schema</a>)
Indexing and caching aside.
Krischan on July 15, 2008 03:23 AMCan somebody tell me how 'views' can increase the performance of a very large system?
Niyaz PK on July 15, 2008 03:25 AM>and table names are singular because each row in the table designates one specific person; one row = one user
james: are you serious? Tables don't have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular.
Also, you do not want to prefix every single column with the name of the table, what does that save you? Just type tablename.columnname instead of tablename_columnname - no need for aliasing. Sometimes you violate that rule where it might be confusing, but for that most part don't double prefix stuff.
Ariel on July 15, 2008 03:28 AMYour technique of describing a denormalized database model is one of the basic principals of dimensional modelling primarily used for data warehousing.
It's pretty much the opposite of normalizing your model (it contains redundant data, utilizes lots of space etc). Especially if end users need to use this data to query for e.g. reporting purposes then dimensional modelling is absolutely the way to go.
However, if you’re designing a normalized database, I would suggest to keep it normalized and don’t denormalize at all unless you’re designing a dimensional database structure from the beginning. Keep it separated and don’t mix the two modelling techniques even though it might be more efficient in some cases. Making things unnecessary complex is not something you want to do very often.
Sander on July 15, 2008 03:28 AMDB nuts come out in forces. Views as a performance improvement LOL.
Even on a very small DB those 6 joins will cost you. Oracle uses huge amounts of CPU for joins, whether or not you have lots of data. With any kind of load at all it will cost you.
Just forget about using a relational DB and use Berkeley.
Asd on July 15, 2008 03:35 AMI'm starting to study object-relational database, and I wonder if it wouldn't be a nicer solution for you, since is wipes away the necessity to write long complicated joins.
My studies haven't yet gone too far on that topic, but it seems ORDBs are easier to deal with than relational ones.
***
Which software did you use do create those diagrams?
Mário Marinato, from Brazil on July 15, 2008 03:41 AM@Niyaz: Materialized views (or indexed views, depending on your rdbms) are in fact tables, but created and maintained by your rdbms. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically.
Goran on July 15, 2008 03:44 AMsorta on topic: What'd you use to make the diagrams? I dig
X on July 15, 2008 03:47 AM>table names are plural, column names are singular.
According to Codd and Date table names should be singular, but what did they know.
Pablo on July 15, 2008 03:48 AMI will have to nod on the use of views, especially indexed views, on this. Views help in performance because it tells the database which queries with complex joins will be accessed more often, and allow the database to pre-store its execution plan (as opposed to adhoc T-SQL) Indexing against the view will further optimize it.
If you use NHibernate, views have the additional advantage of being treated by NH as "just another table", so you could actually design a view against an object in a much simpler manner.
Jon Limjap on July 15, 2008 03:50 AMI'm a little lost here.
I'd say the example is not even normalized. I mean User Phone Number as well as User Screen Name should me merged into the User table and they will still be in BCNF. Then you keep the rest of tables as they are...
Why do you take Phone Number and Screen Name out of the picture? Is not as if you may have several Phone Numbers for one user with that design... and if the user does not have any phone number then, you know, NULL values may help.
Jorge on July 15, 2008 03:51 AM"and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already"
Erm... Isn't the entire point of this post that some people haven't?
I always learnt the last rule of normalization is to de-normalize.
Also I believe large organizations use a de-normalized for for selects and a normalized view for updates and inserts using replicated servers.
Have a look at the Scaling Databases section of http://msdn.microsoft.com/en-us/magazine/cc500561.aspx
John on July 15, 2008 04:08 AM@Ariel
"Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that."
The syntax may be incorrect (I'm not sure you're right, though), but I assume the idea is that you might want a complete list of all users and their related data. Can you give the correct answer rather than just picking holes?
At the same time, the fact that the syntax for joins is so tricky to get right and interpret on reading is another major problem with Normalised Tables.
"If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?"
What if you needed a list of everybody's screen name and their related phone numbers?
"You're simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't! Test it - you'll see I'm right."
Well, as the articlesays, there'll be no measurable difference until you reach billions of data rows. Do you even read the article?
"But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows."
Which you have singularly failed to provide an example of.
very easy answer:
"normalize till its good, denormalize till it works!"
That's the rule I learn in college.
Hi Jeff,
All I can say is: Thanks!
This is the most enjoyable and valuable post in the blogoshphere for monts!
Daniel Fisher on July 15, 2008 04:12 AMWhy do you need to use a relational db for something like querying tags in the first place? Another approach would be simply to treat the db as a persistence layer and maintain in memory indexes for your tags.
Feeding a term vector of tags into something like Lucene is going to be faster than any DB query and a hell of a lot easier to scale.
Ryan Roberts on July 15, 2008 04:13 AMAn observation : if you use LINQ (Linq2SQL,Entity Framework,...) you don't have to write all those inner joins. You can use properties to access related entities. I love SQL but LINQ is allows to work with your data on a higher abstraction. If needed we, the SQL oldies, can still impress the young with blazing fast queries which will become a lost art.
Tom Pester on July 15, 2008 04:16 AM'Golden Copy' of data should also be taken into account when designing databases, to often I have seen database that don't apply any normalisation have multiple copies of critical fields spread across multiple tables...
Ollie Riches on July 15, 2008 04:19 AMselect * whatever can also be a performance drain worse than the 16 inner joins you described. select what you need and nothing more.
Brian on July 15, 2008 04:23 AM@Ariel
Looks like you'd get some duplication where you had multiple screen names and multiple phone numbers, but that's a very fine edge condition. It could also be helpful because you'd be able to detect users with multiple profiles or who have provided multiple contact details.
The fact that you didn't provide suggests that the normalised table is quite tricky to extract the complete data from cleanly.
Tom on July 15, 2008 04:24 AMIn your example you've turned several one to many or many to many relationships into 1-2 and 1-3 relationships. These are obviously not the same thing and there are many real life requirements for these standard relationships (1-N, and N-N).
The problems really start happening when you aren't normalizing your tables because your professor told you to, but when you need to model these one-to-many or a many-to-many relationships between data.
And I disagree that it takes millions of rows to make this stuff relevant, I worked on a project which had a search query with 4 or 5 joins and even specialised indexing for the major queries that were being run still caused any FULLTEXT searching on the data (3000 rows) to take 9 seconds. One summary table later and it was down to 0.09. Just by removing the joins!
So yeah, I guess my point is that the important thing here is the cost of a JOIN, and you need to know how to deal with these (summary tables, proper indexing, query caching, etc.) in situations where they ARE necessary.
David McLaughlin on July 15, 2008 04:25 AM"Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory..."
Holy carp, is that considered modest? Somebody please tell my company so they'll upgrade our Pentium 4 512mb machines. That billion dollars we make each year has to be going somewhere.
Chris on July 15, 2008 04:28 AMAs I tried to clarify in my blog post, denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed.
Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on.
Reading various database stories may make it sound like "all the cool kids are doing it" but all the cool kids are also rolling their own file systems and implementing their own alternatives to relational databases as well. They did that because they hit scaling problems and felt they had no alternatives but they gave up something along the way.
More on this at http://www.25hoursaday.com/weblog/2007/10/10/WhenDatabasesLieConsistencyVsAvailabilityInDistributedSystems.aspx
Dare Obasanjo on July 15, 2008 04:28 AMGoran,
So my question is: where is the performance enhancement in using views?
When something is updated in the tables, all these join statements will be needed to make the view. Right?
So for a very large application, views does not give performance improvements. Tell me if I am wrong.
There's a whole lot of denormalisation going on with even the most normalised databases. Indexes are denormalisations & so is anything you cache.
I think the approach to take here is to denormalise for speed but to ensure that denormalised data sits on top of normalised data & can be rebuilt automatically. This is what happens with indexes, and it's a tried & tested methodology.
As with indexes, anything denormalised should be hidden from the main programming effort. In fact, I'd go as far as to say that it should be a function of the database. How this can be done without an over-reliance on views is difficult to imagine though.
Sean Bamforth on July 15, 2008 04:30 AMOkay, from the very beginning it has been known that normalisation can cause problems with performance. I think every "trained" DB designer know this.
But what happens, is that people see "Normalisation = Slow", that makes them assume that normalisation isn't needed. "My data retrival needs to be fast, therefore I am not going to normalise!"
So they end up with a database they call "denormalised for performance", but really it isn't, as the database was never normalised in the first place.
Tubs on July 15, 2008 04:32 AMThe only time I throw normalisation out the window is when I have a list of infrequently changing fixed-sized items which I can separate by commas and store in a single field instead of another table.
However, whilst it saves time on the querying side, the issue I find with doing this approach is it completely slows down the processing side, as my application must separate the list into an array or another structure before I can work with the data.
Personally, if the trade-off of each approach is near the same, I'd prefer a more logically structured database without null fields than a denormalized database which is hard to maintain.
Lachlan on July 15, 2008 04:35 AMAriel and George,
Phone numbers don't go into the user table because users can have multiple phone numbers (work, phone, cell, fax, etc). A properly normalized database extracts them out to their own table especially since the majority of users will have lots of nulls in these fields.This is assuming that you are normalizing your DB to 3NF.
In addition, who says you wouldn't want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That's exactly what it does.
Dare Obasanjo on July 15, 2008 04:37 AM
always wondered what database normalisation meant...
can't imagine that joins are exactly a heavy performance drain though... the data is indexed after all. I don't actually have the experience of a database that big (biggest db i worked on only held info on about 1.5 million people - linked to various details in much smaller tables), but I would imagine you could have trillions of records with no problem given some good indexing implementation...
in my experience slowness comes from things like calculating a sum, where the database doesn't update the sum on each record change, but recalculates the whole damned thing every time its needed. or poor application design, making 100s of server requests, when actually one would do just fine.
just taking a naive implementation, given numeric primary keys you can reduce the amount of records you actually have to search through by a factor of 10 for each digit you decide to take advantage of...
I wonder what modern database implementations actually do? generic keys aren't so useful but similar logic can still be applied... at worst there is always the choice of a binary tree...
Jheriko on July 15, 2008 04:38 AMI am pretty sure your original "normalized" database in the diagram is poorly designed. After all you have two tables that appear to hold a user_id + affiliation_id relationship (UserAffiliation and UserWorkHistory). That, my friend, is *wrong*. And no wonder you have six joins, you have a database design that does not make sense.
You don't want to write a query like that-- ever-- in real life. It doesn't make sense. If you have three users who each have two phone numbers and two screen names, your join gets you a result set with twelve rows in whatever order the database deems appropriate.
In my experience, the odds of a developer properly designing a database are extremely small. You have just proven that to me yet again. And I can guarantee you that bad designs are way more likely to cause performance problems than properly normalizing data. Not only that, bad database design is likely to cause other problems like inconsistent data, incomprehensible code, and inscrutable bugs.
Michael on July 15, 2008 04:44 AM@Niyaz: "When something is updated in the tables, all these join statements will be needed to make the view. Right?"
Yes - whenever you update / insert data it will in fact update / insert twice. But that's likes akin to saying indexes reduce performance as you need to store the same data multiple times :). My favorite saying on this is "You can read fast and store slow or you can store fast and read slow".
@Jheriko: The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow.
In some cases you may be right, but what will you do if you need more entries in "affiliation", "work_history" and "screen_name"? Do you really want to have so many empty fields in each entry?
Jonathan Rass on July 15, 2008 04:52 AMThe only companies running on denormalized databases are companies providing websites that have loosened constraints when it comes to data loss. Your credit card company is not going to denormalize their data in their transactional system anytime soon.
Atwood is wrong. There is a "better" way to do it, and that's to normalize your data. This keeps your data integrity and DRY, and helps protect you from bugs and errors in your database. Correctness should always come first. Denormalization is up there with "hack" and "kludge" and should be treated as such, not as a "reasonable alternative" to normalization. (Albeit a sometimes necessary one, though probably employed more than it needs to be due to posts like this one, and also employed due to limitations of the shitty database these sites seem to run on top of, MySQL.)
The biggest scalability problems I face are with human processes, not computer processes. Aside from number crunching applications, I've seldom run into the need to restructure code for performance. But I'm always running into mental scalability problems, ways of doing work that are going to drive people crazy in the long run.
John on July 15, 2008 05:04 AMJeff,
It's common in programmer circles to think there's not a best way to do things, but in this case, that's not true. The best way to store data in a relational database is to normalize it; that is, to take full advantage of its relational nature.
In your example, you do not need six joins to get all the user's data: you choose what the pertinent data is for a situation. If you do need all that data, and performance is an issue, materialized views, or some other caching system, is the way to go.
I appreciate your post, but it's very wrong, and harmful advice to an inexperienced programmer.
Clinton R. Nixon on July 15, 2008 05:07 AMAnother piece of savvy writing. I thank so much these writings from the experience and humble truth. Whatever it means.
xmariachi on July 15, 2008 05:08 AMPremature optimisation! Design the database fully normalised, measure performance, optimise, rinse and repeat until fast *enough*.
Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend.
Chris on July 15, 2008 05:09 AMIf you want to have a look at the spec of fully engineered and normalised database, check out the NHS Data Dictionary
This diagram shows the "person" property :
Tubs on July 15, 2008 05:12 AMWanted to repeat what Brian said, because it's one of the more valuable things in the comments: don't use "SELECT *" and you'll have cut your query count almost in half.
The database diagrams appear to have been made in a recent copy of Visual Studio.
While I agree with the general point Jeff, I do think there are other ways of achieving the effects of denormalization without squishing tons of fields into a table. Views, reporting databases, etc. If The DailyWTF is to be believed, just adding some (well-thought) indicies should be everyone's first start. Then learning how to use tuners & optimizers. Jumping straight to denormalizing a database is the same class of mistake as thinking a properly normalized database can't be a problem: you're assuming facts not in evidence based on what someone told you about their experience.
Tom Clancy on July 15, 2008 05:13 AMI've read the odd article on this site now and then, but still not come decided whether Attwood is writing ironically, or if he is actually incompetent.
Firstly, the sample schema is not normalised. The member_count field in the Affiliation table is a derived value (SELECT COUNT(*) FROM UserAffiliation WHERE affiliation_id = ?). The religious_views and political_views columns in the User table smell badly of multi-values. The purpose of the UserWorkHistory table is unclear, but looks unnormalised.
Secondly, the SELECT example given should user outer joins.
Thirdly, using "SELECT *" is a no-no. What happens if the order of the columns change? If you're not using integer indices, then performance suffers. If you are using integer indices, then column order is significant. What if someone adds more columns that your code doesn't need? You'll still be pulling that data back.
Oh well, I'm going to add your example schema to our pre-interview exam (commonly known as the "idiot filter") - and if anyone suggests your denormalised table as an improvement I'll know to export them out the building uninterviewed.
You mean you aren't using NHibernate, Jeff? Or even Linq to Sql?
In your example, if you need to get all of the data for a given user, your sql would look something like:
select * from Users where user_id = 10;
select * UserPhoneNumbers where user_id = 10;
select * UserScreenNames where user_id = 10;
select * UserAffiliations where user_id = 10;
select * Affiliations where user_id = 10;
select * UserWorkHistory where user_id = 10;
select * Affiliations where user_id = 10;
and your ORM would handle populating your POCOs.
Even if you aren't using an ORM of some kind, ADO.Net handles the simultaneous retrieval of multiple sets of records just fine.
Like Peter Becker said, "Normalization is about design, denormalization is about optimization."
From experience: Normalize tables in databases (up to the third normal but probably not more) when these will be mainly used to collect information (lots of inserts, updates).
Denormalize tables when these are going to be used to select type qyeries (to feed reports, etc) or better still offload these to a data warehouse cubes which are optimized for analysis and reporting.
cyclo on July 15, 2008 05:32 AMcodinghorror.com 127.0.0.1
hosts on July 15, 2008 05:34 AMYour normalized database could actually have a couple more tables. Phone Type, IM Service, Company name and maybe even Title could be in lookup tables so you can find all the people for X Company, or using Y IM service etc.
You can't allow free text there or some will enter ATT and others AT&T. You could control it on the front end, but you'd still need a table to get the list.
Ask anyone who works with the family of databases called “multivalued” (such as Revelation’s OpenInsight or IBM’s U2) and they’ll tell you that you normalise when data HAS to be treated atomically – or as so many others have said, design to third normal then justify denormalising. Removing external joins is one easy way of reducing web latency.
Andrew McAuley on July 15, 2008 05:35 AMYou make a good point about denormalization. I am developing a social networking app and have been fairly strict about normalizing data, and have created indexes to help performance. It is not yet a problem, but it has occurred to me that occasionally tortuous joins may eventually exact a performance hit. I'm fully prepared to abandon purity for performance when/if this happens. Thought-provoking article.
decaf on July 15, 2008 05:36 AMAn obvious (?) solution which works for even very large sites is to store the master copy of your table in normalized form, and then use batch jobs to denormalize the data as appropriate. This means you can have multiple denormalized forms suitable for different purposes, without losing any data integrity.
Of course this precludes the use of ORM, but once you start talking about performance and scalability you've already done that.
Greg Beech on July 15, 2008 05:43 AM> People normalize because their professors told them to.
Never had one of those. I normalize because it's elegant and easy(er) for me to understand. I don't do much work on huge systems, except for our major systems which were already designed for us and are quite denormalized already. Of course now I'm doing more data warehouse work, where all the rules get thrown out the window.
Rhywun on July 15, 2008 05:49 AMYou're just another knucklehead. Think of it this way: would you take seriously a guy who's spent 20 years on the line at Tyson dismembering chickens announcing that he was *really* a neurosurgeon? And that he's got the perfect way to remove that tumor that's been giving you massive headeaches? The folks who've spent decades developing RDBMSs and systems from same, really do know more about it than you do.
That said; RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter. Read Codd's papers or Date's book. If you haven't bothered to do either (you haven't, right?), then why should your opinion matter?
There are reasons Codd drew his conclusions. He didn't just make it up.
Some thoughts I've collected over the years.
Not only am I not aware of any resource that would advocate logical modeling of data in an unnormalized fashion, I would be against anyone reading it if it existed. The logical data model should be normalized. Normalization is the process of identifying the one best place each fact belongs.
-- Craig Mullins/2007
There's a tendency, especially among managers of smaller shops, to assume that any developer knows how to set up a database. Frankly, this perplexes me. You wouldn't assume that any given developer knows how to code in C# or set up a Web Service, so why is it that we're all supposed to be database pros? The end result is that too many databases are designed by people who have never even heard the term normalization, never mind developed any understanding of the various normal forms.
-- Mike Gunderloy/2006
This is a great book on building databases the correct way. I highly recommend this book["Database Modeling and Design" by Teorey] and if you cannot understand it you shouldn't be building a database anyway.
-- Robert C/1999 [the current 4th edition/2006 is even better]
Many of the existing formatted data systems provide users with tree-structured files or slightly more general network models of the data. Application programs developed to work with these systems tend to be logically impaired...
-- E. F. Codd, "A Relational Model of Data for Large Shared Data Banks"/1970
[Dr.] Codd had a bunch of ...fairly complicated queries, and since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long that would navigate through this labyrinth of pointers and stuff [XML anyone?]. Codd would sort of write them down as one-liners. ...[T]hey weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that.
-- Don Chamberlin/1995
Proper data management is the key to great architecture. Ignoring this and abstracting data access and data management away just to have a convenient programming model is … problematic. ... Many of the proponents of O/R mapping that I run into (and that is a generalization and I am not trying to offend anyone – just an observation) are folks who don't know SQL and RDBMS technology in any reasonable depth and/or often have no interest in doing so.
-- Clemens Vasters/2006
The big myth perpetrated by architects who don't really understand relational database architecture (me included early in my career) is that the more tables there are, the more complex the design will be. So, conversely, shouldn't condensing multiple tables into a single "catch-all" table simplify the design? It does sound like a good idea, but at one time giving Pauly Shore the lead in a movie sounded like a good idea too.
-- Louis Davidson/2007
Technologies like OODBMS sacrifice sound, application technology agnostic data management for short-term convenience (convenience for one single application, written using one particular programming language). Relational technology essentially completely replaced network or hierarchical database technology, and there were excellent reasons why that happened. We should most certainly not be reviving either of those discredited approaches by slapping on the latest buzzwords (OO, XML, etc) as window dressing. ...We don't see any future for JDO.
-- Gavin King/2004 [he wrote Hibernate]
...the logic exercised by the database to maintain referential integrity is not free, but if the rdbms does not do it, it gets done in the application, where it costs the same if not more - just now the application becomes "slower" versus the rdbms. Whether the logic of triggers or RI is performed in the application or the database, it's going to have similar pathlength - the only question is whether the rdbms or application can be more efficient. As more applications are written in languages that emphasize portability over performance (Java) or ease of learning over performance (Visualbasic), the benefits of putting logic that centres around data further down into the database engine become stronger.
-- Blair Adamache/2003
In CS we don't have a lot of formal models to guide us, as in engineering or other science. Much of CS is entirely ad-hoc. However we do have a sound and complete model for data storage (relational model) and hardly anyone uses it. It boggles my mind. Do people not *want* their programs to work predictably? [corollary: if you *must* interact with an Application to modify data, the data specification isn't relational]
-- Anonymous Coward/2005
The people using SQL to store data often don't grasp that there's a difference between expressing various facts as a set of relations and stuffing data into a database like you stuff material into a file.
-- Christopher Browne/2005
Programmers will always gravitate towards viewing the data in "their databases" as their private bailiwick, and insist that users of the data access it through their own API. Learning SQL is certainly better than learning a hundred programmer's different APIs.
-- David Cressey/2005
It is very obvious that a lot of people have absolutely no idea what a business model is or for that matter a database. The number of times I have seen business rule code that should be imbedded in the database definition and not in the BLL (Business Logic Layer) just makes me want to scream. As soon as data rules (the backbone of a business model) are moved away from a relational style declarative constraint, you have reduced the effectiveness, meaning and integrity of your business model/database.
-- DavidM/2004
Normalizing is an attempt to optimize generally. Denormalization is an attempt to normalize for a particular application, to introduce a bias.
-- Gulutzan & Pelzer/2003
If you store all your business logic within the database, then it doesn't matter what flaky applications people write you can guarantee the integrity of the data.
-- Pretsel/2004
"""Lets unroll our loops next.""" (Christog)
I seen that and I must say was blown away how removing loops can make your application run faster. Of course, you shouldn't jump right away and erase all your loops and ifs, but sometimes it can help, just like de-normalization.
You can read about a guy optimizing his XML parser in the book called "Beautiful Code". Sorry, I don't remember who wrote it.
I run an e-commerce website with about 400,000 page views a month. I've been a complete purist about normalizing the database, as well as using business objects to access all data (rather than using ad hoc SQL queries). We have some pretty complicated business object hierarchies and I've long been waiting for there to be a performance issue (especially as we use a shared SQL Server), but the website continues to be blazingly fast. I realize that you are going to have a lot more page views than our site, but it looks like your schema is going to be simpler and you have a dedicated SQL Server, so I think you can afford to be a purist for now and not worry about denormalizing until it becomes an issue.
Ben Mills on July 15, 2008 05:58 AMGood post Jeff.
The point many of the jihadist posters are missing is that "the right thing" may change dramatically as an application scales up.
Your high scalability point is key. As you said, "Everything is fast for small n."
The ideals of normalized data, and even of using a RDBMS at all, fall apart at some scale, as Google and Amazon will attest.
John Pirie on July 15, 2008 06:00 AMhey all you young guys - logical modelling (data model) is normalised (for understanding), physical model (database) is denormalised (for performance) - duh!
non american spelling on July 15, 2008 06:00 AMjeff, are you awake?
wiggles on July 15, 2008 06:02 AMIt's funny how many people just ignore your last conclusion because it didn't suit their argumentation purpose; “normalize until it hurts, denormalize until it works” sums it up very nicely.
That being said, the above example would be no reason to change the normalized DB schema but instead to change the query (obviosuly). In any case, such a query would be very rarely necessary (someone already mentioned data warehousing, I think).
Konrad on July 15, 2008 06:03 AMAhhhhh! Don't tell people that! Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations.
Sean on July 15, 2008 06:04 AM"Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure."
Whose "tradition database design principles" are you referring to? It's been some time since I took any database courses at college, but I do remember that both the book and the instructor told us that going all the way to the 5th level was extreme and not a good idea, because it added complexity to the queries used to get data which slowed it down. We were told that in most cases, the 3rd level of normalization will provide the best results.
The purpose of normalization is reduce size and eliminate the number of duplicate records, but the trade off is that the queries used to reconstruct the data become more complex and thus take more time.
Delmania on July 15, 2008 06:05 AM(disclaimer: I'm talking completely about Microsoft SQL Server 2005 here.)
I'm calling shenanigans on this post, unless you put up a definitive example of where your app is running into problems with a normalized database design.
Like others have said, to get the user data you could use an indexed view. But also as others have said, if you have proper indexes and foreign keys setup, the SQL engine should have no problem "figuring it out" and generating the same query plan as if you had just queried those tables directly in the first place.
I'd also like to say that the example given is a poor one, seeing as you're joining 1-to-1 tables to 1-to-n (phone numbers, affiliations, etc.)
Not to mention that de-normalization leads to it's own problems..
Nicholas on July 15, 2008 06:07 AMPS. I name my fields with ORM in mind, even if I don't use it. Thus, Id, not UserId or god forbid user_id...
Rhywun on July 15, 2008 06:08 AMorange
orange on July 15, 2008 06:10 AMKeep it normalized. How often do you actually need EVERYTHING about a user at once? Probably only when you are editing that user.
As others have said, use proper indexes to speed it up. De-normalized tables will haunt you for years whenever there are issues. As in an earlier post you are being cruel to the maintenance programmer (even if that is yourself).
Practicality on July 15, 2008 06:10 AMall of those with a data access layer step forward... not so fast jeff
structure on July 15, 2008 06:11 AM> I run an e-commerce website with about 400,000 page views a month
That's very low. I've got a site doing twice that traffic every day and has about 30 tables, with the heaviest (and most common query) working on a table of 200k records requiring a sequential scan each time since the sort criteria is based on an average of a field in the resultset that is different for each query which are finally joined against 12 tables.
All running on a 2ghz opteron with 1gb ram, and cpu usage hovers at 40% at peak times.
God bless PostgreSQL.
nick on July 15, 2008 06:13 AMSorry to contradict Greg !
Use a multi-value (a.k.a. Pick) database and you don't have to worry about normalization ever again. They are fast too.
Mike on July 15, 2008 06:14 AM@[ICR]
> Just a nitpick, I'd recommend using the term "Gender" over "Sex"
Do you sex a cat, or do you gender it?
Don't use an incorrect term just because the politically correct crowd want to make the very mentioning of words that might also refer to a natural human activity. Fight the pussification of language!
If you want to know if someone is physically male or female, the question is "sex", not "gender".
<a href="http://en.wikipedia.org/wiki/Gender">http://en.wikipedia.org/wiki/Gender</a>
Simon Wright on July 15, 2008 06:16 AM^^
I don't know what happened with the wikipedia URL, but I didn't write the HTML tags. I just pasted the URL.
http://en.wikipedia.org/wiki/Gender
http://en.wikipedia.org/wiki/Gender
Simon Wright on July 15, 2008 06:17 AMI have had to fix a few huge databases over the years and often the problem is a bad, but normalized, database design.
Al on July 15, 2008 06:20 AM"I just spent a year having normalization hammered into my meager little mind as part of my database module and now you're telling me that it mightn't be necessary..."
This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue.
You don't normalize because of ghosts; you normalize because it doesn't make fucking sense otherwise. A tuple is a logical statement; denormalizing means you take the risk of having contradictory or false statements in your database in order to gain performance. (That's why it's recommended for read-only databases, for example: the risk of having incorrect statements is much smaller.) If premature optimization is the root of all evil, denormalization is Satan's ugly mother. It's right there with "the compiler is wrong".
You just need 1 table with 2 fields:
- id (primary key)
- data as Xml
lol !
Patrice on July 15, 2008 06:22 AMI've never been a database guy, but this whole issue seems like the classic problem of optimization to me. There's almost always that trade-off between elegant readable design and optimal execution speed.
> I disagreed and waged a number of arguments that it was our database that needed attention.
> We first needed to tune queries and indexes, ...
> Consultants were called in. They declared the db design to be just
> right - that the problem must have been the application.
This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). One of my pet peeves is folks who "optimize" their code with absolutely no evidence that there's an execution speed problem there, or even anywhere.
Ah, and you should read the comments to Dare Obasanjo's article: there's no six-way join; there's a three-way join at the worst.
Marcel Popescu on July 15, 2008 06:24 AMGreat point. At my previous employer, they ran into scalability issues, and denormalizing their data never entered their mind (as far as I know). Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application.
My article explains it better:
http://www.ytechie.com/2008/05/stored-procedure-reporting-scalability.html
I've been working with RDBMS's since about 1986, and currently support a very large complex (normalized) database.
I have to disagree that normalized databases can't perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that 'may' happen, only those that you can actually measure. Anything else is a waste of everyone's time.
Object-Relational database technology doesn't solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don't assume that because the object hides the complexity that the actual physical storage is simpler as well.
Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.
I totally agree with Greg, correctness comes first.
Ed Delaney on July 15, 2008 06:24 AMI think file system performance is often overlooked when examining performance problems in medium/small db's. Denormalization will help when you have an i/o bottleneck - to a point. However, I personally find it faster and more cost effective to upgrade my i/o subsystem than to refactor my db and app for a denormalized schema.
Jeff, be careful when giving advice like this... De-normalizing is something that should only be done by people who ALREADY UNDERSTAND when to use it and how. I fear you're giving it the air of being trendy and thus confusing the less experienced developers out there. :-P
Bruce Johnston on July 15, 2008 06:25 AMI've been working with RDBMS's since about 1986, and currently support a very large complex (normalized) database.
I have to disagree that normalized databases can't perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that 'may' happen, only those that you can actually measure. Anything else is a waste of everyone's time.
Object-Relational database technology doesn't solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don't assume that because the object hides the complexity that the actual physical storage is simpler as well.
Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.
I totally agree with Greg, correctness comes first.
Ed Delaney on July 15, 2008 06:27 AMTo everyone bashing Jeff on his database design skills - he didn't come up with that design. He clearly pointed out that the design was from Dare Obasanjo's blog post on this topic.
Allan N on July 15, 2008 06:31 AMSome things are too slow in a normalized database. Facts are Facts. You just rout around this:
* Keep a de-normalized replica for expensive operations (e.g. reports)
* Cache Results for repeat queries (Memcached)
* Partition the database for scalability (vertical or horizontal)
> This is why this article is wrong, Jeff. This is why you're an idiot...
In your opinion.
In my work building and maintaining content and asset management systems, I've seen far more problems arise from over-normalisation than under-normalisation. Far too often I've seen database designs that fall over because they're so intricately normalised. They don't fail technically, but they become a nightmare to write code against.
On the other hand, some of our most long-lasting table structures have been criminally flat. They have their limitations. They suffer some redundancy. But they're rarely misunderstood, and always a joy to attach to your new systems, because their broad usage has spawned many valuable tools and code that work with them -- and that's far more valuable than having an unlimited number of phone numbers for each contact.
Simon Wright on July 15, 2008 06:34 AM@Julio: Hibernate, at least the Java version, actually has a concept called join fetching where it will join a related collection instead of doing a second select. So, just for the record, an ORM wouldn't necessarily use all separate selects.
a on July 15, 2008 06:34 AMSpeaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan "child" records, etc. Everybody says "we rely on the application to maintain that", and it never, never does.
The question is, what's the cost of the data integrity problems? If it's something you can live with for performance, fine. If it's a disaster waiting to happen, you better build your database design as tight as possible.
A. Lloyd Flanagan on July 15, 2008 06:34 AMSeveral people already hit on what I was going to say, but here goes anyway...
What is the purpose of the database? Is it to process transactions or report findings? I always take the route that I design the data model using the best practices approach and considering my requirements.
Now, I double check requirements to make sure the database will behave correctly. If 99% of what I am doing will be reporting off the tables, I will either create Materialized Views (Oracle) or Materialized Query Tables (DB2) that basically denormalized the data for quick reporting. This can be a problem if you need up to the minute dynamic reports since the MV/MQT's need to be refreshed periodically. Currently, we are on a 2 week refresh, so this approach works well for us. The MV/MQT's carry some resources, but memory and storage is cheap!
Now, if it is a quick app for logging or something, I will just create a flat/wide table and put everything in it. After all, this log is for me and my team... no reason to confuse the issue with 5 tables to hold each piece of the log information... Plus, this data gets cleaned out every so often and I am not concerned about instant access. Just index the table by date, and order by in the sql...
If I am going to be constantly hitting the data with updates/inserts/deletes (like in order processing or something) the normalized structure works very well... and I don't carry the overhead of all the duplicate data associated with MV/MQT...
So, I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals.
my $0.02
Wayne on July 15, 2008 06:39 AMSergej, great comment.
Jeff, I remember from one of the early podcasts when you were describing the box you were deploying on that it was just a single machine, running the database and the web server, is that correct? That immediately struck me as the first thing that will have to change when you go live. It is astounding the performance difference you get by putting the database on its own box and clicking the little box that basically gives the machine to Sql Server.
There is a lot of idealism in the comments above. Lots of alternative databases, OODBMS's, LINQ. Software developers in general have a very poor understanding of how databases work, how to tune them, how to design them. I wonder how many billions of dollars have been spent on servers in the last ten years that could have been saved by a simple index.
The design of any database that expects millions of rows in any table should be handled by an experienced DBA/Programmer. All access should be controlled via stored procedures. Never, ever, ever let a software tool or a (non-dba) software developer write your queries for you. db_reader and db_writer should *not* be privileges that your applications enjoy.
If you look at a query that has 30 joins and immediately think it is horrible, please analyze and tune the query first. Only the joins that require a big scan matter. Look at the plan and if you see 29 joins at 0% and 1 join at 100% of execution time, then only one of the joins is bad. Go ahead and add ten more if you need to. De-normalization is for reporting, not for OLTP.
Here's an excellent book for really digging into how Sql Server processes queries: http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139 (T-Sql Querying by Ben-Gan).
Jeff, one of your comments struck me as worth a little more thought: "Disks and memory are cheap and getting cheaper every nanosecond". Not everyone can afford a SAN, and if you're running everything on one box, it's not about how cheap the disks are, it's about how many you can cram into the machine. You want the fastest disks you can get, which means you are limited in size, and you want RAID, which eats up more space. In a 1U machine, you get maybe 2x136Gb 15k drives, double that in a 2U. And even that leaves everything running on the same set of spindles. So you move up to NAS or DAS, but you are still limited, because if you just fill it up with slow 1Tb drives the IO will kill you. Data size matters.
Eric on July 15, 2008 06:39 AMYour six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys.
Leandro on July 15, 2008 06:40 AM@T.E.D:
"I've never been a database guy, but this whole issue seems like the classic problem of optimization to me. There's almost always that trade-off between elegant readable design and optimal execution speed."
Absolutely, although you've got another element in the trade-off: data correctness. A database design that allows inconsistent data will have inconsistent data.
I have to agree with other posters: you can denormalize if you have to, and you know what you're doing. If you don't you may have to assume that the cost of incorrect data trumps other considerations.
A. Lloyd Flanagan on July 15, 2008 06:40 AMThe solution it 2 tier storage. Websites tend to be read many, write few. The first tier is normalized database, the second tier is de-normalized data specifically constructed for use by the application. In an app that I am building, the 2nd tier is a combination of memcached and plain old disk using binary serialization.
The expense now is that you have two writes, one to the real db and one to the de-normalized tier.
So now you've separated the read requirements/performance from the write and relational data storage.
Damian on July 15, 2008 06:41 AM@Leandro:
"Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys."
The whole question of surrogate vs. natural keys is of course a topic of enormous debate. Generally, I'm with you: natural keys are easier, safer, and often don't hurt performance the way people assume they will.
A. Lloyd Flanagan on July 15, 2008 06:44 AMI think the general advantage of normalization is the same as the DRY principle in coding. When you denormalize stuff you can wind up maintaining copies of data and doing the same thing twice.
Another advantage is it is easier to go from normalized to denormalized when you need it than the other way round.
CT2 on July 15, 2008 06:44 AMJust as well Dare is a social networker and not a database designer!
Look at some of the arbitrary limits in his design:
* A fixed number of telephone numbers per person
* A fixed number of job histories per person
* A fixed number of IM accounts per person
Now, what happens when the big DOH! hand hits the forehead, and our short-sighted denormaliser suddenly realises that out there, in the REAL world, the numbers of things in sets change on a regular and frequent basis ... that's right, we have to change the schema!
It gets worse. The proponents of denormalised schemas are the most likely to write "select * ..." in their queries. That means reviewing and changing all our code as well.
A properly designed database schema should survive most business changes. Denormalised (abnormalised?) schemas don't.
And as others have pointed out in this thread, the avoidance of surrogate keys can do wonders for your performance.
rond on July 15, 2008 06:46 AMIn my experience, beginning developers get into a lot of trouble because they're afraid of joins. The common symptoms are apps that are:
(i) excessively complex,
(ii) never quite work right, and
(iii) are expensive to change
It definitely is true that we're moving into an era of "web scale" systems that are much greater in size and power than traditional "enterprise" systems. Many people are pushing the performance limits of the traditional RDBMS, and we're seeing a move towards distributed main-memory databases and other new technologies.
Workaday business apps benefit greatly from normalization, so I think this post has a dangerous message for many developers. In the case above, I think it still makes sense to use a normalized database to keep the authoritative copy of the data, but it makes sense to cache the data in denormalized forms. Rather than creating a table with the notorious phone_number_1, phone_number_2, ... phone_number_33 fields that I'm always cleaning up in other people's applications, you can just serialize the data with the serialization system that comes with your language or JSON or something like that.
Social sites generally have many more reads than writes, so this can improve performance 100 fold or more, but still keep the advantages of a relational back end. If the cache gets messed up, you can always do a partial or complete rebuild based on data you trust.
Paul Houle on July 15, 2008 06:51 AMI've had programmers tell me that they don't care about the size of data or normalization. Then they cry when they have 200,000 fat rows and performance crawls.
David B on July 15, 2008 06:53 AMFor those advocating the use of Views: the database does not store the view data. The data is recreated, using the view's SELECT statement, every time a user queries a view.
Furthermore, see this article on View performance: http://sql-server-performance.com/tips/views_general_p1.aspx
Steven Bey on July 15, 2008 06:54 AMWhat I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues... In the end you are storing your data in a relational database, right? Unless you are plugging LINQ or Hibernate into something other than Oracle, SQL Server, MySQL or Postgres (or any other RDBMS) this is still an issue for you...
Sure, *you* might not be coding the six inner joins, but your OO mapper sure is.
Scary.
Sam on July 15, 2008 06:54 AM@Dare
"In addition, who says you wouldn't want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That's exactly what it does."
I think what Ariel is saying is that if you performed the 6-join query your return data would be duplicated. For example if a user had two phone numbers and three screen names the return data would look like:
user_id phone_number screen_name
1 111-1111 name1
1 111-1111 name2
1 111-1111 name3
1 222-2222 name1
1 222-2222 name2
1 222-2222 name3
Using this format of the data it would take some monkeying around to create the display text:
User: 1
Phone Numbers: 111-1111, 222-2222
Screen Names: name1, name2, name3
It would be easier to select by phone number and then by screen name instead of trying to get all of the data at once.
Eric on July 15, 2008 06:55 AM@[ICR]: Kudos to you.
Vance Vagell on July 15, 2008 06:55 AM@[ICR]: Kudos to you.
Vance Vagell on July 15, 2008 06:55 AMPlease think! Is this post even about normalization? The 2 database designs you used to illustrate your point are actually solutions to 2 different problems. They are not the same database and they cannot represent the same data. You can argue that the second one is faster but only if it can actually be used to represent the facts that are meaningful to your problem otherwise this is just hot air. This kind of lack of thinking is what usually makes life with databases a hellish proposition because almost invariably someone will ask for data that is not stored because it was easier to say "I need performance so I'll just ignore the requirements (or not think them thru) and de-normalize!"
Terrier on July 15, 2008 07:09 AMJeff,
You make a good point but there really is a upper limit on how far things can be stretched with denormalization. When stored procedures start to have anywhere from 100 to 200 arguments and sql server 2000 warns you that your row size exceeds its 8063 byte limit then maybe...just maybe its time to normalize the tables just a bit.
How do views speed up performance?
1) They prevent bad joins like the one demonstrated here.
2) You can use a materialized view which is the data denormalized for performance, but backed by normalized data for integrity.
The question is what is important: Your data or your performance?
If your limiting where clause is only one one table, you should not see a bad hit, because you will be hitting only the few select rows of each other table. You will take a hit for disk seek if you do a poor job of laying out your physical storage, but that is a different issue.
If you care about data integrity, you need to be normalized, and you need things like foreign keys and constraints. Normalizing data can hurt performance, but so few applications are of the size to cause this issue that usually data integrity is the bigger concern.
On a P-200 with 128M of RAM and a 10G database (point of sale with automated reordering and the lot) fully normalized, the 100MB network card was the bottleneck. If you need more than that, you have the money to buy hardware that can support it.
Grant Johnson on July 15, 2008 07:23 AM> But this design also requires a whopping six
> joins to retrieve a single user's information.
Yes, if you need all of it or if you need all of it in a single table.
Very often neither is the case. Very often you don't need all of the information at once and sometimes you first need some of it and later on some more, so you don't need them in a single table and as joined query, but as separate queries istead.
i hope the people at twitter listen to this. :P
they got people to care, now it's time to move fwd :P
Jeff,
You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. A system that will primarily be serving content may be a good choice for denormalization. That doesn't mean that normalization doesn't scale, it just doesn't scale for one particular problem.
RevMike on July 15, 2008 07:36 AMMecki said:
>Yes, if you need all of it or if you need all of it in a single table.
>Very often neither is the case.
This is spot on.
Let's say you wanted to run a query to find all the people that worked for a specific company:
select first_name,last_name from User u,UserWorkHistory uwh where
u.user_id == uwh.user_id AND
uwh.company_name = 'Coding Horror';
<b><i>(BTW I haven't written SQL in ages so this may be grossly incorrect; I think you get what I'm trying to say, though)</b></i>
Granted, I could have just as easily done:
select first_name,last_name from User where
User.work_history1 = 'Coding Horror' OR
User.work_history2 = 'Coding Horror' OR
User.work_history3 = 'Coding Horror';
But which makes more sense?
I don't think there's any harm in starting off normalized and scaling at some point later. Premature optimization is evil, and for all you know stackoverflow may scale just fine under the circumstances.
Michael Acobas on July 15, 2008 07:41 AMI recently created a denormalized table with over two hundred columns because it was just needed to generate a PDF from a long online application form. There was no need to query the data for any other reports. The table is only meant to serve one purpose. However, those 200 plus columns did exceed the limit for the number of columns a SSIS export will support so now I need to exclude that table from my back ups. There is also a limit to the number of tables SSIS will export so you could run into problems with a fully normalized design as well.
Robert S. Robbins on July 15, 2008 07:43 AMDenormalized data sounds great until you try to extend your data model, at which point you will wish to high heaven for a time machine. The purpose of a relational database is NOT to give you fast performance. If you want that, use flat files and do your own heavy lifting. The purpose of a relational database is to help maintain data integrity. If your data doesn't need integrity all that bad, you don't need or want a relational database. If your data does need integrity, you can't afford to make the mistake of modeling it poorly, and that means semi-obsessive normalization. Those 6 joins are a feature, not a bug.
Noah Yetter on July 15, 2008 07:49 AMThree quick observations:
First: You only need the six joins if you want every single attribute of information stored about the user. My guess is that about 99% of the time, you REALLY only need maybe two tables' worth.
Second: normalization is optimizing for data entry and maintenance. If your application is presenting the data primarily rather than CRUD, of course you need a denormalized view. But I'd push from a normalized db to an optimized copy, or maybe even a cube.
Third: MEMCACHED, MEMCACHED, MEMCACHED.
Chris B. Behrens on July 15, 2008 07:51 AMOver the years I've developed a theory that the one area where developers reveal a consistent lack of knowledge and understanding is in DB/query design. This article and the subsequent comments reinforce this theory (and to be candid, leave me a bit apoplectic).
1) Views: terrible idea unless it's a SQL 00/05 indexed view. Even then, you probably need to revisit your schema and queries before resorting to this measure. If you're on MySQL, forget that views exist and used stored procedures.
2) Surrogate keys: that's another rotten idea, especially if you're concerned with performance. Also a horrible idea if you ever need to migrate the database.
3) The query in question: dude, Dare, really? And your defense of this query is that you may want to show all the numbers a given user has, for example, Facebook? Respectfully, please, both you and Jeff need to learn more about databases or hire someone that does, or you are headed for pain and misery as you build Stack Overflow.
For example: say my User.User_Id is 666 (I also am the son of Satan, but that's a different subject). There are two entries in the phone number table that have the User_Id foreign key of 666: one is my office number, 402.261.0498. Then you have my mobile, 402.525.5376. So you run your query, using my User_Id in the WHERE and you get back two rows, one for each phone number, with all of my data from the User table repeated for each. What are you going to do with this result, say on my profile page? Are you only going to use the data from the first row, except when dealing with phone number data, in which case you'll loop through each row of the result and render each phone number?
You're going to have to do as Ariel suggests: query the distinct user data, then query the phone numbers and other information. Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.
Chris Rikli on July 15, 2008 07:52 AMCache the generated view pages first. Then cache the data.
http://www.pragprog.com/titles/memcd/using-memcached
You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much.
matt m on July 15, 2008 07:52 AMI wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set.
I've seen numerous queries written by "professionals" that work well for small tables (As jeff says, everything's fast for small values of 'n'), but completely fail when scaled up to larger tables.
In my opinion, writing good SQL queries is dark voodo - not least since results can vary hugely depending on the underlying DBMS.
Thomi on July 15, 2008 07:59 AMI think that you should think about how it would be easier to implement the user interface, in the normalized, or in the denormalized form.
In the blog post (article), the user interface can be implemented much easier in case of the denormalized form, so I would prefer that one. Of course, if you can afford to make the application perfect (3 times more time), then the normalized form would be better.
"The biggest performance killer is so called physical read."
Thats a bit scary... I'd always assumed that reads from hard drives were slow... but not so slow as to create the "waiting for several seconds" issue when looking up a record...
I suppose in these big databases with loads of tables all "joined" together the sheer number of reads creates the problem.
You learn something new everyday, huh?
Jheriko on July 15, 2008 08:04 AMJeff,
I know you probably would rather not change the site too much. But I would love to be able to rate responses based on merit. Then we could hide the long list of comments by people who didn't read the post or who troll around looking to bolster their web traffic with insults.
I agree wholeheartedly, by the way, to this idea. I unfortunately was taught by someone who knew what normalization was supposed to mean but no more, and one math prof. who was thrust into a Software role because of his 1 year of experience doing consulting in 1994. He instructs all students that database must be 43NF and all logic outside of storage must happen in the application layer. No stored procedures, no denormalization, and views are sacrilegious.
Thanks again,
Raymond
You used "OR" ? No index for you. Next!
David B on July 15, 2008 08:21 AMWell, after reading all these comments it appears that most programmers still fail miserably at reading comprehension (or just didn't read the article completely). How long until StackOverflow.com is ready, Jeff?
Jason on July 15, 2008 08:38 AMWhat about an approach where the data is stored in and WRITTEN into normalized tables but have a DEnormalized table for READING from? You can have triggers update the READ table when changes are made to the normalized tables, or batch process it nightly or mroe frequently, checking DB activity first to make sure its low.
Mike on July 15, 2008 08:39 AMno matter what, the database is the slowest part of your application. period. fin. done. caput. The way to get near instant execution is to have the objects in memory and then just use the database for storage (ie read = from memory, write = from memory + write to DB). Then if your app restarts, it'll only be slow while pulling things from disk, but after that super fast.
Darren Kopp on July 15, 2008 08:41 AMThe obvious tradeoff in denormalizing your data is that your data structure has lost meaning and flexibility. I've worked on projects which required normalizing unstructured or poorly structured data, and it's often difficult or impossible to move in that direction. I'm surprised you're at that point already - I'd normally think you'd look at optimization once you've had a site running for a while and analyzed actual usage data. Regardless of the direction you're moving, materialized views are very helpful - a materialized denormalized view can give you the performance benefits of denormalization without messing up your data, and a materialized normalized view over unstructured data can reduce some of the pain in dealing with User_Phone_Business_Prefix. Have you looked at the sparse column feature in SQL Server 2008? That's supposed to reduce storage size (and thus increase performance through increased page efficiency) when working with wide columns in which many columns may contain null values.
Jon Galloway on July 15, 2008 08:42 AMIt has been said well by others already, but I feel strongly enough on this issue to add in my two bits:
Denormalization is the last possible step to take, yet the first to be suggested by fools.
And, yes, I just called both Atwood and Obasanjo out as fools.
It is dangerous to make that suggestion so willy-nilly, as people take posts by prominent authors as something they can rely on. Even hinting that denormalization should be anything but a last step means that these guys simply cannot be relied upon as authorities in this space.
So, for those still feeling unsure after reading through all of the above:
Views are not about perf/scale, they are about query clarity and/or abstraction of true data source & structure.
Materialized/indexed views can help perf/scale, but have limitations in terms of the underlying data structure.
Optimization must always be based on freshly-collected metrics, should go roughly in this order (with the order slightly adjusted based on your particular applications and databases and the issues you have previously experienced), and should be done one at a time with fresh metrics collected between each:
- remove indexes no longer deemed necessary
- add indexes newly deemed necessary
- simplify views where possible
- optimize queries, especially to reduce scans
- reduce application mis-use of the database (unnecessary and repeated queries are a common culprit)
- add/change application-tier caching of database information
- materialize/index views where appropriate
- optimize stored procedures
- create generated search/reporting tables
- investigate fundamental issues with the design of the database schema (data types, relations, etc.) but not from a "hey, let's denormalize" perspective.
You'll notice that denormalizing your fundamental schema isn't even on the list. It is too tempting and dangerous to put it there. True, it may be a last resort if all of the above fail, but if all of the above fail you are almost surely having the kind of problem that everyone would love to have: your application is going gang-busters, and you can afford to throw more hardware at your current model (start with increasing spindle count first by getting transaction logs onto their own spindle(s) and then by getting indexes onto their own spindle(s)) and if that still isn't enough then you are are into Google/Amazon/etc. territory and can get some doctorate types to start designing their own persistence model for you. They'll probably cook up something decidedly non-relational, and with much denormalization. ;) But if so, then why not just up and denormalize? Simple: The other possibilities need to be exhausted first, through repeated execution of the above list of steps, until you can be certain that _violating the correctness guarantees of your relational database through denormalization is worth it_.
Interesting blog post on denormalization patterns:
http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html
Tim Hollingsworth on July 15, 2008 08:48 AMOne more benefit of normalization - variable number of fields. For example, if you denormalize the phone number field, how many phone numbers do you pick? 5? 6? Picking too few is very limiting and picking too many is a waste. In 1990, who would have guessed that a person could have 10 phone numbers associated with them? Home Phone 1, Home Phone 2, Cell Phone, Fax, Cottage, Winter Home, etc. Normalization solves this problem.
Mike M on July 15, 2008 08:57 AMGod bless ya, Jeff. As with everything in software development, a bit of critical thinking about why something is done and when tends to go a long way...
Kevin Fairchild on July 15, 2008 08:58 AMThere is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. In Oracle you would create a hash cluster and then create the five tables that include user_id in that cluster. Performance for that join wuld be excellent. I expect that other RDBMS's have similar structures.
The morals of the tale:
i) know your RDBMS
ii) database-agnostic design sucks.
[sigh]
The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones).
I've never considered myself a DB(A) guy ... so reading some of this is interesting.
Most of the databases I've built myself haven't had the volume of traffic that really mattered much -- so I've always done what I thought was the 'intuitive' normalization and in cases where it was faster or easier to put some of it in a denormalized fashion, did so.
Good read.
N on July 15, 2008 09:15 AM> and table names are singular because each row in the table designates one specific person; one row = one user
This is another religious war. Should table names be singular or plural? I was in a company that mandated that tables should be singular, one to match the above quote, and second, if nothing else, so that you have 1 less letter to type for every table name.
However, I am of the opinion that tables, since they normally hold more than one item, are plural, and the object that holds a single row is singular. Collection objects that hold multiple rows should be plural.
Bob on July 15, 2008 09:31 AMSean Bamforth is on the money here.
You need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.
The problem is that DBMSes don't do this. Sure, they give you indices and caching and materialized views, but they don't go all the way.
Robert Fisher on July 15, 2008 09:48 AMYour denormalized schema has lost all of the one to many relationships that caused the original need for normilization in the first place.
Normalize first, it protects data integrity and makes maintenance a breeze. If you hit the performance wall, then denormalize.
However, your example is still bad, because the denormalized table is nothing like the normalized schema.
@Jeff - You're prematurely optimizing.
@Leandro - Natural keys! I agree, and bless you for mentioning it.
In 25 years of software development as contractor and employee in all kinds of companies and industries, every performance problem I ever encountered with a transactional normalized data base was solved by adding indexes or fixing poorly written SQL statements in the application. In a few cases the data base was poorly designed, but normalization wasn't the culprit.
Don't fear normalization. Embrace it.
Charles on July 15, 2008 09:52 AMWith any normalization there are some trade-offs.
Assuming that each user has one screen name might be an acceptable restriction. Same for affiliation.
With a separate phone table a user can have as many phone numbers of as many types as I want. Not so after normalization. This might be acceptable, or not.
So, it the perception is that "normalization" changes representation without changing functionality, that's incorrect. The functionality changes might be acceptable, or not, but that should be a conscious decision.
Wow...