One of the items we're struggling with now on Stack Overflow is how to maintain near-instantaneous performance levels in a relational database as the amount of data increases. More specifically, how to scale our tagging system. Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.
Dare Obasanjo had an excellent post When Not to Normalize your SQL Database wherein he helpfully provides a sample database schema for a generic social networking site. Here's what it would look like if we designed it in the accepted normalized fashion:
Normalization certainly delivers in terms of limiting duplication. Every entity is represented once, and only once -- so there's almost no risk of inconsistencies in the data. But this design also requires a whopping six joins to retrieve a single user's information.
select * from Users u inner join UserPhoneNumbers upn on u.user_id = upn.user_id inner join UserScreenNames usn on u.user_id = usn.user_id inner join UserAffiliations ua on u.user_id = ua.user_id inner join Affiliations a on a.affiliation_id = ua.affiliation_id inner join UserWorkHistory uwh on u.user_id = uwh.user_id inner join Affiliations wa on uwh.affiliation_id = wa.affiliation_id
(Update: this isn't intended as a real query; it's only here to visually illustrate the fact that you need six joins -- or six individual queries, if that's your cup of tea -- to get all the information back about the user.)
Those six joins aren't doing anything to help your system's performance, either. Full-blown normalization isn't merely difficult to understand and hard to work with -- it can also be quite slow.
As Dare points out, the obvious solution is to denormalize -- to collapse a lot of the data into a single Users table.
This works -- queries are now blindingly simple (select * from users), and probably blindingly fast, as well. But you'll have a bunch of gaping blank holes in your data, along with a slew of awkwardly named field arrays. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now. Congratulations on your demotion!
Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?
Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory -- will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.
There's no shortage of fascinating database war stories from companies that made it big. I do worry that these war stories carry an implied tone of "I lost 200 pounds and so could you!"; please assume the tiny-asterisk disclaimer results may not be typical is in full effect while reading them. Here's a series that Tim O'Reilly compiled:
There's also the High Scalability blog, which has its own set of database war stories:
First, a reality check. It's partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn't have to be all or nothing as I've pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn't.
Despite copious evidence that normalization rarely scales, I find that many software engineers will zealously hold on to total database normalization on principle alone, long after it has ceased to make sense.
When growing Cofax at Knight Ridder, we hit a nasty bump in the road after adding our 17th newspaper to the system. Performance wasn't what it used to be and there were times when services were unresponsive.A project was started to resolve the issue, to look for 'the smoking gun'. The thought being that the database, being as well designed as it was, could not be of issue, even with our classic symptom being rapidly growing numbers of db connections right before a crash. So we concentrated on optimizing the application stack.
I disagreed and waged a number of arguments that it was our database that needed attention. We first needed to tune queries and indexes, and be willing to, if required, pre-calculate data upon writes and avoid joins by developing a set of denormalized tables. It was a hard pill for me to swallow since I was the original database designer. Turned out it was harder for everyone else! Consultants were called in. They declared the db design to be just right - that the problem must have been the application.
After two months of the team pushing numerous releases thought to resolve the issue, to no avail, we came back to my original arguments.
Pat Helland notes that people normalize because their professors told them to. I'm a bit more pragmatic; I think you should normalize when the data tells you to:
Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd. Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. Fear not the specter of denormalization. Duplicated data and synchronization problems are often overstated and relatively easy to work around with cron jobs. Disks and memory are cheap and getting cheaper every nanosecond. Measure performance on your system and decide for yourself what works, free of predispositions and bias.
As the old adage goes, normalize until it hurts, denormalize until it works.
I just spent a year having normalization hammered into my meager little mind as part of my database module and now you're telling me that it mightn't be necessary...........
love your blog Jeff but i hate it when you do this to me. :(
Reader 101 on July 15, 2008 2:18 AMHow about using materialized query tables or materialized views?
Andrew Stuart on July 15, 2008 2:21 AMMy approach is to design it all normalized at first, as it makes the design cleaner and usually easier to grok, and then denormalize where absolutely needed. As long as the system is performing well, I would prefer to keep my data normalized and then create denormalized views so I can get those huge joins out of my way.
When it comes to teaching, normalized tables is the way to go. People must master the rules before they break them (and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already)
SomeGuy on July 15, 2008 2:25 AMQuite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol.. :D
nub on July 15, 2008 2:33 AMNormalization is about design, denormalization is about optimization. They should be applied as such, i.e. anything should be normalized until denormalization is needed. As Knuth (or was it Hoare?) said: "Premature optimization is the root of all evil".
And from the short glimpse I did at the article about performance of tag systems it seems they tested only on MySQL. Not exactly a database one connects with good performance on complex queries -- it's extremely fast on flat data, but to my (not utterly up-to-date) knowledge they never managed to catch up with Postgres or the commercial RDBMS' on anything reasonably complex. Not sure if that applies to single joins already, but I wouldn't be too surprised.
Peter Becker on July 15, 2008 2:36 AMNormalization might indeed not be the fastest design, but can usually made fast enough without denormalization, as Dare Obasanjo states in his original article, too:
"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching."
Jens Gring on July 15, 2008 2:43 AMIn my __LONG__ experience of both DB and application design, it is a bad (or rushed) application design that usually forces denormalization on a well designed database.
Several small queries over normalized data also usually outstrip one big one over denormalized data.
But, denormalization still has a home in large data warehouses.
Hope smart tools from DatabaseGear can help.
Lex Y. Li on July 15, 2008 2:48 AMLets unroll our loops next.
Christof on July 15, 2008 2:51 AMJust a nitpick, I'd recommend using the term "Gender" over "Sex". Someone might ally themselves with a different gender to their sex and that's the one they're going to want to tell you.
(Obviously it's sample data, but just for the future :)
Id did annoy me slightly always having normalization drummed into me without mention that it's not always best, but I guess SomeGuy is right.
[ICR] on July 15, 2008 2:54 AMAnother thing you could try in the case of StackOverFlow is to reduce the data required from the user.
Some questions you can ask:
1) Do our web application really need those phone numbers? Are we going to call our users?
2) Is address needed?
3) Does gender matter?
4) If an email id and password (and a display name if necessary)is what really matters, bother with all those unnecessary data?
You can add the bells and whistles later. First, Ship.
Niyaz PK on July 15, 2008 2:56 AMJeff,
if you want to avoid the usage of big and complicated SQL-Statements, why not implementing Views instead of denormalization. A good RDBMS should offer this mechanism, especially for performance and usage optimizations.
You could have a View called "users", could still say "SELECT * FROM users", but what is exactly within "users" can be changed at any time, without changing the interface to the application; The View "users" in this example.
Frank on July 15, 2008 2:59 AMOne thing you're not taking into account is a system like memcached that avoids hitting the database at all. When using such a system, it might be better to actually be storing normalized data so that your hits on the database for cache misses are small and involve as little table locking as possible.
-Max
Max Kanat-Alexander on July 15, 2008 3:00 AMBefore normalizing denormalizing think about what you need to to with the database and where you are going to do it. One example: We have a quite big database with test results. Most of the work is gathering data for reports. Normalizing <> Denormalizing is an issue and you only know what is faster if you know how you access it.
offler on July 15, 2008 3:01 AMIt's always amazed me how we spend hours normalizing databases just to spend more hours creating non-normalized views to run reports on.
I've always tried to normalize as much as was sensible rather than slavishly following the supposed good practice of BC Normal Form.
Excellent thought-provoking article.
Ali M on July 15, 2008 3:01 AMby the way: why do you want to create a list of users and adresses? Wanna sell?
offler on July 15, 2008 3:02 AMI'm not sure I agree with the total denormalisation in the example, Jeff. I'm (relatively) recently out of University, so I've still got the normalisation-is-everything rules in my mind :-) Being pragmatic, though, and as a couple of other people have noted, denormalisation is an optimisation to a fully-normalised design. I'd argue that denormalisation should be done in stages, in just the same way that normalisation is.
For example, I don't think I've ever taken a production database past 3NF. Certainly the only time I've take a design to 5NF is for my University database assignment. And aren't stored procedures supposed to improve performance for this sort of queries. Yes, you're always going to take a performance hit for a query utilising six joins (there are some crazily-sized cartesian products going on in there), but wouldn't a multi-layered approach aid this? I.e., for common or computation-intensive operations (such as viewing a user's profile in the example above), using multiple stored procedures to extract sections of data and then another stored procedure to put them together. It could also be accomplished using the user_id primary key to extract the relevant record from each table and then combining the records at the application level.
Having said all that, I haven't done any database programming in a good couple of years now, so I could be blowing a lot of hot air :-)
Alastair Smith on July 15, 2008 3:02 AMDatabases are one of those things that you just can't take a cavalier approach to. If you screw that design up, and worse still, launch it, you're in for a world of hurt when the problems with the data come.
Normalisation means the only problem you're going to get is scalability: your data is going to be good (presuming you were also clever/paranoid enough to get all the integrity checks in there too).
As the quote says, you can always denormalise later for speed.
Lewisham on July 15, 2008 3:02 AMIf you create indexed views over the joins, then the optimizer should use those anyway and there won't be a perfomance hit from being denormalized.
Harry M on July 15, 2008 3:04 AMAddendum: And as others have said, there are much better solutions than denormalising your database. Having Materialized Views for reads, or memcached.
Lewisham on July 15, 2008 3:07 AM"Quite a horrible job naming the tables. Why the underscores in field names and not in table names. Why user_id and not user_sex? Why table nimes in singular? Of course this is beside the point, but lol.. :D
nub on July 15, 2008 02:33 AM "
because when you're selecting multiple id fields you want to be able to distinguish between them, example: user_id and affiliation_id
instead of
select user.id user_id, affiliation.id affiliation_id from ...
you can just have
select user_id, affiliation_id from ...
taht way you don't always have to prefix the fields with the table name and you also you don't have to alias the fields everytime;
and table names are singular because each row in the table designates one specific person; one row = one user
james on July 15, 2008 3:10 AMThat 6 join query is wrong, and the rest of your article follows from that incorrect join.
Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.
If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?
You're simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't! Test it - you'll see I'm right. But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows.
And if you don't even understand why your query is wrong, you have no business designing databases.
A valid reason to denormalize is to precalculate data, which you touch on very briefly. But you always write that twice: once normalized, and that's the primary data, and then again, the "cached/precalculated" version. You should always be able to regenerate that from the normalized data.
OK, one final reason to denormalize which you didn't even write: if you need to do a where clause from one table, but the order by from a different table, you need to denormalize because you can not create a combined index from both tables. (Databases with function indexes might be able to but that's pretty complicated.)
I'm sorry to bash you so much, but you shouldn't write about what you have no experience with.
Ariel on July 15, 2008 3:15 AMThe approach we've taken at work is to have data normalised (well all the new tables, we have some truly hideous legacy tables that we haven't tidied up yet!) and then either use memory cache or search tables to speed up access.
In fact our search table is one honking great big denormalised table. There can't be any joins at all. This is generated by an off line task constantly.
I'm always wary about denormalising the 'cannonical' tables as this leads you into a maintenance nightmare, especially when it comes to adding or removing columns.
Glenn on July 15, 2008 3:16 AM> (You're automatically measuring all the queries that flow through your software, right?)
Got an article about this? Would be a good topic
Colm Larkin on July 15, 2008 3:22 AMTwo keywords (already mentioned) that I want to underline:
- views
- views aka materialized views
- star scheme for your billions of tuples (that your memory won't hold)
(<a href="http://en.wikipedia.org/wiki/Star_schema">http://en.wikipedia.org/wiki/Star_schema</a>)
Indexing and caching aside.
Krischan on July 15, 2008 3:23 AMCan somebody tell me how 'views' can increase the performance of a very large system?
Niyaz PK on July 15, 2008 3:25 AM>and table names are singular because each row in the table designates one specific person; one row = one user
james: are you serious? Tables don't have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular.
Also, you do not want to prefix every single column with the name of the table, what does that save you? Just type tablename.columnname instead of tablename_columnname - no need for aliasing. Sometimes you violate that rule where it might be confusing, but for that most part don't double prefix stuff.
Ariel on July 15, 2008 3:28 AMYour technique of describing a denormalized database model is one of the basic principals of dimensional modelling primarily used for data warehousing.
It's pretty much the opposite of normalizing your model (it contains redundant data, utilizes lots of space etc). Especially if end users need to use this data to query for e.g. reporting purposes then dimensional modelling is absolutely the way to go.
However, if you’re designing a normalized database, I would suggest to keep it normalized and don’t denormalize at all unless you’re designing a dimensional database structure from the beginning. Keep it separated and don’t mix the two modelling techniques even though it might be more efficient in some cases. Making things unnecessary complex is not something you want to do very often.
Sander on July 15, 2008 3:28 AMDB nuts come out in forces. Views as a performance improvement LOL.
Even on a very small DB those 6 joins will cost you. Oracle uses huge amounts of CPU for joins, whether or not you have lots of data. With any kind of load at all it will cost you.
Just forget about using a relational DB and use Berkeley.
Asd on July 15, 2008 3:35 AMI'm starting to study object-relational database, and I wonder if it wouldn't be a nicer solution for you, since is wipes away the necessity to write long complicated joins.
My studies haven't yet gone too far on that topic, but it seems ORDBs are easier to deal with than relational ones.
***
Which software did you use do create those diagrams?
Mrio Marinato, from Brazil on July 15, 2008 3:41 AM@Niyaz: Materialized views (or indexed views, depending on your rdbms) are in fact tables, but created and maintained by your rdbms. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically.
Goran on July 15, 2008 3:44 AMsorta on topic: What'd you use to make the diagrams? I dig
X on July 15, 2008 3:47 AM>table names are plural, column names are singular.
According to Codd and Date table names should be singular, but what did they know.
Pablo on July 15, 2008 3:48 AMI will have to nod on the use of views, especially indexed views, on this. Views help in performance because it tells the database which queries with complex joins will be accessed more often, and allow the database to pre-store its execution plan (as opposed to adhoc T-SQL) Indexing against the view will further optimize it.
If you use NHibernate, views have the additional advantage of being treated by NH as "just another table", so you could actually design a view against an object in a much simpler manner.
Jon Limjap on July 15, 2008 3:50 AMI'm a little lost here.
I'd say the example is not even normalized. I mean User Phone Number as well as User Screen Name should me merged into the User table and they will still be in BCNF. Then you keep the rest of tables as they are...
Why do you take Phone Number and Screen Name out of the picture? Is not as if you may have several Phone Numbers for one user with that design... and if the user does not have any phone number then, you know, NULL values may help.
Jorge on July 15, 2008 3:51 AM"and at that point nobody will have to teach them that denormalization is OK, they will have figured that out already"
Erm... Isn't the entire point of this post that some people haven't?
I always learnt the last rule of normalization is to de-normalize.
Also I believe large organizations use a de-normalized for for selects and a normalized view for updates and inserts using replicated servers.
Have a look at the Scaling Databases section of http://msdn.microsoft.com/en-us/magazine/cc500561.aspx
John on July 15, 2008 4:08 AM@Ariel
"Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that."
The syntax may be incorrect (I'm not sure you're right, though), but I assume the idea is that you might want a complete list of all users and their related data. Can you give the correct answer rather than just picking holes?
At the same time, the fact that the syntax for joins is so tricky to get right and interpret on reading is another major problem with Normalised Tables.
"If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?"
What if you needed a list of everybody's screen name and their related phone numbers?
"You're simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't! Test it - you'll see I'm right."
Well, as the articlesays, there'll be no measurable difference until you reach billions of data rows. Do you even read the article?
"But, do me a favor and write real joins - not ones that give you tremendous numbers of duplicated rows."
Which you have singularly failed to provide an example of.
very easy answer:
"normalize till its good, denormalize till it works!"
That's the rule I learn in college.
Hi Jeff,
All I can say is: Thanks!
This is the most enjoyable and valuable post in the blogoshphere for monts!
Daniel Fisher on July 15, 2008 4:12 AMWhy do you need to use a relational db for something like querying tags in the first place? Another approach would be simply to treat the db as a persistence layer and maintain in memory indexes for your tags.
Feeding a term vector of tags into something like Lucene is going to be faster than any DB query and a hell of a lot easier to scale.
Ryan Roberts on July 15, 2008 4:13 AMAn observation : if you use LINQ (Linq2SQL,Entity Framework,...) you don't have to write all those inner joins. You can use properties to access related entities. I love SQL but LINQ is allows to work with your data on a higher abstraction. If needed we, the SQL oldies, can still impress the young with blazing fast queries which will become a lost art.
Tom Pester on July 15, 2008 4:16 AM'Golden Copy' of data should also be taken into account when designing databases, to often I have seen database that don't apply any normalisation have multiple copies of critical fields spread across multiple tables...
Ollie Riches on July 15, 2008 4:19 AMselect * whatever can also be a performance drain worse than the 16 inner joins you described. select what you need and nothing more.
Brian on July 15, 2008 4:23 AM@Ariel
Looks like you'd get some duplication where you had multiple screen names and multiple phone numbers, but that's a very fine edge condition. It could also be helpful because you'd be able to detect users with multiple profiles or who have provided multiple contact details.
The fact that you didn't provide suggests that the normalised table is quite tricky to extract the complete data from cleanly.
Tom on July 15, 2008 4:24 AMIn your example you've turned several one to many or many to many relationships into 1-2 and 1-3 relationships. These are obviously not the same thing and there are many real life requirements for these standard relationships (1-N, and N-N).
The problems really start happening when you aren't normalizing your tables because your professor told you to, but when you need to model these one-to-many or a many-to-many relationships between data.
And I disagree that it takes millions of rows to make this stuff relevant, I worked on a project which had a search query with 4 or 5 joins and even specialised indexing for the major queries that were being run still caused any FULLTEXT searching on the data (3000 rows) to take 9 seconds. One summary table later and it was down to 0.09. Just by removing the joins!
So yeah, I guess my point is that the important thing here is the cost of a JOIN, and you need to know how to deal with these (summary tables, proper indexing, query caching, etc.) in situations where they ARE necessary.
David McLaughlin on July 15, 2008 4:25 AM"Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory..."
Holy carp, is that considered modest? Somebody please tell my company so they'll upgrade our Pentium 4 512mb machines. That billion dollars we make each year has to be going somewhere.
Chris on July 15, 2008 4:28 AMAs I tried to clarify in my blog post, denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed.
Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on.
Reading various database stories may make it sound like "all the cool kids are doing it" but all the cool kids are also rolling their own file systems and implementing their own alternatives to relational databases as well. They did that because they hit scaling problems and felt they had no alternatives but they gave up something along the way.
More on this at http://www.25hoursaday.com/weblog/2007/10/10/WhenDatabasesLieConsistencyVsAvailabilityInDistributedSystems.aspx
Dare Obasanjo on July 15, 2008 4:28 AMGoran,
So my question is: where is the performance enhancement in using views?
When something is updated in the tables, all these join statements will be needed to make the view. Right?
So for a very large application, views does not give performance improvements. Tell me if I am wrong.
There's a whole lot of denormalisation going on with even the most normalised databases. Indexes are denormalisations & so is anything you cache.
I think the approach to take here is to denormalise for speed but to ensure that denormalised data sits on top of normalised data & can be rebuilt automatically. This is what happens with indexes, and it's a tried & tested methodology.
As with indexes, anything denormalised should be hidden from the main programming effort. In fact, I'd go as far as to say that it should be a function of the database. How this can be done without an over-reliance on views is difficult to imagine though.
Sean Bamforth on July 15, 2008 4:30 AMOkay, from the very beginning it has been known that normalisation can cause problems with performance. I think every "trained" DB designer know this.
But what happens, is that people see "Normalisation = Slow", that makes them assume that normalisation isn't needed. "My data retrival needs to be fast, therefore I am not going to normalise!"
So they end up with a database they call "denormalised for performance", but really it isn't, as the database was never normalised in the first place.
Tubs on July 15, 2008 4:32 AMThe only time I throw normalisation out the window is when I have a list of infrequently changing fixed-sized items which I can separate by commas and store in a single field instead of another table.
However, whilst it saves time on the querying side, the issue I find with doing this approach is it completely slows down the processing side, as my application must separate the list into an array or another structure before I can work with the data.
Personally, if the trade-off of each approach is near the same, I'd prefer a more logically structured database without null fields than a denormalized database which is hard to maintain.
Lachlan on July 15, 2008 4:35 AMAriel and George,
Phone numbers don't go into the user table because users can have multiple phone numbers (work, phone, cell, fax, etc). A properly normalized database extracts them out to their own table especially since the majority of users will have lots of nulls in these fields.This is assuming that you are normalizing your DB to 3NF.
In addition, who says you wouldn't want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That's exactly what it does.
Dare Obasanjo on July 15, 2008 4:37 AM
always wondered what database normalisation meant...
can't imagine that joins are exactly a heavy performance drain though... the data is indexed after all. I don't actually have the experience of a database that big (biggest db i worked on only held info on about 1.5 million people - linked to various details in much smaller tables), but I would imagine you could have trillions of records with no problem given some good indexing implementation...
in my experience slowness comes from things like calculating a sum, where the database doesn't update the sum on each record change, but recalculates the whole damned thing every time its needed. or poor application design, making 100s of server requests, when actually one would do just fine.
just taking a naive implementation, given numeric primary keys you can reduce the amount of records you actually have to search through by a factor of 10 for each digit you decide to take advantage of...
I wonder what modern database implementations actually do? generic keys aren't so useful but similar logic can still be applied... at worst there is always the choice of a binary tree...
Jheriko on July 15, 2008 4:38 AMI am pretty sure your original "normalized" database in the diagram is poorly designed. After all you have two tables that appear to hold a user_id + affiliation_id relationship (UserAffiliation and UserWorkHistory). That, my friend, is *wrong*. And no wonder you have six joins, you have a database design that does not make sense.
You don't want to write a query like that-- ever-- in real life. It doesn't make sense. If you have three users who each have two phone numbers and two screen names, your join gets you a result set with twelve rows in whatever order the database deems appropriate.
In my experience, the odds of a developer properly designing a database are extremely small. You have just proven that to me yet again. And I can guarantee you that bad designs are way more likely to cause performance problems than properly normalizing data. Not only that, bad database design is likely to cause other problems like inconsistent data, incomprehensible code, and inscrutable bugs.
Michael on July 15, 2008 4:44 AM@Niyaz: "When something is updated in the tables, all these join statements will be needed to make the view. Right?"
Yes - whenever you update / insert data it will in fact update / insert twice. But that's likes akin to saying indexes reduce performance as you need to store the same data multiple times :). My favorite saying on this is "You can read fast and store slow or you can store fast and read slow".
@Jheriko: The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow.
In some cases you may be right, but what will you do if you need more entries in "affiliation", "work_history" and "screen_name"? Do you really want to have so many empty fields in each entry?
Jonathan Rass on July 15, 2008 4:52 AMThe only companies running on denormalized databases are companies providing websites that have loosened constraints when it comes to data loss. Your credit card company is not going to denormalize their data in their transactional system anytime soon.
Atwood is wrong. There is a "better" way to do it, and that's to normalize your data. This keeps your data integrity and DRY, and helps protect you from bugs and errors in your database. Correctness should always come first. Denormalization is up there with "hack" and "kludge" and should be treated as such, not as a "reasonable alternative" to normalization. (Albeit a sometimes necessary one, though probably employed more than it needs to be due to posts like this one, and also employed due to limitations of the shitty database these sites seem to run on top of, MySQL.)
The biggest scalability problems I face are with human processes, not computer processes. Aside from number crunching applications, I've seldom run into the need to restructure code for performance. But I'm always running into mental scalability problems, ways of doing work that are going to drive people crazy in the long run.
John on July 15, 2008 5:04 AMJeff,
It's common in programmer circles to think there's not a best way to do things, but in this case, that's not true. The best way to store data in a relational database is to normalize it; that is, to take full advantage of its relational nature.
In your example, you do not need six joins to get all the user's data: you choose what the pertinent data is for a situation. If you do need all that data, and performance is an issue, materialized views, or some other caching system, is the way to go.
I appreciate your post, but it's very wrong, and harmful advice to an inexperienced programmer.
Clinton R. Nixon on July 15, 2008 5:07 AMAnother piece of savvy writing. I thank so much these writings from the experience and humble truth. Whatever it means.
xmariachi on July 15, 2008 5:08 AMPremature optimisation! Design the database fully normalised, measure performance, optimise, rinse and repeat until fast *enough*.
Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend.
Chris on July 15, 2008 5:09 AMIf you want to have a look at the spec of fully engineered and normalised database, check out the NHS Data Dictionary
This diagram shows the "person" property :
Tubs on July 15, 2008 5:12 AMWanted to repeat what Brian said, because it's one of the more valuable things in the comments: don't use "SELECT *" and you'll have cut your query count almost in half.
The database diagrams appear to have been made in a recent copy of Visual Studio.
While I agree with the general point Jeff, I do think there are other ways of achieving the effects of denormalization without squishing tons of fields into a table. Views, reporting databases, etc. If The DailyWTF is to be believed, just adding some (well-thought) indicies should be everyone's first start. Then learning how to use tuners & optimizers. Jumping straight to denormalizing a database is the same class of mistake as thinking a properly normalized database can't be a problem: you're assuming facts not in evidence based on what someone told you about their experience.
Tom Clancy on July 15, 2008 5:13 AMI've read the odd article on this site now and then, but still not come decided whether Attwood is writing ironically, or if he is actually incompetent.
Firstly, the sample schema is not normalised. The member_count field in the Affiliation table is a derived value (SELECT COUNT(*) FROM UserAffiliation WHERE affiliation_id = ?). The religious_views and political_views columns in the User table smell badly of multi-values. The purpose of the UserWorkHistory table is unclear, but looks unnormalised.
Secondly, the SELECT example given should user outer joins.
Thirdly, using "SELECT *" is a no-no. What happens if the order of the columns change? If you're not using integer indices, then performance suffers. If you are using integer indices, then column order is significant. What if someone adds more columns that your code doesn't need? You'll still be pulling that data back.
Oh well, I'm going to add your example schema to our pre-interview exam (commonly known as the "idiot filter") - and if anyone suggests your denormalised table as an improvement I'll know to export them out the building uninterviewed.
You mean you aren't using NHibernate, Jeff? Or even Linq to Sql?
In your example, if you need to get all of the data for a given user, your sql would look something like:
select * from Users where user_id = 10;
select * UserPhoneNumbers where user_id = 10;
select * UserScreenNames where user_id = 10;
select * UserAffiliations where user_id = 10;
select * Affiliations where user_id = 10;
select * UserWorkHistory where user_id = 10;
select * Affiliations where user_id = 10;
and your ORM would handle populating your POCOs.
Even if you aren't using an ORM of some kind, ADO.Net handles the simultaneous retrieval of multiple sets of records just fine.
Like Peter Becker said, "Normalization is about design, denormalization is about optimization."
From experience: Normalize tables in databases (up to the third normal but probably not more) when these will be mainly used to collect information (lots of inserts, updates).
Denormalize tables when these are going to be used to select type qyeries (to feed reports, etc) or better still offload these to a data warehouse cubes which are optimized for analysis and reporting.
cyclo on July 15, 2008 5:32 AMcodinghorror.com 127.0.0.1
hosts on July 15, 2008 5:34 AMYour normalized database could actually have a couple more tables. Phone Type, IM Service, Company name and maybe even Title could be in lookup tables so you can find all the people for X Company, or using Y IM service etc.
You can't allow free text there or some will enter ATT and others AT&T. You could control it on the front end, but you'd still need a table to get the list.
Ask anyone who works with the family of databases called “multivalued” (such as Revelation’s OpenInsight or IBM’s U2) and they’ll tell you that you normalise when data HAS to be treated atomically – or as so many others have said, design to third normal then justify denormalising. Removing external joins is one easy way of reducing web latency.
Andrew McAuley on July 15, 2008 5:35 AMYou make a good point about denormalization. I am developing a social networking app and have been fairly strict about normalizing data, and have created indexes to help performance. It is not yet a problem, but it has occurred to me that occasionally tortuous joins may eventually exact a performance hit. I'm fully prepared to abandon purity for performance when/if this happens. Thought-provoking article.
decaf on July 15, 2008 5:36 AMAn obvious (?) solution which works for even very large sites is to store the master copy of your table in normalized form, and then use batch jobs to denormalize the data as appropriate. This means you can have multiple denormalized forms suitable for different purposes, without losing any data integrity.
Of course this precludes the use of ORM, but once you start talking about performance and scalability you've already done that.
Greg Beech on July 15, 2008 5:43 AM> People normalize because their professors told them to.
Never had one of those. I normalize because it's elegant and easy(er) for me to understand. I don't do much work on huge systems, except for our major systems which were already designed for us and are quite denormalized already. Of course now I'm doing more data warehouse work, where all the rules get thrown out the window.
Rhywun on July 15, 2008 5:49 AMYou're just another knucklehead. Think of it this way: would you take seriously a guy who's spent 20 years on the line at Tyson dismembering chickens announcing that he was *really* a neurosurgeon? And that he's got the perfect way to remove that tumor that's been giving you massive headeaches? The folks who've spent decades developing RDBMSs and systems from same, really do know more about it than you do.
That said; RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter. Read Codd's papers or Date's book. If you haven't bothered to do either (you haven't, right?), then why should your opinion matter?
There are reasons Codd drew his conclusions. He didn't just make it up.
Some thoughts I've collected over the years.
Not only am I not aware of any resource that would advocate logical modeling of data in an unnormalized fashion, I would be against anyone reading it if it existed. The logical data model should be normalized. Normalization is the process of identifying the one best place each fact belongs.
-- Craig Mullins/2007
There's a tendency, especially among managers of smaller shops, to assume that any developer knows how to set up a database. Frankly, this perplexes me. You wouldn't assume that any given developer knows how to code in C# or set up a Web Service, so why is it that we're all supposed to be database pros? The end result is that too many databases are designed by people who have never even heard the term normalization, never mind developed any understanding of the various normal forms.
-- Mike Gunderloy/2006
This is a great book on building databases the correct way. I highly recommend this book["Database Modeling and Design" by Teorey] and if you cannot understand it you shouldn't be building a database anyway.
-- Robert C/1999 [the current 4th edition/2006 is even better]
Many of the existing formatted data systems provide users with tree-structured files or slightly more general network models of the data. Application programs developed to work with these systems tend to be logically impaired...
-- E. F. Codd, "A Relational Model of Data for Large Shared Data Banks"/1970
[Dr.] Codd had a bunch of ...fairly complicated queries, and since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long that would navigate through this labyrinth of pointers and stuff [XML anyone?]. Codd would sort of write them down as one-liners. ...[T]hey weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that.
-- Don Chamberlin/1995
Proper data management is the key to great architecture. Ignoring this and abstracting data access and data management away just to have a convenient programming model is … problematic. ... Many of the proponents of O/R mapping that I run into (and that is a generalization and I am not trying to offend anyone – just an observation) are folks who don't know SQL and RDBMS technology in any reasonable depth and/or often have no interest in doing so.
-- Clemens Vasters/2006
The big myth perpetrated by architects who don't really understand relational database architecture (me included early in my career) is that the more tables there are, the more complex the design will be. So, conversely, shouldn't condensing multiple tables into a single "catch-all" table simplify the design? It does sound like a good idea, but at one time giving Pauly Shore the lead in a movie sounded like a good idea too.
-- Louis Davidson/2007
Technologies like OODBMS sacrifice sound, application technology agnostic data management for short-term convenience (convenience for one single application, written using one particular programming language). Relational technology essentially completely replaced network or hierarchical database technology, and there were excellent reasons why that happened. We should most certainly not be reviving either of those discredited approaches by slapping on the latest buzzwords (OO, XML, etc) as window dressing. ...We don't see any future for JDO.
-- Gavin King/2004 [he wrote Hibernate]
...the logic exercised by the database to maintain referential integrity is not free, but if the rdbms does not do it, it gets done in the application, where it costs the same if not more - just now the application becomes "slower" versus the rdbms. Whether the logic of triggers or RI is performed in the application or the database, it's going to have similar pathlength - the only question is whether the rdbms or application can be more efficient. As more applications are written in languages that emphasize portability over performance (Java) or ease of learning over performance (Visualbasic), the benefits of putting logic that centres around data further down into the database engine become stronger.
-- Blair Adamache/2003
In CS we don't have a lot of formal models to guide us, as in engineering or other science. Much of CS is entirely ad-hoc. However we do have a sound and complete model for data storage (relational model) and hardly anyone uses it. It boggles my mind. Do people not *want* their programs to work predictably? [corollary: if you *must* interact with an Application to modify data, the data specification isn't relational]
-- Anonymous Coward/2005
The people using SQL to store data often don't grasp that there's a difference between expressing various facts as a set of relations and stuffing data into a database like you stuff material into a file.
-- Christopher Browne/2005
Programmers will always gravitate towards viewing the data in "their databases" as their private bailiwick, and insist that users of the data access it through their own API. Learning SQL is certainly better than learning a hundred programmer's different APIs.
-- David Cressey/2005
It is very obvious that a lot of people have absolutely no idea what a business model is or for that matter a database. The number of times I have seen business rule code that should be imbedded in the database definition and not in the BLL (Business Logic Layer) just makes me want to scream. As soon as data rules (the backbone of a business model) are moved away from a relational style declarative constraint, you have reduced the effectiveness, meaning and integrity of your business model/database.
-- DavidM/2004
Normalizing is an attempt to optimize generally. Denormalization is an attempt to normalize for a particular application, to introduce a bias.
-- Gulutzan & Pelzer/2003
If you store all your business logic within the database, then it doesn't matter what flaky applications people write you can guarantee the integrity of the data.
-- Pretsel/2004
"""Lets unroll our loops next.""" (Christog)
I seen that and I must say was blown away how removing loops can make your application run faster. Of course, you shouldn't jump right away and erase all your loops and ifs, but sometimes it can help, just like de-normalization.
You can read about a guy optimizing his XML parser in the book called "Beautiful Code". Sorry, I don't remember who wrote it.
I run an e-commerce website with about 400,000 page views a month. I've been a complete purist about normalizing the database, as well as using business objects to access all data (rather than using ad hoc SQL queries). We have some pretty complicated business object hierarchies and I've long been waiting for there to be a performance issue (especially as we use a shared SQL Server), but the website continues to be blazingly fast. I realize that you are going to have a lot more page views than our site, but it looks like your schema is going to be simpler and you have a dedicated SQL Server, so I think you can afford to be a purist for now and not worry about denormalizing until it becomes an issue.
Ben Mills on July 15, 2008 5:58 AMGood post Jeff.
The point many of the jihadist posters are missing is that "the right thing" may change dramatically as an application scales up.
Your high scalability point is key. As you said, "Everything is fast for small n."
The ideals of normalized data, and even of using a RDBMS at all, fall apart at some scale, as Google and Amazon will attest.
John Pirie on July 15, 2008 6:00 AMhey all you young guys - logical modelling (data model) is normalised (for understanding), physical model (database) is denormalised (for performance) - duh!
non american spelling on July 15, 2008 6:00 AMjeff, are you awake?
wiggles on July 15, 2008 6:02 AMIt's funny how many people just ignore your last conclusion because it didn't suit their argumentation purpose; “normalize until it hurts, denormalize until it works” sums it up very nicely.
That being said, the above example would be no reason to change the normalized DB schema but instead to change the query (obviosuly). In any case, such a query would be very rarely necessary (someone already mentioned data warehousing, I think).
Konrad on July 15, 2008 6:03 AMAhhhhh! Don't tell people that! Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations.
Sean on July 15, 2008 6:04 AM"Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure."
Whose "tradition database design principles" are you referring to? It's been some time since I took any database courses at college, but I do remember that both the book and the instructor told us that going all the way to the 5th level was extreme and not a good idea, because it added complexity to the queries used to get data which slowed it down. We were told that in most cases, the 3rd level of normalization will provide the best results.
The purpose of normalization is reduce size and eliminate the number of duplicate records, but the trade off is that the queries used to reconstruct the data become more complex and thus take more time.
Delmania on July 15, 2008 6:05 AM(disclaimer: I'm talking completely about Microsoft SQL Server 2005 here.)
I'm calling shenanigans on this post, unless you put up a definitive example of where your app is running into problems with a normalized database design.
Like others have said, to get the user data you could use an indexed view. But also as others have said, if you have proper indexes and foreign keys setup, the SQL engine should have no problem "figuring it out" and generating the same query plan as if you had just queried those tables directly in the first place.
I'd also like to say that the example given is a poor one, seeing as you're joining 1-to-1 tables to 1-to-n (phone numbers, affiliations, etc.)
Not to mention that de-normalization leads to it's own problems..
Nicholas on July 15, 2008 6:07 AMPS. I name my fields with ORM in mind, even if I don't use it. Thus, Id, not UserId or god forbid user_id...
Rhywun on July 15, 2008 6:08 AMKeep it normalized. How often do you actually need EVERYTHING about a user at once? Probably only when you are editing that user.
As others have said, use proper indexes to speed it up. De-normalized tables will haunt you for years whenever there are issues. As in an earlier post you are being cruel to the maintenance programmer (even if that is yourself).
Practicality on July 15, 2008 6:10 AMall of those with a data access layer step forward... not so fast jeff
structure on July 15, 2008 6:11 AM> I run an e-commerce website with about 400,000 page views a month
That's very low. I've got a site doing twice that traffic every day and has about 30 tables, with the heaviest (and most common query) working on a table of 200k records requiring a sequential scan each time since the sort criteria is based on an average of a field in the resultset that is different for each query which are finally joined against 12 tables.
All running on a 2ghz opteron with 1gb ram, and cpu usage hovers at 40% at peak times.
God bless PostgreSQL.
nick on July 15, 2008 6:13 AMSorry to contradict Greg !
Use a multi-value (a.k.a. Pick) database and you don't have to worry about normalization ever again. They are fast too.
Mike on July 15, 2008 6:14 AM@[ICR]
> Just a nitpick, I'd recommend using the term "Gender" over "Sex"
Do you sex a cat, or do you gender it?
Don't use an incorrect term just because the politically correct crowd want to make the very mentioning of words that might also refer to a natural human activity. Fight the pussification of language!
If you want to know if someone is physically male or female, the question is "sex", not "gender".
<a href="http://en.wikipedia.org/wiki/Gender">http://en.wikipedia.org/wiki/Gender</a>
Simon Wright on July 15, 2008 6:16 AM^^
I don't know what happened with the wikipedia URL, but I didn't write the HTML tags. I just pasted the URL.
http://en.wikipedia.org/wiki/Gender
http://en.wikipedia.org/wiki/Gender
Simon Wright on July 15, 2008 6:17 AMI have had to fix a few huge databases over the years and often the problem is a bad, but normalized, database design.
Al on July 15, 2008 6:20 AM"I just spent a year having normalization hammered into my meager little mind as part of my database module and now you're telling me that it mightn't be necessary..."
This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue.
You don't normalize because of ghosts; you normalize because it doesn't make fucking sense otherwise. A tuple is a logical statement; denormalizing means you take the risk of having contradictory or false statements in your database in order to gain performance. (That's why it's recommended for read-only databases, for example: the risk of having incorrect statements is much smaller.) If premature optimization is the root of all evil, denormalization is Satan's ugly mother. It's right there with "the compiler is wrong".
You just need 1 table with 2 fields:
- id (primary key)
- data as Xml
lol !
Patrice on July 15, 2008 6:22 AMI've never been a database guy, but this whole issue seems like the classic problem of optimization to me. There's almost always that trade-off between elegant readable design and optimal execution speed.
> I disagreed and waged a number of arguments that it was our database that needed attention.
> We first needed to tune queries and indexes, ...
> Consultants were called in. They declared the db design to be just
> right - that the problem must have been the application.
This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). One of my pet peeves is folks who "optimize" their code with absolutely no evidence that there's an execution speed problem there, or even anywhere.
Ah, and you should read the comments to Dare Obasanjo's article: there's no six-way join; there's a three-way join at the worst.
Marcel Popescu on July 15, 2008 6:24 AMGreat point. At my previous employer, they ran into scalability issues, and denormalizing their data never entered their mind (as far as I know). Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application.
My article explains it better:
http://www.ytechie.com/2008/05/stored-procedure-reporting-scalability.html
I've been working with RDBMS's since about 1986, and currently support a very large complex (normalized) database.
I have to disagree that normalized databases can't perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that 'may' happen, only those that you can actually measure. Anything else is a waste of everyone's time.
Object-Relational database technology doesn't solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don't assume that because the object hides the complexity that the actual physical storage is simpler as well.
Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.
I totally agree with Greg, correctness comes first.
Ed Delaney on July 15, 2008 6:24 AMI think file system performance is often overlooked when examining performance problems in medium/small db's. Denormalization will help when you have an i/o bottleneck - to a point. However, I personally find it faster and more cost effective to upgrade my i/o subsystem than to refactor my db and app for a denormalized schema.
Jeff, be careful when giving advice like this... De-normalizing is something that should only be done by people who ALREADY UNDERSTAND when to use it and how. I fear you're giving it the air of being trendy and thus confusing the less experienced developers out there. :-P
Bruce Johnston on July 15, 2008 6:25 AMI've been working with RDBMS's since about 1986, and currently support a very large complex (normalized) database.
I have to disagree that normalized databases can't perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that 'may' happen, only those that you can actually measure. Anything else is a waste of everyone's time.
Object-Relational database technology doesn't solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don't assume that because the object hides the complexity that the actual physical storage is simpler as well.
Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.
I totally agree with Greg, correctness comes first.
Ed Delaney on July 15, 2008 6:27 AMTo everyone bashing Jeff on his database design skills - he didn't come up with that design. He clearly pointed out that the design was from Dare Obasanjo's blog post on this topic.
Allan N on July 15, 2008 6:31 AMSome things are too slow in a normalized database. Facts are Facts. You just rout around this:
* Keep a de-normalized replica for expensive operations (e.g. reports)
* Cache Results for repeat queries (Memcached)
* Partition the database for scalability (vertical or horizontal)
> This is why this article is wrong, Jeff. This is why you're an idiot...
In your opinion.
In my work building and maintaining content and asset management systems, I've seen far more problems arise from over-normalisation than under-normalisation. Far too often I've seen database designs that fall over because they're so intricately normalised. They don't fail technically, but they become a nightmare to write code against.
On the other hand, some of our most long-lasting table structures have been criminally flat. They have their limitations. They suffer some redundancy. But they're rarely misunderstood, and always a joy to attach to your new systems, because their broad usage has spawned many valuable tools and code that work with them -- and that's far more valuable than having an unlimited number of phone numbers for each contact.
Simon Wright on July 15, 2008 6:34 AM@Julio: Hibernate, at least the Java version, actually has a concept called join fetching where it will join a related collection instead of doing a second select. So, just for the record, an ORM wouldn't necessarily use all separate selects.
a on July 15, 2008 6:34 AMSpeaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan "child" records, etc. Everybody says "we rely on the application to maintain that", and it never, never does.
The question is, what's the cost of the data integrity problems? If it's something you can live with for performance, fine. If it's a disaster waiting to happen, you better build your database design as tight as possible.
A. Lloyd Flanagan on July 15, 2008 6:34 AMSeveral people already hit on what I was going to say, but here goes anyway...
What is the purpose of the database? Is it to process transactions or report findings? I always take the route that I design the data model using the best practices approach and considering my requirements.
Now, I double check requirements to make sure the database will behave correctly. If 99% of what I am doing will be reporting off the tables, I will either create Materialized Views (Oracle) or Materialized Query Tables (DB2) that basically denormalized the data for quick reporting. This can be a problem if you need up to the minute dynamic reports since the MV/MQT's need to be refreshed periodically. Currently, we are on a 2 week refresh, so this approach works well for us. The MV/MQT's carry some resources, but memory and storage is cheap!
Now, if it is a quick app for logging or something, I will just create a flat/wide table and put everything in it. After all, this log is for me and my team... no reason to confuse the issue with 5 tables to hold each piece of the log information... Plus, this data gets cleaned out every so often and I am not concerned about instant access. Just index the table by date, and order by in the sql...
If I am going to be constantly hitting the data with updates/inserts/deletes (like in order processing or something) the normalized structure works very well... and I don't carry the overhead of all the duplicate data associated with MV/MQT...
So, I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals.
my $0.02
Wayne on July 15, 2008 6:39 AMSergej, great comment.
Jeff, I remember from one of the early podcasts when you were describing the box you were deploying on that it was just a single machine, running the database and the web server, is that correct? That immediately struck me as the first thing that will have to change when you go live. It is astounding the performance difference you get by putting the database on its own box and clicking the little box that basically gives the machine to Sql Server.
There is a lot of idealism in the comments above. Lots of alternative databases, OODBMS's, LINQ. Software developers in general have a very poor understanding of how databases work, how to tune them, how to design them. I wonder how many billions of dollars have been spent on servers in the last ten years that could have been saved by a simple index.
The design of any database that expects millions of rows in any table should be handled by an experienced DBA/Programmer. All access should be controlled via stored procedures. Never, ever, ever let a software tool or a (non-dba) software developer write your queries for you. db_reader and db_writer should *not* be privileges that your applications enjoy.
If you look at a query that has 30 joins and immediately think it is horrible, please analyze and tune the query first. Only the joins that require a big scan matter. Look at the plan and if you see 29 joins at 0% and 1 join at 100% of execution time, then only one of the joins is bad. Go ahead and add ten more if you need to. De-normalization is for reporting, not for OLTP.
Here's an excellent book for really digging into how Sql Server processes queries: http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139 (T-Sql Querying by Ben-Gan).
Jeff, one of your comments struck me as worth a little more thought: "Disks and memory are cheap and getting cheaper every nanosecond". Not everyone can afford a SAN, and if you're running everything on one box, it's not about how cheap the disks are, it's about how many you can cram into the machine. You want the fastest disks you can get, which means you are limited in size, and you want RAID, which eats up more space. In a 1U machine, you get maybe 2x136Gb 15k drives, double that in a 2U. And even that leaves everything running on the same set of spindles. So you move up to NAS or DAS, but you are still limited, because if you just fill it up with slow 1Tb drives the IO will kill you. Data size matters.
Eric on July 15, 2008 6:39 AMYour six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys.
Leandro on July 15, 2008 6:40 AM@T.E.D:
"I've never been a database guy, but this whole issue seems like the classic problem of optimization to me. There's almost always that trade-off between elegant readable design and optimal execution speed."
Absolutely, although you've got another element in the trade-off: data correctness. A database design that allows inconsistent data will have inconsistent data.
I have to agree with other posters: you can denormalize if you have to, and you know what you're doing. If you don't you may have to assume that the cost of incorrect data trumps other considerations.
A. Lloyd Flanagan on July 15, 2008 6:40 AMThe solution it 2 tier storage. Websites tend to be read many, write few. The first tier is normalized database, the second tier is de-normalized data specifically constructed for use by the application. In an app that I am building, the 2nd tier is a combination of memcached and plain old disk using binary serialization.
The expense now is that you have two writes, one to the real db and one to the de-normalized tier.
So now you've separated the read requirements/performance from the write and relational data storage.
Damian on July 15, 2008 6:41 AM@Leandro:
"Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys."
The whole question of surrogate vs. natural keys is of course a topic of enormous debate. Generally, I'm with you: natural keys are easier, safer, and often don't hurt performance the way people assume they will.
A. Lloyd Flanagan on July 15, 2008 6:44 AMI think the general advantage of normalization is the same as the DRY principle in coding. When you denormalize stuff you can wind up maintaining copies of data and doing the same thing twice.
Another advantage is it is easier to go from normalized to denormalized when you need it than the other way round.
CT2 on July 15, 2008 6:44 AMJust as well Dare is a social networker and not a database designer!
Look at some of the arbitrary limits in his design:
* A fixed number of telephone numbers per person
* A fixed number of job histories per person
* A fixed number of IM accounts per person
Now, what happens when the big DOH! hand hits the forehead, and our short-sighted denormaliser suddenly realises that out there, in the REAL world, the numbers of things in sets change on a regular and frequent basis ... that's right, we have to change the schema!
It gets worse. The proponents of denormalised schemas are the most likely to write "select * ..." in their queries. That means reviewing and changing all our code as well.
A properly designed database schema should survive most business changes. Denormalised (abnormalised?) schemas don't.
And as others have pointed out in this thread, the avoidance of surrogate keys can do wonders for your performance.
rond on July 15, 2008 6:46 AMIn my experience, beginning developers get into a lot of trouble because they're afraid of joins. The common symptoms are apps that are:
(i) excessively complex,
(ii) never quite work right, and
(iii) are expensive to change
It definitely is true that we're moving into an era of "web scale" systems that are much greater in size and power than traditional "enterprise" systems. Many people are pushing the performance limits of the traditional RDBMS, and we're seeing a move towards distributed main-memory databases and other new technologies.
Workaday business apps benefit greatly from normalization, so I think this post has a dangerous message for many developers. In the case above, I think it still makes sense to use a normalized database to keep the authoritative copy of the data, but it makes sense to cache the data in denormalized forms. Rather than creating a table with the notorious phone_number_1, phone_number_2, ... phone_number_33 fields that I'm always cleaning up in other people's applications, you can just serialize the data with the serialization system that comes with your language or JSON or something like that.
Social sites generally have many more reads than writes, so this can improve performance 100 fold or more, but still keep the advantages of a relational back end. If the cache gets messed up, you can always do a partial or complete rebuild based on data you trust.
Paul Houle on July 15, 2008 6:51 AMI've had programmers tell me that they don't care about the size of data or normalization. Then they cry when they have 200,000 fat rows and performance crawls.
David B on July 15, 2008 6:53 AMFor those advocating the use of Views: the database does not store the view data. The data is recreated, using the view's SELECT statement, every time a user queries a view.
Furthermore, see this article on View performance: http://sql-server-performance.com/tips/views_general_p1.aspx
Steven Bey on July 15, 2008 6:54 AMWhat I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues... In the end you are storing your data in a relational database, right? Unless you are plugging LINQ or Hibernate into something other than Oracle, SQL Server, MySQL or Postgres (or any other RDBMS) this is still an issue for you...
Sure, *you* might not be coding the six inner joins, but your OO mapper sure is.
Scary.
Sam on July 15, 2008 6:54 AM@Dare
"In addition, who says you wouldn't want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That's exactly what it does."
I think what Ariel is saying is that if you performed the 6-join query your return data would be duplicated. For example if a user had two phone numbers and three screen names the return data would look like:
user_id phone_number screen_name
1 111-1111 name1
1 111-1111 name2
1 111-1111 name3
1 222-2222 name1
1 222-2222 name2
1 222-2222 name3
Using this format of the data it would take some monkeying around to create the display text:
User: 1
Phone Numbers: 111-1111, 222-2222
Screen Names: name1, name2, name3
It would be easier to select by phone number and then by screen name instead of trying to get all of the data at once.
Eric on July 15, 2008 6:55 AM@[ICR]: Kudos to you.
Vance Vagell on July 15, 2008 6:55 AM@[ICR]: Kudos to you.
Vance Vagell on July 15, 2008 6:55 AMPlease think! Is this post even about normalization? The 2 database designs you used to illustrate your point are actually solutions to 2 different problems. They are not the same database and they cannot represent the same data. You can argue that the second one is faster but only if it can actually be used to represent the facts that are meaningful to your problem otherwise this is just hot air. This kind of lack of thinking is what usually makes life with databases a hellish proposition because almost invariably someone will ask for data that is not stored because it was easier to say "I need performance so I'll just ignore the requirements (or not think them thru) and de-normalize!"
Terrier on July 15, 2008 7:09 AMJeff,
You make a good point but there really is a upper limit on how far things can be stretched with denormalization. When stored procedures start to have anywhere from 100 to 200 arguments and sql server 2000 warns you that your row size exceeds its 8063 byte limit then maybe...just maybe its time to normalize the tables just a bit.
How do views speed up performance?
1) They prevent bad joins like the one demonstrated here.
2) You can use a materialized view which is the data denormalized for performance, but backed by normalized data for integrity.
The question is what is important: Your data or your performance?
If your limiting where clause is only one one table, you should not see a bad hit, because you will be hitting only the few select rows of each other table. You will take a hit for disk seek if you do a poor job of laying out your physical storage, but that is a different issue.
If you care about data integrity, you need to be normalized, and you need things like foreign keys and constraints. Normalizing data can hurt performance, but so few applications are of the size to cause this issue that usually data integrity is the bigger concern.
On a P-200 with 128M of RAM and a 10G database (point of sale with automated reordering and the lot) fully normalized, the 100MB network card was the bottleneck. If you need more than that, you have the money to buy hardware that can support it.
Grant Johnson on July 15, 2008 7:23 AM> But this design also requires a whopping six
> joins to retrieve a single user's information.
Yes, if you need all of it or if you need all of it in a single table.
Very often neither is the case. Very often you don't need all of the information at once and sometimes you first need some of it and later on some more, so you don't need them in a single table and as joined query, but as separate queries istead.
i hope the people at twitter listen to this. :P
they got people to care, now it's time to move fwd :P
Jeff,
You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. A system that will primarily be serving content may be a good choice for denormalization. That doesn't mean that normalization doesn't scale, it just doesn't scale for one particular problem.
RevMike on July 15, 2008 7:36 AMMecki said:
>Yes, if you need all of it or if you need all of it in a single table.
>Very often neither is the case.
This is spot on.
Let's say you wanted to run a query to find all the people that worked for a specific company:
select first_name,last_name from User u,UserWorkHistory uwh where
u.user_id == uwh.user_id AND
uwh.company_name = 'Coding Horror';
<b><i>(BTW I haven't written SQL in ages so this may be grossly incorrect; I think you get what I'm trying to say, though)</b></i>
Granted, I could have just as easily done:
select first_name,last_name from User where
User.work_history1 = 'Coding Horror' OR
User.work_history2 = 'Coding Horror' OR
User.work_history3 = 'Coding Horror';
But which makes more sense?
I don't think there's any harm in starting off normalized and scaling at some point later. Premature optimization is evil, and for all you know stackoverflow may scale just fine under the circumstances.
Michael Acobas on July 15, 2008 7:41 AMI recently created a denormalized table with over two hundred columns because it was just needed to generate a PDF from a long online application form. There was no need to query the data for any other reports. The table is only meant to serve one purpose. However, those 200 plus columns did exceed the limit for the number of columns a SSIS export will support so now I need to exclude that table from my back ups. There is also a limit to the number of tables SSIS will export so you could run into problems with a fully normalized design as well.
Robert S. Robbins on July 15, 2008 7:43 AMDenormalized data sounds great until you try to extend your data model, at which point you will wish to high heaven for a time machine. The purpose of a relational database is NOT to give you fast performance. If you want that, use flat files and do your own heavy lifting. The purpose of a relational database is to help maintain data integrity. If your data doesn't need integrity all that bad, you don't need or want a relational database. If your data does need integrity, you can't afford to make the mistake of modeling it poorly, and that means semi-obsessive normalization. Those 6 joins are a feature, not a bug.
Noah Yetter on July 15, 2008 7:49 AMThree quick observations:
First: You only need the six joins if you want every single attribute of information stored about the user. My guess is that about 99% of the time, you REALLY only need maybe two tables' worth.
Second: normalization is optimizing for data entry and maintenance. If your application is presenting the data primarily rather than CRUD, of course you need a denormalized view. But I'd push from a normalized db to an optimized copy, or maybe even a cube.
Third: MEMCACHED, MEMCACHED, MEMCACHED.
Chris B. Behrens on July 15, 2008 7:51 AMOver the years I've developed a theory that the one area where developers reveal a consistent lack of knowledge and understanding is in DB/query design. This article and the subsequent comments reinforce this theory (and to be candid, leave me a bit apoplectic).
1) Views: terrible idea unless it's a SQL 00/05 indexed view. Even then, you probably need to revisit your schema and queries before resorting to this measure. If you're on MySQL, forget that views exist and used stored procedures.
2) Surrogate keys: that's another rotten idea, especially if you're concerned with performance. Also a horrible idea if you ever need to migrate the database.
3) The query in question: dude, Dare, really? And your defense of this query is that you may want to show all the numbers a given user has, for example, Facebook? Respectfully, please, both you and Jeff need to learn more about databases or hire someone that does, or you are headed for pain and misery as you build Stack Overflow.
For example: say my User.User_Id is 666 (I also am the son of Satan, but that's a different subject). There are two entries in the phone number table that have the User_Id foreign key of 666: one is my office number, 402.261.0498. Then you have my mobile, 402.525.5376. So you run your query, using my User_Id in the WHERE and you get back two rows, one for each phone number, with all of my data from the User table repeated for each. What are you going to do with this result, say on my profile page? Are you only going to use the data from the first row, except when dealing with phone number data, in which case you'll loop through each row of the result and render each phone number?
You're going to have to do as Ariel suggests: query the distinct user data, then query the phone numbers and other information. Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.
Chris Rikli on July 15, 2008 7:52 AMCache the generated view pages first. Then cache the data.
http://www.pragprog.com/titles/memcd/using-memcached
You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much.
matt m on July 15, 2008 7:52 AMI wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set.
I've seen numerous queries written by "professionals" that work well for small tables (As jeff says, everything's fast for small values of 'n'), but completely fail when scaled up to larger tables.
In my opinion, writing good SQL queries is dark voodo - not least since results can vary hugely depending on the underlying DBMS.
Thomi on July 15, 2008 7:59 AMI think that you should think about how it would be easier to implement the user interface, in the normalized, or in the denormalized form.
In the blog post (article), the user interface can be implemented much easier in case of the denormalized form, so I would prefer that one. Of course, if you can afford to make the application perfect (3 times more time), then the normalized form would be better.
"The biggest performance killer is so called physical read."
Thats a bit scary... I'd always assumed that reads from hard drives were slow... but not so slow as to create the "waiting for several seconds" issue when looking up a record...
I suppose in these big databases with loads of tables all "joined" together the sheer number of reads creates the problem.
You learn something new everyday, huh?
Jheriko on July 15, 2008 8:04 AMJeff,
I know you probably would rather not change the site too much. But I would love to be able to rate responses based on merit. Then we could hide the long list of comments by people who didn't read the post or who troll around looking to bolster their web traffic with insults.
I agree wholeheartedly, by the way, to this idea. I unfortunately was taught by someone who knew what normalization was supposed to mean but no more, and one math prof. who was thrust into a Software role because of his 1 year of experience doing consulting in 1994. He instructs all students that database must be 43NF and all logic outside of storage must happen in the application layer. No stored procedures, no denormalization, and views are sacrilegious.
Thanks again,
Raymond
You used "OR" ? No index for you. Next!
David B on July 15, 2008 8:21 AMWell, after reading all these comments it appears that most programmers still fail miserably at reading comprehension (or just didn't read the article completely). How long until StackOverflow.com is ready, Jeff?
Jason on July 15, 2008 8:38 AMWhat about an approach where the data is stored in and WRITTEN into normalized tables but have a DEnormalized table for READING from? You can have triggers update the READ table when changes are made to the normalized tables, or batch process it nightly or mroe frequently, checking DB activity first to make sure its low.
Mike on July 15, 2008 8:39 AMno matter what, the database is the slowest part of your application. period. fin. done. caput. The way to get near instant execution is to have the objects in memory and then just use the database for storage (ie read = from memory, write = from memory + write to DB). Then if your app restarts, it'll only be slow while pulling things from disk, but after that super fast.
Darren Kopp on July 15, 2008 8:41 AMThe obvious tradeoff in denormalizing your data is that your data structure has lost meaning and flexibility. I've worked on projects which required normalizing unstructured or poorly structured data, and it's often difficult or impossible to move in that direction. I'm surprised you're at that point already - I'd normally think you'd look at optimization once you've had a site running for a while and analyzed actual usage data. Regardless of the direction you're moving, materialized views are very helpful - a materialized denormalized view can give you the performance benefits of denormalization without messing up your data, and a materialized normalized view over unstructured data can reduce some of the pain in dealing with User_Phone_Business_Prefix. Have you looked at the sparse column feature in SQL Server 2008? That's supposed to reduce storage size (and thus increase performance through increased page efficiency) when working with wide columns in which many columns may contain null values.
Jon Galloway on July 15, 2008 8:42 AMIt has been said well by others already, but I feel strongly enough on this issue to add in my two bits:
Denormalization is the last possible step to take, yet the first to be suggested by fools.
And, yes, I just called both Atwood and Obasanjo out as fools.
It is dangerous to make that suggestion so willy-nilly, as people take posts by prominent authors as something they can rely on. Even hinting that denormalization should be anything but a last step means that these guys simply cannot be relied upon as authorities in this space.
So, for those still feeling unsure after reading through all of the above:
Views are not about perf/scale, they are about query clarity and/or abstraction of true data source & structure.
Materialized/indexed views can help perf/scale, but have limitations in terms of the underlying data structure.
Optimization must always be based on freshly-collected metrics, should go roughly in this order (with the order slightly adjusted based on your particular applications and databases and the issues you have previously experienced), and should be done one at a time with fresh metrics collected between each:
- remove indexes no longer deemed necessary
- add indexes newly deemed necessary
- simplify views where possible
- optimize queries, especially to reduce scans
- reduce application mis-use of the database (unnecessary and repeated queries are a common culprit)
- add/change application-tier caching of database information
- materialize/index views where appropriate
- optimize stored procedures
- create generated search/reporting tables
- investigate fundamental issues with the design of the database schema (data types, relations, etc.) but not from a "hey, let's denormalize" perspective.
You'll notice that denormalizing your fundamental schema isn't even on the list. It is too tempting and dangerous to put it there. True, it may be a last resort if all of the above fail, but if all of the above fail you are almost surely having the kind of problem that everyone would love to have: your application is going gang-busters, and you can afford to throw more hardware at your current model (start with increasing spindle count first by getting transaction logs onto their own spindle(s) and then by getting indexes onto their own spindle(s)) and if that still isn't enough then you are are into Google/Amazon/etc. territory and can get some doctorate types to start designing their own persistence model for you. They'll probably cook up something decidedly non-relational, and with much denormalization. ;) But if so, then why not just up and denormalize? Simple: The other possibilities need to be exhausted first, through repeated execution of the above list of steps, until you can be certain that _violating the correctness guarantees of your relational database through denormalization is worth it_.
Interesting blog post on denormalization patterns:
http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html
Tim Hollingsworth on July 15, 2008 8:48 AMOne more benefit of normalization - variable number of fields. For example, if you denormalize the phone number field, how many phone numbers do you pick? 5? 6? Picking too few is very limiting and picking too many is a waste. In 1990, who would have guessed that a person could have 10 phone numbers associated with them? Home Phone 1, Home Phone 2, Cell Phone, Fax, Cottage, Winter Home, etc. Normalization solves this problem.
Mike M on July 15, 2008 8:57 AMGod bless ya, Jeff. As with everything in software development, a bit of critical thinking about why something is done and when tends to go a long way...
Kevin Fairchild on July 15, 2008 8:58 AMThere is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. In Oracle you would create a hash cluster and then create the five tables that include user_id in that cluster. Performance for that join wuld be excellent. I expect that other RDBMS's have similar structures.
The morals of the tale:
i) know your RDBMS
ii) database-agnostic design sucks.
[sigh]
The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones).
I've never considered myself a DB(A) guy ... so reading some of this is interesting.
Most of the databases I've built myself haven't had the volume of traffic that really mattered much -- so I've always done what I thought was the 'intuitive' normalization and in cases where it was faster or easier to put some of it in a denormalized fashion, did so.
Good read.
N on July 15, 2008 9:15 AM> and table names are singular because each row in the table designates one specific person; one row = one user
This is another religious war. Should table names be singular or plural? I was in a company that mandated that tables should be singular, one to match the above quote, and second, if nothing else, so that you have 1 less letter to type for every table name.
However, I am of the opinion that tables, since they normally hold more than one item, are plural, and the object that holds a single row is singular. Collection objects that hold multiple rows should be plural.
Bob on July 15, 2008 9:31 AMSean Bamforth is on the money here.
You need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.
The problem is that DBMSes don't do this. Sure, they give you indices and caching and materialized views, but they don't go all the way.
Robert Fisher on July 15, 2008 9:48 AMYour denormalized schema has lost all of the one to many relationships that caused the original need for normilization in the first place.
Normalize first, it protects data integrity and makes maintenance a breeze. If you hit the performance wall, then denormalize.
However, your example is still bad, because the denormalized table is nothing like the normalized schema.
@Jeff - You're prematurely optimizing.
@Leandro - Natural keys! I agree, and bless you for mentioning it.
In 25 years of software development as contractor and employee in all kinds of companies and industries, every performance problem I ever encountered with a transactional normalized data base was solved by adding indexes or fixing poorly written SQL statements in the application. In a few cases the data base was poorly designed, but normalization wasn't the culprit.
Don't fear normalization. Embrace it.
Charles on July 15, 2008 9:52 AMWith any normalization there are some trade-offs.
Assuming that each user has one screen name might be an acceptable restriction. Same for affiliation.
With a separate phone table a user can have as many phone numbers of as many types as I want. Not so after normalization. This might be acceptable, or not.
So, it the perception is that "normalization" changes representation without changing functionality, that's incorrect. The functionality changes might be acceptable, or not, but that should be a conscious decision.
Wow...
After I first read this article I wanted to comment remarking that the concept of denormalization being better was not new or all that revolutionary. People working in data warehousing and business intelligence have been practicing denormalization as an optimization it for years.
Then I read on and discovered all the loons and morons who think they know a lot more then they do about databases. People who claim that views are recomputed every time a query runs (that'd be you Steven Bey). Others bragging about their 400,000 pageviews a month websites (generously estimated, perhaps all of 12 pages a minute, woah baby Ben Mills).
Now, before I get too degrading here, let me just get to the point.
As others have already stated, databases have many different purposes. And the way you implement a small or large database is going to be radically different then a monstrously huge one. Denormalization is common in production real world systems.
Wow, I read back and am just astonished at the complete lack of understanding by the average commenter above.... Woah........
Paul Thompson on July 15, 2008 9:56 AMI don't do much (Any) db design, but I do a lot of software design.
It feels to me like there might be a compromise solution that involves creating your "Source/reference" database 100% normalized then using that to generate pre-calculated tables.
This sounds like even more duplication, but it shouldn't really be since you only modify your source tables and the optimized tables are 100% generated. This gives you the ability to regenerate the optimized tables any time, and also to change what you generate as you encounter challenges.
Is this approach at all workable?
Bill on July 15, 2008 10:07 AM>> People who claim that views are recomputed every time a query runs (that'd be you Steven Bey).
In Oracle a view is just a stored query. It is re-executed when it is queried. A _materialized_view_ in Oracle has assocated storage for the view result set. So mabye you and Steven Bey have a different understanding of the term.
Maybe the example in this posting is just a poor one, but the argument for denormalisation usualy comes down to avoiding the overhead of joins. As I posted above, in this case it's trivial to avoid that overhead with a hash cluster. Problem solved.
David Aldridge on July 15, 2008 10:21 AMFor a system like that I don't think there is much question normalization is best. You get the benefits of being able to store multiple phone no's, addresses, work history... for one user id record but let's face it you are almost always going to be finding records based on user_id which will be the primary key for most tables so it will be very quick anyway.
As you said with speed there won't be much difference either way especially when you factor in the duplication you get filling up extra 8k sql pages using non normalized. By the way LEFT JOINS jeff.
The only system I have denormalized for reporting is an electronic attendance system like below:
A register has a number of student ID's
A register has a week range between 1-53 (variable based on when it runs in academic year)
A register has academic year and register ID
By the time you get down to the student marks table it has:
Acad Period Register ID Student ID Week Mark
You end up with upto 53 mark records per student per register per year. Hundreds of thousands of records which are for the main part the key to the table. Pivoting the marks leaving nulls for the weeks the register wasn't active cut down the speed of reporting students with low attendance to about a 20th of what it was before hand.
@Jeff: You hit upon my speciality :) See, I said "speciality", not "specialty", because it sounds cooler.
Anyway, I write social applications for a living. And I've found what works best is:
1) Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone -- hey, this is the future! :)
2) Put the rest of the info into a TEXT variable, like "extra_info", in JSON format. This could be educational history, or anything else that you would never search by.
3) If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need.
I actually go beyond this, and implement synchronization with other online identities, like OpenID and Facebook, but that's a bit overkill for this post.
Greg Magarshak on July 15, 2008 10:32 AMWhile this article is very good there are a number of things not addressed.
How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting?
How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter.
While it is lovely to normalize data, there is no one size fits all for all applications.
Enjoyed this post, a lot of these comments that treat normalization like a religion are quite the head scratcher though. I'm guessing they are from DBAs who are just barely more intelligent then a sysadmin. Learn to think a little.
The best approach usually ends up being a hybrid. You start off with a highly normalized database, work through the usual indexing and views to speed it up, and then start denormalizing a field here or there that require one too many joins.
Jim on July 15, 2008 10:34 AMSo, in essence Greg, you are throwing your hands and going "F this relational crap. I'll do it live".
Sorry, I don't buy it. Sticking serialized data in a column stinks like code smell.
Jonathan H on July 15, 2008 10:36 AMBottom line is - consider the future expansion of the site. Database design, like programming, is a trade-off. Are you going to introduce multiple X's for every Y? Then perhaps you DO need a join. What will happen when the most common operation happens? What will happen when you introduce a distributed cache, like memcache? You gotta use a cache...
Greg Magarshak on July 15, 2008 10:36 AMAh, I did see some sanity in the comments. Your LOGICAL model should *always* be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done *after* the application level solutions are exhausted (page caching, data caches, etc.
For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code.
John Lopez on July 15, 2008 10:45 AMIt might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized.
With normalized data, I had a complex join with records in the millions each and had no trouble retrieving it in mere seconds from a 450 MHz server that they were still using. I just added the appropriate index (took about 45 seconds) and problem solved. That 2 minutes of work turned into a year of consulting because I fixed it so fast.
But don't get me started on denormalized data. Data is stored willy-nilly throughout the application when it should be a code table and then, inevitably, someone wants to do a search on it. Except that people misspelled it and added "The" at the beginning or ", The" at the end or put 2 spaces in it a couple times, etc., etc., etc. At that point, your ENTIRE dataset that the company has spent years gathering is WORTHLESS!
People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (Oops, you wrote "select *" in your code, now you might have trouble adding columns.) And have you ever tried to write a query to find everyone in the same affiliation? Hope those are codes and not free text. And I hope the affiliations are indexed if you want any kind of performance on that search. But now you need 3 separate indexes instead of one. And the third one might perform poorly because it is mostly null and whatever codes are typical to people with 3 codes.
I could go on and on about the real-world problems I have faced with denormalized data, versus the lack of problems with normalized data.
Say what you want in your blog, but a lifetime of experience says that you are wrong.
And, I've worked with Dare before and he's a cool guy (so no offense Dare if you read this) but he writes tools for Microsoft programming languages and stuff. He doesn't maintain software with 10,000,000 row database tables on a daily basis.
PRMan on July 15, 2008 10:49 AMI wonder if these changes we're seeing in how important normalization is in applications isn't a reflection of the changes in the types of applications people are writing today. These days, with most web sites focusing on the infinite possible connections between people and other entities in a system, it may just be that a denormalized data model might make more sense right off the bat.
Thanks for the thoughts. I think I'll write more on this at <blatantCommercial>http://cozenedcognizance.blogspot.com/2008/07/what-kind-of-applications-are-we.html</blatantCommercial>.
Chuck on July 15, 2008 11:04 AM>> Chris Rikli: Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.
Here here!!! You might look into this solution as well. I've had two "ugly" stored procedures as of late that I've refactored/optimized/whatever by simply going to the "core table" and doing the major selection criteria and then making a bunch of "small queries" to fill in the gaps.
Table Variables (SQL Server 2000 8.0 and above, sorry legacy folks (I have an app still stuck there 8^D) allow for this and are much better than temporary variables.
When dealing with this exact issue, I took an approach of using a normalized data repository supplemented with a denormalized view of the data. For data that does not change very often (how often do users change their phone number, really?), this is usually adequate, though you certainly run into problems of stale data unless carefully handled.
In my case, it was more than simply doing a multiple table join across a single database, it was doing a multiple database join across 4 completely incompatible databases: standard RDBMS, Chemical database (outside of the RDBMS), and Biological database (also outside the RDBMS)
Having the denormalized dataset allowed us to reduce queries that originally took 5-10 minutes to complete in < 10 seconds. It all depends on your particular application.
Steve on July 15, 2008 11:23 AMI always normalize. If a group of data (in your case user data) is being queried the same way every time, then either you over-normalized or you need to create a view.
Matt Ridley on July 15, 2008 11:30 AMI once wrote a CRM (digital Rolodex, like SalesForce.com) application and used it as an opportunity to use a textbook denormalized data model. It turned out to be a huge mistake because the model was wrong for the application.
For CRM, you're essentially storing business cards. A business card is fully denormalized: the basic unit is the card, and while you may search on a couple of different fields, users just want to view the card. While it makes some sense to have a separate table for phone numbers and email (since there may be multiple), there's only one mailing address.
Where I really got into trouble was in having an actual person/organization hierarchy instead of just a "department" column. Traversing arbitrary hierarchies in SQL is painful, and using a Java O/R bridge didn't help. Doing a search for "AT&T employees" becomes a recursive set of queries. What's more, 99% of the time, the department was blank or inaccurate anyway.
Which brings me to the next issue: half the time, the data wouldn't import cleanly. We'd get business card type data from our trade shows in Excel format-- essentially completely denormalized. There are a dozen different ways to spell "AT&T" (including, at the time, "Bell Labs"), and often company names were misspelled.
Eventually, we scrapped the project and switched to SalesForce.com, which uses a data format which is just denormalized enough to have a separate company table. I think it even has "Phone 1", "Phone 2" and "Phone 3". But it meant I didn't have to get pulled off my current project to do a custom data import after every trade show.
The moral of the story is that the correct (fully denormalized) data model is not the most general one, but the one that best fits the expectations of the users and/or your data source. Denormalizing any more just forces nonexistent precision. Business cards and Excel spreadsheets are essentially single tables, so using several tables to represent them can lead to problems.
David Leppik on July 15, 2008 11:43 AMI normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don't get rebuilt more than necessary).
Of course, a smart RDBMS would let you use… (drumroll) views…and optimise accordingly. I'd imagine this is the sort of thing that Oracle would do. I haven't checked.
Mo on July 15, 2008 11:53 AMI could be missing the point of the article, but I don't think he's suggesting that we skip past normalization but instead consider if it needs to be done.
What if he only plans to allow those three phone numbers for a user - Should that be moved to a separate table? What about only two? Or just one?
At what point do you stop blindly normalizing the table structures and consider if everything really needs to have a table of its own?
Doesn't seem like an unreasonable question to me...
HB on July 15, 2008 11:55 AMI don't know SQL, but your diagram looks a hell of a lot like what an ordinary Real Programmer would call a "record" with some "pointers". Obviously the "denormalized" version (one record per user) is better than the "normalized" version (many records per user) for pretty much all applications --- easier to read, easier to write, easier for the computer to deal with. So maybe this is some big religious issue to SQL clerks, but it's certainly a no-brainer to normal folks.
The real WTF, of course, is that you're apparently writing your own software to deal with tagging. Tagging is a solved problem, guys. You don't have to write code to manipulate tags anymore.
Anonymous Cowherd on July 15, 2008 11:57 AMMo: You should get out more, the two real RDBMs (SQLServer and Oracle) have views :)
Stop playing with mySQL.
Unfortunately, the SQL given in the article is not a meaningful representation of a real set of data about a single user.
In most cases, our user-retrieving SP would return multiple recordsets for the one-to-many entity-relationships, not an inner product of all the combinations.
Now if you are talking about returning data for many users for analysis, you are probably still not going to combine the data this way.
In the large data analysis situation, you can start to materialize views and temporary tables, etc., but the best thing I've seen is apply dimensional modeling to make a Kimball-style data warehouse give a very fast and easy de-normalized representation of the data. This is a read-only replica of the data in a completely different model.
Cade Roux on July 15, 2008 12:05 PMNormalization is all about reducing redundancy. Achieving great performance requires a balance; if you normalize everything "by the book", you will most certainly pay the price in performance. Leaving redundancies in key places can dramatically speed up your system because that helps the db engine optimize the indexes and run faster.
To the poster who just got out of the db module (course?) at the university: you will learn that Uni-level db is just the tip of the iceberg and that the overreaching principle in db design is that there is no overreaching principle.
BugFree on July 15, 2008 12:09 PM@Jim
That's exactly what you would say if you hadn't yet been bitten in the ass by a denormalisation decision taken years ago, that's making your life 20 times harder now.
With all of the wonderful technologies described here to mitigate the speed of large joins, I fail to see how anyone can reasonably argue that denormalisation in any form makes sense. I'm pretty sure the entirety of the RBDMS vendor community's brain cycles that have been spent thinking about this problem trumps anything from people who think sysadmins and DBAs are idiots.
Lewisham on July 15, 2008 12:24 PM@ Paul Thompson
>> Then I read on and discovered all the loons and morons who think they
>> know a lot more then they do about databases. People who claim that
>> views are recomputed every time a query runs (that'd be you Steven
>> Bey).
Many thanks for you kind words.
I don't profess to be a database expert and my initial thought had been to use a view, however, I had a question mark over my head about when the cached data (in the view) gets updated, so I did a little research, via our friend Google. The only place I found, stating that view data is cached, was on Wikipedia and I prefer not to take Wikipedia at face value. I did, however, find other articles that either explicitly stated or implied that views DON'T cache data (including the url that I posted, which is on a website called sqlserverperformance.com and written by a SQL Server MVP).
If you can provide me with a credible source, saying that view data is cached, I will happily read the article.
Once again, many thanks.
Steven Bey on July 15, 2008 12:47 PMJust to add that materialized view will involves caching of the data.
Odd on July 15, 2008 1:26 PMView data isn't cacbed, however the execution plans are.
Personally I thnk anyone who encourages denormalization to the extent that this article does hasn't worked with databases enough. Denormanlzation should always be an optimization, and not done by design, and a denormanlized database is *not* the same as database that was never normalized to begin with.
In many case I think materialized / indexed views will suffice in cases when you think denormalization is necessary.
Rick on July 15, 2008 1:33 PMYou changed the rules.
The "normalized" model allows for n number of user phone numbers, screen names, affiliation and work history. It also does not care about the order of those items.
The "de-normalized" model limits the number of the items listed above. It also force an order.
Only one represent the "business." Which is it?
rogueDBA on July 15, 2008 1:37 PMI didn't realise that denormalisation was controversial. I was taught to do it in university.
Phil on July 15, 2008 1:42 PMIf you are just mostly doing dirty reads and do not particularly care if the data is 100% consistent all the time then you gain little by doing a lot of normalization. In fact, you should denormalize as much as it makes sense to do so for read-heavy loads.
Most web applications are of this type and do not even require a regular database necessarily. We are just conditioned to throw that database there every time without even questioning it. That is pretty high in the brainless zombie score rating.
great post. simple is beautiful....
Steve on July 15, 2008 1:46 PMDenormalized data makes a mess of the application code: you end up having to do all kinds of funny work to make sure that their affiliation is added to the right spot (affiliation_1? affiliation_2?), and if you remove it, then you end up backtracking and rectifying how your assignments work. And that's assuming that you don't accidentally run into a case where someone has 4 affiliations.
This whole concept is physically painful for me to even consider.
Before you start denormalizing your database, you should talk to a DBA and see what solutions they might have. And step #2 would be to be smarter about pulling back data than slurping back the entire database via your ORM.
Between those two options, you should be able to get the DB performance you're looking for without hemorrhaging functionality and extensibility.
Robert Fischer on July 15, 2008 1:55 PM@Raymond Berg: I agree but I don't think Google can resolve a link in the form of "http://www.codinghorror.com/mtype/mt-comments-renamed.cgi?__mode=red&id=xxxx" to an actual site.
Juan Zamudio on July 15, 2008 2:23 PMJeff, IMO, more important than denormalization is how you design your indexes, and what type of hardware you're running on. IO is the big killer, not so much normalization. Denormalizing a table would be about the last thing I would try to boost performance. SQl server should be able to shred those queries, even with 20 million plus records in each table.
Rob on July 15, 2008 2:33 PM> Unless you are doing *a lot* more reading then writing
Aren't most -- perhaps in fact *all* -- websites in this class of application? Millions of reads, handful of writes.
> you should denormalize as much as it makes sense to do so for read-heavy loads. Most web applications are of this type and do not even require a regular database necessarily
My point exactly!
> View data isn't cacbed, however the execution plans are.
That buys you virtually nothing. Caching of execution plans is pretty much automatic for all modern SQL databases these days -- even for raw SQL. No stored procs or even parameterization of the query is required!
http://msdn.microsoft.com/en-us/library/ms181055.aspx
"When any SQL statement is executed in SQL Server 2005, the relational engine first looks through the procedure cache to verify that an existing execution plan for the same SQL statement exists. SQL Server 2005 reuses any existing plan it finds, saving the overhead of recompiling the SQL statement. If no existing execution plan exists, SQL Server 2005 generates a new execution plan for the query."
Jeff Atwood on July 15, 2008 3:57 PMAnother case where you would almost never want to denormalize would be a situation where you'd turn fixed-length rows into variable-length rows. If your table is all ints, you wouldn't want to replace them with varchars.
However, denormalizing to create fixed-length rows instead of variable-length rows (or "overnormalizing" to do it) could sometimes be a performance benefit, if your application is able to reference things by their integer ids instead of their string values.
-Max
Max Kanat-Alexander on July 15, 2008 4:31 PM@Jeff,
Your example of is wrong. This type of normalization will make designing the user interface, source code and the rest of the sql more complicated. For example to get all users connected to an affiliation will need 3 times the queries now then before. And what if you want to extend WorkHistory to 4 (or 10) in a later release?
I also cannot imagine you need all this information in 1 call over a large number of users. To get this data for a single user it doesn't matter much if you do a few more calls to the database. I assume that large queries probably go over user only, or user and affiliation.
Also the denormalized table isn't complete, it's missing 3 work_history_affiliations.
But some denormalization can be done on your sample without becoming denormalized:
Removing the UserPhoneNumber table is good, that table is kind of useless, users don't want to submit hundreds of phone numbers and it isn't likely there are is an index on phone_number.
UserAffiliation and UserWorkHistory can be merged. Maybe add a "type" field if needed, otherwise an empty company_name should tell enough..
So you woud end up with:
User
------------> UserScreenName
------------> UserWorkHistoryAndAffiliation --------> Affiliation
Why not cache denormalized sets of user data in a second database? That allows us the benefits of both.
Jonathan Drain's Dungeons & Dragons Blog on July 15, 2008 5:02 PMThis post (and the comments) raise some very good points (kudos for that), but manages to fall short a little in leading people down the right path in some respects.
There are a number of other ways that the example schema could be partially denormalized (something I don't think was made clear). While I don't think you could have done much more in a single post, perhaps this is something that could be covered in a future one. If you're up to it, that is, it sounds like the expertise revealed in some comments may reveal just how big a question you are trying to tackle. :)
Ben Garreros on July 15, 2008 5:19 PMUsing NHibernate lazy loading would allow you to just compose your object tree with the data you need at that point in time.
Say for example you needed to display the users name. No need to populate the whole aggregate (Affiliation, Work History...) just the root (User).
That should help cut down some of the joins.
In the end of course 'it depends'.
Stuart Cam on July 15, 2008 5:51 PMI have a feeling a lot of the commentors missed your point, which means they probably just skimmed over the post and read the bolded parts. True, the six join query was an exaggerated example, but still, the point stands.
There wasn't a call to denormalize from the beginning, people. There was a call to denormalize when <b>needed</b>.
Mix on July 15, 2008 6:20 PMThis post seems to suggest throwing normalisation away, saying that our professors are all wrong and the real way to design databases is to "just do what works", as that's how it works in the "real world".
What a horrible, cowboy attitude to DB design. I hope you don't design any real databases.
Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second. The CFO / CEO might be concerned if their database is running slowly, but they will have your neck if the data inside the database is garbage. Properly normalised databases are important because they enforce data integrity, rather than relying on application developers to enforce it.
You even said this yourself, though the point seems to have been lost on you when you said it. "And all those pesky data integrity problems the database used to enforce for you? Those are all your job now."
No.... just, no.
Relying on the application to enforce data integrity is the surest path to hell that I know of. That is the whole freaking point of a database. Hundreds of thousands of hours of development time and testing has gone into each database product to guarantee this level of service. There are thousands of features. And you want to replace all of that time and testing with some hand rolled garbage that has received maybe 100 hours of testing if you're really lucky? What an absurd idea. What a horrible anti-pattern.
Denormalisation is an optimisation over the normalised data. The DB should ALWAYS be normalised with denormalised components added later as scalability issues need to be addressed. The normalised part of the database is still the authoritative source of data, and the denormalised part simply aids in query execution. You can always rebuild the denormalised data from the normalised data which has had all of the data integrity rules enforced on it.
Come on. I know it's easy to hate on professors for not being "in the real world" and not having frontline experience, I hear this wanky sort of rhetoric all the time, but you might be surprised to know that they do sometimes know a thing or two.
Dave G. on July 15, 2008 6:34 PMBuggyFunBunny,
I was wondering how far down the comments it would be before we'd see a pontification from the "anti-knucklehead, database crusader".
Best thing about your comments are they provide a us with a beautiful reference point for one (extreme) end of the application design continuum. At one end we have persistence ignorant application design, and at the other "database is god".
Gee, I wonder which end you belong to? You are an extremist and as such your views need to be treated as such.
AJ on July 15, 2008 6:35 PMI keep my data normalized. I keep it all in memory. I use Prolog to do it.
screed on July 15, 2008 7:19 PMThe link for Findory and Amazon is: http://radar.oreilly.com/2006/05/database-war-stories-8-findory.html
Josh Parris on July 15, 2008 7:29 PM> Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second.
Always is a dangerous word.
Jeff Atwood on July 15, 2008 8:23 PMHmm. Does anyone expect a real discussion of tech at Coding Horror. It's as bad as /.
stupid db design (or at least incomplete), stupid sql.
If most smart people require whole books to describe the (wonderful) world of RDBMS's, do you really think _anyone_ here can have anything more than a soundbite contest?
I'm surprised some twat hasn't called using databases 'cargo cult programming' yet.
Stick to toys and trivia, that's what this blog is best at.
smashmouth on July 15, 2008 9:40 PMI 80% agree with that.
I was always telling my teachers that, trying to convince them they were doing it wrong.
I admit I didn't read all of it, but looking at the diagrams, I can tell you that when you have a 1 to 1 relationship, like UserPhoneNumber, then: yes, there is no point, and I would call it an error to have that table.
The only case where you would need to take it outside the [User] table, is if that [User] table grows out of proportion AND that you often need to fetch user info without his phone number. Otherwise, you are correct to bring his phone number into the main User table.
The tables left of [User] (and above and under) should have have stayed "outside" though. UNLESS you KNOW there will not be more than 3. Also, if you often need to fetch only the phone number of users, having it in a mini table will give you another advantage...
But if you don't need his phone number for example, for a query, please use SELECT [field names], and NOT "SELECT *"!! You will gain another millisecond there!
In the end, it all comes down to a case-by-case db design, there is no formula that will work everywhere with optimum performance.
Consider the number of users connected simultaneously, the expected size of the db, etc etc...!
dbadmin NNM :P
NNM on July 15, 2008 11:02 PM"...scalability is not your problem, getting people to give a shit is."
IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism.
Case in point: Twitter. The idea of mentioning Twitter as a scalability success story is perfectly risible. People have actually stopped "giving a shit" (to steal an artless phrase) because of their performance problems.
While I agree that normalization is the way to teach, as a design principle (RE: SomeGuy), but it would be nice to see optimization issues mentioned more often when talking about "best design principles" in Computer Science.
James Pearce on July 16, 2008 12:26 AMATWOOoOD!!
(A cry of frustration from a DBA stuck in legacy de-normalised hell)
I love your blog Jeff, but I have to say that when I first read this post my first reaction was one much like the one you yourself had after reading about Wasabi. If this post leads to even one more de-normalised database in the world, is is already evil. Anyway - before I had time to comment, most of what I wanted to say (and more) had already been said (rudely, even).
There is now no real point in adding another voice to the thorch-wielding pitchfork DBA posse knocking on your door, so instead I want to make a suggestion: Eat my shorts. No, wait, sorry, that was frustration again - I want to suggest this:
In the spirit of the ideas behind stackoverflow, and in order to spread knowledge and learn, let us help!
Instead of "going dark" about this, challenge the community! Make posts with examples where you feel you will run into trouble, and allow people to comment. Let the people who throw the tomatoes here put up or shut up! *Hides tomato behind back*
"For thisandthat purpose in stackoverflow, I need to have A, B, D, (D*X-4) and H from these six tables (ill.), and I need to handle a load of about 4 executions per second. My current plan is to jam everything up into one huge table because then the execution time drops from 345 ms to 87 ms. I need at least y, preferably less. Comments?"
I hold (not even weakly) the strong opinion that normalising is not a collective madness, it is not like the religion of the angle brackets, (XML), it is The Right Way(tm). I am confident that in the lifetime of Stack Overflow you will NEVER encounter a situation where denormalising is the right thing to do.
Let us prove it to you!
OJ on July 16, 2008 1:11 AMThe basic decision on the initial design is simple just ask yourself these two questions before you begin.
1. Is my Site OLTP? If the answer is yes then Normalize.
2. Is my site OLAP? If the answer is yes then De-Normalize!
Simple as that.
If your answer is Yes to both questions then you need 2 databases.
If your answer is No to both questions then you need 0 databases.
"This post seems to suggest throwing normalisation away"
Seriously? Sometimes I wonder if I'm reading the same blog as most of these commenters. This is certainly not the first time either.
Ricky Dhatt on July 16, 2008 2:17 AMI'm with OJ - let us build it :)
And to all the folks suggesting copying data into a denormalized table via nightly batches or triggers etc...
You are describing a materialized (indexed) view. The functionality exists in all mature rdbms. Your idea has (unsurprisingly) been discovered by rdbms architects and implemented.
I wonder how many people read the links before posting.
Just WOW. Thanks for all those links, I think I'll spend some serious time reading all those entries on highscalability.com.
Oh and i've had similar experiences of denormalised awesomeness with high-throughput systems. Data was renormalised every 24 hours and just simplified everything.
Sure, it's not the first thing you should do but it is an option.
This is a dangerous article, or perhaps you just haven't seen the number of horrific "denormalised" databases I have. People use these posts as excuses to build some truly horrific crimes against sanity.
Normalise first, denormalise if you have to. If you can't normalise then don't even try and denormalise, get someone else to do it.
A six-way join on 100,000 row tables is better than a single query on a 1,000,000,000 row table. I've been there. It wasn't fun.
Denormalising correctly requires more skill than normalising correctly. Just because Flickr duplicates data doesn't mean that wholesale denormalisation is automatically a good idea.
Some sort of middle-ground: temporary staging tables, very heavy caching, etc. will gain significant performance advantages without having to corrupt your underlying data.
I don't think I've ever seen a performance problem caused by a normalised database - not when such things as caching have been used correctly.
AbGenFac on July 16, 2008 4:02 AMNice comments. I think a lot of people who understand databases better than straight coding (myself included) get annoyed with coders telling us how to write databases.
It seems to me Jeff hasn't got a lot of database experience, and is saying "don't bother with normalisation cos it's difficult to understand".
The article is also confusingly written, criticising normalisation due to concerns over scalability, and then saying that scalability doesn't matter.
If you understand normalisation - you know when to use it and when not, because you know what the costs and benefits are. If you don't understand normalisation, but you need to store a lot of data in a way that can be queried and analysed, you had better learn it.
Andy on July 16, 2008 4:10 AM>> Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second.
> Always is a dangerous word.
hehe. The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.
This is the crux of the contention: coders who view the database as a code widget to manipulate versus database wonks who view the database as primal and the coders as Huns out to rape, pillage, and destroy the data. Coders not only want what is rightfully theirs: make the data look pretty to the user, but also to control the data in client code. Always a bad idea.
Coders refuse to accept that Codd's 12 rules are even rules, much less laws of nature. Database wonks realize that they *are* laws of nature.
Maybe this has already been said, so I apologise for not reading all 212 comments...
Be careful not to confuse a denormalised database with a non-normalised database.
The former exists because a previously normailsed database needed to be 'optimised' in some way.
The latter exists because it was 'designed' that way from scratch.
The difference is subtle, but important.
Bob Armour on July 16, 2008 5:47 AMYou don't optimize anything -- code or your database -- without first understanding exactly where your performance gains will be. Use object oriented design when you use a database. I've seen people create all sorts of objects, then put direct SQL calls in the middle of their code. Grrr.
Remember the whole MVC style of programming? THIS IS WHAT IT IS ABOUT! Write your code to be easy to maintain and with good design. Push all database context to object models. That way, when you restructure your database, you don't have to recode your entire application.
Once you've finished coding, you can start optimizing both the code and the database. A good SQL database has all sorts of tricks to make what seems like multiple table lookups to be a single operation. Deciding not to normalize your database because you think it is inefficient is a mistake.
In the mid-70s (yes, I'm THAT old) I worked on a rather popular computer called the Cado. The Cado could do things that took other computers costing 10x as much. It was the first computer that many businesses could actually afford.
The secret was the architecture: Cado made the assumption that hardware cost much more than software, so they optimized the hardware and that optimization was reflected in the OS (it had no file allocation table -- we tracked that by paper and pencil) and the software which required developers to use scant resources and write their programs in 256 byte segments. This allowed four users to use the Cado in only 48K of memory.
You heard of Cado? Nope, they went into a tailspin in the mid-80s because the primary assumption of hardware cost vs. software cost ended up being wrong. By the 1980s, hardware costs dropped and it was software that became the major cost. Overly optimizing the software meant that the programs were almost impossible to maintain. It also meant that programs were written for specific clients, so that you had 20 programs for 1000 clients. Meanwhile, software written for maintenance, scalability, and configurability could be written for millions of clients. Thus, spreading the programming costs over a much larger sales base.
By the mid-80s, the specialized hardware needed to run the Cado programs meant you couldn't take advantage of the PC mass market. A PC would cost $3,000 while the Cado hardware would cost $15,000. Cado programs would cost thousands of dollars while the equivalent PC programs could be had for $600. Hundreds of Cado VARS went belly up as their sales disappeared. No one wanted to spend $20K on a system that was slower, more buggy, and harder to maintain than an equivalent $7,000 PC based system.
So, don't do what Cado did and make stupid optimization assumptions. Write good code and then test where optimization might actually do some good. Otherwise, you'll be writing yourself in a corner.
I recently worked on a project that had a distributed database where inventory items were stored in local databases. To know which distributed table to fetch the item from, the object's item number would include the location where that item was stored. You knew the item number, you knew which database where the item was stored, you could do a single lookup to fetch that item.
What a great idea until inventory items were moved from one location to another. That meant you had to change the inventory number of the item which meant rewriting the entire inventory table, all transactions (since the inventory item number was in the transaction), history tables, customer tables, etc. It ended up that about 2 to 3 dozen items would move every month. So, every month, the entire database which was distributed in over 60 locations would have to be updated. So, that one single database optimization technique cost us several days of down time each and every month.
David W. on July 16, 2008 5:58 AM@BuggyFunBunny
>The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.
From http://www.mathpages.com/home/kmath054.htm
"So far we have assumed that the acceleration of gravity was constant
over thr range of interest, but in fact there is a slight reduction
in g as we go up in altitude. This can be significant when dealing
with the energy of an object ..."
Also, Codd's rules say nothing about database design and everything about Codd's vision of RDBMS (that system that hosts the database).
Your points about Code Wonks vs Data Wonks is well taken.
As a Data Wonk, it is my view that: those Code Wonks that don't know how to View Execution Plan do not get a say in database design.
@David B
I also checked wikipedia; the differences I checked all worked out to 32/sec/sec (rounded, standard). But, OK.
I've always seen that design (normalization in this thread) can be inferred, albeit distantly, from:
rule 0 -> The system must qualify as relational, as a database, and as a management system.
rule 9 -> Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.
rule 10 -> Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.
rule 12 -> If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.
Rule 10 came to be embodied in DRI, unique constraints, check constraints and the like.
Rule 12 tells the coders to keep their grubby paws off.
BuggyFunBunny on July 16, 2008 7:13 AMYour example is probably the stupidest case for denormalization I've ever seen. You didn't even touch on the problems of one-to-many relationships.
If it's one-to-one, denormalize.
If it's one-to-many or zero-to-many, normalize.
If it's one-to-(specific number) it depends on how large the data you may be repeating is, and how scalable you want to be.
A normalized database is good for inserts, updates, and data consistency. A denormalized database is good for fast selects, and works well if the table structure is close to what you’re reporting or loading into memory. That’s why it’s reasonably common to have two databases, a transactional one that’s relational, and a read only reporting database that gets updated from the relational database either by triggers or by a daemon.
If you’re reading lots of data, and you have a decent number of joins, you might be better off reading the data a table at a time, and doing the joins in memory. I’ve worked on systems where simple selects were essentially “free”, and most of the cost was the network (100 Mbit from the client). If you have several 1-n or n-n joins, the cost of reading the extra rows over the network will dominate the cost of the joins in a normalized network. In those cases, you’re better off caching “static” data locally, and doing the join on the client.
I was intriged to read that google "gql" doesn't allow joins. I guess they figure they have hit the scalability wall - so instead of denormalizing later, they made it part of their architecture. Or - maybe the architecture they are using forced it. Who knows.
john on July 16, 2008 8:03 AMHmmm. I would break the interested_in out into another table just in case your user is interested in more than one thing.
Azathoth on July 16, 2008 9:58 AM"hehe. The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails."
Actually, being pedantic, that's the value of gravitational acceleration near the surface of the Earth. The gravitational constant (G) is (more or less) 6.674 x 10 ^ -11 m^3/(kg*s^2)
John,
The reason GQL doesn't support joins is that it's not a relational database..
Shane on July 16, 2008 12:29 PMIf you have to consistently join many tables to get related data, then just stick a stub key in there and voila ... faster performance, less joins.
Scott Fitchet on July 16, 2008 12:42 PMGood post about denormalization. I tend to class myself as a zealous software engineer who likes to stick to the software/database principles that I was taught in university, but through working with clients and real world databases there is definitely a place for denormalization.
I would be interested in your views of keeping the integrity of the denormalized fields. Here is a page discussing this http://www.ssw.com.au/ssw/standards/Rules/RulestoBetterSQLServerdatabases.aspx#DenormalizedFields
Eric Phan on July 16, 2008 3:53 PMTo pick up on Eric's comment that denormalisation is for reporting, reporting is just querying a database. Why then are there "accepted" design patterns for reporting, e.g. star and snowflake?
We have them because reporting has to scale to query much higher numbers of probably distant (i.e. many joins) data. So we denormalise into facts and dimensions.
The change that has come about in the last few years is that many internet scale applications (especially social ones) have reporting style operations at their very core. Look at the problem of aggregating data from a social graph. In the abstract this is a reporting style problem and therefore some level of denormalisation is the only way to get data out quickly.
Now we face the further issue that unlike traditional business reporting systems, sites like twitter and facebook cannot wait for the queries to return over a timescale of minutes, they need it in seconds or less. Which has led them to implement denormalisation on a massive scale.
After several years of companies starting with a tight normalised model and iterating to a denormalised model (see the fantastic O'Reilly series linked in the main article) it's right that people start looking at this wealth of experience and asking the question "When is it right to _start_ with a denormalised model?"
Matt on July 16, 2008 4:44 PMGreat article Jeff!
@Ariel
Re: "james: are you serious? Tables don't have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular."
Singular table names are completely acceptable and often preferred by OO developers using object-relational mappers like Hibernate. One would say the User table contains instances of class 'User'.
Class name to table name mappings are predictable and reliably generated if single names are used. Think in OO terms and you will never use a plural table name again.
Singular table names eliminate having to know anything about natural language pluralization rules (and exceptions) like Category - Categories, Criterion - Criteria, Radius - Radii, Person - People, Mouse - Mice, Index - Indices. (More exceptions than there are rules, and that's just in the English language).
There are no conventions for naming join tables when plurals are used for table names. Would it be UserAffiliations, UsersAffiation or UsersAffilliations? If it is not UsersAffilliations then why did one of the names suddenly become singular and one plural? If it's UsersAffilliations then grammatically that name is wrong too unless you add the possessive apostrophe making it Users'Affilliations. It's a whole bowl of wrong. Avoid plurals altogether and keep things simple.
Michael Hanney on July 16, 2008 5:20 PMThe bottom line here is I/O. If the normalization means doing 6 I/Os for a single person, versus one I/O in the single-table solution, it still isn't a significant difference in terms of the wall clock if you're just going after one person. Crank through a million people though, and it is a different story.
Please keep up the awesome blogging Jeff, you are always a good read!
Thanks!
rcs
"When is it right to _start_ with a denormalised model?"
Answer: never.
Yes, I know always and never are dangerous words. I use them here nonetheless.
If your application starts out with a denormalised model, it is going to be much more expensive to develop.
You have to make sure you manually keep track of all of the places that data needs to be updated, changed or extended. The likelihood of introducing bugs increases, which forces you to dedicate more time to testing and fixing these bugs. Furthermore, you make it much harder for developers new to the system to understand how it works because they don't have a simple model to follow - they have to wrap their head around a mess.
It is silly to spend all this extra time, effort and cost on something when you don't know if it will even pay off. What if the requirements change? A seemingly small request from your client could actually turn out to be a huge amount of work simply because of your implementation. Somehow, I don't think the client will accept your estimate of 2 weeks for adding a second group of contact details to the user's profile.
Knuth said that premature optimisation is the root of all evil. He was right. You should optimise only when you can identify tangible areas of an application to optimise through profiling. Otherwise, you are wasting your time because your "optimisations" may never have been needed in the first place and you're just creating additional work for yourself.
Dear god, please don't let there be a generation of database "designers" who think this is the right way to do it.
Dave G. on July 16, 2008 8:35 PMGoogle App Engine's designers say you should denormalize a Google App Engine database and should avoid relational ideas such as normalization.
For this I abandoned Google App Engine. The Greater Fools on their newsgroup continue to discuss exactly _how_ to achieve this.
tndal on July 16, 2008 9:11 PMimho, if you’re need to denormalize your database for performance you're using the wrong DBMS.
I've read once the story about the development and database design of flickr and they were forced to denomalize and replicate data because they were using a database engine lacking features they need.
So, denomalize if you need more performance but don't have money for appropriate hardware and DBMS.
i can denormalise from a normalised db in 1 (long) line of sql. i can't say that i can normalise in 1 line of sql.
but then i'm just a dark little that burns dimly next to some of you frikkin geniuses :P
TooLateNoFriends on July 17, 2008 4:20 AMShane - of course it's not relational. <s> I think they use a giant excel spreadsheet to store data from google app engine.
john on July 17, 2008 6:35 AMIt's a tough nut to crack. I prefer to normalize simply to avoid data inconsistancy and logically it makes sense. Data updates are also easier in the normalized tables instead of inside the denormalized tables. But I do agree that using SQL to extract the data is crazy mad! I'm a newb to SQL and trying to figure out the complex joins is rediculous! Perhaps there is some sort of hybrid or way to meet normalization in the middle.
Mike on July 17, 2008 6:42 AMwow - there are a lot of miss-imformed IT staff out there
TJ on July 17, 2008 7:11 AM@Michael Hanney:
[quote]Singular table names are completely acceptable and often preferred by OO developers using object-relational mappers like Hibernate. One would say the User table contains instances of class 'User'.[/quote]
I don't agree. The table USERS contains instances of class User. Hibernate is perfectly capable of mapping a classname in single, like User to a tablename in plural, like USERS.
Next thing you are telling us is that we should use camelcase in database object names..
Not the best post post Jeff, I see your sentiment, but some poorly chosen wording and a silver bullet solution have turned it into a battlefield.
As one poster said, I recommend you post some actual situations with your optimisations shown and explained.
Simon on July 17, 2008 7:39 AMI have to admit - I am a tables are singular type dude and for one reason only.
I use a self written code generator to create my classes, collections and data access layer as well as user controls and stored procs.
Having singular table names means its just a bit easier to programmatically generate all that stuff!
Mauro on July 17, 2008 8:34 AMWhere I work we use automated object persistence to store everything. On the up-side, it takes 2 minutes to create a database for the app, and it is ALWAYS correct as it is built for the classes. On the down-side, EVERTHING has a table.
A User record is linked to a user data record in the user data table. The user class has a base class, so the user record is linked to from a user base record. Ad nauseum...
If you want to manually run a SQL statement to check something (as we sometimes do during testing) you need a second brain, a gallon of caffienated drinks, and a solid hour of JOIN statements.
...and if you can't handle table names/classes/properies/variables with both plural and singular names you should go back to writing HTML.
Honestly, as long as it works and the name relates to what it is (user or users) then job done. If you are getting flustered by the grammar of your code then you are not fit to use a computer.
Use whatever you want, test as you go, and don't whinge about "bad" code when you can't understand.
stEvil on July 17, 2008 8:58 AMI've read stories of DB/2 databases that appeared normalized to users, but in the back end several tables were largely denormalized with the 'normalized' world appearing primarily as views.
Help me out: Am I remembering this correctly, or did I miss a key detail?
Michael T on July 17, 2008 12:50 PMPragmatic is just another word to thinking and honestly trying to solve a problem, not waive a flag.
I have 2 questions though:
1. For a second there I am questioning my understanding of the concept of DB normalized design. Why on earth is the userScreenName a separate table? if 2 tables have 1:1 relationship, and each row in X has a row in B, what is the point to separate them? (maybe I am missing the business context of what the table means...)
2. how did you create this diagram? doesn't look like the SQL 2005 diagram or is it?
Dali on July 17, 2008 1:15 PMWhile I lean towards
User {UserID, UserName,Active }
Affiliate {AffiliateID, AffiliateName, Active}
UserAffiliateJoin {UserID,AffiliateID} -> it's a m..n (many-many) join
!And I value surrogate keys over natural keys (mostly because natural keys are misidentified and change, and cascade update is lame)
Whatever you choose, be consistent! It allows you to write code that can infer relationships, and makes your code, queries, etc look better. There's nothing worse than databases with a hodgepodge of intermixed styles for naming (both table and column). It makes it terribly hard to read and understand for everyone involved, and you can't infer anything via code or in your head.
PS Dali, I think
UserScreenName
- badly named. It looks like a join table.
- it's meant for 0..n external system names. MSN, AIM, Yahoo, Facebook, etc?
- poor naming of fields and is misleading
- What's an 'im' anyway? if it's supposed to be external site id's, you'd think that im_service would be better replaced with another foreign key e.g. ExternalServiceTypeID.
e.g.
ExternalSystem { UserID, ExternalServiceTypeID, ExternalUserName, Password (yeah right) }
ExternalServiceType { ExternalServiceTypeID, ExternalServiceTypeName, HomePage, LoginPage, Description}
et cet er a
I'm just a coder, but I try to code databases that don't suck.
locke on July 17, 2008 8:08 PM@Michael T
In the Mainframe DB2 world (as quite distinct from Linux/windoze), most use of the database is retro-fit over (under?) 1970's era COBOL programs which are written to VSAM semantics/structure. I live in that world for the moment. Nothing is normalized.
At a lower level, DB2 on the Mainframe is a layer over VSAM, even today. The zealots will say that the DB2 VSAM files aren't supposed to be visible to casual VSAM calls. Nevertheless, the semantics of storage is VSAM.
BuggyFunBunny on July 18, 2008 6:51 AMThis may have been mentioned already as I'm late in reading this article and there are a tone of comments but I'm a bit suprised that more developers haven't worked on some Business Intelligence and Data Warehouse solutions. A schema set up for reporting and analytics is highly denormalized in terms of performance because you typically end up working with large amounts of business data. Reading Kimball and knowing how to model dimensionally should be part of the developer's tool set.
http://en.wikipedia.org/wiki/Dimensional_modeling
@Jeff
>> Unless you are doing *a lot* more reading then writing
> Aren't most -- perhaps in fact *all* -- websites in this class of application? Millions of reads, handful of writes.
>> you should denormalize as much as it makes sense to do so for read-heavy loads. Most web applications are of this type and do not even require a regular database necessarily
>My point exactly!
But you don't need to denormalize to solve the problem. Caching is (usually) the answer. There are cases where you may want to denormalize, but there are a number of options before you get to that.
Grom on July 18, 2008 9:11 AM@Jeff Atwood
@As the old adage goes, normalize until it hurts, denormalize until it works.
ha. "old adage". what a crock.
@James Pearce
@"While I agree that normalization is the way to teach, as a design principle (RE: SomeGuy), but it would be nice to see optimization issues mentioned more often when talking about "best design principles" in Computer Science."
See, what you call "optimization issues", I call good design. There is a difference between negligent design and avoiding premature optimization.
John "Z-Bo" Zabroski on July 18, 2008 9:59 AMseriously- how many of the commenters here have actually worked on a system with large amounts of data? A recent system of mine inserted around two billion rows per day into it's larger tables- indexes, partitions/sub-partitions, and summary tables are a given; but we also saved a small percent of the data into duplicate tables during the initial loading process in order to retain them longer than they are retained in the main tables- delete statements are too slow, we drop partitions by date as the data ages out; We also don't allow foreign key constraints on the large tables (though of course we list them on the logical data model) because we can't afford the performance loss during inserts. I'm always interested to hear how others have dealt with performance issues, I just hope all of you out there bashing Jeff for this post have dealt with large-enough systems to have a basis to comment...
anonymous on July 18, 2008 12:56 PMI've experienced many databases with varying degrees of normalization.
I've enountered performance issues on both sides. I've seen cases where overly denormalized databases caused so much redundancy we hit serious performance issues.
I've seen cases where normalization purists have imposed a database design on users that had a substantially negative impact on business operations because performance went in the tank compared to previous systems. (And yes, they were specific cases, and the problem was resolved by denormalizing a small piece of data.)
My general rule is to design normalized. Then, when a need comes up to stray from normalization, justify each need (with data and business requirements).
The database design presented in this article is fairly simple. I've only run into problems with normalization when the design is more complex, and the volume of data is large.
John B. on July 18, 2008 2:42 PMerr, anonymous-
That is kind of what makes Jeff and Dare's fictitious example pretty strange. It's a "simple database schema for a generic social networking site", yet it doesn't include any timestamps. Social networks tend to be designed around streams of inforamtion, not sets of user data. Isn't that what the folks at Twitter took forever to learn, especially when they said "Rails doesn't scale" when primarily it was their system eventing model that blew?
John "Z-Bo" Zabroski on July 18, 2008 3:08 PMAn entire post dedicated to an issue that is (within the circles of anyone with a clue) <i>entirely</i> without contention?
Way to pander to a clueless cloud for blog 'ratings'...
Long-time fan of both yourself and Joel for flipping through on a (very) brief skim-the-post basis, but now that I'm actually putting in time to read the podcast, etc...
You really need to learn how to recognize WHEN you need to STFU and let Joel speak. OH so often he's giving perfectly good advice based on experience and nobody can hear because you're busy desperately scrambling to restate everything you've said to fit his opinion... =(
Jay Lynch on July 18, 2008 5:09 PMLet's face it, denormalizing and normalizing are necessary evils in DB design. I look at the first tables given and unless the client requirements are to allow a user to have an unlimited number of phone numbers and/or unlimited number of usernames, you could very easily NOT normalize those two requirements into tables. Thus you limit your joins and increase performance.
Overall it seems that only people that haven't really worked on databases, that are stronger in theory than practice, do they beat a database to death with over normalization. The argument could also be made that normalization increases performance if you know your database enough to pull only what you need and not everything and kitchen sink.
An example would be the affiliation information, you wouldn't need to join all the tables in that example if you are pulling just user information, you would only need to join the user affiliation table if that user is allowed multiple affiliations. Even better if he/she isn't allowed more than one, then you can put that in the user table too! ;-) My point is very simply if the data follows the business requirements and the developer using the database understands the structure properly then a level playing field of both normalization and denormalization should come to play and you should have a pretty decent mix of both structured data and performance.
But hey, what do I know, I'm just a guy who reads blogs.
Great Post dude, love the discussion that it brought up.
Thomas on July 20, 2008 10:11 AMHard drive space is cheap, programmer time and running time is not.
If you're only normalizing to avoid blank entries in a table (like the optional phone number table you're using), there's no point in normalizing. Really. Just let the table have a blank entry in it. How many entries do you think it will take to make a difference when hard drives are measured in TB now?
Normalizing should mainly be used only to avoid duplicate data. If you are doing a social networking site and you have two duplicate entries to make two people friends, then you are probably doing it wrong. Having duplicate records makes coding and running a mess, and should be avoided.
Bill K on July 20, 2008 10:00 PMWell, I haven't touched a DB since I changed jobs last year, but this is one good reason why a DB designer should also be writing queries and optimising them. That way he has to feel the pain too.
To my mind, if your normalised table is ONLY EVER queried in a join to another table, that's a good enough reason to denormalise.
And of course, optimisation depends on usage. If you're updating a table as often as you're reading it, that's a completely different scenario to the typical write-slow, read-fast environment.
Ben on July 21, 2008 2:25 AM> You need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.
> The problem is that DBMSes don't do this. Sure, they give you indices and caching and materialized views, but they don't go all the way.
Oracle has had a feature (since 8i IIRC) called query rewrite, where you can write your query against the normalised schema, and Oracle will figure out that the same query can be performed more efficiently against materialized views (if they exist) and hit those for the data instead.
Dave on July 21, 2008 5:41 AM1. The example was poor
2. Even if the example was good, you've provided no context for optimisation via denormalisation.
What your post *seems* to be suggesting is that optimisation (specifically denormalisation in this case) is something that takes place in design. To me, optimisation is a consideration of circumstance. What value is there in attempting to second guess those circumstances? None. Particularly in the case of denormalisation, since if - god forbid - your assumptions were wrong, reverting back from denormalised to normalised is so costly.
Basically you set no scene, gave no re-design requirements - and as a result you are sending out the wrong messages to impressionable developers.
A final point: who is this post aimed at? With all due respect; anyone responsible for optimising a database to this degree will not be looking to your for advice.
I think this post is misleading at *best*. Sorry Jeff.
Mike K on July 21, 2008 6:48 AMNever say never. <s> I've done my share of denormalizing here and there - but never from the get-go.
I think the main thing here is - there probably *is* a situation where you must architect with denormalization from the start because this is version 25 and all previous versions blew up the database. I think it's a fair guess to say that most of us will probably never work on something so huge to have to start with denormalizing in mind.
FWIW - Once we hit a scalability wall - ended up dropping a foreign key to hit the clients performance metric (they hit us with an automated perf tool) and that table has lived for 8 years without a foreign key since. It always bothers me.
Holy cow, about the flame war I was expecting.
For the normalization crew: I have never seen (and I expect I never will) a fully normalized database. Consider table User, in the example above, is not fully normalized. You have "first_name" and "last_name".
A fully normalized database would not include "first_name", except as a pointer to another table. Fully normalized tables can't contain repeat information, and you'll have several "James" people. Same thing as "last_name", multiple "Smith"s. For that matter, you can't have "phone_number"; the area codes and prefixes replicate. Zip codes are a five-digit primary and a four-digit extension. Any information which itself represents a join has to be broken up. BCNF isn't fully normalized. Do you do that? No? Okay, then you're not operating in a normalized database. Get over it already.
@ BuggyFunBunny
"The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails."
No it's not. The gravitational constant is only that here on our good ol' terra firma. If you're building an aeroplane that may perhaps be used in the future on some other planet, you ought to take that into account when designing the beast. The flip side to this is most likely, you're not going to ship your plane off-planet. The degree of normalization with which you feel comfortable is (or rather, should be) contextual, not absolute, as well.
From a higher viewpoint, complete data normalization is a theoretical construct. Reality gets in the way of theory. Even if you went to the trouble of building out your tables fully normalized, you're relying on the false assumption that the data itself is presented in a canonical fashion; it's not. "Jim", "James", "Jimmy" might all be "James", or they might not be. The user- or institutionally- provided data that populates your database is coming in without consistent presentation.
All that aside, the "Och, Jeff, be careful what you blog about!" commentary is well advised to be heard. Messing with your data structure is something that should be undertaken with supreme care. In practice it is several orders of magnitude easier to de-normalize a database than it is to normalize one that is a giant mess. The "undo" button here is buried under a mountain of pain. Be forewarned, here be dragons.
Pat Cahalan on July 21, 2008 5:53 PMAs long as each of your joins is using a proper index, the query optimizer will take care of the performance. Unless you plan on having millions of users, stick with the normalized design...
Chuck on July 21, 2008 7:21 PMEntry has been up for a week and still no post from --CELKO--.
Ian Boyd on July 21, 2008 9:23 PMFor those who are interested in what "real database" folk have to say about it (beware: they are Brits, and SQLServer zealots to boot):
http://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx
BuggyFunBunny on July 22, 2008 5:46 AM> http://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx
Really not much content and what there was suspect. Normalization can't produce anything that is logically correct be normalization is all about form, not semantics. Normalization produces something in the right form and the form has certain attributes. Logically the model could be gibberish because the semantics aren't addressed at all.
Abnormalized on July 22, 2008 10:11 AMThis seems like an argument between stereotype application developers and stereotype database administrators.
The stereotype system administrator would point out that social networking is store-and-forward and point-to-point. Get those parts of your design wrong, and that will be your bottleneck, folks.
For what it is worth, I would err on the side of "over-normalization", whatever that means.
John "Z-Bo" Zabroski on July 22, 2008 7:00 PMAre you joking .normilize a bit slower?
1.The main table should contain the userid,username and password only
This is to ensure fast login.
2.The main sub table should hyper link with the main table.Since like mysql quit good not auto implement foriegn key.so your query a bit longer but you don't nessary search for unnessary value.
3.Denormlization are quite dangerous
3.1. I T user can back door the data and change it without relation ship of table.
3.2 To query the data and joining again are more longer time.I'm not talking about 15 table per join with dam lot of data.Are we should query denormlize data.
"...this isn't intended as a real query..."
Good. Because it won't be taken as a real argument, either.
Brianary on July 23, 2008 7:42 AM"...queries are now blindingly simple..."
Oh, really?
select * from User where affiliation_1 = @affiliaition or where affiliation_2 = @affiliation or affiliation_3 = @affiliation
If you offer a checkbox or select list for input into a denormalized column group, you cannot loop over an inner recordset, you have to:
affiliations.items[rs["affiliation_1"]].selected= true;
affiliations.items[rs["affiliation_2"]].selected= true;
affiliations.items[rs["affiliation_3"]].selected= true;
And where are you storing the fourth affiliation? There *will* be one, BTW. There is always at least one record that exceeds your expected ceiling, now or in the future.
"...and probably blindingly fast, as well."
As long as you don't have to filter based on the denormalized column groups (which you will), or store them in additional "overflow" rows a year down the road when you realize that you need more fields in the group, but your code is committed to your denormalized "design". Two-dimensional querying ruins performance.
Brianary on July 23, 2008 7:58 AM"Despite copious evidence that normalization rarely scales..."
*SO* copious, in fact, that you weren't able to cite even one of them.
Brianary on July 23, 2008 8:01 AM"Despite copious evidence that normalization rarely scales..."
*SO* copious, in fact, that you weren't able to cite even one of them.
Your quote is a complete non-sequitor, BTW. Did indexes solve the problem, or not?
"Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd."
No, you should do it because it makes sense for queries and maintainable code.
"Disks and memory are cheap and getting cheaper every nanosecond."
Ah. The argument of someone that thinks normalization is about space. You should read "Normalized data is for sissies", or any of the thousands of other naive blog and newsgroup posts over the years by people that also didn't know what they were talking about (not just the ones you cite), but be sure to read the comments below them.
"Measure performance on your system and decide for yourself what works, free of predispositions and bias."
And foresight of maintainability, apparently.
Brianary on July 23, 2008 8:09 AMI've said it before:
As someone who has had to work with a system for K-12 student records that stored absence dates in a *FORTY*-column-denormalized-group with overflow rows, and a financial institution, I would ask that, rather than de-normalizing, simply punch the next person who has to interface to the database as hard as you can in the stomach. Let them know the alternative, and they will politely thank you.
Brianary on July 23, 2008 8:12 AMThis article fails to show a partial normalization, which is better than total de-normalization or total normalization. Where is the (ideal) partial normalization schema example?
There are at least three things normalization does:
1) Reduce duplicate data, both in-memory and on-disk.
2) Allow many-to-many references (MxN),
where the quantities for N and M are not known.
3) Each table has a primary key ID which may be indexed.
Consider this de-normalization of the sample schema.
1) No duplicate data exists.
Suppose that the user is allowed one screen name and one phone number. Then, there is no duplication possible for either. So, merge the UserPhoneNumber and UserScreenName tables into the main User table.
2) Many-to-many references exist.
We must let users join ass many affiliates as they wish. Keep the UserAffiliation and Affiliation tables. This reduces wasted space, since there are no empty affiliate_1 to affiliate_10 fields in the User mega-table.
3) De-normalize only when the waste is marginal.
a) The waste is marginal.
UserWorkHistory can merge with the Affiliation table. The extra integer IDs and short description rows are marginal compared to a join.
b) The indexed join is effective.
Most databases can choose which table's index is optimal to start a join. Suppose the Affiliation table is very short, and we want records WHERE DESCRIPTION LIKE 'Clubs%'. The database loads the small Affiliation table into memory, and get its affiliation_id keys. Then, it reads only the shorter index-matched UserWorkHistory that fit the keys.
@David Aldridge
"In Oracle you would create a hash cluster ... I expect that other RDBMS's have similar structures."
In my experience, only Oracle and Teradata can do this. And clustering the data will absolutely fix the OP's concerns about performance.
"database-agnostic design sucks"
Well, logical models should be database-agnostic. But a physical design that ignores the unique features of the database engine will truly suck.
dbguru on July 23, 2008 9:09 AMAs a general rule, when you hear someone speak about the performance implications of normalization, you can rest assured they don't know WTF they're talking about. This particular blog post is no exception.
Oh well. At least I can count on a continued lucrative contracting career fixing the problems caused by these know-nothings and the readers that follow their advice.
rel on July 23, 2008 9:18 AMI think all databases must be at least in 3NF to avoid redundancies and anomalies. Anomalies derived from the existense of redundant data and functional data dependencies.
For example:
In the above case what would you do if your "USERS" changed their Surname would you write an update statement to change the surname as many times as the USER had affiliates? Imagine this is Facebook? How many rows would you update at a time? Could you keep on living like this? Locking? ACID ? Transaction processing comes to my mind.
I don't think is a good idea to denormalize just because it is slow. If it is slow I think is because of bad written SQL. Normalisation was not invented to make things slow either.
It is there for data integrity and consistency. The Relational Model is a sound and proven model based on Algebra's set theory. You can read C.J. Date's Introduction to Database Systems to find out more about "other" scary/bad things that will happen to denormalized databases. If you denormalize and write everything into one table, is easy, you might as well use a file system and a flat file, in such a case you don't have to use a PRIMARY KEY either!
The database should not be tought of as a hard disk. Although I think is used a lot like a hard disk nowadays.
Must admit I have read very interesting posts here!
My team spent a year and a half "fixing" a major system that was designed to be "denormalized" from the get-go, because they expected to have hundreds of millions of rows of data.
The "genius" who designed the system created a data structure that consisted mainly of a single table with just a few columns. One column contained the "type" of data contained in the row, and another column contained the actual data (a long, delimited string, containing multiple columns of data). In effect, he had entity tables within rows.
It was pretty fast, but it was a nightmare to maintain, and data integrity went out the window even faster than the "blindingly fast" queries. (They weren't really that fast, anyway.)
We replaced this "data stack" with a normalized relational database design. Some of the queries required a lot more joins, but you know what? The new system is much faster, data integrity is back, and maintenance no longer causes nightmares.
This author needs to learn a thing or two about database design...
Abby Normal on July 23, 2008 9:29 AMSeems to me, that if you want to use a "moderate pc" to store BILLIONS of records of anything, denormalizing is just going to postpone your problem. If your site is important enough to have billions of records, than it is important enough to have the proper hardware, as well as the proper design.
That being said, denormalization does have its place, and its an important place. But it shouldn't be considered until you have gone through the normalization process, at least so that you will have a better understanding of your data to build upon.
Its quite a bit easier to build a denormalized structure from a normalized one for speed, but trying to fix the problems that seem to crop up from starting with a demormalized structure aren't always so easy. But just like anything else, it depends on what it is being used for.
Steve on July 23, 2008 10:36 AM"Disks and memory are cheap and getting cheaper every nanosecond."
vs.
"Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax." -- Jeff Atwood, CodingHorror.com, May 11, 2008
Please make up your mind!
Brianary on July 23, 2008 12:08 PM"This is exactly like the discussions that arise when someone decides normalizing data is unnecessary, and have invented some new way to add array data types to SQL.
Brianary on May 12, 2008 08:25 AM"
Hah! I was exactly right, there!
http://www.codinghorror.com/blog/archives/001114.html
Brianary on July 23, 2008 12:12 PMLoved the article. Coming from a data warehousing background I can vouch for denormalizing.
Ritesh on July 24, 2008 5:17 AMOy. And the DB fanatics come out to play.
A few issues I have with other comments in general:
1) Use a view
What many people might not realize is that when using a view, you are actually running two queries - one to build the view, and one to query against the view. You might not even need 90% of what the view is returning, so you are essentially just throwing away all that data and wasting the memory/CPU resources required to process it.
And before anyone throws out the "CPU/DISK SPACE/MEMORY/BANDWIDTH IS CHEAP" argument, think about the multiplicative effect of running those queries when you have hundreds or thousands of people connected to your database. Query optimization plays a -huge- role at this point, and a properly-normalized database will help.
2) Coders don't know Databases, and DBAs don't know Code.
Until you have to work as both. And then you understand how absurdly close they both are, and how if you mess one side up, the other side suffers just as much, if not more.
I've spent my share of time sifting through bad code that implemented poor entity relationships (y'know...the "R" in "RDBMS"). Tables that have 90 columns to store information about 30 fields, of which maybe (MAYBE) four are actually getting used. On the flip side, there are also the tables that are normalized to the point where retrieving an object's data becomes an excruciating task involving six types of joins, views being joined on views, and the like. If you're spending more time tracking down how to connect the information than you are actually GETTING the information, you have a problem.
3) [De]Normalization is the best solution
We don't use flat files for data storage very much anymore, and for good reason. It's easier to think of objects as their various components, and to model them accordingly. That's where B-C were coming from - your tables should reflect the entities that make up your system. However, situations arise where it isn't the best idea to have normalized data...reporting is the classic example, but this goes back to my previous point of doing more work just to get the information you need than retrieving it takes.
Mindless ranting aside, at the end of the day, we all just want to get data where it needs to be. Do what works the best, and everything will be fine.
James on July 24, 2008 3:04 PM@ James:
"I've spent my share of time sifting through bad code that implemented poor entity relationships (y'know...the "R" in "RDBMS"). "
No, actually, that's not what the 'R' in 'RDBMS' refers to.
rel on July 24, 2008 3:48 PMJeff, that post was funny.
Denormalization is just a tool to solve performance problems when all other solutions are defeated.
The data integrity advantages of normalization far overcome that a completely denormalized would gain. I tried in the past create a denormalized db that works: the spaghetti code it generated would generate the kind of reation the your' site logo show and data was complete rotten and untrustable .
Normalized it and it became a breeze to maintain.
Normalization is not about disk space. Is about maintainability.
Fabricio on July 24, 2008 5:28 PMWhat I don't like is when people want your data denormalized because they deem it easier. I work with a database that actually had two versions of someone's name: the all caps version and the regular version. Please for the love of everything that's holy don't let this happen to your database.
Ryan Clare on July 25, 2008 5:48 AM>when using a view, you are actually running two queries - one to build the view, and one to query against the view. You might not even need 90% of what the view is returning
Nope. The query optimizer places the view's code in a subquery within your query that uses the view. Then the optimizer optimizes the whole thing.
You can see the whole thing in action by clicking "show execution plan".
David B on July 28, 2008 9:41 AMGenerally speaking. You should be normalizing your domain models, then your business objects, then implementing a data structure that hopefully closely matches those designs. All of which should be normalized both for data and for behavior. If you do this THEN find performance issues, THEN try optimization procedures like caching, or any of the suggestions above. I think de-normalization is a last result type of thing to do. Generally speaking there's just too much risk for data corruption when you duplicate like that.
james peckham on July 28, 2008 4:22 PMWhy not just throw out the whole kit and caboodle... go the CouchDB route
it's too hot not to consider.
<a href="http://en.wikipedia.org/wiki/CouchDB">http://en.wikipedia.org/wiki/CouchDB</a>
"instead of
select user.id...
you can just have
select user_id..."
I can't help but feel this is a poor excuse for bad naming conventions. Data schema seem to be the only place left in programming where arcane and redundant Hungarian naming schemes still have a hold - and the only technical reason that anyone can give is "it saves typing"?
Sometimes I feel departmental naming schemes are there to give people who know nothing a sense of knowing something: pages and pages of abbreviation rules to memorise so you can make sure that your database never has anything over 12 characters when it can in fact handle 64. He who remembers the list of rules has an edge over others even if he doesn't understand a thing about database design. And why call a field "Passport" when you can call it "PrsnTrvlDc"?
I sometimes look at schemes and wonder if the designer really knew how to normalise or simply claims that his design is for increased efficiency. Especially when his design requires a rewrite to add an extra item of data which should have been stored as a row but was being stored as a column name (AnswerToQ1, AnswerToQ2, etc). Especially if the data is ambiguous and contradictory and he does not realise it. Especially if reporting is now limited to a few use cases and not "anything that could be asked that we haven't thought of yet" (a strength of normalisation is that it is future-proof and caters for the inevitably unanticipated).
And, how can this argument make sense when the solution is to put the table name in the field name to avoid typing the table name?
Why do people keep insisting that it is easier to type "table_id" than "table.id"? It is actually one extra keystroke over the more naturalistic scheme.
If there was a performance argument, I could accept that (although many performance arguments seem dependant on assuming one knows how the query engine internals work and hoping it will not work differently on the next service pack).
Of course, there may be good reasons I haven't encountered yet...
"The things I hate the most are ignorance and stupidity. And no matter how hard I work, I just can't seem to rid myself of them."
Paul Coddington on July 30, 2008 10:43 AMNormalize to that grad where you feel comfortable with the balance between duplication and performance.
Juergen on July 31, 2008 1:20 AMA whopping six joins?! Really?! Omg, save the whales now before it's too late.
Seriously, six joins isn't a big deal, and unless you have multiple millions of people networking in your site, your database shouldn't be dying with the model you describe. Someone f*ed up somewhere, and it has nothing to do with normalization.
You're missing the HUGE costs to de-normalized data on large dataset updates. You have to update multiple places every time data changes. For large datasets this can be disastrous, especially when you have to update objects as a single unit (or roll them all back). As these updates increase, locks increase, which then slows down everything else. You're seriously discussing running cron jobs to clean up the mess you are creating, rather than just writing clean code the first time.
If your only course to improve performance was denormalization, then you either have a pretty rare situation, or you need new architects and/or DBA's. I've worked at three companies now where developers lack of knowledge concerning databases and data relationships brought applications to a halt. It comes repeatedly from a developers centrist view of applications where their part is always the most important component and everything exists to serve their application. After than comes an over-reliance on data layer generators, which cannot optimize queries as well as a good DBA. So basically the developers make cuts to save time, misuse the data model, then blame the DBA when things go wrong.
Out of curiosity, did you find your DBA and Developers often arguing over where things should go? Every time I've seen the behavior, it's been because DBA's feel they cannot trust the developers to intelligently access data, either due to the current team, or that DBA's past experiences.
Every time I see a post like this on a blog I try to imagine it in Nuclear Engineering. "Do you find yourself putting needless safeguards in your nuclear generators just because some professor told you to? They cost too much! Remove them! Don't do things by reflex! Every engineer should just do what works for them! Don't let standards get in the way of progress and performance!"
I find it humorous, that people follow a million "rules" such as "table names should be plural" as a doctrine, then casually discard good design at the first hint of an actual challenge. To be fair, naming tables is fairly easy, and you rarely find that table names stump a developer and cause them to look for "problems".
i cant believe i have to read all this bullplop
Fancypants on August 6, 2008 8:38 PMyou'd better go at work and denormalize your databases. All of you :D
.sid on August 8, 2008 8:52 PMEven the "normalize it 'til it hurts, then denormalize", which seems the nearest to rational argument, doesn't nearly get at the point.
The point of normalization is to analyze the data model in order to determine how the data all hangs together. In effect, once you have gone through normalization, you *understand* your data model, and have a representation that allows expressing information with minimal data; it doesn't have redundancies that consume disk or cause update anomalies or generally require managing.
The next thing to do is not to denormalize, but rather to create the concrete schema based on the understood data model. If you have a decent query optimizer, *most* of the queries should turn out perfectly well when using a "pretty well normalized" concrete schema.
When exceptions arise, of queries that perform badly, you need to react to them, and "denormalizing" isn't a thing to do.
1. Easiest is to consider adding indices; if that resolves the problem cheaply, great!
2. Second easiest is to see if the expensive query may be replaced with some cheaper query. No denormalization here.
3. There may be DBMS-specific techniques such as functional indexes or partial indexes to precompute or avoid actions that are expensive. This is a "declarative" action, thus comparatively cheap, and again, doesn't involve denormalization.
4. But then there is the possibility that it may be necessary to alter the data model to *PRECOMPUTE* something so as to avoid the need to process something (e.g. - maybe a join or something) in the "totally normalized" data model.
It is in the fourth case where we diverge from a normalized model. But that wasn't termed "denormalization"; it was described as "precomputing", that is, storing a precomputed result. You don't just randomly denormalize - you look for things that are being computed that could be computed earlier and stored to save recalculating later.
Christopher Browne on August 14, 2008 6:47 AMThere's some good advice here, but I can already see people citing it as a 'we don't need normalisation' piece. And if I can recall back to my time studying normal forms (back in the early 90s) we were always taught that you would probably need to denormalise - did they stop teaching that part?
Personally, I think there are good reasons why courses drill the maxims of normalisation into students, and that's because without learning them reflexively, the majority of designers and developers make a pigs ear of data models, dumping data into the first entity they can find to hold it, or seeing the database as nothing more than a way of serialising objects.
A significant problem is that the relational model is strongly associated with SQL, which is pretty much like associating high level programming and COBOL. SQL was simply the first relational query language, but it's success unfortunately locked in some bad early design decisions. Check out Date and Darwen's 'The Third Manifesto' for proposals made in the light of experience.
Later languages, and even SQL syntax revisions, have unfortunately failed to gain much traction - for instance a lot of the code in your query can be eliminated by an RDBMS that actually understands the relationships between tables (which we currently specify in each query rather than metadata). Some RDBMS already support this in SQL itself.
Or how about a syntax where you would simply declare the user view as the user table extended by the other tables? The point is that none of these issues actually relate to the model per se.
The second major problem is that the relational model (and SQL) was supposed to free us from thinking about the underlying physical model. We should not be concerned if a query joins 5 tables of 100 columns, or 500 tables, because we're not supposed to know/care how it is implemented underneath.
The default response to that is that is typical ivory tower theory - but recall that the same objections were raised by C developers towards OO, and C++ developers towards managed code and the Java VM. RDBMS have vastly improved in performance in much the same way that VM languages have.
The third problem is with people using an RDBMS where they don't have to, or where they don't fit well, because they've gone with an off-the-shelf architecture. The equivalent view is thinking that your correct DBMS free architecture is equally applicable to all problem spaces.
Final thought - some people reading this would be scared of joining 6 tables, yet would think nothing of having a class reference 6 other classes - is there really a difference? (What if those classes each make a d/b or file system query?)
JulesLt on August 29, 2008 3:25 AMSure, dogmatic adherence to the princple of normalization no matter how much it hurts is foolish. And there are times when it does hurt. But normalization was invented for a reason. In your example, above, you give a "simplified" design with room for 2 screen names and 3 affiliations. But what if someone has 3 screen names or 4 affiliations? Where do we put the data? You've built in arbitrary limits. And sure it's true that in the unnormalized schema, getting all the data about a single user becomes a simple "select * from user". But what if we want to find all the users with a given affiliation? In the normalized schema, we write "select user_id from userAffiliation where affiliation_id='foobar'". (Yes, if we need other data about the user besides his user id, there's an extra join.) In the unnormalized schema, we have to write "select * from user where affiliation_id_1='foobar' or affiliation_id_2='foobar' or affiliation_id_3='foobar'". We have to type everything three times, and if we want that query to run efficiently, we have to have three separate indexes. I'm not sure if your intent was that the entire database consists of only one table, or if there are other tables for other categories of data. If so, and if any of these link to the affiliation, then any joins have to join against any of the three fields. If the other table was designed with a similar philosophy and also has three fields for affiliation, then we need 3 * 3 = 9 clauses in our "where" to catch every possible combination. And if somewhere down the line somebody decides that we need to up the limit to 4, then we have to search all of our queries to make them all search against this fourth slot in addition to the first three. If we miss one, we'll get very subtle bugs, where it seems to work most of the time, only failing for the tiny number of users who have a fourth affiliation.
I discuss the rules and when it's good to break them in more depth in my book, "A Sane Approach to Database Design". (Was that clever the way I snuck in the sutble plug for my book? Okay, probably not all that subtle.)
Mark on October 5, 2008 2:03 PMSubject Line: Beat Long Poll Lines with Absentee Ballots from StateDemocracy.org
Many state and local election officials are encouraging voters to use Absentee Ballots to avoid the long lines and delays expected at the polls on November 4th due to the record-breaking surge in newly registered voters.
Voters in most states still have time to obtain an Absentee Ballot by simply downloading an official application form available through www.StateDemocracy.org, a completely FREE public service from the nonprofit State Democracy Foundation.
Read More: http://us-2008-election.blogspot.com/2008/10/beat-long-poll-lines-with-absentee.html
Bit of a johnny-come-lately to this, but I would always go for full normalization at the data analysis/modelling stage, then take a pragmatic view at implementation. If, in the particular case, it is obvious that denormalization is the sane option, do it.
One comment I would make. If you are designing an information storage and retrieval system that is queried more often than it is updated, it might make sense to use a fully normalized schema for updates and a denormalized one for queries. When inserting, updating or deleting data, use triggers to refresh the query tables to reflect the changes. The SELECT command to query the denormalized table or tables would be far simpler than one with six joins in it, and data integrity would not be an issue.
Jel Mist on January 2, 2009 4:21 AMThe way i see it - the goal of normalization is creation of a logical design that gives you data integrity and expandability(in the design) and hence It does not consider query evaluation performance in its design. This is pushed to the lower layer in the design phase - the physical db design which now looks at volumes of data! And that's where the problem is? We need a design methodology that looks at - the data AND its usage before we design -- is denormalizing at that level a solution? Maybe partly
Please comment as I plan on researching this area
Fascinating Post, a couple small comments to the awesome list. IHMO: OLTP Databases require "Normalized" designs, not on principle but out of necessity. Reporting Databases on the other hand requires "De-Normalized" designs for performance/scalability. A single database should not have to solve both problems (performance and integrity). Understanding the usage of the database in question and using the design required is the right answer, there isn't a one size that fits all...
Raj Datta on June 30, 2009 5:41 PMCoding horror indeed.
Interesting, but clearly solving the wrong problem. Why are you using a relational database at all? It doesn't fit your problem domain.
Stephan Eggermont on July 2, 2009 11:35 PMI'm sorry this articles comparisons are irrelevant. Normalization and Denormalization achieve different goals, and are used as such.
Mickey Bricks on August 19, 2009 8:30 AM| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |