One of the items we're struggling with now on Stack Overflow is how to maintain near-instantaneous performance levels in a relational database as the amount of data increases. More specifically, how to scale our tagging system. Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.
Dare Obasanjo had an excellent post When Not to Normalize your SQL Database wherein he helpfully provides a sample database schema for a generic social networking site. Here's what it would look like if we designed it in the accepted normalized fashion:
Normalization certainly delivers in terms of limiting duplication. Every entity is represented once, and only once -- so there's almost no risk of inconsistencies in the data. But this design also requires a whopping six joins to retrieve a single user's information.
select * from Users u inner join UserPhoneNumbers upn on u.user_id = upn.user_id inner join UserScreenNames usn on u.user_id = usn.user_id inner join UserAffiliations ua on u.user_id = ua.user_id inner join Affiliations a on a.affiliation_id = ua.affiliation_id inner join UserWorkHistory uwh on u.user_id = uwh.user_id inner join Affiliations wa on uwh.affiliation_id = wa.affiliation_id
(Update: this isn't intended as a real query; it's only here to visually illustrate the fact that you need six joins -- or six individual queries, if that's your cup of tea -- to get all the information back about the user.)
Those six joins aren't doing anything to help your system's performance, either. Full-blown normalization isn't merely difficult to understand and hard to work with -- it can also be quite slow.
As Dare points out, the obvious solution is to denormalize -- to collapse a lot of the data into a single Users table.
This works -- queries are now blindingly simple (select * from users), and probably blindingly fast, as well. But you'll have a bunch of gaping blank holes in your data, along with a slew of awkwardly named field arrays. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now. Congratulations on your demotion!
Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?
Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory -- will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.
There's no shortage of fascinating database war stories from companies that made it big. I do worry that these war stories carry an implied tone of "I lost 200 pounds and so could you!"; please assume the tiny-asterisk disclaimer results may not be typical is in full effect while reading them. Here's a series that Tim O'Reilly compiled:
There's also the High Scalability blog, which has its own set of database war stories:
First, a reality check. It's partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn't have to be all or nothing as I've pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn't.
Despite copious evidence that normalization rarely scales, I find that many software engineers will zealously hold on to total database normalization on principle alone, long after it has ceased to make sense.
When growing Cofax at Knight Ridder, we hit a nasty bump in the road after adding our 17th newspaper to the system. Performance wasn't what it used to be and there were times when services were unresponsive.A project was started to resolve the issue, to look for 'the smoking gun'. The thought being that the database, being as well designed as it was, could not be of issue, even with our classic symptom being rapidly growing numbers of db connections right before a crash. So we concentrated on optimizing the application stack.
I disagreed and waged a number of arguments that it was our database that needed attention. We first needed to tune queries and indexes, and be willing to, if required, pre-calculate data upon writes and avoid joins by developing a set of denormalized tables. It was a hard pill for me to swallow since I was the original database designer. Turned out it was harder for everyone else! Consultants were called in. They declared the db design to be just right - that the problem must have been the application.
After two months of the team pushing numerous releases thought to resolve the issue, to no avail, we came back to my original arguments.
Pat Helland notes that people normalize because their professors told them to. I'm a bit more pragmatic; I think you should normalize when the data tells you to:
Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd. Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. Fear not the specter of denormalization. Duplicated data and synchronization problems are often overstated and relatively easy to work around with cron jobs. Disks and memory are cheap and getting cheaper every nanosecond. Measure performance on your system and decide for yourself what works, free of predispositions and bias.
As the old adage goes, normalize until it hurts, denormalize until it works.
This is a dangerous article, or perhaps you just haven't seen the number of horrific denormalised databases I have. People use these posts as excuses to build some truly horrific crimes against sanity.
Normalise first, denormalise if you have to. If you can't normalise then don't even try and denormalise, get someone else to do it.
A six-way join on 100,000 row tables is better than a single query on a 1,000,000,000 row table. I've been there. It wasn't fun.
Denormalising correctly requires more skill than normalising correctly. Just because Flickr duplicates data doesn't mean that wholesale denormalisation is automatically a good idea.
Some sort of middle-ground: temporary staging tables, very heavy caching, etc. will gain significant performance advantages without having to corrupt your underlying data.
I don't think I've ever seen a performance problem caused by a normalised database - not when such things as caching have been used correctly.
AbGenFac on July 16, 2008 5:02 AMNice comments. I think a lot of people who understand databases better than straight coding (myself included) get annoyed with coders telling us how to write databases.
It seems to me Jeff hasn't got a lot of database experience, and is saying don't bother with normalisation cos it's difficult to understand.
The article is also confusingly written, criticising normalisation due to concerns over scalability, and then saying that scalability doesn't matter.
If you understand normalisation - you know when to use it and when not, because you know what the costs and benefits are. If you don't understand normalisation, but you need to store a lot of data in a way that can be queried and analysed, you had better learn it.
Andy on July 16, 2008 5:10 AMTo pick up on Eric's comment that denormalisation is for reporting, reporting is just querying a database. Why then are there accepted design patterns for reporting, e.g. star and snowflake?
We have them because reporting has to scale to query much higher numbers of probably distant (i.e. many joins) data. So we denormalise into facts and dimensions.
The change that has come about in the last few years is that many internet scale applications (especially social ones) have reporting style operations at their very core. Look at the problem of aggregating data from a social graph. In the abstract this is a reporting style problem and therefore some level of denormalisation is the only way to get data out quickly.
Now we face the further issue that unlike traditional business reporting systems, sites like twitter and facebook cannot wait for the queries to return over a timescale of minutes, they need it in seconds or less. Which has led them to implement denormalisation on a massive scale.
After several years of companies starting with a tight normalised model and iterating to a denormalised model (see the fantastic O'Reilly series linked in the main article) it's right that people start looking at this wealth of experience and asking the question When is it right to _start_ with a denormalised model?
Matt on July 16, 2008 5:44 AMData integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second.
Always is a dangerous word.
hehe. The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.
This is the crux of the contention: coders who view the database as a code widget to manipulate versus database wonks who view the database as primal and the coders as Huns out to rape, pillage, and destroy the data. Coders not only want what is rightfully theirs: make the data look pretty to the user, but also to control the data in client code. Always a bad idea.
Coders refuse to accept that Codd's 12 rules are even rules, much less laws of nature. Database wonks realize that they *are* laws of nature.
Great article Jeff!
@Ariel
Re: james: are you serious? Tables don't have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular.
Singular table names are completely acceptable and often preferred by OO developers using object-relational mappers like Hibernate. One would say the User table contains instances of class 'User'.
Class name to table name mappings are predictable and reliably generated if single names are used. Think in OO terms and you will never use a plural table name again.
Singular table names eliminate having to know anything about natural language pluralization rules (and exceptions) like Category - Categories, Criterion - Criteria, Radius - Radii, Person - People, Mouse - Mice, Index - Indices. (More exceptions than there are rules, and that's just in the English language).
There are no conventions for naming join tables when plurals are used for table names. Would it be UserAffiliations, UsersAffiation or UsersAffilliations? If it is not UsersAffilliations then why did one of the names suddenly become singular and one plural? If it's UsersAffilliations then grammatically that name is wrong too unless you add the possessive apostrophe making it Users'Affilliations. It's a whole bowl of wrong. Avoid plurals altogether and keep things simple.
Michael Hanney on July 16, 2008 6:20 AMMaybe this has already been said, so I apologise for not reading all 212 comments...
Be careful not to confuse a denormalised database with a non-normalised database.
The former exists because a previously normailsed database needed to be 'optimised' in some way.
The latter exists because it was 'designed' that way from scratch.
The difference is subtle, but important.
Bob Armour on July 16, 2008 6:47 AMYou don't optimize anything -- code or your database -- without first understanding exactly where your performance gains will be. Use object oriented design when you use a database. I've seen people create all sorts of objects, then put direct SQL calls in the middle of their code. Grrr.
Remember the whole MVC style of programming? THIS IS WHAT IT IS ABOUT! Write your code to be easy to maintain and with good design. Push all database context to object models. That way, when you restructure your database, you don't have to recode your entire application.
Once you've finished coding, you can start optimizing both the code and the database. A good SQL database has all sorts of tricks to make what seems like multiple table lookups to be a single operation. Deciding not to normalize your database because you think it is inefficient is a mistake.
In the mid-70s (yes, I'm THAT old) I worked on a rather popular computer called the Cado. The Cado could do things that took other computers costing 10x as much. It was the first computer that many businesses could actually afford.
The secret was the architecture: Cado made the assumption that hardware cost much more than software, so they optimized the hardware and that optimization was reflected in the OS (it had no file allocation table -- we tracked that by paper and pencil) and the software which required developers to use scant resources and write their programs in 256 byte segments. This allowed four users to use the Cado in only 48K of memory.
You heard of Cado? Nope, they went into a tailspin in the mid-80s because the primary assumption of hardware cost vs. software cost ended up being wrong. By the 1980s, hardware costs dropped and it was software that became the major cost. Overly optimizing the software meant that the programs were almost impossible to maintain. It also meant that programs were written for specific clients, so that you had 20 programs for 1000 clients. Meanwhile, software written for maintenance, scalability, and configurability could be written for millions of clients. Thus, spreading the programming costs over a much larger sales base.
By the mid-80s, the specialized hardware needed to run the Cado programs meant you couldn't take advantage of the PC mass market. A PC would cost $3,000 while the Cado hardware would cost $15,000. Cado programs would cost thousands of dollars while the equivalent PC programs could be had for $600. Hundreds of Cado VARS went belly up as their sales disappeared. No one wanted to spend $20K on a system that was slower, more buggy, and harder to maintain than an equivalent $7,000 PC based system.
So, don't do what Cado did and make stupid optimization assumptions. Write good code and then test where optimization might actually do some good. Otherwise, you'll be writing yourself in a corner.
I recently worked on a project that had a distributed database where inventory items were stored in local databases. To know which distributed table to fetch the item from, the object's item number would include the location where that item was stored. You knew the item number, you knew which database where the item was stored, you could do a single lookup to fetch that item.
What a great idea until inventory items were moved from one location to another. That meant you had to change the inventory number of the item which meant rewriting the entire inventory table, all transactions (since the inventory item number was in the transaction), history tables, customer tables, etc. It ended up that about 2 to 3 dozen items would move every month. So, every month, the entire database which was distributed in over 60 locations would have to be updated. So, that one single database optimization technique cost us several days of down time each and every month.
David W. on July 16, 2008 6:58 AM@BuggyFunBunny
The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.
From http://www.mathpages.com/home/kmath054.htm
So far we have assumed that the acceleration of gravity was constant
over thr range of interest, but in fact there is a slight reduction
in g as we go up in altitude. This can be significant when dealing
with the energy of an object ...
Also, Codd's rules say nothing about database design and everything about Codd's vision of RDBMS (that system that hosts the database).
Your points about Code Wonks vs Data Wonks is well taken.
As a Data Wonk, it is my view that: those Code Wonks that don't know how to View Execution Plan do not get a say in database design.
@David B
I also checked wikipedia; the differences I checked all worked out to 32/sec/sec (rounded, standard). But, OK.
I've always seen that design (normalization in this thread) can be inferred, albeit distantly, from:
rule 0 - The system must qualify as relational, as a database, and as a management system.
rule 9 - Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.
rule 10 - Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.
rule 12 - If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.
Rule 10 came to be embodied in DRI, unique constraints, check constraints and the like.
Rule 12 tells the coders to keep their grubby paws off.
BuggyFunBunny on July 16, 2008 8:13 AMYour example is probably the stupidest case for denormalization I've ever seen. You didn't even touch on the problems of one-to-many relationships.
If it's one-to-one, denormalize.
If it's one-to-many or zero-to-many, normalize.
If it's one-to-(specific number) it depends on how large the data you may be repeating is, and how scalable you want to be.
The bottom line here is I/O. If the normalization means doing 6 I/Os for a single person, versus one I/O in the single-table solution, it still isn't a significant difference in terms of the wall clock if you're just going after one person. Crank through a million people though, and it is a different story.
Please keep up the awesome blogging Jeff, you are always a good read!
Thanks!
rcs
A normalized database is good for inserts, updates, and data consistency. A denormalized database is good for fast selects, and works well if the table structure is close to what you’re reporting or loading into memory. That’s why it’s reasonably common to have two databases, a transactional one that’s relational, and a read only reporting database that gets updated from the relational database either by triggers or by a daemon.
If you’re reading lots of data, and you have a decent number of joins, you might be better off reading the data a table at a time, and doing the joins in memory. I’ve worked on systems where simple selects were essentially “free”, and most of the cost was the network (100 Mbit from the client). If you have several 1-n or n-n joins, the cost of reading the extra rows over the network will dominate the cost of the joins in a normalized network. In those cases, you’re better off caching “static” data locally, and doing the join on the client.
I was intriged to read that google gql doesn't allow joins. I guess they figure they have hit the scalability wall - so instead of denormalizing later, they made it part of their architecture. Or - maybe the architecture they are using forced it. Who knows.
john on July 16, 2008 9:03 AMWhen is it right to _start_ with a denormalised model?
Answer: never.
Yes, I know always and never are dangerous words. I use them here nonetheless.
If your application starts out with a denormalised model, it is going to be much more expensive to develop.
You have to make sure you manually keep track of all of the places that data needs to be updated, changed or extended. The likelihood of introducing bugs increases, which forces you to dedicate more time to testing and fixing these bugs. Furthermore, you make it much harder for developers new to the system to understand how it works because they don't have a simple model to follow - they have to wrap their head around a mess.
It is silly to spend all this extra time, effort and cost on something when you don't know if it will even pay off. What if the requirements change? A seemingly small request from your client could actually turn out to be a huge amount of work simply because of your implementation. Somehow, I don't think the client will accept your estimate of 2 weeks for adding a second group of contact details to the user's profile.
Knuth said that premature optimisation is the root of all evil. He was right. You should optimise only when you can identify tangible areas of an application to optimise through profiling. Otherwise, you are wasting your time because your optimisations may never have been needed in the first place and you're just creating additional work for yourself.
Dear god, please don't let there be a generation of database designers who think this is the right way to do it.
Dave G. on July 16, 2008 9:35 AMGoogle App Engine's designers say you should denormalize a Google App Engine database and should avoid relational ideas such as normalization.
For this I abandoned Google App Engine. The Greater Fools on their newsgroup continue to discuss exactly _how_ to achieve this.
tndal on July 16, 2008 10:11 AMHmmm. I would break the interested_in out into another table just in case your user is interested in more than one thing.
Azathoth on July 16, 2008 10:58 AMhehe. The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.
Actually, being pedantic, that's the value of gravitational acceleration near the surface of the Earth. The gravitational constant (G) is (more or less) 6.674 x 10 ^ -11 m^3/(kg*s^2)
While I agree that normalization is the way to teach, as a design principle (RE: SomeGuy), but it would be nice to see optimization issues mentioned more often when talking about best design principles in Computer Science.
James Pearce on July 16, 2008 1:26 PMJohn,
The reason GQL doesn't support joins is that it's not a relational database..
Shane on July 16, 2008 1:29 PMIf you have to consistently join many tables to get related data, then just stick a stub key in there and voila ... faster performance, less joins.
Scott Fitchet on July 16, 2008 1:42 PMPragmatic is just another word to thinking and honestly trying to solve a problem, not waive a flag.
I have 2 questions though:
1. For a second there I am questioning my understanding of the concept of DB normalized design. Why on earth is the userScreenName a separate table? if 2 tables have 1:1 relationship, and each row in X has a row in B, what is the point to separate them? (maybe I am missing the business context of what the table means...)
2. how did you create this diagram? doesn't look like the SQL 2005 diagram or is it?
Dali on July 17, 2008 2:15 AMi can denormalise from a normalised db in 1 (long) line of sql. i can't say that i can normalise in 1 line of sql.
but then i'm just a dark little that burns dimly next to some of you frikkin geniuses :P
TooLateNoFriends on July 17, 2008 5:20 AMShane - of course it's not relational. s I think they use a giant excel spreadsheet to store data from google app engine.
john on July 17, 2008 7:35 AMIt's a tough nut to crack. I prefer to normalize simply to avoid data inconsistancy and logically it makes sense. Data updates are also easier in the normalized tables instead of inside the denormalized tables. But I do agree that using SQL to extract the data is crazy mad! I'm a newb to SQL and trying to figure out the complex joins is rediculous! Perhaps there is some sort of hybrid or way to meet normalization in the middle.
Mike on July 17, 2008 7:42 AMwow - there are a lot of miss-imformed IT staff out there
TJ on July 17, 2008 8:11 AM@Michael Hanney:
[quote]Singular table names are completely acceptable and often preferred by OO developers using object-relational mappers like Hibernate. One would say the User table contains instances of class 'User'.[/quote]
I don't agree. The table USERS contains instances of class User. Hibernate is perfectly capable of mapping a classname in single, like User to a tablename in plural, like USERS.
Next thing you are telling us is that we should use camelcase in database object names..
Not the best post post Jeff, I see your sentiment, but some poorly chosen wording and a silver bullet solution have turned it into a battlefield.
As one poster said, I recommend you post some actual situations with your optimisations shown and explained.
Simon on July 17, 2008 8:39 AMWhile I lean towards
User {UserID, UserName,Active }
Affiliate {AffiliateID, AffiliateName, Active}
UserAffiliateJoin {UserID,AffiliateID} - it's a m..n (many-many) join
!And I value surrogate keys over natural keys (mostly because natural keys are misidentified and change, and cascade update is lame)
Whatever you choose, be consistent! It allows you to write code that can infer relationships, and makes your code, queries, etc look better. There's nothing worse than databases with a hodgepodge of intermixed styles for naming (both table and column). It makes it terribly hard to read and understand for everyone involved, and you can't infer anything via code or in your head.
PS Dali, I think
UserScreenName
- badly named. It looks like a join table.
- it's meant for 0..n external system names. MSN, AIM, Yahoo, Facebook, etc?
- poor naming of fields and is misleading
- What's an 'im' anyway? if it's supposed to be external site id's, you'd think that im_service would be better replaced with another foreign key e.g. ExternalServiceTypeID.
e.g.
ExternalSystem { UserID, ExternalServiceTypeID, ExternalUserName, Password (yeah right) }
ExternalServiceType { ExternalServiceTypeID, ExternalServiceTypeName, HomePage, LoginPage, Description}
et cet er a
I'm just a coder, but I try to code databases that don't suck.
locke on July 17, 2008 9:08 AMI have to admit - I am a tables are singular type dude and for one reason only.
I use a self written code generator to create my classes, collections and data access layer as well as user controls and stored procs.
Having singular table names means its just a bit easier to programmatically generate all that stuff!
Mauro on July 17, 2008 9:34 AMWhere I work we use automated object persistence to store everything. On the up-side, it takes 2 minutes to create a database for the app, and it is ALWAYS correct as it is built for the classes. On the down-side, EVERTHING has a table.
A User record is linked to a user data record in the user data table. The user class has a base class, so the user record is linked to from a user base record. Ad nauseum...
If you want to manually run a SQL statement to check something (as we sometimes do during testing) you need a second brain, a gallon of caffienated drinks, and a solid hour of JOIN statements.
...and if you can't handle table names/classes/properies/variables with both plural and singular names you should go back to writing HTML.
Honestly, as long as it works and the name relates to what it is (user or users) then job done. If you are getting flustered by the grammar of your code then you are not fit to use a computer.
Use whatever you want, test as you go, and don't whinge about bad code when you can't understand.
stEvil on July 17, 2008 9:58 AMimho, if you’re need to denormalize your database for performance you're using the wrong DBMS.
I've read once the story about the development and database design of flickr and they were forced to denomalize and replicate data because they were using a database engine lacking features they need.
So, denomalize if you need more performance but don't have money for appropriate hardware and DBMS.
I've read stories of DB/2 databases that appeared normalized to users, but in the back end several tables were largely denormalized with the 'normalized' world appearing primarily as views.
Help me out: Am I remembering this correctly, or did I miss a key detail?
Michael T on July 17, 2008 1:50 PMI've experienced many databases with varying degrees of normalization.
I've enountered performance issues on both sides. I've seen cases where overly denormalized databases caused so much redundancy we hit serious performance issues.
I've seen cases where normalization purists have imposed a database design on users that had a substantially negative impact on business operations because performance went in the tank compared to previous systems. (And yes, they were specific cases, and the problem was resolved by denormalizing a small piece of data.)
My general rule is to design normalized. Then, when a need comes up to stray from normalization, justify each need (with data and business requirements).
The database design presented in this article is fairly simple. I've only run into problems with normalization when the design is more complex, and the volume of data is large.
John B. on July 18, 2008 3:42 AMerr, anonymous-
That is kind of what makes Jeff and Dare's fictitious example pretty strange. It's a simple database schema for a generic social networking site, yet it doesn't include any timestamps. Social networks tend to be designed around streams of inforamtion, not sets of user data. Isn't that what the folks at Twitter took forever to learn, especially when they said Rails doesn't scale when primarily it was their system eventing model that blew?
John Z-Bo Zabroski on July 18, 2008 4:08 AMAn entire post dedicated to an issue that is (within the circles of anyone with a clue) ientirely/i without contention?
Way to pander to a clueless cloud for blog 'ratings'...
Long-time fan of both yourself and Joel for flipping through on a (very) brief skim-the-post basis, but now that I'm actually putting in time to read the podcast, etc...
You really need to learn how to recognize WHEN you need to STFU and let Joel speak. OH so often he's giving perfectly good advice based on experience and nobody can hear because you're busy desperately scrambling to restate everything you've said to fit his opinion... =(
Jay Lynch on July 18, 2008 6:09 AM@Michael T
In the Mainframe DB2 world (as quite distinct from Linux/windoze), most use of the database is retro-fit over (under?) 1970's era COBOL programs which are written to VSAM semantics/structure. I live in that world for the moment. Nothing is normalized.
At a lower level, DB2 on the Mainframe is a layer over VSAM, even today. The zealots will say that the DB2 VSAM files aren't supposed to be visible to casual VSAM calls. Nevertheless, the semantics of storage is VSAM.
BuggyFunBunny on July 18, 2008 7:51 AMThis may have been mentioned already as I'm late in reading this article and there are a tone of comments but I'm a bit suprised that more developers haven't worked on some Business Intelligence and Data Warehouse solutions. A schema set up for reporting and analytics is highly denormalized in terms of performance because you typically end up working with large amounts of business data. Reading Kimball and knowing how to model dimensionally should be part of the developer's tool set.
http://en.wikipedia.org/wiki/Dimensional_modeling
@Jeff
Unless you are doing *a lot* more reading then writing
Aren't most -- perhaps in fact *all* -- websites in this class of application? Millions of reads, handful of writes.
you should denormalize as much as it makes sense to do so for read-heavy loads. Most web applications are of this type and do not even require a regular database necessarily
My point exactly!
But you don't need to denormalize to solve the problem. Caching is (usually) the answer. There are cases where you may want to denormalize, but there are a number of options before you get to that.
Grom on July 18, 2008 10:11 AM@Jeff Atwood
@As the old adage goes, normalize until it hurts, denormalize until it works.
ha. old adage. what a crock.
@James Pearce
@While I agree that normalization is the way to teach, as a design principle (RE: SomeGuy), but it would be nice to see optimization issues mentioned more often when talking about best design principles in Computer Science.
See, what you call optimization issues, I call good design. There is a difference between negligent design and avoiding premature optimization.
John Z-Bo Zabroski on July 18, 2008 10:59 AMseriously- how many of the commenters here have actually worked on a system with large amounts of data? A recent system of mine inserted around two billion rows per day into it's larger tables- indexes, partitions/sub-partitions, and summary tables are a given; but we also saved a small percent of the data into duplicate tables during the initial loading process in order to retain them longer than they are retained in the main tables- delete statements are too slow, we drop partitions by date as the data ages out; We also don't allow foreign key constraints on the large tables (though of course we list them on the logical data model) because we can't afford the performance loss during inserts. I'm always interested to hear how others have dealt with performance issues, I just hope all of you out there bashing Jeff for this post have dealt with large-enough systems to have a basis to comment...
anonymous on July 18, 2008 1:56 PMHard drive space is cheap, programmer time and running time is not.
If you're only normalizing to avoid blank entries in a table (like the optional phone number table you're using), there's no point in normalizing. Really. Just let the table have a blank entry in it. How many entries do you think it will take to make a difference when hard drives are measured in TB now?
Normalizing should mainly be used only to avoid duplicate data. If you are doing a social networking site and you have two duplicate entries to make two people friends, then you are probably doing it wrong. Having duplicate records makes coding and running a mess, and should be avoided.
Bill K on July 20, 2008 11:00 AMLet's face it, denormalizing and normalizing are necessary evils in DB design. I look at the first tables given and unless the client requirements are to allow a user to have an unlimited number of phone numbers and/or unlimited number of usernames, you could very easily NOT normalize those two requirements into tables. Thus you limit your joins and increase performance.
Overall it seems that only people that haven't really worked on databases, that are stronger in theory than practice, do they beat a database to death with over normalization. The argument could also be made that normalization increases performance if you know your database enough to pull only what you need and not everything and kitchen sink.
An example would be the affiliation information, you wouldn't need to join all the tables in that example if you are pulling just user information, you would only need to join the user affiliation table if that user is allowed multiple affiliations. Even better if he/she isn't allowed more than one, then you can put that in the user table too! ;-) My point is very simply if the data follows the business requirements and the developer using the database understands the structure properly then a level playing field of both normalization and denormalization should come to play and you should have a pretty decent mix of both structured data and performance.
But hey, what do I know, I'm just a guy who reads blogs.
Great Post dude, love the discussion that it brought up.
Thomas on July 20, 2008 11:11 AMWell, I haven't touched a DB since I changed jobs last year, but this is one good reason why a DB designer should also be writing queries and optimising them. That way he has to feel the pain too.
To my mind, if your normalised table is ONLY EVER queried in a join to another table, that's a good enough reason to denormalise.
And of course, optimisation depends on usage. If you're updating a table as often as you're reading it, that's a completely different scenario to the typical write-slow, read-fast environment.
Ben on July 21, 2008 3:25 AMYou need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.
The problem is that DBMSes don't do this. Sure, they give you indices and caching and materialized views, but they don't go all the way.
Oracle has had a feature (since 8i IIRC) called query rewrite, where you can write your query against the normalised schema, and Oracle will figure out that the same query can be performed more efficiently against materialized views (if they exist) and hit those for the data instead.
Dave on July 21, 2008 6:41 AMHoly cow, about the flame war I was expecting.
For the normalization crew: I have never seen (and I expect I never will) a fully normalized database. Consider table User, in the example above, is not fully normalized. You have first_name and last_name.
A fully normalized database would not include first_name, except as a pointer to another table. Fully normalized tables can't contain repeat information, and you'll have several James people. Same thing as last_name, multiple Smiths. For that matter, you can't have phone_number; the area codes and prefixes replicate. Zip codes are a five-digit primary and a four-digit extension. Any information which itself represents a join has to be broken up. BCNF isn't fully normalized. Do you do that? No? Okay, then you're not operating in a normalized database. Get over it already.
@ BuggyFunBunny
The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.
No it's not. The gravitational constant is only that here on our good ol' terra firma. If you're building an aeroplane that may perhaps be used in the future on some other planet, you ought to take that into account when designing the beast. The flip side to this is most likely, you're not going to ship your plane off-planet. The degree of normalization with which you feel comfortable is (or rather, should be) contextual, not absolute, as well.
From a higher viewpoint, complete data normalization is a theoretical construct. Reality gets in the way of theory. Even if you went to the trouble of building out your tables fully normalized, you're relying on the false assumption that the data itself is presented in a canonical fashion; it's not. Jim, James, Jimmy might all be James, or they might not be. The user- or institutionally- provided data that populates your database is coming in without consistent presentation.
All that aside, the Och, Jeff, be careful what you blog about! commentary is well advised to be heard. Messing with your data structure is something that should be undertaken with supreme care. In practice it is several orders of magnitude easier to de-normalize a database than it is to normalize one that is a giant mess. The undo button here is buried under a mountain of pain. Be forewarned, here be dragons.
Pat Cahalan on July 21, 2008 6:53 AM1. The example was poor
2. Even if the example was good, you've provided no context for optimisation via denormalisation.
What your post *seems* to be suggesting is that optimisation (specifically denormalisation in this case) is something that takes place in design. To me, optimisation is a consideration of circumstance. What value is there in attempting to second guess those circumstances? None. Particularly in the case of denormalisation, since if - god forbid - your assumptions were wrong, reverting back from denormalised to normalised is so costly.
Basically you set no scene, gave no re-design requirements - and as a result you are sending out the wrong messages to impressionable developers.
A final point: who is this post aimed at? With all due respect; anyone responsible for optimising a database to this degree will not be looking to your for advice.
I think this post is misleading at *best*. Sorry Jeff.
Mike K on July 21, 2008 7:48 AMAs long as each of your joins is using a proper index, the query optimizer will take care of the performance. Unless you plan on having millions of users, stick with the normalized design...
Chuck on July 21, 2008 8:21 AMNever say never. s I've done my share of denormalizing here and there - but never from the get-go.
I think the main thing here is - there probably *is* a situation where you must architect with denormalization from the start because this is version 25 and all previous versions blew up the database. I think it's a fair guess to say that most of us will probably never work on something so huge to have to start with denormalizing in mind.
FWIW - Once we hit a scalability wall - ended up dropping a foreign key to hit the clients performance metric (they hit us with an automated perf tool) and that table has lived for 8 years without a foreign key since. It always bothers me.
Entry has been up for a week and still no post from --CELKO--.
Ian Boyd on July 21, 2008 10:23 AMFor those who are interested in what real database folk have to say about it (beware: they are Brits, and SQLServer zealots to boot):
http://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx
BuggyFunBunny on July 22, 2008 6:46 AMThis seems like an argument between stereotype application developers and stereotype database administrators.
The stereotype system administrator would point out that social networking is store-and-forward and point-to-point. Get those parts of your design wrong, and that will be your bottleneck, folks.
For what it is worth, I would err on the side of over-normalization, whatever that means.
John Z-Bo Zabroski on July 22, 2008 8:00 AMhttp://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx
Really not much content and what there was suspect. Normalization can't produce anything that is logically correct be normalization is all about form, not semantics. Normalization produces something in the right form and the form has certain attributes. Logically the model could be gibberish because the semantics aren't addressed at all.
Abnormalized on July 22, 2008 11:11 AMAre you joking .normilize a bit slower?
1.The main table should contain the userid,username and password only
This is to ensure fast login.
2.The main sub table should hyper link with the main table.Since like mysql quit good not auto implement foriegn key.so your query a bit longer but you don't nessary search for unnessary value.
3.Denormlization are quite dangerous
3.1. I T user can back door the data and change it without relation ship of table.
3.2 To query the data and joining again are more longer time.I'm not talking about 15 table per join with dam lot of data.Are we should query denormlize data.
...this isn't intended as a real query...
Good. Because it won't be taken as a real argument, either.
Brianary on July 23, 2008 8:42 AM...queries are now blindingly simple...
Oh, really?
select * from User where affiliation_1 = @affiliaition or where affiliation_2 = @affiliation or affiliation_3 = @affiliation
If you offer a checkbox or select list for input into a denormalized column group, you cannot loop over an inner recordset, you have to:
affiliations.items[rs[affiliation_1]].selected= true;
affiliations.items[rs[affiliation_2]].selected= true;
affiliations.items[rs[affiliation_3]].selected= true;
And where are you storing the fourth affiliation? There *will* be one, BTW. There is always at least one record that exceeds your expected ceiling, now or in the future.
...and probably blindingly fast, as well.
As long as you don't have to filter based on the denormalized column groups (which you will), or store them in additional overflow rows a year down the road when you realize that you need more fields in the group, but your code is committed to your denormalized design. Two-dimensional querying ruins performance.
Brianary on July 23, 2008 8:58 AMDespite copious evidence that normalization rarely scales...
*SO* copious, in fact, that you weren't able to cite even one of them.
Brianary on July 23, 2008 9:01 AMDespite copious evidence that normalization rarely scales...
*SO* copious, in fact, that you weren't able to cite even one of them.
Your quote is a complete non-sequitor, BTW. Did indexes solve the problem, or not?
Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd.
No, you should do it because it makes sense for queries and maintainable code.
Disks and memory are cheap and getting cheaper every nanosecond.
Ah. The argument of someone that thinks normalization is about space. You should read Normalized data is for sissies, or any of the thousands of other naive blog and newsgroup posts over the years by people that also didn't know what they were talking about (not just the ones you cite), but be sure to read the comments below them.
Measure performance on your system and decide for yourself what works, free of predispositions and bias.
And foresight of maintainability, apparently.
Brianary on July 23, 2008 9:09 AMI've said it before:
As someone who has had to work with a system for K-12 student records that stored absence dates in a *FORTY*-column-denormalized-group with overflow rows, and a financial institution, I would ask that, rather than de-normalizing, simply punch the next person who has to interface to the database as hard as you can in the stomach. Let them know the alternative, and they will politely thank you.
Brianary on July 23, 2008 9:12 AMThis article fails to show a partial normalization, which is better than total de-normalization or total normalization. Where is the (ideal) partial normalization schema example?
There are at least three things normalization does:
1) Reduce duplicate data, both in-memory and on-disk.
2) Allow many-to-many references (MxN),
where the quantities for N and M are not known.
3) Each table has a primary key ID which may be indexed.
Consider this de-normalization of the sample schema.
1) No duplicate data exists.
Suppose that the user is allowed one screen name and one phone number. Then, there is no duplication possible for either. So, merge the UserPhoneNumber and UserScreenName tables into the main User table.
2) Many-to-many references exist.
We must let users join ass many affiliates as they wish. Keep the UserAffiliation and Affiliation tables. This reduces wasted space, since there are no empty affiliate_1 to affiliate_10 fields in the User mega-table.
3) De-normalize only when the waste is marginal.
a) The waste is marginal.
UserWorkHistory can merge with the Affiliation table. The extra integer IDs and short description rows are marginal compared to a join.
b) The indexed join is effective.
Most databases can choose which table's index is optimal to start a join. Suppose the Affiliation table is very short, and we want records WHERE DESCRIPTION LIKE 'Clubs%'. The database loads the small Affiliation table into memory, and get its affiliation_id keys. Then, it reads only the shorter index-matched UserWorkHistory that fit the keys.
@David Aldridge
In Oracle you would create a hash cluster ... I expect that other RDBMS's have similar structures.
In my experience, only Oracle and Teradata can do this. And clustering the data will absolutely fix the OP's concerns about performance.
database-agnostic design sucks
Well, logical models should be database-agnostic. But a physical design that ignores the unique features of the database engine will truly suck.
dbguru on July 23, 2008 10:09 AMAs a general rule, when you hear someone speak about the performance implications of normalization, you can rest assured they don't know WTF they're talking about. This particular blog post is no exception.
Oh well. At least I can count on a continued lucrative contracting career fixing the problems caused by these know-nothings and the readers that follow their advice.
rel on July 23, 2008 10:18 AMI think all databases must be at least in 3NF to avoid redundancies and anomalies. Anomalies derived from the existense of redundant data and functional data dependencies.
For example:
In the above case what would you do if your USERS changed their Surname would you write an update statement to change the surname as many times as the USER had affiliates? Imagine this is Facebook? How many rows would you update at a time? Could you keep on living like this? Locking? ACID ? Transaction processing comes to my mind.
I don't think is a good idea to denormalize just because it is slow. If it is slow I think is because of bad written SQL. Normalisation was not invented to make things slow either.
It is there for data integrity and consistency. The Relational Model is a sound and proven model based on Algebra's set theory. You can read C.J. Date's Introduction to Database Systems to find out more about other scary/bad things that will happen to denormalized databases. If you denormalize and write everything into one table, is easy, you might as well use a file system and a flat file, in such a case you don't have to use a PRIMARY KEY either!
The database should not be tought of as a hard disk. Although I think is used a lot like a hard disk nowadays.
Must admit I have read very interesting posts here!
My team spent a year and a half fixing a major system that was designed to be denormalized from the get-go, because they expected to have hundreds of millions of rows of data.
The genius who designed the system created a data structure that consisted mainly of a single table with just a few columns. One column contained the type of data contained in the row, and another column contained the actual data (a long, delimited string, containing multiple columns of data). In effect, he had entity tables within rows.
It was pretty fast, but it was a nightmare to maintain, and data integrity went out the window even faster than the blindingly fast queries. (They weren't really that fast, anyway.)
We replaced this data stack with a normalized relational database design. Some of the queries required a lot more joins, but you know what? The new system is much faster, data integrity is back, and maintenance no longer causes nightmares.
This author needs to learn a thing or two about database design...
Abby Normal on July 23, 2008 10:29 AMSeems to me, that if you want to use a moderate pc to store BILLIONS of records of anything, denormalizing is just going to postpone your problem. If your site is important enough to have billions of records, than it is important enough to have the proper hardware, as well as the proper design.
That being said, denormalization does have its place, and its an important place. But it shouldn't be considered until you have gone through the normalization process, at least so that you will have a better understanding of your data to build upon.
Its quite a bit easier to build a denormalized structure from a normalized one for speed, but trying to fix the problems that seem to crop up from starting with a demormalized structure aren't always so easy. But just like anything else, it depends on what it is being used for.
Steve on July 23, 2008 11:36 AMDisks and memory are cheap and getting cheaper every nanosecond.
vs.
Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. -- Jeff Atwood, CodingHorror.com, May 11, 2008
Please make up your mind!
Brianary on July 23, 2008 1:08 PMThis is exactly like the discussions that arise when someone decides normalizing data is unnecessary, and have invented some new way to add array data types to SQL.
Brianary on May 12, 2008 08:25 AM
Hah! I was exactly right, there!
http://www.codinghorror.com/blog/archives/001114.html
Brianary on July 23, 2008 1:12 PMOy. And the DB fanatics come out to play.
A few issues I have with other comments in general:
1) Use a view
What many people might not realize is that when using a view, you are actually running two queries - one to build the view, and one to query against the view. You might not even need 90% of what the view is returning, so you are essentially just throwing away all that data and wasting the memory/CPU resources required to process it.
And before anyone throws out the CPU/DISK SPACE/MEMORY/BANDWIDTH IS CHEAP argument, think about the multiplicative effect of running those queries when you have hundreds or thousands of people connected to your database. Query optimization plays a -huge- role at this point, and a properly-normalized database will help.
2) Coders don't know Databases, and DBAs don't know Code.
Until you have to work as both. And then you understand how absurdly close they both are, and how if you mess one side up, the other side suffers just as much, if not more.
I've spent my share of time sifting through bad code that implemented poor entity relationships (y'know...the R in RDBMS). Tables that have 90 columns to store information about 30 fields, of which maybe (MAYBE) four are actually getting used. On the flip side, there are also the tables that are normalized to the point where retrieving an object's data becomes an excruciating task involving six types of joins, views being joined on views, and the like. If you're spending more time tracking down how to connect the information than you are actually GETTING the information, you have a problem.
3) [De]Normalization is the best solution
We don't use flat files for data storage very much anymore, and for good reason. It's easier to think of objects as their various components, and to model them accordingly. That's where B-C were coming from - your tables should reflect the entities that make up your system. However, situations arise where it isn't the best idea to have normalized data...reporting is the classic example, but this goes back to my previous point of doing more work just to get the information you need than retrieving it takes.
Mindless ranting aside, at the end of the day, we all just want to get data where it needs to be. Do what works the best, and everything will be fine.
James on July 24, 2008 4:04 AM@ James:
I've spent my share of time sifting through bad code that implemented poor entity relationships (y'know...the R in RDBMS).
No, actually, that's not what the 'R' in 'RDBMS' refers to.
rel on July 24, 2008 4:48 AMLoved the article. Coming from a data warehousing background I can vouch for denormalizing.
Ritesh on July 24, 2008 6:17 AMJeff, that post was funny.
Denormalization is just a tool to solve performance problems when all other solutions are defeated.
The data integrity advantages of normalization far overcome that a completely denormalized would gain. I tried in the past create a denormalized db that works: the spaghetti code it generated would generate the kind of reation the your' site logo show and data was complete rotten and untrustable .
Normalized it and it became a breeze to maintain.
Normalization is not about disk space. Is about maintainability.
Fabricio on July 24, 2008 6:28 AMWhat I don't like is when people want your data denormalized because they deem it easier. I work with a database that actually had two versions of someone's name: the all caps version and the regular version. Please for the love of everything that's holy don't let this happen to your database.
Ryan Clare on July 25, 2008 6:48 AMGenerally speaking. You should be normalizing your domain models, then your business objects, then implementing a data structure that hopefully closely matches those designs. All of which should be normalized both for data and for behavior. If you do this THEN find performance issues, THEN try optimization procedures like caching, or any of the suggestions above. I think de-normalization is a last result type of thing to do. Generally speaking there's just too much risk for data corruption when you duplicate like that.
james peckham on July 28, 2008 5:22 AMwhen using a view, you are actually running two queries - one to build the view, and one to query against the view. You might not even need 90% of what the view is returning
Nope. The query optimizer places the view's code in a subquery within your query that uses the view. Then the optimizer optimizes the whole thing.
You can see the whole thing in action by clicking show execution plan.
David B on July 28, 2008 10:41 AMWhy not just throw out the whole kit and caboodle... go the CouchDB route
it's too hot not to consider.
a href=http://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/CouchDB/a">http://en.wikipedia.org/wiki/CouchDB/a">http://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/CouchDB/a
instead of
select user.id...
you can just have
select user_id...
I can't help but feel this is a poor excuse for bad naming conventions. Data schema seem to be the only place left in programming where arcane and redundant Hungarian naming schemes still have a hold - and the only technical reason that anyone can give is it saves typing?
Sometimes I feel departmental naming schemes are there to give people who know nothing a sense of knowing something: pages and pages of abbreviation rules to memorise so you can make sure that your database never has anything over 12 characters when it can in fact handle 64. He who remembers the list of rules has an edge over others even if he doesn't understand a thing about database design. And why call a field Passport when you can call it PrsnTrvlDc?
I sometimes look at schemes and wonder if the designer really knew how to normalise or simply claims that his design is for increased efficiency. Especially when his design requires a rewrite to add an extra item of data which should have been stored as a row but was being stored as a column name (AnswerToQ1, AnswerToQ2, etc). Especially if the data is ambiguous and contradictory and he does not realise it. Especially if reporting is now limited to a few use cases and not anything that could be asked that we haven't thought of yet (a strength of normalisation is that it is future-proof and caters for the inevitably unanticipated).
And, how can this argument make sense when the solution is to put the table name in the field name to avoid typing the table name?
Why do people keep insisting that it is easier to type table_id than table.id? It is actually one extra keystroke over the more naturalistic scheme.
If there was a performance argument, I could accept that (although many performance arguments seem dependant on assuming one knows how the query engine internals work and hoping it will not work differently on the next service pack).
Of course, there may be good reasons I haven't encountered yet...
The things I hate the most are ignorance and stupidity. And no matter how hard I work, I just can't seem to rid myself of them.
Paul Coddington on July 30, 2008 11:43 AMNormalize to that grad where you feel comfortable with the balance between duplication and performance.
Juergen on July 31, 2008 2:20 AMA whopping six joins?! Really?! Omg, save the whales now before it's too late.
Seriously, six joins isn't a big deal, and unless you have multiple millions of people networking in your site, your database shouldn't be dying with the model you describe. Someone f*ed up somewhere, and it has nothing to do with normalization.
You're missing the HUGE costs to de-normalized data on large dataset updates. You have to update multiple places every time data changes. For large datasets this can be disastrous, especially when you have to update objects as a single unit (or roll them all back). As these updates increase, locks increase, which then slows down everything else. You're seriously discussing running cron jobs to clean up the mess you are creating, rather than just writing clean code the first time.
If your only course to improve performance was denormalization, then you either have a pretty rare situation, or you need new architects and/or DBA's. I've worked at three companies now where developers lack of knowledge concerning databases and data relationships brought applications to a halt. It comes repeatedly from a developers centrist view of applications where their part is always the most important component and everything exists to serve their application. After than comes an over-reliance on data layer generators, which cannot optimize queries as well as a good DBA. So basically the developers make cuts to save time, misuse the data model, then blame the DBA when things go wrong.
Out of curiosity, did you find your DBA and Developers often arguing over where things should go? Every time I've seen the behavior, it's been because DBA's feel they cannot trust the developers to intelligently access data, either due to the current team, or that DBA's past experiences.
Every time I see a post like this on a blog I try to imagine it in Nuclear Engineering. Do you find yourself putting needless safeguards in your nuclear generators just because some professor told you to? They cost too much! Remove them! Don't do things by reflex! Every engineer should just do what works for them! Don't let standards get in the way of progress and performance!
I find it humorous, that people follow a million rules such as table names should be plural as a doctrine, then casually discard good design at the first hint of an actual challenge. To be fair, naming tables is fairly easy, and you rarely find that table names stump a developer and cause them to look for problems.
i cant believe i have to read all this bullplop
Fancypants on August 6, 2008 9:38 AMyou'd better go at work and denormalize your databases. All of you :D
.sid on August 8, 2008 9:52 AMEven the normalize it 'til it hurts, then denormalize, which seems the nearest to rational argument, doesn't nearly get at the point.
The point of normalization is to analyze the data model in order to determine how the data all hangs together. In effect, once you have gone through normalization, you *understand* your data model, and have a representation that allows expressing information with minimal data; it doesn't have redundancies that consume disk or cause update anomalies or generally require managing.
The next thing to do is not to denormalize, but rather to create the concrete schema based on the understood data model. If you have a decent query optimizer, *most* of the queries should turn out perfectly well when using a pretty well normalized concrete schema.
When exceptions arise, of queries that perform badly, you need to react to them, and denormalizing isn't a thing to do.
1. Easiest is to consider adding indices; if that resolves the problem cheaply, great!
2. Second easiest is to see if the expensive query may be replaced with some cheaper query. No denormalization here.
3. There may be DBMS-specific techniques such as functional indexes or partial indexes to precompute or avoid actions that are expensive. This is a declarative action, thus comparatively cheap, and again, doesn't involve denormalization.
4. But then there is the possibility that it may be necessary to alter the data model to *PRECOMPUTE* something so as to avoid the need to process something (e.g. - maybe a join or something) in the totally normalized data model.
It is in the fourth case where we diverge from a normalized model. But that wasn't termed denormalization; it was described as precomputing, that is, storing a precomputed result. You don't just randomly denormalize - you look for things that are being computed that could be computed earlier and stored to save recalculating later.
Christopher Browne on August 14, 2008 7:47 AMThere's some good advice here, but I can already see people citing it as a 'we don't need normalisation' piece. And if I can recall back to my time studying normal forms (back in the early 90s) we were always taught that you would probably need to denormalise - did they stop teaching that part?
Personally, I think there are good reasons why courses drill the maxims of normalisation into students, and that's because without learning them reflexively, the majority of designers and developers make a pigs ear of data models, dumping data into the first entity they can find to hold it, or seeing the database as nothing more than a way of serialising objects.
A significant problem is that the relational model is strongly associated with SQL, which is pretty much like associating high level programming and COBOL. SQL was simply the first relational query language, but it's success unfortunately locked in some bad early design decisions. Check out Date and Darwen's 'The Third Manifesto' for proposals made in the light of experience.
Later languages, and even SQL syntax revisions, have unfortunately failed to gain much traction - for instance a lot of the code in your query can be eliminated by an RDBMS that actually understands the relationships between tables (which we currently specify in each query rather than metadata). Some RDBMS already support this in SQL itself.
Or how about a syntax where you would simply declare the user view as the user table extended by the other tables? The point is that none of these issues actually relate to the model per se.
The second major problem is that the relational model (and SQL) was supposed to free us from thinking about the underlying physical model. We should not be concerned if a query joins 5 tables of 100 columns, or 500 tables, because we're not supposed to know/care how it is implemented underneath.
The default response to that is that is typical ivory tower theory - but recall that the same objections were raised by C developers towards OO, and C++ developers towards managed code and the Java VM. RDBMS have vastly improved in performance in much the same way that VM languages have.
The third problem is with people using an RDBMS where they don't have to, or where they don't fit well, because they've gone with an off-the-shelf architecture. The equivalent view is thinking that your correct DBMS free architecture is equally applicable to all problem spaces.
Final thought - some people reading this would be scared of joining 6 tables, yet would think nothing of having a class reference 6 other classes - is there really a difference? (What if those classes each make a d/b or file system query?)
JulesLt on August 29, 2008 4:25 AMSure, dogmatic adherence to the princple of normalization no matter how much it hurts is foolish. And there are times when it does hurt. But normalization was invented for a reason. In your example, above, you give a simplified design with room for 2 screen names and 3 affiliations. But what if someone has 3 screen names or 4 affiliations? Where do we put the data? You've built in arbitrary limits. And sure it's true that in the unnormalized schema, getting all the data about a single user becomes a simple select * from user. But what if we want to find all the users with a given affiliation? In the normalized schema, we write select user_id from userAffiliation where affiliation_id='foobar'. (Yes, if we need other data about the user besides his user id, there's an extra join.) In the unnormalized schema, we have to write select * from user where affiliation_id_1='foobar' or affiliation_id_2='foobar' or affiliation_id_3='foobar'. We have to type everything three times, and if we want that query to run efficiently, we have to have three separate indexes. I'm not sure if your intent was that the entire database consists of only one table, or if there are other tables for other categories of data. If so, and if any of these link to the affiliation, then any joins have to join against any of the three fields. If the other table was designed with a similar philosophy and also has three fields for affiliation, then we need 3 * 3 = 9 clauses in our where to catch every possible combination. And if somewhere down the line somebody decides that we need to up the limit to 4, then we have to search all of our queries to make them all search against this fourth slot in addition to the first three. If we miss one, we'll get very subtle bugs, where it seems to work most of the time, only failing for the tiny number of users who have a fourth affiliation.
I discuss the rules and when it's good to break them in more depth in my book, A Sane Approach to Database Design. (Was that clever the way I snuck in the sutble plug for my book? Okay, probably not all that subtle.)
Mark on October 5, 2008 3:03 AMSubject Line: Beat Long Poll Lines with Absentee Ballots from StateDemocracy.org
Many state and local election officials are encouraging voters to use Absentee Ballots to avoid the long lines and delays expected at the polls on November 4th due to the record-breaking surge in newly registered voters.
Voters in most states still have time to obtain an Absentee Ballot by simply downloading an official application form available through www.StateDemocracy.org, a completely FREE public service from the nonprofit State Democracy Foundation.
Read More: http://us-2008-election.blogspot.com/2008/10/beat-long-poll-lines-with-absentee.html
Bit of a johnny-come-lately to this, but I would always go for full normalization at the data analysis/modelling stage, then take a pragmatic view at implementation. If, in the particular case, it is obvious that denormalization is the sane option, do it.
One comment I would make. If you are designing an information storage and retrieval system that is queried more often than it is updated, it might make sense to use a fully normalized schema for updates and a denormalized one for queries. When inserting, updating or deleting data, use triggers to refresh the query tables to reflect the changes. The SELECT command to query the denormalized table or tables would be far simpler than one with six joins in it, and data integrity would not be an issue.
Jel Mist on January 2, 2009 4:21 AMThe way i see it - the goal of normalization is creation of a logical design that gives you data integrity and expandability(in the design) and hence It does not consider query evaluation performance in its design. This is pushed to the lower layer in the design phase - the physical db design which now looks at volumes of data! And that's where the problem is? We need a design methodology that looks at - the data AND its usage before we design -- is denormalizing at that level a solution? Maybe partly
Please comment as I plan on researching this area
Fascinating Post, a couple small comments to the awesome list. IHMO: OLTP Databases require "Normalized" designs, not on principle but out of necessity. Reporting Databases on the other hand requires "De-Normalized" designs for performance/scalability. A single database should not have to solve both problems (performance and integrity). Understanding the usage of the database in question and using the design required is the right answer, there isn't a one size that fits all...
Raj Datta on June 30, 2009 6:41 AMCoding horror indeed.
Interesting, but clearly solving the wrong problem. Why are you using a relational database at all? It doesn't fit your problem domain.
Stephan Eggermont on July 2, 2009 12:35 PMI'm sorry this articles comparisons are irrelevant. Normalization and Denormalization achieve different goals, and are used as such.
Mickey Bricks on August 19, 2009 9:30 AMI will have to nod on the use of views, especially indexed views, on this. Views help in performance because it tells the database which queries with complex joins will be accessed more often, and allow the database to pre-store its execution plan (as opposed to adhoc T-SQL) Indexing against the view will further optimize it.
If you use NHibernate, views have the additional advantage of being treated by NH as just another table, so you could actually design a view against an object in a much simpler manner.
Jon Limjap on February 6, 2010 10:38 PMJeff,
It's common in programmer circles to think there's not a best way to do things, but in this case, that's not true. The best way to store data in a relational database is to normalize it; that is, to take full advantage of its relational nature.
In your example, you do not need six joins to get all the user's data: you choose what the pertinent data is for a situation. If you do need all that data, and performance is an issue, materialized views, or some other caching system, is the way to go.
I appreciate your post, but it's very wrong, and harmful advice to an inexperienced programmer.
Clinton R. Nixon on February 6, 2010 10:38 PMhey all you young guys - logical modelling (data model) is normalised (for understanding), physical model (database) is denormalised (for performance) - duh!
non american spelling on February 6, 2010 10:38 PM@[ICR]
Just a nitpick, I'd recommend using the term Gender over Sex
Do you sex a cat, or do you gender it?
Don't use an incorrect term just because the politically correct crowd want to make the very mentioning of words that might also refer to a natural human activity. Fight the pussification of language!
If you want to know if someone is physically male or female, the question is sex, not gender.
a href=http://en.wikipedia.org/wiki/Genderhttp://en.wikipedia.org/wiki/Gender/a">http://en.wikipedia.org/wiki/Gender/a">http://en.wikipedia.org/wiki/Genderhttp://en.wikipedia.org/wiki/Gender/a
Simon Wright on February 6, 2010 10:38 PM^^
I don't know what happened with the wikipedia URL, but I didn't write the HTML tags. I just pasted the URL.
http://en.wikipedia.org/wiki/Gender
http://en.wikipedia.org/wiki/Gender
Simon Wright on February 6, 2010 10:38 PMThis is why this article is wrong, Jeff. This is why you're an idiot...
In your opinion.
In my work building and maintaining content and asset management systems, I've seen far more problems arise from over-normalisation than under-normalisation. Far too often I've seen database designs that fall over because they're so intricately normalised. They don't fail technically, but they become a nightmare to write code against.
On the other hand, some of our most long-lasting table structures have been criminally flat. They have their limitations. They suffer some redundancy. But they're rarely misunderstood, and always a joy to attach to your new systems, because their broad usage has spawned many valuable tools and code that work with them -- and that's far more valuable than having an unlimited number of phone numbers for each contact.
Simon Wright on February 6, 2010 10:38 PMHow do views speed up performance?
1) They prevent bad joins like the one demonstrated here.
2) You can use a materialized view which is the data denormalized for performance, but backed by normalized data for integrity.
The question is what is important: Your data or your performance?
If your limiting where clause is only one one table, you should not see a bad hit, because you will be hitting only the few select rows of each other table. You will take a hit for disk seek if you do a poor job of laying out your physical storage, but that is a different issue.
If you care about data integrity, you need to be normalized, and you need things like foreign keys and constraints. Normalizing data can hurt performance, but so few applications are of the size to cause this issue that usually data integrity is the bigger concern.
On a P-200 with 128M of RAM and a 10G database (point of sale with automated reordering and the lot) fully normalized, the 100MB network card was the bottleneck. If you need more than that, you have the money to buy hardware that can support it.
Grant Johnson on February 6, 2010 10:38 PMThe obvious tradeoff in denormalizing your data is that your data structure has lost meaning and flexibility. I've worked on projects which required normalizing unstructured or poorly structured data, and it's often difficult or impossible to move in that direction. I'm surprised you're at that point already - I'd normally think you'd look at optimization once you've had a site running for a while and analyzed actual usage data. Regardless of the direction you're moving, materialized views are very helpful - a materialized denormalized view can give you the performance benefits of denormalization without messing up your data, and a materialized normalized view over unstructured data can reduce some of the pain in dealing with User_Phone_Business_Prefix. Have you looked at the sparse column feature in SQL Server 2008? That's supposed to reduce storage size (and thus increase performance through increased page efficiency) when working with wide columns in which many columns may contain null values.
Jon Galloway on February 6, 2010 10:38 PM[I came across this rather old post, and couldn't resist putting in my 2c-worth.]
Let me voice a dissenting opinion from a different angle...
Denormalisation is often necessary in modern SQL databases, but not because there is a fundamental problem with the concept of normalisation. It is because SQL databases don't support one of the relational model's core principles: separation of logical and physical models.
When Codd defined the relational model, he explicitly stated that it was a logical model of data, and that the means of storing relations in a data bank was orthogonal to the model itself. This meant, for example, that if a particular pair of relations was often joined, the DBMS would be free to store the pair as a single set of records, which would nonetheless continue to appear as two distinct relations to clients, with the ability to manipulate them independently of each other, but which would also be extremely fast to join (a NOOP, essentially).
Unfortunately, SQL databases almost universally ignored this, and insisted on maintaining a very close mapping between table rows and database records. There has been a recent trend towards column stores, which put paid to the fallacy that rows must be records, and provide just a tiny hint at the power that would be unleashed if DBMSs allowed DBAs and programmers to separate the logical and physical layers in whatever way they saw fit. Ironically, such databases don't even pay lip service to the relational model, and instead we hear all this talk about "post-relational" databases. What a crock!
So, yes, sometimes you just have to denormalise. But don't blame the relational model for this, and don't go looking for the post-relational pot of gold at the end of the rainbow. Blame the SQL DBMSs that ignored Codd and left us all bickering and squabbling amongst each other over a dilemma that should never have existed.
And if you want to do something positive, ask DBMS vendors to make your databases more relational, not less.
Marcelo Cantos on May 31, 2010 11:56 PMVery Nice post, quiet informative.
Database designing problems being the buzzwords these days, one of the most common designing mistakes is the “Poor Planning / Architecture” of the database or mart. Follow the link to know more…
http://www.sqllion.com/2010/08/database-design-and-modeling-i/
The comments to this entry are closed.
|
|
Traffic Stats |