One of the items we're struggling with now on Stack Overflow is how to maintain near-instantaneous performance levels in a relational database as the amount of data increases. More specifically, how to scale our tagging system. Traditional database design principles tell you that well-designed databases are always normalized, but I'm not so sure.
Dare Obasanjo had an excellent post When Not to Normalize your SQL Database wherein he helpfully provides a sample database schema for a generic social networking site. Here's what it would look like if we designed it in the accepted normalized fashion:
Normalization certainly delivers in terms of limiting duplication. Every entity is represented once, and only once -- so there's almost no risk of inconsistencies in the data. But this design also requires a whopping six joins to retrieve a single user's information.
select * from Users u inner join UserPhoneNumbers upn on u.user_id = upn.user_id inner join UserScreenNames usn on u.user_id = usn.user_id inner join UserAffiliations ua on u.user_id = ua.user_id inner join Affiliations a on a.affiliation_id = ua.affiliation_id inner join UserWorkHistory uwh on u.user_id = uwh.user_id inner join Affiliations wa on uwh.affiliation_id = wa.affiliation_id
(Update: this isn't intended as a real query; it's only here to visually illustrate the fact that you need six joins -- or six individual queries, if that's your cup of tea -- to get all the information back about the user.)
Those six joins aren't doing anything to help your system's performance, either. Full-blown normalization isn't merely difficult to understand and hard to work with -- it can also be quite slow.
As Dare points out, the obvious solution is to denormalize -- to collapse a lot of the data into a single Users table.
This works -- queries are now blindingly simple (select * from users), and probably blindingly fast, as well. But you'll have a bunch of gaping blank holes in your data, along with a slew of awkwardly named field arrays. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now. Congratulations on your demotion!
Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?
Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today's standards -- let's say a dual-core box with 4 gigabytes of memory -- will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.
There's no shortage of fascinating database war stories from companies that made it big. I do worry that these war stories carry an implied tone of "I lost 200 pounds and so could you!"; please assume the tiny-asterisk disclaimer results may not be typical is in full effect while reading them. Here's a series that Tim O'Reilly compiled:
There's also the High Scalability blog, which has its own set of database war stories:
First, a reality check. It's partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn't have to be all or nothing as I've pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn't.
Despite copious evidence that normalization rarely scales, I find that many software engineers will zealously hold on to total database normalization on principle alone, long after it has ceased to make sense.
When growing Cofax at Knight Ridder, we hit a nasty bump in the road after adding our 17th newspaper to the system. Performance wasn't what it used to be and there were times when services were unresponsive.A project was started to resolve the issue, to look for 'the smoking gun'. The thought being that the database, being as well designed as it was, could not be of issue, even with our classic symptom being rapidly growing numbers of db connections right before a crash. So we concentrated on optimizing the application stack.
I disagreed and waged a number of arguments that it was our database that needed attention. We first needed to tune queries and indexes, and be willing to, if required, pre-calculate data upon writes and avoid joins by developing a set of denormalized tables. It was a hard pill for me to swallow since I was the original database designer. Turned out it was harder for everyone else! Consultants were called in. They declared the db design to be just right - that the problem must have been the application.
After two months of the team pushing numerous releases thought to resolve the issue, to no avail, we came back to my original arguments.
Pat Helland notes that people normalize because their professors told them to. I'm a bit more pragmatic; I think you should normalize when the data tells you to:
Never, never should you normalize a database out of some vague sense of duty to the ghosts of Boyce-Codd. Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. Fear not the specter of denormalization. Duplicated data and synchronization problems are often overstated and relatively easy to work around with cron jobs. Disks and memory are cheap and getting cheaper every nanosecond. Measure performance on your system and decide for yourself what works, free of predispositions and bias.
As the old adage goes, normalize until it hurts, denormalize until it works.
I have had to fix a few huge databases over the years and often the problem is a bad, but normalized, database design.
Al on July 15, 2008 7:20 AMI just spent a year having normalization hammered into my meager little mind as part of my database module and now you're telling me that it mightn't be necessary...
This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue.
You don't normalize because of ghosts; you normalize because it doesn't make fucking sense otherwise. A tuple is a logical statement; denormalizing means you take the risk of having contradictory or false statements in your database in order to gain performance. (That's why it's recommended for read-only databases, for example: the risk of having incorrect statements is much smaller.) If premature optimization is the root of all evil, denormalization is Satan's ugly mother. It's right there with the compiler is wrong.
I have a feeling a lot of the commentors missed your point, which means they probably just skimmed over the post and read the bolded parts. True, the six join query was an exaggerated example, but still, the point stands.
There wasn't a call to denormalize from the beginning, people. There was a call to denormalize when bneeded/b.
Mix on July 15, 2008 7:20 AMYou just need 1 table with 2 fields:
- id (primary key)
- data as Xml
lol !
Patrice on July 15, 2008 7:22 AMI've never been a database guy, but this whole issue seems like the classic problem of optimization to me. There's almost always that trade-off between elegant readable design and optimal execution speed.
I disagreed and waged a number of arguments that it was our database that needed attention.
We first needed to tune queries and indexes, ...
Consultants were called in. They declared the db design to be just
right - that the problem must have been the application.
This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). One of my pet peeves is folks who optimize their code with absolutely no evidence that there's an execution speed problem there, or even anywhere.
Ah, and you should read the comments to Dare Obasanjo's article: there's no six-way join; there's a three-way join at the worst.
Marcel Popescu on July 15, 2008 7:24 AMGreat point. At my previous employer, they ran into scalability issues, and denormalizing their data never entered their mind (as far as I know). Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application.
My article explains it better:
http://www.ytechie.com/2008/05/stored-procedure-reporting-scalability.html
I've been working with RDBMS's since about 1986, and currently support a very large complex (normalized) database.
I have to disagree that normalized databases can't perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that 'may' happen, only those that you can actually measure. Anything else is a waste of everyone's time.
Object-Relational database technology doesn't solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don't assume that because the object hides the complexity that the actual physical storage is simpler as well.
Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.
I totally agree with Greg, correctness comes first.
Ed Delaney on July 15, 2008 7:24 AMI think file system performance is often overlooked when examining performance problems in medium/small db's. Denormalization will help when you have an i/o bottleneck - to a point. However, I personally find it faster and more cost effective to upgrade my i/o subsystem than to refactor my db and app for a denormalized schema.
Jeff, be careful when giving advice like this... De-normalizing is something that should only be done by people who ALREADY UNDERSTAND when to use it and how. I fear you're giving it the air of being trendy and thus confusing the less experienced developers out there. :-P
Bruce Johnston on July 15, 2008 7:25 AMI've been working with RDBMS's since about 1986, and currently support a very large complex (normalized) database.
I have to disagree that normalized databases can't perform. Or that you should assume they cannot. This is just plain not true and a bad idea. You normalize, then you test, and then you refactor those places that need to perform faster. Never waste any time trying to design around performance problems that 'may' happen, only those that you can actually measure. Anything else is a waste of everyone's time.
Object-Relational database technology doesn't solve anything, the relationships between objects still have to be built into a persistence structure. All it does is hide the complexity some (which can be a good thing). Just don't assume that because the object hides the complexity that the actual physical storage is simpler as well.
Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, update synchronization code etc. You can concentrate on the presentation and persistence APIs, and gradually refactor a much smaller set of db changes than you would if you had denormalized everything.
I totally agree with Greg, correctness comes first.
Ed Delaney on July 15, 2008 7:27 AMTo everyone bashing Jeff on his database design skills - he didn't come up with that design. He clearly pointed out that the design was from Dare Obasanjo's blog post on this topic.
Allan N on July 15, 2008 7:31 AMSome things are too slow in a normalized database. Facts are Facts. You just rout around this:
* Keep a de-normalized replica for expensive operations (e.g. reports)
* Cache Results for repeat queries (Memcached)
* Partition the database for scalability (vertical or horizontal)
@Julio: Hibernate, at least the Java version, actually has a concept called join fetching where it will join a related collection instead of doing a second select. So, just for the record, an ORM wouldn't necessarily use all separate selects.
a on July 15, 2008 7:34 AMSpeaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan child records, etc. Everybody says we rely on the application to maintain that, and it never, never does.
The question is, what's the cost of the data integrity problems? If it's something you can live with for performance, fine. If it's a disaster waiting to happen, you better build your database design as tight as possible.
A. Lloyd Flanagan on July 15, 2008 7:34 AMThis post seems to suggest throwing normalisation away, saying that our professors are all wrong and the real way to design databases is to just do what works, as that's how it works in the real world.
What a horrible, cowboy attitude to DB design. I hope you don't design any real databases.
Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second. The CFO / CEO might be concerned if their database is running slowly, but they will have your neck if the data inside the database is garbage. Properly normalised databases are important because they enforce data integrity, rather than relying on application developers to enforce it.
You even said this yourself, though the point seems to have been lost on you when you said it. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now.
No.... just, no.
Relying on the application to enforce data integrity is the surest path to hell that I know of. That is the whole freaking point of a database. Hundreds of thousands of hours of development time and testing has gone into each database product to guarantee this level of service. There are thousands of features. And you want to replace all of that time and testing with some hand rolled garbage that has received maybe 100 hours of testing if you're really lucky? What an absurd idea. What a horrible anti-pattern.
Denormalisation is an optimisation over the normalised data. The DB should ALWAYS be normalised with denormalised components added later as scalability issues need to be addressed. The normalised part of the database is still the authoritative source of data, and the denormalised part simply aids in query execution. You can always rebuild the denormalised data from the normalised data which has had all of the data integrity rules enforced on it.
Come on. I know it's easy to hate on professors for not being in the real world and not having frontline experience, I hear this wanky sort of rhetoric all the time, but you might be surprised to know that they do sometimes know a thing or two.
Dave G. on July 15, 2008 7:34 AMBuggyFunBunny,
I was wondering how far down the comments it would be before we'd see a pontification from the anti-knucklehead, database crusader.
Best thing about your comments are they provide a us with a beautiful reference point for one (extreme) end of the application design continuum. At one end we have persistence ignorant application design, and at the other database is god.
Gee, I wonder which end you belong to? You are an extremist and as such your views need to be treated as such.
AJ on July 15, 2008 7:35 AMSeveral people already hit on what I was going to say, but here goes anyway...
What is the purpose of the database? Is it to process transactions or report findings? I always take the route that I design the data model using the best practices approach and considering my requirements.
Now, I double check requirements to make sure the database will behave correctly. If 99% of what I am doing will be reporting off the tables, I will either create Materialized Views (Oracle) or Materialized Query Tables (DB2) that basically denormalized the data for quick reporting. This can be a problem if you need up to the minute dynamic reports since the MV/MQT's need to be refreshed periodically. Currently, we are on a 2 week refresh, so this approach works well for us. The MV/MQT's carry some resources, but memory and storage is cheap!
Now, if it is a quick app for logging or something, I will just create a flat/wide table and put everything in it. After all, this log is for me and my team... no reason to confuse the issue with 5 tables to hold each piece of the log information... Plus, this data gets cleaned out every so often and I am not concerned about instant access. Just index the table by date, and order by in the sql...
If I am going to be constantly hitting the data with updates/inserts/deletes (like in order processing or something) the normalized structure works very well... and I don't carry the overhead of all the duplicate data associated with MV/MQT...
So, I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals.
my $0.02
Wayne on July 15, 2008 7:39 AMSergej, great comment.
Jeff, I remember from one of the early podcasts when you were describing the box you were deploying on that it was just a single machine, running the database and the web server, is that correct? That immediately struck me as the first thing that will have to change when you go live. It is astounding the performance difference you get by putting the database on its own box and clicking the little box that basically gives the machine to Sql Server.
There is a lot of idealism in the comments above. Lots of alternative databases, OODBMS's, LINQ. Software developers in general have a very poor understanding of how databases work, how to tune them, how to design them. I wonder how many billions of dollars have been spent on servers in the last ten years that could have been saved by a simple index.
The design of any database that expects millions of rows in any table should be handled by an experienced DBA/Programmer. All access should be controlled via stored procedures. Never, ever, ever let a software tool or a (non-dba) software developer write your queries for you. db_reader and db_writer should *not* be privileges that your applications enjoy.
If you look at a query that has 30 joins and immediately think it is horrible, please analyze and tune the query first. Only the joins that require a big scan matter. Look at the plan and if you see 29 joins at 0% and 1 join at 100% of execution time, then only one of the joins is bad. Go ahead and add ten more if you need to. De-normalization is for reporting, not for OLTP.
Here's an excellent book for really digging into how Sql Server processes queries: http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139 (T-Sql Querying by Ben-Gan).
Jeff, one of your comments struck me as worth a little more thought: Disks and memory are cheap and getting cheaper every nanosecond. Not everyone can afford a SAN, and if you're running everything on one box, it's not about how cheap the disks are, it's about how many you can cram into the machine. You want the fastest disks you can get, which means you are limited in size, and you want RAID, which eats up more space. In a 1U machine, you get maybe 2x136Gb 15k drives, double that in a 2U. And even that leaves everything running on the same set of spindles. So you move up to NAS or DAS, but you are still limited, because if you just fill it up with slow 1Tb drives the IO will kill you. Data size matters.
Eric on July 15, 2008 7:39 AMYour six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys.
Leandro on July 15, 2008 7:40 AM@T.E.D:
I've never been a database guy, but this whole issue seems like the classic problem of optimization to me. There's almost always that trade-off between elegant readable design and optimal execution speed.
Absolutely, although you've got another element in the trade-off: data correctness. A database design that allows inconsistent data will have inconsistent data.
I have to agree with other posters: you can denormalize if you have to, and you know what you're doing. If you don't you may have to assume that the cost of incorrect data trumps other considerations.
A. Lloyd Flanagan on July 15, 2008 7:40 AMThe solution it 2 tier storage. Websites tend to be read many, write few. The first tier is normalized database, the second tier is de-normalized data specifically constructed for use by the application. In an app that I am building, the 2nd tier is a combination of memcached and plain old disk using binary serialization.
The expense now is that you have two writes, one to the real db and one to the de-normalized tier.
So now you've separated the read requirements/performance from the write and relational data storage.
Damian on July 15, 2008 7:41 AM@Leandro:
Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys.
The whole question of surrogate vs. natural keys is of course a topic of enormous debate. Generally, I'm with you: natural keys are easier, safer, and often don't hurt performance the way people assume they will.
A. Lloyd Flanagan on July 15, 2008 7:44 AMI think the general advantage of normalization is the same as the DRY principle in coding. When you denormalize stuff you can wind up maintaining copies of data and doing the same thing twice.
Another advantage is it is easier to go from normalized to denormalized when you need it than the other way round.
CT2 on July 15, 2008 7:44 AMJust as well Dare is a social networker and not a database designer!
Look at some of the arbitrary limits in his design:
* A fixed number of telephone numbers per person
* A fixed number of job histories per person
* A fixed number of IM accounts per person
Now, what happens when the big DOH! hand hits the forehead, and our short-sighted denormaliser suddenly realises that out there, in the REAL world, the numbers of things in sets change on a regular and frequent basis ... that's right, we have to change the schema!
It gets worse. The proponents of denormalised schemas are the most likely to write select * ... in their queries. That means reviewing and changing all our code as well.
A properly designed database schema should survive most business changes. Denormalised (abnormalised?) schemas don't.
And as others have pointed out in this thread, the avoidance of surrogate keys can do wonders for your performance.
rond on July 15, 2008 7:46 AMIn my experience, beginning developers get into a lot of trouble because they're afraid of joins. The common symptoms are apps that are:
(i) excessively complex,
(ii) never quite work right, and
(iii) are expensive to change
It definitely is true that we're moving into an era of web scale systems that are much greater in size and power than traditional enterprise systems. Many people are pushing the performance limits of the traditional RDBMS, and we're seeing a move towards distributed main-memory databases and other new technologies.
Workaday business apps benefit greatly from normalization, so I think this post has a dangerous message for many developers. In the case above, I think it still makes sense to use a normalized database to keep the authoritative copy of the data, but it makes sense to cache the data in denormalized forms. Rather than creating a table with the notorious phone_number_1, phone_number_2, ... phone_number_33 fields that I'm always cleaning up in other people's applications, you can just serialize the data with the serialization system that comes with your language or JSON or something like that.
Social sites generally have many more reads than writes, so this can improve performance 100 fold or more, but still keep the advantages of a relational back end. If the cache gets messed up, you can always do a partial or complete rebuild based on data you trust.
Paul Houle on July 15, 2008 7:51 AMI've had programmers tell me that they don't care about the size of data or normalization. Then they cry when they have 200,000 fat rows and performance crawls.
David B on July 15, 2008 7:53 AMFor those advocating the use of Views: the database does not store the view data. The data is recreated, using the view's SELECT statement, every time a user queries a view.
Furthermore, see this article on View performance: http://sql-server-performance.com/tips/views_general_p1.aspx
Steven Bey on July 15, 2008 7:54 AMWhat I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues... In the end you are storing your data in a relational database, right? Unless you are plugging LINQ or Hibernate into something other than Oracle, SQL Server, MySQL or Postgres (or any other RDBMS) this is still an issue for you...
Sure, *you* might not be coding the six inner joins, but your OO mapper sure is.
Scary.
Sam on July 15, 2008 7:54 AM@Dare
In addition, who says you wouldn't want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That's exactly what it does.
I think what Ariel is saying is that if you performed the 6-join query your return data would be duplicated. For example if a user had two phone numbers and three screen names the return data would look like:
user_id phone_number screen_name
1 111-1111 name1
1 111-1111 name2
1 111-1111 name3
1 222-2222 name1
1 222-2222 name2
1 222-2222 name3
Using this format of the data it would take some monkeying around to create the display text:
User: 1
Phone Numbers: 111-1111, 222-2222
Screen Names: name1, name2, name3
It would be easier to select by phone number and then by screen name instead of trying to get all of the data at once.
Eric on July 15, 2008 7:55 AM@[ICR]: Kudos to you.
Vance Vagell on July 15, 2008 7:55 AM@[ICR]: Kudos to you.
Vance Vagell on July 15, 2008 7:55 AMPlease think! Is this post even about normalization? The 2 database designs you used to illustrate your point are actually solutions to 2 different problems. They are not the same database and they cannot represent the same data. You can argue that the second one is faster but only if it can actually be used to represent the facts that are meaningful to your problem otherwise this is just hot air. This kind of lack of thinking is what usually makes life with databases a hellish proposition because almost invariably someone will ask for data that is not stored because it was easier to say I need performance so I'll just ignore the requirements (or not think them thru) and de-normalize!
Terrier on July 15, 2008 8:09 AMI keep my data normalized. I keep it all in memory. I use Prolog to do it.
screed on July 15, 2008 8:19 AMJeff,
You make a good point but there really is a upper limit on how far things can be stretched with denormalization. When stored procedures start to have anywhere from 100 to 200 arguments and sql server 2000 warns you that your row size exceeds its 8063 byte limit then maybe...just maybe its time to normalize the tables just a bit.
But this design also requires a whopping six
joins to retrieve a single user's information.
Yes, if you need all of it or if you need all of it in a single table.
Very often neither is the case. Very often you don't need all of the information at once and sometimes you first need some of it and later on some more, so you don't need them in a single table and as joined query, but as separate queries istead.
i hope the people at twitter listen to this. :P
they got people to care, now it's time to move fwd :P
The link for Findory and Amazon is: http://radar.oreilly.com/2006/05/database-war-stories-8-findory.html
Josh Parris on July 15, 2008 8:29 AMJeff,
You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. A system that will primarily be serving content may be a good choice for denormalization. That doesn't mean that normalization doesn't scale, it just doesn't scale for one particular problem.
RevMike on July 15, 2008 8:36 AMMecki said:
Yes, if you need all of it or if you need all of it in a single table.
Very often neither is the case.
This is spot on.
Let's say you wanted to run a query to find all the people that worked for a specific company:
select first_name,last_name from User u,UserWorkHistory uwh where
u.user_id == uwh.user_id AND
uwh.company_name = 'Coding Horror';
bi(BTW I haven't written SQL in ages so this may be grossly incorrect; I think you get what I'm trying to say, though)/b/i
Granted, I could have just as easily done:
select first_name,last_name from User where
User.work_history1 = 'Coding Horror' OR
User.work_history2 = 'Coding Horror' OR
User.work_history3 = 'Coding Horror';
But which makes more sense?
I don't think there's any harm in starting off normalized and scaling at some point later. Premature optimization is evil, and for all you know stackoverflow may scale just fine under the circumstances.
Michael Acobas on July 15, 2008 8:41 AMI recently created a denormalized table with over two hundred columns because it was just needed to generate a PDF from a long online application form. There was no need to query the data for any other reports. The table is only meant to serve one purpose. However, those 200 plus columns did exceed the limit for the number of columns a SSIS export will support so now I need to exclude that table from my back ups. There is also a limit to the number of tables SSIS will export so you could run into problems with a fully normalized design as well.
Robert S. Robbins on July 15, 2008 8:43 AMDenormalized data sounds great until you try to extend your data model, at which point you will wish to high heaven for a time machine. The purpose of a relational database is NOT to give you fast performance. If you want that, use flat files and do your own heavy lifting. The purpose of a relational database is to help maintain data integrity. If your data doesn't need integrity all that bad, you don't need or want a relational database. If your data does need integrity, you can't afford to make the mistake of modeling it poorly, and that means semi-obsessive normalization. Those 6 joins are a feature, not a bug.
Noah Yetter on July 15, 2008 8:49 AMThree quick observations:
First: You only need the six joins if you want every single attribute of information stored about the user. My guess is that about 99% of the time, you REALLY only need maybe two tables' worth.
Second: normalization is optimizing for data entry and maintenance. If your application is presenting the data primarily rather than CRUD, of course you need a denormalized view. But I'd push from a normalized db to an optimized copy, or maybe even a cube.
Third: MEMCACHED, MEMCACHED, MEMCACHED.
Chris B. Behrens on July 15, 2008 8:51 AMOver the years I've developed a theory that the one area where developers reveal a consistent lack of knowledge and understanding is in DB/query design. This article and the subsequent comments reinforce this theory (and to be candid, leave me a bit apoplectic).
1) Views: terrible idea unless it's a SQL 00/05 indexed view. Even then, you probably need to revisit your schema and queries before resorting to this measure. If you're on MySQL, forget that views exist and used stored procedures.
2) Surrogate keys: that's another rotten idea, especially if you're concerned with performance. Also a horrible idea if you ever need to migrate the database.
3) The query in question: dude, Dare, really? And your defense of this query is that you may want to show all the numbers a given user has, for example, Facebook? Respectfully, please, both you and Jeff need to learn more about databases or hire someone that does, or you are headed for pain and misery as you build Stack Overflow.
For example: say my User.User_Id is 666 (I also am the son of Satan, but that's a different subject). There are two entries in the phone number table that have the User_Id foreign key of 666: one is my office number, 402.261.0498. Then you have my mobile, 402.525.5376. So you run your query, using my User_Id in the WHERE and you get back two rows, one for each phone number, with all of my data from the User table repeated for each. What are you going to do with this result, say on my profile page? Are you only going to use the data from the first row, except when dealing with phone number data, in which case you'll loop through each row of the result and render each phone number?
You're going to have to do as Ariel suggests: query the distinct user data, then query the phone numbers and other information. Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.
Chris Rikli on July 15, 2008 8:52 AMCache the generated view pages first. Then cache the data.
http://www.pragprog.com/titles/memcd/using-memcached
You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much.
matt m on July 15, 2008 8:52 AMI wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set.
I've seen numerous queries written by professionals that work well for small tables (As jeff says, everything's fast for small values of 'n'), but completely fail when scaled up to larger tables.
In my opinion, writing good SQL queries is dark voodo - not least since results can vary hugely depending on the underlying DBMS.
Thomi on July 15, 2008 8:59 AMI think that you should think about how it would be easier to implement the user interface, in the normalized, or in the denormalized form.
In the blog post (article), the user interface can be implemented much easier in case of the denormalized form, so I would prefer that one. Of course, if you can afford to make the application perfect (3 times more time), then the normalized form would be better.
The biggest performance killer is so called physical read.
Thats a bit scary... I'd always assumed that reads from hard drives were slow... but not so slow as to create the waiting for several seconds issue when looking up a record...
I suppose in these big databases with loads of tables all joined together the sheer number of reads creates the problem.
You learn something new everyday, huh?
Jheriko on July 15, 2008 9:04 AMJeff,
I know you probably would rather not change the site too much. But I would love to be able to rate responses based on merit. Then we could hide the long list of comments by people who didn't read the post or who troll around looking to bolster their web traffic with insults.
I agree wholeheartedly, by the way, to this idea. I unfortunately was taught by someone who knew what normalization was supposed to mean but no more, and one math prof. who was thrust into a Software role because of his 1 year of experience doing consulting in 1994. He instructs all students that database must be 43NF and all logic outside of storage must happen in the application layer. No stored procedures, no denormalization, and views are sacrilegious.
Thanks again,
Raymond
You used OR ? No index for you. Next!
David B on July 15, 2008 9:21 AMData integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second.
Always is a dangerous word.
Jeff Atwood on July 15, 2008 9:23 AMWell, after reading all these comments it appears that most programmers still fail miserably at reading comprehension (or just didn't read the article completely). How long until StackOverflow.com is ready, Jeff?
Jason on July 15, 2008 9:38 AMWhat about an approach where the data is stored in and WRITTEN into normalized tables but have a DEnormalized table for READING from? You can have triggers update the READ table when changes are made to the normalized tables, or batch process it nightly or mroe frequently, checking DB activity first to make sure its low.
Mike on July 15, 2008 9:39 AMno matter what, the database is the slowest part of your application. period. fin. done. caput. The way to get near instant execution is to have the objects in memory and then just use the database for storage (ie read = from memory, write = from memory + write to DB). Then if your app restarts, it'll only be slow while pulling things from disk, but after that super fast.
Darren Kopp on July 15, 2008 9:41 AMIt has been said well by others already, but I feel strongly enough on this issue to add in my two bits:
Denormalization is the last possible step to take, yet the first to be suggested by fools.
And, yes, I just called both Atwood and Obasanjo out as fools.
It is dangerous to make that suggestion so willy-nilly, as people take posts by prominent authors as something they can rely on. Even hinting that denormalization should be anything but a last step means that these guys simply cannot be relied upon as authorities in this space.
So, for those still feeling unsure after reading through all of the above:
Views are not about perf/scale, they are about query clarity and/or abstraction of true data source structure.
Materialized/indexed views can help perf/scale, but have limitations in terms of the underlying data structure.
Optimization must always be based on freshly-collected metrics, should go roughly in this order (with the order slightly adjusted based on your particular applications and databases and the issues you have previously experienced), and should be done one at a time with fresh metrics collected between each:
- remove indexes no longer deemed necessary
- add indexes newly deemed necessary
- simplify views where possible
- optimize queries, especially to reduce scans
- reduce application mis-use of the database (unnecessary and repeated queries are a common culprit)
- add/change application-tier caching of database information
- materialize/index views where appropriate
- optimize stored procedures
- create generated search/reporting tables
- investigate fundamental issues with the design of the database schema (data types, relations, etc.) but not from a hey, let's denormalize perspective.
You'll notice that denormalizing your fundamental schema isn't even on the list. It is too tempting and dangerous to put it there. True, it may be a last resort if all of the above fail, but if all of the above fail you are almost surely having the kind of problem that everyone would love to have: your application is going gang-busters, and you can afford to throw more hardware at your current model (start with increasing spindle count first by getting transaction logs onto their own spindle(s) and then by getting indexes onto their own spindle(s)) and if that still isn't enough then you are are into Google/Amazon/etc. territory and can get some doctorate types to start designing their own persistence model for you. They'll probably cook up something decidedly non-relational, and with much denormalization. ;) But if so, then why not just up and denormalize? Simple: The other possibilities need to be exhausted first, through repeated execution of the above list of steps, until you can be certain that _violating the correctness guarantees of your relational database through denormalization is worth it_.
Interesting blog post on denormalization patterns:
http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html
Tim Hollingsworth on July 15, 2008 9:48 AMOne more benefit of normalization - variable number of fields. For example, if you denormalize the phone number field, how many phone numbers do you pick? 5? 6? Picking too few is very limiting and picking too many is a waste. In 1990, who would have guessed that a person could have 10 phone numbers associated with them? Home Phone 1, Home Phone 2, Cell Phone, Fax, Cottage, Winter Home, etc. Normalization solves this problem.
Mike M on July 15, 2008 9:57 AMGod bless ya, Jeff. As with everything in software development, a bit of critical thinking about why something is done and when tends to go a long way...
Kevin Fairchild on July 15, 2008 9:58 AMThere is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. In Oracle you would create a hash cluster and then create the five tables that include user_id in that cluster. Performance for that join wuld be excellent. I expect that other RDBMS's have similar structures.
The morals of the tale:
i) know your RDBMS
ii) database-agnostic design sucks.
[sigh]
The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones).
I've never considered myself a DB(A) guy ... so reading some of this is interesting.
Most of the databases I've built myself haven't had the volume of traffic that really mattered much -- so I've always done what I thought was the 'intuitive' normalization and in cases where it was faster or easier to put some of it in a denormalized fashion, did so.
Good read.
N on July 15, 2008 10:15 AMand table names are singular because each row in the table designates one specific person; one row = one user
This is another religious war. Should table names be singular or plural? I was in a company that mandated that tables should be singular, one to match the above quote, and second, if nothing else, so that you have 1 less letter to type for every table name.
However, I am of the opinion that tables, since they normally hold more than one item, are plural, and the object that holds a single row is singular. Collection objects that hold multiple rows should be plural.
Bob on July 15, 2008 10:31 AMHmm. Does anyone expect a real discussion of tech at Coding Horror. It's as bad as /.
stupid db design (or at least incomplete), stupid sql.
If most smart people require whole books to describe the (wonderful) world of RDBMS's, do you really think _anyone_ here can have anything more than a soundbite contest?
I'm surprised some twat hasn't called using databases 'cargo cult programming' yet.
Stick to toys and trivia, that's what this blog is best at.
smashmouth on July 15, 2008 10:40 AMSean Bamforth is on the money here.
You need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.
The problem is that DBMSes don't do this. Sure, they give you indices and caching and materialized views, but they don't go all the way.
Robert Fisher on July 15, 2008 10:48 AMYour denormalized schema has lost all of the one to many relationships that caused the original need for normilization in the first place.
Normalize first, it protects data integrity and makes maintenance a breeze. If you hit the performance wall, then denormalize.
However, your example is still bad, because the denormalized table is nothing like the normalized schema.
@Jeff - You're prematurely optimizing.
@Leandro - Natural keys! I agree, and bless you for mentioning it.
In 25 years of software development as contractor and employee in all kinds of companies and industries, every performance problem I ever encountered with a transactional normalized data base was solved by adding indexes or fixing poorly written SQL statements in the application. In a few cases the data base was poorly designed, but normalization wasn't the culprit.
Don't fear normalization. Embrace it.
Charles on July 15, 2008 10:52 AMWith any normalization there are some trade-offs.
Assuming that each user has one screen name might be an acceptable restriction. Same for affiliation.
With a separate phone table a user can have as many phone numbers of as many types as I want. Not so after normalization. This might be acceptable, or not.
So, it the perception is that normalization changes representation without changing functionality, that's incorrect. The functionality changes might be acceptable, or not, but that should be a conscious decision.
Wow...
After I first read this article I wanted to comment remarking that the concept of denormalization being better was not new or all that revolutionary. People working in data warehousing and business intelligence have been practicing denormalization as an optimization it for years.
Then I read on and discovered all the loons and morons who think they know a lot more then they do about databases. People who claim that views are recomputed every time a query runs (that'd be you Steven Bey). Others bragging about their 400,000 pageviews a month websites (generously estimated, perhaps all of 12 pages a minute, woah baby Ben Mills).
Now, before I get too degrading here, let me just get to the point.
As others have already stated, databases have many different purposes. And the way you implement a small or large database is going to be radically different then a monstrously huge one. Denormalization is common in production real world systems.
Wow, I read back and am just astonished at the complete lack of understanding by the average commenter above.... Woah........
Paul Thompson on July 15, 2008 10:56 AMI don't do much (Any) db design, but I do a lot of software design.
It feels to me like there might be a compromise solution that involves creating your Source/reference database 100% normalized then using that to generate pre-calculated tables.
This sounds like even more duplication, but it shouldn't really be since you only modify your source tables and the optimized tables are 100% generated. This gives you the ability to regenerate the optimized tables any time, and also to change what you generate as you encounter challenges.
Is this approach at all workable?
Bill on July 15, 2008 11:07 AMPeople who claim that views are recomputed every time a query runs (that'd be you Steven Bey).
In Oracle a view is just a stored query. It is re-executed when it is queried. A _materialized_view_ in Oracle has assocated storage for the view result set. So mabye you and Steven Bey have a different understanding of the term.
Maybe the example in this posting is just a poor one, but the argument for denormalisation usualy comes down to avoiding the overhead of joins. As I posted above, in this case it's trivial to avoid that overhead with a hash cluster. Problem solved.
David Aldridge on July 15, 2008 11:21 AMFor a system like that I don't think there is much question normalization is best. You get the benefits of being able to store multiple phone no's, addresses, work history... for one user id record but let's face it you are almost always going to be finding records based on user_id which will be the primary key for most tables so it will be very quick anyway.
As you said with speed there won't be much difference either way especially when you factor in the duplication you get filling up extra 8k sql pages using non normalized. By the way LEFT JOINS jeff.
The only system I have denormalized for reporting is an electronic attendance system like below:
A register has a number of student ID's
A register has a week range between 1-53 (variable based on when it runs in academic year)
A register has academic year and register ID
By the time you get down to the student marks table it has:
Acad Period Register ID Student ID Week Mark
You end up with upto 53 mark records per student per register per year. Hundreds of thousands of records which are for the main part the key to the table. Pivoting the marks leaving nulls for the weeks the register wasn't active cut down the speed of reporting students with low attendance to about a 20th of what it was before hand.
@Jeff: You hit upon my speciality :) See, I said speciality, not specialty, because it sounds cooler.
Anyway, I write social applications for a living. And I've found what works best is:
1) Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone -- hey, this is the future! :)
2) Put the rest of the info into a TEXT variable, like extra_info, in JSON format. This could be educational history, or anything else that you would never search by.
3) If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need.
I actually go beyond this, and implement synchronization with other online identities, like OpenID and Facebook, but that's a bit overkill for this post.
Greg Magarshak on July 15, 2008 11:32 AMWhile this article is very good there are a number of things not addressed.
How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting?
How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter.
While it is lovely to normalize data, there is no one size fits all for all applications.
Enjoyed this post, a lot of these comments that treat normalization like a religion are quite the head scratcher though. I'm guessing they are from DBAs who are just barely more intelligent then a sysadmin. Learn to think a little.
The best approach usually ends up being a hybrid. You start off with a highly normalized database, work through the usual indexing and views to speed it up, and then start denormalizing a field here or there that require one too many joins.
Jim on July 15, 2008 11:34 AMSo, in essence Greg, you are throwing your hands and going F this relational crap. I'll do it live.
Sorry, I don't buy it. Sticking serialized data in a column stinks like code smell.
Jonathan H on July 15, 2008 11:36 AMBottom line is - consider the future expansion of the site. Database design, like programming, is a trade-off. Are you going to introduce multiple X's for every Y? Then perhaps you DO need a join. What will happen when the most common operation happens? What will happen when you introduce a distributed cache, like memcache? You gotta use a cache...
Greg Magarshak on July 15, 2008 11:36 AMAh, I did see some sanity in the comments. Your LOGICAL model should *always* be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done *after* the application level solutions are exhausted (page caching, data caches, etc.
For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code.
John Lopez on July 15, 2008 11:45 AMIt might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized.
With normalized data, I had a complex join with records in the millions each and had no trouble retrieving it in mere seconds from a 450 MHz server that they were still using. I just added the appropriate index (took about 45 seconds) and problem solved. That 2 minutes of work turned into a year of consulting because I fixed it so fast.
But don't get me started on denormalized data. Data is stored willy-nilly throughout the application when it should be a code table and then, inevitably, someone wants to do a search on it. Except that people misspelled it and added The at the beginning or , The at the end or put 2 spaces in it a couple times, etc., etc., etc. At that point, your ENTIRE dataset that the company has spent years gathering is WORTHLESS!
People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond 3 affiliations. (Oops, you wrote select * in your code, now you might have trouble adding columns.) And have you ever tried to write a query to find everyone in the same affiliation? Hope those are codes and not free text. And I hope the affiliations are indexed if you want any kind of performance on that search. But now you need 3 separate indexes instead of one. And the third one might perform poorly because it is mostly null and whatever codes are typical to people with 3 codes.
I could go on and on about the real-world problems I have faced with denormalized data, versus the lack of problems with normalized data.
Say what you want in your blog, but a lifetime of experience says that you are wrong.
And, I've worked with Dare before and he's a cool guy (so no offense Dare if you read this) but he writes tools for Microsoft programming languages and stuff. He doesn't maintain software with 10,000,000 row database tables on a daily basis.
PRMan on July 15, 2008 11:49 AMI 80% agree with that.
I was always telling my teachers that, trying to convince them they were doing it wrong.
I admit I didn't read all of it, but looking at the diagrams, I can tell you that when you have a 1 to 1 relationship, like UserPhoneNumber, then: yes, there is no point, and I would call it an error to have that table.
The only case where you would need to take it outside the [User] table, is if that [User] table grows out of proportion AND that you often need to fetch user info without his phone number. Otherwise, you are correct to bring his phone number into the main User table.
The tables left of [User] (and above and under) should have have stayed outside though. UNLESS you KNOW there will not be more than 3. Also, if you often need to fetch only the phone number of users, having it in a mini table will give you another advantage...
But if you don't need his phone number for example, for a query, please use SELECT [field names], and NOT SELECT *!! You will gain another millisecond there!
In the end, it all comes down to a case-by-case db design, there is no formula that will work everywhere with optimum performance.
Consider the number of users connected simultaneously, the expected size of the db, etc etc...!
dbadmin NNM :P
NNM on July 15, 2008 12:02 PMI wonder if these changes we're seeing in how important normalization is in applications isn't a reflection of the changes in the types of applications people are writing today. These days, with most web sites focusing on the infinite possible connections between people and other entities in a system, it may just be that a denormalized data model might make more sense right off the bat.
Thanks for the thoughts. I think I'll write more on this at blatantCommercialhttp://cozenedcognizance.blogspot.com/2008/07/what-kind-of-applications-are-we.html/blatantCommercial.
Chuck on July 15, 2008 12:04 PMChris Rikli: Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.
Here here!!! You might look into this solution as well. I've had two ugly stored procedures as of late that I've refactored/optimized/whatever by simply going to the core table and doing the major selection criteria and then making a bunch of small queries to fill in the gaps.
Table Variables (SQL Server 2000 8.0 and above, sorry legacy folks (I have an app still stuck there 8^D) allow for this and are much better than temporary variables.
Yay! BuggyFunBunny's back!
(back to reading comments now...)
Shog9 on July 15, 2008 12:11 PMWhen dealing with this exact issue, I took an approach of using a normalized data repository supplemented with a denormalized view of the data. For data that does not change very often (how often do users change their phone number, really?), this is usually adequate, though you certainly run into problems of stale data unless carefully handled.
In my case, it was more than simply doing a multiple table join across a single database, it was doing a multiple database join across 4 completely incompatible databases: standard RDBMS, Chemical database (outside of the RDBMS), and Biological database (also outside the RDBMS)
Having the denormalized dataset allowed us to reduce queries that originally took 5-10 minutes to complete in 10 seconds. It all depends on your particular application.
Steve on July 15, 2008 12:23 PMI always normalize. If a group of data (in your case user data) is being queried the same way every time, then either you over-normalized or you need to create a view.
Matt Ridley on July 15, 2008 12:30 PMI once wrote a CRM (digital Rolodex, like SalesForce.com) application and used it as an opportunity to use a textbook denormalized data model. It turned out to be a huge mistake because the model was wrong for the application.
For CRM, you're essentially storing business cards. A business card is fully denormalized: the basic unit is the card, and while you may search on a couple of different fields, users just want to view the card. While it makes some sense to have a separate table for phone numbers and email (since there may be multiple), there's only one mailing address.
Where I really got into trouble was in having an actual person/organization hierarchy instead of just a department column. Traversing arbitrary hierarchies in SQL is painful, and using a Java O/R bridge didn't help. Doing a search for ATT employees becomes a recursive set of queries. What's more, 99% of the time, the department was blank or inaccurate anyway.
Which brings me to the next issue: half the time, the data wouldn't import cleanly. We'd get business card type data from our trade shows in Excel format-- essentially completely denormalized. There are a dozen different ways to spell ATT (including, at the time, Bell Labs), and often company names were misspelled.
Eventually, we scrapped the project and switched to SalesForce.com, which uses a data format which is just denormalized enough to have a separate company table. I think it even has Phone 1, Phone 2 and Phone 3. But it meant I didn't have to get pulled off my current project to do a custom data import after every trade show.
The moral of the story is that the correct (fully denormalized) data model is not the most general one, but the one that best fits the expectations of the users and/or your data source. Denormalizing any more just forces nonexistent precision. Business cards and Excel spreadsheets are essentially single tables, so using several tables to represent them can lead to problems.
David Leppik on July 15, 2008 12:43 PM...scalability is not your problem, getting people to give a shit is.
IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism.
Case in point: Twitter. The idea of mentioning Twitter as a scalability success story is perfectly risible. People have actually stopped giving a shit (to steal an artless phrase) because of their performance problems.
I normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don't get rebuilt more than necessary).
Of course, a smart RDBMS would let you use… (drumroll) views…and optimise accordingly. I'd imagine this is the sort of thing that Oracle would do. I haven't checked.
Mo on July 15, 2008 12:53 PMI could be missing the point of the article, but I don't think he's suggesting that we skip past normalization but instead consider if it needs to be done.
What if he only plans to allow those three phone numbers for a user - Should that be moved to a separate table? What about only two? Or just one?
At what point do you stop blindly normalizing the table structures and consider if everything really needs to have a table of its own?
Doesn't seem like an unreasonable question to me...
HB on July 15, 2008 12:55 PMI don't know SQL, but your diagram looks a hell of a lot like what an ordinary Real Programmer would call a record with some pointers. Obviously the denormalized version (one record per user) is better than the normalized version (many records per user) for pretty much all applications --- easier to read, easier to write, easier for the computer to deal with. So maybe this is some big religious issue to SQL clerks, but it's certainly a no-brainer to normal folks.
The real WTF, of course, is that you're apparently writing your own software to deal with tagging. Tagging is a solved problem, guys. You don't have to write code to manipulate tags anymore.
Anonymous Cowherd on July 15, 2008 12:57 PMMo: You should get out more, the two real RDBMs (SQLServer and Oracle) have views :)
Stop playing with mySQL.
Unfortunately, the SQL given in the article is not a meaningful representation of a real set of data about a single user.
In most cases, our user-retrieving SP would return multiple recordsets for the one-to-many entity-relationships, not an inner product of all the combinations.
Now if you are talking about returning data for many users for analysis, you are probably still not going to combine the data this way.
In the large data analysis situation, you can start to materialize views and temporary tables, etc., but the best thing I've seen is apply dimensional modeling to make a Kimball-style data warehouse give a very fast and easy de-normalized representation of the data. This is a read-only replica of the data in a completely different model.
Cade Roux on July 15, 2008 1:05 PMNormalization is all about reducing redundancy. Achieving great performance requires a balance; if you normalize everything by the book, you will most certainly pay the price in performance. Leaving redundancies in key places can dramatically speed up your system because that helps the db engine optimize the indexes and run faster.
To the poster who just got out of the db module (course?) at the university: you will learn that Uni-level db is just the tip of the iceberg and that the overreaching principle in db design is that there is no overreaching principle.
BugFree on July 15, 2008 1:09 PM@Jim
That's exactly what you would say if you hadn't yet been bitten in the ass by a denormalisation decision taken years ago, that's making your life 20 times harder now.
With all of the wonderful technologies described here to mitigate the speed of large joins, I fail to see how anyone can reasonably argue that denormalisation in any form makes sense. I'm pretty sure the entirety of the RBDMS vendor community's brain cycles that have been spent thinking about this problem trumps anything from people who think sysadmins and DBAs are idiots.
Lewisham on July 15, 2008 1:24 PM@ Paul Thompson
Then I read on and discovered all the loons and morons who think they
know a lot more then they do about databases. People who claim that
views are recomputed every time a query runs (that'd be you Steven
Bey).
Many thanks for you kind words.
I don't profess to be a database expert and my initial thought had been to use a view, however, I had a question mark over my head about when the cached data (in the view) gets updated, so I did a little research, via our friend Google. The only place I found, stating that view data is cached, was on Wikipedia and I prefer not to take Wikipedia at face value. I did, however, find other articles that either explicitly stated or implied that views DON'T cache data (including the url that I posted, which is on a website called sqlserverperformance.com and written by a SQL Server MVP).
If you can provide me with a credible source, saying that view data is cached, I will happily read the article.
Once again, many thanks.
Steven Bey on July 15, 2008 1:47 PMATWOOoOD!!
(A cry of frustration from a DBA stuck in legacy de-normalised hell)
I love your blog Jeff, but I have to say that when I first read this post my first reaction was one much like the one you yourself had after reading about Wasabi. If this post leads to even one more de-normalised database in the world, is is already evil. Anyway - before I had time to comment, most of what I wanted to say (and more) had already been said (rudely, even).
There is now no real point in adding another voice to the thorch-wielding pitchfork DBA posse knocking on your door, so instead I want to make a suggestion: Eat my shorts. No, wait, sorry, that was frustration again - I want to suggest this:
In the spirit of the ideas behind stackoverflow, and in order to spread knowledge and learn, let us help!
Instead of going dark about this, challenge the community! Make posts with examples where you feel you will run into trouble, and allow people to comment. Let the people who throw the tomatoes here put up or shut up! *Hides tomato behind back*
For thisandthat purpose in stackoverflow, I need to have A, B, D, (D*X-4) and H from these six tables (ill.), and I need to handle a load of about 4 executions per second. My current plan is to jam everything up into one huge table because then the execution time drops from 345 ms to 87 ms. I need at least y, preferably less. Comments?
I hold (not even weakly) the strong opinion that normalising is not a collective madness, it is not like the religion of the angle brackets, (XML), it is The Right Way(tm). I am confident that in the lifetime of Stack Overflow you will NEVER encounter a situation where denormalising is the right thing to do.
Let us prove it to you!
OJ on July 16, 2008 2:11 AMThe basic decision on the initial design is simple just ask yourself these two questions before you begin.
1. Is my Site OLTP? If the answer is yes then Normalize.
2. Is my site OLAP? If the answer is yes then De-Normalize!
Simple as that.
If your answer is Yes to both questions then you need 2 databases.
If your answer is No to both questions then you need 0 databases.
This post seems to suggest throwing normalisation away
Seriously? Sometimes I wonder if I'm reading the same blog as most of these commenters. This is certainly not the first time either.
Ricky Dhatt on July 16, 2008 3:17 AMI'm with OJ - let us build it :)
And to all the folks suggesting copying data into a denormalized table via nightly batches or triggers etc...
You are describing a materialized (indexed) view. The functionality exists in all mature rdbms. Your idea has (unsurprisingly) been discovered by rdbms architects and implemented.
I wonder how many people read the links before posting.
Just WOW. Thanks for all those links, I think I'll spend some serious time reading all those entries on highscalability.com.
Oh and i've had similar experiences of denormalised awesomeness with high-throughput systems. Data was renormalised every 24 hours and just simplified everything.
Sure, it's not the first thing you should do but it is an option.
Good post about denormalization. I tend to class myself as a zealous software engineer who likes to stick to the software/database principles that I was taught in university, but through working with clients and real world databases there is definitely a place for denormalization.
I would be interested in your views of keeping the integrity of the denormalized fields. Here is a page discussing this http://www.ssw.com.au/ssw/standards/Rules/RulestoBetterSQLServerdatabases.aspx#DenormalizedFields
Eric Phan on July 16, 2008 4:53 AMThe comments to this entry are closed.
|
|
Traffic Stats |