July 18, 2006
Here's a thought question for today: why can't database tables index themselves?
Obviously, indexes are a central concept to databases and database performance. But horror tales still abound of naive developers who "forget" to index their tables, and encounter massive performance and scalability problems down the road as their tables grow. I've run into it personally, and I've read plenty of other sad tales of woe from other developers who have, too. I've also forgotten to build indexes myself on non primary key columns many times. Why aren't databases smart enough to automatically protect themselves from this?
It always struck me as absurd that I had to go in and manually mark fields in a table to be indexed. Perhaps in the bad old file-based days of FoxPro, DBase, and Access, that might have been a necessary evil. But in a modern client-server database, the server should be aware of all the queries flowing through the system, and how much each of those queries cost. Who better to decide what needs to be indexed than the database itself?
Why can't you enable an automatic indexing mode on your database server that follows some basic rules, such as..
- Does this query result in a table scan?
- If so, determine which field(s) could be indexed, for that particular query, to remove the need for a table scan.
- Store the potential index in a list. If the potential index already exists in the list, bump its priority.
- After (some configurable threshold), build the most commonly needed potential index on the target table.
Of course, for database gurus who are uncomfortable with this, the feature could be disabled. And you could certainly add more rules to make it more robust. But for most database users, it should be enabled by default; an auto-indexing feature would make most database installations almost completely self-tuning with no work at all on their part.
I did some cursory web searches and I didn't see any features like this for any commercial database server. What am I missing here? Why does this seem so obvious, and yet it's not out there?
Posted by Jeff Atwood
DB2 calls it Configuration Advisor. The notion of sell-tuning databases (kind of an oxymoron, if you ask some folk) is gaining currency. But such machinations, of necessity, take a simplistic (e.g. file) view of tables.
Nice idea, but I guess one problem is the overhead of doing this dynamically. SQL server 2005 can create an index on the table without an exclusive lock on the table (stopping anyone else reading the data...) but prior versions can't do this. Creating an index isn't an online operation...
For a big table with millions of rows, creating an index takes quite a while, so you're much better to spend the time during development and make sure that everything is indexed the way it should be, rather than having the double whammy of the system virtually grinding to a halt because it suddenly tries a query that requires a table scan, because of a missing index... then the system jumps in and creates the missing index, causing more overheads, and/or blocking.
I can't see this idea flying until databases are smart enough to be trusted with managing their workload a bit smarter. DBAs don't like the DB server doing things that they didn't expect.
"If so, determine which field(s) could be indexed, for that particular query, to remove the need for a table scan"
If there are 10 queries , each requiring 10 different "optimal" indices , and each having the same approximate hit rate , the database would end up creating 10 different indexes!
Besides , the indexes play a huge role in database capacity planning , creating a "on the fly" index creation engine might throw a spanner in all automated backups etc.
Just my 2 cents.
I have more or less given up creating indexes myself. I just let SQL Server's tuning advisor decide for me.
As a programmer, I have long since come to grips with the fact that the compiler will always be better at low level optimizations than me. Since indexes are a low-level optimization (as-in there is no functional change), I see no reason to not let the database handle that as well.
The only difference between what you want and what we have is that you have to manually hit the button from time to time. So I think self-tuning indexes are a possibility for the future.
sorry, wrong Advisor (there are so many), Design Advisor. But this is on a thread at com.databases.ibm-db2:
Tuning is an art. Why would you want to use robot?
"If there are 10 queries , each requiring 10 different "optimal" indices , and each having the same approximate hit rate , the database would end up creating 10 different indexes!"
Not neccessarily. Most of the time, two or three queries will run more often than the others. The indexes for frequently used queries would get a higher priority.
MySql has this functionality.
Index creaion tunning is a scenario based task. As Buggy Fun Bunny said, it's an art. The same as network design or app design.
Analysis Services for MS SQL Server (their cub/OLAP product) does this mostly automagically. I always found it to be mediocre at best, but I hear it has gotten better since I played with it.
Similarly, I agree with Jonathan Allen above who said that he mostly allows the SQL Server Tuning Wizard to do his indexes for him. I will occasionally (read: semi-annually) let the system run for a day with the with a trace running. Then I just have to come in the next morning and let the analyzer re-tune everything. I always double-check the suggestions for sanity, of course, so I'm not a total quiche-eater.
Considering that great most of my applications use queries as simple SELECTs or SELECTs with JOINs, database index auto-tuning would not be difficult and could save me some work. Good idea :)
The idea has been advanced recently on reg developer, to just index everything instead of indexing nothing. (Of course, you might have to pick and choose among multi-column indices still, you'd hardly want every possible permutation of columns indexed.) As long as reading is 80%+ of requests it should make performance sense.
Every server already logs performance data for query caches, it can't be hard to analyse it on idle or off-peak times to start an index. It wouldn't drag performance down any more than using it while rebuilding a raid array.
Microsoft SQL Server actually has a "statistics" feature that behind the scenes evidently acts something like an index in the creation of query plans. (I'm not enough of an expert on the innards of SQL Server to explain it better than that.) This appears to relate to your idea of auto indexes. I recently posted this to my blog, which explains how to get around a "too many indexes" error in a SQL Server table that has a ridiculous number of columns (I have had the pleasure of inhereting a couple of these):
Most people will want to skip reading the post unless you're having the specific problem I describe there; the gist of it is that on a table with a huge number of columns SQL Server maxes out the number of allowed *indexes* on the table with its behind-the-scenes column-specific "statistics." My point in bringing it up is that it seems to me that SQL Server treats these column-specific "statistics" as a kind of index--otherwise, why would you have to delete one of the statistics (as described in my post) before creating a new explicit index on the table.
Rick O, SSAS creates read only, OLAP DBs. So it's a whole lot easier to have many indexes when you are not getting inserts updates, not to mention that they're compressed.
There's a lot of other information available to help decide on potential indexes. For example, the statistics that Mr. Read refers to would inform the system that the spread of values in a given row isn't selective enough, so an index there wouldn't help.
However, I don't see any discussion here about the COST of maintaining an index. For systems that are heavy on data retrieval, this may be fine, but for high-volume transactional systems, the cost of INSERT/UPDATE/DELETE may be critical.
Tuning is an art. Why would you want to use robot?
Because the robot gets us 90% of the way there. Eg, it creates obvious missing indexes that are killing perf). That's a heck of a lot farther than the 0-10% most naive DB users are at!
Why would you sacrifice the 90% for the 10%? Particularly since you can have both, it's not an either/or scenario.
MySql has this functionality.
Jeff Lewis, can you elaborate on this.. provide an URL or other reference?
So it's a whole lot easier to have many indexes when you are not getting inserts updates
Right, I expected people to bring up the classic tradeoff of indexes: the more indexes you have, the slower your inserts and updates are.
But certainly you can envison a DB server that has certain thresholds of (table scan frequency and cost) divided by (update and insert frequency and cost) and makes a reasonable decision about what to do. If you have a table that's read 1000x a day but written to once a week, obviously you can err on the side of indexes.
The database servers have all this data; I say they should act on it using some basic, configurable rules
I was reading this blog post and thought: "Where's the WTF? This sounds reasonable, I don't get it."
Then I realised I wasn't reading The Daily WTF.
More coffee this way please!
Well see I sorta of agree with the concept of self-tuning to a degree. I think it would be more helpful for the database to tell the dba what index to create based on use as opposed to automatically doing it. Why? because the creation of an index can be exteremly costly if the database is actually be gasp used. Imagine you are taking in sales orders on your e-commerce site and all of a sudden the db decides it needs to use a new index. Table locks... can't process orders... people get cracky because they are loosing out on sales... etc. I think it would be really useful though for the database to submit the feedback to a dba so that the index can be optimized in a controled way.
just my 2 cents.
I haven't worked with it myself, but my ETL colleagues tell me Teradata and Netezza are as close to self-indexing as databases can get. Evidently everything is indexed on every column automatically. Of course, they ain't cheap.
Actually, I don't see why you couldn't have it both ways, with the advisor configured to report "recommended indexes" and "estimated time/load required" and let the DBAs set reporting thresholds and give no/nogo/dontrecommendthisagain/remindmein30days approval *and* set a time/date to do it when system load is expected to be low.
What? Are you mad? Where would all the DBAs and consultants be in 1 years time????
Lovely idea though.... if you live in the land of nod.. ;)
Oh, by the way, it's an offhand comment in that letter, but I know I've read more about it somewhere else--just can't remember where.
Daniel Read: Statistics in SQL Server are a maintained to help the DB engine decide how to efficiently use the indexes it has. They contain information on how the data is distributed across the table.
If your query's where clause was something like "WHERE orders.date '2006-1-1'" then the optimiser can look at the statistics to guestimate how many rows this will include. After it has determined roughly many rows it needs from each table the optimiser can make a better choice of what indexes to use to retreive the data, what order it should join the tables in, and what type of join it should use.
99% of the time it gets it dead right, which means you don't need to use index hints etc.
I assume Oracle (bleh), DB2 etc have something similar.
But then what would DBAs do all day? :)
But then what would DBAs do all day? :)
Go to Start - Programs - Microsoft SQL Server - Index Tuning Wizard
What if the query that has to be fast is only run once a month? You can anticipate that, the machine cannot.
"But then what would DBAs do all day? :)"
I sometimes wonder.
I remember some of the people I've seen most vehemently against the use of SQL Server (which had self-tuning as a design goal) holding a lecture proudly displaying how their databases take an entire team to dynamically tune them 'on-demand' full-time, as they handle 'thousands' of transactions per day.
I'm not sure they realised how silly this looked side-by-side with their comments on how their favorite database product was superior.
It would have been funny, if not for the tax dollars spent and the fact that all data was 'flat file' transferred for 'cross-platform compatibility' and therefore had no dependable relationships.
if everything would be automatic you wouldn't need to hire 100$ an hour database support, that would cut into the database provider's profit. follow the money.
" Tuning is an art. Why would you want to use robot?"
But if it's not a science, is it real or imaginary?
Part of the reason why Microsoft made SQL Server more automated was because they believed making that 'nth' difference manually was at least 'iffy' if not impossible, given all the factors.
The implication was that that to believe it was even possible was a serious underestimation of the problem and that automated processes were more likely to get it right most of the time.
I'm not sure, but I guess finding out if there is an index to use is less expensive an operation than to find out what index should be created to speed the query up by what extent.
I fear that adding a feature to find out what indexes should be created would significantly slow down every query to the database. But then again, if the feature is optimal, you can for example turn the thing on during developement (or even during end-user testing to get significant data), let it do its work and then turn it off.
The other problem is that the more indexes you have, the more expensive the operations become. That means that you can't just analyze one query. You'd also have to take into account the amount of inserts happening on the tables you are auto-indexing.
And there are different types of indexes. Which one is best? Should the system create a unique one? And if the system provides multiple different technical types of indexes: Which type to use? All that makes implementing such a feature quite difficult I guess.
But I agree: I really think software should be smart enough by now to find a useful solution here. Too many times I've been bitten by suboptimal (or outright missing) indexes. And there's more fun stuff to do than manually analyzing query plans the database is throwing at you.
At least for me, that is.
Of course the database (in this scenario) won't suddenly decide "Woops, I need an index, lets create one."
The power would still lie with the developer or DBA. On logging in the database can notify the DBA that a potential index should be created and let the DBA decide on when to create it if they want it.
Considering the power of current database systems, I would trust a machine that has been logging queries for a week over my perceived ideas over what the indexes should be.
I totally agree with the
"Start | Programs | Microsoft SQL Server 2005 | Performance Tools | Database Engine Tuning Advisor" way.
But I ALWAYS put a primary key on a table. In specially critical SQL queries I happen to add some manual indexes in the tuning process.
But how many rows should there be in a table before any secondary indexes need to be considered at all?? Say a table with 1000 rows, whatever you do to it is instantaneous so?? ...
Beware indexing every field in your table (even if you don't have bit fields). One of the things SQL Server's locking does when you're writing a recordset is lock the indexes that recordset affects. Now a single record might only lock one leaf in the index, but larger recordsets obviosuly lock larger bits - in a transactional system that can *easily* lead to blocking and deadlocking.
Tuning is an art. Why would you want to use robot?
Because robots work harder, and are less excentric than artists
FileMaker Pro has had this feature for years. It's not suitable for large-scale production apps (... leave it alone, crazy Filemaker zealots of the world, or you'll get schooled by this bunch ...) but I've always found it works quite well. I see no reason why their algorithm couldn't be applied to other databases. Experts may be able to do better in theory, but in practice most people would underperform Filemaker's auto-indexing.
I've been wondering about this same question for about 2 years.
I once worked on an existing application (with a 2GB SQL 2000 database) that didn't have any indexes because the devs that created it didn't know what they were. A feature like this would have solved some of the performance issues it had.
even if the engine can determine where indexes should be placed it should try to determine the proper partitioning for a table as well. also i beleive that teradata uses a hashing alqorithm when placing rows in a table and has such a fast access time(due to distribution) that indexing becomes alot more meaningless. so the real question is how can we create a system that does not need indexes at all?
But I ALWAYS put a primary key on a table.
I would certainly hope so, if just so that you satisfy the rules of normalization.
Especially with SQL Server, you will run into major performance problems if you don't.
User: Select LastName from Employees
User: Select LastName from Employees Where LastName = "Jones"
User: Select FirstName, LastName From Employees Where LastName = "Smith"
Clippy: I see you're searching on LastName! Would you like some help in indexing that field?
I fear that adding a feature to find out what indexes should be created would significantly slow down every query to the database.
Not really, since the database server has to parse a new query to figure out how to execute it. And that includes determining what indexes to use, etc. This is called an "execution plan". It's generally cached so the next time the same query is called, the execution plan is not regenerated.
The database server, in its normal course of operation, has access to ALL the data it needs to make an excellent decision about what tables/columns should be indexed!
Having done lots of tuning of databases both big and small, the simple answer here is this: how is the database being used? You can't apply the rules as simply as you're stating them when it comes to building indexes becuase you simply cannot know how the database is being used when the criteria is as narrow as you're stating it here.
Example: you're working with a database that's 1GB in size. You run a few queries and then, you find that it's automatically built you an index or two to support what you've done. Yippe, you're getting data back faster now, automaticaly.
Example 2: you're working with a database that's 90GB in size. You run a few queries against it, and then you find that the server appears to be frozen while it's automatically building indexes to support your few queries. After you're done and everyone's productivity has stopped because of the "frozen" server, you find that the server is still slow because those two queries you ran didn't have much to do with the other queries you're running, and then more indexes are built for you automatically to support those other queries that you just ran. And then you find you're drive is running out of disk space...
So I much rather opt for the Index Tuning Wizard in SQL 2000 or the new 2005 tool (Database Tuning Wizard or some such name) than to have the server do it for me. It give you the flexibility to give a "workload" that gives a good representation of the typical activity that your database experiences, taking into account both sides of the indexing equation (i.e., queries vs. DML). And at the end, you still have to review the suggestions to make sure they make sense for your actual usage. I've found some pretty wacky suggestions from the ITW over the years. And you definitely do NOT want it building indexes for you "right now" - you want to schedule these things as they most definitely have an impact on your experience with the database.
I'm currently working on a reasonably sized data warehouse (300Gb), which has some tables that run over 80 million records, gaining 4-5 million more each year. Can you imagine what would happen if the database itself decided it might be a cool idea to add a couple of indexes to those tables?
Of course, it would make my job as performance consultant a lot more interesting, I bet I'd see a lot more projects come my way...
OK, so the straw man "autobuilding without asking kills production performance" is dead. However, that doesn't address tools that collect the performance information in real time and can present suggestions and show the impact of such changes (i.e., 40% faster response time on queries that include X, 10% slower insert speed, updates unaffected because the index field isn't updated commonly).
As many have pointed out, SQL Server has tools that give a (limited) version of this. These tools are *not* a replacement for a DBA (if you can't get far enough to bother with a primary key, no tool will save the design disaster from your incompetence) but they sure could make the DBAs job more effective by being able to back up decisions with hard data.
Which brings up the point that tools of this nature should keep performance metrics from before *and* after the changes are made, so underperforming changes can be undone.
The only thing that you're missing is the idea that the database somehow knows what the data means and will beable to determine all the things the data will mean in the future.
For example: If users continue only insert unique data into a column does that mean that you can safely optimize the column to use a unique index?
What if that column was a list of employee names and the table held 'blog' entries. Each person only makes one 'blog' entry a month. For the first month the column has unique values in its index. Let's say your auto indexer sees this as an opprotunity to put a unique on that column. Now when the next month rolls around your users can't insert a new 'blog' entry because that would violate the uniqueness of the column.
Not only with the case of seemingly unique data not really being unique does it pose a problem to have an automatic system impose indexing. All forms of indexing pre-supposes a semantic constraint on the database that does not exist in its internal representation. It also presupposes a lack of concern over performance issues.
What if your program indexes a "bit" feild?
Using an INDEX on a database table is to explicitly identify time-space trade-offs for the database engine. Indexes aren't magical, they live on disk. To hit an index is to hit disk to save CPU for searching. To say "index everything" is to say "I don't care about I/O" to say "index nothing" is to say "I don't care about CPU load" and to say "Index this" is to say: "In this instance it is better to use I/O instead of CPU" and if you can write a program that reliably predicts when the trade-off is right that is something very valuable indeed... and it is one step closer to making programmers and DBAs obsolete.
If you buy the idea that you can create a heuristic auto indexer then you aren't far from believing that computers will (eventually over the eons of time) program themselves and all programmers and DBAs will be utterly obsolete and out of jobs. And robots will rule the earth.
Query optimizers decides indexes which will be used for a particular query for ages. We dont have to a specify a plan to each query by this way. "What if" type responses to this (Index optimizer) idea are "What if query optimizer sucks; I want to specify a plan for each query I execute".
I think if Index optimizer is smart enough and you let it work from first day of database, you'll have good enough indexes for your database.
This may not be suitable for Google's database servers, but I definitely enjoy this type of feature if it will be available.
PostgreSQL would react the same way that Danie mentiones Oracle reacting.
Sometimes an Index scan is more expensive than a full table scan. So these indexes that are auto-(generated/suggested) may end up completely ignored. It can be maddeningly more expensive if your data has never been reordered (basically remaking the table from a SELECT * FROM whatever ORDER BY index_field) because for each page of the index, several random pages of the table may need to be loaded.
Accurately determining when an index scan is more expensive depends on table statistics and cache size, and all kinds of things that developers normally never or rarely see. That's why they have the DBAs do it. :D (Wish I could afford one. D: )
I work with Oracle are there are multiple types of indexes ie B-tree, FBI, and Bitmap. It takes planning and a considered decision to add an index to a table outside both the primary and foreign key (which should always be indexed if you do not want dead locks).
These requirements also change over time, the idea that databases could automatically index and keep these in line with future requirements is a Nirvana that will never be reached
These are all great points. But the assumption here is either it is done completely automatically and all the time or just suggesting what indexes to create.
What about taking the approach of a knowledge-tree. The Clippy Indexer could walk the user through a series of questions, such as Choose how this database is used? (reports, application, transactional, etc). What percentage of the most used queries? What other queries do you wish to include in the optimization? Stuff that we would look for as dba's and developers. Then let it do the crunching of looking for patterns and such. Then it could ask more questions like what type of index to use, should it be clustered, or maybe give more than one suggestion for each index. Finally it would ask if you wish it to create the index.
This would give control over the process to the user and it would do all the manual crunching for them.
Just following up on another point the original post made.
Does this query result in a table scan?
Full Table Scans (FTS) are not necessarily a 'bad thing’; this is a common misconception. This is true in Oracle at least cannot comment about any of the other Database Vendors.
Have a look around some of the work Tom Kyte has done on FTS.
In conclusion the algorithm (if FTS then add Index) would not work.
What is being asked for in index reports and capabilities already exist, excluding the cute user interface, and even that can be had with purchasing 3rd party products.
All your modern databases have the abaility to store/report on indexes and thier use or lack. For instance I use a monthly job that gives me a report of all indexes which are used, amount of usage,space required,etc and I also get a listing of queries against tables that did not use indexes or maybe could of benifited from them, and how many times that occured in addition to the queries that are the most CPU intensive.
This type of report helps to get the ones that fall between the cracks both in creation and in no longer being needed.
As for the DBA position, for the past decade it has been more a position of dealing with the end-user, developers,etc then with the internal secrets of the database. yes the dba will have to work with the internals but time and knowledge wise it has become somthing alot less needed.
For your average database usage Oracle and DB2 both provide tools that a standard system admin, with a little extra training could use to keep the database in excellent running shape and MS-SQL never provide the options,until 2005, that you could do much about it.
I believe self indexing is a good idea. However as is said in replies, dynamically creating indeces in a production environment could cause further problems.
I would let the dba create the index they disire. but allow the database to run in a special logging mode where all plans used by the database are recorded and the appropriate index statements are generated.
Run the database and all related client applications over a period of time.
Turn off the logging mode, after which the dba can create all or the most frequently used plan indeces which are not yet created.
The simple solution to the whole "don't do that in production" debate: A maintenance window where these types of operations are performed. Your DBA should have one of these in place already, why shouldn't your database know and follow the same practices? (Such as not performing a backup OUTSIDE of this window, unless the database is not "hot" or in use)
I don't know about you but the limited knowledge of databases I have doesn't make me want to keep the same knowledge going forward until the day I die. Perhaps the automatic part of the discussion is the problem, but to suggest the entire thing is useless is only thinking about TODAY. If you are developing a RDBMS and not thinking about these types of issues you might as well just stop. I don't want a "modern" database with the mentality of a ENIAC. "Hardcore" DBAs may prefer such a system but as time progresses you'll have to justify the cash I'd be spending on such turds.
At the very least perhaps you can turn those Access "gurus" into a real DBA in the process, or at least have them thinking about real database problems.
RE: "Tuning is an art. Why would you want to use robot?"
Funny, people used to say the same thing about writing machine language. "Use a high-level language like C? And give up all the efficiency of machine language? Are you out of your mind?"
Joshua's right, the database should self-optimize. Although there are always those people who will study a system deeply and understand its nuances and thus be able to out-perform and out-art any robot, for the vast majority of the human population the robots win.
Kasparov Chessmaster you.
been away, so lots o comments. the robot quote was a quote, not me. that said: those of us in the database world have found it amusing (in a macabre sort of way) that coders jump at the chance to let a robot design the datastore (which has more importance than code, of course) but rail loudly at the notion of code generation; MDA or executable XML or Firestorm or Scaffolds. Kind of funny. Guess it boils down to whose ox is to be gored.
A database is not just a bunch of files wrapped up by a SQL parser. Coders still don't get that part. Phew.
Who wrote database software which you admire?
"Some" DBA's still dont get that part.
You could miss if the fault is really of the program, it may not be getting the most of the SGDB.
I think is a matter of design, get some time and verify how the data are gonna be used.
This link may be of interest for SQL Server 2005 users. It describes a way of implementing automated auto-indexing:
This is just stupid. lazy and bad programing.
If you have just created an application then you should know what table fields to index before you put the application to use. if you don't know then why are you creating applications (And Bad application i must add).
this is a fundamental design process and you as the developer should theoretically know what parts of your applications database needs indexing for performance. after all I'm sure you application won't be creating random SQL queries on the fly?!?! they will always look up the fields, tables and joins you programed into the app at the start. then why don't you think how performance will be hit when these queries hit tables and columns which will have allot of rows in the future.
This is some thing which conceptually you need to know to create a good application!!!
Isn't this covered in SQL 2008?
And robots will rule the earth.
SkyNet auto-indexes like da bomb!
I'm surprised nobody has mentioned GData. While the public API is rather newer than the post, it has done this for years - Google's databases have completely automatic indexes. They are also denormalized for performance, duplicating data while disallowing joins.
My first analysis with SQL Tuning Advisor happened to be really bad advice. It ended up the largest table needed a different clustered index than the obvious one and it was not smart enough to figure it out.
I agree with the general idea, but the Index Clippy would need to watch for a while to give good advice, and it would need to be run under the right operating mode. Auto Indexes generated during development or initial data load would probably be detrimental when the db was actually in production.
The biggest consideration is that indexes can considerably impact insert and update performance, so unnecessary indexes can really hurt you. Maybe Auto Indexes could be marked as such, so the Index Clippy could delete indexes it had previously created but were no longer needed.
My fear would be that fully automatic indexes would be about as good as Word HTML.
And regarding all the comments about expensive worthless DBA's - data is incredibly important and valuable to most businesses, so it makes sense to spend on it. And while I've worked with plenty of worthless, overpaid data trolls, I've also had the pleasure to work with some brilliant DBA's who more than earned their pay. A good DBA's like a plumber - when you need one, you really need one.