Let's say you're about to deploy an application. Said app has been heavily tested by your development team, who have all been infected by unit testing fever. It's also been vetted by your QA group, who spent months spelunking into every crevice of the app. You even had a beta test period with real live users, who dutifully filed bugs and put the application through its paces before finally signing off on the thing.
Your application is useful and popular. Your users love it. Your users love you. But over the next week, something curious happens. As people use the application, it gets progressively slower and slower. Soon, the complaints start filtering in. Within a few weeks, the app is well-neigh unusable due to all the insufferable delays it subjects users to-- and your users turn on you.
Raise your hand if this has ever happened to a project you've worked on. If I had a buck for every time I've personally seen this, I'd have enough for a nice lunch date. Developers test with tiny toy data sets, assume all is well, and then find out the hard way that everything is fast for small n.
I remember a client-side Javascript sort routine we implemented in a rich intranet web app circa 2002. It worked great on our small test datasets, but when we deployed it to production, we were astonished to find that sorting a measly hundred items could take upwards of 5 seconds on a user's desktop machine. JavaScript isn't known for its speed, but what the heck?
Well, guess which sort algorithm we used?
InsertSort is n2 (worst case), ShellSort is n3/2, and QuickSort is n · log n. But we could have done worse-- we could have picked Bubble Sort, which is n2 even in the best case.
Friends, do not do this. Test your applications with large data sets, at least large enough to cover your most optimistic usage projections over the next year. Otherwise, you may be sorry. And your users definitely will be sorry.
Big O notation is one of those things that's easier seen than explained. But it's a fundamental building block of computer science.
Big O notation: A theoretical measure of the execution of an algorithm, usually the time or memory needed, given the problem size n, which is usually the number of items. Informally, saying some equation f(n) = O(g(n)) means it is less than some constant multiple of g(n). The notation is read, "f of n is big oh of g of n".
Developers rely on data structures and databases that have favorable big O notation performance baked in, without thinking much about it. But if you stray from those well-worn paths, you can be in a world of performance pain-- and much sooner than you could have possibly imagined. The importance of big O notation is best illustrated in this graph from Programming Pearls:
The TRS-80 algorithm is 48n, and the DEC Alpha algorithm is n3.
When n is 10, they're within a second of each other. But when n grows to 100,000, the modern DEC Alpha takes a month to do what a prehistoric TRS-80 can accomplish in a few hours. Having a big O notation bottleneck in your app is a one-way ticket in the performance wayback machine to 1997 -- or worse. Much worse.
There are friendly names for the common big O notations; saying "n squared" is equivalent to saying "quadratic":
| notation | friendly name |
| O(1) | constant |
| O(log n) | logarithmic |
| O([log n]c) | polylogarithmic |
| O(n) | linear |
| O(n · log n) | sometimes called "linearithmic" or "supralinear" |
| O(n2) | quadratic |
| O(nc) | polynomial, sometimes "geometric" |
| O(cn) | exponential |
| O(n!) | factorial |
Tom Niemann has handy charts that compare the growth rates of common algorithms, which I've adapted here:
| n | lg n | n7/6 | n lg n | n2 | 7/6n | n! |
| 1 | 0 | 1 | 0 | 1 | 1 | 1 |
| 16 | 4 | 25 | 64 | 256 | 12 | 20.9 trillion |
| 256 | 8 | 645 | 2,048 | 65,536 | 137 quadrillion | - |
| 4,096 | 12 | 16,384 | 49,152 | 16,777,216 | - | - |
| 65,536 | 16 | 416,128 | 1,048,565 | 4,294,967,296 | - | - |
| 1,048,476 | 20 | 10,567,808 | 20,969,520 | 1.09 trillion | - | - |
| 16,775,616 | 24 | 268,405,589 | 402,614,784 | 281.4 trillion | - | - |
Here are sample execution times, assuming one unit of execution is equal to one millisecond of CPU time. That's probably far too much on today's CPUs, but you get the idea:
| n | lg n | n7/6 | n lg n | n2 | 7/6n | n! |
| 1 | <1 sec | <1 sec | <1 sec | <1 sec | <1 sec | <1 sec |
| 16 | <1 sec | <1 sec | <1 sec | <1 sec | <1 sec | 663 years |
| 256 | <1 sec | <1 sec | 2 sec | 65 sec | 4.3 mlln yrs | - |
| 4,096 | <1 sec | 16 sec | 49 sec | 4.6 hr | - | - |
| 65,536 | <1 sec | 7 min | 17 min | 50 days | - | - |
| 1,048,476 | <1 sec | 3 hr | 6 hr | 35 years | - | - |
| 16,775,616 | <1 sec | 3 days | 4.6 days | 8,923 years | - | - |
Notice how quickly we get into trouble as the number of items (n) increases. Unless you've chosen data structures that offer ideal near-logarithmic performance across the board, by the time you get to 4,096 items you're talking about some serious CPU time. Beyond that, I used dash as shorthand for "forever". That's how bad it can get.
Everything is fast for small n. Don't fall into this trap. It's an easy enough mistake to make. Modern apps are incredibly complex, with dozens of dependencies. Neglect to index one little field in your database and you're suddenly back in TRS-80 land. The only way to truly know if you've accidentally slipped an algorithmic big O bottleneck into your app somewhere is to test it with a reasonably large volume of data.
| [advertisement] TransferBigFiles.com allows you to send huge files (up to 1GB) to anyone without worrying about email attachment limits. Send via the Web site or download the DropZone utility for even more functionality. It’s fast, easy, and totally free! Transfer big files now. |
Posted by Jeff Atwood View blog reactions
« Lazyweb Calling On Exposé, Flip3D, and Switcher »
I'm happy to say that I've always had an eye for optimization.
However, we bought a 3rd party weather app a few years ago. Tested it out for a few weeks internally.. and launched it. Literally 5 minutes later the webserver went down under the load.
It never occurred to me that a vendor would sell software that would fall over so easily.
Sean on September 21, 2007 01:42 AMAmen - I've been bitten by this before - most especially by the cases where I've thought "yeah, I could index on <some parameter> and get O(1) rather than do an O(n) linear search for it...but there won't be any cases where anyone'll search on that parameter". Yeah, WRONG! Learn that 'this case will never happen' is always a false statement...
Stuart Dootson on September 21, 2007 01:53 AMYou can bump into a similar problem with SQL servers.
Your app runs great on test data, even on medium sized data, but when confronted with a large number of rows, the database query optimizer decides to change its strategy, and suddenly you are wondering why the query that runs near instant on one database suddenly takes half a minute on another.
Always fun to discover after release...
Christian Mogensen on September 21, 2007 02:19 AMOn the other hand, if you know your n will be small, because you control the creation of the data structure completely, there is not much use in implementing superduper optimization schemes ( especially when they involve caching ).
I have seen perfectly simple code, with n <= 100, being transformed into a brittle mess because of unnecessary cleverness.
But I have also been guilty of not using a suitable algo when n was growing beyond expectations.
KristofU on September 21, 2007 02:26 AMJeff,
Quicksort's worst case is O(n^2), not O(n logn). However, the expected average is O(n logn) and practice shows that the worst case is almost never reached when using Quicksort (and it can be made to disappear, but then we're no longer talking about Quicksort).
Konrad on September 21, 2007 02:42 AMIt's difficult though to draw the line between necessary and unnecessary optimisation. I always remember in these cases Knuth words, "Premature optimization is the root of all evil (or at least most of it) in programming". May it be to excuse my laziness? :)
I always thought Bubble Sort is O(n) in the best case. When the data is sorted there is no need to bubble items upwards, thus the loop ends after one cycle. Selectionsort is always O(n²) because every cycle the biggest/smaller element is determined and put in the right position.
N loops with O(n) elements each makes O(n²) altogether.
As others had said it's not simply a matter of knowing your big O. You also have to know how your algorithm works for all inputs, and what inputs it will be getting in operation.
Of course if you don't know big O, then you'll never get that far.
Factory on September 21, 2007 03:43 AMAs KristofU says, there is a flip side to this.
I've seen simple code that does a linear search looking for a value (e.g. simple for-loop through an array) being replaced with considerably more complicated dictionary-based approaches.
A good thing when n is large, but rather pointless when n is ten.
Graham Stewart on September 21, 2007 03:52 AMI find it incredible (literally, I just can't bring myself to believe) that there are people WORKING PROFESSIONALLY as programmers without knowing this extremely basic high-school level fact. Surely every programmer in the world knows that there are different kinds of sorts with different O(n) performance characteristics.
[run-on-sentence]The fact that people are deploying actual products to actual users, without thinking even for a minute about how their choice of algorithm and data structures will perform in large but fully EXPECTED loads is a very sad reflection on the state of the software engineering profession.[/run-on-sentence]
Why is it that people are deploying (or even sending to QA) solutions that they KNOW won't perform well in real loads? How come no one did an even basic stress test of the application? Does anyone ever THINK anymore, or do they just copy and paste javascript sorting code they found on some website?
Data structures and algorithms are the ABC of software engineering, of software programming. Thinking a bit ahead is the ABC of any engineering or construction practice. Imagine if the first floor ceiling could easily carry itself, but would collapse if more than 2 people stand on it.
On a different note, the tables you presented are probably easier to grok as graphs. You might even want to show one graph in a linear scale, and one graph in logarithmic scale.
When doing technical interviews I always make a point of asking a question centered around Big-O notation and algorithmic complexity.
Candidates claiming to be "senior" software engineers fail instantly if they know nothing about Big-O and related concepts....
Gordon on September 21, 2007 04:10 AM>> I'm happy to say that I've always had an eye for optimization.
>> However, we bought a 3rd party weather app a few years ago.
>> Tested it out for a few weeks internally.. and launched it.
>> Literally 5 minutes later the webserver went down under the load.
>> It never occurred to me that a vendor would sell software that
>> would fall over so easily.
Well, if you tested it for a few weeks without any problems and then it falls five minutes after launch, then your own test were not, ehh, very good either, I guess.
Hey! I thought Donny Knuth wrote almost everything I wanted to know about the subject. Your article is very nice, although I recommend the book "Sorting and Searching Algorithms" for all those curious readers that happens to be out there in the cold using whatever sort routine is provided at his/her office.
A pretty old book, for sure. But it covers the problem IN DEPTH.
I disagree that the pathological O(n^2) case of quicksort rarely happens. With generic implementations it will happen when you try to sort data that is already sorted - and that is a pretty likely occurrence.
So just because it has average nlgn performance, you don't get to say Quicksort is O(nlgn). That is making a very specific claim, and it is false. (Of course the article didn't say that!)
I think it should be noted that the TRS only overtakes the Alpha at an execution time of close to an hour - a point where the advantage is likely to be irrelevant (unless you are talking about some special application to do some serious calculation thats expected to take more than a second).
J. Stoever on September 21, 2007 05:04 AMHey Now Jeff,
Algorithms are so fun, another post I enjoyed reading & learned from.
Coding Horror Fan,
Catto
I am a big believer in returning to the basics on a regular basis. Thanks Jeff - this is a good reminder. Gordon's suggestion that he would not hire a would-be senior developer who cannot demonstrate some knowledge of Big-O notation got me thinking.
There is a popular interview question to the effect of "What's your favorite sort?", sometimes followed by a request to code it up. Some years ago I was talking with a very good, mathematically-inclined developer who said that he was tempted to answer "Ascending". He explained that almost never coded his own sort algorithms, instead preferring to use the functionality of the well-tuned libraries and controls available to us.
In my world of general business applications, knowing how to find the right sort algorithm (including a basic understanding of Big-O) is far more important than being able to code it from memory. The problem just does not come up that often, and I can easily find several sort algorithms for any given language (from dependable sources, usually with the Big-O plainly stated for me, complete with explanations of the best case and worst case scenarios).
Jim Snyder on September 21, 2007 05:08 AMDamn it! I wish someone told me this 4.3 million years ago!
Kent Boogaart on September 21, 2007 05:13 AMJ Stoever, an Alpha running the TRS-80 algorithm (48n) would have overtaken the Alpha running the Alpha algorithm (n^2) in seconds at most. The TRS-80 is only given as a dramatic illustration of how faster hardware does not help fix bad algorithms.
Leons Petrazickis on September 21, 2007 05:17 AMAfter some thinking, I think this article touches a much broader issue that it didn't quite do justice. For one thing, testing with a huge amount of data is only a subsection of what should generally be a central part of any testing: assuming the worst case scenario. Not testing with big datasets is really no different than not testing with internet latency, or not confirming that user input is in the correct format.
On the other hand, care should be taken to avoid optimizing for the worst case scenario - just like (mentioned above) quicksort isn't all that awesome if you get the (normal, in regards to computer programs) case of presorted data, there are other giant holes you can fall into if you confuse "testing for extreme cases" with "optimizing for extreme cases".
J. Stoever on September 21, 2007 05:19 AMI don't think you have to be a optimization freak. You can always replace to the slower components with faster alternatives as the number of users increase.
Rohit Arondekar on September 21, 2007 05:20 AMIncidentally, an Alpha is a superfast mid-90s processor that DEC used to put out and NT used to run on.
A TRS-80 is an early 80s hobbyist computer. It's about 10,000 times slower than an Alpha.
Leons Petrazickis on September 21, 2007 05:23 AMRe: M's little toy-tossing rant
Before posting a comment in future, you may want to just review it and measure it against the standard of, "How useful is this to anyone who might read it?"
Against that standard, your post was completely useless apart from the last paragraph, where you made a helpful and constructive suggestion.
Please don't waste my bandwidth telling the world that you have done some computer science courses. Anyone reading this blog, because of it's highly technical level, is just as smart as you - we just don't talk about it.
Mark Green on September 21, 2007 05:26 AMI think there is a corollary to the big-O notation problem, which is that it can be just as easy to hide performance problems because it is an analysis of complexity, not of runtime speed. It treats O(100n) == O(n) since it is assumed that the size of n will eventually swamp any constant terms. The problems is, I've seen too many systems where performance on some piece of the system is abysmal and the developer says "But I used quicksort!!!" or "But I used a B-tree!!!" That's all fine and good, but then I look at the code and see that the system is fetching things from the database that could be cached in memory or is using strings as comparison keys when ints could be used. Quicksort won't help you if your comparison algorithm is painfully slow. Using the best algorithm/data structure for the job is the least you can do. Next you have to actually think about your application data and how you can simply make your program do less.
Elias Holman on September 21, 2007 05:34 AMI'd sad non-scalable software is the norm, not the exception. There are lots of excellent programmers who know to avoid exponential algorithms, but they have good jobs as far away from marketing as possible. There are far more AWFUL programmers out there peddling their shoddy wares for a quick buck. It's simple: if there are more than 4 or 5 competing software packages that do the same thing, at least 75% of those are going to be absolute garbage; such is the result of today's market forces and management overload, the people doing the purchasing either don't have the eye, or the time to properly examine their options and the result is that the incompetent coders make a sale.
To most experienced developers, bottlenecks are immediately apparent during the design phase. We just know, before the requirements meeting is over, exactly where and how the app will fail and that's where we focus the bulk of our effort. If you know that the key feature of an app is a nonsensical 14-table SQL join or a write-heavy backend, no amount of brilliant code-level optimization will make the app faster... the key is to break it into manageable chunks or sometimes completely rework the functionality.
If it weren't for these true engineering problems, programming would not be a career.
Bill Lambert on September 21, 2007 05:37 AMMaybe I'm getting old, but it seems that this degree of subtlety is getting forgotten these days. Many developers are so used to calling the "sort" method on a collection that they've forgotten -- never even knew in some cases -- the characteristics of common algorithms.
I remember working with one developer who "enhanced" a program we were working on by changing a call to a heapsort (I think) to quicksort, basically because the latter had the word "quick" in it so it must be faster. Except the list was usually sorted...
But yes, even if you don't understand you should at least test!
Stephen Darlington on September 21, 2007 05:39 AMJeff,
I was just going to congratulate you on the use of your sorting animations: I've known about the Big O differences for ages, but I'd never SEEN the difference put so plainly. Then, I saw that you linked to a source for those animations and followed the link. Amazingly, it's a site advertising Cormen's "Introduction to Algorithms." That's the very book I'd been thinking of while reading your article. Heck, I even wrote a review of it on Amazon: "Academic Masterpiece, Practical White Elephant"
http://www.amazon.com/exec/obidos/ASIN/0262032937/ref=nosim/none01
Anyway, good job.
David A. Lessnau on September 21, 2007 06:09 AMI'd hope that anyone with an actual degree from a CS institute would know about Big O.
I agree wholeheartedly with testing with large datasets - but always think really hard what sort of numbers you're actually going to encounter.
Everything is fast for small N - therefore, if small N is all you will ever see, everything is fast :) I was doing work on optimizing our draw ordering sort for transparencies. We were just calling sort every frame, which is much slower than doing a proper octree traversal. After a little testing as to whether I should even be attempting this, I found that there were never more than 17 items on our largest reasonable datasets that needed to be sorted at any give time. We were taking .1 ms on the sort and 6 ms to draw them... I decided to work on other problems ;)
Mike Ford on September 21, 2007 06:12 AMQuicksort's worst-case behavior is O(n*n), not O(n*log(n))!
This is not a niggling detail!
Many apps have been brought to their knees because foolish programmers didn't understand Quicksort's behavior.
Fix your app first, then fix your incorrect blog post!
You should use Heapsort instead of Quicksort. Heapsort is O(n*log(n)) worst-case.
tndal on September 21, 2007 06:14 AMTables show 7/(6^n); I think you mean (7/6)^n.
Anon on September 21, 2007 06:21 AMEven if one doesn't generally implement or even USE a sorting algorithm, I'd expect all programmer to know the basics. You may call me an elitist or whatever, but there has to be a limit to who can call themselves programmers.
Sorting algorithms are, after all, one of the basics that are used to demonstrate programming in the first case. Quick sort is usually one of the first non-trivial recursive algorithms you are taught.
You don't really need a CS degree to know the basics of sorting and complexity. You don't have to be a genius to understand it either.
This stuff is taught in high schools. It is taught in many beginner programming books (but should be in ALL of them).
This is why it's so unbelievable. How can anyone actually manage to get a job as a programmer and not know the basic facts of sorting? It's like getting a pizza delivery job without knowing traffic laws.
M on September 21, 2007 06:31 AMTalking about testing with large data, we did a live data test in parallel with our existing legacy over the last two weeks with our new .NET web service that handles raw medical device data. Our net admin had kittens when the new service filled up his brand new server's disk array in 3 days time. We now have a 2TB disk array on order and we'll go live (maybe) when it's installed.
Frank C on September 21, 2007 06:41 AMTo those saying that it's OK to run a linear search if you "control n", be very careful. Your linear search over 20 items might end up being executed 10,000 times in a loop (I've seen this), and some batch is now taking a full minute when it should be more like 5 seconds. The difference doesn't have to be as dramatic as 1 second vs. 1 month for users to care.
It's really *not* that much harder to use a dictionary in C#. I don't understand the complaints about adding all this complexity. In actual fact, the code for a dictionary is often cleaner, *especially* if there's one specific field you usually use as an identifier.
Getting an item from a dictionary:
MyObject FindDictionaryItem(int objectID)
{
return dic[objectID]; // We're done!
}
Getting an item from a list:
MyObject FindListItem(int objectID)
{
foreach (MyObject m in list)
{
if (m.ID == objectID)
{
return m;
}
}
throw new KeyNotFoundException();
}
I think most people would say that the dictionary version is actually much LESS complex.
Aaron G on September 21, 2007 06:43 AMThis is one reason that the C++ STL explicitly describes the performance/complexity characteristics of things like insertions, searches, comparisons, including specifying the general implementation (though it of course leaves many details unspecified as well). It's worth checking that each time you are about to decide which algorithm or container template or method to use from the STL (I keep the O'Reilly STL Pocket Reference handy at all times).
Reed Hedges on September 21, 2007 06:50 AM"With generic implementations it will happen when you try to sort data that is already sorted - and that is a pretty likely occurrence."
Only if you pick the first or last item as the pivot. I doubt many library implementers do that these days (glibc uses median-of-three).
Observer on September 21, 2007 06:53 AMThis was all too mathematical for me but it has inspired me to find sample code for the various types of sort in C# and VB.NET.
I think programmers tend to be too socially isolated. We need a really good social networking site for developers so we can get the benefit of the "wisdom of the crowd". More social networking by programmers would apply peer pressure to use best practices. I'm currently working on a project created by mainframe programmers who have loaded the entire database structure into arrays (every table name, column property, etc) has to be referenced using jagged arrays with obscure names and it is driving me crazy!
Robert S. Robbins on September 21, 2007 06:57 AMthe always forgotten algorithm: "Counting sort".
It's in my opinion the fastest way to sort integers,
that are in a known range (i.e for instance between 0 & 100)
God, the number of times I've tried to explain to management why a QA database 1/20th the size of the production one isn't perfectly adequate for testing...
A. Lloyd Flanagan on September 21, 2007 07:20 AMThe simpler answer is that if you do not test on full scale data before you release your application to the users, you are an idiot.
Grant Johnson on September 21, 2007 07:21 AMAhh...the infamous Shlemiel the painter's algorithm.
(see Joel's infamous "Back to Basics" post...from away back in 2001)
http://www.joelonsoftware.com/articles/fog0000000319.html
I've seen this over and over both doing QA and as an "optimization engineer". (current gig) I just can't quite wrap my brain around the blind spot we all have for small data sets...I've been guilty as the rest.
I think I'm going to take that execution time chart, print it out on our plotter, and post it as my cube wall with the title "_Everything_ is fast for small n"
It'll become my new mantra. I'll use it in meetings. Slip it into casual conversation. My wife will love it. My kids will sagely repeat it. My dog will go on command for it.
Ok, maybe we'll just stick with the wall chart.
aaron on September 21, 2007 07:24 AMM said:
>>How can anyone actually manage to get a job as a programmer and not know the basic facts of sorting?
Because there are way more programmer jobs than there are good programmers. Worse, a lot of times, neither the manager nor the HR person knows the first thing about programming themselves. There's a widespread myth that programming is something you can just "pick up" over a few weekends.
A. Lloyd Flanagan on September 21, 2007 07:24 AM@Aaron G:
yes obviously if you are hitting a section of code very frequently and your system runs slow, then that code is a good candidate for optimisation.
My point was though that sometimes (for a small dataset) a linear search is a better choice. For example, if I used a dictionary approach then that may cause the dictionary library to be linked in, increasing the size of my program and it's memory usage (which is still a consideration for the embedded systems I work on). Other complications and overhead may come from having to wrap primitive variables to store them in the dictionary and from dealing with exceptions or error conditions.
--
To those saying "This is obvious. Every programmer should know big O.":
You might have a perfect grasp of big O and be absolutely convinced that your algorithm will scale well, but you should still test it with large data regardless. There is always a difference between theory and reality (e.g. the theory may not consider the slow down caused by paging or database cursor moving).
re: Mike Ford's analysis.
Nice work. Optimize only when you have to. The only way to know if you have to is to profile the app so you know where the hot spots are.
Perhaps a bit off topic, but... I just hope when you decided *not* to optimize that particular routine, you at least placed a comment in the code explaining why you found it to be unnecessary. That would keep other well-meaning coders from coming along later and optimizing it anyway, when the boss says, "make the app faster." That would be a time waste, and a quick comment could prevent that. If your comment has specifics, like ".1 ms on the sort and 6 ms to draw," it helps drive the point home.
In fact, the original coder(s) may have done the exact analysis that you did, then decided on the simple approach because the numbers were small. Had *they* commented their thinking, *you* wouldn't have had to spend your time going over the same ground.
JK on September 21, 2007 07:40 AMfantastic article on big O notation in layperson's terms. As a CS dropout, that was one of the things i never really 'got'
chiefchiefen on September 21, 2007 07:48 AMFirst, a broad observation - even with small test datasets, and even if you don't know your algorithm's O(), you can spot trouble coming by simply testing different sizes of data. If doubling your test set causes your runtime to double, you have evidence of O(n); better than that, and you approach scalability; worse than that, and you need a new algorithm.
As for tndal's assertion that quicksort should be ditched for heapsort because of its worst case: don't be daft. Firstly, in-place algorithms are only any use in mutable arrays - a pure functional language or an application that needs to sort sequential data really ought to be using mergesort. Secondly, quicksort's worst case can be avoided by making sure you choose median pivots, which isn't hard to do in practice - choosing the median of the first, middle and last elements of your data set will generally get close enough (and the function you use to choose that can be recycled to sort 3-element sets at the bottom of the recursion). Thirdly, quicksort is more sympathetic to modern computer architectures than heapsort (eg. better cache behaviour), and so runs a lot faster on average. Heapsort might be better if predictability of in-place sort time is more important than average speed - but generally, when predictability is critical it's predictability of latency, which points towards maintaining data in a sorted state in the first place. Lastly, dogmatic opinions have no place in an engineer's toolbox.
Relating to Graham Stewart's point: some real-world implementations of recursive algorithms actually switch to a "simple" sort when the data size is small enough that the reduced setup time is more significant than the O(n^2) runtime (in particular, there is no point in calling ANY sort algorithm on a 3-element data set). Likewise, the WATCOM compiler only implemented linear search on its symbol tables, because student programs were so small that the setup costs were never recouped by longer searches. Even in the graph in the main post, if you know your problem size will never exceed 4k or so, you might as well stick with the Alpha variant (and yes, sometimes you can say "this will never happen" - by explicitly *ensuring* it can never happen).
gwenhwyfaer on September 21, 2007 07:48 AMPart of it is that programmers after a certain year came out of school fiddling about with Access databases and what-not. My first position was with a company, using mainframes, where 1MM records was the norm. It was a good base to work from.
Steve on September 21, 2007 07:57 AMI'm surprised to see everyone talking about "Big O Notation", and algorithm complexity, but nobody used the name "Order-Of-Complexity". Isn't that what the big O stands for? Or did I just make that up? I hope not, because I use the phrase all the time!
Oh well, it's been a while since my algorithms classes. :)
Chris on September 21, 2007 08:37 AMI had a very important lesson in my first university data structures course. The first few classes were devoted to sorting. We started with the theory, covered the Big-O of bubble sort, insertion sort, quicksort and heapsort. We then drew the curves and talked about predicted behavior. Then we had to implement each sort. Then we were given a file of 20K words in random order and had to graph the actual times for each algorithm. The first thing I noted was that the predicted best sort never performed better than the others. I was convinced this couldn't be true, so I created a dictionary of 150K words and ran the tests again, and I tried feeding an already sorted file and reverse sorted file for good measure. Now the actual graph was nearer the theory graph. For a certain value of n, the theory was dead on, but when you add in the constant offset of actual code and real systems the curves intersect at different points. The most surprising bit was that for n < 15 bubble sort _always_ won (for that implementation). Years later I found confirmation of this when porting some unix code. The author of that sort function always used bubble sort for values less than 20 and one of the n log(n) sort for everything else.
Anyway, that first CS course taught me that I not only needed the theory, the data structures, but sometimes also needed to empirically prove those things and that has served me well for nearly 20 years.
In all fairness, the "Neglect 1 little index in your database and you're suddenly back in TRS-80 land" isn't quite fair, as it leads folks to believe that your DBMS is going to perform simple n^2 algorithms for unindexed joins.
A real DBMS (read: not one that you implemented for a CS assignment) will very rarely do a naive n^2 search on your data. What it WILL do is create a hash table of all of your data and use that as an index to avoid this. But that is often quite costly in terms of time and CPU itself, and is something you'd rather avoid...
David Markle on September 21, 2007 08:53 AMFirstly, to the many people seemingly surprised by Jeff's explanation of the Big-O notation - yes, most people here should know it. However, I am willing to bet a big enough portion don't. Not everyone has been and done a Comp-Sci degree. I myself, only just embarking on one this term, have been programming for several years and have only just learnt the Big-O notation properly. I know a couple of other people who read this blog for the less hard core programming aspects who would appreciate the explanation.
Secondly, sometimes the obvious needs to be stated, even for experts. We don't all deliberate over what specific algorithm to use based on all the different metrics. Sometimes we just go with our instincts, which happen to be wrong.
"On a different note, the tables you presented are probably easier to grok as graphs. You might even want to show one graph in a linear scale, and one graph in logarithmic scale."
I would disagree. The graph presented does a good job of visualising the difference between different types. It's hard to read precice values from a graph, and I think the " 16, 4, 25, 64, 256, 12, 20.9 trillion" has more impact :)
Is Microsoft aware of this concept? I'm still rather curious as to how my Outlook can lock-up when I'm simply deleted an e-mail, or how XP locks up so often when I'm trying to log off, usually in the saving settings phase.
Mattkins on September 21, 2007 09:25 AM> You might have a perfect grasp of big O and be absolutely convinced that your algorithm will scale well, but you should still test it with large data regardless. There is always a difference between theory and reality (e.g. the theory may not consider the slow down caused by paging or database cursor moving).
That's exactly my point-- test, test, test! Use large data in your testing!
Consider how many times you've used software that didn't scale. iTunes on Windows doesn't seem to scale: http://www.hanselman.com/blog/iTunes7UnspeakablySlow.aspx
Heck, the Movable Type installation that powers this blog is a perfect example of failing to test with large datasets. Why does it take 15-20 seconds to enter a comment? That sure didn't happen two years ago, before I had 27,404 comments and 946 entries in the MySql database..
Jeff Atwood on September 21, 2007 10:19 AMLeons Petrazickis: "Incidentally, an Alpha is a superfast mid-90s processor that DEC used to put out and NT used to run on.
A TRS-80 is an early 80s hobbyist computer. It's about 10,000 times slower than an Alpha."
Ummm... For some reason you think that the highly technical readers of this blog don't know that?
And FYI: Just because something is a lot faster sometimes doesn't make it always a lot faster. For example, a Ferrari is a lot faster than a Vespa, except when they're both climbing the same mountain and halfway up the Ferrari runs out of gas.
My point is that, even though the Alpha may be about 10K times as fast as the TRS-80, that doesn't mean it was faster at everything. (I think that might have been Jeff's point as well.)
KenW on September 21, 2007 10:19 AMHey Jeff,
The columns in those tables are out of order. n^(7/6) dominates n*lg(n), you just haven't printed enough rows to show it.
c
c on September 21, 2007 10:38 AMThe true lesson here: read back through *all* the posts on this blog.
Jeff posted the exact same thing about three years ago in this same blog. I'm pretty sure it's 100% identical.
Imagine all the good stuff that's back there waiting for you to find!
paul on September 21, 2007 10:55 AMI have seen this happen with badly designed SQL queries as well. Everything is great when the test database has 100 rows in each table, but when production data floods in and there are millions of rows (and one million rows is small for some settings bigger than I have worked in) then the queries grind on and the app gets slower and slower.
Bobbie The Programmer on September 21, 2007 11:03 AM>> With generic implementations it will happen when you try
>> to sort data that is already sorted - and that is a
>> pretty likely occurrence.
At least one version of Microsoft C++ exhibited that behavior with qsort().
So much for the people who say, "I don't need to know anything about sorting, I just call the sort routine in the library".
Bobbie The Programmer on September 21, 2007 11:15 AMPeople have mentioned qsort's O(N^2) worst-case performance and how it typically occurs with naive qsort implementations for already sorted data. Worth noting is insertsort is O(N) for already sorted data. This makes insertsort ideal for applications where you resort data that drifts out of order.
Another reason to test with large data sets: interface optimization.
If a program's interface is only usable with few records, then a more performant sorting algorithm isn't going to save it.
Zack on September 21, 2007 12:06 PM> Jeff posted the exact same thing about three years ago in this same blog. I'm pretty sure it's 100% identical.
Congratulations on being such a long-time fan and an astute observer. :)
I was sort of disappointed that post never got any attention, because it was always one of my favorites. I reformatted it for clarity and added some content to make it what you see today, then deleted the original post.
Jeff Atwood on September 21, 2007 12:18 PMQuicksort worst case (n^2) can be eliminated if you randomize the order of your data before the sort. That takes O(n), so together with quicksort the worst case becomes O(n*log n)
headshots on September 21, 2007 12:28 PM"God, the number of times I've tried to explain to management why a QA database 1/20th the size of the production one isn't perfectly adequate for testing...
A. Lloyd Flanagan on September 21, 2007 07:20 AM "
Or have legal tell you even faking sensitive data is forbidden. The only values we could use for a Social Security Number sort/search were 999-99-9999 and 000-00-0000. Everything else might be mistaken for the real thing, and open to lawsuit.
Tim on September 21, 2007 12:28 PMLike Bill Cook said at INFORMS this year, the speed improvement in linear programming from 1996-2004 is a factor of about 5.3 million (3600 due to algorithms and 1300 due to computing).
The improvement by using better algorithms was almost three times that reached by computing power!
Matteo on September 21, 2007 02:09 PMNo worries, Jeff. I think this time around you have many more readers anyway, since you've grown :)
By the way -- Stuart: by indexing, don't your searches (and inserts) become O(log n) rather than O(1)?
Greg
Greg Magarshak on September 21, 2007 02:21 PMIf you're treating SSNs as character strings for sort/search, you can create fake sensitive data that cannot be mistaken for the real thing by using letters A through J instead of digits, e.g. SSN = ABC-DE-FGHI. Those should perform identically to real SSNs.
If you're treating SSNs as numerics, you've got bigger problems.
Dirty Davey on September 21, 2007 03:13 PMI couldn't agree more with the post. A college of mine did recently an implementation of a tag cloud using for testing a data set of 6500 entries. The tricky part - the whole building and DOM injecting of the cloud was made with javascript which holds in the memory the array containing the test data ...
Performance was terrible - ~ 10 sec page load time on localhost. But this was considered acceptable, as long as it stays under 10 sec. Now, one may thing 6500 data entries sounds pretty good. Well, it isn't. Real live data would have circa 30.000 entries and I can only imagine what the performance would be.
The story ends with a nice shot of me rewriting the whole damn thing from scratch to use AJAX and to lazy load the data. Performance (XHR loading and injecting) is hold under 1 second, well even under 400 ms as measured by firebug.
The moral of the story? The algorithm is always fast with a small n. :)
Nikola on September 21, 2007 03:24 PM"Heck, the Movable Type installation that powers this blog is a perfect example of failing to test with large datasets. Why does it take 15-20 seconds to enter a comment?"
Because movable type regenerates thousands of static htmls with every minor change. It was pretty obvious that it was a bad way to do things back when we first looked at using it 7 years ago.
Sean on September 21, 2007 04:22 PMI agree that testing is necessary. Big O notation only works if you're comparing two algorithms with different orders of complexity.
Look at how textpattern works versus MT.
When Jeff posts a new blog entry, MT creates a new html file and updates the section indexes. This is pretty much O(n). Adding new sections increases runtime linearly.
When someone posts a new blog entry on textpattern, textpattern inserts a new row in a database and inserts new rows in the categories table and sections table. This is also pretty much O(n).
The problem is that inserting a row in a database table is not the same as regenerating an entire html page. Big O doesn't address that. O(1000n) is the *same* as O(n) in Big O notation.
Sean on September 21, 2007 04:47 PM@DirtyDavey: "If you're treating SSNs as character strings for sort/search, you can create fake sensitive data that cannot be mistaken for the real thing by using letters A through J instead of digits, e.g. SSN = ABC-DE-FGHI. Those should perform identically to real SSNs."
...except that they will never compare equal to any real SSN, won't pass input validation (which should be checking that the user enters digits, not letters), will always sort to the end of a list of real SSNs instead of somewhere in the middle, etc. The right answer is to replace or ignore the legal team.
Jeff, the "TRS-80 algorithm" is clearly "go out and buy n TRS-80s", but what is the "DEC Alpha algorithm"?
Anonymous Cowherd on September 21, 2007 05:40 PM> movable type regenerates thousands of static htmls with every minor change
I'm not sure that's the case..
I have *zero* section entries on this blog. No tags, no sections. I've deleted them all, even the templates. There's the main index, and the individual entry, and the "all posts" page. That's it.
When a comment is entered, all it needs to do is regenerate the main page (to update the recent comment count) and the actual entry page.
Jeff Atwood on September 21, 2007 07:21 PMEven if you picked a really stupid sort algorithm, I'm willing to bet your problem wasn't principally the sort algorithm choice but the fact that you were choosing one at all. JavaScript provides Array.sort(compareFunction) since NS/IE 3. If you were having to write your own sort, you must not have been using an Array but instead some DOM collection. Which means you were at least reading from the DOM over and over and possibly also reordering the DOM as you went. That's tremendously slower than reading and writing plain variables. *That's* why your script barfed with a measly 100 rows. A better strategy is to pull out the sorting value and a DOM object reference per row into an array, sort that (which will be effectively instant on 100 items), and then reorder/rewrite the whole table in one go.
Joshua Paine on September 21, 2007 11:34 PM"Worth noting is insertsort is O(N) for already sorted data. This makes insertsort ideal for applications where you resort data that drifts out of order."
What makes insertion sort even more ideal for pre-sorted data is the fact that it's a stable sort, unlike quicksort. That's another important aspect of sorting that most teach-yourself-on-the-job coders have never heard of. Maybe Jeff should do another blog entry on the subject.
Chris Nahr on September 21, 2007 11:39 PMSo what's wrong with javascript's built in Array.sort([sortcb])? Presumably the js engine guys have allready made all the hard choices for you?
You could have added the O(n) radix sort in your examples. And this is worst case :)
@headshots:
Not necessarily. The worst case in quicksort arises when you consistently manage to choose pivots which divide the data into one large set and one empty-ish set. But happily, you don't have to choose a perfect pivot to get the average case; it just has to be close enough (ie. the spilt data sets have to be within an order of magnitude of each other, on average). Your suggestion of randomising the data order solves the "nearly sorted input" case, but what if your random ordering results in a data set that was previously unsorted *becoming* nearly sorted? Right back where you started (and a fine example of cargo cult programming to boot).
That's why I suggested taking three elements from disparate places in the data set and choosing the middle one as the pivot. Most naive (or example) quicksorts select the first element as the pivot - which if the data is nearly sorted, is guaranteed to be a lousy choice. However, choosing the median of the first, central and last elements will always give you a more or less perfect pivot on sorted data, and do no worse on average than picking the first element for randomised data. And it's O(1) so long as you have random access to the data set (and as I said before, if you don't, you really need to be using mergesort).
gwenhwyfaer on September 22, 2007 09:31 AMThis is very interesting; I am seeing exactly this on my computer science course. Thanks Jeff. The most common algorithms I am seeing are n, log(n) and n . log(n) and n^c.
Hoffmann on September 22, 2007 09:58 AMI love this post, and I'm surprised by the drift of the comments.
On the other hand, it is perhaps more accurate that everything is equally slow for small n, since the constants dominate (setup, teardown, loading, yada yada).
Well, I suppose there are some good-order solutions that have implementations with really really terrible small n performance, but it is difficult for that to escape notice, I trust.
orcmid on September 22, 2007 01:51 PMIn my experience, the reality is often "everything is fast if it doesn't call the database". The optimizations to avoid database calls dwarf anything else.
smackfu on September 22, 2007 06:58 PMI'm going to have to side with Eugê here and argue that Bubblesort is O(n) in the best case, and not O(n^2). It's O(n^2) in the *worst case*. It might sound like a pedantic point to make, but the distinction is critical, particularly for an article intended to educate people on computational complexity.
If you know that your data is typically sorted then paradoxically you may be much better off using Bubblesort than an O(n log n) 'worst case' alternative. An example where this might be useful is in a graphics engine where you can exploit frame-to-frame coherence by depth-sorting the list of objects to be rendered each frame, and you expect their relative positions to change infrequently.
There are various other examples in graphics and physics engines where this kind of temporal coherence can be exploited, and undoubtedly in lots of other domains too. Of course the key lesson should be to understand the complexity of candidate algorithms, understand your data (profile!) to make the best selection of algorithm for your needs.
StrmnNrmn on September 24, 2007 02:11 AMWe were guilty of this too, but I have begun to appreciate the value of large n. We're an app that will give users the chance to create records that are searched as frequently as a user switches applications.
We actually reduced the complexity of the code by testing for large n.
We found that a simple iteration works in large-n cases (n = 5000, where you might see n = 100 from a power user). The value in large n here is in justifying design. Without testing we would not be able to prove that a simple solution worked and a complex algorithm for the sake of performance was not needed.
John C on September 24, 2007 06:24 AMI would recommend that everyone also watch Raymond Chen's speech from the 2005 PDC "Five Things Every Win32 Programmer Needs to Know" where he points out how things going on inside of Windows will effect your data structures and algorithms.
slapout on September 24, 2007 02:20 PMGreat video on sorting techniques, "Sorting Out Sorting" via Joey DeVilla:
http://globalnerdy.com/2007/09/24/sorting-out-sorting/
Jeff Atwood on September 26, 2007 01:19 AMSorting's failure to scale is a specific example of a general problem with interfaces that don't scale well. One of my least-favorite examples is the way Apple replaced it's excellent Sherlock search interface with Spotlight.
Spotlight probably works fine for users with 1 to 1,000 data files. I'm a writer, I've owned computers since 1987, and I'm a backup fanatic. At one point my Documents folder had over 300,000 files in it. Instead of getting 10 search results like Joe User, I get 100, or 1,000. So I need an interface that allows me to
1. Easily choose WHERE I will search (which folder or folders);
2. Quickly SORT the files on the basis of what I think will help me find what I want,
3. Quickly SCAN the files, and
4. Quickly RE-SORT the files by a different parameter if steps 1 and 2 don't work right away.
Spotlight may have a way to quickly delimit which folders to search, but I haven't found it. Worse, it sorts results into seperate categories, and displays only SOME of the hits. This makes it impossible to just scan the whole list of files for a match. If what I want isn't in the first few hits, I must also remember what KIND of file it is. Ack! More mental overhead!
Mental overhead does not scale! I can't just add a faster CPU or more RAM to my brain.
I suspect that Spotlight's developers came up with their cool idea, implemented it, and tested it ... on machine with a few dozen to a few thousand files. And it worked -- in that environment. I ended up turning it off and use FileBuddy for searching. Not ideal, not as good as Sherlock, but it works.
The traffic here in the Bay Area is another example of technology that doesn't scale well. Trains scale well; autos do not.
Wilma Keppel on October 1, 2007 03:08 AMJust don't fall into the "same order of magnitude" mistake people make fresh out of computer science studies:
http://dotmad.blogspot.com/2007/10/every-n-matters.html
Wilma Keppel:
You're a writer, but you don't know when to use "its" instead of "it's"? For shame....
Reading these comments, I kept wondering, why do a sort in the first place?
If you know that usually the data is already sorted, then only sort it when you know it has changed. I mean, changed enough to justify the effort of sorting. Why not have two tables, one known to be sorted, and a small table not sorted? Then you can quickly look up the large table. If the item is not found there, you can quickly look up the small table. Do the sorting only occasionally!
That is assuming that you do need to sort the data in the first place. If the purpose of doing the sort is simply to be able to use a quick method of finding an item in a sorted table, perhaps you are going about things in an inefficient way. A hash table might be a far better solution.
Sorting is expensive. Try not to do it. Ever.
mike on December 18, 2007 06:55 PMThe most common time I find myself needing a sort is when looking at some search results. Typically you will want to sort by name, price, phone number, location or any number of things that will depend on the situation. Although you can create a load of presorted lists and keep them up to date as each new record is added/deleted, it might not be worth it if you have too many of them... the cost of adding a record might become impractically high.
I've done a lot of data entry and I have to say the amount of crap software out there is enormous. One of the worst was raising an order where you add a record for each different item. By the time 20 different items are on said order it takes 20 /seconds/ to add the next record. Clearly a case of using a naive or inappropriate algorithm... although given the severe slowness.. I would like to think it is a bug. Unsuprisingly they had only tested for it up to 5 items. Still not fixed after a year... in which time I have written a /scripting language/ to automate the process (and some others)... at least I don't need to waste my time anymore. :)
Jheriko on January 18, 2008 05:19 AMImportant article Jeff.
One clarification might be to indicate that the "lg n" column is using a base of 2. I had forgotten a lot of high school mathematics and the windows calculator has a "log" key, so I was mystified for a while how for example lg(16) = 4 and not 1.204...
The windows calculator doesn't appear to provide a single button for "log b n" so I had to hunt up the expression to calculate base 2 logs in 2 steps. log base2 x = log 10 x / log 10 2.
AJ on February 13, 2008 06:44 PManother excellent collection of visual sort comparisons -- animated GIFs with no Java dependency!
http://vision.bc.edu/~dmartin/teaching/sorting/anim-html/all.html
Jeff Atwood on June 22, 2008 09:47 PM| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |