There Ain't No Such Thing as the Fastest Code

February 19, 2008

I was tickled to see that James Hague chose The Zen of Assembly Language Programming as one of five memorable books about programming. I wholeheartedly agree. Even if you never plan to touch a lick of assembly code in your entire professional career, this book is a fantastic and thoroughly useful read. I was a mere Visual Basic programmer when I found this book (along with The Zen of Code Optimization), picked it up on a lark, and I could barely put it down. It's that good.

Abrash isn't just a seminal figure in the software engineering community, he's also one of the best technical writers you'll ever find. That's why he's one of my programming heroes, directly alongside Steve McConnell.

His Graphics Programming Black Book is similarly great, and covers topics so general and wide ranging that the title becomes a bit of a misnomer. Best of all, it's available online for free courtesy of Byte, so you can sample it yourself.

Michael Abrash's Graphics Programming Black Book

I know what you're thinking. "This book is about graphics. And assembly language. Plus it's from, like, 1996, which is approximately 1928 in computer years. It's of no interest to me as a programmer." Admit it. You are. But you know what you're going to do? You're going to click through anyway and read some of it. Just like in college, the class topic doesn't matter when the instructor is a brilliant teacher. And that's exactly what Abrash is.

Abrash is a world class coder and technical writer, but he's also not shy about explaining the perils and dangers of our craft, including the biggest problem of all-- the one that sits behind the keyboard. Allow me to illustrate with one of my very favorite Abrash passages, from Chapter 16 of the Graphics Programming Black Book.

Not so long ago, Terje Mathisen, who I introduced earlier in this book, wrote a very fast word-counting program, and posted it on BIX. When I say it was fast, I mean fast; this code was optimized like nobody's business. We're talking top-quality code here.

When the topic of optimizing came up in one of the BIX conferences, Terje's program was mentioned, and he posted the following message: "I challenge BIXens (and especially mabrash!) to speed it up significantly. I would consider 5 percent a good result." The clear implication was, "That code is as fast as it can possibly be."

Naturally, it wasn't; there ain't no such thing as the fastest code (TANSTATFC? I agree, it doesn't have the ring of TANSTAAFL).

[assembly language tricks and useful optimization approaches elided -- see PDF for full detail]

The biggest optimization barrier that Terje faced was that he thought he had the fastest code possible. Once he opened up the possibility that there were faster approaches, and looked beyond the specific approach that he had so carefully optimized, he was able to come up with code that was a lot faster. Consider the incongruity of Terje's willingness to consider a 5 percent speedup significant in light of his later near-doubling of performance.

In the same chapter, Mr. Abrash relates a similar anecdote based on a word counting program. It was published as a challenge in his "Pushing the Envelope" column:

That initial challenge was sparked by a column David Gerrold wrote concerning the matter of counting the number of words in a document; David turned up some pretty interesting optimization issues along the way. David did all his coding in Pascal, pointing out that while an assembly language version would probably be faster, his Pascal utility worked properly and was fast enough for him.

It wasn't, however, fast enough for me. The logical starting place for speeding up word counting would be David's original Pascal code, but I'm much more comfortable with C, [so I created] a loose approximation of David's word count program, translated to C.

Mike proceeds to do what he does best-- optimize the word count program into assembly and explain along the way in an easy going, highly articulate way. His results are as follows:

C conversion 4.6 sec
C + assembly conversion 2.4 sec
C + assembly conversion with lookup table 1.6 sec

He then posted his program as a challenge for readers of PC Techniques-- can this optimized assembly word count program, from an acclaimed industry expert on assembly optimization, be made even faster? Well, I think you can guess what happened next.

So how did the entrants in this particular challenge stack up? More than one claimed a speed-up over my assembly word-counting code of more than three times. On top of the three-times speedup over the original C code that I had already realized, we're almost up to an order of magnitude faster. You are, of course, entitled to your own opinion, but I consider an order of magnitude to be significant.

Truth to tell, I didn't expect a three-times speedup; around two times was what I had in mind. Which just goes to show that any code can be made faster than you'd expect, if you think about it long enough and from many different perspectives.

Like Mike said, there ain't no such thing as the fastest code. If you think there is, you're probably the barrier standing in the way of further performance, not the code itself.

Posted by Jeff Atwood
81 Comments

I'm glad that I'm being educated in a world where such obsessions are not as vital. But still, it's not about the craft of optimizing, it's about understending what happens behinde your average code. Understanding that probably will make your day-to-day code more efficient.

Hoffmann on February 19, 2008 6:10 AM

The geek in me loves this sort of thing, but the reality is that doing the wrong thing quickly is worse that doing the right thing slowly.

Optimizing word-counting programs is fine, but figuring out _why_ the user wants to know the number of words and making sure you are solving the problem (aka filling the need) is more important that coming up with the answer, even the right one.

Cheers

Richard C Haven on February 19, 2008 6:35 AM

I've had this book in my shelf since it was published. It was then intresting reading, and it still is.

Game of Life optimization challenge/examples are great in fact that they show optimization is much more than just fidling few asm cycles off here'n'there. Optimization often is about finding new way how to look your data.

NamelessGerbil on February 19, 2008 6:37 AM

Wow! That takes me back. I remember every time an article or book by Mr. Abrash came out, I would carefully read *every* word. I think I still have my copy of Zen. Good stuff!

Ferruccio on February 19, 2008 6:39 AM

I loved the chapters on the cycle eaters, especially the DRAM refresh. The rest of the book is absolutely win as well.

TraumaPony on February 19, 2008 6:44 AM

Now... who here downloaded that all by hand? If you did, shame on you... :)

Ray on February 19, 2008 7:06 AM

I have the fastest code for ading two numbers 3 5
'print 8'
Haha

Wow!!! I know I just told a great joke.

Niyaz PK on February 19, 2008 7:07 AM

Faster than "print 8" is "x = 8"

(the requirements described adding two numbers, nothing about emitting the result)

or even ""

(strictly the requirements didn't even say the result should be stored, so it can be optimized away)

pedant on February 19, 2008 7:45 AM

What's the saying? Debugging is twice as hard as writing the program in the first place, so if you write a program as cleverly as possible, buy definition you're not smart enough to debug it.

Ian Toltz on February 19, 2008 7:46 AM

Beating the C compiler at writing ASM code has been my main occupation for the last 10 years. The real fun began when I got away from x86 and started working on a "Man's" processor - the ARM: Lots of registers, free barrel shifts, conditional execution...Fertile ground for true geeks to go optimization crazy :)

L.B.

Larry Bank on February 19, 2008 8:17 AM


@no-fun:

"Case in point about C compilers vs hand written assembly. The best C code ran a particular benchmark in 2:30 seconds, while the equivalent assembly code (which took a week to write) runs in 15 seconds. Considering that we've been running the code on thousands of machines for the last 8 years, that week of optimising has proved to be priceless."

That is an astounding job, and completely relevant in your case. The main drawbacks there are that:

1. The code is now much harder to understand instantly
2. Algorithm-level optimizations are less likely to occur

In the case where there are no algorithm-level optimizations left to put in, it makes sense to cement your current algorithm by hand-tooling the code, moving to lower-level languages to facilitate hand-tooling, etc. However, hand-tooling a poorly-fit algorithm might result in a 10x improvement, while getting a better-fit algorithm might result in a 100x improvement.

So, again, hand-optimizing is great when you know the algorithm is solid, the implementation bug-free, and the code you are hand-tooling is a bottleneck in the application (from the user's perspective). Presumably this is the case in your example. If these three are not true, hand-tooling, thunking down to lower-level languages, and otherwise obfuscating your code is a bad move.

For a counter example: a constraints-satisfaction engine I work on saw a 120x performance improvement (22 minutes to 9 seconds) in large part due to moving *from* C to C++. Why? Because the underlying code had been exhaustively point optimized, but the algorithm it used was not optimal (and, frankly, managing the kinds of data structures and collections which was required for the more advanced algorithm would have been a maintenance nightmare in C). Perhaps we should have spent several years implementing the improved algorithm in Assembly (for each platform) instead, and then just decided to never modify it again. But since that particular improvement we have tweaked the algorithm several times, resulting in an additional 50% gain (ie, it is now down to 6 seconds), and know that we can gain another 200% by re-implementing it in Java.

The point of the story is: algorithm optimizations by far trump code optimizations and language optimizations. To the extent that lower-level optimizations make it harder for algorithm optimizations to occur (primarily by obfuscating the code), they are, as Knuth stated, the root of all evil.

At least in my line of work, algorithm advances aren't a matter of coming across something someone just published, but rather determining how a particular published algorithm could be squeezed into our problem domain without destroying its efficiencies; there's little confidence that the answer we have today is even the best answer out there for the state of the art today, much less for tomorrow, and as a result there is a definite likelihood that a new approach will pop forth which blows today's implementation away, tomorrow. It's already happened several times, and so I've learned to keep obfuscations safely tucked away so they don't trip us up when the next algorithm advance shows up.

Tom Dibble on February 20, 2008 1:06 AM

JM: Ah, but "fastest code" and "Iprovably/i fastest code" are not the same beastie.

Definitionally, there must be a "fastest code"*, given a specific task and a specific environment (because otherwise we have the awkward problem of infinitely fast code, if it can always get faster by at least one cycle).

Given the difficulty of Iestablishing/i it, and the much more important fact that we almost never know how close we are, it's not unreasonable to Iact/i like there's no such thing, though.

* Or "equally fastest codes".

Sigivald on February 20, 2008 1:12 AM

From a purely theoretical POV, this statement is quite true as well. There is no general way of proving that a given program is the fastest implementation of an algorithm. And while it's possible to prove this non-generically about individual programs, this is horrendously difficult to do for real computer architectures, and your time is definitely better spent elsewhere.

Of course, all that is assuming the algorithm is what you want in the first place. Real programs are composed of many different algorithms, possibly subdivided, so sometimes the entire algorithm is the problem, not the implementation.

JM on February 20, 2008 1:31 AM

There is no such thing as fastest, it is about whether it maintains its efficiency when n is infinitely big.

Samuel on February 20, 2008 1:36 AM

People .. don't confuse Big O notation with optimization. Big O doesn't tell the whole story; it's meant to be used to compare algorithm, not implementation, when n is sufficiently large. For small n, performance will vary wildly based on implementation (and optimization).

example:
I could write 10 different forms of an O(n) algorithm that, on the same data, take anywhere from 1 second to 1000 seconds to run. The gist is that no matter what optimizations I do, it doesn't matter if the underlying algorithm is still O(n)

I can easily write an O(n) algorithm that performs better than an O(log n), but probably only for small n. Big O notation is the simplification of a mathematical formula to express magnitude to make the overall comparison easier.

My O(n) could really be a x = 12 + n/20
My O(log n) could really be x = 9999 + 2000 log n

@ Joseph - your pipeline example fits squarely into this. The branch failure of binary search is a constant per iteration that is, in it's own way, just an implementation detail. BST's are still superior when n gets large enough. As you indicate, for low n, the linear search was far more effective.

Uh-Oh Big-Oh on February 20, 2008 1:45 AM

To be clear, the philosophy and pitfalls of optimization-- and in particular the human element-- are always the foremost topic in the Abrash books. Read some of it and you'll see what I mean.

http://www.byte.com/abrash/

Jeff Atwood on February 20, 2008 1:58 AM

I had a quick look at Abrash's chapter on optimisation. It seems that he has some great technical skills but lacks business skills.

The top 10 list that was given makes perfect sense if you consider that programmer time is the largest cost component of writing software.

He derides the people who thought about making a business out of software as "...focusing on means, and have forgotten about ends."

In contrast in my whole career there have only been one or two cases where performance needed to be improved.

-Andrew

teambob on February 20, 2008 2:04 AM

To download this to current dir on linux I did:
wget -r -nd -np -l1 -A '*.pdf' a href="http://www.byte.com/abrash/"http://www.byte.com/abrash//a

And no, I can never remember the wget parameters either:
a href="http://www.pixelbeat.org/cmdline.html#wget"http://www.pixelbeat.org/cmdline.html#wget/a

As for optimization, I would say that it's more important
to try and use existing logic (libs etc.) as they will likely already
be optimized, or if not can be optimized independently of your program.
Now, don't assume long implemented logic is well optimized.
Personally I've found lots of low hanging fruit in various
linux utilities and libraries for example. It was easier to
fix the generic logic, than incorporate the faster implementations
into my programs.

Pdraig Brady on February 20, 2008 2:05 AM

I'm always amazed at all the coders who claim that performance is old school, and that optimisation is a worthless, extinct art. A lot of posters above demonstrate this attitude. Meanwhile, the company I work for sells crap old embedded hardware with software which looks and runs amazingly fast compared to the competition on newer rigs. When I view their recruitment pages, it all makes sense. Needless to say, my employer makes a killing on margins, for a modest extra they pay for their engineers.

Case in point about C compilers vs hand written assembly. The best C code ran a particular benchmark in 2:30 seconds, while the equivalent assembly code (which took a week to write) runs in 15 seconds. Considering that we've been running the code on thousands of machines for the last 8 years, that week of optimising has proved to be priceless. That is a 10 fold speed increase in one particular benchmark alone. And the code doesn't live in isolation, I'm sure that the rest of the untouched code is faster since this routine consumes less memory, hence it consumes less CPU cache, hence it leaves more room in L1 cache for other code.

Win-win for everyone. And the more people which claim optimisation is worthless, well, that just means that I can charge more since my particular expertise is almost impossible to find. And I'll take the big $$$, thank you very much. Yeah, I agree with y'all, optimisation is crap. Dont bother learning it.

no-fun on February 20, 2008 2:11 AM

I tell ya, a lot of modern big firm game programmers should read this book;)

VG on February 20, 2008 2:13 AM

Back around 1998 when I was at Metrowerks as their junior engineer on the x86 C/C++ compiler for CodeWarrior, I read everything I could from Abrash, and used a lot of that knowledge to tweak the peephole optimizer in the compiler. I really appreciated it, and for a little while, the CodeWarrior tools did a little better than Microsoft's VC++ on quite a few benchmarks.

Ben Combee on February 20, 2008 2:23 AM

Talking about caching, Herb Sutter made a great presentation to the Northwest C++ Users Group about how the hardware of a modern PC works, the differentials in bandwidth and latency between different components, the efforts hardware makers have gone to in order to hide that latency, and how that can affect your program.

The presentation was made available on Google Video and is linked from the meeting page at http://nwcpp.org/Meetings/2007/09.html. You should probably read the slides along with the video because they're not that clear - a PDF is downloadable from the same page.

Micro-optimising your code by rewriting in assembly may well not do any good if you're memory or I/O bound. Find out if you are first.

Mike Dimmick on February 20, 2008 2:27 AM

I think there's a big point you've missed in your post. You're right that there's no such thing as "the fastest code". But there definitely is such a thing as "fast ENOUGH code".

The geek in me loves optimising code and making it faster. Then faster again. Then a little bit faster still. But at some point, making the code faster starts to trade off against time and readability.

Let's take as a simple example a solution to that prime numbers problem posted above :

int prime( int n )
{
for( int x = 2; x n; x++ ) {
if ( ! ( n % x ) ) {
return 0;
}
}
return 1;
}

int main()
{
for( int n = 2; n 100; n++ ) {
if ( prime( n ) ) printf( "%d\n", n );
}
}

Well that runs pretty fast. It does primes from 2 to 100 in no time at all. It scales to 10,000 without too much hassle, and does primes to 100,000 in about seven seconds on my PC. But it's by _no means_ the fastest code.

For starters, any number apart from 2 that's got bit 0 reset can't be prime (since it's even), so we could test for that. We could cache each prime number in a lookup file so when the program's run again, it remembers earlier results and produces faster output. This could be preloaded with some values first. We could unravel the main loop, saving some time putting things on the stack. And what's with that time consuming printf - why not write the output to a binary file. We could rewrite this in assembler and save even more time.

But you can't deny all of these things will take longer to implement. The lookup improvement - that's full of gotchas (What happens if the file's not there? What happens if it gets corrupted or deleted? Does the routine need to be thread safe? Has the running user got permission to see it) that might need to be handled.

Is it worth it? Maybe, if you're going to be calculating prime numbers to a million on a regular basis. But it'll take you longer and you won't get more readable code out of it.

Ritchie Swann on February 20, 2008 2:29 AM

I'd say there is a time and a place for optimization.

If you're doing programming C for a piece of hardware with minimal resources, it's REALLY important.

If you're trying to query a data source with a few million rows and join it with multiple tables with millions of rows, it's REALLY important.

If you're making a web site to display someone's timesheet for a company of 500 people, it's not so important. In that case the limited resource is the cost of the employee doing the development, and the faster the development the better.

It's a judgement call that has made some people rich and others poor over the years. MS Excel killed Lotus 123 with 10 programmers using C++ and concentrating on a GUI instead of Lotus with 100 programmers optimizing assembly code.

Dave on February 20, 2008 2:46 AM

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

CS on February 20, 2008 3:25 AM

if you reading this and then thinking it dont apply to me i program in .net think again. know what you do and how it goes to intermediate language (IL) is a big thing to learn first for speeding up, then learn where your objects are created aka unboxing and boxing then look at the other algorithim optimisations... then start to look at database layer speed... then network operations aka webservice calls asp architecture... just in case they are watching...

ohdear on February 20, 2008 3:44 AM

To no-fun:

Optimizing code using low level assembly is not as vital in most development areas, I even remember reading an article title as: "Is C the new assembly?", I do not remember where I read it but it was basically saying that everbody who programs in some high level languages (C#, Java...) when they require high performance they turned to C, much like when everbody programmed in C usted to turn to assembly for performance. So the point is, who still programs in C still might need assembly.

Hoffmann on February 20, 2008 4:10 AM

This guy needs to update his material. 20 years ago I worked with programmers who had more of a challenge trying to fit their operating system into 256K RAM. THAT was the real challenge. The days of 2-4G RAM have nurtured a generation of lazy, sloppy programmers writing inefficient, bloated code. Vista is a classic example. We should be getting better at this, not worse.

PaulG. on February 20, 2008 4:39 AM

This was one of my favorite books when I was about 19. It inspired me to learn Pentium UV pipeline optimisation. I d-loaded all the optimisation docs from intel, and spent a couple of days writing one of those old school flame effects. I was able to time the number of cycles it took to calculate each pixel, and it matched what I expected from my code. I was very pleased with myself when I managed to get it down to something like 5 cycles per pixel. I *knew* I had the fastest code.

Just to feel even more smug, I decided to write it in C to see how much faster my code was than some compiled code.

The C was faster. I never wrote any intel assembler again after that.

Rocketmagnet on February 20, 2008 4:43 AM

This is great. I recall voraciously reading Michael Abrash's work and used to snuggle up in the corner to read about 3D B-trees to simulate the world of Doom.

reno on February 20, 2008 4:48 AM

Well, many years ago, I had to write a real-time system. I was interested in the benefits of coding it in PL/M, but concerned about performance.

So, I write some code in PL/M and looked at the assembly it generated - just about identical to what I would have done!

I haven't used assembly language since - but 20 years further on, I still believe that having at least a basic understanding of how things work down at that level is key to being a good developer.

And in fact, once you'd been round the loop and coded a few basic libraries, assembly development wasn't actually as slow as its reputation would lead you to believe.

Similarly, COBOL, which is well out of fashion these days, is probably still the best batch-processing language there is. I worked on a C++ job a few years back, whick took about 4 months. I could have written that in COBOL in two weeks!

Bottom line, don't let fashion get in the way of using the right tool for the job - you can write crap in any language!

Bruce on February 20, 2008 5:10 AM

Unfortunately, it doesn't matter if you optimise your code nowadays - you're gaining next to nothing for all your efforts.

I mean, I had a discussion with a chap about singleton performance in C#, he wrote 2 little test apps that showed the differences between the 2 techniques we discussed and I ran it - one took 6 seconds on my laptop, the other 1.5 seconds. This showed a difference but not a big one considering the iterations required to show a count in seconds. However - I then ran it on the big superfast dual-proc server.. and both apps took 20 seconds to run. Stupid C# decided it knew best (even when I ran it with CPU affinity) and "un-optimised" it for the worst case on 2 cores, even though it had been designed for multi-threading anyway!

Another case in point, a colleague decided he could optimise a module in our server app to make everything go lots faster. He spent ages doing it, and it speeded up his tests on his PC. Then we ran it overnight on the test environment... and the whole application ran slower than before. The cause was threading issues - easy to make it go fast on 1 thread, but on a n-tier server farm the locks and memory usage make for a different story.

So - optimise by all means, but make sure you're optimising the right bits, in the right way, or its just a waste of time.

AndyB on February 20, 2008 5:22 AM

Great post. I enjoy most of your posts but I particularly like how this one goes back to your original "coding horror" theme.

dinah on February 20, 2008 5:31 AM

I had already read a few pages of The Graphics Programming Black Book (and enjoyed it very much) before I read the following paragraph;

"But you know what you're going to do? You're going to click through anyway and read some of it."

kevin on February 20, 2008 5:45 AM

I bought this book when I was at a bookstore around 4 years ago when I was in college. I lent it out to a friend and got it back the next year. This book is amazing.
I almost feel sad that a lot of the work today is done on higher level languages like .NET or Java (I work in a .NET place). This book speaks to the inner geek that got me first interested in programming. This book has given me lots of insight and a huge dose of humility when it comes to programming.
For people who think something like this isn't relevant, let me reintroduce you to the iPhone and its SDK that is coming out soon. People are going to want to create some amazing apps on limited hardware. It'll take performance minded people in order to pull off sophisticated applications on the limited devices we're going to see in the future.

Min on February 20, 2008 6:06 AM

Wow! I wish I had known about this book about a year ago when I was coding an implementation of the RTCP protocol in assembler. Very limited CPU and memory but high demands for speed.

Great tip, thank you! -Downloading the pdf as I'm writing this.

tobias on February 20, 2008 6:08 AM

Ha, that book has been sitting on my bookshelf ever since I found it in a clearance bin sometime early this century. You've inspired me to open it!

Owen Byrne on February 20, 2008 6:16 AM

As an example about prime speed ... just a faster prime :

#include stdio.h
#include math.h

int is_prime(int c) {
if(c % 2 == 0 || c % 3 == 0) {
return 0;
}
int i = 5;
int limit = (int) sqrt((float)c) + 1;
while(i = limit) {
if(c % i == 0) {
return 0;
}
i += 2;
}
return 1;
}

int main() {
int i;
for(i=0;i 10000000;i++) {
if(is_prime(i)) {
printf("Is prime %d \n", i);
}
}
return 0;
}

// compile with -lm
// it checks every integer (by intention, just to show that it works ....

#shell# time ./fastprime /dev/null

real 0m10.022s
user 0m9.985s
sys 0m0.000s


this is pretty fast and yet readable, just pure C and a bit of math logic

always remember that code heavyness must be in balance with the task and the goal, this code here is simple and yet finds all primes up to 10 million in 10 seconds.

php_amateur on February 20, 2008 6:25 AM

wc is all I need to count words!

Joe Beam on February 20, 2008 6:37 AM

I am so glad there are those of you that love this stuff. I love to write programs, but once they are written I'm done with them. I guess it's the ADHD, but I don't care to look back on what I did to see if I can make it better/faster/more efficient. When I finish, I am just glad that the project is over so I can start something new.

But that is why I'm thankful I have programmers that work for me to take my ideas and prototypes and improve the code.

I have a lot of turnover on my team (we are a place where new grads cut their teeth before moving on) so I am more concerned with readable/understandable code than optimization. It's a sacrifice I am willing to make since our average programmer only stays about 18-24 months. I need them to open the apps and begin working on day 1... not after 2 months of figuring out what previous programmers thought was a great optimization trick.

Wayne on February 20, 2008 6:42 AM

I'd like to echo Richard C Haven's point in the second post there. And this is an excellent example in which to do it.

See, the most common reason to want a word count (as far as I know) is for submitting some written work to an editor who wants an 800-word article or a 120,000-word novel. And, if you're doing that, what you really want is the same count that they would get if they counted the words.

Which is, as one can probably realize from a few moments of thinking about how this might have been done before computers, not an exact count of contiguous sequences of letters and apostrophes.

Nope. If you submit a book manuscript to a science-fiction editor (a genre in which I happen to know a good bit about how this works, though I think it's probably the same in most other genres), the word count that they expect is the one that would be computed by typing out the manuscript double-spaced on a standard 10-character-per-inch typewriter on paper with standard margins, counting the pages, and multiplying by a factor that assumes about six characters per word.

(And there are good reasons for wanting that, rather than the "true" word count, too. Spaces don't cost money to produce, but pages do, and they want a number that they can multiply by other factors to get about how many pages of standard hardcover or mass-market paperback this manuscript will take to print.)

So, yeah. I suspect I could write a sufficiently accurate approximation of that algorithm (for a flat text file) in about four lines of C and have it be a lot faster than the highly-optimized assembly code mentioned here. :)

Brooks Moses on February 20, 2008 6:43 AM

I had that book a number of years ago. I hope the Canadian I sold it to on eBay got his money's worth. I believe I sold it for $50US, which was, IIRC, about how much I paid for it new.

that book was a MONSTER too, like NYC phonebook big.

it's a good book.

lemur on February 20, 2008 6:52 AM

I’m a mainframe programmer. Performance on a single transaction or task can be small, but if you have millions of them, they add up.

Here’s an example. We had a CICS region bogging down. At first we thought it was a transaction that was taking longer than a second. Come to find out it was a sub-second transaction performed thousands of times a day. It was unnecessarily loading a DB2 table just to see if the customer had a particular product. We took that out, and saved ourselves hundreds of thousands of dollars a year on MIPS cost.

By all means, the program should work as required. But, after that, you can save your company millions if you perform program performance tuning.

While I’m at it. I write in both assembler and COBOL. I have found that my assembler knowledge adds to my understanding of how the computer actually works. When we get further and further away from that understanding, I feel we risk designing software code that not just runs slow but may have to be re-written.

Norm Walker on February 20, 2008 6:59 AM

Man - I guess I am a new-school junior programmer who did not get his start writing x86 assembler, but these comments are boggling the mind. I expected everybody to recoil at the idea of all this talk of optimization.

I guess bit-twiddling at this level doesn't interest me as much as the process of creation. I can tell that this is a quality book, but I'm scanning it for the more general pearls of wisdom. My eyes glaze over when I hit the program listings and the discussion of the microcode implementations of x86 instructions.

Esd on February 20, 2008 7:03 AM

To clarify: It's important to understand performance, and algorithmic complexity, and behavior of hardware and OSes if you're going to write quality code, but the subject matter of the examples, and the minute level of optimization being striven for, is killing me here. =)

Esd on February 20, 2008 7:08 AM

I have that book. It's one of the most enjoyable books I've ever read.

David on February 20, 2008 7:57 AM

It's a shame that the book seems out of print; I'd like a copy to look at for fun but $85 is a bit steep. Maybe someone will convert it as well?

Sitten Spynne on February 20, 2008 8:16 AM

Wow, I used to rip Basic code from Byte magazine back in the 80s (on Apple //c BTW). Nice to see they're still around.

I agree code optimization is important. Even in high level languages, increasing concurrency and knowing your code isn't the bottleneck is important.

CptBongue on February 20, 2008 8:20 AM

I have most of his articles printed out in a binder on my shelf. It's one of the few collections of technical material that has survived the yearly gleaning process...

The complexity of our development stacks now have technical writers covering primarily the intricacies of a particular technology. Abrash is a great writer who excels at exposing the gestalt of a problem rather than getting tripped up in the technology.

It will be nice to have a digital copy of some of the articles. Thanks for the link!

Timothy Lee Russell on February 20, 2008 8:46 AM

The Abrash book is on a topic that's definitely far and beyond what the vast majority of the software developers, dealing with dialog boxes and web dev, could dream of tackling.

A definite classic.

Chris on February 20, 2008 9:39 AM

In 2002 I've took part in a similar competion about generating all primes up to 1_000_000_000. There were tons of entries all using the Sieve of Erathostenes. The second best contestant had spent God knows how many hours for an optimized assembly version. It rocked all other programs by a wide margin.

I however spent just 5 minutes searching on google to get the best algorithm out there ... found one by Atkin Bernstein (http://citeseer.ist.psu.edu/452169.html). Ported it to windows (silly requirement they had) and submitted the program with references to the paper and Bernstein's sample implementation (http://cr.yp.to/primegen.html).

Naturally it was the fastest submission. The morale of the story is that you should search for a better algorithm first before getting into low level machine optimizations. Unfortunately most programmers prefer the machine level optimizations as they do not have the inclination to improve their theoretical background. Sad.

Dee Zsombor on February 20, 2008 9:41 AM

Everything should be done in assembly language.

Rob on February 20, 2008 10:01 AM

Ah, drool... this is one of the few areas of software work i find fascinating. Assembly, fast graphics. I started on the 8-bit 6800, 6502 and such, LEDs and hex keypads. Abrash is a hero.

I love working at that low level. But it is so hard to find work - one can't just breeze into any ol' town and find good well-paying steady assembly-level work there.

And yes, as others have explained, understanding how the computer works at that level does help with writing efficient code and with troubleshooting, even if not in a directly applicable way.

Daren W on February 20, 2008 10:02 AM

another example of how knowing the low-level software and hardware helps - where i work, we process a lot of images. These need to be calibrated. One of our key calibration programs has a 4-deep nested loop to synthesize some image-sized calibration data. Runs slow, of course. I rearranged it taking into account caching, row/column access etc. and it ran about twice as fast, at least on a good day. Now users can't so easily get up for a fresh cup of coffee while calibrating images. I guess that makes me an anti-hero ;-)

Daren W on February 20, 2008 10:11 AM

Understanding of the underlying machine language is useful even for high level programmers. It doesn't mean necessarily writing code in assembly language, but avoiding pitfalls of the architecture. An excellent example is developing code for Windows Mobile devices. The majority of WM devices have CPUs which use the ARMv5 instruction set which does not include a DIVIDE instruction. When a divide operation is compiled into the code, it gets turned into a software algorithm for divide which takes 50 times as long as a multiply.

Larry Bank on February 20, 2008 10:23 AM

Great book references.

People (coders) are limited by a couple of things:

1) you cannot code an algorithm you don't know (yet)
2) you cannot employ the power of a language technique you don't know (yet)

And, picking the right technique for the right problem !

Steve on February 20, 2008 10:53 AM

I agree with the sentiment but to be pedantic, there is such a thing as the fastest algorithm for solving a problem. Binary search trees are mathematically proven to be the fastest way of finding an element by a comparison of keys. But in most practical situations you can tweak the problem so that other solutions are possible (who says you should be limited to only using comparisons of keys? -- sometimes you can just use a hash table).

For some algorithms a lower bound has been proven ("the fastest algorithm can be no better than this: ...") but the best known algorithm is slower. Last I looked matrix multiplication was in this category.

Joseph Garvin on February 20, 2008 10:54 AM

"know what you do and how it goes to intermediate language (IL) is a big thing to learn first for speeding up"

Optimising to the IL level is largely a pointless unless you know about the JITer, which most people don't. I've seen far too many discussions about which technique is better come down to "but it comes out to 4 less IL instructions", only for that argument to be trashed in proper tests because of the JITer optimisations. Even then you can spend ages pouring over the assembly produced by the JITer, optimising it, and it will be completely different on someone else's system. Don't get me wrong, knowing how your code gets translated into IL is incredibly useful, but people rely on it too much to promote various ultimately futile micro-optimisations.

[ICR] on February 20, 2008 11:18 AM

Joseph Garvin:

Not quite true; the math involves a theoretical machine. And these days the real hardware only simulates the theoretical machine. In some cases, you can run your application on two recent processors from the same vendor and get radically different results.

Case in point: CS majors learn to mathematically prove that a binary search on an array is faster than a linear search. That assumes that all instructions take equal and constant, time. But on a pipelined processor, the cost of branching depends on whether or not a branch succeeds. If the branch prediction works, branching takes 0 time, if not, the processor needs to cancel every command in the pipeline and start over.

For a binary search, each branch is taken half the time, so a branch predictor has no pattern to work work on. So half the time it's wrong. Which means that for small arrays, binary search is much slower than linear search. I remember reading in Dr Dobbs that "small" on one particular Pentium processor was about 36 items.

To address your example, BSTs are the fastest, assuming that memory access and branching are free. Most modern computers use heavy pipelining and caching, so neither assumption is safe. In fact, an n-way search tree with linear search at each branch point is likely to be faster.

David Leppik on February 20, 2008 11:41 AM

Nothing beats really knowing the implications of the problem for making an optimized routine.

And, like David L said above, it bears testing on a "real" processor with real data before declaring it faster. I can't tell you how many times I have written "fast" functions that failed to run quickly because parallelism was faulty or memory was fragmented or other real world pot holes.

After all, the fastest way to get 10,000 prime numbers would be to fork() out 10,000 checkers and collect the results as they came in. But it would never work unless you had 10,001 processors and zero fork time. Sounds fast, but isn't.

Chris Chubb on February 20, 2008 12:15 PM

Way back, in the early 80's, I was attending a programming class in school. I was about 15 at the time. We were using Commodore Vic 20. Our teacher put us up with the task of writing a program (in Basic) that would print out the prime numbers from 1 to 100 _as fast as possible_.

I painstakingly came up with a naive version of Sieve of Erathostenes, as did most of the others. Of all those versions, mine was by far the fastest. One guy, though, came up with a program that were, literally, an order of a magnitude faster than that, and everybody was stunned.

That is, until we discovered his secret: He had precomputed the numbers, and was simply printing them out, one after the other.

I was outraged.

Morale: caching!

Jesper Vingborg on February 20, 2008 12:16 PM

This book is excellent, although very little applies today. back in the early days as a games programmer this was invaluable, (our games were 100% assembly) so much so, that the optimisation stuff wasn't really the fun part anymore - we used to optimise for size instead ;)
we ran small competitions in the company for 256 byte games - in the days when windows was still a twinkle in bill eye. I managed to dig out some of the old 256 byte relics I did here

http://runtothehills.org/rob/archives/71

this was also a post looking at your 600 lines is a luxury post ;)
Cheers,
Rob

Rob Hill on February 21, 2008 2:39 AM

One hat doesnt fits all :

Fast is a context tradeoff.

One god (J Carmack) wrote :

float Q_rsqrt( float number ){
long i;
float x2, y;
const float threehalfs = 1.5F;

x2 = number * 0.5F;
y = number;
i = * ( long * ) y; // evil floating point bit level hacking
i = 0x5f3759df - ( i 1 ); // what the fuck?
y = * ( float * ) i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

#ifndef Q3_VM
#ifdef __linux__
assert( !isnan(y) ); // bk010122 - FPE?
#endif
#endif
return y;
}

This is a fast square root.

But if I do have a CPU designed for math calculation, sqrt(float f) might be faster.

And, if I do make high precision calculus, I might want numerical recipies squareroot implementation

And if my web server is already congested, it may be faster to let database do the square root as a result of the sql request, given that IEEE floating point algorithm are often better implemented on database than in standard C library.

So my point is faster is a context issue not a coding issue, and as Steve Mc Connell pointed, aiming for code readability often gives good result in code maintanability, and performance.

I guess that is why we developpers that focus on code are the problem, because coding is not about programming, but understanding.

jul on February 21, 2008 4:34 AM

Speed will always matter, the thing is what took 10ms in a year old hardware today will take 1ms. So let's say that 30 ms is good enough speed for this imaginary task, why bother optimizing more? It will only get faster with more modern hardware. Well, this is from a user point of view, from the server/mainframe point of view, better optimized code translates directly in money saved (power, hardware, refrigeration etc).

Hoffmann on February 21, 2008 4:54 AM

This was a post of personal interest to me. I used to belong to Bix all those years ago and I remember when the challenge went out.

There were a lot of historical conversations shaping the future of PC programming on Bix that are pretty much lost for ever.

Rich Shealer

Rich Shealer on February 21, 2008 5:14 AM

Somebody fetch Scott Hanselman.

Mark Wisecarver on February 21, 2008 5:57 AM

Dan,

Well, as a nit-picker I guess I am going to have to say that there IS such a thing as fastest code. On a single-core processor, it is code that requires only 1 machine instruction. You can't improve on it (other than speeding up the clock).

Other than that... the guy is just flat-out right. Arogance is the bane of productivity and efficiency.

-Doug

Doug on February 21, 2008 7:50 AM

[joke]
...it's the compiler, not the programming language, you stupid, suit, pointy-haired, minion !!!
[/joke]

mramirez on February 21, 2008 8:13 AM

When Abrash states "From the user#8217;s perspective, performance is fundamental", I tend to disagree. IMO from the users perspective the correctness of the application is even more important - it's no use if the word count functionality is the fastest available if it's off by 5% in some cases.
Or: I would rather have to wait some seconds or minutes for the byte.com page to load than looking at the "java.lang.Error: *** Cannot replace DB pool!" message that instead comes up...

oliver on February 21, 2008 9:30 AM

The thing about "Zen" is that it could've been half the size if Abrash hadn't repeated himself so much. It seemed that every other page he was adjuring us to measure performance (instead of just guessing). Valuable the first time, but after a while, it felt a little patronizing.

That didn't stop me from having a great conversation with him at a conference, which almost lead to us (Sierra On-Line) luring him out to join us. Complete lack of interest by upper management scotched that idea, but I always regret not having a chance to work with a truly gifted person.

///ark

Mark Wilden on February 21, 2008 9:41 AM

When Abrash states "From the user’s perspective, performance is fundamental", I tend to disagree

Really? Because I distinctly remember switching to Google because it was good and *crazy fast* compared to AltaVista.

Speed still matters.
http://www.codinghorror.com/blog/archives/000722.html

Jeff Atwood on February 21, 2008 11:29 AM

i had this book too and it is classic. this introduced me to pure assembly computing when i'm still a student.

nice one. after remininse state induced by this post, an idea came up why i should not try coding in bytecodes or msil?

sham on February 21, 2008 11:54 AM

there are still developers out there that take assembly seriously. Take the Naughty Dog job listing for instance:

Assembler Programmer
Assembly programming alas is all too often considered a dying art form; however, this is definitely not the case at Naughty Dog. We take assembly programming VERY seriously and use assembly extensively in our games. We're looking for someone who really enjoys getting down to the metal and writing highly optimized assembler code. This person should have a very solid grasp on caching issues, processor pipelining, and latencies. Strong 3D math skills are a big plus, and good fundamental 3D math skills are required. Past experience writing 3D renderers is a big plus. We're not looking for the occasional down coder, we're looking for someone passionate about assembly, and only people with extensive past assembly experience will be considered.

matt g on February 23, 2008 2:14 AM

What happened to the site? I find it hard to believe that Byte has been down for days!

Loren Pechtel on February 23, 2008 7:24 AM

If you're convinced that optimization is useless today, have your imagination refreshed.

David Smith on February 24, 2008 3:42 AM

jeff - your title, does it refer to Steel Magnolias? If so, wow, I am even MORE impressed by you!

"There ain't no such thing as natural beauty"

ns on February 25, 2008 10:30 AM

OK "mere Visual Basic programmer" (which, of course, you no longer are)
I never even got beyond Basic having done my original training in machine code on an Elliot 405 valve machine in the early 60s. I don't need to do any programming these days BUT I still find the logic of REAL programming fascinating and will proceed to try to find this book in a library.

Hazel on February 26, 2008 4:21 AM

I was given a handout on Bumper-Sticker Computer Science from Jon Bentley many years ago that has several rules related to this subject, but two stories in the article have always stood out to me.

In the 1960's Vic Vyssotsky spent a week modifying a Fortran compiler making a correct routine very fast. The bug he introduced did not surface for two years because it had never been called in over 100,000 compilations. His premature optimization was worse than wasted, it made a good program bad.

Bill Wulf of Tartan Laboratories took only a brief conversation
to convince me that “if a program doesn’t work, it doesn’t matter how fast it runs” isn’t quite as true as I once thought. He used the case of a document production system that we both used. Although it was faster than its predecessor, at times it seemed excruciatingly slow: it took several hours to compile a book. Wulfs clincher went like this:
“That program, like any other large system, today has 10 known, but minor, bugs. Next month, it will have 10 different known bugs. If you could choose between removing the 10 current bugs or making
the program run 10 times faster, which would you pick?”

Wade on February 26, 2008 6:53 AM

haha

gaga on February 26, 2008 10:28 AM

@Rob on February 20, 2008 10:01 AM

"Everything should be done in assembly language."

In a perfect world where developers have all the time in the world to get the job done, I suppose you're right... But when your boss asks you to make an MS Access database to do something relatively simple for a handful of users, and asks you to get it done by the end of the week, is assembly really the best approach? More often than not, I think most people will opt for slapping it together quickly with relatively slow VB scripts. I know I do!

I think sometimes there's a valid play-off between code optimisation, and coding time optimisation.

(Ok, fair enough, I'm not a coder by trade, just an average techie who periodically gets called upon to perform minor coding tasks, but I think my point is still valid).

RWW on March 9, 2008 1:30 PM

If your work is some sort of business app or web app or something, sure, it probably doesn't need to be optimized with today's compilers/ hardware.

However, I work on solving electromagnetics simulation problems with hundreds of thousands (or millions) of unknowns, problems that take weeks to solve on even a large computer cluster.

Anyone that wants to work in an are where optimization is an absolute must should pick up a few books on advanced EM and algorithms and come to the dark side of computational electromagnetics :).

chris on January 7, 2009 2:21 AM

The comments to this entry are closed.