I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

March 08, 2005

On Managed Code Performance

My personal turning point on the importance of managed code was in September 2001, when the NIMDA worm absolutely crushed our organization. It felt like a natural disaster without the "natural" part-- the first notable port 80 IIS buffer overrun exploit. We got literally zero work done that day, and the next day wasn't much better. After surveying the carnage first hand, I immediately saw the benefit of languages where buffer overruns weren't even possible.

Managed code, of course, isn't free. All that bit-twiddling was there for a reason-- to squeeze every last iota of performance out of your 386 and 486. Trading some of that performance for security makes more sense in the era of 1ghz Pentium chips, of course-- but how much performance are we really giving up? One of the more interesting examples of managed code performance is Vertigo Software's port of Quake II to .NET:

How is the performance of the managed version of Quake II? Initially, the managed version was faster than the native version when the default processor optimization setting /G5 (Pentium) was used. Changing the optimization setting to /G7 (Pentium 4 and Above) created a native version that runs around 15% faster then the managed version. Note that assembly code was disabled for the native and managed versions, so both versions are slower than the original version of Quake 2.

David Notario, who works in Microsoft's CLR JIT compiler group, with a little demo scene coding on the side, posted this interesting message with more detail on the performance of Managed Quake II:

  • This version doesn't use any 3D hardware acceleration at all, which is good. It's interesting to see the performance of the .NET platform isolated from the performace of the graphics card. In apps/demos/games that use 3D acceleration, expect the difference between managed and unmanaged code to be even smaller, as the bottleneck of rendering is the 3D card, not the CPU.
  • With this benchmark, you are measuring the quality of the codegen. The managed version is just a recompile of the unmanaged version with the /clr option (which targets IL instead of x86). It's not taking into account GCs that happen in an app that does managed allocations, it's a pure JIT benchmark. This also means that it doesn't show some problems you may have doing realtime graphics with managed code if you're not careful, such as dropping frames due to periodic GCs.
  • On my P4, the managed Q2 timedemo runs at 63.2 fps, and the native Q2 timedemo runs at 72.8 fps, which means the managed code is performing at 85.6% the speed of native C++ code with VS.2003.
  • The original Q2 [and Quake 1] had optimized x86 assembly rasterizers. These were one of the fastest of their time, and they used cunning tricks such as explicitly paralellizing x86 and x87 instructions to achieve maximum speed. For example, the division for perspective correction for the next 8 pixel span was performed in parallel with the actual rendering of the current 8 pixel span, so perspective correction was almost 'free'. The C rasterizers this version uses don't have this property. To compare apples to apples, Vertigo Software compiled their native version with the C rasterizers -- ie, both versions are slower than the original Q2 demo shipped by Id Software. Just for kicks, I compared the managed version with the original assembly optimized version. The original version gave me 92.5 fps, which means our codegen is generating code with about 70% of the performance of the original hand optimized assembly. I personally think this is great-- especially considering that our codegen has quite a bit of room to improve.

I guess we'll see how much codegen has improved in .NET 2.0-- from what I hear, performance improvements aren't a big priority-- but I'll gladly trade 15 percent of performance to live in a world where NIMDA can't exist. That's a no-brainer.

In his woefully out of date blog, David mentions that one of his coding heroes is Mike Abrash. All this talk of Quake and performance reminded me of Mike, too. He worked at Microsoft on the graphics subsystem in NT 3.1, and wrote a number of very influential early assembly and graphics programming books. He also worked on the all-assembly graphics architecture of Quake 1, aka "the last great software rasterizer."

Mike's not only a true programming God, but an amazing, humble and approachable writer. I remember randomly browsing through his 1994 Graphics Programming Black Book as a beginning Visual Basic programmer and being totally engrossed in it, even though it was technically far* above my level. He's that great of a writer. For a taste, there's a little snippet of a 2001 article he wrote for Gamasutra in this archived news post. Or, you can relive my amazement as you browse through a complete online version of the Graphics Programming Black Book. The techniques may be obsolete, but the problem solving he describes so compellingly is truly timeless. Very, very highly recommended.

I wonder what Michael Abrash is up to these days.

* really, really, REALLY far above my level.

Posted by Jeff Atwood    View blog reactions

 

« MS Language Equivalents Paging Dr. Dotnetsky... »

 

Comments

Mike worked at MS on Xbox up until some time in 2001, it appears.

Well, here's one thing he has worked on somewhat recently-- RAD Game Tools Pixomatic software renderer, circa 2002, last updated 1-2005 (!)

http://www.radgametools.com/pixomain.htm

And yes, UT 2004 *DOES* use the Pixomatic renderer if you switch to software rendering. Be sure to turn the resolution way, way down before doing this, or you'll be sorry... like I was ;)

Jeff Atwood on March 9, 2005 12:52 AM

Ok, so yeah, Abrash is all about Pixomatic (fast x86 software 3D rendering) into late 2004. You have to sign up for a free account, but his 3 part DDJ series on Pixomatic is really interesting reading:


http://www.google.com/search?hl=en&q=%22Optimizing+Pixomatic+for+x86+Processors%22
---
In this three-part article, I discuss the process of optimizing Pixomatic, an x86 3D software rasterizer for Windows and Linux written by Mike Sartain and myself for RAD Game Tools (http://www .radgametools.com/). Pixomatic was perhaps the greatest performance challenge I've ever encountered, certainly right up there with Quake. When we started on Pixomatic, we weren't even sure we'd be able to get DirectX 6 (DX6) features and performance, the minimum for a viable rasterizer. (DirectX is a set of low-level Windows multimedia APIs that provide access to graphics and audio cards.) I'm pleased to report that we succeeded. On a 3-GHz Pentium 4, Pixomatic can run Unreal Tournament 2004 at 640×480, with bilinear filtering enabled. On slower processors, performance is of course lower, but by rendering at 320×240 and stretching up to 640×480, then drawing the heads-up display (HUD) at full resolution, Unreal Tournament 2004 runs adequately well, even on a 733-MHz Pentium III.
---

The difference between today's low-level Pentium 4 optimizations and the older optimization techniques he used on ye olde Pentium 1 are.. uh, profound. Sort of a case study in what's possible, even if it doesn't ultimately make much sense IMO. It is amusing to try the software renderer in UT2004, though.. download the free UT2004 demo and give it a shot! ;)

Jeff Atwood on March 9, 2005 01:05 AM

From Part II:

--
I mention this in the context of the bilinear filter because that was where that lesson was driven home. You see, I came up with a way to remove a multiply from the filter code—and the filter got slower. Given that multiplication is slower than other MMX instructions, especially in a long dependency chain such as the bilinear filter, and that I had flat-out reduced the instruction count by one multiply, I was completely baffled. In desperation, I contacted Dean Macri at Intel, and he ran processor-level traces on Intel's simulator and sent them to me.

I can't show you those traces, which contain NDA information, but I wish I could because their complexity beautifully illustrates exactly how difficult it is to fully understand the performance of Pentium 4 code under the best of circumstances. Basically, the answer turned out to be that the sequence in which instructions got processed in the reduced multiply case caused a longer critical dependency path—but there's no way you could have known that without having a processor-level simulator, which you can't get unless you work at Intel. Regardless, the simulator wouldn't usually help you anyway because this level of performance is very sensitive to the exact sequence in which instructions are assigned to execution units and executed, and that's highly dependent on the initial state (including caching and memory access) in which the code is entered, which can easily be altered by preceding code and usually varies over time.

Back in the days of the Pentium, you could pretty much know exactly how your code would execute, down to the cycle. Nowadays, all you can do is try to reduce the instruction count, try to use MMX and SSE, use the cache wisely and try to minimize the effects of memory latency, then throw stuff at the wall and see what sticks.
--

great, great stuff!

Jeff Atwood on March 9, 2005 01:12 AM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.