I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

March 8, 2005

On Managed Code Performance

My personal turning point on the importance of managed code was in September 2001, when the NIMDA worm absolutely crushed our organization. It felt like a natural disaster without the "natural" part-- the first notable port 80 IIS buffer overrun exploit. We got literally zero work done that day, and the next day wasn't much better. After surveying the carnage first hand, I immediately saw the benefit of languages where buffer overruns weren't even possible.

Managed code, of course, isn't free. All that bit-twiddling was there for a reason-- to squeeze every last iota of performance out of your 386 and 486. Trading some of that performance for security makes more sense in the era of 1ghz Pentium chips, of course-- but how much performance are we really giving up? One of the more interesting examples of managed code performance is Vertigo Software's port of Quake II to .NET:

How is the performance of the managed version of Quake II? Initially, the managed version was faster than the native version when the default processor optimization setting /G5 (Pentium) was used. Changing the optimization setting to /G7 (Pentium 4 and Above) created a native version that runs around 15% faster then the managed version. Note that assembly code was disabled for the native and managed versions, so both versions are slower than the original version of Quake 2.

David Notario, who works in Microsoft's CLR JIT compiler group, with a little demo scene coding on the side, posted this interesting message with more detail on the performance of Managed Quake II:

  • This version doesn't use any 3D hardware acceleration at all, which is good. It's interesting to see the performance of the .NET platform isolated from the performace of the graphics card. In apps/demos/games that use 3D acceleration, expect the difference between managed and unmanaged code to be even smaller, as the bottleneck of rendering is the 3D card, not the CPU.
  • With this benchmark, you are measuring the quality of the codegen. The managed version is just a recompile of the unmanaged version with the /clr option (which targets IL instead of x86). It's not taking into account GCs that happen in an app that does managed allocations, it's a pure JIT benchmark. This also means that it doesn't show some problems you may have doing realtime graphics with managed code if you're not careful, such as dropping frames due to periodic GCs.
  • On my P4, the managed Q2 timedemo runs at 63.2 fps, and the native Q2 timedemo runs at 72.8 fps, which means the managed code is performing at 85.6% the speed of native C++ code with VS.2003.
  • The original Q2 [and Quake 1] had optimized x86 assembly rasterizers. These were one of the fastest of their time, and they used cunning tricks such as explicitly paralellizing x86 and x87 instructions to achieve maximum speed. For example, the division for perspective correction for the next 8 pixel span was performed in parallel with the actual rendering of the current 8 pixel span, so perspective correction was almost 'free'. The C rasterizers this version uses don't have this property. To compare apples to apples, Vertigo Software compiled their native version with the C rasterizers -- ie, both versions are slower than the original Q2 demo shipped by Id Software. Just for kicks, I compared the managed version with the original assembly optimized version. The original version gave me 92.5 fps, which means our codegen is generating code with about 70% of the performance of the original hand optimized assembly. I personally think this is great-- especially considering that our codegen has quite a bit of room to improve.

I guess we'll see how much codegen has improved in .NET 2.0-- from what I hear, performance improvements aren't a big priority-- but I'll gladly trade 15 percent of performance to live in a world where NIMDA can't exist. That's a no-brainer.

In his woefully out of date blog, David mentions that one of his coding heroes is Mike Abrash. All this talk of Quake and performance reminded me of Mike, too. He worked at Microsoft on the graphics subsystem in NT 3.1, and wrote a number of very influential early assembly and graphics programming books. He also worked on the all-assembly graphics architecture of Quake 1, aka "the last great software rasterizer."

Mike's not only a true programming God, but an amazing, humble and approachable writer. I remember randomly browsing through his 1994 Graphics Programming Black Book as a beginning Visual Basic programmer and being totally engrossed in it, even though it was technically far* above my level. He's that great of a writer. For a taste, there's a little snippet of a 2001 article he wrote for Gamasutra in this archived news post. Or, you can relive my amazement as you browse through a complete online version of the Graphics Programming Black Book. The techniques may be obsolete, but the problem solving he describes so compellingly is truly timeless. Very, very highly recommended.

I wonder what Michael Abrash is up to these days.

* really, really, REALLY far above my level.

Posted by Jeff Atwood    View blog reactions
« MS Language Equivalents
Paging Dr. Dotnetsky... »
Comments

Mike worked at MS on Xbox up until some time in 2001, it appears.

Well, here's one thing he has worked on somewhat recently-- RAD Game Tools Pixomatic software renderer, circa 2002, last updated 1-2005 (!)

http://www.radgametools.com/pixomain.htm

And yes, UT 2004 *DOES* use the Pixomatic renderer if you switch to software rendering. Be sure to turn the resolution way, way down before doing this, or you'll be sorry... like I was ;)

Jeff Atwood on March 9, 2005 12:52 AM

Ok, so yeah, Abrash is all about Pixomatic (fast x86 software 3D rendering) into late 2004. You have to sign up for a free account, but his 3 part DDJ series on Pixomatic is really interesting reading:


http://www.google.com/search?hl=en&q=%22Optimizing+Pixomatic+for+x86+Processors%22
---
In this three-part article, I discuss the process of optimizing Pixomatic, an x86 3D software rasterizer for Windows and Linux written by Mike Sartain and myself for RAD Game Tools (http://www .radgametools.com/). Pixomatic was perhaps the greatest performance challenge I've ever encountered, certainly right up there with Quake. When we started on Pixomatic, we weren't even sure we'd be able to get DirectX 6 (DX6) features and performance, the minimum for a viable rasterizer. (DirectX is a set of low-level Windows multimedia APIs that provide access to graphics and audio cards.) I'm pleased to report that we succeeded. On a 3-GHz Pentium 4, Pixomatic can run Unreal Tournament 2004 at 640×480, with bilinear filtering enabled. On slower processors, performance is of course lower, but by rendering at 320×240 and stretching up to 640×480, then drawing the heads-up display (HUD) at full resolution, Unreal Tournament 2004 runs adequately well, even on a 733-MHz Pentium III.
---

The difference between today's low-level Pentium 4 optimizations and the older optimization techniques he used on ye olde Pentium 1 are.. uh, profound. Sort of a case study in what's possible, even if it doesn't ultimately make much sense IMO. It is amusing to try the software renderer in UT2004, though.. download the free UT2004 demo and give it a shot! ;)

Jeff Atwood on March 9, 2005 1:05 AM

From Part II:

--
I mention this in the context of the bilinear filter because that was where that lesson was driven home. You see, I came up with a way to remove a multiply from the filter code—and the filter got slower. Given that multiplication is slower than other MMX instructions, especially in a long dependency chain such as the bilinear filter, and that I had flat-out reduced the instruction count by one multiply, I was completely baffled. In desperation, I contacted Dean Macri at Intel, and he ran processor-level traces on Intel's simulator and sent them to me.

I can't show you those traces, which contain NDA information, but I wish I could because their complexity beautifully illustrates exactly how difficult it is to fully understand the performance of Pentium 4 code under the best of circumstances. Basically, the answer turned out to be that the sequence in which instructions got processed in the reduced multiply case caused a longer critical dependency path—but there's no way you could have known that without having a processor-level simulator, which you can't get unless you work at Intel. Regardless, the simulator wouldn't usually help you anyway because this level of performance is very sensitive to the exact sequence in which instructions are assigned to execution units and executed, and that's highly dependent on the initial state (including caching and memory access) in which the code is entered, which can easily be altered by preceding code and usually varies over time.

Back in the days of the Pentium, you could pretty much know exactly how your code would execute, down to the cycle. Nowadays, all you can do is try to reduce the instruction count, try to use MMX and SSE, use the cache wisely and try to minimize the effects of memory latency, then throw stuff at the wall and see what sticks.
--

great, great stuff!

Jeff Atwood on March 9, 2005 1:12 AM

<cite>"the first notable port 80 IIS buffer overrun exploit."</cite>

The problem isn't the language allowing buffer overruns, the problem is using a closed source web server. <a href="http://en.wikipedia.org/wiki/Security_through_transparency">Security through transparency</a> is much better than security through obscurity, hell, if it was open source and your own developers looked at it, one of them may have fixed the bug before you were affected by it.

<cite>"given enough eyeballs, all bugs are shallow."</cite>

Chris on November 11, 2008 1:12 PM

This may or may not be a cheap way to plug your old company's quake.net project, but damn if that isn't a cool project.

Dylan on November 12, 2008 6:06 PM

From what I hear, Abrash has been working on Larrabee. Infact, he's giving a talk at GDC 2009:
https://www.cmpevents.com/GD09/a.asp?option=C&V=11&SessID=9138

Andrew Rudson on March 20, 2009 2:54 PM

This may or may not be a cheap way to plug your old company's quake.net project, but damn if that isn't a cool project.
http://albusclinic.ru

Engor on May 17, 2009 10:13 AM

To me it's not the GC itself that is the biggest issue with Microsoft's managed languages, but rather the IDisposable interface. The usage pattern of a class that implement that interface is so different (and imposes so much on code that uses such a class) from classes that do not that I find myself implementing IDisposable JUST SO I WON'T HAVE TO REDESIGN ENTIRE SECTIONS OF MY APPLICATION if I find out later that my class needs it. This tends to cause a cascade whereby most of the classes in my project implement IDisposable just to be on the safe side.

If I knew up front which classes would need to implement IDisposable this would not be such a problem. But because I develop iteratively, I do not always have this information a priori. Failure to implement or utilize IDisposable where necessary may result in bugs ranging from the minor to the major; locating and changing the use of classes that are newly IDisposable seems a perfect opportunity to do this.

Although I do not take the memory usage hit from the destructor where the use of IDisposable is unnecessary, the simple need to implement it almost everywhere is highly annoying to me. I get all of this in trade off for GC and bounds checking? I'd rather just try to write good code, honestly. I also lose the ability to execute code when leaving a lexical scope (unless I implement IDisposable, and then it depends on the good graces of the code using my class). VB.NET and C# may be fine for scripting a few components together, but for anything bigger (or that might need to grow bigger) than that, please give me a real language.

Richard on June 17, 2009 7:03 AM






(no HTML)


Verification (needed to reduce spam):


Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.