April 4, 2009
I've been following Michael Abrash for more than 10 years now; he's one of my programming heroes. So I was fascinated to discover that Mr. Abrash wrote an article extolling the virtures of Intel's upcoming Larrabee. What's Larrabee? It's a weird little unreleased beast that sits somewhere in the vague no man's land between CPU and GPU:
[Larrabee] is first and foremost NOT a GPU. It's a CPU. A many-core CPU that is optimized for data-parallel processing. What's the difference? Well, there is very little fixed function hardware, and the hardware is targeted to run general purpose code as easily as possible. The bottom lines is that Intel can make this very wide many-core CPU look like a GPU by implementing software libraries to handle DirectX and OpenGL.
We know that GPUs generally deliver one or two more orders of magnitude more performance than a general purpose CPUs at the things they are good at. That's what I would expect for dedicated hardware devoted to a specific and highly parallizable task.
Michael Abrash has already attempted what most people said was impossible -- to build a full software 3D renderer that runs modern games at reasonable framerates. In other words, to make a general purpose CPU compete in a completely unfair fight against a highly specialized GPU. He's effectively accomplished that, and his company sells it as a product called Pixomatic:
In this three-part article, I discuss the process of optimizing Pixomatic, an x86 3D software rasterizer for Windows and Linux written by Mike Sartain and myself. Pixomatic was perhaps the greatest performance challenge I've ever encountered, certainly right up there with Quake. When we started on Pixomatic, we weren't even sure we'd be able to get DirectX 6 features and performance, the minimum for a viable rasterizer. I'm pleased to report that we succeeded. On a 3 GHz Pentium 4, Pixomatic can run Unreal Tournament 2004 at 640Ã¢â€”Å 480, with bilinear filtering enabled. On slower processors, performance is of course lower, but by rendering at 320Ã¢â€”Å 240 and stretching up to 640Ã¢â€”Å 480, Unreal Tournament 2004 runs adequately well -- even on a 733-MHz Pentium III.
Pixomatic is documented in an excellent series of Dr. Dobbs articles. It's fascinating reading; even though I know zero about assembly language, Michael's language of choice, he's a fantastic writer. That old adage about the subject not mattering when you have a great teacher has never been truer.
I remember trying out Pixomatic briefly four years ago. Those CPUs he's talking about seem awfully quaint now, and that made me curious: how fast is the Pixomatic software renderer on today's CPUs? My current box is a Core 2 Duo (wolfdale) running at 3.8 GHz. So I downloaded the Unreal Tournament 2004 demo (still fun, by the way!), and followed the brief, easy instructions provided to enable the Pixomatic software renderer. It's not complicated:
One word of warning. Be sure you have an appropriate resolution set before doing this! I was playing at 1920x1200 initially, and that's what the software renderer defaulted to. And here's the shocker: it was actually playable! I couldn't believe it. It wasn't great, mind you, but it was hardly a slideshow. I tweaked the resolution down to something I felt was realistic: 1024x768. I turned on framerate display by pressing ...
... from within the game. This Pixomatic software rendered version of the game delivered a solid 40-60 fps experience in capture the flag mode. It ran so well, in fact, that I decided to bump up the detail -- I enabled 32-bit color and bilinear filtering by editing the
Once I did this, the game looked totally respectable. Eerily reminiscent in visuals and performance to the classic, early Voodoo and Voodoo 2 cards, actually.
(If you think this looks bad, check out Doom 3 running on an ancient Voodoo 2 setup. It's certainly better than that!)
The frame rate took a big hit, dropping to 30fps, but I found it was an uncannily stable 30fps. The only achilles heel of the Pixomatic software renderer is places with lots of alpha blending, such as when you fire a sniper rifle, obscuring the entire screen with a puff of muzzle smoke, or if you're standing near a teleportation portal.
Pretty amazing, right? It is!
And utterly pointless.
My current video card renders Unreal Tournament 2004 at the highest possible resolution with every possible quality option set to maximum, at somewhere between 200 and 300 frames per second. Despite the miraculously efficient assembly Abrash and Sartain created to make this possible at all, it's at best a carnival oddity; even the crappiest onboard laptop 3D (assuming a laptop of recent vintage) could outperform Pixomatic without even breaking a sweat.
We know that the game is far more enjoyable to play with a real GPU, on a real video card. And we're hip deep in real GPUs on every platform; even the iPhone has one. Perhaps Pixomatic made some business sense back in 2003, but it didn't take a genius analyst back then to see that it would make no business sense at all today. At the same time, I can't help admiring the engineering effort that went into building a viable 3D software renderer, something that seemed virtually impossible bordering on foolish.
In short, it will be possible to get major speedups from [Larrabee] without heroic programming, and that surely is A Good Thing. Of course, nothing's ever that easy; as with any new technology, only time will tell exactly how well automatic vectorization will work, and at the least it will take time for the tools to come fully up to speed. Regardless, it will equally surely be possible to get even greater speedups by getting your hands dirty with intrinsics and assembly language; besides, I happen to like heroic coding.
We'll have to wait and see if Intel's efforts to push GPU functionality into their x86 architecture makes any of this heroic coding more relevant in the future. Either way, it remains impressive.
Posted by Jeff Atwood
The point is that lower-end systems have either low performance or low quality GPU's, and drivers of similar aptitude. On the one hand you have the daunting task of supporting N crappy GPU's with M driver revisions and P system configurations. On the other hand, you can write detection code to figure out when you're not on a nice Core 2 + NVIDIA GeForce / ATI Radeon, and default to Pixomatic. For something like World of Warcraft, I bet this would work nicely and save them a lot of money otherwise spent on getting their customer support staff drunk because they're unhappy all the time. Even the Atom has more than 1 hardware thread, and thus would get gains from Pixomatic 3's multithread optimizations.
It's been almost 6 months you wrote about your current video card:
Isn't it time you fed your video card addition and buy a new one ? Can I have this one then ?
(The third comment in that article is yours, and you tell an Argentinian guy that you usually sell old video cards on eBay. I'm Brazilian, sell this one to me !)
Personally I think Larrabee and multi-core software rendering is a pretty good idea in the light of GHZ race coming to a halt.
It's almost certain that Intel is working on 8+ core chips, with each core having Hyperthreading.
It's all about scaling and simplicity. One platform, one architecture that handles all your computations is a great plus for consumers, vendors and developers. As workloads, including graphics, get more parallelizable it makes a lot of sense to stop putting money in developing specialized platforms, and concentrate all efforts on improving the CPU platform.
In the short term, you can't go wrong with today's graphics chipsets, but in 10 years time they will be obsolete.
HEROIC CODING IS WARRIOR CODING
Just a thought ... if it were not for the use of assembly language, maybe Pixomatic could be ported to a customized V8 rendering engine, in NewAge HeroicCoding?
I don't know the subject well enough, just that Google Chrome is getting rave reviews for speed of rendering and processing.
You are all missing another market for Larrabee: Embedded applications where a lot of high intensive computing is still being done with custom chips or FPGA's. Larrabee will have the horsepower to take over these tasks and make these systems more fully programmable.
@Niniane and @Rick
I hear you, but I see zero actual *shipping games* that have 3D software rendering options. Microsoft's software renderer was a reference implementation meaning it was meant for accuracy, not speed.
Can you point to games other than UT2004 that have a software rendering option, or that ship with Pixomatic?
For one thing, it takes a BLAZINGLY fast cutting edge CPU to do well, and guess what machines with crap GPUs tend to not have? :)
I'm not sure if this changed in Windows 7, but I do believe that the Abrash and Sartain code represents *best possible* performance. I don't think you can do better, and it's still.. not great.
We'll see if Larrabee and future CPUs change that or not.
The Sims games (and I think Spore) use Pixomatic, so did some version of Flight Simulator. Big enough titles for you? :)
Many machines with crap GPU's actually have surprisingly powerful CPUs, certainly fast enough to run a mid-level title at a low resolution. Just look at the figures Abrash quotes in his article. You don't need a blazing fast CPU, not everybody is making or playing games that have UT-2004 level graphics.
As I pointed out it's not always a case of crap GPUs, the default drivers that ship with hardware are notorious for being buggy and many (probably the majority) of PC users never update them.
As I also mentioned the software rasterizer in Windows 7 is NOT the refrast (Reference Rasterizer) that's been present since DX6 and was a) incredibly slow and b) needed turned on via a registry key.
The question is whether it will be more effective to emulate a GPU in x86 or emulate x86 in a GPU. It looks like AMD are betting massively on the GPU and Nvidia have made their intentions clear with CUDA.
Intel are between a rock and hard place here (even though they look like the king of the world right now!). Every time they bump the number of cores up they push the whole software industry towards parallelizing their code. And that brings their code more and more within the grasp of the GPU.
I wonder if Apple have chosen the losing processor design again...
Are you tired because of bikeshed effect comments of before contiguous three posts? This post is not interest to common. :-)
It's only a matter of time in my eyes that NVidia releases a motherboard chip set that runs all it's general purpose processing through the GFX Card(s) you have installed. No CPU required.
Although Larabee is the x86 instruction set, it does have specific extra instructions for graphics processing. I think what intel have done is make the things that are slow in x86 hardcoded into the cores. Remember they also have an overall core to handle frame output and I would guess anti aliasing.
Also we are looking at an initial version having maybe 16 cores each running 4 threads. Since GPU processing can generally be run in parallel this is going to be very different from running pixomatic on a current CPU.
I'm betting on Intel, lets face it they know what they are doing.
Hopefully some or all CPU operations can be passed off to the GPU cores without special coding like you have to with CUDA. Seems possible since it's x86.
No, not at all pointless. I'd think that anyone who is looking to implement OpenCL in a processor-agnostic fashion would want to know about all this, and in detail. There's an unending back-and-forth battle between powerful general-purpose CPUs and powerful specialized (and therefore multi-vendor and driver-dependent) processors-- and Pixomatic sets a benchmark for what's actually possible.
This LRB thing never got me excited.
As said, is somewhere between good-enough and not-so-bad.
We know it isn't going to be fast, but thanks to the more generic approach it won't be nearly as slow when going bad.
However, is it good?
If something is good at task X and something else is good at task Y you know where to send the workload... but what if the performance pattern is similar but sometimes better?
I don't really like heroic programming. One of the things in my todo list is to reimplement an old (I could almost call it legacy) system poking around here which, for some not well understood reason, started eating 10x the time to process (considering the CPUs have gone at least 2x in the meanwhile, this isn't a good thing). It was far below the requirements when it was deployed. I have little clue, and I'm not excited.
I wonder if it will be possible to use a mixed approach as a way to lower the bottleneck on GPUs in laptops. Today, most laptops have the bottleneck on the GPU, while there is tons of CPU power to give and waste. If part of the rendering could be done on the wasted CPU power, we could see significant FPS gain on lower end GPU cards.
From what I read about Larrabee, it was made for laptops that can't afford (price, space, energy consumption) to have a strong GPU card.
But anyway, the tendency now is for games to use a lot more of CPU power because of physics effects (anyone played GTA4 on a dual-core CPU? the FPS gain to go quad-core is absurd).
Kudos @Niniane. Excellent points.
@Jeff Atwood, as far as this:
but I do believe that the Abrash and Sartain code represents *best possible* performance
I'm glad you're not a researcher! I bet you are too as they tend to have prove those kinds of statements. :)
It's all software, at one level or another.
A software implementation of something, as proof-of-concept to show that the performance of an only-software implementation of some graphics kernel can be upheld, is a terrific way to show -- to ourselves, as an industry -- what can be done with what toolset.
Sure, it's in assembler. (talk amongst yourselves)
But the concept is useful to us. It's like the person who has a homegrown, fast-fast-fast version of the BLAS (basic linear algebra routines) on which many math packages are built. It may not be for you, but it's important to know that it can be isolated and improved.
Is Pixomatic the same to DirectX as Mesa to OpenGL?
Joe Harris, about every year someone proclaims the death of Intel, and every time Intel proves them wrong. I'm not saying that they'll in business forever, but I think they proved that they can change architectures if needed and then put an x86 layer on top of it.
Words on Larrabee. What does it mean for you, now? Nothing. It may be the next GPU you buy, it may not be.
But if it is AWESOME (big IF) and if people LOVE it, and it is FAST, it is going to drive the standards. What does that mean? It means that before you know it, DX/OGL will have 'reflection' 'shadowing' 'auto-transparent laying' etc etc built into their standards. That's right- a DX shader function called 'reflect' or 'sample_s' that lets you sample your own scene. Larrabee will be able to support new features with a firmware upgrade, while other GPU vendors will be struggling to make their hardware work. (some do a bit of ray tracing already, but it would still take a lot of work to get features like these... why else are they not there). Larrabee has the potential to push graphics to new levels.
With things like OCL- it doesn't matter if it lets people do faster computations... they probably won't be much faster than on a GPU, anyway.
But for graphics... If they can make it AS FAST as a GPU... It's already 1000x more flexible. It's going to speed up the advancement of 3D graphics.
*note- I was a fan of LRB when I first heard of it. Even if it fails- it will still teach us something
With respect, you may want to do a bit more research before you declare something like Pixomatic as pointless in today's world. To be polite, it's a very uninformed statement.
Although today's machine virtually all come with some form of graphics acceleration the cheaper laptops often lack features that traditional GPUs have had for some time. Support for more than 1.1/2.0 shaders would be one notable example. The support they DO have can often be extremely buggy - laptop OEM's are notoriously bad about ever updating their drivers, and even then many customers will never update unless the driver is pushed out via Windows Update (which it never is, because OEMs are notoriously bad about this too).
So if you're writing a Theme/Sim/casual game where you expect a large number of people to be using low-end or older machines you have two choices;
1) Write lots of fallback code for lower-class machines, including a complete set of your effects/materials expressed using the fixed-function pipeline. Add device-id based workarounds for the various driver-bugs you encounter during QA. After shipping, field support calls and walk people through either updating their drivers or editing a config file to try and get the game stable. Issue patches that address problems that people later encounter.
2) Use a software renderer (such as Pixomatic) to add a software mode that is used for either people with GPU's below your minimum spec (shader 2.0) or a fallback for dodgy drivers. You will need to scale down resolution/effects, but that's easy to do.
Guess which one is the most cost-effective and leads to the best user experience?
You may also be interested to know that in Windows-7 Microsoft have actually ADDED a software renderer for Direct3D (this is not the refrast present in previous versions, this is a fully accelerated and usable implementation).
Why did they do this? For all the reasons above. With Aero/WPF Windows is now extremely reliant on the features and performance of GPUs and is running into exactly the same problems that many game developers have faced for years - you can't always rely on the hardware in someones machine.
Mobile device rendering
they typically don't have GPU's
Huge application here.
I am also a Larrabee fan. Shaders are too non-standard and they will be obsolete one day...
I loaded the original JS raytracer seen and rendered it in 4.1 seconds (using Opera). But I guess there are lots improvements left in the JS engines, because Firefox and IE7 were many times slower.
As time goes on its becoming apparent that two types of hardware are needed on the desktop. The massively parallel world, and the low/single threaded world. Not everything is easy to run parallel and a mix of the two worlds is going to be essential.
At the moment Intel and AMD are pushing forward with more and more cores, but that is rapidly becoming useless for the desktop (the 8 virtual cores of the i7 is already too far). A reversal of that trend and complimenting it with a massively parallel chip, of which Larrabee is the first version, is going to give us the best of both worlds.
Niniane, are you by any chance the Niniane who worked on Lively? Explanation: it's a somewhat unique name, in the context of a discussion about 3D, and as I've remarked on Twitter, female names are (unfortunately and sadly) rare enough on my blog that they're frequently a sign of spam anyway.
Rick: For something like World of Warcraft, I bet this would work nicely and save them a lot of money
In theory yes, but in practice does Blizzard do this? No. And they make gazillions of dollars on WOW. That's a pretty compelling argument, to me at least, that this complex software rendering fallback scenario just isn't necessary.
Honestly, something like pseudo-3D delivered through the browser via Flash or Silverlight is more likely. Do casual gamers even *need* 3D?
Anyway, the more I think about this, the more I think that GPUs need to be on the same die as the CPU. I'm not sure radical architectural redesigns of x86 CPUs that *can* work as de-facto GPUs will be a successful evolutionary path. MHO of course.
On a related note, per that gamedev forum thread, looks like Intel bought Pixomatic..
One of the axioms of systems engineering is that you can optimise a system to do one (or a few things) really well but most others poorly; or you can optimise the system to do many things kind of OK, but none of them really well.
General purpose CPUs are fast enough to do very specific things like graphics kind of OK, but they wont ever be able to match the potential of a limited purpose PU like a GPU.
There are savings in chip production, but because you need to drive a general purpose processor much harder to match a designed for purpose processor the operating costs are higher. A classic example is that my $50 DVD player plays DVDs with nothing but a 1 inch inaudible rear exhaust fan while my $1k HTPC has a water cooling rig so I could watch a DVD without needing to set the volume to 11 to mask the noise of all of those fans...
I disagree with your conclusion.
The software game has no crystal balls; the things that will change in the future are in the tools farther down the waterfall than assembly. Pixomatic, with extremely fast, well written assembly code, has a long-term advantage... The things that will change over time are the DirectX libraries that they are trying to allow to process properly on-die.
In short, I think your 'heroic coders' are extremely capable and did extremely well for themselves. They had their company bought by Intel; and for a good reason. Intel wanted their ability to pull someone else's core business into their core business' realm of operation. Pure and simple.
AMD bought ATI; ever think about why? It's just cheaper, as Moore's Law takes individual processor power farther and farther beyond what a person can even imaginably need, to produce chips that have LOTS of power in relation to the features people want. You can reduce consumption by slowing things down, you can produce smaller stuff, but eventually you get to a point where a fast-responding laptop outperforms what people expect, and you need it cheaper.
Cheaper means less chips, less people involved in the manufacture, less shipping of individual parts around, less placement on boards and QA, and less packaging. How do you do all that?
CPU's enough faster that GPU's with enough backwards compatibility to make that 100$ drop in price mean more than the performance hit.
I'm not sure if this changed in Windows 7, but I do believe that the Abrash and Sartain code represents *best possible* performance. I don't think you can do better, ...
If you ever read Abrash, you know what that means.
Have you ever actually tried running a 3D game with the crap integrated graphics hardware 3D included on most laptops? It's utter and complete shit.
The processors, by comparison, are usually pretty decent. Getting a fast CPU and no real 3D support is easy, getting a laptop with a good processor and good 3D performance is expensive. Bringing back a good software renderer for these machines makes all sorts of sense.
A couple of points:
The plan is for Windows 7 to support the full DirectX 11 standard. If the hardware doesn't support an operation, then Win7 will do it in software. And on a sufficiently fast multicore machine, it's already faster than some (all?) Intel integrated graphics devices.
Second, Intel's admitted that when Larabee ships, it'll be slower than the current ATI/NVidia cards. It's better than what they have now, and it's kinda neat, but it's not a killer chip.
This comment thread is already very long, but I'd like to add one thing:
in his 2008 QuakeCon Keynote John Carmack said that he knows that Id Tech 5 (i.e. the Rage engine), is probably the last polygon engine that Id Software is going to develope. He said that the one guy he currently doesn't want to be is the guy who, at a big game publisher, has to make the bet on what technology is relevant in 4 years for a next-gen game, because polygons, might be it, or not.
Intel is very much going into a direction where they add current-gen GPU technology to their CPUs so that the next generation of game engines can be build on Intel technology.
You can bet that nVidia is currently working on hardware that is optimized for raytracing and bezier curves, but I think that this comment thread framed the discussion in the wrong way.
That being said, there is another untapped application of these GPU/CPU-hybrids for streaming services like OnLive.
At least Larrabee may provide for fully open source graphics drivers for Linux on day one. Yeah, AMD is starting to move in that direction with thier newest cards as well.
What's going to be really interesting is the nVidia Tegra, an ARM Core(s) bundled with nVidia graphics. ARM already has excellent open source support, and nVidia is better than AMD in that regard. Hopefully this can lead to a large group of consumers migrating away from Windows and actually using a system that's secure and safe, yet has the power to do email and web browsing, as well as properly display the high definition videos they'll want to consume. And of course, with lower power usage.
Pixomatic is HARDLY pointless. I hate to break it to you Jeff, but games are not the only applications that require real-time 3D graphics. For a business application with modest rendering needs, using a software renderer is an excellent way to make sure your app works predictably and reliably on every PC, at least with respect to the 3D graphics part.
Unfortunately, many years ago, I made the mistake of relying on Microsoft's software renderer that was provided as part of DirectX. This was NOT the reference rasterizer. It was a go-as-fast-as-possible, don't-get-too-fancy software renderer. It did everything I needed and then some. It was fast enough even on an old P166. It worked 100% reliably while the hardware accelerated renderer failed to start or would BSOD on all the dodgy machines/drivers out there. Once my app got out in the wild, it didn't take that long for me to forget about the idea of the software renderer as a fallback. The ONLY viable option was to use it exclusively.
So then, what did MS do with this very valuable software renderer? Of course! They killed it off, offering no replacement. Now it's impossible to run my app on 64-bit Windows without gutting the app to swap out the 3D engine or, more likely, doing a rewrite. (WOW64 doesn't work on OS components like DirectX DLLs.) How ironic it is that MS is once again providing a software renderer that's actually meant for real work.
If I had gone with a third-party software renderer like Pixomatic, my app would still be working on every modern computer running Windows. So when you call Pixomatic utterly pointless, I think you're not seeing the whole picture. It's about using the right tool for the job, and dependencies and their consequences. Decisions here can easily make or break a company.
One thought I have not yet read: *If* Intel runs with this in a large enough (read: market altering) way, the market will shift to a set of x86 extensions for which AMD does not have a license. Of course, if Intel drags it's feet and only utilizes this in a few niches, that will give others the time needed to come up with a similar but possibly better extension set, as we've seen before...
You're using the business perspective to view the effort of writing the software renderer.
I'd wager he did it mostly because someone said it can't be done.
Also, there are still assembly hacks out there that will beat what the compiler outputs. Writing something in assembly, or at least reading it in my case, maintains the programmer's awareness of the translation between high level language to assembly.
I'm a bit surprise at how Jeff completely misses this. I always enjoy hearing people saying that learning assembly / C / C++ is useless, while they're using tools that are written in it.
I'd benefit from that. I estimate I would use a GPU maybe 5% of the time. Far far more cost effective for me to buy dual CPUs.
It's worth noting that (IIRC) Abrash started that work because the incompatibilities between different cards' nonstandard OpenGL extensions were such a nightmare. It's telling that he found it worthwhile to give up a generation or two of performance in order not to have to write code that is half #ifdefs (I don't know if cross-card OpenGL programming has improved in the ~3 years since I did it, but this was typical then).
Also, as a couple of people already pointed out, there is plenty of software that isn't performance-bound but still uses state-of-the-art graphics calls for the visual effects (Spore comes to mind, as well as certain data visualization software, and lots of kids' games). Something like Pixomatic is perfect for these (back in the day, I used Mesa for the same purpose).
Fun fact: the Intel open source drivers on Linux are nerfed. They don't support OpenGL 2 due to the patent encumbered S3TC texture compression scheme required by the spec.
Thus, most games which require S3TC freak out and crash. UT2004 detects it's absence and falls back on a (~30 FPS) slower texture compression method. Performance wise, Software rendering is on par with UT's aforementioned fallback mode when judging by FPS. Software rendering is better than the fallback mode at higher resolutions, because the fallback mode will pause for a millisecond every second or two, while software mode does not.
I'm trying to grok your point, Jeff.
Are you saying that Michael Abrash is extolling the virtues of Larabee because it plays to the strategy behind Pixomatic? And if so, you are saying it doesn't matter, that it is a failed strategy at the top because video cards do it way better?
Jeff, I don't understand how Pixomatic or Larrabee is utterly pointless.
Abrash's articles imply that the Pixomatic effort produced deep insight into how to parallelize and super-optimize a pure software 3D rasterizer. By working with the Pixomatic team, Intel was able to distill these results into an instruction set and architecture that remains pretty general purpose but powerful enough to match GPUs.
Why is this pointless? It will only take the next hot game to include a feature that's not feasible on a GPU or typical CPU and suddenly the Intel team will be vindicated.
Hmm... maybe I missed your point.
I think the best way to introduce a machine of this power has got to be through video games - specifically written for the system at hand.
There are some commenters above who don't quite realise that to run 'normal' software won't work. Parallel processing is a style of codeing in it's own right, using a HAL is all very well but take best advantage of a system code needs to be written specifically. One can't just take code written for an x86 and expect it to run super quick on a parallel processor machine.
I do suspect that if Nvidia worked with the open source community they could come up with a new console that would make today's game consoles look like zx-spectrums