CPU vs. GPU

November 23, 2006

Intel's latest quad-core CPU, the Core 2 Extreme QX6700, consists of 582 million transistors. That's a lot. But it pales in comparison to the 680 million transistors of nVidia's latest video card, the 8800 GTX. Here's a small chart of transistor counts for recent CPUs and GPUs:

AMD Athlon 64 X2CPU154 m
Intel Core 2 DuoCPU291 m
Intel Pentium D 900CPU376 m
ATI X1950 XTXGPU384 m
Intel Core 2 QuadCPU582 m
NVIDIA G8800 GTXGPU680 m

ATI won't release a new video card until next year. But their current X1950 XTX isn't exactly chopped liver: 384 million transistors is more than any current dual-core CPU.

Of course, comparing GPUs to CPUs isn't an apples-to-apples comparison. The clock rates are lower, the architectures are radically different, and the problems they're trying to solve are almost completely unrelated. But GPUs now exceed the complexity of modern CPUs in terms of absolute transistor count. And like CPUs, they're becoming programmable-- it's possible to harness all that graphics power to do something other than graphics.

There's a nice overview on AnandTech which provides some background on this architectural sea change in video cards:

So far, the only types of programs that have effectively tapped GPU power-- other than the obvious applications and games requiring 3D rendering-- have also been video related: video decoders, encoders, video effect processors, and so forth. But there are many non-video tasks that are floating-point intensive, and these programs have been unable to harness the power of the GPU.

Meanwhile, the academic world has designed and utilized custom-built floating-point research hardware for years. These devices are known as stream processors. Stream processors are extremely powerful floating-point processors able to process whole blocks of data at once, whereas CPUs carry out only a handful of numerical operations at a time. We've seen CPUs implement some stream processing with instruction sets like SSE and 3DNow!, but these efforts pale in comparison to what custom hardware has been able to do.

3D rendering is also a streaming task. Modern GPUs have evolved into stream processors, sharing much in common with the customized hardware of researchers. GPU designers have cut corners where they don't need certain functionality for 3D rendering, but they have ultimately developed extremely fast and flexible stream processors. Modern GPUs are just as fast as custom hardware, but due to economies of scale are many, many times cheaper than custom hardware.

Dedicated, task-specific hardware is orders of magnitude faster than what you can achieve with a general purpose CPU. If you need proof of this, just look at the chess benchmarks. IBM's Deep Blue was capable of evaluating 200 million chess moves per second in 1997. Ten years later, the fastest quad-core desktop system can only evaluate 8 million chess moves per second. Ten year old custom hardware is still 25 times faster than the best general purpose CPUs. Amazing.

The most high profile application for all this GPU power at the moment is Stanford's Folding@Home. There's no shortage of exciting PR on this topic:

The processing power of just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project; and it is estimated that if a mere 10,000 computers were to each use an ATI processor to conduct folding research, that the Folding@home program would effectively perform faster than the fastest supercomputer in existence today, surpassing the 1 petaFLOP level.

Stanford recently introduced a high performance folding client which runs on ATI's X1800 and X1900 series video cards. TechReport tested the new high performance folding client and came away a little disappointed:

Over five days, our Radeon X1900 XTX crunched eight work units for a total or 2,640 points. During the same period, our single Opteron 180 core chewed its way through six smaller work units for a score of 899 -- just about one third the point production of the Radeon. However, had we been running the CPU client on both of our system's cores, the point output should have been closer to 1800, putting the Radeon ahead by less than 50%.

The GPU may be doing 20 to 40 times more work, but the scores are calibrated to a baseline system, not the absolute amount of work that's done. It's a little anticlimactic.

Stanford's advanced folding client exploits the Brook Language, an extension to ANSI C that allows them to compile C-like code that runs on the GPU. It leverages ATI's Stream API to communicate with the GPU. NVIDIA offers something similar to Brook in their CUDA technology:

GPU computing with CUDA technology is an innovative combination of computing features in next generation NVIDIA GPUs that are accessed through a standard C language. Where previous generation GPUs were based on "streaming shader programs", CUDA programmers use C to create programs called threads that are similar to multi-threading programs on traditional CPUs. In contrast to multi-core CPUs, where only a few threads execute at the same time, NVIDIA GPUs featuring CUDA technology process thousands of threads simultaneously enabling a higher capacity of information flow.

Of course, CUDA only works on the latest G80 series of cards, just like the ATI's Stream technology is really only useful on their latest X1900 series. All this potential programmability is a very recent development.

I expect the relationship between CPU and GPU to largely be a symbiotic one: they're good at different things. But I also expect quite a few computing problems to make the jump from CPU to GPU in the next 5 years. The potential order-of-magnitude performance improvements are just too large to ignore.

Posted by Jeff Atwood
37 Comments

We can only hope that we see more non-gaming apps offshored to the GPU. I need a good reason to buy one of those 600 dollar monster cards after all.

Taylor on November 26, 2006 7:16 AM

Your analysis helps to explain how/why videogame consoles are usually at an advantage over PCs for at the least first few years they're out.

Game systems are a lot like IBM's Deep Blue in the sense that they're designed solely to excel at a specific application: games. To this end, they come with a custom hardware bus, and whatever coprocessors are required in order to squeeze every last ounce of performance from each cycle. You're essentially seeing the same thing happen with these advanced video cards: though they might run at relatively lower clock rates and other resources, that entire circuit board is one highly optimized piece of hardware, which in the end can outperform the entire host system - at certain kinds of tasks.

It's not about proving that a gaming PC can match or beat a PS3 or 360 - of course they can, that's generally the case when you throw enough money, memory and MegaHertz at a problem. It's more about giving the consoles their due and recognizing that they're designed differently from the ground-up.

Pravin Wagh on November 26, 2006 7:50 AM

A good example is a sorting competition sponsored by Microsoft Research. A single Nvidia card won against standard microprocessor cores. See http://research.microsoft.com/barc/SortBenchmark/

silicon4ever on November 26, 2006 9:03 AM

it's just a little nit, but Deep Blue had ~500 specialized chess processors, so each specialized chip could do less than half a million moves a second.

eas on November 26, 2006 10:11 AM

"But I also expect quite a few computing problems to make the jump from CPU to GPU in the next 5 years"

It's at this point we start wondering why AMD bought ATI. Prolly not so they could sell graphics cards.

Factory on November 26, 2006 10:44 AM

Yep, those GPUs are great. However, let's not get carried away. Modern CPUs excel at scalar, branch-heavy code with random memory access patterns . Any one of those would make an 8800GTX crawl.

2) CPUs aren't scaling very well right now. GPUs are scaling well beyond Moore's Law speed.

Moore's Law applies to transistor count and has nothing to do with performance.

Jim Battin on November 27, 2006 2:14 AM

Just like the FPU was merged into the CPU, I expect we'll see the GPU merged into the CPU as well. In fact, Intel already produces chipsets with (somewhat primitive) graphics chips. This move, if taken to fruition, could edge NVidia out of the market, were it not that NVidia designs so much faster than Intel. ATI could create the same chipset for AMD.

The last I heard, openGL might also be on the cutting block. There was one, and only one, reason that openGL drivers were included in the last generation of NVidia processors: the guys who write Doom said that they'd not consider a DirectX implementation.

GreggT on November 27, 2006 2:19 AM


I did read that article yesterday... toghether with this one... Supossedly DirectX 10 will require a tenfold less of instructions sent to the graphics card in order to have the same job done.

a href="http://tomshardware.co.uk/2006/11/08/what_direct3d_10_is_all_about_uk/page6.html"http://tomshardware.co.uk/2006/11/08/what_direct3d_10_is_all_about_uk/page6.html/a


GPU's are getting amazingly powerful!! If I was in Intel I would get worried... AMD is and has already bought ATI. That combo will enrich both makers on their own base processor development and will yield even more impressive performance. (Warning: wishful thinking going on)

Nice to see the @home projects using the spare power of these home monster processors.

argatxa on November 27, 2006 6:35 AM

Moore's Law applies to transistor count and has nothing to do with performance.

I get what you're driving at here, but to imply that # of transistors per CPU has no correlation with performance is absurd.

http://en.wikipedia.org/wiki/Moore's_law
--
The most popular formulation is of the doubling of the number of transistors on integrated circuits (a rough measure of computer processing power) every 18 months. At the end of the 1970s, Moore's Law became known as the limit for the number of transistors on the most complex chips. However, it is also common to cite Moore's Law to refer to the rapidly continuing advance in computing power per unit cost.

Jeff Atwood on November 27, 2006 6:42 AM

"But I also expect quite a few computing problems to make the jump from CPU to GPU in the next 5 years"

why? with the quantity of cores increasing, why wouldn't you just throw those computing problems on another core of the CPU?

David on November 27, 2006 6:55 AM

Incidentally, it's not entirely correct to say 3DNow, MMX etc. is like Stream processing. These are single-instruction-multiple-data (SIMD) operations like adding two matrices to one another.

Stream processing involves multiple-instruction-multiple-data processing. Each processor is fed a (necessarily) small group of instructions called a "kernel". So each processor is in effect executing a loop consisting of a small group of SIMD instructions. The NVidia 8800 has 128 processing elements, so that's a lot of loops running at once!

The only fly in the ointment is that - the last time I heard - graphics cards didn't use IEEE floating point so you have to be very careful about round-off errors and so on.

Bryan on November 27, 2006 7:02 AM

with the quantity of cores increasing, why wouldn't you just throw those computing problems on another core of the CPU?

1) CPUs aren't good at these kinds of problems. As mentioned in the post, SIMD instructions are quite slow relative to what a GPU can do. CPUs are somewhat parallel, whereas GPUs are *massively* parallel, to the tune of 48 or 96 "processors" on today's cards right now. Plus a GPU has many times the memory bandwidth of any CPU.

2) CPUs aren't scaling very well right now. GPUs are scaling well beyond Moore's Law speed. Right of the top of my head, I can tell you that the last three nVidia cards I owned were truly 2x faster than each other [in games], and they were all released less than a year apart.

When was the last time you bought a CPU that *doubled* the speed of your applications? Probably never, unless your last CPU is of 2001 vintage.

Jeff Atwood on November 27, 2006 7:36 AM

the arstechnica article is more (get it??) thorough.

http://arstechnica.com/articles/paedia/cpu/moore.ars/1

basically, everything you think is true, isn't. using some of that Transistor Budget for a full BCD adder/multiplier would finally put a steak in the hart of the MainFrame. and we could all go back to writing COBOL G/L programs.

buggyfunbunny on November 27, 2006 7:46 AM

To answer the question "when was the last time you bought a CPU that *doubled* the speed of your applications?", check out this Sysmark 2004 graph:

http://www.tomshardware.com/2004/03/18/spring_speed_leap/page25.html

The highest value in the "Office Productivity" chart is 204, which means we need a sysmark 2004 score of 102 to prove a true doubling of speed across all applications in SysMark 2004. The slowest processor on the list, the Athlon XP 2600+, has a score of 140. Nobody has benchmarks that go back far enough, but I'd presume a system around the level of a Pentium 4 1.8 GHz or so would dip down to 102.

That review was posted in March 2004, and the P4 1.8 GHz was introduced in July 2001. So it took about *three years* of CPU speed improvements to double performance in typical office applications.

Jeff Atwood on November 27, 2006 8:10 AM

You migth be interrested (yet another) microsoft research project, Accelerator.
http://research.microsoft.com/research/downloads/Details/25e1bea3-142e-4694-bde5-f0d44f9d8709/Details.aspx

Jonathan de Halleux on November 27, 2006 9:16 AM

My bad: the SysMark 2004 scores are calibrated to a reference system, a P4 2.0 GHz. See page 14 of this PDF:

http://www.bapco.com/techdocs/SYSmark2004WhitePaper.pdf

Thus, a system which scores 200 on the Sysmark 2004 office benchmark will be twice as fast as that system. Duh! The Pentium 4 "Extreme Edition" 3.2 GHz scores 197 on the Tom's Hardware page.

http://www.tomshardware.com/2004/03/18/spring_speed_leap/page25.html

Pentium 4 2.0 GHz - August 27th, 2001
Pentium 4EE 3.2 GHz - November 3rd, 2003

Thus, it took *26* months-- over two years-- for CPU speeds to double actual real world performance in typical office-style applications. At least according to SysMark 2004, which is a fairly solid real-world benchmark.

Jeff Atwood on November 27, 2006 12:42 PM

It's absurd that a conventional microprocessor uses about 100 million transitors to execute a single stream of instructions -- 30 years ago, microprocessors were able to execute a stream of instructions with less than ten thousand transitors!

For the past 15 years we've seen processors use pipelined and superscalar architectures to discover "hidden parallelism" in a single stream of instructions, but that's a losing game.

People saw that Moore's Law was going in this direction back around 1980: the Japanese government predicted that a "Fifth Generation" computer architecture would involve massive parallelism on a single chip. They launched a ten year effort to develop a programming language, hardware architecture and software environment for parallel programming.

Many people think the Fifth Generation project was a failure. Those people are wrong. The Fifth Generation project delivered working hardware and software, and achieved good parallelism for some tasks. There were two reasons why the world didn't care: (1) The world was losing interest in the "Artificial Intelligence" and Logic Programming paradigm it was based on, and (2) Commodity hardware was improving so rapidly in performance to leave them behind.

Today's multi-core processors are the beginning of the real fifth generation. Parallel chips power GPUs and the PS3, as well as advanced radios and network routers. Soon we'll be putting billions of transitors on a die, and software people like us will be struggling to keep up!

Paul Houle on November 28, 2006 3:43 AM

isn't this the reason the ps3 used a mutliple core system with specific tasks ssigned to different cores such as video rendering and 3-d functions. so it looks like sony already paid ibm to combine them

andrew on November 28, 2006 5:20 AM

Of course, I don't remember where I saw it, but the multi-core issue boils down to two problems: how to exploit parallelism for inherently single task problems (most business software) without running afoul of thread stomping; and its flip side, which is that the trend in coring is to run at lower clock rates. The MIPS of such a machine still goes up, but only if the code can effectively exploit multi-threading. That's going to be the trick.

As to 5th generation; Prolog turned out not to be a general purpose language. Although the Amzi! folk still keep chugging along.

The irony is that Codd invented the RDMBS just before Prolog was created which implemented what amounted to a database: a row in a table is a rule, a rule in Prolog is a row.

buggyfunbunny on November 28, 2006 7:50 AM

Your example of chess benchmark is not valid. IBM's hardware had very simple position evaluator. This allowed them to evaluate an incredible amound of positions per second. Fritz (the software you gave the link to) uses better evaluator which is slower but produces better result. That's why Fritz with its 8 million positions per second is better than IBM's machine with its 200 million positions per second.

Alexei on November 29, 2006 5:13 AM

buggyfunbunny -- You can write imperative programs in prolog just fine; the question is "would you want to?"

Imperative programs can be confusing in prolog; often you need to use a logical failure to implement the flow of control when an imperative operation succeeds, or use logical success when an imperative operation fails. Your head gets twisted into knots quickly.

Prolog is more powerful than a relational database, because it can do reasoning to chain multiple rules together. As a result, it's less scalable than a relational database -- like all AI systems, it starts to crumble when there are more than about 10,000 rules.

Warren discovered that Prolog could be executed quicker than most people would expect, but it's never going to be as fast as Fortran for numerical work.

Early on people had hope that Prolog could be parallelized, but it turned out that Prolog's semantics are too powerful for parallelization. The Japanese invented a language, KL1, with weaker semantics. They built a KL1 runtime that got excellent pararellizaton for some workloads, and moderate paralellization for others. It never compelling enough to catch on in the real world.

I see the multi-core transition going in two directions: running today's applications faster and with less power, and for tomorrow's applications.

I don't think there's a lot of pressure to speed up word processors. Spreadsheets can certainly be parallelized, as can databases. Today's business apps are increasingly database-driven web sites, and these take to parallel computers like ducks to water. (Sun's Niagara processor wipes the floor with the competition when it comes to web apps.) Games and other multimedia applications benefit marvelously from multi-core systems, as do scientific applications.

Multi-core systems (and other parallel processors) will open up all kinds of new applications:

* software radio -- digital radio systems that do d/a at RF rates and are fully programmable
* network processors -- programmable network routers, intrusion detection systems, SAN routers, TCP offload engines
* "perception engines" -- parallel processing will enable new applications in machine vision, machine learning, speech recognition, pattern recognition, et al. Many of the aims of the 5th generation project ~will~ be realized this time around, but by different means. Rule-based programming is dead, replaced by machine learning techniques that are structured more like scientific codes (matrix math for the support vector machine) or like database systems (multidimensional search for k-nearest neighbors)

Paul Houle on November 29, 2006 6:13 AM

Mr. Houle -- You can write imperative programs in prolog just fine; the question is "would you want to?"

No. But I've had to work on an application made with a Prolog mutant (not Amzi!), which was built by COBOL coders. Let's just say it was the worst of both worlds.

I found the multi-core discussion, again. Trip over to Artima. A number of threads running at the moment; java-centric so you MicroSofties be warned. The issue applies to any function (as opposed to OO) based threading semantics. Holub discussed this years ago. Got generally roasted for being "too negative", but that's still the core issue.

buggyfunbunny on November 29, 2006 8:06 AM

I hope you're right about seeing that jump to doing more on GPUs. I hope that for a very specific reason: I work for Peakstream (www.peakstreaminc.com) and we specifically write software that schedules large matrix calculations on GPUs :-)

Our trick to good performance in these operations is pretty simple: do large, SIMD-type operations, and then have what amounts to a JIT compiler to get everything scheduled on one or more GPUs and/or CPUs. Well, okay, it's simple to say. Writing it takes a bit more work.

Noah on November 30, 2006 7:16 AM

For people wanting to do more general purpose programming on the GPU: I am the originator of an open-source shader meta-programming framework that generates shader code from C# at runtime. This eliminates the need for a shading language. There's an alpha release and a getting started guide over at my website.

Do let me know what you think.

Ananrth B. on November 30, 2006 12:11 PM

Yeah, I had the idea to do this as soon as GPUs startd getting powerful, but it seemed too tough to get the GPU to execute arbitrary code, so I abandoned the idea.

wkerney on December 1, 2006 4:56 AM

You are wrong in stating that "Ten year old custom hardware is still 25 times faster than the best general purpose CPUs" . You are comparing a single *processor* made today with a massively-parallel *machine* that had *thirty* CPUs and 480 specialized chess chips. The fact of the matter is that it is today's CPUs that are orders of magnitude faster that 10 year old hardware, not the opposite.

Ricardo on December 2, 2006 3:40 AM

Neural circuit simulations are an unbelievable fit with GPU computation architecture - our research group, Evolved Machines, applies large scale neural circuits to sensory problems and is working with the G8800 now - will be announcing something soon -

Paul Rhodes on December 2, 2006 7:48 AM

Two of ATI's upcoming R600 video cards in a SLI configuration deliver 1 Terafop of performance:

http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~116238,00.html

A teraflop is one trillion floating point operations per second. In 2002, it took 50 of the world's fastest computers to build a 1 teraflop machine:

http://www.zdnet.com.au/news/business/soa/Australian_astronomers_get_1_Teraflop_supercomputer/0,139023166,120267896,00.htm

Jeff Atwood on March 2, 2007 3:50 AM

My dollar is on the GPU!

Glen on August 20, 2007 4:10 AM

Jeff,
You can'y fairly compare ATI's R600 GPU to earlier supercomputers when talking about operations-per-second for two reasons.

1) When measuring FLOPS of supercomputers, they are almost always referring to 64-BIT DOUBLE PRECISION floating point operations, NOT single precision operations which is what graphics card makers (and Sony regarding their CELL BE) love to throw around.

2) Just as importantly, the architectures are completely different... The supercomputers achieving a teraflop are using general purpose processors executing complex code, whereas the massively parallel GPU is executing massively parallel graphics operations. Can these really be compared?

sw on March 16, 2008 11:58 AM

Fortran?? Sorry pal, the 1970s ended in approximately 1982 more or less, when Roland released the TB-303. Think a bit and C for yourself...

If you are talking numerical processing, let's talk machine-language libraries running over specialized hardware. This is not a language issue, unlike producing specialized administrative and business software, where you need to deal with large databases and model your client's (often contradictory and incomplete) needs in less time then you would like to.

NIC1138 on March 24, 2008 6:38 AM

Just waiting for Nvidia Cruda/ATI driver for MSSQL, so i'll upgrade my web db server with 3d graphics card, or use xbox for parallel processing :)

Basically, we have a few number of commands that deals with bunch of data. It's like made for SIMD processors, like GPU, where whole graphics hw engine can be programmed.

Didn't I read some time ago that some folks on MIT or other university adopted graphics hw for db tasks? Have to find link...

hhrvoje on June 18, 2008 5:57 AM

The bulk of the transistor counts on modern CPUs is cache memory, and so provides something of a false picture of CPU complexity. I'm not sure what the bulk of the transistor count is on GPUs.

Hamilton Lovecraft on February 6, 2010 9:51 PM

"The last I heard, openGL might also be on the cutting block. There was one, and only one, reason that openGL drivers were included in the last generation of NVidia processors: the guys who write Doom said that they'd not consider a DirectX implementation."

That's not credible, given the existence of NVidia and ATI graphics chips on Macs.

Hamilton Lovecraft on February 6, 2010 9:51 PM

iisn't this the reason the ps3 used a mutliple core system with specific tasks ssigned to different cores such as video rendering and 3-d functions. so it looks like sony already paid ibm to combine them/i

No, PS3 has a relatively conventional NVidia 3D chipset in addition to the cell processors. The purpose of the 3D hardware is to draw the pretty pictures, and the purpose of the cell processors is to let them claim the machine runs at 2+ TFLOPS.

Hamilton Lovecraft on February 6, 2010 9:51 PM

It's absurd that a conventional microprocessor uses about 100 million transitors to execute a single stream of instructions -- 30 years ago, microprocessors were able to execute a stream of instructions with less than ten thousand transitors!

It's absurd that you need a multiply operation to multiply two numbers -- early microprocessors were able to use the add instruction in a loop to do multiplication!

As previously noted, a modern CPU core is ~15 million transistors exclusive of L2 cache. The thousand-fold increase in transistor count from the good old days includes making all the registers 4x wider, making instructions that took 4 to 8 cycles execute in a single cycle, adding a floating point unit, pipelining the system, adding branch predictors, and in general making the system over 10,000 times faster.

Hamilton Lovecraft on February 6, 2010 9:51 PM

I set up a webpage about the 2n3904 transistor if anyone is interested.
http://www.2n3904.net>2n3904.net

Joshua Dungan on August 5, 2010 7:10 PM

The comments to this entry are closed.