This lecture on the Programming Track for GDC 2006 will be given by Pete Isensee. I haven't been doing enough C++ lately - it's been crowded out of my schedule by so much other technology - Web 2.0 experiments, SVG and W3C reading and ... well, surfing.
I really enjoyed the talk a few years back on how they ported DirectX to the original Xbox and I also got a kick out of a GDC talk on advanced compiler optimizations for C++. I think this one could be right up my alley. Of course I like to be an informed attendee, so I'm going to do some research on the issue raised here and get some background. That should get me ready to get the most out of this lecture - if I get it into my schedule. The list of sessions I'm building up here goes along with Jeff's post on Scrum yesterday and mine on TDD the other day.
It sounds like the issues they're talking about revolve around optimizations for a few deep instruction pipelines in processor versus multiple processing units with shallow pipelines. The differences here are in parallel code execution and costs of branches.
Any time you have a problem that can be broken up for parallel processing, you can reduce code execution time by splitting the processing across multpile processing cores. Multiple cores used to mean having more than one physical CPU package on a board - I'll use the term "multiprocessor" for this. This was a great way to improve the speed of multitasking. The reasoning was simple: multitasking means running several applications at the same time which have very little or no dependence on each other. Having few interdependencies makes it easier to share the information they need to share and easier to divide up the tasks. A simple example would be running Winamp at the same time as Lotus 1-2-3 (if we're going old school, then I'm going all the way back). Winamp doesn't care a whit about what's in your spreadsheet and Lotus has nothing to do with decoding an MP3 data stream. Great, there's your multiprocessor multitasking solution.
System-On-Chip (SoC) technologies started showing up once chip geometries went from really really small to really really really small. I don't know the actual trace dimensions that make it viable, but the idea is simply that once you can etch all the functions that happen on a chip into a small enough area on the die, you have enough space left over to actually put all the functions of another chip in there too. So in the case of CPUs, this meant that functions that used to be outside, like a cache memory and the memory controller (to access external RAM), started to migrate inside the CPU. This saves energy since the CPU can use less power to send signals internally instead of sending them across the motherboard to another chip. It also saves space on the motherboard and manufacturing costs since there are fewer chips to place.
Now over to the clock frequency side of things. Clock frequency is spec measured in GHz and really refers to how many clock cycles happen in a second inside the chip. Clock cycles are used in a chip to synchronize loading instructions or data into the CPU processing core from cache memory or external devices. More Hertz (be they mega, giga, or tera) means more instructions or data moving in a second. In the simple case, you get an instruction, like add, and you get some data, like numbers, then you perform the instruction on the data and go put the result somewhere. When the CPU core frequency goes up, then we can just add numbers faster. If the output device, like a hard drive, video card, or pretty much anything (even your rockin' fast PC4200 RAM) can't keep up with the CPU, then we can just speed on ahead and cache the results until the device is ready for them. If the processor core is really fast, then it could end up waiting for the cache memory to supply instructions and data to work with. Then what? While an instruction is being processed, more instructions and data can be fetched and placed in a waiting queue, called a pipeline. The longer the pipeline is, the less likely it is that the processor core will be starved for instructions to execute.
If all your instructions were like add, then life would be simple and things would go faster just by racheting up the clock speed of the CPU. The processing core can fetch instructions out of the pipeline as fast as it can and the pipeline can just keep getting filled up with instructions and data so the processing core never has to wait. The problem comes with instructions like if. When the CPU comes to an if, then it has to decide where to go next in the code. The code branches caused by an if instruction can sometimes be guessed before the processing core is ready to execute them. This way the pipeline can keep getting filled with code and data from one side of the branch. When the guess is wrong, the pipeline has to be flushed and the processing core sits idle while instructions and data from the other side of the branch are fetched. This makes a fast CPU slow down to be as slow as the memory controller. It sucks. The deeper the pipeline is, the worse this sucks. So what this means is that if, between the CPU and your compiler, you can do perfect branch prediction then deep pipelines are a great optimization. The reality is that you never get perfect branch prediction so there's going to be a point where deeper pipelines are likely to cause slowdowns.
Anyhow, as feature geometries on the chip keep shrinking, CPU manufacturers need something to do with all that space. At the same time CPU performance has to keep up with Moore's Law or some variation of it. Higher clock speeds mean deeper pipelines and that can't improve things for ever.
So the logical next step is to include another processing core inside the CPU package. Well, I hope I painted that as logical. I may have left out a couple steps.
From the way your application software sees things, having multiple cores inside one physical processor is very similar to a multiprocessor system. Only the drivers underneath need to know the details. Microsoft included processor affinity information in the Windows Driver Model (WDM) years ago in order to support multiprocessor systems, so WDM drivers should at least work on multicore systems. I've never done drivers for Linux but I know that it's been deployed on multiprocessor systems for years as well, so I don't expect multiple cores were a huge issue there either.
Working with multiple cores and benefitting from multiple cores are two different things, of course. Back to the Winamp and Lotus 1-2-3 example earlier, when the song is over in Winamp, the second processor sits there with nothing to do. The situation can be the same with multicore processors. Splitting multiple applications across multiple processors or multiple cores can be pretty straightforward. Splitting a single application across multiple cores can involve a complete rewrite. On a desktop PC, we run many applications whether we realize it or not. If an application just uses one core then another application can be run on the other.
Still with me? I'm coming around to what all this means to the lecture that I started looking at.
A gaming console just runs one application: the game you're playing. So the CPU in a gaming console is getting faster in the same way that desktop computer processors got faster - by adding more cores. The big problem is that the games written for previous consoles have been optimized for deeper pipelines. This can be at odds with the goal of parallel code execution. So what's a console developer to do? Spawn more tasks. I expect that at least part of this lecture will be focused on writing multi-threaded code, how to synchronize code between threads, and how to avoid deadlocks when both threads need a resource.
One of the links on Pete Isensee's (the lecturer) home page points to a talk he gave on OpenMP. OpenMP is a specification for SMP programming. The presentation he gave looks like it could be a fore-runner to the one he'll give in the coming March. It sounds like Pete has some solid background in the area, so this lecture is definitely going on my list.
[...] C++ On Next-Gen Consoles: Effective Code For New Architectures (GDC description) [...]