Overcoming the Challenges of Developing Programs for the Cell Processor
In June 2008, Los Alamos National Lab announced the achievement of a numerical goal to which computational scientists have aspired for years—its newest Linux-powered supercomputer, named Roadrunner, had reached a measured performance of just over one petaflop. In doing so, it doubled the performance achieved by the world's second fastest supercomputer, the Blue Gene/L at Lawrence Livermore National Lab, which also runs Linux. [See James Gray's “The Roadrunner Supercomputer: a Petaflop's No Problem” on page 56 of this issue.]
The competition for prestigious spots on the list of the world's fastest supercomputers (www.top500.org) is serious business for those involved and an intriguing showcase for new technologies that may show up in your data center, on your desk or even in your living room.
One of the key technologies enabling Roadrunner's remarkable performance is the Cell processor, collaboratively designed and produced by IBM, Sony and Toshiba. Like the more-familiar multicore processors from Intel and others, the Cell incorporates multiple cores so that multiple streams of computation can proceed simultaneously within the processor. But unlike multicore, general-purpose CPUs, which typically include a set of four, eight or more cores that behave essentially like previous-generation processors in the same family, the Cell has two different kinds of processing cores. One is a general-purpose processor, referred to as the Power Processing Element (PPE), and the remaining (eight in the current configurations) are highly optimized for performing intensive single-precision and double-precision floating-point calculations. These eight cores, called Synergistic Processing Elements (SPEs), are capable of performing about 100 billion double-precision floating-point operations per second (100 Gflops).
The Cell processor also powers the Sony PlayStation 3 and is available from IBM in blade format for clusters and the data center. With a wide range of systems using the Cell, it is interesting to speculate on the importance it will have on the market. Although a number of significant factors beyond flop ratings will contribute to its success or failure, traditional considerations, such as price, availability and reliability, still will be important. Power efficiency and the difficulty or ease of adapting applications to take advantage of the architecture also are critical factors. If only a slim subset of applications can take advantage of a processor, or if application adaptation requires an investment of time and energy disproportionate to its benefits, how widely will it be adopted?
Recognizing the significant interest in the Cell and the accompanying concern about programmability, TotalView Technologies recently introduced a version of the TotalView debugger specifically adapted for debugging on the Cell. Doing so required significant changes to the way that the debugger models what takes place within threads and processes in order to provide computational scientists and developers with a clear representation of what is happening in a Cell program.
This article outlines the challenges faced in writing or porting applications to the Cell, some of the elements of the Cell environment and typical Cell programs that make understanding what is happening especially difficult, and specific actions users can take to develop and troubleshoot Cell programs effectively. Many of the same issues encountered in debugging Cell applications also arise with other techniques, such as general-purpose computing on graphics processing units (or GPGPU programming).
The architecture of the Cell presents two general challenges software developers must solve when writing new software or porting existing software to the Cell. The first challenge is breaking the key components of a program into small chunks that can execute on the eight SPEs. In numerical programs, this often means dividing the data into small independent units. In other cases, individual tasks or pipeline stages may be delegated to specific SPEs. This issue of problem decomposition is similar to that of adapting an application to a distributed memory cluster environment, although the granularity is different due to the limited memory space directly available to each SPE.
Each of the eight SPE cores that is part of the Cell processor has its own independent registers and a small amount of local memory (256KB) used for storing instructions and data. This memory acts similarly to a cache in a general-purpose processor, in that it has a limited size and can be read and written very quickly. Unlike a cache, however, its contents are managed directly by the application. The code units running on the SPE elements are allowed to initiate direct memory access operations that can copy chunks of data from the main memory into the SPE local store or back to main memory from the local store. The same memory is used for the machine instructions and global memory, heap memory and stack. The heap and the stack change size over time, and the programmer must take care to avoid collisions between these different memory ranges, because there is no memory protection. This explicit management of memory is the second major challenge in designing Cell implementations. Because each processor has a completely separate local store, the structure of a program for the Cell is very different from the structure of a traditional multithreaded application, where all the threads share the same memory.
Developers who want to write code for the Cell processor will need to find ways to address these two issues in their programs. Architecture abstraction layers, like the one provided by RapidMind, may allow developers to write code that will run on a range of processors, including the Cell. If developers are going to obtain the best possible performance, however, they need to make use of the control obtained by programming for the SPEs themselves, at the expense of longer and more complex code.
The Cell provides exceptional performance by making it possible to achieve and sustain a high level of concurrent execution. Each SPE element can execute simultaneously and asynchronously, and has 128-bit wide registers and support for vectorized instructions that can apply a mathematical operation synchronously to more than one number at a time. Performance gains do come with trade-offs, however, including increased complexity and unpredictability. When working with a serial application in which all computations happen in a single sequence, the system tends to be highly predictable. Given a set of input conditions, a serial program tends to behave the same way every time, even if it is malfunctioning.
Concurrent applications, especially malfunctioning concurrent applications, may behave in different ways if the sequence of operations in the various threads differs from run to run. These differences lead to elusive race conditions when only some sequences provide the correct behavior. In order to eliminate race conditions, developers can use synchronization constructs, such as barriers, to constrain the set of execution sequences whenever two or more threads need to read or write the same data. Synchronization needs to be applied very carefully, as its misuse will result in reduced performance or deadlocks.