Overcoming the Challenges of Developing Programs for the Cell Processor
TotalView is a highly interactive graphical source code debugger that provides developers working in C, C++ or FORTRAN with a way to explore and control their programs. Originally designed to debug one of the first distributed memory architectures, the BBN Butterfly, TotalView has since been enhanced continually with a focus on multiprocess and multithreaded applications. Today, TotalView is used to develop, troubleshoot and maintain applications in a wide range of situations. One group uses it to develop Linux-based commercial embedded computing applications consisting of a single process with a hundred or more separate threads that simultaneously interact with a graphical user interface (GUI), sensors, on-line databases and network services. Other users troubleshoot sophisticated mathematical models of complex physical systems in astrophysics, geophysics, climate modeling and other areas. These models typically consist of thousands of separate processes that run on large-scale high-performance computing (HPC) clusters on the top 500 list. In both cases, TotalView provides users with a debugging session in which they can examine in detail any one of the many threads or processes that make up the application, control each thread or process singly or as a part of various predefined and user-defined groups, and display multiple types of data and state information from across many processes and threads.
The version of TotalView for the Cell applies these same user-interface techniques and capabilities to the multiple threads of execution that are running on the PPE and SPEs within each Cell processor. It also extends to Cell applications that run as distributed memory parallel applications on clusters of Cell BE blades. In order to extend this interface, however, it was necessary to extend significantly the definitions and models of processes and threads that the debugger uses internally to represent what is happening.
TotalView's traditional model for programs has three tiers: threads, processes and groups of processes. A thread is a single execution context, with its own stack (and, therefore, its own copy of any automatic or local variables) and its own state (either running or stopped). A thread also may have additional characteristics, such as a handle to “thread private” memory, an identity related to a specific OpenMP construct in another thread that is part of the application, or an operating system-assigned thread ID. A thread shares a single virtual memory address space, including both text (instructions) and data (global variables, heap memory and stack memory) sections, with the other threads that may make up a process.
A process is made up of one or more threads that are all executing a single executable “image” that represents a program. The image involves the base program and a list of library images that are loaded either at launch or during runtime. A process also has a number of resources attached to it that are managed by the operating system, including a virtual memory address space (shared with its threads), file handles, parent and/or child relationship with other processes and an operating system-assigned process ID.
Groups are made up of one or many processes running on one or more machines. These processes can be all executing copies of the same program image or, more rarely, even a different image. TotalView notes which processes in a group share the same image and shares and/or reuses information wherever possible. When the user sets a breakpoint on a line of code within a group, TotalView ensures that the breakpoint appears at the right place in each process that shares the same image, even if those images have different offsets (which can happen due to randomized library load addresses across the many nodes of a cluster).
In order to support the Cell processor correctly, several changes to this model were necessary. Most significant, the SPE threads that make up a Cell process have a more distinct identity than threads created on a traditional processor. The SPE core uses a different instruction and register set from the PPE core, so the Cell version of TotalView can handle both, transparently switching between them depending on the kind of thread under operation. The SPE thread actually executes in a separate local memory space, so all the characteristics of memory space traditionally associated with processes are now associated in TotalView with threads on the Cell. As a result, looking up a variable or function in any specific SPE thread or in the PPE thread will give very different results, depending on the variables and functions that exist within the memory local to that thread.
And, this is important to represent clearly what actually is happening in the program. Depending on the Cell executable's construction, the SPE threads may look like separate programs (which happen to do all of their IO via DMA calls) with their own main() routine. Much of the key information about the code running in the SPE threads is different from the PPE process that contains it. User-defined data types, functions and global variables that accidentally or intentionally have the same name in both the SPE and PPE probably refer to distinct things, and the code may be different between two different SPE threads that are part of the same process.
A single Cell program may load many distinct images at the same time to run on different SPEs, and a breakpoint set on a specific line of a specific SPE thread may or may not need to be duplicated across other threads (the user can choose). It certainly shouldn't be set at that same address in threads that don't execute that line of code there. Because the threads are executing and issuing DMA memory read and write requests and synchronization operations around those requests, there is a significant opportunity for race conditions and deadlocks. TotalView provides a set of thread control commands that allow developers to approach these problems systematically by exploring specific sequences of execution among the various SPE and PPE threads.
The intensive explicit memory management required for each unit of data moved in and out of the SPE local memory means that developers are concerned with low-level issues of data representation, layout and alignment. As they are moving units of memory from the PPE to the SPE and back again, it is quite common to want to examine the data on both sides of the transfer. TotalView provides the ability to explore the data in any thread in exactly the same terms as it is being used within that thread. This is a significant benefit. One computational scientist noted that before TotalView was available on the Cell, he had to perform offset calculations with a calculator every time he wanted to look at a variable on the SPE.
To understand the data that has been transferred to or from the SPE threads, scientists and programmers can navigate data structures and large arrays, following pointers as if they were HTML links in a browser, and sort, filter and graph the data they find.
The TotalView process group model applies to the version on the Cell with very little modification. Users can work freely with large groups of processes running on one or many blades. The only change is a more nuanced model that allows breakpoints to be shared between threads that are all executing the same image, rather than between processes executing the same image.
The changes made to TotalView in order to provide for clear troubleshooting on the Cell processor highlight the architectural distinctions that are uniquely challenging for scientists and programmers writing software for the Cell. The capabilities that were developed in response to these distinctions make the process of adapting software to the Cell a bit less daunting for those scientists and programmers.
These lessons may apply in other situations. For example, a program written for the Cell somewhat resembles a program written to take advantage of graphics co-processors as computational accelerators. This technique, called GPGPU (General-Purpose Graphics Processing Unit) programming, also involves writing small bits of code, called kernels, which will execute independently on specialized hardware using a separate device memory for local store. GPGPU programming introduces the additional notion of extremely lightweight context swapping and encourages developers to think of creating thousands of threads, where each might operate on a single element of a large array (a stream, in GPGPU context). Many of the same programming and debugging challenges exist for this model, and future debuggers for GPGPU should give developers clear ways of examining the stream data in both main memory and the GPU store by accurately representing what is happening in the GPU, and displaying the state of thousands of threads to the user.
Developers who work with multithreaded applications on other platforms can thank their caches that they don't need to deal with explicitly loading data into and out of each thread. They can concentrate instead on the challenge of finding the right level of concurrency for their application, their processor and its cache.
Chris Gottbrath is Product Manager for the TotalView debugger and MemoryScape product lines at TotalView Technologies, where he focuses on making it easier for programmers, scientists and engineers to solve even the most complex bugs and get back to work. He welcomes your comments at email@example.com.