How We Should Program GPGPUs

Advanced compilers can simplify GPU and accelerator programming.
Programming GPUs—a Better Way

Current approaches to programming GPUs still are relatively immature. It's much better than it was a few years ago, when programmers had to cast their algorithms into OpenGL or something similar, but it's still unnecessarily difficult.

Programmers must manage (allocate and deallocate) the device, deal with the separate host and device memories, allocate and free device memory, and move data from and to the host, manage (upload) the kernel code, pack up the arguments, initiate the kernel, wait for completion and free the code space when done. All this is in addition to writing the kernel in the first place, exposing the parallelism, optimizing the data access patterns and a host of other machine-specific items, testing and tuning. A matrix multiplication takes a few lines in FORTRAN or C. Converting this to CUDA or Brook takes a page or more of code, even when making simplifying assumptions. One might question whether there is a better way.

Compilers are good at keeping track of details and should be taken advantage of for that as much as possible. Is there anything specific to a GPU that makes it a more difficult compiler target than vector computers or attached processors, both of which had very successful, aggressive compilers? Could a compiler be created that would generate both host and GPU or accelerator code from a single source file, using standard C or FORTRAN, without language extensions?

I think it's feasible (though nontrivial), and a good idea. Here, I discuss what such a compiler might look like and what steps it would have to take. Two overriding goals are that the compiler operates just like a host compiler, except perhaps with a command-line flag to enable or disable the GPU code generator, and that no changes are needed to the other system tools (such as a linker and library archiver).

A significant difference between such a compiler and one for a vector computer has to do with the cost of failure. If a compiler fails to vectorize a specific loop, the performance cost can be a factor of five or ten, which is enough that a programmer will pay attention to messages from the compiler. If a compiler does a bad job of code generation for a GPU, the cost can be a slowdown (relative to host code) of a factor of ten or 100. This is enough that a fully hands-off, automatic approach just isn't feasible, at least not yet. At all steps, a programmer must be able to understand what the compiler has done and, if necessary, to override it.

Figure 3. Five-Step Flowchart

The first, and perhaps most important step is to select what part or parts of the program should be converted to a kernel. Currently, that is done explicitly by a programmer who rips out that part of the program, replaces it with the code to manage the GPU, writes a kernel to execute on the GPU and combines it all into a single program.

Abstractly, we can use compute intensity to determine the parts of the program that are attractive for GPU acceleration. Compute intensity for a function, loop or block of code is the ratio of the number of operations to the amount of data that needs to be moved. For GPU computing, we are most limited by the host-GPU bandwidth, so the critical ratio is the amount of data that needs to be moved from the host to GPU and back, divided into the number of operations that the GPU will execute. If the ratio is high enough, it's worth the cost of the data movement to get the high compute bandwidth of the GPU, assuming the computation has enough parallelism.