How We Should Program GPGPUs
Listing 3. Fortran Matrix Multiplication Loop, Tagged to Be Compiled for the Accelerator
!$acc begin do i = 1,n1 do k = 1,n3 c(i,k) = 0.0 do j = 1,n2 c(i,k) = c(i,k) + a(i,j) + b(j,k) enddo enddo enddo !$acc end
Although a compiler may be able to determine or estimate compute intensity, there are enough issues with GPU computing that it's better to leave this step to the programmer. Let's suppose a programmer can add a pragma or directive to the program, telling the compiler that a particular routine or loop or region of code should be compiled for the GPU.
The second step is data analysis on the region: what data needs to be allocated on the device memory and what needs to be copied from the host and back to the host afterward? This is within the scope of current compiler technology, though peculiar coding styles can defeat the analysis. In such cases, the compiler reports usage patterns with strange boundary conditions; usually, it's easy to determine where this comes from and adjust the program to avoid it. In many cases, it arises from a potential bug lurking in the code, such as a hard-coded constant in one place instead of the symbolic value used everywhere else. Nonetheless, the compiler must have a mechanism to report the data analysis results, and the user must be able to override those results, in cases where the compiler is being too conservative (and moving too much data, for example).
The third step is parallelism analysis on the loops in the region. The GPU's speed comes from structured parallelism, so parallelism must be rampant for the translation to succeed, whether translated automatically or manually. Traditional vectorizing and parallelizing compiler techniques are mature enough to apply here. Although vectorizing compilers were quite successful, both practically and commercially, automatic parallelization for multiprocessors has been less so. Much of that failure has been due to over-aggressive expectations. Compilers aren't magic; they can't find parallelism that isn't there and may not find parallelism that's been cleverly hidden or disguised by such tricks as pointer arithmetic.
Yet, parallelism analysis for GPUs has three advantages. First, the application domain is likely to be self-selected to include those with lots of rampant, structured parallelism. Second, structured parallelism is exactly the domain where the classical compiler techniques apply. And finally, the payoff for success is high enough that even when automatic parallelization fails, if the compiler reports that failure specifically enough, the programmer can rewrite that part of the code to enable the compiler to proceed.
The fourth step is to map the program parallelism onto the machine. Today's GPUs have two or three levels of parallelism. For instance, the NVIDIA G80 architecture has multiprocessor (MIMD) parallelism across the 16 processors. It also has SIMD parallelism within each processor, and it uses another level of parallelism to enable multithreading within a processor to tolerate the long global memory latencies. The loop-level program parallelism must map onto the machine in such a way as to optimize, as much as possible, the performance features of the machine. On the NVIDIA, this means mapping a loop with stride-1 memory accesses to the SIMD-level parallelism and mapping a loop that requires synchronization to the multithread-level parallelism. This step is likely very specific to each GPU or accelerator.
The fifth step is to generate the GPU code. This is more difficult than code generation for a CPU only because the GPU is less general. Otherwise, this uses standard code-generation technology. A single GPU region may generate several GPU kernels to be invoked in order from the host. Some of the code-generation goals can be different from that of a CPU. For instance, a CPU has a fixed number of registers; compilers often will use an extra register if it allows them to schedule instructions more advantageously. A GPU has a large number of registers, but it has to share them among the simultaneously active threads. We want a lot of active threads, so when one thread is busy with a global memory access, the GPU has other work to keep it busy. Using extra registers may give a better schedule for each thread, but if it reduces the number of active threads, the total performance may suffer.
The final step is to replace the kernel region on the host with device and kernel management code. Most of this will turn into library calls, allocating memory, moving data and invoking kernels.
These five steps are the same that a programmer has to perform when moving a program from a host to CUDA or Brook or other GPU-specific language. At least four of them can be mostly or fully automated, which would simplify programming greatly. Perhaps OpenCL, recently submitted by Apple to the Khronos Group for standardization, will address some of these issues.
There are some other issues that still have to be addressed. One is a policy issue. Can a user grab the GPU and hold onto it as a dedicated device? In many cases, there is only one user, so sharing the device is unimportant, but in a computing center, this issue will arise. Another issue has to do with the fixed size, nonvirtual GPU device memory. Whose job is it to split up the computation so it fits onto the GPU? A compiler can apply strip-mining to the loops in the GPU region, processing chunks of data at a time. The compiler also can use this strategy to overlap communication with computation by sending data for the next chunk while the GPU is processing the current chunk.
There are other issues that aren't addressed in this article, such as allocating data on the GPU and leaving it there for the life of a program, or managing multiple GPUs from a single host. These can all be solved in the same framework, all without requiring language extensions or wholesale program rewrites.