How We Should Program GPGPUs

Advanced compilers can simplify GPU and accelerator programming.

Using GPUs for general-purpose computing is attractive because of the high-compute bandwidth available, yet programming them still is a costly task. This article makes the claim that current GPU-oriented languages, like CUDA or Brook, require programmers to do a lot of busy work and keep track of a lot of details that would be better left to a compiler. I argue that host-based, accelerator-enabled C, C++ and FORTRAN compilers are feasible with current technology and discuss the issues and difficulties with automatic compilation for GPUs and how to address them.

Programming GPUs and Accelerators

Today's programmable GPUs are the latest instantiation of programmable accelerators. Going back to the 1970s, the Floating Point Systems AP-120 and FPS-164 attached processors provided mainframe computational speeds at minicomputer cost. Many other vendors built similar products until the advent of the full mini-supercomputer. Today, we have the Clearspeed board, selling into much the same market. In the embedded world, music players and digital phones use programmable digital signal processors (DSPs) to encode and decode audio. Typically, they use dedicated hardware blocks to handle the video to minimize power requirements. In each case, the accelerator is designed to improve the performance for a particular family of applications, such as scientific computing or signal processing. GPUs are themselves (obviously) designed to speed up graphic shaders.

Figure 1. NVIDIA Architecture Block Diagram

Accelerators are made programmable so as to be as flexible as possible. DSPs can be reprogrammed to handle new encoding standards or to use more efficient algorithms, for example. The programming strategy always is a trade-off between performance and convenience, and efficiency and productivity. Designing a language that is close to the hardware allows a user to write as efficient a program as possible, yet such a program likely will have to be rewritten for each such accelerator. It even may need re-tuning for different generations of accelerator from the same vendor and may bear no resemblance to the corresponding CPU program from which it was ported. On the other hand, using a higher-level language puts performance in the hands of a compiler, which may leave a significant fraction of the potential on the table.

Some arguments in favor of a lower-level language are that many programmers want that level of control; compiler development is expensive and time consuming; hardware moves so quickly that compilers are out of date by the time they are tuned properly; and programmers don't trust compilers to deliver good performance.

There are similarities between these arguments and those used 50 years ago during the development of the first high-level languages. When developing the first FORTRAN compiler, the state of the art of programming was machine or assembler language. The developers at IBM realized they had to overcome a very high-acceptance barrier before anyone would be willing to adopt the language. Where would we be today had they not forged ahead? Although FORTRAN is out of favor with most Linux programmers, I submit that all subsequent imperative languages are descendants of, and owe a debt to, that first FORTRAN effort, which formalized named procedures and variables, and introduced the if keyword and the assignment statement (and what languages have no assignment statement?).

Now, having had some years of experience programming GPUs in various ways, we should look at whether we can use less-specific (and less-arcane), more general-purpose languages, open up GPU programming to a wider audience, and still achieve high performance.

Comparison to Vector Computers

Vector computing was first introduced in the 1960s, but was commercially successful only with the introduction of the Cray 1 in 1976. It may be difficult for less-experienced developers to believe, but the Cray 1's 80MHz clock (12.5ns) was the fastest on the planet, and it was the fastest scalar machine available at the time. In addition, the Cray 1 had a vector instruction set that could deliver floating-point results six to ten times faster than the corresponding scalar code. Because the payoff was so high, programmers were quite willing to invest nontrivial effort to make sure their programs ran as fast as possible.

Figure 2. Cray 1 Supercomputer at the Computer History Museum, Photograph by Ed Toton, April 15, 2007

As with GPUs and accelerators today, the parallelism exploited on a Cray covered a wide range of applications, but it was not universal; it tended to self-select the applications that were ported to it. Moving a program to the Cray may have required changing an algorithm to one that was more amenable to vector processing, changing the data layout and, even with appropriate algorithms and data, may have required recoding the program to expose the parallelism. When it first was introduced, many users decried the effort they had to expend. In response, many optimized library kernels (such as the BLAS) were introduced, which could be written in assembly and presumably were highly optimized—programs that used these kernels for computationally intensive regions could expect good performance.

Some of the Cray 1's contemporaries and competitors introduced new languages or language extensions. The Control Data Cyber 205 exposed its vector instruction set in its version of FORTRAN.

Writing a(1;n) meant a vector starting at a(1) with a length of n. Two vectors could be added, as a(1;n) + b(2;n). Nontrivial operations were handled by a large set of Q8 intrinsics, which generated inline code. The expression q8ssum(a(1;n)) would generate the vector instruction to compute the single-precision summation of the vector a(1;n). Such an approach made programming the Cyber 205 costly, and programs optimized for it were not portable.

The Texas Instruments' Advanced Scientific Computer (TI-ASC) came with an aggressive automatic vectorizing compiler, though the machine was not a commercial success. The Cray Fortran Translator (CFT) was the first widely used vectorizing compiler, and it became the dominant method by which the Cray vector instruction set was used. Many tuned libraries still were written in assembly, but the vectorizing compiler was good enough that most programmers could use it effectively.

Although CFT did an effective job of vectorization, the key to its success was something else. It was one of the first compilers that gave performance feedback to the user. The compiler produced a program listing that included a vectorization table. It would tell which loops would run in vector mode and which would not. Moreover, if a loop failed to vectorize, the compiler gave very specific information as to why not, down to which variable or array reference in which statement in the loop prevented vectorization. This had two effects; the first was that programmers knew what to change to make this loop run faster.

The second effect was that programmers became trained to write vector loops: take out procedure calls, don't use conditionals, use stride-1 array references and so on. The next program they wrote was more likely to vectorize from the start. After a few years, programmers were quite comfortable using the “vectorizable subset” of the language, and they could achieve predictable high performance. An important factor in this success was that the style of programming the compiler encouraged was portable, in that it gave predictably good performance across a wide range of vector computers, from Cray, IBM, NEC, Fujitsu, Convex and many others.

More recently, SSE registers with packed arithmetic instructions were added to the X86 architecture with the Pentium III in 1999. When first introduced, compilers were enhanced with a large set of intrinsics that operated on 128-bit wide data; _mm_add_ss(m,n) would do a packed four-wide single-precision floating-point add on two __m128 operands, which presumably would be allocated to XMM registers.

PGI applied classical vectorization technology in its version 3.1 C and FORTRAN compilers to generate the packed arithmetic instructions automatically from loops. The other x86 compilers soon followed, and this is now the dominant method by which the SSE instructions on the x86 are used. Again, there are library routines written in assembly, but the compilers are good enough to be used effectively. As with the Cray compilers, performance feedback is crucial to effective use. If performance is critical, programmers will look at the compiler messages and rewrite loops that don't vectorize.