Linux and the Alpha

Part 2 brings us optimization techniques for speeding up code to get the best performance from your Alpha or other RISC processor.
Lending gcc a Hand

One reason gcc sometimes generates significantly less efficient code than Digital's GEM compiler is that it does not perform inter-procedural alias analysis. What this means is that gcc's alias analysis is sometimes unnecessarily conservative. To illustrate the problem, consider the function shown in Listing 3. It is a simple unrolled loop that reverses all the bytes in array src and stores the result in array dst.

The problem with this code is that the C compiler has no way of knowing how the src and dst pointer relate to each other. For all it knows, dst could point to the second element of the array pointed to by src. When this happens, the two pointers refer to overlapping regions of memory and they are said to alias each other. For the compiler, it is important to know whether the two pointers overlap, since that determines the degree of freedom it has in reordering instructions. For example, if the regions overlap, then storing a value to dst[i+0] may affect the value of src[i+1]. Thus, to be on the safe side, the compiler must generate the loads and stores in the above function strictly in the order in which they occur in the source code.

Now, if it is known that the two arrays passed to the function never alias each other, we can lend gcc a hand by explicitly giving it this information. We can do this by first reading all the values from the memory, then doing all the computation and finally storing the results back to memory. Thus, the above code would be transformed into the code shown in Listing 4.

Since all of the stores occur at the end, gcc knows immediately that none of the stores can affect any of the preceding loads. This provides it complete freedom in generating the best possible code for the loads and computation (the assumption here is that byterev_long is an in-line function).

On the Alpha and most other architectures with lots of registers (e.g., most RISCs), this kind of code never (or at least very rarely) hurts performance and usually improves performance for gcc. Unfortunately, the same isn't true for the x86 architecture. The problem there is that only a few registers are available. So, the code that's better for the RISCs is usually worse for the x86 due to additional stores and loads that are necessary to access the temporaries that end up on the stack. Performance numbers for the two versions are given in Table 2.

Table 2. Performance Values of Byte Reversal Source Code

As the Table 2 shows, merely by separating loads from computation from stores gains 13% with gcc on the Alpha. In contrast, the same code loses 17% on the P6. This illustrates that while most of the techniques in this paper apply to both architectures, there are important differences that sometimes necessitate different coding to achieve the best performance.

Improving the Memory Access Pattern

Now, let's consider a simple problem: matrix multiplication. While simple, it is typical of a problem that is considered floating-point intensive. In reality, most floating-point intensive problems are also memory intensive. For example, they process large vectors or matrices. Thus, the memory access pattern often plays a crucial role in achieving the best possible performance. The textbook implementation of matrix multiplication looks is shown in Listing 5.

Here, the matrix pointed to by a gets multiplied by the matrix pointed to by b and the result is stored in c. The matrix dimension is passed in argument dim. On the Alpha, a 512 by 512 matrix multiply with this routine executes at about thirteen million floating-point operations per second (MFLOPS). This is not too shabby, but let's see whether we can squeeze more out of the machine. Having learned our lesson in performance optimization, we might try to unroll the inner loop and avoid all multiplications due to indexing. This does indeed result in a faster version: now, matrix multiply executes at about 15 MFLOPS.

gcc's optimizer is unable to eliminate the induction variables and, hence, the multiplications due to indexing. gcc 2.72.1 and earlier have a bug in this area that appears when generating code for the Alpha. If the index variables are declared as long instead of int, gcc is able to eliminate the induction variables, as one would expect.

Rather than declaring the problem solved, let's think about the memory access pattern for a minute. Each element in the result, matrix c, is a dot product of a row in a and a column in b. For example, the element c[0][0] is computed as the dot product of the first row in a and the first column in b. This is illustrated in Figure 1. In our naive matrix-multiply routine, this means that the accesses to a form a nice, dense, linear memory-access pattern. Unfortunately, things do not look quite as good for b. There the memory access pattern is sparse: first, the element at offset 0 is read, then the one at offset dim and so on. Such sparse access patterns are bad for many reasons. Suffice to say it's easiest to optimize a machine for dense, linear accesses, so it is likely that those accesses will always be the fastest ones. Fortunately, there is a simple trick that avoids the bad access pattern for matrix b: before doing the actual matrix multiply, we can simply transpose matrix b. Then, all memory accesses are dense. Of course, transposing b causes extra work, but since that matrix is accessed dim times, this may well be worth the trouble.

Figure 1: Matrix Multiply

So, let's change matmul0 into matmul1 by adding a matrix transposition in front of the main loop. The code in Listing 6 assumes that tb is an appropriately sized temporary variable to hold the transposition of b.

If you thought 15 MFLOPS is fast, think again: matmul1 executes at a blazing 45 MFLOPS. Next, we'll add some loop unrolling, etc. For matmul0, this bought us 25%, which isn't bad at all. If we unroll the loop eight times and do some other straightforward optimizations, we get the code shown in Listing 7. For compactness, it assumes that dim is an integer multiple of eight. Was it worth the trouble? By all means yes: matmul2 clocks at a full 80 MFLOPS. Whether you like this kind of code or not may be a matter of taste, but it is certainly fast.

Table 3 presents a summary of the performance results. For comparison, it also includes the results obtained when compiling the same code with Digital's GEM C compiler and the relative results for the P6. As this table shows, with gcc we achieved a performance improvement by over a factor of seven. More importantly, the optimizations paid off in all three cases. In fact, with Digital's compiler, the improvement was more than a factor of eight. The performance gap between Digital C and gcc is rather large. For integer code, the gap is usually smaller, but it certainly looks as if gcc needs some work in the floating-point area. Finally, notice that even on the P6 performance increased by almost a factor of four. This is encouraging since unrolling the loop eight times is a tad aggressive for the x86 architecture (because relatively few registers are available). Presumably, the code could be sped up some more, but the point here is that all three cases benefit from memory access optimizations in the same way.

Table 3. Performance Results of Matrix Multiply Routines

Data-parallel Processing: MPEG Core Loop

Multimedia applications are the rage these days. All mainstream CPU architectures (with the notable exception of the PowerPC) have so-called multimedia extensions. The Alpha is ideally suited for such applications since it has been a 64-bit architecture right from the start. In fact, the Alpha multimedia extension is completely trivial; it adds only four new instruction types (vector minimum/maximum, pixel error, pack and unpack). Since some of these instructions can operate on different data types (byte or word, signed or unsigned), the total number of instructions added is 13, which is much smaller than the corresponding number for other architectures. The Alpha got away with so few additions because its original instruction set already contains many of the instructions needed for multimedia applications. For example, there is an instruction that allows eight bytes to be compared in parallel—a seemingly simple instruction that can prove surprisingly powerful in a number of applications.

We illustrate this using mpeg_play, the Berkeley MPEG decoder. (See Reference 2). Since there is not enough space to illustrate all the optimizations that can be applied to this program, we'll focus on one of the most important operations in the MPEG decoder. This operation involves computing the average of two byte vectors. This is a frequent operation since video often contains images that can be represented as the average of an earlier and a later image. The straightforward way of averaging a byte vector is shown in Listing 8.

This loop executes at about 94ns per byte average (iteration) when compiled with gcc. Unrolling this loop twice and reading ahead to the input values needed in the next iteration yields code that is probably close to optimal with this byte-oriented approach. Indeed, with gcc, performance increases to about 60ns per byte average. (Let's call this unrolled version of the function byte_avg1.)

To get even higher performance, we need to be a bit more aggressive. Considering that the Alpha is a 64-bit architecture, we would like to calculate the average of eight bytes in parallel. Reformulating byte-oriented algorithms in such a data parallel format is often trivial. For byte averaging, it's not quite as simple. The straightforward implementation requires nine bits of precision, since 255 + 255 = 510. If we pack eight bytes into a 64-bit word, there is no extra ninth bit. How can we get around this? Obviously, we can divide the operands by two before adding them. That way, the sum is at most 127 + 127 = 254 which conveniently fits into eight bits. The catch is that the result may be wrong; if both operands are odd, it will be one too small. Fortunately, it's easy to correct for this: if bit 0 in both operands is set, a correction by one is necessary. In other words, we can make space for that extra ninth bit by using an additional long register that is used to hold the correction bits. Since all intermediary results now fit into eight bits, the obstacles to a data-parallel implementation of byte averaging have been removed.

The resulting code is shown in Listing 9. For simplicity, it assumes that the input vectors are long aligned and have a size that is an integer multiple of the size of a long. Note that macro VEC() takes an eight-bit value and replicates it once for each byte in a long—it's much more convenient and less error-prone to write VEC(0x01) instead of 0x0101010101010101. Maybe it's helpful to explain the core of the averaging a bit. Variable CC holds the correction bits, so it's simply the bitwise AND of vectors A0 and B0, masked with a vector of 0x01. We divide A0 and B0 by two by shifting them to the right by one position and masking the resulting long with a vector of 0x7f. This masking is necessary since otherwise bit 0 of the byte “above” a byte would sneak in and become bit 7 of that byte, causing gross errors. The average is computed by simply adding the vectors A0, B0, and CC. This addition does not cause any overflows since, per byte, the largest possible value is 127 + 127 + 1 = 255.

Despite its look, this code is actually very portable. For a 32-bit architecture, all that needs to change is macro VEC (and even that is necessary only to get rid of compiler warnings). Byte order is not an issue since even though the data is accessed one long at a time, each byte is still processed individually. This data-parallel version of the byte-averaging loop runs at 5.3ns per byte-average—more than an order of magnitude faster than the unrolled loop.

A summary of the three averaging routines is given in Table 4. The relative performance is in terms of throughput (number of byte-averages per second) since that's both more intuitive and more impressive. Results for the Alpha are presented both for gcc and Digital's GEM C compiler; as usual, for the P6, gcc was used.

Table 4. Performance of Averaging Routines

Note that GEM C generates much better code for the stupid version (byte_avg0) but just slightly better code for the clever version (byte_avg2). This is a common theme, at least for integer code: for well-structured code, gcc usually generates code that is on par with the GEM C compiler. The other interesting result is that the read ahead and loop-unrolling hurt on the P6. This means that byte_avg2 probably could be optimized more for the P6 (since it uses read ahead and loop-unrolling, too), but even so the P6 is twice as fast with the data-parallel version. This is impressive since the relative overheads are much higher for a 32-bit chip (the Alpha can amortize all the masking and shifting over eight bytes, whereas a 32-bit architecture has only four bytes).

How does all this affect performance of mpeg_play? This is best illustrated by comparing the original Berkeley version with the one optimized using the techniques described in this section (particularly data-parallel processing and avoiding integer divisions). (See Reference 3.) Comparing MPEG performance is a bit tricky since a large fraction of the time is spent displaying images. This can be factored out by using the option -dither<\!s>none, which has the effect that nothing gets displayed (while the MPEG stream is still decoded as usual). Table 5 shows the result for this mode as well as when using an ordered dither. The ordered dither itself was also optimized using data-parallel processing, which resulted in a version called ordered4. The options used for mpeg_play were -controls<\!s>none<\!s>-framerate<\!s>0<\!s>-dither<\!s>D. The value of D was either none, ordered or ordered4, as indicated in column labeled “Dither” in Table 5. The movie that was used in these measurements was a 320 by 240 pixel-sized computer animation called RedsNightmare.mpg. (See Reference 4.)

Table 5. MPEG Performance Values

As the table shows, the optimization techniques do indeed result in tremendous performance improvements even at the application level. Of course, few people would enjoy watching a movie at 98 frames per second, but with the optimized code, you can either watch much larger videos in real-time, or you could have CNN on while watching your favorite movie. Who needs picture-in-picture capability when we've got real windowing systems?