Kernel Tuning Gives 40% Gains

Adjusting assembly language routines for higher performance from Alpha processors.
Instructions to Avoid

The CMOVxx conditional move instructions on the 21264 have been implemented in the hardware by decomposing them into two separate instructions inside the processor. The result of the latency of a CMOVxx instruction is a minimum of two cycles, and can take up to five cycles, depending upon the number of CMOVs in a given fetch block. In some situations, replacing CMOV instructions with highly predictable conditional branches can result in a performance gain on the 21264. Overall, a good rule of thumb is to try to minimize the number of CMOV instructions if possible.

Data Collection and Test Methodology

The data was collected from an API NetWorks' CS20 server, which has dual 833MHz processors with 4MB DDR cache, 1GB of SDRAM and Ultra-160 SCSI disks. Two load-generation tests were run: five builds of the 2.2.18 kernel and five builds of gcc-2.95.3. The average system time (as reported by /usr/bin/time -p) was recorded, using various levels of parallelism with make (see Tables 2 and 3).

Table 2. Load-Generation Test Results of the 2.2.18 Kernel

Table 3. Load-Generation Test Results of gcc-2.95.3 Kernel

A similar version of the experiment was run using the 2.4.2 kernel in default mode (all of the performance patches exist). The results were compared to an unpatched 2.4.2 kernel with most (but not all) of the performance changes reverted.

Table 4. Load-Generation Test Results of Default 2.4.2 Kernel

Table 5. Load-Generation Test Results of gcc-2.4.2 Kernel

This experiment was initially performed on an API NetWorks' UP1000 motherboard system, which has a 700MHz processor with 4MB cache, 128MB SDRAM and IDE disks. Again, five builds of the kernel and gcc were run, and the average times were recorded. The kernel used was 2.4.0-test6, with and without the patches.

Table 6. Test Results of UP1000 System Running 2.4.0 Kernel

Table 7. Test Results of UP1000 System Running gcc-2.4.0 Kernel

On a modestly configured 21664 system (the UP1000), the performance increase is significant in terms of reducing the amount of time spent in the kernel, with improvements in the 40% range for some activities (kernel builds). On a more generously configured CS20, we consistently attained speed increases of 14-15% for the measured loads.

We attribute the differences between the UP1000 and CS20 systems to be related to their memory: the UP1000 has an 800MB/sec, 64-bit bus, while the CS20 has a 2.65GB/sec, 256-bit bus.


All of the rewritten routines have appeared in one form or another (some have undergone subsequent rewriting) as part of the 2.4.2 kernel. Additionally, we have put together a patch for 2.2.17 of the kernel and made it available on our corporate web site, under “Performance”. Through additional efforts, these improvements have also migrated into glibc and will eventually help improve application performance of user-mode code.

Rick Gorton, a member of the technical staff at API NetWorks, was first introduced to Linux in the form of the original 32-bit port to Alpha (BLADE). He has been developing binary translators, optimizers and other binary manipulation tools for Alpha since 1992. He can be reached at