Kernel Tuning Gives 40% Gains
The CMOVxx conditional move instructions on the 21264 have been implemented in the hardware by decomposing them into two separate instructions inside the processor. The result of the latency of a CMOVxx instruction is a minimum of two cycles, and can take up to five cycles, depending upon the number of CMOVs in a given fetch block. In some situations, replacing CMOV instructions with highly predictable conditional branches can result in a performance gain on the 21264. Overall, a good rule of thumb is to try to minimize the number of CMOV instructions if possible.
The data was collected from an API NetWorks' CS20 server, which has dual 833MHz processors with 4MB DDR cache, 1GB of SDRAM and Ultra-160 SCSI disks. Two load-generation tests were run: five builds of the 2.2.18 kernel and five builds of gcc-2.95.3. The average system time (as reported by /usr/bin/time -p) was recorded, using various levels of parallelism with make (see Tables 2 and 3).
A similar version of the experiment was run using the 2.4.2 kernel in default mode (all of the performance patches exist). The results were compared to an unpatched 2.4.2 kernel with most (but not all) of the performance changes reverted.
This experiment was initially performed on an API NetWorks' UP1000 motherboard system, which has a 700MHz processor with 4MB cache, 128MB SDRAM and IDE disks. Again, five builds of the kernel and gcc were run, and the average times were recorded. The kernel used was 2.4.0-test6, with and without the patches.
On a modestly configured 21664 system (the UP1000), the performance increase is significant in terms of reducing the amount of time spent in the kernel, with improvements in the 40% range for some activities (kernel builds). On a more generously configured CS20, we consistently attained speed increases of 14-15% for the measured loads.
We attribute the differences between the UP1000 and CS20 systems to be related to their memory: the UP1000 has an 800MB/sec, 64-bit bus, while the CS20 has a 2.65GB/sec, 256-bit bus.
All of the rewritten routines have appeared in one form or another (some have undergone subsequent rewriting) as part of the 2.4.2 kernel. Additionally, we have put together a patch for 2.2.17 of the kernel and made it available on our corporate web site, http://www.api-networks.com/products/downloads/developer_support/ under “Performance”. Through additional efforts, these improvements have also migrated into glibc and will eventually help improve application performance of user-mode code.
- Give new life to old phones and tablets with these tips!
- Readers' Choice Awards--Nominate Your Apps & Gadgets Now!
- Memory Ordering in Modern Microprocessors, Part I
- RSS Feeds
- Linux Kernel Testing and Debugging
- Using Django and MongoDB to Build a Blog
- Tech Tip: Really Simple HTTP Server with Python
- An Introduction to OpenSSL Programming, Part II of II
- What Is Multi-Threading?