Kernel Tuning Gives 40% Gains
The CMOVxx conditional move instructions on the 21264 have been implemented in the hardware by decomposing them into two separate instructions inside the processor. The result of the latency of a CMOVxx instruction is a minimum of two cycles, and can take up to five cycles, depending upon the number of CMOVs in a given fetch block. In some situations, replacing CMOV instructions with highly predictable conditional branches can result in a performance gain on the 21264. Overall, a good rule of thumb is to try to minimize the number of CMOV instructions if possible.
The data was collected from an API NetWorks' CS20 server, which has dual 833MHz processors with 4MB DDR cache, 1GB of SDRAM and Ultra-160 SCSI disks. Two load-generation tests were run: five builds of the 2.2.18 kernel and five builds of gcc-2.95.3. The average system time (as reported by /usr/bin/time -p) was recorded, using various levels of parallelism with make (see Tables 2 and 3).
A similar version of the experiment was run using the 2.4.2 kernel in default mode (all of the performance patches exist). The results were compared to an unpatched 2.4.2 kernel with most (but not all) of the performance changes reverted.
This experiment was initially performed on an API NetWorks' UP1000 motherboard system, which has a 700MHz processor with 4MB cache, 128MB SDRAM and IDE disks. Again, five builds of the kernel and gcc were run, and the average times were recorded. The kernel used was 2.4.0-test6, with and without the patches.
On a modestly configured 21664 system (the UP1000), the performance increase is significant in terms of reducing the amount of time spent in the kernel, with improvements in the 40% range for some activities (kernel builds). On a more generously configured CS20, we consistently attained speed increases of 14-15% for the measured loads.
We attribute the differences between the UP1000 and CS20 systems to be related to their memory: the UP1000 has an 800MB/sec, 64-bit bus, while the CS20 has a 2.65GB/sec, 256-bit bus.
All of the rewritten routines have appeared in one form or another (some have undergone subsequent rewriting) as part of the 2.4.2 kernel. Additionally, we have put together a patch for 2.2.17 of the kernel and made it available on our corporate web site, http://www.api-networks.com/products/downloads/developer_support/ under “Performance”. Through additional efforts, these improvements have also migrated into glibc and will eventually help improve application performance of user-mode code.
Special Reports: DevOps
Have projects in development that need help? Have a great development operation in place that can ALWAYS be better? Regardless of where you are in your DevOps process, Linux Journal can help!
With deep focus on Collaborative Development, Continuous Testing and Release & Deployment, we offer here the DEFINITIVE DevOps for Dummies, a mobile Application Development Primer, advice & help from the experts, plus a host of other books, videos, podcasts and more. All free with a quick, one-time registration. Start browsing now...
- Vigilante Malware
- Non-Linux FOSS: Code Your Way To Victory!
- Disney's Linux Light Bulbs (Not a "Luxo Jr." Reboot)
- Vagrant Simplified
- Libreboot on an X60, Part I: the Setup
- Dealing with Boundary Issues
- System Status as SMS Text Messages
- Bluetooth Hacks
- October 2015 Issue of Linux Journal: Raspberry Pi
- New Products