Kernel Tuning Gives 40% Gains
API NetWorks is a leading developer of high-performance, high-density servers. As such, we (and our customers) are acutely sensitive to system performance issues. With the majority of our customers running Linux on Alpha-based systems, we pay close attention to key open-source components and packages to ensure we don't inadvertently leave performance “on the table”.
In the proprietary world, improvements unearthed after software product releases are saved for the next version. In the world of open-source software, the time between improvements can be measured in hours or days. In addition, tuning techniques are shared quickly for the enhancement of the communal knowledge base. With this in mind, we examined the assembly language routines in the Linux kernel, with an eye toward taking advantage of the micro-architectural features of the latest Alpha 21264 (also known as ev6) processor.
We discovered that by rewriting a handful of routines in the architecture-specific portion of the Linux kernel to take advantage of the features of the 21264, we could attain significant speed improvements, amounting to a 40% reduction in overhead for some workloads. This article describes how and why these optimizations work, so that others may be able to use these techniques to boost their Alpha applications performance. For an e-mail server running on a 21264-based Alpha, for example, applying these tuning tips will enable it to scale to higher loads, extending performance beyond where it would normally tend to taper off. End users will feel as though their system is running faster; systems managers will be pleased because they won't have to buy another system as soon to handle increased traffic.
In the case of the Linux kernel, the performance-critical routines are carefully (and conveniently) separated by architecture and mostly implemented in assembly language. While there is a large body of developers continually working to improve the algorithmic performance of the kernel, there are only a handful of developers with the requisite interests, skills and knowledge to enable them to tune these assembly routines to extract the maximum performance from a processor. Most of these developers are tuning code for x86 systems; the relevant Alpha code base had been dormant for quite some time prior to our tuning.
Anecdotal evidence and previous measurements suggest a disproportionately large amount of time was spent inside the Linux/Alpha kernel when running various web and networking servers. Given that the number of known performance-critical routines in the Linux kernel is relatively small (20-30 assembly language routines), it made sense to sort through them to make sure the code was written with the 21264 in mind. A quick reading of these routines revealed they had been carefully hand-scheduled for the previous generation of Alpha CPUs, the 21164 (also called ev5).
While the difference may seem minor, the 21264 differs significantly from the 21164 at the micro-architectural (chip implementation) level, having features that result in roughly double the performance of the 21164 at the same clock speed.
Before detailing the performance improvements, it is useful to note some of the differences between the 21164 and 21264 processors (see Table 1).
After realizing a careful rewrite could yield significant performance gains, we undertook the tuning of the Alpha assembly language routines in the Linux kernel. (There have been attempts to solve dynamically and automatically the problem of efficiently running binaries compiled for older versions of Alpha CPUs on new Alpha CPUs1, but that is beyond the scope of this article.) Specifically, it was clear that about 20 routines in linux/arch/alpha/lib and four to six routines in linux/include/asm-alpha needed to be rewritten to take full advantage of the features of the 21264.
From an internal perspective, the 21264 is a significantly more complex and powerful CPU than the older 21164. Because of these differences, the nature of the performance improvements can come from transformations (discussed below) involving the rescheduling of the code in order to keep the processor from stalling on fetch, avoid branch and branch target penalties, avoid replay traps from occurring where possible, use the instruction latencies and scheduling rules of the 21264, use instructions available in the 21164 and 21264 that had not been used, use instructions newly available in the 21264 and not use some instructions if possible on the 21264.
When writing code to avoid fetch stalling, one must ensure the instructions dynamically encountered in a fetch block do not result in one or more of the functional units being over-subscribed. On the 21264 it is also important to ensure that branch targets are aligned on a fetch block boundary. Given that it fetches four instructions at a time, branch targets should have an alignment of 0mod4 instructions, equivalent to an address alignment of 0mod16.
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
|Trying to Tame the Tablet||May 08, 2013|
|Dart: a New Web Programming Experience||May 07, 2013|
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Validate an E-Mail Address with PHP, the Right Way
- New Products
- Developer Poll
- Trying to Tame the Tablet
- not living upto the mobile revolution
34 min 4 sec ago
- Deceptive Advertising and
1 hour 9 min ago
- Let\'s declare that you have
1 hour 10 min ago
- Alterations in Contest Due
1 hour 11 min ago
- At a numbers mindset, your
1 hour 12 min ago
- Do not get Just Almost any
1 hour 16 min ago
- A fantastic rule-of-thumb to
1 hour 17 min ago
- Keren mastah..
2 hours 15 min ago
- mini tablet compare
3 hours 34 min ago
- Looking Good
7 hours 7 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.