Kernel Tuning Gives 40% Gains
It is particularly important to avoid branch penalties on the 21264. Sophisticated, trainable branch prediction logic is built in and works effectively if only one control flow change instruction is in a fetch block (a “quad-pack”). In the 21164-tuned kernel assembly language routines, there are a number of places where multiple control-flow change instructions occur within a quad-pack. Additionally, branch targets were aligned to 8mod16 addresses, which often resulted in branch target labels appearing in the middle of a quad-pack. While these sequences run quite well on a 21164, they run relatively slow on a 21264.
Replay traps occur when the processor must roll back the state of memory to force accesses to a particular memory location in order to be sequential, or when there are different-sized accesses to the same memory location. But the code context in the modified routines was such that replay traps were not an issue, so rewriting sequences to avoid replay traps was unnecessary.
The instruction scheduling and slotting rules for the 21264 are too complex to list here, but for those interested in the details, the 21264 Compiler Writer's Guide is an excellent reference.
Byte- and word-sized loads and stores were introduced in the 21164A (ev56) processor, but they were not used in the original versions of the assembly language routines. Prior experience (in the context of static binary translation of applications) has shown performance can be typically improved 10% to 20% by utilizing these instructions. This is particularly true for the stw (store word) and stb (store byte) instructions, as it eliminates memory traffic in a way guaranteed to cause replay traps on the 21264. In the context of the tuned-up kernel routines, these instructions were helpful, but it was typically localized to the tail code of large region copies, while the bulk of the data movement used eight-byte granularity load and store instructions.
The Alpha architecture also features various forms of pre-fetch instructions. Pre-fetch instructions are hints to the memory subsystem to fetch a block of memory to the data cache for future consumption. These do not normally appear in compiled code, as few compilers have enough context available to permit their generation; Compaq's compilers do generate pre-fetch instructions. In the context of moving large amounts of data, it is possible (and desirable) for the assembly language programmer to utilize pre-fetches. The __asm__() feature of gcc enables programmers to insert relevant pre-fetch instructions at key points in routines when rewriting entire routines in assembly language is undesirable. Because they can minimize or prevent data-cache stalls, using these instructions can significantly boost performance.
The 21264 is the first Alpha implementation to include support for three instructions useful for boosting performance: CTLZ, CTTZ and WH64.
The CTLZ and CTTZ instructions count the number of leading/trailing zeros in a 64-bit register and are handy for string manipulation. When a program performs string operations involving pattern matching (strlen() matches on NULL), it is often the case that the byte-number index of the pattern match in an eight-byte value in a register is needed. Without CTTZ, it takes about ten instructions involving multiple CMOVxx (conditional move) instructions to determine this index. The result is a reduction in code size (always useful), as well as a decrease in the number of cycles needed to perform string operations. Also, there are some filesystem primitives involving finding holes in a bitfield where these instructions are useful.
WH64 (write hint for 64-bytes) is a memory subsystem hint that a specified 64-byte region is going to be written to in the near future. The processor can pass this information to the memory subsystem, which can invalidate the target contents and avoid some number of memory system cycles to keep the memory state coherent. Since a process context switch entails moving large amounts of information in memory from one place to another, any improvement in copying performance between kernel-space memory and user-space memory is good news. Meanwhile, program load time is another place in the operating system that depends upon doing a lot of memory-to-memory traffic. The program bits all have to get mapped, and all of the zeroed memory (.bss in executables) must have zeros written to it.
Practical Task Scheduling Deployment
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.View Now!
|The Firebird Project's Firebird Relational Database||Jul 29, 2016|
|Stunnel Security for Oracle||Jul 28, 2016|
|SUSE LLC's SUSE Manager||Jul 21, 2016|
|My +1 Sword of Productivity||Jul 20, 2016|
|Non-Linux FOSS: Caffeine!||Jul 19, 2016|
|Murat Yener and Onur Dundar's Expert Android Studio (Wrox)||Jul 18, 2016|
- Stunnel Security for Oracle
- The Firebird Project's Firebird Relational Database
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SUSE LLC's SUSE Manager
- Managing Linux Using Puppet
- My +1 Sword of Productivity
- Non-Linux FOSS: Caffeine!
- Google's SwiftShader Released
- SuperTuxKart 0.9.2 Released
- Doing for User Space What We Did for Kernel Space
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide