Linux and the Alpha
Two parts in modern CPUs that are easy to forget are the data and instruction translation lookaside buffers (TLBs). The TLBs hold the most recently accessed page table entries. TLBs are necessary since it would be far too slow to access the page tables on every memory access. Since each TLB entry maps an entire page (8KB with the current Alpha chips), the TLB is usually not the limiting factor to performance. The catch is that when a program does suffer from excessive TLB misses, performance will go down the drain fast. In such a case, the slowdown can easily be big enough that it is worthwhile to switch to a completely different (normally slower) algorithm that has a better TLB behavior. For example, on the Alpha system reading one word from each page in an array of 63 pages (504KB) takes about 15ns per access. But doing the same in an array of 64 pages takes 65ns per access—a slow down of more than a factor of four. Since the second-level cache in that system is 4MB in size, this jump in access time is purely due to the data TLB.
The TLB is not usually a first-order bottleneck but a small experiment did show that a hash table that is 50% full and exceeds the data TLB size is no faster than a more compact search tree that needs two memory accesses per look up (the hash table needs just one memory access, but since it exceeds the TLB size, that one access is slow). In general, the TLB may be the primary bottleneck when large data sets are accessed more or less randomly and sparsely.
On modern systems, memory accesses are bad and computation is good. In the ideal case, we would like to completely replace memory accesses by computation. This obviously is not always possible; but where it is possible, the payoff can be big.
For example, let us consider the problem of reversing all the bits in each byte in a long integer. A byte is reversed as follows: bit 0 is swapped with bit 7, bit 1 with bit 6 and so on. Since we want to reverse all the bytes in a long, this algorithm is applied once for each byte in the long.
Why would anyone want to do this? As you may know, both the Alpha and the x86 architecture are little-endian, but when IBM designed the VGA graphics adapter, it chose to use a big-endian bit order for the pixels in the graphics memory. That is, bit 7 in a byte corresponds to the left-most pixel and bit 0 to the rightmost. This is backwards since the coordinate value for the left-most pixel is smaller. So, byte reversal is indeed a relatively important operation for VGA X servers.
The traditional way of implementing byte reversal is shown by the code in Listing 1 To conserve space, we show the code to reverse a 4-byte integer only. The 8-byte long case is a straightforward extension of this one. The code assumes that array byte_reversed has been initialized such that byte_reversed[i] is the reversed value of i. With this naive algorithm, each byte reversal involves a table look up.
All listings in this article are available by anonymous download in the file ftp.linuxjournal.com/pub/lj/listings/issue43/2487.tgz.
Since the Alpha is a 64-bit architecture, this means each long reversal involves eight memory accesses. How could we avoid these expensive memory accesses? Well, a simple and arguably more intuitive way to implement byte reversal is to use shifting and masking to swap the bits. The code that does this is shown in Listing 2.
Note that, except for the initialization of mask, this code is completely independent of the width of a long. So, to make this work on a 32-bit machine, all we need to do is initialize mask with 0x01010101 instead.
Now let's see how the implementations compare. Table 1 gives the results for the Alpha, the Pentium Pro (P6) and a 120MHz Pentium Notebook (P5). In addition to the two implementations shown above, the table also includes a row that shows the performance of the naive implementation when coded in x86 assembly code (implementation byterev_x86). This routine comes straight from the XFree86 v3.2 distribution.
Table 1. Comparison of Byte-Reversal Routines
As the table shows, the byte-reversal routine that avoids memory accesses is over 30% faster on the Alpha. Interestingly, on the P6 this routine is also the fastest. The benefit there is less (only 13%), but given that it's more portable and more intuitive, there is no reason not to use it. What's stunning is the complete failure of the assembly version on the P6. That routine is only half as fast as the version that avoids memory accesses. This may be due to the assembly routine's extensive use of byte accesses to CPU registers. For the Pentium notebook, as shown in the P5 column, the assembly-code routine is the fastest version. Don't be misled by the relative performance numbers; in terms of absolute performance, the P5 is just half as fast as the P6.
To summarize, not only is byterev_long() by far the fastest version on the Alpha, it also appears to be the right solution on a P6.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Non-Linux FOSS: Caffeine!
- Doing for User Space What We Did for Kernel Space
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- Google's SwiftShader Released
- Rogue Wave Software's Zend Server
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide