Linux and the Alpha
Ever since its announcement in the Fall of 1991, the Alpha architecture (see Reference 3) has been the foundation of the world's fastest systems. In fact, except for one or two brief blips, Alpha systems have been the highest-performing systems based on single-CPU SPECmark performance. With this outstanding performance record comes marketing hype and, sometimes, unrealistic expectations. It is not all that uncommon to find e-mail messages or USENET articles saying things like: “I heard the Alpha is so fast, but now I find that my dusty deck is just 10% faster on the Alpha than on the other system.” So what's the truth? The honest answer is that it depends on what you're doing. Alpha systems are without a doubt fast machines, but it is unreasonable to expect that taking a dusty deck and running it on an Alpha will result in the best possible performance. This is particularly true for programs that were written with the mind-set of the eighties, when CPU cycles were at a premium and memory bandwidth was abundant. Reality looks quite different today: CPU clock-rates above 150MHz are the rule and even laptops can run at 200MHz or more. The result is that, today, the memory system—and not the CPU—is often the first-order bottleneck.
In part 2 of this article, we will demonstrate a few simple techniques that help avoid the memory system bottleneck. Except for one case, the focus is on integer-intensive applications. The topic of optimizing floating-point intensive applications is certainly just as important but, unfortunately, well beyond the scope of this series. The techniques presented can result in tremendous performance improvements. While the techniques will be helpful for all modern systems, they normally extract the biggest benefits on Alpha-based machines. There are a couple of reasons for this bias.
One, the Alpha architecture has been designed with longevity in mind. Specifically, the Alpha architecture should be good for the next 15-25 years, which corresponds roughly to a 1000-fold increase in overall performance. For this reason, some design-tradeoffs were made in favor of long-term viability rather than short-term benefits. For example, the Alpha was right from the start a 64-bit architecture, even though, at the time of its announcement, 32-bit address spaces were considered comfortably large.
Two, the current Alpha implementations are designed to achieve high performance by pushing clock frequency to the limit. This means the CPU-to-memory-system performance gap is the largest for Alpha-based systems. For example, suppose a memory access takes 100ns. On a 500MHz Alpha CPU, this corresponds to 50 clock cycles. In contrast, on a 250MHz CPU, this is only 25 cycles. So the relative performance penalty of accessing memory is much higher on a CPU with higher clock speeds. This may sound like a bad thing, but since the absolute performance is the same, what this really means is that a fast-clock CPU system that is running a memory-bound application will be about as fast as a slower-clock system, but when running a memory-wise application, it will be much faster.
In this part of the series, I present a brief overview of existing and upcoming Alpha implementations. While it is not usually necessary to optimize for a specific CPU, it is helpful to know what the characteristics of current CPUs and systems are. I also discuss a couple of simple performance analysis tools that are available under Linux. When porting legacy code to modern systems, such tools are invaluable, since they avoid wasting time trying to optimize rarely executed code.
So far, the Alpha CPU family tree spans three generations; it all began with the 21064 chip. At the time of its introduction, it was the highest performing CPU, and it still makes for a nice workstation, though it's no longer competitive with the latest generation CPUs. This chip branched off into a version that was called the “Low-Cost Alpha” (LCA), also known as 21066 or 21068. The chip core was identical to the 21064 but it had an integrated memory and PCI-bus controller. This high integration made it possible to build Alpha-based systems at relatively low cost and for the embedded systems market. Unfortunately, the design had a major weakness—the memory system was seriously under-powered. This created the paradoxical situation in which a system based on this chip performed on some applications on average, no better than a 100MHz Pentium, but outperformed a P6 running at 200MHz. As a result, the reaction to this chip varied greatly, and probably resulted in quite a few disappointed customers for Digital. On the other hand, there is no doubt that the low-cost at which 21066-based systems eventually were sold caused a quantum leap in the number of Linux/Alpha users.
Around June 1994, the 21164 chip was announced. It had dramatically improved performance over the 21064 and was the first, and so far only, Alpha CPU to feature a three-level cache hierarchy. The first and second-level caches were both on-chip and only the third-level cache was on the motherboard. This chip, in slightly improved versions, is still going strong. At the Fall 1996 Comdex in Las Vegas, such a chip, coupled with a liquid cooling system, was demonstrated running at 767MHz. Another version, called 21164PC, is scheduled to become available around Spring 1997. It omits the relatively expensive second-level, on-chip cache but adds multi-media extensions and other performance-enhancing features. As the name indicates, this chip is designed to be price-competitive with PC processors, specifically the forthcoming Intel Klamath (an improved P6). While price-competitive, the 21164PC is supposed to deliver over 50% better performance than the Klamath. For this second-generation, low-cost Alpha implementation, it certainly looks like Digital and its co-designer Mitsubishi are not going to repeat the mistakes of the past. The 21164PC promises to be cheap and fast.
If you happen to have a deep pocket or want to take a glance at what PC processors might look like in two or three years, the 21264 might be of interest. It is scheduled to become available in high-end machine during the second half of 1997. With this chip, CPU performance is expected to take another giant leap. Current estimates call for a performance level that is three to four times faster than the fastest CPUs available today.
Between each major chip generation, there are typically “half-generation” CPUs which have improvements that derive primarily from a shrink of the chip manufacturing process. For example, the 21064 chip was followed by the 21064A, and similarly, the 21164 was followed by the 21164A. In the former case, the core of the chip remained virtually identical to the 21064, but the primary caches doubled in size from 8KB to 16KB. In the latter case, instructions for byte and word accesses were added and the maximum clock frequency increased from 333 to 500MHz.
A summary of the performance attributes of the current Alpha chip family is presented in Table 1.
2394s1.ps (min=50, max=52)
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- New Products
- Developer Poll
- Trying to Tame the Tablet
- Validate an E-Mail Address with PHP, the Right Way
- Deceptive Advertising and
1 min 56 sec ago - Let\'s declare that you have
2 min 53 sec ago - Alterations in Contest Due
3 min 59 sec ago - At a numbers mindset, your
5 min 10 sec ago - Do not get Just Almost any
8 min 39 sec ago - A fantastic rule-of-thumb to
10 min 2 sec ago - Keren mastah..
Penting,
1 hour 7 min ago - mini tablet compare
2 hours 26 min ago - Looking Good
5 hours 59 min ago - Hey God - You may not be
10 hours 13 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
measuring cpu cycle counts
is dere any way in wich i can measure the cpu cycle counts of a pentium processor??
i mean i want 2 measure the cpu cycle counts elapsed before and after a bulk of data has been transformed.