Heterogeneous Processing: a Strategy for Augmenting Moore's Law
Better application performance: everyone wants it, and in the high-performance computing (HPC) community, we've come to expect it. Maybe we've even gotten a little spoiled. After all, we've enjoyed basically continuous performance improvement for four decades, thanks to Moore's Law.
Now in its 40th year, that principle (which predicts a doubling of transistor density every 18 months) is still going strong. But unfortunately, ever-increasing transistor density no longer delivers comparable improvements in application performance. The reasons for this are well known. Adding transistors also adds wire delays and speed-to-memory issues. More aggressive single-core designs also inevitably lead to greater complexity and heat. Finally, scalar processors themselves have a fundamental limitation: a design based on serial execution, which makes it extremely difficult to extract more instruction-level parallelism (ILP) from application codes.
These issues are no longer the sole concern of a small, high-end user base, if they ever were. It is becoming more apparent that major performance improvements could have a profound effect on virtually every scientific field. The President's Information Technology Advisory Committee, which challenged HPC researchers to achieve a sustained petaflop on real applications by 2010, noted that trans-petaflop systems will be crucial for better weather and climate forecasting, manufacturing, pharmaceutical development and other strategic applications. Industry experts at conferences such as Petaflops II are demanding improvements for a laundry list of applications, including crash testing, advanced aircraft and spacecraft design, economic modeling, and combating pandemics and bio-terrorism.
The HPC community is responding by developing new strategies to augment Moore's Law and exploring innovative HPC architectures that can work around the limitations of conventional systems. These strategies include:
Multicore systems that use two or more cores on a die to continue providing steady performance gains.
Specialized processors that deliver enhanced performance in areas where conventional commodity processors fare poorly.
Heterogeneous computing architectures, in which conventional and specialized processors work cooperatively.
Each of these strategies can potentially deliver substantial performance improvements. At Cray, we are exploring all three. But in the long term, we believe heterogeneous computing holds tremendous potential for accelerating applications beyond what one would expect from Moore's Law, while overcoming many of the barriers that can limit conventional architectures. As a participant in the DARPA High Productivity Computing Systems Program, we expect heterogeneous processing to become crucially important over the next several years.
Placing multiple cores on a die is the fastest way to deliver continuous performance gains in line with Moore's Law. A well-known example of a multiple-core processor is the Dual-core AMD Opteron.
Cray and other HPC manufacturers have already embraced this model. Today, Cray is delivering dual-core systems, with expectations to leverage more cores in the future. This strategy offers immediate doubling of computing density, while reducing per-processor power consumption and heat.
For many applications (especially those requiring heavy floating-point operations), multicore processing will provide performance gains for the foreseeable future, and the model will likely serve as the primary vehicle through which Moore's Law is upheld. However, for some applications (notably, those that depend on heavy bit manipulation, sorting and signal processing, such as database searching, audio/video/image processing and encryption/decryption), Moore's Law may not be enough. Major advances in these applications can be realized only with processing speeds orders of magnitude beyond what is available today (or likely to be available anytime soon) through conventional processors. So HPC researchers are exploring alternative models.
In recent years, architectures based on clusters of commodity processors have overtaken high-end, specialized systems in the HPC community, due to their low cost and solid performance for many applications. But, as some users begin to bump up against the inherent limitations of scalar processing, we are beginning to see a reversal in that trend. Examples of this resurgence include:
Vector processors: vector processors increase computational performance by efficiently pipelining identical calculations on large streams of data, eliminating the instruction issue rate limitations of conventional processors.
Multithreaded processors: HPC memory speeds have been increasing at only a fraction of the rate of processor speeds, leading to performance bottlenecks as serial processors wait for memory. Systems incorporating multithreaded processors (such as IBM's Simultaneous Multi-Threading processor and Intel's Hyper-Threading technology) address this issue by modifying the processor architecture to execute multiple threads simultaneously, while sharing memory and bandwidth resources. Cray's multithreaded architecture takes this a step further by allowing dozens of active threads simultaneously, fully utilizing memory bandwidth.
Digital Signal Processors (DSPs): DSPs are optimized for processing a continuous signal, making them extremely useful for audio, video and radar applications. Their low power consumption also makes these processors ideal for use in plasma TVs, cell phones and other embedded devices.
Specialized coprocessors: coprocessors such as the floating-point accelerator developed by Clearspeed Technology and the n-body accelerator GRAPE, use unique array processor architectures to provide a large number of floating-point components (multiply/add units) per chip. They can deliver noticeable improvements on mathematically intense functions, such as multiplying or inverting matrices or solving n-body problems.
Processors such as these can deliver substantially better performance than general-purpose processors on some operations. Vector and multithreaded processors are also latency tolerant and can continue executing instructions even while allowing large numbers of memory references to be underway simultaneously. These enhancements can allow for significant application performance improvement, while reducing inter-cache communication burdens and real estate on the chip required by conventional caching strategies.
However, as specialized processors have traditionally been deployed, they have had serious limitations. First, although they can provide excellent acceleration for some operations, they often run scalar code much more slowly than commodity processors—and most software used in the real world employs at least some scalar code. To address this issue, these processors traditionally have been incorporated into more conventional systems via the PCI bus—essentially as a peripheral. This inadequate communications bandwidth severely limits the acceleration that can be achieved. (Communicating a result back to the conventional system may actually take more time than the calculation itself.) There are also hard economic realities of processor fabrication. Unless the processor has a well-developed market niche that will support commodity production (such as the applicability of DSPs to consumer electronics), few manufacturers are willing to take on the huge costs of bringing new designs to market.
These issues are leading Cray and others to explore an alternative model.