Integrating a Linux Cluster into a Production High-Performance Computing Environment
In August 1999, the Ohio Supercomputer Center (OSC) entered into an agreement with SGI, in which OSC would purchase a cluster of 33 SGI 1400L systems (running Linux). These systems were to be connected with Myricom's Myrinet high-speed network and used as a “Beowulf cluster on steroids”. The plan was to make this cluster system eventually a production quality high-performance computing (HPC) system, as well as a testbed for cluster software development by researchers at OSC, SGI, Myricom and elsewhere.
OSC was no stranger to clustering, having built its first workstation cluster (the Beakers, eight DEC Alpha workstations running OSF/1 and connected by FDDI) in 1995. Also, the LAM implementation of MPI started at OSC and was housed there for a number of years. This was not even OSC's first Linux cluster; Pinky, a small cluster of five dual-processor Pentium II systems connected with Myrinet, had been built in early 1999 and was made available to OSC users on a limited basis. However, this new cluster system was different in that it would be expected to be a production HPC system, just as OSC's Cray and SGI systems were.
The new cluster, nicknamed the Brain (after Pinky's smarter half on Animaniacs), consisted of 33 SGI 1400L systems, each with four Pentium III Xeon processors at 550MHz, 2GB of memory, a 10/100Mbps Ethernet interface and an 18GB UW-SCSI system disk. One system was configured as a front end or interactive node with more disks, a second Ethernet interface and an 800Mbps high-performance parallel interface (HIPPI) network interface. The other 32 systems were configured as compute nodes, with two 1.28Gbps Myrinet interfaces each. The reason for putting two Myrinet cards in each system was to increase the available network bandwidth between the nodes; the SGI 1400 systems have two 33MHz 32-bit PCI buses, so one Myrinet card was installed in each PCI bus (a single Myrinet card can easily saturate a 33MHz 32-bit PCI bus, so installing two in a single PCI bus is not a good idea). The 64 Myrinet cards were initially connected to a complex arrangement of eight 16-port Myrinet switches designed to maximize bisection bandwidth (the amount of bandwidth available if half of the network ports simultaneously attempt to communicate with the other half), but in the final installation these were replaced with a single 64-port Myrinet CLOS-64 switch. A 48-port Cisco Ethernet switch was also purchased to connect to the Ethernet cards in each system. This Ethernet network is private; the only network interface to the cluster accessible from the outside is the second Ethernet interface on the front-end node.
It may seem like overkill to have three separate types of networks (Ethernet, Myrinet and HIPPI) in the cluster, but there is actually a good reason for each. Ethernet is used mainly for system management tasks using TCP/IP protocols. HIPPI is used on the front end for high-bandwidth access to mass storage (more on this later). Myrinet, on the other hand, is intended for use by parallel applications using the MPI message-passing library. For the Brain cluster (as well as its predecessor, Pinky), the MPI implementation used was MPICH, from Argonne National Laboratory. The reason for selecting MPICH over LAM was that the developers at Myricom had developed a ch_gm driver for MPICH that talked directly to the GM kernel driver for the Myrinet cards, bypassing the Linux TCP/IP stack entirely and allowing for much higher bandwidth and lower latency than would be possible over TCP/IP. There have been several other MPI implementations for Myrinet, such as FM (fast messaging) and AM (active messaging), but these did not appear to be as robust or well supported as MPICH/ch_gm.
The system was initially assembled and tested in one of SGI's HPC systems labs in Mountain View, California during October 1999. It was then shipped to Portland, Oregon where it was featured and demoed prominently in SGI's booth at the Supercomputing '99 conference. After SC99, the cluster was dismantled and shipped to OSC's facility in Columbus, Ohio where it was permanently installed.
The final installation bears some discussion with respect to floor space, power and cooling. As finally installed, the cluster was comprised of seven racks, six with five 1400 nodes each and one with three 1400 nodes, the Myrinet CLOS-64 switch, the Ethernet switch and a console server (see Figure 1). One of SGI's on-site computer engineers (CEs) estimates that each rack weighs something on the order of 700 pounds, and he insisted on having the raised floor in the area where the cluster was installed reinforced (to put this in perspective, the only other OSC system that required floor reinforcement was a Cray T94, which weighs about 3,800 pounds). Each SGI 1400 unit has three redundant power supplies rated at 400 watts, requiring a total of twenty 20-amp circuits to be installed to supply electrical power. The front-end node was placed on UPS, while the compute nodes were placed on building power. Cooling for the room was found to be adequate; the heat load generated by 33 1400Ls ended up being inconsequential next to the cooling requirements for OSC's Cray systems and the Ohio State University's mainframes, all of which are housed in the same facility.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Client-Side Performance
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Peppermint 7 Released
- Sony Settles in Linux Battle
- Libarchive Security Flaw Discovered
- Maru OS Brings Debian to Your Phone
- The Giant Zero, Part 0.x
- Git 2.9 Released
- Snappy Moves to New Platforms
- Profiles and RC Files
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide