Integrating a Linux Cluster into a Production High-Performance Computing Environment
In August 1999, the Ohio Supercomputer Center (OSC) entered into an agreement with SGI, in which OSC would purchase a cluster of 33 SGI 1400L systems (running Linux). These systems were to be connected with Myricom's Myrinet high-speed network and used as a “Beowulf cluster on steroids”. The plan was to make this cluster system eventually a production quality high-performance computing (HPC) system, as well as a testbed for cluster software development by researchers at OSC, SGI, Myricom and elsewhere.
OSC was no stranger to clustering, having built its first workstation cluster (the Beakers, eight DEC Alpha workstations running OSF/1 and connected by FDDI) in 1995. Also, the LAM implementation of MPI started at OSC and was housed there for a number of years. This was not even OSC's first Linux cluster; Pinky, a small cluster of five dual-processor Pentium II systems connected with Myrinet, had been built in early 1999 and was made available to OSC users on a limited basis. However, this new cluster system was different in that it would be expected to be a production HPC system, just as OSC's Cray and SGI systems were.
The new cluster, nicknamed the Brain (after Pinky's smarter half on Animaniacs), consisted of 33 SGI 1400L systems, each with four Pentium III Xeon processors at 550MHz, 2GB of memory, a 10/100Mbps Ethernet interface and an 18GB UW-SCSI system disk. One system was configured as a front end or interactive node with more disks, a second Ethernet interface and an 800Mbps high-performance parallel interface (HIPPI) network interface. The other 32 systems were configured as compute nodes, with two 1.28Gbps Myrinet interfaces each. The reason for putting two Myrinet cards in each system was to increase the available network bandwidth between the nodes; the SGI 1400 systems have two 33MHz 32-bit PCI buses, so one Myrinet card was installed in each PCI bus (a single Myrinet card can easily saturate a 33MHz 32-bit PCI bus, so installing two in a single PCI bus is not a good idea). The 64 Myrinet cards were initially connected to a complex arrangement of eight 16-port Myrinet switches designed to maximize bisection bandwidth (the amount of bandwidth available if half of the network ports simultaneously attempt to communicate with the other half), but in the final installation these were replaced with a single 64-port Myrinet CLOS-64 switch. A 48-port Cisco Ethernet switch was also purchased to connect to the Ethernet cards in each system. This Ethernet network is private; the only network interface to the cluster accessible from the outside is the second Ethernet interface on the front-end node.
It may seem like overkill to have three separate types of networks (Ethernet, Myrinet and HIPPI) in the cluster, but there is actually a good reason for each. Ethernet is used mainly for system management tasks using TCP/IP protocols. HIPPI is used on the front end for high-bandwidth access to mass storage (more on this later). Myrinet, on the other hand, is intended for use by parallel applications using the MPI message-passing library. For the Brain cluster (as well as its predecessor, Pinky), the MPI implementation used was MPICH, from Argonne National Laboratory. The reason for selecting MPICH over LAM was that the developers at Myricom had developed a ch_gm driver for MPICH that talked directly to the GM kernel driver for the Myrinet cards, bypassing the Linux TCP/IP stack entirely and allowing for much higher bandwidth and lower latency than would be possible over TCP/IP. There have been several other MPI implementations for Myrinet, such as FM (fast messaging) and AM (active messaging), but these did not appear to be as robust or well supported as MPICH/ch_gm.
The system was initially assembled and tested in one of SGI's HPC systems labs in Mountain View, California during October 1999. It was then shipped to Portland, Oregon where it was featured and demoed prominently in SGI's booth at the Supercomputing '99 conference. After SC99, the cluster was dismantled and shipped to OSC's facility in Columbus, Ohio where it was permanently installed.
The final installation bears some discussion with respect to floor space, power and cooling. As finally installed, the cluster was comprised of seven racks, six with five 1400 nodes each and one with three 1400 nodes, the Myrinet CLOS-64 switch, the Ethernet switch and a console server (see Figure 1). One of SGI's on-site computer engineers (CEs) estimates that each rack weighs something on the order of 700 pounds, and he insisted on having the raised floor in the area where the cluster was installed reinforced (to put this in perspective, the only other OSC system that required floor reinforcement was a Cray T94, which weighs about 3,800 pounds). Each SGI 1400 unit has three redundant power supplies rated at 400 watts, requiring a total of twenty 20-amp circuits to be installed to supply electrical power. The front-end node was placed on UPS, while the compute nodes were placed on building power. Cooling for the room was found to be adequate; the heat load generated by 33 1400Ls ended up being inconsequential next to the cooling requirements for OSC's Cray systems and the Ohio State University's mainframes, all of which are housed in the same facility.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Managing Linux Using Puppet
- Non-Linux FOSS: Caffeine!
- Doing for User Space What We Did for Kernel Space
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- Google's SwiftShader Released
- SourceClear Open
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide