Integrating a Linux Cluster into a Production High-Performance Computing Environment
Overall, OSC has been quite pleased with the two Linux clusters it has had so far, and Linux clusters are seen at the center as one of the main directions in the future of high-performance computing. However, there are numerous areas in which Linux could be improved to support high-performance computing. Probably the most critical of these from OSC's perspective is a parallel filesystem with better parallel performance than NFS. The main use for this would be temporary storage for jobs; this is currently handled on the Brain cluster by having a $TMPDIR directory on each node, but a globally accessible scratch area would be much easier on users. There are currently two potential open-source candidates for a cluster parallel filesystem under Linux: GFS, from the University of Minnesota and PVFS, from Clemson University (see “A Parallel Virtual Filesystem for Linux Clusters”, LJ December 2000). GFS is a journaled, serverless storage-area network (SAN) filesystem over Fibre Channel. It promises to be an excellent performer, and its serverless aspect is quite attractive. However, as of this writing, the GFS code is in a state of flux following a redesign of its locking mechanisms, and the Fibre Channel switches needed for large topologies remain relatively expensive. PVFS, on the other hand, requires no special hardware; it, in effect, implements RAID-0 (striping) across multiple I/O node systems. PVFS's main downside is that it currently has no support for data redudancy, such that if an I/O node fails the parallel filesystem may be corrupted.
Another area where open-source solutions for high-performance computing clusters may be improved is job scheduling and resource management. While PBS has proven to be an adequate framework for resource management, its default scheduling algorithm leaves much to be desired. Luckily PBS was designed to allow third-party schedulers to be plugged into PBS to allow sites to implement their own scheduling policies. One such third-party scheduler is the Maui Scheduler from the Maui High Performance Computing Center. OSC has recently implemented Maui Scheduler on top of PBS and found it to be a dramatic improvement over the default PBS scheduler in terms of job turnaround time and system utilization. However, the documentation for Maui Scheduler is currently a little rough, although Dave Jackson, Maui's principal author, has been quite responsive with our questions.
A third area for work on Linux for high-performance computing is process checkpoint and restart. On Cray systems, the state of a running process can be written to disk and then used to restart the process after a reboot. A similar facility for Linux clusters would be a godsend to cluster administrators; however, for cluster systems using a network like Myrinet, it is quite difficult to implement due to the amount of state information stored in both the MPI implementation and the network hardware itself. Process checkpointing and migration for Linux is supported by a number of software packages such as Condor, from the University of Wisconsin, and MOSIX, from the Hebrew University of Jerusalem (see “MOSIX: a Cluster Load-Balancing Solution for Linux”, LJ May 2001); however, neither of these currently support the checkpointing of an arbitrary MPI process that uses a Myrinet network.
The major question for the future of clustering at OSC is what hardware platform will be used. To date Intel IA32-based systems have been used, primarily due to the wealth of software available. However, both Intel's IA64 and Compaq's Alpha 21264 promise greatly improved floating point performance over IA32. OSC has been experimenting with both IA64 and Alpha hardware, and the current plan is to install a cluster of dual processor SGI Itanium/IA64 systems connected with Myrinet 2000 some time in early 2001. This leads to another question: what do you do with old cluster hardware when they are retired? In the case of the Brain cluster, the plan is to hold a grant competition among research faculty in Ohio to select a number of labs that will receive smaller clusters of nodes from Brain. This would include both the hardware and the software environment, on the condition that idle cycles be usable by other researchers. OSC is also developing a statewide licensing program for commercial clustering software such as Totalview and the Portland Group compilers, to make cluster computing more ubiquitous in the state of Ohio.
This article would not have been possible without help from the author's coworkers who have worked on the OSC Linux clustering project, both past and present: Jim Giuliani, Dave Heisterberg, Doug Johnson and Pete Wyckoff. Doug deserves special mention, as both Pinky and Brain have been his babies in terms of both architecture and administration.