Integrating a Linux Cluster into a Production High-Performance Computing Environment

Troy discusses the performance and usage of the Brain at the Ohio Supercomputer Center.
Interface with Mass Storage

OSC's other HPC systems at the time of the Brain cluster's installation consisted of the following:

  • mss: an SGI Origin 2000 with eight MIPS R12000 processors at 300MHz, 4GB of memory, 1TB of Fibre Channel RAID, and approximately 60TB of tapes in an IBM 3494 tape robot with four tape drives

  • origin: an SGI Origin 2000 with 32 MIPS R12000 processors at 300MHz and 16GB of memory

  • osca: a Cray T94 with four custom vector processors at 450MHz and 1GB of memory

  • oscb: a Cray SV1 with 16 custom vector processors at 300MHz and 16GB of memory

  • t3e: a Cray T3E-600/LC with 136 Alpha EV5 processors at 300MHz and 16GB of memory

The latter four systems all mounted their user home directories from mss using NFS over a HIPPI network. When the cluster was installed, its front-end node, known as oscbw or node00, was added to the HIPPI network (see Figure 2). In addition, to make staging files into the compute nodes easier for end users, the compute nodes were configured to mount the user home directories over the private Ethernet, using a previously unused Ethernet port on mss (see Figure 3).

Figure 2. HIPPI Network

Figure 3. Compute Nodes Networked with mss via Private Ethernet

One difficulty encountered with this arrangement involved an interaction between the Linux NFS client implementation and hierarchical storage management (HSM). The mss system runs an HSM product from SGI called Tape Migration Facility (TMF). TMF periodically scans though all the files stored on selected filesystems (in this case the users' home directories) looking for large files that have not been accessed in some time and thus can be migrated off to a tape in the 3494 robot. When a user attempts to read a file that has been migrated to tape, the initial read() system call blocks until TMF is able to migrate the contents of the file back to disk. Unfortunately, the Linux 2.2 NFS client implementation queued NFS file reads by NFS server rather than by filesystem, and so trying to read a migrated file often caused the front-end node to lock up while the file was retrieved from disk.

Job Scheduling and Accounting

As with OSC's other HPC systems, the Brain cluster represents a shared resource for researchers at various academic and industrial institutions in Ohio. The Portable Batch System (PBS) version 2.2 was selected to handle resource management and job scheduling on the cluster. This choice was based on several factors:

  • Previous experience: PBS version 2.0 had been used on the Pinky cluster, after an extensive comparison with Platform Computing's LSF suite.

  • Use at large sites: many large Linux cluster sites used PBS as their scheduling software, including the National Aerodynamic Simulation (NAS) facility at NASA Ames where PBS had been developed.

  • Source availability: PBS was an open-source product with considerable Linux support, whereas LSF was closed source and Platform showed little interest at the time in making all of LSF's features available under Linux.

  • Cost: PBS was freely available (although support contracts were available from MRJ), while LSF incurred a significant per-processor licensing cost for production use.

Version 2.2 of PBS had another feature that was a significant improvement over version 2.0: per-processor allocation of cluster nodes. In version 2.0, PBS classified a system as either a time-shared host (e.g., a Cray vector system or large SMP) that could multitask several jobs or a space-shared cluster node (e.g., a uniprocessor node in a Beowulf cluster or IBM SP) that could be allocated to only a single job. PBS 2.2 extended the cluster node concept with a “virtual processor” attribute; a cluster node with multiple virtual processors can have multiple jobs assigned to it, and a user can specifically request nodes with multiple virtual processors per node.

However, PBS required some tinkering to make it work the way OSC's administrators and users had come to expect from a batch system after ten years of using NQE on Cray system. First, each job was assigned a unique working directory (accessed through the $TMPDIR environment variable). PBS job prologue and epilogue scripts were written to create these directories at the start of the job and delete them at the end of the job (see Listings 1 and 2). Scripts were also added to the /etc/profile.d directory on each compute node to set $TMPDIR inside batch jobs (see Listings 3 and 4). A distributed copy command, pbsdcp, was developed to allow users to copy files to $TMPDIR on each of the nodes allocated to their job without needing to know a priori which nodes they would be given (see Listing 5).

Listing 1. PBS Job Prologue Script

Listing 2. PBS Job Epilogue Script

Listing 3. Script to Set $TMPDIR Inside Batch Jobs

Listing 4. Script to Set $TMPDIR Inside Batch Jobs

Listing 5. pbsdcp

To facilitate the use of graphical programs such as the Totalview parallel debugger, a mechanism for doing remote X display from within the cluster's private network was developed. This mechanism relied on the X display port forwarding feature of ssh, as well as the interactive batch job feature of PBS. An interactive job in PBS is just like a normal batch job, except that it runs an interactive shell instead of a shell script. With an X pseudo-display on the front-end node courtesy of ssh, it was possible to make X programs run on the cluster's private network using some unorthodox xauth manipulations (see Listings 6 and 7).

Listing 6. Manipulating xauth to Display X Programs on the Private Network

Listing 7. Manipulating xauth to Display X Programs on the Private Network

The target applications for the Brain cluster were MPI-based parallel programs. To improve the startup time of these programs and make their CPU time accounting accurate, the rsh-based mpirun shell script from MPICH was replaced with a C program called mpiexec, which uses the task management API in PBS to start the MPI processes on individual nodes. This program also allowed a user to specify the number of MPI processes with which their job was run as one per virtual processor, one per Myrinet interface or one per node (see Listing 8 for examples). The OSC mpiexec program is available under the GNU GPL (see Resources for details).

Listing 8. Manipulating mpiexec

Users are charged against their time allocations based on the number of processors used and the duration of use. In the case of the Brain cluster where resources are space-shared, charging is done by multiplying the wall clock time used by a job times the number of processors requested. PBS supplied accounting logs with records of wall clock time and processors used, which were processed by a short Perl program and inserted into OSC's user accounting database. (The reader should keep in mind that no money changes hands for academic use of OSC's systems; researchers are simply granted time based on peer review of their research proposals.)