Integrating a Linux Cluster into a Production High-Performance Computing Environment
OSC's other HPC systems at the time of the Brain cluster's installation consisted of the following:
mss: an SGI Origin 2000 with eight MIPS R12000 processors at 300MHz, 4GB of memory, 1TB of Fibre Channel RAID, and approximately 60TB of tapes in an IBM 3494 tape robot with four tape drives
origin: an SGI Origin 2000 with 32 MIPS R12000 processors at 300MHz and 16GB of memory
osca: a Cray T94 with four custom vector processors at 450MHz and 1GB of memory
oscb: a Cray SV1 with 16 custom vector processors at 300MHz and 16GB of memory
t3e: a Cray T3E-600/LC with 136 Alpha EV5 processors at 300MHz and 16GB of memory
The latter four systems all mounted their user home directories from mss using NFS over a HIPPI network. When the cluster was installed, its front-end node, known as oscbw or node00, was added to the HIPPI network (see Figure 2). In addition, to make staging files into the compute nodes easier for end users, the compute nodes were configured to mount the user home directories over the private Ethernet, using a previously unused Ethernet port on mss (see Figure 3).
One difficulty encountered with this arrangement involved an interaction between the Linux NFS client implementation and hierarchical storage management (HSM). The mss system runs an HSM product from SGI called Tape Migration Facility (TMF). TMF periodically scans though all the files stored on selected filesystems (in this case the users' home directories) looking for large files that have not been accessed in some time and thus can be migrated off to a tape in the 3494 robot. When a user attempts to read a file that has been migrated to tape, the initial read() system call blocks until TMF is able to migrate the contents of the file back to disk. Unfortunately, the Linux 2.2 NFS client implementation queued NFS file reads by NFS server rather than by filesystem, and so trying to read a migrated file often caused the front-end node to lock up while the file was retrieved from disk.
As with OSC's other HPC systems, the Brain cluster represents a shared resource for researchers at various academic and industrial institutions in Ohio. The Portable Batch System (PBS) version 2.2 was selected to handle resource management and job scheduling on the cluster. This choice was based on several factors:
Previous experience: PBS version 2.0 had been used on the Pinky cluster, after an extensive comparison with Platform Computing's LSF suite.
Use at large sites: many large Linux cluster sites used PBS as their scheduling software, including the National Aerodynamic Simulation (NAS) facility at NASA Ames where PBS had been developed.
Source availability: PBS was an open-source product with considerable Linux support, whereas LSF was closed source and Platform showed little interest at the time in making all of LSF's features available under Linux.
Cost: PBS was freely available (although support contracts were available from MRJ), while LSF incurred a significant per-processor licensing cost for production use.
Version 2.2 of PBS had another feature that was a significant improvement over version 2.0: per-processor allocation of cluster nodes. In version 2.0, PBS classified a system as either a time-shared host (e.g., a Cray vector system or large SMP) that could multitask several jobs or a space-shared cluster node (e.g., a uniprocessor node in a Beowulf cluster or IBM SP) that could be allocated to only a single job. PBS 2.2 extended the cluster node concept with a “virtual processor” attribute; a cluster node with multiple virtual processors can have multiple jobs assigned to it, and a user can specifically request nodes with multiple virtual processors per node.
However, PBS required some tinkering to make it work the way OSC's administrators and users had come to expect from a batch system after ten years of using NQE on Cray system. First, each job was assigned a unique working directory (accessed through the $TMPDIR environment variable). PBS job prologue and epilogue scripts were written to create these directories at the start of the job and delete them at the end of the job (see Listings 1 and 2). Scripts were also added to the /etc/profile.d directory on each compute node to set $TMPDIR inside batch jobs (see Listings 3 and 4). A distributed copy command, pbsdcp, was developed to allow users to copy files to $TMPDIR on each of the nodes allocated to their job without needing to know a priori which nodes they would be given (see Listing 5).
To facilitate the use of graphical programs such as the Totalview parallel debugger, a mechanism for doing remote X display from within the cluster's private network was developed. This mechanism relied on the X display port forwarding feature of ssh, as well as the interactive batch job feature of PBS. An interactive job in PBS is just like a normal batch job, except that it runs an interactive shell instead of a shell script. With an X pseudo-display on the front-end node courtesy of ssh, it was possible to make X programs run on the cluster's private network using some unorthodox xauth manipulations (see Listings 6 and 7).
The target applications for the Brain cluster were MPI-based parallel programs. To improve the startup time of these programs and make their CPU time accounting accurate, the rsh-based mpirun shell script from MPICH was replaced with a C program called mpiexec, which uses the task management API in PBS to start the MPI processes on individual nodes. This program also allowed a user to specify the number of MPI processes with which their job was run as one per virtual processor, one per Myrinet interface or one per node (see Listing 8 for examples). The OSC mpiexec program is available under the GNU GPL (see Resources for details).
Users are charged against their time allocations based on the number of processors used and the duration of use. In the case of the Brain cluster where resources are space-shared, charging is done by multiplying the wall clock time used by a job times the number of processors requested. PBS supplied accounting logs with records of wall clock time and processors used, which were processed by a short Perl program and inserted into OSC's user accounting database. (The reader should keep in mind that no money changes hands for academic use of OSC's systems; researchers are simply granted time based on peer review of their research proposals.)
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- Designing Electronics with Linux
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Build a Skype Server for Your Home Phone System
- Why Python?
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Validate an E-Mail Address with PHP, the Right Way
- Tech Tip: Really Simple HTTP Server with Python
- Understanding the Linux Kernel
1 hour 3 min ago
3 hours 33 min ago
- Kernel Problem
13 hours 35 min ago
- BASH script to log IPs on public web server
18 hours 2 min ago
21 hours 38 min ago
- Reply to comment | Linux Journal
22 hours 10 min ago
- All the articles you talked
1 day 34 min ago
- All the articles you talked
1 day 37 min ago
- All the articles you talked
1 day 39 min ago
1 day 5 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?