Integrating a Linux Cluster into a Production High-Performance Computing Environment
OSC's other HPC systems at the time of the Brain cluster's installation consisted of the following:
mss: an SGI Origin 2000 with eight MIPS R12000 processors at 300MHz, 4GB of memory, 1TB of Fibre Channel RAID, and approximately 60TB of tapes in an IBM 3494 tape robot with four tape drives
origin: an SGI Origin 2000 with 32 MIPS R12000 processors at 300MHz and 16GB of memory
osca: a Cray T94 with four custom vector processors at 450MHz and 1GB of memory
oscb: a Cray SV1 with 16 custom vector processors at 300MHz and 16GB of memory
t3e: a Cray T3E-600/LC with 136 Alpha EV5 processors at 300MHz and 16GB of memory
The latter four systems all mounted their user home directories from mss using NFS over a HIPPI network. When the cluster was installed, its front-end node, known as oscbw or node00, was added to the HIPPI network (see Figure 2). In addition, to make staging files into the compute nodes easier for end users, the compute nodes were configured to mount the user home directories over the private Ethernet, using a previously unused Ethernet port on mss (see Figure 3).
One difficulty encountered with this arrangement involved an interaction between the Linux NFS client implementation and hierarchical storage management (HSM). The mss system runs an HSM product from SGI called Tape Migration Facility (TMF). TMF periodically scans though all the files stored on selected filesystems (in this case the users' home directories) looking for large files that have not been accessed in some time and thus can be migrated off to a tape in the 3494 robot. When a user attempts to read a file that has been migrated to tape, the initial read() system call blocks until TMF is able to migrate the contents of the file back to disk. Unfortunately, the Linux 2.2 NFS client implementation queued NFS file reads by NFS server rather than by filesystem, and so trying to read a migrated file often caused the front-end node to lock up while the file was retrieved from disk.
As with OSC's other HPC systems, the Brain cluster represents a shared resource for researchers at various academic and industrial institutions in Ohio. The Portable Batch System (PBS) version 2.2 was selected to handle resource management and job scheduling on the cluster. This choice was based on several factors:
Previous experience: PBS version 2.0 had been used on the Pinky cluster, after an extensive comparison with Platform Computing's LSF suite.
Use at large sites: many large Linux cluster sites used PBS as their scheduling software, including the National Aerodynamic Simulation (NAS) facility at NASA Ames where PBS had been developed.
Source availability: PBS was an open-source product with considerable Linux support, whereas LSF was closed source and Platform showed little interest at the time in making all of LSF's features available under Linux.
Cost: PBS was freely available (although support contracts were available from MRJ), while LSF incurred a significant per-processor licensing cost for production use.
Version 2.2 of PBS had another feature that was a significant improvement over version 2.0: per-processor allocation of cluster nodes. In version 2.0, PBS classified a system as either a time-shared host (e.g., a Cray vector system or large SMP) that could multitask several jobs or a space-shared cluster node (e.g., a uniprocessor node in a Beowulf cluster or IBM SP) that could be allocated to only a single job. PBS 2.2 extended the cluster node concept with a “virtual processor” attribute; a cluster node with multiple virtual processors can have multiple jobs assigned to it, and a user can specifically request nodes with multiple virtual processors per node.
However, PBS required some tinkering to make it work the way OSC's administrators and users had come to expect from a batch system after ten years of using NQE on Cray system. First, each job was assigned a unique working directory (accessed through the $TMPDIR environment variable). PBS job prologue and epilogue scripts were written to create these directories at the start of the job and delete them at the end of the job (see Listings 1 and 2). Scripts were also added to the /etc/profile.d directory on each compute node to set $TMPDIR inside batch jobs (see Listings 3 and 4). A distributed copy command, pbsdcp, was developed to allow users to copy files to $TMPDIR on each of the nodes allocated to their job without needing to know a priori which nodes they would be given (see Listing 5).
To facilitate the use of graphical programs such as the Totalview parallel debugger, a mechanism for doing remote X display from within the cluster's private network was developed. This mechanism relied on the X display port forwarding feature of ssh, as well as the interactive batch job feature of PBS. An interactive job in PBS is just like a normal batch job, except that it runs an interactive shell instead of a shell script. With an X pseudo-display on the front-end node courtesy of ssh, it was possible to make X programs run on the cluster's private network using some unorthodox xauth manipulations (see Listings 6 and 7).
The target applications for the Brain cluster were MPI-based parallel programs. To improve the startup time of these programs and make their CPU time accounting accurate, the rsh-based mpirun shell script from MPICH was replaced with a C program called mpiexec, which uses the task management API in PBS to start the MPI processes on individual nodes. This program also allowed a user to specify the number of MPI processes with which their job was run as one per virtual processor, one per Myrinet interface or one per node (see Listing 8 for examples). The OSC mpiexec program is available under the GNU GPL (see Resources for details).
Users are charged against their time allocations based on the number of processors used and the duration of use. In the case of the Brain cluster where resources are space-shared, charging is done by multiplying the wall clock time used by a job times the number of processors requested. PBS supplied accounting logs with records of wall clock time and processors used, which were processed by a short Perl program and inserted into OSC's user accounting database. (The reader should keep in mind that no money changes hands for academic use of OSC's systems; researchers are simply granted time based on peer review of their research proposals.)
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- Paranoid Penguin - Building a Secure Squid Web Proxy, Part IV
- SUSE LLC's SUSE Manager
- Google's SwiftShader Released
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Managing Linux Using Puppet
- My +1 Sword of Productivity
- Non-Linux FOSS: Caffeine!
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- Doing for User Space What We Did for Kernel Space
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide