Integrating a Linux Cluster into a Production High-Performance Computing Environment

Troy discusses the performance and usage of the Brain at the Ohio Supercomputer Center.
User Environment

Great care was taken to make sure that the interactive environment on the cluster was as friendly and as similar to the other OSC systems as possible. The nodes of the cluster mount their home directories from mss, just as the other systems do. OSC's Cray systems use a modules facility for dynamically modifying environment variables to point to different versions of compilers, libraries and other software; the cluster nodes were given a workalike facility, originally developed at Los Alamos National Laboratory. Also, a complete suite of compilers for C, C++, Fortran 77 and Fortran 90, as well as a debugger and profiling tool, were purchased from the Portland Group. These compilers were selected based on their excellent optimizer, which was originally developed for the Pentium Pro-based ASCI Red TFLOPS system at Sandia National Laboratory. A variety of numerical libraries were made available on the cluster, including both open source (FFTw, PETSc and ScaLAPACK) and closed source (NAG Fortran and C).

One area in which the Brain cluster is rather unique is parallel performance analysis. Performance analysis tools under Linux have historically been rather primitive compared to those available on real supercomputers such as the Cray T3E. However, OSC staff were able either to acquire or develop a respectable collection of performance analysis tools for the Brain cluster. For profiling of serial (i.e., nonparallel) code, both the GNU gprof command-line profiler and the Portland Group's pgprof graphical profiler were installed. For profiling of MPI-based parallel code, the MPICH distribution supplied a profile logging facility and a Java-based graphical analysis tool called jumpshot. Finally, for truly in-depth performance analysis, the author developed an analysis program for hardware performance counter data called lperfex.

Hardware performance counters are a feature built into most modern microprocessors, and Intel processors based on the P6 core (i.e., Pentium Pro, Pentium II, Pentium III, Celeron and Xeon processors) have them as part of the model-specific registers. Erik Hendriks, one of the original Beowulf programmers and now at Scyld Computing, developed a kernel patch and user-space library for accessing these counters. The lperfex used this library to make a command-line performance counter interface, based on the example of the perfex utility found on SGI Origin 2000 systems. The beauty of this tool is that it requires no special compilation; it simply runs another program and records performance counter data (see Listing 9 for an example). It can also be used with MPI parallel applications (see Listing 10 at As with the OSC mpiexec program, lperfex is available under the GNU GPL. Recent versions of the PAPI instrumentation library from the Parallel Tools Consortium have also been shipped with lperfex as part of the distribution.

Listing 9. An Example of the lperfex Counter Tool at Work

User Experiences

The Brain cluster was opened to friendly users in February 2000 and quickly gained a small but loyal following in the OSC user community. Somewhat to the chagrin of the OSC staff, not all of these users were interested in running parallel MPI applications. Many were interested in running older computational chemistry codes such as Amber and Gaussian 98, neither of which support MPI over Myrinet. Another rather novel application run on the cluster was a gene sequence analysis tool called NCBI BLAST, which Dr. Bo Yuan (a researcher in the Human Cancer Genomics Program at Ohio State University's college of medicine) used to annotate about sixty thousand genes from the draft version of the human genome data set in about one week's time. While not written with MPI, BLAST did run in parallel with four processors on a node by using pthreads and shared memory, and further concurrency was achieved by running multiple simultaneous jobs with each analyzing a different sequence. The pattern-matching algorithm used by BLAST is primarily integer arithmetic, and the Intel processors in the Brain cluster's nodes were found to outperform the MIPS processors in OSC's Origin 2000 systems significantly (see Figure 4).

Figure 4. BLAST Performance

One user application that did use MPI over the Brain cluster's Myrinet network was a quantum chromodynamics (QCD) code written by Dr. Greg Kilcup from Ohio State's physics department. This code simulates the interaction of quarks in subatomic particles and is very communication-intensive, with each process sending a small message approximately every 200 floating point operations. This application is very sensitive to MPI latency and available memory bandwidth. On the Brain cluster, MPI latency was quite acceptable (on the order of 13 microseconds), and memory bandwidth became the main performance bottleneck. With four processors sharing an 800MB/s peak pipe to memory, each processor was limited to about 150MB/s sustained memory bandwidth. This limited each processor's floating point performance to about 60 MFLOPS. Using two processors per node improved both the sustained memory bandwidth and the floating point performance per processor (see Figure 5), while allowing higher processor-count runs than using only one processor per node.

Figure 5. Performance Using Two Processors per Node

Another user code that used MPI over Myrinet on the Brain cluster was a Monte Carlo simulation of condensed matter physics, written by Dr. Mark Jarrell from the University of Cincinnati's physics department. This application is “pleasantly” (also known as “embarrassingly”) parallel, meaning that it performs very little communication. However, like the QCD code described above, this code was very sensitive to memory bandwidth. The innermost loop of this application performed an outer product of two large (1,000+ element) arrays. This tended to cause low L2 cache reuse, which increased pressure on the already limited saturated memory bus (see Figure 6). As with the QCD code, this application has a sweet spot of two processors per node. Work is ongoing at OSC to try to improve the performance of this application.

Figure 6. Jarrell Code Performance