Modeling Seismic Wave Propagation on a 156GB PC Cluster
All the nodes in the cluster run Linux Red Hat 6.2. Linux corresponds perfectly to the demands of our application; we require very high reliability because the machine is being used by researchers who need their jobs to run without having to worry about nodes crashing. We have not had a single system crash since the machine was built nine months ago. The operating system needs to be tuned to the hardware in order to reach maximum performance; with the open-source philosophy, we have been able to recompile the kernel with a minimal set of options corresponding to our hardware configuration. We recently installed the 2.4.1 kernel, which has much better support for dual-node SMP machines than the 2.2 kernel. The performance is excellent; by switching from 2.2 to 2.4, the CPU time of our application has decreased by 25%. In terms of network configuration, the 156 nodes are on a private network of 192.168.1.X addresses. For security reasons, the cluster is not connected to the outside world, and all the post-processing and analysis of the results is done locally on the front end. We use rdate once a day in the cron table of each node to synchronize the time with the front end.
The biggest price we had to pay for the use of a PC cluster was the conversion of an existing serial code to a parallel code based on the message-passing philosophy. In our case the price was substantial because our group is composed of researchers who are not professional programmers. This situation meant we had to dedicate a few months to modifying several tens of thousands of lines of serial code. The main difficulty with the message-passing philosophy is that one needs to ensure that a control node (or master node) is distributing the workload evenly between all the other nodes (the compute nodes). Because all the nodes have to synchronize at each time step, each PC should finish its calculations in about the same amount of time. If the load is uneven (or if the load balancing is poor), the PCs are going to synchronize on the slowest node, leading to a worst-case scenario. Another obstacle is the possibility of communication patterns that can deadlock. A typical example is if PC A is waiting to receive information from PC B, while B is also waiting to receive information from A. To avoid deadlocking, one needs to use a master/slave programming methodology.
We use the MPI (message-passing interface) library to implement the message passing. Specifically, we installed the open-source MPICH implementation developed at Argonne National Laboratory (see the 1996 article by W. Gropp and collaborators, available at www-unix.mcs.anl.gov/mpi/mpich). This package has proven to be extremely reliable with Linux. MPI is becoming a standard in the parallel-computing community. Many features of MPI are similar to the older PVM (parallel virtual machine) library described in the article of R. A. Sevenich in Linux Journal, January 1998.
An additional difficulty with our project was the amount of legacy code we had to deal with. A lot of modern codes are based on libraries that contain legacy code developed in the 1970s and 1980s. Almost all scientific libraries were written in Fortran77, the language of choice at that time; use of C was not yet widespread. We decided not to convert the 40,000+ lines of code to C, rather we upgraded from Fortran77 to the modern Fortran90. The new version has dynamic-memory allocation, pointers, etc., and is back-compatible with Fortran77. We wrote a Perl script to perform most of the conversion automatically, fixing a few details by hand and changing memory allocations from static to dynamic. Unfortunately, to our knowledge no free Fortran90 compiler is currently available under Linux. The GNU g77 and f2c packages only support Fortran77. So, we had to buy a commercial package, pgf90 from The Portland Group, pgroup.com. This is the only non-open-source component in our cluster.
A limitation of PC clusters is the problem of system administration and maintenance. Using hundreds of PCs, one increases the probability of hardware or software failure of nodes. In the case of a hardware problem, the nice thing about PCs is that parts are standard and can be bought and replaced in a matter of hours. Therefore, the cost associated with maintenance is low compared to expensive maintenance contracts researchers used to need for classic supercomputers.
Software maintenance is more of an issue—with 156 nodes, how do you make sure they are all working properly? How do you install new software? How do you monitor the performance of a job that is running? When we installed the cluster, we wrote scripts that collected information from the nodes by sending rsh commands. Since then, universities like Berkeley and companies like VA Linux have developed efficient software packages for cluster monitoring and have made them open source. We use a node-cloning package called SystemImager from VA Linux (valinux.com) to do software upgrades. With this package we only need to upgrade the first node manually. Then the package uses rsync and tftp commands to copy (or clone) the changes to the other 155 nodes in a matter of minutes. To monitor the cluster and the jobs that are running, we use the Ganglia package from Matt Massie at Berkeley (millennium.berkeley.edu/ganglia), which is a fast and convenient package that uses a system of dæmons to send information about the state of each node to the front end, where it is gathered and displayed.
In Figure 6, we show a Tcl/Tk interface to the Ganglia package. The GUI we use is based on another open-source package, bWatch by Jacek Radajewski (sci.usq.edu.au/staff/jacek/bWatch). We modified it for our needs and to use Ganglia instead of standard rsh commands for much faster access to the nodes. Also, VA Linux has recently released the VACM package (VA Cluster Management), which we have not yet installed.
|Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform||Jan 23, 2015|
|Designing with Linux||Jan 22, 2015|
|Wondershaper—QOS in a Pinch||Jan 21, 2015|
|Ideal Backups with zbackup||Jan 19, 2015|
|Non-Linux FOSS: Animation Made Easy||Jan 14, 2015|
|Internet of Things Blows Away CES, and it May Be Hunting for YOU Next||Jan 12, 2015|
- Designing with Linux
- Wondershaper—QOS in a Pinch
- Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform
- Ideal Backups with zbackup
- Internet of Things Blows Away CES, and it May Be Hunting for YOU Next
- Slow System? iotop Is Your Friend
- New Products
- Non-Linux FOSS: Animation Made Easy
- Hats Off to Mozilla
- diff -u: What's New in Kernel Development
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane