Quantcast
Username/Email:  Password: 

Building an Inexpensive, Powerful Parallel Machine and Using It for Numerical Simulations

Faced with a limited budget, Roman researched and built his own parallel system for multiprocessor simulations.

During my postdoctoral training at SUNY
Health Science Center at Syracuse two years ago, I had some
exposure to a commercial 30-processor SUN Enterprise E6500 machine.
The price of that machine, together with its support package and
software licenses, ran up around half a million dollars. This cost
is why during the negotiation of my new faculty startup funds at
Montclair State University, when I was offered a maximum of $25,000
for "something parallel", I was skeptical at first. However, in the
spring of 2001, after doing my homework, I embarked on a project of
building my own Beowulf-type cluster of Pentium-based
computers.Choosing Linux as the operating system for the cluster was
obvious. Even though I had only about one year of experience with
Linux, I already considered myself a Linux fan and a big supporter
of open-source software. I spent much more time, therefore,
choosing the right hardware that would optimize the projected
performance under the constrained cost. After the choice was made,
the boxes began to arrive.In July 2001, about three months after the project began, I
had a cluster, which I named GALAXY, up and running. It has since
exceeded my best amateur expectations of its performance and
reliability, and it has become an indispensable tool in my research
of spiral waves in excitable media.Hardware and AssemblyClassic Beowulf-type clusters used to be built with PCs.
However, because of falling prices of rackmountable, Intel-based
servers, I realized that with a small sacrifice in the
price-performance ratio I could assemble my cluster in a standard
cabinet enclosure (on wheels). This would make it much more
manageable, compact and portable, compared to the LOBOS model (lots
of boxes on shelves). I purchased ten dual 1GHz Pentium III
rackmountable servers at about $2,000 per server, for a total of
about $20,000, from Microway, Inc.
(www.microway.com).Gigabit Ethernet over copper (1000BaseT), which uses category
5e conventional Ethernet cables, was chosen as the internode
communication standard. Thus, each node included a 3COM Gigabit PCI
network card, or one card per two CPUs. Each of the ten cards was
connected to a 12-port 3COM Gigabit switch (SuperStack-3 4900),
which cost about $3,000.The rest of the money (about $2,000) was spent on a standard
19" 40U rackmount cabinet with a shelf (APC Netshelter), a UPS for
power backup of the master node, power outlets, Ethernet kit and
cable and KVM extension cables. All the cluster components, except
for the servers, were ordered through MicroWarehouse
(www.microwarehouse.com).One special node out of ten, called the master node, has two
20GB hard drives, a CD-RW drive and 512MB of ECC SDRAM. This is the
only node that has a keyboard, video and mouse connected. Its
rackmount size is 2U. In addition to the Gigabit connection with
the switch, this node is also connected by Fast Ethernet to the
outside network. Each of the nine slave nodes has a single 10GB
hard drive, 256MB of memory and is 1U in size. Every node has a
floppy drive. The cabinet's front and back doors are perforated for
air circulation. The only fans are the ones inside the server boxes
and inside the switch.

Figures 1 and 2 show the front and rear views of GALAXY,
whose assembly was carried out by a single person. Its total peak
performance is represented by the formula 1GHz/CPU x 20 CPUs,
which, assuming one floating-point operation per one-clock tick, is
equal to 20 Gigaflops, or 20 billion floating-point operations per
second.With an optional 4-port module (about $1,000), the total
number of Gigabit ports in the switch, and thus the maximum number
of nodes, can easily be increased to 16, all of which corresponds
to 32 processors and 32 Gigaflops peak performance. The cabinet
space can accommodate twice that number of nodes, which can be
hooked up using multiple cascading switches, but the room's
electrical capabilities and heat dissipation will become an issue
at that point.Although the term "cluster" is traditionally appropriate for
a system like GALAXY, it is not a cluster of PCs because each
server is not exactly a PC. In fact, this cluster is structurally
not very different from the 30-CPU SUN Enterprise server mentioned
above, which nobody calls a cluster. This is why I used the term
"parallel machine" instead of cluster in the title of this
article.SoftwareThere is a choice of parallel software for clusters, some
installed on top of Linux, others integrated into Linux and running
out-of-the-box. An example of the latter is the Scyld Beowulf
Cluster Operating System
(www.scyld.com), which
can be viewed as a numerous set of patches to Red Hat Linux 6.2,
and which I happened to adopt for GALAXY. I purchased a Scyld
Beowulf CD (with version 27bz-7) from Linux Central
(www.linuxcentral.com)
for about $5. This software is open-source, and the source code is
available at
ftp.scyld.com.The biggest advantage of the Scyld Beowulf Cluster OS for me
is its beoboot utility. The Scyld Beowulf OS is installed on the
master node pretty much the same way Red Hat Linux is installed.
After the installation, a beoboot floppy is created. Adding the
first slave node and any subsequent nodes to the cluster is as easy
as booting that floppy and waiting. A node, which might have no OS
on it at all, becomes a part of the cluster in about one minute. It
first boots an initial kernel from the beoboot floppy, which tells
it to get in touch with the master node and download and execute
the full-size kernel from it. Thus, system administration only
needs to be done on the master node. The slave nodes get the
updated OS from the master node every time they boot.Jobs on any node are started from the master node, which is
the only login place in the entire cluster. Scyld Beowulf software
creates an illusion of a single computer (master node) with many
CPUs (those of slave nodes). Monitoring slave nodes from the master
node is easy with a graphical beostatus utility (see Figure 3) or
simply with top.
Figure 3. Node Monitoring with the beostatus
Utility
Newer incompatible network card drivers, the ATAPI CD-RW
drive and the fact that all my nodes are actually SMPs (symmetric
multiprocessors) did cause some problems with Scyld Beowulf OS,
which didn't work smoothly out-of-the-box the way it should. To fix
all those problems I had to recompile both the main and the beoboot
kernels. The Linux kernel recompilation, which scared me at the
beginning, is actually a streamlined process, well-documented in
the Linux HOWTOs. After a while, I had a custom kernel that
included only the features I wanted (SMP support, i686
optimizations, SCSI emulation mode for the ATAPI CD-RW drive,
etc.). I soon realized that the Linux kernel I obtained, as well as
the entire system, was rock-solid and the cluster was functioning
exactly as it should.Message passing is the established programming paradigm for
distributed memory parallel machines, of which GALAXY is an
example. MPI (message passing interface) is an accepted open
standard for message passing between processors, and there are many
concrete implementations of the MPI standard for different
platforms. One of them is MPICH, a customized version of which is
integrated into Scyld Beowulf software. From the programmer's point
of view, an MPI implementation is a set of Fortran, C and C++
libraries. These libraries provide routines, of which the four most
important being: Get_size(), Get_rank(), MPI_Send(...) and
MPI_Recv(...).An MPI program is usually started on all
P available processors. Each concurrently
running instance of that program can query the parallel machine for
the total number of CPUs using Get_size() and for the local CPU
number using Get_rank(). With MPI_Send() and MPI_Recv(), a pair of
CPUs can send data between each other.Having properly installed the Scyld Beowulf Cluster OS
software, one can start writing parallel programs. I have been
writing my code in C++, compile it with g++ and link it with
libmpi++. The parallel Linux-based operating system I built, which
is 100% open-source and has a total cost of under $10, exhibits an
amazing stability. My parallel programs run on GALAXY's 20
processors for hours and days without a glitch. On the standard
Linpack benchmark, which is also included with Scyld Beowulf OS,
GALAXY demonstrated a sustained performance of 7 Gigaflops.Numerical SimulationsI have used GALAXY extensively for numerical modeling of
excitable media. Under certain conditions, excitation waves
propagating in such media break into persistent spiral patterns.
One major example of an excitable medium is cardiac muscular
tissue. Cardiac arrhythmias, the leading cause of deaths in
industrialized countries, are believed to be triggered by the
appearance of spiral waves in cardiac electrical activity. The area
of my current investigation is the asymptotic behavior of a large
number of interacting spirals; that is, the long-term development
in a 2-D excitable system with a large number of initially
generated random spirals.To answer this question I had to solve numerically a system
of the underlying nonlinear PDEs on a very large domain (to
accommodate a large number of spirals) and for a very large number
of time steps. GALAXY helped me reduce the simulation time from
days to hours. Very often, in order to achieve such results,
researchers would reduce the numerical solution of a system to
performing linear algebra operations on large matrices, which can
be parallelized using the existing libraries (such as LAPACK,
www.netlib.org/lapack).
However, this approach is very specific for each individual type of
problem and usually involves rather complicated math. The approach
I used is by far more straightforward and uniform, and it can be
used by anyone who deals with any PDE system on a rectangle,
providing it has already been solved numerically (using some
difference scheme) for a single CPU.The method consists of splitting the rectangular domain into
P equally-sized stripes (see Figure 4) and
assigning each stripe to one of the P
available processors. Each iteration of the main loop consists of
two major steps. First, each processor updates the data on the
grid-points of its corresponding stripe. Then, using simple
MPI_Send() and MPI_Recv() calls, all pairwise neighboring
processors exchange the data along the shared boundary of the
corresponding stripes.
Figure 4. Dividing the Domain among Multiple CPUs
This simple parallelization technique, applied to my
sequential spiral interaction code, led to a 12-fold speed-up on
the 18 slave CPUs (with 18 stripes). The two CPUs of the master
node were used for a periodic data collection and file output (one
pthread) and for displaying results using Mesa 3-D graphics (second
pthread). The numerical simulations revealed a new and unexpected
effect: the interacting spirals tend to form stable triplets (see
Figure 5). Clearly, the above algorithm can be extended to a 3-D
case, which is one of my future directions.
Figure 5. One Simulation Result for Spiral Wave Interaction
It is worth noting one more naturally parallelizable
situation that does not require processor communication (message
passing) at all. Researchers often run one program many times with
different input, for example, to find out which parameters lead a
simulated system to some special behavior. One can start such a
program on P available processors with
P different parameter values, and the running
time will be P times less than with
P consecutive runs on a single CPU.ConclusionGALAXY is a performant message-passing, multiprocessor
system, which can be compared to some commercial parallel machines,
although it carries only a small fraction of their cost. My example
has demonstrated that such a machine can be built, maintained and
successfully used by a single person. With a simple message-passing
algorithm, like the one described above, GALAXY can speed up a
generic PDE solver on a rectangular domain by at least ten times
compared to a single workstation. With such a speed up, some
programs, previously run as batch jobs, can be converted into
real-time, interactive, graphical simulations.Writing an MPI program can be a tedious task. Is the time
spent really justified? Because of physical limitations, the clock
rate of a single CPU can't grow indefinitely. Most researchers
agree that Moore's law, which states that the CPU speed doubles
every 18 months, will probably not hold beyond 10-15 years from
now. Utilization of parallelism will become the primary way to
increase performance. A well-written MPI program receives the total
processor number dynamically at runtime and, therefore, such a
program should seamlessly scale to a growing number of CPUs,
whenever such systems will become available.A combination of two factors makes systems like GALAXY
possible: inexpensive commodity hardware components of tremendous
performance (processors, memory, switches, etc.) and the existence
of Linux and numerous other open-source software products such as
GNU compilers, GNOME desktop, MPICH MPI implementation and Scyld
cluster software. There is an emerging middle ground between the
industrial multiprocessor computers running proprietary software
and self-made systems like GALAXY.ResourcesBeowulf
Project
GALAXYMPIScyld Beowulf Cluster
OS
A
Spiral Wave Java Applet
Spiral Waves in Excitable Media: "On the Interaction of
Vortices in Two-Dimensional Active Media", E.A. Ermakova, A.M.
Pertsov and E.E. Shnol. Physica D 40 (1989)
185-195. North-Holland, Amsterdam.Top 100
Clusters
AcknowledgementsI'd like to thank my Department Chair, Dr. Dorothy Deremer,
and the former College Dean, Dr. Kenneth C. Wolff, for the project
funding and support.Dr. Roman Zaritski is a
professor at the Department of Computer Science of the Montclair
State University in New Jersey. His primary interest is numerical
modeling of biological systems.

email: zaritski@roman.montclair.edu

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Building an Inexpensive, Powerful Parallel Machine and Using

Anonymous's picture

Congratulatios for your job!

Luca

Re: Building an Inexpensive, Powerful Parallel Machine and Using

Anonymous's picture

Your math seems a bit fuzzy when talking about 20 Gigaflop peak performance. First of all there's no way to get that from a Beowulf with 20 ! Ghz processors, you're assuming no overhead, no communication, etc. Plus, I Ghz does not necessarily mean floating point.

~The Boys from Testing

Re: Building an Inexpensive, Powerful Parallel Machine and Using

Anonymous's picture

Yes, "Peak Performance" of a cluster is a purely hardware term that is computed as the sum of frequencies of all CPUs. It is some rough measure of how fast/strong a cluster is, which is not very precise (e.g., it does not count how fast the network or how efficient the software is). Nevertheless, many people (e.g., at http://clusters.top500.org/db/) use it to describe a cluster.

In the article I also give a Linpack benchmark result for my cluster: 7 Gigaflops sustained performance. This is a more thorough and realistic measurement.

Roman Zaritski

Post new comment

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.
  • Use to create page breaks.

More information about formatting options