I'm Not Going to Pay a Lot for This Supercomputer!

Los Alamos National Laboratory and Caltech obtain gigaflops performance on parallel Linux machines running free software and built from commodity parts costing less than $55,000 each (in September 1996). Now, you can probably build a similar machine for about $25,000.

Because of its use in astrophysics, Loki has been used to compute results for an N-body, gravitational-interaction problem using a parallelized hashed oct-tree library. (oct-tree is a three-dimensional tree data structure, where each cubical cell is recursively divided into eight daughter cells. treecode is a numerical algorithm which uses tree data structures to increase the efficiency of N-body simulations. For details on the treecode, see the URL listed in Resources.) The code is not machine-specific, so comparing the performance of the commodity machines to traditional supercomputers is free of porting issues (with the exception that the Intel i860 and the Thinking Machines CM-5 have an inner loop coded in assembly).

At Supercomputing '96, Loki and Hyglac were connected via $3,000 of additional Ethernet cards and cables) to perform as a single 32-node machine with a purchase cost of just over $100,000. Running the N-body benchmark calculation with 10 million particles, Loki+Hyglac achieved 2.19 billion floating-point operations per second (GFLOPS), more than doubling the per-processor performance of a Cray T3D and almost matching that of an IBM SP-2 (see Table 1).

Figure 4. Intermediate stage of a gravitation N-body simulation of galaxy formation using 9.75 million particles. It took about three days to compute on Loki at a sustained rate of about 1GFLOP.

As a stand-alone machine at LANL, Loki has performed an N-body calculation with just over 9.75 million particles. This calculation was “real work” and not “proof-of-principle”, so it was tuned to optimize scientific results rather than machine performance. Even with that condition, the performance and results are striking. The total simulation required 10 days (less a few hours) to step through 750 time steps, performed 6.6x1014 floating-point operations to compute 1.97x1013 particle interactions and produced just over 10GB of output data.

For the entire simulation, Loki achieved an average of 879MFLOPS, yielding a price/performance figure of $58/MFLOP. Contemporary machines such as SGI's Origin are capable of price/performance in this range, but scaling an Origin to the memory and disk necessary to perform a calculation of this magnitude quickly becomes prohibitive; at list price, 2GB of Origin's memory alone costs more than the entire Loki assembly.

The nature of the treecode is such that later time steps have greater overhead in spanning the tree than in performing floating-point arithmetic, so the average flop rate steadily decreases the longer the code is run. When the first 30 time steps of the simulation are taken into consideration, 1.15x1012 particle interactions in 10.25 hours provide a throughput of 1.19 GFLOPS. This figure actually is a better estimate of the amount of useful work than that given for the total simulation, since the treecode's purpose is to avoid floating-point calculations whenever possible.


Loki has also been used to simulate the fusion of two vortex rings. The simulation began with 57,000 vortex particles in two discrete smoke rings, though re-meshing caused the simulation to be tracking 360,000 particles by the final time step. Each processor sustained just over 65 MFLOPS during the simulation for a total system performance of 950 MFLOPS.

Photo-realistic Rendering

Hyglac has been used to perform photo-realistic rendering using a Monte Carlo implementation of the rendering equation. Images of some of the rendered images are available at http://www.cacr.caltech.edu/research/beowulf/rendering.html. In a direct comparison with an IBM SP-2, Hyglac completed the renderings anywhere from 12% to 20% faster than an IBM SP-2, a machine with a price tag twenty times that of Hyglac.

System Reliability

Even the most blazingly fast system is useless if it can't perform without crashing. System reliability is therefore crucial, especially in the case of a machine like Loki which may need several days without interruption to complete a large-scale calculation. During the burn-in period, a bad SIMM and a handful of bad hard drives were replaced under their warranty terms. The warranties on commodity parts make these commodity supercomputers particularly appealing. Warranties on specialty machines like the Origin tend to be 90 days or less, whereas readily available parts such as Loki's innards generally have warranties ranging from a year to life. In September 1997, most of the Loki nodes had uptimes of over 4 months without a reboot. The only hardware problems encountered have been three ATX power supply fans which failed, resulting in node shutdowns due to overheating. Those nodes were easily swapped with a spare, and the fans replaced in a few minutes.