Natural Selection in a Linux Universe
With the master computer up and running, we turned on each node one at a time. By default, the BIOS in each node tries to boot from the network first. It finds the boot ROM on the Ethernet card, and the ROM image broadcasts a BOOTP request over the network. When the server receives the request, it identifies the associated hardware address, assigns a corresponding IP address, and allows the requesting node to download the boot image. The node loads the kernel image into memory, creates an 8MB initial RAM disk, mounts the root file system, and executes an rc script which starts essential services and daemons.
Once all nodes are up, we log in to the server and start the PVM daemon. An rhosts file in the home directory on each of the nodes allows the server to start up the daemons. We can then run in parallel any executable file that uses the PVM library routines and is included in the root file system.
For our problem, the executable residing on the nodes involves building and vibrating a white dwarf model and comparing the resulting theoretical frequencies to those observed in a real white dwarf. A genetic algorithm running on the master computer is concerned with sending sets of model parameters to each node and modifying the parameter sets based on the results. We tested the performance of the finished metacomputer with the same genetic algorithm master program as our white dwarf project, but with a less computationally intensive node program. The code ran 29.5 times faster using all 32 nodes than it did using a single node. Our tests also indicate that node programs with a higher computation to communication ratio yield an even better efficiency. We expect the white dwarf code to be approximately ten times more computationally intensive than our test problem.
After more than three months without incident, one of the nodes abruptly died. As it turned out, the power supply had gone bad, frying the motherboard and the CPU fan in the process. The processor overheated, shut itself off, and triggered an alarm. We now keep a few spare CPU fans and power supplies on hand. This is the only real problem we have had with the system, and it was easily diagnosed and fixed.
The availability of open-source software like Linux, PVM, Netboot and YARD made this project possible. We would never have considered doing it this way if we'd had to use a substantial fraction of our limited budget to buy software as well as hardware and if we'd been unable to modify it to suit our needs once we had it. This is an aspect of the Open Source movement we have not seen discussed before—the ability to try something new and show it can work, before investing a lot of money in the fond hope that everything will turn out fine.