Clusters for Nothing and Nodes for Free
The most valuable part of the work is the data and all the intermediate states of the work in progress, because any damage here sets you back days even if you have backups and external version control checkpoints. A RAID array of 1 or 5 is the usual protection. One computer, not one of the fastest ones, should have at least two hard drives on distinct controllers. It is worth making sure that each drive has a small swap partition so the kernel can use all the swaps and do some load balancing.
Turn on the kernel-space NFS server and configure /etc/exports from the point of view of securing the data storage from damage. When the NFS is under heavy load, user-space programs have to be swapped to make space for additional disk cache. Consider having a runlevel that could be deferred to disable all the services that wake up periodically for minor purposes.
We're using an existing dual-Athlon MP machine with over a terabyte of storage and running Debian stable as our NFS server. The system is overkill for the cluster; we originally built it to archive field test data and then stream the data to multiple clients for analysis. No X server is used, because the cooling fans make so much noise that nobody wants the machine sitting next to his or her desk.
Using make batch2 on a dual-processor machine reduced our runtime by about 40%, with one of the processors being idle near the end of the run. The total runtime was between four and six hours of clock time. This can be improved, even without a cluster, by distributing the work across many machines using OpenSSH with public key authentication. The Linux Journal article (“Eleven SSH Tricks” by Daniel R. Allen, August 2003) explained how to configure this powerful package to avoid endless streams of password prompts while simultaneously enhancing network security.
Listing 2. This runs simulations in parallel on many computers. The runtime is consistent but can be inefficient.
#! /bin/bash for pair in host1/test1 host2/test2 \ host2/test3 host5/test4 do test=`basename $pair` make $test ssh `dirname $pair` vvpstdin \ < $test > $test.out & done wait; make
The Icarus simulation engine vvp cannot load from standard input, so we use this vvpstdin script:
#! /bin/bash F=/tmp/`basename $0`.tmp.$$ cat > $F /usr/local/bin/vvp $F exec rm $F
The machines sharing the work usually come to have different performance capabilities. It is important to match the relative runtimes of the various tests against individual processor speeds, remembering SMP, so all of the tests finish at about the same time. We found it best to optimize the mapping manually in a script like the one shown in Listing 2.
By using SSH to two dual-Athlon MP machines, one Pentium III laptop and five Pentium II desktops, we reduced runtime to a fairly consistent two hours.
If everyone is running the same version of the same distribution, it probably is sufficient to install the prepackaged binaries of OpenMosix. Thereby, you have the workload migration available without any effort. Always use the autoconfiguration option instead of specifying the list of nodes manually, because the cluster grows in later stages.
We use several different distributions in the office, so we downloaded a pristine 2.4.20 kernel tarball, the matching OpenMosix patch and the source of user-space tools to the NFS fileserver. After making careful notes of the configuration settings to keep all the machines in step, we followed the instructions on the OpenMosix Web site. Because it takes our time and effort to recompile and reinstall kernels, we modified only four computers needed to cluster seven processors. This is slightly less capable than the ten processors achieved through SSH. Even so, the worst-case runtime stayed almost identical, because the migration did the load balancing slightly better than our hand-optimized script could achieve. Because Alex could use make -j and let OpenMosix assign the work, all incremental workloads completed faster and did not need the full two hours.
OpenMosix tries to be fair and have all programs run at the same speed by putting more work on the faster computers. This is not optimal for the logic simulation workload, however, as we usually know the relative runtimes. In this case, a short script (not included here) helpfully monitors the contents of /proc. The script periodically looks for process pairs with a big ratio in their expected runtimes but whose node assignments are not providing a corresponding execution speed ratio. The script uses its knowledge of prior runs to request a migration to gain a long-term benefit hidden from OpenMosix. Such a script is not needed if, for your application, the runtimes of all processes are similar.
- Handling the workloads of the Future
- Readers' Choice Awards 2014
- diff -u: What's New in Kernel Development
- How Can We Get Business to Care about Freedom, Openness and Interoperability?
- Synchronize Your Life with ownCloud
- December 2014 Issue of Linux Journal: Readers' Choice
- Non-Linux FOSS: Don't Type All Those Words!
- Days Between Dates?
- Autokey: Shorthand for Typists