Clusters for Nothing and Nodes for Free

When the users are away, your company's legacy desktop systems can become a powerful temporary Linux cluster.

At Quantum Magnetics we do contract R&D. We often need to design silicon chips, simulate electromagnetic systems and analyze masses of data from field tests. When a single set of regression tests started taking longer than a working day to perform, coauthor Alex Perry found himself wondering how to get short-term access to a cluster. We describe here the sequence of steps that enabled us to set up an OpenMosix cluster with little effort and without having to purchase anything.

Each productivity increase justified putting time into the next step of bringing up the company-wide cluster. We omit details here that are provided in the instructions and FAQs for each project (see the on-line Resources section), partly because things will have changed by the time the article goes to print and partly for brevity.

Choose an Application

The simplest applications to run on a cluster are command-line based and run as multiple instances on one computer. Applications don't have to be written specifically for Linux, because they could use WINE or another portability layer. If multiple instances are not possible, much more time has to be put into providing a virtual machine abstraction layer. It is worth checking your specific application before putting any effort into building a cluster to see whether it is capable of benefitting from an OpenMosix-based cluster.

Most of our logic code is written in Verilog partly because, as the joke goes, we can't type fast enough to use VHDL. Mainly, though, our reason is that a broader range of tools is available in Verilog. We use several closed-source place-and-route tools under Microsoft Windows, the runtime of which is tiny, so putting these on the cluster is not worth the effort. For simulation, we have both open- and closed-source options. It is convenient to use the graphical tools (all closed-source, unfortunately) that have IDE source-level debuggers when trying to track down a bug, but these either don't like clusters or have a hefty licensing price tag when running on a cluster. We use Icarus Verilog for non-interactive simulations, as regression testing is more than 99% of the total simulation workload. We like it because multiple simulators can run in parallel; each simulator is a single Linux process; the tool has its own public regression suite; the developers are helpful and responsive; and the syntax parser is paranoid and accurate.

The paranoia of the syntax parser flags a lot of problems for us. Many parsers simply select one interpretation of ambiguously written source, leading to incorrect behaviour that is effectively a bug. In contrast, Icarus immediately complains about ambiguities, and after we've made the tiny rewrite, the synthesized chip suddenly starts working the way that it was intended.

The developers for Icarus, by responding rapidly to bug reports and patches, enhance the value of the simulator in our work. We update from CVS to benefit from those almost-immediate source changes. In addition, it is much easier to standardize one virtual machine (the cluster) than to manage the versions on the individual workstations.

We run all our proprietary simulation tests immediately before and after a new version of Icarus is retrieved from CVS. About once a year, the simulation results are different, so we submit a bug report that localizes the problem to a test case outside our proprietary work. In this way, all our proprietary work acts as an additional regression suite for the Icarus Project without us having to make it available to our competitors. It also ensures that any official release of Icarus is useful to us.

In our engineering design work, we use make, as shown in Listing 1, to automate test execution and to manage all the Verilog source files, the reference implementation in C, validated test data, the pool of regression tests and all the simulation results.

Without the cluster, between six and ten hours were needed to complete all the dependencies that resulted from a minor change to a source file. Logic simulation usually is about a factor of a million slower than real life, so the regression simulates only about 20 milliseconds of time. The tests have to be selected carefully, because the board can run for as long as 30 seconds per use (about a year of simulation).