Clusters for Nothing and Nodes for Free
Usually, plenty of spare older computers can be found hiding in corners. Put an X server on one of them that is configured to be a terminal into the xdm service on the fast computers. With this machine, you can shut down the X servers on the fast computers and release their processor and memory resources back into the important workload. Alex's desktop computer, a 400MHz Pentium II, already had its X server indirecting over xdm's chooser. David's work keeps him roaming the building and relying on VNC, so he already was using Xvnc. Only Hoke needed to make minor changes to configuration files.
Next, install LTSP on one computer and set up all the other old computers to use diskless boots to become terminals too. Doing so eliminates the administration of all those operating systems. You now should have enough terminal stations that all your team members are using terminals, and all the fast compute nodes can stay in the stripped runlevel and be as efficient as possible. It doesn't take long to get those two features working, and an excellent time to work on this is whenever you're waiting on the running jobs.
There is no need to get the DHCP and TFTP components of LTSP working. Put the kernel on a floppy, together with SysLinux configured to trigger the non-boot DHCP, and mount the NFS root filesystem. Then, use that one floppy to do the one-time boot of the terminals. Reboots are needed infrequently, so the slowness of the floppy is fine.
Once the cluster and LTSP are both functional, we simply combine them. The short script shown in Listing 3 uses the NBI tools to put the patched kernel into /ltsp/i386/boot. Our DHCP server's filename parameter is a soft link, so we can change the LTSP kernel rapidly while testing upgrades. After copying the user-space tools into the client filesystem and renaming the init script as rc.openmosix, we add the few lines in Listing 4 to the LTSP startup script. Slower computers have MOSIX=N in the LTSP configuration file because they would not contribute much performance to the cluster.
One line in /ltsp/i386/etc/inittab:
calls a copy of Debian's shutdown binary using the script shown in Listing 5. This ensures that Ctrl-Alt-Del forces a clean disconnect from the cluster before rebooting.
Listing 3. This /ltsp/i386/usr/src/netkernels copies kernels from the build tree to the TFTP directory.
#! /bin/bash for vsn in 2.4.20 2.4.21 do pushd linux-$vsn; make bzImage; popd s=linux-$vsn/arch/i386/boot/bzImage d=../../boot/vmlinuznbi-$vsn mknbi-linux --ip=dhcp \ --append "root=/dev/nfs" $s >$d done
Listing 4. These few lines are appended to the LTSP startup script /ltsp/i386/etc/rc.local.
MOSIX=`get_cfg MOSIX Y` if [ "$MOSIX" = "Y" ]; then echo 1 > /proc/hpc/admin/lstay AUTODISC=1 /etc/rc.openmosix start fi
Listing 5. New Shutdown Script
#! /bin/bash prefix="Control Alt Del detected: " echo "$prefix OpenMosix" /etc/rc.openmosix stop echo "$prefix ShutDown" /sbin/shutdown -r -n now echo "$prefix failed (give up)"
Once you are confident that the LTSP-OpenMosix kernel is stable and not going to be changed, you can hand out floppies with the new kernel. The LTSP users won't see a difference, but your compute workload will.
If you would like to maintain the option of changing the kernel without having to hunt around the company to find all the old floppies, now is a good time to get the DHCP network boot working. The LTSP documentation describes how to configure Linux or UNIX servers, but our implementation was running on Microsoft Windows. David, who administers our Windows-based DNS and DHCP servers, set up Netboot in DHCP (Figure 1).
Microsoft DHCP appends a null to the NFSROOT, as discussed in LTSP mailing lists, so you need a soft link:
ln -s /ltsp/i386 /ltsp/i386/000