MOSIX: A Cluster Load-Balancing Solution for Linux
MOSIX schedules newly started programs in the node with the lowest current load. However, if the machine with the lowest load level announces itself to all the nodes in the cluster, then all the nodes will migrate newly started jobs to the node with the lowest load and soon enough this node will be overloaded. However, MOSIX does not operate in this manner. Instead, every node sends its current load status to a random list of nodes. This prevents a single node from being seen by all other nodes as the least busy and prevents any node from being overloaded.
How does MOSIX decide which node is the least busy among all the cluster nodes? This is a good question; however, the answer is a simple one.
MOSIX comes with its own monitoring algorithms that detect the speed of each node, its used and free memory, as well as IPC and I/O rates of each process. This information is used to make near optimal decisions on where to place the processes. The algorithms are very interesting because they try to reconcile different resources (bandwidth, memory and CPU cycles) based on economic principles and competitive analysis. Using this strategy, MOSIX converts the total usage of several heterogeneous resources, such as memory and CPU, into a single homogeneous cost. Jobs are then assigned to the machine where they have the lowest cost. This strategy provides a unified algorithm framework for allocation of computation, communication, memory and I/O resources. It also allows the development of near-optimal on-line algorithms for allocation and sharing these resources.
MOSIX uses its own filesystem, MFS, to make all the directories and regular files throughout a MOSIX cluster available from all nodes as if they were within a single filesystem. One of the advantages of MFS is that it provides cache consistency for files viewed from different nodes by maintaining one cache at the server disk node.
MFS meets the direct file system access (DFSA) standards, which extends the capability of a migrated process to perform some I/O operations locally, in the current node. This provision reduces the need of I/O-bound processes to communicate with their home-node, thus allowing such processes (as well as mixed I/O and CPU processes) to migrate more freely among the cluster's node, for load balancing and parallel file and I/O operations. This also allows parallel file access by proper distribution of files, where each process migrates to the node that has its files.
By meeting the DFSA provision, allowing the execution of system calls locally in the process' current node, MFS reduces the extra overhead of executing I/O-oriented system calls of a migrated process.
In order to test MOSIX, we set up the following environment: 1) a cabinet that consists of 13 Pentium-class CPU cards running at 233MHz with 256MB of RAM each; and 2) a Pentium-based server machine, PC1, running at 233MHz with 256MB of RAM. This machine was used as an NFS and DHCP/TFTP server for the 13 diskless CPUs.
When we start the CPUs, they boot from LAN and broadcast a DHCP request to all addresses on the network. PC1, the DHCP server, will be listening and will reply with a DHCP offer and will send the CPUs the information needed to configure network settings such as the IP addresses (one for each interface, eth0 and eth1), gateway, netmask, domain name, the IP address of the boot server (PC1) and the name of the boot file. The CPUs will then download and boot the specified boot file in the DHCP configuration file, which is a kernel image located under the /tftpboot directory on PC1. Next, the CPUs will download a ramdisk and start three web servers (Apache, Jigsaw and TomCat) and two streaming servers (Real System Server and IceCast Internet Radio).
For this setup, we used the Linux Kernel 2.2.14-5.0 that came with Red Hat 6.2. At the time we conducted this activity, MOSIX was not available for Red Hat; thus, we had to port MOSIX to work with the Red Hat kernel. Our plan was to prepare a MOSIX cluster that consists of the server, PC1 and the 13 diskless CPUs. For this reason, we needed to have a MOSIX-enabled kernel on the server, and we wanted to have the same MOSIX-enabled kernel image under the TFTP server directory to be downloaded and started by the CPUs at boot time. After porting MOSIX to Red Hat, we started the MOSIX modified installation script “mosix.install” that applied the patches to the 2.2.14-5.0 kernel tree on PC1.
Once we finished configuring the kernel and enabling the MOSIX features (using $make xconfig), we compiled it to get a kernel image:
cd /usr/src/linux make clean ; make dep ; modules_install
Next, we copied the new kernel image from /usr/src/linux/arch/i386/boot to /boot and we updated the System.map file:
cp /usr/src/linux/arch/i386/boot/bzImage cp /usr/src/linux/arch/i386/boot/System.map ln /boot/System.map.mosix /boot/System.mapOne of the configuration files that was modified was lilo.conf. We added a new entry for the MOSIX kernel to make the server boot as a MOSIX node by default. The updated lilo.conf on PC1 looked like Listing 1.
Having done that, we needed to complete the configuration steps. In /etc/profile, we added one line to specify the number of nodes in the MOSIX cluster:
# Add to /etc/profile NODES=1-14
We created /etc/mosix.map that allows the local MOSIX node to see all other MOSIX nodes. The mosix.map looked as follows:
# Starting node IP Number of Nodes 1 x.x.x.x 13 14 y.y.y.y 1We created the /mfs directory to be used as a mount point for the MOSIX filesystem. We added mosix.o to /lib/modules/2.2.14-5.0/misc/ so it can be loaded at boot time by the MOSIX startup file. Then we applied the same modifications to the ramdisk that will be downloaded by the diskless CPUs at boot time.
Once we completed these steps, we rebooted PC1, and when it was up and running, we rebooted the diskless CPUs. After reboot, the diskless CPUs received their IP addresses, booted with the MOSIX-enabled kernel, and downloaded the ramdisk using the TFTP protocol. Et voilà! All 14 nodes mounted /mfs as the MOSIX filesystem directory. Figure 2 shows a snapshot of /mfs on CPU10.