Using MPICH to Build a Small Private Beowulf Cluster
In the spring of 2001, my students and I constructed a Beowulf cluster in the Wayne City High School computer science laboratory using MPICH 188.8.131.52 (an implementation of the MPI library) and Red Hat 6.2. Prior to this time, none of us had any serious Linux experience, and a project that initially started on a whim later progressed to a serious learning experience for everyone involved. The successful results of our first cluster experience were published on the Beowulf Underground. Since the publication of our story, I have received many e-mails from around the world asking me to describe, in painstaking detail, how we built our cluster. My standard answer was to do the research like we did, but as time passed, I realized that I really should write down our methodology for our own behalf as well as for the benefit of others.
We are a small rural high school in southern Illinois. Even so, we have offered computer science as a formal class for the past 20 years, beginning with a single Commodore VIC-20 and progressing to our current level of 100-plus computers in a school with 200 students. Linux, however, was only another word to us until last year when we started our project. It was common to hear phrases such as "who the heck is root?" and "what's this tar thing?" for several weeks before things began falling into place. We were self-taught and had to learn the mysteries of dual-booting, networking, X, rsh and any other topic that seemed to pop up at the most inopportune time. It wasn't all fun, but the process was certainly rewarding from an educational standpoint. This fall our administration and school board approved the purchase of new computer hardware for us, which brings us to the present.
Our new Beowulf cluster consists of a private network of eight systems dual-booting between Windows 2000 Professional and Red Hat 7.1. Each computer has a single AMD 1.4GHz Athlon processor, 512MB RAM, a 30GB hard drive, a 64MB NVIDIA video card and 100Mb Ethernet with switching hub (did I say that our administration and school board were kind to us?). We are using the current version of MPICH (184.108.40.206) as our MPI library.
Why did we choose MPICH? When we decided to build a cluster and began our research, we discovered three possible clustering systems: MPI, MOSIX and PVM. Our first attempt at clustering involved MOSIX. Its "fork and forget" philosophy seemed ideal (because we really didn't know what we were doing), and MOSIX claimed to work with existing software. What we didn't realize was that we would have to download and compile a vanilla kernel in order for MOSIX to work properly. This seemed a daunting task for rank amateurs, so we rejected MOSIX. PVM was briefly considered, but some of the research we did indicated (correctly or not) that MPI would be a better choice for us in terms of ease of use and performance. Ultimately, we decided on the MPICH implementation of MPI for its price (free), ease of installation (the subject of this article) and programmability (the subject of a future article?).
First assemble the hardware. MPICH allows you to simulate a cluster using a single computer if that's all you have available. How to do single computer clustering is discussed briefly later in the article. For now, I will assume that you have at least two computers with network cards available. Here at Wayne City, we have successfully built clusters of from two to sixteen computers. With two computers, you can use a simple crossover cable to connect the network cards. With more than two computers, you will need either a hub or a switch to network your systems.
It is possible to mix and match systems. In our first clustering efforts we had various flavors of Pentium, Celeron and Mobile Pentium systems working together before we settled on a homogeneous lab of 16 300MHz Celeron computers with 10Mbps Ethernet and hub. Our current cluster retains this homogeneous structure using AMD Athlon processors and 100Mb Ethernet with a switching hub. I also have successfully connected my Dell Inspiron 8000 notebook computer to our new cluster, although its 1GHz Intel processor is a bit slow compared with the 1.4GHz AMD Athlon. Nevertheless, it's fun to watch the Dell notebook benefit from the combined power of eight additional CPUs.
All the computers we have used and experimented with were complete standalone systems with monitors, hard drives, keyboards, etc. It's possible to build clusters using bare-networked CPU boxes and a single master node with monitor and keyboard, but we did not make any attempt to build such a beast. Also, we have not had the opportunity to mix and match other types of systems such as SGI, SPARC or Alpha workstations (although we would welcome the opportunity--anyone wanting to donate old equipment can contact me directly).
Next, install a Linux distribution on each computer in your cluster (henceforth, we'll call them nodes). We are using Red Hat 7.1 for our current cluster and are pleased with its performance.
During the installation process, assign sensible hostnames and unique IP addresses for each node in your cluster. Usually, one node is designated as the master node (where you'll control the cluster, write and run programs, etc.) with all the other nodes used as computational slaves. We named our nodes node00 through node07 to keep things simple, using node00 as our master node. Our cluster is private, so theoretically we could assign any valid IP address to our nodes as long as each had a unique value. We used IP address 192.168.100.200 for the master node and added one for each slave node (192.168.100.201, etc.). If you already have Linux installed on each node in your cluster, then you don't have to make changes to your IP addresses or hostnames unless you want to. Changes (if needed) can be made using your network configuration program (I'm partial to Linuxconf in Red Hat). Finally, create identical user accounts on each node. In our case, we created the user beowulf on each node in our cluster. You can create the identical user accounts either during installation, or you can use the adduser command as root.
Then configure rsh on each node in your cluster. We used rsh for two reasons: first, rsh appeared to be easier to configure than ssh, and because we have a private network with trusted users, security is not an issue; second, from what I understand, rsh does not have the added overhead of encryption, so its cluster performance should be a bit faster than ssh.
As naive users, rsh was a bit of a problem initially, especially with Red Hat 7.1. We were prompted for passwords each time we attempted to connect with another node. Finally, after much searching on the Internet, we were able to figure out a method that seemed to work for us. Educationally, we wanted to install MPICH from both user and root perspectives, so we configured rsh to allow user and root access. Our methods, however repugnant to Linux security experts, were as follows:
Create .rhosts files in the user and root directories. Our .rhosts files for the beowulf users are as follows:
node00 beowulf node01 beowulf node02 beowulf node03 beowulf node04 beowulf node05 beowulf node06 beowulf node07 beowulf
And the .rhosts files for root users are as follows:
node00 root node01 root node02 root node03 root node04 root node05 root node06 root node07 root
Next, we created a hosts file in the /etc directory. Below is our hosts file for node00 (the master node):
192.168.100.200 node00.home.net node00 127.0.0.1 localhost 192.168.100.201 node01 192.168.100.202 node02 192.168.100.203 node03 192.168.100.204 node04 192.168.100.205 node05 192.168.100.206 node06 192.168.100.207 node07
Again, our network is private, so we used IP addresses beginning with 192.168 and made up the rest. Each node in the cluster had a similar hosts file with appropriate changes to the first line reflecting the hostname of that node. For example, node01 would have a first line:
192.168.100.201 node01.home.net node01
with the third line containing the IP and hostname of node00. All other nodes are configured in the same manner. Do not remove the 127.0.0.1 localhost line (as we found out the hard way). The hosts.allow files on each node were modified by adding ALL+ as the only line in the file. This allows anyone on any node permission to connect to any other node in our private cluster. To allow root users to use rsh, we had to add the following lines to the /etc/securetty file:
rsh, rlogin, rexec, pts/0, pts/1
. Also, we modified the /etc/pam.d/rsh file:
#%PAM-1.0 # For root login to succeed here with pam_securetty, "rsh" must be # listed in /etc/securetty. auth sufficient /lib/security/pam_nologin.so auth optional /lib/security/pam_securetty.so auth sufficient /lib/security/pam_env.so auth sufficient /lib/security/pam_rhosts_auth.so account sufficient /lib/security/pam_stack.so service=system-auth session sufficient /lib/security/pam_stack.so service=system-auth
Finally, after much research, we found out that rsh, rlogin, Telnet and rexec are disabled in Red Hat 7.1 by default. To change this, we navigated to the /etc/xinetd.d directory and modified each of the command files (rsh, rlogin, telnet and rexec), changing the disabled = yes line to disabled = no.
Once the changes were made to each file (and saved), we closed the editor and issued the following command: xinetd -restart to enable rsh, rlogin, etc. We were then good to go with no more rsh password prompts.
Next, download the latest version of MPICH (UNIX all flavors) from www-unix.mcs.anl.gov/mpi/mpich/download.html to the master node (node00 in our cluster). The file is around 9.5MB, and you probably should grab the installation instructions and other documents while you are there. You never know when the installation guide might be useful.
Untar the file in either the common user directory (the identical user you established for all nodes "beowulf" on our cluster) or in the root directory (if you want to run the cluster as root). Issue the command: tar zxfv mpich.tar.gz (or whatever the name of the tar file is for your version of MPICH), and the mpich-220.127.116.11 directory will be created with all subdirectories in place. If you are using a later version of MPICH than we are, the last number might be different than ours.
Change to the newly created mpich-18.104.22.168 directory. Make certain to read the README file (if it exists), but in our experience, the configure and make scripts work without modifications. Type ./configure, and when the configuration is complete and you have a command prompt, type make.
The make may take a few minutes, depending on the speed of your master computer. Once make has finished, add the mpich-22.214.171.124/bin and mpich-126.96.36.199/util directories to your PATH in .bash_profile or however you set your path environment statement. The full root paths for the MPICH bin and util directories on our master node are /root/mpich-188.8.131.52/util and /root/mpich-184.108.40.206/bin. For the beowulf user on our cluster, /root is replaced with /home/beowulf in the path statements. Log out and then log in to enable the modified PATH containing your MPICH directories.
From within the mpich-220.127.116.11 directory, go to the util/machines/ directory and find the machines.LINUX file. This file will contain the hostnames of all the nodes in your cluster. When you first view the file, you<\#146>ll notice that five copies of the hostname of the computer you are using will be in the file. For node00 on our cluster, there will be five copies of node00 in the machines.LINUX file. If you have only one computer available, leave this file unchanged, and you will be able to run MPI/MPICH programs on a single machine. Otherwise, delete the five lines and add a line for each node hostname in your cluster, with the exception of the node you are using. For our cluster, our machines.LINUX file as viewed from node00 (the master node) looks like this:
node01 node02 node03 node04 node05 node06 node07
Then make all the example files and the MPE graphic files. First, navigate to the mpich-18.104.22.168/examples/basic directory and type make to make all the basic example files. When this process has finished, you might as well change to the mpich-22.214.171.124/mpe/contrib directory and make some additional MPE example files, especially if you want to view graphics. Within the mpe/contrib directory, you should see several subdirectories. The one we will be interested in (for now -- you can explore the others on your own) is the mandel directory. Change to the mandel directory, and type make to create the pmandel exec file. You are now ready to test your cluster.
I will describe how to run two test programs as a start. You can explore on your own from that point. The first program we will run is cpilog. From within the mpich-126.96.36.199/examples/basic directory, copy the cpilog exec file (if this file isn't present, try typing make again) to your top-level directory. On our cluster, this is either /root (if we are logged in as root) or /home/beowulf, if we are logged in as beowulf (we have installed MPICH both places). Then, from your top directory, rcp the cpilog file to each node in your cluster, placing the file in the corresponding directory on each node. For example, if I am logged in as beowulf on the master node, I'll issue rcp cpilog node01:/home/beowulf to copy cpilog to the beowulf directory on node01. I'll do the same for each node (I'm certain that a script would facilitate this). If I want to run a program as root, then I'll copy the cpilog file to the root directories of all nodes on the cluster.
Once the files have been copied, I'll type the following from the top directory of my master node to test my cluster:
mpirun -np 1 cpilog
This will run the cpilog program on the master node to see if the program works correctly. Some MPI programs require at least two processors (-np 2), but cpilog will work with only one. The output looks like the following:
pi is approximately 3.1415926535899406, Error is 0.0000000000001474 Process 0 is running on node00.home.net wall clock time = 0.360909
Now try all eight nodes (or however many you want to try) by typing: mpirun -np 8 cpilog and you'll see
pi is approximately 3.1415926535899406, Error is 0.0000000000001474 Process 0 is running on node00.home.net Process 1 is running on node01.home.net Process 2 is running on node02.home.net Process 3 is running on node03.home.net Process 4 is running on node04.home.net Process 5 is running on node05.home.net Process 6 is running on node06.home.net Process 7 is running on node07.home.net wall clock time = 0.0611228
or something similar to this. The "Process x is running on nodexx" lines may not be in numerical order, depending on your machines.LINUX file and network communication speed between the nodes. What you should be able to notice, however, is an increase in execution speed as nodes are added according to the wall clock time. If you have only one computer available, you can use mpirun the same way, but each process will run on the master node only. The number following the -np parameter corresponds with the number of processors (nodes) you want to use in running your program. This number may not exceed the number of machines listed in your machines.LINUX file plus one (the master node is not listed in the machines.LINUX file).
To see some graphics, we must run the pmandel program. Copy the pmandel exec file (from the mpich-188.8.131.52/mpe/contrib/mandel directory) to your top-level directory and then to each node (as you did for cpilog). Then, if X isn't already running, issue a startx command. From a command console, type xhost + to allow any node to use your X display, and then set your DISPLAY variable as follows: DISPLAY=node00:0 (be sure to replace node00 with the hostname of your master node). Setting the DISPLAY variable directs all graphics output to your master node. Run pmandel by typing: mpirun -np 2 pmandel
The pmandel program requires at least two processors to run correctly. You should see the Mandelbrot set rendered on your master node.
Figure 1. The mandelbrot Set Rendered on the Master Node
You can use the mouse to draw a box and zoom into the set if you want. Adding more processors (mpirun -np 8 pmandel) should increase the rendering speed dramatically. The mandelbrot set graphic has been partitioned into small rectangles for rendering by the individual nodes. You actually can see the nodes working as the rectangles are filled in. If one node is a bit slow, then the rectangles from that node will be the last to fill in. It<\#146>s fascinating to watch. We've found no graceful way to exit this program other than pressing Ctrl-C or clicking the close box in the window. You may have to do this several times to kill all the nodes. An option to pmandel is to copy the cool.points file from the original mandel directory to the same top-level directory (on the master node) as pmandel and run mpirun -np 8 pmandel -i cool.points
The -i option runs cool.points as a script and puts on a nice mandelbrot slide show. You could use the cool.points file as a model to create your own display sequence if you like.
You can use make to create the other sample graphics and test files within the mpich-184.108.40.206/mpe/contrib subdirectories. Place the exec files in the top-level directories of all nodes and use mpirun as with the cpilog and pmandel programs.
The g_Life program is a bit different from the pmandel program in terms of execution. Can you explain why increasing the number of nodes may actually slow its performance?
One interesting situation we've noticed (and can't explain) is that the cluster runs about 10% faster when we are logged in as root. If raw performance is an issue, then this increase in cluster speed may be important.
Finally, as I was finishing this article, I was putting the cluster through its paces and discovered that I could use all eight nodes (master plus seven slaves) while my students were working on their OpenGL/GLUT projects on each node. I noticed no degradation of cluster performance, and the students didn't know that I was stealing computer time from their respective machines. MPICH multitasks very well in a cluster environment.
The most common errors we encountered were hardware problems. Make certain all your cables are plugged in and that each node can "talk" to all the others via rsh (or ssh if you prefer). Another error commonly encountered is a failure to include all cluster nodes in the machines.LINUX file. Finally, make certain that each node contains a copy of the current exec file (of the program you are trying to run) in the top-level user directory. If all else fails, read the documentation. There are many great user groups and web sites devoted to clustering. Grab your favorite search engine, type "beowulf cluster" and go!
The purpose of this article is to give the step-by-step approach we used to build our latest Beowulf cluster. It is not meant to be a primer for MPICH or MPI. You should be prepared to read the documentation and all README files in the MPICH directories for tips and instructions on how to get the most from MPICH and your cluster. We feel as though we have learned a great deal from our adventure into clustering. I have no doubt that my students are ready to tackle more ambitious projects, including writing our own MPI-based programs and eventually, distributed graphics. Learning shouldn't be this much fun!