Getting Started with Condor
Cluster computing emerged in the early 1990s when hardware prices were dropping and PCs were becoming more and more powerful. Companies were shifting from large mini-computers to small and powerful micro-computers, and many people realized that this would lead to a large-scale waste of computing power, as computing resources were being fragmented more and more. Organizations today have hundreds to thousands of PCs in their offices. Many of them are idle most of the time. However, the same organizations also face huge computation-intensive problems and thus require great computing power to remain competitive—hence the stable demand for supercomputing solutions that largely are built on cluster computing concepts.
Many vendors offer commercial cluster computing solutions. By using free and open-source software, it is possible to forego the purchase of these expensive commercial cluster computing solutions and set up your own cluster. This article describes such a solution, developed by University of Wisconsin, called Condor.
The idea behind Condor is simple. Install it on every machine you want to make part of the cluster. (In Condor terminology, a Condor cluster is called a pool. This article uses both terms interchangeably.) You can launch jobs from any machine, and Condor matches the requirements of the job with the capabilities offered by the idle computers currently available. Once it finds a suitable idle machine, it transfers the job to it, executes it and retrieves the results of the execution. One of the features of Condor is that it doesn't require programs to be modified to run on the cluster.
In practice, however, Condor is more complicated. Condor is installed in different configurations on each machine. Each Condor pool has a central manager. The central manager, as the name implies, is the central manager of the cluster. It manages the detection of new idle machines and coordinates the matchmaking between job requirements and available resources. Machines in a Condor pool also can have Submit and Full Install configurations. Submit machines are those machines that can only submit jobs, but can't run any jobs; Full Install machines are machines that can do both, submit and execute.
Condor does not require the addition of any new hardware to the network; the existing network itself is sufficient. Condor runs on a variety of operating systems, including Linux, Solaris, Digital Unix, AIX, HP-UX and Mac OS X as well as MS Windows 2000 and XP. It supports various architectures, including Intel x86, PowerPC, SPARC and so on. However, jobs developed on one specific architecture, such as Intel x86, will run only on Intel x86 computers. So, it is best if all the computers in a Condor pool are of a single architecture. It is possible, however, for Java applications to run on different architectures.
In this article, we cover the installation from basic tarballs on Linux, although distribution/OS-specific packages also may be available from the official site or sources. (See the Condor Project site for more details, www.cs.wisc.edu/condor/downloads.)
Download the tarball from the Project site, and uncompress it with:
tar -zvf condor.tar.gz
The condor_install script, located in the sbin directory, is all you need to run to set up Condor on a machine. Before you run this script, add a user named condor. For security reasons, Condor does not allow you to run jobs as root; thus, it is advisable to make a new user to protect the system.
One of the first questions the script asks is how many machines are you setting up to be part of the pool? This is important if you have a shared filesystem. If you do, the installation script will prompt you for the names of those machines, and the installation of Condor on those machines will be handled by the software itself. If a shared filesystem does not exist, you have to install Condor manually on each system. Also, if you want to be able to use Java support, you need to have Sun's Java virtual machine installed prior to installing Condor. The install script provides plenty of help and annotation on each question it asks, and you always can turn to Condor's comprehensive user manual and its associated mailing lists for help.
The variable $CONDOR is used from now on to denote the root path where condor has been installed (untarred).
After the installation, start Condor by running:
This command should spawn all other processes that Condor requires. On the central manager, you should be able to see five condor_ processes running after entering:
ps -aux | grep condor
On the central manager machine, you should have the following processes:
All other machines in the pool should have processes for the following:
And, on submit-only machines you will see:
After that, you should be able to see the central manager machine as part of your Condor cluster when you run condor_status:
$CONDOR/bin/condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime Mycluster LINUX INTEL Unclaimed Idle 0.115 3567 0+00:40:04 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 1 0 0 1 0 0 Total 1 0 0 1 0 0
If you now run condor_master on the other machines in the pool, you should see that they are added to this list within a few minutes (usually around five minutes).