Getting Started with Condor

How to try Condor for multiplatform distributed computing.
Launching Jobs on Your New Cluster

To test our new condor setup, let's create a simple “Hello Condor” job:

#include
int main()
{ printf("Hello World!\n");}

Compile the application with gcc.

Now, to submit a job to Condor, we need to write a submit file. A submit file describes what Condor needs to do with the job—that is, where it will get the input for the application, where to produce the output and if any errors occur, where it should store them:

Universe = Vanilla
Executable = hello
Output = hello.out
Input = hello.in
Error = hello.err
Log = hello.log
Queue

The first Universe entry defines the runtime environment under which Condor should run the job. Two Universes are noteworthy: for long jobs, such as those that will last for weeks and months, the Standard Universe is recommended, as it ensures reliability and the ability to save partial execution state and relocate the job to another machine automatically if the first machine crashes. This saves a lot of vital processing effort. However, to use the Standard Universe, the application must be “condor compiled”, and the source code is required. The Vanilla Universe is for jobs that are short-lived, but long jobs also can be executed if the stability of the machines is guaranteed. Vanilla jobs can run unmodified binaries.

Other Universes in Condor include PVM, MPI and Java, for PVM, MPI and Java applications, respectively. For more detail on Condor Universes consult the documentation.

In this example, our executable file is called hello (the traditional “Hello Condor” program), and we're using the Vanilla Universe. The Input, Output, Error and Log directives tell Condor which files to use for stdin, stdout and stderr and to log the job's execution. Finally, the Queue directive specifies how many copies of the program to run.

After you have the submit file ready, run condor_submit hello.sub to submit it to Condor. You can check on the status of your job using condor_q, which will tell you how many jobs are in the queue, their IDs and whether they're running or idle, along with some statistics.

Condor has many other features; so far we have covered only the basics of getting it up and running. A number of tutorials are available on-line, along with the Condor Manual (www.cs.wisc.edu/condor/manual), that will teach you the basic and advanced capabilities of Condor. When reading the Condor Manual, pay particular attention to the Standard Universe, which allows you to checkpoint your job, and the Java Universe, which allows you to run Java jobs seamlessly.

You also can add Condor to the boot sequence of your central manager and other machines. You can shut down cluster machines, and their jobs will continue or restart on a different machine (depending on whether it's a Standard Universe job or a Vanilla job). This allows for a lot of flexibility in managing a system.

Beyond Clusters

Condor is not only about clusters. An extension to Condor allows jobs submitted within one pool of machines to execute on another (separate) Condor pool. Condor calls this flocking. If a machine within the pool where a job is submitted is not available to run the job, the job makes its way to another pool. This is enabled by special configuration of the pools.

The simplest flocking configuration sets a few configuration variables in the condor_config file. For example, let's set up an environment where we have two clusters, A and B, and we want jobs submitted in A to be executed in B. Let's say cluster A has its central manager at a.condor.org and B at b.condor.org. Here's the sample configuration:

FLOCK_TO = b.condor.org
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)

The FLOCK_TO variable can specify multiple pools, by entering a comma-separated list of central managers. The other two variables usually point to the same settings that FLOCK_TO does. The configuration macros that must be set in pool B authorize jobs from pool A to flock to pool B. The following is a sample of configuration macros that allows the flocking of jobs from A to B. As in the FLOCK_TO field, FLOCK_FROM allows users to authorize the flocking of incoming jobs from specific pools:

FLOCK_FROM=a.condor.org
HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)

The above settings set flocking from pool A to pool B, but not the reverse. To enable flocking in both directions, each direction needs to be considered separately. That is, in pool B you would need to set the FLOCK_TO, FLOCK_COLLECTOR_HOSTS and FLOCK_NEGOTIATOR_HOST to point to pool A, and set up the authorization macros in pool A for B.

Be careful with HOSTALLOW_WRITE and HOSTALLOW_READ. These settings let you define the hosts that are allowed to join your pool, or those that can view the status of your pool but are not allowed to join it, respectively.

Condor provides flexible ways to define the hosts. It is possible, for example, to allow read access only to the hosts that belong to a specific subnet, like this:

HOSTALLOW_READ=127.6.45.*

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Open source?

ewan's picture

The Condor FAQ is pretty clear that source is not distributed freely.

How is Condor in any way 'Open Source'?

You just send a request

Anonymous's picture

You just send a request email, and they give you access to a website with the source.

And the license allows you to do anything you want with the source (redistribute, make derivative works), ala BSD style.

Right source code is NOT

anonymous's picture

Right source code is NOT available from the website, however it is STILL opensource, because if you request it from them and have a good reason to do, like extending it for something, they will not deny the request.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState