Getting Started with Condor

How to try Condor for multiplatform distributed computing.
Launching Jobs on Your New Cluster

To test our new condor setup, let's create a simple “Hello Condor” job:

#include
int main()
{ printf("Hello World!\n");}

Compile the application with gcc.

Now, to submit a job to Condor, we need to write a submit file. A submit file describes what Condor needs to do with the job—that is, where it will get the input for the application, where to produce the output and if any errors occur, where it should store them:

Universe = Vanilla
Executable = hello
Output = hello.out
Input = hello.in
Error = hello.err
Log = hello.log
Queue

The first Universe entry defines the runtime environment under which Condor should run the job. Two Universes are noteworthy: for long jobs, such as those that will last for weeks and months, the Standard Universe is recommended, as it ensures reliability and the ability to save partial execution state and relocate the job to another machine automatically if the first machine crashes. This saves a lot of vital processing effort. However, to use the Standard Universe, the application must be “condor compiled”, and the source code is required. The Vanilla Universe is for jobs that are short-lived, but long jobs also can be executed if the stability of the machines is guaranteed. Vanilla jobs can run unmodified binaries.

Other Universes in Condor include PVM, MPI and Java, for PVM, MPI and Java applications, respectively. For more detail on Condor Universes consult the documentation.

In this example, our executable file is called hello (the traditional “Hello Condor” program), and we're using the Vanilla Universe. The Input, Output, Error and Log directives tell Condor which files to use for stdin, stdout and stderr and to log the job's execution. Finally, the Queue directive specifies how many copies of the program to run.

After you have the submit file ready, run condor_submit hello.sub to submit it to Condor. You can check on the status of your job using condor_q, which will tell you how many jobs are in the queue, their IDs and whether they're running or idle, along with some statistics.

Condor has many other features; so far we have covered only the basics of getting it up and running. A number of tutorials are available on-line, along with the Condor Manual (www.cs.wisc.edu/condor/manual), that will teach you the basic and advanced capabilities of Condor. When reading the Condor Manual, pay particular attention to the Standard Universe, which allows you to checkpoint your job, and the Java Universe, which allows you to run Java jobs seamlessly.

You also can add Condor to the boot sequence of your central manager and other machines. You can shut down cluster machines, and their jobs will continue or restart on a different machine (depending on whether it's a Standard Universe job or a Vanilla job). This allows for a lot of flexibility in managing a system.

Beyond Clusters

Condor is not only about clusters. An extension to Condor allows jobs submitted within one pool of machines to execute on another (separate) Condor pool. Condor calls this flocking. If a machine within the pool where a job is submitted is not available to run the job, the job makes its way to another pool. This is enabled by special configuration of the pools.

The simplest flocking configuration sets a few configuration variables in the condor_config file. For example, let's set up an environment where we have two clusters, A and B, and we want jobs submitted in A to be executed in B. Let's say cluster A has its central manager at a.condor.org and B at b.condor.org. Here's the sample configuration:

FLOCK_TO = b.condor.org
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)

The FLOCK_TO variable can specify multiple pools, by entering a comma-separated list of central managers. The other two variables usually point to the same settings that FLOCK_TO does. The configuration macros that must be set in pool B authorize jobs from pool A to flock to pool B. The following is a sample of configuration macros that allows the flocking of jobs from A to B. As in the FLOCK_TO field, FLOCK_FROM allows users to authorize the flocking of incoming jobs from specific pools:

FLOCK_FROM=a.condor.org
HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)

The above settings set flocking from pool A to pool B, but not the reverse. To enable flocking in both directions, each direction needs to be considered separately. That is, in pool B you would need to set the FLOCK_TO, FLOCK_COLLECTOR_HOSTS and FLOCK_NEGOTIATOR_HOST to point to pool A, and set up the authorization macros in pool A for B.

Be careful with HOSTALLOW_WRITE and HOSTALLOW_READ. These settings let you define the hosts that are allowed to join your pool, or those that can view the status of your pool but are not allowed to join it, respectively.

Condor provides flexible ways to define the hosts. It is possible, for example, to allow read access only to the hosts that belong to a specific subnet, like this:

HOSTALLOW_READ=127.6.45.*

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Open source?

ewan's picture

The Condor FAQ is pretty clear that source is not distributed freely.

How is Condor in any way 'Open Source'?

You just send a request

Anonymous's picture

You just send a request email, and they give you access to a website with the source.

And the license allows you to do anything you want with the source (redistribute, make derivative works), ala BSD style.

Right source code is NOT

anonymous's picture

Right source code is NOT available from the website, however it is STILL opensource, because if you request it from them and have a good reason to do, like extending it for something, they will not deny the request.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix