Using MySQL for Load Balancing and Job Control under Slurm

Like most things these days, modern atmospheric science is all about big data. Whether it's an instrument flying in an aircraft taking sets of images several times a second and producing three quarters of a terabyte of data per flight day over a two-week campaign or a satellite instrument producing hundreds of gigs of spectral data daily over a 10–15 year lifetime, data volume is enormous. Simply analyzing a day's worth of data to keep track of basic instrument stability is CPU-intensive. Fully processing a day to retrieve the state of the atmosphere or looking at trends across a decade's worth of data is exponentially so.

High-performance parallel cluster computing is the name of the game. For years I've done this on a very basic level by kicking off a handful of copies of my processing scripts on a couple computers around the lab, but after a recent move into a new lab, I got my first chance to work on a real cluster system, processing data from a satellite-borne hyperspectral sounder called AIRS (see Resources). AIRS is one of the instruments onboard NASA's AQUA satellite that was launched in late 2002 and has been in continuous operation since. Data from AIRS and similar instruments is used to map out vertical profiles of atmospheric temperature and trace gases globally, but we have to be able to process it first.

The cluster computing game here is strictly to get a whole lot of computers doing the same thing to a whole lot of data so that we can process it faster than we collect it (much faster would be preferable). Since I was new to this game just a few months ago, I've had much to learn about cluster computing and how to design algorithms and processing software to take advantage of multiple CPUs for processing. This was my first experience where I had hundreds of CPUs at my disposal, and it really has changed how I process data in general. I started this article to describe how I was shown to parallelize this type of data processing and a method I put together that makes the process much cleaner.

Basic Slurm

The cluster system here consists of 240 compute nodes, each with dual, 8-core processors and 64GB of main memory running Red Hat Enterprise Linux. Cluster jobs are scheduled to run through the Slurm workload manager (see Resources). In a nutshell, Slurm is a suite of programs that works to allocate computer resources among users and compute jobs and enforce sharing rules to make sure everyone gets a chance to get their work in. The two most important programs in the suite for actually working on the system are sbatch and srun.

sbatch is the entry point to the Slurm scheduler and reads a high-level Bash control script that specifies job parameters (number of nodes needed, memory per process, expected run times and so on) and spawns the requested number of identical jobs via calls to srun.

Listing 1. Top-level Slurm/sbatch script. This scripts specifies the request for CPUs, memory and time on the cluster and sets up the program that will be run on each processing node.


#!/bin/bash
# sbatch options
#SBATCH --job-name=RUN_CALFLAG
#SBATCH --partition=batch
#SBATCH --qos=short
#SBATCH --ntasks=20
#SBATCH --mem-per-cpu=18000
#SBATCH --cpus-per-task 1
#SBATCH --time=00:20:00

# matlab options
MATLAB=/usr/bin/matlab
MATOPT=' -nojvm -nodisplay -nosplash'

srun --output=~/run_calflag.log 
 ↪$MATLAB $MATOPT -r "run_calflag; exit;"

The script in Listing 1 asks Slurm to allocate 20 processors (ntasks/cpus-per-task), allocate 18GB of memory (mem-per-cpu) to each and run a job (srun ...) on each that will take around 20 minutes (time) to run. The partition and qos directives help the system manage its resources and set rules for the number of processors a user is allowed, CPU time limits and so on. The job-name directive puts a name to your task to help you separate your jobs in an squeue list of the system queue.

The srun request shown here is to run an instance of MATLAB on each of the allocated nodes and, in each instance, run the script run_calflag and exit. Any message output is sent to the file specified by the output parameter. run_calflag could be a simple "hello world" script, or it could be a loop to process a thousand files. It also doesn't have to be MATLAB. MATLAB is our tool of choice here, and examples in this article use it in a background sort of way. There is no need to understand MATLAB to keep reading.

As long as this request doesn't violate any cluster good behavior rules by hogging processors, hogging memory and so on, Slurm queues the request until such time as processors are available to run it. When the resources are available, Slurm starts the processing by grabbing 20 CPUs and, then, starting a copy of matlab/run_calflag on each one. Once the control scripts are in order, this request is submitted to Slurm through the sbatch command:


sbatch run_calflag_batch.sh

Slurm also manages a set of environment variables that can be used to pass some job parameters into the processing scripts.

Basic Data Chunking

Listing 1 already shows enough to see one issue with parallelizing this kind of data processing: how does one chunk up the data to pass to each instance of the run_calflag started by srun? If I want to process one year, should I ask for 365 processors and do one day on each? 52 processors and one per week? One per month? How do I handle leap years? The cluster resource allocation rules prevent me from doing the 365 processor idea, but other than that, there is no clear, single answer.

Being new to cluster computing, I looked at what my colleagues have done to set up their processing runs. Most have adopted a three-tier system to run these jobs:

  • A bash script with sbatch directives and srun calls to kick off everything.

  • A MATLAB script (called by srun in the previous script) that is run on each compute node that uses some node ID in the Slurm environment (usually SLURM_PROC_ID) to index processing into the range of years/days/files to process. The most common approaches are to request 12 CPUs and assign one month per CPU or request 52 and assign a week per CPU. This script then loops over the years/days/files "assigned" to this node. Within this loop, calls are made to a final MATLAB script that does the actual processing for each year/day/file in the sequence.

This approach certainly can work, but it has some significant issues:

  • Ad hoc chunking of data: in general, how does one split "x" things to process over "y" nodes? In practice, this seems to mean you have to edit run scripts and tailor them to just about every run you wish to do. (It is almost guaranteed that you will do multiple runs in this business: once to try to process a contiguous period of data and a second time to re-run the now non-contiguous set of days that failed for one reason or another.)

  • Does not utilize allocated system resources well: parallelizing by month, the node processing February is pretty much guaranteed to sit idle for 8–10% of the total run time simply because it's 2–3 days shorter than other months. Or, if the lower-level processing fails and takes out the entire process running on that CPU, the rest of that chunk will need to be reprocessed later and the processor sits idle while the rest of the processors finish.

  • Does not utilize my time very well: any time I need to change either the number of processing jobs or the number of nodes to spread them over, I have to recalculate manually how to spread out the job and, likely, edit my control scripts. I absolutely hate having to keep a dozen scripts lying around all to do the same thing but each for some special edge case.

Okay, then, how do we get around this?

______________________

As a chronic suffer of Multiple Avocation Disorder, Steven lives life in a desperate search for days with more hours in which to stuff new activities.