Mainstream Parallel Programming

Whether you're a scientist, graphic artist, musician or movie executive, you can benefit from the speed and price of today's high-performance Beowulf clusters.

When Donald Becker introduced the idea of a Beowulf cluster while working at NASA in the early 1990s, he forever changed the face of high-performance computing. Instead of institutions forking out millions of dollars for the latest supercomputer, they now could spend hundreds of thousands and get the same performance. In fact, a quick scan of the TOP500 project's list of the world's fastest supercomputers shows just how far-reaching the concept of computer clusters has become. The emergence of the Beowulf cluster—a computer cluster created from off-the-shelf components and that runs Linux—has had an unintended effect as well. It has captivated the imaginations of computer geeks everywhere, most notably, those who frequent the Slashdot news site.

Unfortunately, many people believe that Beowulfs don't lend themselves particularly well to everyday tasks. This does have a bit of truth to it. I, for one, wouldn't pay money for a version of Quake 4 that runs on a Beowulf! On one extreme, companies such as Pixar use these computer systems to render their latest films, and on the other, scientists around the world are using them this minute to do everything from simulations of nuclear reactions to the unraveling of the human genome. The good news is that high-performance computing doesn't have to be confined to academic institutions and Hollywood studios.

Before you make your application parallel, you should consider whether it really needs to be. Parallel applications normally are written because the data they process is too large for conventional PCs or the processes involved in the program require a large amount of time. Is a one-second increase in speed worth the effort of parallelizing your code and managing a Beowulf cluster? In many cases, it is not. However, as we'll see later in this article, in a few situations, parallelizing your code can be done with a minimum amount of effort and yield sizeable performance gains.

You can apply the same methods to tackle image processing, audio processing or any other task that is easily broken up into parts. As an example of how to do this for whatever task you have at hand, I consider applying an image filter to a rather large image of your friend and mine, Tux.

A Typical Setup

The first thing we need is a set of identical computers running Linux, connected to each other through high-speed Ethernet. Gigabit Ethernet is best. The speed of the network connection is one factor that can bring a fast cluster to a crawl. We also need a shared filesystem of some sort and some clustering software/libraries. Most clusters use NFS to share the hard drive, though more exotic filesystems exist, like IBM's General Parallel Filesystem (GPFS). For clustering software, there are a few available choices. The standard these days is the Message Passing Interface (MPI), but the Parallel Virtual Machine (PVM) libraries also should work just fine. MOSIX and openMOSIX have been getting a lot of attention lately, but they are used primarily for programs that are not specifically written to run on clusters, and they work by distributing threads in multithreaded programs to other nodes. This article assumes you have MPI installed, though the process for parallelizing an algorithm with PVM is exactly the same. If you have never installed MPI, Stan Blank and Roman Zaritski have both written nice articles on how to set up an MPI-based cluster on the Linux Journal Web site (see the on-line Resources).

Initializing the Program

At the beginning of each MPI program are a few calls to subroutines that initialize communication between the computers and figure out what “rank” each node is. The rank of a node is a number that identifies it uniquely to the other computers, and it varies from 0 to one less than the total cluster size. Node 0 typically is called the master node and is the controller for the process. After the program is finished, you need to make one additional call to finish up the process before exiting. Here's how it's done:

#include <mpi.h>
#include <stdlib.h>

int main (void) {

           int myRank, clusterSize;
           int imgHeight, lowerBoundY, upperBoundY,

          // Initialize MPI
         MPI_Init((void *) 0, (void *) 0);

         // Get which node number we are.
         MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

         // Get how many total nodes there are.
         MPI_Comm_size(MPI_COMM_WORLD, &clusterSize);

        // boxSize - the amount of the image each node
        //                  will process
        boxSize = imgHeight / clusterSize;

         // lowerBoundY - where each node starts processing.
         lowerBoundY = myRank*boxSize;

        // upperBoundY - where each node stops processing.
        upperBoundY = lowerBoundY + boxSize;

        // Body of program goes here

        // Clean-up and exit:
        return 0;

This code runs independently on each machine that is part of the process, so the values for lowerBoundY and upperBoundY will vary on each machine. These will be used in the next section.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

about execution

rajesh k r's picture

can anyone tell me how to run the above code on cluster or on LAN.if anyone has code send me at

image in c

Anonymous's picture

can any1 tel me how to execute above in C

and pls help us for how to put processed image in another file i.e. gray scale image

interesting concept... is

bharath's picture

interesting concept... is there any way u can send the source code, i'd like to use it for my class project...

cristal clear concept necessary for a starter.... :)

Chandra's picture

This article has helped me to explore the extream ends of parallel
computing and MPI.During my study and running codes on cluster PADMA
at CDAC India I always had a dim picture of MPI application till I read this article.
Nice and sincere approch..

Image processing program

Vayrix's picture

Has someone filled up the gaps missing in the code? I'm not an expert in C/C++ programming, but I would like to see and run whole code on ROCKS cluster. So I would appreciate if someone gave me complete program code, if possible.

test Program

BryanC's picture

Any chance you still have this full test program source in C++?

This article has inspired me to set up an HPC cluster with a few fellow students, and I think a graphics algorithm such as this would be exactly what we need to awe our audience (possibly project benefactors, school deans, etc).

me too

Anonymous's picture

I could really use this source code for a class of mine. Let me know if you get it. -Larry Prokov

Good info

Michael Locker MD's picture

Very informative. Thanks.

Michael Locker MD

MOSIX and openMOSIX have

Anonymous's picture

MOSIX and openMOSIX have been getting a lot of attention lately, but they are used primarily for programs that are not specifically written to run on clusters, and they work by distributing threads in multithreaded programs to other nodes.

This is an inaccurate statment of how openMosix and I am pretty sure MOSIX work. They do not migrated threads of a multi threaded application but migrate the whole process. In fact applications that used shared-memory such as multi-threaded applications, like apache, require a secodn pacth on top of openMosix to use the migration system. Still for making fast cluster for multi-stage proccess such as video conversion or manipulation it is very effective at handling the workload.

Is there a way to speed up cpu-intensive application ?

BO Yang's picture

Good article I think .
Just few days ago , I have taken part in the ICFP contest and the task is to write a virtual machine . And I am disappointed that I didn't complete the test because of my slow application which will occupy all my cpu whenever start up . Does parallel programing help in this situation ? I think it will do nothing good with the such kind of cpu-intensive application , any idea ?

CPU Intensive Applications

M. Hore's picture

In my experience, cpu-intensive jobs are easily parallelized and benefit from it whenever their tasks are easily broken up into pieces that are more or less independent of each other (though they need not be). I don't immediately see how one might parallelize a virtual machine, and don't see how it might be beneficial, but I could be wrong.

What do you find to be the major bottleneck in your virtual machine?