Mainstream Parallel Programming

Whether you're a scientist, graphic artist, musician or movie executive, you can benefit from the speed and price of today's high-performance Beowulf clusters.
Compiling and Running Your Code

Both of the popular MPI distributions—LAM and MPICH—include wrapper scripts to allow users to compile their programs easily with the required MPI libraries. These wrapper scripts allow you to pass parameters to GCC like you always do:

  • mpicc: for C programs

  • mpi++: for C++ programs

  • mpif77: for FORTRAN 77 programs

Use mpirun to execute your newly compiled program. For example, I compiled my code with the command mpicc -O3 -o parallel parallel.c and then executed it with mpirun n0 ./parallel. The n0 signifies that the program is to run on node 0 only. To run it on additional nodes, you can specify a range like n0-7 (for eight processors) or use mpirun C to signify that the program is to run on all available nodes.

Performance Gains

So, with only a few simple MPI calls, we have parallelized our image filter algorithm very easily, but did we gain any performance? There are a few ways that we can gain performance. The first is in terms of speed, and the second is in terms of how much work we can do. For example, on a single computer, a 16,000 x 16,000 pixel image would require an array of 768,000,000 elements! This is just too much for many computers—GCC complained to me that the array was simply too big! By breaking the image down as we did above, we can ease memory requirements for our application.

I tested the code above on a 16-node Beowulf cluster running Fedora Core 1. Each node had 1.0GB of RAM, a 3.06GHz Pentium 4 processor and was connected to the other nodes through Gigabit Ethernet. The nodes also shared a common filesystem through NFS. Figure 4 shows the amount of time required to read in the image, process it and write it back to disk.

Figure 4. Shown here are the times that the program took for various image sizes and various cluster sizes. Image sizes ranged from 1,600 x 1,600 pixels to 16,000 x 16,000 pixels. A minimum of four nodes was required for the largest image.

From Figure 4, we can see that parallelizing this image filter sped things up for even moderately sized images, but the real performance gains happened for the largest images. Additionally, for images of more than 10,000 x 10,000 pixels, at least four nodes were required due to memory constraints. This figure also shows where it is a good idea to parallelize the code and where it was not. In particular, there was hardly any difference in the program's performance from 1,600 x 1,600 pixel images to about 3,200 x 3,200 pixel images. In this region, the images are so small that there is also no benefit in parallelizing the code from a memory standpoint, either.

To put some numbers to the performance of our image-processing program, one 3.06GHz machine takes about 50 seconds to read, process and write a 6,400 x 6,400 image to disk, whereas 16 nodes working together perform this task in about ten seconds. Even at 16,000 x 16,000 pixels, 16 nodes working together can process an image faster than one machine processing an image 6.25 times smaller.

Possible Mainstream Applications

This article demonstrates only one possible way to take advantage of the high performance of Beowulf clusters, but the same concepts are used in virtually all parallel programs. Typically, each node reads in a fraction of the data, performs some operation on it, and either sends it back to the master node or writes it out to disk. Here are four examples of areas that I think are prime candidates for parallelization:

  1. Image filters: we saw above how parallel processing can tremendously speed up image processing and also can give users the ability to process huge images. A set of plugins for applications such as The GIMP that take advantage of clustering could be very useful.

  2. Audio processing: applying an effect to an audio file also can take a large amount of time. Open-source projects such as Audacity also stand to benefit from the development of parallel plugins.

  3. Database operations: tasks that require processing of large amounts of records potentially could benefit from parallel processing by having each node build a query that returns only a portion of the entire set needed. Each node then processes the records as needed.

  4. System security: system administrators can see just how secure their users' passwords are. Try a brute-force decoding of /etc/shadow using a Beowulf by dividing up the range of the checks across several machines. This will save you time and give you peace of mind (hopefully) that your system is secure.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

about execution

rajesh k r's picture

can anyone tell me how to run the above code on cluster or on LAN.if anyone has code send me at

image in c

Anonymous's picture

can any1 tel me how to execute above in C

and pls help us for how to put processed image in another file i.e. gray scale image

interesting concept... is

bharath's picture

interesting concept... is there any way u can send the source code, i'd like to use it for my class project...

cristal clear concept necessary for a starter.... :)

Chandra's picture

This article has helped me to explore the extream ends of parallel
computing and MPI.During my study and running codes on cluster PADMA
at CDAC India I always had a dim picture of MPI application till I read this article.
Nice and sincere approch..

Image processing program

Vayrix's picture

Has someone filled up the gaps missing in the code? I'm not an expert in C/C++ programming, but I would like to see and run whole code on ROCKS cluster. So I would appreciate if someone gave me complete program code, if possible.

test Program

BryanC's picture

Any chance you still have this full test program source in C++?

This article has inspired me to set up an HPC cluster with a few fellow students, and I think a graphics algorithm such as this would be exactly what we need to awe our audience (possibly project benefactors, school deans, etc).

me too

Anonymous's picture

I could really use this source code for a class of mine. Let me know if you get it. -Larry Prokov

Good info

Michael Locker MD's picture

Very informative. Thanks.

Michael Locker MD

MOSIX and openMOSIX have

Anonymous's picture

MOSIX and openMOSIX have been getting a lot of attention lately, but they are used primarily for programs that are not specifically written to run on clusters, and they work by distributing threads in multithreaded programs to other nodes.

This is an inaccurate statment of how openMosix and I am pretty sure MOSIX work. They do not migrated threads of a multi threaded application but migrate the whole process. In fact applications that used shared-memory such as multi-threaded applications, like apache, require a secodn pacth on top of openMosix to use the migration system. Still for making fast cluster for multi-stage proccess such as video conversion or manipulation it is very effective at handling the workload.

Is there a way to speed up cpu-intensive application ?

BO Yang's picture

Good article I think .
Just few days ago , I have taken part in the ICFP contest and the task is to write a virtual machine . And I am disappointed that I didn't complete the test because of my slow application which will occupy all my cpu whenever start up . Does parallel programing help in this situation ? I think it will do nothing good with the such kind of cpu-intensive application , any idea ?

CPU Intensive Applications

M. Hore's picture

In my experience, cpu-intensive jobs are easily parallelized and benefit from it whenever their tasks are easily broken up into pieces that are more or less independent of each other (though they need not be). I don't immediately see how one might parallelize a virtual machine, and don't see how it might be beneficial, but I could be wrong.

What do you find to be the major bottleneck in your virtual machine?