Parallel Programming Crash Course

I've been covering various scientific programs the past few months, but sometimes it's hard to find a package that does what you need. In those cases, you need to go ahead and write your own code. When you are involved with heavy-duty scientific computing, you usually need to go to parallel computing in order to get the runtimes down to something reasonable. This month, I give a crash course in parallel programming so you can get a feel for what is involved.

There are two broad categories of parallel programs: shared memory and message passing. You likely will see both types being used in various scientific arenas. Shared-memory programming is when all of the processors you are using are on a single box. This limits you as to how big your problem can be. When you use message passing, you can link together as many machines as you have access to over some interconnection network.

Let's start by looking at message-passing parallel programming. The most common version in use today is MPI (Message Passing Interface). MPI is actually a specification, so many different implementations are available, including Open MPI, MPICH and LAM, among others. These implementations are available for C, C++ and FORTRAN. Implementations also are available for Python, OCaml and .NET.

An MPI program consists of multiple processes (called slots), running on one or more machines. Each of these processes can communicate with all other processes. Essentially, they are in a fully connected network. Each process runs a full copy of your program as its executable content and runs independently of the others. The parallelism comes into play when these processes start sending messages to each other.

Assuming you already have some MPI code, the first step in using it is to compile it. MPI implementations include a set of wrapper scripts that handle all of the compiler and linker options for you. They are called mpicc, mpiCC, mpif77 and mpif90, for C, C++, FORTRAN 77 and FORTRAN 90, respectively. You can add extra options for your compiler as options to the wrapper scripts. One very useful option is -showme. This option simply prints out the full command line that would be used to invoke your compiler. This is useful if you have multiple compilers and/or libraries on your system, and you need to verify that the wrapper is doing the right thing.

Once your code is compiled, you need to run it. You don't actually run your program directly. A support program called mpirun takes care of setting up the system and running your code. You need to tell mpirun how many processors you want to run and where they are located. If you are running on one machine, you can hand in the number of processors with the option -np X. If you are running over several machines, you can hand in a list of hostnames either on the command line or in a text file. If this list of hostnames has repeats, mpirun assumes you want to start one process for each repeat.

Now that you know how to compile and run your code, how do you actually write an MPI program? The first step needs to initialize the MPI subsystem. There is a function to do this, which in C is this:

int MPI_Init(&argc, &argv); 

Until you call this function, your program is running a single thread of execution. Also, you can't call any other MPI functions before this, except for MPI_Initialized. Once you run MPI_Init, MPI starts up all of the parallel processes and sets up the communication network. After this initialization work is finished, you are running in parallel, with each process running a copy of your code.

When you've finished all of your work, you need to shut down all of this infrastructure cleanly. The function that does this is:

int MPI_Finalize();

Once this finishes, you are back to running a single thread of execution. After calling this function, the only MPI functions that you can call are MPI_Get_version, MPI_Initialized and MPI_Finalized.

Remember that once your code goes parallel, each processor is running a copy of your code. If so, how does each copy know what it should be doing? In order to have each process do something unique, you need some way to identify different processes. This can be done with the function:

int MPI_Comm_rank(MPI_Comm comm, int *rank);

This function will give a unique identifier, called the rank, of the process calling it. Ranks are simply integers, starting from 0 to N–1, where N is the number of parallel processes.

You also may need to know how many processes are running. To get this, you would need to call the function:

int MPI_Comm_size(MPI_Comm comm, int *size);

Now, you've initialized the MPI subsystem and found out who you are and how many processes are running. The next thing you likely will need to do is to send and receive messages. The most basic method for sending a message is:

int MPI_Send(void *buf, int count, MPI_Datatype type, 
 ↪int dest, int tag, MPI_Comm comm);

In this case, you need a buffer (buf) containing count elements of type type. The parameter dest is the rank of the process that you are sending the message to. You also can label a message with the parameter tag. Your code can decide to do something different based on the tag value you set. The last parameter is the communicator, which I'll look at a little later. On the receiving end, you would need to call:

int MPI_Recv(void *buf, int count, MPI_Datatype type, 
 ↪int source, int tag, MPI_Comm comm, MPI_Status *status);

When you are receiving a message, you may not necessarily care who sent it or what the tag value is. In those cases, you can set these parameters to the special values MPI_ANY_SOURCE and MPI_ANY_TAG. You then can check what the actual values were after the fact by looking at the status struct. The status contains the values:


Both of these functions are blocking. This means that when you send a message, you end up being blocked until the message has finished being sent. Alternatively, if you try to receive a message, you will block until the message has been received completely. Because these calls block until they complete, it is very easy to cause deadlocks where, for example, two processes are both waiting for a message to arrive before any messages get sent. They end up waiting forever. So if you have problems with your code, these calls usually are the first places to look.

These functions are point-to-point calls. But, what if you want to talk to a group of other processes? MPI has a broadcast function:

int MPI_Bcast(void *buf, int count, MPI_Datatype type, 
 ↪int root, MPI_Comm comm);

This function takes a buffer containing count elements of type type and broadcasts to all of the processors, including the root process. The root process (from the parameter root) is the process that actually has the data. All the others receive the data. They all call MPI_Bcast, and the MPI subsystem is responsible for sorting out who has the data and who is receiving. This call also sends the entire contents of the buffer to all the processes, but sometimes you want each process to work on a chunk of the data. In these cases, it doesn't make sense to send the entire data buffer to all of them. There is an MPI function to handle this:

int MPI_Scatter(void *send, int sendcnt, MPI_Datatype type,
                void *recv, int recvcnt, MPI_Datatype type, int root,
                MPI_Comm comm);

In this case, they all call the same function, and the MPI subsystem is responsible for sorting out which is root (the process with the data) and which are receiving data. MPI then divides the send buffer into even-size chunks and sends it out to all of the processes, including the root process. Then, each process can work away on its chunk. When they're done, you can gather up all the results with:

int MPI_Gather(void *send, int sendcnt, MPI_Datatype type,
               void *recv, int recvcnt, MPI_Datatype type, int root,
               MPI_Comm comm);

This is a complete reversal of MPI_Scatter. In this case, all the processes send their little chunks, and the root process gathers them all up and puts them in its receive buffer.

Taking all of the information from above and combining it together, you can put together a basic boilerplate example:

#include <mpi.h>
// Any other include files

int main(int argc, char **argv){
   int id,size;
   // all of your serial code would
   // go here
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &id);
   MPI_Comm_size(MPI_COMM_WORLD, &size);
   // all of your parallel code would
   // go here
   // any single-threaded cleanup code
   // goes here

Hopefully, you now feel more comfortable with MPI programs. I looked at the most basic elements here, but if you feel inspired, you should grab a good textbook and see what other functions are available to you. If not, you at least should be able to read existing MPI code and have a good idea of what it's trying to do. As always, if you'd like to see a certain area covered in this space, feel free to let me know.


Joey Bernard has a background in both physics and computer science. This serves him well in his day job as a computational research consultant at the University of New Brunswick. He also teaches computational physics and parallel programming.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Setup a prepaid account

JM Dukes's picture

I would like to setup a prepaid account so that payment can automatically deducted from this prepaid account. Sort of like an iTunes account where you purchase a card to add money to the account when it gets low. That way you get your money at the time of purchase. If it is possible to establish such an account, please, please notify me.

Why MPI instead of OPENMP?

Anonymous's picture

Perhaps it will be nice to have an updated view of MPI vs OPENMP
as quad-processors in a single unit are very common nowadys.

administrators point of view in parallel environment

Anonymous's picture

Thanks allot for this insightful article on parallel environments!
I'm working for a company where clusters are our product, so mpi mpich and related programs/techniques are quite important to me. And then not so much wat you can do with it, but more what are the differences and and benefits of one over the other.
I would love to see more articles on parallel programing from an administrators point of view.

Thanks again!

Not as easy as it sounds.

Edmundo's picture

You should mention that if you use a network to run parallel programs using messages, the performance is heavily limited by the cost of establishing connections and passing the messages, so the speed-up you would get won't be as good as if running all the processes on the same box, even for small scale problems.

Only if the connection and

Anonymous's picture

Only if the connection and setup time out stripes the processing time. If it does, "you're doing it wrong" or have picked too small of a data set.

Actually, this is not true. A

Anonymous's picture

Actually, this is not true. A small data-set does not void parallel programming principle as long as the result is worth it. Perhaps, finding recurring strings of integers in "small data-sets" might not sound beneficial - nevertheless it has been done.

Or, for more connection, set-up time and other non-sense just look @SETI.

Missing some setup/configuration

Anonymous's picture

Having worked with LAM/MPI clusters 10 years ago, I know you're missing some significant setup documentation (and I'm sure/hoping setup has become a little easier since then). A link to current setup stuff would be nice.


diane's picture

Very nice your program but I think that it would be cool to have this app on the iPhone

This looks pretty cool. I'm

jumpnett's picture

This looks pretty cool. I'm going to have to try this out on my spare time.