Parallel Programming Crash Course

I've been covering various scientific programs the past few months, but sometimes it's hard to find a package that does what you need. In those cases, you need to go ahead and write your own code. When you are involved with heavy-duty scientific computing, you usually need to go to parallel computing in order to get the runtimes down to something reasonable. This month, I give a crash course in parallel programming so you can get a feel for what is involved.

There are two broad categories of parallel programs: shared memory and message passing. You likely will see both types being used in various scientific arenas. Shared-memory programming is when all of the processors you are using are on a single box. This limits you as to how big your problem can be. When you use message passing, you can link together as many machines as you have access to over some interconnection network.

Let's start by looking at message-passing parallel programming. The most common version in use today is MPI (Message Passing Interface). MPI is actually a specification, so many different implementations are available, including Open MPI, MPICH and LAM, among others. These implementations are available for C, C++ and FORTRAN. Implementations also are available for Python, OCaml and .NET.

An MPI program consists of multiple processes (called slots), running on one or more machines. Each of these processes can communicate with all other processes. Essentially, they are in a fully connected network. Each process runs a full copy of your program as its executable content and runs independently of the others. The parallelism comes into play when these processes start sending messages to each other.

Assuming you already have some MPI code, the first step in using it is to compile it. MPI implementations include a set of wrapper scripts that handle all of the compiler and linker options for you. They are called mpicc, mpiCC, mpif77 and mpif90, for C, C++, FORTRAN 77 and FORTRAN 90, respectively. You can add extra options for your compiler as options to the wrapper scripts. One very useful option is -showme. This option simply prints out the full command line that would be used to invoke your compiler. This is useful if you have multiple compilers and/or libraries on your system, and you need to verify that the wrapper is doing the right thing.

Once your code is compiled, you need to run it. You don't actually run your program directly. A support program called mpirun takes care of setting up the system and running your code. You need to tell mpirun how many processors you want to run and where they are located. If you are running on one machine, you can hand in the number of processors with the option -np X. If you are running over several machines, you can hand in a list of hostnames either on the command line or in a text file. If this list of hostnames has repeats, mpirun assumes you want to start one process for each repeat.

Now that you know how to compile and run your code, how do you actually write an MPI program? The first step needs to initialize the MPI subsystem. There is a function to do this, which in C is this:

int MPI_Init(&argc, &argv); 

Until you call this function, your program is running a single thread of execution. Also, you can't call any other MPI functions before this, except for MPI_Initialized. Once you run MPI_Init, MPI starts up all of the parallel processes and sets up the communication network. After this initialization work is finished, you are running in parallel, with each process running a copy of your code.

When you've finished all of your work, you need to shut down all of this infrastructure cleanly. The function that does this is:

int MPI_Finalize();

Once this finishes, you are back to running a single thread of execution. After calling this function, the only MPI functions that you can call are MPI_Get_version, MPI_Initialized and MPI_Finalized.

Remember that once your code goes parallel, each processor is running a copy of your code. If so, how does each copy know what it should be doing? In order to have each process do something unique, you need some way to identify different processes. This can be done with the function:

int MPI_Comm_rank(MPI_Comm comm, int *rank);

This function will give a unique identifier, called the rank, of the process calling it. Ranks are simply integers, starting from 0 to N–1, where N is the number of parallel processes.

You also may need to know how many processes are running. To get this, you would need to call the function:

int MPI_Comm_size(MPI_Comm comm, int *size);

Now, you've initialized the MPI subsystem and found out who you are and how many processes are running. The next thing you likely will need to do is to send and receive messages. The most basic method for sending a message is:

int MPI_Send(void *buf, int count, MPI_Datatype type, 
 ↪int dest, int tag, MPI_Comm comm);

In this case, you need a buffer (buf) containing count elements of type type. The parameter dest is the rank of the process that you are sending the message to. You also can label a message with the parameter tag. Your code can decide to do something different based on the tag value you set. The last parameter is the communicator, which I'll look at a little later. On the receiving end, you would need to call:

int MPI_Recv(void *buf, int count, MPI_Datatype type, 
 ↪int source, int tag, MPI_Comm comm, MPI_Status *status);

When you are receiving a message, you may not necessarily care who sent it or what the tag value is. In those cases, you can set these parameters to the special values MPI_ANY_SOURCE and MPI_ANY_TAG. You then can check what the actual values were after the fact by looking at the status struct. The status contains the values:


Both of these functions are blocking. This means that when you send a message, you end up being blocked until the message has finished being sent. Alternatively, if you try to receive a message, you will block until the message has been received completely. Because these calls block until they complete, it is very easy to cause deadlocks where, for example, two processes are both waiting for a message to arrive before any messages get sent. They end up waiting forever. So if you have problems with your code, these calls usually are the first places to look.

These functions are point-to-point calls. But, what if you want to talk to a group of other processes? MPI has a broadcast function:

int MPI_Bcast(void *buf, int count, MPI_Datatype type, 
 ↪int root, MPI_Comm comm);

This function takes a buffer containing count elements of type type and broadcasts to all of the processors, including the root process. The root process (from the parameter root) is the process that actually has the data. All the others receive the data. They all call MPI_Bcast, and the MPI subsystem is responsible for sorting out who has the data and who is receiving. This call also sends the entire contents of the buffer to all the processes, but sometimes you want each process to work on a chunk of the data. In these cases, it doesn't make sense to send the entire data buffer to all of them. There is an MPI function to handle this:

int MPI_Scatter(void *send, int sendcnt, MPI_Datatype type,
                void *recv, int recvcnt, MPI_Datatype type, int root,
                MPI_Comm comm);

In this case, they all call the same function, and the MPI subsystem is responsible for sorting out which is root (the process with the data) and which are receiving data. MPI then divides the send buffer into even-size chunks and sends it out to all of the processes, including the root process. Then, each process can work away on its chunk. When they're done, you can gather up all the results with:

int MPI_Gather(void *send, int sendcnt, MPI_Datatype type,
               void *recv, int recvcnt, MPI_Datatype type, int root,
               MPI_Comm comm);

This is a complete reversal of MPI_Scatter. In this case, all the processes send their little chunks, and the root process gathers them all up and puts them in its receive buffer.

Taking all of the information from above and combining it together, you can put together a basic boilerplate example:

#include <mpi.h>
// Any other include files

int main(int argc, char **argv){
   int id,size;
   // all of your serial code would
   // go here
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &id);
   MPI_Comm_size(MPI_COMM_WORLD, &size);
   // all of your parallel code would
   // go here
   // any single-threaded cleanup code
   // goes here

Hopefully, you now feel more comfortable with MPI programs. I looked at the most basic elements here, but if you feel inspired, you should grab a good textbook and see what other functions are available to you. If not, you at least should be able to read existing MPI code and have a good idea of what it's trying to do. As always, if you'd like to see a certain area covered in this space, feel free to let me know.

Load Disqus comments