Advanced OpenMP

Because the August issue's theme is programming, I thought I should cover some of the more-advanced features available in OpenMP. Several issues ago, I looked at the basics of using OpenMP, so you may want go back and review that article. In scientific programming, the basics tend to be the limit of how people use OpenMP, but there is so much more available—and, these other features are useful for so much more than just scientific computing. So, in this article, I delve into other by-waters that never seem to be covered when looking at OpenMP programming. Who knows, you may even replace POSIX threads with OpenMP.

First, let me quickly review a little bit of the basics of OpenMP. All of the examples below are done in C. If you remember, OpenMP is defined as a set of instructions to the compiler. This means you need a compiler that supports OpenMP. The instructions to the compiler are given through pragmas. These pragmas are defined such that they appear as comments to a compiler that doesn't support OpenMP.

The most typical construct is to use a for loop. Say you want to create an array of the sines of the integers from 1 to some maximum value. It would look like this:


#pragma omp parallel for
for (i=0; i<max; i++) {
   a[i] = sin(i);
}

Then you would compile this with GCC by using the -fopenmp flag. Although this works great for problems that naturally form themselves into algorithms around for loops, this is far from the majority of solution schemes. In most cases, you need to be more flexible in your program design to handle more complicated parallel algorithms. To do this in OpenMP, enter the constructs of sections and tasks. With these, you should be able to do almost anything you would do with POSIX threads.

First, let's look at sections. In the OpenMP specification, sections are defined as sequential blocks of code that can be run in parallel. You define them with a nested structure of pragma statements. The outer-most layer is the pragma:


#pragma omp parallel sections
{
   ...commands...
}

Remember that pragmas apply only to the next code block in C. Most simply, this means the next line of code. If you need to use more than one line, you need to wrap them in curly braces, as shown above. This pragma forks off a number of new threads to handle the parallelized code. The number of threads that are created depends on what you set in the environment variable OMP_NUM_THREADS. So, if you want to use four threads, you would execute the following at the command line before running your program:


export OMP_NUM_THREADS=4

Inside the sections region, you need to define a series of individual section regions. Each of these is defined by:


#pragma omp section
{
   ...commands...
}

This should look familiar to anyone who has used MPI before. What you end up with is a series of independent blocks of code that can be run in parallel. Say you defined four threads to be used for your program. This means you can have up to four section regions running in parallel. If you have more than four defined in your code, OpenMP will manage running them as quickly as possible, farming remaining section regions out to the running threads as soon as they become free.

As a more complete example, let's say you have an array of numbers and you want to find the sine, cosine and tangents of the values stored there. You could create three section regions to do all three steps in parallel:


#pragma omp parallel sections
{
#pragma omp section
for (i=0; i<max, i++) {
   sines[i] = sin(A[i]);
}
#pragma omp section
for (j=0; j<max; j++) {
   cosines[j] = cos(A[j]);
}
#pragma omp section
for (k=0; k<max; k++) {
   tangents[k] = tan(A[k]);
}
}

In this case, each of the section regions has a single code block defined by the for loop. Therefore, you don't need to wrap them in curly braces. You also should have noticed that each for loop uses a separate loop index variable. Remember that OpenMP is a shared memory parallel programming model, so all threads can see, and write to, all global variables. So if you use variables that are created outside the parallel region, you need to avoid multiple threads writing to the same variable. If this does happen, it's called a race condition. It might also be called the bane of the parallel programmer.

The second construct I want to look at in this article is the task. Tasks in OpenMP are even more unstructured than sections. Section regions need to be grouped together into a single sections region, and this entire region gets parallelized. With tasks, they are dumped onto a queue, ready to run as soon as possible. Defining a task is simple:


#pragma omp task
{
...commands...
}

In your code, you would create a general parallel region with the pragma:


#pragma omp parallel

This pragma forks off the number of threads that you set in the OMP_NUM_THREADS environment variable. These threads form a pool that is available to be used by other parallel constructs.

Now, when you create a new task, one of three things might happen. The first is that there is a free thread from the pool. In this case, OpenMP will have that free thread run the code in the task construct. The second and third cases are that there are no free threads available. In these cases, the task may end up being scheduled to run by the originating thread, or it may end up being queued up to run as soon as a thread becomes free.

So, let's say you have a function (called func) that you want to call with five different parameters, such that they are independent, and you want to have them run in parallel. You can do this with the following:


#pragma omp parallel
{
for (i=1; i<6; i++) {
#pragma omp task
   func(i);
}
}

This will create a thread pool, and then loop through the for loop and create five tasks to farm out to the thread pool. One cool thing about tasks is that you have a bit more control over how they are scheduled. If you reach a point in your task where you can go to sleep for a while, you actually can tell OpenMP to do that. You can use the pragma:


#pragma omp taskyield

When the currently running thread reaches this point in your code, it will stop and check the task queue to see if there are any waiting to run. If so, it will go ahead and start one of those and put your current task to sleep. When the new task finishes, the suspended task gets picked up and resumes where it left off.

Hopefully, seeing some of the less-common constructs has inspired you to go and check out what other techniques you might be missing from your repertoire. Most parallel frameworks allow you to do most techniques. But each one, for historical reasons, has tended to be used for only one subset of techniques, even though there are constructs available that hardly ever are used. For shared memory programming, the constructs I cover here allow you to do many of the things you can do with POSIX threads without the programming overhead. You just have to trade some of the flexibility you get with POSIX threads.

Load Disqus comments