Roman's Law and Fast Processing with Multiple CPU Cores

Some practical methods to exploit multiple cores and find thread synchronization problems.

Computers are slow as molasses on a cold day, and they've been like that for years. The hardware industry is doing its work, but computers are not going to get any faster unless the software industry follows suit and does something about it. Processing speed is less about GHz and more about multiple cores these days, and software needs to adapt.

I will be so bold as to postulate my own law of the multicore trend for hardware: the number of cores on a chip would double every three years. It remains to be seen whether I'm going to be as accurate in this prediction as Gordon Moore happened to be, but it looks good so far. With Intel and AMD introducing quad-core systems and Sun pushing the envelope even further with its first hexadeca-core CPU named Rock, the question the software industry has to ask itself is “Are we ready to make all these execution threads do useful work for us?” My strong opinion is that we are not ready. A new paradigm is yet to be invented, and we don't really know what it should look like. What we do know, however, in the words of Herb Sutter, is “The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.”

The first layer of software where hardware parallelism is going to percolate is the kernel of your operating system. And, I know of only two kernels that have what it takes to tackle the challenge: Solaris and Linux. If there's any Mac OS X—or dare I say, Windows—fanboys out there, I have two phrases for you: “256 processor systems” and “Completely Fair Scheduler”. Now, having a kernel that is capable of efficiently multiplexing a large number of processes to a large number of hardware threads is only as good as your demand in actually having that large number of individual processes running in parallel. And, as much as it is an ISP or provisioning person's dream come true, on my laptop I rarely have more than four really active processes running at the same time. The real need that I have is for each individual application to be able to take advantage of the underlying hardware parallelism. And, that's how a fundamental hardware concept permeates yet another software layer—an application one. At the end of the day we, as userland software developers, have no other choice but to embrace the parallel view of the world fully. Those pesky hardware people left us no choice whatsoever.

For the rest of this article, I assume that you have at least a dual-core system running Linux kernel 2.6.x and that you have Sun Studio Express installed in /opt/sun/sunstudio and added to your PATH, as follows:


My goal is to explain the kind of practical steps you can take to teach that old serial code of yours a few multicore tricks.

There are three basic steps to iterate through to add parallelism gradually to your serial application:

  1. Identify parallelism.

  2. Express parallelism.

  3. Measure and observe.

And, even though the first two steps sound like the most exciting ones, I cannot stress enough the importance of step 3. Parallel programming is difficult and riddled with unexpected performance gotchas. And, there is no other way of being certain that your code got faster than proving it with numbers. The good news is that it isn't all that difficult. If you happen to develop mostly for Intel architectures, you can use code similar to the FFMPEG's START_TIMER/STOP_TIMER for microbenchmarking:


Additionally, there's the Sun Studio Performance analyzer for observing macro behaviour. You also can use tools like Intel's VTune or even time(1), but whatever you do, make sure that performance regression testing is as much a part of your testing routine as the regular regression testing is. You do regression testing, don't you?

Identifying parallelism in an existing application usually starts with finding spots with data parallel characteristics, task parallel characteristics and figuring out a scheduling model to tie the two together. Data parallelism usually can be found in applications working with large sets of global or static data (think audio, video and image processing, gaming engines and rendering software). Task parallelism, on the other hand, mostly is appropriate when branch-and-bound computation takes place (think Chess-solvers when a bunch of tasks are asked to do similar calculations, but if one finds a solution, there's no need to wait for others).

Once you've identified all potential sources of the parallelism in your application, you have to decide what programming techniques to use for expressing it. For an application written in C or C++, the most commonly used one happens to be explicit parallelization with POSIX threads. This method has been around for decades, and most developers usually have some familiarity with it. On the other hand, given its inherent complexity and the fact that it no longer is the only game in town, I'm going to skip over it.

Let's look at this sample code, which happens to be a very simple routine for calculating how many prime numbers there are between 2 and N:

 1  #include <stdio.h>
 2  #include <stdlib.h>
 3  #include <math.h>
 4  #include <omp.h>
 6  /* pflag[v] == 1 if and only if v is a prime number   */
 7  char *pflag;
 9  int is_prime(int v)
10  {
11      int i;
12      int bound = floor(sqrt(v)) + 1;
14      for (i = 2; i < bound; i++) {
15          if (v % i == 0) { 
16              pflag[v] = 0;
17              return 0; /* v is NOT a prime number */
18          }
19      }
20      return 1; /* v is a prime number */  
21  }
23  int main(int argc, char **argv)
24  {
25      int i;
26      int total = 0;
27      int N = atoi(argv[1]);
28      int primes[N];     /* array of prime numbers */ 
29      pflag=(char*)alloca(N);
31      for (i = 2; i < N; i++) {
32          pflag[i] = 1; /* all numbers are prime until... */   
33      }                 /* ...proven otherwise            */
35      for (i = 2; i < N; i++) { /* let the testing begin! */
36          if ( is_prime(i) ) {
37              primes[total] = i;
38              total++;
39          }
40  }
42  printf("Number of prime numbers between 2 and %d: %d\n",
43         N, total);
45  return 0;
46  }

Granted, the code is silly (some might even say brain-dead), but let's pretend it is a real-life application. In that case, we certainly would benefit from as much automation as possible. And, if you think about it, there's no tool better suited for helping us than a compiler—after all, it already takes care of understanding the semantics of the code in order to perform optimizations. Ideally, what we would need is a compiler that talks back to us, helping us understand the source code better and make reasonable tweaks based on that information. Here's how Sun Studio 12 lets you do that:

$ cc -g -fast prime.c -o prime 
$ er_src prime 
Source loop below has tag L3 
35.     for (i = 2; i < N; i++) { /* let the testing begin! */

Function is_prime inlined from source file prime.c into the code
for the following line.  1 loops inlined 
Loop in function is_prime, line 14 has tag L4
36.          if ( is_prime(i) ) {

Finally! Your compiler actually explains to you in plain human language, what transformations it applied to the source code to make it faster (-fast). Not only that, but it also identifies and tags all key areas (such as loops) that you later can navigate and inspect for the parallelization potential. Identifying parallelism just got easier. But what about expressing parallelism? Would it be completely out of the question to delegate some of that to the compiler as well? After all, we are too lazy to use POSIX threads (besides, they are like the GOTOs of parallel programming, anyway). The good news is that with the right compiler it is possible. But, before we go there, let's remember the third step from our “three-step parallelization program” and establish a performance baseline:

$ cc -fast prime.c -o prime
$ collect ./prime 2000000
Creating experiment database ...
Number of prime numbers between 2 and 2000000: 148933
$ er_print -statistics
                   Execution for entire program

                                  Start Label: Total
                                    End Label: Total
                            Start Time (sec.): 0.028
                              End Time (sec.): 3.364
                              Duration (sec.): 3.336
                        Total LWP Time (sec.): 3.337
                       Average number of LWPs: 1.000

The -fast command-line option instructs the Sun Studio C compiler to generate the fastest possible code for the same architecture where the compilation happens. The last two commands actually run the generated executable and report runtime statistics for the function main. Now we know that whatever we do to the source code, the result shouldn't be slower than 3.336 seconds. With that in mind, let's try asking the compiler to do its best not only at identifying parallelism (-xloopinfo), but at expressing it as well (-xautopar):

$ cc -fast -xloopinfo -xautopar prime.c -o prime
"prime.c", line 14: not parallelized, loop has multiple exits
"prime.c", line 14: not parallelized, loop has multiple exits 
                    (inlined loop)
"prime.c", line 31: PARALLELIZED, and serial version generated
"prime.c", line 35: not parallelized, unsafe dependence (total)

So, with only two extra command-line options, the compiler was smart enough to parallelize the loop on line 31 (-xautopar) and honest enough to explain why two other loops (lines 14 and 35) cannot be parallelized easily (-xloopinfo). That's pretty impressive, but let's see whether it got us any speedup at all:

$ export OMP_NUM_THREADS=4
$ collect ./prime 2000000
Creating experiment database ...
Number of prime numbers between 2 and 2000000: 148933
$  er_print -statistics | grep Duration
                               Duration (sec.): 3.331

Good. It isn't slower (although not significantly faster either), but then again, we didn't have to do anything with the source code. The compiler did everything for us (except letting the runtime system use all the way up to four threads by setting the OMP_NUM_THREADS environment variable to four). Or did it? What about that loop on line 35? It doesn't look any more complicated than the one on line 31. Seems like the compiler is being overly conservative, and we need to step in and help it a bit. This time, let's express parallelism with OpenMP.

The formal (and boring) definition of OpenMP states that it is “an API that supports multiplatform shared-memory parallel programming in C/C++ and Fortran on all architectures, including UNIX platforms and Windows NT platforms”. Personally, I'd like to think about OpenMP as a method of helping the compiler exploit data parallelism in your application when data dependencies get out of hand. In short, OpenMP is something you use when -xautopar complains. Given that, for C and C++, OpenMP is expressed through the #pragmas, it is quite safe to add these hints (although making sure that suggested parallel operations don't have concurrency problems is still your responsibility). As with any #pragma, if the compiler doesn't understand it, it'll skip over it. (At the time of this writing, the following freely available Linux compilers support OpenMP 2.5: Intel Compilers, GCC 4.2 and Sun Studio 12.)

So, how do we use OpenMP to boost the compiler's confidence in the loop on line 35? Simply add the following pragma to line 34:

34 #pragma omp parallel for
35 for (i = 2; i < N; i++) { /* let the testing begin! */
36     if ( is_prime(i) ) {