Roman's Law and Fast Processing with Multiple CPU Cores
Computers are slow as molasses on a cold day, and they've been like that for years. The hardware industry is doing its work, but computers are not going to get any faster unless the software industry follows suit and does something about it. Processing speed is less about GHz and more about multiple cores these days, and software needs to adapt.
I will be so bold as to postulate my own law of the multicore trend for hardware: the number of cores on a chip would double every three years. It remains to be seen whether I'm going to be as accurate in this prediction as Gordon Moore happened to be, but it looks good so far. With Intel and AMD introducing quad-core systems and Sun pushing the envelope even further with its first hexadeca-core CPU named Rock, the question the software industry has to ask itself is “Are we ready to make all these execution threads do useful work for us?” My strong opinion is that we are not ready. A new paradigm is yet to be invented, and we don't really know what it should look like. What we do know, however, in the words of Herb Sutter, is “The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.”
The first layer of software where hardware parallelism is going to percolate is the kernel of your operating system. And, I know of only two kernels that have what it takes to tackle the challenge: Solaris and Linux. If there's any Mac OS X—or dare I say, Windows—fanboys out there, I have two phrases for you: “256 processor systems” and “Completely Fair Scheduler”. Now, having a kernel that is capable of efficiently multiplexing a large number of processes to a large number of hardware threads is only as good as your demand in actually having that large number of individual processes running in parallel. And, as much as it is an ISP or provisioning person's dream come true, on my laptop I rarely have more than four really active processes running at the same time. The real need that I have is for each individual application to be able to take advantage of the underlying hardware parallelism. And, that's how a fundamental hardware concept permeates yet another software layer—an application one. At the end of the day we, as userland software developers, have no other choice but to embrace the parallel view of the world fully. Those pesky hardware people left us no choice whatsoever.
For the rest of this article, I assume that you have at least a dual-core system running Linux kernel 2.6.x and that you have Sun Studio Express installed in /opt/sun/sunstudio and added to your PATH, as follows:
PATH=/opt/sun/sunstudio12/bin:$PATH
My goal is to explain the kind of practical steps you can take to teach that old serial code of yours a few multicore tricks.
There are three basic steps to iterate through to add parallelism gradually to your serial application:
Identify parallelism.
Express parallelism.
Measure and observe.
And, even though the first two steps sound like the most exciting ones, I cannot stress enough the importance of step 3. Parallel programming is difficult and riddled with unexpected performance gotchas. And, there is no other way of being certain that your code got faster than proving it with numbers. The good news is that it isn't all that difficult. If you happen to develop mostly for Intel architectures, you can use code similar to the FFMPEG's START_TIMER/STOP_TIMER for microbenchmarking:
START_TIMER
do_interesting_stuff();
STOP_TIMER("do_interesting_stuff_tag")
Additionally, there's the Sun Studio Performance analyzer for observing macro behaviour. You also can use tools like Intel's VTune or even time(1), but whatever you do, make sure that performance regression testing is as much a part of your testing routine as the regular regression testing is. You do regression testing, don't you?
Identifying parallelism in an existing application usually starts with finding spots with data parallel characteristics, task parallel characteristics and figuring out a scheduling model to tie the two together. Data parallelism usually can be found in applications working with large sets of global or static data (think audio, video and image processing, gaming engines and rendering software). Task parallelism, on the other hand, mostly is appropriate when branch-and-bound computation takes place (think Chess-solvers when a bunch of tasks are asked to do similar calculations, but if one finds a solution, there's no need to wait for others).
Once you've identified all potential sources of the parallelism in your application, you have to decide what programming techniques to use for expressing it. For an application written in C or C++, the most commonly used one happens to be explicit parallelization with POSIX threads. This method has been around for decades, and most developers usually have some familiarity with it. On the other hand, given its inherent complexity and the fact that it no longer is the only game in town, I'm going to skip over it.
Let's look at this sample code, which happens to be a very simple routine for calculating how many prime numbers there are between 2 and N:
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <math.h>
4 #include <omp.h>
5
6 /* pflag[v] == 1 if and only if v is a prime number */
7 char *pflag;
8
9 int is_prime(int v)
10 {
11 int i;
12 int bound = floor(sqrt(v)) + 1;
13
14 for (i = 2; i < bound; i++) {
15 if (v % i == 0) {
16 pflag[v] = 0;
17 return 0; /* v is NOT a prime number */
18 }
19 }
20 return 1; /* v is a prime number */
21 }
22
23 int main(int argc, char **argv)
24 {
25 int i;
26 int total = 0;
27 int N = atoi(argv[1]);
28 int primes[N]; /* array of prime numbers */
29 pflag=(char*)alloca(N);
30
31 for (i = 2; i < N; i++) {
32 pflag[i] = 1; /* all numbers are prime until... */
33 } /* ...proven otherwise */
34
35 for (i = 2; i < N; i++) { /* let the testing begin! */
36 if ( is_prime(i) ) {
37 primes[total] = i;
38 total++;
39 }
40 }
41
42 printf("Number of prime numbers between 2 and %d: %d\n",
43 N, total);
44
45 return 0;
46 }
Granted, the code is silly (some might even say brain-dead), but let's pretend it is a real-life application. In that case, we certainly would benefit from as much automation as possible. And, if you think about it, there's no tool better suited for helping us than a compiler—after all, it already takes care of understanding the semantics of the code in order to perform optimizations. Ideally, what we would need is a compiler that talks back to us, helping us understand the source code better and make reasonable tweaks based on that information. Here's how Sun Studio 12 lets you do that:
$ cc -g -fast prime.c -o prime
$ er_src prime
.....
Source loop below has tag L3
35. for (i = 2; i < N; i++) { /* let the testing begin! */
Function is_prime inlined from source file prime.c into the code
for the following line. 1 loops inlined
Loop in function is_prime, line 14 has tag L4
36. if ( is_prime(i) ) {
Finally! Your compiler actually explains to you in plain human language, what transformations it applied to the source code to make it faster (-fast). Not only that, but it also identifies and tags all key areas (such as loops) that you later can navigate and inspect for the parallelization potential. Identifying parallelism just got easier. But what about expressing parallelism? Would it be completely out of the question to delegate some of that to the compiler as well? After all, we are too lazy to use POSIX threads (besides, they are like the GOTOs of parallel programming, anyway). The good news is that with the right compiler it is possible. But, before we go there, let's remember the third step from our “three-step parallelization program” and establish a performance baseline:
$ cc -fast prime.c -o prime
$ collect ./prime 2000000
Creating experiment database test.1.er ...
Number of prime numbers between 2 and 2000000: 148933
$ er_print -statistics test.1.er
Execution for entire program
Start Label: Total
End Label: Total
Start Time (sec.): 0.028
End Time (sec.): 3.364
Duration (sec.): 3.336
Total LWP Time (sec.): 3.337
Average number of LWPs: 1.000
....................................................
The -fast command-line option instructs the Sun Studio C compiler to generate the fastest possible code for the same architecture where the compilation happens. The last two commands actually run the generated executable and report runtime statistics for the function main. Now we know that whatever we do to the source code, the result shouldn't be slower than 3.336 seconds. With that in mind, let's try asking the compiler to do its best not only at identifying parallelism (-xloopinfo), but at expressing it as well (-xautopar):
$ cc -fast -xloopinfo -xautopar prime.c -o prime
"prime.c", line 14: not parallelized, loop has multiple exits
"prime.c", line 14: not parallelized, loop has multiple exits
(inlined loop)
"prime.c", line 31: PARALLELIZED, and serial version generated
"prime.c", line 35: not parallelized, unsafe dependence (total)
So, with only two extra command-line options, the compiler was smart enough to parallelize the loop on line 31 (-xautopar) and honest enough to explain why two other loops (lines 14 and 35) cannot be parallelized easily (-xloopinfo). That's pretty impressive, but let's see whether it got us any speedup at all:
$ export OMP_NUM_THREADS=4
$ collect ./prime 2000000
Creating experiment database test.2.er ...
Number of prime numbers between 2 and 2000000: 148933
$ er_print -statistics test.2.er | grep Duration
Duration (sec.): 3.331
Good. It isn't slower (although not significantly faster either), but then again, we didn't have to do anything with the source code. The compiler did everything for us (except letting the runtime system use all the way up to four threads by setting the OMP_NUM_THREADS environment variable to four). Or did it? What about that loop on line 35? It doesn't look any more complicated than the one on line 31. Seems like the compiler is being overly conservative, and we need to step in and help it a bit. This time, let's express parallelism with OpenMP.
The formal (and boring) definition of OpenMP states that it is “an API that supports multiplatform shared-memory parallel programming in C/C++ and Fortran on all architectures, including UNIX platforms and Windows NT platforms”. Personally, I'd like to think about OpenMP as a method of helping the compiler exploit data parallelism in your application when data dependencies get out of hand. In short, OpenMP is something you use when -xautopar complains. Given that, for C and C++, OpenMP is expressed through the #pragmas, it is quite safe to add these hints (although making sure that suggested parallel operations don't have concurrency problems is still your responsibility). As with any #pragma, if the compiler doesn't understand it, it'll skip over it. (At the time of this writing, the following freely available Linux compilers support OpenMP 2.5: Intel Compilers, GCC 4.2 and Sun Studio 12.)
So, how do we use OpenMP to boost the compiler's confidence in the loop on line 35? Simply add the following pragma to line 34:
34 #pragma omp parallel for
35 for (i = 2; i < N; i++) { /* let the testing begin! */
36 if ( is_prime(i) ) {
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds






2 hours 5 min ago
3 hours 31 min ago
7 hours 42 min ago
8 hours 27 min ago
8 hours 37 min ago
8 hours 42 min ago
10 hours 52 min ago
10 hours 53 min ago
11 hours 38 min ago
12 hours 27 min ago