Inside the Intel Compiler
The OpenMP standard for C/C++ and Fortran (www.openmp.org) has recently emerged as the de facto standard for shared-memory parallel programming. It allows the user to specify parallelism without getting involved in the details of iteration partitioning, data sharing, thread scheduling and synchronization. Based on these directives, the Intel compiler will transform the code to generate multithreaded code automatically. The Intel compiler supports the OpenMP C++ 2.0 and OpenMP Fortran 2.0 standard directives for explicit parallelization. Applications can use these directives to increase performance on multiprocessor systems by exploiting both task and data parallelism.
The following is an example program, illustrating the use of OpenMP directives with the Intel C++ Linux OpenMP compiler:
#define N 10000
void ploop(void)
{
int k, x[N], y[N], z[N];
#pragma omp parallel for private(k) shared(x,y,z)
for (k=0; k<N; k++) {
x[k] = x[k] * y[k] + workunit(z[k]);
}
}
The for loop will be executed in parallel by a team of threads that divide the iterations in the loop body amongst themselves. Variable k is marked private—each thread will have its own copy of k—while the arrays x, y and z are shared among the threads.
The resulting multithreaded code is illustrated below. The Intel compiler generates OpenMP runtime library calls for thread creation and management, as well as synchronization (see Resources 1 and 2):
#define N 10000
void ploop(void)
{
int k, x[N], y[N], z[N];
__kmpc_fork_call(loc,
3,
T-entry(_ploop_par_loop),
x, y, z)
goto L1:
T-entry _ploop_par_loop(loc, tid,
x[], y[], z[]) {
lower_k = 0;
upper_k = N;
__kmpc_for_static_init(loc, tid, STATIC,
&lower_k,
&upper_k, ...);
for (local_k=lower_k; local_k<=upper_k;
local_k++) {
x[local_k] = x[local_k] * y[local_k]
+ workunit(z[local_k]);
}
__kmpc_for_static_fini(loc, tid);
T-return;
}
L1: return;
}
The multithreaded code generator inserts the thread invocation call __kmpc_fork_call with the T-entry point and data environment (for example, thread id tid) for each loop. This call into the Intel OpenMP runtime library forks a number of threads that execute the iterations of the loop in parallel.
The serial loops annotated with the OpenMP directive are converted to multithreaded code by localizing the lower- and upper-loop bounds and by privatizing the iteration variable. Finally, multithreading runtime initialization and synchronization code is generated for each T-region defined by a [T-entry, T-ret] pair. The call __kmpc_for_static_init computes the localized loop lower-bound, upper-bound and stride for each thread according to a scheduling policy. In this example, the generated code uses static scheduling. The library call __kmpc_for_static_fini informs the runtime system that the current thread has completed one loop chunk.
Rather than performing source-to-source transformations, as is done in other compilers such as OpenMP NanosCompiler and OdinMP, the Intel compiler performs these transformations internally. This allows tight integration of the OpenMP implementation with other advanced, high-level compiler optimizations for improved uniprocessor performance such as vectorization and loop transformations.
Besides the compiler support for exploiting the OpenMP directive-guided explicit parallelism, users also can try auto-parallelization by using the option -parallel. Under this option, the compiler automatically analyzes the loops in the program to detect those that have no loop-carried dependency and can be executed in parallel profitably. The auto-parallelization phase in the compiler relies on the advanced memory disambiguation techniques for its analysis, as well as the profiling information for its heuristics in deciding when to parallelize.
One of the unique features of the Intel compiler is CPU-Dispatch, which allows the user to target a single object for multiple IA-32 architectures by means of either manual CPU-Dispatch or Auto-CPU-Dispatch. Manual CPU-Dispatch allows the user to write multiple versions of a single function. Each function either is assigned a specific IA-32 architecture platform or is considered generic, meaning it can run on any IA-32 architecture. The Intel compiler generates code that dynamically determines on which architecture the code is running and accordingly chooses the particular version of the function that will actually execute. This runtime determination allows programmers to take advantage of architecture-specific optimizations, such as SSE and SSE2, without sacrificing flexibility, allowing execution of the same binary on architectures that do not support newer instructions.
Auto-CPU_Dispatch is similar but with the added benefit that the compiler automatically generates multiple versions of a given function. During compilation, the compiler decides which routines will gain from architecture-specific optimizations. These routines are then automatically duplicated to produce architecture-specific optimized versions, as well as generic versions. The benefit of this feature is, it does not require any rewrite by the programmer. A normal source file can take advantage of the Auto-CPU-Dispatch feature by the simple use of a command-line option. For example, given the function:
void init(float b[], double c[], int n)
{
int i;
for (i = 0; i < n; i++) {
b[i] = (float)i;
}
for (i = 0; i < n; i++) {
c[i] = (double)i;
}
}
the Intel compiler can produce up to three versions of the function. A generic version of the function is generated that will run on any IA-32 processor. Another version would be tuned for the Pentium III processor by vectorizing the first loop with SSE instructions. A third version would be optimized for the Pentium 4 processor by vectorizing both loops to take advantage of SSE2 instructions.
The resulting function begins with dispatch code like this:
.L1 testl $-512, __intel_cpu_indicator
jne init.J
testl $-128, __intel_cpu_indicator
jne init.H
testl $-1, __intel_cpu_indicator
jne init.A
call __intel_cpu_indicator_init
jmp .L1
Where init.A, init.H and init.J are the generic, SSE and SSE2 optimized versions, respectively.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- RSS Feeds
- Readers' Choice Awards
- Tech Tip: Really Simple HTTP Server with Python
- DynDNS
1 hour 57 min ago - Reply to comment | Linux Journal
2 hours 29 min ago - All the articles you talked
4 hours 53 min ago - All the articles you talked
4 hours 56 min ago - All the articles you talked
4 hours 57 min ago - myip
9 hours 22 min ago - Keeping track of IP address
11 hours 13 min ago - Roll your own dynamic dns
16 hours 26 min ago - Please correct the URL for Salt Stack's web site
19 hours 38 min ago - Android is Linux -- why no better inter-operation
21 hours 53 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Re: Inside the Intel Compiler
I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)
Re: Inside the Intel Compiler
Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.
Re: Inside the Intel Compiler
dare to say the current gcc has most of this stuff already implemented.
Re: Inside the Intel Compiler
When did GCC get vectorization and OpenMP support?
Re: Inside the Intel Compiler
3.2 has pretty good vectorization support, not the best but pretty good
Re: Inside the Intel Compiler
Take a look at:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
3.2 had nothing
Re: Inside the Intel Compiler
"dare to say the current gcc has most of this stuff already implemented."
Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.
GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.
Re: Inside the Intel Compiler
well, this might be true, but it is still much slower in most benchmarks.
Re: Inside the Intel Compiler
hmmm, it's posible create a Linux Distribution with Intel Compiler!!??
Re: Inside the Intel Compiler
Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible
Re: Inside the Intel Compiler
AFAIK, no :(
Re: Inside the Intel Compiler
AFAIK, Gentoo will be, maybe...
Re: Inside the Intel Compiler
Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.
One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?
Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..
The Intel Compiler is
The Intel Compiler is written on pure C.
Re: Inside the Intel Compiler
Umm I think they use a mix of PERL and OCaml.
They do the OpenMP stuff in functional Miranda.
lol
lol
functional miranda
hahahah
Re: Inside the Intel Compiler
How do you know?