Inside the Intel Compiler
First, we will look at static profiling. Consider the following code fragment:
g();
for (i=0; i<10; i++) {
g();
}
Obviously, the call inside the loop executes ten times more often than the call outside the loop. In many cases, however, there is no way to make a good estimate. In the following code:
for (i=0; i<10; i++) {
if (condition) {
g();
} else {
h();
}
}
it is difficult to say whether one condition is more likely to
occur than another. If h() happened to be an exit or some other
routine that was known not to return, it would be safe to assume
the then branch was more likely taken and inlining g() may be
worthwhile. Without such information, however, the decision of
whether to inline one call or the other (or both) gets more
complicated. Another option is to use dynamic profiling.
Dynamic profiling gathers information from actual executions of a program. This allows the compiler to take advantage of the way a program actually runs in order to optimize it. In a three-step process, the application is first built with profiling instrumentation embedded in it. Then the resulting application is run with a representative sample (or samples) of data, which yields a database for the compiler to use in a subsequent build of the application. Finally, the information in this database is used to guide optimizations such as code placement or grouping frequently executed basic blocks together, function or partial inlining and register allocation. Register allocation in the Intel compiler is based on graph fusion (see Resource 5), which breaks the code into regions. These regions are typically loop bodies or other cohesive units. With profile information, the regions can be selected more effectively and are based on the actual frequency of the blocks instead of syntactic guesses. This allows spills to be pushed into less frequently executed parts of the program.
Exploiting parallelism is an important way to increase application performance in modern architectures. The Intel compiler can be key in the effort to exploit potential parallelism in a program by facilitating such optimizations as automatic vectorization, automatic parallelization and support for OpenMP directives. Let's look at the automatic conversion of serial loops into a form that takes advantage of the instructions provided by the Intel MMX technology or SSE/SSE2 (Streaming-SIMD-extensions), a process we refer to as “intra-register vectorization” (see Resource 1). For example, given the function:
void vecadd(float a[], float b[], float c[], int n)
{
int i;
for (i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
the Intel compiler will transform the loop to allow four single-precision floating-point additions to occur simultaneously using the addps instruction. Simply put, using a pseudo-vector notation, the result would look something like this:
for (i = 0; i < n; i+=4) {
c[i:i+3] = a[i:i+3] + b[i:i+3];
}
A scalar cleanup loop would follow to execute the remainder of the
instructions if the trip count n is not exactly divisible by four.
Several steps are involved in this process. First, because it is
possible that no information exists about the base addresses of the
arrays, runtime code must be inserted to ensure that the arrays do
not overlap (dynamic dependence testing) and that the bulk of the
loop runs with each vector iteration having addresses aligned along
16-byte boundaries (dynamic loop peeling for alignment). In order
to vectorize efficiently, only loops of sufficient size are
vectorized. If the number of iterations is too small, a simple
serial loop is used instead. Besides simple loops, the vectorizer
also supports loops with reductions (such as summing an array of
numbers or searching for the max or min in an array, conditional
constructs, saturation arithmetic and other idioms. Even the
vectorization of loops with trigonometric mathematical functions is
supported by means of a vector math library.
To give a taste of a realistic performance improvement that can be obtained by intra-register vectorization, we report some performance numbers for the double-precision version of the Linpack benchmark (available in both Fortran and C at www.netlib.org/benchmark). This benchmark reports the performance of a linear equation solver that uses the routines DGEFA and DGESL for the factorization and solve phase, respectively. Most of the runtime of this benchmark results from repetitively calling the Level 1 BLAS routine DAXPY for different subcolumns of the coefficient matrix during factorization. Under generic optimizations (switch -O2), this benchmark reports 1,049 MFLOPS for solving a 100×100 system on a 2.66GHz Pentium 4 processor. When intra-register vectorization for the Pentium 4 processor is enabled (switch -xW), the performance goes up to 1,292 MFLOPS, boosting the performance by about 20%.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- RSS Feeds
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- Readers' Choice Awards
- The Secret Password Is...
- Reply to comment | Linux Journal
13 min 17 sec ago - All the articles you talked
2 hours 36 min ago - All the articles you talked
2 hours 40 min ago - All the articles you talked
2 hours 41 min ago - myip
7 hours 6 min ago - Keeping track of IP address
8 hours 57 min ago - Roll your own dynamic dns
14 hours 10 min ago - Please correct the URL for Salt Stack's web site
17 hours 21 min ago - Android is Linux -- why no better inter-operation
19 hours 37 min ago - Connecting Android device to desktop Linux via USB
20 hours 5 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Re: Inside the Intel Compiler
I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)
Re: Inside the Intel Compiler
Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.
Re: Inside the Intel Compiler
dare to say the current gcc has most of this stuff already implemented.
Re: Inside the Intel Compiler
When did GCC get vectorization and OpenMP support?
Re: Inside the Intel Compiler
3.2 has pretty good vectorization support, not the best but pretty good
Re: Inside the Intel Compiler
Take a look at:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
3.2 had nothing
Re: Inside the Intel Compiler
"dare to say the current gcc has most of this stuff already implemented."
Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.
GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.
Re: Inside the Intel Compiler
well, this might be true, but it is still much slower in most benchmarks.
Re: Inside the Intel Compiler
hmmm, it's posible create a Linux Distribution with Intel Compiler!!??
Re: Inside the Intel Compiler
Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible
Re: Inside the Intel Compiler
AFAIK, no :(
Re: Inside the Intel Compiler
AFAIK, Gentoo will be, maybe...
Re: Inside the Intel Compiler
Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.
One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?
Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..
The Intel Compiler is
The Intel Compiler is written on pure C.
Re: Inside the Intel Compiler
Umm I think they use a mix of PERL and OCaml.
They do the OpenMP stuff in functional Miranda.
lol
lol
functional miranda
hahahah
Re: Inside the Intel Compiler
How do you know?