Inside the Intel Compiler
First, we will look at static profiling. Consider the following code fragment:
g();
for (i=0; i<10; i++) {
g();
}
Obviously, the call inside the loop executes ten times more often than the call outside the loop. In many cases, however, there is no way to make a good estimate. In the following code:
for (i=0; i<10; i++) {
if (condition) {
g();
} else {
h();
}
}
it is difficult to say whether one condition is more likely to
occur than another. If h() happened to be an exit or some other
routine that was known not to return, it would be safe to assume
the then branch was more likely taken and inlining g() may be
worthwhile. Without such information, however, the decision of
whether to inline one call or the other (or both) gets more
complicated. Another option is to use dynamic profiling.
Dynamic profiling gathers information from actual executions of a program. This allows the compiler to take advantage of the way a program actually runs in order to optimize it. In a three-step process, the application is first built with profiling instrumentation embedded in it. Then the resulting application is run with a representative sample (or samples) of data, which yields a database for the compiler to use in a subsequent build of the application. Finally, the information in this database is used to guide optimizations such as code placement or grouping frequently executed basic blocks together, function or partial inlining and register allocation. Register allocation in the Intel compiler is based on graph fusion (see Resource 5), which breaks the code into regions. These regions are typically loop bodies or other cohesive units. With profile information, the regions can be selected more effectively and are based on the actual frequency of the blocks instead of syntactic guesses. This allows spills to be pushed into less frequently executed parts of the program.
Exploiting parallelism is an important way to increase application performance in modern architectures. The Intel compiler can be key in the effort to exploit potential parallelism in a program by facilitating such optimizations as automatic vectorization, automatic parallelization and support for OpenMP directives. Let's look at the automatic conversion of serial loops into a form that takes advantage of the instructions provided by the Intel MMX technology or SSE/SSE2 (Streaming-SIMD-extensions), a process we refer to as “intra-register vectorization” (see Resource 1). For example, given the function:
void vecadd(float a[], float b[], float c[], int n)
{
int i;
for (i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
the Intel compiler will transform the loop to allow four single-precision floating-point additions to occur simultaneously using the addps instruction. Simply put, using a pseudo-vector notation, the result would look something like this:
for (i = 0; i < n; i+=4) {
c[i:i+3] = a[i:i+3] + b[i:i+3];
}
A scalar cleanup loop would follow to execute the remainder of the
instructions if the trip count n is not exactly divisible by four.
Several steps are involved in this process. First, because it is
possible that no information exists about the base addresses of the
arrays, runtime code must be inserted to ensure that the arrays do
not overlap (dynamic dependence testing) and that the bulk of the
loop runs with each vector iteration having addresses aligned along
16-byte boundaries (dynamic loop peeling for alignment). In order
to vectorize efficiently, only loops of sufficient size are
vectorized. If the number of iterations is too small, a simple
serial loop is used instead. Besides simple loops, the vectorizer
also supports loops with reductions (such as summing an array of
numbers or searching for the max or min in an array, conditional
constructs, saturation arithmetic and other idioms. Even the
vectorization of loops with trigonometric mathematical functions is
supported by means of a vector math library.
To give a taste of a realistic performance improvement that can be obtained by intra-register vectorization, we report some performance numbers for the double-precision version of the Linpack benchmark (available in both Fortran and C at www.netlib.org/benchmark). This benchmark reports the performance of a linear equation solver that uses the routines DGEFA and DGESL for the factorization and solve phase, respectively. Most of the runtime of this benchmark results from repetitively calling the Level 1 BLAS routine DAXPY for different subcolumns of the coefficient matrix during factorization. Under generic optimizations (switch -O2), this benchmark reports 1,049 MFLOPS for solving a 100×100 system on a 2.66GHz Pentium 4 processor. When intra-register vectorization for the Pentium 4 processor is enabled (switch -xW), the performance goes up to 1,292 MFLOPS, boosting the performance by about 20%.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- The Pari Package On Linux
- New Products
- Home, My Backup Data Center
- Developer Poll
- This is the easiest tutorial
4 hours 4 min ago - Ahh, the Koolaid.
9 hours 42 min ago - git-annex assistant
15 hours 42 min ago - direct cable connection
16 hours 5 min ago - Agreed on AirDroid. With my
16 hours 15 min ago - I just learned this
16 hours 19 min ago - enterprise
16 hours 49 min ago - not living upto the mobile revolution
19 hours 40 min ago - Deceptive Advertising and
20 hours 16 min ago - Let\'s declare that you have
20 hours 17 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Re: Inside the Intel Compiler
I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)
Re: Inside the Intel Compiler
Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.
Re: Inside the Intel Compiler
dare to say the current gcc has most of this stuff already implemented.
Re: Inside the Intel Compiler
When did GCC get vectorization and OpenMP support?
Re: Inside the Intel Compiler
3.2 has pretty good vectorization support, not the best but pretty good
Re: Inside the Intel Compiler
Take a look at:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
3.2 had nothing
Re: Inside the Intel Compiler
"dare to say the current gcc has most of this stuff already implemented."
Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.
GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.
Re: Inside the Intel Compiler
well, this might be true, but it is still much slower in most benchmarks.
Re: Inside the Intel Compiler
hmmm, it's posible create a Linux Distribution with Intel Compiler!!??
Re: Inside the Intel Compiler
Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible
Re: Inside the Intel Compiler
AFAIK, no :(
Re: Inside the Intel Compiler
AFAIK, Gentoo will be, maybe...
Re: Inside the Intel Compiler
Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.
One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?
Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..
The Intel Compiler is
The Intel Compiler is written on pure C.
Re: Inside the Intel Compiler
Umm I think they use a mix of PERL and OCaml.
They do the OpenMP stuff in functional Miranda.
lol
lol
functional miranda
hahahah
Re: Inside the Intel Compiler
How do you know?