Inside the Intel Compiler
The increasing acceptance of Linux among developers and researchers has yet to be matched by a similar increase in the number of available development tools. The recently released Intel C++ and Fortran compilers for Linux aim to bridge this gap by providing application developers with highly optimizable compilers for the Intel IA-32 and Itanium processor families. These compilers provide strict ANSI support, as well as optional support for some popular extensions. This article focuses on the optimizations and features of the compiler for the Intel IA-32 processors. Throughout the rest of this article, we refer to the Intel C++ and Fortran compilers for Linux on IA-32 collectively as “the Intel compiler”.
The Intel compiler optimizes a program at all levels, from high-level loop and interprocedural optimizations to standard compiler data flow optimizations, in addition to efficient low-level optimizations, such as instruction scheduling, basic block layout and register allocation. In this article, we mainly focus on compiler optimizations unique to the Intel compiler. For completeness, however, we also include a brief overview of some of the more traditional optimizations supported by the Intel compiler.
Decreasing the number of instructions that are dynamically executed and replacing instructions with faster equivalents are perhaps the two most obvious ways to improve performance. Many traditional compiler optimizations fall into this category: copy and constant propagation, redundant expression elimination, dead code elimination, peephole optimizations, function inlining, tail recursion elimination and so forth.
The Intel compiler provides a rich variety of both types of optimizations. Many local optimizations are based on the static-single-assignment (SSA) form. Redundant (or partially redundant) expressions, for example, are eliminated according to Chow's algorithm (see Resource 6), where an expression is considered redundant if it is unnecessarily calculated more than once on an execution path. For instance, in the statement:
x[i] += a[i+j*n] + b[i+j*n];
the expression i+j*n is redundant and needs to be calculated only once. Partial redundancy occurs when an expression is redundant on some paths but not necessarily all paths. In the code:
if (c) {
x = y+a*b;
} else {
x = a;
}
z = a*b;
the expression a*b is partially redundant. If the else branch is
taken, a*b is only calculated once; but if the then branch is
taken, it is calculated twice. The code can be modified as follows:
t = a*b;
if (c) {
x = y+t;
} else {
x = a;
}
z = t;
so there is only one calculation of a*b, no matter which path is
taken.
Clearly, this transformation must be used judiciously as the increase in temporary values, ideally stored in registers, can increase lifetimes and, hence, register pressure. An algorithm similar to Chow's algorithm (see Resource 9) is used to eliminate dead stores, in which a store is succeeded by another store to the same location before a fetch, and partially dead stores, which are dead along some but not necessarily all paths. Other optimizations based on the SSA form are constant propagation (see Resource 7) and the propagation of conditions. Consider the following example:
if (x>0) {
if (y>0) {
. . .
if (x == 0) {
. . .
}
}
}
Since x>0 holds within the outmost if, unless x is changed, we know that x != 0, and therefore the code within the inner if is dead. Although this and the previous example may seem contrived, such situations are actually quite common in the presence of address calculations, macros or inlined functions.
Powerful memory disambiguation (see Resource 8) is used by the Intel compiler to determine whether memory references might overlap. This analysis is important to enhance, for instance, register allocation and to enable the detection and exploitation of implicit parallelism in the code, as discussed in the following sections. The Intel compiler also provides extensive interprocedural optimizations, including manual and automatic function inlining, partial inlining where only the hot parts of a routine are inlined, interprocedural constant optimizations and exception-handling optimizations. With the optional “whole program” analysis, the data layout of certain data structures, such as COMMON BLOCKS in Fortran, may be modified to enhance memory accesses on various processors. For example, the data layout could be padded to provide better data alignment. In addition, in order to make decisions that are more intelligent about when and where to inline, the Intel compiler relies on two types of profiling information: static profiling and dynamic profiling. Static profiling refers to information that can be deduced or estimated at compile time. Dynamic profiling is information gathered from actual executions of a program. These two types of profiling are discussed in the next section.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Roll your own dynamic dns
5 hours 19 sec ago - Please correct the URL for Salt Stack's web site
8 hours 11 min ago - Android is Linux -- why no better inter-operation
10 hours 27 min ago - Connecting Android device to desktop Linux via USB
10 hours 55 min ago - Find new cell phone and tablet pc
11 hours 53 min ago - Epistle
13 hours 22 min ago - Automatically updating Guest Additions
14 hours 31 min ago - I like your topic on android
15 hours 17 min ago - This is the easiest tutorial
21 hours 53 min ago - Ahh, the Koolaid.
1 day 3 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Re: Inside the Intel Compiler
I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)
Re: Inside the Intel Compiler
Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.
Re: Inside the Intel Compiler
dare to say the current gcc has most of this stuff already implemented.
Re: Inside the Intel Compiler
When did GCC get vectorization and OpenMP support?
Re: Inside the Intel Compiler
3.2 has pretty good vectorization support, not the best but pretty good
Re: Inside the Intel Compiler
Take a look at:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
3.2 had nothing
Re: Inside the Intel Compiler
"dare to say the current gcc has most of this stuff already implemented."
Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.
GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.
Re: Inside the Intel Compiler
well, this might be true, but it is still much slower in most benchmarks.
Re: Inside the Intel Compiler
hmmm, it's posible create a Linux Distribution with Intel Compiler!!??
Re: Inside the Intel Compiler
Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible
Re: Inside the Intel Compiler
AFAIK, no :(
Re: Inside the Intel Compiler
AFAIK, Gentoo will be, maybe...
Re: Inside the Intel Compiler
Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.
One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?
Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..
The Intel Compiler is
The Intel Compiler is written on pure C.
Re: Inside the Intel Compiler
Umm I think they use a mix of PERL and OCaml.
They do the OpenMP stuff in functional Miranda.
lol
lol
functional miranda
hahahah
Re: Inside the Intel Compiler
How do you know?