Inside the Intel Compiler

How did Intel's compiler beat gcc on Benchmarks? Intel's compiler developers explain IA-32 optimizations they use.

The increasing acceptance of Linux among developers and researchers has yet to be matched by a similar increase in the number of available development tools. The recently released Intel C++ and Fortran compilers for Linux aim to bridge this gap by providing application developers with highly optimizable compilers for the Intel IA-32 and Itanium processor families. These compilers provide strict ANSI support, as well as optional support for some popular extensions. This article focuses on the optimizations and features of the compiler for the Intel IA-32 processors. Throughout the rest of this article, we refer to the Intel C++ and Fortran compilers for Linux on IA-32 collectively as “the Intel compiler”.

The Intel compiler optimizes a program at all levels, from high-level loop and interprocedural optimizations to standard compiler data flow optimizations, in addition to efficient low-level optimizations, such as instruction scheduling, basic block layout and register allocation. In this article, we mainly focus on compiler optimizations unique to the Intel compiler. For completeness, however, we also include a brief overview of some of the more traditional optimizations supported by the Intel compiler.

Traditional Compiler Optimizations

Decreasing the number of instructions that are dynamically executed and replacing instructions with faster equivalents are perhaps the two most obvious ways to improve performance. Many traditional compiler optimizations fall into this category: copy and constant propagation, redundant expression elimination, dead code elimination, peephole optimizations, function inlining, tail recursion elimination and so forth.

The Intel compiler provides a rich variety of both types of optimizations. Many local optimizations are based on the static-single-assignment (SSA) form. Redundant (or partially redundant) expressions, for example, are eliminated according to Chow's algorithm (see Resource 6), where an expression is considered redundant if it is unnecessarily calculated more than once on an execution path. For instance, in the statement:

x[i] += a[i+j*n] + b[i+j*n];

the expression i+j*n is redundant and needs to be calculated only once. Partial redundancy occurs when an expression is redundant on some paths but not necessarily all paths. In the code:

if (c) {
    x = y+a*b;
} else {
    x = a;
}
z = a*b;
the expression a*b is partially redundant. If the else branch is taken, a*b is only calculated once; but if the then branch is taken, it is calculated twice. The code can be modified as follows:
t = a*b;
if (c) {
    x = y+t;
} else {
    x = a;
}
z = t;
so there is only one calculation of a*b, no matter which path is taken.

Clearly, this transformation must be used judiciously as the increase in temporary values, ideally stored in registers, can increase lifetimes and, hence, register pressure. An algorithm similar to Chow's algorithm (see Resource 9) is used to eliminate dead stores, in which a store is succeeded by another store to the same location before a fetch, and partially dead stores, which are dead along some but not necessarily all paths. Other optimizations based on the SSA form are constant propagation (see Resource 7) and the propagation of conditions. Consider the following example:

if (x>0) {
    if (y>0) {
        . . .
        if (x == 0) {
            . . .
        }
    }
}

Since x>0 holds within the outmost if, unless x is changed, we know that x != 0, and therefore the code within the inner if is dead. Although this and the previous example may seem contrived, such situations are actually quite common in the presence of address calculations, macros or inlined functions.

Powerful memory disambiguation (see Resource 8) is used by the Intel compiler to determine whether memory references might overlap. This analysis is important to enhance, for instance, register allocation and to enable the detection and exploitation of implicit parallelism in the code, as discussed in the following sections. The Intel compiler also provides extensive interprocedural optimizations, including manual and automatic function inlining, partial inlining where only the hot parts of a routine are inlined, interprocedural constant optimizations and exception-handling optimizations. With the optional “whole program” analysis, the data layout of certain data structures, such as COMMON BLOCKS in Fortran, may be modified to enhance memory accesses on various processors. For example, the data layout could be padded to provide better data alignment. In addition, in order to make decisions that are more intelligent about when and where to inline, the Intel compiler relies on two types of profiling information: static profiling and dynamic profiling. Static profiling refers to information that can be deduced or estimated at compile time. Dynamic profiling is information gathered from actual executions of a program. These two types of profiling are discussed in the next section.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Inside the Intel Compiler

Anonymous's picture

I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)

Re: Inside the Intel Compiler

Anonymous's picture

Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.

Re: Inside the Intel Compiler

Anonymous's picture

dare to say the current gcc has most of this stuff already implemented.

Re: Inside the Intel Compiler

Anonymous's picture

When did GCC get vectorization and OpenMP support?

Re: Inside the Intel Compiler

Anonymous's picture

3.2 has pretty good vectorization support, not the best but pretty good

Re: Inside the Intel Compiler

Anonymous's picture

Re: Inside the Intel Compiler

Anonymous's picture

"dare to say the current gcc has most of this stuff already implemented."

Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.

GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.

Re: Inside the Intel Compiler

Anonymous's picture

well, this might be true, but it is still much slower in most benchmarks.

Re: Inside the Intel Compiler

Anonymous's picture

hmmm, it's posible create a Linux Distribution with Intel Compiler!!??

Re: Inside the Intel Compiler

Anonymous's picture

Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible

Re: Inside the Intel Compiler

Anonymous's picture

AFAIK, no :(

Re: Inside the Intel Compiler

Anonymous's picture

AFAIK, Gentoo will be, maybe...

Re: Inside the Intel Compiler

Anonymous's picture

Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.

One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?

Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..

The Intel Compiler is

Anonymous's picture

The Intel Compiler is written on pure C.

Re: Inside the Intel Compiler

Anonymous's picture

Umm I think they use a mix of PERL and OCaml.

They do the OpenMP stuff in functional Miranda.

lol

Anonymous's picture

lol

functional miranda
hahahah

Re: Inside the Intel Compiler

Anonymous's picture

How do you know?

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState