Inside the Intel Compiler

How did Intel's compiler beat gcc on Benchmarks? Intel's compiler developers explain IA-32 optimizations they use.
Profiling Optimizations

First, we will look at static profiling. Consider the following code fragment:

g();
for (i=0; i<10; i++) {
    g();
}

Obviously, the call inside the loop executes ten times more often than the call outside the loop. In many cases, however, there is no way to make a good estimate. In the following code:

for (i=0; i<10; i++) {
    if (condition) {
        g();
    } else {
        h();
    }
}
it is difficult to say whether one condition is more likely to occur than another. If h() happened to be an exit or some other routine that was known not to return, it would be safe to assume the then branch was more likely taken and inlining g() may be worthwhile. Without such information, however, the decision of whether to inline one call or the other (or both) gets more complicated. Another option is to use dynamic profiling.

Dynamic profiling gathers information from actual executions of a program. This allows the compiler to take advantage of the way a program actually runs in order to optimize it. In a three-step process, the application is first built with profiling instrumentation embedded in it. Then the resulting application is run with a representative sample (or samples) of data, which yields a database for the compiler to use in a subsequent build of the application. Finally, the information in this database is used to guide optimizations such as code placement or grouping frequently executed basic blocks together, function or partial inlining and register allocation. Register allocation in the Intel compiler is based on graph fusion (see Resource 5), which breaks the code into regions. These regions are typically loop bodies or other cohesive units. With profile information, the regions can be selected more effectively and are based on the actual frequency of the blocks instead of syntactic guesses. This allows spills to be pushed into less frequently executed parts of the program.

Intra-Register Vectorization

Exploiting parallelism is an important way to increase application performance in modern architectures. The Intel compiler can be key in the effort to exploit potential parallelism in a program by facilitating such optimizations as automatic vectorization, automatic parallelization and support for OpenMP directives. Let's look at the automatic conversion of serial loops into a form that takes advantage of the instructions provided by the Intel MMX technology or SSE/SSE2 (Streaming-SIMD-extensions), a process we refer to as “intra-register vectorization” (see Resource 1). For example, given the function:

void vecadd(float a[], float b[], float c[], int n)
{
  int i;
  for (i = 0; i < n; i++) {
      c[i] = a[i] + b[i];
  }
}

the Intel compiler will transform the loop to allow four single-precision floating-point additions to occur simultaneously using the addps instruction. Simply put, using a pseudo-vector notation, the result would look something like this:

for (i = 0; i < n; i+=4) {
    c[i:i+3] = a[i:i+3] + b[i:i+3];
}
A scalar cleanup loop would follow to execute the remainder of the instructions if the trip count n is not exactly divisible by four. Several steps are involved in this process. First, because it is possible that no information exists about the base addresses of the arrays, runtime code must be inserted to ensure that the arrays do not overlap (dynamic dependence testing) and that the bulk of the loop runs with each vector iteration having addresses aligned along 16-byte boundaries (dynamic loop peeling for alignment). In order to vectorize efficiently, only loops of sufficient size are vectorized. If the number of iterations is too small, a simple serial loop is used instead. Besides simple loops, the vectorizer also supports loops with reductions (such as summing an array of numbers or searching for the max or min in an array, conditional constructs, saturation arithmetic and other idioms. Even the vectorization of loops with trigonometric mathematical functions is supported by means of a vector math library.

To give a taste of a realistic performance improvement that can be obtained by intra-register vectorization, we report some performance numbers for the double-precision version of the Linpack benchmark (available in both Fortran and C at www.netlib.org/benchmark). This benchmark reports the performance of a linear equation solver that uses the routines DGEFA and DGESL for the factorization and solve phase, respectively. Most of the runtime of this benchmark results from repetitively calling the Level 1 BLAS routine DAXPY for different subcolumns of the coefficient matrix during factorization. Under generic optimizations (switch -O2), this benchmark reports 1,049 MFLOPS for solving a 100×100 system on a 2.66GHz Pentium 4 processor. When intra-register vectorization for the Pentium 4 processor is enabled (switch -xW), the performance goes up to 1,292 MFLOPS, boosting the performance by about 20%.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Inside the Intel Compiler

Anonymous's picture

I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)

Re: Inside the Intel Compiler

Anonymous's picture

Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.

Re: Inside the Intel Compiler

Anonymous's picture

dare to say the current gcc has most of this stuff already implemented.

Re: Inside the Intel Compiler

Anonymous's picture

When did GCC get vectorization and OpenMP support?

Re: Inside the Intel Compiler

Anonymous's picture

3.2 has pretty good vectorization support, not the best but pretty good

Re: Inside the Intel Compiler

Anonymous's picture

Re: Inside the Intel Compiler

Anonymous's picture

"dare to say the current gcc has most of this stuff already implemented."

Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.

GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.

Re: Inside the Intel Compiler

Anonymous's picture

well, this might be true, but it is still much slower in most benchmarks.

Re: Inside the Intel Compiler

Anonymous's picture

hmmm, it's posible create a Linux Distribution with Intel Compiler!!??

Re: Inside the Intel Compiler

Anonymous's picture

Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible

Re: Inside the Intel Compiler

Anonymous's picture

AFAIK, no :(

Re: Inside the Intel Compiler

Anonymous's picture

AFAIK, Gentoo will be, maybe...

Re: Inside the Intel Compiler

Anonymous's picture

Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.

One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?

Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..

The Intel Compiler is

Anonymous's picture

The Intel Compiler is written on pure C.

Re: Inside the Intel Compiler

Anonymous's picture

Umm I think they use a mix of PERL and OCaml.

They do the OpenMP stuff in functional Miranda.

lol

Anonymous's picture

lol

functional miranda
hahahah

Re: Inside the Intel Compiler

Anonymous's picture

How do you know?

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState