Inside the Intel Compiler

How did Intel's compiler beat gcc on Benchmarks? Intel's compiler developers explain IA-32 optimizations they use.
OpenMP and Auto-Parallelization

The OpenMP standard for C/C++ and Fortran (www.openmp.org) has recently emerged as the de facto standard for shared-memory parallel programming. It allows the user to specify parallelism without getting involved in the details of iteration partitioning, data sharing, thread scheduling and synchronization. Based on these directives, the Intel compiler will transform the code to generate multithreaded code automatically. The Intel compiler supports the OpenMP C++ 2.0 and OpenMP Fortran 2.0 standard directives for explicit parallelization. Applications can use these directives to increase performance on multiprocessor systems by exploiting both task and data parallelism.

The following is an example program, illustrating the use of OpenMP directives with the Intel C++ Linux OpenMP compiler:

#define N 10000
void   ploop(void)
{
  int k, x[N], y[N], z[N];
  #pragma omp parallel for private(k) shared(x,y,z)
  for (k=0;  k<N; k++) {
    x[k] = x[k] * y[k] + workunit(z[k]);
  }
}

The for loop will be executed in parallel by a team of threads that divide the iterations in the loop body amongst themselves. Variable k is marked private—each thread will have its own copy of k—while the arrays x, y and z are shared among the threads.

The resulting multithreaded code is illustrated below. The Intel compiler generates OpenMP runtime library calls for thread creation and management, as well as synchronization (see Resources 1 and 2):

#define N 10000
void  ploop(void)
{
    int k, x[N], y[N], z[N];
    __kmpc_fork_call(loc,
                     3,
                     T-entry(_ploop_par_loop),
                     x, y, z)
    goto L1:
    T-entry _ploop_par_loop(loc, tid,
                            x[], y[], z[]) {
       lower_k = 0;
       upper_k = N;
       __kmpc_for_static_init(loc, tid, STATIC,
                              &lower_k,
                              &upper_k, ...);
       for (local_k=lower_k;  local_k<=upper_k;
            local_k++)  {
          x[local_k] = x[local_k] * y[local_k]
                       + workunit(z[local_k]);
       }
       __kmpc_for_static_fini(loc, tid);
       T-return;
    }
L1: return;
}

The multithreaded code generator inserts the thread invocation call __kmpc_fork_call with the T-entry point and data environment (for example, thread id tid) for each loop. This call into the Intel OpenMP runtime library forks a number of threads that execute the iterations of the loop in parallel.

The serial loops annotated with the OpenMP directive are converted to multithreaded code by localizing the lower- and upper-loop bounds and by privatizing the iteration variable. Finally, multithreading runtime initialization and synchronization code is generated for each T-region defined by a [T-entry, T-ret] pair. The call __kmpc_for_static_init computes the localized loop lower-bound, upper-bound and stride for each thread according to a scheduling policy. In this example, the generated code uses static scheduling. The library call __kmpc_for_static_fini informs the runtime system that the current thread has completed one loop chunk.

Rather than performing source-to-source transformations, as is done in other compilers such as OpenMP NanosCompiler and OdinMP, the Intel compiler performs these transformations internally. This allows tight integration of the OpenMP implementation with other advanced, high-level compiler optimizations for improved uniprocessor performance such as vectorization and loop transformations.

Besides the compiler support for exploiting the OpenMP directive-guided explicit parallelism, users also can try auto-parallelization by using the option -parallel. Under this option, the compiler automatically analyzes the loops in the program to detect those that have no loop-carried dependency and can be executed in parallel profitably. The auto-parallelization phase in the compiler relies on the advanced memory disambiguation techniques for its analysis, as well as the profiling information for its heuristics in deciding when to parallelize.

CPU-Dispatch

One of the unique features of the Intel compiler is CPU-Dispatch, which allows the user to target a single object for multiple IA-32 architectures by means of either manual CPU-Dispatch or Auto-CPU-Dispatch. Manual CPU-Dispatch allows the user to write multiple versions of a single function. Each function either is assigned a specific IA-32 architecture platform or is considered generic, meaning it can run on any IA-32 architecture. The Intel compiler generates code that dynamically determines on which architecture the code is running and accordingly chooses the particular version of the function that will actually execute. This runtime determination allows programmers to take advantage of architecture-specific optimizations, such as SSE and SSE2, without sacrificing flexibility, allowing execution of the same binary on architectures that do not support newer instructions.

Auto-CPU_Dispatch is similar but with the added benefit that the compiler automatically generates multiple versions of a given function. During compilation, the compiler decides which routines will gain from architecture-specific optimizations. These routines are then automatically duplicated to produce architecture-specific optimized versions, as well as generic versions. The benefit of this feature is, it does not require any rewrite by the programmer. A normal source file can take advantage of the Auto-CPU-Dispatch feature by the simple use of a command-line option. For example, given the function:

void init(float b[], double c[], int n)
{
  int i;
  for (i = 0; i < n; i++) {
      b[i] = (float)i;
  }
  for (i = 0; i < n; i++) {
      c[i] = (double)i;
  }
}

the Intel compiler can produce up to three versions of the function. A generic version of the function is generated that will run on any IA-32 processor. Another version would be tuned for the Pentium III processor by vectorizing the first loop with SSE instructions. A third version would be optimized for the Pentium 4 processor by vectorizing both loops to take advantage of SSE2 instructions.

The resulting function begins with dispatch code like this:

.L1  testl     $-512, __intel_cpu_indicator
     jne       init.J
     testl     $-128, __intel_cpu_indicator
     jne       init.H
     testl     $-1, __intel_cpu_indicator
     jne       init.A
     call      __intel_cpu_indicator_init
     jmp       .L1

Where init.A, init.H and init.J are the generic, SSE and SSE2 optimized versions, respectively.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Inside the Intel Compiler

Anonymous's picture

I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)

Re: Inside the Intel Compiler

Anonymous's picture

Some of the optimizations shown were just part of good proramming practice anyway.
Only extreme newbies wold program as shown before optimization.

Re: Inside the Intel Compiler

Anonymous's picture

dare to say the current gcc has most of this stuff already implemented.

Re: Inside the Intel Compiler

Anonymous's picture

When did GCC get vectorization and OpenMP support?

Re: Inside the Intel Compiler

Anonymous's picture

3.2 has pretty good vectorization support, not the best but pretty good

Re: Inside the Intel Compiler

Anonymous's picture

Re: Inside the Intel Compiler

Anonymous's picture

"dare to say the current gcc has most of this stuff already implemented."

Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.

GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.

Re: Inside the Intel Compiler

Anonymous's picture

well, this might be true, but it is still much slower in most benchmarks.

Re: Inside the Intel Compiler

Anonymous's picture

hmmm, it's posible create a Linux Distribution with Intel Compiler!!??

Re: Inside the Intel Compiler

Anonymous's picture

Well, in other articles people say it can't compile many things that are designed to be compiled with gcc (many linux packages) so I don't think it's feasible

Re: Inside the Intel Compiler

Anonymous's picture

AFAIK, no :(

Re: Inside the Intel Compiler

Anonymous's picture

AFAIK, Gentoo will be, maybe...

Re: Inside the Intel Compiler

Anonymous's picture

Nice article. I hadn't heard anything about OpenMP until now, or CPU dispatch.

One thing I'm curious about though is what stuff the compiler team uses to develop with.. e.g. What language(s) are the C/C++ and Fortran compilers implemented in?

Some benchmarks I've seen would suggest they share a common back-end, but I wonder if the compiler itself is written in C, C++, Fortran or maybe a lisp dialect or some functional language..

The Intel Compiler is

Anonymous's picture

The Intel Compiler is written on pure C.

Re: Inside the Intel Compiler

Anonymous's picture

Umm I think they use a mix of PERL and OCaml.

They do the OpenMP stuff in functional Miranda.

lol

Anonymous's picture

lol

functional miranda
hahahah

Re: Inside the Intel Compiler

Anonymous's picture

How do you know?

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix