An Introduction to GCC Compiler Intrinsics in Vector Processing

in

We're including integer math in the discussion. While integer math is not unique to vector processing, it's useful to have in case your vector hardware is integer-only or if integer math is much faster than floating-point math. Integer math will be an approximation of floating-point math, but you might be able to get a faster answer that is acceptably close.

The first option in integer math is rearranging operations. If your formula is simple enough, you might be able to rearrange operations to preserve precision. For instance, you could rearrange:

```F = (9/5)C + 32
```
into:
```F = (9*C)/5 + 32
```

So long as 9 * C doesn't overflow the type you're using, precision is preserved. It's lost when you divide by 5; so do that after the multiplication. Rearrangement may not work for more complex formulas.

The seond choice is scaled math. In the scaled math option, you decide how much precision you want, then multiply both sides of your equation by a constant, round or truncate your coefficients to integers, and work with that. The final step to get your answer then is to divide by that constant. For instance, if you wanted to convert Celsius to Fahrenheit:

```F = (9/5)C + 32
= 1.8C + 32            -- but we can't have 1.8, so multiply by 10

sum = 10F = 18C + 320    -- 1.8 is now 18: now all integer operations

F = sum/10
```

If you multiply by a power of 2 instead of 10, you change that final division into a shift, which is almost certainly faster, though harder to understand. (So don't do this gratuitously.)

The third choice for integer math is shift-and-add. Shift-and-add is another method based on the idea that a floating-point multiplication can be implemented with a number of shifts and adds. So our troublesome 1.8C can be approximated as:

```1.0C + 0.5C + 0.25C + ...   OR  C + (C >> 1) + (C >> 2) + ...
```

Again, it's almost certainly faster, but harder to understand.

There are examples of integer math in samples/simple/temperatures*.c, and shift-and-add in samples/colorconv2/scalar.c.

Vector Types, the Compiler and the Debugger

To use your processor's vector hardware, tell the compiler to use intrinsics to generate SIMD code, include the file that defines the vector types, and use a vector type to put your data into vector form.

The compiler's SIMD command-line arguments are listed in Table 1. (This article covers only these, but GCC offers much more.)

Table 1. GCC Command-Line Options to Generate SIMD Code

Processor/ Options
X86/MMX/SSE1/SSE2 -mfpmath=sse -mmmx -msse -msse2
ARM Neon -mfpu=neon -mfloat-abi=softfp
Freescale Altivec -maltivec -mabi=altivec

Here are the include files you need:

• arm_neon.h - ARM Neon types & intrinsics
• altivec.h - Freescale Altivec types & intrinsics
• mmintrin.h - X86 MMX
• xmmintrin.h - X86 SSE1
• emmintrin.h - X86 SSE2

X86: MMX, SSE, SSE2 Types and Debugging

The X86 compatibles with MMX, SSE1 and SSE2 have the following types:

• MMX: __m64 64 bits of integers broken down as eight 8-bit integers, four 16-bit shorts or two 32-bit integers.
• SSE1: __m128 128 bits: four single precision floats.
• SSE2: __m128i 128 bits of any size packed integers, __m128d 128 bits: two doubles.

Because the debugger doesn't know how you're using these types, printing X86 vector variables in gdb/ddd shows you the packed form of the vector instead of the collection of elements. To get to the individual elements, tell the debugger how to decode the packed form as "print (type[]) x" For instance if you have:

``````
__m64 avariable; /* storing 4 shorts */
``````

You can tell ddd to list individual elements as shorts saying:

``````
print (short[]) avariable
``````

If you're working with char vectors and want gdb to print the vector's elements as numbers instead of characters, you can tell it to using the "/" option. For instance:

``````
print/d acharvector
``````

will print the contents of acharvector as a series of decimal values.

______________________

Comment viewing options

Hi! > Use GCC's "aligned"

Hi!

> Use GCC's "aligned" attribute to align data sources and destinations on 16-bit
> float anarray[4] __attribute__((aligned(16))) = { 1.2, 3.5, 1.7, 2.8 };

I'm not shure, but it seams to me that instead of "16-bit" should be writen "16-bytes" ( http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html#Type-Attributes ). Isn't it?

intrisics is i´m follow

intrisics is i´m follow

vector Processing

Very interesting article here. But i found different Vector Processing Concepts here >> http://akiavintage.com/vector/vector-processing/ at http://akiavintage.com/

Correction

The pattern for ARM Neon types is not [type]x[elementcount]_t, but [type][elementcount]x_t.

re Correction

You might take a look at:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0004a/CHD...

In example 1.1 they use uint32x4_t as a four element vector of 32-bit unsigned integers...

autovectorisation

http://locklessinc.com/articles/vectorize/ has some tips on helping GCC autovectorise code.

So it talks about ancient tech like MMX and SSE2, my guess these days you would write about AVX. Also the links at the end often lead to nowhere, and an article from 2005. This makes me wonder when this article was actually written.

Very perceptive. The article was accepted for publication in July of 2011. That's why the ARM and Freescale links have gone stale. (I'll post an updated set later this week.)

The choice of MMX and SSE2 for X86 was deliberate. For an introductory article, things that are simple and widespread are often the best choices.

I think an AVX article would wonderful. Any volunteers?

no, intrinsics are no replacement for hand-optimized simd asm

so far, i encountered only one case where intrinsics are somewhat useful - when trying to unroll a loop of non-trivial vector code. if you write a test implementation using intrinsics and let gcc unroll that a bit for you, gcc's liveness analysis and resulting register allocation may give you useful hints for writing the final asm function. but i have never seen a case where gcc produces optimal code from intrinsics for a non-trivial function.

and regarding vendor libraries - the functions they provide are of varying quality with regard to optimization, but even in the cases where the code is pretty good, they don't compete on equal grounds. they have to be pretty generic, which means you always have some overhead. optimizations in simd asm often come from specific knowledge regarding variable ranges. data layout, or data reuse. the vendor lib can't do that.

so write your proof-of-concept using intrinsics or vendor libs. and if performance satisfies you, just keep it that way. but if a function still is a major hotspot, you can do better if you go asm (maybe only a bit, more likely a lot)

Recent i see one articles in

Recent i see one articles in the site backlinks where speak about seo

What?

Perhaps you meant to say: "Recently I saw an article on the site with backlinks. Where to they talk about seo?" Orlando locksmith