An Introduction to GCC Compiler Intrinsics in Vector Processing

on September 21, 2012

Speed is essential in multimedia, graphics and signal processing. Sometimes programmers resort to assembly language to get every last bit of speed out of their machines. GCC offers an intermediate between assembly and standard C that can get you more speed and processor features without having to go all the way to assembly language: compiler intrinsics. This article discusses GCC's compiler intrinsics, emphasizing vector processing on three platforms: X86 (using MMX, SSE and SSE2); Motorola, now Freescale (using Altivec); and ARM Cortex-A (using Neon). We conclude with some debugging tips and references.

Download the sample code for this article here: http://www.linuxjournal.com/files/linuxjournal.com/code/11108.tar

So, What Are Compiler Intrinsics?

Compiler intrinsics (sometimes called "builtins") are like the library functions you're used to, except they're built in to the compiler. They may be faster than regular library functions (the compiler knows more about them so it can optimize better) or handle a smaller input range than the library functions. Intrinsics also expose processor-specific functionality so you can use them as an intermediate between standard C and assembly language. This gives you the ability to get to assembly-like functionality, but still let the compiler handle details like type checking, register allocation, instruction scheduling and call stack maintenance. Some builtins are portable, others are not--they are processor-specific. You can find the lists of the portable and target specific intrinsics in the GCC info pages and the include files (more about that below). This article focuses on the intrinsics useful for vector processing.

Vectors and Scalars

In this article, a vector is an ordered collection of numbers, like an array. If all the elements of a vector are measures of the same thing, it's said to be a uniform vector. Non-uniform vectors have elements that represent different things, and their elements have to be processed differently. In software, vectors have their own types and operations. A scalar is a single value, a vector of size one. Code that uses vector types and operations is said to be vector code. Code that uses only scalar types and operations is said to be scalar code.

Vector Processing Concepts

Vector processing is in the category of Single Instruction, Multiple Data (SIMD). In SIMD, the same operation happens to all the data (the values in the vector) at the same time. Each value in the vector is computed independently. Vector operations include logic and math. Math within a single vector is called horizontal math. Math between two vectors is called vertical math.

Instead of writing: 10 x 2 = 20, express it vertically as:

In vertical math, vectors are lines of these values; multiple operations happen at the same time:

        -------------------------------
        |  10   |   10  |  10  |  10  |   vector1
        -------------------------------
        -------------------------------
    x   |  2    |   2   |  2   |  2   |   vector2
        -------------------------------
   --------------------------------------
        -------------------------------
        |  20   |   20  |  20  |  20  |   vector3
        -------------------------------

All 10s are multiplied by all 2s at the same time.

So, to convert Celsius to Fahrenheit using F = (9/5) * C + 32 for a vector of temperatures in Celsius:

       -------------------------------
        |  C0   |   C1  |  C2  |  C3  |   Celsius temperatures vector
        -------------------------------
        -------------------------------
    x   |  9    |   9   |  9   |  9   |   vector2
        -------------------------------
   --------------------------------------
        -------------------------------
        |  p1   |   p2  |  p3  |  p4  |   partial result
        -------------------------------
        -------------------------------
    /   |  5    |   5   |  5   |  5   |   vector3
        -------------------------------
   --------------------------------------
        -------------------------------
        |  p1   |   p2  |  p3  |  p4  |   partial result
        -------------------------------
        -------------------------------
   +    |  32   |   32  |  32  |  32  |   vector4
        -------------------------------
    --------------------------------------

        -------------------------------
        |  F0   |   F1  |  F2  |  F3  |   Fahrenheit temperatures vector
        -------------------------------

Saturation arithmetic is like normal arithmetic except that when the result of the operation that would overflow or underflow an element in the vector, that is clamped at the end of the range and not allowed to wrap around. (For instance, 255 is the largest unsigned character. In saturation arithmetic on unsigned characters, 250 + 10 = 255.) Regular arithmetic would allow the value to wrap around zero and become smaller. For example, saturation arithmetic is useful if you have a pixel that is slightly brighter than maximum brightness. It should be maximum brightness, not wrap around to be dark.

We're including integer math in the discussion. While integer math is not unique to vector processing, it's useful to have in case your vector hardware is integer-only or if integer math is much faster than floating-point math. Integer math will be an approximation of floating-point math, but you might be able to get a faster answer that is acceptably close.

The first option in integer math is rearranging operations. If your formula is simple enough, you might be able to rearrange operations to preserve precision. For instance, you could rearrange:

F = (9/5)C + 32

into:

F = (9*C)/5 + 32

So long as 9 * C doesn't overflow the type you're using, precision is preserved. It's lost when you divide by 5; so do that after the multiplication. Rearrangement may not work for more complex formulas.

The seond choice is scaled math. In the scaled math option, you decide how much precision you want, then multiply both sides of your equation by a constant, round or truncate your coefficients to integers, and work with that. The final step to get your answer then is to divide by that constant. For instance, if you wanted to convert Celsius to Fahrenheit:

F = (9/5)C + 32
  = 1.8C + 32            -- but we can't have 1.8, so multiply by 10

sum = 10F = 18C + 320    -- 1.8 is now 18: now all integer operations

F = sum/10

If you multiply by a power of 2 instead of 10, you change that final division into a shift, which is almost certainly faster, though harder to understand. (So don't do this gratuitously.)

The third choice for integer math is shift-and-add. Shift-and-add is another method based on the idea that a floating-point multiplication can be implemented with a number of shifts and adds. So our troublesome 1.8C can be approximated as:

1.0C + 0.5C + 0.25C + ...   OR  C + (C >> 1) + (C >> 2) + ...

Again, it's almost certainly faster, but harder to understand.

There are examples of integer math in samples/simple/temperatures*.c, and shift-and-add in samples/colorconv2/scalar.c.

Vector Types, the Compiler and the Debugger

To use your processor's vector hardware, tell the compiler to use intrinsics to generate SIMD code, include the file that defines the vector types, and use a vector type to put your data into vector form.

The compiler's SIMD command-line arguments are listed in Table 1. (This article covers only these, but GCC offers much more.)

Table 1. GCC Command-Line Options to Generate SIMD Code

Processor/		Options
X86/MMX/SSE1/SSE2	-mfpmath=sse -mmmx -msse -msse2
ARM Neon	-mfpu=neon -mfloat-abi=softfp
Freescale Altivec	-maltivec -mabi=altivec

Here are the include files you need:

arm_neon.h - ARM Neon types & intrinsics
altivec.h - Freescale Altivec types & intrinsics
mmintrin.h - X86 MMX
xmmintrin.h - X86 SSE1
emmintrin.h - X86 SSE2

X86: MMX, SSE, SSE2 Types and Debugging

The X86 compatibles with MMX, SSE1 and SSE2 have the following types:

MMX: __m64 64 bits of integers broken down as eight 8-bit integers, four 16-bit shorts or two 32-bit integers.
SSE1: __m128 128 bits: four single precision floats.
SSE2: __m128i 128 bits of any size packed integers, __m128d 128 bits: two doubles.

Because the debugger doesn't know how you're using these types, printing X86 vector variables in gdb/ddd shows you the packed form of the vector instead of the collection of elements. To get to the individual elements, tell the debugger how to decode the packed form as "print (type[]) x" For instance if you have:


__m64 avariable; /* storing 4 shorts */

You can tell ddd to list individual elements as shorts saying:


print (short[]) avariable

If you're working with char vectors and want gdb to print the vector's elements as numbers instead of characters, you can tell it to using the "/" option. For instance:


print/d acharvector

will print the contents of acharvector as a series of decimal values.

PowerPC Altivec Types and Debugging

PowerPC Processors with Altivec (also known as VMX and Velocity Engine) add the keyword "vector" to their types. They're all 16 bytes long. The following are some Altivec vector types:

vector unsigned char: 16 unsigned chars
vector signed char: 16 signed chars
vector bool char: 16 unsigned chars (0 false, 255 true)
vector unsigned short: 8 unsigned shorts
vector signed short: 8 signed shorts
vector bool short: 8 unsigned shorts (0 false, 65535 true)
vector unsigned int: 4 unsigned ints
vector signed int: 4 signed ints
vector bool int: 4 unsigned ints (0 false, 2^32 -1 true)
vector float: 4 floats

The debugger prints these vectors as collections of individual elements.

ARM Neon Types and Debugging

On ARM processors that have Neon extensions available, the Neon types follow the pattern [type]x[elementcount]_t. Types include those in the following list:

uint64x1_t - single 64-bit unsigned integer
uint32x2_t - pair of 32-bit unsigned integers
uint16x4_t - four 16-bit unsigned integers
uint8x8_t - eight 8-bit unsigned integers
int32x2_t - pair of 32-bit signed integers
int16x4_t - four 16-bit signed integers
int8x8_t - eight 8-bit signed integers
int64x1_t - single 64-bit signed integer
float32x2_t - pair of 32-bit floats
uint32x4_t - four 32-bit unsigned integers
uint16x8_t - eight 16-bit unsigned integers
uint8x16_t - 16 8-bit unsigned integers
int32x4_t - four 32-bit signed integers
int16x8_t - eight 16-bit signed integers
int8x16_t - 16 8-bit signed integers
uint64x2_t - pair of 64-bit unsigned integers
int64x2_t - pair of 64-bit signed integers
float32x4_t - four 32-bit floats
uint32x4_t - four 32-bit unsigned integers
uint16x8_t - eight 16-bit unsigned integers

The debugger prints these vectors as collections of individual elements.

There are examples of these in the samples/simple directory.

Now that we've covered the vector types, let's talk about vector programs.

As Ian Ollman points out, vector programs are blitters. They load data from memory, process it, then store it to memory elsewhere. Moving data between memory and vector registers is necessary, but it's overhead. Taking big bites of data from memory, processing it, then writing it back to memory will minimize that overhead.

Alignment is another aspect of data movement to watch for. Use GCC's "aligned" attribute to align data sources and destinations on 16-bit boundaries for best performance. For instance:


float anarray[4] __attribute__((aligned(16))) = { 1.2, 3.5, 1.7, 2.8 };

Failure to align can result in getting the right answer, silently getting the wrong answer or crashing. Techniques are available for handling unaligned data, but they are slower than using aligned data. There are examples of these in the sample code.

The sample code uses intrinsics for vector operations on X86, Altivec and Neon. These intrinsics follow naming conventions to make them easier to decode. Here are the naming conventions:

Altivec intrinsics are prefixed with "vec_". C++ style overloading accomodates the different type arguments.

Neon intrinsics follow the naming scheme [opname][flags]_[type]. A "q" flag means it operates on quad word (128-bit) vectors.

X86 intrinsics are follow the naming convention _mm_[opname]_[suffix]

    suffix    s single-precision floating point
              d double-precision floating point
              i128 signed 128-bit integer
              i64 signed 64-bit integer
              u64 unsigned 64-bit integer
              i32 signed 32-bit integer
              u32 unsigned 32-bit integer
              i16 signed 16-bit integer
              u16 unsigned 16-bit integer
              i8 signed 8-bit integer
              u8 unsigned 8-bit integer
              pi# 64-bit vector of packed #-bit integers
              pu# 64-bit vector of packed #-bit unsigned integers
              epi# 128-bit vector of packed #-bit unsigned integers
              epu# 128-bit vector of packed #-bit unsigned integers
              ps 128-bit vector of packed single precision floats
              ss 128-bit vector of one single precision float
              pd 128-bit vector of double precision floats
              sd 128-bit vector of one double precision (128-bit) float
              si64 64-bit vector of single 64-bit integer
              si128 128 bit vector

Table 2 lists the intrinsics used in the sample code.

Table 2. Subset of vector operators and intrinsics used in the examples.

Operation	Altivec	Neon	MMX/SSE/SSE2
loading	vec_ld	vld1q_f32	_mm_set_epi16
vector	vec_splat	vld1q_s16	_mm_set1_epi16
	vec_splat_s16	vsetq_lane_f32	_mm_set1_pi16
	vec_splat_s32	vld1_u8	_mm_set_pi16
	vec_splat_s8	vdupq_lane_s16	_mm_load_ps
	vec_splat_u16	vdupq_n_s16	_mm_set1_ps
	vec_splat_u32	vmovq_n_f32	_mm_loadh_pi
	vec_splat_u8	vset_lane_u8	_mm_loadl_pi
storing	vec_st	vst1_u8
vector		vst1q_s16	_mm_store_ps
		vst1q_f32
		vst1_s16
add	vec_madd	vaddq_s16	_mm_add_epi16
	vec_mladd	vaddq_f32	_mm_add_pi16
	vec_adds	vmlaq_n_f32	_mm_add_ps
subtract	vec_sub	vsubq_s16
multiply	vec_madd	vmulq_n_s16	_mm_mullo_epi16
	vec_mladd	vmulq_s16	_mm_mullo_pi16
		vmulq_f32	_mm_mul_ps
		vmlaq_n_f32
arithmetic	vec_sra	vshrq_n_s16	_mm_srai_epi16
shift	vec_srl		_mm_srai_pi16
	vec_sr
byte	vec_perm	vtbl1_u8	_mm_shuffle_pi16
permutation	vec_sel	vtbx1_u8	_mm_shuffle_ps
	vec_mergeh	vget_high_s16
	vec_mergel	vget_low_s16
		vdupq_lane_s16
		vdupq_n_s16
		vmovq_n_f32
		vbsl_u8
type	vec_cts	vmovl_u8	_mm_packs_pu16
conversion	vec_unpackh	vreinterpretq_s16_u16
	vec_unpackl	vcvtq_u32_f32
	vec_cts	vqmovn_s32	_mm_cvtps_pi16
	vec_ctu	vqmovun_s16	_mm_packus_epi16
		vqmovn_u16
		vcvtq_f32_s32
		vmovl_s16
		vmovq_n_f32
vector	vec_pack	vcombine_u16
combination	vec_packsu	vcombine_u8
		vcombine_s16
maximum			_mm_max_ps
minimum			_mm_min_ps
vector			_mm_andnot_ps
logic			_mm_and_ps
			_mm_or_ps
rounding	vec_trunc
misc			_mm_empty

Suggestions for Writing Vector Code

Examine the Tradeoffs

Writing vector code with intrinsics forces you to make trade-offs. Your program will have a balance between scalar and vector operations. Do you have enough work for the vector hardware to make using it worthwhile? You must balance the portability of C against the need for speed and the complexity of vector code, especially if you maintain code paths for scalar and vector code. You must judge the need for speed versus accuracy. It may be that integer math will be fast enough and accurate enough to meet the need. One way to make those decisions is to test: write your program with a scalar code path and a vector code path and compare the two.

Data Structures

Start by laying out your data structures assuming that you'll be using intrinsics. This means getting data items aligned. If you can arrange the data for uniform vectors, do that.

Write Portable Scalar Code and Profile

Next, write your portable scalar code and profile it. This will be your reference code for correctness and the baseline to time your vector code. Profiling the code will show where the bottlenecks are. Make vector versions of the bottlenecks.

Write Vector Code

When you're writing that vector code, group the non-portable code into separate files by architecture. Write a separate Makefile for each architecture. That makes it easy to select the files you want to compile and supply arguments to the compiler for each architecture. Minimize the intermixing of scalar and vector code.

Use Compiler-Supplied Symbols if you #ifdef

For files that are common to more than one architecture, but have architecture-specific parts, you can #ifdef with symbols supplied by the compiler when SIMD instructions are available. These are:

__MMX__ -- X86 MMX
__SSE__ -- X86 SSE
__SSE2__ -- X86 SSE2
__VEC__ -- altivec functions
__ARM_NEON__ -- neon functions

To see the baseline macros defined for other processors:


touch emptyfile.c
gcc -E -dD emptyfile.c | more

To see what's added for SIMD, do this with the SIMD command-line arguments for your compiler (see Table 1). For example:


touch emptyfile.c
gcc -E -dD emptyfile.c -mmmx -msse  -msse2 -mfpmath=sse | more

Then compare the two results.

Check Processor at Runtime

Next, your code should check your processor at runtime to see if you have vector support for it. If you don't have a vector code path for that processor, fall back to your scalar code. If you have vector support, and the vector support is faster, use the vector code path. Test processor features on X86 with the cpuid instruction from <cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We couldn't find something that well established for Altivec and Neon, so the examples there parse /proc/cpuinfo. (Serious code might insert a test SIMD instruction. If the processor throws a SIGILL signal when it encounters that test instruction, you do not have that feature.)

Test, Test, Test

Test everything. Test for timing: see if your scalar or vector code is faster. Test for correct results: compare the results of your vector code against the results of your scalar code. Test at different optimization levels: the behavior of the programs can change at different levels of optimization. Test against integer math versions of your code. Finally, watch for compiler bugs. GCC's SIMD and intrinsics are a work in progress.

This gets us to our last code sample. In samples/colorconv2 is a colorspace conversion library that takes images in non-planar YUV422 and turns them into RGBA. It runs on PowerPCs using Altivec; ARM Cortex-A using Neon; and X86 using MMX, SSE and SSE2. (We tested on PowerMac G5 running Fedora 12, a Beagleboard running Angstrom 2009.X-test-20090508 and a Pentium 3 running Fedora 10.) Colorconv detects CPU features and uses code for them. It falls back to scalar code if no supported features are detected.

To build, untar the sources file and run make. Make uses the "uname" command to look for an architecture specific Makefile. (Unfortunately, Angstrom's uname on Beagleboard returns "unknown", so that's what the directory is called.)

Test programs are built along with the library. Testrange compares the results of the scalar code to the vector code over the entire range of inputs. Testcolorconv runs timing tests comparing the code paths it has available (intrinsics and scalar code) so you can see which runs faster.

Finally, here are some performance tips.

First, get a recent compiler and use the best code generation options. (Check the info pages that come with your compiler for things like the -mcpu option.)

Second, profile your code. Humans are bad at guessing where the bottlenecks are. Fix the bottlenecks, not other parts.

Third, get the most work you can from each vector operation by using the vector with the narrowest type elements that your data will fit into. Get the most work you can in each time slice by having enough work that you keep your vector hardware busy. Take big bites of data. If your vector hardware can handle a lot of vectors at the same time, use them. However, exceeding the number of vector registers you have available will slow things down. (Check your processor's documentation.)

Fourth, don't re-invent the wheel. Intel, Freescale and ARM all offer libraries and code samples to help you get the most from their processors. These include Intel's Integrated Performance Primitives, Freescale's libmotovec and ARM's OpenMAX.

Summary

In summary, GCC offers intrinsics that allow you to get more from your processor without the work of going all the way to assembly. We have covered basic types and some of the vector math functions. When you use intrinsics, make sure you test thoroughly. Test for speed and correctness against a scalar version of your code. Different features of each processor and how well they operate means that this is a wide open field. The more effort you put into it, the more you will get out.

References:

The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins: