Parallel Programming with NVIDIA CUDA

Using hardware acceleration via General Programming on stock GPUs (GPGPU), I've sped up my algorithms by more than tenfold. This article shows how you can achieve these results too!

Programmers have been interested in leveraging the highly parallel processing power of video cards to speed up applications that are not graphic in nature for a long time. Here, I explain how to do this with the CUDA API from NVIDIA. If your GPU is not from NVIDIA, you are not out of luck, as the same can be achieved with other APIs, such as the ATI-based Stream SDK or OpenCL.

GPGPU and Stream Processing

With GPGPU, general-purpose applications are executed directly on the streaming processors of video cards. Under the stream processing paradigm, a data set is named a stream. You can think of it much like “file streams” provided by an OS's pipe function.

Streams can be any isolated piece of data, such as a stream of business events or a set of scientific data. Parallel operations are applied on streams with operators, such as split, compute or merge. Figure 1 shows several streams of data and compute operators in parallel.

Figure 1. Stream Processing Diagram

Stream processing has been used successfully for general programming, including dataflow programming, financial calculation and industrial automation, just to name a few. Furthermore, system engineers and vendors such as Dell, ASUS, Western Scientific and Microway are building clusters of video cards that are similar to supercomputers, and they're available at a fraction of the cost of their CPU-based counterparts.

You can find many examples of real-life applications that were sped up using CUDA acceleration showcased by NVIDIA at

Identifying an Algorithm to Parallelize

Now that I've brushed upon what CUDA and stream processing are, let's start looking into a couple compute-intensive algorithms you can use to give it a spin.

Vector fields are constructs employed in a variety of professions. In robotics, vector fields can help a mobile robot navigate through a room. Let's define a destination and add one or more obstacles. A good scenario for testing CUDA consists in calculating a series of vectors that indicate the direction a robot should follow in order to reach its destination while avoiding all the obstacles present. The robot should also avoid local minima (see below). Figure 2 shows the robot and vector field (the green arrows are the “vectors”).

Figure 2. The mobile robot wants to reach the center of the board. The vector field shows the way.

I refer to the target point as an attractor and to obstacles as repulsors—the arrows point toward the attractor and away from repulsors (Figure 3). So, how do you calculate the vector field? The vector field is composed of a series of individual fields, one for each attractor and repulsor.

Figure 3. Attractor and Repulsor

Each individual field is calculated by computing the direction toward the attractor and away from repulsors at each point in the room. Once all of the vectors have been calculated, you obtain the complete vector field by adding them up.

For this example, I will have three streams and two compute operators. The list of attractors and repulsors will be used as the input stream. Then, a compute operator will be applied to it to obtain a second stream: the vector field. Finally, a second compute operator will provide another stream: the local minima field.

Some Problems Can Be Parallelized, Others Not So Much

Why is this a good demonstration of CUDA? When deciding whether an algorithm is a good candidate for parallelization, you should consider the following criteria:

  • Is the problem compute-intensive?

  • Can the problem be modeled as a stream process?

  • Is the code independent of any shared resources?

  • What sequences of code are independent of any other code?

  • Can the data be represented as arrays of 32-bit objects?

  • Are there no optimizations of the sequential algorithm possible?

In my case, the vector field may be large and could take a long time when evaluating the whole field. The path a robot should follow can be modeled easily with streams. There is no access to shared resources, and the computation of each element in the field is independent from all the others.

In terms of computation, robotic engineers usually constrain their algorithms to calculate only the part of the vector field that is needed at a given time, never evaluating the entire vector field. Next, I show you how you can use stream processing for calculating the whole vector field in real time. Let's get started.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The statement minima[y][x] =

Anonymous's picture

The statement
minima[y][x] = (norm(field[y][x]) < threshold) ? true : false
may incur branching penalty

You can just use the first part
minima[y][x] = (norm(field[y][x]) < threshold)