Parallel Programming with NVIDIA CUDA
Programmers have been interested in leveraging the highly parallel processing power of video cards to speed up applications that are not graphic in nature for a long time. Here, I explain how to do this with the CUDA API from NVIDIA. If your GPU is not from NVIDIA, you are not out of luck, as the same can be achieved with other APIs, such as the ATI-based Stream SDK or OpenCL.
With GPGPU, general-purpose applications are executed directly on the streaming processors of video cards. Under the stream processing paradigm, a data set is named a stream. You can think of it much like “file streams” provided by an OS's pipe function.
Streams can be any isolated piece of data, such as a stream of business events or a set of scientific data. Parallel operations are applied on streams with operators, such as split, compute or merge. Figure 1 shows several streams of data and compute operators in parallel.
Stream processing has been used successfully for general programming, including dataflow programming, financial calculation and industrial automation, just to name a few. Furthermore, system engineers and vendors such as Dell, ASUS, Western Scientific and Microway are building clusters of video cards that are similar to supercomputers, and they're available at a fraction of the cost of their CPU-based counterparts.
You can find many examples of real-life applications that were sped up using CUDA acceleration showcased by NVIDIA at www.nvidia.com/cuda.
Now that I've brushed upon what CUDA and stream processing are, let's start looking into a couple compute-intensive algorithms you can use to give it a spin.
Vector fields are constructs employed in a variety of professions. In robotics, vector fields can help a mobile robot navigate through a room. Let's define a destination and add one or more obstacles. A good scenario for testing CUDA consists in calculating a series of vectors that indicate the direction a robot should follow in order to reach its destination while avoiding all the obstacles present. The robot should also avoid local minima (see below). Figure 2 shows the robot and vector field (the green arrows are the “vectors”).
I refer to the target point as an attractor and to obstacles as repulsors—the arrows point toward the attractor and away from repulsors (Figure 3). So, how do you calculate the vector field? The vector field is composed of a series of individual fields, one for each attractor and repulsor.
Each individual field is calculated by computing the direction toward the attractor and away from repulsors at each point in the room. Once all of the vectors have been calculated, you obtain the complete vector field by adding them up.
For this example, I will have three streams and two compute operators. The list of attractors and repulsors will be used as the input stream. Then, a compute operator will be applied to it to obtain a second stream: the vector field. Finally, a second compute operator will provide another stream: the local minima field.
Why is this a good demonstration of CUDA? When deciding whether an algorithm is a good candidate for parallelization, you should consider the following criteria:
Is the problem compute-intensive?
Can the problem be modeled as a stream process?
Is the code independent of any shared resources?
What sequences of code are independent of any other code?
Can the data be represented as arrays of 32-bit objects?
Are there no optimizations of the sequential algorithm possible?
In my case, the vector field may be large and could take a long time when evaluating the whole field. The path a robot should follow can be modeled easily with streams. There is no access to shared resources, and the computation of each element in the field is independent from all the others.
In terms of computation, robotic engineers usually constrain their algorithms to calculate only the part of the vector field that is needed at a given time, never evaluating the entire vector field. Next, I show you how you can use stream processing for calculating the whole vector field in real time. Let's get started.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Nice article, thanks for the
6 hours 41 min ago - I once had a better way I
12 hours 27 min ago - Not only you I too assumed
12 hours 44 min ago - another very interesting
14 hours 38 min ago - Reply to comment | Linux Journal
16 hours 31 min ago - Reply to comment | Linux Journal
23 hours 25 min ago - Reply to comment | Linux Journal
23 hours 41 min ago - Favorite (and easily brute-forced) pw's
1 day 1 hour ago - Have you tried Boxen? It's a
1 day 7 hours ago - seo services in india
1 day 11 hours ago









Comments
The statement minima[y][x] =
The statement
minima[y][x] = (norm(field[y][x]) < threshold) ? true : false
may incur branching penalty
You can just use the first part
minima[y][x] = (norm(field[y][x]) < threshold)