Parallel Programming with NVIDIA CUDA
Now, let's apply a second operation that detects local minima on the computed vector field. Local minima are those places where all vectors are converging (Figure 6). Flagging out the local minima will prevent the mobile robot from stopping in one of them with none of the vectors guiding it out.
Under the stream processing model, operators can be daisy-chained: a second operator consumes the output of a first operator, much like the pipe operator of an operating system. In the example CUDA implementation, you will consume the vector field matrix stored in GPU memory. Sequential local minima detection pseudo-code:
In Parameters: calculated vector field, a decimal threshold
Out Parameters: a boolean matrix called "minima"
detect_local_minima_cpu(in field, in threshold, out minima):
for (y=0 to h):
for (x=0 to w):
minima[y][x] =
(norm(field[y][x]) < threshold)? true : false
return
The sequential algorithm takes the vector field as input and fills in a Boolean matrix of the same dimensions with values “true” or “false”, depending on whether the length is below a given threshold. Conversely, the matrix “minima” at position (x, y) indicates whether the norm of the vector located at (x, y) is less than the given threshold. Parallel local minima detection:
In Parameters: the calculated vector field, a decimal threshold
Out Parameters: a boolean matrix called "minima"
detect_local_minima_gpu(in field, in threshold, out minima):
x = blockIdx.x * BLOCK_SIZE + threadIdx.x
y = blockIdx.y * BLOCK_SIZE + threadIdx.y
minima[y][x] = (norm(field[y][x]) < threshold) ? true : false
The output is a field of Boolean values that indicates whether a given point is a local minimum.
At this point, I have implemented four algorithms. You can, of course, download all the source code from our Web site for free and try them out yourself.
So, how does a CUDA algorithm stack up against its CPU equivalent? Next, I compare the parallel versions against their sequential counterparts in order to find out. The hardware used for the benchmark implementation includes:
Intel Core 2 Duo E6320, running at 1.6GHz with 4GB of RAM.
NVIDIA GeForce 8600GT GPU.
Ubuntu Linux 8.10.
CUDA version 2.2.
I implemented all four algorithms in one C++ program that can switch between the CPU and the CUDA versions of the algorithms dynamically. Not only does this make the benchmarking process easier, but it also is a good technique for developing programs that can fall back to the CPU on a computer where CUDA is not supported.
Each of the benchmarks uses different vector field configurations, increasing the size of the field as well as the number of repulsors. The number of attractors always is set to just one. The size of the vector fields are: 16x16, 32x32, 64x64, 128x128 and 256x256. The repulsors are randomly distributed on the field with a ratio of one repulsor per 32 vector field points. Hence, the number of repulsors is 8, 32, 128, 512, 2048 and 8192.
Figure 7 shows the results of the benchmarks. I am using the notation “WxH/R”, where WxH denotes the vector field's dimensions and R the number of repulsors present. The execution time is in milliseconds on a logarithmic scale (so a small difference in graph size is actually a much larger speedup than it appears to be visually).
How much faster is the GPU? The speedup is calculated by dividing the execution time of the sequential algorithm by the execution time of the parallel algorithm (Figure 8).
Computation times are the closest in the case of a small vector field. However, even in that case, we get a speedup of 2.5 times just by switching to the CUDA implementation of the vector field calculation. The local minima detection becomes interesting to parallelize only with slightly larger data sets that are more compute-intensive than smaller ones.
On average, the speedup is around eight times for our algorithms. In layman's terms, this means if you have a computation that takes one work day to complete, just by switching to CUDA, you can have your results in less than one hour.
This provides significant benefits for computations that require a user to run a computation several times while correcting the parameters each time. Such iterative processes are frequent, for instance, in financial models.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Validate an E-Mail Address with PHP, the Right Way
- Using Salt Stack and Vagrant for Drupal Development
- Tech Tip: Really Simple HTTP Server with Python
- New Products
- Ahh, the Koolaid.
18 min 25 sec ago - git-annex assistant
6 hours 18 min ago - direct cable connection
6 hours 40 min ago - Agreed on AirDroid. With my
6 hours 50 min ago - I just learned this
6 hours 54 min ago - enterprise
7 hours 25 min ago - not living upto the mobile revolution
10 hours 16 min ago - Deceptive Advertising and
10 hours 51 min ago - Let\'s declare that you have
10 hours 52 min ago - Alterations in Contest Due
10 hours 53 min ago









Comments
The statement minima[y][x] =
The statement
minima[y][x] = (norm(field[y][x]) < threshold) ? true : false
may incur branching penalty
You can just use the first part
minima[y][x] = (norm(field[y][x]) < threshold)