Parallel Programming with NVIDIA CUDA
Let's develop a sequential version of a vector field calculation algorithm. The input stream is a list of attractors and a list of repulsors, as shown in Figure 2 and without obstacles (Figure 4). The output stream is a matrix called field. The attractors and repulsors are lists of 2-D points that indicate their positions.
A “field” matrix will hold your vector field data structure. Each element field[y][x] will hold a 2-D vector indicating where the robot should move toward when standing at point (x, y). This vector will be the sum of vectors associated with each attractor and repulsor (Figure 5).
When processing each attractor, the associated vector will be pointing from the current point (x, y) toward the attractor's position. When processing each repulsor, the associated vector will be pointing away from the repulsor's position. Note that the plus and minus operators are performing vector addition and subtraction. Sequential pseudo-code:
In Parameters: a list of attractors, a list of repulsors
Out Parameters: a zero-initialized vector field
calculate_vector_field_cpu(in attractors, in repulsors, out field):
for (y = 0 to height):
for (x = 0 to width):
for (attractor in attractors):
vector = attractor - point(x,y)
field[y][x] = field[y][x] + vector
for (repulsor in repulsors):
vector = point(x,y) - repulsor
if norm(vector) <= 2:
field[y][x] = field[y][x] + vector
return
Okay, so the sequential pseudo-code is ready. Now, let's partition the problem in order to use most of the processing cores on the GPU.
The calculation of each vector field element is independent from the calculation of the other elements. You can leverage this property to parallelize your algorithm. You can calculate each element of the vector field matrix in its own thread, effectively dividing the problem into smaller pieces.
Don't worry about the number of threads. Spawning as many threads as possible when developing CUDA algorithms is encouraged by NVIDIA. It will allow the algorithms to scale across several generations of devices, automatically increasing throughput, as NVIDIA adds more and more processing cores to its video cards.
With this in mind, let's develop a parallel version of our previous algorithm. Parallel pseudo-code:
In Parameters: list of attractors, list of repulsors
Out Parameters: a zero-initialized vector field
calculate_vector_field_gpu(in attractors, in repulsors, out field):
x = blockIdx.x * BLOCK_SIZE + threadIdx.x
y = blockIdx.y * BLOCK_SIZE + threadIdx.y
for (attractor in attractors):
vector = attractor - point(x,y)
field[y][x] = field[y][x] + vector
for (repulsor in repulsors):
vector = point(x,y) - repulsor
if norm(vector) <= 2:4444444444
field[y][x] = field[y][x] + vector
return
Notice I did away with both external for loops, and the points (x, y) are now calculated using a parallel statement.
The new pseudo-code is implemented as a kernel. A kernel is a function that executes on several GPU cores at the same time. Kernels are launched by a host program controlled from the regular CPU that configures the execution environment and supplies the parameters.
How does each thread know what position of the vector field it has to compute? This is where the blockIdx and threadIdx built-in CUDA variables come into place.
As you look at this code, it may not be obvious how this is a parallel implementation, but it's the blockIdx and threadIdx and the CUDA magic associated with them that makes it parallel. When the function is invoked, it actually is invoked multiple times using multiple threads, each thread calculating one part of the result (see the next section).
When the host code sets up an execution environment, it has to determine how the processing cores will be assigned work. As part of its duties, the host must determine how threads will be arranged logically. CUDA allows developers to arrange their threads in a 1-D, 2-D or 3-D structure. It helps developers express design in a natural manner.
Think about the example algorithm. Let's use the GPU to fill a 2-D matrix with data. It would be very convenient if you somehow could assign each thread a “position” in a two-coordinate system, because each thread could use its assigned coordinates to decide which element to compute.
CUDA provides a mechanism through which developers can specify how they want their threads arranged. The compiler takes care of the rest. This feature is available through what is known as a grid of thread blocks. In this example, the host will use a 2-D grid.
You probably are wondering, why a “grid of thread blocks” instead of a “grid of threads”?
Threads do not exist inside grids by themselves, but rather, they are arranged into thread blocks. Each thread is assigned an identifier within its block. Each block, in turn, is assigned an identifier within the grid. The built-in blockIdx and threadIdx variables help determine the current thread and block identifiers. From within a kernel, these identifiers can be seen simply as a C structure containing the thread's x, y and z coordinates.
Using this mechanism, you can have each thread calculate a global thread ID and calculate the x and y variables in your parallel pseudo-code. The pair (x, y) determines which element of the matrix has to be computed. Because each thread will have different values for (x, y), every point of the matrix could, in theory, be computed at the same time if you have enough threads.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Developer Poll
- Dart: a New Web Programming Experience
- What's the tweeting protocol?
- New Products
- Thanks for taking the time to
23 min 49 sec ago - Linux is good
2 hours 21 min ago - Reply to comment | Linux Journal
2 hours 38 min ago - Web Hosting IQ
3 hours 8 min ago - Web Hosting IQ
3 hours 9 min ago - Web Hosting IQ
3 hours 10 min ago - Reply to comment | Linux Journal
6 hours 10 min ago - play with linux? i think you mean work-around linux
14 hours 36 min ago - Where is Epistle?
14 hours 42 min ago - You forgot OwnCloud
15 hours 12 min ago








Comments
The statement minima[y][x] =
The statement
minima[y][x] = (norm(field[y][x]) < threshold) ? true : false
may incur branching penalty
You can just use the first part
minima[y][x] = (norm(field[y][x]) < threshold)