How We Should Program GPGPUs
Current approaches to programming GPUs still are relatively immature. It's much better than it was a few years ago, when programmers had to cast their algorithms into OpenGL or something similar, but it's still unnecessarily difficult.
Programmers must manage (allocate and deallocate) the device, deal with the separate host and device memories, allocate and free device memory, and move data from and to the host, manage (upload) the kernel code, pack up the arguments, initiate the kernel, wait for completion and free the code space when done. All this is in addition to writing the kernel in the first place, exposing the parallelism, optimizing the data access patterns and a host of other machine-specific items, testing and tuning. A matrix multiplication takes a few lines in FORTRAN or C. Converting this to CUDA or Brook takes a page or more of code, even when making simplifying assumptions. One might question whether there is a better way.
Listing 1. Simplified Matrix Multiplication in CUDA, Using Tiled Algorithm
__global__ void
matmulKernel( float* C, float* A, float* B, int N2, int N3 ){
int bx = blockIdx.x, by = blockIdx.y;
int tx = threadIdx.x, ty = threadIdx.y;
int aFirst = 16 * by * N2;
int bFirst = 16 * bx;
float Csub = 0;
for( int j = 0; j < N2; j += 16 ) {
__shared__ float Atile[16][16], Btile[16][16];
Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];
Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx];
__syncthreads();
for( int k = 0; k < 16; ++k )
Csub += Atile[ty][k] * Btile[k][tx];
__syncthreads();
}
int c = N3 * 16 * by + 16 * bx;
C[c + N3 * ty + tx] = Csub;
}
void
matmul( float* A, float* B, float* C,
size_t N1, size_t N2, size_t N3 ){
void *devA, *devB, *devC;
cudaSetDevice(0);
cudaMalloc( &devA, N1*N2*sizeof(float) );
cudaMalloc( &devB, N2*N3*sizeof(float) );
cudaMalloc( &devC, N1*N3*sizeof(float) );
cudaMemcpy( devA, A, N1*N2*sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( devB, B, N2*N3*sizeof(float), cudaMemcpyHostToDevice );
dim3 threads( 16, 16 );
dim3 grid( N1 / threads.x, N3 / threads.y);
matmulKernel<<< grid, threads >>>( devC, devA, devB, N2, N3 );
cudaMemcpy( C, devC, N1*N3*sizeof(float), cudaMemcpyDeviceToHost );
cudaFree( devA );
cudaFree( devB );
cudaFree( devC );
}
Listing 2. Simplified Matrix Multiplication in Brook
kernel void
matmulKernel( float N2, float A[][], float B[][],
out float result<> ){
float2 ik = indexof(result).xy;
float4 ijjk = float4( ik.x, 0.0f, 0.0f, ik.y );
float4 jp1 = float4( 0.0f, 1.0f, 1.0f, 0.0f );
float C = 0.0f;
float n2 = N2;
while( n2 > 0 ) {
C += A[ijjk.zw]*B[ijjk.xy];
ijjk += jp1;
n2 -= 1.0f;
}
result = C;
}
void
matmul( float* A, float* B, float* C,
size_t N1, size_t N2, size_t N3 ){
float Astream<N1, N2>;
float Bstream<N2, N3>;
float Cstream<N1, N3>;
streamRead( Astream, A );
streamRead( Bstream, B );
matmulKernel( (float)N2, Astream, Bstream, Cstream );
streamWrite( Cstream, C );
}
Compilers are good at keeping track of details and should be taken advantage of for that as much as possible. Is there anything specific to a GPU that makes it a more difficult compiler target than vector computers or attached processors, both of which had very successful, aggressive compilers? Could a compiler be created that would generate both host and GPU or accelerator code from a single source file, using standard C or FORTRAN, without language extensions?
I think it's feasible (though nontrivial), and a good idea. Here, I discuss what such a compiler might look like and what steps it would have to take. Two overriding goals are that the compiler operates just like a host compiler, except perhaps with a command-line flag to enable or disable the GPU code generator, and that no changes are needed to the other system tools (such as a linker and library archiver).
A significant difference between such a compiler and one for a vector computer has to do with the cost of failure. If a compiler fails to vectorize a specific loop, the performance cost can be a factor of five or ten, which is enough that a programmer will pay attention to messages from the compiler. If a compiler does a bad job of code generation for a GPU, the cost can be a slowdown (relative to host code) of a factor of ten or 100. This is enough that a fully hands-off, automatic approach just isn't feasible, at least not yet. At all steps, a programmer must be able to understand what the compiler has done and, if necessary, to override it.
The first, and perhaps most important step is to select what part or parts of the program should be converted to a kernel. Currently, that is done explicitly by a programmer who rips out that part of the program, replaces it with the code to manage the GPU, writes a kernel to execute on the GPU and combines it all into a single program.
Abstractly, we can use compute intensity to determine the parts of the program that are attractive for GPU acceleration. Compute intensity for a function, loop or block of code is the ratio of the number of operations to the amount of data that needs to be moved. For GPU computing, we are most limited by the host-GPU bandwidth, so the critical ratio is the amount of data that needs to be moved from the host to GPU and back, divided into the number of operations that the GPU will execute. If the ratio is high enough, it's worth the cost of the data movement to get the high compute bandwidth of the GPU, assuming the computation has enough parallelism.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Using Salt Stack and Vagrant for Drupal Development







3 hours 49 min ago
4 hours 23 min ago
5 hours 22 min ago
6 hours 12 min ago
10 hours 14 min ago
14 hours 1 min ago
14 hours 9 min ago
16 hours 24 min ago
18 hours 54 min ago
1 day 4 hours ago