Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views24 pages

CUDA Class Lecture02

The document provides an overview of high-performance parallel computing using CUDA, detailing thread hierarchy, memory transfer, and common errors. It explains thread indexing in 1D, 2D, and 3D, and discusses warp divergence and strategies to minimize it. Additionally, it includes examples of vector addition and matrix multiplication in CUDA, highlighting the use of shared memory for improved performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views24 pages

CUDA Class Lecture02

The document provides an overview of high-performance parallel computing using CUDA, detailing thread hierarchy, memory transfer, and common errors. It explains thread indexing in 1D, 2D, and 3D, and discusses warp divergence and strategies to minimize it. Additionally, it includes examples of vector addition and matrix multiplication in CUDA, highlighting the use of shared memory for improved performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

High-Performance Parallel

Computing Using CUDA by GPU

Lecture 02
Prof. Md. Mamun Molla –AMCS502:
High-Performance Computing
CUDA Thread Hierarchy

• Thread → Block → Grid


• 1D Thread Indexing:
• Thread index in a block: threadIdx.x
• Block index in the grid: blockIdx.x
• Number of threads per block: blockDim.x
• Global thread index:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
• Example:
• If blockDim.x = 256, blockIdx.x = 2, threadIdx.x = 10
• tid = 2*256 + 10 = 522
2D Thread Indexing
Now threads are arranged in 2D inside a block and blocks are
arranged in 2D inside the grid.
• Thread coordinates in block: (threadIdx.x, threadIdx.y)
• Block coordinates in grid: (blockIdx.x, blockIdx.y)
• Block dimensions: (blockDim.x, blockDim.y)
Global thread coordinates:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

Flatten to a single global ID :


int ixy =i+M*j; where i_max= M, J_max=N
3D Thread Indexing
Now both the grid and blocks have 3D shapes.
Thread coordinates in block: (threadIdx.x, threadIdx.y,
threadIdx.z)
Block coordinates in grid: (blockIdx.x, blockIdx.y, blockIdx.z)
Block dimensions: (blockDim.x, blockDim.y, blockDim.z)
Grid dimensions: (gridDim.x, gridDim.y, gridDim.z)
Global thread coordinates:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
Flatten to a single global ID :
int ixyx =i+L*j+L*M*k; where i_max= L, J_max=M, and
k_max=N
Memory Transfer: Host ↔ Device
cudaMemcpy syntax

Example with cudaMemcpyHostToDevice


and cudaMemcpyDeviceToHost

// Copy vectors from host to device


cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

// Copy vectors from device to dost


cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);
Common CUDA errors:
Common CUDA errors:
• Race condition
• Warp divergence
• Excessive synchronization

For the debugging code:


cudaGetLastError()
cudaDeviceSynchronize()

Synchronization:
__syncthreads() uses for avoiding race condition
__syncthreads() is a barrier synchronization for all threads in a block — it
simply waits until all threads have reached that point before proceeding.
Warp divergence
• A warp is a group of 32 threads that execute the same instruction at the
same time on a GPU's streaming multiprocessor (SM).

• NVIDIA hardware is designed so that the smallest scheduling unit is 32


threads.

• Even if your kernel launches fewer than 32 threads in a block, the GPU still
allocates a full warp (some are just stay idle).

• GPU executes instructions warp-by-warp, not thread-by-thread.

• If threads in a warp take different code paths (due to an if statement),


warp divergence occurs, and execution slows down because branches run
serially.
Warp divergence
Analogy:

• Think of a warp as 32 workers walking in step.

• If everyone takes the same route, they move fast.

• If some workers split off to a side street, the main group must wait for them to
finish before proceeding.

• Warp size = 32 (for most modern NVIDIA GPUs, including Tesla M60)

• Warp scheduling: Hardware schedules warps, not individual threads

• Performance tip: Write kernels so that all threads in a warp follow the same
execution path ( when you sue if statement)
Warp divergence
Analogy:

• Think of a warp as 32 workers walking in step.

• If everyone takes the same route, they move fast.

• If some workers split off to a side street, the main group must wait for them to
finish before proceeding.

• Warp size = 32 (for most modern NVIDIA GPUs, including Tesla M60)

• Warp scheduling: Hardware schedules warps, not individual threads

• Performance tip: Write kernels so that all threads in a warp follow the same
execution path ( when you sue if statement)
Warp Index Calculation

A warp = group of 32 threads in NVIDIA GPUs that execute the same


instruction at the same time (SIMT: Single Instruction, Multiple
Threads).

If you know the thread ID:


int warpId = threadIdx.x / 32; // warp number within the block
int laneId = threadIdx.x % 32; // position within the warp
How Warp Divergence Happens

In CUDA, warp divergence happens when


threads within the same warp take different
execution paths due to conditional branching
(like if, switch, or loops).

Example:
if (threadIdx.x % 2 == 0) {
// Even threads do this
} else {
// Odd threads do this
}
How Warp Divergence Happens

In a warp of 32 threads:

• 16 even threads go into one branch.

• 16 odd threads go into another branch.

The GPU can’t execute both at once → it serializes execu on:

• Runs the even branch while odd threads are idle.

• Runs the odd branch while even threads are idle.


How to Reduce Warp Divergence
• Group similar work together so
threads in the same warp take the
same path.

• Avoid complex branching inside


kernels.

• Warp divergence must be avoided by


writing code with uniform control flow
within a warp.
How to Reduce Warp Divergence
Example:
__global__ void avoidDivergence(int *out, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < n) {
// Instead of branching like:
// if (tid % 2 == 0) out[tid] = tid * 2;
// else out[tid] = tid * 3;

// Use a branch-free approach:


int isEven = (tid % 2 == 0);
out[tid] = isEven * (tid * 2) ;
}
}
Typical Structure of a CUDA Program
– Global variables declaration
– Function prototypes
– __global__ void kernelOne(…)

– Main () repeat
– allocate memory space on the device – as
cudaMalloc(&d_a, size ) needed
– Malloc(&h_a,size)
– transfer data from host to device –
cudaMemcpy(d_a,h_a,cudaMemcpyHost toDevice)

– execution configuration setup


– kernel call – kernelOne<<<execution
configuration>>>( args… );
– transfer results from device to host –
cudaMemcpy(h_c,d_c,cudaMemcpyDevicetoHost)
Vector_addition serial code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 2048 // Length of the vectors

int main() {
float A[N], B[N], C[N];

// Initialize the vectors Compiler: nvcc


vec_add.c –o serial.run
for (int i = 0; i < N; i++) {
A[i] = i * 1.0f; Run: ./resrial.run
B[i] = i * 2.0f;
}
clock_t start = clock();
// Compute the sum
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}

// Print the result


printf("Vector addition result:\n");
for (int i = 0; i < 10; i++) {
printf("A[%d] + B[%d] = %.1f + %.1f = %.1f\n", i, i, A[i], B[i], C[i]);
}
clock_t end = clock();
double elapsed_time = (double)(end - start) / CLOCKS_PER_SEC;
printf("Elapsed time (serial) = %f seconds\n", elapsed_time);
return 0;
}
Vector addition CUDA code
#include <stdio.h>
#include <cuda.h> // Initialize input vectors
#include <iostream> for (int i = 0; i < N; i++) {
#define N 2048 // Size of the vectors
h_A[i] = i * 1.0f;
__global__ void vectorAdd(float *A, float *B, float *C,
int n) { h_B[i] = i * 2.0f;
int i = blockIdx.x * blockDim.x + threadIdx.x; }
if (i < n) {
C[i] = A[i] + B[i]; // Allocate device memory
}
}
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
int main(void) {
float *h_A, *h_B, *h_C; // Host vectors cudaMalloc((void **)&d_C, size);
float *d_A, *d_B, *d_C; // Device vectors
size_t size = N * sizeof(float);

h_A = (float *)malloc(size);


h_B = (float *)malloc(size);
h_C = (float *)malloc(size);
Vector addition CUDA code
// Copy vectors from host to device
cudaMemcpy(d_A, h_A, size,cudaMemcpyHostToDevice); // Display a few results
printf("Vector addition result (first 10
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); elements):\n");

for (int i = 0; i < 10; i++) {


// Launch kernel with enough threads
printf("h_A[%d] + h_B[%d] = %1f + %1f = %1f\n",i,i,
int threadsPerBlock = 256; h_A[i], h_B[i], h_C[i]);
//printf("%1f\n", h_C[i]);
int blocksPerGrid = (N + threadsPerBlock - 1) /
}
threadsPerBlock;
clock_t end_time = clock();
clock_t begin_time = clock(); std::cout<<"Spent Time = " <<float(end_time-
begin_time)/ CLOCKS_PER_SEC << "\n";

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, return 0;


d_B, d_C, N); }
Compiler: nvcc vec_add_cuda.cu –o
// Copy result back to host cuda.run
cudaMemcpy(h_C, d_C, size, Run: ./cuda.run
cudaMemcpyDeviceToHost);
Shared memory in CUDA
• In CUDA, shared memory is a special, fast on-chip memory
that is shared among all threads in a block.
• Location and Speed:
• Location: On the GPU chip (inside the Streaming
Multiprocessor – SM).

• Speed: Much faster than global memory.

• Visible to all threads in the same block.

• Not accessible to threads in other blocks.


• Lifetime = duration of the kernel execution for that block.
Shared memory in CUDA
Why Use Shared Memory?

Reduced global memory traffic → be er performance.


Inter-thread communication within a block.

shared memory inside a kernel using:


__shared__ float array[256];
Or dynamically (size decided at kernel launch):
__shared__ float array[];
Vectopr addition using shared memory
#include <stdio.h>
#include <cuda.h> float *d_A, *d_B, *d_C;
#include <iostream>
cudaMalloc(&d_A, size);
__global__ void vectorAddShared(float *A, float *B, float *C, int N) { cudaMalloc(&d_B, size);
__shared__ float s_A[256]; cudaMalloc(&d_C, size);
__shared__ float s_B[256];
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
int tid = blockIdx.x * blockDim.x + threadIdx.x; cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

if (tid < N) { clock_t begin_time = clock();


s_A[threadIdx.x] = A[tid];
s_B[threadIdx.x] = B[tid]; vectorAddShared<<<N/256, 256>>>(d_A, d_B, d_C, N);
__syncthreads(); // Wait for all threads to load data
C[tid] = s_A[threadIdx.x] + s_B[threadIdx.x];
} cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
}
int main(void) { for (int i = 0; i < 10; i++)
float *h_A, *h_B, *h_C; // Host vectors printf("%f + %f = %f\n", h_A[i], h_B[i], h_C[i]);
float *d_A, *d_B, *d_C; // Device vectors
size_t size = N * sizeof(float); clock_t end_time = clock();

h_A = (float *)malloc(size); std::cout<<"Spent Time = " <<float(end_time-begin_time)/


h_B = (float *)malloc(size); CLOCKS_PER_SEC << "\n";
h_C = (float *)malloc(size);
return 0;
// Initialize input vectors
}
for (int i = 0; i < N; i++) {
h_A[i] = i * 1.0f;
h_B[i] = i * 2.0f;
}
Matrix multiplication in CUDA
#include <stdio.h> int main() {
#include <cuda.h> int size = N * N * sizeof(float);
float h_A[N * N] = {1, 2, 3,
#define N 3 // Matrix size 4, 5, 6,
7, 8, 9};
__global__ void kernel(float *A, float *B,
float *C) { float h_B[N * N] = {9, 8, 7,
int row = threadIdx.y; 6, 5, 4,
int col = threadIdx.x;
3, 2, 1};
float sum = 0.0f;
float h_C[N * N]; // Result
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
float *d_A, *d_B, *d_C;
}
C[row * N + col] = sum;
} // Allocate device memory
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
Matrix multiplication in CUDA
// Display result
// Copy matrices to device
cudaMemcpy(d_A, h_A, size, printf("Result matrix C:\n");
cudaMemcpyHostToDevice); for (int i = 0; i < N; i++) {
cudaMemcpy(d_B, h_B, size, for (int j = 0; j < N; j++) {
cudaMemcpyHostToDevice); printf("%6.1f ", h_C[i * N + j]);
}
// Launch kernel
printf("\n");
dim3 threadsPerBlock(N, N);
}
kernel<<<1, threadsPerBlock>>>(d_A, d_B,
d_C);
return 0;
// Copy result back to host }
cudaMemcpy(h_C, d_C, size,
cudaMemcpyDeviceToHost);
dim3 threadsPerBlock(N, N)

dim3 threadsPerBlock(N, N) Visually:


means we are creating a block of threads
arranged in a 2D layout:
• N threads along the x-axis (columns) (0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)
• N threads along the y-axis (rows)
(0,2) (1,2) (2,2)
In our example, N = 3, so:

• 3 threads in x-direc on → threadIdx.x = 0, 1, 2


(threadIdx.x, threadIdx.y) are the
coordinates inside the block.
• 3 threads in y-direc on → threadIdx.y = 0, 1, 2

That makes a 3 × 3 = 9 threads in one block.


Each thread will compute one element of the 3×3
output matrix.

You might also like