0% found this document useful (0 votes)

10 views24 pages

CUDA Class Lecture02

The document provides an overview of high-performance parallel computing using CUDA, detailing thread hierarchy, memory transfer, and common errors. It explains thread indexing in 1D, 2D, and 3D, and discusses warp divergence and strategies to minimize it. Additionally, it includes examples of vector addition and matrix multiplication in CUDA, highlighting the use of shared memory for improved performance.

Uploaded by

Shafi Riajulislam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views24 pages

CUDA Class Lecture02

Uploaded by

Shafi Riajulislam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

High-Performance Parallel

Computing Using CUDA by GPU

Lecture 02
Prof. Md. Mamun Molla –AMCS502:
High-Performance Computing
CUDA Thread Hierarchy

• Thread → Block → Grid

• 1D Thread Indexing:
• Thread index in a block: threadIdx.x
• Block index in the grid: blockIdx.x
• Number of threads per block: blockDim.x
• Global thread index:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
• Example:
• If blockDim.x = 256, blockIdx.x = 2, threadIdx.x = 10
• tid = 2*256 + 10 = 522
2D Thread Indexing
Now threads are arranged in 2D inside a block and blocks are
arranged in 2D inside the grid.
• Thread coordinates in block: (threadIdx.x, threadIdx.y)
• Block coordinates in grid: (blockIdx.x, blockIdx.y)
• Block dimensions: (blockDim.x, blockDim.y)
Global thread coordinates:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

Flatten to a single global ID :

int ixy =i+M*j; where i_max= M, J_max=N
3D Thread Indexing
Now both the grid and blocks have 3D shapes.
Thread coordinates in block: (threadIdx.x, threadIdx.y,
threadIdx.z)
Block coordinates in grid: (blockIdx.x, blockIdx.y, blockIdx.z)
Block dimensions: (blockDim.x, blockDim.y, blockDim.z)
Grid dimensions: (gridDim.x, gridDim.y, gridDim.z)
Global thread coordinates:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
Flatten to a single global ID :
int ixyx =i+L*j+L*M*k; where i_max= L, J_max=M, and
k_max=N
Memory Transfer: Host ↔ Device
cudaMemcpy syntax

Example with cudaMemcpyHostToDevice

and cudaMemcpyDeviceToHost

// Copy vectors from host to device

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

// Copy vectors from device to dost

cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);
Common CUDA errors:
Common CUDA errors:
• Race condition
• Warp divergence
• Excessive synchronization

For the debugging code:

cudaGetLastError()
cudaDeviceSynchronize()

Synchronization:
__syncthreads() uses for avoiding race condition
__syncthreads() is a barrier synchronization for all threads in a block — it
simply waits until all threads have reached that point before proceeding.
Warp divergence
• A warp is a group of 32 threads that execute the same instruction at the
same time on a GPU's streaming multiprocessor (SM).

• NVIDIA hardware is designed so that the smallest scheduling unit is 32

threads.

• Even if your kernel launches fewer than 32 threads in a block, the GPU still
allocates a full warp (some are just stay idle).

• GPU executes instructions warp-by-warp, not thread-by-thread.

• If threads in a warp take different code paths (due to an if statement),

warp divergence occurs, and execution slows down because branches run
serially.
Warp divergence
Analogy:

• Think of a warp as 32 workers walking in step.

• If everyone takes the same route, they move fast.

• If some workers split off to a side street, the main group must wait for them to
finish before proceeding.

• Warp size = 32 (for most modern NVIDIA GPUs, including Tesla M60)

• Warp scheduling: Hardware schedules warps, not individual threads

• Performance tip: Write kernels so that all threads in a warp follow the same
execution path ( when you sue if statement)
Warp divergence
Analogy:

• Think of a warp as 32 workers walking in step.

• If everyone takes the same route, they move fast.

• If some workers split off to a side street, the main group must wait for them to
finish before proceeding.

• Warp size = 32 (for most modern NVIDIA GPUs, including Tesla M60)

• Warp scheduling: Hardware schedules warps, not individual threads

• Performance tip: Write kernels so that all threads in a warp follow the same
execution path ( when you sue if statement)
Warp Index Calculation

A warp = group of 32 threads in NVIDIA GPUs that execute the same

instruction at the same time (SIMT: Single Instruction, Multiple
Threads).

If you know the thread ID:

int warpId = threadIdx.x / 32; // warp number within the block
int laneId = threadIdx.x % 32; // position within the warp
How Warp Divergence Happens

In CUDA, warp divergence happens when

threads within the same warp take different
execution paths due to conditional branching
(like if, switch, or loops).

Example:
if (threadIdx.x % 2 == 0) {
// Even threads do this
} else {
// Odd threads do this
}
How Warp Divergence Happens

In a warp of 32 threads:

• 16 even threads go into one branch.

• 16 odd threads go into another branch.

The GPU can’t execute both at once → it serializes execu on:

• Runs the even branch while odd threads are idle.

• Runs the odd branch while even threads are idle.

How to Reduce Warp Divergence
• Group similar work together so
threads in the same warp take the
same path.

• Avoid complex branching inside

kernels.

• Warp divergence must be avoided by

writing code with uniform control flow
within a warp.
How to Reduce Warp Divergence
Example:
__global__ void avoidDivergence(int *out, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < n) {
// Instead of branching like:
// if (tid % 2 == 0) out[tid] = tid * 2;
// else out[tid] = tid * 3;

// Use a branch-free approach:

int isEven = (tid % 2 == 0);
out[tid] = isEven * (tid * 2) ;
}
}
Typical Structure of a CUDA Program
– Global variables declaration
– Function prototypes
– __global__ void kernelOne(…)

– Main () repeat
– allocate memory space on the device – as
cudaMalloc(&d_a, size ) needed
– Malloc(&h_a,size)
– transfer data from host to device –
cudaMemcpy(d_a,h_a,cudaMemcpyHost toDevice)

– execution configuration setup

– kernel call – kernelOne<<<execution
configuration>>>( args… );
– transfer results from device to host –
cudaMemcpy(h_c,d_c,cudaMemcpyDevicetoHost)
Vector_addition serial code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 2048 // Length of the vectors

int main() {
float A[N], B[N], C[N];

// Initialize the vectors Compiler: nvcc

vec_add.c –o serial.run
for (int i = 0; i < N; i++) {
A[i] = i * 1.0f; Run: ./resrial.run
B[i] = i * 2.0f;
}
clock_t start = clock();
// Compute the sum
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}

// Print the result

printf("Vector addition result:\n");
for (int i = 0; i < 10; i++) {
printf("A[%d] + B[%d] = %.1f + %.1f = %.1f\n", i, i, A[i], B[i], C[i]);
}
clock_t end = clock();
double elapsed_time = (double)(end - start) / CLOCKS_PER_SEC;
printf("Elapsed time (serial) = %f seconds\n", elapsed_time);
return 0;
}
Vector addition CUDA code
#include <stdio.h>
#include <cuda.h> // Initialize input vectors
#include <iostream> for (int i = 0; i < N; i++) {
#define N 2048 // Size of the vectors
h_A[i] = i * 1.0f;
__global__ void vectorAdd(float *A, float *B, float *C,
int n) { h_B[i] = i * 2.0f;
int i = blockIdx.x * blockDim.x + threadIdx.x; }
if (i < n) {
C[i] = A[i] + B[i]; // Allocate device memory
}
}
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
int main(void) {
float *h_A, *h_B, *h_C; // Host vectors cudaMalloc((void **)&d_C, size);
float *d_A, *d_B, *d_C; // Device vectors
size_t size = N * sizeof(float);

h_A = (float *)malloc(size);

h_B = (float *)malloc(size);
h_C = (float *)malloc(size);
Vector addition CUDA code
// Copy vectors from host to device
cudaMemcpy(d_A, h_A, size,cudaMemcpyHostToDevice); // Display a few results
printf("Vector addition result (first 10
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); elements):\n");

for (int i = 0; i < 10; i++) {

// Launch kernel with enough threads
printf("h_A[%d] + h_B[%d] = %1f + %1f = %1f\n",i,i,
int threadsPerBlock = 256; h_A[i], h_B[i], h_C[i]);
//printf("%1f\n", h_C[i]);
int blocksPerGrid = (N + threadsPerBlock - 1) /
}
threadsPerBlock;
clock_t end_time = clock();
clock_t begin_time = clock(); std::cout<<"Spent Time = " <<float(end_time-
begin_time)/ CLOCKS_PER_SEC << "\n";

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, return 0;

d_B, d_C, N); }
Compiler: nvcc vec_add_cuda.cu –o
// Copy result back to host cuda.run
cudaMemcpy(h_C, d_C, size, Run: ./cuda.run
cudaMemcpyDeviceToHost);
Shared memory in CUDA
• In CUDA, shared memory is a special, fast on-chip memory
that is shared among all threads in a block.
• Location and Speed:
• Location: On the GPU chip (inside the Streaming
Multiprocessor – SM).

• Speed: Much faster than global memory.

• Visible to all threads in the same block.

• Not accessible to threads in other blocks.

• Lifetime = duration of the kernel execution for that block.
Shared memory in CUDA
Why Use Shared Memory?

Reduced global memory traffic → be er performance.

Inter-thread communication within a block.

shared memory inside a kernel using:

__shared__ float array[256];
Or dynamically (size decided at kernel launch):
__shared__ float array[];
Vectopr addition using shared memory
#include <stdio.h>
#include <cuda.h> float *d_A, *d_B, *d_C;
#include <iostream>
cudaMalloc(&d_A, size);
__global__ void vectorAddShared(float *A, float *B, float *C, int N) { cudaMalloc(&d_B, size);
__shared__ float s_A[256]; cudaMalloc(&d_C, size);
__shared__ float s_B[256];
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
int tid = blockIdx.x * blockDim.x + threadIdx.x; cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

if (tid < N) { clock_t begin_time = clock();

s_A[threadIdx.x] = A[tid];
s_B[threadIdx.x] = B[tid]; vectorAddShared<<<N/256, 256>>>(d_A, d_B, d_C, N);
__syncthreads(); // Wait for all threads to load data
C[tid] = s_A[threadIdx.x] + s_B[threadIdx.x];
} cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
}
int main(void) { for (int i = 0; i < 10; i++)
float *h_A, *h_B, *h_C; // Host vectors printf("%f + %f = %f\n", h_A[i], h_B[i], h_C[i]);
float *d_A, *d_B, *d_C; // Device vectors
size_t size = N * sizeof(float); clock_t end_time = clock();

h_A = (float *)malloc(size); std::cout<<"Spent Time = " <<float(end_time-begin_time)/

h_B = (float *)malloc(size); CLOCKS_PER_SEC << "\n";
h_C = (float *)malloc(size);
return 0;
// Initialize input vectors
}
for (int i = 0; i < N; i++) {
h_A[i] = i * 1.0f;
h_B[i] = i * 2.0f;
}
Matrix multiplication in CUDA
#include <stdio.h> int main() {
#include <cuda.h> int size = N * N * sizeof(float);
float h_A[N * N] = {1, 2, 3,
#define N 3 // Matrix size 4, 5, 6,
7, 8, 9};
__global__ void kernel(float *A, float *B,
float *C) { float h_B[N * N] = {9, 8, 7,
int row = threadIdx.y; 6, 5, 4,
int col = threadIdx.x;
3, 2, 1};
float sum = 0.0f;
float h_C[N * N]; // Result
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
float *d_A, *d_B, *d_C;
}
C[row * N + col] = sum;
} // Allocate device memory
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
Matrix multiplication in CUDA
// Display result
// Copy matrices to device
cudaMemcpy(d_A, h_A, size, printf("Result matrix C:\n");
cudaMemcpyHostToDevice); for (int i = 0; i < N; i++) {
cudaMemcpy(d_B, h_B, size, for (int j = 0; j < N; j++) {
cudaMemcpyHostToDevice); printf("%6.1f ", h_C[i * N + j]);
}
// Launch kernel
printf("\n");
dim3 threadsPerBlock(N, N);
}
kernel<<<1, threadsPerBlock>>>(d_A, d_B,
d_C);
return 0;
// Copy result back to host }
cudaMemcpy(h_C, d_C, size,
cudaMemcpyDeviceToHost);
dim3 threadsPerBlock(N, N)

dim3 threadsPerBlock(N, N) Visually:

means we are creating a block of threads
arranged in a 2D layout:
• N threads along the x-axis (columns) (0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)
• N threads along the y-axis (rows)
(0,2) (1,2) (2,2)
In our example, N = 3, so:

• 3 threads in x-direc on → threadIdx.x = 0, 1, 2

(threadIdx.x, threadIdx.y) are the
coordinates inside the block.
• 3 threads in y-direc on → threadIdx.y = 0, 1, 2

That makes a 3 × 3 = 9 threads in one block.

Each thread will compute one element of the 3×3
output matrix.

Informatica PowerCenter Guide
0% (1)
Informatica PowerCenter Guide
15 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
PC Cuda Assignment-2
No ratings yet
PC Cuda Assignment-2
29 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
CUDA Class Lecture04
No ratings yet
CUDA Class Lecture04
11 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
Cuda C
No ratings yet
Cuda C
70 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
Cuda
No ratings yet
Cuda
25 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
Final Question Class 3
No ratings yet
Final Question Class 3
9 pages
Worksheet 4
No ratings yet
Worksheet 4
4 pages
Salary Survey
No ratings yet
Salary Survey
24 pages
Thesis Paper V4
No ratings yet
Thesis Paper V4
23 pages
Sacrifice of Mesbah Uddin Nowfel
No ratings yet
Sacrifice of Mesbah Uddin Nowfel
3 pages
Holiday Homework Class 1
No ratings yet
Holiday Homework Class 1
11 pages
Thesis Paper V1
No ratings yet
Thesis Paper V1
10 pages
A First Book of C 4th Edition by Gary J Bronson ISBN 1111531005 9781111531003 - Download The Ebook Now and Read Anytime, Anywhere
100% (1)
A First Book of C 4th Edition by Gary J Bronson ISBN 1111531005 9781111531003 - Download The Ebook Now and Read Anytime, Anywhere
45 pages
SL Notes
No ratings yet
SL Notes
166 pages
Logcat 1693136573230
No ratings yet
Logcat 1693136573230
26 pages
INVT BD 5-6kW-LL1 Single Phase Solar Inverter 4
No ratings yet
INVT BD 5-6kW-LL1 Single Phase Solar Inverter 4
1 page
Prog LA12
0% (1)
Prog LA12
6 pages
Experiment 1 Diode Logic: A. Background
No ratings yet
Experiment 1 Diode Logic: A. Background
6 pages
SMPS300R
No ratings yet
SMPS300R
4 pages
Beginners Guide To Porting NETMF
No ratings yet
Beginners Guide To Porting NETMF
33 pages
14-10-27 Jubb Concurrency
No ratings yet
14-10-27 Jubb Concurrency
22 pages
Salesforce-SAP Integration via MuleSoft
No ratings yet
Salesforce-SAP Integration via MuleSoft
3 pages
AOZ1268QI 01 AlphaOmegaSemiconductors
No ratings yet
AOZ1268QI 01 AlphaOmegaSemiconductors
15 pages
FRENIC-HVAC Function
No ratings yet
FRENIC-HVAC Function
24 pages
Custo Screen 300 / 400 - Error Codes: Service Manual
No ratings yet
Custo Screen 300 / 400 - Error Codes: Service Manual
3 pages
Algorithms For VLSI Design Automation (Gerez 1998-12-22) Part-1
No ratings yet
Algorithms For VLSI Design Automation (Gerez 1998-12-22) Part-1
32 pages
What Is Pattern in SAP ETD
No ratings yet
What Is Pattern in SAP ETD
17 pages
Arduino Basics for Beginners
No ratings yet
Arduino Basics for Beginners
45 pages
Junction Field Effect Transistors: Class 7
100% (1)
Junction Field Effect Transistors: Class 7
47 pages
200 Evolution System Requirements Ame
No ratings yet
200 Evolution System Requirements Ame
5 pages
Java Scrpit Notes 1
No ratings yet
Java Scrpit Notes 1
31 pages
Model 5020 Combustible Gas Detection Module: Nova-5000 Detection & Control System
No ratings yet
Model 5020 Combustible Gas Detection Module: Nova-5000 Detection & Control System
2 pages
HCIP-Routing & Switching-IEEP V2.5 Exam Outline
No ratings yet
HCIP-Routing & Switching-IEEP V2.5 Exam Outline
3 pages
Infinite Run
No ratings yet
Infinite Run
15 pages
Azure CLI Commands
No ratings yet
Azure CLI Commands
2 pages
Debian 8 MPD Install With Botic Optional
No ratings yet
Debian 8 MPD Install With Botic Optional
14 pages
Cogent Data Centers: Secure, Reliable Connectivity
No ratings yet
Cogent Data Centers: Secure, Reliable Connectivity
2 pages
Ajith Boddu
No ratings yet
Ajith Boddu
7 pages
Unit 19 - Assignment 1 Full - Vo Nguyen Duy Nam - GCS200888 - GCS1001B
No ratings yet
Unit 19 - Assignment 1 Full - Vo Nguyen Duy Nam - GCS200888 - GCS1001B
19 pages
Class 9pdf 32
No ratings yet
Class 9pdf 32
55 pages
Jasper T Aguilar Tm1 Updated
No ratings yet
Jasper T Aguilar Tm1 Updated
4 pages

CUDA Class Lecture02

Uploaded by

CUDA Class Lecture02

Uploaded by

High-Performance Parallel

Computing Using CUDA by GPU

• Thread → Block → Grid

Flatten to a single global ID :

Example with cudaMemcpyHostToDevice

// Copy vectors from host to device

// Copy vectors from device to dost

For the debugging code:

• NVIDIA hardware is designed so that the smallest scheduling unit is 32

• GPU executes instructions warp-by-warp, not thread-by-thread.

• If threads in a warp take different code paths (due to an if statement),

• Think of a warp as 32 workers walking in step.

• If everyone takes the same route, they move fast.

• Warp scheduling: Hardware schedules warps, not individual threads

• Think of a warp as 32 workers walking in step.

• If everyone takes the same route, they move fast.

• Warp scheduling: Hardware schedules warps, not individual threads

A warp = group of 32 threads in NVIDIA GPUs that execute the same

If you know the thread ID:

In CUDA, warp divergence happens when

• 16 even threads go into one branch.

• 16 odd threads go into another branch.

The GPU can’t execute both at once → it serializes execu on:

• Runs the even branch while odd threads are idle.

• Runs the odd branch while even threads are idle.

• Avoid complex branching inside

• Warp divergence must be avoided by

// Use a branch-free approach:

– execution configuration setup

#define N 2048 // Length of the vectors

// Initialize the vectors Compiler: nvcc

// Print the result

h_A = (float *)malloc(size);

for (int i = 0; i < 10; i++) {

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, return 0;

• Speed: Much faster than global memory.

• Visible to all threads in the same block.

• Not accessible to threads in other blocks.

Reduced global memory traffic → be er performance.

shared memory inside a kernel using:

if (tid < N) { clock_t begin_time = clock();

h_A = (float *)malloc(size); std::cout<<"Spent Time = " <<float(end_time-begin_time)/

dim3 threadsPerBlock(N, N) Visually:

• 3 threads in x-direc on → threadIdx.x = 0, 1, 2

That makes a 3 × 3 = 9 threads in one block.

You might also like