A daily log of my hands‑on work in GPU programming alongside learnings from Parallel Programming and Optimization (PMPP).
100‑Day Challenge Base Repo: JanakiSubu/GPU_CUDA_100
Project File: vectadd.cu
What I Did
- Wrote a basic Hello World from GPU Code!
- Wrote a CUDA kernel that adds two float arrays element‑by‑element.Launched one thread per element, each computing
C[i] = A[i] + B[i].UsedblockIdx.x * blockDim.x + threadIdx.xto map threads to data. Added anif (i < N)guard to prevent out‑of‑bounds writes.
Key Takeaways
- Declared and invoked a
__global__function on the GPU. - Understood the grid–block–thread hierarchy and how to size them.
- Managed GPU memory lifecycle with
cudaMalloc,cudaMemcpy,cudaFree. - Synchronized the device using
cudaDeviceSynchronize()to flush and catch errors.
What I Read
- Finished PMPP Chapter 1: Overview of parallel architectures, CUDA execution model, and fundamentals of GPU programming.
Project File: matrixadd.cu
What I Did
- Wrote a CUDA kernel that adds two N×N matrices element‑by‑element. Launched a 2D grid of 16×16 thread‑blocks, mapping each thread to one output element.
- Added an if (row < N && col < N) guard to prevent out‑of‑bounds writes.
- 2D thread mapping - used a 2D grid of 16×16 thread‑blocks and computed to map each CUDA thread onto a unique matrix element.
int row = blockIdx.y*blockDim.y + threadIdx.y;int col = blockIdx.x*blockDim.x + threadIdx.x; - Allocated and initialized matrices with cudaMallocManaged and a simple initMatrix loop.
Key Takeaways
- Mastered mapping 2D data onto the CUDA grid–block–thread hierarchy.
- Saw how a single
__global__function can process an entire matrix by distributing work.
What I Read
- Finished PMPP Chapter 2: Memory hierarchy and data locality in CUDA, and why coalesced global loads matter even for simple element‑wise kernels.
Project File: matrix_vec_mult.cu
What I Did
- Wrote a CUDA kernel that multiplies an N×N matrix by a length‑N vector, computing one dot‑product per thread.
- Launched a 1D grid of threads
(gridSize = (N+blockSize–1)/blockSize, blockSize = 256), mappingthreadIdx.x + blockIdx.x*blockDim.x→ row index. - Added an if (row < N) guard to prevent out‑of‑bounds accesses.
- Allocated matrix and vectors with cudaMallocManaged, initialized host data in simple loops, then let unified memory handle host↔device transfers.
Key Takeaways
- Learned how to map a 1D data structure (rows of a matrix) onto CUDA’s grid–block–thread hierarchy for dot‑product operations.
- Understood the importance of bounds checking when the total thread count may exceed the problem size.
- Saw how a single global kernel can parallelize all row‑wise dot‑products, leveraging unified memory to simplify data management.
Project File: partialsum.cu
What I Did
- Summed two elements per thread into shared memory with guarded, coalesced loads.
- Implemented a tree-based inclusive
scan (O(log blockSize))using synchronized strides. - Computed
gridSize = (N + 2*blockSize - 1)/(2*blockSize)to cover all inputs. - Added a
CUDA_CHECKmacro to validate every CUDA API call.
Key Takeaways
- Tree-based scan pattern: How to implement an in-place inclusive prefix sum in shared memory with minimal divergence.
- Memory coalescing: Summing two elements per thread maximizes contiguous loads and stores, boosting bandwidth utilization.
- Grid‐block sizing: Planning 2 elements per thread lets you halve the number of blocks needed, improving occupancy.
Project File: layernorm.cu
- Implemented Layer Normalization on each row of a 2D matrix.
- Used shared memory to store row data, reducing global memory accesses.
- Calculated mean and variance for each row, then normalized the elements.
- Used
__syncthreads()to synchronize threads in a block. - Applied 1e-7 epsilon to avoid divide-by-zero errors in standard deviation.
- Validated CUDA API calls with a
CUDA_CHECKmacro.
- Parallelism: Leveraged CUDA’s thread-block model to process rows independently.
- Shared Memory: Reduced latency by using shared memory for row data.
- Grid-Block Sizing: One block per row optimized performance.
- Numerical Stability: Added epsilon to avoid divide-by-zero errors.
- Efficient Memory Access: Coalesced memory accesses improved bandwidth utilization.
Project File: MatrixTranspose.cu
What I Did
- Implemented a naïve transpose kernel with bounds checks.
- Chose TILE_DIM = 32 and BLOCK_ROWS = 8 so each block handles a 32×32 tile in eight thread-rows, balancing shared-memory usage and occupancy.
- Launched the kernel with a 2D grid of blocks sized (width + TILE_DIM - 1) / TILE_DIM × (height + TILE_DIM - 1) / TILE_DIM to cover the entire matrix.
Key Takeaways
- Coalesced Access: Tiling groups global loads and stores into contiguous bursts, significantly improving memory throughput.
- Shared-Memory Reuse: Staging data in on-chip shared memory reduces redundant global reads and writes.
- Bank-Conflict Avoidance: Adding one column of padding prevents threads in the same warp from hitting the same shared-memory bank.
Project Files:
one_d_convolution.cuone_d_convolution_with_tiling.cu2d_convolution_with_tiling.cu
-
one_d_convolution.cu:
- Implemented a naïve 1D convolution kernel that slides a 1×K mask over an input array.
- Mapped each thread to compute one output element, using
blockIdx.x * blockDim.x + threadIdx.x. - Added boundary checks (halo cells) to skip out-of-bounds memory accesses.
-
one_d_convolution_with_tiling.cu:
- Extended the 1D version with shared-memory tiling.
- Loaded each tile (plus halo regions) of the input array into shared memory.
- Used
__syncthreads()to synchronize before computing the convolution.
-
2d_convolution_with_tiling.cu:
- Generalized tiling to 2D: divided the input matrix into TILE×TILE patches in shared memory.
- Loaded top, bottom, left, and right halos to handle borders correctly.
- For each output pixel, computed a full MASK×MASK dot-product against the shared-memory tile.
- Parallel convolution fundamentals
- Learned to map CUDA threads to output indices and use halo cells to guard against out-of-bounds reads.
- Shared-memory tiling
- Mastered loading contiguous blocks (and halos) into on-chip memory to reduce global-memory traffic.
- Synchronization & performance trade-offs
- Saw how
__syncthreads()ensures data consistency and how tile size choices impact shared-memory usage, occupancy, and overall throughput.
- Saw how
- Scaling from 1D to 2D
- Translated 1D tiling patterns into a 2D grid layout and managed corner and edge cases for correct 2D convolution.
Project File: prefixsum_brent_kung_algorithm.cu
- Loaded pairs of elements into shared memory and handled out-of-bounds indices with zero-padding.
- Implemented the Brent–Kung scan:
- Up-sweep (reduce): built a partial‐sum tree in-place.
- Down-sweep: distributed the partial sums back down to produce the final prefix sums.
- Launched a 1D grid where each block processes 64 elements (2 × 32 threads).
- Added rigorous bounds checks on loads and stores to avoid illegal memory accesses.
- Used
__restrict__andconstqualifiers to help the compiler optimize global memory traffic. - Wrapped CUDA calls in an
inline checkCudaError()helper for clearer error reporting.
- Hierarchical scan structure: splitting work into a balanced binary tree (up-sweep) then propagating results (down-sweep) yields work-efficient parallel scans.
- Shared‐memory orchestration: careful use of
__syncthreads()between strides is critical to ensure correctness without excessive divergence. - Block‐level vs. full-array scan: while this version computes an inclusive scan within each block, extending it to arbitrary-length arrays requires a second (“carry‐propagate”) pass across blocks.
- Performance hygiene: zero-padding incomplete segments and marking pointers as
__restrict__prevents hidden data hazards and helps maximize GPU throughput.
Project File: flash_attention_forward.cu
What I Did
- Implemented a tile-based Flash Attention forward pass in CUDA (toy example: N=2, d=2).
- Staged Q, K, V in Br×Bc shared-memory tiles, computed per-tile Q·Kᵀ scores.
- Subtracted each row’s max before
expf()and recorded both max and sum-of-exps for numerical stability. - Accumulated softmax-weighted sums into the output buffer with a single-kernel prototype.
Key Takeaways
- Shared-memory tiling: Reduced global memory traffic by staging Q/K/V blocks on-chip.
- Numerical stability: Learned the importance of row-max subtraction to avoid overflow in
expf(). - Tile sizing trade-offs: Observed how Br and Bc (derived from
SRAM_SIZE) affect occupancy and register usage.
What I Read
-
Tri Dao et al., “FlashAttention: Fast and Memory‐Efficient Exact Attention with IO-Awareness,” NeurIPS 2022
-
NVIDIA CUDA C Programming Guide — shared memory, occupancy tuning, and kernel optimizations
Project File: Sparse_MatrixVecMult_Hybrid.cu
What I Did
- Implemented a hybrid SpMV kernel that packs up to TH=20 nonzeros per row into ELL (Ellpack), spilling extras into a global COO array via
atomicAdd. - Zero-filled unused ELL slots and stored per-row column indices.
- In each thread:
- ELL multiplication: iterate fixed TH entries per row.
- COO accumulation: scan the global COO list, adding matching-row entries.
- Wrapped CUDA calls in a
CUDA_CHECKmacro for robust error handling.
Key Takeaways
- ELL vs. COO trade-offs: ELL gives regular accesses for up to TH nonzeros; COO handles overflow with minimal padding.
- Atomic writes:
atomicAddappends COO entries without precomputing row quotas, at the cost of serialized writes. - Memory layout: storing ELL in column‐major “slices” (
[p * N + row]) yields coalesced loads forx[col].
What I Read
- PMPP Chapter 10: Parallel sparse‐matrix techniques—CSR/ELL/COO formats, load balancing, padding for regularized access.
Project File: benchmark.py
What I Did
- Built a Python harness to benchmark my CUDA SpMV vs.
torch.sparse.mm. - Automated:
- NVCC compilation of
main.cuwith injectedN,M, andthreshold. - Kernel timing via CUDA events and PyTorch timing events.
- NVCC compilation of
- Logged memory usage (
psutil) and estimated sparse footprint to avoid OOM.
Key Takeaways
- End-to-end benchmarking must include compile, transfer, and launch overheads.
- Memory-safety checks (e.g. cap nnz to 70% of RAM) prevent large-matrix crashes.
- Source injection simplifies multi-size testing without manual edits.
Project File: merge_path_parallel_merge.cu
What I Did
-
Implemented the Merge Path Parallel Merge algorithm using CUDA.
-
Created a
merge_path_partition()device function to perform binary search across the diagonalk = i + jin the logical 2D merge grid of arrays A and B. -
Launched
N + Mthreads where each thread:- Computes its diagonal index and determines the co-rank split
(i, j). - Selects the smaller of
A[i]orB[j]and writes toC[k].
- Computes its diagonal index and determines the co-rank split
-
Used
printf()inside the kernel to trace: thread ID, diagonal, source array, and output position. -
Verified the final merged array was fully sorted and correct.
Key Takeaways
- Merge Path provides a work-balanced parallel merge across threads with no synchronization needed.
- Learned how to use diagonal binary search to assign merge ranges in parallel.
- Understood how co-ranking is generalized in Merge Path and applied in Thrust and CUB.
- Added thread-wise kernel printouts to verify correctness and thread assignment.
What I Read
- PMPP Chapter 11: Merge Sort, co-rank vs. merge path, and tiled merging.
- “Merge Path: A Visually Intuitive Approach to Parallel Merging” by Green et al.
- Thrust and CUB source code:
merge()andmerge_by_key()implementation insights.
Project File: tiled_matmul.cu
-
Implemented a tiled matrix multiplication (GEMM) kernel on the GPU using shared-memory.
-
Divided each (N\times N) matrix into
TILE_SIZE×TILE_SIZEsub-blocks that each thread-block cooperatively loads into shared memory. -
Each thread computes one element of (C) by iterating over all tiles:
- Loads a tile of A and a tile of B into
__shared__arrays (sA,sB), zero-padding out-of-bounds entries. - Synchronizes with
__syncthreads(). - Performs a
#pragma unrollinner loop of lengthTILE_SIZEto accumulate the dot-product. - Synchronizes again before loading the next tile.
- Loads a tile of A and a tile of B into
-
Wrapped the kernel launch in CUDA events (
cudaEventRecord/cudaEventElapsedTime) to measure in-kernel execution time. -
Added host-side setup to:
- Parse
Nfrom the command line (default 256 for quick demos). - Allocate and initialize host/device arrays.
- Verify correctness by checking
C[0]andC[N*N-1]against the expected sum.
- Parse
- Shared-Memory Tiling: Greatly reduces global-memory traffic by reusing each tile across multiple multiplications.
- Work Distribution: Each thread handles one output element, ensuring balanced compute.
- Performance Instrumentation: CUDA events provide precise kernel timing, decoupled from host overhead.
- PMPP Chapter 4: Tiled algorithms and memory hierarchies for dense linear algebra.
Project Files:
cmpFHD.cucmpFHD_real_image.cu
-
Core FHD kernel (
cmpFHD.cu):- Implemented the Fully-Hybrid Domain (FHD) update pass over non-Cartesian k-space samples in CUDA.
- Broke the full trajectory (
M = 1024samples) intoCHUNK_SIZE = 256tiles and loaded each tile into__constant__memory (kx_c, ky_c, kz_c). - Each GPU thread:
- Reads its point coordinates
(x[n], y[n], z[n])and initial complex accumulator(rPhi, iPhi). - Loops over the tile, computes
angle = 2π * (kx·x + ky·y + kz·z); realAcc += rMu[m]*cosf(angle) − iMu[m]*sinf(angle); imagAcc += iMu[m]*cosf(angle) + rMu[m]*sinf(angle);
- Writes back updated
(rPhi, iPhi)and computes magnitudephiMag = sqrt(r² + i²).
- Reads its point coordinates
- Host orchestration:
- Allocates and initializes host arrays (
h_x, h_y, h_z, h_rMu, h_iMu, h_rPhi, h_iPhi). - Copies data to device buffers and, per chunk, uploads trajectory tile via
cudaMemcpyToSymbol. - Launches
cmpFHD<<<blocks, 256>>>(…)and synchronizes. - Copies results back and prints a few sample values for validation.
- Allocates and initializes host arrays (
-
Real-image extension (
cmpFHD_real_image.cu):- Integrated OpenCV to load a grayscale image (
lena_gray.png→CV_32F [0,1]). - Mapped pixel
(i, j)→ normalized(x, y)and intensity →z. - Ran the identical chunked FHD kernel on this point cloud.
- Converted the per-pixel magnitude back to an 8-bit image and saved
output.jpg.
- Integrated OpenCV to load a grayscale image (
- Constant-Memory Tiling Broadcasting a small tile of k-space samples to all threads dramatically cuts global-memory pressure.
- Per-Thread Work Balance One thread per output point simplifies divergence and ensures each thread does equal work.
- Hybrid Memory Management Managed vs. explicit
cudaMalloc/cudaMemcpyapproaches: tradeoff between simplicity and control. - Real-World Pipeline Integrating OpenCV with CUDA, handling I/O, pre/post-processing around the core compute kernel.
- PMPP Chapter 14 — Non-Cartesian MRI case study:
Iterative reconstruction techniques, k-space sampling patterns, performance-oriented kernel design. - NVIDIA CUDA C Programming Guide — Best practices for constant memory, occupancy tuning, and fast math intrinsics.
Project Files:
-
BFS Kernel (
bfs_kernel.cu)
1.Wrote a parallel Breadth-First Search kernel using atomic operations to update node labels.
2.Explored level‐by‐level traversal with thread‐safeatomicCASand frontier‐completion flags. -
GELU Activation (
glu.cu)- Implemented the GELU (Gaussian Error Linear Unit) activation in CUDA for fast inference.
- Verified against a Python reference to ensure numerical correctness.
-
Linear Layer (
linear_kernel.cu)- Built a batched linear layer using cuBLAS for
C = A × B, followed by a custom bias‐add kernel. - Managed host/device memory and orchestrated cuBLAS calls plus CUDA kernel launches.
- Built a batched linear layer using cuBLAS for
- Designing graph‐traversal kernels with minimal divergence and safe atomic updates.
- Validating complex CUDA math (GELU) against a straightforward Python implementation.
- Integrating cuBLAS GEMM with custom CUDA kernels for a complete linear layer pipeline.
- Gained deeper insight into dynamic parallelism and its trade-offs in real-world examples.
- PMPP Chapter 12: Parallel Patterns for Graph Searches
Background on graph structures and traversal mechanisms. Sequential vs. parallel BFS implementations. Optimizations for memory bandwidth and load balancing in graph algorithms. - PMPP Chapter 13: CUDA Dynamic Parallelism
Fundamentals of launching kernels from the device, memory visibility rules, and nesting depth. Synchronization with streams and events inside dynamic kernels. A recursive Bezier‐curve example with and without dynamic parallelism.
Project Files:
flash.cucnn.cu
-
Flash Attention Backprop
- Extended my Flash Attention forward pass to full backprop: computed gradients w.r.t. Q, K, V via softmax‐and‐matmul reverse chaining.
- Mirrored on-chip tiling (Br×Bc) from the forward pass to keep memory-efficient patterns.
- Diagnosed zero gradients in spots due to mismatched launch configs and missing “col2im” style gather in the tiled layout.
-
CNN Backprop
- Built an end-to-end CNN layer in CUDA:
- Forward: conv → ReLU → max-pool using
im2col→ GEMM →reluAct→maxpoolkernels. - Backward: pooling grads with
atomicAdd, weight grads viagemmDW, input grads viagemmDX+col2im.
- Forward: conv → ReLU → max-pool using
- Added a toy test in
main()to print activations, pooled outputs,dW, anddXfor validation.
- Built an end-to-end CNN layer in CUDA:
- Forward/Backward Alignment
- Any unrolling or tiling in the forward pass must be mirrored exactly in backprop (e.g. implement
col2imfor dX).
- Any unrolling or tiling in the forward pass must be mirrored exactly in backprop (e.g. implement
- Launch-Config Precision
- Off-by-one grid/block calculations often explain zero-gradient anomalies—always double-check your total = rows×cols formulas.
- Toy-Scale Verification
- Dump intermediate tensors on a small example before scaling up to catch indexing and memory-layout bugs early.
- PMPP Ch. 15: Molecular Visualization & Analysis Thread granularity and memory-coalescing strategies for large biomolecular data.
- PMPP Ch. 16: Machine Learning Case Study How cuDNN reduces CNN layers to GEMM under the hood for peak performance.
- PMPP Ch. 17: Parallel Programming & Computational Thinking Systematic problem decomposition and balancing compute vs. memory locality.
Code:
NaiveBayes.cu, NaiveBayesKernel.cuh, NaiveBayesTrain.cuh, NaiveBayesTrain.cpp, and main.cpp
Implemented a CUDA-accelerated Naive Bayes classifier, focusing on the training and inference stages. Leveraging shared memory to maximize computational efficiency, the implementation is structured to divide work among threads for parallelized data processing of feature probabilities.
- This file contains the CUDA kernel responsible for calculating feature likelihoods and class probabilities in parallel.
- Shared memory was used where possible to minimize global memory access penalties.
- Optimized kernel launches to balance between grid and block dimensions for datasets with high dimensionality.
- Header file declaring the kernel functions, ensuring modularity and separation of concerns in code structure.
- Declared the host-side training function, encapsulating the logic to copy data to the GPU, launch CUDA kernels, and retrieve results.
- Implemented the host-side training process, providing pre-processing for input data and managing memory transfers between CPU and GPU.
- Entry point of the program, performing tasks like loading data, splitting datasets for training and testing, and evaluating model performance after training.
- One-thread-per-sample model enabled scalable histogram-style computation.
- Shared memory significantly improved update locality for priors and likelihoods.
- Host-device modularity ensured reusability and clarity.
- Tuned grid/block dimensions for balanced memory latency and thread occupancy.
- PMPP Chapter 9: Parallel Histograms and Voting
- PMPP Chapter 5: Synchronization and Shared Memory
- CUDA C Best Practices Guide: Shared memory vs. global memory access efficiency
Project File: vec_cublas.cu
- Implemented vector addition on the GPU using the cuBLAS library.
- Used the
cublasSaxpy()routine to computeC = A + Bby performingy = α * x + ywithα = 1.0f. - Managed cuBLAS handle lifecycle with
cublasCreate()andcublasDestroy(). - Allocated memory on the device, initialized host data, and handled data transfers using
cudaMemcpy. - Verified output by copying the result back to host and printing sample elements.
-
cuBLAS Handle Management
Learned to initialize and release cuBLAS context usingcublasCreate()andcublasDestroy()to encapsulate library operations. -
AXPY Operation Basics
Understood thatcublasSaxpy()computesy = α * x + y. Settingα = 1.0makes it equivalent to element-wise addition. -
Performance & Simplicity
Observed how vendor-optimized cuBLAS routines offer better performance and cleaner code than writing custom kernels for simple linear algebra operations. -
Memory Safety & Error Checking
Wrapped CUDA and cuBLAS calls in error-checking macros to ensure robustness.
- PMPP Chapter 3 — Leveraging cuBLAS and cuRAND libraries for optimized linear algebra and random number generation.
- cuBLAS Library Documentation — Usage pattern and parameter structure for
cublasSaxpy. - CUDA C Programming Guide — Best practices for mixing cuBLAS with custom kernels and memory management.
Project File: matmul_cublas.cu
- Implemented matrix multiplication using the cuBLAS library function
cublasSgemm(). - Initialized matrices A and B on the host with values
A[i][j] = i + j,B[i][j] = i + jfor easy verification. - Allocated GPU memory with
cudaMallocand transferred matrices from host to device usingcudaMemcpy. - Set scalar values:
alpha = 1.0f,beta = 0.0f. - Used
cublasSgemm()with flagsCUBLAS_OP_Nfor no transposition of A or B. - Copied result matrix C back to host and printed the result using column-major indexing (
i + j * M). - Managed cuBLAS context creation and destruction using
cublasCreate()andcublasDestroy().
- cuBLAS for GEMM
Learned to usecublasSgemm()for efficient matrix multiplication—crucial for deep learning workloads and scientific computing. - Column-Major Order
cuBLAS expects column-major layout (Fortran-style), so indexing must followi + j * leadingDimto correctly interpret results. - cuBLAS Handle Lifecycle
Understood proper creation and destruction of the cuBLAS context (cublasHandle_t) to manage library state. - Parameter Mapping
Mastered the mapping between C-style row-major host matrices and cuBLAS function parameters for correctness and performance.
- PMPP Chapter 6: Matrix Multiplication and Shared Memory Optimization
Foundations of tiled GEMM, matrix layout impacts, and performance best practices. - cuBLAS Developer Guide
Focused oncublasSgemmusage, including leading dimensions, transposition flags, and memory alignment strategies. - CUDA Toolkit Documentation
Reviewed usage ofcudaMalloc,cudaMemcpy, and proper GPU memory management for third-party libraries.
Project File: fcnet.cu
- Implemented a fully connected neural network (FCNet) using the cuDNN library in CUDA.
- Constructed a 3-layer architecture:
- Input layer – 1000 neurons
- Two hidden layers – 512 neurons each
- Output layer – 10 neurons
- Emulated dense connections via 1×1 convolutions.
- Applied ReLU activation after each hidden layer using
cudnnActivationForward. - Initialized weights with cuRAND and zeroed the biases.
- Created and managed cuDNN descriptors for tensors, filters, convolutions, and activations.
- Ran a 10-epoch forward pass over randomly generated inputs and labels.
- Printed a sample output from the final epoch for verification.
- Cleaned up all GPU memory and cuDNN descriptors.
- Learned to build dense layers using 1×1 convolutions in cuDNN for GPU acceleration.
- Gained hands-on with tensor, filter, and convolution descriptor setup and teardown.
- Mastered the use of
cudnnConvolutionForward,cudnnAddTensor(bias add), andcudnnActivationForward. - Reinforced best practices for error checking, memory management, and resource cleanup.
- Saw how cuDNN can model non-convolutional layers using its convolution primitives.
- PMPP Chapter 8 — Using cuDNN for Deep Learning: structure mapping & inference pipelines
- cuDNN Developer Guide — Descriptor APIs, activation functions, and forward algorithms
- CUDA cuRAND Documentation — Pseudorandom weight initialization best practices
- CUDA Toolkit Programming Guide — Memory management, synchronization, and debugging techniques
Project File: rope.cu
What I Did
- Implemented Rotary Positional Encoding (RoPE) in CUDA to inject relative position information into transformer token embeddings
- Wrote a
__global__kernel that: - Splits each query/key vector into even and odd halves
- Applies element-wise rotation using precomputed sine and cosine values
- Uses thread indices to map tokens × dimensions → vector elements
- Loaded the angle table (
θ) into shared memory once per block to minimize global‐memory reads - Launched a 2D grid:
dim3 grid((seq_len + B-1)/B, (dim/2 + T-1)/T)dim3 block(T, 1)for coalesced access across sequence positions
- Validated on a toy sequence (N = 128, D = 64) by comparing against a CPU reference implementation and printing sample rotated vectors
Key Takeaways
- RoPE Fundamentals: Learned how complex‐valued rotations encode relative positions, eliminating the need for explicit absolute embeddings
- CUDA Mapping: Practiced mapping 2D data (tokens × dim/2) onto CUDA’s grid–block–thread hierarchy for element-wise operations
- Shared‐Memory Optimization: Saw significant bandwidth savings by staging constant sin/cos tables in shared memory
- Numerical Stability: Verified that precomputing angles at high precision on the host avoids drift in the GPU’s single‐precision trig evaluations
- Launch‐Config Trade-offs: Balanced block size vs. shared‐memory capacity to maximize occupancy without bank conflicts
Project File: cuda_sgc.cu
What I Did
- Built a CUDA-accelerated SGD trainer for linear regression with weight & bias updates.
- Wrote
compute_lossto launch one thread per sample, computing predictions (X·W + b) and squared-error losses. - Implemented
compute_gradientsusing shared memory to reduce per-block bias contributions and atomics for the global bias gradient; also computed per‑weight gradients over all samples. - Developed
update_weightsto apply the SGD step (W ← W – lr·dW,b ← b – lr·db) entirely on the GPU. - Created the host routine
train_sgdto allocate/copy all buffers, orchestrate the loss, gradient, and update kernels across epochs, and retrieve the final parameters.
Key Takeaways
- Shared‑Memory Reduction: Using a block‑local buffer for bias gradients cuts down on global‑memory traffic and improves reduction performance.
- Atomic Operations: Atomic adds safely combine block reductions into the global bias gradient.
- Kernel Partitioning: Separating loss-computation, gradient-accumulation, and parameter-update into distinct kernels simplifies tuning grid/block sizes for each stage.
- GPU Workflow Management: Mastered the end‑to‑end loop: memory allocation, H2D/D2H transfers, kernel synchronization, and cleanup for iterative training.
What I Read
- Best practices for CUDA optimizations: memory coalescing, occupancy tuning, and shared‑memory patterns.
- Literature on efficient SGD implementations in GPU‑based ML libraries, including block‑level reductions and atomic‑free alternatives.
Project File: EM_kernel.cu
What I Did
- Implemented the Expectation-Maximization (EM) algorithm in CUDA to cluster 1D data into a specified number of Gaussian mixtures.
- Wrote the E-step kernel (
eStepKernel) to compute responsibilities:- Launched one thread per data point.
- Calculated the weighted Gaussian PDF for each cluster (using current means, σ, and π).
- Normalized to obtain per-point posterior probabilities.
- Wrote the M-step kernel (
mStepKernel) to accumulate sufficient statistics:- Used
atomicAddreductions to sum γₖ, γₖ·x, and γₖ·x² across threads. - Launched one thread per data point, looping over all clusters.
- Used
Key Takeaways
- CUDA Memory Management: Mastered allocation, H2D/D2H transfers, and resetting accumulators between iterations.
- Parallel Reduction with Atomics: Leveraged
atomicAddfor safe, block-level accumulation in the M-step. - GMM & EM Concepts: Reinforced the iterative nature of EM—alternating E-step and M-step—to maximize the expected log-likelihood and refine cluster parameters on the GPU.
What I Read
- PMPP Chapter 7: Probabilistic Models & Expectation-Maximization — theory and convergence properties of EM for Gaussian mixtures.
- NVIDIA CUDA C Programming Guide — best practices for using atomics and optimizing memory throughput in reduction patterns.