CUDA GEMM Optimization

In this homework series, you'll optimize a CUDA implementation of General Matrix Multiply aka GEMM. Note that GEMM is slightly more involved than just matrix multiply: it also uses some constant scaling factors and adds the results to the existing values in the output matrix.

Lambda.ai setup

TBD

CETS Virtual PC Lab setup

TBD

[Deprecated] EC2 setup

Use an EC2 g4dn.xlarge instance (currently the cheapest Nvidia GPU instance) with the ami-05c3e698bd0cffe7e AMI (an official Ubuntu 20.04 image with Nvidia GPU tools & PyTorch installed). Other AMIs can sometimes have fees associated with them.

You can use the cheapest storage (magnetic HDD) as disk performance doesn't matter for us. I recommend setting up an Elastic IP Address so that you have a consistent DNS name for your instance; it makes it much easier to connect to your instance via SSH and VSCode.

I also recommend using VSCode to write your code. Some key extensions to install are Nsight Visual Studio Code Edition, Remote - SSH and C/C++ Extension Pack. This allows you to connect to your instance as a "remote" and write code on your local machine. It also provides integration with the cuda-gdb debugger which is very helpful.

Finally, install the Nvidia Compute Insight profiler on your local machine (it's pre-installed on your instance) to allow you to peruse profiling reports easily. Note that you don't need to have an Nvidia GPU to view profiling data. You'll generate a profiling report on the EC2 instance and then view it on your local machine.

Code overview

Our GEMM algorithms will operate on matrices with 32-bit float elements, which is the float datatype in CUDA.

At a high level, the code provided in cugemm.cu does the following:

allocates input and output square matrices of the requested size
initializes the input matrices with random values
runs the requested GEMM algorithm (more details below) for the requested number of repetitions
(optionally) validates the GEMM result

The matrix size, validation, repetition count and algorithm can all be controlled via command-line flags.

To begin with, only two GEMM algorithms are available: a naive version in runBasic and a super-optimized version from Nvidia's cuBLAS library in runCublas. cuBLAS is the reference point for validation: if validation is requested then we run cuBLAS to get the correct answer and compare the other algorithm's output to it.

Build & profile

Build & profile the runBasic code as follows:

git checkout ...
cd <repo-working-copy>/gemm/
make -j3 all
./cugemm.bin --size=2048 --reps=1 --algo=1

This will build 3 versions of the code: an optimized version, an optimized version with some debugging information for profiling, and one without optimizations and extra debugging symbols. When you run the optimized version cugemm.bin it should report a performance of around 60 GFLOPS, which is far below what the GPU can provide.

Next, we'll profile our kernel to see why it is so slow:

sudo /usr/local/cuda-11.8/bin/ncu -o profile-basic --set full ./cugemm-profile.bin --size=4096 --reps=1 --algo=1 --validate=false

Note: you can follow these instructions to avoid the need for sudo when profiling.

Because we used --set full to collect a full set of profiling data, it will take a couple minutes to run. The results are best viewed with the Nvidia Compute Insight profiler running in a graphical environment (i.e., not command-line) on a local machine.

Profiling will reveal an absurd number of uncoalesced global memory accesses.

Debug

Nvidia ships a number of "compute sanitizers" that check for common memory safety (e.g., out-of-bounds accesses) and concurrency errors. You should run them on your debug binaries to get better reporting of where errors are in your source code. They are an easy way to get some clues about where to start when your code isn't passing validation.

compute-sanitizer --tool memcheck ./cugemm-debug.bin ...
compute-sanitizer --tool racecheck ./cugemm-debug.bin ...

HW1: Fix uncoalesced memory accesses

Your first task is to fix all of the uncoalesced global memory accesses in runBasic. Note that you have control over the order in which the elements of the output matrix are computed, and can leverage floating-point commutativity and also assume associativity (even though in reality floating-point addition and multiplication are not associative). You should compute the dot products incrementally in an order that yields coalesced memory accesses.

Copy the runBasic code to runGmemCoalesced and edit it there. Resolving the issues should result in a significant speedup (~550 GFLOPS on 2048² input matrices).

HW2: Use shared memory

Cache tiles of the input matrices into shared memory, to avoid redundant loads to global memory. This should result in another significant speedup to ~1 TFLOPS.

HW3: Multiple results per thread

Have each thread compute multiple cells of the output matrix C, instead of just one. This improves arithmetic intensity and should lift performance further to about ~3 TFLOPS. For reference, cuBLAS was reaching about 7.1 TFLOPS on my instance (with the T4's hardware limit being 8.1 TFLOPS), so we're over 40% of that optimal performance - not too shabby!

HW4: Pipeline Memory Copies and Kernel Computation

Buliding on your unmodified runSharedMemMultiOutput kernel from HW3, use CUDA streams to overlap memory copies with kernel execution. The numbers in the timeline below indicate the order of key operations.

Copy the B matrix in its entirety to the device (1), since all of the output elements depend on all of the rows of B.
In a stream S0, transfer the first slice of rows of the A (2) and C (3) matrices using cudaMemcpyAsync, then launch a kernel to compute those rows of C and to finally copy the rows of C back to the host (6).
Meanwhile, in another stream S1, transfer the next slice of rows of A (4) and C (5), launch a kernel to compute those rows of C and to copy them back to the host (7).

The number of streams to use is a runtime argument, determined by the --streams flag. The rows of A and C should be split evenly across all available streams.

Using streams, you will overlap transfer of the A and C matrices with the matmul kernel. For reference, on a GTX Titan X GPU I was seeing about a 15% performance improvement with 8 streams compared to a fully-synchronous single-stream version.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
class-code		class-code
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cugemm-hw4.cu		cugemm-hw4.cu
cugemm.cu		cugemm.cu
cxxopts.hpp		cxxopts.hpp
hw4-timeline.png		hw4-timeline.png
launch.json		launch.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA GEMM Optimization

Lambda.ai setup

CETS Virtual PC Lab setup

[Deprecated] EC2 setup

Code overview

Build & profile

Debug

HW1: Fix uncoalesced memory accesses

HW2: Use shared memory

HW3: Multiple results per thread

HW4: Pipeline Memory Copies and Kernel Computation

About

Uh oh!

Releases

Packages

Languages

License

branyang02/cis6010

Folders and files

Latest commit

History

Repository files navigation

CUDA GEMM Optimization

Lambda.ai setup

CETS Virtual PC Lab setup

[Deprecated] EC2 setup

Code overview

Build & profile

Debug

HW1: Fix uncoalesced memory accesses

HW2: Use shared memory

HW3: Multiple results per thread

HW4: Pipeline Memory Copies and Kernel Computation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages