In this homework series, you'll optimize a CUDA implementation of General Matrix Multiply aka GEMM. Note that GEMM is slightly more involved than just matrix multiply: it also uses some constant scaling factors and adds the results to the existing values in the output matrix.
TBD
TBD
Use an EC2 g4dn.xlarge instance (currently the cheapest Nvidia GPU instance) with the ami-05c3e698bd0cffe7e AMI (an official Ubuntu 20.04 image with Nvidia GPU tools & PyTorch installed). Other AMIs can sometimes have fees associated with them.
You can use the cheapest storage (magnetic HDD) as disk performance doesn't matter for us. I recommend setting up an Elastic IP Address so that you have a consistent DNS name for your instance; it makes it much easier to connect to your instance via SSH and VSCode.
I also recommend using VSCode to write your code. Some key extensions to install are Nsight Visual Studio Code Edition,
Remote - SSH and C/C++ Extension Pack. This allows you to connect to your instance as a "remote" and write
code on your local machine. It also provides integration with the cuda-gdb debugger which is very helpful.
Finally, install the Nvidia Compute Insight profiler on your local machine (it's pre-installed on your instance) to allow you to peruse profiling reports easily. Note that you don't need to have an Nvidia GPU to view profiling data. You'll generate a profiling report on the EC2 instance and then view it on your local machine.
Our GEMM algorithms will operate on matrices with 32-bit float elements, which is the float datatype in CUDA.
At a high level, the code provided in cugemm.cu does the following:
- allocates input and output square matrices of the requested size
- initializes the input matrices with random values
- runs the requested GEMM algorithm (more details below) for the requested number of repetitions
- (optionally) validates the GEMM result
The matrix size, validation, repetition count and algorithm can all be controlled via command-line flags.
To begin with, only two GEMM algorithms are available: a naive version in runBasic
and a super-optimized version from Nvidia's cuBLAS library in runCublas.
cuBLAS is the reference point for validation: if validation is requested then we run cuBLAS to get the correct answer
and compare the other algorithm's output to it.
Build & profile the runBasic code as follows:
git checkout ...
cd <repo-working-copy>/gemm/
make -j3 all
./cugemm.bin --size=2048 --reps=1 --algo=1
This will build 3 versions of the code: an optimized version, an optimized version with some debugging information for profiling,
and one without optimizations and extra debugging symbols.
When you run the optimized version cugemm.bin it should report a performance of around 60 GFLOPS, which is far below what the GPU can provide.
Next, we'll profile our kernel to see why it is so slow:
sudo /usr/local/cuda-11.8/bin/ncu -o profile-basic --set full ./cugemm-profile.bin --size=4096 --reps=1 --algo=1 --validate=false
Note: you can follow these instructions to avoid the need for
sudowhen profiling.
Because we used --set full to collect a full set of profiling data, it will take a couple minutes to run. The results
are best viewed with the Nvidia Compute Insight profiler running in a graphical environment (i.e., not command-line) on a local machine.
Profiling will reveal an absurd number of uncoalesced global memory accesses.
Nvidia ships a number of "compute sanitizers" that check for common memory safety (e.g., out-of-bounds accesses) and concurrency errors.
You should run them on your debug binaries to get better reporting of where errors are in your source code. They are an easy way to get
some clues about where to start when your code isn't passing validation.
compute-sanitizer --tool memcheck ./cugemm-debug.bin ...
compute-sanitizer --tool racecheck ./cugemm-debug.bin ...
Your first task is to fix all of the uncoalesced global memory accesses in runBasic. Note that you have control over the order in which the elements of the output matrix are computed, and can leverage floating-point commutativity and also assume associativity (even though in reality floating-point addition and multiplication are not associative). You should compute the dot products incrementally in an order that yields coalesced memory accesses.
Copy the runBasic code to runGmemCoalesced and edit it there. Resolving the issues should result in a significant speedup (~550 GFLOPS on 20482 input matrices).
Cache tiles of the input matrices into shared memory, to avoid redundant loads to global memory. This should result in another significant speedup to ~1 TFLOPS.
Have each thread compute multiple cells of the output matrix C, instead of just one. This improves arithmetic intensity and should lift performance further to about ~3 TFLOPS. For reference, cuBLAS was reaching about 7.1 TFLOPS on my instance (with the T4's hardware limit being 8.1 TFLOPS), so we're over 40% of that optimal performance - not too shabby!
Buliding on your unmodified runSharedMemMultiOutput kernel from HW3, use CUDA streams to overlap memory copies with kernel execution. The numbers in the timeline below indicate the order of key operations.
- Copy the
Bmatrix in its entirety to the device (1), since all of the output elements depend on all of the rows ofB. - In a stream
S0, transfer the first slice of rows of theA(2) andC(3) matrices usingcudaMemcpyAsync, then launch a kernel to compute those rows ofCand to finally copy the rows ofCback to the host (6). - Meanwhile, in another stream
S1, transfer the next slice of rows ofA(4) andC(5), launch a kernel to compute those rows ofCand to copy them back to the host (7).
The number of streams to use is a runtime argument, determined by the --streams flag. The rows of A and C should be split evenly across all available streams.
Using streams, you will overlap transfer of the A and C matrices with the matmul kernel. For reference, on a GTX Titan X GPU I was seeing about a 15% performance improvement with 8 streams compared to a fully-synchronous single-stream version.