This repository provides a standalone, high-performance General Matrix Multiplication (GEMM) implementation optimized for AMD GPUs for single-precision floating-point operations (SGEMM).
Take note that the library isn't fully tuned, and has been only tuned for some sizes (if you pass inputs that are calculated as close to the tuned sizes, the right configuration will be selected). The current workflow of this library is to tune for the specific sizes of your use-case before building. This may be improved upon in the future if time permits.
This repository aims to:
- Provide a focused, high-performance GEMM kernel for single-precision floating-point operations (SGEMM).
- Explore and implement support for various matrix data layouts (e.g., row-major, column-major, potentially tiled formats).
- Provide a benchmarking executable that shows the average, maximum and minimum time for a kernel run, along with the average TFLOPs
- Tune the GEMM kernel for different M, N, K sizes
This implementation was inspired by several key sources and observations:
- Previous WMMA work: Building on experience from a GEMM implementation of mine rocm_wmma_gemm that leveraged WMMA instructions and focused on FP16, while achieving good results against rocBLAS
- Sebastien Vince's research: Heavily influenced by the excellent article "Deep Dive into Matrix Optimization on AMD GPUs" where he achieved impressive SGEMM performance through hand-tuned ISA optimizations for 4096×4096×4096 row-major matrices on a 7900 XTX
Testing all implementations on the same hardware (AMD 7900 GRE) using minimum execution times following Sebastien's benchmarking methodology (which involves identity matrix-multiplication) for direct comparison (all matrices are row-major):
| Implementation | Description | Minimum Time (ms) | Performance (TFLOPS) | vs rocBLAS |
|---|---|---|---|---|
| rocBLAS | Baseline | 5.84 | 23.5 | 100.0% |
| Sebastien K5 | LDS Optimization (HIP C++) | 5.39 | 25.5 | 108.5% |
| Sebastien K6 | VALU Optimization (ISA) | 4.84 | 28.4 | 120.8% |
| Sebastien K7 | Loop Unrolling (ISA) | 4.59 | 30.0 | 127.4% |
| Sebastien K8 | Batched GMem loads (ISA) | 3.99 | 34.5 | 146.7% |
| rocm_sgemm | HIP C++ Optimized | 4.36 | 31.5 | 134.0% |
Note that average execution times typically provide more realistic performance indicators for practical applications.
Below are the average execution times by modifying Sebastien's benchmarking methodology:
| Implementation | Description | Average Time (ms) | Performance (TFLOPS) | vs rocBLAS |
|---|---|---|---|---|
| rocBLAS | Baseline | 6.30 | 21.8 | 100.0% |
| Sebastien K5 | LDS Optimization (HIP C++) | 5.98 | 23.0 | 105.5% |
| Sebastien K6 | VALU Optimization (ISA) | 5.48 | 25.1 | 115.1% |
| Sebastien K7 | Loop Unrolling (ISA) | 5.05 | 27.2 | 124.8% |
| Sebastien K8 | Batched GMem loads (ISA) | 4.54 | 30.3 | 139.0% |
| rocm_sgemm | HIP C++ Optimized | 4.70 | 29.3 | 134.4% |
Key Finding: rocm_sgemm matches Sebastien's hand-tuned ISA Kernel 7 performance, proving that the perceived "HIP C++ limitation" can be overcome with the right optimization techniques, while maintaining portability across GPU architectures.
While Sebastien noted that his performance gains "would not have been possible using only HIP C++," rocm_sgemm demonstrates that there's still significant optimization potential within the HIP C++ framework. By carefully applying advanced optimization techniques, it's possible to achieve competitive performance while preserving portability and maintainability across different GPU architectures.
Below is a comparison against rocBLAS for different layout permutations and using regular matrix-multiplication (different input values).
Square Matrix Performance by Layout:
- AMD ROCm installed with HIP support
- CMake version 3.10 or higher
- Python3 (required for config generation and tuning)
- Python packages (can be installed with pip or conda)
numpyoptuna
- Python packages (can be installed with pip or conda)
- AMD RDNA GPU (code needs to be modified to support CDNA GPUs)
- Clone the repository:
git https://github.com/adelj88/rocm_sgemm.git cd rocm_sgemm - Build:
mkdir build cd build CXX=/opt/rocm/bin/hipcc cmake .. make
Run the executable after building:
# Assumes you're currently in /build directory
# To run unit tests
./test/gemm_test
# To run unit benchmarks
./benchmark/gemm_bench
# To run rocblas equivalent for verification
./test/rocblas_test
./benchmark/rocblas_benchThe library includes an Optuna-based Tree-structured Parzen Estimator (TPE) tuner that automatically finds optimal kernel configurations for different matrix sizes and data layouts.
The tuner uses Optuna TPE (Tree-structured Parzen Estimators) to efficiently explore the discrete parameter space:
- TPE optimization: Models the performance landscape using probabilistic distributions to intelligently sample promising regions
- Smart initialization: Tests proven baseline configurations first to seed the optimization with known good solutions
- Multivariate learning: Understands relationships between parameters (e.g., block sizes and tile configurations)
- Adaptive sampling: Balances exploration of uncertain regions with exploitation of high-performing areas
- Reproducible results: Uses configurable random seeds for consistent and repeatable tuning runs
To run the tuner:
cd build
# Default behavior (all sizes and layouts)
python3 tune.py # Results written to gemm_config_tuned.json
# Test specific sizes
python3 tune.py --sizes 1024,1024,1024 2048,2048,2048
# Adjust evaluation budget
python3 tune.py --budget 100
# Test specific layouts
python3 tune.py --layouts r,c,r c,c,c
# Reproducible results with specific seed
python3 tune.py --seed 123
# Different GPU architecture
python3 tune.py --gpu-arch gfx1103
# Custom output file
python3 tune.py --output my_config.json
# Custom baseline configurations
python3 tune.py --baselines 128,128,128,8,4,4,2,4,8 256,128,128,8,2,2,4,4,4Below are benchmark results (in TFLOPs) that compares rocm_wmma_gemm against rocblas for all layouts and different sizes.
- Further tuning to get better performance
- Explore any possibility of further optimizations
This project is licensed under the MIT License - see the LICENSE file for details.