ROCm WMMA GEMM

This repository provides a standalone, high-performance General Matrix Multiplication (GEMM) implementation optimized for AMD GPUs using ROCm's Wave Matrix Multiply-Accumulate (WMMA) intrinsics. It is derived from the fastest half-precision GEMM kernel developed in the hgemm sample within the rocm_wmma_samples project. This new repository refactors the kernel to facilitate exploration of different matrix data layouts and further optimizations.

Take note that the library isn't fully tuned, and has been only tuned for some sizes (if you pass inputs that are calculated as close to the tuned sizes, the right configuration will be selected). The current workflow of this library is to tune for the specific sizes of your use-case before building. This may be improved upon in the future if time permits.

Purpose

This repository aims to:

Provide a focused, high-performance GEMM kernel utilizing ROCm WMMA intrinsics.
Isolate and refine the fastest GEMM implementation derived from the hgemm sample in rocm_wmma_samples.
Explore and implement support for various matrix data layouts (e.g., row-major, column-major, potentially tiled formats) beyond the format used in the sample.
Support FP16, BF16 and float accumulators
Tune the GEMM kernel for different M, N, K sizes

Overview

This implementation leverages ROCm's Wave Matrix Multiply-Accumulate (WMMA) intrinsics to achieve high-performance GEMM operations across diverse matrix configurations and data layouts.

Performance Analysis Across Matrix Shapes

Testing on AMD 7900 GRE reveals distinct performance patterns for both square and rectangular matrices:

Square Matrix Performance by Layout:

Rectangular Matrix Performance by Layout:

Key Finding: rocm_wmma_gemm remains competitive with rocBLAS across diverse matrix configurations, demonstrating that WMMA intrinsics can be effectively leveraged for high-performance GEMM implementations.

Building the Project

Prerequisites

AMD ROCm installed with HIP support
CMake version 3.10 or higher
Python3 (required for config generation and tuning)
- Python packages (can be installed with pip or conda)
  - numpy
  - optuna
AMD RDNA3/RDNA3.5 GPU (required for WMMA support)

Build Steps

Clone the repository:

git https://github.com/adelj88/rocm_wmma_gemm.git
cd rocm_wmma_gemm

Build:

mkdir build
cd build
CXX=/opt/rocm/bin/hipcc cmake ..
make

Usage

Run the executable after building:

# Assumes you're currently in /build directory
# To run unit tests
./test/test_float_accum
./test/test_same_prec

# To run unit benchmarks
./benchmark/bench_bf16_bf16
./benchmark/bench_float_bf16
./benchmark/bench_float_half
./benchmark/bench_half_half

# To run rocblas equivalent for verification
./test/test_rocblas
./benchmark/bench_rocblas

Automatic Kernel Tuning

The library includes an Optuna-based Tree-structured Parzen Estimator (TPE) tuner that automatically finds optimal kernel configurations for different matrix sizes and data layouts.

Tuning Approach

The tuner uses Optuna TPE (Tree-structured Parzen Estimators) to efficiently explore the discrete parameter space:

TPE optimization: Models the performance landscape using probabilistic distributions to intelligently sample promising regions
Smart initialization: Tests proven baseline configurations first to seed the optimization with known good solutions
Multivariate learning: Understands relationships between parameters (e.g., block sizes and tile configurations)
Adaptive sampling: Balances exploration of uncertain regions with exploitation of high-performing areas
Reproducible results: Uses configurable random seeds for consistent and repeatable tuning runs

To run the tuner:

cd build

# Default behavior (all sizes and layouts)
python3 tune.py # Results written to gemm_config_tuned.json

# Test specific sizes
python3 tune.py --sizes 1024,1024,1024 2048,2048,2048

# Adjust evaluation budget
python3 tune.py --budget 100

# Test specific layouts
python3 tune.py --layouts r,c c,c

# Reproducible results with specific seed
python3 tune.py --seed 123

# Different GPU architecture
python3 tune.py --gpu-arch gfx1103

# Custom output file
python3 tune.py --output my_config.json

# Custom baseline configurations
python3 tune.py --baselines 4,4,4,4,256,0 2,2,4,4,128,1 8,2,2,2,64,0

Performance Results

Below are benchmark results (in TFLOPs) that compares rocm_wmma_gemm against rocblas for all layouts and different sizes.

Future Plans

Experiment with SoA for fragment class
Add batched unit tests
Explore any possibility of further optimizations (e.g. Stream-K for smaller M, N, K)
Tuning for RDNA3.5
Modify fragments to support RDNA4 WMMA

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
benchmark		benchmark
cmake		cmake
common		common
docs		docs
rocblas_wrapper		rocblas_wrapper
rocm_wmma_gemm		rocm_wmma_gemm
test		test
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ROCm WMMA GEMM

Purpose

Overview

Performance Analysis Across Matrix Shapes

Building the Project

Prerequisites

Build Steps

Usage

Automatic Kernel Tuning

Tuning Approach

Performance Results

Future Plans

License

About

Uh oh!

Releases

Packages

Languages

License

adelj88/rocm_wmma_gemm

Folders and files

Latest commit

History

Repository files navigation

ROCm WMMA GEMM

Purpose

Overview

Performance Analysis Across Matrix Shapes

Building the Project

Prerequisites

Build Steps

Usage

Automatic Kernel Tuning

Tuning Approach

Performance Results

Future Plans

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages