Thanks to visit codestin.com
Credit goes to github.com

Skip to content

adelj88/rocm_wmma_gemm

Repository files navigation

ROCm WMMA GEMM

Ask DeepWiki

This repository provides a standalone, high-performance General Matrix Multiplication (GEMM) implementation optimized for AMD GPUs using ROCm's Wave Matrix Multiply-Accumulate (WMMA) intrinsics. It is derived from the fastest half-precision GEMM kernel developed in the hgemm sample within the rocm_wmma_samples project. This new repository refactors the kernel to facilitate exploration of different matrix data layouts and further optimizations.

Take note that the library isn't fully tuned, and has been only tuned for some sizes (if you pass inputs that are calculated as close to the tuned sizes, the right configuration will be selected). The current workflow of this library is to tune for the specific sizes of your use-case before building. This may be improved upon in the future if time permits.

Purpose

This repository aims to:

  • Provide a focused, high-performance GEMM kernel utilizing ROCm WMMA intrinsics.
  • Isolate and refine the fastest GEMM implementation derived from the hgemm sample in rocm_wmma_samples.
  • Explore and implement support for various matrix data layouts (e.g., row-major, column-major, potentially tiled formats) beyond the format used in the sample.
  • Support FP16, BF16 and float accumulators
  • Tune the GEMM kernel for different M, N, K sizes

Overview

This implementation leverages ROCm's Wave Matrix Multiply-Accumulate (WMMA) intrinsics to achieve high-performance GEMM operations across diverse matrix configurations and data layouts.

Performance Analysis Across Matrix Shapes

Testing on AMD 7900 GRE reveals distinct performance patterns for both square and rectangular matrices:

Square Matrix Performance by Layout:

WMMA Square Performance

Rectangular Matrix Performance by Layout:

WMMA Rectangular Performance

Key Finding: rocm_wmma_gemm remains competitive with rocBLAS across diverse matrix configurations, demonstrating that WMMA intrinsics can be effectively leveraged for high-performance GEMM implementations.

Building the Project

Prerequisites

  • AMD ROCm installed with HIP support
  • CMake version 3.10 or higher
  • Python3 (required for config generation and tuning)
    • Python packages (can be installed with pip or conda)
      • numpy
      • optuna
  • AMD RDNA3/RDNA3.5 GPU (required for WMMA support)

Build Steps

  1. Clone the repository:
    git https://github.com/adelj88/rocm_wmma_gemm.git
    cd rocm_wmma_gemm
  2. Build:
    mkdir build
    cd build
    CXX=/opt/rocm/bin/hipcc cmake ..
    make

Usage

Run the executable after building:

# Assumes you're currently in /build directory
# To run unit tests
./test/test_float_accum
./test/test_same_prec

# To run unit benchmarks
./benchmark/bench_bf16_bf16
./benchmark/bench_float_bf16
./benchmark/bench_float_half
./benchmark/bench_half_half

# To run rocblas equivalent for verification
./test/test_rocblas
./benchmark/bench_rocblas

Automatic Kernel Tuning

The library includes an Optuna-based Tree-structured Parzen Estimator (TPE) tuner that automatically finds optimal kernel configurations for different matrix sizes and data layouts.

Tuning Approach

The tuner uses Optuna TPE (Tree-structured Parzen Estimators) to efficiently explore the discrete parameter space:

  • TPE optimization: Models the performance landscape using probabilistic distributions to intelligently sample promising regions
  • Smart initialization: Tests proven baseline configurations first to seed the optimization with known good solutions
  • Multivariate learning: Understands relationships between parameters (e.g., block sizes and tile configurations)
  • Adaptive sampling: Balances exploration of uncertain regions with exploitation of high-performing areas
  • Reproducible results: Uses configurable random seeds for consistent and repeatable tuning runs

To run the tuner:

cd build

# Default behavior (all sizes and layouts)
python3 tune.py # Results written to gemm_config_tuned.json

# Test specific sizes
python3 tune.py --sizes 1024,1024,1024 2048,2048,2048

# Adjust evaluation budget
python3 tune.py --budget 100

# Test specific layouts
python3 tune.py --layouts r,c c,c

# Reproducible results with specific seed
python3 tune.py --seed 123

# Different GPU architecture
python3 tune.py --gpu-arch gfx1103

# Custom output file
python3 tune.py --output my_config.json

# Custom baseline configurations
python3 tune.py --baselines 4,4,4,4,256,0 2,2,4,4,128,1 8,2,2,2,64,0

Performance Results

Below are benchmark results (in TFLOPs) that compares rocm_wmma_gemm against rocblas for all layouts and different sizes.

Future Plans

  1. Experiment with SoA for fragment class
  2. Add batched unit tests
  3. Explore any possibility of further optimizations (e.g. Stream-K for smaller M, N, K)
  4. Tuning for RDNA3.5
  5. Modify fragments to support RDNA4 WMMA

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

WMMA GEMM in ROCm for RDNA GPUs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published