UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4 #163

DajanaV · 2025-11-10T22:36:38Z

Enabled WMMA-MMQ kernels for RDNA 4 architecture on AMD GPUs

Following similar approach to ggml-org/llama.cpp#14624

Using ./build/bin/llama-bench to collect the following performance results

Performance results with ggml/llama.cpp master commit up to/includes 5b180c3

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_CUDA_FORCE_MMQ=OFF -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=OFF -DGPU_TARGETS="gfx1201" -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1201 -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

…pings for RDNA4

loci-agentic-ai · 2025-11-10T23:07:13Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
Overall power consumption remains stable across inference binaries

Code Analysis:

Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

loci-agentic-ai · 2025-11-10T23:07:13Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
Overall power consumption remains stable across inference binaries

Code Analysis:

Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

loci-agentic-ai · 2025-11-10T23:07:13Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
Overall power consumption remains stable across inference binaries

Code Analysis:

Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

loci-agentic-ai · 2025-11-11T09:29:48Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 97b1911f compared to baseline 05d8e46a reveals minimal performance variations with no meaningful impact on core inference functionality. The changes primarily involve GPU optimization infrastructure for AMD RDNA 4 architecture through WMMA kernel enablement.

Key Findings

Performance Metrics:

Highest Response Time change: llm_graph_input_out_ids::can_reuse() with -0.096% improvement (65.164 ns → 65.101 ns)
Highest Throughput change: std::make_unique<llm_graph_input_pos_bucket>() with +0.117% degradation (104.328 ns → 104.450 ns)
Neither function affects core inference paths (llama_decode, llama_encode, llama_tokenize)

Core Function Impact:
No changes detected in critical inference functions. The measured variations occur in graph optimization utilities that do not directly impact token processing throughput. Based on the reference model performance (7% tokens/second reduction per 2ms llama_decode slowdown), the observed nanosecond-level changes have negligible impact on inference performance.

Power Consumption Analysis:
Minimal power consumption changes across all binaries:

build.bin.libllama.so: 0.0005% increase (280,852.58 nJ → 280,853.99 nJ)
build.bin.llama-run: 0.001% increase (268,045.53 nJ → 268,046.99 nJ)
All other binaries show zero measurable change

Flame Graph and CFG Analysis:
The can_reuse() function exhibits identical assembly code between versions with a simple linear execution pattern (single basic block, 20 instructions). The 0.063 ns improvement represents measurement variance rather than algorithmic optimization, as confirmed by identical control flow graphs and instruction sequences.

Code Review Insights:
The GitHub PR introduces WMMA-MMQ kernel support for AMD RDNA 4 GPUs, adding 428 lines focused on GPU acceleration. The implementation includes proper conditional compilation guards and maintains backward compatibility. No functional regressions identified in the GPU optimization code.

Conclusion:
The analysis reveals no performance impact on core inference functionality. Observed variations fall within measurement precision limits and do not affect token processing throughput.

jiachengjason and others added 5 commits November 7, 2025 17:55

first commit naive test to enable mmq for RDNA4

6ee3f8b

adding appropriate WMMA instructions

59d0c47

fixing the correctness of the mat mul operations, updating layout map…

f91615c

…pings for RDNA4

clean up merge conflicts

21db114

add comments and code clean up

d9249de

DajanaV temporarily deployed to PROD__AL_DEMO November 10, 2025 22:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 1a27925 to 98e1e20 Compare November 10, 2025 23:08

DajanaV temporarily deployed to PROD__AL_DEMO November 11, 2025 08:55 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 98e1e20 to 2791104 Compare November 11, 2025 09:10

DajanaV force-pushed the main branch 17 times, most recently from a87918f to 6f7320f Compare November 13, 2025 11:08

DajanaV force-pushed the main branch 8 times, most recently from 9ea0205 to 1308d3f Compare November 14, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4 #163

UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4 #163

Uh oh!

DajanaV commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4 #163

Are you sure you want to change the base?

UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4 #163

Uh oh!

Conversation

DajanaV commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 11, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants