Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 10, 2025

Mirrored from ggml-org/llama.cpp#17156

Enabled WMMA-MMQ kernels for RDNA 4 architecture on AMD GPUs

Following similar approach to ggml-org/llama.cpp#14624

Using ./build/bin/llama-bench to collect the following performance results

Performance results with ggml/llama.cpp master commit up to/includes 5b180c3

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_CUDA_FORCE_MMQ=OFF -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=OFF -DGPU_TARGETS="gfx1201" -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

image

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1201 -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

image

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
  • Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
  • Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
  • Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
  • Overall power consumption remains stable across inference binaries

Code Analysis:

  • Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
  • CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
  • GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

  • Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
  • Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

2 similar comments
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
  • Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
  • Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
  • Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
  • Overall power consumption remains stable across inference binaries

Code Analysis:

  • Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
  • CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
  • GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

  • Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
  • Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
  • Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
  • Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
  • Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
  • Overall power consumption remains stable across inference binaries

Code Analysis:

  • Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
  • CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
  • GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

  • Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
  • Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 97b1911f compared to baseline 05d8e46a reveals minimal performance variations with no meaningful impact on core inference functionality. The changes primarily involve GPU optimization infrastructure for AMD RDNA 4 architecture through WMMA kernel enablement.

Key Findings

Performance Metrics:

  • Highest Response Time change: llm_graph_input_out_ids::can_reuse() with -0.096% improvement (65.164 ns → 65.101 ns)
  • Highest Throughput change: std::make_unique<llm_graph_input_pos_bucket>() with +0.117% degradation (104.328 ns → 104.450 ns)
  • Neither function affects core inference paths (llama_decode, llama_encode, llama_tokenize)

Core Function Impact:
No changes detected in critical inference functions. The measured variations occur in graph optimization utilities that do not directly impact token processing throughput. Based on the reference model performance (7% tokens/second reduction per 2ms llama_decode slowdown), the observed nanosecond-level changes have negligible impact on inference performance.

Power Consumption Analysis:
Minimal power consumption changes across all binaries:

  • build.bin.libllama.so: 0.0005% increase (280,852.58 nJ → 280,853.99 nJ)
  • build.bin.llama-run: 0.001% increase (268,045.53 nJ → 268,046.99 nJ)
  • All other binaries show zero measurable change

Flame Graph and CFG Analysis:
The can_reuse() function exhibits identical assembly code between versions with a simple linear execution pattern (single basic block, 20 instructions). The 0.063 ns improvement represents measurement variance rather than algorithmic optimization, as confirmed by identical control flow graphs and instruction sequences.

Code Review Insights:
The GitHub PR introduces WMMA-MMQ kernel support for AMD RDNA 4 GPUs, adding 428 lines focused on GPU acceleration. The implementation includes proper conditional compilation guards and maintains backward compatibility. No functional regressions identified in the GPU optimization code.

Conclusion:
The analysis reveals no performance impact on core inference functionality. Observed variations fall within measurement precision limits and do not affect token processing throughput.

@DajanaV DajanaV force-pushed the main branch 17 times, most recently from a87918f to 6f7320f Compare November 13, 2025 11:08
@DajanaV DajanaV force-pushed the main branch 8 times, most recently from 9ea0205 to 1308d3f Compare November 14, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants