Native AMD Radeon AI PRO R9700 support for llama.cpp with custom ROCm 7.11 build
This fork showcases production-ready AI inference on AMD's latest RDNA4 architecture, achieving competitive performance with NVIDIA RTX 4070-4070 Ti class GPUs while providing 33% more VRAM than RTX 4090.
|
Qwen2.5-Coder-7B Q4_K_M
|
HIP Vector Addition Benchmark
|
Architecture Improvement: RDNA2 β RDNA4
Memory Bandwidth: 567 GB/s ββββββββββββββββββββββββββ 100%
569 GB/s ββββββββββββββββββββββββββ 100%
(Essentially identical)
AI Inference: 34.2 tok/s βββββββββββββ 34%
99.0 tok/s ββββββββββββββββββββββββββββββββββ 100%
β¬ 2.89x FASTER β¬
What changed? Same memory bandwidth, but RDNA4 has:
- β WMMA instructions (hardware matrix operations)
- β Improved FP16/BF16 throughput
- β Better compute scheduling for transformers
- β Native ROCm support (no HSA_OVERRIDE hacks)
| GPU | Architecture | Speed (7B) | VRAM | Position |
|---|---|---|---|---|
| RTX 5090 | Ada Lovelace | ~5,841 tok/s* | 24GB | 59x faster (batched) |
| RTX 4090 | Ada Lovelace | ~194 tok/s | 24GB | 1.96x faster |
| R9700 | RDNA4 | 98.97 tok/s | 32GB | π― You are here |
| RTX 3090 | Ampere | 45.2 tok/s | 24GB | 2.19x slower |
| RTX 2080 | Turing | 34.0 tok/s | 8GB | 2.91x slower |
*RTX 5090 uses batch size 8 with TensorRT optimizations
Key Advantage: R9700 has 32GB VRAM vs RTX 4090's 24GB (33% more capacity) - perfect for large models!
Built entire ROCm stack from source with gfx1201 optimizations:
Compiler Optimizations:
-O3maximum optimization-march=znver3 -mtune=znver3AMD CPU tuning- GPU targets:
gfx1031;gfx1201 - Performance mode enabled
Kernel Generation:
- 30,957 hipBLASLt kernels (145 gfx1201-specific files)
- 570 rocBLAS kernels (56 gfx1201-specific files)
- Architecture-optimized matrix operations
- Code object compression (zstd, 10.74% ratio)
Components Built:
- β ROCm Core: ROCR-Runtime, rocminfo, CLR, OpenCL
- β Compilers: LLVM/Clang with AMDGPU backend
- β Math: rocBLAS, rocFFT, rocRAND, rocSOLVER, rocSPARSE
- β AI: MIOpen, rocWMMA, hipBLASLt
- β Profiling: rocProfiler-SDK, rocprofv3
#!/bin/bash
export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm
export HIP_PLATFORM=amd
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DCMAKE_HIP_ARCHITECTURES=gfx1201 \
-DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build . --config Release -j$(nproc)Result: Native gfx1201 support, no compatibility hacks needed!
- AMD Radeon AI PRO R9700 (gfx1201) or compatible RDNA4 GPU
- ROCm 7.11+ installed to
/opt/rocm - 32GB+ system RAM recommended
- Fedora/RHEL/Ubuntu Linux
# Clone this repository
git clone https://github.com/YOUR_USERNAME/llama.cpp-gfx1201.git
cd llama.cpp-gfx1201
# Set ROCm environment
export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm
export HIP_PLATFORM=amd
export PATH=/opt/rocm/bin:$PATH
export LD_LIBRARY_PATH=/opt/rocm/lib:/opt/rocm/lib64:$LD_LIBRARY_PATH
# Build llama.cpp with HIP support
./build_rocm_gfx1201.sh# Start llama-server with Qwen2.5-Coder-7B
./build-gfx1201/bin/llama-server \
-m /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
--port 8080 \
--host 127.0.0.1 \
-ngl 99 \
-c 4096 \
-t 16
# Test inference
curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a Python hello world function",
"max_tokens": 200,
"temperature": 0.7
}'Model: Qwen2.5-Coder-7B-Instruct Q4_K_M Configuration: 4096 context, 99 GPU layers, 16 threads
| Test | Prompt | Tokens | Duration | Speed |
|---|---|---|---|---|
| 1 | Fibonacci in Python | 275 | 2.76s | 99.60 tok/s |
| 2 | Binary Search Tree C++ | 400 | 3.99s | 100.17 tok/s |
| 3 | REST API Node.js | 400 | 4.00s | 99.90 tok/s |
| 4 | SQL Top Customers | 400 | 4.07s | 98.27 tok/s |
| 5 | Quicksort Rust | 400 | 4.13s | 96.91 tok/s |
| AVERAGE | - | 375 | 3.79s | 98.97 tok/s |
Consistency: Β±1.6% variance - excellent stability!
Test: HIP Vector Addition (100M elements, 100 iterations)
=== ROCm 7.11 gfx1201 Benchmark ===
GPU: AMD Radeon AI PRO R9700
Arch: gfx1201
Memory: 31 GB
Compute Units: 32
Throughput: 569.248 GB/s
Per-iteration: 1.963 ms
1. AI-Specific Hardware
- WMMA (Wave Matrix Multiply-Accumulate) instructions
- Hardware-accelerated transformer operations
- Optimized FP16/BF16 compute units
2. Memory Configuration
- 32GB GDDR6 (256-bit bus)
- 569 GB/s bandwidth
- Larger models than consumer NVIDIA GPUs
3. Software Maturity
- Native gfx1201 in ROCm 7.11
- No HSA_OVERRIDE workarounds
- Production-ready AI stack
Despite similar memory bandwidth (569 vs 567 GB/s), RDNA4 achieves 2.89x AI speedup through:
- WMMA Instructions - Hardware matrix operations for transformers
- Improved Scheduling - Better warp/wave scheduling for AI workloads
- FP16 Throughput - Enhanced half-precision for quantized models
- Cache Hierarchy - Optimized L2/L3 for inference patterns
- ROCm Maturity - Native optimizations vs compatibility mode
This is pure architectural improvement, not just specs!
β Large Model Inference (>20B parameters)
- 32GB VRAM allows full model loading
- RTX 4090 limited to 24GB
- Can run models that won't fit on consumer NVIDIA GPUs
β Multi-Model Serving
- Host multiple smaller models simultaneously
- Better GPU utilization
- Cost-effective production deployments
β Extended Context Windows
- Larger contexts with more VRAM
- Better for code analysis, document processing
- Ideal for RAG (Retrieval-Augmented Generation)
β Open-Source AI Ecosystem
- Native ROCm support
- No CUDA licensing restrictions
- Community-driven development
β Maximum Single-Model Speed
- RTX 4090/5090 2-60x faster for batched workloads
- Better for high-throughput serving
β Framework Support
- Wider ML framework compatibility
- More pre-optimized models
- Better PyTorch/TensorFlow integration
llama.cpp-gfx1201/
βββ build_rocm_gfx1201.sh # HIP build script
βββ build-gfx1201/ # Build output directory
β βββ bin/
β βββ llama-server # OpenAI-compatible API server
β βββ llama-cli # CLI inference tool
β βββ llama-bench # Benchmarking tool
βββ RDNA4_PERFORMANCE_ANALYSIS.md # Comprehensive benchmark report
βββ examples/ # Example scripts and configs
βββ README.md # This file
- AMD Radeon AI PRO R9700 (gfx1201) or compatible RDNA4 GPU
- ROCm 7.11+
- 16GB system RAM
- Ubuntu 22.04+ / Fedora 38+ / RHEL 9+
- AMD Radeon AI PRO R9700
- ROCm 7.11 custom build (see build notes)
- 32GB+ system RAM
- NVMe SSD for model storage
- Fedora 43+ with latest kernel
- β Fedora 43 + ROCm 7.11 custom build
- β AMD Ryzen (znver3) CPU
- β 32GB DDR4 RAM
- β NFS model storage with 4MB read/write buffers
- RDNA4 Performance Analysis - Full benchmark results, methodology, and technical deep dive
- Build Notes - ROCm compilation process (if available)
benchmark_inference.sh- Single AI inference testbenchmark_multiple.sh- Multiple tests for averaginggfx1201_benchmark.cpp- HIP memory bandwidth test
build_rocm_gfx1201.sh- llama.cpp HIP build- Build configuration uses CMake with native gfx1201 support
Contributions welcome! This fork focuses on:
- AMD RDNA4 (gfx1201) optimizations
- ROCm integration improvements
- Performance benchmarking
- Documentation for AMD GPU users
- Flash Attention 2 for ROCm
- Additional quantization methods (GPTQ, AWQ)
- Multi-GPU tensor/pipeline parallelism
- vLLM/TGI backend integration
- Native gfx1201 HIP support
- ROCm 7.11 custom build
- Comprehensive benchmarking
- NVIDIA competitive analysis
- Production-ready inference
- Flash Attention 2 integration
- Multi-GPU support testing
- Additional model benchmarks (13B, 30B, 70B)
- Batch size optimization
- vLLM backend integration
- Text Generation Inference (TGI) support
- Quantization quality analysis
- Power efficiency benchmarks
- RTX 5090 LLM Benchmarks - Runpod
- GPU Ranking for LLMs - Hardware Corner
- LLM Inference GPU Performance - Puget Systems
- GPU Benchmarks on LLM Inference - GitHub
This project inherits the MIT license from llama.cpp.
See LICENSE for details.
- ggerganov and llama.cpp contributors for the excellent inference engine
- AMD ROCm Team for RDNA4 support and documentation
- Open-source AI community for models and tools
- Claude Code for build automation and benchmarking assistance
- Open an issue on GitHub for bugs or feature requests
- Check existing issues for solutions
- Join AMD ROCm community channels
For optimal performance:
- Use ROCm 7.11+ with native gfx1201 support
- Enable GPU memory locking (
--mlockflag) - Offload all layers to GPU (
-ngl 99) - Match thread count to CPU cores (
-t 16) - Use appropriate context length for your workload