Primus-Turbo is a high-performance acceleration library dedicated to large-scale model training on AMD GPUs. Built and optimized for the AMD ROCm platform, it covers the full training stack — including core compute operators (GEMM, Attention, GroupedGEMM), communication primitives, optimizer modules, low-precision computation (FP8), and compute–communication overlap kernels.
With High Performance, Full-Featured, and Developer-Friendly as its guiding principles, Primus-Turbo is designed to fully unleash the potential of AMD GPUs for large-scale training workloads, offering a robust and complete acceleration foundation for next-generation AI systems.
Note: JAX and Optim support are planned but not yet available.- [2025/9/19] Primus-Turbo introduction blog.
- [2025/9/11] Primus-Turbo initial release, version v0.1.0.
| Module | Role | Key Features | Dependencies / Integration | 
|---|---|---|---|
| Primus-LM | End-to-end training framework | - Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Turbo and Safe | - Can invoke Primus-Turbo kernels and modules - Runs on top of Primus-Safe for stable scheduling | 
| Primus-Turbo | High-performance operators & modules | - Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.) - Modular design, directly pluggable into Primus-LM - Optimized for different architectures and precisions | - Built on AITER, CK, hipBLASLt, Triton  and other operator libraries - Can be enabled via configuration inside Primus-LM | 
| Primus-SaFE (Coming soon) | Stability & platform layer | - Cluster sanity check and benchmarking - Kubernets scheduling with topology awareness - Fault tolerance - Stability enhancements | - Building a training platform based on the K8s and Slurm ecosystem | 
- ROCm >= 6.4
- Python >= 3.10
- PyTorch >= 2.6.0 (with ROCm support)
- rocSHMEM (optional, required for experimental DeepEP). Please refer to our DeepEP Installation Guide for instructions.
- AMD Instinct GPUs
- GFX942: MI300X, MI325X
- GFX950: MI350X, MI355X
Use the pre-built AMD ROCm image:
# For GFX942
rocm/primus:v25.9_gfx942
# For GFX950
rocm/primus:v25.9_gfx950
git clone https://github.com/AMD-AGI/Primus-Turbo.git --recursive
cd Primus-Turbo
pip3 install -r requirements.txt
pip3 install --no-build-isolation .
# Set GPU_ARCHS to compile Turbo for multiple AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation .
pip3 install -r requirements.txt
pip3 install --no-build-isolation -e . -v
# Set GPU_ARCHS to compile Turbo for multiple AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation -e . -v
pip3 install -r requirements.txt
python3 -m build --wheel --no-isolation
pip3 install --extra-index-url https://test.pypi.org/simple ./dist/primus_turbo-XXX.whl
import torch
import primus_turbo.pytorch as turbo
dtype = torch.bfloat16
device = "cuda:0"
a = torch.randn((128, 256), dtype=dtype, device=device)
b = torch.randn((256, 512), dtype=dtype, device=device)
c = turbo.ops.gemm(a, b)
print(c)
print(c.shape)See Examples for usage examples.
See Benchmarks for detailed performance results and comparisons.
Roadmap: Primus-Turbo Roadmap H2 2025
Primus-Turbo is licensed under the MIT License.
© 2025 Advanced Micro Devices, Inc. All rights reserved.