Primus-Turbo

Primus-Turbo is a high-performance acceleration library dedicated to large-scale model training on AMD GPUs. Built and optimized for the AMD ROCm platform, it covers the full training stack — including core compute operators (GEMM, Attention, GroupedGEMM), communication primitives, optimizer modules, low-precision computation (FP8), and compute–communication overlap kernels.

With High Performance, Full-Featured, and Developer-Friendly as its guiding principles, Primus-Turbo is designed to fully unleash the potential of AMD GPUs for large-scale training workloads, offering a robust and complete acceleration foundation for next-generation AI systems.

Note: JAX and Optim support are planned but not yet available.

🚀 What's New

[2025/9/19] Primus-Turbo introduction blog.
[2025/9/11] Primus-Turbo initial release, version v0.1.0.

🧩 Primus Product Matrix

Module	Role	Key Features	Dependencies / Integration
Primus-LM	End-to-end training framework	- Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Turbo and Safe	- Can invoke Primus-Turbo kernels and modules - Runs on top of Primus-Safe for stable scheduling
Primus-Turbo	High-performance operators & modules	- Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.) - Modular design, directly pluggable into Primus-LM - Optimized for different architectures and precisions	- Built on AITER, CK, hipBLASLt, Triton and other operator libraries - Can be enabled via configuration inside Primus-LM
Primus-SaFE (Coming soon)	Stability & platform layer	- Cluster sanity check and benchmarking - Kubernets scheduling with topology awareness - Fault tolerance - Stability enhancements	- Building a training platform based on the K8s and Slurm ecosystem

📦 Quick Start

1. Dependencies

Software

ROCm >= 6.4
Python >= 3.10
PyTorch >= 2.6.0 (with ROCm support)
rocSHMEM (optional, required for experimental DeepEP). Please refer to our DeepEP Installation Guide for instructions.

Hardware

AMD Instinct GPUs
GFX942: MI300X, MI325X
GFX950: MI350X, MI355X

2. Docker (Recommended)

Use the pre-built AMD ROCm image:

# For GFX942
rocm/primus:v25.9_gfx942
# For GFX950
rocm/primus:v25.9_gfx950

3. Install from Source

Clone Repository

git clone https://github.com/AMD-AGI/Primus-Turbo.git --recursive
cd Primus-Turbo

User Install

pip3 install -r requirements.txt
pip3 install --no-build-isolation .

# Set GPU_ARCHS to compile Turbo for multiple AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation .

Developer Install (editable mode)

pip3 install -r requirements.txt
pip3 install --no-build-isolation -e . -v

# Set GPU_ARCHS to compile Turbo for multiple AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation -e . -v

4. Build & Deploy Wheel

pip3 install -r requirements.txt
python3 -m build --wheel --no-isolation
pip3 install --extra-index-url https://test.pypi.org/simple ./dist/primus_turbo-XXX.whl

5. Minimal Example

import torch
import primus_turbo.pytorch as turbo

dtype = torch.bfloat16
device = "cuda:0"

a = torch.randn((128, 256), dtype=dtype, device=device)
b = torch.randn((256, 512), dtype=dtype, device=device)
c = turbo.ops.gemm(a, b)

print(c)
print(c.shape)

💡 Example

See Examples for usage examples.

📊 Performance

See Benchmarks for detailed performance results and comparisons.

📍Roadmap

Roadmap: Primus-Turbo Roadmap H2 2025

📜 License

Primus-Turbo is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github		.github
3rdparty		3rdparty
benchmark		benchmark
csrc		csrc
docs		docs
primus_turbo		primus_turbo
tests		tests
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Primus-Turbo

🚀 What's New

🧩 Primus Product Matrix

📦 Quick Start

1. Dependencies

Software

Hardware

2. Docker (Recommended)

3. Install from Source

Clone Repository

User Install

Developer Install (editable mode)

4. Build & Deploy Wheel

5. Minimal Example

💡 Example

📊 Performance

📍Roadmap

📜 License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 11

Uh oh!

Languages

License

AMD-AGI/Primus-Turbo

Folders and files

Latest commit

History

Repository files navigation

Primus-Turbo

🚀 What's New

🧩 Primus Product Matrix

📦 Quick Start

1. Dependencies

Software

Hardware

2. Docker (Recommended)

3. Install from Source

Clone Repository

User Install

Developer Install (editable mode)

4. Build & Deploy Wheel

5. Minimal Example

💡 Example

📊 Performance

📍Roadmap

📜 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Packages