Thanks to visit codestin.com
Credit goes to github.com

Skip to content

AMD-AGI/Primus-Turbo

Repository files navigation

Primus-Turbo

Primus-Turbo is a high-performance acceleration library dedicated to large-scale model training on AMD GPUs. Built and optimized for the AMD ROCm platform, it covers the full training stack — including core compute operators (GEMM, Attention, GroupedGEMM), communication primitives, optimizer modules, low-precision computation (FP8), and compute–communication overlap kernels.

With High Performance, Full-Featured, and Developer-Friendly as its guiding principles, Primus-Turbo is designed to fully unleash the potential of AMD GPUs for large-scale training workloads, offering a robust and complete acceleration foundation for next-generation AI systems.

Note: JAX and Optim support are planned but not yet available.

🚀 What's New

🧩 Primus Product Matrix

Module Role Key Features Dependencies / Integration
Primus-LM End-to-end training framework - Supports multiple training backends (Megatron, TorchTitan, etc.)
- Provides high-performance, scalable distributed training
- Deeply integrates with Turbo and Safe
- Can invoke Primus-Turbo kernels and modules
- Runs on top of Primus-Safe for stable scheduling
Primus-Turbo High-performance operators & modules - Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.)
- Modular design, directly pluggable into Primus-LM
- Optimized for different architectures and precisions
- Built on AITER, CK, hipBLASLt, Triton and other operator libraries
- Can be enabled via configuration inside Primus-LM
Primus-SaFE (Coming soon) Stability & platform layer - Cluster sanity check and benchmarking
- Kubernets scheduling with topology awareness
- Fault tolerance
- Stability enhancements
- Building a training platform based on the K8s and Slurm ecosystem

📦 Quick Start

1. Dependencies

Software

  • ROCm >= 6.4
  • Python >= 3.10
  • PyTorch >= 2.6.0 (with ROCm support)
  • rocSHMEM (optional, required for experimental DeepEP). Please refer to our DeepEP Installation Guide for instructions.

Hardware

  • AMD Instinct GPUs
  • GFX942: MI300X, MI325X
  • GFX950: MI350X, MI355X

2. Docker (Recommended)

Use the pre-built AMD ROCm image:

# For GFX942
rocm/primus:v25.9_gfx942
# For GFX950
rocm/primus:v25.9_gfx950

3. Install from Source

Clone Repository

git clone https://github.com/AMD-AGI/Primus-Turbo.git --recursive
cd Primus-Turbo

User Install

pip3 install -r requirements.txt
pip3 install --no-build-isolation .

# Set GPU_ARCHS to compile Turbo for multiple AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation .

Developer Install (editable mode)

pip3 install -r requirements.txt
pip3 install --no-build-isolation -e . -v

# Set GPU_ARCHS to compile Turbo for multiple AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation -e . -v

4. Build & Deploy Wheel

pip3 install -r requirements.txt
python3 -m build --wheel --no-isolation
pip3 install --extra-index-url https://test.pypi.org/simple ./dist/primus_turbo-XXX.whl

5. Minimal Example

import torch
import primus_turbo.pytorch as turbo

dtype = torch.bfloat16
device = "cuda:0"

a = torch.randn((128, 256), dtype=dtype, device=device)
b = torch.randn((256, 512), dtype=dtype, device=device)
c = turbo.ops.gemm(a, b)

print(c)
print(c.shape)

💡 Example

See Examples for usage examples.

📊 Performance

See Benchmarks for detailed performance results and comparisons.

📍Roadmap

Roadmap: Primus-Turbo Roadmap H2 2025

📜 License

Primus-Turbo is licensed under the MIT License.

© 2025 Advanced Micro Devices, Inc. All rights reserved.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 11