GPU performance benchmarking and debugging toolkit for PyTorch workloads on AMD ROCm.
FSDP2 Compute-Communication Overlap Analysis Debug why distributed training isn't overlapping compute with communication. Runs a synthetic transformer workload with explicit multi-stream execution, captures per-iteration timing, and generates overlap efficiency reports.
Hardware Queue Evaluation Stress-test GPU queue scheduling with 8-64+ concurrent streams. Includes 15 workloads covering distributed training patterns (FSDP, MoE, activation checkpointing), inference (speculative decoding, continuous batching), and latency-sensitive scenarios (heterogeneous kernels, tiny kernel dispatch).
Environment Snapshot for Reproducibility
Capture a versioned, schema-stable snapshot of the trial environment — ROCm / HIP / hipBLASLt / rocBLAS / MIOpen / RCCL identities, GPU arch, PyTorch build flags + cmake cache + per-target HIPCC defines, runtime SDPA backend state, ~30 numerics-relevant env vars — so cross-environment regressions become a jq diff instead of a multi-day investigation. Used standalone (aorta env probe) and embedded automatically into every trial result.
# FSDP2 overlap benchmark
bash scripts/launch_rocm.sh config/default.yaml
# Hardware queue evaluation
python -m aorta.hw_queue_eval list # List workloads
python -m aorta.hw_queue_eval run hetero_kernels --streams 8
python -m aorta.hw_queue_eval sweep hetero_kernels --streams 1,2,4,8,16
# Comm-compute overlap (simulated collectives)
python -m aorta.hw_queue_eval run comms_compute_overlap --streams 4 --profile
# Comm-compute overlap (real NCCL collectives via torchrun)
torchrun --nproc_per_node=8 -m aorta.hw_queue_eval run comms_compute_overlap \
--streams 4 --real-collectives --async-op --backend nccl \
--process-groups "[0,1,2,3,4,5,6,7]" --profile --profile-dir traces/
# Environment snapshot for reproducibility
aorta env probe -o env.json # full snapshot to disk
aorta env probe --summary # one-screen brief, no file write
aorta env probe --field pytorch_build.git_commit # one field, JSON-typed
diff <(jq -S . env_a.json) <(jq -S . env_b.json) # diff two snapshotsAORTA generates comprehensive performance reports comparing ROCm versions across multiple configurations. See a full example report comparing rocm-7.0.8-meta vs rocm-7.0.10-meta:
- 8 configurations tested: 256/512 threads × 28/42/56/70 RCCL channels
- 96 visualizations: Overlap ratios, GEMM throughput, NCCL metrics, timeline comparisons
- Side-by-side diffs: Identify regressions or improvements between driver/library versions
| Guide | Description |
|---|---|
| Getting Started | Prerequisites, Docker setup, installation |
| Running the Benchmark | Launch scripts, torch.compile, direct invocation |
| Hardware Queue Eval | Workloads, CLI usage, metrics |
| Configuration | FSDP tuning, RCCL variables, profiler settings |
| Profiling | Torch profiler, rocprofv3, overlap reports |
| Environment Probe | Capture / diff / query a versioned environment snapshot; jq cookbook |
aorta probe |
Wrap-and-collect opaque launch commands; matrix + classifier |
aorta bundle |
Package probe artifacts with recipe-driven redaction |
| Troubleshooting | Common issues |
src/aorta/
├── training/ # FSDP2 trainer with multi-stream overlap instrumentation
├── hw_queue_eval/ # Hardware queue evaluation framework
├── models/ # Synthetic ranking transformer
├── profiling/ # Stream profiler for overlap measurement
├── instrumentation/ # Environment probe (versioned env.json schema + capture)
├── registry/ # Mitigations + environments registry (extension points)
├── cli/ # `aorta` CLI command groups (run, env probe, ...)
└── utils/ # Config loading, timing, device detection
config/ # YAML configurations for different scenarios
scripts/ # Launch scripts, profiling, analysis tools
analysis/ # Overlap report generation
We recommend using uv for fast, reliable Python environment management.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate a virtual environment
uv venv && source .venv/bin/activate
# Install PyTorch nightly for ROCm 7.1
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/
# Install remaining dependencies
uv pip install -r requirements.txt
# For full installation including hw_queue_eval
uv pip install -e ".[hw-queue]"uv pip install -r requirements-dev.txt
pre-commit install
pytest tests/The FSDP2 overlap workloads also run on NVIDIA CUDA for side-by-side comparison with ROCm.



