Codestin Search App

akii96 · 2026-06-03T23:23:19Z

Motivation

The oneshot MoE sorting kernel runs on a single CU and wins for small token counts, but the multi-phase (MP) kernel distributes work across all CUs and is faster past a hardware-dependent crossover. The current dispatch heuristic (moe_sorting_is_oneshot) only checks LDS capacity — whether oneshot can run — not whether it should. On gfx950 with 128 experts, oneshot is selected up to ~61 tokens even though MP is faster past ~20 tokens.
A fixed threshold doesn't generalize: the crossover depends on CU count, memory subsystem, and expert count, which vary across CDNA generations (gfx90a/gfx942/gfx950) and models (8 to 256 experts).

Technical Details

Refactors moe_sorting_opus() into moe_sorting_opus_oneshot() + an autotuned dispatcher in csrc/include/moe_sorting_opus.h.
On first encounter of each (tokens, num_experts) pair with dispatch_policy=0:

LDS capacity check — if oneshot doesn't fit, go straight to MP (unchanged behavior)
Time both oneshot and MP with hipEvent on the current stream
Cache the winner in a static unordered_map
Re-run the winner to produce correct output
Subsequent calls with the same pair hit the cache (~ns lookup, zero GPU overhead). Autotuning cost is ~80 unique pairs × 3 kernel launches = ~240 launches, absorbed by vLLM's existing warmup phase.
dispatch_policy=1 (force oneshot) and dispatch_policy=2 (force MP) bypass the autotune entirely and behave exactly as before.

Safety

Strict superset of the old behavior: above the LDS capacity limit the codepath is identical. Below it, the autotune either agrees with the old heuristic (same perf) or switches to MP (only if MP measured faster).
No impact on non-MoE models: the function is never called for dense architectures.
No impact on few-expert MoE models (e.g. Mixtral, 8 experts): tokens × 8 rarely enters the autotunable zone; the LDS capacity check handles dispatch.

Test Plan

Existing op_tests/test_moe_sorting.py for correctness at all rerouted shapes
5-repeat A/B serving benchmarks (ISL=1000, OSL=100, random dataset) on two MoE models with different expert counts and TP configurations

Test Result

Measured on gfx950 (MI355X), vLLM 0.22.0:

Model	conc=16	conc=32	conc=64
openai/gpt-oss-120b (128 experts, TP=1)	+4.1% tput, −1.9% TPOT	+2.0% tput, −1.7% TPOT	+1.3% tput (flat TPOT)
MiniMaxAI/MiniMax-M2.5 (TP=4)	+6.0% tput, −4.4% TPOT	+5.4% tput, −3.7% TPOT	+1.3% tput (flat TPOT)
No regressions at any concurrency level on either model.

Refactor moe_sorting_opus() into a dedicated oneshot path and an autotuned dispatcher. On first encounter of each (tokens, num_experts) pair, both oneshot and multi-phase kernels are timed with hipEvents and the winner is cached. Subsequent calls dispatch to the cached winner with no overhead beyond a hash map lookup. The oneshot kernel runs on a single CU and wins for small token counts, while the multi-phase kernel distributes across all CUs and wins past a hardware-dependent crossover. The existing LDS capacity check (moe_sorting_is_oneshot) gates whether autotuning runs at all — above the LDS limit, MP is selected unconditionally as before. dispatch_policy=1 (force oneshot) and dispatch_policy=2 (force MP) are preserved and bypass the autotune entirely. Measured on gfx950 (MI355X), 5-repeat A/B: openai/gpt-oss-120b (128 experts, TP=1): conc=16: +4.1% throughput, -1.9% TPOT conc=32: +2.0% throughput, -1.7% TPOT MiniMaxAI/MiniMax-M2.5 (TP=4): conc=16: +6.0% throughput, -4.4% TPOT conc=32: +5.4% throughput, -3.7% TPOT No regressions at any concurrency level. Signed-off-by: Aakif Nawaz <[email protected]>

github-actions · 2026-06-03T23:23:37Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3525 --add-label <label>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autotune MoE sorting dispatch: runtime kernel selection between oneshot and multi-phase paths#3525

Autotune MoE sorting dispatch: runtime kernel selection between oneshot and multi-phase paths#3525
akii96 wants to merge 1 commit into
mainfrom
autotune-moe-sorting-dispatch

akii96 commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akii96 commented Jun 3, 2026

Motivation

Technical Details

Safety

Test Plan

Test Result

Uh oh!

github-actions Bot commented Jun 3, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant