Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Autotune MoE sorting dispatch: runtime kernel selection between oneshot and multi-phase paths#3525

Draft
akii96 wants to merge 1 commit into
mainfrom
autotune-moe-sorting-dispatch
Draft

Autotune MoE sorting dispatch: runtime kernel selection between oneshot and multi-phase paths#3525
akii96 wants to merge 1 commit into
mainfrom
autotune-moe-sorting-dispatch

Conversation

@akii96
Copy link
Copy Markdown
Contributor

@akii96 akii96 commented Jun 3, 2026

Motivation

The oneshot MoE sorting kernel runs on a single CU and wins for small token counts, but the multi-phase (MP) kernel distributes work across all CUs and is faster past a hardware-dependent crossover. The current dispatch heuristic (moe_sorting_is_oneshot) only checks LDS capacity — whether oneshot can run — not whether it should. On gfx950 with 128 experts, oneshot is selected up to ~61 tokens even though MP is faster past ~20 tokens.
A fixed threshold doesn't generalize: the crossover depends on CU count, memory subsystem, and expert count, which vary across CDNA generations (gfx90a/gfx942/gfx950) and models (8 to 256 experts).

Technical Details

Refactors moe_sorting_opus() into moe_sorting_opus_oneshot() + an autotuned dispatcher in csrc/include/moe_sorting_opus.h.
On first encounter of each (tokens, num_experts) pair with dispatch_policy=0:

  1. LDS capacity check — if oneshot doesn't fit, go straight to MP (unchanged behavior)
  2. Time both oneshot and MP with hipEvent on the current stream
  3. Cache the winner in a static unordered_map
  4. Re-run the winner to produce correct output
    Subsequent calls with the same pair hit the cache (~ns lookup, zero GPU overhead). Autotuning cost is ~80 unique pairs × 3 kernel launches = ~240 launches, absorbed by vLLM's existing warmup phase.
    dispatch_policy=1 (force oneshot) and dispatch_policy=2 (force MP) bypass the autotune entirely and behave exactly as before.

Safety

  • Strict superset of the old behavior: above the LDS capacity limit the codepath is identical. Below it, the autotune either agrees with the old heuristic (same perf) or switches to MP (only if MP measured faster).
  • No impact on non-MoE models: the function is never called for dense architectures.
  • No impact on few-expert MoE models (e.g. Mixtral, 8 experts): tokens × 8 rarely enters the autotunable zone; the LDS capacity check handles dispatch.

Test Plan

  • Existing op_tests/test_moe_sorting.py for correctness at all rerouted shapes
  • 5-repeat A/B serving benchmarks (ISL=1000, OSL=100, random dataset) on two MoE models with different expert counts and TP configurations

Test Result

Measured on gfx950 (MI355X), vLLM 0.22.0:

Model conc=16 conc=32 conc=64
openai/gpt-oss-120b (128 experts, TP=1) +4.1% tput, −1.9% TPOT +2.0% tput, −1.7% TPOT +1.3% tput (flat TPOT)
MiniMaxAI/MiniMax-M2.5 (TP=4) +6.0% tput, −4.4% TPOT +5.4% tput, −3.7% TPOT +1.3% tput (flat TPOT)
No regressions at any concurrency level on either model.

Refactor moe_sorting_opus() into a dedicated oneshot path and an
autotuned dispatcher. On first encounter of each (tokens, num_experts)
pair, both oneshot and multi-phase kernels are timed with hipEvents and
the winner is cached. Subsequent calls dispatch to the cached winner
with no overhead beyond a hash map lookup.
The oneshot kernel runs on a single CU and wins for small token counts,
while the multi-phase kernel distributes across all CUs and wins past a
hardware-dependent crossover. The existing LDS capacity check
(moe_sorting_is_oneshot) gates whether autotuning runs at all — above
the LDS limit, MP is selected unconditionally as before.
dispatch_policy=1 (force oneshot) and dispatch_policy=2 (force MP) are
preserved and bypass the autotune entirely.
Measured on gfx950 (MI355X), 5-repeat A/B:
  openai/gpt-oss-120b (128 experts, TP=1):
    conc=16: +4.1% throughput, -1.9% TPOT
    conc=32: +2.0% throughput, -1.7% TPOT
  MiniMaxAI/MiniMax-M2.5 (TP=4):
    conc=16: +6.0% throughput, -4.4% TPOT
    conc=32: +5.4% throughput, -3.7% TPOT
  No regressions at any concurrency level.
Signed-off-by: Aakif Nawaz <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3525 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant