Autotune MoE sorting dispatch: runtime kernel selection between oneshot and multi-phase paths#3525
Draft
akii96 wants to merge 1 commit into
Draft
Autotune MoE sorting dispatch: runtime kernel selection between oneshot and multi-phase paths#3525akii96 wants to merge 1 commit into
akii96 wants to merge 1 commit into
Conversation
Refactor moe_sorting_opus() into a dedicated oneshot path and an
autotuned dispatcher. On first encounter of each (tokens, num_experts)
pair, both oneshot and multi-phase kernels are timed with hipEvents and
the winner is cached. Subsequent calls dispatch to the cached winner
with no overhead beyond a hash map lookup.
The oneshot kernel runs on a single CU and wins for small token counts,
while the multi-phase kernel distributes across all CUs and wins past a
hardware-dependent crossover. The existing LDS capacity check
(moe_sorting_is_oneshot) gates whether autotuning runs at all — above
the LDS limit, MP is selected unconditionally as before.
dispatch_policy=1 (force oneshot) and dispatch_policy=2 (force MP) are
preserved and bypass the autotune entirely.
Measured on gfx950 (MI355X), 5-repeat A/B:
openai/gpt-oss-120b (128 experts, TP=1):
conc=16: +4.1% throughput, -1.9% TPOT
conc=32: +2.0% throughput, -1.7% TPOT
MiniMaxAI/MiniMax-M2.5 (TP=4):
conc=16: +6.0% throughput, -4.4% TPOT
conc=32: +5.4% throughput, -3.7% TPOT
No regressions at any concurrency level.
Signed-off-by: Aakif Nawaz <[email protected]>
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The oneshot MoE sorting kernel runs on a single CU and wins for small token counts, but the multi-phase (MP) kernel distributes work across all CUs and is faster past a hardware-dependent crossover. The current dispatch heuristic (
moe_sorting_is_oneshot) only checks LDS capacity — whether oneshot can run — not whether it should. On gfx950 with 128 experts, oneshot is selected up to ~61 tokens even though MP is faster past ~20 tokens.A fixed threshold doesn't generalize: the crossover depends on CU count, memory subsystem, and expert count, which vary across CDNA generations (gfx90a/gfx942/gfx950) and models (8 to 256 experts).
Technical Details
Refactors
moe_sorting_opus()intomoe_sorting_opus_oneshot()+ an autotuned dispatcher incsrc/include/moe_sorting_opus.h.On first encounter of each
(tokens, num_experts)pair withdispatch_policy=0:hipEventon the current streamunordered_mapSubsequent calls with the same pair hit the cache (~ns lookup, zero GPU overhead). Autotuning cost is ~80 unique pairs × 3 kernel launches = ~240 launches, absorbed by vLLM's existing warmup phase.
dispatch_policy=1(force oneshot) anddispatch_policy=2(force MP) bypass the autotune entirely and behave exactly as before.Safety
tokens × 8rarely enters the autotunable zone; the LDS capacity check handles dispatch.Test Plan
op_tests/test_moe_sorting.pyfor correctness at all rerouted shapesTest Result
Measured on gfx950 (MI355X), vLLM 0.22.0: