Route MXFP4 1-stage decode through fused quant+sort path#3527
Draft
akii96 wants to merge 1 commit into
Draft
Conversation
The 1-stage MXFP4 decode path was quantizing activations separately via per_1x32_mx_quant_hip and then reading the scale back from VRAM for mxfp4_moe_sort_fwd — two kernel launches with a global memory roundtrip. Route fresh bf16/fp16 activations through fused_dynamic_mxfp4_quant_moe_sort instead, which collapses both operations into a single kernel at small M (decode). At large M the fused entry internally dispatches the same split kernels, so results are identical. Pre-quantized FP4 inputs (from FP4 dispatch) continue using the unfused sort-only path since there is nothing to re-quantize. Measured on gfx950 (MI355X) with openai/gpt-oss-120b, 3-repeat: conc=16: +3.2% throughput, -2.0% TPOT conc=32: +4.2% throughput, -4.3% TPOT Signed-off-by: Aakif Nawaz <[email protected]>
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The 1-stage MXFP4 MoE decode path quantizes activations and sorts them by expert in two separate kernel launches. The activation scale is written to VRAM by the quantization kernel and immediately read back by
mxfp4_moe_sort_fwd— a global memory roundtrip for data that could stay on-chip.The fused kernel
fused_dynamic_mxfp4_quant_moe_sortalready exists and handles both operations in a single launch at small M (decode batch sizes), but the 1-stage dispatch path (fused_moe_1stage) was not using it.Technical Details
Single-file change to
aiter/fused_moe.py. In thefused_moe_1stageelse-branch, theper_1x32(MXFP4) case is now handled before thexbf16/generic branches:fused_dynamic_mxfp4_quant_moe_sort, which collapses quant+sort into one launch at small M and internally falls back to the same split kernels at large M. Results are byte-identical.mxfp4_moe_sort_fwdsort-only path since there is nothing to re-quantize.The redundant standalone
per_1x32block that ranmxfp4_moe_sort_fwdafter the generic quantization is removed.Note: the 2-stage path (
fused_moe_2stages) already usesfused_dynamic_mxfp4_quant_moe_sortat lines 1691/1806. This change brings the 1-stage path to parity.Test Plan
fused_dynamic_mxfp4_quant_moe_sortsupports all arguments passed by the 1-stage path (sorted_ids,num_valid_ids,token_num,topk,block_size,num_rows).Test Result
Measured on gfx950 (MI355X), 3-repeat, ISL=1000/OSL=100:
No regressions. Only affects models using MXFP4 quantization (
quant_type == per_1x32).