Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Route MXFP4 1-stage decode through fused quant+sort path#3527

Draft
akii96 wants to merge 1 commit into
mainfrom
mxfp4-decode-quant-sort-fusion
Draft

Route MXFP4 1-stage decode through fused quant+sort path#3527
akii96 wants to merge 1 commit into
mainfrom
mxfp4-decode-quant-sort-fusion

Conversation

@akii96
Copy link
Copy Markdown
Contributor

@akii96 akii96 commented Jun 4, 2026

Motivation

The 1-stage MXFP4 MoE decode path quantizes activations and sorts them by expert in two separate kernel launches. The activation scale is written to VRAM by the quantization kernel and immediately read back by mxfp4_moe_sort_fwd — a global memory roundtrip for data that could stay on-chip.

The fused kernel fused_dynamic_mxfp4_quant_moe_sort already exists and handles both operations in a single launch at small M (decode batch sizes), but the 1-stage dispatch path (fused_moe_1stage) was not using it.

Technical Details

Single-file change to aiter/fused_moe.py. In the fused_moe_1stage else-branch, the per_1x32 (MXFP4) case is now handled before the xbf16/generic branches:

  • Fresh bf16/fp16 activations: routed through fused_dynamic_mxfp4_quant_moe_sort, which collapses quant+sort into one launch at small M and internally falls back to the same split kernels at large M. Results are byte-identical.
  • Pre-quantized FP4 inputs (from FP4 dispatch): kept on the existing mxfp4_moe_sort_fwd sort-only path since there is nothing to re-quantize.

The redundant standalone per_1x32 block that ran mxfp4_moe_sort_fwd after the generic quantization is removed.

Note: the 2-stage path (fused_moe_2stages) already uses fused_dynamic_mxfp4_quant_moe_sort at lines 1691/1806. This change brings the 1-stage path to parity.

Test Plan

  • Verified that fused_dynamic_mxfp4_quant_moe_sort supports all arguments passed by the 1-stage path (sorted_ids, num_valid_ids, token_num, topk, block_size, num_rows).
  • E2E serving benchmarks on GPT-OSS-120b (gfx950 / MI355X, vLLM 0.22.0, MXFP4 quantization).

Test Result

Measured on gfx950 (MI355X), 3-repeat, ISL=1000/OSL=100:

Model conc=16 conc=32
openai/gpt-oss-120b (128 experts, TP=1) +3.2% throughput, −2.0% TPOT +4.2% throughput, −4.3% TPOT

No regressions. Only affects models using MXFP4 quantization (quant_type == per_1x32).

The 1-stage MXFP4 decode path was quantizing activations separately via
per_1x32_mx_quant_hip and then reading the scale back from VRAM for
mxfp4_moe_sort_fwd — two kernel launches with a global memory roundtrip.
Route fresh bf16/fp16 activations through fused_dynamic_mxfp4_quant_moe_sort
instead, which collapses both operations into a single kernel at small M
(decode). At large M the fused entry internally dispatches the same split
kernels, so results are identical. Pre-quantized FP4 inputs (from FP4
dispatch) continue using the unfused sort-only path since there is nothing
to re-quantize.
Measured on gfx950 (MI355X) with openai/gpt-oss-120b, 3-repeat:
  conc=16: +3.2% throughput, -2.0% TPOT
  conc=32: +4.2% throughput, -4.3% TPOT
Signed-off-by: Aakif Nawaz <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3527 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant