Codestin Search App

akii96 · 2026-06-04T00:49:33Z

Motivation

The 1-stage MXFP4 MoE decode path quantizes activations and sorts them by expert in two separate kernel launches. The activation scale is written to VRAM by the quantization kernel and immediately read back by mxfp4_moe_sort_fwd — a global memory roundtrip for data that could stay on-chip.

The fused kernel fused_dynamic_mxfp4_quant_moe_sort already exists and handles both operations in a single launch at small M (decode batch sizes), but the 1-stage dispatch path (fused_moe_1stage) was not using it.

Technical Details

Single-file change to aiter/fused_moe.py. In the fused_moe_1stage else-branch, the per_1x32 (MXFP4) case is now handled before the xbf16/generic branches:

Fresh bf16/fp16 activations: routed through fused_dynamic_mxfp4_quant_moe_sort, which collapses quant+sort into one launch at small M and internally falls back to the same split kernels at large M. Results are byte-identical.
Pre-quantized FP4 inputs (from FP4 dispatch): kept on the existing mxfp4_moe_sort_fwd sort-only path since there is nothing to re-quantize.

The redundant standalone per_1x32 block that ran mxfp4_moe_sort_fwd after the generic quantization is removed.

Note: the 2-stage path (fused_moe_2stages) already uses fused_dynamic_mxfp4_quant_moe_sort at lines 1691/1806. This change brings the 1-stage path to parity.

Test Plan

Verified that fused_dynamic_mxfp4_quant_moe_sort supports all arguments passed by the 1-stage path (sorted_ids, num_valid_ids, token_num, topk, block_size, num_rows).
E2E serving benchmarks on GPT-OSS-120b (gfx950 / MI355X, vLLM 0.22.0, MXFP4 quantization).

Test Result

Measured on gfx950 (MI355X), 3-repeat, ISL=1000/OSL=100:

Model	conc=16	conc=32
openai/gpt-oss-120b (128 experts, TP=1)	+3.2% throughput, −2.0% TPOT	+4.2% throughput, −4.3% TPOT

No regressions. Only affects models using MXFP4 quantization (quant_type == per_1x32).

The 1-stage MXFP4 decode path was quantizing activations separately via per_1x32_mx_quant_hip and then reading the scale back from VRAM for mxfp4_moe_sort_fwd — two kernel launches with a global memory roundtrip. Route fresh bf16/fp16 activations through fused_dynamic_mxfp4_quant_moe_sort instead, which collapses both operations into a single kernel at small M (decode). At large M the fused entry internally dispatches the same split kernels, so results are identical. Pre-quantized FP4 inputs (from FP4 dispatch) continue using the unfused sort-only path since there is nothing to re-quantize. Measured on gfx950 (MI355X) with openai/gpt-oss-120b, 3-repeat: conc=16: +3.2% throughput, -2.0% TPOT conc=32: +4.2% throughput, -4.3% TPOT Signed-off-by: Aakif Nawaz <[email protected]>

github-actions · 2026-06-04T00:49:47Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3527 --add-label <label>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route MXFP4 1-stage decode through fused quant+sort path#3527

Route MXFP4 1-stage decode through fused quant+sort path#3527
akii96 wants to merge 1 commit into
mainfrom
mxfp4-decode-quant-sort-fusion

akii96 commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akii96 commented Jun 4, 2026

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

github-actions Bot commented Jun 4, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant