Codestin Search App

tianleiwu · 2026-05-12T00:16:03Z

Description

Add a new QMoE contrib operator for the CUDA EP that supports quantized Mixture-of-Experts inference with INT4, INT8, FP4 (MXFP4 e2m1), FP8 (e4m3fn), and WFP4AFP8 (mixed FP4 weight × FP8 activation) quantization formats. This also refactors the existing MoE GEMM infrastructure to support TMA warp-specialized grouped GEMM on Hopper (SM90), native MXFP4 on Blackwell (SM120), and block-scaled tensor ops on SM100+, with automatic fallback to dequantization on older architectures.

Summary of Changes

New QMoE Operator

File	Change
`onnxruntime/core/graph/contrib_ops/contrib_defs.cc`	Register `QMoE` op schema (com.microsoft domain, opset 1)
`onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc/h`	QMoE CUDA kernel implementation with dynamic runner selection
`onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu/h`	Softmax top-k router, sparse mixer, zero-point pre-packing kernels
`onnxruntime/contrib_ops/cuda/moe/moe_base.h`	Shared MoE base class updates for quantization attributes
`docs/contrib_ops/cuda/moe_qmoe.md`	Comprehensive operator documentation (inputs, attributes, quantization formats)

MoE GEMM Refactor

File	Change
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels.h`	Unified `CutlassMoeFCRunner` template with FP4/FP8/WFP4AFP8 specializations
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_template_dispatch.h`	Three-family dispatch: Ampere GemmGrouped, TMA warp-specialized, block-scaled tensor ops
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_profiler.cc/h`	MoE-specific GEMM tactic profiler for auto-tuning
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/common.h`	Shared MoE GEMM types and config structs
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/launchers/`	SM80/SM90/SM120 launcher instantiations (including generated .cu files)

CUTLASS Extensions

File	Change
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/`	Grid dependency control, TMA copy traits, multi-mem copy operations
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/`	Mixed-input and gated GEMM collective builders for SM90
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/`	Fused MoE kernel traits/routines, MoE problem visitors, gated GEMM kernels
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/`	MoE finalize epilogue, per-row/per-col scale epilogues
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/system_barrier.h`	System barrier for multi-CTA synchronization

Common CUDA Utilities

onnxruntime/contrib_ops/cuda/llm/common/cuda_fp8_utils.cu/h — FP8 conversion, quantization, dequantization kernels
onnxruntime/contrib_ops/cuda/llm/common/memory_utils.cu/h — Device memory transpose, permute, type conversion utilities
onnxruntime/contrib_ops/cuda/llm/common/cuda_type_utils.cuh — Unified type traits for half/bfloat16/float/fp8/fp4
onnxruntime/contrib_ops/cuda/llm/common/quantization.h — Quantization parameter structs and helpers
onnxruntime/contrib_ops/cuda/llm/common/reduce_kernel_utils.cuh — Warp/block reduction primitives
onnxruntime/contrib_ops/cuda/llm/kernels/quantization.cuh — FP4/FP8 quantization kernels
onnxruntime/contrib_ops/cuda/llm/kernels/pre_quant_scale_kernel.cu/h — Pre-quantization scaling kernel

GEMM Profiler Refactor

File	Change
`onnxruntime/contrib_ops/cuda/llm/gemm_profiler.cc/h`	Refactored GEMM profiler interface for tactic selection
`onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc/h`	Updated heuristics for new kernel families
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm_configs.h`	Extended GEMM config enums for TMA warp-specialized and gated configs

Build System

File	Change
`cmake/CMakeLists.txt`	Add `ENABLE_FP4`, `ENABLE_FP8`, `ENABLE_CUDA_FP4_QMOE`, `ORT_QUICK_BUILD`, `PLACEHOLDER_KERNELS` options
`cmake/external/cuda_configuration.cmake`	FP4/FP8 capability detection based on CUDA version and SM arch
`cmake/external/cutlass.cmake`	CUTLASS version bump
`cmake/onnxruntime_providers_cuda.cmake`	Add MoE GEMM source files and conditional FP4/FP8 kernel compilation
`cmake/onnxruntime_python.cmake`	Add `onnxruntime_pybind_quant.cc` for Python quantization bindings

Python Quantization Bindings

File	Change
`onnxruntime/python/onnxruntime_pybind_quant.cc`	C++ pybind module for MoE weight preprocessing (quantize, pack, preprocess)
`onnxruntime/python/tools/quantization/quant_utils.py`	FP4/FP8 quantization utilities
`setup.py`	Include new pybind module in package build

Tests

File	Change
`onnxruntime/test/python/transformers/test_qmoe_cuda.py`	INT4/INT8 QMoE tests (Phi3 topology, SwiGLU, blockwise, asymmetric)
`onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py`	MXFP4 QMoE tests
`onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py`	FP8 QMoE tests
`onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py`	WFP4AFP8 mixed-precision QMoE tests
`onnxruntime/test/python/transformers/test_moe_cuda.py`	Updated existing MoE tests for refactored infrastructure
`onnxruntime/test/contrib_ops/moe_test.cc`	C++ MoE unit tests updated

Existing MoE Refactor

onnxruntime/contrib_ops/cuda/moe/moe.cc/h — Refactored to share base with QMoE
onnxruntime/contrib_ops/cuda/moe/ft_moe/ → onnxruntime/contrib_ops/cuda/llm/moe_gemm/ — Relocated and rewritten MoE GEMM kernels
Removed old cuda/quantization/moe_quantization.cc/h in favor of new cuda/moe/moe_quantization.cc/h

Testing

INT4/INT8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py -v (requires CUDA GPU, SM75+)
FP4 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py -v (requires SM120+ for native, falls back on older)
FP8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py -v (requires SM90+ for native)
WFP4AFP8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py -v (requires SM100+)
Existing MoE: python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -v
C++ MoE tests: Build with CUDA EP enabled, run onnxruntime_test_all --gtest_filter=*MoE*
All tests compare QMoE output against PyTorch reference implementations with configurable tolerance

Motivation and Context

Modern LLMs increasingly use Mixture-of-Experts architectures (e.g., Mixtral, DeepSeek, Phi-3.5-MoE) for efficient scaling. These models benefit significantly from weight quantization to reduce memory bandwidth and enable larger models on fewer GPUs. This PR:

Adds native low-precision MoE support — FP4 and FP8 quantized weights avoid the dequantization overhead of INT4/INT8 on supported hardware (Hopper, Blackwell).
Introduces WFP4AFP8 — A novel mixed-precision mode where weights are MXFP4 and activations are dynamically quantized to FP8, enabling 2× weight compression with minimal accuracy loss on Blackwell GPUs.
Refactors MoE GEMM infrastructure — The previous FasterTransformer-derived MoE GEMM code is replaced with a modern CUTLASS 4.x-based dispatch system supporting three kernel families across SM75–SM120+.
Adds auto-tuning — The GEMM profiler enables runtime tactic selection for optimal performance across different expert sizes and batch configurations.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

QMoE CUDA FP4/FP8/WFP4AFP8 + MoE Refactor

17bd084

tianleiwu requested a review from Copilot May 12, 2026 00:25

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor#28467

QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor#28467
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/20260511/qmoe_cuda

tianleiwu commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented May 12, 2026

Description

Summary of Changes

New QMoE Operator

MoE GEMM Refactor

CUTLASS Extensions

Common CUDA Utilities

GEMM Profiler Refactor

Build System

Python Quantization Bindings

Tests

Existing MoE Refactor

Testing

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants