Thanks to visit codestin.com
Credit goes to github.com

Skip to content

QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor#28467

Draft
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/20260511/qmoe_cuda
Draft

QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor#28467
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/20260511/qmoe_cuda

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

Add a new QMoE contrib operator for the CUDA EP that supports quantized Mixture-of-Experts inference with INT4, INT8, FP4 (MXFP4 e2m1), FP8 (e4m3fn), and WFP4AFP8 (mixed FP4 weight × FP8 activation) quantization formats. This also refactors the existing MoE GEMM infrastructure to support TMA warp-specialized grouped GEMM on Hopper (SM90), native MXFP4 on Blackwell (SM120), and block-scaled tensor ops on SM100+, with automatic fallback to dequantization on older architectures.

Summary of Changes

New QMoE Operator

File Change
onnxruntime/core/graph/contrib_ops/contrib_defs.cc Register QMoE op schema (com.microsoft domain, opset 1)
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc/h QMoE CUDA kernel implementation with dynamic runner selection
onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu/h Softmax top-k router, sparse mixer, zero-point pre-packing kernels
onnxruntime/contrib_ops/cuda/moe/moe_base.h Shared MoE base class updates for quantization attributes
docs/contrib_ops/cuda/moe_qmoe.md Comprehensive operator documentation (inputs, attributes, quantization formats)

MoE GEMM Refactor

File Change
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels.h Unified CutlassMoeFCRunner template with FP4/FP8/WFP4AFP8 specializations
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_template_dispatch.h Three-family dispatch: Ampere GemmGrouped, TMA warp-specialized, block-scaled tensor ops
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_profiler.cc/h MoE-specific GEMM tactic profiler for auto-tuning
onnxruntime/contrib_ops/cuda/llm/moe_gemm/common.h Shared MoE GEMM types and config structs
onnxruntime/contrib_ops/cuda/llm/moe_gemm/launchers/ SM80/SM90/SM120 launcher instantiations (including generated .cu files)

CUTLASS Extensions

File Change
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/ Grid dependency control, TMA copy traits, multi-mem copy operations
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/ Mixed-input and gated GEMM collective builders for SM90
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/ Fused MoE kernel traits/routines, MoE problem visitors, gated GEMM kernels
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/ MoE finalize epilogue, per-row/per-col scale epilogues
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/system_barrier.h System barrier for multi-CTA synchronization

Common CUDA Utilities

  • onnxruntime/contrib_ops/cuda/llm/common/cuda_fp8_utils.cu/h — FP8 conversion, quantization, dequantization kernels
  • onnxruntime/contrib_ops/cuda/llm/common/memory_utils.cu/h — Device memory transpose, permute, type conversion utilities
  • onnxruntime/contrib_ops/cuda/llm/common/cuda_type_utils.cuh — Unified type traits for half/bfloat16/float/fp8/fp4
  • onnxruntime/contrib_ops/cuda/llm/common/quantization.h — Quantization parameter structs and helpers
  • onnxruntime/contrib_ops/cuda/llm/common/reduce_kernel_utils.cuh — Warp/block reduction primitives
  • onnxruntime/contrib_ops/cuda/llm/kernels/quantization.cuh — FP4/FP8 quantization kernels
  • onnxruntime/contrib_ops/cuda/llm/kernels/pre_quant_scale_kernel.cu/h — Pre-quantization scaling kernel

GEMM Profiler Refactor

File Change
onnxruntime/contrib_ops/cuda/llm/gemm_profiler.cc/h Refactored GEMM profiler interface for tactic selection
onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc/h Updated heuristics for new kernel families
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm_configs.h Extended GEMM config enums for TMA warp-specialized and gated configs

Build System

File Change
cmake/CMakeLists.txt Add ENABLE_FP4, ENABLE_FP8, ENABLE_CUDA_FP4_QMOE, ORT_QUICK_BUILD, PLACEHOLDER_KERNELS options
cmake/external/cuda_configuration.cmake FP4/FP8 capability detection based on CUDA version and SM arch
cmake/external/cutlass.cmake CUTLASS version bump
cmake/onnxruntime_providers_cuda.cmake Add MoE GEMM source files and conditional FP4/FP8 kernel compilation
cmake/onnxruntime_python.cmake Add onnxruntime_pybind_quant.cc for Python quantization bindings

Python Quantization Bindings

File Change
onnxruntime/python/onnxruntime_pybind_quant.cc C++ pybind module for MoE weight preprocessing (quantize, pack, preprocess)
onnxruntime/python/tools/quantization/quant_utils.py FP4/FP8 quantization utilities
setup.py Include new pybind module in package build

Tests

File Change
onnxruntime/test/python/transformers/test_qmoe_cuda.py INT4/INT8 QMoE tests (Phi3 topology, SwiGLU, blockwise, asymmetric)
onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py MXFP4 QMoE tests
onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py FP8 QMoE tests
onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py WFP4AFP8 mixed-precision QMoE tests
onnxruntime/test/python/transformers/test_moe_cuda.py Updated existing MoE tests for refactored infrastructure
onnxruntime/test/contrib_ops/moe_test.cc C++ MoE unit tests updated

Existing MoE Refactor

  • onnxruntime/contrib_ops/cuda/moe/moe.cc/h — Refactored to share base with QMoE
  • onnxruntime/contrib_ops/cuda/moe/ft_moe/onnxruntime/contrib_ops/cuda/llm/moe_gemm/ — Relocated and rewritten MoE GEMM kernels
  • Removed old cuda/quantization/moe_quantization.cc/h in favor of new cuda/moe/moe_quantization.cc/h

Testing

  • INT4/INT8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py -v (requires CUDA GPU, SM75+)
  • FP4 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py -v (requires SM120+ for native, falls back on older)
  • FP8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py -v (requires SM90+ for native)
  • WFP4AFP8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py -v (requires SM100+)
  • Existing MoE: python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -v
  • C++ MoE tests: Build with CUDA EP enabled, run onnxruntime_test_all --gtest_filter=*MoE*
  • All tests compare QMoE output against PyTorch reference implementations with configurable tolerance

Motivation and Context

Modern LLMs increasingly use Mixture-of-Experts architectures (e.g., Mixtral, DeepSeek, Phi-3.5-MoE) for efficient scaling. These models benefit significantly from weight quantization to reduce memory bandwidth and enable larger models on fewer GPUs. This PR:

  1. Adds native low-precision MoE support — FP4 and FP8 quantized weights avoid the dequantization overhead of INT4/INT8 on supported hardware (Hopper, Blackwell).
  2. Introduces WFP4AFP8 — A novel mixed-precision mode where weights are MXFP4 and activations are dynamically quantized to FP8, enabling 2× weight compression with minimal accuracy loss on Blackwell GPUs.
  3. Refactors MoE GEMM infrastructure — The previous FasterTransformer-derived MoE GEMM code is replaced with a modern CUTLASS 4.x-based dispatch system supporting three kernel families across SM75–SM120+.
  4. Adds auto-tuning — The GEMM profiler enables runtime tactic selection for optimal performance across different expert sizes and batch configurations.

@tianleiwu tianleiwu requested a review from Copilot May 12, 2026 00:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants