DO NOT REVIEW YET #8954

nzmora-nvidia · 2025-11-05T20:58:12Z

Summary by CodeRabbit

Release Notes

New Features
- Added Relu2 activation function support for mixture-of-experts operations
- Introduced configurable activation type selection for MoE layers
- Added FP8 quantization support for MoE operations
Tests
- Added comprehensive test coverage for FP8/FP16 MoE fusion paths with multiple activation variants and configurations

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Neta Zmora <[email protected]>

coderabbitai · 2025-11-05T21:10:07Z

📝 Walkthrough

Walkthrough

This PR introduces Relu² activation support and FP8 quantization infrastructure for mixture-of-experts (MoE) operations. Changes span kernel implementations, C++ runners, and PyTorch bindings, adding new ActivationType enums, threading activation types through MOE runner signatures, and implementing FP8-specific MoE kernels with comprehensive testing.

Changes

Cohort / File(s)	Summary
Relu2 Activation Struct `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/thread/fused_activations.h`	Added new `Relu2` template struct with overloaded operators applying ReLU followed by squaring; includes `kIsHeavy` flag.
ActivationType Enum Extensions `cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h`, `tensorrt_llm/_torch/modules/fused_moe/routing.py`	Extended `ActivationType` enum to include `Relu2` member alongside existing activation types (Gelu, Relu, Silu, Swiglu, Geglu, SwigluBias, Identity).
MOE Kernel Dispatch `cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`, `cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu`	Added `Relu2` case handling in MOE GEMM switch statements; integrated `IdentityAdaptor` wrapping `Relu2` into activation kernel selection list.
MOE Runner Signatures `cpp/tensorrt_llm/thop/moeOp.cpp`	Updated `FusedMoeRunner` constructor and `runMoe`/`runMoeMinLantency` methods to accept `activation_type` parameter; added conditional logic distinguishing gated vs. ungated activations for weight dimension validation.
PyTorch Custom Ops Integration `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Introduced public `ActivationType` enum; extended `MoERunner` class and `fused_moe` function to accept and thread `activation_type` through to underlying runners.
Auto-Deploy FP8 MoE Functions `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py`	Added `trtllm_fused_moe` and new `trtllm_quant_fp8moe_fused` functions with `mlp_style` and `act_fn` parameters; implemented FP8 quantization scales computation and activation-type mapping; added fake registration for testing.
MOE Test Coverage `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py`	New comprehensive test module covering FP8 and FP16/BFloat16 MoE paths with routing computation, expert simulation (including Relu2), quantization utilities, and parameterized test suite validating fused operations against reference implementations.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as PyTorch Caller
    participant PyOps as torch_custom_ops.py
    participant MoeRunner as MoERunner
    participant Dispatch as moe_dispatch
    participant Kernel as CUTLASS Kernel
    
    Caller->>PyOps: fused_moe(..., activation_type=Relu2)
    PyOps->>PyOps: Convert activation_type to enum
    PyOps->>MoeRunner: __init__(activation_type=Relu2)
    MoeRunner->>MoeRunner: Store self.activation_type
    MoeRunner->>Dispatch: forward(activation_type=Relu2)
    
    rect rgb(200, 220, 255)
        note right of Dispatch: Activation Dispatch
        Dispatch->>Dispatch: Switch on activation_type
        alt activation_type == Relu2
            Dispatch->>Kernel: Select Relu2 Kernel
        else activation_type == Swiglu (default)
            Dispatch->>Kernel: Select Swiglu Kernel
        else Other
            Dispatch->>Kernel: Select Corresponding Kernel
        end
    end
    
    Kernel->>Kernel: Apply Relu2 (ReLU + Square)
    Kernel-->>Dispatch: Output
    Dispatch-->>MoeRunner: Result
    MoeRunner-->>PyOps: Result
    PyOps-->>Caller: Output Tensor

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Distributed Changes: Modifications span kernel implementations (CUDA/CUTLASS), C++ runner signatures with conditional logic, and multiple layers of PyTorch bindings, requiring context switching and cross-layer understanding.
FP8 Quantization Logic: The trtllm_moe.py file introduces non-trivial FP8 quantization scale computation (gemm1_dequant, gemm2_act_quant, gemm2_dequant, gemm1_input_dequant) and weight handling that demands careful verification.
Activation Type Threading: Activation types must be correctly threaded from Python entry points through C++ dispatch to kernel selection; incorrect threading could silently select wrong kernels.
Conditional Weight Validation: moeOp.cpp introduces gated vs. ungated activation branching for weight dimension checks; logic must align with actual kernel requirements.
Test Complexity: The new test file includes expert simulation, routing computation, and FP8 quantization/dequantization logic that must accurately model kernel behavior.

Areas requiring extra attention:

FP8 quantization scale computation in trtllm_moe.py (4 dequant/quant phases)
Conditional gated/ungated logic in moeOp.cpp and corresponding weight dimension assertions
Activation type enum consistency across torch_custom_ops.py and routing.py files
Expert simulation in test file, particularly Relu2 activation path accuracy

Possibly related PRs

[None][feat] AutoDeploy: Add FP8 MOE for Nemotron #8599: Adds FP8 MoE support with Relu2 activation and ActivationType enum extensions, directly aligned with this PR's kernel dispatch, FP8 path implementation, and test infrastructure.

Suggested reviewers

Fridah-nv

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'DO NOT REVIEW YET' is a placeholder/meta-comment unrelated to the actual code changes introducing Relu2 activation support.	Replace with a descriptive title following the template format, such as '[None][feat] Add Relu2 activation support to MoE kernels' or similar.
Description check	⚠️ Warning	The PR description is entirely the template itself with no actual content filled in under Description, Test Coverage, or PR Checklist sections.	Fill in the Description section explaining the Relu2 feature, Test Coverage listing the test added, and complete the PR Checklist items appropriately.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/tensorrt_llm/thop/moeOp.cpp (2)

331-355: Block invalid enum values; align INT8 WOQ dim check with gated/ungated activations.

Casting int64_t → ActivationType without validation can pass an out‑of‑range value that later indexes activation tables in kernels. Add a guard.
For mUseINT8WoqPerChannel, the inter-size check always enforces “×2” (GLU). It should mirror the new gated/ungated logic used below.

Apply:

-        ActivationType base_activation_type = activation_type.has_value()
-            ? static_cast<ActivationType>(activation_type.value())
-            : ActivationType::Swiglu;
+        ActivationType base_activation_type = ActivationType::Swiglu;
+        if (activation_type.has_value())
+        {
+            auto act = static_cast<ActivationType>(activation_type.value());
+            switch (act)
+            {
+            case ActivationType::Gelu:
+            case ActivationType::Relu:
+            case ActivationType::SiLu:
+            case ActivationType::Swiglu:
+            case ActivationType::Geglu:
+            case ActivationType::SwigluBias:
+            case ActivationType::Relu2:
+            case ActivationType::Identity:
+                base_activation_type = act;
+                break;
+            default:
+                TORCH_CHECK(false, "Invalid activation_type value: ", activation_type.value());
+            }
+        }
@@
-        if (mUseINT8WoqPerChannel)
-        {
-            // Note: The weight shape for INT8 weight only quantization is different, e.g., fc2_expert_weights:
-            // [num_experts, inter_size, hidden_size]
-            TORCH_CHECK(fc1_expert_weights.sizes()[2] == fc2_expert_weights.sizes()[1] * mInnerDimMultiplier * 2,
-                "fc1_expert_weights inter size must be 2 times fc2_expert_weights inter size.");
-        }
+        if (mUseINT8WoqPerChannel)
+        {
+            // INT8 weight-only uses [num_experts, inter_size, hidden_size] for fc2.
+            auto const expect_mul = isGatedActivation(base_activation_type) ? 2 : 1;
+            TORCH_CHECK(
+                fc1_expert_weights.sizes()[2] == fc2_expert_weights.sizes()[1] * mInnerDimMultiplier * expect_mul,
+                expect_mul == 2
+                    ? "fc1_expert_weights inter size must be 2x fc2_expert_weights inter size for gated activations."
+                    : "fc1_expert_weights inter size must equal fc2_expert_weights inter size for ungated activations.");
+        }

Based on learnings.

286-316: Duplicate bias-validation block.

The fc1/fc2 bias validation appears twice back‑to‑back; drop one to avoid redundant checks and noisy errors.

Apply:

-        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
-        {
-            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
-            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
-            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
-            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
-            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
-                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
-                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
-                "fc1_expert_biases should match fc1_expert_weights output shape.");
-            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
-                "fc2_expert_biases should match fc2_expert_weights output shape.");
-        }

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

949-958: Relu2 activation still hard-fails at runtime

ActivationType::Relu2 is now threaded all the way from the Python API and we introduce a CUTLASS functor for it, yet this switch still raises "Relu2 is not supported.". Any attempt to exercise the new activation will immediately throw here, so the feature cannot work. Please wire this case to the actual Relu² epilogue (or the appropriate adaptor) instead of throwing so that the new activation behaves correctly.

🧹 Nitpick comments (2)

cpp/tensorrt_llm/thop/moeOp.cpp (1)

729-742: Profiler hardcodes Swiglu; consider threading selected activation.

mProfiler->init uses ActivationType::Swiglu regardless of the actual activation. For non‑gated/Relu2 runs this can skew workspace/tactic selection.

Pass the resolved base_activation_type into runGemmProfile (API change), or infer gated vs. ungated from dims to set the activation used for profiling. Based on learnings.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

2310-2312: Relu2 added: guard against enum/dispatch drift.

The dispatch uses an array indexed by static_cast(ActivationType). If enum order diverges from this array, you’ll pick the wrong kernel or OOB-index on invalid values.

Prefer a switch on activation_type.activation_type with a default that TLLM_CHECK(false, …). Compilers will turn this into a jump table; safer than positional arrays.

If keeping the array, add a compile-time/static size guard tied to an enum sentinel (e.g., kNumActivations) and validate input at call sites (see moeOp.cpp comments).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b181568 and 130dbd1.

📒 Files selected for processing (9)

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/thread/fused_activations.h (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1 hunks)
cpp/tensorrt_llm/thop/moeOp.cpp (6 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py (4 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (9 hunks)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}