[TRTLLM-8778][feat] Add tree attention support for blackwell arch #8975

sunnyqgg · 2025-11-06T13:31:27Z

Summary by CodeRabbit

New Features
- Added speculative decoding optimizations for Blackwell architecture with custom attention mask preparation
- Introduced layer index tracking for multi-layer attention operations
- Extended attention metadata to support per-key-value head configuration and speculative decoding tree masking
Tests
- Added comprehensive unit tests for custom mask preparation functionality

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-11-06T13:37:41Z

📝 Walkthrough

Walkthrough

Extends TensorRT-LLM's attention operators to support Blackwell-generation GPU spec-decoding tree masking and layer indexing. Introduces three new tensor buffers for spec-decoding (tree mask offset, tree mask, and sparse mask offset) and propagates layer indices throughout the stack—from high-level attention operations through low-level FMHA kernels. Adds a new custom mask preparation pipeline with CUDA kernels while maintaining backward compatibility for existing dense attention paths.

Changes

Cohort / File(s)	Summary
Spec-Decoding Parameters and Layer Index `cpp/tensorrt_llm/common/attentionOp.h`, `cpp/tensorrt_llm/common/attentionOp.cpp`, `cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h`	Adds `layer_idx` scalar and three new pointers (`spec_decoding_bl_tree_mask_offset`, `spec_decoding_bl_tree_mask`, `spec_bl_tree_first_sparse_mask_offset_kv`) to `EnqueueGenerationParams` and `XQAParams`. Propagates parameters from `generationsParams` to `xqaParams` and extends `toString()` for diagnostics.
FMHA Kernel Selection and Configuration `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h`	Adds spec-decoding conditional logic for kernel type selection (KeepsMmaAbForGeneration vs. SwapsMmaAbForGeneration), custom mask preparation pre-kernel call, max heads sizing (128 for spec-decoding), and hash-based kernel selection. Relaxes constness on mask pointers and adds `layer_idx`, `is_spec_dec_tree`, and `spec_decoding_generation_lengths` fields.
Custom Mask Preparation Pipeline `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/prepareCustomMask.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/prepareCustomMask.cu`	New CUDA kernels and host launchers for mask buffer computation and offset prefix-sum (`prepareCustomMaskBuffersKernelForKeepsMmaAb`, `computeCustomMaskOffsetsKernel`). Orchestrates two-step mask preparation via `runPrepareCustomMask` with error handling and validation.
Dispatcher and Kernel Integration `cpp/tensorrt_llm/kernels/xqaDispatcher.cpp`, `cpp/tensorrt_llm/thop/attentionOp.cpp`	Enables spec-decoding support by switching mask type from Dense to Custom when active, propagates `is_spec_dec_tree` flag, and populates new spec-decoding runner parameters (`generalPackedCustoMaskPtr`, custom mask offsets). Conditional SM100-family path accepts six-tensor spec-decoding contract.
PyTorch Attention Backend `tensorrt_llm/_torch/attention_backend/interface.py`, `tensorrt_llm/_torch/attention_backend/trtllm.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds `num_heads_per_kv` to `AttentionMetadata`, extends `TrtllmAttentionWrapper` and metadata with three new spec-decoding tensor fields, SM version–aware spec-decoding gating, and custom mask tile computation. Propagates parameters through plan/run paths for SM >= 100.
Test Infrastructure `cpp/tests/unit_tests/kernels/CMakeLists.txt`, `cpp/tests/unit_tests/kernels/prepareCustomMaskTest.cpp`, `tests/unittest/_torch/modeling/test_modeling_llama.py`	Adds comprehensive unit test for custom mask preparation (CPU reference, GPU execution, stream management) with SM100-gated skip, and updates modeling tests with `num_heads_per_kv`, relaxed tolerances, `get_sm_version()` guards, and explicit metadata preparation calls.

Sequence Diagram(s)

sequenceDiagram
    participant AttentionOp as Attention Op
    participant XQADispatcher as XQA Dispatcher
    participant FmhaKernels as FMHA Kernels
    participant PrepareCustomMask as Custom Mask Prep
    participant MmaKernel as MMA Kernel

    AttentionOp->>AttentionOp: layer_idx propagation
    AttentionOp->>XQADispatcher: enqueue with spec_decoding params
    
    alt isSpecDecoding && multi_query_tokens
        XQADispatcher->>XQADispatcher: set mask_type = Custom<br/>is_spec_dec_tree = true
        XQADispatcher->>FmhaKernels: configure runner params<br/>(layer_idx, is_spec_dec_tree,<br/>generalPackedCustoMaskPtr, etc.)
        FmhaKernels->>FmhaKernels: select KeepsMmaAbForGeneration<br/>(if mNumHeadsQPerKv ≤ 16)<br/>or SwapsMmaAbForGeneration
        FmhaKernels->>FmhaKernels: set maxNumHeadsQPerKvInCta = 128
        FmhaKernels->>PrepareCustomMask: runPrepareCustomMask()<br/>(layer_idx == 0)
        
        rect rgba(135, 206, 250, 0.3)
            note over PrepareCustomMask: Custom Mask Preparation<br/>(for spec-decoding trees)
            PrepareCustomMask->>PrepareCustomMask: launch computeCustomMaskOffsetsKernel<br/>(BlockScan prefix-sum offsets)
            PrepareCustomMask->>PrepareCustomMask: launch prepareCustomMaskBuffersKernel<br/>(compute mask buffers with atomics)
        end
        
        FmhaKernels->>MmaKernel: run with Custom mask<br/>keepsMmaAb config
    else
        XQADispatcher->>XQADispatcher: set mask_type = Dense
        FmhaKernels->>MmaKernel: run with Dense mask
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

fmhaKernels.h: Conditional kernel type selection logic with multiple branches for spec-decoding vs. standard paths; requires understanding of tile sizing, head counts, and kernel dispatch strategy.
prepareCustomMask.cu: New CUDA kernels with atomic operations, BlockScan prefix-sum, and batch indexing; requires careful verification of grid/block configurations and memory access patterns.
trtllm.py: Extensive changes across multiple methods (plan, run, forward, spec-decoding prep logic) with SM version gating; multi-path state management increases cognitive load.
xqaDispatcher.cpp: Parameter propagation and conditional mask type switching; interacts with multiple layers and requires tracing data flow.
Coordination across layers: New parameters flow through attentionOp → xqaDispatcher → fmhaKernels → custom mask kernels; changes in one layer depend on corresponding updates in others, increasing overall verification complexity.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete and does not follow the required template. The Description and Test Coverage sections are empty, and the PR checklist boxes are unchecked (except a generic checkbox), leaving critical information about implementation details, tests, and verification steps undocumented.	Complete the Description section explaining the implementation approach, rationale, and changes. Populate the Test Coverage section with specific tests that validate the tree attention support. Check the PR Checklist items to confirm guideline compliance, test coverage, and documentation updates.
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.61% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[TRTLLM-8778][feat] Add tree attention support for blackwell arch' clearly summarizes the main change: adding tree attention support for Blackwell architecture. It follows the prescribed format and is specific enough to understand the primary objective.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/attention_backend/trtllm.py (1)
1397-1455: Set num_heads_per_kv before allocating BL custom mask buffers

TrtllmAttentionMetadata.num_heads_per_kv still equals its dataclass default (1) by the time spec_decoding_param_prepare_for_blackwell() runs. For multi-/group-query configs the actual heads-per-KV ratio is >1, so mask_size is undercomputed and the GPU kernel will write past the allocated mask, corrupting memory. Please populate metadata.num_heads_per_kv from the wrapper before you call metadata.update_spec_dec_param(...).
         self.wrapper.plan(
             layer_idx=self.get_local_layer_idx(metadata),
             tokens_per_block=metadata.tokens_per_block,
             max_num_requests=metadata.max_num_requests,
@@
             sparse_attn_indices=sparse_attn_indices,
             sparse_attn_offsets=sparse_attn_offsets,
             sparse_mla_topk=metadata.sparse_mla_topk if hasattr(
                 metadata, 'sparse_mla_topk') else 0)
+        if metadata.num_heads_per_kv is None:
+            metadata.num_heads_per_kv = self.wrapper.num_heads // max(
+                self.wrapper.num_kv_heads, 1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
912-950: Uninitialized num_heads_per_kv can crash metadata setup

num_heads_per_kv is only assigned inside the if self.model.model_config.pretrained_config.num_attention_heads is not None block, yet it is unconditionally passed into self.attn_backend.Metadata(...). For any config that omits num_attention_heads (or sets it to None), _set_up_attn_metadata now raises UnboundLocalError, breaking model initialization and inference. Please default num_heads_per_kv before the conditional (e.g., to 1) and only overwrite it when both head counts are available.
-        if self.model.model_config.pretrained_config.num_attention_heads is not None:
-            if self.model.model_config.pretrained_config.num_key_value_heads is not None:
-                num_heads_per_kv = self.model.model_config.pretrained_config.num_attention_heads // self.model.model_config.pretrained_config.num_key_value_heads
-            else:
-                num_heads_per_kv = 1
+        pretrained_cfg = self.model.model_config.pretrained_config
+        num_attention_heads = getattr(pretrained_cfg, "num_attention_heads", None)
+        num_key_value_heads = getattr(pretrained_cfg, "num_key_value_heads", None)
+        if num_attention_heads is not None and num_key_value_heads:
+            num_heads_per_kv = num_attention_heads // num_key_value_heads
+        else:
+            num_heads_per_kv = 1

🧹 Nitpick comments (2)

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

520-526: Drop the raw printf in hot path.

This printf will spam stdout on every speculative decode run and bypasses the project’s logging framework. Please remove it or switch to the existing logging macros before landing.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/prepareCustomMask.cu (1)

184-194: Check CUDA API return codes.

The new runtime allocates and zeroes device memory with raw CUDA calls, but none of the return values are validated. If cudaMallocAsync or cudaMemsetAsync fails under pressure, we will consume an invalid pointer and crash inside the subsequent kernels. Please wrap these calls with TLLM_CUDA_CHECK (or equivalent) so failures are caught immediately and surfaced upstream.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d246f62 and 54854e1.

📒 Files selected for processing (15)

cpp/tensorrt_llm/common/attentionOp.cpp (4 hunks)
cpp/tensorrt_llm/common/attentionOp.h (1 hunks)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h (3 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h (4 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h (3 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/prepareCustomMask.cu (1 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/prepareCustomMask.h (1 hunks)
cpp/tensorrt_llm/kernels/xqaDispatcher.cpp (3 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (2 hunks)
cpp/tests/unit_tests/kernels/CMakeLists.txt (1 hunks)
cpp/tests/unit_tests/kernels/prepareCustomMaskTest.cpp (1 hunks)
tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (9 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_llama.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}