Add kv_nonpad_seqlen input to Attention #7164

yuanyao-nv · 2025-07-23T23:34:53Z

Description

To accompany the TensorScatter-24 op for managing in-place KV cache update, this PR makes the following changes to the Attention op:

Add nonpad_kv_seqlen to indicate the number of valid (nonpadded) tokens in the K and V inputs when the K and V inputs are the entire cache tensors (where the number of valid tokens can potentially make up only a small proportion of the cache tensors). The nonpad_kv_seqlen input would provided optimization opportunities for backends to skip the unnecessary computation on the padding tokens.
Allow the kv_seqlen dimension (-1 dimension) of attn_mask input to be shorter than K and V. The missing portion will be assumed to be -inf. The length should still be larger than the max value in nonpad_kv_seqlen.

Also, allow attn_mask and is_causal to be present at the same time. This would allow for easier export of HF models later.

Motivation and Context

codecov · 2025-07-23T23:39:08Z

Codecov Report

❌ Patch coverage is 54.54545% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.76%. Comparing base (ee724f6) to head (f92c9d9).
⚠️ Report is 77 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
onnx/backend/test/case/node/attention.py	0.00%	10 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7164   +/-   ##
=======================================
  Coverage   53.76%   53.76%           
=======================================
  Files         512      512           
  Lines       32180    32202   +22     
  Branches     2942     2945    +3     
=======================================
+ Hits        17300    17312   +12     
- Misses      14110    14120   +10     
  Partials      770      770

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

onnx/defs/nn/old.cc

+ONNX_OPERATOR_SET_SCHEMA(
+    Attention,


onnx/defs/nn/defs.cc

Signed-off-by: Yuan Yao <[email protected]>

onnx/version_converter/convert.h

onnx/defs/nn/defs.cc

Signed-off-by: Yuan Yao <[email protected]>

onnx/defs/nn/defs.cc

Signed-off-by: Yuan Yao <[email protected]>

onnx/version_converter/adapters/Attention_24_23.h

+      ONNX_ASSERTM(
+          false,
+          "%s being converted from %d to %d has nonpad_kv_seqlen input, "
+          "which is not supported in opset 23. This conversion cannot be performed.",
+          name().c_str(),
+          initial_version().version(),
+          target_version().version());


gramalingam

LGTM, thanks ... just a couple of minor comments left about documentation of attn_sequence_length

Signed-off-by: Yuan Yao <[email protected]>

Copilot

Pull Request Overview

This PR adds a new nonpad_kv_seqlen input to the Attention operator in version 24 to support optimized KV cache management. This enhancement accompanies the TensorScatter-24 operator for managing in-place KV cache updates.

Key changes include:

Addition of nonpad_kv_seqlen input to indicate valid (non-padded) tokens in K and V inputs
Support for shorter attn_mask dimensions that get padded with -inf
Compatibility between attn_mask and is_causal attributes

Reviewed Changes

Copilot reviewed 10 out of 146 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnx/version_converter/convert.h	Registers adapters for converting between Attention v24 and v23
onnx/version_converter/adapters/Attention_24_23.h	Implements downgrade adapter that prevents conversion when nonpad_kv_seqlen is present
onnx/reference/ops/op_attention.py	Updates reference implementation to handle nonpad_kv_seqlen and shorter attn_mask
onnx/defs/operator_sets.h	Adds Attention-24 to the operator set schema declarations
onnx/defs/nn/old.cc	Moves Attention-23 schema to old.cc for version history
onnx/defs/nn/defs.cc	Implements Attention-24 with updated documentation and function builder
onnx/backend/test/case/node/attention.py	Adds test case for the new nonpad_kv_seqlen functionality
docs/TestCoverage.md	Updates test coverage documentation
docs/Operators.md	Updates operator documentation for Attention-24
docs/Changelog.md	Adds changelog entry for Attention-24

Comments suppressed due to low confidence (2)

onnx/backend/test/case/node/attention.py:1859

The test uses a fixed nonpad_kv_seqlen array with values [3, 4], but the K and V tensors have sequence length 6. Consider adding test cases that cover edge cases like when nonpad_kv_seqlen equals the full sequence length, or when it's 0 or 1.

        nonpad_kv_seqlen = np.array([3, 4], dtype=np.int64)

onnx/backend/test/case/node/attention.py:1858

The test creates an attention mask with kv_sequence_length=4, but K and V have sequence length 6. This tests the padding functionality, but consider adding a test comment explaining this intentional dimension mismatch to clarify the test's purpose.

        attn_mask = np.random.rand(2, 3, 4, 4).astype(np.float32)

onnx/defs/nn/defs.cc

Signed-off-by: Yuan Yao <[email protected]>

onnx/defs/nn/defs.cc

Signed-off-by: Yuan Yao <[email protected]>

To accompany the [TensorScatter-24](onnx#7114) op for managing in-place KV cache update, this PR makes the following changes to the Attention op: - Add `nonpad_kv_seqlen` to indicate the number of valid (nonpadded) tokens in the K and V inputs when the K and V inputs are the entire cache tensors (where the number of valid tokens can potentially make up only a small proportion of the cache tensors). The `nonpad_kv_seqlen` input would provided optimization opportunities for backends to skip the unnecessary computation on the padding tokens. - Allow the kv_seqlen dimension (-1 dimension) of `attn_mask` input to be shorter than K and V. The missing portion will be assumed to be -inf. The length should still be larger than the max value in `nonpad_kv_seqlen`. Also, allow `attn_mask` and `is_causal` to be present at the same time. This would allow for easier export of HF models later.   --------- Signed-off-by: Yuan Yao <[email protected]>

### Description To accompany the [TensorScatter-24](onnx#7114) op for managing in-place KV cache update, this PR makes the following changes to the Attention op: - Add `nonpad_kv_seqlen` to indicate the number of valid (nonpadded) tokens in the K and V inputs when the K and V inputs are the entire cache tensors (where the number of valid tokens can potentially make up only a small proportion of the cache tensors). The `nonpad_kv_seqlen` input would provided optimization opportunities for backends to skip the unnecessary computation on the padding tokens. - Allow the kv_seqlen dimension (-1 dimension) of `attn_mask` input to be shorter than K and V. The missing portion will be assumed to be -inf. The length should still be larger than the max value in `nonpad_kv_seqlen`. Also, allow `attn_mask` and `is_causal` to be present at the same time. This would allow for easier export of HF models later. ### Motivation and Context   --------- Signed-off-by: Yuan Yao <[email protected]> Signed-off-by: Yash solanki <[email protected]>

titaiwangms · 2025-08-18T21:36:33Z

onnx/defs/nn/defs.cc

            }
+            builder
+                .Add("KVSeqLenExpanded = Unsqueeze(nonpad_kv_seqlen, One1D)") // [batch_size, 1]
+                .Add("Range = Range(Zero1D, KVSeqLen, One1D)") // [KVSeqLen,]


The inputs should be Scalar: https://github.com/onnx/onnx/blob/main/docs/Operators.md#Range

cc @gramalingam @yuanyao-nv @justinchuby

It's caught by ORT, but ONNX checker does not complain about this for some reasons..

Just like RMSNorm: https://github.com/onnx/onnx/pull/7135/files (reference fix)

@titaiwangms how about this?

.Const("Zero0D", (int64_t)(0)) .Const("One0D", (int64_t)(1)) .Add("KVSeqLen0D = Unsqueeze(KVSeqLen, Zero1D)") .Add("Range = Range(Zero0D, KVSeqLen0D, One0D)")

Thanks for catching this. We missed it

PR to fix: #7240

### Description In the Attentiion op definition, update the inputs to Range to be scalars as opposed to 1-element vectors, as required by the Range op spec. ### Motivation and Context See discussion [here](#7164 (comment)). --------- Signed-off-by: Yuan Yao <[email protected]>

### Description In the Attentiion op definition, update the inputs to Range to be scalars as opposed to 1-element vectors, as required by the Range op spec. ### Motivation and Context See discussion [here](onnx#7164 (comment)). --------- Signed-off-by: Yuan Yao <[email protected]>

### Description In the Attentiion op definition, update the inputs to Range to be scalars as opposed to 1-element vectors, as required by the Range op spec. ### Motivation and Context See discussion [here](onnx#7164 (comment)). --------- Signed-off-by: Yuan Yao <[email protected]> Signed-off-by: xadupre <[email protected]>

yuanyao-nv requested a review from a team as a code owner July 23, 2025 23:34

github-project-automation bot added this to PR Tracker Jul 23, 2025

github-project-automation bot moved this to In progress in PR Tracker Jul 23, 2025

yuanyao-nv marked this pull request as draft July 23, 2025 23:35

github-advanced-security bot found potential problems Jul 23, 2025

View reviewed changes

onnx/defs/nn/old.cc

Comment on lines +4251 to +4252

ONNX_OPERATOR_SET_SCHEMA(

Attention,

Check notice

Code scanning / CodeQL

Unused static variable Note

Static variable dbg_count_check_Onnx_23_verAttention is never read.

gramalingam reviewed Jul 24, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 24, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 24, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 24, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

justinchuby assigned gramalingam Jul 24, 2025

Add nonpad_kv_seqlen input

326aefb

Signed-off-by: Yuan Yao <[email protected]>

yuanyao-nv force-pushed the dev-attention-seqlen branch from 9c714fb to 326aefb Compare July 25, 2025 00:16

justinchuby added this to the 1.19 milestone Jul 25, 2025

gramalingam reviewed Jul 25, 2025

View reviewed changes

onnx/version_converter/convert.h Outdated Show resolved Hide resolved

gramalingam reviewed Jul 25, 2025

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

kunal-vaishnavi reviewed Jul 25, 2025

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

yuanyao-nv added 2 commits July 25, 2025 13:49

Allow attn_mask and is_causal to be both present

018917c

Signed-off-by: Yuan Yao <[email protected]>

Allow attn_mask to have shorter kv sequence dimension

36ed81e

Signed-off-by: Yuan Yao <[email protected]>

gramalingam reviewed Jul 28, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 28, 2025

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

gramalingam reviewed Jul 28, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 28, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

yuanyao-nv and others added 2 commits July 28, 2025 13:34

Merge branch 'main' into dev-attention-seqlen

b30679f

Padding simplification; version converter

3f75e6c

Signed-off-by: Yuan Yao <[email protected]>

github-advanced-security bot found potential problems Jul 28, 2025

View reviewed changes

gramalingam approved these changes Jul 28, 2025

View reviewed changes

github-project-automation bot moved this from In progress to Reviewer approved in PR Tracker Jul 28, 2025

Fix lint error; add generated test files

1bf9ea0

Signed-off-by: Yuan Yao <[email protected]>

yuanyao-nv marked this pull request as ready for review July 29, 2025 20:25

yuanyao-nv requested a review from a team as a code owner July 29, 2025 20:25

Fix lint error; simplify attn_mask description

c43f2fd

Signed-off-by: Yuan Yao <[email protected]>

justinchuby requested a review from Copilot July 29, 2025 21:45

Copilot AI reviewed Jul 29, 2025

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

more lint

ba7b9c1

Signed-off-by: Yuan Yao <[email protected]>

yuanyao-nv commented Jul 29, 2025

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

change spacing between code and comment; update attn_mask description

f92c9d9

Signed-off-by: Yuan Yao <[email protected]>

yuanyao-nv merged commit 13b6330 into onnx:main Jul 30, 2025
37 of 38 checks passed

github-project-automation bot moved this from Reviewer approved to Done in PR Tracker Jul 30, 2025

titaiwangms reviewed Aug 18, 2025

View reviewed changes

yuanyao-nv mentioned this pull request Aug 19, 2025

Fix Range input rank in Attention op function definition #7240

Merged

osalbahr mentioned this pull request Aug 27, 2025

onnx 1.19.0 Homebrew/homebrew-core#235202

Closed

justinchuby added the topic: operator Issues related to ONNX operators label Sep 15, 2025

Add kv_nonpad_seqlen input to Attention #7164

Add kv_nonpad_seqlen input to Attention #7164

Uh oh!

Conversation

yuanyao-nv commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

codecov bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Check notice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

gramalingam left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

titaiwangms Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

titaiwangms Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

titaiwangms Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

yuanyao-nv Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

justinchuby Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

yuanyao-nv Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuanyao-nv commented Jul 23, 2025 •

edited

Loading

codecov bot commented Jul 23, 2025 •

edited

Loading