-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Add kv_nonpad_seqlen input to Attention #7164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7164 +/- ##
=======================================
Coverage 53.76% 53.76%
=======================================
Files 512 512
Lines 32180 32202 +22
Branches 2942 2945 +3
=======================================
+ Hits 17300 17312 +12
- Misses 14110 14120 +10
Partials 770 770 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Yuan Yao <[email protected]>
9c714fb to
326aefb
Compare
Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Yuan Yao <[email protected]>
| ONNX_ASSERTM( | ||
| false, | ||
| "%s being converted from %d to %d has nonpad_kv_seqlen input, " | ||
| "which is not supported in opset 23. This conversion cannot be performed.", | ||
| name().c_str(), | ||
| initial_version().version(), | ||
| target_version().version()); |
Check notice
Code scanning / CodeQL
Too many arguments to formatting function Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks ... just a couple of minor comments left about documentation of attn_sequence_length
Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Yuan Yao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new nonpad_kv_seqlen input to the Attention operator in version 24 to support optimized KV cache management. This enhancement accompanies the TensorScatter-24 operator for managing in-place KV cache updates.
Key changes include:
- Addition of
nonpad_kv_seqleninput to indicate valid (non-padded) tokens in K and V inputs - Support for shorter
attn_maskdimensions that get padded with -inf - Compatibility between
attn_maskandis_causalattributes
Reviewed Changes
Copilot reviewed 10 out of 146 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| onnx/version_converter/convert.h | Registers adapters for converting between Attention v24 and v23 |
| onnx/version_converter/adapters/Attention_24_23.h | Implements downgrade adapter that prevents conversion when nonpad_kv_seqlen is present |
| onnx/reference/ops/op_attention.py | Updates reference implementation to handle nonpad_kv_seqlen and shorter attn_mask |
| onnx/defs/operator_sets.h | Adds Attention-24 to the operator set schema declarations |
| onnx/defs/nn/old.cc | Moves Attention-23 schema to old.cc for version history |
| onnx/defs/nn/defs.cc | Implements Attention-24 with updated documentation and function builder |
| onnx/backend/test/case/node/attention.py | Adds test case for the new nonpad_kv_seqlen functionality |
| docs/TestCoverage.md | Updates test coverage documentation |
| docs/Operators.md | Updates operator documentation for Attention-24 |
| docs/Changelog.md | Adds changelog entry for Attention-24 |
Comments suppressed due to low confidence (2)
onnx/backend/test/case/node/attention.py:1859
- The test uses a fixed
nonpad_kv_seqlenarray with values [3, 4], but the K and V tensors have sequence length 6. Consider adding test cases that cover edge cases like when nonpad_kv_seqlen equals the full sequence length, or when it's 0 or 1.
nonpad_kv_seqlen = np.array([3, 4], dtype=np.int64)
onnx/backend/test/case/node/attention.py:1858
- The test creates an attention mask with kv_sequence_length=4, but K and V have sequence length 6. This tests the padding functionality, but consider adding a test comment explaining this intentional dimension mismatch to clarify the test's purpose.
attn_mask = np.random.rand(2, 3, 4, 4).astype(np.float32)
Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Yuan Yao <[email protected]>
To accompany the [TensorScatter-24](onnx#7114) op for managing in-place KV cache update, this PR makes the following changes to the Attention op: - Add `nonpad_kv_seqlen` to indicate the number of valid (nonpadded) tokens in the K and V inputs when the K and V inputs are the entire cache tensors (where the number of valid tokens can potentially make up only a small proportion of the cache tensors). The `nonpad_kv_seqlen` input would provided optimization opportunities for backends to skip the unnecessary computation on the padding tokens. - Allow the kv_seqlen dimension (-1 dimension) of `attn_mask` input to be shorter than K and V. The missing portion will be assumed to be -inf. The length should still be larger than the max value in `nonpad_kv_seqlen`. Also, allow `attn_mask` and `is_causal` to be present at the same time. This would allow for easier export of HF models later. <!-- - Why is this change required? What problem does it solve? --> <!-- - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: Yuan Yao <[email protected]>
### Description To accompany the [TensorScatter-24](onnx#7114) op for managing in-place KV cache update, this PR makes the following changes to the Attention op: - Add `nonpad_kv_seqlen` to indicate the number of valid (nonpadded) tokens in the K and V inputs when the K and V inputs are the entire cache tensors (where the number of valid tokens can potentially make up only a small proportion of the cache tensors). The `nonpad_kv_seqlen` input would provided optimization opportunities for backends to skip the unnecessary computation on the padding tokens. - Allow the kv_seqlen dimension (-1 dimension) of `attn_mask` input to be shorter than K and V. The missing portion will be assumed to be -inf. The length should still be larger than the max value in `nonpad_kv_seqlen`. Also, allow `attn_mask` and `is_causal` to be present at the same time. This would allow for easier export of HF models later. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? --> <!-- - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: Yuan Yao <[email protected]> Signed-off-by: Yash solanki <[email protected]>
| } | ||
| builder | ||
| .Add("KVSeqLenExpanded = Unsqueeze(nonpad_kv_seqlen, One1D)") // [batch_size, 1] | ||
| .Add("Range = Range(Zero1D, KVSeqLen, One1D)") // [KVSeqLen,] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inputs should be Scalar: https://github.com/onnx/onnx/blob/main/docs/Operators.md#Range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's caught by ORT, but ONNX checker does not complain about this for some reasons..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like RMSNorm: https://github.com/onnx/onnx/pull/7135/files (reference fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titaiwangms how about this?
.Const("Zero0D", (int64_t)(0))
.Const("One0D", (int64_t)(1))
.Add("KVSeqLen0D = Unsqueeze(KVSeqLen, Zero1D)")
.Add("Range = Range(Zero0D, KVSeqLen0D, One0D)")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. We missed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR to fix: #7240
### Description In the Attentiion op definition, update the inputs to Range to be scalars as opposed to 1-element vectors, as required by the Range op spec. ### Motivation and Context See discussion [here](#7164 (comment)). --------- Signed-off-by: Yuan Yao <[email protected]>
### Description In the Attentiion op definition, update the inputs to Range to be scalars as opposed to 1-element vectors, as required by the Range op spec. ### Motivation and Context See discussion [here](onnx#7164 (comment)). --------- Signed-off-by: Yuan Yao <[email protected]>
### Description In the Attentiion op definition, update the inputs to Range to be scalars as opposed to 1-element vectors, as required by the Range op spec. ### Motivation and Context See discussion [here](onnx#7164 (comment)). --------- Signed-off-by: Yuan Yao <[email protected]> Signed-off-by: xadupre <[email protected]>
Description
To accompany the TensorScatter-24 op for managing in-place KV cache update, this PR makes the following changes to the Attention op:
nonpad_kv_seqlento indicate the number of valid (nonpadded) tokens in the K and V inputs when the K and V inputs are the entire cache tensors (where the number of valid tokens can potentially make up only a small proportion of the cache tensors). Thenonpad_kv_seqleninput would provided optimization opportunities for backends to skip the unnecessary computation on the padding tokens.attn_maskinput to be shorter than K and V. The missing portion will be assumed to be -inf. The length should still be larger than the max value innonpad_kv_seqlen.Also, allow
attn_maskandis_causalto be present at the same time. This would allow for easier export of HF models later.Motivation and Context