-
Notifications
You must be signed in to change notification settings - Fork 24.1k
[TorchInductor] Add ALiBi (Attention with Linear Biases) Fused Attention Pattern #144338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144338
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c5fd035 with merge base 96176e3 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about
If you get error: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Duplicate pattern: expand_default = CallFunction(aten.expand.default, KeywordArg('query'), Ignored()),
-> run export PYTORCH_GEN_PATTERNS=1 in the terminal to generate the attention pattern.
We want to serialize the pattern ahead of time, as with the rest of the attention fusions. Because the additional compilation time of generating at runtime is non insignificant. This is the PYTORCH_GEN_PATTERNS=1
Can you serialize this ? See: torchgen/fuse/gen_patterns.py.
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Summary
This PR adds support for ALiBi (Attention with Linear Biases) in TorchInductor’s fused-attention. ALiBi applies a position-based bias to attention scores, improving extrapolation for language modeling tasks. With this addition, ALiBi-based attention can leverage PyTorch’s optimized
_scaled_dot_product_attention
kernel.Changes
_sfdp_pattern_alibi(...)
: Recognizes [Q @ Kᵀ / √d + alibi_bias] → softmax → dropout → matmul(V)._sfdp_replacement_alibi(...)
: Fuses into_scaled_dot_product_attention
usingattn_mask=alibi_bias
._test_sdpa_rewriter_alibi
inTestSDPAPatternRewriterTemplate
.torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Duplicate pattern: expand_default = CallFunction(aten.expand.default, KeywordArg('query'), Ignored())
,-> run
export PYTORCH_GEN_PATTERNS=1
in the terminal to generate the attention pattern.Notes
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov