Codestin Search App

Anri-Lombard · 2026-01-04T04:08:03Z

Summary

Document that MLX's mask="causal" uses lower-right alignment
Clarify the difference from PyTorch's default is_causal=True (upper-left)

When T_q != T_kv, this distinction matters:

MLX (lower-right): Last query aligns with last key
PyTorch default (upper-left): First query aligns with first key

References:

Relates to #2835

Clarify that MLX uses lower-right alignment for causal masks when T_q != T_kv, which differs from PyTorch's default upper-left alignment. Relates to ml-explore#2835

zcbenz

I don't think PyTorch has a causal_lower_right option for SDPA and the description is not really right.

Anri-Lombard · 2026-01-18T13:05:42Z

Hey @zcbenz, it does have causal_lower_right since 2.3 and can be used with SDPA via the attn_mask parameter. I ran a script with:

from torch.nn.attention.bias import causal_lower_right
bias = causal_lower_right(T_q, T_kv)
F.scaled_dot_product_attention(q, k, v, attn_mask=bias)

to verify.

Here is the tutorial that documents this explicitly: https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html.

I also verified masks are mathematically identical. For example with T_q=2, T_kv=4:

  MLX's mask (using q_off = max(0, kL - qL)):
        k0  k1  k2  k3
  q0 [  1   1   1   0  ]
  q1 [  1   1   1   1  ]

  PyTorch's causal_lower_right(2, 4):
        k0  k1  k2  k3
  q0 [  1   1   1   0  ]
  q1 [  1   1   1   1  ]

  PyTorch's is_causal=True (upper_left):
        k0  k1  k2  k3
  q0 [  1   0   0   0  ]
  q1 [  1   1   0   0  ]

The first two are identical; the third is different. This is also consistent with MLX's CUDA backend which uses cuDNN's set_causal_mask_bottom_right.

Is there something specific about the description you think is incorrect? if your concern is that causal_lower_right isn't a direct SDPA parameter (like is_causal=True) but rather a separate utility class, I could clarify the wording to use the full module path torch.nn.attention.bias.causal_lower_right.

zcbenz · 2026-01-18T23:07:59Z

Thanks for linking the docs, this is a new learn for me. On the behavior, it actually depends on whether T_q is larger or smaller than T_kv:

mlx/mlx/backend/cuda/scaled_dot_product_attention.cpp

Lines 204 to 208 in 9052f67

    
           if (q.shape(2) > k.shape(2)) { 
        
             options.set_causal_mask(do_causal); 
        
           } else { 
        
             options.set_causal_mask_bottom_right(do_causal); 
        
           }

The mask uses lower-right alignment when T_q <= T_kv and upper-left when T_q > T_kv.

Anri-Lombard · 2026-01-19T18:19:03Z

Thanks! Fixed to describe the conditional alignment behavior 🙏

zcbenz

Looks good to me. /cc @awni for a second look.

awni · 2026-01-21T00:22:04Z

The comment definitely makes sense. But I also find it a bit strange that we switch from lower right to upper left depending on if query is longer or shorter than the keys. It's quite rare for the query to be longer than the keys which is why we never really looked at it carefully.

I'm wondering if we should change the behavior in that case rather than documenting something that is a bit unusual? Or maybe it's a good idea to keep it this way?

zcbenz · 2026-01-21T00:39:23Z

I agree current behavior is unusual, and using lower right for all should be a better choice.

awni · 2026-01-21T14:26:44Z

@Anri-Lombard what do you think about changing the behavior to always be lower right even when QL > KL? Do you want to send a patch to this PR / send a new one instead of this?

Anri-Lombard · 2026-01-22T17:43:58Z

Hey @awni, always lower-right makes sense. The change is minimal (unless I'm missing somethign) - just two cuDNN locations (forward/backward) and the CPU fallback offset calculation. I'll update this PR to make the behavior change instead of just documenting it 👍

awni · 2026-01-22T17:47:05Z

Yes the change should be pretty straight-forward. We may also need to update the mask index calculation in the Metal kernels. If you add a test for this case as well (qL > kL) that would be great. I can help with the metal kernels if needed.

- cuDNN: Always use set_causal_mask_bottom_right() instead of conditionally selecting based on qL vs kL. This aligns with FlashAttention/PyTorch behavior. - Steel kernels: Add NaN protection for sum_score == 0 edge case when all keys are masked.

Enable scaled_dot_product_attention to handle cases where query sequence is longer than key sequence with causal mask. When qL > kL, early queries have no keys to attend to and output zeros. Changes: - Remove Metal routing guard that blocked qL > kL for causal mask - Fix CPU fallback to use proper lower-right alignment (not clamped) - Zero out output rows where queries have no keys to attend (row_pos < 0) - Update test references to handle all-masked rows correctly

Anri-Lombard · 2026-01-23T04:48:04Z

@awni and @zcbenz updated and took a stab at the Metal kernels as well - feel free to push changes directly or point out where I deviated if you don't mind the extra time so I can learn the convention preferences more 🙏

For qL > kL, early queries have no keys to attend. Softmax of all-masked values gives uniform weights (exp(finite_min - finite_min) = 1), not zeros. Following PyTorch's pytorch/pytorch#108108 convention, we explicitly zero these rows... I think this is the only "big" change.

…k-offset

Anri-Lombard · 2026-01-23T04:53:36Z

@awni you mentioned the tests, the existing test shapes (127, 65, ...) with mask="causal" cover the qL > kL case. Would you prefer an explicit test that verifies early queries output zeros? 🙏

awni · 2026-01-23T14:26:18Z

Nope if it's already tested that is fine!

awni · 2026-01-23T14:54:31Z

I don't think we should ensure 0s in the qL > kL case. It's a problem I've looked at in the past is what to do if every key position for a given query is masked. And right now it's not consistent. For now let's leave it as undefined behavior and then look into a more principled fix if necessary. I also would rather not reduce performance overall to handle an edge case we don't really care much about).

Per review feedback, leave qL > kL with causal mask as undefined behavior rather than ensuring zeros. This avoids performance overhead for an edge case. Tests skip this undefined case.

Anri-Lombard · 2026-01-23T16:34:12Z

@awni done - removed the zero-row handling. The qL > kL + causal case is now undefined behavior as suggested. Tests skip that case.

mlx/fast.cpp

awni · 2026-01-26T15:13:17Z

python/tests/test_fast_sdpa.py

+                        # Skip causal tests when qL > kL (undefined behavior)
+                        if mask_str == "causal" and qL > kL:
+                            continue


Rather than skipping this test, could you add a little step after the computation which checks that the parts which should match do match?

So basically slice off the initial qL-kL from the result if it's greater than 0 and then compare.

I'm struggling to implement this, I'm getting a bug where the fast path is outputting garbage since the fallback is not being applied when it should. I'll need some time to figure this out 🙏

You could revert this change for now which would likely fix it.

const bool sdpa_full_supported_mask = !has_mask || has_arr_mask || (query_sequence_length <= key_sequence_length && do_causal); # <- add that back

So it won't dispatch to the fused implementation for that case.. but that's ok for now. And if you want to fix it in a follow on that would be great!

Thanks! All tests now pass locally and tests slice undefined rows instead of skipping 🙏 Happy to do a follow up pr for a proper fix when gL > kL 👍

Great, I will merge it when the CI tests clear.

Co-authored-by: Awni Hannun <[email protected]>

Per awni's feedback, revert the Metal backend condition to require query_sequence_length <= key_sequence_length for causal mask. This prevents dispatching to the fused kernel for the qL > kL case. The test now slices off the first qL-kL rows (undefined behavior region) before comparison instead of skipping these cases entirely.

When transpose=True, output shape is (B, qL, qH, D) with sequence dimension at index 1. The previous fix was slicing dimension 2 for both cases, causing test failures. Now correctly slices dimension 1 for transpose=True and dimension 2 for transpose=False.

zcbenz · 2026-01-30T02:12:02Z

It turns out cuDNN does not like this configuration ☹️ :

======================================================================
ERROR: test_sdpa (test_fast_sdpa.TestSDPA.test_sdpa) (B=1, qsl=127, ksl=65, head_dim=64, n_q_heads=32, n_kv_heads=8, mask='causal', transpose=False, dtype='float16')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\cygwin64\home\cheng\codes\mlx\python\tests\test_fast_sdpa.py", line 637, in test_sdpa
    self.assertLessEqual(mx.max(diff).item(), atol)
                         ~~~~~~~~~~~~~~~~~^^
RuntimeError: graph.prepare() failed: Bottom right causal mask does not support max_s_q > max_s_kv. Please virtually slice the Q tensor and pass it as max_s_q == max_s_kv.

(The test does not run in CI as the hardware is not supported by cuDNN)

I'm going to disable cuDNN SDPA for T_q > T_kv with mask='causal' for now.

Anri-Lombard · 2026-01-30T05:33:06Z

Dang! Sorry to see this @zcbenz 🙏 I can have a look later to see how we could make cuDNN happy, but disabling it for now makes sense

Document causal mask alignment in scaled_dot_product_attention

c55b3f5

Clarify that MLX uses lower-right alignment for causal masks when T_q != T_kv, which differs from PyTorch's default upper-left alignment. Relates to ml-explore#2835

zcbenz requested changes Jan 18, 2026

View reviewed changes

Fix causal mask documentation to describe conditional alignment behavior

f434231

The mask uses lower-right alignment when T_q <= T_kv and upper-left when T_q > T_kv.

zcbenz approved these changes Jan 21, 2026

View reviewed changes

Anri-Lombard added 2 commits January 23, 2026 05:34

Anri-Lombard changed the title ~~Document causal mask alignment in scaled_dot_product_attention~~ Use lower-right causal mask alignment consistently Jan 23, 2026

Update docstring to reflect consistent lower-right causal alignment

4bc3b75

Merge remote-tracking branch 'upstream/main' into fix/sdpa-causal-mas…

fc834bf

…k-offset

Remove zero-row handling for qL > kL causal case

07ba3f5

Per review feedback, leave qL > kL with causal mask as undefined behavior rather than ensuring zeros. This avoids performance overhead for an edge case. Tests skip this undefined case.

awni reviewed Jan 25, 2026

View reviewed changes

mlx/fast.cpp Outdated Show resolved Hide resolved

awni reviewed Jan 26, 2026

View reviewed changes

Anri-Lombard and others added 3 commits January 26, 2026 17:43

Update mlx/fast.cpp

1776be7

Co-authored-by: Awni Hannun <[email protected]>

awni merged commit 0c6a895 into ml-explore:main Jan 29, 2026
16 checks passed

Conversation

Anri-Lombard commented Jan 4, 2026

Summary

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Anri-Lombard commented Jan 18, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 19, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

awni commented Jan 21, 2026

Uh oh!

zcbenz commented Jan 21, 2026

Uh oh!

awni commented Jan 21, 2026

Uh oh!

Anri-Lombard commented Jan 22, 2026

Uh oh!

awni commented Jan 22, 2026

Uh oh!

Anri-Lombard commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Anri-Lombard commented Jan 23, 2026

Uh oh!

awni commented Jan 23, 2026

Uh oh!

awni commented Jan 23, 2026

Uh oh!

Anri-Lombard commented Jan 23, 2026

Uh oh!

Uh oh!

awni Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

awni Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Anri-Lombard Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

awni Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Anri-Lombard Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zcbenz commented Jan 30, 2026

Uh oh!

Anri-Lombard commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Anri-Lombard commented Jan 23, 2026 •

edited

Loading

Anri-Lombard Jan 26, 2026 •

edited

Loading

Anri-Lombard Jan 28, 2026 •

edited

Loading