Codestin Search App

zcbenz · 2026-02-08T11:54:57Z

The kv cache in mlx-lm has fixed-size and keys/values passed to fast.sdpa are slices, using this information we can create cuDNN graphs with fixed sequence length and use padding masks to set the actual sequence lengths.

This makes decoding (T_q == 1) 30%~100% faster for large sequence lengths. For small sequence lengths the cuDNN SDPA is still faster but the overhead of creating cuDNN graphs would eliminate the advantage so we fallback to vector SDPA.

This approach however does not support custom array masks, we can add more options to reduce the uses of array masks but cuDNN does not support left padding masks, so integrating with BatchKVCache would require extra efforts.

awni · 2026-02-08T18:53:00Z

mlx/backend/cuda/scaled_dot_product_attention.cpp

+  auto is_slice = [](const array& kv) {
+    // When called during graph building the strides is not available.
+    if (kv.status() != array::evaluated) {
+      return (kv.has_primitive() && typeid(kv.primitive()) == typeid(Slice)) ||
+          (kv.shape(2) % kv_cache_step == 0);
+    }
+    // Get pre-sliced sequence length from strides, and check if the buffer
+    // belongs to a contiguous kv cache.
+    int64_t T_kv = kv.strides(1) / kv.strides(2);
+    if (kv.size() / kv.shape(2) * T_kv != kv.buffer_size() / kv.itemsize()) {
+      return false;
+    }


If we assume that the cudnn case is a strict subset of what the vector fallback can handle then this is not really necessary. We could just use this function during low level dispatch (and actually use the strides which is a lot more robust).

Strictly speaking cuDNN supports more head dims, but yeah it should be more robust just making it a subset of vector sdpa. I have updated the code.

awni · 2026-02-09T14:29:38Z

Very nice improvement. On B200 it's more than 2x speedup for long context:

mlx_lm.benchmark --model Qwen/Qwen3-4B-Instruct-2507 --prompt 64000 --g 512 -n 3

Pre: 72.998 Tok/s
Post: 208.822 Tok/s

I wonder if we should defensively increase the default cache size for the forward op? If you do long generations it will start to thrash after about 16k tokens which is not that many. Maybe we should make it 256 or something?

zcbenz · 2026-02-09T23:35:37Z

Yeah for decoding we definitely need a larger cache size.

awni reviewed Feb 8, 2026

View reviewed changes

zcbenz force-pushed the cuda-sdpa-sliced branch from 0af3abe to 6b0ff04 Compare February 8, 2026 23:59

awni approved these changes Feb 9, 2026

View reviewed changes

[CUDA] Use cuDNN SDPA for decoding when using fixed-size KV cache

59dcdc1

zcbenz force-pushed the cuda-sdpa-sliced branch from 6b0ff04 to 59dcdc1 Compare February 9, 2026 23:38

zcbenz merged commit 54bb3ee into ml-explore:main Feb 10, 2026
16 checks passed

zcbenz deleted the cuda-sdpa-sliced branch February 10, 2026 00:15

awni mentioned this pull request Feb 10, 2026

Faster vector SDPA in CUDA #3073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Use cuDNN SDPA for decoding when using fixed-size KV cache#3113

[CUDA] Use cuDNN SDPA for decoding when using fixed-size KV cache#3113
zcbenz merged 1 commit intoml-explore:mainfrom
zcbenz:cuda-sdpa-sliced

zcbenz commented Feb 8, 2026

Uh oh!

awni Feb 8, 2026

Uh oh!

zcbenz Feb 9, 2026

Uh oh!

awni commented Feb 9, 2026

Uh oh!

zcbenz commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented Feb 8, 2026

Uh oh!

awni Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

awni commented Feb 9, 2026

Uh oh!

zcbenz commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants