[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` #149282

eqy · 2025-03-16T21:09:20Z

cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

pytorch-bot · 2025-03-16T21:09:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149282

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit f1198b0 with merge base 3cd6935 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for aten/src/ATen/native/transformers/cuda/attention_backward.cu:
pull / linux-focal-py3_9-clang9-xla / build (gh)
ninja: build stopped: subcommand failed

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2025-03-16T21:55:38Z

@pytorchmergebot rebase

pytorchmergebot · 2025-03-16T21:57:06Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-16T21:57:08Z

Tried to rebase and push PR #149282, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

eqy · 2025-03-17T16:55:32Z

@pytorchmergebot rebase

pytorchmergebot · 2025-03-17T16:56:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-17T16:57:03Z

Successfully rebased cudnnsdparefactor onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cudnnsdparefactor && git pull --rebase)

linux-foundation-easycla · 2025-03-17T23:42:38Z

The committers listed above are authorized under a signed CLA.

✅ login: eqy (c8cabec, 0a65f0f, 5ee0b08, 2e05ec5, deea06f, f3b7ed9, 02a4a0f, 9a1d4bf, c615857, 9b9b540, 7fd6acd, 23b5b68, 5b5102c, f2b17aa, f1198b0, 0d08279, 1131056)

Skylion007 · 2025-03-18T16:38:01Z

aten/src/ATen/native/cudnn/MHA.cpp

  }
  auto workspace_size = mha_graph->get_workspace_size();
  auto workspace_ptr =
      c10::cuda::CUDACachingAllocator::get()->allocate(workspace_size);
  TORCH_CHECK(
      mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good());
-  mhagraphcache.update(key, graph_and_tensors_values);
+  mhagraphcache.update(key, mha_graph);


The update method for mhagraphcache should probably use perfect forward up where the update method is defined instead of an lref. And throughout the file should be to remove extra copies.

Suggested change

mhagraphcache.update(key, mha_graph);

mhagraphcache.update(key, std::move(mha_graph));

lint

drisspg · 2025-05-08T23:40:34Z

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

@@ -563,7 +581,7 @@ bool check_for_nested_inputs(sdp_params const& params, bool debug) {

  const auto dprop = at::cuda::getCurrentDeviceProperties();
  // Check that the input is nested
-  if (dprop->major != 9 && has_for_nested_inputs(params)) {
+  if (dprop->major < 9 && has_for_nested_inputs(params)) {


should we guard consumer GPUS? I guess thats handled in the dispatch

drisspg · 2025-05-08T23:41:48Z

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

@@ -57,6 +57,8 @@
 namespace sdp {
 namespace {

+bool priority_order_init_ = false;


maybe a comment as to why

drisspg · 2025-05-08T23:44:45Z

aten/src/ATen/native/cudnn/MHA.cpp

    mhagraphbackwardcache;

 namespace {
+
+enum UIDS {


noob q, what for?

This was the API we had in the convolution frontend API days, which was not supported when SDPA was first added as a cuDNN composite op

This removal made the use of cuDNN graphs after they had been built very clunky, as we had to explicitly stash each cuDNN frontend tensor (a shared ptr) in order to associate with a data ptr in order to execute the graph. In particular it meant that the function to build a graph had a really awkward return type of a gigantic tuple of "graph and tensors" as we could not drop the frontend object for each tensor.

In reality we know all of the tensors that are needed to execute the graph by unique names, such as 'q', 'k', 'v', 'dO', 'dQ'... etc., so there's no need to hold on to unique objects. The UID is just a lightweight way to encode these names if we give each tensor a UID (currently just an int64_t) and we can just pass a mapping of UID -> data ptr to execute the graph instead of a clunky shared_ptr (to frontend object) -> data_ptr.

drisspg · 2025-05-08T23:46:47Z

aten/src/ATen/native/cudnn/MHA.cpp

+  auto q_strides = q.strides();
+  auto k_strides = k.strides();
+  auto v_strides = v.strides();
+  constexpr int strideidx0 = 1;


maybe do a NB: sdpa api is transposed vs cudnn

drisspg

any benchmarks?

eqy · 2025-05-09T22:50:36Z

SDPA perf. should be the same, no nested tensor benchmarks yet as that doesn't have caching yet 😅

eqy added open source topic: not user facing topic category module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion labels Mar 16, 2025

eqy requested a review from syed-ahmed as a code owner March 16, 2025 21:09

eqy changed the title ~~[cuDNN][SDPA] cuDNN SDPA refactor/cleanup~~ [WIP][cuDNN][SDPA] cuDNN SDPA refactor/cleanup Mar 17, 2025

pytorch deleted a comment from pytorch-bot bot Mar 17, 2025

pytorchmergebot force-pushed the cudnnsdparefactor branch from ac66884 to f7c76b8 Compare March 17, 2025 16:57

Skylion007 reviewed Mar 18, 2025

View reviewed changes

eqy force-pushed the cudnnsdparefactor branch from 3848e20 to bd4432a Compare April 7, 2025 23:31

eqy requested review from albanD and soulitzer as code owners April 15, 2025 00:44

eqy changed the title ~~[WIP][cuDNN][SDPA] cuDNN SDPA refactor/cleanup~~ [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 Apr 28, 2025

eqy requested review from drisspg and jbschlosser April 28, 2025 21:58

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 29, 2025

eqy force-pushed the cudnnsdparefactor branch from b6a75a5 to 0baac8a Compare April 30, 2025 18:42

eqy added 6 commits May 8, 2025 17:44

check in

9a1d4bf

add cudnn_attention_backward native function

f3b7ed9

update

c8cabec

check in

0a65f0f

lint

wip

2e05ec5

wip

7fd6acd

eqy added 10 commits May 8, 2025 17:44

wip

5b5102c

check in

deea06f

check in

f2b17aa

temp check in

1131056

wip

5ee0b08

add missing RAGGED O offset

02a4a0f

fix ragged offset tensor and cleanup

23b5b68

update test

9b9b540

lint

try dropout cond

c615857

lint and fix decomp test

0d08279

eqy force-pushed the cudnnsdparefactor branch from c43594b to 0d08279 Compare May 8, 2025 17:44

drisspg reviewed May 8, 2025

View reviewed changes

drisspg approved these changes May 8, 2025

View reviewed changes

address comments

f1198b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` #149282

[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` #149282

eqy commented Mar 16, 2025

pytorch-bot bot commented Mar 16, 2025 •

edited

Loading

eqy commented Mar 16, 2025

pytorchmergebot commented Mar 16, 2025

pytorchmergebot commented Mar 16, 2025

eqy commented Mar 17, 2025

pytorchmergebot commented Mar 17, 2025

pytorchmergebot commented Mar 17, 2025

linux-foundation-easycla bot commented Mar 17, 2025 •

edited

Loading

Skylion007 Mar 18, 2025 •

edited

Loading

drisspg May 8, 2025

drisspg May 8, 2025

drisspg May 8, 2025

eqy May 9, 2025

drisspg May 8, 2025

drisspg left a comment

eqy commented May 9, 2025

	mhagraphcache.update(key, mha_graph);
	mhagraphcache.update(key, std::move(mha_graph));

[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 #149282

Are you sure you want to change the base?

[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 #149282

Conversation

eqy commented Mar 16, 2025

pytorch-bot bot commented Mar 16, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149282

❌ 2 New Failures, 1 Unrelated Failure

eqy commented Mar 16, 2025

pytorchmergebot commented Mar 16, 2025

pytorchmergebot commented Mar 16, 2025

eqy commented Mar 17, 2025

pytorchmergebot commented Mar 17, 2025

pytorchmergebot commented Mar 17, 2025

linux-foundation-easycla bot commented Mar 17, 2025 • edited Loading

Skylion007 Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

drisspg May 8, 2025

Choose a reason for hiding this comment

drisspg May 8, 2025

Choose a reason for hiding this comment

drisspg May 8, 2025

Choose a reason for hiding this comment

eqy May 9, 2025

Choose a reason for hiding this comment

drisspg May 8, 2025

Choose a reason for hiding this comment

drisspg left a comment

Choose a reason for hiding this comment

eqy commented May 9, 2025

[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` #149282

[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` #149282

pytorch-bot bot commented Mar 16, 2025 •

edited

Loading

linux-foundation-easycla bot commented Mar 17, 2025 •

edited

Loading

Skylion007 Mar 18, 2025 •

edited

Loading