Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add Tensor Core ops to RNNs for Volta#3409

Merged
soumith merged 1 commit into
pytorch:masterfrom
SeanNaren:rnn-volta
Nov 1, 2017
Merged

Add Tensor Core ops to RNNs for Volta#3409
soumith merged 1 commit into
pytorch:masterfrom
SeanNaren:rnn-volta

Conversation

@SeanNaren
Copy link
Copy Markdown

Enables tensor core operations for RNNs. Checks for cudnn V7 and cuda 9 or above. Only supported on Volta cards (AWS P3's work!).

@soumith
Copy link
Copy Markdown
Collaborator

soumith commented Oct 31, 2017

@pytorchbot test this please

@soumith soumith merged commit cf256ee into pytorch:master Nov 1, 2017
@soumith
Copy link
Copy Markdown
Collaborator

soumith commented Nov 1, 2017

thanks Sean!

@SeanNaren SeanNaren deleted the rnn-volta branch November 1, 2017 10:19
@SeanNaren
Copy link
Copy Markdown
Author

SeanNaren commented Nov 1, 2017

Oh should probably mention, tensor core ops are only activated when handling matrix sizes that are multiples of 8 (make your hidden sizes multiples of 8)! I'm not sure where the best place to make this obvious is...

aditvenk added a commit that referenced this pull request May 26, 2026
… .tolist() in sharding prop

Summary:
- DTensor sharding propagator's local-shape adjustment for view ops calls _StridedShard.local_shard_size_and_offset, which materialises offsets via .tolist() on a fake index tensor. Under FakeTensorMode that allocates ~131k unbacked SymInts per call, none bound to the returned DTensor's tensor_meta, tripping PendingUnbackedSymbolNotFound at the downstream compute_unbacked_bindings check. Fix: honor existing skip_offset flag inside _StridedShard.local_shard_size_and_offset to skip the .tolist() when offsets aren't needed (the only known leak source). Run 25 narrowed the fix per wconstab-style blocking review by removing the previously-included defensive ignore_fresh_unbacked_symbols() wrap from _sharding_prop.py. Run 26 addresses aditvenk-style blocking by capturing clean lint evidence inside the submitted artifact set (lint_evidence.txt + verbatim block in report.md). End-to-end verified on torchtitan gpt_oss MoE+EP+TP+FSDP compile: #3409 surface error is gone (separate downstream aten.histc-on-DTensor bug is unrelated). The in-tree regression test test_strided_shard_view_unbacked_local under test/distributed/tensor/test_dtensor_compile.py uses a fake PG (no GPU required) and bisect-verifies the fix.
- User re-issued the same fix-and-verify instruction. The Run 22-26 fix is
- in place in both the job's pytorch worktree
- (`~/.ptq_workspace/jobs/20260520-torchtitan-3409/pytorch/`) and the conda

Fixes pytorch/torchtitan#3409
claude Bot pushed a commit that referenced this pull request May 28, 2026
The comment previously framed the unbacked-local / backed-global asymmetry
as the precondition for the bug. The actual root cause is .tolist() in
_StridedShard.local_shard_size_and_offset creating unbacked SymInts during
compile-time sharding propagation for any _StridedShard view, regardless of
whether the local tensor has backed or unbacked dims. torch.nonzero is just
how torchtitan #3409 surfaced it.

Co-authored-by: Aditya Venkataraman <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
pytorchmergebot pushed a commit that referenced this pull request May 30, 2026
… .tolist() in sharding prop

Summary:
- DTensor sharding propagator's local-shape adjustment for view ops calls _StridedShard.local_shard_size_and_offset, which materialises offsets via .tolist() on a fake index tensor. Under FakeTensorMode that allocates ~131k unbacked SymInts per call, none bound to the returned DTensor's tensor_meta, tripping PendingUnbackedSymbolNotFound at the downstream compute_unbacked_bindings check. Fix: honor existing skip_offset flag inside _StridedShard.local_shard_size_and_offset to skip the .tolist() when offsets aren't needed (the only known leak source). Run 25 narrowed the fix per wconstab-style blocking review by removing the previously-included defensive ignore_fresh_unbacked_symbols() wrap from _sharding_prop.py. Run 26 addresses aditvenk-style blocking by capturing clean lint evidence inside the submitted artifact set (lint_evidence.txt + verbatim block in report.md). End-to-end verified on torchtitan gpt_oss MoE+EP+TP+FSDP compile: #3409 surface error is gone (separate downstream aten.histc-on-DTensor bug is unrelated). The in-tree regression test test_strided_shard_view_unbacked_local under test/distributed/tensor/test_dtensor_compile.py uses a fake PG (no GPU required) and bisect-verifies the fix.
- User re-issued the same fix-and-verify instruction. The Run 22-26 fix is
- in place in both the job's pytorch worktree
- (`~/.ptq_workspace/jobs/20260520-torchtitan-3409/pytorch/`) and the conda

Fixes pytorch/torchtitan#3409
pytorchmergebot pushed a commit that referenced this pull request May 30, 2026
The comment previously framed the unbacked-local / backed-global asymmetry
as the precondition for the bug. The actual root cause is .tolist() in
_StridedShard.local_shard_size_and_offset creating unbacked SymInts during
compile-time sharding propagation for any _StridedShard view, regardless of
whether the local tensor has backed or unbacked dims. torch.nonzero is just
how torchtitan #3409 surfaced it.

Co-authored-by: Aditya Venkataraman <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
aditvenk added a commit that referenced this pull request Jun 2, 2026
… materialization

Root cause: DTensor's sharding propagator adjusts local shapes for view ops
by calling _StridedShard.local_shard_size_and_offset, which materializes the
offsets list via .tolist() on an index tensor. Under FakeTensorMode (e.g. when
the op is compiled) that .tolist() allocates one unbacked SymInt per element,
none of which are bound to the returned DTensor's tensor_meta. The downstream
compute_unbacked_bindings check then trips PendingUnbackedSymbolNotFound. This
surfaces on torchtitan #3409 (gpt_oss MoE+EP+TP+FSDP under torch.compile), but
the leak is independent of whether local dims are backed or unbacked.

Fix: skip the .tolist() when the caller discards the offsets. The previous
boolean return_first_offset flag could not express "no offsets needed", so it
is replaced by a small _StridedShardOffsetMode enum (FIRST / ALL / NONE).
_get_shard_size_and_offsets passes NONE when skip_offset is set, avoiding the
unbacked-SymInt allocation entirely.

Also removes three stale xfails in test_dtensor_ops.py (linalg.tensorsolve,
nn.functional.instance_norm, take_along_dim) that now compile and pass under
TestCompiledDTensorOps and were reported as unexpected successes.

Fixes pytorch/torchtitan#3409
pytorchmergebot pushed a commit that referenced this pull request Jun 2, 2026
… materialization

Root cause: DTensor's sharding propagator adjusts local shapes for view ops
by calling _StridedShard.local_shard_size_and_offset, which materializes the
offsets list via .tolist() on an index tensor. Under FakeTensorMode (e.g. when
the op is compiled) that .tolist() allocates one unbacked SymInt per element,
none of which are bound to the returned DTensor's tensor_meta. The downstream
compute_unbacked_bindings check then trips PendingUnbackedSymbolNotFound. This
surfaces on torchtitan #3409 (gpt_oss MoE+EP+TP+FSDP under torch.compile), but
the leak is independent of whether local dims are backed or unbacked.

Fix: skip the .tolist() when the caller discards the offsets. The previous
boolean return_first_offset flag could not express "no offsets needed", so it
is replaced by a small _StridedShardOffsetMode enum (FIRST / ALL / NONE).
_get_shard_size_and_offsets passes NONE when skip_offset is set, avoiding the
unbacked-SymInt allocation entirely.

Also removes three stale xfails in test_dtensor_ops.py (linalg.tensorsolve,
nn.functional.instance_norm, take_along_dim) that now compile and pass under
TestCompiledDTensorOps and were reported as unexpected successes.

Fixes pytorch/torchtitan#3409
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants