Add Tensor Core ops to RNNs for Volta#3409
Merged
Merged
Conversation
soumith
approved these changes
Oct 31, 2017
Collaborator
|
@pytorchbot test this please |
Collaborator
|
thanks Sean! |
Author
|
Oh should probably mention, tensor core ops are only activated when handling matrix sizes that are multiples of 8 (make your hidden sizes multiples of 8)! I'm not sure where the best place to make this obvious is... |
aditvenk
added a commit
that referenced
this pull request
May 26, 2026
… .tolist() in sharding prop Summary: - DTensor sharding propagator's local-shape adjustment for view ops calls _StridedShard.local_shard_size_and_offset, which materialises offsets via .tolist() on a fake index tensor. Under FakeTensorMode that allocates ~131k unbacked SymInts per call, none bound to the returned DTensor's tensor_meta, tripping PendingUnbackedSymbolNotFound at the downstream compute_unbacked_bindings check. Fix: honor existing skip_offset flag inside _StridedShard.local_shard_size_and_offset to skip the .tolist() when offsets aren't needed (the only known leak source). Run 25 narrowed the fix per wconstab-style blocking review by removing the previously-included defensive ignore_fresh_unbacked_symbols() wrap from _sharding_prop.py. Run 26 addresses aditvenk-style blocking by capturing clean lint evidence inside the submitted artifact set (lint_evidence.txt + verbatim block in report.md). End-to-end verified on torchtitan gpt_oss MoE+EP+TP+FSDP compile: #3409 surface error is gone (separate downstream aten.histc-on-DTensor bug is unrelated). The in-tree regression test test_strided_shard_view_unbacked_local under test/distributed/tensor/test_dtensor_compile.py uses a fake PG (no GPU required) and bisect-verifies the fix. - User re-issued the same fix-and-verify instruction. The Run 22-26 fix is - in place in both the job's pytorch worktree - (`~/.ptq_workspace/jobs/20260520-torchtitan-3409/pytorch/`) and the conda Fixes pytorch/torchtitan#3409
claude Bot
pushed a commit
that referenced
this pull request
May 28, 2026
The comment previously framed the unbacked-local / backed-global asymmetry as the precondition for the bug. The actual root cause is .tolist() in _StridedShard.local_shard_size_and_offset creating unbacked SymInts during compile-time sharding propagation for any _StridedShard view, regardless of whether the local tensor has backed or unbacked dims. torch.nonzero is just how torchtitan #3409 surfaced it. Co-authored-by: Aditya Venkataraman <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>
pytorchmergebot
pushed a commit
that referenced
this pull request
May 30, 2026
… .tolist() in sharding prop Summary: - DTensor sharding propagator's local-shape adjustment for view ops calls _StridedShard.local_shard_size_and_offset, which materialises offsets via .tolist() on a fake index tensor. Under FakeTensorMode that allocates ~131k unbacked SymInts per call, none bound to the returned DTensor's tensor_meta, tripping PendingUnbackedSymbolNotFound at the downstream compute_unbacked_bindings check. Fix: honor existing skip_offset flag inside _StridedShard.local_shard_size_and_offset to skip the .tolist() when offsets aren't needed (the only known leak source). Run 25 narrowed the fix per wconstab-style blocking review by removing the previously-included defensive ignore_fresh_unbacked_symbols() wrap from _sharding_prop.py. Run 26 addresses aditvenk-style blocking by capturing clean lint evidence inside the submitted artifact set (lint_evidence.txt + verbatim block in report.md). End-to-end verified on torchtitan gpt_oss MoE+EP+TP+FSDP compile: #3409 surface error is gone (separate downstream aten.histc-on-DTensor bug is unrelated). The in-tree regression test test_strided_shard_view_unbacked_local under test/distributed/tensor/test_dtensor_compile.py uses a fake PG (no GPU required) and bisect-verifies the fix. - User re-issued the same fix-and-verify instruction. The Run 22-26 fix is - in place in both the job's pytorch worktree - (`~/.ptq_workspace/jobs/20260520-torchtitan-3409/pytorch/`) and the conda Fixes pytorch/torchtitan#3409
pytorchmergebot
pushed a commit
that referenced
this pull request
May 30, 2026
The comment previously framed the unbacked-local / backed-global asymmetry as the precondition for the bug. The actual root cause is .tolist() in _StridedShard.local_shard_size_and_offset creating unbacked SymInts during compile-time sharding propagation for any _StridedShard view, regardless of whether the local tensor has backed or unbacked dims. torch.nonzero is just how torchtitan #3409 surfaced it. Co-authored-by: Aditya Venkataraman <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>
aditvenk
added a commit
that referenced
this pull request
Jun 2, 2026
… materialization Root cause: DTensor's sharding propagator adjusts local shapes for view ops by calling _StridedShard.local_shard_size_and_offset, which materializes the offsets list via .tolist() on an index tensor. Under FakeTensorMode (e.g. when the op is compiled) that .tolist() allocates one unbacked SymInt per element, none of which are bound to the returned DTensor's tensor_meta. The downstream compute_unbacked_bindings check then trips PendingUnbackedSymbolNotFound. This surfaces on torchtitan #3409 (gpt_oss MoE+EP+TP+FSDP under torch.compile), but the leak is independent of whether local dims are backed or unbacked. Fix: skip the .tolist() when the caller discards the offsets. The previous boolean return_first_offset flag could not express "no offsets needed", so it is replaced by a small _StridedShardOffsetMode enum (FIRST / ALL / NONE). _get_shard_size_and_offsets passes NONE when skip_offset is set, avoiding the unbacked-SymInt allocation entirely. Also removes three stale xfails in test_dtensor_ops.py (linalg.tensorsolve, nn.functional.instance_norm, take_along_dim) that now compile and pass under TestCompiledDTensorOps and were reported as unexpected successes. Fixes pytorch/torchtitan#3409
pytorchmergebot
pushed a commit
that referenced
this pull request
Jun 2, 2026
… materialization Root cause: DTensor's sharding propagator adjusts local shapes for view ops by calling _StridedShard.local_shard_size_and_offset, which materializes the offsets list via .tolist() on an index tensor. Under FakeTensorMode (e.g. when the op is compiled) that .tolist() allocates one unbacked SymInt per element, none of which are bound to the returned DTensor's tensor_meta. The downstream compute_unbacked_bindings check then trips PendingUnbackedSymbolNotFound. This surfaces on torchtitan #3409 (gpt_oss MoE+EP+TP+FSDP under torch.compile), but the leak is independent of whether local dims are backed or unbacked. Fix: skip the .tolist() when the caller discards the offsets. The previous boolean return_first_offset flag could not express "no offsets needed", so it is replaced by a small _StridedShardOffsetMode enum (FIRST / ALL / NONE). _get_shard_size_and_offsets passes NONE when skip_offset is set, avoiding the unbacked-SymInt allocation entirely. Also removes three stale xfails in test_dtensor_ops.py (linalg.tensorsolve, nn.functional.instance_norm, take_along_dim) that now compile and pass under TestCompiledDTensorOps and were reported as unexpected successes. Fixes pytorch/torchtitan#3409
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enables tensor core operations for RNNs. Checks for cudnn V7 and cuda 9 or above. Only supported on Volta cards (AWS P3's work!).