Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend #111834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jataylo opened this issue Oct 23, 2023 · 3 comments
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@jataylo
Copy link
Collaborator

jataylo commented Oct 23, 2023

πŸ› Describe the bug

After some experiments in #111791 I have replicated an accuracy issue with the gloo backend relating to ddp models using "apply_optim_in_backwards" instead of .step() on the CI. This occurs both for CUDA and ROCm.

There are already unit tests in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/distributed/distributed_test.py that track this behavior but I have found that torchvision is not present in the distributed CI job causing these tests to only run for a simple linear model in which the bug is not present.

This can be replicated with TOT PyTorch with the following unit test (as long as torchvision is installed):

BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 HIP_VISIBLE_DEVICES=0,1 python3 test/distributed/test_distributed_spawn.py TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_grad_as_bucket_view_false

Versions

This can be replicated with both CUDA and ROCm CI environments in the distributed workflow if we modify the job to install torchvision as seen here:
https://hud.pytorch.org/pr/111791

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @ezyang @albanD @zou3519 @pearu @nikitaved @soulitzer @lezcano @Varal7 @vincentqb @jbschlosser @janeyx99 @crcrpar

@jataylo
Copy link
Collaborator Author

jataylo commented Oct 23, 2023

This seems to occur with even simple models as long as there is a convolution involved.

I updated the test_ddp_apply_optim_in_backward UT to test a dummy model and we can still replicate the issue

def _test_ddp_apply_optim_in_backward(
   self,
   optim_cls,
   optim_kwargs,
   init_before,
   gradient_as_bucket_view=True,
):
   # Need to seed to ensure inputs are unique across rank. Otherwise,
   # allreduce won't have any effect.
   torch.manual_seed(self.rank)
   torch.cuda.manual_seed(self.rank)
   torch.cuda.set_device(self.rank)

   # Define a toymodel
   class BasicModel(nn.Module):
       def __init__(self, use_bn=False, input_dim=512, input_channels=3, conv_out_channels=8, kernel_size=3, stride=1, padding=0, linear_out=32):
           super(BasicModel, self).__init__()
           self.use_bn = use_bn
           if self.use_bn:
               self.bn = nn.BatchNorm2d(input_channels)
               output_size = input_dim
               self.fc_input_size = input_channels * output_size * output_size
           else:
               self.conv = nn.Conv2d(input_channels, conv_out_channels, kernel_size, stride, padding)
               output_size = ((input_dim - kernel_size + 2 * padding) // stride) + 1  # Both width and height
               self.fc_input_size = conv_out_channels * output_size * output_size

           self.fc = nn.Linear(self.fc_input_size, linear_out)

       def forward(self, x):
           if self.use_bn:
               x = self.bn(x)   # Issue does not occur if we just perform a batch norm
           else:
               x = self.conv(x) # Issue occurs with a convolution involved
           x = x.view(x.size(0), -1)  # Flatten the tensor
           x = self.fc(x)
           return x

   models_to_test = []
   input_size=40
   models_to_test.append(BasicModel(input_dim=input_size, use_bn=False).cuda())

   for j, model in enumerate(models_to_test):
       model_optim_in_bwd = copy.deepcopy(model)
       model = nn.parallel.DistributedDataParallel(
           model,
           device_ids=[self.rank],
           gradient_as_bucket_view=gradient_as_bucket_view,
       )

       optim = optim_cls(model.parameters(), **optim_kwargs)

       _apply_optimizer_in_backward(
           optimizer_class=optim_cls,
           params=model_optim_in_bwd.parameters(),
           optimizer_kwargs=optim_kwargs,
           register_hook=False,
       )

       model_optim_in_bwd = nn.parallel.DistributedDataParallel(
           model_optim_in_bwd,
           device_ids=[self.rank],
           gradient_as_bucket_view=gradient_as_bucket_view,
       )

       for p1, p2 in zip(model.parameters(), model_optim_in_bwd.parameters()):
           self.assertEqual(p1, p2, "Parameters not initially equal!")

       # Enable determinism in cudnn operators
       with torch.backends.cudnn.flags(
           enabled=True, deterministic=True, benchmark=False
       ):
           for i in range(8):
               inp = (
                   torch.randn(1, 3, input_size, input_size, device="cuda")
               )

               model(inp).sum().backward()
               optim.step()
               model_optim_in_bwd(
                   inp
               ).sum().backward()  # runs optimizer as well

               print(f"Iteration: {i}")
               for p1, p2 in zip(
                   model.parameters(), model_optim_in_bwd.parameters()
               ):
                   self.assertEqual(
                       p1, p2, f"Params not equal at iteration {i}"
                   )

                   self.assertTrue(
                       p2.grad is None,
                       f"Optim in backward grad is not None at {i}",
                   )

               # set_to_none for regular optimizer to match in backward
               # case.
               optim.zero_grad(set_to_none=True)

@soulitzer soulitzer added oncall: distributed Add this issue/PR to distributed oncall triage queue module: autograd Related to torch.autograd, and the autograd engine in general module: optimizer Related to torch.optim labels Oct 23, 2023
@albanD albanD removed module: autograd Related to torch.autograd, and the autograd engine in general module: optimizer Related to torch.optim labels Oct 24, 2023
@jon-chuang jon-chuang changed the title Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend Oct 26, 2023
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this issue Oct 3, 2024
========================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this issue Oct 11, 2024
========================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this issue Oct 11, 2024
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this issue Oct 11, 2024
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this issue Nov 19, 2024
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
pruthvistony added a commit to ROCm/pytorch that referenced this issue Dec 2, 2024
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
pruthvistony added a commit to ROCm/pytorch that referenced this issue Dec 21, 2024
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
dnikolaev-amd pushed a commit to ROCm/pytorch that referenced this issue Apr 17, 2025
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
(cherry picked from commit b966e44)
dnikolaev-amd pushed a commit to ROCm/pytorch that referenced this issue Apr 24, 2025
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
(cherry picked from commit b966e44)

[rocm6.4_internal_testing] Skip non_standard_bool_values tests (#1880)

Fixes SWDEV-509757

(cherry picked from commit 80b4c41)

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 (#1882)

The test fails with assertion error "Tensors are not close"

After testing I can confirm that this issue is caused by eager mode
execution specific to navi32 during the test_sdpa_2 run. Made a cross
reference between navi31, navi32 and mi300. AOTInductor results are all
the exact same for all of the archs, only the eager mode fails here for
navi32 with 1.5% difference in tensor values from the gpu run. I assume
that this happens due to fp16-32-16 conversions in eager mode or missing
some if-statements for navi32 specifically.

Simple reproducer to check the values for cpu/gpu/eager/aoti runs.

[gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt)

(cherry picked from commit 896c789)

Fixed rocm skip import issue (#1949)

skip_if_rocm does not exist in
torch/testing/_internal/common_distributed.py. Use skipIfRocm from
torch/testing/_internal/common_utils.py instead.

(cherry picked from commit cfb673e)

Skip certain unit tests on NAVI (#1950)

This PR is to skip certain unit tests on NAVI only.
Fixes SWDEV-509011 - test_sac_ilp.py::TestSACILP::test_sac_ilp_case1
Fixes SWDEV-509311 -
test_max_autotune.py::TestMaxAutotune::test_non_contiguous_input_addmm
Fixes SWDEV-510738
test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_sharded_grad_scaler_found_inf

(cherry picked from commit e86291a)
@jithunnair-amd
Copy link
Collaborator

@wconstab This is a longstanding issue that we would like to get closure on.

@wconstab
Copy link
Contributor

This might be related: #152300
cc @fduwjj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

5 participants