Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend #111834

jataylo · 2023-10-23T19:34:58Z

🐛 Describe the bug

After some experiments in #111791 I have replicated an accuracy issue with the gloo backend relating to ddp models using "apply_optim_in_backwards" instead of .step() on the CI. This occurs both for CUDA and ROCm.

There are already unit tests in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/distributed/distributed_test.py that track this behavior but I have found that torchvision is not present in the distributed CI job causing these tests to only run for a simple linear model in which the bug is not present.

This can be replicated with TOT PyTorch with the following unit test (as long as torchvision is installed):

BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 HIP_VISIBLE_DEVICES=0,1 python3 test/distributed/test_distributed_spawn.py TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_grad_as_bucket_view_false

Versions

This can be replicated with both CUDA and ROCm CI environments in the distributed workflow if we modify the job to install torchvision as seen here:
https://hud.pytorch.org/pr/111791

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @ezyang @albanD @zou3519 @pearu @nikitaved @soulitzer @lezcano @Varal7 @vincentqb @jbschlosser @janeyx99 @crcrpar

The text was updated successfully, but these errors were encountered:

jataylo · 2023-10-23T19:40:34Z

This seems to occur with even simple models as long as there is a convolution involved.

I updated the test_ddp_apply_optim_in_backward UT to test a dummy model and we can still replicate the issue

def _test_ddp_apply_optim_in_backward(
   self,
   optim_cls,
   optim_kwargs,
   init_before,
   gradient_as_bucket_view=True,
):
   # Need to seed to ensure inputs are unique across rank. Otherwise,
   # allreduce won't have any effect.
   torch.manual_seed(self.rank)
   torch.cuda.manual_seed(self.rank)
   torch.cuda.set_device(self.rank)

   # Define a toymodel
   class BasicModel(nn.Module):
       def __init__(self, use_bn=False, input_dim=512, input_channels=3, conv_out_channels=8, kernel_size=3, stride=1, padding=0, linear_out=32):
           super(BasicModel, self).__init__()
           self.use_bn = use_bn
           if self.use_bn:
               self.bn = nn.BatchNorm2d(input_channels)
               output_size = input_dim
               self.fc_input_size = input_channels * output_size * output_size
           else:
               self.conv = nn.Conv2d(input_channels, conv_out_channels, kernel_size, stride, padding)
               output_size = ((input_dim - kernel_size + 2 * padding) // stride) + 1  # Both width and height
               self.fc_input_size = conv_out_channels * output_size * output_size

           self.fc = nn.Linear(self.fc_input_size, linear_out)

       def forward(self, x):
           if self.use_bn:
               x = self.bn(x)   # Issue does not occur if we just perform a batch norm
           else:
               x = self.conv(x) # Issue occurs with a convolution involved
           x = x.view(x.size(0), -1)  # Flatten the tensor
           x = self.fc(x)
           return x

   models_to_test = []
   input_size=40
   models_to_test.append(BasicModel(input_dim=input_size, use_bn=False).cuda())

   for j, model in enumerate(models_to_test):
       model_optim_in_bwd = copy.deepcopy(model)
       model = nn.parallel.DistributedDataParallel(
           model,
           device_ids=[self.rank],
           gradient_as_bucket_view=gradient_as_bucket_view,
       )

       optim = optim_cls(model.parameters(), **optim_kwargs)

       _apply_optimizer_in_backward(
           optimizer_class=optim_cls,
           params=model_optim_in_bwd.parameters(),
           optimizer_kwargs=optim_kwargs,
           register_hook=False,
       )

       model_optim_in_bwd = nn.parallel.DistributedDataParallel(
           model_optim_in_bwd,
           device_ids=[self.rank],
           gradient_as_bucket_view=gradient_as_bucket_view,
       )

       for p1, p2 in zip(model.parameters(), model_optim_in_bwd.parameters()):
           self.assertEqual(p1, p2, "Parameters not initially equal!")

       # Enable determinism in cudnn operators
       with torch.backends.cudnn.flags(
           enabled=True, deterministic=True, benchmark=False
       ):
           for i in range(8):
               inp = (
                   torch.randn(1, 3, input_size, input_size, device="cuda")
               )

               model(inp).sum().backward()
               optim.step()
               model_optim_in_bwd(
                   inp
               ).sum().backward()  # runs optimizer as well

               print(f"Iteration: {i}")
               for p1, p2 in zip(
                   model.parameters(), model_optim_in_bwd.parameters()
               ):
                   self.assertEqual(
                       p1, p2, f"Params not equal at iteration {i}"
                   )

                   self.assertTrue(
                       p2.grad is None,
                       f"Optim in backward grad is not None at {i}",
                   )

               # set_to_none for regular optimizer to match in backward
               # case.
               optim.zero_grad(set_to_none=True)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

======================================== Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (#1356) Converted NAVI check as a function (#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (#1374) skip test_inductor_freezing failing UTs (#1375) Skip test_mm_triton_kernel_benchmark (#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (#1577) Fixes SWDEV-472397 Added functions imports (#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

================================================= Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (#1356) Converted NAVI check as a function (#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (#1374) skip test_inductor_freezing failing UTs (#1375) Skip test_mm_triton_kernel_benchmark (#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (#1577) Fixes SWDEV-472397 Added functions imports (#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda Enable test_public_api_surface (#1601) Fixes SWDEV-462410. Enable this unit test since PyTorch issue pytorch#104012 has been closed. This unit test runs fine on MI100/MI300 and upstream. (cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67) [rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607) Fixes pytorch#8974 (cherry picked from commit a688e0a)

================================================= Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (#1356) Converted NAVI check as a function (#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (#1374) skip test_inductor_freezing failing UTs (#1375) Skip test_mm_triton_kernel_benchmark (#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (#1577) Fixes SWDEV-472397 Added functions imports (#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda Enable test_public_api_surface (#1601) Fixes SWDEV-462410. Enable this unit test since PyTorch issue pytorch#104012 has been closed. This unit test runs fine on MI100/MI300 and upstream. (cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67) [rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607) Fixes pytorch#8974 (cherry picked from commit a688e0a) (cherry picked from commit b966e44)

================================================= Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (#1356) Converted NAVI check as a function (#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (#1374) skip test_inductor_freezing failing UTs (#1375) Skip test_mm_triton_kernel_benchmark (#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (#1577) Fixes SWDEV-472397 Added functions imports (#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda Enable test_public_api_surface (#1601) Fixes SWDEV-462410. Enable this unit test since PyTorch issue pytorch#104012 has been closed. This unit test runs fine on MI100/MI300 and upstream. (cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67) [rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607) Fixes pytorch#8974 (cherry picked from commit a688e0a) (cherry picked from commit b966e44) [rocm6.4_internal_testing] Skip non_standard_bool_values tests (#1880) Fixes SWDEV-509757 (cherry picked from commit 80b4c41) [rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 (#1882) The test fails with assertion error "Tensors are not close" After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInductor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically. Simple reproducer to check the values for cpu/gpu/eager/aoti runs. [gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt) (cherry picked from commit 896c789) Fixed rocm skip import issue (#1949) skip_if_rocm does not exist in torch/testing/_internal/common_distributed.py. Use skipIfRocm from torch/testing/_internal/common_utils.py instead. (cherry picked from commit cfb673e) Skip certain unit tests on NAVI (#1950) This PR is to skip certain unit tests on NAVI only. Fixes SWDEV-509011 - test_sac_ilp.py::TestSACILP::test_sac_ilp_case1 Fixes SWDEV-509311 - test_max_autotune.py::TestMaxAutotune::test_non_contiguous_input_addmm Fixes SWDEV-510738 test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_sharded_grad_scaler_found_inf (cherry picked from commit e86291a)

jithunnair-amd · 2025-04-25T06:42:02Z

@wconstab This is a longstanding issue that we would like to get closure on.

wconstab · 2025-04-29T22:51:15Z

This might be related: #152300
cc @fduwjj

jataylo mentioned this issue Oct 23, 2023

Skip ddp apply_optim_in_bwd tests for gloo ROCm/pytorch#1302

Merged

soulitzer added oncall: distributed Add this issue/PR to distributed oncall triage queue module: autograd Related to torch.autograd, and the autograd engine in general module: optimizer Related to torch.optim labels Oct 23, 2023

jataylo mentioned this issue Oct 24, 2023

[do not review] Testing torchvision install #111791

Closed

albanD removed module: autograd Related to torch.autograd, and the autograd engine in general module: optimizer Related to torch.optim labels Oct 24, 2023

jon-chuang changed the title ~~Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests~~ Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend Oct 26, 2023

This was referenced Dec 8, 2023

DDP hook skips on 2.0 branch. ROCm/pytorch#1335

Merged

DDP hook skips on 1.13 branch ROCm/pytorch#1336

Merged

jithunnair-amd assigned wconstab Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend #111834

Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend #111834

jataylo commented Oct 23, 2023 •

edited

Loading

jataylo commented Oct 23, 2023 •

edited

Loading

jithunnair-amd commented Apr 25, 2025

wconstab commented Apr 29, 2025

Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend #111834

Numerical inaccuracies in "ddp_apply_optim_in_backward" unit tests for gloo backend #111834

Comments

jataylo commented Oct 23, 2023 • edited Loading

🐛 Describe the bug

Versions

jataylo commented Oct 23, 2023 • edited Loading

jithunnair-amd commented Apr 25, 2025

wconstab commented Apr 29, 2025

jataylo commented Oct 23, 2023 •

edited

Loading

jataylo commented Oct 23, 2023 •

edited

Loading