Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Issue] Torch tests hanging and hitting 6 hour timeouts on Windows test runners #5565

@ScottTodd

Description

@ScottTodd

Overview

PyTorch unit tests are hanging on multiple test runners, across multiple torch versions. This does not appear to be a recent regression.

Symptoms / evidence / details

Workflow run: https://github.com/ROCm/TheRock/actions/runs/26707885453, using rocm version 7.14.0a20260531

gfx1151: CS-RORDMZ-DT244 runner

https://github.com/ROCm/TheRock/actions/runs/26738630672/job/78812375372

gfx110X-all: azure-windows-11-gfx1101 runners

index https://rocm.nightlies.amd.com/v2-staging/gfx110X-all/

  • Appears to affect only torch versions 2.11, 2.12, nightly.
  • Not affecting...
    • torch version 2.9, tests segfaulted: https://github.com/ROCm/TheRock/actions/runs/26707885453/job/78783769908#step:13:3270

      external-builds\pytorch\pytorch\test\test_cuda.py::TestBlockStateAbsorption::test_tensor_dies_after_checkpoint SKIPPED [0.0001s] [  8%]
      external-builds\pytorch\pytorch\test\test_cuda.py::TestMemPool::test_graph_capture_reclaim_2_streams PASSED [0.0039s] [  8%]
      Windows fatal exception: access violation
      
      Thread 0x00001e94 (most recent call first):
        <no Python frame>
      
      Current thread 0x0000182c (most recent call first):
        File "B:\runner\_work\TheRock\TheRock\external-builds\pytorch\pytorch\test\test_cuda.py", line 5675 in test_graph_capture_reclaim_4_streams
      
    • torch version 2.10, tests completed: https://github.com/ROCm/TheRock/actions/runs/26707885453/job/78783770046#step:13:40191

      FAILED [0.0016s] external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_launch_rocm_cuda - RuntimeError: input tensor has spatial dimension larger than the kernel capacity
        = 1 failed, 15541 passed, 24534 skipped, 301 deselected, 44 xfailed, 2 subtests passed in 783.13s (0:13:03) =
      

Observed on torch version e.g. 2.10.0+rocm7.14.0a20260531

Jobs and log snippets:

  • https://github.com/ROCm/TheRock/actions/runs/26707885453/job/78783769930
    Mon, 01 Jun 2026 05:36:55 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_label_smoothing_with_probs_cuda PASSED [0.0178s] [  2%]
    Mon, 01 Jun 2026 05:36:55 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_mean_cuda SKIPPED [0.4036s] [  2%]
    Mon, 01 Jun 2026 05:36:56 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_none_cuda SKIPPED [0.4747s] [  2%]
    Mon, 01 Jun 2026 05:36:56 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_sum_cuda SKIPPED [0.3975s] [  2%]
    Mon, 01 Jun 2026 11:29:09 GMT Error: The operation was canceled.
    
  • https://github.com/ROCm/TheRock/actions/runs/26707885453/job/78783770048
    Mon, 01 Jun 2026 13:01:38 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_label_smoothing_with_probs_cuda PASSED [0.0180s] [  2%]
    Mon, 01 Jun 2026 13:01:38 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_mean_cuda SKIPPED [0.3561s] [  2%]
    Mon, 01 Jun 2026 13:01:39 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_none_cuda SKIPPED [0.3301s] [  2%]
    Mon, 01 Jun 2026 13:01:39 GMT external-builds\pytorch\pytorch\test\test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_sum_cuda SKIPPED [0.3322s] [  2%]
    Mon, 01 Jun 2026 18:54:33 GMT Error: The operation was canceled.
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    ecosystem: PyTorchIssue pertains to PyTorch and related librariesgfx110X-allIssue/PR related to gfx110X-all familygfx1151Issue/PR relates to gfx1151.platform: Windows

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions