Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Issue] Linux PyTorch gfx950-dcgpu: 8 test failures on torch nightly (2.13) #5596

@chiranjeevipattigidi

Description

@chiranjeevipattigidi

Context

  • Workflow: Release portable Linux PyTorch Wheels
  • Workflow file: .github/workflows/release_portable_linux_pytorch_wheels.yml
  • Failing run: ↗ View run
  • Platform: Linux
  • Impacted Arch: gfx950-dcgpu
  • PyTorch Version: nightly (2.13.0a0+rocm7.14.0a20260602)
  • Python Version: 3.13, 3.11, 3.10, 3.14

Failed Tests (8)

test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_cudnn_cuda
test_nn.py::TestNNDeviceTypeCUDA::test_LSTM_dropout_per_call_randomness_dropout_p_0_5_training_True_cuda
test_nn.py::TestNNDeviceTypeCUDA::test_ctc_loss_cudnn_tensor_cuda_cuda
test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_launch_rocm_cuda
test_cuda.py::TestCuda::test_hip_device_count
test_cuda.py::TestCudaAllocator::test_memory_compile_regions
test_cuda.py::TestMemPool::test_mempool_empty_cache_inactive
test_cuda.py::TestMemPool::test_mempool_limited_memory_with_allocator

Root Cause

The skip-list loader in run_pytorch_tests.py detects torch version 2.13 and tries to load external-builds/pytorch/skip_tests/pytorch_2.13.py. That file does not exist, so only generic.py is loaded. All the tests above are already marked as known failures in the stable-version skip files (pytorch_2.9.pypytorch_2.12.py), but those exclusions do not apply to nightly.

The most structurally notable error (two test failures):

/__w/TheRock/TheRock/.venv/lib/python3.13/site-packages/torch/include/ATen/hip/Exceptions.h:5:10: fatal error: 'hipblas/hipblas.h' file not found
    5 | #include <hipblas/hipblas.h>
      |          ^~~~~~~~~~~~~~~~~~~
1 error generated.
ninja: build stopped: subcommand failed.
FAILED [7.2476s]  test_cuda.py::TestMemPool::test_mempool_limited_memory_with_allocator

Suggested Fix

Add external-builds/pytorch/skip_tests/pytorch_2.13.py mirroring the known-failing entries from pytorch_2.12.py for the tests listed above. The test_upsamplingNearest2d_launch_rocm_cuda failure on gfx950 is also separately tracked in #5270.

Sample Failing Job

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions