Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Issue] Linux gfx94X-dcgpu: GPU Hang kills the tests #5589

@chiranjeevipattigidi

Description

@chiranjeevipattigidi

Summary

Test rocroller (shard 4/5) on gfx94X-dcgpu failed in bump PR #5556 with a GPU hardware exception that simultaneously aborted both concurrent rocroller ctest processes. The rocroller test binary is unchanged; only rocm-systems was bumped (18 commits, b8c378e84ed18c).

Context

Runners

  • linux-gfx942-1gpu-core42-ossci-rocm-jj2tj-runner-8kkt6
  • linux-gfx942-1gpu-core42-ossci-rocm-jj2tj-runner-v7hlc
  • linux-gfx942-1gpu-core42-ossci-rocm-jj2tj-runner-tjz6f
  • linux-gfx942-1gpu-core42-ossci-rocm-jj2tj-runner-4htrn

Full logs:

Failure signature

4: HW Exception by GPU node-4 (Agent handle: 0x61372ad74ae0) reason :GPU Hang
8: HW Exception by GPU node-4 (Agent handle: 0x56387a41ea30) reason :GPU Hang
shared/rocroller/test/catch/GlobalLoadStoreTest.cpp:105: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

1/2 Test #4: rocroller-tests_full_suite .........Subprocess aborted***Exception: 113.49 sec
2/2 Test #8: rocroller-tests-catch_full_suite ...Subprocess aborted***Exception: 113.50 sec
0% tests passed, 2 tests failed out of 2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions