Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Disable RCCL multi node on mi350 Ruby cluster till 64 node cluster is stable #5583

@araravik-psd

Description

@araravik-psd

Summary

We are intermittently seeing MPI initialization failures when running RCCL multi-node CI on the Ruby MI350 cluster. The failures appear to occur only on a subset of nodes.

Because of this instability, we should temporarily disable the RCCL multi-node tests and re-enable them once either:

  1. The Ruby MI350 cluster is stable enough for CI usage, or
  2. A smaller partition/subset of known-good nodes is identified for these CI runs.

Example Failure

Example failed run:

https://github.com/ROCm/rocm-systems/actions/runs/26777069895/job/79118394500?pr=6137

The failure shows MPI run startup issues during the multi-node RCCL test execution.

Nodes Seen in the Example Failure

The following nodes were involved in the example failed run and need further debugging:

cv350-rck-g03-c18-18.rck.dcgpu
cv350-rck-g03-c21-08.rck.dcgpu
cv350-rck-g03-c21-18.rck.dcgpu
cv350-rck-g03-e14-18.rck.dcgpu

Unique Ruby Cluster Nodes Observed With This Issue

Across recent failed runs, the following unique Ruby cluster nodes have been observed in jobs showing this issue:

cv350-rck-g03-c11-18.rck.dcgpu
cv350-rck-g03-c18-18.rck.dcgpu
cv350-rck-g03-c21-08.rck.dcgpu
cv350-rck-g03-c21-18.rck.dcgpu
cv350-rck-g03-e14-18.rck.dcgpu
cv350-rck-g03-e15-18.rck.dcgpu
cv350-rck-g03-e21-18.rck.dcgpu

Proposed Action

Temporarily disable RCCL multi-node CI tests on the Ruby MI350 cluster until the cluster instability is resolved or a stable CI-specific partition/node subset is identified.

Once a stable set of nodes is available, the multi-node tests can be re-enabled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    disabled-testThis label indicates that an issue includes a disabled test, and needs to be re-enabled after fixing

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    TODO

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions