Summary
We are intermittently seeing MPI initialization failures when running RCCL multi-node CI on the Ruby MI350 cluster. The failures appear to occur only on a subset of nodes.
Because of this instability, we should temporarily disable the RCCL multi-node tests and re-enable them once either:
- The Ruby MI350 cluster is stable enough for CI usage, or
- A smaller partition/subset of known-good nodes is identified for these CI runs.
Example Failure
Example failed run:
https://github.com/ROCm/rocm-systems/actions/runs/26777069895/job/79118394500?pr=6137
The failure shows MPI run startup issues during the multi-node RCCL test execution.
Nodes Seen in the Example Failure
The following nodes were involved in the example failed run and need further debugging:
cv350-rck-g03-c18-18.rck.dcgpu
cv350-rck-g03-c21-08.rck.dcgpu
cv350-rck-g03-c21-18.rck.dcgpu
cv350-rck-g03-e14-18.rck.dcgpu
Unique Ruby Cluster Nodes Observed With This Issue
Across recent failed runs, the following unique Ruby cluster nodes have been observed in jobs showing this issue:
cv350-rck-g03-c11-18.rck.dcgpu
cv350-rck-g03-c18-18.rck.dcgpu
cv350-rck-g03-c21-08.rck.dcgpu
cv350-rck-g03-c21-18.rck.dcgpu
cv350-rck-g03-e14-18.rck.dcgpu
cv350-rck-g03-e15-18.rck.dcgpu
cv350-rck-g03-e21-18.rck.dcgpu
Proposed Action
Temporarily disable RCCL multi-node CI tests on the Ruby MI350 cluster until the cluster instability is resolved or a stable CI-specific partition/node subset is identified.
Once a stable set of nodes is available, the multi-node tests can be re-enabled.
Summary
We are intermittently seeing MPI initialization failures when running RCCL multi-node CI on the Ruby MI350 cluster. The failures appear to occur only on a subset of nodes.
Because of this instability, we should temporarily disable the RCCL multi-node tests and re-enable them once either:
Example Failure
Example failed run:
https://github.com/ROCm/rocm-systems/actions/runs/26777069895/job/79118394500?pr=6137
The failure shows MPI run startup issues during the multi-node RCCL test execution.
Nodes Seen in the Example Failure
The following nodes were involved in the example failed run and need further debugging:
Unique Ruby Cluster Nodes Observed With This Issue
Across recent failed runs, the following unique Ruby cluster nodes have been observed in jobs showing this issue:
Proposed Action
Temporarily disable RCCL multi-node CI tests on the Ruby MI350 cluster until the cluster instability is resolved or a stable CI-specific partition/node subset is identified.
Once a stable set of nodes is available, the multi-node tests can be re-enabled.