Describe the bug
Our end-to-end tests fail on main with errors about NCCL operations have failed or timed out
To Reproduce
Steps to reproduce the behavior:
- Run the end-to-end tests, for example
E2E (NVIDIA L40S x4)
- The job will fail about 3 hours in.
The logs look like this (example):
[rank2]:[E428 18:32:18.066672558 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5111, OpType=_ALLGATHER_BASE, NumelIn=65553408, NumelOut=262213632, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Expected behavior
The end-to-end jobs pass on the main branch