π The feature, motivation and pitch
In #154055, NCCL non-blocking mode is turned off to unblock pytorch release against hangs seen with NCCL 2.26.
A proper fix should add thread safety between main and watchdog threads when they call ncclGetAsyncError, as NCCL provides no thread-safe guarantee.
Alternatives
No response
Additional context
No response
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci
π The feature, motivation and pitch
In #154055, NCCL non-blocking mode is turned off to unblock pytorch release against hangs seen with NCCL 2.26.
A proper fix should add thread safety between main and watchdog threads when they call
ncclGetAsyncError, as NCCL provides no thread-safe guarantee.Alternatives
No response
Additional context
No response
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci