Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Nccl timeout#2673

Merged
awni merged 8 commits intoml-explore:mainfrom
nastya236:nccl_timeout
Oct 14, 2025
Merged

Nccl timeout#2673
awni merged 8 commits intoml-explore:mainfrom
nastya236:nccl_timeout

Conversation

@nastya236
Copy link
Collaborator

Proposed changes

During multi-node runs, the master host sometimes starts later than other nodes. As a result, other hosts wait until rank 0 binds. In the previous implementation the binding retried 30 times with 500 ms intervals, which was not always sufficient (and hard-coded).

  • Added env::nccl_timeout -- the binding will fail after the elapsed time exceeds nccl_timeout (default = 300 000 ms = 5 minutes, be overridden via the MLX_NCCL_TIMEOUT).
  • Removed an ncclGroupEnd() call -- this was a bug.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for fixing it!

@awni awni merged commit e9eab52 into ml-explore:main Oct 14, 2025
1 check passed
faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025
* print the error & delete nccl group

* timeout for nccl binding

* typo

* revert error

* fixed a typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants