Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[c10d] Add thread safety when calling ncclGetAsyncErrorΒ #169484

@kwen2501

Description

@kwen2501

πŸš€ The feature, motivation and pitch

In #154055, NCCL non-blocking mode is turned off to unblock pytorch release against hangs seen with NCCL 2.26.

A proper fix should add thread safety between main and watchdog threads when they call ncclGetAsyncError, as NCCL provides no thread-safe guarantee.

Alternatives

No response

Additional context

No response

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

Metadata

Metadata

Assignees

Labels

module: c10dIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions