Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add timeout to init_process_group in entrypoint #43

Closed
@apoorvkh

Description

@apoorvkh

Question: what's the longest a distributed operation should reasonably take?
How long would it take to "all-gather" a large amount of memory (like 80 GB)?

Let's set a smaller default timeout... maybe 180 seconds?
And then we can pass an argument to override this.

dist.init_process_group(
backend=backend, world_size=worker_args.world_size, rank=worker_args.rank, store=store
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions