Closed
Description
Question: what's the longest a distributed operation should reasonably take?
How long would it take to "all-gather" a large amount of memory (like 80 GB)?
Let's set a smaller default timeout... maybe 180 seconds?
And then we can pass an argument to override this.
torchrunx/src/torchrunx/agent.py
Lines 83 to 85 in cd1a895
Metadata
Metadata
Assignees
Labels
No labels