You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: beginner_source/ddp_series_fault_tolerance.rst
+66-64Lines changed: 66 additions & 64 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,9 @@
1
-
`Introduction <ddp_series_intro.html>`__ \|\| `What is DDP <ddp_series_theory.html>`__ \|\| `Single-Node
2
-
Multi-GPU Training <ddp_series_multigpu.html>`__ \|\| **Fault
3
-
Tolerance** \|\| `Multi-Node
4
-
training <../intermediate/ddp_series_multinode.html>`__ \|\| `minGPT Training <../intermediate/ddp_series_minGPT.html>`__
1
+
`Introduction <ddp_series_intro.html>`__ \|\|
2
+
`What is DDP <ddp_series_theory.html>`__ \|\|
3
+
`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\|
4
+
**Fault Tolerance** \|\|
5
+
`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\|
6
+
`minGPT Training <../intermediate/ddp_series_minGPT.html>`__
5
7
6
8
7
9
Fault-tolerant Distributed Training with ``torchrun``
@@ -61,8 +63,8 @@ Why use ``torchrun``
61
63
don't need to. For instance,
62
64
63
65
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
64
-
- No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups.
65
-
- Gracefully restarting training from the last saved training snapshot
66
+
- No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entry point, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups.
67
+
- Gracefully restarting training from the last saved training snapshot.
66
68
67
69
68
70
Graceful restarts
@@ -84,7 +86,7 @@ For graceful restarts, you should structure your train script like:
84
86
save_snapshot(snapshot_path)
85
87
86
88
If a failure occurs, ``torchrun`` will terminate all the processes and restart them.
87
-
Each process entrypoint first loads and initializes the last saved snapshot, and continues training from there.
89
+
Each process entry point first loads and initializes the last saved snapshot, and continues training from there.
88
90
So at any failure, you only lose the training progress from the last saved snapshot.
89
91
90
92
In elastic training, whenever there are any membership changes (adding or removing nodes), ``torchrun`` will terminate and spawn processes
@@ -101,52 +103,51 @@ Process group initialization
101
103
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102
104
103
105
- ``torchrun`` assigns ``RANK`` and ``WORLD_SIZE`` automatically,
0 commit comments