@@ -7,8 +7,9 @@ Shard Optimizer States with ZeroRedundancyOptimizer
77
88 In this recipe, you will learn:
99
10- - The high-level idea of ``ZeroRedundancyOptimizer ``.
11- - How to use ``ZeroRedundancyOptimizer `` in distributed training and its impact.
10+ - The high-level idea of `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html >`__.
11+ - How to use `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html >`__
12+ in distributed training and its impact.
1213
1314
1415Requirements
@@ -21,8 +22,8 @@ Requirements
2122What is ``ZeroRedundancyOptimizer ``?
2223------------------------------------
2324
24- The idea of `` ZeroRedundancyOptimizer `` comes from
25- `DeepSpeed/ZeRO project <https://github.com/microsoft/DeepSpeed >`_ and
25+ The idea of `ZeroRedundancyOptimizer < https://pytorch.org/docs/master/distributed.optim.html >`__
26+ comes from `DeepSpeed/ZeRO project <https://github.com/microsoft/DeepSpeed >`_ and
2627`Marian <https://github.com/marian-nmt/marian-dev >`_ that shard
2728optimizer states across distributed data-parallel processes to
2829reduce per-process memory footprint. In the
@@ -47,12 +48,14 @@ processes, so that all model replicas still land in the same state.
4748How to use ``ZeroRedundancyOptimizer ``?
4849---------------------------------------
4950
50- The code below demonstrates how to use ``ZeroRedundancyOptimizer ``. The majority
51- of the code is similar to the simple DDP example presented in
51+ The code below demonstrates how to use
52+ `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html >`__.
53+ The majority of the code is similar to the simple DDP example presented in
5254`Distributed Data Parallel notes <https://pytorch.org/docs/stable/notes/ddp.html >`_.
5355The main difference is the ``if-else `` clause in the ``example `` function which
54- wraps optimizer constructions, toggling between ``ZeroRedundancyOptimizer `` and
55- ``Adam `` optimizer.
56+ wraps optimizer constructions, toggling between
57+ `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html >`__
58+ and ``Adam `` optimizer.
5659
5760
5861::
@@ -91,7 +94,7 @@ wraps optimizer constructions, toggling between ``ZeroRedundancyOptimizer`` and
9194 if use_zero:
9295 optimizer = ZeroRedundancyOptimizer(
9396 ddp_model.parameters(),
94- optim =torch.optim.Adam,
97+ optimizer_class =torch.optim.Adam,
9598 lr=0.01
9699 )
97100 else:
0 commit comments