|
11 | 11 |
|
12 | 12 | # TODO: Update overflow check + downscale to use Carl's fused kernel. |
13 | 13 | class FP16_Optimizer(object): |
14 | | - """ |
15 | | - :class:`FP16_Optimizer` is designed to wrap an existing PyTorch optimizer, |
16 | | - and manage static or dynamic loss scaling and master weights in a manner transparent to the user. |
17 | | - For standard use, only two lines must be changed: creating the :class:`FP16_Optimizer` instance, |
18 | | - and changing the call to ``backward``. |
19 | | -
|
20 | | - Example:: |
21 | | -
|
22 | | - model = torch.nn.Linear(D_in, D_out).cuda().half() |
23 | | - optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) |
24 | | - # Name the FP16_Optimizer instance to replace the existing optimizer |
25 | | - # (recommended but not required): |
26 | | - optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0) |
27 | | - ... |
28 | | - # loss.backward() becomes: |
29 | | - optimizer.backward(loss) |
30 | | - ... |
31 | | -
|
32 | | - Example with dynamic loss scaling:: |
33 | | -
|
34 | | - ... |
35 | | - optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True) |
36 | | - # optional arg to control dynamic loss scaling behavior |
37 | | - # dynamic_loss_args={'scale_window' : 500}) |
38 | | - # Usually, dynamic_loss_args is not necessary. |
39 | | -
|
40 | | - Args: |
41 | | - init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`. |
42 | | - static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate. |
43 | | - dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option. |
44 | | - dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`LossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`LossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`LossScaler`'s defaults will be used. |
45 | | - verbose (bool, optional, default=True): By default, FP16_Optimizer's constructor prints out the parameters and parameter groups it is ingesting, as a sanity check. If this becomes annoying (e.g. for large models), it can be disabled by passing ``verbose=False``. ``verbose=False`` will not disable printing when the loss scale is readjusted during dynamic loss scaling. |
46 | | -
|
47 | | - ``init_optimizer`` is expected to have been constructed in the ordinary way. |
48 | | - It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be |
49 | | - named to replace ``init_optimizer``, for two reasons: |
50 | | - First, it means that references to the same name |
51 | | - later in the file will not have to change. |
52 | | - Second, :class:`FP16_Optimizer` reserves the right (as an implementation detail) to |
53 | | - modify ``init_optimizer``. If you do choose a unique name for the new |
54 | | - :class:`FP16_Optimizer` instance, you should only work with this new instance, |
55 | | - because the preexisting optimizer might no longer behave as expected. |
56 | | -
|
57 | | - ``init_optimizer`` may be any Pytorch optimizer. |
58 | | - It may contain a mixture of fp16 and fp32 parameters organized into any number of |
59 | | - ``param_groups`` with different hyperparameters. The :class:`FP16_Optimizer` constructor will |
60 | | - ingest these ``param_groups`` and remember them. |
61 | | -
|
62 | | - Calls to :: |
63 | | -
|
64 | | - loss.backward() |
65 | | -
|
66 | | - must be replaced with :: |
67 | | -
|
68 | | - optimizer.backward(loss) |
69 | | -
|
70 | | - because :class:`FP16_Optimizer` requires ownership of the backward pass to implement |
71 | | - loss scaling and copies to master gradients. |
72 | | -
|
73 | | - .. note:: |
74 | | - Loss scaling, either static or dynamic, is orthogonal to learning rate, because gradients |
75 | | - are downscaled before being applied. This means that adjusting the loss scale, or using |
76 | | - dynamic loss scaling, should not require retuning the learning rate or any other |
77 | | - hyperparameters. |
78 | | -
|
79 | | -
|
80 | | - **Advanced options** |
81 | | -
|
82 | | - **Closures**: :class:`FP16_Optimizer` can wrap a Pytorch optimizer that receives a closure. |
83 | | - See docstring for :attr:`step`. |
84 | | -
|
85 | | - **Gradient clipping**: Use :attr:`clip_master_grads`. |
86 | | - |
87 | | - **Multiple losses**: If your model accumulates gradients from multiple losses, |
88 | | - this can be made more efficient by supplying ``update_master_grads=False`` |
89 | | - to :attr:`backward`. See docstring for :attr:`backward`. |
90 | | -
|
91 | | - **Manually adjusting loss scale**: The current loss scale can be retrieved or set via :: |
92 | | -
|
93 | | - print(optimizer.loss_scale) |
94 | | - optimizer.loss_scale = new_loss_scale |
95 | | -
|
96 | | - For static loss scaling, manually adjusting the loss scale over time is a reasonable |
97 | | - thing to do. During later epochs, gradients may become smaller, and a |
98 | | - higher loss scale may be required, analogous to scheduling the learning rate. Dynamic loss |
99 | | - scaling is more subtle (see :class:`DynamicLossScaler`) and in this case, manually adjusting |
100 | | - the loss scale is not recommended. |
101 | | -
|
102 | | - **Multi_GPU training**: If the wrapped ``init_optimizer`` was created from a model wrapped in |
103 | | - Pytorch DistributedDataParallel or Apex DistributedDataParallel, :class:`FP16_Optimizer` |
104 | | - should still work as intended. |
105 | | - """ |
106 | | - |
107 | 14 | def __init__(self, |
108 | 15 | init_optimizer, |
109 | 16 | static_loss_scale=1.0, |
110 | 17 | dynamic_loss_scale=False, |
111 | 18 | dynamic_loss_args=None, |
112 | 19 | verbose=True): |
| 20 | + print("Warning: FP16_Optimizer is deprecated and dangerous, and will be deleted soon. " |
| 21 | + "If it still works, you're probably getting lucky. " |
| 22 | + "For mixed precision, use the documented API https://nvidia.github.io/apex/amp.html, with opt_level=O1.") |
| 23 | + |
113 | 24 | if not torch.cuda.is_available: |
114 | 25 | raise SystemError("Cannot use fp16 without CUDA.") |
115 | 26 |
|
|
0 commit comments