Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 95d6c00

Browse files
Making the encouragement to use O1 a bit stronger...
1 parent 4b91326 commit 95d6c00

2 files changed

Lines changed: 8 additions & 95 deletions

File tree

apex/fp16_utils/fp16_optimizer.py

Lines changed: 4 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -11,105 +11,16 @@
1111

1212
# TODO: Update overflow check + downscale to use Carl's fused kernel.
1313
class FP16_Optimizer(object):
14-
"""
15-
:class:`FP16_Optimizer` is designed to wrap an existing PyTorch optimizer,
16-
and manage static or dynamic loss scaling and master weights in a manner transparent to the user.
17-
For standard use, only two lines must be changed: creating the :class:`FP16_Optimizer` instance,
18-
and changing the call to ``backward``.
19-
20-
Example::
21-
22-
model = torch.nn.Linear(D_in, D_out).cuda().half()
23-
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
24-
# Name the FP16_Optimizer instance to replace the existing optimizer
25-
# (recommended but not required):
26-
optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
27-
...
28-
# loss.backward() becomes:
29-
optimizer.backward(loss)
30-
...
31-
32-
Example with dynamic loss scaling::
33-
34-
...
35-
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
36-
# optional arg to control dynamic loss scaling behavior
37-
# dynamic_loss_args={'scale_window' : 500})
38-
# Usually, dynamic_loss_args is not necessary.
39-
40-
Args:
41-
init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
42-
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
43-
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
44-
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`LossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`LossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`LossScaler`'s defaults will be used.
45-
verbose (bool, optional, default=True): By default, FP16_Optimizer's constructor prints out the parameters and parameter groups it is ingesting, as a sanity check. If this becomes annoying (e.g. for large models), it can be disabled by passing ``verbose=False``. ``verbose=False`` will not disable printing when the loss scale is readjusted during dynamic loss scaling.
46-
47-
``init_optimizer`` is expected to have been constructed in the ordinary way.
48-
It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be
49-
named to replace ``init_optimizer``, for two reasons:
50-
First, it means that references to the same name
51-
later in the file will not have to change.
52-
Second, :class:`FP16_Optimizer` reserves the right (as an implementation detail) to
53-
modify ``init_optimizer``. If you do choose a unique name for the new
54-
:class:`FP16_Optimizer` instance, you should only work with this new instance,
55-
because the preexisting optimizer might no longer behave as expected.
56-
57-
``init_optimizer`` may be any Pytorch optimizer.
58-
It may contain a mixture of fp16 and fp32 parameters organized into any number of
59-
``param_groups`` with different hyperparameters. The :class:`FP16_Optimizer` constructor will
60-
ingest these ``param_groups`` and remember them.
61-
62-
Calls to ::
63-
64-
loss.backward()
65-
66-
must be replaced with ::
67-
68-
optimizer.backward(loss)
69-
70-
because :class:`FP16_Optimizer` requires ownership of the backward pass to implement
71-
loss scaling and copies to master gradients.
72-
73-
.. note::
74-
Loss scaling, either static or dynamic, is orthogonal to learning rate, because gradients
75-
are downscaled before being applied. This means that adjusting the loss scale, or using
76-
dynamic loss scaling, should not require retuning the learning rate or any other
77-
hyperparameters.
78-
79-
80-
**Advanced options**
81-
82-
**Closures**: :class:`FP16_Optimizer` can wrap a Pytorch optimizer that receives a closure.
83-
See docstring for :attr:`step`.
84-
85-
**Gradient clipping**: Use :attr:`clip_master_grads`.
86-
87-
**Multiple losses**: If your model accumulates gradients from multiple losses,
88-
this can be made more efficient by supplying ``update_master_grads=False``
89-
to :attr:`backward`. See docstring for :attr:`backward`.
90-
91-
**Manually adjusting loss scale**: The current loss scale can be retrieved or set via ::
92-
93-
print(optimizer.loss_scale)
94-
optimizer.loss_scale = new_loss_scale
95-
96-
For static loss scaling, manually adjusting the loss scale over time is a reasonable
97-
thing to do. During later epochs, gradients may become smaller, and a
98-
higher loss scale may be required, analogous to scheduling the learning rate. Dynamic loss
99-
scaling is more subtle (see :class:`DynamicLossScaler`) and in this case, manually adjusting
100-
the loss scale is not recommended.
101-
102-
**Multi_GPU training**: If the wrapped ``init_optimizer`` was created from a model wrapped in
103-
Pytorch DistributedDataParallel or Apex DistributedDataParallel, :class:`FP16_Optimizer`
104-
should still work as intended.
105-
"""
106-
10714
def __init__(self,
10815
init_optimizer,
10916
static_loss_scale=1.0,
11017
dynamic_loss_scale=False,
11118
dynamic_loss_args=None,
11219
verbose=True):
20+
print("Warning: FP16_Optimizer is deprecated and dangerous, and will be deleted soon. "
21+
"If it still works, you're probably getting lucky. "
22+
"For mixed precision, use the documented API https://nvidia.github.io/apex/amp.html, with opt_level=O1.")
23+
11324
if not torch.cuda.is_available:
11425
raise SystemError("Cannot use fp16 without CUDA.")
11526

examples/imagenet/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ For Resnet50 in particular, `--opt-level O3 --keep-batchnorm-fp32 True` establis
8484
the "speed of light." (Without `--keep-batchnorm-fp32`, it's slower, because it does
8585
not use cudnn batchnorm.)
8686

87-
#### `--opt-level O1` ("conservative mixed precision")
87+
#### `--opt-level O1` (Official Mixed Precision recipe, recommended for typical use)
8888

8989
`O1` patches Torch functions to cast inputs according to a whitelist-blacklist model.
9090
FP16-friendly (Tensor Core) ops like gemms and convolutions run in FP16, while ops
@@ -105,7 +105,9 @@ $ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50
105105
For best performance, set `--nproc_per_node` equal to the total number of GPUs on the node
106106
to use all available resources.
107107

108-
#### `--opt-level O2` ("fast mixed precision")
108+
#### `--opt-level O2` ("Almost FP16" mixed precision. More dangerous than O1.)
109+
110+
`O2` exists mainly to support some internal use cases. Please prefer `O1`.
109111

110112
`O2` casts the model to FP16, keeps batchnorms in FP32,
111113
maintains master weights in FP32, and implements

0 commit comments

Comments
 (0)