Rearranging documentation

definitelynotmcarilli · definitelynotmcarilli · commit 4606df98af28 · 2019-03-07T11:33:28.000-08:00
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ users as quickly as possible.
 
 ## 1. Mixed Precision
 
-### amp:  Automatic Mixed Precision
+### Amp:  Automatic Mixed Precision
 
 `apex.amp` is a tool to enable mixed precision training by changing only 3 lines of your script.
 Users can easily experiment with different pure and mixed precision training modes by supplying
@@ -27,7 +27,7 @@ different flags to `amp.initialize`.
 
 [DCGAN example coming soon...](https://github.com/NVIDIA/apex/tree/master/examples/dcgan)
 
-[Moving to the new Amp API] (for users of the deprecated tools formerly called "Amp" and "FP16_Optimizer")
+[Moving to the new Amp API](https://nvidia.github.io/apex/amp.html#transition-guide-for-old-api-users) (for users of the deprecated tools formerly called "Amp" and "FP16_Optimizer")
 
 ## 2. Distributed Training
 
diff --git a/docs/source/amp.rst b/docs/source/amp.rst
@@ -7,11 +7,22 @@ apex.amp
 This page documents the updated API for Amp (Automatic Mixed Precision),
 a tool to enable Tensor Core-accelerated training in only 3 lines of Python.
 
-Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level``
-or properties are chosen.  Amp intends that users start with an existing default (FP32) script,
-add the three lines corresponding to the Amp API, and begin training with mixed precision.
-Amp can also be disabled, in which case the original script will behave exactly as it used to.
-In this way, there's no risk adhering to the Amp API, and a lot of potential performance benefit.
+A `runnable, comprehensive Imagenet example`_ demonstrating good practices can be found
+on the Github page.
+
+GANs are a tricky case that many people have requested.  A `comprehensive DCGAN example`_
+is under construction.
+
+``opt_level``\ s and Properties
+-------------------------------
+
+Amp allows users to easily experiment with different pure and mixed precision modes.
+Commonly-used default modes are chosen by
+selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of
+properties that govern Amp's implementation of pure or mixed precision training.
+Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for
+particular properties directly to ``amp.initialize``.  These manually specified values
+override the defaults established by the ``opt_level``.
 
 Example::
 
@@ -27,39 +38,33 @@ Example::
             scaled_loss.backward()
         ...
 
-A `runnable, comprehensive Imagenet example`_ demonstrating good practices can be found
-on the Github page.
+Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level``
+or properties are chosen.  Amp intends that users start with an existing default (FP32) script,
+add the three lines corresponding to the Amp API, and begin training with mixed precision.
+Amp can also be disabled, in which case the original script will behave exactly as it used to.
+In this way, there's no risk adhering to the Amp API, and a lot of potential performance benefit.
 
-GANs are a tricky case that many people have requested.  A `comprehensive DCGAN example`_
-is under construction.
+.. note::
+    Because it's never necessary to manually cast your model (aside from the call ``amp.initialize``)
+    or input data, a script that adheres to the new API
+    can switch between different ``opt-level``\ s without having to make any other changes.
 
 .. _`runnable, comprehensive Imagenet example`:
     https://github.com/NVIDIA/apex/tree/master/examples/imagenet
 
 .. _`comprehensive DCGAN example`:
     https://github.com/NVIDIA/apex/tree/master/examples/dcgan
 
-``opt_level``\ s and Properties
--------------------------------
-
-Amp allows users to easily experiment with different pure and mixed precision modes, including
-pure FP16 training and pure FP32 training.  Commonly-used default modes are chosen by
-selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of
-properties that govern Amp's implementation of pure or mixed precision training.
-Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for
-particular properties directly to ``amp.initialize``.  These manually specified values will
-override the defaults established by the ``opt_level``.
-
 Properties
 **********
 
 Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
 
 - ``cast_model_type``:  Casts your model's parameters and buffers to the desired type.
 - ``patch_torch_functions``: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
-- ``keep_batchnorm_fp32``:  To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorms in particular in FP32 even if the rest of the model is FP16.
+- ``keep_batchnorm_fp32``:  To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
 - ``master_weights``:  Maintain FP32 master weights to accompany any FP16 model weights.  FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
-- ``loss_scale``:  If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale.  If ``loss_scale`` is the string ``"dynamic"``, adapatively adjust the loss scale over time.  Dynamic loss scale adjustments are performed by Amp automatically.
+- ``loss_scale``:  If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale.  If ``loss_scale`` is the string ``"dynamic"``, adaptively adjust the loss scale over time.  Dynamic loss scale adjustments are performed by Amp automatically.
 
 Again, you often don't need to specify these properties by hand.  Instead, select an ``opt_level``,
 which will set them up for you.  After selecting an ``opt_level``, you can optionally pass property
@@ -85,7 +90,7 @@ Your incoming model should be FP32 already, so this is likely a no-op.
 | Default properties set by ``O0``:
 | ``cast_model_type=torch.float32``
 | ``patch_torch_functions=False``
-| ``keep_batchnorm_fp32=None`` (effectively, "not applicable")
+| ``keep_batchnorm_fp32=None`` (effectively, "not applicable," everything is FP32)
 | ``master_weights=False``
 | ``loss_scale=1.0``
 |
@@ -116,21 +121,23 @@ what gives the best speedup and accuracy for your model.
 
 Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklist
 model.  Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performed
-in FP16.  Blacklist ops that benefit from FP32 precision (for example, batchnorm and softmax)
+in FP16.  Blacklist ops that benefit from FP32 precision (for example, softmax)
 are performed in FP32.  ``O1`` also uses dynamic loss scaling, unless overridden.
 
 | Default properties set by ``O1``:
 | ``cast_model_type=None`` (not applicable)
 | ``patch_torch_functions=True``
-| ``keep_batchnorm_fp32=None`` (not necessary to specify True, batchnorm inputs are cast to FP32)
+| ``keep_batchnorm_fp32=None`` (again, "not applicable," all model weights remain FP32)
 | ``master_weights=None`` (not applicable, model weights remain FP32)
 | ``loss_scale="dynamic"``
 |
 |
 
 ``O2``:  Fast Mixed Precision
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-``O2`` casts the model to FP16, keeps batchnorms in FP32, maintains master weights in FP32,
+``O2`` casts the model weights to FP16,
+patches the model's ``forward`` method to cast input
+data to FP16, keeps batchnorms in FP32, maintains FP32 master weights,
 and implements dynamic loss scaling (unless overridden).
 Unlike ``O1``, ``O2`` does not patch Torch functions or Tensor methods.
 
@@ -221,14 +228,9 @@ One annoying aspect of FP16_Optimizer was that the user had to manually convert
 (either by calling ``.half()`` on it, or using a function or module wrapper from
 ``apex.fp16_utils``), and also manually call ``.half()`` on input data.  **Neither of these are
 necessary in the new API.  No matter what --opt-level
-you choose, you can and should simply build your model in the default FP32 format.**  The new Amp
-API will perform the right conversions during
+you choose, you can and should simply build your model and pass input data in the default FP32 format.**
+The new Amp API will perform the right conversions during
 ``model, optimizer = amp.initialize(model, optimizer, opt_level=....)`` based on the ``--opt-level``
 and any overridden flags.  Floating point input data may be FP32 or FP16, but you may as well just
 let it be FP16, because the ``model`` returned by ``amp.initialize`` will have its ``forward``
 method patched to cast the input data appropriately.
-
-.. note::
-    Aside from the call to ``amp.initialize`` itself, it's never necessary to manually cast
-    your model or data with the new API.  Therefore, a script that adheres to the new API
-    can switch between different ``opt-level``\ s without having to make any other changes.