@@ -23,17 +23,20 @@ Comparison between ``DataParallel`` and ``DistributedDataParallel``
2323-------------------------------------------------------------------
2424
2525Before we dive in, let's clarify why, despite the added complexity, you would
26- consider using ``DistributedDataParallel `` over ``DataParallel ``, remembering
27- that **model parallel ** (covered in the
28- `prior tutorial <https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html >`__)
29- is necessary to use in either case if your model is too large to fit on a single
30- GPU.
26+ consider using ``DistributedDataParallel `` over ``DataParallel ``:
3127
28+ - First, recall from the
29+ `prior tutorial <https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html >`__
30+ that if your model is too large to fit on a single GPU, you must use **model parallel **
31+ to split it across multiple GPUs. ``DistributedDataParallel `` works with
32+ **model parallel **; ``DataParallel `` does not at this time.
3233- ``DataParallel `` is single-process, multi-thread, and only works on a single
3334 machine, while ``DistributedDataParallel `` is multi-process and works for both
3435 single- and multi- machine training. Thus, even for single machine training,
3536 where your **data ** is small enough to fit on a single machine, ``DistributedDataParallel ``
36- is expected to be faster than ``DataParallel ``.
37+ is expected to be faster than ``DataParallel ``. ``DistributedDataParallel ``
38+ also replicates models upfront instead of on each iteration and gets Global
39+ Interpreter Lock out of the way.
3740- If both your data is too large to fit on one machine **and ** your
3841 model is too large to fit on a single GPU, you can combine model parallel
3942 (splitting a single model across multiple GPUs) with ``DistributedDataParallel ``.
0 commit comments