Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 270bf22

Browse files
authored
Merge branch 'master' into 1.9-RC-TEST
2 parents 43e1026 + 3945dd8 commit 270bf22

6 files changed

Lines changed: 68 additions & 5 deletions

File tree

beginner_source/basics/buildmodel_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def forward(self, x):
9797
# Model Layers
9898
# -------------------------
9999
#
100-
# Lets break down the layers in the FashionMNIST model. To illustrate it, we
100+
# Let's break down the layers in the FashionMNIST model. To illustrate it, we
101101
# will take a sample minibatch of 3 images of size 28x28 and see what happens to it as
102102
# we pass it through the network.
103103

beginner_source/basics/data_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
# -------------------
3636
#
3737
# Here is an example of how to load the `Fashion-MNIST <https://research.zalando.com/project/fashion_mnist/fashion_mnist/>`_ dataset from TorchVision.
38-
# Fashion-MNIST is a dataset of Zalando’s article images consisting of of 60,000 training examples and 10,000 test examples.
38+
# Fashion-MNIST is a dataset of Zalando’s article images consisting of 60,000 training examples and 10,000 test examples.
3939
# Each example comprises a 28×28 grayscale image and an associated label from one of 10 classes.
4040
#
4141
# We load the `FashionMNIST Dataset <https://pytorch.org/vision/stable/datasets.html#fashion-mnist>`_ with the following parameters:

beginner_source/dcgan_faces_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@
5252
# with the discriminator. Let :math:`x` be data representing an image.
5353
# :math:`D(x)` is the discriminator network which outputs the (scalar)
5454
# probability that :math:`x` came from training data rather than the
55-
# generator. Here, since we are dealing with images the input to
55+
# generator. Here, since we are dealing with images, the input to
5656
# :math:`D(x)` is an image of CHW size 3x64x64. Intuitively, :math:`D(x)`
5757
# should be HIGH when :math:`x` comes from training data and LOW when
5858
# :math:`x` comes from the generator. :math:`D(x)` can also be thought of

beginner_source/text_sentiment_ngrams_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@
101101
#
102102
# Before sending to the model, ``collate_fn`` function works on a batch of samples generated from ``DataLoader``. The input to ``collate_fn`` is a batch of data with the batch size in ``DataLoader``, and ``collate_fn`` processes them according to the data processing pipelines declared previously. Pay attention here and make sure that ``collate_fn`` is declared as a top level def. This ensures that the function is available in each worker.
103103
#
104-
# In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of ``nn.EmbeddingBag``. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of indidividual text entries.
104+
# In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of ``nn.EmbeddingBag``. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.
105105

106106

107107
from torch.utils.data import DataLoader

intermediate_source/seq2seq_translation_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@
139139
# the networks later. To keep track of all this we will use a helper class
140140
# called ``Lang`` which has word → index (``word2index``) and index → word
141141
# (``index2word``) dictionaries, as well as a count of each word
142-
# ``word2count`` to use to later replace rare words.
142+
# ``word2count`` which will be used to replace rare words later.
143143
#
144144

145145
SOS_token = 0

recipes_source/recipes/tuning_guide.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,69 @@ def fused_gelu(x):
178178
# `torch.autograd.gradgradcheck <https://pytorch.org/docs/stable/autograd.html#torch.autograd.gradgradcheck>`_
179179
#
180180

181+
###############################################################################
182+
# CPU specific optimizations
183+
# --------------------------
184+
185+
###############################################################################
186+
# Utilize Non-Uniform Memory Access (NUMA) Controls
187+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
188+
# NUMA or non-uniform memory access is a memory layout design used in data center machines meant to take advantage of locality of memory in multi-socket machines with multiple memory controllers and blocks. Generally speaking, all deep learning workloads, training or inference, get better performance without accessing hardware resources across NUMA nodes. Thus, inference can be run with multiple instances, each instance runs on one socket, to raise throughput. For training tasks on single node, distributed training is recommended to make each training process run on one socket.
189+
#
190+
# In general cases the following command executes a PyTorch script on cores on the Nth node only, and avoids cross-socket memory access to reduce memory access overhead.
191+
192+
# numactl --cpunodebind=N --membind=N python <pytorch_script>
193+
194+
###############################################################################
195+
# More detailed descriptions can be found `here <https://software.intel.com/content/www/us/en/develop/articles/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration.html>`_.
196+
197+
###############################################################################
198+
# Utilize OpenMP
199+
# ~~~~~~~~~~~~~~
200+
# OpenMP is utilized to bring better performance for parallel computation tasks.
201+
# OMP_NUM_THREADS is the easiest switch that can be used to accelerate computations. It determines number of threads used for OpenMP computations.
202+
# CPU affinity setting controls how workloads are distributed over multiple cores. It affects communication overhead, cache line invalidation overhead, or page thrashing, thus proper setting of CPU affinity brings performance benefits. GOMP_CPU_AFFINITY or KMP_AFFINITY determines how to bind OpenMP* threads to physical processing units. Detailed information can be found `here <https://software.intel.com/content/www/us/en/develop/articles/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration.html>`_.
203+
204+
###############################################################################
205+
# With the following command, PyTorch run the task on N OpenMP threads.
206+
207+
# export OMP_NUM_THREADS=N
208+
209+
###############################################################################
210+
# Typically, the following environment variables are used to set for CPU affinity with GNU OpenMP implementation. OMP_PROC_BIND specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions. OMP_SCHEDULE determines how OpenMP threads are scheduled. GOMP_CPU_AFFINITY binds threads to specific CPUs.
211+
212+
# export OMP_SCHEDULE=STATIC
213+
# export OMP_PROC_BIND=CLOSE
214+
# export GOMP_CPU_AFFINITY="N-M"
215+
216+
###############################################################################
217+
# Intel OpenMP Runtime Library (libiomp)
218+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
219+
# By default, PyTorch uses GNU OpenMP (GNU libgomp) for parallel computation. On Intel platforms, Intel OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It sometimes brings more performance benefits compared to libgomp. Utilizing environment variable LD_PRELOAD can switch OpenMP library to libiomp:
220+
221+
# export LD_PRELOAD=<path>/libiomp5.so:$LD_PRELOAD
222+
223+
###############################################################################
224+
# Similar to CPU affinity settings in GNU OpenMP, environment variables are provided in libiomp to control CPU affinity settings.
225+
# KMP_AFFINITY binds OpenMP threads to physical processing units. KMP_BLOCKTIME sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. In most cases, setting KMP_BLOCKTIME to 1 or 0 yields good performances.
226+
# The following commands show a common settings with Intel OpenMP Runtime Library.
227+
228+
# export KMP_AFFINITY=granularity=fine,compact,1,0
229+
# export KMP_BLOCKTIME=1
230+
231+
###############################################################################
232+
# Switch Memory allocator
233+
# ~~~~~~~~~~~~~~~~~~~~~~~
234+
# For deep learning workloads, Jemalloc or TCMalloc can get better performance by reusing memory as much as possible than default malloc funtion. `Jemalloc <https://github.com/jemalloc/jemalloc>`_ is a general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. `TCMalloc <https://google.github.io/tcmalloc/overview.html>`_ also features a couple of optimizations to speed up program executions. One of them is holding memory in caches to speed up access of commonly-used objects. Holding such caches even after deallocation also helps avoid costly system calls if such memory is later re-allocated.
235+
# Use environment variable LD_PRELOAD to take advantage of one of them.
236+
237+
# export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
238+
239+
###############################################################################
240+
# Train a model on CPU with PyTorch DistributedDataParallel(DDP) functionality
241+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
242+
# For small scale models or memory-bound models, such as DLRM, training on CPU is also a good choice. On a machine with multiple sockets, distributed training brings a high-efficient hardware resource usage to accelerate the training process. `Torch-ccl <https://github.com/intel/torch-ccl>`_, optimized with Intel(R) oneCCL (collective commnications library) for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall, implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup. Upon optimizations implemented in PyTorch DDP moduel, torhc-ccl accelerates communication operations. Beside the optimizations made to communication kernels, torch-ccl also features simultaneous computation-communication functionality.
243+
181244
###############################################################################
182245
# GPU specific optimizations
183246
# --------------------------

0 commit comments

Comments
 (0)