@@ -21,8 +21,9 @@ paradigms:
2121 data between observers and the trainer
22222) Your model might be too large to fit in GPUs on a single machine, and hence
2323 would need a library to help split a model onto multiple machines. Or you
24- might be implementing a parameter server training framework, where model
25- parameters and trainers live on different machines.
24+ might be implementing a `parameter server <https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf >`__
25+ training framework, where model parameters and trainers live on different
26+ machines.
2627
2728
2829The `torch.distributed.rpc <https://pytorch.org/docs/master/rpc.html >`__ package
@@ -360,21 +361,27 @@ borrowed from the word language model in PyTorch
360361`example <https://github.com/pytorch/examples/tree/master/word_language_model >`__
361362repository, which contains three main components, an embedding table, an
362363``LSTM `` layer, and a decoder. The code below wraps the embedding table and the
363- decode into sub-modules, so that their constructors can be passed to the RPC
364- API.
364+ decoder into sub-modules, so that their constructors can be passed to the RPC
365+ API. In the `EmbeddingTable ` sub-module, we intentionally put the `Embedding `
366+ layer on GPU to demonstrate the use case. In v1.4, RPC always creates CPU tensor
367+ arguments or return values on the destination server. If the function takes a
368+ GPU tensor, you need to move it to the proper device explicitly.
365369
366370
367371.. code :: python
368372
369373 class EmbeddingTable (nn .Module ):
374+ r """
375+ Encoding layers of the RNNModel
376+ """
370377 def __init__ (self , ntoken , ninp , dropout ):
371378 super (EmbeddingTable, self ).__init__ ()
372379 self .drop = nn.Dropout(dropout)
373- self .encoder = nn.Embedding(ntoken, ninp)
380+ self .encoder = nn.Embedding(ntoken, ninp).cuda()
374381 self .encoder.weight.data.uniform_(- 0.1 , 0.1 )
375382
376383 def forward (self , input ):
377- return self .drop(self .encoder(input ) )
384+ return self .drop(self .encoder(input .cuda()).cpu( )
378385
379386
380387 class Decoder(nn.Module):
@@ -470,8 +477,9 @@ Then, as the ``RNNModel`` contains three sub-modules, we need to call
470477Now, we are ready to implement the training loop. After initializing the model
471478arguments, we create the `` RNNModel`` and the `` DistributedOptimizer`` . The
472479distributed optimizer will take a list of parameter `` RRefs`` , find all distinct
473- owner workers, and create the given local optimizer (i.e., ``SGD `` in this case)
474- on each of the owner worker using the given arguments (i.e., ``lr=0.05 ``).
480+ owner workers, and create the given local optimizer (i.e., `` SGD `` in this case,
481+ you can use other local optimizers as well) on each of the owner worker using
482+ the given arguments (i.e., `` lr = 0.05 `` ).
475483
476484In the training loop, it first creates a distributed autograd context, which
477485will help the distributed autograd engine to find gradients and involved RPC
0 commit comments