File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 3838# of the word (see the next paragraph for more details). The
3939# ``nn.TransformerEncoder`` consists of multiple layers of
4040# `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
41- # To produce a probability distribution over output words, the output of
42- # the ``nn.TransformerEncoder`` model is passed through a linear layer.
41+ # Along with the input sequence, a square attention mask is required because the
42+ # self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
43+ # the earlier positions in the sequence. For the language modeling task, any
44+ # tokens on the future positions should be masked. To produce a probability
45+ # distribution over output words, the output of the ``nn.TransformerEncoder``
46+ # model is passed through a linear layer to output unnormalized logits.
47+ # The log-softmax function isn't applied here due to the later use of
48+ # `CrossEntropyLoss <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html>`__,
49+ # which requires the inputs to be unnormalized logits.
4350#
4451
4552import math
You can’t perform that action at this time.
0 commit comments