Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views66 pages

NLP Week7 RNNLSTM

The document covers the concepts of Recurrent Neural Networks (RNNs) and Long Short Term Memory units (LSTMs) in the context of natural language processing. It outlines the training process for RNNs, including the use of teacher forcing and cross-entropy loss for language modeling. Additionally, it discusses various RNN architectures and their applications in language modeling, classification, and sequence tasks.

Uploaded by

frfgvhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views66 pages

NLP Week7 RNNLSTM

The document covers the concepts of Recurrent Neural Networks (RNNs) and Long Short Term Memory units (LSTMs) in the context of natural language processing. It outlines the training process for RNNs, including the use of teacher forcing and cross-entropy loss for language modeling. Additionally, it discusses various RNN architectures and their applications in language modeling, classification, and sequence tasks.

Uploaded by

frfgvhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Recurrent neural networks

and LSTMs
NLP Week 7

Thanks to Dan Jurafsky for most of the slides this week!


Plan for today
1. Recap: Estimating n-gram probabilities

2. A note on notation

3. Simple Recurrent Neural Networks (RNNs)

4. RNN architectures

1. For language modeling

2. For classi cation

3. For sequences (seq-to-seq)

5. Long Short Term Memory Units (LSTMs)

6. Basic attention mechanism

7. Group exercises
fi
This semester
We will build language models adding to each layer of their complexity:

1. Bag of words models ( basic statistical models of language )

2. N-gram models ( + sequential dependencies )

3. Hidden Markov models ( + latent categories )

4. Recurrent neural networks ( + distributed representations )

5. LSTM language models ( + long distance dependencies )

6. Transformer language models ( + attention-based dependency learning )

= Today’s language models!


Recap: estimating n-gram probabilities
N-gram for n > 1 uni-gram

C(wn−1wn) C(wn)
P(wn ∣ wn−1) = P(wn) =
∑w C(wn−1w) ?

C(wn−1wn)
P(wn ∣ wn−1) =
C(wn−1)
Recap: estimating n-gram probabilities
N-gram for n > 1 uni-gram

C(wn−1wn) C(wn)
P(wn ∣ wn−1) = P(wn) =
∑w C(wn−1w) ?
wn−1 = {}

C(wn−1wn) C(wn)
P(wn ∣ wn−1) = P(wn) =
C(wn−1) ∑w C(w)

To get a valid probability distribution we need to normalize, i.e., divide by the number of all possible elements in our set
• In the uni-gram case this is simply the sum of all uni-grams counts
• In the n-gram (n > 1) case it is the sum of counts of of all n-grams that share the same prefix/context
(which is equivalent to the counts of the prefix/context)
A note on notation
Quick recap on our notation and matrix-vector multiplication

• Let A denote an p × d matrix We need to understand:


• Let x denote a d × 1 vector (column vector) • y = Ax
• Linear combination view
• Inner product view

👉 will do this on the whiteboard …


Modeling Time in Neural Networks
Language is inherently temporal.
Yet the simple NLP classifiers we've seen (for example for
sentiment analysis) mostly ignore time
• (Feedforward neural LMs (and the transformers we'll see
later) use a "moving window" approach to time.)
Here we introduce a deep learning architecture with a different
way of representing time
• RNNs and their variants like LSTMs
Recurrent Neural Networks (RNNs)

Any network that contains a cycle within its network


connections.
The value of some unit is directly, or indirectly, dependent
on its own earlier outputs as an input.
Simple recurrent neural units (SRN or Elman Net)

yt

ht

xt
The hidden layer has a recurrence as part of its input
The activation value ht depends on xt but also ht-1!
Forward inference in simple RNNs

Very similar to the feedforward networks we've seen!


yt

ht

U W

ht-1 xt
The fact that the computation at time t requires the value of the hidden layer fro
time t 1 mandates an incremental inference algorithm that proceeds from the sta
Inference has to be incremental
of the sequence to the end as illustrated in Fig. 8.3. The sequential nature of simp
recurrent networks can also be seen by unrolling the network in time as is shown
Fig. 8.4. hInatthis
Computing figure,
time the various
t requires layers
that of units
we first are copiedhfor
computed ateach
the time step
illustrate
previous time that they will have differing values over time. However, the various weig
step!
matrices are shared across time.

function F ORWARD RNN(x, network) returns output sequence y

h0 0
for i 1 to L ENGTH(x) do
hi g(Uhi 1 + Wxi )
yi f (Vhi )
return y

Figure 8.3 Forward inference in a simple recurrent network. The matrices U, V and W a
shared across time, while new values for h and y are calculated with each time step.
Training in simple RNNs
yt
Just like feedforward training:
V
• training set, ht

• a loss function, +
• backpropagation U W

ht-1 xt

Weights that need to be updated:


• W, the weights from the input layer to the hidden layer,
• U, the weights from the previous hidden layer to the current hidden layer,
• V, the weights from the hidden layer to the output layer.
Training in simple RNNs: unrolling in time
Unlike feedforward networks:
y3

1. To compute loss function for the V

output at time t we need the hidden layer y2 h3

from time t − 1. V U W

h2
y1 x3
2. hidden layer at time t influences the U W
V
output at time t and hidden layer at time h1 x2

t+1 (and hence the output and loss at U W

t+1). h0 x1

So: to measure error accruing to ht,

we need to know its influence on both


the current output as well as the ones
that follow.
Training in simple RNNs: unrolling in time
y3

We unroll the RNN into a V

feedforward computational graph y2 h3

eliminating recurrence! y1
V

h2
U W

x3

U W
V

Given an input sequence:


h1 x2

U W

1. Generate an unrolled h0 x1

feedforward network specific to


input
2. Use graph to train weights
directly via ordinary backprop
RNN Architectures
We will see three types of RNN-based architectures that can
be used for different tasks:
1. RNNs for language modeling

2. RNNs for classification

3. RNNs for sequences


Reminder: Language Modeling
Modeling the probability of words in context.
Task: predict next word wt
given prior words wt-1, wt-2, wt-3, …
Problem with N-gram and feedforward NNs: Dealing with
sequences of arbitrary length.
Previous solution: Sliding windows (of fixed length)

What about RNNs ? No limit on context size! All prior words


count.
The size of the conditioning context for different LMs

The n-gram LM:


Context size is the n − 1 prior words we condition on.

The feedforward LM:


Context is the window size.

The RNN LM:


No fixed context size; ht-1 represents entire history
Feed forward LMs vs RNN LMs

^
yt ^
ytb)
a) a) b)
^
yt ^
yt
U U

V V
ht ht

ht-2 … ht-1 ht-2U


U Uht ht-1 U ht
W W
W W W W W W

et-2 et-1 et et-2 et-1 eet-2


t et-1 et-2 et et-1 et

FFN RNN
vector representing a probability distribution over the vocabul
rd Inference in an RNN language model
model
Forwarduses the word
inference embedding
in the matrix
RNN ELMto retrieve the embe
word, multiples it by the weight matrix W, and then adds it to
e in atherecurrent
previous
Given oflanguage
input X step tokens model
(weighted
of N proceeds
by weight matrixexactly as de
U) to compute
This
e input hidden layer
sequence X= is [x
then; ...;
usedx to; generate
...; x ] an outputof
consists layer
a whic
serie
1 t N
softmax layer to generate a probability distribution over the en
as a one-hot
Use vectormatrix
embedding of size
to get|V
the|⇥ 1, and for
embedding thecurrent
output predict
token xt
is, at time t:
g a probability distribution over the vocabulary. At eac
et = Ex
ord embedding matrix E to retrieve the
t embedding for t

by the weight matrix W, andhthen


Combine … g(Uhtit1to
t = adds thet hidden
+ We ) l
t =tosoftmax(Vh
(weighted by weight matrix ŷU) compute t )a new hid
is then used to generate an output layer which is passed
b)
Shapes
^
yt |V| × 1

V | V | × dh
dh × dh

ht-2 dh U
×1 ht-1 U ht dh × 1

W W W dh × de

et-2 et-1 et de × 1
t
bility that a particular word k in the vocabulary is the next word is repres
Computing
k], the kth componentthe
P(wprobability
of ŷ
t+1 t := k|w1 ,that
. . . , the
wt ) next
= ŷt
word
[k] is word k

bability of an entireP(w t+1 = k|w


sequence ...,w
is1 ,just the = ŷt [k]of the probabilities
t ) product
the sequence, where we’ll use ŷi [wi ] to mean the probability of the tru
obability
me step i. of an entire sequence is just the product of the probabilities of
n theComputing
sequence, wherethewe’ll use ŷi [wi ] or
probability, to mean the probability
scoring, a sentenceof the true
ime step i. Y n
P(w1:n ) = P(wi |w1:i 1 )
Y n
i=1
P(w1:n ) = n P(wi |w1:i 1 )
Y
i=1
= n
ŷi [wi ]
Y
= i=1ŷ
i [wi ]
Training RNN LM
• Self-supervision (like with Feed forward
LM)
• take a corpus of text as training material
• ask the model to predict the next word
• (Unlike feed-forward) at each time step t !
• Why called self-supervised ? we don't
need human labels; the text is its own
supervision signal
• We train the model to :
• minimize the error in predicting the true
next word in the training sequence, using
cross-entropy as the loss function.
So long and thanks for
g RNNs
ect Cross-entropy
as language models.
distribution. loss
X
Minimizes the difference
rrect distribution. LCE = between (1) log
yt [w] a predicted
ŷt [w] probability (8.10)
distribution and (2) the correct distribution.
w2V
X
LCE = yt [w] log ŷt [w] (8.10)
e case of language modeling, the correct w2V distribution yt comes from knowing the
word.CEThis
lossis for
represented
LMs canas a one-hot
actually be vector corresponding
simplified since: to the vocabulary
thethe
re case of language
entry for the modeling,
actual theword
next correct
is distribution
1, and all the comesentries
yt other from knowing
are 0. the
Thus,
• Correct distribution yt is a one-hot vector over vocab (i.e. where
xt word. This isloss
cross-entropy represented
for as a one-hot
language modelingvector
is corresponding
determined by to the
the vocabularythe
probability
prob of actual next word is 1, and all the other entries are 0.)
here the
el assignsentry
to for the
the actual
correct next
next wordSo
word. is 1,
at and
time all
t the CE
the otherloss
entries
is thearenegative
0. Thus,log
• CE lossloss
eability
cross-entropy for LMs
for is only determined
language modeling isby the probability
determined by the of next word.
probability the
the model assigns to the next word in the training sequence.
• So attotime
odel assigns t, CE loss
the correct nextis:
word. So at time t the CE loss is the negative log
obability the model assigns to the next word in
L (ŷ , y ) = log ŷ [w ] the training sequence. (8.11)
CE t t t t+1
Teacher forcing
We always give the model the correct history to predict the next
word (rather than feeding the model the previous time step next
word prediction — which could be wrong).
This is called teacher forcing (in training we force the context
to be correct based on the gold words).
Summary: Training RNN for language modeling
he likelihood of each word in the vocabulary given the evidence present in the fina
hidden layer
Weightof the network
tying through the calculation of Vh. V is of shape [|V | ⇥ d
That is, is, the rows of V are shaped like a transpose of E, meaning that V provide
a secondAnsetoptional
of learnedarchitectural modification when de = dh .
word embeddings.
Instead of having two sets of embedding matrices, language
When the embedding dimension and hidden dimension are the
models use a singl
embedding matrix, which appears at both the input and softmax layers. That is
same, then the embedding matrix E and the final layer matrix | V are
we dispense with V and use E at the start of the computation and E (because th
of similar shape: E is d x |V| while V is |V| x d
shape of V is the transpose of E at the end. Using the same matrix (transposed) i
Instead
wo places of having
is called weightseparate matrices,
tying.1 The we just
weight-tied tie them
equations fortogether,
an RNN languag
become: ET instead of V, so embeddings appear twice:
transposing
model then

et = Ext (8.12
ht = g(Uht 1 + Wet ) (8.13
ŷt = softmax(E| ht ) (8.14
Autoregressive generation from a RNN LM
RNN Architectures
We will see three types of RNN-based architectures that can
be used for different tasks:
1. RNNs for language modeling

2. RNNs for classification

3. RNNs for sequences


RNNs for text classification
There is only one final output prediction, rather than an output
at each state.
We can use just the final hidden state as a sequence encoding
that gets fed to a regular feedforward NN for classification.
x1 x2 x3 xn

RNNs for text classification


Figure 8.8 Sequence classification using a simple RNN combined with a feedforward net-
work. The final hidden state from the RNN is used as the input to a feedforward network that
performs the classification.
Alternatively, Instead of taking the last hidden state only, we can use
some
pools all the pooling
n hidden function
states by takingoftheir
all the mean: like mean pooling.
hidden states,
element-wise
n
X
1
hmean = hi (8.15)
n hmean
i=1
Mean pooling
Or we can take the element-wise max; the element-wise max of a set of n vectors is
a new vector whose kth element is the max of the kth elements of all the n vectors.
The long contexts of RNNs makes it quite difficult to successfully backpropagate
error all the way through the entire input; we’ll talk about this problem, and some
standard solutions, in Section 8.5.
RNN Architectures
We will see three types of RNN-based architectures that can
be used for different tasks:
1. RNNs for language modeling

2. RNNs for classification

3. RNNs for sequences


Two types of sequence prediction tasks
1. sequence labeling : When output sequence length matches
input
2. seq-to-seq or encoder-decoder : When output and input
sequence are of different lengths
RNNs for sequence labeling
Assign a label to each element of a sequence (eg. POS tagging):
• The main difference from LM architecture and classification architecture
is that the output dimension for output layer is over labels (not vocab),
and there is an output prediction at each time step (not just the last
one).
RNNs for sequence to sequence tasks
An example of such a task is translation:
‘Have you tasted loroco before?’
‘Avez-vous déjà goûté au loroco?’
3 components of an encoder-decoder
1. An encoder that accepts an input sequence, x1:n, and generates
a corresponding sequence of contextualized representations, h1:n.
2. A context vector, c, which is a function of h1:n, and conveys the
essence of the input to the decoder.
3. A decoder, which accepts c as input and generates an arbitrary
length sequence of hidden states h1:m, from which a
corresponding sequence of output states y1:m, can be obtained
toregressive
In this section generation
we’ll describe into ananencoder-decoder
encoder-decoder model networkthat is aon
based translation
a pair of mode
that but
NNs, Encoder-decoder
can translate
we’ll see infrom a source
Chapter 13 howtext tofor translation
in one
apply language
them to a target
to transformers text in
as well. a second
We’ll
uild sentence
addupathe equations separation marker at models
for encoder-decoder the endby of starting
the source withtext, and then simply
the conditional
NNconcatenate the target
language model p(y),text.
the probability of a sequence y.
Recall that
Let’s use
Regular in <s>
any language model,
for our sentence
language modeling we can breaktoken,
separator downand the probability
let’s think as follows:
about translating
an English source text (“the green witch arrived”), to a Spanish sentence (“llego
p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 ) . . . p(ym |y1 , ..., ym 1 ) (8.28)
la bruja verde” (which can be glossed word-by-word as ‘arrived the witch green’)
We could
In RNN also illustrate
language
Conditioned modeling, encoder-decoder
sequence particularmodels
at ascoring time t, withwe pass a question-answer
the prefix of t pair, 1 or
text-summarization
kens through the language pair. model, using forward inference to produce a sequence
f hidden Let
Let’s x be the
use xending
states, source
to referwith to thetext plus
thesource
hiddentexta separate token
(incorresponding
state this case in English)<s> and
to the last y the
plusword target
the separato
of
e token x =then
prefix.<s>,
We Theyuse
and green
to the
referwitch
final arrive
to hidden
the <s>text
state
target of the prefix
y (in thisascase our in
starting point to
Spanish). Then an
enerate the ynext
= llego
encoder-decoder token.
model computes
́ la bruja verde the probability p(y|x) as follows:
More formally, if g is an activation function like tanh or ReLU, a function of
p(y|x)t and
e input at time = the p(yhidden 2 |y1 , x)p(y
1 |x)p(ystate at time3 |yt1 , y1,
2 , x)
and p(ysoftmax
. . .the ymover
m |y1 , ..., is 1 , x)the (8.31
et of possible vocabulary items, then at time t the output yt and hidden state ht are
Fig. 8.17 shows the setup for a simplified version of the encoder-decoder mode
omputed as:
Encoder-decoder simplified

Target Text

llegó la bruja verde </s>

softmax (output of source is ignored)

hidden hn
layer(s)

embedding
layer

the green witch arrived <s> llegó la bruja verde

Separator
Source Text
ilable to more than just the first decoder hidden state, to ensure that the infl
Encoder-decoder
e context vector, c, doesn’t waneshowing
as the output context
sequence is generated.
y adding c as a parameter to the computation of the current hidden state
Where each hidden state of the decoder is conditioned not only on
ollowingtheequation:
input and prior hidden state, but also the context.

htd = g(ŷt d
1 t 1 , c)
, h
Decoder

y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax

he1 he2 he3 e d hd hd hd hd hd


hidden hhn n = c = h 0 1 2 3 4 m
layer(s)

embedding
layer

x1 x2 x3 xn <s> y1 y2 y3 ym

Encoder
encoder-decoder model, with context available at each decoding timestep. Recall
that g is a stand-in for some flavor of RNN and ŷt 1 is the embedding for the output
sampledEncoder-decoder equations
from the softmax at the previous step:

c = hen
hd0 = c
htd = g(ŷt d
1 t 1 , c)
, h
ŷt = softmax(htd ) (8.33)

Thusg ŷist aisstand-in


a vectorforof probabilities
some over the vocabulary, representing the probability
flavor of RNN
of each word
ŷt−1 is occurring for
the embedding at the
timeoutput
t. Tosampled
generate text,
from thewe sample
softmax from
at the this distribution
previous step
ŷt . For example, the greedy choice is simply to choose the most probable word to
ŷt is output vector of probabilities over vocab, representing the probability of each
generate at each timestep. We’ll introduce more sophisticated sampling methods in
word occurring at time t.
Section ??.
To generate text, we sample from this distribution ŷt .
Training the encoder-decoder with teacher forcing
Decoder

gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5

Total loss is the average L1 = L2 = L3 = L4 = L5 =


cross-entropy loss per -log P(y1) -log P(y2) -log P(y3) -log P(y4) -log P(y5) per-word
target word: loss

softmax

hidden
layer(s)

embedding
layer
x1 x2 x3 x4
the green witch arrived <s> llegó la bruja verde

Encoder
RNN Architectures
We will see three types of RNN-based architectures that can
be used for different tasks:
1. RNNs for language modeling

2. RNNs for classification

3. RNNs for sequences


Other architectural modifications

We have seen different ways to use simple single layered


RNN in a unidirectional manner. In all use cases we can
additionally modify our architectures to include:

1. Multiple layers or Stacked RNNs

2. Bidirectional encoding
Stacked RNNs
Bidirectional RNNs
Bidirectional RNNs for classification
Other types of recurrent units
Recall: RNN is any network that contains a cycle within its
network connections, i.e. the value of some unit is directly, or
indirectly, dependent on its own earlier outputs as an input.

All the architectures we have just seen can be


implemented with ANY type of RNN.

We saw simple recurrent neural units (SRN). Two other


common types are gated recurrent units (GRU) and long-
short-term memory units (LSTM)
Motivating the LSTM: dealing with distance
Hidden layers in RNN are forced to do two things:
a) Provide information useful for the current decision,
b) Update and carry forward information required for future decisions.

Leads to two problems :


1. It's hard to assign probabilities accurately when context is very far away:
‘The flights the airline was canceling were full’
this requires knowing and prioritizing information from far away prior
states over closer info.
2. During backprop, we have to repeatedly multiply gradients through time and
many hidden states which can lead to the "vanishing gradient" problem
LSTM: Long short-term memory network
LSTMs divide the context management problem into two
subproblems:
1. removing information no longer needed from the context,
2. adding information likely to be needed for later decision
making

LSTMs add:
- explicit context layer
- Neural circuits with gates to control information flow
Difference between neural units

Feedforward unit Simple recurrent unit LSTM unit


The LSTM
ct-1 ct-1

f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

+ σ
xt xt

σ
o
LSTM
+
The LSTM
ct-1 ct-1 Forget Gate
f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

+ σ
xt xt

σ
o
LSTM
+
in the layer being gated that align with values near 1 in the mask are passed through
nearly unchanged; values corresponding to lower values are essentially erased.
Forget
The first gate gate
we’ll consider is the forget gate. The purpose of this gate is
to delete information from the context that is no longer needed. The forget gate
computes a weighted sum of the previous state’s hidden layer and the current in-
Its role : delete information from the context that is no longer
put and passes
needed. that through a sigmoid. This mask is then multiplied element-wise
by the context vector to remove the information from context that is no longer re-
Computes weighted sum of the previous hidden states and
quired. Element-wise multiplication of two vectors (represented by the operator ,
current input and passes through sigmoid activation.
and sometimes called the Hadamard product) is the vector of the same dimension
as theThis
two mask is thenwhere
input vectors, multiplied element-wise
each element with context
i is the product vector
of element i in the two
inputto remove info from context that is no longer required.
vectors:

ft = s (U f ht 1 + W f xt ) (8.20)
kt = ct 1 ft (8.21)

The next task is to compute the actual information we need to extract from the previ-
ous hidden state and current inputs—the same basic computation we’ve been using
The LSTM
ct-1 ct-1

f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

+ σ
Add Gate
xt xt

σ
o
LSTM
+
ftft == ss(U (Uf fhht t 11+
+WWff xxtt)) (8.20)
(8.20)
he next task is to compute the actual information we need to extract from the previ-
Add gate ktkt == cct t 11 fft t (8.21)
(8.21)
us hidden state and current inputs—the same basic computation we’ve been using
The
r all our next task
recurrent
The
Its next
role: isnetworks.
task toto
is compute
selecting computethe
theactual
actualinformation
information information
to we
we need to
add to current extractfrom
extract
context. fromthe
theprevi-
previ-
ousous
hidden state
hidden and
state andcurrent
currentinputs—the
inputs—thesame
same basic
basic computation we’vebeen
computation we’ve beenusing
using
First
forfor
allall
calculate
ourour
regular
recurrent
recurrent
information passing (as with SRN):
networks.
networks.
gt = tanh(Ug ht 1 + Wg xt ) (8.22)
tanh(Ugghht t 11+
gtgt == tanh(U +WWggxxtt )) (8.22)
(8.22)
ext, we generate the mask for the add gate to select the information to add to the
add gate Then,
Next,
Next,
urrent context.
gate we select
wegeneratenew
generate
the info
maskto
themask foradd
for the to
add
theadd context
gate to
gate by multiplying
to select
select element
the information
the information to
toadd
addtowise
tothe
the
with context.
sigmoid output over current state output.
current context.
current
s s(U
it i=it == s(U
(U
i hhi ht t 11+
+
+WWixxi tx))t )
W (8.23)
(8.23)
(8.23)
t i t 1 i t
jt j= jt =
g gt i it (8.24) (8.24)
t = tgt tit (8.24)
Next, we add this to the modified context vector to get our new context vector.
ext, Add
we addwe
Next, this to
addto
this the
this
the modified
tomodified context
the modified vector
contextvector
context toget
vector toto get
get our
ourour new
newnew context
context vector.
vector.vector.
context
ct = jt + kt (8.25)
ct = jt + kt (8.25)
ct = jt + kt (8.25)
The LSTM
ct-1 ct-1

f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

+ σ Output Gate
xt xt

σ
o
LSTM
+
The LSTM Make sure to check out https://colah.github.io/posts/2015-08-Understanding-LSTMs/

ct-1 ct-1

f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

+ σ
xt xt

σ
o
LSTM
+
σ

+
Output gate
e LSTM unit displayed as a computation graph. The inputs to each unit consists of the
evious hidden state, ht 1 , and the previous context, ct 1 . The outputs are a new hidden
Itsctrole:
d context, . Decide what information is required for the current
hidden state (as opposed to what information needs to be
preserved for future decisions).
The final gate we’ll use is the output gate which is used to decide what informa-
Multiplies
is required for theelement-wise thestate
current hidden current
(as state
opposedsigmoid output
to what by
information needs
current context.
e preserved for future decisions).

ot = s (Uo ht 1 + Wo xt ) (8.26)
ht = ot tanh(ct ) (8.27)

g. 8.13 illustrates the complete computation for a single LSTM unit. Given the
ropriate weights for the various gates, an LSTM accepts as input the context
LSTMs for natural language data
Because natural language has many
long distance dependencies, LSTMs are
the recurrent neural network of choice
for language modeling.

This is also true of LSTM encoder-


decoders.
ilable to more than just the first decoder hidden state, to ensure that the infl
Encoder-decoder
e context vector, c, doesn’t wane as the output sequence is generated.
y adding c as a parameter to the computation of the current hidden state
Recall, we use the final hidden state of encoder as context for
ollowingdecoder.
equation:

htd = g(ŷt d
1 t 1 , c)
, h
Decoder

y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax

he1 he2 he3 e d hd hd hd hd hd


hidden hhn n = c = h 0 1 2 3 4 m
layer(s)

embedding
layer

x1 x2 x3 xn <s> y1 y2 y3 ym

Encoder
Problem with passing context c only from end

Requiring the context c to be only the encoder’s final hidden


state forces all the information from the entire source sentence
to pass through this representational bottleneck.
8.8
Solution: attention
In the attention mechanism, as in the vanilla encoder-dec
vectoronly
Instead of taking c is the
a single vector that
last hidden is we
state, a function
apply anof the hidden st
instead of fbeing
attention mechanism to all taken
the hidden
from the states,
last specifically weit’s a wei
hidden state,
take a weighted
hiddenaverage
states ofofthealldecoder.
the hidden Andstates of the average is a
this weighted
decoder as our context.
the decoder state as well, the state of the decoder right bef
That is, c = f (he1 . . . hen , hdi 1 ). The weights focus on (‘attend
the source text that is relevant for the token i that the decode
This weighted average
Attention is also
thus informed
replaces by part
the static of the
context decoder
vector with one tha
state as well, the state of the decoder right before the current
token i.
from the encoder hidden states, but also informed by and h
token in decoding.
This context vector, ci , is generated anew with each de
all of the encoder hidden states into account in its derivati
i 1
capture relevance by computing— at each state i during decoding—a score(hdi 1 , hej )

How to compute c ?
dot-product
attention
for each encoder state j.
i
The simplest such score, called dot-product attention, implements relevance as
similarity: measuring how similar the decoder hidden state is to an encoder hidden
First state,
we score the relevance
by computing of between
the dot product each encoder
them: state j to hidden
decoder state i-1:
score(hdi 1 , hej ) = hdi 1 · hej (8.35)

The score that results from this dot product is a scalar that reflects the degree of
Second, we between
similarity normalize with
the two a softmax
vectors. The vectorto create
of these weights
scores across allαthe
i j encoder
hidden states gives us the relevance of each encoder state to the current step of the
decoder.
To make use of these scores, we’ll normalize them with a softmax to create a
vector of weights, ai j , that tells us the proportional relevance of each encoder hidden
24 C HAPTER
state8 j •to the
RNN S AND
prior LSTM
hidden S
decoder state, hdi 1 .
Finally, we use these scores to help create
d a eweighted average:
states. ai j = softmax(score(hi 1 , h j ))
X
= ai j hdiej 1 , hej )
ciexp(score(h (8.37)
= P j d e
(8.36)
k exp(score(hi 1 , hk ))
Encoder-decoder with attention, focusing on
the computation of ci
ilable
n ito1more than just the first decoder hidden state, to ensure that the infl
eis context
relevantvector,
for the c,
token i thatwane
doesn’t the decoder
as the is currently
output producing.
sequence is generated.
ces the Encoder-decoder
static context vector with with
one Attention
that
by adding c as a parameter to the computation is dynamically derived
of the current hidden state
idden states,
ollowing but also informed by and hence different for each
equation:
Recall, without attention the context is fixed and the same for all
decoder states:
ctor, ci , is generated anew htdwith
= g(each
ŷt 1 ,decoding
htd 1 , c) step i and takes
hidden states into account in its derivation. We then make this
uring decoding by conditioning the computation of the current
e on it (along with thethe
With attention, prior hidden
context state for
changes andevery
the previous
state! output
coder), as we see in this equation (and Fig. 8.21): y1 y2 yi

d d
hi = g(ŷi 1 , hi 1 , ci ) hd1 hd2 (8.34)
hdi …

c1 c2 ci
[15 minute break]
Working with RNN models!

Team up!

Open exercises/week 7 in your course folder and start writing/running


code!

You might also like