NLP Week7 RNNLSTM
NLP Week7 RNNLSTM
and LSTMs
NLP Week 7
2. A note on notation
4. RNN architectures
7. Group exercises
fi
This semester
We will build language models adding to each layer of their complexity:
C(wn−1wn) C(wn)
P(wn ∣ wn−1) = P(wn) =
∑w C(wn−1w) ?
C(wn−1wn)
P(wn ∣ wn−1) =
C(wn−1)
Recap: estimating n-gram probabilities
N-gram for n > 1 uni-gram
C(wn−1wn) C(wn)
P(wn ∣ wn−1) = P(wn) =
∑w C(wn−1w) ?
wn−1 = {}
C(wn−1wn) C(wn)
P(wn ∣ wn−1) = P(wn) =
C(wn−1) ∑w C(w)
To get a valid probability distribution we need to normalize, i.e., divide by the number of all possible elements in our set
• In the uni-gram case this is simply the sum of all uni-grams counts
• In the n-gram (n > 1) case it is the sum of counts of of all n-grams that share the same prefix/context
(which is equivalent to the counts of the prefix/context)
A note on notation
Quick recap on our notation and matrix-vector multiplication
yt
ht
xt
The hidden layer has a recurrence as part of its input
The activation value ht depends on xt but also ht-1!
Forward inference in simple RNNs
ht
U W
ht-1 xt
The fact that the computation at time t requires the value of the hidden layer fro
time t 1 mandates an incremental inference algorithm that proceeds from the sta
Inference has to be incremental
of the sequence to the end as illustrated in Fig. 8.3. The sequential nature of simp
recurrent networks can also be seen by unrolling the network in time as is shown
Fig. 8.4. hInatthis
Computing figure,
time the various
t requires layers
that of units
we first are copiedhfor
computed ateach
the time step
illustrate
previous time that they will have differing values over time. However, the various weig
step!
matrices are shared across time.
h0 0
for i 1 to L ENGTH(x) do
hi g(Uhi 1 + Wxi )
yi f (Vhi )
return y
Figure 8.3 Forward inference in a simple recurrent network. The matrices U, V and W a
shared across time, while new values for h and y are calculated with each time step.
Training in simple RNNs
yt
Just like feedforward training:
V
• training set, ht
• a loss function, +
• backpropagation U W
ht-1 xt
from time t − 1. V U W
h2
y1 x3
2. hidden layer at time t influences the U W
V
output at time t and hidden layer at time h1 x2
t+1). h0 x1
eliminating recurrence! y1
V
h2
U W
x3
U W
V
U W
1. Generate an unrolled h0 x1
^
yt ^
ytb)
a) a) b)
^
yt ^
yt
U U
V V
ht ht
FFN RNN
vector representing a probability distribution over the vocabul
rd Inference in an RNN language model
model
Forwarduses the word
inference embedding
in the matrix
RNN ELMto retrieve the embe
word, multiples it by the weight matrix W, and then adds it to
e in atherecurrent
previous
Given oflanguage
input X step tokens model
(weighted
of N proceeds
by weight matrixexactly as de
U) to compute
This
e input hidden layer
sequence X= is [x
then; ...;
usedx to; generate
...; x ] an outputof
consists layer
a whic
serie
1 t N
softmax layer to generate a probability distribution over the en
as a one-hot
Use vectormatrix
embedding of size
to get|V
the|⇥ 1, and for
embedding thecurrent
output predict
token xt
is, at time t:
g a probability distribution over the vocabulary. At eac
et = Ex
ord embedding matrix E to retrieve the
t embedding for t
V | V | × dh
dh × dh
ht-2 dh U
×1 ht-1 U ht dh × 1
W W W dh × de
et-2 et-1 et de × 1
t
bility that a particular word k in the vocabulary is the next word is repres
Computing
k], the kth componentthe
P(wprobability
of ŷ
t+1 t := k|w1 ,that
. . . , the
wt ) next
= ŷt
word
[k] is word k
et = Ext (8.12
ht = g(Uht 1 + Wet ) (8.13
ŷt = softmax(E| ht ) (8.14
Autoregressive generation from a RNN LM
RNN Architectures
We will see three types of RNN-based architectures that can
be used for different tasks:
1. RNNs for language modeling
Target Text
hidden hn
layer(s)
embedding
layer
Separator
Source Text
ilable to more than just the first decoder hidden state, to ensure that the infl
Encoder-decoder
e context vector, c, doesn’t waneshowing
as the output context
sequence is generated.
y adding c as a parameter to the computation of the current hidden state
Where each hidden state of the decoder is conditioned not only on
ollowingtheequation:
input and prior hidden state, but also the context.
htd = g(ŷt d
1 t 1 , c)
, h
Decoder
y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax
embedding
layer
x1 x2 x3 xn <s> y1 y2 y3 ym
Encoder
encoder-decoder model, with context available at each decoding timestep. Recall
that g is a stand-in for some flavor of RNN and ŷt 1 is the embedding for the output
sampledEncoder-decoder equations
from the softmax at the previous step:
c = hen
hd0 = c
htd = g(ŷt d
1 t 1 , c)
, h
ŷt = softmax(htd ) (8.33)
gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5
softmax
hidden
layer(s)
embedding
layer
x1 x2 x3 x4
the green witch arrived <s> llegó la bruja verde
Encoder
RNN Architectures
We will see three types of RNN-based architectures that can
be used for different tasks:
1. RNNs for language modeling
2. Bidirectional encoding
Stacked RNNs
Bidirectional RNNs
Bidirectional RNNs for classification
Other types of recurrent units
Recall: RNN is any network that contains a cycle within its
network connections, i.e. the value of some unit is directly, or
indirectly, dependent on its own earlier outputs as an input.
LSTMs add:
- explicit context layer
- Neural circuits with gates to control information flow
Difference between neural units
f
σ ct
ct
+
ht-1 ht-1
tanh
tanh
+
g
ht
ht
+ σ
xt xt
σ
o
LSTM
+
The LSTM
ct-1 ct-1 Forget Gate
f
σ ct
ct
+
ht-1 ht-1
tanh
tanh
+
g
ht
ht
+ σ
xt xt
σ
o
LSTM
+
in the layer being gated that align with values near 1 in the mask are passed through
nearly unchanged; values corresponding to lower values are essentially erased.
Forget
The first gate gate
we’ll consider is the forget gate. The purpose of this gate is
to delete information from the context that is no longer needed. The forget gate
computes a weighted sum of the previous state’s hidden layer and the current in-
Its role : delete information from the context that is no longer
put and passes
needed. that through a sigmoid. This mask is then multiplied element-wise
by the context vector to remove the information from context that is no longer re-
Computes weighted sum of the previous hidden states and
quired. Element-wise multiplication of two vectors (represented by the operator ,
current input and passes through sigmoid activation.
and sometimes called the Hadamard product) is the vector of the same dimension
as theThis
two mask is thenwhere
input vectors, multiplied element-wise
each element with context
i is the product vector
of element i in the two
inputto remove info from context that is no longer required.
vectors:
ft = s (U f ht 1 + W f xt ) (8.20)
kt = ct 1 ft (8.21)
The next task is to compute the actual information we need to extract from the previ-
ous hidden state and current inputs—the same basic computation we’ve been using
The LSTM
ct-1 ct-1
f
σ ct
ct
+
ht-1 ht-1
tanh
tanh
+
g
ht
ht
+ σ
Add Gate
xt xt
σ
o
LSTM
+
ftft == ss(U (Uf fhht t 11+
+WWff xxtt)) (8.20)
(8.20)
he next task is to compute the actual information we need to extract from the previ-
Add gate ktkt == cct t 11 fft t (8.21)
(8.21)
us hidden state and current inputs—the same basic computation we’ve been using
The
r all our next task
recurrent
The
Its next
role: isnetworks.
task toto
is compute
selecting computethe
theactual
actualinformation
information information
to we
we need to
add to current extractfrom
extract
context. fromthe
theprevi-
previ-
ousous
hidden state
hidden and
state andcurrent
currentinputs—the
inputs—thesame
same basic
basic computation we’vebeen
computation we’ve beenusing
using
First
forfor
allall
calculate
ourour
regular
recurrent
recurrent
information passing (as with SRN):
networks.
networks.
gt = tanh(Ug ht 1 + Wg xt ) (8.22)
tanh(Ugghht t 11+
gtgt == tanh(U +WWggxxtt )) (8.22)
(8.22)
ext, we generate the mask for the add gate to select the information to add to the
add gate Then,
Next,
Next,
urrent context.
gate we select
wegeneratenew
generate
the info
maskto
themask foradd
for the to
add
theadd context
gate to
gate by multiplying
to select
select element
the information
the information to
toadd
addtowise
tothe
the
with context.
sigmoid output over current state output.
current context.
current
s s(U
it i=it == s(U
(U
i hhi ht t 11+
+
+WWixxi tx))t )
W (8.23)
(8.23)
(8.23)
t i t 1 i t
jt j= jt =
g gt i it (8.24) (8.24)
t = tgt tit (8.24)
Next, we add this to the modified context vector to get our new context vector.
ext, Add
we addwe
Next, this to
addto
this the
this
the modified
tomodified context
the modified vector
contextvector
context toget
vector toto get
get our
ourour new
newnew context
context vector.
vector.vector.
context
ct = jt + kt (8.25)
ct = jt + kt (8.25)
ct = jt + kt (8.25)
The LSTM
ct-1 ct-1
f
σ ct
ct
+
ht-1 ht-1
tanh
tanh
+
g
ht
ht
+ σ Output Gate
xt xt
σ
o
LSTM
+
The LSTM Make sure to check out https://colah.github.io/posts/2015-08-Understanding-LSTMs/
ct-1 ct-1
f
σ ct
ct
+
ht-1 ht-1
tanh
tanh
+
g
ht
ht
+ σ
xt xt
σ
o
LSTM
+
σ
+
Output gate
e LSTM unit displayed as a computation graph. The inputs to each unit consists of the
evious hidden state, ht 1 , and the previous context, ct 1 . The outputs are a new hidden
Itsctrole:
d context, . Decide what information is required for the current
hidden state (as opposed to what information needs to be
preserved for future decisions).
The final gate we’ll use is the output gate which is used to decide what informa-
Multiplies
is required for theelement-wise thestate
current hidden current
(as state
opposedsigmoid output
to what by
information needs
current context.
e preserved for future decisions).
ot = s (Uo ht 1 + Wo xt ) (8.26)
ht = ot tanh(ct ) (8.27)
g. 8.13 illustrates the complete computation for a single LSTM unit. Given the
ropriate weights for the various gates, an LSTM accepts as input the context
LSTMs for natural language data
Because natural language has many
long distance dependencies, LSTMs are
the recurrent neural network of choice
for language modeling.
htd = g(ŷt d
1 t 1 , c)
, h
Decoder
y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax
embedding
layer
x1 x2 x3 xn <s> y1 y2 y3 ym
Encoder
Problem with passing context c only from end
How to compute c ?
dot-product
attention
for each encoder state j.
i
The simplest such score, called dot-product attention, implements relevance as
similarity: measuring how similar the decoder hidden state is to an encoder hidden
First state,
we score the relevance
by computing of between
the dot product each encoder
them: state j to hidden
decoder state i-1:
score(hdi 1 , hej ) = hdi 1 · hej (8.35)
The score that results from this dot product is a scalar that reflects the degree of
Second, we between
similarity normalize with
the two a softmax
vectors. The vectorto create
of these weights
scores across allαthe
i j encoder
hidden states gives us the relevance of each encoder state to the current step of the
decoder.
To make use of these scores, we’ll normalize them with a softmax to create a
vector of weights, ai j , that tells us the proportional relevance of each encoder hidden
24 C HAPTER
state8 j •to the
RNN S AND
prior LSTM
hidden S
decoder state, hdi 1 .
Finally, we use these scores to help create
d a eweighted average:
states. ai j = softmax(score(hi 1 , h j ))
X
= ai j hdiej 1 , hej )
ciexp(score(h (8.37)
= P j d e
(8.36)
k exp(score(hi 1 , hk ))
Encoder-decoder with attention, focusing on
the computation of ci
ilable
n ito1more than just the first decoder hidden state, to ensure that the infl
eis context
relevantvector,
for the c,
token i thatwane
doesn’t the decoder
as the is currently
output producing.
sequence is generated.
ces the Encoder-decoder
static context vector with with
one Attention
that
by adding c as a parameter to the computation is dynamically derived
of the current hidden state
idden states,
ollowing but also informed by and hence different for each
equation:
Recall, without attention the context is fixed and the same for all
decoder states:
ctor, ci , is generated anew htdwith
= g(each
ŷt 1 ,decoding
htd 1 , c) step i and takes
hidden states into account in its derivation. We then make this
uring decoding by conditioning the computation of the current
e on it (along with thethe
With attention, prior hidden
context state for
changes andevery
the previous
state! output
coder), as we see in this equation (and Fig. 8.21): y1 y2 yi
d d
hi = g(ŷi 1 , hi 1 , ci ) hd1 hd2 (8.34)
hdi …
…
c1 c2 ci
[15 minute break]
Working with RNN models!
Team up!