Neural Models for NLP
Lecture 15
COMS 4705, Spring 2020
Yassine Benajiba
Columbia University
One more thing about embeddings
No learning
No learning
If you don’t want to use any additional learning, you could just stop here and take the average of the
embeddings as the representation of the sentence. What are the pros and cons of such an approach?
What if you wanted to weight the words without any additional learning, what could you use?
doc2vec
Doc2vec/paragraph2vec
https://medium.com/district-data-labs/forward-propagation-building-a-skip-gram-net-from-the-ground-up-9578814b221
At the word level, Skip gram takes a word as an input and predicts its context.
We can do few changes and get a document representation.
p’: Matrix to increase # dimensions
Doc2vec/paragraph2vec This is what we throw away at the end
p: Matrix to reduce # dimensions
One row per word
This is what we save at the end as the word embedding matrix
Input word one hot encoding
One hot encoded
context words
p’: Matrix to increase # dimensions
Doc2vec/paragraph2vec This is what we throw away at the end
p: Matrix to reduce # dimensions
One row per word doc
This is what we save at the end as the word doc embedding matrix
vd
d(t)
Input word doc one hot encoding
We should call it d(t) instead
One hot encoded
context words
occurring in the
document
p’: Matrix to increase # dimensions
Doc2vec/paragraph2vec This is what we throw away at the end
p: Matrix to reduce # dimensions
One row per word doc
This is what we save at the end as the word doc embedding matrix
vd
d(t)
Input word doc one hot encoding
We should call it d(t) instead
What happens for documents that were not in the One hot encoded
training? context words
occurring in the
Add new column to d(t) and new parameter row to ‘p’, document
freeze all old parameters and train for few iterations.
p’: Matrix to increase # dimensions
Doc2vec/paragraph2vec This is what we throw away at the end
p: Matrix to reduce # dimensions
One row per word doc
This is what we save at the end as the word doc embedding matrix
vd
d(t)
Input word doc one hot encoding
We should call it d(t) instead
Approach doesn’t allow for composition, i.e. does not build
document representation from words.
One hot encoded
context words
Nice for document similarity but not for applications like
occurring in the
machine translation, question answering, sentiment, etc
document
Let’s take a look at the ones that attempt to do so.
dense nets
dense nets
dense nets
Deep Averaging Networks (DANs)
https://cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings
NBOW -> avg + softmax
ROOT -> sentiment at the sentence level
dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings
NBOW -> avg + softmax
ROOT -> sentiment at the sentence level
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
Simple
Fast
But what are we missing here?
Primarily order of words. It could beat
RecNN but not CNN!
convolutional nets
CNNs
intuition
CNNs
feature maps
• A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.
Zhang Y. and Wallace B.
https://arxiv.org/pdf/1510.03820v2.pdf
CNNs
the convolution operation
Wikipedia
CNNs
the convolution operation
1D
Wikipedia
Input
kernel Output
2D
* =
CNNs
the convolution operation
A (input)
B (kernel) C (output)
* =
C 𝑚, 𝑛 = 𝐴 ∗ 𝐵 𝑚, 𝑛 = σ𝑗 σ𝑘 𝐵 𝑗, 𝑘 . 𝐴[𝑚 + 𝑗, 𝑛 + 𝑘]
CNNs
- Shared parameters
- Sparse neurons
CNNs Retain only the max values.
Why does this make sense?
https://arxiv.org/pdf/1510.03820v2.pdf
Convolutional neural networks
https://github.com/bwallace/CNN-for-text-classification/blob/master/CNN_text.py
CNNs
Now we have something that can take
word order into consideration but we still
have two problems:
- Length of sentence
- Long distance dependencies
.. Let’s talk about recurrent neural
networks
https://arxiv.org/pdf/1510.03820v2.pdf
recurrent nets
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
How can we do this with words
instead of chars?
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
sequential
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
Shared params
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
generative
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
Last hidden state is sequence
representation
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
Can be used to classify either each element
of the sequence or whole sequence
LMs too
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
Training:
Algorithm: BTTBP
Issues: vanishing gradients
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Gated RNNs: LSTMs
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt Given an NLP task the LSTM will pick what information to remember and what information is
irrelevant to build a sentence level rep: h(n) where ‘n’ is the length of the sentence
Gated RNNs: LSTMs
Ct-1 Ct
Works much better ..
What problems does it solve?:
- Vanishing gradient
ht-1 ht - Longer memory
xt
Gated RNNs: LSTMs
Gated RNNs: LSTMs
BiLSTM means bidirectional
LSTM, basically two LSTMs one
starting from the left and the
other from the right. At each
word we concatenate the
content of the hidden layers
Gated RNNs: LSTMs
Let’s discuss some
limitations of LSTMs:
- Speed
- Accuracy of
representation
Where do you think
is the state of art
today?
Gated RNNs: LSTMs
http://nlpprogress.com/
Gated RNNs: LSTMs
http://nlpprogress.com/
Other architectures
ADANs
one-shot learning
Back to the intro
Back to the intro
Back to the intro
Back to the intro