Word Embeddings
SEPTEMBER 13, 2018 Elena Voita
Yandex Research,
University of Amsterdam
[email protected]Plan
➢ Why do we need word representations?
➢ Distributional semantics
➢ Word2Vec in detail
➢ Glove overview
➢ Let’s take a walk!
➢ Further directions: subword information
➢ Further directions: abstract the ideas to sentence-level
➢ Further directions: exploiting the structure of semantic spaces
➢ Hack of the day!
Why do we need word representation?
I saw a cat. Text
Why do we need word representation?
I saw a cat . Sequence of tokens
I saw a cat. Text
Why do we need word representation?
Word representation - vector
(word embedding)
I saw a cat . Sequence of tokens
I saw a cat. Text
Why do we need word representation?
Your algorithm
Any algorithm for solving any task
(e.g., neural network)
Word representation - vector
(word embedding)
I saw a cat . Sequence of tokens
I saw a cat. Text
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols: hotel, conference, motel
Means one 1, the rest 0s
Words can be represented by one-hot vectors:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
Vector dimension = number of words in vocabulary (e.g. 500,000)
Problem with words as discrete symbols
Example: in web search, if user searches for “Seattle motel”, we would like to
match documents containing “Seattle hotel”.
But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal.
There is no natural notion of similarity for one-hot vectors!
These vectors do not contain information about a meaning of a word.
But what is meaning?
But what is meaning?
What is bardiwac?
But what is meaning?
What is bardiwac?
➢ He handed her a glass of bardiwac.
➢ Beef dishes are made to complement the bardiwac.
➢ Nigel staggered to his feet, face flushed from too much
bardiwac.
➢ Malbec, one of the lesser-known bardiwac grapes,
responds well to Australia’s sunshine.
➢ I dined off bread and cheese and this excellent bardiwac.
➢ The drinks were delicious: blood-red bardiwac as well as
light, sweet Rhenish.
But what is meaning?
What is bardiwac?
➢ He handed her a glass of bardiwac.
➢ Beef dishes are made to complement the bardiwac.
➢ Nigel staggered to his feet, face flushed from too much Bardiwac is a red
bardiwac.
alcoholic beverage
➢ Malbec, one of the lesser-known bardiwac grapes,
responds well to Australia’s sunshine. made from grapes
➢ I dined off bread and cheese and this excellent bardiwac.
➢ The drinks were delicious: blood-red bardiwac as well as
light, sweet Rhenish.
Distributional semantics
➢ A bottle of _________
bardiwac is on the table.
➢ Everybody likes _________.
bardiwac
➢ Don’t have _________
bardiwac before you drive.
➢ We make _________
bardiwac out of corn.
Distributional semantics
➢ A bottle of _________ is on the table.
➢ Everybody likes _________.
What other words fit into these contexts?
➢ Don’t have _________ before you drive.
➢ We make _________ out of corn.
Distributional semantics
➢ A bottle of _________ is on the table. (1)
➢ Everybody likes _________. (2)
What other words fit into these contexts?
➢ Don’t have _________ before you drive. (3)
➢ We make _________ out of corn. (4)
(1) (2) (3) (4) …
bardiwac 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
wine 1 1 1 0
choices 0 1 0 0
Distributional semantics
➢ A bottle of _________ is on the table. (1)
➢ Everybody likes _________. (2)
What other words fit into these contexts?
➢ Don’t have _________ before you drive. (3)
➢ We make _________ out of corn. (4)
(1) (2) (3) (4) …
bardiwac 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
wine 1 1 1 0
choices 0 1 0 0
Distributional semantics
Does vector similarity imply semantic similarity?
Distributional semantics
Does vector similarity imply semantic similarity?
The distributional hypothesis, stated by Firth (1957):
“You shall know a word by the company it keeps.”
Idea: co-occurrence counts
Corpus sentences
Idea: co-occurrence counts
Corpus sentences Co-occurrence counts
Idea: co-occurrence counts
Corpus sentences Co-occurrence counts vector
Idea: co-occurrence counts
Corpus sentences Co-occurrence counts vector
small vector
Dimensionality
reduction
Latent semantic analysis (LSA)
➢ co-occurrence counts
𝑋 - document-term co-occurrence matrix ➢ tf-idf
➢ filter stop-words
𝑋 ≈ 𝑋 = 𝑈 Σ 𝑉 𝑇 ➢ lemmatize
➢ …
Σ 𝑇
U 𝑉
d d
≈ × ×
w
w
Latent semantic analysis (LSA)
𝑋 - document-term co-occurrence matrix
𝑋 ≈ 𝑋 = 𝑈 Σ 𝑉 𝑇
d d w
≈ × ×
LSA document vectors
Hope: documents 𝑇
w discussing similar topics U Σ 𝑉
have similar representations
Latent semantic analysis (LSA)
𝑋 - document-term co-occurrence matrix LSA term vectors
Hope: term having common
𝑋 ≈ 𝑋 = 𝑈 Σ 𝑉 𝑇 meaning are mapped to the
same direction
d d w
≈ × ×
LSA document vectors
Hope: documents 𝑇
w discussing similar topics U Σ 𝑉
have similar representations
Count-based methods
However, this is not the only way to induce
distributional representations
(and not the best one)
Why not to learn distributed
representations?
I mean, really, why?
We will learn a dense vector for
each word, chosen so that it is
similar to vectors of words that
appear in similar contexts.
This word vectors are called word
embeddings or word representations
Word2Vec
➢ a large corpus of text
➢ Every word in a fixed vocabulary is
represented by a vector
➢ Go through each position t in the
text, which has a center word c and
context (“outside”) words o
➢ Use the similarity of the word
vectors for c and o to calculate the
probability of o given c (or vice
versa)
➢ Keep adjusting the word vectors
to maximize this probability
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
Word2Vec
➢Examples windows and and process for computing 𝑃(𝑤𝑡+𝑗 |𝑤𝑗 )
http://web.stanford.edu/class/cs224n/syllabus.html
Word2Vec
➢Examples windows and and process for computing 𝑃(𝑤𝑡+𝑗 |𝑤𝑗 )
http://web.stanford.edu/class/cs224n/syllabus.html
Word2Vec: objective function
For each position 𝑡 = 1, ... , 𝑇, predict context words within a window of fixed
size m, given center word 𝑤t.
Word2Vec: objective function
For each position 𝑡 = 1, ... , 𝑇, predict context words within a window of fixed
size m, given center word 𝑤t.
Likelihood =
𝜃 𝑖𝑠 𝑎𝑙𝑙 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
𝑡𝑜 𝑏𝑒 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑
Word2Vec: objective function
The objective function (or loss, or cost function) 𝐽(𝜃) is the (average) negative
log likelihood
Word2Vec: objective function
The objective function (or loss, or cost function) 𝐽(𝜃) is the (average) negative
log likelihood
Minimizing objective function Maximizing predictive accuracy
Word2Vec: objective function
➢ We want to minimize objective function
Word2Vec: objective function
➢ We want to minimize objective function
➢ Question: How to calculate 𝑷 𝒘𝒕+𝒋 𝒘𝒋 , 𝜽)?
Word2Vec: objective function
➢ We want to minimize objective function
➢ Question: How to calculate 𝑷 𝒘𝒕+𝒋 𝒘𝒋 , 𝜽)?
➢ Answer: We will use two vectors per word w ▪ vw is a center word
▪ uw is a context word
Word2Vec: objective function
➢ We want to minimize objective function
➢ Question: How to calculate 𝑷 𝒘𝒕+𝒋 𝒘𝒋 , 𝜽)?
➢ Answer: We will use two vectors per word w ▪ vw is a center word
▪ uw is a context word
➢ Then for a center word c and a context word o:
exp(𝑢𝑜𝑇 𝑣𝑐 )
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Word2Vec
➢ Examples windows and and process for computing 𝑃 𝑤𝑡+𝑗 𝑤𝑗
➢ 𝑃 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝑣𝑖𝑛𝑡𝑜 ) is short for 𝑃 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝑖𝑛𝑡𝑜; 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 , 𝑣𝑖𝑛𝑡𝑜 , 𝜃)
http://web.stanford.edu/class/cs224n/syllabus.html
Word2Vec
➢ Examples windows and and process for computing 𝑃 𝑤𝑡+𝑗 𝑤𝑗
➢ 𝑃 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝑣𝑖𝑛𝑡𝑜 ) is short for 𝑃 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝑖𝑛𝑡𝑜; 𝑢𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 , 𝑣𝑖𝑛𝑡𝑜 , 𝜃)
http://web.stanford.edu/class/cs224n/syllabus.html
Word2Vec: prediction function
exp(𝑢𝑜𝑇 𝑣𝑐 )
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
Word2Vec: prediction function
Dot product measures similarity of o and c
exp(𝑢𝑜𝑇 𝑣𝑐 ) Larger dot product = larger probability
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
Word2Vec: prediction function
Dot product measures similarity of o and c
exp(𝑢𝑜𝑇 𝑣𝑐 ) Larger dot product = larger probability
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
After taking exponent, normalize
over entire vocabulary
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
This is softmax!
Softmax function ℝ𝑛 → ℝ𝑛 :
exp(𝑥𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒙)𝑖 = 𝑛 = 𝑝𝑖
σ𝑗=1 exp(𝑥𝑗 )
➢ maps arbitrary values 𝑥𝑖 to a probability distribution 𝑝𝑖
➢ ”max” because amplifies probability of largest 𝑥𝑖
➢ “soft” because still assigns some probability to smaller 𝑥𝑖
➢ often used in Deep Learning!
Where is 𝜽?
➢ 𝜃 - d-dimensional vectors for V words
➢ every word has two vectors!
➢ we optimize these parameters
Where is 𝜽?
Word2Vec: Skip-gram (SG)
➢ Predict context (”outside”) words
(position independent) given
center word
Word2Vec: Continuous Bag of Words (CBOW)
➢ Predict center word from (bag of)
context words
Word2Vec: Additional efficiency in training
exp(𝑢𝑜𝑇 𝑣𝑐 ) Huge sum! Time for calculating
𝑃 𝑜𝑐 = 𝑇 𝑣 ) gradients is proportional to |V|
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
Word2Vec: Additional efficiency in training
exp(𝑢𝑜𝑇 𝑣𝑐 ) Huge sum! Time for calculating
𝑃 𝑜𝑐 = 𝑇 𝑣 ) gradients is proportional to |V|
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Possible solutions:
➢ Hierarchical softmax
➢ Negative sampling
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
Word2Vec: Additional efficiency in training
exp(𝑢𝑜𝑇 𝑣𝑐 ) Huge sum! Time for calculating
𝑃 𝑜𝑐 = 𝑇 𝑣 ) gradients is proportional to |V|
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Possible solutions:
➢ Hierarchical softmax
𝑇 𝑇 𝑣 )
exp(𝑢𝑤
➢ Negative sampling exp(𝑢𝑤 𝑣𝑐 ) 𝑐
𝑤∈𝑉 𝑤∈{𝒐}∪𝑺_𝒌
Sum over a small subset: negative sample, |Sk|=k
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
Word2Vec:
(Near) equivalence to matrix factorization
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋 = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
≈ × ×
c
c
Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
Word2Vec:
(Near) equivalence to matrix factorization
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
Context vectors
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋 = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
c
≈ × ×
Word vectors 𝑇
c 𝑉𝑑 Σ𝑑 𝑈𝑑
Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
GloVe: combine count-based and direct
prediction methods
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162
GloVe: combine count-based and direct
prediction methods
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
probability that word j
appear in the context
of word i
Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162
GloVe: combine count-based and direct
prediction methods
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
the idea is close to
factorizing the log of the co-
occurrence matrix (closely
related to LSA)
Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162
GloVe: combine count-based and direct
prediction methods
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
discard rare noisy
co-occurrences
Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162
Word2Vec:
what are relations between vectors?
v(king) – v(man) + v(woman) ≈ v(queen)
What are relations between vectors?
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
What are relations between vectors?
(http://nlp.stanford.edu/projects/glove/)
What are relations between vectors?
(http://nlp.stanford.edu/projects/glove/)
Let’s walk through space…
Let’s walk through space…
Semantic space!
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
https://pste.eu/p/ZRot.html
How to evaluate embeddings
Intrinsic: evaluation on a specific/intermediate subtask
➢ word analogies: “a is to b as c is to ___?”
➢ word similarity: correlation of the rankings
➢…
Extrinsic: evaluation on a real task
➢ take some task (MT, NER, coreference resolution, …) or several tasks
➢ train with different pretrained word embeddings
➢ if the task quality is better -> win!
What if..
We want to use subword
information?
Adding subword information: FastText
Model: SG-NS (skip-gram with negative sampling)
Change the way word vectors are formed:
➢ each word represented as a bag of character
n-gram
➢ associate a vector representation to each n-
gram
➢ represent a word by the sum of the vector
representations of its n-grams
Sum of vectors for char n-grams
Bojanovsky et al, TACL 2017 http://aclweb.org/anthology/Q17-1010
Add there any function of chars - get new
embeddings!
Char-aware word embedding recipe:
➢ take any model that learns word embeddings
➢ choose how to get word representation from representation of chars of char n-
grams (RNN, CNN, pooling – mean, sum, etc., - anything reasonable)
➢ replace word vector in the model with the representation gathered from
char/subword representations
➢ train as before
➢ DONE!
One example: Ling et al, EMNLP 2015 https://www.aclweb.org/anthology/D/D15/D15-1176.pdf
Or just pretend to be some other
embeddings!
Match the predicted embeddings 𝑓(𝑤𝑘 ) to the
pre-trained word embeddings 𝑒𝑤𝑘 , by
minimizing the squared Euclidean distance:
Pinter et al, EMNLP 2017 http://www.aclweb.org/anthology/D17-1010
Or just pretend to be some other
embeddings!
Match the predicted embeddings 𝑓(𝑤𝑘 ) to the
pre-trained word embeddings 𝑒𝑤𝑘 , by
minimizing the squared Euclidean distance:
Any function
Pinter et al, EMNLP 2017 http://www.aclweb.org/anthology/D17-1010
What if..
We abstract the skip-gram
model to the sentence level?
Skip-Thought Vectors
Before: Now:
➢ use a word to predict its ➢ encode a sentence to predict
surrounding context the sentences around it
Kiros et al., NIPS 2015 https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf
Discourse-Based Objectives
If for sentence embedding
information about neighboring
sentences is useful, let’s predict
something about them:
➢ Binary Ordering of Sentences
➢ Next Sentence (classifier)
➢ Conjunction Prediction
(predict a conjunction phrase
if the second sentence starts
from any)
Jernite et al., 2017, https://arxiv.org/pdf/1705.00557.pdf
What if..
We exploit the structure of
semantic space?
Exploiting Similarities among Languages
for Machine Translation
➢ we are given a set of word
pairs and their associated
vector representations
➢ find a transformation matrix W
➢ for any given new word we
can map it to the other
language space
Mikolov et al, 2013, https://arxiv.org/pdf/1309.4168.pdf
Here we supposed that we already know
some word pairs.
But what if we know nothing about the languages?
Word Translation Without Parallel Data
Map semantic spaces so that their samples are indistinguishable
(Spoiler alert! You’ll know how to do it later in the course)
Conneau et al, ICLR 2018, https://arxiv.org/pdf/1710.04087.pdf
Are the underlying maps really linear?
Look at the local linear
approximations:
➢ If they’re identical, then the
mapping is indeed linear
➢ If they are not, probably not
(actually, they’re not)
Nakashole & Flauger, ACL 2018, http://aclweb.org/anthology/P18-2036
A piece of practice:
When we really need to learn word
representations?
Why do we need word representation?
Neural network Any NN for solving any task
Word representation - vector
(word embedding)
I saw a cat . Sequence of tokens
I saw a cat. Text
Do we REALLY need to learn word
representation in advance?
Neural network Ok, but if we already
have an NN for our task,
why do we have to learn
parameters for word
embeddings using some
other NN?
I saw a cat .
I saw a cat.
When to use pretrained embeddings?
Not enough data or the task is too simple Enough data and a hard task (LM, MT, …)
Use pretrained on the other task Train with the model
Pretrained
(word2vec, Trained
glove, etc.) together
Hack of the day
Tensorboard of a healthy man
Tensorboard of a man who doesn’t
shuffle his data
Shuffle your data!
Congratulations, you’ve just
survived the first NLP lecture!
Looking forward to the next week's episode…
Sincerely yours,
Yandex Research