Distributional Semantics - Introduction
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 14
Introduction
What is Semantics?
The study of meaning: Relation between symbols and their
denotata. John told Mary that the train moved out of the station
at 3 o’clock.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/
Computational Semantics
Computational Semantics
The study of how to automate the process of constructing and
reasoning with meaning representations of natural language
expressions.
Methods in Computational Semantics generally fall in two categories:
Formal Semantics: Construction of precise mathematical
models of the relations between expressions in a natural
language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
Distributional Semantics: The study of statistical patterns of
human word usage to extract semantics.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/
Distributional Hypothesis
Distributional Hypothesis: Basic Intuition
“The meaning of a word is its use in language.”
(Wittgenstein, 1953)
“You know a word by the company it keeps.” (Firth, 1957)
→ Word meaning (whatever it might be) is reflected in linguistic
distributions.
“Words that occur in the same contexts tend to have
similar meanings.” (Zellig Harris, 1968)
→ Semantically similar words tend to have similar
distributional patterns.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4/
Distributional Semantics: a linguistic perspective
“If linguistics is to deal with meaning, it can only do so
through distributional analysis.” (Zellig Harris)
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/
Distributional Semantics: a linguistic perspective
“If linguistics is to deal with meaning, it can only do so
through distributional analysis.” (Zellig Harris)
“If we consider words or morphemes A and B to be more
different in meaning than A and C, then we will often find that
the distributions of A and B are more different than the
distributions of A and C. In other words, difference in
meaning correlates with difference of distribution.” (Zellig
Harris, “Distributional Structure”)
Differential and not referential
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/
Distributional Semantics: a cognitive perspective
Contextual representation
A word’s contextual representation is an abstract cognitive
structure that accumulates from encounters with the word in
various linguistic contexts.
We learn new words based on contextual cues
He filled the wampimuk with the substance, passed it around and we
all drunk some.
We found a little wampimuk sleeping behind the tree.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6/
Distributional Semantic Models (DSMs)
Computational models that build contextual semantic
repesentations from corpus data
D S M s are models for semantic representations
) The semantic content is represented by a vector
) Vectors are obtained through the statistical analysis of the
linguistic contexts of a word
Alternative names
) corpus-based semantics
) statistical semantics
) geometrical models of meaning
) vector semantics
) word space models
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 7/
Distributional Semantics: The general intuition
Distributions are vectors in a multidimensional semantic
space, that is, objects with a magnitude and a direction.
The semantic space has dimensions which correspond to
possible contexts, as gathered from a given corpus.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 14
Vector Space
In practice, many more dimensions are used.
cat = [...dog 0.8, eat 0.7, joke 0.01, mansion
0.2,...]
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Word
Space
Small Dataset
An automobile is a wheeled motor vehicle used for transporting
passengers .
A car is a form of transport , usually with four wheels and the capacity to
carry around five passengers .
Transport for the London games is limited , with spectators strongly
advised to avoid the use of cars .
The London 2012 soccer tournament began yesterday , with plenty of
goals in the opening matches .
Giggs scored the first goal of the football tournament at Wembley , North
London .
Bellamy was largely a passenger in the football match , playing no part
in either goal .
Target words: ⟨automobile, car, soccer, football⟩
Term vocabulary : ⟨wheel, transport, passenger, tournament,
London, goal, match⟩
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 / 14
Constructing Word spaces
Informal algorithm for constructing word spaces
Pick the words you are interested in: target words
Define a context window, number of words surrounding target
word
) The context can in general be defined in terms of documents,
paragraphs or sentences.
Count number of times the target word co-occurs with the context
words:
co-occurrence matrix
Build vectors out of (a function of) these co-occurrence counts
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 /
Constructing Word spaces: distributional vectors
distributional matrix = targets X
contexts
wheel transport passenger tournament London goal match
automobile 1 1 1 0 0 0 0
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 14
2.5
2 football
(0,2)
1.5
goa
l
1 s o c c er
(0,1)
0.5
automobile car
0
(1,0) (2,0)
0 0.5 1 1.5 2 2.5
transpo
rt
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 / 14
Computing similarity
wheel transport passenger tournament London goal match
automobile 1 1 1 0 0 0 0
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1
Using simple vector product
automobile . car = 4 car . soccer = 1
automobile . soccer car . football = 2
=0 soccer . football
automobile . football =5
=1
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 / 14
Distributional Models of Semantics
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 19
Vector Space Model without distributional similarity
Words are treated as atomic
symbols
One-hot representation
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/
Distributional Similarity Based Representations
You know a word by the company it
keeps
These words will represent banking
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/
Building a DSM step-by-step
The “linguistic” steps
Pre-process a corpus (to define targets and
contexts)
⇓
Select the targets and the contexts
The “mathematical” steps
Count the target-context co-occurrences
⇓
Weight the contexts (optional)
⇓
Build the distributional matrix
⇓
Reduce the matrix dimensions (optional)
⇓
Compute the vector distances on the (reduced)
Pawan Goyal (IIT Kharagpur) matrix Week 7, Lecture 4/
Many design choices
General Questions
How do the rows (words, ...) relate to each other?
How do the columns (contexts, documents, ...) relate to each
other?
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/
The parameter space
A number of parameters to be fixed
Which type of context?
Which weighting
scheme? Which
similarity measure?
...
A specific parameter setting determines a particular type of D S M
(e.g. LSA, HAL, etc.)
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 19
Documents as context: Word × document
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 19
Words as context: Word × Word
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 19
Words as contexts
Parameters
Window size
Window shape - rectangular/triangular/other
Consider the following passage
Suspected communist rebels on 4 July 1989 killed Col. Herminio
Taylo, police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels
have killed up to 65 soldiers and police in the Capitol region since
January.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Words as contexts
Parameters
Window size
Window shape - rectangular/triangular/other
5 words window (unfiltered): 2 words either side of the target word
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo,
police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the
rebel New People’s Army. They fled in a commandeered passenger
jeep. The military says communist rebels have killed up to 65 soldiers
and police in the Capitol region since January.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Words as contexts
Parameters
Window size
Window shape - rectangular/triangular/other
5 words window (filtered): 2 words either side of the target word
Suspected communist rebels on 4 July 1989 killed Col. Herminio
Taylo, police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels
have killed up to 65 soldiers and police in the Capitol region since
January.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9 / 19
Context weighting: documents as context
Indexing function F: Essential factors
Word frequency (fij): How many times a word appears in the
document?
F ∝ fij
F ∝ |D1 |
Document length (|Di|): How many words appear in the
i
document?
Document frequency (Nj): Number of documents in
F ∝Nj
which a word
appears.
1
Indexing Weight: tf-Idf
fij ∗ log(NN ) for each term, normalize the weight in a
j
document
respect to L with
2-
norm.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /
Context weighting: words as context
basic intuition
word1 freq(1,2 freq(1 freq(2
word2
do small ) )
855 33,338 )
g domesticat 490,580
do ed 29 33,338
g 918
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 /
Context weighting: words as context
basic intuition
Association measures are used to give more weight to
contexts that are more significantly associted with a targer word.
The less frequent the target and context element are, the
higher the weight given to their co-occurrence count should
be.
⇒ Co-occurrence with frequent context element small is less
informative than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 19
Pointwise Mutual Information (PMI)
P (w , w )
PMI(w1 , w2 ) = log2 Pcorpus 1 2
ind (w1 , w2 )
P corpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus(w1)Pcorpus(w2)
freq(w1, w2)
Pcorpus(w1, w2) =
N
freq(w)
P corpus (w) =
N
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 /
PMI: Issues and Variations
Positive PMI
All PMI values less than zero are replaced with zero.
Bias towards infrequent events
Consider wj having the maximum association
with wi, Pcorpus(wi) ≈ Pcorpus(wj) ≈ Pcorpus(wi, wj)
PMI increases as the probability of wi
decreases.
Also, consider a word wj that occurs once in the
corpus, also in the context of
wi. A discounting factor proposed by Pantel and
Lin:
fij min(fi , fj )
δij =
Pawan Goyal (IIT Kharagpur) fij + 1 min(fi , fj ) + Week 7, Lecture 13 /
Distributional Vectors: Example
Normalized Distributional Vectors using Pointwise Mutual Information
oil:0.032 gas:0.029 crude:0.029 barrels:0.028 exploration:0.027 barrel:0.026
petroleu
m opec:0.026 refining:0.026 gasoline:0.026 fuel:0.025 natural:0.025
exporting:0.025
trafficking:0.029 cocaine:0.028 narcotics:0.027 fda:0.026 police:0.026
drug
abuse:0.026
marijuana:0.025 crime:0.025 colombian:0.025 arrested:0.025 addicts:0.024
insurers:0.028 premiums:0.028 lloyds:0.026 reinsurance:0.026
insurance
underwriting:0.025
pension:0.025 mortgage:0.025 credit:0.025 investors:0.024 claims:0.024
benefits:0.024
timber:0.028 trees:0.027 land:0.027 forestry:0.026 environmental:0.026
forest
species:0.026
wildlife:0.026 habitat:0.025 tree:0.025 mountain:0.025 river:0.025 lake:0.025
robots:0.032 automation:0.029 technology:0.028 engineering:0.026
robotics
systems:0.026
sensors:0.025 welding:0.025 computer:0.025 manufacturing:0.025
automated:0.025
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /
Distributional Semantics: Applications, Structured
Models
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 15
Application to Query Expansion: Addressing Term
Mismatch
Term Mismatch Problem in Information Retrieval
Stems from the word independence assumption during document
indexing. User query: insurance cover which pays for long term
care.
A relevant document may contain terms different from the actual
user query. Some relevant words concerning this query: {medicare,
premiums, insurers}
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/
Application to Query Expansion: Addressing Term
Mismatch
Term Mismatch Problem in Information Retrieval
Stems from the word independence assumption during document
indexing. User query: insurance cover which pays for long term
care.
A relevant document may contain terms different from the actual
user query. Some relevant words concerning this query: {medicare,
premiums, insurers}
Using DSMs for Query Expansion
Given a user query, reformulate it using related terms to enhance the
retrieval performance.
The distributional vectors for the query terms are computed.
Expanded
Pawan query
Goyal (IIT Kharagpur) is obtained by a linear combination or a Week
functional
7, Lecture 2/
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93
hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75
ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7
reimbursable:0.7 deductible:0.69
Broad expansion terms: medicare, beneficiaries,
premiums . . .
Specific domain terms: HCFA (Health Care Financing
Administration), HMO
(Health Maintenance Organization), HHS (Health and Human
Services)
TREC Topic 355: ocean remote sensing
Query Representation: radiometer:1.0 landsat:0.97
ionosphere:0.94 cnes:0.84 altimeter:0.83 nasda:0.81
meterology:0.81 cartography:0.78 geostationary:0.78
doppler:0.78 oceanographic:0.76
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X
and Y.
Similarity Measures
Dice coefficient :|X2|X∩Y|
|+|
Y|
Jaccard Coefficient :| |X∩Y|
| X ∪Y|
Overlap Coefficient : min
(|X |,|
X∩Y| Y|)
Jaccard coefficient penalizes small number of shared entries, while
Overlap coefficient uses the concept of inclusion.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4/
Similarity Measures for Vector Spaces
Let →X and →Y denote the distributional vectors for words
X and Y .
→X = [x1 , x2 , . . . , xn], →Y = [y1 , y2 , . . . , yn]
Similarity Measures Cosine similarity : cos(→√X,→Y)|→= X||→Y|
X¯·→Y distance →
Euclidean
n
: |X − Y| = Σi=1 (xi −
yi ) 2 →
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/
Similarity Measure for Probability Distributions
Let p and q denote the probability distributions corresponding
to two distributional vectors.
Similarity Measures
p
KL-divergence : D(p||q) =i Σ
i p qi
i
log p+q
Radius : D(p||
p+q
Information 2 ) + (q|| 2 )
D L1 -norm : Σ i |pi −
qi |
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6/
Attributional Similarity vs. Relational Similarity
Attributional Similarity
The attributional similarity between two words a and b depends on the
degree of correspondence between the properties of a and b.
Ex: dog and wolf
Relational Similarity
Two pairs(a, b) and (c, d) are relationally similar if they have
many similar relations.
Ex: dog: bark and cat: meow
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 15
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason:
stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs
occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs
Extended Distributional Hypothesis; Lin and Pantel
Patterns that co-occur with similar pairs tend to have similar meanings.
This matrix can also be used to measure the semantic similarity of
patterns.
Given a pattern such as “X solves Y”, you can use this matrix to find
similar patterns, such as “Y is solved by X”, “Y is resolved in X”, “X
resolves Y”.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8/
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
An Ideal Formalism
Should mirror semantic relationships as close as
possible
Incorporate word-based information and syntactic
analysis Should be applicable to different
languages
Use Dependency grammar framework
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Structured DSMs
Using Dependency Structure: How does it help?
The teacher eats a red apple.
‘eat’ is not a legitimate context for ‘red’.
The ‘object’ relation connecting ‘eat’ and ‘apple’ is treated as a
different type of co-occurrence from the ‘modifier’ relation linking
‘red’ and ‘apple’.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /
Structured DSMs
Structured DSMs: Words as ‘legitimate’ contexts
Co-occurrence statistics are collected using parser-extracted
relations.
To qualify as context of a target item, a word must be linked to it
by some (interesting) lexico-syntactic relation
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 15
Structured DSMs
Distributional models, as guided by dependency
Ex: For the sentence ‘This virus affects the body’s defense
system.’, the dependency parse is:
Word vectors
< system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-
way matrix
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 /
2-way matrix
< system, dobj, affects>
< virus, nsubj, affects>
The dependency information can be dropped
< system, dobj, affects> ⇒ < system, affects>
< virus, nsubj, affects> ⇒ < virus, affects>
Link and one word can be concatenated and treated as
attributes
virus={nsubj-
affects:0.05,...},
system= { dobj-
affects:0.03,. . .}
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 /
Structured DSMs for Selectional Preferences
Selectional Preferences for Verbs
Most verbs prefer arguments of a particular type. This regularity is
known as selectional preference.
From a parsed corpus, noun vectors are calculated as shown for
‘virus’ and ‘system’.
obj- obj- obj- obj- obj-store sub-fly
car carry buy drive eat ...
vegetabl 0.
0.1 0.
0.4 0
0.8 0.
0.02 0.2
0. 0.05
0.0 ...
e biscuit 3 5 0 6 3 5 ...
··· 0. 0. .. 0. 0. 0.0 ...
4 4 . 5 4 2 ...
.. .. .. .. ...
. . . .
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the
nouns as object of verb ‘eat’.
n nouns having highest weight in the dimension ‘obj-eat’ are
selected, let
{ vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an
‘object prototype’ of the verb.
‘object prototype’ will indicate various attributes such as these
nouns can be consumed, bought, carried, stored etc.
Similarity of a noun to this ‘object prototype’ is used to
denote the plausibility of that noun being an object of verb
‘eat’.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 15 /
Word Embeddings - Part I
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 19
Word Vectors
At one level, it is simply a vector of weights.
In a simple 1-of-N (or ‘one-hot’) encoding every element in the
vector is associated with a word in the vocabulary.
The encoding of a given word is simply the vector in which
the corresponding element is set to one, and all other
elements are zero.
One-hot representation
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/
Word Vectors - One-hot Encoding
Suppose our vocabulary has only five words: King, Queen, Man,
Woman, and Child.
We could encode the word ‘Queen’ as:
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3 / 19
Limitations of One-hot encoding
Word vectors are not comparable
Using such an encoding, there is no meaningful comparison we
can make between word vectors other than equality testing.
Week 7, Lecture 4/
Word2Vec – A distributed representation
Distributional representation – word embedding?
Any word wi in the corpus is given a distributional
representation by an embedding
wi ∈ R d
i.e., a d−dimensional vector, which is mostly learnt!
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/
Distributional Representation
Take a vector with several hundred dimensions (say 1000).
Each word is represented by a distribution of weights
across those elements.
S o instead of a one-to-one mapping between an element in
the vector and a word, the representation of a word is spread
across all of the elements in the vector, and
Each element in the vector contributes to the definition of many
words.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 19
Distributional Representation: Illustration
If we label the dimensions in a hypothetical word vector (there are
no such pre-assigned labels in the algorithm of course), it might
look a bit like this:
Such a vector comes to represent in some abstract way the ‘meaning’ of
a word
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 19
Word Embeddings
d typically in the range 50 to 1000
Similar words should have similar
embeddings
SVD can also be thought of as an embedding
method
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8/
Reasoning with Word Vectors
It has been found that the learned word representations in fact
capture meaningful syntactic and semantic regularities in a
very simple way.
Specifically, the regularities are observed as constant vector
offsets between pairs of words sharing a particular
relationship.
Case of Singular-Plural Relations
If we denote the vector for word i as xi, and focus on the
singular/plural relation, we observe that
xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies ≈ xcar − xcars
and so on.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Reasoning with Word Vectors
Perhaps more surprisingly, we find that this is also the case for a
variety of semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
man is to woman as uncle is to ? (aunt)
A simple vector offset method based on cosine distance shows
the relation.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /
Vcctor Offset for Gender Relation
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 19
Vcctor Offset for Singular-Plural Relation
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 19
Encoding Other Dimensions of Similarity
Analogy Testing
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 / 19
Analogy Testing
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 / 19
Country-capital city relationships
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 15 / 19
More Analogy Questions
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 16 / 19
Element Wise Addition
We can also use element-wise addition of vector elements to ask
questions such as ‘German + airlines’ and by looking at the
closest tokens to the composite vector come up with impressive
answers:
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 17 / 19
Two Variations: CBOW and Skip-grams
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 19 / 19
Word Embedings - Part II
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 14
CBOW
Consider a piece of prose such as:
“The recently introduced continuous Skip-gram model is an
efficient method for learning high-quality distributed vector
representations that capture a large number of syntatic and
semantic word relationships.”
Imagine a sliding window over the text, that includes the
central word currently in focus, together with the four words
that precede it, and the four words that follow it:
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/
CBOW
The context words form the input layer. Each word is encoded in one-
hot form. A single hidden and output layer.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3 / 14
CBOW: Training Objective
The training objective is to maximize the conditional probability of
observing the actual output word (the focus word) given the input
context words, with regard to the weights.
In our example, given the input (“an”, “efficient”, “method”, “for”,
“high”, “quality”, “distributed”, “vector”), we want to maximize the
probability of getting “learning” as the output.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4 / 14
CBOW: Input to Hidden Layer
Since our input vectors are one-hot, multiplying an input vector by
the weight matrix W1 amounts to simply selecting a row from W1.
Given C input word vectors, the activation function for the hidden
layer h amounts to simply summing the corresponding ‘hot’ rows in
W1, and dividing by C to take their average.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5 / 14
CBOW: Hidden to Output Layer
From the hidden layer to the output layer, the second weight matrix W2
can be used to compute a score for each word in the vocabulary, and
softmax can be used to obtain the posterior distribution of words.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 14
Skip-gram Model
The skip-gram model is the opposite of the C B O W model. It is
constructed with the focus word as the single input vector, and the
target context words are now at the output layer:
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 14
Skip-gram Model: Training
The activation function for the hidden layer simply amounts to
copying the corresponding row from the weights matrix W1 (linear)
as we saw before. At the output layer, we now output C
multinomial distributions instead of
just one.
The training objective is to mimimize the summed prediction error
across all context words in the output layer. In our example, the
input would be “learning”, and we hope to see (“an”, “efficient”,
“method”, “for”, “high”, “quality”, “distributed”, “vector”) at the
output layer.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 14
Skip-gram Model
Details
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context
word given the current center word:
T
1
J (θ) = ∑
∑ log p(wt+j|wt)
T t=1
− c≤ j≤ c,j=/ 0
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Word Vectors
For p(wt+j |wt) the simplest first
formulation is
exp(v′wO T vWI )
p(wO |wI )
= ∑W
w=1 exp(v vWI )
′ T
w
where v and v′ are “input” and “output” vector representations of w
(so every word has two vectors)
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /
Parameters θ
With d−dimensional words and V many
words:
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 14
Gradient Descent for Parameter Updates
∂
θj new = θj old ∂θjold J (θ
−α )
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 14
Two sets of vectors
Best solution is to sum these up
Lfinal = L + L′
A good tutorial to understand parameter learning:
https://arxiv.org/pdf/1411.2738.pdf
An interactive Demo
https://ronxin.github.io/wevi/
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 /
Glove
Combine the best of both worlds – count based methods as well
as direct prediction methods
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /
Glove
Combine the best of both worlds – count based methods as well
as direct prediction methods
Fast training
Scalable to huge corpora
Good performance even with small corpus, and small vectors
Code and vectors: http://nlp.stanford.edu/projects/glove/
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /