Week 7
Week 7
EL
Pawan Goyal
Week 7, Lecture 1
N
EL
What is Semantics?
PT
N
EL
What is Semantics?
The study of meaning: Relation between symbols and their denotata.
PT
N
EL
What is Semantics?
The study of meaning: Relation between symbols and their denotata.
PT
John told Mary that the train moved out of the station at 3 o’clock.
N
EL
PT
N
Computational Semantics
The study of how to automate the process of constructing and reasoning with
EL
meaning representations of natural language expressions.
PT
N
Computational Semantics
The study of how to automate the process of constructing and reasoning with
EL
meaning representations of natural language expressions.
PT
Formal Semantics: Construction of precise mathematical models of the
relations between expressions in a natural language and the world.
N
Computational Semantics
The study of how to automate the process of constructing and reasoning with
EL
meaning representations of natural language expressions.
PT
Formal Semantics: Construction of precise mathematical models of the
relations between expressions in a natural language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
N
Computational Semantics
The study of how to automate the process of constructing and reasoning with
EL
meaning representations of natural language expressions.
PT
Formal Semantics: Construction of precise mathematical models of the
relations between expressions in a natural language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
N
Distributional Semantics: The study of statistical patterns of human
word usage to extract semantics.
EL
PT
N
EL
1953)
PT
N
EL
1953)
PT
→ Word meaning (whatever it might be) is reflected in linguistic distributions.
N
EL
1953)
PT
→ Word meaning (whatever it might be) is reflected in linguistic distributions.
“Words that occur in the same contexts tend to have similar
N
meanings.” (Zellig Harris, 1968)
EL
1953)
PT
→ Word meaning (whatever it might be) is reflected in linguistic distributions.
“Words that occur in the same contexts tend to have similar
N
meanings.” (Zellig Harris, 1968)
EL
PT
N
EL
“If we consider words or morphemes A and B to be more
different in meaning than A and C, then we will often find that the
distributions of A and B are more different than the distributions of A
PT
and C. In other words, difference in meaning correlates with
difference of distribution.” (Zellig Harris, “Distributional Structure”)
N
EL
“If we consider words or morphemes A and B to be more
different in meaning than A and C, then we will often find that the
distributions of A and B are more different than the distributions of A
PT
and C. In other words, difference in meaning correlates with
difference of distribution.” (Zellig Harris, “Distributional Structure”)
N
Differential and not referential
Contextual representation
EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.
PT
N
Contextual representation
EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.
PT
We learn new words based on contextual cues
N
Contextual representation
EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.
PT
We learn new words based on contextual cues
He filled the wampimuk with the substance, passed it around and we all drunk
some.
N
Contextual representation
EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.
PT
We learn new words based on contextual cues
He filled the wampimuk with the substance, passed it around and we all drunk
some.
N
We found a little wampimuk sleeping behind the tree.
EL
PT
N
EL
I The semantic content is represented by a vector
I Vectors are obtained through the statistical analysis of the linguistic
contexts of a word
PT
N
EL
I The semantic content is represented by a vector
I Vectors are obtained through the statistical analysis of the linguistic
contexts of a word
Alternative names
I PT
corpus-based semantics
statistical semantics
N
I
I geometrical models of meaning
I vector semantics
I word space models
EL
Distributions are vectors in a multidimensional semantic space, that is,
objects with a magnitude and a direction.
PT
The semantic space has dimensions which correspond to possible
contexts, as gathered from a given corpus.
N
EL
PT
N
EL
PT
N
EL
PT
N
In practice, many more dimensions are used.
cat = [...dog 0.8, eat 0.7, joke 0.01, mansion 0.2,...]
Small Dataset
An automobile is a wheeled motor vehicle used for transporting passengers .
A car is a form of transport , usually with four wheels and the capacity to carry around
EL
five passengers .
Transport for the London games is limited , with spectators strongly advised to avoid
the use of cars .
PT
The London 2012 soccer tournament began yesterday , with plenty of goals in the
opening matches .
Giggs scored the first goal of the football tournament at Wembley , North London .
Bellamy was largely a passenger in the football match , playing no part in either goal .
N
Target words: hautomobile, car, soccer, footballi
Term vocabulary : hwheel, transport, passenger, tournament, London, goal,
matchi
EL
PT
N
EL
Define a context window, number of words surrounding target word
PT
N
EL
Define a context window, number of words surrounding target word
I The context can in general be defined in terms of documents, paragraphs
or sentences.
PT
N
EL
Define a context window, number of words surrounding target word
I The context can in general be defined in terms of documents, paragraphs
or sentences.
PT
Count number of times the target word co-occurs with the context words:
co-occurrence matrix
N
EL
Define a context window, number of words surrounding target word
I The context can in general be defined in terms of documents, paragraphs
or sentences.
PT
Count number of times the target word co-occurs with the context words:
co-occurrence matrix
N
Build vectors out of (a function of) these co-occurrence counts
EL
wheel transport passenger tournament London goal match
automobile 1 1 1 0 0 0 0
car
soccer
football
1
0
0
PT2
0
0
1
0
1
0
1
1
1
1
1
0
1
2
0
1
1
N
2 football (0,2)
EL
1.5
goal
1 soccer (0,1)
0.5
0
0
PT
0.5
automobile (1,0)
1 1.5
car (2,0)
2 2.5
N
transport
EL
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1
EL
Pawan Goyal
Week 7, Lecture 2
N
EL
PT
N
EL
One-hot representation
PT
N
EL
PT
N
EL
PT
N
EL
PT
These words will represent banking
N
EL
PT
N
EL
The “mathematical” steps
Count the target-context co-occurrences
PT ⇓
Weight the contexts (optional)
⇓
N
Build the distributional matrix
⇓
Reduce the matrix dimensions (optional)
⇓
Compute the vector distances on the (reduced) matrix
EL
PT
N
EL
PT
N
General Questions
How do the rows (words, ...) relate to each other?
How do the columns (contexts, documents, ...) relate to each other?
EL
Which type of context?
Which weighting scheme?
Which similarity measure?
...
PT
A specific parameter setting determines a particular type of DSM (e.g. LSA,
N
HAL, etc.)
EL
PT
N
EL
PT
N
EL
Parameters
Window size
PT
Window shape - rectangular/triangular/other
N
Parameters
Window size
EL
Window shape - rectangular/triangular/other
PT
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo, police
chief of Makati, the Philippines major financial center, in an escalation of street
violence sweeping the Capitol area. The gunmen shouted references to the
N
rebel New People’s Army. They fled in a commandeered passenger jeep. The
military says communist rebels have killed up to 65 soldiers and police in the
Capitol region since January.
Parameters
Window size
EL
Window shape - rectangular/triangular/other
PT
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo, police
chief of Makati, the Philippines major financial center, in an escalation of street
violence sweeping the Capitol area. The gunmen shouted references to the
N
rebel New People’s Army. They fled in a commandeered passenger jeep. The
military says communist rebels have killed up to 65 soldiers and police in the
Capitol region since January.
Parameters
Window size
EL
Window shape - rectangular/triangular/other
PT
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo, police
chief of Makati, the Philippines major financial center, in an escalation of street
violence sweeping the Capitol area. The gunmen shouted references to the
N
rebel New People’s Army. They fled in a commandeered passenger jeep. The
military says communist rebels have killed up to 65 soldiers and police in the
Capitol region since January.
EL
PT
N
EL
Document length (|Di |): How many words appear in the document?
1
F∝ |Di |
appears. F ∝ N1
j PT
Document frequency (Nj ): Number of documents in which a word
N
EL
Document length (|Di |): How many words appear in the document?
1
F∝ |Di |
appears. F ∝ N1
j PT
Document frequency (Nj ): Number of documents in which a word
N
Indexing Weight: tf-Idf
fij ∗ log( NNj ) for each term, normalize the weight in a document with
respect to L2 -norm.
EL
PT
N
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
EL
dog domesticated 29 33,338 918
PT
N
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
EL
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
PT
more significantly associted with a targer word.
N
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
EL
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
PT
more significantly associted with a targer word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
N
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
EL
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
PT
more significantly associted with a targer word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
N
⇒ Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
EL
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
PT
more significantly associted with a targer word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
N
⇒ Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )
EL
PT
N
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )
EL
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus (w1 )Pcorpus (w2 )
PT
N
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )
EL
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus (w1 )Pcorpus (w2 )
PT freq(w1 , w2 )
Pcorpus (w1 , w2 ) =
N
N
freq(w)
Pcorpus (w) =
N
Positive PMI
All PMI values less than zero are replaced with zero.
EL
PT
N
Positive PMI
All PMI values less than zero are replaced with zero.
EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )
PT
N
Positive PMI
All PMI values less than zero are replaced with zero.
EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )
PT
PMI increases as the probability of wi decreases.
N
Positive PMI
All PMI values less than zero are replaced with zero.
EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )
PT
PMI increases as the probability of wi decreases.
Also, consider a word wj that occurs once in the corpus, also in the context of
wi .
N
Positive PMI
All PMI values less than zero are replaced with zero.
EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )
PT
PMI increases as the probability of wi decreases.
Also, consider a word wj that occurs once in the corpus, also in the context of
wi . A discounting factor proposed by Pantel and Lin:
N
fij min(fi , fj )
δij =
fij + 1 min(fi , fj ) + 1
EL
oil:0.032 gas:0.029 crude:0.029 barrels:0.028 exploration:0.027 barrel:0.026
petroleum
opec:0.026 refining:0.026 gasoline:0.026 fuel:0.025 natural:0.025 exporting:0.025
trafficking:0.029 cocaine:0.028 narcotics:0.027 fda:0.026 police:0.026 abuse:0.026
drug
marijuana:0.025 crime:0.025 colombian:0.025 arrested:0.025 addicts:0.024
insurance
forest PT
insurers:0.028 premiums:0.028 lloyds:0.026 reinsurance:0.026 underwriting:0.025
pension:0.025 mortgage:0.025 credit:0.025 investors:0.024 claims:0.024 benefits:0.024
timber:0.028 trees:0.027 land:0.027 forestry:0.026 environmental:0.026 species:0.026
wildlife:0.026 habitat:0.025 tree:0.025 mountain:0.025 river:0.025 lake:0.025
N
robots:0.032 automation:0.029 technology:0.028 engineering:0.026 systems:0.026
robotics
sensors:0.025 welding:0.025 computer:0.025 manufacturing:0.025 automated:0.025
EL
Pawan Goyal
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 1 / 15
Application to Query Expansion: Addressing Term
Mismatch
EL
User query: insurance cover which pays for long term care.
A relevant document may contain terms different from the actual user query.
PT
Some relevant words concerning this query: {medicare, premiums, insurers}
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 2 / 15
Application to Query Expansion: Addressing Term
Mismatch
EL
User query: insurance cover which pays for long term care.
A relevant document may contain terms different from the actual user query.
PT
Some relevant words concerning this query: {medicare, premiums, insurers}
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 2 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
EL
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
EL
Broad expansion terms: medicare, beneficiaries, premiums . . .
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
EL
Broad expansion terms: medicare, beneficiaries, premiums . . .
PT
TREC Topic 355: ocean remote sensing
Query Representation: radiometer:1.0 landsat:0.97 ionosphere:0.94
N
cnes:0.84 altimeter:0.83 nasda:0.81 meterology:0.81 cartography:0.78
geostationary:0.78 doppler:0.78 oceanographic:0.76
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
EL
Broad expansion terms: medicare, beneficiaries, premiums . . .
PT
TREC Topic 355: ocean remote sensing
Query Representation: radiometer:1.0 landsat:0.97 ionosphere:0.94
N
cnes:0.84 altimeter:0.83 nasda:0.81 meterology:0.81 cartography:0.78
geostationary:0.78 doppler:0.78 oceanographic:0.76
Broad expansion terms: radiometer, landsat, ionosphere . . .
Specific domain terms: CNES (Centre National dÉtudes Spatiales) and NASDA
(National Space Development Agency of Japan)
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
EL
2|X∩Y|
Dice coefficient : |X|+|Y|
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
EL
2|X∩Y|
Dice coefficient : |X|+|Y|
|X∩Y|
Jaccard Coefficient : |X∪Y|
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
EL
2|X∩Y|
Dice coefficient : |X|+|Y|
|X∩Y|
Jaccard Coefficient : |X∪Y|
PT
Overlap Coefficient : min(|X|,|Y|)
|X∩Y|
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
EL
2|X∩Y|
Dice coefficient : |X|+|Y|
|X∩Y|
Jaccard Coefficient : |X∪Y|
PT
Overlap Coefficient : min(|X|,|Y|)
|X∩Y|
N
Jaccard coefficient penalizes small number of shared entries, while Overlap
coefficient uses the concept of inclusion.
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Vector Spaces
EL
~X = [x1 , x2 , . . . , xn ], ~Y = [y1 , y2 , . . . , yn ]
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 5 / 15
Similarity Measures for Vector Spaces
EL
~X = [x1 , x2 , . . . , xn ], ~Y = [y1 , y2 , . . . , yn ]
Similarity Measures
PT ~
Cosine similarity : cos(~X,~Y) = ~X̄·Y~
|X||Y|
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 5 / 15
Similarity Measures for Vector Spaces
EL
~X = [x1 , x2 , . . . , xn ], ~Y = [y1 , y2 , . . . , yn ]
Similarity Measures
PT X||
~
Cosine similarity : cos(~X,~Y) = ~X̄·Y~
| Y|
Euclidean distance : |~X − ~Y| = Σni=1 (xi − yi )2
p
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 5 / 15
Similarity Measure for Probability Distributions
EL
distributional vectors.
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 6 / 15
Similarity Measure for Probability Distributions
EL
distributional vectors.
Similarity Measures
p
PT
KL-divergence : D(p||q) = Σi pi log qi
Information Radius : D(p|| 2 ) + D(q|| 2 )
L1 -norm : Σi |pi − qi |
p+q
i
p+q
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 6 / 15
Attributional Similarity vs. Relational Similarity
Attributional Similarity
The attributional similarity between two words a and b depends on the degree
EL
of correspondence between the properties of a and b.
Ex: dog and wolf
Relational Similarity
PT
Two pairs(a, b) and (c, d) are relationally similar if they have many similar
N
relations.
Ex: dog: bark and cat: meow
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 7 / 15
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
PT
Compute the similarity of rows to find similar pairs
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
PT
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
PT
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
PT
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
EL
X solves Y and Y is solved by X
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 9 / 15
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
EL
X solves Y and Y is solved by X
An Ideal Formalism
PT
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
N
Should be applicable to different languages
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 9 / 15
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
EL
X solves Y and Y is solved by X
An Ideal Formalism
PT
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
N
Should be applicable to different languages
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 9 / 15
Structured DSMs
EL
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 10 / 15
Structured DSMs
EL
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 10 / 15
Structured DSMs
EL
PT
N
‘eat’ is not a legitimate context for ‘red’.
The ‘object’ relation connecting ‘eat’ and ‘apple’ is treated as a different
type of co-occurrence from the ‘modifier’ relation linking ‘red’ and ‘apple’.
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 10 / 15
Structured DSMs
EL
Structured DSMs: Words as ‘legitimate’ contexts
Co-occurrence statistics are collected using parser-extracted relations.
PT
To qualify as context of a target item, a word must be linked to it by some
(interesting) lexico-syntactic relation
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 11 / 15
Structured DSMs
EL
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 12 / 15
Structured DSMs
EL
PT
N
Word vectors
<system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-way matrix
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 12 / 15
2-way matrix
EL
The dependency information can be dropped
<system, dobj, affects> ⇒ <system, affects>
<virus, nsubj, affects> ⇒ <virus, affects>
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 13 / 15
2-way matrix
EL
The dependency information can be dropped
<system, dobj, affects> ⇒ <system, affects>
<virus, nsubj, affects> ⇒ <virus, affects>
PT
Link and one word can be concatenated and treated as attributes
N
virus={nsubj-affects:0.05,. . .},
system={dobj-affects:0.03,. . .}
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 13 / 15
Structured DSMs for Selectional Preferences
EL
selectional preference.
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 14 / 15
Structured DSMs for Selectional Preferences
EL
selectional preference.
From a parsed corpus, noun vectors are calculated as shown for ‘virus’ and
‘system’.
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 14 / 15
Structured DSMs for Selectional Preferences
EL
selectional preference.
From a parsed corpus, noun vectors are calculated as shown for ‘virus’ and
‘system’.
car
vegetable
obj-carry
0.1
0.3
PT
obj-buy
0.4
0.5
obj-drive
0.8
0
obj-eat
0.02
0.6
obj-store
0.2
0.3
sub-fly
0.05
0.05
...
...
...
N
biscuit 0.4 0.4 0 0.5 0.4 0.02 ...
··· ... ... ... ... ... ... ...
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 14 / 15
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.
EL
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.
EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
PT
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.
EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
PT
The complete vectors of these n nouns are used to obtain an ‘object
prototype’ of the verb.
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.
EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
‘object prototype’ will indicate various attributes such as these nouns can
N
be consumed, bought, carried, stored etc.
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.
EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
‘object prototype’ will indicate various attributes such as these nouns can
N
be consumed, bought, carried, stored etc.
Similarity of a noun to this ‘object prototype’ is used to denote the
plausibility of that noun being an object of verb ‘eat’.
Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Word Embeddings - Part I
EL
Pawan Goyal
Week 7, Lecture 4
N
EL
PT
N
EL
associated with a word in the vocabulary.
PT
N
EL
associated with a word in the vocabulary.
The encoding of a given word is simply the vector in which the
corresponding element is set to one, and all other elements are zero.
One-hot representation PT
N
Suppose our vocabulary has only five words: King, Queen, Man, Woman,
and Child.
We could encode the word ‘Queen’ as:
EL
PT
N
EL
PT
N
EL
Word vectors are not comparable
Using such an encoding, there is no meaningful comparison we can make
PT
between word vectors other than equality testing.
N
wi ∈ Rd
EL
i.e., a d−dimensional vector, which is mostly learnt!
PT
N
wi ∈ Rd
EL
i.e., a d−dimensional vector, which is mostly learnt!
PT
N
EL
Each word is represented by a distribution of weights across those
elements.
So instead of a one-to-one mapping between an element in the vector
PT
and a word, the representation of a word is spread across all of the
elements in the vector, and
N
Each element in the vector contributes to the definition of many words.
EL
PT
N
EL
d typically in the range 50 to 1000
Similar words should have similar embeddings
PT
N
EL
d typically in the range 50 to 1000
Similar words should have similar embeddings
PT
SVD can also be thought of as an embedding method
N
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
EL
PT
N
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
EL
between pairs of words sharing a particular relationship.
PT
N
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
EL
between pairs of words sharing a particular relationship.
PT
If we denote the vector for word i as xi , and focus on the singular/plural
relation, we observe that
N
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
EL
between pairs of words sharing a particular relationship.
PT
If we denote the vector for word i as xi , and focus on the singular/plural
relation, we observe that
N
xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies ≈ xcar − xcars
and so on.
Perhaps more surprisingly, we find that this is also the case for a variety of
EL
semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
PT
man is to woman as uncle is to ? (aunt)
N
Perhaps more surprisingly, we find that this is also the case for a variety of
EL
semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
PT
man is to woman as uncle is to ? (aunt)
A simple vector offset method based on cosine distance shows the relation.
N
EL
PT
N
EL
PT
N
Analogy Testing
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
composite vector come up with impressive answers:
PT
N
EL
Basic Idea
EL
PT
N
EL
Pawan Goyal
Week 7, Lecture 5
N
EL
capture a large number of syntatic and semantic word relationships.”
PT
N
EL
capture a large number of syntatic and semantic word relationships.”
Imagine a sliding window over the text, that includes the central word
PT
currently in focus, together with the four words that precede it, and the
four words that follow it:
N
EL
capture a large number of syntatic and semantic word relationships.”
Imagine a sliding window over the text, that includes the central word
PT
currently in focus, together with the four words that precede it, and the
four words that follow it:
N
EL
PT
N
EL
observing the actual output word (the focus word) given the input context
words, with regard to the weights.
PT
In our example, given the input (“an”, “efficient”, “method”, “for”, “high”,
“quality”, “distributed”, “vector”), we want to maximize the probability of
getting “learning” as the output.
N
Since our input vectors are one-hot, multiplying an input vector by the weight
matrix W1 amounts to simply selecting a row from W1 .
EL
PT
N
Given C input word vectors, the activation function for the hidden layer h
amounts to simply summing the corresponding ‘hot’ rows in W1 , and dividing
by C to take their average.
EL
From the hidden layer to the output layer, the second weight matrix W2 can be
used to compute a score for each word in the vocabulary, and softmax can be
PT
used to obtain the posterior distribution of words.
N
EL
PT
N
The activation function for the hidden layer simply amounts to copying the
EL
corresponding row from the weights matrix W1 (linear) as we saw before.
At the output layer, we now output C multinomial distributions instead of
just one.
PT
The training objective is to mimimize the summed prediction error across
all context words in the output layer. In our example, the input would be
“learning”, and we hope to see (“an”, “efficient”, “method”, “for”, “high”,
N
“quality”, “distributed”, “vector”) at the output layer.
Details
EL
Predict surrounding words in a window of length c of each word
PT
N
Details
EL
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context word given
the current center word:
PT
J(θ) =
1 T
∑ ∑ log p(wt+j |wt )
T t=1 −c≤j≤c,j6=0
N
EL
exp(v0wO T vWI )
p(wO |wI ) = W
∑w=1 exp(v0w T vWI )
PT
N
EL
exp(v0wO T vWI )
p(wO |wI ) = W
∑w=1 exp(v0w T vWI )
PT
where v and v0 are “input” and “output” vector representations of w (so every
word has two vectors)
N
EL
PT
N
EL
∂
θj new = θj old − α J(θ)
∂θj old
PT
N
Lfinal = L + L0
EL
PT
N
Lfinal = L + L0
EL
A good tutorial to understand parameter learning:
PT
https://arxiv.org/pdf/1411.2738.pdf
N
Lfinal = L + L0
EL
A good tutorial to understand parameter learning:
PT
https://arxiv.org/pdf/1411.2738.pdf
N
An interactive Demo
https://ronxin.github.io/wevi/
EL
Combine the best of both worlds – count based methods as well as direct
prediction methods
PT
N
EL
Combine the best of both worlds – count based methods as well as direct
prediction methods
Fast training
PT
N
Scalable to huge corpora
Good performance even with small corpus, and small vectors
Code and vectors: http://nlp.stanford.edu/projects/glove/