Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views161 pages

Week 7

The document introduces distributional semantics, which studies the meaning of words based on their usage in language and the statistical patterns of word co-occurrence. It outlines two main methods in computational semantics: formal semantics and distributional semantics, emphasizing the distributional hypothesis that similar meanings correlate with similar linguistic contexts. Additionally, it discusses the construction of word spaces and distributional semantic models that represent word meanings as vectors derived from context data.

Uploaded by

rahejarhythm29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views161 pages

Week 7

The document introduces distributional semantics, which studies the meaning of words based on their usage in language and the statistical patterns of word co-occurrence. It outlines two main methods in computational semantics: formal semantics and distributional semantics, emphasizing the distributional hypothesis that similar meanings correlate with similar linguistic contexts. Additionally, it discusses the construction of word spaces and distributional semantic models that represent word meanings as vectors derived from context data.

Uploaded by

rahejarhythm29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

Distributional Semantics - Introduction

EL
Pawan Goyal

PT CSE, IIT Kharagpur

Week 7, Lecture 1
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 1 / 14


Introduction

EL
What is Semantics?

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 2 / 14


Introduction

EL
What is Semantics?
The study of meaning: Relation between symbols and their denotata.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 2 / 14


Introduction

EL
What is Semantics?
The study of meaning: Relation between symbols and their denotata.

PT
John told Mary that the train moved out of the station at 3 o’clock.
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 2 / 14


Computational Semantics

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 3 / 14


Computational Semantics

Computational Semantics
The study of how to automate the process of constructing and reasoning with

EL
meaning representations of natural language expressions.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 3 / 14


Computational Semantics

Computational Semantics
The study of how to automate the process of constructing and reasoning with

EL
meaning representations of natural language expressions.

Methods in Computational Semantics generally fall in two categories:

PT
Formal Semantics: Construction of precise mathematical models of the
relations between expressions in a natural language and the world.
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 3 / 14


Computational Semantics

Computational Semantics
The study of how to automate the process of constructing and reasoning with

EL
meaning representations of natural language expressions.

Methods in Computational Semantics generally fall in two categories:

PT
Formal Semantics: Construction of precise mathematical models of the
relations between expressions in a natural language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 3 / 14


Computational Semantics

Computational Semantics
The study of how to automate the process of constructing and reasoning with

EL
meaning representations of natural language expressions.

Methods in Computational Semantics generally fall in two categories:

PT
Formal Semantics: Construction of precise mathematical models of the
relations between expressions in a natural language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
N
Distributional Semantics: The study of statistical patterns of human
word usage to extract semantics.

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 3 / 14


Distributional Hypothesis

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 4 / 14


Distributional Hypothesis

Distributional Hypothesis: Basic Intuition


“The meaning of a word is its use in language.” (Wittgenstein,

EL
1953)

“You know a word by the company it keeps.” (Firth, 1957)

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 4 / 14


Distributional Hypothesis

Distributional Hypothesis: Basic Intuition


“The meaning of a word is its use in language.” (Wittgenstein,

EL
1953)

“You know a word by the company it keeps.” (Firth, 1957)

PT
→ Word meaning (whatever it might be) is reflected in linguistic distributions.
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 4 / 14


Distributional Hypothesis

Distributional Hypothesis: Basic Intuition


“The meaning of a word is its use in language.” (Wittgenstein,

EL
1953)

“You know a word by the company it keeps.” (Firth, 1957)

PT
→ Word meaning (whatever it might be) is reflected in linguistic distributions.
“Words that occur in the same contexts tend to have similar
N
meanings.” (Zellig Harris, 1968)

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 4 / 14


Distributional Hypothesis

Distributional Hypothesis: Basic Intuition


“The meaning of a word is its use in language.” (Wittgenstein,

EL
1953)

“You know a word by the company it keeps.” (Firth, 1957)

PT
→ Word meaning (whatever it might be) is reflected in linguistic distributions.
“Words that occur in the same contexts tend to have similar
N
meanings.” (Zellig Harris, 1968)

→ Semantically similar words tend to have similar distributional patterns.

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 4 / 14


Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so through


distributional analysis.” (Zellig Harris)

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 5 / 14


Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so through


distributional analysis.” (Zellig Harris)

EL
“If we consider words or morphemes A and B to be more
different in meaning than A and C, then we will often find that the
distributions of A and B are more different than the distributions of A

PT
and C. In other words, difference in meaning correlates with
difference of distribution.” (Zellig Harris, “Distributional Structure”)
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 5 / 14


Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so through


distributional analysis.” (Zellig Harris)

EL
“If we consider words or morphemes A and B to be more
different in meaning than A and C, then we will often find that the
distributions of A and B are more different than the distributions of A

PT
and C. In other words, difference in meaning correlates with
difference of distribution.” (Zellig Harris, “Distributional Structure”)
N
Differential and not referential

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 5 / 14


Distributional Semantics: a cognitive perspective

Contextual representation

EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 6 / 14


Distributional Semantics: a cognitive perspective

Contextual representation

EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.

PT
We learn new words based on contextual cues
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 6 / 14


Distributional Semantics: a cognitive perspective

Contextual representation

EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.

PT
We learn new words based on contextual cues
He filled the wampimuk with the substance, passed it around and we all drunk
some.
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 6 / 14


Distributional Semantics: a cognitive perspective

Contextual representation

EL
A word’s contextual representation is an abstract cognitive structure that
accumulates from encounters with the word in various linguistic contexts.

PT
We learn new words based on contextual cues
He filled the wampimuk with the substance, passed it around and we all drunk
some.
N
We found a little wampimuk sleeping behind the tree.

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 6 / 14


Distributional Semantic Models (DSMs)

Computational models that build contextual semantic repesentations from


corpus data

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 7 / 14


Distributional Semantic Models (DSMs)

Computational models that build contextual semantic repesentations from


corpus data
DSMs are models for semantic representations

EL
I The semantic content is represented by a vector
I Vectors are obtained through the statistical analysis of the linguistic
contexts of a word

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 7 / 14


Distributional Semantic Models (DSMs)

Computational models that build contextual semantic repesentations from


corpus data
DSMs are models for semantic representations

EL
I The semantic content is represented by a vector
I Vectors are obtained through the statistical analysis of the linguistic
contexts of a word
Alternative names
I PT
corpus-based semantics
statistical semantics
N
I
I geometrical models of meaning
I vector semantics
I word space models

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 7 / 14


Distributional Semantics: The general intuition

EL
Distributions are vectors in a multidimensional semantic space, that is,
objects with a magnitude and a direction.

PT
The semantic space has dimensions which correspond to possible
contexts, as gathered from a given corpus.
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 8 / 14


Vector Space

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 9 / 14


Vector Space

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 9 / 14


Vector Space

EL
PT
N
In practice, many more dimensions are used.
cat = [...dog 0.8, eat 0.7, joke 0.01, mansion 0.2,...]

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 9 / 14


Word Space

Small Dataset
An automobile is a wheeled motor vehicle used for transporting passengers .
A car is a form of transport , usually with four wheels and the capacity to carry around

EL
five passengers .
Transport for the London games is limited , with spectators strongly advised to avoid
the use of cars .

PT
The London 2012 soccer tournament began yesterday , with plenty of goals in the
opening matches .
Giggs scored the first goal of the football tournament at Wembley , North London .
Bellamy was largely a passenger in the football match , playing no part in either goal .
N
Target words: hautomobile, car, soccer, footballi
Term vocabulary : hwheel, transport, passenger, tournament, London, goal,
matchi

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 10 / 14


Constructing Word spaces

Informal algorithm for constructing word spaces


Pick the words you are interested in: target words

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 11 / 14


Constructing Word spaces

Informal algorithm for constructing word spaces


Pick the words you are interested in: target words

EL
Define a context window, number of words surrounding target word

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 11 / 14


Constructing Word spaces

Informal algorithm for constructing word spaces


Pick the words you are interested in: target words

EL
Define a context window, number of words surrounding target word
I The context can in general be defined in terms of documents, paragraphs
or sentences.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 11 / 14


Constructing Word spaces

Informal algorithm for constructing word spaces


Pick the words you are interested in: target words

EL
Define a context window, number of words surrounding target word
I The context can in general be defined in terms of documents, paragraphs
or sentences.

PT
Count number of times the target word co-occurs with the context words:
co-occurrence matrix
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 11 / 14


Constructing Word spaces

Informal algorithm for constructing word spaces


Pick the words you are interested in: target words

EL
Define a context window, number of words surrounding target word
I The context can in general be defined in terms of documents, paragraphs
or sentences.

PT
Count number of times the target word co-occurs with the context words:
co-occurrence matrix
N
Build vectors out of (a function of) these co-occurrence counts

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 11 / 14


Constructing Word spaces: distributional vectors

distributional matrix = targets X contexts

EL
wheel transport passenger tournament London goal match
automobile 1 1 1 0 0 0 0
car
soccer
football
1
0
0
PT2
0
0
1
0
1
0
1
1
1
1
1
0
1
2
0
1
1
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 12 / 14


2.5

2 football (0,2)

EL
1.5
goal

1 soccer (0,1)

0.5

0
0
PT
0.5
automobile (1,0)
1 1.5
car (2,0)
2 2.5
N
transport

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 13 / 14


Computing similarity

wheel transport passenger tournament London goal match


automobile 1 1 1 0 0 0 0

EL
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1

Using simple vector product


automobile . car = 4
PT car . soccer = 1
N
automobile . soccer = 0 car . football = 2
automobile . football = 1 soccer . football = 5

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 1 14 / 14


Distributional Models of Semantics

EL
Pawan Goyal

PT CSE, IIT Kharagpur

Week 7, Lecture 2
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 1 / 19


Vector Space Model without distributional similarity

Words are treated as atomic symbols

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 2 / 19


Vector Space Model without distributional similarity

Words are treated as atomic symbols

EL
One-hot representation

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 2 / 19


Distributional Similarity Based Representations

You know a word by the company it keeps

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 3 / 19


Distributional Similarity Based Representations

You know a word by the company it keeps

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 3 / 19


Distributional Similarity Based Representations

You know a word by the company it keeps

EL
PT
These words will represent banking
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 3 / 19


Building a DSM step-by-step

The “linguistic” steps


Pre-process a corpus (to define targets and contexts)

Select the targets and the contexts

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 4 / 19


Building a DSM step-by-step

The “linguistic” steps


Pre-process a corpus (to define targets and contexts)

Select the targets and the contexts

EL
The “mathematical” steps
Count the target-context co-occurrences

PT ⇓
Weight the contexts (optional)

N
Build the distributional matrix

Reduce the matrix dimensions (optional)

Compute the vector distances on the (reduced) matrix

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 4 / 19


Many design choices

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 5 / 19


Many design choices

EL
PT
N
General Questions
How do the rows (words, ...) relate to each other?
How do the columns (contexts, documents, ...) relate to each other?

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 5 / 19


The parameter space

A number of parameters to be fixed

EL
Which type of context?
Which weighting scheme?
Which similarity measure?
...
PT
A specific parameter setting determines a particular type of DSM (e.g. LSA,
N
HAL, etc.)

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 6 / 19


Documents as context: Word × document

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 7 / 19


Words as context: Word × Word

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 8 / 19


Words as contexts

EL
Parameters
Window size

PT
Window shape - rectangular/triangular/other
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 9 / 19


Words as contexts

Parameters
Window size

EL
Window shape - rectangular/triangular/other

Consider the following passage

PT
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo, police
chief of Makati, the Philippines major financial center, in an escalation of street
violence sweeping the Capitol area. The gunmen shouted references to the
N
rebel New People’s Army. They fled in a commandeered passenger jeep. The
military says communist rebels have killed up to 65 soldiers and police in the
Capitol region since January.

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 9 / 19


Words as contexts

Parameters
Window size

EL
Window shape - rectangular/triangular/other

5 words window (unfiltered): 2 words either side of the target word

PT
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo, police
chief of Makati, the Philippines major financial center, in an escalation of street
violence sweeping the Capitol area. The gunmen shouted references to the
N
rebel New People’s Army. They fled in a commandeered passenger jeep. The
military says communist rebels have killed up to 65 soldiers and police in the
Capitol region since January.

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 9 / 19


Words as contexts

Parameters
Window size

EL
Window shape - rectangular/triangular/other

5 words window (filtered): 2 words either side of the target word

PT
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo, police
chief of Makati, the Philippines major financial center, in an escalation of street
violence sweeping the Capitol area. The gunmen shouted references to the
N
rebel New People’s Army. They fled in a commandeered passenger jeep. The
military says communist rebels have killed up to 65 soldiers and police in the
Capitol region since January.

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 9 / 19


Context weighting: documents as context

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 10 / 19


Context weighting: documents as context

Indexing function F: Essential factors


Word frequency (fij ): How many times a word appears in the document?
F ∝ fij

EL
Document length (|Di |): How many words appear in the document?
1
F∝ |Di |

appears. F ∝ N1
j PT
Document frequency (Nj ): Number of documents in which a word
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 10 / 19


Context weighting: documents as context

Indexing function F: Essential factors


Word frequency (fij ): How many times a word appears in the document?
F ∝ fij

EL
Document length (|Di |): How many words appear in the document?
1
F∝ |Di |

appears. F ∝ N1
j PT
Document frequency (Nj ): Number of documents in which a word
N
Indexing Weight: tf-Idf
fij ∗ log( NNj ) for each term, normalize the weight in a document with
respect to L2 -norm.

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 10 / 19


Context weighting: words as context

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 11 / 19


Context weighting: words as context

basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580

EL
dog domesticated 29 33,338 918

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 11 / 19


Context weighting: words as context

basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580

EL
dog domesticated 29 33,338 918

Association measures are used to give more weight to contexts that are

PT
more significantly associted with a targer word.
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 11 / 19


Context weighting: words as context

basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580

EL
dog domesticated 29 33,338 918

Association measures are used to give more weight to contexts that are

PT
more significantly associted with a targer word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 11 / 19


Context weighting: words as context

basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580

EL
dog domesticated 29 33,338 918

Association measures are used to give more weight to contexts that are

PT
more significantly associted with a targer word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
N
⇒ Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 11 / 19


Context weighting: words as context

basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580

EL
dog domesticated 29 33,338 918

Association measures are used to give more weight to contexts that are

PT
more significantly associted with a targer word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
N
⇒ Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 11 / 19


Pointwise Mutual Information (PMI)

Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 12 / 19


Pointwise Mutual Information (PMI)

Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )

EL
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus (w1 )Pcorpus (w2 )

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 12 / 19


Pointwise Mutual Information (PMI)

Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )

EL
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus (w1 )Pcorpus (w2 )

PT freq(w1 , w2 )
Pcorpus (w1 , w2 ) =
N
N
freq(w)
Pcorpus (w) =
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 12 / 19


PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 13 / 19


PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

Bias towards infrequent events

EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 13 / 19


PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

Bias towards infrequent events

EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )

PT
PMI increases as the probability of wi decreases.
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 13 / 19


PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

Bias towards infrequent events

EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )

PT
PMI increases as the probability of wi decreases.
Also, consider a word wj that occurs once in the corpus, also in the context of
wi .
N

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 13 / 19


PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

Bias towards infrequent events

EL
Consider wj having the maximum association with wi ,
Pcorpus (wi ) ≈ Pcorpus (wj ) ≈ Pcorpus (wi , wj )

PT
PMI increases as the probability of wi decreases.
Also, consider a word wj that occurs once in the corpus, also in the context of
wi . A discounting factor proposed by Pantel and Lin:
N
fij min(fi , fj )
δij =
fij + 1 min(fi , fj ) + 1

PMInew (wi , wj ) = δij PMI(wi , wj )

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 13 / 19


Distributional Vectors: Example

Normalized Distributional Vectors using Pointwise Mutual Information

EL
oil:0.032 gas:0.029 crude:0.029 barrels:0.028 exploration:0.027 barrel:0.026
petroleum
opec:0.026 refining:0.026 gasoline:0.026 fuel:0.025 natural:0.025 exporting:0.025
trafficking:0.029 cocaine:0.028 narcotics:0.027 fda:0.026 police:0.026 abuse:0.026
drug
marijuana:0.025 crime:0.025 colombian:0.025 arrested:0.025 addicts:0.024

insurance

forest PT
insurers:0.028 premiums:0.028 lloyds:0.026 reinsurance:0.026 underwriting:0.025
pension:0.025 mortgage:0.025 credit:0.025 investors:0.024 claims:0.024 benefits:0.024
timber:0.028 trees:0.027 land:0.027 forestry:0.026 environmental:0.026 species:0.026
wildlife:0.026 habitat:0.025 tree:0.025 mountain:0.025 river:0.025 lake:0.025
N
robots:0.032 automation:0.029 technology:0.028 engineering:0.026 systems:0.026
robotics
sensors:0.025 welding:0.025 computer:0.025 manufacturing:0.025 automated:0.025

Pawan Goyal (IIT Kharagpur) Distributional Models of Semantics Week 7, Lecture 2 14 / 19


Distributional Semantics: Applications, Structured
Models

EL
Pawan Goyal

PT CSE, IIT Kharagpur


N
Week 7, Lecture 3

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 1 / 15
Application to Query Expansion: Addressing Term
Mismatch

Term Mismatch Problem in Information Retrieval


Stems from the word independence assumption during document indexing.

EL
User query: insurance cover which pays for long term care.

A relevant document may contain terms different from the actual user query.

PT
Some relevant words concerning this query: {medicare, premiums, insurers}
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 2 / 15
Application to Query Expansion: Addressing Term
Mismatch

Term Mismatch Problem in Information Retrieval


Stems from the word independence assumption during document indexing.

EL
User query: insurance cover which pays for long term care.

A relevant document may contain terms different from the actual user query.

PT
Some relevant words concerning this query: {medicare, premiums, insurers}

Using DSMs for Query Expansion


N
Given a user query, reformulate it using related terms to enhance the retrieval
performance.

The distributional vectors for the query terms are computed.

Expanded query is obtained by a linear combination or a functional combination


of these vectors.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 2 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69

EL
Broad expansion terms: medicare, beneficiaries, premiums . . .

Specific domain terms: HCFA (Health Care Financing Administration), HMO


(Health Maintenance Organization), HHS (Health and Human Services)

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69

EL
Broad expansion terms: medicare, beneficiaries, premiums . . .

Specific domain terms: HCFA (Health Care Financing Administration), HMO


(Health Maintenance Organization), HHS (Health and Human Services)

PT
TREC Topic 355: ocean remote sensing
Query Representation: radiometer:1.0 landsat:0.97 ionosphere:0.94
N
cnes:0.84 altimeter:0.83 nasda:0.81 meterology:0.81 cartography:0.78
geostationary:0.78 doppler:0.78 oceanographic:0.76

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69

EL
Broad expansion terms: medicare, beneficiaries, premiums . . .

Specific domain terms: HCFA (Health Care Financing Administration), HMO


(Health Maintenance Organization), HHS (Health and Human Services)

PT
TREC Topic 355: ocean remote sensing
Query Representation: radiometer:1.0 landsat:0.97 ionosphere:0.94
N
cnes:0.84 altimeter:0.83 nasda:0.81 meterology:0.81 cartography:0.78
geostationary:0.78 doppler:0.78 oceanographic:0.76
Broad expansion terms: radiometer, landsat, ionosphere . . .

Specific domain terms: CNES (Centre National dÉtudes Spatiales) and NASDA
(National Space Development Agency of Japan)

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 3 / 15
Similarity Measures for Binary Vectors

Let X and Y denote the binary distributional vectors for words X and Y .

Similarity Measures

EL
2|X∩Y|
Dice coefficient : |X|+|Y|

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Binary Vectors

Let X and Y denote the binary distributional vectors for words X and Y .

Similarity Measures

EL
2|X∩Y|
Dice coefficient : |X|+|Y|
|X∩Y|
Jaccard Coefficient : |X∪Y|

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Binary Vectors

Let X and Y denote the binary distributional vectors for words X and Y .

Similarity Measures

EL
2|X∩Y|
Dice coefficient : |X|+|Y|
|X∩Y|
Jaccard Coefficient : |X∪Y|

PT
Overlap Coefficient : min(|X|,|Y|)
|X∩Y|
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Binary Vectors

Let X and Y denote the binary distributional vectors for words X and Y .

Similarity Measures

EL
2|X∩Y|
Dice coefficient : |X|+|Y|
|X∩Y|
Jaccard Coefficient : |X∪Y|

PT
Overlap Coefficient : min(|X|,|Y|)
|X∩Y|
N
Jaccard coefficient penalizes small number of shared entries, while Overlap
coefficient uses the concept of inclusion.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 4 / 15
Similarity Measures for Vector Spaces

Let ~X and ~Y denote the distributional vectors for words X and Y .

EL
~X = [x1 , x2 , . . . , xn ], ~Y = [y1 , y2 , . . . , yn ]

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 5 / 15
Similarity Measures for Vector Spaces

Let ~X and ~Y denote the distributional vectors for words X and Y .

EL
~X = [x1 , x2 , . . . , xn ], ~Y = [y1 , y2 , . . . , yn ]

Similarity Measures

PT ~
Cosine similarity : cos(~X,~Y) = ~X̄·Y~
|X||Y|
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 5 / 15
Similarity Measures for Vector Spaces

Let ~X and ~Y denote the distributional vectors for words X and Y .

EL
~X = [x1 , x2 , . . . , xn ], ~Y = [y1 , y2 , . . . , yn ]

Similarity Measures

PT X||
~
Cosine similarity : cos(~X,~Y) = ~X̄·Y~
| Y|
Euclidean distance : |~X − ~Y| = Σni=1 (xi − yi )2
p
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 5 / 15
Similarity Measure for Probability Distributions

Let p and q denote the probability distributions corresponding to two

EL
distributional vectors.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 6 / 15
Similarity Measure for Probability Distributions

Let p and q denote the probability distributions corresponding to two

EL
distributional vectors.
Similarity Measures
p

PT
KL-divergence : D(p||q) = Σi pi log qi
Information Radius : D(p|| 2 ) + D(q|| 2 )
L1 -norm : Σi |pi − qi |
p+q
i
p+q
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 6 / 15
Attributional Similarity vs. Relational Similarity

Attributional Similarity
The attributional similarity between two words a and b depends on the degree

EL
of correspondence between the properties of a and b.
Ex: dog and wolf

Relational Similarity
PT
Two pairs(a, b) and (c, d) are relationally similar if they have many similar
N
relations.
Ex: dog: bark and cat: meow

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 7 / 15
Relational Similarity: Pair-pattern matrix

Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood

EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y

PT
Compute the similarity of rows to find similar pairs
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Relational Similarity: Pair-pattern matrix

Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood

EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y

PT
Compute the similarity of rows to find similar pairs

Extended Distributional Hypothesis; Lin and Pantel


N
Patterns that co-occur with similar pairs tend to have similar meanings.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Relational Similarity: Pair-pattern matrix

Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood

EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y

PT
Compute the similarity of rows to find similar pairs

Extended Distributional Hypothesis; Lin and Pantel


N
Patterns that co-occur with similar pairs tend to have similar meanings.
This matrix can also be used to measure the semantic similarity of patterns.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Relational Similarity: Pair-pattern matrix

Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood

EL
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y

PT
Compute the similarity of rows to find similar pairs

Extended Distributional Hypothesis; Lin and Pantel


N
Patterns that co-occur with similar pairs tend to have similar meanings.
This matrix can also be used to measure the semantic similarity of patterns.
Given a pattern such as “X solves Y”, you can use this matrix to find similar patterns,
such as “Y is solved by X”, “Y is resolved in X”, “X resolves Y”.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 8 / 15
Structured DSMs

Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?

EL
X solves Y and Y is solved by X

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 9 / 15
Structured DSMs

Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?

EL
X solves Y and Y is solved by X

An Ideal Formalism

PT
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
N
Should be applicable to different languages

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 9 / 15
Structured DSMs

Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?

EL
X solves Y and Y is solved by X

An Ideal Formalism

PT
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
N
Should be applicable to different languages

Use Dependency grammar framework

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 9 / 15
Structured DSMs

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 10 / 15
Structured DSMs

Using Dependency Structure: How does it help?


The teacher eats a red apple.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 10 / 15
Structured DSMs

Using Dependency Structure: How does it help?


The teacher eats a red apple.

EL
PT
N
‘eat’ is not a legitimate context for ‘red’.
The ‘object’ relation connecting ‘eat’ and ‘apple’ is treated as a different
type of co-occurrence from the ‘modifier’ relation linking ‘red’ and ‘apple’.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 10 / 15
Structured DSMs

EL
Structured DSMs: Words as ‘legitimate’ contexts
Co-occurrence statistics are collected using parser-extracted relations.

PT
To qualify as context of a target item, a word must be linked to it by some
(interesting) lexico-syntactic relation
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 11 / 15
Structured DSMs

Distributional models, as guided by dependency


Ex: For the sentence ‘This virus affects the body’s defense system.’, the
dependency parse is:

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 12 / 15
Structured DSMs

Distributional models, as guided by dependency


Ex: For the sentence ‘This virus affects the body’s defense system.’, the
dependency parse is:

EL
PT
N
Word vectors
<system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-way matrix

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 12 / 15
2-way matrix

<system, dobj, affects>


<virus, nsubj, affects>

EL
The dependency information can be dropped
<system, dobj, affects> ⇒ <system, affects>
<virus, nsubj, affects> ⇒ <virus, affects>

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 13 / 15
2-way matrix

<system, dobj, affects>


<virus, nsubj, affects>

EL
The dependency information can be dropped
<system, dobj, affects> ⇒ <system, affects>
<virus, nsubj, affects> ⇒ <virus, affects>

PT
Link and one word can be concatenated and treated as attributes
N
virus={nsubj-affects:0.05,. . .},
system={dobj-affects:0.03,. . .}

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 13 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences for Verbs


Most verbs prefer arguments of a particular type. This regularity is known as

EL
selectional preference.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 14 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences for Verbs


Most verbs prefer arguments of a particular type. This regularity is known as

EL
selectional preference.
From a parsed corpus, noun vectors are calculated as shown for ‘virus’ and
‘system’.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 14 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences for Verbs


Most verbs prefer arguments of a particular type. This regularity is known as

EL
selectional preference.
From a parsed corpus, noun vectors are calculated as shown for ‘virus’ and
‘system’.

car
vegetable
obj-carry
0.1
0.3
PT
obj-buy
0.4
0.5
obj-drive
0.8
0
obj-eat
0.02
0.6
obj-store
0.2
0.3
sub-fly
0.05
0.05
...
...
...
N
biscuit 0.4 0.4 0 0.5 0.4 0.02 ...
··· ... ... ... ... ... ... ...

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 14 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.

EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.

PT
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.

EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.

PT
The complete vectors of these n nouns are used to obtain an ‘object
prototype’ of the verb.
N

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.

EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.

prototype’ of the verb.


PT
The complete vectors of these n nouns are used to obtain an ‘object

‘object prototype’ will indicate various attributes such as these nouns can
N
be consumed, bought, carried, stored etc.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb ‘eat’.

EL
n nouns having highest weight in the dimension ‘obj-eat’ are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.

prototype’ of the verb.


PT
The complete vectors of these n nouns are used to obtain an ‘object

‘object prototype’ will indicate various attributes such as these nouns can
N
be consumed, bought, carried, stored etc.
Similarity of a noun to this ‘object prototype’ is used to denote the
plausibility of that noun being an object of verb ‘eat’.

Pawan Goyal (IIT Kharagpur) Distributional Semantics: Applications, Structured Models Week 7, Lecture 3 15 / 15
Word Embeddings - Part I

EL
Pawan Goyal

PT CSE, IIT Kharagpur

Week 7, Lecture 4
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 1 / 19


Word Vectors

At one level, it is simply a vector of weights.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 2 / 19


Word Vectors

At one level, it is simply a vector of weights.


In a simple 1-of-N (or ‘one-hot’) encoding every element in the vector is

EL
associated with a word in the vocabulary.

PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 2 / 19


Word Vectors

At one level, it is simply a vector of weights.


In a simple 1-of-N (or ‘one-hot’) encoding every element in the vector is

EL
associated with a word in the vocabulary.
The encoding of a given word is simply the vector in which the
corresponding element is set to one, and all other elements are zero.

One-hot representation PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 2 / 19


Word Vectors - One-hot Encoding

Suppose our vocabulary has only five words: King, Queen, Man, Woman,
and Child.
We could encode the word ‘Queen’ as:

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 3 / 19


Limitations of One-hot encoding

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 4 / 19


Limitations of One-hot encoding

EL
Word vectors are not comparable
Using such an encoding, there is no meaningful comparison we can make

PT
between word vectors other than equality testing.
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 4 / 19


Word2Vec – A distributed representation
Distributional representation – word embedding?
Any word wi in the corpus is given a distributional representation by an
embedding

wi ∈ Rd

EL
i.e., a d−dimensional vector, which is mostly learnt!

PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 5 / 19


Word2Vec – A distributed representation
Distributional representation – word embedding?
Any word wi in the corpus is given a distributional representation by an
embedding

wi ∈ Rd

EL
i.e., a d−dimensional vector, which is mostly learnt!

PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 5 / 19


Distributional Representation

Take a vector with several hundred dimensions (say 1000).

EL
Each word is represented by a distribution of weights across those
elements.
So instead of a one-to-one mapping between an element in the vector

PT
and a word, the representation of a word is spread across all of the
elements in the vector, and
N
Each element in the vector contributes to the definition of many words.

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 6 / 19


Distributional Representation: Illustration
If we label the dimensions in a hypothetical word vector (there are no such
pre-assigned labels in the algorithm of course), it might look a bit like this:

EL
PT
N

Such a vector comes to represent in some abstract way the ‘meaning’ of a


word

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 7 / 19


Word Embeddings

EL
d typically in the range 50 to 1000
Similar words should have similar embeddings

PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 8 / 19


Word Embeddings

EL
d typically in the range 50 to 1000
Similar words should have similar embeddings

PT
SVD can also be thought of as an embedding method
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 8 / 19


Reasoning with Word Vectors

It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 9 / 19


Reasoning with Word Vectors

It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets

EL
between pairs of words sharing a particular relationship.

PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 9 / 19


Reasoning with Word Vectors

It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets

EL
between pairs of words sharing a particular relationship.

Case of Singular-Plural Relations

PT
If we denote the vector for word i as xi , and focus on the singular/plural
relation, we observe that
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 9 / 19


Reasoning with Word Vectors

It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets

EL
between pairs of words sharing a particular relationship.

Case of Singular-Plural Relations

PT
If we denote the vector for word i as xi , and focus on the singular/plural
relation, we observe that
N
xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies ≈ xcar − xcars

and so on.

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 9 / 19


Reasoning with Word Vectors

Perhaps more surprisingly, we find that this is also the case for a variety of

EL
semantic relations.
Good at answering analogy questions
a is to b, as c is to ?

PT
man is to woman as uncle is to ? (aunt)
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 10 / 19


Reasoning with Word Vectors

Perhaps more surprisingly, we find that this is also the case for a variety of

EL
semantic relations.
Good at answering analogy questions
a is to b, as c is to ?

PT
man is to woman as uncle is to ? (aunt)

A simple vector offset method based on cosine distance shows the relation.
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 10 / 19


Vcctor Offset for Gender Relation

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 11 / 19


Vcctor Offset for Singular-Plural Relation

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 12 / 19


Encoding Other Dimensions of Similarity

Analogy Testing

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 13 / 19


Analogy Testing

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 14 / 19


Country-capital city relationships

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 15 / 19


More Analogy Questions

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 16 / 19


Element Wise Addition

We can also use element-wise addition of vector elements to ask questions


such as ‘German + airlines’ and by looking at the closest tokens to the

EL
composite vector come up with impressive answers:

PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 17 / 19


Learning Word Vectors

EL
Basic Idea

Instead of capturing co-occurrence counts directly, predict (using) surrounding


words of every word.
PT
Code as well as word-vectors: https://code.google.com/p/word2vec/
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 18 / 19


Two Variations: CBOW and Skip-grams

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embeddings - Part I Week 7, Lecture 4 19 / 19


Word Embedings - Part II

EL
Pawan Goyal

PT CSE, IIT Kharagpur

Week 7, Lecture 5
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 1 / 14


CBOW

Consider a piece of prose such as:


“The recently introduced continuous Skip-gram model is an efficient
method for learning high-quality distributed vector representations that

EL
capture a large number of syntatic and semantic word relationships.”

PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 2 / 14


CBOW

Consider a piece of prose such as:


“The recently introduced continuous Skip-gram model is an efficient
method for learning high-quality distributed vector representations that

EL
capture a large number of syntatic and semantic word relationships.”
Imagine a sliding window over the text, that includes the central word

PT
currently in focus, together with the four words that precede it, and the
four words that follow it:
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 2 / 14


CBOW

Consider a piece of prose such as:


“The recently introduced continuous Skip-gram model is an efficient
method for learning high-quality distributed vector representations that

EL
capture a large number of syntatic and semantic word relationships.”
Imagine a sliding window over the text, that includes the central word

PT
currently in focus, together with the four words that precede it, and the
four words that follow it:
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 2 / 14


CBOW
The context words form the input layer. Each word is encoded in one-hot form.
A single hidden and output layer.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 3 / 14


CBOW: Training Objective

The training objective is to maximize the conditional probability of

EL
observing the actual output word (the focus word) given the input context
words, with regard to the weights.

PT
In our example, given the input (“an”, “efficient”, “method”, “for”, “high”,
“quality”, “distributed”, “vector”), we want to maximize the probability of
getting “learning” as the output.
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 4 / 14


CBOW: Input to Hidden Layer

Since our input vectors are one-hot, multiplying an input vector by the weight
matrix W1 amounts to simply selecting a row from W1 .

EL
PT
N
Given C input word vectors, the activation function for the hidden layer h
amounts to simply summing the corresponding ‘hot’ rows in W1 , and dividing
by C to take their average.

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 5 / 14


CBOW: Hidden to Output Layer

EL
From the hidden layer to the output layer, the second weight matrix W2 can be
used to compute a score for each word in the vocabulary, and softmax can be

PT
used to obtain the posterior distribution of words.
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 6 / 14


Skip-gram Model
The skip-gram model is the opposite of the CBOW model. It is constructed
with the focus word as the single input vector, and the target context words are
now at the output layer:

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 7 / 14


Skip-gram Model: Training

The activation function for the hidden layer simply amounts to copying the

EL
corresponding row from the weights matrix W1 (linear) as we saw before.
At the output layer, we now output C multinomial distributions instead of
just one.

PT
The training objective is to mimimize the summed prediction error across
all context words in the output layer. In our example, the input would be
“learning”, and we hope to see (“an”, “efficient”, “method”, “for”, “high”,
N
“quality”, “distributed”, “vector”) at the output layer.

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 8 / 14


Skip-gram Model

Details

EL
Predict surrounding words in a window of length c of each word

PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 9 / 14


Skip-gram Model

Details

EL
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context word given
the current center word:

PT
J(θ) =
1 T
∑ ∑ log p(wt+j |wt )
T t=1 −c≤j≤c,j6=0
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 9 / 14


Word Vectors

For p(wt+j |wt ) the simplest first formulation is

EL
exp(v0wO T vWI )
p(wO |wI ) = W
∑w=1 exp(v0w T vWI )

PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 10 / 14


Word Vectors

For p(wt+j |wt ) the simplest first formulation is

EL
exp(v0wO T vWI )
p(wO |wI ) = W
∑w=1 exp(v0w T vWI )

PT
where v and v0 are “input” and “output” vector representations of w (so every
word has two vectors)
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 10 / 14


Parameters θ

With d−dimensional words and V many words:

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 11 / 14


Gradient Descent for Parameter Updates

EL

θj new = θj old − α J(θ)
∂θj old

PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 12 / 14


Two sets of vectors

Best solution is to sum these up

Lfinal = L + L0

EL
PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 13 / 14


Two sets of vectors

Best solution is to sum these up

Lfinal = L + L0

EL
A good tutorial to understand parameter learning:

PT
https://arxiv.org/pdf/1411.2738.pdf
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 13 / 14


Two sets of vectors

Best solution is to sum these up

Lfinal = L + L0

EL
A good tutorial to understand parameter learning:

PT
https://arxiv.org/pdf/1411.2738.pdf
N
An interactive Demo
https://ronxin.github.io/wevi/

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 13 / 14


Glove

EL
Combine the best of both worlds – count based methods as well as direct
prediction methods
PT
N

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 14 / 14


Glove

EL
Combine the best of both worlds – count based methods as well as direct
prediction methods
Fast training
PT
N
Scalable to huge corpora
Good performance even with small corpus, and small vectors
Code and vectors: http://nlp.stanford.edu/projects/glove/

Pawan Goyal (IIT Kharagpur) Word Embedings - Part II Week 7, Lecture 5 14 / 14

You might also like