Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views81 pages

Unit 4 - Distributional Semantics

Uploaded by

samanthkumar2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views81 pages

Unit 4 - Distributional Semantics

Uploaded by

samanthkumar2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Distributional Semantics - Introduction

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 14


Introduction

What is Semantics?
The study of meaning: Relation between symbols and their
denotata. John told Mary that the train moved out of the station
at 3 o’clock.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/


Computational Semantics

Computational Semantics
The study of how to automate the process of constructing and
reasoning with meaning representations of natural language
expressions.

Methods in Computational Semantics generally fall in two categories:


Formal Semantics: Construction of precise mathematical
models of the relations between expressions in a natural
language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
Distributional Semantics: The study of statistical patterns of
human word usage to extract semantics.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/


Distributional Hypothesis

Distributional Hypothesis: Basic Intuition


“The meaning of a word is its use in language.”
(Wittgenstein, 1953)

“You know a word by the company it keeps.” (Firth, 1957)

→ Word meaning (whatever it might be) is reflected in linguistic


distributions.
“Words that occur in the same contexts tend to have
similar meanings.” (Zellig Harris, 1968)

→ Semantically similar words tend to have similar


distributional patterns.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4/


Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so


through distributional analysis.” (Zellig Harris)

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/


Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so


through distributional analysis.” (Zellig Harris)

“If we consider words or morphemes A and B to be more


different in meaning than A and C, then we will often find that
the distributions of A and B are more different than the
distributions of A and C. In other words, difference in
meaning correlates with difference of distribution.” (Zellig
Harris, “Distributional Structure”)

Differential and not referential

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/


Distributional Semantics: a cognitive perspective

Contextual representation
A word’s contextual representation is an abstract cognitive
structure that accumulates from encounters with the word in
various linguistic contexts.

We learn new words based on contextual cues


He filled the wampimuk with the substance, passed it around and we
all drunk some.
We found a little wampimuk sleeping behind the tree.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6/


Distributional Semantic Models (DSMs)

Computational models that build contextual semantic


repesentations from corpus data
D S M s are models for semantic representations
) The semantic content is represented by a vector
) Vectors are obtained through the statistical analysis of the
linguistic contexts of a word
Alternative names
) corpus-based semantics
) statistical semantics
) geometrical models of meaning
) vector semantics
) word space models

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 7/


Distributional Semantics: The general intuition

Distributions are vectors in a multidimensional semantic


space, that is, objects with a magnitude and a direction.
The semantic space has dimensions which correspond to
possible contexts, as gathered from a given corpus.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 14


Vector Space

In practice, many more dimensions are used.


cat = [...dog 0.8, eat 0.7, joke 0.01, mansion
0.2,...]

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/


Word
Space

Small Dataset
An automobile is a wheeled motor vehicle used for transporting
passengers .
A car is a form of transport , usually with four wheels and the capacity to
carry around five passengers .
Transport for the London games is limited , with spectators strongly
advised to avoid the use of cars .
The London 2012 soccer tournament began yesterday , with plenty of
goals in the opening matches .
Giggs scored the first goal of the football tournament at Wembley , North
London .
Bellamy was largely a passenger in the football match , playing no part
in either goal .

Target words: ⟨automobile, car, soccer, football⟩


Term vocabulary : ⟨wheel, transport, passenger, tournament,
London, goal, match⟩
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 / 14
Constructing Word spaces

Informal algorithm for constructing word spaces


Pick the words you are interested in: target words
Define a context window, number of words surrounding target
word
) The context can in general be defined in terms of documents,
paragraphs or sentences.
Count number of times the target word co-occurs with the context
words:
co-occurrence matrix
Build vectors out of (a function of) these co-occurrence counts

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 /


Constructing Word spaces: distributional vectors

distributional matrix = targets X


contexts
wheel transport passenger tournament London goal match
automobile 1 1 1 0 0 0 0
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 14


2.5

2 football
(0,2)

1.5
goa
l

1 s o c c er

(0,1)

0.5

automobile car
0
(1,0) (2,0)
0 0.5 1 1.5 2 2.5

transpo
rt

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 / 14


Computing similarity

wheel transport passenger tournament London goal match


automobile 1 1 1 0 0 0 0
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1

Using simple vector product


automobile . car = 4 car . soccer = 1
automobile . soccer car . football = 2
=0 soccer . football
automobile . football =5
=1

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 / 14


Distributional Models of Semantics

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 19


Vector Space Model without distributional similarity

Words are treated as atomic


symbols

One-hot representation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/


Distributional Similarity Based Representations

You know a word by the company it


keeps

These words will represent banking

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/


Building a DSM step-by-step

The “linguistic” steps


Pre-process a corpus (to define targets and
contexts)

Select the targets and the contexts

The “mathematical” steps


Count the target-context co-occurrences

Weight the contexts (optional)

Build the distributional matrix

Reduce the matrix dimensions (optional)

Compute the vector distances on the (reduced)
Pawan Goyal (IIT Kharagpur) matrix Week 7, Lecture 4/
Many design choices

General Questions
How do the rows (words, ...) relate to each other?
How do the columns (contexts, documents, ...) relate to each
other?

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/


The parameter space

A number of parameters to be fixed


Which type of context?
Which weighting
scheme? Which
similarity measure?
...

A specific parameter setting determines a particular type of D S M


(e.g. LSA, HAL, etc.)

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 19


Documents as context: Word × document

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 19


Words as context: Word × Word

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 19


Words as contexts

Parameters
Window size
Window shape - rectangular/triangular/other

Consider the following passage


Suspected communist rebels on 4 July 1989 killed Col. Herminio
Taylo, police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels
have killed up to 65 soldiers and police in the Capitol region since
January.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/


Words as contexts

Parameters
Window size
Window shape - rectangular/triangular/other

5 words window (unfiltered): 2 words either side of the target word


Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo,
police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the
rebel New People’s Army. They fled in a commandeered passenger
jeep. The military says communist rebels have killed up to 65 soldiers
and police in the Capitol region since January.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/


Words as contexts

Parameters
Window size
Window shape - rectangular/triangular/other

5 words window (filtered): 2 words either side of the target word


Suspected communist rebels on 4 July 1989 killed Col. Herminio
Taylo, police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels
have killed up to 65 soldiers and police in the Capitol region since
January.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9 / 19


Context weighting: documents as context

Indexing function F: Essential factors


Word frequency (fij): How many times a word appears in the
document?
F ∝ fij
F ∝ |D1 |
Document length (|Di|): How many words appear in the
i
document?
Document frequency (Nj): Number of documents in
F ∝Nj
which a word
appears.
1

Indexing Weight: tf-Idf


fij ∗ log(NN ) for each term, normalize the weight in a
j
document
respect to L with
2-
norm.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /


Context weighting: words as context

basic intuition
word1 freq(1,2 freq(1 freq(2
word2
do small ) )
855 33,338 )
g domesticat 490,580
do ed 29 33,338
g 918

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 /


Context weighting: words as context

basic intuition

Association measures are used to give more weight to


contexts that are more significantly associted with a targer word.
The less frequent the target and context element are, the
higher the weight given to their co-occurrence count should
be.
⇒ Co-occurrence with frequent context element small is less
informative than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 19


Pointwise Mutual Information (PMI)

P (w , w )
PMI(w1 , w2 ) = log2 Pcorpus 1 2
ind (w1 , w2 )

P corpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus(w1)Pcorpus(w2)

freq(w1, w2)
Pcorpus(w1, w2) =
N
freq(w)
P corpus (w) =
N

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 /


PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

Bias towards infrequent events


Consider wj having the maximum association

with wi, Pcorpus(wi) ≈ Pcorpus(wj) ≈ Pcorpus(wi, wj)


PMI increases as the probability of wi
decreases.
Also, consider a word wj that occurs once in the
corpus, also in the context of
wi. A discounting factor proposed by Pantel and
Lin:

fij min(fi , fj )
δij =
Pawan Goyal (IIT Kharagpur) fij + 1 min(fi , fj ) + Week 7, Lecture 13 /
Distributional Vectors: Example

Normalized Distributional Vectors using Pointwise Mutual Information


oil:0.032 gas:0.029 crude:0.029 barrels:0.028 exploration:0.027 barrel:0.026
petroleu
m opec:0.026 refining:0.026 gasoline:0.026 fuel:0.025 natural:0.025
exporting:0.025
trafficking:0.029 cocaine:0.028 narcotics:0.027 fda:0.026 police:0.026
drug
abuse:0.026
marijuana:0.025 crime:0.025 colombian:0.025 arrested:0.025 addicts:0.024
insurers:0.028 premiums:0.028 lloyds:0.026 reinsurance:0.026
insurance
underwriting:0.025
pension:0.025 mortgage:0.025 credit:0.025 investors:0.024 claims:0.024
benefits:0.024
timber:0.028 trees:0.027 land:0.027 forestry:0.026 environmental:0.026
forest
species:0.026
wildlife:0.026 habitat:0.025 tree:0.025 mountain:0.025 river:0.025 lake:0.025
robots:0.032 automation:0.029 technology:0.028 engineering:0.026
robotics
systems:0.026
sensors:0.025 welding:0.025 computer:0.025 manufacturing:0.025
automated:0.025

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /


Distributional Semantics: Applications, Structured
Models

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 15


Application to Query Expansion: Addressing Term
Mismatch

Term Mismatch Problem in Information Retrieval


Stems from the word independence assumption during document
indexing. User query: insurance cover which pays for long term
care.
A relevant document may contain terms different from the actual
user query. Some relevant words concerning this query: {medicare,
premiums, insurers}

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/


Application to Query Expansion: Addressing Term
Mismatch

Term Mismatch Problem in Information Retrieval


Stems from the word independence assumption during document
indexing. User query: insurance cover which pays for long term
care.
A relevant document may contain terms different from the actual
user query. Some relevant words concerning this query: {medicare,
premiums, insurers}

Using DSMs for Query Expansion


Given a user query, reformulate it using related terms to enhance the
retrieval performance.
The distributional vectors for the query terms are computed.
Expanded
Pawan query
Goyal (IIT Kharagpur) is obtained by a linear combination or a Week
functional
7, Lecture 2/
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93
hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75
ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7
reimbursable:0.7 deductible:0.69
Broad expansion terms: medicare, beneficiaries,
premiums . . .
Specific domain terms: HCFA (Health Care Financing
Administration), HMO
(Health Maintenance Organization), HHS (Health and Human
Services)

TREC Topic 355: ocean remote sensing


Query Representation: radiometer:1.0 landsat:0.97
ionosphere:0.94 cnes:0.84 altimeter:0.83 nasda:0.81
meterology:0.81 cartography:0.78 geostationary:0.78
doppler:0.78 oceanographic:0.76
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/
Similarity Measures for Binary Vectors

Let X and Y denote the binary distributional vectors for words X


and Y.

Similarity Measures
Dice coefficient :|X2|X∩Y|
|+|
Y|
Jaccard Coefficient :| |X∩Y|
| X ∪Y|
Overlap Coefficient : min
(|X |,|
X∩Y| Y|)
Jaccard coefficient penalizes small number of shared entries, while
Overlap coefficient uses the concept of inclusion.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4/


Similarity Measures for Vector Spaces

Let →X and →Y denote the distributional vectors for words


X and Y .
→X = [x1 , x2 , . . . , xn], →Y = [y1 , y2 , . . . , yn]

Similarity Measures Cosine similarity : cos(→√X,→Y)|→= X||→Y|


X¯·→Y distance →
Euclidean
n
: |X − Y| = Σi=1 (xi −
yi ) 2 →

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/


Similarity Measure for Probability Distributions

Let p and q denote the probability distributions corresponding


to two distributional vectors.
Similarity Measures
p
KL-divergence : D(p||q) =i Σ
i p qi
i

log p+q
Radius : D(p||
p+q
Information 2 ) + (q|| 2 )
D L1 -norm : Σ i |pi −
qi |

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6/


Attributional Similarity vs. Relational Similarity

Attributional Similarity
The attributional similarity between two words a and b depends on the
degree of correspondence between the properties of a and b.
Ex: dog and wolf

Relational Similarity
Two pairs(a, b) and (c, d) are relationally similar if they have
many similar relations.
Ex: dog: bark and cat: meow

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 15


Relational Similarity: Pair-pattern matrix

Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason:
stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs
occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs

Extended Distributional Hypothesis; Lin and Pantel


Patterns that co-occur with similar pairs tend to have similar meanings.
This matrix can also be used to measure the semantic similarity of
patterns.
Given a pattern such as “X solves Y”, you can use this matrix to find
similar patterns, such as “Y is solved by X”, “Y is resolved in X”, “X
resolves Y”.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8/
Structured DSMs

Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X

An Ideal Formalism
Should mirror semantic relationships as close as
possible
Incorporate word-based information and syntactic
analysis Should be applicable to different
languages

Use Dependency grammar framework

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/


Structured DSMs

Using Dependency Structure: How does it help?


The teacher eats a red apple.

‘eat’ is not a legitimate context for ‘red’.


The ‘object’ relation connecting ‘eat’ and ‘apple’ is treated as a
different type of co-occurrence from the ‘modifier’ relation linking
‘red’ and ‘apple’.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /


Structured DSMs

Structured DSMs: Words as ‘legitimate’ contexts


Co-occurrence statistics are collected using parser-extracted
relations.
To qualify as context of a target item, a word must be linked to it
by some (interesting) lexico-syntactic relation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 15


Structured DSMs

Distributional models, as guided by dependency


Ex: For the sentence ‘This virus affects the body’s defense
system.’, the dependency parse is:

Word vectors
< system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-
way matrix

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 /


2-way matrix

< system, dobj, affects>


< virus, nsubj, affects>
The dependency information can be dropped
< system, dobj, affects> ⇒ < system, affects>
< virus, nsubj, affects> ⇒ < virus, affects>

Link and one word can be concatenated and treated as


attributes
virus={nsubj-
affects:0.05,...},
system= { dobj-
affects:0.03,. . .}

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 /


Structured DSMs for Selectional Preferences

Selectional Preferences for Verbs


Most verbs prefer arguments of a particular type. This regularity is
known as selectional preference.
From a parsed corpus, noun vectors are calculated as shown for
‘virus’ and ‘system’.
obj- obj- obj- obj- obj-store sub-fly
car carry buy drive eat ...
vegetabl 0.
0.1 0.
0.4 0
0.8 0.
0.02 0.2
0. 0.05
0.0 ...
e biscuit 3 5 0 6 3 5 ...
··· 0. 0. .. 0. 0. 0.0 ...
4 4 . 5 4 2 ...
.. .. .. .. ...
. . . .

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /


Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the
nouns as object of verb ‘eat’.
n nouns having highest weight in the dimension ‘obj-eat’ are
selected, let
{ vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an
‘object prototype’ of the verb.
‘object prototype’ will indicate various attributes such as these
nouns can be consumed, bought, carried, stored etc.
Similarity of a noun to this ‘object prototype’ is used to
denote the plausibility of that noun being an object of verb
‘eat’.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 15 /


Word Embeddings - Part I

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 19


Word Vectors

At one level, it is simply a vector of weights.


In a simple 1-of-N (or ‘one-hot’) encoding every element in the
vector is associated with a word in the vocabulary.
The encoding of a given word is simply the vector in which
the corresponding element is set to one, and all other
elements are zero.

One-hot representation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/


Word Vectors - One-hot Encoding

Suppose our vocabulary has only five words: King, Queen, Man,
Woman, and Child.
We could encode the word ‘Queen’ as:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3 / 19


Limitations of One-hot encoding

Word vectors are not comparable


Using such an encoding, there is no meaningful comparison we
can make between word vectors other than equality testing.

Week 7, Lecture 4/
Word2Vec – A distributed representation
Distributional representation – word embedding?
Any word wi in the corpus is given a distributional
representation by an embedding

wi ∈ R d
i.e., a d−dimensional vector, which is mostly learnt!

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/


Distributional Representation

Take a vector with several hundred dimensions (say 1000).


Each word is represented by a distribution of weights
across those elements.
S o instead of a one-to-one mapping between an element in
the vector and a word, the representation of a word is spread
across all of the elements in the vector, and
Each element in the vector contributes to the definition of many
words.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 19


Distributional Representation: Illustration
If we label the dimensions in a hypothetical word vector (there are
no such pre-assigned labels in the algorithm of course), it might
look a bit like this:

Such a vector comes to represent in some abstract way the ‘meaning’ of


a word

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 19


Word Embeddings

d typically in the range 50 to 1000


Similar words should have similar
embeddings

SVD can also be thought of as an embedding


method

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8/


Reasoning with Word Vectors

It has been found that the learned word representations in fact


capture meaningful syntactic and semantic regularities in a
very simple way.
Specifically, the regularities are observed as constant vector
offsets between pairs of words sharing a particular
relationship.

Case of Singular-Plural Relations


If we denote the vector for word i as xi, and focus on the
singular/plural relation, we observe that

xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies ≈ xcar − xcars

and so on.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Reasoning with Word Vectors

Perhaps more surprisingly, we find that this is also the case for a
variety of semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
man is to woman as uncle is to ? (aunt)

A simple vector offset method based on cosine distance shows


the relation.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /


Vcctor Offset for Gender Relation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 19


Vcctor Offset for Singular-Plural Relation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 19


Encoding Other Dimensions of Similarity

Analogy Testing

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 / 19


Analogy Testing

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 / 19


Country-capital city relationships

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 15 / 19


More Analogy Questions

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 16 / 19


Element Wise Addition

We can also use element-wise addition of vector elements to ask


questions such as ‘German + airlines’ and by looking at the
closest tokens to the composite vector come up with impressive
answers:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 17 / 19


Two Variations: CBOW and Skip-grams

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 19 / 19


Word Embedings - Part II

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 14


CBOW

Consider a piece of prose such as:


“The recently introduced continuous Skip-gram model is an
efficient method for learning high-quality distributed vector
representations that capture a large number of syntatic and
semantic word relationships.”
Imagine a sliding window over the text, that includes the
central word currently in focus, together with the four words
that precede it, and the four words that follow it:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/


CBOW
The context words form the input layer. Each word is encoded in one-
hot form. A single hidden and output layer.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3 / 14


CBOW: Training Objective

The training objective is to maximize the conditional probability of


observing the actual output word (the focus word) given the input
context words, with regard to the weights.
In our example, given the input (“an”, “efficient”, “method”, “for”,
“high”, “quality”, “distributed”, “vector”), we want to maximize the
probability of getting “learning” as the output.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4 / 14


CBOW: Input to Hidden Layer

Since our input vectors are one-hot, multiplying an input vector by


the weight matrix W1 amounts to simply selecting a row from W1.

Given C input word vectors, the activation function for the hidden
layer h amounts to simply summing the corresponding ‘hot’ rows in
W1, and dividing by C to take their average.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5 / 14


CBOW: Hidden to Output Layer

From the hidden layer to the output layer, the second weight matrix W2
can be used to compute a score for each word in the vocabulary, and
softmax can be used to obtain the posterior distribution of words.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 14


Skip-gram Model
The skip-gram model is the opposite of the C B O W model. It is
constructed with the focus word as the single input vector, and the
target context words are now at the output layer:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 14


Skip-gram Model: Training

The activation function for the hidden layer simply amounts to


copying the corresponding row from the weights matrix W1 (linear)
as we saw before. At the output layer, we now output C
multinomial distributions instead of
just one.
The training objective is to mimimize the summed prediction error
across all context words in the output layer. In our example, the
input would be “learning”, and we hope to see (“an”, “efficient”,
“method”, “for”, “high”, “quality”, “distributed”, “vector”) at the
output layer.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 14


Skip-gram Model

Details
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context
word given the current center word:
T
1
J (θ) = ∑
∑ log p(wt+j|wt)
T t=1
− c≤ j≤ c,j=/ 0

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/


Word Vectors

For p(wt+j |wt) the simplest first


formulation is
exp(v′wO T vWI )
p(wO |wI )
= ∑W
w=1 exp(v vWI )
′ T

w
where v and v′ are “input” and “output” vector representations of w
(so every word has two vectors)

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /


Parameters θ

With d−dimensional words and V many


words:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 14


Gradient Descent for Parameter Updates


θj new = θj old ∂θjold J (θ
−α )

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 14


Two sets of vectors

Best solution is to sum these up

Lfinal = L + L′

A good tutorial to understand parameter learning:


https://arxiv.org/pdf/1411.2738.pdf

An interactive Demo
https://ronxin.github.io/wevi/

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 /


Glove

Combine the best of both worlds – count based methods as well


as direct prediction methods

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /


Glove

Combine the best of both worlds – count based methods as well


as direct prediction methods

Fast training
Scalable to huge corpora
Good performance even with small corpus, and small vectors

Code and vectors: http://nlp.stanford.edu/projects/glove/

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /

You might also like