0% found this document useful (0 votes)

17 views81 pages

Unit 4 - Distributional Semantics

Uploaded by

samanthkumar2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views81 pages

Unit 4 - Distributional Semantics

Uploaded by

samanthkumar2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 81

Distributional Semantics - Introduction

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 14

Introduction

What is Semantics?
The study of meaning: Relation between symbols and their
denotata. John told Mary that the train moved out of the station
at 3 o’clock.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/

Computational Semantics

Computational Semantics
The study of how to automate the process of constructing and
reasoning with meaning representations of natural language
expressions.

Methods in Computational Semantics generally fall in two categories:

Formal Semantics: Construction of precise mathematical
models of the relations between expressions in a natural
language and the world.
John chases a bat → ∃x[bat(x) ∧ chase(john, x)]
Distributional Semantics: The study of statistical patterns of
human word usage to extract semantics.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/

Distributional Hypothesis

Distributional Hypothesis: Basic Intuition

“The meaning of a word is its use in language.”
(Wittgenstein, 1953)

“You know a word by the company it keeps.” (Firth, 1957)

→ Word meaning (whatever it might be) is reflected in linguistic

distributions.
“Words that occur in the same contexts tend to have
similar meanings.” (Zellig Harris, 1968)

→ Semantically similar words tend to have similar

distributional patterns.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4/

Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so

through distributional analysis.” (Zellig Harris)

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/

Distributional Semantics: a linguistic perspective

“If linguistics is to deal with meaning, it can only do so

through distributional analysis.” (Zellig Harris)

“If we consider words or morphemes A and B to be more

different in meaning than A and C, then we will often find that
the distributions of A and B are more different than the
distributions of A and C. In other words, difference in
meaning correlates with difference of distribution.” (Zellig
Harris, “Distributional Structure”)

Differential and not referential

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/

Distributional Semantics: a cognitive perspective

Contextual representation
A word’s contextual representation is an abstract cognitive
structure that accumulates from encounters with the word in
various linguistic contexts.

We learn new words based on contextual cues

He filled the wampimuk with the substance, passed it around and we
all drunk some.
We found a little wampimuk sleeping behind the tree.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6/

Distributional Semantic Models (DSMs)

Computational models that build contextual semantic

repesentations from corpus data
D S M s are models for semantic representations
) The semantic content is represented by a vector
) Vectors are obtained through the statistical analysis of the
linguistic contexts of a word
Alternative names
) corpus-based semantics
) statistical semantics
) geometrical models of meaning
) vector semantics
) word space models

Pawan Goyal (IIT Kharagpur) Distributional Semantics - Introduction Week 7, Lecture 7/

Distributional Semantics: The general intuition

Distributions are vectors in a multidimensional semantic

space, that is, objects with a magnitude and a direction.
The semantic space has dimensions which correspond to
possible contexts, as gathered from a given corpus.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 14

Vector Space

In practice, many more dimensions are used.

cat = [...dog 0.8, eat 0.7, joke 0.01, mansion
0.2,...]

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/

Word
Space

Small Dataset
An automobile is a wheeled motor vehicle used for transporting
passengers .
A car is a form of transport , usually with four wheels and the capacity to
carry around five passengers .
Transport for the London games is limited , with spectators strongly
advised to avoid the use of cars .
The London 2012 soccer tournament began yesterday , with plenty of
goals in the opening matches .
Giggs scored the first goal of the football tournament at Wembley , North
London .
Bellamy was largely a passenger in the football match , playing no part
in either goal .

Target words: ⟨automobile, car, soccer, football⟩

Term vocabulary : ⟨wheel, transport, passenger, tournament,
London, goal, match⟩
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 / 14
Constructing Word spaces

Informal algorithm for constructing word spaces

Pick the words you are interested in: target words
Define a context window, number of words surrounding target
word
) The context can in general be defined in terms of documents,
paragraphs or sentences.
Count number of times the target word co-occurs with the context
words:
co-occurrence matrix
Build vectors out of (a function of) these co-occurrence counts

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 /

Constructing Word spaces: distributional vectors

distributional matrix = targets X

contexts
wheel transport passenger tournament London goal match
automobile 1 1 1 0 0 0 0
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 14

2.5

2 football
(0,2)

1.5
goa
l

1 s o c c er

(0,1)

0.5

automobile car
0
(1,0) (2,0)
0 0.5 1 1.5 2 2.5

transpo
rt

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 / 14

Computing similarity

wheel transport passenger tournament London goal match

automobile 1 1 1 0 0 0 0
car 1 2 1 0 1 0 0
soccer 0 0 0 1 1 1 1
football 0 0 1 1 1 2 1

Using simple vector product

automobile . car = 4 car . soccer = 1
automobile . soccer car . football = 2
=0 soccer . football
automobile . football =5
=1

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 / 14

Distributional Models of Semantics

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 19

Vector Space Model without distributional similarity

Words are treated as atomic

symbols

One-hot representation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/

Distributional Similarity Based Representations

You know a word by the company it

keeps

These words will represent banking

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/

Building a DSM step-by-step

The “linguistic” steps

Pre-process a corpus (to define targets and
contexts)
⇓
Select the targets and the contexts

The “mathematical” steps

Count the target-context co-occurrences
⇓
Weight the contexts (optional)
⇓
Build the distributional matrix
⇓
Reduce the matrix dimensions (optional)
⇓
Compute the vector distances on the (reduced)
Pawan Goyal (IIT Kharagpur) matrix Week 7, Lecture 4/
Many design choices

General Questions
How do the rows (words, ...) relate to each other?
How do the columns (contexts, documents, ...) relate to each
other?

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/

The parameter space

A number of parameters to be fixed

Which type of context?
Which weighting
scheme? Which
similarity measure?
...

A specific parameter setting determines a particular type of D S M

(e.g. LSA, HAL, etc.)

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 19

Documents as context: Word × document

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 19

Words as context: Word × Word

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 19

Words as contexts

Parameters
Window size
Window shape - rectangular/triangular/other

Consider the following passage

Suspected communist rebels on 4 July 1989 killed Col. Herminio
Taylo, police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels
have killed up to 65 soldiers and police in the Capitol region since
January.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/

Words as contexts

Parameters
Window size
Window shape - rectangular/triangular/other

5 words window (unfiltered): 2 words either side of the target word

Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo,
police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the
rebel New People’s Army. They fled in a commandeered passenger
jeep. The military says communist rebels have killed up to 65 soldiers
and police in the Capitol region since January.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/

Words as contexts

Parameters
Window size
Window shape - rectangular/triangular/other

5 words window (filtered): 2 words either side of the target word

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9 / 19

Context weighting: documents as context

Indexing function F: Essential factors

Word frequency (fij): How many times a word appears in the
document?
F ∝ fij
F ∝ |D1 |
Document length (|Di|): How many words appear in the
i
document?
Document frequency (Nj): Number of documents in
F ∝Nj
which a word
appears.
1

Indexing Weight: tf-Idf

fij ∗ log(NN ) for each term, normalize the weight in a
j
document
respect to L with
2-
norm.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /

Context weighting: words as context

basic intuition
word1 freq(1,2 freq(1 freq(2
word2
do small ) )
855 33,338 )
g domesticat 490,580
do ed 29 33,338
g 918

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 /

Context weighting: words as context

basic intuition

Association measures are used to give more weight to

contexts that are more significantly associted with a targer word.
The less frequent the target and context element are, the
higher the weight given to their co-occurrence count should
be.
⇒ Co-occurrence with frequent context element small is less
informative than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 19

Pointwise Mutual Information (PMI)

P (w , w )
PMI(w1 , w2 ) = log2 Pcorpus 1 2
ind (w1 , w2 )

P corpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus(w1)Pcorpus(w2)

freq(w1, w2)
Pcorpus(w1, w2) =
N
freq(w)
P corpus (w) =
N

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 /

PMI: Issues and Variations

Positive PMI
All PMI values less than zero are replaced with zero.

Bias towards infrequent events

Consider wj having the maximum association

with wi, Pcorpus(wi) ≈ Pcorpus(wj) ≈ Pcorpus(wi, wj)

PMI increases as the probability of wi
decreases.
Also, consider a word wj that occurs once in the
corpus, also in the context of
wi. A discounting factor proposed by Pantel and
Lin:

fij min(fi , fj )
δij =
Pawan Goyal (IIT Kharagpur) fij + 1 min(fi , fj ) + Week 7, Lecture 13 /
Distributional Vectors: Example

Normalized Distributional Vectors using Pointwise Mutual Information

oil:0.032 gas:0.029 crude:0.029 barrels:0.028 exploration:0.027 barrel:0.026
petroleu
m opec:0.026 refining:0.026 gasoline:0.026 fuel:0.025 natural:0.025
exporting:0.025
trafficking:0.029 cocaine:0.028 narcotics:0.027 fda:0.026 police:0.026
drug
abuse:0.026
marijuana:0.025 crime:0.025 colombian:0.025 arrested:0.025 addicts:0.024
insurers:0.028 premiums:0.028 lloyds:0.026 reinsurance:0.026
insurance
underwriting:0.025
pension:0.025 mortgage:0.025 credit:0.025 investors:0.024 claims:0.024
benefits:0.024
timber:0.028 trees:0.027 land:0.027 forestry:0.026 environmental:0.026
forest
species:0.026
wildlife:0.026 habitat:0.025 tree:0.025 mountain:0.025 river:0.025 lake:0.025
robots:0.032 automation:0.029 technology:0.028 engineering:0.026
robotics
systems:0.026
sensors:0.025 welding:0.025 computer:0.025 manufacturing:0.025
automated:0.025

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /

Distributional Semantics: Applications, Structured
Models

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 15

Application to Query Expansion: Addressing Term
Mismatch

Term Mismatch Problem in Information Retrieval

Stems from the word independence assumption during document
indexing. User query: insurance cover which pays for long term
care.
A relevant document may contain terms different from the actual
user query. Some relevant words concerning this query: {medicare,
premiums, insurers}

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/

Application to Query Expansion: Addressing Term
Mismatch

Term Mismatch Problem in Information Retrieval

Using DSMs for Query Expansion

Given a user query, reformulate it using related terms to enhance the
retrieval performance.
The distributional vectors for the query terms are computed.
Expanded
Pawan query
Goyal (IIT Kharagpur) is obtained by a linear combination or a Week
functional
7, Lecture 2/
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93
hmos:0.83 medicaid:0.8 hmo:0.78 beneficiaries:0.75
ambulatory:0.72 premiums:0.72 hospitalization:0.71 hhs:0.7
reimbursable:0.7 deductible:0.69
Broad expansion terms: medicare, beneficiaries,
premiums . . .
Specific domain terms: HCFA (Health Care Financing
Administration), HMO
(Health Maintenance Organization), HHS (Health and Human
Services)

TREC Topic 355: ocean remote sensing

Query Representation: radiometer:1.0 landsat:0.97
ionosphere:0.94 cnes:0.84 altimeter:0.83 nasda:0.81
meterology:0.81 cartography:0.78 geostationary:0.78
doppler:0.78 oceanographic:0.76
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3/
Similarity Measures for Binary Vectors

Let X and Y denote the binary distributional vectors for words X

and Y.

Similarity Measures
Dice coefficient :|X2|X∩Y|
|+|
Y|
Jaccard Coefficient :| |X∩Y|
| X ∪Y|
Overlap Coefficient : min
(|X |,|
X∩Y| Y|)
Jaccard coefficient penalizes small number of shared entries, while
Overlap coefficient uses the concept of inclusion.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4/

Similarity Measures for Vector Spaces

Let →X and →Y denote the distributional vectors for words

X and Y .
→X = [x1 , x2 , . . . , xn], →Y = [y1 , y2 , . . . , yn]

Similarity Measures Cosine similarity : cos(→√X,→Y)|→= X||→Y|

X¯·→Y distance →
Euclidean
n
: |X − Y| = Σi=1 (xi −
yi ) 2 →

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/

Similarity Measure for Probability Distributions

Let p and q denote the probability distributions corresponding

to two distributional vectors.
Similarity Measures
p
KL-divergence : D(p||q) =i Σ
i p qi
i

log p+q
Radius : D(p||
p+q
Information 2 ) + (q|| 2 )
D L1 -norm : Σ i |pi −
qi |

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6/

Attributional Similarity vs. Relational Similarity

Attributional Similarity
The attributional similarity between two words a and b depends on the
degree of correspondence between the properties of a and b.
Ex: dog and wolf

Relational Similarity
Two pairs(a, b) and (c, d) are relationally similar if they have
many similar relations.
Ex: dog: bark and cat: meow

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 15

Relational Similarity: Pair-pattern matrix

Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason:
stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs
occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs

Extended Distributional Hypothesis; Lin and Pantel

Patterns that co-occur with similar pairs tend to have similar meanings.
This matrix can also be used to measure the semantic similarity of
patterns.
Given a pattern such as “X solves Y”, you can use this matrix to find
similar patterns, such as “Y is solved by X”, “Y is resolved in X”, “X
resolves Y”.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8/
Structured DSMs

Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X

An Ideal Formalism
Should mirror semantic relationships as close as
possible
Incorporate word-based information and syntactic
analysis Should be applicable to different
languages

Use Dependency grammar framework

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/

Structured DSMs

Using Dependency Structure: How does it help?

The teacher eats a red apple.

‘eat’ is not a legitimate context for ‘red’.

The ‘object’ relation connecting ‘eat’ and ‘apple’ is treated as a
different type of co-occurrence from the ‘modifier’ relation linking
‘red’ and ‘apple’.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /

Structured DSMs

Structured DSMs: Words as ‘legitimate’ contexts

Co-occurrence statistics are collected using parser-extracted
relations.
To qualify as context of a target item, a word must be linked to it
by some (interesting) lexico-syntactic relation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 15

Structured DSMs

Distributional models, as guided by dependency

Ex: For the sentence ‘This virus affects the body’s defense
system.’, the dependency parse is:

Word vectors
< system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-
way matrix

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 /

2-way matrix

< system, dobj, affects>

< virus, nsubj, affects>
The dependency information can be dropped
< system, dobj, affects> ⇒ < system, affects>
< virus, nsubj, affects> ⇒ < virus, affects>

Link and one word can be concatenated and treated as

attributes
virus={nsubj-
affects:0.05,...},
system= { dobj-
affects:0.03,. . .}

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 /

Structured DSMs for Selectional Preferences

Selectional Preferences for Verbs

Most verbs prefer arguments of a particular type. This regularity is
known as selectional preference.
From a parsed corpus, noun vectors are calculated as shown for
‘virus’ and ‘system’.
obj- obj- obj- obj- obj-store sub-fly
car carry buy drive eat ...
vegetabl 0.
0.1 0.
0.4 0
0.8 0.
0.02 0.2
0. 0.05
0.0 ...
e biscuit 3 5 0 6 3 5 ...
··· 0. 0. .. 0. 0. 0.0 ...
4 4 . 5 4 2 ...
.. .. .. .. ...
. . . .

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /

Structured DSMs for Selectional Preferences

Selectional Preferences
Suppose we want to compute the selectional preferences of the
nouns as object of verb ‘eat’.
n nouns having highest weight in the dimension ‘obj-eat’ are
selected, let
{ vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an
‘object prototype’ of the verb.
‘object prototype’ will indicate various attributes such as these
nouns can be consumed, bought, carried, stored etc.
Similarity of a noun to this ‘object prototype’ is used to
denote the plausibility of that noun being an object of verb
‘eat’.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 15 /

Word Embeddings - Part I

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 19

Word Vectors

At one level, it is simply a vector of weights.

In a simple 1-of-N (or ‘one-hot’) encoding every element in the
vector is associated with a word in the vocabulary.
The encoding of a given word is simply the vector in which
the corresponding element is set to one, and all other
elements are zero.

One-hot representation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/

Word Vectors - One-hot Encoding

Suppose our vocabulary has only five words: King, Queen, Man,
Woman, and Child.
We could encode the word ‘Queen’ as:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3 / 19

Limitations of One-hot encoding

Word vectors are not comparable

Using such an encoding, there is no meaningful comparison we
can make between word vectors other than equality testing.

Week 7, Lecture 4/
Word2Vec – A distributed representation
Distributional representation – word embedding?
Any word wi in the corpus is given a distributional
representation by an embedding

wi ∈ R d
i.e., a d−dimensional vector, which is mostly learnt!

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5/

Distributional Representation

Take a vector with several hundred dimensions (say 1000).

Each word is represented by a distribution of weights
across those elements.
S o instead of a one-to-one mapping between an element in
the vector and a word, the representation of a word is spread
across all of the elements in the vector, and
Each element in the vector contributes to the definition of many
words.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 19

Distributional Representation: Illustration
If we label the dimensions in a hypothetical word vector (there are
no such pre-assigned labels in the algorithm of course), it might
look a bit like this:

Such a vector comes to represent in some abstract way the ‘meaning’ of

a word

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 19

Word Embeddings

d typically in the range 50 to 1000

Similar words should have similar
embeddings

SVD can also be thought of as an embedding

method

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8/

Reasoning with Word Vectors

It has been found that the learned word representations in fact

capture meaningful syntactic and semantic regularities in a
very simple way.
Specifically, the regularities are observed as constant vector
offsets between pairs of words sharing a particular
relationship.

Case of Singular-Plural Relations

If we denote the vector for word i as xi, and focus on the
singular/plural relation, we observe that

xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies ≈ xcar − xcars

and so on.
Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/
Reasoning with Word Vectors

Perhaps more surprisingly, we find that this is also the case for a
variety of semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
man is to woman as uncle is to ? (aunt)

A simple vector offset method based on cosine distance shows

the relation.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /

Vcctor Offset for Gender Relation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 19

Vcctor Offset for Singular-Plural Relation

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 19

Encoding Other Dimensions of Similarity

Analogy Testing

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 / 19

Analogy Testing

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 / 19

Country-capital city relationships

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 15 / 19

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 16 / 19

Element Wise Addition

We can also use element-wise addition of vector elements to ask

questions such as ‘German + airlines’ and by looking at the
closest tokens to the composite vector come up with impressive
answers:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 17 / 19

Two Variations: CBOW and Skip-grams

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 19 / 19

Word Embedings - Part II

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 1 / 14

CBOW

Consider a piece of prose such as:

“The recently introduced continuous Skip-gram model is an
efficient method for learning high-quality distributed vector
representations that capture a large number of syntatic and
semantic word relationships.”
Imagine a sliding window over the text, that includes the
central word currently in focus, together with the four words
that precede it, and the four words that follow it:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 2/

CBOW
The context words form the input layer. Each word is encoded in one-
hot form. A single hidden and output layer.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 3 / 14

CBOW: Training Objective

The training objective is to maximize the conditional probability of

observing the actual output word (the focus word) given the input
context words, with regard to the weights.
In our example, given the input (“an”, “efficient”, “method”, “for”,
“high”, “quality”, “distributed”, “vector”), we want to maximize the
probability of getting “learning” as the output.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 4 / 14

CBOW: Input to Hidden Layer

Since our input vectors are one-hot, multiplying an input vector by

the weight matrix W1 amounts to simply selecting a row from W1.

Given C input word vectors, the activation function for the hidden
layer h amounts to simply summing the corresponding ‘hot’ rows in
W1, and dividing by C to take their average.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 5 / 14

CBOW: Hidden to Output Layer

From the hidden layer to the output layer, the second weight matrix W2
can be used to compute a score for each word in the vocabulary, and
softmax can be used to obtain the posterior distribution of words.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 6 / 14

Skip-gram Model
The skip-gram model is the opposite of the C B O W model. It is
constructed with the focus word as the single input vector, and the
target context words are now at the output layer:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 7 / 14

Skip-gram Model: Training

The activation function for the hidden layer simply amounts to

copying the corresponding row from the weights matrix W1 (linear)
as we saw before. At the output layer, we now output C
multinomial distributions instead of
just one.
The training objective is to mimimize the summed prediction error
across all context words in the output layer. In our example, the
input would be “learning”, and we hope to see (“an”, “efficient”,
“method”, “for”, “high”, “quality”, “distributed”, “vector”) at the
output layer.

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 8 / 14

Skip-gram Model

Details
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context
word given the current center word:
T
1
J (θ) = ∑
∑ log p(wt+j|wt)
T t=1
− c≤ j≤ c,j=/ 0

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 9/

Word Vectors

For p(wt+j |wt) the simplest first

formulation is
exp(v′wO T vWI )
p(wO |wI )
= ∑W
w=1 exp(v vWI )
′ T

w
where v and v′ are “input” and “output” vector representations of w
(so every word has two vectors)

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 10 /

Parameters θ

With d−dimensional words and V many

words:

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 11 / 14

Gradient Descent for Parameter Updates

∂
θj new = θj old ∂θjold J (θ
−α )

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 12 / 14

Two sets of vectors

Best solution is to sum these up

Lfinal = L + L′

A good tutorial to understand parameter learning:

https://arxiv.org/pdf/1411.2738.pdf

An interactive Demo
https://ronxin.github.io/wevi/

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 13 /

Glove

Combine the best of both worlds – count based methods as well

as direct prediction methods

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /

Glove

Combine the best of both worlds – count based methods as well

as direct prediction methods

Fast training
Scalable to huge corpora
Good performance even with small corpus, and small vectors

Code and vectors: http://nlp.stanford.edu/projects/glove/

Pawan Goyal (IIT Kharagpur) Week 7, Lecture 14 /

Unit 4 - Distributional Semantics
No ratings yet
Unit 4 - Distributional Semantics
81 pages
Distributional Models of Semantics: Pawan Goyal
No ratings yet
Distributional Models of Semantics: Pawan Goyal
35 pages
DSM and Word Embedding
No ratings yet
DSM and Word Embedding
116 pages
Dist Sem Fin
No ratings yet
Dist Sem Fin
59 pages
Distributional Semantics Word Vectors-1-62
No ratings yet
Distributional Semantics Word Vectors-1-62
62 pages
Week 7
No ratings yet
Week 7
161 pages
Dist Semantics PDF
No ratings yet
Dist Semantics PDF
101 pages
3.dist Semantics
No ratings yet
3.dist Semantics
59 pages
Introduction To Distributional Semantics
No ratings yet
Introduction To Distributional Semantics
43 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Lec 32
No ratings yet
Lec 32
15 pages
The Vector Space Model of Word Meaning: Informatics 1 CG: Lecture 13
No ratings yet
The Vector Space Model of Word Meaning: Informatics 1 CG: Lecture 13
46 pages
NLP Unit 4
No ratings yet
NLP Unit 4
23 pages
Distributional Semantics in Linguistic and Cognitive Research Article in Italian Journal of Linguistics
No ratings yet
Distributional Semantics in Linguistic and Cognitive Research Article in Italian Journal of Linguistics
3 pages
Distributional Semantics in Linguistic and Cognitive Research Article in Italian Journal of Linguistics
No ratings yet
Distributional Semantics in Linguistic and Cognitive Research Article in Italian Journal of Linguistics
3 pages
Unit - 3 Distributional Semantics and Word Embedding
No ratings yet
Unit - 3 Distributional Semantics and Word Embedding
69 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
An Introductory Course On Semantics and
No ratings yet
An Introductory Course On Semantics and
223 pages
Lecture1 Word Embeddings
No ratings yet
Lecture1 Word Embeddings
99 pages
2025-NLP-Lecture 04-Vector Semantics and Embeddings
No ratings yet
2025-NLP-Lecture 04-Vector Semantics and Embeddings
83 pages
NLP Presentation
No ratings yet
NLP Presentation
27 pages
Distributional Semantics and Linguistic Theory
No ratings yet
Distributional Semantics and Linguistic Theory
22 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
XCS224N Module1 Slides
No ratings yet
XCS224N Module1 Slides
72 pages
NLP1 Lecture5
No ratings yet
NLP1 Lecture5
52 pages
Semantic Shifts in Political Discourse
No ratings yet
Semantic Shifts in Political Discourse
10 pages
Capitolul 3 - Ocilar
No ratings yet
Capitolul 3 - Ocilar
8 pages
Department of English: Class: Subject: Topic: Submitted To
No ratings yet
Department of English: Class: Subject: Topic: Submitted To
9 pages
SEMNATICSANDPRAGMATICS Fullbook
No ratings yet
SEMNATICSANDPRAGMATICS Fullbook
225 pages
Advances in AI: Module-1
No ratings yet
Advances in AI: Module-1
15 pages
Riemer - Introducing Semantics PDF
100% (9)
Riemer - Introducing Semantics PDF
478 pages
Lecture 10
No ratings yet
Lecture 10
86 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
Semantics: Lexical Semantics: Pawan Goyal
No ratings yet
Semantics: Lexical Semantics: Pawan Goyal
54 pages
Lecture 0214 Semantic Relations
No ratings yet
Lecture 0214 Semantic Relations
43 pages
Distributional Semantics and Linguistic Theory 1
No ratings yet
Distributional Semantics and Linguistic Theory 1
16 pages
Lexical Semantics Overview
No ratings yet
Lexical Semantics Overview
54 pages
Political Word Meaning Evolution
No ratings yet
Political Word Meaning Evolution
25 pages
Semantic Processing in NLP
No ratings yet
Semantic Processing in NLP
26 pages
Vector Based Models
No ratings yet
Vector Based Models
41 pages
Semantics
No ratings yet
Semantics
10 pages
Semantic Spaces and Causality Storm Gen Article
No ratings yet
Semantic Spaces and Causality Storm Gen Article
5 pages
Eng 107 - Semantics - (Module Answer) - Angela May G. Sigue Abel2a1
No ratings yet
Eng 107 - Semantics - (Module Answer) - Angela May G. Sigue Abel2a1
19 pages
Doc1.docx TERRM P SEM
No ratings yet
Doc1.docx TERRM P SEM
13 pages
Semantics Article Group1
No ratings yet
Semantics Article Group1
13 pages
Monograph Prof Rajendran Semantics N Pragmatics
No ratings yet
Monograph Prof Rajendran Semantics N Pragmatics
231 pages
Vector Semantics
No ratings yet
Vector Semantics
18 pages
Chapter 4 NLP
No ratings yet
Chapter 4 NLP
17 pages
Semantic Domains Makale Güzel
No ratings yet
Semantic Domains Makale Güzel
7 pages
LEXICON
50% (4)
LEXICON
29 pages
Semantic Theory by Nurzyinat
No ratings yet
Semantic Theory by Nurzyinat
16 pages
NLP Lexical Semnatics Slides
No ratings yet
NLP Lexical Semnatics Slides
55 pages
Semantic Theory Presentation by Nurzyinat Rustam Kyzy
No ratings yet
Semantic Theory Presentation by Nurzyinat Rustam Kyzy
16 pages
Eng 107 - Semantics - (Module Answer) - Angela May G. Sigue Abel2a
No ratings yet
Eng 107 - Semantics - (Module Answer) - Angela May G. Sigue Abel2a
12 pages
Semantic Processing Overview
No ratings yet
Semantic Processing Overview
13 pages
Lec CS563 Lexical Knowledge NN
No ratings yet
Lec CS563 Lexical Knowledge NN
208 pages
Semantic Networks
100% (1)
Semantic Networks
68 pages
Networking Commands
No ratings yet
Networking Commands
2 pages
Gerund, Infinitive, Participle
100% (1)
Gerund, Infinitive, Participle
6 pages
Maze Solving Robot Project
No ratings yet
Maze Solving Robot Project
42 pages
Cloud-AI-Native 6G Powered by eBPF
No ratings yet
Cloud-AI-Native 6G Powered by eBPF
20 pages
A. Nagoor Kani - Circuit Theory-McGraw-Hill Education (2018)
67% (3)
A. Nagoor Kani - Circuit Theory-McGraw-Hill Education (2018)
808 pages
Lecture-1 Introduction To DATAWAREHOUSING
No ratings yet
Lecture-1 Introduction To DATAWAREHOUSING
19 pages
Lesson 7 Speech Context
No ratings yet
Lesson 7 Speech Context
5 pages
4 Template CV Farmasi Bahasa Inggris Yang Bisa Diedit
0% (1)
4 Template CV Farmasi Bahasa Inggris Yang Bisa Diedit
1 page
Tunnel
100% (4)
Tunnel
1 page
Drew English - 2024 Poetry Revision Booklet
No ratings yet
Drew English - 2024 Poetry Revision Booklet
82 pages
Identifying The Firmware of A Qlogic or Emulex FC HBA
No ratings yet
Identifying The Firmware of A Qlogic or Emulex FC HBA
2 pages
Year 5 Autumn Term 1 SPaG Activity Mat 2
No ratings yet
Year 5 Autumn Term 1 SPaG Activity Mat 2
6 pages
Free Modules 55 PDF
No ratings yet
Free Modules 55 PDF
13 pages
Sentence Types and Their Effects
100% (1)
Sentence Types and Their Effects
1 page
Concerto de Brandemburgo Nº2 (J. S. Bach)
100% (1)
Concerto de Brandemburgo Nº2 (J. S. Bach)
5 pages
Gonds Art
No ratings yet
Gonds Art
5 pages
Understanding The Times
No ratings yet
Understanding The Times
21 pages
Neuropsychological Assessment Overview
No ratings yet
Neuropsychological Assessment Overview
10 pages
Python Lab Manual
No ratings yet
Python Lab Manual
27 pages
Getting Started With CREATE PLUS
No ratings yet
Getting Started With CREATE PLUS
3 pages
March Topic Letter Final Nursery
No ratings yet
March Topic Letter Final Nursery
4 pages
Solutions Manual for Electronic Materials
No ratings yet
Solutions Manual for Electronic Materials
26 pages
Research Paper Memes - As - Digital - Folk - Tales
No ratings yet
Research Paper Memes - As - Digital - Folk - Tales
9 pages
Ielts Academic Reading Test Tutorial
No ratings yet
Ielts Academic Reading Test Tutorial
3 pages
How Does A Teacher Become A Facilitator of Learning
No ratings yet
How Does A Teacher Become A Facilitator of Learning
32 pages
Online Medical Store - Synopsis
75% (4)
Online Medical Store - Synopsis
35 pages
Lim 2014 Manichaeans and Public Disputation in Late Antiquity
No ratings yet
Lim 2014 Manichaeans and Public Disputation in Late Antiquity
40 pages
C Programming and Data Structures - Unit I Notes
No ratings yet
C Programming and Data Structures - Unit I Notes
40 pages
Brazilian Culture and Civilization
No ratings yet
Brazilian Culture and Civilization
8 pages
English FAL P2 May-June 2023-1
No ratings yet
English FAL P2 May-June 2023-1
28 pages