Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views98 pages

Week 3

The document discusses advanced smoothing models in language modeling, focusing on techniques like Good-Turing and Kneser-Ney smoothing. It explains the principles behind these methods, including how to estimate probabilities for unseen n-grams and the importance of context in language prediction. Additionally, it covers morphological concepts, including morphemes, affixes, and the distinction between inflectional and derivational morphology.

Uploaded by

wacinop537
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views98 pages

Week 3

The document discusses advanced smoothing models in language modeling, focusing on techniques like Good-Turing and Kneser-Ney smoothing. It explains the principles behind these methods, including how to estimate probabilities for unseen n-grams and the importance of context in language prediction. Additionally, it covers morphological concepts, including morphemes, affixes, and the distinction between inflectional and derivational morphology.

Uploaded by

wacinop537
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Language Modelling: Advanced Smoothing Models

EL
Pawan Goyal

PT CSE, IITKGP

Week 3: Lecture 1
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 1 / 18
Advanced smoothing algorithms

Some Examples

EL
Good-Turing
Kneser-Ney

Good-Turing: Basic Intuition


PT
Use the count of things we have see once
N
to help estimate the count of things we have never seen

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 2 / 18
Nc : Frequency of frequency c

Example Sentences
<s>I am here </s>
<s>who am I </s>

EL
<s>I would like </s>

Computing Nc

am
I

here
3
2
1
PT N1 = 4
N
who 1 N2 = 1
would 1 N3 = 1
like 1

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 3 / 18
Good Turing Estimation

Idea
Reallocate the probability mass of n grams that occur r + 1 times in the
training data to the n grams that occur r times

EL
In particular, reallocate the probability mass of n grams that were seen
once to the n grams that were never seen

Adjusted count
PT
For each count c, an adjusted count c⇤ is computed as:
N
(c + 1)Nc+1
c⇤ =
Nc
where Nc is the number of n grams seen exactly c times

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 4 / 18
Good Turing Estimation

Good Turing Smoothing


c⇤
P⇤GT (things with frequency c) =

EL
N

(c + 1)Nc+1
c⇤ =
Nc

What if c = 0 PT N1
N
P⇤GT (things with frequency c) = where N denotes the total number of
N
bigrams that actually occur in training

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 5 / 18
Complications

What about words with high frequency?

EL
For small c, Nc > Nc+1
For large c, too jumpy

PT
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 6 / 18
Complications

What about words with high frequency?

EL
For small c, Nc > Nc+1
For large c, too jumpy

Simple Good-Turing PT
Replace empirical Nk with a best-fit power law once counts get unreliable
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 6 / 18
Good-Turing numbers: Example

EL
22 million words of AP Neswire

(c + 1)Nc+1
c⇤ =
Nc
PT
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 7 / 18
Good-Turing numbers: Example

EL
22 million words of AP Neswire

(c + 1)Nc+1
c⇤ =

It looks like c⇤ = c
Nc
PT
0.75
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 8 / 18
Absolute Discounting Interpolation

Why don’t we just substract 0.75 (or some d)?

EL
c(wi 1 , wi ) d
PAbsoluteDiscounting (wi |wi 1 ) = + l(wi 1 )P(wi )
c(wi 1 )

PT
We may keep some more values of d for counts 1 and 2
But can we do better than using the regular unigram correct?
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 9 / 18
Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?
“Francisco” more common that “glasses”

EL
But “Francisco” mostly follows “San”

P(w): “How likely is w?”


PT
Instead, Pcontinuation (w): “How likely is w to appear as a novel continuation?”
For each word, count the number of bigram types it completes
N
Every bigram type was a novel continuation the first time it was seen

Pcontinuation (w) µ |{wi 1 : c(wi 1 , w) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 10 / 18
Kneser-Ney Smoothing

How many times does w appear as a novel continuation?

Pcontinuation (w) µ |{wi 1 : c(wi 1 , w) > 0}|

EL
Normalized by the total number of word bigram types

|{(wj 1 , wj ) : c(wj 1 , wj ) > 0}|

PT
Pcontinuation (w) =
|{wi 1 : c(wi 1 , w) > 0}|
|{(wj 1 , wj ) : c(wj 1 , wj ) > 0}|
N
A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 11 / 18
Kneser-Ney Smoothing

max(c(wi 1 , wi ) d, 0)

EL
PKN (wi |wi 1 ) = + l(wi 1 )Pcontinuation (wi )
c(wi 1 )
l is a normalizing constant

PT
l(wi 1 ) =
d
c(wi 1 )
|{w : c(wi 1 , w) > 0}|
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 12 / 18
Model Combination

As N increases

EL
The power (expressiveness) of an N-gram model increases
But the ability to estimate accurate parameters from sparse data

PT
decreases (i.e. the smoothing problem gets worse).

A general approach is to combine the results of multiple N-gram models.


N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 13 / 18
Backoff and Interpolation

It might help to use less context


when you haven’t learned much about larger contexts

EL
Backoff
use trigram if you have good evidence

PT
otherwise bigram, otherwise unigram
N
Interpolation
mix unigram, bigram, trigram

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 14 / 18
Backoff

Estimating P(wi |wi 2 wi 1 )


If we do not have counts to compute P(wi |wi 2 wi 1 ) estimate this using
the bigram probbaility P(wi |wi 1 )
If we do not have counts to compute P(wi |wi 1 ), estimate this using the

EL
unigram probability P(wi )

Pbo (wi |wi 2 wi 1 )=


PT
P̂(wi |wi 2 wi 1 ), if c(wi 2 wi 1 wi ) > 0
l(wi 1 wi 2 )Pbo (wi |wi 1 ), otherwise
N
where Pbo (wi |wi 1 ) =
P̂(wi |wi 1 ) if c(wi 1 wi ) > 0
l(wn 1 )P̂(wn ), otherwise

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 15 / 18
Example Problem

In a corpus, suppose there are 4 words, a, b, c, and d. You are provided with
the following counts.

EL
n-gram count n-gram count n-gram count
aba 4 ba 5 a 8
abb 0 bb 3 b 9
abc
abd
0
0 PT bc
bd
0
0
c
d
8
7
N
Use the recursive definition of backoff smoothing to obtain the probability
distribution, Pbackoff (wn |wn 2 wn 1 ), where wn 1 = b and wn 2 = a.
Also assume that P̂(x) = P(x) 1/8.

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 16 / 18
Linear Interpolation

Simple Interpolation

P̃(wn |wn 1 wn 2 ) = l1 P(wn |wn 1 wn 2 ) + l2 P(wn |wn 1 ) + l3 P(wn )

EL
 li = 1
i

Lambdas conditional on context PT


N
P̃(wn |wn 1 wn 2 ) = l1 (wn 2 , wn 1 )P(wn |wn 1 wn 2 )
+l2 (wn 2 , wn 1 )P(wn |wn 1 ) + l3 (wn 2 , wn 1 )P(wn )

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 17 / 18
Setting the lambda values

EL
Use a held-out corpus
Choose ls to maximize the probability of held-out data:
Find the N-gram probabilities on the training data

PT
Search for ls that give the largest probability to held-out data
N

Pawan Goyal (IIT Kharagpur) Language Modelling: Advanced Smoothing Models Week 3: Lecture 1 18 / 18
Computational Morphology

EL
Pawan Goyal

PT CSE, IITKGP

Week 3: Lecture 2
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 1 / 19


Morphology

Morphology studies the internal structure of words, how words are built up
from smaller meaningful units called morphemes

dogs

EL
2 morphemes, ‘dog’ and ‘s’
‘s’ is a plural marker on nouns

unladylike
3 morphemes
PT
N
un- ‘not’
lady ‘well-behaved woman’
-like ‘having the characteristic of’

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 2 / 19


Allomorphs

EL
Variants of the same morpheme, but cannot be replaced by one another

Example

PT
opposite: un-happy, in-comprehensible, im-possible, ir-rational
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 3 / 19


Bound and Free Morphemes

Bound

EL
Cannot appear as a word by itself.
-s (dog-s), -ly (quick-ly), -ed (walk-ed)

Free
PT
Can appear as a word by itself; often can combine with other morphemes too.
house (house-s), walk (walk-ed), of, the, or
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 4 / 19


Stems and Affixes

Stems and Affixes

EL
Stems (roots): The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings and
grammatical functions
PT
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 5 / 19


Stems and Affixes

Stems and Affixes

EL
Stems (roots): The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings and
grammatical functions
PT
Mostly, stems are free morphemes and affixes are bound morphemes
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 5 / 19


Types of affixes

Prefix: un-, anti-, etc (a-, ati-, pra- etc.)


un-happy, pre-existing

EL
Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)
talk-ing, quick-ly
Infix: ‘n’ in ‘vindati’ (he knows), as contrasted with vid (to know).

PT
Philippines: basa ‘read’ ! b-um-asa ‘read’
English: abso-bloody-lutely (emphasis)
N
Circumfixes - precedes and follows the stem
Dutch: berg ‘mountain’, ge-berg-te ‘mountains’

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 6 / 19


Content and functional morphemes

Content morphemes

EL
Carry some semantic content
car, -able, un-

Functional morphemes
Provide grammatical information
-s (plural), -s (3rd singular)
PT
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 7 / 19


Inflectional and Derivational Morphology

Two different kind of relationship among words

EL
PT
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 8 / 19


Inflectional and Derivational Morphology

Two different kind of relationship among words

Inflectional morphology

EL
Grammatical: number, tense, case, gender
Creates new forms of the same word: bring, brought, brings, bringing

Derivational morphology
PT
Creates new words by changing part-of-speech: logic, logical, illogical,
illogicality, logician
N
Fairly systematic but some derivations missing: sincere - sincerity, scarce -
scarcity, curious - curiosity, fierce - fiercity?

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 8 / 19


Morphological processes

Concatenation

EL
Adding continuous affixes - the most common process:
hope+less, un+happy, anti+capital+ist+s

PT
Often, there are phonological/graphemic changes on morpheme boundaries:
book + s [s], shoe + s [z]
N
happy +er ! happier

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 9 / 19


Morphological processes

Reduplication: part of the word or the entire word is doubled

EL
Nama: ‘go’ (look), ‘go-go’ (examine with attention)
Tagalog: ‘basa’ (read), ‘ba-basa’(will read)

PT
Sanskrit: ‘pac’ (cook), ‘papāca’ (perfect form, cooked)
Phrasal reduplication (Telugu): pillavād.u nad.ustū nad.ustū pad.i pōyād.u
(The child fell down while walking)
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 10 / 19


Morphological processes

Suppletion

EL
‘irregular’ relation between the words
go - went, good - better

Morpheme internal changes


The word changes internally PT
sing - sang - sung, man - men, goose - geese
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 11 / 19


Word Formation

Compounding
Words formed by combining two or more words
Example in English:

EL
Adj + Adj ! Adj: bitter-sweet
N + N ! N: rain-bow
V + N ! V: pick-pocket
P + V ! V: over-do
PT
N
Particular to languages
room-temperature: Hindi translation?

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 12 / 19


Word Formation

Compounding
Words formed by combining two or more words
Example in English:

EL
Adj + Adj ! Adj: bitter-sweet
N + N ! N: rain-bow
V + N ! V: pick-pocket
P + V ! V: over-do
PT
N
Particular to languages
room-temperature: Hindi translation?

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 12 / 19


Word Formation

Acronyms
laser: Light Amplification by Simulated Emission of Radiation

EL
Blending
Parts of two different words are combined
breakfast + lunch ! brunch
smoke + fog ! smog
motor + hotel ! motel
PT
N
Clipping
Longer words are shortened

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 13 / 19


Word Formation

Acronyms
laser: Light Amplification by Simulated Emission of Radiation

EL
Blending
Parts of two different words are combined
breakfast + lunch ! brunch
smoke + fog ! smog
motor + hotel ! motel
PT
N
Clipping
Longer words are shortened
doctor, laboratory, advertisement, dormitory, examination, bicycle, refrigerator

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 13 / 19


Processing morphology

Lemmatization: word ! lemma


saw ! {see, saw}

EL
Morphological analysis : word ! setOf(lemma +tag)
saw ! { <see, verb.past>, < saw, noun.sg>}

PT
Tagging: word ! tag, considers context
Peter saw her ! { <see, verb.past>}
Morpheme segmentation: de-nation-al-iz-ation
N
Generation: see + verb.past ! saw

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 14 / 19


What are the applications?

EL
Text-to-speech synthesis:
lead: verb or noun?
read: present or past?

PT
Search and information retrieval
Machine translation, grammar correction
N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 15 / 19


Morphological Analysis

EL
Goal
PT
N
To take input forms like those in the first column and produce output forms like
those in the second column.
Output contains stem and additional information; +N for noun, +SG for
singular, +PL for plural, +V for verb etc.

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 16 / 19


Issues involved

boy ! boys
fly ! flys ! flies (y! i rule)

EL
Toiling ! toil
Duckling ! duckl?

Getter ! get + er
PT
N
Doer ! do + er
Beer ! be + er?

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 17 / 19


Knowledge Required
Knowledge of stems or roots
Duck is a possible root, not duckl.
We need a dictionary (lexicon)

Morphotactics

EL
Which class of morphemes follow other classes of morphemes inside the
word?
Ex: plural morpheme follows the noun

PT
Only some endings go on some words
N
Do+er: ok
Be+er: not so

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 18 / 19


Knowledge Required
Knowledge of stems or roots
Duck is a possible root, not duckl.
We need a dictionary (lexicon)

Morphotactics

EL
Which class of morphemes follow other classes of morphemes inside the
word?
Ex: plural morpheme follows the noun

PT
Only some endings go on some words
N
Do+er: ok
Be+er: not so

Spelling change rules


Adjust the surface form using spelling change rules

Get + er ! getter
Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 18 / 19
Why can’t this be put in a big lexicon?

English: just 317,477 forms from 90,196 lexical entries, a ratio of 3.5:1

EL
Sanskrit: 11 million forms from a lexicon of 170,000 entries, a ratio of
64.7:1

PT
New forms can be created, compounding etc.

One of the most common methods is finite-state-machines


N

Pawan Goyal (IIT Kharagpur) Computational Morphology Week 3: Lecture 2 19 / 19


Finite-state methods for morphology

EL
Pawan Goyal

PT CSE, IITKGP

Week 3: Lecture 3
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 1 / 18
Finite State Automaton (FSA)

EL
What is FSA?
A kind of directed graph
PT
Nodes are called states, edges are labeled with symbols (possibly empty
N
✏)
Start state and accepting states
Recognizes regular languages, i.e., languages specified by regular
expressions

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 2 / 18
FSA for nominal inflection in English

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 3 / 18
FSA for English Adjectives

EL
PT
N
Word modeled
happy, happier, happiest, real, unreal, cool, coolly, clear, clearly, unclear,
unclearly, ...

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 4 / 18
Morphotactics

EL
The last two examples model some parts of the English morphotactics
But what about the information about regular and irregular roots?

Lexicon
PT
Can we include the lexicon in the FSA?
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 5 / 18
FSA for nominal inflection in English

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 6 / 18
After adding a mini-lexicon

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 7 / 18
Some properties of FSAs: Elegance

Recognizing problem can be solved in linear time (independent of the


size of the automaton)

EL
There is an algorithm to transform each automaton into a unique
equivalent automaton with the least number of states

PT
An FSA is deterministic iff it has no empty (✏ ) transition and for each state
and each symbol, there is at most one applicable transition
Every non-deterministic automaton can be transformed into a
N
deterministic one

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 8 / 18
But ...

FSAs are language recognizers/generators.

EL
We need transducers to build Morphological Analyzers

Finite State Transducers

PT
Translate strings from one language to strings in another language
Like FSA, but each edge is associated with two strings
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 9 / 18
An example FST

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 10 / 18
Two-level morphology

Given the input cats, we would like to output cat+N+PL, talling us that cat is a
plural noun.

EL
We do this via a version of two-level morphology, a correspondence between
a lexical level (morphemes and features) to a surface level (actual spelling).

PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 11 / 18
Intermediate tape for Spelling change rules

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 12 / 18
English Nominal Inflection FST

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 13 / 18
Spelling Handling

A spelling change rule would insert an e only in the appropriate environment.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 14 / 18
Rule Handling

EL
Rule Notation
a ! b/c_d : “rewrite a as b when it occurs between c and d.”

PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 15 / 18
Morphological Analysis: Approaches

Two different ways to address phonological/graphemic variations

EL
Linguistic approach: A phonological component accompanying the simple
concatenative process of attaching an ending

PT
Engineering approach: Phonological changes and irregularities are
factored into endings and a higher number of paradigms
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 16 / 18
Different Approaches: Example from Czech

EL
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 17 / 18
Tools Available

EL
AT&T FSM Library and Lextools
http://www2.research.att.com/~fsmtools/fsm/
OpenFST (Google and NYU)
http://www.openfst.org/
PT
N

Pawan Goyal (IIT Kharagpur) Finite-state methods for morphology Week 3: Lecture 3 18 / 18
Introduction to POS Tagging

EL
Pawan Goyal

PT CSE, IITKGP

Week 3: Lecture 4
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 1 / 18


Part-of-Speech (POS) tagging

Task
Given a text of English, identify the parts of speech of each word

EL
PT
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 2 / 18


Parts of Speech: How many?

Open class words (content words)


nouns, verbs, adjectives, adverbs

EL
mostly content-bearing: they refer to objects, actions, and features in the
world
open class, since new words are added all the time

Closed class words PT


pronouns, determiners, prepositions, connectives, ...
N
there is a limited number of these
mostly functional: to tie the concepts of a sentence together

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 3 / 18


POS examples

EL
PT
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 4 / 18


POS tagging: Choosing a tagset

To do POS tagging, a standard set needs to be chosen


Could pick very coarse tagsets

EL
N, V, Adj, Adv
More commonly used set is finer grained, “UPenn TreeBank tagset”, 45
tags

A Nice Tutorial on POS tags


PT
N
https://sites.google.com/site/partofspeechhelp/

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 5 / 18


UPenn TreeBank POS tag set

EL
PT
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 6 / 18


Using the UPenn tagset

Example Sentence

EL
The grand jury commented on a number of other topics.

POS tagged sentence


PT
The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN
other/JJ topics/NNS ./.
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 7 / 18


Why is POS tagging hard?

Words often have more than one POS: back


The back door: back/JJ

EL
On my back: back/NN
Win the voters back: back/RB

PT
Promised to back the bill: back/VB

POS tagging problem


N
To determine the POS tag for a particular instance of a word

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 8 / 18


Ambiguous word types in the Brown Corpus
Ambiguity in the Brown corpus
40% of word tokens are ambiguous
12% of word types are ambiguous
Breakdown of ambiguous word types:

EL
PT
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 9 / 18


How bad is the ambiguity problem?

One tag is usually more likely than the others.

EL
PT
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 10 / 18


How bad is the ambiguity problem?

One tag is usually more likely than the others.

EL
In the Brown corpus, race is a noun 98% of the time, and a verb 2% of the
time
A tagger for English that simply chooses the most likely tag for each word

PT
can achieve good performance
Any new approach should be compared against the unigram baseline
(assigning each token to its most likely tag)
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 10 / 18


Deciding the correct POS

Can be difficult even for people

EL
Mrs./NNP Shaefer/NNP never/RB got/VBD around/_ to/TO joining/VBG.
All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/_ the/DT
corner/NN.
PT
Chateau/NNP Petrus/NNP costs/VBZ around/_ 2500/CD.
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 11 / 18


Deciding the correct POS

Can be difficult even for people

EL
Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG.
All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT
corner/NN.
PT
Chateau/NNP Petrus/NNP costs/VBZ around/RB 2500/CD.
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 12 / 18


Relevant knowledge for POS tagging

The word itself


Some words may only be nouns, e.g. arrow

EL
Some words are ambiguous, e.g. like, flies
Probabilities may help, if one tag is more likely than another

Local context
PT
Two determiners rarely follow each other
N
Two base form verbs rarely follow each other
Determiner is almost always followed by adjective or noun

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 13 / 18


POS tagging: Two approaches

Rule-based Approach
Assign each word in the input a list of potential POS tags

EL
Then winnow down this list to a single tag using hand-written rules

Statistical tagging
PT
Get a training corpus of tagged text, learn the transformation rules from
the most frequent tags (TBL tagger)
N
Probabilistic: Find the most likely sequence of tags T for a sequence of
words W

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 14 / 18


TBL Tagger

Label the training set with most frequent tags

EL
The can was rusted.
The/DT can/MD was/VBD rusted/VBD.

MD !NN: DT_
VBD!VBN: VBD_
PT
Add transformation rules to reduce training mistakes
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 15 / 18


Probabilistic Tagging: Two different families of models

Problem at hand
We have some data {(d, c)} of paired observations d and hidden classes c.

Different instances of d and c

EL
Part-of-Speech Tagging: words are observed and tags are hidden.
Text Classification: sentences/documents are observed and the
category is hidden.
PT
Categories can be positive/negative for sentiments ..
sports/politics/business for documents ...
N
What gives rise to the two families?
Whether they generate the observed data from hidden stuff or the hidden
structure given the data?

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 16 / 18


Generative vs. Conditional Models

Generative (Joint) Models


Generate the observed data from hidden stuff, i.e. put a probability over the
observations given the class: P(d, c) in terms of P(d|c)

EL
e.g. Naïve Bayes’ classifiers, Hidden Markov Models etc.

Discriminative (Conditional) Models

data: P(c|d) PT
Take the data as given, and put a probability over hidden structure given the

e.g. Logistic regression, maximum entropy models, conditional random fields


N
SVMs, perceptron, etc. are discriminative classifiers but not directly
probabilistic

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 17 / 18


Generative vs. Discriminative Models

EL
PT
N

Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 18 / 18


Generative vs. Discriminative Models

EL
PT
N
Joint vs. conditional likelihood
A joint model gives probabilities P(d, c) and tries to maximize this joint
likelihood.
A conditional model gives probabilities P(c|d), taking the data as given
and modeling only the conditional probability of the class.
Pawan Goyal (IIT Kharagpur) Introduction to POS Tagging Week 3: Lecture 4 18 / 18
Hidden Markov Models for POS Tagging

EL
Pawan Goyal

PT CSE, IITKGP

Week 3: Lecture 5
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 1 / 17
Probabilistic Tagging

W = w1 . . . wn - words in the corpus (observed)


T = t1 . . . tn - the corresponding tags (unknown)

EL
Tagging: Probabilistic View (Generative Model)
Find

T̂ = argmaxT P(T|W)

= argmaxT
PT
P(W|T)P(T)
P(W)
N
= argmaxT P(W|T)P(T)
Y
= argmaxT P(wi |w1 . . . wi 1 , t1 . . . ti )P(ti |t1 . . . ti 1 )
i

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 2 / 17
Further simplifications

Y
T̂ = argmaxT P(wi |w1 . . . wi 1 , t1 . . . ti )P(ti |t1 . . . ti 1 )
i
The probability of a word appearing depends only on its own POS tag

EL
P(wi |w1 . . . wi 1 , t1 . . . ti ) ⇡ P(wi |ti )
Bigram assumption: the probability of a tag appearing depends only on
the previous tag
PT
P(ti |t1 . . . ti 1 ) ⇡ P(ti |ti 1 )
Using these simplifications:
Y
N
T̂ = argmaxT P(wi |ti )P(ti |ti 1 )
i

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 3 / 17
Computing the probability values

Tag Transition probabilities p(ti |ti 1 )


C(ti 1 , ti )
P(ti |ti 1 ) =
C(ti 1 )

EL
C(DT, NN) 56, 509
P(NN|DT) = = = 0.49
C(DT) 116, 454

PT
Word Likelihood probabilities p(wi |ti )
C(ti , wi )
N
P(wi |ti ) =
C(ti )
C(VBZ, is) 10, 073
P(is|VBZ) = = = 0.47
C(VBZ) 21, 627

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 4 / 17
Disambiguating “race”

EL
PT
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 5 / 17
Disambiguating “race”

Difference in probability due to


P(VB|TO) vs. P(NN|TO)

EL
P(race|VB) vs. P(race|NN)
P(NR|VB) vs. P(NR|NN)

PT
After computing the probabilities
P(NN|TO)P(NR|NN)P(race|NN) = 0.0047 ⇥ 0.0012 ⇥ 0.00057 =
N
0.00000000032
P(VB|TO)P(NR|VB)P(race|VB) = 0.83 ⇥ 0.0027 ⇥ 0.00012 = 0.00000027

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 6 / 17
What is this model?

EL
PT
N
This is a Hidden Markov Model

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 7 / 17
Hidden Markov Models

Tag Transition probabilities p(ti |ti 1 )

EL
Word Likelihood probabilities (emissions) p(wi |ti )
What we have described with these probabilities is a hidden markov
model.
PT
Let us quickly introduce the Markov Chain, or observable Markov Model.
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 8 / 17
Markov Chain = First-order Markov Model

Weather example
Three types of weather: sunny, rainy, foggy

EL
qn : variable denoting the weather on the nth day
We want to find the following conditional probabilities:

First-order Markov Assumption


PT P(qn |qn 1 , qn 2 , . . . , q1 )
N
P(qn |qn 1 , qn 2 , . . . , q1 ) = P(qn |qn 1 )

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 9 / 17
Markov Chain Transition Table

EL
PT
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 10 / 17
Using Markov Chain

Given that today the weather is sunny, what is the probability that
tomorrow is sunny and day after is rainy?

EL
P(q2 = sunny, q3 = rainy|q1 = sunny)

= P(q3 = rainy|q2 = sunny, q1 = sunny) ⇥ P(q2 = sunny|q1 = sunny)


PT
= P(q3 = rainy|q2 = sunny) ⇥ P(q2 = sunny|q1 = sunny)
= 0.05 ⇥ 0.8
N
= 0.04

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 11 / 17
Hidden Markov Model

For Markov chains, the output symbols are the same as the states
‘sunny’ weather is both observable and state

EL
But in POS tagging
The output symbols are words

PT
But the hidden states are POS tags
A Hidden Markov Model is an extension of a Markov chain in which the
output symbols are not the same as the states
N
We don’t know which state we are in

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 12 / 17
Hidden Markov Models (HMMs)

Elements of an HMM model

EL
A set of states (here: the tags)
An output alphabet (here: words)

PT
Initial state (here: beginning of sentence)
State transition probabilities (here p(tn |tn 1 ))
Symbol emission probabilities (here p(wi |ti ))
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 13 / 17
Graphical Representation
When tagging a sentence, we are walking through the state graph:

EL
PT
N

Edges are labeled with the state transition probabilities: p(tn |tn 1 )
Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 14 / 17
Graphical Representation
At each state we emit a word: P(wn |tn )

EL
PT
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 15 / 17
Walking through the states: best path

EL
PT
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 16 / 17
Walking through the states: best path

EL
PT
N

Pawan Goyal (IIT Kharagpur) Hidden Markov Models for POS Tagging Week 3: Lecture 5 17 / 17

You might also like