Language Modeling: Part II
Pawan Goyal
CSE, IITKGP
July 31, 2014
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
1 / 47
Lower perplexity = better model
WSJ Corpus
Training: 38 million words
Test: 1.5 million words
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
2 / 47
Lower perplexity = better model
WSJ Corpus
Training: 38 million words
Test: 1.5 million words
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
2 / 47
Lower perplexity = better model
WSJ Corpus
Training: 38 million words
Test: 1.5 million words
Unigram perplexity: 962?
The model is as confused on test data as if it had to choose uniformly and
independently among 962 possibilities for each word.
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
2 / 47
The Shannon Visualization Method
Use the language model to generate word sequences
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
3 / 47
The Shannon Visualization Method
Use the language model to generate word sequences
Choose a random bigram
(<s>,w) as per its
probability
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
3 / 47
The Shannon Visualization Method
Use the language model to generate word sequences
Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
3 / 47
The Shannon Visualization Method
Use the language model to generate word sequences
Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
And so on until we choose
</s>
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
3 / 47
The Shannon Visualization Method
Use the language model to generate word sequences
Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
And so on until we choose
</s>
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
3 / 47
Shakespeare as Corpus
N = 884,647 tokens, V = 29,066
Shakespeare produced 300,000 bigram types out of V 2 = 844 million
possible bigrams.
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
4 / 47
Approximating Shakespeare
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
5 / 47
Problems with simple MLE estimate: zeros
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
6 / 47
Problems with simple MLE estimate: zeros
Training set
... denied the allegations
... denied the reports
... denied the claims
... denied the request
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
6 / 47
Problems with simple MLE estimate: zeros
Training set
... denied the allegations
Test Data
... denied the reports
... denied the offer
... denied the claims
... denied the loan
... denied the request
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
6 / 47
Problems with simple MLE estimate: zeros
Training set
... denied the allegations
Test Data
... denied the reports
... denied the offer
... denied the claims
... denied the loan
... denied the request
Zero probability bigrams
P(offer | denied the) = 0
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
6 / 47
Problems with simple MLE estimate: zeros
Training set
... denied the allegations
Test Data
... denied the reports
... denied the offer
... denied the claims
... denied the loan
... denied the request
Zero probability bigrams
P(offer | denied the) = 0
The test set will be assigned a probability 0
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
6 / 47
Problems with simple MLE estimate: zeros
Training set
... denied the allegations
Test Data
... denied the reports
... denied the offer
... denied the claims
... denied the loan
... denied the request
Zero probability bigrams
P(offer | denied the) = 0
The test set will be assigned a probability 0
And the perplexity cant be computed
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
6 / 47
Language Modeling: Smoothing
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
7 / 47
Language Modeling: Smoothing
With sparse statistics
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
7 / 47
Language Modeling: Smoothing
With sparse statistics
Steal probability mass to generalize better
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
7 / 47
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word one more time that we actually did
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
8 / 47
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
8 / 47
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w
,w )
MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i
i1
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
8 / 47
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w
,w )
MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i
i1
c(w
,w )+1
i
Add-1 estimate: PAdd1 (wi |wi1 ) = c(wi1 )+V
i1
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
8 / 47
Reconstituted counts as effect of smoothing
Effective bigram count (c (wn1 wn ))
c (wn1 wn ) c(wn1 wn ) + 1
=
c(wn1 )
c(wn1 ) + V
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
9 / 47
Comparing with bigrams: Restaurant corpus
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
10 / 47
Comparing with bigrams: Restaurant corpus
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
10 / 47
Add-1 estimation
Not used for N-grams
There are better smoothing methods
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
11 / 47
Add-1 estimation
Not used for N-grams
There are better smoothing methods
Is used to smooth other NLP models
In domains where the number of zeros isnt so large
For text classification
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
11 / 47
More general formulations: Add-k
PAddk (wi |wi1 ) =
Pawan Goyal (IIT Kharagpur)
c(wi1 , wi ) + k
c(wi1 ) + kV
Language Modeling
July 31, 2014
12 / 47
More general formulations: Add-k
PAddk (wi |wi1 ) =
PAddk (wi |wi1 ) =
Pawan Goyal (IIT Kharagpur)
c(wi1 , wi ) + k
c(wi1 ) + kV
c(wi1 , wi ) + m( V1 )
c(wi1 ) + m
Language Modeling
July 31, 2014
12 / 47
More general formulations: Add-k
PAddk (wi |wi1 ) =
PAddk (wi |wi1 ) =
c(wi1 , wi ) + k
c(wi1 ) + kV
c(wi1 , wi ) + m( V1 )
c(wi1 ) + m
Unigram prior smoothing:
PUnigramPrior (wi |wi1 ) =
Pawan Goyal (IIT Kharagpur)
c(wi1 , wi ) + mP(wi )
c(wi1 ) + m
Language Modeling
July 31, 2014
12 / 47
Advanced smoothing algorithms
Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
13 / 47
Advanced smoothing algorithms
Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen
Smoothing algorithms
Good-Turing
Kneser-Ney
Witten-Bell
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
13 / 47
Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
14 / 47
Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Computing Nc
I
am
here
who
would
like
3
2
1
1
1
1
Pawan Goyal (IIT Kharagpur)
N1 = 4
N2 = 1
N3 = 1
Language Modeling
July 31, 2014
14 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
1/18
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
1/18
How likely is it that next species is new?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
1/18
How likely is it that next species is new?
Use the estimate of things-we-saw-once to estimate the new things
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
1/18
How likely is it that next species is new?
Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
1/18
How likely is it that next species is new?
Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)
So, how likely is it that the next species is trout?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good-Turing smoothing intuition
You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
1/18
How likely is it that next species is new?
Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)
So, how likely is it that the next species is trout?
Must be less than 1/18
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
15 / 47
Good Turing calculations
PGT (things with zero frequency) = NN1
Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =
Pawan Goyal (IIT Kharagpur)
(c + 1)Nc+1
Nc
Language Modeling
July 31, 2014
16 / 47
Good Turing calculations
PGT (things with zero frequency) = NN1
Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =
(c + 1)Nc+1
Nc
Seen once (trout)
c (trout) = 2 N2 /N1 = 2/3
PGT (trout) = 2/3
18 = 1/27
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
16 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
What fraction of held-out words are unseen in training?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:
Pawan Goyal (IIT Kharagpur)
(k+1)Nk+1
cNk
Language Modeling
July 31, 2014
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:
(k+1)Nk+1
cNk
(k+1)Nk+1
Expected count: k =
N
k
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
17 / 47
Complications
What about the?
For small k, Nk > Nk+1
For large k, too jumpy
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
18 / 47
Complications
What about the?
For small k, Nk > Nk+1
For large k, too jumpy
Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
18 / 47
Good-Turing numbers: Example
22 million words of AP Neswire
c =
(c + 1)Nc+1
Nc
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
19 / 47
Good-Turing numbers: Example
22 million words of AP Neswire
c =
(c + 1)Nc+1
Nc
It looks like c = c 0.75
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
20 / 47
Absolute Discounting Interpolation
Why dont we just substract 0.75 (or some d)?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
21 / 47
Absolute Discounting Interpolation
Why dont we just substract 0.75 (or some d)?
PAbsoluteDiscounting (wi |wi1 ) =
Pawan Goyal (IIT Kharagpur)
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
Language Modeling
July 31, 2014
21 / 47
Absolute Discounting Interpolation
Why dont we just substract 0.75 (or some d)?
PAbsoluteDiscounting (wi |wi1 ) =
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
We may keep some more values of d for counts 1 and 2
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
21 / 47
Absolute Discounting Interpolation
Why dont we just substract 0.75 (or some d)?
PAbsoluteDiscounting (wi |wi1 ) =
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
We may keep some more values of d for counts 1 and 2
But can we do better than using the regular unigram correct?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
21 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
P(w): How likely is w?
Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
P(w): How likely is w?
Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
P(w): How likely is w?
Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
22 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
23 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
23 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =
Pawan Goyal (IIT Kharagpur)
|{wi1 : c(wi1 , w) > 0}|
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Language Modeling
July 31, 2014
23 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =
|{wi1 : c(wi1 , w) > 0}|
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
23 / 47
Kneser-Ney Smoothing
PKN (wi |wi1 ) =
Pawan Goyal (IIT Kharagpur)
max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )
Language Modeling
July 31, 2014
24 / 47
Kneser-Ney Smoothing
PKN (wi |wi1 ) =
max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )
is a normalizing constant
(wi1 ) =
Pawan Goyal (IIT Kharagpur)
d
|{w : c(wi1 , w) > 0}|
c(wi1 )
Language Modeling
July 31, 2014
24 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
25 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
25 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
A general approach is to combine the results of multiple N-gram models.
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
25 / 47
Backoff and Interpolation
It might help to use less context
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
26 / 47
Backoff and Interpolation
It might help to use less context
when you havent learned much about larger contexts
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
26 / 47
Backoff and Interpolation
It might help to use less context
when you havent learned much about larger contexts
Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
26 / 47
Backoff and Interpolation
It might help to use less context
when you havent learned much about larger contexts
Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Interpolation
mix unigram, bigram, trigram
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
26 / 47
Backoff and Interpolation
It might help to use less context
when you havent learned much about larger contexts
Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Interpolation
mix unigram, bigram, trigram
Interpolation is found to work better
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
26 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
27 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
i = 1
i
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
27 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
i = 1
i
Lambdas conditional on context
(wn |wn1 wn2 ) = 1 (wn2 , wn1 )P(wn |wn1 wn2 )
P
+2 (wn2 , wn1 )P(wn |wn1 ) + 3 (wn2 , wn1 )P(wn )
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
27 / 47
Setting the lambda values
Use a held-out corpus
Choose s to maximize the probability of held-out data:
Find the N-gram probabilities on the training data
Search for s that give the largest probability to held-out data
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
28 / 47