0% found this document useful (0 votes)

128 views85 pages

Langmodel2 PDF

The document discusses language modeling and smoothing techniques for language models. It describes a language modeling corpus with 38 million words for training and 1.5 million words for testing. It then discusses the Shannon visualization method for generating word sequences from a language model and evaluating perplexity. Finally, it covers smoothing techniques like Laplace smoothing and advanced algorithms like Good-Turing smoothing that address data sparsity issues in language models.

Uploaded by

Amar Kaswan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views85 pages

Langmodel2 PDF

Uploaded by

Amar Kaswan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Language Modeling: Part II

Pawan Goyal
CSE, IITKGP

July 31, 2014

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

1 / 47

Lower perplexity = better model

WSJ Corpus
Training: 38 million words
Test: 1.5 million words

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

2 / 47

Lower perplexity = better model

WSJ Corpus
Training: 38 million words
Test: 1.5 million words

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

2 / 47

Lower perplexity = better model

WSJ Corpus
Training: 38 million words
Test: 1.5 million words

Unigram perplexity: 962?

The model is as confused on test data as if it had to choose uniformly and
independently among 962 possibilities for each word.

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

2 / 47

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences

Choose a random bigram
(<s>,w) as per its
probability

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences

Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences

Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
And so on until we choose
</s>

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences

Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
And so on until we choose
</s>

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

Shakespeare as Corpus

N = 884,647 tokens, V = 29,066

Shakespeare produced 300,000 bigram types out of V 2 = 844 million
possible bigrams.

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

4 / 47

Approximating Shakespeare

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

5 / 47

Problems with simple MLE estimate: zeros

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations
... denied the reports
... denied the claims
... denied the request

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams

P(offer | denied the) = 0

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams

P(offer | denied the) = 0
The test set will be assigned a probability 0

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams

P(offer | denied the) = 0
The test set will be assigned a probability 0
And the perplexity cant be computed

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Language Modeling: Smoothing

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

7 / 47

Language Modeling: Smoothing

With sparse statistics

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

7 / 47

Language Modeling: Smoothing

With sparse statistics

Steal probability mass to generalize better

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

7 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w

,w )

MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w

,w )

MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i

i1
c(w

,w )+1

i
Add-1 estimate: PAdd1 (wi |wi1 ) = c(wi1 )+V
i1

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Reconstituted counts as effect of smoothing

Effective bigram count (c (wn1 wn ))

c (wn1 wn ) c(wn1 wn ) + 1
=
c(wn1 )
c(wn1 ) + V

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

9 / 47

Comparing with bigrams: Restaurant corpus

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

10 / 47

Comparing with bigrams: Restaurant corpus

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

10 / 47

Add-1 estimation

Not used for N-grams

There are better smoothing methods

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

11 / 47

Add-1 estimation

Not used for N-grams

There are better smoothing methods

Is used to smooth other NLP models

In domains where the number of zeros isnt so large
For text classification

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

11 / 47

More general formulations: Add-k

PAddk (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) + k
c(wi1 ) + kV

Language Modeling

July 31, 2014

12 / 47

More general formulations: Add-k

PAddk (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) + k
c(wi1 ) + kV

c(wi1 , wi ) + m( V1 )
c(wi1 ) + m

Language Modeling

July 31, 2014

12 / 47

More general formulations: Add-k

PAddk (wi |wi1 ) =

c(wi1 , wi ) + k
c(wi1 ) + kV

c(wi1 , wi ) + m( V1 )
c(wi1 ) + m

Unigram prior smoothing:

PUnigramPrior (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) + mP(wi )
c(wi1 ) + m

Language Modeling

July 31, 2014

12 / 47

Advanced smoothing algorithms

Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

13 / 47

Advanced smoothing algorithms

Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen

Smoothing algorithms
Good-Turing
Kneser-Ney
Witten-Bell

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

13 / 47

Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

14 / 47

Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Computing Nc
I
am
here
who
would
like

3
2
1
1
1
1

Pawan Goyal (IIT Kharagpur)

N1 = 4
N2 = 1
N3 = 1

Language Modeling

July 31, 2014

14 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

1/18

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

1/18

How likely is it that next species is new?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

1/18

How likely is it that next species is new?

Use the estimate of things-we-saw-once to estimate the new things

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

1/18

How likely is it that next species is new?

Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

1/18

How likely is it that next species is new?

Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)

So, how likely is it that the next species is trout?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

1/18

How likely is it that next species is new?

Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)

So, how likely is it that the next species is trout?

Must be less than 1/18

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good Turing calculations

PGT (things with zero frequency) = NN1

Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =

Pawan Goyal (IIT Kharagpur)

(c + 1)Nc+1
Nc

Language Modeling

July 31, 2014

16 / 47

Good Turing calculations

PGT (things with zero frequency) = NN1

Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =

(c + 1)Nc+1
Nc

Seen once (trout)

c (trout) = 2 N2 /N1 = 2/3

PGT (trout) = 2/3
18 = 1/27

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

16 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

What fraction of held-out words are unseen in training?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:

Pawan Goyal (IIT Kharagpur)

(k+1)Nk+1
cNk

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:

(k+1)Nk+1
cNk

(k+1)Nk+1
Expected count: k =
N
k

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Complications

What about the?

For small k, Nk > Nk+1
For large k, too jumpy

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

18 / 47

Complications

What about the?

For small k, Nk > Nk+1
For large k, too jumpy

Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

18 / 47

Good-Turing numbers: Example

22 million words of AP Neswire

c =

(c + 1)Nc+1
Nc

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

19 / 47

Good-Turing numbers: Example

22 million words of AP Neswire

c =

(c + 1)Nc+1
Nc

It looks like c = c 0.75

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

20 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

21 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

PAbsoluteDiscounting (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )

Language Modeling

July 31, 2014

21 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

PAbsoluteDiscounting (wi |wi1 ) =

c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )

We may keep some more values of d for counts 1 and 2

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

21 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

PAbsoluteDiscounting (wi |wi1 ) =

c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )

We may keep some more values of d for counts 1 and 2

But can we do better than using the regular unigram correct?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

21 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

P(w): How likely is w?

Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

P(w): How likely is w?

Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

P(w): How likely is w?

Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

23 / 47

Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

23 / 47

Pawan Goyal (IIT Kharagpur)

|{wi1 : c(wi1 , w) > 0}|

|{(wj1 , wj ) : c(wj1 , wj ) > 0}|

Language Modeling

July 31, 2014

23 / 47

|{wi1 : c(wi1 , w) > 0}|

|{(wj1 , wj ) : c(wj1 , wj ) > 0}|

A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

23 / 47

Kneser-Ney Smoothing

PKN (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )

Language Modeling

July 31, 2014

24 / 47

Kneser-Ney Smoothing

PKN (wi |wi1 ) =

max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )

is a normalizing constant
(wi1 ) =

Pawan Goyal (IIT Kharagpur)

d
|{w : c(wi1 , w) > 0}|
c(wi1 )

Language Modeling

July 31, 2014

24 / 47

Model Combination

As N increases
The power (expressiveness) of an N-gram model increases

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

25 / 47

Model Combination

As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

25 / 47

Model Combination

As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
A general approach is to combine the results of multiple N-gram models.

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

25 / 47

Backoff and Interpolation

It might help to use less context

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context

when you havent learned much about larger contexts

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context

when you havent learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context

when you havent learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Interpolation
mix unigram, bigram, trigram

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context

when you havent learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Interpolation
mix unigram, bigram, trigram
Interpolation is found to work better

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Linear Interpolation

Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

27 / 47

Linear Interpolation

Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P

i = 1
i

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

27 / 47

Linear Interpolation

Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P

i = 1
i

Lambdas conditional on context

(wn |wn1 wn2 ) = 1 (wn2 , wn1 )P(wn |wn1 wn2 )
P
+2 (wn2 , wn1 )P(wn |wn1 ) + 3 (wn2 , wn1 )P(wn )

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

27 / 47

Setting the lambda values

Use a held-out corpus

Choose s to maximize the probability of held-out data:
Find the N-gram probabilities on the training data
Search for s that give the largest probability to held-out data

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

28 / 47

Langmodel PDF
0% (1)
Langmodel PDF
69 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
No ratings yet
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
36 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Language Modelling
No ratings yet
Language Modelling
48 pages
Dist Semantics PDF
No ratings yet
Dist Semantics PDF
101 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Semantics: Lexical Semantics: Pawan Goyal
No ratings yet
Semantics: Lexical Semantics: Pawan Goyal
54 pages
N-Gram Language Models in NLP
No ratings yet
N-Gram Language Models in NLP
48 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Spelling Correction: Edit Distance: Pawan Goyal
No ratings yet
Spelling Correction: Edit Distance: Pawan Goyal
67 pages
Comp Morpho PDF
No ratings yet
Comp Morpho PDF
78 pages
Language Modeling with N-grams
No ratings yet
Language Modeling with N-grams
79 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Dependency Parsing Explained
No ratings yet
Dependency Parsing Explained
38 pages
Lectures LM
No ratings yet
Lectures LM
57 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Doubts PDF
No ratings yet
Doubts PDF
42 pages
Dependency Parsing - Part II: Pawan Goyal
No ratings yet
Dependency Parsing - Part II: Pawan Goyal
56 pages
Introduction to N-gram Language Models
No ratings yet
Introduction to N-gram Language Models
77 pages
MT PDF
No ratings yet
MT PDF
42 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Ngrams
100% (1)
Ngrams
22 pages
Ie PDF
No ratings yet
Ie PDF
64 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Statistical Language Models Based On Neural Networks
No ratings yet
Statistical Language Models Based On Neural Networks
59 pages
N-Gram Language Models Overview
No ratings yet
N-Gram Language Models Overview
400 pages
WHO - Ethics and Governance For AI For Health-1
No ratings yet
WHO - Ethics and Governance For AI For Health-1
98 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
No ratings yet
Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
15 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Voice Recognition
60% (5)
Voice Recognition
31 pages
N-Gram Models For Language Detection
No ratings yet
N-Gram Models For Language Detection
14 pages
BE AIDS R 20 V VI Sem Syllabus Compressed
No ratings yet
BE AIDS R 20 V VI Sem Syllabus Compressed
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Language Model Evaluation Methods
No ratings yet
Language Model Evaluation Methods
21 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
NLP Week-3
No ratings yet
NLP Week-3
214 pages
Lecture13 LM YirenWang
No ratings yet
Lecture13 LM YirenWang
8 pages
NLP and Machine Translation Overview
No ratings yet
NLP and Machine Translation Overview
100 pages
Diffusion-LM Improves Controllable Text Generation
No ratings yet
Diffusion-LM Improves Controllable Text Generation
25 pages
Tinystories: How Small Can Language Models Be and Still Speak Coherent English?
No ratings yet
Tinystories: How Small Can Language Models Be and Still Speak Coherent English?
27 pages
Word Embeddings & Word2Vec Guide
No ratings yet
Word Embeddings & Word2Vec Guide
9 pages
GPT-3: AI's Revolutionary Language Model
No ratings yet
GPT-3: AI's Revolutionary Language Model
1 page
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Language Models & NLP Overview
No ratings yet
Language Models & NLP Overview
1 page
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Contrash
No ratings yet
Contrash
68 pages
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
9 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
10.48550 Arxiv.2212.08073
No ratings yet
10.48550 Arxiv.2212.08073
34 pages
Comparing Open-Source Speech Recognition Toolkits
No ratings yet
Comparing Open-Source Speech Recognition Toolkits
12 pages
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
No ratings yet
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
8 pages
N Grams
No ratings yet
N Grams
51 pages
CMU NLP Online Course Overview
No ratings yet
CMU NLP Online Course Overview
13 pages
Gen AI
No ratings yet
Gen AI
15 pages
Chapter-02 2
No ratings yet
Chapter-02 2
42 pages
Language Modeling
No ratings yet
Language Modeling
43 pages
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
No ratings yet
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
13 pages
Unit Ii - NLP
No ratings yet
Unit Ii - NLP
35 pages
CT107-3-3-TXSA - Group Assignment
No ratings yet
CT107-3-3-TXSA - Group Assignment
4 pages
Fine-Tuning Language Models For Factuality
No ratings yet
Fine-Tuning Language Models For Factuality
16 pages
RAI For The New Bing April 2023
No ratings yet
RAI For The New Bing April 2023
11 pages
CTR Prediction with CTRL Model
No ratings yet
CTR Prediction with CTRL Model
11 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
Collaborative Storytelling With Large-Scale Neural Language Models
No ratings yet
Collaborative Storytelling With Large-Scale Neural Language Models
10 pages
Week-3 - +Prompt+Engineering+w - o+Code+-+Caselets
No ratings yet
Week-3 - +Prompt+Engineering+w - o+Code+-+Caselets
6 pages
Frank2023 ChildrenLLMs
No ratings yet
Frank2023 ChildrenLLMs
3 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Ngram
No ratings yet
Ngram
41 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
NLP Lecture 8 Week 4
No ratings yet
NLP Lecture 8 Week 4
10 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
April 22 Part 2achine Translation
No ratings yet
April 22 Part 2achine Translation
36 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Video v3
No ratings yet
Video v3
34 pages
Ngrams
No ratings yet
Ngrams
22 pages
Xu00b Icslp
No ratings yet
Xu00b Icslp
4 pages
Week 3
No ratings yet
Week 3
98 pages
2.1 Chap NLP Ngrams
No ratings yet
2.1 Chap NLP Ngrams
37 pages
Language Models
No ratings yet
Language Models
59 pages
02 Neural Lms
No ratings yet
02 Neural Lms
58 pages
5-N Gram
No ratings yet
5-N Gram
35 pages
Language Models L3-6
No ratings yet
Language Models L3-6
49 pages