Natural Language Processing
Curated by
Dr. Tohida Rehman
Assistant Professor
Department of Information Technology
Jadavpur University
Contents
1. Language Modeling
2. References
What is Language Models(LMs)
A language model (LM) is a machine learning model that predicts the next word(s) in a
sequence.
Formally, it assigns a probability to each possible next word(s).
LMs can also assign a probability to an entire sentence.
To specify a correct probability distribution, the probability of all sentences in a language
must sum to 1.
Why Is Word Prediction Important in NLP?
Many NLP applications (translation, chatbots, autocomplete) rely on predicting likely
word sequences.
Modern large language models (LLMs) are trained primarily via next-word
prediction.
If a model predicts words well, it implicitly learns grammar, facts, and reasoning.
Knowledge is impotant as chain rule of probability.
Simple Example: N-Gram Language Models
Basic Approach: Predicts next word based on previous *(n-1)* words (e.g., bigram: "the
cat" → "sat").
Limitation: Fixed context window (no long-range dependencies).
Relies on the chain rule of probability (factorizing sentence probability).
"The sky is ___"
Possible predictions: blue, cloudy, falling, limitless
Correct predictions (blue) require knowledge of:
Grammar (adjective follows "is").
World knowledge (skies are typically blue).
Probabilistic Language Models
Main goal: assign a probability to a sentence
Machine Translation:
P(high winds tonite) > P(large winds tonite)
Spell Correction
The office is about fifteen minuets from my house
P(about fifteen minutes from) > P(about fifteen minuets from)
Speech Recognition
P(I saw a van) >> P(eyes awe of an)
+ Summarization, question-answering, etc., etc.!!
Probabilistic Language Models
Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
How to estimate these probabilities
Recall the definition of conditional probabilities
More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint probability of words in sentence
P(w1w2 …wn ) = Õ P(wi | w1w2 …wi-1 )
i
P(“its water is so transparent”) =P(its) × P(water|its) × P(is|its water) × P(so|its water
is) × P(transparent|its water is so)
Probabilistic Language Models
Could we just count and divide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
No! Too many possible sentences!
We’ll never see enough data for estimating these
Language is creative and any particular context might have never occurred
before!
Markov Assumption
Simplifying assumption:
P(the | its water is so transparent that) » P(the | that)
Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
P(w1w2 …wn ) » Õ P(wi | wi-k …wi-1 )
i
In other words, we approximate each component in the product
Markov Assumption
P(w1w2 …wn ) » Õ P(wi | wi-k …wi-1 )
i
In other words, we approximate each component in the product
P(wi | w1w2 …wi-1) » P(wi | wi-k …wi-1)
Simplest case: Unigram model
P(w1w 2 …w n ) » Õ P(w i )
i
Word Count Probability P(wi)
Example corpus counts (from a tiny corpus):
"the" 100 100/200 = 0.5
Calculate Sentence Probability
"cat" 50 50/200 = 0.25
Example Sentence: "the cat sat on the cat" "sat" 30 30/200 = 0.15
P(the)×P(cat)×P(sat)×P(on)×P(the)×P(cat) "on" 20 20/200 = 0.1
=0.5×0.25×0.15×0.1×0.5×0.25 Total 200
=0.000234375(or 2.34×10−4)
⁈ Repetition penalty: Repeated words ("the", "cat") reduce probability exponentially.
⁈ Sparsity issue: If a word is unseen(like dog) making the entire sentence probability zero.
Bigram model
P(wi | w1w2 …wi-1) » P(wi | wi-1)
Word Count Probability P(wi)
Calculating Sentence Probability:
"the" 100 100/200 = 0.5
"cat" 50 50/200 = 0.25
"the cat sat on the cat"
"sat" 30 30/200 = 0.15
P(the)×P(cat∣the)×P(sat∣cat)×P(on∣sat)×P(the∣on)×P(cat∣the) "on" 20 20/200 = 0.1
=0.5×0.4×0.5×0.5×0.5×0.4 Total 200 P(w_i|w_{i-1})
=0.01 40/100=0.4
"the cat" 40
"cat sat" 25 25/50=0.5
"sat on" 15 15/30=0.5
"on the" 10 5/50=0.1
N-gram models
We can extend to trigrams, 4-grams, 5-grams
In general this is an insufficient model of language
because language has long-distance dependencies:
“The computer which I had just put into the machine room on the fifth floor crashed.”
But we can often get away with N-gram models
Estimating bigram probabilities
The Maximum Likelihood Estimate
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s> c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
More examples: Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:
Result:
An example
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
Evaluation and Perplexity
Does our language model prefer good sentences to bad ones?
Assign higher probability to “real” or “frequently observed” sentences
Than “ungrammatical” or “rarely observed” sentences?
We train parameters of our model on a training set.
We test the model’s performance on data we haven’t seen.
A test set is an unseen dataset that is different from our training set, totally unused.
An evaluation metric tells us how well our model does on the test set.
Evaluation and Perplexity
Best evaluation for comparing models A and B
Put each model in a task
spelling corrector, speech recognizer, MT system
Run the task, get an accuracy for A and for B
How many misspelled words corrected properly
How many words translated correctly
Compare accuracy for A and B
Evaluation and Perplexity
Extrinsic evaluation
Time-consuming; can take days or weeks
So
Sometimes use intrinsic evaluation: perplexity
Bad approximation
unless the test data looks just like the training data
So generally only useful in pilot experiments
But is helpful to think about.
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability
Intuition of Perplexity
The Shannon Game:
How well can we predict the next word? mushrooms 0.1
pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
and 1e-100
Unigrams are terrible at this game. (Why?)
A better model of a text
is one which assigns a higher probability to the word that actually occurs
Perplexity as branching factor
Let’s suppose a sentence consisting of random digits
What is the perplexity of this sentence according to a model that assign P=1/10
to each digit?
Lower perplexity = better model
Common Smoothing Techniques
Laplace (Add-1) Smoothing
Adds 1 to all counts.
Problem: Over-smoothes, distorts probabilities.
Good-Turing Smoothing
Adjusts counts based on frequency of frequencies.
Problem: Doesn’t handle high n-grams well.
Interpolation & Backoff
Mixes higher and lower-order n-grams.
Problem: May not generalize well.
Smoothing: Add-one (Laplace) smoothing
Solution of Generalization and zeros
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!
MLE estimate: c(wi-1, wi )
PMLE (wi | wi-1 ) =
(Maximum Likelihood estimate) c(wi-1 )
c(wi-1, wi ) +1
Add-1 estimate: PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
Laplace-smoothed bigrams
where V is the vocabulary size
Add-1 estimation is a blunt instrument
So add-1 isn’t used for N-grams:
We’ll see better methods
But add-1 is used to smooth other NLP models
For text classification
In domains where the number of zeros isn’t so huge.
Reminder: Add-1 (Laplace) Smoothing, Add-k
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
c(wi-1, wi ) + k
PAdd-k (wi | wi-1 ) =
c(wi-1 ) + kV
1
c(wi-1, wi ) + m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 ) + m
Unigram prior smoothing
1
c(wi-1, wi ) + m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 ) + m
c(wi-1, wi ) + mP(wi )
PUnigramPrior (wi | wi-1 ) =
c(wi-1 ) + m
Backoff and Interpolation
Sometimes it helps to use less context
Condition on less context for contexts you haven’t learned much about
Backoff:
use trigram if you have good evidence,
otherwise bigram, otherwise unigram
Interpolation:
mix unigram, bigram, trigram
Interpolation works better
Linear Interpolation
Simple interpolation
Lambdas conditional on context:
How to set the lambdas?
Use a held-out corpus
Training Data Held-Out Test
Data Data
Choose λs to maximize the probability of held-out data:
Fix the N-gram probabilities (on the training data)
Then search for λs that give largest probability to held-out set:
log P(w1...wn | M(l1...lk )) = å log PM ( l1... lk ) (wi | wi-1 )
i
Unknown words: Open versus closed vocabulary tasks
If we know all the words in advanced
Vocabulary V is fixed
Closed vocabulary task
Often we don’t know this
Out Of Vocabulary = OOV words
Open vocabulary task
Create an unknown word token <UNK>
Huge web-scale n-grams
How to deal with, e.g., Google N-gram corpus
Pruning
Only store N-grams with count > threshold.
Remove singletons of higher-order n-grams
Entropy-based pruning
Efficiency
Efficient data structures like tries
Bloom filters: approximate language models
Store words as indexes, not strings
Use Huffman coding to fit large numbers of words into two bytes
Quantize probabilities (4-8 bits instead of 8-byte float)
38
Smoothing for Web-scale N-grams
“Stupid backoff” (Brants et al. 2007)
No discounting, just use relative frequencies
ì i
ïï count(wi-k+1
i-1
)
if count(wi-k+1 ) > 0
i
S(wi | wi-k+1 ) = í count(wi-k+1 )
i-1
ï i-1
ïî 0.4S(w i | w i-k+2 ) otherwise
count(wi )
S(wi ) =
N
39
N-gram Smoothing Summary
Add-1 smoothing:
OK for text categorization, not for language modeling
The most commonly used method:
Extended Interpolated Kneser-Ney
For very large N-grams like the Web:
Stupid backoff
Advanced Language Modeling
Discriminative models:
choose n-gram weights to improve a task, not to fit the training set
Parsing-based models
Caching Models
Recently used words are more likely to appear
These perform very poorly for speech recognition (why?)
c(w Î history)
PCACHE (w | history) = l P(wi | wi-2 wi-1 ) + (1- l )
| history |
Advanced smoothing algorithms
Intuition used by many smoothing algorithms
Good-Turing
Kneser-Ney
Witten-Bell
Use the count of things we’ve seen once
to help estimate the count of things we’ve never seen
Why Kneser-Ney Smoothing?
Problem with Traditional Smoothing:
Handles unseen n-grams but ignores context diversity.
Example:
"San Francisco" vs. "Francisco" (rare alone but frequent after "San").
Traditional methods overestimate "Francisco" in new contexts.
Kneser-Ney Smoothing – Key Idea
Instead of raw counts, it considers how many different contexts a word appears in.
Advantages of Kneser-Ney:
Handles Rare Words Better:
Considers context diversity.
Works well for unseen n-grams.
Outperforms other smoothing techniques in practice.
Notation: Nc = Frequency of frequency c
Nc = the count of things we’ve seen c times
Sam I am I am Sam I do not eat
I 3
sam 2
am 2
do 1
not 1
N1 = 3
eat 1
N2 = 2
N3 = 1
Good-Turing smoothing intuition
You are fishing (a scenario from Josh Goodman), and caught:
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that next species is trout?
1/18
How likely is it that next species is new (i.e. catfish or bass)
Let’s use our estimate of things-we-saw-once to estimate the new things.
3/18 (because N1=3)
Assuming so, how likely is it that next species is trout?
Must be less than 1/18
How to estimate?
Good Turing calculations
N1 (c +1)N c+1
P (things with zero frequency) =
*
GT c* =
N Nc
Unseen (bass or catfish)
c = 0:
• Seen once (trout)
•c=1
MLE p = 0/18 = 0
• MLE p = 1/18
P*GT (unseen) = N1/N = 3/18
• C*(trout) = 2 * N2/N1
= 2 * 1/3
= 2/3
• P*GT(trout) = 2/3 / 18 = 1/27
46
Ney et al.’s Good Turing Intuition
H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out.
IEEE Trans. PAMI. 17:12,1202-1212
Held-out words:
Training Held out
Ney et al. Good Turing Intuition
(slide from Dan Klein)
N1 N0
Intuition from leave-one-out validation
Take each of the c training words out in turn
c training sets of size c–1, held-out of size 1
N2 N1
What fraction of held-out words are unseen in training?
N1/c
What fraction of held-out words are seen k times in N3 N2
training?
....
....
(k+1)Nk+1/c
So in the future we expect (k+1)Nk+1/c of the words to be
those with training count k
There are Nk words with training count k
Each should occur with probability: N3511 N3510
(k +1)N k+1
(k+1)Nk+1/c/Nk k* =
Nk N4417 N4416
Good-Turing complications
(slide from Dan Klein)
Problem: what about “the”? (say c=4417)
For small k, Nk > Nk+1
N1
For large k, too jumpy, zeros wreck estimates N2 N
3
Simple Good-Turing [Gale and Sampson]:
replace empirical Nk with a best-fit power law
once counts get unreliable
N1
N2
Resulting Good-Turing numbers
Numbers from Church and Gale (1991) Count Good Turing c*
c
22 million words of AP Newswire
0 .0000270
(c +1)N c+1 1 0.446
c* = 2 1.26
Nc
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
Language Modeling
Advanced:
Kneser-Ney Smoothing
Resulting Good-Turing numbers
Numbers from Church and Gale (1991) Count Good Turing c*
c
22 million words of AP Newswire
0 .0000270
(c +1)N c+1 1 0.446
c* = 2 1.26
Nc
3 2.24
4 3.24
It sure looks like c* = (c - .75) 5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
52
Absolute Discounting Interpolation
Save ourselves some time and just subtract 0.75 (or some d)!
discounted bigram Interpolation weight
c(wi-1, wi ) - d
PAbsoluteDiscounting (wi | wi-1 ) = + l (wi-1 )P(w)
c(wi-1 )
unigram
(Maybe keeping a couple extra values of d for counts 1 and 2)
But should we really just use the regular unigram P(w)?
Kneser-Ney Smoothing I
Better estimate for probabilities of lower-order unigrams!
Shannon game: I can’t see without my reading___________?
Francisco
glasses
“Francisco” is more common than “glasses”
… but “Francisco” always follows “San”
The unigram is useful exactly when we haven’t seen this bigram!
Instead of P(w): “How likely is w”
Pcontinuation(w): “How likely is w to appear as a novel continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen
PCONTINUATION (w)µ {wi-1 : c(wi-1, w) > 0}
Kneser-Ney Smoothing II
How many times does w appear as a novel continuation:
PCONTINUATION (w)µ {wi-1 : c(wi-1, w) > 0}
Normalized by the total number of word bigram types
{(w j-1, w j ) : c(w j-1, w j ) > 0}
{wi-1 : c(wi-1, w) > 0}
PCONTINUATION (w) =
{(w j-1, w j ) : c(w j-1, w j ) > 0}
Kneser-Ney Smoothing III
Alternative metaphor: The number of # of word types seen to precede w
| {wi-1 : c(wi-1, w) > 0} |
normalized by the # of words preceding all words:
{wi-1 : c(wi-1, w) > 0}
PCONTINUATION (w) =
å {w' i-1 : c(w'i-1, w') > 0}
w'
A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability
Kneser-Ney Smoothing IV
max(c(wi-1, wi ) - d, 0)
PKN (wi | wi-1 ) = + l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi-1 ) = {w : c(wi-1, w) > 0}
c(wi-1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
= # of times we applied normalized discount
Kneser-Ney Smoothing: Recursive formulation
i
max(cKN (wi-n+1 ) - d, 0)
i-1
PKN (wi | wi-n+1 ) = i-1
+ l (w i-1
i-n+1 )PKN (wi
i-1
| wi-n+2 )
cKN (wi-n+1 )
ìï count(·) for the highest order
cKN (·) = í
ïî continuationcount(·) for lower order
Continuation count = Number of unique single word contexts for
HW
What is the primary limitation of Maximum Likelihood Estimation (MLE) in n-gram
models that Good-Turing smoothing addresses?
Explain the intuition behind Good-Turing smoothing. How does it adjust the counts of
unseen or rare n-grams?
Explain how Kneser-Ney smoothing would assign a probability to a word in the real world
example.
Reference Books
1. Daniel Jurafsky and James H. Martin. 2020. Speech and Language Processing.
2. 3rd Edition Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. MIT Press.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical
Natural Language Processing. O'Reilly.
4. NPTEL NLP course.
5. https://www.google.co.in/
6. Coursera course - Natural Language Processing