Natural Language Processing
Mahmmoud Mahdi
Language
Modeling
N-Grams
Probabilistic Language Models
Goal: assign a probability to a sentence
● Machine Translation:
○ P(high winds tonite) > P(large winds tonite)
● Spell Correction
○ The office is about fifteen minuets from my house
■ P(about fifteen minutes from) > P(about fifteen minuets from)
● Speech Recognition
○ P(I saw a van) >> P(eyes awe of an)
● + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling
● Goal: compute the probability of a sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
● Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
● A model that computes P(W) or P(wn|w1,w2…wn-1) is called a
language model.
● Better: the grammar But language model (LM) is standard
How to compute P(W)
● How to compute this joint probability:
○ P(its, water, is, so, transparent, that)
● Intuition: let’s rely on the Chain Rule of Probability
Reminder: The Chain Rule
● Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)
● More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
● The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint probability of
words in sentence
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities
● No! Too many possible sentences!
● We’ll never see enough data for estimating these
Markov Assumption
Simplifying assumption:
Andrei Markov
Or maybe
Markov Assumption
● In other words, we approximate each component in the
product
Simplest case: Unigram model
Some automatically generated sentences from a unigram model
Bigram model
Condition on the previous word
N-gram models
● We can extend to trigrams, 4-grams, 5-grams
● In general this is an insufficient model of
language
● But we can often get away with N-gram models
Language Modeling
Estimating N-gram
Probabilities
Estimating bigram probabilities
● The Maximum Likelihood Estimate
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples: Berkeley Restaurant Project sentences
● can you tell me about any good cantonese restaurants close
by
● mid priced thai food is what i’m looking for
● tell me about chez panisse
● can you give me a listing of the kinds of food that are
available
● i’m looking for a good place to eat breakfast
● when is caffe venezia open during the day
Raw bigram counts
● Out of 9222 sentences
Raw bigram probabilities
● Normalize by unigrams:
● Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge?
● P(english|want) = .0011
● P(chinese|want) = .0065
● P(to|want) = .66
● P(eat | to) = .28
● P(food | to) = 0
● P(want | spend) = 0
● P (i | <s>) = .25
Practical Issues
● We do everything in log space
○ Avoid underflow
○ (also adding is faster than multiplying)
Questions ?