Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views24 pages

Week 3

Natural Language Processing involves probabilistic language modeling to assign probabilities to sequences of words. N-gram models are commonly used, which make the Markov assumption that the probability of a word depends only on the previous N-1 words. The chain rule of probability is applied to compute the joint probability of a sentence as the product of conditional probabilities of each word given previous words. These conditional probabilities are estimated from large corpora by calculating maximum likelihood estimates of n-gram probabilities based on raw counts of n-grams in the data. N-gram models are simple but often sufficient for many NLP tasks.

Uploaded by

Khaled Tarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views24 pages

Week 3

Natural Language Processing involves probabilistic language modeling to assign probabilities to sequences of words. N-gram models are commonly used, which make the Markov assumption that the probability of a word depends only on the previous N-1 words. The chain rule of probability is applied to compute the joint probability of a sentence as the product of conditional probabilities of each word given previous words. These conditional probabilities are estimated from large corpora by calculating maximum likelihood estimates of n-gram probabilities based on raw counts of n-grams in the data. N-gram models are simple but often sufficient for many NLP tasks.

Uploaded by

Khaled Tarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Natural Language Processing

Mahmmoud Mahdi
Language
Modeling
N-Grams
Probabilistic Language Models
Goal: assign a probability to a sentence

● Machine Translation:
○ P(high winds tonite) > P(large winds tonite)
● Spell Correction
○ The office is about fifteen minuets from my house
■ P(about fifteen minutes from) > P(about fifteen minuets from)
● Speech Recognition
○ P(I saw a van) >> P(eyes awe of an)
● + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling
● Goal: compute the probability of a sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

● Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

● A model that computes P(W) or P(wn|w1,w2…wn-1) is called a


language model.
● Better: the grammar But language model (LM) is standard
How to compute P(W)
● How to compute this joint probability:
○ P(its, water, is, so, transparent, that)

● Intuition: let’s rely on the Chain Rule of Probability


Reminder: The Chain Rule

● Recall the definition of conditional probabilities


p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

● More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

● The Chain Rule in General

P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint probability of
words in sentence

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water)

× P(so|its water is) × P(transparent|its water is so)


How to estimate these probabilities

● No! Too many possible sentences!


● We’ll never see enough data for estimating these
Markov Assumption

Simplifying assumption:
Andrei Markov

Or maybe
Markov Assumption

● In other words, we approximate each component in the


product
Simplest case: Unigram model

Some automatically generated sentences from a unigram model


Bigram model

Condition on the previous word


N-gram models

● We can extend to trigrams, 4-grams, 5-grams


● In general this is an insufficient model of
language
● But we can often get away with N-gram models
Language Modeling
Estimating N-gram
Probabilities
Estimating bigram probabilities

● The Maximum Likelihood Estimate


An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples: Berkeley Restaurant Project sentences
● can you tell me about any good cantonese restaurants close
by
● mid priced thai food is what i’m looking for
● tell me about chez panisse
● can you give me a listing of the kinds of food that are
available
● i’m looking for a good place to eat breakfast
● when is caffe venezia open during the day
Raw bigram counts
● Out of 9222 sentences
Raw bigram probabilities
● Normalize by unigrams:

● Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge?

● P(english|want) = .0011
● P(chinese|want) = .0065
● P(to|want) = .66
● P(eat | to) = .28
● P(food | to) = 0
● P(want | spend) = 0
● P (i | <s>) = .25
Practical Issues

● We do everything in log space


○ Avoid underflow
○ (also adding is faster than multiplying)
Questions ?

You might also like