0% found this document useful (0 votes)

146 views30 pages

NLP Unit-V

Uploaded by

Sathish Koppoju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views30 pages

NLP Unit-V

Uploaded by

Sathish Koppoju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Unit-V

Language Modeling
• Introduction
• n-Gram Models
• Language Model Evaluation
• Parameter Estimation
• Language Model Adaptation
• Types of Language Models
• Language Specific Modeling Problems
• Multilingual and Cross-lingual Language Modeling
Introduction
• Statistical Language Model is a model that specifies the a priori
probability of a particular word sequence in the language of interest.
• Given an alphabet or inventory of units ∑ and a sequence W=
w1w2…..wt ϵ ∑* a language model can be used to compute the
probability of W based on parameters previously estimated from a
training set.
• The inventory ∑ is the list of unique words encountered in the training
data.
• Selecting the units over which a language model should be defined is
a difficult problem particularly in languages other than English.
Introduction
• A language model is combined with other model or models that
hypothesize possible word sequences.
• In speech recognition a speech recognizer combines acoustic model
scores with language model scores to decode spoken word sequences
from an acoustic signal.
• Language models have also become a standard tool in information
retrieval, authorship identification, and document classification.
n-Gram Models
• Finding the probability of a word sequence of arbitrary length is not
possible in natural language because natural language permits infinite
number of word sequences of variable length.
• The probability P(W) can be decomposed into a product of component
probabilities according to the chain rule of probability:

• Since the individual terms in the above product are difficult to compute
directly n-gram approximation was introduced.
n-Gram Models
• The assumption is that all the preceding words except the n-1 words
directly preceding the current word are irrelevant for predicting the
current word.
• Hence P(W) is approximated to:

• This model is also called as (n-1)-th order Markov model because of

the assumption of the independence of the current word given all the
words except for the n-1 preceding words.
Language Model Evaluation
• Now let us look at the problem of judging the performance of a
language model.
• The question is how can we tell whether the language model is
successful at estimating the word sequence probabilities?
• Two criteria are used:
• Coverage rate and perplexity on a held out test set that does not form
part of the training data.
• The coverage rate measures the percentage of n-grams in the test set
that are represented in the language model.
• A special case is the out-of-vocabulary rate (OOV) which is the
percentage of unique word types not covered by the language model.
Language Model Evaluation
• The second criterion perplexity is an information theoretic measure.
• Given a model p of a discrete probability distribution, perplexity can
be defined as 2 raised to the entropy of p:

• In language modeling we are more interested in the performance of a

language model q on a test set of a fixed size, say t words (w1w2wt).
• The language model perplexity can be computed as:
• q(wi) computes the probability of the ith word.
Language Model Evaluation
• If q(wi) is an n-gram probability, the equation becomes

• When comparing different language models, their perplexities must

be normalized with respect to the same number of units in order to
obtain a meaningful comparison.
• Perplexity is the average number of equally likely successor words
when transitioning from one position in the word string to the next.
• If the model has no predictive power, perplexity is equal to the
vocabulary size.
Language Model Evaluation
• A model achieving perfect prediction has a perplexity of one.
• The goal in language model development is to minimize the perplexity
on a held-out data set representative of the domain of interest.
• Sometimes the goal of language modeling might be to distinguish
between “good” and “bad” word sequences.
• Optimization in such cases may not be minimizing the perplexity.
Parameter Estimation
• Maximum-Likelihood Estimation and Smoothing
• Bayesian Parameter Estimation
• Large-Scale Language Models
Maximum-Likelihood Estimation and Smoothing
• The standard procedure in training n-gram models is to estimate n-
gram probabilities using the maximum-likelihood criterion in
combination with parameter smoothing.
• The maximum-likelihood estimate is obtained by simply computing
relative frequencies:

• Where c(wi,wi-1,wi-2) is the count of the trigram wi-2wi-1wi in the

training data.
Maximum-Likelihood Estimation and
Smoothing
• This method fails to assign nonzero probabilities to word sequences
that have not been observed in the training data.
• The probability of sequences that were observed might also be
overestimated.
• The process of redistributing probability mass such that peaks in the
n-gram probability distribution are flattened and zero estimates are
floored to some small nonzero value is called smoothing.
• The most common smoothing technique is backoff.
Maximum-Likelihood Estimation and
Smoothing
• Backoff involves splitting n-grams into those whose counts in the
training data fall below a predetermined threshold Ʈ and those whose
counts exceed the threshold.
• In the former case the maximum-likelihood estimate of the n-gram
probability is replaced with an estimate derived from the probability
of the lower-order (n-1)-gram and a backoff weight.
• In the later case, n-grams retain their maximum-likelihood estimates,
discounted by a factor that redistributes probability mass to the
lower-order distribution.
Maximum-Likelihood Estimation and
Smoothing
• The back-off probability PBO for wi given wi-1,wi-2 is computed as
follows:

• Where c is the count of (wi,wi-1,wi-2), and dc is a discounting factor that

is applied to the higher order distribution.
• The normalization factor α(wi-1,wi-2) ensures that the entire distribution
sums to one and is computed as:
Maximum-Likelihood Estimation and
Smoothing
• The way in which the discounting factor is computed determines the
precise smoothing technique.
• Well-known techniques include:
• Good-Turing
• Written-Bell
• Kneser-Ney
• In Kneser-Ney smoothing a fixed discounting parameter D is applied
to the raw n-gram counts before computing the probability estimates:
Maximum-Likelihood Estimation and
Smoothing
• In modified Kneser-Ney smoothing, which is one of the most widely
used techniques, different discounting factors D1,D2,D3+ are used for
n-grams with exactly one, two, or three or more counts:

• Where n1,n2,….. are the counts of n-grams with one, two, …, counts.
Maximum-Likelihood Estimation and
Smoothing
• Another common way of smoothing language model estimates is linear
model interpolation.
• In linear interpolation, M models are combined by

• Where λ is a model-specific weight.

• The following constraints hold for the model weights: 0<= λ<=1 and ∑m
λm =1.
• Weights are estimated by maximizing the log-likelihood on a held-out
data set that is different from the training set for the component models.
Maximum-Likelihood Estimation and
Smoothing
• This is done using the expectation-maximization (EM) procedure.
Bayesian Parameter Estimation
• This is an alternative parameter estimation method where the set of
parameters are viewed as a random variable governed by a prior
statistical distribution.
• Given a training sample S and a set of parameters 𝜃, 𝑃(𝜃) denotes a
prior distribution over different possible values of 𝜃, and P(𝜃/S) is the
posterior distribution and is expressed using Baye’s rule as:
Bayesian Parameter Estimation
• In language modeling, 𝜃 = < 𝑃 𝑤1 , … . , 𝑃(𝑊𝑘) > (where K is the
vocabulary size) for a unigram model.
• For an n-gram model 𝜃=<P(W1/h1),…,P(Wk/hk)> with K n-grams and
history h of a specified length.
• The training sample S is a sequence of words, W1…..Wt.
• We require a point estimate of 𝜃 given the constraints expressed by
the prior distribution and the training sample.
• A maximum a posterior (MAP) can be used to do this.
Bayesian Parameter Estimation
• The Bayesian criterion finds the expected value of 𝜃 given the sample S:

• Assuming that the prior distribution is a uniform distribution, the MAP is

equivalent to the maximum-likelihood estimate.
Bayesian Parameter Estimation
• Bayesian estimate is equivalent to the maximum-likelihood estimate
with Laplace smoothing:

• Different choices for the prior distribution lead to different estimation

functions.
• The most commonly used prior distribution in language model is the
Dirichlet distribution.
Bayesian Parameter Estimation
• The Dirichlet distribution is the conjugate prior to the multinomial
distribution. It is defined as:

• Where Γ is the gamma function and α1,….. Αk are the parameters of

the Dirichlet distribution.
• It can also be thought of as counts derived from an a priori training
sample.
Bayesian Parameter Estimation
• The MAP estimate under the Dirichlet prior is:

• Where nk is the number of times word k occurs in the training sample.

• The result is another Dirichlet distribution parameterized by nk+ α
• The MAP estimate of P(𝜃/W,α) thus is equivalent to the maximum-likelihood
estimate with add-m smoothing.
• mk= αk-1 that is pseudocounts of size αk-1 are added to each word count.
Large-Scale Language Models
• As the amount of available monolingual data increases daily models can be
built from sets as large as several billions or trillions of words.
• Scaling language models to data sets of this size requires modifications to
the ways in which language models are trained.
• There are several approaches to large-scale language modeling.
• The entire language model training data is subdivided into several partitions,
and counts or probabilities derived from each partition are stored in
separate physical locations.
• Distributed language modeling scales to vary large amounts of data and
large vocabulary sizes and allows new data to be added dynamically without
having to recompute static model parameters.
Large-Scale Language Models
• The drawback of distributed approaches is the slow speed of
networked queries.
• One technique uses raw relative frequency estimate instead of a
discounted probability if the n-gram count exceeds the minimum
threshold (in this case 0):

• The α parameter is fixed for all contexts rather than being dependent
on the lower-order n-gram.
Large-Scale Language Models
• An alternative possibility is to use large-scale distributed language
models at a second pass rescoring stage only, after first-pass
hypotheses have been generated using a smaller language model.
• The overall trend in large-scale language modeling is to abandon
exact parameter estimation of the type described in favor of
approximate techniques.
Language Model Adaptation
• Language model adaptation is about designing and tuning model such that
it performs well on a new test set for which little equivalent training data is
available.
• The most commonly used adaptation method is that of mixture language
models or model interpolation.
• One popular method is topic-dependent language model adaptation.
• The documents are first clustered into a large number of different topics
and individual language models can be built for each topic cluster.
• The desired final model is then fine-tuned by choosing and interpolating a
smaller number of topic-specific language models.
Language Model Adaptation
• A form of dynamic self-adaptation of a language model is provided by
trigger models.
• The idea is that in accordance with the underlying topic of the text,
certain word combinations are more likely than other to co-occur.
• Some words are said to trigger others for example the words stock
and market in a financial news text.

History of Mathematicians
86% (7)
History of Mathematicians
15 pages
3 - Relaxation, Presence, Happiness - Acceptance
No ratings yet
3 - Relaxation, Presence, Happiness - Acceptance
182 pages
Parle Products List
100% (3)
Parle Products List
5 pages
NLP Unit 2 Part 1
No ratings yet
NLP Unit 2 Part 1
28 pages
01
No ratings yet
01
314 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Linguistics & NLP: Morphology Basics
No ratings yet
Linguistics & NLP: Morphology Basics
14 pages
Unit 1
No ratings yet
Unit 1
35 pages
NLP Assignment Answer
No ratings yet
NLP Assignment Answer
4 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
P&ID Symbols and Legend Guide
No ratings yet
P&ID Symbols and Legend Guide
1 page
Graph Operations in Applied Math
100% (1)
Graph Operations in Applied Math
20 pages
MINI Project
No ratings yet
MINI Project
27 pages
Management by Walking Around
100% (2)
Management by Walking Around
7 pages
Unit3 - Morphology and Finite State Transducers
100% (1)
Unit3 - Morphology and Finite State Transducers
55 pages
NLP MODULE 1 Chapter1 &2
100% (1)
NLP MODULE 1 Chapter1 &2
83 pages
NLP Course File Notes
100% (1)
NLP Course File Notes
71 pages
SARA-R5 ATCommands UBX-19047455
No ratings yet
SARA-R5 ATCommands UBX-19047455
558 pages
Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
AI Assignment 1
No ratings yet
AI Assignment 1
31 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
9 pages
Unit 2 - Esp in Elt - Complete
No ratings yet
Unit 2 - Esp in Elt - Complete
35 pages
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
No ratings yet
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
96 pages
Control Account Reconciliation Statement
No ratings yet
Control Account Reconciliation Statement
8 pages
Aly 8520 To Aly 8526 12V PL
No ratings yet
Aly 8520 To Aly 8526 12V PL
4 pages
CNS CW .
No ratings yet
CNS CW .
151 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
B.Tech CSE NLP Course Overview
No ratings yet
B.Tech CSE NLP Course Overview
24 pages
21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
NLP Notes Unit-3
No ratings yet
NLP Notes Unit-3
19 pages
Changing Levels of Meaning and Experience - Steve Andreas
No ratings yet
Changing Levels of Meaning and Experience - Steve Andreas
5 pages
N-gram Models in NLP Explained
No ratings yet
N-gram Models in NLP Explained
28 pages
NLP UNIT 5 Part B
100% (2)
NLP UNIT 5 Part B
31 pages
DAA
No ratings yet
DAA
15 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
Smooth N-Gram
No ratings yet
Smooth N-Gram
2 pages
3.2-1 Aqa Particles and Radiation Notes
No ratings yet
3.2-1 Aqa Particles and Radiation Notes
29 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
Hari Mini
No ratings yet
Hari Mini
40 pages
Hari Mini
No ratings yet
Hari Mini
40 pages
My Mini Doc
No ratings yet
My Mini Doc
40 pages
DAA
No ratings yet
DAA
10 pages
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
XOR A String With A Zero: Hello World'. The
No ratings yet
XOR A String With A Zero: Hello World'. The
30 pages
Formal Grammars & Parsing Methods
No ratings yet
Formal Grammars & Parsing Methods
11 pages
Major Documentation
No ratings yet
Major Documentation
77 pages
R&D Project Proposal Submission Guide
No ratings yet
R&D Project Proposal Submission Guide
7 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
CNS Laqs
No ratings yet
CNS Laqs
45 pages
Travelling Sales Person Problem
No ratings yet
Travelling Sales Person Problem
8 pages
NLP Notes Last Sem
No ratings yet
NLP Notes Last Sem
48 pages
WT Programs For Labsession
No ratings yet
WT Programs For Labsession
18 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
CNS Vsaqs
No ratings yet
CNS Vsaqs
25 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Unit - 1
No ratings yet
Unit - 1
15 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
NLP Important and Super Important Questions-18CS743
No ratings yet
NLP Important and Super Important Questions-18CS743
2 pages
Design and Analysis of Algorithms (Complete)
No ratings yet
Design and Analysis of Algorithms (Complete)
116 pages
Question Bank
No ratings yet
Question Bank
13 pages
NLP Lab Tasks for Students
No ratings yet
NLP Lab Tasks for Students
16 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
21AD3202 - Natural LanguageProcessing-Record
No ratings yet
21AD3202 - Natural LanguageProcessing-Record
64 pages
NLP Lab Guide for Students
No ratings yet
NLP Lab Guide for Students
103 pages
IS 7118 Unit-5 POS Tagging
No ratings yet
IS 7118 Unit-5 POS Tagging
89 pages
Unit - 2 (A)
No ratings yet
Unit - 2 (A)
8 pages
DPPM Unit - I
No ratings yet
DPPM Unit - I
16 pages
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
0% (1)
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
2 pages
A Project Report ON: Online Payroll Management System
No ratings yet
A Project Report ON: Online Payroll Management System
52 pages
NLP
No ratings yet
NLP
2 pages
AGRO-Assist (Paper Publish) Main
No ratings yet
AGRO-Assist (Paper Publish) Main
7 pages
Natural Language Processing
100% (1)
Natural Language Processing
3 pages
MyEdBC Family Portal Instructional Manual
No ratings yet
MyEdBC Family Portal Instructional Manual
6 pages
Fpv3dcam 3d FPV Camera Blackbird 2 User Guid Eng
No ratings yet
Fpv3dcam 3d FPV Camera Blackbird 2 User Guid Eng
16 pages
Ch11 3 Tries
No ratings yet
Ch11 3 Tries
11 pages
ML 5
No ratings yet
ML 5
20 pages
NLP Course for Students
No ratings yet
NLP Course for Students
25 pages
ML Unit 5
No ratings yet
ML Unit 5
13 pages
Solutions To NLP I Mid Set A
100% (1)
Solutions To NLP I Mid Set A
8 pages
IS 7118 Unit1 Introduction
No ratings yet
IS 7118 Unit1 Introduction
58 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Unit 1
No ratings yet
Unit 1
99 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
Hypertension Cheat Sheet
No ratings yet
Hypertension Cheat Sheet
4 pages
QUESTÕES A SEREM TRABALHADAS EM SALA DE AULA.1111docx
No ratings yet
QUESTÕES A SEREM TRABALHADAS EM SALA DE AULA.1111docx
7 pages
APSC 255 Formula Sheet
No ratings yet
APSC 255 Formula Sheet
3 pages
Reviewer On Police Photography by Mr. Herbert Tunac, RMT, MSMT
No ratings yet
Reviewer On Police Photography by Mr. Herbert Tunac, RMT, MSMT
4 pages
Black Dog Institute Online Clinic Assessment Report
No ratings yet
Black Dog Institute Online Clinic Assessment Report
7 pages
(Ebook) Speech and Language Processing: An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin Download
100% (1)
(Ebook) Speech and Language Processing: An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin Download
80 pages
Tiếng Anh thầy Tiểu Đạt - chuyên luyện thi Đại học Mr. Tieu Dat's English Academy Thầy Lưu Tiến Đạt (thầy Tiểu Đạt) Chuyên gia luyện thi môn Tiếng Anh
No ratings yet
Tiếng Anh thầy Tiểu Đạt - chuyên luyện thi Đại học Mr. Tieu Dat's English Academy Thầy Lưu Tiến Đạt (thầy Tiểu Đạt) Chuyên gia luyện thi môn Tiếng Anh
5 pages
NLP Qb-Ese
No ratings yet
NLP Qb-Ese
2 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Pawan Transfer
No ratings yet
Pawan Transfer
2 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
No ratings yet
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
29 pages
Unit-1 Aim 502
No ratings yet
Unit-1 Aim 502
15 pages
Dokumen - Tips - Registered Trademark of Basf Se Magnafloc Magnafloc 155 Is A High Molecular Weight
No ratings yet
Dokumen - Tips - Registered Trademark of Basf Se Magnafloc Magnafloc 155 Is A High Molecular Weight
2 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
Ictasol
No ratings yet
Ictasol
1 page
Dialogue Completion & Reading Comprehension
0% (1)
Dialogue Completion & Reading Comprehension
8 pages
NLP Lab Manual Updated
No ratings yet
NLP Lab Manual Updated
34 pages
Unit-2 Aim 502
No ratings yet
Unit-2 Aim 502
6 pages
Story Name: "The Story Canvas"
No ratings yet
Story Name: "The Story Canvas"
1 page
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Recovery CDs
No ratings yet
Recovery CDs
6 pages

NLP Unit-V

Uploaded by

NLP Unit-V

Uploaded by

Unit-V

• This model is also called as (n-1)-th order Markov model because of

• In language modeling we are more interested in the performance of a

• When comparing different language models, their perplexities must

• Where c(wi,wi-1,wi-2) is the count of the trigram wi-2wi-1wi in the

• Where c is the count of (wi,wi-1,wi-2), and dc is a discounting factor that

• Where λ is a model-specific weight.

• Assuming that the prior distribution is a uniform distribution, the MAP is

• Different choices for the prior distribution lead to different estimation

• Where Γ is the gamma function and α1,….. Αk are the parameters of

• Where nk is the number of times word k occurs in the training sample.

You might also like