Course Probabilistic Language Models

The document discusses probabilistic language models, focusing on N-gram models, neural probabilistic language models, and probabilistic topic models. It outlines the challenges of estimating N-gram probabilities, the concept of perplexity as an evaluation metric, and the advantages and limitations of various modeling approaches. Additionally, it poses discussion questions regarding the assumptions and implications of these models in language processing.

Uploaded by

fyz020301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views3 pages

Course Probabilistic Language Models

Uploaded by

fyz020301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Probabilistic language models

Notes by Yuzhen Feng

November 20, 2024

1 Citation
• Blei, David M. Probabilistic topic models. Communications of the ACM 55.4 (2012): 77-84.
• Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model.
Advances in Neural Information Processing Systems 13 (2000).
• Daniel Jurafsky, and James H. Martin. N-gram language models. Speech and Language Processing.
Draft of August 20, 2024.

2 Notes
2.1 N-gram language model
The general objective of language models is to learn and infer p(w[1 : T ]), where w[1 : t] is the first t
words and w[t] is the tth one. With conditional probability, we can decompose it as
Y
p(w[1 : T ]) = p(w[t]|w[1 : t − 1]). (1)
t
Q
However, computing t p(w[t]|w[1 : t − 1]) directly is still very difficult. With the assumption of Marko-
vian property, we approximate

p(w[t]|w[1 : t − 1]) ≈ p(w[t]|w[t − N + 1 : t − 1]), (2)

and this is called N-gram model. The key challenge of N-gram model is the zeros, which is the word
sequences unseen in the training set but shown in the test set. It will cause a probability of zero, which
is under-estimated, and be problematic when calculating the perplexity.
There are several methods to deal with the zeros. The first is the to smooth the probability like the
following Add-k method
C(w[t], w[t − 1]) + k
p(w[t]|w[t − 1]) = , (3)
C(w[t − 1]) + kV
where V is the number of all vocabularies. The second, interpolation, is to estimate the probability with
a linear combination of conditional probabilities on shorter context

p(w[t]|w[t − 2 : t − 1]) = λ1 p(w[t]) + λ2 p(w[t]|w[t − 1]) + λ3 p(w[t]|w[t − 2 : t − 1]). (4)

The third, backoff, is to simply estimate the probability with a lower-order N-gram, probably with a
discounting factor.
Perplexity is defined to evaluate the language model, calculated as
1
P (w[1 : T ])− T = 2H(W ) , (5)

where H(W ) is the cross-entropy. From the perspective of probability, higher accuracy corresponds to
a higher probability and lower perplexity. From the perspective of entropy, higher accuracy results in
lower cross-entropy, approaching the true entropy, and thus lower perplexity. From both perspectives,
perplexity is a good way to evaluate a language model.

1
2.2 Neural probabilistic language model
Dimension reduction and smooth representation In language models, each word w[t] can be rep-
resented with a V -dimension vector with one single non-zero element. However, such representation is
discrete and not smooth: “walking” and “running”, “blue sea” and “sea is blue” may be very different –
similarities, between (1) words and (2) contexts with different arrangements of words, are not considered.
In this paper, word feature vector instead of the original 0-1 vector is proposed to solve the problem. The
detailed model is shown as follows.

• w → x: x = C(w), where x is the word feature vector;

• x → y: y = b + W x + U tanh(d + Hx);
• y → p: p = softmax(y).
Here, the parameters to be learned are θ = (b, d, W, U, H, C). A parallel algorithm is proposed to learn
the parameters from the dataset (and the workload is from soft-max normalization). Extensions can be
made by also representing the output word with the word feature vector.

2.3 Probabilistic topic model

Mixed membership within one single group The objective of probabilistic topic model is to use the
observed documents to infer the hidden topic structure. The detailed model is:
Y Y Y
p(β, θ, z, w) = p(βk ) p(θd ) p(zd,n |θd )p(wd,n |β, zd,n ) (6)
K D N

For each topic k, a distributional realization βk shows how words are distributed under topic k. For each
document d, a distributional realization θd shows how topics are distributed in document d. For each
word n, given θd a realization zd,n shows which topic the word is about, and given β and zd,n (i.e., βzd,n )
a realization wd,n shows which word it is. Extensions can be made by generating words conditional on
the previous one.

3 Discussion Questions
• What are N-grams? What are the key challenges of estimating N-gram probabilities?
• What does perplexity mean? Is it a good way to evaluate a language model? Why?

• What are assumptions behind probabilistic topic models? The generating process is assumed.
Specifically, topics are finite and given. Topics and words in each document are generated in-
dependently. For this independency, there is no order, context and correlation.
• What are the pros and cons of Latent Dirichlet Allocation (LDA) in terms of language modeling? It
introduces theme(s) in language modeling: a document is closely around one or several themes. It
can also be generalized to model the mixed membership (themes) within one single group (document).
The cons are related with its assumptions: order of words, context and correlation among both topics
and words are omitted.
• What are the basic ideas behind neural probabilistic language models? The model seeks word
representation to reduce the dimension and make it more smooth. In this way, similar words and
contexts with different arrangements of words are naturally considered “similar”.
• What are the key merits of neural probabilistic language models compared to N-grams and LDA?
Are there any limitations? Similar but different words and contexts are considered “similar” in this
model. It also considers the order of words, context and possible correlations between one word and
its previous ones. But it is learned and inferred in an ordered way (Markovian assumption), so how
to model correlations between one word and the words behind it (or even the full document, which
LDA can)? The matrix C needs to be more interpretable and prior knowledge on the topic of the
document can be learned (which means LDA can be integrated into the model). Computation of the
probabilities for all the V words in the vocabulary is required (which is not in N-grams).

2
4 Further Reading
• Teh, Yee, Michael Jordan, Matthew Beal, and David Blei. Sharing clusters among related groups:
Hierarchical Dirichlet processes. Advances in Neural Information Processing Systems 17 (2004).

Energymetabolism Chinchilla
100% (4)
Energymetabolism Chinchilla
7 pages
Generator Spare Parts Budget-2020
No ratings yet
Generator Spare Parts Budget-2020
106 pages
The Wessex Head Injury Matrix
No ratings yet
The Wessex Head Injury Matrix
7 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Language Modeling in NLP
No ratings yet
Language Modeling in NLP
15 pages
Rizal Paris To Berlin
No ratings yet
Rizal Paris To Berlin
14 pages
Vickers Hardness Test
No ratings yet
Vickers Hardness Test
3 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Statistical NLP: Models & Applications
No ratings yet
Statistical NLP: Models & Applications
43 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Formal Aspects of Language Modeling
No ratings yet
Formal Aspects of Language Modeling
252 pages
NLP Deep Learning for Students
No ratings yet
NLP Deep Learning for Students
57 pages
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
No ratings yet
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
136 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Neural Network Language Models Survey
No ratings yet
Neural Network Language Models Survey
7 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
Ngrams
100% (1)
Ngrams
22 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
QMS Internal Audit - 1 Day Trainng
100% (2)
QMS Internal Audit - 1 Day Trainng
104 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Language Modeling Lecture Notes
No ratings yet
Language Modeling Lecture Notes
88 pages
Ima 2000
No ratings yet
Ima 2000
56 pages
HFSS-High Frequency Structure Simulator
No ratings yet
HFSS-High Frequency Structure Simulator
38 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
NLP
No ratings yet
NLP
12 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Language Modeling
No ratings yet
Language Modeling
43 pages
NLP-Ch-2 Introduction To Language Models
No ratings yet
NLP-Ch-2 Introduction To Language Models
82 pages
RNN For Moodle
No ratings yet
RNN For Moodle
42 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
PLM 17
No ratings yet
PLM 17
15 pages
02 Neural Lms
No ratings yet
02 Neural Lms
58 pages
Lecture 7 - Language Modelling
No ratings yet
Lecture 7 - Language Modelling
107 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Language Models
No ratings yet
Language Models
59 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
Unit5 Notes
No ratings yet
Unit5 Notes
17 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
NLP with Deep Learning for Students
No ratings yet
NLP with Deep Learning for Students
45 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Language Models
No ratings yet
Language Models
11 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
Lab Ex1
100% (1)
Lab Ex1
7 pages
Lec 04
No ratings yet
Lec 04
38 pages
Problem Solving
No ratings yet
Problem Solving
16 pages
Relational Database Design by ER and EER To Relational Mapping PDF
No ratings yet
Relational Database Design by ER and EER To Relational Mapping PDF
10 pages
Ngrams
No ratings yet
Ngrams
22 pages
Allied DC Portable Aspirator User Manual
No ratings yet
Allied DC Portable Aspirator User Manual
9 pages
Foundations of Text Representation, LLMs and Transformers
No ratings yet
Foundations of Text Representation, LLMs and Transformers
87 pages
OB - Product Design - Eng
No ratings yet
OB - Product Design - Eng
29 pages
Realitive and Absolute Dating
No ratings yet
Realitive and Absolute Dating
24 pages
NLP Unit-5
No ratings yet
NLP Unit-5
13 pages
Subject G11-Goodyear Tvl-Ia Eclassrecord 1stsem 2018-19
No ratings yet
Subject G11-Goodyear Tvl-Ia Eclassrecord 1stsem 2018-19
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Technical NLP U3-6
No ratings yet
Technical NLP U3-6
83 pages
Unit 3-Notes AI
No ratings yet
Unit 3-Notes AI
36 pages
Hierarchical NNLM Aistats05
No ratings yet
Hierarchical NNLM Aistats05
7 pages
BSBCRT511 Task 3 Assessment Templates V3.0923
No ratings yet
BSBCRT511 Task 3 Assessment Templates V3.0923
10 pages
Meteorological Instruments: MODEL 85000
No ratings yet
Meteorological Instruments: MODEL 85000
16 pages
Surat Undangan Peserta ADIA
No ratings yet
Surat Undangan Peserta ADIA
9 pages
Institutional Theory Framework
No ratings yet
Institutional Theory Framework
9 pages
Internal Combustion Engines
No ratings yet
Internal Combustion Engines
2 pages
Air Cadet Pumps Manual
No ratings yet
Air Cadet Pumps Manual
12 pages
GEZE - Product Data Sheet - EN - 697800130822
No ratings yet
GEZE - Product Data Sheet - EN - 697800130822
3 pages
Nelco N5000 BT Epoxy Laminate and Prepreg
No ratings yet
Nelco N5000 BT Epoxy Laminate and Prepreg
6 pages
Senior High ICT Students' Mobile Use
No ratings yet
Senior High ICT Students' Mobile Use
5 pages
BALLOU Inclusion VS Empathy
No ratings yet
BALLOU Inclusion VS Empathy
5 pages
Acknowledgement Abstract
No ratings yet
Acknowledgement Abstract
6 pages
Rapid Prototyping
100% (1)
Rapid Prototyping
21 pages
MiniROVER Data Sheet 2013 Lo 1
No ratings yet
MiniROVER Data Sheet 2013 Lo 1
2 pages
IDEALS Essay Framework
No ratings yet
IDEALS Essay Framework
1 page
EfkaPB2001 TDS
No ratings yet
EfkaPB2001 TDS
2 pages

Course Probabilistic Language Models

Uploaded by

Course Probabilistic Language Models

Uploaded by

Probabilistic language models

Notes by Yuzhen Feng

p(w[t]|w[1 : t − 1]) ≈ p(w[t]|w[t − N + 1 : t − 1]), (2)

p(w[t]|w[t − 2 : t − 1]) = λ1 p(w[t]) + λ2 p(w[t]|w[t − 1]) + λ3 p(w[t]|w[t − 2 : t − 1]). (4)

• w → x: x = C(w), where x is the word feature vector;

2.3 Probabilistic topic model

You might also like