Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views3 pages

Course Probabilistic Language Models

The document discusses probabilistic language models, focusing on N-gram models, neural probabilistic language models, and probabilistic topic models. It outlines the challenges of estimating N-gram probabilities, the concept of perplexity as an evaluation metric, and the advantages and limitations of various modeling approaches. Additionally, it poses discussion questions regarding the assumptions and implications of these models in language processing.

Uploaded by

fyz020301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

Course Probabilistic Language Models

The document discusses probabilistic language models, focusing on N-gram models, neural probabilistic language models, and probabilistic topic models. It outlines the challenges of estimating N-gram probabilities, the concept of perplexity as an evaluation metric, and the advantages and limitations of various modeling approaches. Additionally, it poses discussion questions regarding the assumptions and implications of these models in language processing.

Uploaded by

fyz020301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Probabilistic language models

Notes by Yuzhen Feng


November 20, 2024

1 Citation
• Blei, David M. Probabilistic topic models. Communications of the ACM 55.4 (2012): 77-84.
• Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model.
Advances in Neural Information Processing Systems 13 (2000).
• Daniel Jurafsky, and James H. Martin. N-gram language models. Speech and Language Processing.
Draft of August 20, 2024.

2 Notes
2.1 N-gram language model
The general objective of language models is to learn and infer p(w[1 : T ]), where w[1 : t] is the first t
words and w[t] is the tth one. With conditional probability, we can decompose it as
Y
p(w[1 : T ]) = p(w[t]|w[1 : t − 1]). (1)
t
Q
However, computing t p(w[t]|w[1 : t − 1]) directly is still very difficult. With the assumption of Marko-
vian property, we approximate

p(w[t]|w[1 : t − 1]) ≈ p(w[t]|w[t − N + 1 : t − 1]), (2)

and this is called N-gram model. The key challenge of N-gram model is the zeros, which is the word
sequences unseen in the training set but shown in the test set. It will cause a probability of zero, which
is under-estimated, and be problematic when calculating the perplexity.
There are several methods to deal with the zeros. The first is the to smooth the probability like the
following Add-k method
C(w[t], w[t − 1]) + k
p(w[t]|w[t − 1]) = , (3)
C(w[t − 1]) + kV
where V is the number of all vocabularies. The second, interpolation, is to estimate the probability with
a linear combination of conditional probabilities on shorter context

p(w[t]|w[t − 2 : t − 1]) = λ1 p(w[t]) + λ2 p(w[t]|w[t − 1]) + λ3 p(w[t]|w[t − 2 : t − 1]). (4)

The third, backoff, is to simply estimate the probability with a lower-order N-gram, probably with a
discounting factor.
Perplexity is defined to evaluate the language model, calculated as
1
P (w[1 : T ])− T = 2H(W ) , (5)

where H(W ) is the cross-entropy. From the perspective of probability, higher accuracy corresponds to
a higher probability and lower perplexity. From the perspective of entropy, higher accuracy results in
lower cross-entropy, approaching the true entropy, and thus lower perplexity. From both perspectives,
perplexity is a good way to evaluate a language model.

1
2.2 Neural probabilistic language model
Dimension reduction and smooth representation In language models, each word w[t] can be rep-
resented with a V -dimension vector with one single non-zero element. However, such representation is
discrete and not smooth: “walking” and “running”, “blue sea” and “sea is blue” may be very different –
similarities, between (1) words and (2) contexts with different arrangements of words, are not considered.
In this paper, word feature vector instead of the original 0-1 vector is proposed to solve the problem. The
detailed model is shown as follows.

• w → x: x = C(w), where x is the word feature vector;


• x → y: y = b + W x + U tanh(d + Hx);
• y → p: p = softmax(y).
Here, the parameters to be learned are θ = (b, d, W, U, H, C). A parallel algorithm is proposed to learn
the parameters from the dataset (and the workload is from soft-max normalization). Extensions can be
made by also representing the output word with the word feature vector.

2.3 Probabilistic topic model


Mixed membership within one single group The objective of probabilistic topic model is to use the
observed documents to infer the hidden topic structure. The detailed model is:
Y Y Y
p(β, θ, z, w) = p(βk ) p(θd ) p(zd,n |θd )p(wd,n |β, zd,n ) (6)
K D N

For each topic k, a distributional realization βk shows how words are distributed under topic k. For each
document d, a distributional realization θd shows how topics are distributed in document d. For each
word n, given θd a realization zd,n shows which topic the word is about, and given β and zd,n (i.e., βzd,n )
a realization wd,n shows which word it is. Extensions can be made by generating words conditional on
the previous one.

3 Discussion Questions
• What are N-grams? What are the key challenges of estimating N-gram probabilities?
• What does perplexity mean? Is it a good way to evaluate a language model? Why?

• What are assumptions behind probabilistic topic models? The generating process is assumed.
Specifically, topics are finite and given. Topics and words in each document are generated in-
dependently. For this independency, there is no order, context and correlation.
• What are the pros and cons of Latent Dirichlet Allocation (LDA) in terms of language modeling? It
introduces theme(s) in language modeling: a document is closely around one or several themes. It
can also be generalized to model the mixed membership (themes) within one single group (document).
The cons are related with its assumptions: order of words, context and correlation among both topics
and words are omitted.
• What are the basic ideas behind neural probabilistic language models? The model seeks word
representation to reduce the dimension and make it more smooth. In this way, similar words and
contexts with different arrangements of words are naturally considered “similar”.
• What are the key merits of neural probabilistic language models compared to N-grams and LDA?
Are there any limitations? Similar but different words and contexts are considered “similar” in this
model. It also considers the order of words, context and possible correlations between one word and
its previous ones. But it is learned and inferred in an ordered way (Markovian assumption), so how
to model correlations between one word and the words behind it (or even the full document, which
LDA can)? The matrix C needs to be more interpretable and prior knowledge on the topic of the
document can be learned (which means LDA can be integrated into the model). Computation of the
probabilities for all the V words in the vocabulary is required (which is not in N-grams).

2
4 Further Reading
• Teh, Yee, Michael Jordan, Matthew Beal, and David Blei. Sharing clusters among related groups:
Hierarchical Dirichlet processes. Advances in Neural Information Processing Systems 17 (2004).

You might also like