COL333/671: Introduction to AI
Semester I, 2024-25
Learning with Probabilities
Rohan Paul
1
Outline
• Last Class
• CSPs
• This Class
• Bayesian Learning, MLE/MAP, Learning in Probabilistic Models.
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 20 sections 20.1 – 20.2.4.
2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.
3
Learning Probabilistic Models
• Models are useful for making optimal decisions.
• Probabilistic models express a theory about the domain and can be used for
decision making.
• How to acquire these models in the first place?
• Solution: data or experience can be used to build these models
• Key question: how to learn from data?
• Bayesian view of learning (learning task itself is probabilistic inference)
• Learning with complete and incomplete data.
• Essentially, rely on counting.
Example: Which candy bag is it?
Statistics Probability
Bayesian Learning – in a nutshell
P(H)
H
P(d|H)
D1 D2 DN
i.i.d
In these slides X and d used
interchangeably.
Posterior Probability of Hypothesis given
Ovservations
Now, we are getting observations incrementally, Probability of a bag of a certain type given
how does our belief change? observations.
Bayes Rule
IID assumption
Posterior Probability of Hypothesis given
Ovservations
Incremental Belief Update
True hypothesis eventually dominates. Probability of
indefinitely producing uncharacteristic data →0
Predictions given Belief over Hypotheses
What is the probability that the next candy is of type lime?
Observations
Bayesian Prediction – Evidence arrives incrementally
key ideas
• Predictions are weighted average over the Changing
predictions of the individual hypothesis. belief
• Bayesian prediction eventually agrees with
the true hypothesis.
• For any fixed prior that does not rule out the
true hypothesis, the posterior probability of
any false hypothesis will eventually vanish.
• Why keep all the hypothesis?
Prediction
• Learning from small data, early commitment to a
by model
hypothesis is risky, later evidence may lead to a
different likely hypothesis. averaging.
• Better accounting of uncertainty in making
predictions.
• Problem: maybe slow and intractable, cannot
estimate and marginalize out the hypotheses.
Marginalization over Hypothesis – challenging!
Ideally, one needs to marginalize or account for all the hypotheses.
Can we pick one good hypothesis and just use that for predications?
Maximum a-posteriori (MAP) Approximation
P(Xd): This is the probability of observing new data X, given the evidence d.
Estimate the best hypothesis
given data while incorporating
the prior knowledge.
What is the
probability of a The prior term says which
hypothesis given hypothesis are likelier than
data?
others. Typically, the number of
bits to encode hypothesis.
MAP Vs. Bayesian Estimation
Difference between marginalization
(accounting for all hypothesis) vs.
committing to one and make
predictions from it.
Maximum Likelihood Estimation
Make predictions with the hypothesis that maximizes the data likelihood. Essentially, assuming a
uniform prior with no preference of a hypothesis over another.
MLE is also called Maximum likelihood (ML) Approximation
Maximum Likelihood Approximation
Theta represents the parameters of the probabilistic model.
These parameters define the specific configuration of the hypothesis or the model we are using.
Theta ML stands for the Maximum Likelihood Estimate of the parameters.
It is the value of Theta that makes the observed data most likely under the model.
ML Estimation in General: Bernoulli Model
Hypothesis is the likelihood of generating a candy of a specific flavor.
Cherry, Lime, Lime, Cherry, Cherry,
Lime, Cherry, Cherry
Similar problem to observing tosses of a biased coin and estimating the bias/fractional
parameter.
ML Estimation in General: Estimation for
Bernoulli Model
Cherry, Lime, Lime, Cherry, Cherry,
Lime, Cherry, Cherry
Even in the coin tossing problem, one would take the fraction as heads or
tails over the total number of tosses.
MAP vs. MLE Estimation
• Maximum likelihood estimate (MLE)
• Estimates the parameters that maximizes
the data likelihood.
• Relative counts give MLE estimates
• Maximum a posteriori estimate (MAP)
• Bayesian parameter estimation
• Encodes a prior over the parameters (not all
parameters are equal prior values).
• Combines the prior and the likelihood while
estimating the parameters.
ML Estimation in General: Learning Parameters
for a Probability Model
• Probabilistic models require
parameters (numbers in the
conditional probability tables).
• We need these values to make
predictions.
• Can we learn these from data (i.e.,
samples from the Bayes Net)?
• How to do this? Counting and
averaging. Can we use samples to estimate the
values in the tables?
Learning Parameters for a Probability Model
Classification Problem
• Task: given inputs x, predict labels (classes) y
• Examples:
• Spam detection (input: document,
classes: spam / ham)
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms, classes: diseases)
• Fraud detection (input: account activity, classes: fraud / no fraud)
Bayes Net for Classification
• Input: images / pixel grids
0
• Output: a digit 0-9
1
• Setup:
• Get a large collection of example images, each labeled with a digit
• Note: someone has to hand label all this data! 2
• Want to learn to predict labels of new, future digit images
1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
Not clear
Bayes Net for Classification
• Naïve Bayes: Assume all features are independent effects of the label
• Simple digit recognition: Y
• One feature (variable) Fij for each grid position <i,j>
• Feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
• Each input maps to a feature vector, e.g. F1 F2 Fn
Parameter Estimation
• Need the estimates of local conditional probability tables.
• P(Y), the prior over labels
• P(Fi|Y) for each feature (evidence variable)
• These probabilities are collectively called the parameters of the
model and denoted by
• Till now, the table values were provided.
• Now, use data to acquire these values.
Parameter Estimation
• P(Y) – how frequent is the class-type for digit 3?
• If you take a sample of images of numbers how frequent is this
number
• P(Fi|Y) – for digit 3 what fraction of the time the cell is on?
• Conditioned on the class type how frequent is the feature
• Use relative frequencies from the data to estimate these
values.
Parameter Estimation: Complete Data
Note: The data is “complete”. Each data
point had values observed for “all” the
variables in the model.
Parameter Estimation
Parameter Estimation
Parameter Estimation
Problem: values not seen in the training data
If one feature was not seen in the training data, the likelihood goes to zero.
If we did not see this feature in the training data, does not mean we will
not see this in training. Essentially overfitting to the training data set.
Laplace Smoothing
• Pretend that every outcome occurs once
more than it is observed. H H T
• If certain counts are not seen in training
does not mean that they have zero
probability of occurring in future.
• Another version of Laplace smoothing
• instead of 1, add k times
• k is an adjustable parameter.
• Essentially, encodes a prior (pseudo-
counts).
Learning Multiple Parameters
• Estimate latent parameters
using MLE.
• There are two CPTs in this
example.
• Observations are of both
variables: Flavor and
Wrapper.
• Take log likelihood.
Learning Multiple Parameters
• Minimize data likelihood to estimate the parameters.
Maximum Likelihood Parameter Learning
with complete data for a Bayes Net
decomposes into separate learning
problems, one for each parameter.
How to learn the structure of the Bayes Net?
• Problem: Estimate/learn the structure
of the model
• Setup a search process (like local search,
hill climbing etc.)
• For each structure, learn the
parameters.
• How to score a solution?
• Use Max. likelihood estimation.
• Penalize complexity of the structure
(don’t want a fully connected
model).
• Additionally check for validity of the
conditional independences.
Parameter Learning when some variables are
not observed
• If we knew the missing
value for B. Then we can
estimate the CPTs.
Conditional probability table
• If we knew the CPTs then
we can infer the probability
of the missing value of B.
• It is a chicken and egg
problem. Data is incomplete. One sample has (A = 1, B= ? and C = 0 )
Expectation Maximization
• Initialization
• Initialize CPT parameter values (ignoring missing information)
• Expectation
• Compute expected values of unobserved variables assuming current
parameters values.
• Involves BayesNet inference (exact or approximate)
• Maximization
• Compute new parameters (of the CPTs) to maximize the probability of data
(observed and estimated)
• Alternate the EM steps until convergence. Convergence is guaranteed.
Expectation Maximization
Expectation Maximization
EM Example
Problem: learning the parameters of a Bayes
Net that models ratings given by reviewers.
We postulate that ratings (1 or 2) are
conditioned on the “genre” or “type” of the
move (Comedy or Drama.
Observations, we only see the ratings given
by the reviewers.
Apply EM to learn the parameters.
Reviewers rate individually (their CPTs are
assumed to be the same).
Slide adapted from Dorsa Sadigh and Percy Liang
What objective are we optimizing in EM?
Maximum Marginal Likelihood
Latent Variables are variables in a model that
are not directly observed in the data but are
inferred through relationships with observed
variables.
In this example, G (genre of the movie) acts as
a latent variable because it influences the
observed ratings, but its value might not be
directly provided in the data.
Latent Vectors typically refer to multi-
dimensional representations of these latent
variables, but in this context, they simply mean
the possible values or states of G that we sum
over to compute the marginal likelihood in the
EM objective.
Marginalize over the
latent variables in the
likelihood
Slide adapted from Dorsa Sadigh and Percy Liang
E and M steps Compute for every value of h
and for each setting of the
evidence variables.
The estimated data points
In the E-step, it estimates the probabilities of the hidden (latent) variables given the observed data
and the current parameters. from E step are used to
In the M-step, it updates the model parameters to maximize the likelihood of the observed data, update the CPTs.
given these estimated probabilities.
Slide adapted from Dorsa Sadigh and Percy Liang
EM: Estimating and using weighted samples
Estimated Fractional samples
(g=c, r1=2, r2=2) prob: 0.69
(g=d, r1=2, r2=2) prob: 0.31
(g=c, r1=1, r2=2) prob: 0.5
(g=d, r1=1, r2=2) prob: 0.5
Revising probabilities based on
fractional samples.
The CPTs for the two reviewers is the same.
Related Topic: Clustering
Example: Clustering images in a data base
Clustering is subjective
Clustering is based on a distance metric
Clustering depends on the
distance function used.
Euclidean distance? Edit
distance? ….
K-Means Clustering
A GMM yields a probability distribution
over the cluster assignment for each
point; whereas K-Means gives a single
hard assignment
GMM: Gaussian Mixture Model
https://www.naftaliharris.com/blog/
visualizing-k-means-clustering/
K-Means Clustering Algorithm
What objective K-Means is optimizing?
Data points
K-means converges with
every step to the minimum
distortion metric
Iteration I Iteration II Iteration III
How to pick “k”?
Ideal value of number of clusters(k) can
be identified using the distortion metric
for different values of k.
K-Means Application: Segmentation
Goal of segmentation is to partition
an image into regions each of
which has reasonably homogenous
visual appearance.
Apply K-Means in the colour space.
EM in Continuous Space: Gaussian Mixture
Modeling
• Problem: clustering task where we want
to discern multiple category in a
collection of given points.
• Assume a mixture of components
(Gaussian)
• Don’t know which data point comes
from which component.
• Use EM to iteratively determine the
assignments and the parameters of the
Gaussian components.
Web link: https://lukapopijac.github.io/gaussian-mixture-model/
Soft vs. hard assignments during clustering
Some slides courtesy: https://nakulgopalan.github.io/cs4641/course/20-gaussian-
mixture-model.pdf
Gaussian Mixture Models (GMMs)
GMMs are a
generative model of
data.
They model how the
data was generated
from an underlying
model.
Each f is the normal
distribution. The overall data
set is generated as being
sampled from a mixture.
Learning a GMM: Optimizing the likelihood of
generating the data
We want to fit the parameters of
the Gaussian mixture model
(mixing fractions and the
parameters of the Gaussians given
the data).
E-step (associating data points with clusters)
M-step (given responsibilities optimize the
GMM parameters)
EM for GMMs
GMM Example
Colours indicate cluster
membership likelihood
Example
Example
Example
Example
Example
Example
Example
Online demo: https://lukapopijac.github.io/gaussian-mixture-model/