0% found this document useful (0 votes)

24 views66 pages

L09 Learning I Bayesian Learning

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views66 pages

L09 Learning I Bayesian Learning

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

COL333/671: Introduction to AI

Semester I, 2024-25

Learning with Probabilities

Rohan Paul

1
Outline
• Last Class
• CSPs
• This Class
• Bayesian Learning, MLE/MAP, Learning in Probabilistic Models.
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 20 sections 20.1 – 20.2.4.

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Learning Probabilistic Models
• Models are useful for making optimal decisions.
• Probabilistic models express a theory about the domain and can be used for
decision making.
• How to acquire these models in the first place?
• Solution: data or experience can be used to build these models
• Key question: how to learn from data?
• Bayesian view of learning (learning task itself is probabilistic inference)
• Learning with complete and incomplete data.
• Essentially, rely on counting.
Example: Which candy bag is it?

Statistics Probability
Bayesian Learning – in a nutshell
P(H)
H
P(d|H)

D1 D2 DN

i.i.d

In these slides X and d used

interchangeably.
Posterior Probability of Hypothesis given
Ovservations
Now, we are getting observations incrementally, Probability of a bag of a certain type given
how does our belief change? observations.

Bayes Rule

IID assumption
Posterior Probability of Hypothesis given
Ovservations
Incremental Belief Update

True hypothesis eventually dominates. Probability of

indefinitely producing uncharacteristic data →0
Predictions given Belief over Hypotheses
What is the probability that the next candy is of type lime?

Observations
Bayesian Prediction – Evidence arrives incrementally

key ideas
• Predictions are weighted average over the Changing
predictions of the individual hypothesis. belief
• Bayesian prediction eventually agrees with
the true hypothesis.
• For any fixed prior that does not rule out the
true hypothesis, the posterior probability of
any false hypothesis will eventually vanish.
• Why keep all the hypothesis?
Prediction
• Learning from small data, early commitment to a
by model
hypothesis is risky, later evidence may lead to a
different likely hypothesis. averaging.
• Better accounting of uncertainty in making
predictions.
• Problem: maybe slow and intractable, cannot
estimate and marginalize out the hypotheses.
Marginalization over Hypothesis – challenging!

Ideally, one needs to marginalize or account for all the hypotheses.

Can we pick one good hypothesis and just use that for predications?
Maximum a-posteriori (MAP) Approximation
P(Xd): This is the probability of observing new data X, given the evidence d.

Estimate the best hypothesis

given data while incorporating
the prior knowledge.
What is the
probability of a The prior term says which
hypothesis given hypothesis are likelier than
data?
others. Typically, the number of
bits to encode hypothesis.
MAP Vs. Bayesian Estimation

Difference between marginalization

(accounting for all hypothesis) vs.
committing to one and make
predictions from it.
Maximum Likelihood Estimation

Make predictions with the hypothesis that maximizes the data likelihood. Essentially, assuming a
uniform prior with no preference of a hypothesis over another.

MLE is also called Maximum likelihood (ML) Approximation

Maximum Likelihood Approximation

Theta represents the parameters of the probabilistic model.

These parameters define the specific configuration of the hypothesis or the model we are using.

Theta ML stands for the Maximum Likelihood Estimate of the parameters.

It is the value of Theta that makes the observed data most likely under the model.
ML Estimation in General: Bernoulli Model
Hypothesis is the likelihood of generating a candy of a specific flavor.

Cherry, Lime, Lime, Cherry, Cherry,

Lime, Cherry, Cherry

Similar problem to observing tosses of a biased coin and estimating the bias/fractional
parameter.
ML Estimation in General: Estimation for
Bernoulli Model

Cherry, Lime, Lime, Cherry, Cherry,

Lime, Cherry, Cherry

Even in the coin tossing problem, one would take the fraction as heads or
tails over the total number of tosses.
MAP vs. MLE Estimation
• Maximum likelihood estimate (MLE)
• Estimates the parameters that maximizes
the data likelihood.
• Relative counts give MLE estimates

• Maximum a posteriori estimate (MAP)

• Bayesian parameter estimation
• Encodes a prior over the parameters (not all
parameters are equal prior values).
• Combines the prior and the likelihood while
estimating the parameters.
ML Estimation in General: Learning Parameters
for a Probability Model
• Probabilistic models require
parameters (numbers in the
conditional probability tables).
• We need these values to make
predictions.
• Can we learn these from data (i.e.,
samples from the Bayes Net)?
• How to do this? Counting and
averaging. Can we use samples to estimate the
values in the tables?
Learning Parameters for a Probability Model
Classification Problem
• Task: given inputs x, predict labels (classes) y
• Examples:
• Spam detection (input: document,
classes: spam / ham)
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms, classes: diseases)
• Fraud detection (input: account activity, classes: fraud / no fraud)
Bayes Net for Classification
• Input: images / pixel grids
0
• Output: a digit 0-9
1
• Setup:
• Get a large collection of example images, each labeled with a digit
• Note: someone has to hand label all this data! 2
• Want to learn to predict labels of new, future digit images

1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
Not clear
Bayes Net for Classification
• Naïve Bayes: Assume all features are independent effects of the label

• Simple digit recognition: Y

• One feature (variable) Fij for each grid position <i,j>
• Feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
• Each input maps to a feature vector, e.g. F1 F2 Fn
Parameter Estimation
• Need the estimates of local conditional probability tables.
• P(Y), the prior over labels
• P(Fi|Y) for each feature (evidence variable)
• These probabilities are collectively called the parameters of the
model and denoted by 
• Till now, the table values were provided.
• Now, use data to acquire these values.
Parameter Estimation

• P(Y) – how frequent is the class-type for digit 3?

• If you take a sample of images of numbers how frequent is this
number

• P(Fi|Y) – for digit 3 what fraction of the time the cell is on?
• Conditioned on the class type how frequent is the feature

• Use relative frequencies from the data to estimate these

values.
Parameter Estimation: Complete Data

Note: The data is “complete”. Each data

point had values observed for “all” the
variables in the model.
Parameter Estimation
Parameter Estimation
Parameter Estimation
Problem: values not seen in the training data

If one feature was not seen in the training data, the likelihood goes to zero.
If we did not see this feature in the training data, does not mean we will
not see this in training. Essentially overfitting to the training data set.
Laplace Smoothing
• Pretend that every outcome occurs once
more than it is observed. H H T

• If certain counts are not seen in training

does not mean that they have zero
probability of occurring in future.

• Another version of Laplace smoothing

• instead of 1, add k times
• k is an adjustable parameter.

• Essentially, encodes a prior (pseudo-

counts).
Learning Multiple Parameters
• Estimate latent parameters
using MLE.
• There are two CPTs in this
example.
• Observations are of both
variables: Flavor and
Wrapper.
• Take log likelihood.
Learning Multiple Parameters
• Minimize data likelihood to estimate the parameters.

Maximum Likelihood Parameter Learning

with complete data for a Bayes Net
decomposes into separate learning
problems, one for each parameter.
How to learn the structure of the Bayes Net?
• Problem: Estimate/learn the structure
of the model
• Setup a search process (like local search,
hill climbing etc.)
• For each structure, learn the
parameters.
• How to score a solution?
• Use Max. likelihood estimation.
• Penalize complexity of the structure
(don’t want a fully connected
model).
• Additionally check for validity of the
conditional independences.
Parameter Learning when some variables are
not observed
• If we knew the missing
value for B. Then we can
estimate the CPTs.
Conditional probability table

• If we knew the CPTs then

we can infer the probability
of the missing value of B.

• It is a chicken and egg

problem. Data is incomplete. One sample has (A = 1, B= ? and C = 0 )
Expectation Maximization
• Initialization
• Initialize CPT parameter values (ignoring missing information)
• Expectation
• Compute expected values of unobserved variables assuming current
parameters values.
• Involves BayesNet inference (exact or approximate)
• Maximization
• Compute new parameters (of the CPTs) to maximize the probability of data
(observed and estimated)
• Alternate the EM steps until convergence. Convergence is guaranteed.
Expectation Maximization
Expectation Maximization
EM Example
Problem: learning the parameters of a Bayes
Net that models ratings given by reviewers.

We postulate that ratings (1 or 2) are

conditioned on the “genre” or “type” of the
move (Comedy or Drama.

Observations, we only see the ratings given

by the reviewers.

Apply EM to learn the parameters.

Reviewers rate individually (their CPTs are

assumed to be the same).

Slide adapted from Dorsa Sadigh and Percy Liang

What objective are we optimizing in EM?
Maximum Marginal Likelihood
Latent Variables are variables in a model that
are not directly observed in the data but are
inferred through relationships with observed
variables.
In this example, G (genre of the movie) acts as
a latent variable because it influences the
observed ratings, but its value might not be
directly provided in the data.
Latent Vectors typically refer to multi-
dimensional representations of these latent
variables, but in this context, they simply mean
the possible values or states of G that we sum
over to compute the marginal likelihood in the
EM objective.

Marginalize over the

latent variables in the
likelihood

Slide adapted from Dorsa Sadigh and Percy Liang

E and M steps Compute for every value of h
and for each setting of the
evidence variables.

The estimated data points

In the E-step, it estimates the probabilities of the hidden (latent) variables given the observed data
and the current parameters. from E step are used to
In the M-step, it updates the model parameters to maximize the likelihood of the observed data, update the CPTs.
given these estimated probabilities.

Slide adapted from Dorsa Sadigh and Percy Liang

EM: Estimating and using weighted samples

Estimated Fractional samples

(g=c, r1=2, r2=2) prob: 0.69

(g=d, r1=2, r2=2) prob: 0.31
(g=c, r1=1, r2=2) prob: 0.5
(g=d, r1=1, r2=2) prob: 0.5

Revising probabilities based on

fractional samples.

The CPTs for the two reviewers is the same.

Related Topic: Clustering
Example: Clustering images in a data base
Clustering is subjective
Clustering is based on a distance metric

Clustering depends on the

distance function used.

Euclidean distance? Edit

distance? ….
K-Means Clustering

A GMM yields a probability distribution

over the cluster assignment for each
point; whereas K-Means gives a single
hard assignment

GMM: Gaussian Mixture Model

https://www.naftaliharris.com/blog/
visualizing-k-means-clustering/
K-Means Clustering Algorithm
What objective K-Means is optimizing?
Data points

K-means converges with

every step to the minimum
distortion metric

Iteration I Iteration II Iteration III

How to pick “k”?

Ideal value of number of clusters(k) can

be identified using the distortion metric
for different values of k.
K-Means Application: Segmentation
Goal of segmentation is to partition
an image into regions each of
which has reasonably homogenous
visual appearance.

Apply K-Means in the colour space.

EM in Continuous Space: Gaussian Mixture
Modeling
• Problem: clustering task where we want
to discern multiple category in a
collection of given points.
• Assume a mixture of components
(Gaussian)
• Don’t know which data point comes
from which component.
• Use EM to iteratively determine the
assignments and the parameters of the
Gaussian components.

Web link: https://lukapopijac.github.io/gaussian-mixture-model/

Soft vs. hard assignments during clustering

Some slides courtesy: https://nakulgopalan.github.io/cs4641/course/20-gaussian-

mixture-model.pdf
Gaussian Mixture Models (GMMs)

GMMs are a
generative model of
data.

They model how the

data was generated
from an underlying
model.
Each f is the normal
distribution. The overall data
set is generated as being
sampled from a mixture.
Learning a GMM: Optimizing the likelihood of
generating the data

We want to fit the parameters of

the Gaussian mixture model
(mixing fractions and the
parameters of the Gaussians given
the data).
E-step (associating data points with clusters)
M-step (given responsibilities optimize the
GMM parameters)
EM for GMMs
GMM Example

Colours indicate cluster

membership likelihood
Example
Example
Example
Example
Example
Example
Example

Online demo: https://lukapopijac.github.io/gaussian-mixture-model/

Machine Learning for Students
No ratings yet
Machine Learning for Students
30 pages
Dostojewski Notatki Z Podziemia (Całość)
No ratings yet
Dostojewski Notatki Z Podziemia (Całość)
102 pages
Module 1 - INTRODUCTION TO SURVEYING
No ratings yet
Module 1 - INTRODUCTION TO SURVEYING
46 pages
Machine Learning Cheat Sheets
No ratings yet
Machine Learning Cheat Sheets
15 pages
Machine Learning Interview Cheat Sheets
No ratings yet
Machine Learning Interview Cheat Sheets
14 pages
Socket Base Connections With Precast Concrete Columns PDF
100% (5)
Socket Base Connections With Precast Concrete Columns PDF
11 pages
Licensure Examination For Teachers Reviewer (Part 1)
100% (1)
Licensure Examination For Teachers Reviewer (Part 1)
11 pages
Audels Engineers and Mechanics Guide Volume 5 From WWW Jgokey Com
No ratings yet
Audels Engineers and Mechanics Guide Volume 5 From WWW Jgokey Com
556 pages
Design of Radial Gate Using Rectangular 2
100% (1)
Design of Radial Gate Using Rectangular 2
55 pages
Earn Money Typing Online
100% (3)
Earn Money Typing Online
37 pages
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
No ratings yet
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
27 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
Bayesian Learning for Graphics
No ratings yet
Bayesian Learning for Graphics
141 pages
Ai Notes (Unit 3)
No ratings yet
Ai Notes (Unit 3)
33 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
by Lord Asa Briggs 2001
100% (2)
by Lord Asa Briggs 2001
430 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
91 pages
Chapter 1.
No ratings yet
Chapter 1.
6 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
Chapter20 4e
No ratings yet
Chapter20 4e
36 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
6 Probabilities
No ratings yet
6 Probabilities
52 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
SPSS: A Tool For Survey Analysis: Alok Kumar PGDM 2 Year
No ratings yet
SPSS: A Tool For Survey Analysis: Alok Kumar PGDM 2 Year
3 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
National Cultural Policy
No ratings yet
National Cultural Policy
58 pages
Solutions Ch08 4e Probs01 14
No ratings yet
Solutions Ch08 4e Probs01 14
20 pages
Slide 1
No ratings yet
Slide 1
37 pages
Unit 1-Omd553-Telehealth Technology
No ratings yet
Unit 1-Omd553-Telehealth Technology
53 pages
ML in 10 Pages 1683806402
No ratings yet
ML in 10 Pages 1683806402
10 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
Machine Learning in 10 Pages PDF
No ratings yet
Machine Learning in 10 Pages PDF
10 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
All RE
100% (3)
All RE
98 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Introduction To Probabilistic Learning
No ratings yet
Introduction To Probabilistic Learning
9 pages
Custom DateTimePicker - Custom Controls WinForm C # - RJ Code Advance
No ratings yet
Custom DateTimePicker - Custom Controls WinForm C # - RJ Code Advance
12 pages
Autodesk Inventor - Design Accelerator
No ratings yet
Autodesk Inventor - Design Accelerator
23 pages
TPS6106x Constant Current LED Driver With Digital and PWM Brightness Control
No ratings yet
TPS6106x Constant Current LED Driver With Digital and PWM Brightness Control
29 pages
ML 5
No ratings yet
ML 5
28 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Online Credit Risk Analytics and Modeling
0% (2)
Online Credit Risk Analytics and Modeling
7 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
Thesis Help for Trade Students
100% (2)
Thesis Help for Trade Students
6 pages
Teaching Tools for Parsing Education
No ratings yet
Teaching Tools for Parsing Education
5 pages
lec21-ML II
No ratings yet
lec21-ML II
66 pages
Bayesian Inference and Learning
No ratings yet
Bayesian Inference and Learning
48 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
Realitive and Absolute Dating
No ratings yet
Realitive and Absolute Dating
24 pages
Institutional Theory Framework
No ratings yet
Institutional Theory Framework
9 pages
SP14 CS188 Lecture 21 - Naive Bayes - Print
No ratings yet
SP14 CS188 Lecture 21 - Naive Bayes - Print
41 pages
Ensayo Sobre El Patriotismo
100% (1)
Ensayo Sobre El Patriotismo
6 pages
Chapter 21
No ratings yet
Chapter 21
31 pages
James Dobson Homework
100% (1)
James Dobson Homework
6 pages
ML 1
No ratings yet
ML 1
64 pages
Target Chair Testing Protocol Guide
No ratings yet
Target Chair Testing Protocol Guide
12 pages
Steel Works
No ratings yet
Steel Works
13 pages
Lec 12
No ratings yet
Lec 12
15 pages
STD V Intl Syllabus 2024 25
No ratings yet
STD V Intl Syllabus 2024 25
10 pages
CP4252 ML Unit-Iv
No ratings yet
CP4252 ML Unit-Iv
12 pages
U2103660 Sharvani Exp5
No ratings yet
U2103660 Sharvani Exp5
10 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
Audit of Shareholder's Equity (Roque) PDF
No ratings yet
Audit of Shareholder's Equity (Roque) PDF
1 page
Physics Project
No ratings yet
Physics Project
15 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
Main 2
No ratings yet
Main 2
37 pages
Unit 3
No ratings yet
Unit 3
16 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lec04c Naive Bayes
No ratings yet
Lec04c Naive Bayes
35 pages
Machine Learning PPT Part III
No ratings yet
Machine Learning PPT Part III
26 pages
Statistics Notes Based On Pattern Recognition and Machine Learning (PRML)
No ratings yet
Statistics Notes Based On Pattern Recognition and Machine Learning (PRML)
5 pages

L09 Learning I Bayesian Learning

Uploaded by

L09 Learning I Bayesian Learning

Uploaded by

COL333/671: Introduction to AI

Learning with Probabilities

In these slides X and d used

True hypothesis eventually dominates. Probability of

Ideally, one needs to marginalize or account for all the hypotheses.

Estimate the best hypothesis

Difference between marginalization

MLE is also called Maximum likelihood (ML) Approximation

Theta represents the parameters of the probabilistic model.

Theta ML stands for the Maximum Likelihood Estimate of the parameters.

Cherry, Lime, Lime, Cherry, Cherry,

Cherry, Lime, Lime, Cherry, Cherry,

• Maximum a posteriori estimate (MAP)

• Simple digit recognition: Y

• P(Y) – how frequent is the class-type for digit 3?

• Use relative frequencies from the data to estimate these

Note: The data is “complete”. Each data

• If certain counts are not seen in training

• Another version of Laplace smoothing

• Essentially, encodes a prior (pseudo-

Maximum Likelihood Parameter Learning

• If we knew the CPTs then

• It is a chicken and egg

We postulate that ratings (1 or 2) are

Observations, we only see the ratings given

Apply EM to learn the parameters.

Reviewers rate individually (their CPTs are

Slide adapted from Dorsa Sadigh and Percy Liang

Marginalize over the

Slide adapted from Dorsa Sadigh and Percy Liang

The estimated data points

Slide adapted from Dorsa Sadigh and Percy Liang

Estimated Fractional samples

(g=c, r1=2, r2=2) prob: 0.69

Revising probabilities based on

The CPTs for the two reviewers is the same.

Clustering depends on the

Euclidean distance? Edit

A GMM yields a probability distribution

GMM: Gaussian Mixture Model

K-means converges with

Iteration I Iteration II Iteration III

Ideal value of number of clusters(k) can

Apply K-Means in the colour space.

Web link: https://lukapopijac.github.io/gaussian-mixture-model/

Some slides courtesy: https://nakulgopalan.github.io/cs4641/course/20-gaussian-

They model how the

We want to fit the parameters of

Colours indicate cluster

Online demo: https://lukapopijac.github.io/gaussian-mixture-model/

You might also like