0% found this document useful (0 votes)

16 views54 pages

Naive Bayes Classifier

The document discusses the Naive Bayes classifier and its application in classification tasks, particularly in diagnosing diabetes using Bayes Rule. It contrasts generative and discriminative approaches to classification, explaining the Gaussian Discriminant Analysis (GDA) and the mathematical foundations of multivariate Gaussian distributions. The lecture also covers parameter estimation using maximum likelihood and compares GDA with logistic regression.

Uploaded by

Tahsin Nujum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views54 pages

Naive Bayes Classifier

Uploaded by

Tahsin Nujum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

CSC 411: Lecture 09: Naive Bayes

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

October 12, 2016

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 1 / 28
Today

Classification – Multi-dimensional (Gaussian) Bayes classifier

Estimate probability densities from data
Naive Bayes classifier

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 2 / 28
Generative vs Discriminative

Two approaches to classification:

Discriminative classifiers estimate parameters of decision boundary/class
separator directly from labeled examples
I learn p(y |x) directly (logistic regression models)
I learn mappings from inputs to classes (least-squares, neural nets)

Generative approach: model the distribution of inputs characteristic of the

class (Bayes classifier)
I Build a model of p(x|y )
I Apply Bayes Rule

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 3 / 28
Bayes Classifier

Aim to diagnose whether patient has diabetes: classify into one of two
classes (yes C=1; no C=0)
Run battery of tests
Given patient’s results: x = [x1 , x2 , · · · , xd ]T we want to update class
probabilities using Bayes Rule:

p(x|C )p(C )
p(C |x) =
p(x)

More formally
Class likelihood × prior
posterior =
Evidence
How can we compute p(x) for the two class case?

p(x) = p(x|C = 0)p(C = 0) + p(x|C = 1)p(C = 1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 4 / 28
Classification: Diabetes Example
Last class we had a single observation per patient: white blood cell count
p(x = 48|C = 1)p(C = 1)
p(C = 1|x = 48) =
p(x = 48)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 5 / 28
Classification: Diabetes Example
Last class we had a single observation per patient: white blood cell count
p(x = 48|C = 1)p(C = 1)
p(C = 1|x = 48) =
p(x = 48)

Add second observation: Plasma glucose value

Now our input x is 2-dimensional

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 5 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1

p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1

p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Each class k has associated mean vector µk and covariance matrix Σk

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1

p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Each class k has associated mean vector µk and covariance matrix Σk
Typically the classes share a single covariance matrix Σ (“share” means that
they have the same parameters; the covariance matrix in this case):
Σ = Σ1 = · · · = Σk

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Multivariate Data

Multiple measurements (sensors)

d inputs/features/attributes
N instances/observations/examples
 (1) (1) (1)

x1 x2 ··· xd
 (2) (2) (2) 
 x1 x2 ··· xd 
X=  .. .. .. .. 
.

 . . . 
(N) (N) (N)
x1 x2 ··· xd

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 7 / 28
Multivariate Parameters

Mean
E[x] = [µ1 , · · · , µd ]T

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Parameters

Mean
E[x] = [µ1 , · · · , µd ]T

Covariance
σ12
 
σ12 ··· σ1d
 σ12 σ22 ··· σ2d 
Σ = Cov (x) = E[(x − µ)T (x − µ)] =  .
 
.. .. .. 
 .. . . . 
σd1 σd2 ··· σd2

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Parameters

Mean
E[x] = [µ1 , · · · , µd ]T

Covariance
σ12
 
σ12 ··· σ1d
 σ12 σ22 ··· σ2d 
Σ = Cov (x) = E[(x − µ)T (x − µ)] =  .
 
.. .. .. 
 .. . . . 
σd1 σd2 ··· σd2

Correlation = Corr (x) is the covariance divided by the product of standard

deviation
σij
ρij =
σi σj

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Gaussian Distribution

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as

1
exp −(x − µ)T Σ−1 (x − µ)

p(x) = d/2 1/2
(2π) |Σ|

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Multivariate Gaussian Distribution

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as

1
exp −(x − µ)T Σ−1 (x − µ)

p(x) = d/2 1/2
(2π) |Σ|

Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x

to µ in terms of Σ

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Multivariate Gaussian Distribution

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as

1
exp −(x − µ)T Σ−1 (x − µ)

p(x) = d/2 1/2
(2π) |Σ|

Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x

to µ in terms of Σ
It normalizes for difference in variances and correlations
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Bivariate Normal

1 0 1 0 1 0
Σ= Σ = 0.5 Σ=2
0 1 0 1 0 1

Figure : Probability density function

Figure : Contour plot of the pdf

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 10 / 28
Bivariate Normal
var (x1 ) = var (x2 ) var (x1 ) > var (x2 ) var (x1 ) < var (x2 )

Figure : Probability density function

Figure : Contour plot of the pdf

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 11 / 28
Bivariate Normal

1 0 1 0.5 1 0.8
Σ= Σ= Σ=
0 1 0.5 1 0.8 1

Figure : Probability density function

Figure : Contour plot of the pdf

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 12 / 28
Bivariate Normal
Cov (x1 , x2 ) = 0 Cov (x1 , x2 ) > 0 Cov (x1 , x2 ) < 0

Figure : Probability density function

Figure : Contour plot of the pdf

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 13 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

GDA (GBC) decision boundary is based on class posterior:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)

d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)

Decision: take the class with the highest posterior probability

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 14 / 28
Decision Boundary

likelihoods)
discriminant:!!
P!(t1|x")!=!0.5!

posterior)for)t1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 15 / 28
Decision Boundary when Shared Covariance Matrix

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 16 / 28
Learning

Learn the parameters using maximum likelihood

N
Y
`(φ, µ0 , µ1 , Σ) = − log p(x(n) , t (n) |φ, µ0 , µ1 , Σ)
n=1
N
Y
= − log p(x(n) |t (n) , µ0 , µ1 , Σ)p(t (n) |φ)
n=1

What have we assumed?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 17 / 28
More on MLE

Assume the prior is Bernoulli (we have two classes)

p(t|φ) = φt (1 − φ)1−t

You can compute the ML estimate in closed form

N
1 X (n)
φ = 1[t = 1]
N n=1

1
PN (n)
n=1 [t = 0] · x(n)
µ0 =
1[t (n) = 0]
PN
n=1

1
PN (n)
n=1 [t = 1] · x(n)
µ1 =
1[t (n) = 1]
PN
n=1

N
1 X (n)
Σ = (x − µt (n) )(x(n) − µt (n) )T
N n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 18 / 28
Gaussian Discriminative Analysis vs Logistic Regression

If you examine p(t = 1|x) under GDA, you will find that it looks like this:

1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression

If you examine p(t = 1|x) under GDA, you will find that it looks like this:

1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

So the decision boundary has the same form as logistic regression!

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression

If you examine p(t = 1|x) under GDA, you will find that it looks like this:

1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

So the decision boundary has the same form as logistic regression!
When should we prefer GDA to LR, and vice versa?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is

multivariate Gaussian

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is

multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is

multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is

multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions
Many class-conditional distributions lead to logistic classifier

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Simplifying the Model

What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Simplifying the Model

What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters
Save some parameters by using a shared covariance for the classes

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Simplifying the Model

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Naive Bayes

Naive Bayes is an alternative generative model: Assumes features

independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes

Naive Bayes is an alternative generative model: Assumes features

independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1

Assuming likelihoods are Gaussian, how many parameters required for Naive
Bayes classifier?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes

Naive Bayes is an alternative generative model: Assumes features

independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1

Assuming likelihoods are Gaussian, how many parameters required for Naive
Bayes classifier?
Important note: Naive Bayes does not assume a particular distribution

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1

If the assumption of conditional independence holds, NB is the optimal

classifier

If the assumption of conditional independence holds, NB is the optimal

classifier
If not, a heavily regularized version of generative classifier

If the assumption of conditional independence holds, NB is the optimal

classifier
If not, a heavily regularized version of generative classifier
What’s the regularization?

If the assumption of conditional independence holds, NB is the optimal

classifier
If not, a heavily regularized version of generative classifier
What’s the regularization?
Note: NB’s assumptions (cond. independence) typically do not hold in
practice. However, the resulting algorithm still works well on many problems,
and it typically serves as a decent baseline for more sophisticated models
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Gaussian Naive Bayes

Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:

−(xi − µik )2

1
p(xi |t = k) = √ exp 2
2πσik 2σik

(this is just a 1-dim Gaussian, one for each input dimension)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Gaussian Naive Bayes

Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:

−(xi − µik )2

1
p(xi |t = k) = √ exp 2
2πσik 2σik

(this is just a 1-dim Gaussian, one for each input dimension)

Model the same as Gaussian Discriminative Analysis with diagonal
covariance matrix

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Gaussian Naive Bayes

Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:

−(xi − µik )2

1
p(xi |t = k) = √ exp 2
2πσik 2σik

(this is just a 1-dim Gaussian, one for each input dimension)

Model the same as Gaussian Discriminative Analysis with diagonal
covariance matrix
Maximum likelihood estimate of parameters

1 (n)
PN (n)
n=1 [t = k] · xi
µik =
1[t (n) = k]
PN
n=1

1[t (n) = k] · (xi(n) − µik )2

PN
2 n=1
σik =
n=1 1[t
PN (n) = k]

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Decision Boundary: Shared Variances (between Classes)

variances may be
different

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 25 / 28
Decision Boundary: isotropic

Same variance across all classes and input dimensions, all class priors equal
Classification only depends on distance to the mean. Why?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 26 / 28
Decision Boundary: isotropic

In this case: σi,k = σ (just one parameter), class priors equal (e.g.,
p(tk ) = 0.5 for 2-class case)
Going back to class posterior for GDA:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)

d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 27 / 28
Decision Boundary: isotropic

In this case: σi,k = σ (just one parameter), class priors equal (e.g.,
p(tk ) = 0.5 for 2-class case)
Going back to class posterior for GDA:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)

d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)

where we take Σk = σ 2 I and ignore terms that don’t depend on k (don’t

matter when we take max over classes):

1
log p(tk |x) = − (x − µk )T (x − µk )
2σ 2

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 27 / 28
Spam Classification

You have examples of emails that are spam and non-spam

How would you classify spam vs non-spam?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 28 / 28
Spam Classification

You have examples of emails that are spam and non-spam

How would you classify spam vs non-spam?
Think about it at home, solution in the next tutorial

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 28 / 28

Practical Example. Hypothesis Testing
No ratings yet
Practical Example. Hypothesis Testing
16 pages
Teachers Details Check List: 04001 Govt HSS, Ala, Alappuzha
100% (2)
Teachers Details Check List: 04001 Govt HSS, Ala, Alappuzha
110 pages
PPT
No ratings yet
PPT
14 pages
09 Naive Bayes
No ratings yet
09 Naive Bayes
23 pages
Generative Learning Explained
No ratings yet
Generative Learning Explained
14 pages
L05 NaiveBayes
No ratings yet
L05 NaiveBayes
21 pages
Bayesian
No ratings yet
Bayesian
21 pages
4gaussian Discriminant
No ratings yet
4gaussian Discriminant
50 pages
Lec5 Part1
No ratings yet
Lec5 Part1
42 pages
cs229 Notes2
No ratings yet
cs229 Notes2
14 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Lecture 6 - Generative Models
No ratings yet
Lecture 6 - Generative Models
33 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Generative Models & Classification
No ratings yet
Generative Models & Classification
43 pages
Classification Bayes
No ratings yet
Classification Bayes
18 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Bayes Classification for Fish Sorting
No ratings yet
Bayes Classification for Fish Sorting
86 pages
UCS - 401 - Unit-LV - Probabilistic Models Normal Distribution and Its Geometric Interpretations - 03
No ratings yet
UCS - 401 - Unit-LV - Probabilistic Models Normal Distribution and Its Geometric Interpretations - 03
14 pages
Tutorial 4
No ratings yet
Tutorial 4
21 pages
Supervised Classification 3601
No ratings yet
Supervised Classification 3601
39 pages
Chapter 07
No ratings yet
Chapter 07
68 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Bayesian Classifiers Lecture
No ratings yet
Bayesian Classifiers Lecture
31 pages
Intro to Pattern Recognition
No ratings yet
Intro to Pattern Recognition
9 pages
Generative Learning Algorithms
No ratings yet
Generative Learning Algorithms
57 pages
AE - Tema 5 - Two-Class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-Class Fisher Discriminant Analysis
6 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
CCS - Lec 5
No ratings yet
CCS - Lec 5
33 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh
No ratings yet
Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh
16 pages
Cours FLD
No ratings yet
Cours FLD
28 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
Legal 3 AI
No ratings yet
Legal 3 AI
3 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
n9 PDF
No ratings yet
n9 PDF
6 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Pgm5 With Output
No ratings yet
Pgm5 With Output
13 pages
Gaussian Process Regression Guide
No ratings yet
Gaussian Process Regression Guide
13 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
The Gaussian Classifier: Nuno Vasconcelos
No ratings yet
The Gaussian Classifier: Nuno Vasconcelos
19 pages
Lect 13 - Bayes Decistion Theory - Derivation
No ratings yet
Lect 13 - Bayes Decistion Theory - Derivation
25 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
Multidimensional Gaussian Distribution
No ratings yet
Multidimensional Gaussian Distribution
99 pages
Bayesian Classification Guide
No ratings yet
Bayesian Classification Guide
6 pages
Bayesian Decision Theory in ML
No ratings yet
Bayesian Decision Theory in ML
56 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Slide ML 0915
No ratings yet
Slide ML 0915
24 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
New Document
No ratings yet
New Document
10 pages
Final Project
No ratings yet
Final Project
30 pages
MDPI Format (12008014)
No ratings yet
MDPI Format (12008014)
8 pages
Algorithm Analysis
No ratings yet
Algorithm Analysis
14 pages
Secant Method
No ratings yet
Secant Method
3 pages
Eis 28
No ratings yet
Eis 28
4 pages
Asia Score For Vertebra Injury
100% (1)
Asia Score For Vertebra Injury
2 pages
Unit 2 Technological Change Population and Growth 1.0
No ratings yet
Unit 2 Technological Change Population and Growth 1.0
33 pages
Guru Harkrishan Public School, India Gate Holiday Homework (2019 - 20) Class 8 English
No ratings yet
Guru Harkrishan Public School, India Gate Holiday Homework (2019 - 20) Class 8 English
5 pages
Stanford Psychology Programs
No ratings yet
Stanford Psychology Programs
10 pages
Fiber Coupling Efficiency Doric Lenses
No ratings yet
Fiber Coupling Efficiency Doric Lenses
7 pages
Aspiring Bioenergy Innovator
No ratings yet
Aspiring Bioenergy Innovator
3 pages
SPOJ Solutions for Coders
No ratings yet
SPOJ Solutions for Coders
4 pages
mtc09-2513 Quotation To Juan 2022.7.13
No ratings yet
mtc09-2513 Quotation To Juan 2022.7.13
9 pages
Original Operating Manual HT-S Sintering Furnace HT-S Speed Sintering Furnace
No ratings yet
Original Operating Manual HT-S Sintering Furnace HT-S Speed Sintering Furnace
39 pages
12200.01 IT Assessment and Understanding
No ratings yet
12200.01 IT Assessment and Understanding
13 pages
Jha RT Cpecc
No ratings yet
Jha RT Cpecc
10 pages
Digital Signatures: CCA Controller of Certifying Authorities
No ratings yet
Digital Signatures: CCA Controller of Certifying Authorities
18 pages
Leadership Without Easy Answers-18 Pages
No ratings yet
Leadership Without Easy Answers-18 Pages
18 pages
Jigsaw Jumbled Test Sheet
No ratings yet
Jigsaw Jumbled Test Sheet
75 pages
Syllabus - Strength of Materials
No ratings yet
Syllabus - Strength of Materials
2 pages
Spherical Roller Bearings
No ratings yet
Spherical Roller Bearings
32 pages
Smartwatch Instruction
No ratings yet
Smartwatch Instruction
6 pages
M.Tech Forensic Structural Engineering
No ratings yet
M.Tech Forensic Structural Engineering
5 pages
Grade 10 Singapore and Asian Schools Math Olympiad: Choose Correct Answer(s) From The Given Choices
No ratings yet
Grade 10 Singapore and Asian Schools Math Olympiad: Choose Correct Answer(s) From The Given Choices
2 pages
Analysis and Design of Diagrid Sturcture For HIGH RISE
No ratings yet
Analysis and Design of Diagrid Sturcture For HIGH RISE
10 pages
English 111 Fall 2013 Syllabus
No ratings yet
English 111 Fall 2013 Syllabus
15 pages
Circular Letter No.4623 - Information On Hybrid Meetings (Secretariat)
No ratings yet
Circular Letter No.4623 - Information On Hybrid Meetings (Secretariat)
6 pages
Oral Communication-Oct11
No ratings yet
Oral Communication-Oct11
33 pages
Manual MFD Ingles
No ratings yet
Manual MFD Ingles
451 pages
Earth Surface Changes Explained
No ratings yet
Earth Surface Changes Explained
31 pages

Naive Bayes Classifier

Uploaded by

Naive Bayes Classifier

Uploaded by

CSC 411: Lecture 09: Naive Bayes

Richard Zemel, Raquel Urtasun and Sanja Fidler

October 12, 2016

Classification – Multi-dimensional (Gaussian) Bayes classifier

Two approaches to classification:

Generative approach: model the distribution of inputs characteristic of the

p(x) = p(x|C = 0)p(C = 0) + p(x|C = 1)p(C = 1)

Add second observation: Plasma glucose value

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Multiple measurements (sensors)

Correlation = Corr (x) is the covariance divided by the product of standard

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as

Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as

Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x

Figure : Probability density function

Figure : Contour plot of the pdf

Figure : Probability density function

Figure : Contour plot of the pdf

Figure : Probability density function

Figure : Contour plot of the pdf

Figure : Probability density function

Figure : Contour plot of the pdf

GDA (GBC) decision boundary is based on class posterior:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)

Decision: take the class with the highest posterior probability

Learn the parameters using maximum likelihood

What have we assumed?

Assume the prior is Bernoulli (we have two classes)

You can compute the ML estimate in closed form

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

GDA makes stronger modeling assumption: assumes class-conditional data is

GDA makes stronger modeling assumption: assumes class-conditional data is

GDA makes stronger modeling assumption: assumes class-conditional data is

GDA makes stronger modeling assumption: assumes class-conditional data is

GDA makes stronger modeling assumption: assumes class-conditional data is

Naive Bayes is an alternative generative model: Assumes features

Naive Bayes is an alternative generative model: Assumes features

Naive Bayes is an alternative generative model: Assumes features

If the assumption of conditional independence holds, NB is the optimal

If the assumption of conditional independence holds, NB is the optimal

If the assumption of conditional independence holds, NB is the optimal

If the assumption of conditional independence holds, NB is the optimal

(this is just a 1-dim Gaussian, one for each input dimension)

(this is just a 1-dim Gaussian, one for each input dimension)

(this is just a 1-dim Gaussian, one for each input dimension)

1[t (n) = k] · (xi(n) − µik )2

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)

where we take Σk = σ 2 I and ignore terms that don’t depend on k (don’t

You have examples of emails that are spam and non-spam

You have examples of emails that are spam and non-spam

You might also like