Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views54 pages

Naive Bayes Classifier

The document discusses the Naive Bayes classifier and its application in classification tasks, particularly in diagnosing diabetes using Bayes Rule. It contrasts generative and discriminative approaches to classification, explaining the Gaussian Discriminant Analysis (GDA) and the mathematical foundations of multivariate Gaussian distributions. The lecture also covers parameter estimation using maximum likelihood and compares GDA with logistic regression.

Uploaded by

Tahsin Nujum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views54 pages

Naive Bayes Classifier

The document discusses the Naive Bayes classifier and its application in classification tasks, particularly in diagnosing diabetes using Bayes Rule. It contrasts generative and discriminative approaches to classification, explaining the Gaussian Discriminant Analysis (GDA) and the mathematical foundations of multivariate Gaussian distributions. The lecture also covers parameter estimation using maximum likelihood and compares GDA with logistic regression.

Uploaded by

Tahsin Nujum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CSC 411: Lecture 09: Naive Bayes

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

October 12, 2016

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 1 / 28
Today

Classification – Multi-dimensional (Gaussian) Bayes classifier


Estimate probability densities from data
Naive Bayes classifier

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 2 / 28
Generative vs Discriminative

Two approaches to classification:


Discriminative classifiers estimate parameters of decision boundary/class
separator directly from labeled examples
I learn p(y |x) directly (logistic regression models)
I learn mappings from inputs to classes (least-squares, neural nets)

Generative approach: model the distribution of inputs characteristic of the


class (Bayes classifier)
I Build a model of p(x|y )
I Apply Bayes Rule

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 3 / 28
Bayes Classifier

Aim to diagnose whether patient has diabetes: classify into one of two
classes (yes C=1; no C=0)
Run battery of tests
Given patient’s results: x = [x1 , x2 , · · · , xd ]T we want to update class
probabilities using Bayes Rule:

p(x|C )p(C )
p(C |x) =
p(x)

More formally
Class likelihood × prior
posterior =
Evidence
How can we compute p(x) for the two class case?

p(x) = p(x|C = 0)p(C = 0) + p(x|C = 1)p(C = 1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 4 / 28
Classification: Diabetes Example
Last class we had a single observation per patient: white blood cell count
p(x = 48|C = 1)p(C = 1)
p(C = 1|x = 48) =
p(x = 48)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 5 / 28
Classification: Diabetes Example
Last class we had a single observation per patient: white blood cell count
p(x = 48|C = 1)p(C = 1)
p(C = 1|x = 48) =
p(x = 48)

Add second observation: Plasma glucose value


Now our input x is 2-dimensional

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 5 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is


distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1
 
p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2

where |Σk | denotes the determinant of the matrix, and d is dimension of x

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is


distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1
 
p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2

where |Σk | denotes the determinant of the matrix, and d is dimension of x


Each class k has associated mean vector µk and covariance matrix Σk

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is


distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1
 
p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2

where |Σk | denotes the determinant of the matrix, and d is dimension of x


Each class k has associated mean vector µk and covariance matrix Σk
Typically the classes share a single covariance matrix Σ (“share” means that
they have the same parameters; the covariance matrix in this case):
Σ = Σ1 = · · · = Σk

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Multivariate Data

Multiple measurements (sensors)


d inputs/features/attributes
N instances/observations/examples
 (1) (1) (1)

x1 x2 ··· xd
 (2) (2) (2) 
 x1 x2 ··· xd 
X=  .. .. .. .. 
.

 . . . 
(N) (N) (N)
x1 x2 ··· xd

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 7 / 28
Multivariate Parameters

Mean
E[x] = [µ1 , · · · , µd ]T

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Parameters

Mean
E[x] = [µ1 , · · · , µd ]T

Covariance
σ12
 
σ12 ··· σ1d
 σ12 σ22 ··· σ2d 
Σ = Cov (x) = E[(x − µ)T (x − µ)] =  .
 
.. .. .. 
 .. . . . 
σd1 σd2 ··· σd2

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Parameters

Mean
E[x] = [µ1 , · · · , µd ]T

Covariance
σ12
 
σ12 ··· σ1d
 σ12 σ22 ··· σ2d 
Σ = Cov (x) = E[(x − µ)T (x − µ)] =  .
 
.. .. .. 
 .. . . . 
σd1 σd2 ··· σd2

Correlation = Corr (x) is the covariance divided by the product of standard


deviation
σij
ρij =
σi σj

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Gaussian Distribution

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as


1
exp −(x − µ)T Σ−1 (x − µ)
 
p(x) = d/2 1/2
(2π) |Σ|

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Multivariate Gaussian Distribution

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as


1
exp −(x − µ)T Σ−1 (x − µ)
 
p(x) = d/2 1/2
(2π) |Σ|

Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x


to µ in terms of Σ

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Multivariate Gaussian Distribution

x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as


1
exp −(x − µ)T Σ−1 (x − µ)
 
p(x) = d/2 1/2
(2π) |Σ|

Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x


to µ in terms of Σ
It normalizes for difference in variances and correlations
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Bivariate Normal
     
1 0 1 0 1 0
Σ= Σ = 0.5 Σ=2
0 1 0 1 0 1

Figure : Probability density function

Figure : Contour plot of the pdf


Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 10 / 28
Bivariate Normal
var (x1 ) = var (x2 ) var (x1 ) > var (x2 ) var (x1 ) < var (x2 )

Figure : Probability density function

Figure : Contour plot of the pdf

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 11 / 28
Bivariate Normal
     
1 0 1 0.5 1 0.8
Σ= Σ= Σ=
0 1 0.5 1 0.8 1

Figure : Probability density function

Figure : Contour plot of the pdf


Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 12 / 28
Bivariate Normal
Cov (x1 , x2 ) = 0 Cov (x1 , x2 ) > 0 Cov (x1 , x2 ) < 0

Figure : Probability density function

Figure : Contour plot of the pdf

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 13 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)

GDA (GBC) decision boundary is based on class posterior:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)


d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)

Decision: take the class with the highest posterior probability

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 14 / 28
Decision Boundary

likelihoods)
discriminant:!!
P!(t1|x")!=!0.5!

posterior)for)t1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 15 / 28
Decision Boundary when Shared Covariance Matrix

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 16 / 28
Learning

Learn the parameters using maximum likelihood


N
Y
`(φ, µ0 , µ1 , Σ) = − log p(x(n) , t (n) |φ, µ0 , µ1 , Σ)
n=1
N
Y
= − log p(x(n) |t (n) , µ0 , µ1 , Σ)p(t (n) |φ)
n=1

What have we assumed?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 17 / 28
More on MLE

Assume the prior is Bernoulli (we have two classes)


p(t|φ) = φt (1 − φ)1−t

You can compute the ML estimate in closed form


N
1 X (n)
φ = 1[t = 1]
N n=1

1
PN (n)
n=1 [t = 0] · x(n)
µ0 =
1[t (n) = 0]
PN
n=1

1
PN (n)
n=1 [t = 1] · x(n)
µ1 =
1[t (n) = 1]
PN
n=1

N
1 X (n)
Σ = (x − µt (n) )(x(n) − µt (n) )T
N n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 18 / 28
Gaussian Discriminative Analysis vs Logistic Regression

If you examine p(t = 1|x) under GDA, you will find that it looks like this:

1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression

If you examine p(t = 1|x) under GDA, you will find that it looks like this:

1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)


So the decision boundary has the same form as logistic regression!

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression

If you examine p(t = 1|x) under GDA, you will find that it looks like this:

1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)

where w is an appropriate function of (φ, µ0 , µ1 , Σ)


So the decision boundary has the same form as logistic regression!
When should we prefer GDA to LR, and vice versa?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is


multivariate Gaussian

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is


multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is


multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is


multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions
Many class-conditional distributions lead to logistic classifier

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression

GDA makes stronger modeling assumption: assumes class-conditional data is


multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions
Many class-conditional distributions lead to logistic classifier
When these distributions are non-Gaussian, in limit of large N, LR beats
GDA

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Simplifying the Model

What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Simplifying the Model

What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters
Save some parameters by using a shared covariance for the classes

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Simplifying the Model

What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters
Save some parameters by using a shared covariance for the classes
Any other idea you can think of?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Naive Bayes

Naive Bayes is an alternative generative model: Assumes features


independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes

Naive Bayes is an alternative generative model: Assumes features


independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1

Assuming likelihoods are Gaussian, how many parameters required for Naive
Bayes classifier?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes

Naive Bayes is an alternative generative model: Assumes features


independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1

Assuming likelihoods are Gaussian, how many parameters required for Naive
Bayes classifier?
Important note: Naive Bayes does not assume a particular distribution

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1

If the assumption of conditional independence holds, NB is the optimal


classifier

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1

If the assumption of conditional independence holds, NB is the optimal


classifier
If not, a heavily regularized version of generative classifier

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1

If the assumption of conditional independence holds, NB is the optimal


classifier
If not, a heavily regularized version of generative classifier
What’s the regularization?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1

If the assumption of conditional independence holds, NB is the optimal


classifier
If not, a heavily regularized version of generative classifier
What’s the regularization?
Note: NB’s assumptions (cond. independence) typically do not hold in
practice. However, the resulting algorithm still works well on many problems,
and it typically serves as a decent baseline for more sophisticated models
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Gaussian Naive Bayes

Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:

−(xi − µik )2
 
1
p(xi |t = k) = √ exp 2
2πσik 2σik

(this is just a 1-dim Gaussian, one for each input dimension)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Gaussian Naive Bayes

Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:

−(xi − µik )2
 
1
p(xi |t = k) = √ exp 2
2πσik 2σik

(this is just a 1-dim Gaussian, one for each input dimension)


Model the same as Gaussian Discriminative Analysis with diagonal
covariance matrix

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Gaussian Naive Bayes

Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:

−(xi − µik )2
 
1
p(xi |t = k) = √ exp 2
2πσik 2σik

(this is just a 1-dim Gaussian, one for each input dimension)


Model the same as Gaussian Discriminative Analysis with diagonal
covariance matrix
Maximum likelihood estimate of parameters

1 (n)
PN (n)
n=1 [t = k] · xi
µik =
1[t (n) = k]
PN
n=1

1[t (n) = k] · (xi(n) − µik )2


PN
2 n=1
σik =
n=1 1[t
PN (n) = k]

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Decision Boundary: Shared Variances (between Classes)

variances may be
different

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 25 / 28
Decision Boundary: isotropic

*?

Same variance across all classes and input dimensions, all class priors equal
Classification only depends on distance to the mean. Why?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 26 / 28
Decision Boundary: isotropic

In this case: σi,k = σ (just one parameter), class priors equal (e.g.,
p(tk ) = 0.5 for 2-class case)
Going back to class posterior for GDA:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)


d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 27 / 28
Decision Boundary: isotropic

In this case: σi,k = σ (just one parameter), class priors equal (e.g.,
p(tk ) = 0.5 for 2-class case)
Going back to class posterior for GDA:

log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)


d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)

where we take Σk = σ 2 I and ignore terms that don’t depend on k (don’t


matter when we take max over classes):

1
log p(tk |x) = − (x − µk )T (x − µk )
2σ 2

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 27 / 28
Spam Classification

You have examples of emails that are spam and non-spam


How would you classify spam vs non-spam?

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 28 / 28
Spam Classification

You have examples of emails that are spam and non-spam


How would you classify spam vs non-spam?
Think about it at home, solution in the next tutorial

Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 28 / 28

You might also like