CSC 411: Lecture 09: Naive Bayes
Richard Zemel, Raquel Urtasun and Sanja Fidler
University of Toronto
October 12, 2016
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 1 / 28
Today
Classification – Multi-dimensional (Gaussian) Bayes classifier
Estimate probability densities from data
Naive Bayes classifier
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 2 / 28
Generative vs Discriminative
Two approaches to classification:
Discriminative classifiers estimate parameters of decision boundary/class
separator directly from labeled examples
I learn p(y |x) directly (logistic regression models)
I learn mappings from inputs to classes (least-squares, neural nets)
Generative approach: model the distribution of inputs characteristic of the
class (Bayes classifier)
I Build a model of p(x|y )
I Apply Bayes Rule
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 3 / 28
Bayes Classifier
Aim to diagnose whether patient has diabetes: classify into one of two
classes (yes C=1; no C=0)
Run battery of tests
Given patient’s results: x = [x1 , x2 , · · · , xd ]T we want to update class
probabilities using Bayes Rule:
p(x|C )p(C )
p(C |x) =
p(x)
More formally
Class likelihood × prior
posterior =
Evidence
How can we compute p(x) for the two class case?
p(x) = p(x|C = 0)p(C = 0) + p(x|C = 1)p(C = 1)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 4 / 28
Classification: Diabetes Example
Last class we had a single observation per patient: white blood cell count
p(x = 48|C = 1)p(C = 1)
p(C = 1|x = 48) =
p(x = 48)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 5 / 28
Classification: Diabetes Example
Last class we had a single observation per patient: white blood cell count
p(x = 48|C = 1)p(C = 1)
p(C = 1|x = 48) =
p(x = 48)
Add second observation: Plasma glucose value
Now our input x is 2-dimensional
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 5 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
Gaussian Discriminant Analysis in its general form assumes that p(x|t) is
distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1
p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2
where |Σk | denotes the determinant of the matrix, and d is dimension of x
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
Gaussian Discriminant Analysis in its general form assumes that p(x|t) is
distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1
p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2
where |Σk | denotes the determinant of the matrix, and d is dimension of x
Each class k has associated mean vector µk and covariance matrix Σk
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
Gaussian Discriminant Analysis in its general form assumes that p(x|t) is
distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
1
exp −(x − µk )T Σ−1
p(x|t = k) = k (x − µk )
(2π)d/2 |Σ k |1/2
where |Σk | denotes the determinant of the matrix, and d is dimension of x
Each class k has associated mean vector µk and covariance matrix Σk
Typically the classes share a single covariance matrix Σ (“share” means that
they have the same parameters; the covariance matrix in this case):
Σ = Σ1 = · · · = Σk
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 6 / 28
Multivariate Data
Multiple measurements (sensors)
d inputs/features/attributes
N instances/observations/examples
(1) (1) (1)
x1 x2 ··· xd
(2) (2) (2)
x1 x2 ··· xd
X= .. .. .. ..
.
. . .
(N) (N) (N)
x1 x2 ··· xd
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 7 / 28
Multivariate Parameters
Mean
E[x] = [µ1 , · · · , µd ]T
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Parameters
Mean
E[x] = [µ1 , · · · , µd ]T
Covariance
σ12
σ12 ··· σ1d
σ12 σ22 ··· σ2d
Σ = Cov (x) = E[(x − µ)T (x − µ)] = .
.. .. ..
.. . . .
σd1 σd2 ··· σd2
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Parameters
Mean
E[x] = [µ1 , · · · , µd ]T
Covariance
σ12
σ12 ··· σ1d
σ12 σ22 ··· σ2d
Σ = Cov (x) = E[(x − µ)T (x − µ)] = .
.. .. ..
.. . . .
σd1 σd2 ··· σd2
Correlation = Corr (x) is the covariance divided by the product of standard
deviation
σij
ρij =
σi σj
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 8 / 28
Multivariate Gaussian Distribution
x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as
1
exp −(x − µ)T Σ−1 (x − µ)
p(x) = d/2 1/2
(2π) |Σ|
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Multivariate Gaussian Distribution
x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as
1
exp −(x − µ)T Σ−1 (x − µ)
p(x) = d/2 1/2
(2π) |Σ|
Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x
to µ in terms of Σ
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Multivariate Gaussian Distribution
x ∼ N (µ, Σ), a Gaussian (or normal) distribution defined as
1
exp −(x − µ)T Σ−1 (x − µ)
p(x) = d/2 1/2
(2π) |Σ|
Mahalanobis distance (x − µk )T Σ−1 (x − µk ) measures the distance from x
to µ in terms of Σ
It normalizes for difference in variances and correlations
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 9 / 28
Bivariate Normal
1 0 1 0 1 0
Σ= Σ = 0.5 Σ=2
0 1 0 1 0 1
Figure : Probability density function
Figure : Contour plot of the pdf
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 10 / 28
Bivariate Normal
var (x1 ) = var (x2 ) var (x1 ) > var (x2 ) var (x1 ) < var (x2 )
Figure : Probability density function
Figure : Contour plot of the pdf
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 11 / 28
Bivariate Normal
1 0 1 0.5 1 0.8
Σ= Σ= Σ=
0 1 0.5 1 0.8 1
Figure : Probability density function
Figure : Contour plot of the pdf
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 12 / 28
Bivariate Normal
Cov (x1 , x2 ) = 0 Cov (x1 , x2 ) > 0 Cov (x1 , x2 ) < 0
Figure : Probability density function
Figure : Contour plot of the pdf
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 13 / 28
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
GDA (GBC) decision boundary is based on class posterior:
log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)
d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)
Decision: take the class with the highest posterior probability
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 14 / 28
Decision Boundary
likelihoods)
discriminant:!!
P!(t1|x")!=!0.5!
posterior)for)t1)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 15 / 28
Decision Boundary when Shared Covariance Matrix
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 16 / 28
Learning
Learn the parameters using maximum likelihood
N
Y
`(φ, µ0 , µ1 , Σ) = − log p(x(n) , t (n) |φ, µ0 , µ1 , Σ)
n=1
N
Y
= − log p(x(n) |t (n) , µ0 , µ1 , Σ)p(t (n) |φ)
n=1
What have we assumed?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 17 / 28
More on MLE
Assume the prior is Bernoulli (we have two classes)
p(t|φ) = φt (1 − φ)1−t
You can compute the ML estimate in closed form
N
1 X (n)
φ = 1[t = 1]
N n=1
1
PN (n)
n=1 [t = 0] · x(n)
µ0 =
1[t (n) = 0]
PN
n=1
1
PN (n)
n=1 [t = 1] · x(n)
µ1 =
1[t (n) = 1]
PN
n=1
N
1 X (n)
Σ = (x − µt (n) )(x(n) − µt (n) )T
N n=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 18 / 28
Gaussian Discriminative Analysis vs Logistic Regression
If you examine p(t = 1|x) under GDA, you will find that it looks like this:
1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)
where w is an appropriate function of (φ, µ0 , µ1 , Σ)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression
If you examine p(t = 1|x) under GDA, you will find that it looks like this:
1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)
where w is an appropriate function of (φ, µ0 , µ1 , Σ)
So the decision boundary has the same form as logistic regression!
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression
If you examine p(t = 1|x) under GDA, you will find that it looks like this:
1
p(t|x, φ, µ0 , µ1 , Σ) =
1 + exp(−wT x)
where w is an appropriate function of (φ, µ0 , µ1 , Σ)
So the decision boundary has the same form as logistic regression!
When should we prefer GDA to LR, and vice versa?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 19 / 28
Gaussian Discriminative Analysis vs Logistic Regression
GDA makes stronger modeling assumption: assumes class-conditional data is
multivariate Gaussian
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression
GDA makes stronger modeling assumption: assumes class-conditional data is
multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression
GDA makes stronger modeling assumption: assumes class-conditional data is
multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression
GDA makes stronger modeling assumption: assumes class-conditional data is
multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions
Many class-conditional distributions lead to logistic classifier
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Gaussian Discriminative Analysis vs Logistic Regression
GDA makes stronger modeling assumption: assumes class-conditional data is
multivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions
Many class-conditional distributions lead to logistic classifier
When these distributions are non-Gaussian, in limit of large N, LR beats
GDA
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 20 / 28
Simplifying the Model
What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Simplifying the Model
What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters
Save some parameters by using a shared covariance for the classes
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Simplifying the Model
What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance
matrix has many parameters
Save some parameters by using a shared covariance for the classes
Any other idea you can think of?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 21 / 28
Naive Bayes
Naive Bayes is an alternative generative model: Assumes features
independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes
Naive Bayes is an alternative generative model: Assumes features
independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1
Assuming likelihoods are Gaussian, how many parameters required for Naive
Bayes classifier?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes
Naive Bayes is an alternative generative model: Assumes features
independent given the class
d
Y
p(x|t = k) = p(xi |t = k)
i=1
Assuming likelihoods are Gaussian, how many parameters required for Naive
Bayes classifier?
Important note: Naive Bayes does not assume a particular distribution
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 22 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1
If the assumption of conditional independence holds, NB is the optimal
classifier
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1
If the assumption of conditional independence holds, NB is the optimal
classifier
If not, a heavily regularized version of generative classifier
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1
If the assumption of conditional independence holds, NB is the optimal
classifier
If not, a heavily regularized version of generative classifier
What’s the regularization?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Naive Bayes Classifier
Given
prior p(t = k)
assuming features are conditionally independent given the class
likelihood p(xi |t = k) for each xi
The decision rule
d
Y
y = arg max p(t = k) p(xi |t = k)
k
i=1
If the assumption of conditional independence holds, NB is the optimal
classifier
If not, a heavily regularized version of generative classifier
What’s the regularization?
Note: NB’s assumptions (cond. independence) typically do not hold in
practice. However, the resulting algorithm still works well on many problems,
and it typically serves as a decent baseline for more sophisticated models
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 23 / 28
Gaussian Naive Bayes
Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:
−(xi − µik )2
1
p(xi |t = k) = √ exp 2
2πσik 2σik
(this is just a 1-dim Gaussian, one for each input dimension)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Gaussian Naive Bayes
Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:
−(xi − µik )2
1
p(xi |t = k) = √ exp 2
2πσik 2σik
(this is just a 1-dim Gaussian, one for each input dimension)
Model the same as Gaussian Discriminative Analysis with diagonal
covariance matrix
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Gaussian Naive Bayes
Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:
−(xi − µik )2
1
p(xi |t = k) = √ exp 2
2πσik 2σik
(this is just a 1-dim Gaussian, one for each input dimension)
Model the same as Gaussian Discriminative Analysis with diagonal
covariance matrix
Maximum likelihood estimate of parameters
1 (n)
PN (n)
n=1 [t = k] · xi
µik =
1[t (n) = k]
PN
n=1
1[t (n) = k] · (xi(n) − µik )2
PN
2 n=1
σik =
n=1 1[t
PN (n) = k]
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 24 / 28
Decision Boundary: Shared Variances (between Classes)
variances may be
different
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 25 / 28
Decision Boundary: isotropic
*?
Same variance across all classes and input dimensions, all class priors equal
Classification only depends on distance to the mean. Why?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 26 / 28
Decision Boundary: isotropic
In this case: σi,k = σ (just one parameter), class priors equal (e.g.,
p(tk ) = 0.5 for 2-class case)
Going back to class posterior for GDA:
log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)
d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 27 / 28
Decision Boundary: isotropic
In this case: σi,k = σ (just one parameter), class priors equal (e.g.,
p(tk ) = 0.5 for 2-class case)
Going back to class posterior for GDA:
log p(tk |x) = log p(x|tk ) + log p(tk ) − log p(x)
d 1 1
= − log(2π) − log |Σ−1 T −1
k | − (x − µk ) Σk (x − µk ) +
2 2 2
+ log p(tk ) − log p(x)
where we take Σk = σ 2 I and ignore terms that don’t depend on k (don’t
matter when we take max over classes):
1
log p(tk |x) = − (x − µk )T (x − µk )
2σ 2
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 27 / 28
Spam Classification
You have examples of emails that are spam and non-spam
How would you classify spam vs non-spam?
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 28 / 28
Spam Classification
You have examples of emails that are spam and non-spam
How would you classify spam vs non-spam?
Think about it at home, solution in the next tutorial
Zemel, Urtasun, Fidler (UofT) CSC 411: 09-Naive Bayes October 12, 2016 28 / 28