CS 60050
Machine Learning
Naïve Bayes Classifier
Some slides taken from course materials of Tan, Steinbach, Kumar
Bayes Classifier
●A probabilistic framework for solving classification
problems
● Approach for modeling probabilistic relationships
between the attribute set and the class variable
– May not be possible to certainly predict class label of a
test record even if it has identical attributes to some
training records
– Reason: noisy data or presence of certain factors that
are not included in the analysis
Probability Basics
● P(A = a, C = c): joint probability that random
variables A and C will take values a and c
respectively
● P(A = a | C = c): conditional probability that A will
take the value a, given that C has taken value c
P( A, C )
P(C | A) =
P( A)
P( A, C )
P( A | C ) =
P(C )
Bayes Theorem
● Bayes theorem:
P( A | C ) P(C )
P(C | A) =
P( A)
● P(C) known as the prior probability
● P(C | A) known as the posterior probability
Example of Bayes Theorem
● Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
● If a patient has stiff neck, what’s the probability
he/she has meningitis?
P( S | M ) P( M ) 0.5 ×1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
Bayesian Classifiers
● Consider each attribute and class label as random
variables
● Given
a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
● Canwe estimate P(C| A1, A2,…,An ) directly from
data?
Bayesian Classifiers
● Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P( A A ! A | C ) P(C )
P(C | A A ! A ) = 1 2 n
P( A A ! A )
1 2 n
1 2 n
Bayesian Classifiers
● Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
Class-conditional
probability Prior probability
P( A A ! A | C ) P(C )
P(C | A A ! A ) = 1 2 n
P( A A ! A )
1 2 n
1 2 n
Posterior probability Evidence
Bayesian Classifiers
● Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P( A A ! A | C ) P(C )
P(C | A A ! A ) = 1 2 n
P( A A ! A )
1 2 n
1 2 n
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
● How to estimate P(A1, A2, …, An | C )?
Naïve Bayes Classifier
● Assumes all attributes Ai are conditionally independent,
when class C is given:
– P(A1, A2, …, An |C) = P(A1| C) P(A2| C)… P(An| C)
– Can estimate P(Ai | Cj) for all Ai and Cj.
– New point is classified to Cj if P(Cj) Π P(Ai | Cj) is
maximal.
Conditional independence: basics
● Let X, Y, Z denote three sets of random variables
● The variables in X are said to be conditionally
independent of variables in Y, given Z if
● P( X | Y, Z ) = P( X | Z )
● An example
– Level of reading skills of people tends to increase with
length of the arm
– Explanation: both increase with age of a person
– If age is given, arm length and reading skills are
(conditionally) independent
Conditional independence: basics
● If X and Y are conditionally independent, given Z
P( X, Y | Z ) = P(X, Y, Z) / P(Z)
= P(X, Y, Z) / P(Y, Z) * P(Y, Z) / P(Z)
= P(X | Y, Z) * P(Y | Z)
= P(X | Z) * P(Y | Z)
P( X, Y | Z ) = P(X | Z) * P(Y | Z)
NB assumption:
P(A1, A2, …, An |C) = P(A1| C) P(A2| C)… P(An| C)
How to Estimate
l l
Probabilities from Data?
a a u s
r ic r ic o
o o u
g g it n ss
c at
e
c at
e
c on cl
a ● Class: P(C) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
● For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k
5 No Divorced 95K Yes
6 No Married 60K No
– where |Aik| is number of
instances having attribute
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10
How to Estimate Probabilities from Data?
● For
continuous attributes, two options:
– Discretize the range into bins
u one ordinal attribute per bin
– Probability density estimation:
u Assume attribute follows a Gaussian / normal
distribution
u Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
u Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
How too Estimate Probabilities from Data?
a l a l s
u
o u ric ric o
e g e g t in s s
c at c at c o n
cl
a
Tid Refund Marital
Status
Taxable
Income Evade
● Normal distribution:
1 −
( Ai − µ ij ) 2
1 Yes Single 125K No
P( A | c ) = e 2 σ ij2
2πσ
i j 2
2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Ai, cj) pair
5 No Divorced 95K Yes
6 No Married 60K No ● For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
u sample mean = 110
10 No Single 90K Yes u sample variance = 2975
10
1 −
( 120 −110 ) 2
P( Income = 120 | No) = e 2 ( 2975 )
= 0.0072
2π (54.54)
A complete example
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund
ca
l
ca
l= No, Married,
us
Income = 120K)
r i r i o
o o u
g g it n s
at
e
Training data: at
e
on las naive Bayes Classifier:
c c c c
Tid Refund Marital Taxable
Status Income Evade P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
1 Yes Single 125K No P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
2 No Married 100K No
P(Marital Status=Single|No) = 2/7
3 No Single 70K No P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
4 Yes Married 120K No P(Marital Status=Single|Yes) = 2/7
5 No Divorced 95K Yes P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
6 No Married 60K No
7 Yes Divorced 220K No For taxable income:
If class=No: sample mean=110
8 No Single 85K Yes sample variance=2975
9 No Married 75K No If class=Yes: sample mean=90
sample variance=25
10 No Single 90K Yes
10
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7 ● P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 4/7 × P(Married| Class=No)
P(Refund=Yes|Yes) = 0 × P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7 × 4/7 × 0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7 ● P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7 × P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7 × P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1 × 0 × 1.2 × 10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975
Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Naïve Bayes Classifier
● Ifone of the conditional probability is zero, then
the entire expression becomes zero
● Probability estimation:
N ic
Original : P ( Ai | C ) =
Nc
c: number of classes
N ic + 1
Laplace : P ( Ai | C ) = p: prior probability
Nc + c
m: parameter
N ic + mp
m - estimate : P ( Ai | C ) =
Nc + m
Naïve Bayes: Pros and Cons
● Robust to isolated noise points
● Can handle missing values by ignoring the
instance during probability estimate calculations
● Robust to irrelevant attributes
● Independence assumption may not hold for some
attributes
– Presence of correlated attributes can degrade
performance of NB classifier
Example with correlated attribute
● Two attributes A, B and class Y (all binary)
● Prior probabilities:
– P(Y=0) = P(Y=1) = 0.5
● Class conditional probabilities of A:
– P(A=0 | Y=0) = 0.4 P(A=1 | Y=0) = 0.6
– P(A=0 | Y=1) = 0.6 P(A=1 | Y=1) = 0.4
● Class conditional probabilities of B are same as
that of A
● B is perfectly correlated with A when Y=0, but is
independent of A when Y=1
Example with correlated attribute
● Need to classify a record with A=0, B=0
● P(Y=0 | A=0,B=0) = P(A=0,B=0 | Y=0) P(Y=0)
P(A=0, B=0)
= P(A=0|Y=0) P(B=0|Y=0) P(Y=0)
P(A=0, B=0)
= (0.16 * 0.5) / P(A=0,B=0)
● P(Y=1 | A=0,B=0) = P(A=0,B=0 | Y=1) P(Y=1)
P(A=0, B=0)
= P(A=0|Y=1) P(B=0|Y=1) P(Y=1)
P(A=0, B=0)
= (0.36 * 0.5) / P(A=0,B=0)
● Hence prediction is Y=1
Example with correlated attribute
● Need to classify a record with A=0, B=0
● In reality, since B is perfectly correlated to A when
Y= 0
● P(Y=0 | A=0,B=0) = P(A=0,B=0 | Y=0) P(Y=0)
P(A=0, B=0)
= P(A=0|Y=0) P(Y=0)
P(A=0, B=0)
= (0.4 * 0.5) / P(A=0,B=0)
● Hence prediction should have been Y=0
Other Bayesian classifiers
● Ifit is suspected that attributes may have
correlations:
● Can use other techniques such as Bayesian
Belief Networks (BBN)
● Uses a graphical model (network) to capture prior
knowledge in a particular domain, and causal
dependencies among variables