Bayesian Learning
→ To determine the best hypothesis from space H, given the observed training data.
→ Bayes Theorem
→ To find the most probable hypothesis
→ Given the data D
→ Given initial knowledge about the prior probabilities of the various
hypothesis in H
Notations
→ P(h)
→ Initial probability that the hypothesis h holds, before we have observed the
training data.
→ Prior probability of h
→ That h is a correct hypothesis
→ If no such prior knowledge, simply assign the same prior probability to
each candidate hypothesis.
→ P(D)
→ Prior probability that the training data D will be observed, given no knowledge
about which hypothesis holds.
→ P(D/h)
→ Probability of observing data D, given some world in which hypothesis h
holds.
→ P(h/D)
→ Posterior probability of h
→ Probability that h holds, given the observed training data D
→ Reflects the influence of the training data.
Bayes Theorem
𝑃(𝐷⁄ℎ)𝑃(ℎ)
→ 𝑃(ℎ⁄𝐷 ) = 𝑃(𝐷)
Naïve Bayes Classifier
→ Naive (Stupid!) Bayes is a classification technique based on Bayes’ Theorem.
→ Assumes all predictors (features/attributes) are independent to each other i.e., there is
no inter-relation among those.
→ Each of the attributes should be considered for classification/prediction, contributing
equally to the outcome.
→ Each instance 𝑥 is described by a conjunction of attribute values.
→ The target function 𝑓(𝑥) can take on any value from some finite set 𝑉
→ Given a set of training examples of the target function
→ A new instance described by the tuple of attribute values 〈𝑎1 , 𝑎2 , … , 𝑎𝑛 〉
→ The learner is asked to predict the target value, or classification, for this new
instance.
Bayesian approach
→ To assign the most probable target value 𝑣𝑀𝐴𝑃 (maximum a posteriori)
→ Given the attribute values 〈𝑎1 , 𝑎2 , … , 𝑎𝑛 〉 that describe the instance
→ 𝑣𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗 ∈ 𝑉𝑃(𝑣𝑗 |𝑎1 , 𝑎2 , … , 𝑎𝑛 )
→ Pick up the output with maximum probability
𝑃(𝐷⁄ℎ)𝑃(ℎ)
Applying Bayes Theorem: 𝑃(ℎ⁄𝐷) = 𝑃(𝐷)
𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑎 ,𝑎 ,…,𝑎 ⁄𝑣 )𝑃(𝑣 )
→ 𝑣𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗 ∈ 𝑉𝑃(𝑣𝑗 |𝑎1 , 𝑎2 , … , 𝑎𝑛 ) =
1 2 𝑛 𝑗 𝑗
𝑃(𝑎1 ,𝑎2 ,…,𝑎𝑛 )
𝑣𝑗 ∈ 𝑉
𝑎𝑟𝑔𝑚𝑎𝑥
= 𝑣𝑗 ∈ 𝑉𝑃(𝑎1 , 𝑎2 , … , 𝑎𝑛 ⁄𝑣𝑗 )𝑃(𝑣𝑗 )
→ Estimate 𝑃(𝑣𝑗 )
→ by counting how many times each target value 𝑣𝑗 occurs in the training
data
Naïve Bayes Classifier Approach
→ Assume that the attribute values are conditionally independent, given the target value.
→ That is, the probability of observing the conjunction 𝑎1 , 𝑎2 , … , 𝑎𝑛 is just the product
of the probabilities for the individual attributes.
→ 𝑃(𝑎1 , 𝑎2 , … , 𝑎𝑛 |𝑣𝑗 ) = ∏𝑖 𝑃(𝑎𝑖 |𝑣𝑗 )
→ Substituting in the previous equation
→ Target value output by the Naive Bayes classifier
→ 𝑣𝑁𝐵 =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗 ∈ 𝑉𝑃(𝑣𝑗 ) ∏𝑖 𝑃(𝑎𝑖 |𝑣𝑗 )
Example
→ Outlook = sunny
Temperature = cool
Humidity = high
Wind = strong
→ To predict the target value (yes or no) of the target concept Play Tennis for this new
instance.
→ 𝑣𝑁𝐵 =
𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑗 ∈ {𝑦𝑒𝑠,𝑛𝑜}𝑃(𝑣𝑗 )𝑃(𝑂𝑢𝑡 = 𝑠𝑢𝑛⁄𝑣𝑗 ) 𝑃(𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 ⁄𝑣𝑗 ) 𝑃(𝐻𝑢𝑚 =
ℎ𝑖𝑔ℎ⁄𝑣𝑗 ) 𝑃(𝑊𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 ⁄𝑣𝑗 )
→ Calculation of 𝑣𝑁𝐵 is required to be based on 10 probabilities that can be
estimated from the training data.
→ From 14 training examples
9
→ 𝑃(𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = = 0.64
14
5
→ 𝑃(𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = 14 = 0.36
→ Conditional Probabilities for wind=strong
→ 𝑃(𝑊𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = 3⁄9 = 0.33
𝑃(𝑊𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = 3⁄5 = 0.60
→ 𝑃(𝐻𝑢𝑚 = ℎ𝑖𝑔ℎ|𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = 3⁄9 = 0.33
𝑃(𝐻𝑢𝑚 = ℎ𝑖𝑔ℎ |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = 4⁄5 = 0.80
→ 𝑃(𝑂𝑢𝑡 = 𝑠𝑢𝑛𝑛𝑦 |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = 2⁄9 = 0.22
𝑃(𝑂𝑢𝑡 = 𝑠𝑢𝑛𝑛𝑦 |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = 3⁄5 = 0.60
→ 𝑃(𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = 3⁄9 = 0.33
→ 𝑃(𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 |𝑃𝑙𝑎𝑦 𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = 1⁄5 = 0.20
→ Using these probability estimates and similar estimates from the remaining attribute
values
→ 𝑃(𝑦𝑒𝑠)𝑃(𝑠𝑢𝑛𝑛𝑦⁄𝑦𝑒𝑠)𝑃(𝑐𝑜𝑜𝑙 ⁄𝑦𝑒𝑠) 𝑃(ℎ𝑖𝑔ℎ⁄𝑦𝑒𝑠) 𝑃(𝑠𝑡𝑟𝑜𝑛𝑔 ⁄𝑦𝑒𝑠)
→ 0.64 × 0.22 × 0.33 × 0.33 × 0.33 = 0.005
→ 𝑃(𝑛𝑜)𝑃(𝑠𝑢𝑛𝑛𝑦⁄𝑛𝑜)𝑃(𝑐𝑜𝑜𝑙 ⁄𝑛𝑜) (ℎ𝑖𝑔ℎ⁄𝑛𝑜) (𝑠𝑡𝑟𝑜𝑛𝑔⁄𝑛𝑜)
→ 0.36× 0.6 × 0.2 × 0.8 × 0.6 = 0.0207
→ Thus the naïve Bayes Classifier assigns the target value Play Tennis = ‘No’ to this
new instance, based on the probability estimates learned from the training data.
Pros and Cons of Naive Bayes
→ Pros:
→ It is easy and fast to predict class of test data set.
→ It also performs well in multi class prediction
→ When assumption of independence holds, a Naive Bayes classifier performs
better compared to other models like logistic regression and you need less
training data.
→ It performs well in case of categorical (discrete) input variables compared to
numerical (continuous) variable(s).
→ Cons:
→ If categorical variable has a category (in test data set), which was not observed
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction.
→ This is often known as “Zero Frequency”.
→ Another limitation of Naive Bayes is the assumption of independent
predictors.
→ In real life, it is almost impossible that we get a set of predictors which
are completely independent.