Machine Learning
CSE343/CSE543/ECE363/ECE563
Lecture 15 | Take your own notes during lectures
Vinayak Abrol <[email protected]>
Association in Clustering
E Step: Assignment
(e.g., K-means)
Now let’s generalize this idea of association.
If you have a distribution then it is easy to
calculate the probability of an example
belonging to a particular class
[In many case a closed-form expression exits
for the formula above]
Association: Conditional Probability
Association: Conditional Probability
Bayes Classifier
Bayesian classifiers use Bayes theorem i.e.,
Bayes Classifier
Bayesian classifiers use Bayes theorem i.e.,
Probability of instance x being in class cj
To compute this we need the following
Bayes Classifier
Bayesian classifiers use Bayes theorem i.e.,
Probability of instance x being in class cj
To compute this we need the following
Probability of generating instance x given class cj
Bayes Classifier
Bayesian classifiers use Bayes theorem i.e.,
Probability of instance x being in class cj
To compute this we need the following
Probability of generating instance x given class cj
Probability of occurrence of class cj
This is just how frequent the class cj, is in our database
Bayes Classifier
Bayesian classifiers use Bayes theorem i.e.,
Probability of instance x being in class cj
To compute this we need the following
Probability of generating instance x given class cj
Probability of occurrence of class cj
This is just how frequent the class cj, is in our database
Probability of instance x occurring
This can actually be ignored, since it is the same for all.
Naïve Bayes Classifier
How to handle multiple features?
Assume independence and what you have is called Naïve Bayes Classifier:
- Naïve Bayes is fast and space efficient
Here x is an instance/vector and ‘xα’ denotes individual features.
Note: We can look up all the probabilities with a single scan of the database and
store them in a (small) table…
Naïve Bayes Classifier
Naïve Bayes Classifier
- Robust to irrelevant attributes
If p(x1 | cj) = p(x1) then attribute x1 will just contribute a constant in calculation of
conditional probability
- Can handle missing values
If for attribute xi we only have value in some instances x = [x1 x2 x3 x4 …] even
then we can compute p(xi | cj).
- Robust to isolated noise points
It uses all attributes for all predictions. The probability of a data point belonging to each class is
computed based on the distribution of features for each class not its position/distance. Hence, feature
values are averaged out and isolated points can’t really alter the distribution.
Homework: Not entirely true for Bernouilli-NB Why?
Thanks