Bayesian Classification
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with
strong independence assumptions. A more descriptive term for the underlying probability model
would be "independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or absence) of any other feature.
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the hypothesis holds given the
observed data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income,
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the sample X, given that the
hypothesis holds
E.g.,Given that X will buy computer, the prob. that X is 31..40, medium income
Bayes theorem is useful in that it provides a way of calculating the posteriori probability, P(H|X),
from P(H),P(X) and P(X|H), Bayes theorem can be stated as
P( X|H ) P( H )
P( H|X )=
P( X )
Native Bayesian Classification
The native Bayesian Classifier or simple Bayesian Classifier, works as follows,
Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(C i|X)
This can be derived from Bayes’ theorem
P ( X |C ) P ( C )
P(C |X )=
i i
i P( X )
P (C |X )=P ( X|C ) P ( C )
Since P(X) is constant for all classes, only i i need ito be maximized.
A simplified assumption: attributes are conditionally independent i.e., no dependence relation
between attributes, so P(X|Ci) becomes
n
P( X|C i )=∏ P( x |C i )=P( x |C i )×P( x |C i )×. ..×P( x |Ci )
k=1 k 1 2 n
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (#
of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a
mean μ and standard deviation σ,
( x −μ )2
−
1 2σand
2
P(XK|Ci)=g(xk, μci, σci)
g( x , μ , σ )= e
√2 π σ
Ex: Consider the following Training dataset to illustrate the classification for predicting a class
label for the situation “A student age is less than or equal to 30 with medium income and fair
credit rating purchased computer or not”.
Understanding the Data
The table represents a dataset used for a classification problem. We're trying to predict whether
someone "buys a computer" (the "Class" column) based on several attributes:
RID (Record ID): A unique identifier for each data point.
age: Categorical age ranges: "<=30", "31...40", ">40".
income: Categorical income levels: "high", "medium", "low".
student: Binary (yes/no) indicating if the person is a student.
credit_rating: Categorical credit rating: "fair", "excellent".
Class: buys_computer: The target variable we want to predict. It's binary (yes/no).
1. Calculate Prior Probabilities P(Ci):
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
2. Compute Conditional Probabilities P(X|Ci):
This step calculates the probability of observing specific attribute values (X) given each class
(Ci).
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
3. X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)