Module 3 - Classification
Module 3 - Classification
CLASSIFICATION
CLASSIFICATION: DEFINITION
2
Training set
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
6
Process (1): Model Construction
5
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
CLASSIFICATION TECHNIQUES
7
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Data Mining: Concepts and Techniques April 21, 2018
Example-Naïve Bayes
17 • Example: Play Tennis
x’=(Outlook=S
unny,
Temperatu
re=Cool,
Humidity=
High,
Wind=Stro
ng)
Example
18 • Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
Sunny e
2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Splitting Attributes
T id R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
1 Y es S in g le 125K No
2 No M a r r ie d 100K No Refund
Yes No
3 No S in g le 70K No
4 Y es M a r r ie d 120K No NO MarSt
5 No D iv o r c e d 95K Y es
Single, Divorced Married
6 No M a r r ie d 60K No
7 Y es D iv o r c e d 220K No
TaxInc NO
8 No S in g le 85K Y es < 80K > 80K
9 No M a r r ie d 75K No
NO YES
10 No S in g le 90K Y es
10
MarSt Single,
Married Divorced
T id R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
NO Refund
1 Y es S in g le 125K No
Yes No
2 No M a r r ie d 100K No
3 No S in g le 70K No NO TaxInc
4 Y es M a r r ie d 120K No < 80K > 80K
5 No D iv o r c e d 95K Y es
NO YES
6 No M a r r ie d 60K No
7 Y es D iv o r c e d 220K No
8 No S in g le 85K Y es
6 No Medium 60K No
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
APPLY MODEL TO TEST DATA
Start from the root of tree. R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No M a r r ie d 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No M a r r ie d 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No M a r r ie d 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No M a r r ie d 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No M a r r ie d 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
No M a r r ie d 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
DECISION TREE
CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Model
Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
DECISION TREE INDUCTION
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART (Classification And Regression trees)
ID3 (Iterative dichotomiser), C4.5
SLIQ,SPRINT
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
TREE INDUCTION
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
HOW TO SPECIFY TEST
CONDITION?
Depends on attribute types
Nominal
Ordinal
Continuous
CarType
Family Luxury
Sports
Binar y split: Divides values into two subsets.
Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting based on ordinal
attributes
Multi-way split: Use as many par titions as distinct values.
Size
Small Large
Size
{Small,
Large} {Medium}
3
6
SPLITTING BASED ON CONTINUOUS
ATTRIBUTES
Different ways of handling
Discretization to form an ordinal categorical attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Which test condition is the best? The measures developed for selecting the
best split are often based on the degree of impurity of child nodes. Impurity
measures include Entropy(t), Gini(t), Classification error(t) etc.
HOW TO DETERMINE THE BEST SPLIT
Greedy approach:
Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
CHOOSING ATTRIBUTES
• The order in which attributes are chosen determines how
complicated the tree is.
• ID3 uses information theor y to determine the most
informative attribute.
• A measure of the information content of a message is the
inverse of the probability of receiving the message:
information(M) = 1/probability(M)
Entropy:
Given a collection S of c outcomes
Entropy(S) = Σ -p(I) log2 p(I)
Where p(I) is the proportion of S belonging to
class I. Σ is over c. Log2 is log base 2.
Gain(S, A) is information gain of example set S on
attribute A is defined as
Where:
S is each value v of all possible values of attribute A
Sv = subset of S for which attribute A has value v
| S v | = number of elements in Sv
| S | = number of elements in S
4
Formula for ID3 algorithm
50
Entropy(S) =
TRAINING SET
RID Age Income Student Credit Buys
1 <30 High No Fair No
2 <30 High No Excellent No
3 31-40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31-40 Low Yes Excellent Yes
8 <30 Medium No Fair No
9 <30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <30 Medium Yes Excellent Yes
12 31-40 Medium No Excellent Yes
13 31-40 High Yes Fair Yes
14 >40 Medium No Excellent No
S is a collection of 14 examples with 9 YES and 5 NO
examples then
Entropy(S) = - (9/14) Log 2 (9/14) - (5/14) Log 2 (5/14)
= 0.940
Let us consider A ge as the splitting attribute
Gain(S, Age) = Entropy(S) - (5/14)*Entropy(S <30 )
- (4/14)*Entropy(S 31-40 )
- (5/14)*Entropy(S >40 )
= 0.940 - (5/14)*0.971 – (4/ 14)*0 – (5/14)*0.971
= 0.246
Similarly consider Student as the splitting attribute
Gain(S, Student) = Entropy(S) - (7/14)*Entropy(S YES )
- (7/14)*Entropy(S NO )
= 0.151
Similarly consider Credit as the splitting attribute
Gain(S, Credit) = Entropy(S) - (8/14)*Entropy(S FAIR )
- (6/14)*Entropy(S EXCELLENT )
= 0.048
Similarly consider Income as the splitting attribute
Gain(S, Income) = Entropy(S) - (4/14)*Entropy(S HIGH )
- (6/14)*Entropy(S MED )
- (4/14)*Entropy(S LOW )
= 0.029
We see that Gain (S, A ge) is max with 0.246.
Hence A ge is chosen as the splitting attribute.
Age
<30 >40
31-40
5 samples 5 samples
YES
55
Recursively we find the splitting attributes at the next level.
Let us consider Student as the next splitting attribute
Gain(Age<30, Student) = Entropy(Age<30) - (2/5)*Entropy(S YES )
- (3/5)*Entropy(S NO )
= 0.971 - ( 2 /5 )*0 – ( 3 / 5 ) * 0
= 0.971
Similarly Credit as the next splitting attribute
Gain(Age<30, Credit) = Entropy(Age<30) - (3/5)*Entropy(S FAIR )
- (2/5)*Entropy(S EXCELLENT)
= 0.0202
Similarly Income as the next splitting attribute
Gain(Age<30, Income) = Entropy(Age<30) - (2/5)*Entropy(S HIGH)
- (2/5)*Entropy(S MED ) - (1/5)*Entropy(S LOW )
= 0.571
We see that Gain (Age<30, Student) is max with
0.971.
Hence Student is chosen as the next splitting
attribute.
Similarly we find that Gain (Age>40, Credit) is
max. Age
31-40
Stude
Credit
nt YES
excellent
yes no fair
NO YES
YES NO
EXTRACTING CLASSIFICATION
RULES FROM TREES
Represent the knowledge in the form of IF -THEN rules
One rule is created for each path from the root to a leaf
Each attribute -value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
CONSTRUCT A DECISION TREE FOR THE
BELOW EXAMPLE:
64
Tree Pruning
65
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
If generalization error improves after trimming, replace sub-
tree by a leaf node.
Class label of leaf node is determined from majority class of
instances in the sub-tree
A combined approach may be used as no single
pruning method found to be superior than the other
( x i x )( y i y )
W1 i1 w0 y w1 x
|D |
( xi x )
2
i 1
1 5.00
2 17.00
3 11.00
4 8.00
5 14.00 RESIDUALS
6 5.00
y 10
SQUARING THE RESIDUALS(ERRORS)
76
The Goal of Simple Linear regression is to create a linear model that minimizes the
sum of squares of residuals / Errors (SSE)
If Our regression model is Significant it will “Eat up” much of the raw SSE.
OUR AIM SHOULD BE TO DEVELOP A REGRESSION LINE WHICH WILL FIT BETTER WITH
MINIMUM SSE
QUICK REVIEW
78
Variable 2
CORRELATION 30
20
Variable
10 2
0
ANOVA 0 10
LINEAR REGRESSION
y = m.X +c
81
Y= m. X +b
X – Random variable
m = slope of line = rise / run
b = y intercept when x = 0.
SIMPLE LINEAR REGRESSION MODEL
82
y = ß0 + ß1 X+ ɛ
Where
83
84
WHEN THE SLOPE , B 1 = 0
85
GETTING READY FOR LEAST SQUARES
86
TIPS IN Rs. TIPS IN RS. Meal (#) Bill Amount (Rs) TIP
18 Amount (
Rs. )
16 1 34.00 5.00
14 2 108.00 17.00
12 3 64.00 11.00
TIPS IN RS.
10 4 88.00 8.00
8 5 99.00 14.00
6 51.00 5.00
6
4
2 We want to know what degree
the tip amount can be
0 predicted by the Bill.
0 25 50 75 100 125
TIP is Dependent variable Bill
is Independent variable
LEAST SQUARE CRITERIA
87
STEP I : DRAW SCATTER PLOT
88
2 108.00 17.00
10
3 64.00 11.00
8 4 88.00 8.00
6 5 99.00 14.00
4 6 51.00 5.00
2
0
0 25 50 75 100 125
STEP II : LOOK FOR A VISUAL LINE
89
10
8
6
4
2
0
0 25 50 75 100 125
STEP III : CORREALATION( OPTIONAL)
10
8 Is relationship strong ?
6
4
YES in our case ! !!
2
0
0 25
PEARSON CORRELATION
91
92
The results will be between -1 and 1. You will very rarely see 0, -1 or 1. You’ll get a
number somewhere in between those values. The closer the value of r gets to zero,
the greater the variation the data points are around the line of best fit.
High correlation: .5 to 1.0 or -0.5 to 1.0.5.
Medium correlation: .3 to .5 or -0.3 to .
Low correlation: .1 to .3 or -0.1 to -0.3.
STEP IV : DESCRIPTIVE STATISTICS
/CENTROID
TIPS TIPS IN RS.
IN Rs.
18 Meal (#) Bill Amount (Rs) TIP
Amount (
16 Rs. )
14 1 34.00 5.00
2 108.00 17.00
TIPS IN RS.
12
3 64.00 11.00
10
4 88.00 8.00
8
5 99.00 14.00
6
6 51.00 5.00
4 X bar = 74 Y bar =
10
2
0
0 25 50 75 100 125
Total bil in Rs 93
STEP V : CALCULATIONS
94
CALCULATION
95
CALCULATIONS
96
97
B0 CALCULATION
98
YOUR REGRESSION LINE
99
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
97
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
99
EXAMPLE
101
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent
sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
At i-th iteration, use Di as test set and others as training set
Leave-one-out: k folds where k = # of tuples, for small
sized data
*Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data
102
Evaluating Classifier Accuracy:
Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did not
make it into the training set end up forming the test set. About 63.2%
of the original data end up in the bootstrap, and the remaining 36.8%
form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the
model:
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
Suppose we have 2 classifiers, M1 and M2, which one is
better?
Use 10-fold cross-validation to obtain and
These mean error rates are just estimates of error on the true
population of future data cases
What if the difference between the 2 error rates is just
attributed to chance?
Use a test of statistical significance
Obtain confidence limits for our error estimates
104
Estimating Confidence Intervals:
Null Hypothesis
Perform 10-fold cross-validation
Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
Use t-test (or Student’s t-test)
Null Hypothesis: M1 & M2 are the same
If we can reject null hypothesis, then
we conclude that the difference between M1 & M2 is
statistically significant
Chose model with lower error rate
105
Ensemble Methods: Increasing the
Accuracy
106
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of classifiers
Ensemble: combining a set of heterogeneous classifiers
Bagging: Boostrap Aggregation
107