9/28/2020
INTRODUCTION TO DATA MINING
UNIT # 4
FALL 2020 Sajjad Haider 1
ACKNOWLEDGEMENT
Most of the slides in this presentation are taken from material provided
by
Han and Kimber (Data Mining Concepts and Techniques) and
Tan, Steinbach and Kumar (Introduction to Data Mining)
FALL 2020 Sajjad Haider 2
1
9/28/2020
TODAY’S AGENDA
Recap
Handling Multi-State Variables
Confusion Matrix and Accuracy Computation
Recall and Precision
Sensitivity and Specificity
ROC Curve
FALL 2020 Sajjad Haider 3
CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX
From a historical perspective, Gini Index always created a binary tree.
As a result, in case of multiple values, it merged them together to find the best binary
split
For each distinct value, gather counts for each class in the dataset
Two-way split
Multi-way split (find best partition of values)
CarType CarType CarType
Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
FALL 2020 Sajjad Haider 4
2
9/28/2020
HANDLING OF MULTI-STATE VARIABLE
Gini Index (and Entropy) become biased to variables having multiple states.
To over this, the following approach was recommended (in C4.5 using Entropy
but can be generalized to Gini Index as well).
Gain = SR (D) – SRA (D)
Where SR = splitting rule metric
D = class variable
A = an attribute on which the splitting rule is conditioned
Gain Ratio = Gain / SplitInfo
FALL 2020 Sajjad Haider 5
SPLITINFO
Gini (class) = 0.46
Gini outlook (class) = 0.34 : Gain = 0.12
Gini temperature (class) = 0.44 : Gain = 0.02
Gini humidity (class) = 0.37 : Gain = 0.09
Gini windy (class) = 0.43 : Gain = 0.03
SplitInfo = unconditional splitting rules on the variables. If one is using Gini then it becomes
Splitinfo (outlook) = Gini (outlook) = 0.66
Splitinfo (temperature) = Gini (temperature) = 0.65
Splitinfo (humidity) = Gini (humidity) = 0.5
Splitinfo (windy) = Gini (windy) = 0.49
FALL 2020 Sajjad Haider 6
3
9/28/2020
GAIN_RATIO
To obtain gain ratio, we divide gain by splitinfo
Gain_ratio (outlook) = 0.12 / 0.66 = 0.18
Gain_ratio (temperature) = 0.02 / 0.65 = 0.03
Gain_ratio (humidity) = 0.09 / 0.5 = 0.18
Gain_ratio (windy) = 0.03 / 0.49 = 0.06
FALL 2020 Sajjad Haider 7
EXAMPLE
Attribute 1 Attribute 2 Attribute 3 Class
A 70 T C1
A 90 T C2
A 85 F C2
A 95 F C2
A 70 F C1
B 90 T C1
B 78 F C1
B 65 T C1
B 75 F C1
C 80 T C2
C 70 T C2
C 80 F C1
C 80 F C1
FALL 2020 C 96 F C1 Sajjad Haider 8
4
9/28/2020
EXAMPLE II
Height Hair Eyes Class
Short Blond Blue +
Tall Blond Brown -
Tall Red Blue +
Short Dark Blue -
Tall Dark Blue -
Tall Blond Blue +
Tall Dark Brown -
Short Blond Brown -
FALL 2020 Sajjad Haider 9
ACCURACY OR ERROR RATES
Partition: Training-and-testing
use two independent data sets, e.g., training set (2/3), test set(1/3)
used for data set with large number of examples
FALL 2020 Sajjad Haider 10
5
9/28/2020
METRICS FOR PERFORMANCE EVALUATION…
Predicted Label
Positive (+) Negative (-)
True Label
Positive (+) True Positive False Negative
(TP) (FN)
Negative (-) False Positive True Negative
(FP) (TN)
Most widely-used metric:
TP TN
Accuracy
FALL 2020
TP TN FP FN Sajjad Haider 11
IMBALANCE CLASS PROBLEM
An imbalance class problem occurs when one or more classes have very
low proportions in the training data as compared to the other classes.
In online advertising, an advertisement is presented to a viewer which creates an
impression. The click through rate is the number of times an ad was clicked on
divided by the total number of impressions and tends to be very low.
FALL 2020 Sajjad Haider 12
6
9/28/2020
LIMITATION OF ACCURACY
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect any class 1 example
FALL 2020 Sajjad Haider 13
COST-SENSITIVE MEASURES
TP
Precision (p)
TP FP
TP
Recall (r)
TP FN
2rp 2TP
F - measure (F)
r p 2TP FN FP
FALL 2020 Sajjad Haider 14
7
9/28/2020
RECALL AND PRECISION
Actual Prediction
T T
T F
F T
F F
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 15
RECALL AND PRECISION
Actual Prediction
T T
T F
F T
F F
Recall = 4 / 6
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 16
8
9/28/2020
RECALL AND PRECISION
Actual Prediction
T T
T F
Recall = 4 / 6
F T
F F Precision = 4 / 7
F T
F-Measure = 8 / 13
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 17
TERMINOLOGY
True Positive: The number of positive examples correctly predicted by the
classification model.
False Negative: The number of positive examples wrongly predicted as negative
by the classification model.
False Positive: The number of negative examples wrongly predicted as positive
by the classification model.
True Negative: The number of negative examples correctly predicted by the
classification model.
FALL 2020 Sajjad Haider 18
9
9/28/2020
TERMINOLOGY (CONT’D)
The true positive rate (TPR) or sensitivity is defined as TPR = TP / (TP +
FN).
The true negative rate (TNR) or specificity is defined as TNR = TN / (TN
+ FP).
The false positive rate (FPR) is defined as FPR = FP / (TN + FP).
The false negative rate (FNR) is defined as FNR = FN / (TP + FN).
FALL 2020 Sajjad Haider 19
ROC (RECEIVER OPERATING CHARACTERISTIC)
Developed in 1950s for signal detection theory to analyze noisy signals
Characterize the trade-off between positive hits and false alarms
ROC curve plots TPR (on the y-axis) against FPR (on the x-axis)
Remember that TPR represents “sensitivity” while FPR represents “100 –
specificity”.
FALL 2020 Sajjad Haider 20
10
9/28/2020
ROC CURVES
Suppose sensitivity in a given scenario is poor (40%) while specificity is
fairly high (92.9%).
The values are calculated from classes that are determined with the
default 50% probability threshold.
Lowering the threshold to 30% results in a model with 60% sensitivity
and 79.3% specificity.
FALL 2020 Sajjad Haider 21
ROC CURVE (CONT’D)
The ROC curve is created by evaluating the class probabilities for the
model across a continuum of thresholds.
For each candidate threshold, the resulting true-positive rate (sensitivity)
and the false-positive rate (1-specificity) are plotted against each other.
FALL 2020 Sajjad Haider 22
11
9/28/2020
ROC CURVE (CONT’D)
It is important to remember that altering the threshold only has the
effect of making samples more positive (or negative as the case may be).
In the confusion matrix, it cannot move samples out of both off-diagonal
table cells. There is almost always a decrease in either sensitivity or
specificity as 1 is increased.
FALL 2020 Sajjad Haider 23
ROC CURVE (CONT’D)
The optimal model should be shifted towards the upper left corner of
the plot.
Alternatively, the model with the largest area under the ROC curve
would be the most effective.
The ROC curve is only defined for two-class problems but has been
extended to handle three or more classes.
FALL 2020 Sajjad Haider 24
12
9/28/2020
HOW TO CONSTRUCT AN ROC CURVE
Instance P(+|A) True Class
• Use classifier that produces posterior
1 0.95 +
probability for each test instance P(+|A)
2 0.93 + • Sort the instances according to P(+|A) in
3 0.87 - decreasing order
4 0.852 -
5 0.851 -
• Apply threshold at each unique value of
6 0.850 +
P(+|A)
7 0.76 - • Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 -
• TP rate, TPR = TP/(TP+FN)
10 0.25 +
FALL 2020
• FP rate, FPR = FP/(FP + TN) Sajjad Haider 25
HOW TO CONSTRUCT AN ROC CURVE
Class + - + - - - + - + +
Threshold
P
0.25 0.43 0.53 0.76 0.850 0.851 0.852 0.87 0.93 0.95 1.00
>= TP 5 4 4 3 3 2 2 2 2 1 0
FP 5 5 4 4 3 3 2 1 0 0 0
TN 0 0 1 1 2 2 3 4 5 5 5
FN 0 1 1 2 2 3 3 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0
ROC Curve
FALL 2020 Sajjad Haider 26
13