0% found this document useful (0 votes)

53 views13 pages

Introduction To Data Mining Unit 4

This document discusses various metrics for evaluating classification models, including accuracy, precision, recall, and F-measure. It provides examples to illustrate concepts like handling multi-state variables, computing Gini index and gain ratio for categorical attributes, and dealing with imbalanced class problems where one class is much more prevalent than others. Examples are given to demonstrate computing confusion matrices and various evaluation metrics.

Uploaded by

vinay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views13 pages

Introduction To Data Mining Unit 4

Uploaded by

vinay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

9/28/2020

INTRODUCTION TO DATA MINING

UNIT # 4

FALL 2020 Sajjad Haider 1

ACKNOWLEDGEMENT

 Most of the slides in this presentation are taken from material provided
by
 Han and Kimber (Data Mining Concepts and Techniques) and
 Tan, Steinbach and Kumar (Introduction to Data Mining)

FALL 2020 Sajjad Haider 2

1
9/28/2020

TODAY’S AGENDA

 Recap
 Handling Multi-State Variables
 Confusion Matrix and Accuracy Computation
 Recall and Precision
 Sensitivity and Specificity
 ROC Curve

FALL 2020 Sajjad Haider 3

CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX

 From a historical perspective, Gini Index always created a binary tree.

 As a result, in case of multiple values, it merged them together to find the best binary
split
 For each distinct value, gather counts for each class in the dataset
Two-way split
Multi-way split (find best partition of values)

CarType CarType CarType

Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

FALL 2020 Sajjad Haider 4

2
9/28/2020

HANDLING OF MULTI-STATE VARIABLE

 Gini Index (and Entropy) become biased to variables having multiple states.
 To over this, the following approach was recommended (in C4.5 using Entropy
but can be generalized to Gini Index as well).
 Gain = SR (D) – SRA (D)
 Where SR = splitting rule metric
 D = class variable
 A = an attribute on which the splitting rule is conditioned

 Gain Ratio = Gain / SplitInfo

FALL 2020 Sajjad Haider 5

SPLITINFO

 Gini (class) = 0.46

 Gini outlook (class) = 0.34 : Gain = 0.12
 Gini temperature (class) = 0.44 : Gain = 0.02
 Gini humidity (class) = 0.37 : Gain = 0.09
 Gini windy (class) = 0.43 : Gain = 0.03
 SplitInfo = unconditional splitting rules on the variables. If one is using Gini then it becomes
 Splitinfo (outlook) = Gini (outlook) = 0.66
 Splitinfo (temperature) = Gini (temperature) = 0.65
 Splitinfo (humidity) = Gini (humidity) = 0.5
 Splitinfo (windy) = Gini (windy) = 0.49

FALL 2020 Sajjad Haider 6

3
9/28/2020

GAIN_RATIO

 To obtain gain ratio, we divide gain by splitinfo

 Gain_ratio (outlook) = 0.12 / 0.66 = 0.18
 Gain_ratio (temperature) = 0.02 / 0.65 = 0.03
 Gain_ratio (humidity) = 0.09 / 0.5 = 0.18
 Gain_ratio (windy) = 0.03 / 0.49 = 0.06

FALL 2020 Sajjad Haider 7

EXAMPLE
Attribute 1 Attribute 2 Attribute 3 Class

A 70 T C1
A 90 T C2
A 85 F C2
A 95 F C2
A 70 F C1
B 90 T C1
B 78 F C1
B 65 T C1
B 75 F C1
C 80 T C2
C 70 T C2
C 80 F C1
C 80 F C1
FALL 2020 C 96 F C1 Sajjad Haider 8

4
9/28/2020

EXAMPLE II

Height Hair Eyes Class

Short Blond Blue +
Tall Blond Brown -
Tall Red Blue +
Short Dark Blue -
Tall Dark Blue -
Tall Blond Blue +
Tall Dark Brown -
Short Blond Brown -

FALL 2020 Sajjad Haider 9

ACCURACY OR ERROR RATES

 Partition: Training-and-testing
 use two independent data sets, e.g., training set (2/3), test set(1/3)
 used for data set with large number of examples

FALL 2020 Sajjad Haider 10

5
9/28/2020

METRICS FOR PERFORMANCE EVALUATION…

Predicted Label
Positive (+) Negative (-)
True Label

Positive (+) True Positive False Negative

(TP) (FN)
Negative (-) False Positive True Negative
(FP) (TN)

 Most widely-used metric:

TP  TN
Accuracy 
FALL 2020
TP  TN  FP  FN Sajjad Haider 11

IMBALANCE CLASS PROBLEM

 An imbalance class problem occurs when one or more classes have very
low proportions in the training data as compared to the other classes.
 In online advertising, an advertisement is presented to a viewer which creates an
impression. The click through rate is the number of times an ad was clicked on
divided by the total number of impressions and tends to be very low.

FALL 2020 Sajjad Haider 12

6
9/28/2020

LIMITATION OF ACCURACY

 Consider a 2-class problem

 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10

 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

 Accuracy is misleading because model does not detect any class 1 example

FALL 2020 Sajjad Haider 13

COST-SENSITIVE MEASURES

TP
Precision (p) 
TP  FP
TP
Recall (r) 
TP  FN
2rp 2TP
F - measure (F)  
r  p 2TP  FN  FP

FALL 2020 Sajjad Haider 14

7
9/28/2020

RECALL AND PRECISION

Actual Prediction
T T
T F
F T
F F
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 15

RECALL AND PRECISION

Actual Prediction
T T
T F
F T
F F
 Recall = 4 / 6
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 16

8
9/28/2020

RECALL AND PRECISION

Actual Prediction
T T
T F
 Recall = 4 / 6
F T
F F  Precision = 4 / 7
F T
 F-Measure = 8 / 13
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 17

TERMINOLOGY

 True Positive: The number of positive examples correctly predicted by the

classification model.
 False Negative: The number of positive examples wrongly predicted as negative
by the classification model.
 False Positive: The number of negative examples wrongly predicted as positive
by the classification model.
 True Negative: The number of negative examples correctly predicted by the
classification model.

FALL 2020 Sajjad Haider 18

9
9/28/2020

TERMINOLOGY (CONT’D)

 The true positive rate (TPR) or sensitivity is defined as TPR = TP / (TP +

FN).
 The true negative rate (TNR) or specificity is defined as TNR = TN / (TN
+ FP).
 The false positive rate (FPR) is defined as FPR = FP / (TN + FP).
 The false negative rate (FNR) is defined as FNR = FN / (TP + FN).

FALL 2020 Sajjad Haider 19

ROC (RECEIVER OPERATING CHARACTERISTIC)

 Developed in 1950s for signal detection theory to analyze noisy signals

 Characterize the trade-off between positive hits and false alarms
 ROC curve plots TPR (on the y-axis) against FPR (on the x-axis)
 Remember that TPR represents “sensitivity” while FPR represents “100 –
specificity”.

FALL 2020 Sajjad Haider 20

10
9/28/2020

ROC CURVES

 Suppose sensitivity in a given scenario is poor (40%) while specificity is

fairly high (92.9%).
 The values are calculated from classes that are determined with the
default 50% probability threshold.
 Lowering the threshold to 30% results in a model with 60% sensitivity
and 79.3% specificity.

FALL 2020 Sajjad Haider 21

ROC CURVE (CONT’D)

 The ROC curve is created by evaluating the class probabilities for the
model across a continuum of thresholds.
 For each candidate threshold, the resulting true-positive rate (sensitivity)
and the false-positive rate (1-specificity) are plotted against each other.

FALL 2020 Sajjad Haider 22

11
9/28/2020

ROC CURVE (CONT’D)

 It is important to remember that altering the threshold only has the

effect of making samples more positive (or negative as the case may be).
 In the confusion matrix, it cannot move samples out of both off-diagonal
table cells. There is almost always a decrease in either sensitivity or
specificity as 1 is increased.

FALL 2020 Sajjad Haider 23

ROC CURVE (CONT’D)

 The optimal model should be shifted towards the upper left corner of
the plot.
 Alternatively, the model with the largest area under the ROC curve
would be the most effective.
 The ROC curve is only defined for two-class problems but has been
extended to handle three or more classes.

FALL 2020 Sajjad Haider 24

12
9/28/2020

HOW TO CONSTRUCT AN ROC CURVE

Instance P(+|A) True Class

• Use classifier that produces posterior
1 0.95 +
probability for each test instance P(+|A)
2 0.93 + • Sort the instances according to P(+|A) in
3 0.87 - decreasing order
4 0.852 -
5 0.851 -
• Apply threshold at each unique value of
6 0.850 +
P(+|A)
7 0.76 - • Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 -
• TP rate, TPR = TP/(TP+FN)
10 0.25 +

FALL 2020
• FP rate, FPR = FP/(FP + TN) Sajjad Haider 25

HOW TO CONSTRUCT AN ROC CURVE

Class + - + - - - + - + +
Threshold
P
0.25 0.43 0.53 0.76 0.850 0.851 0.852 0.87 0.93 0.95 1.00

>= TP 5 4 4 3 3 2 2 2 2 1 0

FP 5 5 4 4 3 3 2 1 0 0 0

TN 0 0 1 1 2 2 3 4 5 5 5

FN 0 1 1 2 2 3 3 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0

ROC Curve

FALL 2020 Sajjad Haider 26

IML 7 - ROC Curve
No ratings yet
IML 7 - ROC Curve
17 pages
Module 5 ML
No ratings yet
Module 5 ML
12 pages
Lecture11evaluationmetricsforclassification 240913060639 0c766554
No ratings yet
Lecture11evaluationmetricsforclassification 240913060639 0c766554
28 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Compare Class I Fiers Part 13
No ratings yet
Compare Class I Fiers Part 13
32 pages
Performance Measures
No ratings yet
Performance Measures
32 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Data M11
No ratings yet
Data M11
5 pages
Data M
No ratings yet
Data M
10 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
8 pages
2.3 Performance Metrics
No ratings yet
2.3 Performance Metrics
32 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
Model Evaluation for Data Scientists
No ratings yet
Model Evaluation for Data Scientists
7 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
18 pages
03 Performance Metrics
No ratings yet
03 Performance Metrics
15 pages
Performance
No ratings yet
Performance
11 pages
Performance Parameters
No ratings yet
Performance Parameters
14 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
Evaluation Matrix
No ratings yet
Evaluation Matrix
29 pages
Ca 3 Merged
No ratings yet
Ca 3 Merged
275 pages
Unit2 - Perfomance Measures
No ratings yet
Unit2 - Perfomance Measures
32 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Data Mining: Class Imbalance Solutions
No ratings yet
Data Mining: Class Imbalance Solutions
56 pages
B83c05aa 672f 4234 A627 Cfc944f11d45 Classification Summary
No ratings yet
B83c05aa 672f 4234 A627 Cfc944f11d45 Classification Summary
6 pages
Model Evaluation
No ratings yet
Model Evaluation
31 pages
ML - 03 Evaluation Metrics
No ratings yet
ML - 03 Evaluation Metrics
17 pages
Machine Learning II
No ratings yet
Machine Learning II
61 pages
Statistical Modelling and Evaluation
No ratings yet
Statistical Modelling and Evaluation
15 pages
Machine Learning Evaluation Metrics
No ratings yet
Machine Learning Evaluation Metrics
20 pages
Chap4 Imbalanced Classes
No ratings yet
Chap4 Imbalanced Classes
28 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
A10 Model Performance v2 2up
No ratings yet
A10 Model Performance v2 2up
11 pages
Unit 2 Chap 4
No ratings yet
Unit 2 Chap 4
14 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Machine Learning Evaluation Metrics
No ratings yet
Machine Learning Evaluation Metrics
16 pages
3.1.2. Error Metrics
No ratings yet
3.1.2. Error Metrics
20 pages
Performance Parameters
No ratings yet
Performance Parameters
23 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Evaluation Metricsflaksdj Fa
No ratings yet
Evaluation Metricsflaksdj Fa
22 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
40 pages
An Introduction To ROC Analysis
No ratings yet
An Introduction To ROC Analysis
14 pages
? Task
No ratings yet
? Task
23 pages
Tres Hold
No ratings yet
Tres Hold
7 pages
ICA and Curve ROC.: P Erez Mart Inez Luis Alberto March 4, 2024
No ratings yet
ICA and Curve ROC.: P Erez Mart Inez Luis Alberto March 4, 2024
5 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
DWDM Final5
No ratings yet
DWDM Final5
45 pages
Confusion Matrix
No ratings yet
Confusion Matrix
8 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Analytics in Practice: Model Evaluation
No ratings yet
Analytics in Practice: Model Evaluation
40 pages
Evaluation in Ai
No ratings yet
Evaluation in Ai
25 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
IT 138 - Lecture 4
No ratings yet
IT 138 - Lecture 4
30 pages
Linear Discriminant & SVM Explained
No ratings yet
Linear Discriminant & SVM Explained
41 pages
Ch04 7ed
No ratings yet
Ch04 7ed
20 pages
Part 1: Objective Paper: Is Used For Process Switching in An Operating System
No ratings yet
Part 1: Objective Paper: Is Used For Process Switching in An Operating System
4 pages
I Am Not Going To Give You The Answer To These Questions, But You Can Expect Them To Show Up On The Exam in Some Form - So Use Them To Prepare!
No ratings yet
I Am Not Going To Give You The Answer To These Questions, But You Can Expect Them To Show Up On The Exam in Some Form - So Use Them To Prepare!
4 pages
Subqueries in SQL: A Comprehensive Guide
No ratings yet
Subqueries in SQL: A Comprehensive Guide
21 pages
Introduction To Data Mining Unit 1
No ratings yet
Introduction To Data Mining Unit 1
13 pages
Introduction To Data Mining Unit 2
No ratings yet
Introduction To Data Mining Unit 2
18 pages
LA Assignment 3 PDF
No ratings yet
LA Assignment 3 PDF
4 pages
Sepi Journal Brief PDF
No ratings yet
Sepi Journal Brief PDF
2 pages
The Hadith of Jibraeel Explained
No ratings yet
The Hadith of Jibraeel Explained
2 pages
Institute of Business Administration, Karachi
No ratings yet
Institute of Business Administration, Karachi
2 pages
K Meansassignment PDF
No ratings yet
K Meansassignment PDF
2 pages
Intro to Computer Programming CS141
No ratings yet
Intro to Computer Programming CS141
31 pages
LA Assignment 3 PDF
No ratings yet
LA Assignment 3 PDF
4 pages
Packet Tracer - Designing and Implementing A VLSM Addressing Scheme
No ratings yet
Packet Tracer - Designing and Implementing A VLSM Addressing Scheme
4 pages
Abstrak Halimbawa
100% (1)
Abstrak Halimbawa
3 pages
Ortho Referral
No ratings yet
Ortho Referral
4 pages
BÀI TẬP ĐỘNG TỪ KHUYẾT THIẾU+CÂU ĐIỀU KIỆN LOẠI 1
No ratings yet
BÀI TẬP ĐỘNG TỪ KHUYẾT THIẾU+CÂU ĐIỀU KIỆN LOẠI 1
2 pages
MAPEH 8 Exam Guide
No ratings yet
MAPEH 8 Exam Guide
3 pages
Primary and Secondary Sources in Literature Review
No ratings yet
Primary and Secondary Sources in Literature Review
5 pages
English For Tourism Lesson 26 - A Job Interview (Continued) : Pelajaran 26: Wawancara Pekerjaan (Lanjutan)
No ratings yet
English For Tourism Lesson 26 - A Job Interview (Continued) : Pelajaran 26: Wawancara Pekerjaan (Lanjutan)
7 pages
Injury Prevention For Runners
100% (4)
Injury Prevention For Runners
66 pages
10:00 AM SED: 10:00 AM / 21-04-2022 ATC Circle 1 . 300 IDJ21032205141740 21-03-2022
No ratings yet
10:00 AM SED: 10:00 AM / 21-04-2022 ATC Circle 1 . 300 IDJ21032205141740 21-03-2022
2 pages
Data Pasien Hipertensi
No ratings yet
Data Pasien Hipertensi
43 pages
Anemia of Inflammation vs. Iron-Deficiency Anemia: Diagnostic Challenges and Overlapping Mechanisms (WWW - Kiu.ac - Ug)
No ratings yet
Anemia of Inflammation vs. Iron-Deficiency Anemia: Diagnostic Challenges and Overlapping Mechanisms (WWW - Kiu.ac - Ug)
6 pages
Grade 2 Holiday Assignment
100% (1)
Grade 2 Holiday Assignment
23 pages
Occupational Health Hazards in Textiles Industry: Sudha Babel Meenaxi Tiwari
No ratings yet
Occupational Health Hazards in Textiles Industry: Sudha Babel Meenaxi Tiwari
5 pages
PE Term Paper
No ratings yet
PE Term Paper
9 pages
Design of A Bluetooth-Enabled Wireless Pulse Oximeter
No ratings yet
Design of A Bluetooth-Enabled Wireless Pulse Oximeter
14 pages
Construction Safety Dashboard
No ratings yet
Construction Safety Dashboard
3 pages
Parkland Trauma Handbook
100% (5)
Parkland Trauma Handbook
540 pages
UK Biometric Appointment Guide
No ratings yet
UK Biometric Appointment Guide
2 pages
Naturopathic Doctor Prescribing Authority Report
100% (1)
Naturopathic Doctor Prescribing Authority Report
36 pages
Drug Study Diclofenac K
100% (1)
Drug Study Diclofenac K
3 pages
ICH9
No ratings yet
ICH9
27 pages
How Might We Screen For Psychological Factors in People With Pelvic Pain
No ratings yet
How Might We Screen For Psychological Factors in People With Pelvic Pain
10 pages
Emergency Sternotomy For Ruptured Left Ventricular Aneurysm A Case Report From The Andr Festoc Center at The Le Luxembourg Mother
No ratings yet
Emergency Sternotomy For Ruptured Left Ventricular Aneurysm A Case Report From The Andr Festoc Center at The Le Luxembourg Mother
4 pages
Damien
No ratings yet
Damien
1 page
STD Awareness and Prevention Advocacy
No ratings yet
STD Awareness and Prevention Advocacy
11 pages
Factories Act, 1948
No ratings yet
Factories Act, 1948
37 pages
Hulda Winnes - Air Pollution From Ships
No ratings yet
Hulda Winnes - Air Pollution From Ships
92 pages
Physiological Basis of Aging and Geriatrics - 4th Edition Full Text Download
100% (10)
Physiological Basis of Aging and Geriatrics - 4th Edition Full Text Download
17 pages
Pseudalert Iso 13843 Validation Report
No ratings yet
Pseudalert Iso 13843 Validation Report
38 pages
Aci 526R-19
No ratings yet
Aci 526R-19
4 pages
5pri Nat and Soc SC WM 360º Assessement Word
No ratings yet
5pri Nat and Soc SC WM 360º Assessement Word
89 pages

Introduction To Data Mining Unit 4

Uploaded by

Introduction To Data Mining Unit 4

Uploaded by

9/28/2020

INTRODUCTION TO DATA MINING

FALL 2020 Sajjad Haider 1

FALL 2020 Sajjad Haider 2

FALL 2020 Sajjad Haider 3

CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX

 From a historical perspective, Gini Index always created a binary tree.

CarType CarType CarType

FALL 2020 Sajjad Haider 4

HANDLING OF MULTI-STATE VARIABLE

 Gain Ratio = Gain / SplitInfo

FALL 2020 Sajjad Haider 5

 Gini (class) = 0.46

FALL 2020 Sajjad Haider 6

 To obtain gain ratio, we divide gain by splitinfo

FALL 2020 Sajjad Haider 7

Height Hair Eyes Class

FALL 2020 Sajjad Haider 9

ACCURACY OR ERROR RATES

FALL 2020 Sajjad Haider 10

METRICS FOR PERFORMANCE EVALUATION…

Positive (+) True Positive False Negative

 Most widely-used metric:

IMBALANCE CLASS PROBLEM

FALL 2020 Sajjad Haider 12

 Consider a 2-class problem

 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

FALL 2020 Sajjad Haider 13

FALL 2020 Sajjad Haider 14

RECALL AND PRECISION

RECALL AND PRECISION

RECALL AND PRECISION

 True Positive: The number of positive examples correctly predicted by the

FALL 2020 Sajjad Haider 18

 The true positive rate (TPR) or sensitivity is defined as TPR = TP / (TP +

FALL 2020 Sajjad Haider 19

ROC (RECEIVER OPERATING CHARACTERISTIC)

 Developed in 1950s for signal detection theory to analyze noisy signals

FALL 2020 Sajjad Haider 20

 Suppose sensitivity in a given scenario is poor (40%) while specificity is

FALL 2020 Sajjad Haider 21

ROC CURVE (CONT’D)

FALL 2020 Sajjad Haider 22

ROC CURVE (CONT’D)

 It is important to remember that altering the threshold only has the

FALL 2020 Sajjad Haider 23

ROC CURVE (CONT’D)

FALL 2020 Sajjad Haider 24

HOW TO CONSTRUCT AN ROC CURVE

Instance P(+|A) True Class

HOW TO CONSTRUCT AN ROC CURVE

FPR 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0

FALL 2020 Sajjad Haider 26

You might also like