6012B0419Y Machine Learning
Model Evaluation and Class
Imbalance
27-11-2023
Guido van Capelleveen
(Prepared by: Stevan Rudinac)
Slide Credit
●Andreas Müller, lecturer at the Data Science
Institute at Columbia University
● Author of the book we will be using for this course
“Introduction to Machine Learning with Python”
● Great materials available at:
● https://github.com/amueller/applied_ml_spring_2017/
● https://amueller.github.io/applied_ml_spring_2017/
Reading
Pages: 277 – 305
Metrics for Binary Classification
Kinds of Errors
● Example: Early cancer detection screening
– The test is negative: patient is assumed healthy
– The test is positive: patient undergoes additional test
● Possible mistakes:
– Healthy patient is classified as positive: false positive
or type I error
– Sick patient is classified as negative: false negative or
type II error
Review: confusion matrix
Diagonal divided by everything.
Problems with accuracy
● Imbalanced classes lead to hard-to-interpret accuracy:
Data with 90% negatives
(class, is this OK?)
Precision, Recall, f-score
Positive Predictive Value
(PPV)
limit
Sensitivity, coverage, true positive rate.
limit
All depend on definition of positive and
negative!
The
zoo
https://en.wikipedia.org/wiki/Precision_and_recall
Goal setting!
● What do I want? What do I care
about? (precision, recall, something
else)
● Can I assign costs to the confusion matrix?
(i.e. a false positive costs me $10, a false negative
$100)
● What guarantees do we want to give?
Changing Thresholds
Precision-Recall Curve
Precision-Recall Curve
Comparing RF and SVC
Comparing RF and SVC
Average Precision
Precision at threshold k
Change in recall
between k and k-1
Sum over data points, ranked by decision function
Same as area under the precision-recall curve
(depending on how you treat edge-cases)
F1 vs average precision
Receiver Operating Characteristics
(ROC) Curve
= recall
ROC
●
AUC
Area under ROC Curve
● Always .5 for random / constant prediction
●Evaluation of the ranking: probability that a randomly
picked positive sample will have a higher score than a
randomly picked negative sample
The Relationship Between Precision-Recall and ROC Curves
https://www.biostat.wisc.edu/~page/rocpr.pdf
Multi-class classification
Confusion Matrix
Normalizing confusion matrix (by rows) can be
helpful
Micro and Macro F1
● Macro-average f1: Average f1 scores over classes
●Micro-average f1: Computes the total number of FP,
FN and TP over all classes and then computes P, R and
f1 using these counts.
●Weighted: Mean of the per-class f1 scores, weighted
by support
Macro: “all classes are equally important”
Micro: “all samples are equally important” - same for other metric averages
Multi-class ROC AUC
● Hand & Till, 2001 one vs one
● Provost & Domingo, 2000 one vs
rest
● https://github.com/scikit-learn/scikit-learn/pull/7663
Picking metrics?
● Accuracy rarely what you want
● Problems are rarely balanced
● Find the right criterion for the task
● OR pick one arbitrarily, but at least think about it
● Emphasis on recall or precision?
● Which classes are the important ones?
Using metrics in cross-validation
Same for GridSearchCV
Will make GridSearchCV.score use your
metric!
Built-in scoring
● “scoring” can be string or callable.
● Strings:
Providing your own callable
● Takes estimator, X, y
● Returns score – higher is better (always!)
def accuracy_scoring(est, X, y):
return (est.predict(X) == y).mean()
You can access the model!
Metrics for regression models
Build-in standard metrics
● R^2 : easy to understand scale
● MSE : easy to relate to input
● Mean absolute error, median absolute
error: more robust.
●When using “scoring” use
“neg_mean_squared_error” etc
Prediction plots
Residual Plots
Target vs Feature
Residual vs Feature
Absolute vs relative:
MAPE Mean absolute percentage error (MAPE)
Over vs under
●Overprediction and underprediction can
have different cost.
●Try to create cost-matrix: how much
does overprediction and underprediction
cost?
● Is it linear?