Lecture 7: Classification
Kobina Abakah-Paintsil
1
The objectives of this lecture are to:
1. Equip students to use logistic regression;
Lecture 2. Equip students to use support vector
machines;
Objectives
3. Equip students to understand how to
evaluate classification models.
2
Logistic Regression
• Logistic regression is a supervised machine learning algorithm widely used for
binary classification tasks.
• It estimates the probability that an instance belongs to a given class or not.
• Logistic regression predicts the output of a categorical dependent variable.
• The output can be either Yes or No, 0 or 1, true or False, etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
3
Types of Logistic Regression
• Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
4
Assumptions of Logistic Regression
• Independent observations: Each observation is independent of the other.
meaning there is no correlation between any input variables.
• Binary dependent variables: It takes the assumption that the dependent variable
must be binary or dichotomous, meaning it can take only two values. For more
than two categories SoftMax functions are used.
• Linearity relationship between independent variables and log odds: The
relationship between the independent variables and the log odds of the
dependent variable should be linear.
• No outliers: There should be no outliers in the dataset.
• Large sample size: The sample size is sufficiently large
5
Terminologies of Logistic Regression
• Logistic function: The formula used to represent how the independent and
dependent variables relate to one another. The logistic function transforms the
input variables into a probability value between 0 and 1.
• Odds: It is the ratio of something occurring to something not occurring. It is
different from probability as the probability is the ratio of something occurring to
everything that could possibly occur.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm
of the odds. In logistic regression, the log odds of the dependent variable are
modeled as a linear combination of the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the
log odds when all independent variables are equal to zero. 6
Logistic Regression
• Probability needs to satisfy two basic conditions:
• Always positive i.e. > 0
• Always less than or equal to 1
• 𝑦 = 𝑏0 + 𝑏1 𝑥 (SLR)
Always positive
• 𝑒𝑦
Make it less than 1
𝑒𝑦
• (Probability of success)
𝑒 𝑦 +1
7
Logistic Regression
𝑒𝑦 𝑒𝑦
P= then Probability of failure, 𝑄 = 1 − 𝑃 = 1 −
𝑒 𝑦 +1 𝑒 𝑦 +1
𝑒𝑦 + 1 − 𝑒𝑦 1
𝑄= 𝑦
= 𝑦
𝑒 +1 𝑒 +1
𝑃(𝑆𝑢𝑐𝑐𝑒𝑠𝑠) 𝑃
Therefore, 𝑂𝑑𝑑𝑠 = = = 𝑒𝑦
𝑃(𝐹𝑎𝑖𝑙𝑢𝑟𝑒) 1−𝑃
𝑃
𝑙𝑜𝑔 𝑂𝑑𝑑𝑠 = 𝑙𝑜𝑔 = 𝑙𝑜𝑔 𝑒 𝑦
1−𝑃
𝑃
𝑙𝑜𝑔 = 𝑦 = 𝑏0 + 𝑏1 𝑥
1−𝑃
8
Logistic Regression
9
Stratification
Splitting Data Without Stratification – Grouping of Like responses
Train Data
Test Data
10
Stratification
Splitting Data Without Stratification –Imbalanced Split
Train Data
Test Data
11
Stratification
Stratification with 50% split
Train Data
Test Data
12
Confusion Matrix
• A confusion matrix, also known as an error matrix, is a table that summarizes the
performance of a machine learning model on a set of test data. It is useful for
evaluating the performance of classification algorithms.
13
Support Vector Machine
• A Support Vector Machine (SVM) is a supervised machine learning algorithm that
classifies data by finding an optimal line or hyperplane in an N-dimensional
space.
• The goal is to maximize the distance between each class, ensuring effective
separation.
• SVMs can handle both linear and non-linear classification tasks using a
technique called the kernel trick, which implicitly maps data into higher-
dimensional feature spaces.
• They are useful for binary classification and regression analysis.
14
Support Vector Machine
Hyperplane
• A plane in 1D = point; 2D = line; 3D = plane; 4D = Hyperplane
15
Support Vector Machine
Hyperplane Margin is maximum distance between the
nearest data points and the hyperplane.
16
Support Vector Machine
The Mathematics
𝑥1 𝑥2
𝑉1 = 𝑉2 =
𝑦1 𝑦2
𝑉1 ∙ 𝑉2 = 𝑉1 𝑉2 cos 𝜃
𝑤
cos 𝜃 = ֜ 𝑤 = 𝑉1 cos 𝜃
𝑉1
𝑉1 ∙𝑉2 𝑉1 ∙𝑉2
Also, cos 𝜃 = ; 𝑤 = 𝑉1
𝑉1 𝑉2 𝑉1 𝑉2
𝒖
𝑤 = 𝑉1 ∙ 𝑢
Where u is a unit vector
17
Support Vector Machine
Selecting The Right Hyperplane and Class Determination If normal vector
magnitude is c
and b is the
distance between
the hyperplane
and positive
𝒘 = 𝑽𝟏 ∙ 𝒖 hyperplane then:
Margin = 2*b
𝑽𝟏
Normal Vector +ve plane = c+b
-ve plane = c-b
18
Support Vector Machine
Changing Perspectives
Gaussian
Transformation
19
Support Vector Machine
Radial Basis Function
20
Support Vector Machine
Radial Basis Function
1
𝛾= 2
2𝜎
21
Support Vector Machine
Radial Basis Function
22
Support Vector Machine
Types of Kernel Functions
23
Iris Dataset
24
Iris Dataset
25
Evaluating Classification Models
• Accuracy is the proportion of total number of correct predictions.
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠+𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
• Accuracy =
𝑇𝑜𝑡𝑎𝑙 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
• Accuracy is a poor evaluator for dealing with class-imbalanced data.
26
Evaluating Classification Models
• Precision is the proportion of correct positive results out of all predicted positive
results.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
• Precision =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
• Recall (Sensitivity) is the proportion of actual positive cases predicted correctly.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
• Recall =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
• F1 score, also known as the F-measure, combines precision and recall into a
single metric. It is the harmonic mean of precision and recall.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
• F1 score = 2 ×
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
• F1 score helps you find a balance between avoiding false positives (precision)
and minimizing false negatives (recall). 27
Evaluating Classification Models
• Specificity or Selectivity is the proportion of actual negative cases predicted
correctly.
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
• Specificity =
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
• High Precision is preferred when its OK to have false negatives.
• High recall is preferred when cost of false negative is very high.
28
Evaluating Classification Models
Threshold and Adjusting Threshold
New Threshold 1
Accuracy = 0.8616
Precision = 0.84375
Recall = 0.9818
New Threshold 2
29
Evaluating Classification Models
AUC ROC Curve
• The AUC-ROC curve stands for the Area Under the Receiver Operating
Characteristic curve.
ROC (Receiver Operating Characteristics) Curve:
• The ROC curve is a graphical representation of how well a binary classification
model performs across different classification thresholds.
• It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).
• TPR (Recall) represents the proportion of actual positive instances correctly
predicted by the model.
• FPR (1 - specificity) represents the proportion of actual negative instances
incorrectly predicted as positive by the model 30
Evaluating Classification Models
AUC ROC Curve
AUC (Area Under Curve):
• The AUC summarizes the overall performance of the binary classification model.
• It represents the area under the ROC curve.
• A greater AUC value indicates better model performance.
• The AUC measures the probability that the model assigns a randomly chosen
positive instance a higher predicted probability than a randomly chosen negative
instance.
31
Evaluating Classification Models
AUC ROC Curve
32
Reading Assignment
1. Read and write short notes on the following classification models:
i. Decision Trees
ii. Random Forest
2. Read and write short notes on the following classification evaluation
metrics:
i. Macro-averaged F1 score
ii. Micro-averaged F1 score
iii. Sample-weighted F1 score
iv. Fβ score
33