Class 10 – Logistic Regression
Prof. Pedram Jahangiry
Prof. Pedram Jahangiry 1
Road map
ML Algorithm
Supervised Unsupervised
Dimensionality
Regression Classification Clustering
Reduction
Linear / Logistic
Principle K-Mean
Polynomial regression
Component
Penalized Analysis (PCA)
regression KNN Hierarchical
KNN
SVC
SVR
CART
CART Random Forest
Random Forest
Prof. Pedram Jahangiry
Topics
Part I
1. Linear probability model (LPM) vs Logistic regression
2. Sigmoid function
3. Logistic regression
Part II
1. Classification performance metrics
a) Accuracy,
b) Precision,
c) Recall,
d) F1 score,
e) ROC and AUC.
Prof. Pedram Jahangiry 3
Classification
• Qualitative variables can be either nominal or ordinal.
• Qualitative variables are often referred to as categorical.
• Classification is the process of predicting categorical variables.
• Classification problems are quite common, perhaps even more than regression
problems.
• Examples:
• Financial instrument tranches (investment grade or junk)
• Online transactions (fraudulent or not)
• Loan application (approved or denied)
• Credit card default (default or not)
• Car insurance customers (high, medium, low risk)
Prof. Pedram Jahangiry 4
Credit card default example
Goal: Build a classifier that performs well in both train and test set.
Prof. Pedram Jahangiry 5
Part I
Logistic Regression
Prof. Pedram Jahangiry 6
Linear Probability Model (LPM) vs Logistic Regression
Starting with simple LPM : 𝑦 = 𝛽0 + 𝛽1 𝑏𝑎𝑙 + 𝜖 where, 𝑌 = 1 for default and 0 otherwise.
𝐸 𝑌 = 1 𝑏𝑎𝑙 = Pr 𝑌 = 1 𝑏𝑎𝑙 = 𝑃 𝑥 = 𝛽0 + 𝛽1 𝑏𝑎𝑙
• It seems that simple regression is perfect for this task,
• But what are the caveats?
Prof. Pedram Jahangiry 7
Sigmoid Function
• We need a monotone mapping function that has a range of [0,1]
Prof. Pedram Jahangiry 8
Logistic Regression (Model)
1
• The model: 𝑓𝑤,𝑏 𝑋 =
1+𝑒 − 𝑊𝑋+𝑏
• In case of two classes, 𝑓𝑤,𝑏 𝑋 = Pr 𝑌 = 1 𝑥 = 𝑝(𝑥).
• A bit of rearrangement gives
𝑝 𝑋
𝐿𝑜𝑔 = 𝑊𝑋 + 𝑏
1−𝑝 𝑥
• This monotone transformation is called the log odds or logit transformation of 𝑝(𝑥).
• Logistic regression ensures that our estimates always lie between 0 and 1
Prof. Pedram Jahangiry 9
Logistic regression fit (Decision boundary)
• Depending on how we define 𝑊𝑋 + 𝑏, we can get any of the following fits from
logistic regression classifier.
Prof. Pedram Jahangiry 10
Logistic Regression (Maximum Likelihood)
• In logistic regression, instead of minimizing the average loss, we maximize the likelihood
of the training data according to our model. This is called maximum likelihood estimation.
• A fantastic visualization!
• Can you do the same visualization with the S curve?
1−𝑦𝑖
𝐿𝑤,𝑏 = ෑ 𝑓𝑤,𝑏 𝑥𝑖 𝑦𝑖 1 − f𝑤,𝑏 𝑥𝑖
𝑖
Prof. Pedram Jahangiry 11
Logistic Regression (Objective function)
• Maximizing the likelihood function:
1−𝑦𝑖
𝑀𝑎𝑥 {𝐿𝑤,𝑏 = ς𝑖 𝑓𝑤,𝑏 𝑥𝑖 𝑦𝑖 1 − f𝑤,𝑏 𝑥𝑖 }
• Solution: In practice, it is more convenient to maximize the log-likelihood function. This
log-likelihood maximization, gives us 𝑤 ∗ and 𝑏 ∗ . There is no closed form solution to this
optimization problem. We need to use gradient descent.
• We are now ready to make predictions.
1
𝑓𝑤 ∗,𝑏∗ 𝑋 = ∗ ∗
1+𝑒 − 𝑊 𝑋+𝑏
• Depending on how we define the probability threshold, we can classify the observations.
In practice, the choice of the threshold could be different depending on the problem.
Prof. Pedram Jahangiry 12
Logistic regression output for credit card default example
1
𝑃(𝑑𝑒𝑓𝑎𝑢𝑙𝑡|𝑏𝑎𝑙, 𝑖𝑛𝑐) =
1 + 𝑒 −(𝑏 +𝑤1 𝑏𝑎𝑙 +𝑤2 𝑖𝑛𝑐 )
Prof. Pedram Jahangiry 13
Part II
Classification Performance Metrics
Prof. Pedram Jahangiry 14
Confusion Matrix
Prof. Pedram Jahangiry 15
Accuracy, Precision, Recall and F1score
While recall expresses the ability to
find all relevant instances in a dataset,
precision expresses the proportion of
the data points our model says was
relevant were actually relevant.
F1 uses the harmonic mean instead of a simple average because it
punishes extreme values.
Prof. Pedram Jahangiry 16
ROC (Receiver Operating Characteristic)
𝜌 FN
TPR
FP FPR
𝜌
Prof. Pedram Jahangiry 17
AUC
Prof. Pedram Jahangiry 18
Some other classification metrics
Prof. Pedram Jahangiry 19
Students’ questions
1) Are we treating (classifying) 𝑦ො = 0.51 and 𝑦ො = 0.99 the same?
2) Does it make sense to have non-linear decision boundaries in logistic regression?
3) Is logistical regression useful for anything beyond probability prediction?
4) What do ROC and AUC tell us about our predictions?
Prof. Pedram Jahangiry 20