CMPE 442 Introduction to
Machine Learning
• Logistic Regression/ Softmax Regression
• Measurements
Logistic Regression
Classification problem
Similar to regression problem
Assume a binary classifier: y can take on only two values 0 and 1.
Ex.: spam email (1, positive class), ham email (0 negative class)
Linear Regression
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
Classification Problem
Threshold classifier output at 0.5
If ℎ𝜃 𝑥 ≥ 0.5 predict 𝑦 = 1
ℎ𝜃 𝑥 = −0.34 + 1.58𝑥 If ℎ𝜃 (𝑥) < 0.5 predict 𝑦 = 0
Classification Problem
𝑦 ∈ 0,1
So, we want 0 ≤ ℎ𝜃 𝑥 ≤ 1
Logistic Regression
Let’s change the form of hypotheses h (x)
1
h ( x) = g ( T x) = − T x
1+ e
1
g ( z) = Logistic function or sigmoid function
1 + e− z
Logistic Regression
1
g ( z) =
1 + e− z
g ( z ) tends towards 1 as z →
g ( z ) tends towards 0 as z → −
g(z) is always bounded between 0 and 1
1
h ( x) = g ( T x) =
1 + e −
T
x
n
x = 0 + j x j
T
x0 = 1
j =1
Logistic Regression
Predicts something that is
True or False, instead of
something continuous like
size
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Interpretation of Logistic Function
ℎ𝜃 𝑥 - can be estimated as a probability that 𝑦 = 1 on input 𝑥
ℎ𝜃 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝜃)
𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃(𝑦 = 1|𝑥; 𝜃)
Decision Boundary
Assume: 𝑦 = 1 𝑖𝑓 ℎ𝜃 𝑥 ≥ 0.5
𝑦 = 0 𝑖𝑓 ℎ𝜃 < 0.5
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 )
−3
𝜃= 1
1
Predict 𝑦 = 1 if −3 + 𝑥1 + 𝑥2 ≥ 0
𝑥1 + 𝑥2 = 3 Decision Boundary
Logistic Regression
Given the logistic regression model, how do we fit θ for it?
Logistic Regression
Training set: { 𝑥 1
,𝑦 1
, 𝑥 2
,𝑦 2
, … , (𝑥 𝑚
, 𝑦 (𝑚) )}
𝑚 examples
𝑥0
𝑥= ⋮ 𝑥0 = 1
𝑥𝑛
𝑦 ∈ {0,1}
1
ℎ𝜃 𝑥 = 𝑇
1+𝑒 −𝜃 𝑥
Cost Function
1
Linear Regression: 𝐽 𝜃 = 𝑚 σ𝑚
𝑖=1(ℎ𝜃 𝑥
𝑖 − 𝑦 (𝑖) )2
𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = (ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) )2
Using this cost function for logistic regression gives non-convex function
Not guaranteed to converge to global minima
Cost Function
− log ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = ቐ
− log 1 − ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 0
𝐶𝑜𝑠𝑡 = 0 𝑖𝑓 𝑦 = 1 𝑎𝑛𝑑 ℎ𝜃 𝑥 = 1
But as ℎ𝜃 𝑥 → 0 𝐶𝑜𝑠𝑡 → ∞
If ℎ𝜃 𝑥 = 0 => predict that 𝑝 𝑦 = 1 𝑥; 𝜃 = 0 → if actually 𝑦 = 1 then
penalizes learning algorithm by a very large cost
Cost Function
𝑚
− log ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 1 1 𝑖
𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = ቐ 𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ𝜃 𝑥 , 𝑦 (𝑖) )
𝑚
− log 1 − ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 0 𝑖=1
Cost Cost
ℎ𝜃 𝑥 ℎ𝜃 𝑥
Cost Function
− log ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = ቐ
− log 1 − ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 0
𝑦 = 1 𝑜𝑟 𝑦 = 0→ only two values are possible
𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = −𝑦𝑙𝑜𝑔 ℎ𝜃 𝑥 − 1 − 𝑦 log 1 − ℎ𝜃 𝑥
1 1
𝐽 𝜃 = 𝑚 σ𝑚
𝑖=1 𝐶𝑜𝑠𝑡(ℎ𝜃 𝑥
𝑖
, 𝑦 (𝑖) ) = − 𝑚 ൣσ𝑚 (𝑖)
𝑖=1 𝑦 log(ℎ𝜃 𝑥
𝑖
) + ൫1 −
𝑦 𝑖 ൯log(1 − ℎ𝜃 𝑥 𝑖 )൧
This is a convex function
Logistic Regression
To fit parameters 𝜃:
❑ min 𝐽 𝜃 to obtain 𝜃
𝜃
To make a prediction on given new 𝑥:
1
❑ Output ℎ𝜃 𝑥 = 𝑇 ➔ 𝑝(𝑦 = 1|𝑥; 𝜃)
1+𝑒 −𝜃 𝑥
Gradient Descent
Want min 𝐽 𝜃 :
𝜃
Repeat{
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝞰 𝐽(𝜃)
𝜕𝜃𝑗
} (Simultaneously update for all 𝜃𝑗 s)
Gradient Descent
𝑚
1 𝑖
𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ𝜃 𝑥 , 𝑦 (𝑖) )
𝑚
𝑖=1
𝑚
1
=− 𝑦 (𝑖) log(ℎ𝜃 𝑥 𝑖
)+ 1−𝑦 𝑖
log(1 − ℎ𝜃 𝑥 𝑖
)
𝑚
𝑖=1
𝑚
𝜕 1 𝑖
𝐽 𝜃 = (ℎ𝜃 𝑥 − 𝑦 (𝑖) )𝑥𝑗 (𝑖)
𝜕𝜃𝑗 𝑚
𝑖=1
Gradient Descent
Want min 𝐽 𝜃 : 1
𝜃
ℎ𝜃 𝑥 = 𝑇𝑥
Repeat{ 1 + 𝑒 −𝜃
1 𝑚 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝞰 σ (ℎ 𝑥 − 𝑦 (𝑖) )𝑥𝑗 (𝑖)
𝑚 𝑖=1 𝜃
} (Simultaneously update for all 𝜃𝑗 s)
Python Implementation
Gives probabilities.
To get the classes just use log_reg.predict()
Example
Let’s apply Logistic Regression on Iris dataset.
Iris→famous dataset, contains the sepal and petal length and width of 150 iris flowers
Three different species: Iris-Setosa, Iris-Versicolor and Iris-Virginica
Iris Dataset
Example
Example
Nonlinear Decision Boundary
Nonlinear Decision Boundary
𝑥2 = 𝑥 21
Nonlinear Decision Boundary
𝑋2
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 )
𝟏
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 +𝜃3 𝑥 21 + 𝜃4 𝑥 2 2 )
−1
0
𝟏 𝑋1 𝜃= 0
−𝟏
1
−𝟏 1
Predict 𝑦 = 1 if −1 + 𝑥 21 + 𝑥 2 2 ≥ 0
𝑥 21 + 𝑥 2 2 ≥ 1 𝑥 21 + 𝑥 2 2 = 1
Multiclass Classification
What if 𝑦 ∈ {1,2, … , 𝑘} classes
Example: weather prediction {sunny, cloudy, snowy, …}
𝑋2
𝑋1
One-vs-all Classifier (one-vs-rest)
Class 1:
Class 2:
Class 3:
𝑋2 𝑋2 𝑋2
𝑋1 𝑋1 𝑋1
1 ℎ 2 (𝑥) ℎ 3 (𝑥)
ℎ (𝑥)
One-vs-all Classifier (one-vs-rest)
𝑖
ℎ 𝑥 = 𝑝 𝑦 = 𝑖 𝑥; 𝜃 𝑓𝑜𝑟 𝑖 = 1,2,3
One-vs-all:
❑ Train a logistic regression classifier ℎ 𝑖 (𝑥) for each class 𝑖 to predict the probability
𝑦=1
❑ On a new input 𝑥, to make a prediction pick the class 𝑖 that maximizes max ℎ 𝑖 (𝑥)
𝑖
Softmax Regression
Multi-class classification
Multinomial Logistic Regression
Generalization of logistic regression
𝑦 ∈ {1,2, … , 𝐾}
Softmax Regression
Given a test input 𝑥, estimate the probability that 𝑃(𝑦 = 𝑘|𝑥) for each value
of 𝑘 = 1, … , 𝐾
Estimate the probability of the class label taking on each of the 𝐾 different
possible values.
Our hypothesis will output a 𝐾-dimensional vector (whose elements sum to
1) giving 𝐾 estimated probabilities.
Softmax Regression: Hypothesis
Hypothesis ℎ𝜃 (𝑥) takes the form:
𝑃(𝑦 = 1|𝒙; 𝜃) exp(𝜽 1 𝑇 𝒙)
𝑃(𝑦 = 2|𝒙; 𝜃) 1 exp(𝜽 2 𝑇 𝒙)
ℎ𝜃 𝒙 = = σ𝐾 𝑗 𝑇 𝒙)
⋮ 𝑗=1 exp(𝜽 ⋮
𝑃(𝑦 = 𝐾|𝒙; 𝜃) exp(𝜽 𝐾 𝑇 𝒙)
𝜽(1) , 𝜽(2) , … , 𝜽(𝐾) ∈ ℜ𝑛 parameters of the model
1
σ𝐾 𝑗 𝑇 𝒙) normalizes distribution so it sums to one
𝑗=1 exp(𝜽
| | |
𝜽 = 𝜽(1) … 𝜽(𝐾) n-by-K matrix
| | |
Softmax Regression: Cost Function
1 ∙ - indicator function
1 𝑇𝑟𝑢𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 1 and 1 𝐹𝑎𝑙𝑠𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 0
Minimize the cost function:
𝑚 𝐾
exp(𝜽 𝑘 𝑇 𝒙(𝑖) )
𝐽 𝜃 = − 1 𝑦 (𝑖) = 𝑘 𝑙𝑜𝑔 𝐾
σ𝑗=1 exp(𝜽 𝑗 𝑇 𝒙(𝑖) )
𝑖=1 𝑘=1
𝑖
exp(𝜽 𝑘 𝑇 𝒙(𝑖) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 = 𝐾
σ𝑗=1 exp(𝜽 𝑗 𝑇 𝒙(𝑖) )
Softmax Regression
We cannot solve for the minimum of 𝐽(𝜃) analytically
Use iterative optimization algorithm.
Taking derivatives, one can show that the gradient is:
𝑚
𝛻𝜃 𝑘 𝐽 𝜽 = − 𝒙 𝑖 (1 𝑦 𝑖 = 𝑘 − 𝑃(𝑦 𝑖 = 𝑘|𝒙; 𝜽))
𝑖=1
𝑖
exp(𝜽 𝑘 𝑇 𝒙(𝑖) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 = 𝐾
σ𝑗=1 exp(𝜽 𝑗 𝑇 𝒙(𝑖) )
Example: Iris dataset
Python Implementation
❑ ScikitLearn’s Logistic Regression uses one-versus-all by default
❑ Switch to Softmax Regression by setting multi-class=“multinomial”
Example: MNIST dataset
70000 images of digits handwritten by high school students and employees
of the US Census Bureau
Each image is 28x28 pixels (hence has 784 features)
Each feature simply represents one pixel’s intensity, from 0 (white) to 255
(black)
Is already split into a training set ( the first 60000 images) and a test set (last
10000 images)
Example: MNIST dataset
Example: MNIST dataset
Learned Weights
Example: MNIST dataset
Illustration of the decrease of the cost
function over time
Training Accuracy: 0.9209666666666667
Test Accuracy: 0.9206
Question
Suppose you train a logistic regression classifier and your hypothesis function H is
Which of the following figure will represent the decision boundary as given by above classifier?
Question
Imagine, you have given the below graph of logistic regression which is shows the relationships
between cost function and number of iteration for 3 different learning rate values (different colors
are showing different curves at different learning rates ).
Suppose, you save the graph for future reference but you forgot to save the value of
different learning rates for this graph. Now, you want to find out the relation between
the leaning rate values of these curve. Which of the following will be the true relation?
Note:
1. The learning rate for blue is l1
2. The learning rate for red is l2
3. The learning rate for green is l3
A) l1>l2>l3
B) l1 = l2 = l3
C) l1 < l2 < l3
D) None of these
Question
Suppose you are dealing with 4 class classification problem and you want to train a logistic
regression model on the data for that you are using One-vs-all method.
How many logistic regression models do we need to fit in 4-class classification problem?
Machine Learning Experiments
Measurements
Accuracy:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
▪ In case of imbalance data accuracy should not be used.
Measurements
Confusion Matrix:
Predicted
Actual 0 1
0 TN FN
1 FP TP
𝑇𝑃 𝐹𝑃 𝑇𝑃 𝑇𝑃𝑅 + 𝑇𝑁𝑅
𝑇𝑃𝑅 = 𝐹𝑃𝑅 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑃 𝑁 𝑇𝑃 + 𝐹𝑃 2
𝑇𝑁 𝐹𝑁 𝑇𝑃 𝑇𝑃
𝑇𝑁𝑅 = 𝐹𝑁𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 𝑇𝑃𝑅
𝑁 𝑃 𝑇𝑃 + 𝐹𝑁 𝑃
Measurements
Confusion Matrix:
▪ For the skewed classes better use Precision/Recall
▪ For better model both TPR and TNR should be high and FPR and FNR should be
low as possible.
▪ Importance of measure is domain specific:
▪ Ex: Suppose we want to predict y=1 (cancer) only if very confident: higher precision,
lower recall, low FPR.
▪ Ex: Suppose we want to avoid too many cases of cancer : higher recall, lower precision,
low FNR, acceptable FPR.
𝑇𝑃 𝑇𝑃 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 𝑇𝑃𝑅
𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝑃
Performance Measure: Precision
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
TP→ number of true positives
FP→ number of false positives
Accuracy of positive predictions (the percentage of true positives)
Performance Measure: Recall
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
TP→ number of true positives
FN→ number of false negatives
Also called sensitivity or true positive rate (TPR)
Ratio of positive instances that are correctly detected by the classifier.
Performance Measure: Precision/Recall
Performance Measure: Precision/Recall
tradeoff
Performance Measure:F1 score
We can combine precision and recall into single metric
Allows to compare two classifiers
Harmonic mean of precision and recall
Regular mean treats all values equally
Harmonic mean gives much more weight to low values
A classifier get high F1 score if both recall and precision are high
2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃
𝐹1 = =2× =
1
+
1 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙 2
Examples
𝐹𝑃
𝐹𝑃𝑅 = Type equation here.
𝑁
𝑇𝑃 60 𝑇𝑃 75
𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.75 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.94
𝑃 80 𝑃 80
𝑇𝑁 20 𝑇𝑁 10
𝑇𝑁𝑅 = = =1 𝑇𝑁𝑅 = = = 0.5
𝑁 20 𝑁 20
𝐹𝑃 0 𝐹𝑃 10
𝐹𝑃𝑅 = = =0 𝐹𝑃𝑅 = = = 0.5
𝑁 20 𝑁 20
𝐹𝑁 20 𝐹𝑁 5
𝐹𝑁𝑅 = = = 0.25 𝐹𝑁𝑅 = = = 0.0625
𝑃 80 𝑃 80
𝑇𝑃 60 𝑇𝑃 75
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = =1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.88
𝑇𝑃 + 𝐹𝑃 60 + 0 𝑇𝑃 + 𝐹𝑃 75 + 10
0.75 + 1 0.94 + 0.5
𝑎𝑐𝑐 = 0.8 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.875 𝑎𝑐𝑐 = 0.85 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.72
2 2
Examples
Type equation here. Type equation here.
𝑇𝑃 75 𝑇𝑃 75
𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.94 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.94
𝑃 80 𝑃 80
𝑇𝑁 910 𝑇𝑁 10
𝑇𝑁𝑅 = = = 0.99 𝑇𝑁𝑅 = = = 0.5
𝑁 920 𝑁 20
𝐹𝑃 10 𝐹𝑃 10
𝐹𝑃𝑅 = = = 0.01 𝐹𝑃𝑅 = = = 0.5
𝑁 920 𝑁 20
𝐹𝑁 5 𝐹𝑁 5
𝐹𝑁𝑅 = = = 0.0625 𝐹𝑁𝑅 = = = 0.0625
𝑃 80 𝑃 80
𝑇𝑃 75 𝑇𝑃 75
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.88 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.88
𝑇𝑃 + 𝐹𝑃 75 + 10 𝑇𝑃 + 𝐹𝑃 75 + 10
0.94 + 0.5
0.94 + 0.99 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.72
𝑎𝑐𝑐 = 0.985 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.965 𝑎𝑐𝑐 = 0.85 2
2