Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views61 pages

Logistic Regression

The document provides an overview of Logistic Regression and Softmax Regression, focusing on their applications in binary and multiclass classification problems. It explains the logistic function, decision boundaries, cost functions, and the gradient descent method for fitting parameters. Additionally, it introduces the one-vs-all classifier approach for multiclass classification and the softmax regression model for estimating probabilities across multiple classes.

Uploaded by

bertankaankaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views61 pages

Logistic Regression

The document provides an overview of Logistic Regression and Softmax Regression, focusing on their applications in binary and multiclass classification problems. It explains the logistic function, decision boundaries, cost functions, and the gradient descent method for fitting parameters. Additionally, it introduces the one-vs-all classifier approach for multiclass classification and the softmax regression model for estimating probabilities across multiple classes.

Uploaded by

bertankaankaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CMPE 442 Introduction to

Machine Learning
• Logistic Regression/ Softmax Regression
• Measurements
Logistic Regression

 Classification problem
 Similar to regression problem
 Assume a binary classifier: y can take on only two values 0 and 1.
 Ex.: spam email (1, positive class), ham email (0 negative class)
Linear Regression

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
Classification Problem

Threshold classifier output at 0.5


If ℎ𝜃 𝑥 ≥ 0.5 predict 𝑦 = 1
ℎ𝜃 𝑥 = −0.34 + 1.58𝑥 If ℎ𝜃 (𝑥) < 0.5 predict 𝑦 = 0
Classification Problem

 𝑦 ∈ 0,1
 So, we want 0 ≤ ℎ𝜃 𝑥 ≤ 1
Logistic Regression

 Let’s change the form of hypotheses h (x)

1
h ( x) = g ( T x) = − T x
1+ e

1
g ( z) = Logistic function or sigmoid function
1 + e− z
Logistic Regression

1
g ( z) =
1 + e− z
g ( z ) tends towards 1 as z → 

g ( z ) tends towards 0 as z → −

g(z) is always bounded between 0 and 1

1
h ( x) = g ( T x) =
1 + e −
T
x

n
 x =  0 +  j x j
T
x0 = 1
j =1
Logistic Regression
Predicts something that is
True or False, instead of
something continuous like
size
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Interpretation of Logistic Function

 ℎ𝜃 𝑥 - can be estimated as a probability that 𝑦 = 1 on input 𝑥


 ℎ𝜃 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝜃)
 𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃(𝑦 = 1|𝑥; 𝜃)
Decision Boundary

 Assume: 𝑦 = 1 𝑖𝑓 ℎ𝜃 𝑥 ≥ 0.5
𝑦 = 0 𝑖𝑓 ℎ𝜃 < 0.5

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 )

−3
𝜃= 1
1

Predict 𝑦 = 1 if −3 + 𝑥1 + 𝑥2 ≥ 0

𝑥1 + 𝑥2 = 3 Decision Boundary
Logistic Regression

 Given the logistic regression model, how do we fit θ for it?


Logistic Regression

 Training set: { 𝑥 1
,𝑦 1
, 𝑥 2
,𝑦 2
, … , (𝑥 𝑚
, 𝑦 (𝑚) )}
 𝑚 examples
𝑥0
 𝑥= ⋮ 𝑥0 = 1
𝑥𝑛
 𝑦 ∈ {0,1}
1
 ℎ𝜃 𝑥 = 𝑇
1+𝑒 −𝜃 𝑥
Cost Function

1
 Linear Regression: 𝐽 𝜃 = 𝑚 σ𝑚
𝑖=1(ℎ𝜃 𝑥
𝑖 − 𝑦 (𝑖) )2

 𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = (ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) )2


 Using this cost function for logistic regression gives non-convex function
 Not guaranteed to converge to global minima
Cost Function

− log ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 1
 𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = ቐ
− log 1 − ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 0

 𝐶𝑜𝑠𝑡 = 0 𝑖𝑓 𝑦 = 1 𝑎𝑛𝑑 ℎ𝜃 𝑥 = 1
 But as ℎ𝜃 𝑥 → 0 𝐶𝑜𝑠𝑡 → ∞
 If ℎ𝜃 𝑥 = 0 => predict that 𝑝 𝑦 = 1 𝑥; 𝜃 = 0 → if actually 𝑦 = 1 then
penalizes learning algorithm by a very large cost
Cost Function

𝑚
− log ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 1 1 𝑖
 𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = ቐ 𝐽 𝜃 = ෍ 𝐶𝑜𝑠𝑡(ℎ𝜃 𝑥 , 𝑦 (𝑖) )
𝑚
− log 1 − ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 0 𝑖=1

Cost Cost

ℎ𝜃 𝑥 ℎ𝜃 𝑥
Cost Function

− log ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 1
 𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = ቐ
− log 1 − ℎ𝜃 𝑥 𝑖𝑓 𝑦 = 0

 𝑦 = 1 𝑜𝑟 𝑦 = 0→ only two values are possible


 𝐶𝑜𝑠𝑡 ℎ𝜃 𝑥 , 𝑦 = −𝑦𝑙𝑜𝑔 ℎ𝜃 𝑥 − 1 − 𝑦 log 1 − ℎ𝜃 𝑥
1 1
 𝐽 𝜃 = 𝑚 σ𝑚
𝑖=1 𝐶𝑜𝑠𝑡(ℎ𝜃 𝑥
𝑖
, 𝑦 (𝑖) ) = − 𝑚 ൣσ𝑚 (𝑖)
𝑖=1 𝑦 log(ℎ𝜃 𝑥
𝑖
) + ൫1 −
𝑦 𝑖 ൯log(1 − ℎ𝜃 𝑥 𝑖 )൧
 This is a convex function
Logistic Regression

 To fit parameters 𝜃:
❑ min 𝐽 𝜃 to obtain 𝜃
𝜃

 To make a prediction on given new 𝑥:


1
❑ Output ℎ𝜃 𝑥 = 𝑇 ➔ 𝑝(𝑦 = 1|𝑥; 𝜃)
1+𝑒 −𝜃 𝑥
Gradient Descent

 Want min 𝐽 𝜃 :
𝜃
Repeat{
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝞰 𝐽(𝜃)
𝜕𝜃𝑗

} (Simultaneously update for all 𝜃𝑗 s)


Gradient Descent

𝑚
1 𝑖
𝐽 𝜃 = ෍ 𝐶𝑜𝑠𝑡(ℎ𝜃 𝑥 , 𝑦 (𝑖) )
𝑚
𝑖=1
𝑚
1
=− ෍ 𝑦 (𝑖) log(ℎ𝜃 𝑥 𝑖
)+ 1−𝑦 𝑖
log(1 − ℎ𝜃 𝑥 𝑖
)
𝑚
𝑖=1

𝑚
𝜕 1 𝑖
𝐽 𝜃 = ෍(ℎ𝜃 𝑥 − 𝑦 (𝑖) )𝑥𝑗 (𝑖)
𝜕𝜃𝑗 𝑚
𝑖=1
Gradient Descent

 Want min 𝐽 𝜃 : 1
𝜃
ℎ𝜃 𝑥 = 𝑇𝑥
Repeat{ 1 + 𝑒 −𝜃
1 𝑚 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝞰 σ (ℎ 𝑥 − 𝑦 (𝑖) )𝑥𝑗 (𝑖)
𝑚 𝑖=1 𝜃

} (Simultaneously update for all 𝜃𝑗 s)


Python Implementation

Gives probabilities.
To get the classes just use log_reg.predict()
Example
Let’s apply Logistic Regression on Iris dataset.
Iris→famous dataset, contains the sepal and petal length and width of 150 iris flowers
Three different species: Iris-Setosa, Iris-Versicolor and Iris-Virginica
Iris Dataset
Example
Example
Nonlinear Decision Boundary
Nonlinear Decision Boundary

𝑥2 = 𝑥 21
Nonlinear Decision Boundary

𝑋2

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 )
𝟏
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 +𝜃3 𝑥 21 + 𝜃4 𝑥 2 2 )

−1
0
𝟏 𝑋1 𝜃= 0
−𝟏
1
−𝟏 1

Predict 𝑦 = 1 if −1 + 𝑥 21 + 𝑥 2 2 ≥ 0

𝑥 21 + 𝑥 2 2 ≥ 1 𝑥 21 + 𝑥 2 2 = 1
Multiclass Classification

 What if 𝑦 ∈ {1,2, … , 𝑘} classes


 Example: weather prediction {sunny, cloudy, snowy, …}

𝑋2

𝑋1
One-vs-all Classifier (one-vs-rest)

Class 1:
Class 2:
Class 3:

𝑋2 𝑋2 𝑋2

𝑋1 𝑋1 𝑋1
1 ℎ 2 (𝑥) ℎ 3 (𝑥)
ℎ (𝑥)
One-vs-all Classifier (one-vs-rest)

𝑖
 ℎ 𝑥 = 𝑝 𝑦 = 𝑖 𝑥; 𝜃 𝑓𝑜𝑟 𝑖 = 1,2,3
 One-vs-all:
❑ Train a logistic regression classifier ℎ 𝑖 (𝑥) for each class 𝑖 to predict the probability
𝑦=1
❑ On a new input 𝑥, to make a prediction pick the class 𝑖 that maximizes max ℎ 𝑖 (𝑥)
𝑖
Softmax Regression

 Multi-class classification
 Multinomial Logistic Regression
 Generalization of logistic regression
 𝑦 ∈ {1,2, … , 𝐾}
Softmax Regression

 Given a test input 𝑥, estimate the probability that 𝑃(𝑦 = 𝑘|𝑥) for each value
of 𝑘 = 1, … , 𝐾
 Estimate the probability of the class label taking on each of the 𝐾 different
possible values.
 Our hypothesis will output a 𝐾-dimensional vector (whose elements sum to
1) giving 𝐾 estimated probabilities.
Softmax Regression: Hypothesis

 Hypothesis ℎ𝜃 (𝑥) takes the form:


𝑃(𝑦 = 1|𝒙; 𝜃) exp(𝜽 1 𝑇 𝒙)
𝑃(𝑦 = 2|𝒙; 𝜃) 1 exp(𝜽 2 𝑇 𝒙)
 ℎ𝜃 𝒙 = = σ𝐾 𝑗 𝑇 𝒙)
⋮ 𝑗=1 exp(𝜽 ⋮
𝑃(𝑦 = 𝐾|𝒙; 𝜃) exp(𝜽 𝐾 𝑇 𝒙)
 𝜽(1) , 𝜽(2) , … , 𝜽(𝐾) ∈ ℜ𝑛 parameters of the model
1
 σ𝐾 𝑗 𝑇 𝒙) normalizes distribution so it sums to one
𝑗=1 exp(𝜽

| | |
 𝜽 = 𝜽(1) … 𝜽(𝐾) n-by-K matrix
| | |
Softmax Regression: Cost Function

 1 ∙ - indicator function
 1 𝑇𝑟𝑢𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 1 and 1 𝐹𝑎𝑙𝑠𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 0
 Minimize the cost function:
𝑚 𝐾
exp(𝜽 𝑘 𝑇 𝒙(𝑖) )
𝐽 𝜃 = − ෍෍1 𝑦 (𝑖) = 𝑘 𝑙𝑜𝑔 𝐾
σ𝑗=1 exp(𝜽 𝑗 𝑇 𝒙(𝑖) )
𝑖=1 𝑘=1

𝑖
exp(𝜽 𝑘 𝑇 𝒙(𝑖) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 = 𝐾
σ𝑗=1 exp(𝜽 𝑗 𝑇 𝒙(𝑖) )
Softmax Regression

 We cannot solve for the minimum of 𝐽(𝜃) analytically


 Use iterative optimization algorithm.
 Taking derivatives, one can show that the gradient is:
𝑚

𝛻𝜃 𝑘 𝐽 𝜽 = − ෍ 𝒙 𝑖 (1 𝑦 𝑖 = 𝑘 − 𝑃(𝑦 𝑖 = 𝑘|𝒙; 𝜽))


𝑖=1

𝑖
exp(𝜽 𝑘 𝑇 𝒙(𝑖) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 = 𝐾
σ𝑗=1 exp(𝜽 𝑗 𝑇 𝒙(𝑖) )
Example: Iris dataset
Python Implementation

❑ ScikitLearn’s Logistic Regression uses one-versus-all by default


❑ Switch to Softmax Regression by setting multi-class=“multinomial”
Example: MNIST dataset

 70000 images of digits handwritten by high school students and employees


of the US Census Bureau
 Each image is 28x28 pixels (hence has 784 features)
 Each feature simply represents one pixel’s intensity, from 0 (white) to 255
(black)
 Is already split into a training set ( the first 60000 images) and a test set (last
10000 images)
Example: MNIST dataset
Example: MNIST dataset
Learned Weights
Example: MNIST dataset

Illustration of the decrease of the cost


function over time

 Training Accuracy: 0.9209666666666667


 Test Accuracy: 0.9206
Question

 Suppose you train a logistic regression classifier and your hypothesis function H is

 Which of the following figure will represent the decision boundary as given by above classifier?
Question

 Imagine, you have given the below graph of logistic regression which is shows the relationships
between cost function and number of iteration for 3 different learning rate values (different colors
are showing different curves at different learning rates ).
Suppose, you save the graph for future reference but you forgot to save the value of
different learning rates for this graph. Now, you want to find out the relation between
the leaning rate values of these curve. Which of the following will be the true relation?
Note:
1. The learning rate for blue is l1
2. The learning rate for red is l2
3. The learning rate for green is l3

A) l1>l2>l3
B) l1 = l2 = l3
C) l1 < l2 < l3
D) None of these
Question

 Suppose you are dealing with 4 class classification problem and you want to train a logistic
regression model on the data for that you are using One-vs-all method.
 How many logistic regression models do we need to fit in 4-class classification problem?
Machine Learning Experiments
Measurements

 Accuracy:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠


𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

▪ In case of imbalance data accuracy should not be used.


Measurements

 Confusion Matrix:

Predicted
Actual 0 1
0 TN FN

1 FP TP

𝑇𝑃 𝐹𝑃 𝑇𝑃 𝑇𝑃𝑅 + 𝑇𝑁𝑅
𝑇𝑃𝑅 = 𝐹𝑃𝑅 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑃 𝑁 𝑇𝑃 + 𝐹𝑃 2
𝑇𝑁 𝐹𝑁 𝑇𝑃 𝑇𝑃
𝑇𝑁𝑅 = 𝐹𝑁𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 𝑇𝑃𝑅
𝑁 𝑃 𝑇𝑃 + 𝐹𝑁 𝑃
Measurements

 Confusion Matrix:
▪ For the skewed classes better use Precision/Recall
▪ For better model both TPR and TNR should be high and FPR and FNR should be
low as possible.
▪ Importance of measure is domain specific:
▪ Ex: Suppose we want to predict y=1 (cancer) only if very confident: higher precision,
lower recall, low FPR.
▪ Ex: Suppose we want to avoid too many cases of cancer : higher recall, lower precision,
low FNR, acceptable FPR.

𝑇𝑃 𝑇𝑃 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 𝑇𝑃𝑅
𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝑃
Performance Measure: Precision

𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
 TP→ number of true positives
 FP→ number of false positives
 Accuracy of positive predictions (the percentage of true positives)
Performance Measure: Recall

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
 TP→ number of true positives
 FN→ number of false negatives
 Also called sensitivity or true positive rate (TPR)
 Ratio of positive instances that are correctly detected by the classifier.
Performance Measure: Precision/Recall
Performance Measure: Precision/Recall
tradeoff
Performance Measure:F1 score

 We can combine precision and recall into single metric


 Allows to compare two classifiers
 Harmonic mean of precision and recall
 Regular mean treats all values equally
 Harmonic mean gives much more weight to low values
 A classifier get high F1 score if both recall and precision are high

2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃
𝐹1 = =2× =
1
+
1 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙 2
Examples

𝐹𝑃
 𝐹𝑃𝑅 =  Type equation here.
𝑁

𝑇𝑃 60 𝑇𝑃 75
𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.75 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.94
𝑃 80 𝑃 80
𝑇𝑁 20 𝑇𝑁 10
𝑇𝑁𝑅 = = =1 𝑇𝑁𝑅 = = = 0.5
𝑁 20 𝑁 20
𝐹𝑃 0 𝐹𝑃 10
𝐹𝑃𝑅 = = =0 𝐹𝑃𝑅 = = = 0.5
𝑁 20 𝑁 20
𝐹𝑁 20 𝐹𝑁 5
𝐹𝑁𝑅 = = = 0.25 𝐹𝑁𝑅 = = = 0.0625
𝑃 80 𝑃 80
𝑇𝑃 60 𝑇𝑃 75
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = =1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.88
𝑇𝑃 + 𝐹𝑃 60 + 0 𝑇𝑃 + 𝐹𝑃 75 + 10
0.75 + 1 0.94 + 0.5
𝑎𝑐𝑐 = 0.8 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.875 𝑎𝑐𝑐 = 0.85 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.72
2 2
Examples

 Type equation here.  Type equation here.

𝑇𝑃 75 𝑇𝑃 75
𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.94 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.94
𝑃 80 𝑃 80
𝑇𝑁 910 𝑇𝑁 10
𝑇𝑁𝑅 = = = 0.99 𝑇𝑁𝑅 = = = 0.5
𝑁 920 𝑁 20
𝐹𝑃 10 𝐹𝑃 10
𝐹𝑃𝑅 = = = 0.01 𝐹𝑃𝑅 = = = 0.5
𝑁 920 𝑁 20
𝐹𝑁 5 𝐹𝑁 5
𝐹𝑁𝑅 = = = 0.0625 𝐹𝑁𝑅 = = = 0.0625
𝑃 80 𝑃 80
𝑇𝑃 75 𝑇𝑃 75
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.88 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.88
𝑇𝑃 + 𝐹𝑃 75 + 10 𝑇𝑃 + 𝐹𝑃 75 + 10
0.94 + 0.5
0.94 + 0.99 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.72
𝑎𝑐𝑐 = 0.985 𝑎𝑣𝑔 − 𝑟𝑒𝑐𝑎𝑙𝑙 = = 0.965 𝑎𝑐𝑐 = 0.85 2
2

You might also like