Logistic Regression
Logistic Regression
• A supervised classification algorithm.
• Predicted output is discrete – only specific values or classes
are allowed. Eg: Pass/Fail, Spam/No spam, Malignant or
Benign Tumor etc.
• That means, Output y (y = f (x)), which is predicted from x
(inputs or features), takes only discrete values/classes.
• Types of logistic regression:
(a) Binary logistic regression: y takes two discrete values (for
two classes).
1
Logistic Regression (contd.)
(b) Multi-class logistic regression: y takes more than two
discrete values (for multiple number of classes). For example,
in digit classification, there are 10 classes (0 – 9).
(c) Ordinal logistic regression: Multiple classes in some order
(Eg: Low, Medium, High).
• Predicts class based on probability.
2
Binary Logistic Regression
• A Two-class classification: y = 0 or 1.
• y can be predicted based on single variable or feature (x):
y = f (x) = β0 + β1 x
• Based on multiple features (x1 , x2 , · · · , xk ):
y = f (x1 , x2 , · · · , xk ) = β0 + β1 x1 + β2 x2 + · · · βk xk
• An example on student dataset:
Hours studied Hours slept Result (Pass (1)/Fail (0))
4.85 9.63 1
8.62 3.23 0
5.43 8.23 1
9.21 6.34 0
3
Binary Logistic Regression (contd.)
• f should generate probabilities (in the range, 0 to 1), to
predict classes.
• Choose a threshold P (say, 0.5) such that, if probability is less
than threshold P, predict y as Class 0. Otherwise, predict y
as Class 1.
• The above function, f , is used for linear regression.
• As shown in the below figure, it is not useful for logistic
regression, as predicted output should be in the range, [0, 1].
4
Binary Logistic Regression (contd.)
• A function that has the range, [0, 1], is required.
• Sigmoid function maps any real value to the range, [0, 1].
σ(z) = 1+e1 −z
5
Binary Logistic Regression (contd.)
• If z = β0 + β1 x1 + β2 x2 + · · · + βk xk , σ(z) gives values in the
range, [0, 1].
6
Binary Logistic Regression (contd.)
• Logistic regression is based on Sigmoid function which is
also called Logistic function, given by:
1
y = f (x1 , x2 , · · · , xk ) = −(β
1 + e 0 1 1 +β2 x2 +···+βk xk )
+β x
• Similar to linear regression, coefficients (β0 , β1 , etc.) are
learnt from training data, based on gradient descent on the
error or cost function – Training process.
7
Cost function
• Consider n training points or samples, (Xi , Yi ), where each Xi
is a k-dimensional feature vector (k features) and Yi = 0 or 1.
• For logistic regression, as Sigmoid function is used which has
exponential term in the denominator, the following cost
function called Log-Loss or Cross Entropy, is used (instead
of MSE used for linear regression):
J(θ) = J(β0 , β1 , · · · , βk ) =
n
1X
− [Yi log (f (Xi )) + (1 − Yi )log (1 − f (Xi ))]
n
i=1
• Similar to MSE, the above cost function which quantifies the
difference between actual (Yi ) and predicted (f (Xi )) outputs,
has to be minimized.
8
Cost function (contd.)
• If the actual output, Yi = 1, J(θ) = −log (f (Xi )). If f (Xi ) is
low, the error, J(θ), is high. If f (Xi ) approaches 1, J(θ)
moves towards 0.
• If the actual output, Yi = 0, J(θ) = −log (1 − f (Xi )). If f (Xi )
is high, error J(θ) is high. If f (Xi ) approaches 0, J(θ) moves
towards 0.
• That means, if both actual and predicted outputs, Yi and
f (Xi ), match, then the error is 0.
9
Cost function (contd.)
• Illustrated in the below plots, for some Y and f (X ):
J(θ) vs f (X ) for Y = 1 and Y = 0.
10
Gradient descent for training
• Initialize each βj , j = 1, 2, · · · k, to some random values.
• For one or more epochs or until some minimum error
threshold (say, ϵ < 0.001) is reached, do the following:
For each j = 1, 2, · · · k,
(i) δβj = −η ∂J(θ)
∂βj
(ii) βj = βj + δβj
• Training process can be monitored by plotting cost, J(θ) vs
number of training epochs.
11
Gradient descent for training (contd.)
12
Testing
• After weights or coefficients (β0 , β1 , · · · , βk ) are learnt after
training, the function,
1
y = f (x1 , x2 , · · · , xk ) = −(β +β x +β2 x2 +···+βk xk )
1+e 0 1 1
can be used to predict output, y , for any new input,
(x1 , x2 , · · · , xk ).
• Choose a threshold, P, say 0.5.
• If y ≥ P, output Class 1.
• If y < P, output Class 0.
13
Multi-class logistic regression
• Suppose there are c classes.
• Divide the problem into c binary classification problems, for
each class – Generate c binary classifiers.
• To train binary classifier i, training samples from class i have
actual output, y = 1. For the remaining samples, y = 0.
• After training c binary classifiers, each classifier i can be used
to determine if a sample belongs to class i or not (binary).
• For testing, compute probabilities from the c binary classifiers
corresponding to the c classes, and then output the class
which has the maximum probability.
14
Thank You
15