DDA3020 Machine Learning
Lecture 06 Logistic Regression
Jicong Fan
School of Data Science, CUHK-SZ
October 10/12, 2022
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 1 / 47
Outline
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 2 / 47
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 3 / 47
Linear regression: deterministic perspective
Linear hypothesis function: fw,b (x) = x> w + b, or, simply fw (x) =
x> w by concatenating b and w together and augmenting x to [1; x]
Linear regression by minimizing residual sum of squares (RSS):
m
1X > 1
w∗ = arg min J(w), where J(w) = (xi w − yi )2 = kXw − yk2
w 2 i=1 2
Two solutions:
−1 >
Closed-form solution: w∗ = X> X X y
Gradient descent: w ← w − αX> (Xw − y), for multiple iterations until
convergence
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 4 / 47
Linear regression: probabilistic perspective
We assume that: y = w> x + e, where e ∼ N (0, σ 2 ) is called observation
noise or residual error
y is also a random variable, and its conditional probability is
p(y|x, w) = N (w> x, σ 2 )
Maximum log-likelihood estimation:
m
Y
wM LE = arg max log L(w|D) = arg max log p(yi |xi , w) (1)
w w
i
m
X m
X
= arg max log p(yi |xi , w) = arg max log N (w> xi , σ 2 ) (2)
w w
i i
m
1 X m
= arg max − log(σ m (2π) ) − 2 (yi − w> xi )2
2 (3)
w 2σ i
m
1X
= arg min (yi − w> xi )2 , (4)
w 2 i
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 5 / 47
Variants of linear regression
Ridge regression to avoid over-fitting, through MAP estimation:
m
X
wM AP = arg max log p(yi |xi , w) + log p(w) (5)
w
i=1
Xm
= arg max log N (w> xi , σ 2 ) + N (w|0, τ 2 I) (6)
w
i=1
m
X
≡ arg min (w> xi − yi )2 + λkwk22 . (7)
w
i=1
Polynomial regression: linear model with basis expansion φ(x)
d
X d X
X d d X
X d X
d
fw,b (x) = b + wi xi + wij xi xj + wijk xi xj xk + . . .
i=1 i=1 j=1 i=1 j=1 k=1
= φ(x)> w, (8)
>
φ(x) = [1, x1 , . . . , xd , . . . , xi xj , . . . , xi xj xk , . . .] ,
w = [b, w1 , . . . , wd , . . . , wij , . . . , wijk , . . .]> .
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 6 / 47
Variants of linear regression
Lasso regression to obtain sparse model,
m
X
wM AP = arg max log N (w> xi , σ 2 ) + Lap(w|0, b) (9)
w
i
m
X
= arg min (w> xi − yi )2 + λkwk1 . (10)
w
i=1
Robust regression for data with outliers:
m
X
wM LE = arg min |w> xi − yi | (11)
w
i=1
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 7 / 47
Summary of different linear regressions
Note that the uniform distribution will not change the mode of the likelihood.
Thus, MAP estimation with a uniform prior corresponds to MLE.
p(y|x, w) p(w) regression method
Gaussian Uniform Least squares
Gaussian Gaussian Ridge regression
Gaussian Laplace Lasso regression
Laplace Uniform Robust regression
Student Uniform Robust regression
u2
u1
ML Estimate
MAP Estimate
prior mean
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 8 / 47
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 9 / 47
Classification
Classification: classifying input data into discrete states
Email filtering: spam / not spam?
Weather forecast: sunny / not sunny?
Tumor: malignant / benign?
The label y ∈ {0, 1}:
y = 0: negative class, e.g., not spam, not sunny, benign
y = 1: positive class, e.g., spam, sunny, malignant
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 10 / 47
Threshold classifier with linear regression
We assume a linear hypothesis function fw,b (x) = x> w + b
A simple threshold classifier with this hypothesis function is
If fw,b (x) > 0.5, then y = 1, i.e., malignant tumor
If fw,b (x) < 0.5, then y = 0, i.e., benign tumor
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 11 / 47
Threshold classifier with linear regression
It seems that the simple threshold classifier with linear regression works
well on this classification task
However, if there is a positive sample with very large tumor size (plot
above), what will happen?
The hypothesis function will be significantly changed, causing that some
positive samples are mis-classified as negative (not malignant). How to han-
dle it? Adjusting the threshold value, or adopting robust linear regression.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 12 / 47
Threshold classifier with linear regression
But there is still something wired.
Our goal is to predict y ∈ {0, 1}, but the prediction could be fw,b (x) > 1
or fw,b (x) < 0, which does not serve our purpose.
A desired hypothesis function for this task should be fw,b (x) ∈ [0, 1].
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 13 / 47
Threshold classifier with linear regression
Exercise: Which statements are true?
If linear regression doesn’t work well like the above example, feature scaling
may help
If the training set satisfies that all yi ∈ [0, 1] for all points (xi , yi ), then the
linear hypothesis function fw,b (x) ∈ [0, 1] for all values of xi
None of the above is correct
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 14 / 47
Hypothesis representation
A desired hypothesis function for this task should be fw,b (x) ∈ [0, 1]
To this end, we introduce a novel function, as follows:
1
fw,b (x) = g(w> x) ∈ [0, 1], g(z) = ,
1 + exp(−z)
where g(·) is called sigmoid function or logistic function (shown below)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 15 / 47
Hypothesis representation
Interpretation of sigmoid/logistic function
fw,b (x) = estimated probability that y = 1 of input x.
For example (plot below), if fw,b (x) = 0.8, then it means that a patient
with tumor size x has 80% chance of tumor being malignant. In this task,
larger tumor size has larger chance/probability of being malignant tumor.
Thus, we can say that
fw,b (x) = P (y = 1|x; w).
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 16 / 47
Decision boundary
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
In logistic regression, we have
1
fw,b (x) = g(w> x + b) = P (y = 1|x; w) ∈ [0, 1], g(z) = .
1 + exp(−z)
Suppose that if fw,b (x) ≥ 0.5, then we predict y = 1; if fw,b (x) < 0.5, then
we predict y = 0
Correspondingly, if w> x + b ≥ 0, we predict y = 1; if w> x + b < 0, then
we predict y = 0.
It determines the decision boundary, which is the curve/hyper-plane cor-
responding to fw,b (x) = 0.5, or w> x + b = 0
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 17 / 47
Decision boundary
fw,b (x) = g(b + w1 x1 + w2 x2 ) = g(−3 + x1 + x2 )
Predict y = 1 if −3 + x1 + x2 ≥ 0 (plot above)
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 18 / 47
Decision boundary
Figure: Non-linear decision boundary
fw,b (x) = g(b + w1 x1 + w2 x2 + w3 x21 + w4 x22 ) = g(−1 + x21 + x22 )
Predict y = 1 if −1 + x21 + x22 ≥ 0 (plot above)
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 19 / 47
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 20 / 47
Cost function
Training set: m training examples {(xi , yi )}m
i=1
Hypothesis function: fw,b (x) = g(w> x + b) = 1
1+exp(−w> x−b)
Cost function:
1
Pm 2 1 2
Linear regression: J(w) = 2m i=1 (fw,b (xi ) − yi ) = 2m kXw − yk ,
which is called `2 loss or residual sum of squares
It is convex w.r.t. w for linear regression
Logistic regression: If we adopt the same cost function for logistic regres-
sion, we have
m
1 X
J(w) = (g(w> xi ) − yi )2 .
2m i
However, it is non-convex w.r.t. w.
Exercise 1: Prove the `2 loss is convex w.r.t. w for linear regression.
Exercise 2: Prove the `2 loss is non-convex w.r.t. w for logistic regression.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 21 / 47
Cost function
Exercise 1: Prove the `2 loss is convex w.r.t. w for linear regression.
Exercise 2: Prove the `2 loss is non-convex w.r.t. w for logistic regression.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 22 / 47
Cost function
Cross-entropy:
Z X
H(p, q) = − p(x) log(q(x))dx or − p(x) log(q(x)),
x x
where p(x), q(x) are probability density functions (PDF) of x if x is
a continuous random variable, or, probability mass functions if x is a
discrete random variable
We set
ground-truth posterior probability : y(x) = P (y = 1|x),
predicted posterior probability : fw,b (x) = P (y = 1|x; w).
Cross-entropy loss:
cost y(x), fw,b (x) = H y(x), fw,b (x)
= − P (y = 1|x) · log P (y = 1|x; w) − P (y = 0|x) · log P (y = 0|x; w)
(
− log(fw,b (x)), if y(x) = 1
=
− log(1 − fw,b (x)), if y(x) = 0
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 23 / 47
Cost function for logistic regression
Cross-entropy loss:
(
− log(fw,b (x)), if y(x) = 1
cost y(x), fw,b (x) =
− log(1 − fw,b (x)), if y(x) = 0
For y = 1, if fw,b (x) = 1, i.e., P (y = 1|x; w) = 1, then the prediction
equals to the ground-truth label, the cost is 0.
For y = 1, if fw,b (x) → 0, i.e., P (y = 1|x; w) → 0, then it should be
penalized with a very large cost. Here we have cost(y(x), fw,b (x)) → ∞.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 24 / 47
Cost function for logistic regression
Cross-entropy loss:
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0
For y = 0, if fw,b (x) = 0, i.e., P (y = 1|x; w) = 0, then the prediction
equals to the ground-truth label, the cost is 0
For y = 0, if fw,b (x) → 1, i.e., P (y = 1|x; w) → 0, then it should be
penalized with a very large cost. Here we have cost(y(x), fw,b (x)) → ∞
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 25 / 47
Cost function for logistic regression
Cross-entropy loss:
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0
Exercise: Which states are true?
If fw,b (x) = y, then cost(y(x), fw,b (x)) = 0 for both y = 0 and y = 1
If y = 0, then cost(y(x), fw,b (x)) → ∞ as fw,b (x) → 1
If y = 0, then cost(y(x), fw,b (x)) → ∞ as fw,b (x) → 0
Regardless whether y = 0 or y = 1, if fw,b (x) = 0.5, then
cost(y(x), fw,b (x)) > 0
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 26 / 47
Cost function of logistic regression
Cost function of logistic regression
m
1 X
J(w) = cost(yi , fw,b (xi )),
m i=1
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0
The above cost function can be simplified as follows
m
1 X
J(w) = − yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) .
m i=1
Exercise: Please prove that J(w) is convex w.r.t. w.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 27 / 47
Gradient descent for logistic regression
Learning w by minimize J(w), i.e.,
m
1 X
w∗ = arg min J(w) = −
yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) .
w m i=1
Gradient descent: repeat the following update until convergence
w ← w − α∇w J(w)
m
1 X
∇w J(w) = [fw,b (xi ) − yi ]xi
m i=1
How to define convergence? Calculating the changes of J(w) or w in the
last K steps, if the change is lower than a threshold, than it can be seen as
convergence. Remember that choosing suitable learning rate α is important
to achieve a good converged solution.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 28 / 47
Gradient descent for logistic regression
Exercise: Suppose you are running a logistic regression model, and you should
observe the learning procedure to find a suitable learning rate α. Which of the
following is reasonable to make sure α is set properly and that the gradient
descent is running correctly?
1
Pm 2
Plot J(w) = − m i (yi − fw,b (xi )) as a function of the number of itera-
tions (i.e., the horizontal axis is the iteration number) and make sure J(w)
is decreasing on every iteration.
1
Pm
Plot J(w) = − m i yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) as a
function of the number of iterations (i.e., the horizontal axis is the iteration
number) and make sure J(w) is decreasing on every iteration.
Plot J(w) as a function of w and make sure it is decreasing on every
iteration.
Plot J(w) as a function of w and make sure it is convex.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 29 / 47
Multi-class classification
Binary classification: in above examples and derivations, we only consider
the binary classification problem, i.e., y ∈ {0, 1}.
Multi-class/multi-category classification: however, many practical prob-
lems involve with multi-category outputs, i.e., y ∈ {1, . . . , C}:
Weather forecast: sunny, cloudy, rain, snow
Email tagging: work, friends, families, hobby
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 30 / 47
Multi-class classification
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 31 / 47
Multi-class classification: one-vs-all
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 32 / 47
Multi-class classification: one-vs-all
One-vs-all logistic regression:
Train a binary logistic regression fwj ,bj (·) for each class j, by setting all
samples of other classes as negative class
For a new testing sample x, predict its class as arg maxj fwj ,bj (x).
Pros: Easy to implement
Cons: The training cost is too high, and is difficult to scale to tasks with large
number of classes.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 33 / 47
Multi-class classification: Softmax regression
Softmax function:
(j) exp(wj> x + bj )
fW,b (x) = PC = P (y = j|x; W, b), (12)
>
c=1 exp(wc x + bc )
where W = [w1 , . . . , wC ], b = [b1 ; b2 ; . . . ; bC ] with C being the number of
(j)
classes. For simplicity, in the following we write fW,b (·) as fwj ,bj (·)
Cost function:
m C
1 XX
J(W) = − I(yi = j) log(fwj ,bj (xi )) , (13)
m i j
where I(a) = 1 if a is true, otherwise I(a) = 0.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 34 / 47
Multi-class classification: Softmax regression
It can also be optimized by gradient descent:
∂J(W)
wj ← wj − α ,
∂wj
m
∂J(W) 1 X I(yi = j) ∇fwj ,bj (xi )
=− ·
∂wj m i fwj ,bj (xi )) ∇wj
C
X I(yi 6= j) ∇fwc ,bc (xi )
+ ·
f
c=1 wc ,bc
(xi )) ∇wj
∇fwj ,bj (xi )
= fwj ,bj (xi ) · (1 − fwj ,bj (xi )) · xi .
∇wj
∇fwc ,bc (xi )
= −fwj ,bj (xi ) · fwc ,bc (xi ) · xi
∇wj
m
∂J(W) 1 X
=⇒ = fwj ,bj (xi ) − I(yi = j) xi (14)
∂wj m i
Note: {wc }C
c=1 should be updated in parallel, rather than sequentially.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 35 / 47
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 36 / 47
Overfitting in linear regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 37 / 47
Overfitting in linear regression
Overfitting: If we have too many features, the learned hypothesis may fit the
training data very well (low bias), but fail to generalize to new examples.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 38 / 47
Overfitting in logistic regression
Under-fitting Good-fitting Over-fitting
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 39 / 47
Addressing Overfitting
Generally, there are two approaches to address the overfitting problem, includ-
ing:
Reducing the number of features:
Feature selection
Dimensionality reduction (introduced in later lectures)
Regularization:
Keep all features, but reduce magnitude/value of each parameter, such that
each feature contributes a bit to predict y
In the following, we will focus on the regularization-based approach.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 40 / 47
Regularized logistic regression
The objective function of the regularized logistic regression is formulated
as follows
d
¯ λ X 2
J(w) = J(w) + w
2m j=1 j
m d
1 X λ X 2
=− yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) + w .
m i 2m j=1 j
Note: the bias parameter w0 (or b) is not regularized/penalized.
The above objective function can also be solved by gradient descent, as
follows
m
α X
w0 ← w0 − (fw,b (xi ) − yi ) · xi (0), where xi (0) = 1, ∀i
m i=1
m
α X
wj ← wj − (fw,b (xi ) − yi ) · xi (j) + λ · wj ,
m i=1
where xi (j) denotes the j-th entry of xi , and j = 0, . . . , d.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 41 / 47
Regularized logistic regression
Exercise: When using regularized logistic regression, which of these is the
best way to monitor whether gradient descent is working correctly?
Plot J(w) as a function of the number of iterations and make sure it’s
decreasing
λ
Pd 2
Plot J(w) − 2m j=1 wj as a function of the number of iterations and
make sure it’s decreasing
λ
Pd 2
Plot J(w) + 2m j=1 wj as a function of the number of iterations and
make sure it’s decreasing
Pd
Plot j=1 wj2 as a function of the number of iterations and make sure it’s
decreasing
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 42 / 47
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 43 / 47
Logistic regression: probabilistic modeling
Behind logistic regression for binary classification, we assume that
both the feature x and and the label y are random variables, as follows
µ(x|w) = Sigmoid(w> x),
y(x|w) ∼ Bernoulli(µ(x|w)).
Then, we have
(
µ if y = 1,
P (y|x; w) =
1 − µ if y = 0.
The log-likelihood function of P (y|x; w) is formulated as
L(w) = y log(µ) + (1 − y) log(1 − µ).
Thus, we obtain
max L(w) ≡ min J(w).
w w
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 44 / 47
Logistic regression: probabilistic modeling
Behind logistic regression, we assume that
µ(x|w) = Sigmoid(w> x),
y(x|w) ∼ Bernoulli(µ(x|w)).
`2 -regularized logistic regression: we further assume w ∼ N (w|0, σ 2 I),
then we have
d
λ X 2
max L(w) + log N (w|0, σ 2 I) ≡ min J(w) + w .
w w 2m j=1 j
`1 -regularized logistic regression: if we assume w ∼ Laplace(w|0, b), then
we have
d
λ X
max L(w) + log Laplace(w|0, b) ≡ min J(w) + |wj |.
w w 2m j=1
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 45 / 47
1 Review of last week
2 Classification and representation
3 Logistic regression
4 Regularized logistic regression
5 Probabilistic perspective of logistic regression
6 Summary: linear regression vs. logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 46 / 47
Summary: linear regression vs. logistic regression
Linear regression Logistic regression
Task regression classification
Hypothesis fw,b (x) w> x + b ∈ (−∞, ∞) g(w> x+ b) ∈ [0, 1]
1
Pm > 2 1
Pm
Objective J(w) 2m i (yi − w xi ) −m i=1 yi log(fw,b (xi ))
+(1 − yi ) log(1 − fw,b (xi ))
Solution closed-form or gradient descent gradient descent
Note that: For each variant of linear/logistic regression, you can derive it from both
the deterministic and the probabilistic perspectives.
Own reading: Both linear regression and logistic regression are special cases of gen-
eralized linear models. If interested, you can find more details from Section 4 of th
book “Pattern Recognition and Machine Learning”, Bishop, 2006.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 47 / 47