Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040
Lecture 2: Linear and Logistic Regression
Fall 2024/5
Lecturer: Nir Shlezinger and Asaf Cohen
We now start the part of the course that is dedicated to machine learning algorithms, and par-
ticularly to different forms on inductive biases and the corresponding learning algorithms. We will
dedicate the next three lectures to classic machine learning models, referring to techniques that
constitute the foundations of machine learning, after which we will proceed to the more recent
deep learning techniques. Today’s lecture and the next will deal with basic supervised learning
techniques, with the current lecture focusing on linear models, and is based mostly on [1, Ch. 9].
1 Linear Model
Recall that we are considering parametric families of machine learning models, denoted Fθ . For
linear models, Fθ encompasses the set of affine mappings from RN to S. Many learning algorithms
that are being widely used in practice rely on linear predictors, first and foremost because of the
ability to learn them efficiently in many cases. In addition, linear predictors are intuitive, are easy
to interpret, and fit the data reasonably well in many natural learning problems.
Definition 1.1 (Linear Model). A linear model is a mapping fθ : RN 7→ S which can be written
as
fθ (x) = σ(wT x + b), (1)
where σ : R 7→ S is a fixed mapping. The parameters of the model in (1) are given by θ = [w, b].
Note that the affine transformation in (1) can also be written as a linear transformation applied
to x′ = [x, 1], as (1) can be written as
σ(wT x + b) = σ(θ T x′ ). (2)
2 Linear Regression
The first task we consider in the context of linear models is regression, which can be viewed as
estimating a continuous-valued variable. Specifically, the label takes values in S = R, and thus
σ(·) is the identity mapping.
2.1 Empirical Risk
The loss used is the ℓ2 loss, i.e.,
l(f, x, s) = (f (x) − s)2 . (3)
1
Given a labeled dataset D = {xt , st }nt=1
t
, the empirical risk is thus given by
1 X
nt
LD (θ) = (fθ (xt ) − st )2
nt t=1
1 X
nt
= (θ T x′t − st )2 . (4)
nt t=1
2.2 Least Squares
Recall that the learning task aims at finding the empirical risk minimizer, i.e.,
θ ⋆ = arg min LD (θ). (5)
θ
For linear regression, where the empirical risk is given by (4), the empirical risk minimizer is in
fact given in closed form using the least squares solution.
To formulate the least squares solution, we define the matrix
X ≜ [x′1 , . . . , x′nt ]T , (6)
and the vector
s ≜ [s1 , . . . , snt ]T . (7)
Using these notations, we can write the empirical risk (4) as
1
LD (θ) = (Xθ − s)T (Xθ − s) . (8)
nt
We can now compare the derivative of the empirical risk to zero, i.e.,
1
0 = ∇θ LD (θ ⋆ ) = ∇θ (Xθ − s)T (Xθ − s)
nt θ=θ ⋆
2 T
= X (Xθ ⋆ − s) . (9)
nt
Rearranging the above equality, we obtain that
X T Xθ ⋆ = X T s
−1 T
⇒ θ⋆ = X T X X s. (10)
The closed-form solution to the empirical risk minimizer obtained in (10) is known as the least
squares solution, and dates back as early as the works of Gauss and Legendre. It is one of the rare
cases in machine learning where an empirical risk minimizer can be found in closed form. Almost
in every other case we would have to resort to some iterative optimization technique, as we will do
in the following.
2
Figure 1: Illustration of matching a data set with two dimensional input using a linear classifier (a)
and the fact that multiple linear classifiers match the same data (b).
3 Linear Classification
We proceed to consider classification task, where S = {−1, 1}. To enforce the model to produce
outputs in S, we set σ(·) to be the sign function, namely, a linear classifier is given by
fθ (x) = sign(θ T x′ ) = sign(wT x + b). (11)
Obviously, the linear classifier (11) operates by dividing the input space RN into two halfspaces.
Mappings of the form (11) actually serve as the historical basis for neural networks and perceptron
models, which will be discussed later in the course.
3.1 Learning a Linear Classifier
Consider a dataset D = {xt , st }nt=1
t
. Assuming that the data points can be indeed divided into
halfspaces, we would like to have
st = sign(wT xt + b), ∀t ∈ 1, . . . , nt , (12)
which is equivalent to having
st (wT xt + b) > 0 ∀t ∈ 1, . . . , nt . (13)
However, as illustrated in Fig. 1, there maybe multiple linear classifiers satisfying condition
(13). How do we select which one to use? A possible solution is based on support vector machines
(SVMs).
3
3.2 SVMs
SVMs aim at finding linear classifiers by seeking to maximize their margins, i.e., the minimal
distance between the data points and the dividing hyperplane. Recall that the linear classifier is
based on a boundary hyperplane, given by the set of points
Hθ = {p ∈ RN : wT p + b = 0}. (14)
To formulate the SVM objective, we need to quantify the distance of a point x from the hyper-
plane Hθ , which is defined as the distance from the closest point in the hyperplane, i.e.,
d(x; Hθ ) = min kx − vk. (15)
v∈Hθ
For our linear formulation of the hyperplane, the distance is given in closed form, as stated in the
following lemma:
Lemma 3.1. The distance between a point x and Hθ is
|wT x + b|
d(x; Hθ ) = . (16)
kwk
(wT x+b)
Proof. Let us first consider the point v = x − ∥w∥2
w. We note that v ∈ Hθ since
wT v + b = wT x − (wT x + b) + b = 0. (17)
Furthermore, we have that
|wT x + b| |wT x + b|
kx − vk = kwk = . (18)
kwk2 kwk
Finally, we have that v is the closest point to x in the hyperplane since for every u ∈ Hθ it holds
that
kx − uk2 = kx − v + (v − u)k2
= kx − vk2 + kv − uk2 + 2(x − v)T (v − u)
≥ kx − vk2 + 2(x − v)T (v − u)
(a) (wT x + b) T
= kx − vk2 + 2 w (v − u)
kwk2
(b)
= kx − vk2 , (19)
where (a) follows from the definition of v, and (b) stems from the fact that u, v ∈ Hθ , and thus
wT u = wT v = −b. Equation (19) proves the lemma.
4
|wT x+b|
Figure 2: SVM illustration, with r = ∥w∥
.
Using Lemma 3.1, we can formulate the task of finding the linear classifier with the maximal
margin as
|wT xt + b|
arg max min , subject to st (wT xt + b) > 0 ∀t ∈ 1, . . . , nt . (20)
w,b t=1,...,nt kwk
Assuming that there is a set of linear classifiers satisfying this constraints, then it holds that
st (wT xt + b) = |wT x + b|, and thus we can write (20) as
st (wT xt + b)
arg max min . (21)
w,b t=1,...,nt kwk
Hard SVM reformulates (21) by defining
δ ≜ min st (wT xt + b), (22)
t=1,...,nt
using which (21) can be written as
δ
arg max subject to st (wT xt + b) ≥ δ ∀t ∈ 1, . . . , nt . (23)
w,b kwk
An illustration is depicted in Fig. 2.
By replacing the maximization term in (23) with minimizing the inverse and writing w̃ = 1δ w,
b̃ = 1δ b, we obtain the hard SVM problem
arg min kw̃k subject to st (w̃T xt + b̃) ≥ 1 ∀t ∈ 1, . . . , nt . (24)
w̃,b̃
5
1
Note that the δ
factor does not affect the decision rule due to the sign operation, as
1 T 1
sign(w x + b) ≡ sign
T
w x+ b .
δ δ
The motivation for formulating the hard SVM learning problem via (24) is that it is given
by a convex quadratic optimization problem, i.e., an optimization problem whose objective is
quadratic in the optimization variables and the constraints are linear functions of the variables.
Such problems are simple to solve using iterative optimization methods.
4 Logistic Regression
Logistic regression deals with the problem of using linear models for classification by having them
produce probabilistic estimates. We again focus on binary classification (detection), while writing
S = {0, 1}. However, instead of having our model produce outputs in S, we want it to produce
outputs in [0, 1], which are used as an estimate of the probability mass function, i.e.,
fθ (x) ≈ P(s = 1|x). (25)
To obtain outputs in the desired range [0, 1], we employ the sigmoid function, setting
1
,σ(z) = (26)
1 + e−z
in (1). Note that this function is monotonically increasing, with its limits at −∞ and ∞ being 0
and 1, respectively.
4.1 From Maximum Likelihood to Binary Cross Entropy
Again, we are given a data set D = {xt , st }, assumed to be drawn in an i.i.d. manner from the
true data generating distribution P. However, since our model is trying to learn a distribution
function, and we only have data points, the question is how to formulate a loss function that indeed
encourages tbe model to approach the true conditional distribution?
In our first lecture we showed that the cross entropy loss (when used to compute true risk) is
minimized when the learned conditional distribution (dictated by the model) matches the true one
(dictated by P). Here, we will provide an alternative derivation for the binary classification case,
which considers its usage with the empirical risk, i.e., without assuming an asymptotic regime
where the empirical risk approaches the true risk.
A natural way to assess whether a set of samples are drawn from a given distribution is to
compute their likelihood, i.e.,
(a) Y
nt
P(s1 , . . . , snt |x1 , . . . , xnt ) = P(s = si |x = xi )
i=1
Y Y
= P(s = 1|xi ) (1 − P(s = 1|xj )) . (27)
(si ,xi )∈D:si =1 (sj ,xj )∈D:sj =0
6
where (a) follows from the assumption that the data samples are drawn i.i.d. We can now set
the model parameters θ to be the ones maximizing the likelihood using (25), where we take the
(monotonically increasing) logarithm of the likelihood to convert the product in sum, i.e., we wish
to maximize
Y Y
log fθ (xi ) (1 − fθ (xj ))
(si ,xi )∈D:si =1 (sj ,xj )∈D:sj =0
X X
= log fθ (xi ) + log (1 − fθ (xj ))
(si ,xi )∈D:si =1 (sj ,xj )∈D:sj =0
X
nt
= st log fθ (xt ) + (1 − st ) log(1 − fθ (xt )). (28)
t=1
Note that (28) denotes an objective which is to be maximized. Therefore, to convert into a loss
function, we obtain exactly the empirical risk with the binary cross entropy loss (after scaling by
1/nt which does not affect the minimizer of the empirical risk), namely
1 X
nt
LD (θ) = −si log fθ (xt ) − (1 − st ) log(1 − fθ (xt )). (29)
nt t=1
Next, we substitute the expression for our logistic regressor, which is
1 1
fθ (x) = 1 − fθ (x) = . (30)
1 + e−θT x′ 1 + eθ T x′
Substituting (30) into (29) leads to
1 X
nt T ′
T ′
LD (θ) = st log 1 + e−θ xt + (1 − st ) log 1 + eθ xt
nt t=1
1 X
nt T ′
=. log 1 + e−s̃t ·θ xt , (31)
nt t=1
where s̃t is a representation of st in {−1, 1} instead of {0, 1}, i.e., s̃t = 2si − 1. While the
empirical risk in (31) does not lend to a closed form solution, it is convex in θ, and therefore
one can find θ ⋆ = arg min LD (θ) by iterative optimization, and particularly based on first-order
(gradient) methods. We will discuss such optimizers in much more detail in future lectures.
References
[1] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.