0% found this document useful (0 votes)

7 views7 pages

Intro To Machine Learning Lecture Notes2

The document outlines a lecture on linear and logistic regression as part of a machine learning course at Ben-Gurion University. It covers the fundamentals of linear models, linear regression, and classification techniques, including support vector machines (SVMs) and logistic regression, emphasizing their mathematical formulations and applications. The lecture aims to establish a foundation for understanding classic machine learning models before progressing to deep learning techniques.

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

Intro To Machine Learning Lecture Notes2

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 2: Linear and Logistic Regression

Fall 2024/5
Lecturer: Nir Shlezinger and Asaf Cohen

We now start the part of the course that is dedicated to machine learning algorithms, and par-
ticularly to different forms on inductive biases and the corresponding learning algorithms. We will
dedicate the next three lectures to classic machine learning models, referring to techniques that
constitute the foundations of machine learning, after which we will proceed to the more recent
deep learning techniques. Today’s lecture and the next will deal with basic supervised learning
techniques, with the current lecture focusing on linear models, and is based mostly on [1, Ch. 9].

1 Linear Model
Recall that we are considering parametric families of machine learning models, denoted Fθ . For
linear models, Fθ encompasses the set of affine mappings from RN to S. Many learning algorithms
that are being widely used in practice rely on linear predictors, first and foremost because of the
ability to learn them efficiently in many cases. In addition, linear predictors are intuitive, are easy
to interpret, and fit the data reasonably well in many natural learning problems.
Definition 1.1 (Linear Model). A linear model is a mapping fθ : RN 7→ S which can be written
as
fθ (x) = σ(wT x + b), (1)
where σ : R 7→ S is a fixed mapping. The parameters of the model in (1) are given by θ = [w, b].
Note that the affine transformation in (1) can also be written as a linear transformation applied
to x′ = [x, 1], as (1) can be written as
σ(wT x + b) = σ(θ T x′ ). (2)

2 Linear Regression
The ﬁrst task we consider in the context of linear models is regression, which can be viewed as
estimating a continuous-valued variable. Speciﬁcally, the label takes values in S = R, and thus
σ(·) is the identity mapping.

2.1 Empirical Risk

The loss used is the ℓ2 loss, i.e.,
l(f, x, s) = (f (x) − s)2 . (3)

1
Given a labeled dataset D = {xt , st }nt=1
t
, the empirical risk is thus given by

1 X
nt
LD (θ) = (fθ (xt ) − st )2
nt t=1
1 X
nt
= (θ T x′t − st )2 . (4)
nt t=1

2.2 Least Squares

Recall that the learning task aims at ﬁnding the empirical risk minimizer, i.e.,

θ ⋆ = arg min LD (θ). (5)

For linear regression, where the empirical risk is given by (4), the empirical risk minimizer is in
fact given in closed form using the least squares solution.
To formulate the least squares solution, we deﬁne the matrix

X ≜ [x′1 , . . . , x′nt ]T , (6)

and the vector

s ≜ [s1 , . . . , snt ]T . (7)
Using these notations, we can write the empirical risk (4) as
1
LD (θ) = (Xθ − s)T (Xθ − s) . (8)
nt
We can now compare the derivative of the empirical risk to zero, i.e.,
1
0 = ∇θ LD (θ ⋆ ) = ∇θ (Xθ − s)T (Xθ − s)
nt θ=θ ⋆
2 T
= X (Xθ ⋆ − s) . (9)
nt
Rearranging the above equality, we obtain that

X T Xθ ⋆ = X T s
−1 T
⇒ θ⋆ = X T X X s. (10)

The closed-form solution to the empirical risk minimizer obtained in (10) is known as the least
squares solution, and dates back as early as the works of Gauss and Legendre. It is one of the rare
cases in machine learning where an empirical risk minimizer can be found in closed form. Almost
in every other case we would have to resort to some iterative optimization technique, as we will do
in the following.

2
Figure 1: Illustration of matching a data set with two dimensional input using a linear classiﬁer (a)
and the fact that multiple linear classiﬁers match the same data (b).

3 Linear Classification
We proceed to consider classification task, where S = {−1, 1}. To enforce the model to produce
outputs in S, we set σ(·) to be the sign function, namely, a linear classifier is given by

fθ (x) = sign(θ T x′ ) = sign(wT x + b). (11)

Obviously, the linear classiﬁer (11) operates by dividing the input space RN into two halfspaces.
Mappings of the form (11) actually serve as the historical basis for neural networks and perceptron
models, which will be discussed later in the course.

3.1 Learning a Linear Classiﬁer

Consider a dataset D = {xt , st }nt=1
t
. Assuming that the data points can be indeed divided into
halfspaces, we would like to have

st = sign(wT xt + b), ∀t ∈ 1, . . . , nt , (12)

which is equivalent to having

st (wT xt + b) > 0 ∀t ∈ 1, . . . , nt . (13)

However, as illustrated in Fig. 1, there maybe multiple linear classiﬁers satisfying condition
(13). How do we select which one to use? A possible solution is based on support vector machines
(SVMs).

3
3.2 SVMs
SVMs aim at finding linear classifiers by seeking to maximize their margins, i.e., the minimal
distance between the data points and the dividing hyperplane. Recall that the linear classifier is
based on a boundary hyperplane, given by the set of points

Hθ = {p ∈ RN : wT p + b = 0}. (14)

To formulate the SVM objective, we need to quantify the distance of a point x from the hyper-
plane Hθ , which is deﬁned as the distance from the closest point in the hyperplane, i.e.,

d(x; Hθ ) = min kx − vk. (15)

v∈Hθ

For our linear formulation of the hyperplane, the distance is given in closed form, as stated in the
following lemma:

Lemma 3.1. The distance between a point x and Hθ is

|wT x + b|
d(x; Hθ ) = . (16)
kwk
(wT x+b)
Proof. Let us ﬁrst consider the point v = x − ∥w∥2
w. We note that v ∈ Hθ since

wT v + b = wT x − (wT x + b) + b = 0. (17)

Furthermore, we have that

|wT x + b| |wT x + b|
kx − vk = kwk = . (18)
kwk2 kwk

Finally, we have that v is the closest point to x in the hyperplane since for every u ∈ Hθ it holds
that

kx − uk2 = kx − v + (v − u)k2
= kx − vk2 + kv − uk2 + 2(x − v)T (v − u)
≥ kx − vk2 + 2(x − v)T (v − u)
(a) (wT x + b) T
= kx − vk2 + 2 w (v − u)
kwk2
(b)
= kx − vk2 , (19)

where (a) follows from the deﬁnition of v, and (b) stems from the fact that u, v ∈ Hθ , and thus
wT u = wT v = −b. Equation (19) proves the lemma.

4
|wT x+b|
Figure 2: SVM illustration, with r = ∥w∥
.

Using Lemma 3.1, we can formulate the task of ﬁnding the linear classiﬁer with the maximal
margin as

|wT xt + b|
arg max min , subject to st (wT xt + b) > 0 ∀t ∈ 1, . . . , nt . (20)
w,b t=1,...,nt kwk

Assuming that there is a set of linear classiﬁers satisfying this constraints, then it holds that
st (wT xt + b) = |wT x + b|, and thus we can write (20) as

st (wT xt + b)
arg max min . (21)
w,b t=1,...,nt kwk

Hard SVM reformulates (21) by deﬁning

δ ≜ min st (wT xt + b), (22)

t=1,...,nt

using which (21) can be written as

δ
arg max subject to st (wT xt + b) ≥ δ ∀t ∈ 1, . . . , nt . (23)
w,b kwk

An illustration is depicted in Fig. 2.

By replacing the maximization term in (23) with minimizing the inverse and writing w̃ = 1δ w,
b̃ = 1δ b, we obtain the hard SVM problem

arg min kw̃k subject to st (w̃T xt + b̃) ≥ 1 ∀t ∈ 1, . . . , nt . (24)

w̃,b̃

5
1
Note that the δ
factor does not affect the decision rule due to the sign operation, as

1 T 1
sign(w x + b) ≡ sign
T
w x+ b .
δ δ
The motivation for formulating the hard SVM learning problem via (24) is that it is given
by a convex quadratic optimization problem, i.e., an optimization problem whose objective is
quadratic in the optimization variables and the constraints are linear functions of the variables.
Such problems are simple to solve using iterative optimization methods.

4 Logistic Regression
Logistic regression deals with the problem of using linear models for classiﬁcation by having them
produce probabilistic estimates. We again focus on binary classiﬁcation (detection), while writing
S = {0, 1}. However, instead of having our model produce outputs in S, we want it to produce
outputs in [0, 1], which are used as an estimate of the probability mass function, i.e.,
fθ (x) ≈ P(s = 1|x). (25)
To obtain outputs in the desired range [0, 1], we employ the sigmoid function, setting
1
,σ(z) = (26)
1 + e−z
in (1). Note that this function is monotonically increasing, with its limits at −∞ and ∞ being 0
and 1, respectively.

4.1 From Maximum Likelihood to Binary Cross Entropy

Again, we are given a data set D = {xt , st }, assumed to be drawn in an i.i.d. manner from the
true data generating distribution P. However, since our model is trying to learn a distribution
function, and we only have data points, the question is how to formulate a loss function that indeed
encourages tbe model to approach the true conditional distribution?
In our ﬁrst lecture we showed that the cross entropy loss (when used to compute true risk) is
minimized when the learned conditional distribution (dictated by the model) matches the true one
(dictated by P). Here, we will provide an alternative derivation for the binary classiﬁcation case,
which considers its usage with the empirical risk, i.e., without assuming an asymptotic regime
where the empirical risk approaches the true risk.
A natural way to assess whether a set of samples are drawn from a given distribution is to
compute their likelihood, i.e.,
(a) Y
nt
P(s1 , . . . , snt |x1 , . . . , xnt ) = P(s = si |x = xi )
i=1
Y Y
= P(s = 1|xi ) (1 − P(s = 1|xj )) . (27)
(si ,xi )∈D:si =1 (sj ,xj )∈D:sj =0

6
where (a) follows from the assumption that the data samples are drawn i.i.d. We can now set
the model parameters θ to be the ones maximizing the likelihood using (25), where we take the
(monotonically increasing) logarithm of the likelihood to convert the product in sum, i.e., we wish
to maximize
 
Y Y
log  fθ (xi ) (1 − fθ (xj ))
(si ,xi )∈D:si =1 (sj ,xj )∈D:sj =0
X X
= log fθ (xi ) + log (1 − fθ (xj ))
(si ,xi )∈D:si =1 (sj ,xj )∈D:sj =0
X
nt
= st log fθ (xt ) + (1 − st ) log(1 − fθ (xt )). (28)
t=1

Note that (28) denotes an objective which is to be maximized. Therefore, to convert into a loss
function, we obtain exactly the empirical risk with the binary cross entropy loss (after scaling by
1/nt which does not affect the minimizer of the empirical risk), namely

1 X
nt
LD (θ) = −si log fθ (xt ) − (1 − st ) log(1 − fθ (xt )). (29)
nt t=1

Next, we substitute the expression for our logistic regressor, which is

1 1
fθ (x) = 1 − fθ (x) = . (30)
1 + e−θT x′ 1 + eθ T x′
Substituting (30) into (29) leads to

1 X
nt T ′
T ′

LD (θ) = st log 1 + e−θ xt + (1 − st ) log 1 + eθ xt
nt t=1
1 X
nt T ′

=. log 1 + e−s̃t ·θ xt , (31)
nt t=1

where s̃t is a representation of st in {−1, 1} instead of {0, 1}, i.e., s̃t = 2si − 1. While the
empirical risk in (31) does not lend to a closed form solution, it is convex in θ, and therefore
one can ﬁnd θ ⋆ = arg min LD (θ) by iterative optimization, and particularly based on ﬁrst-order
(gradient) methods. We will discuss such optimizers in much more detail in future lectures.

References
[1] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.

Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Mekelle University College of Business and Economics Department of Management
No ratings yet
Mekelle University College of Business and Economics Department of Management
29 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
w04 LectureSlices MA4550
No ratings yet
w04 LectureSlices MA4550
32 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial
No ratings yet
Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial
23 pages
Classification
No ratings yet
Classification
47 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
ML 3
No ratings yet
ML 3
66 pages
Support Vector Machines (SVMS)
No ratings yet
Support Vector Machines (SVMS)
31 pages
Ch4 Support Vector Machine
No ratings yet
Ch4 Support Vector Machine
21 pages
10 SVM
No ratings yet
10 SVM
77 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Advantages in Using Real-Time Blasting Simulation in Surface Blast Planning and Grade Control
No ratings yet
Advantages in Using Real-Time Blasting Simulation in Surface Blast Planning and Grade Control
12 pages
Unit - 3 Decision Support Systems
No ratings yet
Unit - 3 Decision Support Systems
62 pages
ML - 5 Sovan LR SVM 1
No ratings yet
ML - 5 Sovan LR SVM 1
59 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
Linear Programming Guide
No ratings yet
Linear Programming Guide
72 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
SVMs for Machine Learning Students
No ratings yet
SVMs for Machine Learning Students
36 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Statistical ML Exam Guide
No ratings yet
Statistical ML Exam Guide
13 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
SVM Applications and Properties
100% (1)
SVM Applications and Properties
34 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ML Opt
No ratings yet
ML Opt
89 pages
Session 1 - Introduction and Graphical Method To Solve LPP
No ratings yet
Session 1 - Introduction and Graphical Method To Solve LPP
74 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Linear Programming
100% (2)
Linear Programming
187 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
CHEN3005 - Week 1 - Lecture 1 - Introduction To Process Control and Unit Information
No ratings yet
CHEN3005 - Week 1 - Lecture 1 - Introduction To Process Control and Unit Information
98 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
Linear Discriminant & SVM Explained
No ratings yet
Linear Discriminant & SVM Explained
41 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
36 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Chemie Ingenieur Technik - 2021 - Schweidtmann - Machine Learning in Chemical Engineering A Perspective
No ratings yet
Chemie Ingenieur Technik - 2021 - Schweidtmann - Machine Learning in Chemical Engineering A Perspective
11 pages
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
28 pages
Internship Report NITW Center of Excellence
No ratings yet
Internship Report NITW Center of Excellence
12 pages
Acoustic Diffuser Optimization Arqen
No ratings yet
Acoustic Diffuser Optimization Arqen
86 pages
Lec 1
No ratings yet
Lec 1
9 pages
Operations Research Material
No ratings yet
Operations Research Material
208 pages
Chapter 1 - One Variable Optimization
No ratings yet
Chapter 1 - One Variable Optimization
17 pages
MPC Book 2nd Edition 3rd Printing
No ratings yet
MPC Book 2nd Edition 3rd Printing
819 pages
6 - Sustainable Supply Chain Network Design - A Case of The Wine Industry
No ratings yet
6 - Sustainable Supply Chain Network Design - A Case of The Wine Industry
12 pages
Differential Evolution Algorithm For Structural Optimization Using Matlab
No ratings yet
Differential Evolution Algorithm For Structural Optimization Using Matlab
22 pages
Grid VO Formation for GSP Profit
No ratings yet
Grid VO Formation for GSP Profit
11 pages
Corrected Proof: Greater Cane Rat Algorithm (GCRA) : A Nature-Inspired Metaheuristic For Optimization Problems
No ratings yet
Corrected Proof: Greater Cane Rat Algorithm (GCRA) : A Nature-Inspired Metaheuristic For Optimization Problems
39 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Reliability Engineering and System Safety: Paolo Bocchini, Dan M. Frangopol
No ratings yet
Reliability Engineering and System Safety: Paolo Bocchini, Dan M. Frangopol
18 pages
CO1 SESSION 2 Introduction Formulation of LPP
No ratings yet
CO1 SESSION 2 Introduction Formulation of LPP
28 pages
SriHarsha - Kaustuv - EE401 - Reactive Power Planning in Distribution Networks - 1
No ratings yet
SriHarsha - Kaustuv - EE401 - Reactive Power Planning in Distribution Networks - 1
7 pages
Dynamic Programming Method Approach To Unit Commitment For Electricity Generation Schedule in Yangon Division
No ratings yet
Dynamic Programming Method Approach To Unit Commitment For Electricity Generation Schedule in Yangon Division
8 pages
1 s2.0 S1875510014002054 Main
No ratings yet
1 s2.0 S1875510014002054 Main
9 pages
Three-Level Optimized Pulse Patterns For Grid-Connected Converters With LCL Filters
No ratings yet
Three-Level Optimized Pulse Patterns For Grid-Connected Converters With LCL Filters
8 pages
AI in Software Engineering: Survey & Challenges
No ratings yet
AI in Software Engineering: Survey & Challenges
7 pages
Optimization of Vapor Compression Cycle Based On Genetic Algorithm
No ratings yet
Optimization of Vapor Compression Cycle Based On Genetic Algorithm
5 pages
Resonance Self-Shielding via Subgroup Method
No ratings yet
Resonance Self-Shielding via Subgroup Method
7 pages
Abu Ali Gah 2021
No ratings yet
Abu Ali Gah 2021
57 pages

Intro To Machine Learning Lecture Notes2

Uploaded by

Intro To Machine Learning Lecture Notes2

Uploaded by

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 2: Linear and Logistic Regression

2.1 Empirical Risk

2.2 Least Squares

θ ⋆ = arg min LD (θ). (5)

X ≜ [x′1 , . . . , x′nt ]T , (6)

and the vector

fθ (x) = sign(θ T x′ ) = sign(wT x + b). (11)

3.1 Learning a Linear Classiﬁer

st = sign(wT xt + b), ∀t ∈ 1, . . . , nt , (12)

which is equivalent to having

st (wT xt + b) > 0 ∀t ∈ 1, . . . , nt . (13)

d(x; Hθ ) = min kx − vk. (15)

Lemma 3.1. The distance between a point x and Hθ is

Furthermore, we have that

Hard SVM reformulates (21) by deﬁning

δ ≜ min st (wT xt + b), (22)

using which (21) can be written as

An illustration is depicted in Fig. 2.

arg min kw̃k subject to st (w̃T xt + b̃) ≥ 1 ∀t ∈ 1, . . . , nt . (24)

4.1 From Maximum Likelihood to Binary Cross Entropy

Next, we substitute the expression for our logistic regressor, which is

You might also like