0% found this document useful (0 votes)

11 views55 pages

Linear Regression

The document discusses linear regression as a method for making real-valued predictions based on features, exemplified by predicting house prices from attributes like size and age. It covers the learning problem, including hypothesis space selection, cost function optimization, and evaluation of generalization to unseen examples. The document also details the process of minimizing the sum of squared errors to find optimal parameters for the regression model.

Uploaded by

samira.nazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views55 pages

Linear Regression

Uploaded by

samira.nazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Linear Regression

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani
Fall 2016
Topics
 Linear regression
 Error (cost) function
 Optimization
 Generalization

2
Regression problem
 The goal is to make (real valued) predictions given
features

 Example: predicting house price from 3 attributes

Size (𝑚2 ) Age (year) Region Price (106 T)

100 2 5 500
80 25 3 250
… … … …

3
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function

𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to

unseen examples

4
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function

𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to

unseen examples

5
Hypothesis space
 Specify the class of functions (e.g., linear)

 We begin by the class of linear functions

 easy to extend to generalized linear and so cover more
complex regression functions

6
Linear regression: hypothesis space
 Univariate 𝑦

𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥

𝑥
 Multivariate

𝑓 ∶ ℝ𝑑 → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑

𝑇
 𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 are parameters we need to set.

7
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function

𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to

unseen examples

8
Learning algorithm

 Select how to measure the error (i.e. prediction loss)

 Find the minimum of the resulting error or cost function

9
Learning algorithm
Training Set 𝐷

We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0 , 𝑤1

Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated

house price
𝑥

10
How to measure the error

500

400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥

2
𝑖 𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘

11
Linear regression: univariate example
500

400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function: 𝑛
𝑖 2
𝐽 𝒘 = 𝑦 − 𝑓(𝑥; 𝒘)
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1

12
Regression: squared loss
 In the SSE cost function, we used squared error as the
prediction loss:
2 𝑦 = 𝑓(𝒙; 𝒘)
𝐿𝑜𝑠𝑠 𝑦, 𝑦 = 𝑦 − 𝑦

 Cost function (based on the training set):

𝑛
𝐽 𝒘 = 𝐿𝑜𝑠𝑠 𝑦 𝑖 , 𝑓 𝒙 𝑖 ; 𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1

 Minimizing sum (or mean) of squared errors is a common

approach in curve fitting, neural network, etc.
13
Sum of Squares Error (SSE) cost function

𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1

 𝐽 𝒘 : sum of the squares of the prediction errors on the

training set

 We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘

 equivalently, the best 𝒘

 Minimize 𝐽 𝒘
 Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘

14
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0 ,𝑤1)
500

400
Price ($) 300
in 1000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0

15 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 ,𝑤1, this is a function of 𝑥) (function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

16 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

17 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

18 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

19 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function optimization: univariate
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1

 Necessary conditions for the “optimal” parameter values:

𝜕𝐽 𝒘
=0
𝜕𝑤0

𝜕𝐽 𝒘
=0
𝜕𝑤1

20
Optimality conditions: univariate
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑤0 − 𝑤1 𝑥
𝑖=1

𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1

𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1

 A systems of 2 linear equations

21
Cost function: multivariate
 We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 𝑇

𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1

22
Cost function and optimal linear model

 Necessary conditions for the “optimal” parameter values:

𝛻𝒘 𝐽 𝒘 = 𝟎
 A system of 𝑑 + 1 linear equations
23
Cost function: matrix notation
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘) =
𝑖=1
𝑛 2
𝑖 𝑇 𝑖
= 𝑦 −𝒘 𝒙
𝑖=1

(1) (1)
1 𝑥1 ⋯ 𝑥𝑑 𝑤0
𝑦 (1) (2) ⋯ (2) 𝑤1
1 𝑥1 𝑥𝑑
𝒚= ⋮ 𝑿= ⋱ 𝒘= ⋮
⋮ ⋮ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑑
(𝑛)
1 𝑥1 ⋯ 𝑥𝑑

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
24
Minimizing cost function
Optimal linear weight vector (for SSE cost function):

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = −2𝑿𝑇 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚

25
Minimizing cost function

𝒘 = 𝑿𝑇 𝑿 −𝟏
𝑿𝑇 𝒚

𝒘 = 𝑿† 𝒚

𝑿† = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇

𝑿† is pseudo inverse of 𝑿

26
Another approach for optimizing the sum
squared error
 Iterative approach for solving the following optimization
problem:
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘)
𝑖=1

27
Review:
Iterative optimization of cost function
 Cost function: 𝐽(𝒘)
 Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘

 Steps:
 Start from 𝒘0
 Repeat
 Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
 𝑡 ←𝑡+1
 until we hopefully end up at a minimum

28
Review:
Gradient descent
 First-order optimization algorithm to find 𝒘∗ = argmin 𝐽(𝒘)
𝒘
 Also known as ”steepest descent”

 In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒘𝑡 :
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
 𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

 Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a

point 𝒘𝑡

Gradient ascent takes steps proportional to (the positive of)

the gradient to find a local maximum of the function
29
Review:
Gradient descent

 Minimize 𝐽(𝒘)
Step size
(Learning rate parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑑

 If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

 𝜂 can be allowed to change at every iteration as 𝜂𝑡 .

30
Review:
Gradient descent disadvantages

 Local minima problem

 However, when 𝐽 is convex, all local minima are also global

minima ⇒ gradient descent can converge to the global
solution.

31
Review: Problem of gradient descent with
non-convex cost functions

J(w0,w1)

w1
w0
32 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Review: Problem of gradient descent with
non-convex cost functions

J(w0,w1)

w1
w0

33 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Gradient descent for SSE cost function
 Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

 𝐽(𝒘): Sum of squares error

𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1

 Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙:

𝑛
𝑇
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑡 𝒙 𝑖
𝒙(𝑖)
𝑖=1

34
Gradient descent for SSE cost function
 Weight update rule: 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙
𝑛

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑇 𝒙 𝑖
𝒙(𝑖)
𝑖=1
Batch mode: each step
considers all training data

 𝜂: too small → gradient descent can be slow.

 𝜂 : too large → gradient descent can overshoot the
minimum. It may fail to converge, or even diverge.

35
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

36 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

37 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

38 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

39 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

40 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

41 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

42 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

43 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

44 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
 Batch techniques process the entire training set in one go
 thus they can be computationally costly for large data sets.

 Stochastic gradient descent: when the cost function can

comprise a sum over data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1

 Update after presentation of (𝒙(𝑖) , 𝑦 (𝑖) ):

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

45
Stochastic gradient descent
 Example: Linear regression with SSE cost function

2
𝐽(𝑖) (𝒘) = 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)

It is proper for sequential or online learning

46
Stochastic gradient descent: online learning
 Sequential learning is also appropriate for real-time
applications
 data observations are arriving in a continuous stream
 and predictions must be made before seeing all of the data

 The value of η needs to be chosen with care to ensure

that the algorithm converges

47
Evaluation and generalization

 Why minimizing the cost function (based on only training data)

while we are interested in the performance on new examples?

𝑛
(𝑖) (𝑖)
min 𝐿𝑜𝑠𝑠 𝑦 , 𝑓(𝒙 ; 𝜽) Empirical loss
𝜽 𝑖=1

 Evaluation: After training, we need to measure how well the

learned prediction function can predicts the target for unseen
examples

48
Training and test performance
 Assumption: training and test examples are drawn independently
at random from the same but unknown distribution.
 Each training/test example (𝒙, 𝑦) is a sample from joint probability
distribution 𝑃 𝒙, 𝑦 , i.e., 𝒙, 𝑦 ~𝑃

1 𝑛 (𝑖)
Empirical (training) loss = 𝑖=1 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 ; 𝜽)
𝑛

Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)

 We minimize empirical loss (on the training data) and expect to

also find an acceptable expected loss
 Empirical loss as a proxy for the performance over the whole distribution.

49
Linear regression: number of training data

𝑛 = 10 𝑛 = 20

𝑛 = 50

50
Linear regression: generalization
 By increasing the number of training examples, will solution be
better?
 Why the mean squared error does not decrease more after
reaching a level?

51
Linear regression: types of errors
 Structural error: the error introduced by the limited
function class (infinite training data):

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
∗𝑇 2
Structural error: 𝐸𝒙,𝑦 𝑦−𝒘 𝒙

 where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear

regression parameters (infinite training data)

52
Linear regression: types of errors
 Approximation error measures how close we can get to the
optimal linear predictions with limited training data:

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘

𝑛
2
𝒘 = argmin 𝑦 (𝑖) 𝑇
−𝒘 𝒙 (𝑖)
𝒘
𝑖=1

2
Approximation error: 𝐸𝒙 𝒘∗ 𝑇 𝒙 𝑇
−𝒘 𝒙

 Where 𝒘 are the parameter estimates based on a small

training set (so themselves are random variables).
53
Linear regression: error decomposition
 The expected error can decompose into the sum of
structural and approximation errors

𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2

𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙

 Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
+ 2𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙

54
Linear regression: error decomposition
 The expected error can decompose into the sum of
structural and approximation errors

𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2

𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙

 Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
0
+ 2𝐸 𝑦 − 𝒘 ∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
𝒙,𝑦
Note: Optimality condition for 𝒘∗ give us 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒙 = 0
since 𝛻𝒘 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 𝒘∗ = 0
55

How To Work With Spirits - Taylor Ellwood
No ratings yet
How To Work With Spirits - Taylor Ellwood
8 pages
Classroom Objects Worksheet
100% (2)
Classroom Objects Worksheet
4 pages
TAT Practical Guide
100% (2)
TAT Practical Guide
133 pages
Student Writing Prompts Guide
0% (1)
Student Writing Prompts Guide
3 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Week 6
No ratings yet
Week 6
72 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Chap6 (Regression)
No ratings yet
Chap6 (Regression)
74 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lecture3 - Linear Regression and Logistic Regression
No ratings yet
Lecture3 - Linear Regression and Logistic Regression
60 pages
Lecture 5 - Linear Regression
No ratings yet
Lecture 5 - Linear Regression
51 pages
Lec06 Matt
No ratings yet
Lec06 Matt
60 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
Notes 1
No ratings yet
Notes 1
30 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Professional Education Test
No ratings yet
Professional Education Test
7 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Lecture 3
No ratings yet
Lecture 3
56 pages
CS229
No ratings yet
CS229
69 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Challenges in VTU Ph.D. Coursework
100% (2)
Challenges in VTU Ph.D. Coursework
8 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Andrew NG Week 1-2
No ratings yet
Andrew NG Week 1-2
120 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lecture 6
No ratings yet
Lecture 6
51 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
National Apprenticeship Training Scheme: Student User Manual
No ratings yet
National Apprenticeship Training Scheme: Student User Manual
41 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
GR 1 Report Week 7
No ratings yet
GR 1 Report Week 7
6 pages
G7 Physics Comp Review Packet 2022-2023
No ratings yet
G7 Physics Comp Review Packet 2022-2023
25 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Regression
No ratings yet
Regression
30 pages
Personal Development Worksheets WK 1 - 1
No ratings yet
Personal Development Worksheets WK 1 - 1
7 pages
Regression
No ratings yet
Regression
39 pages
04 LinearRegression PDF
No ratings yet
04 LinearRegression PDF
61 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Madhuri Resume
No ratings yet
Madhuri Resume
4 pages
Lec 03
No ratings yet
Lec 03
42 pages
Educ 6 142 Module 1 Lesson 1 and 2
No ratings yet
Educ 6 142 Module 1 Lesson 1 and 2
29 pages
Resource Mobilization
No ratings yet
Resource Mobilization
14 pages
B.A. 2nd Sem Exam Schedule 2024
No ratings yet
B.A. 2nd Sem Exam Schedule 2024
1 page
cs229 2
No ratings yet
cs229 2
275 pages
Applied Motor Control and Control Module ISU Ilagan - 060606
No ratings yet
Applied Motor Control and Control Module ISU Ilagan - 060606
10 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
Strategy: The Totality of Decisions - 47
No ratings yet
Strategy: The Totality of Decisions - 47
1 page
Perdev q1 Mod3 Kdoctolero - Compress
No ratings yet
Perdev q1 Mod3 Kdoctolero - Compress
24 pages
Metalearning Applications To Automated Machine Learning and Data Mining (Pavel Brazdil, Jan N. Van Rijn, Carlos Soares Etc.) (Z-Library)
No ratings yet
Metalearning Applications To Automated Machine Learning and Data Mining (Pavel Brazdil, Jan N. Van Rijn, Carlos Soares Etc.) (Z-Library)
349 pages
NURS 324 Athabasca
No ratings yet
NURS 324 Athabasca
5 pages
The Essence of Interdisciplinary Research: Speaker: Martin Dunn Writer: Sreetej Lakkam
No ratings yet
The Essence of Interdisciplinary Research: Speaker: Martin Dunn Writer: Sreetej Lakkam
2 pages
3 HR Frame Worksheet
No ratings yet
3 HR Frame Worksheet
2 pages
Case 7 (Interviewee 1) Gulfraz Ahmed
No ratings yet
Case 7 (Interviewee 1) Gulfraz Ahmed
8 pages
Janae Benson: Exceptional Nursing Student Recommendation
No ratings yet
Janae Benson: Exceptional Nursing Student Recommendation
1 page
Cetprospectus 2025
No ratings yet
Cetprospectus 2025
56 pages
Onni Annisa - Nim 155110501111053 - Skripsi-2
No ratings yet
Onni Annisa - Nim 155110501111053 - Skripsi-2
154 pages
Philosophy and Life's Meaning
No ratings yet
Philosophy and Life's Meaning
9 pages
Understanding Literature Review in Research
No ratings yet
Understanding Literature Review in Research
9 pages
Hjorland (2002) Análisis de Dominio en La CI. 11 Enfoques
No ratings yet
Hjorland (2002) Análisis de Dominio en La CI. 11 Enfoques
13 pages
Course Teaching Plan
No ratings yet
Course Teaching Plan
5 pages
Jayson Bejec: Industrial Engineering Resume
No ratings yet
Jayson Bejec: Industrial Engineering Resume
3 pages
CV Nurul Amelia
No ratings yet
CV Nurul Amelia
2 pages

Linear Regression

Uploaded by

Linear Regression

Uploaded by

Linear Regression

CE-717: Machine Learning

 Example: predicting house price from 3 attributes

Size (𝑚2 ) Age (year) Region Price (106 T)

 Learning (estimation): optimization of a cost function

 Evaluation: we measure how well 𝑓 generalizes to

 Learning (estimation): optimization of a cost function

 Evaluation: we measure how well 𝑓 generalizes to

 We begin by the class of linear functions

 Learning (estimation): optimization of a cost function

 Evaluation: we measure how well 𝑓 generalizes to

 Select how to measure the error (i.e. prediction loss)

 Find the minimum of the resulting error or cost function

Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated

 Cost function (based on the training set):

 Minimizing sum (or mean) of squared errors is a common

 𝐽 𝒘 : sum of the squares of the prediction errors on the

 We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘

 Necessary conditions for the “optimal” parameter values:

 A systems of 2 linear equations

 Necessary conditions for the “optimal” parameter values:

 In each step, takes steps proportional to the negative of the

 Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a

Gradient ascent takes steps proportional to (the positive of)

 If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

 Local minima problem

 However, when 𝐽 is convex, all local minima are also global

 𝐽(𝒘): Sum of squares error

 Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙:

 𝜂: too small → gradient descent can be slow.

 Stochastic gradient descent: when the cost function can

 Update after presentation of (𝒙(𝑖) , 𝑦 (𝑖) ):

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

It is proper for sequential or online learning

 The value of η needs to be chosen with care to ensure

 Why minimizing the cost function (based on only training data)

 Evaluation: After training, we need to measure how well the

Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)

 We minimize empirical loss (on the training data) and expect to

 where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear

 Where 𝒘 are the parameter estimates based on a small

You might also like