Linear Regression
CE-717: Machine Learning
Sharif University of Technology
M. Soleymani
Fall 2016
Topics
Linear regression
Error (cost) function
Optimization
Generalization
2
Regression problem
The goal is to make (real valued) predictions given
features
Example: predicting house price from 3 attributes
Size (𝑚2 ) Age (year) Region Price (106 T)
100 2 5 500
80 25 3 250
… … … …
3
Learning problem
Selecting a hypothesis space
Hypothesis space: a set of mappings from feature vector to
target
Learning (estimation): optimization of a cost function
𝑖 𝑖 𝑛
Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function
Evaluation: we measure how well 𝑓 generalizes to
unseen examples
4
Learning problem
Selecting a hypothesis space
Hypothesis space: a set of mappings from feature vector to
target
Learning (estimation): optimization of a cost function
𝑖 𝑖 𝑛
Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function
Evaluation: we measure how well 𝑓 generalizes to
unseen examples
5
Hypothesis space
Specify the class of functions (e.g., linear)
We begin by the class of linear functions
easy to extend to generalized linear and so cover more
complex regression functions
6
Linear regression: hypothesis space
Univariate 𝑦
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥
𝑥
Multivariate
𝑓 ∶ ℝ𝑑 → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝑇
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 are parameters we need to set.
7
Learning problem
Selecting a hypothesis space
Hypothesis space: a set of mappings from feature vector to
target
Learning (estimation): optimization of a cost function
𝑖 𝑖 𝑛
Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function
Evaluation: we measure how well 𝑓 generalizes to
unseen examples
8
Learning algorithm
Select how to measure the error (i.e. prediction loss)
Find the minimum of the resulting error or cost function
9
Learning algorithm
Training Set 𝐷
We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0 , 𝑤1
Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated
house price
𝑥
10
How to measure the error
500
400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200
100
0
0 500 1000 1500 2000 2500 3000
𝑥
2
𝑖 𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘
11
Linear regression: univariate example
500
400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200
100
0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function: 𝑛
𝑖 2
𝐽 𝒘 = 𝑦 − 𝑓(𝑥; 𝒘)
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1
12
Regression: squared loss
In the SSE cost function, we used squared error as the
prediction loss:
2 𝑦 = 𝑓(𝒙; 𝒘)
𝐿𝑜𝑠𝑠 𝑦, 𝑦 = 𝑦 − 𝑦
Cost function (based on the training set):
𝑛
𝐽 𝒘 = 𝐿𝑜𝑠𝑠 𝑦 𝑖 , 𝑓 𝒙 𝑖 ; 𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1
Minimizing sum (or mean) of squared errors is a common
approach in curve fitting, neural network, etc.
13
Sum of Squares Error (SSE) cost function
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1
𝐽 𝒘 : sum of the squares of the prediction errors on the
training set
We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘
equivalently, the best 𝒘
Minimize 𝐽 𝒘
Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘
14
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0 ,𝑤1)
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
15 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 ,𝑤1, this is a function of 𝑥) (function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
16 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
17 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
18 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
19 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function optimization: univariate
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1
Necessary conditions for the “optimal” parameter values:
𝜕𝐽 𝒘
=0
𝜕𝑤0
𝜕𝐽 𝒘
=0
𝜕𝑤1
20
Optimality conditions: univariate
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑤0 − 𝑤1 𝑥
𝑖=1
𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1
𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1
A systems of 2 linear equations
21
Cost function: multivariate
We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 𝑇
𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1
22
Cost function and optimal linear model
Necessary conditions for the “optimal” parameter values:
𝛻𝒘 𝐽 𝒘 = 𝟎
A system of 𝑑 + 1 linear equations
23
Cost function: matrix notation
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘) =
𝑖=1
𝑛 2
𝑖 𝑇 𝑖
= 𝑦 −𝒘 𝒙
𝑖=1
(1) (1)
1 𝑥1 ⋯ 𝑥𝑑 𝑤0
𝑦 (1) (2) ⋯ (2) 𝑤1
1 𝑥1 𝑥𝑑
𝒚= ⋮ 𝑿= ⋱ 𝒘= ⋮
⋮ ⋮ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑑
(𝑛)
1 𝑥1 ⋯ 𝑥𝑑
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
24
Minimizing cost function
Optimal linear weight vector (for SSE cost function):
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = −2𝑿𝑇 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚
25
Minimizing cost function
𝒘 = 𝑿𝑇 𝑿 −𝟏
𝑿𝑇 𝒚
𝒘 = 𝑿† 𝒚
𝑿† = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇
𝑿† is pseudo inverse of 𝑿
26
Another approach for optimizing the sum
squared error
Iterative approach for solving the following optimization
problem:
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘)
𝑖=1
27
Review:
Iterative optimization of cost function
Cost function: 𝐽(𝒘)
Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘
Steps:
Start from 𝒘0
Repeat
Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
𝑡 ←𝑡+1
until we hopefully end up at a minimum
28
Review:
Gradient descent
First-order optimization algorithm to find 𝒘∗ = argmin 𝐽(𝒘)
𝒘
Also known as ”steepest descent”
In each step, takes steps proportional to the negative of the
gradient vector of the function at the current point 𝒘𝑡 :
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡
Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a
point 𝒘𝑡
Gradient ascent takes steps proportional to (the positive of)
the gradient to find a local maximum of the function
29
Review:
Gradient descent
Minimize 𝐽(𝒘)
Step size
(Learning rate parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )
𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑑
If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .
𝜂 can be allowed to change at every iteration as 𝜂𝑡 .
30
Review:
Gradient descent disadvantages
Local minima problem
However, when 𝐽 is convex, all local minima are also global
minima ⇒ gradient descent can converge to the global
solution.
31
Review: Problem of gradient descent with
non-convex cost functions
J(w0,w1)
w1
w0
32 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Review: Problem of gradient descent with
non-convex cost functions
J(w0,w1)
w1
w0
33 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Gradient descent for SSE cost function
Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )
𝐽(𝒘): Sum of squares error
𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1
Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙:
𝑛
𝑇
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑡 𝒙 𝑖
𝒙(𝑖)
𝑖=1
34
Gradient descent for SSE cost function
Weight update rule: 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙
𝑛
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑇 𝒙 𝑖
𝒙(𝑖)
𝑖=1
Batch mode: each step
considers all training data
𝜂: too small → gradient descent can be slow.
𝜂 : too large → gradient descent can overshoot the
minimum. It may fail to converge, or even diverge.
35
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
36 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
37 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
38 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
39 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
40 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
41 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
42 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
43 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
44 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
Batch techniques process the entire training set in one go
thus they can be computationally costly for large data sets.
Stochastic gradient descent: when the cost function can
comprise a sum over data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1
Update after presentation of (𝒙(𝑖) , 𝑦 (𝑖) ):
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)
45
Stochastic gradient descent
Example: Linear regression with SSE cost function
2
𝐽(𝑖) (𝒘) = 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)
It is proper for sequential or online learning
46
Stochastic gradient descent: online learning
Sequential learning is also appropriate for real-time
applications
data observations are arriving in a continuous stream
and predictions must be made before seeing all of the data
The value of η needs to be chosen with care to ensure
that the algorithm converges
47
Evaluation and generalization
Why minimizing the cost function (based on only training data)
while we are interested in the performance on new examples?
𝑛
(𝑖) (𝑖)
min 𝐿𝑜𝑠𝑠 𝑦 , 𝑓(𝒙 ; 𝜽) Empirical loss
𝜽 𝑖=1
Evaluation: After training, we need to measure how well the
learned prediction function can predicts the target for unseen
examples
48
Training and test performance
Assumption: training and test examples are drawn independently
at random from the same but unknown distribution.
Each training/test example (𝒙, 𝑦) is a sample from joint probability
distribution 𝑃 𝒙, 𝑦 , i.e., 𝒙, 𝑦 ~𝑃
1 𝑛 (𝑖)
Empirical (training) loss = 𝑖=1 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 ; 𝜽)
𝑛
Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)
We minimize empirical loss (on the training data) and expect to
also find an acceptable expected loss
Empirical loss as a proxy for the performance over the whole distribution.
49
Linear regression: number of training data
𝑛 = 10 𝑛 = 20
𝑛 = 50
50
Linear regression: generalization
By increasing the number of training examples, will solution be
better?
Why the mean squared error does not decrease more after
reaching a level?
51
Linear regression: types of errors
Structural error: the error introduced by the limited
function class (infinite training data):
𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
∗𝑇 2
Structural error: 𝐸𝒙,𝑦 𝑦−𝒘 𝒙
where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear
regression parameters (infinite training data)
52
Linear regression: types of errors
Approximation error measures how close we can get to the
optimal linear predictions with limited training data:
𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
𝑛
2
𝒘 = argmin 𝑦 (𝑖) 𝑇
−𝒘 𝒙 (𝑖)
𝒘
𝑖=1
2
Approximation error: 𝐸𝒙 𝒘∗ 𝑇 𝒙 𝑇
−𝒘 𝒙
Where 𝒘 are the parameter estimates based on a small
training set (so themselves are random variables).
53
Linear regression: error decomposition
The expected error can decompose into the sum of
structural and approximation errors
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
+ 2𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
54
Linear regression: error decomposition
The expected error can decompose into the sum of
structural and approximation errors
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
0
+ 2𝐸 𝑦 − 𝒘 ∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
𝒙,𝑦
Note: Optimality condition for 𝒘∗ give us 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒙 = 0
since 𝛻𝒘 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 𝒘∗ = 0
55