Linear Regression1
Linear Regression1
Linear Regression – I
Spring 2020
ECE – Carnegie Mellon University
Outline
1. Recap of MLE/MAP
2. Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
1
Recap of MLE/MAP
Dogecoin
2
• You flip the coin 10 times . . .
3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times
3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times
3
Machine Learning Pipeline
4
Maximum Likelihood Estimation (MLE)
5
Maximum Likelihood Estimation (MLE)
5
Maximum Likelihood Estimation (MLE)
5
Maximum Likelihood Estimation (MLE)
5
Maximum Likelihood Estimation (MLE)
5
MAP for Dogecoin
6
MAP for Dogecoin
6
MAP for Dogecoin
1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)
6
MAP for Dogecoin
1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)
6
Putting it all together
nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2
7
Putting it all together
nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2
7
Putting it all together
nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2
7
Putting it all together
nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2
7
Putting it all together
nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2
7
Putting it all together
nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2
7
Linear Regression
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
8
Task 1: Regression
= ??
= ??
pirce ($)
pirce ($)
= ??
pirce ($)
house
housesize
size
Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input
• Examples:
10
Supervised Learning
Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input
• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
10
Supervised Learning
Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input
• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
• Cat vs. Dog (Classification): predict whether a picture is of a cat
or a dog
10
Regression
11
Regression
11
Ex: predicting the sale price of a house
12
Features used to predict
13
Correlation between square footage and sale price
14
Roughly linear relationship
15
Data Can be Compactly Represented by Matrices
= ??
pirce ($)
house size
16
Data Can be Compactly Represented by Matrices
= ??
pirce ($)
house size
16
Some Concepts That You Should Know
Excellent Resources:
17
Matrix Inverse
18
You could have data from many houses
• Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.
19
Want to predict the best price per sqft and fixed expense
• Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.
20
Reduce prediction error
21
Reduce prediction error
21
Geometric Illustration: Each house corresponds to one line
c>
1 w = y1
2 3
c>1w y1
6 c 2 w y2 7
>
6
r=6 ..
7
7 w c>
2 w = y2
4 . 5
c>4w y4
= Aw y c>
3 w = y3
c>
4 w = y4
22
Norms and Loss Functions
23
Norms and Loss Functions
23
Norms and Loss Functions
23
Norms and Loss Functions
23
Norms and Loss Functions
23
Norms and Loss Functions
23
Norms and Loss Functions
Our model:
Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
Training data:
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
··· ···
Total 8100 + 1072 + 382 + 0 + · · ·
Aim:
Adjust price per sqft and fixed expense such that the sum of the squared
error is minimized — i.e., the unexplainable stuff is minimized.
24
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
25
Linear regression
Setup:
26
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
27
A simple case: x is just one-dimensional (D=1)
28
A simple case: x is just one-dimensional (D=1)
29
A simple case: x is just one-dimensional (D=1)
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0,
∂w0 n
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0.
∂w1 n
29
A simple case: x is just one-dimensional (D=1)
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n
30
A simple case: x is just one-dimensional (D=1)
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n
30
A simple case: x is just one-dimensional (D=1)
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n
∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n
w1 ≈ 1.6
w0 ≈ 0.45
32
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
33
Least Mean Squares when x is D-dimensional
34
Least Mean Squares when x is D-dimensional
which leads to
X
RSS(w̃) = (yn − w̃> x̃n )(yn − x̃>
n w̃)
n
X
= w̃> x̃n x̃> >
n w̃ − 2yn x̃n w̃ + const.
n
( ! ! )
X X
>
= w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n
Least Mean Squares when x is D-dimensional
which leads to
X
RSS(w̃) = (yn − w̃> x̃n )(yn − x̃>
n w̃)
n
X
= w̃> x̃n x̃> >
n w̃ − 2yn x̃n w̃ + const.
n
( ! ! )
X X
>
= w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n
35
RSS(w̃) in new notations
.
x̃>
N
36
RSS(w̃) in new notations
Compact expression:
>
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const
36
Example: RSS(w̃) in compact form
37
Example: RSS(w̃) in compact form
. Compact expression:
>
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const
38
Solution in matrix form
Compact expression
>
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const
39
Solution in matrix form
Compact expression
>
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const
• ∇x (b> x) = b
• ∇x (x> Ax) = 2Ax (symmetric A)
39
Solution in matrix form
Compact expression
>
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const
• ∇x (b> x) = b
• ∇x (x> Ax) = 2Ax (symmetric A)
Normal equation
39
Example: RSS(w̃) in compact form
40
Example: RSS(w̃) in compact form
Can use solvers in Matlab, Python etc., to compute this for any given X̃
and y.
40
Exercise: RSS(w̃) in compact form
41
Exercise: RSS(w̃) in compact form
1
P 1
P
where x̄ = N n xn and ȳ = N n yn .
Exercise: RSS(w̃) in compact form
1
P 1
P
where x̄ = N n xn and ȳ = N n yn .
42
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
43
Why is minimizing RSS sensible?
c>
1 w = y1
2 3
c>
1w y1
6c> 7
6 2 w y2 7
r=6 .. 7 w c>
2 w = y2
4 . 5
c>
4 w y 4
= Aw y c>
3 w = y3
c>
4 w = y4
Probabilistic interpretation
• Noisy observation model:
Y = w0 + w1 X + η
45
Probabilistic interpretation (cont’d)
N
Y X
log P(D) = log p(yn |xn ) = log p(yn |xn )
n=1 n
X [yn − (w0 + w1 xn )]2 √
= − − log 2πσ
n
2σ 2
1 X N √
=− 2 [yn − (w0 + w1 xn )]2 − log σ 2 − N log 2π
2σ n 2
( )
1 1 X 2 2
=− [yn − (w0 + w1 xn )] + N log σ + const
2 σ2 n
46
Maximum likelihood estimation
• Maximize over s = σ 2 :
( )
∂ log P(D) 1 1 X 2 1
=− − 2 [yn − (w0 + w1 xn )] + N =0
∂s 2 s n s
1 X
→ σ∗ 2 = s ∗ = [yn − (w0 + w1 xn )]2
N n
47
How does this probabilistic interpretation help us?
48
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
49
Computational complexity of the Least Squares Solution
50
Alternative method: Batch Gradient Descent
51
Why would this work?
Hessian of RSS
>
RSS(w ) = w > X> Xw − 2 X> y w + const
2
∂ RSS(w )
⇒ = 2X> X
∂w w >
X> X is positive semidefinite, because for any v
52
Alternative method: Batch Gradient Descent
53
Stochastic gradient descent (SGD)
g t = (x >
t w
(t)
− yt )xt
How does the complexity per iteration compare with gradient descent?
54
SGD versus Batch GD
55
How to Choose Learning Rate η in practice?
56
How to Choose Learning Rate η in practice?
56
How to Choose Learning Rate η in practice?
56
Mini-Summary
57