Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views98 pages

Linear Regression1

The document outlines a course on Linear Regression as part of an Introduction to Machine Learning class at Carnegie Mellon University. It covers topics such as Maximum Likelihood Estimation (MLE), Maximum a Posteriori Estimation (MAP), and the application of linear regression in predicting continuous outcomes like housing prices. The document also emphasizes the importance of understanding matrix operations and error measurement in regression analysis.

Uploaded by

Muhammad Aamir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views98 pages

Linear Regression1

The document outlines a course on Linear Regression as part of an Introduction to Machine Learning class at Carnegie Mellon University. It covers topics such as Maximum Likelihood Estimation (MLE), Maximum a Posteriori Estimation (MAP), and the application of linear regression in predicting continuous outcomes like housing prices. The document also emphasizes the importance of understanding matrix operations and error measurement in regression analysis.

Uploaded by

Muhammad Aamir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

18-661 Introduction to Machine Learning

Linear Regression – I

Spring 2020
ECE – Carnegie Mellon University
Outline

1. Recap of MLE/MAP

2. Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

1
Recap of MLE/MAP
Dogecoin

• Scenario: You find a coin on the ground.

• You ask yourself: Is this a fair or biased coin? What is the


probability that I will flip a heads?

2
• You flip the coin 10 times . . .

3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times

3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3
Machine Learning Pipeline

data ML method intelligence

feature model & optimization evaluation


extraction parameters

Two approaches that we discussed:

• Maximum likelihood Estimation (MLE)


• Maximum a posteriori Estimation (MAP)

4
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails


• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails


• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails


• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1 − θ)nT


• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails


• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1 − θ)nT


• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θ̂MLE = arg max P(D | θ)
θ

= arg max log P(D | θ)


θ
nH
=
nH + nT

5
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)


θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT

6
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)


θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT


• How should we set the prior, P(θ)?

6
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)


θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT


• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):

1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)

6
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)


θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT


• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):

1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)

• Interpretation: α = number of expected heads, β = number of


expected tails. Larger value of α + β denotes more confidence (and
smaller variance).

6
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }


• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }


• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?
• θMAP = 5/12, θMLE = 1/3

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }


• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP ?

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }


• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP ?
• θMAP = 1/6, θMLE = 1/3

7
Linear Regression
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

8
Task 1: Regression

How much should you sell your house for?

data regression intelligence


data
data MLregression
method intelligence
intelligence
data regression intelligence

= ??
= ??
pirce ($)
pirce ($)

= ??
pirce ($)

house
housesize
size

input: houses & features house


learn:size
x → y relationship predict: y (continuous)

Course Covers: Linear/Ridge Regression, Loss Function, SGD, Feature


Scaling, Regularization, Cross Validation 9
Supervised Learning

Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input

• Examples:

10
Supervised Learning

Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input

• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)

10
Supervised Learning

Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input

• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
• Cat vs. Dog (Classification): predict whether a picture is of a cat
or a dog

10
Regression

Predicting a continuous outcome variable:


• Predicting a company’s future stock price using its profit and other
financial info
• Predicting annual rainfall based on local flora and fauna
• Predicting distance from a traffic light using LIDAR measurements

11
Regression

Predicting a continuous outcome variable:


• Predicting a company’s future stock price using its profit and other
financial info
• Predicting annual rainfall based on local flora and fauna
• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:


• We can measure ’closeness’ of prediction and labels, leading to
different ways to evaluate prediction errors.
• Predicting stock price: better to be off by 1$ than by 20$
• Predicting distance from a traffic light: better to be off 1 m than by
10 m
• We should choose learning models and algorithms accordingly.

11
Ex: predicting the sale price of a house

Retrieve historical sales records


(This will be our training data)

12
Features used to predict

13
Correlation between square footage and sale price

14
Roughly linear relationship

Sale price ≈ price per sqft × square footage + fixed expense

15
Data Can be Compactly Represented by Matrices

= ??

pirce ($)
house size

• Learn parameters (w0 , w1 ) of the orange line y = w1 x + w0


Sq.ft

House 1: 1000 × w1 + w0 = 200, 000


House 2: 2000 × w1 + w0 = 350, 000

16
Data Can be Compactly Represented by Matrices

= ??

pirce ($)
house size

• Learn parameters (w0 , w1 ) of the orange line y = w1 x + w0


Sq.ft

House 1: 1000 × w1 + w0 = 200, 000


House 2: 2000 × w1 + w0 = 350, 000

• Can represent compactly in matrix notation


" #" # " #
1000 1 w1 200, 000
=
2000 1 w0 350, 000

16
Some Concepts That You Should Know

• Invertibility of Matrices and Computing Inverses


• Vector Norms – L2, Frobenius etc., Inner Products
• Eigenvalues and Eigen-vectors
• Singular Value Decomposition
• Covariance Matrices and Positive Semi-definite-ness

Excellent Resources:

• Essence of Linear Algebra YouTube Series


• Prof. Gilbert Strang’s course at MIT

17
Matrix Inverse

• Let us solve the house-price prediction problem

" #" # " #


1000 1 w1 200, 000
= (1)
2000 1 w0 350, 000
" # " #!−1 " #
w1 1000 1 200, 000
= (2)
w0 2000 1 350, 000
" #" #
1 1 −1 200, 000
= (3)
−1000 −2000 1000 350, 000
" #
1 150, 000
= (4)
−1000 −5 × 107
" # " #
w1 150
= (5)
w0 50, 000

18
You could have data from many houses

• Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.

sqft sale price


2000 800K
2100 907K
1100 312K
5500 2,600K
··· ···

Problem: there isn’t a w = [w1 , w0 ]T that will satisfy all equations

19
Want to predict the best price per sqft and fixed expense

• Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.

sqft sale price prediction


2000 810K 720K
2100 907K 800K
1100 312K 350K
5500 2,600K 2,600K
··· ··· ···

20
Reduce prediction error

How to measure errors?

21
Reduce prediction error

How to measure errors?


• absolute difference: |prediction − sale price|.
• squared difference: (prediction − sale price)2 [differentiable!].

sqft sale price prediction abs error squared error


2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
··· ···

21
Geometric Illustration: Each house corresponds to one line

c>
1 w = y1

2 3
c>1w y1
6 c 2 w y2 7
>
6
r=6 ..
7
7 w c>
2 w = y2
4 . 5
c>4w y4
= Aw y c>
3 w = y3

c>
4 w = y4

• Want to find w that minimizes the difference between Xw, y


• But since this a vector, we need an operator that can map the
residual vector r (w) = y − Xw to a scalar

22
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with


• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with


• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with


• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤ f (x) + f (y )

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with


• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤
qf (x) + f (y )
√ Pn
• e.g., `2 norm: kxk2 = x >x = i=1 xi
2

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with


• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤
qf (x) + f (y )
√ Pn
• e.g., `2 norm: kxk2 = x >x = i=1 xi
2
Pn
• e.g., `1 norm: kxk1 = i=1 |xi |

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with


• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤
qf (x) + f (y )
√ Pn
• e.g., `2 norm: kxk2 = x >x = i=1 xi
2
Pn
• e.g., `1 norm: kxk1 = i=1 |xi |
• e.g., `∞ norm: kxk∞ = max |xi |

from inside to outside: `1 , `2 , `∞ norm ball. 23


Minimize squared errors

Our model:
Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
Training data:
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
··· ···
Total 8100 + 1072 + 382 + 0 + · · ·

Aim:
Adjust price per sqft and fixed expense such that the sum of the squared
error is minimized — i.e., the unexplainable stuff is minimized.
24
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

25
Linear regression

Setup:

• Input: x ∈ RD (covariates, predictors, features, etc)


• Output: y ∈ R (responses, targets, outcomes, outputs, etc)
PD
• Model: f : x → y , with f (x) = w0 + d=1 wd xd = w0 + w> x.
• w = [w1 w2 · · · wD ]> : weights, parameters, or parameter vector
• w0 is called bias.
• Sometimes, we also call w̃ = [w0 w1 w2 · · · wD ]> parameters.
• Training data: D = {(xn , yn ), n = 1, 2, . . . , N}

Minimize the Residual sum of squares:


N
X N
X D
X
2
RSS(w̃) = [yn − f (xn )] = [yn − (w0 + wd xnd )]2
n=1 n=1 d=1

26
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

27
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

28
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

What kind of function is this?


28
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

What kind of function is this? CONVEX (has a unique global minimum)


28
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

29
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

Stationary points:
Take derivative with respect to parameters and set it to zero

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0,
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0.
∂w1 n

29
A simple case: x is just one-dimensional (D=1)

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n

30
A simple case: x is just one-dimensional (D=1)

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n

Simplify these expressions to get the “Normal Equations”:


X X
yn = Nw0 + w1 xn
X X X
xn yn = w 0 xn + w 1 xn2

30
A simple case: x is just one-dimensional (D=1)

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n

Simplify these expressions to get the “Normal Equations”:


X X
yn = Nw0 + w1 xn
X X X
xn yn = w 0 xn + w 1 xn2
Solving the system we obtain the least squares coefficient estimates:
P
(xn − x̄)(yn − ȳ )
w1 = P and w0 = ȳ − w1 x̄
(xi − x̄)2
P P
where x̄ = N1 n xn and ȳ = N1 n yn .
30
Example

sqft (1000’s) sale price (100k)


1 2
2 3.5
1.5 3
2.5 4.5

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

The w1 and w0 that minimize this are given by:


P
(xn − x̄)(yn − ȳ )
w1 = P and w0 = ȳ − w1 x̄
(xi − x̄)2
P P
where x̄ = N1 n xn and ȳ = N1 n yn .
31
Example

sqft (1000’s) sale price (100k)


1 2
2 3.5
1.5 3
2.5 4.5

Residual sum of squares:


X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

The w1 and w0 that minimize this are given by:

w1 ≈ 1.6
w0 ≈ 0.45

32
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

33
Least Mean Squares when x is D-dimensional

sqft (1000’s) bedrooms bathrooms sale price (100k)


1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

RSS(w̃) in matrix form:


X X X
RSS(w̃) = [yn − (w0 + wd xnd )]2 = [yn − w̃> x̃n ]2 ,
n d n

where we have redefined some variables (by augmenting)

x̃ ← [1 x1 x2 . . . xD ]> , w̃ ← [w0 w1 w2 . . . wD ]>

34
Least Mean Squares when x is D-dimensional

RSS(w̃) in matrix form:


X X X
RSS(w̃) = [yn − (w0 + wd xnd )]2 = [yn − w̃> x̃n ]2 ,
n d n

where we have redefined some variables (by augmenting)

x̃ ← [1 x1 x2 . . . xD ]> , w̃ ← [w0 w1 w2 . . . wD ]>

which leads to
X
RSS(w̃) = (yn − w̃> x̃n )(yn − x̃>
n w̃)
n
X
= w̃> x̃n x̃> >
n w̃ − 2yn x̃n w̃ + const.
n
( ! ! )
X X
>
= w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n
Least Mean Squares when x is D-dimensional

RSS(w̃) in matrix form:


X X X
RSS(w̃) = [yn − (w0 + wd xnd )]2 = [yn − w̃> x̃n ]2 ,
n d n

where we have redefined some variables (by augmenting)

x̃ ← [1 x1 x2 . . . xD ]> , w̃ ← [w0 w1 w2 . . . wD ]>

which leads to
X
RSS(w̃) = (yn − w̃> x̃n )(yn − x̃>
n w̃)
n
X
= w̃> x̃n x̃> >
n w̃ − 2yn x̃n w̃ + const.
n
( ! ! )
X X
>
= w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n

35
RSS(w̃) in new notations

From previous slide:


( ! ! )
X X
>
RSS(w̃) = w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n

Design matrix and target vector:


 
x̃>
1
 > 
 x̃2 
X̃ =  
 ..  ∈ R
N×(D+1)

 . 
x̃>
N

36
RSS(w̃) in new notations

From previous slide:


( ! ! )
X X
>
RSS(w̃) = w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n

Design matrix and target vector:


   
x̃>
1 y1
 >   
 x̃2   y2 
X̃ =  
 ..  ∈ R
N×(D+1)
, y=
 ..  ∈ RN

 .   . 
x̃>
N yN

Compact expression:
  > 
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const

36
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)


1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Design matrix and target vector:


   
x̃>
1 y1
 >   
 x̃2   y2 
X̃ =  
 ..  ∈ R
N×(D+1)
, y=
 ..  ∈ RN

 .   . 
x̃>
N yN
. Compact expression:
  > 
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const

37
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)


1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Design matrix and target vector:


     
x̃>
1 1 1 2 1 2
 >  
 x̃2  1 2 2 2 
3.5
 
X̃ =  
 ..  =  , y= 
 .  1 1.5 3 2 3
x̃> 1 2.5 4 2.5 4.5
N

. Compact expression:
  > 
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const

38
Solution in matrix form

Compact expression
  > 
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const

39
Solution in matrix form

Compact expression
  > 
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const

Gradients of Linear and Quadratic Functions

• ∇x (b> x) = b
• ∇x (x> Ax) = 2Ax (symmetric A)

39
Solution in matrix form

Compact expression
  > 
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const

Gradients of Linear and Quadratic Functions

• ∇x (b> x) = b
• ∇x (x> Ax) = 2Ax (symmetric A)

Normal equation

∇w̃ RSS(w̃) = 2X̃> X̃w̃ − 2X̃> y = 0

This leads to the least-mean-squares (LMS) solution


 −1
w̃LMS = X̃> X̃ X̃> y

39
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)


1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution


 −1
w̃LMS = X̃> X̃ X̃> y

40
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)


1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution


 −1
w̃LMS = X̃> X̃ X̃> y

Can use solvers in Matlab, Python etc., to compute this for any given X̃
and y.

40
Exercise: RSS(w̃) in compact form

Using the general least-mean-squares (LMS) solution


 −1
w̃LMS = X̃> X̃ X̃> y

recover the uni-variate solution that we had computed earlier:


P
(xn − x̄)(yn − ȳ )
w1 = P and w0 = ȳ − w1 x̄
(xi − x̄)2
P P
where x̄ = N1 n xn and ȳ = N1 n yn .

41
Exercise: RSS(w̃) in compact form

For the 1-D case, the least-mean-squares solution is


 −1
w̃LMS = X̃> X̃ X̃> y
  −1  
" # 1 x1 " # y1
 1 1 ... 1   1 1 ... 1  
 1 x2   y2 
=    
 x1 x2 . . . xN 1 . . . x1 x2 . . . xN . . .
1 xN yN
" #!−1 " P #
N N x̄ y
= P 2 P n n
N x̄ n xn n xn yn
" # " P P #
w0 1 ȳ (xi − x̄)2 − x̄ (xn − x̄)(yn − ȳ )
=P P
w1 (xi − x̄)2 ) (xn − x̄)(yn − ȳ )

1
P 1
P
where x̄ = N n xn and ȳ = N n yn .
Exercise: RSS(w̃) in compact form

For the 1-D case, the least-mean-squares solution is


 −1
w̃LMS = X̃> X̃ X̃> y
  −1  
" # 1 x1 " # y1
 1 1 ... 1   1 1 ... 1  
 1 x2   y2 
=    
 x1 x2 . . . xN 1 . . . x1 x2 . . . xN . . .
1 xN yN
" #!−1 " P #
N N x̄ y
= P 2 P n n
N x̄ n xn n xn yn
" # " P P #
w0 1 ȳ (xi − x̄)2 − x̄ (xn − x̄)(yn − ȳ )
=P P
w1 (xi − x̄)2 ) (xn − x̄)(yn − ȳ )

1
P 1
P
where x̄ = N n xn and ȳ = N n yn .

42
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

43
Why is minimizing RSS sensible?

c>
1 w = y1

2 3
c>
1w y1
6c> 7
6 2 w y2 7
r=6 .. 7 w c>
2 w = y2
4 . 5
c>
4 w y 4

= Aw y c>
3 w = y3

c>
4 w = y4

• Want to find w that minimizes the difference between Xw, y


• But since this a vector, we need an operator that can map the
residual vector r (w) = y − Xw to a scalar
• We take the sum of the squares of the elements of r (w)
44
Why is minimizing RSS sensible?

Probabilistic interpretation
• Noisy observation model:

Y = w0 + w1 X + η

where η ∼ N(0, σ 2 ) is a Gaussian random variable


• Conditional likelihood of one training sample:
1 [yn −(w0 +w1 xn )]2
p(yn |xn ) = N(w0 + w1 xn , σ 2 ) = √ e− 2σ 2
2πσ

45
Probabilistic interpretation (cont’d)

Log-likelihood of the training data D (assuming i.i.d):

N
Y X
log P(D) = log p(yn |xn ) = log p(yn |xn )
n=1 n
X  [yn − (w0 + w1 xn )]2 √

= − − log 2πσ
n
2σ 2
1 X N √
=− 2 [yn − (w0 + w1 xn )]2 − log σ 2 − N log 2π
2σ n 2
( )
1 1 X 2 2
=− [yn − (w0 + w1 xn )] + N log σ + const
2 σ2 n

What is the relationship between minimizing RSS and maximizing the


log-likelihood?

46
Maximum likelihood estimation

Estimating σ, w0 and w1 can be done in two steps


• Maximize over w0 and w1 :
X
max log P(D) ⇔ min [yn − (w0 + w1 xn )]2 ← This is RSS(w̃)!
n

• Maximize over s = σ 2 :
( )
∂ log P(D) 1 1 X 2 1
=− − 2 [yn − (w0 + w1 xn )] + N =0
∂s 2 s n s
1 X
→ σ∗ 2 = s ∗ = [yn − (w0 + w1 xn )]2
N n

47
How does this probabilistic interpretation help us?

• It gives a solid footing to our intuition: minimizing RSS(w̃) is a


sensible thing based on reasonable modeling assumptions.
• Estimating σ ∗ tells us how much noise there is in our predictions.
For example, it allows us to place confidence intervals around our
predictions.

48
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

49
Computational complexity of the Least Squares Solution

Bottleneck of computing the solution?


 −1
w = X >X Xy

Matrix multiply of X> X ∈ R(D+1)×(D+1)


Inverting the matrix X> X

How many operations do we need?

• O(ND2 ) for matrix multiplication


• O(D3 ) (e.g., using Gauss-Jordan elimination) or O(D2.373 ) (recent
theoretical advances) for matrix inversion
• Impractical for very large D or N

50
Alternative method: Batch Gradient Descent

(Batch) Gradient descent


• Initialize w to w (0) (e.g., randomly);
set t = 0; choose η > 0
• Loop until convergence
1. Compute the gradient
∇RSS(w ) = X> (Xw (t) − y )
2. Update the parameters
w (t+1) = w (t) − η∇RSS(w )
3. t ← t + 1

What is the complexity of each iteration?


O(ND)

51
Why would this work?

If gradient descent converges, it will converge to the same solution as


using matrix inversion.

This is because RSS(w ) is a convex function in its parameters w

Hessian of RSS
>
RSS(w ) = w > X> Xw − 2 X> y w + const
2
∂ RSS(w )
⇒ = 2X> X
∂w w >
X> X is positive semidefinite, because for any v

v > X> Xv = kX> v k22 ≥ 0

52
Alternative method: Batch Gradient Descent

(Batch) Gradient descent


• Initialize w to w (0) (e.g., randomly);
set t = 0; choose η > 0
• Loop until convergence
1. Compute the gradient
∇RSS(w ) = X> Xw (t) − X> y
2. Update the parameters
w (t+1) = w (t) − η∇RSS(w )
3. t ← t + 1

What is the complexity of each iteration?


O(ND)

53
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

• Initialize w to some w (0) ; set t = 0; choose η > 0


• Loop until convergence
1. random choose a training a sample x t
2. Compute its contribution to the gradient

g t = (x >
t w
(t)
− yt )xt

3. Update the parameters


w (t+1) = w (t) − ηg t
4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

• O(ND) for gradient descent versus O(D) for SGD

54
SGD versus Batch GD

• SGD reduces per-iteration complexity from O(ND) to O(D)


• But it is noisier and can take longer to converge

55
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on


this later) and choose the one that gives fastest, stable convergence

56
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on


this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.

56
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on


this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,
AdaDelta are used in practice.

56
Mini-Summary

• Linear regression is the linear combination of features


P
f : x → y , with f (x) = w0 + d wd xd = w0 + w > x
• If we minimize residual sum of squares as our learning objective, we
get a closed-form solution of parameters
• Probabilistic interpretation: maximum likelihood if assuming residual
is Gaussian distributed
• Gradient Descent and mini-batch SGD can overcome computational
issues

57

You might also like