0% found this document useful (0 votes)

6 views98 pages

Linear Regression1

The document outlines a course on Linear Regression as part of an Introduction to Machine Learning class at Carnegie Mellon University. It covers topics such as Maximum Likelihood Estimation (MLE), Maximum a Posteriori Estimation (MAP), and the application of linear regression in predicting continuous outcomes like housing prices. The document also emphasizes the importance of understanding matrix operations and error measurement in regression analysis.

Uploaded by

Muhammad Aamir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views98 pages

Linear Regression1

Uploaded by

Muhammad Aamir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

18-661 Introduction to Machine Learning

Linear Regression – I

Spring 2020
ECE – Carnegie Mellon University
Outline

1. Recap of MLE/MAP

2. Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

1
Recap of MLE/MAP
Dogecoin

• Scenario: You find a coin on the ground.

• You ask yourself: Is this a fair or biased coin? What is the

probability that I will flip a heads?

2
• You flip the coin 10 times . . .

3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3
Machine Learning Pipeline

data ML method intelligence

feature model & optimization evaluation

extraction parameters

Two approaches that we discussed:

• Maximum likelihood Estimation (MLE)

• Maximum a posteriori Estimation (MAP)

4
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1 − θ)nT

• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?

5
Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1 − θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1 − θ)nT

• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θ̂MLE = arg max P(D | θ)
θ

= arg max log P(D | θ)

θ
nH
=
nH + nT

5
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)

θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT

6
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)

θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT

• How should we set the prior, P(θ)?

6
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)

θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT

• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):

1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)

6
MAP for Dogecoin

θ̂MAP = arg max P(θ | D) = arg max P(D | θ)P(θ)

θ θ

• Recall that P(D | θ) = θnH (1 − θ)nT

• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):

1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and
smaller variance).

6
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }

• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }

• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?
• θMAP = 5/12, θMLE = 1/3

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }

• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP ?

7
Putting it all together

nH
θ̂MLE =
nH + nT
α + nH − 1
θ̂MAP =
α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H, H, T , T , T , T }

• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP ?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP ?
• θMAP = 1/6, θMLE = 1/3

7
Linear Regression
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

8
Task 1: Regression

How much should you sell your house for?

data regression intelligence

data
data MLregression
method intelligence
intelligence
data regression intelligence

= ??
= ??
pirce ($)
pirce ($)

= ??
pirce ($)

house
housesize
size

input: houses & features house

learn:size
x → y relationship predict: y (continuous)

Course Covers: Linear/Ridge Regression, Loss Function, SGD, Feature

Scaling, Regularization, Cross Validation 9
Supervised Learning

Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input

• Examples:

10
Supervised Learning

Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input

• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)

10
Supervised Learning

Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input

• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
• Cat vs. Dog (Classification): predict whether a picture is of a cat
or a dog

10
Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other
financial info
• Predicting annual rainfall based on local flora and fauna
• Predicting distance from a traffic light using LIDAR measurements

11
Regression

Predicting a continuous outcome variable:

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to
different ways to evaluate prediction errors.
• Predicting stock price: better to be off by 1$ than by 20$
• Predicting distance from a traffic light: better to be off 1 m than by
10 m
• We should choose learning models and algorithms accordingly.

11
Ex: predicting the sale price of a house

Retrieve historical sales records

(This will be our training data)

12
Features used to predict

13
Correlation between square footage and sale price

14
Roughly linear relationship

Sale price ≈ price per sqft × square footage + fixed expense

15
Data Can be Compactly Represented by Matrices

= ??

pirce ($)
house size

• Learn parameters (w0 , w1 ) of the orange line y = w1 x + w0

Sq.ft

House 1: 1000 × w1 + w0 = 200, 000

House 2: 2000 × w1 + w0 = 350, 000

16
Data Can be Compactly Represented by Matrices

= ??

pirce ($)
house size

• Learn parameters (w0 , w1 ) of the orange line y = w1 x + w0

Sq.ft

House 1: 1000 × w1 + w0 = 200, 000

House 2: 2000 × w1 + w0 = 350, 000

• Can represent compactly in matrix notation

" #" # " #
1000 1 w1 200, 000
=
2000 1 w0 350, 000

16
Some Concepts That You Should Know

• Invertibility of Matrices and Computing Inverses

• Vector Norms – L2, Frobenius etc., Inner Products
• Eigenvalues and Eigen-vectors
• Singular Value Decomposition
• Covariance Matrices and Positive Semi-definite-ness

Excellent Resources:

• Essence of Linear Algebra YouTube Series

• Prof. Gilbert Strang’s course at MIT

17
Matrix Inverse

• Let us solve the house-price prediction problem

" #" # " #

1000 1 w1 200, 000
= (1)
2000 1 w0 350, 000
" # " #!−1 " #
w1 1000 1 200, 000
= (2)
w0 2000 1 350, 000
" #" #
1 1 −1 200, 000
= (3)
−1000 −2000 1000 350, 000
" #
1 150, 000
= (4)
−1000 −5 × 107
" # " #
w1 150
= (5)
w0 50, 000

18
You could have data from many houses

• Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.

sqft sale price

2000 800K
2100 907K
1100 312K
5500 2,600K
··· ···

Problem: there isn’t a w = [w1 , w0 ]T that will satisfy all equations

19
Want to predict the best price per sqft and fixed expense

• Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.

sqft sale price prediction

2000 810K 720K
2100 907K 800K
1100 312K 350K
5500 2,600K 2,600K
··· ··· ···

20
Reduce prediction error

How to measure errors?

21
Reduce prediction error

How to measure errors?

• absolute difference: |prediction − sale price|.
• squared difference: (prediction − sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
··· ···

21
Geometric Illustration: Each house corresponds to one line

c>
1 w = y1

2 3
c>1w y1
6 c 2 w y2 7
>
6
r=6 ..
7
7 w c>
2 w = y2
4 . 5
c>4w y4
= Aw y c>
3 w = y3

c>
4 w = y4

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the
residual vector r (w) = y − Xw to a scalar

22
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤ f (x) + f (y )

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤
qf (x) + f (y )
√ Pn
• e.g., `2 norm: kxk2 = x >x = i=1 xi
2

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤
qf (x) + f (y )
√ Pn
• e.g., `2 norm: kxk2 = x >x = i=1 xi
2
Pn
• e.g., `1 norm: kxk1 = i=1 |xi |

23
Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y ) ≤
qf (x) + f (y )
√ Pn
• e.g., `2 norm: kxk2 = x >x = i=1 xi
2
Pn
• e.g., `1 norm: kxk1 = i=1 |xi |
• e.g., `∞ norm: kxk∞ = max |xi |

from inside to outside: `1 , `2 , `∞ norm ball. 23

Minimize squared errors

Our model:
Sale price =
price per sqft × square footage + fixed expense + unexplainable stuff
Training data:
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
··· ···
Total 8100 + 1072 + 382 + 0 + · · ·

Aim:
Adjust price per sqft and fixed expense such that the sum of the squared
error is minimized — i.e., the unexplainable stuff is minimized.
24
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

25
Linear regression

Setup:

• Input: x ∈ RD (covariates, predictors, features, etc)

• Output: y ∈ R (responses, targets, outcomes, outputs, etc)
PD
• Model: f : x → y , with f (x) = w0 + d=1 wd xd = w0 + w> x.
• w = [w1 w2 · · · wD ]> : weights, parameters, or parameter vector
• w0 is called bias.
• Sometimes, we also call w̃ = [w0 w1 w2 · · · wD ]> parameters.
• Training data: D = {(xn , yn ), n = 1, 2, . . . , N}

Minimize the Residual sum of squares:

N
X N
X D
X
2
RSS(w̃) = [yn − f (xn )] = [yn − (w0 + wd xnd )]2
n=1 n=1 d=1

26
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

27
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

28
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

What kind of function is this?

28
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

What kind of function is this? CONVEX (has a unique global minimum)

28
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

29
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

Stationary points:
Take derivative with respect to parameters and set it to zero

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0,
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0.
∂w1 n

29
A simple case: x is just one-dimensional (D=1)

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n

30
A simple case: x is just one-dimensional (D=1)

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n

Simplify these expressions to get the “Normal Equations”:

X X
yn = Nw0 + w1 xn
X X X
xn yn = w 0 xn + w 1 xn2

30
A simple case: x is just one-dimensional (D=1)

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )] = 0
∂w0 n

∂RSS(w̃) X
= 0 ⇒ −2 [yn − (w0 + w1 xn )]xn = 0
∂w1 n

Simplify these expressions to get the “Normal Equations”:

X X
yn = Nw0 + w1 xn
X X X
xn yn = w 0 xn + w 1 xn2
Solving the system we obtain the least squares coefficient estimates:
P
(xn − x̄)(yn − ȳ )
w1 = P and w0 = ȳ − w1 x̄
(xi − x̄)2
P P
where x̄ = N1 n xn and ȳ = N1 n yn .
30
Example

sqft (1000’s) sale price (100k)

1 2
2 3.5
1.5 3
2.5 4.5

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

The w1 and w0 that minimize this are given by:

P
(xn − x̄)(yn − ȳ )
w1 = P and w0 = ȳ − w1 x̄
(xi − x̄)2
P P
where x̄ = N1 n xn and ȳ = N1 n yn .
31
Example

sqft (1000’s) sale price (100k)

1 2
2 3.5
1.5 3
2.5 4.5

Residual sum of squares:

X X
RSS(w̃) = [yn − f (xn )]2 = [yn − (w0 + w1 xn )]2
n n

The w1 and w0 that minimize this are given by:

w1 ≈ 1.6
w0 ≈ 0.45

32
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

33
Least Mean Squares when x is D-dimensional

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

RSS(w̃) in matrix form:

X X X
RSS(w̃) = [yn − (w0 + wd xnd )]2 = [yn − w̃> x̃n ]2 ,
n d n

where we have redefined some variables (by augmenting)

x̃ ← [1 x1 x2 . . . xD ]> , w̃ ← [w0 w1 w2 . . . wD ]>

34
Least Mean Squares when x is D-dimensional

RSS(w̃) in matrix form:

X X X
RSS(w̃) = [yn − (w0 + wd xnd )]2 = [yn − w̃> x̃n ]2 ,
n d n

where we have redefined some variables (by augmenting)

x̃ ← [1 x1 x2 . . . xD ]> , w̃ ← [w0 w1 w2 . . . wD ]>

which leads to
X
RSS(w̃) = (yn − w̃> x̃n )(yn − x̃>
n w̃)
n
X
= w̃> x̃n x̃> >
n w̃ − 2yn x̃n w̃ + const.
n
( ! ! )
X X
>
= w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n
Least Mean Squares when x is D-dimensional

RSS(w̃) in matrix form:

X X X
RSS(w̃) = [yn − (w0 + wd xnd )]2 = [yn − w̃> x̃n ]2 ,
n d n

where we have redefined some variables (by augmenting)

x̃ ← [1 x1 x2 . . . xD ]> , w̃ ← [w0 w1 w2 . . . wD ]>

which leads to
X
RSS(w̃) = (yn − w̃> x̃n )(yn − x̃>
n w̃)
n
X
= w̃> x̃n x̃> >
n w̃ − 2yn x̃n w̃ + const.
n
( ! ! )
X X
>
= w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n

35
RSS(w̃) in new notations

From previous slide:

( ! ! )
X X
>
RSS(w̃) = w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n

Design matrix and target vector:

 
x̃>
1
 > 
 x̃2 
X̃ =  
 ..  ∈ R
N×(D+1)

 . 
x̃>
N

36
RSS(w̃) in new notations

From previous slide:

( ! ! )
X X
>
RSS(w̃) = w̃ x̃n x̃>
n w̃ − 2 yn x̃>
n w̃ + const.
n n

Design matrix and target vector:

   
x̃>
1 y1
 >   
 x̃2   y2 
X̃ =  
 ..  ∈ R
N×(D+1)
, y=
 ..  ∈ RN

 .   . 
x̃>
N yN

Compact expression:
>
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const

36
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Design matrix and target vector:

   
x̃>
1 y1
 >   
 x̃2   y2 
X̃ =  
 ..  ∈ R
N×(D+1)
, y=
 ..  ∈ RN

 .   . 
x̃>
N yN
. Compact expression:
>
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const

37
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Design matrix and target vector:

     
x̃>
1 1 1 2 1 2
 >  
 x̃2  1 2 2 2 
3.5
 
X̃ =  
 ..  =  , y= 
 .  1 1.5 3 2 3
x̃> 1 2.5 4 2.5 4.5
N

. Compact expression:
>
> > >
RSS(w̃) = kX̃w̃ − yk22 = w̃ X̃ X̃w̃ − 2 X̃ y w̃ + const

38
Solution in matrix form

Compact expression
>
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const

39
Solution in matrix form

Compact expression
>
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const

Gradients of Linear and Quadratic Functions

• ∇x (b> x) = b
• ∇x (x> Ax) = 2Ax (symmetric A)

39
Solution in matrix form

Compact expression
>
RSS(w̃) = ||X̃w̃ − y||22 = w̃> X̃> X̃w̃ − 2 X̃> y w̃ + const

Gradients of Linear and Quadratic Functions

• ∇x (b> x) = b
• ∇x (x> Ax) = 2Ax (symmetric A)

Normal equation

∇w̃ RSS(w̃) = 2X̃> X̃w̃ − 2X̃> y = 0

This leads to the least-mean-squares (LMS) solution

−1
w̃LMS = X̃> X̃ X̃> y

39
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

−1
w̃LMS = X̃> X̃ X̃> y

40
Example: RSS(w̃) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

−1
w̃LMS = X̃> X̃ X̃> y

Can use solvers in Matlab, Python etc., to compute this for any given X̃
and y.

40
Exercise: RSS(w̃) in compact form

Using the general least-mean-squares (LMS) solution

−1
w̃LMS = X̃> X̃ X̃> y

recover the uni-variate solution that we had computed earlier:

P
(xn − x̄)(yn − ȳ )
w1 = P and w0 = ȳ − w1 x̄
(xi − x̄)2
P P
where x̄ = N1 n xn and ȳ = N1 n yn .

41
Exercise: RSS(w̃) in compact form

For the 1-D case, the least-mean-squares solution is

−1
w̃LMS = X̃> X̃ X̃> y
  −1  
" # 1 x1 " # y1
 1 1 ... 1   1 1 ... 1  
 1 x2   y2 
=    
 x1 x2 . . . xN 1 . . . x1 x2 . . . xN . . .
1 xN yN
" #!−1 " P #
N N x̄ y
= P 2 P n n
N x̄ n xn n xn yn
" # " P P #
w0 1 ȳ (xi − x̄)2 − x̄ (xn − x̄)(yn − ȳ )
=P P
w1 (xi − x̄)2 ) (xn − x̄)(yn − ȳ )

1
P 1
P
where x̄ = N n xn and ȳ = N n yn .
Exercise: RSS(w̃) in compact form

For the 1-D case, the least-mean-squares solution is

1
P 1
P
where x̄ = N n xn and ȳ = N n yn .

42
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

43
Why is minimizing RSS sensible?

c>
1 w = y1

2 3
c>
1w y1
6c> 7
6 2 w y2 7
r=6 .. 7 w c>
2 w = y2
4 . 5
c>
4 w y 4

= Aw y c>
3 w = y3

c>
4 w = y4

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the
residual vector r (w) = y − Xw to a scalar
• We take the sum of the squares of the elements of r (w)
44
Why is minimizing RSS sensible?

Probabilistic interpretation
• Noisy observation model:

Y = w0 + w1 X + η

where η ∼ N(0, σ 2 ) is a Gaussian random variable

• Conditional likelihood of one training sample:
1 [yn −(w0 +w1 xn )]2
p(yn |xn ) = N(w0 + w1 xn , σ 2 ) = √ e− 2σ 2
2πσ

45
Probabilistic interpretation (cont’d)

Log-likelihood of the training data D (assuming i.i.d):

N
Y X
log P(D) = log p(yn |xn ) = log p(yn |xn )
n=1 n
X [yn − (w0 + w1 xn )]2 √

= − − log 2πσ
n
2σ 2
1 X N √
=− 2 [yn − (w0 + w1 xn )]2 − log σ 2 − N log 2π
2σ n 2
( )
1 1 X 2 2
=− [yn − (w0 + w1 xn )] + N log σ + const
2 σ2 n

What is the relationship between minimizing RSS and maximizing the

log-likelihood?

46
Maximum likelihood estimation

Estimating σ, w0 and w1 can be done in two steps

• Maximize over w0 and w1 :
X
max log P(D) ⇔ min [yn − (w0 + w1 xn )]2 ← This is RSS(w̃)!
n

• Maximize over s = σ 2 :
( )
∂ log P(D) 1 1 X 2 1
=− − 2 [yn − (w0 + w1 xn )] + N =0
∂s 2 s n s
1 X
→ σ∗ 2 = s ∗ = [yn − (w0 + w1 xn )]2
N n

47
How does this probabilistic interpretation help us?

• It gives a solid footing to our intuition: minimizing RSS(w̃) is a

sensible thing based on reasonable modeling assumptions.
• Estimating σ ∗ tells us how much noise there is in our predictions.
For example, it allows us to place confidence intervals around our
predictions.

48
Recap of MLE/MAP

Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization

49
Computational complexity of the Least Squares Solution

Bottleneck of computing the solution?

−1
w = X >X Xy

Matrix multiply of X> X ∈ R(D+1)×(D+1)

Inverting the matrix X> X

How many operations do we need?

• O(ND2 ) for matrix multiplication

• O(D3 ) (e.g., using Gauss-Jordan elimination) or O(D2.373 ) (recent
theoretical advances) for matrix inversion
• Impractical for very large D or N

50
Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);
set t = 0; choose η > 0
• Loop until convergence
1. Compute the gradient
∇RSS(w ) = X> (Xw (t) − y )
2. Update the parameters
w (t+1) = w (t) − η∇RSS(w )
3. t ← t + 1

What is the complexity of each iteration?

O(ND)

51
Why would this work?

If gradient descent converges, it will converge to the same solution as

using matrix inversion.

This is because RSS(w ) is a convex function in its parameters w

Hessian of RSS
>
RSS(w ) = w > X> Xw − 2 X> y w + const
2
∂ RSS(w )
⇒ = 2X> X
∂w w >
X> X is positive semidefinite, because for any v

v > X> Xv = kX> v k22 ≥ 0

52
Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);
set t = 0; choose η > 0
• Loop until convergence
1. Compute the gradient
∇RSS(w ) = X> Xw (t) − X> y
2. Update the parameters
w (t+1) = w (t) − η∇RSS(w )
3. t ← t + 1

What is the complexity of each iteration?

O(ND)

53
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

• Initialize w to some w (0) ; set t = 0; choose η > 0

• Loop until convergence
1. random choose a training a sample x t
2. Compute its contribution to the gradient

g t = (x >
t w
(t)
− yt )xt

3. Update the parameters

w (t+1) = w (t) − ηg t
4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

• O(ND) for gradient descent versus O(D) for SGD

54
SGD versus Batch GD

• SGD reduces per-iteration complexity from O(ND) to O(D)

• But it is noisier and can take longer to converge

55
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

56
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.

56
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,
AdaDelta are used in practice.

56
Mini-Summary

• Linear regression is the linear combination of features

P
f : x → y , with f (x) = w0 + d wd xd = w0 + w > x
• If we minimize residual sum of squares as our learning objective, we
get a closed-form solution of parameters
• Probabilistic interpretation: maximum likelihood if assuming residual
is Gaussian distributed
• Gradient Descent and mini-batch SGD can overcome computational
issues

BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
w01 LectureSlices MA4550
No ratings yet
w01 LectureSlices MA4550
36 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Lec 03
No ratings yet
Lec 03
42 pages
03 Linear Regression Intuition
No ratings yet
03 Linear Regression Intuition
23 pages
Linear Regression for House Pricing
No ratings yet
Linear Regression for House Pricing
113 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
MLF Week 4 Notes by Manisha Pal
No ratings yet
MLF Week 4 Notes by Manisha Pal
13 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
ML 01
No ratings yet
ML 01
24 pages
ML Week 4
No ratings yet
ML Week 4
5 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
ANN Unit1
No ratings yet
ANN Unit1
29 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
Cost Function
No ratings yet
Cost Function
17 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Cse 445 ML - 1
No ratings yet
Cse 445 ML - 1
28 pages
AI Lec 3
No ratings yet
AI Lec 3
36 pages
Regression
No ratings yet
Regression
39 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
ML Day3
No ratings yet
ML Day3
10 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
Lecture 2-Regression
No ratings yet
Lecture 2-Regression
49 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Machine Learning Cheatsheet
100% (1)
Machine Learning Cheatsheet
15 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Notes 1
No ratings yet
Notes 1
3 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Cs229 ML Notes
No ratings yet
Cs229 ML Notes
192 pages
ML Day2
No ratings yet
ML Day2
7 pages
Final ML
No ratings yet
Final ML
54 pages
ML - Lec 4-Introduction To Regression
No ratings yet
ML - Lec 4-Introduction To Regression
65 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Fileml
No ratings yet
Fileml
54 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
Model Based Wheat Yield Estimation in The Punjab, Pakistan: GC University Lahore
No ratings yet
Model Based Wheat Yield Estimation in The Punjab, Pakistan: GC University Lahore
163 pages
ID717 Lecture 6 2014 Impact Assessment
No ratings yet
ID717 Lecture 6 2014 Impact Assessment
7 pages
How Agriculture Began
No ratings yet
How Agriculture Began
31 pages
Soil Pollution and Its Effects On Soil Producitity
No ratings yet
Soil Pollution and Its Effects On Soil Producitity
20 pages
Climate Change and Agriculture
No ratings yet
Climate Change and Agriculture
45 pages
Tunnel Farming
No ratings yet
Tunnel Farming
30 pages
Optimizing Watercourse Conveyance Efficiency Through Enhancing Lining Length Progress of Various Activities On ADP Watercourses (2017-18)
No ratings yet
Optimizing Watercourse Conveyance Efficiency Through Enhancing Lining Length Progress of Various Activities On ADP Watercourses (2017-18)
54 pages
Android - Using The SDK
No ratings yet
Android - Using The SDK
32 pages
GSTN Informatin Booklet
No ratings yet
GSTN Informatin Booklet
100 pages
Group 3 Report
No ratings yet
Group 3 Report
66 pages
How To Easily Generate Sales Funnels and Growth Hack Your Business Using ClickFunnels - Kev Chavez - Your Keen & Crisp VP
50% (8)
How To Easily Generate Sales Funnels and Growth Hack Your Business Using ClickFunnels - Kev Chavez - Your Keen & Crisp VP
103 pages
Paas Under The Hood Printversion
No ratings yet
Paas Under The Hood Printversion
23 pages
HFS File Sharing Guide
No ratings yet
HFS File Sharing Guide
53 pages
IBM POST & BIOS Error Codes Guide
No ratings yet
IBM POST & BIOS Error Codes Guide
4 pages
A VLSI Analog Computer - Math Co-Processor For A Digital Computer
No ratings yet
A VLSI Analog Computer - Math Co-Processor For A Digital Computer
3 pages
ISO (International Organization Standardization)
100% (1)
ISO (International Organization Standardization)
18 pages
Java Solve
No ratings yet
Java Solve
28 pages
How To Crack GATE - IES - BARC - Electronic Devices and Circuits (EDC)
No ratings yet
How To Crack GATE - IES - BARC - Electronic Devices and Circuits (EDC)
4 pages
Intel® Core™2 Duo Processor E7500
No ratings yet
Intel® Core™2 Duo Processor E7500
4 pages
IQAN-MD4 Instructionbook UK
No ratings yet
IQAN-MD4 Instructionbook UK
45 pages
Project 12
No ratings yet
Project 12
44 pages
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
No ratings yet
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
20 pages
Expert Frontend Developer Portfolio
No ratings yet
Expert Frontend Developer Portfolio
1 page
Quiz - Cloud Security - Revisão Da Tentativa - Training Institute - PDF 3
No ratings yet
Quiz - Cloud Security - Revisão Da Tentativa - Training Institute - PDF 3
2 pages
EI8751-Industrial Data Networks
No ratings yet
EI8751-Industrial Data Networks
10 pages
The Meshing Sequence: Meshing With Default Settings
No ratings yet
The Meshing Sequence: Meshing With Default Settings
9 pages
State of The Art Reliability
No ratings yet
State of The Art Reliability
39 pages
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
No ratings yet
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
100 pages
BRO Software
No ratings yet
BRO Software
28 pages
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
No ratings yet
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
24 pages
Syllabus IST 8105-Spring 2024
No ratings yet
Syllabus IST 8105-Spring 2024
10 pages
Software Requirements Specification
No ratings yet
Software Requirements Specification
7 pages
Taj Mahal ClipartC2A0 - Google Search
No ratings yet
Taj Mahal ClipartC2A0 - Google Search
1 page
CORPORATE - FORM Internet Banking
No ratings yet
CORPORATE - FORM Internet Banking
2 pages
Philips C++ Coding Standard ( C++11)
No ratings yet
Philips C++ Coding Standard ( C++11)
97 pages
S-20 U-Verse Remote User Guide
No ratings yet
S-20 U-Verse Remote User Guide
2 pages
Nickel Ore Pre-Shipment Analysis
No ratings yet
Nickel Ore Pre-Shipment Analysis
1 page