Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views19 pages

Lecture 2

The lecture covers linear regression, focusing on estimation criteria, least squares solutions, and properties of estimates. It discusses the importance of understanding bias and generalization in the context of training and test errors, as well as the risks of overfitting with insufficient training data. Key concepts include the formulation of linear functions, the minimization of empirical loss, and the relationship between model parameters and prediction errors.

Uploaded by

abdul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Lecture 2

The lecture covers linear regression, focusing on estimation criteria, least squares solutions, and properties of estimates. It discusses the importance of understanding bias and generalization in the context of training and test errors, as well as the risks of overfitting with insufficient training data. Key concepts include the formulation of linear functions, the minimization of empirical loss, and the relationship between model parameters and prediction errors.

Uploaded by

abdul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Machine learning: lecture 2

Tommi S. Jaakkola
MIT AI Lab
Topics
• Brief review of background
• Linear regression
– estimation criterion
– least squares solution, properties
– generalization, overfitting

Tommi Jaakkola, MIT AI Lab 2


Brief review of background
• Expectation and sample mean
n
1 X
EX∼P { X } = E{ X } ≈ x̄ = xi
n i=1

where each xi is a sample from P .


• Variance and sample variance
n
1X
V ar{X} = E (X − E{X})2 (xi − x̄)2


n i=1

Tommi Jaakkola, MIT AI Lab 3


Brief review of background cont’d
• Covariance and sample covariance
 
Cov{X1, X2} = E (X1 − E{X1})(X2 − E{X2}) }
n
1 X
≈ (x1i − x̄1)(x2i − x̄2)
n i=1

where (x1i, x2i) is the ith joint sample from P .

Tommi Jaakkola, MIT AI Lab 4


Brief review of background cont’d
• Conditional expectation:
Z
E{ Y |x } = y p(y|x)dy
y
E{ E{ Y |X } } = E{ Y }

where X and Y are random variables governed by a


distribution p; x is a possible value of X.

Tommi Jaakkola, MIT AI Lab 5


Regression
6

y
−2

−4

−6
−2 −1 0 1 2
x

• The goal is to predict responses/outputs for inputs


• We need to define
1. function class, the type of predictions we consider
2. fitting criterion (loss) that measures the degree of fit to
the data

Tommi Jaakkola, MIT AI Lab 6


Linear regression
• Linear functions of one variable (two parameters)
f (x; w) = w0 + w1x, w = [w0 w1]T
and a squared loss: Loss(y, f (x; w)) = (y − f (x; w))2/2.
• Estimation based on minimizing the empirical loss
Xn
Jn(w) = Loss(yi, f (xi; w))
i=1
with respect to the parameters w.
6

−2

−4

−6
−2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 7


Linear regression: estimation
• We minimize the empirical squared loss
Xn Xn
Jn(w) = Loss(yi, f (xi; w)) = (yi − w0 − w1xi)2/2
i=1 i=1

Setting the derivatives with respect to w0 and w1 to zero we


get necessary conditions for the “optimal” parameter values
n
∂ X
Jn(w) = − (yi − w0 − w1xi) = 0
∂w0 i=1
n
∂ X
Jn(w) = − (yi − w0 − w1xi) xi = 0
∂w1 i=1

Note: These conditions mean that the prediction error


(yi − w0 − w1xi) has zero mean and is decorrelated with the
inputs xi
Tommi Jaakkola, MIT AI Lab 8
Linear regression: estimation
• The prediction error (yi − w0 − w1xi) is decorrelated with
the inputs xi
6 2.5

2
4
1.5

2 1

0.5
0
0

−2 −0.5

−1
−4
−1.5

−6 −2
−2 −1 0 1 2 −2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 9


Linear regression: estimation
n
∂ X
Jn(w) = − (yi − w0 − w1xi) = 0
∂w0 i=1
n
∂ X
Jn(w) = − (yi − w0 − w1xi) xi = 0
∂w1 i=1

• Solution via matrix inversion


Pn Pn Pn
w0 ( i=1 1) + w1 ( i=1 xi) = i=1 yi
Pn Pn 2
Pn
w0 ( i=1 xi) + w1 i=1 xi = i=1 yixi
or Φw = b, where
 Pn Pn   Pn 
i=1 1 i=1 xi i=1 yi
Φ = Pn Pn 2 , b = Pn
i=1 xi i=1 xi i=1 yixi

• If Φ is invertible, we get our parameter estimates via ŵ =


Φ−1b

Tommi Jaakkola, MIT AI Lab 10


Linear regression
• In a matrix notation, we minimize:
    2
y1 1 x1  
1   w0
 ···  −  ··· ··· 
 
2 w1
yn 1 xn
1
or ky − Xwk2
2
By setting the derivatives to zero, we get
T T T −1 T
X y − X Xw = 0 ⇒ ŵ = |(X{zX)} X
| {zy}
Φ b

Note: the solution is a linear function of the outputs y

Tommi Jaakkola, MIT AI Lab 11


Linear regression: pseudo-inverse
• 2-D example: ŵ = (XT X)† XT y
 
1
T 
yi ≈ f (xi; ŵ) = ŵ0 + ŵ1x1i + ŵ2x2i = ŵ  x1i 

x2i
2

1.8

1.6
w
1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

• We find the solution in the subspace spanned by the examples


(weight vector set to zero in the orthogonal dimensions)
Tommi Jaakkola, MIT AI Lab 12
Properties of estimates
5

−5
−2 −1 0 1 2

• Suppose the mean response (output) for any input x can


indeed be modeled as a linear function with some “true”
parameters w∗:
 
∗ ∗T 1
E{ y|x } = f (x; w ) = w
x

We can ask if our estimate ŵ (based on a limited training


set) is in any sense close to w∗.
Tommi Jaakkola, MIT AI Lab 13
Bias
• One measure of deviation is bias, which gauges any
systematic deviation from w∗:
Bias = E{ ŵ } − w∗
where the expectation is taken with respect to resampled
training sets of the same size; each input/output pair (x, y)
in the training set is assumed to be an independent sample
from some distribution P
• In linear regression the estimate ŵ is unbiased, i.e., E{ ŵ }−
w∗ = 0 (problem set)
• This means that the predictions are unbiased as well:
    
1 ∗T 1
E{ f (x; ŵ) } = E ŵ T
=w = f (x; w∗)
x x

Tommi Jaakkola, MIT AI Lab 14


Linear regression: generalization
• We’d like to understand how our linear predictions
“improve” as a function of the number of training examples
{(x1, y1), . . . , (xn, yn)}
6 6

4 4 n=∞

n=40
2 2

0 0

n=20 n=60
−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2

We assume that there is a systematic relation between x and


y: each training example (x, y) is an independent sample
from a fixed but unknown distribution P

Tommi Jaakkola, MIT AI Lab 15


Linear regression: generalization
Training examples {(x1, y1), . . . , (xn, yn)}
Test examples {(xn+1, yn+1), . . . , (xn+N , yn+N )}
ŵ is the parameter estimate from the training examples.
• Types of errors:
n
1 X
Mean training error = (yi − ŵ0 − ŵ1xi)2
n i=1
n+N
1 X
Mean test error = (yi − ŵ0 − ŵ1xi)2
N i=n+1

(y − ŵ0 − ŵ1x)2

“Generalization” error = E(x,y)∼P

(note: ŵ0 and ŵ1 are themselves random variables as they


depend on the training set)

Tommi Jaakkola, MIT AI Lab 16


Linear regression: generalization
• We can decompose the “generalization” error
E(x,y)∼P (y − ŵ0 − ŵ1x)2


into two terms:


1. error of the best predictor in the class
E(x,y)∼P (y − w0∗ − w1∗x)2


= min E(x,y)∼P (y − w0 − w1x)2



w0,w1

2. and how well we approximate the best predictor


n o
2
(w0∗ + w1∗x) − (ŵ0 + ŵ1x)

E(x,y)∼P

• This holds for any input/output relation depicted by the


distribution P

Tommi Jaakkola, MIT AI Lab 17


Brief derivation

(y − ŵ0 − ŵ1x)2 =
∗ ∗ ∗ ∗
 2
(y − (w0 + w1 x) + (w0 + w1 x) − (ŵ0 + ŵ1x)
= y − (w0∗ + w1∗x))2 +
∗ ∗ ∗ ∗

+2 y − (w0 + w1 x))((w0 + w1 x) − (ŵ0 + ŵ1x) +
∗ ∗
2
+ (w0 + w1 x) − (ŵ0 + ŵ1x)

The cross-terms (blue) vanish when we take expectation with


respect to (x, y) ∼ P . (w∗ is the best linear predictor)

Tommi Jaakkola, MIT AI Lab 18


Overfitting
• With too few training examples our linear regression model
may achieve zero training error but nevertless has a large
generalization error
6

2
x

0 x

−2

−4

−6
−2 −1 0 1 2

When the training error no longer bears any relation to the


generalization error the model overfits the data

Tommi Jaakkola, MIT AI Lab 19

You might also like