Machine learning: lecture 2
Tommi S. Jaakkola
MIT AI Lab
Topics
• Brief review of background
• Linear regression
– estimation criterion
– least squares solution, properties
– generalization, overfitting
Tommi Jaakkola, MIT AI Lab 2
Brief review of background
• Expectation and sample mean
n
1 X
EX∼P { X } = E{ X } ≈ x̄ = xi
n i=1
where each xi is a sample from P .
• Variance and sample variance
n
1X
V ar{X} = E (X − E{X})2 (xi − x̄)2
≈
n i=1
Tommi Jaakkola, MIT AI Lab 3
Brief review of background cont’d
• Covariance and sample covariance
Cov{X1, X2} = E (X1 − E{X1})(X2 − E{X2}) }
n
1 X
≈ (x1i − x̄1)(x2i − x̄2)
n i=1
where (x1i, x2i) is the ith joint sample from P .
Tommi Jaakkola, MIT AI Lab 4
Brief review of background cont’d
• Conditional expectation:
Z
E{ Y |x } = y p(y|x)dy
y
E{ E{ Y |X } } = E{ Y }
where X and Y are random variables governed by a
distribution p; x is a possible value of X.
Tommi Jaakkola, MIT AI Lab 5
Regression
6
y
−2
−4
−6
−2 −1 0 1 2
x
• The goal is to predict responses/outputs for inputs
• We need to define
1. function class, the type of predictions we consider
2. fitting criterion (loss) that measures the degree of fit to
the data
Tommi Jaakkola, MIT AI Lab 6
Linear regression
• Linear functions of one variable (two parameters)
f (x; w) = w0 + w1x, w = [w0 w1]T
and a squared loss: Loss(y, f (x; w)) = (y − f (x; w))2/2.
• Estimation based on minimizing the empirical loss
Xn
Jn(w) = Loss(yi, f (xi; w))
i=1
with respect to the parameters w.
6
−2
−4
−6
−2 −1 0 1 2
Tommi Jaakkola, MIT AI Lab 7
Linear regression: estimation
• We minimize the empirical squared loss
Xn Xn
Jn(w) = Loss(yi, f (xi; w)) = (yi − w0 − w1xi)2/2
i=1 i=1
Setting the derivatives with respect to w0 and w1 to zero we
get necessary conditions for the “optimal” parameter values
n
∂ X
Jn(w) = − (yi − w0 − w1xi) = 0
∂w0 i=1
n
∂ X
Jn(w) = − (yi − w0 − w1xi) xi = 0
∂w1 i=1
Note: These conditions mean that the prediction error
(yi − w0 − w1xi) has zero mean and is decorrelated with the
inputs xi
Tommi Jaakkola, MIT AI Lab 8
Linear regression: estimation
• The prediction error (yi − w0 − w1xi) is decorrelated with
the inputs xi
6 2.5
2
4
1.5
2 1
0.5
0
0
−2 −0.5
−1
−4
−1.5
−6 −2
−2 −1 0 1 2 −2 −1 0 1 2
Tommi Jaakkola, MIT AI Lab 9
Linear regression: estimation
n
∂ X
Jn(w) = − (yi − w0 − w1xi) = 0
∂w0 i=1
n
∂ X
Jn(w) = − (yi − w0 − w1xi) xi = 0
∂w1 i=1
• Solution via matrix inversion
Pn Pn Pn
w0 ( i=1 1) + w1 ( i=1 xi) = i=1 yi
Pn Pn 2
Pn
w0 ( i=1 xi) + w1 i=1 xi = i=1 yixi
or Φw = b, where
Pn Pn Pn
i=1 1 i=1 xi i=1 yi
Φ = Pn Pn 2 , b = Pn
i=1 xi i=1 xi i=1 yixi
• If Φ is invertible, we get our parameter estimates via ŵ =
Φ−1b
Tommi Jaakkola, MIT AI Lab 10
Linear regression
• In a matrix notation, we minimize:
2
y1 1 x1
1 w0
··· − ··· ···
2 w1
yn 1 xn
1
or ky − Xwk2
2
By setting the derivatives to zero, we get
T T T −1 T
X y − X Xw = 0 ⇒ ŵ = |(X{zX)} X
| {zy}
Φ b
Note: the solution is a linear function of the outputs y
Tommi Jaakkola, MIT AI Lab 11
Linear regression: pseudo-inverse
• 2-D example: ŵ = (XT X)† XT y
1
T
yi ≈ f (xi; ŵ) = ŵ0 + ŵ1x1i + ŵ2x2i = ŵ x1i
x2i
2
1.8
1.6
w
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
• We find the solution in the subspace spanned by the examples
(weight vector set to zero in the orthogonal dimensions)
Tommi Jaakkola, MIT AI Lab 12
Properties of estimates
5
−5
−2 −1 0 1 2
• Suppose the mean response (output) for any input x can
indeed be modeled as a linear function with some “true”
parameters w∗:
∗ ∗T 1
E{ y|x } = f (x; w ) = w
x
We can ask if our estimate ŵ (based on a limited training
set) is in any sense close to w∗.
Tommi Jaakkola, MIT AI Lab 13
Bias
• One measure of deviation is bias, which gauges any
systematic deviation from w∗:
Bias = E{ ŵ } − w∗
where the expectation is taken with respect to resampled
training sets of the same size; each input/output pair (x, y)
in the training set is assumed to be an independent sample
from some distribution P
• In linear regression the estimate ŵ is unbiased, i.e., E{ ŵ }−
w∗ = 0 (problem set)
• This means that the predictions are unbiased as well:
1 ∗T 1
E{ f (x; ŵ) } = E ŵ T
=w = f (x; w∗)
x x
Tommi Jaakkola, MIT AI Lab 14
Linear regression: generalization
• We’d like to understand how our linear predictions
“improve” as a function of the number of training examples
{(x1, y1), . . . , (xn, yn)}
6 6
4 4 n=∞
n=40
2 2
0 0
n=20 n=60
−2 −2
−4 −4
−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
We assume that there is a systematic relation between x and
y: each training example (x, y) is an independent sample
from a fixed but unknown distribution P
Tommi Jaakkola, MIT AI Lab 15
Linear regression: generalization
Training examples {(x1, y1), . . . , (xn, yn)}
Test examples {(xn+1, yn+1), . . . , (xn+N , yn+N )}
ŵ is the parameter estimate from the training examples.
• Types of errors:
n
1 X
Mean training error = (yi − ŵ0 − ŵ1xi)2
n i=1
n+N
1 X
Mean test error = (yi − ŵ0 − ŵ1xi)2
N i=n+1
(y − ŵ0 − ŵ1x)2
“Generalization” error = E(x,y)∼P
(note: ŵ0 and ŵ1 are themselves random variables as they
depend on the training set)
Tommi Jaakkola, MIT AI Lab 16
Linear regression: generalization
• We can decompose the “generalization” error
E(x,y)∼P (y − ŵ0 − ŵ1x)2
into two terms:
1. error of the best predictor in the class
E(x,y)∼P (y − w0∗ − w1∗x)2
= min E(x,y)∼P (y − w0 − w1x)2
w0,w1
2. and how well we approximate the best predictor
n o
2
(w0∗ + w1∗x) − (ŵ0 + ŵ1x)
E(x,y)∼P
• This holds for any input/output relation depicted by the
distribution P
Tommi Jaakkola, MIT AI Lab 17
Brief derivation
(y − ŵ0 − ŵ1x)2 =
∗ ∗ ∗ ∗
2
(y − (w0 + w1 x) + (w0 + w1 x) − (ŵ0 + ŵ1x)
= y − (w0∗ + w1∗x))2 +
∗ ∗ ∗ ∗
+2 y − (w0 + w1 x))((w0 + w1 x) − (ŵ0 + ŵ1x) +
∗ ∗
2
+ (w0 + w1 x) − (ŵ0 + ŵ1x)
The cross-terms (blue) vanish when we take expectation with
respect to (x, y) ∼ P . (w∗ is the best linear predictor)
Tommi Jaakkola, MIT AI Lab 18
Overfitting
• With too few training examples our linear regression model
may achieve zero training error but nevertless has a large
generalization error
6
2
x
0 x
−2
−4
−6
−2 −1 0 1 2
When the training error no longer bears any relation to the
generalization error the model overfits the data
Tommi Jaakkola, MIT AI Lab 19