0% found this document useful (0 votes)

12 views19 pages

Lecture 2

The lecture covers linear regression, focusing on estimation criteria, least squares solutions, and properties of estimates. It discusses the importance of understanding bias and generalization in the context of training and test errors, as well as the risks of overfitting with insufficient training data. Key concepts include the formulation of linear functions, the minimization of empirical loss, and the relationship between model parameters and prediction errors.

Uploaded by

abdul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

Lecture 2

Uploaded by

abdul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Machine learning: lecture 2

Tommi S. Jaakkola
MIT AI Lab
Topics
• Brief review of background
• Linear regression
– estimation criterion
– least squares solution, properties
– generalization, overfitting

Tommi Jaakkola, MIT AI Lab 2

Brief review of background
• Expectation and sample mean
n
1 X
EX∼P { X } = E{ X } ≈ x̄ = xi
n i=1

where each xi is a sample from P .

• Variance and sample variance
n
1X
V ar{X} = E (X − E{X})2 (xi − x̄)2

≈
n i=1

Tommi Jaakkola, MIT AI Lab 3

Brief review of background cont’d
• Covariance and sample covariance

Cov{X1, X2} = E (X1 − E{X1})(X2 − E{X2}) }
n
1 X
≈ (x1i − x̄1)(x2i − x̄2)
n i=1

where (x1i, x2i) is the ith joint sample from P .

Tommi Jaakkola, MIT AI Lab 4

Brief review of background cont’d
• Conditional expectation:
Z
E{ Y |x } = y p(y|x)dy
y
E{ E{ Y |X } } = E{ Y }

where X and Y are random variables governed by a

distribution p; x is a possible value of X.

Tommi Jaakkola, MIT AI Lab 5

Regression
6

y
−2

−4

−6
−2 −1 0 1 2
x

• The goal is to predict responses/outputs for inputs

• We need to define
1. function class, the type of predictions we consider
2. fitting criterion (loss) that measures the degree of fit to
the data

Tommi Jaakkola, MIT AI Lab 6

Linear regression
• Linear functions of one variable (two parameters)
f (x; w) = w0 + w1x, w = [w0 w1]T
and a squared loss: Loss(y, f (x; w)) = (y − f (x; w))2/2.
• Estimation based on minimizing the empirical loss
Xn
Jn(w) = Loss(yi, f (xi; w))
i=1
with respect to the parameters w.
6

−2

−4

−6
−2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 7

Linear regression: estimation
• We minimize the empirical squared loss
Xn Xn
Jn(w) = Loss(yi, f (xi; w)) = (yi − w0 − w1xi)2/2
i=1 i=1

Setting the derivatives with respect to w0 and w1 to zero we

get necessary conditions for the “optimal” parameter values
n
∂ X
Jn(w) = − (yi − w0 − w1xi) = 0
∂w0 i=1
n
∂ X
Jn(w) = − (yi − w0 − w1xi) xi = 0
∂w1 i=1

Note: These conditions mean that the prediction error

(yi − w0 − w1xi) has zero mean and is decorrelated with the
inputs xi
Tommi Jaakkola, MIT AI Lab 8
Linear regression: estimation
• The prediction error (yi − w0 − w1xi) is decorrelated with
the inputs xi
6 2.5

2
4
1.5

2 1

0.5
0
0

−2 −0.5

−1
−4
−1.5

−6 −2
−2 −1 0 1 2 −2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 9

Linear regression: estimation
n
∂ X
Jn(w) = − (yi − w0 − w1xi) = 0
∂w0 i=1
n
∂ X
Jn(w) = − (yi − w0 − w1xi) xi = 0
∂w1 i=1

• Solution via matrix inversion

Pn Pn Pn
w0 ( i=1 1) + w1 ( i=1 xi) = i=1 yi
Pn Pn 2
Pn
w0 ( i=1 xi) + w1 i=1 xi = i=1 yixi
or Φw = b, where
Pn Pn Pn
i=1 1 i=1 xi i=1 yi
Φ = Pn Pn 2 , b = Pn
i=1 xi i=1 xi i=1 yixi

• If Φ is invertible, we get our parameter estimates via ŵ =

Φ−1b

Tommi Jaakkola, MIT AI Lab 10

Linear regression
• In a matrix notation, we minimize:
    2
y1 1 x1
1   w0
 ···  −  ··· ··· 
 
2 w1
yn 1 xn
1
or ky − Xwk2
2
By setting the derivatives to zero, we get
T T T −1 T
X y − X Xw = 0 ⇒ ŵ = |(X{zX)} X
| {zy}
Φ b

Note: the solution is a linear function of the outputs y

Tommi Jaakkola, MIT AI Lab 11

Linear regression: pseudo-inverse
• 2-D example: ŵ = (XT X)† XT y
 
1
T 
yi ≈ f (xi; ŵ) = ŵ0 + ŵ1x1i + ŵ2x2i = ŵ  x1i 

x2i
2

1.8

1.6
w
1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

• We find the solution in the subspace spanned by the examples

(weight vector set to zero in the orthogonal dimensions)
Tommi Jaakkola, MIT AI Lab 12
Properties of estimates
5

−5
−2 −1 0 1 2

• Suppose the mean response (output) for any input x can

indeed be modeled as a linear function with some “true”
parameters w∗:

∗ ∗T 1
E{ y|x } = f (x; w ) = w
x

We can ask if our estimate ŵ (based on a limited training

set) is in any sense close to w∗.
Tommi Jaakkola, MIT AI Lab 13
Bias
• One measure of deviation is bias, which gauges any
systematic deviation from w∗:
Bias = E{ ŵ } − w∗
where the expectation is taken with respect to resampled
training sets of the same size; each input/output pair (x, y)
in the training set is assumed to be an independent sample
from some distribution P
• In linear regression the estimate ŵ is unbiased, i.e., E{ ŵ }−
w∗ = 0 (problem set)
• This means that the predictions are unbiased as well:

1 ∗T 1
E{ f (x; ŵ) } = E ŵ T
=w = f (x; w∗)
x x

Tommi Jaakkola, MIT AI Lab 14

Linear regression: generalization
• We’d like to understand how our linear predictions
“improve” as a function of the number of training examples
{(x1, y1), . . . , (xn, yn)}
6 6

4 4 n=∞

n=40
2 2

0 0

n=20 n=60
−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2

We assume that there is a systematic relation between x and

y: each training example (x, y) is an independent sample
from a fixed but unknown distribution P

Tommi Jaakkola, MIT AI Lab 15

Linear regression: generalization
Training examples {(x1, y1), . . . , (xn, yn)}
Test examples {(xn+1, yn+1), . . . , (xn+N , yn+N )}
ŵ is the parameter estimate from the training examples.
• Types of errors:
n
1 X
Mean training error = (yi − ŵ0 − ŵ1xi)2
n i=1
n+N
1 X
Mean test error = (yi − ŵ0 − ŵ1xi)2
N i=n+1

(y − ŵ0 − ŵ1x)2

“Generalization” error = E(x,y)∼P

(note: ŵ0 and ŵ1 are themselves random variables as they

depend on the training set)

Tommi Jaakkola, MIT AI Lab 16

Linear regression: generalization
• We can decompose the “generalization” error
E(x,y)∼P (y − ŵ0 − ŵ1x)2

into two terms:

1. error of the best predictor in the class
E(x,y)∼P (y − w0∗ − w1∗x)2

= min E(x,y)∼P (y − w0 − w1x)2

w0,w1

2. and how well we approximate the best predictor

n o
2
(w0∗ + w1∗x) − (ŵ0 + ŵ1x)

E(x,y)∼P

• This holds for any input/output relation depicted by the

distribution P

Tommi Jaakkola, MIT AI Lab 17

Brief derivation

(y − ŵ0 − ŵ1x)2 =
∗ ∗ ∗ ∗
2
(y − (w0 + w1 x) + (w0 + w1 x) − (ŵ0 + ŵ1x)
= y − (w0∗ + w1∗x))2 +
∗ ∗ ∗ ∗

+2 y − (w0 + w1 x))((w0 + w1 x) − (ŵ0 + ŵ1x) +
∗ ∗
2
+ (w0 + w1 x) − (ŵ0 + ŵ1x)

The cross-terms (blue) vanish when we take expectation with

respect to (x, y) ∼ P . (w∗ is the best linear predictor)

Tommi Jaakkola, MIT AI Lab 18

Overfitting
• With too few training examples our linear regression model
may achieve zero training error but nevertless has a large
generalization error
6

2
x

0 x

−2

−4

−6
−2 −1 0 1 2

When the training error no longer bears any relation to the

generalization error the model overfits the data

Tommi Jaakkola, MIT AI Lab 19

Lecture 3
No ratings yet
Lecture 3
24 pages
Linear Regression Lecture Notes
No ratings yet
Linear Regression Lecture Notes
34 pages
ML - Lec 4-Introduction To Regression
No ratings yet
ML - Lec 4-Introduction To Regression
65 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Regression
No ratings yet
Regression
11 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
No ratings yet
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
51 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
9 Mle
No ratings yet
9 Mle
39 pages
11 - Học máy cơ bản - Hồi quy tuyến tính 1
No ratings yet
11 - Học máy cơ bản - Hồi quy tuyến tính 1
105 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
MML Book Removed
No ratings yet
MML Book Removed
54 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
ML 2
No ratings yet
ML 2
155 pages
Xai25 Part02 Linear Models
No ratings yet
Xai25 Part02 Linear Models
49 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Linear Regression Basics and Optimization
No ratings yet
Linear Regression Basics and Optimization
17 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Lecture 3
No ratings yet
Lecture 3
35 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
Lecture 5 - Linear Regression
No ratings yet
Lecture 5 - Linear Regression
51 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Linear Regression and Classification
No ratings yet
Linear Regression and Classification
8 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Data Science Course Syllabus
No ratings yet
Data Science Course Syllabus
104 pages
Cse 445 ML - 1
No ratings yet
Cse 445 ML - 1
28 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Unit 2
No ratings yet
Unit 2
35 pages
Neural
No ratings yet
Neural
68 pages
Python Tutorial
No ratings yet
Python Tutorial
37 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Machine Learning Regression Basics
No ratings yet
Machine Learning Regression Basics
33 pages
Linear Models
No ratings yet
Linear Models
30 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
2 Linear Regression
No ratings yet
2 Linear Regression
14 pages
W1.2 Regression 1
No ratings yet
W1.2 Regression 1
28 pages
ML 02 Regression 2
No ratings yet
ML 02 Regression 2
30 pages
Business Research Methods: Problem Definition and The Research Proposal
No ratings yet
Business Research Methods: Problem Definition and The Research Proposal
37 pages
Scientific Aspects of Juggling by Claude Shannon
No ratings yet
Scientific Aspects of Juggling by Claude Shannon
11 pages
Engine Diagrams Cummins
100% (3)
Engine Diagrams Cummins
34 pages
STA02A2 - Chapter 1
No ratings yet
STA02A2 - Chapter 1
25 pages
OPC Penetration Testing in Oil PCS
No ratings yet
OPC Penetration Testing in Oil PCS
15 pages
Nigerian Audit Impact on Reporting
No ratings yet
Nigerian Audit Impact on Reporting
121 pages
Preparation of Specimens FR Immunohistochemistry - PPT (2) - 1
No ratings yet
Preparation of Specimens FR Immunohistochemistry - PPT (2) - 1
33 pages
Comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2/comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2 PDF
No ratings yet
Comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2/comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2 PDF
13 pages
MT831 Installation Manual
No ratings yet
MT831 Installation Manual
92 pages
BIOS Password Backdoors in Laptops
No ratings yet
BIOS Password Backdoors in Laptops
4 pages
Chapter 5 Group 13 Elements
No ratings yet
Chapter 5 Group 13 Elements
16 pages
Discussion - A Technical Note - Derivation of The LRFD Column Design Equations
No ratings yet
Discussion - A Technical Note - Derivation of The LRFD Column Design Equations
2 pages
A Spacetime Curvature Model For The Three-Body Problem: A Novel Approach To Orbital Dynamics
No ratings yet
A Spacetime Curvature Model For The Three-Body Problem: A Novel Approach To Orbital Dynamics
8 pages
Critical Resistance and Critical Speed For DC Shunt Generator For PDF
No ratings yet
Critical Resistance and Critical Speed For DC Shunt Generator For PDF
10 pages
Proteus CT1628 Electrical Simulation
No ratings yet
Proteus CT1628 Electrical Simulation
4 pages
Business Analytics
100% (5)
Business Analytics
46 pages
Class12 CS Practical File Slides Guidelines
No ratings yet
Class12 CS Practical File Slides Guidelines
12 pages
FYP Proposal Form
No ratings yet
FYP Proposal Form
7 pages
CS311 Final Term Question File 2019, 2020, 2021
No ratings yet
CS311 Final Term Question File 2019, 2020, 2021
5 pages
Civil Engineering Channel Flow
No ratings yet
Civil Engineering Channel Flow
22 pages
Fix
No ratings yet
Fix
4 pages
Fabrication Details for Engineers
No ratings yet
Fabrication Details for Engineers
1 page
Parental Personality and Parenting Style
No ratings yet
Parental Personality and Parenting Style
13 pages
Maquina Automatica Vertical Koike VUP NA3 (EGW)
No ratings yet
Maquina Automatica Vertical Koike VUP NA3 (EGW)
12 pages
7 Network Flows: Objectives
No ratings yet
7 Network Flows: Objectives
14 pages
Math - Exercise of Pat
No ratings yet
Math - Exercise of Pat
5 pages
Maths
No ratings yet
Maths
6 pages
CAPA Test 1 2014 Regular
No ratings yet
CAPA Test 1 2014 Regular
3 pages
Final Report v1.5 Lucknow
No ratings yet
Final Report v1.5 Lucknow
173 pages
Database Security Essentials
100% (1)
Database Security Essentials
53 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Machine learning: lecture 2

Tommi Jaakkola, MIT AI Lab 2

where each xi is a sample from P .

Tommi Jaakkola, MIT AI Lab 3

where (x1i, x2i) is the ith joint sample from P .

Tommi Jaakkola, MIT AI Lab 4

where X and Y are random variables governed by a

Tommi Jaakkola, MIT AI Lab 5

• The goal is to predict responses/outputs for inputs

Tommi Jaakkola, MIT AI Lab 6

Tommi Jaakkola, MIT AI Lab 7

Setting the derivatives with respect to w0 and w1 to zero we

Note: These conditions mean that the prediction error

Tommi Jaakkola, MIT AI Lab 9

• Solution via matrix inversion

• If Φ is invertible, we get our parameter estimates via ŵ =

Tommi Jaakkola, MIT AI Lab 10

Note: the solution is a linear function of the outputs y

Tommi Jaakkola, MIT AI Lab 11

• We find the solution in the subspace spanned by the examples

• Suppose the mean response (output) for any input x can

We can ask if our estimate ŵ (based on a limited training

Tommi Jaakkola, MIT AI Lab 14

We assume that there is a systematic relation between x and

Tommi Jaakkola, MIT AI Lab 15

(note: ŵ0 and ŵ1 are themselves random variables as they

Tommi Jaakkola, MIT AI Lab 16

into two terms:

= min E(x,y)∼P (y − w0 − w1x)2

2. and how well we approximate the best predictor

• This holds for any input/output relation depicted by the

Tommi Jaakkola, MIT AI Lab 17

The cross-terms (blue) vanish when we take expectation with

Tommi Jaakkola, MIT AI Lab 18

When the training error no longer bears any relation to the

Tommi Jaakkola, MIT AI Lab 19

You might also like