0% found this document useful (0 votes)

9 views23 pages

L3 Linear Regression

The document provides an introduction to machine learning and data mining, focusing on supervised learning techniques such as linear regression and its variants like ordinary least squares (OLS), ridge regression, and LASSO. It discusses the learning process, loss functions, and methods for minimizing errors in predictions, while also addressing the limitations and advantages of each regression method. Practical examples and exercises are included to reinforce the concepts presented.

Uploaded by

hoangvietbui2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views23 pages

L3 Linear Regression

Uploaded by

hoangvietbui2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to

Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2023
2
Contents
Introduction to Machine Learning & Data Mining
Supervised learning

Linear regression
Unsupervised learning

Practical advice
3
Linear regression: introduction
Regression problem: learn a function y = f(x) from a given
training data D = {(x1, y1), (x2, y2), …, (xM, yM)} such that
yi ≅ f(xi) for every i

Each observation of x is represented by a vector in an n-
dimensional space, e.g., xi = (xi1, xi2, …, xin)T. Each dimension
represents an attribute/feature/variate.

Bold characters denote vectors.
Linear model: if f(x) is assumed to be of linear form
f(x,w) = w0 + w1x1 + … + wnxn
 w0, w1, …, wn are the regression coefficients/weights. w0
sometimes is called “bias”.
Note: learning a linear model is equivalent to learning the
coefficient vector w = (w , w , …, w )T.
4
Linear regression: example
What is the best function?

x y 𝑓 ( 𝑥)
0.13 -0.91
1.02 -0.17
3.17 1.61
-2.76 -3.31
1.44 0.18
5.28 3.36
-1.74 -2.46
7.93 5.56 𝑥
... ...
5
Prediction
For each observation x = (x1, x2, …, xn)T
 The true output: cx
(but unknown for future data)

Prediction by our model:
yx = w0 + w1x1 + … + wnxn
 We often expect yx ≅ cx.
Prediction for a future observation z = (z1, z2, …, zn)T

Use the learned function to make prediction
f(z,w) = w0 + w1z1 + … + wnzn
6
Learning a regression function
Learning goal: learn a function f* such
that its prediction in the future is the
best.

Its generalization is the best.
𝑓 (𝑥)
Difficulty: infinite number of functions


How can we learn?

Is function f better than g?
Use a measure

Loss function is often used to guide 𝑥
learning.
7
Loss function
 Definition:
 The error/loss of the prediction for an observation x = (x1, x2, …, xn)T
r(f,x) = [cx – f(x,w)]2 = (cx – w0 – w1x1 -… - wnxn)2

The expected loss (risk) of f over the whole space:
E = Ex[r(f,x)] = Ex[cx – f(x)]2 Cost, risk
(Ex is the expectation over x)
 The goal of learning is to find f* that minimizes the expected
loss:


H is the space of functions of linear form.
 But, we cannot work directly with this problem during the
8
Empirical loss
We can only observe a set of training data D = {(x1, y1), (x2,
y2), …, (xM, yM)}, and have to learn f from D.
Residual sum of squares:

Empirical loss (lỗi thực nghiệm):

 is an approximation of Ex[r(f,x)].

 is often known as generalization error of f.

(lỗi tổng quát hoá)
Many learning algorithms base on this RSS or its variants.
9
Methods: ordinary least squares (OLS)
 Given D, we find f* that minimizes RSS:

(1)

 This method is often known as ordinary least squares (OLS, bình

phương tối thiểu).
 Find w* by taking the gradient of RSS and solving the equation
RSS’=0. We have:


Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2, …,
y M ) T.
1
0
Methods: OLS
Input: D = {(x1, y1), (x2, y2), …, (xM, yM)}
Output: w*
Learning: compute


Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2,
…, yM)T.

Note: we assume that ATA is invertible.
Prediction for a new x:
1
1
Methods: OLS example

6
x y
0.13 -1
4 f*
1.02 -0.17
3 1.61
-2.5 -2 2

1.44 0.1
5 3.36 0

-1.74 -2.46
7.5 5.56 -2

-4
-4 -2 0 2 4 6 8
f*(x) = 0.81x – 0.78
1
2
Methods: limitations of OLS
OLS cannot work if ATA is not invertible

If some columns (attributes/features) of A are dependent, then A
will be singular and therefore ATA is not invertible.
(Nếu một vài cột của A phụ thuộc tuyến tính thì A sẽ không khả nghịch)
OLS requires considerable computation due to the need of
computing a matrix inversion.

Intractable for the very high dimensional problems.
OLS likely tends to overfitting, because the learning phase
just focuses on minimizing the error of the training data.
1
3
Methods: Ridge regression (1)
Given D = {(x1, y1), (x2, y2), …, (xM, yM)}, we solve for:

(2)

 Where Ai = (1, xi1, xi2, …, xin) is composed from xi; and λ is a

regularization constant (λ> 0). is the L2 norm.
1
4
Methods: Ridge regression (2)
Problem (2) is equivalent to the following:

(3)
Subject to

for some constant t.
The regularization/penalty term:

Limits the magnitute/size of w* (i.e., reduces the search space for
f*).

Helps us to trade off between the fitting of f on D and its
generalization on future observations.
1
5
Methods: Ridge regression (3)
 We solve for w* by taking the gradient of the objective
function in (2), and then zeroing it. Therefore we obtain:


Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2,
…, yM)T; In+1 is the identity matrix of size n+1.
 Compared with OLS, Ridge can

Avoid the cases of singularity, unlike OLS. Hence Ridge always
works.

Reduce overfitting.

But error in the training data might be greater than OLS.
 Note: the predictiveness of Ridge depends heavily on the
choice of the hyperparameter λ.
1
6
Methods: Ridge regression (4)
Input: D = {(x1, y1), (x2, y2), …, (xM, yM)} and λ>0
Output: w*
Learning: compute

Prediction for a new x:

Note: to avoid some negative effects of the magnitute of y on

covariates x, one should remove w0 from the penalty term in
(2). In this case, the solution of w* should be modified slightly.
1
7
An example of using Ridge and OLS
The training set D contains 67 observations on prostate
cancer, each was represented with 8 attributes. Ridge and
OLS were learned from D, and then predicted 30 new
observations. Ordinary Least
w Squares Ridge
0 2.465 2.452
lcavol 0.680 0.420
lweight 0.263 0.238
age −0.141 −0.046
lbph 0.210 0.162
svi 0.305 0.227
lcp −0.288 0.000
gleason −0.021 0.040
pgg45 0.267 0.133
Test RSS 0.521 0.492
1
8
Effects of λ in Ridge regression
W* = (w0, S1, S2, S3, S4, S5, S6, AGE, SEX, BMI, BP) changes
as the regularization constant λ changes.
1
9
LASSO
Ridge regression use L2 norm for regularization:
subject to (3)
Replacing L2 by L1 norm will result in LASSO:

Subject to
Equivalently:

This problem is non-differentiable  the training algorithm

(4)
should be more complex than Ridge.
2
0
LASSO: regularization role
The regularization types lead to different domains for w.
LASSO often produces sparse solutions, i.e., many
components of w are zero.

Shrinkage and selection at the same time

Figure by Nicoguaro - Own work, CC BY 4.0,

https://commons.wikimedia.org/w/index.php?
2
1
OLS, Ridge, and LASSO
The training set D contains 67 observations on prostate
cancer, each was represented with 8 attributes. OLS, Ridge,
and LASSO were trained from D, and then predicted 30 new
observations.
Ordinary Least
w Squares Ridge LASSO
0 2.465 2.452 2.468
lcavol 0.680 0.420 0.533
lweight 0.263 0.238 0.169 Some weights
age −0.141 −0.046 are 0
lbph 0.210 0.162 0.002
 some
attributes
svi 0.305 0.227 0.094 may not be
lcp −0.288 0.000 important
gleason −0.021 0.040
pgg45 0.267 0.133
Test RSS 0.521 0.492 0.479
2
2
References
 Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of
Statistical Learning. Springer, 2009.
 Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the
lasso". Journal of the Royal Statistical Society. Series B (methodological).
Wiley. 58 (1): 267–88.
2
3
Exercises
Derive the solution of (1) and (2) in details.

Derive the solution of (2) when removing w0 from the penalty

term.

Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Cost Comparison Engage Report Final Version PDF
No ratings yet
Cost Comparison Engage Report Final Version PDF
80 pages
L3 Linear Regression
No ratings yet
L3 Linear Regression
23 pages
Ridge Regression LASSO
No ratings yet
Ridge Regression LASSO
18 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Aiml 6
No ratings yet
Aiml 6
30 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
PGN AI and ML Presentation
No ratings yet
PGN AI and ML Presentation
28 pages
Mlda U1
No ratings yet
Mlda U1
10 pages
10 - Linear Regression-Problems and Solutions
No ratings yet
10 - Linear Regression-Problems and Solutions
23 pages
Regression Variable Selection Methods
No ratings yet
Regression Variable Selection Methods
30 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Group 30
No ratings yet
Group 30
33 pages
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
No ratings yet
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
29 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
04 Linear
No ratings yet
04 Linear
31 pages
Regression Shrinkage Techniques
No ratings yet
Regression Shrinkage Techniques
5 pages
Chapter 4 - Linear Model: Prepared By: Shier Nee, SAW Based On: Probabilistic Machine Learning by Kevin Murphy
No ratings yet
Chapter 4 - Linear Model: Prepared By: Shier Nee, SAW Based On: Probabilistic Machine Learning by Kevin Murphy
42 pages
Unit 2
No ratings yet
Unit 2
92 pages
Unit 2
No ratings yet
Unit 2
8 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Unit-2 ML
No ratings yet
Unit-2 ML
199 pages
SLChapter 5
No ratings yet
SLChapter 5
16 pages
Chap 2
No ratings yet
Chap 2
15 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
06 Regularization
No ratings yet
06 Regularization
36 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
LLM ML Interview Q
No ratings yet
LLM ML Interview Q
43 pages
Dependent Independent Variable (S) : Regression: What Is Regression
No ratings yet
Dependent Independent Variable (S) : Regression: What Is Regression
15 pages
02 Simple Regression
No ratings yet
02 Simple Regression
29 pages
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Ridge vs Lasso Regression Guide
No ratings yet
Ridge vs Lasso Regression Guide
5 pages
Lecture-6 Linear Regression Addition
No ratings yet
Lecture-6 Linear Regression Addition
15 pages
WK 06
No ratings yet
WK 06
7 pages
1.1. Linear Models - Scikit-Learn 1.6.1 Documentation
No ratings yet
1.1. Linear Models - Scikit-Learn 1.6.1 Documentation
41 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
DataScience - Chapter03 - Machine Learning With Python - 03 - Regression
No ratings yet
DataScience - Chapter03 - Machine Learning With Python - 03 - Regression
19 pages
Data Analysis: Statistics & Linear Algebra
No ratings yet
Data Analysis: Statistics & Linear Algebra
40 pages
Machine Learning PPT Part II
No ratings yet
Machine Learning PPT Part II
56 pages
Regularization
No ratings yet
Regularization
3 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
ML - Perplexity
No ratings yet
ML - Perplexity
71 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Econometrics: OLS Estimation Basics
No ratings yet
Econometrics: OLS Estimation Basics
38 pages
Et Ch3
No ratings yet
Et Ch3
38 pages
Machine Learning Regression Basics
No ratings yet
Machine Learning Regression Basics
20 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Feature Selection
No ratings yet
Feature Selection
19 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
Module 3
No ratings yet
Module 3
35 pages
Chap 3.2
No ratings yet
Chap 3.2
56 pages
445 Lecture 7
No ratings yet
445 Lecture 7
30 pages
AQA As 2.0 Optical Fibres 1 Questions
No ratings yet
AQA As 2.0 Optical Fibres 1 Questions
13 pages
Theories of Perception PDF
No ratings yet
Theories of Perception PDF
31 pages
Why Do You Glamorize Serial Killers in The Media
No ratings yet
Why Do You Glamorize Serial Killers in The Media
7 pages
The Evolution of Physics
No ratings yet
The Evolution of Physics
17 pages
Ray Tsai: Key Skills
No ratings yet
Ray Tsai: Key Skills
3 pages
Class 9 PT-2
No ratings yet
Class 9 PT-2
3 pages
ICC Audit: External Operations Review
No ratings yet
ICC Audit: External Operations Review
22 pages
Science3 Q1 DLP WK7 D3
No ratings yet
Science3 Q1 DLP WK7 D3
4 pages
Additional English - 4th Semester Full
No ratings yet
Additional English - 4th Semester Full
48 pages
Sriya PPT 2.0
No ratings yet
Sriya PPT 2.0
16 pages
Data Presentation Methods Explained
No ratings yet
Data Presentation Methods Explained
13 pages
OFP Interview Questions
100% (1)
OFP Interview Questions
2 pages
Organizational Change Insights
33% (3)
Organizational Change Insights
19 pages
SAP - PS, Budget
100% (4)
SAP - PS, Budget
47 pages
Leadership Without Easy Answers-18 Pages
No ratings yet
Leadership Without Easy Answers-18 Pages
18 pages
JCT3V-G1100 (CTC)
No ratings yet
JCT3V-G1100 (CTC)
7 pages
Food Safety Attitude of Culinary Arts Based Students in Public PDF
No ratings yet
Food Safety Attitude of Culinary Arts Based Students in Public PDF
11 pages
002 Ostrich PDF
No ratings yet
002 Ostrich PDF
10 pages
Letter From ED To Corps Reg SBM2.0 Guidelines 26.10.2021 - Final
No ratings yet
Letter From ED To Corps Reg SBM2.0 Guidelines 26.10.2021 - Final
3 pages
Six Sigma - Examskey.lssbb.v2019!03!11.by - Ronnie.182q
No ratings yet
Six Sigma - Examskey.lssbb.v2019!03!11.by - Ronnie.182q
88 pages
Logitech MX ERGO Wireless Trackball
No ratings yet
Logitech MX ERGO Wireless Trackball
8 pages
Drdo Research Project
No ratings yet
Drdo Research Project
5 pages
Thesis Fahrenheit 451 Essay
100% (2)
Thesis Fahrenheit 451 Essay
4 pages
Ki Manna Engleza
No ratings yet
Ki Manna Engleza
19 pages
12200.01 IT Assessment and Understanding
No ratings yet
12200.01 IT Assessment and Understanding
13 pages
MA Anthropology Curriculum
No ratings yet
MA Anthropology Curriculum
1 page
Riasec Test
No ratings yet
Riasec Test
2 pages
Term Paper Tungkol Sa k12
100% (1)
Term Paper Tungkol Sa k12
7 pages
Curriculum Map: SY 2019-2020 Yr Level: Grade 8 Subject: Mathematics 8 (Second Quarter)
No ratings yet
Curriculum Map: SY 2019-2020 Yr Level: Grade 8 Subject: Mathematics 8 (Second Quarter)
3 pages

L3 Linear Regression

Uploaded by

L3 Linear Regression

Uploaded by

Introduction to

Machine Learning and Data Mining

Empirical loss (lỗi thực nghiệm):

 is often known as generalization error of f.

 This method is often known as ordinary least squares (OLS, bình

 Where Ai = (1, xi1, xi2, …, xin) is composed from xi; and λ is a

Prediction for a new x:

Note: to avoid some negative effects of the magnitute of y on

This problem is non-differentiable  the training algorithm

Figure by Nicoguaro - Own work, CC BY 4.0,

Derive the solution of (2) when removing w0 from the penalty

You might also like