Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views23 pages

L3 Linear Regression

The document provides an introduction to machine learning and data mining, focusing on supervised learning techniques such as linear regression and its variants like ordinary least squares (OLS), ridge regression, and LASSO. It discusses the learning process, loss functions, and methods for minimizing errors in predictions, while also addressing the limitations and advantages of each regression method. Practical examples and exercises are included to reinforce the concepts presented.

Uploaded by

hoangvietbui2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

L3 Linear Regression

The document provides an introduction to machine learning and data mining, focusing on supervised learning techniques such as linear regression and its variants like ordinary least squares (OLS), ridge regression, and LASSO. It discusses the learning process, loss functions, and methods for minimizing errors in predictions, while also addressing the limitations and advantages of each regression method. Practical examples and exercises are included to reinforce the concepts presented.

Uploaded by

hoangvietbui2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction to

Machine Learning and Data Mining


(Học máy và Khai phá dữ liệu)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2023
2
Contents
Introduction to Machine Learning & Data Mining
Supervised learning

Linear regression
Unsupervised learning

Practical advice
3
Linear regression: introduction
Regression problem: learn a function y = f(x) from a given
training data D = {(x1, y1), (x2, y2), …, (xM, yM)} such that
yi ≅ f(xi) for every i

Each observation of x is represented by a vector in an n-
dimensional space, e.g., xi = (xi1, xi2, …, xin)T. Each dimension
represents an attribute/feature/variate.

Bold characters denote vectors.
Linear model: if f(x) is assumed to be of linear form
f(x,w) = w0 + w1x1 + … + wnxn
 w0, w1, …, wn are the regression coefficients/weights. w0
sometimes is called “bias”.
Note: learning a linear model is equivalent to learning the
coefficient vector w = (w , w , …, w )T.
4
Linear regression: example
What is the best function?

x y 𝑓 ( 𝑥)
0.13 -0.91
1.02 -0.17
3.17 1.61
-2.76 -3.31
1.44 0.18
5.28 3.36
-1.74 -2.46
7.93 5.56 𝑥
... ...
5
Prediction
For each observation x = (x1, x2, …, xn)T
 The true output: cx
(but unknown for future data)

Prediction by our model:
yx = w0 + w1x1 + … + wnxn
 We often expect yx ≅ cx.
Prediction for a future observation z = (z1, z2, …, zn)T

Use the learned function to make prediction
f(z,w) = w0 + w1z1 + … + wnzn
6
Learning a regression function
Learning goal: learn a function f* such
that its prediction in the future is the
best.

Its generalization is the best.
𝑓 (𝑥)
Difficulty: infinite number of functions


How can we learn?

Is function f better than g?
Use a measure

Loss function is often used to guide 𝑥
learning.
7
Loss function
 Definition:
 The error/loss of the prediction for an observation x = (x1, x2, …, xn)T
r(f,x) = [cx – f(x,w)]2 = (cx – w0 – w1x1 -… - wnxn)2

The expected loss (risk) of f over the whole space:
E = Ex[r(f,x)] = Ex[cx – f(x)]2 Cost, risk
(Ex is the expectation over x)
 The goal of learning is to find f* that minimizes the expected
loss:


H is the space of functions of linear form.
 But, we cannot work directly with this problem during the
8
Empirical loss
We can only observe a set of training data D = {(x1, y1), (x2,
y2), …, (xM, yM)}, and have to learn f from D.
Residual sum of squares:

Empirical loss (lỗi thực nghiệm):


 is an approximation of Ex[r(f,x)].

 is often known as generalization error of f.


(lỗi tổng quát hoá)
Many learning algorithms base on this RSS or its variants.
9
Methods: ordinary least squares (OLS)
 Given D, we find f* that minimizes RSS:

(1)

 This method is often known as ordinary least squares (OLS, bình


phương tối thiểu).
 Find w* by taking the gradient of RSS and solving the equation
RSS’=0. We have:


Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2, …,
y M ) T.
1
0
Methods: OLS
Input: D = {(x1, y1), (x2, y2), …, (xM, yM)}
Output: w*
Learning: compute


Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2,
…, yM)T.

Note: we assume that ATA is invertible.
Prediction for a new x:
1
1
Methods: OLS example

6
x y
0.13 -1
4 f*
1.02 -0.17
3 1.61
-2.5 -2 2

1.44 0.1
5 3.36 0

-1.74 -2.46
7.5 5.56 -2

-4
-4 -2 0 2 4 6 8
f*(x) = 0.81x – 0.78
1
2
Methods: limitations of OLS
OLS cannot work if ATA is not invertible

If some columns (attributes/features) of A are dependent, then A
will be singular and therefore ATA is not invertible.
(Nếu một vài cột của A phụ thuộc tuyến tính thì A sẽ không khả nghịch)
OLS requires considerable computation due to the need of
computing a matrix inversion.

Intractable for the very high dimensional problems.
OLS likely tends to overfitting, because the learning phase
just focuses on minimizing the error of the training data.
1
3
Methods: Ridge regression (1)
Given D = {(x1, y1), (x2, y2), …, (xM, yM)}, we solve for:

(2)

 Where Ai = (1, xi1, xi2, …, xin) is composed from xi; and λ is a


regularization constant (λ> 0). is the L2 norm.
1
4
Methods: Ridge regression (2)
Problem (2) is equivalent to the following:

(3)
Subject to

for some constant t.
The regularization/penalty term:

Limits the magnitute/size of w* (i.e., reduces the search space for
f*).

Helps us to trade off between the fitting of f on D and its
generalization on future observations.
1
5
Methods: Ridge regression (3)
 We solve for w* by taking the gradient of the objective
function in (2), and then zeroing it. Therefore we obtain:


Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2,
…, yM)T; In+1 is the identity matrix of size n+1.
 Compared with OLS, Ridge can

Avoid the cases of singularity, unlike OLS. Hence Ridge always
works.

Reduce overfitting.

But error in the training data might be greater than OLS.
 Note: the predictiveness of Ridge depends heavily on the
choice of the hyperparameter λ.
1
6
Methods: Ridge regression (4)
Input: D = {(x1, y1), (x2, y2), …, (xM, yM)} and λ>0
Output: w*
Learning: compute

Prediction for a new x:

Note: to avoid some negative effects of the magnitute of y on


covariates x, one should remove w0 from the penalty term in
(2). In this case, the solution of w* should be modified slightly.
1
7
An example of using Ridge and OLS
The training set D contains 67 observations on prostate
cancer, each was represented with 8 attributes. Ridge and
OLS were learned from D, and then predicted 30 new
observations. Ordinary Least
w Squares Ridge
0 2.465 2.452
lcavol 0.680 0.420
lweight 0.263 0.238
age −0.141 −0.046
lbph 0.210 0.162
svi 0.305 0.227
lcp −0.288 0.000
gleason −0.021 0.040
pgg45 0.267 0.133
Test RSS 0.521 0.492
1
8
Effects of λ in Ridge regression
W* = (w0, S1, S2, S3, S4, S5, S6, AGE, SEX, BMI, BP) changes
as the regularization constant λ changes.
1
9
LASSO
Ridge regression use L2 norm for regularization:
subject to (3)
Replacing L2 by L1 norm will result in LASSO:

Subject to
Equivalently:

This problem is non-differentiable  the training algorithm


(4)
should be more complex than Ridge.
2
0
LASSO: regularization role
The regularization types lead to different domains for w.
LASSO often produces sparse solutions, i.e., many
components of w are zero.

Shrinkage and selection at the same time

Figure by Nicoguaro - Own work, CC BY 4.0,


https://commons.wikimedia.org/w/index.php?
2
1
OLS, Ridge, and LASSO
The training set D contains 67 observations on prostate
cancer, each was represented with 8 attributes. OLS, Ridge,
and LASSO were trained from D, and then predicted 30 new
observations.
Ordinary Least
w Squares Ridge LASSO
0 2.465 2.452 2.468
lcavol 0.680 0.420 0.533
lweight 0.263 0.238 0.169 Some weights
age −0.141 −0.046 are 0
lbph 0.210 0.162 0.002
 some
attributes
svi 0.305 0.227 0.094 may not be
lcp −0.288 0.000 important
gleason −0.021 0.040
pgg45 0.267 0.133
Test RSS 0.521 0.492 0.479
2
2
References
 Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of
Statistical Learning. Springer, 2009.
 Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the
lasso". Journal of the Royal Statistical Society. Series B (methodological).
Wiley. 58 (1): 267–88.
2
3
Exercises
Derive the solution of (1) and (2) in details.

Derive the solution of (2) when removing w0 from the penalty


term.

You might also like