Introduction to
Machine Learning and Data Mining
(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2023
2
Contents
Introduction to Machine Learning & Data Mining
Supervised learning
Linear regression
Unsupervised learning
Practical advice
3
Linear regression: introduction
Regression problem: learn a function y = f(x) from a given
training data D = {(x1, y1), (x2, y2), …, (xM, yM)} such that
yi ≅ f(xi) for every i
Each observation of x is represented by a vector in an n-
dimensional space, e.g., xi = (xi1, xi2, …, xin)T. Each dimension
represents an attribute/feature/variate.
Bold characters denote vectors.
Linear model: if f(x) is assumed to be of linear form
f(x,w) = w0 + w1x1 + … + wnxn
w0, w1, …, wn are the regression coefficients/weights. w0
sometimes is called “bias”.
Note: learning a linear model is equivalent to learning the
coefficient vector w = (w , w , …, w )T.
4
Linear regression: example
What is the best function?
x y 𝑓 ( 𝑥)
0.13 -0.91
1.02 -0.17
3.17 1.61
-2.76 -3.31
1.44 0.18
5.28 3.36
-1.74 -2.46
7.93 5.56 𝑥
... ...
5
Prediction
For each observation x = (x1, x2, …, xn)T
The true output: cx
(but unknown for future data)
Prediction by our model:
yx = w0 + w1x1 + … + wnxn
We often expect yx ≅ cx.
Prediction for a future observation z = (z1, z2, …, zn)T
Use the learned function to make prediction
f(z,w) = w0 + w1z1 + … + wnzn
6
Learning a regression function
Learning goal: learn a function f* such
that its prediction in the future is the
best.
Its generalization is the best.
𝑓 (𝑥)
Difficulty: infinite number of functions
How can we learn?
Is function f better than g?
Use a measure
Loss function is often used to guide 𝑥
learning.
7
Loss function
Definition:
The error/loss of the prediction for an observation x = (x1, x2, …, xn)T
r(f,x) = [cx – f(x,w)]2 = (cx – w0 – w1x1 -… - wnxn)2
The expected loss (risk) of f over the whole space:
E = Ex[r(f,x)] = Ex[cx – f(x)]2 Cost, risk
(Ex is the expectation over x)
The goal of learning is to find f* that minimizes the expected
loss:
H is the space of functions of linear form.
But, we cannot work directly with this problem during the
8
Empirical loss
We can only observe a set of training data D = {(x1, y1), (x2,
y2), …, (xM, yM)}, and have to learn f from D.
Residual sum of squares:
Empirical loss (lỗi thực nghiệm):
is an approximation of Ex[r(f,x)].
is often known as generalization error of f.
(lỗi tổng quát hoá)
Many learning algorithms base on this RSS or its variants.
9
Methods: ordinary least squares (OLS)
Given D, we find f* that minimizes RSS:
(1)
This method is often known as ordinary least squares (OLS, bình
phương tối thiểu).
Find w* by taking the gradient of RSS and solving the equation
RSS’=0. We have:
Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2, …,
y M ) T.
1
0
Methods: OLS
Input: D = {(x1, y1), (x2, y2), …, (xM, yM)}
Output: w*
Learning: compute
Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2,
…, yM)T.
Note: we assume that ATA is invertible.
Prediction for a new x:
1
1
Methods: OLS example
6
x y
0.13 -1
4 f*
1.02 -0.17
3 1.61
-2.5 -2 2
1.44 0.1
5 3.36 0
-1.74 -2.46
7.5 5.56 -2
-4
-4 -2 0 2 4 6 8
f*(x) = 0.81x – 0.78
1
2
Methods: limitations of OLS
OLS cannot work if ATA is not invertible
If some columns (attributes/features) of A are dependent, then A
will be singular and therefore ATA is not invertible.
(Nếu một vài cột của A phụ thuộc tuyến tính thì A sẽ không khả nghịch)
OLS requires considerable computation due to the need of
computing a matrix inversion.
Intractable for the very high dimensional problems.
OLS likely tends to overfitting, because the learning phase
just focuses on minimizing the error of the training data.
1
3
Methods: Ridge regression (1)
Given D = {(x1, y1), (x2, y2), …, (xM, yM)}, we solve for:
(2)
Where Ai = (1, xi1, xi2, …, xin) is composed from xi; and λ is a
regularization constant (λ> 0). is the L2 norm.
1
4
Methods: Ridge regression (2)
Problem (2) is equivalent to the following:
(3)
Subject to
for some constant t.
The regularization/penalty term:
Limits the magnitute/size of w* (i.e., reduces the search space for
f*).
Helps us to trade off between the fitting of f on D and its
generalization on future observations.
1
5
Methods: Ridge regression (3)
We solve for w* by taking the gradient of the objective
function in (2), and then zeroing it. Therefore we obtain:
Where A is the data matrix of size Mx(n+1), whose the ith row is
Ai = (1, xi1, xi2, …, xin); B-1 is the inversion of matrix B; y = (y1, y2,
…, yM)T; In+1 is the identity matrix of size n+1.
Compared with OLS, Ridge can
Avoid the cases of singularity, unlike OLS. Hence Ridge always
works.
Reduce overfitting.
But error in the training data might be greater than OLS.
Note: the predictiveness of Ridge depends heavily on the
choice of the hyperparameter λ.
1
6
Methods: Ridge regression (4)
Input: D = {(x1, y1), (x2, y2), …, (xM, yM)} and λ>0
Output: w*
Learning: compute
Prediction for a new x:
Note: to avoid some negative effects of the magnitute of y on
covariates x, one should remove w0 from the penalty term in
(2). In this case, the solution of w* should be modified slightly.
1
7
An example of using Ridge and OLS
The training set D contains 67 observations on prostate
cancer, each was represented with 8 attributes. Ridge and
OLS were learned from D, and then predicted 30 new
observations. Ordinary Least
w Squares Ridge
0 2.465 2.452
lcavol 0.680 0.420
lweight 0.263 0.238
age −0.141 −0.046
lbph 0.210 0.162
svi 0.305 0.227
lcp −0.288 0.000
gleason −0.021 0.040
pgg45 0.267 0.133
Test RSS 0.521 0.492
1
8
Effects of λ in Ridge regression
W* = (w0, S1, S2, S3, S4, S5, S6, AGE, SEX, BMI, BP) changes
as the regularization constant λ changes.
1
9
LASSO
Ridge regression use L2 norm for regularization:
subject to (3)
Replacing L2 by L1 norm will result in LASSO:
Subject to
Equivalently:
This problem is non-differentiable the training algorithm
(4)
should be more complex than Ridge.
2
0
LASSO: regularization role
The regularization types lead to different domains for w.
LASSO often produces sparse solutions, i.e., many
components of w are zero.
Shrinkage and selection at the same time
Figure by Nicoguaro - Own work, CC BY 4.0,
https://commons.wikimedia.org/w/index.php?
2
1
OLS, Ridge, and LASSO
The training set D contains 67 observations on prostate
cancer, each was represented with 8 attributes. OLS, Ridge,
and LASSO were trained from D, and then predicted 30 new
observations.
Ordinary Least
w Squares Ridge LASSO
0 2.465 2.452 2.468
lcavol 0.680 0.420 0.533
lweight 0.263 0.238 0.169 Some weights
age −0.141 −0.046 are 0
lbph 0.210 0.162 0.002
some
attributes
svi 0.305 0.227 0.094 may not be
lcp −0.288 0.000 important
gleason −0.021 0.040
pgg45 0.267 0.133
Test RSS 0.521 0.492 0.479
2
2
References
Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of
Statistical Learning. Springer, 2009.
Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the
lasso". Journal of the Royal Statistical Society. Series B (methodological).
Wiley. 58 (1): 267–88.
2
3
Exercises
Derive the solution of (1) and (2) in details.
Derive the solution of (2) when removing w0 from the penalty
term.