Regularized
regression
Chapter 5
Regularized regression
EEE 485/585 Statistical Learning and Data Analytics Regularization
Ridge regression
Lasso
Cem Tekin
Bilkent University
Cannot be distributed outside this class without the permission of the
instructor. 5.1
Ely97 XXI Xy
Regularized
Regularization
MERICA my
regression
Properties of the least squares estimate:
When relation between Y and X = [X1 , . . . , Xp ]T is almost
linear, least squares estimate have low bias
But it can have high variance. Ex: when p ⇡datafinerthydotatin
n or p n
Shrinking regression coefficients results in better fit
Regularization
Reducing the complexity of linear regression
Ridge regression
Lasso
5.2
Regularized
Two methods for regularization regression
ie
Ordinary least squares:
0 12
n
X p
X
RSS( ) = @yi 0 j xij
A
i=1 j=1
Ridge regression: Regularization
IF
Ridge regression
0 12 Lasso
n
X p
X p
X
@yi A + 2
LossR ( , ) = 0 j xij j
i=1 j=1 j=1
p
y notpeered
X
2
= RSS( ) + j
j=1
Lasso:
so
evilbe 0 Ferdie1squareof porcretes
w X @y
Loss ( , ) =
Xn
x A +
X
| |
p
2
p
L i 0 j ij j
i=1 j=1 j=1
p
X
= RSS( ) +
j=1
| j|
pendite
absolute role of
parameters 5.3
I Regularized
Ridge regression regression
0 12
n p p
X X X
@yi A + 2
LossR ( , ) = 0 j xij |{z} j
i=1 j=1 tuning parameter j=1
| {z }
Regularization
penalty
Ridge regression
ˆ R = arg min LossR ( , ) Lasso
What happens when
!0
LostCHOI ROTC I ER I
É
!1
How to select ? CB not penalized
Use CV to select
5.4
Regularized
Example - Credit card balance prediction regression
Onotstudent
Y =card balance s
X =(income, limit, rating, student, ...)
Elo I student
R
Lines show estimated regression coefficients ˆ by ridge
regression.
Regularization
Ridge regression
400
Income
400
Lasso
Limit
Standardized Coefficients
Standardized Coefficients
300
300
Rating
Student
200
200
100
100
0
0
−100
−100
−300
−300
1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0
so2 ˆR 2 / ˆ 2
Figure from “An introduction to statistical learning" by James et al. 5.5
Regularized
Scale invariance regression
Least squares linear regression is scale invariant
Is ridge regression scale invariant?
Making ridge regression fair:
Figgis.ie and
income in the
Standardize the predictors:
1 apithfeatureofdata instance i Regularization
D leastsquarer I
g ITE Totie Text Is
Ridge regression
xij x̄j
x˜ij = q P Lasso
1 n
n i=1 (xij x̄j )2
Inefficient
Pn of inome
of
iaug value
1
where x̄j = n i=1 xij fester j
Properties of standardized predictors: resale inone i iincome in thetas th
i new 0.001 xi old
1
Pn
n i=1 x˜ij = 0 (zero mean)
1
Pn 2
n i=1 x˜ij = 1 (unit variance)
5 dataset with retched inone
centare the response YE t ÉYi
I Fi Jill ridge
It gets It500017 ist
Exit Exit least it tide
instead 5.6 since squareinerrant
Bias-variance tradeoff
d Regularized
in general
En
regression
A D s ridge
E g IITItg I
treetoooin
ER
60
60
ridge
Mean Squared Error
Mean Squared Error
50
50
underfit
40 weft
40
ridge it not sale invariant
iI
Regularization
30
30
Ridge regression
20
20
i
Lasso
n
10
10
I
0
0
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
ˆR 2 / ˆ 2
t model iorplexyd
bias: black, variance: green, MSE: red
n
1X
MSE := (yi f̂ (x i ))2
n
i=1
Figure from “An introduction to statistical learning" by James et al. 5.7
Regularized
How to solve ridge regression?
titter
regression
n
X
0 oh p
X
12
p
X
FICy y'tITI
Y
Cy XEtCy XE t II
@yi A + 2
LossR ( , ) = 0 j xij j
i=1 j=1 j=1
ytyy'TE Iffy ITTXI ITI
R Regularization
ˆ = arg min LossR ( , )
Ridge regression
Lasso
DELORCAAt Q
Center the predictors and the response (centering makes
the intercept ˆ0R = 0)
Standardize the predictors
Q yTX yTX Pt I CAT
t II I I IT Q
g
Zytx It Ext E
I IF EE Ext sixty
5.8
n
Itai
Ip
Regularized
How to solve ridge regression?
ridge t s when
a
regression
Some2 notation:2y 3 and X centered
F Ey
3 2 3
y1 1 x11 x12 . . . x1p
6 y2 7 6 27 6x21 x22 . . . x2p 7
ET esta't YEEIE
6 7 6 7 6 7
y = 6 . 7, = 6 . 7, X = 6 . .. .. .. 7
4.5. .
4.5 4 .. . . . 5
yn xn1 xn2 . . . xnp
p
Linear algebra and matrix calculus gives:
Regularization
d
of 15 coefficients
Ridge regression
downsided version
ˆ R = (XT X + I) 1
g
XT y
Lasso
Hence given a new (centered and scaled) input x, (centered
R
prediction) ŷ = x T ˆ
5.9
Regularized
How to solve ridge regression? regression
Some2 notation:
3 2y 3 and X centered
2 3
y1 1 x11 x12 . . . x1p
6 y2 7 6 27 6x21 x22 . . . x2p 7
6 7 6 7 6 7
y = 6 . 7, = 6 . 7, X = 6 . .. .. .. 7
4.5. .
4.5 4 .. . . . 5
yn p xn1 xn2 . . . xnp Regularization
Linear algebra and matrix calculus gives: Ridge regression
Lasso
R
ˆ = (XT X + I) 1
XT y
Hence given a new (centered and scaled) input x, (centered
R
prediction) ŷ = x T ˆ
Compare with least squares solution:
ˆ RSS = (XT X) 1
XT y
5.9
Regularized
Advantage of ridge regression regression
Reduces variance
XT X + I, > 0 is invertable even when XT X is not
invertable.
Regularization
Ridge regression
Lasso
Figure from http://stats.stackexchange.com 5.10
Regularized
Disadvantage of ridge regression regression
Coefficients will be small but still almost all of them will be
nonzero
Regularization
Ridge regression
Lasso
5.11
Regularized
Lasso (least absolute shrinkage and selection operator) regression
0 12
n p p
X X X
LossL ( , ) = @yi 0 j xij
A + | j|
i=1 j=1 j=1
L Regularization
ˆ = arg min LossL ( , )
Ridge regression
Lasso
No closed form solution (in general)
crepe
What happens when
!0
ELI's
11 0
!1
5.12
Regularized
Example - Credit card balance prediction regression
Y =card balance
X =(income, limit, rating, student, ...)
L
Lines show estimated regression coefficients ˆ by lasso.
Lasso performs variable selection (results in a sparse model)
lasso ridge regression
Regularization
400
400
Ridge regression
Standardized Coefficients
Standardized Coefficients
Lasso
300
300
200
200
100
100
i
0
0
−100
Income
I Limit
−200
Rating
b is
Student
−300
20 50 100 200 500 2000 5000 0.0 0.2 0.4 0.6 0.8 1.0
ˆL 1 / ˆ
feature selection
1
Figure from “An introduction to statistical learning" by James et al. after CV 5.13
Regularized
Ridge regression and lasso as constrained minimization regression
problems
Ridge:
Ijf
8 0 12 9
>
<X n p >
= p
X X
@yi A 2
minimize 0 j xij subject to j s Regularization
>
: i=1 >
;
j=1 j=1 Ridge regression
fo pappies
Lasso
Lasso:
8
>
0
tf 12 9
>
geometry of constraint region
<X n Xp = Xp
different
minimize @yi 0 x
j ij
A subject to | j| s
>
: i=1 >
;
j=1 j=1
forp 2 Hittites
For each s in the constrained minimization problem there
is a corresponding in the equivalent unconstrained p
minimization problem.
5.14
Regularized
Geometric interpretation
p2
regression
prssinceons
so f01
Regularization
an x op spied Ridge regression
Lasso
Red lines: error contours for RSS (same error for all
values on the same contour)
ˆ : least squares solution
2 2
Blue areas: region for which | 1| +| 2| s or 1 + 2 s
Figure from “An introduction to statistical learning" by James et al. 5.15