Penalized Regression
Penalized Regression
In the regression tree, we talk about the case that we want to select the number of leaves M based on the
following criterion:
n
1X
Cλ,n (M ) = b i ))2 +
(Yi − m(X λM . (11.1)
n i=1 |{z}
| {z } penalty on the complexity
fitting to the data
It turns out that this type of criterion is very general in regression analysis because we want to avoid the
problem of overfitting.
The overfitting means that you fit a too complex model to the data so that although the fitted curve is close
to most of the observations, the actual prediction is very bad. For instance, the following picture shows the
fitted result using a smoothing/cubic spline (here the quantity spar is related to λ):
This data is generated from a sine function plus a small noise. When λ is too small (orange curve), we fit a
very complicated model to the data, which does not capture the right structure. On the other hand, when
λ is too large (green curve), we fit a too simple model (a straight line), which is also bad in predicting the
actual outcome. When λ is too small, it is called overfitting (orange curve) whereas when λ is too large,
it is called underfitting (green curve). In fact, overfitting is similar to undersmoothing and underfitting is
similar to oversmoothing. It regression analysis, people prefer to use overfitting and underfitting to describe
the outcome and in density estimation, people prefer to use undersmoothing and oversmoothing.
Finding a regression estimator using a criterion with a fitting to the data plus a penalty on the complexity
is called a penalized regression. In the case of regression tree, let
11-1
11-2 Lecture 11: Regression: Penalized Approach
be the collection of all possible regression trees. We can rewrite equation (11.1) as
n
1X
m
b Tree = argmin (Yi − m(Xi ))2 + Pλ (m),
m∈MTree n i=1
where Pλ (m) = λ × number of regions in m. Thus, with the penalty on the number of regions, the regression
tree is a penalized regression approach.
For any penalized regression approach, there is an abstract expression for them:
n
1X
m
b = argmin (Yi − m(Xi ))2 + Pλ (m), (11.2)
m∈M n i=1
where M is a collection of regression estimators and Pλ (m) is the amount of penalty imposed for a regression
estimator m ∈ M and λ is a tuning parameter Pn that determines the amount of penalty. The penalized
regression always have a fitting part (e.g., n1 i=1 (Yi −m(Xi ))2 ) and a penalized part (also called regularized
part) Pλ (m). The fitting part makes sure the model fits the data well while the penalized part guarantees
that the model is not too complex. Thus, the penalized regression often leads to a simple model with a good
fitting to the data.
11.2 Spline
Smoothing spline is a famous example in penalized regression methods. Here we consider the case of uni-
variate regression (i.e., the covariate X is univariate or equivalently, d = 1) and focus on the region where
the covariates belongs to [0, 1]. Namely, our data is (X1 , Y1 ), · · · , (Xn , Yn ) with Xi ∈ [0, 1] ⊂ R for each i.
Let M2 denotes the collection of all univariate functions with second derivative on [0, 1]. The cubic (smooth-
ing) spline finds an estimator
n Z 1
1X
mb = argmin (Yi − m(Xi ))2 + λ |m00 (x)|2 dx. (11.3)
m∈M2 n i=1 0
R1
In the cubic spline the penalty function is λ 0 |m00 (x)|2 dx, which imposes restriction on the smoothness –
the curve m(x) cannot change too drastically otherwise the second derivatives will be large. Thus, the cubic
spline leads to a smooth curve but fits to the data well.
Why the estimator m b is called a cubic spline? This is because it turns out that m
b is a piecewise polynomial
function (spline) with degree of 3. Namely, there exists knots τ1 < · · · < τK such that for x ∈ (τk , τk+1 ),
m(x)
b = γ0,k + γ1,k x + γ2,k x2 + γ3,k x3 ,
for some γ0,k , · · · , γ3,k with restriction that m(x)
b has continuous second derivatives at each knot. In the case
of cubic spline, it turns out that the knots are just data points.
The representation of a cubic spline is often done using some basis function. Here we will introduce a simple
basis called the truncated power basis. Let X(1) < X(2) < · · · < X(n) be the ordered statistics of X1 , · · · , Xn .
In the cubic spline, the knots are
τ1 = X(1) , τ2 = X(2) , · · · , τn = X(n) .
and
hj (x) = (x − τj−4 )3+ , j = 5, 6, · · · , n + 4,
where (x)+ = max{x, 0}. Then the estimator m
b can be written as
n+4
X
m(x)
b = βbj hj (x),
j=1
How do we compute βb1 , · · · , βbn+4 ? They should be chosen using equation (11.3). Here how we will compute
it. Define an n × (n + 4) matrix H such that
Hij = hj (Xi )
Given a point x, let H(x) = (h1 (x), h2 (x), · · · , hn+4 (x)) be an (n+4)-dimensional vector. Then the predicted
value m(x)
b has a simple form:
n
X
m(x)
b = H T (x)βb = H T (x)(HT H + λΩ)−1 HT Y = `i (x)Yi ,
i=1
where
`i (x) = H T (x)(HT H + λΩ)−1 HT ei ,
with ei = (0, 0, · · · , 0, 1
|{z} , 0, · · · , 0) is the unit vector in the i-th coordinate. Therefore, again the
i-th coordinate
cubic spline is a linear smoother.
Note that when the sample size n is large, the spline estimator behaves like a kernel regression in the sense
that
1 Xi − x
`i (x) ≈ K
p(Xi )h(Xi ) h(Xi )
11-4 Lecture 11: Regression: Penalized Approach
and 1/4
λ 1 |x| |x| π
h(x) = , K(x) = exp − √ sin √ + .
np(x) 2 2 2 4
Remark.
• Regression spline. In the case where we use the spline basis to do regression but without a penalty
and use fewer number of knots (and we allow the knots to be at non data points), the resulting
estimator is called a regression spline. Namely, a regression spline is an estimator of the form m(x)
b =
PM b
j=1 βj hj (x), where β1 , · · · , βM are determined by minimizing
b b
2
n M
1 X X
Yi − βj hj (Xi ) .
n i=1 j=1
m(x)
b = H T (x)βb = H T (x)(HT H)−1 HT Y.
• B-spline basis. There are other basis that can be used in constructing a spline estimator. One of
the most famous basis is the B-spline basis. This basis is defined through a recursive way so we will
not go to the details here. If you are interested in, you can check https://cran.r-project.org/
web/packages/crs/vignettes/spline_primer.pdf. The advantage of using a B-spline basis is the
computation.
• M-th order spline. There are higher order spline. If we modify the optimization criterion to
n Z 1
1X
(Yi − m(Xi ))2 + λ |m(β) (x)|2 dx,
n i=1 0
where m(β) denotes the β-th derivative, then the estimator is called a (β + 1)-th order spline. As you
may expect, we can construct a truncated power basis using polynomials up to the order of β + 1.
Namely, we will use 1, x, x2 , · · · , xβ+1 and knots to construct the basis.
We consider a multivariate linear regression problem where there are d covariates, i.e., Xi ∈ Rd . In linear
regression, we assume that the relationship between the response and the covariates can be written as
Yi = β T Xi + i ,
where i is a mean 0 noise random variable. Note that here we assume the intercept is 0.
In many cases, we may have many covariates so d is large. However, we believe that some of these covariates
are useless covariates – the slope of these covariates are 0. Only a few covariates that have the actual linear
relation with the response. Even we know this is true, if we naively apply the least square approach to find
β, we often have all fitted coefficients being non-zero and some of them could even be quiet significant just
due to randomness of the data. Note that the least square estimator finds the fitted parameter as
n
1 X
βbLSE = argmin (Yi − β T Xi )2 .
β n i=1
Lecture 11: Regression: Penalized Approach 11-5
The idea of penalization/regularization can help in this case. There are two comment penalized parametric
regression model: (i) the ridge regression model, and (ii) LASSO (least absolute shrinkage and selection
operator).
Ridge regression. The ridge regression added a penalty called the L2 penalty in the minimization criterion.
Namely, the ridge regression finds the fitted parameter as
n
1 X
βbRidge = argmin (Yi − β T Xi )2 + λkβk22 ,
β n i=1
Pd
where kβk22 = j=1 βj2 is the square 2-norm of the vector β. The penalty λkβk22 is called the L2 penalty
because it is based on the L2 norm of the parameter.
It turns out that the ridge regression has a closed-form solution that is similar to the least square estimator
and the spline:
−1 T
βbRidge = XT X + nλId X Y,
where X is the n × d data matrix and Id is the d × d identity matrix.
−1 T
Let βbLS = XT X X Y be the ordinary least square estimator (no penalty, the classical approach). The
ridge regression has a very similar coefficients as the least square estimator but just the coefficients are moved
toward 0 because in the matrix inverse, there is an extra nλId term. We will say that the ridge regression
shrinks the estimator βbRidge toward 0.
LASSO. LASSO (least absolute shrinkage and selection operator) is one of the most famous penalized
parametric regression model. It has revolutionized the modern statistical research because of its attractive
properties. LASSO finds the regression parameters/coefficients using
n
1X
βbLASSO = argmin (Yi − β T Xi )2 + λkβk1 ,
β n i=1
Pd
where kβk1 = j=1 |βj | is the 1-norm of the vector β. The penalty λkβk1 is called the L1 penalty.
for j = 1, · · · , d. Namely, the coefficients from LASSO are those coefficients from the least square method
shrinking toward 0 and for those parameters whose value are below nλ, they will be shrink to 0.
When λ is large or the signal is small, many coefficients will be 0. This is called sparsity in statistics (only
a few non-zero coefficients). Thus, we will say that the LASSO outputs a sparse estimate. Those βbj will be
0 if it does not provide much improvement on predicting Y . So it naturally leads to an estimator with an
automatic variable selection property. The value of λ will affect the estimates β. b Larger λ encourages a
sparser β (namely, more coefficients are 0) whereas smaller λ leads to a less sparse β.
b b
Although ridge regression also shrinks the coefficients toward 0, it does not yield a sparse estimator. The
coefficients are just smaller but generally non-zero. On the other hand, LASSO not only shrinks the values
of coefficients but also set them to be 0 if the effect is very weak. Actually, this is a property of the L1
penalty – it tends to yield a sparse estimator – an estimator with many 0’s.
Remark.
11-6 Lecture 11: Regression: Penalized Approach
• L0 penalty. There is something called the L0 penalty. For a vector β, its L0 norm is
The resulting coefficients are related to the so-called best subset estimators.
Pn
However, a problem of the L0 penalty is that finding the minimum of n1 i=1 (Yi − β T Xi )2 + λkβk0 is
difficult. It is a non-convex problem and is an NP-hard problem (you can just view these two statements
as ‘computationally very very very difficult’). Thus, in many situations we will replace the L0 penalty
by an L1 penalty because solving an L1 penalty problem is still a convex problem, so computationally
it is not very challenging. The process of replacing L0 penalty (or other non-convex problem) by L1
penalty (or other convex problem) is called convex relaxation. A common trick in machine learning
and optimization.
• High-dimensional problem. Both ridge regression and LASSO are common tools in high-dimensional
data analysis. The so-called high-dimensional problem is where the number of variables d is much larger
than the number of observation n. In this case, the usual least square estimator does not work because
we have more variables (d) than the constraints n. A good news is: both ridge regression and LASSO
work. In particular, LASSO is very powerful in this scenario because it leads to a sparse estimator
(many coefficients are 0). The sparsity is a common belief in high-dimensional statistics because we
anticipate only a few covariates are actually related to the response and most covariates are useless.
Note that the high-dimensional problems are very common in genetics (there are many genes per indi-
viduals but often we have little patient in our study), neuroscience (the fMRI machine produces many
voxels per person at a given time), and many other scientific domains.