CHAPTER TWO
Regression with penalization: subset selection
Module Code: EC4308
Module title: Machine Learning and Economic Forecasting
Instructor: Vu Thanh Hai
Reference: ISLR 6.1
Acknowledgement: I adapt Dr. Denis Tkachenko’s notes with
some minor modifications. Thanks to Denis!
Lecture outline
▪ Introduction and motivation
▪ Best subset selection
▪ Forward stepwise selection
▪ Backward stepwise selection
▪ Concluding thoughts: pros and cons.
Introduction and motivation
▪ The linear regression model that you have learned:
𝑦𝑖 = 𝛽0 + 𝛽𝑝 𝑥1,𝑖 + 𝛽2 𝑥2,𝑖 + ⋯ + 𝛽𝑝 𝑥𝑝,𝑖 + 𝜀𝑖
▪ Very commonly used to model 𝐸(𝑌 ∣ 𝑋) - recall that this is what we are after
in predictive modelling under squared loss.
▪ Can describe nonlinear but still additive relationships. That is linear in
coefficients (𝛽’s).
▪ Linearity yields advantages in inference, and in many applications delivers
surprisingly good approximations.
Motivation: is OLS good enough?
▪ We typically fit such models with OLS, why do we have to look for
alternatives?
▪ 2 key concerns: prediction accuracy and model interpretability.
▪ OLS will have low bias if true relationship approximately linear.
▪ If 𝑃 << 𝑁 ("typical" metrics textbook setting), OLS also has low variance.
No need to do anything
Motivation: prediction accuracy
▪ When 𝑁 is not much larger than 𝑃 - OLS predictions will have high variance
→ overfit → poor OOS prediction.
▪ When 𝑃 > 𝑁 (or 𝑃 >> 𝑁 ), no unique solution: OLS variance is infinite →
cannot use this method at all.
▪ Our goal is to get a linear combination of 𝑋 to get a good forecast of 𝑌.
▪ We want to reduce the variance by somehow reducing model complexity with
a minor cost in terms bias - constrain coefficients on some 𝑋 's to 0 or shrink
them.
Motivation: Interpretability
▪ Oftentimes 𝑋 's entering regression may be irrelevant.
▪ OLS will almost never set coefficients exactly to 0 .
▪ By removing potentially many such variables (setting their 𝛽 's to 0 ), we enhance
model interpretability.
▪ Sacrifice "small details" to get the "big picture".
▪ Caution: "interpretation" here need not mean causal effect! Rather, which variables
can be important predictors.
Introduction: Three important approaches
▪ Subset selection: find a (hopefully small) subset of the X's that yields good
performance. Fit OLS on this subset.
▪ Shrinkage: Fit a model using all 𝑃 variables, but shrink coefficients to 0
relative to OLS (some can go to exactly 0 ).
▪ Dimension reduction: attempt to "summarize" the 𝑃 predictors with 𝑀 < 𝑃
linear combinations of the variables (e.g., principal components)
▪ We will talk about the first approach today.
Subset Selection
▪ We try to find a possibly sparse subset of 𝑋 's to forecast 𝑌 well: find 𝑆 << 𝑁
predictors needed for a high quality forecast.
▪ Dropping a variable = setting its coefficient to zero. A form of penalty called
the 𝐿0 penalty.
▪ Bias/variance tradeoff: leave out a variable highly correlated to signal → bias;
put in too many X's → high variance.
▪ In principle, want to try out all possible combinations and choose the one
with the best OOS performance.
Introduction: Three important approaches
▪ Data from the ISLR package on credit card balances and 10 predictors, 𝑁 = 400.
ID Identification
Income Income in $10,000 's
Limit Credit limit
Rating Credit rating
Cards Number of credit cards
Age Age in years
Education Number of years of education
Gender A factor with levels Male and Female
Student A factor with levels No and Yes indicating whether the individual was a student
Married A factor with levels No and Yes indicating whether the individual was married
Ethnicity A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity.
Balance Average credit card balance in $
Best Subset Selection
▪ Let 𝑝᪄ ≤ 𝑝 be the maximum model size.
▪ Goal: Goal: for each 𝑘 ∈ {0,1,2 … , 𝑃}, find the subset of size 𝑘 that gives the
best 𝑅2 . Then, pick the overall best model.
▪ For 𝑘 = 0,1, … , 𝑝᪄
• Fit all models that contain exactly 𝑘 predictors. There are 𝑝ҧ models. If 𝑘 = 0, the
𝑘
forecast is the unconditional mean.
• Pick the best (e.g, highest 𝑅2 ) among these models, and denote it by ℳ𝑘 .
▪ Optimize over ℳ0 , … , ℳ𝑝᪄ using cross-validation (or other criteria like AIC
or BIC)
Best Subset Selection
▪ The above direct approach is called all subsets or best subsets regression
▪ However, we often can't examine all possible models, since they are 2𝑝 of them; for
example, when 𝑝 = 40 there are over a billion models!
▪ The prediction is highly unstable: the subsets of variables in ℳ10 and ℳ11 can be very
different from each other → high variance (the best subset of ℳ3 need not include any of
the variables in best subset of ℳ2 .)
▪ If P is large, the chance of finding models that work great in training data, but not so
much OOS increases – overfitting.
▪ Estimates fluctuate across different samples, so does the best model choice
▪ Nice feature: not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Best Subset Selection
▪ But the more predictors, the lower the insample MSE. With large 𝑘, we might have too
flexible a model. With small 𝑘, we have a rigid model. Both do badly out-of-sample.
▪ So which 𝑘 is the best? In other words, out of the best models for every 𝑘, which model is
the best?
▪ We need the model with the lowest OOS MSE ('test error’).
▪ We can use Information Criterion (IC) to indirectly estimate the OOS MSE. IC help us
with adjusting the in-sample MSE to penalize for potential overfitting. The more regressors
we add, the higher the penalty.
▪ OR: We can directly estimate the OOS MSE using the validation set or CV.
Mallow’s 𝐶𝑝 and adjusted-𝑅 : A refresher 2
▪ Mallow's 𝐶𝑝 :
1
𝐶𝑝 = RSS + 2𝑘𝜎ƶ 2
𝑛
▪ where 𝑑 is the total # of parameters used and 𝜎ƶ 2 is an estimate of the variance of the error 𝜖 associated with
each response measurement.
▪ For a least squares model with 𝑘 variables, the adjusted 𝑅2 statistic is calculated as
2
RSS/(𝑛 − 𝑘 − 1)
Adjusted 𝑅 = 1 −
TSS/(𝑛 − 1)
2 RSS
▪ Maximizing the adjusted 𝑅 is equivalent to minimizing . While RSS always
𝑛−𝑘−1
RSS
decreases as the number of variables in the model increases, may increase or decrease,
𝑛−𝑘−1
due to the presence of 𝑘 in the denominator.
AIC: A refresher
▪ AIC (Akaike Information Criterion) is an unbiased estimate of the Kullback-
Leibler divergence between the model distribution and the forecast
distribution (if all the assumptions hold).
▪ The 𝐴𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
AIC = −2log 𝐿 + 2 ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In the linear regression model, it equals:
1 2
𝐴𝐼𝐶 = 𝑅𝑆𝑆 + 2𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
▪ AIC is used in different model types, not just linear.
BIC: a refresher
▪ BIC ( (Schwarz) Bayesian Information Criterion) is inversely proportional to
the posterior probability of the model under a uniform prior over models.
▪ The 𝐵𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
BIC = −2log 𝐿 + log(N) ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In this linear regression, it equals:
1 2
𝐵𝐼𝐶 = 𝑅𝑆𝑆 + log 𝑁 𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
Estimating 𝜎ƶ 2 when applying IC
▪ How do we find 𝜎ƶ 2 to compute AIC and BIC.
▪ First approach: Use 𝜎ො 2 directly from each model.
▪ When 𝑃 << 𝑁: use the 𝜎ƶ 2 estimated using the full model (all predictors used
→ low bias model; suggested in textbook.)
▪ When 𝑃/𝑁 is not small, try using iterative procedure:
▪ Use 𝜎02 = 𝑠𝑌2 (sample variance of 𝑌 ), select best model based on AIC/BIC,
call it 𝑀 𝑘0 .
▪ Take 𝜎12 = 𝜎ƶ02 , select the best model on IC, call it 𝑀 𝑘1 . Iterate until
convergence (often above 2 steps enough).
Best subset selection for ‘Credit’
Source: James et al. (2014), ISLR, Springer
For each possible model containing a subset of the ten predictors in the Credit data set, the 𝑅𝑆𝑆
and 𝑅2 are displayed. The red frontier tracks the best model for a given number of predictors,
according to 𝑅𝑆𝑆 and 𝑅 2 .
Best subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable
Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education ∗
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married ∗
Asian ∗ ∗
Caucasian ∗
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset optimal model for 'Credit’
Criterion 𝑘 Test MSE
AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 4 10307.72
10-fold CV 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.
Stepwise Selection
▪ For computational reasons, best subset selection cannot be applied with very
large 𝑝. Why not?
▪ Best subset selection may also suffer from statistical problems when 𝑝 is large:
larger the search space, the higher the chance of finding models that look
good on the training data, even though they might not have any predictive
power on future data.
▪ Thus, an enormous search space can lead to overfitting and high variance of
the coefficient estimates.
▪ For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.
Forward Stepwise Selection: algorithm
1. Let ℳ0 denote the null model, which contains no predictors.
2. For 𝑘 = 0, … , 𝑝ҧ − 1 :
▪ Consider all 𝑝 − 𝑘 models that augment the predictors in ℳ𝑘 with one additional
predictor.
▪ Choose the best among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1 . Here best is defined as
having smallest RSS or highest 𝑅2 .
3. Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated
prediction error, 𝐶𝑝 , 𝐴𝐼𝐶,𝐵𝐼𝐶 , or adjusted 𝑅2 .
Forward Stepwise Selection
▪ FSS involves fitting one null model and (𝑝 − 𝑘) models for each iteration of 𝑘 =
𝑝(𝑝+1)
0,1, … , (𝑝ҧ − 1). If 𝑝ҧ = 𝑝 ⇒ 1 + models. E.g. P = 20, we have 211 models (vs.
2
1,048,576 using best subset)
▪ Observe: ℳ1 contains predictor of ℳ0 and ℳ2 contains predictor of ℳ1 and so on.
Not the case in best subset selection, thus, may not always find the best possible
▪ Forward stepwise selection can be applied in high-dimensional scenarios where 𝑛 <
𝑝 with 𝑝ҧ < 𝑛.
▪ Still not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Forward Stepwise Selection
Forward Stepwise Selection
Forward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable
Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Forward sel. optimal model for 'Credit'
Criterion 𝑘 Test MSE
AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10281.33
10-fold CV 5 10281.33
▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.
Backward Stepwise Selection (General)
▪ Let ℳ𝑝ҧ denote the full model, which contains all 𝑝ҧ predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1:
▪ Consider all 𝑘 models that contain all but one of the predictors in ℳ𝑘 , for a total of
𝑘 − 1 predictors.
▪ Choose the best among these 𝑘 models, and call it ℳ𝑘−1 . Here, best is defined as
having smallest RSS or highest 𝑅2 (or maximum log likelihood or lowest deviance
depending on the estimation.)
▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated
prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection (OLS using t-stat)
▪ Let ℳ𝑝 denote the full model, which contains all 𝑝 predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1 :
▪ Fit the model with k predictors.
▪ Identify the least useful predictor (smallest “classical” t-stat.), drop it and
call the resulting model 𝑀𝑘−1
▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-
validated prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection
𝑝(𝑝+1)
▪ Instead of fitting2𝑃 models, need only 𝑃 + 1 as for OLS and 1 +
2
for logit regression (why?).
▪ Statistical cost: we constrain the search to reduce variance, but perhaps incur
more bias.
▪ Can't be used when 𝑃 > 𝑁
▪ Still not confined just to OLS, can do, e.g., logit regression.
Backward Stepwise Selection
Backward Stepwise Selection
Backward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable
Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Backward-selection optimal model for 'Credit'
Criterion 𝑘 Test MSE
AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10307.72
10-fold CV 5 10155.78
▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.
High-dimensional setting (𝑃 >> 𝑁)
▪ Standard AIC and BIC are not well suited for selection - they will typically
overfit.
▪ BIC could be sometimes fine if 𝑃 proportional to 𝑁.
▪ Chen and Chen (2008) propose the extended BIC formally justified for high-
dimensional setting: replace log(𝑁) with log(𝑁) + 2log(𝑃)
▪ Wang (2012) shows that forward selection using the above BIC is consistent in
ultra high-dimensional settings under some assumptions (normal
homoscedastic linear regression plus some regularity conditions).
Final comments
▪ Nice feature: end up with a single model that is easy to work with and discuss. Most
likely not the true model.
▪ Very interpretable, but careful about claiming causality!
▪ Can do this type of selection with essentially any criterion function (e.g., log-
likelihood for binary choice).
▪ Many flavors of selection methods, not all have theoretical justification. Need to
figure out what works when.
▪ Standard approaches get very slow very fast. New techniques push the envelope
quite a bit though.
▪ Stability w.r.t. to the data sample could be an issue - a starkly different model may
obtain if another sample is considered.