0% found this document useful (0 votes)

78 views37 pages

Chapter 2

This document discusses best subset selection for linear regression models. It introduces best subset selection as an approach to selecting a subset of predictor variables that yields good predictive performance. The document outlines that best subset selection examines all possible combinations of predictors and chooses the subset with the best performance. However, this becomes computationally infeasible with many predictors. The document then discusses using information criteria like AIC and BIC to select the best subset model by penalizing models with more predictors. It provides an example of applying best subset selection to a credit card data set with 10 predictors.

Uploaded by

changmin shim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views37 pages

Chapter 2

Uploaded by

changmin shim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

CHAPTER TWO

Regression with penalization: subset selection

Module Code: EC4308

Module title: Machine Learning and Economic Forecasting
Instructor: Vu Thanh Hai
Reference: ISLR 6.1
Acknowledgement: I adapt Dr. Denis Tkachenko’s notes with
some minor modifications. Thanks to Denis!
Lecture outline

▪ Introduction and motivation

▪ Best subset selection
▪ Forward stepwise selection
▪ Backward stepwise selection
▪ Concluding thoughts: pros and cons.
Introduction and motivation
▪ The linear regression model that you have learned:
𝑦𝑖 = 𝛽0 + 𝛽𝑝 𝑥1,𝑖 + 𝛽2 𝑥2,𝑖 + ⋯ + 𝛽𝑝 𝑥𝑝,𝑖 + 𝜀𝑖

▪ Very commonly used to model 𝐸(𝑌 ∣ 𝑋) - recall that this is what we are after
in predictive modelling under squared loss.
▪ Can describe nonlinear but still additive relationships. That is linear in
coefficients (𝛽’s).
▪ Linearity yields advantages in inference, and in many applications delivers
surprisingly good approximations.
Motivation: is OLS good enough?
▪ We typically fit such models with OLS, why do we have to look for
alternatives?
▪ 2 key concerns: prediction accuracy and model interpretability.
▪ OLS will have low bias if true relationship approximately linear.
▪ If 𝑃 << 𝑁 ("typical" metrics textbook setting), OLS also has low variance.
No need to do anything
Motivation: prediction accuracy
▪ When 𝑁 is not much larger than 𝑃 - OLS predictions will have high variance
→ overfit → poor OOS prediction.
▪ When 𝑃 > 𝑁 (or 𝑃 >> 𝑁 ), no unique solution: OLS variance is infinite →
cannot use this method at all.
▪ Our goal is to get a linear combination of 𝑋 to get a good forecast of 𝑌.
▪ We want to reduce the variance by somehow reducing model complexity with
a minor cost in terms bias - constrain coefficients on some 𝑋 's to 0 or shrink
them.
Motivation: Interpretability
▪ Oftentimes 𝑋 's entering regression may be irrelevant.

▪ OLS will almost never set coefficients exactly to 0 .

▪ By removing potentially many such variables (setting their 𝛽 's to 0 ), we enhance

model interpretability.

▪ Sacrifice "small details" to get the "big picture".

▪ Caution: "interpretation" here need not mean causal effect! Rather, which variables
can be important predictors.
Introduction: Three important approaches
▪ Subset selection: find a (hopefully small) subset of the X's that yields good
performance. Fit OLS on this subset.

▪ Shrinkage: Fit a model using all 𝑃 variables, but shrink coefficients to 0

relative to OLS (some can go to exactly 0 ).

▪ Dimension reduction: attempt to "summarize" the 𝑃 predictors with 𝑀 < 𝑃

linear combinations of the variables (e.g., principal components)

▪ We will talk about the first approach today.

Subset Selection
▪ We try to find a possibly sparse subset of 𝑋 's to forecast 𝑌 well: find 𝑆 << 𝑁
predictors needed for a high quality forecast.
▪ Dropping a variable = setting its coefficient to zero. A form of penalty called
the 𝐿0 penalty.
▪ Bias/variance tradeoff: leave out a variable highly correlated to signal → bias;
put in too many X's → high variance.
▪ In principle, want to try out all possible combinations and choose the one
with the best OOS performance.
Introduction: Three important approaches
▪ Data from the ISLR package on credit card balances and 10 predictors, 𝑁 = 400.
ID Identification
Income Income in $10,000 's
Limit Credit limit
Rating Credit rating
Cards Number of credit cards
Age Age in years
Education Number of years of education
Gender A factor with levels Male and Female
Student A factor with levels No and Yes indicating whether the individual was a student
Married A factor with levels No and Yes indicating whether the individual was married
Ethnicity A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity.
Balance Average credit card balance in $
Best Subset Selection
▪ Let 𝑝᪄ ≤ 𝑝 be the maximum model size.
▪ Goal: Goal: for each 𝑘 ∈ {0,1,2 … , 𝑃}, find the subset of size 𝑘 that gives the
best 𝑅2 . Then, pick the overall best model.

▪ For 𝑘 = 0,1, … , 𝑝᪄
• Fit all models that contain exactly 𝑘 predictors. There are 𝑝ҧ models. If 𝑘 = 0, the
𝑘
forecast is the unconditional mean.

• Pick the best (e.g, highest 𝑅2 ) among these models, and denote it by ℳ𝑘 .

▪ Optimize over ℳ0 , … , ℳ𝑝᪄ using cross-validation (or other criteria like AIC
or BIC)
Best Subset Selection
▪ The above direct approach is called all subsets or best subsets regression
▪ However, we often can't examine all possible models, since they are 2𝑝 of them; for
example, when 𝑝 = 40 there are over a billion models!
▪ The prediction is highly unstable: the subsets of variables in ℳ10 and ℳ11 can be very
different from each other → high variance (the best subset of ℳ3 need not include any of
the variables in best subset of ℳ2 .)
▪ If P is large, the chance of finding models that work great in training data, but not so
much OOS increases – overfitting.
▪ Estimates fluctuate across different samples, so does the best model choice
▪ Nice feature: not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Best Subset Selection
▪ But the more predictors, the lower the insample MSE. With large 𝑘, we might have too
flexible a model. With small 𝑘, we have a rigid model. Both do badly out-of-sample.
▪ So which 𝑘 is the best? In other words, out of the best models for every 𝑘, which model is
the best?
▪ We need the model with the lowest OOS MSE ('test error’).
▪ We can use Information Criterion (IC) to indirectly estimate the OOS MSE. IC help us
with adjusting the in-sample MSE to penalize for potential overfitting. The more regressors
we add, the higher the penalty.
▪ OR: We can directly estimate the OOS MSE using the validation set or CV.
Mallow’s 𝐶𝑝 and adjusted-𝑅 : A refresher 2

▪ Mallow's 𝐶𝑝 :
1
𝐶𝑝 = RSS + 2𝑘𝜎ƶ 2
𝑛
▪ where 𝑑 is the total # of parameters used and 𝜎ƶ 2 is an estimate of the variance of the error 𝜖 associated with
each response measurement.

▪ For a least squares model with 𝑘 variables, the adjusted 𝑅2 statistic is calculated as
2
RSS/(𝑛 − 𝑘 − 1)
Adjusted 𝑅 = 1 −
TSS/(𝑛 − 1)

2 RSS
▪ Maximizing the adjusted 𝑅 is equivalent to minimizing . While RSS always
𝑛−𝑘−1
RSS
decreases as the number of variables in the model increases, may increase or decrease,
𝑛−𝑘−1
due to the presence of 𝑘 in the denominator.
AIC: A refresher
▪ AIC (Akaike Information Criterion) is an unbiased estimate of the Kullback-
Leibler divergence between the model distribution and the forecast
distribution (if all the assumptions hold).
▪ The 𝐴𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
AIC = −2log 𝐿 + 2 ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In the linear regression model, it equals:
1 2
𝐴𝐼𝐶 = 𝑅𝑆𝑆 + 2𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
▪ AIC is used in different model types, not just linear.
BIC: a refresher
▪ BIC ( (Schwarz) Bayesian Information Criterion) is inversely proportional to
the posterior probability of the model under a uniform prior over models.
▪ The 𝐵𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
BIC = −2log 𝐿 + log(N) ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In this linear regression, it equals:
1 2
𝐵𝐼𝐶 = 𝑅𝑆𝑆 + log 𝑁 𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
Estimating 𝜎ƶ 2 when applying IC
▪ How do we find 𝜎ƶ 2 to compute AIC and BIC.
▪ First approach: Use 𝜎ො 2 directly from each model.

▪ When 𝑃 << 𝑁: use the 𝜎ƶ 2 estimated using the full model (all predictors used
→ low bias model; suggested in textbook.)
▪ When 𝑃/𝑁 is not small, try using iterative procedure:
▪ Use 𝜎02 = 𝑠𝑌2 (sample variance of 𝑌 ), select best model based on AIC/BIC,
call it 𝑀 𝑘0 .
▪ Take 𝜎12 = 𝜎ƶ02 , select the best model on IC, call it 𝑀 𝑘1 . Iterate until
convergence (often above 2 steps enough).
Best subset selection for ‘Credit’

Source: James et al. (2014), ISLR, Springer

For each possible model containing a subset of the ten predictors in the Credit data set, the 𝑅𝑆𝑆
and 𝑅2 are displayed. The red frontier tracks the best model for a given number of predictors,
according to 𝑅𝑆𝑆 and 𝑅 2 .
Best subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗

Limit ∗ ∗ ∗ ∗ ∗

Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗

Cards ∗ ∗ ∗ ∗ ∗

Age ∗ ∗ ∗

Education ∗

Female ∗

Student ∗ ∗ ∗ ∗ ∗ ∗

Married ∗

Asian ∗ ∗

Caucasian ∗
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset optimal model for 'Credit’

Criterion 𝑘 Test MSE

AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 4 10307.72
10-fold CV 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

Stepwise Selection
▪ For computational reasons, best subset selection cannot be applied with very
large 𝑝. Why not?
▪ Best subset selection may also suffer from statistical problems when 𝑝 is large:
larger the search space, the higher the chance of finding models that look
good on the training data, even though they might not have any predictive
power on future data.
▪ Thus, an enormous search space can lead to overfitting and high variance of
the coefficient estimates.
▪ For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.
Forward Stepwise Selection: algorithm
1. Let ℳ0 denote the null model, which contains no predictors.

2. For 𝑘 = 0, … , 𝑝ҧ − 1 :

▪ Consider all 𝑝 − 𝑘 models that augment the predictors in ℳ𝑘 with one additional
predictor.
▪ Choose the best among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1 . Here best is defined as
having smallest RSS or highest 𝑅2 .

3. Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

prediction error, 𝐶𝑝 , 𝐴𝐼𝐶,𝐵𝐼𝐶 , or adjusted 𝑅2 .
Forward Stepwise Selection
▪ FSS involves fitting one null model and (𝑝 − 𝑘) models for each iteration of 𝑘 =
𝑝(𝑝+1)
0,1, … , (𝑝ҧ − 1). If 𝑝ҧ = 𝑝 ⇒ 1 + models. E.g. P = 20, we have 211 models (vs.
2
1,048,576 using best subset)

▪ Observe: ℳ1 contains predictor of ℳ0 and ℳ2 contains predictor of ℳ1 and so on.

Not the case in best subset selection, thus, may not always find the best possible

▪ Forward stepwise selection can be applied in high-dimensional scenarios where 𝑛 <

𝑝 with 𝑝ҧ < 𝑛.

▪ Still not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Forward Stepwise Selection
Forward Stepwise Selection
Forward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Forward sel. optimal model for 'Credit'

Criterion 𝑘 Test MSE

AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10281.33
10-fold CV 5 10281.33

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

Backward Stepwise Selection (General)
▪ Let ℳ𝑝ҧ denote the full model, which contains all 𝑝ҧ predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1:
▪ Consider all 𝑘 models that contain all but one of the predictors in ℳ𝑘 , for a total of
𝑘 − 1 predictors.
▪ Choose the best among these 𝑘 models, and call it ℳ𝑘−1 . Here, best is defined as
having smallest RSS or highest 𝑅2 (or maximum log likelihood or lowest deviance
depending on the estimation.)

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection (OLS using t-stat)
▪ Let ℳ𝑝 denote the full model, which contains all 𝑝 predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1 :
▪ Fit the model with k predictors.
▪ Identify the least useful predictor (smallest “classical” t-stat.), drop it and
call the resulting model 𝑀𝑘−1

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-

validated prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection
𝑝(𝑝+1)
▪ Instead of fitting2𝑃 models, need only 𝑃 + 1 as for OLS and 1 +
2
for logit regression (why?).
▪ Statistical cost: we constrain the search to reduce variance, but perhaps incur
more bias.

▪ Can't be used when 𝑃 > 𝑁

▪ Still not confined just to OLS, can do, e.g., logit regression.
Backward Stepwise Selection
Backward Stepwise Selection
Backward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Backward-selection optimal model for 'Credit'

Criterion 𝑘 Test MSE

AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10307.72
10-fold CV 5 10155.78

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

High-dimensional setting (𝑃 >> 𝑁)
▪ Standard AIC and BIC are not well suited for selection - they will typically
overfit.

▪ BIC could be sometimes fine if 𝑃 proportional to 𝑁.

▪ Chen and Chen (2008) propose the extended BIC formally justified for high-
dimensional setting: replace log(𝑁) with log(𝑁) + 2log(𝑃)

▪ Wang (2012) shows that forward selection using the above BIC is consistent in
ultra high-dimensional settings under some assumptions (normal
homoscedastic linear regression plus some regularity conditions).
Final comments
▪ Nice feature: end up with a single model that is easy to work with and discuss. Most
likely not the true model.
▪ Very interpretable, but careful about claiming causality!
▪ Can do this type of selection with essentially any criterion function (e.g., log-
likelihood for binary choice).
▪ Many flavors of selection methods, not all have theoretical justification. Need to
figure out what works when.
▪ Standard approaches get very slow very fast. New techniques push the envelope
quite a bit though.
▪ Stability w.r.t. to the data sample could be an issue - a starkly different model may
obtain if another sample is considered.

Multilevel Modeling Using R
No ratings yet
Multilevel Modeling Using R
253 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
WINSEM2024-25 CSE3008 ETH AP2024254000248 2025-01-24 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE3008 ETH AP2024254000248 2025-01-24 Reference-Material-I
27 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
Week8 Lecture 1 ML SPR25
No ratings yet
Week8 Lecture 1 ML SPR25
20 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
Rio Thesis - 054559
No ratings yet
Rio Thesis - 054559
53 pages
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
31 pages
Regression Variable Selection Methods
No ratings yet
Regression Variable Selection Methods
30 pages
Multicollinearity & Model Selection
No ratings yet
Multicollinearity & Model Selection
30 pages
Model Selection for Statisticians
No ratings yet
Model Selection for Statisticians
41 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Data Mining
No ratings yet
Data Mining
2 pages
Unit 4
No ratings yet
Unit 4
7 pages
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
No ratings yet
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
6 pages
3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
Module 5.2
No ratings yet
Module 5.2
51 pages
Stepwise Regression
No ratings yet
Stepwise Regression
4 pages
Unit II ML
No ratings yet
Unit II ML
14 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Lars Based S Estimator
No ratings yet
Lars Based S Estimator
10 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
Model Selection R Chap 4
No ratings yet
Model Selection R Chap 4
5 pages
Model Selection and Model Averaging
No ratings yet
Model Selection and Model Averaging
16 pages
Model Selection
No ratings yet
Model Selection
11 pages
Lecture4 - Model Selection and Regularization - Ver2
No ratings yet
Lecture4 - Model Selection and Regularization - Ver2
98 pages
Midterm 1 Prep
No ratings yet
Midterm 1 Prep
7 pages
Class2 Slides
No ratings yet
Class2 Slides
26 pages
Chapter 8: Multiple and Logistic Regression Learning Objectives
No ratings yet
Chapter 8: Multiple and Logistic Regression Learning Objectives
3 pages
Statistical Methods For Bioinformatics Lecture 3
No ratings yet
Statistical Methods For Bioinformatics Lecture 3
33 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
Chap3 Variable Selection
No ratings yet
Chap3 Variable Selection
23 pages
Multiple Linear Regression Notes
No ratings yet
Multiple Linear Regression Notes
18 pages
VariableSelectionAndModelBuilding IIT
No ratings yet
VariableSelectionAndModelBuilding IIT
22 pages
06 Model Selection and Regularization I 169
No ratings yet
06 Model Selection and Regularization I 169
60 pages
DDMA05 ModelSelection
No ratings yet
DDMA05 ModelSelection
28 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
Modified Akaike Information Criterion (MAIC) For Statistical Model Selection
No ratings yet
Modified Akaike Information Criterion (MAIC) For Statistical Model Selection
12 pages
Slides 2
No ratings yet
Slides 2
27 pages
Multi-Collineartity, Variance Inflation and Orthogonalization in Regression
No ratings yet
Multi-Collineartity, Variance Inflation and Orthogonalization in Regression
5 pages
Da Sem Unit 3-1
No ratings yet
Da Sem Unit 3-1
13 pages
Session 3 - Chapter 06 Linear Reg
No ratings yet
Session 3 - Chapter 06 Linear Reg
20 pages
ML 2024 Part1 CrossValidation
No ratings yet
ML 2024 Part1 CrossValidation
43 pages
Linear Model Selection and Regularization
No ratings yet
Linear Model Selection and Regularization
23 pages
Regression Shrinkage Techniques
No ratings yet
Regression Shrinkage Techniques
5 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
An Asymptotic Theory For Linear Model Selection
No ratings yet
An Asymptotic Theory For Linear Model Selection
44 pages
Lec7 Model
No ratings yet
Lec7 Model
8 pages
New Criterion for Model Selection
No ratings yet
New Criterion for Model Selection
12 pages
Mathematics 07 01215
No ratings yet
Mathematics 07 01215
12 pages
Linear Regression for Analysts
No ratings yet
Linear Regression for Analysts
24 pages
Introduction To Machine Learning - Unit 5 - Week 2
No ratings yet
Introduction To Machine Learning - Unit 5 - Week 2
4 pages
Reg 07
No ratings yet
Reg 07
22 pages
Statistical Modelling: Regression: Choosing The Independent Variables
No ratings yet
Statistical Modelling: Regression: Choosing The Independent Variables
14 pages
Statistics and Probability Week 7 - 8
No ratings yet
Statistics and Probability Week 7 - 8
4 pages
Midterm: Attendance Quiz 1 Quiz 2
No ratings yet
Midterm: Attendance Quiz 1 Quiz 2
6 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
5 pages
M Com Ist Sem Dec 2018
No ratings yet
M Com Ist Sem Dec 2018
14 pages
Linear Regression and Correlation - 3232612
No ratings yet
Linear Regression and Correlation - 3232612
6 pages
Dip 08 09
No ratings yet
Dip 08 09
122 pages
Final Questions For Last Class
No ratings yet
Final Questions For Last Class
5 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
HO4 Estimation
No ratings yet
HO4 Estimation
9 pages
RLB Contoh
No ratings yet
RLB Contoh
13 pages
Bull Heritability Analysis
No ratings yet
Bull Heritability Analysis
4 pages
CrashTests and The HIC
No ratings yet
CrashTests and The HIC
9 pages
Capstone - Project - Final Report - Hitesh - Dadhich
No ratings yet
Capstone - Project - Final Report - Hitesh - Dadhich
38 pages
Cross-Validation, Regularization, and Principal Components Analysis (PCA)
No ratings yet
Cross-Validation, Regularization, and Principal Components Analysis (PCA)
47 pages
Analysis of The Effect of Service On Passenger Satisfaction in The
No ratings yet
Analysis of The Effect of Service On Passenger Satisfaction in The
16 pages
Econometric Models for Analysts
No ratings yet
Econometric Models for Analysts
104 pages
Uji Univariat
No ratings yet
Uji Univariat
2 pages
Regression & ANOVA for Students
No ratings yet
Regression & ANOVA for Students
2 pages
Centering in Multilevel Regression
No ratings yet
Centering in Multilevel Regression
5 pages
Chapter 23 Summary: T Method. We Discussed The Pros and Cons of Each Method and Illustrated
No ratings yet
Chapter 23 Summary: T Method. We Discussed The Pros and Cons of Each Method and Illustrated
2 pages
Henson en 15 Anova
No ratings yet
Henson en 15 Anova
6 pages
Untitled6.Ipynb - Colab
No ratings yet
Untitled6.Ipynb - Colab
6 pages
Chapter 4. Violation of Assumptions
No ratings yet
Chapter 4. Violation of Assumptions
51 pages
Fisher On Design
No ratings yet
Fisher On Design
15 pages
Surfac Limestone Model
No ratings yet
Surfac Limestone Model
91 pages
Introduction To Regression With Statsmodels in Python
No ratings yet
Introduction To Regression With Statsmodels in Python
142 pages
Penting 2
No ratings yet
Penting 2
22 pages
Lavaan Multilevel Zurich2017
100% (1)
Lavaan Multilevel Zurich2017
162 pages
PSQ Q2
No ratings yet
PSQ Q2
2 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

CHAPTER TWO

Regression with penalization: subset selection

Module Code: EC4308

▪ Introduction and motivation

▪ OLS will almost never set coefficients exactly to 0 .

▪ By removing potentially many such variables (setting their 𝛽 's to 0 ), we enhance

▪ Sacrifice "small details" to get the "big picture".

▪ Shrinkage: Fit a model using all 𝑃 variables, but shrink coefficients to 0

▪ Dimension reduction: attempt to "summarize" the 𝑃 predictors with 𝑀 < 𝑃

▪ We will talk about the first approach today.

Source: James et al. (2014), ISLR, Springer

Criterion 𝑘 Test MSE

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

3. Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

▪ Observe: ℳ1 contains predictor of ℳ0 and ℳ2 contains predictor of ℳ1 and so on.

▪ Forward stepwise selection can be applied in high-dimensional scenarios where 𝑛 <

Criterion 𝑘 Test MSE

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-

▪ Can't be used when 𝑃 > 𝑁

Criterion 𝑘 Test MSE

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

▪ BIC could be sometimes fine if 𝑃 proportional to 𝑁.

You might also like