Machine Learning and Deep Learning
with R
Instructor: Babu Adhimoolam
Learning objectives Subset Selection Methods
To improve prediction accuracy on
the test dataset in the following
conditions with high variance:
number of (n) is not so large than
Why additional the number of (p)
linear methods?
By constraining or shrinking the
coefficients associated with (p), we
can substantially reduce the variance
associated with test error.
To improve model interpretability:
Including multiple variables that are
not associated with the response in a
Why additional model leads to complexity.
linear methods?
We can make the coefficient of those
variables that do not contribute to
the response as zero thereby
resulting in interpretable models.
Subset Selection Methods
Best Subset selection
Forward Selection method
Backward Selection method
Shrinkage Methods
Extension of linear
Ridge Regression
methods
Lasso Regression
Dimensionality Reduction Methods
Principal Components Regression
Partial Least Squares
The Best Subset Selection Method
• We fit least squares regression for each possible combination of p predictors.
• The total number of all possible combinations of predictors for least square estimation
in the case of p predictors is 2p
• We first start with a null model (M0) with no predictors and then we calculate Mk for each
value of k.
for k = 1, 2, …, p :
- fit all models for each Mk.
- choose the best of all models with RSS or R 2 criterion for each Mk
• We finally choose the best model from the list of available models M 0 ,…. ,Mk .
Application of Best Subset selection to Credit Data set
M1 to Mp models
Response - Balance
Predictors – Income, Limit, Rating, Cards, Age, Education, Own, Student, Married and Region
Red line – Best model in each subset of predictors.
Suffers heavily from computational
limitations as p becomes higher.
Recall the total possible subsets of
models with p predictors is 2 p
If p is 10, 2 10 ~1000 models to
evaluate
Limitations of
If p is 20, 2 20 ~1,000,000 models to
best subset Selection evaluate.
P>40 is computationally infeasible!
In addition, large model space allows
for overfitting problems that may not
generalize to high accuracy in test
data.
Forward Stepwise Selection
• Forward stepwise selection is a computationally feasible and efficient alterative to best
subset selection as it considers lesser subset of models in comparison to 2p models.
• We begin with a model with no predictors(M0), and then add predictors to the model,
one at a time until we add all the predictors.
for k = 0, 1,..,p-1:
- consider all (p – k) models that will augment the predictions in M k with one additional
predictor.
- choose the best among the (p-k) models with RSS and R2 and call it Mk+1.
• We finally choose the best model among the list of available models M0 … Mk.
Unlike best subset selection which
involves model selection with 2 p
models (with p predictors), we have
in forward stepwise selection:
Computational
feasibility of forward
stepwise selection
So, if p = 20, in best subset selection
we must fit approximately 1,048,576
models. In contrast we have only 211
models in forward stepwise
selection.
Forward selection methods always don’t guarantee the best
models
Note that the best four variable models are different between best subset selection and
Forward stepwise selection methods.
Backward Stepwise Selection
• Unlike the best subset selection and forward step wise selection, here we start with a full
model (Mp) containing all the predictors.
• We then iteratively remove the least useful predictor one and a time.
for k = p, p-1, …1:
- consider all k models that contain all but one (k-1) of the predictors in Mk
- choose the best among these k models, call it M k-1, (as assessed with RSS and R2)
• We then choose the single best model out of Mp,…,M1
• Backward stepwise method is computationally similar to Forward stepwise selection
• It requires n>p unlike forward stepwise selection method.
R 2 and RSS are not the best metrics
for selecting the best models as
models with all the predictors will
always have R 2 as highest and RSS as
lowest values.
Choosing the
optimal model Indirect methods for test error
among M0 ,…,Mp estimation use adjustment to
training error rates to account for
models bias/overfitting.
Direct methods for test error
estimation use cross-validation
methods.
Cp
Indirect Methods Akaike Information Criterion(AIC)
for adjusting training
error rates Bayesian Information Criterion(BIC)
Adjusted R2
For a fitted least square model with
d predictors, C p estimate of test MSE
is:
C p adds a penalty that is proportional
to the number of predictors in the
Cp model.
The model with more predictors will
incur more penalty.
C p tends to take the small value for
models with low test errors(best
model!).
AIC is defined for larger set of
models fit by maximum likelihood
approach.
In the case least square models it is
defined as follows:
Akaike Information
Criterion (AIC)
AIC and Cp values measure same
things for least square models and
are proportional to each other.
AIC takes smaller values for best
models.
Like Cp and AIC, the BIC takes on
small value for models with lowest
test error and is given by:
Bayesian
Information
Criteria(BIC) The BIC places heavier penalty than
the Cp or AIC for models with many
variables for observations(n)> 7.
Maximizing adjusted R 2 is equivalent
to minimizing the RSS/(n-d-1).
The addition of noise variables will
result in very small decrease in RSS
along with increase in d resulting in
Adjusted R2 overall increase in RSS.
In comparison with regular R 2,
adjusted R 2 accounts for nuisance
variables (larger adjusted R 2 is the
best model)
Optimal model selection in the credit data set
Choose M1 ,…, Mp
Select from M1,…,Mp
Low values of Cp, AIC and BIC and high values of adjusted R2 will reveal models with low test error rates.
Choosing the optimal model with validation and cross
validation
Validation and cross-validation methods directly estimate the test error and are the most preferred methods for model
selection in comparison with previous methods.