Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views26 pages

Class2 Slides

Uploaded by

daniel.li.here2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views26 pages

Class2 Slides

Uploaded by

daniel.li.here2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Stat 422/722, Class 2

Richard Waterman

The Wharton School

Fall 2019

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 1 / 26
Table of Contents

1 Last time

2 Today’s class

3 Review

4 JMP tasks

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 2 / 26
Last time

What is a linear regression? A linear model for the conditional mean.


The least squares criteria.
The standard regression assumptions.
Predictions can be viewed as weighted linear combinations of the yi .
Leverage as measured by the hats (hii ).
Hypothesis testing.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 3 / 26
Today’s class

Introduce stepwise regression.


Illustrate the JMP platform.
Identify the BIG problem – overfitting.
Alternative stopping rules (in addition to p-values)
1 RMSE
2 Adjusted R 2
3 Mallows’ Cp
4 Akaike Information Criterion (AIC)
5 Bayesian Information Criterion (BIC)
Comparing the RMSE in and out-of-sample.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 4 / 26
The Apple data set

The goal: based on data available today, predict Apple’s return


tomorrow.
We have 106 data points, 42 are held out, October onward.
Variables: 10 other stocks GOOG, INTC etc.
These features provide 41 variables to choose from: price, volume,
number of trades and returns.
Even just considering main effects (no interactions or squares) there
are 241 possible models. That is 2,199,023,255,552 (two trillion)
different models to explore.
Including all possible interactions and squares there are
1
41 41 + × 40 × 41 = 902 terms and
+ |{z}
|2
|{z}
main.effects squares {z }
interactions
2902 = 3.381 × 10271 possible models1 .
1
The number of atoms in the universe is estimated at 1080 atoms.
Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 5 / 26
Heuristic

So it is a big problem. A big space of models to explore – the so


called curse of dimensionality.
We bring a heuristic to the problem.
Merriam Webster: involving or serving as an aid to learning, dis-
covery, or problem-solving by experimental and especially trial-
and-error methods
Then follow our natural instinct: iterative model fitting. Find the
single best variable. Given this variable, find the second best. Given
these two, find the third best and so on.
The essence of stepwise regression, the original automated model
selection tool.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 6 / 26
Elements of stepwise

Choose a direction to make a path through the big model space:


1 Forward selection.
2 Backwards elimination.
3 Forwards and backwards = mixed.
4 When there is collinearity these three approaches will not necessarily
identify the same model.
A rule for variable selection.
1 Add step: the one with the lowest p-value (or R 2 up the most).
2 Removal: the one with the highest p-value (or R 2 down the least).
A rule for stopping.
1 P-Value Threshold
2 AICc
3 BIC
Details: rules for treating categorical variables and interactions.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 7 / 26
Choosing the variables to offer to stepwise

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 8 / 26
The stepwise dialog

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 9 / 26
The stepwise elements

Variable selection: which one makes R 2 go up the most or down the


least.
Direction: mixed
Stopping rule: P-Value threshold
Rules: No rules!

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 10 / 26
Output from the chosen model

All variables highly significant. R 2 = 54%. RMSE = 0.012573. The initial


raw standard deviation of the returns was 0.0178048. So RMSE is
0.012573
= 70.616%
0.0178048
of the initial unexplained variation. The model looks good by most
standards.
Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 11 / 26
Out-of-sample prediction

Disaster strikes! A plot of the absolute forecast error, both in and


out-of-sample.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 12 / 26
Overfitting and the in and out-of-sample RMSE

This is overfitting. The big danger of greedy algorithms run amok.


The model actually performs much worse out-of-sample than the
in-sample summaries suggest.
We will see how to mitigate this in the next class.
In-sample RMSE = 0.012573.
Out-of-sample RMSE = 0.0623.
A 500% inflation factor.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 13 / 26
Comments on stepwise regression

You can use stepwise after a hand-crafted model has been made to
make sure nothing has been overlooked.
Stepwise can’t find variables unless you offer them to it!
Stepwise can’t think about transformations and normalization.
Stepwise can’t help in interpretation.
Stepwise looks one step ahead. It is a greedy algorithm; that is one
that makes locally optimal decisions in the hope that it comes close
to a globally optimal one. You could imagine looking over pairs of
variables or triplets, rather than one at a time. Kasparov looked 3-5
moves ahead in chess and sometimes as many as 12. Stepwise looks
one step ahead!
Use center polynomials to reduce collinearity.

Use stepwise as a validation/exploratory tool, not as the only approach.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 14 / 26
Stopping criteria other than the p-value cut-off

K.I.S.S = Occam’s razor = Parsimony

Among competing theories that equally well explain the observa-


tions, choose the one that is simplest.
Comparing R 2 (the same as minimizing Sums of Squared Error [SSE])
across models doesn’t capture the idea of simplicity.
Neither does RMSE. Two models can have the same RMSE but that
doesn’t distinguish between the complexity of the models.
Unlike regular R 2 , Adjusted R 2 doesn’t have to increase with
additional variables so looks like a better choice, but
RMSE 2
 
2
AdjustedR = 1 − ,
sy2

so maximizing Adjusted R 2 is equivalent to minimizing RMSE.


We need a new idea.
Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 15 / 26
Explicitly incorporating complexity in the model selection
criterion
Rather than choosing the model with the smallest sums of squares error
(SSE), you can penalize more complex models directly through the
number of variables included (k is the number of variables in the model
and σ̂ 2 is an estimate of the variance of the i ).
SSEk
1 Mallows’ Cp = σ̂ 2
− n + 2k.
SSEk
2 Akaike Information Criterion: AIC (k) ∝ σ̂ 2
+ 2k.
SSEk
3 Bayesian Information Criterion: BIC (k) ∝ σ̂2 + log(n)k.
With normally distributed error terms Cp and AIC are equivalent.
BIC penalizes complexity more than AIC (when log(n) > 2) so prefers
smaller models.
These are the other Stopping Rules in the stepwise dialog.
These stopping rules are more appropriate when the goal is model
selection for prediction.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 16 / 26
Interpreting the model selection criteria through t-stats

We add a variable to the model if the increased complexity (k goes up) is


appropriately offset with a smaller unexplained variation SSE.
One can show that a new variable is added to an existing model if:

Criterion Approx |t| cut-off Equiv p-value Goal


Adjusted R 2 |t| > 1√ 0.33 Minimize RMSE
Cp / AIC |t| > 2 0.16 Achieve an unbiased
estimate of prediction
p accuracy
BIC |t| > log(n) Depends on n Something Bayesian!

Recall that in standard hypothesis testing a significant t is one such that


|t| > 2. The value 2 is arbitrary and is chosen to control the Type I error
rate at α = 0.05.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 17 / 26
The reason to adjust for complexity
Statistics estimates parameters through optimization – typically by
making something as small as possible. In particular, in regression, by
making the Sums of Squares Error (SSE) as small as possible.
But we end up using the data twice. Once to estimate p the parameters
and then again to judge the quality of fit (RMSE = SSE /(n − p)).
Hence it provides an over-optimistic view of what happens in practice.
Penalizing the fitting criteria by the number of parameters in the
model, is an explicit way to mitigate this over-optimism.
Recommendation: use AIC when you are looking for a predictive
model that is trying to get close to an unobtainable complex truth.
That is, you see your final model as an approximation to the truth.
Choose the model with the lowest AIC. But AIC will only be reliable if
n >> k.
Recommendation: use BIC, when you are searching for the right
model, within a set of predefined models, and you believe that the
truth lies in the set under consideration.
Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 18 / 26
Categorical variables in stepwise

JMP creates two-level contrasts of the categorical variables. That is,


it buckets the multi-level categorical variable into a set of two-level
comparisons.
There is no reason why these two-level comparisons should be easily
interpretable.
Example: After fitting a stepwise model for GP1000M City with a
categorical variable (Transmission) and weight and horsepower,
run the model and then look at the prediction formula to see the
coding.
If a categorical is selected by stepwise, you could choose to put the
entire variable into the final model, or you could create an
interpretable recoding or even use the contrasts chosen by stepwise
itself.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 19 / 26
Illustration of the coding for the Transmission variable

JMP will use a {+1,-1,0} coding by default, and not the {+1,0} dummy
variable coding scheme we saw in Stat 613/621.

Transmission is a four level categorical with levels {A, AS, AV, M}.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 20 / 26
Interpretation of the categorical variable parameters

JMP notation: “&” means to combine the categories and “-” means to
compare/contrast them.

Contrast A AS AV M
AV&M - A&AS -1 -1 +1 +1
AV - M 0 0 +1 -1
A - AS +1 -1 0 0

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 21 / 26
Representation of the categorical variable in the data table

Adding the variable Transmission{AV - M} to the model adds a parameter


estimate. Its value is -4.37. So the AV’s forecast changes by -4.37, the
M’s changes by +4.37 and the A’s and AS’s change by 0.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 22 / 26
Stepwise regression review

The need for tools like stepwise – the space of all models is typically
too big to exhaustively explore.
The mechanics of stepwise; stopping rules, variable selection criterion.
The big issue with stepwise: over-fitting.
The Information criteria that penalize complexity.
How JMP treats categoricals in stepwise.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 23 / 26
JMP tasks

Make sure you can:


1 Use the model dialog to include all interactions and squares as
potential model effects.
2 Run the stepwise tool.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 24 / 26
Creating all interactions and squared terms

In the fit model dialog


Select the X-variables of interest
In model effects, go to Macros
Choose Response Surface.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 25 / 26
Stepwise dialog

JMP stepwise

After having selected the X-variables


in the Fit Model dialog
In the Personality choose Stepwise
Check Keep dialog open and click Run
For Stopping Rule choose P-value Threshold
Enter Prob To Enter and Prob to Leave
For Direction choose Mixed
For Rules choose No rules
Click Step to step through the variables.

Richard Waterman (The Wharton School) Stat 422/722, Class 2 Fall 2019 26 / 26

You might also like