Shrinkage techniques
Presentation Capita Selecta
28-10-2014
Prediction model
Aim:
• Predict the outcome for new subjects
Overfitting:
• Data in sample are well described, but not
valid for new subjects
• Higher predictions will be found too high,
lower predictions too low
Steyerberg EW. Clinical prediction models, chapter 5 . Springer, 2009
Overfitting
Good performance in sample
Bad performance in new subjects
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/AlexZhao/Overfitting_Cancer_workshop_04162010.pdf
What you see is not what you get
Overfitting
Fitting a model with too many degrees of
freedom in modeling process.
Eg: univariable selection, interactions, transformations etc.
Solutions:
• Use less degrees of freedom
• Increase the degrees of freedom you can use
Steyerberg EW. Clinical prediction models, chapter 5 . Springer, 2009
How many degrees of freedom?
Candidate predictors Reducing overfitting
>1:10 events Necessary
1:10 – 1:20 events Advisable
1:20 – 1:50 events Not necessary
<1:50 events Not necessary
http://painconsortium.nih.gov/symptomresearch/chapter_8/sec8/cess8pg2.htm
What is shrinkage?
• Regression coefficients to less extreme values
Predicted probabilities
No After
shrinkage shrinkage
Moons KGM e.a. J Clin Epidemiol. 2004;57(12):1262-70
How to shrink?
• 4 methods
• Before or after estimation of betas
• Simple to more difficult
Shrinkage after estimation
• Apply a shrinkage factor to the regression
coefficients (βshrunken = s x β)
How to determine s?
• Formula
• Bootstrap
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Shrinkage after estimation
Formula (van Houwelingen, Le Cessie)
s = (model χ2 – df) / model χ2 = 0.89
Example is based on dataset with 24 outcomes
van Houwelingen JC, Le Cessie S. Stat Med 1990;9(11):1303-25
Shrinkage by formula
Increased shrinkage with:
↓ sample size (= ↓ χ2)
↑ predictors
• 3 predictors: 0.92
• 5 predictors: 0.89
• 8 predictors: 0.83
Note: honestly estimate df!
Shrinkage by formula
New intercept
• βshrunken = s x β
• Calculate LP: βs1*x1+ .. +βsn*xn
• Calculate predicted probabilities
• Sum of predictions = # of events
Questions?
Shrinkage by bootstrap
1. Take a bootstrap sample (size = n, with
replacement)
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Shrinkage by bootstrap
1. Take a bootstrap sample
2. Estimate the regression coefficients in the
bootstrap sample (same selection & estimation strategy)
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Shrinkage by bootstrap
1. Take a bootstrap sample (size n, drawn with replacement)
2. Estimate the regression coefficients (same
selection & estimation strategy)
3. Calculate linear predictor (β1*x1+…+βn*xn) in original
sample with bootstrapped coefficients.
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Shrinkage by bootstrap
1. Take a bootstrap sample
2. Estimate the regression coefficients (same selection & estimation strategy)
3. Calculate linear predictor (β1*x1+β2*x2 etc) in original sample with bootstrapped coefficients.
4. Slope of LP: regression with outcome of
patients in original sample and LP as covariable.
0.5
Observed
slope = 0.9
Predicted probability 0.5
Shrinkage by bootstrap
1. Take a bootstrap sample
2. Estimate the regression coefficients (same selection &
estimation strategy)
3. Calculate linear predictor (β1*x1+β2*x2 etc) in original
sample with bootstrapped coefficients.
4. Slope of LP: regression with outcome of patients in
original sample and LP as covariable.
Repeat steps 1 – 4 (e.g. 200x)
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Shrinkage by bootstrap
Shrinkage factor: average slope of LP
Intercept: sum predictions = observed # events
R output
Example: 24 outcomes, 8 predictors, backward selection.
Questions?
Shrinkage during estimation
Penalized maximum likelihood
- Normally: maximum likelihood estimation
- Penalized ML: log L – 0.5 * penalty * Σ(βscaled)2
- Penalty factor by optimizing AIC
(related to model χ2 & # predictors)
- Trial & error
Moons KGM e.a. J Clin Epidemiol. 2004;57(12):1262-70
Penalty factor
8
24
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Questions?
Shrinkage for selection
Lasso
- Shrinkage + selection
- Some predictors are set to zero
- Maximum sum of |βstandardized|
- How to determine this maximum?
Trial & error (e.g. cross validation/AIC)
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Questions?
Result of shrinkage
Steyerberg EW. Clinical prediction models, chapter 13. Springer, 2009
Differences in amount of shrinkage
Shrinking afterwards
• Same shrinkage factor for every predictor
PML
• Different shrinkage for different predictors
• Unstable predictors are shrunk the most
Lasso
• Also selection of predictors (less is more)
Differences in performance
Case study by Steyerberg et al.
• GUSTO-I study
• Mortality at 30 days
• 8 predictor model
• Subsample: 336 patients, 20 cases
• Total sample: 20512 patients, 1435 cases
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Differences in obtained betas
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Differences in obtained betas
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Differences in obtained betas
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Differences in obtained betas
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
1.09 1.01
0.90
0.68
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Differences in performance
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Question
Can you explain why shrinkage methods make
predictions more reliable, and affect calibration,
but not discrimination?
Conclusion case study
Shrinkage in small datasets
• No shrinkage: poor performance
• No/minor advantage on discrimination
• Major improvement of calibration
• No major differences between methods
Steyerberg EW e.a. Stat Neerl 2001;55:76-88
Software differences
Shrinking afterwards (relatively easy)
• Formula: all software
• Bootstrapping: SAS/Stata/R
PML & Lasso (advanced)
• Only in R
Questions?
Conclusion
Shrinkage is necessary in small datasets
or with many predictors
(like studies on molecular markers)
But…
Clinical practice
Vickers e.a. Cancer 2008;112(8):1862-8
Conclusion
Shrinkage is necessary in small datasets
• When? >1 / 20 events
• How? Different methods available
• All result in better performance
• Techniques: simple to difficult
Reminder: make less data driven decisions
Questions?
Differences in performance
Stimulation study by Vach et al.
Lasso
• Overcorrects larger effects
• Cautious in removing variables
Formula
• Good estimation of shrinkage
• No variable selection
Vach K e.a. Stat Neerl 2001;55:53-75