Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views66 pages

BDA Lecture 11a

The document discusses variable selection using projpred, emphasizing that while comparing 2-3 models is sufficient for a project, the number of potential models increases exponentially with the number of variables. It recommends using brms and projpred to avoid overfitting in model selection. The document also references historical approaches to model choice and provides examples of simulated regression for illustration.

Uploaded by

marius.boda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views66 pages

BDA Lecture 11a

The document discusses variable selection using projpred, emphasizing that while comparing 2-3 models is sufficient for a project, the number of potential models increases exponentially with the number of variables. It recommends using brms and projpred to avoid overfitting in model selection. The document also references historical approaches to model choice and provides examples of simulated regression for illustration.

Uploaded by

marius.boda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Variable selection with projpred

• In your project it is sufficient to compare 2–3 models

1 / 22
Variable selection with projpred

• In your project it is sufficient to compare 2–3 models


• ...but if you are interested in variable selection, then the number
of potential models is 2p , where p is the number of variables

1 / 22
Variable selection with projpred

• In your project it is sufficient to compare 2–3 models


• ...but if you are interested in variable selection, then the number
of potential models is 2p , where p is the number of variables
• ...in such case I recommended to use brms + projpred

1 / 22
Variable selection with projpred

• In your project it is sufficient to compare 2–3 models


• ...but if you are interested in variable selection, then the number
of potential models is 2p , where p is the number of variables
• ...in such case I recommended to use brms + projpred
• projpred avoids the overfit in model selection

1 / 22
Use of reference models in model selection

• Background
• First example
• Bayesian and decision theoretical justification
• More examples

2 / 22
Not a novel idea

• Lindley (1968): The choice of variables in multiple regression


• Bayesian and decision theoretical justification, but simplified
model and computation

3 / 22
Not a novel idea

• Lindley (1968): The choice of variables in multiple regression


• Bayesian and decision theoretical justification, but simplified
model and computation
• Goutis & Robert (1998): Model choice in generalised linear
models: a Bayesian approach via Kullback-Leibler projections
• one key part for practical computation

3 / 22
Not a novel idea

• Lindley (1968): The choice of variables in multiple regression


• Bayesian and decision theoretical justification, but simplified
model and computation
• Goutis & Robert (1998): Model choice in generalised linear
models: a Bayesian approach via Kullback-Leibler projections
• one key part for practical computation
• Related approaches
• gold standard, preconditioning, teacher and student, distilling, . . .

3 / 22
Not a novel idea

• Lindley (1968): The choice of variables in multiple regression


• Bayesian and decision theoretical justification, but simplified
model and computation
• Goutis & Robert (1998): Model choice in generalised linear
models: a Bayesian approach via Kullback-Leibler projections
• one key part for practical computation
• Related approaches
• gold standard, preconditioning, teacher and student, distilling, . . .
• Motivation in these
• measurement cost in covariates
• running cost of predictive model
• easier explanation / learn from the model

3 / 22
Example: Simulated regression

f ∼ N(0, 1),
y | f ∼ N(f , 1)

2 ●



1 ●

● ●
● ●

0 ●

●●
● ●● ●
y


● ●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
f
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●
● ●

0 ●●
●● ● ●●

y




−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●
● ●

0 ●

●● ● ●● ●
y


● ●
−1 ● ●
● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●
● ●

0 ●
● ●
● ●
●●

y


● ●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

●●
● ●

0 ● ●●
●● ●

y



● ●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●
● ●

0 ●
● ● ● ●


y


● ●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●



0 ● ● ● ●



y



● ●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●
● ●

0 ● ● ●●

● ●
y



●●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Simulated regression

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

2 ●



1 ●

● ●
● ●

0 ●


●●
● ●
y


● ●
−1 ● ●
● ● ●

−2
● ●

−3 −2 −1 0 1 2 3
x[,?]
4 / 22
Example: Individual correlations

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

Correlation for xj , y
1.00

0.75


●●
● ●

|R(xj, y)|

●● ●
● ●
● ● ● ● ●●● ●● ●
●● ● ●● ● ● ● ● ●●
●● ● ●
● ● ● ●
0.50 ● ●
● ● ●● ●
● ●

● ●● ●

● ●●
● ● ●●● ●
●● ●

● ● ● ● ●●
● ●● ● ●●
●● ● ●●
●● ●● ●● ●●● ● ●● ●

● ● ●●●●
● ● ● ● ●

●● ● ●● ● ● ●


● ● ●
●● ● ● ●● ●● ●● ● ● ● ●
● ●● ● ● ● ● ●
●● ● ● ● ●● ● ● ● ● ● ●
● ● ●
● ●●● ● ●● ●● ●● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ●

0.25 ● ●



●●
●● ● ●
● ● ● ● ●● ●
● ●
● ●
● ●● ● ●
●●


● ● ● ●
●●
● ●
●● ● ●●
● ● ●● ● ● ●● ● ●● ● ●● ● ●●● ●
● ● ● ● ● ● ●●● ● ●
● ● ● ●●
●●● ●
● ● ● ●● ● ●● ● ● ● ●
● ● ●●● ● ●● ●● ● ● ●●●●●
●●
●● ●● ● ● ● ●●●● ●●
●● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ●● ●
● ● ● ●● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ●● ●● ●● ●
●● ● ● ●
● ● ● ● ● ●● ●● ● ●●● ● ● ●●● ● ● ● ● ● ●● ● ● ●●
●●●●●
● ●●● ● ●
●● ● ●● ● ● ●
● ●● ●● ●●
●● ●
● ●
● ●● ●● ●● ● ● ●●● ● ●●● ●

●● ●● ● ●●● ●●● ●● ●●
0.00 ●●●● ●● ● ● ● ●●

0 100 200 300 400 500


variable index j
5 / 22
Example: Individual correlations

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

Correlation for xj , y
1.00

0.75


● ● ●
● ●
|R(xj, y)|

● ●
● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●
0.50 ● ●

●●● ●
● ●●


● ●
● ●

● ● ●● ●● ●

● ●

●● ●● ●● ● ● ● ● ●

● ● ● ●
● ●● ●● ●
● ● ●●● ● ●

● ● ● ● ●● ●
● ●● ● ● ● ●●
● ● ●● ●●
● ●
● ●● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●
● ●● ● ●
●● ●

●● ● ● ● ● ● ●● ●● ●

●●● ●● ●
● ● ●● ●● ● ●
● ●● ●● ● ●● ● ●● ● ●● ●
0.25 ●
● ●● ●
●●
● ●●
●● ● ● ● ●




●●
● ●
● ●

●● ● ● ● ●

●●● ● ● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ●
●●
● ●
●● ● ● ● ● ● ● ● ●
● ● ● ●●● ●
● ●

● ●
● ● ● ● ●● ● ●●
● ● ● ●
● ●
● ● ●● ● ● ●● ● ●
● ●● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ●


● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ● ● ● ●●● ● ● ● ●● ● ●● ●

●●
● ● ●● ●● ● ● ●● ●● ●● ●● ●● ● ●● ●
●● ● ●● ● ● ● ●●● ●
● ● ●● ● ● ● ●● ●● ● ● ● ●●●
●● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ●●●
● ●
●● ● ● ● ● ● ● ● ●
0.00 ●● ● ●● ● ● ●

0 100 200 300 400 500


randomized variable index
5 / 22
Example: Individual correlations

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

Correlation for xj , f
1.00


● ●
● ● ●● ●
● ● ● ● ●● ● ●● ● ●
● ● ●

●● ●
●●● ● ● ●● ●
0.75 ●●● ●
●●

● ●


● ●
●●
● ●
● ●


● ● ●●




●●
● ●● ● ●● ●● ● ● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●●

● ● ●
● ● ● ●
●● ●● ●● ●● ● ● ●
● ● ● ● ● ● ●● ●
|R(xj, f)|

● ● ● ●
● ●● ● ●
● ●
● ● ● ●
0.50 ● ●
● ●
● ●● ●

● ●

● ●
● ●
● ●
●● ● ●
● ● ●
●● ●
● ● ●

●● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●● ●● ● ● ● ●
● ● ● ● ● ●● ● ●
● ● ● ●
●●●● ●

0.25 ●
●● ●



● ●●

●●
● ● ●●
● ● ●
● ●
● ● ● ●


●● ●



● ●● ● ● ● ●● ●● ●● ● ● ● ●●
●● ● ● ● ● ● ● ● ●●
● ●

● ● ●● ●● ● ● ● ● ● ● ●● ●
●● ● ● ● ● ●
●●
●●● ●● ● ●
● ● ● ●●
● ● ●●● ● ●● ●● ● ●
●● ● ● ● ●● ●● ●
●●● ●● ● ● ● ●●●● ●●● ● ●
●● ●● ● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●● ●● ● ●● ●
● ● ●● ●
● ● ●●
●● ●● ●● ● ●
●● ● ●
●●
● ● ● ●● ●●
● ● ●● ● ● ●
● ●●● ● ● ● ●● ●● ● ●● ●● ● ●
●● ●● ● ● ●● ●● ●
●● ●
● ●
● ●
● ● ● ● ● ● ●● ● ● ●● ● ●
● ●

● ● ● ●● ● ● ●

0.00 ● ● ● ● ●●

0 100 200 300 400 500


randomized variable index
5 / 22
Example: Individual correlations

f ∼ N(0, 1), xj | f ∼ N( 𝜌f , 1 − 𝜌), j = 1, . . . , 150 ,
y | f ∼ N(f , 1) xj | f ∼ N(0, 1), j = 151, . . . , 500 .

Correlation for xj , f∗ (f∗ = PCA + linear regression)


1.00

● ●
● ● ●
● ● ●● ●●● ● ●
● ●●
0.75 ●


● ● ●

●● ●

● ● ●
● ●●


●●




● ●●●
●●
● ●●
● ● ● ● ●● ●
● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ●● ● ●● ● ● ● ●

● ● ● ● ●●

● ●● ●●

● ●● ●●
● ●
|R(xj, f∗)|

● ● ●● ●● ● ●●

●● ● ● ●
● ●●●
● ● ● ●
● ● ●
● ●●
● ● ●
0.50 ●

●●


● ● ●

● ●

● ● ● ● ●● ●
● ● ● ● ●
● ● ●● ● ● ●
● ●● ● ● ● ●
● ● ●● ●●
● ● ● ● ● ●

● ● ●
● ● ● ●
● ● ● ● ●
● ●● ● ● ● ● ●● ● ●
0.25 ●
● ● ● ●●





● ●
●● ●

● ● ● ●●

● ●
● ●

● ● ● ● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●● ● ●●● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
●● ●
● ● ●●● ● ●
●●
● ● ●
●● ●
● ●
● ● ●● ●
● ●● ● ●● ●
● ●●●
● ●●●● ● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ●●●
●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ●
● ● ●
●● ●●●●● ● ●● ● ●● ● ●● ● ●● ●●● ● ●●
● ●●

● ● ● ●● ●
● ●● ● ● ● ● ●● ●●●●● ● ● ● ● ●
● ● ●● ● ● ● ●● ● ●
● ●● ● ● ●● ●●●● ● ●●● ●●●● ● ● ● ●● ● ●● ● ● ●●● ●
● ●

0.00 ● ● ● ● ● ● ● ● ● ●●

0 100 200 300 400 500


randomized variable index
5 / 22
Knowing the latent values would help


●● ●● ●●●● ● ●● ●●

●● ●● ●●●
●● ● ●● ●
● ●●● ● ●●
● ●●●● ●● ●
●●●● ●●●● ●●●●● ●
● ●● ● ●
●●● ● ●
●● ● ● ●●●
0.75 ●
● ● ●●●●
● ● ● ●● ●
●●
●●
●● ●

●● ●
●● ●


● ●●
●● ● ●

● ●

● ● ●●● ●
● ●● ●●●●
●● ● ●●●●● ●
●●● ●

● ● ●

|R(xj,f)|

● ● ●

0.50 ●●
● ●


● ●●

● ● ●● ● ●
● ●● ●
● ●● ● ●
●● ● ● ● ● ●
● ●● ● ● ●●● ●● ●
● ● ● ● ●● ●


●●

0.25 ● ●● ●● ●


●● ● ● ●● ●●●●●●● ●

●●● ● ● ●● ●● ● ●
● ●



●●● ● ●●

● ●●●●● ●●●●●● ● ●

●●●●●●● ● ● ●●● ●●● ●● ●
●● ●● ●●

●●●
●● ●● ● ●●●● ●● ●● ●
● ●
●●●●●
● ● ●●● ●●●
● ●● ●● ●
●● ●
● ●●

● ●●●●●● ●●● ●● ●● ●

●● ●● ●●● ●●●
● ●●●● ●●●● ●
● ●● ● ●●● ●●
● ● ● ●●● ●
●●●●●
● ● ●●
●● ●●●●

● ● ●● ●
●●●●●●●●●

●●●

● ●
●●● ●●●●● ● ● ● ● ●
● ●
●● ● ●
●● ●●●
●●● ● ●●●●● ● ●
0.00 ●
● ●
●●●
● ●

0.00 0.25 0.50 0.75


|R(xj,y)|

irrelevant xj , relevant xj
A) Sample correlation with y vs. sample correlation with f

6 / 22
Estimating the latent values with a reference model helps


●● ●● ●●●● ● ●● ●● ●


●● ●● ●●● ●
● ●● ●● ●
●● ● ●● ●
● ●●● ● ●● ● ●●●
●●● ●● ● ● ●● ● ●● ●
● ● ●● ● ●●●●●●●● ● ●●● ● ●●● ● ●
● ●●
●● ● ● ● ● ●●● ●● ●●●●● ●
●●● ● ●
●● ● ● ●●● ● ● ● ● ●● ●●●

0.75 ●
● ● ●●●●
● ● ● ●● ●
●●
●●
●● ●

●● ●
●● ●


● ●●
●● ● ●

● ●
0.75 ●
●●● ●
● ●

● ●●●●● ●●
●● ● ●
● ●
●●● ●● ●●● ●
●●



● ● ●●● ● ● ●●●●●
●● ●● ● ●●● ●
● ●● ●●●●
●● ● ●●●●● ● ●
● ● ●● ● ●●

●●● ● ●●● ● ● ● ●● ●●
● ●
●● ●●●●
● ● ● ● ●● ● ● ●
● ●●

|R(xj,f∗)|
● ● ●
|R(xj,f)|

● ● ● ●
● ● ● ●

0.50 ●●
● ●
0.50 ●

● ● ● ●
● ●● ● ●●● ●● ●●

● ● ●● ● ● ●
●● ● ●● ● ● ● ●●
● ●
●● ● ● ● ● ● ●● ● ● ●
●● ●● ●● ●
●●●●
●● ● ● ● ● ● ●●●

● ●
● ●● ● ● ●●● ●● ● ● ●
● ● ● ● ●● ●
● ●● ●
● ● ● ● ●●
● ● ● ● ●●●●● ●●
0.25 ● ●● ●● ●


●● ● ● ●● ●●●●●●● ●

●●● ●● ●● ● ●
● ●


● 0.25 ●●●
●●● ●
●●●●
● ●
●●
●●● ● ● ●
● ●

●●●●● ●


●● ●

● ●● ●● ● ●

● ● ●
● ● ●● ● ●● ●● ● ● ●● ●
●● ●●●●● ●●●●●● ● ● ●
● ● ●● ●● ● ●
●●●●●●● ● ● ●●● ●●● ●● ●
● ●●
● ●● ●●●● ● ●●●
●● ●● ●●

●●●
●● ●● ● ●●●● ●● ●● ● ●●●●●

● ●●●●● ● ● ●
●●
●● ●●● ● ●
●●●


●●● ●●● ●
●● ●● ●● ●●
● ●● ● ●
● ● ●● ● ●●
●● ●●●
● ●●●●●● ●● ● ● ●●●●●●●●●●
●●
●●● ● ● ●
●● ●
●●●●●
●● ●●
● ●
●●●● ●●
● ●●●●●


●●
● ●
●●●● ●

●● ●●


●●● ●● ● ●

● ●● ●●●● ●
●●
● ● ● ● ● ●●
● ● ●
●●●●●
● ● ●
●●
●● ●●●●

● ●● ● ●
●● ●●● ● ●

●●
●●


● ●●


● ●
●●●●●● ●● ●
●●●●● ●●
●●●●● ● ●
●●●
● ●●● ●●●●● ● ● ● ●
●● ●●


●●●● ●●●● ●
●●● ●●
● ● ●
●●●
●●● ●
● ● ● ●●●●●●●●●●●●● ●● ●● ●● ●
●●
● ●
●●●
●●● ● ●●●●● ● ● ●●● ●●
●● ●
●● ● ●●●
0.00 ●
● ●
●●●
● ●
0.00 ●
● ●

0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75


|R(xj,y)| |R(xj,y)|

irrelevant xj , relevant xj
A) Sample correlation with y vs. sample correlation with f
B) Sample correlation with y vs. sample correlation with f∗
f∗ = linear regression fit with 3 principal components

6 / 22
Bayesian justification

• Theory says to integrate over all the uncertainties


• build a rich model
• make model checking etc.
• this model can be the reference model

7 / 22
Bayesian justification

• Theory says to integrate over all the uncertainties


• build a rich model
• make model checking etc.
• this model can be the reference model
• Consider model selection as decision problem

7 / 22
Bayesian justification

• Theory says to integrate over all the uncertainties


• build a rich model
• make model checking etc.
• this model can be the reference model
• Consider model selection as decision problem
• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so
that the predictive distribution changes as little as possible

7 / 22
Bayesian justification

• Theory says to integrate over all the uncertainties


• build a rich model
• make model checking etc.
• this model can be the reference model
• Consider model selection as decision problem
• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so
that the predictive distribution changes as little as possible
• Example constraints
• q(𝜃) can have only point mass at some 𝜃 0
⇒ “Optimal point estimates”

7 / 22
Bayesian justification

• Theory says to integrate over all the uncertainties


• build a rich model
• make model checking etc.
• this model can be the reference model
• Consider model selection as decision problem
• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so
that the predictive distribution changes as little as possible
• Example constraints
• q(𝜃) can have only point mass at some 𝜃 0
⇒ “Optimal point estimates”
• Some covariates must have exactly zero regression coefficient
⇒ “Which covariates can be discarded”

7 / 22
Bayesian justification

• Theory says to integrate over all the uncertainties


• build a rich model
• make model checking etc.
• this model can be the reference model
• Consider model selection as decision problem
• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so
that the predictive distribution changes as little as possible
• Example constraints
• q(𝜃) can have only point mass at some 𝜃 0
⇒ “Optimal point estimates”
• Some covariates must have exactly zero regression coefficient
⇒ “Which covariates can be discarded”
• Much simpler model
⇒ “Easier explanation”

7 / 22
Logistic regression with two covariates

Posterior Predictions

2
20

1
15

10
β2

x2
0

5
-1
0

-2
0 5 10 15 20 -2 -1 0 1 2
β1 x1

Full posterior for 𝛽1 and 𝛽2 and contours of predicted class probabil-


ity

8 / 22
Logistic regression with two covariates

Posterior Predictions

2
20

1
15

10
β2

x2
0

5
-1
0

-2
0 5 10 15 20 -2 -1 0 1 2
β1 x1

Projected point estimates for 𝛽1 and 𝛽2

8 / 22
Logistic regression with two covariates

Posterior Predictions

2
20

1
15

10
β2

x2
0

5
-1
0

-2
0 5 10 15 20 -2 -1 0 1 2
β1 x1

Projected point estimates, constraint 𝛽1 = 0

8 / 22
Logistic regression with two covariates

Posterior Predictions

2
20

1
15

10
β2

x2
0

5
-1
0

-2
0 5 10 15 20 -2 -1 0 1 2
β1 x1

Projected point estimates, constraint 𝛽2 = 0

8 / 22
Logistic regression with two covariates

Posterior Predictions

2
20

1
15

10
β2

x2
0

5
-1
0

-2
0 5 10 15 20 -2 -1 0 1 2
β1 x1

Draw-by-draw projection, constraint 𝛽1 = 0

8 / 22
Logistic regression with two covariates

Posterior Predictions

2
20

1
15

10
β2

x2
0

5
-1
0

-2
0 5 10 15 20 -2 -1 0 1 2
β1 x1

Draw-by-draw projection, constraint 𝛽2 = 0

8 / 22
Predictive projection

• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so


that the predictive distribution changes as little as possible

9 / 22
Predictive projection

• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so


that the predictive distribution changes as little as possible
• As the full posterior p(𝜃 | D) is projected to q(𝜃)
• the prior is also projected and there is no need to define priors
for submodels separately

9 / 22
Predictive projection

• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so


that the predictive distribution changes as little as possible
• As the full posterior p(𝜃 | D) is projected to q(𝜃)
• the prior is also projected and there is no need to define priors
for submodels separately
• even if we constrain some coefficients to be 0, the predictive
inference is conditoned on the information related features
contributed to the reference model

9 / 22
Predictive projection

• Replace full posterior p(𝜃 | D) with some constrained q(𝜃) so


that the predictive distribution changes as little as possible
• As the full posterior p(𝜃 | D) is projected to q(𝜃)
• the prior is also projected and there is no need to define priors
for submodels separately
• even if we constrain some coefficients to be 0, the predictive
inference is conditoned on the information related features
contributed to the reference model
• solves the problem of how to do the inference after the model
selection

9 / 22
Projective selection
• How to select a feature combination?

10 / 22
Projective selection
• How to select a feature combination?
• For a given model size, choose feature combination with
minimal projective loss

10 / 22
Projective selection
• How to select a feature combination?
• For a given model size, choose feature combination with
minimal projective loss
• Search heuristics, e.g.
• Monte Carlo search
• Forward search
• L1 -penalization (as in Lasso)

10 / 22
Projective selection
• How to select a feature combination?
• For a given model size, choose feature combination with
minimal projective loss
• Search heuristics, e.g.
• Monte Carlo search
• Forward search
• L1 -penalization (as in Lasso)
• Use cross-validation to select the appropriate model size
• need to cross-validate over the search paths

10 / 22
Projective selection vs. Lasso
Same simulated regression data as before,
n = 50, p = 500, prel = 150, 𝜌 = 0.5

Reference model
2.00

Lasso
Mean squared error

1.75

1.50 ●


● ●
● ●
● ●
● ● ●
● ● ● ● ● ● ●

1.25

0 5 10 15 20 25
Number of covariates

11 / 22
Projective selection vs. Lasso
Same simulated regression data as before,
n = 50, p = 500, prel = 150, 𝜌 = 0.5

Reference model
2.00

Lasso
Lasso, relaxed
Mean squared error

1.75




● ●
● ●





1.50 ●




● ●






● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●

1.25

0 5 10 15 20 25
Number of covariates

11 / 22
Projective selection vs. Lasso
Same simulated regression data as before,
n = 50, p = 500, prel = 150, 𝜌 = 0.5

Reference model
Lasso
2.00

Lasso, relaxed
Projection
Mean squared error

1.75




● ●
● ●





● ●

1.50 ●




● ●


● ●



● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●

1.25 ●






● ● ● ● ● ● ● ● ● ●
● ● ● ● ●

0 5 10 15 20 25
Number of covariates

11 / 22
Projective selection vs. Lasso
Same simulated regression data as before,
n = 50, p = 500, prel = 150, 𝜌 = 0.5

Reference model
Lasso
2.00

● ● ● ●

Lasso, relaxed ● ● ● ● ● ● ●

−1.5 ● ●



Log predictive density


Projection ●

Mean squared error


1.75


● ● ●

−1.6


● ●
● ● ●



● ●
● ● ●



● ●







● ●
● ●

● ● ● ●
● ●
● ●

1.50 ●


−1.7 ●



● ●

● ●


● ● ●


● ●
● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●

1.25 ●

● −1.8





● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●

0 5 10 15 20 25 0 5 10
● 15 20 25
Number of covariates Number of covariates

11 / 22
Bodyfat: small p example of projection predictive

Predict bodyfat percentage. The reference value is obtained by


immersing person in water. n = 251.

12 / 22
Bodyfat: small p example of projection predictive

Predict bodyfat percentage. The reference value is obtained by


immersing person in water. n = 251.

abdomen

forearm
biceps
weight
height

chest

ankle
thigh
knee
neck

wrist
age

hip
siri

1
siri ● ● ● ● ●●● ● ● ● ● ● ●

age ●● ● ●
● ● ●
● ● ●
● ● ● ● 0.8
weight ● ● ● ● ●●●● ● ● ● ● ●

0.6
height ●
● ●● ● ● ● ● ● ● ● ● ● ●
neck ● ● ● ●● ● ● ● ● ● ● ● ●
● 0.4
chest ● ● ● ● ● ●● ● ● ● ● ● ● ● 0.2
abdomen ● ● ● ● ● ●● ● ● ● ● ● ● ● 0
hip ● ● ● ● ● ●●● ● ● ● ● ●

thigh ● ● ● ● ● ● ● ●● ● ● ● ● ● −0.2
knee ● ● ● ● ● ● ● ●● ● ● ● ●

−0.4
ankle ● ● ● ● ● ● ● ● ●● ● ● ●

biceps ● ● ● ● ● ● ● ● ● ● ●● ●

−0.6
forearm ● ● ● ● ● ● ● ● ● ● ●● ●

−0.8
wrist ● ● ● ● ● ● ● ● ● ● ● ● ●●
−1
12 / 22
Bodyfat

Marginal posteriors of coefficients

age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist

−5 0 5 10
13 / 22
Bodyfat

Bivariate marginal of weight and height

4
weight

−4

−2 −1 0 1
height
14 / 22
Bodyfat

The predictive performance of the full and submodels

0 ● ● ●


● ● ● ● ● ● ●


Difference to the baseline

−50

elpd
−100

−150 ●

4 ●

rmse
2
1

● ● ●
0 ● ● ● ● ● ● ● ● ●

0 3 6 9 12 15
Number of variables in the submodel
15 / 22
Bodyfat

Marginals of the reference and projected posterior

age
weight
height
neck
chest abdomen
abdomen
hip
thigh
knee
ankle
weight
biceps
forearm
wrist

−5 0 5 10 −5 0 5 10

16 / 22
Predictive performance vs. selected variables

• The initial aim: find the minimal set of variables providing similar
predictive performance as the reference model

17 / 22
Predictive performance vs. selected variables

• The initial aim: find the minimal set of variables providing similar
predictive performance as the reference model
• Some keep asking can it find the true variables

17 / 22
Predictive performance vs. selected variables

• The initial aim: find the minimal set of variables providing similar
predictive performance as the reference model
• Some keep asking can it find the true variables
• What do you mean by true variables?
abdomen

forearm
biceps
weight
height

chest

ankle
thigh
knee
neck

wrist
age

hip
siri

1
siri ● ● ● ● ●●● ● ● ● ● ● ●

age ●● ● ●
● ● ●● ● ●

● ● ● 0.8
weight ● ● ● ● ●●●● ● ● ● ● ●

0.6
height ●
● ●● ● ● ● ● ● ● ● ● ● ●
neck ● ● ● ●● ● ● ● ● ● ● ● ●
● 0.4
chest ● ● ● ● ● ●● ● ● ● ● ● ● ● 0.2 abdomen
abdomen ● ● ● ● ● ●● ● ● ● ● ● ● ● 0
hip ● ● ● ● ● ●●● ● ● ● ● ●

thigh ● ● ● ● ● ● ● ●● ● ● ● ● ● −0.2
knee ● ● ● ● ● ● ● ●● ● ● ● ●

−0.4
ankle ● ● ● ● ● ● ● ● ●● ● ● ●

weight
biceps ● ● ● ● ● ● ● ● ● ● ●● ●

−0.6
forearm ● ● ● ● ● ● ● ● ● ● ●● ●

−0.8
wrist ● ● ● ● ● ● ● ● ● ● ● ● ●●
−1
−5 0 5 10

17 / 22
Variability under data perturbation

Comparing projection predictive variable selection (projpred) and


stepwise maximum likelihood over bootstrapped datasets
100%

75%

50% projpred
steplm

25%

0%
abdomen weight wrist height age neck chest biceps thigh ankle forearm hip knee

18 / 22
Variability under data perturbation

Comparing projection predictive variable selection (projpred) and


stepwise maximum likelihood over bootstrapped datasets
100%

75%

50% projpred
steplm

25%

0%
abdomen weight wrist height age neck chest biceps thigh ankle forearm hip knee

18 / 22
Variability under data perturbation

Comparing projection predictive variable selection (projpred) and


stepwise maximum likelihood over bootstrapped datasets
100%

75%

50% projpred
steplm

25%

0%
abdomen weight wrist height age neck chest biceps thigh ankle forearm hip knee

• Reduced variability, but in case of noisy finite data, there will be


some variability under data perturbation

19 / 22
Variability under data perturbation

Comparing projection predictive variable selection (projpred) and


stepwise maximum likelihood over bootstrapped datasets
100%

75%

50% projpred
steplm

25%

0%
abdomen weight wrist height age neck chest biceps thigh ankle forearm hip knee

• Reduced variability, but in case of noisy finite data, there will be


some variability under data perturbation
• projpred uses
• Bayesian inference for the reference
• The reference model
• Projection for submodel inference

19 / 22
Variability under data perturbation

Comparing projection predictive variable selection (projpred) and


stepwise maximum likelihood over bootstrapped datasets
100%

75%

50% projpred
steplm

25%

0%
abdomen weight wrist height age neck chest biceps thigh ankle forearm hip knee

• Reduced variability, but in case of noisy finite data, there will be


some variability under data perturbation
• projpred uses
• Bayesian inference for the reference
• The reference model
• Projection for submodel inference

19 / 22
Multilevel regerssion and GAMMs

• projpred supports also hierarchical models in brms


Catalina, Bürkner, and Vehtari (2022). Projection predictive inference
for generalized linear and additive multilevel models. Proceedings of
the 24th International Conference on Artificial Intelligence and
Statistics (AISTATS), PMLR 151:4446–4461.
https://proceedings.mlr.press/v151/catalina22a.html

20 / 22
Scaling

• So far the biggest number of variables we’ve tested is 22K


• 96s for creating a reference model
• 14s for projection predictive variable selection

21 / 22
Intro paper and brms and rstanarm + projpred examples

• McLatchie, Rögnvaldsson, Weber, and Aki Vehtari (2024). Advances


in projection predictive inference. Statistical Science.
https://arxiv.org/abs/2306.15581
• https://mc-stan.org/projpred/articles/projpred.html
• https://users.aalto.fi/~ave/casestudies.html
• Fast and often sufficient if n ≫ p
varsel <- cv_varsel(fit, method='forward', cv_method='loo',
validate_search=FALSE)
• Slower but needed if not n ≫ p
varsel <- cv_varsel(fit, method='forward', cv_method='kfold', K=10,
validate_search=TRUE)
• If p is very big
varsel <- cv_varsel(fit, method='L1', cv_method='kfold', K=5,
validate_search=TRUE)

22 / 22

You might also like