STAT 535: Statistical Machine Learning Autumn 2019
Lecture 7: Model Selection and Prediction
Instructor: Yen-Chi Chen
Useful reference:
• Section 7 in the elements of statistical learning by Trever Hastie, Robert Tibshirani, Jerome Friedman.
• Section 13 (especially 13.6) in all of statistics by Larry Wasserman.
• Section 5.3 in all of nonparametric statistics by Larry Wasserman.
7.1 Introduction
Suppose that we observe X1 , · · · , Xn from an unknown distribution function. Under the assumption that
our data comes from a parametric model pθ with θ ∈ Θ ⊂ Rd .
Let
θbn = argmaxθ L(θ|X1 , · · · , Xn ) = argmaxθ `(θ|X1 , · · · , Xn )
be the MLE. For simplicity, we denote
`n = `(θbn |X1 , · · · , Xn )
as the maximal value of the empirical log-likelihood function and we define
¯ = E(`(θ|X1 ))
`(θ)
as the population log-likelihood function. The population likelihood function implies the population MLE,
which is
¯
θ∗ = argmaxθ `(θ).
7.2 AIC: Akaike information criterion
The AIC is an information criterion that is common used for model selection. In model selection, the AIC
propose the following criterion:
AIC = 2d − 2`n ,
where d is the dimension of the model.
The idea of AIC is to adjust the empirical risk to be an unbiased estimator of the true risk in a parametric
model. Under a likelihood framework, the loss function is the negative log-likelihood function so the empirical
risk is
bn (θbn ) = −`n = −`(θbn |X1 , · · · , Xn ) = −`n (θbn ).
R
On the other hand, the true risk of the MLE is
¯ θbn )).
R(θbn ) = E(−n`(
7-1
7-2 Lecture 7: Model Selection and Prediction
Note that we multiply it by n to reflect the fact that in the empirical risk, we did not divide it by n.
To derive the AIC, we examine the asymptotic bias of the empirical risk R
bn (θbn ) versus the true risk R(θbn ).
¯ θbn )
Analysis of true risk. To analyze the true risk R(θbn ), we first investigate the asymptotic behavior of `(
∗
around θ :
¯ θbn ) ≈ `(θ
`( ¯ ∗ ) + 1 (θbn − θ∗ )T ∇∇`(θ
¯ ∗ ) + (θbn − θ∗ )T ∇`(θ ¯ ∗ )(θbn − θ∗ )
| {z } 2 | {z }
=0 =I(θ ∗ )
¯ ∗ ) + 1 (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ).
= `(θ
2
Thus, the true risk is
¯ θbn )) ≈ −n`(θ
¯ ∗) − n b
R(θbn ) = −nE(`( E (θn − θ∗ )T I(θ∗ )(θbn − θ∗ ) . (7.1)
2
Analysis of expected empirical risk. For the expected empirical risk, we first expand `n as follows:
n
X
`n = `(θbn |Xi )
i=1
n n n
X X 1 X (7.2)
≈ `(θ∗ |Xi ) + (θbn − θ∗ )T ∇`(θ∗ |Xi ) + (θbn − θ∗ )T ∇∇`(θ∗ |Xi )(θbn − θ∗ ) .
i=1 i=1
2 i=1
| {z } | {z }
(I) =(II)
¯ ∗ ), which agrees with the first term in the true risk so all we need
The expectation of the first quantity is `(θ
is to understand the behavior of the rest two quantities.
Pn
For the first quantity, using the fact that i=1 ∇`(θbn |Xi ) = 0,
n
X n
X
∇`(θ∗ |Xi ) = ∇(`(θ∗ |Xi ) − `(θbn |Xi ))
i=1 i=1
n
!
X
≈ ∇∇`(θ |Xi ) (θ∗ − θbn )
∗
i=1
≈ (∇∇E(`(θ∗ |Xi ))) n(θ∗ − θbn ).
| {z }
=I(θ ∗ )
Thus,
(I) ≈ −n(θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ).
For quantity (II), note that
n
1X
∇∇`(θ∗ |Xi ) ≈ ∇∇E(`(θ∗ |Xi )) = I(θ∗ )
n i=1
so
n
1 X
(II) = (θbn − θ∗ )T ∇∇`(θ∗ |Xi )(θbn − θ∗ )
2 i=1
n
= (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ )
2
Lecture 7: Model Selection and Prediction 7-3
Combining (I) and (II) into equation (7.2) and taking the expectation, we obtain
¯ ∗ ) + n E (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ )
E(R bn (θbn )) = −E(`n ) = −n`(θ
2
Comparing this to equation (7.1), we obtain
bn (θbn )) − R(θbn ) = −nE (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ) .
E(R
Note that by the theory of MLE, one can easily shown that
√
n(θbn − θ∗ ) ≈ N (0, I −1 (θ∗ ))
so
n(θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ) ≈ χ2d ,
which implies that1
nE (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ) = d.
Thus, to make sure that we have an asymptotic unbiased estimator of the true risk R(θbn ), we need to modify
the empirical risk by
Rbn (θbn ) + d = −`n + d.
Multiplying this quantity by 2, we obtain the AIC
AIC = 2d − 2`n .
Remark. From the derivation of AIC, we see that the goal of the AIC is to adjust the model so that we
are comparing unbiased estimates of the true risks across different models. Thus, the model selected by
minimizing the AIC can be viewed as the model selected by minimizing unbiased estimates of the true risks.
From the risk minimization point of view, this is trying to make a prediction using a good risk estimator.
Thus, some people would common that the design of AIC is to choose a model that makes good predictions.
7.3 BIC: Bayesian information criterion
Another common approach for model selection is the BIC:
BIC = d log n − 2`n ,
where again d denotes the dimension of the model.
Here is the derivation of the BIC. In the Bayesian setting, we place a prior π(m) over all possible models
and within each model, we place a prior over parameters p(θ|m). The BIC is a Bayesian criterion, which
means that we will select model according to the posterior distribution of each model m. Namely, we will
try to derive π(m|X1 , · · · , Xn ).
By Bayes rule, we have
π(m, X1 , · · · , Xn )
π(m|X1 , · · · , Xn ) = ∝ p(X1 , · · · , Xn |m)π(m)
p(X1 , · · · , Xn )
so all we need is to derive the marginal density in a model p(X1 , · · · , Xn |m).
1 Formally, we need a few more conditions; the convergence in distribution is not enough for the convergence in expectation.
7-4 Lecture 7: Model Selection and Prediction
With a prior π(θ|m), this marginal density can be written as
Z
p(X1 , · · · , Xn |m) = p(X1 , · · · , Xn |θ, m)π(θ|m)dθ. (7.3)
Suppose that the model m is correct in the sense that under model m, there exists θ∗ ∈ Θ such that the
data are indeed generated from p(x|θ∗ ).
Using the log-likelihood function, we can expand
Pn
p(X1 , · · · , Xn |θ, m) = e`(θ|X1 ,··· ,Xn ,m) = e i=1 `(θ|Xi ,m)
. (7.4)
Asymptotically, the log-likelihood function can be further expand
n
X n
X n
X
`(θ|Xi , m) = `(θ∗ |Xi , m) + (θ − θ∗ )T ∇`(θ∗ |Xi , m)
i=1 i=1 i=1
| {z }
=0
n
X
+ (θ − θ∗ )T ∇∇`(θ∗ |Xi , m)(θ − θ∗ ) + small terms
i=1
n
1X
= `n + n(θ − θ∗ )T ∇∇`(θ∗ |Xi , m)(θ − θ∗ ) + small terms,
n i=1
| {z }
≈−I(θ ∗ )
where I(θ∗ ) is the Fisher’s information matrix. Plugging this into equation (7.4) and ignore the reminder
terms, we obtain
∗ T ∗ ∗
p(X1 , · · · , Xn |θ, m) ≈ e`n −n(θ−θ ) I(θ )(θ−θ ) . (7.5)
Thus, equation (7.3) can be rewritten as
Z
∗ T
) I(θ ∗ )(θ−θ ∗ )
p(X1 , · · · , Xn |m) ≈ e `n
e−n(θ−θ π(θ|m)dθ. (7.6)
To compute the above integral, consider a random vector Y ∼ N (0, n1 I(θ∗ )). The expectation
n d/2 Z
∗ T ∗ ∗
E(π(Y |m)) = det−1 (I(θ∗ )) e−n(y−θ ) I(θ )(y−θ ) π(y|m)dy
2π
≈ π(θbn |m)
when n → ∞. This implies that
Z d/2
−n(θ−θ ∗ )T I(θ ∗ )(θ−θ ∗ ) 2π
e π(θ|m)dθ ≈ det(I(θ∗ ))π(θbn |m).
n
Putting this into equation (7.6), we conclude that the Bayesian evidence
d/2
2π
p(X1 , · · · , Xn |m) ≈ e`n det(I(θ∗ ))π(θbn |m)
n
so the log evidence is
d d
log p(X1 , · · · , Xn |m) ≈ `n − log n + log(2π) + log det(I(θ∗ )) + log π(θbn |m).
2 2
Lecture 7: Model Selection and Prediction 7-5
The only quantity that would increases with respect to the sample size n are the first two quantities so after
multiplying by −2 and keeping only the dominating two terms, we obtain
BIC = d log n − 2`n .
Remark. Although the BIC leads to a criterion similar to the AIC, the reasoning is somewhat different.
In the construction of BIC, the effect of priors are ignored since we are working on the limiting regime
but we still use the Bayesian evidence as a model selection criterion. We are selecting the model with the
highest evidence. When the data is indeed generated from one of the model in the collection of models we
are choosing from, the posterior will concentrate on this correct model. So BIC would eventually be able
to select this model. Therefore, some people would argue that unlike AIC that chooses the best predictive
model, the BIC attempts to select the true model if it exists in the model set.
7.4 Information criterion in regression
Using information criterion in a regression problem is very common in practice. However, there is a caveat in
the construction. The information criterion we derived is based on a likelihood framework but the regression
problem is often formulated as an empirical risk minimization process, e.g., least squared approach. So we
need to associate the likelihood function to the loss function used in the regression problem to properly
construct a model selection criterion. Here we gave one example of using the least square approach under a
linear regression model.
Let (X1 , Y1 ), · · · , (Xn , Yn ) be the observed data such that Xi ∈ Rd and Yi ∈ R. A regression model associate
Xi and Yi via
Yi = XiT β + i ,
where i ’s are the noise. To associate the least square method to a likelihood framework, we assume that
i ∼ N (0, σ 2 ). Under this framework, one can easily shown that the MLE of β and the least squared estimate
are the same.
What will the likelihood function be like in this case? For simplicity, we consider the fixed design such
that Xi ’s are non-random and the only random quantity is the noise i ’s. Because i = Yi − XiT β, the
log-likelihood function will be
1 1
`(β, σ 2 |Xi , Yi ) = log p (Yi − XiT β; β, σ 2 ) = − log(2πσ 2 ) − 2 (Yi − XiT β)2 .
2 σ
1
Pn
b2 =
Let βb be the least square estimate (MLE) and σ n
2
i=1 ei be the MLE of the noise level and ei = Yi −XiT βb
is the residual.
7-6 Lecture 7: Model Selection and Prediction
The empirical risk under the MLE/least square estimate is
n
X
`n = `(β, σ 2 |Xi , Yi )
i=1
n
n 1 X
=− σ2 ) − 2
log(2πb (Yi − XiT β)2
2 σ
b i=1
| {z }
σ2
=nb
n
b2 ) − n
= − (log 2 + log π + log σ
2 !
n
n 1X 2
= Cn − log e
2 n i=1 i
n 1
= Cn − log RSSn ,
2 n
Pn Pn
where RSSn = i=1 e2i = i=1 (Yi − XiT β)2 is the least square objective function or the so-called residual
sum of squares.
The quantity Cn is invariant across models/variables we choose. Thus, the AIC and BIC of the regression
model will be
1
AIC = 2d − n log RSSn
n
1
BIC = d log n − n log RSSn .
n
7.5 Prediction risk in regression
Now we consider a general regression prediction problem where we observe (X1 , Y1 ), · · · , (Xn , Yn ) where
Xi ∈ Rd and Yi ∈ R and they are from an unknown joint distribution FX,Y . Here we do not specify a
particular model of the regression function and all we assume is
Yi = m(Xi ) + i ,
where i ’s are mean 0 noise with Var(i |Xi = x) = σ 2 (x) and m(x) = E(Y |X = x) is the regression function.
Suppose that we have an estimator m b of m and we would like to know the prediction risk (expected loss in
predicting a future outcome) of this estimator.
To define the prediction risk, let Xnew , Ynew be a new pair of observation from FX,Y . Then the prediction
risk is defined as
R(m)b = E((Ybnew − Ynew )2 ) = E((m(X
b new ) − Ynew )2 ),
where Yb = m(X)
b is the predictive value of Y given covariate X.
Let m̄(x) = E(m(x))
b be the expected regression function of the estimator/predictor m.
b We can decompose
the prediction risk using the following expansion:
R(m) b new ) − Ynew )2 )
b = E((m(X
b new ) − Ynew )2 |Xnew ))
= E(E((m(X
= E(R(Xnew ))
Z
= R(x)pX (x)dx
Lecture 7: Model Selection and Prediction 7-7
where
b new ) − Ynew )2 |Xnew = x)
R(x) = E((m(X
is a local predictive risk and pX is the PDF of the covariate X.
The local predictive risk can be decomposed as
b new ) − Ynew )2 |Xnew = x)
R(x) = E((m(X
b new ) − m̄(Xnew ) + m̄(Xnew ) − m(Xnew ) + m(Xnew ) − Ynew )2 |Xnew = x
= E (m(X
| {z } | {z } | {z }
variance of m
b bias of m
b intrinsic variance
= E(V (Xnew ) + b2 (Xnew ) + σ 2 (Xnew )|Xnew = x),
where
V (x) = Var(m(X
b new )|Xnew = x)
b(x) = m̄(x) − m(x).
Thus, the predictive risk can be decomposed into
b = E(b2 (Xnew )) + E(V (Xnew )) + E(σ 2 (Xnew )) .
R(m)
| {z } | {z } | {z }
bias Variance intrinsic error
To compare the performance of prediction, we should use a good estimate of the predictive risk. A naive
estimator is the training error (or empirical risk), which is
n n
1X 1X
R(
b m)
b = (Yi − Ybi )2 = b i ))2 .
(Yi − m(X
n i=1 n i=1
Although this estimator seems to be intuitively correct, it may suffer from a severe bias! To see this, note
that the expected value of the empirical risk is
n
1X
E(R(
b m))
b = b i ))2 .
E(Yi − m(X
n i=1
For each i,
b i ))2 = E(Yi − m(Xi ) + m(Xi ) − m̄(Xi ) + m̄(Xi ) − m(X
E(Yi − m(X b i ))2
= E(σ 2 (Xi )) + E(b2 (Xi )) + E(V (Xi )) + 2E((Yi − m(Xi ))(m̄(Xi ) − m(X
b i )))
b − 2E((Yi − m(Xi ))(Ybi − m̄(Xi )))
= R(m)
b − 2E(Cov(Yi , Ybi |Xi )).
= R(m)
Thus, the empirical risk and the predictive risk has the follow relationship
n
2X
E(R(
b m)) b −
b = R(m) E(Cov(Yi , Ybi |Xi )),
n i=1
which implies that the empirical risk is always underestimate the predictive risk. This is especially more
severe when the predictor Ybi is highly correlated with the i-th observation– a common scenario when we are
overfitting the model.
7-8 Lecture 7: Model Selection and Prediction
Note that in the fix design case (covariate X are non-random), the above expression reduces to
n
2X
E(R(
b m)) b −
b = R(m) Cov(Yi , Ybi ),
n i=1
and the quantity Cov(Yi , Ybi ) is called the degrees of freedom. For more details, see the following papers:
1. Tibshirani, Ryan J. “Degrees of freedom and model search.” Statistica Sinica (2015): 1265-
1296.
2. Tibshirani, Ryan J., and Saharon Rosset. “Excess Optimism: How Biased is the Apparent
Error of an Estimator Tuned by SURE?.” Journal of the American Statistical Association 114,
no. 526 (2019): 697-712.
For a general regression model, a simple and consistent estimate of the predictive risk is the cross-validation
(CV). With the growing power of computing, it is a very convenient tool nowadays.
If we are using a linear smoother, there is a closed-form of the leave-one-out CV:
n
1X Yi − m(X
b i)
R
bLOO (m)
b = ,
n i=1 1 − Lii
where Lij is the (i, j) component in the linear smoother matrix L, i.e., Y b = LY. Also, there is a generalized
method called the generalized cross-validation (GCV) that is available for obtaining a good estimate of the
predictive risk:
n
1 X Yi − m(Xb i)
R
bGCV (m)
b = ,
n i=1 1 − ν/n
P
where ν = i Lii .
Remark (Mallow’s Cp). The Mallow’s Cp is an approach to estimate Cov(Yi , Ybi ) and correct the empirical
risk to obtain a better estimate of the predictive risk. The Mallow’s Cp is
2dbσ2
RbCp = R(b m)
b + ,
n
b2 is an estimate of the overall noise level σ 2 = σ 2 (x)pX (x)dx and d is the dimension of the model.
R
where σ
In the case of linear smoother, d = ν. One can simply use the average of squared residuals in this case. The
Mallow’s Cp also leads to a model selection criterion but the goodness-of-fit part is different from AIC/BIC.