0% found this document useful (0 votes)

14 views8 pages

Lec7 Model

This document discusses model selection and prediction in statistical machine learning, focusing on the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). AIC is derived to adjust empirical risk for unbiased estimation of true risk, while BIC is based on Bayesian principles to select models with the highest evidence. The document also addresses the application of information criteria in regression problems, linking likelihood functions to empirical risk minimization.

Uploaded by

jozsef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Lec7 Model

Uploaded by

jozsef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

STAT 535: Statistical Machine Learning Autumn 2019

Lecture 7: Model Selection and Prediction

Instructor: Yen-Chi Chen

Useful reference:

• Section 7 in the elements of statistical learning by Trever Hastie, Robert Tibshirani, Jerome Friedman.
• Section 13 (especially 13.6) in all of statistics by Larry Wasserman.
• Section 5.3 in all of nonparametric statistics by Larry Wasserman.

7.1 Introduction

Suppose that we observe X1 , · · · , Xn from an unknown distribution function. Under the assumption that
our data comes from a parametric model pθ with θ ∈ Θ ⊂ Rd .
Let
θbn = argmaxθ L(θ|X1 , · · · , Xn ) = argmaxθ `(θ|X1 , · · · , Xn )
be the MLE. For simplicity, we denote

`n = `(θbn |X1 , · · · , Xn )

as the maximal value of the empirical log-likelihood function and we define

¯ = E(`(θ|X1 ))
`(θ)

as the population log-likelihood function. The population likelihood function implies the population MLE,
which is
¯
θ∗ = argmaxθ `(θ).

7.2 AIC: Akaike information criterion

The AIC is an information criterion that is common used for model selection. In model selection, the AIC
propose the following criterion:
AIC = 2d − 2`n ,
where d is the dimension of the model.
The idea of AIC is to adjust the empirical risk to be an unbiased estimator of the true risk in a parametric
model. Under a likelihood framework, the loss function is the negative log-likelihood function so the empirical
risk is
bn (θbn ) = −`n = −`(θbn |X1 , · · · , Xn ) = −`n (θbn ).
R
On the other hand, the true risk of the MLE is
¯ θbn )).
R(θbn ) = E(−n`(

7-1
7-2 Lecture 7: Model Selection and Prediction

Note that we multiply it by n to reflect the fact that in the empirical risk, we did not divide it by n.
To derive the AIC, we examine the asymptotic bias of the empirical risk R
bn (θbn ) versus the true risk R(θbn ).

¯ θbn )
Analysis of true risk. To analyze the true risk R(θbn ), we first investigate the asymptotic behavior of `(
∗
around θ :

¯ θbn ) ≈ `(θ
`( ¯ ∗ ) + 1 (θbn − θ∗ )T ∇∇`(θ
¯ ∗ ) + (θbn − θ∗ )T ∇`(θ ¯ ∗ )(θbn − θ∗ )
| {z } 2 | {z }
=0 =I(θ ∗ )

¯ ∗ ) + 1 (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ).

= `(θ
2
Thus, the true risk is

¯ θbn )) ≈ −n`(θ
¯ ∗) − n b
R(θbn ) = −nE(`( E (θn − θ∗ )T I(θ∗ )(θbn − θ∗ ) . (7.1)
2

Analysis of expected empirical risk. For the expected empirical risk, we first expand `n as follows:
n
X
`n = `(θbn |Xi )
i=1
n n n
X X 1 X (7.2)
≈ `(θ∗ |Xi ) + (θbn − θ∗ )T ∇`(θ∗ |Xi ) + (θbn − θ∗ )T ∇∇`(θ∗ |Xi )(θbn − θ∗ ) .
i=1 i=1
2 i=1
| {z } | {z }
(I) =(II)

¯ ∗ ), which agrees with the first term in the true risk so all we need
The expectation of the first quantity is `(θ
is to understand the behavior of the rest two quantities.
Pn
For the first quantity, using the fact that i=1 ∇`(θbn |Xi ) = 0,
n
X n
X
∇`(θ∗ |Xi ) = ∇(`(θ∗ |Xi ) − `(θbn |Xi ))
i=1 i=1
n
!
X
≈ ∇∇`(θ |Xi ) (θ∗ − θbn )
∗

i=1

≈ (∇∇E(`(θ∗ |Xi ))) n(θ∗ − θbn ).

| {z }
=I(θ ∗ )

Thus,
(I) ≈ −n(θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ).
For quantity (II), note that
n
1X
∇∇`(θ∗ |Xi ) ≈ ∇∇E(`(θ∗ |Xi )) = I(θ∗ )
n i=1

so
n
1 X
(II) = (θbn − θ∗ )T ∇∇`(θ∗ |Xi )(θbn − θ∗ )
2 i=1
n
= (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ )
2
Lecture 7: Model Selection and Prediction 7-3

Combining (I) and (II) into equation (7.2) and taking the expectation, we obtain

¯ ∗ ) + n E (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ )

E(R bn (θbn )) = −E(`n ) = −n`(θ
2
Comparing this to equation (7.1), we obtain

bn (θbn )) − R(θbn ) = −nE (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ) .
E(R

Note that by the theory of MLE, one can easily shown that
√
n(θbn − θ∗ ) ≈ N (0, I −1 (θ∗ ))

so
n(θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ) ≈ χ2d ,
which implies that1
nE (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ) = d.

Thus, to make sure that we have an asymptotic unbiased estimator of the true risk R(θbn ), we need to modify
the empirical risk by
Rbn (θbn ) + d = −`n + d.

Multiplying this quantity by 2, we obtain the AIC

AIC = 2d − 2`n .

Remark. From the derivation of AIC, we see that the goal of the AIC is to adjust the model so that we
are comparing unbiased estimates of the true risks across different models. Thus, the model selected by
minimizing the AIC can be viewed as the model selected by minimizing unbiased estimates of the true risks.
From the risk minimization point of view, this is trying to make a prediction using a good risk estimator.
Thus, some people would common that the design of AIC is to choose a model that makes good predictions.

7.3 BIC: Bayesian information criterion

Another common approach for model selection is the BIC:

BIC = d log n − 2`n ,

where again d denotes the dimension of the model.

Here is the derivation of the BIC. In the Bayesian setting, we place a prior π(m) over all possible models
and within each model, we place a prior over parameters p(θ|m). The BIC is a Bayesian criterion, which
means that we will select model according to the posterior distribution of each model m. Namely, we will
try to derive π(m|X1 , · · · , Xn ).
By Bayes rule, we have
π(m, X1 , · · · , Xn )
π(m|X1 , · · · , Xn ) = ∝ p(X1 , · · · , Xn |m)π(m)
p(X1 , · · · , Xn )

so all we need is to derive the marginal density in a model p(X1 , · · · , Xn |m).

1 Formally, we need a few more conditions; the convergence in distribution is not enough for the convergence in expectation.
7-4 Lecture 7: Model Selection and Prediction

With a prior π(θ|m), this marginal density can be written as

Z
p(X1 , · · · , Xn |m) = p(X1 , · · · , Xn |θ, m)π(θ|m)dθ. (7.3)

Suppose that the model m is correct in the sense that under model m, there exists θ∗ ∈ Θ such that the
data are indeed generated from p(x|θ∗ ).
Using the log-likelihood function, we can expand
Pn
p(X1 , · · · , Xn |θ, m) = e`(θ|X1 ,··· ,Xn ,m) = e i=1 `(θ|Xi ,m)
. (7.4)

Asymptotically, the log-likelihood function can be further expand

n
X n
X n
X
`(θ|Xi , m) = `(θ∗ |Xi , m) + (θ − θ∗ )T ∇`(θ∗ |Xi , m)
i=1 i=1 i=1
| {z }
=0
n
X
+ (θ − θ∗ )T ∇∇`(θ∗ |Xi , m)(θ − θ∗ ) + small terms
i=1
n
1X
= `n + n(θ − θ∗ )T ∇∇`(θ∗ |Xi , m)(θ − θ∗ ) + small terms,
n i=1
| {z }
≈−I(θ ∗ )

where I(θ∗ ) is the Fisher’s information matrix. Plugging this into equation (7.4) and ignore the reminder
terms, we obtain
∗ T ∗ ∗
p(X1 , · · · , Xn |θ, m) ≈ e`n −n(θ−θ ) I(θ )(θ−θ ) . (7.5)
Thus, equation (7.3) can be rewritten as
Z
∗ T
) I(θ ∗ )(θ−θ ∗ )
p(X1 , · · · , Xn |m) ≈ e `n
e−n(θ−θ π(θ|m)dθ. (7.6)

To compute the above integral, consider a random vector Y ∼ N (0, n1 I(θ∗ )). The expectation
n d/2 Z
∗ T ∗ ∗
E(π(Y |m)) = det−1 (I(θ∗ )) e−n(y−θ ) I(θ )(y−θ ) π(y|m)dy
2π
≈ π(θbn |m)

when n → ∞. This implies that

Z d/2
−n(θ−θ ∗ )T I(θ ∗ )(θ−θ ∗ ) 2π
e π(θ|m)dθ ≈ det(I(θ∗ ))π(θbn |m).
n

Putting this into equation (7.6), we conclude that the Bayesian evidence
d/2
2π
p(X1 , · · · , Xn |m) ≈ e`n det(I(θ∗ ))π(θbn |m)
n

so the log evidence is

d d
log p(X1 , · · · , Xn |m) ≈ `n − log n + log(2π) + log det(I(θ∗ )) + log π(θbn |m).
2 2
Lecture 7: Model Selection and Prediction 7-5

The only quantity that would increases with respect to the sample size n are the first two quantities so after
multiplying by −2 and keeping only the dominating two terms, we obtain

BIC = d log n − 2`n .

Remark. Although the BIC leads to a criterion similar to the AIC, the reasoning is somewhat different.
In the construction of BIC, the effect of priors are ignored since we are working on the limiting regime
but we still use the Bayesian evidence as a model selection criterion. We are selecting the model with the
highest evidence. When the data is indeed generated from one of the model in the collection of models we
are choosing from, the posterior will concentrate on this correct model. So BIC would eventually be able
to select this model. Therefore, some people would argue that unlike AIC that chooses the best predictive
model, the BIC attempts to select the true model if it exists in the model set.

7.4 Information criterion in regression

Using information criterion in a regression problem is very common in practice. However, there is a caveat in
the construction. The information criterion we derived is based on a likelihood framework but the regression
problem is often formulated as an empirical risk minimization process, e.g., least squared approach. So we
need to associate the likelihood function to the loss function used in the regression problem to properly
construct a model selection criterion. Here we gave one example of using the least square approach under a
linear regression model.
Let (X1 , Y1 ), · · · , (Xn , Yn ) be the observed data such that Xi ∈ Rd and Yi ∈ R. A regression model associate
Xi and Yi via

Yi = XiT β + i ,

where i ’s are the noise. To associate the least square method to a likelihood framework, we assume that
i ∼ N (0, σ 2 ). Under this framework, one can easily shown that the MLE of β and the least squared estimate
are the same.
What will the likelihood function be like in this case? For simplicity, we consider the fixed design such
that Xi ’s are non-random and the only random quantity is the noise i ’s. Because i = Yi − XiT β, the
log-likelihood function will be

1 1
`(β, σ 2 |Xi , Yi ) = log p (Yi − XiT β; β, σ 2 ) = − log(2πσ 2 ) − 2 (Yi − XiT β)2 .
2 σ

1
Pn
b2 =
Let βb be the least square estimate (MLE) and σ n
2
i=1 ei be the MLE of the noise level and ei = Yi −XiT βb
is the residual.
7-6 Lecture 7: Model Selection and Prediction

The empirical risk under the MLE/least square estimate is

n
X
`n = `(β, σ 2 |Xi , Yi )
i=1
n
n 1 X
=− σ2 ) − 2
log(2πb (Yi − XiT β)2
2 σ
b i=1
| {z }
σ2
=nb
n
b2 ) − n
= − (log 2 + log π + log σ
2 !
n
n 1X 2
= Cn − log e
2 n i=1 i

n 1
= Cn − log RSSn ,
2 n
Pn Pn
where RSSn = i=1 e2i = i=1 (Yi − XiT β)2 is the least square objective function or the so-called residual
sum of squares.
The quantity Cn is invariant across models/variables we choose. Thus, the AIC and BIC of the regression
model will be

1
AIC = 2d − n log RSSn
n

1
BIC = d log n − n log RSSn .
n

7.5 Prediction risk in regression

Now we consider a general regression prediction problem where we observe (X1 , Y1 ), · · · , (Xn , Yn ) where
Xi ∈ Rd and Yi ∈ R and they are from an unknown joint distribution FX,Y . Here we do not specify a
particular model of the regression function and all we assume is
Yi = m(Xi ) + i ,
where i ’s are mean 0 noise with Var(i |Xi = x) = σ 2 (x) and m(x) = E(Y |X = x) is the regression function.
Suppose that we have an estimator m b of m and we would like to know the prediction risk (expected loss in
predicting a future outcome) of this estimator.
To define the prediction risk, let Xnew , Ynew be a new pair of observation from FX,Y . Then the prediction
risk is defined as
R(m)b = E((Ybnew − Ynew )2 ) = E((m(X
b new ) − Ynew )2 ),
where Yb = m(X)
b is the predictive value of Y given covariate X.
Let m̄(x) = E(m(x))
b be the expected regression function of the estimator/predictor m.
b We can decompose
the prediction risk using the following expansion:
R(m) b new ) − Ynew )2 )
b = E((m(X
b new ) − Ynew )2 |Xnew ))
= E(E((m(X
= E(R(Xnew ))
Z
= R(x)pX (x)dx
Lecture 7: Model Selection and Prediction 7-7

where
b new ) − Ynew )2 |Xnew = x)
R(x) = E((m(X
is a local predictive risk and pX is the PDF of the covariate X.
The local predictive risk can be decomposed as

b new ) − Ynew )2 |Xnew = x)

R(x) = E((m(X
 

b new ) − m̄(Xnew ) + m̄(Xnew ) − m(Xnew ) + m(Xnew ) − Ynew )2 |Xnew = x

= E (m(X
| {z } | {z } | {z }
variance of m
b bias of m
b intrinsic variance
= E(V (Xnew ) + b2 (Xnew ) + σ 2 (Xnew )|Xnew = x),

where

V (x) = Var(m(X
b new )|Xnew = x)
b(x) = m̄(x) − m(x).

Thus, the predictive risk can be decomposed into

b = E(b2 (Xnew )) + E(V (Xnew )) + E(σ 2 (Xnew )) .

R(m)
| {z } | {z } | {z }
bias Variance intrinsic error

To compare the performance of prediction, we should use a good estimate of the predictive risk. A naive
estimator is the training error (or empirical risk), which is
n n
1X 1X
R(
b m)
b = (Yi − Ybi )2 = b i ))2 .
(Yi − m(X
n i=1 n i=1

Although this estimator seems to be intuitively correct, it may suffer from a severe bias! To see this, note
that the expected value of the empirical risk is
n
1X
E(R(
b m))
b = b i ))2 .
E(Yi − m(X
n i=1

For each i,

b i ))2 = E(Yi − m(Xi ) + m(Xi ) − m̄(Xi ) + m̄(Xi ) − m(X

E(Yi − m(X b i ))2
= E(σ 2 (Xi )) + E(b2 (Xi )) + E(V (Xi )) + 2E((Yi − m(Xi ))(m̄(Xi ) − m(X
b i )))
b − 2E((Yi − m(Xi ))(Ybi − m̄(Xi )))
= R(m)
b − 2E(Cov(Yi , Ybi |Xi )).
= R(m)

Thus, the empirical risk and the predictive risk has the follow relationship
n
2X
E(R(
b m)) b −
b = R(m) E(Cov(Yi , Ybi |Xi )),
n i=1

which implies that the empirical risk is always underestimate the predictive risk. This is especially more
severe when the predictor Ybi is highly correlated with the i-th observation– a common scenario when we are
overfitting the model.
7-8 Lecture 7: Model Selection and Prediction

Note that in the fix design case (covariate X are non-random), the above expression reduces to
n
2X
E(R(
b m)) b −
b = R(m) Cov(Yi , Ybi ),
n i=1

and the quantity Cov(Yi , Ybi ) is called the degrees of freedom. For more details, see the following papers:

1. Tibshirani, Ryan J. “Degrees of freedom and model search.” Statistica Sinica (2015): 1265-
1296.
2. Tibshirani, Ryan J., and Saharon Rosset. “Excess Optimism: How Biased is the Apparent
Error of an Estimator Tuned by SURE?.” Journal of the American Statistical Association 114,
no. 526 (2019): 697-712.

For a general regression model, a simple and consistent estimate of the predictive risk is the cross-validation
(CV). With the growing power of computing, it is a very convenient tool nowadays.
If we are using a linear smoother, there is a closed-form of the leave-one-out CV:
n
1X Yi − m(X
b i)
R
bLOO (m)
b = ,
n i=1 1 − Lii

where Lij is the (i, j) component in the linear smoother matrix L, i.e., Y b = LY. Also, there is a generalized
method called the generalized cross-validation (GCV) that is available for obtaining a good estimate of the
predictive risk:
n
1 X Yi − m(Xb i)
R
bGCV (m)
b = ,
n i=1 1 − ν/n
P
where ν = i Lii .

Remark (Mallow’s Cp). The Mallow’s Cp is an approach to estimate Cov(Yi , Ybi ) and correct the empirical
risk to obtain a better estimate of the predictive risk. The Mallow’s Cp is

2dbσ2
RbCp = R(b m)
b + ,
n
b2 is an estimate of the overall noise level σ 2 = σ 2 (x)pX (x)dx and d is the dimension of the model.
R
where σ
In the case of linear smoother, d = ν. One can simply use the average of squared residuals in this case. The
Mallow’s Cp also leads to a model selection criterion but the goodness-of-fit part is different from AIC/BIC.

Akaike Information Criterion Guide
No ratings yet
Akaike Information Criterion Guide
19 pages
Chap 3
No ratings yet
Chap 3
74 pages
Chap3 01
No ratings yet
Chap3 01
35 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Mixed Model Selection Information Theoretic
No ratings yet
Mixed Model Selection Information Theoretic
7 pages
8 DecisionTheoryHandout
No ratings yet
8 DecisionTheoryHandout
15 pages
DA Unit 2
No ratings yet
DA Unit 2
124 pages
Bayesian Methods Decrypted 67 82
No ratings yet
Bayesian Methods Decrypted 67 82
16 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Lecture 13
No ratings yet
Lecture 13
12 pages
Lecture 14
No ratings yet
Lecture 14
9 pages
Statistica Neerlandica - 2012 - Wit - All Models Are Wrong An Introduction To Model Uncertainty
No ratings yet
Statistica Neerlandica - 2012 - Wit - All Models Are Wrong An Introduction To Model Uncertainty
20 pages
An Asymptotic Theory For Linear Model Selection
No ratings yet
An Asymptotic Theory For Linear Model Selection
44 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
Advanced GLM Techniques Lecture
No ratings yet
Advanced GLM Techniques Lecture
61 pages
Unit-2 MLT
No ratings yet
Unit-2 MLT
84 pages
Lecture 09
No ratings yet
Lecture 09
32 pages
Scribe Notes BML
No ratings yet
Scribe Notes BML
25 pages
Topic One Linear Regression Regularization
No ratings yet
Topic One Linear Regression Regularization
68 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
10 1002@wics 199
No ratings yet
10 1002@wics 199
5 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
Decision Theory in Statistics Lecture
No ratings yet
Decision Theory in Statistics Lecture
7 pages
Point Estimation
No ratings yet
Point Estimation
5 pages
14 Model Selection
No ratings yet
14 Model Selection
24 pages
Minimax Estimation for Statisticians
No ratings yet
Minimax Estimation for Statisticians
10 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Econometrics - Exercise Set 2 (Solution)
No ratings yet
Econometrics - Exercise Set 2 (Solution)
12 pages
10 AI Sample Paper 2025-26
No ratings yet
10 AI Sample Paper 2025-26
6 pages
Word Sense Disambiguation: A Survey
No ratings yet
Word Sense Disambiguation: A Survey
16 pages
Unit 4 ML NN, DL, CNN-1
No ratings yet
Unit 4 ML NN, DL, CNN-1
84 pages
Akaike Etc, Wli
No ratings yet
Akaike Etc, Wli
49 pages
14.384 Time Series Analysis: Mit Opencourseware
No ratings yet
14.384 Time Series Analysis: Mit Opencourseware
6 pages
Overestimation in Bayesian Info Criterion
No ratings yet
Overestimation in Bayesian Info Criterion
12 pages
Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018
No ratings yet
Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018
39 pages
Notes 08 - Nichols Charts & Closed Loop Performance
No ratings yet
Notes 08 - Nichols Charts & Closed Loop Performance
3 pages
Decisiontheory 0
No ratings yet
Decisiontheory 0
13 pages
Akaike's and Other Information Criteria
No ratings yet
Akaike's and Other Information Criteria
5 pages
Computational Intelligence (CS3030/CS3031) : School of Computer Engineering, KIIT-DU, BBS-24, India
No ratings yet
Computational Intelligence (CS3030/CS3031) : School of Computer Engineering, KIIT-DU, BBS-24, India
2 pages
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
No ratings yet
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
44 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
RenSun Sankhya2004 ComparisonBayesFreqtstPrediction
No ratings yet
RenSun Sankhya2004 ComparisonBayesFreqtstPrediction
29 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Prob Sensitivity and Specificity of Information Criteri
No ratings yet
Prob Sensitivity and Specificity of Information Criteri
20 pages
Stat-Review Xid-8243919 1
No ratings yet
Stat-Review Xid-8243919 1
24 pages
Éléments de Data Mining Avec Tanagra: Vincent ISOZ, 2013-10-21 (V3.0 Revision 6) (oUUID 1.679)
No ratings yet
Éléments de Data Mining Avec Tanagra: Vincent ISOZ, 2013-10-21 (V3.0 Revision 6) (oUUID 1.679)
146 pages
Myung Ohio State Model Selection Methods
No ratings yet
Myung Ohio State Model Selection Methods
76 pages
Model Selection and Model Averaging
No ratings yet
Model Selection and Model Averaging
16 pages
Whitepaper Emebddings Vectorstores v2
No ratings yet
Whitepaper Emebddings Vectorstores v2
64 pages
DM Mod4
No ratings yet
DM Mod4
108 pages
RN Notes
No ratings yet
RN Notes
119 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
STAT 497 - Old Exams
100% (2)
STAT 497 - Old Exams
71 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
Modelling Mechanisms For Measurable and Detection Based On Artificial Intelligence
No ratings yet
Modelling Mechanisms For Measurable and Detection Based On Artificial Intelligence
6 pages
Estimation: Summer 2018 Summer Institutes 103
No ratings yet
Estimation: Summer 2018 Summer Institutes 103
31 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Proses Pengendalian Proses
No ratings yet
Proses Pengendalian Proses
2 pages
The Bayesian Information Criterion
No ratings yet
The Bayesian Information Criterion
32 pages
HW 4
No ratings yet
HW 4
6 pages
Bai 1996
No ratings yet
Bai 1996
27 pages
MAT121 - Applications of Differentiation - Upload
No ratings yet
MAT121 - Applications of Differentiation - Upload
56 pages
Eca Unit-5 (Two Port Networ Parameters
No ratings yet
Eca Unit-5 (Two Port Networ Parameters
25 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
Unit 1 Introduction To Digital Signal Processing
No ratings yet
Unit 1 Introduction To Digital Signal Processing
15 pages
Econometrics: CLM & OLS Basics
No ratings yet
Econometrics: CLM & OLS Basics
11 pages
Week 1.: "All Models Are Wrong, But Some Are Useful" - George Box
No ratings yet
Week 1.: "All Models Are Wrong, But Some Are Useful" - George Box
7 pages
CV Lecture 4
No ratings yet
CV Lecture 4
52 pages
Introduction To The CFD Methodology
No ratings yet
Introduction To The CFD Methodology
13 pages
Advanced Machine Learning Lab Syllabus
No ratings yet
Advanced Machine Learning Lab Syllabus
4 pages
6 DUALITY Theory
No ratings yet
6 DUALITY Theory
25 pages
Modified Akaike Information Criterion (MAIC) For Statistical Model Selection
No ratings yet
Modified Akaike Information Criterion (MAIC) For Statistical Model Selection
12 pages
Dokumen - Tips - Mmds 2014 Talk Distributing ML Algorithms From Gpus To The Cloud
No ratings yet
Dokumen - Tips - Mmds 2014 Talk Distributing ML Algorithms From Gpus To The Cloud
34 pages
Binomial Heaps: Manoj Kumar DTU, Delhi
No ratings yet
Binomial Heaps: Manoj Kumar DTU, Delhi
36 pages
Retail Shoe Dataset: Adidas vs. Nike: by - Rochita Sundar 15 April 2022
No ratings yet
Retail Shoe Dataset: Adidas vs. Nike: by - Rochita Sundar 15 April 2022
20 pages
Understanding Limits and Sequences
No ratings yet
Understanding Limits and Sequences
7 pages
Who Invented Queuing Theory?
No ratings yet
Who Invented Queuing Theory?
16 pages
Feynman Propagator in Scalar Fields
No ratings yet
Feynman Propagator in Scalar Fields
15 pages
Self-Rewarding Language Models: Weizhe Yuan Richard Yuanzhe Pang Kyunghyun Cho Sainbayar Sukhbaatar Jing Xu Jason Weston
No ratings yet
Self-Rewarding Language Models: Weizhe Yuan Richard Yuanzhe Pang Kyunghyun Cho Sainbayar Sukhbaatar Jing Xu Jason Weston
15 pages
AI Enhances Dysgraphia Detection
No ratings yet
AI Enhances Dysgraphia Detection
9 pages
Pakistani
No ratings yet
Pakistani
5 pages
Machine Learning Concise Notes
No ratings yet
Machine Learning Concise Notes
7 pages
Risk Fisher
No ratings yet
Risk Fisher
39 pages
Quantum Computing
No ratings yet
Quantum Computing
2 pages
2024 End Term Statistical Modeling Question Paper
No ratings yet
2024 End Term Statistical Modeling Question Paper
2 pages
Linearisation Techniques Explained
No ratings yet
Linearisation Techniques Explained
3 pages
1.TKS Tutorials Overview
No ratings yet
1.TKS Tutorials Overview
3 pages
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data (Fall 2014)
No ratings yet
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data (Fall 2014)
68 pages
Burnham and Anderson 2004 Multimodel Inference
No ratings yet
Burnham and Anderson 2004 Multimodel Inference
44 pages

Lec7 Model

Uploaded by

Lec7 Model

Uploaded by

STAT 535: Statistical Machine Learning Autumn 2019

Lecture 7: Model Selection and Prediction

as the maximal value of the empirical log-likelihood function and we define

7.2 AIC: Akaike information criterion

¯ ∗ ) + 1 (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ ).

≈ (∇∇E(`(θ∗ |Xi ))) n(θ∗ − θbn ).

¯ ∗ ) + n E (θbn − θ∗ )T I(θ∗ )(θbn − θ∗ )

Multiplying this quantity by 2, we obtain the AIC

7.3 BIC: Bayesian information criterion

Another common approach for model selection is the BIC:

BIC = d log n − 2`n ,

where again d denotes the dimension of the model.

so all we need is to derive the marginal density in a model p(X1 , · · · , Xn |m).

With a prior π(θ|m), this marginal density can be written as

Asymptotically, the log-likelihood function can be further expand

when n → ∞. This implies that

so the log evidence is

BIC = d log n − 2`n .

7.4 Information criterion in regression

The empirical risk under the MLE/least square estimate is

7.5 Prediction risk in regression

b new ) − Ynew )2 |Xnew = x)

b new ) − m̄(Xnew ) + m̄(Xnew ) − m(Xnew ) + m(Xnew ) − Ynew )2 |Xnew = x

Thus, the predictive risk can be decomposed into

b = E(b2 (Xnew )) + E(V (Xnew )) + E(σ 2 (Xnew )) .

b i ))2 = E(Yi − m(Xi ) + m(Xi ) − m̄(Xi ) + m̄(Xi ) − m(X

You might also like