0% found this document useful (0 votes)

6 views10 pages

Lecture 15

This document discusses asymptotic theory for the Maximum Likelihood Estimator (MLE), focusing on consistency and asymptotic distribution. It outlines necessary conditions for MLE consistency, such as identifiability and uniform law of large numbers, and addresses potential failures of consistency under model misspecification. Additionally, it explores the limiting distribution of the MLE and provides sufficient regularity conditions for its asymptotic behavior.

Uploaded by

yifanwa2cc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

Lecture 15

Uploaded by

yifanwa2cc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture Notes 15

36-705

1 Asymptotic theory
This lecture and the next will focus on asymptotic theory for the MLE. We suppose that we
obtain a sample X1 , . . . , Xn ∼ p(X; θ) and are interested in estimating θ. We are interested
in two questions:

p
1. Consistency: Does the MLE converge in probability to θ, i.e. does θbMLE → θ? This
is analogous to the LLN.
√
2. Asymptotic distribution: What can we say about the distribution of n(θbMLE −θ)?
This is analogous to the CLT.

We will begin with the question of consistency.

2 Consistency of the MLE

The main take-home from this section is that under somewhat mild conditions the MLE
is a consistent estimator. We will try to develop the necessary conditions and build some
intuition about the MLE and about what consistency entails.

2.1 MLE as Empirical Risk Minimization

We have discussed previously the idea of empirical risk minimization, where we construct an
estimator by minimizing an empirical estimate of the risk. We looked at the particular case
of classification with the 0/1 loss. The MLE can be viewed as a special case of ERM with a
different loss function.
Suppose we define the loss function:
n
b θ) = 1 p(Xi ; θ)
X
Rn (θ, log .
n i=1 p(Xi ; θ)
b

Observe that minimizing this loss function is identical to maximizing the likelihood. Notice
that we introduced an extra p(Xi ; θ) term but this does not affect anything. Of course, if

1
this is the empirical risk it is natural to wonder what the associated population risk is. This
is
Z !
p(X; θ) p(x; θ)
R(θ,b θ) = Eθ log = p(x; θ) log dx
p(X; θ)
b p(x; θ)
b

which is the Kullback-Leibler divergence, i.e. the population risk is the KL divergence
KL(p(x; θ)kp(x; θ)).
b Notice that, the empirical risk is a sum of i.i.d terms so by the LLN we
have that for any fixed θe
p
e θ) →
Rn (θ, R(θ,
e θ).

To analyze empirical risk minimization we needed a uniform LLN and we will need exactly
this to show consistency. An important property of the KL divergence is that it is zero iff
p(X; θ) = p(X; θ)
b almost everywhere (i.e. they are equal except on sets of measure 0). The
main thing to remember is the connection between MLE and KL divergence.

2.2 Conditions for consistency

Condition 1: Identifiability: A basic requirement for constructing any consistent estimator

is that the model be identifiable, i.e. if θ1 6= θ2 then it must be the case that p(x; θ1 ) 6=
p(x; θ2 ).
We will in general require something slightly stronger than this:
Condition 2: Strong identifiability: We assume that for every > 0

inf KL(p(x; θ), p(x; θ))

e > 0.
θ:|
e θ−θ|≥
e

This condition is essentially the same as Condition 1, except that it does not allow the
difference between the two distributions to be vanishingly small. The two conditions are
equivalent if θ is restricted to lie in a compact set.
Condition 3: Uniform LLN: Assume that,
p
sup |Rn (θ,
e θ) − R(θ,
e θ)| → 0.
θe

This condition is a uniform LLN. As we have seen before it holds for instance if the
Rademacher complexity of the class of functions of the form: fθe(X) = log p(X; θ)/p(X;
e θ) is
not too large.

Theorem 1 Suppose that Conditions 2 and 3 above hold, then the MLE is consistent.

2
Proof: Fix an > 0. Using the strong identifiability condition we see that for every > 0,
we have that there is an η > 0 such that,
e ≥ η,
KL(p(x; θ), p(x; θ))

if |θe − θ| ≥ . We will show that for the MLE θ, b ≤ η, as

b we have that KL(p(x; θ)kp(x; θ))
p
n → ∞ in probability. This in turn implies that |θb − θ| ≤ which implies that θb → θ. If
b ≤ η, as n → ∞. Notice that,
remains to show that KL(p(x; θ), p(x; θ))
(i) p
KL(p(X; θ)kp(X; θ))
b = R(θ, b θ) − Rn (θ,
b θ) = R(θ, b θ) ≤ R(θ,
b θ) + Rn (θ, b θ) − Rn (θ,
b θ) → 0,
where the final convergence simply uses Condition 3. The inequality (i) follows since,
n
b θ) = 1 p(Xi ; θ)
X
Rn (θ, log ≤ 0,
n i=1 p(Xi ; θ)
b

since θb is the MLE.

3 Inconsistency of the MLE

The MLE can fail to be consistent. When the model is not identifiable it is clear that we
cannot have consistent estimators. The other possible failure is the failure of the uniform
law. This typically happens when the parameter space is too large. Here is a simple example:
Example: Suppose that we measure some outcome (say their blood sugar) for n individuals
using a machine. We do it twice for every individual so that we can assess the variability of
the machine, i.e. suppose we observe:
Y11 , Y12 ∼ N (µ1 , σ 2 )
..
.
Yn1 , Yn2 ∼ N (µn , σ 2 ),
and want to estimate σ 2 . Even though we only want to estimate σ 2 the model has a growing
number of parameters µ1 , . . . , µn , σ 2 and the MLE for σ 2 will depend on estimating µi .
Formally, we can see that the MLE for the means is:
Yi1 + Yi2
µ
bi = .
2
The log-likelihood for σ 2 can be written as:
n
1 X
LL(σ 2 , µ) = −n log σ 2 − 2 2

(Y i1 − µ i ) + (Yi2 − µ i ) ,
2σ 2 i=1

3
which is maximized when we take:
n n
2 1 X 2 2 1 X
(Yi1 − Yi2 )2 .

σ
b = (Yi1 − µ
bi ) + (Yi2 − µ
bi ) =
2n i=1 4n i=1
Notice that,
σ2
σ2] =
E[b ,
2
so by the LLN the MLE is inconsistent. One could easily fix this in this problem (by
multiplying the MLE by 2) but more generally this could be tricky. We note that in this
type of problem where the number of parameters is not fixed (and grows with the sample
size) it is not even clear how to define convergence of the log-likelihood since its limit changes
with the sample size.

4 MLE under misspecification

In statistical modeling we do not typically believe the model is correct, i.e. that the samples
were in fact generated by some distribution in our model. Rather, we think of the model
as a useful idealization or a simplification. In this (more realistic) case, one might wonder
what the MLE converges to, or if it converges at all?
Suppose X1 , . . . , Xn ∼ q, and we estimate θbMLE , then what can we say about our estimate?
To answer this, we can follow a similar argument to what we did in the beginning of the
lecture and observe that at the population-level (i.e. with infinite samples) the MLE is:
θbMLE = arg max Eq log p(X; θ)
θ∈Θ

How do we interpret this statement? As before we can re-write it in terms of KL divergences

and see that:
KL(qkpθbMLE ) ≤ KL(qkpθ ) for all θ ∈ Θ.
So that at the population-level we can conclude that the MLE is estimating the KL projection
of the data-generating distribution on our model, i.e. when q does not belong to our model
the MLE is essentially estimating the KL projection of q onto our model. One can also impose
similar conditions to what we had in the last section (uniform law + strong identifiability)
to complete the consistency argument under model misspecification.

5 Limiting Distribution of the MLE

Now we will address the question of what is the asymptotic distribution of the MLE. This is
analogous to the CLT which gave the asymptotic distribution of averages. In some cases, we

4
can do this directly. For instance, if X1 , . . . , Xn ∼ Ber(p) then the MLE is just the average:
n
1X
pb = Xi ,
n i=1

and so we know by the CLT:

√ pb − p d
np → N (0, 1),
p(1 − p)
which tells us the asymptotic distribution of the MLE.
More generally, however the MLE need not be a simple average of i.i.d. terms, but the main
take-away is that asymptotically it often behaves like one.
Recall that the score function is
n
X
s(θ) = ∇ log(p(Xi ; θ)),
i=1

which is the gradient of the log-likelihood, and the Fisher Information,

I(θ) = E[s(θ)s(θ)T ].

We showed that s(θ) has mean 0, so I(θ) = Var(s(θ)). The Fisher information is alternatively
the expected Hessian of the log-likelihood:
" n #
X
In (θ) = E ∇2 log p(Xi ; θ) .
i=1

It is worth remembering that the score is data-dependent, while the Fisher Information is
not (it is an expectation over the data so does not depend on the values of X1 , . . . , Xn ).
Let θb denote the MLE. Our goal to show that (under enough regularity conditions),
√ d
n(θb − θ) → N (0, [I1 (θ)]−1 ).

6 Counterexample
The usual counterexample to the above convergence in distribution is the MLE for the
uniform distribution. For the uniform distribution most regularity conditions fail. Formally,
we observe X1 , . . . , Xn ∼ U [0, θ] and want to estimate θ. The log-likelihood:

1
`(θ) = log n I(θ ≥ max Xi ) .
θ i

5
The MLE is θb = maxni=1 Xi . Observe that the log-likelihood is not differentiable at the MLE,
so the Fisher information is not defined at the MLE.
Another thing that we used frequently in defining the equivalent forms of the Fisher infor-
mation was to exchange derivatives (with respect to θ) and integrals (with respect to X).
This in general does not work if the domain of integration depends on the parameter with
respect to which we are taking the derivative. For the uniform distribution the domain of
density depends on the parameter.
On the other hand, things are usually nice for exponential families. They will automatically
satisfy all the regularity conditions (provided it is identifiable, i.e. say full-rank and minimal)
and the MLE is extremely well-behaved in such models.
Returning to the uniform case, we can directly analyze the distribution of the MLE. In a
previous lecture we showed that
d
n(θb − θ) → −Exp(1/θ)
√ d
(we did this when θ = 1 but you can work out the general case). It follows that n(θb− θ) →
δ0 , where δ0 is a point mass at 0 and it does not have a Gaussian limit.

7 MLE asymptotics
We will only attempt a heuristic calculation here. If you are curious to see a rigorous proof
with minimal regularity assumptions you should look at Van der Vaart’s book on Asymptotic
Statistics. Here is a list of some sufficient regularity conditions:

1. The dimension of the parameter space does not change with n, i.e. θ ∈ Rd and d is
fixed. We have seen that if d grows the MLE need not even be consistent.

2. p(x; θ) is a smooth (thrice differentiable) function of θ,

3. We can interchange differentiation with respect to θ and integration over X. This

in turn requires that the range of X does not depend on θ, and some integrability
conditions on p(x; θ).

4. The parameter θ is identifiable.

5. If the parameter space is restricted, i.e. θ ∈ Θ for some set Θ then θ is in the interior
of the set Θ (i.e. cannot be on its boundary).

We will focus on the case when the parameter is one-dimensional, although everything carries
over almost exactly in the general (fixed) d case.

6
Theorem 2 Under the regularity conditions above,
√ d
n(θb − θ) → N (0, 1/I(θ)).

We note that under the conditions of the theorem one can verify that the MLE is consistent,
p
i.e. that θb → θ. The basic idea is to verify that under the differentiability assumptions on
the density, we can effectively treat the parameter space as compact, then derive a uniform
law of large numbers, and then apply the proof from the previous lecture notes. This is
a complicated technical proof but you can look it up by searching for Wald’s proof of the
consistency of the MLE.
The proof will use all the facts about scores and the Fisher information that we derived
earlier.
p
Proof: To begin with let us note the following fact: if θb → θ, then
p
Eθ [−∇2θ log p(X; θ)]
b → Eθ [−∇2θ log p(X; θ)] = I(θ).

Since θb maximizes the log-likelihood we know that the derivative of the log-likelihood at θb
must be 0, i.e.

`0 (θ)
b = 0.

Formally you need to know that θb is not on the boundary of the parameter space. To prove
p
this you will need to use the fact that θ is not on the boundary and that θb → θ.
By a Taylor expansion of the derivative of the log-likelihood we obtain that,

0 = `0 (θ)
b = `0 (θ) + (θb − θ)`00 (θ),
e

where θe is some point in between θb and θ. This in turn gives us that,

`0 (θ)
(θb − θ) = ,
−`00 (θ)
e

so that,
`0 (θ)
√ √
n
n(θb − θ) = .
`00 (θ)
− n
e

We will look at the numerator and denominator separately. The denominator is:
n
`00 (θ)
e 1X p p
− = −∇2θ log p(Xi ; θ)
e → Eθ [−∇2θ log p(X; θ)]
e → Eθ [−∇2θ log p(X; θ)] = I(θ)
n n i=1

7
p
where the last step uses the fact that θe → θ.
The numerator is just the score function, i.e.
n n
1 1 X √ 1X
√ `0 (θ) = √ ∇θ log p(Xi ; θ) = n × [∇θ log p(Xi ; θ) − E[∇θ log p(X; θ)]]
n n i=1 n i=1
d d
→ N (0, Var(∇θ log p(X; θ))) → N (0, I(θ)),
where we used the facts that the score
√ has mean 0, that the variance of the score is the Fisher
information and that by the CLT n times an average of i.i.d. terms minus its expectation
converges in distribution to a normal.
Putting the pieces together via Slutsky’s theorem we obtain that,
√ d 1 d
n(θb − θ) → N (0, I(θ)) → N (0, 1/I(θ)),
I(θ)
which is what we wanted to prove.
Example: Suppose that X1 , . . . , Xn ∼ Exp(θ), then the log-likelihood,
n
X
`(θ) = n log θ − θ Xi .
i=1

The score function:

n
n X
s(θ) = − Xi ,
θ i=1

and the Fisher information,

n
I(θ) = .
θ2
1
The MLE is θb = X
. So we can use the above result to conclude that,
θ2

d
θ − θ → N 0,
b .
n

8 Influence Functions and Regular Asymptotically Lin-

ear Estimators
We could have followed a similar proof as above to conclude that the MLE can be written
as:
n
1 X ∇θ log p(Xi ; θ)
θb = θ + + Remainder,
n i=1 I(θ)

8
where the remainder is small (roughly proportional to the previous term multiplied by [I(θ)−
e
I(θ)] → 0). The term,
∇θ log p(x; θ)
ψ(x) = ,
I(θ)
is called the influence function.
Thinking of a complex predictor like a deep neural network, one can try to obtain some
information about the predictor by trying to compute the influence function of training
images on the final predictor. A paper that did this (and quite a bit more) won ICML’s best
paper a few years ago.
Returning to the expression:
n
1X
θb ≈ θ + ψ(Xi ).
n i=1

Estimators that satisfy this type of expansion are called asymptotically linear estimators
(many non-MLE estimators also satisfy expansions of this form). There is a classical result
due to Le Cam that any sufficiently well-behaved (regular) estimator is asymptotically linear.
It is not easy to prove (see Van Der Vaart’s book). This together with the Cramér-Rao lower
bound implies that the MLE is the “best regular asymptotically linear estimator”.

9 Asymptotic Relative Efficiency

Once you restrict attention to asymptotically Normal estimators, comparing estimators in
terms of their MSE boils down to comparing their variances. Specifically, if
√ 2
n(Wn − τ (θ)) N (0, σW )
√
n(Vn − τ (θ)) N (0, σV2 )
then the asymptotic relative efficiency (ARE) is
2
σW
ARE(Vn , Wn ) = .
σV2

Example 3 Let X1 , . . . , Xn ∼ Poisson(λ). The mle of λ is X. Let

τ = P(Xi = 0).
So τ = e−λ . Define Yi = I(Xi = 0). This suggests the estimator
n
1X
Wn = Yi .
n i=1

9
Another estimator is the mle
Vn = e−λ .
b

The delta method gives

λe−2λ
Var(Vn ) ≈ .
n
We have
√
n(Wn − τ ) N (0, e−λ (1 − e−λ ))
√
n(Vn − τ ) N (0, λe−2λ ).

So
λ
ARE(Wn , Vn ) = ≤ 1.
eλ −1

Since the mle is efficient, we know that, in general, ARE(Wn , mle) ≤ 1.

10 Multivariate Case
Now let θ = (θ1 , . . . , θk ). In this case we have
√
n(θb − θ) N (0, I −1 (θ))

where I −1 (θ)q is the inverse of the Fisher information matrix. The approximate standard
−1
error of θbj is Ijj /n. If τ = g(θ) with g : Rk → R then by the delta method,
√
τ − τ)
n(b N (0, (g 0 )T I −1 g 0 )

where g 0 is the gradient of g evaluated at θ.

Solutions To Selected Exercises From Chapter 9 Bain & Engelhardt - Second Edition
No ratings yet
Solutions To Selected Exercises From Chapter 9 Bain & Engelhardt - Second Edition
13 pages
AISE Anchor Bolt Details PDF
100% (1)
AISE Anchor Bolt Details PDF
1 page
4 - Logistic Reg 2
No ratings yet
4 - Logistic Reg 2
44 pages
Fractal Previous Year Coding Questions Super Dream
No ratings yet
Fractal Previous Year Coding Questions Super Dream
2 pages
Lecture1 ML MLE
No ratings yet
Lecture1 ML MLE
103 pages
Cosistent Asymptotic Normal Estimator
No ratings yet
Cosistent Asymptotic Normal Estimator
11 pages
Module04 Slides Print
No ratings yet
Module04 Slides Print
60 pages
Overview of As Convergence
No ratings yet
Overview of As Convergence
17 pages
Non-Linear Methods 4.1. Asymptotic Analysis: 4.1.2. Stochastic Regressors
No ratings yet
Non-Linear Methods 4.1. Asymptotic Analysis: 4.1.2. Stochastic Regressors
73 pages
Advanced Statistics Exam Prep
No ratings yet
Advanced Statistics Exam Prep
20 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
57 pages
Failover-Clustering Windows Server
No ratings yet
Failover-Clustering Windows Server
89 pages
Lecture Five 2025
No ratings yet
Lecture Five 2025
78 pages
MLE & Least Squares Lecture
No ratings yet
MLE & Least Squares Lecture
78 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
All Ex Sol
No ratings yet
All Ex Sol
43 pages
Advanced Econometrics & Estimation
No ratings yet
Advanced Econometrics & Estimation
81 pages
Block 1
No ratings yet
Block 1
83 pages
3 - Mle
No ratings yet
3 - Mle
14 pages
02 Review Estimation 2
No ratings yet
02 Review Estimation 2
36 pages
Module02B Slides Print 1
No ratings yet
Module02B Slides Print 1
59 pages
T 3 Estimation
No ratings yet
T 3 Estimation
20 pages
Review Sol 8
No ratings yet
Review Sol 8
9 pages
Econometrics - Exercise Set 2 (Solution)
No ratings yet
Econometrics - Exercise Set 2 (Solution)
12 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
Lecture 4
No ratings yet
Lecture 4
28 pages
Ps 2,3
No ratings yet
Ps 2,3
48 pages
Section 5
No ratings yet
Section 5
18 pages
Estimation of Parameters2
No ratings yet
Estimation of Parameters2
44 pages
Audi 80/90 Wiring Diagram Guide
No ratings yet
Audi 80/90 Wiring Diagram Guide
20 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Invariance and Unbaisness
No ratings yet
Invariance and Unbaisness
16 pages
Asymptotic Theory & Inference Guide
No ratings yet
Asymptotic Theory & Inference Guide
32 pages
Rbi Script - Plaza, Marvin - G11 Math - Q1-W2
100% (2)
Rbi Script - Plaza, Marvin - G11 Math - Q1-W2
6 pages
Heskay Report
No ratings yet
Heskay Report
43 pages
Stat210b Lecture 9
No ratings yet
Stat210b Lecture 9
6 pages
Maximum Likelihood Estimation.: N N I N I 1 N I I 1
No ratings yet
Maximum Likelihood Estimation.: N N I N I 1 N I I 1
5 pages
Module 4
No ratings yet
Module 4
3 pages
School Based Press Conference Guidelines
No ratings yet
School Based Press Conference Guidelines
13 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Mstat Note12 Parametric Inference FSP
No ratings yet
Mstat Note12 Parametric Inference FSP
45 pages
Wireless Livestock Feed Monitoring and Management System Using Arduino and IOT
No ratings yet
Wireless Livestock Feed Monitoring and Management System Using Arduino and IOT
7 pages
Notes
No ratings yet
Notes
10 pages
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
No ratings yet
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
86 pages
Maximum Likelihood Estimation (MLE)
No ratings yet
Maximum Likelihood Estimation (MLE)
4 pages
Metrics WT 2023-24 Unit10 Ml+Discrete Choice
No ratings yet
Metrics WT 2023-24 Unit10 Ml+Discrete Choice
36 pages
Classical Statistics & MLE Guide
No ratings yet
Classical Statistics & MLE Guide
8 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Econometrics 2018 Final Solutions
No ratings yet
Econometrics 2018 Final Solutions
5 pages
Scenario 11
No ratings yet
Scenario 11
2 pages
Cox - Statistics Paper
No ratings yet
Cox - Statistics Paper
42 pages
MLE Stuff
No ratings yet
MLE Stuff
8 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Stacey Evans Letter To Cobb 11.17.22
No ratings yet
Stacey Evans Letter To Cobb 11.17.22
14 pages
Solution 14
No ratings yet
Solution 14
5 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
MLE and Confidence Intervals in Statistics
No ratings yet
MLE and Confidence Intervals in Statistics
8 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
On The Sidewalk Bleeding Essay
100% (2)
On The Sidewalk Bleeding Essay
8 pages
Advanced Econometrics: GMM & MLE
No ratings yet
Advanced Econometrics: GMM & MLE
15 pages
Suggested Solutions: Problem Set 3 Econ 210: April 27, 2015
No ratings yet
Suggested Solutions: Problem Set 3 Econ 210: April 27, 2015
11 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
14 pages
Profile Likelihood Explained
No ratings yet
Profile Likelihood Explained
21 pages
ARMA Model Analysis Guide
No ratings yet
ARMA Model Analysis Guide
31 pages
Econometrics: CLM & OLS Basics
No ratings yet
Econometrics: CLM & OLS Basics
11 pages
Risk Fisher
No ratings yet
Risk Fisher
39 pages
3ms Third Test
No ratings yet
3ms Third Test
4 pages
Zero Leakage Performance Robust Design Trouble-Free Operation Ergonomically Designed
No ratings yet
Zero Leakage Performance Robust Design Trouble-Free Operation Ergonomically Designed
9 pages
Cisco® Catalyst® 9400 Series
No ratings yet
Cisco® Catalyst® 9400 Series
25 pages
Lecture05 IntervalTree
No ratings yet
Lecture05 IntervalTree
4 pages
14.384 Time Series Analysis: Mit Opencourseware
No ratings yet
14.384 Time Series Analysis: Mit Opencourseware
6 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Consent Form Version 6
No ratings yet
Consent Form Version 6
2 pages
CJ MCQ
No ratings yet
CJ MCQ
98 pages
Prime Coat
No ratings yet
Prime Coat
1 page
Assignment 42
No ratings yet
Assignment 42
5 pages
Cisco EtherChannel Guide
No ratings yet
Cisco EtherChannel Guide
3 pages
SH - Fall of Troy Semi Fiction PDF
No ratings yet
SH - Fall of Troy Semi Fiction PDF
11 pages
Masbate City School Canvass Summary
No ratings yet
Masbate City School Canvass Summary
2 pages
Akvárium Klub Ticket Guidelines
No ratings yet
Akvárium Klub Ticket Guidelines
1 page
30 Days of Photoshop Schedule
No ratings yet
30 Days of Photoshop Schedule
9 pages
Asymptotic Theory for OLS
No ratings yet
Asymptotic Theory for OLS
15 pages
Brady Ferrule Printer
No ratings yet
Brady Ferrule Printer
5 pages
An Efficient Index For Contact Tracing Query in A Large Spatio - Temporal DB
No ratings yet
An Efficient Index For Contact Tracing Query in A Large Spatio - Temporal DB
22 pages
Importance of Analytical Sandbox
No ratings yet
Importance of Analytical Sandbox
30 pages
Comparing Functions Answered
No ratings yet
Comparing Functions Answered
14 pages
Week 7-8 Inventory Planning
No ratings yet
Week 7-8 Inventory Planning
68 pages
CSE 101 - Introduction To Computers I: Topic
No ratings yet
CSE 101 - Introduction To Computers I: Topic
39 pages

Lecture 15

Uploaded by

Lecture 15

Uploaded by

Lecture Notes 15

We will begin with the question of consistency.

2 Consistency of the MLE

2.1 MLE as Empirical Risk Minimization

2.2 Conditions for consistency

Condition 1: Identifiability: A basic requirement for constructing any consistent estimator

inf KL(p(x; θ), p(x; θ))

if |θe − θ| ≥ . We will show that for the MLE θ, b ≤ η, as

since θb is the MLE.

3 Inconsistency of the MLE

4 MLE under misspecification

How do we interpret this statement? As before we can re-write it in terms of KL divergences

5 Limiting Distribution of the MLE

and so we know by the CLT:

which is the gradient of the log-likelihood, and the Fisher Information,

2. p(x; θ) is a smooth (thrice differentiable) function of θ,

3. We can interchange differentiation with respect to θ and integration over X. This

4. The parameter θ is identifiable.

where θe is some point in between θb and θ. This in turn gives us that,

The score function:

and the Fisher information,

8 Influence Functions and Regular Asymptotically Lin-

9 Asymptotic Relative Efficiency

Example 3 Let X1 , . . . , Xn ∼ Poisson(λ). The mle of λ is X. Let

The delta method gives

Since the mle is efficient, we know that, in general, ARE(Wn , mle) ≤ 1.

where g 0 is the gradient of g evaluated at θ.

You might also like

if |θe − θ| ≥ . We will show that for the MLE θ, b ≤ η, as