Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views10 pages

Lecture 15

This document discusses asymptotic theory for the Maximum Likelihood Estimator (MLE), focusing on consistency and asymptotic distribution. It outlines necessary conditions for MLE consistency, such as identifiability and uniform law of large numbers, and addresses potential failures of consistency under model misspecification. Additionally, it explores the limiting distribution of the MLE and provides sufficient regularity conditions for its asymptotic behavior.

Uploaded by

yifanwa2cc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Lecture 15

This document discusses asymptotic theory for the Maximum Likelihood Estimator (MLE), focusing on consistency and asymptotic distribution. It outlines necessary conditions for MLE consistency, such as identifiability and uniform law of large numbers, and addresses potential failures of consistency under model misspecification. Additionally, it explores the limiting distribution of the MLE and provides sufficient regularity conditions for its asymptotic behavior.

Uploaded by

yifanwa2cc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture Notes 15

36-705

1 Asymptotic theory
This lecture and the next will focus on asymptotic theory for the MLE. We suppose that we
obtain a sample X1 , . . . , Xn ∼ p(X; θ) and are interested in estimating θ. We are interested
in two questions:

p
1. Consistency: Does the MLE converge in probability to θ, i.e. does θbMLE → θ? This
is analogous to the LLN.

2. Asymptotic distribution: What can we say about the distribution of n(θbMLE −θ)?
This is analogous to the CLT.

We will begin with the question of consistency.

2 Consistency of the MLE


The main take-home from this section is that under somewhat mild conditions the MLE
is a consistent estimator. We will try to develop the necessary conditions and build some
intuition about the MLE and about what consistency entails.

2.1 MLE as Empirical Risk Minimization

We have discussed previously the idea of empirical risk minimization, where we construct an
estimator by minimizing an empirical estimate of the risk. We looked at the particular case
of classification with the 0/1 loss. The MLE can be viewed as a special case of ERM with a
different loss function.
Suppose we define the loss function:
n
b θ) = 1 p(Xi ; θ)
X
Rn (θ, log .
n i=1 p(Xi ; θ)
b

Observe that minimizing this loss function is identical to maximizing the likelihood. Notice
that we introduced an extra p(Xi ; θ) term but this does not affect anything. Of course, if

1
this is the empirical risk it is natural to wonder what the associated population risk is. This
is
Z !
p(X; θ) p(x; θ)
R(θ,b θ) = Eθ log = p(x; θ) log dx
p(X; θ)
b p(x; θ)
b

which is the Kullback-Leibler divergence, i.e. the population risk is the KL divergence
KL(p(x; θ)kp(x; θ)).
b Notice that, the empirical risk is a sum of i.i.d terms so by the LLN we
have that for any fixed θe
p
e θ) →
Rn (θ, R(θ,
e θ).

To analyze empirical risk minimization we needed a uniform LLN and we will need exactly
this to show consistency. An important property of the KL divergence is that it is zero iff
p(X; θ) = p(X; θ)
b almost everywhere (i.e. they are equal except on sets of measure 0). The
main thing to remember is the connection between MLE and KL divergence.

2.2 Conditions for consistency

Condition 1: Identifiability: A basic requirement for constructing any consistent estimator


is that the model be identifiable, i.e. if θ1 6= θ2 then it must be the case that p(x; θ1 ) 6=
p(x; θ2 ).
We will in general require something slightly stronger than this:
Condition 2: Strong identifiability: We assume that for every  > 0

inf KL(p(x; θ), p(x; θ))


e > 0.
θ:|
e θ−θ|≥
e

This condition is essentially the same as Condition 1, except that it does not allow the
difference between the two distributions to be vanishingly small. The two conditions are
equivalent if θ is restricted to lie in a compact set.
Condition 3: Uniform LLN: Assume that,
p
sup |Rn (θ,
e θ) − R(θ,
e θ)| → 0.
θe

This condition is a uniform LLN. As we have seen before it holds for instance if the
Rademacher complexity of the class of functions of the form: fθe(X) = log p(X; θ)/p(X;
e θ) is
not too large.

Theorem 1 Suppose that Conditions 2 and 3 above hold, then the MLE is consistent.

2
Proof: Fix an  > 0. Using the strong identifiability condition we see that for every  > 0,
we have that there is an η > 0 such that,
e ≥ η,
KL(p(x; θ), p(x; θ))

if |θe − θ| ≥ . We will show that for the MLE θ, b ≤ η, as


b we have that KL(p(x; θ)kp(x; θ))
p
n → ∞ in probability. This in turn implies that |θb − θ| ≤  which implies that θb → θ. If
b ≤ η, as n → ∞. Notice that,
remains to show that KL(p(x; θ), p(x; θ))
(i) p
KL(p(X; θ)kp(X; θ))
b = R(θ, b θ) − Rn (θ,
b θ) = R(θ, b θ) ≤ R(θ,
b θ) + Rn (θ, b θ) − Rn (θ,
b θ) → 0,
where the final convergence simply uses Condition 3. The inequality (i) follows since,
n
b θ) = 1 p(Xi ; θ)
X
Rn (θ, log ≤ 0,
n i=1 p(Xi ; θ)
b

since θb is the MLE.

3 Inconsistency of the MLE


The MLE can fail to be consistent. When the model is not identifiable it is clear that we
cannot have consistent estimators. The other possible failure is the failure of the uniform
law. This typically happens when the parameter space is too large. Here is a simple example:
Example: Suppose that we measure some outcome (say their blood sugar) for n individuals
using a machine. We do it twice for every individual so that we can assess the variability of
the machine, i.e. suppose we observe:
Y11 , Y12 ∼ N (µ1 , σ 2 )
..
.
Yn1 , Yn2 ∼ N (µn , σ 2 ),
and want to estimate σ 2 . Even though we only want to estimate σ 2 the model has a growing
number of parameters µ1 , . . . , µn , σ 2 and the MLE for σ 2 will depend on estimating µi .
Formally, we can see that the MLE for the means is:
Yi1 + Yi2
µ
bi = .
2
The log-likelihood for σ 2 can be written as:
n
1 X
LL(σ 2 , µ) = −n log σ 2 − 2 2

(Y i1 − µ i ) + (Yi2 − µ i ) ,
2σ 2 i=1

3
which is maximized when we take:
n n
2 1 X 2 2 1 X
(Yi1 − Yi2 )2 .

σ
b = (Yi1 − µ
bi ) + (Yi2 − µ
bi ) =
2n i=1 4n i=1
Notice that,
σ2
σ2] =
E[b ,
2
so by the LLN the MLE is inconsistent. One could easily fix this in this problem (by
multiplying the MLE by 2) but more generally this could be tricky. We note that in this
type of problem where the number of parameters is not fixed (and grows with the sample
size) it is not even clear how to define convergence of the log-likelihood since its limit changes
with the sample size.

4 MLE under misspecification


In statistical modeling we do not typically believe the model is correct, i.e. that the samples
were in fact generated by some distribution in our model. Rather, we think of the model
as a useful idealization or a simplification. In this (more realistic) case, one might wonder
what the MLE converges to, or if it converges at all?
Suppose X1 , . . . , Xn ∼ q, and we estimate θbMLE , then what can we say about our estimate?
To answer this, we can follow a similar argument to what we did in the beginning of the
lecture and observe that at the population-level (i.e. with infinite samples) the MLE is:
θbMLE = arg max Eq log p(X; θ)
θ∈Θ

How do we interpret this statement? As before we can re-write it in terms of KL divergences


and see that:
KL(qkpθbMLE ) ≤ KL(qkpθ ) for all θ ∈ Θ.
So that at the population-level we can conclude that the MLE is estimating the KL projection
of the data-generating distribution on our model, i.e. when q does not belong to our model
the MLE is essentially estimating the KL projection of q onto our model. One can also impose
similar conditions to what we had in the last section (uniform law + strong identifiability)
to complete the consistency argument under model misspecification.

5 Limiting Distribution of the MLE


Now we will address the question of what is the asymptotic distribution of the MLE. This is
analogous to the CLT which gave the asymptotic distribution of averages. In some cases, we

4
can do this directly. For instance, if X1 , . . . , Xn ∼ Ber(p) then the MLE is just the average:
n
1X
pb = Xi ,
n i=1

and so we know by the CLT:


√ pb − p d
np → N (0, 1),
p(1 − p)
which tells us the asymptotic distribution of the MLE.
More generally, however the MLE need not be a simple average of i.i.d. terms, but the main
take-away is that asymptotically it often behaves like one.
Recall that the score function is
n
X
s(θ) = ∇ log(p(Xi ; θ)),
i=1

which is the gradient of the log-likelihood, and the Fisher Information,

I(θ) = E[s(θ)s(θ)T ].

We showed that s(θ) has mean 0, so I(θ) = Var(s(θ)). The Fisher information is alternatively
the expected Hessian of the log-likelihood:
" n #
X
In (θ) = E ∇2 log p(Xi ; θ) .
i=1

It is worth remembering that the score is data-dependent, while the Fisher Information is
not (it is an expectation over the data so does not depend on the values of X1 , . . . , Xn ).
Let θb denote the MLE. Our goal to show that (under enough regularity conditions),
√ d
n(θb − θ) → N (0, [I1 (θ)]−1 ).

6 Counterexample
The usual counterexample to the above convergence in distribution is the MLE for the
uniform distribution. For the uniform distribution most regularity conditions fail. Formally,
we observe X1 , . . . , Xn ∼ U [0, θ] and want to estimate θ. The log-likelihood:
 
1
`(θ) = log n I(θ ≥ max Xi ) .
θ i

5
The MLE is θb = maxni=1 Xi . Observe that the log-likelihood is not differentiable at the MLE,
so the Fisher information is not defined at the MLE.
Another thing that we used frequently in defining the equivalent forms of the Fisher infor-
mation was to exchange derivatives (with respect to θ) and integrals (with respect to X).
This in general does not work if the domain of integration depends on the parameter with
respect to which we are taking the derivative. For the uniform distribution the domain of
density depends on the parameter.
On the other hand, things are usually nice for exponential families. They will automatically
satisfy all the regularity conditions (provided it is identifiable, i.e. say full-rank and minimal)
and the MLE is extremely well-behaved in such models.
Returning to the uniform case, we can directly analyze the distribution of the MLE. In a
previous lecture we showed that
d
n(θb − θ) → −Exp(1/θ)
√ d
(we did this when θ = 1 but you can work out the general case). It follows that n(θb− θ) →
δ0 , where δ0 is a point mass at 0 and it does not have a Gaussian limit.

7 MLE asymptotics
We will only attempt a heuristic calculation here. If you are curious to see a rigorous proof
with minimal regularity assumptions you should look at Van der Vaart’s book on Asymptotic
Statistics. Here is a list of some sufficient regularity conditions:

1. The dimension of the parameter space does not change with n, i.e. θ ∈ Rd and d is
fixed. We have seen that if d grows the MLE need not even be consistent.

2. p(x; θ) is a smooth (thrice differentiable) function of θ,

3. We can interchange differentiation with respect to θ and integration over X. This


in turn requires that the range of X does not depend on θ, and some integrability
conditions on p(x; θ).

4. The parameter θ is identifiable.

5. If the parameter space is restricted, i.e. θ ∈ Θ for some set Θ then θ is in the interior
of the set Θ (i.e. cannot be on its boundary).

We will focus on the case when the parameter is one-dimensional, although everything carries
over almost exactly in the general (fixed) d case.

6
Theorem 2 Under the regularity conditions above,
√ d
n(θb − θ) → N (0, 1/I(θ)).

We note that under the conditions of the theorem one can verify that the MLE is consistent,
p
i.e. that θb → θ. The basic idea is to verify that under the differentiability assumptions on
the density, we can effectively treat the parameter space as compact, then derive a uniform
law of large numbers, and then apply the proof from the previous lecture notes. This is
a complicated technical proof but you can look it up by searching for Wald’s proof of the
consistency of the MLE.
The proof will use all the facts about scores and the Fisher information that we derived
earlier.
p
Proof: To begin with let us note the following fact: if θb → θ, then
p
Eθ [−∇2θ log p(X; θ)]
b → Eθ [−∇2θ log p(X; θ)] = I(θ).

Since θb maximizes the log-likelihood we know that the derivative of the log-likelihood at θb
must be 0, i.e.

`0 (θ)
b = 0.

Formally you need to know that θb is not on the boundary of the parameter space. To prove
p
this you will need to use the fact that θ is not on the boundary and that θb → θ.
By a Taylor expansion of the derivative of the log-likelihood we obtain that,

0 = `0 (θ)
b = `0 (θ) + (θb − θ)`00 (θ),
e

where θe is some point in between θb and θ. This in turn gives us that,

`0 (θ)
(θb − θ) = ,
−`00 (θ)
e

so that,
`0 (θ)
√ √
n
n(θb − θ) = .
`00 (θ)
− n
e

We will look at the numerator and denominator separately. The denominator is:
n
`00 (θ)
e 1X p p
− = −∇2θ log p(Xi ; θ)
e → Eθ [−∇2θ log p(X; θ)]
e → Eθ [−∇2θ log p(X; θ)] = I(θ)
n n i=1

7
p
where the last step uses the fact that θe → θ.
The numerator is just the score function, i.e.
n n
1 1 X √ 1X
√ `0 (θ) = √ ∇θ log p(Xi ; θ) = n × [∇θ log p(Xi ; θ) − E[∇θ log p(X; θ)]]
n n i=1 n i=1
d d
→ N (0, Var(∇θ log p(X; θ))) → N (0, I(θ)),
where we used the facts that the score
√ has mean 0, that the variance of the score is the Fisher
information and that by the CLT n times an average of i.i.d. terms minus its expectation
converges in distribution to a normal.
Putting the pieces together via Slutsky’s theorem we obtain that,
√ d 1 d
n(θb − θ) → N (0, I(θ)) → N (0, 1/I(θ)),
I(θ)
which is what we wanted to prove.
Example: Suppose that X1 , . . . , Xn ∼ Exp(θ), then the log-likelihood,
n
X
`(θ) = n log θ − θ Xi .
i=1

The score function:


n
n X
s(θ) = − Xi ,
θ i=1

and the Fisher information,


n
I(θ) = .
θ2
1
The MLE is θb = X
. So we can use the above result to conclude that,
θ2
 
d
θ − θ → N 0,
b .
n

8 Influence Functions and Regular Asymptotically Lin-


ear Estimators
We could have followed a similar proof as above to conclude that the MLE can be written
as:
n
1 X ∇θ log p(Xi ; θ)
θb = θ + + Remainder,
n i=1 I(θ)

8
where the remainder is small (roughly proportional to the previous term multiplied by [I(θ)−
e
I(θ)] → 0). The term,
∇θ log p(x; θ)
ψ(x) = ,
I(θ)
is called the influence function.
Thinking of a complex predictor like a deep neural network, one can try to obtain some
information about the predictor by trying to compute the influence function of training
images on the final predictor. A paper that did this (and quite a bit more) won ICML’s best
paper a few years ago.
Returning to the expression:
n
1X
θb ≈ θ + ψ(Xi ).
n i=1

Estimators that satisfy this type of expansion are called asymptotically linear estimators
(many non-MLE estimators also satisfy expansions of this form). There is a classical result
due to Le Cam that any sufficiently well-behaved (regular) estimator is asymptotically linear.
It is not easy to prove (see Van Der Vaart’s book). This together with the Cramér-Rao lower
bound implies that the MLE is the “best regular asymptotically linear estimator”.

9 Asymptotic Relative Efficiency


Once you restrict attention to asymptotically Normal estimators, comparing estimators in
terms of their MSE boils down to comparing their variances. Specifically, if
√ 2
n(Wn − τ (θ)) N (0, σW )

n(Vn − τ (θ)) N (0, σV2 )
then the asymptotic relative efficiency (ARE) is
2
σW
ARE(Vn , Wn ) = .
σV2

Example 3 Let X1 , . . . , Xn ∼ Poisson(λ). The mle of λ is X. Let


τ = P(Xi = 0).
So τ = e−λ . Define Yi = I(Xi = 0). This suggests the estimator
n
1X
Wn = Yi .
n i=1

9
Another estimator is the mle
Vn = e−λ .
b

The delta method gives


λe−2λ
Var(Vn ) ≈ .
n
We have

n(Wn − τ ) N (0, e−λ (1 − e−λ ))

n(Vn − τ ) N (0, λe−2λ ).

So
λ
ARE(Wn , Vn ) = ≤ 1. 
eλ −1

Since the mle is efficient, we know that, in general, ARE(Wn , mle) ≤ 1.

10 Multivariate Case
Now let θ = (θ1 , . . . , θk ). In this case we have

n(θb − θ) N (0, I −1 (θ))

where I −1 (θ)q is the inverse of the Fisher information matrix. The approximate standard
−1
error of θbj is Ijj /n. If τ = g(θ) with g : Rk → R then by the delta method,

τ − τ)
n(b N (0, (g 0 )T I −1 g 0 )

where g 0 is the gradient of g evaluated at θ.

10

You might also like