STA732
Statistical Inference
Lecture 06: Information Inequality
Yuansi Chen
Spring 2023
Duke University
https://www2.stat.duke.edu/courses/Spring23/sta732.01/
1
Recap from Lecture 05
• Convex loss and Jensen’s inequality
• Rao-Blackwell Theorem allows us to improve an estimator
using sufficient statistics
• UMVU exists and is unique when the estimand is U-estimable
and complete sufficient statistics exist
2
Goal of Lecture 06
1. Second thoughts about bias
2. Log-likelihood, score and Fisher information
3. Cramér-Rao lower bound
4. Hammersley-Chapman-Robbins ineq
Chap. 4.2, 4.5-4.6 in Keener or Chap. 2.5 in Lehmann and Casella
3
Second thoughts about bias
Admissibility
Def. Admissible
An estimator 𝛿 is called inadmissible if there exists 𝛿 ∗ which has a
better risk:
𝑅(𝜃, 𝛿 ∗ ) ≤ 𝑅(𝜃, 𝛿) for all 𝜃 ∈ Ω, with 𝑅(𝜃1 , 𝛿 ∗ ) < 𝑅(𝜃1 , 𝛿) for some
𝜃1 ∈ Ω.
We also say that 𝛿 ∗ dominates 𝛿
4
Uniform distribution example from last lecture
𝑋1 , … , 𝑋𝑛 are i.i.d. from the uniform distribution on (0, 𝜃).
𝑇 = max {𝑋1 , … , 𝑋𝑛 } is complete sufficient.
𝑛+1
• We have derived that 𝑛 𝑇 is UMVU for estimating 𝜃.
• Among estimators in the form of mutiple of 𝑇 , is the UMVU
estimator admissible?
5
Gaussian sequence model example
2
𝑋𝑖 ∼ 𝒩(𝜇𝑖 , 1), 𝑖 = 1, … , 𝑛, independent. Want to estimate ‖𝜇‖2 ,
𝜇1
⎛ ⎞
where 𝜇 = ⎜ ⋮ ⎟
⎜ ⎟
⎝𝜇𝑛 ⎠
• Find a UMVU estimator ‖𝑋‖22 − 𝑛
• Can we find a better estimator (if 𝜇 = 0)?
6
Thoughts about unbiased estimators
• A UMVU estimator is not necessarily admissible!
• It might even be absurd (Ex 4.7 in Keener)
• It is a good estimator to start with, but in general we shall not
insist on UMVU
7
Log-likelihood, score and Fisher
information
Log-likelihood
Suppose 𝑋 has distribution from a family P = {𝑃𝜃 , 𝜃 ∈ Ω}.
Assume each distribution has density 𝑝𝜃 and shares the common
support {𝑥 ∣ 𝑝𝜃 (𝑥) > 0}. The log-likelihood is
ℓ(𝜃; 𝑋) = log 𝑝𝜃 (𝑋)
8
Score
Def. Score
The score is defined as the gradient of the log-likelihood with
respect to the parameter vector
∇ℓ(𝜃; 𝑋)
Remark
• can treat it as “local sufficient statistics”, for 𝜉 ≈ 0
𝑝𝜃0 +𝜉 = exp ℓ(𝜃0 + 𝜉; 𝑥)
≈ exp [𝜉 ⊤ ∇ℓ(𝜃0 ; 𝑥)] ⋅ 𝑝𝜃0 (𝑥)
• indicates the sensitivity to infinitesimal changes to 𝜃.
9
Expected value of score is zero
Under enough regularity conditions, we have
𝔼𝜃 [∇ℓ(𝜃; 𝑋)] = 0
Proof:
1 = ∫ exp ℓ(𝜃; 𝑥)𝑑𝜇(𝑥)
Taking derivative (under regularity conditions) implies
𝜕
0=∫ ℓ(𝜃; 𝑥) ⋅ exp ℓ(𝜃; 𝑥)𝑑𝜇(𝑥)
𝜕𝜃𝑗
10
Fisher information
Def. Fisher information
For 𝜃 taking values in ℝ𝑠 , the Fisher information is a 𝑠 × 𝑠 matrix
𝐼(𝜃) = Cov𝜃 (∇ℓ(𝜃; 𝑋))
= 𝔼𝜃 [−∇2 ℓ(𝜃; 𝑋)]
why are the two definitions equivalent?
11
Cramér-Rao lower bound
Cramér-Rao lower bound in 1-dimension case
Consider an estimator 𝛿(𝑋) which is unbiased for 𝑔(𝜃). Then
𝑔(𝜃) = 𝔼𝜃 𝛿
Under enough regularity
𝑔′ (𝜃) = ∫ 𝛿(𝑥)ℓ′ (𝜃; 𝑥)𝑒ℓ(𝜃;𝑥) 𝑑𝜇(𝑥) = 𝔼𝜃 𝛿ℓ′
Thm 4.9 in Keener
Let P = {𝑃𝜃 ∶ 𝜃 ∈ Ω} be a dominated family with densities 𝑝𝜃
differentiable. Under enough regularity conditions (𝔼𝜃 𝑙′ = 0,
𝔼𝜃 𝛿 2 < ∞, 𝑔′ well defined), we have
[𝑔′ (𝜃)]2
Var𝜃 (𝛿) ≥ ,𝜃 ∈ Ω
𝐼(𝜃)
called Cramér-Rao lower bound or information lower bound 12
proof idea: Cauchy Schwarz inequality
13
Cramér-Rao lower bound in high dimension
For 𝜃 ∈ ℝ𝑠 , we have
Var𝜃 (𝛿) ≥ ∇𝑔(𝜃)⊤ 𝐼(𝜃)−1 ∇𝑔(𝜃)
14
Interpretation of the Cramér-Rao lower bound
• To estimate 𝑔(𝜃), no unbiased estimator can have smaller
variance than ∇𝑔(𝜃)⊤ 𝐼(𝜃)−1 ∇𝑔(𝜃)
• For a unbiased estimator 𝛿, we always have the lower bound of
the form for any random variable 𝜓
2
Cov𝜃 (𝛿, 𝜓)
Var𝜃 (𝛿) ≥
Var𝜃 (𝜓)
What is a good 𝜓?
15
Example: Cramér-Rao lower bound for i.i.d. samples
i.i.d. (1)
Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜃 , 𝜃 ∈ Ω. The joint density is
𝑛
(1)
𝑝𝜃 (𝑥) = ∏ 𝑝𝜃 (𝑥𝑖 )
𝑖=1
What is the relationship between Fisher information for 𝑛 i.i.d.
observations and that for a single observation?
16
Efficiency
CRLB is not always attainable
Def. efficiency
The efficiency of an unbiased estimator 𝛿 is
CRLB
eff𝜃 (𝛿) =
Var𝜃 (𝛿)
Remark
• According to the definition and the Cramér-Rao lower bound,
for “regular” unbiased estimators, eff𝜃 (𝛿) ≤ 1
• Efficiency 1 is rarely achieved in finite samples, but usually we
can approach it asymptotically as 𝑛 → ∞
17
Hammersley-Chapman-Robbins
Inequality
Motivation behind Hammersley-Chapman-Robbins Inequality
The Cramér-Rao lower bound requires the differentiation under
integral, thus requires regularity conditions so that the
differentiation is well-defined.
We can get a more general statement if we replace ∇ℓ(𝜃; 𝑋) with
the corresponding finite difference.
18
Hammersley-Chapman-Robbins Inequality (1)
Recall that by Cauchy-Schwarz, for a unbiased estimator 𝛿, we
always have the lower bound of the form for any random variable 𝜓
2
Cov𝜃 (𝛿, 𝜓)
Var𝜃 (𝛿) ≥
Var𝜃 (𝜓)
• In CRLB, we took 𝜓 = ∇ℓ(𝜃; 𝑋)
• Here we take
𝑝𝜃+𝜖 (𝑋)
− 1 = exp (ℓ(𝜃 + 𝜖; 𝑋) − ℓ(𝜃; 𝑋)) − 1
𝑝𝜃 (𝑋)
≈ 𝜖⊤ ∇ℓ(𝜃; 𝑋) for small 𝜖
19
Hammersley-Chapman-Robbins Inequality (2)
We verify that
𝑝𝜃+𝜖 (𝑋)
• 𝔼[ 𝑝𝜃 (𝑋) − 1] = 0
•
𝑝𝜃+𝜖 (𝑋) 𝑝 (𝑥)
Cov𝜃 (𝛿(𝑋), − 1) = ∫ 𝛿(𝑥) ( 𝜃+𝜖 − 1) 𝑝𝜃 (𝑥)𝑑𝜇(𝑥)
𝑝𝜃 (𝑋) 𝑝𝜃 (𝑥)
= 𝔼𝜃+𝜖 [𝛿] − 𝔼𝜃 [𝛿]
= 𝑔(𝜃 + 𝜖) − 𝑔(𝜃)
Hence HCRI:
2
(𝑔(𝜃 + 𝜖) − 𝑔(𝜃))
Var𝜃 (𝛿) ≥ 2
𝑝𝜃+𝜖 (𝑥)
𝔼 [( 𝑝𝜃 (𝑥) − 1) ]
CRLB follows from taking 𝜖 → 0, but taking sup over 𝜖 can give better bounds
20
Example 1: exponential family
What is the Cramér-Rao lower bound for the exponential family?
21
Example 2: curved exponential family
What is the Cramér-Rao lower bound for the curved exponential
family?
𝑝𝜃 (𝑥) = exp(𝜂(𝜃)⊤ 𝑇 (𝑥) − 𝐵(𝜃))ℎ(𝑥), 𝜃 ∈ ℝ, 𝑇 (𝑥) ∈ ℝ𝑠
22
Summary
• Restricting to unbiased estimators have nice theory: UMVU
theory. But it is not always admissible in terms of total risk
• Score and Fisher information
• Cramér-Rao lower bound and its variant
23
What is next?
• Equivariance
24
Thank you
25
26