5.
Maximum Likelihood Estimation
Ismaı̈la Ba
[email protected]
STAT 3100 - Winter 2024
1 / 32
Preambule
Contents
1 Preambule
2 The Likelihood Function
3 Maximum Likelihood Estimation
2 / 32
Preambule
MLE : intuition
Example 1
Suppose we toss an unfair coin. Let p be the probability of getting heads,
that is P(heads) = p and consider the observed sample x = (0, 1, 1, 1)
(0=Tails,1=Heads). For what value of p is the observed sample most likely
to have occurred ?
1 Let X1 , X2 , X3 , X4 be iid Bernoulli(p) (probability distribution for a
coin tossing) random variables.
2 The probability of the event {X1 = x1 , X2 = x2 , X3 = x3 , X4 = x4 } is
f (x1 , x2 , x3 , x4 ; p) = (1 − p)p 3
3 Natural idea of the maximum likelihood : find the value of p which
maximizes the probability of observing the sample x = (0, 1, 1, 1).
3 / 32
Preambule
What do you think ?
The function. . . and its derivative with respect to p
0.10
0.2
0.0
0.08
−1.0 −0.8 −0.6 −0.4 −0.2
0.06
0.04
0.02
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
⇒ maximum likelihood estimate : p̂(x) = 3/4 !
4 / 32
The Likelihood Function
Contents
1 Preambule
2 The Likelihood Function
3 Maximum Likelihood Estimation
5 / 32
The Likelihood Function
Definition 1
Let X1 , . . . , Xn have joint pmf or pdf fX1 ,...,Xn (·; θ) where the parameters
θ = (θ1 , . . . , θm ) have unknown values. Given
X1 = x1 , X2 = x2 , . . . , Xn = xn is observed, the function of θ defined by
L(θ) = L(θ; x) = f (x1 , x2 , . . . , xn ; θ)
is the likelihood function. If X = (X1 , . . . , Xn ) is a random sample from a
distribution with density f (x; θ), the likelihood function is
n
Y
L(θ; x) = f (xi ; θ)
i=1
where f (xi ; θ) is the common density of the Xi .
The likelihood function is not a probability density function.
6 / 32
Maximum Likelihood Estimation
Contents
1 Preambule
2 The Likelihood Function
3 Maximum Likelihood Estimation
7 / 32
Maximum Likelihood Estimation
Methodology
The maximum likelihood principle for estimation is to choose the value of θ
that maximizes the likelihood function L(θ; x) for the observed x.
Definition : maximum likelihood
Let x = (x1 , . . . , xn ) be a realization of X = (X1 , . . . , Xn ) with likelihood
function L(θ; x), where θ ∈ Θ and Θ is the parameter space of θ. Then a
maximum likelihood estimator (MLE) of θ is any θ̂ that satisfies
L(θ̂; x) = max f (x1 , . . . , xn ; θ).
θ∈Θ
We could also write
θ̂(X) = argmaxθ∈Θ L(θ; X) = argmaxθ∈Θ f (X1 , . . . , Xn ; θ).
8 / 32
Maximum Likelihood Estimation
Log-likelihood
It is more convenient to work with the log-likelihood function : that
is
n
X
ln L(θ; x) = ln f (xi ; θ)
i=1
and since x 7→ ln x is increasing
θ̂(X) = argmaxθ∈Θ ln L(θ; X).
General methodology : For X ∼ f (·; θ) with θ ∈ Θ ⊂ R
1 Compute the log-likelihood function ;
2 If this is differentiable with respect to θ, compute (ln L(θ; x))′ : let
θ̂(x) be the value of θ where the derivative vanishes ; (otherwise find
another argument) ;
3 Verify that it is indeed a maximum by checking that
(ln L(θ; x))′′ < 0.
|θ=θ̂(x)
9 / 32
Maximum Likelihood Estimation
Examples - Exponential distribution
Example 2
Let X1 , . . . , Xn be a random sample from an Exp(β) distribution so that
n
1 1X
L(β; x) = f (x1 , . . . , xn ; β) = n exp 1{x > 0, ∀i = 1, . . . , n}.
− xi
β β
i
i=1
Now,
n
1X
ln L(β; x) = −n ln β − xi .
β i=1
To maximize this with respect to β, we differentiate to obtain
n
n
d d 1 X n 1 X
ln L(β; x) = −n ln β − xi = − + 2 xi .
dβ dβ β i=1 β β i=1
10 / 32
Maximum Likelihood Estimation
Examples - Exponential distribution
Example 2 continued
Setting this equal to 0 and solving for β yields
n
1X
β= xi = x n .
n i=1
We should always check that this a maximum. The second derivative is
n
n
d2 d n 1 X n 2 X
ln L(β; x) = − + xi = − xi ,
dβ β β2 i=1 β2 β3 i=1
dβ2
which is negative when evaluated at β = x n . Thus, β̂ = X n is the
maximum likelihood estimator (MLE) of β.
Remark : This is the same as the method of moments estimator, which we
have seen is unbiased and consistent for β.
11 / 32
Maximum Likelihood Estimation
Examples - Poisson distribution
Example 3
Let X1 , . . . , Xn be a random sample from a Poisson distribution with
parameter λ unknown. The likelihood function is
n Pn
Y λxi e −λ λ( i=1 xi )
e −nλ
L(λ; x) = = Qn .
i=1
xi ! i=1 xi !
The log-likelihood function is
n n
X Y
ln L(λ; x) = xi ln λ − nλ − ln xi ! .
i=1 i=1
Differentiating with respect to λ yields
n
d 1 X
ln L(λ; x) = xi − n.
dλ λ i=1 12 / 32
Maximum Likelihood Estimation
Examples - Poisson distribution
Example 3 continued
Setting this equal to 0 and solving for λ yields λ = x n . The second
derivative is n
d2 1 X
ln L(λ; x) = − 2 xi
dλ2 λ i=1
which is negative when evaluated at λ = x n . The MLE of λ is hence
λ̂ = X n . Since X n is unbiased and V(X n ) = λ/n, we also have that λ̂ is
consistent for λ.
13 / 32
Maximum Likelihood Estimation
Examples - Gamma distribution
Example 4
Let X1 , . . . , Xn be a random sample from a Gamma(α, β) distribution with
parameters α and β unknown. The likelihood function writes
α−1
x α−1 e −xi /β
n !n Y
n n
Y
i 1 1X
L(α, β; x) = .
= α xi exp − xi
βα Γ(α) β Γ(α) β
i=1 i=1 i=1
The log-likelihood function becomes
n n
Y 1 X
ln L(α, β; x) = −nα ln(β) − n ln Γ(α) + (α − 1) ln xi − xi .
i=1
β i=1
14 / 32
Maximum Likelihood Estimation
Examples - Gamma distribution
Example 4 continued
The partial derivatives are
n
d Γ′ (α) Y
ln L(α, β; x) = − n ln(β) − n + ln xi ;
dα Γ(α) i=1
n
d nα 1 X
ln L(α, β; x) = − + 2 xi .
dα β β i=1
Now, define
n 1/n
d Γ′ (α) Y
ψ(α) := ln Γ(α) = and x̃n = xi
dα Γ(α) i=1
where ψ(·) is called the digamma function and x̃n is the geometric
mean of x1 , . . . , xn .
15 / 32
Maximum Likelihood Estimation
Examples - Gamma distribution
Example 4 continued
Setting the partial derivatives to 0, we obtain the maximum likelihood
equations
xn
β=
α
ln(α) − ψ(α) − ln(x n /x̃n ) = 0.
There is no closed form solution for these equations, but they can be
solved numerically.
16 / 32
Maximum Likelihood Estimation
Order Statistics
Definition : order statistics
The order statistics from a random sample X1 , . . . , Xn are the random
variables X1:n , X2:n , . . . , Xn:n given by
X1:n = the smallest among X1 , . . . , Xn
X2:n = the second smallest among X1 , . . . , Xn
..
.
Xn:n = the largest among X1 , . . . , Xn
so that with probability 1, −∞ < X1:n < X2:n < . . . < Xn:n < ∞.
Remark
X1:n = min{X1 , . . . , Xn }.
Xn:n = max{X1 , . . . , Xn }.
17 / 32
Maximum Likelihood Estimation
Order Statistics (2)
Proposition : joint density and marginal distribution
Let gX1:n ,...,Xn:n (x1:n , . . . , xn:n ; θ) denote the joint density of the order
statistics X1:n , . . . , Xn:n resulting from a random sample of Xi ’s from a
density fX (x). Then
n
Y
gX1:n ,...,Xn:n (x1:n , . . . , xn:n ; θ) = n! fX (xi:n )1{x1:n < x2:n < . . . < xn:n }.
i=1
The marginal distribution of the first r order statistics is given by
r
n! Y
gX1:n ,...,Xr :n (x1:n , . . . , xr :n ; θ) = fX (xi:n ) × [1 − FX (xr :n )]n−r
(n − r )! i=1
when x1:n < x2:n < . . . < xr :n and 0 otherwise.
18 / 32
Maximum Likelihood Estimation
Order Statistics (3)
More generally, consider the joint (marginal) distribution of x3:10 and
x7:10 .
( ) Draw a picture to see what is going on.
The (marginal) joint density of (x3:10 , x7:10 ) writes
n!
g (x3:10 , x7:10 ) = [FX (x3:10 )]2 fX (x3:10 )[FX (x7:10 ) − FX (x3:10 )]3
2!1!3!1!3!
fX (x7:10 )[1 − FX (x7:10 )]3
when x3:10 < x7:10 and 0 otherwise.
The multinomial coefficient represents the number of ways to arrange
10 observations into groups of sizes 2, 1, 3, 1, 3.
Not a formal proof but it works !
19 / 32
Maximum Likelihood Estimation
Order Statistics (4)
Suppose we have a random sample X1 , . . . , Xn from a continuous
distribution with CDF F and pdf f , with n ≥ 3 and, for some fixed i, j, k
with i < j < k, we want the joint likelihood (or density) of X1:n , Xj:n and
Xk:n . Using the above trick, we can easily write this down :
n!
g (xi:n , xj:n , xk:n ) =
(i − 1)!(j − i − 1)!(k − j − 1)!(n − k)!
× [F (xi:n )]i−1 f (xi:n )[F (xj:n ) − F (xi:n )]j−i−1 f (xj:n )
× [F (xk:n ) − F (xj:n )]k−j−1 f (xk:n )[1 − F (xk:n )]n−k
when xi:n < xj:n < xk:n and 0 otherwise.
Remark : Adding up the exponents should yield n − ν where ν is the
number of arguments in the joint density g . Here, they add up to n − 3.
20 / 32
Maximum Likelihood Estimation
Exercise on joint density for order statistics
Exercise 1
Consider a random sample X1 , . . . , X20 from a continuous distribution with
CDF F and density f . What is the joint density of X2:20 , X5:20 , and X13:20 ?
21 / 32
Maximum Likelihood Estimation
Example 5
Suppose that the lifetime of a particular component has an Exp(β)
distribution and that n of these are randomly chosen (independently) and
placed into service. We observe the times of the first r failures (i.e. we
observe x1:n , . . . , xr :n ). From the above slides on order statistcs, the joint
pdf of x1:n , . . . , xr :n is given by
L(β; x) =g (x1:n , . . . , xr :n ; β)
r
n! Y
= f (xi:n ; β) [1 − F (xr :n ; β)]n−r
(n − r )! i=1
r
( )
n! 1 1X
(n − r )xr :n
= exp − xi:n exp −
(n − r )! βr β
i=1
β
r
n! 1 1 X
.
= exp − x i:n + (n − r )xr :n
(n − r )! β β
r
i=1
22 / 32
Maximum Likelihood Estimation
Example 5 continued
Notice that t = T (x1 , . . . , xn ) := ri=1 xi:n + (n − r )xr :n represents the
P
observed total time in service of the n times when the experiment is
terminated (at the time of the r’th failure). The log-likelihood (in terms of
t) is
t
ln L(β; x) = −r ln(β) − + const
β
and its derivative with respect to β
d r t
ln L(β; x) = − + 2 .
dβ β β
Setting this equal to 0 and solving for β yields β = rt . The second
derivative is
d2 1 2t
ln L(β; x) = 2 (r − ),
dβ 2 β β
which is negative when evaluated at β = rt .
23 / 32
Maximum Likelihood Estimation
Example 5 continued
Thus, the MLE becomes
T
β̂ =
r
with T = T (X1 , . . . , Xn ) := i=1 Xi:n + (n − r )Xr :n .
Pr
Remark : If r = n, then t = ni=1 xi:r = ni=1 xi since, either way, we are
P P
adding up the entire sample and the MLE becomes β̂ = X n as before.
Exercise 2
Suppose that X1 , . . . , Xn are iid N(µ, σ2 ) random variables with both µ and
σ2 unknown. Find the MLE’s of µ and σ2 .
24 / 32
Maximum Likelihood Estimation
Example 6
Let X1 , . . . , Xn be a random sample from a two parameter exponential
distribution Exp(1, η) with density
f (x; η) = e −(x−η) 1(η,∞) (x).
The likelihood function for η writes
n n
X
X
η) η η)
L(η; x) = exp − (x − 1{x ≥ ∀i} = exp − (x − 1{η ≤ x1:n }.
i
i
i
i=1 i=1
This likelihood function is monotonically increasing in η to x1:n and then is
0 for all η > x1:n . The derivative with respect to η won’t help us since the
maximum occurs on the boundary and L(η) is not continuous at this point.
Nevertheless, the MLE for η is η̂ = X1:n , the minimum of the Xi . This is
quite different than the method of moments estimator.
25 / 32
Maximum Likelihood Estimation
The following exercise examines the properties of the MLE found in
Example 6, that is of η̂ = X1:n .
Exercise 3
Let X1 , . . . , Xn be a random sample from a two parameter exponential
distribution Exp(1, η) with density
f (x; η) = e −(x−η) 1(η,∞) (x).
From the above, the MLE of η is η̂ = X1:n .
(a) Show that FX1:n (t) = P(X1:n ≤ t) = [1 − e −n(t−η) ]1(η,∞) (t).
(b) Show that fX1:n (t) = ne −n(t−η) 1(η,∞) (t).
(c) Use (b) to show that E(η̂) = E(X1:n ) = η + n1 , V(η̂) = V(X1:n ) = 1
n2
.
(d) Use (c) to conclude that η̂ is asymptotically unbiased and consistent
for η.
26 / 32
Maximum Likelihood Estimation
Exercise 4
Let X1 , . . . , Xn be a random sample from a two parameter exponential
distribution Exp(β, η) with density
1 −(x−η)/β
f (x; η, β) = e 1(η,∞) (x).
β
Find the MLE’s of β and η ; and compare with the method of moments
result from Chapter 4.
Example 7
Let X1 , . . . , Xn be a random sample from a two parameter Pareto
distribution Pareto(α, κ) with density
ακα
f (x; α, κ) = 1(κ,∞) (x).
x α+1
27 / 32
Maximum Likelihood Estimation
Examples - Two parameter Pareto distribution
Example 7 continued
The log-likelihood function is
n
Y 1
ln L(α, κ; x) =
n ln(α) + nα ln(κ) + ln 1{κ ≤ x ∀i}
α+1 i
i=1 xi
n
X
.
= n ln(α) + nα ln(κ) − (α + 1) ln(xi ) 1{κ ≤ xi ∀i}
i=1
Differentiating ln L(α, κ; x) with respect to κ and setting this equal to 0
yield nα/κ = 0 which implies that κ = ∞, which is impossible because
κ ≤ x1:n . As a function of κ, L(α, κ; x) is monotonically increasing in κ until
κ > x1:n , at which time L(α, κ; x) becomes 0. Therefore, the MLE for κ is
κ̂ = X1:n .
28 / 32
Maximum Likelihood Estimation
Examples - Two parameter Pareto distribution
Example 7 continued
Differentiating ln L(α, κ; x) with respect to α yields
n
d n X
ln L(α, κ; x) = + n ln(κ) − ln(xi ).
dα α i=1
Setting this equal to 0 and solving for α yields
n
α = Pn .
i=1 ln(xi /κ)
Thus, the MLE for α is
n
α̂ = Pn .
i=1 ln(Xi /X1:n )
29 / 32
Maximum Likelihood Estimation
Exercises
Exercise 5
Find the MLE’s when X1 , . . . , Xn is a random sample from the following
distributions :
(a) U(0, θ) where θ > 0 is unknown.
(b) Weibull(1/2, γ) where γ > 0 is unknown.
(c) Binomial(20, p) where p ∈ [0, 1] is unknown.
(d) Geometric(p) where pin(0, 1) is unknown.
(e) Laplace(λ) where λ > 0 is unknown.
30 / 32
Maximum Likelihood Estimation
Exercises - Two parameter Laplace distribution
Exercise 6
Let X1 , . . . , Xn be a random sample from a distribution with density
1 −|x−η|/β
f (x; η, β) = e 1(−∞,∞) (x).
β
Find the MLE’s for β and η. Hint : The value of a that minimizes
i=1 |xi − a| is a = median(x1 , . . . , xn ). What are the method of moments
Pn
estimator ?
31 / 32
Maximum Likelihood Estimation
Definition : Invariance Property
If θ̂ is the MLE of θ and if u(θ) is a function of θ, then u(θ̂) is an MLE for
u(θ).
Example 8
Let X1 , . . . , Xn be a random sample from an Exp(β) distribution. What is
the MLE for estimating p(β) = P(X ≥ 1) = e −1/β ? Since X n is the MLE
for β, we have p(β) d = p(β̂) = e −1/X n .
Example 9
Let X1 , . . . , Xn be a random sample from a Poisson(λ) distribution. What is
the MLE for estimating p(λ) = P(X = 0) = e −λ ? Since X n is the MLE for
λ, we have p(λ) d = p(λ̂) = e −X n .
32 / 32