Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views9 pages

Lecture 14

The document discusses decision theory, focusing on the estimation of parameters using data and the associated risk functions. It covers various loss functions, methods for comparing estimators, and the concepts of Bayes and minimax estimators. The document also highlights the trade-offs between Bayes estimators, which depend on prior distributions, and minimax estimators, which aim to minimize the worst-case risk.

Uploaded by

yifanwa2cc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

Lecture 14

The document discusses decision theory, focusing on the estimation of parameters using data and the associated risk functions. It covers various loss functions, methods for comparing estimators, and the concepts of Bayes and minimax estimators. The document also highlights the trade-offs between Bayes estimators, which depend on prior distributions, and minimax estimators, which aim to minimize the worst-case risk.

Uploaded by

yifanwa2cc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture Notes 14

36-705

We continue with our discussion of decision theory.

1 Decision Theory
Suppose we want to estimate a parameter θ using data X n = (X1 , . . . , Xn ). What is the
best possible estimator θb = θ(X
b 1 , . . . , Xn ) of θ? Decision theory provides a framework for
answering this question.

1.1 The Risk Function

Let θb = θ(X
b n ) be an estimator for the parameter θ ∈ Θ. We start with a loss function
b that measures how good the estimator is. For example:
L(θ, θ)

b = (θ − θ)
L(θ, θ) b2 squared error loss,
b = |θ − θ|
L(θ, θ) b absolute error loss,
b = |θ − θ|
L(θ, θ) bp Lp loss,
b = 0 if θ = θb or 1 if θ 6= θb
L(θ, θ) zero–one loss,
b = I(|θb − θ| > c)
L(θ, θ) large deviation loss,
R  
b = log p(x; θ) p(x; θ)dx
L(θ, θ) Kullback–Leibler loss.
p(x; θ)
b

If θ = (θ1 , . . . , θk ) is a vector then some common loss functions are

X
k
b = kθ − θk
L(θ, θ) b2= (θbj − θj )2 ,
j=1

!1/p
X
k
b = kθ − θk
L(θ, θ) bp= |θbj − θj |p .
j=1

When the problem is to predict a Y ∈ {0, 1} based on some classifier h(x) a commonly used
loss is
L(Y, h(X)) = I(Y 6= h(X)).
For real valued prediction a common loss function is

L(Y, Yb ) = (Y − Yb )2 .

1
3
2 R(θ, θb2 )

1 R(θ, θb1 )
0 θ
0 1 2 3 4 5

Figure 1: Comparing two risk functions. Neither risk function dominates the other at all
values of θ.

The risk of an estimator θb is


  Z
b b b 1 , . . . , xn ))p(x1 , . . . , xn ; θ)dx.
R(θ, θ) = Eθ L(θ, θ) = L(θ, θ(x (1)

When the loss function is squared error, the risk is just the MSE (mean squared error):

b = Eθ (θb − θ)2 = Varθ (θ)


R(θ, θ) b + bias2 . (2)

If we do not state what loss function we are using, assume the loss function is squared error.

1.2 Comparing Risk Functions

To compare two estimators, we compare their risk functions. However, this does not provide
a clear answer as to which estimator is better. Consider the following examples.

Example 1 Let X ∼ N (θ, 1) and assume we are using squared error loss. Consider two
estimators: θb1 = X and θb2 = 3. The risk functions are R(θ, θb1 ) = Eθ (X − θ)2 = 1 and
R(θ, θb2 ) = Eθ (3 − θ)2 = (3 − θ)2 . If 2 < θ < 4 then R(θ, θb2 ) < R(θ, θb1 ), otherwise, R(θ, θb1 ) <
R(θ, θb2 ). Neither estimator uniformly dominates the other; see Figure 1.

2
Example 2 Let X1 , . . . , Xn ∼ Bernoulli(p). Consider squared error loss and let pb1 = X.
Since this has zero bias, we have that
p(1 − p)
R(p, pb1 ) = Var(X) = .
n
Another estimator is
Y +α
pb2 =
α+β+n
Pn
where Y = i=1 Xi and α and β are positive constants.1 Now,
R(p, pb2 ) = Varp (b p2 ))2
p2 ) + (biasp (b
     2
Y +α Y +α
= Varp + Ep −p
α+β+n α+β+n
 2
np(1 − p) np + α
= + −p .
(α + β + n)2 α+β+n
p
Let α = β = n/4. The resulting estimator is
p
Y + n/4
pb2 = √
n+ n
and the risk function is
n
√ .
R(p, pb2 ) =
4(n + n)2
The risk functions are plotted in Figure 2. As we can see, neither estimator uniformly
dominates the other.

These examples highlight the need to be able to compare risk functions. To do so, we need a
one-number summary of the risk function. Two such summaries are the maximum risk and
the Bayes risk.
The maximum risk is
b = sup R(θ, θ)
R(θ) b (3)
θ∈Θ

and the Bayes risk under prior π is


Z
b =
Bπ (θ) b
R(θ, θ)π(θ)dθ. (4)

Example 3 Consider again the two estimators in Example 2. We have


p(1 − p) 1
R(b
p1 ) = max =
0≤p≤1 n 4n
1
This is the posterior mean using a Beta (α, β) prior.

3
Risk

Figure 2: Risk functions for pb1 and pb2 in Example 2. The solid curve is R(b
p1 ). The dotted
line is R(b
p2 ).

and
n n
R(b
p2 ) = max √ 2 = √ .
p 4(n + n) 4(n + n)2
Based on maximum risk, pb2 is a better estimator since R(b p2 ) < R(b
p1 ). However, when n is
large, R(b
p1 ) has smaller risk except for a small region in the parameter space near p = 1/2.
Thus, many people prefer pb1 to pb2 . This illustrates that one-number summaries like the
maximum risk are imperfect.

These two summaries of the risk function suggest two different methods for devising estima-
tors: choosing θb to minimize the maximum risk leads to minimax estimators; choosing θb to
minimize the Bayes risk leads to Bayes estimators.
An estimator θb that minimizes the Bayes risk is called a Bayes estimator. That is,
b = inf Bπ (θ̃)
Bπ (θ) (5)
θ̃

where the infimum is over all estimators θ̃. An estimator that minimizes the maximum risk
is called a minimax estimator. That is,
b = inf sup R(θ, θ̃)
sup R(θ, θ) (6)
θ θ̃ θ

where the infimum is over all estimators θ̃. We call the right hand side of (6), namely,
b
Rn ≡ Rn (Θ) = inf sup R(θ, θ), (7)
θb θ∈Θ

4
the minimax risk. Statistical decision theory has two main goals: determine the minimax
risk Rn and find an estimator that achieves this risk.
Once we have found the minimax risk Rn we want to find the minimax estimator that
achieves this risk:
b = inf sup R(θ, θ).
sup R(θ, θ) b (8)
θ∈Θ θb θ∈Θ

1.3 Bayes Estimators

Let π be a prior distribution. After observing X n = (X1 , . . . , Xn ), the posterior distribution


is, according to Bayes’ theorem,
R R
p(X1 , . . . , Xn |θ)π(θ)dθ L(θ)π(θ)dθ
P(θ ∈ A|X ) = R
n A
= RA (9)
Θ
p(X1 , . . . , Xn |θ)π(θ)dθ Θ
L(θ)π(θ)dθ
where L(θ) = p(xn ; θ) is the likelihood function. The posterior has density
p(xn |θ)π(θ)
π(θ|xn ) = (10)
m(xn )
R
where m(xn ) = p(xn |θ)π(θ)dθ is the marginal distribution of X n . Define the posterior
b n ) by
risk of an estimator θ(x
Z
b n b n ))π(θ|xn )dθ.
r(θ|x ) = L(θ, θ(x (11)

b satisfies
Theorem 4 The Bayes risk Bπ (θ)
Z
b
Bπ (θ) = r(θ|x b n )m(xn ) dxn . (12)

b n ) be the value of θ that minimizes r(θ|x


Let θ(x b n ). Then θb is the Bayes estimator.

Proof:
Let p(x, θ) = p(x|θ)π(θ) denote the joint density of X and θ. We can rewrite the Bayes risk
as follows:
Z Z Z !
b =
Bπ (θ) b
R(θ, θ)π(θ)dθ = b n ))p(x|θ)dxn π(θ)dθ
L(θ, θ(x
Z Z Z Z
= b n n
L(θ, θ(x ))p(x, θ)dx dθ = b n ))π(θ|xn )m(xn )dxn dθ
L(θ, θ(x
Z Z ! Z
= b n n n n
L(θ, θ(x ))π(θ|x )dθ m(x ) dx = r(θ|x b n )m(xn ) dxn .

5
b n ) to be the value of θ that minimizes r(θ|x
If we choose θ(x b n ) then we will minimize the
R
b n )m(xn )dxn .
integrand at every x and thus minimize the integral r(θ|x
Now we can find an explicit formula for the Bayes estimator for some specific loss functions.

b = (θ − θ)
Theorem 5 If L(θ, θ) b 2 then the Bayes estimator is
Z
b n) =
θ(x θπ(θ|xn )dθ = E(θ|X = xn ). (13)

b = |θ − θ|
If L(θ, θ) b then the Bayes estimator is the median of the posterior π(θ|xn ). If L(θ, θ)
b
is zero–one loss, then the Bayes estimator is the mode of the posterior π(θ|xn ).

Proof:
We will prove the theorem for squared error loss. The Bayes estimator θ(x b n ) minimizes
R
b n ) = (θ − θ(x
r(θ|x b n ))2 π(θ|xn )dθ. Taking the derivative of r(θ|x
b n ) with respect to θ(x
b n ) and
R
b n ))π(θ|xn )dθ = 0. Solving for θ(x
setting it equal to zero yields the equation 2 (θ − θ(x b n)
we get 13.

Example 6 Let X1 , . . . , Xn ∼ N (µ, σ 2 ) where σ 2 is known. Suppose we use a N (a, b2 ) prior


for µ. The Bayes estimator with respect to squared error loss is the posterior mean, which is
2
σ
b 1 , . . . , Xn ) = b2 n
θ(X 2 X + σ2
a. (14)
b2 + σn b2 + n

It is worth keeping in mind the trade-off: Bayes estimators although easy to compute are
very subjective; they depend strongly on the prior π. Minimax estimators, although more
challenging to compute are not subjective, but do have the drawback that they are protecting
against the worst-case which might lead to pessimistic conclusions.

2 Minimax Estimators through Bayes Estimators

Our goal is to compute a minimax estimator θb that satisfies:

b ≤ inf sup R(θ, θ).


sup R(θ, θ) e
θ∈Θ θe θ∈Θ

We will let θminimax denote a minimax estimator.

6
2.1 Bounding the Minimax Risk

One strategy to find the minimax estimator is by finding (upper and lower) bounds on the
minimax risk that match. Then the estimator that achieves the upper bound is a minimax
estimator.
Upper bounding the minimax risk is straightforward. Given an estimator θbup we can compute
its maximum risk and use it to upper bound the minimax risk, i.e.
e ≤ R(θ, θbup ).
inf sup R(θ, θ)
θe θ∈Θ

The Bayes risk of the Bayes estimator for any prior π lower bounds the minimax risk. Fix a
prior π and suppose that θblow is the Bayes estimator with respect to π, then we have that:
Bπ (θblow ) ≤ Bπ (θminimax ) ≤ sup R(θ, θminimax ) = inf sup R(θ, θ).
e
θ θe θ∈Θ

Let us see an example of this in action.


Example: We will prove a classical result that if we observe independent draws from a
d-dimensional Gaussian, X1 , . . . , Xn ∼ N (θ, Id ), then the average:
1X
n
θb = Xi ,
n i=1
is a minimax estimator of θ with respect to the squared loss.
Let Rn denote the minimax risk. First, let us compute the upper bound on Rn . We note
that,
θb ∼ N (θ, Id /n),
so that its risk:
X
d Xd
b b 2
R(θ, θ) = E[ (θi − θi ) ] = E[ Zi2 ],
i=1 i=1

where Zi ∼ N (0, 1/n). This yields that,

b = d.
e ≤ R(θ, θ)
inf sup R(θ, θ)
θe θ∈Θ n
Now we lower bound the minimax risk using the Bayes risk. Let us take the prior to be
zero-mean Gaussian, i.e. we take π = N (0, c2 Id ). By sufficiency, we can replace the data
b We can write:
with θ.
θ ∼ N (0, c2 Id )
b ∼ N (θ, Id /n).
θ|θ

7
We can write this as

θ = c
1
θb = √ Z
n

where , Z ∼ N (0, Id ). Hence,


     2 
θ 0 c Id c2 Id
∼N ,
θb 0 c2 Id (c2 + 1/n)Id

We can now compute the posterior (using standard conditional Gaussian formulae), and
obtain its mean:

b = c2 b
E[θ|θ] θ.
c2 + 1/n

Now,
2
b =E c2
R(θ, θ) θb − θ .
c2 + 1/n

Write θb = θ + W , where W ∼ N (0, Id /n). Then


2
b = EW c2 θ
R(θ, θ) Z − .
c2 + 1/n n(c2 + 1/n)

Let us denote β := c2 + 1/n. Then we obtain that,

b = kθk22 c4 2 kθk22 c4 d
R(θ, θ) + EkW k 2 = + .
n2 β 2 β 2 n2 β 2 β 2 n

The Bayes risk further averages this over θ ∼ N (0, c2 Id ) to obtain that,
 
c2 b c2 d c4 d c2 d d
Bπ 2
θ = 2 2
+ 2
= = .
c + 1/n nβ β n nβ n(1 + 1/(nc2 ))

We conclude that
d d
≤ Rn ≤ .
n(1 + 1/(nc2 )) n
This is true for every c > 0. Since c was arbitrary we can take the limit as c → ∞ to obtain
that the minimax risk is upper and lower bounded by d/n and hence, Rn = d/n and the
sample average θb is minimax.

8
2.2 Least Favorable Prior

The other way to obtain Bayes estimators is by constructing what are called least favorable
priors.

Theorem 7 Let θb be the Bayes estimator for some prior π. If


b ≤ Bπ (θ)
R(θ, θ) b for all θ (15)

then θb is minimax and π is called a least favorable prior.

Proof:
Suppose that θb is not minimax. Then there is another estimator θb0 such that supθ R(θ, θb0 ) <
b Since the average of a function is always less than or equal to its maximum, we
supθ R(θ, θ).
have that Bπ (θb0 ) ≤ supθ R(θ, θb0 ). Hence,

Bπ (θb0 ) ≤ sup R(θ, θb0 ) < sup R(θ, θ)


b ≤ Bπ (θ)
b (16)
θ θ

which is a contradiction.

Theorem 8 Suppose that θb is the Bayes estimator with respect to some prior π. If the risk
is constant then θb is minimax.

Proof:
R
b = R(θ, θ)π(θ)dθ
The Bayes risk is Bπ (θ) b b ≤ Bπ (θ)
= c and hence R(θ, θ) b for all θ. Now
apply the previous theorem.

Example 9 Consider the Bernoulli model with squared error loss. We showed previously
that the estimator Pn p
i=1 Xi + n/4
pb = √
n+ n
has a constant risk function. This estimator is the
p posterior mean, and hence the Bayes
estimator, for the prior Beta(α, β) with α = β = n/4. Hence, by the previous theorem,
this estimator is minimax.

You might also like