Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views5 pages

Studio 2 Questions

Uploaded by

Moqiu Liang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Studio 2 Questions

Uploaded by

Moqiu Liang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

FIT3154 Studio 2

Estimation, Loss and Risk


Daniel F. Schmidt
July 29, 2024

Contents

1 Estimation Risk for Bernoulli Model 2

2 Prediction Risk for Normal Distribution 4

1
Introduction
In this studio covers some problems in statistical decision theory. To gain some familiarity with the
ideas of risk functions for characterising estimators, we will examine the risk of a class of estimators
of the probability of success for the Bernoulli model. The second question will explore the Kullback–
Leibler divergence prediction loss for estimation of the mean and variance of a normal distribution.

1 Estimation Risk for Bernoulli Model


For the first part of the studio, we will look at the theoretical squared-error estimation risk of a class
of common estimators for the Bernoulli model. In this case, we observe a series of zeros and ones,
y = (y1 , . . . , yn ) with yj ∈ {0, 1} that we assume were generated independently and identically from a
Bernoulli model
p(yj = 1 | θ) = θ, j = 1, . . . , n (1)
so that θ is the probability of observing a success (“1”). The problem is to try and estimate θ from our
sequence of n trials. This problem is very common data science, as it is essentially about prediction
of future binary events (storm/no storm, win election/lose election, etc.) under the assumption that
each event is independent and unaffected by external effects. More general models of this type allow θ
to be determined by predictors, etc. (i.e., logistic regression), but as we will see, how to best estimate
θ even in the simple model (1) is not immediately clear. A very common class of estimators are the
so-called smoothed frequency estimators:
s+α
θ̂α (y) = (2)
n + 2α
Pn
where s = j=1 yj is the number of successes in our sequence of trials, and α ≥ 0 is tunable smooth-
ing/shrinkage/regularisation constant chosen by the user. Let us examine the risk of the estimator (2)
for different choices of α. We will first look at its risk under the squared-error loss
L(θ, θ̂) = (θ − θ̂)2
Using the fact that if y follows (1), then
E [s] = nθ, V [s] = nθ(1 − θ),
we can find the bias and variance of (2) as functions of the population paameter θ and constant α:
α(2θ − 1) h i nθ(1 − θ)
biasθ (θ̂α ) = − , Vθ θ̂α = . (3)
n + 2α (n + 2α)2
We can use these to explore the squared-error risk for different choices of α:
1. Write a line of R code to calculate the estimates of θ using (2) for s = 0, . . . , 10 (i.e., for n = 10,
try all possible numbers of successes). (hint: this can be done in a single line by making a vector
of s values, i.e., s = 0:10).
Try this for α = 0 and then α = 1/2. How do the two estimates differ? When α = 0, what
estimator is (2) equivalent to?
2. Download the file bin.risk(). This computes the risk function for the estimator (2) using the
fact that h i h i
Eθ (θ̂α (y) − θ)2 = biasθ (θ̂α )2 + Vθ θ̂α
along with (3). Open and examine the code.

2
3. Using the function bin.risk() create a plot showing the risk curves for n = 10 when α = 0 and
α = 1/2:

rv=bin.risk(alpha=0, n=10)
plot(rv$theta,rv$risk,type="l")
rv=bin.risk(alpha=1/2, n=10)
lines(rv$theta,rv$risk,col="red")

Please answer the following questions.


(a) Where is the point of maximum risk (i.e., which value of θ is the hardest two estimate well?)
Why do you think the population parameter is hardest to estimate when it takes on this
value?
(b) At what point are risk curves at their smallest? Why do you think these values of the
population parameter are easier to estimate well?
(c) How do the two curves differ?
(d) Does either choice of α dominate (see Slide 21, Lecture 1) the other?

(hint: remember to use plot() to plot the first curve, then lines() to overlay successive curves)

4. The choice of α = n/2 produces an estimator that is minimax (see Slide 29, Lecture 1). Overlay
the risk curve when α takes on this value. What does this curve look like?
5. Create a new plot and overlay the risk curves
√ for n = 100 (much larger sample) for α = 0,
α = 1/2 and the minimax estimator α = n/2. How do the curves compare? Do you think the
minimax estimator is a good choice? Why or why not?
6. Challenge Question. See if you can derive the bias and variance formulas (3) for the smoothed
estimator (2). (complete out of studio).

3
2 Prediction Risk for Normal Distribution
As a second problem, let us consider the problem of estimating the mean µ and variance σ 2 of a normal
distribution. For reference this has probability density function of
1/2
(y − µ)2
  
2 1
p(y | µ, σ ) = exp − .
2πσ 2 2σ 2
the estimation of the mean µ from a sample y = (y1 , . . . , yn ) using the
In Lecture 1 we examined P
n
sample mean µ̂(y) = (1/n) i=1 yi , under squared-error loss

L(µ, µ̂) = (µ − µ̂)2 , (4)

and found the squared-error risk to be (Slide 24, Lecture 1)


σ2
R(µ, µ̂) = Eµ [L(µ, µ̂)] = (5)
n
which is independent of the population value of µ, but depends on the noise variance (the larger σ 2 ,
the bigger the average squared-error) and the sample size (the larger the sample size, the smaller the
average squared-error). Now let us consider the risk when estimating a normal using Kullback–Leibler
(KL) divergence (see Slide 36, Lecture 1); the KL divergence is a loss function that takes into account
the structural properties of the probability model and measures the ability of our estimated model
to predict future data from the same population. The KL divergence from a normal distibution with
mean µ and σ 2 to an estimated normal distribution with mean µ̂ and variance σ̂ 2 is
 2
2 2 1 σ̂ σ2 (µ − µ̂)2 1
KL(µ, σ ||µ̂, σ̂ ) = log + + − (6)
2 σ2 2σ̂ 2 2σ̂ 2 2
Please answer the following questions.
1. Imagine we only need to estimate the mean µ, i.e., somehow we know the true value of the
variance σ 2 .
(a) Write down the simplified formula for the KL divergence (6) in this case, i.e., when we set
σ̂ 2 = σ 2 .
(b) How does this compare to the standard squared-error loss (4). In what way is it similar,
and how does it differ? Why is this the case?
2. Calculate the KL risk for the sample mean assuming that we know σ 2 , i.e., set σ̂ 2 = σ 2 , using
your previously simplified formula. How does it differ from the squared-error risk?
3. Imagine now instead that we are estimating the variance σ 2 and somehow we know the true
population mean µ.
(a) Write down the simplified formula for the KL divergence (6) in this case, i.e., when we set
µ̂ = µ.
(b) Plot the function for σ 2 = 1 and σ̂ 2 ∈ (0.1, 10) using

sigma2 = 1
sigma2.hat = seq(0.1, 10, length.out=1e3)
kl = (1/2)*log(sigma2.hat/sigma2) + sigma2/2/sigma2.hat - 1/2
plot(sigma2.hat, kl, type="l")

4
(c) What does the curve look like? Does the KL divergence penalize overestimation of σ 2
differently from underestimation? Which is less costly in terms of loss?
(d) Overlay the KL divergence for σ 2 = 2 and σ̂ 2 ∈ (0.1, 10). Does it look similar to the previous
curve?
4. Standard estimators that we have previously examined for the mean and variance of a normal
distribution are the sample mean and (adjusted) sample variance
n  Xn
1X 1
µ̂ = yi , σ̂ 2 = (yi − µ̂)2 , (7)
n i=1 n − k i=1

where k ≥ 0 is a user-chosen constant. For k = 0 the estimator is the sample variance, and
for k = 1 it is the unbiased estimate of variance. The KL risk of these estimators is found by
plugging (7) into (6) and taking appropriate expectations, and is a little tricky; we could instead
use simulation to obtain an approximate value of risk.
Download and source the file normal.kl.risk.R. This contains a function
normal.kl.risk(mu,sigma2,k,n,m=1e6)
which runs m simulations, each which involves drawing a sample of size n from the normal with
specified mean and variance, calcuating the estimates (7) using this sample, and then calculating
the KL divergence (6) for these estimates. All m KL divergences are then averaged to estimate
the risk. This is an example of the simulation procedure from Slide 20, Lecture 1. Examine the
code and work through the lines; you should be able to understand what is going on.
5. Run the normal.kl.risk() function for mu = 0, sigma2 = 1, n = 10 and k = 0. Compare this
against the risk obtained using k = 1. Which is better? Why do you think it might be better?
6. It turns out that the exact KL risk for these estimators is given by the formula
    
 2 2
 1 n−1 2 (n + 1)(n − k) 1
E KL(µ, σ ||µ̂, σ̂ ) = ψ + log + − (8)
2 2 n−k 2(n − 3)n 2
where ψ(x) is a special function called the digamma function (digamma() in R). In what way
does the KL risk (8) for the estimators (7) depend on the population values of µ and σ 2 ? What
does this imply, and is this a good thing?
7. Evaluate (8) for n = 10 and k = 0 and k = 1 using the function normal.kl.risk.exact(n,k=0)
from the file normal.kl.risk.R. How do these quantities compare to the estimates of risk
obtained by simulation previously? Do the same for n = 100; how do they compare?
8. Challenge Question 1. See if you can derive the formula for the KL divergence (6) between
two normal distributions. As a hint, it is easier to rewrite the KL divergence formula as
Z Z
KL(θ||θ̂) = p(y | θ) log p(y | θ)dy − p(y | θ) log p(y | θ̂)dy
h i
= Eθ [log p(y | θ)] − Eθ log p(y | θ̂)

where Eθ [f (y)] denotes an expectation of f (y) with respect to the distribution p(y | θ). (complete
out of studio)
9. Challenge Question 2. Using (8) determine the optimal value of k for a given value of n.
What happens to this value as n → ∞? (complete out of studio)

You might also like