FIT3154 Studio 5
Bayesian Inference Part 1
Daniel F. Schmidt
August 19, 2024
Contents
1 Introduction 2
2 Bayesian Estimation of the Bernoulli Model 2
2.1 Challenge Question: Deriving the Posterior Distribution . . . . . . . . . . . . . . . . . . 4
3 Bayesian Analysis of the Normal Distribution 5
3.1 Challenge Question: Deriving the Posterior Distribution . . . . . . . . . . . . . . . . . . 8
1
1 Introduction
Studio 5 covers introduces us to some very basic Bayesian analysis of data. We will examine two impor-
tant building blocks for much of data science – the Bernoulli distribution and the normal distribution.
Question 2 examines how basic Bayesian analysis of binary data can be performed, and Question 3
examines the a basic approach to inference of the mean of normally distributed data in the Bayesian
framework.
2 Bayesian Estimation of the Bernoulli Model
We will begin by using the Bayesian approach to analyse binary data. Lecture 4 examined the usual
approach to Bayesian analysis of the Bernoulli distribution based on the beta prior distribution. We
will briefly recap these results. Consider a data vector y = (y1 , . . . , yn ) of 0s and 1s; the Bernoulli
likelihood for this data is
p(y | θ) = θk (1 − θ)n−k (1)
Pn
where k = j=1 yj is the number of successes in our data, n is the number of trials and θ is the
(unknown) probability of success. The usual maximum likelihood estimator, found by maximising (1)
with respect to θ is equivalent to the sample (empirical) proportion:
k
θ̂ML (y) = .
n
In the usual likelihood/confidence interval/hypothesis testing framework, inference proceeds by assum-
ing the population success probability θ is unknown but fixed. The Bayesian approach to inference
involves making a different assumption: that the unknown θ is itself a realisation of a random process,
and can therefore be characterised by a probability distribution π(θ), which we call the prior distribu-
tion. The prior distribution can be chosen to reflect subjective beliefs (i.e., the probability we assign
different values of θ is related to how likely we believe the different values of θ are to be the value of
the true population parameter), it can be chosen for convenience, or even chosen to produce a specific
type of estimator. Regardless of how the priors are chosen, the central quantity in Bayesian analysis
is the posterior distribution
p(y | θ)π(θ)
p(θ | y) = R (2)
p(y | θ)π(θ)dθ
which describes the probability of the true population parameter θ taking on different values of Θ,
conditional on the fact we have seen the data y. In Lecture 4 we examined the use of the beta
distribution as a suitable prior distribution for the success probability θ. The beta distribution has
two parameters, α and β, and is given by
θα−1 (1 − θ)β−1
π(θ | α, β) = (3)
B(α, β)
where B(α, β) is a special function called the beta function. The beta distribution is used as a prior
for two main reasons: (i) it is highly flexible, allowing a range of prior beliefs about θ to be encoded
by appropriate choice of α and β, and (ii) it is “conjugate” in the sense that when used in (2) with
likelihood (1) the posterior distribution is also a beta distribution:
θ | y, α, β ∼ Beta(k + α, n − k + β). (4)
This makes inference about θ very straightforward, as most packages have specialised functions to find
quantiles, etc. of the beta distribution. Let’s look at this in some more detail.
2
1. How can we interpret the values of the hyperparameters α and β in the posterior distribution
(4)?
2. The posterior mean is a commonly used Bayesian point estimate (best guess of θ); for (4) it is
given by
k+α
E [θ | y] = (5)
n+α+β
(a) Can you identify this estimator as one you have seen before when α = β?
(b) What happens to the posterior mean as α and β grow so that the ratio α/(α + β) = θ′
remains fixed, and k, n are held constant?
(c) What happens to the posterior mean as k and n grow so that the ratio k/n = θ′ remains
fixed, and α and β are held constant?
3. The posterior variance is a measure of how much uncertainty we have about our estimate of θ;
for (4) it is given by
(k + α)(n − k + β)
V [θ | y] = (6)
(n + α + β)2 (n + α + β + 1)
(a) What happens to the posterior variance as α = β grow and k, n are held constant?
(b) What happens to the posterior variance as n → ∞, and α and β are held constant?
4. Let us analyse some basic data using (4). Imagine our friend claims that they can accurately
predict the outcome (win/loss/draw) of Australian Rules Football matches. You believe she can’t
do better than a 50-50 coin toss, and decide to put her claim to the test. You then recorded
whether she correctly predicted the outcome of the next 12 games:
y = (0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0) (7)
with a “1” denoting that she correctly predicted the result, and a “0” denoting that she got the
result incorrect. As a reference,
(a) let us first compute the maximum likelihood estimate θ̂ML ≡ θ̂ML (y) of the population
parameter θ,
(b) the standard error of θ̂ML using the central limit theorem
s
θ̂ML (1 − θ̂ML )
se(θ̂ML ) =
n
(c) and the 95% confidence interval for θ using the usual (asympotic) approximation
CI95 = θ̂ML − 1.96 se(θ̂ML ), θ̂ML + 1.96 se(θ̂ML )
Remember the standard error is a measure of how much we would expect our estimate
θ̂ML to change by if we obtained a new sample of the same sample size n from the same
population/process, and recomputed our estimate. It is a measure of the variability of θ̂ML .
Using these quantities, do you think the data support the hypothesis that they can predict the
outcomes of games better than a 50-50 coin toss?
5. Now let us perform a Bayesian analysis of the same data. We have no strong prior beliefs about
the frequency with which our friend can correctly predict games, so we choose α = β = 1 (an
“uninformative” uniform prior on θ). For the data (7), do the following:
3
(a) Plot the posterior distribution (4) (hint: use the dbeta() function in R).
(b) Compute the posterior mean (5) and posterior standard deviation (the square root of (13)).
How do these compare to the maximum likelihood estimate and standard error?
(c) Compute the 95% credible interval for θ (see Lecture 4, Slide 37) for p = 0.025 to p = 0.975.
How does this compare to the approximate 95% confidence interval we obtained above?
Does your Bayesian analysis lead to different conclusions about your friend’s ability to predict
AFL results?
6. One of the strengths of the Bayesian framework is the ability to incorporate any actual prior
information on a problem that you may have. The seamless, formal way in which this is done
through the prior distribution is one of the cornerstones of the theory. Imagine now that another
friend of ours comes along and tells us that he also conducted a similar experiment to test
our original friend’s claim of being able to predict AFL matches, and in that experiment, she
successfully predicted the outcome of 11 out of 12 matches.
(a) How can you incorporate this information into your prior distribution by the appropriate
choice of values for α and β?
(b) Plot your revised posterior distribution.
(c) Compute the posterior mean, posterior standard deviation and 95% credible interval for
your posterior distribution computed using this revised prior.
Does this lead to a revised conclusion regarding your friend’s ability to predict AFL results?
2.1 Challenge Question: Deriving the Posterior Distribution
As a challenge question, see if you can derive the posterior distribution (4) by using the Bernoulli
likelihood (1) and beta prior (3) in (2). (hint: rather than trying to compute the integral on the
denominator, try writing out the numerator and seeing if you can identify the beta distribution it is
proportional to).
4
3 Bayesian Analysis of the Normal Distribution
In this question we will examine one approach to Bayesian analysis of the normal distribution. Un-
surprisingly, given the importance of this problem, there has been an extensive (vast) literature on
different prior distributions for the mean of a normal distribution, and the literature continues to
grow. We will examine one of the most basic priors, which actually remains very relevant due to the
fact it is usually used as a building block for more complex Bayesian analysis in a hierarchical model
framework. We observe a sample of n numbers y = (y1 , . . . , yn ) which we assume to be generated by
the population model
yj ∼ N (µ, σ 2 ), j = 1, . . . , n,
where µ is the unknown population mean, and for simplicity we assume that the population variance
σ 2 is known. We will see how to relax this assumption next Studio. The likelihood for the normal
distribution is
n/2 n
1 1 X
p(y | µ) = exp − 2 (yj − µ)2 (8)
2πσ 2 2σ j=1
The usual maximum likelihood estimate for µ is equivalent to the sample mean
n
1X
µ̂ML (y) = yj = ȳ.
n j=1
To perform a Bayesian analysis of the normal distribution, we need a suitable prior distribution for
the mean µ. The most basic prior distribution used in the literature is the normal distribution; that
is, we assume that the unknown population mean itself follows a normal distribution
1/2
(m − µ)2
1
π(µ | m, s2 ) = exp − (9)
2πs2 2s2
where m and s2 are hyperparameters controlling the prior distribution over the mean µ. So, using this
prior, m controls the a priori (i.e., before seeing data) expected value of the unknown parameter µ (i.e.,
our “best guess”) and s2 controls how confident we are about our guess (larger s2 ⇒ less confident,
smaller s2 ⇒ more confident). We can write this Bayesian model in the hierarchical form:
y1 , . . . , yn | µ ∼ N (µ, σ 2 ), j = 1, . . . , n
µ | m, s2 ∼ N (m, s2 )
Using a normal distribution for µ is convenient for two reasons: (i) it allows a reasonably flexible
specification of our prior beliefs about µ by appropriate choices of the mean m and variance s2 , and
(ii) it is conjugate to the normal distribution for yj ; this means that the posterior distribution for µ
using this prior is also a normal distribution:
−1 !
n 1
µ | y ∼ N wȳ + (1 − w)m, + 2 (10)
σ2 s
where
n/σ 2
w= (11)
n/σ 2 + 1/s2
Let us now examine this posterior distribution in some more detail.
5
1. From (10) the posterior mean of µ is clearly given by:
E [µ | y] = wȳ + (1 − w)m (12)
where w is given by (11).
(a) How can we interpet this estimate? How can we interpret w?
(b) What happens to w as the sample size n increases?
(c) What happens to E [µ | y] as the prior mean m → ∞?
(d) What happens to w as the prior variance s2 → 0?
(e) Does this suggest a way in which we could interpret m in terms of a previously observed
data sample?
2. From (10) the posterior variance of µ is clearly
−1
n 1
V [µ | y] = 2
+ 2 . (13)
σ s
Think about the following questions:
(a) What happens to (13) as the prior variance s2 → 0?
(b) What happens to (13) as the sample size n → ∞
3. Download the R script studio5.normal.R. It contains √a function analyse.normal() which
calculates
√ the standard√ sample mean, standard error σ/ n and 95% confidence interval (ȳ −
1.96σ/ n, ȳ + 1.96σ/ n) for a given sample of data and a specified σ.
4. The R script also contains the function bayes.normal.normal(y, sigma, m, s) for performing
Bayesian analysis. It returns the posterior mean (12), the posterior standard deviation (square-
root of (13)) and the 95% credible interval for µ given hyperparameters m and s. Examine the
function and see if you can match up the formulas.
5. Let us analyse some data using this estimator. Download the file bpdata.csv. We will look to
estimate the mean blood pressure of this population of men aged 47–56. From a large studyW
we know that the standard deviation of blood pressure in the population for men aged 40-59 is
approximately 13.5mmHg. Using this as our value for the standard deviation for our population,
use analyse.normal() to compute the non-Bayesian inferential statistics (mean, standard error,
confidence interval) on bpdata$BP.
6. We will now use your Bayesian estimator to infer the mean blood pressure of these cohort. We
need values for our prior distribution. From the large study of blood pressure, we find that the
population median blood pressure of men aged 40−59 is 122mmHg. We can use this as the value
for our m hyperparameter (prior guess at likely value of blood pressure). The s2 hyperparameter
is a little trickier to set; from the study on blood pressure, it was found that the median blood
pressure of cohorts men of different ethnicities aged 40-59 varied by up to 4mmHg. This suggests
we might take s = 2, so that a priori we say there is a 95% chance that the µ for our population
lies between 118 and 126.
Using these values for the hyperparameters (m = 122, s = 2), calculate the posterior mean,
posterior standard deviation and credible interval using your function. How do they compare to
the sample mean, standard error and confidence interval?
6
7. Now, imagine a colleague who was an expert in blood pressure saw your analysis, and suggested
your prior was too tightly concentrated around 122mmHg. They might comment that you
don’t know if the cohort of people have been taking drugs to lower blood pressure, or may have
been selected because of extra low or extra high blood pressure, etc. That is, the group of
people may not be an “average” subset of the US population, as we don’t know how they were
selected, and our prior may not be appropriate. So instead of using a prior based on the range
of average blood pressures across ethnicities, we could use something that was less informative.
A decent approach here is to use the variability of possible blood pressures observed in the US
population as a measure of how widely we might reasonably expect our study population’s blood
pressure to range; the standard deviation of blood pressure measurements in the US population
is 13.5mmHg, as we determined above, so we now use m = 122, s = 13.5 as the settings for our
prior hyperparameter.
Rerun the Bayesian analysis using these new values for the prior distribution. How do the
posterior mean, posterior standard deviation and credible intervals compare to those produced
in the previous question, and the sample mean, standard error and confidence interval produced
by the non-Bayesian approach?
8. We saw as s was made larger, the prior had less effect on our estimates. One idea then is to let
s get really big – in fact, we can take s2 → ∞. This spreads the prior probability over (−∞, ∞)
infinitesimally thinly, but has the advantage of saying that any value of µ is a priori equally
likely – an expression of true ignorance about µ (Bayesian inference may truly be the one place
that ignorance is potentially a good thing!). Examine the posterior equation (10).
(a) Write the resulting posterior if we let s2 → ∞ (hint: to do this, take limits as s2 → ∞ of
both the posterior mean and variance terms).
(b) What is the posterior mean when s2 → ∞? How about the posterior variance? Are these
quantities familiar?
While this approach might seem appealing, it has an unfortunate drawback. The prior distri-
bution implied by taking s → ∞ is actually not normalisable – that is, it does not integrate to
one because it spreads its probability too thinly! These are called improper priors, and are a
little hard to get your head around. Unfortunately, most truly uninformative prior distributions
are improper. While use of improper priors in Bayes theorem can (often) result in proper (nor-
malisable) posterior distributions, they lack any probability interpretation. In Studio 4 we will
see an alternative approach to making “mostly uninformative” priors that largely circumvent the
problems of having to specify hyperparameters, while remaining proper (normalisable).
9. See if you can analyse the Weight data in bpdata. Imagine you determined from a previous
study that σ = 18 is a reasonable value for the variability of weight in the population that takes
into account not just how much weight varies, but also the errors in self-reporting weight that
are common in data collection and research.
Further imagine that you are told by a colleague who is an expert on weight and nutrition that
it is a reasonable prior belief that 75% men have weights between 75kg and 105kg.
(a) See if you can work out how to select the prior hyperparameters m and s so that your
normal prior on µ encodes this prior information.
(b) Plot the resulting posterior distribution for the Weight data using this choice of hyperpa-
rameters.
(c) Analyse the weight data and report on the posterior mean, standard deviation and credible
interval for our dataset. How do these compare with the non-Bayesian statistics?
7
3.1 Challenge Question: Deriving the Posterior Distribution
As a further challenge question, see if you can derive the posterior distribution (10). To do this, it is
easier to rewrite the likelihood (8) as
n/2
1 1
exp − 2 Y2 − 2µnȳ + nµ2
p(y | µ) = (14)
2πσ 2 2σ
Pn
where Y2 = j=1 yj2 . Then you can use (14) and (9) in (2). Again the best approach is to compute
the numerator of (2) and see if you can figure out how to identify the normal distribution that it is
proportional.