Monte Carlo Methods
∗ Monte Carlo methods are a way of approximating the value
of an integral using large samples of random variables.
∗ These samples of random variables are typically computer
generated.
∗ Since we regularly need to calculate integrals in Bayesian
inference, Monte Carlo methods are very popular for that.
6-1
Monte Carlo Integration
∗ Suppose that we need to evaluate
Z
I = h(x) dx
A
∗ Let f be a probability density function with support A.
∗ Then we can write
h(x)
Z Z
I = f (x) dx = g(x)f (x) dx
A f (x) A
∗ Now if Y is a random variable with pdf f then we have
I = E[g(Y )].
6-2
Monte Carlo Integration
∗ Hence if we have Y1, . . . , YN iid
∼ f , the Weak Law of Large
Numbers tells us that
N
1 X p
ˆ
I = g(Yi) −→ I.
N i=1
∗ We can therefore use Iˆ to approximate I very well for large
enough N .
∗ Furthermore we can use estimate the variability in Iˆ using
the sample variance of the random sample g(Y1), . . . , g(YN )
divided by N .
∗ Since N is totally in our control we can choose N to be large
enough to make the variability as low as we desire.
6-3
Generating Uniform Random Variates
∗ Computers are unable to generate random numbers.
∗ The can, however, be used to generate pseudo-random num-
bers.
∗ These are sequences of numbers which are generated from a
deterministic algorithm but which behave like a sequence of
iid random variates.
∗ Typically the numbers generated by a computer can be thought
of as coming from a Uniform(0,1) distribution.
6-4
Generating Non-Uniform Random Numbers
∗ Uniform random variates are rarely what we need for simula-
tion or Monte Carlo inference.
∗ They are, however, the building blocks for generating random
variates from any other distributions.
∗ Much of this is based on the following theorem
Theorem 6.1 (Probability Integral Transform)
Suppose that U ∼ Uniform(0, 1) and that F is a continuous cdf
with unique inverse F −1. Then the random variable
Y = F −1(U )
has a distribution with cdf F .
6-5
Generating Discrete Random Variates
∗ The method as described above requires that we have a con-
tinuous cdf.
∗ A similar technique can also be used to generate discrete
random variables.
∗ Suppose that p(y) is the probability mass function and the
support of the random variable is Y = {y : p(y) > 0}. Then
we can define the inverse cdf as
F −1(u) = min{y ∈ Y : F (y) ⩾ u}
∗ Then if U ∼ uniform(0, 1), the random variable Y = F −1(u)
will be distributed with probability mass function p(y).
6-6
Special Methods
∗ In many cases, the inverse cdf is not available in closed form
and so this method cannot be used. We can often, however,
use algorithms based on transformations for such situations.
∗ Suppose that U1 and U2 are two independent Uniform(0, 1)
random variables then it is easy to show that
q q
Y1 = −2 log U1 sin(2πU2) and Y2 = −2 log U1 cos(2πU2)
are independent standard normal random variables.
∗ This is known as the Box-Muller Algorithm.
6-7
Accept/Reject Algorithm
∗ A more general technique which is useful when the inverse
cdf method cannot be applied is called the Accept/Reject
Algorithm.
∗ This method relies on generating a different random variable
V which has the same support as the required variable Y .
∗ We also require that the ratio of densities is bounded by a
known constant
fY (y)
M = sup < ∞
y fV (y)
6-8
Accept/Reject Algorithm
1. Calculate M = supy fY (y)/fV (y).
2. Generate V ∼ fV and independently U ∼ Uniform(0, 1).
3. If
fY (V )
U <
M fV (V )
then set Y = V . Otherwise discard U and V and return to
step 2.
6-9
Markov Chain Monte Carlo Methods
∗ Many of the methods described so far are not very useful for
generating multivariate random variates.
∗ Markov Chain Monte Carlo methods are now widely used in
these settings.
∗ The methods work on the idea of constructing a Markov
chain which has a stationary distribution equal to the distri-
bution of interest.
∗ Under certain conditions, the distribution of the elements in
such a chain will converge to this stationary distribution.
6-10
Markov Chain Monte Carlo Methods
∗ These algorithms start with some initial value for the random
variable of interest.
∗ They then run a carefully constructed Markov chain starting
from that initial value for a sufficiently long time.
∗ It is not always easy to know how long the chains should be
run but various diagnostics have been proposed.
∗ Any observations in the chain after this burn-in period may
be considered as (at least approximately) distributed with the
stationary distribution.
6-11
Metropolis–Hastings Algorithm
∗ First introduced in statistical physics in 1954 by Metropolis
et al. Statistical properties shown by Hastings in 1970.
∗ It is basically a Markov chain version of the accept/reject
algorithm.
∗ Random variates are generated from some candidate distri-
bution conditional on the current state of the chain and then
either the new state is accepted or rejected in which case the
chain stays where it is.
6-12
Metropolis–Hastings Algorithm
Suppose we wish to sample Y ∼ fY .
First initialize the chain with some value Y (0).
Then for t = 1, 2, . . . we generate Y (t) by
1. Generate V (t) ∼ fV |Y (v | Y (t−1)).
2. Calculate the acceptance probability
fY V (t) fV |Y Y (t−1) | V (t)
ρt = min × , 1
f
Y Y
(t−1) fV |Y V (t) | Y (t−1)
3. Generate Ut ∼ Uniform(0, 1) and set
V (t) if Ut ⩽ ρt
Y (t) =
Y (t−1) if Ut > ρt
6-13
Independence Metropolis–Hastings Algorithm
∗ It is often convenient to generate V (t) from the same distri-
bution at every iteration.
∗ In this case we have fV |Y (v | Y (t−1)) = fV (v) and so the
acceptance probability becomes
fY V (t) fV Y (t−1)
ρt = min × , 1
f
Y Y
(t−1) fV V (t)
fY V (t) fV Y (t−1)
= min × , 1
f
V V
(t) fY Y (t−1)
6-14
Random Walk Metropolis–Hastings Algorithm
∗ Another special case is where fV |Y (v | y) = fZ (v − y) where
fZ is a distribution symmetric about 0.
∗ We generate Z (t) ∼ fZ and set V (t) = Y (t−1) + Z (t).
∗ In stochastic processes this is called a random walk.
∗ The acceptance probability for the Metropolis–Hastings al-
gorithm then becomes
fY V (t)
ρt = min , 1
f
Y Y
(t−1)
6-15
Gibbs Sampler
∗ The Gibbs Sampler (Geman & Geman, 1984) is designed to
generate observations from a complex multivariate distribu-
tion.
∗ The Markov chain is constructed by considering the univari-
ate conditional distributions.
∗ Suppose that the random vector of interest is Y = (Y1, . . . , Yd)
and that we can generate observations from the full condi-
tional distributions
fj y | Y−j = y−j = fyj |y−j y | Y−j = y−j j = 1, . . . , d
where Y−j = Y1, . . . , Yj−1, Yj+1, . . . , Yd .
6-16
Gibbs Sampler
(0) (0)
Initialise the chain to some value Y (0) = Y1 , . . . , Yd .
For t = 1, 2, . . .
(t) (t−1) (t−1)
1 Generate Y1 from f1 y1 | Y2 , . . . , Yd .
(t) (t) (t−1) (t−1)
2 Generate Y2 from f2 y2 | Y1 , Y3 , . . . , Yd .
...
(t) (t) (t) (t−1) (t−1)
j Generate Yj from fj yj | Y1 , . . . , Yj−1, Yj+1 , . . . , Yd .
...
(t) (t) (t)
d Generate Yd from fd yd | Y1 , . . . , Yd−1 .
(t) (t)
Then we set Y (t) = Y1 , . . . , Yd .
6-17