Bayesian Inference
by Hoai Nam Nguyen
September 9, 2017
1
The setting is the same. Given a population that follows a distribution P ,
where P contains 1 or more unknown parameters, we want to construct an
estimator for each of them. In this course, I consider the simple case, where
there is only 1 unknown parameter . To do this, we proceed by collecting
an i.i.d sample X1 , ..., Xn P
Similar to Maximum Likelihood Estimation, we first find the Likelihood
Function L():
L() = fX1 ,...,Xn (x1 , ..., xn |)
In Bayesian inference, we treat the parameter as a random variable. That
is, follows a probability distribution with pdf (). We call () the prior
distribution of
By Bayess formula, we have
fX1 ,...,Xn (x1 , ..., xn |)()
(|x1 , ..., xn ) =
fX1 ,...,Xn (x1 , ..., xn )
fX1 ,...,Xn (x1 , ..., xn |)()
where (|x1 , ..., xn ) is the pdf of given the sample data. This is called the
posterior distribution of
Let me clarify the last step further. The symbol means proportional
to. Since the left-hand side is the distribution of conditional on the sam-
ple data {x1 , ..., xn }, all the xi are assumed to be known and the denominator
fX1 ,...,Xn (x1 , ..., xn ) is, therefore, no more than a constant
In this setting, we are given the population distribution P and the prior
distribution (). We have to find the posterior distribution (|x1 , ..., xn ).
We then use the posterior mean E[|x1 , ..., xn ] to estimate the unknown
parameter . That is,
= E[|x1 , ..., xn ]
NOTE: when calculating (|x1 , ..., xn ), always use proportionality by re-
moving constants because this will simply the calculation a lot
2
Example 1
The population distribution is Bernoulli(p), where p U nif orm(0, 1). Use
Bayesian inference to construct an estimator p
The likelihood function is given by:
n
Y
L(p) = fXi (xi |p)
i=1
n
Y
= pxi (1 p)1xi
i=1
P P
xi
=p (1 p)n xi
The pdf of the prior distribution is (p) = 1, for 0 < p < 1
Therefore, the posterior distribution is given by:
(p|x1 , ..., xn ) fX1 ,...,Xn (x1 , ..., xn |p)(p)
P P
xi
=p (1 p)n xi
, for 0 < p < 1
Recall the pdf of Beta(, ):
(+) 1
fX (x) = ()()
x (1 x)1 , for 0 < x < 1
P
By comparing,
P we can see that the posterior distribution p is Beta( xi +
1, n xi + 1)
We know that the expectation of Beta(, ) is +
. Therefore, the posterior
mean is given by:
P
xi +1
E[p|x1 , ..., xn ] = n+2
P
Xi +1
Thus, p = n+2
is the Bayesian estimator for p
Note that we used proportionality when calculating the posterior distribu-
tion. By comparing with the pdf of Beta(, ), we can easily recover the
missing constant:
(+) (n+2) P
c= ()()
= P
( xi +1)(n xi +1)
3
Example 2
Same as example 1, except that p Beta(a, b), where both a and b are
given constants
The likelihood function stays unchanged:
P P
xi
L(p) = p (1 p)n xi
The pdf of the prior distribution is given by:
(a+b) a1
(p) = (a)(b)
p (1 p)b1 , for 0 < p < 1
Therefore, the pdf of the posterior distribution is given by:
(p|x1 , ..., xn ) fX1 ,...,Xn (x1 , ..., xn |p)(p)
P P
xi
p (1 p)n xi a1
p (1 p)b1
P P
p xi +a1 (1 p)n xi +b1 , for 0 < p < 1
P P
We recognise this is Beta( xi + a, n xi + b)
P
xi +a
The posterior mean is E[p|x1 , ..., xn ] = n+a+b
. The Bayesian estimator for p
is given by:
P
Xi +a
p = n+a+b
Again, you can recover the normalising constant in the pdf of the posterior
distribution:
c= P (n+a+b) P
( xi +a)(n xi +b)
4
Example 3
The population distribution is N (, 2 ), where is unknown and 2 is known.
The parameter follows a prior distribution N (, 2 ), where both and 2
are given constants. Use Bayesian inference to construct an estimator p
The likelihood function is given by:
n
Y
L() = fXi (xi |)
i=1
n (x )2
Y 1 i
= exp
i=1 2 2 2 2
n
Y (xi )2
exp 2
, because 2 is known
i=1
2
Also, the pdf of the prior distribution is given by:
1 ( )2
() = p exp
2 2 2 2
( )2
exp , because 2 is known
2 2
Then, calculate the pdf of the posterior distribution:
Yn (x )2 ( )2
i
(|x1 , ..., xn ) exp exp
i=1
2 2 2 2
n
1 X
1
= exp 2 (xi )2 exp 2 (2 2 + 2 )
2 i=1 2
n
1 X 1
exp 2 (xi )2 exp 2 (2 2)
2 i=1 2
2
by removing exp
2 2
5
n
1 X 2
2
1 2
= exp 2 (x 2xi + ) exp 2 ( 2)
2 i=1 i 2
n
1 X
1
exp 2 (2xi + 2 ) exp 2 (2 2)
2 i=1 2
n
1 X 2
by removing exp 2 x
2 i=1 i
n
X n2 1 2
= exp xi exp ( 2)
2 i=1
2 2 2 2
n n
h 1 2 1 X i
= exp + + xi + 2
2 2 2 2 2 i=1
= exp(A2 + B)
2 B/A
= exp
1/A
2 B/A + B 2 /4A2
exp
1/A
( B/2A)2 i
h
= exp
1/A
Comparing with the pdf of a Normal distribution, we deduce that the pos-
terior distribution of is given by:
B 1
|x1 , ...xn N 2A , 2A
Clearly, E[|x1 , ..., xn ] = B/2A. Therefore, = B/2A is the Bayesian esti-
mator for
6
Example 4
Consider the following types of treatment:
Treatment 1: 100% of the patients are cured (3 out of 3)
Treatment 2: 95% of the patients are cured (19 out of 20)
Treatment 3: 90% of the patients are cured (90,000 out of 100,000)
Which one is the best???
Treatment 1 cured 100% of the patients. But the sample was so small that
we should cast doubt on the result. On the other hand, Treatment 3 was
very reassuring, but the percentage was a bit lower
Let p be the probability that a patient is cured. Then, the probability that
a patient is not cured is 1 p
Therefore, the population follows Bernoulli(p), where p is an unknown pa-
rameter
P
xi +1
In example 1, we found that p = n+2
provided an estimate for p
3+1 4
Treatment 1: p = 3+2
= 5
= 0.8
19+1 20
Treatment 2: p = 20+2
= 22
0.909
90000+1 90001
Treatment 3: p = 100000+2
= 100002
0.9
We can see that p for Treatment 2 is the highest. Therefore, we predict that
Treatment 2 is the best one. Treatment 1, despite curing everyone in the
sample, is predicted to be the worst due to its small sample size