Lec11 Introduction2BayesianStatistics
Lec11 Introduction2BayesianStatistics
Email: [email protected]
URL: https://www.zabaras.com/
September 2, 2020
Bayesian Statistics, Bayes rule, Prior, Likelihood and Posterior, Posterior Point Estimates,
Predictive Distribution
ℓ (𝜃 | 𝑥) = 𝑓 (𝑥 | 𝜃)
The statistical step is now dealing with the evaluation of 𝛼1, 𝛼2, 𝛼3.
Let us consider a simple example. Let 𝑋 = (𝑋1, 𝑋2, … , 𝑋𝑛) be i.i.d. from
𝒩(𝜇, 𝜎 2 ) with 𝜃 = (𝜇, 𝜎 2 ). Then we can write:
N N
1 1 1 1 N
2
f (x | ) N (x j | ) exp 2 ( x j ) 2
2 2
exp ( x )
2 2 2
j 1
N /2
j 1 j 1 2 j 1
1 1 N
1 N N12
exp 2
x xj
2 2 2 2 2 j 1 2 2
N /2 j
j 1
On the other hand, the estimate 𝜇Ƹ 1 = 𝑥1 does not satisfy the sufficiency
′
principle for 𝑛 > 1 because if we have another dataset 𝑥1:𝑁 such that
′
𝑇 𝑥1:𝑁 = 𝑇(𝑥1:𝑁 ) then 𝜇Ƹ 2 = 𝑥1′ ≠ 𝜇Ƹ 1 , 𝑖𝑓 𝑥1′ ≠ 𝑥1 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
The Likelihood Principle
Likelihood Principle. In the inference about 𝜃, the information brought by an
observation is entirely contained in the likelihood function ℓ ( 𝜃 | 𝑥) = 𝑓 (𝑥 | 𝜃).
Also, two likelihood functions contain the same information about 𝜃 if they are
proportional to each other; i.e.
ℓ1 ( 𝜃 | 𝑥) = 𝑐 (𝑥)ℓ2 ( 𝜃 | 𝑥)
Indeed:
arg sup ( | x) arg sup g ( x) h(T ( x) | ) arg sup h(T ( x) | )
i.i.d. data: Data points that are drawn independently from the same
distribution are said to be independent and identically distributed, which is
often abbreviated to i.i.d.
N
Likelihood function : p ( x | , 2 ) N ( xi | , 2 )
i 1
i 1
2
N N
1 1
ML
N
i ML
x ,
i 1
2
N
i ML
( x
i 1
) 2
The MLE underestimates the variance (bias due to overfitting) because 𝜇𝑀𝐿
fitted some of the noise in the data.
2
The maximum likelihood solutions 𝜇𝑀𝐿 , 𝜎𝑀𝐿 are functions of the data set
values 𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to
the data set values, which come from a Gaussian.
1 N
1 N 1 N
2
2
ML
N
n 1
( xn ML
2
) N n N
n 1
( x
m 1
xm )
1 N
2 2 N
1 N N
N
n 1
n N n m N 2 m l
x x
m 1
x
m 1 l 1
x x
1
N
N ( 2
2
) N
2
N
( N 1) 2
( 2
2
) N
1
N2
N ( N 1) 2
( 2
2
)
N
1
N ( 2 2 ) N 2 2
N 1 2
N
On average the MLE estimate obtains the correct mean but will underestimate
the true variance by a factor (𝑁 − 1)/𝑁.
The 𝑁 − 1 factor takes account the fact that “1 degree of freedom has been
used in fitting the mean” and removes the bias of the MLE.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
MLE: Underestimating the Variance
In the schematic from MLE estimate for
the 2 data points
Bishop’s PRML, we
consider 3 cases each with True Gaussian
2 data points extracted
from the true Gaussian.
ND N 1 N
ln p ( X | D , , ) ln 2 ln | | ( xn )T 1 ( xn )
2 2 2 n 1
Setting the derivatives wrt 𝝁 and 𝚺 equal to zero gives the following:
N N
1 1
ML
N
xn , ML
n 1
N
n ML n ML
( x
n 1
)( x ) T
Here we used: A
| A1 || A |1 , tr ( AB ) tr ( BA)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Appendix: Some Useful Matrix Operations
Show that
Tr T and Tr T
Indeed
Tr ik ki nm
A B B Tr T
Amn Amn
Show that
ln | A | A1
T
A
Some statisticians question this approach but most accept the probabilistic
modeling on the observations.
Example: Assume you want to measure the speed of light given some
observations. Why should you put a prior on a physical constant?
Due to the limited accuracy of the measurement, this constant will never
be known exactly.
If a tested patient has the disease, 100% of the time the test will be
positive.
If a tested patient does not have the disease, 95% of the time the test will
be negative (5% false positive).
𝑃(𝐵|𝐴)𝑃(𝐴) 1 × 0.0001
𝑃(𝐴|𝐵) = = ≈ 0.002
ҧ ҧ
𝑃(𝐵|𝐴)𝑃(𝐴) + 𝑃(𝐵|𝐴)𝑃(𝐴) 1 × 0.0001 + 0.05 × 0.9999
In Bayesian settings all variables are random and all inferences are probabilistic.
Posterior 𝜋(ℎ | 𝑥): How likely is ℎ after data 𝑥 have been observed.
f ( x | h) (h)
(h | x)
m( x )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Prior 𝝅(𝜽)
We use the prior to introduce quantitatively some insights on the parameters
of interest.
If you know the input 𝑋 = 𝑥 to your problem, the likelihood can represent the
computed output 𝑦 = 𝑓(𝑥).
It weights the data and the prior information in making probabilistic inferences
f ( x | ) ( )
( | x)
m( x )
𝑚(𝑥) = න𝑓(𝑥|𝜃)𝜋(𝜃)𝑑𝜃
Posterior Mean
Posterior Quantiles
Pr a ( | x)d
a
We have:
𝒙, 𝜃, 𝒙)
𝜋(ෝ 𝒙, 𝜃, 𝒙) 𝜙(𝜃, 𝒙)
𝜋(ෝ
𝑔(ෝ 𝒙, 𝜃|𝒙)𝑑𝜃 =
𝒙|𝒙) = න𝑔(ෝ 𝑑𝜃 = 𝑑𝜃 =
𝑚(𝒙) 𝜙(𝜃, 𝒙) 𝑚(𝒙)
= න𝑓(ෝ
𝒙|𝜃, 𝒙)𝜋(𝜃|𝒙)𝑑𝜃 = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃|𝒙)𝑑𝜃
2 1 1 x1 0 1
( | x1 ) exp 2 2 2 2 exp 2 ( 1 ) 2
2 0 0 2 1
x1 0
1 2 2
2
0
1
X | x1 ~ N ( 1 , 12 2 )
N
2 N 1 n
x
1
( | X ) exp 2 2 n 1 2 02 exp 2 ( N ) 2
2 0 0 2 N
Statistical Computing and Machine
Learning, Fall 2020, N. Zabaras 34
Bayesian Inference for the Gaussian
N
2 N 1 n
x
1
( | X ) exp 2 2 n 1 2 02 exp 2 ( N ) 2
2 0 0 2 N
So the posterior is a Gaussian as before with
| X ~ N ( N , N2 ) with
1 1 N 0
2 2
2
, and
N 0
2 2 2 N
N 0
2 2
N
xn N N 2
2
N N n 1 2 02 N 2ML 02
2 2 0
ML 0
0 0 N 0 N 0
2 2 2 2
The posterior precision is the sum of the precision of the prior plus one
contribution of the data precision for each observed data point.
For 𝑁 → ∞ the posterior peaks around the 𝜇𝑀𝐿 and the posterior variance goes
to zero, i.e. the point MLE estimate is recovered within the Bayesian paradigm
for infinite data. In addition, 𝔼 𝜃|𝑥1 , . . . , 𝑥𝑁 = 𝜇𝑁 ≃ 𝜃𝑀𝐿 .
2
How about when 𝜎02 → ∞? In this case note that 2
N and N ML
N
4.5
3.5
3
N=10
2.5
2
N=2
We can derive sequential estimates of the posterior variance and mean. They
are as follows:
1 1 1 N2 N2
2 2 , and N 2 N 1 2 xN
N N 1
2
N 1
Show this is by recognizing the sequential nature of Bayesian inference (the
posterior at the previous step becomes the new prior) and recalling that:
prior prior
lik lik
0.6 0.6
post post
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
gaussInferParamsMean1d
from PMTK
0.1 0.1
0 0
-5 0 5 -5 0 5
The posterior precision is the sum of the precision of the prior plus one contribution of the data
precision for each observed data point. For 𝑁 → ∞ the posterior peaks around the 𝜇𝑀𝐿 and the
posterior variance goes to zero, i.e. MLE estimate is recovered within the Bayesian paradigm.
If we apply the data sequentially, we can write for the posterior mean after the collection of one
data point (𝑁 = 1), i.e. 𝜇𝑀𝐿 = 𝑦 as following:
2
1 y ( y 0 ) 2 ( shrinkage : data y adjusted towards the prior mean 0 )
0 2
2
0
, for y x ( observed signal ), x ~ N 0 , 2
0 (true signal ), ~ N 0, 2
(noise)
p x N x | , 1
p y | x N y | Ax b, L 1
We want using Bayes’ rule to find 𝑝(𝒚) and 𝑝(𝒙|𝒚).
We start with the joint distribution over 𝒛 = (𝒙, 𝒚) which is quadratic in the
components of 𝒛 – so 𝑝(𝒛) is a Gaussian.
1
ln p( z ) ln p x ln p y | x ( x )T ( x )
2
1
( y Ax b)T L( y Ax b) const
2
p xa N xa | a , aa
Based on our calculations:
1 1 AT
z cov z 1 1 1 T
A b A L + A A
we conclude:
Proof:
p x | y N x | A LA T 1
A L( y b) , A LA
T T 1
x | y A LA
1
T
AT L( y A b)
A LA A LA A LA A L( y b)
T 1 T T T
cov x | y A LA T 1
A LA A L( y b)
T 1 T