Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views48 pages

Lec11 Introduction2BayesianStatistics

This document provides an introduction to Bayesian statistics concepts including sufficiency, likelihood principles, prior, posterior, and posterior predictive distributions. It discusses key Bayesian concepts like parametric modeling, sufficient statistics, maximum likelihood estimation for univariate and multivariate Gaussian distributions, and Bayesian inference for unknown means in univariate Gaussian models. Examples are provided to illustrate maximum likelihood estimation and sufficient statistics.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views48 pages

Lec11 Introduction2BayesianStatistics

This document provides an introduction to Bayesian statistics concepts including sufficiency, likelihood principles, prior, posterior, and posterior predictive distributions. It discusses key Bayesian concepts like parametric modeling, sufficient statistics, maximum likelihood estimation for univariate and multivariate Gaussian distributions, and Bayesian inference for unknown means in univariate Gaussian models. Examples are provided to illustrate maximum likelihood estimation and sufficient statistics.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Bayesian Statistics: Sufficiency

and Likelihood Principles,


Prior, Posterior and Posterior Predictive
Distributions, Gaussian Examples

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 2, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman
and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online
resource)
 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
 Chris Bishops’ PRML book, Chapter 2
 M. Jordan, An introduction to Probabilistic Graphical Models, Chapter 8 (pre-print)
 Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapters 2 and 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Contents
 Parametric modeling

 Sufficiency principle, MLE and the Likelihood principles

 MLE for a Univariate Gaussian, MLE for the Multivariate Gaussian

 Bayesian Statistics, Bayes rule, Prior, Likelihood and Posterior, Posterior Point Estimates,
Predictive Distribution

 Univariate Gaussian with Unknown Mean, Appendix: Gaussian Linear Models

 The goals for today’s lecture are:

 Understand the sufficiency and likelihood principles


 Understand the fundamentals of Bayesian inference from prior modeling to computing the
posterior predictive distribution
 Learn to perform Bayesian inference in the mean 𝜇 in univariate Gaussian models
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3
Parametric Modeling
 Statistical theory derives from observations of a random phenomenon an
inference about the probability distribution underlying this phenomenon.a

 We consider parametric modeling: The observations 𝑥 are the realizations of


a random variable 𝑋 of known probability density function 𝑓(𝑥|𝜃) where

 𝜃 is unknown and belongs to a space Θ of finite dimension.

 The function 𝑓(𝑥|𝜃) considered as a function of 𝜃 for a fixed realization of


the observation 𝑋 = 𝑥 is called the likelihood function.

ℓ (𝜃 | 𝑥) = 𝑓 (𝑥 | 𝜃)

a Here we follow closely:


 C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 1 (full text available)
 Brani Vidakovic, Bayesian Statistics for Engineers, online course.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Example of Parametric Modeling
 Consider the problem of forest fires. Ecological and meteorological factors
influence their eruption. Determining the probability 𝑝 of fire as a function of
these factors can be useful in the prevention of forest fires.

 We assume a parametrized form for the function 𝑝.

 Denoting by ℎ the humidity rate, 𝑡 the average temperature, 𝑥 the degree of


management of the forest, a logistic model (Bernoulli random variable of
parameter 𝑝) could be proposed as:

𝑝 = exp(𝛼1ℎ + 𝛼2𝑡 + 𝛼3𝑥)/ [1 + exp(𝛼1ℎ + 𝛼2𝑡 + 𝛼3𝑥)]

 The statistical step is now dealing with the evaluation of 𝛼1, 𝛼2, 𝛼3.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Sufficiency Principle
 Consider 𝑋~𝑓 (𝑥 | 𝜃). A function 𝑻 of 𝑿 (called a statistic of 𝑿) is said to
be sufficient if the distribution of 𝑿 conditional upon 𝑻(𝑿) is independent
of 𝜽.*
𝑓 (𝑋| 𝜃) = ℎ (𝑇(𝑋) | 𝜃 ) 𝑔 (𝑋 | 𝑇 (𝑋))*

 A sufficient statistic 𝑻(𝒙) contains the whole information brought by 𝒙


about 𝜽.

 Let us consider a simple example. Let 𝑋 = (𝑋1, 𝑋2, … , 𝑋𝑛) be i.i.d. from
𝒩(𝜇, 𝜎 2 ) with 𝜃 = (𝜇, 𝜎 2 ). Then we can write:
N N
1  1  1  1 N
2
f (x |  )   N (x j |  )   exp   2 ( x j   ) 2  
   2 2 
exp   ( x   ) 
2  2 2
j 1

N /2
j 1 j 1 2 j 1 
1  1 N
1 N N12 
 exp    2
x   xj  
 2 2   2 2  2 j 1 2 2 
N /2 j
j 1

* f X θ = f(X, T(X)| θ) = h(T X θ g X T X , θ) = h(T X θ g X T X )


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Sufficiency Principle
1  1 N
1 N N12 
f (x |  )  exp    x   xj 
2

 2 2   2 2  2 j 1 2 2 
N /2 j
j 1
𝑁 𝑁
 We can see that 𝑓(𝑥|𝜃) depends only on 𝑇 𝑋 = ෍ 𝑥𝑗 , ෍ 𝑥𝑗2 which
𝑗=1 𝑗=1
is our set of sufficient statistics.
𝑁
෍ 𝑥𝑗 𝑁
𝑗=1
 Introducing 𝑥ҧ = , 𝑠2 = ෍ ൫𝑥𝑗 − 𝑥)ҧ 2 , we can also re-write the above
𝑁 𝑗=1
equation:
1 𝑠 2 + 𝑁𝑥ҧ 2 𝜃1 𝑁𝑥ҧ 𝑁𝜃12
𝑓(𝑥|𝜃) = 𝑁 Τ2
exp − + −
2𝜋𝜃2 2𝜃2 𝜃2 2𝜃2

 So 𝑥,ҧ 𝑠 2 is an alternative set of sufficient statistics.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Sufficiency Principle
 Sufficiency principle: Two observations 𝑥 and 𝑦 have the same values of
𝑇 (𝑥) = 𝑇 (𝑦) of statistics sufficient for 𝑓 . |𝜃 . Then the inferences about 𝜃
based on 𝑥 and 𝑦 should be the same.

 Consider 𝑥𝑖 ~𝒩 (𝜇, 1) and we want to estimate 𝜇 (our 𝜃) based on 𝑁 data. In


this case, the sufficient statistic is 𝑇 𝑥1:𝑁 = σ𝑁
𝑗=1 𝑥𝑗 .

 Consider the (MLE) estimate of : 𝜇Ƹ 1 = σ𝑁


𝑗=1 𝑥𝑗 Τ𝑁 . It satisfies the

sufficiency principle because if we have another dataset 𝑥1:𝑁 such that

𝑇 𝑥1:𝑁 = 𝑇(𝑥1:𝑁 ) then 𝑁 𝑁
1 1
𝜇ො2 = ෍ 𝑥𝑗′ = ෍ 𝑥𝑗 = 𝜇ො1
𝑁 𝑁
𝑗=1 𝑗=1

 On the other hand, the estimate 𝜇Ƹ 1 = 𝑥1 does not satisfy the sufficiency

principle for 𝑛 > 1 because if we have another dataset 𝑥1:𝑁 such that

𝑇 𝑥1:𝑁 = 𝑇(𝑥1:𝑁 ) then 𝜇Ƹ 2 = 𝑥1′ ≠ 𝜇Ƹ 1 , 𝑖𝑓 𝑥1′ ≠ 𝑥1 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
The Likelihood Principle
 Likelihood Principle. In the inference about 𝜃, the information brought by an
observation is entirely contained in the likelihood function ℓ ( 𝜃 | 𝑥) = 𝑓 (𝑥 | 𝜃).

 Also, two likelihood functions contain the same information about 𝜃 if they are
proportional to each other; i.e.

ℓ1 ( 𝜃 | 𝑥) = 𝑐 (𝑥)ℓ2 ( 𝜃 | 𝑥)

 It is straight forward to show that the MLE (maximum likelihood procedure)


satisfies the likelihood principle

arg max 1 ( | x)  arg max 2 ( | x)


 

 Classical approaches do not necessarily satisfy the likelihood principle.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


MLE: Summary
 The likelihood principle is fairly vague since it does not lead to the selection of
a particular procedure.

 Maximum likelihood estimation is one way to implement the sufficiency and


likelihood principles
𝜃መ = argsupℓ(𝜃|𝑥)
𝜃

 Indeed:
arg sup ( | x)  arg sup g ( x) h(T ( x) |  )  arg sup h(T ( x) |  )
  

1 ( | x)  c( x) 2 ( | x)  arg sup 1 ( | x)  arg sup 2 ( | x)


 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Maximum Likelihood for a Gaussian
 Suppose that we have a data set of observations 𝓓 = (𝑥1, . . . , 𝑥𝑁)𝑻,
representing 𝑁 observations of the scalar random variable 𝑋. The
observations are drawn independently from a Gaussian distribution whose
mean 𝜇 and variance 𝜎2 are unknown.

 We would like to determine these parameters from the data set.

 i.i.d. data: Data points that are drawn independently from the same
distribution are said to be independent and identically distributed, which is
often abbreviated to i.i.d.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Maximum Likelihood for a Gaussian
 Because our data set 𝓓 is i.i.d., we can write the probability of the data set,
given 𝜇 and 𝜎2, in the form

N
Likelihood function : p ( x |  ,  2 )   N ( xi | ,  2 )
i 1

This is seen as a function of  ,  2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Max Likelihood for a Gaussian Distribution
N
Likelihood function : p ( x |  ,  )   N ( xi | ,  2 )
2

i 1

One common criterion for determining the parameters in a probability


distribution using an observed data set is to find the parameter values that
maximize the likelihood function, i.e. maximizing the probability of the data
given the parameters (contrast this with maximizing the probability of the
parameters given the data).

 We can equivalently maximize the log-likelihood:


 1 N
N N 
max2 ln p( x |  ,  )  max2   2
 ,
2
 ,  2

i 1
( xi   )  ln   ln(2 )  
2
2 2

2 
N N
1 1
 ML 
N
 i ML
x ,
i 1
 2

N
 i ML
( x
i 1
  ) 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Maximum Likelihood for a Gaussian Distribution
N N
1 1
 ML 
N
 xi , 
i 1
2
ML 
N
 i ML
( x
i 1
  ) 2

Sample mean Sample variance wrt MLE


mean (not the exact mean)

 The MLE underestimates the variance (bias due to overfitting) because 𝜇𝑀𝐿
fitted some of the noise in the data.

2
 The maximum likelihood solutions 𝜇𝑀𝐿 , 𝜎𝑀𝐿 are functions of the data set
values 𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to
the data set values, which come from a Gaussian.

 Using the equations above you can show that :


In this derivation

N 1 2 you need to use :


 ML    ,  2
ML
    xi x j    2 for i  j
N
 xi2    2   2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Maximum Likelihood for a Gaussian Distribution
We use :
N 1 2
 2
ML
    xi x j    2 for i  j
N  xi2    2   2

1 N
 1 N 1 N
2
 2
ML
  N

n 1
( xn   ML
2
)  N  n N


 n 1
( x  
m 1
xm ) 

1 N
 2 2 N
1 N N 

N
n 1
 n N n  m N 2  m l 

x  x
m 1
x 
m 1 l 1
x x

1 

N
 N (  2
  2
)  N
2
N
 ( N  1)  2
 (  2
  2
)   N
1
N2
N  ( N  1)  2
 (  2
  2
)  


N
1

N ( 2   2 )   N  2   2  

 N  1  2
N

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15


Maximum Likelihood for a Gaussian Distribution
N 1 2 1 N
1 N
 ML    ,  2
ML   ,  ML   x , i
2
ML   (x  
i ML )2
N N i 1 N i 1

 On average the MLE estimate obtains the correct mean but will underestimate
the true variance by a factor (𝑁 − 1)/𝑁.

 An unbiased estimate of the variance is given as:


𝑁 For large 𝑁,
𝑁 2
1 the bias is not
𝜎ത 2 = 𝜎 = ෍ 𝑥𝑖 − 𝜇𝑀𝐿 2
𝑁 − 1 𝑀𝐿 𝑁 − 1 a problem
𝑖=1

 This result can be obtained from “a Bayesian treatment” in which we


marginalize over the unknown mean.

 The 𝑁 − 1 factor takes account the fact that “1 degree of freedom has been
used in fitting the mean” and removes the bias of the MLE.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
MLE: Underestimating the Variance
 In the schematic from MLE estimate for
the 2 data points
Bishop’s PRML, we
consider 3 cases each with True Gaussian
2 data points extracted
from the true Gaussian.

 The mean of the 3


distributions predicted via
MLE (i.e. averaged over
the data) is correct.

 However, the variance is


underestimated since it is
a variance with respect to
the sample mean and NOT
the true mean. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
MLE for the Multivariate Gaussian
 We can easily generalize the earlier MLE results for a multivariate Gaussian.
The log-likelihood takes the form:

ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1

 Setting the derivatives wrt 𝝁 and 𝚺 equal to zero gives the following:
N N
1 1
 ML 
N
 xn ,  ML
n 1

N
 n ML n ML
( x
n 1
  )( x   ) T

 We provide a proof of the calculation of 𝜮𝑀𝐿 next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


MLE for the Multivariate Guassian
ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1
 We differentiate the log likelihood wrt 𝚺 −𝟏 . Each contributing term is:
N  N  1 N T N
 1
ln |  | 1
ln |  |   
2  2  2 2 A useful trick!
1  N 1   1 N 1 T 
  n
2  1 n 1
( x   )T 1
 ( x n   )  
2  1 
N Tr  
n 1 N
( x n   )( x n   ) 


  N 1 Tr   1 S 
1
𝑺 symmetric
2 
1 1 N
  NS , where S   ( xn   )( xn   )T
2 N n 1

 Setting the derivative equal to zero leads to:  ML  S


 
Tr     T , ln | A |  A1  ,
T

 Here we used:  A
| A1 || A |1 , tr ( AB )  tr ( BA)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Appendix: Some Useful Matrix Operations
 Show that  
Tr     T and Tr     T
 

Indeed
  
Tr      ik ki  nm
A B  B  Tr     T
Amn Amn 

 Show that 
ln | A |  A1 
T

A

Using the cofactor expansion of the det:


 1  1 
( 1) m  n M mn   A1  nm
1
Amn
ln | A |
| A | Amn
| A |
| A | Amn
 (1)i  j Aij M ij 
j | A|

where in the last step we used Cramer’s rule.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


MLE for a Multivariate Gaussian
𝑁 𝑁 𝑁
1 1 𝑇
1
𝜇𝑀𝐿 = ෍ 𝑥𝑛 ≡ 𝑥,ҧ 𝛴𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 = ෍ 𝑥𝑛 𝑥𝑛𝑇 − 𝑥ҧ 𝑥ҧ 𝑇
𝑁 𝑁 𝑁
𝑛=1 𝑛=1 𝑛=1

 Note that the unconstrained maximization of the log-likelihood gives a


symmetric 𝚺.

 As for the univariate case, we can define an unbiased covariance as:


𝑁
1
𝛴ത𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 𝑇
, 𝔼 𝛴ത𝑀𝐿 = 𝛴
𝑁−1
𝑛=1

 To prove this, you will need to use that:


 xn xmT    T   mn 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Prior Knowledge is Essential
 We cannot do everything simply based on data – prior knowledge is essential
to inference and prediction.

 Bayesian inference integrates data and prior models.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Bayesian Statistics
 Bayesian model is made of a parametric statistical model (𝒳, 𝑓 (𝑥| 𝜃)) and a
prior distribution on the parameters (Q, 𝜋 (𝜃)).

 The unknown parameters are now considered as random.

 Some statisticians question this approach but most accept the probabilistic
modeling on the observations.

 Example: Assume you want to measure the speed of light given some
observations. Why should you put a prior on a physical constant?

 Due to the limited accuracy of the measurement, this constant will never
be known exactly.

 It is thus justified to put a (e.g. uniform) prior on this parameter reflecting


this uncertainty. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Recall Bayes’ rule
 Coming back from a trip, you feel sick and your doctor thinks you might have
contracted a rare disease (0.01% of the population has the disease).

 A test is available but not perfect.

 If a tested patient has the disease, 100% of the time the test will be
positive.

 If a tested patient does not have the disease, 95% of the time the test will
be negative (5% false positive).

 Your test is positive, should you really care?

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Medical Diagnosis Example
 Let 𝐴 =`the patient has the disease’

 Let 𝐵 =`the test returns a positive result’

𝑃(𝐵|𝐴)𝑃(𝐴) 1 × 0.0001
𝑃(𝐴|𝐵) = = ≈ 0.002
ҧ ҧ
𝑃(𝐵|𝐴)𝑃(𝐴) + 𝑃(𝐵|𝐴)𝑃(𝐴) 1 × 0.0001 + 0.05 × 0.9999

 Such a test would be a complete waste of money.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Prior, Likelihood and Posterior
In the previous example, we can identify the following:

 Data 𝑥 (e.g.` the test is positive’)


 Hypothesis ℎ (e.g. `do you have the disease?’). We want to make inference
about ℎ.

In Bayesian settings all variables are random and all inferences are probabilistic.

We identify three key ingredients of a Bayesian inference approach:

 Prior 𝜋(ℎ): How likely is hypothesis ℎ before looking at the data

 Likelihood 𝑓(𝑥 | ℎ): How likely is to observe 𝑥 assuming ℎ is true.

 Posterior 𝜋(ℎ | 𝑥): How likely is ℎ after data 𝑥 have been observed.
f ( x | h) (h)
 (h | x) 
m( x )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Prior 𝝅(𝜽)
 We use the prior to introduce quantitatively some insights on the parameters
of interest.

 This can be as subjective or as objective as you want it to be – and that’s why


frequentists do not like Bayesian approaches!

 There is no such a thing as a true prior!

 Even when prior information is heavily subjective, the Bayesian inference


model is honest.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Likelihood 𝑓(𝑥|𝜽)
 The likelihood encapsulates the mathematical model of the physical
phenomena you are investigating.

 If you know the input 𝑋 = 𝑥 to your problem, the likelihood can represent the
computed output 𝑦 = 𝑓(𝑥).

 It is the most computational expensive part of Bayesian approaches to


inference problems (inverse problems).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Posterior 𝝅(𝜽|𝑥): Inference and Prediction
 It combines the prior and likelihood.

 It weights the data and the prior information in making probabilistic inferences

 The posterior distribution is also useful in estimating the probability of


observing a future outcome (prediction)

f ( x |  ) ( )
 ( | x) 
m( x )

 The normalizing factor is given as:

𝑚(𝑥) = න𝑓(𝑥|𝜃)𝜋(𝜃)𝑑𝜃

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Posterior Inference: Point Estimates
f ( x |  ) ( )
 ( | x) 
m( x )

Maximum A Posteriori estimate (MAP)

 *  arg max log  ( | x)   arg max  log f ( x |  )  log  ( ) 


 

Posterior Mean

𝜃መ = 𝔼𝑝(𝜃|𝑥) [𝜃] = න𝜃𝜋(𝜃|𝑥)𝑑𝜃

Posterior Quantiles

Pr   a     ( | x)d
a

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Prediction
 Suppose we have observed 𝒙 and we want to make a prediction about
ෝ if
(future) unknown observables: What is the probability of observing data 𝒙
we already have observed data 𝒙?

 This means finding 𝑔 𝒙


ෝ𝒙

 We have:
𝒙, 𝜃, 𝒙)
𝜋(ෝ 𝒙, 𝜃, 𝒙) 𝜙(𝜃, 𝒙)
𝜋(ෝ
𝑔(ෝ 𝒙, 𝜃|𝒙)𝑑𝜃 = ඲
𝒙|𝒙) = න𝑔(ෝ 𝑑𝜃 = ඲ 𝑑𝜃 =
𝑚(𝒙) 𝜙(𝜃, 𝒙) 𝑚(𝒙)

= න𝑓(ෝ
𝒙|𝜃, 𝒙)𝜋(𝜃|𝒙)𝑑𝜃 = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃|𝒙)𝑑𝜃

 Compare this with the normalizing factor in Bayes’ rule:


𝑚(ෝ
𝒙) = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃)𝑑𝜃

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31


A Gaussian Example
 Consider 𝑥1 |𝜃~𝒩(𝜃, 𝜎 2 ), with prior 𝜃~𝒩 𝜇0 , 𝜎02 .

 Then we can derive the following:


 ( x1   ) 2 (  0 ) 2 
 ( | x1 )  f ( x1 |  ) ( )  exp    
 2 2
2 0 
2

 2  1 1   x1 0    1 
 ( | x1 )  exp    2  2     2  2    exp   2 (  1 ) 2  
 2  0    0   2 1 

With prior  ~ N ( 0 ,  02 ) and observation x1


 | x1 ~ N ( 1 ,  12 ) with
1 1 1  0
2 2
    2
 , and
1  0 
2 2 2 1
0 
2 2

 x1 0 
1    2  2 
2

 0 
1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


A Gaussian Example: Continued
To predict the distribution of a new observation 𝑥|𝜃~𝒩(𝜃, 𝜎 2 ) in light of 𝑥1 , we
use the predictive distribution as follows:
( x  )2 (  1 )2 1  ( x  )2 (  1 )2 
     
2   2 12 
f ( x | x1 )   f ( x |  )  ( | x1 ) d   e 2 2
e 212
d   e d
Likelihood Posterior

 You can verify with direct substitution the following:


2 𝜃 − 𝜇1 2
1 𝑥−𝜃 1 1 𝜎12 −𝜎12 𝑥 − 𝜇1
− + = − 𝑥 − 𝜇 𝜃 − 𝑚
2 𝜎2 𝜎12 2
1 1
𝜎12 𝜎 2 −𝜎12 𝜎12 + 𝜎 2 𝜃 − 𝜇1 + ⋯
−1
1 𝜎12 + 𝜎 2 𝜎12 𝑥 − 𝜇1
= − 𝑥 − 𝜇1 𝜃 − 𝜇1 𝜃 − 𝜇1 +. .
2 𝜎12 𝜎12
 This is a bivariate Gaussian and thus 𝑓(𝑥|𝑥1 ) is the marginal in 𝑥, i.e.

X | x1 ~ N ( 1 ,  12   2 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Bayesian Inference for the Gaussian
 Consider X   x1 , x2 ,..., xN  ~ N (  ,  2 ), with prior  ~ N ( 0 ,  02 ).

 The likelihood takes the form:  N 2 


N
1   ( xn   ) 
p( X |  )   f ( xn |  )  exp   n 1 
 2 2   2 
N /2 2
n 1
 
 
 Note that in terms of 𝜇 this is not a probability density and is not normalized.
Introducing the conjugate (Gaussian) prior on 𝜇 leads to:
 N 
  ( xn   ) (    ) 2 
2
N
 (  | X )   f ( xn |  ) (  )  exp   n 1  0

n 1  2 2
2 0 
2

 
 
  N

 2  N 1    n  
x
 1 
 (  | X )  exp    2  2     n 1 2  02    exp   2 (    N ) 2 
 2  0    0   2 N 
   

 Statistical Computing and Machine 
Learning, Fall 2020, N. Zabaras 34
Bayesian Inference for the Gaussian
  N 
 2  N 1    n  
x
 1 
 (  | X )  exp    2  2     n 1 2  02    exp   2 (    N ) 2 
 2  0    0   2 N 
   
  
 So the posterior is a Gaussian as before with

 | X ~ N (  N ,  N2 ) with
1 1 N  0
2 2
    2
 , and
 N 0 
2 2 2 N
N 0  
2 2

 N 
  xn    N    N  2
 2
 N   N  n 1 2  02    N  2ML  02  
2 2 0
 ML  0
  0     0  N 0   N 0  
2 2 2 2

 
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Bayesian Inference for the Gaussian
 | X ~ N (  N ,  N2 ) with
1 1 N  02 2 N 02 2
 2  2   2
, and  N  ML  0
 N 0 
2
N 0  
2 N 2
N 0  
2 2
N 0  
2 2

 The posterior precision is the sum of the precision of the prior plus one
contribution of the data precision for each observed data point.

 Observe the posterior mean for 𝑁 → ∞ and 𝑁 → 0.

 For 𝑁 → ∞ the posterior peaks around the 𝜇𝑀𝐿 and the posterior variance goes
to zero, i.e. the point MLE estimate is recovered within the Bayesian paradigm
for infinite data. In addition, 𝔼 𝜃|𝑥1 , . . . , 𝑥𝑁 = 𝜇𝑁 ≃ 𝜃𝑀𝐿 .
2
 How about when 𝜎02 → ∞? In this case note that   2
N and  N   ML
N

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Bayesian Inference for the Gaussian
 
2 2
N  2
2
 | X ~ N (  N ,  N ) with  N 
2 2 0
, and  N  0
ML  0
N 0  
2 2
N 0  
2 2
N 0  
2 2

4.5

3.5

3
N=10
2.5

2
N=2

MatLab 1.5 N=1


implementation 1
0.5 prior
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

X   x1 , x2 ,..., xN  ~ N (0.8, 0.1), with prior  ~ N (0, 0.1).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Sequential Bayesian Inference
 | X ~ N (  N ,  N2 ) with
1 1 N 02 2 N 02 2
 2  2   , and  N 
2
ML  0
 N 0 
2
N 0  
2 2 N
N 0  
2 2
N 0  
2 2

 We can derive sequential estimates of the posterior variance and mean. They
are as follows:
1 1 1  N2  N2
 2  2 , and  N  2  N 1  2 xN
 N  N 1 
2
 N 1 
 Show this is by recognizing the sequential nature of Bayesian inference (the
posterior at the previous step becomes the new prior) and recalling that:

With prior  ~ N ( 0 ,  02 ) and observation x1   | x1 ~ N ( 1 ,  12 ) with


1 1 1  02 2 2  x1 0 
     ,2
and    1   2 
 12  02  2  02   2 1 1
  2
 0 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Linear Gaussian Systems: Inferring the Mean
 Revisit the Bayesian inference problem for the Gaussian. Consider
y   y1 , y2 ,..., y N  ~ N ( y | x,  2   y1 ), with prior x ~ N ( x | 0 ,  02  01 ).
 Likelihood for the 𝑁 −dataset in the form of a linear Gaussian system:
p  y | x   N  y | Ax  b,  y  , A  1N (column vector of 1' s ), b  0,  y1  diag ( y I)
 Applying conditional Gaussian results (to be reviewed later on):
p  x   N  x |  , 1 
p  y | x   N  y | Ax  b, L 1
 
p  x | y   N x |    A LA  T 1
   A L( y  b) ,    A LA
T T 1


p  x | y   N x |  0  1TN  y I1N 
1
 0 0  1TN y I( y  0) ,  0  1TN y I1N 
1

 This can be simplified as 𝑝 𝑥 𝒚 = 𝒩 𝑥|𝜇𝑁 , 𝜆−1 𝑁 where:
𝑁𝜆𝑦 𝜆0 −1
𝑝 𝑥|𝒚 = 𝒩 𝑥| 𝑦ത + 𝜇0 , 𝜆0 + 𝑁𝜆𝑦
𝜆0 + 𝑁𝜆𝑦 𝜆0 + 𝑁𝜆𝑦
 The precision is the prior precision + 𝑁 measurement precisions. The mean is the weighted
average of the MLE and prior mean. These are identical results to those obtained earlier.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Inferring the Mean of a Gaussian
prior variance = 1.00 prior variance = 5.00

prior prior
lik lik
0.6 0.6
post post

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
gaussInferParamsMean1d
from PMTK
0.1 0.1

0 0
-5 0 5 -5 0 5

Inference about 𝑥 given a single noisy observation 𝑦 = 3.


 (a) Strong prior N(0, 1). The posterior mean is “shrunk” towards the prior
mean, which is 0.
 (b) Weak prior N(0, 5). The posterior mean is similar to the MLE.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Shrinkage and Signal-To-Noise Ratio
 02 2 N 02 2
 2
, and  N   ML  0
N 0   N 0   N 0  
N 2 2 2 2 2 2

 The posterior precision is the sum of the precision of the prior plus one contribution of the data
precision for each observed data point. For 𝑁 → ∞ the posterior peaks around the 𝜇𝑀𝐿 and the
posterior variance goes to zero, i.e. MLE estimate is recovered within the Bayesian paradigm.
 If we apply the data sequentially, we can write for the posterior mean after the collection of one
data point (𝑁 = 1), i.e. 𝜇𝑀𝐿 = 𝑦 as following:
2
1  y  ( y  0 ) 2 ( shrinkage : data y adjusted towards the prior mean 0 )
0  2

 Shrinkage is often measured also with the signal-to-noise ratio:


 X 2   2   2
SNR 
 
2
 0

 2
0
, for y  x   ( observed signal ), x ~ N   0 ,  2
0  (true signal ),  ~ N  0,  2
 (noise)

 How about when  02   ? In this case note that 2


2
N  and  N   ML
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41
Appendix:
Gaussian Linear Models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42


Bayes’ Theorem and Gaussian Linear Models
 Consider a linear Gaussian model: A Gaussian marginal distribution 𝑝(𝒙) and
a Gaussian conditional distribution 𝑝(𝒚|𝒙) in which 𝑝(𝒚|𝒙) has a mean that is a
linear function of 𝒙, and a covariance which is independent of 𝒙.

p  x   N  x | ,  1

p  y | x   N  y | Ax  b, L 1

 We want using Bayes’ rule to find 𝑝(𝒚) and 𝑝(𝒙|𝒚).

 We start with the joint distribution over 𝒛 = (𝒙, 𝒚) which is quadratic in the
components of 𝒛 – so 𝑝(𝒛) is a Gaussian.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Bayes’ Theorem and Gaussian Processes
p  x   N  x |  , 1 
p  y | x   N  y | Ax  b, L1 

1
ln p( z )  ln p  x   ln p  y | x    ( x   )T  ( x   )
2
1
 ( y  Ax  b)T L( y  Ax  b)  const
2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44


Covariance of the Joint Distribution
ln p( z )   xT    AT LA  x  yT Ly  yT LAx  x T AT Ly  const 
1 1 1 1
2 2 2 2 Only
T
1  x     AT LA  AT L   x  quadratic
        const terms
2  y    LA L  y  shown
 We can immediately write down the covariance of z.
1
   A LA  A L   1
T T
1 AT 
cov  z       1 1 1 T 
  LA L   A L + A  A 

 In the matrix inversion we used a result from an earlier lecture.


1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, where : M  A  BD C
 C D    D CM D  D CM BD 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Mean of the Joint Distribution
1 T 1 T T
ln p( z )   x   y Lb  x A Lb  ...
T
Only
2 2
T
linear
 x     AT Lb  terms
     ... shown
 y  Lb 
 We can immediately write down the mean of 𝒛:
   AT Lb 
z cov  z  z  z 
T 1 T

 Lb 
 1 1 AT    AT Lb    
 z    1 1 1 T   
 A L + A  A  Lb   A  b 
 It remains to find the marginal 𝑝(𝒚). We can use earlier derived results.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Marginal 𝑝(𝑦) Distribution
 Recall results from an earlier lecture for computing the marginal:

p  xa   N  xa | a ,  aa 
 Based on our calculations:
    1 1 AT 
z    cov  z    1 1 1 T 
 A  b   A L + A  A 

we conclude: cov  y   L1 + A1 AT


p  y   N  y | A  b, L1 + A1 AT   y   A  b
 Note that for 𝑨 = 𝑰, p x  N  x |  ,   , p y | x  N  y | x  b , L , the convolution
1 1
   
of the two Gaussians gives the well known result:
 y    b cov  y   L1
+ 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Conditional 𝑝(𝑥|𝑦) Distribution
 Recall from an earlier lecture the result for computing the conditional:
p  xa | xb   N  xa | a|b ,  1
aa  a|b  a  aa1  ab ( xb  b )

 Based on our calculations:


       
1
 1
1 T

z  
T T
A LA A L A
 
 cov  z       1 1 1 T 
 A b    LA L   A L + A A 

we conclude:
Proof:

p  x | y   N x |    A LA  T 1
   A L( y  b) ,    A LA
T T 1

 x | y        A LA
1
T
AT L( y  A  b) 

    A LA     A LA  A LA  A L( y  b) 
T 1 T T T

cov  x | y      A LA T 1

    A LA     A L( y  b)
T 1 T

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48

You might also like