0% found this document useful (0 votes)

27 views48 pages

Lec11 Introduction2BayesianStatistics

This document provides an introduction to Bayesian statistics concepts including sufficiency, likelihood principles, prior, posterior, and posterior predictive distributions. It discusses key Bayesian concepts like parametric modeling, sufficient statistics, maximum likelihood estimation for univariate and multivariate Gaussian distributions, and Bayesian inference for unknown means in univariate Gaussian models. Examples are provided to illustrate maximum likelihood estimation and sufficient statistics.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views48 pages

Lec11 Introduction2BayesianStatistics

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Introduction to Bayesian Statistics: Sufficiency

and Likelihood Principles,

Prior, Posterior and Posterior Predictive
Distributions, Gaussian Examples

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 2, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman
and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online
resource)
 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
 Chris Bishops’ PRML book, Chapter 2
 M. Jordan, An introduction to Probabilistic Graphical Models, Chapter 8 (pre-print)
 Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapters 2 and 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Contents
 Parametric modeling

 Sufficiency principle, MLE and the Likelihood principles

 MLE for a Univariate Gaussian, MLE for the Multivariate Gaussian

 Bayesian Statistics, Bayes rule, Prior, Likelihood and Posterior, Posterior Point Estimates,
Predictive Distribution

 Univariate Gaussian with Unknown Mean, Appendix: Gaussian Linear Models

 The goals for today’s lecture are:

 Understand the sufficiency and likelihood principles

 Understand the fundamentals of Bayesian inference from prior modeling to computing the
posterior predictive distribution
 Learn to perform Bayesian inference in the mean 𝜇 in univariate Gaussian models
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3
Parametric Modeling
 Statistical theory derives from observations of a random phenomenon an
inference about the probability distribution underlying this phenomenon.a

 We consider parametric modeling: The observations 𝑥 are the realizations of

a random variable 𝑋 of known probability density function 𝑓(𝑥|𝜃) where

 𝜃 is unknown and belongs to a space Θ of finite dimension.

 The function 𝑓(𝑥|𝜃) considered as a function of 𝜃 for a fixed realization of

the observation 𝑋 = 𝑥 is called the likelihood function.

ℓ (𝜃 | 𝑥) = 𝑓 (𝑥 | 𝜃)

a Here we follow closely:

 C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 1 (full text available)
 Brani Vidakovic, Bayesian Statistics for Engineers, online course.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Example of Parametric Modeling
 Consider the problem of forest fires. Ecological and meteorological factors
influence their eruption. Determining the probability 𝑝 of fire as a function of
these factors can be useful in the prevention of forest fires.

 We assume a parametrized form for the function 𝑝.

 Denoting by ℎ the humidity rate, 𝑡 the average temperature, 𝑥 the degree of

management of the forest, a logistic model (Bernoulli random variable of
parameter 𝑝) could be proposed as:

𝑝 = exp(𝛼1ℎ + 𝛼2𝑡 + 𝛼3𝑥)/ [1 + exp(𝛼1ℎ + 𝛼2𝑡 + 𝛼3𝑥)]

 The statistical step is now dealing with the evaluation of 𝛼1, 𝛼2, 𝛼3.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Sufficiency Principle
 Consider 𝑋~𝑓 (𝑥 | 𝜃). A function 𝑻 of 𝑿 (called a statistic of 𝑿) is said to
be sufficient if the distribution of 𝑿 conditional upon 𝑻(𝑿) is independent
of 𝜽.*
𝑓 (𝑋| 𝜃) = ℎ (𝑇(𝑋) | 𝜃 ) 𝑔 (𝑋 | 𝑇 (𝑋))*

 A sufficient statistic 𝑻(𝒙) contains the whole information brought by 𝒙

about 𝜽.

 Let us consider a simple example. Let 𝑋 = (𝑋1, 𝑋2, … , 𝑋𝑛) be i.i.d. from
𝒩(𝜇, 𝜎 2 ) with 𝜃 = (𝜇, 𝜎 2 ). Then we can write:
N N
1  1  1  1 N
2
f (x |  )   N (x j |  )   exp   2 ( x j   ) 2  
   2 2 
exp   ( x   ) 
2  2 2
j 1

N /2
j 1 j 1 2 j 1 
1  1 N
1 N N12 
 exp    2
x   xj  
 2 2   2 2  2 j 1 2 2 
N /2 j
j 1

* f X θ = f(X, T(X)| θ) = h(T X θ g X T X , θ) = h(T X θ g X T X )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Sufficiency Principle
1  1 N
1 N N12 
f (x |  )  exp    x   xj 
2

 2 2   2 2  2 j 1 2 2 
N /2 j
j 1
𝑁 𝑁
 We can see that 𝑓(𝑥|𝜃) depends only on 𝑇 𝑋 = ෍ 𝑥𝑗 , ෍ 𝑥𝑗2 which
𝑗=1 𝑗=1
is our set of sufficient statistics.
𝑁
෍ 𝑥𝑗 𝑁
𝑗=1
 Introducing 𝑥ҧ = , 𝑠2 = ෍ ൫𝑥𝑗 − 𝑥)ҧ 2 , we can also re-write the above
𝑁 𝑗=1
equation:
1 𝑠 2 + 𝑁𝑥ҧ 2 𝜃1 𝑁𝑥ҧ 𝑁𝜃12
𝑓(𝑥|𝜃) = 𝑁 Τ2
exp − + −
2𝜋𝜃2 2𝜃2 𝜃2 2𝜃2

 So 𝑥,ҧ 𝑠 2 is an alternative set of sufficient statistics.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Sufficiency Principle
 Sufficiency principle: Two observations 𝑥 and 𝑦 have the same values of
𝑇 (𝑥) = 𝑇 (𝑦) of statistics sufficient for 𝑓 . |𝜃 . Then the inferences about 𝜃
based on 𝑥 and 𝑦 should be the same.

 Consider 𝑥𝑖 ~𝒩 (𝜇, 1) and we want to estimate 𝜇 (our 𝜃) based on 𝑁 data. In

this case, the sufficient statistic is 𝑇 𝑥1:𝑁 = σ𝑁
𝑗=1 𝑥𝑗 .

 Consider the (MLE) estimate of : 𝜇Ƹ 1 = σ𝑁

𝑗=1 𝑥𝑗 Τ𝑁 . It satisfies the
′
sufficiency principle because if we have another dataset 𝑥1:𝑁 such that
′
𝑇 𝑥1:𝑁 = 𝑇(𝑥1:𝑁 ) then 𝑁 𝑁
1 1
𝜇ො2 = ෍ 𝑥𝑗′ = ෍ 𝑥𝑗 = 𝜇ො1
𝑁 𝑁
𝑗=1 𝑗=1

 On the other hand, the estimate 𝜇Ƹ 1 = 𝑥1 does not satisfy the sufficiency
′
principle for 𝑛 > 1 because if we have another dataset 𝑥1:𝑁 such that
′
𝑇 𝑥1:𝑁 = 𝑇(𝑥1:𝑁 ) then 𝜇Ƹ 2 = 𝑥1′ ≠ 𝜇Ƹ 1 , 𝑖𝑓 𝑥1′ ≠ 𝑥1 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
The Likelihood Principle
 Likelihood Principle. In the inference about 𝜃, the information brought by an
observation is entirely contained in the likelihood function ℓ ( 𝜃 | 𝑥) = 𝑓 (𝑥 | 𝜃).

 Also, two likelihood functions contain the same information about 𝜃 if they are
proportional to each other; i.e.

ℓ1 ( 𝜃 | 𝑥) = 𝑐 (𝑥)ℓ2 ( 𝜃 | 𝑥)

 It is straight forward to show that the MLE (maximum likelihood procedure)

satisfies the likelihood principle

arg max 1 ( | x)  arg max 2 ( | x)

 

 Classical approaches do not necessarily satisfy the likelihood principle.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

MLE: Summary
 The likelihood principle is fairly vague since it does not lead to the selection of
a particular procedure.

 Maximum likelihood estimation is one way to implement the sufficiency and

likelihood principles
𝜃መ = argsupℓ(𝜃|𝑥)
𝜃

 Indeed:
arg sup ( | x)  arg sup g ( x) h(T ( x) |  )  arg sup h(T ( x) |  )
  

1 ( | x)  c( x) 2 ( | x)  arg sup 1 ( | x)  arg sup 2 ( | x)

 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

Maximum Likelihood for a Gaussian
 Suppose that we have a data set of observations 𝓓 = (𝑥1, . . . , 𝑥𝑁)𝑻,
representing 𝑁 observations of the scalar random variable 𝑋. The
observations are drawn independently from a Gaussian distribution whose
mean 𝜇 and variance 𝜎2 are unknown.

 We would like to determine these parameters from the data set.

 i.i.d. data: Data points that are drawn independently from the same
distribution are said to be independent and identically distributed, which is
often abbreviated to i.i.d.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Maximum Likelihood for a Gaussian
 Because our data set 𝓓 is i.i.d., we can write the probability of the data set,
given 𝜇 and 𝜎2, in the form

N
Likelihood function : p ( x |  ,  2 )   N ( xi | ,  2 )
i 1

This is seen as a function of  ,  2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Max Likelihood for a Gaussian Distribution
N
Likelihood function : p ( x |  ,  )   N ( xi | ,  2 )
2

i 1

One common criterion for determining the parameters in a probability

distribution using an observed data set is to find the parameter values that
maximize the likelihood function, i.e. maximizing the probability of the data
given the parameters (contrast this with maximizing the probability of the
parameters given the data).

 We can equivalently maximize the log-likelihood:

 1 N
N N 
max2 ln p( x |  ,  )  max2   2
 ,
2
 ,  2

i 1
( xi   )  ln   ln(2 )  
2
2 2

2 
N N
1 1
 ML 
N
 i ML
x ,
i 1
 2

N
 i ML
( x
i 1
  ) 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Maximum Likelihood for a Gaussian Distribution
N N
1 1
 ML 
N
 xi , 
i 1
2
ML 
N
 i ML
( x
i 1
  ) 2

Sample mean Sample variance wrt MLE

mean (not the exact mean)

 The MLE underestimates the variance (bias due to overfitting) because 𝜇𝑀𝐿
fitted some of the noise in the data.

2
 The maximum likelihood solutions 𝜇𝑀𝐿 , 𝜎𝑀𝐿 are functions of the data set
values 𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to
the data set values, which come from a Gaussian.

 Using the equations above you can show that :

In this derivation

N 1 2 you need to use :

 ML    ,  2
ML
    xi x j    2 for i  j
N
 xi2    2   2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

Maximum Likelihood for a Gaussian Distribution
We use :
N 1 2
 2
ML
    xi x j    2 for i  j
N  xi2    2   2

1 N
 1 N 1 N
2
 2
ML
  N

n 1
( xn   ML
2
)  N  n N


 n 1
( x  
m 1
xm ) 

1 N
 2 2 N
1 N N 

N
n 1
 n N n  m N 2  m l 

x  x
m 1
x 
m 1 l 1
x x

1 

N
 N (  2
  2
)  N
2
N
 ( N  1)  2
 (  2
  2
)   N
1
N2
N  ( N  1)  2
 (  2
  2
)  


N
1

N ( 2   2 )   N  2   2  

 N  1  2
N

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

Maximum Likelihood for a Gaussian Distribution
N 1 2 1 N
1 N
 ML    ,  2
ML   ,  ML   x , i
2
ML   (x  
i ML )2
N N i 1 N i 1

 On average the MLE estimate obtains the correct mean but will underestimate
the true variance by a factor (𝑁 − 1)/𝑁.

 An unbiased estimate of the variance is given as:

𝑁 For large 𝑁,
𝑁 2
1 the bias is not
𝜎ത 2 = 𝜎 = ෍ 𝑥𝑖 − 𝜇𝑀𝐿 2
𝑁 − 1 𝑀𝐿 𝑁 − 1 a problem
𝑖=1

 This result can be obtained from “a Bayesian treatment” in which we

marginalize over the unknown mean.

 The 𝑁 − 1 factor takes account the fact that “1 degree of freedom has been
used in fitting the mean” and removes the bias of the MLE.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
MLE: Underestimating the Variance
 In the schematic from MLE estimate for
the 2 data points
Bishop’s PRML, we
consider 3 cases each with True Gaussian
2 data points extracted
from the true Gaussian.

 The mean of the 3

distributions predicted via
MLE (i.e. averaged over
the data) is correct.

 However, the variance is

underestimated since it is
a variance with respect to
the sample mean and NOT
the true mean. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
MLE for the Multivariate Gaussian
 We can easily generalize the earlier MLE results for a multivariate Gaussian.
The log-likelihood takes the form:

ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1

 Setting the derivatives wrt 𝝁 and 𝚺 equal to zero gives the following:
N N
1 1
 ML 
N
 xn ,  ML
n 1

N
 n ML n ML
( x
n 1
  )( x   ) T

 We provide a proof of the calculation of 𝜮𝑀𝐿 next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

MLE for the Multivariate Guassian
ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1
 We differentiate the log likelihood wrt 𝚺 −𝟏 . Each contributing term is:
N  N  1 N T N
 1
ln |  | 1
ln |  |   
2  2  2 2 A useful trick!
1  N 1   1 N 1 T 
  n
2  1 n 1
( x   )T 1
 ( x n   )  
2  1 
N Tr  
n 1 N
( x n   )( x n   ) 


  N 1 Tr   1 S 
1
𝑺 symmetric
2 
1 1 N
  NS , where S   ( xn   )( xn   )T
2 N n 1

 Setting the derivative equal to zero leads to:  ML  S

 
Tr     T , ln | A |  A1  ,
T

 Here we used:  A
| A1 || A |1 , tr ( AB )  tr ( BA)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Appendix: Some Useful Matrix Operations
 Show that  
Tr     T and Tr     T
 

Indeed
  
Tr      ik ki  nm
A B  B  Tr     T
Amn Amn 

 Show that 
ln | A |  A1 
T

A

Using the cofactor expansion of the det:

 1  1 
( 1) m  n M mn   A1  nm
1
Amn
ln | A |
| A | Amn
| A |
| A | Amn
 (1)i  j Aij M ij 
j | A|

where in the last step we used Cramer’s rule.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

MLE for a Multivariate Gaussian
𝑁 𝑁 𝑁
1 1 𝑇
1
𝜇𝑀𝐿 = ෍ 𝑥𝑛 ≡ 𝑥,ҧ 𝛴𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 = ෍ 𝑥𝑛 𝑥𝑛𝑇 − 𝑥ҧ 𝑥ҧ 𝑇
𝑁 𝑁 𝑁
𝑛=1 𝑛=1 𝑛=1

 Note that the unconstrained maximization of the log-likelihood gives a

symmetric 𝚺.

 As for the univariate case, we can define an unbiased covariance as:

𝑁
1
𝛴ത𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 𝑇
, 𝔼 𝛴ത𝑀𝐿 = 𝛴
𝑁−1
𝑛=1

 To prove this, you will need to use that:

 xn xmT    T   mn 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Prior Knowledge is Essential
 We cannot do everything simply based on data – prior knowledge is essential
to inference and prediction.

 Bayesian inference integrates data and prior models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Bayesian Statistics
 Bayesian model is made of a parametric statistical model (𝒳, 𝑓 (𝑥| 𝜃)) and a
prior distribution on the parameters (Q, 𝜋 (𝜃)).

 The unknown parameters are now considered as random.

 Some statisticians question this approach but most accept the probabilistic
modeling on the observations.

 Example: Assume you want to measure the speed of light given some
observations. Why should you put a prior on a physical constant?

 Due to the limited accuracy of the measurement, this constant will never
be known exactly.

 It is thus justified to put a (e.g. uniform) prior on this parameter reflecting

this uncertainty. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Recall Bayes’ rule
 Coming back from a trip, you feel sick and your doctor thinks you might have
contracted a rare disease (0.01% of the population has the disease).

 A test is available but not perfect.

 If a tested patient has the disease, 100% of the time the test will be
positive.

 If a tested patient does not have the disease, 95% of the time the test will
be negative (5% false positive).

 Your test is positive, should you really care?

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

Medical Diagnosis Example
 Let 𝐴 =`the patient has the disease’

 Let 𝐵 =`the test returns a positive result’

𝑃(𝐵|𝐴)𝑃(𝐴) 1 × 0.0001
𝑃(𝐴|𝐵) = = ≈ 0.002
ҧ ҧ
𝑃(𝐵|𝐴)𝑃(𝐴) + 𝑃(𝐵|𝐴)𝑃(𝐴) 1 × 0.0001 + 0.05 × 0.9999

 Such a test would be a complete waste of money.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

Prior, Likelihood and Posterior
In the previous example, we can identify the following:

 Data 𝑥 (e.g.` the test is positive’)

 Hypothesis ℎ (e.g. `do you have the disease?’). We want to make inference
about ℎ.

In Bayesian settings all variables are random and all inferences are probabilistic.

We identify three key ingredients of a Bayesian inference approach:

 Prior 𝜋(ℎ): How likely is hypothesis ℎ before looking at the data

 Likelihood 𝑓(𝑥 | ℎ): How likely is to observe 𝑥 assuming ℎ is true.

 Posterior 𝜋(ℎ | 𝑥): How likely is ℎ after data 𝑥 have been observed.
f ( x | h) (h)
 (h | x) 
m( x )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Prior 𝝅(𝜽)
 We use the prior to introduce quantitatively some insights on the parameters
of interest.

 This can be as subjective or as objective as you want it to be – and that’s why

frequentists do not like Bayesian approaches!

 There is no such a thing as a true prior!

 Even when prior information is heavily subjective, the Bayesian inference

model is honest.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Likelihood 𝑓(𝑥|𝜽)
 The likelihood encapsulates the mathematical model of the physical
phenomena you are investigating.

 If you know the input 𝑋 = 𝑥 to your problem, the likelihood can represent the
computed output 𝑦 = 𝑓(𝑥).

 It is the most computational expensive part of Bayesian approaches to

inference problems (inverse problems).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Posterior 𝝅(𝜽|𝑥): Inference and Prediction
 It combines the prior and likelihood.

 It weights the data and the prior information in making probabilistic inferences

 The posterior distribution is also useful in estimating the probability of

observing a future outcome (prediction)

f ( x |  ) ( )
 ( | x) 
m( x )

 The normalizing factor is given as:

𝑚(𝑥) = න𝑓(𝑥|𝜃)𝜋(𝜃)𝑑𝜃

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

Posterior Inference: Point Estimates
f ( x |  ) ( )
 ( | x) 
m( x )

Maximum A Posteriori estimate (MAP)

 *  arg max log  ( | x)   arg max  log f ( x |  )  log  ( ) 

 

Posterior Mean

𝜃መ = 𝔼𝑝(𝜃|𝑥) [𝜃] = න𝜃𝜋(𝜃|𝑥)𝑑𝜃

Posterior Quantiles

Pr   a     ( | x)d
a

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

Prediction
 Suppose we have observed 𝒙 and we want to make a prediction about
ෝ if
(future) unknown observables: What is the probability of observing data 𝒙
we already have observed data 𝒙?

 This means finding 𝑔 𝒙

ෝ𝒙

 We have:
𝒙, 𝜃, 𝒙)
𝜋(ෝ 𝒙, 𝜃, 𝒙) 𝜙(𝜃, 𝒙)
𝜋(ෝ
𝑔(ෝ 𝒙, 𝜃|𝒙)𝑑𝜃 = ඲
𝒙|𝒙) = න𝑔(ෝ 𝑑𝜃 = ඲ 𝑑𝜃 =
𝑚(𝒙) 𝜙(𝜃, 𝒙) 𝑚(𝒙)

= න𝑓(ෝ
𝒙|𝜃, 𝒙)𝜋(𝜃|𝒙)𝑑𝜃 = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃|𝒙)𝑑𝜃

 Compare this with the normalizing factor in Bayes’ rule:

𝑚(ෝ
𝒙) = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃)𝑑𝜃

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31

A Gaussian Example
 Consider 𝑥1 |𝜃~𝒩(𝜃, 𝜎 2 ), with prior 𝜃~𝒩 𝜇0 , 𝜎02 .

 Then we can derive the following:

 ( x1   ) 2 (  0 ) 2 
 ( | x1 )  f ( x1 |  ) ( )  exp    
 2 2
2 0 
2

 2  1 1   x1 0    1 
 ( | x1 )  exp    2  2     2  2    exp   2 (  1 ) 2  
 2  0    0   2 1 

With prior  ~ N ( 0 ,  02 ) and observation x1

 | x1 ~ N ( 1 ,  12 ) with
1 1 1  0
2 2
    2
 , and
1  0 
2 2 2 1
0 
2 2

 x1 0 
1    2  2 
2

 0 
1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

A Gaussian Example: Continued
To predict the distribution of a new observation 𝑥|𝜃~𝒩(𝜃, 𝜎 2 ) in light of 𝑥1 , we
use the predictive distribution as follows:
( x  )2 (  1 )2 1  ( x  )2 (  1 )2 
     
2   2 12 
f ( x | x1 )   f ( x |  )  ( | x1 ) d   e 2 2
e 212
d   e d
Likelihood Posterior

 You can verify with direct substitution the following:

2 𝜃 − 𝜇1 2
1 𝑥−𝜃 1 1 𝜎12 −𝜎12 𝑥 − 𝜇1
− + = − 𝑥 − 𝜇 𝜃 − 𝑚
2 𝜎2 𝜎12 2
1 1
𝜎12 𝜎 2 −𝜎12 𝜎12 + 𝜎 2 𝜃 − 𝜇1 + ⋯
−1
1 𝜎12 + 𝜎 2 𝜎12 𝑥 − 𝜇1
= − 𝑥 − 𝜇1 𝜃 − 𝜇1 𝜃 − 𝜇1 +. .
2 𝜎12 𝜎12
 This is a bivariate Gaussian and thus 𝑓(𝑥|𝑥1 ) is the marginal in 𝑥, i.e.

X | x1 ~ N ( 1 ,  12   2 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

Bayesian Inference for the Gaussian
 Consider X   x1 , x2 ,..., xN  ~ N (  ,  2 ), with prior  ~ N ( 0 ,  02 ).

 The likelihood takes the form:  N 2 

N
1   ( xn   ) 
p( X |  )   f ( xn |  )  exp   n 1 
 2 2   2 
N /2 2
n 1
 
 
 Note that in terms of 𝜇 this is not a probability density and is not normalized.
Introducing the conjugate (Gaussian) prior on 𝜇 leads to:
 N 
  ( xn   ) (    ) 2 
2
N
 (  | X )   f ( xn |  ) (  )  exp   n 1  0

n 1  2 2
2 0 
2

 
 
  N

 2  N 1    n  
x
 1 
 (  | X )  exp    2  2     n 1 2  02    exp   2 (    N ) 2 
 2  0    0   2 N 
   

 Statistical Computing and Machine 
Learning, Fall 2020, N. Zabaras 34
Bayesian Inference for the Gaussian
  N 
 2  N 1    n  
x
 1 
 (  | X )  exp    2  2     n 1 2  02    exp   2 (    N ) 2 
 2  0    0   2 N 
   
  
 So the posterior is a Gaussian as before with

 | X ~ N (  N ,  N2 ) with
1 1 N  0
2 2
    2
 , and
 N 0 
2 2 2 N
N 0  
2 2

 N 
  xn    N    N  2
 2
 N   N  n 1 2  02    N  2ML  02  
2 2 0
 ML  0
  0     0  N 0   N 0  
2 2 2 2

 
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

Bayesian Inference for the Gaussian
 | X ~ N (  N ,  N2 ) with
1 1 N  02 2 N 02 2
 2  2   2
, and  N  ML  0
 N 0 
2
N 0  
2 N 2
N 0  
2 2
N 0  
2 2

 The posterior precision is the sum of the precision of the prior plus one
contribution of the data precision for each observed data point.

 Observe the posterior mean for 𝑁 → ∞ and 𝑁 → 0.

 For 𝑁 → ∞ the posterior peaks around the 𝜇𝑀𝐿 and the posterior variance goes
to zero, i.e. the point MLE estimate is recovered within the Bayesian paradigm
for infinite data. In addition, 𝔼 𝜃|𝑥1 , . . . , 𝑥𝑁 = 𝜇𝑁 ≃ 𝜃𝑀𝐿 .
2
 How about when 𝜎02 → ∞? In this case note that   2
N and  N   ML
N

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Bayesian Inference for the Gaussian
 
2 2
N  2
2
 | X ~ N (  N ,  N ) with  N 
2 2 0
, and  N  0
ML  0
N 0  
2 2
N 0  
2 2
N 0  
2 2

4.5

3.5

3
N=10
2.5

2
N=2

MatLab 1.5 N=1

implementation 1
0.5 prior
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

X   x1 , x2 ,..., xN  ~ N (0.8, 0.1), with prior  ~ N (0, 0.1).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Sequential Bayesian Inference
 | X ~ N (  N ,  N2 ) with
1 1 N 02 2 N 02 2
 2  2   , and  N 
2
ML  0
 N 0 
2
N 0  
2 2 N
N 0  
2 2
N 0  
2 2

 We can derive sequential estimates of the posterior variance and mean. They
are as follows:
1 1 1  N2  N2
 2  2 , and  N  2  N 1  2 xN
 N  N 1 
2
 N 1 
 Show this is by recognizing the sequential nature of Bayesian inference (the
posterior at the previous step becomes the new prior) and recalling that:

With prior  ~ N ( 0 ,  02 ) and observation x1   | x1 ~ N ( 1 ,  12 ) with

1 1 1  02 2 2  x1 0 
     ,2
and    1   2 
 12  02  2  02   2 1 1
  2
 0 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Linear Gaussian Systems: Inferring the Mean
 Revisit the Bayesian inference problem for the Gaussian. Consider
y   y1 , y2 ,..., y N  ~ N ( y | x,  2   y1 ), with prior x ~ N ( x | 0 ,  02  01 ).
 Likelihood for the 𝑁 −dataset in the form of a linear Gaussian system:
p  y | x   N  y | Ax  b,  y  , A  1N (column vector of 1' s ), b  0,  y1  diag ( y I)
 Applying conditional Gaussian results (to be reviewed later on):
p  x   N  x |  , 1 
p  y | x   N  y | Ax  b, L 1
 
p  x | y   N x |    A LA  T 1
   A L( y  b) ,    A LA
T T 1


p  x | y   N x |  0  1TN  y I1N 
1
 0 0  1TN y I( y  0) ,  0  1TN y I1N 
1

 This can be simplified as 𝑝 𝑥 𝒚 = 𝒩 𝑥|𝜇𝑁 , 𝜆−1 𝑁 where:
𝑁𝜆𝑦 𝜆0 −1
𝑝 𝑥|𝒚 = 𝒩 𝑥| 𝑦ത + 𝜇0 , 𝜆0 + 𝑁𝜆𝑦
𝜆0 + 𝑁𝜆𝑦 𝜆0 + 𝑁𝜆𝑦
 The precision is the prior precision + 𝑁 measurement precisions. The mean is the weighted
average of the MLE and prior mean. These are identical results to those obtained earlier.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Inferring the Mean of a Gaussian
prior variance = 1.00 prior variance = 5.00

prior prior
lik lik
0.6 0.6
post post

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
gaussInferParamsMean1d
from PMTK
0.1 0.1

0 0
-5 0 5 -5 0 5

Inference about 𝑥 given a single noisy observation 𝑦 = 3.

 (a) Strong prior N(0, 1). The posterior mean is “shrunk” towards the prior
mean, which is 0.
 (b) Weak prior N(0, 5). The posterior mean is similar to the MLE.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Shrinkage and Signal-To-Noise Ratio
 02 2 N 02 2
 2
, and  N   ML  0
N 0   N 0   N 0  
N 2 2 2 2 2 2

 The posterior precision is the sum of the precision of the prior plus one contribution of the data
precision for each observed data point. For 𝑁 → ∞ the posterior peaks around the 𝜇𝑀𝐿 and the
posterior variance goes to zero, i.e. MLE estimate is recovered within the Bayesian paradigm.
 If we apply the data sequentially, we can write for the posterior mean after the collection of one
data point (𝑁 = 1), i.e. 𝜇𝑀𝐿 = 𝑦 as following:
2
1  y  ( y  0 ) 2 ( shrinkage : data y adjusted towards the prior mean 0 )
0  2

 Shrinkage is often measured also with the signal-to-noise ratio:

 X 2   2   2
SNR 
 
2
 0

 2
0
, for y  x   ( observed signal ), x ~ N   0 ,  2
0  (true signal ),  ~ N  0,  2
 (noise)

 How about when  02   ? In this case note that 2

2
N  and  N   ML
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41
Appendix:
Gaussian Linear Models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42

Bayes’ Theorem and Gaussian Linear Models
 Consider a linear Gaussian model: A Gaussian marginal distribution 𝑝(𝒙) and
a Gaussian conditional distribution 𝑝(𝒚|𝒙) in which 𝑝(𝒚|𝒙) has a mean that is a
linear function of 𝒙, and a covariance which is independent of 𝒙.

p  x   N  x | ,  1

p  y | x   N  y | Ax  b, L 1

 We want using Bayes’ rule to find 𝑝(𝒚) and 𝑝(𝒙|𝒚).

 We start with the joint distribution over 𝒛 = (𝒙, 𝒚) which is quadratic in the
components of 𝒛 – so 𝑝(𝒛) is a Gaussian.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43

Bayes’ Theorem and Gaussian Processes
p  x   N  x |  , 1 
p  y | x   N  y | Ax  b, L1 

1
ln p( z )  ln p  x   ln p  y | x    ( x   )T  ( x   )
2
1
 ( y  Ax  b)T L( y  Ax  b)  const
2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44

Covariance of the Joint Distribution
ln p( z )   xT    AT LA  x  yT Ly  yT LAx  x T AT Ly  const 
1 1 1 1
2 2 2 2 Only
T
1  x     AT LA  AT L   x  quadratic
        const terms
2  y    LA L  y  shown
 We can immediately write down the covariance of z.
1
   A LA  A L   1
T T
1 AT 
cov  z       1 1 1 T 
  LA L   A L + A  A 

 In the matrix inversion we used a result from an earlier lecture.

1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, where : M  A  BD C
 C D    D CM D  D CM BD 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Mean of the Joint Distribution
1 T 1 T T
ln p( z )   x   y Lb  x A Lb  ...
T
Only
2 2
T
linear
 x     AT Lb  terms
     ... shown
 y  Lb 
 We can immediately write down the mean of 𝒛:
   AT Lb 
z cov  z  z  z 
T 1 T

 Lb 
 1 1 AT    AT Lb    
 z    1 1 1 T   
 A L + A  A  Lb   A  b 
 It remains to find the marginal 𝑝(𝒚). We can use earlier derived results.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46

Marginal 𝑝(𝑦) Distribution
 Recall results from an earlier lecture for computing the marginal:

p  xa   N  xa | a ,  aa 
 Based on our calculations:
    1 1 AT 
z    cov  z    1 1 1 T 
 A  b   A L + A  A 

we conclude: cov  y   L1 + A1 AT

p  y   N  y | A  b, L1 + A1 AT   y   A  b
 Note that for 𝑨 = 𝑰, p x  N  x |  ,   , p y | x  N  y | x  b , L , the convolution
1 1
   
of the two Gaussians gives the well known result:
 y    b cov  y   L1
+ 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47

 Based on our calculations:

       
1
 1
1 T

z  
T T
A LA A L A
 
 cov  z       1 1 1 T 
 A b    LA L   A L + A A 

we conclude:
Proof:

p  x | y   N x |    A LA  T 1
   A L( y  b) ,    A LA
T T 1

 x | y        A LA
1
T
AT L( y  A  b) 

    A LA     A LA  A LA  A L( y  b) 
T 1 T T T

cov  x | y      A LA T 1

    A LA     A L( y  b)
T 1 T

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48

Poly Macs201macs203 250726 161340
No ratings yet
Poly Macs201macs203 250726 161340
197 pages
Statistical Thinking From Scratch: A Primer For Scientists M. D. Edge Instant Download
No ratings yet
Statistical Thinking From Scratch: A Primer For Scientists M. D. Edge Instant Download
147 pages
Math and Statistics PDF
No ratings yet
Math and Statistics PDF
192 pages
Bayes
No ratings yet
Bayes
10 pages
Simulation
No ratings yet
Simulation
180 pages
Mathematical Theory of Bayesian Statistics First Edition Watanabe Download
No ratings yet
Mathematical Theory of Bayesian Statistics First Edition Watanabe Download
116 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
ML 3
No ratings yet
ML 3
66 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
LectureNotes22 WI4455
No ratings yet
LectureNotes22 WI4455
154 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
FSMLecture6 - Statistics
No ratings yet
FSMLecture6 - Statistics
61 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
Statistics Notes Based On Pattern Recognition and Machine Learning (PRML)
No ratings yet
Statistics Notes Based On Pattern Recognition and Machine Learning (PRML)
5 pages
STAT 713 Mathematical Statistics Ii: Lecture Notes
No ratings yet
STAT 713 Mathematical Statistics Ii: Lecture Notes
152 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
MS Theory Exam Study Guide
No ratings yet
MS Theory Exam Study Guide
50 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Project Report
No ratings yet
Project Report
56 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Bayesian Learning for ML Experts
No ratings yet
Bayesian Learning for ML Experts
18 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Lec2 IntroToProbabilityAndStatistics
No ratings yet
Lec2 IntroToProbabilityAndStatistics
37 pages
Understanding SE, SD, and MLE
No ratings yet
Understanding SE, SD, and MLE
10 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
Stat 111
No ratings yet
Stat 111
7 pages
Lec5 IntroToProbabilityAndStatistics
No ratings yet
Lec5 IntroToProbabilityAndStatistics
63 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Data Analysis & Interpretation Questions
No ratings yet
Data Analysis & Interpretation Questions
10 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Sufficient Statistics - Problems - Solved - Xiang - Yin
No ratings yet
Sufficient Statistics - Problems - Solved - Xiang - Yin
5 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
State Space Models & Bayesian Inference
No ratings yet
State Space Models & Bayesian Inference
58 pages
Lec19 Introduction2LinearRegression
No ratings yet
Lec19 Introduction2LinearRegression
53 pages
Classical and Quantum Information Basics
No ratings yet
Classical and Quantum Information Basics
49 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Iterative Quantum Phase Estimation
No ratings yet
Iterative Quantum Phase Estimation
31 pages
Advanced Rejection Sampling Guide
No ratings yet
Advanced Rejection Sampling Guide
30 pages
Bayesian Linear Regression Guide
No ratings yet
Bayesian Linear Regression Guide
29 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec22 PDF
No ratings yet
Lec22 PDF
8 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Variational Gaussian Processes
No ratings yet
Variational Gaussian Processes
62 pages
Seminar em
No ratings yet
Seminar em
51 pages