Bayesian Statistics
Introduction
Shaobo Jin
Department of Mathematics
Shaobo Jin (Math) Bayesian Statistics 1 / 16
Introduction Frequentist Paradigm
Parametric Statistical Model
Suppose that the vector of observations x = (x1 , ..., xn ) is generated
from a probability distribution with density f (x | θ), where θ is the
vector of parameters.
For example, if we further assume the observations are iid, then
n
Y
f (x | θ) = f (xi | θ) .
i=1
A parametric statistical model consists of the observation x of a
random variable X , distributed according to the density f (x | θ), where
the parameter θ belongs to a parameter space Θ of nite dimension.
Shaobo Jin (Math) Bayesian Statistics 2 / 16
Introduction Frequentist Paradigm
Likelihood Function
Denition
For an observation x of a random variable X with density f (x | θ), the
likelihood function L (· | x) : Θ → [0, ∞) is dened by
L (θ | x) = f (x | θ).
Example
T
If X = X1 · · · Xn is a sample of independent random variables,
then
n
Y
L (θ | x) = fi (xi | θ) ,
i=1
as a function in θ conditional on x.
Shaobo Jin (Math) Bayesian Statistics 3 / 16
Introduction Frequentist Paradigm
Likelihood Function: Example
1 If X1 , ..., Xn is a sample of i.i.d. random variables according to
N θ, σ 2 , then
n
" ( )#
Y 1 (xi − µ)2
L (θ | x) = √ exp − .
2πσ 2 2σ 2
i=1
2 If X1 , ..., Xn is a sample of i.i.d. random variables according to
Binomial (k, θ), then
n
Y k xi n−xi
L (θ | x) = θ (1 − θ) .
xi
i=1
Shaobo Jin (Math) Bayesian Statistics 4 / 16
Introduction Frequentist Paradigm
Likelihood Function: Another Example
Consider the case
For i ̸= j , Xi1 · · · Xin and Xj1 · · · Xjn are independent
and identically distributed.
For each i, Xi1 , ..., Xip are not necessarily independent.
Then, the likelihood is
n
Y
L (θ | x) = f (xi1 , · · · , xip | θ) ,
i=1
where f (xi1 , · · · , xip | θ) is the joint density of Xi1 · · · Xip .
Shaobo Jin (Math) Bayesian Statistics 5 / 16
Introduction Frequentist Paradigm
Inference Principle
In the frequentist context,
1 likelihood principle: the information brought by observation x is
entirely contained in the likelihood function L (θ | x).
2 suciency principle: two observations x and y factorizing through
the same value of a sucient statistic T as T (x) = T (y) must lead
to the same inference on θ.
Shaobo Jin (Math) Bayesian Statistics 6 / 16
Introduction Bayesian Paradigm
Bayes Formula
If A and E are two events, then
P (E | A) P (A)
P (A | E) =
P (E)
P (E | A) P (A)
= .
P (E | A) P (A) + P (E | Ac ) P (Ac )
If X and Y are two random variables, then
f (x | y) f (y) f (x | y) f (y)
f (y | x) = =´ .
f (x) f (x | y) f (y) dy
Shaobo Jin (Math) Bayesian Statistics 7 / 16
Introduction Bayesian Paradigm
Prior and Posterior
A Bayes model consists of a distribution π (θ) on the parameters, and a
conditional probability distribution f (x | θ) on the observations.
The distribution π (θ) is called the prior distribution.
The unknown parameter θ is a random parameter.
By Bayes formula,
f (x | θ) π (θ) f (x | θ) π (θ)
π (θ | x) = =´ ,
m (x) f (x | θ) π (θ) dθ
where the conditional distribution π (θ | x) is the posterior distribution
and m (x) is the marginal distribution of x.
Shaobo Jin (Math) Bayesian Statistics 8 / 16
Introduction Bayesian Paradigm
Update Our Knowledge on θ
The prior often summarizes the prior information about θ.
From similar experiences, the average number of accidents at a
crossing is 1 per 30 days. We assume
π (θ) = 30 exp (−30θ) , [day]−1 .
Our experiment resulted in an observation x.
Three accidents have been recorded after monitoring the
roundabout for one year. The likelihood is
(365θ)3
f (X = 3 | θ) = exp (−365θ) .
3!
We use the information in x to update our knowledge on θ.
By Bayes' formula
f (X = 3 | θ) π (θ)
π (θ | x) = .
m (x)
Shaobo Jin (Math) Bayesian Statistics 9 / 16
Introduction Bayesian Paradigm
Distributions
In a Bayesian model, we will have many distributions
prior distribution: π (θ).
conditional distribution X | θ (likelihood): f (x | θ).
joint distribution of (θ, X): f (x, θ) = f (x | θ) π (θ).
posterior distribution: π (θ | x).
´
marginal distribution of X : m (x) = f (x | θ) π (θ) dθ.
We most of the time use π (·) and m (·) as generic symbols. But in
several cases, they are tied to specic functions.
Shaobo Jin (Math) Bayesian Statistics 10 / 16
Introduction Bayesian Paradigm
Use Bayes Formula To Obtain Posterior
Example
Find the posterior distribution.
1 Suppose that we have an iid sample X | θ ∼ Bernoulli (θ),
i
i = 1, ..., n. The prior is θ ∼ Beta (a0 , b0 ).
2 Suppose that we have an iid sample X | µ ∼ N µ, σ 2 , i = 1, ..., n,
i
where σ 2 is known. The prior is µ ∼ N µ0 , σ02 .
3 Suppose that we have an iid sample X | µ, σ 2 ∼ N µ, σ 2 ,
i
i = 1, ..., n. The priors are µ | σ 2 ∼ N µ0 , σ 2 /λ0 and
σ 2 ∼ InvGamma (a0 , b0 ), where
ba00
2
1 b0
π σ = exp − 2 .
Γ (a0 ) (σ 2 )a0 +1 σ
Shaobo Jin (Math) Bayesian Statistics 11 / 16
Introduction Bayesian Paradigm
Bayesian Inference Principle
Bayesian Inference Principle
Information on the underlying parameter θ is entirely contained in the
posterior distribution π (θ | x). That is, all statistical inference are
based on the posterior distribution π (θ | x).
Some examples are
1 posterior mean: E[θ | x].
2 posterior mode (MAP): θ that maximizes π (θ | x).
3 predictive distribution of a new observation:
ˆ
f (y | x) = f (y | x, θ) π (θ | x) dθ.
Shaobo Jin (Math) Bayesian Statistics 12 / 16
Introduction Multivariate Normal Distribution
From Univariate to Multivariate Normal
Let Z ∼ N (0, 1). Then, X = σZ + µ ∼ N µ, σ 2 , where E [X] = µ and
Var (X) = σ 2 .
T
Let Z = Z1 Z2 · · · Zp be a random vector, each Zj ∼ N (0, 1),
and Zj is independent of Zk for any j ̸= k. Then,
X = Σ1/2 Z + µ ∈ Rp
follows a p−dimensional multivariate normal distribution, denoted by
X ∼ Np (µ, Σ), where E [X] = µ and Var (X) = Σ.
Shaobo Jin (Math) Bayesian Statistics 13 / 16
Introduction Multivariate Normal Distribution
From Univariate to Multivariate Normal: Density
The density function of the random variable X ∼ N µ, σ 2 with σ > 0
can be expressed as
( )
(x − µ)2
1 1 1 1
√ exp − = √ exp − (x − µ) 2 (x − µ) .
2πσ 2 2σ 2 2πσ 2 2 σ
A p-dimensional random variable X ∼ Np (µ, Σ) with Σ > 0 has the
density
1 1 T −1
f (x) = exp − (x − µ) Σ (x − µ) .
(2π)p/2 det (Σ) 2
p
Shaobo Jin (Math) Bayesian Statistics 14 / 16
Introduction Multivariate Normal Distribution
Some Useful Properties
1 Linear combination of normal remains normal: Suppose that
X ∼ Np (µ, Σ), then AX + d ∼ Nq Aµ + d, AΣA , for every q × p
T
constant matrix A, and every p × 1 constant vector d.
2 Marginal normal + independence imply joint normal: If X1 and
X2 are independent and are distributed Np (µ1 , Σ11 ) and
Nq (µ2 , Σ22 ), respectively, then
X1 µ1 Σ11 0
∼ Np+q , .
X2 µ2 0 Σ22
X1 µ1 Σ11 Σ12
3 Conditional distribution: Let ∼ Np+q , .
X2 µ2 Σ21 Σ22
Then the conditional distribution of X1 given that X2 = x2 , is
X1 | X2 ∼ N µ1 + Σ12 Σ−1 −1
22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21 .
Shaobo Jin (Math) Bayesian Statistics 15 / 16
Introduction Multivariate Normal Distribution
Multivariate Normal In Bayesian Statistics
Example
Suppose that X | θ ∼ Np(Cθ, Σ), where Cp×q and Σ > 0 are known.
The prior is Nq µ0 , Λ−1
0 . Find the posterior of θ.
We can in fact use the property of the conditional distribution of a
multivariate normal distribution to simplify the steps.
Result
If we know X1 | X2 ∼ Np (CX2 , Σ) and X2 ∼ Nq (m, Ω), then
Σ + CΩC T
X1 Cm CΩ
∼ Np+q , .
X2 m ΩC T Ω
Shaobo Jin (Math) Bayesian Statistics 16 / 16