Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views13 pages

Bayesian Estimation

Bayesian parameter estimation is a method for estimating unknown parameters by combining prior beliefs with observed data to form posterior distributions. The process involves defining research questions, specifying prior probabilities, collecting data, and updating beliefs using Bayes' theorem. While Bayesian estimation has merits such as incorporating prior knowledge and providing full distributional information, it also has challenges including sensitivity to prior choice and computational complexity.

Uploaded by

sri.harini.007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Bayesian Estimation

Bayesian parameter estimation is a method for estimating unknown parameters by combining prior beliefs with observed data to form posterior distributions. The process involves defining research questions, specifying prior probabilities, collecting data, and updating beliefs using Bayes' theorem. While Bayesian estimation has merits such as incorporating prior knowledge and providing full distributional information, it also has challenges including sensitivity to prior choice and computational complexity.

Uploaded by

sri.harini.007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Bayesian estimation of Parameters

Bayesian parameter estimation is a technique for estimating the probability density function of random
variables with unknown parameters. It involves identifying the prior distribution, collecting data and
forming the likelihood, and finding the posterior distribution. A Bayes estimator is an estimator
that minimizes the posterior expected loss or maximizes the posterior expected utility
The classic theory assumes parameters are fixed quantities that we want to estimate as precisely as
possible. Whereas Bayesian perspective is different; parameters are random variables with probabilities
assigned to particular values of parameters to reflect the degree of evidence for that value

The steps involved in Bayesian inference

 Define the research question or problem: Clearly articulate the problem you aim to solve using
Bayesian inference.

 Specify the prior probabilities: Determine the initial beliefs or probabilities based on available
information.

 Collect and analyze data: Gather relevant data and analyse it using appropriate statistical methods.

 Update the prior probabilities: Apply Bayes’ theorem to combine the prior probabilities with the
observed data and obtain posterior probabilities.

 Interpret the results: Interpret the posterior probabilities in the context of the research question
and draw conclusions.

Bayesian estimation of parameters relies on several key assumptions. These assumptions ensure that the
process of updating beliefs about a parameter using Bayes' theorem is valid and meaningful. Here are the
main assumptions:

1. Model Assumptions

a. Correct Model Specification

 It is assumed that the statistical model chosen correctly represents the underlying data generation
process.
 The likelihood function p(X∣θ) accurately describes the probability of observing the data given the
parameter θ.

b. Independence of Observations

 Observations are often assumed to be independent and identically distributed (i.i.d.).


 For models that assume independence, each data point provides independent information about
the parameter.
2. Prior Assumptions

a. Choice of Prior Distribution

 The prior distribution p(θ) reflects prior beliefs or knowledge about the parameter before observing
the data.
 The prior should be chosen based on subject matter knowledge, previous research, or convenience
(e.g., conjugate priors).

b. Informative vs. Non-informative Priors

 Informative priors incorporate substantial prior knowledge.


 Non-informative (or weakly informative) priors are used when little prior information is available,
often aiming to have minimal influence on the posterior.

3. Data Assumptions

Data Representativeness

 The observed data is representative of the population or process being studied.


 The data is collected in a way that avoids biases that could distort the estimation.

4. Computational Assumptions

Convergence of Algorithms

 When using computational methods such as Markov Chain Monte Carlo (MCMC) to approximate
the posterior distribution, it is assumed that the algorithms converge to the true posterior
distribution.
 Proper diagnostics and checks are necessary to ensure convergence.

5. Mathematical and Logical Assumptions

Bayes' Theorem Validity

 Bayes' theorem is mathematically sound and applies to the problem at hand.


 The prior, likelihood, and evidence are properly defined and integrable functions.

Mathematical derivation:

Bayes parameter estimation (BPE) is a widely used technique for estimating the probability density
function of random variables with unknown parameters. Suppose that we have an observable random
variable X for an experiment and its distribution depends on unknown parameter θ taking values in a
parameter space Θ. The probability density function of X for a given value of θ is denoted by p(x|θ ). It
should be noted that the random variable X and the parameter θ can be vector-valued. Now we obtain a
set of independent observations or samples S = {x 1,x2,...,xn} from an experiment. Our goal is to compute
p(x|S) which is as close as we can come to obtain the unknown p(x), the probability density function of X.
In Bayes parameter estimation, the parameter θ is viewed as a random variable or random vector following
the distribution p(θ ). Then the probability density function of X given a set of observations S can be
estimated by

So if we know the form of p(x|θ) with unknown parameter vector θ, then we need to estimate
the weight p(θ |S), often called posterior, so as to obtain p(x|S) using Eq. (1). Based on Bayes Theorem, the
posterior can be written as

where p(θ) is called prior distribution or simply prior, and p(S|θ) is called likelihood function . A prior is
intended to reflect our knowledge of the parameter before we gather data and the posterior is an updated
distribution after obtaining the information from data.

Example 1: Binomial Model


Let S = {x1,x2,...,xn} be a set of coin flipping observations, where xi = 1 denotes 'Head' and xi = 0 denotes
'Tail'. Assume the coin is weighted and our goal is to estimate parameter θ , the probability of 'Head'.
Assume that we flipped a coin 20 times yesterday, but we did not remember how many times the ’Head’
was observed. What we know is that the probability of 'Head' is around 1/4, but this probability is
uncertain since we only did 20 trails and we did not remember the number of 'Heads'. With this prior
information, we decide to do this experiment today so as to estimate the parameter θ .

A prior represents our previous knowledge or belief about parameter θ. Based on our memories from
yesterday, assume that the prior of θ follows Beta distribution Beta(5, 15)

Today we flipped the same coin n times and y 'Heads' were observed. Then we compute
the posterior with today’s data. Consider Eq. (2), the posterior is written as
4

Assume that we did 500 trials and 'Head' appeared 220 times, the posterior is Beta(225,295) . It can be
noted that the posterior and prior distribution have the same form. This kind of prior distribution is called
conjugate prior. The Beta distribution is conjugate to the binomial distribution which gives the likelihood of
i.i.d Bernoulli trials.
As we can see, the conjugate prior successfully includes previous information, or our belief of parameter θ
into the posterior. So our knowledge about the parameter is updated with today’s data, and the posterior
obtained today can be used as prior for tomorrow’s estimation. This reveals an important property of Bayes
parameter estimation, that the Bayes estimator is based on cumulative information or knowledge of
unknown parameters, from past and present.
After we obtain the posterior, then we can estimate the probability density function of random variable X.
Consider Eq. (1), the density function can be expressed as

It suggests that the prior Beta(5,15) actually is equivalent to adding 20 Bernoulli observations to the data,
5 Heads and 15 Tails. This means the posterior summarizes all our knowledge about the parameter x, and
the prior does affect the estimate of the density of random variable X. However, as we do more and more
coin flipping trials (i.e. n is getting larger), the density function p(x|S) will almost surely converge to the
underlying distribution (Figure 3), which means the prior becomes less important. Figure 4 illustrates that
as n is getting larger, the posterior becomes sharper. In our experiment, when n = 10000, the posterior has
sharp peak at θ = 0.4. In this case, the prior (said around 1/4) has little effect on posterior and we have
strong belief to say that the probability of ’Head’ is around 0.4.
Example 2: Gaussian Model
The Bernoulli distribution discussed above is a discrete example, here we will illustrate a continuous

a Gaussian random variable X ∼ N(μ,σ2) with unknown mean μ. Here we use a conjugate
example, a Gaussian random variable X [3]. Assume S = {x1,x2,...,xn} is a set of independent observations of

prior p(μ)∼N(μ0,σ^2 0). Then the posterior can be computed by

So the posterior also follows Gaussian distribution N(μn,σ2n), where μn and σ2n is defined by

Consider Eq. (1), we have the estimated density of random variable X.

8
Merits and Demerits
Merits of Bayesian Estimation

1. Incorporation of Prior Knowledge:


- Allows integration of prior knowledge or expert opinion through the prior distribution.

2. Full Distributional Information:


- Provides a full posterior distribution, offering comprehensive uncertainty quantification.

3. Sequential Updating:
- Supports real-time data assimilation and updating of beliefs as new data arrives.

4. Flexible Modelling:
- Handles complex hierarchical and mixture models, accommodating various data structures.

5. Effective with Small Sample Sizes:


- Performs well with limited data by leveraging prior information and reducing overfitting.

Demerits of Bayesian Estimation

1. Sensitivity to Prior Choice:


- Results can be heavily influenced by the choice of prior, requiring careful selection.

2. Computational Complexity:
- Often requires intensive algorithms like MCMC, demanding significant computational resources.

3. Challenging Prior Interpretation:


- Difficult to specify appropriate priors without solid prior information, which can be subjective.

4. Complex Integration and Normalization:


- Computing the marginal likelihood for model comparison can be challenging, especially in high dimensions.

5. Difficulty in Communication:
- Explaining Bayesian concepts, such as priors and posterior distributions, can be complex for non-statisticians.
Stochastic Process
A family of random variables {X(t), >=0} indexed by the time parameter t, then X(t) is called a stochastic
process.
The parameter t belongs to T, where T is known as the index set and it can be either discrete or continuous
T is also known as parameter space
The values assumed by the process are called the states and the set of possible values is called state space
and it is denoted by S
A state space is discrete if it contains a finite or a denumerable infinite number of points otherwise it is
continuous.

Markov Chain
A Stochastic process Xn{ X(t); t belongs to T} is said to be a Markov chain if the index set and the state space
both are discrete and the sequence of random variables satisfies the following condition
P[Xn+1 = j | X0 = i0 , X1 ,......., Xn = in ] = P[Xn+1 = j | Xn = in ] for all n>=0
[ i.e., The probability of jumping from one state to the next state depends only on the current
state and not on the sequence of previous states that lead to this current state.]

Where i0, i1,......in, j belongs to S and the sequence {Xn} is said to possess Markov property
Markov chain is known as a finite Markov chain and if the state space is countably infinite then the Markov
chain is known as countable Markov chain.
The outcomes of the trails are called as the states of the MC. If Xn = j is given then the process is said to be
at state j at nth trail.

Transition Probabilities
Pij(n) = P(Xn = j | X0=i) is known as the n- step transition probabilities of a Markov chain from state ‘i’ at time
0 to state j at time ‘n’ units.
Pij(n,n+1) is known as the one-step-transition probability of the Markov chain from state i at the time point t n
to the state j at the time point tn+1
These one step transition probabilities should satisfy Pij(n,n+1) >=0 , ∑ nj=1 Pij(n,n+1) =1; for all i,j and they can be
represented in a matrix form, which is known as transition probability matrix(TPM) of MC and it is denoted
by P.
Stationary Distribution
Suppose, we have a process of few states and we have a fixed transition probability (P) of jumping between
states.
We start with some random probability distribution over all states (Sᵢ)
at time step i, and to estimate the probability distribution over all states at the next time step i.e i+1, we
multiply it by transition probability P.

Si+1 = Si P
If we keep on doing this, after a while S stops changing on multiplying with matrix P, this is when we say it
has reached a Stationary Distribution.

S= S P
0 1 2

[ ]
0 0 2/3 1/3
Ex: P = 1 3/8 1/8 1 /2 Find the stationary distribution of MC with TPM P
2 1/2 1/2 0

Sol: Let We want to satisfy S = S P

[ ]
0 2/3 1/3
Consider (S0, S1, S2) = (S0, S1, S2) 3/8 1/8 1 /2
1/2 1/2 0

S0 = (3/8) S1 + (1/2) S2 ..............1


S1 = (2/3)S0 + (1/8)S1 + (1/2) S2 .............2
S2 = (1/3)S0 + (1/2)S1 ................3
After solving the above 3 equations we obtain S = (S0, S1, S2) = (0.3,0.4,0.3) which are the steady state
probabilities of the MC
i.e., P0 = (0.3,0.4,0.3)
P1 = P0 P2 = (0.3,0.4,0.3)
............Pn= P0Pn = (0.3,0.4,0.3)
This is a stationary distribution of a MC.

Markov Chain Monte Carlo (MCMC)


MCMC can be used to sample from any probability distribution. Mostly we use it to sample from the
intractable posterior distribution for the purpose of Inference. Estimating the Posterior using Bayes can be
difficult sometimes, in most of the cases we can find the functional form of Likelihood x Prior. However,
computing marginalized probability P(B) can be computationally expensive, especially when it is a
continuous distribution.
The trick here is to avoid calculating the normalization constant altogether.
The General Idea for the algorithm is to start with some random probability distribution and
gradually move towards desired probability distribution.
 Initiate a markov chain with a random probability distribution over states
 Gradually move in the chain converging towards stationary distribution
 Apply some condition (Detailed Balance Sheet) that ensures this stationary distribution resembles
desired probability distribution.

Thus, on reaching stationary distribution we have approximated posterior probability distribution.

 The probability p(A) represents the probability of being at A, the probability T(A -> B) represents the
probability of moving from A to B.
 The probability p(B) represents the probability of being at B, the probability T(B -> A) represents the
probability of moving from B to A.
 Each of the sides represents probability flow from either A to B or B to A.
If the condition satisfies, then it guarantees the stationary state to be approximately representing
posterior distribution.
 Although MCMC itself is complicated, they provide a lot of flexibility. It provides us with efficient
sampling in high-dimension. It can be used to solve problems with a large state space.

Limitation — MCMC doesn’t perform well in approximating probability distribution that has multi modes.

 Marginal Probability P(B) in this case is a constant known as a normalization constant that sums
over the entire possible values of the numerator.
 There are techniques for training or evaluating models that have intractable normalization constant
(also known as partition function ). Few of them use MCMC for sampling in their algorithm.

Metropolis — Hasting Algorithm

The Metropolis-Hastings algorithm is a specific type of Markov Chain Monte Carlo (MCMC) method used to
generate samples from a target distribution when direct sampling is difficult. This algorithm constructs a
Markov chain with the desired stationary distribution by using a proposal distribution to suggest new states
and an acceptance probability to decide whether to accept or reject these new states.
Suppose we are sampling from distribution p(x) = f(x) / Z, where Z is the intractable normalization constant.
Our objective is to sample from p(x) in such a way that involves making use of numerator alone and avoids
having to estimate denominator.

The Metropolis–Hastings algorithm generates a sequence of sample values in such a way that, as more and
more sample values are produced, the distribution of values more closely approximates the desired
distribution. These sample values are produced iteratively in such a way, that the distribution of the next
sample depends only on the current sample value, which makes the sequence of samples a Markov chain.

Derivation

desired distribution 𝑃(𝑥). To accomplish this, the algorithm uses a Markov process, which asymptotically
The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a

reaches a unique stationary distribution 𝜋(𝑥) such that 𝜋(𝑥)=𝑃(𝑥)

A Markov process is uniquely defined by its transition probabilities 𝑃(𝑥′∣𝑥), the probability of transitioning
from any given state 𝑥 to any other given state 𝑥′ . It has a unique stationary distribution 𝜋(𝑥) when the
following two conditions are met:

1. Existence of stationary distribution: there must exist a stationary distribution 𝜋(𝑥). A sufficient but
not necessary condition is detailed balance, which requires that each transition 𝑥→𝑥′ is reversible:
for every pair of states 𝑥,𝑥′, the probability of being in state 𝑥 and transitioning to state 𝑥′ must be
equal to the probability of being in state 𝑥′and transitioning to state 𝑥, 𝜋(𝑥)𝑃(𝑥′∣𝑥) = 𝜋(𝑥
′)𝑃(𝑥∣𝑥′).
2. Uniqueness of stationary distribution: the stationary distribution 𝜋(𝑥) must be unique. This is
guaranteed by ergodicity of the Markov process, which requires that every state must
 Be aperiodic—the system does not return to the same state at fixed intervals; and
 Be positive recurrent—the expected number of steps for returning to the same state is
finite.

probabilities) that fulfils the two above conditions, such that its stationary distribution 𝜋(𝑥) is chosen to
Step 1: The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition

be 𝑃(𝑥). The derivation of the algorithm starts with the condition of detailed balance:

𝑃(𝑥′∣𝑥)𝑃(𝑥)=𝑃(𝑥∣𝑥′)𝑃(𝑥′)

Step 2: which is re-written as

P ( x ' ∣ x ) P ( x ')
=
P ( x ∣ x ') P (x)

Step 3: The approach is to separate the transition in two sub-steps; the proposal and the acceptance-

The proposal distribution 𝑔(𝑥′∣𝑥) is the conditional probability of proposing a state 𝑥′given 𝑥, and
rejection.

The acceptance distribution 𝐴(𝑥′,𝑥) is the probability to accept the proposed state 𝑥′.
The transition probability can be written as the product of them:

𝑃(𝑥′∣𝑥)=𝑔(𝑥′∣𝑥)𝐴(𝑥′,𝑥).

ll’y 𝑃(𝑥 ∣𝑥′)=𝑔(𝑥 ∣ 𝑥′)𝐴(𝑥,𝑥′).

Step 4: Inserting this relation in the previous equation, we have

A (x ' , x ) P ( x ') g(x ∣ x ' )


=
A (x , x ') P (x) g(x ' ∣ x)
Step 5: The next step in the derivation is to choose an acceptance ratio that fulfills the condition above.
One common choice is the Metropolis choice:

𝐴(𝑥′, 𝑥) = min (1 , P(x


P( x ) g ( x ' ∣ x) )
') g ( x ∣ x ' )

Step 6: We keep on sampling for a long time and discard the initial few samples as the chain has not
reached its stationary state yet (this period is known as burn-in period).

* For this Metropolis acceptance ratio 𝐴, either 𝐴(𝑥′,𝑥)=1 or 𝐴(𝑥,𝑥′)=1 and, either way, the condition is
satisfied.

Thus the Metropolis–Hastings algorithm in Bayesian Inference can thus be written as follows:

 Pick an initial state 𝜃0.


1. Initialise

 Set i=0 Iterate

i) Generate a random candidate state 𝜃∗ according to 𝑄(𝜃∗|𝜃𝑖).


2. Iterate

𝑃𝑎𝑐𝑐(𝜃𝑖→𝜃∗) = min ¿
ii) Calculate the acceptance probability

 Generate a uniform random number 𝑢∈[0,1];


iii) Accept or reject:

 If 𝑢≤ 𝑃𝑎𝑐𝑐(𝜃𝑖→𝜃∗), then accept the new state and set 𝜃𝑖+1 = 𝜃∗;
 If 𝑢> 𝑃𝑎𝑐𝑐(𝜃𝑖→𝜃∗), then reject the new state, and copy the old state forward 𝜃𝑖+1 = 𝜃𝑖.

iv) Increment: set i=i+1.

Provided that specified conditions are met, the empirical distribution of saved states 𝜃0, 𝜃1 ,...... 𝜃𝑖 will
approach 𝑃(𝜃).

probability is given above where 𝐿 is the likelihood, 𝑃(𝜃) the prior probability density and 𝑄 the
MCMC can be used to draw samples from the posterior distribution of a statistical model. The acceptance

(conditional) proposal probability


Gibbs Sampling Algorithm

Gibbs sampling is a special case of the Metropolis–Hastings algorithm. Gibbs sampling is applicable when
the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional
distribution of each variable is known and is easy (or at least, easier) to sample from.

The point of Gibbs sampling is that given a multivariate distribution it is simpler to sample from a
conditional distribution than to marginalize by integrating over a joint distribution.

Bivariate Distribution case


Our goal is to sample from 2D Normal distribution with

μ= (00) , ∑ = [ 1/21 1/2


1 ]
Now we need to sample from this 2 D N(μ,∑) with pdf f(x,y), which might be a difficult task. So instead of
sampling directly from the joint pdf, we consider their conditional distributions f(x|y) or f(y|x) and obtain
the samples. This method makes easy the task.

From the given data

f(x|y) = N(ρ y, 1- ρ2) = N(y/2, 3/4)

ll’y for f(y|x) = N(x/2, 3/4)

Now we have two individual univariate distributions which is quite easy to sample the data from.

Procrdure:
1) Start at (x(0),y(0))
2) Sample x(1) ~ f(x(1)| y(0))
3) Sample y(1) ~ f(y(1)| x(1))

Then iterate steps 2 and 3 until we get our desired samples.

This is for the Bivariate Distribution.

Generalization
Let 𝑦 denote observations generated from the sampling distribution 𝑓(𝑦|𝜃) and 𝜋(𝜃) be a prior
supported on the parameter space Θ. Then one of the central goals of the Bayesian statistics is to
approximate the posterior density

𝜋(𝜃|𝑦) =
f ( y∨θ)⋅ π (θ)
m( y )
where the marginal likelihood 𝑚(𝑦)=∫Θ 𝑓(𝑦|𝜃)⋅𝜋(𝜃)𝑑𝜃 is assumed to be finite for all 𝑦.
To explain the Gibbs sampler, we additionally assume that the parameter space Θ is decomposed as

k
Θ =∏ Θi = Θ1×⋯Θ𝑖×⋯×Θ𝐾, (𝐾>1),
i

where × represents the Cartesian product. Each component parameter space Θ𝑖 can be a set of scalar

Define a set Θ−𝑖 that complements the Θ𝑖. Essential ingredients of the Gibbs sampler is the 𝑖-th full
components, subvectors, or matrices.

conditional posterior distribution for each 𝑖=1,⋯,𝐾

𝜋(𝜃𝑖|𝜃−𝑖,𝑦)=𝜋(𝜃𝑖|𝜃1,⋯,𝜃𝑖−1,𝜃𝑖+1,⋯,𝜃𝐾, 𝑦).

The following algorithm details a generic Gibbs sampler

Initialize: Pick arbitrary starting value 𝜃(1) = (𝜃1(1), 𝜃2(1), 𝜃3(1),......., 𝜃k(1))

Iterate a Cycle:

Step 1. Draw 𝜃1(𝑠+1)∼𝜋(𝜃1|𝜃2(𝑠),𝜃3(𝑠),⋯,𝜃𝐾(𝑠),𝑦)

Step 2. Draw 𝜃2(𝑠+1)∼𝜋(𝜃2|𝜃1(𝑠+1),𝜃3(𝑠),⋯,𝜃𝐾(𝑠),𝑦)

Step i. Draw 𝜃𝑖(𝑠+1)∼𝜋(𝜃𝑖|𝜃1(𝑠+1),𝜃2(𝑠+1),⋯,𝜃𝑖−1(𝑠+1),𝜃𝑖+1(𝑠),⋯,𝜃𝐾(𝑠),𝑦)

Step i+1. Draw 𝜃𝑖+1(𝑠+1)∼𝜋(𝜃𝑖+1|𝜃1(𝑠+1),𝜃2(𝑠+1),⋯,𝜃𝑖(𝑠+1),𝜃𝑖+2(𝑠),⋯,𝜃𝐾(𝑠),𝑦)

Step K. Draw 𝜃𝐾(𝑠+1)∼𝜋(𝜃𝐾|𝜃1(𝑠+1),𝜃2(𝑠+1),⋯,𝜃𝐾−1(𝑠+1),𝑦)


4. end Iterate : Repeat the iteration for a number of iterations with one variable at a time and for all the
variables present in the system.
5. A Markov Chain is created in the process which converges to the target distribution, we have to
discard the initial values which led us to the target sample i.e. we have to discard the burn in phase .

You might also like