0% found this document useful (0 votes)

54 views78 pages

확통1 LectureNote09 on Bayesian Statistical Inference

Uploaded by

jedem10224

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views78 pages

확통1 LectureNote09 on Bayesian Statistical Inference

Uploaded by

jedem10224

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Bayesian Statistical Inference

Statistical Inference (1)

◼ Statistical inference is the process of extracting
information about an unknown variable or an unknown
model from available data.

◼ Two main approaches

◼ Bayesian statistical inference

◼ Classical statistical inference

◼ Main categories of inference problems

◼ parameter estimation

◼ hypothesis testing

◼ significance testing

2
Statistical Inference (2)
◼ Most important methodologies of inference
◼ maximum a posteriori (MAP) probability rule

◼ least mean squares estimation (LMSE)

◼ maximum likelihood (ML)

◼ regression

◼ likelihood ratio tests (LRT)

3
Probability versus Statistics
◼ Probability Theory
◼ Self-contained mathematical tool based on axioms

◼ Together with a fully specified model that obeys these

axioms
◼ For a problem, this gives a unique correct answer,

even if sometimes hard to find

◼ Statistics
◼ For any problem, several reasonable methods yield

different answers
◼ No principal best method can be selected

◼ Right method is given imposing additional factors and

constraints (performance, experience, common

sense, consensus, …)
4
Bayesian versus Classical Statistics
◼ Two prominent approaches
◼ Bayesian

◼ Classical/Frequentist

◼ Difference: What’s the nature of the unknown variables

or unknown models in the problem?
◼ Bayesian: They are treated as random variables with
known distributions.
◼ Classical/Frequentist: They are treated as deterministic
but unknown quantities.

5
Bayesian
◼ When trying to infer the nature of an unknown model, it
views the model as chosen randomly from a given model.
◼ In the Bayesian world, we regard a parameter Θ that
characterizes the model as random.
◼ Once the distribution of Θ is given, only a single model is
assumed, moving Bayesian back to the probability theory.
◼ First, postulate a prior distribution 𝑝Θ 𝜃 .
◼ Next, given observed data 𝑥, obtain a conditional
distribution (or likelihood) 𝑝𝑋|Θ (𝑥|𝜃).
◼ Finally, one can use Bayes' rule to derive a posterior
distribution 𝑝Θ|𝑋 𝜃|𝑥 .
◼ This captures all information that 𝑥 can provide about 𝜃.

6
Classical/Frequentist
◼ View the unknown quantity 𝜃 as an unknown constant.
◼ Strives to develop an estimate of 𝜃.
◼ For each possible value of 𝜃, various competitive
models are postulated.
◼ So, we are dealing with multiple candidate
probabilistic models, and a best model can be
selected.

7
Model versus Variable Inference
◼ Model inference: the object of study is a real
phenomenon or process for which we wish to
construct or validate a probability model based on
available data
◼ e.g., do planets follow elliptical trajectories?

◼ Such a model can then be used to make predictions

about the future, or to infer some hidden underlying
causes.
◼ Variable inference: we wish to estimate the value of
one or more unknown variables by using some related
sample, possibly added by noisy information
◼ e.g., what is my current position, given a few GPS

readings?
8
Statistical Inference Problems (1)
◼ Estimation: Given a model is fully specified, we wish
to estimate an unknown, possibly multidimensional,
parameter 𝜃.
◼ This parameter can be viewed as either a random
variable (Bayesian approach) or as an unknown
constant (classical approach).
◼ Objective is to estimate 𝜃 close to the true value.

9
Statistical Inference Problems (2)
◼ Binary hypothesis testing:
◼ Start with two hypotheses 𝐻0 and 𝐻1

◼ Use the available data to decide which of the two is

true
◼ 𝑚-ary hypothesis testing:
◼ There is a finite number 𝑚 of competing

hypotheses, 𝐻0 , 𝐻1 , … , 𝐻𝑚−1
◼ Evaluation: typically, by each decision error

probability

◼ Both Bayesian and classical approaches are possible.

10
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation

11
Bayesian Inference (1)
◼ In Bayesian inference, the unknown quantity of
interest is modeled as a random variable or as a finite
random vector.
◼ We usually denote it by Θ.

◼ We aim to extract information about Θ, based on

observing a collection 𝑋 = 𝑋1 , … , 𝑋𝑛 of related
random variables.
◼ called observations, measurements, or an

observation vector.

12
Bayesian Inference (2)
◼ We assume that we know the joint distribution of Θ
and 𝑋.
◼ Equivalently, we assume that we know
◼ A prior distribution 𝑝Θ or 𝑓Θ , depending on whether

Θ is discrete or continuous.
◼ A conditional distribution 𝑝𝑋|Θ or 𝑓𝑋|Θ , depending on

whether 𝑋 is discrete or continuous.

13
Bayesian Inference (3)
◼ After a specific value 𝑥 of 𝑋 has been given, a
complete answer to the Bayesian inference problem is
provided by the posterior distribution 𝑝Θ|𝑋 or 𝑓Θ|𝑋 .
◼ It encapsulates everything to know about Θ, given
the available information.

14
Summary of Bayesian Inference
1. We start with a prior distribution 𝑝Θ or 𝑓Θ for the
unknown random variable Θ.
2. We have a model 𝑝𝑋|Θ or 𝑓𝑋|Θ of the observation
vector 𝑋.
3. After observing the value 𝑥 of 𝑋, we form the
posterior distribution of Θ, using the appropriate
version of Bayes' rule.

15
Bayes’ Rule: Summary
◼ Depending on
discrete/continuous
Θ and 𝑋, there are
four specifications of
Bayes' rule.

◼ Even though looking

different, they have
syntactically all the
same meaning.

16
Example: Romeo/Juliet Meeting (1)
◼ Romeo and Juliet meeting: Juliet will be late on any
date by a random amount 𝑋, uniformly distributed over
the interval 0, 𝜃 .
◼ The maximum waiting time 𝜃 is unknown and is
modeled as the value of a random variable Θ which is
uniformly distributed in 0,1 .
◼ Assume that Juliet was late by an amount 𝑥 on their
first date.
◼ Question: How should Romeo use this information to
update the distribution of 𝜃?

17
Example: Romeo/Juliet Meeting (2)
1, if 0 ≤ 𝜃 ≤ 1,
◼ Prior density: 𝑓Θ 𝜃 = ቊ
0, otherwise.

◼ Conditional density (likelihood) of the observation:

1Τ𝜃 , if 0 ≤ 𝑥 ≤ 𝜃,
𝑓𝑋|Θ 𝑥|𝜃 = ቊ
0, otherwise.

18
Example: Romeo/Juliet Meeting (3)
◼ 𝑓Θ 𝜃 = 1 if 0 ≤ 𝜃 ≤ 1
◼ 𝑓𝑋|Θ 𝑥|𝜃 = 1/𝜃 if 0 ≤ 𝑥 ≤ 𝜃
◼ Use Bayes' rule: the posterior density is
𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃
𝑓Θ|𝑋 𝜃|𝑥 = 1
‫𝑓 𝑥׬‬Θ 𝜃 ′ 𝑓𝑋|Θ 𝑥|𝜃 ′ 𝑑𝜃 ′
1/𝜃 1
= = , if 𝑥 ≤ 𝜃 ≤ 1
1 1 ′ 𝜃 ⋅ log 𝑥
‫ 𝜃 𝑥׬‬′ 𝑑𝜃
and
𝑓Θ|𝑋 𝜃|𝑥 = 0, otherwise.

19
Example: Romeo/Juliet Meeting (4)
◼ Consider now a variation on the first 𝑛 independent
dates. Let 𝑋 = 𝑋1 , … , 𝑋𝑛 and 𝑥 = 𝑥1 , … , 𝑥𝑛 . Then,
similarly to 𝑛 = 1,
𝑓𝑋|Θ 𝑥|𝜃 = 1/𝜃 𝑛 if 𝑥(𝑛) ≤ 𝜃 ≤ 1,
with 𝑥(𝑛) = max{𝑥1 , … , 𝑥𝑛 }.
◼ Then, the posterior density is
1/𝜃 𝑛
𝑓Θ|𝑋 𝜃|𝑥 = 1 , if 𝑥(𝑛) ≤ 𝜃 ≤ 1
‫ 𝑥׬‬1/(𝜃 ′ )𝑛 𝑑𝜃′
(𝑛)

and
𝑓Θ|𝑋 𝜃|𝑥 = 0, otherwise.

20
Example: Inference of Common
Mean of Gaussian (1)
◼ We observe a collection 𝑋 = 𝑋1 , … , 𝑋𝑛 of rvs, with an
unknown common mean, whose value we wish to infer.
Given the value of the common mean, we assume that
the 𝑋𝑖 are normal and independent each other, with
known variances 𝜎12 , … , 𝜎𝑛2 , that is 𝑋𝑖 ~𝑁(Θ, 𝜎𝑖2 )
◼ We model the common mean as Θ, with a given normal
prior, that is Θ~𝑁 𝑥0 , 𝜎02 Random error term

◼ We suppose the 𝑋𝑖 are represented by the sum of two

rvs, each determining the mean and the variance,
respectively. Thus, 𝑋𝑖 = Θ + 𝑊𝑖 , where
◼ Θ, 𝑊𝑖 are normal, 𝑊𝑖 depends on Θ like the following.

◼ For any value of 𝜃, 𝐄 𝑊𝑖 = 𝐄 𝑊𝑖 Θ = 𝜃 = 0,

var 𝑊𝑖 = var 𝑊𝑖 Θ = 𝜃 = 𝜎𝑖2 , giving 𝑊𝑖 ~𝑁 0, 𝜎𝑖2 .

21
Example: Inference of Common
Mean of Gaussian (2)
𝜃−𝑥0 2
◼ Prior density: 𝑓Θ 𝜃 = 𝑐1 exp − ,
2𝜎02

𝑥𝑖 −𝜃 2
◼ Likelihood: 𝑓𝑋|Θ 𝑥|𝜃 = 𝑐2 ς𝑛𝑖=1 exp −
2𝜎𝑖2
◼ 𝑐1 and 𝑐2 are constants.
◼ By Bayes’ rule, the posterior density is
𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃
𝑓Θ|𝑋 𝜃|𝑥 =
‫𝑓 ׬‬Θ 𝜃 ′ 𝑓𝑋|Θ 𝑥|𝜃 ′ 𝑑𝜃 ′
◼ Note: Usually, calculating the normalizing constant in
the denominator (called evidence) is challenging.

22
Example: Inference of Common
Mean of Gaussian (3)
◼ The numerator is of the form
𝑛
2
𝑥𝑖 − 𝜃
𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃 = 𝑐1 𝑐2 exp − ෍ .
2𝜎𝑖2
𝑖=0
◼ The exponent is a quadratic form in 𝜃, thus for some
constant 𝑑, can be written as
𝜃−𝑚 2
𝑑 ⋅ exp − ,
2𝑣
where
𝑛 𝑛
1 1 𝑚 𝑥𝑖
= ෍ 2, = ෍ 2.
𝑣 𝜎𝑖 𝑣 𝜎𝑖
𝑖=0 𝑖=0

23
Example: Inference of Common
Mean of Gaussian (4)
◼ The constant 𝑑 depends only on 𝑥, but not on 𝜃.The
denominator doesn’t depend on 𝜃, either. Thus
𝜃−𝑚 2
𝑓Θ|𝑋 𝜃|𝑥 ∝ exp −
2𝑣
◼ So, we conclude that the posterior density 𝑓Θ|𝑋 𝜃|𝑥 is
normal with mean 𝑚 and variance 𝑣.
◼ Recall prior: Θ ∼ 𝑁 𝑥0 , 𝜎02 .
◼ A remarkable property of the normal family: If the
posterior is in the same family as the prior, the prior
and posterior are then called conjugate distributions,
and the prior is called a conjugate prior for the
likelihood.

24
Example: Inference of Common
Mean of Gaussian (5)
◼ This property opens up the possibility of efficient
recursive inference.
◼ Suppose that after 𝑋1 , … , 𝑋𝑛 are observed, an
additional observation 𝑋𝑛+1 is obtained.
◼ Instead of solving the inference problem from scratch,
we can view 𝑓Θ|𝑋1,…,𝑋𝑛 as our prior, and use the new
observation to obtain the new posterior 𝑓Θ|𝑋1 ,…,𝑋𝑛 ,𝑋𝑛+1 .
◼ Thus, the new posterior is normal distribution with
mean 𝑚′ and variance 𝑣′, where
𝑚′ 𝑚 𝑥𝑛+1
= + 2 ,
𝑣′ 𝑣 𝜎𝑛+1
1 1 1
= + 2 .
𝑣′ 𝑣 𝜎𝑛+1

25
Example: Beta Priors on the Bias of
a Coin
◼ Beside the normal family, another prominent example
involves the Beta family derived from Bernoulli trials.
◼ We wish to estimate the probability of heads, Θ, of a
biased coin, and suppose Θ has beta prior 𝑓Θ 𝜃 =
Beta(𝛼, 𝛽).
◼ The coin is tossed 𝑛 times and the number of heads is
denoted by 𝑋~Bin 𝑛, 𝜃 .
◼ The posterior of Θ is, for 0 ≤ 𝜃 ≤ 1
𝑓Θ|𝑋 𝜃|𝑘 = 𝑐𝑓Θ 𝜃 𝑝𝑋|Θ 𝑘|𝜃 = 𝑑𝑓Θ 𝜃 𝜃 𝑘 (1 − 𝜃)𝑛−𝑘
𝑑
= 𝜃 𝑘+𝛼−1 (1 − 𝜃)𝑛−𝑘+𝛽−1
𝐵(𝛼, 𝛽)
is also a beta density with 𝛼 ′ = 𝑘 + 𝛼, 𝛽 ′ = 𝑛 − 𝑘 + 𝛽.

26
Multiparameter Problems
◼ The case of multiple unknown parameters is entirely
similar. The principles for calculating the posterior density
are essentially the same, regardless of whether Θ
consists of one or multiple components.
◼ However, while the posterior density can be obtained in
principle using Bayes’ rule, a closed form solution should
not be expected in general.
◼ If Θ is high-dimensional, computing the denominator of
Bayes’ formula in terms of numerical integration becomes
formidable.
◼ We resort to sophisticated numerical approximation
methods, based on random sampling, such as Monte
Carlo integration, Gibbs sampling, or Markov chain
Monte Carlo (MCMC). 27
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation

28
MAP (1)
◼ Given the value 𝑥 of the observation, we select a
෠ that maximizes the posterior
value of 𝜃, denoted 𝜃,
distribution
◼ 𝑝Θ|𝑋 𝜃|𝑥 if Θ is discrete

◼ 𝑓Θ|𝑋 𝜃|𝑥 if Θ is continuous

◼ That is,
𝜃෠ = argmax 𝑝Θ|𝑋 𝜃|𝑥 , if Θ is discrete,
𝜃
𝜃෠ = argmax 𝑓Θ|𝑋 𝜃|𝑥 , if Θ is continuous.
𝜃

29
MAP (2)
◼ This is called the Maximum a Posteriori probability
(MAP) rule.

30
MAP (3)
◼ When Θ is discrete, the MAP rule has an important
optimality property.
◼ Since it chooses 𝜃෠ to be the most likely value of Θ, it
maximizes the probability of correct decision (selecting a
correct value of Θ) for any given value 𝑥, when the
decision rule is the MAP, i.e., for all 𝑔 𝑥
𝐏{𝜃෠ = 𝑔 𝑥 |𝑋 = 𝑥} ≤ 𝐏{𝜃෠ = 𝑔𝑀𝐴𝑃 𝑥 |𝑋 = 𝑥}.
◼ This implies that it also maximizes the overall (averaged
over all possible values 𝑋) probability of correct decision,
that is, the following holds for all decision rules 𝑔(𝑥),
𝐏Θ ෡=𝑔 𝑋 𝑋 ≤𝐏 Θ ෡ = 𝑔𝑀𝐴𝑃 𝑋 𝑋 .
Taking expectation wrt 𝑋 and using the total probability law
𝐏Θ ෡=𝑔 𝑋 ≤𝐏 Θ ෡ = 𝑔𝑀𝐴𝑃 𝑋 .
31
Computational Shortcut
𝑝Θ 𝜃 𝑝𝑋|Θ 𝑥|𝜃
◼ Recall posterior: 𝑝Θ|𝑋 𝜃|𝑥 = σ ′ 𝑝Θ 𝜃′ 𝑝𝑋|Θ 𝑥|𝜃′
𝜃

◼ This allows an important computational shortcut: the

denominator is dependent of 𝑥, and independent of 𝜃.
◼ Thus, to maximize the posterior with respect to 𝜃, we
only need to maximize the numerator 𝑝Θ 𝜃 𝑝𝑋|Θ 𝑥|𝜃 .
Calculation of the denominator is unnecessary.

32
Example: MAP for the Inference of
Common Mean of Gaussian
◼ 𝑋1 , … , 𝑋𝑛 are independent normal rv with an unknown
common mean Θ ∼ 𝑁 𝑥0 , 𝜎02 , and known variances
𝜎12 , … , 𝜎𝑛2 .
𝜃−𝑚 2
◼ Posterior: 𝑓Θ|𝑋 𝜃|𝑥 ∝ exp − with
2𝑣
𝑛 𝑛
1 1 𝑚 𝑥𝑖
= ෍ 2, = ෍ 2.
𝑣 𝜎𝑖 𝑣 𝜎𝑖
𝑖=0 𝑖=0
◼ The MAP estimate: 𝜃෠ = 𝑚,
◼ because the normal PDF is maximized at its mean

33
Example: MAP for Spam Filtering
◼ Let {𝑤1 , . . . , 𝑤𝑛 } be a collection of words whose
appearance suggests a spam message. Θ takes values
1 and 2, corresponding to spam or legitimate
messages, with given probabilities 𝑝Θ (1) and 𝑝Θ (2),
and 𝑋𝑖 is the Bernoulli rv that models the appearance
of 𝑤𝑖 (𝑋𝑖 = 1 if 𝑤𝑖 appears and 𝑋𝑖 = 0, otherwise)
◼ Posterior 𝐏 Θ = 𝑚|𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛
𝑝Θ (𝑚) ς𝑛𝑖=1 𝑝𝑋𝑖 |Θ (𝑥𝑖 |𝑚)
= 2 𝑛 , 𝑚 = 1,2.
σ𝑗=1 𝑝Θ (𝑗) ς𝑖=1 𝑝𝑋𝑖 |Θ (𝑥𝑖 |𝑗)
◼ The MAP rule decides the message is spam if
𝐏 Θ = 1 𝑋1 = 𝑥1 , … > 𝐏 Θ = 2 𝑋1 = 𝑥1 , … , or
𝑛 𝑛
𝑝Θ 1 ෑ 𝑝𝑋𝑖 |Θ 𝑥𝑖 1 > 𝑝Θ (2) ෑ 𝑝𝑋𝑖 |Θ (𝑥𝑖 |2)
𝑖=1 𝑖=1
34
Point Estimation
◼ Point estimate: a single numerical value that represents
our best guess of the parameter Θ
◼ Estimate: the deterministic numerical value 𝜃෠ that we
choose on observation 𝑥.
◼ 𝜃෠ is to be determined by applying some function 𝑔 to
the observation 𝑥, resulting in 𝜃෠ = 𝑔 𝑥 .
◼ 𝑔 is called the decision rule, which upon observing 𝑥,

selects a value of 𝜃෠ denoted by 𝑔(𝑥).

◼ Estimator: the random variable Θ ෡=𝑔 𝑋
◼ its realized value equals 𝑔(𝑥) when 𝑋 = 𝑥.

෡ depends on the random variable 𝑋, thus Θ

◼ Θ ෡ is
random

35
Two Popular Estimators
◼ We can use different decision rules 𝑔 to form different
estimators, and some will be better than others.
◼ Two popular estimates:
෠ = argmax 𝑝Θ|𝑋 𝜃|𝑥
◼ MAP estimate : 𝜃
𝜃
◼ Conditional Expectation estimate : 𝜃෠ = 𝐄 Θ|𝑋 = 𝑥 .
◼ Conditional expectation estimator is also called the
least mean squares (LMS) estimator.
◼ It minimizes the mean squared estimation error

(MSE) over all estimators.

◼ To be discussed in Section 8.3.

36
Example: Romeo/Juliet Meeting (1)
◼ Juliet is late on the first date by a random amount 𝑋.
◼ The distribution of 𝑋 is uniform over 0, Θ .
◼ Θ is an unknown random variable with a uniform prior
𝑓Θ over the interval 0,1 .
1
◼ Recall: 𝑓Θ|𝑋 𝜃|𝑥 = , if 𝑥 ≤ 𝜃 ≤ 1
𝜃⋅ log 𝑥
◼ MAP: 𝜃෠ = 𝑥,
◼ because 𝑓Θ|𝑋 𝜃|𝑥 is decreasing in 𝜃 over the range

𝑥, 1 .
◼ Note that MAP is an optimistic estimate. If Juliet is

late by a small amount on the first date 𝑥 ≈ 0 , the

estimate of future lateness is also small.
37
Example: Romeo/Juliet Meeting (2)
◼ Conditional expectation: less optimistic.
1
1 1−𝑥
𝐄 Θ|𝑋 = 𝑥 = න 𝜃 𝑑𝜃 = .
𝑥 𝜃 log 𝑥 log 𝑥

38
Example: Bias of a Coin
◼ We wish to estimate the probability of heads, Θ, of a
biased coin, and suppose Θ has uniform prior, Θ~𝑈[0,1].
◼ We want to derive the MAP and conditional expectation
estimators of Θ.
◼ When 𝑋 = 𝑘, the posterior of Θ is, for 0 ≤ 𝜃 ≤ 1
1
𝑓Θ|𝑋 𝜃|𝑘 = 𝜃 𝑘 (1 − 𝜃)𝑛−𝑘
𝐵(𝑘 + 1, 𝑛 − 𝑘 + 1)
𝑘
▪ MAP estimate: Maximum of the posterior is at 𝜃 = .
෠
𝑛
▪ Conditional expectation estimate: 𝐄[Θ|𝑋 = 𝑘] is obtained
from the first moment of Beta(α = 𝑘 + 1, 𝛽 = 𝑛 − 𝑘 + 1),
𝐵 𝛼 + 1, 𝛽 𝑘+1
𝐄Θ𝑋=𝑘 = = .
𝐵 𝛼, 𝛽 𝑛+2
▪ For large 𝑛, the two nearly coincide. 39
Hypothesis Testing (1)
◼ Θ takes one of 𝑚 values, 𝜃1 , … , 𝜃𝑚 .
◼ 𝑚 is usually a small integer; often 𝑚 = 2.

◼ The 𝑖th hypothesis is the event 𝐻𝑖 ≝ Θ = 𝜃𝑖 .

◼ Once the value 𝑥 of 𝑋 is observed, we may use Bayes'
rule to calculate the posterior probabilities
𝐏 Θ = 𝜃𝑖 |𝑋 = 𝑥 = 𝑃Θ|𝑋 𝜃𝑖 |𝑥 ,
for each 𝑖.

40
Hypothesis Testing (2)
◼ MAP rule: select the hypothesis 𝐻𝑖 : Θ = 𝜃𝑖 with the
largest posterior probability 𝐏 Θ = 𝜃𝑖 |𝑋 = 𝑥 .
◼ Equivalently, it selects the hypothesis 𝐻𝑖 with the
largest 𝑝Θ 𝜃𝑖 𝑝𝑋|Θ 𝑥|𝜃𝑖 (if 𝑋 is discrete) or
𝑝Θ 𝜃𝑖 𝑓𝑋|Θ 𝑥|𝜃𝑖 (if 𝑋 is continuous).
◼ Computational shortcut
◼ The MAP rule minimizes the probability of selecting an
incorrect hypothesis, or the probability of error over all
decision rules.

41
Hypothesis Testing (3)
◼ Once we derive the MAP rule, we can compute the
probability of a correct decision or error.
◼ Assume that the correct value of Θ is 𝜃𝑖 .
◼ If 𝑔MAP 𝑥 is the hypothesis selected by the MAP rule
when 𝑋 = 𝑥, the MAP rule should select 𝜃𝑖 , that is,
𝑔MAP 𝑥 = 𝜃𝑖 .
◼ Then, the probability of correct decision (selecting a
correct value of Θ) when 𝑋 = 𝑥 is
𝐏 Θ = 𝑔MAP 𝑥 |𝑋 = 𝑥
which is also equal to 𝐏 Θ = 𝜃𝑖 |𝑋 = 𝑥 .

42
Hypothesis Testing (4)
◼ Assume that the correct hypothesis is 𝐻𝑖 .
◼ Let 𝑆𝑖 the set of all 𝑥 such that the MAP rule selects 𝐻𝑖 ,
𝑆𝑖 = 𝑥: 𝑔MAP 𝑥 = 𝐻𝑖 .
◼ Then the overall (averaged over all possible values of 𝑥)
probability of correct decision is
𝐏 correct = 𝐏 Θ = 𝑔MAP 𝑋
= ෍ 𝐏 Θ = 𝜃𝑖 , 𝑋 ∈ 𝑆𝑖
𝑖
▪ The corresponding overall probability of error is
𝐏 error = ෍ 𝐏 Θ ≠ 𝜃𝑖 , 𝑋 ∈ 𝑆𝑖
𝑖

43
Example: Hypothesis Testing (1)
◼ We have two biased coins, with probabilities of heads
equal to 𝑝1 and 𝑝2 , respectively.
◼ We choose a coin at random: either coin is equally
likely to be chosen. This gives the uniform prior.
◼ We want to infer its identity (1 or 2), based on the
outcome of a single toss.
◼ Let 𝐻1 = {Θ = 1} and 𝐻2 = {Θ = 2} be the hypotheses
that coin 1 or 2 was chosen, respectively,.
1, if head,
◼ Depending on the outcome of toss, let 𝑋 = ቊ
0, if tail.
▪ MAP: compare 𝑝Θ 1 𝑝𝑋|Θ 𝑥|1 with 𝑝Θ 2 𝑝𝑋|Θ 𝑥|2 and
take the larger one.
44
Example: Hypothesis Testing (2)
◼ Since 𝑝Θ 1 = 𝑝Θ 2 = 1/2, we just need to compare
𝑝𝑋|Θ 𝑥|1 and 𝑝𝑋|Θ 𝑥|2 .
◼ For example, if 𝑝1 = 0.46, 𝑝2 = 0.52, and the outcome
is a tail (or 𝑥 = 0), 𝐏 0|Θ = 1 = 1 − 𝑝1 > 𝐏(0|Θ =
2) = 1 − 𝑝2 , we decide in favor of coin 1.
◼ Let us consider the general case that we toss a
randomly selected coin 𝑛 times.
◼ Let 𝑋 be the number of heads obtained.
◼ Then, the preceding argument of single toss is still
valid, and the MAP rule selects the hypothesis under
which the observed outcome is most likely.

45
Example: Hypothesis Testing (3)
◼ If 𝑋 = 𝑘, we should select 𝐻1 = {Θ = 1} if
𝑃𝑋|Θ 𝑘|1 = 𝑝1𝑘 1 − 𝑝1 𝑛−𝑘 > 𝑃𝑋|Θ 𝑘|2 = 𝑝2𝑘 1 − 𝑝2 𝑛−𝑘

and select 𝐻2 = {Θ = 2}, otherwise.

0
10 30 50

𝑘∗
46
Example: Hypothesis Testing (4)
◼ The characteristic of the MAP rule, as illustrated in the
figure, is typical of decision rules in binary hypothesis
testing problems.
◼ It is specified by a partition of the observation space
into the two disjoint sets in which each of the two
hypotheses is chosen.
◼ In this example, the MAP rule is specified by a single
threshold 𝑘 ∗ :
◼ Accept Θ = 1 if 𝑘 ≤ 𝑘 ∗ , accept Θ = 2 otherwise.

47
Example: Hypothesis Testing (5)
◼ The overall probability of error is obtained by using the total
probability rule:
𝐏 error = 𝐏 Θ = 1, X > 𝑘 ∗ + 𝐏 Θ = 2, X ≤ 𝑘 ∗
𝑛 𝑘 ∗
1
= ෍ 𝑐 𝑘 𝑝1𝑘 1 − 𝑝1 𝑛−𝑘 + ෍ 𝑐(𝑘)𝑝2𝑘 (1 − 𝑝2 )𝑛−𝑘
2 𝑘=𝑘 ∗ +1 𝑘=1

48
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation

49
Mean Squared Estimation Error
without Observation (1)
◼ First, let us consider the simpler problem of estimating
Θ with a constant 𝜃,෠ in the absence of an observation 𝑋.
◼ The estimation error: 𝜃෠ − Θ, which is random
◼ The mean squared estimation error (MSE):
2
𝐄 Θ−𝜃 ෠
◼ Question: What’s the minimum value of 𝑀𝑆𝐸 over all
෠
choices of 𝜃?
◼ Answer: 𝐕 Θ , minimum is achieved when 𝜃෠ = 𝐄 Θ .
Equivalently,
2
𝐄 Θ − 𝐄[Θ] 2 ෠ ෠
≤ 𝐄 Θ − 𝜃 , for all 𝜃.

50
Mean Squared Estimation Error
without Observation (2)
2
◼ 𝐄 Θ − 𝜃෠
2
෠
=𝐕 Θ−𝜃 + 𝐄 Θ−𝜃 ෠ // definition of var()
2
෠
=𝐕 Θ + 𝐄 Θ−𝜃 // var doesn’t change
2
෠
= 𝐕 Θ + 𝐄 Θ − 𝜃 // linearity of expectation
≥ 𝐕 Θ // “=” achieved when 𝜃෠ = 𝐄 Θ .

51
Mean Squared Estimation Error
with Observation
◼ Second, suppose that we have observation 𝑋.
◼ We still want to estimate Θ to minimize the MSE.
◼ Note that once we know the value 𝑥 of 𝑋, the situation
is identical to the one considered earlier, except that
we are now in a new setting: everything is conditioned
on 𝑋 = 𝑥.
◼ For any given observation 𝑥, the conditional
expectation estimate 𝜃෠ = 𝐄 Θ|𝑋 = 𝑥 minimizes the
2
෠
conditional MSE 𝐄 Θ − 𝜃 |𝑋 = 𝑥 .
◼ ෠
Equivalently, for any given observation 𝑥 and for all 𝜃,
2
𝐄 Θ − 𝐄 Θ|𝑋 = 𝑥 2 |𝑋 ෠
= 𝑥 ≤ 𝐄 Θ − 𝜃 |𝑋 = 𝑥

52
Mean Squared Estimation Error in
General Case
◼ Generally, for all the observation 𝑋, the MSE
associated with an estimator 𝑔 𝑋 is defined as
2
𝐄 Θ − 𝑔 𝑋 |𝑋 .
◼ If we view 𝐄 Θ 𝑋 as a function of 𝑋, the preceding
analysis shows that out of all possible estimators
𝑔(𝑋) of Θ based on 𝑋, the conditional MSE is
minimized when 𝑔 𝑋 = 𝐄 Θ|𝑋 :
◼ That is, for all estimators 𝑔(𝑋),
𝐄 Θ − 𝐄[Θ|𝑋] 2 |𝑋 ≤ 𝐄 Θ − 𝑔(𝑋) 2 |𝑋
◼ Similarly, using the total expectation rule, for all
estimators 𝑔(𝑋),
𝐄 Θ − 𝐄[Θ|𝑋] 2 ≤ 𝐄 Θ − 𝑔(𝑋) 2

53
Example: Conditional MSE (1)
◼ We observe Θ with error 𝑊:
𝑋 =Θ+𝑊
where Θ ~ 𝑈 4,10 and 𝑊 ~ 𝑈 −1,1 is independent of Θ.
◼ We want to obtain the conditional MSE
𝐄 Θ − 𝐄 Θ 𝑋 2 |𝑋 = 𝑥 .
◼ 𝑓Θ 𝜃 = 1/6 if 4 ≤ 𝜃 ≤ 10 (and 0 otherwise).
◼ {𝑋|Θ = 𝜃} equals to 𝜃 + 𝑊, so
▪ {𝑋|Θ = 𝜃} ~𝑈 𝜃 − 1, 𝜃 + 1 ,
▪ 𝑓𝑋|Θ 𝑥|𝜃 = 1/2 if 𝜃 − 1 ≤ 𝑥 ≤ 𝜃 + 1.
1 1 1
◼ Joint density: 𝑓Θ,𝑋 𝜃, 𝑥 = 𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃 = ⋅ =
6 2 12
◼ when 𝜃 ∈ 4,10 and 𝑥 ∈ 𝜃 − 1, 𝜃 + 1 .
54
Example: Conditional MSE (2)
◼ The joint density of Θ and 𝑋 is uniform over the
parallelogram given in the following figure.
◼ Given that 𝑋 = 𝑥, the posterior density 𝑓Θ|𝑋 is
proportional to the joint density and is also uniform on
the corresponding vertical section of the parallelogram.

55
Example: Conditional MSE (3)
◼ At 𝑥 = 5,
1
▪ 𝜃መ = 𝐄 Θ 𝑋 = 𝑥 = 5 and 𝑓Θ|𝑋 𝜃 𝑥 = , 4 ≤ 𝜃 ≤ 6.
2
2 6 1
▪ CMSE = 𝐄 Θ − 𝜃መ 𝑋 = 𝑥 = ‫׬‬4 (𝜃 − 5)2 ∙ 𝑑𝜃
2
1
1
= න 𝑦 2 𝑑𝑦 =
−1 3
◼ For 3 ≤ 𝑥 ≤ 5,
𝑥−3 𝑥+5 1
▪ 𝜃መ = 4 + = and 𝑓Θ|𝑋 𝜃 𝑥 = , 4 ≤ 𝜃 ≤ 𝑥 + 1.
2 2 𝑥−3
2 𝑥+1 𝑥+5 2 1
▪ CMSE = 𝐄 Θ − 𝜃መ 𝑋=𝑥 = ‫׬‬4 (𝜃 − ) ∙ 𝑑𝜃
2 𝑥−3
𝑥−3
1 2
2 1
= න 𝑦 𝑑𝑦 = (𝑥 − 3)2
𝑥 − 3 3−𝑥 12
2
▪ Similarly, for 9 ≤ 𝑥 ≤ 11,
1
▪ CMSE = (𝑥 − 11)2
12

56
Example: Conditional MSE (4)
◼ Thus, 𝐄 Θ 𝑋 = 𝑥 is the midpoint of that section, which
is a piecewise linear function of 𝑥.
◼ Conditioned on a specific value 𝑥, the conditional MSE
𝐄 Θ − 𝐄 Θ 𝑋 = 𝑥 2 |𝑋 = 𝑥 corresponds to the
conditional variance of Θ.

57
Example: Romeo/Juliet Meeting (1)
◼ Juliet is late on the first date by a random amount 𝑋
that is uniformly distributed over 0, Θ .
◼ Θ: uniform prior over the interval 0,1 .
◼ MAP estimate: 𝜃෠ = 𝑥.
1 1 1−𝑥
◼ ෠
LMS estimate: 𝜃 = 𝐄 Θ|𝑋 = 𝑥 = ‫𝜃 𝑥׬‬ 𝑑𝜃 =
𝜃 log 𝑥 log 𝑥
◼ In the above, we have used the conditional density
1
𝑓Θ|𝑋 𝜃 𝑥 = .
𝜃 log 𝑥
◼ We want to calculate the conditional MSE for the MAP
and LMS estimates.

58
Example: Romeo/Juliet Meeting (2)
◼ ෠ we have
Given that 𝑋 = 𝑥, for any estimate 𝜃,
2
෠
𝐄 Θ − 𝜃 |𝑋 = 𝑥
1 2 1
= ෠
‫ 𝜃 𝑥׬‬− 𝜃 𝜃 log 𝑥 𝑑𝜃
1 2 1
= ෠ ෠
‫ 𝜃 𝑥׬‬− 2𝜃𝜃 + 𝜃 𝜃 log 𝑥2 𝑑𝜃
1−𝑥 2 2 1−𝑥
= −𝜃 ෠ + 𝜃෠ 2 .
2 log 𝑥 log 𝑥
▪ In the above, we have used the conditional density
1
𝑓Θ|𝑋 𝜃 𝑥 = .
𝜃 log 𝑥

59
Example: Romeo/Juliet Meeting (3)
2 1−𝑥 2 2 1−𝑥
◼ 𝐄 Θ − 𝜃෠ |𝑋 = 𝑥 = − 𝜃෠ + 𝜃෠ 2 .
2 log 𝑥 log 𝑥
◼ For MAP: Substituting 𝜃෠ = 𝑥,
3𝑥 2 − 4𝑥 + 1
2
𝐄 Θ − 𝜃෠ |𝑋 = 𝑥 = 𝑥 2 +
2 log 𝑥
1−𝑥
◼ For LMS: Substituting 𝜃෠ = ,
log 𝑥
2
2 1 − 𝑥2 1−𝑥
෠
𝐄 Θ − 𝜃 |𝑋 = 𝑥 = −
2 log 𝑥 log 𝑥

60
Example: Romeo/Juliet Meeting (4)
◼ MAP estimate has smaller values.
◼ LMS estimate has uniformly smaller conditional MSE.

61
Example: Bias of a Coin (1)
◼ The probability of heads is modeled as Θ, and we
assume the prior of Θ is uniform over the interval 0,1 .
◼ We want to calculate the conditional MSE for the MAP
and LMS estimates.
◼ The coin is tossed 𝑛 times and the number of heads is
distributed by 𝑋~Bin(𝑛, Θ).
◼ When 𝑋 = 𝑘, the posterior density is Beta(𝛼 = 𝑘 + 1, 𝛽 =
𝑛 − 𝑘 + 1), and the MAP estimate 𝜃෠ = 𝑘/𝑛.
◼ By using the formula for the Beta density,
𝑚
𝑘 + 1 𝑘 + 2 ⋯ (𝑘 + 𝑚)
𝐄 Θ |𝑋 = 𝑘 = ,
(𝑛 + 2)(𝑛 + 3) ⋯ (𝑛 + 𝑚 + 1)
𝑘+1
෠
and for 𝑚 = 1 the LMS estimate 𝜃 = 𝐄 Θ|𝑋 = 𝑘 = .
𝑛+2
62
Example: Bias of a Coin (2)
◼ Given 𝑋 = 𝑘, the conditional MSE for any estimate 𝜃෠ is
2
෠ ෠
𝐄 Θ − 𝜃 |𝑋 = 𝑘 = 𝐄[Θ2 |𝑋 = 𝑘] − 2𝜃𝐄[Θ|𝑋 = 𝑘] + 𝜃෠ 2
(𝑘 + 1)(𝑘 + 2) 𝑘+1
= − 2𝜃෠ + 𝜃෠ 2 .
(𝑛 + 2)(𝑛 + 3) 𝑛+2
▪ For MAP: 𝜃෠ = 𝑘/𝑛, the conditional MSE is
2
2 (𝑘 + 1)(𝑘 + 2) 𝑘𝑘+1 𝑘
෠
𝐄 Θ − 𝜃 |𝑋 = 𝑘 = −2 + .
(𝑛 + 2)(𝑛 + 3) 𝑛𝑛+2 𝑛
𝑘+1
▪ ෠
For LMS: 𝜃 = 𝐄 Θ|𝑋 = 𝑘 = , the conditional MSE is
𝑛+2
2
𝐄 Θ − 𝜃෠ |𝑋 = 𝑘 = 𝐄[Θ2 𝑋 = 𝑘 − (𝐄 Θ 𝑋 = 𝑘 )2
2
(𝑘 + 1)(𝑘 + 2) 𝑘+1
= − .
(𝑛 + 2)(𝑛 + 3) 𝑛+2
63
Example: Bias of a Coin (3)
◼ LMS estimate has uniformly smaller conditional MSE.

65
Properties of Estimation Error (2)
◼ ෡ = 𝐄 Θ|𝑋 , Θ
Again, Θ ෩=Θ
෡ − Θ.

◼ ෡Θ
𝐄Θ ෩ =𝐄 𝐄Θ
෡ Θ|𝑋
෩ // total expectation rule
෡ Θ|𝑋
= 𝐄 Θ𝐄 ෩ ෡ is completely determined by 𝑋
// Θ
෩
= 0 // 𝐄 Θ|𝑋 = 0 from the preceding result
◼ ෡ Θ
Cov Θ, ෩ =𝐄Θ ෡Θ෩ −𝐄 Θ ෡𝐄Θ ෩ = 0 − 0 = 0. // Θ
෩ is
෡
uncorrelated with the estimator Θ
◼ Therefore, by considering the variance of both sides in
Θ=Θ ෡−Θ ෩ , we have
𝐕 Θ =𝐕 Θ ෡ +𝐕 Θ ෩

66
Properties of Estimation Error (3)
◼ The observation 𝑋 is uninformative if the MSE 𝐄 Θ ෩2 = 𝐕 Θ෩ is
the same as 𝐕 Θ . When is this the case?
◼ Using 𝐕 Θ = 𝐕 Θ ෡ +𝐕 Θ ෩ , we see that 𝑋 is uninformative if
and only if 𝐕 Θ ෡ = 0. 𝐄Θ෩ =0
◼ The variance of a rv is zero when the rv is a constant, equal

to its mean, so Θ෡=𝐄 Θ ෡ .

◼ 𝑋 is uninformative if and only if the estimator Θ෡ = 𝐄 Θ|𝑋 is
equal to 𝐄 Θ ෡ , which is also equal to 𝐄[Θ].
◼ From Θ ෡ ≡ 𝐄 Θ|𝑋 , take expectation to both sides 𝐄 Θ෡ =
𝐄 𝐄 Θ|𝑋 = 𝐄[Θ].
◼ Therefore, if Θ෡ ≡ 𝐄 Θ|𝑋 , 𝐄 Θ෡ =𝐄Θ =Θ ෡.
◼ If Θ and 𝑋 are independent, we have 𝐄 Θ|𝑋 = 𝐄 Θ , and 𝑋 is
indeed called uninformative.

67
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation

68
Linear LMS Estimator (1)
◼ LMS estimator is sometimes hard to compute, and we
need alternatives.
◼ We derive an estimator by minimizing the MSE within
a restricted class of estimators: those that are linear
functions of the observations.
◼ This estimator may result in higher MSE.
◼ But it has a significant computational advantage.
◼ It requires simple calculations, involving only

means, variances, and covariances of the

parameter Θ and observations 𝑋.

69
Linear LMS Estimator (2)
◼ ෡ of a random variable Θ, based on
A linear estimator Θ
observations 𝑋1 , … , 𝑋𝑛 , has the form
෡ = 𝑎1 𝑋1 + ⋯ + 𝑎𝑛 𝑋𝑛 + 𝑏
Θ
◼ Given a specified choice of the scalars 𝑎1 , … , 𝑎𝑛 , 𝑏, the
corresponding MSE is
𝐄 Θ − 𝑎1 𝑋1 − ⋯ − 𝑎𝑛 𝑋𝑛 − 𝑏 2
◼ The linear LMS estimator chooses 𝑎1 , … , 𝑎𝑛 , 𝑏 to
minimize the above expression.

70
Linear LMS Estimator (3)
◼ We first develop the solution for the case where 𝑛 = 1,
and then generalize.
◼ ෡ = 𝑎𝑋 + 𝑏 and the MSE is 𝐄[(Θ −
The estimator is Θ
𝑎𝑋 − 𝑏)2 ].
◼ We are interested in finding 𝑎 and 𝑏 that minimize this
MSE.

71
Linear LMS Estimator (4)
◼ If 𝑎 is chosen, then it’s easy to find the optimal 𝑏:
◼ Choose a constant 𝑏 to estimate the random variable
Θ − 𝑎𝑋.
◼ By the discussion in previous section, the best choice
is 𝑏 = 𝐄 Θ − 𝑎𝑋 = 𝐄 Θ − 𝑎𝐄 𝑋 .
◼ With this choice of 𝑏, it remains to minimize the MSE
with respect to 𝑎, i.e., to minimize MSE = 𝐄[(Θ − 𝑎𝑋 −
𝐄 Θ + 𝑎𝐄 𝑋 )2 ], which has the form of 𝐄[(Θ − 𝑎𝑋 −
𝐄 Θ − 𝑎𝑋 )2 ] = 𝐕 Θ − 𝑎𝑋
◼ MSE = 𝐕 Θ − 𝑎𝑋
= 𝐕 Θ + 𝑎2 𝐕 𝑋 + 2 cov Θ, −𝑎𝑋
= 𝐕 Θ + 𝑎2 𝐕 𝑋 − 2𝑎 cov Θ, 𝑋

72
Linear LMS Estimator (5)
◼ We set its derivative with respect to 𝑎 to zero and solve
for 𝑎. This yields
cov Θ, 𝑋 𝜎Θ
𝑎= =𝜌
𝐕 𝑋 𝜎𝑋
◼ 𝜎Θ and 𝜎𝑋 : standard deviation of Θ and 𝑋, respectively.

cov Θ,𝑋
◼ 𝜌= : correlation coefficient of Θ and 𝑋.
𝜎Θ 𝜎𝑋
◼ With this choice of 𝑎, the linear LMS estimator is
෡ = 𝑎𝑋 + 𝑏 = 𝑎𝑋 + 𝐄 Θ − 𝑎𝐄 𝑋
Θ ෩ ]=0,
For LMS, 𝐄[Θ
=𝐄 Θ +𝑎 𝑋−𝐄 𝑋 so MSE is given
cov Θ,𝑋 ෩2 = 𝐕 Θ
by 𝐄 Θ ෩
=𝐄Θ + 𝑋−𝐄 𝑋
𝐕 𝑋
◼ The resulting MSE is The maximal MSE
𝐕 Θ෡ − Θ = 1 − 𝜌2 𝐕 Θ is 𝐕 Θ since 𝜌 ≤
1.
73
Example: Romeo/Juliet Meeting (1)
◼ Juliet is late by an amount 𝑋 uniformly distributed over
0, Θ , and Θ is a random variable with a uniform prior
𝑓Θ 𝜃 over the interval 0,1 .
◼ Let us derive the linear LMS estimator of Θ based on 𝑋.
◼ By the law of total expectation,
𝐄Θ 1
𝐄 𝑋 = 𝐄 𝐄 𝑋|Θ = 𝐄 Θ/2 = =
2 4
◼ By the law of total variance,
𝐕 𝑋 = 𝐄 𝐕 𝑋|Θ + 𝐕 𝐄 𝑋|Θ
Θ2 Θ
=𝐄 +𝐕
12 2
1 1 2 1 1−0 2 7
= න 𝜃 𝑑𝜃 + =
12 0 4 12 144
74
Example: Romeo/Juliet Meeting (2)
◼ Now we compute cov Θ, 𝑋 :
𝐄 Θ𝑋 = 𝐄 𝐄 Θ𝑋|Θ = 𝐄 Θ𝐄 𝑋|Θ = 𝐄 Θ2 /2 = 1/6
1 1 1 1
cov Θ, 𝑋 = 𝐄 Θ𝑋 − 𝐄 Θ 𝐄 𝑋 = − ⋅ =
6 2 4 24
◼ The linear LMS estimator is
cov Θ, 𝑋 6 2
෡
Θ=𝐄Θ + 𝑋−𝐄 𝑋 = 𝑋+
𝐕 𝑋 7 7
◼ The conditional MSE is obtained using
2 1 − 𝑥2 2 1−𝑥
෠
𝐄 Θ − 𝜃 |𝑋 = 𝑥 = −𝜃 ෠ + 𝜃෠ 2
2 log 𝑥 log 𝑥
6 2
෠
with 𝜃 = 𝑥 + .
7 7

75
Example: Bias of a Coin (1)
◼ The probability of heads is modeled as Θ, and the prior is
uniform over the interval 0,1 ,
1 1 2 1
𝐄Θ = ,𝐕 Θ = ,𝐄 Θ = .
2 12 3
◼ The coin is tossed 𝑛 times and the number of heads is
distributed by 𝑋~Bin(𝑛, Θ).
◼ By the law of total expectation,
𝑛
𝐄 𝑋 = 𝐄 𝐄 𝑋|Θ = 𝐄 𝑛Θ = .
2
◼ By the law of total variance,
𝐕 𝑋 = 𝐄 𝐕 𝑋|Θ + 𝐕 𝐄 𝑋|Θ
= 𝐄 𝑛Θ(1 − Θ) + 𝐕 𝑛Θ
𝑛 𝑛 𝑛2 𝑛 𝑛 + 2
= − + = .
2 3 12 12
76
Example: Bias of a Coin (2)
◼ Now we compute cov Θ, 𝑋 :
𝑛
𝐄 Θ𝑋 = 𝐄 𝐄 Θ𝑋|Θ = 𝐄 Θ𝐄 𝑋|Θ = 𝐄 𝑛Θ2 = ,
3
𝑛 1 𝑛 𝑛
cov Θ, 𝑋 = 𝐄 Θ𝑋 − 𝐄 Θ 𝐄 𝑋 = − ∙ = .
3 2 2 12
◼ The linear LMS estimator is
cov Θ, 𝑋 𝑋 1
෡
Θ=𝐄Θ + 𝑋−𝐄 𝑋 = + .
𝐕 𝑋 𝑛+2 𝑛+2
▪ This agrees with the LMS estimator derived in the
previous Ex 8.13 in Section 8.3.

77
Homework #10
Textbook “Introduction to Probability”, 2nd Edition, D. Bertsekas and J. Tsitsiklis
Chapter8 p.445-p.455, Problems 1, 2, 3, 5, 7, 10, 11
Due date: 아주BB 과제출제 확인

Homework #11
Textbook “Introduction to Probability”, 2nd Edition, D. Bertsekas and J. Tsitsiklis
Chapter8 p.445-p.455, Problems 14, 15, 16, 17, 18, 19, 24
Due date: 아주BB 과제출제 확인

Chapitre 1 Statistique - Bayesienne
No ratings yet
Chapitre 1 Statistique - Bayesienne
47 pages
Blaine Ciment 114q
100% (1)
Blaine Ciment 114q
38 pages
Studio 5 Questions
No ratings yet
Studio 5 Questions
8 pages
IT590 Bayesian Theory Lecture 2
No ratings yet
IT590 Bayesian Theory Lecture 2
5 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
79 pages
1 Inference
No ratings yet
1 Inference
9 pages
Bayesian Statistics
No ratings yet
Bayesian Statistics
20 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
20 Bayesian2
No ratings yet
20 Bayesian2
50 pages
20-Bayesian 310456690
No ratings yet
20-Bayesian 310456690
34 pages
Dimitri P. Bertsekas, John N. Tsitsiklis - Introduction To Probability, 2nd Edition - Athena Scientific (2008) - 6
No ratings yet
Dimitri P. Bertsekas, John N. Tsitsiklis - Introduction To Probability, 2nd Edition - Athena Scientific (2008) - 6
49 pages
25 Intro To Bayesian Inference
No ratings yet
25 Intro To Bayesian Inference
31 pages
PHD Thesis Essays On Regime Switching Models With Endogenous Feedback - Indiana Uni 2019
No ratings yet
PHD Thesis Essays On Regime Switching Models With Endogenous Feedback - Indiana Uni 2019
157 pages
FSMLecture 4
No ratings yet
FSMLecture 4
49 pages
Tulsiani Viewpoints and Keypoints 2015 CVPR Paper
No ratings yet
Tulsiani Viewpoints and Keypoints 2015 CVPR Paper
10 pages
Bayesian Inference
No ratings yet
Bayesian Inference
18 pages
Lectures 5
No ratings yet
Lectures 5
31 pages
The - Impact - of - Railway - Networks - On - Residential - Land Value
No ratings yet
The - Impact - of - Railway - Networks - On - Residential - Land Value
9 pages
Single Parameter Models
No ratings yet
Single Parameter Models
37 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
24 Intro To Bayesian Inference
No ratings yet
24 Intro To Bayesian Inference
33 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Mstat Note14 Bayesian Inference FSP
No ratings yet
Mstat Note14 Bayesian Inference FSP
30 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Bayesian Data Analysis
No ratings yet
Bayesian Data Analysis
36 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
Baysian Inferences
No ratings yet
Baysian Inferences
20 pages
Chap 25
No ratings yet
Chap 25
85 pages
Statistics Chapter 4
No ratings yet
Statistics Chapter 4
28 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
2017 13 Report
No ratings yet
2017 13 Report
36 pages
1-MS2 (Intro Bayes)
No ratings yet
1-MS2 (Intro Bayes)
38 pages
4.2 Generative
No ratings yet
4.2 Generative
21 pages
Chapter 2 Optimization and Solving Nonlinear Equations
No ratings yet
Chapter 2 Optimization and Solving Nonlinear Equations
22 pages
Making Models With Bayes
No ratings yet
Making Models With Bayes
51 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Bayesian Statistics 01
100% (1)
Bayesian Statistics 01
22 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Bayesian-Statistics Final 20140416 3
No ratings yet
Bayesian-Statistics Final 20140416 3
38 pages
Longitudinal Trajectories of Sustained Attention Development
No ratings yet
Longitudinal Trajectories of Sustained Attention Development
14 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
Presentation Generalized Linear Model Theory
No ratings yet
Presentation Generalized Linear Model Theory
77 pages
Bayesian Inference
No ratings yet
Bayesian Inference
22 pages
Chapter 1 B
No ratings yet
Chapter 1 B
35 pages
Intro to Variational Autoencoders
No ratings yet
Intro to Variational Autoencoders
89 pages
CH 5
No ratings yet
CH 5
45 pages
Block 4 ST3189
No ratings yet
Block 4 ST3189
25 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
m246 1 Chapter06
No ratings yet
m246 1 Chapter06
41 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Bayesian Statistics
No ratings yet
Bayesian Statistics
76 pages
Box & Cox 1964
No ratings yet
Box & Cox 1964
33 pages
Tobit & Heckman Models Cheat Sheet
No ratings yet
Tobit & Heckman Models Cheat Sheet
3 pages
Supplemental 1
No ratings yet
Supplemental 1
54 pages
The Use of The Lognormal Distribution in Analyzing Incomes: Jakub Nedvěd
No ratings yet
The Use of The Lognormal Distribution in Analyzing Incomes: Jakub Nedvěd
13 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
Bayesian Analysis - Explanation
No ratings yet
Bayesian Analysis - Explanation
20 pages
Maximum Likelihood Estimation: PRE 905: Multivariate Analysis Lecture 5: February 25, 2014
No ratings yet
Maximum Likelihood Estimation: PRE 905: Multivariate Analysis Lecture 5: February 25, 2014
49 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Watkins NIST CSF Excel User Guide v6.0
No ratings yet
Watkins NIST CSF Excel User Guide v6.0
15 pages
Manual Stata 13
100% (1)
Manual Stata 13
371 pages
Bayesian Statistics: MA501, Statistics For Insurance
No ratings yet
Bayesian Statistics: MA501, Statistics For Insurance
28 pages
Bayesian Analysis
No ratings yet
Bayesian Analysis
33 pages
CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning
No ratings yet
CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning
16 pages
Biostatistics Regression Course
No ratings yet
Biostatistics Regression Course
4 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
SEMIdetailed Lesson Plan CO2
No ratings yet
SEMIdetailed Lesson Plan CO2
7 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
A Course On Small Area Estimation and Mixed Models Methods Theory and Applications in R 1st Edition Domingo Morales María Dolores Esteban Agustín Pérez Tomáš Hobza Download
No ratings yet
A Course On Small Area Estimation and Mixed Models Methods Theory and Applications in R 1st Edition Domingo Morales María Dolores Esteban Agustín Pérez Tomáš Hobza Download
167 pages
03 Bay Est He or em
No ratings yet
03 Bay Est He or em
13 pages
Bayesian Inference: The Basics
No ratings yet
Bayesian Inference: The Basics
37 pages
10 1 1 53
No ratings yet
10 1 1 53
84 pages
Bayesian Regression Analysis Guide
No ratings yet
Bayesian Regression Analysis Guide
53 pages
Model Misspecification: Gerda Claeskens
No ratings yet
Model Misspecification: Gerda Claeskens
21 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
An Overview of Bayesian Econometrics
No ratings yet
An Overview of Bayesian Econometrics
30 pages
Bayesian Statistics Essentials
No ratings yet
Bayesian Statistics Essentials
180 pages
Introduction To Discrete Bayesian Methods: Petri Nokelainen
No ratings yet
Introduction To Discrete Bayesian Methods: Petri Nokelainen
146 pages
Introduction To Discrete Bayesian Methods: Petri Nokelainen
No ratings yet
Introduction To Discrete Bayesian Methods: Petri Nokelainen
146 pages
Baysian-Slides 16 Bayes Intro
No ratings yet
Baysian-Slides 16 Bayes Intro
49 pages
Chapter 36 Large Sample Estimation and Hypothesis Testing
No ratings yet
Chapter 36 Large Sample Estimation and Hypothesis Testing
135 pages
A Beginner's Notes On Bayesian Econometrics (Art)
No ratings yet
A Beginner's Notes On Bayesian Econometrics (Art)
21 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages

확통1 LectureNote09 on Bayesian Statistical Inference

Uploaded by

확통1 LectureNote09 on Bayesian Statistical Inference

Uploaded by

Bayesian Statistical Inference

Statistical Inference (1)

◼ Two main approaches

◼ Classical statistical inference

◼ Main categories of inference problems

◼ least mean squares estimation (LMSE)

◼ maximum likelihood (ML)

◼ likelihood ratio tests (LRT)

◼ Together with a fully specified model that obeys these

even if sometimes hard to find

◼ Right method is given imposing additional factors and

constraints (performance, experience, common

◼ Difference: What’s the nature of the unknown variables

◼ Such a model can then be used to make predictions

◼ Use the available data to decide which of the two is

◼ Both Bayesian and classical approaches are possible.

◼ We aim to extract information about Θ, based on

whether 𝑋 is discrete or continuous.

◼ Even though looking

◼ Conditional density (likelihood) of the observation:

◼ We suppose the 𝑋𝑖 are represented by the sum of two

◼ For any value of 𝜃, 𝐄 𝑊𝑖 = 𝐄 𝑊𝑖 Θ = 𝜃 = 0,

var 𝑊𝑖 = var 𝑊𝑖 Θ = 𝜃 = 𝜎𝑖2 , giving 𝑊𝑖 ~𝑁 0, 𝜎𝑖2 .

◼ 𝑓Θ|𝑋 𝜃|𝑥 if Θ is continuous

◼ This allows an important computational shortcut: the

selects a value of 𝜃෠ denoted by 𝑔(𝑥).

෡ depends on the random variable 𝑋, thus Θ

(MSE) over all estimators.

late by a small amount on the first date 𝑥 ≈ 0 , the

◼ The 𝑖th hypothesis is the event 𝐻𝑖 ≝ Θ = 𝜃𝑖 .

and select 𝐻2 = {Θ = 2}, otherwise.

to its mean, so Θ෡=𝐄 Θ ෡ .

means, variances, and covariances of the

You might also like