확통1 LectureNote09 on Bayesian Statistical Inference
확통1 LectureNote09 on Bayesian Statistical Inference
◼ hypothesis testing
◼ significance testing
2
Statistical Inference (2)
◼ Most important methodologies of inference
◼ maximum a posteriori (MAP) probability rule
◼ regression
3
Probability versus Statistics
◼ Probability Theory
◼ Self-contained mathematical tool based on axioms
axioms
◼ For a problem, this gives a unique correct answer,
different answers
◼ No principal best method can be selected
◼ Classical/Frequentist
5
Bayesian
◼ When trying to infer the nature of an unknown model, it
views the model as chosen randomly from a given model.
◼ In the Bayesian world, we regard a parameter Θ that
characterizes the model as random.
◼ Once the distribution of Θ is given, only a single model is
assumed, moving Bayesian back to the probability theory.
◼ First, postulate a prior distribution 𝑝Θ 𝜃 .
◼ Next, given observed data 𝑥, obtain a conditional
distribution (or likelihood) 𝑝𝑋|Θ (𝑥|𝜃).
◼ Finally, one can use Bayes' rule to derive a posterior
distribution 𝑝Θ|𝑋 𝜃|𝑥 .
◼ This captures all information that 𝑥 can provide about 𝜃.
6
Classical/Frequentist
◼ View the unknown quantity 𝜃 as an unknown constant.
◼ Strives to develop an estimate of 𝜃.
◼ For each possible value of 𝜃, various competitive
models are postulated.
◼ So, we are dealing with multiple candidate
probabilistic models, and a best model can be
selected.
7
Model versus Variable Inference
◼ Model inference: the object of study is a real
phenomenon or process for which we wish to
construct or validate a probability model based on
available data
◼ e.g., do planets follow elliptical trajectories?
readings?
8
Statistical Inference Problems (1)
◼ Estimation: Given a model is fully specified, we wish
to estimate an unknown, possibly multidimensional,
parameter 𝜃.
◼ This parameter can be viewed as either a random
variable (Bayesian approach) or as an unknown
constant (classical approach).
◼ Objective is to estimate 𝜃 close to the true value.
9
Statistical Inference Problems (2)
◼ Binary hypothesis testing:
◼ Start with two hypotheses 𝐻0 and 𝐻1
true
◼ 𝑚-ary hypothesis testing:
◼ There is a finite number 𝑚 of competing
hypotheses, 𝐻0 , 𝐻1 , … , 𝐻𝑚−1
◼ Evaluation: typically, by each decision error
probability
10
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation
11
Bayesian Inference (1)
◼ In Bayesian inference, the unknown quantity of
interest is modeled as a random variable or as a finite
random vector.
◼ We usually denote it by Θ.
observation vector.
12
Bayesian Inference (2)
◼ We assume that we know the joint distribution of Θ
and 𝑋.
◼ Equivalently, we assume that we know
◼ A prior distribution 𝑝Θ or 𝑓Θ , depending on whether
Θ is discrete or continuous.
◼ A conditional distribution 𝑝𝑋|Θ or 𝑓𝑋|Θ , depending on
13
Bayesian Inference (3)
◼ After a specific value 𝑥 of 𝑋 has been given, a
complete answer to the Bayesian inference problem is
provided by the posterior distribution 𝑝Θ|𝑋 or 𝑓Θ|𝑋 .
◼ It encapsulates everything to know about Θ, given
the available information.
14
Summary of Bayesian Inference
1. We start with a prior distribution 𝑝Θ or 𝑓Θ for the
unknown random variable Θ.
2. We have a model 𝑝𝑋|Θ or 𝑓𝑋|Θ of the observation
vector 𝑋.
3. After observing the value 𝑥 of 𝑋, we form the
posterior distribution of Θ, using the appropriate
version of Bayes' rule.
15
Bayes’ Rule: Summary
◼ Depending on
discrete/continuous
Θ and 𝑋, there are
four specifications of
Bayes' rule.
16
Example: Romeo/Juliet Meeting (1)
◼ Romeo and Juliet meeting: Juliet will be late on any
date by a random amount 𝑋, uniformly distributed over
the interval 0, 𝜃 .
◼ The maximum waiting time 𝜃 is unknown and is
modeled as the value of a random variable Θ which is
uniformly distributed in 0,1 .
◼ Assume that Juliet was late by an amount 𝑥 on their
first date.
◼ Question: How should Romeo use this information to
update the distribution of 𝜃?
17
Example: Romeo/Juliet Meeting (2)
1, if 0 ≤ 𝜃 ≤ 1,
◼ Prior density: 𝑓Θ 𝜃 = ቊ
0, otherwise.
18
Example: Romeo/Juliet Meeting (3)
◼ 𝑓Θ 𝜃 = 1 if 0 ≤ 𝜃 ≤ 1
◼ 𝑓𝑋|Θ 𝑥|𝜃 = 1/𝜃 if 0 ≤ 𝑥 ≤ 𝜃
◼ Use Bayes' rule: the posterior density is
𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃
𝑓Θ|𝑋 𝜃|𝑥 = 1
𝑓 𝑥Θ 𝜃 ′ 𝑓𝑋|Θ 𝑥|𝜃 ′ 𝑑𝜃 ′
1/𝜃 1
= = , if 𝑥 ≤ 𝜃 ≤ 1
1 1 ′ 𝜃 ⋅ log 𝑥
𝜃 𝑥′ 𝑑𝜃
and
𝑓Θ|𝑋 𝜃|𝑥 = 0, otherwise.
19
Example: Romeo/Juliet Meeting (4)
◼ Consider now a variation on the first 𝑛 independent
dates. Let 𝑋 = 𝑋1 , … , 𝑋𝑛 and 𝑥 = 𝑥1 , … , 𝑥𝑛 . Then,
similarly to 𝑛 = 1,
𝑓𝑋|Θ 𝑥|𝜃 = 1/𝜃 𝑛 if 𝑥(𝑛) ≤ 𝜃 ≤ 1,
with 𝑥(𝑛) = max{𝑥1 , … , 𝑥𝑛 }.
◼ Then, the posterior density is
1/𝜃 𝑛
𝑓Θ|𝑋 𝜃|𝑥 = 1 , if 𝑥(𝑛) ≤ 𝜃 ≤ 1
𝑥1/(𝜃 ′ )𝑛 𝑑𝜃′
(𝑛)
and
𝑓Θ|𝑋 𝜃|𝑥 = 0, otherwise.
20
Example: Inference of Common
Mean of Gaussian (1)
◼ We observe a collection 𝑋 = 𝑋1 , … , 𝑋𝑛 of rvs, with an
unknown common mean, whose value we wish to infer.
Given the value of the common mean, we assume that
the 𝑋𝑖 are normal and independent each other, with
known variances 𝜎12 , … , 𝜎𝑛2 , that is 𝑋𝑖 ~𝑁(Θ, 𝜎𝑖2 )
◼ We model the common mean as Θ, with a given normal
prior, that is Θ~𝑁 𝑥0 , 𝜎02 Random error term
𝑥𝑖 −𝜃 2
◼ Likelihood: 𝑓𝑋|Θ 𝑥|𝜃 = 𝑐2 ς𝑛𝑖=1 exp −
2𝜎𝑖2
◼ 𝑐1 and 𝑐2 are constants.
◼ By Bayes’ rule, the posterior density is
𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃
𝑓Θ|𝑋 𝜃|𝑥 =
𝑓 Θ 𝜃 ′ 𝑓𝑋|Θ 𝑥|𝜃 ′ 𝑑𝜃 ′
◼ Note: Usually, calculating the normalizing constant in
the denominator (called evidence) is challenging.
22
Example: Inference of Common
Mean of Gaussian (3)
◼ The numerator is of the form
𝑛
2
𝑥𝑖 − 𝜃
𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃 = 𝑐1 𝑐2 exp − .
2𝜎𝑖2
𝑖=0
◼ The exponent is a quadratic form in 𝜃, thus for some
constant 𝑑, can be written as
𝜃−𝑚 2
𝑑 ⋅ exp − ,
2𝑣
where
𝑛 𝑛
1 1 𝑚 𝑥𝑖
= 2, = 2.
𝑣 𝜎𝑖 𝑣 𝜎𝑖
𝑖=0 𝑖=0
23
Example: Inference of Common
Mean of Gaussian (4)
◼ The constant 𝑑 depends only on 𝑥, but not on 𝜃.The
denominator doesn’t depend on 𝜃, either. Thus
𝜃−𝑚 2
𝑓Θ|𝑋 𝜃|𝑥 ∝ exp −
2𝑣
◼ So, we conclude that the posterior density 𝑓Θ|𝑋 𝜃|𝑥 is
normal with mean 𝑚 and variance 𝑣.
◼ Recall prior: Θ ∼ 𝑁 𝑥0 , 𝜎02 .
◼ A remarkable property of the normal family: If the
posterior is in the same family as the prior, the prior
and posterior are then called conjugate distributions,
and the prior is called a conjugate prior for the
likelihood.
24
Example: Inference of Common
Mean of Gaussian (5)
◼ This property opens up the possibility of efficient
recursive inference.
◼ Suppose that after 𝑋1 , … , 𝑋𝑛 are observed, an
additional observation 𝑋𝑛+1 is obtained.
◼ Instead of solving the inference problem from scratch,
we can view 𝑓Θ|𝑋1,…,𝑋𝑛 as our prior, and use the new
observation to obtain the new posterior 𝑓Θ|𝑋1 ,…,𝑋𝑛 ,𝑋𝑛+1 .
◼ Thus, the new posterior is normal distribution with
mean 𝑚′ and variance 𝑣′, where
𝑚′ 𝑚 𝑥𝑛+1
= + 2 ,
𝑣′ 𝑣 𝜎𝑛+1
1 1 1
= + 2 .
𝑣′ 𝑣 𝜎𝑛+1
25
Example: Beta Priors on the Bias of
a Coin
◼ Beside the normal family, another prominent example
involves the Beta family derived from Bernoulli trials.
◼ We wish to estimate the probability of heads, Θ, of a
biased coin, and suppose Θ has beta prior 𝑓Θ 𝜃 =
Beta(𝛼, 𝛽).
◼ The coin is tossed 𝑛 times and the number of heads is
denoted by 𝑋~Bin 𝑛, 𝜃 .
◼ The posterior of Θ is, for 0 ≤ 𝜃 ≤ 1
𝑓Θ|𝑋 𝜃|𝑘 = 𝑐𝑓Θ 𝜃 𝑝𝑋|Θ 𝑘|𝜃 = 𝑑𝑓Θ 𝜃 𝜃 𝑘 (1 − 𝜃)𝑛−𝑘
𝑑
= 𝜃 𝑘+𝛼−1 (1 − 𝜃)𝑛−𝑘+𝛽−1
𝐵(𝛼, 𝛽)
is also a beta density with 𝛼 ′ = 𝑘 + 𝛼, 𝛽 ′ = 𝑛 − 𝑘 + 𝛽.
26
Multiparameter Problems
◼ The case of multiple unknown parameters is entirely
similar. The principles for calculating the posterior density
are essentially the same, regardless of whether Θ
consists of one or multiple components.
◼ However, while the posterior density can be obtained in
principle using Bayes’ rule, a closed form solution should
not be expected in general.
◼ If Θ is high-dimensional, computing the denominator of
Bayes’ formula in terms of numerical integration becomes
formidable.
◼ We resort to sophisticated numerical approximation
methods, based on random sampling, such as Monte
Carlo integration, Gibbs sampling, or Markov chain
Monte Carlo (MCMC). 27
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation
28
MAP (1)
◼ Given the value 𝑥 of the observation, we select a
that maximizes the posterior
value of 𝜃, denoted 𝜃,
distribution
◼ 𝑝Θ|𝑋 𝜃|𝑥 if Θ is discrete
29
MAP (2)
◼ This is called the Maximum a Posteriori probability
(MAP) rule.
30
MAP (3)
◼ When Θ is discrete, the MAP rule has an important
optimality property.
◼ Since it chooses 𝜃 to be the most likely value of Θ, it
maximizes the probability of correct decision (selecting a
correct value of Θ) for any given value 𝑥, when the
decision rule is the MAP, i.e., for all 𝑔 𝑥
𝐏{𝜃 = 𝑔 𝑥 |𝑋 = 𝑥} ≤ 𝐏{𝜃 = 𝑔𝑀𝐴𝑃 𝑥 |𝑋 = 𝑥}.
◼ This implies that it also maximizes the overall (averaged
over all possible values 𝑋) probability of correct decision,
that is, the following holds for all decision rules 𝑔(𝑥),
𝐏Θ =𝑔 𝑋 𝑋 ≤𝐏 Θ = 𝑔𝑀𝐴𝑃 𝑋 𝑋 .
Taking expectation wrt 𝑋 and using the total probability law
𝐏Θ =𝑔 𝑋 ≤𝐏 Θ = 𝑔𝑀𝐴𝑃 𝑋 .
31
Computational Shortcut
𝑝Θ 𝜃 𝑝𝑋|Θ 𝑥|𝜃
◼ Recall posterior: 𝑝Θ|𝑋 𝜃|𝑥 = σ ′ 𝑝Θ 𝜃′ 𝑝𝑋|Θ 𝑥|𝜃′
𝜃
32
Example: MAP for the Inference of
Common Mean of Gaussian
◼ 𝑋1 , … , 𝑋𝑛 are independent normal rv with an unknown
common mean Θ ∼ 𝑁 𝑥0 , 𝜎02 , and known variances
𝜎12 , … , 𝜎𝑛2 .
𝜃−𝑚 2
◼ Posterior: 𝑓Θ|𝑋 𝜃|𝑥 ∝ exp − with
2𝑣
𝑛 𝑛
1 1 𝑚 𝑥𝑖
= 2, = 2.
𝑣 𝜎𝑖 𝑣 𝜎𝑖
𝑖=0 𝑖=0
◼ The MAP estimate: 𝜃 = 𝑚,
◼ because the normal PDF is maximized at its mean
33
Example: MAP for Spam Filtering
◼ Let {𝑤1 , . . . , 𝑤𝑛 } be a collection of words whose
appearance suggests a spam message. Θ takes values
1 and 2, corresponding to spam or legitimate
messages, with given probabilities 𝑝Θ (1) and 𝑝Θ (2),
and 𝑋𝑖 is the Bernoulli rv that models the appearance
of 𝑤𝑖 (𝑋𝑖 = 1 if 𝑤𝑖 appears and 𝑋𝑖 = 0, otherwise)
◼ Posterior 𝐏 Θ = 𝑚|𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛
𝑝Θ (𝑚) ς𝑛𝑖=1 𝑝𝑋𝑖 |Θ (𝑥𝑖 |𝑚)
= 2 𝑛 , 𝑚 = 1,2.
σ𝑗=1 𝑝Θ (𝑗) ς𝑖=1 𝑝𝑋𝑖 |Θ (𝑥𝑖 |𝑗)
◼ The MAP rule decides the message is spam if
𝐏 Θ = 1 𝑋1 = 𝑥1 , … > 𝐏 Θ = 2 𝑋1 = 𝑥1 , … , or
𝑛 𝑛
𝑝Θ 1 ෑ 𝑝𝑋𝑖 |Θ 𝑥𝑖 1 > 𝑝Θ (2) ෑ 𝑝𝑋𝑖 |Θ (𝑥𝑖 |2)
𝑖=1 𝑖=1
34
Point Estimation
◼ Point estimate: a single numerical value that represents
our best guess of the parameter Θ
◼ Estimate: the deterministic numerical value 𝜃 that we
choose on observation 𝑥.
◼ 𝜃 is to be determined by applying some function 𝑔 to
the observation 𝑥, resulting in 𝜃 = 𝑔 𝑥 .
◼ 𝑔 is called the decision rule, which upon observing 𝑥,
35
Two Popular Estimators
◼ We can use different decision rules 𝑔 to form different
estimators, and some will be better than others.
◼ Two popular estimates:
= argmax 𝑝Θ|𝑋 𝜃|𝑥
◼ MAP estimate : 𝜃
𝜃
◼ Conditional Expectation estimate : 𝜃 = 𝐄 Θ|𝑋 = 𝑥 .
◼ Conditional expectation estimator is also called the
least mean squares (LMS) estimator.
◼ It minimizes the mean squared estimation error
36
Example: Romeo/Juliet Meeting (1)
◼ Juliet is late on the first date by a random amount 𝑋.
◼ The distribution of 𝑋 is uniform over 0, Θ .
◼ Θ is an unknown random variable with a uniform prior
𝑓Θ over the interval 0,1 .
1
◼ Recall: 𝑓Θ|𝑋 𝜃|𝑥 = , if 𝑥 ≤ 𝜃 ≤ 1
𝜃⋅ log 𝑥
◼ MAP: 𝜃 = 𝑥,
◼ because 𝑓Θ|𝑋 𝜃|𝑥 is decreasing in 𝜃 over the range
𝑥, 1 .
◼ Note that MAP is an optimistic estimate. If Juliet is
38
Example: Bias of a Coin
◼ We wish to estimate the probability of heads, Θ, of a
biased coin, and suppose Θ has uniform prior, Θ~𝑈[0,1].
◼ We want to derive the MAP and conditional expectation
estimators of Θ.
◼ When 𝑋 = 𝑘, the posterior of Θ is, for 0 ≤ 𝜃 ≤ 1
1
𝑓Θ|𝑋 𝜃|𝑘 = 𝜃 𝑘 (1 − 𝜃)𝑛−𝑘
𝐵(𝑘 + 1, 𝑛 − 𝑘 + 1)
𝑘
▪ MAP estimate: Maximum of the posterior is at 𝜃 = .
𝑛
▪ Conditional expectation estimate: 𝐄[Θ|𝑋 = 𝑘] is obtained
from the first moment of Beta(α = 𝑘 + 1, 𝛽 = 𝑛 − 𝑘 + 1),
𝐵 𝛼 + 1, 𝛽 𝑘+1
𝐄Θ𝑋=𝑘 = = .
𝐵 𝛼, 𝛽 𝑛+2
▪ For large 𝑛, the two nearly coincide. 39
Hypothesis Testing (1)
◼ Θ takes one of 𝑚 values, 𝜃1 , … , 𝜃𝑚 .
◼ 𝑚 is usually a small integer; often 𝑚 = 2.
40
Hypothesis Testing (2)
◼ MAP rule: select the hypothesis 𝐻𝑖 : Θ = 𝜃𝑖 with the
largest posterior probability 𝐏 Θ = 𝜃𝑖 |𝑋 = 𝑥 .
◼ Equivalently, it selects the hypothesis 𝐻𝑖 with the
largest 𝑝Θ 𝜃𝑖 𝑝𝑋|Θ 𝑥|𝜃𝑖 (if 𝑋 is discrete) or
𝑝Θ 𝜃𝑖 𝑓𝑋|Θ 𝑥|𝜃𝑖 (if 𝑋 is continuous).
◼ Computational shortcut
◼ The MAP rule minimizes the probability of selecting an
incorrect hypothesis, or the probability of error over all
decision rules.
41
Hypothesis Testing (3)
◼ Once we derive the MAP rule, we can compute the
probability of a correct decision or error.
◼ Assume that the correct value of Θ is 𝜃𝑖 .
◼ If 𝑔MAP 𝑥 is the hypothesis selected by the MAP rule
when 𝑋 = 𝑥, the MAP rule should select 𝜃𝑖 , that is,
𝑔MAP 𝑥 = 𝜃𝑖 .
◼ Then, the probability of correct decision (selecting a
correct value of Θ) when 𝑋 = 𝑥 is
𝐏 Θ = 𝑔MAP 𝑥 |𝑋 = 𝑥
which is also equal to 𝐏 Θ = 𝜃𝑖 |𝑋 = 𝑥 .
42
Hypothesis Testing (4)
◼ Assume that the correct hypothesis is 𝐻𝑖 .
◼ Let 𝑆𝑖 the set of all 𝑥 such that the MAP rule selects 𝐻𝑖 ,
𝑆𝑖 = 𝑥: 𝑔MAP 𝑥 = 𝐻𝑖 .
◼ Then the overall (averaged over all possible values of 𝑥)
probability of correct decision is
𝐏 correct = 𝐏 Θ = 𝑔MAP 𝑋
= 𝐏 Θ = 𝜃𝑖 , 𝑋 ∈ 𝑆𝑖
𝑖
▪ The corresponding overall probability of error is
𝐏 error = 𝐏 Θ ≠ 𝜃𝑖 , 𝑋 ∈ 𝑆𝑖
𝑖
43
Example: Hypothesis Testing (1)
◼ We have two biased coins, with probabilities of heads
equal to 𝑝1 and 𝑝2 , respectively.
◼ We choose a coin at random: either coin is equally
likely to be chosen. This gives the uniform prior.
◼ We want to infer its identity (1 or 2), based on the
outcome of a single toss.
◼ Let 𝐻1 = {Θ = 1} and 𝐻2 = {Θ = 2} be the hypotheses
that coin 1 or 2 was chosen, respectively,.
1, if head,
◼ Depending on the outcome of toss, let 𝑋 = ቊ
0, if tail.
▪ MAP: compare 𝑝Θ 1 𝑝𝑋|Θ 𝑥|1 with 𝑝Θ 2 𝑝𝑋|Θ 𝑥|2 and
take the larger one.
44
Example: Hypothesis Testing (2)
◼ Since 𝑝Θ 1 = 𝑝Θ 2 = 1/2, we just need to compare
𝑝𝑋|Θ 𝑥|1 and 𝑝𝑋|Θ 𝑥|2 .
◼ For example, if 𝑝1 = 0.46, 𝑝2 = 0.52, and the outcome
is a tail (or 𝑥 = 0), 𝐏 0|Θ = 1 = 1 − 𝑝1 > 𝐏(0|Θ =
2) = 1 − 𝑝2 , we decide in favor of coin 1.
◼ Let us consider the general case that we toss a
randomly selected coin 𝑛 times.
◼ Let 𝑋 be the number of heads obtained.
◼ Then, the preceding argument of single toss is still
valid, and the MAP rule selects the hypothesis under
which the observed outcome is most likely.
45
Example: Hypothesis Testing (3)
◼ If 𝑋 = 𝑘, we should select 𝐻1 = {Θ = 1} if
𝑃𝑋|Θ 𝑘|1 = 𝑝1𝑘 1 − 𝑝1 𝑛−𝑘 > 𝑃𝑋|Θ 𝑘|2 = 𝑝2𝑘 1 − 𝑝2 𝑛−𝑘
0
10 30 50
𝑘∗
46
Example: Hypothesis Testing (4)
◼ The characteristic of the MAP rule, as illustrated in the
figure, is typical of decision rules in binary hypothesis
testing problems.
◼ It is specified by a partition of the observation space
into the two disjoint sets in which each of the two
hypotheses is chosen.
◼ In this example, the MAP rule is specified by a single
threshold 𝑘 ∗ :
◼ Accept Θ = 1 if 𝑘 ≤ 𝑘 ∗ , accept Θ = 2 otherwise.
47
Example: Hypothesis Testing (5)
◼ The overall probability of error is obtained by using the total
probability rule:
𝐏 error = 𝐏 Θ = 1, X > 𝑘 ∗ + 𝐏 Θ = 2, X ≤ 𝑘 ∗
𝑛 𝑘 ∗
1
= 𝑐 𝑘 𝑝1𝑘 1 − 𝑝1 𝑛−𝑘 + 𝑐(𝑘)𝑝2𝑘 (1 − 𝑝2 )𝑛−𝑘
2 𝑘=𝑘 ∗ +1 𝑘=1
48
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation
49
Mean Squared Estimation Error
without Observation (1)
◼ First, let us consider the simpler problem of estimating
Θ with a constant 𝜃, in the absence of an observation 𝑋.
◼ The estimation error: 𝜃 − Θ, which is random
◼ The mean squared estimation error (MSE):
2
𝐄 Θ−𝜃
◼ Question: What’s the minimum value of 𝑀𝑆𝐸 over all
choices of 𝜃?
◼ Answer: 𝐕 Θ , minimum is achieved when 𝜃 = 𝐄 Θ .
Equivalently,
2
𝐄 Θ − 𝐄[Θ] 2
≤ 𝐄 Θ − 𝜃 , for all 𝜃.
50
Mean Squared Estimation Error
without Observation (2)
2
◼ 𝐄 Θ − 𝜃
2
=𝐕 Θ−𝜃 + 𝐄 Θ−𝜃 // definition of var()
2
=𝐕 Θ + 𝐄 Θ−𝜃 // var doesn’t change
2
= 𝐕 Θ + 𝐄 Θ − 𝜃 // linearity of expectation
≥ 𝐕 Θ // “=” achieved when 𝜃 = 𝐄 Θ .
51
Mean Squared Estimation Error
with Observation
◼ Second, suppose that we have observation 𝑋.
◼ We still want to estimate Θ to minimize the MSE.
◼ Note that once we know the value 𝑥 of 𝑋, the situation
is identical to the one considered earlier, except that
we are now in a new setting: everything is conditioned
on 𝑋 = 𝑥.
◼ For any given observation 𝑥, the conditional
expectation estimate 𝜃 = 𝐄 Θ|𝑋 = 𝑥 minimizes the
2
conditional MSE 𝐄 Θ − 𝜃 |𝑋 = 𝑥 .
◼
Equivalently, for any given observation 𝑥 and for all 𝜃,
2
𝐄 Θ − 𝐄 Θ|𝑋 = 𝑥 2 |𝑋
= 𝑥 ≤ 𝐄 Θ − 𝜃 |𝑋 = 𝑥
52
Mean Squared Estimation Error in
General Case
◼ Generally, for all the observation 𝑋, the MSE
associated with an estimator 𝑔 𝑋 is defined as
2
𝐄 Θ − 𝑔 𝑋 |𝑋 .
◼ If we view 𝐄 Θ 𝑋 as a function of 𝑋, the preceding
analysis shows that out of all possible estimators
𝑔(𝑋) of Θ based on 𝑋, the conditional MSE is
minimized when 𝑔 𝑋 = 𝐄 Θ|𝑋 :
◼ That is, for all estimators 𝑔(𝑋),
𝐄 Θ − 𝐄[Θ|𝑋] 2 |𝑋 ≤ 𝐄 Θ − 𝑔(𝑋) 2 |𝑋
◼ Similarly, using the total expectation rule, for all
estimators 𝑔(𝑋),
𝐄 Θ − 𝐄[Θ|𝑋] 2 ≤ 𝐄 Θ − 𝑔(𝑋) 2
53
Example: Conditional MSE (1)
◼ We observe Θ with error 𝑊:
𝑋 =Θ+𝑊
where Θ ~ 𝑈 4,10 and 𝑊 ~ 𝑈 −1,1 is independent of Θ.
◼ We want to obtain the conditional MSE
𝐄 Θ − 𝐄 Θ 𝑋 2 |𝑋 = 𝑥 .
◼ 𝑓Θ 𝜃 = 1/6 if 4 ≤ 𝜃 ≤ 10 (and 0 otherwise).
◼ {𝑋|Θ = 𝜃} equals to 𝜃 + 𝑊, so
▪ {𝑋|Θ = 𝜃} ~𝑈 𝜃 − 1, 𝜃 + 1 ,
▪ 𝑓𝑋|Θ 𝑥|𝜃 = 1/2 if 𝜃 − 1 ≤ 𝑥 ≤ 𝜃 + 1.
1 1 1
◼ Joint density: 𝑓Θ,𝑋 𝜃, 𝑥 = 𝑓Θ 𝜃 𝑓𝑋|Θ 𝑥|𝜃 = ⋅ =
6 2 12
◼ when 𝜃 ∈ 4,10 and 𝑥 ∈ 𝜃 − 1, 𝜃 + 1 .
54
Example: Conditional MSE (2)
◼ The joint density of Θ and 𝑋 is uniform over the
parallelogram given in the following figure.
◼ Given that 𝑋 = 𝑥, the posterior density 𝑓Θ|𝑋 is
proportional to the joint density and is also uniform on
the corresponding vertical section of the parallelogram.
55
Example: Conditional MSE (3)
◼ At 𝑥 = 5,
1
▪ 𝜃መ = 𝐄 Θ 𝑋 = 𝑥 = 5 and 𝑓Θ|𝑋 𝜃 𝑥 = , 4 ≤ 𝜃 ≤ 6.
2
2 6 1
▪ CMSE = 𝐄 Θ − 𝜃መ 𝑋 = 𝑥 = 4 (𝜃 − 5)2 ∙ 𝑑𝜃
2
1
1
= න 𝑦 2 𝑑𝑦 =
−1 3
◼ For 3 ≤ 𝑥 ≤ 5,
𝑥−3 𝑥+5 1
▪ 𝜃መ = 4 + = and 𝑓Θ|𝑋 𝜃 𝑥 = , 4 ≤ 𝜃 ≤ 𝑥 + 1.
2 2 𝑥−3
2 𝑥+1 𝑥+5 2 1
▪ CMSE = 𝐄 Θ − 𝜃መ 𝑋=𝑥 = 4 (𝜃 − ) ∙ 𝑑𝜃
2 𝑥−3
𝑥−3
1 2
2 1
= න 𝑦 𝑑𝑦 = (𝑥 − 3)2
𝑥 − 3 3−𝑥 12
2
▪ Similarly, for 9 ≤ 𝑥 ≤ 11,
1
▪ CMSE = (𝑥 − 11)2
12
56
Example: Conditional MSE (4)
◼ Thus, 𝐄 Θ 𝑋 = 𝑥 is the midpoint of that section, which
is a piecewise linear function of 𝑥.
◼ Conditioned on a specific value 𝑥, the conditional MSE
𝐄 Θ − 𝐄 Θ 𝑋 = 𝑥 2 |𝑋 = 𝑥 corresponds to the
conditional variance of Θ.
57
Example: Romeo/Juliet Meeting (1)
◼ Juliet is late on the first date by a random amount 𝑋
that is uniformly distributed over 0, Θ .
◼ Θ: uniform prior over the interval 0,1 .
◼ MAP estimate: 𝜃 = 𝑥.
1 1 1−𝑥
◼
LMS estimate: 𝜃 = 𝐄 Θ|𝑋 = 𝑥 = 𝜃 𝑥 𝑑𝜃 =
𝜃 log 𝑥 log 𝑥
◼ In the above, we have used the conditional density
1
𝑓Θ|𝑋 𝜃 𝑥 = .
𝜃 log 𝑥
◼ We want to calculate the conditional MSE for the MAP
and LMS estimates.
58
Example: Romeo/Juliet Meeting (2)
◼ we have
Given that 𝑋 = 𝑥, for any estimate 𝜃,
2
𝐄 Θ − 𝜃 |𝑋 = 𝑥
1 2 1
=
𝜃 𝑥− 𝜃 𝜃 log 𝑥 𝑑𝜃
1 2 1
=
𝜃 𝑥− 2𝜃𝜃 + 𝜃 𝜃 log 𝑥2 𝑑𝜃
1−𝑥 2 2 1−𝑥
= −𝜃 + 𝜃 2 .
2 log 𝑥 log 𝑥
▪ In the above, we have used the conditional density
1
𝑓Θ|𝑋 𝜃 𝑥 = .
𝜃 log 𝑥
59
Example: Romeo/Juliet Meeting (3)
2 1−𝑥 2 2 1−𝑥
◼ 𝐄 Θ − 𝜃 |𝑋 = 𝑥 = − 𝜃 + 𝜃 2 .
2 log 𝑥 log 𝑥
◼ For MAP: Substituting 𝜃 = 𝑥,
3𝑥 2 − 4𝑥 + 1
2
𝐄 Θ − 𝜃 |𝑋 = 𝑥 = 𝑥 2 +
2 log 𝑥
1−𝑥
◼ For LMS: Substituting 𝜃 = ,
log 𝑥
2
2 1 − 𝑥2 1−𝑥
𝐄 Θ − 𝜃 |𝑋 = 𝑥 = −
2 log 𝑥 log 𝑥
60
Example: Romeo/Juliet Meeting (4)
◼ MAP estimate has smaller values.
◼ LMS estimate has uniformly smaller conditional MSE.
61
Example: Bias of a Coin (1)
◼ The probability of heads is modeled as Θ, and we
assume the prior of Θ is uniform over the interval 0,1 .
◼ We want to calculate the conditional MSE for the MAP
and LMS estimates.
◼ The coin is tossed 𝑛 times and the number of heads is
distributed by 𝑋~Bin(𝑛, Θ).
◼ When 𝑋 = 𝑘, the posterior density is Beta(𝛼 = 𝑘 + 1, 𝛽 =
𝑛 − 𝑘 + 1), and the MAP estimate 𝜃 = 𝑘/𝑛.
◼ By using the formula for the Beta density,
𝑚
𝑘 + 1 𝑘 + 2 ⋯ (𝑘 + 𝑚)
𝐄 Θ |𝑋 = 𝑘 = ,
(𝑛 + 2)(𝑛 + 3) ⋯ (𝑛 + 𝑚 + 1)
𝑘+1
and for 𝑚 = 1 the LMS estimate 𝜃 = 𝐄 Θ|𝑋 = 𝑘 = .
𝑛+2
62
Example: Bias of a Coin (2)
◼ Given 𝑋 = 𝑘, the conditional MSE for any estimate 𝜃 is
2
𝐄 Θ − 𝜃 |𝑋 = 𝑘 = 𝐄[Θ2 |𝑋 = 𝑘] − 2𝜃𝐄[Θ|𝑋 = 𝑘] + 𝜃 2
(𝑘 + 1)(𝑘 + 2) 𝑘+1
= − 2𝜃 + 𝜃 2 .
(𝑛 + 2)(𝑛 + 3) 𝑛+2
▪ For MAP: 𝜃 = 𝑘/𝑛, the conditional MSE is
2
2 (𝑘 + 1)(𝑘 + 2) 𝑘𝑘+1 𝑘
𝐄 Θ − 𝜃 |𝑋 = 𝑘 = −2 + .
(𝑛 + 2)(𝑛 + 3) 𝑛𝑛+2 𝑛
𝑘+1
▪
For LMS: 𝜃 = 𝐄 Θ|𝑋 = 𝑘 = , the conditional MSE is
𝑛+2
2
𝐄 Θ − 𝜃 |𝑋 = 𝑘 = 𝐄[Θ2 𝑋 = 𝑘 − (𝐄 Θ 𝑋 = 𝑘 )2
2
(𝑘 + 1)(𝑘 + 2) 𝑘+1
= − .
(𝑛 + 2)(𝑛 + 3) 𝑛+2
63
Example: Bias of a Coin (3)
◼ LMS estimate has uniformly smaller conditional MSE.
64
Properties of Estimation Error (1)
◼ For the LMS estimator and its associated estimation
error, respectively, let us denote
= 𝐄 Θ|𝑋 , Θ
Θ ෩=Θ −Θ
◼ 𝐄Θ෩ =𝐄 Θ − Θ = 𝐄 𝐄 Θ|𝑋 − 𝐄 Θ = 0
▪ 𝐄 Θ|𝑋
෩ − Θ|𝑋
=𝐄 Θ
𝐄𝐄𝑋𝑌 =𝐄𝑋
= 𝐄 Θ|𝑋 − 𝐄 Θ|𝑋 𝐄 𝐄 𝑋 𝑌 𝑌 = 𝐄[𝑋|𝑌]
= 𝐄 𝐄 Θ|𝑋 X − 𝐄 Θ|𝑋
= 𝐄 Θ|𝑋 − 𝐄 Θ|𝑋 = 0
◼ The estimation error is unbiased since it has zero
unconditional and conditional mean.
65
Properties of Estimation Error (2)
◼ = 𝐄 Θ|𝑋 , Θ
Again, Θ ෩=Θ
− Θ.
◼ Θ
𝐄Θ ෩ =𝐄 𝐄Θ
Θ|𝑋
෩ // total expectation rule
Θ|𝑋
= 𝐄 Θ𝐄 ෩ is completely determined by 𝑋
// Θ
෩
= 0 // 𝐄 Θ|𝑋 = 0 from the preceding result
◼ Θ
Cov Θ, ෩ =𝐄Θ Θ෩ −𝐄 Θ 𝐄Θ ෩ = 0 − 0 = 0. // Θ
෩ is
uncorrelated with the estimator Θ
◼ Therefore, by considering the variance of both sides in
Θ=Θ −Θ ෩ , we have
𝐕 Θ =𝐕 Θ +𝐕 Θ ෩
66
Properties of Estimation Error (3)
◼ The observation 𝑋 is uninformative if the MSE 𝐄 Θ ෩2 = 𝐕 Θ෩ is
the same as 𝐕 Θ . When is this the case?
◼ Using 𝐕 Θ = 𝐕 Θ +𝐕 Θ ෩ , we see that 𝑋 is uninformative if
and only if 𝐕 Θ = 0. 𝐄Θ෩ =0
◼ The variance of a rv is zero when the rv is a constant, equal
67
Contents
◼ Bayesian inference and the posterior distribution
◼ Point estimation, Hypothesis testing and MAP
◼ Bayesian least mean squares estimation
◼ Bayesian linear least mean squares estimation
68
Linear LMS Estimator (1)
◼ LMS estimator is sometimes hard to compute, and we
need alternatives.
◼ We derive an estimator by minimizing the MSE within
a restricted class of estimators: those that are linear
functions of the observations.
◼ This estimator may result in higher MSE.
◼ But it has a significant computational advantage.
◼ It requires simple calculations, involving only
69
Linear LMS Estimator (2)
◼ of a random variable Θ, based on
A linear estimator Θ
observations 𝑋1 , … , 𝑋𝑛 , has the form
= 𝑎1 𝑋1 + ⋯ + 𝑎𝑛 𝑋𝑛 + 𝑏
Θ
◼ Given a specified choice of the scalars 𝑎1 , … , 𝑎𝑛 , 𝑏, the
corresponding MSE is
𝐄 Θ − 𝑎1 𝑋1 − ⋯ − 𝑎𝑛 𝑋𝑛 − 𝑏 2
◼ The linear LMS estimator chooses 𝑎1 , … , 𝑎𝑛 , 𝑏 to
minimize the above expression.
70
Linear LMS Estimator (3)
◼ We first develop the solution for the case where 𝑛 = 1,
and then generalize.
◼ = 𝑎𝑋 + 𝑏 and the MSE is 𝐄[(Θ −
The estimator is Θ
𝑎𝑋 − 𝑏)2 ].
◼ We are interested in finding 𝑎 and 𝑏 that minimize this
MSE.
71
Linear LMS Estimator (4)
◼ If 𝑎 is chosen, then it’s easy to find the optimal 𝑏:
◼ Choose a constant 𝑏 to estimate the random variable
Θ − 𝑎𝑋.
◼ By the discussion in previous section, the best choice
is 𝑏 = 𝐄 Θ − 𝑎𝑋 = 𝐄 Θ − 𝑎𝐄 𝑋 .
◼ With this choice of 𝑏, it remains to minimize the MSE
with respect to 𝑎, i.e., to minimize MSE = 𝐄[(Θ − 𝑎𝑋 −
𝐄 Θ + 𝑎𝐄 𝑋 )2 ], which has the form of 𝐄[(Θ − 𝑎𝑋 −
𝐄 Θ − 𝑎𝑋 )2 ] = 𝐕 Θ − 𝑎𝑋
◼ MSE = 𝐕 Θ − 𝑎𝑋
= 𝐕 Θ + 𝑎2 𝐕 𝑋 + 2 cov Θ, −𝑎𝑋
= 𝐕 Θ + 𝑎2 𝐕 𝑋 − 2𝑎 cov Θ, 𝑋
72
Linear LMS Estimator (5)
◼ We set its derivative with respect to 𝑎 to zero and solve
for 𝑎. This yields
cov Θ, 𝑋 𝜎Θ
𝑎= =𝜌
𝐕 𝑋 𝜎𝑋
◼ 𝜎Θ and 𝜎𝑋 : standard deviation of Θ and 𝑋, respectively.
cov Θ,𝑋
◼ 𝜌= : correlation coefficient of Θ and 𝑋.
𝜎Θ 𝜎𝑋
◼ With this choice of 𝑎, the linear LMS estimator is
= 𝑎𝑋 + 𝑏 = 𝑎𝑋 + 𝐄 Θ − 𝑎𝐄 𝑋
Θ ෩ ]=0,
For LMS, 𝐄[Θ
=𝐄 Θ +𝑎 𝑋−𝐄 𝑋 so MSE is given
cov Θ,𝑋 ෩2 = 𝐕 Θ
by 𝐄 Θ ෩
=𝐄Θ + 𝑋−𝐄 𝑋
𝐕 𝑋
◼ The resulting MSE is The maximal MSE
𝐕 Θ − Θ = 1 − 𝜌2 𝐕 Θ is 𝐕 Θ since 𝜌 ≤
1.
73
Example: Romeo/Juliet Meeting (1)
◼ Juliet is late by an amount 𝑋 uniformly distributed over
0, Θ , and Θ is a random variable with a uniform prior
𝑓Θ 𝜃 over the interval 0,1 .
◼ Let us derive the linear LMS estimator of Θ based on 𝑋.
◼ By the law of total expectation,
𝐄Θ 1
𝐄 𝑋 = 𝐄 𝐄 𝑋|Θ = 𝐄 Θ/2 = =
2 4
◼ By the law of total variance,
𝐕 𝑋 = 𝐄 𝐕 𝑋|Θ + 𝐕 𝐄 𝑋|Θ
Θ2 Θ
=𝐄 +𝐕
12 2
1 1 2 1 1−0 2 7
= න 𝜃 𝑑𝜃 + =
12 0 4 12 144
74
Example: Romeo/Juliet Meeting (2)
◼ Now we compute cov Θ, 𝑋 :
𝐄 Θ𝑋 = 𝐄 𝐄 Θ𝑋|Θ = 𝐄 Θ𝐄 𝑋|Θ = 𝐄 Θ2 /2 = 1/6
1 1 1 1
cov Θ, 𝑋 = 𝐄 Θ𝑋 − 𝐄 Θ 𝐄 𝑋 = − ⋅ =
6 2 4 24
◼ The linear LMS estimator is
cov Θ, 𝑋 6 2
Θ=𝐄Θ + 𝑋−𝐄 𝑋 = 𝑋+
𝐕 𝑋 7 7
◼ The conditional MSE is obtained using
2 1 − 𝑥2 2 1−𝑥
𝐄 Θ − 𝜃 |𝑋 = 𝑥 = −𝜃 + 𝜃 2
2 log 𝑥 log 𝑥
6 2
with 𝜃 = 𝑥 + .
7 7
75
Example: Bias of a Coin (1)
◼ The probability of heads is modeled as Θ, and the prior is
uniform over the interval 0,1 ,
1 1 2 1
𝐄Θ = ,𝐕 Θ = ,𝐄 Θ = .
2 12 3
◼ The coin is tossed 𝑛 times and the number of heads is
distributed by 𝑋~Bin(𝑛, Θ).
◼ By the law of total expectation,
𝑛
𝐄 𝑋 = 𝐄 𝐄 𝑋|Θ = 𝐄 𝑛Θ = .
2
◼ By the law of total variance,
𝐕 𝑋 = 𝐄 𝐕 𝑋|Θ + 𝐕 𝐄 𝑋|Θ
= 𝐄 𝑛Θ(1 − Θ) + 𝐕 𝑛Θ
𝑛 𝑛 𝑛2 𝑛 𝑛 + 2
= − + = .
2 3 12 12
76
Example: Bias of a Coin (2)
◼ Now we compute cov Θ, 𝑋 :
𝑛
𝐄 Θ𝑋 = 𝐄 𝐄 Θ𝑋|Θ = 𝐄 Θ𝐄 𝑋|Θ = 𝐄 𝑛Θ2 = ,
3
𝑛 1 𝑛 𝑛
cov Θ, 𝑋 = 𝐄 Θ𝑋 − 𝐄 Θ 𝐄 𝑋 = − ∙ = .
3 2 2 12
◼ The linear LMS estimator is
cov Θ, 𝑋 𝑋 1
Θ=𝐄Θ + 𝑋−𝐄 𝑋 = + .
𝐕 𝑋 𝑛+2 𝑛+2
▪ This agrees with the LMS estimator derived in the
previous Ex 8.13 in Section 8.3.
77
Homework #10
Textbook “Introduction to Probability”, 2nd Edition, D. Bertsekas and J. Tsitsiklis
Chapter8 p.445-p.455, Problems 1, 2, 3, 5, 7, 10, 11
Due date: 아주BB 과제출제 확인
Homework #11
Textbook “Introduction to Probability”, 2nd Edition, D. Bertsekas and J. Tsitsiklis
Chapter8 p.445-p.455, Problems 14, 15, 16, 17, 18, 19, 24
Due date: 아주BB 과제출제 확인
78