0% found this document useful (0 votes)

11 views12 pages

Module 2 - Optimal Estimators

Uploaded by

sirish.h991

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

Module 2 - Optimal Estimators

Uploaded by

sirish.h991

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Bayes Theorem

1. Bayes Theorem

Why Bayes’ Rule?

We’re learning how to estimate an unknown random variable X based on an

observation Y (which gives us some clue about what X might be). This is exactly what
we’re doing in adaptive filtering and optimal estimation problems.

To understand how to make this estimate as good as possible, we need to understand

the relationship between X and Y , and that’s where Bayes' Rule and joint PDFs come
in.

What is Bayes’ Rule in probability terms?

In its simplest form:

P (Y |X) ⋅ P (X)
P (X|Y ) =
P (Y )

This just says:

"If I know how likely Y is when X happens, and I know how likely X is in general, then
I can figure out how likely X is when Y happens."

Bayes' Rule with Probability Densities (PDFs)

When we deal with continuous random variables, we use probability density functions
(PDFs) instead of probabilities.

Bayes’ Rule becomes:

f Y |X (y|x) ⋅ f X (x)
f X|Y (x|y) =
f Y (y)

Here:

f X|Y (x|y) is the PDF of X given observation Y = y

f Y |X (y|x) is the likelihood — how likely are we to observe Y = y given X = x

f X (x) is the prior PDF of X

f Y (y) is just a normalization term (makes the PDF integrate to 1)
Intuition: What does this mean?

Think of f X|Y
(x|y) as a “slice” of the joint PDF f X,Y (x, y) at a specific observed value of y.

Imagine a 3D plot: x-axis is X, y-axis is Y , height is the joint PDF value.

If we observe a specific y = y , we are slicing this surface vertically at that value.
0

What we’re left with is a curve along the X-axis that shows how likely different x
values are, given that slice.

That curve is the conditional distribution f X|Y

(x|y 0 ) .

Expectation from the Joint PDF

Here’s where it gets really elegant:

We want to calculate the expected value (i.e., average) of X, but now we want to do it
given Y .
Let’s say:

E[X] = ∬ x ⋅ f X,Y (x, y) dx dy

Using Bayes’ Rule, we rewrite:

E[X] = ∫ [∫ x ⋅ f X|Y (x|y) dx]f Y (y) dy

The inner integral is:

E[X|Y = y] = ∫ x ⋅ f X|Y (x|y) dx

So overall:

E[X] = E [E[X|Y ]]

This is called the law of total expectation or the tower property.

It means: You can compute the average of X by:

1. First averaging it over each slice (conditional on Y ),

2. Then averaging those results over the distribution of Y .

This is how nested expectations work, and it’s crucial in Bayesian estimation and
adaptive filtering.
Summary of what we just did

Bayes' rule gives us a way to go from the likelihood f Y |X

(y|x) to the posterior
distribution f (x|y).
X|Y

Once we observe Y = y0 , we can compute the updated distribution of X.

We can use this conditional distribution to compute the best estimate of X, like the
least mean square estimate, by choosing:

x
^ = E[X|Y = y 0 ]

Which is the mean of the distribution of X given Y = y0 .

Optimal Estimator
2. Optimal Estimator

Main point

The conditional mean

E[X ∣ Y ]

is the optimal estimator of a random variable X given observation Y in the mean

squared error (MSE) sense. It is unbiased and minimizes E[(X − X)^ ] among all
2

possible functions X
^ = g(Y ), meaning no other estimator comes close to it in terms of

accuracy.

Explanation

You have:

An unknown variable X (like a signal you're trying to estimate),

An observation Y (something related to X, e.g., noisy version),
You want to find the best estimate x(y)
^ of X, based only on what you observe:
Y = y.

The best estimator is the one that minimizes:

2
^
E[(X − x(Y )) ]

The Candidate Estimator: Conditional Expectation

We're considering:
x(Y
^ ) = E[X ∣ Y ]

But we want to prove that this is indeed the best possible estimator (i.e., no other
function of Y does better).

Step 1: Is this estimator unbiased?

We want to check if:

E[X − x(Y
^ )] = 0

And yes — it is. Why?

Because:

E[X − E[X ∣ Y ]] = 0

This follows from the law of total expectation.

So this estimator doesn’t consistently overestimate or underestimate — it’s unbiased.

Step 2: Is it the best (i.e., MMSE optimal)?

Suppose someone gives you another estimator, say:

g(Y )

We want to prove that it can’t be better than E[X ∣ Y] .

The Trick: Decompose the error term

We write:

X − g(Y ) = X − E[X ∣ Y ] + E[X ∣ Y ] − g(Y )




      
true error dif f erence f rom optimal

So when you compute the expected squared error:

2
E[(X − g(Y )) ]

You expand it into:

2
E [(X − E[X ∣ Y ] + E[X ∣ Y ] − g(Y )) ]

Now, normally this gives 3 terms:

E[(X − E[X ∣ Y ]) ]
2
— this is the minimum achievable error
2
E[(E[X ∣ Y ] − g(Y )) ] — this is the extra penalty for not using optimal estimator
A cross term — but...

Here’s the key insight:

The cross term vanishes because the estimation error X − E[X ∣ Y] is uncorrelated
with any function of Y .
(This follows from the orthogonality principle.)

So:
2 2
E[(X − g(Y )) ] = E[(X − E[X ∣ Y ]) ] + something ≥ 0

   
minimum possible error

Which means:
✨ Only when g(Y ) = E[X ∣ Y ] , that extra term becomes zero — and you reach the
lowest possible MSE.

Intuition Behind Orthogonality

Imagine your true value X is a point in space. You observe a distorted version Y , and
you want to "guess" X based on Y .

The best guess is like projecting X onto a subspace defined by functions of Y . This
projection is the conditional expectation:

x(Y
^ ) = E[X ∣ Y ]

In geometry, the error vector (true minus estimate) is perpendicular (orthogonal) to

your estimate.

That’s what the math shows:

(X − x(Y
^ )) ⊥ x(Y
^ )

and more generally,

(X − x(Y
^ )) ⊥ any f unction of Y

Final Conclusion

The MMSE optimal estimator is E[X ∣ Y]

It is unbiased
It produces an error uncorrelated with any function of Y
No other estimator g(Y ) can do better
Hence, it minimizes E[(X − x(Y
^ ))
2
]

Estimating Random Vectors

3. Estimating Random Vectors

The Big Idea

So far, we were estimating a single random variable X from a single observation Y —

and we saw that the optimal estimate is the conditional mean:

^ = E[X ∣ Y ]
X

1. What happens if we have multiple observations?

Let’s say you're estimating tomorrow's temperature (X).

Instead of just knowing today's temperature (Y ), you now also know yesterday’s, the
day before, and so on.
So your observations are now a vector:

Y0
⎡ ⎤
Y1
Y =
⋮

⎣ ⎦
YN

Intuitively, more information means a better estimate — you're not just guessing
based on one point, but on a trend.

💡 Good news:
The optimal estimate is still the conditional mean:

^
X = E[X ∣ Y]

It doesn’t matter whether Y is scalar or a vector — the math still works!

The derivation still holds — just with more components in the conditioning.

2. What happens if the thing you’re estimating is also a vector?

Now imagine you're estimating the temperature profile for the next several days.
So now X itself is a vector:
Option A: L2 Norm (Euclidean distance)

This is the standard:

∥e∥

We try to find the best functions h

Option B: Using Trace of Covariance Matrix

The error covariance matrix is:

This is a matrix where:

= ∑e

k=0

k (Y)
X =

And you have multiple past temperature readings Y.

Again, the optimal estimate is still:

2
k

min
⎢⎥
⎡

^ = E[X ∣ Y]
X
X0

XM
⋮
⎤

But this means you are estimating each component X_k using a function of all the
observed values:

So you're really just doing M

3. What about the error?

We define the error vector:

+ 1
^
X k = h k (Y) = E[X k ∣ Y]

scalar estimations — one for each component of X.

^
e = X − X

This tells us how far off we are in each component of the estimate.
Then, we want to minimize the overall error. There are two equivalent ways to
measure this:

= total squared error

that minimize this expected error:

h 0 ,…,h M

E[ee
E[∥e∥ ]

the diagonal has the variances of each component error (e

]
2

2
0
2
, e1 , etc.)
the off-diagonal has the correlations between errors in different components.

The trace of this matrix (sum of diagonal elements) is just:

T 2
trace(E[ee ]) = ∑ E[e k ]

Which is exactly the same as the L2 norm squared.

Final Insight:

Whether you're estimating a scalar from a vector of observations, or estimating a

vector from a vector — the conditional mean remains optimal.

All the complexity is just notation and dimensionality — the concept remains the
same:

Use all the information available (vector Y) to find the best mean-squared-error-
minimizing estimate of your target (scalar or vector X).

Linear Estimators
4. Linear Estimators

What’s the problem again?

We learned that the optimal estimator of a random variable X, given an observation Y ,

is the conditional expectation:

^
X(Y ) = E[X ∣ Y ]

This is unbiased and has the lowest possible mean squared error (MSE) among all
estimators - linear or nonlinear.

BUT...

This conditional expectation is usually very hard or impossible to compute in

real life.

Why? Because:

We usually don’t know the exact joint PDF p(x, y).

Even if we did, integrating over it to compute E[X ∣ Y] is computationally difficult.
So what’s the solution?

We approximate the optimal estimator using something more tractable - a linear

estimator:

^
X(Y ) = HY + b

Where:

H is a matrix (or scalar if Y is scalar),

b is a bias term (vector or scalar),
This is a simple linear function of the observation Y .

Linear estimators are much easier to compute because we only need means and
covariances, not full PDFs.

But wait — what are we sacrificing?

We're no longer guaranteed to get the optimal estimate, except in one special case:

If X and Y are jointly Gaussian, then the linear estimator is the optimal one!

That’s because in Gaussian distributions, the conditional expectation E[X ∣ Y] is itself a

linear function of Y .

So:

If X and Y are jointly Gaussian, linear estimator = optimal estimator

If not, linear estimator ≠ optimal, but still pretty good and way easier to compute.

Step 1: Zero-mean assumption

We assume both X and Y have zero mean:

E[X] = 0, E[Y ] = 0

Why? Because:

If they’re not, we can just subtract the mean from each to make them zero mean.
Subtracting the mean is easy and doesn’t affect the correlation structure.

This simplifies the math: now the bias term b becomes unnecessary because:

^ = H E[Y ] + b = 0 ⇒ b = 0
E[X]
So the linear estimator becomes:

^
X(Y ) = HY

Step 2: Minimize the mean squared error (MSE)

We now find the best matrix H by minimizing:

2
E [∥X − H Y ∥ ]

This is a standard optimization problem and leads to the Wiener-Hopf equation, which
gives us the best linear estimator in terms of covariances:
−1
H = R XY R
Y

Where:

R XY = E[XY
T
: cross-covariance matrix
]

R Y = E[Y Y
T
: auto-covariance of observations
]

Intuition Recap

Imagine:

You can't get the perfect answer (conditional expectation).

But you can get a pretty good approximation by drawing a straight line (linear
function) that tries to best match the observations to the quantity you're trying to
estimate.
This is like fitting a line to noisy data — not perfect, but often good enough,
especially if the noise is Gaussian.

Linear Estimators Example

5. Linear Estimators Example

The key idea here is that we're trying to recover or "undo" the effect of a noisy
communication channel, using a linear estimator (filter).

What's the goal?

We want to recover the original transmitted signal D(n) (like a voice or data stream)
from what we actually received Y (n), which is a distorted and noisy version of it.
Think of D(n) like the clean audio you’re sending over a walkie-talkie, and Y (n) like the
scratchy version your friend hears on the other end.

What does the system look like?

1. Transmitter sends: A clean sequence D(n).

2. Channel adds distortion: Like echoes, attenuation, delays (modeled by a system
with impulse response G(n)).
3. Noise gets added: Random, unwanted stuff (V (n)), like static on a radio.
4. Receiver gets: Y (n) = D(n) ∗ G(n) + V (n) → (convolution of signal with channel +
noise).
5. We apply a filter H(n): This is the equalizer at the receiver that tries to undo the
distortion and cancel out the noise.
6. Output estimate: D̂(n) = H (n) ∗ Y (n) — our best guess of what was sent.

The Ideal Situation

Imagine if we could find a filter H (n) such that:

₀
H (n) ∗ G(n) = δ(n − n )

This means: if we convolve the channel response G(n) with our filter H (n), we get a
delta function — a perfect spike at some delay n₀.

Why is this ideal? Because that would mean all the spreading/delaying effects of the
channel are reversed. So now:

₀
D̂(n) = H (n) ∗ Y (n) = δ(n − n ) ∗ D(n) + H (n) ∗ V (n) = DelayedD(n) + F ilterednoise

Perfect! Except... the noise is still there.

What's the problem then?

While H (n) ∗ G(n) ≈ δ(n − n₀) might give us a great recovery of the signal...
It might horribly amplify the noise.
Why? Because to cancel the effects of the channel, H (n) may need huge gain at
some frequencies (especially if the channel suppresses certain frequencies).

Imagine the channel wipes out mid-range tones in your music. To recover them, the
equalizer might crank the mid-range way up — but that also cranks up noise in that
range.

What’s the key takeaway?

This is a classic signal recovery problem: equalization in the presence of noise.

A simple approach (just undoing the channel effect blindly) can fail, because it may
boost noise too much.
Instead, we need to optimize the filter not just to cancel distortion, but also to
minimize mean squared error — taking both signal and noise into account.

This is where Wiener filter and adaptive filters come into play!

(The Wiener filter videos are skipped. Refer Rich Radke's DSP notes)

G9 Science Q2 - Week 1 - Quantum
100% (1)
G9 Science Q2 - Week 1 - Quantum
41 pages
Ieee 1222
100% (1)
Ieee 1222
51 pages
Estimation Theory
100% (1)
Estimation Theory
8 pages
Estimation Theory Presentation
100% (2)
Estimation Theory Presentation
66 pages
Estimation Theory Overview
100% (1)
Estimation Theory Overview
17 pages
Topic 8-Mean Square Estimation-Wiener and Kalman Filtering
No ratings yet
Topic 8-Mean Square Estimation-Wiener and Kalman Filtering
73 pages
SSPI Lecture 3 Estimation Intro 2025
No ratings yet
SSPI Lecture 3 Estimation Intro 2025
56 pages
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
No ratings yet
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
13 pages
Statistical Inference Foundations
No ratings yet
Statistical Inference Foundations
89 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Parameter Estimation
No ratings yet
Parameter Estimation
44 pages
M2S2 - Statistical Modelling: DR Axel Gandy Imperial College London Spring 2011
No ratings yet
M2S2 - Statistical Modelling: DR Axel Gandy Imperial College London Spring 2011
25 pages
For Section 1.7
No ratings yet
For Section 1.7
11 pages
Lecture 11: Standard Error, Propagation of Error, Central Limit Theorem in The Real World
No ratings yet
Lecture 11: Standard Error, Propagation of Error, Central Limit Theorem in The Real World
13 pages
Bayesian Models for AI Experts
No ratings yet
Bayesian Models for AI Experts
130 pages
Bias-Variance Tradeoffs: 1 Single Sample MLE
No ratings yet
Bias-Variance Tradeoffs: 1 Single Sample MLE
7 pages
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
No ratings yet
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
12 pages
Estimation and Detection: Lecture 6: The Bayesian Philosophy
No ratings yet
Estimation and Detection: Lecture 6: The Bayesian Philosophy
19 pages
Estimators: The Basic Statistical Model
No ratings yet
Estimators: The Basic Statistical Model
9 pages
Module C
No ratings yet
Module C
30 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Review Materials 0 8 1
No ratings yet
Review Materials 0 8 1
140 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
A Step by Step Mathematical Derivation A
No ratings yet
A Step by Step Mathematical Derivation A
32 pages
Uni Variate Regression
No ratings yet
Uni Variate Regression
61 pages
ML Notes
No ratings yet
ML Notes
4 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
7772 LectureNotes
No ratings yet
7772 LectureNotes
120 pages
Lect 02
No ratings yet
Lect 02
36 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
7 Expectation
No ratings yet
7 Expectation
20 pages
Asymptotic Theory & Inference Guide
No ratings yet
Asymptotic Theory & Inference Guide
32 pages
Lecture5 Module2 Anova 1
No ratings yet
Lecture5 Module2 Anova 1
9 pages
Estimation
No ratings yet
Estimation
39 pages
Notes Estimation Theory
100% (3)
Notes Estimation Theory
39 pages
RegEstimationLS ML StatColumbia
No ratings yet
RegEstimationLS ML StatColumbia
44 pages
8 Conditional Expectation
No ratings yet
8 Conditional Expectation
27 pages
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
No ratings yet
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
25 pages
HKNECE313 Cramming Carnival FA24
No ratings yet
HKNECE313 Cramming Carnival FA24
45 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
No ratings yet
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
6 pages
Estimation 3
No ratings yet
Estimation 3
12 pages
Lecture 4
No ratings yet
Lecture 4
8 pages
Part 2 Estimation
No ratings yet
Part 2 Estimation
72 pages
ASNT Ultrasonic Testing (UT) Level III Notes
No ratings yet
ASNT Ultrasonic Testing (UT) Level III Notes
253 pages
Documentation Edger Perception Briot
No ratings yet
Documentation Edger Perception Briot
6 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
Types of Energy Meters and Their Working Principles
No ratings yet
Types of Energy Meters and Their Working Principles
5 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
MTS9300A V100R001C00 Telecom Power User Manual
No ratings yet
MTS9300A V100R001C00 Telecom Power User Manual
155 pages
Error Propagation
No ratings yet
Error Propagation
22 pages
Stat
No ratings yet
Stat
43 pages
Stat 450850 Notes 2012
No ratings yet
Stat 450850 Notes 2012
190 pages
Laser Triggering of High Voltages
No ratings yet
Laser Triggering of High Voltages
119 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
Chapter Five
No ratings yet
Chapter Five
54 pages
Week 3-Nonparametric Estimation
No ratings yet
Week 3-Nonparametric Estimation
37 pages
Lecture 13
No ratings yet
Lecture 13
12 pages
Manganese Behavior in Cobalt Electrowinning Cell
No ratings yet
Manganese Behavior in Cobalt Electrowinning Cell
126 pages
SPE 169190 Improved Zonal Isolation in Open Hole Application
No ratings yet
SPE 169190 Improved Zonal Isolation in Open Hole Application
10 pages
Wire Rope
No ratings yet
Wire Rope
9 pages
Statistics
No ratings yet
Statistics
4 pages
ME 2101-Heat Transfer and Thermodynamics
No ratings yet
ME 2101-Heat Transfer and Thermodynamics
58 pages
NEWKer 302 Manual V3.2
No ratings yet
NEWKer 302 Manual V3.2
49 pages
Lec 8
No ratings yet
Lec 8
17 pages
Induction Motor
No ratings yet
Induction Motor
52 pages
10th Science One Mark Choose
No ratings yet
10th Science One Mark Choose
23 pages
Anchor Bolt Design
No ratings yet
Anchor Bolt Design
20 pages
MeltSpinning Genel (Bikomponent)
No ratings yet
MeltSpinning Genel (Bikomponent)
32 pages
Science and Technology Classnotes
No ratings yet
Science and Technology Classnotes
17 pages
Maths F4 QS
No ratings yet
Maths F4 QS
12 pages
PDC - VCS, IITR - 01 Intro & Laplace
No ratings yet
PDC - VCS, IITR - 01 Intro & Laplace
16 pages
Journal of Building Engineering: S.C. Devi, R.A. Khan
No ratings yet
Journal of Building Engineering: S.C. Devi, R.A. Khan
12 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Ferroresonance Evaluation On Capacitor V
No ratings yet
Ferroresonance Evaluation On Capacitor V
9 pages
Pulse Blowing Valve: AXTS040
No ratings yet
Pulse Blowing Valve: AXTS040
3 pages
Narita Et Al. - 2000 - Observation of Current Waveshapes of Lightning STR
No ratings yet
Narita Et Al. - 2000 - Observation of Current Waveshapes of Lightning STR
7 pages
JK Laser JKFL Manual
No ratings yet
JK Laser JKFL Manual
28 pages
IEC 56-63-71-80 - Nema Frame 56 D.C. Motors: Installation and Maintenance
No ratings yet
IEC 56-63-71-80 - Nema Frame 56 D.C. Motors: Installation and Maintenance
7 pages
Electrostatic Potential Project Revised
No ratings yet
Electrostatic Potential Project Revised
3 pages
MAT 112 Exam Questions 20152016
No ratings yet
MAT 112 Exam Questions 20152016
2 pages
Cooling Tower Approach Temperature
No ratings yet
Cooling Tower Approach Temperature
2 pages
Adpro5 24
No ratings yet
Adpro5 24
1 page