Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views12 pages

Module 2 - Optimal Estimators

Uploaded by

sirish.h991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Module 2 - Optimal Estimators

Uploaded by

sirish.h991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Bayes Theorem

1. Bayes Theorem

Why Bayes’ Rule?

We’re learning how to estimate an unknown random variable X based on an


observation Y (which gives us some clue about what X might be). This is exactly what
we’re doing in adaptive filtering and optimal estimation problems.

To understand how to make this estimate as good as possible, we need to understand


the relationship between X and Y , and that’s where Bayes' Rule and joint PDFs come
in.

What is Bayes’ Rule in probability terms?

In its simplest form:

P (Y |X) ⋅ P (X)
P (X|Y ) =
P (Y )

This just says:

"If I know how likely Y is when X happens, and I know how likely X is in general, then
I can figure out how likely X is when Y happens."

Bayes' Rule with Probability Densities (PDFs)

When we deal with continuous random variables, we use probability density functions
(PDFs) instead of probabilities.

Bayes’ Rule becomes:

f Y |X (y|x) ⋅ f X (x)
f X|Y (x|y) =
f Y (y)

Here:

f X|Y (x|y) is the PDF of X given observation Y = y

f Y |X (y|x) is the likelihood — how likely are we to observe Y = y given X = x

f X (x) is the prior PDF of X


f Y (y) is just a normalization term (makes the PDF integrate to 1)
Intuition: What does this mean?

Think of f X|Y
(x|y) as a “slice” of the joint PDF f X,Y (x, y) at a specific observed value of y.

Imagine a 3D plot: x-axis is X, y-axis is Y , height is the joint PDF value.


If we observe a specific y = y , we are slicing this surface vertically at that value.
0

What we’re left with is a curve along the X-axis that shows how likely different x
values are, given that slice.

That curve is the conditional distribution f X|Y


(x|y 0 ) .

Expectation from the Joint PDF

Here’s where it gets really elegant:


We want to calculate the expected value (i.e., average) of X, but now we want to do it
given Y .
Let’s say:

E[X] = ∬ x ⋅ f X,Y (x, y) dx dy

Using Bayes’ Rule, we rewrite:

E[X] = ∫ [∫ x ⋅ f X|Y (x|y) dx]f Y (y) dy

The inner integral is:

E[X|Y = y] = ∫ x ⋅ f X|Y (x|y) dx

So overall:

E[X] = E [E[X|Y ]]

This is called the law of total expectation or the tower property.

It means: You can compute the average of X by:

1. First averaging it over each slice (conditional on Y ),


2. Then averaging those results over the distribution of Y .

This is how nested expectations work, and it’s crucial in Bayesian estimation and
adaptive filtering.
Summary of what we just did

Bayes' rule gives us a way to go from the likelihood f Y |X


(y|x) to the posterior
distribution f (x|y).
X|Y

Once we observe Y = y0 , we can compute the updated distribution of X.


We can use this conditional distribution to compute the best estimate of X, like the
least mean square estimate, by choosing:

x
^ = E[X|Y = y 0 ]

Which is the mean of the distribution of X given Y = y0 .

Optimal Estimator
2. Optimal Estimator

Main point

The conditional mean

E[X ∣ Y ]

is the optimal estimator of a random variable X given observation Y in the mean


squared error (MSE) sense. It is unbiased and minimizes E[(X − X)^ ] among all
2

possible functions X
^ = g(Y ), meaning no other estimator comes close to it in terms of

accuracy.

Explanation

You have:

An unknown variable X (like a signal you're trying to estimate),


An observation Y (something related to X, e.g., noisy version),
You want to find the best estimate x(y)
^ of X, based only on what you observe:
Y = y.

The best estimator is the one that minimizes:


2
^
E[(X − x(Y )) ]

The Candidate Estimator: Conditional Expectation

We're considering:
x(Y
^ ) = E[X ∣ Y ]

But we want to prove that this is indeed the best possible estimator (i.e., no other
function of Y does better).

Step 1: Is this estimator unbiased?

We want to check if:

E[X − x(Y
^ )] = 0

And yes — it is. Why?


Because:

E[X − E[X ∣ Y ]] = 0

This follows from the law of total expectation.


So this estimator doesn’t consistently overestimate or underestimate — it’s unbiased.

Step 2: Is it the best (i.e., MMSE optimal)?

Suppose someone gives you another estimator, say:

g(Y )

We want to prove that it can’t be better than E[X ∣ Y] .

The Trick: Decompose the error term

We write:

X − g(Y ) = X − E[X ∣ Y ] + E[X ∣ Y ] − g(Y )





      
true error dif f erence f rom optimal

So when you compute the expected squared error:


2
E[(X − g(Y )) ]

You expand it into:

2
E [(X − E[X ∣ Y ] + E[X ∣ Y ] − g(Y )) ]

Now, normally this gives 3 terms:

E[(X − E[X ∣ Y ]) ]
2
— this is the minimum achievable error
2
E[(E[X ∣ Y ] − g(Y )) ] — this is the extra penalty for not using optimal estimator
A cross term — but...

Here’s the key insight:


The cross term vanishes because the estimation error X − E[X ∣ Y] is uncorrelated
with any function of Y .
(This follows from the orthogonality principle.)

So:
2 2
E[(X − g(Y )) ] = E[(X − E[X ∣ Y ]) ] + something ≥ 0

   
minimum possible error

Which means:
✨ Only when g(Y ) = E[X ∣ Y ] , that extra term becomes zero — and you reach the
lowest possible MSE.

Intuition Behind Orthogonality

Imagine your true value X is a point in space. You observe a distorted version Y , and
you want to "guess" X based on Y .

The best guess is like projecting X onto a subspace defined by functions of Y . This
projection is the conditional expectation:

x(Y
^ ) = E[X ∣ Y ]

In geometry, the error vector (true minus estimate) is perpendicular (orthogonal) to


your estimate.

That’s what the math shows:

(X − x(Y
^ )) ⊥ x(Y
^ )

and more generally,

(X − x(Y
^ )) ⊥ any f unction of Y

Final Conclusion

The MMSE optimal estimator is E[X ∣ Y]

It is unbiased
It produces an error uncorrelated with any function of Y
No other estimator g(Y ) can do better
Hence, it minimizes E[(X − x(Y
^ ))
2
]

Estimating Random Vectors


3. Estimating Random Vectors

The Big Idea

So far, we were estimating a single random variable X from a single observation Y —


and we saw that the optimal estimate is the conditional mean:

^ = E[X ∣ Y ]
X

1. What happens if we have multiple observations?

Let’s say you're estimating tomorrow's temperature (X).

Instead of just knowing today's temperature (Y ), you now also know yesterday’s, the
day before, and so on.
So your observations are now a vector:

Y0
⎡ ⎤
Y1
Y =

⎣ ⎦
YN

Intuitively, more information means a better estimate — you're not just guessing
based on one point, but on a trend.

💡 Good news:
The optimal estimate is still the conditional mean:

^
X = E[X ∣ Y]

It doesn’t matter whether Y is scalar or a vector — the math still works!


The derivation still holds — just with more components in the conditioning.

2. What happens if the thing you’re estimating is also a vector?

Now imagine you're estimating the temperature profile for the next several days.
So now X itself is a vector:
Option A: L2 Norm (Euclidean distance)

This is the standard:

∥e∥

We try to find the best functions h


2

Option B: Using Trace of Covariance Matrix

The error covariance matrix is:

This is a matrix where:


M

= ∑e

k=0

k (Y)
X =

And you have multiple past temperature readings Y.


Again, the optimal estimate is still:

2
k

min
⎢⎥

^ = E[X ∣ Y]
X
X0

X1

XM

But this means you are estimating each component X_k using a function of all the
observed values:

So you're really just doing M

3. What about the error?

We define the error vector:


+ 1
^
X k = h k (Y) = E[X k ∣ Y]

scalar estimations — one for each component of X.

^
e = X − X

This tells us how far off we are in each component of the estimate.
Then, we want to minimize the overall error. There are two equivalent ways to
measure this:

= total squared error

that minimize this expected error:

h 0 ,…,h M

E[ee
E[∥e∥ ]

the diagonal has the variances of each component error (e


]
2

2
0
2
, e1 , etc.)
the off-diagonal has the correlations between errors in different components.

The trace of this matrix (sum of diagonal elements) is just:


T 2
trace(E[ee ]) = ∑ E[e k ]

Which is exactly the same as the L2 norm squared.

Final Insight:

Whether you're estimating a scalar from a vector of observations, or estimating a


vector from a vector — the conditional mean remains optimal.

All the complexity is just notation and dimensionality — the concept remains the
same:

Use all the information available (vector Y) to find the best mean-squared-error-
minimizing estimate of your target (scalar or vector X).

Linear Estimators
4. Linear Estimators

What’s the problem again?

We learned that the optimal estimator of a random variable X, given an observation Y ,


is the conditional expectation:

^
X(Y ) = E[X ∣ Y ]

This is unbiased and has the lowest possible mean squared error (MSE) among all
estimators - linear or nonlinear.

BUT...

This conditional expectation is usually very hard or impossible to compute in


real life.

Why? Because:

We usually don’t know the exact joint PDF p(x, y).


Even if we did, integrating over it to compute E[X ∣ Y] is computationally difficult.
So what’s the solution?

We approximate the optimal estimator using something more tractable - a linear


estimator:

^
X(Y ) = HY + b

Where:

H is a matrix (or scalar if Y is scalar),


b is a bias term (vector or scalar),
This is a simple linear function of the observation Y .

Linear estimators are much easier to compute because we only need means and
covariances, not full PDFs.

But wait — what are we sacrificing?

We're no longer guaranteed to get the optimal estimate, except in one special case:

If X and Y are jointly Gaussian, then the linear estimator is the optimal one!

That’s because in Gaussian distributions, the conditional expectation E[X ∣ Y] is itself a


linear function of Y .

So:

If X and Y are jointly Gaussian, linear estimator = optimal estimator


If not, linear estimator ≠ optimal, but still pretty good and way easier to compute.

Step 1: Zero-mean assumption

We assume both X and Y have zero mean:

E[X] = 0, E[Y ] = 0

Why? Because:

If they’re not, we can just subtract the mean from each to make them zero mean.
Subtracting the mean is easy and doesn’t affect the correlation structure.

This simplifies the math: now the bias term b becomes unnecessary because:

^ = H E[Y ] + b = 0 ⇒ b = 0
E[X]
So the linear estimator becomes:

^
X(Y ) = HY

Step 2: Minimize the mean squared error (MSE)

We now find the best matrix H by minimizing:


2
E [∥X − H Y ∥ ]

This is a standard optimization problem and leads to the Wiener-Hopf equation, which
gives us the best linear estimator in terms of covariances:
−1
H = R XY R
Y

Where:

R XY = E[XY
T
: cross-covariance matrix
]

R Y = E[Y Y
T
: auto-covariance of observations
]

Intuition Recap

Imagine:

You can't get the perfect answer (conditional expectation).


But you can get a pretty good approximation by drawing a straight line (linear
function) that tries to best match the observations to the quantity you're trying to
estimate.
This is like fitting a line to noisy data — not perfect, but often good enough,
especially if the noise is Gaussian.

Linear Estimators Example


5. Linear Estimators Example

The key idea here is that we're trying to recover or "undo" the effect of a noisy
communication channel, using a linear estimator (filter).

What's the goal?

We want to recover the original transmitted signal D(n) (like a voice or data stream)
from what we actually received Y (n), which is a distorted and noisy version of it.
Think of D(n) like the clean audio you’re sending over a walkie-talkie, and Y (n) like the
scratchy version your friend hears on the other end.

What does the system look like?

1. Transmitter sends: A clean sequence D(n).


2. Channel adds distortion: Like echoes, attenuation, delays (modeled by a system
with impulse response G(n)).
3. Noise gets added: Random, unwanted stuff (V (n)), like static on a radio.
4. Receiver gets: Y (n) = D(n) ∗ G(n) + V (n) → (convolution of signal with channel +
noise).
5. We apply a filter H(n): This is the equalizer at the receiver that tries to undo the
distortion and cancel out the noise.
6. Output estimate: D̂(n) = H (n) ∗ Y (n) — our best guess of what was sent.

The Ideal Situation

Imagine if we could find a filter H (n) such that:


H (n) ∗ G(n) = δ(n − n )

This means: if we convolve the channel response G(n) with our filter H (n), we get a
delta function — a perfect spike at some delay n₀.

Why is this ideal? Because that would mean all the spreading/delaying effects of the
channel are reversed. So now:


D̂(n) = H (n) ∗ Y (n) = δ(n − n ) ∗ D(n) + H (n) ∗ V (n) = DelayedD(n) + F ilterednoise

Perfect! Except... the noise is still there.

What's the problem then?

While H (n) ∗ G(n) ≈ δ(n − n₀) might give us a great recovery of the signal...
It might horribly amplify the noise.
Why? Because to cancel the effects of the channel, H (n) may need huge gain at
some frequencies (especially if the channel suppresses certain frequencies).

Imagine the channel wipes out mid-range tones in your music. To recover them, the
equalizer might crank the mid-range way up — but that also cranks up noise in that
range.

What’s the key takeaway?

This is a classic signal recovery problem: equalization in the presence of noise.


A simple approach (just undoing the channel effect blindly) can fail, because it may
boost noise too much.
Instead, we need to optimize the filter not just to cancel distortion, but also to
minimize mean squared error — taking both signal and noise into account.

This is where Wiener filter and adaptive filters come into play!

(The Wiener filter videos are skipped. Refer Rich Radke's DSP notes)

You might also like