Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views64 pages

SSPI Lecture 4 MVU Estimation 2025

The document discusses Minimum Variance Unbiased Estimation (MVU) within the context of Statistical Signal Processing and Inference, focusing on the Mean Square Error (MSE) as a criterion for optimal estimators. It introduces key concepts such as the Cramer-Rao Lower Bound (CRLB), which serves as a benchmark for the variance of unbiased estimators, and explores practical applications in fields like finance and engineering. Various examples illustrate the relationship between estimator accuracy, noise characteristics, and the importance of parameterized probability density functions.

Uploaded by

zhouwenwei0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views64 pages

SSPI Lecture 4 MVU Estimation 2025

The document discusses Minimum Variance Unbiased Estimation (MVU) within the context of Statistical Signal Processing and Inference, focusing on the Mean Square Error (MSE) as a criterion for optimal estimators. It introduces key concepts such as the Cramer-Rao Lower Bound (CRLB), which serves as a benchmark for the variance of unbiased estimators, and explores practical applications in fields like finance and engineering. Various examples illustrate the relationship between estimator accuracy, noise characteristics, and the importance of parameterized probability density functions.

Uploaded by

zhouwenwei0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Statistical Signal Processing & Inference

Minimum Variance Unbiased


Estimation (MVU)
Danilo Mandic
room 813, ext: 46271

Department of Electrical and Electronic Engineering


Imperial College London, UK
[email protected], URL: www.commsp.ee.ic.ac.uk/∼mandic

c D. P. Mandic Statistical Signal Processing & Inference 1


Motivation (from Lecture 3)
A natural criterion to define optimal estimators is the
M SE(θ̂) = E (θ̂ − θ)2

Mean Square Error (MSE):

which measures the average mean squared deviation of the estimate, θ̂,

R from the true parameter value, θ.


We desire to minimise the error power (see Lectures 3, 6 and 7)
M SE (θ̂) = var(θ̂) + B 2(θ̂)

MSE = VARIANCE OF THE ESTIMATOR + SQUARED BIAS


Of particular interest are unbiased estimators for which
min M SE(θ̂) ≡ min var(θ̂)
θ̂ θ̂

R
so that we can use our available degrees of freedom to minimise one
performance metric (variance), instead of two (bias, variance).
We now establish a theoretical bound on the optimality of MVU estimators.
c D. P. Mandic Statistical Signal Processing & Inference 2
Motivation: Practical example from High Definition TV
There are so many ways SSP&I can help with sustainability.

c D. P. Mandic Statistical Signal Processing & Inference 3


Objectives
◦ Learn the concept of minumum variance unbiased (MVU) estimation
◦ Investigate how the accuracy of an estimator depends upon the
relationship between the unknown parameter(s) and the PDF of noise
◦ Study the requirements for the design of an efficient estimator
◦ Analyse the Cramer–Rao Lower Bound (CRLB) for the scalar case
◦ Extension to the Cramer–Rao Lower Bound (CRLB) for the vector case
◦ Optimal parameter estimation, linear models, General Linear Model
◦ Dependence on data length (motivation for ’sufficient statistics’)
◦ Examples:
~ DC level in WGN (frequency estimation in smart grid, bioengineering)
~ Regression as in Capital Asset Pricing Model (CAPM) in finance
~ Finding parameters of a sinusoid, e.g. in communications, radar,
sonar, bioengineering (scalar case, vector case)
~ A new, statistical, view of Fourier analysis, performance bounds
~ System identification
c D. P. Mandic Statistical Signal Processing & Inference 4
R
What is the Cramer–Rao Lower Bound (CRLB)
The CRLB is a lower bound on the variance of any unbiased estimator.
In other words, if θ̂ is an unbiased estimator of θ, then
q
σθ̂2 ≥ CRLBθ̂ (θ) or σθ̂ ≥ CRLBθ̂ (θ)

Therefore, the CRLB is a benchmark which tells us the best we can


ever expect to be able to achieve with an unbiased estimator.
The CRLB is a must–check quantitative bound for:
◦ Feasibility studies (sensor relevance, if we met problem specifications)
◦ Assessment of the quality (goodness) of any derived estimator (we can
only do as good as CRLB)
◦ It can sometimes provide the form of the MVU estimator (we just read
it out from the CRLB theorem)
◦ It may be used to demonstrate the importance of physical/signal
parameters to the estimation problem (e.g. optimum freq. for power)
c D. P. Mandic Statistical Signal Processing & Inference 5
The need for the “parametrised” pdf, p(x[0]; θ)
Interpretation of p(x; θ) # a function of θ for fixed observed data x

Q: What determines how well we estimate the unknown θ from the


observed data x?
A: Since the data x is a random process which depends on θ, it is the

R parametrised pdf which describes that dependence, denoted by p(x; θ)


Clearly, if p(x; θ) depends strongly/weakly on θ, then this implies that we
should be able to estimate θ well/poorly.

R Left: Strong dependence on θ Right: Weak dependence on θ


The mean of the parametrised pdf (red & blue slices) depends on the observed point x[0].

c D. P. Mandic Statistical Signal Processing & Inference 6


Example 1: Consider a single observation x[0] = A + w[0],
where w[0] ∼ N (0, σ 2)
The simplest estimator of the DC level A in white noise w[0] ∼ N (0, σ 2) is
 = x[0] ⇒ estimator  is unbiased, with the variance of σ 2
To show that the estimator accuracy improves as σ 2 decreases:
◦ Consider h i
pi(x[0]; A) = √ 1 2 exp − 2σ1 2 (x[0] − A)2
2πσi i

1
for x[0] = 3 and i = 1, 2 with σ1 = 3 and σ2 = 1
| {z }
f undamental step, we are f ixing the data value
p (x[0]=3;A) p (x[0]=3;A)
1 2

Case I Case II
σ 1=1/3 σ 2=1

3 A 3 A
Clearly, as σ1 < σ2, the DC level A is estimated more accurately with p1(x[0]; A)
Likely candidates for values of A ∈ 3 ± 3σ ⇒ therefore [2, 4] for σ1 and [0, 6] for σ2.
c D. P. Mandic Statistical Signal Processing & Inference 7
Putting this into context # Sharpe ratio in finance
In financial modelling, the Sharpe Ratio (SR) models the risk-adjusted
returns, whereby the volatility (risk) is designated by the variance of the
distribution of returns. return = price(t)/price(t − 1)
Sharpe ratio. The blue asset (narrower pdf) is less profitable but also less
risky. To balance between the risk and profit, we can use the Sharpe ratio
√ E[r1:T ] µ
SR1:T = T or f or a single asset SR =
V ar[r1:T ] σ
Simple Returns: Equal Returns Level Simple Returns: Unequal Return Levels
0.4
(1, 1) 0.4 (1, 1)
Frequency Density

Frequency Density
0.3 (1, 9) (4, 9)
0.3
0.2 0.2
0.1 0.1
0.0 6 4 2 0 2 4 6 8 0.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5
Simple Returns, rt (%) Simple Returns, rt (%)

R
√ √
Here, SRblue = T 11
which is smaller than SRred = T 43 .
We therefore choose the red asset.
c D. P. Mandic Statistical Signal Processing & Inference 8
Likelihood function
When a PDF is viewed as a function of an unknown parameter (with the
dataset {x} = x[0], x[1], . . . fixed) it is termed the “likelihood function”.
◦ The “sharpness” of the likelihood function determines the accuracy at
which the unknown parameter may be estimated.
◦ Sharpness is measured by the “curvature” # a negative of the second
derivative of the logarithm of the likelihood function at its peak.
Example 2: Estimation based on one sample of a DC level in WGN
√ 1 2
2
ln p(x[0]; A) = − ln 2πσ − 2 x[0] − A

then ∂ ln p(x[0]; A) 1 
= 2 x[0] − A
∂A σ
and the curvature ∂ 2 ln p(x[0]; A) 1
− = 2
∂A2 σ

R Therefore, as expected, the curvature increases as σ 2 decreases.


Curvature % V PDF concentration %
c D. P. Mandic
V Accuracy %
Statistical Signal Processing & Inference 9
Likelihood function: Curvature
Since we know that the variance of the estimator equals σ 2, then
1
var(Â) = ∂ 2 ln p(x[0];A)
− ∂A2

and the variance decreases as the curvature increases.


Generally, the second derivative does depend upon one data point,
x[0], and hence a more appropriate measure of curvature is the
statistical measure (average over many random x[0])
 2 
∂ ln p(x[0]; A)
−E
∂A2 A=true value

which measures the average curvature of the log-likelihood function

Note: The likelihood function is a random variable, due to x[0]

R Recall: The Mean Square Error # MSE = Bias2 + variance


It makes perfect sense to look for a minimum variance unbiased (MVU) solution!

c D. P. Mandic Statistical Signal Processing & Inference 10


Link between the curvature and human perception
In the 50s, a psychologist Fred Attneave recorded eye dwellings on objects

Example 3a): The drawing of a bean (top)


and the histogram of eye dwellings (bottom) Example 3b): Read the words below ...
now read letter by letter ... are you still
sure?

Example 3c): Is the drawing on the left still a penguin?

So, what is the sufficient information to ’estimate’ an object?

c D. P. Mandic Statistical Signal Processing & Inference 11


THE KEY: Cramer-Rao Lower Bound (CRLB) for a
scalar parameter (performance of the theoretically best estimator)
The Cramer–Rao Lower Bound (CRLB)

Theorem: [CRLB] Assumption: The PDF p(x; θ) satisfies the


“regularity” condition
 
∂ ln p(x; θ)
E = 0, ∀θ
∂θ

where the expectation is taken with respect to p(x; θ).


Then, the variance of any unbiased estimator, θ̂, must satisfy

1
var(θ̂) ≥ n o
∂ 2 ln p(x;θ)
−E ∂θ 2
↑ average curvature
where the derivative is evaluated at the true value of θ.

c D. P. Mandic Statistical Signal Processing & Inference 12


CRLB for a scalar parameter, continued
Moreover, an unbiased estimator may be found that attains the bound for
all θ, if and only if for some functions g and I
∂ ln p(x; θ) 
= I(θ) g(x) − θ
∂θ
This estimator is the minimum variance unbiased (MVU) estimator,
for which θ̂ = g(x)
and its minimum variance 1
I(θ)
—— end of CRLB theorem ——
1
Remark: Since the variance var(θ̂) ≥ 
∂ 2 ln p(x;θ)
, the evaluation of
−E
∂θ 2
the “curvature term” gives
 2  Z 2
∂ ln p(x; θ) ∂ ln p(x; θ)
E = p(x; θ)dx
∂θ2 ∂θ2

Obviously, in general the bound depends on the parameter θ and the data length

c D. P. Mandic Statistical Signal Processing & Inference 13


R
Example 4: Physical relevance of CRLB
Point 3 from Slide 5: “CRLB can sometimes provide the form of MVU”
Shall we therefore compare the form of regularity condition with Example 3

∂ ln p(x; θ) 
Regularity condition: = I(θ) g(x) − θ
∂θ
inverse of the minimum achievable variance ↑ ↑ form of the optimum est.
Compare with what we have derived for x[0] = A + w[0] (Slide 9)
↓ g(x[0])
∂ ln p(x[0]; A) 1 
= 2 x[0] − A
∂A σ

R
inverse of the Fisher information ↑ ↑ the unknown parameter

R By inspection, the optimum estimate is  = g([x[0]) = x[0]


From the CRLB theorem the optimum variance of this estimator: 1
I(θ) = σ2
Therefore: Good estimator V variance & and curvature %
Poor estimator V variance % and curvature & (see Slide 9)

c D. P. Mandic Statistical Signal Processing & Inference 14


Example 5: DC level in WGN for N data points
for the validity of the Gaussian assumption, see Appendix 2 and Lecture 3 (S 21)

Consider the estimation of a DC level in WGN, assume N observations

x[n] = A
|{z} + w[n] n = 0, 1, 2, . . . , N − 1
|{z}
unknown DC level noise with known pdf

where w[n] ∼ N (0, σ 2).


Determine the CRLB for the unknown DC level A, starting from (θ = A)

N −1  
Y 1 1  2
p(x; θ) = p(x; A) = √ exp − 2 x[n] − A
n=0 2πσ 2 2σ
−1
" N
#
1 1 X 2
= exp − 2 x[n] − A

R
(2πσ 2)N/2 2σ n=0

Estimation of a DC level is very useful, e.g. in the time-frequency plane a


sinusoid of frequency f is represented by a straight line (specgramdemo)

c D. P. Mandic Statistical Signal Processing & Inference 15


Example 5: DC level in WGN for N data points # contd.
Upon taking the first derivative, we have
−1
" N
#
∂ ln p(x; A) ∂  2 N/2
 1 X 2
= − ln 2πσ − 2 x[n] − A
∂A ∂A 2σ n=0
N −1
1 X  N 
= x[n] − A = 2 x̄ − A
σ 2 n=0 σ
↑ g(x), we can read out the estimator
where x̄ is the sample mean.
σ2
1
PN −1 1
CRLB connection: x̄ = N n=0 x(n) = g(x), var(Â) = I(A) = N

Upon differ. again ↓ does not depend on x, so no E{·}


∂ 2 ln p(x; A) N
= −
∂A2 σ2
σ2
Therefore var(Â) = N= CRLB, which implies that the sample mean
estimator attains the Cramer-Rao LB and must, therefore, be an
MVU estimator of a DC level in WGN.
c D. P. Mandic Statistical Signal Processing & Inference 16
Example 5: DC level in WGN (spelling out the previous slide)

Upon taking the first derivative, we have


−1
" N
#
∂ ln p(x; A) ∂  2 N/2
 1 X 2
= − ln 2πσ − 2 x[n] − A
∂A ∂A 2σ n=0
N −1
1 X  N  CRLB Th 
= x[n] − A = 2 x̄ − A = I(A) g(x) − A
σ 2 n=0 σ
I(A) ↑ ↑ g(x)
σ2
1
PN −1 1
CRLB connection: x̄ = N n=0 x(n) = g(x), var(Â) = I(A) = N
Upon differentiating again ↓ does not depend on x, so no E{·}
∂ 2 ln p(x; A) N
=− 2
∂A2 σ
σ2
Therefore var(Â) = N = CRLB, which implies that the sample mean
estimator attains the Cramer-Rao LB and must, therefore, be an
MVU estimator of A in WGN. (for any other estimator, Ã, var(Ã) ≥ σ 2/N )
c D. P. Mandic Statistical Signal Processing & Inference 17
Let us have a closer look at the CRLB for N data points
The figures below illustrate the behaviour of
σ2
CRLBN = var(Â) = N (cf. CRLB1 = σ 2)
with a change in the DC level, A, data length, N , and noise variance, σ 2.
CRLB CRLB CRLB

For Fixed 𝑵𝑵 & 𝝈𝝈𝟐𝟐 For Fixed 𝑵𝑵 For Fixed 𝝈𝝈𝟐𝟐

𝑨𝑨 𝝈𝝈𝟐𝟐 𝑵𝑵

Properties of CRLB for DC level estimation from N noisy data points:


◦ It does not depend on the DC level A
◦ The CRLB increases linearly with the noise variance, σ 2
◦ The CRLB decreases as an inverse in the data length, N . For
example, doubling the data length halves the CRLB
c D. P. Mandic Statistical Signal Processing & Inference 18
Efficient estimator # concept cf. consistent est.
Def: An estimator which is unbiased and attains the CRLB is said to be
efficient. In other words, an estimator is efficient if:
◦ It is an Minimum Variance Unbiased (MVU) estimator, and
◦ It efficiently uses the data.
^)
var( θ ^)
var( θ

^ ^
θ2 θ2
^ ^
θ3 θ3
^
θ1
^
θ1

CRLB CRLB

θ θ

R
θ̂1 is efficient and MVU, θ̂2, θ̂3 are not θ̂1 may be MVU but is not efficient

Not all estimators (phase est.) & not all MVU estimators are efficient
c D. P. Mandic Statistical Signal Processing & Inference 19
Fisher information and a general form of MVU estimator
(measures the “expected goodness” of data for making an estimate)
2
 
The term ∂ ln p(x; θ)
I(θ) = −E
∂θ2
in the CRLB theorem is referred to as the Fisher information.

Intuitively:

R
the more information available ; the lower the bound ; lower variance

R
Essential properties of an information measure:

R R Non–negative
Additive for independent observations
General CRLB for arbitrary signals in WGN (cf. σ 2/N , see the next slide)
σ2
var(θ̂) ≥ P  2
N −1 ∂s[n;θ]
n=0 ∂θ
↑ sensitivity of signal to parameter change

Accurate estimators: A signal is very sensitive to parameter change.


Therefore ∂s[n;θ]
∂θ above acts as a “sensitivity” term. (see Appendix)
c D. P. Mandic Statistical Signal Processing & Inference 20
General case: Arbitrary signal in noise
CRLB via parameter sensitivity (for alternative forms, see Appendix)

Consider a deterministic signal s[n; θ] observed in WGN, w ∼ N (0, σ 2)


x[n] = s[n; θ] + w[n], n = 0, 1, . . . , N − 1

Then, the PDF for x parametrised by θ has the form


2
1 1
− 2
PN −1
n=0 x[n]−s[n;θ]
p(x; θ) = e 2σ
(2πσ 2)N/2
and so
N −1
∂ ln p(x; θ) 1 X  ∂ s[n; θ]
= 2
x[n] − s[n; θ]
∂θ σ n=0 ∂θ
N −1
∂ 2 ln p(x; θ) 1 Xh  ∂ 2s[n; θ]  ∂s[n; θ] 2i
= x[n] − s[n; θ] −
∂θ2 σ 2 n=0 | {z } ∂θ2 ∂θ
E{x[n]}=s[n;θ]}

Therefore, the Fisher information


N −1
 ∂ 2 ln p(x; θ)  1 X  ∂s[n; θ] 2
I(θ) = −E = 2
∂θ2 σ n=0 ∂θ

c D. P. Mandic Statistical Signal Processing & Inference 21


Example 6: Sinusoidal frequency estimation
(the CRLB depends both on the unknown parameter f0 and the data length N )

Consider a general sinewave in noise: x[n] = A cos 2πf0n + Φ + w[n]
If only the frequency f0 is unknown, then
1
s[n; f0] = A cos(2πf0n + |{z}
Φ ), 0 < f0 < (norm. freq.)
|{z}
known known
2
σ2 σ2
From Slide 20: var(fˆ0) ≥  2 = PN −1  2
PN −1 ∂s[n;θ] A2 n=0 2πn sin(2πf0 n + Φ)
n=0 ∂θ
−4
x 10
5
Note the preferred
frequencies, e.g.
Cramer−Rao lower bound

f ≈ 0.03, and that


3
for f0 → {0, 1/2} the
2 CRLB → ∞
1
Parameters: N = 10,
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Frequency Φ = 0, SNR = 1
c D. P. Mandic Statistical Signal Processing & Inference 22
Example 6: Sinusiodal frequency estimation (contd.)
Practical context: Frequency estimation in power (smart) grids, fs = 1000 Hz

To illustrate the bias–variance–consistency, consider some recent frequency


estimation algorithms (see Lecture Supplement). For convenience, the
performance was evaluated against the signal to noise ratio (SNR).
Observe that both bias, variance and CRLB are a function of SNR.

Left: Bias in frequency estimation Right: Variance against the CRLB


The AI-MVDR algorithm was asymptotically unbiased and also consistent,
as it approached the CRLB for frequency est. with an increase in SNR.
c D. P. Mandic Statistical Signal Processing & Inference 23
CRLB Theorem: Extension to a vector parameter
∂ 2 ln p(x;θ)
h i
we now have Fisher Information Matrix I , whereby [I(θ)]ij = −E ∂θi ∂θj

T
Formulation: Estimate a vector parameter θ = [θ1, θ2, . . . , θp]
◦ Recall that an unbiased estimator θ̂ is efficient (and therefore an MVU
estimator) when it satisfies the conditions of the CRLB
◦ It is assumed that the PDF p(x; θ) satisfies the regularity conditions
 
∂ ln p(x; θ)
E = 0, ∀θ
∂θ
◦ Then, the covariance matrix, Cθ̂ , of any unbiased estimator θ̂ satisfies
Cθ̂ − I −1(θ) ≥ 0 (symbol ≥ 0 means that Cθ̂ is positive semidefinite)

R
h 2 i
◦ The Fisher Information Matrix is given by [I(θ)]ij = −E ∂ ∂θ
ln p(x;θ)
i ∂θj

An unbiased estimator θ̂ = g(x) exists that satisfies the bound


Cθ̂ = I −1(θ) if and only if
∂ ln p(x; θ) 
= I(θ) g(x) − θ
∂θ

c D. P. Mandic Statistical Signal Processing & Inference 24


Extension to a vector parameter: Fisher information
matrix
Some observations:

◦ Elements of the Information Matrix I(θ) are given by

2
 
∂ ln p(x; θ)
[I(θ)]ij = −E
∂θi∂θj

where the derivatives are evaluated at the true values of the


parameter vector.

◦ The CRLB theorem provides a powerful tool for finding MVU estimators
for a vector parameter.

R MVU estimators for linear models are found with the Cramer–Rao
Lower Bound (CRLB) theorem.

c D. P. Mandic Statistical Signal Processing & Inference 25


Example 7: Sinusoid parameter estimation # vector case
Consider again a general sinewave

s[n] = A cos 2πf0n + Φ
where A, f0 and Φ are all unknown. Then, the data model becomes
x[n] = A cos(2πf0n + Φ) + w[n] n = 0, 1, . . . , N − 1

where A > 0, 0 < f0 < 1/2, and w[n] ∼ N (0, σ 2).


Task: Determine CRLB for the parameter vector θ = [A, f0, Φ]T .
Solution: The elements of the Fisher Information Matrix become (P&As)
∂ 2 ln p(x;A,f0 ,Φ) ∂ 2 ln p(x;A,f0 ,Φ) ∂ 2 ln p(x;A,f0 ,Φ)
∂A2 
↓ ↓ ∂A∂f0 .
 ∂A∂Φ
N/2 0 0
1  PN −1 PN −1
2A2π 2
n2 πA

I(θ) = 2  0 n=0 n=0 n 
σ  
N A2
PN −1
0 πA n=0 n 2
∂ 2 ln p(x;A,f0 ,Φ) ∂ 2 ln p(x;A,f0 ,Φ) ∂ 2 ln p(x;A,f0 ,Φ)
∂Φ∂A ↑ ↑ ∂Φ∂f0 - ∂Φ2
c D. P. Mandic Statistical Signal Processing & Inference 26
Example 7: Sinusoid parameter estimation # continued
since Cθ̂ = I −1(θ) (slide 22) make an inverse of the FIM
A2
After inversion of I(θ), its diagonal components are (η = 2σ 2
is SNR):
2σ 2 12 2(2N − 1)
var(Â) ≥ var(fˆ0) ≥ var(Φ̂) ≥
N (2π)2ηN (N 2 − 1) ηN (N + 1)
CRLB of var(f0) with Fixed SNR
140
CRLB for Sinusoidal Parameter Estimates at
SNR = −3dB (Dashed Lines) and 10dB (Solid Lines) 180

CRLB [dB]
220
Amplitude Frequency Phase
260
0
300
0 20 40 60 80
Number of data samples, N
CRLB [dB]

−20
CRLB of var(A) with Fixed
−40 20

40

CRLB [dB]
−60
60

R
−80 80
5 15 25 35 45 55 65 75 85 95
0 20 40 60 80
Number of data samples, N Number of data samples, N

The variance of the estimated parameters of a sinusoid behaves ∝ 1/η and


∝ 1/N 3, thus exhibiting strong sensitivity to data length

c D. P. Mandic Statistical Signal Processing & Inference 27


Need for Linear Models (regression models) (see Lecture 6)
These underpin many areas e.g. the CAPM and Fama-French models in finance

Daily Returns of Crude Oil vs. Energy Sector Residuals of Linear Fit, Oil vs. Energy
Data from Apr. 2024 0.010 Residuals
0.010 Regression Line
Vangard Energy ETF

0.005
0.005

Residuals
0.000 0.000

0.005 0.005

0.010 0.010
0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.02 0.01 0.00 0.01
Crude Oil x

S&P 500 vs. Gold Prices in April 2024 Residuals of Both Fits, S&P500 vs. Gold
2400 75
50
2350
25
Gold Prices

2300 Residuals 0
25
2250
Data 50

2200
Linear Fit 75 Linear Fit Residuals, sum(Res2)=47375
Quadratic Fit 100 Quadratic Fit Residuals, sum(Res2)=42270
5000 5050 5100 5150 5200 5250 5000 5050 5100 5150 5200 5250
S&P 500 x

c D. P. Mandic Statistical Signal Processing & Inference 28


Linear Models
Generally, it is difficult to determine the MVU estimator.

◦ In practice, however, a linear data model can often be employed #


straightforward to determine the MVU estimator.

x[n]
Example 8: Linear model of a
straight line in noise noisy line

x[n] = A + Bn + w[n]
n = 0, 1, . . . , N − 1

where
A
◦ w[n] ∼ N (0, σ 2),
◦ B - slope and ideal noiseless line

◦ A - intercept. 0
n

c D. P. Mandic Statistical Signal Processing & Inference 29


Linear models: Compact notation (Example 8 contd.)
This data model can be written more compactly in the matrix notation as
known ↓ . known pdf
x = Hθ + w or x = Hθ + w
where observed % ↑ unknown
   
x[0] 1 0
 x[1]   T  1 1 
x= ..  = x[0], x[1], . . . , x[N − 1] H= . .
   
. . 
x[N − 1] 1 N −1
and
T
θ = [A B]
T
w = [w[0], w[1], . . . , w[N − 1]]
2
w ∼ N ( 0, σ I )
 
1 0 ··· 0
I =  ... ... ... ...  = diag(1, 1, . . . , 1)
0 0 ··· 1

c D. P. Mandic Statistical Signal Processing & Inference 30


From a scalar to the vector/matrix notation
The “spelled out” form of the likelihood function for θ = [A, B]T is
2
1 1
− 2
PN −1
n=0 x[n]−A−Bn
p(x; A, B) = 2 N/2
e 2σ (∗)
(2πσ )
To arrive at the vector form, swap the variables as w[n] = x[n] − A − Bn.
T
PN −1 2
Then, with w = [w(0), . . . , w(N − 1)] , the term n=0 w (n), which
appears in the above likelihood function can be written as
PN −1 2 T
n=0 w (n) = w w.
This applies to any vector, so that for w = x − Hθ we have
N −1
X 2 T 
x[n] − A − Bn = x − Hθ x − Hθ
| {z } | {z } | {z }
n=0 w2 [n] w
wT
This gives

1 (x−Hθ)T (x−Hθ)

p(x; A, B) = p(x; θ) = e 2σ 2 equivalent to (∗)
(2πσ 2)N/2

c D. P. Mandic Statistical Signal Processing & Inference 31


Linear models: Fisher information matrix
(x−Hθ)T (x−Hθ)

R
1
NB: p(x; θ) = (2πσ 2)N/2
e 2σ 2

The CRLB theorem can be used to obtain the MVU estimator for θ
The MVU estimator, θ̂ = g(x), will then satisfy

∂ ln p(x; θ) 
= I(θ) g (x) − θ
∂θ
where I(θ) is the Fisher information matrix, whose elements are
h ∂ 2 ln p(x; θ) i
[I]ij = −E
∂θi∂θj
Then, for the Linear Model
 
∂ ln p(x; θ) ∂ N 1 T
− ln(2πσ 2) − 2 x − Hθ
 
= x − Hθ
∂θ ∂θ 2 2σ
1 ∂ h i
= − 2 N σ 2 ln(2πσ 2) + xT x − 2xT Hθ + θ T HT Hθ
2σ ∂θ
Note that only the quadratic term in θ involves the matrix H

c D. P. Mandic Statistical Signal Processing & Inference 32


Linear models: Some useful matrix/vector derivatives
the derivation is given in the Lecture Supplement

Use the identities (remember that both bT θ and θ T Aθ are scalars)

∂bT θ ∂xT Hθ
= b # = (xT H)T = HT x
∂θ ∂θ
∂θ T Aθ ∂θ T HT Hθ
= 2Aθ # = 2HT Hθ
∂θ ∂θ

(which you should prove for yourself), that is, follow the rules of
vector/matrix differentiation.

As a rule of thumb, watch for the position of the (·)T operator

Then, the form of the partial derivative from the previous slide becomes

∂ ln p x; θ 1 T T

= 2 H x − H Hθ
∂θ σ

c D. P. Mandic Statistical Signal Processing & Inference 33


Linear models: Cramer-Rao lower bound
∂ ln p(x;θ) 
Find the MVU estimator: ∂θ = I(θ) g (x) − θ
Similarly to the vector CRLB, # recall that (HT H)T = HT H
T
 
∂ ∂ ln p(x; θ) 1
I(θ) = − = 2 HT H ← does not depend on data
∂θ ∂θ σ

Therefore
∂ ln p(x; θ) 1 T h −1
i
= 2 H H HT H HT x −θ

∂θ |σ {z } | {z }
I(θ) g (x )

By inspection, the linear MVU estimator is then given by


T
−1 T
θ̂ = g (x) = H H H x
T
−1
provided H H is invertible (it is, as H is full rank, with orthogonal
rows and columns).
−1 2 T
−1
The covariance matrix of θ̂ now becomes Cθ̂ = I (θ) = σ H H

c D. P. Mandic Statistical Signal Processing & Inference 34


CRLB for linear models – continued
◦ We have seen that the MVU estimator for the linear model is
efficient # it attains the CRLB
 T
◦ The columns of H must be linearly independent for H H to be
easily invertible

Theorem: (Minimum Variance Unbiased Estimator for the Linear Model)


If the observed data can be modelled as

x = Hθ + w

where
x is an N×1 “vector of observed data”
H is an N×p “observation (measurement) matrix” of rank p
θ is a p×1 unknown “parameter vector” 
2
w is an N×1 additive “noise vector” ∼ N 0, σ I

c D. P. Mandic Statistical Signal Processing & Inference 35


CRLB for linear models: Theorem, continued

Then, the MVU estimator is given by


−1 T
T
θ̂ = H H H x

for which the covariance matrix has the form

2 T
−1
Cθ̂ = σ H H

Note that the statistical performance of θ̂ is now completely described


because θ̂ is linear transformation of a Gaussian vector x, i.e.

2
−1
T
θ̂ ∼ N θ, σ H H

End of Theorem 2

c D. P. Mandic Statistical Signal Processing & Inference 36


Example 9: Fourier analysis as a linear estimator
T
−1
Recall that we need to calculate θ̂ = H H HT x

2

The data model is given by (n = 0, 1, . . . , N − 1, w[n] ∼ N 0, σ )
M   M  
X 2πkn X 2πkn
x[n] = ak cos + bk sin + w[n]
N N
k=1 k=1

where the Fourier coefficients, ak and bk , that is, the amplitudes of the
cosine and sine terms, are to be estimated.
1 k
◦ Frequencies are multiples of the fundamental f1 = N, that is, fk = N.

◦ Then, the parameter vector is θ= [a1, a2, . . . , aM , b1, b2, . . . , bM ]T


and the observation matrix H is N × |{z}
2M –dimensional, and takes the form
p
 
1 ... 1 0 ... 0
 cos 2π ... cos 2πM sin 2π ... sin 2πM 
H= ... N ... N ... N ... N
 
... ... 
 
−1) (N −1) −1) (N −1)
cos 2π(N N ... cos 2πM N sin 2π(N N ... sin 2πM N
N ×2M

c D. P. Mandic Statistical Signal Processing & Inference 37


Example 9: Fourier analysis, geometric view
M   M  
X 2πkn X 2πkn
Data model: x[n] = ak cos + bk sin + w[n]
N N
k=1 k=1
↑ parameters to estimate ↑ WGN ↑

Parameter vector: θ = [a1, a1, . . . , aM , b1, b2, . . . , bM ]T (Fourier coeffs.)

this picture is for k=1


h1 hM h2M
z−plane

n=0

H= 1

n=N−1

n=0,1,...,N−1 samples
down each column

cos 2πkn
N sin 2πkn
N

c D. P. Mandic Statistical Signal Processing & Inference 38


Example 9: Fourier analysis # continued
(see also Lecture 6)

R
N
For H not to be under-determined, it has to satisfy N > p ⇒M < 2

For mathematical convenience, the columns of H should be orthogonal

This is because the columns of H form a basis of a new representation


space, which is obvious if we rewrite the measurement matrix in the form
 
H = h1 | h2 | · · · | h2M

where hi = hi is the i-th column of H.


Then, for a large enough number of data points, N , due to the
orthogonality properties of products of sines and cosines of different
frequencies

hTi hj = 0 f or i 6= j

In other words, hi ⊥ hj , that is, the columns of matrix H are orthogonal

c D. P. Mandic Statistical Signal Processing & Inference 39


Example 9: Fourier analysis # contd. contd.
The orthogonality of the columns of H (for large N ) follows from the
properties of sines and cosines of different frequencies:
N −1    
X 2πin 2πjn N
cos cos = δij (as power of a sinusoid is A2/2)
n=0
N N 2
N −1    
X 2πin 2πjn N
sin sin = δij (as power of a sinusoid is A2/2)
n=0
N N 2
N −1 
  
X 2πin 2πjn N
cos sin = 0 ∀i, j, s.t. i, j = 1, 2, . . . , M <
n=0
N N 2

where the Kronecker delta 


1, i = j
δij =
6 j
0, i =

In other words: (i) cos iα ⊥ sin jα, ∀i, j, (ii) cos iα ⊥ cos jα, ∀i 6= j,
(iii) sin iα ⊥ sin jα, ∀i 6= j
c D. P. Mandic Statistical Signal Processing & Inference 40
Example 9: Fourier analysis → measurement matrix
Therefore (orthogonality)  N 
 T 
2 0 ... 0
h1 N
0 ... 0  N
HT H =  ..  h1 | · · · | h2M = 
  
2 = I
 .. .. ... .. 

T 2
h2M N
0 0 ... 2

and the MVU estimator of the Fourier coefficients is given by


T
−1 T 2 T
θ̂ = H
| {zH} H x V θ̂ M V U = H x
N
=N
2I

 PN −1 
2 T 2 2π×1×n

hT1
   
2 T 2  . N h1 x N n=0 x [n] cos N
θ̂ = H x = . x =  .
. =
 .
.

N N

2 T N −1
hT2M 2 2π×2M ×n


R
h x
P
N 2M N n=0 x [n] sin N

Fourier coefficients of a “signal + WGN” are MVU estimates of the Fourier


coefficients of the noise–free signal.

c D. P. Mandic Statistical Signal Processing & Inference 41


Example 9: Finally # Fourier coefficients (Fourier coefficients
of “signal + AWGN” are MVU estimates of Fourier coeff. of noise–free signal)

Therefore, the Fourier analysis represents a linear MVU estimator, given by


2
PN −1 2πkn

âk = N n=0 x [n] cos N
2
PN −1 2πkn

b̂k = N n=0 x [n] sin N
where the ak and bk are the discrete Fourier transform coefficients.

From CRLB for Linear Model, the covariance matrix of this estimator is
2 T
−1 2σ 2
Cθ̂ = σ H H = I
N
- decreases with N
i) Note that, as θ̂ is a Gaussian random variable and the covariance
matrix is diagonal, the amplitude estimates are statistically independent;
ii) The orthogonality of the columns of H is fundamental in the
computation of the MVU estimator (invertible parsimonious basis);
iii) For accuracy, the measurement matrix H is desired to be a tall matrix
with orthogonal columns.
c D. P. Mandic Statistical Signal Processing & Inference 42
Example 10: System Identification (SYS ID)
Aim: To identify the model of a system (filter coefficients {h}) from
input/output data. Assume an FIR filter system model given below

u[n] u[n−1] u[n−2] u[n−p]


z −1 z −1 ... z −1
x h(0) x h(1) x h(2) x h(p−1)

Σ Σ ... Σ p
Σ h (k) u[n−k]
k=1

◦ The input u[n] “probes” the system, then the output of the FIR filter is
Pp−1
given by the convolution x[n] = k=0 h(k)x[n − k]
◦ We wish to estimate the filter coefficients [h(0), . . . , h(p − 1)]T
◦ In practice, the output is corrupted by additive WGN

c D. P. Mandic Statistical Signal Processing & Inference 43


Example 10: SYS ID # data model in noise w ∼ N (0, σ 2)
Data model p−1
X
x[n] = h(k)u[n − k] + w[n] n = 0, 1, . . . , N − 1
k=0
The equivalent matrix–vector form is
      
x[0] u[0] 0 ... 0 h(0) w[0]
 x[1]   u[1] u[0] ... 0   h(1)   w[1] 
 . =  . . . .. .  .. + . 
 .   . . .    . 
x[N − 1] u[N − 1] u[N − 2] . . . u[N − p] h(p − 1) w[N − 1]
| {z } | {z }| {z } | {z }
obs. vec. x measurement matrix H coeff. vec. θ noise vec. w
that is
x = Hθ + w where w ∼ N (0, σ 2I)
Then, the MVU estimator
T
−1 T 2 T
−1
θ̂ = H H H x with Cθ̂ = σ H H

This representation also lends itslef to state-space modelling

c D. P. Mandic Statistical Signal Processing & Inference 44


Example 10: SYS ID # more about H
Now, HT H becomes a symmetric Toeplitz autocorrelation matrix, given by
 
ruu(0) ruu(1) . . . ruu(p − 1)
T
 ruu(1) ruu(0) . . . ruu(p − 2) 
H H=N  .. .. ... .. 

ruu(p − 1) ruu(p − 2) . . . ruu(0)
where
−1−k
NX
1
ruu(k) = u[n]u[n + k]
N n=0

For HT H to be diagonal, we must have ruu(k) = 0 for k 6= 0, which holds


for a pseudorandom (PRN) input sequence.
Finally, when HT H = N ruu(0)I
 σ2
then var ĥ(i) = , i = 0, 1, . . . , p − 1
N ruu(0)

c D. P. Mandic Statistical Signal Processing & Inference 45


Example 10: SYS ID # MVU estimator
For a PRN sequence, the MVU estimator becomes
−1 T
T
θ̂ = H H H x
Then
N −1
1 X
ĥ(k) = u[n − k]x[n]
N ruu(0) n=0
and
1
PN −1−k
rux(k) N n=0 u[n]x[n + k]
=
ruu(0) ruu(0)
k = 0, 1, . . . , p − 1

R Thus, the MVU estimator is a ratio of the input-output cross-correlation to

R the input autocorrelation (makes perfect physical sense).


Compare with the Wiener filter in Lecture 7 (adaptive inference)

c D. P. Mandic Statistical Signal Processing & Inference 46


Example 10a: Adaptive noise cancel. with reference
such as in noise-cancelling headphones on an airplane see Lecture 7

Physical intuition behind “MVU estimator is a ratio of the input-output


cross-correlation to the input autocorrelation” on a noise cancel. example.

BABE
T.FI?as*..B.oqaBBBBBB
Headphones x(n) y(n)
Adaptive
Filter
e(n) _
Reference
§

z%Ég•!¥÷¥¥¥③B•z@
microphone, N1 Σ
+
ANC Speech or music s(n) + No (n)
plus additive noise N 1 (n) d(n)
s+N0
go Reference input Primary input

Input: The cockpit noise, x(n) = N1(n), that is, the noise for Reference
Microphone. The only requirement is that N1 is correlated with the noise,

R
N0, which you hear through the headphones, but not with the music
signal, s(n). The filter aims to estimate N0 from N1, that is, y = N̂0.
Based on the input-output cross-correlation, the filter output can only
produce and estimate of the noise you hear, that is, y(n) = N̂0(n), as
cockpit noise, N1, is not correlated with the music, s.
Therefore we hear d(n) − y(n) = s(n) + N0(n) − N̂0(n) ≈ s(n)
c D. P. Mandic Statistical Signal Processing & Inference 47
The need for a General Linear Model (GLM)
We shall now consider a general case where:
1) the observed signal may contain a known but non-white component, s
2) the observation noise, w, may be non-white, that is, C 6= N (0, σ 2I).
Case 1) Often in practical applications (e.g. in radar), the observed signal
consists of some known signal, s, and another signal whose components are
not known, Hθ, so that the linear model of the observed signal becomes
x = Hθ + s + w (here, noise is assumed to be white)
The MVU estimator is determined immediately from x0 = x − s, so that
x0 = Hθ + w
T
−1 T
θ̂ = H H H (x − s)
2 T
−1
Cθ̂ = σ H H (covariance matrix of θ̂)
An example may be a DC level observed in random white noise, but also
with a known sinusoidal interference (e.g. from the mains).
c D. P. Mandic Statistical Signal Processing & Inference 48
Incorporating coloured (correlated) noise into GLM
Case 2) For coloured noise, w ∼ N (0, C), where C 6= σ 2I, that is, C is not a
scaled identity matrix!
To this end, we can use a whitening approach as follows:
Since C is +ve semidefinite, so too is C−1, # it can be factored as
C−1 = DT D, DN ×N is invertible
Now, D acts as a whitening transform when applied to w, since

R
T T T T −1 T −1 T
   
E (Dw)(Dw) = E Dww D = DCD = DD D D = I
This allows us to transform the general linear model

R from x = Hθ + w

0 T 0 −1 0 T 0
θ̂ = H H

H x = H D DH
to

T T
x0 = Dx = DHθ + Dw = H0θ + w0
The noise is now whitened, as w0 = Dw ∼ N (0, σ 2I) # use Linear Model
−1 T T
H D Dx = H C H T −1
−1 T −1
H C x
In a similar fashion, for the variance of this estimator we have

R
0 T 0 −1
 T −1
−1
Cθ̂ = H H and finally Cθ̂ = H C H
For C = σ 2 I we have our previous results for standard Linear Estimator
c D. P. Mandic Statistical Signal Processing & Inference 49
Can we resort to (approximately) Gaussian distribution?
Yes, very often, if we re–cast our problem in an appropriate way (see Lecture 3)
Top panel. Share prices, pn, of Apple (AAPL), General Electric (GE) and
Boeing (BA) and their histogram (right). Bottom panel. Logarithmic
returns for these assets, ln(pn/pn−1), that is, the log of price differences at
consecutive days (left) and the histogram of log returns (right).

Clearly, by a suitable data transformation, we may arrive at symmetric


distributions which are more amenable to analysis (bottom right).
c D. P. Mandic Statistical Signal Processing & Inference 50
Theorem: MVU Estimator for the General Linear Model
Upon combining Case 1 and Case 2 above (non-white noise + known component)

i) General linear data model: s(θ) = Hθ + s


known ↓ ↓ some known signal

x=Hθ + s + w w ∼ N 0, C
observed ↑ ↑ unknown ↑ known statistics, can be non-white

ii) Then, the MVU estimator has the form


T −1 −1 T −1

θ̂ = H C H) H C x−s

with the covariance matrix


Cθ̂ = HT C−1H)−1

R
and attains the Cramer Rao Lower Bound (CRLB).

We must assume that H is full rank, otherwise for any s there exist some
θ 1 and θ 2 which both give s, that is, s = Hθ 1 = Hθ 2 (no uniqueness)
c D. P. Mandic Statistical Signal Processing & Inference 51
Example 11: The concept of “linear in the parameters”

R
models (e.g. like neural networks) (see also Lecture 8)

Recall that the notion “linear” in the term “Linear Models” does not arise
from fitting straight lines to data!
x[n] Observed data
Model is quadratic in time "n"
Model is "linear in the parameters!"
True signal of interest
(quadratic in n)
1 2 3 4 5 6 n

Observations: x[n] = |θ0 + θ1{zn + θ2n}2 +w[n] V x = Hθ + w


θ
linear in parameters 
2

1 0 0
12
 
θ0  1
 1 

2
where θ =  θ1  H=  1. 2 2 
θ2  . .
. .
.


1 N − 1 (N − 1)2

c D. P. Mandic Statistical Signal Processing & Inference 52


What to remember about MVU estimators
◦ An estimator is a random variable and as such its performance can
only be described statistically by its PDF
◦ The use of computer simulations for assessing the performance of an
estimator is rarely conclusive
◦ Unbiased estimators tend to have symmetric PDFs, centred about
the true value of θ
◦ The minimum mean square error (MMSE) criterion is natural to search
for optimal estimators, but it most often leads to unrealisable estimators
(those that cannot be written solely as a function of data)
◦ Since MSE = Bias2 + variance, any criterion that depends on bias
should be abandoned # we need to consider alternative approaches
◦ Remedy: Constrain the bias to zero and find an estimator which
minimises the variance # the minimum variance unbiased (MVU) estim.
◦ Minimising the variance of an unbiased estimator also has the effect of
concentrating the PDF of the estimation error, θ̂ − θ, about zero #
this makes it easier to perform the analysis

c D. P. Mandic Statistical Signal Processing & Inference 53


Things to remember about CRLB
Even if the MVU estimator exists, there is no “turn of the crank”
procedure to find it.
The CRLB sets a lower bound on the variance of any unbiased estimator!

This can be extremely useful in several ways:

◦ If we find an estimator that achieves the CRLB # we known we have


found an MVU estimator
◦ The CRLB can provide a benchmark against which we can compare the
performance of any unbiased estimator
◦ The CRLB enables us to rule out infeasible estimators. It is physically
impossible to find an unbiased estimator that beats the CRLB
◦ We may require the estimator to be linear, which is not necessarily a
severe restriction, as shown in the examples on the estimation of Fourier
coefficients and a quadratic curve in noise
c D. P. Mandic Statistical Signal Processing & Inference 54
Some “rule of thumb” practical hints with CRLB
1. Start from the log–likelihood parametrised PDF function, which
depends on the unknown parameter θ, that is, ln p(x; θ)

2. Fix x and take 2nd partial derivative of the log-likelhood function, that
is, ∂ 2 ln p(x; θ)/∂θ2

3. If the result still depends on x, then fix the θ and take the expected
value with respect to x.
Otherwise, this step is not needed

4. Should the result still depend on θ, then evaluate at every specific value
of θ

5. For the CRLB, perform the reciprocal and negate

For some problems, an efficient estimator may not exist, for example
the estimation of sinusoidal phase (see your P& A sets)

c D. P. Mandic Statistical Signal Processing & Inference 55


Appendix 1: An alternative form of CRLB
(via the sensitivity of p(x; θ) to θ)
Sometimes, it is easier to find CRLB as
1 1
var(θ̂) ≥ nh o cf. the original var(θ̂) ≥ n o
∂ ln p(x;θ) 2 ∂ 2 ln p(x;θ)
E ∂θ −E ∂θ 2

Motivation: Sensitivity analysis, ease of interpretation


For an increment in θ, i.e. θ → θ + ∆θ ⇒ p(x; θ) → p(x; θ + ∆θ)
Then, the sensitivity of p(x; θ) to that change is
h i
∆p(x;θ)
p(x;θ) % change in p(x; θ) h ∆p(x; θ) ih θ i
S̃θp(x) = h i = =
∆θ % change in θ ∆θ p(x; θ)
θ

h ∂p(x; θ) ih θ i ∂ ln p(x0θ)
For ∆θ → 0 Sθp(x) = lim S̃θp(x) = =θ
∆θ→0 ∂θ p(x; θ) ∂θ

∂ ln f (x) 1 ∂f (x)
(recall the derivative rules of a log function, ∂x = f (x) ∂x )

c D. P. Mandic Statistical Signal Processing & Inference 56


Appendix 1: An alternative form of CRLB (contd.)
(via the sensitivity of p(x; θ) to θ)

Therefore (Gardner, IEEE Transactions on Information Theory, July 1979)

var(θ̂) 1 1
2
= nh
2
o = nh
2
o
θ θ2 E ∂ ln p( x ;θ) 
θ 2E p
Sθ (x)

∂θ

Interpretation: This is an inverse mean square sensitivity of p(x; θ) to θ.

◦ Modelling and estimation are obviously intertwined

◦ Unknown parameters may have a physical interpretation, such as e.g.


direction in beamforming, delay in radar, ...

◦ Otherwise, parameters may be part of an imposed model, such as e.g.


the fixed sine-cosine bases in Fourier analysis

c D. P. Mandic Statistical Signal Processing & Inference 57


Appendix 2: The validity of Gaussian assumption
(The Gaussian data assumption leads to the largest Cramer-Rao bound)

◦ When there is no information about the distribution of observations,


Gaussian assumption appears as the most conservative choice
◦ This follows from the fact that the Gaussian distribution minimises the
Fisher information (inverse of the CRLB), or in other words the Gaussian
distribution maximises the CRLB
◦ Indeed, it leads to the largest CRLB in quite a general class of data
distributions and for a significant set of parameter estimation problems
◦ Therefore, any optimisation based on the CRLB under the Gaussian
assumption is min-max optimal in the sense of minimising the largest CRLB
(they yield the best CRB-related performance in the worst case, and over a
large class of data distributions which satisfy the regularity condition)
◦ Also, the Gaussian random vector maximises a differential entropy, and
also the worst additive noise lemma
For more detail see: S. Park, E. Serpedin, and K. Qaraqe, “Gaussian assumption: The
least favourable but the most useful”, IEEE Signal Processing Magazine, May 2013, pp.
183–186 and the references therein
c D. P. Mandic Statistical Signal Processing & Inference 58
Appendix 3: Transformation of parameters
Suppose that there is a parameter θ for which we know the CRLB, denoted
by CRLBθ .
Our task is the estimate another parameter α which is a function of θ, i.e.
α = g(θ)
Then, it can be shown that (see S. Kay’s book on Statistical Signal
Processing)  2
∂g(θ)
var(α) ≥ CRLBα = CRLBθ
∂θ

R
- sensitivity of α to θ
Therefore, a large sensitivity ∂g(θ)
∂θ means that a small error in θ gives a large
error in α. This, in turn, increases the CRLB (that is, worsens accuracy).
It can be shown that if g(θ) has an affine form, that is, g(θ) = aθ + b,
then α̂ = g(θ̂) is efficient.
Otherwise, for any other form of g(θ), the result is asymptotically efficient
for N → ∞.

c D. P. Mandic Statistical Signal Processing & Inference 59


Appendix 4: Modelling vs. Estimation
◦ Oftentimes parameters we wish to estimate have some physical
significance (e.g. heart rate, or delay in the time of arrival of the
back-scattered signal in radar).
◦ It is also common that the parameters of interest arise from a
non-physical model which is imposed onto data (e.g. Fourier analysis).
◦ However, even then, the Fourier coefficients for a signal in AWGN are
the MVU estimates of the Fourier coefficients in the noise-free case!
◦ Similar reasoning applies to ARMA modelling, the coefficients may or
may not have physical meaning.
◦ Model # related to data generation (e.g. a generative model)
◦ Estimation # related to both model accuracy (bias/variance) and when

R using a model to e.g. future values of a signal (inference).


Modelling and Estimation/Inference are intertwined. It is our goal to
understand the bounds on the best achievable performance for a
certain paradigm, and use this as a domain knowledge for inference.

c D. P. Mandic Statistical Signal Processing & Inference 60


Notes:

c D. P. Mandic Statistical Signal Processing & Inference 61


Notes:

c D. P. Mandic Statistical Signal Processing & Inference 62


Notes:

c D. P. Mandic Statistical Signal Processing & Inference 63


Notes:

c D. P. Mandic Statistical Signal Processing & Inference 64

You might also like