SSPI Lecture 4 MVU Estimation 2025
SSPI Lecture 4 MVU Estimation 2025
which measures the average mean squared deviation of the estimate, θ̂,
R
so that we can use our available degrees of freedom to minimise one
performance metric (variance), instead of two (bias, variance).
We now establish a theoretical bound on the optimality of MVU estimators.
c D. P. Mandic Statistical Signal Processing & Inference 2
Motivation: Practical example from High Definition TV
There are so many ways SSP&I can help with sustainability.
1
for x[0] = 3 and i = 1, 2 with σ1 = 3 and σ2 = 1
| {z }
f undamental step, we are f ixing the data value
p (x[0]=3;A) p (x[0]=3;A)
1 2
Case I Case II
σ 1=1/3 σ 2=1
3 A 3 A
Clearly, as σ1 < σ2, the DC level A is estimated more accurately with p1(x[0]; A)
Likely candidates for values of A ∈ 3 ± 3σ ⇒ therefore [2, 4] for σ1 and [0, 6] for σ2.
c D. P. Mandic Statistical Signal Processing & Inference 7
Putting this into context # Sharpe ratio in finance
In financial modelling, the Sharpe Ratio (SR) models the risk-adjusted
returns, whereby the volatility (risk) is designated by the variance of the
distribution of returns. return = price(t)/price(t − 1)
Sharpe ratio. The blue asset (narrower pdf) is less profitable but also less
risky. To balance between the risk and profit, we can use the Sharpe ratio
√ E[r1:T ] µ
SR1:T = T or f or a single asset SR =
V ar[r1:T ] σ
Simple Returns: Equal Returns Level Simple Returns: Unequal Return Levels
0.4
(1, 1) 0.4 (1, 1)
Frequency Density
Frequency Density
0.3 (1, 9) (4, 9)
0.3
0.2 0.2
0.1 0.1
0.0 6 4 2 0 2 4 6 8 0.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5
Simple Returns, rt (%) Simple Returns, rt (%)
R
√ √
Here, SRblue = T 11
which is smaller than SRred = T 43 .
We therefore choose the red asset.
c D. P. Mandic Statistical Signal Processing & Inference 8
Likelihood function
When a PDF is viewed as a function of an unknown parameter (with the
dataset {x} = x[0], x[1], . . . fixed) it is termed the “likelihood function”.
◦ The “sharpness” of the likelihood function determines the accuracy at
which the unknown parameter may be estimated.
◦ Sharpness is measured by the “curvature” # a negative of the second
derivative of the logarithm of the likelihood function at its peak.
Example 2: Estimation based on one sample of a DC level in WGN
√ 1 2
2
ln p(x[0]; A) = − ln 2πσ − 2 x[0] − A
2σ
then ∂ ln p(x[0]; A) 1
= 2 x[0] − A
∂A σ
and the curvature ∂ 2 ln p(x[0]; A) 1
− = 2
∂A2 σ
1
var(θ̂) ≥ n o
∂ 2 ln p(x;θ)
−E ∂θ 2
↑ average curvature
where the derivative is evaluated at the true value of θ.
Obviously, in general the bound depends on the parameter θ and the data length
∂ ln p(x; θ)
Regularity condition: = I(θ) g(x) − θ
∂θ
inverse of the minimum achievable variance ↑ ↑ form of the optimum est.
Compare with what we have derived for x[0] = A + w[0] (Slide 9)
↓ g(x[0])
∂ ln p(x[0]; A) 1
= 2 x[0] − A
∂A σ
R
inverse of the Fisher information ↑ ↑ the unknown parameter
x[n] = A
|{z} + w[n] n = 0, 1, 2, . . . , N − 1
|{z}
unknown DC level noise with known pdf
N −1
Y 1 1 2
p(x; θ) = p(x; A) = √ exp − 2 x[n] − A
n=0 2πσ 2 2σ
−1
" N
#
1 1 X 2
= exp − 2 x[n] − A
R
(2πσ 2)N/2 2σ n=0
𝑨𝑨 𝝈𝝈𝟐𝟐 𝑵𝑵
^ ^
θ2 θ2
^ ^
θ3 θ3
^
θ1
^
θ1
CRLB CRLB
θ θ
R
θ̂1 is efficient and MVU, θ̂2, θ̂3 are not θ̂1 may be MVU but is not efficient
Not all estimators (phase est.) & not all MVU estimators are efficient
c D. P. Mandic Statistical Signal Processing & Inference 19
Fisher information and a general form of MVU estimator
(measures the “expected goodness” of data for making an estimate)
2
The term ∂ ln p(x; θ)
I(θ) = −E
∂θ2
in the CRLB theorem is referred to as the Fisher information.
Intuitively:
R
the more information available ; the lower the bound ; lower variance
R
Essential properties of an information measure:
R R Non–negative
Additive for independent observations
General CRLB for arbitrary signals in WGN (cf. σ 2/N , see the next slide)
σ2
var(θ̂) ≥ P 2
N −1 ∂s[n;θ]
n=0 ∂θ
↑ sensitivity of signal to parameter change
T
Formulation: Estimate a vector parameter θ = [θ1, θ2, . . . , θp]
◦ Recall that an unbiased estimator θ̂ is efficient (and therefore an MVU
estimator) when it satisfies the conditions of the CRLB
◦ It is assumed that the PDF p(x; θ) satisfies the regularity conditions
∂ ln p(x; θ)
E = 0, ∀θ
∂θ
◦ Then, the covariance matrix, Cθ̂ , of any unbiased estimator θ̂ satisfies
Cθ̂ − I −1(θ) ≥ 0 (symbol ≥ 0 means that Cθ̂ is positive semidefinite)
R
h 2 i
◦ The Fisher Information Matrix is given by [I(θ)]ij = −E ∂ ∂θ
ln p(x;θ)
i ∂θj
2
∂ ln p(x; θ)
[I(θ)]ij = −E
∂θi∂θj
◦ The CRLB theorem provides a powerful tool for finding MVU estimators
for a vector parameter.
R MVU estimators for linear models are found with the Cramer–Rao
Lower Bound (CRLB) theorem.
CRLB [dB]
220
Amplitude Frequency Phase
260
0
300
0 20 40 60 80
Number of data samples, N
CRLB [dB]
−20
CRLB of var(A) with Fixed
−40 20
40
CRLB [dB]
−60
60
R
−80 80
5 15 25 35 45 55 65 75 85 95
0 20 40 60 80
Number of data samples, N Number of data samples, N
Daily Returns of Crude Oil vs. Energy Sector Residuals of Linear Fit, Oil vs. Energy
Data from Apr. 2024 0.010 Residuals
0.010 Regression Line
Vangard Energy ETF
0.005
0.005
Residuals
0.000 0.000
0.005 0.005
0.010 0.010
0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.02 0.01 0.00 0.01
Crude Oil x
S&P 500 vs. Gold Prices in April 2024 Residuals of Both Fits, S&P500 vs. Gold
2400 75
50
2350
25
Gold Prices
2300 Residuals 0
25
2250
Data 50
2200
Linear Fit 75 Linear Fit Residuals, sum(Res2)=47375
Quadratic Fit 100 Quadratic Fit Residuals, sum(Res2)=42270
5000 5050 5100 5150 5200 5250 5000 5050 5100 5150 5200 5250
S&P 500 x
x[n]
Example 8: Linear model of a
straight line in noise noisy line
x[n] = A + Bn + w[n]
n = 0, 1, . . . , N − 1
where
A
◦ w[n] ∼ N (0, σ 2),
◦ B - slope and ideal noiseless line
◦ A - intercept. 0
n
1 (x−Hθ)T (x−Hθ)
−
p(x; A, B) = p(x; θ) = e 2σ 2 equivalent to (∗)
(2πσ 2)N/2
R
1
NB: p(x; θ) = (2πσ 2)N/2
e 2σ 2
The CRLB theorem can be used to obtain the MVU estimator for θ
The MVU estimator, θ̂ = g(x), will then satisfy
∂ ln p(x; θ)
= I(θ) g (x) − θ
∂θ
where I(θ) is the Fisher information matrix, whose elements are
h ∂ 2 ln p(x; θ) i
[I]ij = −E
∂θi∂θj
Then, for the Linear Model
∂ ln p(x; θ) ∂ N 1 T
− ln(2πσ 2) − 2 x − Hθ
= x − Hθ
∂θ ∂θ 2 2σ
1 ∂ h i
= − 2 N σ 2 ln(2πσ 2) + xT x − 2xT Hθ + θ T HT Hθ
2σ ∂θ
Note that only the quadratic term in θ involves the matrix H
∂bT θ ∂xT Hθ
= b # = (xT H)T = HT x
∂θ ∂θ
∂θ T Aθ ∂θ T HT Hθ
= 2Aθ # = 2HT Hθ
∂θ ∂θ
(which you should prove for yourself), that is, follow the rules of
vector/matrix differentiation.
Then, the form of the partial derivative from the previous slide becomes
∂ ln p x; θ 1 T T
= 2 H x − H Hθ
∂θ σ
Therefore
∂ ln p(x; θ) 1 T h −1
i
= 2 H H HT H HT x −θ
∂θ |σ {z } | {z }
I(θ) g (x )
x = Hθ + w
where
x is an N×1 “vector of observed data”
H is an N×p “observation (measurement) matrix” of rank p
θ is a p×1 unknown “parameter vector”
2
w is an N×1 additive “noise vector” ∼ N 0, σ I
2 T
−1
Cθ̂ = σ H H
End of Theorem 2
2
The data model is given by (n = 0, 1, . . . , N − 1, w[n] ∼ N 0, σ )
M M
X 2πkn X 2πkn
x[n] = ak cos + bk sin + w[n]
N N
k=1 k=1
where the Fourier coefficients, ak and bk , that is, the amplitudes of the
cosine and sine terms, are to be estimated.
1 k
◦ Frequencies are multiples of the fundamental f1 = N, that is, fk = N.
n=0
H= 1
n=N−1
n=0,1,...,N−1 samples
down each column
cos 2πkn
N sin 2πkn
N
R
N
For H not to be under-determined, it has to satisfy N > p ⇒M < 2
hTi hj = 0 f or i 6= j
In other words: (i) cos iα ⊥ sin jα, ∀i, j, (ii) cos iα ⊥ cos jα, ∀i 6= j,
(iii) sin iα ⊥ sin jα, ∀i 6= j
c D. P. Mandic Statistical Signal Processing & Inference 40
Example 9: Fourier analysis → measurement matrix
Therefore (orthogonality) N
T
2 0 ... 0
h1 N
0 ... 0 N
HT H = .. h1 | · · · | h2M =
2 = I
.. .. ... ..
T 2
h2M N
0 0 ... 2
PN −1
2 T 2 2π×1×n
hT1
2 T 2 . N h1 x N n=0 x [n] cos N
θ̂ = H x = . x = .
. =
.
.
N N
2 T N −1
hT2M 2 2π×2M ×n
R
h x
P
N 2M N n=0 x [n] sin N
From CRLB for Linear Model, the covariance matrix of this estimator is
2 T
−1 2σ 2
Cθ̂ = σ H H = I
N
- decreases with N
i) Note that, as θ̂ is a Gaussian random variable and the covariance
matrix is diagonal, the amplitude estimates are statistically independent;
ii) The orthogonality of the columns of H is fundamental in the
computation of the MVU estimator (invertible parsimonious basis);
iii) For accuracy, the measurement matrix H is desired to be a tall matrix
with orthogonal columns.
c D. P. Mandic Statistical Signal Processing & Inference 42
Example 10: System Identification (SYS ID)
Aim: To identify the model of a system (filter coefficients {h}) from
input/output data. Assume an FIR filter system model given below
Σ Σ ... Σ p
Σ h (k) u[n−k]
k=1
◦ The input u[n] “probes” the system, then the output of the FIR filter is
Pp−1
given by the convolution x[n] = k=0 h(k)x[n − k]
◦ We wish to estimate the filter coefficients [h(0), . . . , h(p − 1)]T
◦ In practice, the output is corrupted by additive WGN
BABE
T.FI?as*..B.oqaBBBBBB
Headphones x(n) y(n)
Adaptive
Filter
e(n) _
Reference
§
z%Ég•!¥÷¥¥¥③B•z@
microphone, N1 Σ
+
ANC Speech or music s(n) + No (n)
plus additive noise N 1 (n) d(n)
s+N0
go Reference input Primary input
Input: The cockpit noise, x(n) = N1(n), that is, the noise for Reference
Microphone. The only requirement is that N1 is correlated with the noise,
R
N0, which you hear through the headphones, but not with the music
signal, s(n). The filter aims to estimate N0 from N1, that is, y = N̂0.
Based on the input-output cross-correlation, the filter output can only
produce and estimate of the noise you hear, that is, y(n) = N̂0(n), as
cockpit noise, N1, is not correlated with the music, s.
Therefore we hear d(n) − y(n) = s(n) + N0(n) − N̂0(n) ≈ s(n)
c D. P. Mandic Statistical Signal Processing & Inference 47
The need for a General Linear Model (GLM)
We shall now consider a general case where:
1) the observed signal may contain a known but non-white component, s
2) the observation noise, w, may be non-white, that is, C 6= N (0, σ 2I).
Case 1) Often in practical applications (e.g. in radar), the observed signal
consists of some known signal, s, and another signal whose components are
not known, Hθ, so that the linear model of the observed signal becomes
x = Hθ + s + w (here, noise is assumed to be white)
The MVU estimator is determined immediately from x0 = x − s, so that
x0 = Hθ + w
T
−1 T
θ̂ = H H H (x − s)
2 T
−1
Cθ̂ = σ H H (covariance matrix of θ̂)
An example may be a DC level observed in random white noise, but also
with a known sinusoidal interference (e.g. from the mains).
c D. P. Mandic Statistical Signal Processing & Inference 48
Incorporating coloured (correlated) noise into GLM
Case 2) For coloured noise, w ∼ N (0, C), where C 6= σ 2I, that is, C is not a
scaled identity matrix!
To this end, we can use a whitening approach as follows:
Since C is +ve semidefinite, so too is C−1, # it can be factored as
C−1 = DT D, DN ×N is invertible
Now, D acts as a whitening transform when applied to w, since
R
T T T T −1 T −1 T
E (Dw)(Dw) = E Dww D = DCD = DD D D = I
This allows us to transform the general linear model
R from x = Hθ + w
0 T 0 −1 0 T 0
θ̂ = H H
H x = H D DH
to
T T
x0 = Dx = DHθ + Dw = H0θ + w0
The noise is now whitened, as w0 = Dw ∼ N (0, σ 2I) # use Linear Model
−1 T T
H D Dx = H C H T −1
−1 T −1
H C x
In a similar fashion, for the variance of this estimator we have
R
0 T 0 −1
T −1
−1
Cθ̂ = H H and finally Cθ̂ = H C H
For C = σ 2 I we have our previous results for standard Linear Estimator
c D. P. Mandic Statistical Signal Processing & Inference 49
Can we resort to (approximately) Gaussian distribution?
Yes, very often, if we re–cast our problem in an appropriate way (see Lecture 3)
Top panel. Share prices, pn, of Apple (AAPL), General Electric (GE) and
Boeing (BA) and their histogram (right). Bottom panel. Logarithmic
returns for these assets, ln(pn/pn−1), that is, the log of price differences at
consecutive days (left) and the histogram of log returns (right).
R
and attains the Cramer Rao Lower Bound (CRLB).
We must assume that H is full rank, otherwise for any s there exist some
θ 1 and θ 2 which both give s, that is, s = Hθ 1 = Hθ 2 (no uniqueness)
c D. P. Mandic Statistical Signal Processing & Inference 51
Example 11: The concept of “linear in the parameters”
R
models (e.g. like neural networks) (see also Lecture 8)
Recall that the notion “linear” in the term “Linear Models” does not arise
from fitting straight lines to data!
x[n] Observed data
Model is quadratic in time "n"
Model is "linear in the parameters!"
True signal of interest
(quadratic in n)
1 2 3 4 5 6 n
2. Fix x and take 2nd partial derivative of the log-likelhood function, that
is, ∂ 2 ln p(x; θ)/∂θ2
3. If the result still depends on x, then fix the θ and take the expected
value with respect to x.
Otherwise, this step is not needed
4. Should the result still depend on θ, then evaluate at every specific value
of θ
For some problems, an efficient estimator may not exist, for example
the estimation of sinusoidal phase (see your P& A sets)
h ∂p(x; θ) ih θ i ∂ ln p(x0θ)
For ∆θ → 0 Sθp(x) = lim S̃θp(x) = =θ
∆θ→0 ∂θ p(x; θ) ∂θ
∂ ln f (x) 1 ∂f (x)
(recall the derivative rules of a log function, ∂x = f (x) ∂x )
var(θ̂) 1 1
2
= nh
2
o = nh
2
o
θ θ2 E ∂ ln p( x ;θ)
θ 2E p
Sθ (x)
∂θ
R
- sensitivity of α to θ
Therefore, a large sensitivity ∂g(θ)
∂θ means that a small error in θ gives a large
error in α. This, in turn, increases the CRLB (that is, worsens accuracy).
It can be shown that if g(θ) has an affine form, that is, g(θ) = aθ + b,
then α̂ = g(θ̂) is efficient.
Otherwise, for any other form of g(θ), the result is asymptotically efficient
for N → ∞.