S M S T C Lecture Notes Lecture5
S M S T C Lecture Notes Lecture5
INVERSE PROBLEMS
Lecture 5: Maximum Likelihood methods, statistical approach
Anya Kirpichnikova, University of Stirlinga
www.smstc.ac.uk
Contents
5.1 How we could look from statistical point of view on what we have done? . . . . . . . . . . . . 5–1
5.1.1 The variance of the model parameter estimates . . . . . . . . . . . . . . . . . . . . . . . 5–1
5.2 The unit covariance matrix and the Backup-Gilbert spread function . . . . . . . . . . . . . . 5–3
5.2.1 The unit covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5.2.2 Dirichlet spread function (general case), Sylvester equation . . . . . . . . . . . . . . . . . 5–4
5.2.3 Backup-Gilbert (BG) spread function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–5
5.3 Maximum Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.3.1 Maximum likelihood principle/ function . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.3.2 Inverse problem with Gaussian-distributed data . . . . . . . . . . . . . . . . . . . . . . . 5–8
5.3.3 Underdetermined cases: Probabilistic representation of a priori information . . . . . . . . 5–8
5.3.4 Information gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–9
5.3.5 Exact Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–10
5.3.6 Fuzzy theories (just some ideas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5.3.7 Minimising the Relative entropy (information gain) . . . . . . . . . . . . . . . . . . . . . 5–12
5.1 How we could look from statistical point of view on what we have
done?
5.1.1 The variance of the model parameter estimates
As discussed before, the data contains noise that causes errors in mest . The goal is to consider how the measurement
errors map into errors in the model parameters estimates under linear transformation
mest = Md + v
for some M, v. Assume data d has some distribution, characterised by some covariance matrix [covd], then as seen
in Eq. 4.2, the estimates of the model parameters have a distribution characterised by a covariance matrix
[covm] = M[covd]M T
so error in m depends on errors in d.
How to arrive at the estimate of variance of the data σ2d ?
• Way 1: a priori variance: length is measured bya ruler with 1mm division, so σ2d ≈ 12 mm
• Way 2: based on the size distribution of prediction errors e determined by fitting model to the data, a
posteriori
N
1
σ2d ≈ ∑ e2
N − M i=1 i
N
1
(compare with the mean-squared error N ∑ e2i )
i=1
a [email protected]
5–1
SM ST C : INVERSE PROBLEMS 5–2
Overdetermined case: LS
i.e. m can be correlated and can be of unequal variance when the data are uncorrelated and are of equal variance.
• if E(m) = eT e has a sharp minimum near mest the the LS solution is well-defines in a sense of small variance.
The latter means that small errors in determining the shape E(m) due to random fluctuations in the data lead
to small errors in mest
Mathematically, the curvature of a function is responsible for the sharpness at a minimum: variance should be
related to the curvature of E(m) at its minimum, the latter depends on the structure of K. Curvature is proportional
to the second derivative, i.e.
2
1∂ E
∆E = E(m) − E(mest ) = [m − mest ]T | est [m − m
est
]
2 ∂m2 m=m
h 2 i
∂2 E
where the matrix 21 ∂m∂ E
2 has elements ∂mi ∂m j , notice the we are expanding at a minimum m = m
est , so first order
The covariance of the LS solution (for uncorrelated data all with variance σ2d ) is
−1
1 ∂2 E
[covm] = σ2d [KT K]−1 = σ2d |m=mest
2 ∂m2
which means that the covariance of model parameters [covm] is controlled by a) the variance of the data σ2d , and
h 2 i−1
b) the measure of the curvature of the predicted error 12 ∂m
∂ E
2 |m=mest .
See Fig: 5.1; variance is related to the size of the ellipse, m1 is an intercept (downwards axis), m2 is the slope
(horizontal axis). Slope is determined more accurately than the intercept; discussion in lectures
Underdetermined case: ML
(a) case 1: data (b) case 1: variance (c) case 2: data (d) case 2: variance
Figure 5.1: Parts (a) and (c) contain data and LS solution; parts (b) and (d) contain Two examples of the data: in
the case 2 the points are concentrated so give much more options for the gradient
5.2 The unit covariance matrix and the Backup-Gilbert spread function
5.2.1 The unit covariance matrix
Covariance of the model parameters [covm] depends on covd and the way data d is mapped into m, i.e. on K. A
unit covariance matrix characterises a degree of error amplification that occurs in the mapping.
• d is correlated
[covu m] = K−g [covu d]K−gT
where [covu d] is some normalisation of [covd].
A unit covariance matrix [covu m] is a useful tool in experimental design, it is independent of the actual values
and variances of data.
∑ z2i
1 − ∑ zi
[covu m] =
N ∑ z2i − (∑ zi )2 − ∑ zi N
The estimates of m1 and m2 are uncorrelated only wen the data is centered about z = 0.. The overall zise of the
variance is controlled by the denominator is the fraction. If all the values of z are nearly equal (see Fig. 5.2, case
1), the denominator of the fraction is small and the variance is large. If the values of z have large spread, the
denominator is large, and the variance is small.
The unit covariance matrix [covu m] is the measure of the amount of error amplification mapped from d to m,
then the size of it is
1
M
size ([covu m]) = k[varu m] 2 k22 = ∑ [covu m]ii
i=1
where the square roots are taken element-wise, and only diagonal elements are taken into account.
SM ST C : INVERSE PROBLEMS 5–4
Figure 5.2: case 1: data points are not spread;case 2: data points are better spread
Cumbersome (but similar to the above) calculus skills imply Sylvester equation
• LS solution: α1 = 1, α2 = α3 = 0
• ML solution: α1 = 0, α2 = 1, α3 = 0
• Damped LS solution: α1 = 1, α2 = 0, α3 = ε2 and [covu d] = I then
−1 T
K−g = Kt K + ε2 I
K
If R contains negative off-diagonal elements, then interpreting it as averaging makes less sense: non-negativity can
be included as a new constraint, but this solution make problem even harder. We could prefer any large elements
to be close to the main diagonal when there exists natural ordering as then we have localised averaging functions.
If we continue using the Dirichlet spread function to compute K−g , we would often get sidelobes (large amplitude
regions in the resolution matrix far from the main diagonal.)
Two matrices R1 and R2 can have the same Dirichlet spread, see Fig. 5.3 for, say, ithe row of matrix R1 being:
0 0 . . . 0.2 0.6 0.2 . . . 0
SM ST C : INVERSE PROBLEMS 5–5
Figure 5.3: Sidelobes of the matrices: the same Dirichlet spread, but R1 (A) is better resolved.
where in both cases 0.6 is the diagonal element. The contribution of the ith row to the Dirichlet spread will be
exactly the same: (0.2 − 0)2 + (0.6 − 1)2 + (0.2 − 0)2 , however the distribution of weights is different.
where w(i, j) is a weighting factor which is based on the distance from the main diagonal (it should grow when
the element gets further away from the diagonal). Usually, w(i, j) = (i − j)2 is taken for the natural ordering case.
Similar Backup-Gilbert spread functions can be introduced for the data resolution matrix N.
Since we have introduced new measure, we can find new generalised inverses!
Underdetermined case
It is easy to have the spread spread(N) small, so we take BG spread spread(R) → min . The diagonal elements of
R are given no weights now, we add the following condition
M
∑ Ri j = [1]i
j=1
the latter allows to keep the diagonal elements finite and the rows are unit averaging of mtrue . We will use calculus
again differentiate the spread row-wise. Assume the spread of the kth row of R is Jk :
" #" #
M M N N
Jk = ∑ w(l, k)Rkl Rkl = ∑ w(l, k) ∑ K−g
ki Kil ∑ K−g
k j K jl
l=1 l=1 i=1 j=1
N N M N N
−g −g
= ∑ ∑ Kki Kk j ∑ w(l, k)Kil K jl = ∑ ∑ K−g −g (k)
ki Kk j S ,
i=1 j=1 l=1 i=1 j=1
where the matrix is square (N + 1) × (N + 1) and can be solved by bordering method by partition into submatrices
with simple properties.
(k) −1
S u
Suppose that exists and that we partition it into submatrices: N × N symmetric matrix A, vector
uT 0
b and scalar c : (k)
AS + buT Au
A b S(k) u I O
= =
bT c uT 0 0 1 bT S(k) + cuT bT u
(k)
A b S u
Since T is the inverse of , their product should be equal to identity.
b c uT 0
[S(k) ]−1 u
Au = O ⇒ [S(k) ]−1 u = buT S(k) u ⇒ b =
uT [S(k) ]−1 u
−1
bT S(k) + cu = 0 ⇒ c =
uT [S(k) ]−1 u
(k)
S u
We have found the inverse for , and hence
uT 0
(k)
A b S(k) u k(k) k A b 0
= = T ⇒ k(k) = b, λ = c
bT c uT 0 λ λ b c 1
Then the required BG generalised inverse is
N h i
∑ ui (S(k) )−1 M
−g i=1 il
Kkl = N M
, ui = ∑ Kik (5.2)
∑ ∑ ui (S(k) )−1 i j u j k=1
i=1 j=1
the data is weighted average of model parameters m j , weights decline exponentially with depth z. The decay is
controlled by constant ci , the smaller i is, the wider is the range of averages it will go over. Kernel Ki j = exp (−ci z j )
works as smoothing. Shallow parameters are better resolved. BG
SM ST C : INVERSE PROBLEMS 5–7
Figure 5.4: The true solution mtrue is shown in red in (b), (d). BG solution is much smoother (b), then Dirichlet
solution (d); the BG resolution matrix (c) has much shallower sidelobes than Dirichlet resolution matrix (a)
This solution was introduced in the 1970 Backup and Gilbert paper [10]
M M M
αspread(R) + (1 − α)size([covu m]) = α ∑ ∑ w(i, j)R2i j + (1 − α) ∑ [covu m]ii , 0 ≤ α ≤ 1
i=1 j=1 i=1
is the weighting factor, signifying the relative contribution of R and [covu m]. If we minimise the above linear
combination as before (calculus), we get the following Backus-Gilbert generalised inverse
N 0
∑ ui [(S(k) )−1 ] M
−g i=1 0
Kkl = N N
, (S(k) )i j = αS(k) + (1 − α)[covu d]i j , ui = ∑ Kik
0
∑ ∑ ui [(S(k) )−1 ]i j u j k=1
i=1 j=1
(see Fig 5.5, part (b)) to find the cloud centered on d1 = d2 = d3 with radius proportional to σ. We will look at
p(dobs ) as the probability that the observed data was in fact observed, so we imagine sliding the cloud of probability
in Fig 5.5, part (b), up along the line and adjusting its diameter until its probability is maximized. This procedure
defines a method of estimating the unknown parameters in the distribution, the method of maximum likelihood.
The maximum is localised when the derivatives of p(dibs ) are zeros:
∂p ∂p
= 0, =0
∂m1 ∂σ
Since log(p) is a monotonic function of p, we could maximise it instead of p. Let us introduce the likelihood
N
function (ignoring the normalisation constant (2π)− 2 )
N
1
L = log(p(dobs )) = −N log(σ) − σ−2 ∑ (diobs − m1 )2
2 i=1
SM ST C : INVERSE PROBLEMS 5–8
Figure 5.5: Understanding the nature of data and maximum of likelihood function principle
we see that the mest is just the usual formula for the sample mean, however the σest is almost the usual sample
1
standard deviation (should have had N−1 instead of N1 ) (see Fig 5.5, part (c)). Those formulae are only valid for
the Gaussian p.d.f.
The p(d) is maximal, when the argument of the exp is maximal p(d ∝ exp − 21 E), i.e.
E = (d − Km)T [covu d]−1 (d − Km) → min
is a weighted measure of prediction error (see before). Conclusion: maximum likelihood in this case (Gaussian
distributed data) solves Km = d with weighted LS with the weighting We = [covu d]−1 .
Special case 1: if the data are uncorrelated, all with equal variance, then [covu d] = σ@ d I, and we have simple LS
solution.
Special case 2: If the data are uncorrelated but their variances are all different, say σ2di
N
pre
E = ∑ σ−2 2
di ei → min, ei = diobs − di
i=1
A priori distributions
(a) pdf 1 (b) pdf (c) pdf 3 (d) pdf 4 (e) pdf 5
• if we expect that model parameters are close to µ(m), we might use Gaussian distribution with mean µ(m)
and variance that reflects certainty of our knowledge, Fig 5.6 (c) we are more certain than in case (d)
• Fig 5.6 (b) the values are correlated. This distribution is no-Gaussian, but can be approximated by a Gaussian
distribution with non-zero covariance if the expected range of model parameters is small.
• if the a priori value of one model parameter were more certain than another, we have different variances, see
Fig 5.6 (e)
When the range is unbounded, the uniform distribution does not exist, so probably some very wide Gaussian can
be used instead.
The information gain is always a non-negative number and is only zero when pA = pN , S has the following prop-
erties
SM ST C : INVERSE PROBLEMS 5–10
• the more sharply peaked the p.d.f. becomes, the more its information gain increases
• S is invariant under reparametrization
Since the a priori model is independent of the data, we can form an a priori p.d.f.
Since we have an exact theory we can avoid Lagrange multipliers, all we need to do is substitute d = k(m) into
p(m, d) and minimise. The log of the Gaussian p.d.f. for the a priori model parameters is the function L(m) and
the log of the Gaussian p.d.f. for the observations is the function E(m) (both up written up to some additive terms
which don’t involve m so can be ignored).
this is again a weighted damped LS (WDLS) problem (see Section 2), so its solution is the WDLS
−1 T
mW DLS = KT We K + ε2Wm K We dobs + ε2Wm µ(m) =
−1 T
= KT [covd]−1 K + [covm]−1
A K [covd]−1 dobs + [covm]−1
A µ(m) =
i covd]− 12 K −1 1
" #! " #!
h
1 − 1
h
1 − 21
i covd]− 2 dobs
[covd]− 2 K [covm]A 2 I −1 [covd]− 2 K [covm]A I −1
[covm]A 2 I [covm]A 2 µ(m)
= [FT F]−1 [FT f]
can be considered as a simple LS solution of the following system:
Fm = f, FT Fmest = FT f (5.5)
Assume the data were uncorrelated and model parameters have a uniform variance,, we can simplify the equation
such that only one parameter is left:
σ2
obs
2 2 K d
[cov d] = σd I, [cov m] = σm I ⇒ F = , and f = , with ε2 = 2d
εI εµ(m) σm
σ2d
another mysterious question ”what should be ε in the damped LS?” has the answer ε2 = σ2m
(not by trial and error
this time!), just the ratio of variances of the data and the a priori model parameters.
p (d|m) p (m)
p (m|d) =
p (d)
p (d) doesn’t depend on model parameters so acts for normalisation. The following properties will have been
satisfies by a combination of two p.d.f. [8]: the order in which they are combined doesn’t matter; combining
with null–distribution should leave the original distribution unchanged, combination should be invariant under
the change of variables (under reparametrisations), the combination should be zero only if both combined p.d.f.
are zero everywhere. The required combination is a product (T stands for total), provided the null-distribution is
constant, i.e.
pT (m, d) = pA (m, d)pk (m, d), pN ∝ const
both d pre and mest are obtained simultaneously. This method doesn’t necessary bring the same mest as the maxi-
mization of the L function above. To find the likelihood point pT (m, d) with respect to model parameters alone,
we sum all the probabilities (integrate) along the lines of equal model parameter (project onto d = 0 plane), only
then we look for a maximum: Z
p(m) = pT (m, d)dd1 . . . ddN
Consider the (WDLS) solution of the linear Km = d assuming the a priori p.d.f. is
1 T −1 1 obs T −1 obs
pA (m, d) ∝ exp − (m − µ(m)) [cov m]A (m − µ(m)) − (d − d ) [cov d] (d − d )
2 2
if there are no errors in the theory, then the p.d.f. is narrow, say, Dirac delta function
pA (m, d) = δ(Km − d)
SM ST C : INVERSE PROBLEMS 5–12
these are just ML (when underdetermined) and LS (when overdetermined) solutions. If a priori model parameters
are not zero,
mest = K−g dobs + (I − R)µ(m) = KT [KKT ]−1 dobs + I − KT [KKT ]−1 K µ(m) = [KT K]−1 KT dobs
When σ2d → ∞ and/or/both σ2k → ∞, the solution is mest = µ(m) (no information from neither data nor theory).
mest = KT [KKT ]−1 dobs + [I − KT [KKT ]−1 K]µ(m) = [KT K]−1 KT dobs
weak a priori information and finite-error data and theory produce the same solution as finite-error a priori infor-
mation and error-free data [1].
• find solution p.d.f. pT (m) that has the largest S as compared to pA (m)
• find solution p.d.f. pT (m) that has a smallest possible new information as compared to apriori p.d.f. pA (m)
where
1
A(m) = − (m − µ(m))T [cov m]−1 T
A (m − µ(m)) − (1 + λ0 ) − λ (d − Km)
2
the best estimate of mest is the mean of this distribution, which is also a max likelihood point. Solving for a
minimum of A :
mest − µ(m) = [cov m]A KT [K[cov m]A KT ]−1 [d − Kµ(m)]
so the new principle (MRE) applied to the underdetermined problem, gave us the WML solution with Wm−1 =
[cov m]A .
SM ST C : INVERSE PROBLEMS 5–14
Exercises
5–1. Consider the Laplace transform Z ∞
d(c) = e−cz m(z)dz
0
and its discrete version
M
di = ∑ exp (−ci z j )m j , zi ∈ [0, 10]
j=1
the data is weighted average of model parameters m j , weights decline exponentially with depth z. The decay
is controlled by constant ci , the smaller i is, the wider is the range of averages it will go over. You need to
simulate images similar to Fig. 5.4 following the steps below
(a) Generate N = 100 model weights mtrue such that mi , i ∈ [1, 100] such that they are all zero, except for
a few, say let m5 = m10 = m20 = m50 = m90 = 1.
(b) Generate the observed data N = 80, such that c j ∈ [0, 0.1] and the kernel is Ki j = exp(−ci z j ), the noise
is normally distributed with mean zero, standard deviation is zero, dobs = dtrue + noise.
−g −g
(c) Construct the minimum length solution KML = KT K + ε2 I KT with ε = 10−12 , find mest = KML dobs ,
−g
and the resolution matrix R = KML K. Draw the heatmap of the model resolution matrix R, true model
parameters and estimated model parameters on the same graph as on Fig. 5.4
(d) Using the same setting of model parameters as above, and the same observed data as above, Construct
−g
the Backus-Gilbert (BG) solution row-wise and find mest = KBG dobs by formulae (5.2), find the corre-
sponding estimates of model parameters, the corresponding BG model resolution matrix and complete
the set of pictures on Fig. 5.4.