Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views14 pages

S M S T C Lecture Notes Lecture5

Lecture 5 discusses Maximum Likelihood methods and statistical approaches to inverse problems, focusing on the variance of model parameter estimates and the impact of measurement errors. It covers the unit covariance matrix, Backup-Gilbert spread function, and the mathematical principles behind maximum likelihood estimation, including overdetermined and underdetermined cases. The lecture emphasizes the importance of understanding error propagation and the characteristics of covariance in model fitting.

Uploaded by

miru park
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

S M S T C Lecture Notes Lecture5

Lecture 5 discusses Maximum Likelihood methods and statistical approaches to inverse problems, focusing on the variance of model parameter estimates and the impact of measurement errors. It covers the unit covariance matrix, Backup-Gilbert spread function, and the mathematical principles behind maximum likelihood estimation, including overdetermined and underdetermined cases. The lecture emphasizes the importance of understanding error propagation and the characteristics of covariance in model fitting.

Uploaded by

miru park
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SMSTC (2020/21)

INVERSE PROBLEMS
Lecture 5: Maximum Likelihood methods, statistical approach
Anya Kirpichnikova, University of Stirlinga

www.smstc.ac.uk

Contents
5.1 How we could look from statistical point of view on what we have done? . . . . . . . . . . . . 5–1
5.1.1 The variance of the model parameter estimates . . . . . . . . . . . . . . . . . . . . . . . 5–1
5.2 The unit covariance matrix and the Backup-Gilbert spread function . . . . . . . . . . . . . . 5–3
5.2.1 The unit covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5.2.2 Dirichlet spread function (general case), Sylvester equation . . . . . . . . . . . . . . . . . 5–4
5.2.3 Backup-Gilbert (BG) spread function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–5
5.3 Maximum Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.3.1 Maximum likelihood principle/ function . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.3.2 Inverse problem with Gaussian-distributed data . . . . . . . . . . . . . . . . . . . . . . . 5–8
5.3.3 Underdetermined cases: Probabilistic representation of a priori information . . . . . . . . 5–8
5.3.4 Information gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–9
5.3.5 Exact Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–10
5.3.6 Fuzzy theories (just some ideas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5.3.7 Minimising the Relative entropy (information gain) . . . . . . . . . . . . . . . . . . . . . 5–12

5.1 How we could look from statistical point of view on what we have
done?
5.1.1 The variance of the model parameter estimates
As discussed before, the data contains noise that causes errors in mest . The goal is to consider how the measurement
errors map into errors in the model parameters estimates under linear transformation
mest = Md + v
for some M, v. Assume data d has some distribution, characterised by some covariance matrix [covd], then as seen
in Eq. 4.2, the estimates of the model parameters have a distribution characterised by a covariance matrix
[covm] = M[covd]M T
so error in m depends on errors in d.
How to arrive at the estimate of variance of the data σ2d ?

• Way 1: a priori variance: length is measured bya ruler with 1mm division, so σ2d ≈ 12 mm
• Way 2: based on the size distribution of prediction errors e determined by fitting model to the data, a
posteriori
N
1
σ2d ≈ ∑ e2
N − M i=1 i
N
1
(compare with the mean-squared error N ∑ e2i )
i=1
a [email protected]

5–1
SM ST C : INVERSE PROBLEMS 5–2

Overdetermined case: LS

If d is uncorrelated and of equal variance σ2d , then

mest,LS = [KT K]−1 KT d


T
[covm] = [KT K]−1 KT [covd] [KT K]−1 KT = σ2d [KT K]−1
  

i.e. m can be correlated and can be of unequal variance when the data are uncorrelated and are of equal variance.

• if E(m) = eT e has a sharp minimum near mest the the LS solution is well-defines in a sense of small variance.
The latter means that small errors in determining the shape E(m) due to random fluctuations in the data lead
to small errors in mest

• if E(m) has a broad minimum, we expect mest to have large variance

Mathematically, the curvature of a function is responsible for the sharpness at a minimum: variance should be
related to the curvature of E(m) at its minimum, the latter depends on the structure of K. Curvature is proportional
to the second derivative, i.e.
 2 
1∂ E
∆E = E(m) − E(mest ) = [m − mest ]T | est [m − m
est
]
2 ∂m2 m=m
h 2 i
∂2 E
where the matrix 21 ∂m∂ E
2 has elements ∂mi ∂m j , notice the we are expanding at a minimum m = m
est , so first order

term should be zero, so


N N M N M M
E(m) = kd − Kmk22 = ∑ di2 − 2 ∑ di ∑ Ki j m j + ∑ ∑ Ki j m j ∑ Kik mk
i=1 i=1 j=1 i=1 j=1 k=1

and hence with the same effort,


1 ∂2 E
 
= KT K.
2 ∂m2

The covariance of the LS solution (for uncorrelated data all with variance σ2d ) is
−1
1 ∂2 E

[covm] = σ2d [KT K]−1 = σ2d |m=mest
2 ∂m2

which means that the covariance of model parameters [covm] is controlled by a) the variance of the data σ2d , and
h 2 i−1
b) the measure of the curvature of the predicted error 12 ∂m
∂ E
2 |m=mest .

Example of straight line fitting

See Fig: 5.1; variance is related to the size of the ellipse, m1 is an intercept (downwards axis), m2 is the slope
(horizontal axis). Slope is determined more accurately than the intercept; discussion in lectures

Underdetermined case: ML

If d is uncorrelated and of equal variance σ2d , then

mest,ML = KT [KKT ]−1 d


T
[covm] = KT [KKT ]−1 [covd] KT [KKT ]−1 = σ2d KT [KKT ]−2 K
  
SM ST C : INVERSE PROBLEMS 5–3

(a) case 1: data (b) case 1: variance (c) case 2: data (d) case 2: variance

Figure 5.1: Parts (a) and (c) contain data and LS solution; parts (b) and (d) contain Two examples of the data: in
the case 2 the points are concentrated so give much more options for the gradient

5.2 The unit covariance matrix and the Backup-Gilbert spread function
5.2.1 The unit covariance matrix
Covariance of the model parameters [covm] depends on covd and the way data d is mapped into m, i.e. on K. A
unit covariance matrix characterises a degree of error amplification that occurs in the mapping.

• d is uncorrelated, we have a uniform variance σ2 , then

[covu m] = σ−2 K−g [covd]K−gT = σ−2 K−g σ2 K−gT = K−g K−gT

• d is correlated
[covu m] = K−g [covu d]K−gT
where [covu d] is some normalisation of [covd].

A unit covariance matrix [covu m] is a useful tool in experimental design, it is independent of the actual values
and variances of data.

Example of straight line fitting

Fit a straight line into (z, d) data:


di = m1 + m2 zi ,
the unit covariance matrix for the intercept m1 and slope m2 is

∑ z2i
 
1 − ∑ zi
[covu m] =
N ∑ z2i − (∑ zi )2 − ∑ zi N

The estimates of m1 and m2 are uncorrelated only wen the data is centered about z = 0.. The overall zise of the
variance is controlled by the denominator is the fraction. If all the values of z are nearly equal (see Fig. 5.2, case
1), the denominator of the fraction is small and the variance is large. If the values of z have large spread, the
denominator is large, and the variance is small.
The unit covariance matrix [covu m] is the measure of the amount of error amplification mapped from d to m,
then the size of it is
1
M
size ([covu m]) = k[varu m] 2 k22 = ∑ [covu m]ii
i=1
where the square roots are taken element-wise, and only diagonal elements are taken into account.
SM ST C : INVERSE PROBLEMS 5–4

(a) case 1 (b) case 2

Figure 5.2: case 1: data points are not spread;case 2: data points are better spread

Overdetermined case: LS solution

[covu m] = K−g K−gT = [KT K]−1 KT K[KT K]−1 = [KT K]−1

Underdetermined case: ML solution

[covu m] = K−g K−gT = KT [KT K]−1 [KT K]−1 K = KT [KT K]−2 K

5.2.2 Dirichlet spread function (general case), Sylvester equation


The idea is to minimise the following combination

α1 spread(N) + α2 spread(R) + α3 size([covu m]) → min

Cumbersome (but similar to the above) calculus skills imply Sylvester equation

α1 [KT K]K−g + K−g α1 [KKT ] + α3 [covu d] = [α1 + α2 ]KT


 
(5.1)

Special cases of Sylvester equation

• LS solution: α1 = 1, α2 = α3 = 0
• ML solution: α1 = 0, α2 = 1, α3 = 0
• Damped LS solution: α1 = 1, α2 = 0, α3 = ε2 and [covu d] = I then
−1 T
K−g = Kt K + ε2 I

K

• Damped ML solution: α1 = 0, α2 = 1, α3 = ε2 and [covu d] = I then

K−g = KT [KKT + ε2 I]−1

Model resolution matrix R: what’s wrong with the Dirichlet spread?

If R contains negative off-diagonal elements, then interpreting it as averaging makes less sense: non-negativity can
be included as a new constraint, but this solution make problem even harder. We could prefer any large elements
to be close to the main diagonal when there exists natural ordering as then we have localised averaging functions.
If we continue using the Dirichlet spread function to compute K−g , we would often get sidelobes (large amplitude
regions in the resolution matrix far from the main diagonal.)
Two matrices R1 and R2 can have the same Dirichlet spread, see Fig. 5.3 for, say, ithe row of matrix R1 being:
 
0 0 . . . 0.2 0.6 0.2 . . . 0
SM ST C : INVERSE PROBLEMS 5–5

Figure 5.3: Sidelobes of the matrices: the same Dirichlet spread, but R1 (A) is better resolved.

and the ith row of matrix R2 :


 
0 0 . . . 0.2 0 0.6 0 0.2 ... 0

where in both cases 0.6 is the diagonal element. The contribution of the ith row to the Dirichlet spread will be
exactly the same: (0.2 − 0)2 + (0.6 − 1)2 + (0.2 − 0)2 , however the distribution of weights is different.

5.2.3 Backup-Gilbert (BG) spread function


We want to accommodate physical distance from the main diagonal into the spread function [2], [1]
M M M M
spread(R) = ∑ ∑ w(i, j) [Ri j − δi j ]2 = ∑ ∑ w(i, j)R2i j
i=1 j=1 i=1 j=1

where w(i, j) is a weighting factor which is based on the distance from the main diagonal (it should grow when
the element gets further away from the diagonal). Usually, w(i, j) = (i − j)2 is taken for the natural ordering case.
Similar Backup-Gilbert spread functions can be introduced for the data resolution matrix N.
Since we have introduced new measure, we can find new generalised inverses!

Underdetermined case

It is easy to have the spread spread(N) small, so we take BG spread spread(R) → min . The diagonal elements of
R are given no weights now, we add the following condition
M
∑ Ri j = [1]i
j=1

the latter allows to keep the diagonal elements finite and the rows are unit averaging of mtrue . We will use calculus
again differentiate the spread row-wise. Assume the spread of the kth row of R is Jk :
" #" #
M M N N
Jk = ∑ w(l, k)Rkl Rkl = ∑ w(l, k) ∑ K−g
ki Kil ∑ K−g
k j K jl
l=1 l=1 i=1 j=1

N N M N N
−g −g
= ∑ ∑ Kki Kk j ∑ w(l, k)Kil K jl = ∑ ∑ K−g −g (k)
ki Kk j S ,
i=1 j=1 l=1 i=1 j=1

where S(k) are symmetric in i, j and


M
S(k) = ∑ w(l, k)Kil K jl
l=1

Now the constraint:


" #
M M N N M N M
−g
∑ Rik = ∑ ∑ Ki j K jk = ∑ K−g −g
i j ∑ K jk = ∑ Ki j u j , uj = ∑ K jk
k=1 k=1 j=1 j=1 k=1 j=1 k=1
SM ST C : INVERSE PROBLEMS 5–6

We use Langange multipliers to combine the two functions to be minimised:


N M N
−g −g −g
Φ = ∑ ∑ Kki Kk j S(k) i j + 2λ ∑ Kk j u j
i=1 j=1 j=1

thus we solve (can be solved row-wise)



N
−g



∂Φ
−g = 2 ∑ S(k) pi Kki + 2λu p = 0
∂Kkp i=1
M N
−g
 ∑ Ri j = [1]i = ∑ Kik uk


j=1 k=1

the latter can be written as a matrix equation


 (k)    
S u k(k) 0
=
uT 0 λ 1

where the matrix is square (N + 1) × (N + 1) and can be solved by bordering method by partition into submatrices
with simple properties.
 (k) −1
S u
Suppose that exists and that we partition it into submatrices: N × N symmetric matrix A, vector
uT 0
b and scalar c :   (k)
AS + buT Au
    
A b S(k) u I O
= =
bT c uT 0 0 1 bT S(k) + cuT bT u
   (k) 
A b S u
Since T is the inverse of , their product should be equal to identity.
b c uT 0

AS(k) + buT = I ⇒ A = [S(k) ]−1 [I − buT ]

[S(k) ]−1 u
Au = O ⇒ [S(k) ]−1 u = buT S(k) u ⇒ b =
uT [S(k) ]−1 u
−1
bT S(k) + cu = 0 ⇒ c =
uT [S(k) ]−1 u
 (k) 
S u
We have found the inverse for , and hence
uT 0
     (k)    
A b S(k) u k(k) k A b 0
= = T ⇒ k(k) = b, λ = c
bT c uT 0 λ λ b c 1
Then the required BG generalised inverse is
N h i
∑ ui (S(k) )−1 M
−g i=1 il
Kkl = N M
, ui = ∑ Kik (5.2)
 
∑ ∑ ui (S(k) )−1 i j u j k=1
i=1 j=1

is the equivalent of the ML solution.

Example of the discrete version of Laplace transform

Consider the Laplace transform Z ∞


d(c) = e−cz m(z)dz
0
and its discrete version
M
di = ∑ exp (−ci z j )m j , zi ∈ [0, 10]
j=1

the data is weighted average of model parameters m j , weights decline exponentially with depth z. The decay is
controlled by constant ci , the smaller i is, the wider is the range of averages it will go over. Kernel Ki j = exp (−ci z j )
works as smoothing. Shallow parameters are better resolved. BG
SM ST C : INVERSE PROBLEMS 5–7

(a) BG R matrix (b) BG solution (c) Dirichlet R (d) Dirichlet solution

Figure 5.4: The true solution mtrue is shown in red in (b), (d). BG solution is much smoother (b), then Dirichlet
solution (d); the BG resolution matrix (c) has much shallower sidelobes than Dirichlet resolution matrix (a)

BG generalise inverse accounting for covariance: damped ML

This solution was introduced in the 1970 Backup and Gilbert paper [10]
M M M
αspread(R) + (1 − α)size([covu m]) = α ∑ ∑ w(i, j)R2i j + (1 − α) ∑ [covu m]ii , 0 ≤ α ≤ 1
i=1 j=1 i=1

is the weighting factor, signifying the relative contribution of R and [covu m]. If we minimise the above linear
combination as before (calculus), we get the following Backus-Gilbert generalised inverse
N 0
∑ ui [(S(k) )−1 ] M
−g i=1 0
Kkl = N N
, (S(k) )i j = αS(k) + (1 − α)[covu d]i j , ui = ∑ Kik
0
∑ ∑ ui [(S(k) )−1 ]i j u j k=1
i=1 j=1

which has a Dirichlet analogue, damped ML solution.


This is a special case of
α1 spread(N) + α2 spread(R) + α3 size([covu m]) → min
when α1 = 0, α2 = 1, α3 = ε2 , see Sylvester equation and the fourth solution.

5.3 Maximum Likelihood Methods


5.3.1 Maximum likelihood principle/ function
The observed data is one point in the space of all possible observations or dobs ∈ S(d) (see Fig 5.5, part (a)), see
that the exact realisation might not belong to the d1 = d2 = d3 line. Assume now the data are independent; each
datum is drawn from a Gaussian distribution with the same mean m1 and variance σ2 (both unknown):
" #
N
1
p(d) = σ−N (2π)−N/2 exp − σ−2 ∑ [di − m1 ]2
2 i=1

(see Fig 5.5, part (b)) to find the cloud centered on d1 = d2 = d3 with radius proportional to σ. We will look at
p(dobs ) as the probability that the observed data was in fact observed, so we imagine sliding the cloud of probability
in Fig 5.5, part (b), up along the line and adjusting its diameter until its probability is maximized. This procedure
defines a method of estimating the unknown parameters in the distribution, the method of maximum likelihood.
The maximum is localised when the derivatives of p(dibs ) are zeros:
∂p ∂p
= 0, =0
∂m1 ∂σ
Since log(p) is a monotonic function of p, we could maximise it instead of p. Let us introduce the likelihood
N
function (ignoring the normalisation constant (2π)− 2 )
N
1
L = log(p(dobs )) = −N log(σ) − σ−2 ∑ (diobs − m1 )2
2 i=1
SM ST C : INVERSE PROBLEMS 5–8

(a) realisation of data (b) p.d.f. of data (c) L function

Figure 5.5: Understanding the nature of data and maximum of likelihood function principle

thus the derivatives



N
 N
mest 1
 ∂L = − 1 σ−2 2m1 ∑ (d obs − m1 ) = 0

 ∂m 2 i

 1 = N ∑ diobs
1 i=1
i=1 ⇒  12
N 
N
∂L
= − Nσ + σ−3 ∑ (diobs − m1 )2 = 0 σest = 1 obs est 2
 
N ∑ (di − m1 )

 ∂σ 
i=1 i=1

we see that the mest is just the usual formula for the sample mean, however the σest is almost the usual sample
1
standard deviation (should have had N−1 instead of N1 ) (see Fig 5.5, part (c)). Those formulae are only valid for
the Gaussian p.d.f.

5.3.2 Inverse problem with Gaussian-distributed data


 
1
Km = d, p(d) ∝ exp − (d − Km)T [covu d]−1 (d − Km) → max
2

Gaussian data with known covariance [covu d] (m are unknown)

The p(d) is maximal, when the argument of the exp is maximal p(d ∝ exp − 21 E), i.e.
E = (d − Km)T [covu d]−1 (d − Km) → min
is a weighted measure of prediction error (see before). Conclusion: maximum likelihood in this case (Gaussian
distributed data) solves Km = d with weighted LS with the weighting We = [covu d]−1 .
Special case 1: if the data are uncorrelated, all with equal variance, then [covu d] = σ@ d I, and we have simple LS
solution.
Special case 2: If the data are uncorrelated but their variances are all different, say σ2di
N
pre
E = ∑ σ−2 2
di ei → min, ei = diobs − di
i=1

This E collects errors weighted by their certainty.


Conclusion: This was a justification of the L2 norm through probability theory: if data are uncorrelated, have
equal variance, and obey Gaussian statistics, the L2 norm is appropriate.

5.3.3 Underdetermined cases: Probabilistic representation of a priori information


The LS solution doesn’t exists when the problem is underdetermined: in term of probability it means that p(dobs )
has no well-defined maximum w.r.t. variations of model parameters (see Fig 5.6 (b)). Assume the probability that
the model parameters are near µ(m) is given by a p.d.f. pA (m) (A stands for a-priori), centered at a-priori value
µ(m), variance reflects uncertainty in a a priori informnation.
SM ST C : INVERSE PROBLEMS 5–9

A priori distributions

We need to develop a way to include a priori information in the solution.

(a) pdf 1 (b) pdf (c) pdf 3 (d) pdf 4 (e) pdf 5

Figure 5.6: Various cases of data distributions

• if we expect that model parameters are close to µ(m), we might use Gaussian distribution with mean µ(m)
and variance that reflects certainty of our knowledge, Fig 5.6 (c) we are more certain than in case (d)
• Fig 5.6 (b) the values are correlated. This distribution is no-Gaussian, but can be approximated by a Gaussian
distribution with non-zero covariance if the expected range of model parameters is small.
• if the a priori value of one model parameter were more certain than another, we have different variances, see
Fig 5.6 (e)

The general Gaussian case with covariance [covu m]A


 
1 T −1
pA (m) ∝ exp − (m − µ(m)) [covu m]A (m − µ(m))
2

5.3.4 Information gain


One can summarise the observation to form a distribution with mean dobs and a priori covariance [covu d],
 
1
pA (d) ∝ exp − (d − dobs )T [covu d]−1 (d − mobs )
2
The information gain, or the amount of information is a scalar number S (relative entropy)
 
pA (m)
Z
S(pA (m)) = pA (m) log dm1 . . . dmM (5.3)
pN (m)
 
pA (d)
Z
S(pA (d)) = pA (d) log dd1 . . . ddN (5.4)
pN (d)
The knowledge needs to be compare with the state of no knowledge (complete ignorance), which we call the null
p.d.f. pN (d), pN (m). Note that S = 0 when pA = pN (we know nothing). A wide distribution is more random than
a narrow one.
When the range of m and d are bounded, the pN can be taken to be proportional to a constant, i.e.
pN (d) ∝ const, pN (m) ∝ const means m and d can be anything

When the range is unbounded, the uniform distribution does not exist, so probably some very wide Gaussian can
be used instead.
The information gain is always a non-negative number and is only zero when pA = pN , S has the following prop-
erties
SM ST C : INVERSE PROBLEMS 5–10

• S of null distribution is zero


• all distributions except the null distribution have positive S

• the more sharply peaked the p.d.f. becomes, the more its information gain increases
• S is invariant under reparametrization

Since the a priori model is independent of the data, we can form an a priori p.d.f.

pA (m, d) = pA (m)pA (d)

5.3.5 Exact Theory


When d = k(m) is a surface in the combined space of data and model parameters on which the estimated model
parameters and predicted data must lie. For a linear theory the surface is planar. The trick: maximise pA (m, d) or
(log(p(m, d))) with the constaint (on the surface) d − k(m) = 0. The solution is estimated model parameters.

MaxLike principle Gaussian-distributed data and Gaussian-distributed a priori information

Since we have an exact theory we can avoid Lagrange multipliers, all we need to do is substitute d = k(m) into
p(m, d) and minimise. The log of the Gaussian p.d.f. for the a priori model parameters is the function L(m) and
the log of the Gaussian p.d.f. for the observations is the function E(m) (both up written up to some additive terms
which don’t involve m so can be ignored).

Φ(m) = L(m) + E(m) → min (with respect to m)


L(m) = (m − µ(m))T [covm]−1 T 2
A (m − µ(m)) = (m − µ(m)) ε Wm (m − µ(m))
 T    T  
E(m) = Km − dobs [covd]−1 Km − dobs == Km − dobs We Km − dobs

this is again a weighted damped LS (WDLS) problem (see Section 2), so its solution is the WDLS
−1  T 
mW DLS = KT We K + ε2Wm K We dobs + ε2Wm µ(m) =

−1  T 
= KT [covd]−1 K + [covm]−1
A K [covd]−1 dobs + [covm]−1
A µ(m) =

i covd]− 12 K −1 1
" #! " #!
h
1 − 1
h
1 − 21
i covd]− 2 dobs
[covd]− 2 K [covm]A 2 I −1 [covd]− 2 K [covm]A I −1
[covm]A 2 I [covm]A 2 µ(m)
= [FT F]−1 [FT f]
can be considered as a simple LS solution of the following system:

Fm = f, FT Fmest = FT f (5.5)

where from above


1 1
" # " #
[covd]− 2 K [covd]− 2 dobs
F= −1 , f= −1
[covm]A 2 I [covm]A 2 µ(m)
1 1
The matrices [cov d]− 2 and [cov m]− 2 can be interpreted as the certainty of dobs and µ(m) respectively, the top
part of the Eq.(5.5) is just Km = d weighted by its certainty, and the bottom part is the prior equation m = µ(m),
weighted by its certainty. Thus the weighting matrices We and Wm now have interpretation.
The vector f has unit covariance
1 1 −1T −1
[covf] = I since [cov d]− 2 T [cov d][cov d]− 2 = I, [cov m]A 2 [cov m]A [cov m]A 2 = I

and hence the covariance of mest is


[covmest ] = [FT F]−1
SM ST C : INVERSE PROBLEMS 5–11

Assume the data were uncorrelated and model parameters have a uniform variance,, we can simplify the equation
such that only one parameter is left:

σ2
   obs 
2 2 K d
[cov d] = σd I, [cov m] = σm I ⇒ F = , and f = , with ε2 = 2d
εI εµ(m) σm

σ2d
another mysterious question ”what should be ε in the damped LS?” has the answer ε2 = σ2m
(not by trial and error
this time!), just the ratio of variances of the data and the a priori model parameters.

Very useful system with a priori information

If somehow we have a linear system of a priori information

Hm = h, with covariance [cov h]A

then the Eq. (5.5) becomes


1 1
" # " #
[covd]− 2 K [covd]− 2 dobs
F= −1 , f= −1
[covm]A 2 H [cov h]A 2 h
This solution is extremely practical and can be solve by biconjugate gradient method.

5.3.6 Fuzzy theories (just some ideas)


An inexact theory is the one that is only approximately correct, so instead of a clear surface km = d we have a
cloudy set (the maximum likelihood solution was the point on the surface). We must have some a priori notation
regarding how approximate is it ( see [2], [1]). The a priori p.d.f. pA (m, d) is the same, but we have another p.d.f.
pk (m, d) centered about k(m) = d with the width proportional to its uncertainty. We change our approach from
maximisation on the surface to looking at the combination of both distributions into a single distribution

p (d|m) p (m)
p (m|d) =
p (d)

p (d) doesn’t depend on model parameters so acts for normalisation. The following properties will have been
satisfies by a combination of two p.d.f. [8]: the order in which they are combined doesn’t matter; combining
with null–distribution should leave the original distribution unchanged, combination should be invariant under
the change of variables (under reparametrisations), the combination should be zero only if both combined p.d.f.
are zero everywhere. The required combination is a product (T stands for total), provided the null-distribution is
constant, i.e.
pT (m, d) = pA (m, d)pk (m, d), pN ∝ const
both d pre and mest are obtained simultaneously. This method doesn’t necessary bring the same mest as the maxi-
mization of the L function above. To find the likelihood point pT (m, d) with respect to model parameters alone,
we sum all the probabilities (integrate) along the lines of equal model parameter (project onto d = 0 plane), only
then we look for a maximum: Z
p(m) = pT (m, d)dd1 . . . ddN

If all distributions are Gaussian, both solutions are the same.

Example/illustration (limiting case)

Consider the (WDLS) solution of the linear Km = d assuming the a priori p.d.f. is
 
1 T −1 1 obs T −1 obs
pA (m, d) ∝ exp − (m − µ(m)) [cov m]A (m − µ(m)) − (d − d ) [cov d] (d − d )
2 2

if there are no errors in the theory, then the p.d.f. is narrow, say, Dirac delta function

pA (m, d) = δ(Km − d)
SM ST C : INVERSE PROBLEMS 5–12

implies the total p.d.f. is


pT (m, d) = pA (m, d)δ(Km − d)
Now substitute the a priori distribution and integrating it along the lines of equal model parameter:
Z Z
p(m) = pT (m, d)dd1 . . . ddN = pA (m, d)δ(Km − d)dd1 . . . ddN ∝
 
1 T −1 1 obs T −1 obs
exp − (m − µ(m)) [cov m]A (m − µ(m)) − (Km − d ) [cov d] (Km − d )
2 2
coincides with the WDLS solution (take the log).

Exact data and theory

Suppose σ2d = σ2k = 0, the solution is then

mest = KT [KKT ]−1 dobs

these are just ML (when underdetermined) and LS (when overdetermined) solutions. If a priori model parameters
are not zero,

mest = K−g dobs + (I − R)µ(m) = KT [KKT ]−1 dobs + I − KT [KKT ]−1 K µ(m) = [KT K]−1 KT dobs


Infinitely inexact data and theory

When σ2d → ∞ and/or/both σ2k → ∞, the solution is mest = µ(m) (no information from neither data nor theory).

No a priori knowledge of the model parameters

In this case, σ2m → ∞, the solution

mest = KT [KKT ]−1 dobs + [I − KT [KKT ]−1 K]µ(m) = [KT K]−1 KT dobs

weak a priori information and finite-error data and theory produce the same solution as finite-error a priori infor-
mation and error-free data [1].

5.3.7 Minimising the Relative entropy (information gain)


Another guiding principle for solutions to inverse problems is the idea to find a p.d.f. pT (m) (the solution to inverse
problem) tht minimises information gain of pT (m) relative to some apriori p.d.f. pA (m). This approach will need
constraints otherwise the solution is pT = pA (S = 0). Two possible viewpoints:

• find solution p.d.f. pT (m) that has the largest S as compared to pA (m)
• find solution p.d.f. pT (m) that has a smallest possible new information as compared to apriori p.d.f. pA (m)

Therefore, assuming pA (m) is given,


 
pT (m)
Z
Find pT (m) : S= pT (m) log dm1 . . . dmM → min subject to constraints:
pA (m)
Z
pT (m)dm1 . . . dmM = 1 normalisation

and for the underdetermined case


Z
pT (m)(d − Km)dm1 . . . dmM = 0 prediction error is zero on average
SM ST C : INVERSE PROBLEMS 5–13

Solving using Euler-Lagrange for the case of Gaussian a priori p.d.f.

Φ = pT log(pT ) − pT log(pA ) + λ0 pT + λT (d − Km)pT


differentiating w.r.t. pT ,
pT (m) = pA (m) exp −(1 + λ0 ) − λT (d − Km))
now assuming
 
1 T −1
pA (m) ∝ exp − (m − µ(m)) [cov m]A (m − µ(m)) ⇒ pT (m) ∝ exp {A(m){
2

where
1
A(m) = − (m − µ(m))T [cov m]−1 T
A (m − µ(m)) − (1 + λ0 ) − λ (d − Km)
2
the best estimate of mest is the mean of this distribution, which is also a max likelihood point. Solving for a
minimum of A :
mest − µ(m) = [cov m]A KT [K[cov m]A KT ]−1 [d − Kµ(m)]
so the new principle (MRE) applied to the underdetermined problem, gave us the WML solution with Wm−1 =
[cov m]A .
SM ST C : INVERSE PROBLEMS 5–14

Exercises
5–1. Consider the Laplace transform Z ∞
d(c) = e−cz m(z)dz
0
and its discrete version
M
di = ∑ exp (−ci z j )m j , zi ∈ [0, 10]
j=1

the data is weighted average of model parameters m j , weights decline exponentially with depth z. The decay
is controlled by constant ci , the smaller i is, the wider is the range of averages it will go over. You need to
simulate images similar to Fig. 5.4 following the steps below

(a) Generate N = 100 model weights mtrue such that mi , i ∈ [1, 100] such that they are all zero, except for
a few, say let m5 = m10 = m20 = m50 = m90 = 1.
(b) Generate the observed data N = 80, such that c j ∈ [0, 0.1] and the kernel is Ki j = exp(−ci z j ), the noise
is normally distributed with mean zero, standard deviation is zero, dobs = dtrue + noise.
−g −g
(c) Construct the minimum length solution KML = KT K + ε2 I KT with ε = 10−12 , find mest = KML dobs ,
 
−g
and the resolution matrix R = KML K. Draw the heatmap of the model resolution matrix R, true model
parameters and estimated model parameters on the same graph as on Fig. 5.4
(d) Using the same setting of model parameters as above, and the same observed data as above, Construct
−g
the Backus-Gilbert (BG) solution row-wise and find mest = KBG dobs by formulae (5.2), find the corre-
sponding estimates of model parameters, the corresponding BG model resolution matrix and complete
the set of pictures on Fig. 5.4.

You might also like