0% found this document useful (0 votes)

8 views14 pages

S M S T C Lecture Notes Lecture5

Lecture 5 discusses Maximum Likelihood methods and statistical approaches to inverse problems, focusing on the variance of model parameter estimates and the impact of measurement errors. It covers the unit covariance matrix, Backup-Gilbert spread function, and the mathematical principles behind maximum likelihood estimation, including overdetermined and underdetermined cases. The lecture emphasizes the importance of understanding error propagation and the characteristics of covariance in model fitting.

Uploaded by

miru park

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views14 pages

S M S T C Lecture Notes Lecture5

Uploaded by

miru park

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

SMSTC (2020/21)

INVERSE PROBLEMS
Lecture 5: Maximum Likelihood methods, statistical approach
Anya Kirpichnikova, University of Stirlinga

www.smstc.ac.uk

Contents
5.1 How we could look from statistical point of view on what we have done? . . . . . . . . . . . . 5–1
5.1.1 The variance of the model parameter estimates . . . . . . . . . . . . . . . . . . . . . . . 5–1
5.2 The unit covariance matrix and the Backup-Gilbert spread function . . . . . . . . . . . . . . 5–3
5.2.1 The unit covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5.2.2 Dirichlet spread function (general case), Sylvester equation . . . . . . . . . . . . . . . . . 5–4
5.2.3 Backup-Gilbert (BG) spread function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–5
5.3 Maximum Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.3.1 Maximum likelihood principle/ function . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.3.2 Inverse problem with Gaussian-distributed data . . . . . . . . . . . . . . . . . . . . . . . 5–8
5.3.3 Underdetermined cases: Probabilistic representation of a priori information . . . . . . . . 5–8
5.3.4 Information gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–9
5.3.5 Exact Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–10
5.3.6 Fuzzy theories (just some ideas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5.3.7 Minimising the Relative entropy (information gain) . . . . . . . . . . . . . . . . . . . . . 5–12

5.1 How we could look from statistical point of view on what we have
done?
5.1.1 The variance of the model parameter estimates
As discussed before, the data contains noise that causes errors in mest . The goal is to consider how the measurement
errors map into errors in the model parameters estimates under linear transformation
mest = Md + v
for some M, v. Assume data d has some distribution, characterised by some covariance matrix [covd], then as seen
in Eq. 4.2, the estimates of the model parameters have a distribution characterised by a covariance matrix
[covm] = M[covd]M T
so error in m depends on errors in d.
How to arrive at the estimate of variance of the data σ2d ?

• Way 1: a priori variance: length is measured bya ruler with 1mm division, so σ2d ≈ 12 mm
• Way 2: based on the size distribution of prediction errors e determined by fitting model to the data, a
posteriori
N
1
σ2d ≈ ∑ e2
N − M i=1 i
N
1
(compare with the mean-squared error N ∑ e2i )
i=1
a [email protected]

5–1
SM ST C : INVERSE PROBLEMS 5–2

Overdetermined case: LS

If d is uncorrelated and of equal variance σ2d , then

mest,LS = [KT K]−1 KT d

T
[covm] = [KT K]−1 KT [covd] [KT K]−1 KT = σ2d [KT K]−1

i.e. m can be correlated and can be of unequal variance when the data are uncorrelated and are of equal variance.

• if E(m) = eT e has a sharp minimum near mest the the LS solution is well-defines in a sense of small variance.
The latter means that small errors in determining the shape E(m) due to random fluctuations in the data lead
to small errors in mest

• if E(m) has a broad minimum, we expect mest to have large variance

Mathematically, the curvature of a function is responsible for the sharpness at a minimum: variance should be
related to the curvature of E(m) at its minimum, the latter depends on the structure of K. Curvature is proportional
to the second derivative, i.e.
2
1∂ E
∆E = E(m) − E(mest ) = [m − mest ]T | est [m − m
est
]
2 ∂m2 m=m
h 2 i
∂2 E
where the matrix 21 ∂m∂ E
2 has elements ∂mi ∂m j , notice the we are expanding at a minimum m = m
est , so first order

term should be zero, so

N N M N M M
E(m) = kd − Kmk22 = ∑ di2 − 2 ∑ di ∑ Ki j m j + ∑ ∑ Ki j m j ∑ Kik mk
i=1 i=1 j=1 i=1 j=1 k=1

and hence with the same effort,

1 ∂2 E

= KT K.
2 ∂m2

The covariance of the LS solution (for uncorrelated data all with variance σ2d ) is
−1
1 ∂2 E

[covm] = σ2d [KT K]−1 = σ2d |m=mest
2 ∂m2

which means that the covariance of model parameters [covm] is controlled by a) the variance of the data σ2d , and
h 2 i−1
b) the measure of the curvature of the predicted error 12 ∂m
∂ E
2 |m=mest .

Example of straight line fitting

See Fig: 5.1; variance is related to the size of the ellipse, m1 is an intercept (downwards axis), m2 is the slope
(horizontal axis). Slope is determined more accurately than the intercept; discussion in lectures

Underdetermined case: ML

If d is uncorrelated and of equal variance σ2d , then

mest,ML = KT [KKT ]−1 d

T
[covm] = KT [KKT ]−1 [covd] KT [KKT ]−1 = σ2d KT [KKT ]−2 K

SM ST C : INVERSE PROBLEMS 5–3

(a) case 1: data (b) case 1: variance (c) case 2: data (d) case 2: variance

Figure 5.1: Parts (a) and (c) contain data and LS solution; parts (b) and (d) contain Two examples of the data: in
the case 2 the points are concentrated so give much more options for the gradient

5.2 The unit covariance matrix and the Backup-Gilbert spread function
5.2.1 The unit covariance matrix
Covariance of the model parameters [covm] depends on covd and the way data d is mapped into m, i.e. on K. A
unit covariance matrix characterises a degree of error amplification that occurs in the mapping.

• d is uncorrelated, we have a uniform variance σ2 , then

[covu m] = σ−2 K−g [covd]K−gT = σ−2 K−g σ2 K−gT = K−g K−gT

• d is correlated
[covu m] = K−g [covu d]K−gT
where [covu d] is some normalisation of [covd].

A unit covariance matrix [covu m] is a useful tool in experimental design, it is independent of the actual values
and variances of data.

Example of straight line fitting

Fit a straight line into (z, d) data:

di = m1 + m2 zi ,
the unit covariance matrix for the intercept m1 and slope m2 is

∑ z2i

1 − ∑ zi
[covu m] =
N ∑ z2i − (∑ zi )2 − ∑ zi N

The estimates of m1 and m2 are uncorrelated only wen the data is centered about z = 0.. The overall zise of the
variance is controlled by the denominator is the fraction. If all the values of z are nearly equal (see Fig. 5.2, case
1), the denominator of the fraction is small and the variance is large. If the values of z have large spread, the
denominator is large, and the variance is small.
The unit covariance matrix [covu m] is the measure of the amount of error amplification mapped from d to m,
then the size of it is
1
M
size ([covu m]) = k[varu m] 2 k22 = ∑ [covu m]ii
i=1
where the square roots are taken element-wise, and only diagonal elements are taken into account.
SM ST C : INVERSE PROBLEMS 5–4

(a) case 1 (b) case 2

Figure 5.2: case 1: data points are not spread;case 2: data points are better spread

Overdetermined case: LS solution

[covu m] = K−g K−gT = [KT K]−1 KT K[KT K]−1 = [KT K]−1

Underdetermined case: ML solution

[covu m] = K−g K−gT = KT [KT K]−1 [KT K]−1 K = KT [KT K]−2 K

5.2.2 Dirichlet spread function (general case), Sylvester equation

The idea is to minimise the following combination

α1 spread(N) + α2 spread(R) + α3 size([covu m]) → min

Cumbersome (but similar to the above) calculus skills imply Sylvester equation

α1 [KT K]K−g + K−g α1 [KKT ] + α3 [covu d] = [α1 + α2 ]KT

(5.1)

Special cases of Sylvester equation

• LS solution: α1 = 1, α2 = α3 = 0
• ML solution: α1 = 0, α2 = 1, α3 = 0
• Damped LS solution: α1 = 1, α2 = 0, α3 = ε2 and [covu d] = I then
−1 T
K−g = Kt K + ε2 I

K

• Damped ML solution: α1 = 0, α2 = 1, α3 = ε2 and [covu d] = I then

K−g = KT [KKT + ε2 I]−1

Model resolution matrix R: what’s wrong with the Dirichlet spread?

If R contains negative off-diagonal elements, then interpreting it as averaging makes less sense: non-negativity can
be included as a new constraint, but this solution make problem even harder. We could prefer any large elements
to be close to the main diagonal when there exists natural ordering as then we have localised averaging functions.
If we continue using the Dirichlet spread function to compute K−g , we would often get sidelobes (large amplitude
regions in the resolution matrix far from the main diagonal.)
Two matrices R1 and R2 can have the same Dirichlet spread, see Fig. 5.3 for, say, ithe row of matrix R1 being:

0 0 . . . 0.2 0.6 0.2 . . . 0
SM ST C : INVERSE PROBLEMS 5–5

Figure 5.3: Sidelobes of the matrices: the same Dirichlet spread, but R1 (A) is better resolved.

and the ith row of matrix R2 :

0 0 . . . 0.2 0 0.6 0 0.2 ... 0

where in both cases 0.6 is the diagonal element. The contribution of the ith row to the Dirichlet spread will be
exactly the same: (0.2 − 0)2 + (0.6 − 1)2 + (0.2 − 0)2 , however the distribution of weights is different.

5.2.3 Backup-Gilbert (BG) spread function

We want to accommodate physical distance from the main diagonal into the spread function [2], [1]
M M M M
spread(R) = ∑ ∑ w(i, j) [Ri j − δi j ]2 = ∑ ∑ w(i, j)R2i j
i=1 j=1 i=1 j=1

where w(i, j) is a weighting factor which is based on the distance from the main diagonal (it should grow when
the element gets further away from the diagonal). Usually, w(i, j) = (i − j)2 is taken for the natural ordering case.
Similar Backup-Gilbert spread functions can be introduced for the data resolution matrix N.
Since we have introduced new measure, we can find new generalised inverses!

Underdetermined case

It is easy to have the spread spread(N) small, so we take BG spread spread(R) → min . The diagonal elements of
R are given no weights now, we add the following condition
M
∑ Ri j = [1]i
j=1

the latter allows to keep the diagonal elements finite and the rows are unit averaging of mtrue . We will use calculus
again differentiate the spread row-wise. Assume the spread of the kth row of R is Jk :
" #" #
M M N N
Jk = ∑ w(l, k)Rkl Rkl = ∑ w(l, k) ∑ K−g
ki Kil ∑ K−g
k j K jl
l=1 l=1 i=1 j=1

N N M N N
−g −g
= ∑ ∑ Kki Kk j ∑ w(l, k)Kil K jl = ∑ ∑ K−g −g (k)
ki Kk j S ,
i=1 j=1 l=1 i=1 j=1

where S(k) are symmetric in i, j and

M
S(k) = ∑ w(l, k)Kil K jl
l=1

Now the constraint:

" #
M M N N M N M
−g
∑ Rik = ∑ ∑ Ki j K jk = ∑ K−g −g
i j ∑ K jk = ∑ Ki j u j , uj = ∑ K jk
k=1 k=1 j=1 j=1 k=1 j=1 k=1
SM ST C : INVERSE PROBLEMS 5–6

We use Langange multipliers to combine the two functions to be minimised:

N M N
−g −g −g
Φ = ∑ ∑ Kki Kk j S(k) i j + 2λ ∑ Kk j u j
i=1 j=1 j=1

thus we solve (can be solved row-wise)


N
−g



∂Φ
−g = 2 ∑ S(k) pi Kki + 2λu p = 0
∂Kkp i=1
M N
−g
 ∑ Ri j = [1]i = ∑ Kik uk


j=1 k=1

the latter can be written as a matrix equation

(k)
S u k(k) 0
=
uT 0 λ 1

where the matrix is square (N + 1) × (N + 1) and can be solved by bordering method by partition into submatrices
with simple properties.
(k) −1
S u
Suppose that exists and that we partition it into submatrices: N × N symmetric matrix A, vector
uT 0
b and scalar c : (k)
AS + buT Au

A b S(k) u I O
= =
bT c uT 0 0 1 bT S(k) + cuT bT u
(k)
A b S u
Since T is the inverse of , their product should be equal to identity.
b c uT 0

AS(k) + buT = I ⇒ A = [S(k) ]−1 [I − buT ]

[S(k) ]−1 u
Au = O ⇒ [S(k) ]−1 u = buT S(k) u ⇒ b =
uT [S(k) ]−1 u
−1
bT S(k) + cu = 0 ⇒ c =
uT [S(k) ]−1 u
(k)
S u
We have found the inverse for , and hence
uT 0
(k)
A b S(k) u k(k) k A b 0
= = T ⇒ k(k) = b, λ = c
bT c uT 0 λ λ b c 1
Then the required BG generalised inverse is
N h i
∑ ui (S(k) )−1 M
−g i=1 il
Kkl = N M
, ui = ∑ Kik (5.2)

∑ ∑ ui (S(k) )−1 i j u j k=1
i=1 j=1

is the equivalent of the ML solution.

Example of the discrete version of Laplace transform

Consider the Laplace transform Z ∞

d(c) = e−cz m(z)dz
0
and its discrete version
M
di = ∑ exp (−ci z j )m j , zi ∈ [0, 10]
j=1

the data is weighted average of model parameters m j , weights decline exponentially with depth z. The decay is
controlled by constant ci , the smaller i is, the wider is the range of averages it will go over. Kernel Ki j = exp (−ci z j )
works as smoothing. Shallow parameters are better resolved. BG
SM ST C : INVERSE PROBLEMS 5–7

(a) BG R matrix (b) BG solution (c) Dirichlet R (d) Dirichlet solution

Figure 5.4: The true solution mtrue is shown in red in (b), (d). BG solution is much smoother (b), then Dirichlet
solution (d); the BG resolution matrix (c) has much shallower sidelobes than Dirichlet resolution matrix (a)

BG generalise inverse accounting for covariance: damped ML

This solution was introduced in the 1970 Backup and Gilbert paper [10]
M M M
αspread(R) + (1 − α)size([covu m]) = α ∑ ∑ w(i, j)R2i j + (1 − α) ∑ [covu m]ii , 0 ≤ α ≤ 1
i=1 j=1 i=1

is the weighting factor, signifying the relative contribution of R and [covu m]. If we minimise the above linear
combination as before (calculus), we get the following Backus-Gilbert generalised inverse
N 0
∑ ui [(S(k) )−1 ] M
−g i=1 0
Kkl = N N
, (S(k) )i j = αS(k) + (1 − α)[covu d]i j , ui = ∑ Kik
0
∑ ∑ ui [(S(k) )−1 ]i j u j k=1
i=1 j=1

which has a Dirichlet analogue, damped ML solution.

This is a special case of
α1 spread(N) + α2 spread(R) + α3 size([covu m]) → min
when α1 = 0, α2 = 1, α3 = ε2 , see Sylvester equation and the fourth solution.

5.3 Maximum Likelihood Methods

5.3.1 Maximum likelihood principle/ function
The observed data is one point in the space of all possible observations or dobs ∈ S(d) (see Fig 5.5, part (a)), see
that the exact realisation might not belong to the d1 = d2 = d3 line. Assume now the data are independent; each
datum is drawn from a Gaussian distribution with the same mean m1 and variance σ2 (both unknown):
" #
N
1
p(d) = σ−N (2π)−N/2 exp − σ−2 ∑ [di − m1 ]2
2 i=1

(see Fig 5.5, part (b)) to find the cloud centered on d1 = d2 = d3 with radius proportional to σ. We will look at
p(dobs ) as the probability that the observed data was in fact observed, so we imagine sliding the cloud of probability
in Fig 5.5, part (b), up along the line and adjusting its diameter until its probability is maximized. This procedure
defines a method of estimating the unknown parameters in the distribution, the method of maximum likelihood.
The maximum is localised when the derivatives of p(dibs ) are zeros:
∂p ∂p
= 0, =0
∂m1 ∂σ
Since log(p) is a monotonic function of p, we could maximise it instead of p. Let us introduce the likelihood
N
function (ignoring the normalisation constant (2π)− 2 )
N
1
L = log(p(dobs )) = −N log(σ) − σ−2 ∑ (diobs − m1 )2
2 i=1
SM ST C : INVERSE PROBLEMS 5–8

(a) realisation of data (b) p.d.f. of data (c) L function

Figure 5.5: Understanding the nature of data and maximum of likelihood function principle

thus the derivatives


N
 N
mest 1
 ∂L = − 1 σ−2 2m1 ∑ (d obs − m1 ) = 0

 ∂m 2 i

 1 = N ∑ diobs
1 i=1
i=1 ⇒ 12
N
N
∂L
= − Nσ + σ−3 ∑ (diobs − m1 )2 = 0 σest = 1 obs est 2
 
N ∑ (di − m1 )

 ∂σ 
i=1 i=1

we see that the mest is just the usual formula for the sample mean, however the σest is almost the usual sample
1
standard deviation (should have had N−1 instead of N1 ) (see Fig 5.5, part (c)). Those formulae are only valid for
the Gaussian p.d.f.

5.3.2 Inverse problem with Gaussian-distributed data

1
Km = d, p(d) ∝ exp − (d − Km)T [covu d]−1 (d − Km) → max
2

Gaussian data with known covariance [covu d] (m are unknown)

The p(d) is maximal, when the argument of the exp is maximal p(d ∝ exp − 21 E), i.e.
E = (d − Km)T [covu d]−1 (d − Km) → min
is a weighted measure of prediction error (see before). Conclusion: maximum likelihood in this case (Gaussian
distributed data) solves Km = d with weighted LS with the weighting We = [covu d]−1 .
Special case 1: if the data are uncorrelated, all with equal variance, then [covu d] = σ@ d I, and we have simple LS
solution.
Special case 2: If the data are uncorrelated but their variances are all different, say σ2di
N
pre
E = ∑ σ−2 2
di ei → min, ei = diobs − di
i=1

This E collects errors weighted by their certainty.

Conclusion: This was a justification of the L2 norm through probability theory: if data are uncorrelated, have
equal variance, and obey Gaussian statistics, the L2 norm is appropriate.

5.3.3 Underdetermined cases: Probabilistic representation of a priori information

The LS solution doesn’t exists when the problem is underdetermined: in term of probability it means that p(dobs )
has no well-defined maximum w.r.t. variations of model parameters (see Fig 5.6 (b)). Assume the probability that
the model parameters are near µ(m) is given by a p.d.f. pA (m) (A stands for a-priori), centered at a-priori value
µ(m), variance reflects uncertainty in a a priori informnation.
SM ST C : INVERSE PROBLEMS 5–9

A priori distributions

We need to develop a way to include a priori information in the solution.

(a) pdf 1 (b) pdf (c) pdf 3 (d) pdf 4 (e) pdf 5

Figure 5.6: Various cases of data distributions

• if we expect that model parameters are close to µ(m), we might use Gaussian distribution with mean µ(m)
and variance that reflects certainty of our knowledge, Fig 5.6 (c) we are more certain than in case (d)
• Fig 5.6 (b) the values are correlated. This distribution is no-Gaussian, but can be approximated by a Gaussian
distribution with non-zero covariance if the expected range of model parameters is small.
• if the a priori value of one model parameter were more certain than another, we have different variances, see
Fig 5.6 (e)

The general Gaussian case with covariance [covu m]A

1 T −1
pA (m) ∝ exp − (m − µ(m)) [covu m]A (m − µ(m))
2

5.3.4 Information gain

One can summarise the observation to form a distribution with mean dobs and a priori covariance [covu d],

1
pA (d) ∝ exp − (d − dobs )T [covu d]−1 (d − mobs )
2
The information gain, or the amount of information is a scalar number S (relative entropy)

pA (m)
Z
S(pA (m)) = pA (m) log dm1 . . . dmM (5.3)
pN (m)

pA (d)
Z
S(pA (d)) = pA (d) log dd1 . . . ddN (5.4)
pN (d)
The knowledge needs to be compare with the state of no knowledge (complete ignorance), which we call the null
p.d.f. pN (d), pN (m). Note that S = 0 when pA = pN (we know nothing). A wide distribution is more random than
a narrow one.
When the range of m and d are bounded, the pN can be taken to be proportional to a constant, i.e.
pN (d) ∝ const, pN (m) ∝ const means m and d can be anything

When the range is unbounded, the uniform distribution does not exist, so probably some very wide Gaussian can
be used instead.
The information gain is always a non-negative number and is only zero when pA = pN , S has the following prop-
erties
SM ST C : INVERSE PROBLEMS 5–10

• S of null distribution is zero

• all distributions except the null distribution have positive S

• the more sharply peaked the p.d.f. becomes, the more its information gain increases
• S is invariant under reparametrization

Since the a priori model is independent of the data, we can form an a priori p.d.f.

pA (m, d) = pA (m)pA (d)

5.3.5 Exact Theory

When d = k(m) is a surface in the combined space of data and model parameters on which the estimated model
parameters and predicted data must lie. For a linear theory the surface is planar. The trick: maximise pA (m, d) or
(log(p(m, d))) with the constaint (on the surface) d − k(m) = 0. The solution is estimated model parameters.

MaxLike principle Gaussian-distributed data and Gaussian-distributed a priori information

Since we have an exact theory we can avoid Lagrange multipliers, all we need to do is substitute d = k(m) into
p(m, d) and minimise. The log of the Gaussian p.d.f. for the a priori model parameters is the function L(m) and
the log of the Gaussian p.d.f. for the observations is the function E(m) (both up written up to some additive terms
which don’t involve m so can be ignored).

Φ(m) = L(m) + E(m) → min (with respect to m)

L(m) = (m − µ(m))T [covm]−1 T 2
A (m − µ(m)) = (m − µ(m)) ε Wm (m − µ(m))
T T
E(m) = Km − dobs [covd]−1 Km − dobs == Km − dobs We Km − dobs

this is again a weighted damped LS (WDLS) problem (see Section 2), so its solution is the WDLS
−1 T
mW DLS = KT We K + ε2Wm K We dobs + ε2Wm µ(m) =

−1 T
= KT [covd]−1 K + [covm]−1
A K [covd]−1 dobs + [covm]−1
A µ(m) =

i covd]− 12 K −1 1
" #! " #!
h
1 − 1
h
1 − 21
i covd]− 2 dobs
[covd]− 2 K [covm]A 2 I −1 [covd]− 2 K [covm]A I −1
[covm]A 2 I [covm]A 2 µ(m)
= [FT F]−1 [FT f]
can be considered as a simple LS solution of the following system:

Fm = f, FT Fmest = FT f (5.5)

where from above

1 1
" # " #
[covd]− 2 K [covd]− 2 dobs
F= −1 , f= −1
[covm]A 2 I [covm]A 2 µ(m)
1 1
The matrices [cov d]− 2 and [cov m]− 2 can be interpreted as the certainty of dobs and µ(m) respectively, the top
part of the Eq.(5.5) is just Km = d weighted by its certainty, and the bottom part is the prior equation m = µ(m),
weighted by its certainty. Thus the weighting matrices We and Wm now have interpretation.
The vector f has unit covariance
1 1 −1T −1
[covf] = I since [cov d]− 2 T [cov d][cov d]− 2 = I, [cov m]A 2 [cov m]A [cov m]A 2 = I

and hence the covariance of mest is

[covmest ] = [FT F]−1
SM ST C : INVERSE PROBLEMS 5–11

Assume the data were uncorrelated and model parameters have a uniform variance,, we can simplify the equation
such that only one parameter is left:

σ2
obs
2 2 K d
[cov d] = σd I, [cov m] = σm I ⇒ F = , and f = , with ε2 = 2d
εI εµ(m) σm

σ2d
another mysterious question ”what should be ε in the damped LS?” has the answer ε2 = σ2m
(not by trial and error
this time!), just the ratio of variances of the data and the a priori model parameters.

Very useful system with a priori information

If somehow we have a linear system of a priori information

Hm = h, with covariance [cov h]A

then the Eq. (5.5) becomes

1 1
" # " #
[covd]− 2 K [covd]− 2 dobs
F= −1 , f= −1
[covm]A 2 H [cov h]A 2 h
This solution is extremely practical and can be solve by biconjugate gradient method.

5.3.6 Fuzzy theories (just some ideas)

An inexact theory is the one that is only approximately correct, so instead of a clear surface km = d we have a
cloudy set (the maximum likelihood solution was the point on the surface). We must have some a priori notation
regarding how approximate is it ( see [2], [1]). The a priori p.d.f. pA (m, d) is the same, but we have another p.d.f.
pk (m, d) centered about k(m) = d with the width proportional to its uncertainty. We change our approach from
maximisation on the surface to looking at the combination of both distributions into a single distribution

p (d|m) p (m)
p (m|d) =
p (d)

p (d) doesn’t depend on model parameters so acts for normalisation. The following properties will have been
satisfies by a combination of two p.d.f. [8]: the order in which they are combined doesn’t matter; combining
with null–distribution should leave the original distribution unchanged, combination should be invariant under
the change of variables (under reparametrisations), the combination should be zero only if both combined p.d.f.
are zero everywhere. The required combination is a product (T stands for total), provided the null-distribution is
constant, i.e.
pT (m, d) = pA (m, d)pk (m, d), pN ∝ const
both d pre and mest are obtained simultaneously. This method doesn’t necessary bring the same mest as the maxi-
mization of the L function above. To find the likelihood point pT (m, d) with respect to model parameters alone,
we sum all the probabilities (integrate) along the lines of equal model parameter (project onto d = 0 plane), only
then we look for a maximum: Z
p(m) = pT (m, d)dd1 . . . ddN

If all distributions are Gaussian, both solutions are the same.

Example/illustration (limiting case)

Consider the (WDLS) solution of the linear Km = d assuming the a priori p.d.f. is

1 T −1 1 obs T −1 obs
pA (m, d) ∝ exp − (m − µ(m)) [cov m]A (m − µ(m)) − (d − d ) [cov d] (d − d )
2 2

if there are no errors in the theory, then the p.d.f. is narrow, say, Dirac delta function

pA (m, d) = δ(Km − d)
SM ST C : INVERSE PROBLEMS 5–12

implies the total p.d.f. is

pT (m, d) = pA (m, d)δ(Km − d)
Now substitute the a priori distribution and integrating it along the lines of equal model parameter:
Z Z
p(m) = pT (m, d)dd1 . . . ddN = pA (m, d)δ(Km − d)dd1 . . . ddN ∝

1 T −1 1 obs T −1 obs
exp − (m − µ(m)) [cov m]A (m − µ(m)) − (Km − d ) [cov d] (Km − d )
2 2
coincides with the WDLS solution (take the log).

Exact data and theory

Suppose σ2d = σ2k = 0, the solution is then

mest = KT [KKT ]−1 dobs

these are just ML (when underdetermined) and LS (when overdetermined) solutions. If a priori model parameters
are not zero,

mest = K−g dobs + (I − R)µ(m) = KT [KKT ]−1 dobs + I − KT [KKT ]−1 K µ(m) = [KT K]−1 KT dobs

Infinitely inexact data and theory

When σ2d → ∞ and/or/both σ2k → ∞, the solution is mest = µ(m) (no information from neither data nor theory).

No a priori knowledge of the model parameters

In this case, σ2m → ∞, the solution

mest = KT [KKT ]−1 dobs + [I − KT [KKT ]−1 K]µ(m) = [KT K]−1 KT dobs

weak a priori information and finite-error data and theory produce the same solution as finite-error a priori infor-
mation and error-free data [1].

5.3.7 Minimising the Relative entropy (information gain)

Another guiding principle for solutions to inverse problems is the idea to find a p.d.f. pT (m) (the solution to inverse
problem) tht minimises information gain of pT (m) relative to some apriori p.d.f. pA (m). This approach will need
constraints otherwise the solution is pT = pA (S = 0). Two possible viewpoints:

• find solution p.d.f. pT (m) that has the largest S as compared to pA (m)
• find solution p.d.f. pT (m) that has a smallest possible new information as compared to apriori p.d.f. pA (m)

Therefore, assuming pA (m) is given,

pT (m)
Z
Find pT (m) : S= pT (m) log dm1 . . . dmM → min subject to constraints:
pA (m)
Z
pT (m)dm1 . . . dmM = 1 normalisation

and for the underdetermined case

Z
pT (m)(d − Km)dm1 . . . dmM = 0 prediction error is zero on average
SM ST C : INVERSE PROBLEMS 5–13

Solving using Euler-Lagrange for the case of Gaussian a priori p.d.f.

Φ = pT log(pT ) − pT log(pA ) + λ0 pT + λT (d − Km)pT

differentiating w.r.t. pT ,
pT (m) = pA (m) exp −(1 + λ0 ) − λT (d − Km))
now assuming

1 T −1
pA (m) ∝ exp − (m − µ(m)) [cov m]A (m − µ(m)) ⇒ pT (m) ∝ exp {A(m){
2

where
1
A(m) = − (m − µ(m))T [cov m]−1 T
A (m − µ(m)) − (1 + λ0 ) − λ (d − Km)
2
the best estimate of mest is the mean of this distribution, which is also a max likelihood point. Solving for a
minimum of A :
mest − µ(m) = [cov m]A KT [K[cov m]A KT ]−1 [d − Kµ(m)]
so the new principle (MRE) applied to the underdetermined problem, gave us the WML solution with Wm−1 =
[cov m]A .
SM ST C : INVERSE PROBLEMS 5–14

Exercises
5–1. Consider the Laplace transform Z ∞
d(c) = e−cz m(z)dz
0
and its discrete version
M
di = ∑ exp (−ci z j )m j , zi ∈ [0, 10]
j=1

the data is weighted average of model parameters m j , weights decline exponentially with depth z. The decay
is controlled by constant ci , the smaller i is, the wider is the range of averages it will go over. You need to
simulate images similar to Fig. 5.4 following the steps below

(a) Generate N = 100 model weights mtrue such that mi , i ∈ [1, 100] such that they are all zero, except for
a few, say let m5 = m10 = m20 = m50 = m90 = 1.
(b) Generate the observed data N = 80, such that c j ∈ [0, 0.1] and the kernel is Ki j = exp(−ci z j ), the noise
is normally distributed with mean zero, standard deviation is zero, dobs = dtrue + noise.
−g −g
(c) Construct the minimum length solution KML = KT K + ε2 I KT with ε = 10−12 , find mest = KML dobs ,

−g
and the resolution matrix R = KML K. Draw the heatmap of the model resolution matrix R, true model
parameters and estimated model parameters on the same graph as on Fig. 5.4
(d) Using the same setting of model parameters as above, and the same observed data as above, Construct
−g
the Backus-Gilbert (BG) solution row-wise and find mest = KBG dobs by formulae (5.2), find the corre-
sponding estimates of model parameters, the corresponding BG model resolution matrix and complete
the set of pictures on Fig. 5.4.

S M S T C Inverse Problems Lecture 5
No ratings yet
S M S T C Inverse Problems Lecture 5
39 pages
Curve-Fitting The Science and Art of Approximation (D. James Benton) (Z-Library)
No ratings yet
Curve-Fitting The Science and Art of Approximation (D. James Benton) (Z-Library)
162 pages
S M S T C Lecture Notes Lecture3
No ratings yet
S M S T C Lecture Notes Lecture3
11 pages
Mathematics 13 02255
No ratings yet
Mathematics 13 02255
89 pages
Wainwrightslides 1
No ratings yet
Wainwrightslides 1
67 pages
Stein 2011 DiffFilter
No ratings yet
Stein 2011 DiffFilter
20 pages
S M S T C Inverse Problems Lecture 1 Annotated
No ratings yet
S M S T C Inverse Problems Lecture 1 Annotated
30 pages
Errata 7
No ratings yet
Errata 7
5 pages
Poly Macs201macs203 250726 161340
No ratings yet
Poly Macs201macs203 250726 161340
197 pages
Lecture 21
No ratings yet
Lecture 21
9 pages
GP506 L2 Error Analysis 2
No ratings yet
GP506 L2 Error Analysis 2
28 pages
S M S T C Lecture Notes Lecture4
No ratings yet
S M S T C Lecture Notes Lecture4
11 pages
Stat 450850 Notes 2012
No ratings yet
Stat 450850 Notes 2012
190 pages
High-Dimensional Stats Notes
No ratings yet
High-Dimensional Stats Notes
168 pages
Mathematical Model
No ratings yet
Mathematical Model
34 pages
On Fitting Models For Danish Fire Data
No ratings yet
On Fitting Models For Danish Fire Data
49 pages
Experiment 6 - Linear Systems, Regression, Curve Fitting, and Interpolation
No ratings yet
Experiment 6 - Linear Systems, Regression, Curve Fitting, and Interpolation
24 pages
Signals and Systems Lab Manual: University of Engineering & Technology, Taxila
No ratings yet
Signals and Systems Lab Manual: University of Engineering & Technology, Taxila
22 pages
S M S T C Inverse Problems Lecture 4
No ratings yet
S M S T C Inverse Problems Lecture 4
47 pages
Error Propagation
No ratings yet
Error Propagation
22 pages
��
No ratings yet
��
3 pages
Answers To Odd-Numbered Exercises For Fox, Applied Regression Analysis
No ratings yet
Answers To Odd-Numbered Exercises For Fox, Applied Regression Analysis
151 pages
Methods Lecture5 Slides 2024
No ratings yet
Methods Lecture5 Slides 2024
255 pages
Finite-Sample OLS Analysis
No ratings yet
Finite-Sample OLS Analysis
35 pages
Relative Performance of Expected and Observed Fisher Information in Covariance Estimation For Maximum Likelihood Estimates
No ratings yet
Relative Performance of Expected and Observed Fisher Information in Covariance Estimation For Maximum Likelihood Estimates
6 pages
Ödev Least
No ratings yet
Ödev Least
7 pages
Methods Lecture3 Slides 2024
No ratings yet
Methods Lecture3 Slides 2024
229 pages
S M S T C Lecture Notes Lecture2
No ratings yet
S M S T C Lecture Notes Lecture2
9 pages
Stochastic Notes
No ratings yet
Stochastic Notes
133 pages
Least Squares in Inverse Problems
No ratings yet
Least Squares in Inverse Problems
37 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Homework 2 DSC 40A
No ratings yet
Homework 2 DSC 40A
13 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
DataDriven ReservoirModeling NAGAO THESIS 2021
No ratings yet
DataDriven ReservoirModeling NAGAO THESIS 2021
119 pages
ECON 1630 Problem Set #2 Fall 2021: Bias Variance
No ratings yet
ECON 1630 Problem Set #2 Fall 2021: Bias Variance
9 pages
HMWK 4
No ratings yet
HMWK 4
5 pages
Machine Learning Exercises FS 2013
No ratings yet
Machine Learning Exercises FS 2013
4 pages
Classical and Quantum Information Basics
No ratings yet
Classical and Quantum Information Basics
49 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Intro to Radial Basis Functions
100% (1)
Intro to Radial Basis Functions
58 pages
31 Least Squares
No ratings yet
31 Least Squares
39 pages
S M S T C Inverse Problems Lecture 2 Annotated
No ratings yet
S M S T C Inverse Problems Lecture 2 Annotated
49 pages
17.874 Lecture Notes Part 6: Panel Models
No ratings yet
17.874 Lecture Notes Part 6: Panel Models
13 pages
Day 1
No ratings yet
Day 1
41 pages
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
No ratings yet
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
58 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
Wooldridge 6e AppE IM
No ratings yet
Wooldridge 6e AppE IM
5 pages
Linear Least Squares
No ratings yet
Linear Least Squares
40 pages
Dynamics of Hashtag Communities
No ratings yet
Dynamics of Hashtag Communities
13 pages
S M S T C Inverse Problems Lecture 3 Annotated
No ratings yet
S M S T C Inverse Problems Lecture 3 Annotated
38 pages
UPTU B.tech Computer Science 3rd 4th Yr
No ratings yet
UPTU B.tech Computer Science 3rd 4th Yr
51 pages
Introduction to Statistical Inference
No ratings yet
Introduction to Statistical Inference
16 pages
3 - Discrete-Time Systems
No ratings yet
3 - Discrete-Time Systems
61 pages
Askey Wilson
No ratings yet
Askey Wilson
23 pages
Computational Inverse Problems
100% (1)
Computational Inverse Problems
67 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Chap 1
No ratings yet
Chap 1
10 pages
S M S T C Lecture 2425 4
No ratings yet
S M S T C Lecture 2425 4
43 pages
P. Vanicek, Introduction To Adjustment Calculus
No ratings yet
P. Vanicek, Introduction To Adjustment Calculus
250 pages
Modeling Basics: Compartment Models Dimensional Analysis Stochastic Modeling
No ratings yet
Modeling Basics: Compartment Models Dimensional Analysis Stochastic Modeling
58 pages
Elliptic Problems in Weak Form
No ratings yet
Elliptic Problems in Weak Form
37 pages
CS229 Midterm Solutions 2010
No ratings yet
CS229 Midterm Solutions 2010
8 pages
Inverse Problems An Introduction - kirchgraber-Kirsch-Sto Er
No ratings yet
Inverse Problems An Introduction - kirchgraber-Kirsch-Sto Er
27 pages
Diffusion Models: A Comprehensive Survey of Methods and Applications
No ratings yet
Diffusion Models: A Comprehensive Survey of Methods and Applications
54 pages
Inverse Problems
No ratings yet
Inverse Problems
45 pages
Geoff 9
No ratings yet
Geoff 9
17 pages
Interface Lecture1
No ratings yet
Interface Lecture1
29 pages
Lecture Slides Week1
No ratings yet
Lecture Slides Week1
33 pages
Dijkstra's Algorithm Explained
No ratings yet
Dijkstra's Algorithm Explained
39 pages
Wavelet Packet: A Multirate Adaptive Filter For De-Noising of TDM Signal
No ratings yet
Wavelet Packet: A Multirate Adaptive Filter For De-Noising of TDM Signal
6 pages
Artificial Intelligence by Rajdeep
No ratings yet
Artificial Intelligence by Rajdeep
43 pages
Chapter 4. Gauss-Markov Model
No ratings yet
Chapter 4. Gauss-Markov Model
20 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Lecture Slides Week3
No ratings yet
Lecture Slides Week3
23 pages
Floating Point Representation
No ratings yet
Floating Point Representation
18 pages
Bms Institute of Technology Department of Mca Sub Code - 16mca38 Algorithms Laboratory Viva Questions
No ratings yet
Bms Institute of Technology Department of Mca Sub Code - 16mca38 Algorithms Laboratory Viva Questions
13 pages
Chaitin Theorem
No ratings yet
Chaitin Theorem
18 pages
Finite Element Method for 2D Problems
No ratings yet
Finite Element Method for 2D Problems
24 pages
OSU Adjustment Notes Part 1
No ratings yet
OSU Adjustment Notes Part 1
230 pages
Advanced Machine Learning Lab Syllabus
No ratings yet
Advanced Machine Learning Lab Syllabus
4 pages
Chapter 3 - Reduction of Multiple Subsystems PDF
No ratings yet
Chapter 3 - Reduction of Multiple Subsystems PDF
28 pages
Part A Simulation: Matthias Winkel Department of Statistics University of Oxford
No ratings yet
Part A Simulation: Matthias Winkel Department of Statistics University of Oxford
54 pages
Drives Training Foils: PID - Closed Loop Control
No ratings yet
Drives Training Foils: PID - Closed Loop Control
18 pages
SMSTC Adhm 2020 Slides
No ratings yet
SMSTC Adhm 2020 Slides
18 pages
Lesson 1 Fundamentals of DSA
No ratings yet
Lesson 1 Fundamentals of DSA
17 pages
Lecture Slides Week5
No ratings yet
Lecture Slides Week5
15 pages
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
No ratings yet
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
13 pages
Robust Filtered Smith Predictor For Processes With Time - 2020 - European Journ
No ratings yet
Robust Filtered Smith Predictor For Processes With Time - 2020 - European Journ
13 pages
Algebraic Der of Laguerre
No ratings yet
Algebraic Der of Laguerre
11 pages
CAT1 MCQs
No ratings yet
CAT1 MCQs
11 pages
Lecture Slides Week5
No ratings yet
Lecture Slides Week5
11 pages
Composite Fermions and Integer Partitions
No ratings yet
Composite Fermions and Integer Partitions
8 pages
Final Exam: COS 226 Algorithms and Data Structures Fall 2015
No ratings yet
Final Exam: COS 226 Algorithms and Data Structures Fall 2015
15 pages
Gradientflows 2
No ratings yet
Gradientflows 2
8 pages
PG DataMiningR Practicals
No ratings yet
PG DataMiningR Practicals
2 pages
Ju DG-Recon Depth-Guided Neural 3D Scene Reconstruction ICCV 2023 Paper
No ratings yet
Ju DG-Recon Depth-Guided Neural 3D Scene Reconstruction ICCV 2023 Paper
11 pages
Shannon's Theory of Secrecy: 3.1 Introduction To Attack and Security Assumptions
No ratings yet
Shannon's Theory of Secrecy: 3.1 Introduction To Attack and Security Assumptions
13 pages
Lecture 2 - Periodicity Properties and Lyndon Words
No ratings yet
Lecture 2 - Periodicity Properties and Lyndon Words
8 pages
Comb Trig With Cheby
No ratings yet
Comb Trig With Cheby
6 pages
Coherent Op
No ratings yet
Coherent Op
8 pages
How To Construct Qab Wald
No ratings yet
How To Construct Qab Wald
8 pages
Lecture 3 - Unavoidable Patterns
No ratings yet
Lecture 3 - Unavoidable Patterns
7 pages
1 The Black-Scholes Model
No ratings yet
1 The Black-Scholes Model
12 pages
A Study On Speech Emotion Recognition Based On MFCC and KNN Models
No ratings yet
A Study On Speech Emotion Recognition Based On MFCC and KNN Models
4 pages
An Ensemble Method For Phishing Websites Detection Based On XGBoost
No ratings yet
An Ensemble Method For Phishing Websites Detection Based On XGBoost
6 pages
S M S T C I P Exercises Solutions Week 6
No ratings yet
S M S T C I P Exercises Solutions Week 6
2 pages
2024 End Term Statistical Modeling Question Paper
No ratings yet
2024 End Term Statistical Modeling Question Paper
2 pages
Introduction To Spectral Theory
No ratings yet
Introduction To Spectral Theory
3 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Chapter 4
No ratings yet
Chapter 4
7 pages