SM ST C : INVERSE PROBLEMS 1–12
1.2 Lecture 2: Investigation of the Least Squares approach
Recall that we had seen examples of modelling (or discretising the problem to ) the linear system of equations
Km = d, m = mtrue (1.2)
Even though the matrix K is an M × N matrix, we will try to ”invert” it. We will use the following notation adopted
from [1]: mest is an estimate of model parameters, could be given as bounds, or with some degree of certainty.
For example, mest
1 = 1.2 ± 0.1 (95%) would mean that there is 0.95 probability that the true value of the model
parameterm1 = mtrue
1 is between 1.1 and 1.3. Next the estimited model parameters can generate a predicted data:
Kmest = d pre .
Remark: This linear discrete system (1.2) can be used as local approximations for weakly non-linear problems:
d = k(m) ≈ k(m̂n ) + ∇k[m − m̂n ] =
using Taylor series expansion around m̂n , so that
= k(m̂n ) + Kn ∆mn+1 , n ∈ {0, 1, 2, . . . }
where
∂ki
Kn = ∇k|m=m̂n ; [Kn ]i j = (n)
, ∆mn+1 = m − m̂n
∂m̂ j
Thus we need to invert d0 = d − k(m̂n ) = Kn ∆mn+1 for ∆mn+1 , starting with m0 .
We will consider the methods based on the size (length) of d pre .
Example: Fitting a straight line:
dobs = [d1obs , . . . , dNobs ]T is the observed data (known)
We are interested in model parameters
m = [m1 , m2 ]T , gradient and y-intersept, M = 2
we want to choose d pre as close as possible to dobs . To get ”as close as possible” we need to define the dis-
tance/length, that we will minimise.
Consider
pre
ei = diobs − di the prediction error of i-th measurement, or residual
then the total overall error should tend to its minimum, i.e.
N
E = ∑ e2i = eT e → min
i=1
here
e = dobs − d pre .
1.2.1 Measures of length
Given a vector space V over a field F (C or R), a norm on V is a non-negative-valued function p : V → R such that
for any α ∈ F nd for any u, v ∈ V,
• p(u + v) ≤ p(u) + p(v) triangle inequality
• p(αv) = |α|p(v) absolute scalability
• p(v) = 0 ⇒ v = 0 positive-definiteness
if p satisfies the first two only, it is called a semi-norm.
SM ST C : INVERSE PROBLEMS 1–13
Vector norms
L1 (l1 ) norm: kek1 = ∑|ei |
i
!1
2
1
L2 (l2 ) Euclidean norm: kek2 = ∑|ei |2 = (e, e) 2
i
!1
n
Ln (ln ) norm: kek2 = ∑|ei |n
i
Example:. Consider a vector (of errors): e = [0.01.0.13]T , then its norma are:
kek1 = 0.01 + 0.1 + 3 = 3.11
p
kek2 = (0.012 + 0.12 + 32 ) ≈ 3.00168286
p
25
kek25 = 0.0125 + 0.125 + 325 ≈ 3.000000000000000000000000000000000000014163...
The largest component matters more and more when n → ∞, hence, taking the limit, the infinity norm is
L∞ (l∞ ) norm: kek∞ = max|ei |
i
which selects the vector element with the largest absolute value as the measure of length.
The choice of norm
Figure 1.6: This illustration [1] shows the difference in taking various norms
Computations are simpler with L2 norm than with L1 norm.
The least squares method uses the L2 norm to quantify the length, the norms above vary according to the weight
of the outliers. An outlier is a data point far from the average trend). If the data is very accurate, we might use
higher order norm as having an outlier would mean it contain lots of information and should be taken into account
properly. L2 norm implied that the data obey Gaussian statistics (will see under the probabilistic approach), it
weights outliers heavily, since Gaussian distribution is short-tailed, i.e. leaves no space for the outliers (a few
scattered points), see Fig. 1.7. The long-tailed distributions have many scattered (improbably) points and are
better treated by L1 norm. The choice of norm depends on statistics the data obey. Methods that can tolerate a few
bad data points are known as robust.
SM ST C : INVERSE PROBLEMS 1–14
Figure 1.7: This illustration [1] compares two distributions with different tails: the left distribution is long-tailed,
the right one is short-tailed
Matrix norms
A vector-induced matrix norm is
kKmk
kKk = max or kKk = max kKmk
m6=0 kmk kmk=1
the latter implies that kKmk ≤ kKkkmk and kIk = 1.
kKmk1
L1 − norm is: kKk1 = max = max∑|ki j | = maxkc j k
m6=0 kmk1 j i j
where c j are the jth column vectors of matrix K. Thus matrix L1 norm is the largest L1 norm of the columns of
the matrix.
The L∞ norm generates pretty similar L∞ matrix norm, i.e.
kKmk∞
L∞ − norm is: kKk1 = max = max∑|ki j | = maxkri k
m6=0 kmk∞ i j j
where ri are the ith row vectors of matrix K. Thus matrix L∞ norm is the largest L∞ norm of the rows of the
matrix.
If we use the vector L2 norm now, we get matrix L2 norm:
kKmk2 p
= ρ(K ∗ K) = ρ(KK ∗ ) = kK ∗ k2
p
L2 − norm is: kKk2 = max
m6=0 kmk2
where ρ(K) is the spectral radius of matrix K; K ∗ is the adjoint of matrix K. For the square matrices, the spectral
radius is
ρ(K) = max {|λi (K)|}
1≤i≤N
where in turn, λi (K) are the eigenvalues of K such that det (K − λIN ) = 0. The matrix L2 norm of K is the largest
square roots of the eigenvalue of matrix KK ∗ or matrix K ∗ K (the largest singular value of matrix K). A singular
value of matrix K are positive square roots of the eigenvalues of KK T or K T K, i.e.
q q
µi (K) = λi (K T K) = λi (KK T )
If matrix K is Hermitian (or self-adjoint) K = K ∗ , or symmetric K = K T , the L2 norm is a spectral radius of matrix
K:
kKk2 = ρ(K), if K = K ∗ or K T = K; kKk ≥ ρ(K)
The Frobenius norm is not vector-induced matrix norm, but can be computed easily
!1
M M 2
2 ∗
kKkF = ∑ ∑ |ki j | , or kKkF = {tr(K K)}
i=1 j=1
The latter allows to compute L2 norm effectively for an N × N by using
√
kKk2 ≤ kKkF ≤ NkKk2
SM ST C : INVERSE PROBLEMS 1–15
1.2.2 Classification of linear inverse problems basing on the quality of information
We are still dealing with equation (1.2) and M is the dimension of model parameters, N is the dimension of data.
We judge the quality of information that depends on N data points and number M of model parameters, but also on
the structure of matrix K which can be sparse or full.
(Purely) underdetermined problems If the data doesn’t provide enough information to determine m uniquely.
One easy explanation is N < M and we have more unknowns than data. We usually have several solutions
that deliver zero prediction error E = 0.
Assume N < M and there are no inconsistencies, then we can find more than one solution, such that total
overall error E = 0.
Solution: Minimization of the norm of the estimated solution.
Overdetermined problems We have too much information known, usually N > M, and we minimise the total
overall error.
Solution: Minimization of the prediction error
Mixed-determined problems However, frequently, the data determines uniquely some of the model parameters,
but not the other. The situation is typical for tomography: if the rays miss some of the blocks, but go through
the others, some blocks may have too much data about their parameters, but missed block won’t have a
chance to be reconstructed. Usually K T K is singular and even though M < N, the data kernel has poor
structure.
Even-determined problems Those ones have exactly enough information to determine the model parameters.
1.2.3 Least squares problem for a straight line
Assume the data d = [d1 , . . . , dN ]T can be described by the linear equation y = m2 x + m1 , hence
di = m1 + m2 zi , m = [m1 , m2 ]T , M = 2
Usually, N > M and then there is no solution (for which e = 0) except for special case when all data points actually
belong to a straight line, and hence the inverse problem is overdetermined. So we look for an approximate
solution, such that
N
E = eT e = ∑ (di − m1 − m2 zi )2 → min
i=1
That’s calculus problem, we consider the total overall error as a function of model parameters E = E(m1 , m2 ) and
∂E
hence take the derivatives with respect to mq , q ∈ {1, 2}, ∂m q
, for each i ∈ {1, 2, . . . , N} we get
∂
(di − m1 − m2 zi )2 = 2m1 − 2di + 2m2 zi
∂m1
∂
(di − m1 − m2 zi )2 = 2m1 zi − 2di zi + 2m2 z2i
∂m2
hence we have
N N
∂E
= 2Nm1 − 2 ∑ di + 2m2 ∑ zi
∂m1 i=1 i=1
N N N
∂E
= 2m1 ∑ zi − 2 ∑ di zi + 2m2 ∑ z2i
∂m2 i=1 i=1 i=1
setting them to zero for a minimum,
N N
2Nm1 − 2 ∑ di + 2m2 ∑ zi = 0
i=1 i=1
N N N
2m1 ∑ zi − 2 ∑ di zi + 2m2 ∑ z2i = 0
i=1 i=1 i=1
SM ST C : INVERSE PROBLEMS 1–16
dividing everything by two, the latter can be written as a matrix equation:
N N
N i=1 ∑ zi ∑ di
m1 = i=1
N N m2 N
∑ ∑ z2i
∑ zi di
i=1 i=1 i=1
1.2.4 Least squares solution to the linear inverse problem
The matrix equation we consider is the same:
d = Km, m = [m1 , m2 , . . . , mM ]T
and recall the total overall error
T
E = eT e = dobs − d pre dobs − d pre = (d − Km)T (d − Km) =
" #" #!
N M M
=∑ di − ∑ Ki j m j di − ∑ Kik mk
i=1 j=1 k=1
Then taking the derivatives, we get so-called normal equation
K T Kmest − K T d = 0, (1.3)
where K T K is a square M × M square matrix, m is a vector length M, also K T d is a vector length M.
If [K T K]−1 exists, then solution exists and can be solved:
mLS = [K T K]−1 K T d
where mLS is the least squares solution to Km = d. Please notice, that [K T K]−1 can be extremely hard to compute,
as it can be huge. Usually even if K is sparse, [K T K]−1 is not sparse enough to simplify computations. Also for
large matrices, [K T K]−1 can be computed by biconjugate gradient algorithm.
If [K T K]−1 doesn’t exist, solution might be not unique.
1.2.5 An alternative approach to least squares approximation
Consider again normal equation (1.3)
K T Kmest = K T d, (1.4)
where K is a given M × N matrix and d is a given vector in RN , we define a real-valued function F : RM → R by
F(m) := kKm − dk2 = (Km − d) · (Km − d), for all m ∈ RM .
We can show that the gradient vector of F satisfies
∇F(m) = 2 K T Km − K T d , for all m ∈ RM .
(1.5)
Thus, ∇F(m) = 0 for the least squares solution mest . According to equation (1.5), the least squares solution should
satisfy normal equation (1.4)
1.2.6 Minimal Length solution: purely underdetermined problems
We consider equation (1.2) such that it is purely underdetermined, i.e. there are more than one solution minimizes
the error.
What should we do? We use another guiding principle: add some more a priori information. Mainly it quantifies
expectations about the character of the solution that is not based on actual data, some examples might include
density is positive; the solution is ”small” or ”simple” (in some measure of length).
We then might want to consider
L = mT m = ∑ m2i → min (1.6)
SM ST C : INVERSE PROBLEMS 1–17
We still want to minimise the error (but several solutions) so we have added that as a constraint as get Lagrange
multipliers problem
min L subject to the constraint e = d − Km = 0
m
We are thus looking to minimise the following
" #
N N N M
Φ(m) = L + ∑ λi ei = ∑ mi + ∑ λi di − ∑ Ki j m j λi (1.7)
i=1 i=1 i=1 j=1
∂Φ
Taking the derivatives ∂mq = 0, we get
N
∂Φ
= 2mq − ∑ λi Kiq = 0
∂mq i=1
and hence in a matrix form
2m = K T λ, Km = d
The latter implied that
λ
d = K[K T ]
2
where KK T is N × N matrix and if [KK T ]−1 then
λ = 2[KK T ]−1 d
and hence the minimal length solution is
mML = K T [KK T ]−1 d (1.8)
1.2.7 Weak underdetermination: damped least squares
For weakly underdetermined problems we can adopt some approximation process for matrix partition. Recall
that if some parameters can be determined and others cannot, we would like to separate them (later in the mised-
determined cases sections).
Determine a solution that minimises some combination Φ(m) of the predication error and the solution length for
m:
Φ(m) = E + ε2 L = eT e (1.9)
where ε2 is some weighting factor which determines the relative importance given to the prediction error and
solution length.
If ε is large, we minimize the underdetermined part of the solution: BUT it tends to minimum together with the
overdetermined part also, hence E is not minimised.
If ε = 0 the E is minimal, but no a priori information will be provided to single out.
We need a compromise, there are some methods, by mostly trial and error.
Minimimation of Φ (Lectures and exercise) brings us
" #" #
N M M N
Φ(m) = E + ε2 L = ∑ di − ∑ Ki j m j di − ∑ Kik mk + ε2 ∑ mi
i=1 j=1 k=1 i=1
N M N
∂Φ
= ε2 mq − ∑ Kiq di + ∑ mk ∑ Kiq Kik = 0
∂mq i=1 k=1 i=1
and hence
K T Km − K T d + ε2 m = 0
and the corresponding solution is called the damped least squares
[K T K + Iε2 ]mDLS = K T d (1.10)
Underdeterminicity is said to be damped.
SM ST C : INVERSE PROBLEMS 1–18
1.2.8 Other a priori information
Example If we want to reconstruct density fluctuations in the ocean, we don’t want the solution to be close to
zero, by rather to some typical sea level a priori value m priori , so L = mT m might not be good, and we need
L = (m − m priori )T (m − m priori )
Example We want our solution to be flat, and that is easy to quantify. Flatness is the opposite of steepness, and the
first derivative controls it. In the discrete case we consider
∂m mi+1 − mi
→
∂x ∆x
thus the steepness of m is
m1
−1 1 0 0 ... 0 m2
1 0 −1 1 0 ... 0 m3
l= = Dm
∆x 0 0 −1 0 ... 0 .
... ... ... ... ... ... .
mM
where D is the steepness matrix and is an approximation of dm
dx .
Example Assume solution m is smooth, parameters vary slowly with position. Roughness is the opposite of
smoothness, and the second derivative controls that. We will now have term like
(∆x)−2 [. . . . . . 1 − 2 1 . . . ]
2
Hence the approximation to ddxm2 matrix is
1 −2 1 0 ... 0
0 1 −2 1 ... 0 = lT l = [Dm]T Dm = mT DT Dm = mT Wm m
L= 0 0 1 −2 ... 0
... ... ... ... ... ...
Matrix Wm = D is the weighting factor that enters into the calculation of the length of the vector m.
kmk2weighted = mT Wm m is not a proper norm as it violates positivity. The reason for the latter is that kmk2weighted = 0
for some non-zero vectors, such as a constant vector. It can cause non-uniqueness.
1.2.9 A priori and weighting matrix
The measure of solution simplicity can therefore be generalised to
L = (m − m priori )T Wm (m − m priori ) (1.11)
see the Section 1.2.8 for matrix Wm . By suitable choosing a priori model m priori and weights Wm one can generate
a variety of measures of simplicity. We can also consider the generalised prediction error
E = eT We e
We in its turn, defines relative contribution of each individual error (normally this is a diagonal matrix).
Example: Weighted least squares To solve completely overdetermined Km = d with E = eT We e we have
mW LS = [K T We K]−1 K T We d
Example: Weighted minimum length To solve completely underdetermined Km = d with L = [m−m priori ]T Wm [m−
m priori ] we have
mW ML = m priori +Wm−1 K T [KWm−1 K T ]−1 [d − Km priori ]
Example: Weighted Damped least squares To solve slightly underdetermined Km = d with Φ(m) = E + ε2 L (ε
is chosen by trial and error) we have
mW DLS = [K T We K + ε2Wm ]−1 [K T We d + ε2Wm m prior ]
SM ST C : INVERSE PROBLEMS 1–19
1.2.10 Moore-Penrose Pseudoinverses and least squares
Consider
min kKm − dk
Assume m0 and m1 are any two solutions to the normal equation (1.3), (1.4):
K T Km = K T d
then
K T K(m0 − m1 ) = 0 so m0 − m1 ∈ Null(K T K)
Recall, the Null(K) = {m : Km = 0}.
Theorem 1.1. Let K be an M × N matrix, let m be a vector in RM , then
Km · d = m · K T d
by · we highlight the dot or scalar product of vectors (and will be omitted in general)
Proof. The proof is similar to the example below
Example This example illustrates well what happens in Theorem 1.1, consider
4 0 d1
m
K = −3 2 , m = 1 d = d2 , M = 2, N = 3
m2
1 5 d3
then
4m1 d1
Km · d = −3m1 + 2m2 · d2 =
m1 + 5m2 d3
4m1 d1 + (−3m1 + 3m2 )d2 + (m1 + 5m2 )d3 = [4d1 − 3d2 + d3 ]m1 + [2d2 + 5d3 ]m2 =
d
m1 4 −3 1 1
m1 4d1 − 3d2 + d3
= = d2 = m · K T d
m2 2d2 + 5d3 m2 0 2 5
d3
Theorem 1.2 (Corollary). If d is orthogonal to the range of K, then d is in the Null(K T )
Proof. If d is orthogonal to the range of K, then
Km · d = 0 for all m ∈ RM ⇒ m · K T d = 0 due to Theorem 1.1
i.e. vector K T d is orthogonal to any vector in RN , in particular to itself, so
K T d · K T d = 0 and so kK T dk2 = 0 ⇒ K T d = 0.
The latter proves that d ∈ Null(K T ).
Theorem 1.3. For an arbitrary matrix K, the Null(K) is the same as the Null(K T K).
Proof. We should the direct implication first: assume K is any matrix, suppose m is in the Null(K), so Km = 0,
then
m ∈ Null(K) ⇒ Km = 0 ⇒ K T Km = K T 0 = 0 ⇒ m ∈ Null(K T K)
Now, the back implication. Assume m ∈ Null(K T K). Then K T Km = 0, then the range of K is a subspace of RM
and the null space of K is a subspace of RN . Thus
kKmk2 = Km · Km = m · K T Km = m · 0 = 0 ⇒ Km = 0
Thus the null spaces are the same.
SM ST C : INVERSE PROBLEMS 1–20
Hence m0 − m1 ∈ Null(K) and thus there exists a unique solution to the normal equation that is also orthogonal
to the null space of K. If there were two solutions, then their difference would be in the null space of K and at the
same time orthogonal to the null space, and hence zero.
Let us denote this unique solution by m+ , then km+ k is as small as possible for solutions of the normal equation.
Som,e explanation: any other solution must differ from m+ by a component in the null space of K, orthogonal to
m+ , so adding an orthogonal component only increases the norm by Pythogorean theorem.
The uniquely determined vector m+ is called the Moore-Penrose solution to the normal equation and m+ ∈
Range(K T ).
If K T K is invertible, the solution to normal equation is unique and Moore-Penrose solution coincides with the least
squares solution mLS = mMP = [K T K]−1 K T d
1 1 0 1
1 1 0 3 T
Example: Consider K = 1 0 1 and data d = 8 , K K is not invertible (exercise), the Null(K) = {t(−1, 1, 1) :
1 0 1 2
t ∈ R} and hence the set of solutions to the corresponding normal equation is given by
{(1, 1, 4) + t(−1, 1, 1),t ∈ R}
x
since all the solutions to the normal equation are {m̂ = 2 − x} The norm of a typical solution is
5−x
8
q p
(1 − t)2 + (1 + t)2 + (4 + t)2 = 3t 2 + 8t + 18 → min when 6t = −8 ⇒ t = − .
6
Thus Moore-Penrose solution is
8 7 1 8
mMP = m+ = (1, 1, 4) − (−1, 1, 1) = ( , − , )
6 3 3 3
and is indeed, orthogonal to the vector (−1, 1, 1).
Summary: here we used the following approach: found all the solutions to the normal equation and then found the
min norm one. Singular value decomposition is another approach to Moore-Penrose solution.
Exercises
−1 2 4
1–6. Take K = 2 −3 and data d = 1 .. Show that K T K is invertible and then find a unique least squares
−1 3 −2
3
solution mest = . Compute kKmest − dk.
2
1 1 0 1
1 1 0 3 T
1–7. Take K = 1 0 1 and data d = 8. Show that K K is non-invertible. Find all the solutions to the
1 0 1 2
2
2
5 . Compute kŷ − dk
normal equation and show that the unique element closest to d in range of K is ŷ =
5
in this case.
1–8. Compute singular values of
1 2 3
K=
4 5 6
References
[1] W ILLIAM M ENKE, Geophysical Data Analysis: Discrete Inverse Theory, 3rd edition, Essevier, 2012.
[2] P ER C HRISTIAN H ANSEN, Discrete Inverse Problems, Insights and Algorithms, SIAM, 2010.