Strategies For Computing The Condition Number of A Matrix
Strategies For Computing The Condition Number of A Matrix
Author: Supervisor:
Per Niklas Waaler Claus Führer
Abstract
The main objective of this thesis is to present the article An estimate of the con-
dition number of a matrix, written by Cline et.al., which describes an algorithm
for obtaining a reliable order of magnitude estimate of the condition number
of a matrix in O(n2 ) operations. In addition, the thesis introduces an iterative
process for estimating the condition number in the 2-norm with arbitrary pre-
cision. It is a generalisation of the algorithm in the article, and it establishes
a connection between power iteration and the algorithm from the article, and
provides an alternative route to deriving its key equations. Also, the thesis
covers basic theory surrounding the condition number, and for completeness it
includes a brief presentation on Skeel’s condition number which is based on the
article Scaling for numerical stability in Gaussian elimination, written by Skeel
et.al..
3
Acknowledgements
I would like to thank my thesis supervisor Claus for all his help and his insistence
on understanding everything on a more intuitive level, and my friend Rasmus
Høier for all his good advice and all his time spent reading my work to weed
out errors.
4
Contents
Introduction 7
4 Numerical testing 33
4.1 Numerical testing using QR decomposition . . . . . . . . . . . . 33
4.2 Numerical testing using LU decomposition . . . . . . . . . . . . . 34
5
6 CONTENTS
Introduction
7
8 CONTENTS
9
10 CHAPTER 1. THEORY OF ERROR SENSITIVITY
appropriate to use depends on the problem. For instance, if we are solving the
system Ax = b, where b is input data and x is output data, it might be tempting
to just use the 2-norm as a measurement of the perturbation of b. But it might
be the case that the various components of b are measurements of greatly varying
magnitude, and that perturbing one component by some fixed amount repre-
sents a catastrophe in terms of the precision of that measurement(in physics for
instance, the least precise measurement tends to be somewhat of a bottleneck in
terms of precision), whereas perturbing another component by the same amount
would not matter much at all if that measurement is of a comparatively large
order of magnitude.
This consideration is not taken into account with the 2-norm, and so in some
cases it is not a useful way to measure distance. In fact, if we perturb a compo-
nent xi by δ, and take the 2-norm of x we will have under the square-root sign
the term (xi + δ)2 = x2i + 2δ · xi + δ 2 . We can see that the amount by which
the perturbation changes the length of x is determined by the sum 2δ · xi + δ 2 ,
and for positive xi and δxi the increase in length becomes greater when the
perturbed component xi is large. We see here a disagreement between distance
as measured in the 2-norm and our intuitive notion of distance between two sets
of measurements; intuitively, we think of an error of fixed size as making more of
a difference when the error is in a measurement of small magnitude than when
it is in a measurement of large magnitude, and we would like our definition of
distance to reflect this intuition. Considerations such as these have lead to the
formulation of Skeels condition number, which we will get to later.
use k · k without specifying the norm by an index. The norm that is intended
can be read from the context, so when we write kf (x)k, we mean kf (x)k(Y ) ,
hence the norm can be inferred by asking which space the element belongs to.
Now, if we let δf = f (x̄) − f (x), then this can be expressed as
With this definition at hand, we turn our attention to finding the condition
numbers related to problems involving system of equations, which we will in
turn use to define the condition number of a matrix. Note that there are several
ways to view the equation Ax = b in terms of what to consider as input and
output. For instance, we can view x as output, and then consider the effect
of perturbing both A and b simultaneously, or we can perturb just A or just b,
which is sometimes done for simplicity. Often, b will contain measurements, and
A will be a matrix of coefficients that reflects the physical laws or properties that
characterize the physical system under observation, and it is therefore useful to
consider perturbations to both or either of these.
As an example, when applying Kirchhoff’s rules in order to determine how the
currents will flow in an electrical circuit, we will obtain a system of equations
equivalent to the matrix equation Ax = b, where the components of A are
determined by the circuits resistances (and the directions of the currents in the
circuit) as well as how the circuits are connected, b is determined by the voltages
of the emf sources and how the circuits are connected, and x is the vector of
unknown currents. In this case we can expect there to be uncertainty in the
data that goes in to both b and A, and so ideally we want to consider the effect
of introducing perturbations to both when we compute the condition number of
the problem. Note that if we consider the problem in this way, the elements of
the input space would be matrix-vector pairs, which raises the question of how
to define distance between such objects. Note also that some of the elements
of A and b will be exact zeros, and as such they should not be considered as
measurements, which are susceptible to errors. This means that it would be
inappropriate to include perturbations to these elements in our error-sensitivity
analysis, as this yields an overly dramatic estimate of the error sensitivity(the
reason that it becomes larger is that we are maximizing the relative perturbation
over a larger set of input perturbations). The subjects of how to deal with exact
zeros and how to define distance between matrix-vector pairs are addressed in
the section on Skeel’s condition number.
kAxk(m)
kAk(m,n) := sup = maxn kAxk(m) (1.3)
x∈Rn kxk(n) x∈R
kxk(n) =1
Note that if we swap the roles of x and b and consider b as input and x as
output, then we have essentially the same problem as before, with the roles of
A and A−1 interchanged. Hence we would end up with the same bound.
1.4. SKEEL’S CONDITION NUMBER 13
This bound also turns up when we consider A as input and x as output, with b
held fixed. Let δA be the perturbation in A, and let δx the resulting perturbation
in x. We then have (A + δA)(x + δx) = b. Since the double infinitesimal
δAδx will become vanishingly small in size relative to the other terms for small
perturbations, we drop it, and obtain Ax + Aδx + δAx = b. After subtracting
b = Ax and multiplying by A−1 on each side, we get δx = −A−1 δAx. Taking the
norm on each side yields kδxk = kA−1 δA xk ≤ kA−1 kkδAxk ≤ kA−1 kkδAkkxk
(where we use the fact that kAxk ≤ kAkkxk twice). Finally, using this upper
bound for kδxk, we compute
Which is the same bound as before. Note that in this case the norm of the input
space is the induced matrix norm.
Since this bound turns so often, it is defined as the condition number of A,
denoted κ(A). Note that it is independent of x and b, and it can be considered
a property of A which gives us information about the tendency of A and A−1
to amplify errors in the vectors they operate on. In the following we will use
the notation κ(A)l , where l indicates the norm we are using.
Besides giving an estimate to the accuracy in our solution, κ(A) can also be used
as a way to detect potential errors that have been made. For instance, a matrix
which is ”nearly invertible” (imagine for instance taking a triangular matrix
with a zero on the diagonal and perturbing the zero by a small amount) tends
to be highly ill conditioned, and an extremely large condition number could
be a sign that A in its pre-converted form (before having its entries converted
to floating point numbers) was actually singular, but due to round-off errors
introduced in the conversion it has become invertible.
and
Using this metric effectively restricts the set of perturbations over which we
maximize the relative perturbation, since for any > 0 the perturbation in an
exact zero is bounded by 0. Also, it measures distance entirely in terms of
relative component-wise error which means that it is meaningful regardless of
the physical units of the measurement data. It also is a neat solution to the
problem of how to measure distance between elements of the form (A, b).
We will not include the derivations of Cond(A) as it takes us beyond the scope
of this thesis. By using definition 1.1 one can show that the condition number
of the problem where A and b are inputs and x is output - using our newly
defined measure of distance - is given by
where the notation | · | used on arrays signifies that the arrays are component-
wise non-negative, but otherwise the same as the array inside the modulus sign.
If we hold b fixed, then the we have
k|A−1 ||A||x|k∞
Cond(A, x) = . (1.7)
kxk∞
To simplify matters further, we can define Skeels condition number as the max-
imum over all kxk∞ = 1. Since |A−1 ||A| has only non-negative components, the
ratio is maximized by setting x = (1, 1, ...., 1)T . Since k|A−1 ||A||(1, 1, ..., 1)T |k∞
= k|A−1 ||A|k∞ , we get that
Let us compute κ(A0 )2 . To compute kA0 k2 , we find the vector x of length 1 that
maximizes kA0 xk2 . This is a straight-forward maximization problem subject to
the constraint x21 + x22 + x23 = 1, and if we solve it we find that a maximizer is
x = ( √12 , √12 , 0)T , hence kA0 k2 = kxk2 /kbk2 = 1. By the same method we also
find that k(A0 )−1 k2 = 1/. Hence κ(A)2 = 1 · 1/ = 1/, which can be made
arbitrarily large by choosing sufficiently small. This number is a reflection of
the fact that - given that we measure the sizes of the perturbations in the 2-
norm - the system Ax = b is highly sensitive to perturbations in A33 , x3 and b3
due to the small size of , provided that we place no restrictions on the direction
(that is, the size of the components relative to each other) of the perturbation
vectors we consider.
So how do we account for the fact that the two estimates vary so dramatically in
this regard? This has to do with the different ways in which distance is measured
in each condition number. In the 2-norm, if we perturb a small component with
a small perturbation δxi (here, small means small relative to kxk2 ) then kδxk2
will be small, as we have touched on before, and so the perturbation will be
measured as being small. Hence we get a large perturbation of output resulting
from a small perturbation of input, and are therefore given the impression that
the system is sensitive to perturbations. However, if we measure the size of the
same perturbation using the metric of relative component-wise perturbation(as
we do in Skeel’s condition number), the same perturbation will be measured as
being very large if the perturbation is large relative to xi . Consequently, the
two norms can give very different impressions of the perturbation sensitivity of
16 CHAPTER 1. THEORY OF ERROR SENSITIVITY
A. Another reason for the disagreement is that in κ(A) we are maximizing the
relative perturbation over a larger set of perturbations, since in Cond(A) we are
effectively placing the restraint that perturbation of exact zeros are not allowed.
Chapter 2
In this chapter we cover the main subject of this thesis, which is the question
of how to estimate κ(A). Given that κ(A) = kAkkA−1 k, our efforts to estimate
κ(A) can be broken down into the task of estimating kAk and kA−1 k. Com-
puting kAk in the 1-norm or infinity norm is particularly simple as it is just a
matter of finding the column vector (in the case of the 1-norm) or row vector
(in the case of the ∞ norm) with the largest 1-norm. We do not know A−1
however, and although computing it would yield an exact estimate (ignoring
round-off errors), this would be too expensive to be worthwhile as it is a task
that requires O(n3 ) operations, especially considering that in many applications
only an order of magnitude estimate is required.
The ease with which we can compute kAk1 makes it tempting to compute the
condition number in the 1-norm. It is not immediately obvious how to estimate
kA−1 k1 however. The approach presented in The condition number of a matrix
is in essence to construct a right hand side in Ax = b in a way that tends to
maximize the ratio kxk/kbk, which then yields a lower bound estimate as we see
from
kA−1 bk kxk
kA−1 k ≥ = . (2.1)
kbk kbk
17
18 CHAPTER 2. ESTIMATING THE CONDITION NUMBER OF A
This - as one might expect, and which we will show examples of later - leads to
a less accurate estimate, but one which also seems reliable in the sense that it
reliably indicates the correct order of magnitude of the condition number.
With these facts at hand, we turn our attention again to the ratio kxk/kbk.
Since what we are looking for, σ1 and σn−1 , are linked so intimately to the SVD
of A, a promising place to start our analysis is to express A in terms of its SVD,
and then expand b in terms of the basis {ui } or {vi }(it will become obvious
which choice is suitable). Now, note that
Ax = U ΣV T x = b (2.2)
x = V Σ−1 U T b. (2.3)
2.1. ANALYSIS BASED ON THE SVD OF A 19
Observe that the multiplication U T b yields a vector where the elements are
formed by computing the inner products uTi b, which suggests that matters are
simplified by expressing b in in terms of the basis {ui }, since uT
i uj equals 1 if
i = j, and 0 otherwise. So we let
X X
b = kbk αi ui , αi2 = 1. (2.4)
Inserting this expression for b into equation 2.3, and letting ei denote the unit
vector where the i’th component equals 1 and all other components equal 0, we
obtain
X X
x = V Σ−1 U T kbk αi ui = kbkV Σ−1 α i ei
X αi X αi
= kbkV ei = kbk vi (2.5)
σi σi
This expression is clearly maximized when all the weight is given to the term
with the largest coefficient 1/σn , i.e. when αn = 1 and αi = 0 for i 6= n. From
equation 2.6 it seems plausible (assuming that σi are randomly chosen) that
kxk/kbk is of the order kA−1 k2 , unless we get unlucky and αn is particularly
small. Note also that the probability of the ratio indicating the right order of
magnitude increases when σn is very small relative to the other σi , which implies
that A is ill-conditioned.
The most important information we gather from equation 2.6, howerever, is that
kxk/kbk provides a good estimate when un is well represented in the right hand
side b of our equation; that is, when un has a coefficient that is large relative
to the coefficients of all the other ui . A natural next step then is to try to
construct a b where un is well represented.
Taking a look at x as expressed in equation 2.6, note that vn is likely to be
well represented in x due to the presence of the σi in the denominators since
σn−1 > σi−1 for i 6= n. It is tempting to exploit this amplification of vn by
somehow using x as a new right hand side. The only complication is that
it is vn , not un , that is scaled up. Adjusting for this turns out to be quite
simple however. To see how, notice that if we were to carry out the above
SVD analysis using AT instead of A (recall that both matrices have the same
condition number), nothing would change fundamentally; the only difference
would be that the roles of U and V (and thus also the roles of the vectors ui
20 CHAPTER 2. ESTIMATING THE CONDITION NUMBER OF A
zkT Bzk
r(A, zk ) = (2.7)
zkT zk
The diagonal elements of Σ−2 are σi−2 . We know from elementary linear algebra
that if we can express a matrix B on the form B = QDQT - where D is diagonal
and Q is orthogonal - then the column vectors of Q are the eigenvectors of B, and
the diagonal elements of D are the corresponding eigenvalues. Therefore, we can
see in the above equations that (AT A)−1 and (AAT )−1 both have eigenvalues
σi−2 , and eigenvectors vi and ui respectively.
We first use the same notation that we used earlier, with the vectors x, y and
b, in order to make it clear that we end up with the same equations. Later we
adopt new notation which is more convenient for expressing the entire algorithm.
So, let b be the initial vector to start of the power iteration where we apply the
matrix (AT A)−1 . In the first step we solve the equation
X X αi
y = (AT A)−1 b = V Σ−2 V T · αi vi = vi (2.11)
σi2
and X αi X αi
x = Ay = U ΣV T · 2 vi = ui . (2.12)
σi σi
In the expression for y the presence of σi2 in the denominators increases the
probability that vn will dominate the expression, which is also the eigenvector
22 CHAPTER 2. ESTIMATING THE CONDITION NUMBER OF A
of (AT A)−1 corresponding to the eigenvalue σn−2 . Hence we are likely to obtain
a good estimate of σn−2 from Rayleigh quotient
Now, we could keep going to obtain a finer estimate. To this end, we will
change notation, and present the first few steps in order to make the pattern
clear. Using the notation Ayk = yk−1 when k is even, and AT yk = yk−1 when
k is odd, we get
AT y1 = y0
AT
2 = y1
AT y3 = y2
Ay4 = y3
..
.
Ay2k = y2k−1
T
A y2k+1 = y2k
The vector yk is then obtained from yk−1 via the recurrence relation yk =
A−T yk−1 (with k referring to the iteration we are in, so starting at k = 1) when
k is odd, and yk = A−1 yk−1 when k is even. To make clear which Rayleigh
quotients to use for each yk , we observe that
αi
y1 = A−T y0 = U Σ−1 V T · Σαi vi = Σ ui
σi
αi αi
y2 = A−1 y1 = V Σ−1 U T · Σ ui = Σ 2 vi
σi σi
αi αi
y3 = A−T y2 = U Σ−1 V T · Σ 2 vi = Σ 3 ui
σi σi
..
.
αi
y2k = Σ 2k vi
σi
αi
y2k+1 = Σ 2k+1 ui
σi
From this we gather that the Rayleigh quotients when k is odd has the form
kyk k
Hence the estimate at step k is obtained by computing kyk−1 k .
If we instead want to estimate σ1 , we start out the iteration with the equation
y2 = AT Ay0 . The arguments leading to the sequence of equations that follow
from this are analogous to those in the previous derivation, and if we fill out
the details we obtain the formulas
yk = AT yk−1 , k even
yk = Ayk−1 , k odd
see [3]. Keep in mind that this is referring to the normal power iteration algo-
rithm as opposed to the algorithm we just derived, and the increase in accuracy
with each iteration k that equation 2.17 indicates is for us only obtained with
every second iteration we perform.
It might be tempting to use the Rayleigh quotient at each step in order to
make shifts to speed up the rate of convergence, as is done in Rayleigh quotient
iteration. The problem with this approach is that it ruins the simplicity of
the algorithm. In practice - as we will discuss later - factorization of A into a
QR or LU decomposition is done simultaneously as we compute the condition
number, which reduces most of the computation down to solving triangular
systems of equations. Shifting by a constant µ would then result in that matrix
2.3. CONVERGENCE OF THE POWER ITERATION ALGORITHM 25
no longer being triangular, which means that the algorithm would require O(n3 )
operations, and here we are looking specifically for algorithms that require O(n2 )
operations. If accuracy was essential however, it might be preferable to use this
approach instead, given that Rayleigh quotient iteration is much faster; it triples
the number of digits of accuracy with each iteration [3].
Although the rate of convergence is slow for power iteration, the PIA is sped up
by the fact that the eigenvalues in our case are σi2 or 1/σi2 , which means that
the inaccuracy of the estimate at each even-numbered step k is proportional
to (|σ2 |/|σ1 |)4k or (|σn |/|σn−1 |)4k respectively. Hence, if we are estimating σ1 ,
then adding two steps of iteration reduces the error by a constant factor ≈
(|σ2 |/|σ1 |)4 . Also, when A is ill conditioned these ratios will tend to be larger,
which further speeds up the rate of convergence. The fact that the convergence
is slow for well conditioned matrices is not a big issue. The worst thing that
can happen if κ(A) is small is that we underestimate it by a large factor(since
we are computing a lower bound), but this is not catastrophic since the matrix
is well conditioned anyway. It would be problematic however if there was a
substantial risk of significantly underestimating κ(A) in cases where it is very
large, but the likelihood of this becomes smaller the more ill-conditioned A is,
which is a reassuring.
26 CHAPTER 2. ESTIMATING THE CONDITION NUMBER OF A
Chapter 3
Factorization of A and a
strategic choice of b
Besides from using factorizations to improve efficiency, we also have the oppor-
tunity to speed up the rate of convergence by making a good choice of initial
vector b = y0 . We should expect an algorithm which does this to be very sim-
ple if it is going to be worthwhile, since it might otherwise be better to simply
generate its components randomly, and then add an extra iteration in the PIA.
Given the emphasis on simplicity, an appealing idea is to compute the compo-
nents of b successively as we solve RT x = b. In each equation we solve for one
component xi , and in that equation only one component of b, bi , is present.
We then have the freedom to choose bi in such a way that it promotes growth
in kxk2 . In order to keep the algorithm cheap, we restrict our choice of each
27
28CHAPTER 3. FACTORIZATION OF A AND A STRATEGIC CHOICE OF B
bi to two choices: +1 and −1. When solving the i’th equation, we then have
rii xi = −r1i x1 −r2i x2 −...−ri−1,i xi−1 ±1. A simple way to make this choice is to
−
compute xi for each choice, which we denote x+ i and xi , and choose whichever
sign maximizes |xi |. In the following this strategy will be denoted as the local
strategy, abbreviated to LS.
A drawback of this strategy is that it is blind to the effect that a choice will have
on subsequent equations. If we are dealing with randomly generated matrices,
this is very unlikely to cause severe issues. As an example, if the elements of
R are random numbers that are uniformly distributed over the interval (−1, 1),
then the larger the modulus of each xk , 1 ≤ k < j, the greater the likelihood
that xj will be large, since we see from
j−1
X −xk
xj = · rij + 1 (3.1)
bj
k=1
that the first j − 1 terms of the sum are independent random numbers each
uniformly distributed on an interval (− |x k | |xk |
|bj | , |bj | ). The wider these intervals are,
the more likely that the modulus of the sum is large, since its variance then
increases and its probability density function becomes more spread out. This
demonstrates that the strategy, given a random matrix, will be successful on
average, and has virtually zero probability to generate a ”worst case b”, i.e. a
b which is orthogonal to vn .
The problem, however, is that in practice matrices are not generally random.
Often they have a particular structure, and this structure could be such that our
strategy fails completely. To demonstrate this, consider the following example
[1]. Let
1 0 0 0
0 1 0 0
RT =
(3.2)
k −k 1 0
−k k 0 1
When computing x1 and x2 , our strategy so far offers no way to determine the
sign of b1 and b2 . Let us arbitrarily decide that the default choice is the positive
sign, in which case x1 = 1 and x2 = 1. But then it becomes immediately clear
that k will not appear in equations 3 or 4 due to cancellation, and hence will
have no influence on the estimate of the condition number. This is a problem,
as we see when we consider how ill conditioned R is when k is large: kRk∞ =
kRT k∞ = 1 + 2k.
The situation can be somewhat remedied by generating at each step a random
number θ between 0.5 and 1, and choosing between ±θ instead of ±1. Then
the probability of getting the kind of exact cancellation of terms as in the above
example becomes practically zero, though we can still run into situations where
3.2. A LOOK AHEAD STRATEGY FOR CHOOSING B 29
i−1
X n
X i
X
| rki xk + bi | + | rkj xk |. (3.3)
k=1 i+1 k=1
The first part of this expression ensures that the effect on |rii xi | is taken into
account, and the second part of the expression takes into account the effect on
the remaining n − i equations. With this strategy, we will avoid the unfortunate
scenario we encountered in our example regardless of how we choose the sign
for the k’s in RT since - if k is large enough - the algorithm will choose the sign
which does not lead to the cancellation of the terms involving k.
This strategy is more costly than the first one. In the first strategy, the cost
is approximately the same as when we solve the system for a given b; for each
row we are adding one extra multiplication and addition to the workload when
−
we compute x+ i and xi , and for large n, this cost is negligible compared to the
Pi−1
cost of computing the first i − 1 terms in k=1 rki xk for each xi . So for large
n, the cost of the first strategy is approximately n2 flops, about the same as the
cost of solving a triangular system of equations in a regular way.
So far we have been discussing how to choose b in a way so that we get a
good estimate of σn−1 , but the ideas that we have discussed applies also when
computing kAk2 , since the success of the PIA relies on u1 being well represented,
which will be the case when x is large. The difference in how to apply of the
growth-promoting strategies in each case is a matter of details.
30CHAPTER 3. FACTORIZATION OF A AND A STRATEGIC CHOICE OF B
Let us now compute the cost of the new strategy. Consider the i’th Pi iteration
where we compute xi . The bulk of the workload lies in computing k=1 rkj xk
for all i < j ≤ n, so we will do n − i such computations. But since r1j x1 +
r2j x2 + ... + ri−1,j xi−1 was computed in the previous iteration for each j, we
only count the cost associated with adding the new term rij xi to the sum for
−
each j. For each computed value x+ i and xi we will perform two operations (1
multiplication and 1 addition), so 4 operations in total for each j. And so, the
number of flops in iteration i is ∼ 4(n − i). Summing together, we get
n
X n(n + 1) n2 n
4(n − i) = 4{n2 − } = 4(n2 − − ) = 2n2 − 2n (3.4)
i=1
2 2 2
The factor by which the cost goes up with the ”look ahead strategy”, which
in the following will be denoted LAS, can now be computed. Ignoring the
2
first order term, we get that 2nn2 = 2, and so this strategy is roughly twice
as expensive as the LS. If we are using the QR decomposition and perform
two iterations to obtain an estimate, the overall increase in work is by a factor
2n2 +n2 3n2
n2 +n2 = 2n2 = 3/2 = 1.5. At this point we may note that the cost of choosing
b by the LAS is about the same as the cost of an extra iteration, and so we may
wonder if we gain more accuracy by using the cheaper RLS for choosing b, and
then use the work we save in doing so to perform an extra iteration in the PIA.
We return to this speculation in Chapter 3, where we test it numerically.
U Tz = b (3.5)
T
L x=z (3.6)
It is less obvious how to go about choosing b in this case than it was when we
had an QR factorization, since we now have two factors instead of one. The
overall objective is still the same; we are looking to find a strategy for choosing
the components of b in such a way so as to promote growth in the size of x.
The strategy suggested in the article is to choose b such that z becomes large
in equation 3.6, and hope that this will result in a large x. However, as of yet
we have no assurance that this will be the case. In the QR case this was not an
issue, since the choices of bi were designed to ensure large components of x, or
atleast avoid the worst case scenario where b is orthogonal to vn , or nearly so.
In practice, the strategy of choosing b such that z becomes large was successful,
in fact it was about as successful as it was in the QR case when tested on a
large number of random matrices of various sizes.
This raises the question of why it was so successful. We can get some intuition
for this by considering the following; due to the way row pivoting is done in
the factorization of A, the ill-condition of A tends to be mostly reflected in the
matrix U [1], and therefore U tends to be more ill-conditioned than L. The fact
that U is ill-conditioned helps us promote growth in the length of z in the first
step, and the fact that L is well conditioned helps us keep the advantage we
gained in the first step.
To illustrate this point, consider the cases where L is well conditioned and
σ1 /σn ∼ 1, with σ1 and σn being the largest and smallest singular values of L
respectively. In the extreme case when σ1 /σn = 1, the unit ball gets mapped to
a hyper ellipse where all the semi-axes have the same length, and so the image
of the unit ball under L is a scaled version of the unit ball. This means that, if
σ1 /σn = 1, it does not matter which direction z has in terms of how its length
will be altered when multiplied by L−T , and so we maximize kxk2 by choosing
b such that kzk2 is maximized. The more general point is that - given that
U is more ill-conditioned than L - the first step is more important in terms of
determining the length of x than the second step, and therefore the strategy of
picking a b such to maximize kzk2 is expected to be successful on average.
are better of with an LU factorization even if our only goal is to estimate the
condition number - given that the number of iterations is not to high - since the
part of the process where we estimate κ(A) is only O(n2 ), and thus for large n
the overall cost associated with estimating κ(A) will be dominated by the cost
of the factorization.
Chapter 4
Numerical testing
There are many ways to obtain an estimate of κ(A) depending on which norm
we are using, such as the number of iterations in the PIA, and the strategy
for choosing b. If accuracy is an important aspect, then we can compute a
2-norm estimate, since we have the tools to reach arbitrary precision in that
case. If computational cost is the most important aspect, we can settle for a
1-norm estimate, since the cost of computing kAk1 exactly is only n2 flops. If
we then use two iterations in the PIA, using RLS to choose b, then this brings
the total cost down to n2 + 2n2 = 3n2 , which is about as cheap as we can make
it and still get a reliable order of magnitude estimate. To test the quality of
this estimate, 500 random1 50 × 50 matrices where generated, and then for each
such matrix we computed the relative deviation: |estimate(κ(A) 1 )−κ(A)1 |
κ(A)1 . The
relative deviation was greater than 1 for 7 out of the 500 matrices generated,
and the highest such value was 1.52. Hence this method is successful in terms
of indicating the correct order of magnitude. Note that the accuracy we can
get in the 1-norm estimate is limited by the fact that the vector pair y and x
that maximizes kyk2 /kxk2 need not be the same pair of vectors that maximizes
kyk1 /kxk1 , hence we can’t improve accuracy indefinitely by just adding more
iterations the way we can with a 2-norm estimate. Also note that we got a
relative deviation greater than 1, which might seem impossible since we should
have that |estimate(κ(A)1 ) − κ(A)1 | ≤ 1 given that our estimate is a lower
bound, but we must keep in mind that we are actually obtaining a lower bound
on κ(R)1 which may very well be larger than κ(A)1 ; equality between these two
holds only in the 2-norm.
1 By random we mean that the components where uniformly distributed on the interval
[−1, 1].
33
34 CHAPTER 4. NUMERICAL TESTING
Next we consider the two different strategies for estimating σn−1 that were men-
tioned at the end of section 3.2; one strategy where we use LAS combined with
2 iterations, and one where we use RLS combined with 3 iterations. These are
of interest given that both use approximately the same number of flops. We
will refer to these strategies simply as the 2-iterations strategy and 3-iterations
strategy respectively. 3000 random matrices of dimension 20 × 20 where gener-
ated, and for each matrix we computed an estimate using the 2-iterations and
3-iterations strategy. For 87% of the matrices, the 3-iterations strategy resulted
in a better estimate, with the average of the ratio estimate(σn−1 )/σn−1 - which
we refer to as the success ratio - being 0.96, and the smallest such ratio being
0.22. For the 2-iterations strategy the corresponding values where 0.92 and 0.09
respectively, and similar results where obtained for matrices of various sizes.
The two strategies were also tested on the matrix from the example in section
3.1 with k = 1000, and the average success ratio was 0.99979 using the 3-
step strategy, and 0.999996 using the 2-step strategy. We also computed 100
estimations for this matrix with the 3-step strategy to see how likely it was to
generate a bad estimate (since we have added an element of randomness into
the process), and out of the 100 iterations the smallest success ratio was 0.9967,
hence the modified version of LS seems to cope very well in this situation. Based
on these results, it appears that out of the strategies for estimating σn−1 that
require around 3n2 flops, the best one is the 3-iterations strategy.
First we wanted to test how well the two strategies for picking the components
of b were now that we are using an LU factorization. To this end, we generated
4000 random matrices of dimensions 40×40 matrices and estimated σn−1 for each
using RLS and LAS, using 2 iterations of the PIA in each case, and we obtained
an average success ratio of 0.87 and 0.89 respectively, the smallest success ratios
being 0.07 and 0.12 respectively, and with the LAS being more successful 54%
of the times, hence LAS performed slightly better. We also wanted to see how
much we would gain if we instead of using 2 iterations and LAS, we used the
RLS and add an extra iteration, and so we tested using 3 iterations combined
with RLS. The average success ratio was 0.96, which is a 7% increase in average
accuracy compared to the strategy of performing 2 iterations combined with the
LAS, and the cost of this improvement is an additional n2 flops. A 7% increase
in the average success ratio for an additional n2 flops seems like a decent trade-
off given the fact that we are only increasing the amount of work by a factor of
6n2 /5n2 = 6/5 = 1.2, not including the factorization cost of course.
The exact same strategies were tested on their ability to estimate σ1 . When
comparing RLS against LAS we found that LAS was more successful 75% of
the times, and the average success ratios where 0.74 and 0.79 respectively, a
larger performance difference than when we estimated σn−1 . Performing RLS
4.2. NUMERICAL TESTING USING LU DECOMPOSITION 35
combined with 3 iterations gave an average success ratio of 0.82, which is only
marginally better than the result we got when using 2 iterations combined with
the LAS. In contrast to when we where estimating σn−1 , it seems that the LAS
is preferable to using RLS.
Finally we generated 4000 matrices of dimension 40 × 40, and estimated κ(A)2
and κ(A)1 for each. For each A we used 3 iterations combined with RLS to
estimate σn−1 , and 3 iterations combined with LAS to estimate σ1 , which adds
up 6n2 + 7n2 = 13n2 flops to estimate κ(A)2 . To estimate κ(A)1 we used
3 iterations combined with RLS to estimate kA−1 k1 , which together with the
computation of kAk1 then adds up to 6n2 + n2 = 7n2 flops to estimate κ(A)1 .
When estimating κ(A)2 we got that the average success ratio was 0.80, and
the smallest success ratio was 0.11. Estimating κ(A)1 we got that the average
success ratio was 0.44, the smallest success ratio was 0.06, and 0.43% of the
estimations had a success ratio below 0.1. Although the average success ratio
was much lower for the 1-norm estimate, the lowest success ratio was not that
much lower given the fact that it only uses 7n2 flops as opposed to the 13n2 flops
used to estimate κ(A)2 . Hence the 1-norm does indeed seem to be a suitable
choice when an order of magnitude estimate is all that is required.
36 CHAPTER 4. NUMERICAL TESTING
Bibliography
[1] Alan K Cline, Cleve B Moler, George W Stewart, and James H Wilkinson.
An estimate for the condition number of a matrix, volume 16. SIAM, 1979.
[2] Robert D Skeel. Scaling for numerical stability in Gaussian elimination,
volume 26. ACM, 1979.
[3] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50.
Siam, 1997.
37