Coursenotes
Coursenotes
course notes
Dual Bachelor in Data Science and Engineering and
Telecommunication Technologies Engineering
Bachelor in Data Science and Engineering
Index 2
0 Notation 6
2
Index
5 Numerical interpolation 62
5.1 Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Piecewise cubic interpolation . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Cubic piecewise interpolation with splines . . . . . . . . . . . . . . . 67
5.5 The Newton form and the method of divided differences . . . . . . . 71
5.5.1 Bases of polynomials and the interpolating polynomial . . . 74
5.6 Interpolation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3
Index
Bibliography 156
4
Index
This course requires the use of codes that can be found in the folder (or direc-
tory) ncm of the book [M04] and, as it is indicated at the beginning of Chapter 1
in that book, in order to make use of these codes you should either open matlab
in that directory or add it to the pathtool of matlab.
5
0 Notation
• The coordinates of a vector will be denoted using the same letter as for
the vector and adding a subindex for the corresponding coordinate. For
instance, x1 and xi are the first and the ith coordinate of the vector x, respec-
tively.
• We use capital calligraphic letters for sets (like F for the floating point sys-
tem in Chapter 2).
6
1 Short introduction to matlab
In this chapter we show the basic commands that will be used by default in this
course, as well as the elementary syntax of matlab.
7
1 Short introduction to matlab
format short: Shows the floating point representation of the number with 4 dig-
its. Is the default format.
fzero(f,x0): Computes the root of the function f which is closest to the value x0.
help [order]: Displays an explanation of order. For instance: help sqrt.
f=inline(’function’): Creates a function, f, where ’function’ is the expres-
sion of the function which is defined, using a symbolic variable (for instance, x).
Example:
f=inline(’x^2+2*x+9’)
This allows to evaluate f( a), where a is any complex number. It is also possible to
create a function of several variables, f=inline(’function’,’x1’,...,’xn’).
linspace(a,b,n): Generates a vector with n equispaced coordinates between a
and b.
lookfor [word]: Displays all files of the program where this word appears in the
description.
nnz(A): Number of nonzero entries of A.
norm(A), or norm(A,2): The 2-norm of the matrix A.
norm(A,1): The 1-norm of the matrix A.
norm(A,’fro’): The Frobenius norm of the matrix A.
norm(A,inf): The infinity norm of the matrix A.
[m,pos]=max(c): Provides the maximum value (in modulus) of an entry of the
vector c (namely, m) and the position of this entry (pos) among its coordinates.
ones(m,n): Generates the m×n matrix whose entries are all equal to 1.
pi: π number.
rand(m,n): Generates a matrix of size m×n with random entries (uniform distri-
bution) within the interval [0, 1].
rank(a): Returns the (numerical) rank of a.
roots([an ... a1 a0]): Computes the roots of the polynomial an zn + · · · +
a1 z + a0.
size(A): Displays the size of the matrix A.
solve(eqn,var): Solves, when possible, the equation eqn in terms of the variable
x, which must be previously inserted with the command sym x. The equation
must be inserted in the form f(x) == g(x).
8
1 Short introduction to matlab
9
1 Short introduction to matlab
• plot(x,y): If x,y are two vectors of the same dimension, this command
represents the points given by the corresponding coordinates, and joins each
point with the next one (in the order of the vector x) with a segment.
• loglog(x,y): Acts like the command plot, but using the logarithmic scale.
• semilogx(x,y): Acts like the command plot, but using the logarithmic scale
in the first variable.
• semilogy(x,y): Acts like the command plot, but using the logarithmic scale
in the second variable.
• scatter(x,y): Acts like the command plot, but plots the points with small
circles, without joining them.
10
1 Short introduction to matlab
1.2 Syntax
The (row) vectors in matlab are introduced with brackets “[, ]”, separating the
coordinates with black spaces or commas. For instance:
A=[1 2 3 4;5 6 7 8]
is the matrix
1 2 3 4
A= .
5 6 7 8
The conjugate transpose of a matrix is obtained adding a tilde (the one below the
question mark “?” in the keyboard). For instance, if A is the previous matrix, then
1 5
2 6
A′ =
.
3 7
4 8
If A is a square matrix, the command A(i,j) recovers the (i,j) entry of A. For
instance, in the previous matrix, A(1,2) is equal to 5. If v is a vector, then v(i) is
the ith coordinate of v.
The product of the matrices A and B is obtained with A*B.
11
1 Short introduction to matlab
It is also possible to obtain the inverse of the matrix A by means of either the
command inv(A) or A^(-1) (though they are different!).
Similarly, it is possible to multiply and to divide two vectors or matrices of the
same size entry-wise, by adding a dot, “.”, before the symbol of the operation. This
way, v.*w and v./w provide the vector obtained after multiplying and dividing
(respectively) the corresponding entries of the vectors v and w.
The notation a:b, where a and b are whole numbers, with a≤b, produces the
ordered list of integers between a and b. If a and b are not integers, then a:b
produces the ordered list of numbers between a and b that are obtained by adding
1 to the previous one, starting with a and ending in the closest number to b which
is smaller or equal than b. For instance: the command 2.8:5.4 produces the list
2.8, 3.8, 4.8.
Similarly, The command a:c:b produces the list of equispaced numbers be-
tween a and (the closest number which is smaller than or equal to) b, spaced, each
one with the precedent one, by a distance which is equal to c (where c can be
positive or negative).
We can extract submatrices from a matrix A with the following commands:
then
2 3 4
A(1:2,2:4) = ,
6 7 8
2 4
A([1,3],[2,4]) = .
10 12
In the previous two cases, if all rows are wanted (respectively, all columns), just
type “:” in the information of the rows (respectively, the columns). For instance,
12
1 Short introduction to matlab
It is also possible to join some matrices to other ones in order to create a larger
matrix, provided that the dimensions are compatible. For instance, if A is an m × n
matrix and B is another m × q matrix, [A B] is the matrix that is obtained typing
the columns of B after those of A. Analogously, is B is q × n, [A;B] is the matrix
that is obtained by typing the rows of B below those of A.
It is possible to exchange rows and columns of a matrix by means of a “permu-
tation vector”, which is a vector with n coordinates, that are the natural numbers
from 1 to n in some order. In particular, if p is such a vector, then:
• A(p,:) reorders the rows of A according to the order of the natural numbers
in the vector p (in this case, n must be the number of rows of A).
• A(:,p) reorders the columns of A according to the order of the natural num-
bers in the vector p (in this case, n must be the number of columns of A).
Everything you want to indicate in the file of the function, but is not part of the
code (for instance, explanations), will be written in a line starting with the symbol
%. Here is an example of the first lines of the code roots.m:
matlab functions are files with the extension “.m”. For instance, the function
that computes the roots of a polynomial is the function roots.m.
13
1 Short introduction to matlab
14
2 Floating point arithmetic
In this chapter, we will focus on how to represent and operate with real numbers.
Computers are, of course, able to do the same with complex numbers. However,
the representation and the arithmetic of complex numbers are essentially based
on those of real numbers (just recall that a complex number can be represented as
a pair of real numbers), but their analysis would complicate unnecessarily all the
arguments and developments that we will carry out in this chapter.
Since every computer has a finite memory, it is only able to store a finite number
of real numbers. In this chapter we will present a summary of some of the basic
features of the storage and the arithmetic systems used by computers. For in-
stance, we will answer questions like: How many numbers can a computer store?
How much big is the distance between two consecutive machine numbers? What
happens if we introduce in the computer a number that does not belong to its
system? How does a computer perform an arithmetic operation? etc.
where:
• di is 0 or 1, for i = 1, . . . , t;
15
2 Floating point arithmetic
16
2 Floating point arithmetic
Proof: We will prove the statement only for positive numbers, since for negative
ones it is analogous. From the representation of any number in F , given by (1 +
f ) · 2e = 2t (1 + f ) · 2e−t , we define m := 2t (1 + f ), which is a positive integer that
can be bounded as:
1 1
2 ≤ m = 2 (1 + f ) ≤ 2 1 + + · · · + t = 2t+1 − 1,
t t t
2 2
x n +1 − x n = 2e − t .
Moreover, and using again the representation (2.2), the largest number with expo-
nent e is xn = (2t+1 − 1) · 2e−t , whereas the first (smallest) number with exponent
e + 1 is xn+1 = 2e+1 . Therefore, the difference between these two consecutive
numbers is equal to
x n +1 − x n = 2e +1 − (2t +1 − 1 ) · 2e − t = 2e − t .
17
2 Floating point arithmetic
Now, let us consider the relative distance. Assume that 2e ≤ xn < xn+1 ≤ 2e+1 .
Then, the relative distance, sr , between xn and xn+1 satisfies
2e − t x n +1 − x n 2e − t 2e − t
2− t −1 = ≤ s r = = ≤ = 2− t .
2e +1 xn xn 2e
Moreover, the maximum of this relative distance, 2−t , is achieved for xn = 2e (the
smallest number with exponent e), and decreases up to its minimum value, 2t+11 −1 ,
which is achieved for xn = (2t+1 − 1) · 2e−t (the largest number with exponent e).
This fact is illustrated in Figure 2.2 for a system with precision t = 23.
Figure 2.2: Relative distance from x to the next machine number (t = 23), [H98, p. 40].
18
2 Floating point arithmetic
2.1.2 Roundoff
A relevant question that maybe you have already considered (even if you know the
answer) is the following: What happens if you introduce in the machine a number
that is within the range of the machine number system, but does not belong to it?
The answer is the expected one: the machine “rounds” the number to the closest
one. In the case of a “tie”, namely, if the number is at the same distance from two
consecutive machine numbers, then there are several ways to undo this tie, but the
standard one is to choose the number having b52 = 0. Nonetheless, this fact has
no relevance at all in the contents of this course.
Anyway, the mathematical formulation of roundoff is the following. Let x ∈ R
be a real number (not necessarily a machine number). Let us denote by fl(x) el
the closest machine number to x (that is read “float of x”). Then,
fl(x) − x eps
≤ = u,
fl(x) 2
where u is the unit roundoff. In particular, we have the following result. From now
on, the range of F is the interval [ f min , f max ], where f min and f max are, respectively,
the smallest (negative) and the largest (positive) number of F .
Theorem 1 Let x ∈ R be a number within the range of F . Then
(a) fl( x ) = x (1 + δ), with |δ| ≤ 2−t ,
x
(b) fl( x ) = , with |α| ≤ 2−t .
1+α
Proof: Without loss of generality, we assume x > 0. If x = m · 2e−t , with 2t ≤ m <
e−t
2t+1 and emin ≤ e ≤ emax , as in (2.2), then |fl( x ) − x | ≤ 2 2 , since this is half the
distance between x and the next floating point number. Also, x, fl( x ) ≥ 2e . Now:
fl( x ) − x 2e−t /2
(a) Let δ = , then |δ| ≤ = 2− t −1 .
x 2e
x − fl( x ) 2e−t /2
(b) Let α = , then |α| ≤ = 2− t −1 .
fl( x ) 2e □
As a conclusion, we have
The unit roundoff, u = 2−t−1 , of a system with precision
t is the largest relative distance between a number and its
representation in the machine number system.
19
2 Floating point arithmetic
Let x, y ∈ F. Then
• The model does not tell us which is exactly x ⊛ y. It only provides a bound
on it with respect to the exact value x ∗ y. This allows us to deal easily with
rounding errors, at the expense of working with unknown quantities (δ).
• Note that the model does not guarantee that some of the standard arithmetic
laws (like the associative or distributive ones) are still true in floating point
arithmetic. For instance, in general x ⊗ (y ⊕ z) ̸= ( x ⊗ y) ⊕ (y ⊗ z), and
x ⊗ (y ⊗ z) ̸= ( x ⊗ y) ⊗ z.
20
2 Floating point arithmetic
where in the last equality we have used that x, y are both positive. Equivalently
|fl( x + y) − ( x + y)|
≤ 2u + u2 . (2.5)
| x + y|
Equation (2.5) provides a bound on the relative error when computing the sum of two
positive numbers in floating point arithmetic.
unit
sistem precision (t) emin emax eps (2−t )
roundoff (2−t−1 )
IEEE simple 24 −126 127 1.2 · 10−7 6 · 10−8
IEEE doble 52 −1022 1023 2.2 · 10−16 1.1 · 10−16
Table 2.1: Basic ingredients of the IEEE system in simple and double precision
In this section we are going to analyze more in detail the IEEE system with
double precision. This system stores each number of the system F in a “word” of
62 digits (binary):
21
2 Floating point arithmetic
More precisely
Let us note that there are two extreme values of the exponent, namely e = −1023
and e = 1024, which are not indicated in Table 2.1. These exponents are used to
store some special numbers, indicated in Table 2.2. These numbers are either too
large or too small, and the machine treats them in an exceptional way (that is, it
does not treat them as “machine numbers”). For instance, when e + 1023 = 0, the
machine represents the number in the form f · e−1022 , instead of (1 + f ) · e−1022 ,
which allows us to obtain numbers that are even smaller (known as subnormal
numbers).
e e + 1023 di f type
−1023 0 0 0
0
̸= 0 subnormal
1024 2047 0 ±Inf
1
̸= 0 NaN
Table 2.2: Exceptional numbers corresponding to the extreme values of the exponent.
22
2 Floating point arithmetic
mentioned, the machine is capable to recognize some larger and smaller numbers.
More precisely, the smallest number that the machine is able to recognize is the
smallest subnormal number, namely 2−t · 2emin , which in IEEE with double pre-
cision is equal to 2−1022−52 = 4.9407 · 10−324 . Any other smaller number will be
treated as 0.
If, during any computation, a number that is not within the range of the com-
puter is obtained, then the underflow phenomenon is produced (if the number is
smaller than realmin) or overflow (when it is larger than realmax). In this last
case, the program matlab displays an error message, consisting of either Inf or
NaN (see Table 2.3). However, the underflow phenomenon does not produce any
error message, since any number smaller than realmin is rounded to 0.
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f .
x = c0 · 20 + c1 · 21 + c2 · 22 + c3 · 23 , ci = 0, 1, i = 0, 1, 2, 3.
Table 2.4 shows the correspondence between each hexadecimal character and its
corresponding binary representation. By means of this equivalence, the IEEE for-
mat in double precision (binary)
23
2 Floating point arithmetic
In particular, the formula to go from one of the systems to the other one is:
w1 w2 w
mantissa f = + 2 + · · · + 13
16 16 1613
There are several formats in matlab to see the hexadecimal representation (and
to go back to the decimal one):
24
2 Floating point arithmetic
• format hex: Shifts from the binary representation (decimal) to the hexadec-
imal one.
HIGHLIGHTS:
A floating point system aims the relative distance between two consecu-
tive numbers to be essentially constant for all numbers in the system.
The relative distance between two consecutive numbers is bounded be-
tween the machine epsilon, β1−t , and the unit roundoff, β−t .
The absolute separation increases as the absolute value of the numbers
increase. It increases by a factor β as the exponent of the numbers increases
by 1.
The floating point arithmetic model assumes that the relative error when
rounding an elementary arithmetic operation of machine numbers is at most
the unit roundoff of the computer.
25
3 Conditioning and stability
In this chapter we introduce the notions of conditioning (and condition number) of
a problem and stability of a numerical algorithm. The first notion (conditioning)
refers to how small changes in the data of a problem affect the solution of this
problem and, as a consequence, is a intrinsic notion of the problem, independent
of the algorithm that is used to solve it. By contrast, the notion of stability refers
to the algorithm that is used to solve a problem, and provides a measure of how
“good” is the solution provided by this algorithm, namely: how far is it away from
being the exact solution.
3.1 Conditioning
When working with computers, we can not be sure that the original data are the
exact ones. When introducing our (hopefully exact) data in the computer, these
data are affected by rounding errors (see Section 2.1.2).
We can think of a problem as a function from a set of data (X ) to a set of
solutions (Y ), where each set is endowed with a norm.
Problem f
f : X −→ Y
x (data) 7→ f ( x ) (solution)
26
3 Conditioning and stability
∥δ f ∥
κb f ( x ) := lim sup , (3.2)
δ→0 ∥δx ∥≤δ ∥ δx ∥
with δ f as in (3.1).
(a) The condition number κb f ( x ) compares absolute changes in the solutions with abso-
lute changes in the data.
∥δ f ∥
κb f ( x ) ≈ sup ,
∥δx ∥≤δ ∥ δx ∥
κb f ( x ) = ∥ J f ( x )∥,
where J f is the Jacobian matrix of f (namely, the linear operator associated to the
differential).
If we look at relative variations of the data, and look, accordingly, for relative
variations in the solution, then we arrive at the following notion.
∥δ f ∥/∥ f ∥
κ f ( x ) := lim sup . (3.3)
δ →0 ∥δx ∥ ∥δx ∥/∥ x ∥
∥x∥
≤δ
∥x∥
κ f (x) = · ∥ J f ( x )∥.
∥ f ( x )∥
27
3 Conditioning and stability
Remark 3 (Choice of the norm). Condition numbers depend on the norms in X and
Y . This is not, in general, a problem, and it should be clear from the very beginning which
is the norm we are using. In this course, both X and Y will be either Fn or Fm×n (the
matrix space), with F = R or C, and the norms we use are the following:
x1
• The infinite norm: ... = max{| x1 |, . . . , | xn |}.
xn ∞
x1
..
• The 1-norm: . = | x1 | + · · · + | x n |.
xn 1
x1
.. p
• The 2-norm: . = | x1 |2 + · · · + | x n |2 .
xn 2
In all cases, | x | denotes the absolute value (or the modulus) of the complex number x. The
matrix norms will be introduced in Chapter 4.
Remark 4 For any vector x ∈ Cn , the following relation between the norms introduced
in Remark 3 hold:
∥ x ∥2
√ ≤ ∥ x ∥ ∞ ≤ ∥ x ∥2 ,
n √
∥ x ∥2 ≤ ∥ x ∥1 ≤ ∥n∥2 · n,
∥ x ∥1
≤ ∥ x ∥ ∞ ≤ ∥ x ∥1 .
n
Let us illustrate the notions of absolute and relative condition number with the
following examples, corresponding to elementary arithmetic operations:
28
3 Conditioning and stability
Example 2 Let f be the function (problem) assigning to each nonzero real value its in-
verse:
f : R \ {0} −→ R
1
x 7→ f ( x ) = .
x
This function is differentiable, so
1
κb f ( x ) = .
| x |2
Note that when x → 0, the condition number κb f ( x ) goes to infinity, so inverting a number
close to zero is ill-conditioned in absolute terms.
However, in relative terms
|x|
κ f (x) = | f ′ ( x )| = 1,
| f ( x )|
so the problem of inverting nonzero real numbers is well conditioned in relative terms.
Example 3 Let us consider the following particular values in Example 2. Set: x = 10−6
and xe = 10−6 + 10−10 . Then the absolute error in the inversion of x is given by:
1
− 1xe 1
x
= −12 ≈ 1012 ,
| x − xe| 10 + 10−16
which is a big number. However, the relative error is
1 1 1
x − xe ÷ x |x|
| x − xe|
= ≈ 1.
| xe|
|x|
1 1 106
− = 106 − ≈ 102 ,
x xe 1 + 10−4
which is a big quantity, compared just with the original data, which are of order 10−6 .
However, it is a small error compared to the magnitude of 1/x.
In the following, for the sake of simplicity, and since the function f is clear by
the context, we will replace the subscript f in the notation for condition numbers
by a reference to the norm. Then, for instance, κb∞ is the absolute condition number
in the ∞-norm, and κ1 is the relative condition number in the 1-norm.
29
3 Conditioning and stability
Example 4 Let us now consider the function assigning to each pair of real numbers their
difference:
f : R2 −→ R
x x
7→ f = x − y.
y y
Again, this function is differentiable, and
x x h
∂f ∂f
i
κb∞ = J = =∥ 1 −1 ∥∞ = 2,
y y ∞
∂x ∂y ∞
x
κb1 = 1.
y
However
x
y | x | + |y|
x 1 x
κ1 = J = .
y | x − y| y 1
| x − y|
x
Therefore, κ1 can be arbitrarily large if x ≈ y and x, y are not close to zero.
y
Example 4 highlights the well-know fact that subtracting nearby numbers can
be problematic. But note that this just gives problems in relative terms, since the
absolute condition number is not big. Summarizing:
The problem of subtracting two numbers is relatively ill–conditioned when
these two numbers are close to each other (and not close to zero), though it is
absolutely well-conditioned. The effect produced by ill-conditioning when sub-
tracting two nearby numbers is known as cancellation.
x
Exercise 1 Let x = 99, y = 100, x + δx = 99, y + δy = 101, and set f =
y
x − y.
x
(a) Compute κb1 without using the Jacobian.
y
x
(b) Compute κ1 without using the Jacobian.
y
x + δx − f
x
f
y + δy y
(c) Compute the absolute error: 1
.
δx
δy 1
30
3 Conditioning and stability
,
x + δx − f
x x
f f
y + δy y 1
y 1
(d) Compute the relative error: , .
δx
x
δy 1 y 1
(e) Compare the results obtained in (c) and (d) in the light of (a) and (b).
In the previous examples we have included the absolute condition number just
for the sake of comparison. However, it is not illustrative, because it does not take
into account the magnitude of the original data or the solution. In the follow-
ing, we will only focus on the relative condition number. Nonetheless, the next
example shows that this is not the whole story, and that the normwise relative
condition number is not always the best way to get an idea on how much sensible
the problem is to realistic perturbations.
31
3 Conditioning and stability
where Erc ( x ) is some measure of the componentwise relative error in the data x. In
the case of Example 5, this componentwise relative error is:
|δx | |δy|
x
Erc := Erc = max , .
y | x | |y|
Then, with some elementary manipulations, we get
δy
x
x + δx x x
δx
x − y x 2Erc
δf = − = ≤ ,
y y + δy y y 1+
δy y 1 − Erc
y
32
3 Conditioning and stability
so
x
δf
y Erc
≤ 2· ≈ 2Erc .
1 − Erc
x
f
y
(comp)x
As a consequence, κf = 2, which means that the problem of dividing
y
two real numbers x, y, with y ̸= 0, is well-conditioned componentwise.
Though componentwise conditioning is relevant in numerical linear algebra, we
will not pay much attention to it, due to time restrictions.
33
3 Conditioning and stability
which is equivalent to
p(r + δr ) + δa j (r + δr ) j = 0. (3.6)
Expanding in Taylor series, we have
p(r + δr ) = p(r ) + δr · p′ (r ) + O((δr )2 ) = δr · p′ (r ) + O((δr )2 ),
since p(r ) = 0. Replacing in (3.6) we get
δr · p′ (r ) + O((δr )2 ) + r j δa j + δa j · O(δr ) = 0.
Now, ignoring second order terms, we obtain δr · p′ (r ) + r j δa j = 0, from which
|r j | · |δa j |
|δr | = ,
| p′ (r )|
and, from this, (3.5) is immediate.
As a consequence of (3.5), the condition number of a multiple root is infinite. The
condition number can be also quite large for simple roots.
Let us check the formula obtained in (3.5) for quadratic polynomials p( x ) = a + bx +
x2 . We know that the roots in this case are given by
√ √
−b + b2 − 4a −b − b2 − 4a
r1 = , r2 = .
2 2
Since these are differentiable functions of a and b, we can compute the condition number
using Remark 2. Let us focus on r1 (the developments for r2 are similar). First, for the
coefficient a, we have
dr1 1
( a) = − √ ,
da 2
b − 4a
so the formula in Remark 2 gives
∥ a∥ | a|
· ∥ Jr1 ( a)∥ = √ .
∥r1 ( a)∥ |r1 || b2 − 4a|
√
On the other hand, since p′ ( x ) √ = b + 2x, then p′ (r1 ) = b2 − 4a. Therefore, the right-
hand side in (3.5) reads | ar1−1 |/ b2 − 4a, which coincides with the previous expression.
As for the coefficient b, we proceed in the same way:
−r
dr1 1 b r
(b) = − −1 + √ =√ 1 =− ′ 1 ,
db 2 b2 − 4a b2 − 4a p (r1 )
so the formula in Remark 2 reads
∥b∥ | b | · |r1 | |b|
· ∥ Jr1 (b)∥ = ′
= ′ ,
∥r1 (b)∥ |r1 | · | p (r1 )| | p (r1 )|
which coincides, again, with the right-hand side of (3.5).
34
3 Conditioning and stability
|δr |/|r |
Table 3.1: Ratio |δa2 |/| a2 |
for p( x ) = ( x − 1)3 , δa2 = 10−n (r = 1).
Problem f
f : X −→ Y
x (data) 7→ f ( x ) (exact solution)
Algorithm fe
fe : X −→ Y
x (data) 7→ fe( x ) (computed solution)
35
3 Conditioning and stability
Definition 3 The algorithm fe for the problem f is accurate if, for any x ∈ X ,
∥ fe( x ) − f ( x )∥
= O ( u ). (3.8)
∥ f ( x )∥
Definition 4 The algorithm fe for the problem f is stable if for any x ∈ X there is some
xe ∈ X satisfying
Definition 5 The algorithm fe for the problem f is backward stable if for any x ∈ X
there is some xe ∈ X satisfying
∥ x − xe∥
fe( x ) = f ( xe), and = O ( u ).
∥x∥
36
3 Conditioning and stability
In other words, a backward stable algorithm provides the exact solution for a
nearby data. This is the most ambitious requirement for an algorithm, since data
are always subject to rounding errors.
Definitions 3, 4, and 5 depend on two objects: (a) the norm in the sets of data
and solutions, and (b) the quantity C such that O(u) ≤ Cu when u approaches 0.
Regarding the norm, in all problems of this course, X and Y are finite dimensional
vector spaces. Since all norms in finite dimensional spaces are equivalent, the
notions of accuracy, stability, and backward stability do not depend on the norm.
As for the quantity C, it must be independent on u, but it will usually depend on
the dimension of the spaces X , Y . However, in order for the quantity O(u) not
being too large, this dependence is expected to be, at most, polynomial.
A natural question regarding Definitions 3, 4, and 5 is about the relationships
between them. For instance: is an accurate algorithm necessarily backward stable,
or vice versa? It is immediate that a backward stable algorithm is stable, and
that an accurate algorithm is stable. However, is not so easy to see whether the
other implications are true or not. Actually, none of them is true. In particular, an
accurate algorithm is not necessarily backward stable.
Example 9 Equation (2.5) tells us that the sum of two real numbers of the same sign in
floating point arithmetic is an accurate algorithm, and (2.4) implies that it is backward
stable as well.
In terms of the goal mentioned at the beginning of this section, one may wonder
which is the connection between the stability (or the backward stability) and the
relative error (3.7). The following result provides an answer to this question.
∥ fe( x ) − f ( x )∥
= O(uκ ( x )).
∥ f ( x )∥
The proof of Theorem 2 for fe being backward stable is immediate (see [TB97, Th.
15.1]). For f being just stable is more involved, and is out of the scope of this
course.
What Theorem 2 tells us is that when using a stable algorithm for solving a
problem, the forward error of the algorithm depends on the conditioning. In
particular, this can be summarized as:
37
3 Conditioning and stability
This is another way express the general rule (see [H98, p. 9]):
HIGHLIGHTS:
38
4 Solution of linear systems of equations
In this chapter we are going to study the standard method for solving systems
of linear equations, which is known as Gauss method (or Gaussian elimination) with
partial pivoting. In particular, we will see that this method is associated with a
decomposition of the coefficient matrix of the system, which is known as LU
factorization. For this, we will review some notions from the subject Linear Algebra.
Finally, we will analyze the errors committed by the previous method, by means
of the study of the sensitivity and the condition numbers.
Throughout this chapter, the system of linear equations (SLE) will be repre-
sented in matrix form as:
Ax = b, A ∈ Cm × n , b ∈ Cm , (4.1)
where x ∈ Cn is the unknown. The matrix A in (4.1) is the coefficient matrix of the
system, and the vector b is the right-hand side of the system. During all this chapter
we will assume that the matrix A is square, namely, m = n in (4.1). Moreover, we
will assume that the matrix is invertible, in order to guarantee that it has a unique
solution.
We will start examining the solution of triangular systems in Section 4.1 but,
before, we will introduce some notation and basic definitions from matrix theory.
1 0... 0
. . . . ..
0 . . .
In :=
. .
.
. .
. . 1 0
0 . . . 0 1 n×n
39
4 Solution of linear systems of equations
The main diagonal of a matrix A consists of the entries A(i, i ) (namely, the entries
for which both indices coincide).
A triangular matrix is a matrix in which either all entries above (lower triangular)
or below (upper triangular) the main diagonal are zero. Mathematically:
Note that the adjective “lower” or “upper” refers to the “relevant” part of the
matrix (namely, the one that is not necessarily zero).
A diagonal matrix is a matrix which is both lower and upper triangular. In a
diagonal matrix all entries outside the main diagonal are zero.
A permutation matrix is s square matrix whose entries are all 0 or 1, in such a
way that there is exactly one entry equal to 1 in each row and and each column.
The name of these matrices comes from the fact that multiplying another matrix A
by a permutation matrix P produces a permutation of the rows (when multiplying
on the left, namely PA) or the columns (when multiplying on the right, namely
AP) of A. For instance, set
0 1 0 1 2 3
P = 0 0 1 , A = 4 5 6 .
1 0 0 7 8 9
You can see that PA is obtained from A permuting the rows, whereas AP is ob-
tained from A by a permutation of the columns. Moreover, this permutation is
encoded in the position of the entries equal to 1 in the matrix P. In fact, a permu-
tation matrix P is obtained by permuting the rows (or the columns) of the identity
matrix, and this permutation of the identity matrix that leads to the matrix P is,
precisely, the same one that P produces over A when multiplying A and P.
Some properties of permutation matrices that will be used in this course (some-
times without any explicit mention to them) are:
40
4 Solution of linear systems of equations
In Section 4.3 we will use matrix norms. A matrix norm indicates how the norm
of a vector can change when it is multiplied by the matrix. More precisely, if ∥ · ∥ p
is any of the norms introduced in Remark 3 (namely, p can be 1, 2, or ∞), then the
pth matrix norm of A ∈ Cn×n is
∥ Ax∥ p
∥ A∥ p = max n . (4.2)
0̸ = x ∈C ∥x∥ p
In words, the norm ∥ A∥ p indicates how much the norm ∥ Ax∥ p increases with
respect to ∥x∥ p (a nice geometric interpretation of matrix norms can be found at
[TB97, Lecture 3]).
The following result provides explicit formulas for the matrix norms associated
to vector norms ∥ · ∥ p , for p = 1, 2, ∞. The proof can be found in pages 20 − 21,
for p = 1, ∞, and in page 34, for p = 2, of [TB97].
Lemma 2 Let A ∈ Cn×n , and denote by Col i ( A) and Row i ( A) the ith column and the
ith row of A, respectively. Then:
The basic properties of matrix norms that will be used in this course are given
in the following result, whose proof is straightforward.
1. ∥ Ax∥ p ≤ ∥ A∥ p · ∥x∥ p ,
2. ∥ AB∥ p ≤ ∥ A∥ p · ∥ B∥ p ,
3. ∥ In ∥ p = 1.
Exercise 3 Prove that ∥uv∗ ∥2 = ∥u∥2 ∥v∥2 , for any two vectors u, v ∈ Cn .
κ p ( A ) = ∥ A ∥ p · ∥ A −1 ∥ p .
41
4 Solution of linear systems of equations
We can solve for the unknown xn in the last equation and replace the obtained
value in the remaining equations, then solve for xn−1 in the last-but-one equation
and replace again the obtained value in the remaining equations, and so on. This
procedure is known as backward substitution, and provides the following expression
for the unknowns:
bnn
xn = ,
unn
bn−1 − un−1,n xn
x n −1 = ,
un−1,n−1
..
.
(4.4)
bk − uk,k+1 xk+1 − · · · − ukn xn
xk = ,
ukk
..
.
b1 − u12 x2 − · · · − u1n xn
x1 = .
u11
Remember that U is invertible, which means that u11 · · · unn ̸= 0, so the previous
expressions make sense.
The expression (4.4) can be easily implemented in matlab with a for loop:
42
4 Solution of linear systems of equations
function[x]=triusol(U,b)
% solution of Ux=b, with U upper triangular nxn
x=zeros(n,1)
for k=n:-1:1
j=k+1:n
x(k)=(b(k)-U(k,j)*x(j))/U(k,k)
end
When the coefficient matrix is lower triangular, a similar method is derived, and
it is known as forward substitution.
OP1 (Exchange). Exchange two rows. We denote the operation that exchanges
the ith and jth rows by Rij .
Step 1: Select, in the leftmost column, a nonzero entry (pivot) and exchange
the row containing this entry (OP1) to take it to the topmost position.
Step 3: Repeat steps 1 and 2 in the submatrix obtained from the previous
one by removing the row and column containing the pivot.
43
4 Solution of linear systems of equations
As an illustration of the Gauss method you can look at the example in Section 2.3
in [M04]. Nonetheless, we are going to show another example, with an explanation
that will allow us to relate this method with the LU factorization:
Let us apply Gauss elimination to the following matrix:
2 −4 4 −2
A = 6 −9 7 −3 . (4.5)
−1 −4 8 0
Step 2: Now, we make zeroes below the pivot with the operations R21 (−3)
and R31 (1/2):
2 −4 4 −2
0 3 −5 3 .
0 −6 10 −1
Now we iterate the previous steps over the submatrix A(2 : 3, 2 : 4):
Step 1: We choose the (2, 2) entry as a pivot to avoid, again, a row exchange:
2 −4 4 −2
0 3 −5 3 .
0 −6 10 −1
2 −4 4 −2
0 3 −5 3 .
0 0 0 5
With this, we have arrived at a matrix in echelon form, which is upper trian-
gular (we denote it by U):
2 −4 4 −2
U : = 0 3 −5 3 .
0 0 0 5
44
4 Solution of linear systems of equations
Now, we gather together all the elementary row operations that we have applied,
and we construct a lower triangular matrix with 1’s in the main diagonal and, at
each column below each principal entry, contains the factors that we have used
in Step 2 at the corresponding iteration, with the opposite sign (we denote this
matrix by L):
1 0 0
L := 3 1 0 .
−1/2 −2 1
Finally, we note that the original matrix, A, is the product of L and U, namely:
2 −4 4 −2
1 0 0
A = LU = 3 1 0 0 3 −5 3 .
−1/2 −2 1 0 0 0 5
This is, precisely, the LU factorization of A. In the next subsection we show the
general method to obtain the LU factorization, and we analyze some of its basic
features.
where a11 is the (1, 1) entry of A, A12 is a row vector of size 1 × (n − 1), A21 is a
column vector of size (m − 1) × 1 and A22 is a matrix of size (m − 1) × (n − 1).
If a11 ̸= 0, we can express the previous decomposition by blocks as follows:
" #" #
a11 A12 1 0 a11 A12
A= = 1 .
A21 A22 a11 A21 Im−1 0 A22 − a111 A21 A12
1
A (0) : = A ∈ Cm × n , A(1) := A22 − A21 A12 ∈ C(m−1)×(n−1) .
a11
45
4 Solution of linear systems of equations
Now, let us assume that we know the LU factorization of the matrix A(1) , say
A(1) = L1 U1 . Then, the previous factorization of A allows us to write:
" #
1 0 a11 A12
A= 1 =: LU, (4.6)
a11 A21 L1 0 U1
• L is the lower triangular part of A(d−1) (without the main diagonal), adding
1’s in the main diagonal, and padding up with 0’s below the diagonal entries
in the columns n + 1 : m for m > n.
46
4 Solution of linear systems of equations
Example 10 The matrices A(k) obtained in the previous process for the matrix A in (4.5)
are the following:
−4 4 −2
2
A(0) = A, A(1) = 3 3 −5 3 ,
−1/2 −6 10 −1
−4 4 −2
2
A (2) = 3 3 −5 3 .
−1/2 −2 0 5
2 −4 4 −2
1 0 0
L= 3 1 0 , U = 0 3 −5 3 ,
−1/2 −2 1 0 0 0 5
function[L,U]=lu(A)
% Computes the L and U matrices of an LU factorization of A
47
4 Solution of linear systems of equations
for k=1:n-1
A(k+1:n,k)=A(k+1:n,k)/A(k,k);
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-A(k+1:n,k)*A(k,k+1:n);
end
L=eye(n)+tril(A,-1)
U=triu(A)
PA = LU (4.7)
Remark 6 Theorem 3 is stated for F = R or C, which means that, if the original matrix
is real, then there is a factorization (4.7) where all three matrices P, L, U are real.
(k)
Even in those cases where the diagonal entry a11 is nonzero, it is advisable, for
stability reasons, to exchange rows to get an appropriate pivot. This technique of
exchanging rows to get an appropriate pivot is known as pivoting. There are two
basic pivoting strategies:
(a) Partial pivoting: The chosen pivot (that is taken to the (1, 1) position of the
matrix A(k) by row exchange) is the entry with largest modulus within the
first column of A(k) . Mathematically, this process is carried out by means of
48
4 Solution of linear systems of equations
since we have already carried out the first k steps of the LU factorization.
Then, the row exchange only affects the last n − k rows, namely:
I 0 L11 0 U11 U12
·A =
0 P(k) P(k) L21 P(k) 0 A(k)
L11 0 U11 U12
= ,
P(k) L21 I 0 P(k) A(k)
where P(k) is the permutation matrix that exchanges the corresponding rows
according to the partial pivoting criterion.
(b) Total pivoting: The chosen pivot is the entry with largest modulus of A(k) .
This pivoting strategy requires to exchange rows and columns, which leads
to a factorization of the form PLUQ, with Q being another permutation
matrix, and will not be considered in this course.
The natural question is: what is the reason for exchanging rows even when the
(1, 1) entry of A(k) is nonzero? Or, more precisely, what is the advantage of partial
pivoting? To answer this, let us consider the following example (another one can
be found in [M04, §2.6]).
Example 11 Let
10−20 1
A= .
1 1
The LU factorization of A is
10−20
1 0 1
L= , U= .
1020 1 0 1 − 1020
However, if we apply the previous LU algorithm, without row exchange (since, in princi-
ple, it is not necessary because the diagonal entries are nonzero), in matlab (namely, in
floating point arithmetic), we get
1 0 1 0
L =
b =
fl(1020 ) 1 1020 1
−20 −20
10 1 10 1
U =
b = .
0 fl(1 − 1020 ) 0 −1020
49
4 Solution of linear systems of equations
It can be checked in matlab that fl(1020 ) = 1020 and fl(1 − 1020 ) = −1020 . Therefore,
we get
−20
LU
b b = 10 1
.
1 0
This matrix should be the original matrix A. However, the (2, 2, ) entry is equal to 1,
which has nothing to do with the (2, 2) entry of A.
Example 11 shows that roundoff errors can produce a large error in the LU
factorization. If we analyze the procedure that we have followed in this exam-
ple, we will realize that the problem comes from approximating fl(1 − 1020 ) ≡
fl( a22 − 1020 ) = −1020 , and it is produced because there are some entries of L or
U which are too large compared to the entries of A.
The pivoting strategies aim to solve this problem. In particular, the partial piv-
oting strategy guarantees that the entries of L always have modulus at most 1.
2. Multiply Pb.
As can be seen, once we have computed the factorization (4.7) of the coefficient
matrix A of the system (4.1), the solution of the system reduces to solving two
triangular systems, whose cost is notably smaller than the one for solving a general
system. We will see more on the computational cost in Section 4.2.4.
Example
1 8 (Continued). Assume that we want to solve the SLE Ax = b, where b =
0 . The exact solution is
1 −1
x= . (4.8)
1 − 10−20 1
50
4 Solution of linear systems of equations
However, if we follow the procedure described above with the computed factors b
L, U,
b when
solving the lower triangular system we get
1 1
Lb
b y= ⇒y
b= ,
0 −1020
If we compare x and b x in (4.8) and (4.9) we will immediately realize that there is a large
error, up to the point that the computed solution has nothing to do with the exact one (they
are in completely different directions!)
• [L,U,P] = lu(A): Computes the matrices L,U, and P of the factorization (4.7)
of A. Is the professional version of matlab.
• x = bslashtx(A,b): This code is also part of the package ncm. Solves the
system Ax=b using the LU factorization computed with lutx and then solv-
ing the two associated triangular systems, as explained in Section 4.2.2. More
precisely, it makes use of the following codes for solving the triangular sys-
tems:
– x = forward(L,b): Solves a lower triangular system Lx=b by forward
substitution.
– x = backsubs(U,b): Solves an upper triangular system Ux=b by back-
ward substitution.
51
4 Solution of linear systems of equations
Now we estimate the cost of the LU algorithm introduced at the end of Section
4.2.1:
The first line of the algorithm at the ith step involves n − i divisions. The second
line involves (n − i )2 products and (n − i )2 subtractions. Then, the overall cost of
the algorithm is:
52
4 Solution of linear systems of equations
Finally, let us analyze the cost of the algorithm in the box in Section 4.2.2. This
algorithm consists of four steps. The first one requires the computation of the LU
decomposition of A (recall that the permutations are not counted) that we have
just estimated. The second one is, simply, a permutation of the coordinates of the
vector b, and does not count. Step 3 requires solving a triangular system with 1’s
in the main diagonal, so there are no divisions involved. In Step 4 we have to
solve another triangular system. Adding up the computational cost obtained in
the previous sections, we arrive at:
53
4 Solution of linear systems of equations
say r systems with LU has a computational cost of O(n3 ) + O(rn2 ), whereas doing
it without LU has a cost of O(rn3 ).
This situation, in which one wants to solve a lot of systems with the same coef-
ficient matrix, appears in many problems of engineering, for instance in systems
where the right-hand side varies with time, but the rest of the system remains
constant.
In order to measure the errors we are going to use the vector norms ∥ · ∥ p ,
with p = 1, 2, ∞, introduced in Remark 3, as well as the induced matrix norms,
introduced at the beginning of this chapter (see Lemma 2). If x is the exact solution
of the SLE (4.1), then ∥ Ax − b∥ p = 0. It is expected, then, that if b x is a good
approximation to the solution, then ∥ Ab x − b∥ p must be small. Then, a natural way
to estimate the error committed when solving the SLE (4.1) consists in evaluating
∥ Abx − b∥ p . The vector r := Ab
x − b is called the residual of the solution. Note that
the residual is known, since all its ingredients are known (they are the input data,
A, b, together with the computed solution b x). The following result shows that the
residual is a good measure on how far is b x away from being the exact solution of
a nearby problem.
∥ r ∥2 ∥ E ∥2
= min : ( A + E)b
x=b . (4.10)
∥ A∥2 ∥bx ∥2 ∥ A ∥2
Proof: In the first place, for any matrix E such that ( A + E)b
x = b, after solving for
r and taking norms we get:
∥r∥2 = ∥ Eb
x∥2 ≤ ∥ E∥2 ∥b
x ∥2 ,
54
4 Solution of linear systems of equations
55
4 Solution of linear systems of equations
Definition 8 (Backward error). The quantity ∥r∥2 /(∥ A∥2 ∥b x∥2 ) is known as the back-
ward error (in the 2-norm) of the computed solution b
x for solving the SLE (4.1).
Note that the backward error in Definition 8 does not take into account the
right-hand side b. In fact, the perturbed system (4.12) has the same right-hand
side as the original system (4.1).
Despite what we have seen in Theorem 4, a small residual does not guarantee
that the computed solution is a good approximation to the exact solution, as it is
shown in the following example, taken from [I09, Ex. 3.6].
Example 9 Let
1 108 1 + 108
1
A= , b= , x=
0 1 1 1
is very small. However, the computed solution is far away from being the exact one, since
1
∥x − b x ∥2 10−8 2 1
= √ ≈√ .
∥ x ∥2 2 2
Example 9 shows that the residual is not necessarily a good measure for the error
in the solution of the SLE (4.1). The following theorem explains what is going on.
The proof can be found in [I09, p. 48].
Theorem 5 Let A ∈ Cn×n be an invertible matrix and let x be the solution of the SLE
Ax = b. Let E ∈ Cn×n be a matrix such that A + E is also invertible and let b x be the
solution of the SLE ( A + E)bx = b + b E . Let us assume that ∥ E∥ p ≤ 1/∥ A−1 ∥ p , with
p = 1, 2, ∞. Then
∥x − b
x∥ p κ p ( A) ∥ E∥ p ∥b E ∥ p
≤ ∥ E∥
· + . (4.13)
∥x∥ p 1 − κ p ( A) ∥ A∥pp ∥ A∥ p ∥b∥ p
The expression (4.13) provides a bound, and an estimation, of the relative error in
the solution of the SLE (4.1) (see Definition 7). More precisely, it tells us that this
error depends on the condition number of the coefficient matrix, A (see Definition
56
4 Solution of linear systems of equations
6), as well as in the backward errors in the data given by the right factor in the
bound (4.13). Let us assume that we have used a backward stable method for com-
puting the solution bx. This means that the quotients ∥ E∥ p /∥ A∥ p and ∥bE ∥ p /∥b∥ p
are very small. On the other hand, the hypothesis ∥ E∥ p ≤ 1/∥ A−1 ∥ p implies that
∥ E∥
the denominator 1 − κ p ( A) ∥ A∥pp is a positive number smaller than 1. If the product
∥ E∥
κ p ( A) ∥ A∥pp is small (something that should be expected), then the denominator
∥ E∥
1 − κ p ( A) ∥ A∥pp will be close to 1. In this situation, the bound (4.13) says that the
relative error in the solution can be magnified by a factor equal to the condition
number of the matrix A with respect to the backward error. This is even more
∥ E∥
emphasized if κ p ( A) ∥ A∥pp is not small.
Note that in Theorem 5 there is a perturbation in the right-hand side, unlike
what happened in Theorem 4. Now, if we combine both results (with b E = 0 as in
Theorem 4), we arrive at the bound:
∥x − b x ∥2 κ2 ( A ) ∥ r ∥2
≤ ∥ ∥
· .
∥ x ∥2 1 − κ2 ( A) ∥ A∥ ∥2bx∥ ∥ A∥2 ∥b
r x ∥2
2 2
If the product κ2 ( A) ∥ A∥∥r∥∥2bx∥ is small (close to 0), which is expected since, even
2 2
if κ2 ( A) is large, the residual should be very small if the computed solution is a
good approximation to the exact solution, then
∥x − b x ∥2 ∥ r ∥2
≤ κ2 ( A ) · . (4.14)
∥ x ∥2 ∥ A∥2 ∥bx ∥2
Equation (4.14) is the most eloquent formula in this section, since it indicates in
a clear way which are the sources of error in the solution of a SLE (4.1). More
precisely, it depends on two factors:
57
4 Solution of linear systems of equations
Remark 7 The bound given in (4.14) highlights the implication (3.10) in the solution of
the SLE (4.1). More precisely, if the algorithm is stable, the factor ∥ A∥∥r∥∥2bx∥ should be of the
2 2
order of the unit roundoff. On the other hand, if the problem is well conditioned, the factor
κ2 ( A) should be moderate. If both premises hold then the product will be of the order of
the unit roundoff, which will provide an accurate algorithm.
The previous results refer to the solution of a SLE using any algorithm (not spec-
ified). However, in the precedent sections we have studied a particular algorithm,
namely Gaussian elimination with partial pivoting and the LU factorization. The
natural question that arises is: which is the relative error in the solution when
using this algorithm? The answer is given in the following result.
∥ r ∥2
≤ ρ · ε,
∥ A∥2 ∥bx ∥2
where ρ is a constant which “rarely” is larger than 10 and ε is the unit roundoff. As a
consequence,
∥x − b x ∥2
≤ κ2 ( A) · ρ · ε. (4.15)
∥b
x ∥2
The inequality (4.15) provides a simple bound of the relative error in the solution
of the SLE (4.1) with the method analyzed in Section 4.2. The bound indicated that
the error is of the order of the unit roundoff, except for a factor equal to ρκ2 ( A).
This factor is, in general, approximately κ2 ( A) since, as indicated in the theorem,
the factor ρ is, in general, moderate (around 10). This means that, in general, the
condition number of the matrix A is the only value that may affect the stability of
the Gaussian elimination algorithm with partial pivoting, which is what happens
in Example 9. In practice, the condition number of A indicates the number of
significant digits that are lost in the computation of the solution of the SLE (4.1).
Finally, we note that, in the relative error (4.15), the denominator is not the norm
of the exact solution, but the norm of the computed one. This is not a problem,
since both quantities are expected to be close to each other and, moreover, ∥b x∥ is
known, whereas ∥x∥ is not.
Remark 8 In practice, since the exact solution x is unknown, the way to estimate the
error is by means of error bounds. The bounds that we have seen in Theorems 4–6 are
just this: bounds, which means that they just give an idea of the maximum error that can
be committed. Nonetheless, these bounds usually provide a very accurate idea of the error
in practice. Anyway, it is important that these bounds are computable, since, otherwise,
58
4 Solution of linear systems of equations
we will not be able to get an estimation of the error. In particular, the bound given in
(4.15) depends on two quantities that are difficult to compute, namely: ρ and κ2 ( A).
The value ρ, though difficult to estimate, is in practice, as we have said, less than 10.
Regarding the condition number, though it is costly to compute numerically, there are
some commands in matlab to compute it. In particular, these commands are cond(A,p),
where p=1,2,infty, and condest(A), which estimates the condition number in the 1-
norm in a faster way than the previous one.
In other words, the bandwidth is the number of nonzero diagonals around the
main diagonal, including the main diagonal (this is the reason for the addend 1 in
r + s + 1 in Definition 9).
An m × n banded matrix is a matrix having a low bandwidth (compared to the
total width of the matrix, namely m + n − 1).
59
4 Solution of linear systems of equations
Notice that a tridiagonal matrix is not necessarily square, as in the previous exam-
ple (the matrix is 6 × 5). However, in this course we will focus on square matrices.
In particular, we are going to see that the Gaussian elimination algorithm with-
out pivoting to solve the SLE, when the coefficient matrix is tridiagonal, becomes
quite simple.
In general, a tridiagonal n × n matrix can be written as:
b1 c1 0 0 ··· 0
a b c2 0 ... 0
1 2
..
0 a2 b3 c3 ... .
Tn = . .
. .. .. .. ..
. . . . . 0
0 . . . 0 a n − 2 bn − 1 c n − 1
0 0 ... 0 a n − 1 bn
⊤
Now, let us write d = d1 · · · dn for the right-hand side and let us apply
the Gaussian elimination method to the augmented matrix of the SLE, namely
Tn x = d:
b1 c1 0 0 ··· 0 d1 b1 c1 0 0 ··· 0 d1
a1 b2 c2 0 ... 0 d2 0 b2′ c2 0 ... 0 d2′
.. ..
0
a2 b3 c3 ... . d3
−→ 0
a2 b3 c3 ... . d3
,
. .. .. .. .. ..
a
R21 − b1
. .. .. .. .. ..
.. ..
. . . . 0 .
1
. . . . 0 .
0 ... 0 a n −2 bn − 1 c n −1 d n −1 0 ... 0 a n −2 bn − 1 c n −1 d n −1
0 0 ... 0 a n −1 bn dn 0 0 ... 0 a n −1 bn dn
a1 a1
where b2′ = b2 − b1 c1 and d2′ = d2 − b1 d1 . Proceeding with the next row:
b1 c1 0 0 ··· 0 d1 b1 c1 0 0 ··· 0 d1
0 b2′ c2 0 ... 0 d2′ 0 b2′ c2 0 ... 0 d2′
.. ..
−→ b3′ d3′
0
a2 b3 c3 ... . d3
0
0 c3 ... .
,
. a
.. .. .. .. .. .. R32 − 2′ .
.. .. .. .. .. ..
. . . . 0 .
b2
. . . . 0 .
0 ... 0 a n −2 bn −1 c n −1 d n −1 0 ... 0 a n −2 bn − 1 c n −1 d n −1
0 0 ... 0 a n −1 bn dn 0 0 ... 0 a n −1 bn dn
where b3′ = b3 − ba2′ c2 and d3′ = d3 − ba2′ d2′ . These two steps are enough to observe
2 2
a template that is repeated in all iterations of Gaussian elimination:
(i) For each pivot only one elementary operation is needed, that is applied on
the next row. More precisely, at each step we only modify the following two
entries: bi′ = bi − b′ai ci and di′ = di − b′ai di′−1 .
i −1 i −1
(ii) The matrix obtained at the end of each step is still tridiagonal.
60
4 Solution of linear systems of equations
Property (ii) above is the one that allows us to confirm that property (i) will still
hold up to the end of the algorithm. This property of preserving the structure is
key in numerical analysis.
As a consequence of property (i) above, the overall computational cost of the
method is 8(n − 1). That is, the cost is O(n), which is quite smaller than O(n3 )
for dense matrices.
The augmented matrix in echelon form that we obtain at the end of the proce-
dure is
b1 c1 0 0 ··· 0 d1
0
b2′ c2 0 ... 0 d2′
′
.. ′
0 0 b3 c3 ... . d3
. .. .
. .. .. .. ..
. . . . . 0 .
0 bn′ −1 cn−1 d′n−1
0 ... 0
0 0 ... 0 0 bn′ d′n
In the program folder ncm the previous algorithm for tridiagonal matrices is
implemented in the code tridisolve.
61
5 Numerical interpolation
Let us assume that a function f : R → R is unknown, but we know (or we are able
to evaluate) the images of f in some values of the domain. Can we extract any
further information on f in other values of the domain? Interpolation provides an
answer to this question. More precisely, let
Anyway, f is known as the interpolating function in the nodes (5.1), and the points
(5.1) are known as interpolation nodes.
Note that, in both cases above, the interpolating functions are polynomial func-
tions. The reason for using such functions is that these are the most manageable
functions.
62
5 Numerical interpolation
P ( x ) = a n −1 x n −1 + · · · + a 1 x + a 0 , (5.2)
P( x1 ) = y1 ⇒ an−1 x1n−1 + · · · + a1 x1 + a0 = y1 ,
.. .. .. (5.3)
. . .
n − 1
P ( x n ) = y n ⇒ a n −1 x n + · · · + a 1 x n + a 0 = y n ,
whose augmented matrix is
n −1
x1n−2 · · · x1 1 y1
x1
.. .. .. . . .
. . . .. .. .. .
xnn−1 xnn−2 · · · xn 1 yn
The coefficient matrix of the system
x1n−1 x1n−2 · · · x1 1
V ( x1 , . . . , xn ) := ... .. .. . .
. .. .. (5.4)
.
xnn−1 xnn−2 · · · xn 1
⊤
is the Vandermonde matrix associated to the vector x = x1 . . . xn . By means
of elementary row operations we can obtain its determinant
det V ( x1 , . . . , xn ) = ∏ ( x i − x j ).
i< j
P ( c ) = a n −1 c n −1 + · · · + a 1 c + a 0 ,
63
5 Numerical interpolation
The Lagrange form is an alternative to the power form that provides an explicit
expression for the interpolating polynomial. To define it we first introduce the
following concept.
∏ ( x − xi )
i ̸=k
ℓk ( x ) = , k = 1, . . . , n. (5.5)
∏ ( x k − xi )
i ̸=k
The Lagrange polynomials (5.5) have the following properties, that can be easily
checked:
Note that the interpolating polynomial in the Lagrange form (5.6) has degree
n − 1 and that P( xi ) = yi , for i = 1, . . . , n.
The Lagrange form is also expensive to obtain, because of the amount of differ-
ences xk − xi that must be computed, as well as their respective products. It is also
expensive to evaluate, since ℓk (c) requires computing lots of products when n is
large.
Anyway, polynomial interpolation is not frequently used in practice, due to the
amount of errors involved, that are usually located close to the endpoints of the
interpolating interval, when the function has large oscillations in this interval (see
Section 5.6).
64
5 Numerical interpolation
1. Obtain the index, k, of the interval such that xk ≤ x < xk+1 , with 1 ≤ k ≤ n.
y k +1 − y k
2. Evaluate P( x ) = yk + x k +1 − x k ( x − xk ) = yk + δk · sk , where
y k +1 − y k
δk := (5.7)
x k +1 − x k
sk := x − xk (5.8)
This function is implemented in the folder ncm. Given two vectors x,y, and an-
other vector u, the function v = piecelin(x,y,u) provides a vector v such that
v(k) is P(u(k)), where P is the piecewise linear interpolating polynomial through
the nodes
(x(1),y(1)),(x(2),y(2)),...
Namely: the function evaluates the interpolating polynomial in a given number
of real values.
It is important to note that, in order for piecelin to provide the output, the
coordinates of the vector x must be ordered in increasing order: x(1)<x(2)<...
65
5 Numerical interpolation
Theorem 8 (Hermite interpolation). Let ( xk , yk ) and ( xk+1 , yk+1 ) be two points such
that xk ̸= xk+1 , and let dk , dk+1 be any two real numbers. Then there exists a unique cubic
polynomial, Pk ( x ), such that
Pk ( xk ) = yk , Pk ( xk+1 ) = yk+1 ,
(5.9)
Pk′ ( xk ) = dk , Pk′ ( xk+1 ) = dk+1 .
Proof: The polynomial (5.10) is clearly of degree at most 3. Moreover, we can see
that it satisfies (5.9). The first line is immediate:
• If x = xk then sk = 0, so Pk ( xk ) = 0 + yk + 0 + 0 = yk .
For the derivatives, note that differentiating with respect to x is the same as
differentiating with respect to sk :
and we obtain:
• Pk′ ( xk ) = 0 + 0 + 0 + dk = dk ,
66
5 Numerical interpolation
□
Piecewise cubic interpolation provides a function consisting of n − 1 cubic
polynomials, P1 ( x ), . . . , Pn−1 ( x ), where the polynomial Pk ( x ) is defined in the in-
terval [ xk , xk+1 ).
Note that piecewise cubic interpolation depends on the values of the derivatives
dk and dk+1 . In this course we are going to see two different ways to determine
these values, that give rise to two different interpolating formulas. One of them is
known as splines interpolation, that is analyzed in Section 5.4, and the other one is
produced by the matlab command pchip, that is explained below. In both cases,
the choice gives rise to differentiable functions, since for the set of nodes (5.1) the
variables that we need to choose are d1 , . . . , dn (namely: the variable dk is the same
for both couples ( xk−1 , yk−1 ), ( xk , yk ) and ( xk , yk ), ( xk+1 , yk+1 )). This is summarized
in the following remark.
This command is implemented in matlab, and it has a simple version in the folder
ncm, called pchiptx.
The command chooses the values dk and dk+1 taking into account the slopes
of the straight lines joining two consecutive couples of points ( xk−1 , yk−1 ), ( xk , yk )
and ( xk , yk ), ( xk+1 , yk+1 ) that is, δk and δk+1 given by (5.7). More precisely:
• If δk−1 δk ≤ 0, then dk = 0.
• If δk−1 δk > 0 then dk is the harmonic mean of the two slopes if xk+1 − xk =
xk − xk−1 , or a weighted mean otherwise.
67
5 Numerical interpolation
Introducing in condition (5.11) the first and last previous identities we get:
h2 2( h1 + h2 ) h1 d1
h3 2( h2 + h3 ) h2 d2
.. .. .. .
..
. . .
h n −1 2 ( h n −2 + h n −1 ) h n −2 dn
3(h2 δ1 + h1 δ2 )
3 (h3 δ2 + h2 δ3 )
= .
..
.
3(hn−1 δn−2 + hn−2 δn−1 )
68
5 Numerical interpolation
a unique solution we need to add another two restrictions. Now, we study the
model obtained from a particular choice of these conditions.
The strategy that we follow to uniquely determine the interpolating function is
the one that is known as “not a knot”, and consists in making the nodes x2 and
xn−1 not being nodes anymore, by forcing the functions in the intervals [ x1 , x3 )
and [ xn−2 , xn ) to be (exactly) cubic polynomials. In other words, we impose
The following result, whose proof is left as an exercise, will allow us to obtain
an alternative condition.
Remark 10, together with (5.11) and Lemma 5 tell us that it is enough to impose
the additional conditions
namely:
h22 d1 + (h2 + h1 )(h2 − h1 )d2 − h21 d3 = 2h22 δ1 − 2h21 δ2 . (5.15)
Equation (5.12) with k = 2 gives
69
5 Numerical interpolation
h2n−1 dn−2 + (hn−1 + hn−2 )(hn−1 − hn−2 )dn−1 − h2n−2 dn = 2h2n−1 δn−2 − 2h2n−2 δn−1 ,
(5.18)
while equation (5.12) with k = n − 1 is equal to
hn−1 dn−2 + 2(hn−2 + hn−1 )(hn−2 − hn−1 )dn−1 + hn−2 dn = 3(hn−1 δn−2 + hn−2 δn−1 ),
(5.19)
so that, subtracting to (5.18) equation (5.19) multiplied by hn−1 , we obtain
Remark 11 The piecewise cubic interpolation polynomial (5.10) can be also expressed in
the local variable sk = x − xk in power form
Pk ( x ) = yk + sdk + s2 ck + s3 bk , (5.22)
where
3δk − 2dk − dk+1 d − 2δk + dk+1
ck = , bk = k .
h h2
70
5 Numerical interpolation
P( x ) = α0 + α1 ( x − x1 ) + α2 ( x − x1 )( x − x2 ) + . . . + αn−1 ( x − x1 )( x − x2 ) · · · ( x − xn ).
(5.23)
The coefficient α0 is easy to obtain: taking into account that P( x1 ) = y1 , it must be
α0 = y1 . Similarly, the coefficient αn−1 is the same as c1 in the power form, since
in both cases is the coefficient of the term of degree n. How can we obtain the
remaining coefficients αi of (5.23)? Let us see two ways to do it.
71
5 Numerical interpolation
P0 ( x ) = α0
is the interpolating polynomial through the node ( x1 , y1 ).
P1 ( x ) = α0 + α1 ( x − x1 )
is the interpolating polynomial through the nodes ( x1 , y1 ), ( x2 , y2 ).
P2 ( x ) = α0 + α1 ( x − x1 ) + α2 ( x − x1 )( x − x2 )
is the interpolating polynomial through the nodes ( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ).
..
.
P( x ) = Pn−1 ( x ) = α0 + α1 ( x − x1 ) + α2 ( x − x1 )( x − x2 ) + · · · + αn−1 ( x − x1 ) · · · ( x − xn )
is the interpolating polynomial through the nodes (5.1).
Definition 12 (Divided differences). Given the nodes (5.1), the kth divided differ-
ence in the subset ( xi1 , yi1 ), . . . , ( xik , yik ) is the coefficient of largest degree of the interpo-
lating polynomial through the nodes ( xi1 , yi1 ), . . . , ( xik , yik ). We denote it by F [ xi1 . . . xik ].
As we have seen above, the interpolating polynomial in the Newton form can
be expressed as
P( x ) = F [ x1 ] + F [ x1 x2 ]( x − x1 ) + F [ x1 x2 x3 ]( x − x1 )( x − x2 )
+ · · · + F [ x1 x2 . . . xn ]( x − x1 )( x − x2 ) · · · ( x − xn−1 ).
72
5 Numerical interpolation
F [ x 2 . . . x k ] − F [ x 1 . . . x k −1 ]
F [ x1 x2 . . . x k ] = , (5.25)
x k − x1
for k = 1, . . . , n.
Therefore:
x − x1
p( x ) = q( x ) + · (r ( x ) − q( x )).
x k − x1
From this identity we deduce that the coefficient of degree k − 1 of p( x ) coin-
cides with the difference between the coefficients of degree k − 1 of r ( x ) and q( x ),
divided by xk − x1 . In other words, (5.25) holds. □
Theorem 9 allows us to obtain the kth divided differences from the (k − 1)st
ones, following the scheme in Figure 5.1. In this figure, the entries of each column
are obtained from the two entries indicated in the previous column using the
formula (5.25), which also involves the division between the values corresponding
to the variable x.
73
5 Numerical interpolation
y1 = F [ x1 ]
y2 = F [ x2 ] F [ x1 x2 ]
y3 = F [ x3 ] F [ x2 x3 ] F [ x1 x2 x3 ]
y4 = F [ x4 ] F [ x3 x4 ] F [ x2 x3 x4 ] ···
.. ..
. .
y n = F [ x n −1 ] F [ x n −2 n n −1 ]
yn = F [ xn ] F [ x n −1 x n ] F [ x n −2 x n −1 x n ] ···
Figure 5.1: Scheme of the divided differences to obtain the coefficients of the Newton interpolat-
ing polynomial.
The interpolating polynomial is the same in all three cases because, according
to Theorem 7, this polynomial is unique. Nonetheless, these expressions for the
interpolating polynomial correspond to three different bases of the vector space of
polynomials with degree at most n − 1. More precisely:
• The power form corresponds to the monomial basis centered at the origin:
{1, x, x2 , . . . , x n−1 }.
• The Lagrange formula corresponds to the Lagrange basis: {ℓ1 ( x ), . . . , ℓn ( x )},
where ℓ1 ( x ), . . . , ℓn ( x ) are given in (5.5).
74
5 Numerical interpolation
Example 10 Assume we want to approximate the following two functions in the interval
[0, 2):
(a) f 1 ( x ) = x6 − 2x5 + x + 1.
2x5 + 2
(b) f 2 ( x ) = .
3x3 − x2 + 2
When taking equispaced values in the interval [0, 2) (namely, x1 = 0, x2 = 1, x3 = 2),
we get the nodes
(0, 1), (1, 1), (2, 3).
It can be easily checked that the graph of both functions contains these nodes. We can obtain
the (quadratic) interpolating polynomial through these nodes, P( x ) = a2 x2 + a1 x + a0 , as
in the proof of Theorem 7, namely solving the linear system whose augmented matrix is
0 0 1 1
1 1 1 1 .
4 2 1 3
75
5 Numerical interpolation
One could expect that the larger is the number of nodes, the smaller the error
is, namely: the graph of the interpolating polynomial would better fit to the one
of the function. In particular, one would expect that
lim Pn ( x ) = f ( x ), x ∈ [ x1 , x n ),
n→∞
76
5 Numerical interpolation
Remark 12 It can be proved (see, for instance, [BF11, Th. 3.6]) that
f (n) ( η )
= F [ x1 x2 . . . x n ],
n!
with η as in the statement of Theorem 10.
Theorem 10 does not assume that the nodes are ordered such that x1 < x2 <
· · · < xn . In this situation, if x ∈ [ x1 , xn ), this theorem implies
f (n) ( ξ )
| f ( x ) − P( x )| ≤ max · |( x − x1 ) · · · ( x − xn )|. (5.26)
ξ ∈[ x1 ,xn ] n!
The inequality (5.26) provides a bound for the maximum error of the interpolating
polynomial through the nodes (5.1) in the interval [ x1 , xn ). The first factor of the
bound depends on the function, f , that we want to approximate. More precisely,
it depends on how big the nth derivative of f is within the interpolation interval.
Therefore, if f is a function having a small derivative, this term will be small.
However, for functions with large oscillations it can be large (though it is not
necessarily so, since (5.26) is just a bound).
Despite the denominator n!, which increases very fast as n increases, the
bound in (5.26) can be quite large, even for n going to infinity.
On the other hand, there is a second factor, |( x − x1 ) · · · ( x − xn )|, which is
independent of the function, and depends only on the values of the abscissa of
the nodes. To bound this term is not easy and depends, among other things, on
the choice of the nodes. Nevertheless, if the nodes are equispaced (namely, the
distance between the abscissa of two consecutive nodes is constant), we can get a
bound for this error.
77
5 Numerical interpolation
If, again, we consider equispaced nodes, with h = xn+1 − xn , then the bound in
(5.28) reads
5
max | f ( x ) − S( x )| ≤ Mh4 ,
x1 ≤ x ≤ x n 384
where maxx1 ≤ x≤ xn | f (4) ( x )| ≤ M. This is a bound of the same order of error than
(5.27) (namely O(h4 )) and it is even worse if we compare the coefficients that are
multiplied by the term h4 . Nonetheless, the splines enjoy the interesting property
described below.
A differentiable function f has large oscillations when its first derivative has large
variations (in absolute value). In other words, f ′′ has large variations (in absolute
78
5 Numerical interpolation
Theorem 13 Let (5.1) be the interpolation nodes, with x1 < · · · < xn . Let f be a
twice differentiable function with continuous derivatives in [ x1 , xn ] such that f ( xi ) = yi ,
for i = 1, . . . , n. Let S( x ) be the cubic spline in the nodes (5.1) with S( xi ) = yi and
S′ ( xi ) = f ′ ( xi ), for i = 1, . . . , n. Then
Z b Z b
(S′′ ( x ))2 dx ≤ ( f ′′ ( x ))2 dx.
a a
79
6 Roots of polynomials and functions
The basic idea behind the bisection method is dividing a given interval [ a, b] in
which f ( a) f (b) < 0 in two halves, and to choose at each step the half subinterval
that fulfills this condition, namely:
• Choose [ a, a+2 b ] if f ( a) f ( a+2 b ) < 0.
Example
√ 11 The following matlab algorithm is the basic bisection algorithm
√ to compute
2. Let us consider [1, 2] as the starting interval, where the value 2 is located. The
algorithm is applied to the function is f ( x ) = x2 − 2, which satisfies f (1) f (2) < 0.
80
6 Roots of polynomials and functions
function[x,k] = bisectsqrt2
r=2;
a=1;
b=2;
k=0; % iteration counter
while (b-a)>eps*b
x=(a+b)/2;
if x^2-r>0
b=x;
else
a=x
end
k=k+1;
end
The previous algorithm does not have any input variable, since we have
√ chosen the starting
interval beforehand. The output variables are x (approximation to 2) and k (number of
iterations).
The code in the previous example can be extended without difficulty to more
general functions. The main difference is the criterion to choose the appropriate
half subinterval at each step. For this, we need to evaluate the function in the
endpoints of the interval.
The stopping criterion by default is
| a − b|
< ε, (6.1)
|b|
where ε is the machine epsilon (b can be replaced by a in the denominator). The
criterion (6.1) establishes that the relative error in the approximation is less than
the machine epsilon, which is, as we know, twice the roundoff error. Nonetheless,
the bisection method is slow, and to get such a small threshold can lead to a lot of
iterations. In this case, as an alternative stopping criterion we can introduce some
factor in the right-hand side of (6.1) (namely, cε, with c > 1) or to fix in advance
the number of iterations.
Cons:
81
6 Roots of polynomials and functions
• It is very slow.
Pros:
• Always converges.
It may happen that the initial interval contains more than one root. In this case,
the bisection method will approximate one of them, but the method does not
allow to know the number of roots.
f ( xn ) f ( xn )
f ′ ( xn ) = ⇒ x n +1 = x n − ′ (6.2)
x n − x n +1 f ( xn )
82
6 Roots of polynomials and functions
f ( x ) = f ( x n ) + ( x − x n ) f ′ ( x n ).
f ( xn )
0 = f ( xn ) + ( x − xn ) f ′ ( xn ) ⇔ x = xn − ,
f ′ ( xn )
which is the value of xn+1 in Newton’s method. This is the idea that will be used
in Section 8.2.1 to extend Newton’s method to multivariable functions.
83
6 Roots of polynomials and functions
Cons:
• It does not always converge. In fact, the function f must satisfy certain
“regularity” properties (namely, the derivative should not present large vari-
ations) and the initial value, x0 , must be close enough to the root for conver-
gence.
• We need to be able to evaluate not just f , but also f ′ , in values that are not
located in any particular interval.
Pros:
f ( x n −1 ) − f ( x n )
y= ( x − x n ) + f ( x n ),
x n −1 − x n
so the intersection with the horizontal axis y = 0 is given by
f ( xn ) f ( x n −1 ) − f ( x n )
= .
x − xn x n −1 − x n
Solving for x we obtain the next value of the iteration:
x n − x n −1
x n +1 = x n − f ( x n ) · (6.3)
f ( x n ) − f ( x n −1 )
84
6 Roots of polynomials and functions
It can be checked that this sequence consists of alternating positive and negative numbers,
which are larger at each iteration (in absolute value). The explanation of this can be seen
in Figure 6.4.
In Example 13, the value x2 is outside the interval [ x0 , x1 ] and this is the reason
for the divergence. This, however, does not happen if the evaluation of the function
in the endpoints produces two numbers with different signs. More precisely:
If f ( xn−1 ) f ( xn ) < 0, then xn+1 is between xn−1 and xn .
The algorithm fzero, that we will analyze in Section 6.7, will make use of this
fact.
85
6 Roots of polynomials and functions
They are the same as the ones for Newton’s method, except for the following:
lim xn = r ?
n→∞
86
6 Roots of polynomials and functions
Example 14 Let g( x ) = − x3 + 3x2 − 1. This function has a fixed point in [2, 3], since
the conditions in the statement of Theorem 15 are satisfied. However, this fixed point
cannot be approximated with the fixed point iteration, even for nearby initial values. Table
6.1 provides the first values of the fixed point iteration for some initial values in [2, 3]. As
you may guess, all sequences diverge.
In Figure 6.5 you can see that the fixed point is between 2.4 and 2.5. Nevertheless, none
of these initial values produce a convergent fixed point sequence.
87
6 Roots of polynomials and functions
In Example 14 we have seen that, though there is a fixed point in [2, 3], the fixed
point iterations, even for initial values that are close to the fixed point, produce
values that lie outside the interval. It is natural to expect that, in order for the
sequence to be convergent, the values of the sequence must lie within the initial
interval where the root is located. The following result provides conditions for
this to happen.
88
6 Roots of polynomials and functions
where ξ n is some value between xn+1 and xn (in particular, it is in [ a, b]). Iterating
this inequality we arrive at
| x n +1 − x n | ≤ M n · | x 1 − x 0 | .
The sum ∑∞ k
k=0 M is a geometric series with with ratio M < 1, so its value is
1
1− M .
Replacing this value in the previous bound we get
Mn
lim | xm − xn | ≤ · | x1 − x0 |. (6.6)
m→∞ 1−M
Since M < 1, we conclude that limm→∞ | xm − xn | tends to 0 when n → ∞. In
other words, the difference between any two terms, xm and xn , of the fixed point
iteration, tends to 0. Namely, the fixed point iteration is a Cauchy sequence. We
know by elementary Calculus that every Cauchy sequence over R is convergent.
Therefore, the fixed point iteration is convergent.
It remains to prove that the fixed point iteration converges to r, and that equa-
tion (6.5) is true. For the first claim, let ℓ Be the limit of the fixed point sequence.
Since g is continuous, we get
89
6 Roots of polynomials and functions
(c) −1 < g′ < 0: Oscillatory convergence. (d) g′ < −1: Oscillatory divergence.
approximation. This bound, (6.5), depends on the value of M that bounds the
absolute value of the derivative of g. The bound (6.5) suggests that the smaller
the value of the derivative is, the faster is the convergence. In Section 6.8 we will
come back to this issue on the speed of convergence.
Figure 6.6 illustrates the condition on the derivative of g in Theorem 16. In this
figure we show all four possible cases when | g′ ( x )| > 1 (divergence) and when
| g′ ( x )| < 1 (convergence).
Condition g([ a, b]) ⊆ [ a, b] in the statement of Theorem 15 means that g( x ) ∈
[ a, b], for all x ∈ [ a, b]. Though it is very easy to state and interpret, this condition
is not, in general, easy to check. The following result avoids to impose this restric-
tion, though it assumes that we know beforehand that there is a fixed point in the
starting interval.
Theorem 17 If g( x ) is differentiable in [ a, b] and satisfies
(i) it has a fixed point in [ a, b],
90
6 Roots of polynomials and functions
then g has a unique fixed point, r, in [ a, b] and the fixed point iteration converges to r for
every initial value x0 ∈ [ a, b]. Moreover, the fixed point sequence is monotone, namely:
Proof: For the first claim in the statement, it suffices to prove that condition (i)
implies g([ a, b]) ⊆ [ a, b] and apply Theorem 16. Indeed, x ∈ [ a, b] and let us
denote by r the fixed point guaranteed by (i). Let us distinguish two cases:
so g( x ) ∈ [ a, b].
so g( x ) ∈ [ a, b].
To prove the second claim (the fixed point sequence is monotone) it suffices to
look at the inequalities obtained in the two previous items, since
• r ≤ x n ⇒ x n +1 = g ( x n ) < x n .
• x n ≤ r ⇒ x n +1 = g ( x n ) > x n .
91
6 Roots of polynomials and functions
x = b2 y2 + b1 y + b0 . (6.7)
This parabola always intersects the horizontal axis (at (b0 , 0)).
Therefore, the method is as follows:
• xn+1 = b0 .
xcx = inline(’x-cos(x)’);
z1=fzerotx(xcx,[0,1])
(b) Use a function from some file.m together with the command @. This last
command calls a function fun.m:
92
6 Roots of polynomials and functions
z2=fzerotx(@name,[0,1])
z3=fzerotx(@cos,[0,1])
z4=fzerotx(@(x) x-cos(x),[0,1])
The command feval is quite useful to evaluate functions in matlab. The syntax
for using this command is the following
feval(f,values)
where f is the function that we want to evaluate (introduced in any of the ways
that we have seen above) and values are the values where we want to evaluate it,
that can be just one value or a vector. For instance:
xcx=inline(’x-cos(x)’);
feval(xcx,[0,1,6,9])
93
6 Roots of polynomials and functions
* f ( a) f (b) < 0,
* | f (b)| ≤ | f ( a)|,
* c is the precedent value to b.
– If c ̸= a, consider one step of IQI.
– If c = a, consider one step of the secant method.
– If either the IQI or the secant method step is in the interval, [ a, b], take
it.
– If the step is not in the interval [ a, b], use bisection method.
This algorithm is infallible and always contains a root in the considered inter-
val. It uses methods with fast convergence (IQI) at the moment where they are
reliable, together with some methods which are slower but reliable (bisection, se-
cant), when necessary.”
The version that we will use in this course is the code fzerotx, which is im-
plemented in the folder ncm (see [M04, §4.7]). In this folder you can also find
the code fzerogui, which represents graphically the steps of fzerotx and may
provide a list of them. Moreover, it allows us to choose, at each step, the next
iteration.
94
6 Roots of polynomials and functions
The expression (6.8) indicates that, from some n0 , the absolute error in the ap-
proximation { xn } is “raised to α” at each step. This roughly means that the num-
ber of significant digits of the approximation at each iteration is multiplied by
α. Therefore, the larger the convergence order is, the fastest will the sequence
converge to its limit r.
In this section we are going to present the convergence order of the iterative
methods analyzed in this chapter. In all cases we assume that the method is
convergent (though, as we know, not all of them are), and we denote by r the root
of f which is the limit of the sequence.
f (r + ϵn )(ϵn − ϵn−1 )
ϵ n +1 = ϵ n − . (6.9)
f ( r + ϵ n ) − f ( r + ϵ n −1 )
f ′′ (r ) 2
f (r + ϵ ) ≈ f (r ) + f ′ (r ) ϵ + ϵ .
2
f ′′ (r )
Let us denote, for simplicity, M := 2 f ′ (r )
. Taking into account that f (r ) = 0 (since
r is a root of f ), we can write:
95
6 Roots of polynomials and functions
Summarizing:
ϵn+1 ≈ ϵn−1 ϵn M. (6.10)
But the method has convergence order α if and only if |ϵn+1 | ≈ C |ϵn |α , for some
constant C > 0. If this holds, replacing in (6.10) we get
α−1 1
| M| | M|
1
α −1
C | ϵ n | ≈ | M | · | ϵ n −1 | · | ϵ n | ⇔ | ϵ n |
α
≈ · | ϵ n −1 | ⇔ | ϵ n | ≈ | ϵ n − 1 | α −1 .
C C
namely, if g′ (r ) ̸= 0 then the convergence order of the fixed point method is equal to 1.
96
6 Roots of polynomials and functions
The second result says that, if the root is multiple, namely g(r ) = 0 = g′ (r ), the
converge, instead of being linear, is, at least, quadratic:
(i) g′ (r ) = 0,
then there is some δ > 0 such that, for every initial value x0 ∈ [ p − δ, p + δ], the fixed
point sequence { xn+1 = g( xn )} converges at least quadratically to r. If, moreover
(ii) g′′ is continuous and | g′′ ( x )| < M for all x in an open interval containing r,
M
| x n +1 − r | < · | x n − r |2 . (6.13)
2
Proof: By hypothesis (i) and, since g′ is continuous (because it is twice differen-
tiable), there are δ > 0 and k ∈ (0, 1) such that [r − δ, r + δ] is contained in the
interval mentioned in part (ii) and | g′ ( x )| ≤ k for all x ∈ [r − δ, r + δ]. As we
have seen in the proof of Theorem 16, this implies that the terms of the fixed point
iteration { xn+1 = g( xn )} belong to the interval [r − δ, r + δ], for any initial value
x0 . Using the hypotheses g(r ) = r and g′ (r ) = 0, the Taylor expansion of g around
r in [r − δ, r + δ] gives
g′′ (ξ ) g′′ (ξ )
g( x ) = g(r ) + g′ (r )( x − r ) + ( x − r )2 = r + ( x − r )2 ,
2 2
with ξ being. a value between r and x. In particular, if x = xn ,
g′′ (ξ n ) g′′ (ξ n )
x n +1 = g ( x n ) = r + ( x n − r )2 ⇒ x n +1 − r = ( x n − r )2 , (6.14)
2 2
for some ξ n between r and xn .
As | g′ ( x )| ≤ k < 1 in [r − δ, r + δ] and g([r − δ, r + δ]) ⊆ [r − δ, r + δ], Theorem
16 guarantees that { xn } converges to r. Since ξ n is between xn and r, {ξ n } also
converges to r, so (6.14) implies that
| x n +1 − r | | g′′ (r )|
lim = .
n → ∞ | x n − r |2 2
97
6 Roots of polynomials and functions
f (x)
g( x ) = x − .
f ′ (x)
f ′ ( x )2 − f ′′ ( x ) f ( x ) f ′ (r )2 − f ′′ (r ) f (r )
g′ ( x ) = 1 − ′
⇒ g ′ (r ) = 1 − = 1 − 1 = 0.
f (x) 2 f ′ (r )2
98
7 Least squares problems
Remark 14 Recall that the 2-norm of a vector is the square root of the sum of the squares
of the modulus of the coordinates of the vector. The square root of a sum of squares
is minimized when the sum of squares itself is minimized (since the square root is an
increasing function). Therefore, the least squares solution is the one that minimizes the
sum of squares of the vector Ax − b. This is the reason for the name “least squares”.
Least squares problems arise in many contexts (in engineering, economy, biol-
ogy, physics, etc.). In particular, they arise in the “curve fitting” problem. This
problem has a similar motivation to the interpolation problem (though, as we will
see, it is solved in a different way), and it is explained below.
Curve fitting
99
7 Least squares problems
Let us also assume that y is a function of x that can be written in the form
y = β 1 ψ1 ( x ) + β 2 ψ2 ( x ) + · · · + β n ψn ( x ), (7.2)
ψ1 ( x1 ) ψ2 ( x1 ) · · · ψn ( x1 ) y1
β1
ψ1 ( x2 ) ψ2 ( x2 ) · · · ψn ( x2 ) β 2 y2
. = . . (7.3)
.
.. .. ..
ψ1 ( xm ) ψ2 ( xm ) · · · ψn ( xm ) βn ym
The coefficient matrix of the system (7.3) is m × n. If m > n, the system (7.3) is
overdetermined and, in general does not have a solution. In this case, we are in
the situation described at the beginning of the section.
The most elementary examples of curve fitting correspond to polynomial regres-
sion, which includes, as a particular case linear regression. We analyze this last one
independently because of its simplicity and relevance.
100
7 Least squares problems
The previous system is not consistent, as long as at least three points in the
set are not aligned.
1 x1 . . . x1n y1
β1
1 x2 . . . x n β 2 y2
2
= . (7.4)
. .
.. .. . .
. . .. ... ...
1 xm . . . xmn βn ym
The coefficient matrix of the system is a Vandermonde matrix (5.4) but, un-
like in the case considered in polynomial interpolation, when m > n the
system does not have a solution (except when the nodes are placed in the
graph of a polynomial function with degree at most m).
The command “\” which solves the system (4.1) when it has a unique solution,
solves the least squares problem associated to the SLE (4.1) when the system is not
consistent. Then, the command A\b solves the SLE (4.1) by least squares.
For polynomial fitting we can use the command polyfit. More precisely,
polyfit(x,y,m) solves the least squares problem associated to the system (7.4),
where
x = [ x1 x2 . . . xm ]⊤ and y = [y1 y2 . . . ym ]⊤ .
∥ Qx∥2 = ∥x∥2 .
101
7 Least squares problems
∥ Qx∥2 = x⊤ Q⊤ Qx = x⊤ x = ∥x∥2 .
□
Lemma 7 tells us that multiplying a vector by an orthogonal matrix provides a
vector with the same 2-norm.
We will also use the following property.
Lemma 8 The product of two orthogonal matrices of the same dimension is again an
orthogonal matrix.
so Q1 Q2 is orthogonal as well. □
and also as
A=Q
e R,
e (Reduced QR factorization) (7.6)
where Q, R, Q,
e Re satisfy:
(a) Q is orthogonal m × m.
(c) Q e⊤ Q
e is m × n with orthonormal columns, namely: Q e = In .
102
7 Least squares problems
where ai is the ith column of A. Note that, since the rank of A is n, the dimension
of the column space of A is n. If in (7.7) we solve for the vectors ai in terms of the
vectors q j and we write down the expressions that we obtain in matrix form, we
arrive at
∥ v1 ∥2 ∗ ... ∗
0
∥ v2 ∥2 . . . ∗
a1 a2 . . . a n = q1 q2 . . . q n . .. =: Q e R.
. . .
e
. . .
0 ... 0 ∥ v n ∥2
The entries marked with ∗ are not relevant, but can be obtained from (7.7) as indi-
cated in the previous paragraph. Note that the diagonal entries of R e are nonzero
numbers, since they are the norm of nonzero vectors (the fact that the vectors vi
are nonzero is a consequence of the fact that the rank of A is n).
To obtain the full QR factorization (7.5) it is enough to complete the basis
{q1 , . . . , qn } of the column space of A to an orthonormal basis of Rm , {q1 , . . . , qn , qn+1 , . . . , qm },
and to add a zero block to the matrix R, e in the form:
∥ v1 ∥2 ∗ ... ∗
0
∥ v2 ∥2 . . . ∗
. .. ..
.. . .
a 1 a 2 . . . a n = q 1 q 2 . . . q n q n +1 . . . q m 0 ... 0 ∥vn ∥2 =: QR.
0 ... 0 0
.. ..
.. ..
. . . .
0 ... 0 0
□
Figure 7.1 shows the shape and the size of the full and reduced QR factorization.
Remark 15 Some remarks about the QR factorization, in the conditions of Theorem 20,
are in order:
103
7 Least squares problems
(b) The reduced QR factorization is unique up to the sign of the columns of Qe and the
rows of R.
e Namely: if Q
eRe and Q e′ R
e ′ are two reduced QR factorizations of a matrix
A, then
e=Q
Q e ′ S, e = SR
R e′ ,
where
±1
±1
S=
..
.
±1
is a diagonal matrix whose diagonal entries are 1 or −1.
(c) In the proof of Theorem 20 the diagonal entries of Re are positive, since they are the
norm of nonzero vectors. The diagonal entries of the matrix R e in the QR factoriza-
tion can also be negative but, by the previous remark, if we impose the condition that
they are all positive, then the reduced QR factorization of a matrix is unique.
(d) In the full QR factorization the last m − n columns Q are not unique.
where ( Q⊤ b)1 are the first n rows of Q⊤ b, and ( Q⊤ b)2 are the last m − n rows. To
obtain the second identity we have used Lemma 7.
104
7 Least squares problems
To minimize (7.8) we can only work on the upper part of the vector obtained at
e − ( Q⊤ b)1 , since the other term, −( Q⊤ b)2 , does not depend
the end, namely Rx
on x. Moreover, the minimum is achieved when the first term is zero, namely:
e = ( Q ⊤ b )1 .
Rx
Since R
e is an upper triangular matrix, the previous system can be solved by back-
ward substitution. This provides the following procedure to solve the least squares
problem (LSP):
2. Multiply Q⊤ b.
x = ( Q⊤ b)(1 :
3. Solve, by backward substitution, the SLE R(1 : n, 1 : n)b
n ).
Remark 16 The previous algorithm is, essentially, the one that matlab follows with the
command A\b.
105
7 Least squares problems
(i) Hu⊤ = Hu .
scalar (namely, a real number). As vectors x and ce1 must have the same norm, it
must be c = ±∥x∥2 . There are two possible choices of the vector u such that Hu
takes x to ±∥x∥2 e1 , which are u = ±∥x∥2 e1 − x. Figure 7.2 illustrates the effect, on
the plane, of the Householder reflector that is obtained taking these two vectors.
106
7 Least squares problems
In practice, and for stability reasons, the standard choice is the one such that
c = sign( x1 )∥x∥2 , where x1 is the first coordinate of the vector x.
Householder reflectors are used to obtain the QR factorization in the following
way. In the first place, we choose the Householder that takes the first column of
A, denoted by a1 , to the vector ±∥a1 ∥2 e1 following the previous criterion, namely
u = sign((a1 )1 )∥a1 ∥2 . This effect is produced by multiplying, on the left, the start-
ing matrix, A, by Hu . Then, we proceed in the same way with the Householder
reflector that takes the first column of the matrix that results from this product,
and then removing the first row and first column, to the corresponding multiple
of the vectors e1 , and so on. The following example illustrates this procedure.
107
7 Least squares problems
Now:
H3 H2 H1 A = R, upper triangular matrix,
H3 H2 H1 = Q⊤ , orthogonal matrix (since it is a product of orthogonal matrices, see Lemma 8),
so A = QR is a full QR factorization of A.
( Hn Hn−1 · · · H2 H1 ) A = R,
Q⊤ = Hn Hn−1 · · · H2 H1 ,
Q = H1 H2 · · · Hn−1 Hn ,
(H1 , . . . , Hn are Householder reflectors)
• The 2-norm of A.
• The distance of A to the set of matrices with smaller rank. That is, the 2-norm
of the smallest perturbation, ∆A, that makes rank ( A + ∆A) < rank A. In
particular, if A is invertible, it will allow us to know how far is A from being
non-invertible (singular).
In this course, however, we will use the SVD as a tool to solve LSPs. In the
following result we introduce the SVD. Though it is valid for matrices of any size,
we keep imposing the same restriction as in previous sections, namely, when A is
m × n with m ≥ n.
108
7 Least squares problems
109
7 Least squares problems
The solution to this problem goes though the SVD of A. To state it, we need
to recall that any matrix M ∈ Cm×n with rank M = ρ can be written as a sum of
rank-1 matrices
M = x1 y1∗ + · · · + xρ y∗ρ ,
110
7 Least squares problems
Then
∥ A − Aρ ∥2 = min{∥ A − B∥2 : rank B ≤ ρ} = σρ+1 ,
where, if ρ = r, then σρ+1 = 0.
Proof: First note that, if A and Aρ are as in (7.12) and (7.13), respectively, then
∥ A − Aρ ∥2 = ∥ ∑ri=ρ+1 σi ui vi∗ ∥2 = σρ+1 , by (7.11). It remains to prove that, for any
B with rank B ≤ ρ, it must be ∥ A − B∥2 ≥ σρ+1 . We omit this part and refer to
[I09, Fact 4.13] for a proof. □
As a consequence of Theorem 22, the best approximation to A with rank ρ is
obtained from the SVD of A by truncating the sum in (7.12) from the ρth singular
value on.
Image compression
We are going to see an interesting application of the SVD in the process of com-
pression and restoration of images.
For the sake of simplicity, let us focus on the simplest case of a grey scale. The
information of this image is stored in a matrix, whose dimensions correspond to
the resolution of the image. For instance, an image with resolution 1368 × 1712
means that the image is decomposed into a grid of 1368 × 1712 small squares
(pixels), each one having an independent intensity of grey color. This corresponds
to a matrix of size 1368 × 1712, where each entry encodes the intensity of grey
color in the corresponding pixel. In some cases, it may be preferable to reduce
the amount of stored information, at the cost of loosing some definition in the
image. This leads to the question on how to get the best compromise between the
111
7 Least squares problems
amount of stored information and the quality of the image. The SVD is helpful
in this question, based on Equation (7.12) and Theorem 22. More precisely, the
decomposition in (7.12) allows us to store the information of the matrix A ∈ Cm×n
through the sum of r rank-1 matrices, together with r real values (the singular
values σ1 ≥ · · · ≥ σr > 0). This is a total amount of r (m + n) + r numbers, since
each rank-1 matrix of size m × n is stored in 2 vectors, one of size m × 1, and the
other one of size 1 × n. Of course, if r = m = n (so the matrix is invertible), then
this amounts to 2n2 + n, which is more information than storing directly the n × n
whole matrix. However, when the matrix is rectangular, and m << n or n << m,
or even when it is square n × n but r << n, this quantity can be much smaller than
mn, which is the storage requirement of the whole matrix. Then, what we can do
is to replace the whole matrix A by a good low-rank approximation. The notion
of “good” here depends on the particular case, but the idea is to find a low rank
approximation that allows to decrease the storage cost without loosing too much
definition in the image. Theorem 22 tells us that the best rank-ρ approximation
to a given matrix (in terms of the distance induced by the 2-norm) is obtained by
truncating the sum (7.12), namely removing the last addends, and keeping the first
ρ addends, which correspond to the largest singular values. The compression ratio
(that is usually presented as a percentage) is the ratio between the information that
is needed to store the low-rank approximation and the one for the full matrix, and
is given by
ρ ( m + n + 1)
cρ = .
mn
Note that if ρ = min{m, n} then cρ > 1, so this approach is not useful for low-rank
approximations with rank close to the full rank of the original matrix. However,
the good news is that low-rank approximations with low-to-moderate rank can
provide very good results.
In general, a rank-100 approximation (that is, using only the largest 100 singular
values) is enough to get a good approximation to the original image. However,
the amount of information stored in this approximation is much less than the one
stored in the original matrix. In particular, for an image with 1368 × 1712 reso-
lution, the information required to store the rank-100 approximation as a matrix
(7.12) with ρ = 100 is equal to: 100(1368 + 1712 + 1) = 308100 bytes. Compared
to the size of the whole matrix this gives a compression rate of
308100
cρ = = 0.1315,
1368 · 1712
which means that the information stored in the approximation is only 13.15% of
the whole information, which is an important saving.
112
7 Least squares problems
Figure 7.3 shows three compressions of the same 720 × 1080 image, which has
full row rank (namely, 720). These images correspond to three low-rank approxi-
mations, whose rank and compression ratio are shown in Table 7.1
Table 7.1: Rank and compression ratio of the images in Figure 7.3.
113
7 Least squares problems
Note that, since σ1 , . . . , σr are nonzero, the matrix Σ+ is invertible. This leads to
the notion of pseudoinverse.
is the pseudoinverse of A.
The basic properties of the pseudoinverse that we are going to use in this course
are summarized in the following result.
(i) A† ∈ Rn×m .
Ir 0r×(n−r)
†
(ii) A A = V V⊤.
0(n−r)×r 0n −r
(iv) If r = n then A† A = In .
114
7 Least squares problems
Σ+
h i
A⊤ A = V Σ+ 0r×(m−r) U⊤U V⊤
0(m−r)×r
i Σ+
h
= V Σ+ 0r×(m−r) V⊤
0(m−r)×r
= VΣ2+ V ⊤ .
Since V and Σ+ are both invertible, the product VΣ2+ V ⊤ is also invertible. More-
over, ( A⊤ A)−1 = V (Σ2+ )−1 V ⊤ , so
h i h i
( A⊤ A)−1 A⊤ = V (Σ2+ )−1 V ⊤ V Σ+ 0r×(m−r) U ⊤ = V (Σ2+ )−1 Σ+ 0r×(m−r) U ⊤
h i
= V Σ− +
1
0 ⊤ †
r ×(m−r ) U = A .
Theorem 23 Let A ∈ Rm×n , with m ≥ n and rank( A) = r, and let A† be the pseudoin-
verse of A. Then:
(i) If r = n, the LSP associated to the SLE (4.1) has a unique solution and it is given
by bx = A† b.
115
7 Least squares problems
(ii) If r < n then the LSP has infinitely many solutions that can be written as:
x = A † b + α r +1 V ( : , r + 1 ) + · · · + α n V ( : , n ),
b αr+1 , . . . , αn ∈ R. (7.15)
Proof: We are going to prove only claim (i), using the SVD of A (7.10):
y = Σ− 1 ⊤
+ (U b )(1 : n ),
namely h i
x = VΣ−
b +
1
( U ⊤
b )( 1 : n ) = V Σ −1
+ 0 ⊤ †
n×(m−n) U b = A b.
As for claim (ii), we are just going to give an idea on why it is true. When r < n,
it can be proved that b x = A† b is a particular solution of the LSP. Now, for any
vector, z, in the null space of A, we have
∥ A(b
x + z) − b∥2 = ∥ Ab
x + Az − b∥2 = ∥ Ab
x − b ∥2 ,
116
8 Numerical Optimization
In this chapter we look for (local) maxima and minima of vector functions f :
Rn → R. We start, in Section 8.1, by introducing the basic notions and results, and
then, in Section 8.2, we introduce the numerical methods that are studied in this
course.
B(x0 , r ) := {x ∈ Rn : ∥x − x0 ∥2 < r }.
117
8 Numerical Optimization
∂2 f
When i = j in the previous definition, we write ∂2 x (x0 ). From elementary Calculus
i
we know that
∂2 f ∂2 f
( x0 ) = ( x0 ). (8.2)
∂xi ∂x j ∂x j ∂xi
The Hessian of the function f at x0 is the n × n matrix
∂2 f ∂2 f ∂2 f
∂2 x1
( x0 ) ∂x1 ∂x2 ( x0 ) ... ∂x1 ∂xn ( x0 )
∂2 f ∂2 f ∂2 f
∂x2 ∂x1 (x0 ) ( x0 ) ... ∂x2 ∂xn ( x0 )
∂2 x2
∇2 f ( x0 ) : = .. .. .. .. .
. . . .
∂2 f ∂2 f ∂2 f
∂xn ∂x1 ( x0 ) ∂xn ∂x2 ( x0 ) ... ∂2 x n
( x0 )
By (8.2), the Hessian matrix is symmetric. We recall the following notions for
symmetric matrices.
(v) indefinite if there are two vectors x, y such that x⊤ Sx > 0 and y⊤ Sy < 0.
We know from basic Linear Algebra that every real symmetric matrix has real
eigenvalues. The following result characterizes the definiteness of a symmetric
matrix in terms of the sign of its eigenvalues.
(v) indefinite if and only if S has at least one positive and one negative eigenvalue.
118
8 Numerical Optimization
for y = Qx. Since Q is invertible, y is any vector in Rn . Now the result follows
taking into account that
λ1 y1
.. ..
y⊤ Dy = y1 · · · yn 2 2
. . = λ1 y1 + · · · + λ n y n .
λn yn
□
If f is twice differentiable at x0 , we know from basic Calculus that it can be
expanded as a Taylor expansion around x0 :
1
f (x) = f (x0 ) + ∇ f (x0 )⊤ (x − x0 ) + (x − x0 )⊤ ∇2 f (x0 )(x − x0 ) + o (∥x − x0 ∥22 ),
2
(8.3)
for x ∈ R close enough to x0 , where o (∥x − x0 ∥2 ) contain terms such that
n 2
o (∥x − x0 ∥22 )
lim = 0.
x → x0 ∥x − x0 ∥22
Now we introduce the basic notions of this chapter.
Definition 19 (Critical and saddle point, local maxima and minima). Given a func-
tion f : Rn → R and a point x0 ∈ Rn , we say that x0 is
(ii) a local minimum of f if f (x) > f (x0 ), for all x ∈ B(x0 , r ) and some r > 0,
(iii) a local maximum of f if f (x) < f (x0 ), for all x ∈ B(x0 , r ) and some r > 0,
(iv) a non-strict local minimum of f if f (x) ≥ f (x0 ), for all x ∈ B(x0 , r ) and some
r > 0,
(v) a non-strict local maximum of f if f (x) ≤ f (x0 ), for all x ∈ B(x0 , r ) and some
r > 0,
(vi) a saddle point if ∇ f (x0 ) = 0 and, for any r > 0, there are x1 , x2 ∈ B(x0 , r ) such
that f (x2 ) < f (x0 ) < f (x1 ).
119
8 Numerical Optimization
Figure 8.1 illustrates the notion of local maximum and minimum, as well as a
saddle point, and in Figure 8.2 we illustrate the notion of non-strict local min-
imum. Note that there are infinitely many such points in a straight line at the
bottom of the graph.
We are interested in computing (or approximating) local maxima and local min-
ima. For this, note that f has a local maximum at x0 if and only if − f has a local
minimum at x0 . Therefore, we can concentrate on local minima.
Now the question is: what is the connection between all the previous stuff?
Namely, what is the role played by the gradient, the Hessian, the Taylor expansion,
or Definition 18 and Theorem 24 in the context of obtaining the local minima of a
function? The answer comes from the following result.
(ii) f has a local minimum at x0 if and only if ∇ f (x0 ) = 0 and ∇2 f (x0 ) is positive
definite.
120
8 Numerical Optimization
(iii) f has a non-strict local minimum at x0 if and only if ∇ f (x0 ) = 0 and ∇2 f (x0 ) is
positive semidefinite.
Proof: All claims in the statement are a consequence of (8.3). More precisely, if
f has a local minimum at x0 then it must be f (x) > f (x0 ) for x close enough to
x0 . Note that, for x close enough to x0 , the terms 21 (x − x0 )⊤ ∇2 f (x0 )(x − x0 ) +
o (∥x − x0 ∥22 ) in (8.3) are smaller than ∇ f (x0 )⊤ (x − x0 ) (note that they are all real
numbers), so in order to have f (x) > f (x0 ) it must be ∇ f (x0 ) = 0, since otherwise
there are some x1 and x2 close enough to x0 such that ∇ f (x0 )⊤ (x1 − x0 ) > 0 and
∇ f (x0 )⊤ (x2 − x0 ) < 0. This proves claim (i).
To prove claim (ii), we first assume that x0 is a local minimum of f . Then, by
claim (i), it must be ∇ f (x0 ) = 0 so, by (8.3),
1
f (x) ≈ f (x0 ) + (x − x0 )⊤ ∇2 f (x0 )(x − x0 ), (8.4)
2
for x close enough to x0 . Therefore, ∇2 f (x0 ) must be positive definite. Conversely,
if ∇ f (x0 ) = 0 equation (8.4) holds for x close enough to x0 and, if ∇2 f (x0 ) is
positive definite, this implies that f (x) > f (x0 ), so x0 is a local minimum of f .
The proof of claim (iii) is similar to that of claim (ii), just noticing that, if ∇2 f (x0 )
is positive semidefinite, then it may happen that (x − x0 )⊤ ∇2 f (x − x0 ) = 0, for x
arbitrarily close to x0 , which gives f (x) = f (x0 ) by (8.4). □
Example 16 Let us consider the function f (x) = e x1 +2x2 sin( x1 ) + x2 , whose gradient
at some vector x is x +2x
e 1 2 (sin( x1 ) + cos( x1 ))
∇ f (x) = .
2e x1 +2x2 sin( x1 ) + 1
The gradient is zero at the points
x1 = 7π
4 + 2πn n ∈ Z,
1
x2 = − 2 ( x1 + log(−2 sin( x1 ))).
Let us analyze what happens for n = 0 above. In this case the point is
" 7π
#
x1 4 √
x0 : = = .
x2 − 21 7π4 + log( 2)
The Hessian of f at x is
2 x1 +2x2 2 cos( x1 ) 2(sin( x1 ) + cos( x1 ))
∇ f (x) = e ,
2(sin( x1 ) + cos( x1 )) 4 sin( x1 )
121
8 Numerical Optimization
This is a diagonal matrix with positive diagonal entries, so it is positive definite. Therefore,
x0 is a local minimum of f .
In plain words, in a convex function the segment joining the images of f (x) and
f (y) is above the graph of the function, as it is illustrated in Figure 8.3.
The function to the left in Figure 8.1 is strictly convex, whereas the one in Figure
8.2 is convex, but not strictly convex. The functions in the middle and the right of
Figure 8.1, however, are not convex.
An example of strictly convex function is f : R2 → R defined by f ( x, y) =
x + y2 .
2
However, it may happen that not all points in the segment joining x and y
belong to the domain or the subset of the domain where we are searching the
local minima. This leads to the following notion.
122
8 Numerical Optimization
Definition 21 (Convex set). A set D ⊆ Rn is convex if the segment joining any two
points x, y ∈ D belongs to D . Namely, αx + (1 − α)y ∈ D , for all α ∈ [0, 1].
Examples of convex sets are the whole Rn , an open ball B(x, r ), or an n-dimensional
rectangle [ a1 , b1 ] × · · · × [ an , bn ].
We are going to see in Theorem 27 that convex functions over convex sets have
just one local minimum which is also a global minimum. This is a great advantage
against general functions, where the number of local minima is not known (and
not even whether such a minimum exists). In order to prove Theorem 27 we first
provide in Theorem 26 some properties of convex functions.
(b) If f is twice differentiable with continuous second derivative, then f is convex if and
only if
∇2 f (x) is positive semidefinite for all x ∈ D . (8.7)
Parts (a) and (b) remain true if we replace “convex” by “strictly convex”, the inequality
in (8.6) by a strict inequality and “semidefinite” in (8.7) by “definite”.
Proof: (a) Let us first assume that f is convex. Then, f ((1 − α)x + αy) ≤ (1 −
α) f (x) + α f (y), for any α ∈ [0, 1], which is equivalent to
f (x + α(y − x)) − f (x)
≤ f ( y ) − f ( x ),
α
and, taking the limit as α → 0, the previous inequality, together with (8.1), gives
f (x + α(y − x)) − f (x)
∇ f (x)⊤ (y − x) = Dy−x f (x) = lim ≤ f ( y ) − f ( x ),
α →0 α
so (8.6) holds.
Now assume that (8.6) holds and let us prove that f is convex. Let x, y ∈ D and
α ∈ [0, 1], and set z := αy + (1 − α)x. By assumption
f (y) ≥ f (z) + ∇ f (z)⊤ (y − z)
f ( x ) ≥ f ( z ) + ∇ f ( z ) ⊤ ( x − z ).
Multiplying the first inequality above by α and adding up the second one multi-
plied by (1 − α) we obtain:
α f ( y ) + (1 − α ) f ( x ) ≥ f (z) + ∇ f (z)⊤ (x − z) + α∇ f (z)⊤ (y − z)
= f (z) + ∇ f (z)⊤ (αy + (1 − α)x − z) = f (αy + (1 − α)x)),
123
8 Numerical Optimization
Exercise 4 Prove that if z⊤ ∇2 f (x)z ≥ 0 for all z ∈ B(x, r ), for some r > 0, then
∇2 f (x) is positive semidefinite.
Proof: Let us first prove the statement for f being convex. If x0 is a local minimum
we already know, by Theorem 25, that ∇ f (x0 ) = 0. Now, let us assume that
∇ f (x0 ) = 0. The Taylor expansion (8.3) implies that, for y close enough to x0 ,
1
f (y) ≈ f (x0 ) + (y − x0 )⊤ ∇2 f (x0 )(y − x0 ).
2
124
8 Numerical Optimization
• The function may have more than one local minimum. In this case, we expect
the sequence to approximate just one of them. However, if the function
is convex (and twice differentiable with continuous second derivative), we
know, by Theorem 27, that the minimum is unique.
• (Stopping criterion): We should provide a criterion for the method (or algo-
rithm) to stop at some point. There are several options, and we can include
all them as stopping criterion in the particular algorithm, or just some of
them (the only necessary condition is to include at least one, since otherwise
the algorithm would not stop!). The standard ones are the following:
– Fix the number of iterations. The advantage of this criterion is that
we guarantee the termination of the algorithm. However, this does not
allow us to know at all whether the output is a good approximation to
the solution or not.
– Impose a tolerance (ε) on the (norm of the) gradient, namely: ∥∇ f (xk )∥2 ≤
ε. If x∗ is a local minimum, Theorem 25 guarantees that ∇ f (x∗ ) = 0.
Therefore, we expect that, if xk is a good approximation to x∗ , then
∇ f (xk ) is close to zero, namely its norm is small. However, a small
value of the norm of the gradient does not guarantee that xk is close
enough to x∗ .
– Impose a tolerance in the change of iterations, namely: ∥xk+1 − xk ∥2 < ε.
The idea behind this condition is that when there is no much difference
in two consecutive outputs, we cannot expect too much improvement
in further iterations.
125
8 Numerical Optimization
x k +1 = x k + α k d k , (8.10)
where dk is a suitable direction, called the line search, and αk is a real number. In
other words, the vector xk+1 is obtained from xk by adding up a suitable vector in
some particular direction dk . What is important for the method is to appropriately
choose both the search direction, dk , and the stepsize αk . The choice of these two
ingredients determine the method. The common feature of all descent methods is
that, provided that αk > 0, the direction search dk satisfies:
• d⊤k ∇ f ( xk ) < 0 if ∇ f ( xk ) ̸ = 0. The reason for imposing this condition is the
following: the vector ∇ f (xk ), which determines the direction of maximal
increase of f , divides the space Rn in two parts: one part where vectors v ∈
Rn satisfy v⊤ ∇ f (xk ) > 0 and the other one for vectors with v⊤ ∇ f (xk ) < 0.
The first one contains the “uphill” directions, where the function f increases,
and the second one contains the “downhill” directions, where f decreases.
Since we are looking for a new iteration xk+1 such that f (xk+1 ) < f (xk )
(namely we want f to decrease), it is natural to choose vectors pointing to
the second part.
• d⊤
k ∇ f ( xk ) = 0 if ∇ f ( xk ) = 0. In this (very unlikely) case, xk+1 = xk , so the
method terminates at xk , since we have reached a point where ∇ f (xk ) = 0,
which is the local minimum.
Depending on the choice of dk , we obtain the three different methods that are
considered in this course, namely:
• Newton’s method: dk = −[∇2 f (xk )]−1 ∇ f (xk ).
126
8 Numerical Optimization
Newton’s method
Let us consider the second-order Taylor polynomial of the function f around the
kth iteration, xk :
1
q(x) = f (xk ) + ∇ f (xk )⊤ (x − xk ) + (x − xk )⊤ ∇2 f (xk )(x − xk ).
2
Instead of looking at the minimum of f , we look at the minimum of q (and this is
the only idea behind the method). So, in order to get the next iteration, xk+1 , we
impose that it is the minimum of q, namely ∇q(xk+1 ) = 0. If we differentiate in
the expression above for q
127
8 Numerical Optimization
• Compute ∇ f (xk ).
• Compute ∇2 f (xk ).
• Update xk+1 = xk + dk .
• Check convergence.
To solve the first problem we can reduce the length of the step in the form xk+1 =
xk + αk dk , so we end up with an expression like (8.10).
As for the second drawback we can use a “shifted Hessian”
Fk := ∇2 f (xk ) + γk I, γk > 0,
128
8 Numerical Optimization
Despite the previous drawbacks, the good news is that, provided that f is twice
differentiable and ∇2 f (x) is Lipschitz continuous (something that, for instance, all
polynomial functions satisfy in bounded regions) and there is some local mini-
mum of x∗ where ∇2 f (x∗ ) is positive definite, then Newton’s method
For a proof of these facts, see, for instance, [NW06, Th. 3.5].
The Newton method requires computing the Hessian ∇2 f (xk ) at each step, and
this may be too expensive (we need to compute the partial derivatives of f at
xk ). Then, Inexact Newton’s methods aim to overcome this problem, replacing the
Hessian by some approximation, Bk . Some of the standard methods use a low-
rank approximation (rank-1 or rank-2) of the matrix obtained at the previous step
(starting with, for instance, B0 = I) and they are obtained from a second-order
Taylor approximation to ∇ f (xk+1 ). We are not going to see these methods in this
course, but if you are interested in them you can have a look at [BC11, §3.8].
129
8 Numerical Optimization
Figure 8.4: Approximations and search directions of the steepest descent with exact line search.
In general, it is not easy to find the solution of the line search method, and this
leads to the “inexact line search methods”, that are analyzed in Section 8.2.3. How-
ever, if we assume that, at each step, αk is the solution of the line search problem,
then the steepest descent method produces a “zig-zag” iteration due to the follow-
ing result.
Lemma 11 If αk is the solution of the line search problem and xk+1 = xk + αk dk then
∇ f (xk+1 )⊤ dk = 0. (8.12)
Proof: Note that the objective function ϕ(α) = f (xk + αdk ) of the line search
problem is a composition ϕ(α) = ( f ◦ g)(α), with g(α) = xk + αdk . Then, applying
the chain rule, the derivative of ϕ is:
n
∂f ∂((xk )i + α(dk )i )
ϕ′ (α) = ∑ ∂xi (xk + αdk ) ∂α
= ∇ f (xk + αdk )⊤ dk .
i =1
• The method is convergent for any starting point x0 under some “mild con-
ditions”.
130
8 Numerical Optimization
f ( x k +1 ) − f ( x ∗ ) κ−1 2
≈ .
f (xk ) − f (x∗ ) κ+1
This means that, as k increases, the approximation provided by the (k + 1)st
iteration does not improve very much the one in the previous step.
• Set α = 1.
However, the previous algorithm may result in very small reductions of f , mak-
ing the algorithm quite slow and expensive. Another strategy is to choose another
“loop” criterion. For instance, we choose another value 0 < β < 1 and impose
131
8 Numerical Optimization
• f (xk + tdk ) ≈ f (xk ) + t∇ f (xk )⊤ dk < f (xk ) + αt∇ f (xk )⊤ dk , so the algorithm
eventually terminates (see [BV09, p. 465]).
• Set α = 1.
132
9 Numerical integration
a
f ( x )dx ≈ ∑ ωi f ( x i ), (9.2)
i =0
where x0 , . . . , xn are the quadrature nodes and ω0 , . . . , ωn are the quadrature weights.
A relevant property that we are going to impose to these formulae is given in the
following definition.
Definition 22 We will say that a quadrature rule (9.2) is exact for polynomials of
degree n if it gives the exact value of the integral when f ( x ) is a polynomial of degree at
most n.
The ideal situation is to approximate the integral (9.1) with as much accuracy
as desired. In particular, given a “tolerance”, tol, we want to obtain an algorithm
that allows us to choose the quadrature nodes and weights in such a way that
Z b n
f ( x )dx − ∑ ωi f ( xi ) ≤ tol. (9.3)
a i =0
133
9 Numerical integration
x0 = a,
b−a
x1 = a+ ,
n
..
.
b−a
xk = a+k· ,
n
..
.
b−a
xn = a + n · = b.
n
Step 3: Construct the interpolating polynomial through the nodes ( x0 , f ( x0 )), . . . , ( xn , f ( xn )).
In the Lagrange formula, this polynomial is
n x − xj
Pn ( x ) = ∑ f ( xk )ℓk ( x ), ℓk ( x ) = ∏
xk − x j
.
k =0 j̸=k
a
f ( x )dx ≈
a
Pn ( x )dx = ∑ f ( xk )
a
ℓk ( x )dx.
k =0
a
f ( x )dx ≈ ∑ f ( xk )αk , αk =
a
ℓk ( x )dx.
k =0
The most elementary cases of these formulas are the ones obtained for n = 1
(trapezoid’s rule) and n = 2 (Simpson’s rule), which are illustrated in Figure 9.1.
134
9 Numerical integration
Rb f ( a)+ f (b)
a
f ( x )dx ≈ (b − a) · 2
Rb
b− a
a
f ( x )dx ≈ 6 · f ( a) + 4 · f a+2 b + f (b)
135
9 Numerical integration
a
p( x )dx = ∑ p( xk )αk , (9.5)
k =0
a
ℓ j ( x )dx = ∑ ℓ j ( xk )αk = α j ,
k =0
α0 + α1 + · · · + α n = 1 ( j = 0),
1
x0 α0 + x1 α1 + · · · + x n α n = 2 ( j = 1),
1
x02 α0 + x12 α1 + · · · + xn2 αn = 3 ( j = 2), (9.6)
.. ..
. .
1
x0 α0 + x1 α1 + · · · + xnn αn
n n = n +1 ( j = n ),
136
9 Numerical integration
Example 17 Let us obtain the values of the coefficients of the simple closed Newton-Cotes
formulas for the first three values of n:
Z b
b−a
f ( x )dx ≈ · ( f ( a) + f (b)) (Trapezoid’s rule)
a 2
Z b
b−a
a+b
f ( x )dx ≈ · f ( a) + 4 · f + f (b) (Simpson’s rule)
a 6 2
• n = 3 (Simpson’s 3/8 rule): When solving the corresponding system (9.6) and
using (9.4) we obtain
Z b
b−a
a+b a+b
f ( x )dx ≈ · f ( a) + 3 · f a+ +3· f a+2· + f (b) (Simpson’s 3/8 rule)
a 8 3 3
137
9 Numerical integration
( b − a ) n +2
En ( f ) ≤ · max | f (n+1) ( x )| . (9.7)
( n + 1) ! a ≤ x ≤ b
( b − a ) n +2
Remark 21 The error bound (9.7) is a product of two factors. The first of them, (n+1)! ,
depends on the integration interval, and not on the function. This factor is small if the
length of the interval is too small (b − a << 1), and tends to 0 as n tends to infinity if
138
9 Numerical integration
the length of the interval, b − a, is less than 1. The second factor, maxa≤ x≤b | f (n+1) ( x )|,
depends on the function f and its oscillations in the interval [ a, b] (which are determined
by the derivatives of the function). In summary, if these oscillations are bounded and the
length of the interval is less than 1, the simple closed Newton-Cotes formulas tend to the
exact value of the integral as n tends to infinity.
Remark 21 tells us that, for small intervals and functions which are “smooth
enough”, the simple Newton-Cotes formulas converge to the value of the integral
by increasing the number of nodes. Nonetheless, we may be interested in com-
puting an integral over a relatively large interval. This is addressed in the next
section.
[ a, b] = [ a, x1 ] ∪ [ x1 , x2 ] ∪ · · · ∪ [ x p−1 , b]
x0 = a, x1 = a + h N , . . . , x j = a + j · h N , . . . , x N = a + N · h N = b.
where, in each integral of the right-hand side we apply the simple closed Newton-
Cotes formula in n + 1 nodes.
139
9 Numerical integration
Let us explicitly write the formulas that are obtained in the simplest two cases
(n = 1, 2) of the previous procedure.
Z b
hN
f ( x )dx ≈ ( f ( a) + 2 f ( x1 ) + · · · + 2 f ( x N −1 ) + f (b))
a 2
Z b
hN
f ( x )dx ≈ ( E + 2I )
a 2
where
– E = f ( a) + f (b): is the sum of the evaluations of f in the endpoints of
the interval.
– I = f ( x1 ) + · · · + f ( x N −1 ): is the sum of the evaluations of f in the
inner nodes.
140
9 Numerical integration
where
• E = f ( a) + f (b): is the sum of the evaluations of f in the endpoints of the
interval.
b − a n +1 n n +1
E( f ) ≤ (b − a) · · max | f (n+1) ( x )| (n odd)
N ( n + 1) ! a ≤ x ≤ b
Analogously if n is even:
b − a n +2 n +2
E( f ) ≤ cn · (b − a) ·n · max | f (n+2) ( x )| (n even)
N a≤ x ≤b
Remark 22 If the number of interpolation nodes, n, is fixed, and f is a function with con-
tinuous derivatives in [ a, b] (up to degree, at least, n + 1), then | f (n+1) ( x )| is bounded, so
maxa≤ x≤b | f (n+1) ( x )| is a fixed number (the same happens with | f (n+2) ( x )| for n even).
Then, when N → ∞, the error E( f ) tends to 0. In other words: the composite closed
Newton-Cotes formulas guarantee the convergence of the approximation if f is “continu-
ous enough” (namely, if its derivatives are continuous).
141
9 Numerical integration
Unlike Remark 21, Remark 22 tells us that, in order for the composite closed
Newton-Cotes formulas to converge, it is not necessary to impose that the deriva-
tives of f are bounded: it is enough to impose that they are continuous. The
continuity condition is, in general, much easier to check than the regularity of the
derivatives. For instance, all polynomial and rational functions have continuous
derivatives (in those values where the denominator is not zero for rational func-
tions), as well as all trigonometric ans exponential functions. Similarly, it is not
necessary to increase the number of nodes to obtain a quadrature formula that
converges to the integral. It is enough to fix the number of nodes and to increase
the number of subintervals in the composite formula. This is the main advantage
of the composite formulas against the simple ones. Nonetheless, the composite
formulas present a relevant drawback, that we analyze below.
142
9 Numerical integration
where c is a constant. This formula indicates that the convergence order of A(h)
is n. For instance, as we have seen in Section 9.2.1, the composite closed Newton-
Cotes formulas are of order n + 1 (if n is odd) or n + 2 (if n is even). Now, the
function
tn A(h/t) − A(h)
R(h, t) := , (9.9)
tn − 1
has convergence order n + 1, since
n
tn A + c ht + O(hn+1 ) − ( A + chn + O(hn+1 ))
R(h, t) = = A + O ( h n +1 ).
tn − 1
The function R(h, t) is known as Richardson extrapolation of A(h). In many sit-
uations it is easier to obtain the convergence with the prescribed tolerance using
R(h, t) instead of reducing the step size since, as we have seen, reducing the step
size involves more calculations and evaluations of the given function that not only
are more expensive but may involve larger roundoff errors.
so R(h, 2) is equal to
It can be checked, solving the system (9.6) for n = 4, that the previous extrap-
olated Simpson rule is the simple closed Newton-Cotes formula for 5 nodes, so it
is exact for polynomials with degree at most 5.
143
9 Numerical integration
Step 1: Compute S and S2 as in Section 9.3.2 (namely: the simple Simpson rule
and the composite Simpson rule with 2 subintervals).
Step 2: Evaluate E = |S − S2 |, which gives an estimation of the error.
If E ≤ tol, then Z b
16S2 − S
f ( x )dx ≈ = Q,
a 15
and we are finished.
If E > tol, proceed with next step.
Step 3: If E > tol, repeat steps 1 and 2 in
144
10 Numerical differentiation
f ′ ( x0 ) ≈ P ′ ( x0 )
We are going to consider equispaced nodes (the stepsize will be denoted, as usual,
by h). There are three standard ways to consider these nodes (the choice can
depend on the information we have at hand, namely the nodes where we are
allowed to evaluate f ), which are indicated in Table 10.1
Type Nodes
Forward differentiation x0 , x0 + h, . . . , x0 + nh
Backward differentiation x0 , x0 − h, . . . , x0 − nh
Centered differentiation (n even) x0 , x0 − h, x0 + h, . . . , x0 − n2 h, x0 + n2 h
In the following subsections, we are going to show the specific formulas that we
get for small values of n (namely, n = 1, 2) when P( x ) is the Lagrange interpolating
polynomial introduced in Definition 11.
145
10 Numerical differentiation
Type Nodes
f ( x0 + h ) − f ( x0 )
Forward differentiation f ′ ( x0 ) ≈
h
′ f ( x0 ) − f ( x0 − h )
Backward differentiation f ( x0 ) ≈
h
Table 10.2: Numerical differentiation formulas for n = 1
Type Formula
−3 f ( x0 ) + 4 f ( x0 + h) − f ( x0 + 2h)
Forward differentiation f ′ ( x0 ) ≈
2h
′ 3 f ( x0 ) − 4 f ( x0 − h) + f ( x0 − 2h)
Backward differentiation f ( x0 ) ≈
2h
′ f ( x0 + h ) − f ( x0 − h )
Centered differentiation f ( x0 ) ≈
2h
Table 10.3: Numerical differentiation formulas for n = 2
f ( n +1) ( η x ) f ( n +2) ( ξ x )
E′ ( x ) = f ′ ( x ) − P( x ) = · Π′ ( x ) + · Π ( x ), (10.1)
( n + 1) ! ( n + 2) !
where Π( x ) = ( x − x0 )( x − x1 ) · · · ( x − xn ).
146
10 Numerical differentiation
f ( n +1) ( η x )
E( x ) = f ( x ) − P( x ) = Π ( x ),
( n + 1) !
f ( n +1) ( η x ) d
E′ ( x ) = f ′ ( x ) − P′ ( x ) = · Π ′ ( x ) + Π ( x ) f [ x0 , x1 , . . . , x n , x ].
( n + 1) ! dx
d f ( n +2) ( ξ x )
f [ x0 , x1 , . . . , x n , x ] = ,
dx ( n + 2) !
D ( x + h) − D ( x ) f [ x0 , x1 , . . . , x n , x + h ] − f [ x0 , x1 , . . . , x n , x ]
lim = lim
h →0 h h →0 h
f [ x + h, x0 , x1 , . . . , xn ] − f [ x0 , x1 , . . . , xn , x ]
= lim
h →0 ( x + h) − x
= lim f [ x + h, x0 , x1 , . . . , xn , x ]
h →0
= lim f [ x0 , x1 , . . . , xn , x, x + h].
h →0
where, to get the last identity we set limh→0 ξ x,h =: ξ x ∈ ( a, b) and we use that
f (n+2) is continuous in ( a, b). □
The formula (10.1) tells us that the error in the differentiation formulas depends
on the n + 1 and n + 2 derivatives of f , which is somewhat expected, since the
interpolation error depends on the n + 1 derivative.
If we make x tend to some of the nodes, xi , since f (n+1) and f (n+2) are continu-
ous in ( a, b), and using that
Π ′ ( x i ) = Π j ̸ =i ( x i − x j ),
we conclude that the error of the differentiation formulas in the node xi is given
by
f ( n +1) ( η x i )
E ′ ( xi ) = f ′ ( xi ) − P ′ ( xi ) = · Π j ̸ =i ( x i − x j ). (10.2)
( n + 1) !
147
10 Numerical differentiation
148
11 The Fast Fourier Transform
The Fast Fourier Transform (FFT) is used to compute the Discrete Fourier Transform
(DFT) in an efficient way. Then, for completeness, we first recall the DFT in Section
11.1, even though we assume the student to be familiar with it. This section is
merely a summary, and to get more information on it you can have a look at the
basic references [BF11, ChK08, SB80].
p N, f ( x j ) = f ( x j ), for j = 0, . . . , N − 1.
These conditions uniquely determine the coefficients fbk of (11.1). More precisely,
these coefficients are (see, for instance, [SB80, Th. 2.3.1.9]):
N −1
1 2πjki
fbk =
N ∑ f ( xk )e− N , k = 0, 1, . . . , N − 1. (11.2)
j =0
The DFT consist in obtaining the coefficients (11.2) from f (namely, from the values
f ( x j ), for 0 ≤ j ≤ N − 1). The opposite problem, namely obtaining the values
149
11 The Fast Fourier Transform
f1
b
1
1
ωN · · · ( ω N ) N −1
f ( x1 )
bf := . = . .. .. .. .. = FN f, (11.3)
..
. N .
. . . .
2
fbN −1 1 ( ω N ) N −1 · · · ( ω N ) ( N −1) f ( x N −1 )
where FN , called the Fourier matrix of size N, is the N × N matrix whose (i, j) entry
(i −1)( j−1)
is ( FN )ij = ω N , and (f)i = f ( xi−1 ), for 1 ≤ i ≤ N.
The IDFT can be performed using the inverse of the matrix FN . In order to get
this inverse we note the following:
• FN is a symmetric matrix,
• therefore, the product of the ith row of FN by the ith column of FN (where
FN denotes the conjugate of FN , namely the matrix obtained from FN by
conjugating all its entries) is equal to
1
i
(ω N )
1 ( ω N ) i · · · ( ω N ) ( N −1) i
..
.
(ω N ) ( N − 1 ) i
= 1 + ( ω N ) ( ω N ) + · · · + ( ω N ) ( N −1) i ( ω N ) ( N −1) i
i i
= 1 + 1 + · · · + 1 = N,
2πi
where we have used that ω N = e− N , so that (ω N )( N −1) j (ω N )( N −1) j = 1, for
all j ∈ N,
• whereas the product of the ith row of FN by the jth column of FN , when
i ̸= j, is equal to
2πi 2πi 2πi 2πi 2πi 2πi
1 + e N i e − N j + · · · + e N ( N −1) i e − N ( N −1) j = 1+e N (i − j ) +···+e N ( N −1)(i − j )
= 1 + σ + σ2 + · · · + σ N −1 = 0,
2πi
since the complex number σ := e N (i− j) , when i ̸= j, is an Nth root of
1 (different from 1), so it satisfies 1 + σ + · · · + σ N −1 = 0, because of the
identity z N − 1 = (z − 1)(1 + z + · · · + z N −1 ).
150
11 The Fast Fourier Transform
As a consequence:
1 1 ··· 1 1 1 ··· 1
1
ωN ··· ( ω N ) N −1 1
ωN · · · ( ω N ) N −1
FN FN = .. .. .. .. .
. .. .. ..
. . . . . . . .
2 2
1 ( ω N ) N −1 · · · (ω N ) ( N − 1 ) 1 (ω N ) N − 1 · · · (ω N ) ( N − 1 )
N 0 ··· 0
..
0 N ...
.
= . .
= N IN .
.. .. ...
0
0 ··· 0 N
Therefore, ( FN )−1 = 1
N FN , so the IDFT is performed as:
1
f= FN bf.
N
151
11 The Fast Fourier Transform
namely fbke and fbko are the sum of, respectively, the even numbered terms (namely,
the ones with indices 0, 2, 4, . . .) and the odd numbered terms (those with index
1, 3, 5, . . .) in (11.2) with N/2 instead of N (and removing the initial factor 2/N).
Then, (11.4) is equivalent to
1 be
fbk = f k + (ω N )k fbko , (11.5)
N
so it remains to prove (11.5). This is a consequence of the following chain of
identities:
N −1 N −1
1 1
∑ ∑
2π
fbk = f ( x j )(ω N )kj = f ( x j )(e− N i )kj =
N j =0
N j =0
N N
2 −1 2 −1
1
N j∑
f ( x2j )(e− N i )2kj + ∑ f ( x2j+1 )(e− N i )k(2j+1)
2π 2π
=
=0 j =0
N N
−1 2 −1
1 2
N j∑
f ( x2j )(e− N/2 i )kj + (e− N i )k ∑ f ( x2j+1 )(e− N/2 i )kj
2π 2π 2π
=
=0 j =0
N N
− 1 − 1
1 2 2
N j∑
= (ω N/2 )kj + (ω N )k ∑ f ( x2j+1 )(ω N/2 )kj
=0 j =0
1 be
= ( f + (ω N )k fbko ),
N k
as wanted. □
152
11 The Fast Fourier Transform
As mentioned, the number of flops involved in the DFT (11.3) is O(2n2 ). Let us
estimate the number of flops when computed as in (11.4). What we need to do it
to multiply the product of matrices in the right-hand side of (11.4) by f. Therefore,
this approach consists of three steps:
153
11 The Fast Fourier Transform
here comes the second half of the story: if we assume that N = 2ℓ , then we can
iterate the previous steps, halving the size of the resulting Fourier matrix, until we
end up, after ℓ − 1 iterations, with (2ℓ−1 copies of) F2 . Note that, at the end of this
process, we end up with
ℓ−1
( G2ℓ G2ℓ−1 · · · G2 F2⊕2 P)f,
F2 0 ... 0
.. .
. ..
ℓ−1 0 F2
F2⊕2 :=
.
.. .. ..
. . 0
0 . . . 0 F2
Note that, when N = 2ℓ , then ℓ = log2 N. The FFT can be also carried out for
arbitrary values of N ∈ N (but this is out of the scope of this course), and the
overall cost is O( N log2 N ) , instead of the cost O( N 2 ) of the standard DFT (11.3).
Table 11.1 shows some values of N log2 N against N 2 , for some values of N = 2ℓ .
Note that, for ℓ = 16, namely N = 16384, the difference between the correspond-
ing values is noticeable.
Note that a similar approach can be followed for the IDFT, applying the same
arguments to FN instead of FN .
154
11 The Fast Fourier Transform
ℓ N N2 N log2 N
10 1024 1048576 109240
12 4096 16777216 49152
14 16384 268435456 229375
Example 18 Let us compare the execution time in matlab for the DFT of the function
f ( x ) = sin( x ) cos( x ) in [0, 2π ] in two ways: (a) with the formula (11.3) and (b) using
the FFT. In both cases we first create the vector of equispaced nodes in [0, 2π ] by typing
linspace(0,2*pi,N+1)
(we set the number of nodes to be N + 1 since linspace includes the right endpoint,
namely 2π, that we want to exclude). Now we evaluate the function in this vector by
typing, for instance:
f=inline(’cos(x).*sin(x)’)
fv=feval(f,v(1:N))
Now, let us apply (11.3) to this vector. One way to get the matrix FN with matlab is
by typing fft(eye(N,N)) (see the meaning of the command fft below and then convince
yourself about this). Therefore, the DFT (11.3) can be obtained with the command
f1=(1/N)*fft(eye(N,N))*fv
As for the FFT, we just type f2=fft(fv)
Table 11.2 contains the execution time (obtained with the commands tic; toc;) for
N = 2ℓ for some values of ℓ.
Table 11.2: Execution times (in seconds) of DFT with and without using the FFT for f ( x) =
sin( x ) cos( x ) with N = 2ℓ .
In this section have essentially followed the approach in [S19, §IV.1], and I recom-
mend you to have a look at this reference for more details.
155
Bibliography
[HJ13] R.A. Horn, C. R. Johnson. Matrix Analysis, 2nd ed. Cambridge University
Press, Cambridge, 2013.
[IK66] H. Isaacson, H. B. Keller, Analysis of Numerical Methods. John Wiley & Sons,
New York, 1966.
156
Bibliography
[S19] G. Strang. Linear Algebra and Learning from Data. Wellesley Cambridge, 2019.
157