Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
80 views157 pages

Coursenotes

This document provides course notes for numerical methods. It covers topics including floating point arithmetic, solution of linear systems of equations, numerical interpolation, root-finding algorithms, least squares problems, numerical optimization, and numerical integration. The notes are written by Fernando De Terán Vergara for dual bachelor's degrees in data science and engineering and telecommunication technologies engineering at Universidad Carlos III de Madrid.

Uploaded by

100496612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views157 pages

Coursenotes

This document provides course notes for numerical methods. It covers topics including floating point arithmetic, solution of linear systems of equations, numerical interpolation, root-finding algorithms, least squares problems, numerical optimization, and numerical integration. The notes are written by Fernando De Terán Vergara for dual bachelor's degrees in data science and engineering and telecommunication technologies engineering at Universidad Carlos III de Madrid.

Uploaded by

100496612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Numerical Methods

course notes
Dual Bachelor in Data Science and Engineering and
Telecommunication Technologies Engineering
Bachelor in Data Science and Engineering

Fernando De Terán Vergara*

* Departamento de Matemáticas, Universidad Carlos III de Madrid, Avenida de la Uni-


versidad 30, 28911 Leganés, Spain ([email protected])
Index

Index 2

0 Notation 6

1 Short introduction to matlab 7


1.1 Basic commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Built-in functions in matlab . . . . . . . . . . . . . . . . . . . 10
1.1.2 Graphics in matlab . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Creating matlab functions . . . . . . . . . . . . . . . . . . . . 13

2 Floating point arithmetic 15


2.1 Floating point system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Absolute and relative separation between consecutive numbers 17
2.1.2 Roundoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 Floating point operations . . . . . . . . . . . . . . . . . . . . . 20
2.2 The IEEE system in double precision . . . . . . . . . . . . . . . . . . . 21
2.2.1 matlab representation of machine numbers . . . . . . . . . . 23

3 Conditioning and stability 26


3.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Stability of numerical algorithms . . . . . . . . . . . . . . . . . . . . . 35

4 Solution of linear systems of equations 39


4.1 Solution of triangular systems . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Gaussian elimination and LU factorization . . . . . . . . . . . . . . . 43
4.2.1 The LU factorization . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Solution of SLE with the LU factorization . . . . . . . . . . . 50
4.2.3 matlab commands for the solution of SLE and the LU fac-
torization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Error analysis for the solution of SLE . . . . . . . . . . . . . . . . . . 54
4.4 Banded and sparse matrices . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Solution of a SLE with a tridiagonal coefficient matrix . . . . 59

2
Index

5 Numerical interpolation 62
5.1 Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Piecewise cubic interpolation . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Cubic piecewise interpolation with splines . . . . . . . . . . . . . . . 67
5.5 The Newton form and the method of divided differences . . . . . . . 71
5.5.1 Bases of polynomials and the interpolating polynomial . . . 74
5.6 Interpolation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Roots of polynomials and functions 80


6.1 The bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 The secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5 Inverse quadratic iteration (IQI) . . . . . . . . . . . . . . . . . . . . . . 91
6.6 Functions in matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.7 The program fzero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.8 Convergence order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.8.1 Convergence of bisection method . . . . . . . . . . . . . . . . 95
6.8.2 Convergence of the secant method . . . . . . . . . . . . . . . . 95
6.8.3 Convergence of fixed point methods . . . . . . . . . . . . . . . 96
6.8.4 Convergence of Newton’s method . . . . . . . . . . . . . . . . 98

7 Least squares problems 99


7.1 Setting the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 The QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3.1 Solution of least squares problems using the QR factorization 104
7.3.2 Computation of the QR factorization . . . . . . . . . . . . . . 105
7.4 The Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . 108
7.4.1 Application of the SVD to low-rank approximation . . . . . . 110
7.5 The pseudoinverse of a matrix . . . . . . . . . . . . . . . . . . . . . . 114
7.5.1 Application of the pseudoinverse to the LSP . . . . . . . . . . 115

8 Numerical Optimization 117


8.1 Unconstrained optimization . . . . . . . . . . . . . . . . . . . . . . . . 117
8.1.1 Convex functions and convex optimization . . . . . . . . . . . 122
8.2 Descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2.1 Newton-like methods . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.2 Steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . 129

3
Index

8.2.3 Inexact line search . . . . . . . . . . . . . . . . . . . . . . . . . 131

9 Numerical integration 133


9.1 Basic closed Newton-Cotes formulas . . . . . . . . . . . . . . . . . . . 134
9.1.1 Computation of the weights of closed Newton-Cotes quadra-
ture rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.1.2 Error in the basic Newton-Cotes formulas . . . . . . . . . . . 137
9.2 Composite closed Newton-Cotes formulas . . . . . . . . . . . . . . . 139
9.2.1 Error in the composite closed Newton-Cotes formulas . . . . 141
9.3 Richardson extrapolation. Extrapolated Simpson’s formula . . . . . 142
9.3.1 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . 142
9.3.2 Extrapolated Simpson rule . . . . . . . . . . . . . . . . . . . . 143
9.4 Adaptive integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

10 Numerical differentiation 145


10.1 Forward, backward, and centered formulas . . . . . . . . . . . . . . . 145
10.1.1 Formulas for n = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.1.2 Formulas for n = 2 . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.2 Error of the previous formulas . . . . . . . . . . . . . . . . . . . . . . 146

11 The Fast Fourier Transform 149


11.1 Discrete Fourier Transform (summary) . . . . . . . . . . . . . . . . . 149
11.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Bibliography 156

4
Index

This course requires the use of codes that can be found in the folder (or direc-
tory) ncm of the book [M04] and, as it is indicated at the beginning of Chapter 1
in that book, in order to make use of these codes you should either open matlab
in that directory or add it to the pathtool of matlab.

5
0 Notation

Throughout this course we use the following general notation:

• R denotes the field of real numbers.

• C denotes the field of complex numbers.

• Matrices are denoted with capital letters, like A, U, Q, S, etc.

• Vectors are written in boldface style, like x, b, or v.

• The coordinates of a vector will be denoted using the same letter as for
the vector and adding a subindex for the corresponding coordinate. For
instance, x1 and xi are the first and the ith coordinate of the vector x, respec-
tively.

• We use capital calligraphic letters for sets (like F for the floating point sys-
tem in Chapter 2).

6
1 Short introduction to matlab

In this chapter we show the basic commands that will be used by default in this
course, as well as the elementary syntax of matlab.

1.1 Basic commands


The working space of matlab is the “workspace”, which is like a local memory
where the variables introduced in each session are stored. It is erased after closing
the session.
ans: Is the solution produced by matlab when typing a command (ans is the
abbreviation of “answer”). The variable ans is recorded in the “workspace”.
clear: Erases all variables in the “workspace”.
diag(x): If x is a vector, this command creates a diagonal matrix whose diagonal
entries are those of x. If x is a matrix (namely, it has more than 1 row), then the
command extracts the principal diagonal as a column vector.
diff(x): Difference vector. It is a vector with one coordinate less than x, and
whose coordinates are the differences between the consecutive coordinates of x.
disp(x): Shows the value of the variable x.
double(variable): Transforms the variable in double precision (16 significant
digits).
edit [namefun]: Allows to editing a function, namefun, previously created. The
code of the function is opened in a new window.
eps: Distance from 1.0 to the next machine number. It is equal to 2−52 = 2.2204 ·
10−16 , which is twice the unit roundoff. It is know as the machine epsilon.
eps(x): Is the distance from x to the next machine number.
eye(n): Identity matrix of order n.
format long: Shows the floating point representation of the number with 16 dig-
its.

7
1 Short introduction to matlab

format short: Shows the floating point representation of the number with 4 dig-
its. Is the default format.
fzero(f,x0): Computes the root of the function f which is closest to the value x0.
help [order]: Displays an explanation of order. For instance: help sqrt.
f=inline(’function’): Creates a function, f, where ’function’ is the expres-
sion of the function which is defined, using a symbolic variable (for instance, x).
Example:

f=inline(’x^2+2*x+9’)

This allows to evaluate f( a), where a is any complex number. It is also possible to
create a function of several variables, f=inline(’function’,’x1’,...,’xn’).
linspace(a,b,n): Generates a vector with n equispaced coordinates between a
and b.
lookfor [word]: Displays all files of the program where this word appears in the
description.
nnz(A): Number of nonzero entries of A.
norm(A), or norm(A,2): The 2-norm of the matrix A.
norm(A,1): The 1-norm of the matrix A.
norm(A,’fro’): The Frobenius norm of the matrix A.
norm(A,inf): The infinity norm of the matrix A.
[m,pos]=max(c): Provides the maximum value (in modulus) of an entry of the
vector c (namely, m) and the position of this entry (pos) among its coordinates.
ones(m,n): Generates the m×n matrix whose entries are all equal to 1.
pi: π number.
rand(m,n): Generates a matrix of size m×n with random entries (uniform distri-
bution) within the interval [0, 1].
rank(a): Returns the (numerical) rank of a.
roots([an ... a1 a0]): Computes the roots of the polynomial an zn + · · · +
a1 z + a0.
size(A): Displays the size of the matrix A.
solve(eqn,var): Solves, when possible, the equation eqn in terms of the variable
x, which must be previously inserted with the command sym x. The equation
must be inserted in the form f(x) == g(x).

8
1 Short introduction to matlab

sort: Sort in descending order. In particular, if v is a vector, sort(v) sorts the


elements of v in descending order.
spy(a): Plots the nonzero entries of the matrix a.
subplot(m,n,i): Divides the window of figures in an mxn matrix, and places the
current figure in the ith position.
sym(’x’): Creates a symbolic variable x. You can also use syms x.
tic [command] toc: Computes (and displays) the execution time of the command.
tril(A,k): Is the matrix whose entries below the kth diagonal are the same as
those of A, and the remaining ones are zero (the kth diagonal are the entries whose
indices (i, j) satisfy j − i =k).
triu(A,k): Is the matrix whose entries above the kth diagonal are the same as
those of A, and the remaining ones are zero.
type [name]: Displays the full information of the file name.m.
vander(x): Creates a Vandermonde matrix associated to the vector x.
varargin: Means “variable number of arguments in”. Corresponds to a (cell array),
namely a vector whose coordinates are variables that can be of different type and
have different dimensions (see [HH17, §10.5]).
varargout: Means “variable number of arguments out”. Is similar to varargin,
but it is used for the output variables of a code (see [HH17, §10.5]).
vectorize(f): When f is a scalar function, this command “vectorizes” it by intro-
ducing a “·”, in the operations with variables, to convert them into vector opera-
tions. For instance:

vpa(variable,n): Displays the value of the variable with n significant digits.


zeros(m,n): Generates the null matrix of size m×n. Similarly, zeros(size(v))
generates a null matrix with the same size as v.

9
1 Short introduction to matlab

1.1.1 Built-in functions in matlab


matlab already has some predefined functions. Some of the are (by alphabetical
order):
acos: arc-cosine.
asin: arc-sine.
atan: arc-tangent.
cos: cosine.
exp: exponential.
log: logarithm (natural).
sin: sine.
sqrt: square root.
tan: tangent.
In all cases, in order to evaluate the function f in x, it suffices to insert, in the
command line, f(x).
By default, the values of the trigonometric functions are given in radians.

1.1.2 Graphics in matlab


matlab allows to represent points and graphs of functions by means of the fol-
lowing commands:

• plot(x,y): If x,y are two vectors of the same dimension, this command
represents the points given by the corresponding coordinates, and joins each
point with the next one (in the order of the vector x) with a segment.

• loglog(x,y): Acts like the command plot, but using the logarithmic scale.

• semilogx(x,y): Acts like the command plot, but using the logarithmic scale
in the first variable.

• semilogy(x,y): Acts like the command plot, but using the logarithmic scale
in the second variable.

• scatter(x,y): Acts like the command plot, but plots the points with small
circles, without joining them.

• ezplot(’f’,x0,x1): Represents the graph of the function f in the interval


[x0,x1].

10
1 Short introduction to matlab

• fplot(@(x) f(x),[x0,x1]): Represents the graph of the function f(x) in


the interval [x0,x1].

In the commands plot, loglog. semilogx, semilogy, and scatter, it is pos-


sible to insert a third argument to specify the color of the dots and lines, or
the kind of desired mark for the points (without any mark, or with some mark
like a circle, an asterisk, a cross, etc.) or line (continuous, dotted). For instance:
plot(x,y,’r*--’) plots the points with a red asterisk, and joins them with a red
dotted line.
To overlap several plots, the hold on command can be used. This command
saves the previous plot and overlaps the next one.

1.2 Syntax
The (row) vectors in matlab are introduced with brackets “[, ]”, separating the
coordinates with black spaces or commas. For instance:

v=[1 2 3 4] and v=[1,2,3,4],

produce the same vector.


To add more rows and to create a matrix, a semi-colon “;” must be inserted to
separate the rows. For instance:

A=[1 2 3 4;5 6 7 8]

is the matrix  
1 2 3 4
A= .
5 6 7 8
The conjugate transpose of a matrix is obtained adding a tilde (the one below the
question mark “?” in the keyboard). For instance, if A is the previous matrix, then

1 5
 
2 6
A′ = 

.
3 7
4 8

If A is a square matrix, the command A(i,j) recovers the (i,j) entry of A. For
instance, in the previous matrix, A(1,2) is equal to 5. If v is a vector, then v(i) is
the ith coordinate of v.
The product of the matrices A and B is obtained with A*B.

11
1 Short introduction to matlab

It is also possible to obtain the inverse of the matrix A by means of either the
command inv(A) or A^(-1) (though they are different!).
Similarly, it is possible to multiply and to divide two vectors or matrices of the
same size entry-wise, by adding a dot, “.”, before the symbol of the operation. This
way, v.*w and v./w provide the vector obtained after multiplying and dividing
(respectively) the corresponding entries of the vectors v and w.
The notation a:b, where a and b are whole numbers, with a≤b, produces the
ordered list of integers between a and b. If a and b are not integers, then a:b
produces the ordered list of numbers between a and b that are obtained by adding
1 to the previous one, starting with a and ending in the closest number to b which
is smaller or equal than b. For instance: the command 2.8:5.4 produces the list
2.8, 3.8, 4.8.
Similarly, The command a:c:b produces the list of equispaced numbers be-
tween a and (the closest number which is smaller than or equal to) b, spaced, each
one with the precedent one, by a distance which is equal to c (where c can be
positive or negative).
We can extract submatrices from a matrix A with the following commands:

• A(i1 : i2 , j1 : j2 ) is the submatrix of A consisting of the entries within the


rows i1 and i2 and the columns j1 and j2 .
   
• A( i1 i2 . . . ik , j1 j2 . . . jℓ ) is the submatrix  of A consisting
 of
the entries which are within therows of the vector  i1 i2 . . . ik and
within the columns of the vector j1 j2 . . . jℓ .

This way, for instance, if


 
1 2 3 4
A =  5 6 7 8 ,
9 10 11 12

then  
2 3 4
A(1:2,2:4) = ,
6 7 8
 
2 4
A([1,3],[2,4]) = .
10 12
In the previous two cases, if all rows are wanted (respectively, all columns), just
type “:” in the information of the rows (respectively, the columns). For instance,

12
1 Short introduction to matlab

for the previous matrix A:


 
2 3 4
A(:,2:4) =  6 7 8  ,
10 11 12
 
1 2 3 4
A([1 3],:) = .
9 10 11 12

It is also possible to join some matrices to other ones in order to create a larger
matrix, provided that the dimensions are compatible. For instance, if A is an m × n
matrix and B is another m × q matrix, [A B] is the matrix that is obtained typing
the columns of B after those of A. Analogously, is B is q × n, [A;B] is the matrix
that is obtained by typing the rows of B below those of A.
It is possible to exchange rows and columns of a matrix by means of a “permu-
tation vector”, which is a vector with n coordinates, that are the natural numbers
from 1 to n in some order. In particular, if p is such a vector, then:

• A(p,:) reorders the rows of A according to the order of the natural numbers
in the vector p (in this case, n must be the number of rows of A).

• A(:,p) reorders the columns of A according to the order of the natural num-
bers in the vector p (in this case, n must be the number of columns of A).

1.2.1 Creating matlab functions


The first line of the function, that defines the number of input and output vari-
ables, together with the name of the function, will be:

function[output variables] = name(input variables)

Everything you want to indicate in the file of the function, but is not part of the
code (for instance, explanations), will be written in a line starting with the symbol
%. Here is an example of the first lines of the code roots.m:

matlab functions are files with the extension “.m”. For instance, the function
that computes the roots of a polynomial is the function roots.m.

13
1 Short introduction to matlab

It is important to avoid displaying the values of the variables obtained at each


iteration, in the command window, since this would slow down the execution of
the code. For this, a semicolon “;” should be added to the end of the line where
the variable is defined. Below we show an example taken from the code of roots:

To become familiar with the commands of creating functions (including loops


of the form for, while, if, etc.), I recommend you to have a look at Section 6.2
in [HH17]. Also, I recommend to have a look at this reference for any further
information about matlab.

14
2 Floating point arithmetic

In this chapter, we will focus on how to represent and operate with real numbers.
Computers are, of course, able to do the same with complex numbers. However,
the representation and the arithmetic of complex numbers are essentially based
on those of real numbers (just recall that a complex number can be represented as
a pair of real numbers), but their analysis would complicate unnecessarily all the
arguments and developments that we will carry out in this chapter.
Since every computer has a finite memory, it is only able to store a finite number
of real numbers. In this chapter we will present a summary of some of the basic
features of the storage and the arithmetic systems used by computers. For in-
stance, we will answer questions like: How many numbers can a computer store?
How much big is the distance between two consecutive machine numbers? What
happens if we introduce in the computer a number that does not belong to its
system? How does a computer perform an arithmetic operation? etc.

2.1 Floating point system


The standard system currently implemented in computers is the floating point sys-
tem. The basic motivation for creating these systems is, as we will see, to keep the
relative distance between any two consecutive numbers more or less constant.
A floating point system (FPS) in base 2 is a subset, F , of the set of real numbers
of the form:
 
e d1 d2 dt
F := ±(1 + f ) · 2 : f = + 2 + · · · + t ∪ {0}, (2.1)
2 2 2

where:

• di is 0 or 1, for i = 1, . . . , t;

• t is a positive integer, known as the precision of the system;

• e is an integer between a minimum and a maximum value, that will be


denoted by emin and emax , respectively (both integers). It is known as the
exponent of the number.

15
2 Floating point arithmetic

The numbers of F are known as floating point numbers, or machine numbers.


The number f in (2.1) is the mantissa (or fraction) of the machine number, and
we represent it in abbreviate “decimal” notation as:
d1 d2 dt
0.d1 d2 · · · dt :=+ 2 +···+ t.
2 2 2
The reason to express the first factor of the numbers of F in (2.1) as 1 + f , instead
of f , is to guarantee that the representation of each number of F is unique (though
this uniqueness deserves a proof), something that is no longer true if we remove
the term 1 in 1 + f . For instance, the number 1 has the following two different
representations in the form f · 2e :
1 −1 1
1 · 20 ( f = 1, e = 0), and ·2 ( f = , e = −1),
2 2
whereas its representation as (2.1) is unique.
Now, we indicate several properties and basic notions of the system of machine
numbers F as in (2.1):
(i) The number 0 belongs to F , but it is represented with a special memory
position.
(ii) The number 1 belongs to F , and its representation is 1 = (1 + 0) · 20 (namely,
f = 0).
(iii) The distance between 1 and the next number of F is called machine epsilon
(abbreviated in matlab notation by eps). Its value is eps := 2−t , since the
number following 1 is (1 + 21t ) · 20 . The double of the machine epsilon is
known as the unit roundoff, and it is equal to u := 21−t .
(iv) the value of the mantissa is always between 0 and 1, since:
1 1 1 1
f ≤ + 2 + · · · + t = 1 − t ≤ 1.
2 2 2 2
(v) In order to store the mantissa, the t binary digits of the term f are needed,
but the first factor of the number (namely 1 + f ) contains t + 1 digits, due to
the addend 1.
The following result provides an alternative representation of the machine num-
bers.
Lemma 1 (Integer representation of the machine numbers). Every number of the
system F in (2.1) can be expressed as
x = ± m · 2e − t , (2.2)
for some emin ≤ e ≤ emax , where m is a positive integer within the interval [2t , 2t+1 − 1].

16
2 Floating point arithmetic

Proof: We will prove the statement only for positive numbers, since for negative
ones it is analogous. From the representation of any number in F , given by (1 +
f ) · 2e = 2t (1 + f ) · 2e−t , we define m := 2t (1 + f ), which is a positive integer that
can be bounded as:
 
1 1
2 ≤ m = 2 (1 + f ) ≤ 2 1 + + · · · + t = 2t+1 − 1,
t t t
2 2

and this concludes the proof. □


The representation (2.2) indicates, in particular, that two consecutive numbers
of F having the same exponent, e, are obtained from two consecutive values of m
(namely, m and m + 1).

2.1.1 Absolute and relative separation between consecutive numbers


In this section we restrict ourselves to positive numbers, for simplicity. We want to
analyze the absolute and relative distance between two consecutive numbers in F ,
that we denote by xn and xn+1 , where xn < xn+1 . Let us first analyze the absolute
distance.
Let us fix an exponent emin ≤ e ≤ emax . From the expression in (2.2) we de-
duce that the absolute distance between two consecutive numbers with the same
exponent e in (2.1) is equal to

x n +1 − x n = 2e − t .

Moreover, and using again the representation (2.2), the largest number with expo-
nent e is xn = (2t+1 − 1) · 2e−t , whereas the first (smallest) number with exponent
e + 1 is xn+1 = 2e+1 . Therefore, the difference between these two consecutive
numbers is equal to

x n +1 − x n = 2e +1 − (2t +1 − 1 ) · 2e − t = 2e − t .

As a consequence, any two consecutive numbers of F between 2e and 2e+1 have


the same distance, which is equal to 2e−t . This means, in particular, that each time
we shift the exponent by adding one unit, the separation between two consecutive
numbers of F is multiplied by 2.
Figure 2.1 shows all positive numbers of a floating point system (2.1) with t =
3, emin = −4, emax = 3. As can be seen, the separation between two consecutive
numbers increases as we move to the right. You can also obtain a similar picture
for other systems, by changing the precision and the maximum and minimum
exponents using the code floatgui included in the folder ncm.

17
2 Floating point arithmetic

Figure 2.1: Positive machine numbers of a system with


t = 3, emin = −4, emax = 3.

Now, let us consider the relative distance. Assume that 2e ≤ xn < xn+1 ≤ 2e+1 .
Then, the relative distance, sr , between xn and xn+1 satisfies

2e − t x n +1 − x n 2e − t 2e − t
2− t −1 = ≤ s r = = ≤ = 2− t .
2e +1 xn xn 2e

Moreover, the maximum of this relative distance, 2−t , is achieved for xn = 2e (the
smallest number with exponent e), and decreases up to its minimum value, 2t+11 −1 ,
which is achieved for xn = (2t+1 − 1) · 2e−t (the largest number with exponent e).
This fact is illustrated in Figure 2.2 for a system with precision t = 23.

Figure 2.2: Relative distance from x to the next machine number (t = 23), [H98, p. 40].

As a conclusion, the difference between two consecutive machine numbers is


almost constant along the whole number system, but it changes (decreasing) inside
each interval consisting of all numbers with the same exponent. In particular, at
the end of each interval the relative distance between two consecutive numbers is
half the distance at the beginning of the interval.
Another consequence of the previous arguments is the following:

18
2 Floating point arithmetic

The machine epsilon, eps = 2−t , of a system with precision


t is the largest relative distance between two consecutive
machine numbers.

2.1.2 Roundoff
A relevant question that maybe you have already considered (even if you know the
answer) is the following: What happens if you introduce in the machine a number
that is within the range of the machine number system, but does not belong to it?
The answer is the expected one: the machine “rounds” the number to the closest
one. In the case of a “tie”, namely, if the number is at the same distance from two
consecutive machine numbers, then there are several ways to undo this tie, but the
standard one is to choose the number having b52 = 0. Nonetheless, this fact has
no relevance at all in the contents of this course.
Anyway, the mathematical formulation of roundoff is the following. Let x ∈ R
be a real number (not necessarily a machine number). Let us denote by fl(x) el
the closest machine number to x (that is read “float of x”). Then,
fl(x) − x eps
≤ = u,
fl(x) 2
where u is the unit roundoff. In particular, we have the following result. From now
on, the range of F is the interval [ f min , f max ], where f min and f max are, respectively,
the smallest (negative) and the largest (positive) number of F .
Theorem 1 Let x ∈ R be a number within the range of F . Then
(a) fl( x ) = x (1 + δ), with |δ| ≤ 2−t ,
x
(b) fl( x ) = , with |α| ≤ 2−t .
1+α
Proof: Without loss of generality, we assume x > 0. If x = m · 2e−t , with 2t ≤ m <
e−t
2t+1 and emin ≤ e ≤ emax , as in (2.2), then |fl( x ) − x | ≤ 2 2 , since this is half the
distance between x and the next floating point number. Also, x, fl( x ) ≥ 2e . Now:
fl( x ) − x 2e−t /2
(a) Let δ = , then |δ| ≤ = 2− t −1 .
x 2e
x − fl( x ) 2e−t /2
(b) Let α = , then |α| ≤ = 2− t −1 .
fl( x ) 2e □
As a conclusion, we have
The unit roundoff, u = 2−t−1 , of a system with precision
t is the largest relative distance between a number and its
representation in the machine number system.

19
2 Floating point arithmetic

2.1.3 Floating point operations


By “floating point operations” we understand the following arithmetic operations:

+ (sum), − (subtraction), × (multiplication), ÷ (division), and · (square root).
Clearly, the system F in (2.1) is not closed under such arithmetic operations. Then,
the question that arises is: what does the computer do when the result of operating
two numbers in F is not a number in F ? The answer is not easy and, essentially,
depends on the computer. However, it is possible to model what most computers
do as follows.
 In the following, we denote by ∗ any of the arithmetic operations mentioned
above, and by ⊛ its floating counterpart, that is: if x, y ∈ F , then x ⊛ y is the
output provided by the computer as x ∗ y.

Floating point arithmetic model

Let x, y ∈ F. Then

x ⊛ y = ( x ∗ y)(1 + δ), with |δ| ≤ u, (2.3)

where u is the unit roundoff.

Some comments on the floating point arithmetic model are in order:

• Comparing with the natural assumption x ⊛ y = fl( x ∗ y), the assumption


(2.3) is more pessimistic, because in the case x ∗ y ∈ F , the model does not
require δ = 0.

• However, and having in mind the precedent comment, we will sometimes


write fl( x ∗ y) instead of x ⊛ y. This will simplify the developments.

• The model does not tell us which is exactly x ⊛ y. It only provides a bound
on it with respect to the exact value x ∗ y. This allows us to deal easily with
rounding errors, at the expense of working with unknown quantities (δ).

• Note that the model does not guarantee that some of the standard arithmetic
laws (like the associative or distributive ones) are still true in floating point
arithmetic. For instance, in general x ⊗ (y ⊕ z) ̸= ( x ⊗ y) ⊕ (y ⊗ z), and
x ⊗ (y ⊗ z) ̸= ( x ⊗ y) ⊗ z.

• In general, we start with two real numbers, x, y ∈ R not necessarily in F . In


this case, the operation performed by the computer would be fl( x ) ⊛ fl(y).

20
2 Floating point arithmetic

Example 1 Let x, y ∈ R+ . We are going to analyze the error of computing x + y in a


computer with unit roundoff u.
As mentioned before, the computer performs:

fl( x ) ⊕ fl(y) ≡ fl( x + y) = x (1 + u1 ) ⊕ y(1 + u2 ) = [ x (1 + u1 ) + y(1 + u2 )] (1 + u3 ),

with |u1 |, |u2 |, |u3 | ≤ u, by (2.3). Then

fl( x + y) = x (1 + u1 )(1 + u3 ) + y(1 + u2 )(1 + u3 ) = x (1 + δ1 ) + y(1 + δ2 ), (2.4)

with |δ1 |, |δ2 | ≤ 2u + u2 . As a consequence, fl( x + y) − ( x + y) = xδ1 + yδ2 , so

|fl( x + y) − ( x + y)| = | x ||δ1 | + |y||δ2 | ≤ (| x | + |y|)(2u + u2 ) = | x + y|(2u + u2 ),

where in the last equality we have used that x, y are both positive. Equivalently

|fl( x + y) − ( x + y)|
≤ 2u + u2 . (2.5)
| x + y|

Equation (2.5) provides a bound on the relative error when computing the sum of two
positive numbers in floating point arithmetic.

2.2 The IEEE system in double precision


The two standard systems of machine numbers are the simple and double precision
IEEE systems whose main features are indicated in Table 2.1.

unit
sistem precision (t) emin emax eps (2−t )
roundoff (2−t−1 )
IEEE simple 24 −126 127 1.2 · 10−7 6 · 10−8
IEEE doble 52 −1022 1023 2.2 · 10−16 1.1 · 10−16

Table 2.1: Basic ingredients of the IEEE system in simple and double precision

In this section we are going to analyze more in detail the IEEE system with
double precision. This system stores each number of the system F in a “word” of
62 digits (binary):

1 bit 11 bits 52 bits


s d10 d9 · · · d1 d0 b1 b2 · · · b51 b52
sign exponent mantissa

21
2 Floating point arithmetic

More precisely

• The sign, s, is 0 (positive) or 1 (negative).

• The exponent, e, is obtained from the number

e + 1023 = d0 + 2d1 + · · · + 210 d10 ≤ 1 + 2 + · · · + 210 = 211 − 1 = 2047.

• The mantissa, f , is obtained from


52
bi b1 b2 b52
f = ∑ 2i =
2
+ 2 + · · · + 52 .
2 2
i =1

(Note that the number 1 in 1 + f is not stored, though is necessary to recon-


struct the machine number).

Let us note that there are two extreme values of the exponent, namely e = −1023
and e = 1024, which are not indicated in Table 2.1. These exponents are used to
store some special numbers, indicated in Table 2.2. These numbers are either too
large or too small, and the machine treats them in an exceptional way (that is, it
does not treat them as “machine numbers”). For instance, when e + 1023 = 0, the
machine represents the number in the form f · e−1022 , instead of (1 + f ) · e−1022 ,
which allows us to obtain numbers that are even smaller (known as subnormal
numbers).

e e + 1023 di f type
−1023 0 0 0
0
̸= 0 subnormal
1024 2047 0 ±Inf
1
̸= 0 NaN

Table 2.2: Exceptional numbers corresponding to the extreme values of the exponent.

As a consequence of the previous developments, we can obtain the largest and


smallest machine number (in absolute value). These numbers are indicated in Ta-
ble 2.3. In particular, realmax is the largest positive machine number, and realmin
is the smallest positive machine number. Both are matlab variables, that can be
displayed typing their names in the command window.
In particular, the range of positive machine numbers is within the interval
[realmin, realmax], which in IEEE with double precision is, according to Table
2.3, the interval [2.2251 · 10−308 , 1.7977 · 10308 ]. Nonetheless, as we have already

22
2 Floating point arithmetic

mentioned, the machine is capable to recognize some larger and smaller numbers.
More precisely, the smallest number that the machine is able to recognize is the
smallest subnormal number, namely 2−t · 2emin , which in IEEE with double pre-
cision is equal to 2−1022−52 = 4.9407 · 10−324 . Any other smaller number will be
treated as 0.
If, during any computation, a number that is not within the range of the com-
puter is obtained, then the underflow phenomenon is produced (if the number is
smaller than realmin) or overflow (when it is larger than realmax). In this last
case, the program matlab displays an error message, consisting of either Inf or
NaN (see Table 2.3). However, the underflow phenomenon does not produce any
error message, since any number smaller than realmin is rounded to 0.

2.2.1 matlab representation of machine numbers


The program matlab uses a hexadecimal system of 16 characters to represent
numbers. These 16 characters are denoted by:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f .

Each hexadecimal character is represented by 4 binary digits. In particular, if x is


a real number such that 0 ≤ x ≤ 15 (in decimal format), then x can be represented
by 4 binary digits c0 , c1 , c2 , c3 as:

x = c0 · 20 + c1 · 21 + c2 · 22 + c3 · 23 , ci = 0, 1, i = 0, 1, 2, 3.

Table 2.4 shows the correspondence between each hexadecimal character and its
corresponding binary representation. By means of this equivalence, the IEEE for-
mat in double precision (binary)

1 bit 11 bits 52 bits


s d10 d9 · · · d1 d0 b1 b2 · · · b51 b52
sign exponent mantissa

can be represented in hexadecimal form as:

Name e bi value value in IEEE double


realmax emax 1 (2 − 2−t ) · 2emax 1.7977 · 10308
realmin emin 0 2emin 2.2251 · 10−308

Table 2.3: Largest and smallest machine numbers.

23
2 Floating point arithmetic

3 hexadecimal digits 13 hexadecimal digits


h2 h1 h0 w1 w2 · · · w13
sign + exponent mantissa

In particular, the formula to go from one of the systems to the other one is:

w1 w2 w
mantissa f = + 2 + · · · + 13
16 16 1613

sign + h2 < 8 (sign +): e + 1023 = h2 · 162 + h1 · 16 + h0


exponent h2 ≥ 8 (sign −) : e + 1023 = (h2 − 8) · 162 + h1 · 16 + h0

There are several formats in matlab to see the hexadecimal representation (and
to go back to the decimal one):

• num2hex(x): Shows the hexadecimal representation of the number x.

hexadecimal character binary representation


0 (0, 0, 0, 0)
1 (0, 0, 0, 1)
2 (0, 0, 1, 0)
3 (0, 0, 1, 1)
4 (0, 1, 0, 0)
5 (0, 1, 0, 1)
6 (0, 1, 1, 0)
7 (0, 1, 1, 1)
8 (1, 0, 0, 0)
9 (1, 0, 0, 1)
a (1, 0, 1, 0)
b (1, 0, 1, 1)
c (1, 1, 0, 0)
d (1, 1, 0, 1)
e (1, 1, 1, 0)
f (1, 1, 1, 1)

Table 2.4: Binary representation of hexadecimal characters.

24
2 Floating point arithmetic

• hex2num(’x’): Shows the binary representation of the number x (hexadeci-


mal). In this case it is important to not forget the quotation marks.

• format hex: Shifts from the binary representation (decimal) to the hexadec-
imal one.

• format: Allows to go back to the decimal representation.

HIGHLIGHTS:

 A floating point system aims the relative distance between two consecu-
tive numbers to be essentially constant for all numbers in the system.
 The relative distance between two consecutive numbers is bounded be-
tween the machine epsilon, β1−t , and the unit roundoff, β−t .
 The absolute separation increases as the absolute value of the numbers
increase. It increases by a factor β as the exponent of the numbers increases
by 1.
 The floating point arithmetic model assumes that the relative error when
rounding an elementary arithmetic operation of machine numbers is at most
the unit roundoff of the computer.

25
3 Conditioning and stability
In this chapter we introduce the notions of conditioning (and condition number) of
a problem and stability of a numerical algorithm. The first notion (conditioning)
refers to how small changes in the data of a problem affect the solution of this
problem and, as a consequence, is a intrinsic notion of the problem, independent
of the algorithm that is used to solve it. By contrast, the notion of stability refers
to the algorithm that is used to solve a problem, and provides a measure of how
“good” is the solution provided by this algorithm, namely: how far is it away from
being the exact solution.

3.1 Conditioning
When working with computers, we can not be sure that the original data are the
exact ones. When introducing our (hopefully exact) data in the computer, these
data are affected by rounding errors (see Section 2.1.2).
We can think of a problem as a function from a set of data (X ) to a set of
solutions (Y ), where each set is endowed with a norm.

Problem f

f : X −→ Y
x (data) 7→ f ( x ) (solution)

We will refer to the elements in X as “points”, even though X is not necessarily


an affine space. Hence, you can replace sentences like “a given point x ∈ X ” by
“a given data x ∈ X ”. Moreover, for simplicity we will write ∥ · ∥ for both norms
in X and Y , even though they are not necessarily the same.
In this setting, x denotes the exact data, which are affected by some perturbation,
δx. These perturbations may come not only from rounding errors, but also from
uncertainty in the measurements, or from any other source of error. Then, the
error in the solution is given by
δ f := f ( x + δx ) − f ( x ) (3.1)
 We want to compare δ f with δx, where δ f is as in (3.1).

26
3 Conditioning and stability

Definition 1 (Normwise absolute condition number). The condition number of the


problem f at a given point x ∈ X is the non-negative number

∥δ f ∥
κb f ( x ) := lim sup , (3.2)
δ→0 ∥δx ∥≤δ ∥ δx ∥

with δ f as in (3.1).

Remark 1 Some observations about Definition 1 are in order:

(a) The condition number κb f ( x ) compares absolute changes in the solutions with abso-
lute changes in the data.

(b) In general, to get an acceptable measure of the condition number κb f ( x ) it is not


necessary to take the limit, but just to get

∥δ f ∥
κb f ( x ) ≈ sup ,
∥δx ∥≤δ ∥ δx ∥

for δ small enough.

(c) If f is a differentiable function of the data x ∈ X , then

κb f ( x ) = ∥ J f ( x )∥,

where J f is the Jacobian matrix of f (namely, the linear operator associated to the
differential).

If we look at relative variations of the data, and look, accordingly, for relative
variations in the solution, then we arrive at the following notion.

Definition 2 (Normwise relative condition number). The relative condition num-


ber of the problem f at the point x ∈ X is the non-negative number:

∥δ f ∥/∥ f ∥
κ f ( x ) := lim sup . (3.3)
δ →0 ∥δx ∥ ∥δx ∥/∥ x ∥
∥x∥
≤δ

Remark 2 Again, if f is a differentiable function of the data x ∈ X , then we have

∥x∥
κ f (x) = · ∥ J f ( x )∥.
∥ f ( x )∥

27
3 Conditioning and stability

We will frequently say that a given problem is well-conditioned or ill-conditioned,


depending on how big the condition number is. There is not a fixed quantity
that establishes the difference between these two categories. In general, whether
a problem is well-conditioned or ill-conditioned depends on the context. Is it a
relative condition number of 103 enough to say that a problem is ill-conditioned?
There is no a satisfactory answer to this question. What the relative condition
number tells us in this case is that a relative error of 1 unit in the data could cause
a relative error of 103 units in the solutions or, in other words, that the error in the
solutions of the problem can be 103 times bigger than the error in the data.

Remark 3 (Choice of the norm). Condition numbers depend on the norms in X and
Y . This is not, in general, a problem, and it should be clear from the very beginning which
is the norm we are using. In this course, both X and Y will be either Fn or Fm×n (the
matrix space), with F = R or C, and the norms we use are the following:
 
x1
• The infinite norm:  ...  = max{| x1 |, . . . , | xn |}.
 

xn ∞
 
x1
 .. 
• The 1-norm:  .  = | x1 | + · · · + | x n |.
xn 1
 
x1
 ..  p
• The 2-norm:  .  = | x1 |2 + · · · + | x n |2 .
xn 2

In all cases, | x | denotes the absolute value (or the modulus) of the complex number x. The
matrix norms will be introduced in Chapter 4.

Remark 4 For any vector x ∈ Cn , the following relation between the norms introduced
in Remark 3 hold:
∥ x ∥2
√ ≤ ∥ x ∥ ∞ ≤ ∥ x ∥2 ,
n √
∥ x ∥2 ≤ ∥ x ∥1 ≤ ∥n∥2 · n,
∥ x ∥1
≤ ∥ x ∥ ∞ ≤ ∥ x ∥1 .
n
Let us illustrate the notions of absolute and relative condition number with the
following examples, corresponding to elementary arithmetic operations:

28
3 Conditioning and stability

Example 2 Let f be the function (problem) assigning to each nonzero real value its in-
verse:
f : R \ {0} −→ R
1
x 7→ f ( x ) = .
x
This function is differentiable, so
1
κb f ( x ) = .
| x |2
Note that when x → 0, the condition number κb f ( x ) goes to infinity, so inverting a number
close to zero is ill-conditioned in absolute terms.
However, in relative terms

|x|
κ f (x) = | f ′ ( x )| = 1,
| f ( x )|

so the problem of inverting nonzero real numbers is well conditioned in relative terms.

Example 3 Let us consider the following particular values in Example 2. Set: x = 10−6
and xe = 10−6 + 10−10 . Then the absolute error in the inversion of x is given by:
1
− 1xe 1
x
= −12 ≈ 1012 ,
| x − xe| 10 + 10−16
which is a big number. However, the relative error is
1 1 1
x − xe ÷ x |x|
| x − xe|
= ≈ 1.
| xe|
|x|

The reason for such a difference is that

1 1 106
− = 106 − ≈ 102 ,
x xe 1 + 10−4

which is a big quantity, compared just with the original data, which are of order 10−6 .
However, it is a small error compared to the magnitude of 1/x.

In the following, for the sake of simplicity, and since the function f is clear by
the context, we will replace the subscript f in the notation for condition numbers
by a reference to the norm. Then, for instance, κb∞ is the absolute condition number
in the ∞-norm, and κ1 is the relative condition number in the 1-norm.

29
3 Conditioning and stability

Example 4 Let us now consider the function assigning to each pair of real numbers their
difference:
f : R2 −→ R
   
x x
7→ f = x − y.
y y
Again, this function is differentiable, and
   
x x h
∂f ∂f
i  
κb∞ = J = =∥ 1 −1 ∥∞ = 2,
y y ∞
∂x ∂y ∞

x
κb1 = 1.
y

However  
x
y | x | + |y|
   
x 1 x
κ1 = J = .
y | x − y| y 1
| x − y|
 
x
Therefore, κ1 can be arbitrarily large if x ≈ y and x, y are not close to zero.
y

Example 4 highlights the well-know fact that subtracting nearby numbers can
be problematic. But note that this just gives problems in relative terms, since the
absolute condition number is not big. Summarizing:
 The problem of subtracting two numbers is relatively ill–conditioned when
these two numbers are close to each other (and not close to zero), though it is
absolutely well-conditioned. The effect produced by ill-conditioning when sub-
tracting two nearby numbers is known as cancellation.
 
x
Exercise 1 Let x = 99, y = 100, x + δx = 99, y + δy = 101, and set f =
y
x − y.
 
x
(a) Compute κb1 without using the Jacobian.
y
 
x
(b) Compute κ1 without using the Jacobian.
y
   
x + δx − f 
x
f  
y + δy y
(c) Compute the absolute error:   1
.

δx 
δy 1

30
3 Conditioning and stability

    ,  
x + δx − f 
x  x
f  f  
y + δy y 1
y 1
(d) Compute the relative error:   ,   .

δx  
x 
δy 1 y 1

(e) Compare the results obtained in (c) and (d) in the light of (a) and (b).

In the previous examples we have included the absolute condition number just
for the sake of comparison. However, it is not illustrative, because it does not take
into account the magnitude of the original data or the solution. In the follow-
ing, we will only focus on the relative condition number. Nonetheless, the next
example shows that this is not the whole story, and that the normwise relative
condition number is not always the best way to get an idea on how much sensible
the problem is to realistic perturbations.

Example 5 Let f be the function


 
x
f : R2 \ −→ R
0
   
x x x
7→ f = .
y y y

This is again a differentiable function, so


 
x
p
y x 2 + y2
     
x 2x 1 x
κ2 = J = − 2
y | x/y| y 2
| x/y| y y 2
2
x +y 2 x y
= = + .
| xy| y x

Example 5 shows that the division of two real numbers x, y is ill-conditioned


when either | x/y| or |y/x | is large, namely, when x, y are quite different to each
other. However, this is not in accordance with the experience in practice. The
reason to explain the results from Example 5 is that in this example we are con-
sidering all perturbations of x, y, and this is not realistic. To be more precise, the
usual perturbations in practice are small perturbations of x and y independently.
However, in Example 5 we are allowing all kind of perturbations and this, when
x, y are very different from each other, may produce unrealistic perturbations. An
example of a non-realistic perturbation is the following.

31
3 Conditioning and stability

Example 6 Let f be as in Example 5. Set x = 1, y = 10−16 , and x + δx = 1, y + δy =


10−10 . Note that the perturbation in the variable y is huge in relative terms if we just look
at y, but compared to the norm of the whole pair ( x, y) it is a small perturbation:
 
δx
δy
  ∞ = 10−10 − 10−16 .
x
y ∞

However, it produces a big (relative) change in the function f , namely:


    ,  
x + δx x x
f −f f
y + δy y ∞
y ∞ (1016 − 1010 )/1016
, = −10 − 10−16
≈ 1010 .

δx
  
x 10
δy ∞
y ∞

Exercise 2 We have seen in Example 2 that inverting a nonzero number is a well-


conditioned problem in relative terms. However, in Example 5 we have seen that, even
if we fix the numerator to be 1, the condition number of the division between two real
numbers x, y can be arbitrarily large. Which is the reason for this?

A realistic perturbation of the pair ( x, y) in Example 5 would be a small relative


perturbation in both x and y, namely, a perturbation of the form (δx, δy) with both
|δx |/| x | and |δy|/|y| being small.
The normwise relative condition number, however, does not take into account
this issue. For this, we need to look at what happens in each coordinate of the
data. This leads to the notion of componentwise relative condition number, which we
are not going to define formally because it is beyond the scope of this course but,
following Definition 3.3, it can be informally defined as
(comp) ∥δ f ∥/∥ f ∥
κf ( x ) := lim sup , (3.4)
δ→0 E ( x )≤δ
rc
Erc ( x )

where Erc ( x ) is some measure of the componentwise relative error in the data x. In
the case of Example 5, this componentwise relative error is:
|δx | |δy|
   
x
Erc := Erc = max , .
y | x | |y|
Then, with some elementary manipulations, we get
δy

x

x + δx x x
δx
x − y x 2Erc
δf = − = ≤ ,
y y + δy y y 1+
δy y 1 − Erc
y

32
3 Conditioning and stability

so  
x
δf
y Erc
≤ 2· ≈ 2Erc .
1 − Erc
 
x
f
y


(comp)x
As a consequence, κf = 2, which means that the problem of dividing
y
two real numbers x, y, with y ̸= 0, is well-conditioned componentwise.
Though componentwise conditioning is relevant in numerical linear algebra, we
will not pay much attention to it, due to time restrictions.

Example 7 (Conditioning of the polynomial root-finding). Let us analyze the con-


ditioning of the polynomial root-finding problem (namely, the problem of computing the
roots of a polynomial). We consider a (monic) polynomial of degree n, p( x ) = a0 + a1 x +
· · · + an−1 x n−1 + x n , with a0 , . . . , an−1 ∈ C. There are several ways to measure the con-
ditioning of computing the roots of p( x ), depending on the input and the output data we
are interested in. In particular, the input data can be (i) the whole set of coefficients of the
polynomial or (ii) just one of the coefficients; and the output data can be (a) all roots of the
polynomial, or (b) just one of the roots. We are going to consider the case (ii)-(b), so the
problem we are interested in is in measuring how a small change in the jth coefficient of the
polynomial p( x ) affects a particular root of p( x ). Note that the root is fixed, but arbitrary,
so this will be valid for any of the roots of the polynomial. In this case, the function we
want to analyze is the following:
ri : C → C
,
a j 7 → ri ( a j )
where a j is the jth coefficient of p( x ) and ri is the ith root of p( x ) (we assume the roots
ordered in a certain way). We are going to see that the relative condition number of the
previous problem is
j −1
| a j ri |
κ ri ( a j ) = ′ , (3.5)
| p (ri )|
where p′ ( x ) stands for the derivative of p( x ). In order to see (3.5), let us first simplify the
notation by writing r instead of ri . Then, we need to measure the ratio
|δr |/|r |
|δa j |/| a j |
when δa j approaches zero. The number r + δr is not a root of the original p( x ), but of a
nearby polynomial, obtained after perturbing the jth coefficient by δa j , namely

a0 + a1 (r + δr ) + · · · + ( a j + δa j )(r + δr ) j + · · · + an−1 (r + δr )n−1 + (r + δr )n = 0,

33
3 Conditioning and stability

which is equivalent to
p(r + δr ) + δa j (r + δr ) j = 0. (3.6)
Expanding in Taylor series, we have
p(r + δr ) = p(r ) + δr · p′ (r ) + O((δr )2 ) = δr · p′ (r ) + O((δr )2 ),
since p(r ) = 0. Replacing in (3.6) we get
δr · p′ (r ) + O((δr )2 ) + r j δa j + δa j · O(δr ) = 0.
Now, ignoring second order terms, we obtain δr · p′ (r ) + r j δa j = 0, from which

|r j | · |δa j |
|δr | = ,
| p′ (r )|
and, from this, (3.5) is immediate.
As a consequence of (3.5), the condition number of a multiple root is infinite. The
condition number can be also quite large for simple roots.
Let us check the formula obtained in (3.5) for quadratic polynomials p( x ) = a + bx +
x2 . We know that the roots in this case are given by
√ √
−b + b2 − 4a −b − b2 − 4a
r1 = , r2 = .
2 2
Since these are differentiable functions of a and b, we can compute the condition number
using Remark 2. Let us focus on r1 (the developments for r2 are similar). First, for the
coefficient a, we have
dr1 1
( a) = − √ ,
da 2
b − 4a
so the formula in Remark 2 gives
∥ a∥ | a|
· ∥ Jr1 ( a)∥ = √ .
∥r1 ( a)∥ |r1 || b2 − 4a|

On the other hand, since p′ ( x ) √ = b + 2x, then p′ (r1 ) = b2 − 4a. Therefore, the right-
hand side in (3.5) reads | ar1−1 |/ b2 − 4a, which coincides with the previous expression.
As for the coefficient b, we proceed in the same way:
−r
 
dr1 1 b r
(b) = − −1 + √ =√ 1 =− ′ 1 ,
db 2 b2 − 4a b2 − 4a p (r1 )
so the formula in Remark 2 reads
∥b∥ | b | · |r1 | |b|
· ∥ Jr1 (b)∥ = ′
= ′ ,
∥r1 (b)∥ |r1 | · | p (r1 )| | p (r1 )|
which coincides, again, with the right-hand side of (3.5).

34
3 Conditioning and stability

Example 8 Let us consider the polynomial p( x ) = −1 + 3x − 3x2 + x3 = ( x − 1)3 ,


and let us perturb the coefficient of x2 to get the polynomial pn ( x ) = −1 + 3x − (3 +
10−n ) x2 − x3 . Table 3.1 shows the ratio (|δr |/|r |)/(|δa j |/| a j |), with δa j = 10−n , a j =
a2 = −3, and r = 1, for some values of n. Here we choose 1 + δr as the root which is
the farthest away from r = 1. This ratio provides a lower bound on the condition number
(which is, as we have seen in Example 7, equal to infinity). The second column of the table
shows the approximated root.
|δr |
n 1 + δr |δa2 |/| a2 |
6 1.01 3 · 104
8 1.022 6.6 · 105
10 1.0005 1.4 · 107
12 1.000099996830444 3 · 108

|δr |/|r |
Table 3.1: Ratio |δa2 |/| a2 |
for p( x ) = ( x − 1)3 , δa2 = 10−n (r = 1).

3.2 Stability of numerical algorithms


In this section we use the notation O(u) to denote a function of the variable u
such that, when u approaches 0, there is a number C, independent on u, such that
O(u) ≤ Cu. However, the number C may depend on some other quantities, like
the dimension of the problem.
Following the developments in Section 3.1, we represent both a problem, f , and
an algorithm to solve this problem, fe, as functions from a set of data (X ) to a set
of solutions (Y ):

Problem f

f : X −→ Y
x (data) 7→ f ( x ) (exact solution)

Algorithm fe

fe : X −→ Y
x (data) 7→ fe( x ) (computed solution)

35
3 Conditioning and stability

For a given data, x ∈ X , the algorithm computes an approximate solution, fe( x ),


to the exact solution f ( x ).
 The GOAL of this section is to compare f ( x ) and fe( x ) in a relative way, that is,
to measure the relative error
∥ fe( x ) − f ( x )∥
. (3.7)
∥ f ( x )∥
The first attempt to introduce a notion of a good algorithm in this respect is the
following definition.

Definition 3 The algorithm fe for the problem f is accurate if, for any x ∈ X ,

∥ fe( x ) − f ( x )∥
= O ( u ). (3.8)
∥ f ( x )∥

However, condition (3.8) is too ambitious, since it is a condition on the algorithm


that disregards the conditioning of the problem. More precisely, even in the ideal
situation where
fe( x ) = f (fl( x )), (3.9)
then the relative error (3.7) can be large if the problem is ill-conditioned (and this
last property has nothing to do with the algorithm). We also want to note that
(3.9) means that the only error committed by the computer is the roundoff error in
the data, which is too optimistic (note also that this error is unavoidable, but there
may be other errors and, in general, there are).
Tho following definition is more realistic, since it takes into account the condi-
tioning of the problem.

Definition 4 The algorithm fe for the problem f is stable if for any x ∈ X there is some
xe ∈ X satisfying

∥ fe( x ) − f ( xe)∥ ∥ x − xe∥


= O ( u ), and = O ( u ).
∥ f ( x )∥ ∥x∥

In other words, a stable algorithm produces an approximate solution for an


approximate data. We can go even further in this direction.

Definition 5 The algorithm fe for the problem f is backward stable if for any x ∈ X
there is some xe ∈ X satisfying

∥ x − xe∥
fe( x ) = f ( xe), and = O ( u ).
∥x∥

36
3 Conditioning and stability

In other words, a backward stable algorithm provides the exact solution for a
nearby data. This is the most ambitious requirement for an algorithm, since data
are always subject to rounding errors.
Definitions 3, 4, and 5 depend on two objects: (a) the norm in the sets of data
and solutions, and (b) the quantity C such that O(u) ≤ Cu when u approaches 0.
Regarding the norm, in all problems of this course, X and Y are finite dimensional
vector spaces. Since all norms in finite dimensional spaces are equivalent, the
notions of accuracy, stability, and backward stability do not depend on the norm.
As for the quantity C, it must be independent on u, but it will usually depend on
the dimension of the spaces X , Y . However, in order for the quantity O(u) not
being too large, this dependence is expected to be, at most, polynomial.
A natural question regarding Definitions 3, 4, and 5 is about the relationships
between them. For instance: is an accurate algorithm necessarily backward stable,
or vice versa? It is immediate that a backward stable algorithm is stable, and
that an accurate algorithm is stable. However, is not so easy to see whether the
other implications are true or not. Actually, none of them is true. In particular, an
accurate algorithm is not necessarily backward stable.

Example 9 Equation (2.5) tells us that the sum of two real numbers of the same sign in
floating point arithmetic is an accurate algorithm, and (2.4) implies that it is backward
stable as well.

In terms of the goal mentioned at the beginning of this section, one may wonder
which is the connection between the stability (or the backward stability) and the
relative error (3.7). The following result provides an answer to this question.

Theorem 2 Let fe be a stable algorithm for the problem f : X → Y , whose relative


condition number at the data x ∈ X is denoted by κ ( x ). Then

∥ fe( x ) − f ( x )∥
= O(uκ ( x )).
∥ f ( x )∥

The proof of Theorem 2 for fe being backward stable is immediate (see [TB97, Th.
15.1]). For f being just stable is more involved, and is out of the scope of this
course.
What Theorem 2 tells us is that when using a stable algorithm for solving a
problem, the forward error of the algorithm depends on the conditioning. In
particular, this can be summarized as:

fe stable + f well-conditioned ⇒ fe accurate (3.10)

37
3 Conditioning and stability

This is another way express the general rule (see [H98, p. 9]):

forward error ≲ condition number × backward error (3.11)

(where a ≲ b means that a ⩽ b and a ≃ b).

HIGHLIGHTS:

 The conditioning of a problem measures the sensitivity of the solution of


this problem to (small) changes in the data.
 The word “conditioning” makes reference to a problem, and not to an
algorithm. Moreover, it is a mathematical property, and it has nothing to do
with the computer.
 The normwise condition number measures the sensitivity of the solution
with respect to changes in norm of the set of data. In some cases, it is
more appropriate to look at the componentwise condition number, which
measures the sensitivity of the solution with respect to changes in each of
the components of the data, independently.
 If the problem, f , is differentiable, then the condition number at a certain
data, x, is the norm of the Jacobian of f at x.
 Subtracting two nearby numbers is ill-conditioned.
 Stability is a property of an algorithm used to solve a given problem.
 The best property one can expect from an algorithm is to be backward
stable, because accuracy is too ambitious due to the presence of rounding
errors.
 A stable algorithm applied to solve a well-conditioned problem produces
an accurate solution.

38
4 Solution of linear systems of equations

In this chapter we are going to study the standard method for solving systems
of linear equations, which is known as Gauss method (or Gaussian elimination) with
partial pivoting. In particular, we will see that this method is associated with a
decomposition of the coefficient matrix of the system, which is known as LU
factorization. For this, we will review some notions from the subject Linear Algebra.
Finally, we will analyze the errors committed by the previous method, by means
of the study of the sensitivity and the condition numbers.
Throughout this chapter, the system of linear equations (SLE) will be repre-
sented in matrix form as:

Ax = b, A ∈ Cm × n , b ∈ Cm , (4.1)

where x ∈ Cn is the unknown. The matrix A in (4.1) is the coefficient matrix of the
system, and the vector b is the right-hand side of the system. During all this chapter
we will assume that the matrix A is square, namely, m = n in (4.1). Moreover, we
will assume that the matrix is invertible, in order to guarantee that it has a unique
solution.
We will start examining the solution of triangular systems in Section 4.1 but,
before, we will introduce some notation and basic definitions from matrix theory.

Matrices: notation and basic definitions

We denote the n × n identity matrix by In :

1 0... 0
 
 . . . . .. 
0 . . .
In := 
. .
 .
. .

. . 1 0
0 . . . 0 1 n×n

The null m × n matrix is denoted by 0m×n .


If A is any matrix, we denote by aij or A(i, j) (in matlab notation) to the entry
in the position (i, j).
Given a matrix A ∈ Cm×n , its transpose is denoted by A⊤ and its conjugate trans-
pose is denoted by A∗ .

39
4 Solution of linear systems of equations

The main diagonal of a matrix A consists of the entries A(i, i ) (namely, the entries
for which both indices coincide).
A triangular matrix is a matrix in which either all entries above (lower triangular)
or below (upper triangular) the main diagonal are zero. Mathematically:

A lower triangular: A(i, j) = 0 when i < j,


A upper triangular: A(i, j) = 0 when i > j.

Note that the adjective “lower” or “upper” refers to the “relevant” part of the
matrix (namely, the one that is not necessarily zero).
A diagonal matrix is a matrix which is both lower and upper triangular. In a
diagonal matrix all entries outside the main diagonal are zero.
A permutation matrix is s square matrix whose entries are all 0 or 1, in such a
way that there is exactly one entry equal to 1 in each row and and each column.
The name of these matrices comes from the fact that multiplying another matrix A
by a permutation matrix P produces a permutation of the rows (when multiplying
on the left, namely PA) or the columns (when multiplying on the right, namely
AP) of A. For instance, set
   
0 1 0 1 2 3
P = 0 0 1 , A = 4 5 6 .
1 0 0 7 8 9

The matrix P is a permutation matrix. Let us see the effect of multiplying A on


the left and on the right by P:
   
4 5 5 3 1 2
PA = 7 8 9 , AP = 6 4 5 .
1 2 3 9 7 8

You can see that PA is obtained from A permuting the rows, whereas AP is ob-
tained from A by a permutation of the columns. Moreover, this permutation is
encoded in the position of the entries equal to 1 in the matrix P. In fact, a permu-
tation matrix P is obtained by permuting the rows (or the columns) of the identity
matrix, and this permutation of the identity matrix that leads to the matrix P is,
precisely, the same one that P produces over A when multiplying A and P.
Some properties of permutation matrices that will be used in this course (some-
times without any explicit mention to them) are:

• Permutation matrices are invertible (with determinant ±1) and P−1 = P⊤ .

• The product of permutation matrices is again a permutation matrix.

40
4 Solution of linear systems of equations

In Section 4.3 we will use matrix norms. A matrix norm indicates how the norm
of a vector can change when it is multiplied by the matrix. More precisely, if ∥ · ∥ p
is any of the norms introduced in Remark 3 (namely, p can be 1, 2, or ∞), then the
pth matrix norm of A ∈ Cn×n is

∥ Ax∥ p
∥ A∥ p = max n . (4.2)
0̸ = x ∈C ∥x∥ p

In words, the norm ∥ A∥ p indicates how much the norm ∥ Ax∥ p increases with
respect to ∥x∥ p (a nice geometric interpretation of matrix norms can be found at
[TB97, Lecture 3]).
The following result provides explicit formulas for the matrix norms associated
to vector norms ∥ · ∥ p , for p = 1, 2, ∞. The proof can be found in pages 20 − 21,
for p = 1, ∞, and in page 34, for p = 2, of [TB97].

Lemma 2 Let A ∈ Cn×n , and denote by Col i ( A) and Row i ( A) the ith column and the
ith row of A, respectively. Then:

• ∥ A∥1 = max{∥Col 1 ( A)∥1 , . . . , ∥Col n ( A)∥1 }.

• ∥ A∥2 = max{ |λ| : λ is an eigenvalue of A∗ A}.


p

• ∥ A∥∞ = max{∥Row 1 ( A)∥1 , . . . , ∥Row n ( A)∥1 }.

The basic properties of matrix norms that will be used in this course are given
in the following result, whose proof is straightforward.

Lemma 3 Let A, B ∈ Cn×n and x ∈ Cn . Then, for p = 1, 2, ∞:

1. ∥ Ax∥ p ≤ ∥ A∥ p · ∥x∥ p ,

2. ∥ AB∥ p ≤ ∥ A∥ p · ∥ B∥ p ,

3. ∥ In ∥ p = 1.

Exercise 3 Prove that ∥uv∗ ∥2 = ∥u∥2 ∥v∥2 , for any two vectors u, v ∈ Cn .

In Section 4.3 the following definition will be key.

Definition 6 (Condition number of a matrix). The condition number of an invertible


matrix A in the p-norm is the non-negative real number:

κ p ( A ) = ∥ A ∥ p · ∥ A −1 ∥ p .

41
4 Solution of linear systems of equations

4.1 Solution of triangular systems


A triangular system (4.1) is a system whose coefficient matrix is triangular (either
lower or upper). Let us focus on the case in which A is upper triangular (when A
is lower triangular the arguments are similar). In this case, we use the notation U,
instead of A, for the coefficient matrix, since it is the initial letter of “upper”. The
matrix U and the vector b are of the form

u11 . . . u1,n−1 u1n


 
b1
 
..
 0 ...  b2 
 
. 
U= . , b =  . ,
  
 .. . .. u
 .
 . 
n−1,n−1 un−1,n

0 ··· 0 unn bn

so the system reads

u11 x1 + · · · + u1,n−1 xn−1 + u1n xn = b1 ,


.. .. .. ..
. . . . (4.3)
un−1,n−1 xn−1 + un−1,n xn = bn−1 ,
unn xn = bn .

We can solve for the unknown xn in the last equation and replace the obtained
value in the remaining equations, then solve for xn−1 in the last-but-one equation
and replace again the obtained value in the remaining equations, and so on. This
procedure is known as backward substitution, and provides the following expression
for the unknowns:
bnn
xn = ,
unn
bn−1 − un−1,n xn
x n −1 = ,
un−1,n−1
..
.
(4.4)
bk − uk,k+1 xk+1 − · · · − ukn xn
xk = ,
ukk
..
.
b1 − u12 x2 − · · · − u1n xn
x1 = .
u11
Remember that U is invertible, which means that u11 · · · unn ̸= 0, so the previous
expressions make sense.
The expression (4.4) can be easily implemented in matlab with a for loop:

42
4 Solution of linear systems of equations

function[x]=triusol(U,b)
% solution of Ux=b, with U upper triangular nxn
x=zeros(n,1)
for k=n:-1:1
j=k+1:n
x(k)=(b(k)-U(k,j)*x(j))/U(k,k)
end

When the coefficient matrix is lower triangular, a similar method is derived, and
it is known as forward substitution.

4.2 Gaussian elimination and LU factorization


When the coefficient matrix of (4.1) is not triangular, the standard procedure for
solving the system consists in obtaining an equivalent system (namely, having the
same solutions) whose coefficient matrix is triangular, and then solve this new
triangular system by backward substitution. To obtain an equivalent system we
use the following elementary row operations:

OP1 (Exchange). Exchange two rows. We denote the operation that exchanges
the ith and jth rows by Rij .

OP2 (Replacement). Add to some row a multiple of another one. We denote by


Rij (α) the operation that adds to the ith row the jth one multiplied by α.

The procedure to obtain an equivalent system with an upper triangular coeffi-


cient matrix (known as an echelon form of A) is carried out in a systematic way, as
it is introduced in a basic course on Linear Algebra, and consists in iterating the
following steps applied to the coefficient matrix A:

Gauss method to obtain an echelon form:

Step 1: Select, in the leftmost column, a nonzero entry (pivot) and exchange
the row containing this entry (OP1) to take it to the topmost position.

Step 2: By means of replacement operations (OP2), add a multiple of the


row containing the pivot to the rows below this one in order to get 0 in the
entries below the pivot (in the same column).

Step 3: Repeat steps 1 and 2 in the submatrix obtained from the previous
one by removing the row and column containing the pivot.

43
4 Solution of linear systems of equations

As an illustration of the Gauss method you can look at the example in Section 2.3
in [M04]. Nonetheless, we are going to show another example, with an explanation
that will allow us to relate this method with the LU factorization:
Let us apply Gauss elimination to the following matrix:

2 −4 4 −2
 

A =  6 −9 7 −3 . (4.5)
−1 −4 8 0

For this we apply the procedure described above:

Step 1: We take as a pivot the (1, 1) entry, so we do not need to exchange


rows:  
2 −4 4 −2
 6 −9 7 −3 .
−1 −4 8 0

Step 2: Now, we make zeroes below the pivot with the operations R21 (−3)
and R31 (1/2):
 
2 −4 4 −2
0 3 −5 3  .
0 −6 10 −1
Now we iterate the previous steps over the submatrix A(2 : 3, 2 : 4):

Step 1: We choose the (2, 2) entry as a pivot to avoid, again, a row exchange:

2 −4 4 −2
 
0 3 −5 3  .
0 −6 10 −1

Step 2: The operation, in this case, is R32 (2):

2 −4 4 −2
 
0 3 −5 3  .
0 0 0 5

With this, we have arrived at a matrix in echelon form, which is upper trian-
gular (we denote it by U):

2 −4 4 −2
 

U : = 0 3 −5 3  .
0 0 0 5

44
4 Solution of linear systems of equations

Now, we gather together all the elementary row operations that we have applied,
and we construct a lower triangular matrix with 1’s in the main diagonal and, at
each column below each principal entry, contains the factors that we have used
in Step 2 at the corresponding iteration, with the opposite sign (we denote this
matrix by L):
 
1 0 0
L :=  3 1 0 .
−1/2 −2 1
Finally, we note that the original matrix, A, is the product of L and U, namely:

2 −4 4 −2
  
1 0 0
A = LU =  3 1 0 0 3 −5 3  .
−1/2 −2 1 0 0 0 5

This is, precisely, the LU factorization of A. In the next subsection we show the
general method to obtain the LU factorization, and we analyze some of its basic
features.

4.2.1 The LU factorization


Our goal is to decompose a given matrix A ∈ Cm×n as a product A = LU, where L
is a lower triangular matrix (L and U are the initial letters of “lower” and “upper”,
respectively). Moreover, L is square m × m with 1’s in the main diagonal, and U
is of size m × n.
Let A ∈ Cm×n be an arbitrary matrix. In the first place, we decompose it, by
blocks, in the following way:
 
a11 A12
A= ,
A21 A22

where a11 is the (1, 1) entry of A, A12 is a row vector of size 1 × (n − 1), A21 is a
column vector of size (m − 1) × 1 and A22 is a matrix of size (m − 1) × (n − 1).
If a11 ̸= 0, we can express the previous decomposition by blocks as follows:
  " #" #
a11 A12 1 0 a11 A12
A= = 1 .
A21 A22 a11 A21 Im−1 0 A22 − a111 A21 A12

Let us introduce the following notation, to obtain an iterative method:

1
A (0) : = A ∈ Cm × n , A(1) := A22 − A21 A12 ∈ C(m−1)×(n−1) .
a11

45
4 Solution of linear systems of equations

Now, let us assume that we know the LU factorization of the matrix A(1) , say
A(1) = L1 U1 . Then, the previous factorization of A allows us to write:
" # 
1 0 a11 A12
A= 1 =: LU, (4.6)
a11 A21 L1 0 U1

which is an LU factorization of A, where


" #  
1 0 a11 A12
L= 1 , U= .
a11 A21 L1 0 U1

Remark 5 The decomposition (4.6) satisfies the following properties:

1. The first row of U is equal to the first row of A.

2. The first column of L is equal to the first column of A divided by a11 .


Moreover, if A(1) := A22 − 1
a11 A21 A12 ∈ C(m−1)×(n−1) :

3. The second row of U is equal to the first row of A(1) .


(1)
4. The second column of L is equal to the first column of A(1) divided by a11 .
(1) (1) (1)
Analogously, if A(2) := A22 − 1
(1) A21 A12 ∈ C(m−2)×(n−2) :
a11

5. The third row of U is equal to the first row of A(2) .


(2)
6. The third column of L is equal to the first column of A(2) divided by a11 .

Remark 5 allows us to obtain an iterative procedure to compute the LU factor-


ization of A. The main observation is that, from the matrices A(0) , A(1) , A(2) , . . .
we can obtain the rows of U and the columns of L. Moreover, we can store the
information obtained at each step in just one matrix of size m × n (the size of the
original matrix A). This can be done in d − 1 steps (where d = min{m, n}) replac-
ing, at the kth step, the submatrix A(k−1) (k : m, k : n) by a matrix whose first row is
the one of A(k−1) , whose first column is, except fo the (1, 1) entry, the first column
( k −1)
of A(k−1) divided by a11 , and the rest of the matrix is A(k) . Figure 4.1 illustrates
this procedure. The LU factorization of A is obtained from A(d−1) (the matrix in
the last step) in the following way:

• L is the lower triangular part of A(d−1) (without the main diagonal), adding
1’s in the main diagonal, and padding up with 0’s below the diagonal entries
in the columns n + 1 : m for m > n.

46
4 Solution of linear systems of equations

Figure 4.1: Storage of the LU factorization.

• U is the upper triangular part of A(d−1) (including the main diagonal).

Example 10 The matrices A(k) obtained in the previous process for the matrix A in (4.5)
are the following:

−4 4 −2
 
2
A(0) = A, A(1) = 3 3 −5 3  ,
−1/2 −6 10 −1
−4 4 −2
 
2
A (2) =  3 3 −5 3  .
−1/2 −2 0 5

From A(2) we get the matrices L and U:

2 −4 4 −2
   
1 0 0
L= 3 1 0 , U = 0 3 −5 3  ,
−1/2 −2 1 0 0 0 5

which coincide with the ones previously obtained.

Though, as we have seen, the LU factorization is valid for rectangular matrices,


from now on we will focus on square matrices of size n × n, since we are only
going to solve linear systems with square coefficient matrices.
Now, we show the matlab code for the previous algorithm.

function[L,U]=lu(A)
% Computes the L and U matrices of an LU factorization of A

47
4 Solution of linear systems of equations

for k=1:n-1
A(k+1:n,k)=A(k+1:n,k)/A(k,k);
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-A(k+1:n,k)*A(k,k+1:n);
end
L=eye(n)+tril(A,-1)
U=triu(A)

The pivoting strategy

So far we have seen an algorithm to compute the LU factorization of a matrix with-


out row exchanges (that is, we have not used the operation OP1), and we have only
used the replacement operation OP2. This is so because we have assumed that the
(i )
entries a11 at each step were nonzero (note that this hypothesis has been explicitly
posed in (4.6) for A(0) , but it is also implicit in the algorithm for computing the
(i )
LU factorization, since we have divided by a11 , see Steps 4 and 6 in Remark 5,
(1) (2)
where the entries a11 and a11 appear). This condition does not necessarily hold
and, when it does not hold, the matrix does not have an LU factorization. An im-
mediate consequence of this is that not every matrix has an LU factorization. In
those cases, the operation OP2 is needed. The following result guarantees that if
we use this operation, then any matrix has an LU factorization, after multiplying
it, if necessary, by a permutation matrix that encodes the information of the row
exchanges (see [HJ13, Th. 3.5.8]).

Theorem 3 Given a matrix A ∈ Fn×n , there is a permutation matrix P ∈ Fn×n , a lower


triangular matrix L ∈ Fn×n and an upper triangular matrix U ∈ Fn×n such that

PA = LU (4.7)

(or, equivalently, A = P⊤ LU).

Remark 6 Theorem 3 is stated for F = R or C, which means that, if the original matrix
is real, then there is a factorization (4.7) where all three matrices P, L, U are real.

(k)
Even in those cases where the diagonal entry a11 is nonzero, it is advisable, for
stability reasons, to exchange rows to get an appropriate pivot. This technique of
exchanging rows to get an appropriate pivot is known as pivoting. There are two
basic pivoting strategies:

(a) Partial pivoting: The chosen pivot (that is taken to the (1, 1) position of the
matrix A(k) by row exchange) is the entry with largest modulus within the
first column of A(k) . Mathematically, this process is carried out by means of

48
4 Solution of linear systems of equations

a permutation matrix, denoted by P(k) , in the following way. Let us assume


that A is factorized as
  
L11 0 U11 U12
A= ,
L21 I 0 A(k)

since we have already carried out the first k steps of the LU factorization.
Then, the row exchange only affects the last n − k rows, namely:
    
I 0 L11 0 U11 U12
·A =
0 P(k) P(k) L21 P(k) 0 A(k)
  
L11 0 U11 U12
= ,
P(k) L21 I 0 P(k) A(k)

where P(k) is the permutation matrix that exchanges the corresponding rows
according to the partial pivoting criterion.

(b) Total pivoting: The chosen pivot is the entry with largest modulus of A(k) .
This pivoting strategy requires to exchange rows and columns, which leads
to a factorization of the form PLUQ, with Q being another permutation
matrix, and will not be considered in this course.

The natural question is: what is the reason for exchanging rows even when the
(1, 1) entry of A(k) is nonzero? Or, more precisely, what is the advantage of partial
pivoting? To answer this, let us consider the following example (another one can
be found in [M04, §2.6]).

Example 11 Let
10−20 1
 
A= .
1 1
The LU factorization of A is

10−20
   
1 0 1
L= , U= .
1020 1 0 1 − 1020

However, if we apply the previous LU algorithm, without row exchange (since, in princi-
ple, it is not necessary because the diagonal entries are nonzero), in matlab (namely, in
floating point arithmetic), we get
   
1 0 1 0
L =
b =
fl(1020 ) 1 1020 1
 −20   −20 
10 1 10 1
U =
b = .
0 fl(1 − 1020 ) 0 −1020

49
4 Solution of linear systems of equations

It can be checked in matlab that fl(1020 ) = 1020 and fl(1 − 1020 ) = −1020 . Therefore,
we get
 −20 
LU
b b = 10 1
.
1 0
This matrix should be the original matrix A. However, the (2, 2, ) entry is equal to 1,
which has nothing to do with the (2, 2) entry of A.

Example 11 shows that roundoff errors can produce a large error in the LU
factorization. If we analyze the procedure that we have followed in this exam-
ple, we will realize that the problem comes from approximating fl(1 − 1020 ) ≡
fl( a22 − 1020 ) = −1020 , and it is produced because there are some entries of L or
U which are too large compared to the entries of A.
The pivoting strategies aim to solve this problem. In particular, the partial piv-
oting strategy guarantees that the entries of L always have modulus at most 1.

4.2.2 Solution of SLE with the LU factorization


Once we have the coefficient matrix A factorized as (4.7), the system (4.1) is equiv-
alent to
LUx = Pb.
This way, the procedure to solve the SLE (4.1) by means of the LU factorization is
the following:

Procedure to solve the system Ax = b using the


PA = LU factorization:

1. Compute the factorization PA = LU.

2. Multiply Pb.

3. Solve the lower triangular system Ly = Pb.

4. Solve the upper triangular system Ux = y.

As can be seen, once we have computed the factorization (4.7) of the coefficient
matrix A of the system (4.1), the solution of the system reduces to solving two
triangular systems, whose cost is notably smaller than the one for solving a general
system. We will see more on the computational cost in Section 4.2.4.

Example
1 8 (Continued). Assume that we want to solve the SLE Ax = b, where b =
0 . The exact solution is  
1 −1
x= . (4.8)
1 − 10−20 1

50
4 Solution of linear systems of equations

However, if we follow the procedure described above with the computed factors b
L, U,
b when
solving the lower triangular system we get
   
1 1
Lb
b y= ⇒y
b= ,
0 −1020

and now, when solving the upper triangular system we obtain


 
0
Ub b⇒b
bx = y x= . (4.9)
1

If we compare x and b x in (4.8) and (4.9) we will immediately realize that there is a large
error, up to the point that the computed solution has nothing to do with the exact one (they
are in completely different directions!)

4.2.3 matlab commands for the solution of SLE and the LU


factorization
These are the matlab commands for this chapter:

• [L,U,P] = lu(A): Computes the matrices L,U, and P of the factorization (4.7)
of A. Is the professional version of matlab.

• [L,U,p] = lutx(A): Is the simplified version of the previous code lu above,


included in the package ncm.

• x=A\b: Solves the system Ax=b.

• x = bslashtx(A,b): This code is also part of the package ncm. Solves the
system Ax=b using the LU factorization computed with lutx and then solv-
ing the two associated triangular systems, as explained in Section 4.2.2. More
precisely, it makes use of the following codes for solving the triangular sys-
tems:
– x = forward(L,b): Solves a lower triangular system Lx=b by forward
substitution.
– x = backsubs(U,b): Solves an upper triangular system Ux=b by back-
ward substitution.

4.2.4 Computational cost


We have introduced, in Section 3.2, the notation O(u) for arbitrarily small quanti-
ties. In the present section (and forthcoming ones) we will use the notation O( f ),

51
4 Solution of linear systems of equations

for arbitrarily large quantities, with a different meaning. More precisely, O( f ),


where f (n) is a polynomial in the corresponding variable (typically n, which in
this chapter denotes the order of the coefficient matrix), represents a quantity sat-
isfying
O( f )
lim = c,
n→∞ f ( n )

where c is a constant (independent of n). The context in which this expression


appears in this section is the estimation of the number of operations that requires
a numerical algorithm, in terms of the variable n, which represents the size of the
coefficient matrix of the system (4.1) (we will restrict ourselves to square matrices,
namely, A ∈ Cn×n ). This way, O(n3 ) means that the number of operations required
by the algorithm is a cubic polynomial in n. The number of operations required
by an algorithm is what we mean by computational cost.
The operations that are counted in the computational cost are the elementary
arithmetic operations that have been mentioned in Section 2.1.3, namely addition,
subtraction, multiplication, division, and square roots (though here square roots
are not needed). This means, in particular, that permutations for the computation
of (4.7) are not counted.
For brevity, we will refer to each of the elementary arithmetic operations as
“flops” (which is an abbreviation of “floating point operations”).

Computational cost of backward substitution

We start by analyzing the number of operations involved in the solution of the


system (4.1) when A is a lower triangular matrix, as described in Section 4.1 (for
lower triangular matrices the analysis is similar):
Note that the first equation of (4.4) only involves one operation (a division). As
we go “downwards” in the equation (4.4) we add 2 new operations at each step,
in particular one subtraction and one multiplication. This way, the overall cost of
(4.4) is
1 + 3 + 5 + · · · + 2n − 1 = n2 .
This cost is exact, and not an approximation

Computational cost of the LU algorithm

Now we estimate the cost of the LU algorithm introduced at the end of Section
4.2.1:
The first line of the algorithm at the ith step involves n − i divisions. The second
line involves (n − i )2 products and (n − i )2 subtractions. Then, the overall cost of
the algorithm is:

52
4 Solution of linear systems of equations

Cost of the LU algorithm:


n −1 n −1 n −1
∑ 2( n − i )2 + ( n − i ) ∑ j2 + ∑j

= 2
i =1 j =1 j =1
(n − 1)n(2n − 1) (n − 1)n
= 2 +
6 2
(n − 1)n(4n + 1) 2 3
= = n + O ( n2 ).
6 3

Computational cost of solving SLE with the LU factorization

Finally, let us analyze the cost of the algorithm in the box in Section 4.2.2. This
algorithm consists of four steps. The first one requires the computation of the LU
decomposition of A (recall that the permutations are not counted) that we have
just estimated. The second one is, simply, a permutation of the coordinates of the
vector b, and does not count. Step 3 requires solving a triangular system with 1’s
in the main diagonal, so there are no divisions involved. In Step 4 we have to
solve another triangular system. Adding up the computational cost obtained in
the previous sections, we arrive at:

The computational cost of the algorithm for solving a SLE


(4.1) using Gaussian elimination with the LU factorization
2
described in Section 4.2.2 is 3 n3 + O ( n2 )

Why to use LU for solving a SLE?

A natural question is the following: Why the LU factorization is used to solve


linear systems and not Gaussian elimination instead? The answer is the following:
Both methods (Gaussian elimination and LU) are similar. As we have seen, the LU
factorization is obtained by performing the same operations as in Gaussian elim-
ination, but without taking into account the right-hand side, that is inserted after
in the system to solve two triangular linear systems (one lower and one upper).
From the point of view of the computational cost (number of flops) both methods
are equivalent if one wants to solve just one system. However, if more than one
linear system is to be solved, the LU factorization can be stored and use it to solve
all systems at the same time with a smaller computational cost. More precisely, as
we have seen in the precedent section, the cost of the LU factorization is O(n3 ),
whereas the cost of solving a triangular system is O(n2 ). For matrices with large
size, the difference between n3 and n2 can be huge. By contrast, solving each sys-
tem with Gaussian elimination without using LU, is O(n3 ). Therefore, solving,

53
4 Solution of linear systems of equations

say r systems with LU has a computational cost of O(n3 ) + O(rn2 ), whereas doing
it without LU has a cost of O(rn3 ).
This situation, in which one wants to solve a lot of systems with the same coef-
ficient matrix, appears in many problems of engineering, for instance in systems
where the right-hand side varies with time, but the rest of the system remains
constant.

4.3 Error analysis for the solution of SLE


In this section we want to evaluate the error committed when solving a SLE (4.1).
More precisely, to identify which are the ingredients that may cause a large error.
For this, we denote by b x the solution of (4.1) computed with an algorithm (inde-
pendently of which this algorithm is) and, by x, the exact solution. Let us recall
that, since A is invertible, the exact solution is x = A−1 b.

Definition 7 (Absolute and relative error in the solution of a SLE). Let p = 1, 2, ∞.


The absolute error of the solution of the SLE (4.1), in the p-norm, is the number
∥x − b
x∥ p .
∥x−b
x∥ p
The relative error of the solution of the SLE (4.1), in the p-norm, is the number ∥x∥ p
.

In order to measure the errors we are going to use the vector norms ∥ · ∥ p ,
with p = 1, 2, ∞, introduced in Remark 3, as well as the induced matrix norms,
introduced at the beginning of this chapter (see Lemma 2). If x is the exact solution
of the SLE (4.1), then ∥ Ax − b∥ p = 0. It is expected, then, that if b x is a good
approximation to the solution, then ∥ Ab x − b∥ p must be small. Then, a natural way
to estimate the error committed when solving the SLE (4.1) consists in evaluating
∥ Abx − b∥ p . The vector r := Ab
x − b is called the residual of the solution. Note that
the residual is known, since all its ingredients are known (they are the input data,
A, b, together with the computed solution b x). The following result shows that the
residual is a good measure on how far is b x away from being the exact solution of
a nearby problem.

Theorem 4 For any A ∈ Cn×n , b ∈ Cn , and bx ∈ Cn :

∥ r ∥2 ∥ E ∥2
 
= min : ( A + E)b
x=b . (4.10)
∥ A∥2 ∥bx ∥2 ∥ A ∥2

Proof: In the first place, for any matrix E such that ( A + E)b
x = b, after solving for
r and taking norms we get:

∥r∥2 = ∥ Eb
x∥2 ≤ ∥ E∥2 ∥b
x ∥2 ,

54
4 Solution of linear systems of equations

and this in turn gives:


∥ r ∥2 ∥ E ∥2
≤ ,
∥ A∥2 ∥bx ∥2 ∥ A ∥2
or, equivalently,
∥ r ∥2 ∥ E ∥2
 
≤ min : ( A + E)b
x=b .
∥ A∥2 ∥bx ∥2 ∥ A ∥2
Now we see that the equality is achieved for some matrix E0 , and this will tell us
that the previous inequality is an equality. For this, we are going to find a matrix
E0 satisfying
∥ r ∥2 ∥ E0 ∥2
= , ( A + E0 )b
x = b. (4.11)
∥ A∥2 ∥bx ∥2 ∥ A ∥2
In particular, such a matrix is the following:
r·bx∗
E0 = .
x∥22
∥b
We see, in the first place, that
1
E0 · b
x= x∗ b
· r · (b x) = r,
x∥22
∥b
so the right identity in (4.11) holds. Now, let us see that the left identity also holds:
∥ E0 ∥2 x ∗ ∥2
∥r · b ∥r∥2 · ∥b
x ∥2 ∥ r ∥2
= 2
= 2
= ,
∥ A ∥2 ∥ A∥2 ∥b x ∥2 ∥ A∥2 ∥bx ∥2 ∥ A∥2 ∥bx ∥2
as wanted. (To obtain the second identity in the last chain of identities we have
used Exercise 3). □
Theorem 4 shows that, if the residual is small, then b
x is the exact solution of a
nearby problem, namely the following SLE that has a nearby coefficient matrix
∥ E0 ∥2 ∥ r ∥2
( A + E0 )b
x = b, with = . (4.12)
∥ A ∥2 ∥ A∥2 ∥bx ∥2
Therefore, the quantity that appears to the right in the previous formula is a mea-
sure of how far is the computed solution (in relative terms) from being the exact
solution of a nearby problem. Let us recall that this is, precisely, what is mea-
sured to know whether an algorithm is backward stable or not. In this case, the
algorithm is any algorithm that has allowed us to get a solution of the SLE (4.1),
since no explicit algorithm is mentioned either in the statement or in the proof of
Theorem 4. Note that nothing is said either about b x being the computed solution
of a SLE, but it can be an arbitrary vector. Nonetheless, when we see this vector
as the computed solution of a SLE, we reach the following notion.

55
4 Solution of linear systems of equations

Definition 8 (Backward error). The quantity ∥r∥2 /(∥ A∥2 ∥b x∥2 ) is known as the back-
ward error (in the 2-norm) of the computed solution b
x for solving the SLE (4.1).

Note that the backward error in Definition 8 does not take into account the
right-hand side b. In fact, the perturbed system (4.12) has the same right-hand
side as the original system (4.1).
Despite what we have seen in Theorem 4, a small residual does not guarantee
that the computed solution is a good approximation to the exact solution, as it is
shown in the following example, taken from [I09, Ex. 3.6].

Example 9 Let

1 108 1 + 108
     
1
A= , b= , x=
0 1 1 1

i solution of Ax = b. Assume that we have obtained the following


The vector x ish the exact
0
x=
solution b 1+10−8
. The norm of the residual
 
0
r = b − Ab
x= , ∥r∥2 = 10−8
−10−8

is very small. However, the computed solution is far away from being the exact one, since
 
1
∥x − b x ∥2 10−8 2 1
= √ ≈√ .
∥ x ∥2 2 2

Example 9 shows that the residual is not necessarily a good measure for the error
in the solution of the SLE (4.1). The following theorem explains what is going on.
The proof can be found in [I09, p. 48].

Theorem 5 Let A ∈ Cn×n be an invertible matrix and let x be the solution of the SLE
Ax = b. Let E ∈ Cn×n be a matrix such that A + E is also invertible and let b x be the
solution of the SLE ( A + E)bx = b + b E . Let us assume that ∥ E∥ p ≤ 1/∥ A−1 ∥ p , with
p = 1, 2, ∞. Then

∥x − b
x∥ p κ p ( A) ∥ E∥ p ∥b E ∥ p
 
≤ ∥ E∥
· + . (4.13)
∥x∥ p 1 − κ p ( A) ∥ A∥pp ∥ A∥ p ∥b∥ p

The expression (4.13) provides a bound, and an estimation, of the relative error in
the solution of the SLE (4.1) (see Definition 7). More precisely, it tells us that this
error depends on the condition number of the coefficient matrix, A (see Definition

56
4 Solution of linear systems of equations

6), as well as in the backward errors in the data given by the right factor in the
bound (4.13). Let us assume that we have used a backward stable method for com-
puting the solution bx. This means that the quotients ∥ E∥ p /∥ A∥ p and ∥bE ∥ p /∥b∥ p
are very small. On the other hand, the hypothesis ∥ E∥ p ≤ 1/∥ A−1 ∥ p implies that
∥ E∥
the denominator 1 − κ p ( A) ∥ A∥pp is a positive number smaller than 1. If the product
∥ E∥
κ p ( A) ∥ A∥pp is small (something that should be expected), then the denominator
∥ E∥
1 − κ p ( A) ∥ A∥pp will be close to 1. In this situation, the bound (4.13) says that the
relative error in the solution can be magnified by a factor equal to the condition
number of the matrix A with respect to the backward error. This is even more
∥ E∥
emphasized if κ p ( A) ∥ A∥pp is not small.
Note that in Theorem 5 there is a perturbation in the right-hand side, unlike
what happened in Theorem 4. Now, if we combine both results (with b E = 0 as in
Theorem 4), we arrive at the bound:

∥x − b x ∥2 κ2 ( A ) ∥ r ∥2
≤ ∥ ∥
· .
∥ x ∥2 1 − κ2 ( A) ∥ A∥ ∥2bx∥ ∥ A∥2 ∥b
r x ∥2
2 2

If the product κ2 ( A) ∥ A∥∥r∥∥2bx∥ is small (close to 0), which is expected since, even
2 2
if κ2 ( A) is large, the residual should be very small if the computed solution is a
good approximation to the exact solution, then

∥x − b x ∥2 ∥ r ∥2
≤ κ2 ( A ) · . (4.14)
∥ x ∥2 ∥ A∥2 ∥bx ∥2

Equation (4.14) is the most eloquent formula in this section, since it indicates in
a clear way which are the sources of error in the solution of a SLE (4.1). More
precisely, it depends on two factors:

(a) the condition number of the coefficient matrix A, κ2 ( A), and

(b) the norm of the residual (in relative terms).

In particular, this explains what happens in Example 9. In this example, the


condition number of the matrix A is quite large. In particular, it is approximately
x∥2 ) is approximately 10−16 . This
1016 . On the other hand, the factor ∥r ∥2 /(∥ A∥2 ∥b
implies that the relative error in the solution is close to 1. Anyway, the formula
(4.14) tells us that the only case in which a small residual (in relative terms) does
not provide a good solution to the SLE (4.1) is when the condition number of the
coefficient matrix is large.

57
4 Solution of linear systems of equations

Remark 7 The bound given in (4.14) highlights the implication (3.10) in the solution of
the SLE (4.1). More precisely, if the algorithm is stable, the factor ∥ A∥∥r∥∥2bx∥ should be of the
2 2
order of the unit roundoff. On the other hand, if the problem is well conditioned, the factor
κ2 ( A) should be moderate. If both premises hold then the product will be of the order of
the unit roundoff, which will provide an accurate algorithm.

The previous results refer to the solution of a SLE using any algorithm (not spec-
ified). However, in the precedent sections we have studied a particular algorithm,
namely Gaussian elimination with partial pivoting and the LU factorization. The
natural question that arises is: which is the relative error in the solution when
using this algorithm? The answer is given in the following result.

Theorem 6 If b x is the solution of Ax = b (where A is invertible) computed using Gaus-


sian elimination with partial pivoting, then

∥ r ∥2
≤ ρ · ε,
∥ A∥2 ∥bx ∥2
where ρ is a constant which “rarely” is larger than 10 and ε is the unit roundoff. As a
consequence,
∥x − b x ∥2
≤ κ2 ( A) · ρ · ε. (4.15)
∥b
x ∥2

The inequality (4.15) provides a simple bound of the relative error in the solution
of the SLE (4.1) with the method analyzed in Section 4.2. The bound indicated that
the error is of the order of the unit roundoff, except for a factor equal to ρκ2 ( A).
This factor is, in general, approximately κ2 ( A) since, as indicated in the theorem,
the factor ρ is, in general, moderate (around 10). This means that, in general, the
condition number of the matrix A is the only value that may affect the stability of
the Gaussian elimination algorithm with partial pivoting, which is what happens
in Example 9. In practice, the condition number of A indicates the number of
significant digits that are lost in the computation of the solution of the SLE (4.1).
Finally, we note that, in the relative error (4.15), the denominator is not the norm
of the exact solution, but the norm of the computed one. This is not a problem,
since both quantities are expected to be close to each other and, moreover, ∥b x∥ is
known, whereas ∥x∥ is not.

Remark 8 In practice, since the exact solution x is unknown, the way to estimate the
error is by means of error bounds. The bounds that we have seen in Theorems 4–6 are
just this: bounds, which means that they just give an idea of the maximum error that can
be committed. Nonetheless, these bounds usually provide a very accurate idea of the error
in practice. Anyway, it is important that these bounds are computable, since, otherwise,

58
4 Solution of linear systems of equations

we will not be able to get an estimation of the error. In particular, the bound given in
(4.15) depends on two quantities that are difficult to compute, namely: ρ and κ2 ( A).
The value ρ, though difficult to estimate, is in practice, as we have said, less than 10.
Regarding the condition number, though it is costly to compute numerically, there are
some commands in matlab to compute it. In particular, these commands are cond(A,p),
where p=1,2,infty, and condest(A), which estimates the condition number in the 1-
norm in a faster way than the previous one.

4.4 Banded and sparse matrices


A matrix is said to be sparse when most of its entries are zero (otherwise it is called
dense).
There are some algorithms that are specially designed for sparse matrices. These
algorithms take advantage of the zero structure of the matrix to get the solution in
a more efficient way (namely, with a smaller number of operations). A particular
case of these matrices are banded matrices, namely those having only a few number
of nonzero diagonals, which are the ones closest to the main diagonal. To provide
a proper definition, we need the following concept.

Definition 9 (Bandwidth). The bandwidth of a matrix A is r + s + 1, where A(i, j) =


0 for i − j > r and for i − j < −s, and there are (i0 , j0 ), (i1 , j1 ) with i0 − j0 = r and
i1 − j1 = −s such that (i0 , j0 ) ̸= 0 ̸= A(i1 , j1 ).

In other words, the bandwidth is the number of nonzero diagonals around the
main diagonal, including the main diagonal (this is the reason for the addend 1 in
r + s + 1 in Definition 9).
An m × n banded matrix is a matrix having a low bandwidth (compared to the
total width of the matrix, namely m + n − 1).

4.4.1 Solution of a SLE with a tridiagonal coefficient matrix


A relevant class of banded matrices is the one of tridiagonal matrices, which fre-
quently appear in vibration problems. These matrices are those with bandwidth
equal to 3. For instance:  
1 6 0 0 0
 −2 5 7 0 0 
 
 0 7 1 5 0
 
.

 0 0 6 7 8


0 −4
 
 0 0 0 
0 0 0 0 7

59
4 Solution of linear systems of equations

Notice that a tridiagonal matrix is not necessarily square, as in the previous exam-
ple (the matrix is 6 × 5). However, in this course we will focus on square matrices.
In particular, we are going to see that the Gaussian elimination algorithm with-
out pivoting to solve the SLE, when the coefficient matrix is tridiagonal, becomes
quite simple.
In general, a tridiagonal n × n matrix can be written as:
 
b1 c1 0 0 ··· 0
a b c2 0 ... 0 
 1 2 
 .. 
 0 a2 b3 c3 ... . 
 
Tn =  . .
 . .. .. .. ..
. . . . . 0 

 
 0 . . . 0 a n − 2 bn − 1 c n − 1 
0 0 ... 0 a n − 1 bn
 ⊤
Now, let us write d = d1 · · · dn for the right-hand side and let us apply
the Gaussian elimination method to the augmented matrix of the SLE, namely
Tn x = d:
b1 c1 0 0 ··· 0 d1 b1 c1 0 0 ··· 0 d1
   
 a1 b2 c2 0 ... 0 d2   0 b2′ c2 0 ... 0 d2′ 
.. ..
   
   
 0
 a2 b3 c3 ... . d3 
 −→  0
 a2 b3 c3 ... . d3 
,
 . .. .. .. .. ..

a
R21 − b1

 . .. .. .. .. ..
 ..  ..
 
 . . . . 0 . 
 1
 . . . . 0 . 

 0 ... 0 a n −2 bn − 1 c n −1 d n −1   0 ... 0 a n −2 bn − 1 c n −1 d n −1 
0 0 ... 0 a n −1 bn dn 0 0 ... 0 a n −1 bn dn
a1 a1
where b2′ = b2 − b1 c1 and d2′ = d2 − b1 d1 . Proceeding with the next row:

b1 c1 0 0 ··· 0 d1 b1 c1 0 0 ··· 0 d1
   
 0 b2′ c2 0 ... 0 d2′   0 b2′ c2 0 ... 0 d2′ 
.. ..
   
−→ b3′ d3′
   
 0
 a2 b3 c3 ... . d3 
  
 0
 0 c3 ... . 
,
 . a
 .. .. .. .. .. ..  R32 − 2′  .
 .. .. .. .. .. .. 
 . . . . 0 . 

b2
 . . . . 0 . 

 0 ... 0 a n −2 bn −1 c n −1 d n −1   0 ... 0 a n −2 bn − 1 c n −1 d n −1 
0 0 ... 0 a n −1 bn dn 0 0 ... 0 a n −1 bn dn

where b3′ = b3 − ba2′ c2 and d3′ = d3 − ba2′ d2′ . These two steps are enough to observe
2 2
a template that is repeated in all iterations of Gaussian elimination:

(i) For each pivot only one elementary operation is needed, that is applied on
the next row. More precisely, at each step we only modify the following two
entries: bi′ = bi − b′ai ci and di′ = di − b′ai di′−1 .
i −1 i −1

(ii) The matrix obtained at the end of each step is still tridiagonal.

60
4 Solution of linear systems of equations

Property (ii) above is the one that allows us to confirm that property (i) will still
hold up to the end of the algorithm. This property of preserving the structure is
key in numerical analysis.
 As a consequence of property (i) above, the overall computational cost of the
method is 8(n − 1). That is, the cost is O(n), which is quite smaller than O(n3 )
for dense matrices.
The augmented matrix in echelon form that we obtain at the end of the proce-
dure is  
b1 c1 0 0 ··· 0 d1
 0
 b2′ c2 0 ... 0 d2′ 


.. ′

 0 0 b3 c3 ... . d3 
 
 . ..  .
 . .. .. .. ..
 . . . . . 0 . 

0 bn′ −1 cn−1 d′n−1 
 
 0 ... 0
0 0 ... 0 0 bn′ d′n
In the program folder ncm the previous algorithm for tridiagonal matrices is
implemented in the code tridisolve.

61
5 Numerical interpolation

Let us assume that a function f : R → R is unknown, but we know (or we are able
to evaluate) the images of f in some values of the domain. Can we extract any
further information on f in other values of the domain? Interpolation provides an
answer to this question. More precisely, let

( x1 , y1 ), ( x2 , y2 ), . . . , ( x n , y n ), with xi , yi ∈ R, for i = 1, . . . , n. (5.1)

We want to find a function f : R → R such that f ( xi ) = yi , for i = 1, . . . , n. The


additional properties that we want to impose to f define different interpolation
problems. In this course we will study the following ones:

1. f is a polynomial. This gives rise to polynomial interpolation, which is


analyzed in Section 5.1.

2. f is defined “piecewise”, and in each interval it is a different cubic polyno-


mial. This case gives rise to what is known as piecewise cubic interpolation,
that is analyzed in Sections 5.2, 5.3, and 5.4.

Anyway, f is known as the interpolating function in the nodes (5.1), and the points
(5.1) are known as interpolation nodes.
Note that, in both cases above, the interpolating functions are polynomial func-
tions. The reason for using such functions is that these are the most manageable
functions.

5.1 Polynomial interpolation


It is well-known that two points determine a unique straight line. This is a partic-
ular case of an interpolating polynomial of degree 1. For an arbitrary degree we
introduce the following definition.

Definition 10 (Interpolating polynomial). Given the points ( x1 , y1 ), . . . , ( xn , yn ), with


xi , yi ∈ R and xi ̸= x j , for i ̸= j, the interpolating polynomial associated to these points
is a polynomial with degree at most n − 1, P( x ), such that P( xi ) = yi , for i = 1, . . . , n.

The interpolating polynomial is unique, as it is stated in the following theorem


.

62
5 Numerical interpolation

Theorem 7 The interpolating polynomial always exists and it is unique.

Proof: Let us look for a polynomial of degree n − 1,

P ( x ) = a n −1 x n −1 + · · · + a 1 x + a 0 , (5.2)

satisfying the conditions of an interpolating polynomial. These conditions provide


the following system of linear equations:

P( x1 ) = y1 ⇒ an−1 x1n−1 + · · · + a1 x1 + a0 = y1 ,
.. .. .. (5.3)
. . .
n − 1
P ( x n ) = y n ⇒ a n −1 x n + · · · + a 1 x n + a 0 = y n ,
whose augmented matrix is
 n −1
x1n−2 · · · x1 1 y1

x1
 .. .. .. . . . 
 . . . .. .. ..  .
xnn−1 xnn−2 · · · xn 1 yn
The coefficient matrix of the system

x1n−1 x1n−2 · · · x1 1
 

V ( x1 , . . . , xn ) :=  ... .. .. . . 
. .. ..  (5.4)

.
xnn−1 xnn−2 · · · xn 1
 ⊤
is the Vandermonde matrix associated to the vector x = x1 . . . xn . By means
of elementary row operations we can obtain its determinant

det V ( x1 , . . . , xn ) = ∏ ( x i − x j ).
i< j

Looking at the determinant, we conclude that, if xi ̸= x j , for i ̸= j, the matrix


V ( x1 , . . . , xn ) is invertible, so the system (5.3) has a solution, for any y1 , . . . , yn .
Moreover, the solution is unique, since the system is consistent with unique solu-
tion. □
The proof of Theorem 7 provides a procedure to get the interpolating polyno-
mial. In particular, it gives the interpolating polynomial in the power form (5.2).
This form is easy to evaluate, for any c ∈ R:

P ( c ) = a n −1 c n −1 + · · · + a 1 c + a 0 ,

since it only requires 2n − 3 multiplications and n − 1 additions (namely, 3n −


4 flops). By contrast, to get the coefficients of the polynomial is an expensive
procedure, since it requires to solve a SLE of size n × n.

63
5 Numerical interpolation

The Lagrange form of the interpolating polynomial

The Lagrange form is an alternative to the power form that provides an explicit
expression for the interpolating polynomial. To define it we first introduce the
following concept.

Definition 11 (Lagrange polynomials). Given the nodes ( x1 , y1 ), . . . ( xn , yn ), the La-


grange polynomials associated to these nodes are of the form

∏ ( x − xi )
i ̸=k
ℓk ( x ) = , k = 1, . . . , n. (5.5)
∏ ( x k − xi )
i ̸=k

The Lagrange polynomials (5.5) have the following properties, that can be easily
checked:

1. They are all of degree n − 1.



1, if j ̸= k,
2. ℓk ( x j ) =
0, if j = k.
Now, the interpolating polynomial in the Lagrange form of the nodes ( x1 , y1 ), . . . , ( xn , yn )
is given by:
n
P( x ) = ∑ yk ℓk ( x ) . (5.6)
k =1

Note that the interpolating polynomial in the Lagrange form (5.6) has degree
n − 1 and that P( xi ) = yi , for i = 1, . . . , n.
The Lagrange form is also expensive to obtain, because of the amount of differ-
ences xk − xi that must be computed, as well as their respective products. It is also
expensive to evaluate, since ℓk (c) requires computing lots of products when n is
large.
Anyway, polynomial interpolation is not frequently used in practice, due to the
amount of errors involved, that are usually located close to the endpoints of the
interpolating interval, when the function has large oscillations in this interval (see
Section 5.6).

The function polyinterp

This function is included in the folder ncm.


Given two vectors, x,y, of the same dimension, and a third vector v, the function
u = polyinterp(x,y,v) provides a vector u such that u(i)=P(v(i)), where P is
the interpolating polynomial through the nodes (x(1),y(1)),(x(2),y(2)),...

64
5 Numerical interpolation

5.2 Piecewise linear interpolation


Assume that the nodes (5.1) are ordered such that x1 < x2 < · · · < xn . The
piecewise interpolating linear function is given by
 y −y

 y1 + x22 − x11 ( x − x1 ), if x1 ≤ x < x2 ,
 y 2 + y3 − y2 ( x − x 2 ),


x3 − x2 if x2 ≤ x < x3 ,
P( x ) = . ..

 .. .

 y n − y n −1
yn−1 + xn − xn−1 ( x − xn−1 ), if xn−1 ≤ x < xn .

The above polynomial P( x ) always exists and it is unique. Note that yk +


y k +1 − y k
− xk ) is the straight line through ( xk−1 , yk−1 ) and ( xk , yk ).
x k +1 − x k ( x
To evaluate P( x ), with x ∈ R, we need:

1. Obtain the index, k, of the interval such that xk ≤ x < xk+1 , with 1 ≤ k ≤ n.
y k +1 − y k
2. Evaluate P( x ) = yk + x k +1 − x k ( x − xk ) = yk + δk · sk , where

y k +1 − y k
δk := (5.7)
x k +1 − x k

is known as the first divided difference in [ xk , xk+1 ), and

sk := x − xk (5.8)

is known as the local variable in [ xk , xk+1 ).

Piecewise linear interpolation produces a function which is not differentiable


in the interpolation nodes. This problem can be overcome using interpolating
polynomials of higher degree in each interval, like we see in forthcoming sections.

The function piecelin

This function is implemented in the folder ncm. Given two vectors x,y, and an-
other vector u, the function v = piecelin(x,y,u) provides a vector v such that
v(k) is P(u(k)), where P is the piecewise linear interpolating polynomial through
the nodes
(x(1),y(1)),(x(2),y(2)),...
Namely: the function evaluates the interpolating polynomial in a given number
of real values.
It is important to note that, in order for piecelin to provide the output, the
coordinates of the vector x must be ordered in increasing order: x(1)<x(2)<...

65
5 Numerical interpolation

5.3 Piecewise cubic interpolation


It aims to obtain an interpolating function, P( x ), which is a cubic polynomial in
[ xk , xk+1 ) passing through ( xk , yk ), for k = 1, . . . , n − 1. There are infinitely many
cubic polynomial functions through the point ( xk , yk ), so we need to impose some
additional condition for the function to be unique. In particular, we impose that
the derivatives in the nodes are prefixed. This leads to the following result.

Theorem 8 (Hermite interpolation). Let ( xk , yk ) and ( xk+1 , yk+1 ) be two points such
that xk ̸= xk+1 , and let dk , dk+1 be any two real numbers. Then there exists a unique cubic
polynomial, Pk ( x ), such that

Pk ( xk ) = yk , Pk ( xk+1 ) = yk+1 ,
(5.9)
Pk′ ( xk ) = dk , Pk′ ( xk+1 ) = dk+1 .

Moreover, such polynomial is

3hk s2k − 2s3k h3k − 3hk s2k + 2s3k s2k (sk − hk ) s k ( s k − h k )2


Pk ( x ) = · y k + 1 + · y k + · d k + 1 + · dk ,
h3k h3k h2k h2k
(5.10)
where hk := xk+1 − xk and sk := x − xk .

Proof: The polynomial (5.10) is clearly of degree at most 3. Moreover, we can see
that it satisfies (5.9). The first line is immediate:

• If x = xk then sk = 0, so Pk ( xk ) = 0 + yk + 0 + 0 = yk .

• If x = xk+1 then sk = hk , so Pk ( xk+1 ) = yk+1 + 0 + 0 + 0 = yk+1 .

For the derivatives, note that differentiating with respect to x is the same as
differentiating with respect to sk :

6hk sk − 6s2k −6hk sk + 6s2k 2sk (sk − hk ) + s2k


Pk′ ( x ) = · y k + 1 + · y k + · d k +1
h3k h3k h2k
(s − hk )2 + 2sk (sk − hk )
+ k · dk ,
h2k

and we obtain:

• Pk′ ( xk ) = 0 + 0 + 0 + dk = dk ,

• Pk′ ( xk+1 ) = 0 + 0 + dk+1 + 0 = dk+1 .

66
5 Numerical interpolation


 Piecewise cubic interpolation provides a function consisting of n − 1 cubic
polynomials, P1 ( x ), . . . , Pn−1 ( x ), where the polynomial Pk ( x ) is defined in the in-
terval [ xk , xk+1 ).
Note that piecewise cubic interpolation depends on the values of the derivatives
dk and dk+1 . In this course we are going to see two different ways to determine
these values, that give rise to two different interpolating formulas. One of them is
known as splines interpolation, that is analyzed in Section 5.4, and the other one is
produced by the matlab command pchip, that is explained below. In both cases,
the choice gives rise to differentiable functions, since for the set of nodes (5.1) the
variables that we need to choose are d1 , . . . , dn (namely: the variable dk is the same
for both couples ( xk−1 , yk−1 ), ( xk , yk ) and ( xk , yk ), ( xk+1 , yk+1 )). This is summarized
in the following remark.

Remark 9 The polynomials Pk ( x ) of piecewise cubic interpolation satisfy

Pk−1 ( xk ) = Pk ( xk ), Pk′ −1 ( xk ) = Pk′ ( xk ), k = 1, . . . , n − 1.

The command pchip

This command is implemented in matlab, and it has a simple version in the folder
ncm, called pchiptx.
The command chooses the values dk and dk+1 taking into account the slopes
of the straight lines joining two consecutive couples of points ( xk−1 , yk−1 ), ( xk , yk )
and ( xk , yk ), ( xk+1 , yk+1 ) that is, δk and δk+1 given by (5.7). More precisely:

• If δk−1 δk ≤ 0, then dk = 0.

• If δk−1 δk > 0 then dk is the harmonic mean of the two slopes if xk+1 − xk =
xk − xk−1 , or a weighted mean otherwise.

For more information, see [M04, §3.4].

5.4 Cubic piecewise interpolation with splines


It is possible to obtain a cubic interpolating polynomial which is not only differen-
tiable, but twice differentiable, by appropriately choosing the values dk . For this,
we impose the condition
Pk′′−1 ( xk ) = Pk′′ ( xk ). (5.11)
In order to obtain the corresponding value of dk , we first need to know the value
of Pk′′ ( x ), which is given in the following lemma, whose proof is left as an exercise.

67
5 Numerical interpolation

Lemma 4 If Pk ( x ) is as in (5.10), then

(6hk − 12sk )δk + (6sk − 2hk )dk+1 + (6sk − 4hk )dk


Pk′′ ( x ) = ,
h2k

where δk and sk are as in (5.7) an (5.8), respectively, and hk = xk+1 − xk .

When evaluating the previous expression for the second derivative of Pk ( x ) in


the endpoints xk and xk+1 we get:

6hk δk − 2hk dk+1 + 4hk dk 6δ − 2dk+1 + 4dk


Pk′′ ( xk ) = 2
= k ,
hk hk
−6δk 4dk+1 + 2dk
Pk′′ ( xk+1 ) = , so:
hk
−6δk−1 4dk + 2dk−1
Pk′′−1 ( xk ) = .
h k −1

Introducing in condition (5.11) the first and last previous identities we get:

hk dk−1 + 2(hk−1 + hk )dk + hk−1 dk+1 = 3(hk δk−1 + hk−1 δk ), (5.12)

which, in matrix form, reads:

h2 2( h1 + h2 ) h1 d1
  

 h3 2( h2 + h3 ) h2   d2 
 
.. .. ..  . 
  .. 

 . . .
h n −1 2 ( h n −2 + h n −1 ) h n −2 dn
3(h2 δ1 + h1 δ2 )
 
 3 (h3 δ2 + h2 δ3 ) 
= .
 
..
 . 
3(hn−1 δn−2 + hn−2 δn−1 )

The analogue to Remark 9 for splines is the following remark.

Remark 10 The polynomials Pk ( x ) of spline cubic interpolation satisfy

Pk−1 ( xk ) = Pk ( xk ), Pk′ −1 ( xk ) = Pk′ ( xk ), Pk′′−1 ( xk ) = Pk′′ ( xk ), k = 1, . . . , n − 1.

 The coefficient matrix of the previous system has size (n − 2) × n, which


means that the system is underdetermined. Moreover, since hk ̸= 0, for k =
2, . . . , n − 1, the matrix has n − 2 pivots, so the system is consistent with infinitely
many solutions (with two degrees of freedom). To obtain a system which has

68
5 Numerical interpolation

a unique solution we need to add another two restrictions. Now, we study the
model obtained from a particular choice of these conditions.
The strategy that we follow to uniquely determine the interpolating function is
the one that is known as “not a knot”, and consists in making the nodes x2 and
xn−1 not being nodes anymore, by forcing the functions in the intervals [ x1 , x3 )
and [ xn−2 , xn ) to be (exactly) cubic polynomials. In other words, we impose

P1 ( x ) = P2 ( x ) and Pn−2 ( x ) = Pn−1 ( x ). (5.13)

The following result, whose proof is left as an exercise, will allow us to obtain
an alternative condition.

Lemma 5 If P( x ) and Q( x ) are two cubic polynomials such that P( x0 ) = Q( x0 ), P′ ( x0 ) =


Q′ ( x0 ), P′′ ( x0 ) = Q′′ ( x0 ) and P′′′ ( x0 ) = Q′′′ ( x0 ), for some x0 ∈ R, then P( x ) = Q( x ),
for all x ∈ R.

Remark 10, together with (5.11) and Lemma 5 tell us that it is enough to impose
the additional conditions

P1′′′ ( x2 ) = P2′′′ ( x2 ) and Pn′′′−2 ( xn−1 ) = Pn′′′−1 ( xn−1 ) (5.14)

in order to satisfy (5.13).


Differentiating in the formula for the second derivative of Pk ( x ) in Lemma 4 we
obtain:
−12δk + 6dk+1 6dk
Pk′′′ ( x ) = ,
h2k
so that the first identity in (5.14) can be expressed as

−12δ1 + 6d2 + 6d1 −12δ2 + 6d3 + 6d2


2
= ,
h1 h22

namely:
h22 d1 + (h2 + h1 )(h2 − h1 )d2 − h21 d3 = 2h22 δ1 − 2h21 δ2 . (5.15)
Equation (5.12) with k = 2 gives

h2 d1 + 2(h1 + h2 )d2 + h1 d3 = 3(h2 δ1 + h1 δ2 ), (5.16)

so that, by adding to (5.15) equation (5.16) multiplied by h1 , we can eliminate d3


and obtain
3h1 h2 + 2h22 h21
h2 d1 + ( h1 + h2 ) d2 = · δ1 + · δ2 =: 3r1 . (5.17)
h1 + h2 h1 + h2

69
5 Numerical interpolation

Analogously, the second identity in (5.14) gives

h2n−1 dn−2 + (hn−1 + hn−2 )(hn−1 − hn−2 )dn−1 − h2n−2 dn = 2h2n−1 δn−2 − 2h2n−2 δn−1 ,
(5.18)
while equation (5.12) with k = n − 1 is equal to

hn−1 dn−2 + 2(hn−2 + hn−1 )(hn−2 − hn−1 )dn−1 + hn−2 dn = 3(hn−1 δn−2 + hn−2 δn−1 ),
(5.19)
so that, subtracting to (5.18) equation (5.19) multiplied by hn−1 , we obtain

3hn−1 hn−2 + 2h2n−2 h2n−1


( h n −1 + h n −2 ) d n −1 + h n −2 d n = · δn−1 + · δn−2 =: 3rn .
h n −1 + h n −2 h n −1 + h n −2
(5.20)
Adding up equations (5.17) and (5.20) to the system (5.12) we obtain the following
system:
   
h2 h1 + h2   r1
 h2 2( h1 + h2 ) h1  d1  (h2 δ1 + h1 δ2 ) 
 h3 2( h2 + h3 ) h2   d2   (h3 δ2 + h2 δ3 ) 
  ..  = 3  ,
    
 .. .. .. ..
 . . .  .  . 
h n −1 2 ( h n −2 + h n −1 ) h n −2 dn (hn−1 δn−2 + hn−2 δn−1 )
   
h n −2 + h n −1 h n −2 rn
(5.21)
where r1 and rn are given by the right-hand side of (5.17) and (5.20), respectively.
 The SLE (5.21) is a tridiagonal system, that can be solved in a efficient way
using the procedure studied in Section 4.4.1.

Remark 11 The piecewise cubic interpolation polynomial (5.10) can be also expressed in
the local variable sk = x − xk in power form

Pk ( x ) = yk + sdk + s2 ck + s3 bk , (5.22)

where
3δk − 2dk − dk+1 d − 2δk + dk+1
ck = , bk = k .
h h2

The command spline

The professional command in matlab for the computation of piecewise cubic


splines is spline. There is a simplified version in the folder ncm, called splinetx.

70
5 Numerical interpolation

5.5 The Newton form and the method of divided differences


Let us come back to the problem of finding the interpolating polynomial of degree
n − 1 through the nodes (5.1). The Newton form of this polynomial is

P( x ) = α0 + α1 ( x − x1 ) + α2 ( x − x1 )( x − x2 ) + . . . + αn−1 ( x − x1 )( x − x2 ) · · · ( x − xn ).
(5.23)
The coefficient α0 is easy to obtain: taking into account that P( x1 ) = y1 , it must be
α0 = y1 . Similarly, the coefficient αn−1 is the same as c1 in the power form, since
in both cases is the coefficient of the term of degree n. How can we obtain the
remaining coefficients αi of (5.23)? Let us see two ways to do it.

71
5 Numerical interpolation

Method 1: Solving a linear system.


In particular, it is a system obtained by evaluating in the nodes (5.1), as we have
done to obtain the polynomial in the power form. Such system is
    
1 0 0 ... 0 α0 y1
1 x − x 0 ... 0  α  y 
 2 1  1   2
1 x3 − x1 ( x3 − x2 )( x2 − x1 ) . . . 0  y3  .
    
 α2  =
. .. .. ..  .. 
.
..
  
. .
. . . . .  .  .
1 xn − x1 ( xn − x2 )( xn − x1 ) . . . ( xn − xn−1 ) · · · ( xn − x2 )( xn − x1 )
α n −1 yn
(5.24)
The previous system is a lower triangular system, that can be solved efficiently by
forward substitution. Nonetheless, solving it usually presents conditioning issues,
as well as overflow and underflow.
Note that, from the system (5.24) we deduce the following:

P0 ( x ) = α0
is the interpolating polynomial through the node ( x1 , y1 ).
P1 ( x ) = α0 + α1 ( x − x1 )
is the interpolating polynomial through the nodes ( x1 , y1 ), ( x2 , y2 ).
P2 ( x ) = α0 + α1 ( x − x1 ) + α2 ( x − x1 )( x − x2 )
is the interpolating polynomial through the nodes ( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ).
..
.
P( x ) = Pn−1 ( x ) = α0 + α1 ( x − x1 ) + α2 ( x − x1 )( x − x2 ) + · · · + αn−1 ( x − x1 ) · · · ( x − xn )
is the interpolating polynomial through the nodes (5.1).

In other words: the system (5.24) solves n annihilated interpolation problems,


with 1, 2, . . . , n nodes, respectively, and gives us an expression for the solution by
means of an interpolating polynomial in the Newton form.

Method 2: The method of divided differences.


Let us start with the following definition.

Definition 12 (Divided differences). Given the nodes (5.1), the kth divided differ-
ence in the subset ( xi1 , yi1 ), . . . , ( xik , yik ) is the coefficient of largest degree of the interpo-
lating polynomial through the nodes ( xi1 , yi1 ), . . . , ( xik , yik ). We denote it by F [ xi1 . . . xik ].

As we have seen above, the interpolating polynomial in the Newton form can
be expressed as

P( x ) = F [ x1 ] + F [ x1 x2 ]( x − x1 ) + F [ x1 x2 x3 ]( x − x1 )( x − x2 )
+ · · · + F [ x1 x2 . . . xn ]( x − x1 )( x − x2 ) · · · ( x − xn−1 ).

72
5 Numerical interpolation

In other words, the divided difference F [ x1 . . . xk ] is the coefficient αk−1 of the


Newton polynomial through the nodes (5.1).
The following result provides an iterative method for computing the divided
differences.

Theorem 9 Given the nodes (5.1), it holds that:

F [ x 2 . . . x k ] − F [ x 1 . . . x k −1 ]
F [ x1 x2 . . . x k ] = , (5.25)
x k − x1

for k = 1, . . . , n.

Proof: Let us denote by

p ( x ): interpolating polynomial through the nodes ( x1 , y1 ), . . . , ( xk , yk ).


q ( x ): interpolating polynomial through the nodes ( x1 , y1 ), . . . , ( xk−1 , yk−1 ).
r ( x ): interpolating polynomial through the nodes ( x2 , y2 ), . . . , ( xk , yk ).

Let us see that q( x ) + xxk−−xx11 · (r ( x ) − q( x )) is the interpolating polynomial through


( x1 , y1 ), . . . , ( xk , yk ). For this, it suffices to show that it has degree at most k − 1 and
that when we evaluate it at x1 , . . . , xk the respective values y1 , . . . , yk are obtained.
The fact that its degree is at most k − 1 is obvious, since both q( x ) and r ( x ) have
degree k − 2. Now, when we evaluate at x1 , . . . , xk we obtain:
x1 − x1
x = x1 : q ( x1 ) + · (r ( x1 ) − q( x1 )) = q( x1 ) = y1 .
x k − x1
x j − x1 x j − x1
x = x j (2 ≤ j ≤ k − 1) : q ( x j ) + (r ( x j ) − q( x j )) = y j + (y j − y j ) = y j .
x k − x1 x k − x1
x − x1
x = xk : q( xk ) + k (r ( xk ) − q( xk )) = r ( xk ) = yk .
x k − x1

Therefore:
x − x1
p( x ) = q( x ) + · (r ( x ) − q( x )).
x k − x1
From this identity we deduce that the coefficient of degree k − 1 of p( x ) coin-
cides with the difference between the coefficients of degree k − 1 of r ( x ) and q( x ),
divided by xk − x1 . In other words, (5.25) holds. □
Theorem 9 allows us to obtain the kth divided differences from the (k − 1)st
ones, following the scheme in Figure 5.1. In this figure, the entries of each column
are obtained from the two entries indicated in the previous column using the
formula (5.25), which also involves the division between the values corresponding
to the variable x.

73
5 Numerical interpolation

y1 = F [ x1 ]

y2 = F [ x2 ] F [ x1 x2 ]

y3 = F [ x3 ] F [ x2 x3 ] F [ x1 x2 x3 ]

y4 = F [ x4 ] F [ x3 x4 ] F [ x2 x3 x4 ] ···

.. ..
. .

y n = F [ x n −1 ] F [ x n −2 n n −1 ]

yn = F [ xn ] F [ x n −1 x n ] F [ x n −2 x n −1 x n ] ···

Figure 5.1: Scheme of the divided differences to obtain the coefficients of the Newton interpolat-
ing polynomial.

5.5.1 Bases of polynomials and the interpolating polynomial


So far we have seen three different expressions of the interpolating polynomial of
degree n − 1 through the nodes (5.1), namely:

• The power form (Section 5.1).

• The Lagrange form (Section 5.1).

• The Newton form (Section 5.5).

The interpolating polynomial is the same in all three cases because, according
to Theorem 7, this polynomial is unique. Nonetheless, these expressions for the
interpolating polynomial correspond to three different bases of the vector space of
polynomials with degree at most n − 1. More precisely:

• The power form corresponds to the monomial basis centered at the origin:
{1, x, x2 , . . . , x n−1 }.
• The Lagrange formula corresponds to the Lagrange basis: {ℓ1 ( x ), . . . , ℓn ( x )},
where ℓ1 ( x ), . . . , ℓn ( x ) are given in (5.5).

74
5 Numerical interpolation

• The Newton formula corresponds to the basis centered at x1 , . . . , xn−1 : { x −


x1 , ( x − x1 )( x − x2 ), . . . , ( x − x1 )( x − x2 ) · · · ( x − xn−1 )}.

5.6 Interpolation errors


As we have explained in Section 5.1, the goal of polynomial interpolation is to find
a polynomial that approximates some function whose graph passes through the
nodes (5.1). Since this procedure provides an approximation to such function, we
want to know which is the error that is committed when replacing this function
by the interpolating polynomial. In this section we will provide a bound of such
error, that is, we will determine the maximum error that is committed in the
interval [ x1 , xn ) (assuming that x1 < x2 < . . . < xn ). The following example aims
to illustrate these notions.

Example 10 Assume we want to approximate the following two functions in the interval
[0, 2):

(a) f 1 ( x ) = x6 − 2x5 + x + 1.

2x5 + 2
(b) f 2 ( x ) = .
3x3 − x2 + 2
When taking equispaced values in the interval [0, 2) (namely, x1 = 0, x2 = 1, x3 = 2),
we get the nodes
(0, 1), (1, 1), (2, 3).
It can be easily checked that the graph of both functions contains these nodes. We can obtain
the (quadratic) interpolating polynomial through these nodes, P( x ) = a2 x2 + a1 x + a0 , as
in the proof of Theorem 7, namely solving the linear system whose augmented matrix is
 
0 0 1 1
 1 1 1 1 .
4 2 1 3

The solution to this system gives: a2 = 1, a1 = −1, a0 = 1, namely, P( x ) = x2 −


x + 1. Figure 5.2 shows the graph of both functions f 1 ( x ) and f 2 ( x ), together with the
one of the interpolating polynomial, P( x ), in the interval [0, 2). As can be seen, the
graph of the interpolating polynomial fits better the graph of f 2 ( x ) that that of f 1 ( x ).
This fact is confirmed in Figure 5.3, which shows the absolute error committed by the
interpolating polynomial in [0, 2), given by | f i ( x ) − P( x )|, with i = 1, 2. In particular,
the maximum error can be computed with the matlab command abs, and is given by
maxx∈[0,2) | f 1 ( x ) − P( x )| = 3.7503 and maxx∈[0,2) | f 2 ( x ) − P( x )| = 0.2344.

75
5 Numerical interpolation

Figure 5.2: Graph of the functions f 1 ( x), f 2 ( x) and P( x) in [0, 2).

Figure 5.3: Interpolation error in [0, 2).

One could expect that the larger is the number of nodes, the smaller the error
is, namely: the graph of the interpolating polynomial would better fit to the one
of the function. In particular, one would expect that

lim Pn ( x ) = f ( x ), x ∈ [ x1 , x n ),
n→∞

where f ( x ) is the function that we want to approximate. However, we will see


that this is not necessarily true.
The first result in this section provides an expression of the error for the standard
polynomial interpolation, that is, the one analyzed in Section 5.1. The proof can

76
5 Numerical interpolation

be seen, for instance, in [BF11, p.112]

Theorem 10 (Bound for the interpolation error). Let f : R → R be an n times


differentiable function and P( x ) the interpolating polynomial through the nodes (5.1).
Then
f (n) ( η )
f ( x ) − P( x ) = · ( x − x1 )( x − x2 ) · · · ( x − xn ),
n!
where η is a real number within the smaller interval that contains x1 , . . . , xn , x.

Remark 12 It can be proved (see, for instance, [BF11, Th. 3.6]) that

f (n) ( η )
= F [ x1 x2 . . . x n ],
n!
with η as in the statement of Theorem 10.

Theorem 10 does not assume that the nodes are ordered such that x1 < x2 <
· · · < xn . In this situation, if x ∈ [ x1 , xn ), this theorem implies

f (n) ( ξ )
| f ( x ) − P( x )| ≤ max · |( x − x1 ) · · · ( x − xn )|. (5.26)
ξ ∈[ x1 ,xn ] n!
The inequality (5.26) provides a bound for the maximum error of the interpolating
polynomial through the nodes (5.1) in the interval [ x1 , xn ). The first factor of the
bound depends on the function, f , that we want to approximate. More precisely,
it depends on how big the nth derivative of f is within the interpolation interval.
Therefore, if f is a function having a small derivative, this term will be small.
However, for functions with large oscillations it can be large (though it is not
necessarily so, since (5.26) is just a bound).
 Despite the denominator n!, which increases very fast as n increases, the
bound in (5.26) can be quite large, even for n going to infinity.
On the other hand, there is a second factor, |( x − x1 ) · · · ( x − xn )|, which is
independent of the function, and depends only on the values of the abscissa of
the nodes. To bound this term is not easy and depends, among other things, on
the choice of the nodes. Nevertheless, if the nodes are equispaced (namely, the
distance between the abscissa of two consecutive nodes is constant), we can get a
bound for this error.

Lemma 6 If xk = x1 + kh, for k = 1, . . . , n, and h = ( xn − x1 )/(n − 1), then, if


x ∈ [ x1 , xn ], we have
h n ( n − 1) !
|( x − x1 ) · · · ( x − xn )| ≤
4

77
5 Numerical interpolation

The proof of Lemma 6 can be found in [ChK08, p. 157]. As an immediate


consequence of Theorem 10 and Lemma 6 we get the following result.

Theorem 11 Let x1 < xn and let f : R → R be an n-times differentiable function


in [ x1 , xn ] such that | f (n) ( x )| ≤ M, for x ∈ [ x1 , xn ]. Let P( x ) be the interpolating
polynomial through the nodes (5.1), with xi = x1 + xnn − x1
−1 . Then, if x ∈ [ x1 , xn ], the
following inequality holds
Mhn
| f ( x ) − P( x )| ≤ . (5.27)
4n
The bound (5.27) indicates that, for equispaced nodes, a big error in the interpo-
lating polynomial, for n large enough, is due to the fact that the value of the nth
derivative of f in [ x1 , xn ] is large. In particular, if this derivative is bounded in this
interval, the interpolation error tends to 0 as n tends to infinity. By contrast, for
functions whose derivative is unbounded the errors can be quite large. In these
cases, a more appropriate choice of the nodes is possible to make the interpolation
error tend to 0. Anyway, we will not address this issue.
The second result in this section provides a bound for the interpolation error
when using cubic splines (see [BF11, Th. 3.13]).

Theorem 12 (Interpolation error for cubic splines). Let f : R → R be a 4-times


differentiable function and S( x ) be a cubic spline S( xi ) = f ( xi ), S′ ( xi ) = f ′ ( xi ), for
i = 1, . . . , n. If x1 < x2 < · · · < xn , then:
 
5
max | f ( x ) − S( x )| ≤ max ( x − xk ) · max | f (4) ( x )|.
4
(5.28)
x1 ≤ x ≤ x n 384 1≤k≤n−1 k+1 x1 ≤ x ≤ x n

If, again, we consider equispaced nodes, with h = xn+1 − xn , then the bound in
(5.28) reads
5
max | f ( x ) − S( x )| ≤ Mh4 ,
x1 ≤ x ≤ x n 384
where maxx1 ≤ x≤ xn | f (4) ( x )| ≤ M. This is a bound of the same order of error than
(5.27) (namely O(h4 )) and it is even worse if we compare the coefficients that are
multiplied by the term h4 . Nonetheless, the splines enjoy the interesting property
described below.

Optimal character of splines in minimizing oscillations

A differentiable function f has large oscillations when its first derivative has large
variations (in absolute value). In other words, f ′′ has large variations (in absolute

78
5 Numerical interpolation

value). A central measure of the oscillations of a function f within an interval [ a, b]


is Z b
( f ′′ ( x ))2 dx.
a
The following result tells us which is the cubic spline with smallest oscillation
within all interpolating functions of a certain function f .

Theorem 13 Let (5.1) be the interpolation nodes, with x1 < · · · < xn . Let f be a
twice differentiable function with continuous derivatives in [ x1 , xn ] such that f ( xi ) = yi ,
for i = 1, . . . , n. Let S( x ) be the cubic spline in the nodes (5.1) with S( xi ) = yi and
S′ ( xi ) = f ′ ( xi ), for i = 1, . . . , n. Then
Z b Z b
(S′′ ( x ))2 dx ≤ ( f ′′ ( x ))2 dx.
a a

79
6 Roots of polynomials and functions

A function f : R → R has a root (or zero) in r ∈ R if f (r ) = 0.


The goal of this chapter is to study some basic methods and algorithms for
computing roots of real continuous functions. In all cases, they are algorithms
that approximate a root and, as a consequence, we do not expect to get the exact
root. We say that the algorithm is convergent if it is possible to get an approxima-
tion which is as good as we want. All these algorithms are iterative algorithms,
which means that, at each step, they provide a different approximation to the solu-
tion, which is presumably closer than the approximation obtained in the previous
step to the root that we are looking for. In particular, they generate a sequence
{ x0 , x1 , . . . , xn , xn+1 , . . .} that approximates a root, r. The value x0 is known as
the seed or the initial value, and it is relevant because it has to be chosen before-
hand, and this choice can make the method convergent or not and, in the case it
converges, it can determine the speed of convergence.

6.1 The bisection method


The bisection method is based on the following theorem.

Theorem 14 (Bolzano). If f : R → R is a continuous function in the interval [ a, b] and


f ( a) f (b) < 0, then there is, at least, one root of f in [ a, b].

The basic idea behind the bisection method is dividing a given interval [ a, b] in
which f ( a) f (b) < 0 in two halves, and to choose at each step the half subinterval
that fulfills this condition, namely:
• Choose [ a, a+2 b ] if f ( a) f ( a+2 b ) < 0.

• Choose [ a+2 b , b] if f ( a+2 b ) f (b) < 0.


Since f ( a) f (b) < 0, only one of the previous conditions is satisfied at each step.
The output of the algorithm at each step can be either an interval containing the
root or the middle point.

Example
√ 11 The following matlab algorithm is the basic bisection algorithm
√ to compute
2. Let us consider [1, 2] as the starting interval, where the value 2 is located. The
algorithm is applied to the function is f ( x ) = x2 − 2, which satisfies f (1) f (2) < 0.

80
6 Roots of polynomials and functions

function[x,k] = bisectsqrt2
r=2;
a=1;
b=2;
k=0; % iteration counter
while (b-a)>eps*b
x=(a+b)/2;
if x^2-r>0
b=x;
else
a=x
end
k=k+1;
end

The previous algorithm does not have any input variable, since we have
√ chosen the starting
interval beforehand. The output variables are x (approximation to 2) and k (number of
iterations).

The code in the previous example can be extended without difficulty to more
general functions. The main difference is the criterion to choose the appropriate
half subinterval at each step. For this, we need to evaluate the function in the
endpoints of the interval.
The stopping criterion by default is

| a − b|
< ε, (6.1)
|b|
where ε is the machine epsilon (b can be replaced by a in the denominator). The
criterion (6.1) establishes that the relative error in the approximation is less than
the machine epsilon, which is, as we know, twice the roundoff error. Nonetheless,
the bisection method is slow, and to get such a small threshold can lead to a lot of
iterations. In this case, as an alternative stopping criterion we can introduce some
factor in the right-hand side of (6.1) (namely, cε, with c > 1) or to fix in advance
the number of iterations.

Pros and cons of the bisection method

Cons:

• Requires to know an initial interval containing the root.

81
6 Roots of polynomials and functions

Figure 6.1: Newton’s method.

• It is very slow.

• We need to be able to evaluate f in the initial interval.

Pros:

• Always converges.

• It provides an accurate estimation of the error: it is, at most, the length of


the last interval.

It may happen that the initial interval contains more than one root. In this case,
the bisection method will approximate one of them, but the method does not
allow to know the number of roots.

6.2 Newton’s method


The basic idea behind Newton’s method (also known as Newton-Raphson) is illus-
trated in Figure 6.1 (the root is labelled as r1 ). Given the function f : R → R,
which is assumed to be differentiable, and an initial value x0 , the next value, x1 , is
obtained by intersecting the tangent line to the graph of f at x0 and the horizon-
tal axis. The iteration of this procedure gives rise to the Newton method, which
produces the sequence { x0 , x1 , . . . , xn , xn+1 , . . .}.
Analytically, we can express the method in the following way, taking into ac-
count that f ′ ( xn ) is the slope of the tangent line of f at xn :

f ( xn ) f ( xn )
f ′ ( xn ) = ⇒ x n +1 = x n − ′ (6.2)
x n − x n +1 f ( xn )

82
6 Roots of polynomials and functions

Figure 6.2: Newton’s iteration for f ( x) = x3 − 2x + 2 with x0 = 1.

Newton’s method is a very-well studied method, and it can be easily extended


to functions of several variables (see Chapter 8). There is an extensive literature
about this method. However, it does not always converge, and when applied to
any function can produce aberrant iterations, as the one in the following example.

Example 12 Let f ( x ) = x3 − 2x + 2. This function has a root between −1 and −2.


However, if the initial value is not the appropriate one, (even if it is close to the root),
Newton’s method is not able to locate it, and produces a divergent sequence (periodic). Let
us take, for instance, x0 = 1. With this initial value the sequence that Newton’s method
produces is:
{1, 0, 1, 0, 1, 0, . . .}.
The explanation of this fact can be seen in Figure 6.2.

We have introduced Newton’s method with a graphical interpretation. How-


ever, Newton’s method is nothing but the one that is obtained after replacing the
function f by its first-order Taylor approximation at xn . More precisely, this Taylor
approximation, for x close enough to xn , is

f ( x ) = f ( x n ) + ( x − x n ) f ′ ( x n ).

Now, if we look for a root of this polynomial, we get:

f ( xn )
0 = f ( xn ) + ( x − xn ) f ′ ( xn ) ⇔ x = xn − ,
f ′ ( xn )

which is the value of xn+1 in Newton’s method. This is the idea that will be used
in Section 8.2.1 to extend Newton’s method to multivariable functions.

83
6 Roots of polynomials and functions

Pros and cons of Newton’s method

Cons:

• It does not always converge. In fact, the function f must satisfy certain
“regularity” properties (namely, the derivative should not present large vari-
ations) and the initial value, x0 , must be close enough to the root for conver-
gence.

• We need to be able to evaluate not just f , but also f ′ , in values that are not
located in any particular interval.

Pros:

• When it converges, it is very fast.

• Is a very simple method which admits modifications to improve conver-


gence.

In Section 6.8 we will analyze more in detail the convergence of Newton’s


method.

6.3 The secant method


Newton’s method is a limit case of the secant method. In this method we need
two initial values x0 and x1 , and the next value of the iteration is obtained by
intersecting the secant line to the graph of f by ( x0 , f ( x0 )) and ( x1 , f ( x1 )) with the
horizontal axis. The iteration of this procedure gives rise to the secant method.
The illustration of this method is given in Figure 6.3.
We can obtain an analytic expression for the secant method in the following
way. The equation of the line through ( xn−1 , f ( xn−1 )) and ( xn , f ( xn )) is

f ( x n −1 ) − f ( x n )
y= ( x − x n ) + f ( x n ),
x n −1 − x n
so the intersection with the horizontal axis y = 0 is given by

f ( xn ) f ( x n −1 ) − f ( x n )
= .
x − xn x n −1 − x n
Solving for x we obtain the next value of the iteration:

x n − x n −1
x n +1 = x n − f ( x n ) · (6.3)
f ( x n ) − f ( x n −1 )

84
6 Roots of polynomials and functions

Figure 6.3: The secant method.

Note that (6.3) is equivalent to


f ( xn )
x n +1 = x n − f ( xn )− f ( xn−1 )
,
x n − x n −1

which resembles (6.2). In particular, the difference between both methods is in


the denominator of the subtracting term. In the case of the secant method the
derivative of f at xn appearing in the formula for Newton’s method is replaced by
an approximation to the derivative of f at xn .
As for Newton’s method, the secant method does not always converge. The
following example illustrates this fact.

Example 13 Let f ( x ) = 1x − 1. This function has a unique root at x = 1. Let us take as


initial values x0 = 8, x1 = 7. The secant method produces the following sequence:

{8, 7, −41, 295, · · · }.

It can be checked that this sequence consists of alternating positive and negative numbers,
which are larger at each iteration (in absolute value). The explanation of this can be seen
in Figure 6.4.

In Example 13, the value x2 is outside the interval [ x0 , x1 ] and this is the reason
for the divergence. This, however, does not happen if the evaluation of the function
in the endpoints produces two numbers with different signs. More precisely:
If f ( xn−1 ) f ( xn ) < 0, then xn+1 is between xn−1 and xn .
The algorithm fzero, that we will analyze in Section 6.7, will make use of this
fact.

85
6 Roots of polynomials and functions

Figure 6.4: The secant method for f ( x) = 1x − 1.

Pros and cons of the secant method

They are the same as the ones for Newton’s method, except for the following:

• (Pro): It does not require the use of f ′ , unlike Newton’s method.

• (Con): When it converges, the convergence is slower than for Newton’s


method.

6.4 Fixed point iteration


Given a function g : R → R, we say that r is a fixed point of g if g(r ) = r. Note
that a fixed point of g is a root of f ( x ) = x − g( x ), so the problem of computing
fixed points of a function is equivalent (theoretically) to the one of computing the
roots of another function. Though we are going to see some iterative methods to
approximate fixed points, we are not interested in considering fixed problems like
root-finding problems. In particular a fixed point iteration for g is an iteration of the
form
x k +1 = g ( x k ), k = 0, 1, 2, . . . (6.4)
We are interested in knowing when the fixed point iteration { xn } converges to a
fixed point of g, namely, when

lim xn = r ?
n→∞

86
6 Roots of polynomials and functions

Figure 6.5: Fixed point in [2, 3] of g( x) = − x3 + 3x2 − 1.

As a first approximation to the answer, we need to locate an interval containing


the fixed point.

Theorem 15 If g : R → R is continuous in [ a, b] and g( a) ≥ a, g(b) ≤ b, then g has,


at least, one fixed point in [ a, b].

Proof: Let f ( x ) = x − g( x ). By the conditions in the statement, f ( a) ≤ 0 and


f (b) ≥ 0. Since f is continuous, Theorem 14 (Bolzano) guarantees the existence of
some r ∈ [ a, b] such that f (r ) = 0, which is equivalent to g(r ) = r. □
Figure 6.5 provides an illustration of Theorem 15. Note that the fixed point
of g is the solution of g( x ) = x, namely, of the system y = g( x ), y = x. As a
consequence, it is the abscissa of the intersecting point of the graphs of y = g( x )
and y = x. If g is such that g( a) > a and g(b) < b, then the graph of g must
intersect the line y = x at some point to go from ( a, g( a)) to (b, g(b)).
Nonetheless, the condition in Theorem 15 does not guarantee at all the conver-
gence of the fixed point iteration, not even for initial values which are close to the
fixed point, as we are going to see in the following example.

Example 14 Let g( x ) = − x3 + 3x2 − 1. This function has a fixed point in [2, 3], since
the conditions in the statement of Theorem 15 are satisfied. However, this fixed point
cannot be approximated with the fixed point iteration, even for nearby initial values. Table
6.1 provides the first values of the fixed point iteration for some initial values in [2, 3]. As
you may guess, all sequences diverge.
In Figure 6.5 you can see that the fixed point is between 2.4 and 2.5. Nevertheless, none
of these initial values produce a convergent fixed point sequence.

87
6 Roots of polynomials and functions

Initial value Sequence


2 2, 3, −1, 3, −1, . . .
2.5 2.5, 2.125, 2.9512, −0.5747, 0.1808, . . .
2.4 2.4, 2.456, 2.2814, 2.7402, 0.9507, . . .

Table 6.1: Fixed point iterations for g( x) = − x3 + 3x3 − 1.

In Example 14 we have seen that, though there is a fixed point in [2, 3], the fixed
point iterations, even for initial values that are close to the fixed point, produce
values that lie outside the interval. It is natural to expect that, in order for the
sequence to be convergent, the values of the sequence must lie within the initial
interval where the root is located. The following result provides conditions for
this to happen.

Theorem 16 If g : R → R is differentiable in [ a, b] and satisfies:


(i) g([ a, b]) ⊆ [ a, b],

(ii) | g′ ( x )| ≤ M < 1, for all x ∈ [ a, b],


then g has a unique fixed point, r, in [ a, b]. Besides this, the fixed point iteration
{ xn+1 = g( xn )} converges to r for any initial value x0 ∈ [ a, b], and the approximation
error is bounded by
Mn
| xn − r | ≤ · | x1 − x0 |. (6.5)
1−M
Proof: Theorem 15 and property (i) together imply that g has, at least, one fixed
point in [ a, b]. Let us see that it is unique. Assume, by contradiction, that there
are two fixed points, r1 and r2 . The Mean Value Theorem guarantees that there is an
intermediate value between r1 and r2 , that we denote by ξ, such that
g (r1 ) − g (r2 )
= g ′ ( ξ ).
r1 − r2
This implies that

|r1 − r2 | = | g(r1 ) − g(r2 )| = | g′ (ξ )| · |r1 − r2 | < |r1 − r2 |,


which is impossible. therefore, the fixed point is unique.
Now let us prove that the fixed point iteration converges for every initial value in
[ a, b]. First, by condition (i), we conclude that all terms in the fixed point sequence
belong to the interval [ a, b], regardless of the initial value, as long as it is within
the interval. Using again the Mean Value Theorem and property (ii) we get:

| xn+1 − xn | = | g( xn ) − g( xn−1 )| = | g′ (ξ n )| · | xn − xn−1 | ≤ M · | xn − xn−1 |,

88
6 Roots of polynomials and functions

where ξ n is some value between xn+1 and xn (in particular, it is in [ a, b]). Iterating
this inequality we arrive at

| x n +1 − x n | ≤ M n · | x 1 − x 0 | .

Now, for m > n, using the above developments:

| xm − xn | = |( xm − xm−1 ) + ( xm−1 − xm−2 ) + · · · + ( xn+1 − xn )|


≤ | x m − x m −1 | + | x m −1 − x m −2 | + · · · + | x n +1 − x n |
≤ M m −1 | x 1 − x 0 | + M m −2 | x 1 − x 0 | + · · · + M n | x 1 − x 0 |
= ( Mm + Mm−1 + · · · + Mn )| x1 − x0 |
−n
= Mn · ∑m k
k =0 M · | x 1 − x 0 | .

Taking limits in the previous inequality


m−n ∞
lim | xm − xn | ≤ Mn lim
m→∞ m→∞
∑ M k · | x1 − x0 | = M n ∑ M k · | x1 − x0 |.
k =0 k =0

The sum ∑∞ k
k=0 M is a geometric series with with ratio M < 1, so its value is
1
1− M .
Replacing this value in the previous bound we get

Mn
lim | xm − xn | ≤ · | x1 − x0 |. (6.6)
m→∞ 1−M
Since M < 1, we conclude that limm→∞ | xm − xn | tends to 0 when n → ∞. In
other words, the difference between any two terms, xm and xn , of the fixed point
iteration, tends to 0. Namely, the fixed point iteration is a Cauchy sequence. We
know by elementary Calculus that every Cauchy sequence over R is convergent.
Therefore, the fixed point iteration is convergent.
It remains to prove that the fixed point iteration converges to r, and that equa-
tion (6.5) is true. For the first claim, let ℓ Be the limit of the fixed point sequence.
Since g is continuous, we get

ℓ = lim xn ⇒ g(ℓ) = g( lim xn ) = lim g( xn ) = lim xn+1 = ℓ,


n→∞ n→∞ n→∞ n→∞

so ℓ is a fixed point of g which, as we have seen before, is unique, so ℓ = r.


To get (6.5), it remains to replace limm→∞ xm = r in (6.6). □
Theorem 16 is a very strong result, since it provides sufficient conditions for
a function to have a unique fixed point in a given interval, and these conditions
guarantee the convergence of the fixed point iteration. Moreover, it provides a
bound for the absolute error of approximation at each iteration, and this allows
us to predict beforehand the number of iterations that are needed to reach a given

89
6 Roots of polynomials and functions

(a) 0 < g′ < 1: Monotone convergence. (b) g′ > 1: Monotone divergence.

(c) −1 < g′ < 0: Oscillatory convergence. (d) g′ < −1: Oscillatory divergence.

Figure 6.6: Condition over the derivative of g: convergence and divergence


https://personales.unican.es/segurajj/nolinCN1.pdf.

approximation. This bound, (6.5), depends on the value of M that bounds the
absolute value of the derivative of g. The bound (6.5) suggests that the smaller
the value of the derivative is, the faster is the convergence. In Section 6.8 we will
come back to this issue on the speed of convergence.
Figure 6.6 illustrates the condition on the derivative of g in Theorem 16. In this
figure we show all four possible cases when | g′ ( x )| > 1 (divergence) and when
| g′ ( x )| < 1 (convergence).
Condition g([ a, b]) ⊆ [ a, b] in the statement of Theorem 15 means that g( x ) ∈
[ a, b], for all x ∈ [ a, b]. Though it is very easy to state and interpret, this condition
is not, in general, easy to check. The following result avoids to impose this restric-
tion, though it assumes that we know beforehand that there is a fixed point in the
starting interval.
Theorem 17 If g( x ) is differentiable in [ a, b] and satisfies
(i) it has a fixed point in [ a, b],

(ii) 0 < g′ ( x ) < 1, for all x ∈ [ a, b],

90
6 Roots of polynomials and functions

then g has a unique fixed point, r, in [ a, b] and the fixed point iteration converges to r for
every initial value x0 ∈ [ a, b]. Moreover, the fixed point sequence is monotone, namely:

• If x0 > r, then x0 > x1 > · · · > xn > xn+1 > · · · > r.

• If x0 < r, then x0 < x1 < · · · < xn < xn+1 < · · · < r.

Proof: For the first claim in the statement, it suffices to prove that condition (i)
implies g([ a, b]) ⊆ [ a, b] and apply Theorem 16. Indeed, x ∈ [ a, b] and let us
denote by r the fixed point guaranteed by (i). Let us distinguish two cases:

• r ≤ x: By the Mean Value Theorem

g( x ) − g(r ) = g′ (ξ )( x − r ) ⇒ 0 ≤ g( x ) − g(r ) = g( x ) − r < x − r


⇒ a ≤ r ≤ g( x ) < x ≤ b,

so g( x ) ∈ [ a, b].

• r ≥ x: By the Mean Value Theorem again

g(r ) − g( x ) = g′ (ξ )(r − x ) ⇒ 0 ≤ g(r ) − g( x ) = r − g( x ) < r − x


⇒ a ≤ x < g( x ) ≤ r ≤ b,

so g( x ) ∈ [ a, b].

To prove the second claim (the fixed point sequence is monotone) it suffices to
look at the inequalities obtained in the two previous items, since

• r ≤ x n ⇒ x n +1 = g ( x n ) < x n .

• x n ≤ r ⇒ x n +1 = g ( x n ) > x n .

6.5 Inverse quadratic iteration (IQI)


Though this is not one of the methods that are in the basic standard literature, it
is implemented as a part of the code fzerotx, that is analyzed in Section 6.7.
It is an iterative method with three values, namely: xn+1 depends on xn , xn−1 ,
and xn−2 . It uses an interpolating polynomial through the nodes ( xn−2 , f ( xn−2 )),
( xn−1 , f ( xn−1 )), ( xn , f ( xn )) and approximates the root of f as the closest root
of this polynomial to xn . This idea is the same as the one behind the secant
method, where the interpolating polynomial is the straight line through the points
( xn−1 , f ( xn−1 )) and ( xn , f ( xn )). Nonetheless, in this case a relevant difficulty

91
6 Roots of polynomials and functions

arises: the interpolating polynomial is quadratic, whose graph is a parabola that


does not necessarily intersect the horizontal axis. In other words: the interpolating
polynomial does not necessarily have roots. To overcome this problem, instead of
taking a parabola in the variable x, we take a parabola in the variable y:

x = b2 y2 + b1 y + b0 . (6.7)

This parabola always intersects the horizontal axis (at (b0 , 0)).
Therefore, the method is as follows:

Inverse quadratic iteration:


Given xn−2 , xn−1 , xn :

• Obtain the parabola (6.7) which interpolates


( f ( x n −2 ), x n −2 ), ( f ( x n −1 ), x n −1 ), ( f ( x n ), x n ).

• xn+1 = b0 .

The method requires f ( xn−2 ), f ( xn−1 ), and f ( xn ) to be different, since otherwise


the parabola (6.7) does not exist.
An elementary matlab code for the IQI can be found at [M04, Ch. 4, p.8].

6.6 Functions in matlab


There are several possibilities to work with functions in matlab, that we are going
to illustrate with particular examples by referring to the function fzerotx, which
is introduced in Section 6.7, and which is suitable for root-finding:

(a) Use inline definitions:

xcx = inline(’x-cos(x)’);
z1=fzerotx(xcx,[0,1])

When introducing these commands in the command window of matlab,


the program computes an approximation to the root of the function f ( x ) =
x − cos( x ) in the interval [0, 1] (note that, by Theorem 14, there is, at least,
one root in this interval, since f (0) = −1 < 0 and f (1) = 1 − cos(1) > 0).

(b) Use a function from some file.m together with the command @. This last
command calls a function fun.m:

92
6 Roots of polynomials and functions

z2=fzerotx(@name,[0,1])

(c) Use a previously defined function in matlab:

z3=fzerotx(@cos,[0,1])

(d) Use anonymous functions, directly defined:

z4=fzerotx(@(x) x-cos(x),[0,1])

Evaluate functions with feval

The command feval is quite useful to evaluate functions in matlab. The syntax
for using this command is the following

feval(f,values)

where f is the function that we want to evaluate (introduced in any of the ways
that we have seen above) and values are the values where we want to evaluate it,
that can be just one value or a vector. For instance:

xcx=inline(’x-cos(x)’);
feval(xcx,[0,1,6,9])

provides the result of evaluating f ( x ) = x − cos( x ) in the values x = 0, 1, 6, 9,


namely:

-1.0000 0.4597 5.0398 9.9111

6.7 The program fzero


The professional matlab command for root-finding is fzero. We borrow from
[M04, pp. 8–9, Ch. 4] the following explanations:
“The summary of the algorithm is the following:

• Start with a, b such that f ( a) f (b) < 0.

• Use the secant method to obtain c between a and b.

• Repeat the previous steps until |b − a| < ε|b| or f (b) = 0:


– Take a, b, c such that

93
6 Roots of polynomials and functions

* f ( a) f (b) < 0,
* | f (b)| ≤ | f ( a)|,
* c is the precedent value to b.
– If c ̸= a, consider one step of IQI.
– If c = a, consider one step of the secant method.
– If either the IQI or the secant method step is in the interval, [ a, b], take
it.
– If the step is not in the interval [ a, b], use bisection method.

This algorithm is infallible and always contains a root in the considered inter-
val. It uses methods with fast convergence (IQI) at the moment where they are
reliable, together with some methods which are slower but reliable (bisection, se-
cant), when necessary.”
The version that we will use in this course is the code fzerotx, which is im-
plemented in the folder ncm (see [M04, §4.7]). In this folder you can also find
the code fzerogui, which represents graphically the steps of fzerotx and may
provide a list of them. Moreover, it allows us to choose, at each step, the next
iteration.

6.8 Convergence order


When an iterative method, like the ones we have analyzed in this chapter, con-
verges, the output at each iteration, for a number of iterations that is large enough,
is closer to the solution than the previous iteration. This idea gives rise to the fol-
lowing definition.

Definition 13 (Convergence order). Let { xn } be a sequence converging to r. If there


are a number α, a constant c > 0, and a positive integer n0 such that, for n ≥ n0
| x n +1 − r | ≤ c | x n − r | α , (6.8)
with α being the largest number for which (6.8) holds, then we say that α is the conver-
gence order of the sequence and c is the asymptotic constant error.

When α = 1 in Definition 13, then the sequence { xn } is said to converge linearly


to r, whereas if α = 2 then it is said to converge quadratically.

Remark 13 The conditions for α and c in Definition 13 are equivalent to


| x n +1 − r |
lim = c.
n→∞ | xn − r |α

94
6 Roots of polynomials and functions

The expression (6.8) indicates that, from some n0 , the absolute error in the ap-
proximation { xn } is “raised to α” at each step. This roughly means that the num-
ber of significant digits of the approximation at each iteration is multiplied by
α. Therefore, the larger the convergence order is, the fastest will the sequence
converge to its limit r.
In this section we are going to present the convergence order of the iterative
methods analyzed in this chapter. In all cases we assume that the method is
convergent (though, as we know, not all of them are), and we denote by r the root
of f which is the limit of the sequence.

6.8.1 Convergence of bisection method


The bisection method converges at most linearly. This means, in particular, that,
though it always converges, in some cases the convergence is not even linear
[KL86].

6.8.2 Convergence of the secant method


The secant method has convergence order equal to ϕ (the golden ratio!). Let us
sketch the proof.
For simplicity, we set ϵn := xn − r. Then (6.3) is equivalent to

f (r + ϵn )(ϵn − ϵn−1 )
ϵ n +1 = ϵ n − . (6.9)
f ( r + ϵ n ) − f ( r + ϵ n −1 )

Let us assume that f is twice differentiable and that f ′ (r ) f ′′ (r ) ̸= 0. Using Taylor’s


formula:
f ′′ (r ) 2
f (r + ϵ ) = f (r ) + f ′ (r ) ϵ + ϵ + R2 ( ϵ ),
2
where R2 (ϵ) is the residual, which is of order O(ϵ3 ) (namely, it depends on ϵ and
tends to 0 faster than ϵ2 when ϵ tends to 0). In other words, if we neglect all terms
of order larger than 2 in ϵ in the previous expression, we get:

f ′′ (r ) 2
f (r + ϵ ) ≈ f (r ) + f ′ (r ) ϵ + ϵ .
2
f ′′ (r )
Let us denote, for simplicity, M := 2 f ′ (r )
. Taking into account that f (r ) = 0 (since
r is a root of f ), we can write:

f (r + ϵn ) ≈ ϵn f ′ (r )(1 + Mϵn ), from which:



f (r + ϵn ) − f (r + ϵn−1 ) ≈ f (r )(ϵn − ϵn−1 )(1 + (ϵn + ϵn−1 ) M ).

95
6 Roots of polynomials and functions

Replacing these approximations in (6.9):

ϵn f ′ (r )(1 + Mϵn )(ϵn − ϵn−1 )


ϵ n +1 ≈ ϵ n −
f ′ (r )(ϵn − ϵn−1 )(1 + M (ϵn + ϵn+1 ))
ϵn (1 + Mϵn )
= ϵn −
1 + M ( ϵ n + ϵ n −1 )
ϵ n −1 ϵ n M
=
1 + M ( ϵ n + ϵ n −1 )
≈ ϵn−1 ϵn M.

Summarizing:
ϵn+1 ≈ ϵn−1 ϵn M. (6.10)
But the method has convergence order α if and only if |ϵn+1 | ≈ C |ϵn |α , for some
constant C > 0. If this holds, replacing in (6.10) we get
 α−1 1
| M| | M|

1
α −1
C | ϵ n | ≈ | M | · | ϵ n −1 | · | ϵ n | ⇔ | ϵ n |
α
≈ · | ϵ n −1 | ⇔ | ϵ n | ≈ | ϵ n − 1 | α −1 .
C C

By definition of convergence order, the last approximation must be |ϵn | ≈ C |ϵn−1 |α ,


  α−1 1
which means that | M C
|
= C and, moreover, that α = α−1 1 , which implies α = ϕ,
since α > 0.

6.8.3 Convergence of fixed point methods


The first convergence result is focused on simple roots, namely those roots, r,
such that g(r ) = 0 but g′ (r ) ̸= 0, and tells us that, in this case, if the method is
convergent then the convergence is linear.

Theorem 18 If g : R → R satisfies conditions (i) and (ii) in Theorem 16, and g′ is


continuous, then
r − x n +1
lim = g ′ (r ), (6.11)
n→∞ r − xn

namely, if g′ (r ) ̸= 0 then the convergence order of the fixed point method is equal to 1.

Proof: By the Mean Value Theorem, there is an intermediate value, ξ n , between r


and xn , such that
r − x n +1 g (r ) − g ( x n )
= = g ′ ( ξ n ). (6.12)
r − xn r − xn
Since ξ n is a value between xn and r, and lim xn = r, we conclude that lim ξ n = r.
n→∞ n→∞
Using that g′ is continuous, then lim g′ (ξ n ) = g′ ( lim ξ n ) = g(r ), so taking limits
n→∞ n→∞
in (6.12) we arrive at (6.11). □

96
6 Roots of polynomials and functions

The second result says that, if the root is multiple, namely g(r ) = 0 = g′ (r ), the
converge, instead of being linear, is, at least, quadratic:

Theorem 19 Let g : R → R be twice differentiable and let r be a solution of the equation


x = g( x ). If

(i) g′ (r ) = 0,

then there is some δ > 0 such that, for every initial value x0 ∈ [ p − δ, p + δ], the fixed
point sequence { xn+1 = g( xn )} converges at least quadratically to r. If, moreover

(ii) g′′ is continuous and | g′′ ( x )| < M for all x in an open interval containing r,

then, for n large enough and x0 as in the previous paragraph,

M
| x n +1 − r | < · | x n − r |2 . (6.13)
2
Proof: By hypothesis (i) and, since g′ is continuous (because it is twice differen-
tiable), there are δ > 0 and k ∈ (0, 1) such that [r − δ, r + δ] is contained in the
interval mentioned in part (ii) and | g′ ( x )| ≤ k for all x ∈ [r − δ, r + δ]. As we
have seen in the proof of Theorem 16, this implies that the terms of the fixed point
iteration { xn+1 = g( xn )} belong to the interval [r − δ, r + δ], for any initial value
x0 . Using the hypotheses g(r ) = r and g′ (r ) = 0, the Taylor expansion of g around
r in [r − δ, r + δ] gives

g′′ (ξ ) g′′ (ξ )
g( x ) = g(r ) + g′ (r )( x − r ) + ( x − r )2 = r + ( x − r )2 ,
2 2
with ξ being. a value between r and x. In particular, if x = xn ,

g′′ (ξ n ) g′′ (ξ n )
x n +1 = g ( x n ) = r + ( x n − r )2 ⇒ x n +1 − r = ( x n − r )2 , (6.14)
2 2
for some ξ n between r and xn .
As | g′ ( x )| ≤ k < 1 in [r − δ, r + δ] and g([r − δ, r + δ]) ⊆ [r − δ, r + δ], Theorem
16 guarantees that { xn } converges to r. Since ξ n is between xn and r, {ξ n } also
converges to r, so (6.14) implies that

| x n +1 − r | | g′′ (r )|
lim = .
n → ∞ | x n − r |2 2

This identity implies that { xn } has convergence order 2 if g′′ (r ) ̸= 0. Moreover,


since g′′ is continuous and bounded, in absolute value, by M in [r − δ, r + δ], the
identity implies that, for n large enough, (6.13) holds. □

97
6 Roots of polynomials and functions

6.8.4 Convergence of Newton’s method


Newton’s method is a fixed point method, { xn+1 = g( xn )}, with

f (x)
g( x ) = x − .
f ′ (x)

If r is a root of f , then r is a fixed point of g. Moreover

f ′ ( x )2 − f ′′ ( x ) f ( x ) f ′ (r )2 − f ′′ (r ) f (r )
g′ ( x ) = 1 − ′
⇒ g ′ (r ) = 1 − = 1 − 1 = 0.
f (x) 2 f ′ (r )2

As a a consequence, Theorem 19 guarantees that, for a initial value x0 which is


close enough to r, the fixed point iteration { xn+1 = g( xn )} converges quadratically
to r.
In other words: Newton’s method converges quadratically, for any initial value
which is close enough to the root.

98
7 Least squares problems

7.1 Setting the problem


Given a matrix A ∈ Rm×n , the SLE (4.1) does not always have a solution. In those
cases with no solution, we aim to find a vector, denoted by b x, which is “the closest”
to a solution. Since x is a solution of (4.1) if and only if Ax − b = 0, it is natural to
think that bx will be a vector close to the solution if Ab x − b is close to 0. But Abx−b
is a vector, so the notion of closeness requires a precise definition. This definition
comes through the norm. Then, Ab x − b is close to 0 if the norm of Ab x − b is small.
We have already seen in Remark 3 that there are several vector norms. The one
we are going to use here is the 2-norm, which is the standard one. This leads to
the following definition.

Definition 14 The vector b


x is a least squares solution of the SLE (4.1) if
∥ Ab
x − b∥2 = minn {∥ Ax − b∥2 }.
x ∈R

Remark 14 Recall that the 2-norm of a vector is the square root of the sum of the squares
of the modulus of the coordinates of the vector. The square root of a sum of squares
is minimized when the sum of squares itself is minimized (since the square root is an
increasing function). Therefore, the least squares solution is the one that minimizes the
sum of squares of the vector Ax − b. This is the reason for the name “least squares”.

Least squares problems arise in many contexts (in engineering, economy, biol-
ogy, physics, etc.). In particular, they arise in the “curve fitting” problem. This
problem has a similar motivation to the interpolation problem (though, as we will
see, it is solved in a different way), and it is explained below.

Curve fitting

Assume we have a list of m data points corresponding to two variables x and y:


x y
x1 y1
x2 y2 (7.1)
.. ..
. .
xm ym

99
7 Least squares problems

Let us also assume that y is a function of x that can be written in the form

y = β 1 ψ1 ( x ) + β 2 ψ2 ( x ) + · · · + β n ψn ( x ), (7.2)

where ψ1 ( x ), . . . , ψn ( x ) are functions of x. The relevant property of (7.2) is that


y is a linear combination of the functions ψ1 ( x ), . . . , ψn ( x ). These functions are,
in general, a basis of the vector space of polynomials in x of degree less than n,
where n is not necessarily equal to m (actually, in this chapter we focus on the
case m > n). Nonetheless, they can be more general functions (trigonometric,
exponential, etc). This is the main difference between polynomial interpolation
analyzed in Chapter 5 and least squares problems.
The goal is to determine the values of the parameters β 1 , . . . , β n in (7.2) to find
the least squares solution. For this we impose the conditions given in Table 7.1,
namely:
y( x1 ) = y1 ⇒ β 1 ψ1 ( x1 ) · · · + β n ψn ( x1 ) = y1 ,
y( x2 ) = y2 ⇒ β 1 ψ1 ( x2 ) · · · + β n ψn ( x2 ) = y2 ,
.. .. ..
. . .
y( xm ) = ym ⇒ β 1 ψ1 ( xm ) · · · + β n ψn ( xm ) = ym .
The previous system is a SLE, that can be expressed in matrix form as

ψ1 ( x1 ) ψ2 ( x1 ) · · · ψn ( x1 ) y1
    
β1
 ψ1 ( x2 ) ψ2 ( x2 ) · · · ψn ( x2 )   β 2   y2 
 .  =  . . (7.3)
    
 .
 ..   ..   .. 
ψ1 ( xm ) ψ2 ( xm ) · · · ψn ( xm ) βn ym

The coefficient matrix of the system (7.3) is m × n. If m > n, the system (7.3) is
overdetermined and, in general does not have a solution. In this case, we are in
the situation described at the beginning of the section.
The most elementary examples of curve fitting correspond to polynomial regres-
sion, which includes, as a particular case linear regression. We analyze this last one
independently because of its simplicity and relevance.

• Linear regression: In this case, the functions in (7.2) are ψ1 ( x ) = 1, ψ2 ( x ) =


x. Then, we aim to approximate the set of points ( x1 , y1 ), . . . , ( xm , ym ) by a
straight line, which is known as regression line, y = β 1 + β 2 x. The system
(7.3), in this case, is:
1 x1 y1
   
1 x2     y2 
 β1
=  . .
  
. .. 
.. β
.  2  .. 
1 xm ym

100
7 Least squares problems

The previous system is not consistent, as long as at least three points in the
set are not aligned.

• Polynomial regression: Now the functions in (7.2) are ψ1 ( x ) = 1, ψ2 ( x ) =


x, . . . ,
ψm ( x ) = x m (namely, the monomial basis of the vector space of polynomials
of degree ate most m), and the system (7.3) is of the form:

1 x1 . . . x1n y1
    
β1
1 x2 . . . x n   β 2   y2 
2
  =  . (7.4)
   
. .
 .. .. . .
. . ..   ...   ... 
1 xm . . . xmn βn ym

The coefficient matrix of the system is a Vandermonde matrix (5.4) but, un-
like in the case considered in polynomial interpolation, when m > n the
system does not have a solution (except when the nodes are placed in the
graph of a polynomial function with degree at most m).

matlab commands for least squares problems and polynomial fitting

The command “\” which solves the system (4.1) when it has a unique solution,
solves the least squares problem associated to the SLE (4.1) when the system is not
consistent. Then, the command A\b solves the SLE (4.1) by least squares.
For polynomial fitting we can use the command polyfit. More precisely,
polyfit(x,y,m) solves the least squares problem associated to the system (7.4),
where
x = [ x1 x2 . . . xm ]⊤ and y = [y1 y2 . . . ym ]⊤ .

7.2 Orthogonal matrices


Definition 15 (Orthogonal matrix). A square matrix Q ∈ Rn×n is orthogonal if
Q⊤ Q = In

In other words, an orthogonal matrix is an invertible matrix whose inverse co-


incides with its transpose.
The main property of orthogonal matrices in this course is the following:

Lemma 7 If Q ∈ Rn×n is an orthogonal matrix and x ∈ Rn , then

∥ Qx∥2 = ∥x∥2 .

101
7 Least squares problems

Proof: By definition of 2-norm:

∥ Qx∥2 = x⊤ Q⊤ Qx = x⊤ x = ∥x∥2 .


Lemma 7 tells us that multiplying a vector by an orthogonal matrix provides a
vector with the same 2-norm.
We will also use the following property.

Lemma 8 The product of two orthogonal matrices of the same dimension is again an
orthogonal matrix.

Proof: If Q1 , Q2 ∈ Rn×n are orthogonal, then

( Q1 Q2 )⊤ ( Q1 Q2 ) = Q2⊤ Q1⊤ Q1 Q2 = Q2⊤ Q2 = In ,

so Q1 Q2 is orthogonal as well. □

7.3 The QR factorization


The QR factorization is a useful tool in several problems of Numerical Linear
Algebra, including the solution of least squares problems. The following result
introduces this factorization in its full and reduced versions.

Theorem 20 (QR factorization). Let A ∈ Rm×n , with m ≥ n and rank ( A) = n. Then


A can be factorized as:

A = QR, (Full QR factorization) (7.5)

and also as
A=Q
e R,
e (Reduced QR factorization) (7.6)
where Q, R, Q,
e Re satisfy:

(a) Q is orthogonal m × m.

(b) R ∈ Rm×n is upper triangular with nonzero diagonal entries.

(c) Q e⊤ Q
e is m × n with orthonormal columns, namely: Q e = In .

e ∈ Rn×n is upper triangular with nonzero diagonal entries.


(d) R

102
7 Least squares problems

Proof: It is possible to obtain a reduced QR factorization (7.6) by means of the


Gram-Schmidt orthogonalization method (normalization included), that is ex-
plained in a basic course on Linear Algebra. More precisely, the column space
of A has an orthonormal basis, {q1 , . . . , qn }, that can be obtained as:
v1
v1 = a1 , q1 = ∥ v1 ∥2
,
v2
v2 = a2 − ( q1 · a2 ) q1 , q2 = ∥ v2 ∥2
.. .. (7.7)
. .
vn
v n = a n − ( q 1 · a n ) q 1 − · · · − ( q n −1 · a n ) q n −1 , qn = ∥ v n ∥2
.

where ai is the ith column of A. Note that, since the rank of A is n, the dimension
of the column space of A is n. If in (7.7) we solve for the vectors ai in terms of the
vectors q j and we write down the expressions that we obtain in matrix form, we
arrive at
∥ v1 ∥2 ∗ ... ∗
 

    0
 ∥ v2 ∥2 . . . ∗ 
a1 a2 . . . a n = q1 q2 . . . q n  . ..  =: Q e R.

. . .
e
 . . . 
0 ... 0 ∥ v n ∥2
The entries marked with ∗ are not relevant, but can be obtained from (7.7) as indi-
cated in the previous paragraph. Note that the diagonal entries of R e are nonzero
numbers, since they are the norm of nonzero vectors (the fact that the vectors vi
are nonzero is a consequence of the fact that the rank of A is n).
To obtain the full QR factorization (7.5) it is enough to complete the basis
{q1 , . . . , qn } of the column space of A to an orthonormal basis of Rm , {q1 , . . . , qn , qn+1 , . . . , qm },
and to add a zero block to the matrix R, e in the form:
 
∥ v1 ∥2 ∗ ... ∗
 0
 ∥ v2 ∥2 . . . ∗  
 . .. .. 
 .. . . 
    
a 1 a 2 . . . a n = q 1 q 2 . . . q n q n +1 . . . q m  0 ... 0 ∥vn ∥2  =: QR.
 
 
 0 ... 0 0 
 .. .. 
 
.. ..
 . . . . 
0 ... 0 0

Figure 7.1 shows the shape and the size of the full and reduced QR factorization.

Remark 15 Some remarks about the QR factorization, in the conditions of Theorem 20,
are in order:

103
7 Least squares problems

(a) Full QR. (b) Reduced QR.

Figure 7.1: Full and reduced QR factorization of an m × n matrix with m ≥ n.

(a) The matrix R in the QR factorization (7.5) is invertible.

(b) The reduced QR factorization is unique up to the sign of the columns of Qe and the
rows of R.
e Namely: if Q
eRe and Q e′ R
e ′ are two reduced QR factorizations of a matrix
A, then
e=Q
Q e ′ S, e = SR
R e′ ,
where
±1
 
 ±1 
S=
 
.. 
 . 
±1
is a diagonal matrix whose diagonal entries are 1 or −1.

(c) In the proof of Theorem 20 the diagonal entries of Re are positive, since they are the
norm of nonzero vectors. The diagonal entries of the matrix R e in the QR factoriza-
tion can also be negative but, by the previous remark, if we impose the condition that
they are all positive, then the reduced QR factorization of a matrix is unique.

(d) In the full QR factorization the last m − n columns Q are not unique.

7.3.1 Solution of least squares problems using the QR factorization


Let A = QR be a full QR factorization of a matrix A. Then the residual of the least
squares problem can be written as
∥ Ax − b∥2 = ∥( QR)x − b∥2 = ∥ Q( Rx − Q⊤ b)∥2 = ∥ Rx − Q⊤ b∥2
   ⊤
e − ( Q ⊤ b )1
  
Rx
e ( Q b )1 Rx (7.8)
= − = ,
0 ( Q ⊤ b )2 2 −( Q⊤ b)2 2

where ( Q⊤ b)1 are the first n rows of Q⊤ b, and ( Q⊤ b)2 are the last m − n rows. To
obtain the second identity we have used Lemma 7.

104
7 Least squares problems

To minimize (7.8) we can only work on the upper part of the vector obtained at
e − ( Q⊤ b)1 , since the other term, −( Q⊤ b)2 , does not depend
the end, namely Rx
on x. Moreover, the minimum is achieved when the first term is zero, namely:

e = ( Q ⊤ b )1 .
Rx

Since R
e is an upper triangular matrix, the previous system can be solved by back-
ward substitution. This provides the following procedure to solve the least squares
problem (LSP):

Algorithm for solving the LSP Ax ≈ b:


Input: A ∈ Rm×n , with m ≥ n and rank ( A) = n, b ∈ Rn .
Output: Solution, b
x, of the LSP.

1. Compute the full QR factorization of A, A = QR.

2. Multiply Q⊤ b.

x = ( Q⊤ b)(1 :
3. Solve, by backward substitution, the SLE R(1 : n, 1 : n)b
n ).

4. (Optional) The norm of the minimal residual is


x − b∥2 = ∥( Q⊤ b)(n + 1 : m)∥2 .
∥b

Remark 16 The previous algorithm is, essentially, the one that matlab follows with the
command A\b.

Remark 17 The solution to the LSP is unique if rank ( A) = n. Otherwise is not.

7.3.2 Computation of the QR factorization


One way to obtain the QR factorization is by means of the Gram-Schmidt method,
as we have explained before. Here we are going to study another procedure which
is numerically more advantageous. It is based on the following transformations:

Definition 16 Let u ∈ Rm be a nonzero vector. The Householder reflector associated


to u is the matrix
2
Hu := Im − uu⊤ . (7.9)
∥u∥22

In the following lemma we present the elementary properties of the House-


holder reflectors that are needed later.

105
7 Least squares problems

Lemma 9 (Properties of Householder reflectors). Let Hu be the Householder reflector


associated to some vector u ∈ Rm . Then:

(i) Hu⊤ = Hu .

(ii) Hu⊤ Hu = Hu2 = Im .

(iii) If 0 ̸= α ∈ R, then Hαu = Hu .


 ⊤ ⊤
Proof: (i) Hu⊤ = Im − 2
∥u∥22
uu⊤ = Im⊤ −
uu⊤ = Im − ∥u2∥2 uu⊤ = Hu .
2
∥u∥22
   2
⊤ 2 2
(ii) Hu Hu = Hu (by (i)) = Im − ∥u∥2 uu ⊤ 2
Im − ∥u∥2 uu ⊤ = Im − 2 ∥u2∥2 uu⊤
  2 2 2

+ ∥u∥2 uu uu = Im − ∥u∥2 uu + ∥u∥4 u∥u∥2 u = Im − ∥u∥2 uu + ∥u∥2 uu⊤ =


2 ⊤ ⊤ 4 ⊤ 4 2 ⊤ 4 ⊤ 4
2 2 2 2 2
Im .
(iii) Hαu = Im − ∥αu2 ∥2 (αu)(αu)⊤ = Im − α2 ·∥2u∥2 αα⊤ uu⊤ = Im − α2 ·∥2u∥2 α2 uu⊤ =
2 2 2
Im − 2
∥u∥22
uu⊤ = Hu . □
As a consequence of Lemma 9, the Householder reflectors are orthogonal ma-
trices. Moreover, they are reflectors through some line.
Given any vector, x, there exists a Householder matrix that takes it to any other
vector of the same norm. In particular, we are interested in taking the vector x
 ⊤
to a multiple of the first canonical vector, e1 = 1 0 . . . 0 . That is: given
x ∈ R , we want to find a vector u ∈ R such that Hu (x) = ce1 , with c being a
m m

scalar (namely, a real number). As vectors x and ce1 must have the same norm, it
must be c = ±∥x∥2 . There are two possible choices of the vector u such that Hu
takes x to ±∥x∥2 e1 , which are u = ±∥x∥2 e1 − x. Figure 7.2 illustrates the effect, on
the plane, of the Householder reflector that is obtained taking these two vectors.

Figure 7.2: Householder reflector in R2 [TB97].

106
7 Least squares problems

In practice, and for stability reasons, the standard choice is the one such that
c = sign( x1 )∥x∥2 , where x1 is the first coordinate of the vector x.
Householder reflectors are used to obtain the QR factorization in the following
way. In the first place, we choose the Householder that takes the first column of
A, denoted by a1 , to the vector ±∥a1 ∥2 e1 following the previous criterion, namely
u = sign((a1 )1 )∥a1 ∥2 . This effect is produced by multiplying, on the left, the start-
ing matrix, A, by Hu . Then, we proceed in the same way with the Householder
reflector that takes the first column of the matrix that results from this product,
and then removing the first row and first column, to the corresponding multiple
of the vectors e1 , and so on. The following example illustrates this procedure.

Example 15 Let us illustrate how the QR factorization of a matrix A ∈ R5×3 is obtained


using Householder reflectors. The entries marked with ∗ are not relevant. Similarly, we
use the variants ∗′ , ∗′′ and ∗′′′ to indicate that the entries are modified at each iteration.
Nonetheless, it should be noted that not all entries marked with ∗ are necessarily equal,
and the same happens with the entries ∗′ , ∗′′ , and ∗′′′ :
     
∗ ∗ ∗ ∗′ ∗′ ∗′ ∗′ ∗′ ∗′
∗ ∗ ∗  0 ∗′ ∗′   0 ∗′′ ∗′′ 
     
A = ∗ ∗ ∗ −→ H1 A =  0 ∗′ ∗′  −→ H2 H1 A =  0 0 ∗′′ 
     
∗ ∗ ∗  0 ∗′ ∗′  H2 ·  0 0 ∗′′ 
  H1 ·
   
∗ ∗ ∗ 0 ∗′ ∗′  0 0 ∗′′
∗′ ∗′ ∗′
 0 ∗′′ ∗′′ 
 
−→ H3 H2 H1 A =   0 0 ∗′′′  ,

0 0 0
H3 ·
 
0 0 0
where
 
∗′
0
 
H1 = Householder reflector such that H1 A(:, 1) =  0 .
 
0
 
0
∗′′
 
 
1 0 0
H2 = ( H2′ : Householder reflector such that H2′ H1 A(2 : 5, 2) = 
 0  ).

0 H2′
0
∗′′′
   
1 0 0
H3 = 0 1 0  ( H3′ : Householder reflector such that H3′ H2 H1 A(3 : 5, 3) =  0 ).
0 0 H3′ 0

107
7 Least squares problems

Now:
H3 H2 H1 A = R, upper triangular matrix,
H3 H2 H1 = Q⊤ , orthogonal matrix (since it is a product of orthogonal matrices, see Lemma 8),

so A = QR is a full QR factorization of A.

Example 15 can be extended to any matrix A ∈ Rm×n :

QR factorization using Householder reflectors:

( Hn Hn−1 · · · H2 H1 ) A = R,
Q⊤ = Hn Hn−1 · · · H2 H1 ,
Q = H1 H2 · · · Hn−1 Hn ,
(H1 , . . . , Hn are Householder reflectors)

matlab commands for the QR factorization

[Q R]=qr(A): Provides the matrices Q and R of the full QR factorization of A.


[Q R]=qr(A,0): Provides the matrices Q and R of the reduced QR factorization of
A.
qrsteps: Is a command included in the folder ncm. It shows all the steps followed
in the computation of the QR factorization using Householder reflectors.

7.4 The Singular Value Decomposition (SVD)


The SVD is a decomposition of a matrix A ∈ Rm×n that provides useful informa-
tion regarding:

• The rank of the matrix A.

• The 2-norm of A.

• The distance of A to the set of matrices with smaller rank. That is, the 2-norm
of the smallest perturbation, ∆A, that makes rank ( A + ∆A) < rank A. In
particular, if A is invertible, it will allow us to know how far is A from being
non-invertible (singular).

In this course, however, we will use the SVD as a tool to solve LSPs. In the
following result we introduce the SVD. Though it is valid for matrices of any size,
we keep imposing the same restriction as in previous sections, namely, when A is
m × n with m ≥ n.

108
7 Least squares problems

Theorem 21 (Singular Value Decomposition). Let A ∈ Rm×n , with m ≥ n and


rank ( A) = r. Then, there are two orthogonal matrices U ∈ Rm×m and V ∈ Rn×n such
that
A = UΣV ⊤ , (7.10)
with
 
σ1 0 ... 0 0 ... 0
 .. .. . . .. 
0
 σ2 . . . .
. ..
 ..

. 0 0 . . . 0
Σ=
 
0 ... 0 σr 0 . . . 0
 , σ1 ≥ σ2 ≥ · · · ≥ σr > σr+1 = · · · = σn = 0.
0 ... 0 0 0 . . . 0
 
 .. .. .. .. .. . . .. 
 
. . . . . . .
0 ... 0 0 0 . . . 0 m×n
The factorization (7.10) is known as the Singular Value Decomposition (SVD) of A,
and the real numbers σ1 , σ2 , . . . , σr , σr+1 , . . . , σn are known as the singular values of A.
The proof of Theorem 21 is beyond the scope of this course but can be found,
for instance, in [TB97].
Remark 18 The singular values of A are unique, but the matrices U and V of the SVD
of A are not unique.
Remark 19 The SVD of A provides, among other information, the rank of A, which is
equal to r (the number of nonzero singular values).
The matrix Σ in Theorem 21 can be written as
 
σ1
 .. 
Σ= .
,
 
 σn 
0m − n
where the non-specified entries in the upper block are zero. When performing the
product at the right-hand side in (7.10), the last m − n columns of U disappear,
since they are multiplied by zeroes of Σ, so they can be eliminated. This leads to
the expression:
Reduced (or economic) SVD of A:
 
σ1
A = U (: , 1 : n) 
 ..  ⊤
V .
.
σn

109
7 Least squares problems

The 2-norm and the 2-condition number

Given an SVD of A ∈ Rm×n (7.10), then A∗ A = A⊤ A = VΣ⊤ ΣV ⊤ . This matrix is


similar to Σ⊤ Σ, which is a diagonal matrix whose diagonal entries are the singular
values of A. Then, by Lemma 2 we conclude that

∥ A∥2 = σ1 (the largest singular value of A). (7.11)

Moreover, since, if A is invertible, then A−1 = VΣ−1 U is an SVD of A−1 , so the


largest singular value of A−1 is 1/σn , where σn is the smallest singular value of A.
As a consequence
σ1
κ2 ( A ) = (ratio between the largest and smallest singular value of A).
σn

matlab command for the SVD

[U,S,V]=svd(A): Provides the matrices U, S and V of the reduced SVD of A.

7.4.1 Application of the SVD to low-rank approximation


In many situations it may be convenient to approximate a given matrix A by some
other matrix with smaller rank, which is called a low-rank approximation of A. It is
desirable that this low-rank approximation is the closest to A in some sense. The
interest on this closeness is motivated by the idea that the closer the approximation
is to A, the more amount of information from A is shared with the approximation.
In other words, we look for the low-rank approximation that better “represents”
a given matrix A. A natural notion of “closeness” is given by the distance, and in
particular the distance defined by the 2-norm. Then the low-rank approximation
problem is as follows:

Low-rank approximation problem:


Given a matrix A, find a matrix Aρ such that

∥ A − Aρ ∥2 = min{∥ A − B∥2 : rank B ≤ ρ}

The solution to this problem goes though the SVD of A. To state it, we need
to recall that any matrix M ∈ Cm×n with rank M = ρ can be written as a sum of
rank-1 matrices
M = x1 y1∗ + · · · + xρ y∗ρ ,

110
7 Least squares problems

for some vectors xi , yi ∈ Cn , for i = 1, . . . , ρ. In particular, the SVD of A in (7.10)


allows us to decompose A as a sum of r rank-1 matrices as follows:
r
A= ∑ σi ui vi∗ , (7.12)
i =1

where ui , vi , for i = 1, . . . , r, are the ith columns of U and V, respectively. With


these considerations in mind, we can state the solution of the Low-rank approxi-
mation problem.

Theorem 22 (Optimal low-rank approximation). Let A ∈ Cm×n be as in (7.12), with


σ1 ≥ · · · ≥ σr > 0 being the nonzero singular values of A, and let ρ ≤ r. Set
ρ
Aρ := ∑ σi ui vi∗ . (7.13)
i =1

Then
∥ A − Aρ ∥2 = min{∥ A − B∥2 : rank B ≤ ρ} = σρ+1 ,
where, if ρ = r, then σρ+1 = 0.

Proof: First note that, if A and Aρ are as in (7.12) and (7.13), respectively, then
∥ A − Aρ ∥2 = ∥ ∑ri=ρ+1 σi ui vi∗ ∥2 = σρ+1 , by (7.11). It remains to prove that, for any
B with rank B ≤ ρ, it must be ∥ A − B∥2 ≥ σρ+1 . We omit this part and refer to
[I09, Fact 4.13] for a proof. □
As a consequence of Theorem 22, the best approximation to A with rank ρ is
obtained from the SVD of A by truncating the sum in (7.12) from the ρth singular
value on.

Image compression

We are going to see an interesting application of the SVD in the process of com-
pression and restoration of images.
For the sake of simplicity, let us focus on the simplest case of a grey scale. The
information of this image is stored in a matrix, whose dimensions correspond to
the resolution of the image. For instance, an image with resolution 1368 × 1712
means that the image is decomposed into a grid of 1368 × 1712 small squares
(pixels), each one having an independent intensity of grey color. This corresponds
to a matrix of size 1368 × 1712, where each entry encodes the intensity of grey
color in the corresponding pixel. In some cases, it may be preferable to reduce
the amount of stored information, at the cost of loosing some definition in the
image. This leads to the question on how to get the best compromise between the

111
7 Least squares problems

amount of stored information and the quality of the image. The SVD is helpful
in this question, based on Equation (7.12) and Theorem 22. More precisely, the
decomposition in (7.12) allows us to store the information of the matrix A ∈ Cm×n
through the sum of r rank-1 matrices, together with r real values (the singular
values σ1 ≥ · · · ≥ σr > 0). This is a total amount of r (m + n) + r numbers, since
each rank-1 matrix of size m × n is stored in 2 vectors, one of size m × 1, and the
other one of size 1 × n. Of course, if r = m = n (so the matrix is invertible), then
this amounts to 2n2 + n, which is more information than storing directly the n × n
whole matrix. However, when the matrix is rectangular, and m << n or n << m,
or even when it is square n × n but r << n, this quantity can be much smaller than
mn, which is the storage requirement of the whole matrix. Then, what we can do
is to replace the whole matrix A by a good low-rank approximation. The notion
of “good” here depends on the particular case, but the idea is to find a low rank
approximation that allows to decrease the storage cost without loosing too much
definition in the image. Theorem 22 tells us that the best rank-ρ approximation
to a given matrix (in terms of the distance induced by the 2-norm) is obtained by
truncating the sum (7.12), namely removing the last addends, and keeping the first
ρ addends, which correspond to the largest singular values. The compression ratio
(that is usually presented as a percentage) is the ratio between the information that
is needed to store the low-rank approximation and the one for the full matrix, and
is given by
ρ ( m + n + 1)
cρ = .
mn
Note that if ρ = min{m, n} then cρ > 1, so this approach is not useful for low-rank
approximations with rank close to the full rank of the original matrix. However,
the good news is that low-rank approximations with low-to-moderate rank can
provide very good results.
In general, a rank-100 approximation (that is, using only the largest 100 singular
values) is enough to get a good approximation to the original image. However,
the amount of information stored in this approximation is much less than the one
stored in the original matrix. In particular, for an image with 1368 × 1712 reso-
lution, the information required to store the rank-100 approximation as a matrix
(7.12) with ρ = 100 is equal to: 100(1368 + 1712 + 1) = 308100 bytes. Compared
to the size of the whole matrix this gives a compression rate of

308100
cρ = = 0.1315,
1368 · 1712
which means that the information stored in the approximation is only 13.15% of
the whole information, which is an important saving.

112
7 Least squares problems

Figure 7.3 shows three compressions of the same 720 × 1080 image, which has
full row rank (namely, 720). These images correspond to three low-rank approxi-
mations, whose rank and compression ratio are shown in Table 7.1

(a) Rank 10. (b) Rank 70.

(c) Rank 360. (d) Rank 720.

Figure 7.3: Compressed images and original picture.

Rank Compression ratio


10 2.32%
70 16.21%
360 69.49%

Table 7.1: Rank and compression ratio of the images in Figure 7.3.

113
7 Least squares problems

7.5 The pseudoinverse of a matrix


There are some matrices A ∈ Rm×n , with m ≥ n, which are not invertible (for
instance, if m > n because the matrix is not square) but that can be “inverted”
from the left, namely there is a matrix A† ∈ Rn×n such that A† A = In . This matrix
A† is the pseudoinverse of A. In general, any matrix A ∈ Rm×n has a pseudoinverse.
To define it, we use the following notation. Given the SVD of A (7.10), we express
the matrix Σ as
 
σ1
Σ+
 
0r×(n−r) ..
Σ= , Σ+ =   . (7.14)
 
0(m−r)×r 0(m−r)×(n−r) .
σr r×r

Note that, since σ1 , . . . , σr are nonzero, the matrix Σ+ is invertible. This leads to
the notion of pseudoinverse.

Definition 17 (Pseudoinverse). Let A ∈ Rm×n , with m ≥ n and rank( A) = r. Let


(7.10) be the SVD of A. Then
" #
−1
Σ 0 ×( −
A† = V + r m r )
U⊤ (with Σ+ as in (7.14)),
0(n−r)×r 0(n−r)×(m−r)

is the pseudoinverse of A.

The basic properties of the pseudoinverse that we are going to use in this course
are summarized in the following result.

Lemma 10 (Properties of the pseudoinverse). Let A ∈ Rm×n , with m ≥ n and


rank( A) = r, and let A† be the pseudoinverse of A. Then:

(i) A† ∈ Rn×m .
 
Ir 0r×(n−r)

(ii) A A = V V⊤.
0(n−r)×r 0n −r

(iii) If r = n then A⊤ A is invertible and A† = ( A⊤ A)−1 A⊤ .

(iv) If r = n then A† A = In .

(v) If A is invertible then A† = A−1 .

114
7 Least squares problems

Proof: Claim (i) is immediate from the definition of A† .


Claim (ii) follows directly from
" #
−1
Σ Σ+
 
+ 0 r ×(m−r ) ⊤ 0r×(n−r)

A A = V U U V⊤
0(n−r)×r 0(n−r)×(m−r) 0(m−r)×r 0(m−r)×(n−r)
" #
Σ− 1
Σ+

+ 0 r ×(m−r ) 0r×(n−r)
= V V⊤
0(n−r)×r 0(n−r)×(m−r) 0(m−r)×r 0(m−r)×(n−r)
 
Ir 0r×(n−r)
= V V⊤,
0(n−r)×r 0n −r

where we have used that U ⊤ U = In .


To prove (iii), we multiply

Σ+
h i  
A⊤ A = V Σ+ 0r×(m−r) U⊤U V⊤
0(m−r)×r
i Σ+
h 
= V Σ+ 0r×(m−r) V⊤
0(m−r)×r
= VΣ2+ V ⊤ .

Since V and Σ+ are both invertible, the product VΣ2+ V ⊤ is also invertible. More-
over, ( A⊤ A)−1 = V (Σ2+ )−1 V ⊤ , so
h i h i
( A⊤ A)−1 A⊤ = V (Σ2+ )−1 V ⊤ V Σ+ 0r×(m−r) U ⊤ = V (Σ2+ )−1 Σ+ 0r×(m−r) U ⊤
h i
= V Σ− +
1
0 ⊤ †
r ×(m−r ) U = A .

Claim (iv) in an immediate consequence of (iii), since A† A = ( A⊤ A)−1 ( A⊤ A) =


In . Similarly (v) is an immediate consequence of (iv), as we know from the basic
course on Linear Algebra. □

7.5.1 Application of the pseudoinverse to the LSP


The pseudoinverse of A will allow us to give an explicit expression of the solution
to the LSP associated to the SLE (4.1). This expression is given in the following
result.

Theorem 23 Let A ∈ Rm×n , with m ≥ n and rank( A) = r, and let A† be the pseudoin-
verse of A. Then:

(i) If r = n, the LSP associated to the SLE (4.1) has a unique solution and it is given
by bx = A† b.

115
7 Least squares problems

(ii) If r < n then the LSP has infinitely many solutions that can be written as:

x = A † b + α r +1 V ( : , r + 1 ) + · · · + α n V ( : , n ),
b αr+1 , . . . , αn ∈ R. (7.15)

Among all them, A† b is the one with smallest 2-norm.

Proof: We are going to prove only claim (i), using the SVD of A (7.10):

∥ Ax − b∥2 = ∥UΣV ⊤ x − b∥2 = ∥U (ΣV ⊤ − U ⊤ b)∥2 = ∥ΣV ⊤ x − U ⊤ b∥2


Σ+ Σ+ (U ⊤ b)(1 : n)
     
= ∥ V ⊤ x − U ⊤ b ∥2 = ∥ y− ∥
0(m−r)×r 0(m−r)×r (U ⊤ b)(n + 1 : m) 2
Σ+ y − (U ⊤ b)(1 : n)
 
= ∥ ∥2 ,
(U ⊤ b)(n + 1 : m)

where y = V ⊤ x. The previous norm is minimized when Σ+ y = (U ⊤ b)(1 : n).


Since r = n, the matrix Σ+ is invertible, so this identity is equivalent to

y = Σ− 1 ⊤
+ (U b )(1 : n ),

namely h i
x = VΣ−
b +
1
( U ⊤
b )( 1 : n ) = V Σ −1
+ 0 ⊤ †
n×(m−n) U b = A b.

As for claim (ii), we are just going to give an idea on why it is true. When r < n,
it can be proved that b x = A† b is a particular solution of the LSP. Now, for any
vector, z, in the null space of A, we have

∥ A(b
x + z) − b∥2 = ∥ Ab
x + Az − b∥2 = ∥ Ab
x − b ∥2 ,

x + z has the same residual as b


so b x + z is also a solution of the LSP.
x. Therefore, b
The addend that appears to the right in the right-hand side of (7.15) is the general
way to write a vector in the null space of A, as a linear combination of the vector
in a basis of this space. □

matlab command for the pseudoinverse

pinv(A): Provides the pseudoinverse of A.

116
8 Numerical Optimization

In this chapter we look for (local) maxima and minima of vector functions f :
Rn → R. We start, in Section 8.1, by introducing the basic notions and results, and
then, in Section 8.2, we introduce the numerical methods that are studied in this
course.

8.1 Unconstrained optimization


We recall that the open ball centered at x0 ∈ Rn with radius r is the set of vectors in
Rn whose distance to x0 is less than r, namely:

B(x0 , r ) := {x ∈ Rn : ∥x − x0 ∥2 < r }.

The ith partial derivative of the function f : Rn → R at x0 ∈ Rn is defined as


∂f f (x0 + hei ) − f (x0 )
(x0 ) := lim ,
∂xi h →0 h
where ei if the ith canonical vector in Rn , for i = 1, . . . , n. Then, the gradient of f
is the vector of partial derivatives of f , namely
 ∂f 
( x 0 )
 ∂x1 . 
 ..  .
∇ f ( x0 ) : =  
∂f
∂xn ( x0 )

The directional derivative of f at x0 in the direction of y is


f (x0 + hy) − f (x0 )
Dy f (x0 ) = lim .
h →0 h
We know from basic Calculus that

Dy f (x0 ) = ∇ f (x0 )⊤ y. (8.1)

Similarly, we define the second (i, j) partial derivative of f at x0 as


 
∂f
2
∂ f ∂ ∂x j
( x0 ) : = ( x0 ).
∂xi ∂x j ∂xi

117
8 Numerical Optimization

∂2 f
When i = j in the previous definition, we write ∂2 x (x0 ). From elementary Calculus
i
we know that
∂2 f ∂2 f
( x0 ) = ( x0 ). (8.2)
∂xi ∂x j ∂x j ∂xi
The Hessian of the function f at x0 is the n × n matrix

∂2 f ∂2 f ∂2 f
 
∂2 x1
( x0 ) ∂x1 ∂x2 ( x0 ) ... ∂x1 ∂xn ( x0 )
 ∂2 f ∂2 f ∂2 f 
 ∂x2 ∂x1 (x0 ) ( x0 ) ... ∂x2 ∂xn ( x0 ) 
 
∂2 x2
∇2 f ( x0 ) : =  .. .. .. .. .

 . . . .


∂2 f ∂2 f ∂2 f
∂xn ∂x1 ( x0 ) ∂xn ∂x2 ( x0 ) ... ∂2 x n
( x0 )

By (8.2), the Hessian matrix is symmetric. We recall the following notions for
symmetric matrices.

Definition 18 Let S be an n × n real symmetric matrix. Then S is said to be

(i) positive definite if x⊤ Sx > 0, for all nonzero x ∈ Rn ,

(ii) negative definite if x⊤ Sx < 0, for all nonzero x ∈ Rn ,

(iii) positive semidefinite if x⊤ Sx ≥ 0, for all x ∈ Rn ,

(iv) negative semidefinite if x⊤ Sx ≤ 0, for all x ∈ Rn , and

(v) indefinite if there are two vectors x, y such that x⊤ Sx > 0 and y⊤ Sy < 0.

We know from basic Linear Algebra that every real symmetric matrix has real
eigenvalues. The following result characterizes the definiteness of a symmetric
matrix in terms of the sign of its eigenvalues.

Theorem 24 Let S be an n × n real symmetric matrix. Then, S is

(i) positive definite if and only if all eigenvalues of S are positive,

(ii) negative definite if and only if all eigenvalues of S are negative,

(iii) positive semidefinite if and only if all eigenvalues of S are nonnegative,

(iv) negative semidefinite if and only if all eigenvalues of S are nonpositive,

(v) indefinite if and only if S has at least one positive and one negative eigenvalue.

118
8 Numerical Optimization

Proof: We know from elementary Linear Algebra that S is orthogonally diagonal-


izable. This means that there is an orthogonal matrix, Q, such that S = Q⊤ DQ,
with D being diagonal, and whose diagonal entries are the eigenvalues of S (de-
noted by λ1 , . . . , λn ). Then, for any x ∈ Rn ,

x⊤ Sx = x⊤ Q⊤ DQx = ( Qx)⊤ D ( Qx) = y⊤ Dy,

for y = Qx. Since Q is invertible, y is any vector in Rn . Now the result follows
taking into account that
  
λ1 y1
..   .. 
y⊤ Dy = y1 · · · yn  2 2
 
.   .  = λ1 y1 + · · · + λ n y n .
λn yn


If f is twice differentiable at x0 , we know from basic Calculus that it can be
expanded as a Taylor expansion around x0 :

1
f (x) = f (x0 ) + ∇ f (x0 )⊤ (x − x0 ) + (x − x0 )⊤ ∇2 f (x0 )(x − x0 ) + o (∥x − x0 ∥22 ),
2
(8.3)
for x ∈ R close enough to x0 , where o (∥x − x0 ∥2 ) contain terms such that
n 2

o (∥x − x0 ∥22 )
lim = 0.
x → x0 ∥x − x0 ∥22
Now we introduce the basic notions of this chapter.

Definition 19 (Critical and saddle point, local maxima and minima). Given a func-
tion f : Rn → R and a point x0 ∈ Rn , we say that x0 is

(i) a critical point of f if ∇ f (x0 ) = 0,

(ii) a local minimum of f if f (x) > f (x0 ), for all x ∈ B(x0 , r ) and some r > 0,

(iii) a local maximum of f if f (x) < f (x0 ), for all x ∈ B(x0 , r ) and some r > 0,

(iv) a non-strict local minimum of f if f (x) ≥ f (x0 ), for all x ∈ B(x0 , r ) and some
r > 0,

(v) a non-strict local maximum of f if f (x) ≤ f (x0 ), for all x ∈ B(x0 , r ) and some
r > 0,

(vi) a saddle point if ∇ f (x0 ) = 0 and, for any r > 0, there are x1 , x2 ∈ B(x0 , r ) such
that f (x2 ) < f (x0 ) < f (x1 ).

119
8 Numerical Optimization

Figure 8.1: Maximum, minimum, and saddle point.

Figure 8.2: Non-strict minima.

Figure 8.1 illustrates the notion of local maximum and minimum, as well as a
saddle point, and in Figure 8.2 we illustrate the notion of non-strict local min-
imum. Note that there are infinitely many such points in a straight line at the
bottom of the graph.
We are interested in computing (or approximating) local maxima and local min-
ima. For this, note that f has a local maximum at x0 if and only if − f has a local
minimum at x0 . Therefore, we can concentrate on local minima.
Now the question is: what is the connection between all the previous stuff?
Namely, what is the role played by the gradient, the Hessian, the Taylor expansion,
or Definition 18 and Theorem 24 in the context of obtaining the local minima of a
function? The answer comes from the following result.

Theorem 25 Let f : Rn → R be twice differentiable at x0 ∈ Rn .

(i) If f has a local minimum at x0 then ∇ f (x0 ) = 0.

(ii) f has a local minimum at x0 if and only if ∇ f (x0 ) = 0 and ∇2 f (x0 ) is positive
definite.

120
8 Numerical Optimization

(iii) f has a non-strict local minimum at x0 if and only if ∇ f (x0 ) = 0 and ∇2 f (x0 ) is
positive semidefinite.

Proof: All claims in the statement are a consequence of (8.3). More precisely, if
f has a local minimum at x0 then it must be f (x) > f (x0 ) for x close enough to
x0 . Note that, for x close enough to x0 , the terms 21 (x − x0 )⊤ ∇2 f (x0 )(x − x0 ) +
o (∥x − x0 ∥22 ) in (8.3) are smaller than ∇ f (x0 )⊤ (x − x0 ) (note that they are all real
numbers), so in order to have f (x) > f (x0 ) it must be ∇ f (x0 ) = 0, since otherwise
there are some x1 and x2 close enough to x0 such that ∇ f (x0 )⊤ (x1 − x0 ) > 0 and
∇ f (x0 )⊤ (x2 − x0 ) < 0. This proves claim (i).
To prove claim (ii), we first assume that x0 is a local minimum of f . Then, by
claim (i), it must be ∇ f (x0 ) = 0 so, by (8.3),

1
f (x) ≈ f (x0 ) + (x − x0 )⊤ ∇2 f (x0 )(x − x0 ), (8.4)
2
for x close enough to x0 . Therefore, ∇2 f (x0 ) must be positive definite. Conversely,
if ∇ f (x0 ) = 0 equation (8.4) holds for x close enough to x0 and, if ∇2 f (x0 ) is
positive definite, this implies that f (x) > f (x0 ), so x0 is a local minimum of f .
The proof of claim (iii) is similar to that of claim (ii), just noticing that, if ∇2 f (x0 )
is positive semidefinite, then it may happen that (x − x0 )⊤ ∇2 f (x − x0 ) = 0, for x
arbitrarily close to x0 , which gives f (x) = f (x0 ) by (8.4). □

Example 16 Let us consider the function f (x) = e x1 +2x2 sin( x1 ) + x2 , whose gradient
at some vector x is  x +2x 
e 1 2 (sin( x1 ) + cos( x1 ))
∇ f (x) = .
2e x1 +2x2 sin( x1 ) + 1
The gradient is zero at the points

x1 = 7π
4 + 2πn n ∈ Z,
1
x2 = − 2 ( x1 + log(−2 sin( x1 ))).

Let us analyze what happens for n = 0 above. In this case the point is
  " 7π
#
x1 4 √
x0 : = =   .
x2 − 21 7π4 + log( 2)

The Hessian of f at x is
 
2 x1 +2x2 2 cos( x1 ) 2(sin( x1 ) + cos( x1 ))
∇ f (x) = e ,
2(sin( x1 ) + cos( x1 )) 4 sin( x1 )

121
8 Numerical Optimization

Figure 8.3: Illustration of a convex function f : [−2, 4] → R.

which, for x0 above, reads


"√ # 
√ 2 √
0 2 0

∇2 f ( x0 ) = 2 = .
0 2 2 0 4

This is a diagonal matrix with positive diagonal entries, so it is positive definite. Therefore,
x0 is a local minimum of f .

8.1.1 Convex functions and convex optimization


Definition 20 (Convex function). A function f : D ⊆ Rn → R is convex in D if

f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y), (8.5)

for all x, y ∈ D and all α ∈ [0, 1].


The function is said to be strictly convex if (8.5) is a strict inequality.

In plain words, in a convex function the segment joining the images of f (x) and
f (y) is above the graph of the function, as it is illustrated in Figure 8.3.
The function to the left in Figure 8.1 is strictly convex, whereas the one in Figure
8.2 is convex, but not strictly convex. The functions in the middle and the right of
Figure 8.1, however, are not convex.
An example of strictly convex function is f : R2 → R defined by f ( x, y) =
x + y2 .
2

However, it may happen that not all points in the segment joining x and y
belong to the domain or the subset of the domain where we are searching the
local minima. This leads to the following notion.

122
8 Numerical Optimization

Definition 21 (Convex set). A set D ⊆ Rn is convex if the segment joining any two
points x, y ∈ D belongs to D . Namely, αx + (1 − α)y ∈ D , for all α ∈ [0, 1].

Examples of convex sets are the whole Rn , an open ball B(x, r ), or an n-dimensional
rectangle [ a1 , b1 ] × · · · × [ an , bn ].
We are going to see in Theorem 27 that convex functions over convex sets have
just one local minimum which is also a global minimum. This is a great advantage
against general functions, where the number of local minima is not known (and
not even whether such a minimum exists). In order to prove Theorem 27 we first
provide in Theorem 26 some properties of convex functions.

Theorem 26 Let f : D ⊆ Rn → R be a function and D a convex set.


(a) If f is differentiable with continuous derivative in D , then f is convex if and only if

f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ), for all x, y ∈ D . (8.6)

(b) If f is twice differentiable with continuous second derivative, then f is convex if and
only if
∇2 f (x) is positive semidefinite for all x ∈ D . (8.7)

Parts (a) and (b) remain true if we replace “convex” by “strictly convex”, the inequality
in (8.6) by a strict inequality and “semidefinite” in (8.7) by “definite”.

Proof: (a) Let us first assume that f is convex. Then, f ((1 − α)x + αy) ≤ (1 −
α) f (x) + α f (y), for any α ∈ [0, 1], which is equivalent to
f (x + α(y − x)) − f (x)
≤ f ( y ) − f ( x ),
α
and, taking the limit as α → 0, the previous inequality, together with (8.1), gives
f (x + α(y − x)) − f (x)
∇ f (x)⊤ (y − x) = Dy−x f (x) = lim ≤ f ( y ) − f ( x ),
α →0 α
so (8.6) holds.
Now assume that (8.6) holds and let us prove that f is convex. Let x, y ∈ D and
α ∈ [0, 1], and set z := αy + (1 − α)x. By assumption
f (y) ≥ f (z) + ∇ f (z)⊤ (y − z)
f ( x ) ≥ f ( z ) + ∇ f ( z ) ⊤ ( x − z ).
Multiplying the first inequality above by α and adding up the second one multi-
plied by (1 − α) we obtain:
α f ( y ) + (1 − α ) f ( x ) ≥ f (z) + ∇ f (z)⊤ (x − z) + α∇ f (z)⊤ (y − z)
= f (z) + ∇ f (z)⊤ (αy + (1 − α)x − z) = f (αy + (1 − α)x)),

123
8 Numerical Optimization

where in the last equality we have used that αy + (1 − α)x − z = 0.


(b) Let us first assume that f is as in the statement (namely, twice differentiable
with continuous second derivative). Let x ∈ D , and y be close enough to x, so that
the Taylor expansion (8.3) holds:
1
f (y) = f (x) + ∇ f (x)⊤ (y − x) + (y − x)⊤ ∇2 f (x)(y − x) + o (∥y − x∥22 ). (8.8)
2
If f is convex, by part (a), f (y) ≥ f (x) + ∇ f (x)⊤ (y − x), so (8.8) implies, for y close
enough to x, that (y − x)⊤ ∇2 f (x)(y − x) ≥ 0. This implies that z⊤ ∇2 f (x)z ≥ 0,
for all z in a closed ball around x. By Exercise 4 this implies that ∇2 f (x) is positive
semidefinite.
For the converse, we are going to use another expression for the Taylor expan-
sion of f around x ∈ D , instead of (8.3), that is known from elementary Calculus,
namely
1
f (y) = f (x) + ∇ f (x)(y − x) + (y − x)⊤ ∇2 f (z)(y − z), (8.9)
2
where z is a vector in the segment joining x and y. Now, if ∇2 f is positive definite
in D , then (y − x)⊤ ∇2 f (z)(y − z) ≥ 0, so the previous expression implies that
f (y) ≥ f (x) + ∇ f (x)⊤ (y − x), for all y ∈ D and this, by part (a), implies that f is
convex in D .
To prove the last part of the statement, note that the arguments can be extended
to the case indicated in that sentence (namely, when f is strictly convex instead of
convex, with the appropriate changes in the statements of claims (a) and (b)). □

Exercise 4 Prove that if z⊤ ∇2 f (x)z ≥ 0 for all z ∈ B(x, r ), for some r > 0, then
∇2 f (x) is positive semidefinite.

Theorem 26 provides two characterizations of convex functions (over convex


sets), namely (8.6) and (8.7). Now, we are in the position of proving the main
result of this section.

Theorem 27 Let f : D ⊆ Rn → R be a convex function, which is twice differentiable


with continuous second derivative, and D be a convex set. Then x0 is a non-strict (local
and global) minimum of f if and only if ∇ f (x0 ) = 0. If f is strictly convex, then there is
only a global (and local) minimum of f , x0 , which satisfies ∇ f (x0 ) = 0.

Proof: Let us first prove the statement for f being convex. If x0 is a local minimum
we already know, by Theorem 25, that ∇ f (x0 ) = 0. Now, let us assume that
∇ f (x0 ) = 0. The Taylor expansion (8.3) implies that, for y close enough to x0 ,
1
f (y) ≈ f (x0 ) + (y − x0 )⊤ ∇2 f (x0 )(y − x0 ).
2

124
8 Numerical Optimization

By (8.7), (y − x0 )⊤ ∇2 f (x0 )(y − x0 ) ≥ 0, so f (y) ≥ f (x0 ), namely x0 is a (non-strict)


local minimum. We are going to prove that it is also a (non-strict) global minimum.
Assume, by contradiction, that there is another x1 ∈ D such that f (x0 ) ≥ f (x1 ).
Then, since D is a convex set, the segment joining x0 and x1 belongs to D . By (8.6),
f (x1 ) ≥ f (x0 ) + ∇ f (x0 )(x1 − x0 ) = f (x0 ), so it must be f (x1 ) = f (x0 ).
The proof when f is strictly convex is similar, just noting that the previous
inequalities are strict. □

8.2 Descent methods


We are going to present in this chapter several methods for approximating local
minima of multivariable functions. As in Chapter 6, they are all iterative methods,
which generate a sequence of approximations {xk } that aim to approach the local
minima. Several considerations are in order:

• The function may have more than one local minimum. In this case, we expect
the sequence to approximate just one of them. However, if the function
is convex (and twice differentiable with continuous second derivative), we
know, by Theorem 27, that the minimum is unique.

• (Stopping criterion): We should provide a criterion for the method (or algo-
rithm) to stop at some point. There are several options, and we can include
all them as stopping criterion in the particular algorithm, or just some of
them (the only necessary condition is to include at least one, since otherwise
the algorithm would not stop!). The standard ones are the following:
– Fix the number of iterations. The advantage of this criterion is that
we guarantee the termination of the algorithm. However, this does not
allow us to know at all whether the output is a good approximation to
the solution or not.
– Impose a tolerance (ε) on the (norm of the) gradient, namely: ∥∇ f (xk )∥2 ≤
ε. If x∗ is a local minimum, Theorem 25 guarantees that ∇ f (x∗ ) = 0.
Therefore, we expect that, if xk is a good approximation to x∗ , then
∇ f (xk ) is close to zero, namely its norm is small. However, a small
value of the norm of the gradient does not guarantee that xk is close
enough to x∗ .
– Impose a tolerance in the change of iterations, namely: ∥xk+1 − xk ∥2 < ε.
The idea behind this condition is that when there is no much difference
in two consecutive outputs, we cannot expect too much improvement
in further iterations.

125
8 Numerical Optimization

– Impose a tolerance in the change of the function, namely | f (xk+1 ) −


f (xk )| < ε a + ε r | f (xk )|. In this case, ε a is a measure of the absolute error
of f (called absolute tolerance) and ε r is a measure for the relative error
(called relative tolerance).
The drawback of the last three stopping criteria is that they cannot guarantee
that the algorithm terminates. It could happen, for instance, that it gets into a
loop, like in Example 12.
In descent methods, each approximation (say the the (k + 1)st) can be obtained
from the previous one in the following form

x k +1 = x k + α k d k , (8.10)

where dk is a suitable direction, called the line search, and αk is a real number. In
other words, the vector xk+1 is obtained from xk by adding up a suitable vector in
some particular direction dk . What is important for the method is to appropriately
choose both the search direction, dk , and the stepsize αk . The choice of these two
ingredients determine the method. The common feature of all descent methods is
that, provided that αk > 0, the direction search dk satisfies:
• d⊤k ∇ f ( xk ) < 0 if ∇ f ( xk ) ̸ = 0. The reason for imposing this condition is the
following: the vector ∇ f (xk ), which determines the direction of maximal
increase of f , divides the space Rn in two parts: one part where vectors v ∈
Rn satisfy v⊤ ∇ f (xk ) > 0 and the other one for vectors with v⊤ ∇ f (xk ) < 0.
The first one contains the “uphill” directions, where the function f increases,
and the second one contains the “downhill” directions, where f decreases.
Since we are looking for a new iteration xk+1 such that f (xk+1 ) < f (xk )
(namely we want f to decrease), it is natural to choose vectors pointing to
the second part.

• d⊤
k ∇ f ( xk ) = 0 if ∇ f ( xk ) = 0. In this (very unlikely) case, xk+1 = xk , so the
method terminates at xk , since we have reached a point where ∇ f (xk ) = 0,
which is the local minimum.
Depending on the choice of dk , we obtain the three different methods that are
considered in this course, namely:
• Newton’s method: dk = −[∇2 f (xk )]−1 ∇ f (xk ).

• Inexact Newton’s methods: dk = − Bk−1 ∇ f (xk ), where Bk is an approxima-


tion to ∇2 f (xk ).

• Steepest descent method: dk = −∇ f (xk ) (note that this is an inexact New-


ton method with Bk = I).

126
8 Numerical Optimization

8.2.1 Newton-like methods


Newton’s method for multivariable functions is a generalization of Newton’s method
for single variable functions introduced in Section 5.5. As we are going to explain,
this method requires the computation of the Hessian at each step, which makes it
very expensive, and this is the reason for introducing inexact Newton’s methods,
that aim to overcome this drawback.

Newton’s method

Let us consider the second-order Taylor polynomial of the function f around the
kth iteration, xk :
1
q(x) = f (xk ) + ∇ f (xk )⊤ (x − xk ) + (x − xk )⊤ ∇2 f (xk )(x − xk ).
2
Instead of looking at the minimum of f , we look at the minimum of q (and this is
the only idea behind the method). So, in order to get the next iteration, xk+1 , we
impose that it is the minimum of q, namely ∇q(xk+1 ) = 0. If we differentiate in
the expression above for q

∇q(x) = ∇ f (xk ) + ∇2 f (xk )(x − xk )

and equate to 0, we obtain

∇2 f (xk )(x − xk ) = −∇ f (xk ).

Therefore, xk+1 is the solution of the equation above, namely:

xk+1 = xk − [∇2 f (xk )]−1 ∇ f (xk ) (8.11)

Written in the form of an algorithm, Newton’s method is as follows.

127
8 Numerical Optimization

Newton algorithm for multivariable functions


Input: A function f : Rn → R and a seed vector x0 ∈ Rn (and, maybe,
the number of iterations).
Output: Approximation, xm , to a local minimum of f .

For k = 0, 1, ... (until convergence).

• Compute ∇ f (xk ).

• Compute ∇2 f (xk ).

• Solve the linear system ∇2 f (xk )dk = −∇ f (xk ).

• Update xk+1 = xk + dk .

• Check convergence.

In the previous algorithm, m is the total number of iterations, that depends on


the stopping criterion.
Two relevant drawbacks arise in the previous algorithm, that may affect conver-
gence (or make the method being not convergent), namely:

• The function f may be highly nonlinear, so the quadratic approximation is


not good enough.

• The Hessian is not necessarily positive definite at xk , so the method does


not necessarily provide a descent iteration. Moreover ∇2 f (xk ) could be non
invertible (or having a large condition number).

To solve the first problem we can reduce the length of the step in the form xk+1 =
xk + αk dk , so we end up with an expression like (8.10).
As for the second drawback we can use a “shifted Hessian”

Fk := ∇2 f (xk ) + γk I, γk > 0,

by adding up a (positive) multiple of the identity matrix to the Hessian, so that


the resulting matrix Fk is positive definite and, hence, invertible. This is always
possible since, as γk increases, Fk approaches Ik . Then, as γk → ∞, the direction dk
approaches −∇ f (xk ), which is for sure a “downhill” direction. Moreover, we can
use the parameter γk to control the length of the step (see Section 8.2.3). Therefore,
we can initially set γk a moderate parameter and, if in the value of xk+1 that we
get the function f does not decrease, we can increase the value γk (by trial and
error, for instance) until we get a vector xk+1 such that f (xk+1 ) < f (xk ).

128
8 Numerical Optimization

Despite the previous drawbacks, the good news is that, provided that f is twice
differentiable and ∇2 f (x) is Lipschitz continuous (something that, for instance, all
polynomial functions satisfy in bounded regions) and there is some local mini-
mum of x∗ where ∇2 f (x∗ ) is positive definite, then Newton’s method

• converges for any initial point which is close enough to x∗ , and


• the convergence rate is quadratic, namely there is some M > 0 such that
∥ x k +1 − x ∗ ∥ 2
≤ M,
∥xk − x∗ ∥22
for k large enough.

For a proof of these facts, see, for instance, [NW06, Th. 3.5].

Inexact Newton’s methods

The Newton method requires computing the Hessian ∇2 f (xk ) at each step, and
this may be too expensive (we need to compute the partial derivatives of f at
xk ). Then, Inexact Newton’s methods aim to overcome this problem, replacing the
Hessian by some approximation, Bk . Some of the standard methods use a low-
rank approximation (rank-1 or rank-2) of the matrix obtained at the previous step
(starting with, for instance, B0 = I) and they are obtained from a second-order
Taylor approximation to ∇ f (xk+1 ). We are not going to see these methods in this
course, but if you are interested in them you can have a look at [BC11, §3.8].

8.2.2 Steepest descent


This is a descent method obtained by taking dk = −∇ f (xk ) as the line search in
(8.10). A further motivation for such a choice, besides the general one presented
at the beginning of Section 8.2, is the following.
Let us consider the Taylor series expansion of f at xk :
f (xk + αk dk ) = f (xk ) + αk ∇ f (xk )⊤ dk + o (α2k ),
namely
f (xk + αk dk ) − f (xk ) = αk ∇ f (xk )⊤ dk + o (α2k ).
Then, for αk small enough, f (xk + αk dk ) − f (xk ) ≈ αk ∇ f (xk )⊤ dk . If we set dk =
−∇ f (xk ) and xk+1 = xk + αk dk , for αk > 0, then f (xk + αk dk ) − f (xk ) ≈ −αk ∥∇ f (xk )∥22 <
0, which implies that f (xk+1 ) < f (xk ), as wanted.
Once we have defined the direction search in the descent method (8.10), it re-
mains to determine the length of the step, αk . The line search problem is to find the
minimum of (8.10) in the variable αk , namely

129
8 Numerical Optimization

Figure 8.4: Approximations and search directions of the steepest descent with exact line search.

Line search problem: Given xk and dk in Rn , find αk ∈ R such that the


function ϕ(α) = f (xk + αdk ) is minimized.

In general, it is not easy to find the solution of the line search method, and this
leads to the “inexact line search methods”, that are analyzed in Section 8.2.3. How-
ever, if we assume that, at each step, αk is the solution of the line search problem,
then the steepest descent method produces a “zig-zag” iteration due to the follow-
ing result.

Lemma 11 If αk is the solution of the line search problem and xk+1 = xk + αk dk then

∇ f (xk+1 )⊤ dk = 0. (8.12)

Proof: Note that the objective function ϕ(α) = f (xk + αdk ) of the line search
problem is a composition ϕ(α) = ( f ◦ g)(α), with g(α) = xk + αdk . Then, applying
the chain rule, the derivative of ϕ is:
n
∂f ∂((xk )i + α(dk )i )
ϕ′ (α) = ∑ ∂xi (xk + αdk ) ∂α
= ∇ f (xk + αdk )⊤ dk .
i =1

Therefore, if ϕ(α) reaches its minimum at αk , it must be 0 = ϕ(αk ) = ∇ f (xk +


αk dk )⊤ dk . □
Since dk+1 = −∇ f (xk+1 ), then (8.12) says that the search direction of the (k +
1)st iteration in a descent method (8.10) is orthogonal to the search direction of
the previous iteration when αk is the solution of the line search problem. This is
illustrated in Figure 8.4.
Regarding the convergence features of the steepest descent method, we quote
from [BC11]:

• The method is convergent for any starting point x0 under some “mild con-
ditions”.

130
8 Numerical Optimization

• The rate of convergence is linear. More precisely, if x∗ is the local minimum


we are approximating and assuming that ∇2 f (x∗ ) is positive definite with
2-condition number κ, then

f ( x k +1 ) − f ( x ∗ ) κ−1 2
 
≈ .
f (xk ) − f (x∗ ) κ+1
This means that, as k increases, the approximation provided by the (k + 1)st
iteration does not improve very much the one in the previous step.

8.2.3 Inexact line search


To obtain the exact solution to the line search problem is a one-dimensional op-
timization problem whose computational cost may be of the same order as com-
puting the search direction (this problem can be translated into a root-finding
problem for the derivative of the function ϕ(α)). Therefore, several strategies have
been proposed to approximate the solution of this problem at a much lower cost.
Among them, here we are only to see a couple of elementary choices.
The first strategy is very natural: Just choose some value of α, evaluate f (xk +
αdk ) and, if this value is larger than (or equal) f (xk ) then reduce the step, say, to
α/2. The algorithm is as follows:

Inexact line search by halving the size:


Input: A function f : Rn → R, a vector xk , and a direction search dk .
Output: Approximation to a local minimum of f (xk + αdk ).

• Set α = 1.

• Compute f (xk + αdk ).

• If f (xk + αdk ) ≥ f (xk ), set α = α/2 and go to the previous step.

However, the previous algorithm may result in very small reductions of f , mak-
ing the algorithm quite slow and expensive. Another strategy is to choose another
“loop” criterion. For instance, we choose another value 0 < β < 1 and impose

f (xk + αdk ) < f (xk ) + βα∇ f (xk )⊤ dk . (8.13)

This condition is based on the following facts:


• βα∇ f (xk )⊤ dk < 0, since ∇ f (xk )⊤ dk < 0 because we have chosen a “down-
hill” direction, and α, β > 0. Therefore, the value of f (xk + αdk ) in (8.13) is
strictly smaller than f (xk ).

131
8 Numerical Optimization

• f (xk + tdk ) ≈ f (xk ) + t∇ f (xk )⊤ dk < f (xk ) + αt∇ f (xk )⊤ dk , so the algorithm
eventually terminates (see [BV09, p. 465]).

This leads to the following algorithm (see [BC11, p. 118]:

Inexact line search based on “sufficient decrease”:


Input: A function f : Rn → R, a vector xk , a direction search dk , and a
value 0 < β < 1.
Output: Approximation to a local minimum of f (xk + αdk ).

• Set α = 1.

• Compute f (xk + αdk ).

• Check whether condition (8.13) is satisfied. If not, set α = α/2 and


go to the previous step.

132
9 Numerical integration

The goal of this chapter is to approximate numerically a definite integral of a


function f : R → R in an interval [ a, b]:
Z b
f ( x )dx. (9.1)
a

This approximation will be obtained using a “quadrature rule” (or “quadrature


formula”), which is a finite sum
Z b n

a
f ( x )dx ≈ ∑ ωi f ( x i ), (9.2)
i =0

where x0 , . . . , xn are the quadrature nodes and ω0 , . . . , ωn are the quadrature weights.
A relevant property that we are going to impose to these formulae is given in the
following definition.

Definition 22 We will say that a quadrature rule (9.2) is exact for polynomials of
degree n if it gives the exact value of the integral when f ( x ) is a polynomial of degree at
most n.

The ideal situation is to approximate the integral (9.1) with as much accuracy
as desired. In particular, given a “tolerance”, tol, we want to obtain an algorithm
that allows us to choose the quadrature nodes and weights in such a way that
Z b n
f ( x )dx − ∑ ωi f ( xi ) ≤ tol. (9.3)
a i =0

A standard technique in this context is the “change of variable”, y = ( x −


a)/(b − a) (which is equivalent to x = a + (b − a)y), that allows us to restrict
ourselves to the interval [0, 1], by means of the identity:
Z b Z 1 Z 1
f ( x )dx = (b − a) f ( a + (b − a)y)dy = (b − a) g(y)dy, g ( y ) = f ( a + ( b − a ) y ).
a 0 0
(9.4)

133
9 Numerical integration

9.1 Basic closed Newton-Cotes formulas


The adjective “closed” refers to the fact that the endpoints of the interval [ a, b] are
part of the quadrature rule.
To obtain the closed Newton-Cotes formulas we follow these 4 steps:
Step 1: Divide [ a, b] in n subintervals (n + 1 nodes) of the same length, (b − a)/n.
Step 2: Define the corresponding quadrature nodes:

x0 = a,
b−a
x1 = a+ ,
n
..
.
b−a
xk = a+k· ,
n
..
.
b−a
xn = a + n · = b.
n

Step 3: Construct the interpolating polynomial through the nodes ( x0 , f ( x0 )), . . . , ( xn , f ( xn )).
In the Lagrange formula, this polynomial is
n x − xj
Pn ( x ) = ∑ f ( xk )ℓk ( x ), ℓk ( x ) = ∏
xk − x j
.
k =0 j̸=k

Step 4: Approximate the integral using the polynomial in Step 3:


Z b Z b n Z b

a
f ( x )dx ≈
a
Pn ( x )dx = ∑ f ( xk )
a
ℓk ( x )dx.
k =0

This way, we obtain the formulas

Closed quadrature Newton-Cotes formulas:


Z b n Z b

a
f ( x )dx ≈ ∑ f ( xk )αk , αk =
a
ℓk ( x )dx.
k =0

The most elementary cases of these formulas are the ones obtained for n = 1
(trapezoid’s rule) and n = 2 (Simpson’s rule), which are illustrated in Figure 9.1.

134
9 Numerical integration

Rb f ( a)+ f (b)
a
f ( x )dx ≈ (b − a) · 2

(a) Trapezoid’s rule.

Rb    
b− a
a
f ( x )dx ≈ 6 · f ( a) + 4 · f a+2 b + f (b)

(b) Simpson’s rule.

Figure 9.1: Quadrature trapezoid and Simpson rules [BF11, p. 194–195].

135
9 Numerical integration

9.1.1 Computation of the weights of closed Newton-Cotes quadrature


rules
Let us see how to compute the weights, αk , of the Newton-Cotes formulas that
we have introduced in the previous section. The following result states that these
formulas are exact for polynomials of degree n.
b− a
Theorem 28 Let xk = a + k · for k = 0, . . . , n, and let ℓk ( x ) be the Lagrange
n , Rb
polynomials associated to these nodes. Then αk = a ℓk ( x )dx, for k = 0, . . . , n, are the
unique numbers that satisfy
Z b n

a
p( x )dx = ∑ p( xk )αk , (9.5)
k =0

where p( x ) is any polynomial with degree at most n.

Proof: If p( x ) is a polynomial with degree at most n, then p( x ) = ∑nk=0 p( xk )ℓk ( x ),


since the interpolating polynomial with degree at most n of a polynomial with
degree at most n is the polynomial itself. Therefore, identity (9.5) holds.
Let us now see that, if the identity (9.5) holds for any polynomial with degree
Rb
at most n, then αk = a ℓk ( x )dx. Then, let us assume that (9.5) is exact for any
polynomial with degree at most n. In particular, it must hold for p( x ) = ℓ j ( x ),
since ℓ j ( x ) has degree at most n (for all 0 ≤ j ≤ n). Therefore
Z b n

a
ℓ j ( x )dx = ∑ ℓ j ( xk )αk = α j ,
k =0

by property 2 right after Definition 11. □


As a consequence of Theorem 28, the closed Newton-Cotes formulas are exact
for the monomial basis 1, x, . . . , x n . Therefore
n Z 1
1

j
xk αk = x j dx = , j = 0, 1, . . . , n.
k =0 0 j+1

Giving the values j = 0, 1, . . . , n in the previous identity we obtain a linear system,


whose unknowns are the coefficients of the simple closed Newton-Cotes formulas,
αk . More precisely:

α0 + α1 + · · · + α n = 1 ( j = 0),
1
x0 α0 + x1 α1 + · · · + x n α n = 2 ( j = 1),
1
x02 α0 + x12 α1 + · · · + xn2 αn = 3 ( j = 2), (9.6)
.. ..
. .
1
x0 α0 + x1 α1 + · · · + xnn αn
n n = n +1 ( j = n ),

136
9 Numerical integration

where, since we are considering [0, 1] as the integration interval, x0 = 0, x1 =


1 n −1
n , . . . , xn−1 = n , xn = 1. This allows us to determine the values of αk as the
solutions of the SLE (9.6).

Example 17 Let us obtain the values of the coefficients of the simple closed Newton-Cotes
formulas for the first three values of n:

• n = 1 (trapezoid’s rule): The system (9.6) in this case reads:



α0 + α1 = 1 1
⇒ α0 = = α1 .
α1 = 12 2

Then, using the change of variables (9.4), we obtain the formula:

Z b
b−a
f ( x )dx ≈ · ( f ( a) + f (b)) (Trapezoid’s rule)
a 2

• n = 2 (Simpson’s rule): the system now is



α0 + α1 + α2 = 1 
1 1 1 4 1
2 α1 + α2 = 2  ⇒ α0 = 6 , α1 = 6 , α2 = 6 ,
1 1
4 α1 + α2 = 3

which, using again (9.4), gives rise to the formula:

Z b
b−a
   
a+b
f ( x )dx ≈ · f ( a) + 4 · f + f (b) (Simpson’s rule)
a 6 2

• n = 3 (Simpson’s 3/8 rule): When solving the corresponding system (9.6) and
using (9.4) we obtain

Z b
b−a
     
a+b a+b
f ( x )dx ≈ · f ( a) + 3 · f a+ +3· f a+2· + f (b) (Simpson’s 3/8 rule)
a 8 3 3

9.1.2 Error in the basic Newton-Cotes formulas


Let us estimate the error when we approximate the integral (9.1) using the simple
closed Newton-Cotes formulas. For this, we use the interpolation error formula in
n + 1 nodes:
n
f (n+1) (ξ ( x ))
f ( x ) = Pn ( x ) + ∏( x − xi ) ,
i =0
( n + 1) !

137
9 Numerical integration

where ξ ( x ) is a function of x with values between x0 = a and xn = b. This formula


is an immediate consequence of Theorem 10.
Integrating the previous expression:
!
n
f (n+1) (ξ ( x ))
Z b Z b
f ( x )dx = Pn ( x ) + ∏( x − xi ) dx
a a i =0
( n + 1) !
Z b n
f (n+1) (ξ ( x ))
Z b
=
a
Pn ( x )dx + ∏
a i =0
( x − x i )
( n + 1) !
dx
n Z b n
f (n+1) (ξ ( x ))
= ∑ f ( xk )αk + ∏ ( x − xi ) dx.
k =0 a i =0 ( n + 1) !

Then, the error when approximating the integral is


n Z b n
f (n+1) (ξ ( x ))
Z b
En ( f ) : =
a

f ( x )dx − f ( xk )αk =
a i =0
∏ ( x − xi )
( n + 1) !
dx
k =0
Z b n
| f (n+1) (ξ ( x ))| (b − a)n+1 maxa≤x≤b | f (n+1) ( x )| b
Z
≤ ∏ |x − xi | (n + 1)! dx ≤
a i =0 ( n + 1) ! a
1dx.

Therefore, we obtain the following error bound:

Error bound for the simple closed Newton-Cotes formula:

( b − a ) n +2
En ( f ) ≤ · max | f (n+1) ( x )| . (9.7)
( n + 1) ! a ≤ x ≤ b

Note that, if f (n+1) ( x ) ≡ 0 in [ a, b], then En ( f ) = 0. This means that, if f is a


function whose (n + 1)st derivative vanishes on [ a, b] (for instance, a polynomial
of degree at most n), then the quadrature formula is exact (the error is 0).

Remark 20 It can be proved [IK66, p. 313] that, if n is even, then

En+1 ( f ) ≤ cn · (b − a)n+3 · max | f (n+2) ( x )|,


a≤ x ≤b
Rn
where cn = (n+1 2)! 0 t(t − 1) · · · (t − n)dt is a constant that depends on n but not on f .
In particular, the previous expression indicates that the closed Newton-Cotes formulas for
n even are exact for polynomials of degree n + 1 (since f (n+2) ( x ) ≡ 0), and not just for n.
However, the formulas for n odd are exact only for degree n.

( b − a ) n +2
Remark 21 The error bound (9.7) is a product of two factors. The first of them, (n+1)! ,
depends on the integration interval, and not on the function. This factor is small if the
length of the interval is too small (b − a << 1), and tends to 0 as n tends to infinity if

138
9 Numerical integration

the length of the interval, b − a, is less than 1. The second factor, maxa≤ x≤b | f (n+1) ( x )|,
depends on the function f and its oscillations in the interval [ a, b] (which are determined
by the derivatives of the function). In summary, if these oscillations are bounded and the
length of the interval is less than 1, the simple closed Newton-Cotes formulas tend to the
exact value of the integral as n tends to infinity.

Remark 21 tells us that, for small intervals and functions which are “smooth
enough”, the simple Newton-Cotes formulas converge to the value of the integral
by increasing the number of nodes. Nonetheless, we may be interested in com-
puting an integral over a relatively large interval. This is addressed in the next
section.

9.2 Composite closed Newton-Cotes formulas


A composite quadrature formula consists in dividing the integration interval [ a, b] in
several subintervals and integrate in each of them. In particular, if we divide

[ a, b] = [ a, x1 ] ∪ [ x1 , x2 ] ∪ · · · ∪ [ x p−1 , b]

then, using the additive property of the integral, we get


Z b Z x1 Z x2 Z b
f ( x )dx = f ( x )dx + f ( x )dx + · · · + f ( x )dx.
a a x1 x p −1

In particular, to get the Newton-Cotes formula, we will apply, in each subinterval


the simple closed Newton-Cotes formula over n + 1 nodes. Moreover, the intervals
will have the same length.
To obtain the composite closed Newton-Cotes formulas we follow the next steps:
Step 1: We define N = np (total number of intervals, included the integration
subintervals and, inside each of them, the n subintervals given by the integration
nodes).
Step 2: We set the “integration step”, h N = (b − a)/N, and the integration nodes:

x0 = a, x1 = a + h N , . . . , x j = a + j · h N , . . . , x N = a + N · h N = b.

Step 3: We apply the additive property of the integral


Z b Z xn Z x2n Z x3n Z b
f ( x )dx = f ( x )dx + f ( x )dx + f ( x )dx + · · · + f ( x )dx, (9.8)
a a xn x2n x ( p −1) n

where, in each integral of the right-hand side we apply the simple closed Newton-
Cotes formula in n + 1 nodes.

139
9 Numerical integration

Let us explicitly write the formulas that are obtained in the simplest two cases
(n = 1, 2) of the previous procedure.

• n = 1 ⇒ N = p (composite trapezoid’s rule):


Z b
x1 − a x2 − x1 b − x N −1
f ( x )dx ≈ ( f ( a) + f ( x1 )) + ( f ( x1 ) + f ( x2 )) + · · · + ( f ( x N −1 ) + f (b)),
a 2 2 2
and, since x j+1 − x j ≈ h N , for j = 0, 1, . . . , N − 1, we arrive at the simplified
expression:

Z b
hN
f ( x )dx ≈ ( f ( a) + 2 f ( x1 ) + · · · + 2 f ( x N −1 ) + f (b))
a 2

that can be written in the abbreviated way:

Z b
hN
f ( x )dx ≈ ( E + 2I )
a 2

where
– E = f ( a) + f (b): is the sum of the evaluations of f in the endpoints of
the interval.
– I = f ( x1 ) + · · · + f ( x N −1 ): is the sum of the evaluations of f in the
inner nodes.

• n = 2 ⇒ N = 2p (composite Simpson’s rule):


Z b
x2 − a x − x2
f ( x )dx ≈ ( f ( a) + 4 f ( x1 ) + f ( x2 )) + 4 ( f ( x2 ) + 4 f ( x3 ) + f ( x4 ))
a 6 6
b − x N −2
+···+ ( f ( x N −2 ) + 4 f ( x N −1 ) + f (b)).
6
Again, since x j+2 − x j = 2h N , for j = 0, 1, . . . , N − 2, we arrive at
Z b
f ( x )dx ≈
a
hN
( f ( a) + 4 f ( x1 ) + 2 f ( x2 ) + 4 f ( x3 ) + · · · + 2 f ( x N −2 ) + 4 f ( x N −1 ) + f (b))
3

The composite Simpson rule is usually abbreviated by


Z b
hN
f ( x )dx ≈ ( E + 4I + 2P)
a 3

140
9 Numerical integration

where
• E = f ( a) + f (b): is the sum of the evaluations of f in the endpoints of the
interval.

• I = f ( x1 ) + · · · + f ( x N −1 ): is the sum of the evaluations of f in the nodes


with odd index.

• P = f ( x2 ) + · · · + f ( x N −2 ): is the sum of the evaluations of f in the nodes


with even index.
Note, in particular, that the number of nodes, N, in the composite Simpson rule is
an even number.

9.2.1 Error in the composite closed Newton-Cotes formulas


The error when approximating the integral using equation (9.8) and approximat-
ing each addend in the right-hand side is smaller than the sum of errors in each
addend. Therefore, using the bounds that we have obtained in (9.7) and in Remark
20, we arrive at the following error bounds, that we denote by E( f ), depending on
the parity of n:
If n is odd:
p −1
( x( j+1)n − x jn )n+2 (nh N )n+2
E( f ) ≤ ∑ ( n + 1) !
· max
x jn ≤ x ≤ x( j+1)n
| f (n+1) ( x )| ≤ p · · max | f (n+1) ( x )|,
( n + 1) ! a ≤ x ≤ b
j =0

where we have used that x( j+1)n − x jn = nh N , for j = 0, 1, . . . , p − 1. Moreover,


since p = N/n an h N = (b − a)/N, we can obtain:

b − a n +1 n n +1
 
E( f ) ≤ (b − a) · · max | f (n+1) ( x )| (n odd)
N ( n + 1) ! a ≤ x ≤ b
Analogously if n is even:

b − a n +2 n +2
 
E( f ) ≤ cn · (b − a) ·n · max | f (n+2) ( x )| (n even)
N a≤ x ≤b

where cn is as in Remark 20.

Remark 22 If the number of interpolation nodes, n, is fixed, and f is a function with con-
tinuous derivatives in [ a, b] (up to degree, at least, n + 1), then | f (n+1) ( x )| is bounded, so
maxa≤ x≤b | f (n+1) ( x )| is a fixed number (the same happens with | f (n+2) ( x )| for n even).
Then, when N → ∞, the error E( f ) tends to 0. In other words: the composite closed
Newton-Cotes formulas guarantee the convergence of the approximation if f is “continu-
ous enough” (namely, if its derivatives are continuous).

141
9 Numerical integration

Unlike Remark 21, Remark 22 tells us that, in order for the composite closed
Newton-Cotes formulas to converge, it is not necessary to impose that the deriva-
tives of f are bounded: it is enough to impose that they are continuous. The
continuity condition is, in general, much easier to check than the regularity of the
derivatives. For instance, all polynomial and rational functions have continuous
derivatives (in those values where the denominator is not zero for rational func-
tions), as well as all trigonometric ans exponential functions. Similarly, it is not
necessary to increase the number of nodes to obtain a quadrature formula that
converges to the integral. It is enough to fix the number of nodes and to increase
the number of subintervals in the composite formula. This is the main advantage
of the composite formulas against the simple ones. Nonetheless, the composite
formulas present a relevant drawback, that we analyze below.

Problem with the composite Newton-Cotes formulas

The Newton-Cotes formulas require many evaluations of the function f to obtain


a good approximation to the integral. The number of calculations involved in
these evaluations, or the fact that it is not always possible to evaluate the function
f in such points, makes that, in practice, these formulas are inefficient when one
wants to obtain a good approximation.
The adaptive integration, that we will see in Section 9.4 allows to overcome this
problem.

9.3 Richardson extrapolation. Extrapolated Simpson’s


formula
The Richardson extrapolation method is a general technique that is applied to approx-
imation methods that depend on one “step” h, and allows us to obtain a higher
order of convergence. In Section 9.3.2 it will be applied to obtain an “improved”
Simpson formula, which coincides with the simple Newton-Cotes formula for 5
nodes.

9.3.1 Richardson extrapolation


Let us assume that A(h) is an approximation to some quantity A that depends on
a “step” (a variable, that is assumed to be small), h, in the form:

A(h) = A + chn + O(hn+1 ),

142
9 Numerical integration

where c is a constant. This formula indicates that the convergence order of A(h)
is n. For instance, as we have seen in Section 9.2.1, the composite closed Newton-
Cotes formulas are of order n + 1 (if n is odd) or n + 2 (if n is even). Now, the
function
tn A(h/t) − A(h)
R(h, t) := , (9.9)
tn − 1
has convergence order n + 1, since
  n 
tn A + c ht + O(hn+1 ) − ( A + chn + O(hn+1 ))
R(h, t) = = A + O ( h n +1 ).
tn − 1
The function R(h, t) is known as Richardson extrapolation of A(h). In many sit-
uations it is easier to obtain the convergence with the prescribed tolerance using
R(h, t) instead of reducing the step size since, as we have seen, reducing the step
size involves more calculations and evaluations of the given function that not only
are more expensive but may involve larger roundoff errors.

9.3.2 Extrapolated Simpson rule


If we apply the Richardson extrapolation to the simple Simpson rule, S(h), with
t = 2, in the interval [ a, b], then h = b − a and
     
Basic
A(h) = S(h) = 6h f ( a) + 4 f a+2 b + f (b)
         Simpson rule 
h h 3a+b a+b a+3b Composite
A( 2 ) = S2 (h) = 12 f ( a) + 4 f 4 + 2f 2 + 4f 4 + f (b) Simpson rule

so R(h, 2) is equal to

16S2 (h) − S(h)


Q(h) =
 15       
7 16 3a + b 2 a+b 16 a + 3b 7
= h f ( a) + f + f + f + f (b) .
90 45 4 15 2 45 4 90

Using that h = b − a, we can write the previous rule in the form:

Extrapolated Simpson rule:


       
7 16 h 2 h 16 3h 7
Q(h) = h f ( a) + f a+ + f a+ + f a+ + f (b)
90 45 4 15 2 45 4 90

It can be checked, solving the system (9.6) for n = 4, that the previous extrap-
olated Simpson rule is the simple closed Newton-Cotes formula for 5 nodes, so it
is exact for polynomials with degree at most 5.

143
9 Numerical integration

9.4 Adaptive integration


The goal of adaptive integration is the following: given an integrable function,
f , the endpoints of an interval, a, b, and a tolerance tol (fixed by the user), esti-
mate the integral (9.1) with an error at most tol aiming to reduce the number of
evaluations of the function f as much as possible.
Though there are several methods of adaptive integration, we focus on the one
that is obtained applying the extrapolated Simpson rule that we have just seen in
Section 9.3.2. This method consists of the following steps:

Step 1: Compute S and S2 as in Section 9.3.2 (namely: the simple Simpson rule
and the composite Simpson rule with 2 subintervals).
Step 2: Evaluate E = |S − S2 |, which gives an estimation of the error.
If E ≤ tol, then Z b
16S2 − S
f ( x )dx ≈ = Q,
a 15
and we are finished.
If E > tol, proceed with next step.
Step 3: If E > tol, repeat steps 1 and 2 in

Step 3.1: the interval [ a, a+2 b ], and then in

Step 3.2: the interval [ a+2 b , b].

Step 4: Add up the values Q obtained in all steps at each iteration.

matlab commands for adaptive integration

quad(f,a,b,tol): Is the code that is implemented in matlab. Computes an ap-


proximation to (9.1) using adaptive Simpson’s rule.
quadtx(f,a,b,tol): Is another version of quad, more elementary, included in the
folder ncm.
[Q,fcount]=quadtx(f,a,b,tol): Besides approximating the integral (9.1) using
quadtx (in the output variable Q), counts the number of times that the function f
is evaluated throughout the process (output variable fcount).
 Both in quad and quadtx, the default tolerance is 10−16 .

144
10 Numerical differentiation

In this short chapter we aim to approximate f ′ ( x0 ), namely the derivative of some


function f : [ a, r ] → R at a certain value x0 ∈ [ a, b]. We assume that the function f
can only be evaluated at certain values, which are close to x0 .

10.1 Forward, backward, and centered formulas


The starting idea to get the formulas if to replace the function f by an interpolating
polynomial over a certain set of nodes, P( x ), and then approximate

f ′ ( x0 ) ≈ P ′ ( x0 )

We are going to consider equispaced nodes (the stepsize will be denoted, as usual,
by h). There are three standard ways to consider these nodes (the choice can
depend on the information we have at hand, namely the nodes where we are
allowed to evaluate f ), which are indicated in Table 10.1

Type Nodes
Forward differentiation x0 , x0 + h, . . . , x0 + nh
Backward differentiation x0 , x0 − h, . . . , x0 − nh
Centered differentiation (n even) x0 , x0 − h, x0 + h, . . . , x0 − n2 h, x0 + n2 h

Table 10.1: Numerical differentiation types and corresponding nodes

In the following subsections, we are going to show the specific formulas that we
get for small values of n (namely, n = 1, 2) when P( x ) is the Lagrange interpolating
polynomial introduced in Definition 11.

10.1.1 Formulas for n = 1


f (x ) f (x ) f ( x )− f ( x )
Since P( x ) = f ( x0 ) · xx0−−xx11 + f ( x1 ) · xx1−−xx00 , so that P′ ( x ) = x0 −0x1 + x1 −1x0 = x1 1 − x0 0 ,
we get the formulas in Table 10.2
Note that these formulas are a first-order approximation to the derivative through
its definition, namely
f ( x0 + h ) − f ( x0 )
f ′ ( x0 ) = lim .
h →0 h

145
10 Numerical differentiation

Type Nodes
f ( x0 + h ) − f ( x0 )
Forward differentiation f ′ ( x0 ) ≈
h
′ f ( x0 ) − f ( x0 − h )
Backward differentiation f ( x0 ) ≈
h
Table 10.2: Numerical differentiation formulas for n = 1

10.1.2 Formulas for n = 2


( x − x1 )( x − x2 ) − x0 )( x − x2 ) − x0 )( x − x1 )
Now P( x ) = f ( x0 ) · ( x0 − x1 )( x0 − x2 )
+ f ( x1 ) · ((xx1 − x2 )( x0 − x2 )
+ f ( x2 ) · ((xx2 − x0 )( x2 − x1 )
, so
2x −( x1 + x2 ) 2x −( x0 + x2 ) 2x −( x0 + x1 )
that P′ ( x ) = f ( x0 ) ·( x0 − x1 )( x0 − x2 )
+ f ( x1 ) · (x1 −x0 )(x1 −x2 ) + f ( x2 ) · (x2 −x0 )(x2 −x1 ) , we
get the formulas in Table 10.3

Type Formula
−3 f ( x0 ) + 4 f ( x0 + h) − f ( x0 + 2h)
Forward differentiation f ′ ( x0 ) ≈
2h
′ 3 f ( x0 ) − 4 f ( x0 − h) + f ( x0 − 2h)
Backward differentiation f ( x0 ) ≈
2h
′ f ( x0 + h ) − f ( x0 − h )
Centered differentiation f ( x0 ) ≈
2h
Table 10.3: Numerical differentiation formulas for n = 2

10.2 Error of the previous formulas


The idea to estimate the error in the formulas for numerical differentiation in
the previous section is to estimate the difference E′ ( x ) = f ′ ( x ) − P′ ( x ), where
E( x ) = f ( x ) − P( x ) is the interpolation error, for which we already know an
estimation (see Section 5.6). The following result gives an explicit expression for
this error.

Theorem 29 Let f be n + 2 times differentiable (with f (n+2) being continuous) in [ a, b],


and { x0 , x1 , . . . , xn } ⊆ [ a, b], with xi ̸= x j for i ̸= j. Let also P( x ) be the interpo-
lating polynomial of f through the nodes { x0 , x1 , . . . , xn }. Then, for every x ∈ [ a, b] \
{ x0 , x1 , . . . , xn }, there are some values ξ x , ηx ∈ ( a, b) such that

f ( n +1) ( η x ) f ( n +2) ( ξ x )
E′ ( x ) = f ′ ( x ) − P( x ) = · Π′ ( x ) + · Π ( x ), (10.1)
( n + 1) ! ( n + 2) !

where Π( x ) = ( x − x0 )( x − x1 ) · · · ( x − xn ).

146
10 Numerical differentiation

Proof: Let x ∈ [ a, b] with x ̸= xi , for i = 0, 1, . . . , n. Since, by Theorem 29,

f ( n +1) ( η x )
E( x ) = f ( x ) − P( x ) = Π ( x ),
( n + 1) !

and differentiating we get

f ( n +1) ( η x ) d
E′ ( x ) = f ′ ( x ) − P′ ( x ) = · Π ′ ( x ) + Π ( x ) f [ x0 , x1 , . . . , x n , x ].
( n + 1) ! dx

It remains to show that

d f ( n +2) ( ξ x )
f [ x0 , x1 , . . . , x n , x ] = ,
dx ( n + 2) !

for some ξ ∈ ( a, b). For this, set D ( x ) := f [ x0 , x1 , . . . , xn , x ]. Then

D ( x + h) − D ( x ) f [ x0 , x1 , . . . , x n , x + h ] − f [ x0 , x1 , . . . , x n , x ]
lim = lim
h →0 h h →0 h
f [ x + h, x0 , x1 , . . . , xn ] − f [ x0 , x1 , . . . , xn , x ]
= lim
h →0 ( x + h) − x
= lim f [ x + h, x0 , x1 , . . . , xn , x ]
h →0
= lim f [ x0 , x1 , . . . , xn , x, x + h].
h →0

Now, using Remark 12,

D ( x + h) − D ( x ) f (n+2) (ξ x,h ) f ( n +2) ( ξ x )


lim = lim = ,
h →0 h h →0 ( n + 2 ) ! ( n + 2) !

where, to get the last identity we set limh→0 ξ x,h =: ξ x ∈ ( a, b) and we use that
f (n+2) is continuous in ( a, b). □
The formula (10.1) tells us that the error in the differentiation formulas depends
on the n + 1 and n + 2 derivatives of f , which is somewhat expected, since the
interpolation error depends on the n + 1 derivative.
If we make x tend to some of the nodes, xi , since f (n+1) and f (n+2) are continu-
ous in ( a, b), and using that

Π ′ ( x i ) = Π j ̸ =i ( x i − x j ),

we conclude that the error of the differentiation formulas in the node xi is given
by
f ( n +1) ( η x i )
E ′ ( xi ) = f ′ ( xi ) − P ′ ( xi ) = · Π j ̸ =i ( x i − x j ). (10.2)
( n + 1) !

147
10 Numerical differentiation

If M = maxx∈(a,b) f (n+1) ( x ) then the error in (10.2) can be bounded as


M
| E′ ( xi )| ≤ · Π j ̸ =i | x i − x j |.
( n + 1) !
If the nodes are equispaced, namely xi = x0 + ih, for i = 0, 1, . . . , n, for some fixed
h, then
hn
E ′ ( xi ) = f ′ ( xi ) − P ′ ( xi ) = · Π j ̸ = i ( i − j ) · f ( n +1) ( η x i ), ηxi ∈ ( a, b),
( n + 1) !
which can be bounded as before by
Mhn
| E′ ( xi )| ≤ · Π j ̸ =i | i − j |
( n + 1) !
Recall that the previous formula gives only the error of the differentiation formu-
las at the nodes. However, it van be a good estimation of the error in the whole
( a, b). Anyway, the differentiation formulas are frequently used to obtain an ap-
proximation of the derivative in the nodes (or even in just one of them, namely x0 ).
This formula also shows that, provided that the derivatives of f do not increase
very much, then the error goes to 0 as h → 0 (as expected, and the error is of order
hn . Moreover, the bound is a tool to determine the value of h which is needed to
get a prescribed accuracy. The hard part is, again, to get an estimation of M.

The matlab command diff

To approximate the derivative of f at x0 with matlab we can use the command


diff. The basic features of this command are the following: diff is able to com-
pute symbolically the derivative of a function f . For this we first need to identify
the symbolic variable with the command sym. For instance:
syms f(x), f(x) = sin(x^2); Df = diff(f,x)
provides the derivative of f ( x ) = sin( x2 ).
If we want to calculate the value of the derivative at some particular value of
the symbolic variable, we need to “take it back” to the numerical setting, by using
the command double, namely:
Df2=Df(2), double(Df2)
gives
4*cos(4), -2.6146
The first output provides the symbolic answer, whereas the second one is the
numerical one.

148
11 The Fast Fourier Transform

The Fast Fourier Transform (FFT) is used to compute the Discrete Fourier Transform
(DFT) in an efficient way. Then, for completeness, we first recall the DFT in Section
11.1, even though we assume the student to be familiar with it. This section is
merely a summary, and to get more information on it you can have a look at the
basic references [BF11, ChK08, SB80].

11.1 Discrete Fourier Transform (summary)


Recall that a way to approximate complex functions f : C → C is by means of
trigonometric polynomials
N −1
p N, f ( x ) := ∑ fbk ekix , (11.1)
k =0

where i denotes the imaginary unit (namely i = −1). The polynomial (11.1) is
called the Nth Fourier polynomial of f (note that it is a polynomial in the basis
{ekix : k = 0, . . . , N − 1}). The polynomial (11.1) aims to be an interpolating
polynomial of the function f in the interval [0, 2π ]. In order for this to be the case,
we consider N equispaced nodes in this interval, namely
2πj
xj = , for j = 0, . . . , N − 1, i.e. :
N
2π 4π 2( N − 1) π
x0 = 0, x1 = , x2 = , . . . , x N −1 = ,
N N N
and we impose the interpolating conditions

p N, f ( x j ) = f ( x j ), for j = 0, . . . , N − 1.

These conditions uniquely determine the coefficients fbk of (11.1). More precisely,
these coefficients are (see, for instance, [SB80, Th. 2.3.1.9]):
N −1
1 2πjki
fbk =
N ∑ f ( xk )e− N , k = 0, 1, . . . , N − 1. (11.2)
j =0

The DFT consist in obtaining the coefficients (11.2) from f (namely, from the values
f ( x j ), for 0 ≤ j ≤ N − 1). The opposite problem, namely obtaining the values

149
11 The Fast Fourier Transform

f ( x j ), for 0 ≤ j ≤ N − 1, from the values of the coefficients (11.2), is the Inverse


Discrete Fourier Transform (IDFT).
2πi
The DFT can be written in matrix form as follows. Let us denote ω N := e− N
(namely the basic Nth root of unity). Then, from (11.2) we obtain
   
fb0 1 1 ··· 1 f ( x0 )

 f1 
 b 
1
1
 ωN · · · ( ω N ) N −1 
 f ( x1 ) 

bf :=  .  = . .. .. ..  ..  = FN f, (11.3)
 ..
 .  N .

 .  . . .
 

2
fbN −1 1 ( ω N ) N −1 · · · ( ω N ) ( N −1) f ( x N −1 )

where FN , called the Fourier matrix of size N, is the N × N matrix whose (i, j) entry
(i −1)( j−1)
is ( FN )ij = ω N , and (f)i = f ( xi−1 ), for 1 ≤ i ≤ N.
The IDFT can be performed using the inverse of the matrix FN . In order to get
this inverse we note the following:

• FN is a symmetric matrix,

• therefore, the product of the ith row of FN by the ith column of FN (where
FN denotes the conjugate of FN , namely the matrix obtained from FN by
conjugating all its entries) is equal to

1
 
i
 (ω N ) 
 
1 ( ω N ) i · · · ( ω N ) ( N −1) i 

.. 
 . 
(ω N ) ( N − 1 ) i

= 1 + ( ω N ) ( ω N ) + · · · + ( ω N ) ( N −1) i ( ω N ) ( N −1) i
i i

= 1 + 1 + · · · + 1 = N,
2πi
where we have used that ω N = e− N , so that (ω N )( N −1) j (ω N )( N −1) j = 1, for
all j ∈ N,

• whereas the product of the ith row of FN by the jth column of FN , when
i ̸= j, is equal to
2πi 2πi 2πi 2πi 2πi 2πi
1 + e N i e − N j + · · · + e N ( N −1) i e − N ( N −1) j = 1+e N (i − j ) +···+e N ( N −1)(i − j )

= 1 + σ + σ2 + · · · + σ N −1 = 0,
2πi
since the complex number σ := e N (i− j) , when i ̸= j, is an Nth root of
1 (different from 1), so it satisfies 1 + σ + · · · + σ N −1 = 0, because of the
identity z N − 1 = (z − 1)(1 + z + · · · + z N −1 ).

150
11 The Fast Fourier Transform

As a consequence:
  
1 1 ··· 1 1 1 ··· 1
1
 ωN ··· ( ω N ) N −1   1
 ωN · · · ( ω N ) N −1  
FN FN =   .. .. .. ..  .
 . .. .. .. 
. . . .  . . . .


2 2
1 ( ω N ) N −1 · · · (ω N ) ( N − 1 ) 1 (ω N ) N − 1 · · · (ω N ) ( N − 1 )

N 0 ··· 0
 
.. 
 0 N ...

.
= . .
 = N IN .
 .. .. ...

0
0 ··· 0 N

Therefore, ( FN )−1 = 1
N FN , so the IDFT is performed as:

1
f= FN bf.
N

11.2 Fast Fourier Transform


The DFT (11.3) is a matrix-times-vector multiplication that involves N 2 multiplica-
tions and N ( N − 1) additions. For large values of N (which is the usual situation
in applied problems), this number of operations can be prohibitive. The FFT al-
lows us to perform this matrix-times-vector product with a much smaller number
of operations. How is it possible?
The key idea is to relate the matrix FN with the half-size Fourier matrix FN/2 .
Here we are just going to see how to do this when N is an even number (and, in
order to iterate this relationship, we will assume that N is a power of 2, namely
N = 2ℓ , for some ℓ ∈ N). This being the case, the following result holds.

Theorem 30 (Cooley-Tukey [CT65]). Let N be an even number. Then


  
I D N/2 FN/2 0
FN = N/2 PN , (11.4)
IN/2 − D N/2 0 FN/2
N
where D N/2 is the diagonal matrix whose diagonal entries are 1, ω N , (ω N )2 , . . . , (ω N ) 2 −1

151
11 The Fast Fourier Transform

(in this order), and PN is the permutation matrix of size N × N:



e1⊤

 e⊤ 
 3 
 . 
 . 
 . 
 ⊤ 
e N 
PN =  +1 
 e2 ⊤  .
 2 
 ⊤ 
 e4 
 . 
 . 
 . 
eN

Proof: Let us introduce the following notation:


N N
2 −1 2 −1
fbke := ∑ kj
f ( x2j )(ω N/2 ) , fbko := ∑ f ( x2j+1 )(ω N/2 )kj ,
j =0 j =0

namely fbke and fbko are the sum of, respectively, the even numbered terms (namely,
the ones with indices 0, 2, 4, . . .) and the odd numbered terms (those with index
1, 3, 5, . . .) in (11.2) with N/2 instead of N (and removing the initial factor 2/N).
Then, (11.4) is equivalent to

1  be 
fbk = f k + (ω N )k fbko , (11.5)
N
so it remains to prove (11.5). This is a consequence of the following chain of
identities:
N −1 N −1
1 1
∑ ∑

fbk = f ( x j )(ω N )kj = f ( x j )(e− N i )kj =
N j =0
N j =0
 N N

2 −1 2 −1
1 
N j∑
f ( x2j )(e− N i )2kj + ∑ f ( x2j+1 )(e− N i )k(2j+1) 
2π 2π
=
=0 j =0
N N

−1 2 −1
1 2
N j∑
f ( x2j )(e− N/2 i )kj + (e− N i )k ∑ f ( x2j+1 )(e− N/2 i )kj 
2π 2π 2π
=
=0 j =0
N N

− 1 − 1
1  2 2

N j∑
= (ω N/2 )kj + (ω N )k ∑ f ( x2j+1 )(ω N/2 )kj 
=0 j =0
1 be
= ( f + (ω N )k fbko ),
N k
as wanted. □

152
11 The Fast Fourier Transform

As mentioned, the number of flops involved in the DFT (11.3) is O(2n2 ). Let us
estimate the number of flops when computed as in (11.4). What we need to do it
to multiply the product of matrices in the right-hand side of (11.4) by f. Therefore,
this approach consists of three steps:

(S1) Multiply PN f, namely, perform a permutation of the entries of f. This does


not account for any flop, since permutations are not arithmetic operations.
Let us denote:  e
f
PN f = o
f
where fe and fo are as follows:
f ( x0 ) f ( x1 )
   
 f ( x2 )   f ( x3 ) 
e o
f :=  .  , f :=  .
   
..
 ..   . 
f (xN ) f ( x N −1 )
h i
F 0
(S2) Multiply the vector resulting in (S1) by N/2 0 FN/2 . Note that this product
reads:   e 
FN/2 fe
 
FN/2 0 f
= .
0 FN/2 fo FN/2 fo
Namely, in this step we are performing two DFTs but with size N/2 instead
of N, so the computational cost is O(2( N/2)2 ) = O( N 2 /2). This is precisely
the relevant saving in the whole process, and the first key fact of the FFT.
h i
I D N/2
(S3) Multiply the matrix obtained in (S2) by IN/2 N/2 − D N/2
. If the output of (S2) is
 g1 
denoted by g2 , partitioned according to the partition of PN f, then the last
product is
    
IN/2 D N/2 g1 g + D N/2 g2
= 1 .
IN/2 − D N/2 g2 g1 − D N/2 g2
Here comes the second key fact of the FFT: the operation g1 + D N/2 g2 amounts
for N/2 products (since D N/2 is a diagonal matrix) and N/2 additions, and
the same happens for g1 − D N/2 g2 , except for the product D N/2 g2 , which is
already known. Therefore, the overall cost of this step is (3/2) N.

Summarizing, the overall cost of one step of the FFT is O( N 2 /2 + (3/2) N ) =


O( N 2 /2) since, for large values of N, the term (3/2) N is negligible compared to
N 2 /2.
However, though a reduction from O( N 2 ) to O( N 2 /2) can be significant, it may
not be enough, since we are still with a cost of quadratic order in the size N. But

153
11 The Fast Fourier Transform

here comes the second half of the story: if we assume that N = 2ℓ , then we can
iterate the previous steps, halving the size of the resulting Fourier matrix, until we
end up, after ℓ − 1 iterations, with (2ℓ−1 copies of) F2 . Note that, at the end of this
process, we end up with
ℓ−1
( G2ℓ G2ℓ−1 · · · G2 F2⊕2 P)f,

where P is a permutation matrix, which is the product of all permutation matrices


obtained at each reduction, Gi2k is a block-diagonal matrix whose diagonal blocks
I2k D2k
h
are 2ℓ−k−1 copies of I2k − D2k , and

F2 0 ... 0
 
.. .
. .. 

ℓ−1 0 F2
F2⊕2 := 
.

 .. .. .. 
. . 0
0 . . . 0 F2

is also a block-diagonal matrix having 2ℓ−1 copies of F2 as diagonal blocks.


Then, the computational cost of the whole process is obtained from the cost of
multiplying the vector f by all previous matrices:
• The cost of multiplying by P is 0, since it is a permutation matrix.
ℓ−1
• The cost of multiplying by F2⊕2 is 2ℓ−1 · 6, since the cost of multiplying by
F2 is 6 flops. This is equal to 3N.

• The cost of multiplying by G2k is 3 · 2k 2ℓ−k−1 = 3 · 2ℓ−1 = 32 N. Since there


are ℓ − 1 such products, the total cost of multiplying by these matrices is
3
2 N (ℓ − 1).
ℓ−1
Therefore, if we disregard the term 3N coming from the product by F2⊕2 ,
which is negligible compared with the cost of multiplying by the G’s matrices:

The overall cost of the FFT, when N = 2ℓ , is 32 N (ℓ − 1) = O( N log2 N )

Note that, when N = 2ℓ , then ℓ = log2 N. The FFT can be also carried out for
arbitrary values of N ∈ N (but this is out of the scope of this course), and the
overall cost is O( N log2 N ) , instead of the cost O( N 2 ) of the standard DFT (11.3).
Table 11.1 shows some values of N log2 N against N 2 , for some values of N = 2ℓ .
Note that, for ℓ = 16, namely N = 16384, the difference between the correspond-
ing values is noticeable.
Note that a similar approach can be followed for the IDFT, applying the same
arguments to FN instead of FN .

154
11 The Fast Fourier Transform

ℓ N N2 N log2 N
10 1024 1048576 109240
12 4096 16777216 49152
14 16384 268435456 229375

Table 11.1: Values of N log2 N and N 2 for some N = 2ℓ .

Example 18 Let us compare the execution time in matlab for the DFT of the function
f ( x ) = sin( x ) cos( x ) in [0, 2π ] in two ways: (a) with the formula (11.3) and (b) using
the FFT. In both cases we first create the vector of equispaced nodes in [0, 2π ] by typing
linspace(0,2*pi,N+1)
(we set the number of nodes to be N + 1 since linspace includes the right endpoint,
namely 2π, that we want to exclude). Now we evaluate the function in this vector by
typing, for instance:
f=inline(’cos(x).*sin(x)’)
fv=feval(f,v(1:N))
Now, let us apply (11.3) to this vector. One way to get the matrix FN with matlab is
by typing fft(eye(N,N)) (see the meaning of the command fft below and then convince
yourself about this). Therefore, the DFT (11.3) can be obtained with the command
f1=(1/N)*fft(eye(N,N))*fv
As for the FFT, we just type f2=fft(fv)
Table 11.2 contains the execution time (obtained with the commands tic; toc;) for
N = 2ℓ for some values of ℓ.

ℓ DFT with (11.3) FFT


10 0.004487 0.001576
12 0.019061 0.000232
14 0.541414 0.005692

Table 11.2: Execution times (in seconds) of DFT with and without using the FFT for f ( x) =
sin( x ) cos( x ) with N = 2ℓ .

In this section have essentially followed the approach in [S19, §IV.1], and I recom-
mend you to have a look at this reference for more details.

matlab commands for the FFT

• fft(v): Computes the FFT of the vector v.

155
Bibliography

[BC11] A. Belegundu, T. Chandrupatla Optimization Concepts and Applications in


Engineering. Cambridge University Press, Cambridge, 2011.

[BV09] S. Boyd, L. Vanderberghe. Convex Optimization. Cambridge University


Press, Cambridge, 2004.

[BF11] R. L. Burden, J. D. Faires, Numerical Analysis, 9th ed. Brooks/Cole Cengage


Learning, Boston, 2011.

[CT65] J. W. Cooley, J. W. Tuckey. An algorithm for the machine calculation of


complex Fourier series. Mathematics of Computation, 19 (1965) 297–301.

[ChK08] W. Cheney, D. Kincaid. Numerical Mathematics and Computing, 6th ed.


Thomson Brooks/Cole, Belmont, 2008.

[H98] N. J. Higham. Accuracy and Stability of Numerical Methods. SIAM, Philadel-


phia, 1998.

[HH17] D. H. Higham, N. J. Higham. Matlab Guide, 3rd ed. SIAM, Philadelphia,


2017.

[HJ13] R.A. Horn, C. R. Johnson. Matrix Analysis, 2nd ed. Cambridge University
Press, Cambridge, 2013.

[I09] I. Ipsen. Numerical Matrix Analysis. SIAM, Philadelphia, 2009.

[IK66] H. Isaacson, H. B. Keller, Analysis of Numerical Methods. John Wiley & Sons,
New York, 1966.

[KL86] E. H. Kauffman, T. D. Lenker. Linear convergence and the bisection algo-


rithm. The American Mathematical Monthly, 93 (1986), 48–51.

[M04] C. Moler. Numerical Computing with matlab. SIAM, Philadelphia, 2004.

[NW06] J. Nocedal, S. J. Wright. Numerical Optimization, 2nd Ed. Springer, New


York, 2006.

[SB80] J. Stoer, R. Bulirsch. Introduction to Numerical Analysis. Springer, New York,


1980.

156
Bibliography

[S19] G. Strang. Linear Algebra and Learning from Data. Wellesley Cambridge, 2019.

[TB97] L. N. Trefethen, D. Bau III. Applied Numerical Linear Algebra. SIAM,


Philadelphia, 1997.

157

You might also like