NumCSE Lecture Document
NumCSE Lecture Document
, 1
Contents
0 Introduction 8
0.1 Course Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.1 Focus of this Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.1.3 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2 Teaching Style and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.1 Flipped Classroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.1.1 Course Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.1.2 Following the Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
0.2.2 Clarifications and Frank Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
0.2.3 Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
0.2.4 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
0.2.5 Information on Examinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
0.2.5.1 For the Course 401-2673-00L Numerical Methods for CSE (BSc CSE) . . 28
0.2.5.2 For the Course 401-0663-00L Numerical Methods for CS (BSc Informatik) 29
0.3 Programming in C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.3.1 Function Arguments and Overloading . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.3.2 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.3.3 Function Objects and Lambda Functions . . . . . . . . . . . . . . . . . . . . . . . 33
0.3.4 Multiple Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
0.3.5 A Vector Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
0.3.6 Complex numbers in C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
0.4 Prerequisite Mathematical Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
0.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
0.4.2 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
0.4.3 Trigonometric Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
0.4.4 Linear Algebra and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
CONTENTS, CONTENTS 3
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
CONTENTS, CONTENTS 4
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
CONTENTS, CONTENTS 5
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
CONTENTS, CONTENTS 6
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Index 860
Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
CONTENTS, CONTENTS 7
Chapter 0
Introduction
✄ on (efficient and stable) implementation in C++ based on the numerical linear algebra template library
E IGEN, a Domain Specific Language (DSL) embedded into C++.
§0.1.1.1 (Aspects outside the scope of this course) No emphasis will be put on
• theory and proofs (unless essential for derivation and understanding of algorithms).
☞ 401-3651-00L Numerical Methods for Elliptic and Parabolic Partial Differential Equations
401-3652-00L Numerical Methods for Hyperbolic Partial Differential Equations
(both courses offered in BSc Mathematics)
• hardware aware implementation (cache hierarchies, CPU pipelining, vectorization, etc.)
☞ 263-0007-00L Advanced System Lab (How To Write Fast Numerical Code, Prof. M. Püschel,
D-INFK)
• issues of high-performance computing (HPC, shard and distributed memory parallelisation, vector-
ization)
☞ 151-0107-20L High Performance Computing for Science and Engineering (HPCSE,
Prof. P. Koumoutsakos, D-MAVT)
263-2800-00L Design of Parallel and High-Performance Computing (Prof. T. Höfler, D-INFK)
However, note that these other courses partly rely on knowledge of elementary numerical methods, which
is covered in this course. y
8
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Contents
§0.1.1.2 (Prequisites) This course will take for granted basic knowledge of linear algebra, calculus, and
programming, that you should have acquired during your first year at ETH.
Numerical Methods
Approximating integrals:
Least squares problems
Numerical quadrature
Numerically solving
ordinary differential
Linear systems of
Interpolation and
equations
equations
They are vastly different in terms of ideas, design, analysis, and scope of application. They are the
items in a toolbox, some only loosely related by the common purpose of being building blocks for
codes for numerical simulation.
Fig. 1
y
§0.1.1.4 (Dependencies of topics) Despite the diverse nature of the individual topics covered in this
course, some depend on others for providing essential building blocks. The following directed graph tries
to capture these relationships. The arrows have to be read as “uses results or algorithms of”.
Numerical integration
ẏ = f(t, y), Chapter 11
Quadrature
R Eigenvalues Krylov methods
f ( x ) dx,
Ax = λx, Chapter 9 Chapter 10
Chapter 7
Least squares,
Function approximation, Non-linear least squares,
kAx − bk → min,
Chapter 6 k F (x)k → min, Section 8.7
Chapter 3
Any one-semester course “Numerical methods for CSE” will cover only selected chapters and sec-
tions of this document. Only topics addressed in class or in homework problems will be relevant
for exams!
§0.1.1.5 (Relevance of this course) I am a student of computer science. After the exam, may I safely
forget everything I have learned in this mandatory “numerical methods” course? No, because it is highly
likely that other courses or projects will rely on the contents of this course:
singular value decomposition
Computational statistics, machine learning
least squares
function approximation
numerical quadrature machine learning, Numerical methods for PDEs
numerical integration
interpolation
Computer graphics
least squares
eigensolvers
Graph theoretic algorithms
sparse linear systems
Hardly anyone will need everything covered in this course, but most of you will need something.
0.1.2 Goals
This course is meant to impart
✦ knowledge of some fundamental algorithms forming the basis of numerical simulations,
✦ familiarity with essential terms in numerical mathematics and the techniques used for the analysis
of numerical algorithms
✦ the skill to choose the appropriate numerical methods for concrete problems,
✦ the ability to interpret numerical results,
✦ proficiency in implementing numerical algorithms efficiently in C++, using numerical libraries.
0.1.3 Literature
Parts of the following textbooks may be used as supplementary reading for this course. References to
relevant sections will be provided in the course material.
✦ [AG11] U. A SCHER AND C. G REIF, A First Course in Numerical Methods, SIAM, Philadelphia, 2011.
Good reference for large parts of this course; provides a lot of simple examples and lucid explana-
tions, but also rigorous mathematical treatment.
(Target audience: undergraduate students in science and engineering)
Available for download as PDF
✦ [Han02] M. H ANKE -B OURGEOIS, Grundlagen der Numerischen Mathematik und des Wis-
senschaftlichen Rechnens, Mathematische Leitfäden, B.G. Teubner, Stuttgart, 2002.
Gives detailed description and mathematical analysis of algorithms and relies on MATLAB. Profound
treatment of theory way beyond the scope of this course. (Target audience: undergraduates in
mathematics)
✦ [QSS00] A. Q UARTERONI , R. S ACCO, F. S ALERI, Numerical mathematics, vol. 37 of Texts in
AND
Applied Mathematics, Springer, New York, 2000.
Classical introductory numerical analysis text with many examples and detailed discussion of algo-
rithms. (Target audience: undergraduates in mathematics and engineering)
Can be obtained from website.
✦ [DH03] P. D EUFLHARD A. H OHMANN, Numerische Mathematik. Eine algorithmisch orientierte
AND
Einführung, DeGruyter, Berlin, 1 ed., 1991.
Modern discussion of numerical methods with profound treatment of theoretical aspects (Target
audience: undergraduate students in mathematics).
✦ [GGK14]: W.. G ANDER , M.J. G ANDER , AND F. K WOK, Scientific Computing, Text in Computational
Science and Engineering, springer, 2014.
Comprehensive treatment of elementary numerical methods with an algorithmic focus.
D-INFK maintains a webpage with links to some of these books.
Essential prerequisite for this course is a solid knowledge in linear algebra and calculus. Familiarity with
the topics covered in the first semester courses is taken for granted, see
✦ [NS02] K. N IPP AND D. S TOFFER, Lineare Algebra, vdf Hochschulverlag, Zürich, 5 ed., 2002.
✦ [Gut09] M. G UTKNECHT, Lineare algebra, lecture notes, SAM, ETH Zürich, 2009, available online.
✦ [Str09] M. S TRUWE, Analysis für Informatiker. Lecture notes, ETH Zürich, 2009, available online.
A flipped-classroom course
All the course material will be published online through the course Moodle Page. All notes jotted down by
the lecturer during the creation of videos or during the Q&A sessions will be made available as PDF.
In the flipped-classroom teaching model regular lectures will be replaced with pre-recorded videos. These
videos are not commercial-grade clips, but resemble video recordings from a standard classroom setting;
they convey the development of the material on a tablet accompanied by the lecturer’s voice.
Fig. 2
§0.2.1.2 (“Pause” and “fast forward”) Videos have two big advantages:
The video portal also allows you to play the videos at 1.5× speed. This can be useful, if the current topic
is very clear to you. You can also skip entire parts using the scroll bar. The same functionality (fast playing
and skipping) is offered by most video players, for instance the VLC media player. y
§0.2.1.3 (Review questions) Most lecture units (corresponding to a video) are accompanied with a list of
review questions. You should try to answer them off the top of your head without consulting any written
material shortly after you have finished studying the unit .
In case you are utterly clueless about how to approach a review question, you probably need to refresh
some of the unit’s topics.
y
§0.2.1.4 (List of available tutorial videos) This is the list of available video tutorials as of February 5,
2025:
1. Video tutorial for Chapter 0 “Introduction”: (16 minutes) Download link, tablet notes
Video tutorial for Section 1.1.1 “Notations and Classes of Matrices”: (7 minutes)
2.
Download link, tablet notes
3. Video tutorial for Section 1.2.1 "E IGEN ": (11 minutes) Download link, tablet notes
Video tutorial for Section 1.2.3 "(Dense) Matrix Storage Formats": (10 minutes)
4.
Download link, tablet notes
Video tutorial for Section 1.4 "Computational Effort": (29 minutes) Download link,
5.
tablet notes
Video tutorial for Section 1.5 "Machine Arithmetic and Consequences": (16 minutes)
6.
Download link, tablet notes
7. Video tutorial for Section 1.5.4 "Cancellation": (22 minutes) Download link, tablet notes
Video tutorial for Section 1.5.5 "Numerical Stability": (17 minutes) Download link,
8.
tablet notes
Video tutorial for Ex. 2.1.0.3 "Nodal Analysis of Linear Electric Circuits": (8 minutes)
10.
Download link, tablet notes
Video tutorial for Section 2.2.2 "Sensitivity of Linear Systems": (15 minutes)
11.
Download link, tablet notes
Video tutorial for Section 2.3 & Section 2.5 "Gaussian Elimination": (17 minutes)
12.
Download link, tablet notes
Video tutorial for Section 2.7.1 "Sparse Matrix Storage Formats": (10 minutes)
14.
Download link, tablet notes
Video tutorial for Section 2.7.2 "Sparse Matrices in E IGEN ": (6 minutes) Download link,
15.
tablet notes
→ review questions>2.7.3.7
Video tutorial for Section 3.0.1 "Overdetermined Linear Systems of Equations: Exam-
17.
ples": (12 minutes) Download link, tablet notes
Video tutorial for Section 3.1.1 "Least Squares Solutions": (9 minutes) Download link,
18.
tablet notes
Video tutorial for Section 3.1.2 "Normal Equations": (16 minutes) Download link,
19.
tablet notes
Video tutorial for Section 3.2 "Normal Equation Methods": (12 minutes) Download link,
21.
tablet notes
Video tutorial for Section 3.3 "Orthogonal Transformation Methods": (10 minutes)
22.
Download link, tablet notes
Video tutorial for Section 3.4.2 "SVD in E IGEN ": (9 minutes) Download link,
28.
tablet notes
Video tutorial for Section 3.4.4.2 "Best Low-Rank Approximation": (13 minutes)
31.
Download link, tablet notes
Video tutorial for Section 3.6 "Constrained Least Squares": (23 minutes)
33.
Download link, tablet notes
Video tutorial for Section 4.1.2 "LT-FIR Linear Mappings": (12 minutes) Download link,
35.
tablet notes
Video tutorial for Section 4.1.3 "Discrete Convolutions": (9 minutes) Download link,
36.
tablet notes
Video tutorial for Section 4.1.4 "Periodic Convolutions": (12 minutes) Download link,
37.
tablet notes
Video tutorial for Section 4.2.1 "Diagonalizing Circulant Matrices": (17 minutes)
38.
Download link, tablet notes
Video tutorial for Section 4.2.2 "Discrete Convolution via DFT": (7 minutes)
39.
Download link, tablet notes
Video tutorial for Section 4.2.3 "Frequency filtering via DFT": (20 minutes)
40.
Download link, tablet notes
Video tutorial for Section 4.2.5 "Two-Dimensional DFT": (20 minutes) Download link,
41.
tablet notes
Video tutorial for Section 4.3 "Fast Fourier Transform (FFT)": (16 minutes)
42.
Download link, tablet notes
Video tutorial for Section 4.5 "Toeplitz Matrix Techniques": (20 minutes) Download link,
43.
tablet notes
Video tutorial for Section 5.1 "Abstract Interpolation": (16 minutes) Download link,
44.
tablet notes
Video tutorial for Section 5.2.1 "Uni-Variate Polynomials": (7 minutes) Download link,
45.
tablet notes
Video tutorial for Section 5.2.3 "Polynomial Interpolation: Algorithms": (18 minutes)
47.
Download link, tablet notes
Video tutorial for Section 5.2.3.3 "Extrapolation to Zero": (12 minutes) Download link,
48.
tablet notes
Video tutorial for Section 5.2.3.4 "Newton Basis and Divided Differences": (17 minutes)
49.
Download link, tablet notes
Video tutorial for Section 5.2.4 "Polynomial Interpolation: Sensitivity": (13 minutes)
50.
Download link, tablet notes
Video tutorial for Section 5.4.1 "Spline Function Spaces": (9 minutes) Download link,
52.
tablet notes
Video tutorial for Section 5.4.2 "Cubic Spline Interpolation": (14 minutes)
53.
Download link, tablet notes
Video tutorial for Section 5.6 "Trigonometric Interpolation": (14 minutes) Download link,
55.
tablet notes
Video tutorial for Section 5.7 "Least Squares Data Fitting": (13 minutes) Download link,
56.
tablet notes
Video tutorial for Section 6.2 "Polynomial Approximation: Theory": (13 minutes)
58.
Download link, tablet notes
Video tutorial for Section 6.2.2.3 "Error Estimates for Polynomial Interpolation: Analytic
61.
Interpolands": (27 minutes) Download link, tablet notes
Video tutorial for Section 6.6.2 "Cubic Hermite and Spline Interpolation: Error Esti-
69.
mates": (10 minutes) Download link, tablet notes
Video tutorial for Section 7.4.2 "Maximal-Order Quadrature Rules": (16 minutes)
74.
Download link, tablet notes
Video tutorial for Section 7.5 "Composite Quadrature": (18 minutes) Download link,
76.
tablet notes
Video tutorial for Section 7.6 "Adaptive Quadrature": (13 minutes) Download link,
77.
tablet notes
Video tutorial for Section 8.2.1 "Iterative Methods: Fundamental Concepts": (6 minutes)
79.
Download link, tablet notes
Video tutorial for Section 8.3 "Fixed-Point Iterations": (12 minutes) Download link,
82.
tablet notes
Video tutorial for Section 8.4.2.1 "Newton Method in the Scalar Case": (20 minutes)
84.
Download link, tablet notes
Video tutorial for Section 8.4.2.3 "Multi-Point Methods": (12 minutes) Download link,
85.
tablet notes
Video tutorial for Section 8.5.1 "The Newton Iteration in R n (I)": (10 minutes)
87.
Download link, tablet notes
Video tutorial for Section 8.5.1 "The Newton Iteration in R n (II)": (15 minutes)
89.
Download link, tablet notes
Video tutorial for Section 8.5.4 "Damped Newton Method": (11 minutes) Download link,
92.
tablet notes
Video tutorial for Section 8.6 "Quasi-Newton Method": (15 minutes) Download link,
93.
tablet notes
Video tutorial for Section 8.7 "Non-linear Least Squares": (7 minutes) Download link,
94.
tablet notes
Video tutorial for Section 8.7.2 "(Trust-region) Gauss-Newton Method": (13 minutes)
96.
Download link, tablet notes
Video tutorial for Section 11.1: Initial-Value Problems (IVPs) for Ordinary Differential
97.
Equations (ODEs): (35 minutes) Download link, tablet notes
Video tutorial for Section 11.3: General Single-Step Methods: (14 minutes)
99.
Download link, tablet notes
Video tutorial for Section 11.5: Adaptive Stepsize Control: (32 minutes) Download link,
102.
tablet notes
Video tutorial for Section 12.1:Model Problem Analysis: (40 minutes) Download link,
103.
tablet notes
Video tutorial for Section 12.2: Stiff Initial-Value Problems: (24 minutes) Download link,
104.
tablet notes
Video tutorial for Section 12.3: Implicit Runge-Kutta Single-Step Methods: (50 minutes)
105.
Download link, tablet notes
Video tutorial for Section 12.4: Semi-Implicit Runge-Kutta Methods: (13 minutes)
106.
Download link, tablet notes
Video tutorial for Section 12.5: Splitting Methods: (21 minutes) Download link,
107.
tablet notes
Necessary corrections and updates of the lecture document will sometimes lead to changes
in the numbering of paragraphs and formulas, which, of course, cannot be applied to the
! recorded videos.
However, these changes will be taken into account into the tablet notes supplied for every
video.
• For every week there is a list of course units and associated videos published on the course
Moodle Page.
• The corresponding contents must be studied in that same week.
! Do not put off studying for this course. Dependencies between the topics will make it very
hard to catch up.
§0.2.1.7 (“Personalized learning”) The flipped classroom model allows students to pursue their preferred
ways of studying. The following approaches can be tried.
• Traditional: You watch the assigned videos similar to attending a conventional classroom lecture.
Afterwards digest the material based on the tablet notes and/or the lecture document. Finally, answer
the review questions and look up more information in the lecture document.
• Reading-centered: You work through the unit reading the tablet notes, and, sometimes, related
sections of the lecture document. You occasionally watch parts of the videos, in case some consid-
erations and arguments have not become clear to you already.
§0.2.1.8 (Question and Answer (Q&A) sessions) The lecturer will offer a two-hour so-called Q&A ses-
sion almost every week during the teaching period, but not in the weeks in which term exams will be held.
These Q&A sessions will be devoted to
• discussing and answering questions asked by the participants of the course,
• presenting solutions of review questions, and
• offering additional explanations for some parts of the course.
Questions can be asked right during the Q&A session, but participants of the course are encouraged to
submit general or specific questions of comments beforehand.
Questions/comments can be posted in dedicated D ISCUNA chat channels
(folder “Q&A Channels”, community “NumCSE Autumn <YEAR>”), which
will be set up for each week in which a regular Q&A session will take
place.
It is highly desirable that questions are submitted at least a few hours before the start of the Q&A session
so that the lecturer has the opportunity to structure his or her answer.
Tablet notes of the Q&A sessions will be made available for download. y
• hopefully understand the remaining third when studying for the main examination after the end
of the course.
Perseverance will be rewarded!
0.2.3 Requests
The lecturers very much welcome and, putting it even more strongly, rather depend on feedback and
suggestions of the students taking the course for continuous improvement of the course contents and
presentation. Therefore all participants are strongly encouraged to get involved actively and contribute in
the following ways:
§0.2.3.1 (Reporting errors) As the documents for this course will always be in a state of flux, they will
inevitably and invariably teem with small errors, mainly typos and omissions.
For error reporting we use the D ISCUNA online collaboration platform that
runs in the browser.
D ISCUNA allows to attach various types of annotations to shared PDF documents, see instruction video.
Please report errors in the lecture material through the D ISCUNA NumCSE Community to which
various course-related documents have already been uploaded.
§0.2.3.2 (Pointing out technical problems) The D ISCUNA NumCSE Community is equipped with a chat
channel “Technical Problems”. In case you encounter a problem affecting the videos, the course web
pages, or the PDF documents supplied online, say, severely distorted or missing audio tracks or a faulty
link, instantly post a comment to this channel with a short description of the problem. You can do this after
clicking on the channel name in the left sidebar in the community y
§0.2.3.3 (Providing comments and suggestions) The chat channel “General Comments” of the
D ISCUNA NumCSE Community is meant for letting the lecturer know about weaknesses of the contents,
structure, and presentation of the course and how they can be remedied. Your statements should be
constructive and address specific parts or aspects of the course.
Regularly, students attending the course remark that they have found online resources like instruction
videos that they think present some of the course material in a much clearer and better structured way. It
is important that you tell the lecturer about those online resources so that he can include pointers to them
and get inspiration. Use the “General Comments” channel also for this purpose. State clearly, which part
of the course you are referring to, and briefly explain why the online resource is superior or a valuable
supplement. y
§0.2.3.4 (Asking/posting questions) Whenever a question comes up while you are studying for the
course or trying to solve homework problems and that question lingers, it is probably connected to an
issue that also bothers other students. What to do, in case you are not able to attend the Q&A session?
Please post arising question to the D ISCUNA Q&A channels even if you do not attend the Q&A
session! See also § 0.2.1.8
• initiating a discussion of the question that may also be relevant for other students, and
• will make it possible for you to find an answer in the Q&A tablet notes.
Tongue in cheek:
0.2.4 Assignments
A steady and persistent effort spent on homework problems is essential for success in this course.
You should expect to spend 3-5 hours per week on trying to solve the homework problems. Since many
involve small coding projects, the time it will take an individual student to arrive at a solution is hard to
predict.
The problems are published online together with plenty of hints. A master solution will also be made
available, but it is foolish to read the master solution parallel to working on a problem sheet, because
trying to find the solution on one’s own is essential for developing problem solving skills, though it may
occasionally be frustrating.
Please note that this problem collection is being extended throughout the semester. Thus, make
sure that you obtain the most current version every week. A polybox link will also be distributed;
if you install the Polybox Client the most current version of all course documents will always be
uploaded to your machine.
✦ Some or all of the problems of an assignment sheet will be discussed in the tutorial classes at least
one week after the problems have been assigned.
✦ Your tutors are happy to examine your solutions and give you feedback : You may either hand
them your solution papers during the tutorial session (put your name on every sheet and clearly
mark the problems you want to be inspected) or upload a scan/photo through the C ODE E XPERT up-
load interface, see § 0.2.4.2 below. You are encouraged to hand in incomplete and wrong solutions,
so that you can receive valuable feedback even on incomplete or failed attempts.
✦ Your tutors will automatically have access to all your homework codes, see § 0.2.4.2 below.
y
Note that C ODE E XPERT will also be using for the coding problems of the main examination.
y
§0.2.4.4 (C ODE E XPERT synchronization with local folder) If you prefer to use your own editor locally
on your computer, synchronization between the online C ODE E XPERT repository and your local folder is
available via Code Expert Sync tool. Follow the instruction here.
The working pipeline is:
Sync from C ODE E XPERT platform −→ Edit locally −→ Sync with C ODE E XPERT and run/test −→
continue editting locally . . .
§0.2.5.1 (Examinations during the teaching period) From the ETH course directory:
An optional 30-minutes mid-term exam and an optional 30-minutes end-term exam will be
held during the teaching period. The grades of these interim examinations will be taken into
account through a BONUS of up to 30% for the final grade.
The term exams will be conducted as closed book examinations on paper . The dates of the exams will be
communicated in the beginning of the term and published on the course webpage. The term exams can
neither be repeated nor be taken remotely.
The final grade is computed according to the formula
G := 0.25 · ⌈4 · max{ Gs , 0.85Gs + 0.15gm , 0.85Gs + 0.15ge , 0.7Gs + 0.15gm + 0.15ge }⌉ , (0.2.5.2)
ˆ grade in main exam, gm =
Gs = ˆ mid-term grade, ge =
ˆ end-term grade,
• All topics that have been addressed in a video listed on the course Moodle page or in any
assigned homework problem
The lecture document contains much more material than covered in class. All these extra topics are
not relevant for the exam.
✦ Lecture document (as PDF), the E IGEN documentation, and the online C++ REFERENCE PAGES will
be available PDF during the examination. The corresponding final version of the lecture document
will be made available at least two weeks before the exam.
✦ No other materials may be used during the exam.
✦ The homework problem collection cannot be accessed during the exam.
✦ The exam questions will be asked in English.
✦ In case you come to the conclusion that you have too little time to prepare for the main exam a few
weeks before the exam, contemplate withdrawing in order not to squander an attempt.
y
0.2.5.2 For the Course 401-0663-00L Numerical Methods for CS (BSc Informatik)
§0.2.5.5 (Homework bonus) During the teaching period every week quizzes and exercices similar to
those that will appear in the final exam are published on the moodle page of the lecture. These are open
for answers for about one week and students are expected to answer them within this time. Answering
them later is not possible. Correct answers are awarded “semester points” that are defined for each
questions. Hence, each student has the possibility to accumulate such points during the semester.
Grade bonus
The grade achieved in the final exam will be raised by 0.25 for all students who have earned at least
75% of the “semester points”.
• Visual code studio can be used as an editor during the exam, but only the codes submitted through
C ODE E XPERT will be saved during the exam and taken into account for grading.
y
However, C++ has become the main language in computational science and engineering and high per-
formance computing. Therefore this course relies on C++ to discuss the implementation of numerical
methods.
Supplementary literature. A popular book for learning C++ that has been upgraded to include
The following sections highlight a few particular aspects of C++ that may be important for code develop-
ment in this course.
The version of the course for BSc students of Computer Science includes a two-week introduction
to C++ in the beginning of the course.
and the compiler selects the function to be used depending on the type of the arguments following rather
sophisticated rules, refer to overload resolution rules. Complications arise, because implicit type conver-
sions have to be taken into account. In case of ambiguity a compile-time error will be triggered. Functions
cannot be distinguished by return type!
For member functions (methods) of classes an additional distinction can be introduced by the const spec-
ifier:
s t r u c t MyClass {
double f( double ); // use for a mutable object of type MyClass
double f( double ) const ; // use this version for a constant object
...
};
The second version of the method f is invoked for constant objects of type MyClass. y
§0.3.1.2 (Operator overloading [LLM12, Chapter 14]) In C++ unary and binary operators like =, ==, +,
-, *, /, +=, -=, *=, /=, %, &&, ||, «, », etc. are regarded as functions with a fixed number of arguments
(one or two). For built-in numeric and logic types they are defined already. They can be extended to any
other type, for instance
MyClass o p e r a t o r +( const MyClass &, const MyClass &);
MyClass o p e r a t o r +( const MyClass &, double );
MyClass o p e r a t o r +( const MyClass &); // unary + !
The same selection rules as for function overloading apply. Of course, operators can also be introduced
as class member functions.
C++ gives complete freedom to overload operators. However, the semantics of the new operators should
be close to the customary use of the operator. y
§0.3.1.3 (Passing arguments by value and by reference [LLM12, Sect. 6.2]) Consider a generic func-
tion declared as follows:
v o i d f(MyClass x); // Argument x passed by value.
When f is invoked, a temporary copy of the argument is created through the copy constructor or the move
constructor of MyClass. The new temporary object is a local variable inside the function body.
then the argument is passed to the scope of the function and can be changed inside the function. No
copies are created. If one wants to avoid the creation of temporary objects, which may be costly, but also
wants to indicate that the argument will not be modified inside f, then the declaration should read
v o i d f(const MyClass &x); // Argument x passed by constant referene.
In this case, if the scope of the object passed as the argument is merely the function or std::move()
tags it as disposable, the move constructor of MyClass is invoked, which will usually do a shallow copy
only. Refer to Code 0.3.5.10 for an example. y
0.3.2 Templates
§0.3.2.1 (Function templates) The template mechanism supports parameterization of definitions of
classes and functions by type. An example of a function templates is
t e m p l a t e < typename ScalarType, typename VectorType>
VectorType saxpy(ScalarType alpha, const VectorType &x, const
VectorType &y)
{ r e t u r n (alpha*x+y); }
Depending on the concrete type of the arguments the compiler will instantiate particular versions of this
function, for instance saxpy<float,double>, when alpha is of type float and both x and y are of
type double. In this case the return type will be double.
For the above example the compiler will be able to deduce the types ScalarType and VectorType
from the arguments. The programmer can also specify the types directly through the < >-syntax as in
saxpy< double , double >(a,x,y);
If an instantiation for all arguments of type double is desired. In case, the arguments do not supply
enough information about the type parameters, specifying (some of) them through < > is mandatory. y
§0.3.2.2 (Class templates) A class template defines a class depending on one or more type parameters,
for instance
t e m p l a t e < typename T>
c l a s s MyClsTempl {
public:
using parm_t = T; // T-dependent type
MyClsTempl( v o i d ); // Default constructor
MyClsTempl( const T&); // Constructor with an argument
t e m p l a t e < typename U>
T memfn( const T&, const U&) const ; // Templated member function
private:
T *ptr; // Data member, T-pointer
};
Types MyClsTempl<T> for a concrete choice of T are instantiated when a corresponding object is de-
clared, for instance via
double x = 3.14;
MyClass myobj; // Default construction of an object
MyClsTempl< double > tinstd; // Instantiation for T = double
MyClsTempl<MyClass> mytinst(myobj); // Instantiation for T = MyClass
MyClass ret = mytinst.memfn(myobj,x); // Instantiation of member
function for U = double, automatic type deduction
The types spawned by a template for different parameter types have nothing to do with each other. y
The parameter types for a template have to provide all type definitions, member functions, operators,
and data to make possible the instantiation (“compilation”) of the class of function template.
The evaluation operator can take more than one argument and need not be declared const.
(II) through lambda functions, an “anonymous function” defined as
[<capture list>] (<arguments>) -> <return type> { body; }
where <capture list> is a list of variables from the local scope to be passed to the lambda func-
tion; an & indicates passing by reference,
<arguments> is a comma separated list of function arguments complete with types,
<return type> is an optional return type; often the compiler will be able to deduce the
return type from the definition of the function.
Function classes should be used, when the function is needed in different places, whereas lambda func-
tions for short functions intended for single use.
In this code the lambda function captures the local variable sum by reference, which enables the lambda
function to change its value in the surrounding scope.
§0.3.3.2 (Function type wrappers) The special class std::function provides types for general poly-
morphic function wrappers.
s t d ::function<return type(arg types)>
3 void s t d f u n c t i o n t e s t ( void ) {
4 // Vector of objects of a particular signature
§0.3.3.4 (Recorder objects) In the case of routines that perform some numerical computations we are
often interested in the final result only. Occasionally we may also want to screen intermediate results. The
following example demonstrates the use of an optional object for collecting information while the function
is being executed. If no such object is supplied, an idle lambda function is passed, which incurs absolutely
no runtime overhead.
• Putting the name of a variable in the current scope in a lambda’s capture list, makes that variable
accessible inside the lambda’s body as a const reference. The variables are immutable inside the
lambda’s body.
• Capturing a local variable as non-const reference prepend the variable name with &. That variable’s
value can be changed by the lambda.
• The capture list [=] captures all local variables as const references.
• Conversely, the capture list [&] means that all variables in the local scope are captured by non-
const reference and can be changed by the lambda function.
y
§0.3.3.8 (Lambda functions inside member functions) To access class methods or class variables in
a lambda function inside a member function of a class you have to capture the current object as const
reference by putting this or *this in the capture list.
§0.3.3.10 (Recursions based on lambda functions) Lambda functions offer an elegant way to implment
recursive algorithms locally inside a function. Note that
• you have to capture the lambda function itself by reference,
• and that you cannot use auto for automatic compile-time type deduction of that lambda function!
In C++ this is also possible by using the tuple utility. For instance, the following function computes the
mimimal and maximal element of a vector and also returns its cumulative sum. It returns all these values.
This code snippet shows how to extract the individual components of the tuple returned by the previous
function.
C++ code 0.3.4.2: Calling a function with multiple return values ➺ GITLAB
1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // Variables for return values
5 double minv , maxv ; // Extremal elements
6 std : : vector <double> cs ; // Cumulative sums
7 std : : t i e ( minv , maxv , cs ) = extcumsum ( v ) ;
8 cout << " min = " << minv << " , max = " << maxv << endl ;
9 cout << " cs = [ " ; f o r ( double x : cs ) cout << x << ’ ’ ; cout << " ] " << endl ;
10 return ( 0 ) ;
11 }
Be careful: many temporary objects might be created! A demonstration of this hidden cost is given in
Exp. 0.3.5.27. From C++17 a more compact syntax is available:
C++ code 0.3.4.3: Calling a function with multiple return values ➺ GITLAB
1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // Definition of variables and assignment of return values all at once
5 auto [ minv , maxv , cs ] = extcumsum ( v ) ;
6 cout << " min = " << minv << " , max = " << maxv << endl ;
7 cout << " cs = [ " ; f o r ( double x : cs ) cout << x << ’ ’ ; cout << " ] " << endl ;
8 return ( 0 ) ;
9 }
Remark 0.3.4.4 (“auto” considered harmful) C++ is a strongly typed programming language and every
variable must have a precise type. However, the developer of templated classes and functions may not
know the type of some variables in advance, because it can be deduced only after instantiation through
the compiler. The auto keyword has been introduced to handle this situation.
There is a temptation to use auto profligately, because it is convenient, in particular when using templated
data types. However, this denies a major benefit of types, consistency checking at compile time and, as a
developer, one may eventually lose track of the types completely, which can lead to errors that are hard to
detect.
Thus, the use of auto should be avoided, unless in the following situations:
• for variables inside templated functions or classes, whose precise type will only become clear during
instantiation,
• for lambda functions, see Section 0.3.3,
• for return values of templated library (member) functions, whose type is “impossible to deduce” by
the user. An example is expression templates in E IGEN, refer to Rem. 1.2.1.11 below.
y
55 // Euclidean norm
56 [ [ n o d i s c a r d ] ] double norm ( void ) const ;
57 // Euclidean inner product
58 double operator * ( const MyVector &) const ;
59 // Output operator
60 f r i e n d std : : ostream &
61 operator << ( std : : ostream & , const MyVector &mv) ;
62
Note the use of a public static data member dbg in Line 63 that can be used to control debugging output
by setting MyVector::dbg = true or MyVector::dbg = false.
Remark 0.3.5.2 (Contiguous arrays in C++) The class MyVector uses a C-style array and dynamic
memory management with new and delete to store the vector components. This is for demonstration
purposes only and not recommended.
Arrays in C++
In C++ use the STL container std::vector<T> for storing data in contiguous memory locations.
Exception: use std::array<T>, if the number of elements is known at compile time.
C++ code 0.3.5.5: Constructor for constant vector, also default constructor, see Line 6 in
Code 0.3.5.1 ➺ GITLAB
1 MyVector : : MyVector ( std : : s i z e _ t _n , double _a ) : n ( _n ) , data ( n u l l p t r ) {
2 i f ( dbg ) cout << " { Constructor MyVector ( " << _n
3 << " ) called " << ’ } ’ << endl ;
4 i f ( n > 0 ) data = new double [ _n ] ;
5 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] = _a ;
6 }
This constructor can also serve as default constructor (a constructor that can be invoked without any
argument), because defaults are supplied for all its arguments.
The following two constructors initialize a vector from sequential containers according to the conventions
of the STL.
C++ code 0.3.5.6: Templated constructors copying vector entries from an STL container
➺ GITLAB
1 template <typename Container >
2 MyVector : : MyVector ( const C o n t a i n e r &v ) : n ( v . s i z e ( ) ) , data ( n u l l p t r ) {
3 i f ( dbg ) cout << " { MyVector ( length " << n
4 << " ) constructed from container " << ’ } ’ << endl ;
5 i f ( n > 0) {
6 double * tmp = ( data = new double [ n ] ) ;
7 f o r ( auto i : v ) * tmp++ = i ; // foreach loop
8 }
9 }
Note the use of the new C++ 11 facility of a “foreach loop” iterating through a container in Line 7.
C++ code 0.3.5.7: Constructor initializing vector from STL iterator range ➺ GITLAB
1 template <typename I t e r a t o r >
2 MyVector : : MyVector ( I t e r a t o r f i r s t , I t e r a t o r l a s t ) : n ( 0 ) , data ( n u l l p t r ) {
3 n = std : : d i s t a n c e ( f i r s t , l a s t ) ;
4 i f ( dbg ) cout << " { MyVector ( length " << n
5 << " ) constructed from range " << ’ } ’ << endl ;
6 i f ( n > 0) {
7 data = new double [ n ] ;
8 std : : copy ( f i r s t , l a s t , data ) ;
9 }
10 }
C++ code 0.3.5.8: Initialization of a MyVector object from an STL vector ➺ GITLAB
1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r u e ;
3 std : : vector < i n t > i v e c = { 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 } ; // initializer list
4 myvec : : MyVector v1 ( i v e c . cbegin ( ) , i v e c . cend ( ) ) ;
5 myvec : : MyVector v2 ( i v e c ) ;
6 myvec : : MyVector v r ( i v e c . crbegin ( ) , i v e c . crend ( ) ) ;
7 cout << " v1 = " << v1 << endl ;
8 cout << " v2 = " << v2 << endl ;
9 cout << " vr = " << v r << endl ;
10 return ( 0 ) ;
11 }
The copy constructor listed next relies on the STL algorithm std::copy to copy the elements of an
existing object into a newly created object. This takes n operations.
An important new feature of C++11 is move semantics which helps avoid expensive copy operations. The
following implementation just performs a shallow copy of pointers and, thus, for large n is much cheaper
than a call to the copy constructor from Code 0.3.5.9. The source vector is left in an empty vector state.
The following code demonstrates the use of std::move() to mark a vector object as disposable and
allow the compiler the use of the move constructor. The code also uses left multiplication with a scalar,
see Code 0.3.5.23.
This code produces the following output. We observe that v1 is empty after its data have been “stolen” by
v2.
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ o p e r a t o r a * , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r * = , MyVector o f l e n g t h 8 }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
v1 = [ ]
v2 = [ 2 . 4 , 4 . 6 , 6 . 8 , 9 , 1 1 . 2 , 1 3 . 4 , 1 5 . 6 , 1 7 . 8 ]
v3 = [ 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 , 8 . 9 ]
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
We observe that the object v1 is reset after having been moved to v3.
Use std::move only for special purposes like above and only if an object has a move con-
structor. Otherwise a ’move’ will trigger a plain copy operation. In particular, do not use
! std::move on objects at the end of their scope, e.g., within return statements.
The next operator effects copy assignment of an rvalue MyVector object to an lvalue MyVector. This
involves O(n) operations.
C++ code 0.3.5.15: Type conversion operator: copies contents of vector into STL vector
➺ GITLAB
1 MyVector : : operator std : : vector <double> ( ) const {
2 i f ( dbg ) cout << " { Conversion to std : : vector , length = " << n << ’ } ’ << endl ;
3 r e t u r n std : : vector <double >( data , data+n ) ;
4 }
The bracket operator [] can be used to fetch and set vector components. Note that index range checking
is performed; an exception is thrown for invalid indices. The following code also gives an example of
operator overloading as discussed in § 0.3.1.2.
Componentwise direct comparison of vectors. Can be dangerous in numerical codes,cf. Rem. 1.5.3.15.
The transform method applies a function to every vector component and overwrites it with the value
returned by the function. The function is passed as an object of a type providing a ()-operator that accepts
a single argument convertible to double and returns a value convertible to double.
The following code demonstrates the use of the transform method in combination with
1. a function object of the following type
The output is
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 3 . 2 , 4 . 3 , 5 . 4 , 6 . 5 , 7 . 6 , 8 . 7 , 9 . 8 , 1 0 . 9 ]
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 5 . 2 , 6 . 3 , 7 . 4 , 8 . 5 , 9 . 6 , 1 0 . 7 , 1 1 . 8 , 1 2 . 9 ]
Final vector = [ 1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9 ]
Operator overloading provides the “natural” vector operations in R n both in place and with a new vector
created for the result.
C++ code 0.3.5.23: Non-member function for left multiplication with a scalar ➺ GITLAB
1 MyVector operator * ( double alpha , const MyVector &mv) {
2 i f ( MyVector : : dbg ) cout << " { operator a * , MyVector of length "
3 << mv . n << ’ } ’ << endl ;
4 MyVector tmp (mv) ; tmp * = alpha ;
5 r e t u r n ( tmp ) ;
6 }
Adopting the notation in some linear algebra texts, the operator * has been chosen to designate the
Euclidean inner product:
At least for debugging purposes every reasonably complex class should be equipped with output function-
ality.
EXPERIMENT 0.3.5.27 (“Behind the scenes” of MyVector arithmetic) The following code highlights
the use of operator overloading to obtain readable and compact expressions for vector arithmetic.
We run the code and trace calls. This is printed to the console:
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ d o t * , MyVector o f l e n g t h 8 }
{ o p e r a t o r a * , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r * = , MyVector o f l e n g t h 8 }
{ o p e r a t o r + , MyVector o f l e n g t h 8 }
{ o p e r a t o r += , MyVector o f l e n g t h 8 }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r a * , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r * = , MyVector o f l e n g t h 8 }
{ o p e r a t o r − , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r −= , MyVector o f l e n g t h 8 }
{ norm : MyVector o f l e n g t h 8 }
{ o p e r a t o r / , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r / = , MyVector o f l e n g t h 8 }
{ o p e r a t o r + , MyVector o f l e n g t h 8 }
{ o p e r a t o r += , MyVector o f l e n g t h 8 }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
Several temporary objects are created and destroyed and quite a few copy operations take place. The
situation would be worse unless move semantics was available; if we had not supplied a move constructor,
a few more copy operations would have been triggered. Even worse, the frequent copying of data runs a
high risk of cache misses. This is certainly not an efficient way to do elementary vector operations though
it looks elegant at first glance. y
Here we use this simple algorithm from linear algebra to demonstrate the use of the vector class MyVector
defined in Code 0.3.5.1.
The templated function gramschmidt takes a sequence of vectors stored in a std::vector object. The
actual vector type is passed as a template parameter. It has to supply size() and norm() member
functions as well as in place arithmetic operations -=, / and =. Note the use of the highlighted methods
of the std::vector class.
This driver program calls a function that initializes a sequence of vectors and then orthonormalizes them
by means of the Gram-Schmidt algorithm. Eventually orthonormality of the computed vectors is tested.
Please pay attention to
• the use of auto to avoid cumbersome type declarations,
• the for loops following the “foreach” syntax.
• automatic indirect template type deduction for the templated function gramschmidt from its argu-
ment. In Line 6 the function gramschmidt<MyVector> is instantiated.
C++ code 0.3.5.32: Initialization of a set of vectors through a functor with two arguments
1 template <typename Functor >
2 std : : vector <myvec : : MyVector>
3 i n i t v e c t o r s ( std : : s i z e _ t n , std : : s i z e _ t k , F u n c t o r &&f ) {
4 std : : vector <MyVector> A { } ;
5 f o r ( i n t j =0; j <k ;++ j ) {
6 A . push_back ( MyVector ( n ) ) ;
7 f o r ( i n t i =0; i <n ;++ i )
8 ( A . back ( ) ) [ i ] = f ( i , j ) ;
9 }
10 r e t u r n ( A) ;
11 }
where the template argument T must be a floating point type like double or float . Then he type complex
• t supports all basic arithmetic operations +, −, ∗, /,
• provides the member functions real() and imag() for extracting real and imaginary parts,
• and can be passed to std::abs() and std::arg() to get the modulus |z| and the argument
ϕ ∈ [−π, π ] of the complex number z = |z| exp(iϕ).
Complex conjugation can be done by calling std::conj() for a complex number. y
§0.3.6.2 (Initialization of complex numbers) The value of a variable of type std::complex<double> can
be initialized
• by calling the standard constructor and supplying real and imaginary part: x =
std::complex<double>(x,y), where x,y are of a numeric type that can be converted to
double. If the second argument is omitted, the imaginary part is set to zero.
• by providing a complex literal, x = 1.0+1.0i. This entails the directive using namespace
std::complex_literals.
• by specifying the modulus r ≥ 0 and argument ϕ ∈ R and calling std::polar(): x =
std::polar(r,phi). Arguments are always given in radians.
y
§0.3.6.3 (Functions with complex arguments) All standard mathematical functions like exp, sin, cos,
sinh, and cosh can be supplied with complex arguments.
Note that the definition of log and of square roots for complex argument entails specifying a branch
cut. The default choice for the built-in functions is the negative real line. For instance this means that
std::sqrt(z) for a complex number z will always have non-negative real part. y
C++ code 0.3.6.4: Data types and operations for complex numbers ➺ GITLAB
2 # include <complex>
3 # include < iostream >
4 # include <numbers>
5 using complex = std : : complex<double > ;
6 using namespace std : : c o m p l e x _ l i t e r a l s ;
7 i n t main ( i n t /*argc*/ , char * * /*argv*/ ) {
8 std : : cout << "Demo: Complex numbers i n C++" << std : : endl ;
9 // This initialization requires std::complex_literals
10 complex z = 0 . 5 ; // Std constructor, real part only
11 z += 0 . 5 + 1 . 0 i ;
12 // Various elementary operations, see
13 // https://en.cppreference.com/w/cpp/numeric/complex
14 std : : cout << " z = " << z << " , Re( z ) = " << z . r e a l ( )
15 << " , Im ( z ) = " << z . imag ( ) << " | z | = " << std : : abs ( z )
16 << " , arg ( z ) = " << std : : arg ( z ) << " , conj ( z ) = " << std : : conj ( z )
17 << std : : endl ;
18 complex w = std : : polar ( 1 . 0 , std : : numbers : : p i / 4 . 0 ) ;
19 std : : cout << "w = " << w << std : : endl ;
20 std : : cout << " exp ( z ) = " << std : : exp ( z )
21 << " , abs ( exp ( z ) ) = " << std : : abs ( std : : exp ( z ) ) << " = "
22 << std : : exp ( z . r e a l ( ) ) << std : : endl ;
23 std : : cout << " s q r t ( z ) = " << std : : s q r t ( z )
24 << " , arg ( s q r t ( z ) ) = " << std : : arg ( std : : s q r t ( z ) ) << std : : endl ;
25
26 return 0;
27 }
Terminal output:
1 Demo : Complex numbers i n C++
2 z = ( 1 , 1 ) , Re ( z ) = 1 , Im ( z ) = 1 | z | = 1.41421 , arg ( z ) = 0.785398 , c o n j ( z )
= (1 , −1)
3 w = (0.707107 ,0.707107)
4 exp ( z ) = ( 1 . 4 6 8 6 9 , 2 . 2 8 7 3 6 ) , abs ( exp ( z ) ) = 2.71828 = 2.71828
5 s q r t ( z ) = ( 1 . 0 9 8 6 8 , 0 . 4 5 5 0 9 ) , arg ( s q r t ( z ) ) = 0.392699
1
x0 := 1 , x m+n = x m x n , x −n = ∀ x ∈ R \ {0} , m, n ∈ Z , (0.4.1.2)
xn
x a+b = x a x b , x ab = ( x a )b ∀ x > 0 , a, b ∈ R , (0.4.1.3)
d d
{ x 7→ x n } = nx n−1 , x 6= 0, n ∈ N , { x 7→ x a } = ax a−1 , x > 0, a ∈ R , (0.4.1.4)
dx dx
Z
x a +1
x a dx = + C , a ∈ R \ {−1} , (0.4.1.5)
a+1
where the last integral can only cover subsets of R + unless a ∈ N. The notation in (0.4.1.5) expresses
a +1
that x 7→ xa+1 in the principal (ger.: Stammfunktion) of x 7→ x a . y
§0.4.1.6 (Exponential functions and logarithms) In this course log always stands for the logarithm with
respect to basis e = 2.71828 . . ..
1
exp( x + y) = exp( x ) exp(y) , exp(− x ) = ∀ x, y ∈ R ,
exp( x ) (0.4.1.8)
log( xy) = log( x ) + log(y) , log( x/y) log( x ) − log(y) ∀ x, y > 0 ,
exp(nx ) = exp( x )n , exp( ax ) = exp( x ) a ∀ x ∈ R, n ∈ Z, a > 0 . (0.4.1.9)
d d 1
{ x 7→ exp( x )} = exp( x ) , x∈R , { x 7→ log( x )} = , x>0, (0.4.1.10)
dx Z Z dx x
exp( x ) dx = exp( x ) + C , log( x ) dx = x log( x ) − x + C , (0.4.1.11)
d
1D product rule: { x 7→ f ( x ) g( x )} = f ′ ( x ) g( x ) + f ( x ) g′ ( x ) , (0.4.1.13)
dx
d
1D chain rule: { x 7→ f ( g( x ))} = f ′ ( g( x )) g( x ) . (0.4.1.14)
dx
This implies the following standard integration techniques:
Zb gZ(b)
′
Integration by substituion: f ( g( x )) g ( x ) dx = f (y)dy , (0.4.1.15)
a g( a)
Z Z
integration by parts: f ( x ) g′ ( x ) dx = − f ′ ( x ) g( x ) dx + f ( x ) g( x ) . (0.4.1.16)
Taylor expansion formula in one dimension for a function that is m + 1 times continuously differentiable
in a neighborhood of x0
m
1 (k) 1
f ( x0 + h ) = ∑ f ( x0 ) h k + R m ( x0 , h ) , R m ( x0 , h ) = f ( m +1) ( ξ ) h m +1 , (1.5.4.28)
k =0
k! ( m + 1) !
for some ξ ∈ [min{ x0 , x0 + h}, max{ x0 , x0 + h}], and for all sufficiently small | h|. y
Some parts of this course will rely on sophisticated results from complex analysis, that is, the field of
mathematics studying functions C → C. This results will be recalled when needed.
This implies
1
sin2 z + cos2 z = 1 ∀z ∈ C , 1 + tan2 z = ∀z ∈ C \ πZ .
sin2 z
Addition formulas:
sin(z ± w) = sin z cos w ± cos z sin w , cos(z ± w) = cos z cos w ∓ sin z sin w ∀z, w ∈ C ,
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on p. 11).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on p. 11).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 12).
[Fri19] F. Friedrich. Datenstrukturen und Algorithmen. Lecture slides. 2019 (cit. on p. 37).
[GGK14] W. Gander, M.J. Gander, and F. Kwok. Scientific Computing. Vol. 11. Texts in Computational
Science and Engineering. Heidelberg: Springer, 2014 (cit. on p. 12).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on p. 12).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on p. 11).
[Jos12] N.M. Josuttis. The C++ Standard Library. Boston, MA: Addison-Wesley, 2012 (cit. on p. 30).
[LLM12] S. Lippman, J. Lajoie, and B. Moo. C++ Primer. 5th. Boston: Addison-Wesley, 2012 (cit. on
pp. 30, 31).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 12).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on p. 11).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on p. 12).
52
Chapter 1
§1.0.0.1 (Prerequisite knowledge for Chapter 1) The reader must master the basics of linear vector
and matrix calculus as covered in every introductory course on linear algebra [NS02, Ch. 2].
On a few occasions we will also need results of 1D real calculus like Taylor’s formula [Str09, Sect. 5.5]. y
§1.0.0.2 (Levels of operations in simulation codes) The lowest level of real arithmetic available on
computers are the elementary operations “+”, “−”, “∗”, “\”, “^”, usually implemented in hardware. The next
level comprises computations on finite arrays of real numbers, the elementary linear algebra operations
(BLAS). On top of them we build complex algorithms involving iterations and approximations.
Elementary operations in R
Hardly ever anyone will contemplate implementing elementary operations on binary data formats; similarly,
well tested and optimised code libraries should be used for all elementary linear algebra operations in
simulation codes. This chapter will introduce you to such libraries and how to use them smartly. y
Contents
1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.1.2 Classes of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.2 Software and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.1 E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.2 P YTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.2.3 (Dense) Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.3 Basic Linear Algebra Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.3.1 Elementary Matrix-Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . 70
1.3.2 BLAS – Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . . . . . 76
1.4 Computational Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.4.1 (Asymptotic) Computational Complexity . . . . . . . . . . . . . . . . . . . . 83
1.4.2 Cost of Basic Linear-Algebra Operations . . . . . . . . . . . . . . . . . . . . 84
1.4.3 Improving Complexity in Numerical Linear Algebra: Some Tricks . . . . . 86
1.5 Machine Arithmetic and Consequences . . . . . . . . . . . . . . . . . . . . . . . . 91
1.5.1 Experiment: Loss of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . 91
1.5.2 Machine Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.5.3 Roundoff Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1.5.4 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
53
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.1 Fundamentals
1.1.1 Notations
Video tutorial for Section 1.1.1 “Notations and Classes of Matrices”: (7 minutes)
Download link, tablet notes
Kn =
ˆ vector space of column vectors with n components in K.
“Linear algebra convention”: Unless stated otherwise, in mathematical formulas vector com-
ponents are indexed from 1!
✎ two notations: x = [ x1 , . . . , x n ] ⊤ → xi , i = 1, . . . , n
x ∈ Kn → (x)i , i = 1, . . . , n
✦ Selecting sub-vectors:
✎ notation: x = [ x1 . . . xn ]⊤ ➣ (x)k:l = [ xk , . . . , xl ]⊤ , 1 ≤ k ≤ l ≤ n
✎ notations like 1 ≤ k, ℓ ≤ n where it is clear from the context that k and ℓ designate integer
indices mean “for all k, ℓ ∈ {1, . . . , n}”.
⊤
✦ j-th unit vector: e j = 0, . . . , 1, . . . , 0 , (e j )i = δij , i, j = 1, . . . , n.
✎ notation: Kronecker symbol, also called “Kronecker delta”, and defined as δij := 1, if i = j,
δij := 0, if i 6= j.
y
Fig. 5
r s
1. Computing with Matrices and Vectors, 1.1. Fundamentals 55
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✦ Transposed matrix:
⊤
a11 . . . a1m a11 . . . an1
.. . ..
A⊤ = ... . := .. . ∈K
m,n
.
an1 . . . anm a1m . . . amn
✎ notation: āij = Re( aij ) − iIm( aij ) complex conjugate of aij . Of course, for A ∈ R n,m we
have AH = A⊤ .
y
§1.1.2.1 (Special matrices) Terminology and notations for a few very special matrices:
1 0
.. n,n
Identity matrix: I := In := . ∈K ,
0 1
0 ... 0
Zero matrix: O := On,m := ... . . . ... ∈ K n,m ,
0 ... 0
d1 0
.. n,n
Diagonal matrix: diag(d1 , . . . , dn ) := . ∈ K , d j ∈ K , j = 1, . . . , n .
0 dn
The creation of special matrices can usually be done by special commands or functions in the various
languages or libraries dedicated to numerical linear algebra, see § 1.2.1.3. y
§1.1.2.2 (Diagonal and triangular matrices) A little terminology to quickly refer to matrices whose non-
zero entries occupy special locations:
0 0
0 0
Definition 1.1.2.6. Symmetric positive definite (s.p.d.) matrices → [DR08, Def. 3.31],
[QSS00, Def. 1.22]
M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .
Lemma 1.1.2.7. Necessary conditions for s.p.d. → [DR08, Satz 3.33], [QSS00, Prop. 1.18]
Remark 1.1.2.8 (S.p.d. Hessians) Recall from analysis: in an isolated local minimum x ∗ of a C2 -function
f : R n 7→ R ➤ Hessian D2 f ( x ∗ ) s.p.d. (see Def. 8.5.1.18 for the definition of the Hessian)
To compute the minimum of a C2 -function iteratively by means of Newton’s method (→ Sect. 8.5) a linear
system of equations with the s.p.d. Hessian as system matrix has to be solved in each step.
The solutions of many equations in science and engineering boils down to finding the minimum of some
(energy, entropy, etc.) function, which accounts for the prominent role of s.p.d. linear systems in applica-
tions. y
Review question(s) 1.1.2.9 (Notations, matrix-vector calculus, and special matrices)
(Q1.1.2.9.A) Give a compact notation for the row vector containing the diagonal entries of a square matrix
S ∈ R n,n , n ∈ N.
(Q1.1.2.9.B) How can you write down the s × s-submatrix, s ∈ N, in the upper right corner of C ∈ R n,m ,
n, m ≥ s.
(Q1.1.2.9.C) We consider two matrices A, B ∈ R n,m , both with at most N ∈ N non-zero entries. What
is the maximal number of non-zero entries of A + B?
(Q1.1.2.9.D) A matrix A ∈ R n,m enjoys the following property (banded matrix):
1.2.1 E IGEN
Video tutorial for Section 1.2.1 "E IGEN ": (11 minutes) Download link, tablet notes
E IGEN is a header-only C++ template library designed to enable easy, natural and efficient numerical
linear algebra: it provides data structures and a wide range of operations for matrices and vectors, see
below. E IGEN also implements many more fundamental algorithms documentation page or the discussion
below).
E IGEN relies on expression templates to allow the efficient evaluation of complex expressions involving
matrices and vectors. Refer to the example given in the E IGEN documentation for details.
§1.2.1.1 (Matrix and vector data types in E IGEN) A generic matrix data type is given by the templated
class
Eigen::Matrix< typename Scalar,
i n t RowsAtCompileTime, i n t ColsAtCompileTime>
Here Scalar is the underlying scalar type of the matrix entries, which must support the usual operations
’+’,’-’,’*’,’/’, and ’+=’, ’*=’, ’¯’, etc. Usually the scalar type will be either double, float , or complex<>. The
cardinal template arguments RowsAtCompileTime and ColsAtCompileTime can pass a fixed size
of the matrix, if it is known at compile time. There is a specialization selected by the template argument
Eigen::Dynamic supporting variable size “dynamic” matrices.
C++ code 1.2.1.2: Vector type and their use in E IGEN ➺ GITLAB
1 # include <Eigen / Dense >
2
Note that in Line 24 we could have relied on automatic type deduction via auto vectprod = ....
However, as argued in Rem. 0.3.4.4 often it is safer to forgo this option and specify the type directly
The following convenience data types are provided by E IGEN, see E IGEN documentation:
ˆ generic variable size matrix with double precision entries
• MatrixXd =
• VectorXd, RowVectorXd = ˆ dynamic column and row vectors
(= dynamic matrices with one dimension equal to 1)
• MatrixNd with N = 2, 3, 4 for small fixed size square N × N -matrices (type double)
• VectorNd with N = 2, 3, 4 for small column vectors with fixed length N .
The d in the type name may be replaced with i (for int), f (for float), and cd (for
complex<double>) to select another basic scalar type.
All matrix type feature the methods cols(), rows(), and size() telling the number of columns, rows,
and total number of entries.
Access to individual matrix entries and vector components, both as Rvalue and Lvalue, is possible through
the ()-operator taking two arguments of type index_t. If only one argument is supplied, the matrix is
accessed as a linear array according to its memory layout. For vectors, that is, matrices where one
dimension is fixed to 1, the []-operator can replace () with one argument, see Line 22 of Code 1.2.1.2.
y
§1.2.1.3 (Initialization of dense matrices in E IGEN, E IGEN documentation) The entry access oper-
ator (int i,int j) allows the most direct setting of matrix entries; there is hardly any runtime penalty.
Of course, in E IGEN dedicated functions take care of the initialization of the special matrices introduced in
§ 1.1.2.1:
Eigen::MatrixXd I = Eigen::MatrixXd::Identity(n,n);
Eigen::MatrixXd O = Eigen::MatrixXd::Zero(n,m);
Eigen::MatrixXd D = d_vector.asDiagonal();
A versatile way to initialize a matrix relies on a combination of the operators « and ,, which allows the
construction of a matrix from blocks, see ➺ GITLAB, function blockinit().
MatrixXd mat3(6,6);
mat3 <<
MatrixXd::Constant(4,2,1.5), // top row, first block
MatrixXd::Constant(4,3,3.5), // top row, second block
MatrixXd::Constant(4,1,7.5), // top row, third block
MatrixXd::Constant(2,4,2.5), // bottom row, left block
MatrixXd::Constant(2,2,4.5); // bottom row, right block
The matrix is filled top to bottom left to right, block dimensions have to match (like in MATLAB). y
§1.2.1.5 (Access to submatrices in E IGEN, E IGEN documentation) The method block(int i,int
j,int p,int q) returns a reference to the submatrix with upper left corner at position (i, j) and size
p × q.
The methods row(int i) and col(int j) provide a reference to the corresponding row and column of
the matrix. Even more specialised access methods are
topLeftCorner(p,q), bottomLeftCorner(p,q),
topRightCorner(p,q), bottomRightCorner(p,q),
topRows(q), bottomRows(q),
leftCols(p), and rightCols(q),
with obvious purposes.
C++ code 1.2.1.6: Demonstration code for access to matrix blocks in E IGEN ➺ GITLAB
2 template <typename MatType>
3 void blockAccess ( Eigen : : MatrixBase <MatType> &M)
4 {
5 using i n d e x _ t = typename Eigen : : MatrixBase <MatType > : : Index ;
6 const i n d e x _ t nrows (M. rows ( ) ) ; // No. of rows
7 const i n d e x _ t n c o l s (M. cols ( ) ) ; // No. of columns
8
9 cout << " Matrix M = " << endl << M << endl ; // Print matrix
10 // Block size half the size of the matrix
11 const i n d e x _ t p = nrows / 2 ;
12 const i n d e x _ t q = n c o l s / 2 ;
13 // Output submatrix with left upper entry at position (i,i)
14 f o r ( i n d e x _ t i =0; i < std : : min ( p , q ) ; i ++) {
15 cout << " Block ( " << i << ’ , ’ << i << ’ , ’ << p << ’ , ’ << q
16 << " ) = " << M. block ( i , i , p , q ) << endl ;
17 }
18 // l-value access: modify sub-matrix by adding a constant
19 M. block ( 1 , 1 , p , q ) += Eigen : : MatrixBase <MatType > : : Constant ( p , q , 1 . 0 ) ;
20 cout << "M = " << endl << M << endl ;
21 // r-value access: extract sub-matrix
22 const MatrixXd B = M. block ( 1 , 1 , p , q ) ;
23 cout << " Isolated modified block = " << endl << B << endl ;
24 // Special sub-matrices
25 cout << p << " top rows of m = " << M. topRows ( p ) << endl ;
26 cout << p << " bottom rows of m = " << M. bottomRows ( p ) << endl ;
27 cout << q << " l e f t cols of m = " << M. l e f t C o l s ( q ) << endl ;
28 cout << q << " r i g h t cols of m = " << M. r i g h t C o l s ( p ) << endl ;
29 // r-value access to upper triangular part
30 const MatrixXd T = M. template triangularView <Upper > ( ) ; //
31 cout << "Upper t r i a n g u l a r part = " << endl << T << endl ;
32 // l-value access to upper triangular part
33 M. template triangularView <Lower > ( ) * = − 1 . 5 ; //
34 cout << " Matrix M = " << endl << M << endl ;
35 }
• Note that the function blockAccess() is templated and that the matrix argument passed through
M has a type derived from Eigen::MatrixBase. The deeper reason for this alien looking signature of
blockAccess() is explained in E IGEN documentation.
• E IGEN offers views for access to triangular parts of a matrix, see Line 30 and Line 33, according to
M.triangularView<XX>()
where XX can stand for one of the following: Upper, Lower, StrictlyUpper, StrictlyLower, UnitUpper,
UnitLower, see E IGEN documentation.
• For column and row vectors references to sub-vectors can be obtained by the methods head(int
length), tail(int length), and segment(int pos,int length).
Note: Unless the preprocessor switch NDEBUG is set, E IGEN performs range checks on all indices. y
§1.2.1.7 (Componentwise operations in E IGEN) Running out of overloadable operators, E IGEN uses the
Array concept to furnish entry-wise operations on matrices. An E IGEN-Array contains the same data as a
matrix, supports the same methods for initialisation and access, but replaces the operators of matrix arith-
metic with entry-wise actions. Matrices and arrays can be converted into each other by the array() and
matrix() methods, see E IGEN documentation for details. Information about functions that enable
entry-wise operation is available in the E IGEN documentation.
The application of a functor (→ Section 0.3.3) to all entries of a matrix can also be done via the
unaryExpr() method of a matrix:
// Apply a lambda function to all entries of a matrix
au to fnct = []( double x) { r e t u r n (x+1.0/x); };
co u t << "f(m1) = " << e n d l << m1.unaryExpr(fnct) << e n d l ;
§1.2.1.9 (Reduction operations in E IGEN) According to E IGEN’s terminology, reductions are op-
erations that access all entries of a matrix and accumulate some information in the process
E IGEN documentation. A typical example is the summation of the entries.
Remark 1.2.1.11 (’auto’ in E IGEN codes) The expression template programming model (→ explanations
from E IGEN documentation) relies on complex intermediary data types hidden from the user . They support
the efficient evaluation of complex expressions E IGEN documentation. Let us look at the following two
code snippets that assume that both M and R are of type Eigen::MatrixXd.
Code I:
au to D = M.diagonal().asDiagonal(); R = D.inverse();
Code II:
Eigen::MatrixXd D = M.diagonal().asDiagonal(); R = D.inverse();
The reason is that in Code I D is of a complex type that preserves the information that the matrix is diagonal.
Of course, inverting a diagonal matrix is cheap. Conversely forcing D to be of type Eigen::MatrixXd loses
this information and the expensive invert() method for a generic densely populated matrix is invoked.
This is one of the exceptions to Rem. 0.3.4.4: for variables holding the result of E IGEN expressions auto
is recommended. y
Remark 1.2.1.12 (E IGEN-based code: debug mode and release mode) If you want a C++ code built
using the E IGEN library run fast, for instance, for large computations or runtime measurements, you should
compile in release mode, that is, with the compiler switches -O2 -DNDEBUG (for gcc or clang). In a
cmake-based build system you can achieve this by setting the flag CMAKE_BUILD_TYPE to “Release”.
The default setting for E IGEN is debug mode, which makes E IGEN do a lot of consistency
checking and considerably slows down execution of a code.
!
For “production runs” E IGEN-based codes must be compiled in release mode!
y
that checks whether the matrix is an n × n-matrix with even n ∈ N and then replaces its upper right
n/2 × n/2-block with an identity matrix. Do not use any C++ loops.
(Q1.2.1.14.B) Given an Eigen::VectorXd object v (↔ v ∈ R n ), sketch a C++ code snippet that replaces
it with a vector v
e defined by
(
(v)n for i = 1 ,
v )i : =
(e
(v)i−1 for i = 2, . . . , n .
1.2.2 P YTHON
P YTHON is a widely used general-purpose and open source programming language. Together with the
packages like N UM P Y and MATPLOTLIB it delivers similar functionality like M ATLAB for free. For interactive
computing IP YTHON can be used. All those packages belong to the S CI P Y ecosystem.
P YTHON features a good documentation and several scientific distributions are available (e.g. Anaconda,
Enthought) which contain the most important packages. On most Linux-distributions the S CI P Y ecosystem
is also available in the software repository, as well as many other packages including for example the
Spyder IDE delivered with Anaconda.
A good introduction tutorial to numerical P YTHON are the S CI P Y-lectures. The full documentation of
N UM P Y and S CI P Y can be found here. For former M ATLAB-users there’s also a guide. The scripts in this
lecture notes follow the official P YTHON style guide.
Note that in P YTHON we have to import the numerical packages explicitly before use. This is normally done
at the beginning of the file with lines like import numpy as np and from matplotlib import
pyplot as plt. Those import statements are often skipped in this lecture notes to focus on the actual
computations. But you can always assume the import statements as given here, e.g. np.ravel(A) is
a call to a N UM P Y function and plt.loglog(x, y) is a call to a MATPLOTLIB pyplot function.
P YTHON is not used in the current version of the lecture. Nevertheless a few P YTHON codes are supplied
in order to convey similarities and differences to implementations in M ATLAB and C++.
§1.2.2.1 (Matrices and Vectors in P YTHON) The basic numeric data type in P YTHON are N UM P Y’s n-
dimensional arrays. Vectors are normally implemented as 1D arrays and no distinction is made between
row and column vectors. Matrices are represented as 2D arrays.
☞ v = np.array([1, 2, 3]) creates a 1D array with the three elements 1, 2 and 3.
☞ A = np.array([[1, 2], [3, 4]] creates a 2D array.
☞ A.shape gives the n-dimensional size of an array.
☞ A.size gives the total number of entries in an array.
Note: There’s also a matrix class in N UM P Y with different semantics but its use is officially discouraged
and it might even be removed in future release.
y
§1.2.2.2 (Manipulating arrays in P YTHON) There are many possibilities listed in the documentation how
to create, index and manipulate arrays.
An important difference to M ATLAB is, that all arithmetic operations are normally performed element-wise,
e.g. A * B is not the matrix-matrix product but element-wise multiplication (in M ATLAB: A.*A). Also A
* v does a broadcasted element-wise product. For the matrix product one has to use np.dot(A, B)
or A.dot(B) explicitly. y
Video tutorial for Section 1.2.3 "(Dense) Matrix Storage Formats": (10 minutes)
Download link, tablet notes
Two natural options for “vectorisation” of a matrix: row major, column major
(A)ij ↔A_arr(m*(i-1)+(j-1))
column major:
EXAMPLE 1.2.3.1 (Accessing matrix data as a vector) In E IGEN the single index access operator relies
on the linear data layout:
In E IGEN the data layout can be controlled by a template argument; default is column major.
C++ code 1.2.3.2: Single index access of matrix entries in E IGEN ➺ GITLAB
2 void s t o r a g e O r d e r ( i n t nrows =6 , i n t n c o l s =7)
3 {
4 cout << " D i f f e r e n t matrix storage layouts i n Eigen " << endl ;
5 // Template parameter ColMajor selects column major data layout
6 Matrix <double , Dynamic , Dynamic , ColMajor > mcm( nrows , n c o l s ) ;
7 // Template parameter RowMajor selects row major data layout
8 Matrix <double , Dynamic , Dynamic , RowMajor> mrm( nrows , n c o l s ) ;
16 cout << " Matrix mrm = " << endl << mrm << endl ;
17 cout << "mcm l i n e a r = " ;
18 f o r ( i n t l =0; l < mcm. s i z e ( ) ; l ++) {
19 cout << mcm( l ) << ’ , ’ ;
20 }
21 cout << endl ;
22
The function call storageOrder(3,3), cf. Code 1.2.3.2 yields the output
1 D i f f e r e n t matrix s t o r a g e l a y o u t s i n Eigen
2 Matrix mrm =
3 1 2 3
4 4 5 6
5 7 8 9
6 mcm l i n e a r = 1 , 4 , 7 , 2 , 5 , 8 , 3 , 6 , 9 ,
7 mrm l i n e a r = 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,
In P YTHON the default data layout is row major, but it can be explicitly set. Further, array transposition
does not change any data, but only the memory order and array shape.
Remark 1.2.3.4 (Vectorisation of a matrix) Mapping a column-major matrix to a column vector with the
same number of entries is called vectorization or linearization in numerical linear algebra, in symbols
(A):,1
(A):,2
vec : K n,m → K n·m , vec(A) := .. ∈ R n·m . (1.2.3.5)
.
(A):,m
y
Remark 1.2.3.6 (Reshaping matrices in E IGEN) If you need a reshaped view of a matrix’ data in E IGEN
you can obtain it via the raw data vector belonging to the matrix. Then use this information to create a
matrix view by means of Map → documentation.
18 cout << " Matrix M = " << endl << M << endl ;
19 cout << " reshaped to " << R. rows ( ) << ’ x ’ << R. cols ( )
20 << " = " << endl << R << endl ;
21 // Modifying R affects M, because they share the data space !
22 R *= − 1 . 5 ;
23 cout << " Scaled ( ! ) matrix M = " << endl << M << endl ;
24 // Matrix S is not affected, because of deep copy
25 cout << " Matrix S = " << endl << S << endl ;
26 }
27 }
This function has to be called with a mutable (l-value) matrix type object. A sample output is printed next:
1 Matrix M =
2 0 −1 −2 −3 −4 −5 −6
3 1 0 −1 −2 −3 −4 −5
4 2 1 0 −1 −2 −3 −4
5 3 2 1 0 −1 −2 −3
6 4 3 2 1 0 −1 −2
7 5 4 3 2 1 0 −1
8 reshaped t o 2x21 =
9 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
10 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1
11 Scaled ( ! ) matrix M =
12 −0 1 . 5 3 4.5 6 7.5 9
y
Remark 1.2.3.8 (N UM P Y function reshape) N UM P Y offers the function np.reshape for changing the
dimensions of a matrix A ∈ K m,n :
# read elements of A in row major order (default)
B = np.reshape(A, (k, l)) # error, in case kl 6= mn
B = np.reshape(A, (k, l), order=’C’) # same as above
# read elements of A in column major order
B = np.reshape(A, (k, l), order=’F’)
# read elements of A as stored in memory
B = np.reshape(A, (k, l), order=’A’)
This command will create an k × l -array by reinterpreting the array of entries of A as data for an array
with k rows and l columns. The order in which the elements of A are be read can be set by the order
argument to row major (default, ’C’), column major (’F’) or A’s internal storage order, i.e. row major if
A is row major or column major if A is column major (’A’). y
EXPERIMENT 1.2.3.9 (Impact of matrix data access patterns on runtime) Modern CPU feature several
levels of memories (registers, L1 cache, L2 cache, . . ., main memory) of different latency, bandwidth, and
size. Frequently accessing memory locations with widely different addresses results in many cache misses
and will considerably slow down the CPU.
The following C++ code sequentially runs through the entries of a column major matrix (E IGEN’s de-
fault) in two ways and measures the (average) time required for the loops to complete. It relies on the
std::chrono library C++ reference.
C++ code 1.2.3.10: Timing for row and column oriented matrix access for E IGEN ➺ GITLAB
2 void r o w c o l a c c e s s t i m i n g ( )
3 {
4 c o n s t e x p r s i z e _ t K = 3 ; // Number of repetitions
5 c o n s t e x p r i n d e x _ t N_min = 5 ; // Smallest matrix size 32
6 c o n s t e x p r i n d e x _ t N_max = 1 3 ; // Scan until matrix size of 8192
7 i n d e x _ t n = ( 1UL << s t a t i c _ c a s t < s i z e _ t >( N_min ) ) ;
8 Eigen : : MatrixXd t i m e s ( N_max−N_min +1 ,3) ;
9
10 1
A(:,j+1) = A(:,j+1) - A(:,j)
A(i+1,:) = A(i+1,:) - A(i,:)
eigen row access ✁ Plot of average runtimes as measured
10 0 eigen column access
with code Code 1.2.3.10.
10 -1 Platform:
✦ ubuntu 14.04 LTS
10 -2 ✦ i7-3517U CPU @ 1.90GHz
runtime [s]
We observe a blatant discrepancy of CPU time required for accessing entries of a matrix in rowwise or
columnwise fashion. This reflects the impact of features of the unterlying hardware architecture, like cache
size and memory bandwidth:
Interpretation of timings: Since standard matrices in E IGEN are stored column major all the matrix el-
ements in a column occupy contiguous memory locations, which will all reside in the cache together.
Hence, column oriented access will mainly operate on data in the cache even for large matrices. Con-
versely, row oriented access addresses matrix entries that are stored in distant memory locations, which
incurs frequent cash misses (cache thrashing).
The impact of hardware architecture on the performance of algorithms will not be taken into account in
this course, because hardware features tend to be both intricate and ephemeral. However, for modern
high performance computing it is essential to adapt implementations to the hardware on which the code is
supposed to run. y
Review question(s) 1.2.3.11 (Dense matrix storage formats)
(Q1.2.3.11.A) Write efficient elementary C++ loops that realize the matrix×vector product Mx,
m ∈ R m,n , x ∈ R n , where M is stored in an Eigen::MatrixXd object M and x given as a
Eigen::VectorXd object x. Assume the default (Column major) memory layout for M. Discuss the
memory access pattern.
(Q1.2.3.11.B) A black-box function has the following signature:
t e m p l a t e < typename Vector>
double processVector( const Eigen::DenseBase<Vector> &v);
Given a matrix A ∈ R n,m stored in an Eigen::MatrixXd object A (column major memory layout), how
can you efficiently realize the following function calls in C++:
• processVector(vec(A)) ,
• processVector(vec(A⊤ )) ?
△
" #
n
A ∈ K m,n , B ∈ K n,k : AB = ∑ aij bjl ∈ K m,k . (1.3.1.1)
j =1 i =1,...,m
l =1,...,k
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 70
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Recall from linear algebra basic properties of the matrix product: for all K-matrices A, B, C (of suitable
sizes), α, β ∈ K
associative:
(AB)C = A(BC) ,
bi-linear: (αA + βB)C = α(AC) + β(BC) , C(αA + βB) = α(CA) + β(CB) ,
non-commutative: AB 6= BA in general .
m n = m
n k
Fig. 10
k
= =
Remark 1.3.1.3 (Row-wise & column-wise view of matrix product) To understand what is going on
when forming a matrix product, it is often useful to decompose it into matrix×vector operations in one of
the following two ways:
A ∈ K m,n , B ∈ K n,k :
" # (A)1,: B
..
AB = A(B):,1 ... A(B):,k , AB = . .
(A)m,: B (1.3.1.4)
↓ ↓
matrix assembled from columns matrix assembled from rows
Remark 1.3.1.5 (Understanding the structure of product matrices) A “mental image” of matrix multi-
plication is useful for telling special properties of product matrices.
For instance, zero blocks of the product matrix can be predicted easily in the following situations using the
idea explained in Rem. 1.3.1.3 (try to understand how):
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 71
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
m 0 n = 0 m
n k
Fig. 11
k
m
0 n =
0 m
n k
Fig. 12
k
A clear understanding of matrix multiplication enables you to “see”, which parts of a matrix factor matter
in a product:
m n = m
n 0 k
Fig. 13
k
“Seeing” the structure/pattern of a matrix product:
= ,
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 72
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
= .
These nice renderings of the so-called patterns of matrices, that is, the distribution of their non-zero entries
have been created by a special plotting command spy() of matplotlibcpp.
16 i n t main ( ) {
17 i n t n = 100;
18 MatrixXd A( n , n ) , B( n , n ) ;
19 A . setZero ( ) ;
20 B . setZero ( ) ;
21 // Initialize matrices, see Fig. 13
22 A . diagonal ( ) = VectorXd : : LinSpaced ( n , 1 , n ) ;
23 A . col ( n − 1 ) = VectorXd : : LinSpaced ( n , 1 , n ) ;
24 A . row ( n − 1 ) = RowVectorXd : : LinSpaced ( n , 1 , n ) ;
25 B = A . colwise ( ) . reverse ( ) ;
26 // Matrix products
27 MatrixXd C = A * A , D = A * B;
28 spy ( A , "Aspy_cpp . eps " ) ; // Sparse arrow matrix
29 spy ( B , "Bspy_cpp . eps " ) ; // Sparse arrow matrix
30 spy (C, "Cspy_cpp . eps " ) ; // Fully populated matrix
31 spy (D, "Dspy_cpp . eps " ) ; // Sparse "framed" matrix
32 return 0;
33 }
This code also demonstrates the use of diagonal(), col(), row() for L-value access to parts of a
matrix.
P YTHON/MATPLOTLIB-command for visualizing the structure of a matrix: plt.spy(M)
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 73
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2 A = np . d i a g ( np . mgrid [ : n ] )
3 A [ : , −1] = A[ − 1 , : ] = np . mgrid [ : n ]
4 p l t . spy ( A)
5 p l t . spy ( A [ : : − 1 , : ] )
6 p l t . spy ( np . d o t ( A , A) )
7 p l t . spy ( np . d o t ( A , B) )
Remark 1.3.1.8 (Multiplying triangular matrices) The following result is useful when dealing with matrix
decompositions that often involve triangular matrices.
y
EXPERIMENT 1.3.1.10 (Scaling a matrix) Scaling = multiplication with diagonal matrices (with non-zero
diagonal entries):
It is important to know the different effect of multiplying with a diagonal matrix from left or right:
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 74
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 14
C++ code 1.3.1.11: Timing multiplication with scaling matrix in E IGEN ➺ GITLAB
2 i n t nruns = 3 , minExp = 2 , maxExp = 1 4 ;
3 MatrixXd tms ( maxExp−minExp +1 ,4) ;
4 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
5 Timer tbad , tgood , t o p t ; // timer class
6 i n t n = std : : pow ( 2 , minExp + i ) ;
7 VectorXd d = VectorXd : : Random( n , 1 ) , x = VectorXd : : Random( n , 1 ) , y ( n ) ;
8 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
9 MatrixXd D = d . asDiagonal ( ) ; //
10 // matrix vector multiplication
11 tbad . s t a r t ( ) ; y = D * x ; tbad . s t o p ( ) ; //
12 // componentwise multiplication
13 tgood . s t a r t ( ) ; y= d . cwiseProduct ( x ) ; tgood . s t o p ( ) ; //
14 // matrix multiplication optimized by Eigen
15 t o p t . s t a r t ( ) ; y = d . asDiagonal ( ) * x ; t o p t . s t o p ( ) ; //
16 }
17 tms ( i , 0 ) =n ;
18 tms ( i , 1 ) =tgood . min ( ) ; tms ( i , 2 ) =tbad . min ( ) ; tms ( i , 3 ) = t o p t . min ( ) ;
19 }
Hardly surprising, the component-wise multiplication of the two vectors is way faster than the intermit-
tent initialisation of a diagonal matrix (main populated by zeros) and the computation of a matrix×vector
product. Nevertheless, such blunders keep on haunting numerical codes. Do not rely solely on E IGEN
optimizations! y
Remark 1.3.1.12 (Row and column transformations) Simple operations on rows/columns of matrices,
cf. what was done in Exp. 1.2.3.9, can often be expressed as multiplication with special matrices: For
instance, given A ∈ K n,m we obtain B by adding row (A) j,: to row (A) j+1,: , 1 ≤ j < n.
1
..
.
Realisation through matrix 1
B= A .
product 1 1
..
.
1
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 75
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The matrix multiplying A from the left is a specimen of a transformation matrix, a matrix that coincides
with the identity matrix I except for a single off-diagonal entry.
§1.3.1.13 (Block matrix product) Given matrix dimensions M, N, K ∈ N block sizes 1 ≤ n < N
(n′ := N − n), 1 ≤ m < M (m′ := M − m), 1 ≤ k < K (k′ := K − k) we start from the following
matrices:
′ ′
A11 ∈ K m,n A12 ∈ K m,n B11 ∈ K n,k B12 ∈ K n,k
′ ′ ′ , ′ ′ ′ .
A21 ∈ K m ,n A22 ∈ K m ,n B21 ∈ K n ,k B22 ∈ K n ,k
This matrices serve as sub-matrices or matrix blocks and are assembled into larger matrices
A11 A12 M,N B11 B12
A= ∈K , B= ∈ K N,K .
A21 A22 B21 B22
It turns out that the matrix product AB can be computed by the same formula as the product of simple
2 × 2-matrices:
A11 A12 B11 B12 A11 B11 + A12 B21 A11 B12 + A12 B22
= . (1.3.1.14)
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22
m n m
M N = M
m′ m′
n′
n n′ k k′
N
k k′ K
Fig. 15
K
Bottom line: one can compute with block-structured matrices in almost (∗) the same ways as with matrices
with real/complex entries, see [QSS00, Sect. 1.3.3].
(∗): you must not use the commutativity of multiplication (because matrix multiplication is not
! commutative).
y
The BLAS API is standardised by the BLAS technical forum and, due to its history dating back to the 70s,
follows conventions of FORTRAN 77, see the Quick Reference Guide for examples. However, wrappers for
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 76
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
other programming languages are available. CPU manufacturers and/or developers of operating systems
usually supply highly optimised implementations:
• OpenBLAS: open source implementation with some general optimisations, available under BSD
license.
• ATLAS (Automatically Tuned Linear Algebra Software): open source BLAS implementation with
auto-tuning capabilities. Comes with C and FORTRAN interfaces and is included in Linux distribu-
tions.
• Intel MKL (Math Kernel Library): commercial highly optimised BLAS implemetation available for all
Intel CPUs. Used by most proprietory simulation software and also M ATLAB.
EXPERIMENT 1.3.2.1 (Multiplying matrices in E IGEN)
The following E IGEN-based C++ code performs a multiplication of densely populated matrices in three
different ways:
1. Direct implementation of three nested loops
2. Realization by matrix×vector products
3. Use of buit-in matrix multiplication of E IGEN
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 77
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
36 timings (p , 1) = t 1 . min ( ) ;
37 timings (p , 2) = t 2 . min ( ) ;
38 timings (p , 3) = t 3 . min ( ) ;
39 timings (p , 4) = t 4 . min ( ) ;
40 }
41 std : : cout << std : : s c i e n t i f i c << std : : s e t p r e c i s i o n ( 3 ) << t i m i n g s << std : : endl ;
10 0
dot-product implementation
matrix-vector implementation Platform:
Eigen matrix product
✦ ubuntu 14.04 LTS
10 -1
✦ i7-3517U CPU @ 1.90GHz
10 -2
✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3
✬ ✩
time [s]
10 -3
✫ ✪
simple loops almost competitive.
10 -7
10 0 10 1 10 2 10 3 10 4
Fig. 16 matrix size n
y
BLAS routines are grouped into “levels” according to the amount of data and computation involved (asymp-
totic complexity, see Section 1.4.1 and [GV89, Sect. 1.1.12]):
• Level 1: vector operations such as scalar products and vector norms.
asymptotic complexity O(n), (with n =ˆ vector length),
e.g.: dot product: ρ = x⊤ y
• Level 2: vector-matrix operations such as matrix-vector multiplications.
asymptotic complexity O(mn),(with (m, n) = ˆ matrix size),
e.g.: matrix×vector multiplication: y = αAx + βy
• Level 3: matrix-matrix operations such as matrix additions or multiplications.
asymptotic complexity often O(nmk ),(with (n, m, k ) =
ˆ matrix sizes),
e.g.: matrix product: C = AB
Syntax of BLAS calls:
The functions have been implemented for different types, and are distinguished by the first letter of the
function name. E.g. sdot is the dot product implementation for single precision and ddot for double
precision.
xDOT(N,X,INCX,Y,INCY)
– x ∈ {S, D}, scalar type: S =
ˆ type float, D =
ˆ type double
ˆ length of vector (modulo stride INCX)
– N=
ˆ vector x: array of type x
– X=
ˆ stride for traversing vector X
– INCX =
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 78
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
xAXPY(N,ALPHA,X,INCX,Y,INCY)
– x ∈ {S, D, C, Z}, S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
ˆ length of vector (modulo stride INCX)
– N=
ˆ scalar α
– ALPHA =
ˆ vector x: array of type x
– X=
ˆ stride for traversing vector X
– INCX =
ˆ vector y: array of type x
– Y=
ˆ stride for traversing vector Y
– INCY =
✦ BLAS LEVEL 2: matrix-vector operations, asymptotic complexity O(mn), (m, n) =
ˆ matrix size
xGEMV(TRANS,M,N,ALPHA,A,LDA,X,
INCX,BETA,Y,INCY)
– x ∈ {S, D, C, Z}, scalar type: S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
ˆ size of matrix A
– M, N =
ˆ scalar parameter α
– ALPHA =
ˆ matrix A stored in linear array of length M · N (column major arrangement)
– A=
(A)i,j = A[ N ∗ ( j − 1) + i ] .
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 79
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
ALPHA,A,LDA,X,B,LDB,
BETA,C,LDC)
(☞ meaning of arguments as above)
Remark 1.3.2.3 (BLAS calling conventions) The BLAS calling syntax seems queer in light of modern
object oriented programming paradigms, but it is a legacy of FORTRAN77, which was (and partly still is)
the programming language, in which the BLAS routines were coded.
It is a very common situation in scientific computing that one has to rely on old codes and libraries imple-
mented in an old-fashioned style. y
EXAMPLE 1.3.2.4 (Calling BLAS routines from C/C++) When calling BLAS library functions from C,
all arguments have to be passed by reference (as pointers), in order to comply with the argument passing
mechanism of FORTRAN77, which is the model followed by BLAS.
16 i n t main ( ) {
17 cout << "Demo code f o r NumCSE course : c a l l basic BLAS routines from C++"
18 << endl ;
19 const i n t n = 5 ; // length of vector
20 const i n t i n c x = 1 ; // stride
21 const i n t i n c y = 1 ; // stride
22 const double alpha = 2 . 5 ; // scaling factor
23
28 f o r ( s i z e _ t i = 0 ; i < n ; i ++) {
29 x [ i ] = 3.1415 * s t a t i c _ c a s t <double >( i ) ;
30 y [ i ] = 1 . 0 / s t a t i c _ c a s t <double >( i + 1 ) ;
31 }
32
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 80
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
41 }
42 cout << " ] " << endl ;
43
When using E IGEN in a mode that includes an external BLAS library, all this calls are wrapped into E IGEN
methods. y
EXAMPLE 1.3.2.6 (Using Intel Math Kernel Library (Intel MKL) from E IGEN) The
Intel Math Kernel Library is a highly optimized math library for Intel processors and can be called
directly from E IGEN, see E IGEN documentation on “Using Intel® Math Kernel Library from Eigen”.
C++-code 1.3.2.7: Timing of matrix multiplication in E IGEN for MKL comparison ➺ GITLAB
2 //! script for timing different implementations of matrix
multiplications
3 void mmeigenmkl ( ) {
4 i n t nruns = 3 , minExp = 6 , maxExp = 1 3 ;
5 MatrixXd t i m i n g s ( maxExp−minExp +1 ,2) ;
6 f o r ( i n t p = 0 ; p <= maxExp−minExp ; ++p ) {
7 Timer t 1 ; // timer class
8 i n t n = std : : pow ( 2 , minExp + p ) ;
9 MatrixXd A = MatrixXd : : Random( n , n ) ;
10 MatrixXd B = MatrixXd : : Random( n , n ) ;
11 MatrixXd C = MatrixXd : : Zero ( n , n ) ;
12 f o r ( i n t q = 0 ; q < nruns ; ++q ) {
13 t1 . s t a r t ( ) ;
14 C = A * B;
15 t1 . stop ( ) ;
16 }
17 t i m i n g s ( p , 0 ) =n ; t i m i n g s ( p , 1 ) = t 1 . min ( ) ;
18 }
19 std : : cout << std : : s c i e n t i f i c << std : : s e t p r e c i s i o n ( 3 ) << t i m i n g s << std : : endl ;
20 }
Timing results:
n E IGEN sequential [s] E IGEN parallel [s] MKL sequential [s] MKL parallel [s]
64 1.318e-04 1.304e-04 6.442e-05 2.401e-05
128 7.168e-04 2.490e-04 4.386e-04 1.336e-04
256 6.641e-03 1.987e-03 3.000e-03 1.041e-03
512 2.609e-02 1.410e-02 1.356e-02 8.243e-03
1024 1.952e-01 1.069e-01 1.020e-01 5.728e-02
2048 1.531e+00 8.477e-01 8.581e-01 4.729e-01
4096 1.212e+01 6.635e+00 7.075e+00 3.827e+00
8192 9.801e+01 6.426e+01 5.731e+01 3.598e+01
1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 81
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 2 10 -9
Eigen sequential Eigen sequential
Eigen parallel Eigen parallel
MKL sequential MKL sequential
10 1 MLK parallel MLK parallel
[s]
3
10 0
10 -1
10 -10
-2
10
10 -3
10 -4
10 -5 10 -11
10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4
Fig. 17 Fig. 18 matrix size n
matrix size n
Video tutorial for Section 1.4 "Computational Effort": (29 minutes) Download link, tablet notes
The following definition encapsulates what is regarded as a measure for the “cost” of an algorithm in
computational mathematics.
§1.4.0.2 (What computational effort does not tell us) Fifty years ago counting elementary operations
provided good predictions of runtimes, but nowadays this is no longer true.
The computational effort involved in a run of a numerical code is only loosely related
! to overall execution time on modern computers.
This is conspicuous in Exp. 1.2.3.9, where algorithms incurring exactly the same computational effort took
different times to execute.
The reason is that on today’s computers a key bottleneck for fast execution is latency and bandwidth of
memory, cf. the discussion at the end of Exp. 1.2.3.9 and [KW03]. Thus, concepts like I/O-complexity
[AV88; GJ10] might be more appropriate for gauging the efficiency of a code, because they take into
account the pattern of memory access. y
• Problem size parameters in numerical linear algebra usually are the lengths and dimensions of the
vectors and matrices that an algorithm takes as inputs.
• Worst case indicates that the maximum effort over a set of admissible data is taken into account.
We write F (n) = O( G (n)) for two functions F, G : N → R, if there exists a constant C > 0 and
n∗ ∈ N such that
F (n) ≤ C G (n) ∀n ≥ n∗ .
Remark 1.4.1.3 (Meaningful “O-bounds” for complexity) Of course, the definition of the Landau symbol
leaves ample freedom for stating meaningless bounds; an algorithm that runs with linear complexity O(n)
can be correctly labelled as possessing O(exp(n)) complexity.
Yet, whenever the Landau notation is used to describe asymptotic complexities, the bounds have to be
sharp in the sense that no function with slower asymptotic growth will be possible inside the O. To make
this precise we stipulate the following.
Whenever the asymptotic complexity of an algorithm is stated as O(nα logβ n exp(γnδ )) with non-
negative parameters α, β, γ, δ ≥ 0 in terms of the problem size parameter n, we take for granted
that choosing a smaller value for any of the parameters will no longer yield a valid (or provable)
asymptotic bound.
In particular
✦ complexity O(n) means that the complexity is not O(nα ) for any α < 1,
✦ complexity O(exp(n)) excludes asymptotic complexity O(n p ) for any p ∈ R.
Terminology: If the asymptotic complexity of an algorithm is O(n p ) with p = 1, 2, 3 we say that it is of
“linear”, “quadratic”, and “cubic” complexity, respectively.
y
Remark 1.4.1.5 (Relevance of asymptotic complexity) § 8.4.3.14 warned us that computational effort
and, thus, asymptotic complexity, of an algorithm for a concrete problem on a particular platform may
not have much to do with the actual runtime (the blame goes to memory hierarchies, internal pipelining,
vectorisation, etc.).
To a certain extent, the asymptotic complexity allows to predict the dependence of the runtime of a
particular implementation of an algorithm on the problem size (for large problems).
For instance, an algorithm with asymptotic complexity O(n2 ) is likely to take 4× as much time when the
problem size is doubled. y
If the conjecture holds true, then the points (ni , ti ) will approximately lie on a straight
line with slope α in a doubly logarithmic plot (which can be created in P YTHON by the
lmatplotlib.pyplot.loglog plotting command.
➣ Offers a quick “visual test” of conjectured asymptotic complexity
More rigorous: Perform linear regression on (log ni , log ti ), i = 1, . . . , N (→ Chapter 3)
y
104
C11 = Q0 + Q3 − Q4 + Q6 ,
C21 = Q1 + Q3 ,
C12 = Q2 + Q4 ,
C22 = Q0 + Q2 − Q1 + Q5 ,
where the Qk ∈ K ℓ,ℓ , k = 0, . . . , 6 are obtained from
A refined algorithm of this type can achieve complexity O(n2.36 ), see [CW90]. y
y = ab⊤ x . (1.4.3.2) y = a b⊤ x . (1.4.3.3)
T = (a*b.transpose())*x; t = a*b.dot(x);
➤ complexity O(mn) ➤ complexity O(n + m) (“linear complexity”)
10 -3
Platform:
10 -4
10 -5
✦ ubuntu 14.04 LTS
✦ i7-3517U CPU @ 1.90GHz
10 -6
✦ L1 32 KB, L2 256 KB, L3 4096 KB,
10 -7
✦ 8 GB main memory
✦ gcc 4.8.4, -O3
10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 20 problem size n
when supplied with two low-rank matrices A, B ∈ K n,p , p ≪ n, in terms of n → ∞ obviously is O(n2 ),
because an intermediate n × n-matrix AB T is built.
First, consider the case of a tensor product (= rank-1) matrix, that is, p = 1, A ↔ a = [ a1 , . . . , an ]⊤ ∈ K n ,
B ↔ b = [b1 , . . . , bn ] ∈ K n . Then
a1 b1 a1 b2 . . . . . . a 1 bn x1
0 . . . . . . a 2 bn
a2 b2 a2 b3 ...
.. .. .. .. ..
. . . . .
y = triu(ab )x =
T
..
.
.. ..
. .
..
.
. .. .. .. ..
.. . . . .
0 ... . . . 0 a n bn xn
1 1 . . . ... 1
a1 b1 x1
.. 0 1 1 ... ... 1 .. ..
.
.. . .
... . . . . . . . . .
.
= .. .. .. ..
.
. . . .
. .
.. . .. . .. ..
. .. .. . .. .
an 0 . . . bn x n
... 0 1
| {z }
T
The brackets indicate the order of the matrix×vector multiplications. Thus, the core problem is
the fast multiplication of a vector with an upper triangular matrix T described in E IGEN syntax by
Eigen::MatrixXd::Ones(n,n).triangularView<Eigen::Upper>(). Note that multipli-
cation of a vector x with T yields a vector of partial sums of components of x starting from last compo-
nent:
1 1 ... ... 1 v1 sn
0 1 1 ... ... 1
...
sn.−1
.. . . . . .. ..
. . . . . ..
n
. .. .. . = , s j := ∑ vk .
.. . . ..
..
k = n − j +1
. .. .. .. ..
.. . . . . .
0 ... ... 0 1 vn s1
This can be achieved by invoking the special C++ command std::partial_sum from the C++ stan-
dard library (documentation). We also observe that
p
T
AB = ∑ (A):,ℓ ((B):,ℓ )⊤ ,
ℓ=1
so that the computations for the special case p = 1 discussed above can simply be reused p times!
C++ code 1.4.3.6: Efficient multiplication with the upper diagonal part of a rank- p-matrix in
E IGEN ➺ GITLAB
2 //! Computation of y = triu(AB T )x
3 //! Efficient implementation with backward cumulative sum
4 //! (partial_sum)
5 template <class Vec , class Mat>
6 void l r t r i m u l t e f f ( const Mat& A , const Mat& B , const Vec& x , Vec& y ) {
7 const i n t n = A . rows ( ) ;
8 const i n t p = A . cols ( ) ;
9 assert ( n == B . rows ( ) && p == B . cols ( ) ) ; // size mismatch
10 f o r ( i n t l = 0 ; l < p ; ++ l ) {
11 Vec tmp = ( B . col ( l ) . array ( ) * x . array ( ) ) . matrix ( ) . reverse ( ) ;
12 std : : partial_sum ( tmp . begin ( ) , tmp . end ( ) , tmp . begin ( ) ) ;
13 y += ( A . col ( l ) . array ( ) * tmp . reverse ( ) . array ( ) ) . matrix ( ) ;
14 }
15 }
This code enjoys the obvious complexity of O( pn) for p, n → ∞, p < n. The code offers an example of a
function templated with its argument types, see § 0.3.2.1. The types Vec and Mat must fit the concept of
E IGEN vectors/matrices. y
The next concept from linear algebra is important in the context of computing with multi-dimensional arrays.
EXAMPLE 1.4.3.8 (Multiplication of Kronecker product with vector) The function (A ⊗ B)x when
invoked with two matrices A ∈ K m,n and B ∈ K l,k and a vector x ∈ K nk , will suffer an asymptotic
complexity of O(m · n · l · k ), determined by the size of the intermediate dense matrix A ⊗ B ∈ K ml,nk .
The idea is to form the products Bx j , j = 1, . . . , n, once, and then combine them linearly with coefficients
given by the entries in the rows of A:
C++ code 1.4.3.9: Efficient multiplication of Kronecker product with vector in E IGEN
➺ GITLAB
2 template <class Matrix , class Vector >
3 V e c t o r kronmultv ( const Matrix &A , const Matrix &B , const V e c t o r &x ) {
4 const s i z e _ t m = A . rows ( ) ;
5 const s i z e _ t n = A . cols ( ) ;
6 const s i z e _ t l = B . rows ( ) ;
7 const s i z e _ t k = B . cols ( ) ;
8 // 1st matrix mult. computes the products Bx j
9 // 2nd matrix mult. combines them linearly with the coefficients of
A
10 Matrix t = B * Matrix : : Map( x . data ( ) , k , n ) * A . transpose ( ) ; //
11 r e t u r n Matrix : : Map( t . data ( ) , m* l , 1 ) ;
12 }
Recall the reshaping of a matrix in E IGEN in order to understand this code: Rem. 1.2.3.6. Note a new
twist: Here the Map() member function of an E IGEN data type is used, where X::Map(<args>) is
roughly equivalent to Eigen::Map<X>(<args>).
The asymptotic complexity of this code is determined by the two matrix multiplications in Line 10. This
yields the asymptotic complexity O(lkn + mnl ) for l, k, m, n → ∞.
Note that different reshaping is used in the P YTHON code due to the default row major storage order. y
Review question(s) 1.4.3.11 (Computational effort)
(Q1.4.3.11.A) Explain why the classical concept of “computational effort” (= computational cost) is only
loosely related to the runtime of a concrete implementation of an algorithm.
(Q1.4.3.11.B) We are given two dense matrices A, B ∈ R n,p , n, p ∈ N, p < n fixed. What is the asmp-
totic complexity of each of the following two lines of code in terms of n, p → ∞?
Eigen::MatrixXd AB = A*B.transpose();
y = AB.triangularView<Eigen::Upper>()*x;
Outline an efficient algorithm for computing the matrix-vector product Ax, x ∈ R n . What is its asymp-
totic complexity for n → ∞?
(Q1.4.3.11.D) [Matrix×vector multiplication involving Kronecker product, cf. Ex. 1.4.3.8] How do you
have to modify the implementation of the function kronmultv() from Ex. 1.4.3.8 to ensure that it is
still efficient even when called with long column or row vectors as arguments A and B, in particular in
the cases A ∈ R n,1 , B ∈ R k,1 or A ∈ R1,n , B ∈ R1,k , n, k ≫ 1?
C++ code 1.4.3.9: (Potentially not so) efficient multiplication of Kronecker product with
vector in E IGEN ➺ GITLAB
2 template <class Matrix , class Vector >
3 V e c t o r kronmultv ( const Matrix &A , const Matrix &B , const V e c t o r &x ) {
4 const s i z e _ t m = A . rows ( ) ;
5 const s i z e _ t n = A . cols ( ) ;
6 const s i z e _ t l = B . rows ( ) ;
7 const s i z e _ t k = B . cols ( ) ;
8 // 1st matrix mult. computes the products Bx j
9 // 2nd matrix mult. combines them linearly with the coefficients
of A
10 Matrix t = B * Matrix : : Map( x . data ( ) , k , n ) * A . transpose ( ) ; //
11 r e t u r n Matrix : : Map( t . data ( ) , m* l , 1 ) ;
12 }
(Q1.4.3.11.E) [Multiplication with a “kernel matrix”] Ex. 1.4.3.5 gave us efficient functions
Eigen::VectorXd mvtriutp( const Eigen::Vector &a,
const Eigen::Vector &b, const Eigen::Vector &x);
for evaluating
(Q1.4.3.11.F) [Initialization of a “kernel matrix”] A call to the black-box function double f(unsigned
int i) might be expensive. Outline an efficient C++ code for the initialization of the matrix
Video tutorial for Section 1.5 "Machine Arithmetic and Consequences": (16 minutes)
Download link, tablet notes
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 91
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We will soon learn the rationale behind the odd test in Line 13.
In P YTHON the same algorithm can be implemented as follows:
Note the different loop range due to the zero-based indexing in P YTHON.
y
EXPERIMENT 1.5.1.5 (Unstable Gram-Schmidt orthonormalization) If {a1 , . . . , ak } are linearly inde-
pendent we expect the output vectors q1 , . . . , qk to be orthonormal:
This
1 property can be easily tested numerically, for instance by computing Q⊤ Q for a matrix Q =
q , . . . , qk ∈ R n,k .
C++ code 1.5.1.7: Wrong result from Gram-Schmidt orthogonalisation E IGEN ➺ GITLAB
2 void g s r o u n d o f f ( MatrixXd& A) {
3 // Gram-Schmidt orthogonalization of columns of A, see Code 1.5.1.3
4 MatrixXd Q = gramschmidt ( A) ;
5 // Test orthonormality of columns of Q, which should be an
6 // orthogonal matrix according to theory
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 92
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We test the orthonormality of the output vectors of Gram-Schmidt orthogonalization for a special matrix
A ∈ R10,10 , a so-called Hilbert matrix, defined by (A)i,j = (i + j − 1)−1 . Then Code 1.5.1.7 produces
the follwing output:
I =
1.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 1.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 1.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 -0.0000 -0.0000 1.0000 0.0000 -0.0008 -0.0007 -0.0007 -0.0006
0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 -0.0540 -0.0430 -0.0360 -0.0289
-0.0000 -0.0000 -0.0000 -0.0000 -0.0008 -0.0540 1.0000 0.9999 0.9998 0.9996
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0430 0.9999 1.0000 1.0000 0.9999
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0360 0.9998 1.0000 1.0000 1.0000
-0.0000 -0.0000 -0.0000 -0.0000 -0.0006 -0.0289 0.9996 0.9999 1.0000 1.0000
Obviously, the vectors produced by the function gramschmidt fail to be orthonormal, contrary to the
predictions of rigorous results from linear algebra!
However, Line 11, Line 12 of Code 1.5.1.7 demonstrate another way to orthonormalize the columns of a
matrix using E IGEN’s built-in class template HouseholderQR (more details in Section 3.3.3).
I1 =
1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000
0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 1.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 0.0000
-0.0000 -0.0000 0.0000 -0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 1.0000 0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000
-0.0000 0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 1.0000 -0.0000
0.0000 -0.0000 0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 -0.0000 1.0000
Now we observe apparently perfect orthogonality (1.5.1.6) of the columns of the matrix Q1 in Code 1.5.1.7.
Obviously, there is another algorithm that reliably yields the theoretical output of Gram-Schmidt orthogo-
nalization. There is no denying that it is possible to compute Gram-Schmidt orthonormalization in a “clean”
way. y
Computers cannot compute “properly” in R: numerical computations may not respect the laws of
analysis and linear algebra!
Remark 1.5.1.9 (Stable orthonormalization by QR-decomposition) In Code 1.5.1.7 we saw the use of
the E IGEN class HousholderQR<MatrixType> for the purpose of Gram-Schmidt orthogonalisation.
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 93
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The underlying theory and algorithms will be explained later in Section 3.3.3. There we will have the
following insight:
➣ Up to signs the columns of the matrix Q available from the QR-decomposition of A are the same
vectors as produced by the Gram-Schmidt orthogonalisation of the columns of A.
Code 1.5.1.7 demonstrates a case where a desired result can be obtained by two algebraically
equivalent computations, that is, they yield the same result in a mathematical sense. Yet, when
! implemented on a computer, the results can be vastly different. One algorithm may produce junk
(“unstable algorithm”), whereas the other lives up to the expectations (“stable algorithm”)
Supplement to Exp. 1.5.1.5: despite its ability to produce orthonormal vectors, we get as output for
D=A-Q1*R1 in Code 1.5.1.7:
D =
2.2204e-16 3.3307e-16 3.3307e-16 1.9429e-16 1.9429e-16 5.5511e-17 1.3878e-16 6.9389e-17 8.3267e-17 9.7145e-17
0.0000e+00 1.1102e-16 8.3267e-17 5.5511e-17 0.0000e+00 5.5511e-17 -2.7756e-17 0.0000e+00 0.0000e+00 4.1633e-17
-5.5511e-17 5.5511e-17 2.7756e-17 5.5511e-17 0.0000e+00 0.0000e+00 0.0000e+00 -1.3878e-17 1.3878e-17 1.3878e-17
0.0000e+00 5.5511e-17 2.7756e-17 2.7756e-17 0.0000e+00 1.3878e-17 -1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 1.3878e-17 4.1633e-17
-2.7756e-17 2.7756e-17 1.3878e-17 4.1633e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 2.7756e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17 2.0817e-17
0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.0817e-17 2.7756e-17
1.3878e-17 1.3878e-17 1.3878e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 6.9389e-18 -6.9389e-18 1.3878e-17
0.0000e+00 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 0.0000e+00 0.0000e+00 1.3878e-17 1.3878e-17
➥ The computed QR-decomposition apparently fails to meet the exact algebraic requirements stipulated
by Thm. 3.3.3.4. However, note the tiny size of the “defect”. y
The set of machine numbers M cannot be closed under elementary arithmetic operations
+, −, ·, /, that is, when adding, multiplying, etc., two machine numbers the result may not belong
to M.
The results of elementary operations with operands in M have to be mapped back to M, an oper-
ation called rounding.
The impact of roundoff means that mathematical identities may not carry over to the computational realm.
As we have seen above in Exp. 1.5.1.5
✞ ☎
✝ ✆
Computers cannot compute “properly” !
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 94
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✛ ✘
analysis
numerical computations 6=
linear algebra
✚ ✙
This introduces a new and important aspect in the study of numerical algorithms!
§1.5.2.2 (Internal representation of machine numbers) Now we give a brief sketch of the internal
structure of machine numbers ∈ M. The main insight will be that
EXAMPLE 1.5.2.3 (Decimal floating point numbers) Some 3-digit normalized decimal floating point
numbers:
valid: 0.723 · 102 , 0.100 · 10−20 , −0.801 · 105
invalid: 0.033 · 102 , 1.333 · 10−4 , −0.002 · 103
General form of an m-digit normalized decimal floating point number:
never = 0 !
never = 0 ! 1 1 ... 1 1
| {z }
machine number ∈ M : x=± 0 . 1 1 1 1 1 ... 1 1 ·B digits for exponent
| {z }
m digits for mantissa
Remark 1.5.2.5 (Extremal numbers in M) Clearly, there is a largest element of M and two that are
closest to zero. These are mainly determined by the range for the exponent E, cf. Def. 1.5.2.4.
Largest machine number (in modulus) : xmax = max |M | = (1 − B−m ) · Bemax
Smallest machine number (in modulus) : xmin = min |M | = B−1 · Bemin
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 95
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
std::numeric_limits<double>::max()
and std::numeric_limits<double>::min()
functions. Other properties of arithmetic types can be queried accordingly from the numeric_limits header.
y
Remark 1.5.2.6 (Distribution of machine numbers) From Def. 1.5.2.4 it is clear that there are equi-
spaced sections of M and that the gaps between machine numbers are bigger for larger numbers, see
also [AG11, Fig. 2.3].
Bemin −1
0
spacing Bemin −m spacing Bemin −m+1 spacing Bemin −m+2
Gap partly filled with non-normalized numbers
Non-normalized numbers violate the lower bound for the mantissa i in Def. 1.5.2.4. y
§1.5.2.7 (IEEE standard 754 for machine numbers → [Ove01], [AG11, Sect. 2.4], → link) No sur-
prise: for modern computers B = 2 (binary system), the other parameters of the universally implemented
machine number system are
single precision : m = 24∗ ,E ∈ {−125, . . . , 128} ➣ 4 bytes
double precision : m = 53∗ ,E ∈ {−1021, . . . , 1024} ➣ 8 bytes
∗: including bit indicating sign
The standardisation of machine numbers is important, because it ensures that the same numerical algo-
rithm, executed on different computers will nevertheless produce the same result. y
1 inf
2 0
Output: inf
!
3
4 −nan
E = emax , M 6= 0 =
ˆ NaN = Not a number → exception
E = emax , M = 0 =ˆ Inf = Infinity → overflow
E =0 ˆ Non-normalized numbers → underflow
=
E = 0, M = 0 ˆ number 0
=
In C++ these flags can be tested with the functions std::isnan() C++ reference and
std::isinf() C++ reference. y
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 96
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
11 i n t main ( ) {
12 cout << numeric_limits <double > : : i s _ i e c 5 5 9 << endl
13 << std : : d e f a u l t f l o a t << numeric_limits <double > : : min ( ) << endl
14 << std : : h e x f l o a t << numeric_limits <double > : : min ( ) << endl
15 << std : : d e f a u l t f l o a t << numeric_limits <double > : : max ( ) << endl
16 << std : : h e x f l o a t << numeric_limits <double > : : max ( ) << endl ;
17 }
1 true
2 2.22507e−308
Output: 3 0010000000000000
4 1.79769e+308
5 7fefffffffffffff
y
1 2.22044604925031e−16
Output: 2 6.75015598972095e−14
3 −1.60798663273454e−09
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 97
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Can you devise a similar calculation, whose result is even farther off zero? Apparently the rounding that
inevitably accompanies arithmetic operations in M can lead to results that are far away from the true
result. y
ǫabs := | x − xe| ,
| x − xe|
ǫrel := .
|x|
Remark 1.5.3.4 (Relative error and number of correct digits) The number of correct (significant, valid)
digits of an approximation xe of x ∈ K is defined through the relative error:
| x − xe|
If ǫrel := | x| ≤ 10−ℓ , then xe has ℓ correct digits, ℓ ∈ N0
To see this write write as base-10 floating point numbers
This means that xe has m correct digits. We compute the relative error
m
z }| {
| x − xe| 0. 0 . . . 0 δm+1 δm+1 . . . δn
ǫ= = , δm+1 6= 0
|x| d1 .d2 d3 . . . dm dem+1 . . . den
§1.5.3.5 (Floating point operations) We may think of the elementary binary operations +, −, ∗, / in M
comprising two steps:
➊ Compute the exact result of the operation.
➋ Perform rounding of the result of ➊ to map it back to M.
Definition 1.5.3.6. Correct rounding
(Recall that argminx F ( x ) is the set of arguments of a real valued function F that makes it attain its (global)
minimum.)
Of course, ➊ above is not possible in a strict sense, but the effect of both steps can be realised and yields
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 98
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§1.5.3.8 (Estimating roundoff errors → [AG11, p. 23]) Let us denote by EPS the largest relative error
(→ Def. 1.5.3.3) incurred through rounding:
| rd( x ) − x |
EPS := max , (1.5.3.9)
x∈ I |x|
where I = [min |M |, max |M |] ∩ M is the range of positive machine numbers.
For machine numbers according to Def. 1.5.2.4 EPS can be computed from the defining parameters B
(base) and m (length of mantissa) [AG11, p. 24]:
However, when studying roundoff errors, we do not want to delve into the intricacies of the internal repre-
sentation of machine numbers. This can be avoided by just using a single bound for the relative error due
to rounding, and, thus, also for the relative error potentially suffered in each elementary operation.
There is a small positive number EPS, the machine precision, such that for the elementary arithmetic
operations ⋆ ∈ {+, −, ·, /} and “hard-wired” functions∗ f ∈ {exp, sin, cos, log, . . .} holds
xe
⋆ y = ( x ⋆ y)(1 + δ) , fe( x ) = f ( x )(1 + δ) ∀ x, y ∈ M ,
EXAMPLE 1.5.3.12 (Machine precision for IEEE standard) C++ tells the machine precision as fol-
lowing:
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 99
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
4 i n t main ( ) {
5 std : : cout . p r e c i s i o n ( 1 5 ) ;
6 std : : cout << std : : n u m e r i c _ l i m i t s <double > : : epsilon ( ) << std : : endl ;
7 }
Output:
1 2.22044604925031e−16
Knowing the machine precision can be important for checking the validity of computations or coding ter-
mination conditions for iterative approximations. y
cout .precision(25);
const double eps =
s t d ::numeric_limits< double >::epsilon();
cout << s t d ::fixed << 1.0 + 0.5*eps << e n d l In fact, the following “definition”
<< 1.0 - 0.5*eps << e n d l of EPS is sometimes used:
<< (1.0 + 2/eps) - 2/eps << e n d l ;
EPS is the smallest posi-
Output: tive number ∈ M for which
1+e EPS 6= 1 (in M):
1 1.0000000000000000000000000
2 0.9999999999999998889776975
3 0.0000000000000000000000000
e EPS = 1 actually complies with the “axiom” of roundoff error analysis, Ass. 1.5.3.11:
We find that 1+
EPS
1 = (1 + EPS)(1 + δ) ⇒ |δ| = < EPS ,
1 + EPS
2 2 EPS
= (1 + )(1 + δ) ⇒ |δ| = < EPS .
EPS EPS 2 + EPS
y
!
Do we have to worry about these tiny roundoff errors ?
To motivate this rule, think of a code where the number stored in x is a length. One may have chosen µm
as reference unit and then x=0.1. Another user prefers km as length unit, which means that x=1.0E-10
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 100
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
in this case. Just comparing x to a fixed threshold will lead to a different behavior of the code depending on
the choice of physical units, which is certainly not desirable. In general numerical codes should largely be
insensitive to the choice of physical units, a property called scaling invariance. From these considerations
we also conclude that a guideline for choosing the comparison variable s is that it should represent a
quantity with the same physical units. y
Remark 1.5.3.16 (Overflow and underflow) Since the set of machine numbers M is a finite set, the
result of an arithmetic operation can lie outside the range covered by it. In this case we have to deal with
overflow =ˆ |result of an elementary operation| > max{M }
ˆ IEEE standard ⇒ Inf
=
ˆ 0 < |result of an elementary operation| < min{|M \ {0}|}
underflow =
ˆ IEEE standard ⇒ use non-normalized numbers (!)
=
The Axiom of roundoff analysis Ass. 1.5.3.11 does not hold once non-normalized numbers are encoun-
tered:
10 i n t main ( ) {
11 cout . p r e c i s i o n ( 1 5 ) ;
12 const double min = n u m e r i c _ l i m i t s <double > : : min ( ) ;
13 const double res1 = M_PI * min /123456789101112;
14 const double res2 = res1 * 123456789101112/ min ;
15 cout << res1 << endl << res2 << endl ;
16 }
1 5.68175492717434e−322
Output: 2 3.15248510554597
A simple example teaching how to avoid overflow during the computation of the norm of a 2D vector [AG11,
Ex. 2.9]:
q ( p
r= x2 + y2 | x | 1 + (y/x)2 , if | x | ≥ |y| ,
r= p
|y| 1 + ( x/y)2 , if |y| > | x | .
straightforward evaluation:
p p overflow, when | x | >
max |M | or |y| > max |M |. ➢ no overflow!
y
Review question(s) 1.5.3.18 (Machine arithmetic)
(Q1.5.3.18.A) What is the order of magnitude of the machine precision EPS for the double floating point
type?
(Q1.5.3.18.B) What is the “Axiom of roundoff analysis”?
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 101
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.5.4 Cancellation
Video tutorial for Section 1.5.4 "Cancellation": (22 minutes) Download link, tablet notes
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 102
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 1.5.4.1 (Computing the zeros of a quadratic polynomial) The following simple E IGEN code
computes the real roots of a quadratic polynomial p(ξ ) = ξ 2 + αξ + β by the discriminant formula
1 √
p(ξ 1 ) = p(ξ 2 ) = 0 , ξ 1/2 = −α ± D , if D := α2 − 4β ≥ 0 . (1.5.4.2)
2
C++ code 1.5.4.3: Discriminant formula for the real roots of p(ξ ) = ξ 2 + αξ + β ➺ GITLAB
2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means
p of the familiar discriminant
4 //! formula ξ 1,2 = 12 (−α ± α2 − 4β). However
5 //! this implementation is vulnerable to round-off! The zeros are
6 //! returned in a column vector
7 i n l i n e Vector2d zerosquadpol ( double alpha , double beta ) {
8 Vector2d z ;
9 const double D = std : : pow ( alpha , 2 ) − 4 * beta ; // discriminant
10 i f (D >= 0 ) {
11 // The famous discriminant formula
12 const double wD = std : : s q r t (D) ;
13 z << ( − alpha − wD) / 2 , ( − alpha + wD) / 2 ; //
14 }
15 else {
16 throw std : : r u n t i m e _ e r r o r ( "no r e a l zeros " ) ;
17 }
18 return z ;
19 }
This formula is applied to the quadratic polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) after its coefficients α, β have
been computed from γ, which will have introduced small relative roundoff errors (of size EPS).
C++ code 1.5.4.4: Testing the accuracy of computed roots of a quadratic polynomial
➺ GITLAB
2 //! Eigen Function for testing the computation of the zeros of a
parabola
3 void compzeros ( ) {
4 i n t n = 100;
5 MatrixXd r e s ( n , 4 ) ;
6 VectorXd gamma = VectorXd : : LinSpaced ( n , 2 , 992) ;
7 f o r ( i n t i = 0 ; i < n ; ++ i ) {
8 double alpha = −(gamma( i ) + 1 . / gamma( i ) ) ;
9 double beta = 1 . ;
10 Vector2d z1 = zerosquadpol ( alpha , beta ) ;
11 Vector2d z2 = zerosquadpolstab ( alpha , beta ) ;
12 double z t r u e = 1 . / gamma( i ) , z 2 t r u e = gamma( i ) ;
13 r e s ( i , 0 ) = gamma( i ) ;
14 r e s ( i , 1 ) = std : : abs ( ( z1 ( 0 ) − z t r u e ) / z t r u e ) ;
15 r e s ( i , 2 ) = std : : abs ( ( z2 ( 0 ) − z t r u e ) / z t r u e ) ;
16 r e s ( i , 3 ) = std : : abs ( ( z1 ( 1 ) − z 2 t r u e ) / z 2 t r u e ) ;
17 }
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 103
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2
relative errors in ξ , ξ
tation of α and β leads to “wrong” roots.
1
For large γ the computed small root may be fairly 2
In order to understand why the small root is much more severely affected by roundoff, note that its com-
putation involves the subtraction of two large numbers, if γ is large. This is the typical situation, in which
cancellation occurs. y
§1.5.4.5 (Visualisation of cancellation effect) We look at the exact subtraction of two almost equal
positive numbers both of which have small relative errors (red boxes) with respect to some desired exact
value (indicated by blue boxes). The result of the subtraction will be small, but the errors may add up
during the subtraction, ultimately constituting a large fraction of the result.
(absolute) errors
Cancellation
Fig. 22
(✁ Roundoff error introduced by subtraction itself is negligi-
ble.)
y
EXAMPLE 1.5.4.6 (Cancellation in decimal system) We consider two positive numbers x, y
of about the same size afflicted with relative errors ≈ 10−7 . This means that their sev-
enth decimal digits are perturbed, here indicated by ∗. When we subtract the two numbers
the perturbed digits are shifted to the left, resulting in a possible relative error of ≈ 10−3 :
padded zeroes
Again, this example demonstrates that cancellation wreaks havoc through error amplification, not through
the roundoff error due to the subtraction. y
EXAMPLE 1.5.4.7 (Cancellation when evaluating difference quotients → [DR08, Sect. 8.2.6], [AG11,
Ex. 1.3]) From analysis we know that the derivative of a differentiable function f : I ⊂ R → R at a point
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 104
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
f ( x + h) − f ( x )
f ′ ( x ) = lim .
h →0 h
This suggests the following approximation of the derivative by a difference quotient with small but finite
h>0
f ( x + h) − f ( x )
f ′ (x) ≈ for |h| ≪ 1 .
h
Results from analysis tell us that the approximation error should tend to zero for h → 0. More precise
quantitative information is provided by the Taylor formula for a twice continuously differentiable function
[AG11, p. 5]
f ( x + h) − f ( x )
− f ′ ( x ) = 12 h f ′′ (ξ ) for some ξ = ξ ( x, h) ∈ [min{ x, x + h}, max{ x, x + h}] .
h
(1.5.4.9)
We investigate the approximation of the derivative by difference quotients for f = exp, x = 0, and different
values of h > 0:
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 105
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 23
Obvious culprit for what we see in Fig. 23: cancellation when computing the numerator of the
difference quotient for small | h| leads to a strong amplification of inevitable errors introduced by
the evaluation of the transcendent exponential function.
We witness the competition of two opposite effects: Smaller h results in a better approximation of the
derivative by the difference quotient, but the impact of cancellation is the stronger the smaller | h|.
f ( x + h) − f ( x )
Approximation error f ′ ( x ) − →0
h as h → 0 .
Impact of roundoff → ∞
In order to provide a rigorous underpinning for our conjecture, in this example we embark on our first
roundoff error analysis merely based on the “Axiom of roundoff analysis” Ass. 1.5.3.11: As in the compu-
tational example above we study the approximation of f ′ ( x ) = e x for f = exp, x ∈ R.
(Note that the estimate for the term (eh − 1)/h is a particular case of (1.5.4.9).)
e x − df 2eps p
relative error: ≈ h + → min for h = 2 eps .
ex h
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 106
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
√
In double precision: 2eps = 2.107342425544702 · 10−8 y
Remark 1.5.4.11 (Cancellation during the computation of relative errors) In the numerical experiment
of Ex. 1.5.4.7 we computed the relative error of the result by subtraction, see Code 1.5.4.10. Of course,
massive cancellation will occur! Do we have to worry?
In this case cancellation can be tolerated, because we are interested only in the magnitude of the relative
error. Even if it was affected itself by a large relative error, this information is still not compromised.
For example, if the relative error has the exact value 10−8 , but can be computed only with a huge relative
error of 10%, then the perturbed value would still be in the range [0.9 · 10−8 , 1.1 · 10−8 ]. Therefore it will
still have the correct magnitude and still permit us to conclude the number of valid digits correctly. y
Remark 1.5.4.12 (Cancellation in Gram-Schmidt orthogonalisation of Exp. 1.5.1.5) The Hilbert matrix
A ∈ R10,10 , (A)i,j = (i + j − 1)−1 , considered in Exp. 1.5.1.5 has columns that are almost linearly
dependent.
EXAMPLE 1.5.4.13 (Cancellation: roundoff error analysis) We consider a simple arithmetic expression
written in two ways:
a2 − b2 = ( a + b)( a − b) , a, b ∈ R; .
We evaluate this term by means of two algebraically equivalent algorithms for the input data a = 1.3,
b = 1.2 in 2-digit decimal arithmetic with standard rounding. (“Algebraically equivalent” means that two
algorithms will produce the same results in the absence of roundoff errors.
Algorithm A Algorithm B
x := ae· a = 1.7 (rounded) e b = 2.5 (exact)
x := a+
y := be· b = 1.4 (rounded) e b = 0.1 (exact)
y := a−
e y = 0.30 (exact)
x− x ∗ y = 0.25 (exact)
Algorithm B produces the exact result, whereas Algorithm A fails to do so. Is this pure coincidence or an
indication of the superiority of algorithm B? This question can be answered by roundoff error analysis. We
demonstrate the approach for the two algorithms A & B and general input a, b ∈ R.
Roundoff error analysis heavily relies on Ass. 1.5.3.11 and dropping terms of “higher order” in the machine
precision, that is terms that behave like O(EPSq ), q > 1. It involves introducing the relative roundoff error
for every elementary operation through a factor (1 + δ), |δ| ≤ EPS.
Algorithm A:
x = a2 (1 + δ1 ) , y = b2 (1 + δ2 )
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 107
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
| fe − f | a2 + b2 + | a2 − b2 | 2 | a2 + b2 |
≤ EPS + O(EPS ) = EPS 1 + 2 + O(EPS2 ) . (1.5.4.14)
|f| | a2 − b2 | | a − b2 |
will be neglected
For a ≈ b the relative error of the result of Algorithm A will be much larger than the machine
precision EPS. This reflects cancellation in the last subtraction step.
Algorithm B:
x = ( a + b)(1 + δ1 ) , y = ( a − b)(1 + δ2 )
fe = ( a + b)( a − b)(1 + δ1 )(1 + δ2 )(1 + δ3 ) = f + ( a2 − b2 )(δ1 + δ2 + δ3 ) + O(EPS2 )
| fe − f |
≤ |δ1 + δ2 + δ3 | + O(EPS2 ) ≤ 3EPS + O(EPS2 ) . (1.5.4.15)
|f|
The reason is that input data and and initial intermediate results are usually not as much tainted by roundoff
errors as numbers computed after many steps. y
§1.5.4.16 (Avoiding disastrous cancellation) The following examples demonstrate a few fundamental
techniques for steering clear of cancellation by using alternative formulas that yield the same value (in
exact arithmetic), but do not entail subtracting two numbers of almost equal size.
EXAMPLE 1.5.4.17 (Stable discriminant formula → Ex. 1.5.4.1, [AG11, Ex. 2.10]) If ξ 1 and ξ 2 are
the two roots of the quadratic polynomial p(ξ ) = ξ 2 + αξ + β, then ξ 1 · ξ 2 = β (Vieta’s formula). Thus
once we have computed a root, we can obtain the other by simple division.
Idea:
➊ Depending on the sign of α compute “stable root” without cancellation.
➋ Compute other root from Vieta’s formula (avoiding subtraction)
C++ code 1.5.4.18: Stable computation of real roots of a quadratic polynomial ➺ GITLAB
2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means
p of the familiar discriminant
4 //! formula ξ 1,2 = 12 (−α ± α2 − 4β).
5 //! This is a stable implementation based on Vieta’s theorem.
6 //! The zeros are returned in a column vector
7 Eigen : : VectorXd zerosquadpolstab ( double alpha , double beta ) {
8 Eigen : : Vector2d z ( 2 ) ;
9 const double D = std : : pow ( alpha , 2 ) − 4 * beta ; // discriminant
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 108
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 i f (D >= 0 ) {
11 const double wD = std : : s q r t (D) ;
12 // Use discriminant formula only for zero far away from 0
13 // in order to avoid cancellation. For the other zero
14 // use Vieta’s formula.
15 i f ( alpha >= 0 ) {
16 const double t = 0 . 5 * ( − alpha − wD) ; //
17 z << t , beta / t ;
18 } else {
19 const double t = 0 . 5 * ( − alpha + wD) ; //
20 z << beta / t , t ;
21 }
22 }
23 else {
24 throw std : : r u n t i m e _ e r r o r ( "no r e a l zeros " ) ;
25 }
26 return z ;
27 }
➥ Invariably, we add numbers with the same sign in Line 16 and Line 19.
-11 Roundoff in the computation of zeros of a parabola
×10
3.5
unstable
stable
Code 1.5.4.4.
relative error in ξ 1
Observation:
1.5
The new code can also compute the small root of
the polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) (expanded 1
0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 25 γ
y
EXAMPLE 1.5.4.19 (Exploiting trigonometric identities to avoid cancellation) The task is to evaluate
the integral
Z x
sin t dt = 1− cos x = 2 sin2 ( x/2) for 0 < x ≪ 1 , (1.5.4.20)
0 | {z } | {z }
I II
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 109
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 -4
-16
10
10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0
Fig. 26 x
y
Analytic manipulations offer ample opportunity to rewrite expressions in equivalent form immune to
cancellation.
Fig. 27
sin α2n
We focus on the unit circle. The area of the inscribed
n-gon is Fn
cos α2n
αn αn n n 2π
An = n cos sin = sin αn = sin .
2 2 2 2 n αn
2
Recursion formula for An derived from Fig. 28
r s p
αn 1 − cos αn 1− 1 − sin2 αn
sin = = ,
2 2 2
√
Initial approximation: A6 = 32 3 .
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 110
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
5 double s= s q r t ( 3 ) / 2 . ;
6 double An = 3 . * s ; // initialization (hexagon case)
7 unsigned i n t n = 6 ;
8 unsigned i n t i t = 0 ;
9 MatrixXd r e s ( maxIt , 4 ) ; // matrix for storing results
10 r e s ( i t , 0 ) = n ; r e s ( i t , 1 ) = An ;
11 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
12 while ( i t < maxIt && s > t o l ) { // terminate when s is ’small enough’
13 s = s q r t ( ( 1 . − s q r t ( 1 . − s * s ) ) / 2 . ) ; // recursion for area
14 n * = 2 ; An = n / 2 . * s ; // new estimate for circumference
15 ++ i t ;
16 r e s ( i t , 0 ) =n ; r e s ( i t , 1 ) =An ; // store results and (absolute) error
17 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
18 }
19 r e t u r n r e s . topRows ( i t ) ;
20 }
The approximation deteriorates after applying the recursion formula many times:
n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589794 0.500000000000000
24 3.105828541230250 -0.035764112359543 0.258819045102521
48 3.132628613281237 -0.008964040308556 0.130526192220052
96 3.139350203046872 -0.002242450542921 0.065403129230143
192 3.141031950890530 -0.000560702699263 0.032719082821776
384 3.141452472285344 -0.000140181304449 0.016361731626486
768 3.141557607911622 -0.000035045678171 0.008181139603937
1536 3.141583892148936 -0.000008761440857 0.004090604026236
3072 3.141590463236762 -0.000002190353031 0.002045306291170
6144 3.141592106043048 -0.000000547546745 0.001022653680353
12288 3.141592516588155 -0.000000137001638 0.000511326906997
24576 3.141592618640789 -0.000000034949004 0.000255663461803
49152 3.141592645321216 -0.000000008268577 0.000127831731987
98304 3.141592645321216 -0.000000008268577 0.000063915865994
196608 3.141592645321216 -0.000000008268577 0.000031957932997
393216 3.141592645321216 -0.000000008268577 0.000015978966498
786432 3.141593669849427 0.000001016259634 0.000007989485855
1572864 3.141592303811738 -0.000000349778055 0.000003994741190
3145728 3.141608696224804 0.000016042635011 0.000001997381017
6291456 3.141586839655041 -0.000005813934752 0.000000998683561
12582912 3.141674265021758 0.000081611431964 0.000000499355676
25165824 3.141674265021758 0.000081611431964 0.000000249677838
50331648 3.143072740170040 0.001480086580246 0.000000124894489
100663296 3.159806164941135 0.018213511351342 0.000000062779708
201326592 3.181980515339464 0.040387861749671 0.000000031610136
402653184 3.354101966249685 0.212509312659892 0.000000016660005
805306368 4.242640687119286 1.101048033529493 0.000000010536712
1610612736 6.000000000000000 2.858407346410207 0.000000007450581
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 111
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
v p
r
u p For αn ≪ 1: 1 − sin2 αn ≈ 1
u
αn 1 − cos αn t 1 − 1 − sin2 αn
sin = = Cancellation here!
2 2 2
We arrive at an equivalent formula not vulnerable to cancellation essentially using the identity ( a + b)( a −
b) = a2 − b2 in order to eliminate the difference of square roots in the numerator.
s v
p u p p
αn 1− 2
1 − sin αn u 1 − 1 − sin2 αn 1 + 1 − sin2 αn
sin = = t · p
2 2 2 1 + 1 − sin2 αn
s
1 − (1 − sin2 αn ) sin αn
= p =r p .
2(1 + 1 − sin2 αn ) 2
2 1 + 1 − sin αn
C++ code 1.5.4.23: Stable recursion for area of regular n-gon ➺ GITLAB
2 //! Approximation of Pi by approximating the circumference of a
3 //! regular polygon
4 MatrixXd a p p r p i s t a b l e ( double t o l = 1e −8 , unsigned i n t maxIt = 50) {
5 double s= s q r t ( 3 ) / 2 . ; double An = 3 . * s ; // initialization (hexagon case)
6 unsigned i n t n = 6 ;
7 unsigned i n t i t = 0 ;
8 MatrixXd r e s ( maxIt , 4 ) ; // matrix for storing results
9 r e s ( i t , 0 ) = n ; r e s ( i t , 1 ) = An ;
10 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
11 while ( i t < maxIt && s > t o l ) { // terminate when s is ’small enough’
12 s = s / s q r t ( 2 * ( 1 + s q r t ( ( 1 + s ) * (1 − s ) ) ) ) ; // Stable recursion without
cancellation
13 n * = 2 ; An = n / 2 . * s ; // new estimate for circumference
14 ++ i t ;
15 r e s ( i t , 0 ) =n ; r e s ( i t , 1 ) =An ; // store results and (absolute) error
16 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
17 }
18 r e t u r n r e s . topRows ( i t ) ;
19 }
Using the stable recursion, we observe better approximation for polygons with more corners:
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 112
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589793 0.500000000000000
24 3.105828541230249 -0.035764112359544 0.258819045102521
48 3.132628613281238 -0.008964040308555 0.130526192220052
96 3.139350203046867 -0.002242450542926 0.065403129230143
192 3.141031950890509 -0.000560702699284 0.032719082821776
384 3.141452472285462 -0.000140181304332 0.016361731626487
768 3.141557607911857 -0.000035045677936 0.008181139603937
1536 3.141583892148318 -0.000008761441475 0.004090604026235
3072 3.141590463228050 -0.000002190361744 0.002045306291164
6144 3.141592105999271 -0.000000547590522 0.001022653680338
12288 3.141592516692156 -0.000000136897637 0.000511326907014
24576 3.141592619365383 -0.000000034224410 0.000255663461862
49152 3.141592645033690 -0.000000008556103 0.000127831731976
98304 3.141592651450766 -0.000000002139027 0.000063915866118
196608 3.141592653055036 -0.000000000534757 0.000031957933076
393216 3.141592653456104 -0.000000000133690 0.000015978966540
786432 3.141592653556371 -0.000000000033422 0.000007989483270
1572864 3.141592653581438 -0.000000000008355 0.000003994741635
3145728 3.141592653587705 -0.000000000002089 0.000001997370818
6291456 3.141592653589271 -0.000000000000522 0.000000998685409
12582912 3.141592653589663 -0.000000000000130 0.000000499342704
25165824 3.141592653589761 -0.000000000000032 0.000000249671352
50331648 3.141592653589786 -0.000000000000008 0.000000124835676
100663296 3.141592653589791 -0.000000000000002 0.000000062417838
201326592 3.141592653589794 0.000000000000000 0.000000031208919
402653184 3.141592653589794 0.000000000000001 0.000000015604460
805306368 3.141592653589794 0.000000000000001 0.000000007802230
1610612736 3.141592653589794 0.000000000000001 0.000000003901115
Recursion for the area of a regular n-gon
2
10
0
10
-4
10
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 113
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 114
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
3
Observation:
sign. -2
-4
-5
0 5 10 15 20 25 30 35 40 45 50
Fig. 30 index k of summand
1
exp( x ) = , if x<0.
exp(− x )
y
EXAMPLE 1.5.4.26 (Trade cancellation for approximation) In a computer code we have to provide a
routine for the evaluation of the “hidden difference quotient”
Z 1
exp( a) − 1
I ( a) := e at dt = for any a>0, (1.5.4.27)
0 a
cf. the discussion of cancellation in the context of numerical differentiation in Ex. 1.5.4.7. There we
observed massive cancellation.
Trick. Recall the Taylor expansion formula in one dimension for a function that is m + 1 times continu-
ously differentiable in a neighborhood of x0 [Str09, Satz 5.5.1]
m
1 (k) 1
f ( x0 + h ) = ∑ f ( x0 ) h k + R m ( x0 , h ) , R m ( x0 , h ) = f ( m +1) ( ξ ) h m +1 , (1.5.4.28)
k =0
k! ( m + 1 ) !
for some ξ ∈ [min{ x0 , x0 + h}, max{ x0 , x0 + h}], and for all sufficiently small | h|. Here R( x0 , h) is
called the remainder term and f (k) denotes the k-th derivative of f .
Cancellation in (1.5.4.27) can be avoided by replacing exp( a), a > 0, with a suitable Taylor expansion of
a 7→ e a around a = 0 and then dividing by a:
m
exp( a) − 1 1 1
= ∑ ak + Rm ( a) , Rm ( a) = exp(ξ ) am for some 0 ≤ ξ ≤ a .
a k =0
( k + 1 ) ! ( m + 1 ) !
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 115
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
tol. To estimate the relative approximation error, we use the expression for the remainder together with
the simple estimate (exp( a) − 1)/a > 1 for all a > 0:
m
1
I ( a) − e
Im ( a) (e a − 1)/a − ∑ ( k +1) !
ak
k =0
rel. err. = =
| I ( a)| − 1)/a (e a
1 1
≤ exp(ξ ) am ≤ exp( a) am .
( m + 1) ! ( m + 1) !
-10
10
v = 1.0 + (1.0/2 + 1.0/6*a)*a;
10 -11
else
10 -12
v = (exp(a)-1.0)/a;
end 10
-13
Note that f is infinitely many times differentiable in a neighborhood of x0 and that its derivatives satisfy
f (n) ( x0 ) = n!an ∈ R, n ∈ N0 .
A power series like in (6.2.2.36) also makes sense for x ∈ C, | x − x0 | < ρ! Thus, for 0 < h < ρ we can
approximate f in a neighborhood of x0 by means of a complex Taylor polynomial
Trick. Take the imaginary part on both sides of (1.5.4.31) using that all derivatives are real:
Im f ( x0 + ıh) = h f ′ ( x0 ) + O( h3 ) for h ∈ R → 0 .
Im f ( x0 + ıh)
f ′ ( x0 ) = + O(h2 ) for h ∈ R → 0 ,
h
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 116
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Im f ( x0 + ıh)
f ′ ( x0 ) ≈
h
√
for h ≈ EPS to compute the derivative of f in x0 . y
Remark 1.5.4.32 (A broader view of cancellation) Cancellation can be viewed as a particular case of a
situation, in which severe amplification of relative errors is possible. Consider a function
F ( x1 , . . . , xn ) 6= 0 , F ( x1 , . . . , xn ) ≈ 0 , grad F ( x1 , . . . , xn ) 6= 0 .
We supply arguments xei ∈ R with small relative errors ǫi , xei = xi (1 + ǫi ), i = 1, . . . , n, and study the
resulting relative error δ of the result
| F ( xe1 , . . . , xen ) − F ( x1 , . . . , xn )|
δ := .
| F ( x1 , . . . , xn )|
Thanks to the smoothness of F, we can employ multi-dimensional Taylor approximation
ǫ1 x 1
F ( x1 (1 + ǫ1 ), . . . , xn (1 + ǫn )) = F ( x1 , . . . , xn ) + grad F ( x1 , . . . , xn )⊤ ... + R(x, ǫ) ,
ǫn x n
with remainder R(x, ǫ) = O(ǫ12 + · · · + ǫn2 ) for ǫi → 0 .
This yields
ǫ1 x 1
grad F ( x1 , . . . , xn )⊤ ... + R(x, ǫ)
ǫn x n
δ= .
| F ( x1 , . . . , xn )|
If |ǫi | ≪ 1, we can neglect the remainder term and obtain the first-order approximation (indicated by =
˙)
ǫ1 x 1
1
δ=
˙ grad F ( x1 , . . . , xn )⊤ ... .
| F ( x1 , . . . , xn )|
ǫn x n
In case |(grad F ( x1 , . . . , xn ))i | ≫ | F (x)|, | xi | ≫ 0 for some i ∈ {1, . . . , n}, we can thus encounter
δ ≫ max j |ǫ j |, which indicates a potentially massive amplification of relative errors.
• “Classical cancellation” as discussed in § 1.5.4.5 fits this setting and corresponds to the special
choice F : R2 → R, F ( x1 , x2 ) := x1 − x2 .
• The effect found above can be observed for the simple trigonometric functions sin and cos!
y
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 117
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Zb
1
I ( a, b) := dx .
1 + x2
a
Hints.
d 1
{ x 7→ arctan( x )} = ,
dx 1 + x2
tan(α) − tan β
tan(α − β) = .
1 + tan(α) tan( β)
where x is of type double? Rewrite this line of code into an algebraically equivalent one so that problem
does no longer occur.
(Q1.5.4.33.D) [Harmless cancellation] Discuss the impact of round-off errors and cancellation for the
C++ expression
f = x + s t d ::sqrt(1-x*x);
where x, y ∈ R designate the “exact values”. Give a bound for the relative error of the product xe · ye and
discuss whether it can be much larger than EPS for particular values of x and y.
△
Video tutorial for Section 1.5.5 "Numerical Stability": (17 minutes) Download link, tablet notes
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 118
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We have seen that a particular “problem” can be tackled by different “algorithms”, which produce different
results due to roundoff errors. This section will clarify what distinguishes a “good” algorithm from a rather
abstract point of view.
Note: In this course, both the data space X and the result space Y will always be subsets of finite dimen-
sional vector spaces.
§1.5.5.3 (Norms on spaces of vectors and matrices) Norms provide tools for measuring errors. Recall
from linear algebra and calculus [NS02, Sect. 4.3], [Gut09, Sect. 6.1]:
Remark 1.5.5.5 (Inequalities between vector norms) All norms on the vector space K n , n ∈ N, are
equivalent in the sense that for arbitrary two norms k·k1 and k·k2 we can always find a constant C > 0
such that
k v k1 ≤ C k v k2 ∀ v ∈ K n . (1.5.5.6)
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 119
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Of course, the constant C will usually depend on n and the norms under consideration.
For the vector norms introduced above, explicit expressions for the constants “C” are available: for all
x∈ Kn
√
k x k2 ≤ k x k1 ≤ n k x k2 , (1.5.5.7)
√
k x k ∞ ≤ k x k2 ≤ n k x k ∞ , (1.5.5.8)
k x k ∞ ≤ k x k1 ≤ n k x k ∞ . (1.5.5.9)
The matrix space K m,n is a vector space, of course, and can also be equipped with various norms. Of
particular importance are norms induced by vector norms on K n and K m .
Given vector norms k·k x and k·ky on K n and K m , respectively, the associated matrix norm is
defined by
kMxky
M ∈ R m,n : kMk := sup .
x∈R n \{0} kxk x
By virtue of definition the matrix norms enjoy an important property, they are sub-multiplicative:
✎ notations for matrix norms for quadratic matrices associated with standard vector norms:
k x k2 → k M k2 , k x k1 → k M k1 , k x k ∞ → k M k ∞
EXAMPLE 1.5.5.12 (Matrix norm associated with ∞-norm and 1-norm) Rather simple formulas are
available for the matrix norms induced by the vector norms k·k∞ and k·k1
m m
e.g. for M = m11 12
21 m22
∈ K2,2 : kMxk∞ = max{|m11 x1 + m12 x2 |, |m21 x1 + m22 x2 |}
≤ max{|m11 | + |m12 |, |m21 | + |m22 |} k x k∞ ,
kMxk1 = |m11 x1 + m12 x2 | + |m21 x1 + m22 x2 |
≤ max{|m11 | + |m21 |, |m12 | + |m22 |}(| x1 | + | x2 |) .
For general M = mij ∈ K m,n
n
➢ matrix norm ↔ k·k∞ = row sum norm kMk∞ := max ∑ |mij | ,
i =1,...,m j=1
(1.5.5.13)
m
➢ matrix norm ↔ k·k1 = column sum norm kMk1 := max ∑ |mij | .
j=1,...,n i =1
(1.5.5.14)
Sometimes special formulas for the Euclidean matrix norm come handy [GV89, Sect. 2.3.3]:
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 120
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
|x H Ax|
A ∈ K n,n , A = A H ⇒ kAk2 = max .
x 6 =0 kxk22
Proof. Recall from linear algebra: Hermitian matrices (a special class of normal matrices) enjoy unitary
similarity to diagonal matrices:
Since multiplication with an unitary matrix preserves the 2-norm of a vector, we conclude
Hence, both expressions in the statement of the lemma agree with the largest modulus of eigenvalues of
A.
✷
For A ∈ K m,n the Euclidean matrix norm kAk2 is the square root of the largest (in modulus)
eigenvalue of A H A.
For a normal matrix A ∈ K n,n (that is, A satisfies AH A = AAH ) the Euclidean matrix norm agrees
with the modulus of the largest eigenvalue.
§1.5.5.17 ((Numerical) algorithm) When we talk about an “algorithm” we have in mind a concrete code
function in M ATLAB or C++; the only way to describe an algorithm is through a piece of code. We assume
that this function defines another mapping F e : X → Y on the data space of the problem. Of course,
we can only feed data to the M ATLAB/C++-function, if they can be represented in the set M of machine
e is the assumption that input data are subject to rounding
numbers. Hence, implicit in the definition of F
before passing them to the code function proper.
Problem Algorithm
F : X ⊂ Rn → Y ⊂ Rm e⊂ M
Fe : X → Y
y
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 121
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We write w(x), x ∈ X , for the computational effort (→ Def. 1.4.0.1, “number of elementary operations”)
required by the algorithm for input x.
An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result Fe(x)
(possibly affected by roundoff) is the exact result for “slightly perturbed” data:
x ∈ X: kx − e
∃C ≈ 1: ∀x ∈ X: ∃e xk X ≤ Cw(x) EPSkxk X ∧ Fe(x) = F (e
x) .
Here EPS should be read as machine precision according to the “Axiom” of roundoff analysis Ass. 1.5.3.11.
F
Illustration of Def. 1.5.5.19 ✄ x Fe y
(y =ˆ exact result for exact data x) Fe(x)
e
x F
Terminology: (Y, k·kY )
Def. 1.5.5.19 introduces stability in the sense of ( X, k·k X )
backward error analysis
Fig. 33
Sloppily speaking, the impact of roundoff (∗) on a stable algorithm is of the same order of magnitude
as the effect of the inevitable perturbations due to rounding of the input data.
(∗) In some cases the definition of Fe will also involve some approximations as in Ex. 1.5.4.26. Then the
above statement also includes approximation errors. y
EXAMPLE 1.5.5.20 (Testing stability of matrix×vector multiplication) Assume you are given a black
box implementation of a function
VectorXd mvmult( const MatrixX &A, const VectorXd &x)
that purports to provide a stable implementation of Ax for A ∈ K m,n , x ∈ K n , cf. Ex. 1.5.5.2. How can
we verify this claim for particular data. Both, K m,n and K n are equipped with the Euclidean norm.
The task is, given y ∈ K n as returned by the function, to find conditions on y that ensure the existence of
aAe ∈ K m,n such that
e = y and
Ax e −A
A ≤ Cmn EPSkAk2 , (1.5.5.21)
2
e = A + zx T , z := y − Ax ∈ K m ,
A
kxk22
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 122
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
and we find
e −A x · w k z k2 ky − Axk2
A = zx T = sup ≤ k x k2 k z k2 = .
2 2 w∈K n \{0} k w k2 k x k2
✬ ✩
Hence, in principle stability of an algorithm for computing Ax is confirmed, if for every x ∈ R n the
computed result y = mvmult(A, x) satisfies
✫ ✪
with a small constant C > 0 independent of data and problem size.
y
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 123
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result
e
F (x) (possibly affected by roundoff) is the exact result for “slightly perturbed” data:
∃C ≈ 1: ∀ : ∃ :
!X !
≤ Cw(x) EPSkxk X ∧ Fe =F .
that just adds the two numbers given as arguments. Derive conditions on the returned result that, when
satisfied, imply the stability of the implementation of add(). Of course, the norm on R is just |·|.
△
Learning Outcomes
Principal take-home knowledge and skills from this chapter:
• Learning by doing: Knowledge about the syntax of fundamental operations on matrices and vectors
in E IGEN.
• Understanding of the concepts of computational effort/cost and asymptotic complexity in numerics.
• Awareness of the asymptotic complexity of basic linear algebra operations
• Ability to determine the (asymptotic) computational effort for a concrete (numerical linear algebra)
algorithm.
• Ability to manipulate simple expressions involving matrices and vectors in order to reduce the com-
putational cost for their evaluation.
• Knowledge about round-off and machine precision.
• Familiarity with the phenomenon of “cancellation”: cause, effect, remedies, and tricks
1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 124
Bibliography
[AV88] A. Aggarwal and J.S. Vitter. “The input/output complexity of sorting and related problems”. In:
Communications of the ACM 31.9 (1988), pp. 1116–1127 (cit. on p. 83).
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on pp. 56, 83, 95, 96, 98,
99, 101, 104, 105, 108, 115, 121).
[CW90] D. Coppersmith and S. Winograd. “Matrix multiplication via arithmetic progression”. In: J. Sym-
bgolic Computing 9.3 (1990), pp. 251–280 (cit. on p. 86).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 57, 104).
[GV89] G.H. Golub and C.F. Van Loan. Matrix computations. 2nd. Baltimore, London: John Hopkins
University Press, 1989 (cit. on pp. 78, 120).
[GJ10] Gero Greiner and Riko Jacob. “The I/O Complexity of Sparse Matrix Dense Matrix Multipli-
cation”. In: LATIN 2010: THEORETICAL INFORMATICS. Ed. by LopezOrtiz, A. Vol. 6034.
Lecture Notes in Computer Science. Microsoft Res; Yahoo Res; Univ Waterloo. 2010, 143–
156. DOI: {10.1007/978-3-642-12200-2\_14} (cit. on p. 83).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on p. 119).
[KW03] M. Kowarschik and C. Weiss. “An Overview of Cache Optimization Techniques and Cache-
Aware Numerical Algorithms”. In: Algorithms for Memory Hierarchies. Vol. 2625. Lecture Notes
in Computer Science. Heidelberg: Springer, 2003, pp. 213–232 (cit. on p. 83).
[LM67] J. N. Lyness and C. B. Moler. “Numerical differentiation of analytic functions”. In: SIAM J.
Numer. Anal. 4 (1967), pp. 202–210 (cit. on p. 116).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 53, 70, 91, 119).
[Ove01] M.L. Overton. Numerical Computing with IEEE Floating Point Arithmetic. Philadelphia, PA:
SIAM, 2001 (cit. on p. 96).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 57, 76).
[Str69] V. Strassen. “Gaussian elimination is not optimal”. In: Numer. Math. 13 (1969), pp. 354–356
(cit. on p. 85).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 53, 54, 115).
[Van00] Charles F. Van Loan. “The ubiquitous Kronecker product”. In: J. Comput. Appl. Math. 123.1-2
(2000), pp. 85–100. DOI: 10.1016/S0377-0427(00)00393-9.
125
Chapter 2
§2.0.0.1 (Required prior knowledge for Chapter 2) Also this chapter heavily relies on concepts and
techniques from linear algebra as taught in the 1st semester introductory course. Knowledge of the fol-
lowing topics from linear algebra will be taken for granted and they should be refreshed in case of gaps:
• Operations involving matrices and vectors [NS02, Ch. 2], already covered in Chapter 1
• Computations with block-structured matrices, cf. § 1.3.1.13
• Linear systems of equations: existence and uniqueness of solutions [NS02, Sects. 1.2, 3.3]
• Gaussian elimination [NS02, Ch. 2]
• LU-decomposition and its connection with Gaussian elimination [NS02, Sect. 2.4]
y
Contents
2.1 Introduction: Linear Systems of Equations (LSE) . . . . . . . . . . . . . . . . . . . 127
2.2 Theory: Linear Systems of Equations (LSE) . . . . . . . . . . . . . . . . . . . . . . 130
2.2.1 LSE: Existence and Uniqueness of Solutions . . . . . . . . . . . . . . . . . . 130
2.2.2 Sensitivity/Conditioning of Linear Systems . . . . . . . . . . . . . . . . . . 131
2.3 Gaussian Elimination (GE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.3.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.3.2 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.4 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.5 Survey: Elimination Solvers for Linear Systems of Equations . . . . . . . . . . . 165
2.6 Exploiting Structure when Solving Linear Systems . . . . . . . . . . . . . . . . . . 170
2.7 Sparse Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.7.1 Sparse Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 179
2.7.2 Sparse Matrices in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
2.7.3 Direct Solution of Sparse Linear Systems of Equations . . . . . . . . . . . . 190
2.7.4 LU-Factorization of Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . 193
2.7.5 Banded Matrices [DR08, Sect. 3.7] . . . . . . . . . . . . . . . . . . . . . . . . 199
2.8 Stable Gaussian Elimination Without Pivoting . . . . . . . . . . . . . . . . . . . . 206
126
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Terminology: A =
ˆ system matrix/coefficient matrix, b =
ˆ right hand side vector )
Linear systems with rectangular system matrices A ∈ K m,n , called “overdetermined” for m > n, and
“underdetermined” for m < n will be treated in Chapter 3. y
Remark 2.1.0.2 (LSE: key components of mathematical models in many fields) Linear systems of
equations are ubiquitous in computational science: they are encountered
• with discrete linear models in network theory (see Ex. 2.1.0.3), control, statistics;
• in the case of discretized boundary value problems for ordinary and partial differential equations (→
course “Numerical methods for partial differential equations”, 4th semester);
• as a result of linearization (e.g, “Newton’s method” → Section 8.5).
y
EXAMPLE 2.1.0.3 (Nodal analysis of (linear) electric circuit [QSS00, Sect. 4.7.1])
Now we study a very important application of numerical simulation, where (large, sparse) linear systems
of equations play a central role: Numerical circuit analysis. We begin with linear circuits in the frequency
domain, which are directly modelled by complex linear systems of equations. In later chapters we will
tackle circuits with non-linear elements, see Ex. 8.1.0.1, and, finally, will learn about numerical methods
for computing the transient (time-dependent) behavior of circuits, see Ex. 11.1.2.11.
Modeling of simple linear circuits takes only elementary physical laws as covered in any introductory
course of physics (or even in secondary school physics). There is no sophisticated physics or mathematics
involved. Circuits are composed of so-called circuit elements connected by (ideal) wires.
A circuit diagram ✄ ➀ C1 ➁ R1 ➂
•: Nodes that is, junctions of wires
We number the nodes 1, . . . , n and write Ikj (physi- U ~~ L
R5 R2
cal units, [ Ikj ] = 1A) for the electric current flowing C2
from node k to node j. Currents have a sign: R3
R4
Ikj = − Ijk
Fig. 36 ➃ ➄ ➅
2. Direct Methods for (Square) Linear Systems of Equations, 2.1. Introduction: Linear Systems of Equations (LSE)
127
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The most fundamental relationship is the Kirchhoff current law (KCL) that demands that the sum of node
currents vanishes:
The unknowns of the model are the nodal potentials Uk , k = 1, . . . , n. (Some of them may be known, for
instance those for grounded nodes: ➅ in Fig. 36, and nodes connected to voltage sources: ➀ in Fig. 36.)
The difference of the nodal potentials of two connected nodes is called the branch voltage.
The circuit elements are characterized by current-voltage relationships, so-called constitutive relations,
here given in frequency domain for angular frequency ω > 0 (physical units [ω ] = 1s−1 ). We consider
only the following simple circuit elements:
U
• Ohmic resistor: I= , [ R] = 1VA−1 −1
R (Uk − Uj ) ,
R
• capacitor: I = ıωCU , capacitance [C ] = 1AsV−1 ➤ Ikj = ıωC (Uk − Uj ) ,
U
• coil/inductor : I= , inductance [ L] = 1VsA−1 −ıω −1 L−1 (Uk − Uj ) .
ıωL
√
✎ notation: ı = ˆ imaginary unit “ı := −1”, ı = exp(ıπ/2), ı2 = −1
Here we face the special case of a linear circuit: all relationships between branch currents and voltages
are of the form
The concrete value of αkj is determined by the circuit element connecting node k and node j.
These constitutive relations are derived by assuming a harmonic time-dependence of all quantities, which
is termed circuit analysis in the frequency domain (AC-mode).
Here U, I ∈ C are called complex amplitudes. This implies for temporal derivatives (denoted by a dot):
du di
(t) = Re{ıωU exp(ıωt)} , (t) = Re{ıωI exp(ıωt)} . (2.1.0.7)
dt dt
For a capacitor the total charge is proportional to the applied voltage:
dq
i(t) = (t)
dt du
q(t) = Cu(t) ⇒ i(t) = C (t) .
dt
di
For a coil the voltage is proportional to the rate of change of current: u(t) = L dt (t). Combined with
(2.1.0.6) and (2.1.0.7) this leads to the above constitutive relations.
Now we combine the constitutive relations with the Kirchhoff current law (2.1.0.4). We end up with a linear
system of equations!
2. Direct Methods for (Square) Linear Systems of Equations, 2.1. Introduction: Linear Systems of Equations (LSE)
128
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
U1 = U , U6 = 0 .
We do not get equations for the nodes ➀ and ➅, because these nodes are connected to the “outside
world” so that the Kirchhoff current law (2.1.0.4) does not hold (from a local perspective). This is fitting,
because the voltages in these nodes are known anyway.
i i
ıωC1 + R11 − ωL + R12 − R11 ωL − R12 U2 ıωC1 U
− R11 1
−ıωC2 0
R1 + ıωC2 0 U3 1
=
i
ωL 0 1 i 1
R5 − ωL + R4
1
− R4 U4 U
R5
− R12 −ıωC2 − R14 1 1 −1 U5 0
R2 + ıωC2 + R4 + R3
This is a linear system of equations with complex coefficients: A ∈ C4,4 , b ∈ C4 . For the algorithms to
be discussed below this does not matter, because they work alike for real and complex numbers. y
Review question(s) 2.1.0.8 (Nodal analysis of linear electric circuits)
(Q2.1.0.8.A) [A simple resistive circuit]
➀ ➁
Fig. 37 ➃ ➄
(Q2.1.0.8.B) [Current source] The voltage source with strength U in Fig. 37 is replaced with a current
source, which drives a known current I through the circuit branch it is attached to.
Which linear system of equations has to be solved in order to determine the unknown nodal potentials
U1 , U2 , U3 , U4 ?
(Q2.1.0.8.C) A linear mapping L : R n → R n is represented by the matrix A ∈ R n,n with respect to the
standard basis of R n comprising Cartesian coordinate vectors eℓ , ℓ = 1, . . . , n.
Explain, how one can compute the matrix representation of L with respect to the basis
2 1 0 0 0
. ..
1 2
1 .
.
.
0 1 2 . ..
. .. .
..
0 1
. . ..
. .
. , .. , 0, . . . . , .
. . 0
. .. . ...
.. . . 1
.
.
.
0
.. .. .
. 2 1
0
0 0 1 2
by merely solving n linear systems of equations and forming matrix products.
△
2. Direct Methods for (Square) Linear Systems of Equations, 2.1. Introduction: Linear Systems of Equations (LSE)
129
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Now, recall a few notions from linear algebra needed to state criteria for the invertibility of a matrix.
Given A ∈ K m,n , the range/image (space) of A is the subspace of K m spanned by the columns of
A
R(A) := {Ax, x ∈ K n } ⊂ K m .
The kernel/nullspace of A is
N (A) := {z ∈ K n : Az = 0} .
Definition 2.2.1.3. Rank of a matrix → [NS02, Sect. 2.4], [QSS00, Sect. 1.5]
The rank of a matrix M ∈ K m,n , denoted by rank(M), is the maximal number of linearly indepen-
dent rows/columns of M. Equivalently, rank(A) = dim R(A).
Theorem 2.2.1.4. Criteria for invertibility of matrix → [NS02, Sect. 2.3 & Cor. 3.8]
A square matrix A ∈ K n,n is invertible/regular if one of the following equivalent conditions is satis-
fied:
1. ∃B ∈ K n,n : BA = AB = I,
2. x 7→ Ax defines an endomorphism of K n ,
3. the columns of A are linearly independent (full column rank),
4. the rows of A are linearly independent (full row rank),
5. det A 6= 0 (non-vanishing determinant),
6. rank(A) = n (full rank).
§2.2.1.5 (Solution of a LSE as a “problem”, recall § 2.1.0.1) Linear algebra give us a formal way to
denote solution of LSE:
inverse matrix
Now recall our notion of “problem” from § 1.5.5.1 as a function F mapping data in a data space X to a
result in a result space Y . Concretely, for n × n linear systems of equations:
X := K n,n
∗ ×K
n → Y := K n
F:
(A, b) 7 → A −1 b
2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 130
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Before we examine sensitivity for linear systems of equations, we look at the simpler problem of
matrix×vector multiplication.
EXAMPLE 2.2.2.1 (Sensitivity of linear mappings) For a fixed given regular A ∈ K n,n we study the
problem map
F : K n → K n , x 7→ Ax ,
that is, now we consider only the vector x as data.
Goal: Estimate relative perturbations in F (x) due to relative perturbations in x.
We assume that K n is equipped with some vector norm (→ Def. 1.5.5.4) and we use the induced matrix
norm (→ Def. 1.5.5.10) on K n,n . Using linearity and the elementary estimate kMxk ≤ kMkkxk, which
is a direct consequence of the definition of an induced matrix norm, we obtain
2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 131
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Now we study the sensitivity of the problem of finding the solution of a linear system of equations Ax = b,
A ∈ R n,n regular, b ∈ R n , see § 2.1.0.1. We write e
x for the solution of the perturbed linear system.
kx − e
xk
(normwise) relative error: ǫr : = ?
kxk
(k·k =
ˆ suitable vector norm, e.g., maximum norm k·k∞ )
Ax = b ↔ (A + ∆A)e
x = b + ∆b x − x) = ∆b − ∆Ax .
(A + ∆A)(e (2.2.2.3)
Theorem 2.2.2.4. Conditioning of LSEs → [QSS00, Thm. 3.1], [GGK14, Thm 3.5]
−1
If A regular, k∆Ak < A−1 and (2.2.2.3), then
(i) A + ∆A is regular/invertible,
(ii) If Ax = b, (A + ∆A)e x = b + ∆b, then
kx − e
xk A −1 k A k k∆bk k∆Ak
≤ + .
kxk 1 − kA−1 kkAkk∆Ak/kAk kbk kAk
relative error of data relative perturbations
We conclude that I + B must have trivial kernel N (I + B) = {0}, which implies that the square matrix
I + B is regular. We continue using this fact, the definition of the matrix norm, and (2.2.2.6):
( I + B ) −1 x kyk 1
( I + B ) −1 = sup = sup ≤ .
x∈R n \{0} kxk y∈R n \{0} k( I + B ) y k 1 − kBk
2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 132
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✷
Proof. (of Thm. 2.2.2.4) We use a slightly generalized version of Lemma 2.2.2.5, which gives us
−1 A −1
(A + ∆A) ≤ .
1 − kA−1 ∆Ak
We combine this estimate with (2.2.2.3):
A −1 A −1 k A k k∆bk k∆Ak
k∆xk ≤ (k∆bk + k∆Axk) ≤ + kxk .
1 − kA−1 ∆Ak 1 − kA−1 kk∆Ak kAkkxk kAk
✷
Note that the term kAk A−1 occurs frequently. Therefore it has been given a special name:
kx − e
xk cond(A)δA k∆Ak
ǫr : = ≤ , δA := . (2.2.2.8)
kxk 1 − cond(A)δA kAk
From (2.2.2.8) we conclude important messsages of cond(A):
✦ If cond(A) ≫ 1, small perturbations in A can lead to large relative errors in the solution of
the LSE.
✓ ✏
✦ If cond(A) ≫ 1, a stable algorithm (→ Def. 1.5.5.19) can produce solutions
✒ ✑
with large relative error !
Recall Thm. 2.2.2.4: for regular A ∈ K n,n , small ∆A, generic vector/matrix norm k·k
Ax = b kx − e
xk cond(A) k∆bk k∆Ak
⇒ ≤ + . (2.2.2.9)
(A + ∆A)e
x = b + ∆b kxk 1 − cond(A)k∆Ak/kAk kbk kAk
cond(A) ≫ 1 ➣ small relative changes of data A, b may effect huge relative changes in so-
lution.
Terminology:
Small changes of data ⇒ small perturbations of result : well-conditioned problem
Small changes of data ⇒ large perturbations of result : ill-conditioned problem
2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 133
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 2.2.2.10 (Intersection of lines in 2D) Solving a 2 × 2 linear system of equations amounts to
finding the intersection of two lines in the coordinate plane: This relationship allows a geometric view of
“sensitivity of a linear system”, when using the distance metric (Euclidean vector norm).
Remember the Hessian normal form of a straight line in the plane. We are given the Hessian normal
forms of two lines L1 and L2 and want to compute the coordinate vector x ∈ R2 of the point in which they
intersect:
Li = {x ∈ R2 : x T ni = di } , ni ∈ R2 , di ∈ R , i = 1, 2 .
T
n1 d
LSE for finding intersection: T x= 1 ,
n d2
| {z2 } |{z}
=:A =:b
where the ni are (unit) direction vectors, and the di ∈ R give the (signed) distance to the origin.
Now we perturb the right-hand side vector b and wonder how this will impact the intersection points. The
situation is illustrated by the following two pictures, in which the original and perturbed lines are drawn in
black and red, respectively.
Obviously, if the lines are almost parallel, a small shift in their position will lead to a big shift of the inter-
section point.
1 cos ϕ
The following E IGEN-based C++ code investigates condition numbers for the matrix that can
0 sin ϕ
arise when computing the intersection of two lines enclosing the angle ϕ. As usual the directive using
namespace Eigen; was given in the beginning of the file.
2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 134
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In Line 13 we compute the condition number of A with respect to the Euclidean vector norm using special
E IGEN built-in functions.
Line 18 evaluated the condition number of a matrix for the maximum norm, recall Ex. 1.5.5.12.
140
2−norm
max−norm
120
condition numbers
spect to the Euclidean vector norms) as the angle 80
enclosed by the two lines shrinks.
60
This corresponds to a large sensitivity of the location
of the intersection point in the case of glancing inci- 40
dence.
20
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Fig. 38 angle of n1, n2
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 135
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Supplementary literature. In case you cannot remember the main facts about Gaussian
Ax = b ⇒ A′ x = b′ , if A′ = TA, b′ = Tb .
So we may try to convert the linear system of equations to a form that can be solved more easily by
multiplying with regular matrices from left, which boils down to applying row transformations. A suitable
target format is a diagonal linear system of equations, for which all equations are completely decoupled.
This is the gist of Gaussian elimination.
EXAMPLE 2.3.1.1 (Gaussian elimination)
Stage ➀ (Forward) elimination:
1 1 0 x1 4 x1 + x2 = 4
2 1
−1 x2 = 1 ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
1 1 0 4 1 1 0 4
2 1 −1 1 ➤ 0 −1 −1 −7
3 −1 −1 −3 3 −1 −1 −3
1 1 0 4 1 1 0 4
➤ 0 −1 −1 −7 ➤ 0 − 1 −1 −7
0 −4 −1 −15 0 0 3 13
| {z }
=U
= pivot row, pivot element bold.
We have transformed the LSE to upper triangular form
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 136
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
x1 + x2 = 4 x3 = 13
3
− x2 − = −7 ⇒ 13 8
x3 x2 = 7 − 3 = 3
3x3 = 13 8 4
x1 = 4 − 3 = 3 .
More detailed examples are given in [Gut09, Sect. 1.1], [NS02, Sect. 1.1]. y
More general:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 137
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
∗
0∗ 0
0∗
−→ −→ −→
0 00
0 0 0
0
0∗
−→ −→ · · · −→ −→
0 ∗
0 00 0 0 0 0
transformation: Ax = b ➤ A′ x = b′ .
with
aik
aij − akk akj for k < i, j ≤ n ,
bi − aik b for k < i ≤ n ,
aij′ := 0 for k < i ≤ n,j = k ,
′
bi : = akk k (2.3.1.2)
b else.
i
aij else,
multipliers lik
§2.3.1.3 (Gaussian elimination: algorithm) Here we give a direct E IGEN implementation of Gaussian
elimination for LSE Ax = b (grossly inefficient!).
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 138
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
24 Ab ( i , n ) / = Ab ( i , i ) ;
25 }
26 x = Ab . r i g h t C o l s ( 1 ) ; // Solution in rightmost column!
27 }
• In Line 9 the right hand side vector set as last column of matrix, which facilitates simultaneous row
transformations of matrix and r.h.s.
• In Line 14 the variable fac is the multiplier from (2.3.1.2).
• In Line 26 we extract solution from last column of the transformed matrix.
y
§2.3.1.5 (Computational effort of Gaussian elimination) We examine Code 2.3.1.4.
• Forward elimination involves three nested loops (note that the compact vector operation in Line 15
involves another loop from i + 1 to m).
• Back substitution can be done with two nested loops.
computational cost (↔ number of elementary operations) of Gaussian elimination [NS02, Sect. 1.3]:
n −1
forward elimination : ∑i=1 (n − i)(2(n − i) + 3) = n(n − 1)( 32 n + 76 ) Ops. , (2.3.1.6)
n
back substitution : ∑i=1 2(n − i) + 1 = n2 Ops. .
✎ ☞
asymptotic complexity (→ Section 1.4) of Gaussian elimination 2 3
= 3n + O ( n2 ) = O ( n3 )
✍ ✌
(without pivoting) for generic LSE Ax = b, A ∈ R n,n
y
EXPERIMENT 2.3.1.7 (Runtime of Gaussian elimination) In this experiment we compare the efficiency
of our hand-coded Gaussian elimination with that of library functions.
C++ code 2.3.1.8: Measuring runtimes of Code 2.3.1.4 vs. E IGEN lu()-operator vs. MKL
➺ GITLAB
2 //! Eigen code for timing numerical solution of linear systems
3 MatrixXd g a u s s t i m i n g ( ) {
4 std : : vector < i n t > n = {8 ,16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192};
5 i n t nruns = 3 ;
6 MatrixXd t i m e s ( n . s i z e ( ) , 3 ) ;
7 f o r ( i n t i = 0 ; i < n . s i z e ( ) ; ++ i ) {
8 Timer t1 , t 2 ; // timer class
9 MatrixXd A = MatrixXd : : Random( n [ i ] , n [ i ] ) + n [ i ] * MatrixXd : : I d e n t i t y ( n [ i ] , n [ i ] ) ;
10 VectorXd b = VectorXd : : Random( n [ i ] ) ;
11 VectorXd x ( n [ i ] ) ;
12 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
13 t 1 . s t a r t ( ) ; x = A . l u ( ) . solve ( b ) ; t 1 . s t o p ( ) ; // Eigen implementation
14 # i f n d e f EIGEN_USE_MKL_ALL // only test own algorithm without MKL
15 i f ( n [ i ] <= 4096) // Prevent long runs
16 t2 . s t a r t ( ) ; gausselimsolve ( A , b , x ) ; t2 . stop ( ) ; // own gauss
elimination
17 #endif
18 }
19 t i m e s ( i , 0 ) = n [ i ] ; t i m e s ( i , 1 ) = t 1 . min ( ) ; t i m e s ( i , 2 ) = t 2 . min ( ) ;
20 }
21 return times ;
22 }
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 139
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 4
Eigen lu() solver
gausselimsolve
MLK solver sequential
MLK solver parallel
10 2
Platform: O(n 3 )
× 4
10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 39 matrix size n
n Code 2.3.1.4 [s] E IGEN lu() [s] MKL sequential [s] MKL parallel [s]
8 6.340e-07 1.140e-06 3.615e-06 2.273e-06
16 2.662e-06 3.203e-06 9.603e-06 1.408e-05
32 1.617e-05 1.331e-05 1.603e-05 2.495e-05
64 1.214e-04 5.836e-05 5.142e-05 7.416e-05
128 2.126e-03 3.180e-04 2.041e-04 3.176e-04
256 3.464e-02 2.093e-03 1.178e-03 1.221e-03
512 3.954e-01 1.326e-02 7.724e-03 8.175e-03
1024 4.822e+00 9.073e-02 4.457e-02 4.864e-02
2048 5.741e+01 6.260e-01 3.347e-01 3.378e-01
4096 5.727e+02 4.531e+00 2.644e+00 1.619e+00
8192 - 3.510e+01 2.064e+01 1.360e+01
y
Never implement Gaussian elimination yourself !
A concise list of libraries for numerical linear algebra and related problems can be found here.
Remark 2.3.1.9 (Gaussian elimination for non-square matrices) In Code 2.3.1.4: the right hand side
vector b was first appended to matrix A as rightmost column, and then forward elimination and back
substitution were carried out on the resulting matrix. This can be generalized to a Gaussian elimination for
rectangular matrices A ∈ K n,n+1 !
Consider a “fat matrix” A ∈ K n,m , m>n:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 140
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1
1 0
−→ −→
0 0
1
Usually library functions meant to solve LSEs also accept a matrix instead of a right-hand-side vector and
then return a matrix of solution vectors. For instance, in E IGEN the following function call accomplishes
this:
Eigen::MatrixXd X = A.lu().solve(B);
C++ code 2.3.1.10: Gaussian elimination with multiple r.h.s. → Code 2.3.1.4 ➺ GITLAB
2 //! Gauss elimination without pivoting, X = A−1 B
3 //! A must be an n × n-matrix, B an n × m-matrix
4 //! Result is returned in matrix X
5 void g a u s s e l i m s o l v e m u l t ( const MatrixXd &A , const MatrixXd& B ,
6 MatrixXd& X) {
7 const Eigen : : Index n = A . rows ( ) ;
8 const Eigen : : Index m = B . cols ( ) ;
9 MatrixXd AB( n , n+m) ; // Augmented matrix [ A, B]
10 AB << A , B ;
11 // Forward elimination, do not forget the B part of the Matrix
12 f o r ( Eigen : : Index i = 0 ; i < n −1; ++ i ) {
13 const double p i v o t = AB( i , i ) ;
14 f o r ( Eigen : : Index k = i +1; k < n ; ++k ) {
15 const double f a c = AB( k , i ) / p i v o t ;
16 AB . block ( k , i +1 ,1 ,m+n− i −1)−= f a c * AB . block ( i , i +1 ,1 ,m+n− i −1) ;
17 }
18 }
19 // Back substitution
20 AB . block ( n −1 , n , 1 , m) / = AB( n −1 ,n −1) ;
21 f o r ( Eigen : : Index i = n −2; i >= 0 ; −− i ) {
22 f o r ( Eigen : : Index l = i +1; l < n ; ++ l ) {
23 AB . block ( i , n , 1 ,m) −= AB . block ( l , n , 1 ,m) * AB( i , l ) ;
24 }
25 AB . block ( i , n , 1 ,m) / = AB( i , i ) ;
26 }
27 X = AB . r i g h t C o l s (m) ;
28 }
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 141
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y
Concerning the next two remarks: For understanding or analyzing special variants of Gaussian elimination,
it is useful to be aware of
• the effects of elimination steps on the level of matrix blocks, cf. § 1.3.1.13,
• and of the recursive nature of Gaussian elimination.
Remark 2.3.1.11 (Gaussian elimination via rank-1 modifications) We can view Gaus-
sian elimination from the perspective of matrix block operations: Then the first step
of Gaussian elimination with pivot α 6= 0), cf. (2.3.1.2), can be expressed as
α c⊤ α c⊤
A :=
d
→ A′ :=
0 dc⊤ .
(2.3.1.12)
C C′ := C −
α
rank-1 modification of C
Terminology: Adding a tensor product of two vectors to a matrix is called a rank-1 modification of that
matrix, see also § 2.6.0.12 below.
Notice that the transformation (2.3.1.12) is applied to the resulting lower-right block C′ in the next elimina-
tion step. Thus Gaussian elimination can be realized by successive rank-1 modifications applied to smaller
and smaller lower-right blocks of the matrix. An implementation in this spirit is given in Code 2.3.1.13.
In this code the Gaussian elimination is carried out in situ: the matrix A is replaced with the transformed
matrices during elimination. If the matrix is not needed later this offers maximum efficiency. An in-situ
LU-decomposition as described in Rem. 2.3.2.11 could also be performed by Code 2.3.1.13 after a modi-
fication of Line 10. y
Remark 2.3.1.14 (Block Gaussian elimination) Recall the “principle” from § 1.3.1.13: deal with block
matrices (“matrices of matrices”) like regular matrices (except for commutativity of multiplication!). This
suggests a block view of Gaussian elimination:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 142
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2.3.2 LU-Decomposition
A matrix factorization (ger. Matrixzerlegung) expresses a general matrix A as product of two special (fac-
tor) matrices. Requirements for these special matrices define the matrix factorization. Matrix factorizations
come with the mathematical issue of existence & uniqueness, and pose the numerical challenge of finding
algorithms for computing the factor matrices (efficiently and stably).
Matrix factorizations
☞ often capture the essence of algorithms in compact form (here: Gaussian elimination),
☞ are important building blocks for complex algorithms,
☞ are key theoretical tools for algorithm analysis.
In this section the forward elimination step of Gaussian elimination will be related to a special matrix
factorization, the so-called LU-decomposition or LU-factorization.
Supplementary literature. The LU-factorization should be well known from the introductory
linear algebra course. In case you need to refresh your knowledge, please consult one of the
following:
• textbook by Nipp & Stoffer [NS02, Sect. 2.4],
• book by M. Hanke-Bourgeois [Han02, p. II.4],
• linear algebra lecture notes by M. Gutknecht [Gut09, Sect. 3.1],
• textbook by Quarteroni et al. [QSS00, Sect.3.3.1],
• Sect. 3.5 of the book by Dahmen & Reusken,
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 143
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1
1 0
−→ −→
0 0
1
Here: row transformation = adding a multiple of a matrix row to another row, or multiplying a row with
a non-zero scalar (number) swapping two rows (more special row transfor-
mations are discussed in Rem. 1.3.1.12)
Note: Row transformations preserve regularity of a matrix and, thus, are suitable for transforming linear
systems of equations: they will not affect the solution when applied to both the coefficient matrix
and right-hand-side vector.
Rem. 1.3.1.12: row transformations can be realized by multiplication from left with suitable transformation
matrices. When multiplying these transformation matrices we can emulate the effect to successive row
transformations through left multiplication with a matrix T:
A′
A
−→ ⇔ TA = A′ .
0
row transformations
Now we want to determine the T for the forward elimination step of Gaussian elimination.
EXAMPLE 2.3.2.1 (Gaussian elimination and LU-factorization → [NS02, Sect. 2.4], [Han02, p. II.4],
[Gut09, Sect. 3.1]) We revisit the LSE from Ex. 2.3.1.1 and carry out (forward) Gaussian elimination:
1 1 0 x1 4 x1 + x2 = 4
2 1 −1 x2 = 1 ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
1 1 1 0 4 1 1 1 0 4
1 2 1 −1 1 ➤ 2 1 0 −1 −1 −7 ➤
1 3 −1 −1 −3 0 1 3 −1 −1 −3
1 1 1 0 4 1 1 1 0 4
2 1 0 −1 −1 −7 ➤ 2 1 0 − 1 −1 −7
3 0 1 0 −4 −1 −15 3 4 1 0 0 3 13
| {z } | {z }
=:L =:U
As before we highlight the pivot rows with and write the pivot element in bold. In addition, we let
the negative multipliers take the places of matrix entries made to vanish; we color these entries red.
After this replacement we make the “surprising” observation that A = L U! y
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 144
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The link between Gaussian elimination and matrix factorization, an explanations for the observation made
in Ex. 2.3.2.1, becomes clear by recalling that row transformations result from multiplications with elimina-
tion matrices:
1 0 ··· ··· 0 a1 a1
a2
− a 1 0 a2 0
1
a3
a1 6 = 0 − a1 a3 = 0 . (2.3.2.2)
. . .
.. .. ..
− aan1 0 1 an 0
n − 1 steps of Gaussian forward elimination immediately give rise to a matrix factorization (non-zero
pivot elements assumed)
elimination matrices Li , i = 1, . . . , n − 1 ,
A = L 1 · · · · · L n −1 U , with
upper triangular matrix U ∈ R n,n .
1 0 ··· ··· 0 1 0 ··· ··· 0 1 0 ··· ··· 0
l2 1 0 0 0
0 1 l2 1
l3 0 h3 1 = l3 h 3 1
. . . . .
.. .. .. .. ..
ln 0 1 0 hn 0 1 ln h n 0 1
The matrix products L1 · · · · · Ln−1 yield normalized lower triangular matrices,
a
whose entries are the multipliers − a ik from (2.3.1.2) → Ex. 2.3.1.1.
kk
The matrix factorization that “automatically” emerges during Gaussian forward elimination has a special
name:
Given a square matrix A ∈ K n,n , an upper triangular matrix U ∈ K n,n and a normalized lower
triangular matrix (→ Def. 1.1.2.3) form an LU-decomposition/LU-factorization of A, if A = LU.
1
1
1
1
0
1
1
= 1 · ,
1
1
1
0
1
1
A = L · U.
Using this notion we can summarize what we have learned from studying elimination matrices:
✤ ✜
The (forward) Gaussian elimination (without pivoting), for Ax = b, A ∈ R n,n ,
if possible, is alge-
braically equivalent to an LU-factorization/LU-decomposition A = LU of A into a normalized lower
triangular matrix L and an upper triangular matrix U, [DR08, Thm. 3.2.1], [NS02, Thm. 2.10], [Gut09,
✣ ✢
Sect. 3.1].
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 145
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Algebraically equivalent = ˆ when carrying out the forward elimination in situ as in Code 2.3.1.4 and storing
the multipliers in a lower triangular matrix as in Ex. 2.3.2.1, then the latter will contain the L-factor and the
original matrix will be replaced with the U-factor.
Proof. We adopt a block matrix perspective (→ § 1.3.1.13) and employ induction w.r.t. n:
n = 1: assertion trivial
n − 1→n: Induction hypothesis ensures existence of normalized lower triangular matrix L e and regular
e such that A
upper triangular matrix U e =LeU
e , where A
e is the upper left (n − 1) × (n − 1) block of A:
e b
A e 0
L e y
U
= =: LU .
a⊤ α x⊤ 1 0 ξ
Then solve
➊ e =b
Ly → provides y ∈ K n ,
➋ x⊤ U
e = a⊤ → provides x ∈ K n ,
➌ x⊤ y + ξ = α → provides ξ ∈ K .
L1 U1 = L2 U2 ⇒ L2−1 L1 = U2 U1−1 = I .
This reveals how to compute the entries of L and U sequentially. We start with the top row of U, which
agrees with that of A, and then work our way towards to bottom right corner:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 146
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
ˆ columns of L
=
Fig. 40
It is instructive to compare this code with a simple implementation of the matrix product of a normalized
lower triangular and an upper triangular matrix. From this perspective the LU-factorization looks like the
“inversion” of matrix multiplication:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 147
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
16 return A;
17 }
Observe: Solving for entries L(i,k) of L and U(k,j) of U in the multiplication of an upper triangular
and normalized lower triangular matrix (→ Code 2.3.2.9) yields the algorithm for LU-factorization (→
Code 2.3.2.8). y
The computational cost of LU-factorization is immediate from Code 2.3.2.8 and the same as for Gaussian
elimination, cf. § 2.3.1.5:
✗ ✔
Asymptotic complexity of LU-factorization of A ∈ R n,n (2.3.2.10)
Remark 2.3.2.11 (In-situ LU-decomposition) “In situ” is Latin and means “in place”. Many library
routines provide routines that overwrite the matrix A with its LU-factors in order to save memory when
the original matrix is no longer needed. This is possible because the number of unknown entries of the
LU-factors combined exactly agrees with the number of entries of A. The convention is to replace the
strict lower-triangular part of A with L, and the upper triangular part with U:
A −→
L
y
Remark 2.3.2.12 (Recursive LU-factorization) Recall Rem. 2.3.1.11 and the recursive view of Gaussian
elimination it suggests, because in (2.3.1.12) an analoguous row transformation can be applied to the
remaining right-lower block C′ .
In light of the close relationship between Gaussian elimination and LU-factorization there will also be a
recursive version of LU-factorization.
The following code implements the recursive in situ (in place) LU-decomposition of A ∈ R n,n (without
pivoting). It is closely related to Code 2.3.1.13, but now both L and U are stored in place of A:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 148
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Refer to (2.3.1.12) to understand lurec: the rank-1 modification of the lower (n − 1) × (n − 1)-block of
the matrix is done in Line 7-Line 8 of the code.
C++ code 2.3.2.14: Driver for recursive LU-factorization of Code 2.3.2.13 ➺ GITLAB
2 //! post-processing: extract L and U
3 void l u r e c d r i v e r ( const MatrixXd &A , MatrixXd &L , MatrixXd &U) {
4 const MatrixXd A_dec = l u r e c ( A) ;
5 // post-processing:
6 //extract L and U
7 U = A_dec . triangularView <Upper > ( ) ;
8 L . setIdentity ( ) ;
9 L += A_dec . triangularView < S t r i c t l y L o w e r > ( ) ;
10 }
y
§2.3.2.15 (Using LU-factorization to solve a linear system of equations) An intermediate
LU-factorization paves the way for a three-stage procedure for solving an n × n linear system of equations.
① LU -decomposition A = LU, #elementary operations = 31 n(n − 1)(n + 1)
Ax = b : ② forward substitution, solve Lz = b, #elementary operations = 21 n(n − 1)
③ backward substitution, solve Ux = z, #elementary operations = 21 n(n + 1)
➣ The asymptotic complexity of the complete three-stage algorithm is (in leading order) the same as for
Gaussian elimination (The bulk of computational cost is incurred in the factorization step ①).
However, the perspective of LU-factorization reveals that the solution of linear systems of equations can be
split into two separate phases with different asymptotic complexity in terms of the number n of unknowns:
✗ ✔ ✗ ✔
setup phase elimination phase
(factorization) + (forward/backward substition)
✖ ✕ ✖ ✕
Cost: O(n3 ) Cost: O(n2 )
y
Remark 2.3.2.16 (Rationale for using LU-decomposition in algorithms) Gauss elimination and
LU-factorization for the solution of a linear system of equations (→ § 2.3.2.15) are equivalent and only
differ in the ordering of the steps.
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 149
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
0
=
0
(2.3.2.18)
The left-upper blocks of both L and U in the LU-factorization of A depend only on the corresponding
left-upper block of A! y
Remark 2.3.2.19 (Block LU-factorization) In the spirit of § 1.3.1.13 we can also adopt a matrix-block
perspective of LU-factorization. This is a natural idea in light of the close connection between matrix multi-
plication and matrix factorization, cf. the relationship between matrix factorization and matrix multiplication
found in § 2.3.2.6:
Block matrix multiplication (1.3.1.14) ∼
= block LU -decomposition:
We consider a block-partitioned matrix
A11 A12 A11 ∈ K n,n regular , A12 ∈ K n,m
A= ,
A21 A22 A21 ∈ K m,n , A22 ∈ K m,m .
The block LU-decomposition arises from the block Gaussian forward elimination of Rem. 2.3.1.14 in the
same way as the standard LU-decomposition is spawned by the entry-wise Gaussian elimination:
With
Under the assumption that A11 is invertible, the Schur complement matrix S is invertible, if and only if this
holds for A. y
Review question(s) 2.3.2.21 (Gaussian elimination and LU-decomposition)
(Q2.3.2.21.A) Performing Gaussian elimination by hand compute the solution of the following 4 × 4 linear
system of equations
2 −1 0 0 x1 0
−1 2 −1 0 x2 0
0 −1 2 −1 x3 = 0 .
0 0 −1 2 x4 1
(Q2.3.2.21.B) Give an example of a 2 × 2-matrix, for which there does not exist an LU-decomposition.
(Q2.3.2.21.C) Assume that one of the LU-factors of a square matrix A ∈ R n,n is diagonal. What proper-
ties of A can you infer?
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 150
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q2.3.2.21.D) Suppose the LU-factors L, U ∈ R n,n of a square matrix A ∈ R n,n exist and have been
computed already. Sketch an algorithm for computing the determinant det A.
From linear algebra remember that the determinant of the product of two square matrices is the product
of the determinants of its factors.
(Q2.3.2.21.E) Compute the block LU-decomposition of the partitioned matrix
Ik B⊤
A= ∈ R n,n , B ∈ R n−k,k , k ∈ {1, . . . , n − 1} .
B O
in terms of n, k → ∞.
(Q2.3.2.21.G) What is the inverse of the block matrix
O A
∈ R2n,2n , A ∈ R n,n regular ?
A⊤ O
well-defined.
Show that A is singular, if and only if S is singular :
Hint.
• First show that if [ yx ] ∈ N (A), then Sy = 0.
h −1
i
• For y ∈ Rm such that Sy = 0 consider −A11 A12 y .
y
2.3.3 Pivoting
We know from linear algebra [NS02, Sect. 1.1] that sometimes we have to swap rows of a linear system
of equations in order to carry out Gaussian elimination without encountering a division by zero. Here is a
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 151
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2 × 2 example:
0 1 x1 b 1 0 x1 b
= 1 = 2
1 0 x2 b2 0 1 x2 b1
Remedy (in linear algebra): ˆ avoid zero pivot elements by swapping rows.
Pivoting =
EXAMPLE 2.3.3.1 (Pivoting and numerical stability → [DR08, Example 3.2.3]) Gaussian elimination
for the 2 × 2 linear system of equations studied in this example will never lead to a division by zero.
Nevertheless, Gaussian elimination runs into problems.
2 MatrixXd A( 2 , 2 ) ;
3 A << 5 . 0 e −17 , 1 . 0 , 1 . 0 , 1 . 0 ; Output:
4 VectorXd b ( 2 ) ;
5 VectorXd x2 ( 2 ) ; 1 x1 =
6 b << 1 . 0 , 2 . 0 ; 2 1
const VectorXd x1 = A . f u l l P i v L u ( ) . solve ( b ) ;
7
3 1
8 gausselimsolve : : gausselimsolve ( A , b , x2 ) ; // see
Code 2.3.1.10 4 x2 =
9 const auto [ L , U] = l u f a k : : l u f a k ( A ) ; // see 5 0
Code 2.3.2.8
10 const VectorXd z = L . l u ( ) . solve ( b ) ; 6 1
11 const VectorXd x3 = U. l u ( ) . solve ( z ) ; 7 x3 =
12 std : : cout << " x1 = \ n" 8 0
13 << x1 << " \ nx2 = \ n"
9 1
14 << x2 << " \ nx3 = \ n"
15 << x3 << std : : endl ;
We get different results from E IGEN built-in linear solver and out hand-crafted Gaussian elimination! Let’s
see what we should expect as an “exact solution”:
1
ǫ 1 1 1−ǫ 1
A= , b= ⇒ x= ≈ for |ǫ| ≪ 1 .
1 1 2 1 − 2ǫ 1
1−ǫ
What is wrong with E IGEN? To make sense of our observations we have to rely on our insights into
roundoff errors gained in Section 1.5.3. Armed with knowledge about the behavior of machine numbers
and roundoff errors we can understand what is going on in this example:
1
➊ We “simulate” floating point arithmetic for straightforward LU-factorization: if ǫ ≤ 2 EPS, EPS =
ˆ
machine precision,
1 0 ǫ 1 (∗)
e := ǫ 1
L= , U= =U in M ! (2.3.3.2)
ǫ −1 1 0 1 − ǫ −1 0 − ǫ −1
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 152
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
e 1 + 2ǫ
The solution of LUx = b is x = , which is a sufficiently accurate result!
1 − 2ǫ
From Section 1.5.5, Def. 1.5.5.19 remember the concept of numerical stability, see also [DR08, Sect. 2.3].
An LU-decomposition computed in M is stable, if it is the exact LU-decomposition of a slightly perturbed
matrix. Is this satisfied for the LU-decompositions obtained in ➊ and ➋?
e 0 0
➊, no row swapping, → (2.3.3.2): LU = A + E with E = unstable !
0 1
e + E with E = 0
e =A
➋, after row swapping, → (2.3.3.3): LU
0
stable !
0 ǫ
Clearly, swapping rows is necessary for being able to stably compute the LU-decomposition in floating
point arithmetic. y
The main rationale behind pivoting in numerical linear algebra is not to steer clear of division by
zero, but to ensure numerical stability of Gaussian elimination.
§2.3.3.4 (Partial pivoting) In linear algebra is was easy to decide when pivoting should be done. We just
had to check whether a potential pivot element was equal to zero. The situation is murky in numerical
linear algebra because
(i) a test == 0.0 is meaningless in floating point computations Rem. 1.5.3.15,
(ii) the goal of numerical stability is hard to quantify.
Nevertheless there is a very successful strategy and it is known as partial pivoting: Writing ai,j
i, j ∈ {k, . . . , n} for the elements of the intermediate matrix obtained after k < n steps of Gaussian elimi-
nation applied to an n × n LSE, we choose the index j of the next pivot row as follows:
| a j,i |
j ∈ {k, . . . , n} such that → max (2.3.3.5)
max{| a j,l |, l = k, . . . , n}
In a sense, we choose the relatively largest pivot element compared to the other entries in the same row
[NS02, Sect. 2.5]. y
EXAMPLE 2.3.3.6 (Gaussian elimination with pivoting for 3 × 3-matrix) The following sequence of
matrices is produced by Gaussian elimination with partial pivoting:
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A = 2 −3 2 → 1 2 2 → 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 153
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++ code 2.3.3.8: Gaussian elimination with pivoting: extension of Code 2.3.1.4 ➺ GITLAB
2 //! Solving an LSE Ax = b by Gaussian elimination with partial pivoting
3 //! A must be an n × n-matrix, b an n-vector
4 void g e p i v ( const MatrixXd &A , const VectorXd& b , VectorXd& x ) {
5 const Eigen : : Index n = A . rows ( ) ;
6 MatrixXd Ab ( n , n +1) ;
7 Ab << A , b ; //
8 // Forward elimination by rank-1 modification, see Rem. 2.3.1.11
9 f o r ( Eigen : : Index k = 0 ; k < n −1; ++k ) {
10 Eigen : : Index j = −1; // j = pivot row index
11 // p = relatively largest pivot
12 const double p = ( Ab . col ( k ) . t a i l ( n−k ) . cwiseAbs ( ) . cwiseQuotient (
Ab . block ( k , k , n−k , n−k ) . cwiseAbs ( ) . rowwise ( ) . maxCoeff ( ) )
) . maxCoeff(& j ) ; //
13 i f ( p < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) *
Ab . block ( k , k , n−k , n−k ) . norm ( ) ) {
14 throw std : : l o g i c _ e r r o r ( " nearly singular " ) ; //
15 }
16 Ab . row ( k ) . t a i l ( n−k +1) . swap ( Ab . row ( k+ j ) . t a i l ( n−k +1) ) ; //
17 Ab . bottomRightCorner ( n−k −1 ,n−k ) −= Ab . col ( k ) . t a i l ( n−k −1) *
Ab . row ( k ) . t a i l ( n−k ) / Ab ( k , k ) ; //
18 }
19 // Back substitution (same as in Code 2.3.1.4)
20 Ab ( n −1 ,n ) = Ab ( n −1 ,n ) / Ab ( n −1 ,n −1) ;
21 f o r ( Eigen : : Index i = n −2; i >= 0 ; −− i ) {
22 f o r ( Eigen : : Index l = i +1; l < n ; ++ l ) {
23 Ab ( i , n ) −= Ab ( l , n ) * Ab ( i , l ) ;
24 }
25 Ab ( i , n ) / = Ab ( i , i ) ;
26 }
27 x = Ab . r i g h t C o l s ( 1 ) ; //
28 }
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 154
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
➣ LU-factorization with pivoting? Of course, just by rearranging the operations of Gaussian forward elim-
ination with pivoting.
Line 8: If the pivot element is still very small relative to the norm of the matrix, then we have encountered
an entire column that is close to zero. The matrix is (close to) singular and LU-factorization does
not exist.
Line 11: Swap the first and the j-th row of the matrix.
Line 13: Call the routine for the lower right (n − 1) × (n − 1)-block of the matrix after subtracting suitable
multiples of the first row from the other rows, cf. Rem. 2.3.1.11 and Rem. 2.3.2.12.
Line 14: Reassemble the parts of the LU-factors. The vector of multipliers yields a column of L, see
Ex. 2.3.2.1.
y
Remark 2.3.3.11 (Rationale for partial pivoting policy (2.3.3.5) → [NS02, Page 47]) Why do we
select the relatively largest pivot element in (2.3.3.5)? Because we aim for an algorithm for Gaussian
elimination/LU-decomposition that possesses the highly desirable scale-invariance property. Loosely
speaking, the algorithm should not make different decisions on pivoting when we multiply the LSE with
a regular diagonal matrix from the left. Let us take a closer look at a 2 × 2 example:
Scale linear system of equations from Ex. 2.3.3.1:
2/ǫ 0 ǫ 1 x1 2 2/ǫx1 2/ǫ
e
= = := b
0 1 1 1 x2 1 1 x2 2
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 155
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
No row swapping would be triggered, if absolutely largest pivot element was used to select the pivot row:
2 2/ǫ 1 0 2 2/ǫ · 1 0 2 2/ǫ
= = in M .
1 1 1 1 0 1 − 2/ǫ 1 1 0 −2/ǫ
| {z } | {z }
e
L e
U
§2.3.3.12 (Theory of pivoting) We view pivoting from the perspective of matrix operations and start with
a matrix view of row swapping.
1 0 0 0
0 0 1 0
ˆ P=
Example: permutation (1, 2, 3, 4) 7→ (1, 3, 2, 4) = 0
.
1 0 0
0 0 0 1
Lemma 2.3.3.14. Existence of LU-factorization with pivoting → [DR08, Thm. 3.25], [Han02,
Thm. 4.4]
For any regular A ∈ K n,n there is a permutation matrix (→ Def. 2.3.3.13) P ∈ K n,n , a normalized
lower triangular matrix L ∈ K n,n , and a regular upper triangular matrix U ∈ K n,n (→ Def. 1.1.2.3),
such that PA = LU .
Every regular matrix A ∈ K n,n admits a row permutation encoded by the permutation matrix P ∈ K n,n ,
such that A′ := (A)1:n−1,1:n−1 is regular (why ?).
By induction assumption there is a permutation matrix P′ ∈ K n−1,n−1 such that P′ A′ possesses a
LU-factorization A′ = L′ U′ . There are x, y ∈ K n−1 , γ ∈ K such that
′ ′ ′ ′ ′
P′ 0 P 0 A x LU x L 0 U d
PA = = = ⊤ ,
0 1 0 1 y⊤ γ y⊤ γ c 1 0 α
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 156
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
if we choose
d = ( L ′ ) −1 x , c = ( u ′ ) − T y , α = γ − c ⊤ d ,
which is always possible. ✷ y
EXAMPLE 2.3.3.15 (Ex. 2.3.3.6 cnt’d) Let us illustrate the assertion of Lemma 2.3.3.14 for the small
3 × 3 LSE from Ex. 2.3.3.6:
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A = 2 −3 2 → 1 2 2 → 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373
2 −3 2 1 0 0 0 1 0
U = 0 25.5 −1 , L = 0.5 1 0 , P= 0 0 1 .
0 0 1.1373 0.5 0.1373 1 1 0 0
Two permutations: in step ➊ swap rows #1 and #2, in step ➌ swap rows #2 and #3. Apply these swaps to
the identity matrix and you will recover P. See also [DR08, Ex. 3.30]. y
§2.3.3.16 (LU-decomposition in E IGEN) E IGEN provides various functions for computing the LU-
decomposition of a given matrix. They all perform the factorization in-situ → Rem. 2.3.2.11:
A −→
L
The resulting matrix can be retrieved and used to recover the LU-factors, as demonstrated in the next code
snippet. Note that the method matrixLU returns just a single matrix, from which the LU-factors have to
be extracted using special view methods.
Note that for solving a linear system of equations by means of LU-decomposition (the standard algorithm)
we never have to extract the LU-factors. y
Remark 2.3.3.18 (Row swapping commutes with forward elimination) Any kind of pivoting only in-
volves comparisons and row/column permutations, but no arithmetic operations on the matrix entries.
This makes the following observation plausible:
2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 157
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The LU-factorization of A ∈ K n,n with partial pivoting by § 2.3.3.9 is numerically equivalent to the LU-
factorization of PA without pivoting (→ Code in § 2.3.2.6), when P is a permutation matrix gathering
the row swaps entailed by partial pivoting.
The above statement means that whenever we study the impact of roundoff errors on LU-
factorization it is safe to consider only the basic version without pivoting, because we can always
assume that row swaps have been conducted beforehand.
y
r = b − Ae
x.
§2.4.0.2 (Probing stability of a direct solver for LSE) Assume that you have downloaded a direct solver
for a general (dense) linear system of equations Ax = b, A ∈ K n,n regular, b ∈ K n . When given the
data A and b it returns the perturbed solution e x. How can we tell that e x is the exact solution of a linear
system with slightly perturbed data (in the sense of a tiny relative error of size ≈ EPS, EPS the machine
precision, see § 1.5.3.8). That is, how can we tell that e
x is an acceptable solution in the sense of backward
error analysis, cf. Def. 1.5.5.19:
An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result Fe(x)
(possibly affected by roundoff) is the exact result for “slightly perturbed” data:
A question similar to the one we ask now for Gaussian elimination was answered in Ex. 1.5.5.20 for the
operation of matrix×vector multiplication.
We can alter either side of the linear system of equations in order to restore e
x as solution:
x accounted for by perturbation of right hand side:
➊ x−e
Ax = b
x − b =: −r (residual, Def. 2.4.0.1) .
⇒ ∆b = Ae
x = b + ∆b
Ae
krk
Hence, e
x can be accepted as a solution, if ≤ Cn3 · EPS, for some small constant C ≈ 1, see
kbk
Def. 1.5.5.19. Here, k·k can be any vector norm on K n .
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 158
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Ax = b , (A + ∆A)e
x=b
xH , u ∈ K n ]
[ try perturbation ∆A = ue
r xH
re
u= ⇒ ∆A = .
kxk22 kxk22
As in Ex. 1.5.5.20 we find
k∆Ak2 krk k r k2
= ≤ . (2.4.0.3)
k A k2 x k2
kAk2 ke kAe x k2
krk
Thus, e
x is ok in the sense of backward error analysis, if ≤ Cn3 · EPS.
kAexk
Now that we know when to accept a vector as solution of a linear system of equations, we can explore
whether an implementation of Gaussian elimination (with some pivoting strategy) in floating point arith-
metic actually delivers acceptable solutions. Given the several levels of nested loops occurring in algo-
rithms for Gaussian elimination, it is not surprising that the roundoff error analysis of Gaussian elimination
based on Ass. 1.5.3.11 is rather involved. Here we merely summarise the results:
The analysis can be simplified by using the fact that equivalence of Gaussian elimination and LU-
factorization extends to machine arithmetic, cf. Section 2.3.2
In Rem. 2.3.3.18 we learned that pivoting can be taken into account by a prior permutation of the rows of
the linear system of equations. Since permutations do not introduce any roundoff errors, it is thus sufficient
to consider LU-factorization without pivoting.
A profound roundoff analysis of Gaussian elimination/LU-factorization can be found in [GV89, Sect. 3.3 &
3.5] and [Hig02, Sect. 9.3]. A less rigorous, but more lucid discussion is given in [TB97, Lecture 22]. Here
we only quote a result due to Wilkinson, [Hig02, Thm. 9.5]:
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 159
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Let A ∈ R n,n be regular and A(k) ∈ R n,n , k = 1, . . . , n − 1, denote the intermediate matrix arising
in the k-th step of § 2.3.3.9 (Gaussian elimination with partial pivoting) when carried out with exact
arithmetic.
For the approximate solution e x ∈ R n of the LSE Ax = b, b ∈ R n , computed as in § 2.3.3.9 (based
on machine arithmetic with machine precision EPS, → Ass. 1.5.3.11) there is ∆A ∈ R n,n with
If ρ is “small”, the computed solution of a LSE can be regarded as the exact solution of a LSE with “slightly
perturbed” system matrix (perturbations of size O(n3 EPS)).
EXAMPLE 2.4.0.6 (Wilkinson’s counterexample) We confirm the bad news by means of a famous
example, known as the so-called Wilkinson matrix.
1 0 0 0 0 0 0 0 0 1
−1 1 0 0 0 0 0 0 0 1
n=10:
−1 −1 1 0 0 0 0 0 0 1
−1 −1 −1 1 0 0 0 0 0 1
1 , if i = j ∨ j = n ,
−1 −1 −1 −1 1 0 0 0 0 1
aij = −1 , if i > j , , A=
−1
−1 −1 −1 −1 1 0 0 0 1
0 else. −1 −1 −1 −1 −1 −1 1 0 0 1
−1 −1 −1 −1 −1 −1 −1 1 0 1
−1 −1 −1 −1 −1 −1 −1 −1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 1
Partial pivoting does not trigger row permutations !
1 , if i = j ,
1 , if i = j ,
A = LU , lij = −1 , if i > j , uij = 2 i − 1 , if j = n ,
0 else 0 else.
C++ code 2.4.0.7: Gaussian elimination for “Wilkinson system” in E IGEN ➺ GITLAB
2 MatrixXd r e s ( 1 0 0 , 2 ) ;
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 160
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The measured relative errors are displayed in the following plots alongside the Euclidean condition num-
bers of the Wilkinson matrices.
450
0
10
400
−2
10
350
250
Gaussian elimination
−8
10 QR−decomposition
200 relative residual norm
−10
10
150
−12
10
100
−14
10
50
−16
10
0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Fig. 41 n Fig. 42 n
√
Observation: In practice ρ (almost) always grows only mildly (like O( n)) with n
√
Discussion in [TB97, Lecture 22]: growth factors larger than the orderO( n) are exponentially rare in
certain relevant classes of random matrices.
EXAMPLE 2.4.0.8 (Stability by small random perturbations) Spielman and Teng [ST96] discovered
that a tiny relative random perturbation of the Wilkinson matrix on the scale of the machine precision EPS
(→ § 1.5.3.8) already remedies the instability of Gaussian elimination.
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 161
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++ code 2.4.0.9: Stabilization of Gaussian elimination with partial pivoting by small random
perturbations ➺ GITLAB
2 //! Curing Wilkinson’s counterexample by random perturbation
3 MatrixXd r e s ( 2 0 , 3 ) ;
4 mt19937 gen ( 4 2 ) ; // seed
5 // normal distribution, mean = 0.0, stddev = 1.0
6 std : : normal_distribution <> b e l l c u r v e ;
7 f o r ( i n t n = 1 0 ; n <= 10 * 2 0 ; n += 10) {
8 // Build Wilkinson matrix
9 MatrixXd A ( n , n ) ; A . s e t I d e n t i t y ( ) ;
10 A . triangularView < S t r i c t l y L o w e r > ( ) . setConstant ( −1) ;
11 A . rightCols <1 >() . setOnes ( ) ;
12 // imposed solution
13 VectorXd x = VectorXd : : Constant ( n , −1) . binaryExpr (
14 VectorXd : : LinSpaced ( n , 1 , n ) ,
15 [ ] ( double x , double y ) { r e t u r n pow ( x , y ) ; } ) ;
16 double r e l e r r = ( A . l u ( ) . solve ( A * x ) − x ) . norm ( ) / x . norm ( ) ;
17 // Randomly perturbed Wilkinson matrix by matrix with iid
18 // N (0, eps) distributed entries
19 MatrixXd Ap = A . unaryExpr ( [ & ] ( double x ) {
20 r e t u r n x + n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) * b e l l c u r v e ( gen ) ;
21 }) ;
22 double r e l e r r p = ( Ap . l u ( ) . solve ( Ap * x ) − x ) . norm ( ) / x . norm ( ) ;
23 r e s ( n / 10 − 1 , 0 ) = n ;
24 r e s ( n / 10 − 1 , 1 ) = r e l e r r ;
25 r e s ( n / 10 − 1 , 2 ) = r e l e r r p ;
26 }
0
10
−2
10
−4
10
Recall the statement made above about “improbabil-
ity” of matrices for which Gaussian elimination with
−6
10
partial pivoting is unstable. This is now matched
relative error
−8
10
unperturbed matrix
randn perturbed matrix
by the observation that a tiny random perturba-
−10
tion of the matrix (almost certainly) cures the prob-
10
lem. This is investigated by the brand-new field
−12
10 of smoothed analysis of numerical algorithms, see
−14
[SST06].
10
−16
10
0 20 40 60 80 100 120 140 160 180 200
Fig. 43 matrix size n
y
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 162
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Hence, for an ill-conditioned linear system, whose system matrix has a huge condition number, (stable)
Gaussian elimination may return “solutions” with large errors. This will be demonstrated in this experiment.
20 2
10 10
cond(A)
19
10 1
10
17
10
relative error
−1
T 10
cond(A)
A = uv + ǫI , 10
16
u = 31 (1, 2, 3, . . . , 10) T ,
−2
10
15
10
1 T
v = (−1, 12 , − 13 , 41 , . . . , 10 ) 10
14
10
−3
−4
13 10
10
relative error
12 −5
10 10
−14 −13 −12 −11 −10 −9 −8 −7 −6 −5
10 10 10 10 10 10 10 10 10 10
Fig. 44 ε
y
The practical stability of Gaussian elimination for Ax = b is reflected by the size of a particular vector, the
residual r := b − Ae x, e
x the computed solution, that can easily be computed after the elimination solver
has finished:
In practice Gaussian elimination/LU-factorization with partial pivoting
produces “relatively small residual vectors”
x = b ⇒ r = b − Ae
(A + ∆A)e x ⇒
x = ∆Ae krk ≤ k∆Akke
xk ,
for any vector norm k·k. This means that, if a direct solver for an LSE is stable in the sense of backward
error analysis, that is, the perturbed solution could be obtained as the exact solution for a slightly relatively
perturbed system matrix, then the residual will be (relatively) small.
EXPERIMENT 2.4.0.11 (Small residuals by Gaussian elimination) Gaussian elimination works mira-
cles in terms of delivering small residuals! To demonstrate this we study a numerical experiment with
nearly singular matrix.
A = uv T + ǫI , u = 13 (1, 2, 3, . . . , 10) T ,
with 1 T
v = (−1, 12 , − 13 , 41 , . . . , 10 )
singular rank-1 matrix
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 163
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2
10
relative error
relative residual
0
10
−2
10
−4
10
−6
10
Observations (w.r.t k·k∞ -norm)
✦ for ǫ ≪ 1 large relative error in computed so-
−8
10
lution e
x
−10
10 ✦ small residuals for any ǫ
−12
10
−14
10
−16
10
−14 −13 −12 −11 −10 −9 −8 −7 −6 −5
10 10 10 10 10 10 10 10 10 10
Fig. 45 ε
How can a large relative error be reconciled with a small relative residual? We continue a discussion that
we already started in Ex. 2.4.0.6:
Ax = b ↔ Ae x≈b
x k ≤ A −1 k r k
x) = r ⇒ kx − e
A(x − e kx − e
xk krk
⇒ ≤ k A k A −1 . (2.4.0.13)
Ax = b ⇒ kbk ≤ kAkkxk kxk kbk
➣ If cond(A) := kAk A−1 ≫ 1, then a small relative residual may not imply a small relative error.
Also recall the discussion in Exp. 2.4.0.10. y
EXPERIMENT 2.4.0.14 (Instability of multiplication with inverse) An important justification for
Rem. 2.2.1.6 that advised us not to compute the inverse of a matrix in order to solve a linear system of
equations is conveyed by this experiment. We again consider the nearly singular matrix from Exp. 2.4.0.11.
A = uv T + ǫI , u = 13 (1, 2, 3, . . . , 10) T ,
with 1 T
v = (−1, 12 , − 13 , 41 , . . . , 10 )
singular rank-1 matrix
2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 164
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2
10
Gaussian elimination
multiplication with inversel
0
10 inverse
−2
10
−4
10
relative residual
off errors, but does not benefit from the same favor-
−8
10
able cancellation of roundoff errors as Gaussian elim-
ination. −10
10
−12
10
−14
10
−16
10
−14 −13 −12 −11 −10 −9 −8 −7 −6 −5
10 10 10 10 10 10 10 10 10 10
Fig. 46 ε
y
(∗): a direct solver terminates after a predictable finite number of elementary operations for every admis-
sible input.
2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
165
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Therefore, familiarity with details of Gaussian elimination is not required, but one must know when and
how to use the library functions and one must be able to assess the computational effort they involve.
§2.5.0.1 (Computational effort for direct elimination) We repeat the reasoning of § 2.3.1.5: Gaus-
sian elimination for a general (dense) matrix invariably involves three nested loops of length n, see
Code 2.3.1.4, Code 2.3.3.8.
The constant hidden in the Landau symbol can be expected to be rather small (≈ 1) as is clear from
(2.3.1.6).
The cost for solving are substantially lower, if certain properties of the matrix A are known. This is clear,
if A is diagonal or orthogonal/unitary. It is also true for triangular matrices (→ Def. 1.1.2.3), because they
can be solved by simple back substitution or forward elimination. We recall the observation made in see
§ 2.3.2.15.
y
§2.5.0.4 (Direct solution of linear systems of equations in E IGEN) E IGEN supplies a rich suite of
functions for matrix decompositions and solving LSEs, see E IGEN documentation. The default solver is
Gaussian elimination with partial pivoting, accessible through the methods lu() and solve()of dense
matrix types:
Given: system/coefficient matrix A ∈ K n,n regular ↔ A (n × n E IGEN matrix)
right hand side vectors B ∈ K n,ℓ ↔ B (n × ℓ E IGEN matrix)
(corresponds to multiple right hand sides, cf. Code 2.3.1.10)
linear algebra E IGEN
h i
X = A−1 B = A−1 (B):,1 , . . . , A−1 (B):,ℓ X = A.lu().solve(B)
2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
166
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2 // A is lower triangular
3 x = A . triangularView <Eigen : : Lower > ( ) . solve ( b ) ;
4 // A is upper triangular
5 x = A . triangularView <Eigen : : Upper > ( ) . solve ( b ) ;
6 // A is hermitian / self adjoint and positive definite
7 x = A . s e l f a d j o i n t V i e w <Eigen : : Upper > ( ) . l l t ( ) . solve ( b ) ;
8 // A is hermiatin / self adjoint (poitive or negative semidefinite)
9 x = A . s e l f a d j o i n t V i e w <Eigen : : Upper > ( ) . l d l t ( ) . solve ( b ) ;
The methods llt() and ldlt() rely on special factorizations for symmetric matrices, see § 2.8.0.13
below. y
EXPERIMENT 2.5.0.6 (Standard E IGEN lu() operator versus triangularView() ) In this numerical ex-
periment we study the gain in efficiency achievable by make the direct solver aware of important matrix
properties.
C++ code 2.5.0.7: Direct solver applied to a upper triangular matrix ➺ GITLAB
2 //! Eigen code: assessing the gain from using special properties
3 //! of system matrices in Eigen
4 MatrixXd t i m i n g ( ) {
5 std : : vector < i n t > n = {16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192};
6 const i n t nruns = 3 ;
7 MatrixXd t i m e s ( n . s i z e ( ) , 3 ) ;
8 f o r ( unsigned i n t i = 0 ; i < n . s i z e ( ) ; ++ i ) {
9 Timer t 1 ;
10 Timer t 2 ; // timer class
11 MatrixXd A = VectorXd : : LinSpaced ( n [ i ] , 1 , n [ i ] ) . asDiagonal ( ) ;
12 A += MatrixXd : : Ones ( n [ i ] , n [ i ] ) . triangularView <Upper > ( ) ;
13 const VectorXd b = VectorXd : : Random( n [ i ] ) ;
14 VectorXd x1 ( n [ i ] ) ;
15 VectorXd x2 ( n [ i ] ) ;
16 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
17 t1 . s t a r t ( ) ; x1 = A . l u ( ) . solve ( b ) ; t1 . stop ( ) ;
18 t2 . s t a r t ( ) ; x2 = A . triangularView <Upper > ( ) . solve ( b ) ; t 2 . s t o p ( ) ;
19 }
20 t i m e s ( i , 0 ) = n [ i ] ; t i m e s ( i , 1 ) = t 1 . min ( ) ; t i m e s ( i , 2 ) = t 2 . min ( ) ;
21 }
22 return times ;
23 }
10 2
naive lu()
triangularView lu()
10 1
10 0
Observation: ✄
runtime for direct solver [s]
10 -1
10 -6
10 1 10 2 10 3 10 4
Fig. 47
matrix size n
y
2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
167
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§2.5.0.8 (Direct solvers for LSE in E IGEN) Invocation of direct solvers in E IGEN is a two stage process:
➊ Request a decomposition (LU,QR,LDLT) of the matrix and store it in a temporary “decomposition
object”.
➋ Perform backward & forward substitutions by calling the solve() method of the decomposition
object.
The general format for invoking linear solvers in E IGEN is as follows:
Eigen::SolverType<Eigen::MatrixXd> solver(A);
Eigen::VectorXd x = solver.solve(b);
This can be reduced to one line, as the solvers can also be used as methods acting on matrices:
Eigen::VectorXd x = A.solverType().solve(b);
A full list of solvers can be found in the E IGEN documentation. The next code demonstrates a few of
the available decompositions that can serve as the basis for a linear solver:
The different decompositions trade speed for stability and accuracy: fully pivoted and QR-based decom-
positions also work for nearly singular matrices, for which the standard LU-factorization may non longer
be reliable. y
Remark 2.5.0.10 (Many sequential solutions of LSE) As we have seen in Code 2.5.0.9, E IGEN provides
functions that return decompositions of matrices. For instance, we can get an object “containing” the
LU-decomposition (→ Section 2.3.2) of a matrix by the following commands:
Eigen::MatrixXd A(n,n); // A dense square matrix object
......
au to ludec = A.lu(); // Perform LU-decomposition and store the
factors.
Based on the precomputed decompositions, a linear system of equations with coefficient matrix A ∈ K n,n
can be solved with asymptotic computational effort O(n2 ), cf. § 2.3.2.15.
2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
168
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The following example illustrates a special situation, in which matrix decompositions can curb computa-
tional cost:
∗ −1 ( k ) ( k +1) x∗
x := A x , x := ∗ , k = 0, 1, 2, . . . , (2.5.0.14)
k x k2
2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
169
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This is necessary, because A+B will spawn an auxiliary object of a “strange” type determined by the
expression template mechanism. y
Remark 2.5.0.16 (Access to LU-factors in E IGEN) LU-decomposition objects available in E IGEN provide
access to the computed LU-factors L and U through a member function matrixLU(). This returns a
matrix object with L stored in its strictly lower triangular part, and U in its upper triangular part.
However note that E IGEN’s algorithms for LU-factorization invariably employ (partial) pivoting for the sake
of numerical stability, see Section 2.3.3 for a discussion. This has the effect that the LU-factors of a matrix
A ∈ R n,n are actually those for a matrix PA, where P is a permutation matrix as stated in Lemma 2.3.3.14.
Thus matrixLU() provides the LU-factorization of A after some row permutation.
C++ code 2.5.0.17: Retrieving the LU-factors from an E IGEN lu object ➺ GITLAB
2 std : : pair <Eigen : : MatrixXd , Eigen : : MatrixXd >
3 lufak_eigen ( const Eigen : : MatrixXd &A) {
4 // Compute LU decomposition
5 auto l u d e c = A . l u ( ) ;
6 // The LU-factors are computed by in-situ LU-decomposition,
7 // see Rem. 2.3.2.11, and are stored in a dense matrix of
8 // the same size as A
9 Eigen : : MatrixXd L { l u d e c . matrixLU ( ) . triangularView <Eigen : : UnitLower > ( ) } ;
10 const Eigen : : MatrixXd U { l u d e c . matrixLU ( ) . triangularView <Eigen : : Upper > ( ) } ;
11 // E I G E N employs partial pivoting, see § 2.3.3.7, which can be viewed
12 // as a prior permutation of the rows of A. We apply the inverse of
this
13 // permutation to the L-factor in order to achieve A = LU.
14 L . applyOnTheLeft ( l u d e c . permutationP ( ) . inverse ( ) ) ;
15 // Return LU-factors as members of a 2-tuple.
16 return { L , U } ;
17 }
§2.6.0.1 (Triangular linear systems) Triangular linear systems are linear systems of equations whose
system matrix is a triangular matrix (→ Def. 1.1.2.3).
Thm. 2.5.0.3 tells us that (dense) triangular linear systems can be solved by backward/forward elimination
with O(n2 ) asymptotic computational effort (n = ˆ number of unknowns) compared to an asymptotic com-
3
plexity of O(n ) for solving a generic (dense) linear system of equations (→ Thm. 2.5.0.2, Exp. 2.5.0.6).
This is the simplest case where exploiting special structure of the system matrix leads to faster algorithms
for the solution of a special class of linear systems. y
§2.6.0.2 (Block elimination) Remember that thanks to the possibility to compute the matrix product in a
block-wise fashion (→ § 1.3.1.13), Gaussian elimination can be conducted on the level of matrix blocks.
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 170
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Using block matrix multiplication (applied to the matrix×vector product in (2.6.0.3)) we find an equivalent
way to write the block partitioned linear system of equations:
A11 x1 + A12 x2 = b1 ,
(2.6.0.4)
A21 x1 + A22 x2 = b2 .
We assume that A11 is regular (invertible) so that we can solve for x1 from the first equation.
The resulting ℓ × ℓ linear system of equations for the unknown vector x2 is called the Schur complement
system for (2.6.0.3).
Unless A has a special structure that allows the efficient solution of linear systems with system matrix
A11 , the Schur complement system is mainly of theoretical interest. y
EXAMPLE 2.6.0.5 (Linear systems with arrow matrices) From n ∈ N, a diagonal matrix D ∈ K n,n ,
c ∈ K n , b ∈ K n , and α ∈ K, we can build an (n + 1) × (n + 1) arrow matrix.
0
2
4
D c 6
A=
(2.6.0.6)
8
10
b⊤ α
12
0 2 4 6 8 10 12
Fig. 48 nz = 31
We can apply the block partitioning (2.6.0.3) with k = n and ℓ = 1 to a linear system Ax = y with system
matrix A and obtain A11 = D, which can be inverted easily, provided that all diagonal entries of D are
different from zero. In this case
D c x1 y
Ax = ⊤ = y := 1 , (2.6.0.7)
b α ξ η
η − b T D −1 y 1
ξ= ,
α − b ⊤ D −1 c (2.6.0.8)
x1 = D−1 (y1 − ξc) .
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 171
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
These formulas make sense, if D is regular and α − b⊤ D−1 c 6= 0, which is another condition for the
invertibility of A.
Using the formula (2.6.0.8) we can solve the linear system (2.6.0.7) with an asymptotic complexity O(n)!
This superior speed compared to Gaussian elimination applied to the (dense) linear system is evident in
runtime measurements.
C++ code 2.6.0.9: Dense Gaussian elimination applied to arrow system ➺ GITLAB
2 VectorXd arrowsys_slow ( const VectorXd &d ,
3 const VectorXd &c ,
4 const VectorXd &b , double alpha ,
5 const VectorXd &y ) {
6 const Eigen : : Index n = d . s i z e ( ) ;
7 MatrixXd A( n + 1 , n + 1 ) ; // Empty dense matrix
8 A . setZero ( ) ; // Initialize with all zeros.
9 A . diagonal ( ) . head ( n ) = d ; // Initialize matrix diagonal from a vector.
10 A . col ( n ) . head ( n ) = c ; // Set rightmost column c.
11 A . row ( n ) . head ( n ) = b ; // Set bottom row b⊤ .
12 A( n , n ) = alpha ; // Set bottom-right entry α.
13 r e t u r n A . l u ( ) . solve ( y ) ; // Gaussian elimination
14 }
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 172
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 1
arrowsys slow
arrowsys fast
10 0
10 -1
Code for Runtime measurements can be ob-
tained from ➺ GITLAB. 10 -2
runtime [s]
(Intel i7-3517U CPU @ 1.90GHz, 64-bit, 10 -3
y
Ubuntu Linux 14.04 LTS, gcc 4.8.4, -O3)
10 -4
No comment! ✄
10 -5
10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 49 matrix size n
Remark 2.6.0.11 (Sacrificing numerical stability for efficiency) The vector based implementation of
the solver of Code 2.6.0.10 can be vulnerable to roundoff errors, because, upon closer inspection, the
algorithm turns out to be equivalent to Gaussian elimination without pivoting, cf. Section 2.3.3, Ex. 2.3.3.1.
§2.6.0.12 (Solving LSE subject to low-rank modification of system matrix) Given a regular matrix
A ∈ K n,n , let us assume that at some point in a code we are in a position to solve any linear system
Ax = b “fast”, because
✦ either A has a favorable structure, eg. triangular, see § 2.6.0.1,
✦ or an LU-decomposition of A is already available, see § 2.3.2.15.
e is obtained by changing a single entry of A:
Now, a A
(
aij , if (i, j) 6= (i ∗ , j∗ ) ,
e ∈ K n,n : e
A, A aij = , i∗ , j∗ ∈ {1, . . . , n} . (2.6.0.13)
z + aij , if (i, j) = (i∗ , j∗ ) ,
e = A + z · ei∗ e T∗
A . (2.6.0.14)
j
(Recall: ei = ˆ i-th unit vector.) The question is whether we can reuse some of the computations spent on
solving Ax = b in order to solve Ae e x = b with less effort than entailed by a direct Gaussian elimination
from scratch.
We may also consider a matrix modification affecting a single row: Changing a single row: given z ∈ K n
(
aij , if i 6= i ∗ ,
e ∈ K n,n : e
A, A aij = , i∗ , j∗ ∈ {1, . . . , n} .
(z) j + aij , if i = i∗ ,
e = A + ei ∗ z T
A . (2.6.0.15)
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 173
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
e := A + uv H , u, v ∈ K n .
A ∈ K n,n 7→ A (2.6.0.16)
general rank-1-matrix
As in Ex. 2.3.1.1 we carry out Gaussian elimination for the first column of the bock-partitioned linear
system
−1 vH
ξ 0
= (2.6.0.17)
u A e
x b
−1 vH ξ 0 ex = b !
[G.e. on first column] ➤ H = Ae (2.6.0.18)
0 A + uv e
x b
Hence, we have solved the modified LSE, once we have found the component e
x of the solution of the
linear system (2.6.0.17).
Now we swap (block) rows and columns and consider the block-partitioned linear system
A u e
x b
= .
vH −1 ξ 0
We do (block) Gaussian elimination on the first (block) )column again, which yields the Schur complement
system
(1 + v H A −1 u ) ξ = v H A −1 b . (2.6.0.19)
uvH A−1
x = b−
Ae b. (2.6.0.20)
1 + v H A −1 u
The generalization of this formula to rank-k-perturbations if given in the following lemma:
if I + V H A−1 U is regular.
e x = b with A
We use this result to solve Ae e from (2.6.0.16) more efficiently than straightforward elimination
could deliver, provided that the LU-factorisation a = LU is already known. We apply Lemma 2.6.0.21 for
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 174
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
k = 1 and get
We have to solve two linear systems of equations with system matrix A, which is "cheap" provided that
the LU-decomposition of A is available. This is another case, where precomputing the LU-decomposition
pays off.
Assuming that lu passes an object that contains an LU-decomposition of A ∈ R n,n , the following code
demonstrates an efficient implementation with asymptotic complexity O(n2 ) for n → ∞ due to the back-
ward/forward substitutions in Lines 7-8.
In Rem. 1.5.3.15 we were told that the test whether a numerical result was zero should be done by
comparing with another quantity. Then why is this advice not heeded in the above code. The reason
is the line alpha = 1.0 + v.dot(w). Imagine this results in a value α = 10 · EPS. In this case
the computation of α would involve massive cancellation, see Section 1.5.4, and the result would probably
have a huge relative error ≈ 0.1. This would completely destroy the accuracy of the final result, regardless
of the size of any other quantity computed in the code. Therefore, it is advisable to check the absolute size
of α. y
EXAMPLE 2.6.0.24 (Resistance to currents map) Many lineare systems with system matrices that differ
in a single entry only have to be solved when we want to determine the dependence of the total impedance
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 175
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C1 R1 C1 R1
L L
R1
R1
Large (linear) electric circuit (modelling → R2 R2
C2 C2
Ex. 2.1.0.3) ✄ Rx
R4 R4 R1
Sought:
C2
C2
R1
R1
R3
R3
Dependence of (certain) branch currents U ~
~
on “continuously varying” resistance R x
R2
R2
(➣ currents for many different values of
Rx )
C1
C1
R4
R4
L
L
R2 12 R2 R2
Fig. 50
Only a few entries of the nodal analysis matrix A (→ Ex. 2.1.0.3) are affected by variation of R x !
(If R x connects nodes i & j ⇒ only entries aii , a jj , aij , a ji of A depend on R x )
y
Review question(s) 2.6.0.25 (Exploiting structure when solving linear systems of equations)
(Q2.6.0.25.A) Compute the block LU-decomposition for the arrow matrix
D ∈ R n,n regular, diagonal ,
D c
A=
,
c, b ∈ R n ,
α∈R,
b⊤ α
according to the indicated (and natural) partitioning of the matrix.
(Q2.6.0.25.B) Sketch an efficient algorithm for solving the LSE
I n −1 c
x = b , b ∈ Rn , c ∈ R n −1 , α>0.
0⊤ α
(Q2.6.0.25.C) Given a matrix A ∈ R n,n find rank-1 modifications that replace its i-th row or column with
a given vector w ∈ R n .
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 176
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q2.6.0.25.D) Given a regular matrix A ∈ R n,n and b ∈ R n , we want to solve many linear systems of the
e (ξ )x = b, where A
form A e (ξ ) is obtained by adding ξ ∈ R to every entry of A.
e ( ξ )!
that serves this purpose. Do not forget to test for near-singularity of the matrix A
(Q2.6.0.25.E) [“Z-shaped” matrix] Let A ∈ R n,n be “Z-shaped”
Use these formulas to compute the solution of the 2 × 2 linear system of equations
δ 1 x1 1
= ,
1 1 ξ 2
2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 177
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q2.6.0.25.G) [A banded linear system] Sketch an efficient algorithm for the solution of the n × n linear
system of equations
1 0 ... ... 0 1
1 1 0 ... 0
.. .. .. ..
0 . . . .
. . .
n
. . . . .. x = b ∈ R .
.. . . . . . . .
. ..
.. . 1 1 0
0 ... ... 0 1 1
Hint. You may first perform Gaussian elimination for n = 7 and n = 8 or use the LU-decomposition
1 0 ... ... 0 1
1 1 0 ... 0
.. .. .. ..
0 . . . .
. . . . . . . ..
.. . . . . . . .
. ..
.. . 1 1 0
0 ... ... 0 1 1
1 0 ... ... 0 1
1 0 ... ... 0 0 .. ..
0 1 . . −1
1 1 0 ...
0 . .
.. .
.. ..
.. .. .. .. 1
0 . . . . .. .. ..
= .
.. . . . . . . .. .
. . .
. . . −1 .
. . . .
. .. .. .
.. . . 0 .
.. . 1 1 0 .
.. .. .. −
0 ... ... 0 1 1 . . (− 1 ) n 2
0 ... . . . 0 1 + (−1)n−1
A ∈ K m,n , m, n ∈ N, is sparse, if
Sloppy parlance: matrix sparse :⇔ “almost all” entries = 0 /“only a few percent of” entries 6= 0
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 178
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A matrix with enough zeros that it pays to take advantage of them should be treated as sparse.
nnz(A(l ) )
lim =0.
l →∞ nl ml
EXAMPLE 2.7.0.4 (Sparse LSE in circuit modelling) See Ex. 2.1.0.3 for the description of a linear
electric circuit by means of a linear system of equations for nodal voltages. For large circuits the system
matrices will invariably be huge and sparse.
Remark 2.7.0.5 (Sparse matrices from the discretization of linear partial differential equations)
Another important context in which sparse matrices usually arise:
☛ spatial discretization of linear boundary value problems for partial differential equations by means
of finite element (FE), finite volume (FV), or finite difference (FD) methods (→ 4th semester course
“Numerical methods for PDEs”).
y
§2.7.1.1 (Triplet/coordinate list (COO) format) In the case of a sparse matrix A ∈ K m,n , this format
stores triplets (i, j, αi,j ), 1 ≤ i ≤ m, 1 ≤ j ≤ n:
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 179
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
s t r u c t Triplet {
size_t i; // row index
size_t j; // column index
scalar_t a; // additive contribution to matrix entry
};
using TripletMatrix = s t d :: v e c t o r <Triplet>;
Here scalar_t is the underlying scalar type, either float , double, or std::complex<double>.
The vector of triplets in a TripletMatrix has size ≥ nnz(a). We write “≥”, because repetitions of index
pairs (i, j) are allowed. The matrix entry (A))i, j is defined to be the sum of all values αi,j associated with
the index pair (i, j). The next code clearly demonstrates this summation.
Note that this code assumes that the result vector y has the appropriate length; no index checks are
performed.
Code 2.7.1.2: computational effort is proportional to the number of triplets. (This might be much larger
than nnz(A) in case of many repetitions of triplets.) y
Remark 2.7.1.3 (The zoo of sparse matrix formats) Special sparse matrix storage formats store only
non-zero entries:
• Compressed Row Storage (CRS)
• Compressed Column Storage (CCS) → used by MATLAB
• Block Compressed Row Storage (BCRS)
• Compressed Diagonal Storage (CDS)
• Jagged Diagonal Storage (JDS)
• Skyline Storage (SKS)
All of these formats achieve the two objectives stated above. Some have been designed for sparse matri-
ces with additional structure or for seamless cooperation with direct elimination algorithms (JDS,SKS). y
§2.7.1.4 (Compressed row-storage (CRS) format) The CRS format for a sparse matrix A = aij ∈
K n,n keeps the data in three contiguous arrays:
std::vector<scalar_t> val size ≥ nnz(A) := #{(i, j) ∈ {1, . . . , n}2 , aij 6= 0}
std::vector<size_t> col_ind size = val.size()
std::vector<size_t> row_ptr size n + 1 & row_ptr[n + 1] =val.size()
(sentinel value)
ˆ (number of nonzeros) of A
As above we write nnz(A) =
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 180
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
val aij
col_ind j
i
val-vector:
10 0 0 0 −2 0
3 9 0 0 0 3 10 -2 3 9 3 7 8 7 3 ...9 13 4 2 -1
0 7 8 7 0 0 col_ind-array:
A =
3
1 5 1 2 6 2 3 4 1 ...5 6 2 5 6
0 8 7 5 0
0 8 0 9 9 13 row_ptr-array:
0 4 0 0 2 −1 1 3 6 9 13 17 20
e , where A
that computes y := Ax e ∈ R n,n is defined as
(
(A)i,j , if |i − j| ≤ 1 ,
e
A = i, j ∈ {1, . . . , n} .
i,j 0 else,
(Q2.7.1.5.C) Let a matrix A ∈ R n,n be given in COO/triplet format and by an TripletMatrix object A:
s t r u c t Triplet {
size_t i; // row index
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 181
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q2.7.1.5.D) Assume that a sparse matrix in CRS format is represented by an object of the type
s t r u c t CRSMatrix {
s t d :: v e c t o r < double > val;
s t d :: v e c t o r < s t d :: s i z e _ t > col_ind;
s t d :: v e c t o r < s t d :: s i z e _ t > row_ptr;
};
whose arguments supply the three vectors defining the matrix A in CRS format and which overwrites
them with the corresponding vectors of the CRS-format description of WA .
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 182
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Usually sparse matrices in CRS/CCS format must not be filled by setting entries through index-pair access,
because this would entail frequently moving big chunks of memory. The matrix should first be assembled
in triplet format (→ E IGEN documentation), from which a sparse matrix is built. E IGEN offers special
data types and facilities for handling triplets.
As shown, a Triplet object offers the access member functions row(), col(), and value() to fetch
the row index, column index, and scalar value stored in a Triplet.
The statement that entry-wise initialization of sparse matrices is not efficient has to be qualified in Eigen.
Entries can be set, provided that enough space for each row (in RowMajor format) is reserved in ad-
vance. This done by the reserve() method that takes an integer vector of maximal expected numbers
of non-zero entries per row:
insert(i.j) sets an entry of the sparse matrix, which is rather efficient, provided that enough space
has be reserved. coeffRef(i,j) gives l-value and r-value access to any matrix entry, creating a
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 183
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The usual matrix operations are supported for sparse matrices; addition and subtraction may involve only
sparse matrices stored in the same format. These operations may incur large hidden costs and have to
be used with care!
EXPERIMENT 2.7.2.2 (Initialization of sparse matrices in Eigen) We study the runtime behavior of the
initialization of a sparse matrix in Eigen. We use the methods described above. The code is available from
➺ GITLAB.
6
10
Triplets
Time in milliseconds
in Eigen.
3
10
Observation: insufficient advance allocation of memory massively slows down the set-up of a sparse
matrix in the case of direct entry-wise initialization.
Reason: Massive internal copying of data is required to created space for “unexpected” entries. y
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 184
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 2.7.2.5 (Smoothing of a triangulation) This example demonstrates that sparse linear sys-
tems of equations naturally arise in the handling of triangulations.
The points in N are also called the nodes of the mesh, the triangles the cells, and all line segments
connecting two nodes and occurring as a side of a triangle form the set of edges. We always assume a
consecutive numbering of the nodes and cells of the triangulation (starting from 1, M ATLAB’s convention).
Fig. 53 Fig. 54
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 185
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Common data structure for describing a triangulation with N nodes and M cells:
• column vector x ∈ R N : x-coordinates of nodes
• column vector y ∈ R N : y-coordinates of nodes
• M × 3-matrix T whose rows contain the index numbers of the vertices of the cells.
(This matrix is a so-called triangle-node incidence matrix.)
Fig. 55
The cells of a mesh may be rather distorted triangles (with very large and/or small angles), which is usually
not desirable. We study an algorithm for smoothing a mesh without changing the planar domain covered
by it.
Every edge that is adjacent to only one cell is a boundary edge of the triangulation. Nodes that are
endpoints of boundary edges are boundary nodes.
✎ Notation: Γ ⊂ {1, . . . , N } =
ˆ set of indices of boundary nodes.
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 186
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✎ Notation: pi = p1i , p2i ]⊤ ∈ R2 =
ˆ coordinate vector of node ♯i, i = 1, . . . , N
We define
S(i ) := { j ∈ {1, . . . , N } : nodes i and j are connected by an edge} , (2.7.2.8)
as the set of node indices of the “neighbours” of the node with index number i.
1
pi = ∑ pj (2.7.2.10)
♯ S (i ) j ∈ S (i )
m
♯ S (i ) pi = ∑ pj , d = 1, 2 , for all i ∈ {1, . . . , N } \ Γ ,
d d
j ∈ S (i )
that is, every interior node is located in the center of gravity of its neighbours.
The relations (2.7.2.10) correspond to the lines of a sparse linear system of equations! In order to state it,
we insert the coordinates of all nodes into a column vector z ∈ K2N , according to
(
p1i , if 1 ≤ i ≤ N ,
zi = i− N (2.7.2.11)
p2 , if N + 1 ≤ i ≤ 2N .
For the sake of ease of presentation, in the sequel we assume (which is not the case in usual triangulation
data) that interior nodes have index numbers smaller than that of boundary nodes.
From (2.7.2.8) we infer that the system matrix C ∈ R2n,2N , n := N − ♯Γ, of that linear system has the
following structure:
♯S(i ) , if i = j ,
i ∈ {1, . . . , n} ,
A O
C= , (A)i,j = −1 , if j ∈ S(i ) , (2.7.2.12)
O A
j ∈ {1, . . . , N } .
0 else,
(2.7.2.10) ⇔ Cz = 0 . (2.7.2.13)
➣ nnz(A) ≤ number of edges of M + number of interior nodes of M.
➣ The matrix C associated with M according to (2.7.2.12) is clearly sparse.
➣ The sum of the entries in every row of C vanishes.
We partition the vector z into coordinates of nodes in the interior and of nodes on the boundary
zint
1
zbd ⊤
zT = 1 := z , . . . , z , z
zint 1 n n + 1 , . . . , z N , z N + 1 , . . . , z N + n , z N + n + 1 , . . . , z 2N .
2
zbd
2
This induces the following block partitioning of the linear system (2.7.2.13):
int
z
1bd
Aint Abd O O
z1 = 0 , Aint ∈ R n,n ,
O O Aint
Abd z2 int Abd ∈ R n,N −n .
zbd
2
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 187
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
m
Aint Abd
=0. (2.7.2.14)
Aint Abd
The linear system (2.7.2.14) holds the key to the algorithmic realization of mesh smoothing; when smooth-
ing the mesh
(i) the node coordinates belonging to interior nodes have to be adjusted to satisfy the equilibrium con-
dition (2.7.2.10), they are unknowns,
(ii) the coordinates of nodes located on the boundary are fixed, that is, their values are known.
unknown zint int
1 , z2 , known zbd bd
1 , z2
(yellow in (2.7.2.14)) (pink in (2.7.2.14))
(2.7.2.13)/(2.7.2.14) ⇔ Aint zint
1 z int = − A zbd A zbd .
2 bd 1 bd 2 (2.7.2.15)
This is a square linear system with an n × n system matrix, to be solved for two different right hand side
vectors. The matrix Aint is also known as the matrix of the combinatorial graph Laplacian.
We examine the sparsity pattern of the system matrices Aint for a sequence of triangulations created by
regular refinement.
We start from the triangulation of Fig. 55 and in turns perform regular refinement and smoothing (left ↔
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 188
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Below we give spy plots of the system matrices Aint for the first three triangulations of the sequence:
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 189
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++ code 2.7.3.1: Function for solving a sparse LSE with E IGEN ➺ GITLAB
2 using SparseMatrix = Eigen : : SparseMatrix <double > ;
3 // Perform sparse elimination
4 i n l i n e void s p a r s e _ s o l v e ( const SparseMatrix &A , const VectorXd &b , VectorXd &x ) {
5 const Eigen : : SparseLU<SparseMatrix > s o l v e r ( A) ;
6 i f ( s o l v e r . i n f o ( ) ! = Eigen : : Success ) {
7 throw std : : r u n t i m e _ e r r o r ( " Matrix f a c t o r i z a t i o n f a i l e d " ) ;
8 }
9 x = s o l v e r . solve ( b ) ;
10 }
The constructor of the solver object builds the actual sparse LU-decomposition. The solve method
then does forward and backward elimination, cf. § 2.3.2.15. It can be called multiple times, see
Rem. 2.5.0.10. For more sample codes see ➺ GITLAB.
EXPERIMENT 2.7.3.2 (Sparse elimination for arrow matrix) In Ex. 2.6.0.5 we saw that applying the
standard lu() solver to a sparse arrow matrix results in an extreme waste of computational resources.
Yet, E IGEN can do much better! The main mistake was the creation of a dense matrix instead of storing
the arrow matrix in sparse format. There are E IGEN solvers which rely on particular sparse elimination
techniques. They still rely of Gaussian elimination with (partial) pivoting (→ Code 2.3.3.8), but take pains
to operate on non-zero entries only. This can greatly boost the speed of the elimination.
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 190
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++ code 2.7.3.3: Invoking sparse elimination solver for arrow matrix ➺ GITLAB
2 template <class solver_t >
3 VectorXd arrowsys_sparse ( const VectorXd &d ,
4 const VectorXd &c ,
5 const VectorXd &b , double alpha ,
6 const VectorXd &y ) {
7 const Eigen : : Index n = d . s i z e ( ) ;
8 SparseMatrix <double> A( n+1 , n +1) ; // default: column-major
9 VectorXi reserveVec = VectorXi : : Constant ( n+1 , 2 ) ; // nnz per col
10 reserveVec ( n ) = s t a t i c _ c a s t < i n t >( n +1) ; // last full col
11 A . r e s e r v e ( reserveVec ) ;
12 f o r ( i n t j = 0 ; j < n ; ++ j ) { // initalize along cols for efficiency
13 A. insert ( j , j ) = d( j ) ; // diagonal entries
14 A. insert (n , j ) = b( j ) ; // bottom row entries
15 }
16 f o r ( i n t i = 0 ; i < n ; ++ i ) {
17 A. insert ( i , n) = c ( i ) ; // last col
18 }
19 A . i n s e r t ( n , n ) = alpha ; // bottomRight entry
20 A . makeCompressed ( ) ;
21 r e t u r n s o l v e r _ t ( A) . solve ( y ) ;
22 }
10 1
arrowsys slow
arrowsys fast
arrowsys SparseLU
Observation: 10 0 arrowsys iterative
dense matrix.
y
The sparse solver is still slower than 10 -3
pivoting.
10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 60 matrix size n
EXPERIMENT 2.7.3.4 (Timing sparse elimination for the combinatorial graph Laplacian) We con-
sider a sequence of planar triangulations created by successive regular refinement (→ Def. 2.7.2.16) of
the planar triangulation of Fig. 55, see Ex. 2.7.2.5. We use different E IGEN and MKL sparse solver for the
linear system of equations (2.7.2.15) associated with each mesh.
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 191
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 2
Eigen SparseLU
Eigen SimplicialLDLT
Timing results ✄ 10 1 Eigen ConjugateGradient
MKL PardisoLU
MKL PardisoLDLT
O(n 1.5 )
Platform: 10 0
When solving linear systems of equations directly dedicated sparse elimination solvers from
numerical libraries have to be used!
System matrices are passed to these algorithms in sparse storage formats (→ Section 2.7.1) to
convey information about zero entries.
STOP Never ever even think about implementing a general sparse elimination solver by yourself!
→ SuperLU (http://www.cs.berkeley.edu/~demmel/SuperLU.html),
→ UMFPACK (https://en.wikipedia.org/wiki/UMFPACK), used by M ATLAB’s \,
→ PARDISO [SG04] (http://www.pardiso-project.org/), incorporated into MKL
Fig. 62
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 192
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++-code 2.7.3.6: Example code demonstrating the use of PARDISO with E IGEN ➺ GITLAB
2 void sol veSpar sePar di so ( s i z e _ t n ) {
3 using SpMat = Eigen : : SparseMatrix <double > ;
4 // Initialize a sparse matrix
5 const SpMat M = i n i t S p a r s e M a t r i x <SpMat >( n ) ;
6 const Eigen : : VectorXd b = Eigen : : VectorXd : : Random( n ) ;
7 Eigen : : VectorXd x ( n ) ;
8 // Initalization of the sparse direct solver based on the Pardiso
library with
9 // directly passing the matrix M to the solver Pardiso is part of the
Intel
10 // MKL library, see also Ex. 1.3.2.6
11 Eigen : : PardisoLU<SpMat> s o l v e r (M) ;
12 // The checks of Code 2.7.3.1 are omitted
13 // solve the LSE
14 x = s o l v e r . solve ( b ) ;
15 }
4 # Intel(R) MKL 11.3.2, Linux, None, GNU C/C++, Intel(R) 64, Static, LP64,
Sequential)
5 FLAGS_LINK = −Wl, − − s t a r t −group $ {MKLROOT } / l i b / i n t e l 6 4 / l i b m k l _ i n t e l _ l p 6 4 . a \
6 $ {MKLROOT } / l i b / i n t e l 6 4 / l i b m k l _ c o r e . a $ {MKLROOT } / l i b / i n t e l 6 4 / l i b m k l _ s e q u e n t i a l . a \
7 −Wl, − −end−group − l p t h r e a d −lm − l d l
8
9 a l l : main . cpp
10 $ (COMPILER) $ (FLAGS) −DEIGEN_USE_MKL_ALL $< −o main $ ( FLAGS_LINK )
y
Review question(s) 2.7.3.7 (Direct solution of sparse linear systems of equations)
(Q2.7.3.7.A) In Code 2.7.3.1 we checked (solver.info()!= Eigen::Success), where
solver was of type Eigen::SparseLU<Eigen::SparseMatrix>. Can you explain, why it
is a good idea to include this test.
(Q2.7.3.7.B) What are the benefits of storing a matrix in CRS or CCS format?
△
However, simple examples show that the product of sparse matrices need not be sparse, which means
that the multiplication of large sparse matrices will usually require an effort way bigger than the sum of the
numbers of their non-zero entries.
What is the situation concerning the solution of square linear systems of equations with sparse system
matrices? Generically, we have to brace for a computational effort O(n3 ) for matrix size n → ∞. Yet
Section 2.7.3 sends the message that a better asymptotic complexity can often be achieved, if the sparse
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 193
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
matrix has a particular structure and sophisticated library routines are used. In this section, we examine
some aspects of Gaussian elimination ↔ LU-factorisation when applied in a sparse-matrix context.
EXAMPLE 2.7.4.1 ( LU -factorization of sparse matrices) We examine the following “sparse” matrix with
a typical structure and inspect the pattern of the LU-factors returned by E IGEN, see Code 2.7.4.2.
3 −1 −1
. .. ..
−1 . . . .
.. .. ..
. . −1 .
A=
−1 −1 3 −1 n,n
∈ R ,n ∈ N
3 −1
.. . ..
. −1 . . .
.. .. ..
. . . −1
−1 −1 3
20 20 20
40 40 40
60 60 60
80 80 80
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 194
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Of course, in case the LU-factors of a sparse matrix possess many more non-zero entries than the matrix
itself, the effort for solving a linear system with direct elimination will increase significantly. This can be
quantified by means of the following concept:
EXAMPLE 2.7.4.4 (Sparse LU -factors) Ex. 2.7.4.1 ➣ massive fill-in can occur for sparse matrices
This example demonstrates that fill-in can largely be avoided, if the matrix has favorable structure. In this
case a LSE with this particular system matrix A can be solved efficiently, that is, with a computational
effort O(nnz(A)) by Gaussian elimination.
A is called an “arrow matrix”, see the pattern of non-zero entries below and Ex. 2.6.0.5.
Recalling Rem. 2.3.2.17 it is easy to see that the LU-factors of A will be sparse and that their sparsity
patterns will be as depicted below. Observe that despite sparse LU-factors, A−1 will be densely populated.
Pattern of A −1 Pattern of L Pattern of U
Pattern of A
0 0 0 0
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 121 nz = 21 nz = 21
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 195
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 2.7.4.6 (LU-decomposition of flipped “arrow matrix”) Recall the discussion in Ex. 2.6.0.5.
Here we look at an arrow matrix in a slightly different form:
α b⊤
α∈R,
M=
c
,
b, c ∈ R n−1 ,
D D ∈ R n−1,n−1 regular diagonal matrix, → Def. 1.1.2.3
(2.7.4.7)
0 0
2 2
4 4
Output of modified
Code 2.7.4.5: 6
L 6
U
Obvious fill-in (→ Def. 2.7.4.3) 8 8
10 10
12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 65 nz = 65
Now it comes as a surprise that the arrow matrix A
from Ex. 2.6.0.5, (2.6.0.6) has sparse LU-factors! D c
A=
Arrow matrix (2.6.0.6) ✄
b⊤ α
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 196
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
I 0 D c
A=
·
, σ : = α − b ⊤ D −1 c .
b ⊤ D −1 1 0 σ
| {z } | {z }
=:L =:U
Idea: Transform A into A by row and column permutations before performing LU-
decomposition.
0 0
2 2
4 4
6 6
8 8
10 10
12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 66 nz = 31 Fig. 67 nz = 31
➣ Then LU-factorization (without pivoting) of the resulting matrix requires O(n) operations.
C++ code 2.7.4.8: Permuting arrow matrix, see Fig. 66, Fig. 67 ➺ GITLAB
2 MatrixXd A( 1 1 , 11) ;
3 A. setIdentity ( ) ;
4 A . col ( 0 ) . setOnes ( ) ;
5 A . row ( 0 ) = RowVectorXd : : LinSpaced ( 1 1 , 11 , 1 ) ;
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 197
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 2.7.4.9 (Pivoting destroys sparsity) In Ex. 2.7.4.6 we found that permuting a matrix can make
it amenable to Gaussian elimination/LU-decomposition with much less fill-in (→ Def. 2.7.4.3). However,
recall from Section 2.3.3 that pivoting, which may be essential for achieving numerical stability, amounts to
permuting the rows (or even columns) of the matrix. Thus, we may face the awkward situation that pivoting
tries to reverse the very permutation we applied to minimize fill-in! The next example shows that this can
happen for an arrow matrix.
1 2
1
2
2
.. ..
A= . . → arrow matrix, Ex. 2.7.4.4
1
2
10
2 ... 2
The distributions of non-zero entries of the computed LU-factors (“spy-plots”) are as follows:
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 198
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2 2 2
4 4 4
6 6 6
In
8 8 8
10 10 10
12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 21 nz = 66
A L U
this case the solution of a LSE with system matrix A ∈ R n,n of the above type by means of Gaussian
elimination with partial pivoting would incur costs of O(n3 ). y
: diagonal
: super-diagonals
m
: sub-diagonals
✁ bw(A) = 3, bw(A) = 2
n
We now examine a generalization of the concept of a banded matrix that is particularly useful in the context
of Gaussian elimination:
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 199
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
bwR ( A) =0
∗ 0 ∗ 0 0 0 0 1
0 ∗ 0 0 ∗ 0 0 bw2R ( A) =0
∗ 0 ∗ 0 0 0 ∗ bw3R ( A) =2
env( A) = red entries
A=
0 0 0 ∗ ∗ 0 ∗ bwR ( A)
4 = 0 ∗ = non-zero matrix entry a 6= 0
bwR ( A) ˆ ij
0 ∗ 0 ∗ ∗ ∗ 0 5 =3
0 0 0 0 ∗ ∗ 0 bwR ( A) =1
6
0 0 ∗ ∗ 0 0 ∗ bw7R ( A) =4
2 2
4 4
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 68 nz = 138 Fig. 69 nz = 121
Note: the envelope of the arrow matrix from Ex. 2.7.4.4 is just the set of index pairs of its non-zero entries.
Hence, the following theorem provides another reason for the sparsity of the LU-factors in that example. y
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 200
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Proof. (by induction, version I) Examine first step of Gaussian elimination without pivoting, a11 6= 0
" #
a11 b⊤ 1 0 a11 b⊤
A= = ⊤
c à − ac11 I 0 à − cb
a11
| {z } | {z }
L (1) U (1)
ci−1 = 0 , if i > j ,
If (i, j) 6∈ env(A) ⇒
b j−1 = 0 , if i < j .
⇒ env(L(1) ) ⊂ env(A), env(U(1) ) ⊂ env(A) .
⊤
Moreover, env(Ã − cb
a ) = env(( A )2:n,2:n ) ✷
11
Proof. (by induction, version II) Use block-LU-factorization, cf. Rem. 2.3.2.19 and proof of
Lemma 2.3.2.4:
e ⊤l = c ,
Ae b e 0
L e u
U U
= ⇒ (2.7.5.5)
c⊤ α l⊤ 1 0 ξ e =b.
Lu
If mC
n ( A ) = m, then b1 , . . . , bn−m = 0 (entries of b
= 0
from (2.7.5.5))
Thm. 2.7.5.4 immediately suggests a policy for saving cmputational effort when solving linear system
whose system matrix A ∈ K n,n is sparse due to small envelope:
♯ env(A) ≪ n2 :
✞ ☎
✝ ✆
Policy Confine elimination to envelope!
Envelope-aware LU-factorization:
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 201
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
9 m( i t . row ( ) ) =
10 std : : max<VectorXi : : Scalar >(m( i t . row ( ) ) , i t . row ( ) − i t . col ( ) ) ;
11 }
12 }
13 r e t u r n m;
14 }
15 //! computes row bandwidth numbers miR (A) of A (dense
16 //! matrix) according to Def. 2.7.5.2
17 template <class Derived >
18 VectorXi rowbandwidth ( const MatrixBase <Derived > &A) {
19 VectorXi m = VectorXi : : Zero ( A . rows ( ) ) ;
20 f o r ( i n t i = 1 ; i < A . rows ( ) ; ++ i ) {
21 f o r ( i n t j = 0 ; j < i ; ++ j ) {
22 i f ( A( i , j ) ! = 0 ) {
23 m( i ) = i − j ;
24 break ;
25 }
26 }
27 }
28 r e t u r n m;
29 }
Asymptotic complexity of envelope aware forward substitution, cf. § 2.3.2.15, for Lx = y, L ∈ K n,n
regular lower triangular matrix is
O(# env(L)) !
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 202
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Since by Thm. 2.7.5.4 fill-in is confined to the envelope, we need store only the matrix entries aij , (i, j) ∈
env(A) when computing (in situ) LU-factorization of structurally symmetric A ∈ K n,n
EXAMPLE 2.7.5.12 (Envelope oriented matrix storage) Linear envelope oriented matrix storage of
symmetric A = A⊤ ∈ R n,n :
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 203
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Two arrays:
scalar_t * val size P, Indexing rule:
size_t * dptr size n ∗ 0 ∗ 0 0 0 0
0 ∗ 0 0 ∗ 0 0 dptr[ j] = k
∗ ∗
n A = 0 00 ∗
0
0
∗
0
∗
0
0 ∗
m
P : = n + ∑ mi ( A ) . 0 ∗ 0 ∗ ∗ ∗ 0
0 0 0 0 ∗ ∗ 0 val[k ] = a jj
i =1 0 0 ∗ ∗ 0 0 ∗
(2.7.5.13)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
val a11 a22 a31 a32 a33 a44 a52 a53 a54 a55 a65 a66 a73 a74 a75 a76 a77
dptr 0 1 2 5 6 10 12 17
y
Minimizing bandwidth/envelope:
Goal: Minimize mi (A),A = ( aij ) ∈ R N,N , by permuting rows/columns of A
EXAMPLE 2.7.5.14 (Reducing bandwidth by row/column permutations) Recall: cyclic permutation
of rows/columns of arrow matrix applied in Ex. 2.7.4.6. This can be viewed as a drastic shrinking of the
envelope:
envelope arrow matrix envelope arrow matrix
0 0
2 2
4 4
6 6
8 8
10 10
12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 31
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 204
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 205
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
It would be very desirable to have a priori criteria, when Gaussian elimination/LU-factorization re-
mains stable even without pivoting. This can help avoid the extra work for partial pivoting and makes
it possible to exploit structure without worrying about stability.
This section will introduce classes of matrices that allow Gaussian elimination without pivoting. Fortunately,
linear systems of equations featuring system matrices from these classes are very common in applications.
EXAMPLE 2.8.0.1 (Diagonally dominant matrices from nodal analysis → Ex. 2.1.0.3)
➀ R12 ➁ R23 ➂
Consider:
−1 −1 −1 −1
➁ : R12 (U2 − U1 ) + R23 (U2 − U3 ) + R24 (U2 − U4 ) + R25 (U2 − U5 ) = 0,
−1 −1
➂: R23 (U3 − U2 ) + R35 (U3 − U5 ) = 0,
−1 −1 −1
➃: R14 (U4 − U1 ) + R24 (U4 − U2 ) + R45 (U4 − U5 ) = 0,
−1 −1 −1
➄ : R25 (U5 − U2 ) + R35 (U5 − U3 ) + R45 (U5 − U4 ) + R56 (U5 − U6 ) = 0,
U1 = U , U6 = 0 .
1 1
R12 + R23+ R124 + 1
R25 − R123 − R124 − R125 U2
1
R12
− R123 1 1
− R135
R23 + R35 0 U3 0
= 1 U
− R124 0 1
R24 + R45
1
− R145 U4 R14
− R125 − R135 − R145 1
+ R135 + R145 + 1 U5 0
R22 R56
2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting206
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
All these properties are obvious except for the fact that A is regular.
Proof of (2.8.0.4): By Thm. 2.2.1.4 it suffices to show that the nullspace of A is trivial: Ax = 0 ⇒ x=
0. So we pick x ∈ R n , Ax = 0, and denote by i ∈ {1, . . . , n} the index such that
| xi | = max{| x j |, j = 1, . . . , n} .
aij | aij |
Ax = 0 ⇒ xi = ∑ aii x j ⇒ | xi | ≤ ∑ |aii | | x j | . (2.8.0.5)
j 6 =i j 6 =i
| aij |
∑ |aii | ≤ 1 . (2.8.0.6)
j 6 =i
Hence, (2.8.0.6) combined with the above estimate (2.8.0.5) that tells us that the maximum is smaller
equal than a mean implies | x j | = | xi | for all j = 1, . . . , n. Finally, the sign condition akj ≤ 0 for k 6= j
enforces the same sign of all xi . Thus, we conclude, w.l.o.g., x1 = x2 = · · · = xn . As
n
∃i ∈ {1, . . . , n}: ∑ aij > 0 (strict inequality) ,
j =1
A has LU-factorization
regular, diagonally dominant
A ⇒ m
with positive diagonal
Gaussian elimination feasible without pivoting(∗)
(∗): In fact, when we apply partial pivoting to a diagonally dominant matrix it will trigger not a single row
permutation, because (2.3.3.5) will always be satisfied for j = k!
➣ We can dispense with pivoting without compromising stability.
2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting207
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
It is clear that partial pivoting in the first step selects a11 as pivot element, cf. (2.3.3.5). Thus after the 1st
step of elimination we obtain the modified entries
(1) ai1 (1)
aij = aij − a1j , i, j = 2, . . . , n ⇒ aii > 0 ,
a11
which we conclude from diagonal dominance. That also permits us to infer
n n
(1) (1) ai1 a
| aii | − ∑ | aij | = aii − a1i − ∑ aij − i1 a1j
j =2
a11 j =2
a11
j 6 =i j 6 =i
n
| ai1 || a1i | |a | n
≥ aii − − ∑ | aij | − i1 ∑ | a1j |
a11 j =2
a11 j=2
j 6 =i j 6 =i
n n
| ai1 || a1i | a − | a1i |
≥ aii − − ∑ | aij | − | ai1 | 11 ≥ aii − ∑ | aij | ≥ 0 .
a11 j =2
a11 j =1
j 6 =i j 6 =i
A regular, diagonally dominant ⇒ partial pivoting according to (2.3.3.5) selects i-th row in i-th step. y
§2.8.0.10 (Gaussian elimination for symmetric positive definite (s.p.d.) matrices) The class of sym-
metric positive definite (s.p.d.) matrices has been defined in Def. 1.1.2.6. They permit stable Gaussian
elimintation without pivoting:
Every symmetric/Hermitian positive definite matrix (s.p.d. → Def. 1.1.2.6) possesses an LU-
decomposition (→ Section 2.3.2).
Equivalent to the assertion of the theorem is the assertion that for s.p.d. matrices Gaussian elimination is
feasible without pivoting.
In fact, this theorem is a corollary of Lemma 2.3.2.4, because all principal minors of an s.p.d. matrix are
s.p.d. themselves. However, we outline an alternative self-contained proof:
Proof. (of Thm. 2.8.0.11) we pursue a proof by induction with respect to the matrix size n. The assertion
in the case n = 1 is obviously true.
For the induction argument n − 1 ⇒ n consider the first step of the elimination algorithm
" #
a b⊤ 1. step a11 b⊤
A = 11 e −−−−−−−−−−→ e − bb⊤ .
b A Gaussian elimination 0 A a11
This step has not problem, because all diagonal entries of an s.p.d. matrix are strictly positive.
⊤
The induction requires us to show that the right-lower block Ae − bb ∈ R n−1,n−1 is also symmetric and
a11
positive definite. Its symmetry is evident, but the demonstration of the s.p.d. property relies on a trick: As
A ist s.p.d. (→ Def. 1.1.2.6), for every y ∈ R n−1 \ {0}
" ⊤
#⊤ " ⊤
#
⊤
− ba11y a11 b⊤ − ba11y e − bb )y .
0< e = y⊤ (A
y b A y a11
⊤
We conclude that A e − bb positive definite. Thus, according to the induction hypothesis, Gaussian
a11
elimination without pivoting can now be applied to that right-lower block.
2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting208
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✷
The proof can also be based on the identities
(A)1:n−1,1:n−1 (A)1:n−1,n L1 0 U1 u
= , (2.7.5.8)
(A)n,1:n−1 (A)n,n l⊤ 1 0 γ
⇒ (A)1:n−1,1:n−1 = L1 U1 , L1 u = (A)1:n−1,n , U1⊤ l = (A)⊤ ⊤
n,1:n−1 , l u + γ = ( A )n,n ,
noticing that the principal minor (A)1:n−1,1:n−1 is also s.p.d. This allows a simple induction argument.
The next result gives a useful criterion for telling whether a given symmetric/Hermitian matrix is s.p.d.:
Proof. For A = AH diagonally dominant, use inequality between arithmetic and geometric mean (AGM)
ab ≤ 12 ( a2 + b2 ):
n n
xH Ax = ∑ aii | xi |2 + ∑ aij x̄i x j ≥ ∑ aii | xi |2 − ∑ | aij || xi || x j |
i =1 i6= j i =1 i6= j
AGM n
≥ ∑ aii | xi |2 − 21 ∑ |aij |(| xi |2 + | x j |2 )
i =1 i6= j
n n
1 2 2 1 2 2
≥ 2 ∑ { a ii | x i | − ∑ | a ij || x i | } + 2 ∑ { a ii | x j | − ∑ | a ij || x j | }
i =1 j 6 =i j =1 i6= j
n
≥ ∑ | xi |2 aii − ∑ | aij | ≥ 0 .
i =1 j 6 =i
Lemma 2.8.0.14. Cholesky decomposition for s.p.d. matrices → [Gut09, Sect. 3.4], [Han02,
Sect. II.5], [QSS00, Thm. 3.6]
For any s.p.d. A ∈ K n,n , n ∈ N, there is a unique upper triangular matrix R ∈ K n,n with rii > 0,
i = 1, . . . , n, such that A = RH R (Cholesky decomposition).
Proof. Thm. 2.8.0.11, Lemma 2.3.2.4 ensure the existence of a unique LU-decomposition of A: A=
LU, which we can rewrite as follows:
e , D=ˆ diagonal of U ,
A = LDU e=
U ˆ normalized upper triangular matrix → Def. 1.1.2.3
2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting209
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A = A⊤ ⇒ U = DL⊤ ⇒ A = LDL⊤ ,
x⊤ Ax > 0 ∀x 6= 0 ⇒ y⊤ Dy > 0 ∀y 6= 0 .
➤ The
√ diagonal matrix D has a positive diagonal and, hence, we can take its “square root” and choose
R := DL⊤ .
✷
We find formulas analogous to (2.3.2.7)
i −1
min{i,k} ∑ r ji r jk + rii rik , if i < k ,
H j =1
R R = A ⇒ aik = ∑ r ji r jk =
i −1
(2.8.0.15)
2 2 , if i = k .
j =1
∑ |r ji | + rii
j =1
The asymptotic computational cost (# elementary arithmetic operations) of computing the Cholesky
decomposition of an n × n s.p.d. matrix is 16 n3 + O(n2 ) for matrix size n → ∞.
This is “half the costs” of computing a general LU-factorization, cf. Code in § 2.3.2.6, but this does not
mean “twice as fast” in a concrete implementation, because memory access patterns will have a crucial
impact, see Rem. 1.4.1.5.
Gains of efficiency hardly justify the use of Cholesky decomposition in modern numerical algorithms.
Savings in memory compared to standard LU-factorization (only one factor R has to be stored) offer a
stronger reason to prefer the Cholesky decomposition. y
§2.8.0.18 (Cholesky-type decompositions in E IGEN) Hardly surprising, E IGEN provides library routines
for the computation of the (generalized) Cholesky decomposition of an symmetric (positive definite) matrix.
For dense or sparse matrices these are the methods (→ E IGEN documentation)
• LLT() for computing a genuine Cholesky decomposition,
• LDLT() for computing a factorization A = LDL⊤ with a normalized lower-triangular matrix L and
a diagonal matrix D.
2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting210
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
These methods are invoked like all other matrix decomposition methods, refer to § 2.5.0.8, where
solverType is to be replaced with either LLT or LDLT. Rem. 2.5.0.10 also applies. The LDLT-
decomposition can be attempted for any symmetric matrix, but need not exist. y
✖ ✕
is numerically stable (→ Def. 1.5.5.19)
m
Gaussian elimination for s.p.d. matrices
Gaussian elimination without pivoting is a numerically stable way to solve LSEs with s.p.d.
system matrix.
Learning Outcomes
Principal take-home knowledge and skills from this chapter:
• A clear understanding of the algorithm of Gaussian elimination with and without pivoting (prerequisite
knowledge from linear algebra)
• Insight into the relationship between Gaussian elimination and LU-decomposition and the algorith-
mic relevance of LU-decomposition
• Awareness of the asymptotic complexity of dense Gaussian elimination, LU-decomposition, and
elimination for special matrices
• Familiarity with “sparse matrices”: notion, data structures, initialization, benefits
• Insight into the reduced computational complexity of the direct solution of sparse linear systems of
equations with special structural properties.
2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting211
Bibliography
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on pp. 136, 144, 205).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 145, 152, 153, 156, 157, 199).
[GGK14] W. Gander, M.J. Gander, and F. Kwok. Scientific Computing. Vol. 11. Texts in Computational
Science and Engineering. Heidelberg: Springer, 2014 (cit. on p. 132).
[GV89] G.H. Golub and C.F. Van Loan. Matrix computations. 2nd. Baltimore, London: John Hopkins
University Press, 1989 (cit. on p. 159).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 130, 136,
137, 143–147, 209).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 143, 144, 156,
209).
[Hig02] N.J. Higham. Accuracy and Stability of Numerical Algorithms. 2nd ed. Philadelphia, PA: SIAM,
2002 (cit. on p. 159).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 126, 130, 136, 137, 139, 143–145, 151, 153, 155).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 127, 130, 132, 136, 143, 146, 200, 207, 209).
[SST06] A. Sankar, D.A. Spielman, and S.-H. Teng. “Smoothed analysis of the condition numbers and
growth factors of matrices”. In: SIAM J. Matrix Anal. Appl. 28.2 (2006), pp. 446–476 (cit. on
p. 162).
[SG04] O. Schenk and K. Gärtner. “Solving Unsymmetric Sparse Systems of Linear Equations with
PARDISO”. In: J. Future Generation Computer Systems 20.3 (2004), pp. 475–487 (cit. on
p. 192).
[ST96] D.A. Spielman and Shang-Hua Teng. “Spectral partitioning works: planar graphs and finite el-
ement meshes”. In: Foundations of Computer Science, 1996. Proceedings., 37th Annual Sym-
posium on. Oct. 1996, pp. 96–105. DOI: 10.1109/SFCS.1996.548468 (cit. on p. 161).
[TB97] L.N. Trefethen and D. Bau. Numerical Linear Algebra. Philadelphia, PA: SIAM, 1997 (cit. on
pp. 159, 161).
212
Chapter 3
In this chapter we study numerical methods for overdetermined (OD) linear systems of equations, that
is, linear systems with a “tall” rectangular system matrix
x ∈ R n : “Ax = b” , (3.0.0.1)
A x = b
b ∈ R m , A ∈ R m,n , m≥n.
We point out that, in contrast to Chapter 1, Chapter 2, we will restrict ourselves to real linear systems in
this chapter.
Note that the quotation marks in (3.0.0.1) indicate that this is not a well-defined problem in the sense of
§ 1.5.5.1; Ax = b does no define a mapping (A, b) 7→ x, because
• such a vector x ∈ R n may not exist,
• and, even if it exists, it may not be unique.
Therefore, first we have to establish a crisp concept of that we mean by a “solution” of (3.0.0.1).
Contents
3.0.1 Overdetermined Linear Systems of Equations: Examples . . . . . . . . . . . 214
3.1 Least Squares Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
3.1.1 Least Squares Solutions: Definition . . . . . . . . . . . . . . . . . . . . . . . . 218
3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.1.3 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.1.4 Sensitivity of Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . 229
3.2 Normal Equation Methods [DR08, Sect. 4.2], [Han02, Ch. 11] . . . . . . . . . . . . 230
3.3 Orthogonal Transformation Methods [DR08, Sect. 4.4.2] . . . . . . . . . . . . . . . 234
3.3.1 Transformation Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.3.2 Orthogonal/Unitary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
3.3.3 QR-Decomposition [Han02, Sect. 13], [Gut09, Sect. 7.3] . . . . . . . . . . . . 236
3.3.4 QR-Based Solver for Linear Least Squares Problems . . . . . . . . . . . . . . 252
3.3.5 Modification Techniques for QR-Decomposition . . . . . . . . . . . . . . . . 257
213
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 3.0.1 "Overdetermined Linear Systems of Equations: Examples":
(12 minutes) Download link, tablet notes
You may think that overdetermined linear systems of equations are exotic, but this is not true. Rather they
are very common in mathematical models.
EXAMPLE 3.0.1.1 (Linear parameter estimation in 1D) From first principles it is known that two physical
quantities x ∈ R and y ∈ R (e.g., pressure and density of an ideal gas) are related by a linear relationship
In practice inevitable (“random”) measurement errors will affect the yi s, push the vector b out of the
range/image R(A) of A (→ Def. 2.2.1.2), and thwart the solvability of (3.0.1.3). Assuming independent
h i
and randomly distributed measurement errors in the yi , for m > 2 the probability that a solution αβ exists
is actually zero, see Rem. 3.1.0.2. y
EXAMPLE 3.0.1.4 (Linear regression: Parameter estimation for a linear model) Ex. 3.0.1.1 can be
generalized to higher dimensions:
Known: without measurement errors data would satisfy an affine linear relationship y = a⊤ x + β, for
some a ∈ R n , c ∈ R.
Plugging in the measured quantities gives yi = a⊤ xi + β, i = 1, . . . , m, a linear system of equations of
3. Direct Methods for Linear Least Squares Problems, 3. Direct Methods for Linear Least Squares Problems 214
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
the form
x1⊤ 1 y1
.. .. a ..
. . β = . ↔ Ax = b , A ∈ R m,n+1 , b ∈ R m , x ∈ R n+1 , (3.0.1.5)
x⊤m 1 ym
which is an overdetermined LSE, in case m > n + 1. y
EXAMPLE 3.0.1.6 (Measuring the angles of a triangle [NS02, Sect. 5.1]) We measure the angles
of a planar triangle and obtain e eγ
α, β, e (in radians). In the case of perfect measurements the true angles
α, β, γ would satisfy
1 0 0 e
α
0 α
1
0 βe
0 β = . (3.0.1.7)
0 1 e
γ
γ
1 1 1 π
Measurement errors will inevitably make the measured angles fail to add up to π so that (3.0.1.7) will not
⊤
have a solution [α, β, γ] .
Then, why should we add this last equation? This is suggested by a tenet of data science that reads “You
cannot afford not to use any piece of information available”. It turns out that solving (3.0.1.7) “in a suitable
way” as discussed below in Section 3.1.1 enhances cancellation of measurement errors and gives better
estimates for the angles. We will not discuss this here and refer to statistics for an explanation.
10 -3 100 angle measurements, /50 variance
5
3.5
π/6. Synthetic “measurement errors” are introduced
3
by adding a normally distributed random perturbation
2.5 with mean 0 and standard deviation π/50 to the exact
2
α, βe, and γ
values of the angles, yielding e e.
For 100 “measurements” we compute the variance
1.5
of the raw angles and that of the estimates obtained
1
by solving (3.0.1.7) in least squares sense (→ Sec-
0.5
angle
angle
/2
/3
tion 3.1.1). These variances are plotted for many dif-
0
angle /6
ferent “runs”.
0 1 2 3 4 5
Fig. 71 variance(measurements) 10 -3
We observe that in most runs the variance of the estimates from (3.0.1.7) are smaller than those of the
raw data. y
EXAMPLE 3.0.1.8 (Angles in a triangulation)
3. Direct Methods for Linear Least Squares Problems, 3. Direct Methods for Linear Least Squares Problems 215
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Die Grundlagen seines Verfahrens hatte Gauss schon 1795 im Alter von 18 Jahren entwickelt.
Basis war eine Idee von Pierre-Simon Laplace, die Beträge von Fehlern aufzusummieren,
so dass sich die Fehler zu Null addieren. Gauss nahm stattdessen die Fehlerquadrate und
konnte die künstliche Zusatzanforderung an die Fehler weglassen.
Gauss benutzte dann das Verfahren intensiv bei seiner Vermessung des Königreichs Han-
nover durch Triangulation. 1821 und 1823 erschien die zweiteilige Arbeit sowie 1826 eine
Ergänzung zur Theoria combinationis observationum erroribus minimis obnoxiae (Theorie der
den kleinsten Fehlern unterworfenen Kombination der Beobachtungen), in denen Gauss eine
Begründung liefern konnte, weshalb sein Verfahren im Vergleich zu den anderen so erfolgre-
ich war: Die Methode der kleinsten Quadrate ist in einer breiten Hinsicht optimal, also besser
als andere Methoden.
We now extend Ex. 3.0.1.6 to planar triangulations, for which measured values for all internal angles are
available. We obtain an overdetermined system of equations by combining the following linear relations:
1. each angle is supposed to be equal to its measured value,
2. the sum of interior angles is π for every triangle,
3. the sum of the angles at an interior node is 2π .
If the planar triangulation has N0 interior vertices and M cells, then we end up with 4M + N0 equations
for the 3M unknown angles. y
EXAMPLE 3.0.1.9 ((Relative) point locations from distances [GGK14, Sect. 6.1]) Consider n points
located on the real axis at unknown locations xi ∈ R, i = 1, . . . , n. At least we know that xi < xi+1 ,
i = 1, . . . , n − 1.
We measure the m := (n2 ) = 21 n(n − 1) pairwise distances dij := | xi − x j |, i, j ∈ {1, . . . , n}, i 6= j.
They are connected to the point positions by the overdetermined linear system of equations
−1 1 0 . . . ... 0 d12
−1 0 1 0 d13
xi − x j = dij , . . .. ... ..
.. .
. x1 .
1≤j<i≤n. . .. ..
. . . x2 ..
↔ −1 . . . 0 1 .. = d1n (3.0.1.10)
l .
0 −1 1 0 d23
. .. xn .
.. . ..
Ax = b
.. .. ..
. . .
0 ... −1 1 dn−1,n
⊤
Note that we can never expect a unique solution for x ∈ R n , because adding a multiple of [1, 1, . . . , 1]
⊤
to any solution will again yield a solution, because A has a non-trivial kernel: N (A) = [1, 1, . . . , 1] .
Non-uniqueness can be cured by setting x1 := 0, thus removing one component of x.
If the measurements were perfect, we could then find x2 , . . . , xn from di−1,i , i = 2, . . . , n by solving a
standard (square) linear system of equations. However, as in Ex. 3.0.1.6, using much more information
through the overdetermined system (3.0.1.10) helps curb measurement errors. y
Review question(s) 3.0.1.11 (Overdetermined Linear Systems of Equations: Examples)
(Q3.0.1.11.A) The mass of three different items is measures in all possible combinations. Find the overde-
termined linear system of equations (LSE) that has to be “solved” to obtain estimates for the three
masses. What would be the size of the corresponding overdetermined LSE for m ∈ N masses?
3. Direct Methods for Linear Least Squares Problems, 3. Direct Methods for Linear Least Squares Problems 216
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 3.1.1 "Least Squares Solutions": (9 minutes) Download link,
tablet notes
Remark 3.1.0.2 (Consistent right hand side vectors are highly improbable) If R(A) 6= R m , then
“almost all” perturbations of b (e.g., due to measurement errors) will destroy b ∈ R(A), because R(A)
is a “set of measure zero” in R m . y
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 217
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For given A ∈ R m,n , b ∈ R m the vector x ∈ R n is a least squares solution of the linear system of
equations Ax = b, if
x ∈ argminkAy − bk22 ,
y ∈R n
m
!2
m n
kAx − bk22 = minn kAy − bk22 = min ∑ ∑ (A)i,j y j − (b)i .
y ∈R y1 ,...,yn ∈R
i =1 j =1
➨ A least squares solution is any vector x that minimizes the Euclidean norm of the residual r =
b − Ax, see Def. 2.4.0.1.
We write lsq(A, b) for the set of least squares solutions of the linear system of equations Ax = b,
A ∈ R m,n , b ∈ R m :
§3.1.1.3 (Least squares solutions and “ true” solutions of LSE) The concept of least squares solutions
is a genuine generalization of what is regarded as a solution of linear system of equations in linear algebra:
Clearly, for a square linear system of equations with regular system matrix the least squares solution
agrees with the “true” solution:
Also for A ∈ R m,n , m > n, the set lsq(A, b) contains only “true” solutions of Ax = b, if b ∈ R(A),
because in this case the smallest residual is 0. y
Known: without measurement errors data would satisfy affine linear relationship
y = a⊤ x + β , (3.1.1.6)
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 218
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2
Proof. The function F : R n → R, F (x) := kb − Axk2 is continuous, bounded from below by 0 and
F (x) → ∞ for kxk → ∞. Hence, there must be an x∗ ∈ R n for which it attains its minimum.
✷
§3.1.1.10 (Least squares solution as maximum-likelihood estimator → [DR08, Sect. 4.5]) Extending
the considerations of Ex. 3.0.1.4, a generic linear parameter estimation problem seeks to determine the
unknown parameter vector x ∈ R n from the linear relationship Ax = y, where A ∈ R m is known and
y ∈ R m is accessible through measurements.
Unfortunately, y is affected by measurement errors, Thus we model it as a random vector y = y(ω ),
ω ∈ Ω, Ω the set of outcomes from a probability space.
The measurement errors in different components of y are supposed be unbiased (expectation = 0), inde-
pendent, and identically normally distributed with variance σ2 , σ > 0, which means
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 219
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
m 2 !
yℓ − (Ax)ℓ 1
L(x; y) = ∏ exp − 21 = exp − 2 ky − Axk22 . (3.1.1.13)
ℓ=1
σ 2σ
The last identity follows from exp( x ) exp(y) = exp( x + y). This probability density function y 7→ L(x; y)
is called the likelihood of y, and the notation emphasizes its dependence on the parameters x.
Assume that we are given a measurement (realization/sample) b of y. The maximum likelihood principle
then suggests that we choose the parameters so that the probability density of y becomes maximal at b:
1
x∗ ∈ R n such that x∗ = argmax L(x; b) = argmax exp(− 2
kb − Axk22 ) .
x ∈R n x ∈R n 2σ
y
Review question(s) 3.1.1.14 (Least squares solution: Definition)
(Q3.1.1.14.A) Describe A ∈ R2,2 and b ∈ R2 so that lsq(A, b) contains more than a single vector.
(Q3.1.1.14.B) What is lsq(A, 0) for A ∈ R m,n ?
(Q3.1.1.14.C) Given a matrix B ∈ R m,n , a vector c ∈ R m , and λ > 0, define
Fig. 75
Video tutorial for Section 3.1.2 "Normal Equations": (16 minutes) Download link, tablet notes
Appealing to the geometric intuition gleaned from Fig. 74 we infer the orthogonality of b − Ax, x a least
squares solution of the overdetermined linear systems of equations Ax = b, to all columns of A:
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 220
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Surprisingly, we have found a square linear system of equations satisfied by the least squares solution.
The next theorem gives the formal statement is this discovery. It also completely characterizes lsq(A, b)
and reveals a way to compute this set.
The vector x ∈ R n is a least squares solution (→ Def. 3.1.1.1) of the linear system of equations
Ax = b, A ∈ R m,n , b ∈ R m , if and only if it solves the normal equations (NEQ)
A⊤ Ax = A⊤ b . (3.1.2.2)
Note that the normal equations (3.1.2.2) are an n × n square linear system of equations with a symmetric
positive semi-definite coefficient matrix:
" # " # " #
A⊤ A x = A⊤ b,
" #" # " #
⇔ A⊤ A x = A⊤ b.
dϕd
= 2d⊤ A⊤ (Ax − b) = 0 .
dτ |τ =0
Since this holds for any vector d 6= 0, we conclude (set d equal to all the Euclidean unit vectors in R n )
A⊤ (Ax − b) = 0 ,
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 221
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
➋: Let x be a solution of the normal equations. Then we find by tedious but straightforward computations
Since this holds for any y ∈ R n , x must be a global minimizer of y 7→ kAy − bk!
✷
EXAMPLE 3.1.2.4 (Normal equations for some examples from Section 3.0.1) Given A and b it takes
only elementary linear algebra operations to form the normal equations
A⊤ Ax = A⊤ b . (3.1.2.2)
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 222
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
As above, using elementary identities for the Euclidean inner product on R m , J can be recast as
J (y) = y⊤ A⊤ Ay − 2b⊤ Ay + b⊤ b
n n m n m
y = [ y1 , . . . , y n ] ⊤ ∈ R n .
= ∑ ∑ (A⊤ A)ij yi y j − 2 ∑ ∑ bi (A)ij y j + ∑ bi2 ,
i =1 j =1 i =1 j =1 i =1
∂J
This formula for the gradient of J can easily be confirmed by computing the partial derivatives ∂y from the
i
above explicit formula. Observe that (3.1.2.7) is equivalent to the normal equations (3.1.2.2). y
§3.1.2.8 (The linear least squares problem (→ § 1.5.5.1)) Thm. 3.1.2.1 together with Thm. 3.1.1.9
already confirms that the normal equations will always have a solution and that lsq(A, b) is a subspace
of R n parallel to N (A⊤ A). The next theorem gives even more detailed information.
V ⊥ : = { x ∈ K k : xH y = 0 ∀ y ∈ V } .
z ∈ N (A⊤ A) ⇔ A⊤ Az = 0 ⇒ z⊤ A⊤ Az = kAzk22 = 0 ⇔ Az = 0 ,
Az = 0 ⇒ A⊤ Az = 0 ⇔ z ∈ N (A⊤ A) .
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 223
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
If m ≥ n and N (A) = {0}, then the linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , has
a unique least squares solution (→ 3.1.1.1)
x = ( A ⊤ A ) −1 A ⊤ b , (3.1.2.14)
Remark 3.1.2.15 (Full-rank condition (→ Def. 2.2.1.3)) For a matrix A ∈ R m,n with m ≥ n is equiva-
lent
N (A) = {0} ⇐⇒ rank(A) = n . (3.1.2.16)
Hence the assumption N (A) = {0} of Cor. 3.1.2.13 is also called a full-rank condition (FRC), because
the rank of A is maximal. y
EXAMPLE 3.1.2.17 (Meaning of full-rank condition for linear models) We revisit the parameter esti-
mation problem for a linear model.
• For Ex. 3.0.1.1, A ∈ R m,2 given in (3.0.1.3) it is easy to see
x1 1
x2 1
. .
. .
. .
rank = 2 ⇔ ∃i, j ∈ {1, . . . , m}: xi 6= x j ,
. .
.. ..
xm 1
that is, the manifest condition, that the all points ( xi , yi ) have the same x-coordinate.
y
• In the case of Ex. 3.0.1.4 and the overdetermined m × (n + 1) linear system (3.0.1.5), we find
x1⊤ 1 There is a subset of n + 1 points
..
rank ... . = n + 1 ⇔ xi1 , . . . xin+1 such that {xi1 , . . . xin+1 }
x⊤ 1 spans a non-degenerate n-simplex.
m
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 224
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 3.1.2.18 (Rank defect in linear least squares problems) In case the system matrix A ∈ R m,n ,
m ≥ n, of an overdetermined linear system arising from a mathematical fails to have full rank, it hints at
inadequate modelling:
In this case parameters are redundant, because different sets of parameters yield the same output quan-
tities: the parameters are not “observable”. y
Remark 3.1.2.19 (Hesse matrix of least squares functional) For the least squares functional
and its explicit form as polynomial in the vector components y j we find the Hessian (→ Def. 8.5.1.18,
[Str09, Satz 7.5.3]) of J :
" #n
∂2 J
H J (y) = (y) = 2A⊤ A . (3.1.2.20)
∂yi ∂y j
i,k=1
Thm. 3.1.2.9 implies that A⊤ A is positive definite (→ Def. 1.1.2.6) if and only if N (A) = {0}.
Therefore, by [Str09, Satz 7.5.3], under the full-rank condition J has a positive definite Hessian everywhere,
and a minimum at every stationary point of its gradient, that is, at every solution of the normal equations.
y
Remark 3.1.2.21 (Convex least squares functional) Another result from analysis tells us that real-valued
C1 -functions on R n whose Hessian has positive eigenvalues uniformly bounded away from zero are strictly
convex. Hence, if A has full rank, the least squares functional J from (3.1.2.6) is a strictly convex function.
Fig. 77
y
Now we are in a position to state precisely what we mean by solving an overdetermined (m ≥ n!) linear
system of equations Ax = b, A ∈ R m,n , b ∈ R m , provided that A has full (maximal) rank, cf. (3.1.2.16).
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 225
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✎ A sloppy notation for the minimization problem (3.1.2.22) is kAx − bk2 → min y
Review question(s) 3.1.2.23 (Normal equations)
(Q3.1.2.23.A) Compute the system matrix and the right-hand side vector for the normal equations for 1D
linear regression, which led to the overdetermined linear system of equations
x1 1 y1
x2 1 y2
. . .
. . .
. . α .
= ,
β
. . .
.. .. ..
xm 1 ym
(Q3.1.2.23.B) Let {v1 , . . . , vk } ⊂ R n , k < n, be a basis of a subspace V ⊂ R n . Give a formula for the
point x ∈ V with smallest Euclidean distance from a given point p ∈ R n . Why is the basis property of
{v1 , . . . , vk } ⊂ R n important?
(Q3.1.2.23.C) Characterize the set of least squares solutions lsq(A, b), if A ∈ R m,n , m ≥ n, has or-
thonormal columns and b ∈ R m is an arbitrary vector.
(Q3.1.2.23.D) Let A ∈ R m,n , m ≥ n, have full rank: rank(A) = n. Show that the mapping
P : R m → R m , P ( y ) : = A ( A ⊤ A ) −1 A ⊤ y , y ∈ Rm ,
Hint. Permuting the rows of an LSE amounts to left-multiplication of the system matrix and the right-
hand-side vector with a permutation matrix.
△
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 226
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
As we have seen in Ex. 3.0.1.9, there can be many least squares solutions of Ax = b, in case N (A) 6=
{0}. We can impose another condition to single out a unique element of lsq(A, b):
➨ The generalized solution is the least squares solution with minimal norm.
§3.1.3.3 (Reduced normal equations) Elementary geometry teaches that the minimal norm element of
an affine subspace L (a plane) in Euclidean space is the orthogonal projection of 0 onto L.
lsq(A, b) k N (A)
✁ visualization:
The minimal norm element x† of the affine space
x† lsq(A, b) ⊂ R n belongs to the subspace of R n that
is orthogonal to lsq(A, b).
N (A)⊥
0
Fig. 78
Since the space of least squares solutions of Ax = b is an affine subspace parallel to N (A),
the generalized solution x† of Ax = b according to Def. 3.1.3.1 is contained in N (A)⊥ . Therefore, given
a basis {v1 , . . . , vk } ⊂ R n of N (A)⊥ , k := dim N (A)⊥ = n − dim N (A), we can find y ∈ R k such
that
Plugging this representation into the normal equations and multiplying with V⊤ yields the reduced normal
equations
V⊤ A⊤ AV y = V⊤ A⊤ b (3.1.3.5)
m
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 227
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
" #
V⊤ A⊤ y =
A
V
V⊤
A⊤ b .
The very construction of V ensures N (AV) = {0} so that, by Thm. 3.1.2.9 the k × k linear system of
equations (3.1.3.5) has a unique solution. The next theorem summarizes our insights:
✎ notation: A† ∈ R n,m =
ˆ pseudoinverse of A ∈ R m,n
Note that the Moore-Penrose pseudoinverse does not depend on the choice of V. y
Armed with the concept of generalized solution and the knowledge about its existence and uniqueness we
can state the most general linear least squares problem:
given: A ∈ R m,n , m, n ∈ N, b ∈ R m ,
find: x† ∈ R n such that
(i) Ax† − b = min{kAy − bk2 : y ∈ R n }, (3.1.3.7)
2
(ii) x† is minimal under the condition (i).
2
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 228
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q3.1.3.8.B) Given A ∈ R m,n and a basis {v1 , . . . , vk }, k ≤ n, of the orthogonal complement N (A)⊥ ,
show that the Moore-Penrose pseudoinverse
Recall Section 2.2.2, where we discussed the sensitivity of solutions of square linear systems, that is,
the impact of perturbations in the problem data on the result. Now we study how (small) changes in A
and b affect the unique (→ Cor. 3.1.2.13) least solution x of Ax = b in the case of A with full rank (⇔
N ( A ) = { 0 })
Note: If the matrix A ∈ R m,n , m ≥ n, has full rank, then there is a c > 0 such that A + ∆A still has
full rank for all ∆A ∈ R m,n with k∆Ak2 < c. Hence, “sufficiently small” perturbations will not destroy the
full-rank property of A. This is a generalization of the Perturbation Lemma 2.2.2.5.
For square linear systems the condition number of the system matrix (→ Def. 2.2.2.7) provided the key
gauge of sensitivity. To express the sensitivity of linear least squares problems we also generalize this
concept:
For a square regular matrix this agrees with its condition number according to Def. 2.2.2.7, which follows
3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 229
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For m ≥ n, A ∈ R m,n , rank(A) = n, let x ∈ R n be the solution of the least squares problem
kAx − bk → min and b x − bk →
x the solution of the perturbed least squares problem k(A + ∆A)b
min. Then
kx − b x k2 · k r k2 k∆Ak2
≤ 2 cond2 (A) + cond22 (A)
k x k2 k A k2 k x k2 k A k2
This means: if krk2 ≪ 1 ➤ condition of the least squares problem ≈ cond2 (A)
if krk2 “large” ➤ condition of the least squares problem ≈ cond22 (A)
For instance, in a linear parameter estimation problem (→ Ex. 3.0.1.4) a small residual will be the conse-
quence of small measurement errors.
3.2 Normal Equation Methods [DR08, Sect. 4.2], [Han02, Ch. 11]
Video tutorial for Section 3.2 "Normal Equation Methods": (12 minutes) Download link,
tablet notes
3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
230
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Definition 1.1.2.6. Symmetric positive definite (s.p.d.) matrices → [DR08, Def. 3.31],
[QSS00, Def. 1.22]
M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .
C++ code 3.2.0.1: Solving a linear least squares problem via normal equations ➺ GITLAB
2 //! Solving the overdetermined linear system of equations
3 //! Ax = b by solving normal equations (3.1.2.2)
4 //! The least squares solution is returned by value
5 VectorXd normeqsolve ( const MatrixXd &A , const VectorXd &b ) {
6 i f ( b . s i z e ( ) ! = A . rows ( ) ) {
7 throw r u n t i m e _ e r r o r ( " Dimension mismatch " ) ;
8 }
9 // Use Cholesky factorization for s.p.d. system matrix, § 2.8.0.13
10 VectorXd x = ( A . transpose ( ) * A) . l l t ( ) . solve ( A . transpose ( ) * b ) ;
11 return x ;
12 }
By Thm. 2.8.0.11, for the s.p.d. matrix A⊤ A Gaussian elimination remains stable even without pivot-
ing. This is taken into account by requesting the Cholesky decomposition of A⊤ A by calling the method
llt().
§3.2.0.2 (Asymptotic complexity of normal equation method) The problem size parameters for the
linear least squares problem (3.1.2.22) are the matrix dimensions m, n ∈ N, where n small & fixed,
n ≪ m, is common.
3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
231
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In Section 1.4.2 and Thm. 2.5.0.2 we discussed the asymptotic complexity of the operations involved in
step ➊-➌ of the normal equation method:
step ➊: cost O(mn2 )
step ➋: cost O(nm) cost O(n2 m + n3 ) for m, n → ∞ .
step ➌: cost O(n3 )
Note that for small fixed n, n ≪ m, m → ∞ the computational effort scales linearly with m. y
EXAMPLE 3.2.0.4 (Roundoff effects in normal equations → [DR08, Ex. 4.12]) In this example we
witness loss of information in the computation of AH A.
1 1
⊤ 1 + δ2 1
A = δ 0 ⇒ A A=
1 1 + δ2
0 δ
! √
Exp. 1.5.3.14: If δ ≈ EPS, then 1 + δ2 = 1 in M (set of machine numbers, see
Hence the computed A⊤ A will fail to be regular, though rank(A) = 2,
Def. 1.5.2.4). √
cond2 (A) ≈ EPS.
Output:
1 Rank o f A : 2
2 Rank o f A^T * A : 1
3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
232
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 3.2.0.6 (Loss of sparsity when forming normal equations) Another reason not to compute
AH A, when both m, n large:
A sparse 6⇒ A⊤ A sparse
A⊤ Ax = A⊤ b , (3.1.2.2)
The benefit of using (3.2.0.8) instead of the standard normal equations (3.1.2.2) is that sparsity is pre-
served. However, the conditioning of the system matrix in (3.2.0.8) is not better than that of A⊤ A.
A more general substitution q := α−1 (Ax − b) with α > 0 may even improve the conditioning for suitably
chosen parameter α > 0:
⊤ ⊤ q −αI A q b
A Ax = A b ⇔ Bα := ⊤ = . (3.2.0.9)
x A 0 x 0
For m, n ≫ 1, A sparse, both (3.2.0.8) and (3.2.0.9) lead to large sparse linear systems of equations,
amenable to sparse direct elimination techniques, see Section 2.7.4. y
3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
233
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10
10
cond2(A)
In this example we explore empirically how the Eu- 9
10 cond2(AHA)
clidean condition number of the extended normal 8
10
cond2(B)
equations (3.2.0.9) is influenced by the coice of α cond (B )
2 α
7
10
Consider (3.2.0.8), (3.2.0.9) for
6
10
1+ǫ 1 5
10
A = 1−ǫ 1 . 4
10
ǫ ǫ
3
10
in dependence on ǫ√ ✄ 1
10
(Here α = ǫkAk2 / 2) 0
10
−5 −4 −3 −2 −1 0
10 10 10 10 10 10
Fig. 79 ε
y
Review question(s) 3.2.0.11 (Normal equation methods)
(Q3.2.0.11.A) We consider the overdetermined linear system of equations
Ax = b , A ∈ R m,n , m ≥ n, b ∈ R m . (3.2.0.12)
We augment it by another equation and get another overdetermined linear system of equations
A b
⊤ ex= , v ∈ Rn , β∈R. (3.2.0.13)
v β
Ax = b , A ∈ R m,n , m ≥ n, b ∈ R m , (3.2.0.14)
⊤ m n
(A + uv )e
x=b, u∈R , v∈R , (3.2.0.15)
are related.
△
Video tutorial for Section 3.3 "Orthogonal Transformation Methods": (10 minutes)
Download link, tablet notes
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
234
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§3.3.1.1 (Generalizing the policy underlying Gaussian elimination) Recall the rationale behind Gaus-
sian elimination (→ Section 2.3, Ex. 2.3.1.1)
➥ e,
By row transformations convert LSE Ax = b to equivalent (in terms of set of solutions) LSE Ux = b
which is easier to solve because it has triangular form.
How to adapt this policy to linear least squares problem (3.1.2.22) ?
Two questions: ➊ What linear least squares problems are “easy to solve” ?
➋ How can we arrive at them by equivalent transformations of (3.1.2.22) ?
Here we call two overdetermined linear systems Ax = b and Ax e = b e equivalent in the sense of
e ), see (3.1.1.2).
e b
(3.1.2.22), if both have the same set of least squares solutions: lsq(A, b) = lsq(A,
y
b1
.. −1
. b1
..
.
..
x1 x =
.
.. (∗) R
A . −
→ min =⇒
..
.
xn bn
..
. ˆ least squares solution
x=
bm
2
How can we draw the conclusion (∗)? Obviously, the components n + 1, . . . , m of the vector inside the
norm are fixed and do not depend on x. All we can do is to make the first components 1, . . . , n vanish, by
choosing a suitable x, see [DR08, Thm. 4.13]. Obviously, x = R−1 (b)1:n accomplishes this.
Note: since A has full rank n, the upper triangular part R ∈ R n,n of A is regular! y
Answer to question ➋:
Idea: If we have a (transformation) matrix T ∈ R m,m satisfying
where A e = Tb.
e = TA and b
The next section will characterize the class of eligible transformation matrices T.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
235
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
From Thm. 3.3.2.2 we immediately conclude that, if a matrix Q ∈ K n,n is unitary/orthogonal, then
(Q3.3.2.3.C) Based on the result of Question (Q3.3.2.3.A) find an orthogonal matrix Q ∈ R2,2 , such that
0 a12 ∗ ∗
Q = , a12 , a22 ∈ R .
1 a22 0 ∗
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
236
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 3.3.3.1 "QR-Decomposition: Theory": (11 minutes) Download link,
tablet notes
The span property (1.5.1.2) can be made more explicit in terms of the existence of linear combinations
q1 = t11 a1
q2 = t12 a1 + t22 a2
q3 = t13 a1 + t23 a2 + t33 a3 ∃T ∈ R n,n upper triangular: Q = AT , (3.3.3.3)
..
.
qn = t1n a1 + t2n a2 + · · · + tnn an .
where Q = [q1 , . . . , qn ] ∈ R m,n (with orthonormal columns), A = [a1 , . . . , an ] ∈ R m,n . Note that thanks
to the linear independence of {a1 , . . . , ak } and {q1 , . . . , qk }, the matrix T = (tij )i,j
k k,k
=1 ∈ R is regular
(“non-existent” tij are set to zero, of course).
Recall from Lemma 1.3.1.9 that inverses of regular upper triangular matrices are upper triangular
again.
−1
=
T R
Thus, by (3.3.3.3), we have found an upper triangular R := T−1 ∈ R n,n such that
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
237
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A Q
A = QR ↔
=
.
R
Next “augmentation by zero”: add m − n zero rows at the bottom of R and complement columns of Q to
e ∈ R m,m :
an orthonormal basis of R m , which yields an orthogonal matrix Q
e R
A=Q
0
l
R
A e
= Q
0
e⊤ R
⇔ Q A= .
0
y
Thus the algorithm of Gram-Schmidt orthonormalization “proves” the following theorem.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
238
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A = Q0 · R0 (“economical” QR-decomposition) ,
(ii) a unitary Matrix Q ∈ K n,n and a unique upper triangular R ∈ K n,k with (R)i,i > 0, i ∈
{1, . . . , n}, such that
A = QR , Q ∈ K n,n , R ∈ K n,k ,
A
= Q . (3.3.3.6)
R
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
239
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Proof. We observe that R is regular, if A has full rank n. Since the regular upper triangular matrices form
a group under multiplication:
(Q3.3.3.8.C) What is the R in the full QR-decomposition A = QR of a tensor product matrix A = uv⊤ ,
u ∈ R m , v ∈ R n , m, n ∈ N, m ≥ n.
Hint. rank(A) = rank(R).
(Q3.3.3.8.D) Explain why the full QR-decomposition/QR-factorization of A ∈ R m,n , m > n, cannot be
unique, even if we demand (R)ii > 0, i = 1, . . . , n.
△
Video tutorial for Section 3.3.3.2 & Section 3.3.3.4 "Computation of QR-Decomposition, QR-
Decomposition in E IGEN ": (32 minutes) Download link, tablet notes
In theory, Gram-Schmidt orthogonalization (GS) can be used to compute the QR-factorization of a matrix
A ∈ R m,n , m ≥ n, rank(A) = n. However, as we saw in Exp. 1.5.1.5, Gram-Schmidt orthogonalization
in the form of Code 1.5.1.3 is not a stable algorithm.
There is a stable way to compute QR-decompositions, based on the accumulation of orthogonal transfor-
mations.
The product of two orthogonal/unitary matrices of the same size is again orthogonal/unitary.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
240
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Idea: find simple orthogonal (row) transformations rendering certain matrix ele-
ments zero:
Q
= 0
with Q⊤ = Q−1 .
Recall that this “annihilation of column entries” is the key operation in Gaussian forward elimination, where
it is achieved by means of non-unitary row transformations, see Sect. 2.3.2. Now we want to find a
counterpart of Gaussian elimination based on unitary row transformations on behalf of numerical stability.
EXAMPLE 3.3.3.10 (“Annihilating” orthogonal transformations in 2D) In 2D there are two possible
orthogonal transformations make 2nd component of a ∈ R2 vanish, which, in geometric terms, amounts
to mapping the vector onto the x1 -axis.
x2
x2
h i
cos ϕ sin ϕ
Q= − sin ϕ cos ϕ
a a
x1 ϕ x1
. .
Fig. 80
Fig. 81
Note that in each case we have two different length-preserving lineare mappings at our disposal. This
flexibility will be important for curbing the impact of roundoff. y
Both reflections and rotations are actually used in library routines and both are discussed in the sequel:
§3.3.3.11 (Householder reflections → [GV13, Sect. 5.1.2]) The following so-called Householder
matrices (HHM) effect the reflection of a vector into a multiple of the first unit vector with the same length:
vv⊤
Q = H(v) := I − 2 with v = a±kak2 e1 , (3.3.3.12)
v⊤ v
where e1 is the first Cartesian basis vector. Orthogonality of these matrices can be established by direct
computation.
Fig. 82 depicts a “geometric derivation” of Householder reflections mapping a → b, assuming
kak2 = kbk2 . We accomplish this by a reflection at the hyperplane with normal vector b − a.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
241
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
v⊤ v a
b = a − (a − b) = a − v
v⊤ v
v⊤ a vv⊤ b
= a − 2v ⊤ = a − 2 ⊤ a = H(v)a ,
v v v v
Fig. 82
Hence, suitable successive Householder transformations determined by the leftmost column (“target col-
umn”) of shrinking bottom right matrix blocks can be used to achieve upper triangular form R. The following
series of figures visualizes the gradual annihilation of the lower triangular matrix part for a square matrix:
*
*
*
➤ ➤ ➤ .
0 0 0
Q n −1 Q n −2 · · · · · Q 1 A = R ,
QR-decomposition Q := Q1⊤ · · · · · Q⊤
n−1 orthogonal matrix ,
of A ∈ C n,n : A = QR ,
(QR-factorization) R upper triangular matrix .
y
Remark 3.3.3.13 (QR-decomposition of “fat” matrices) We can also apply successive Householder
transformation as outlined in § 3.3.3.11 to a matrix A ∈ R m,n with m < n. If the first m columns of A are
linearly independent, we obtain another variant of the QR-decomposition:
A = Q R ,
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
242
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A = QR , Q ∈ R m,m , R ∈ R m,n ,
vv⊤
H(v) := I − 2 ,
v⊤ v
v is normalized to unit length (division by kvk22 ), and then a large absolute error might result.
Fortunately, two choices for v are possible in (3.3.3.12) and at most one can be affected by cancellation.
The right choice is
(
a+kak2 e1 , if a1 > 0 ,
v=
a−kak2 e1 ) , if a1 ≤ 0 .
See [Hig02, Sect. 19.1] and [GV13, Sect. 5.1.3] for a discussion. y
§3.3.3.15 (Givens rotations → [Han02, Sect. 14], [GV13, Sect. 5.1.8]) The 2D rotation displayed in
Fig. 81 can be embedded in an identity matrix. Thus, the following orthogonal transformation, a Givens
rotation, annihilates the k-th component of a vector a = [ a1 , . . . , an ]⊤ ∈ R n . Here γ stands for cos( ϕ)
and σ for sin( ϕ), ϕ the angle of rotation, see Fig. 81.
(1)
γ ··· σ ··· 0 a1 a
.
. . . .
. .
. .. 1. a1
. . . . . .. γ = p ,
| a1 + | a k |2
|2
G1k ( a1 , ak )a := −σ · · · γ · · · 0 ak = 0 for
ak (3.3.3.16)
. .. . .
.. .. .. σ = p .
.. . . . . . | a1 |2 + | a k |2
0 · · · 0 · · · 1 an an
Orthogonality (→ Def. 6.3.1.2) of G1k ( a1 , ak ) is verified immediately. Again, we have two options for an
annihilating rotation, see Ex. 3.3.3.10. It will always be possible to choose one that avoids overflow [GV13,
Sect. 5.1.8], see Code 3.3.3.17 for details.
C++ code 3.3.3.17: Stable Givens rotation of a 2D vector, [GV13, Alg. 5.1.3] ➺ GITLAB
2 // plane (2D) Givens rotation avoiding cancellation
3 // Computes orthogonal G ∈ R2,2 with G⊤ a = 0r =: x, r = ±kak2
4 void planerot ( const Eigen : : Vector2d& a , Eigen : : Matrix2d& G,
5 Eigen : : Vector2d& x ) {
6 int sign { 1 } ;
7 const double anorm = a . norm ( ) ;
8 i f ( anorm ! = 0 . 0 ) { //
9 double s ; // s ↔ σ
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
243
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 double c ; // c ↔ γ
11 i f ( std : : abs ( a [ 1 ] ) > std : : abs ( a [ 0 ] ) ) { // Avoid overflow
12 const double t = −a [ 0 ] / a [ 1 ] ;
13 s = 1 . 0 / std : : s q r t ( 1 . 0 + t * t ) ;
14 c = s * t;
15 s i g n = −1;
16 } else {
17 const double t = −a [ 1 ] / a [ 0 ] ;
18 c = 1 . 0 / std : : s q r t ( 1 . 0 + t * t ) ;
19 s = c * t;
20 }
21 G << c , s , −s , c ; // Form 2 × 2 Givens rotation matrix
22 } else {
23 G. s e t I d e n t i t y ( ) ;
24 }
25 x << ( s i g n * anorm ) , 0 . 0 ;
26 }
• Case | a1 | ≥ | a0 |: t = − aa10 , s = √1 , c = st
1+ t2
⊤
c s a0 ca0 − sa1 sta0 − sa1
= =
−s c a1 sa0 + ca1 sa0 + sta1
" 2 #
1 ta0 − a1 | a 1 | − a0 − a 1 sgn ( a 1 )k a k 2
=√ = a1 = .
1 + t2 a0 + ta1 k a k2 a0 − a0 0
So far, we know how to annihilate a single component of a vector by means of a Givens rotation that targets
that component and some other (the first in (3.3.3.16)). However, for the sake of QR-decomposition we
aim to map all components to zero except for the first.
☞ This can be achieved by n − 1 successive Givens rotations, see also Code 3.3.3.19
(2)
a1 (1) ( n −1)
a a a
a2 1 10 1
0 0
.. ..
. G12 ( a1 ,a2 ) a3 G13 ( a1(1) ,a3 )
0 G ( a(2) ,a )
( n −2)
G1n ( a1 ,an )
. −−−−−→ . −−−−−−→ −
14 1
−−−−−
4
→ · · · −
−−−−−−− → . (3.3.3.18)
.. .. a4
.
. .. . .
.. . . ..
an an an 0
✎ Notation: Gij ( a1 , a2 ) =
ˆ Givens rotation (3.3.3.16) modifying rows i and j of the matrix.
C++11 code 3.3.3.19: Roating a vector onto the x1 -axis by successive Givens transformation
➺ GITLAB
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
244
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Armed with these compound Givens rotations we can proceed as in the case of Householder reflections
to accomplish the orthogonal transformation of a full-rank matrix to upper triangular form, see
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
245
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 3.3.3.21 (Testing != 0.0 in Code 3.3.3.17) In light of the guideline “do not test floating point
number for exact equality” from Rem. 1.5.3.15 the test if (anorm != 0.0) Line 8 looks inappropriate.
However, its sole purpose is to avoid division by zero and the code will work well even if anorm≈ 0. y
Remark 3.3.3.22 (Storing orthogonal transformations) When doing successive orthogonal transforma-
tions as in the case of QR-decomposition by means of Householder reflections (→ § 3.3.3.11) or Givens
rotations (→ § 3.3.3.15) it would be prohibitively expensive to assemble and even multiply the transforma-
tion matrices!
The matrices for the orthogonal transformation are never built in codes!
The transformations are stored in a compressed format.
Therefore, we stress that Code 3.3.3.20 is meant for demonstration purposes only, because the construc-
tion of the Q-factor matrix would never be done in this way in a well-designed numerical code.
↑ Case m < n
Case m > n →
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
246
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
➋ In the case of Givens rotations, for a single rotation Gi,j ( a1 , a2 ) we need store only the row indices (i, j)
and rotation angle [Ste76], [GV13, Sect. 5.1.11]. The latter is subject to a particular encoding scheme:
1M , if γ = 0 ,
γ σ 1
for G = ⇒ store ρ := 2 sign(γ)σ , if |σ| < |γ| , (3.3.3.23)
−σ γ
2 sign(σ )/γ , if |σ | ≥ |γ| ,
ρ = 1M ⇒ γ = 0 , σ = 1 √ ,
which means |ρ| < 1 ⇒ σ = 2ρ , γ = p 1 − σ2 , (3.3.3.24)
|ρ| > 1 ⇒ γ = 2/ρ , σ = 1 − γ2 .
Here 1M < alludes to the fact that the number 1.0 can be represented exactly in machine number systems.
Then store Gij ( a, b) as triple (i, j, ρ). The parameter ρ forgets the sign of the matrix Gij , so the signs of
the corresponding rows in the transformed matrix R have to be changed accordingly. The rationale behind
the above convention is to curb the impact of roundoff errors, because when we recover γ, σ by taking the
square root of a difference we never subtract two numbers of equal size; cancellation is avoided.
Note that the multiplication of a vector w ∈ R m with a Householder matrix H(v) := I − 2vv⊤ , v ∈ R m ,
kvk2 = 1, takes only 2m operations, cf. Ex. 1.4.3.1.
Next, we examine the elementary matrix vector operations involved in orthogonal transformation of A into
an upper triangular matrix R ∈ R m,n .
Step ➊: Householder matrix × n − 1 remaining matrix cols. of size m, cost = 2m(n − 1)
Step ➋: Householder matrix × n − 2 remaining matrix cols. of size m − 1, cost = 2(m − 1)(n − 2)
Step ➌: Householder matrix × n − 3 remaining matrix cols. of size m − 2, cost = 2(m − 2)(n − 3)
..
.
We see that the combined number of entries of the -colored matrix blocks in the above figures is propor-
tional to the total work.
n −1
cost(R-factor of A by Householder trf.) = ∑ 2(m − k + 1)(n − k) (3.3.3.26)
k =1
= O(mn2 ) for m, n → ∞ .
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
247
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 3.3.3.27 (QR-decomposition of banded matrices) The advantage of Givens rotations is its
selectivity, which can be exploited for banded matrices, see Section 2.7.5.
Specific case: Orthogonal transformation of an n × n tridiagonal matrix to upper triangular form, that is,
the annihilation of the sub-diagonal, by means of successive Givens rotations:
∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0
∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0
0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0
0 0 ∗ ∗ ∗ 0 0 0 G12 0 0 ∗ ∗ ∗ 0 0 0 G23 ···Gn−1,n 0 0 0 ∗ ∗ ∗ 0 0
−−−→ −−−−−−→
0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0
0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗
0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗
0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 0 ∗
ˆ entry set to zero by Givens rotation, ∗ =
∗= ˆ new non-zero entry (“fill-in” → Def. 2.7.4.3).
This is a manifestation of a more general result, see Def. 2.7.5.1 for notations:
Studying the algorithms sketched above for tridiagonal matrices, we find that a total of at most n · bw(A)
Givens rotations is required or computing the QR-decomposition. Each of them acts on O(bw(A)) non-
zero entries of the matrix, which leads to an asymptotic total computational effort of O(n · bw(A)2 ) for
n → ∞. y
Review question(s) 3.3.3.29 (Computation of QR-decompositions)
(Q3.3.3.29.A) Let A ∈ R n,n be “Z-shaped”
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
248
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1. Give a sequence of Givens rotations that convert A into upper triangular form.
2. Think about an efficient way to deploy orthogonal transformation techniques for the efficient solu-
tion of a linear system of equations Ax = b, b ∈ R n .
(Q3.3.3.29.B) The matrix A ∈ R n,n , n ∈ N, is upper triangular except for a single non-zero entry in
position (n, 1):
Which sequence of Givens rotations (of minimal length) can be used to compute the QR-decomposition
of A?
(Q3.3.3.29.C) [Householder matrices] What is a Householder matrix and what are its properties
(regularity, orthogonality, symmetry, rank, kernel, range)?
△
In numerical linear algebra orthogonal transformation methods usually give rise to reliable algorithms,
thanks to the norm-preserving property of orthogonal transformations.
We are interested in the sensitivity of F, that is, the impact of relative errors in the data vector x on the
output vector y := F (x).
We study the output for a perturbed input vector:
)
Qx = y ⇒ kxk2 = kyk2 k∆yk2 k∆xk2
= .
Q(x + ∆x) = y + ∆y ⇒ Q∆x = ∆y ⇒ k∆yk2 = k∆xk2 k y k2 k x k2
We conclude, that unitary/orthogonal transformations do not cause any amplification of relative errors in
the data vectors.
Of course, this also applies to the “solution” of square linear systems with orthogonal coefficient matrix
Q ∈ R n,n , which, by Def. 6.3.1.2, boils down to multiplication of the right hand side vector with QH . y
However, these transformations can lead to a massive amplification of relative errors, which, by virtue of
Ex. 2.2.2.1 can be linked to large condition numbers of T.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
249
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This accounts for fact that the computation of LU-decompositions by means of Gaussian elimination might
not be stable, see Ex. 2.4.0.6. y
6
10
Study in 2D:
5
10
condition number
trices of Gaussian elimination.
4
10
1 0 10
3
T(µ) =
µ 1 2
10
The perfect conditioning of orthogonal transformation prevents the destructive build-up of roundoff errors.
E IGEN offers several classes dedicated to computing QR-type decompositions of matrices, for instance
HouseholderQR. Internally the QR-decomposition is stored in compressed format as explained in
Rem. 3.3.3.22. Its computation is triggered by the constructor.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
250
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Note that the method householderQ returns the Q-factor in compressed format, refer to Rem. 3.3.3.22.
Assignment to a matrix will convert it into a (dense) matrix format, see Line 8; only then the actual com-
putation of the matrix entries is performed. It can also be multiplied with another matrix of suitable size,
which is used in Line 20 to extract the Q-factor Q0 ∈ R m,n of the economical QR-decomposition (3.3.3.1).
The matrix returned by the method matrixQR() gives access to a matrix storing the QR-factors in
compressed form. Its upper triangular part provides R, see Line 21.
§3.3.3.37 (Economical versus full QR-decomposition) The distinction of Thm. 3.3.3.4 between eco-
nomical and full QR-decompositions of a “tall” matrix A ∈ R m,n , m > n, becomes blurred on the algo-
rithmic level. If all we want is a representation of the Q-factor as a product of orthogonal transformations
as discussed in Rem. 3.3.3.22, exactly the same computations give us both types of QR-decompositions,
because, of course, the bottom zero block of R need not be stored.
The same computations yield both full and economical QR-decompositions with Q-factors in product
form.
This is clearly reflected in Code 3.4.2.1. Thus, in the derivation of algorithms we choose either type of
QR-decomposition, whichever is easier to understand. y
§3.3.3.38 (Cost of QR-decomposition in E IGEN) A close inspection of the algorithm for the computation
of QR-decompositions of A ∈ R m,n by successive Householder reflections (→ § 3.3.3.11) reveals, that n
transformations costing ∼ mn operations each are required.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
251
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
11 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
12 // plain QR-factorization in the constructor
13 t 1 . s t a r t ( ) ; HouseholderQR<MatrixXd > qr ( A) ; t 1 . s t o p ( ) ;
14 // full decomposition
15 t 2 . s t a r t ( ) ; std : : pair <MatrixXd , MatrixXd > QR2 = q r _ d e c o m p _ f u l l ( A) ; t 2 . s t o p ( ) ;
16 // economic decomposition
17 t 3 . s t a r t ( ) ; std : : pair <MatrixXd , MatrixXd > QR3 = qr_decomp_eco ( A) ; t 3 . s t o p ( ) ;
18 }
19 tms ( i , 0 ) =n ;
20 tms ( i , 1 ) = t 1 . min ( ) ; tms ( i , 2 ) = t 2 . min ( ) ; tms ( i , 3 ) = t 3 . min ( ) ;
21 }
10 2
time [s]
• call to qr_decomp_eco() from 10 -2
Code 3.4.2.1.
10 -3
Platform:
✦ ubuntu 14.04 LTS 10 -4
2
The runtimes for the QR-factorization of A ∈ R n ,n behave like O(n2 · n) for large n. y
Video tutorial for Section 3.3.4 "QR-Based Solver for Linear Least Squares Problems": (9
minutes) Download link, tablet notes
The QR-decomposition introduced in Section 3.3.3, Thm. 3.3.3.4, paves the way for the practical algo-
rithmic realization of the “equivalent orthonormal transformation to upper triangular form”-idea from Sec-
tion 3.3.1.
We consider the full-rank linear least squares problem Eq. (3.1.2.22): Given A ∈ R m,n , m ≥ n,
rank(A) = n,
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
252
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
e
b1
..
.
R0
x1
..
kAx − bk2 → min ⇔ . −
→ min .
xn
0 ..
.
e
bm
2
−1 0
..
e .
b1
.. 0
x=
. , with residual r = Qe
.
R0 bn + 1
e
bn ..
.
e
bm
q
Note: by Thm. 3.3.2.2 the norm of the residual is readily available: krk2 = ebn2 +1 + · · · + e2.
bm
C++-code 3.3.4.1: QR-based solver for full rank linear least squares problem (3.1.2.22)
➺ GITLAB
2 // Solution of linear least squares problem (3.1.2.22) by means of
QR-decomposition
3 // Note: A ∈ R m,n with m > n, rank(A) = n is assumed
4 // Least squares solution returned in x, residual norm as return value
5 double q r l s q s o l v e ( const MatrixXd& A , const VectorXd& b ,
6 VectorXd& x ) {
7 const unsigned m = A . rows ( ) ;
8 const unsigned n = A . cols ( ) ;
9
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
253
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
computational cost of this function when called for an m × n matrix is, asymptotically for m, n → ∞,
O ( n2 m ).
• Line 10: We perform the QR-decomposition of the extended matrix [A, b] with b as rightmost col-
umn. Thus, the orthogonal transformations are automatically applied to b; the augmented matrix is
converted into [R, Q⊤ b], the data of the equivalent upper triangular linear least squares problem.
Thus, actually, no information about Q needs to be stored, if one is interested in the least squares
solution x only.
The idea is borrowed from Gaussian elimination, see Code 2.3.1.4, Line 9.
• Line 14: MatrixQR() returns the compressed QR-factorization as a matrix, where the R-factor
R ∈ R m,n is contained in the upper triangular part, whose top n rows give R0 from see (3.3.3.1).
• Line 19: the components (b)n+2:m of the vector b (treated as rightmost column of the augmented
matrix) are annihilated when computing the QR-decomposition (by final Householder reflection):
Q⊤ [A, b] = 0. Hence, Q⊤ [A, b] e )n+1:m
= (b , which gives the norm of the
n+2:m,n n+1,n+1 2
residual.
➤ A QR-based algorithm is implemented in the solve() method available for E IGEN’s QR-
decomposition, see Code 3.3.4.2.
C++ code 3.3.4.2: E IGEN’s built-in QR-based linear least squares solver ➺ GITLAB
2 // Solving a full-rank least squares problem kAx − bk2 → min in E I G E N
3 double l s q s o l v e _ e i g e n ( const MatrixXd& A , const VectorXd& b ,
4 VectorXd& x ) {
5 x = A . householderQr ( ) . solve ( b ) ;
6 r e t u r n ( ( A * x−b ) . norm ( ) ) ;
7 }
Remark 3.3.4.3 (QR-based solution of linear systems of equations) Applying the QR-based algorithm
for full-rank linear least squares problems in the case m = n, that is, to a square linear system of equations
Ax = b with a regular coefficient matrix , will compute the solution x = A−1 b. In a sense, the QR-
decomposition offers an alternative to Gaussian elimination/LU-decomposition discussed in § 2.3.2.15.
The steps for solving a linear system of equations Ax = b by means of QR-decomposition are as follows:
① QR-decomposition A = QR, computational costs 23 n3 + O(n2 )
(about twice as expensive as LU -decomposition without pivoting)
Ax = b : ② orthogonal transformation z = Q⊤ b, computational costs 4n2 + O(n)
(in the case of compact storage of reflections/rotations)
③ Backward substitution, solve Rx = z, computational costs 12 n(n + 1)
Benefit: we can utterly dispense with any kind of pivoting:
✬ ✩
✌ Computing the generalized QR-decomposition A = QR by means of Householder reflections
or Givens rotations is (numerically stable) for any A ∈ C m,n .
✌ For any regular system matrix an LSE can be solved by means of
QR-decomposition + orthogonal transformation + backward substitution
✫ ✪
in a stable manner.
Drawback: QR-decomposition can hardly ever avoid massive fill-in (→ Def. 2.7.4.3) also in situations,
where LU-factorization greatly benefits from Thm. 2.7.5.4. y
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
254
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 3.3.4.4 (QR-based solution of banded LSE) From Rem. 3.3.3.27, Thm. 3.3.3.28, we know that
that particular situtation, in which QR-decomposition can avoid fill-in (→ Def. 2.7.4.3) is the case of banded
matrices, see Def. 2.7.5.1. For a banded n × n linear systems of equations with small fixed bandwidth
bw(A) ≤ O(1) we incur an
➣ asymptotic computational effort: O(n) for n → ∞
The following code uses QR-decomposition com-
d1 c1 0 ... 0
..
puted by means of selective Givens rotations (→ e1 d 2 c 2 .
§ 3.3.3.15) to solve a tridiagonal linear system of A = 0 e2 d 3 c 3
. . . .
equations Ax = b .. .. .. .. c n −1
0 ... 0 e n −1 d n
The matrix is passed in the form of three vectors e, c, d giving the entries in the non-zero bands.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
255
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 3.3.4.6 (Stable solution of LSE by means of QR-decomposition) Aiming to confirm the
claim of superior stability of QR-based approaches (→ Rem. 3.3.4.3, § 3.3.3.30) we revisit Wilkinson’s
counterexample from Ex. 2.4.0.6 for which Gaussian elimination with partial pivoting does not yield an
acceptable solution.
0
10
−2
10
−4
10
−1 for i > j, j < n , −6
10
1 for i = j , −8
10
Gaussian elimination
QR−decomposition
(A)i,j := relative residual norm
0 for i < j, j < n ,
−10
10
1 for j = n . −12
10
−16
10
0 100 200 300 400 500 600 700 800 900 1000
Fig. 85 n
y
Let us summarize the pros and cons of orthogonal transformation techniques for linear least squares
problems:
Use orthogonal transformations methods for least squares problems (3.1.3.7), whenever
A ∈ R m,n dense and n small.
Use normal equations in the expanded form (3.2.0.8)/(3.2.0.9), when A ∈ R m,n sparse (→
Notion 2.7.0.1) and m, n big.
e ∈ R m,m orthogonal ,
Q
eR
[A b] = Q e,
e ∈ R m,n+1 upper triangular ,
R
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
256
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
e and R
• How can you compute the unique least-squares solution x∗ ∈ R n of Ax = b using Q e?
Video tutorial for Section 3.3.5 "Modification Techniques for QR-Decomposition": (25 minutes)
Download link, tablet notes
e = b efficiently, whose
In § 2.6.0.12 we faced the task of solving a square linear system of equations Ax
e was a (rank-1) perturbation of A, for which an LU-decomposition was available.
coefficient matrix A
Lemma 2.6.0.21 showed a way to reuse the information contained in the LU-decomposition.
A similar task can be posed for the QR-decomposition: Assume that a QR-decomposition (→
Thm. 3.3.3.4) of a matrix A ∈ R m,n , m ≥ n, has already been computed. However, now we have to
solve a full-rank linear least squares problem e −b
Ax e ∈ R m,n , which is a “slight”
→ min with A
2
perturbation of A. If we aim to use orthogonalization techniques it would be desirable to compute a
QR-decomposition of A e with recourse to the QR-decomposition of A.
Remark 3.3.5.1 (Economical vs. full QR-decomposition) We remind of § 3.3.3.37: The precise type of
QR-decomposition, whether full or economical, does not matter, since all algorithms will store the Q-factors
as products of orthogonal transformations.
Thus, below we will select that type of QR-decomposition, which allows an easier derivation of an algo-
rithm, which will be the full QR-decomposition. y
For A ∈ R m,n , m ≥ n, rank(A) = n, we consider the rank-1 modification, cf. Eq. (2.6.0.16),
e := A + uv⊤ , u ∈ R m , v ∈ R n .
A −→ A (3.3.5.2)
Remember from § 2.6.0.12, (2.6.0.13), (2.6.0.15) that changing a single entry, row, or column of A can be
achieved through special rank-1 perturbations.
h i
Given a full QR-decomposition according to Thm. 3.3.3.4, A = QR = Q RO0 , Q ∈ R m,m orthogonal
(stored in some implicit format as product of orthogonal transformations, see Rem. 3.3.3.22), R ∈ R m,n
and R0 ∈ R n,n upper triangular, the goal is to find an efficient algorithm that yields a QR-decomposition
e: A
of A e =Q eR e ∈ R m,m a product of orthogonal transformations, R
e, Q e ∈ R m,n upper triangular.
Step ➊: compute w = Q⊤ u ∈ R m .
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
257
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
➣ Computational effort = O(mn), if Q stored in suitable (compressed) format, cf. Rem. 3.3.3.22.
This can be done by applying m − 1 Givens rotations to be employed in the following order:
∗ ∗ ∗ ∗
∗ ∗ ∗ 0
.. .. .. ..
. G . G . G .
m−1,m m−2,m−1 m−3,m−2 G12
w = ∗ −−−−→ ∗ −−−−−→ ∗ −−−−−→ · · · −−−→ 0
∗ ∗ ∗ 0
∗ ∗ 0 0
∗ 0 0 0
Of course, these transformations also have to act on R ∈ R m,n and they will affect R by creating a single
non-zero subdiagonal by linearly combining pairs of adjacent rows from bottom to top:
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
0 ∗ ··· ∗ ∗ ∗ ∗ 0 ∗ ··· ∗ ∗ ∗ ∗
. .. . ..
. .. . ...
. . . . .
0 ··· 0 ∗ ∗ ∗ ∗ 0 ··· 0 ∗ ∗ ∗ ∗
0 ··· 0 0 ∗ ∗ ∗ G 0 ··· 0 0 ∗ ∗ ∗ G
n,n+1 n−1,n
R= 0 ··· 0 0 0 ∗ ∗ −−−→ 0 · · · 0 0 0 ∗ ∗ −−−→
0 ··· 0 0 0 0 ∗ 0 ··· 0 0 0 0 ∗
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 ∗
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
. .. . ..
.. . .. .
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
0 ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
. .. ..
. .. ..
. . . . .
0 ··· 0 ∗ ∗ ∗ ∗ 0 ··· ∗ ∗ ∗ ∗ ∗
0 ··· 0 0 ∗ ∗ ∗ G 0 ··· 0 ∗ ∗ ∗ ∗
n−2,n−1 G 1,2
−−−→ 0 ··· 0 0 0 ∗ ∗ −−−−−→ · · · −−−→ 0 · · · 0 0 ∗ ∗ ∗ = : R1 .
0 ··· 0 0 0 ∗ ∗ 0 ··· 0 0 0 ∗ ∗
0 ··· 0 0 0 0 ∗ 0 ··· 0 0 0 0 ∗
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
. .. . ..
.. . .. .
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
We see (R1 )i,j = 0, if i > j + 1. It is a so-called upper Hessenberg matrix. This is also true of
R1 + kwk2 e1 v⊤ ∈ R n,n , because only the top row of the matrix e1 v⊤ is non-zero. Therefore, if Q1 ∈
R n,n collects all m − 1 orthogonal transformations used in Step ➋, then
Step ➌: Convert R1 + kwk2 e1 v⊤ ∈ R n,n into upper triangular form by n − 1 successive Givens rotations
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
258
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
0 ∗ ··· ∗ ∗ ∗ ∗ 0 ∗ ··· ∗ ∗ ∗ ∗
.. ..
. .
0 ··· 0 ∗ ∗ ∗ ∗ 0 ··· 0 ∗ ∗ ∗ ∗
0
Gn−1,n
··· 0 0 ∗ ∗ ∗ G 0 ··· 0 0 ∗ ∗ ∗
n,n+1 e.
−−−→ 0 · · · 0 0 0 ∗ ∗ −−−→ 0 · · · 0 0 0 ∗ ∗ =: R
0 ··· 0 0 0 0 ∗ 0 ··· 0 0 0 0 ∗
0 ··· 0 0 0 0 ∗ 0 ··· 0 0 0 0 0
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
. .. . ..
.. . .. .
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
(Gn,n+1 Gn−2,n−1 · · · . . . · · · G23 G12 )(R1 + kwk e1 v⊤ ) = R
2
e (upper triangular!) . (3.3.5.3)
e = A + uv⊤ = Q
A eR e = QQ⊤ G⊤ G23
e with Q 1 12
⊤
· · · · · G⊤ ⊤
n−1,n−2 Gn,n−1 .
On the level of matrix-vector arithmetic the following explanations are easier for the full QR-
decompositions, cf. Rem. 3.3.5.1.
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
259
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
As preparation we point out that left-multiplication of a matrix with another matrix can be understood as
forming multiple matrix×vector products:
R0
h i
A = QR ⇔ Q⊤ A = Q⊤ a1 , . . . , Q⊤ an = R =: , R0 ∈ R n,n ,
O
that is, Q⊤ a j ℓ = 0 for ℓ > j. We immediately infer
h i
e = Q⊤ a1 , . . . , Q⊤ ak−1 , Q⊤ v, Q⊤ ak , . . . , Q⊤ an =
Q⊤ A
=: W ∈ R m,n+1 ,
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
260
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Step ➊: compute w = Q⊤ v ∈ R m .
This can be done by m − n − 1 Givens rotations targeting adjacent rows of W bottom → top:
G
m−1,m Gn+1,n+2
W= −−−−→ ... −−−−−→ =: T .
Writing Q2⊤ := Gn+1,n+2 · · · · · Gm−1,m ∈ R m,m for the orthogonal matrix representing the product of
Givens rotations, we find
Q2⊤ Q⊤ A
e =T.
We accomplish this by applying n + 1 − k successive Givens rotations from bottom to top in the following
fashion.
∗ ··· ∗ ··· ∗ ∗ ··· ∗ ··· ∗
.. ..
0 ∗ ∗ . 0 ∗ ∗ .
.. .. .. .. .. ..
. . . . . .
.. .. .. ∗
. . .
..
∗ ∗ ∗ Gn,n+1 Gk,k+1 0 .
T=
−−−→ · · · −
−−→
..
∗ ∗ 0 ∗ .
.. ..
. ∗ 0 . 0 ∗
..
0 ··· 0 ··· .
0 ··· 0 ··· 0
.. .. .. .. ..
. . . . .
0 ··· 0 ··· 0 0 ··· 0 ··· 0
0
→ → → 0 →
0 0
0
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
261
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Again, the perspective of the full QR-decomposition is preferred for didactic reasons, cf. Rem. 3.3.5.1.
We are given a matrix A ∈ R m,n of which a full QR-decomposition (→ Thm. 3.3.3.4) A = QR, Q ∈
R m,m orthogonal, R ∈ R m,n upper triangular, is already available, maybe only in encoded form (→
Rem. 3.3.3.22).
We add another row to the matrix A in arbitrary position k ∈ {1, . . . , m} and obtain
(A)1,:
..
.
(A)k−1,:
A ∈ R m,n e = v T , with given v ∈ R n .
7→ A (3.3.5.5)
(A)
k,:
..
.
(A)m,:
Task: Find an algorithm for the efficient computation of the QR-decomposition A e = Q eRe of Ae from
e ∈R
(3.3.5.5) , Q m + 1,m + 1 e ∈ K +1,n+1
orthogonal (as a product of orthogonal transformations), R m
upper triangular.
Step ①: Move new row to the bottom.
e:
Employ partial cyclic permutation of rows of A
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
262
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Q⊤ 0 e
e = A⊤
PA
R
PA = ⊤ = R =: T ∈ R m+1,n .
v 0 1 v
v⊤
This step is a mere bookkeeping operation and does not involve any computations.
e is never
because the product of orthogonal matrices is again orthogonal. Of course, we know that Q
formed in an algorithm but kept as a sequence of orthogonal transformations.
Review question(s) 3.3.5.7 (Modification teachniques for QR-decompositions)
(Q3.3.5.7.A) Explain why, as far as the use of the QR-decomposition in numerical methods is concerned,
the distinction between full and economical versions does not matter.
263
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Exception. You may look at the lecture notes to answer this question.
△
Video tutorial for Section 3.4.1 "Singular Value Decomposition: Definition and Theory": (13
minutes) Download link, tablet notes
Theorem 3.4.1.1. Singular value decomposition → [NS02, Thm. 9.6], [Gut09, Thm. 11.1]
For any A ∈ K m,n there are unitary/ orthogonal matrices U ∈ K m,m , V ∈ K n,n and a (generalized)
diagonal (∗) matrix Σ = diag(σ1 , . . . , σp ) ∈ R m,n , p := min{m, n}, σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0
such that
A = UΣVH .
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 264
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
iH h hi yH Ax yH AV e
H
H e A xV e = σ w
U AV = y U e H Ax U e = 0 B
e H AV = : A1 .
U
For the induction argument we have to show that w = 0. Since
2 2
σ σ 2 + wH w
A1 = = (σ2 + wH w)2 + kBwk22 ≥ (σ2 + wH w)2 ,
w 2
Bw 2
we conclude
σ 2
kA1 xk22 A1 ( w ) ( σ 2 + wH w )2
kA1 k22 = sup ≥ 2
≥ = σ 2 + wH w . (3.4.1.2)
06 = x ∈K n kxk22 σ
(w )
2
2
2
σ +w w H
We exploit that multiplication with orthogonal matrices either from right or left does not affect the Euclidean
matrix norm:
2 (3.4.1.2)
σ2 = kAk22 = UH AV = kA1 k22 =⇒ kA1 k22 = kA1 k22 + kwk22 ⇒ w = 0 .
2
σ 0
A1 = .
0 B
Then apply the induction argument to B.
✷
The decomposition A = UΣVH of Thm. 3.4.1.1 is called singular value decomposition (SVD) of
A. The diagonal entries σi of Σ are the singular values of A. The columns of U/V are the left/right
singular vectors of A.
Next, we visualize the structure of the singular value decomposition of a matrix A = K m,n .
A
= U Σ VH
VH
A = U Σ
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 265
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§3.4.1.4 (Economical singular value decomposition) As in the case of the QR-decomposition, compare
(3.3.3.1) and (3.3.3.1), we can also drop the bottom zero rows of Σ and the corresponding columns of U
in the case of m > n. Thus we end up with an “economical” singular value decomposition of A ∈ K m,n :
with true diagonal matrices Σ, whose diagonals contain the singular values of A.
Visualization of economimcal SVD for m > 0:
Σ
A = U VH
The economical SVD is also called thin SVD in literature [GV13, Sect. 2.3.4]. y
An alternative motivation and derivation of the SVD is based on diagonalizing the Hermitian matrices
AAH ∈ R m,m and AH A ∈ R n,n . The relationship is made explicit in the next lemma.
Lemma 3.4.1.6.
The squares σi2 of the non-zero singular values of A are the non-zero eigenvalues of AH A, AAH
with associated eigenvectors (V):,1 , . . . , (V):,p , (U):,1 , . . . , (U):,p , respectively.
Proof. AAH and AH A are similar (→ Lemma 9.1.0.6) to diagonal matrices with non-zero diagonal
entries σi2 (σi 6= 0), e.g.,
✷
Remark 3.4.1.7 (SVD and additive rank-1 decomposition → [Gut09, Cor. 11.2], [NS02, Thm. 9.8])
Recall from linear algebra that rank-1 matrices coincide with tensor products of vectors:
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 266
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
because rank(A) = 1 means that Ax = µ(x)u for some u ∈ K m and a linear form x ∈ K n 7→ µ(x) ∈
K. By the Riesz representation theorem the latter can be written as µ(x) = vH x.
The singular value decomposition provides an additive decomposition into rank-1 matrices:
p
A = UΣVH = ∑ σj (U):,j (V)H:,j . (3.4.1.9)
j =1
y
Remark 3.4.1.11 (Uniqueness of SVD)
The SVD from Def. 3.4.1.3 is not (necessarily) unique, but the singular values are.
Proof. Proof by contradiction: assume that A has two singular value decompositions
A = U1 Σ1 VH H
1 = U2 Σ2 V2 ⇒ U1 Σ1 ΣH UH H
1 = AA = U2 Σ2 ΣH UH
2 .
| {z 1} | {z 2}
=diag(σ12 ,...,σm
2) 2)
=diag(σ12 ,...,σm
The two diagonal matrices are similar, which implies that they have the same eigenvalues, which agree
with their diagonal entries. Since the latter are sorted, the diagonals must agree.
✷ y
§3.4.1.12 (SVD, nullspace, and image space) The SVD give complete information about all crucial
subspaces associated with a matrix:
Let A = UΣVH be the SVD of A ∈ K m,n according to Thm. 3.4.1.1. If, for some 1 ≤ r ≤ p :=
min{m, n}, the singular values of A ∈ K m,n satisfy
σ1 ≥ · · · ≥ σr > σr+1 = · · · σp = 0 ,
then
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 267
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
| {z } | {z }| {z }
∈K m,n ∈K m,m ∈K m,n
(3.4.1.14)
y
Review question(s) 3.4.1.15 (SVD: Definition and theory)
(Q3.4.1.15.A) If a square matrix A ∈ R n,n is given as A = QDQ⊤ with an orthogonal matrix Q ∈ R n,n
and a diagonal matrix D ∈ R n,n , then what is a singular value decomposition of A?
(Q3.4.1.15.B) What is a full singular value decomposition of A = uv⊤ , u ∈ R m , v ∈ R n ?
(Q3.4.1.15.C) Based on the SVD give a proof of the fundamental dimension theorem from linear algebra:
Here X ⊥ designates the orthogonal complement of a subspace X ⊂ K d with respect to the Euclidean
inner product:
X ⊥ : = { v ∈ K d : xH v = 0 ∀ x ∈ X } .
(Q3.4.1.15.E) Use the SVD to show that every regular square matrix A ∈ R n,n can be factorized as
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 268
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q3.4.1.15.G) [Completion to regular matrix] Given m, n ∈ N, m < n, and a matrix T ∈ R n,m with full
rank m, sketch an algorithm that computes a matrix X ∈ R n,n−m such that [T X] ∈ R n,n is regular/in-
vertible.
The following result can be a starting point:
Let A = UΣVH be the SVD of A ∈ K m,n according to Thm. 3.4.1.1. If, for some 1 ≤ r ≤ p :=
min{m, n}, the singular values of A ∈ K m,n satisfy
σ1 ≥ · · · ≥ σr > σr+1 = · · · σp = 0 ,
then
Video tutorial for Section 3.4.2 "SVD in E IGEN ": (9 minutes) Download link, tablet notes
The E IGEN class JacobiSVD is constructed from a matrix data type, computes the SVD of its argument
during construction and offers access methods MatrixU(), singularValues(), and MatrixV()
to request the SVD-factors and singular values.
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 269
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The second argument in the constructor of JacobiSVD determines, whether the methods matrixU()
and matrixV() return the factor for the full SVD of Def. 3.4.1.3 or of the economical (thin) SVD (3.4.1.5):
Eigen::ComputeFull* will select the full versions, whereas Eigen::ComputeThin* picks the
economical versions → documentation.
Internally, the computation of the SVD is done by a sophisticated algorithm, for which key steps rely on
orthogonal/unitary transformations. Also there we reap the benefit of the exceptional stability brought
about by norm-preserving transformations → § 3.3.3.30.
§3.4.2.2 (Computational cost of computing the SVD) According to E IGEN’s documentation the SVD of
a general dense matrix involves the following asymptotic complexity:
EXAMPLE 3.4.2.3 (SVD-based computation of the rank of a matrix) Based on Lemma 3.4.1.13, the
SVD is the main tool for the stable computation of the rank of a matrix (→ Def. 2.2.1.3)
However, theory as reflected in Lemma 3.4.1.13 entails identifying zero singular values, which must rely
on a threshold condition in a numerical code, recall Rem. 1.5.3.15. Given the SVD A = UΣVH , Σ =
diag(σ1 , . . . , σmin{m,n} ), of a matrix A ∈ K m,n , A 6= 0 and a tolerance tol > 0, we define the numerical
rank
r := ♯ σi : |σi | ≥ tol max{|σj |} . (3.4.2.4)
j
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 270
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
16 }
17 return r ;
18 }
E IGEN offers an equivalent built-in method rank() for objects representing singular value decomposi-
tions:
EXAMPLE 3.4.2.7 (Computation of nullspace and image space of matrices) “Computing” a subspace
of R k amounts to making available a (stable) basis of that subspace, ideally an orthonormal basis.
Lemma 3.4.1.13 taught us how to glean orthonormal bases of N (A) and R(A) from the SVD of a matrix
A. This immediately gives a numerical method and its implementation is given in the next two codes.
y
Review question(s) 3.4.2.10 (SVD in E IGEN)
(Q3.4.2.10.A) Please examine Code 3.4.2.9 and detect a potentially serious loss of efficiency. In which
situations will this have an impact?
△
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 271
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 3.4.3 "Solving General Least-Squares Problems by SVD": (14
minutes) Download link, tablet notes
In a similar fashion as explained for QR-decomposition in Section 3.3.4, the singular valued decompisition
(SVD, → Def. 3.4.1.3) can be used to transform general linear least squares problems (3.1.3.7) into a
simpler form. In the case of SVD-based orthogonal transformation methods this simpler form involves
merely a diagonal matrix.
Here we consider the most general setting
In particular, we drop the assumption of full rank of A. This means that the minimum norm condition (ii) in
the definition (3.1.3.7) of a linear least squares problem may be required for singling out a unique solution.
We recall the (full) SVD of A ∈ R m,n :
Σr 0 V1⊤
A = [ U1 U2 ]
0 0 V2⊤
Σr
0
A
=
V1⊤
U1 U2
V2⊤
0 0
| {z }
∈R n,n
| {z } | {z } | {z }
∈R m,n ∈R m,m ∈R m,n
(3.4.3.1)
" #
U1⊤
[ U1 , U2 ] · =I.
U2⊤
⊤
Σ 0 V1⊤ Σr V1⊤ x U b
kAx − bk2 = [U1 U2 ] r x−b = − 1⊤ (3.4.3.2)
0 0 V2⊤ 2
0 U2 b 2
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 272
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
To fix a unique solution in the case r < n we appeal to the minimal norm condition in (3.1.3.7): appealing
to the considerations of § 3.1.3.3, the solution x of (3.4.3.3) is unique up to contributions from
Since V is unitary, the minimal norm solution is obtained by setting contributions from R(V2 ) to zero,
which amounts to choosing x ∈ R(V1 ). This converts (3.4.3.3) into
Approach ➋: From Thm. 3.1.2.1 we know that the generalized least-squares solution x† of Ax = b solves
the normal equations (3.1.2.2), and in § 3.1.3.3 we saw that x† lies in the orthogonal complement of
N ( A ):
By Lemma 3.4.1.13 and using the notations from (3.4.3.1) together with the fact that the columns of V
form an orthonormal basis of R n :
Hence, we can write x† = V1 y for some y ∈ Rr . We plug this representation into the normal equations
and also multiply with V1⊤ , similar to what we did in § 3.1.3.3:
V1⊤ VΣ |U{z
⊤
U} ΣV⊤ V1⊤ y = V1⊤ VΣU⊤ b . (3.4.3.9)
=I
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 273
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 3.4.3.15 (Pseudoinverse and SVD → [Han02, Ch. 12], [DR08, Sect. 4.7]) From Thm. 3.1.3.6
we could conclude a general formula for the Moore-Penrose pseudoinverse of any matrix A ∈ R m,n . Now,
the solution formula (3.4.3.5) directly yields a concrete incarnation of the pseudoinverse A+ .
If A ∈ K m,n with rank(A) = r has the full singular value decomposition A = UΣVH (→
Thm. 3.4.1.1) partitioned as in (3.4.3.1), then its Moore-Penrose pseudoinverse (→ Thm. 3.1.3.6)
is given by A† = V1 Σr−1 UH1.
y
Review question(s) 3.4.3.17 (Solving general least-squares problems by SVD)
(Q3.4.3.17.A) Discuss the efficient implementation of a C++ function
Eigen::VectorXd solveRankOneLsq( const Eigen::VectorXd &u,
const Eigen::VectorXd &v, const Eigen::VectorXd &b);
that returns the general least squares solution of Ax = b for the rank-1 matrix A := uv⊤ , u ∈ R m ,
v ∈ R n , m ≥ n.
△
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 274
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 3.4.4.1 "Norm-Constrained Extrema of Quadratic Forms": (11 min-
utes) Download link, tablet notes
We consider the following problem of finding the extrema of quadratic forms on the Euclidean unit sphere
{ x ∈ K n : k x k2 = 1}:
Use that multiplication with orthogonal/unitary matrices preserves the 2-norm (→ Thm. 3.3.2.2) and resort
to the (full) singular value decomposition A = UΣVH (→ Def. 3.4.1.3):
2 2
min kAxk22 = min UΣVH x = min UΣ(VH x)
k x k2 =1 k x k2 =1 2 k V x k2 =1
H 2
Since the singular values are assumed to be sorted as σ1 ≥ σ2 ≥ · · · ≥ σn , the minimum with value σn2
is attained for y2n = 1 and y1 = · · · = yn−1 = 0, that is, VH x = y = en (=
ˆ n-th Cartesian basis vector
in R n ). ⇒ minimizer x∗ = Ven = (V):,n , minimal value kAx∗ k2 = σn .
By similar arguments we can solve the corresponding norm constrained maximization problem
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 275
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Recall: The Euclidean matrix norm (2-norm) of the matrix A (→ Def. 1.5.5.10) is defined as the maximum
in (3.4.4.3). Thus we have proved the following theorem:
EXAMPLE 3.4.4.5 (Fit of hyperplanes) For an important application from computational geometry, this
example studies the power and versatility of orthogonal transformations in the context of (generalized)
least squares minimization problems.
From school recall the Hesse normal form of a hyperplane H (= affine subspace of dimension d − 1) in
Rd :
H = { x ∈ R d : c + n ⊤ x = 0} , n ∈ R d , k n k2 = 1 . (3.4.4.6)
where n is the unit normal to H and |c| gives the distance of H from 0. The Hesse normal form is
convenient for computing the distance of points from H, because the
Note that (3.4.4.8) is not a linear least squares problem due to the constraint knk2 = 1. However, it turns
out to be a minimization problem with almost the structure of (3.4.4.1) (yk,ℓ := (yk )ℓ ):
1 y1,1 · · · y1,d c
1 y · · · y2,d n1
2,1
(3.4.4.8) ⇔ .. .. .. .. → min under the constraint k n k2 = 1 .
. . . .
1 ym,1 · · · ym,d nd
| {z } | {z }
=:A =:x 2
Note that the solution component c is not subject to the constraint. One is tempted to use this freedom to
make one component of Ax vanish, but which one is not clear. This is why we need another preparatory
step.
Step ➊: To convert the minimization problem into the form (3.4.4.1) we start with a QR-decomposition
(→ Section 3.3.3)
r11 r12 · · · · · ·r1,d+1
0 r22 · · · · · ·r2,d+1
.. .. ..
1 y1,1 · · · y1,d . . .
1 y . .. ..
2,1 · · · y2,d .. . .
A := .. .. .. = QR , R :=
0
∈ R m,d+1 .
. . . rd+1,d+1
1 ym,1 · · · ym,d 0 ··· ··· 0
. ..
.. .
0 ··· ··· 0
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 276
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
r11 r12 · · · · · ·r1,d+1
0 r22 · · · · · ·r2,d+1
c
.. .. ..
. . . n
. .. .. 1
.. . . .
kAxk2 → min ⇔ kRxk2 = .. → min . (3.4.4.9)
0 rd+1,d+1
...
0 ··· ··· 0
. .. nd
.. .
0 ··· ··· 0 2
√ d
−1
Note: Since r11 = k(A):,1 k2 = m 6= 0, c = −r11 ∑ r1,j+1 n j can always be computed.
j =1
This algorithm is implemented as case p==dim+1 in the following code, making heavy use of E IGEN’s
block access operations and the built-in QR-decomposition and SVD factorization.
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 277
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Note that Code 3.4.4.11 solves the general problem: For A ∈ K m,n find n ∈ R d , c ∈ R n−d such that
c
A → min with constraint k n k2 = 1 . (3.4.4.12)
n 2
y
Review question(s) 3.4.4.13 (Norm-Constrained Extrema of Quadratic Forms)
(Q3.4.4.13.A) Let M ∈ R n,n be symmetric and positive definite (s.p.d.) and A ∈ R m,n . Devise an algo-
rithm for computing
argmaxkAxk , B := {x ∈ R n , x⊤ Mx = 1} ,
x∈ B
Video tutorial for Section 3.4.4.2 "Best Low-Rank Approximation": (13 minutes)
Download link, tablet notes
§3.4.4.14 (Low-Rank matrix compression) Matrix compression addresses the problem of approximating
a given “generic” matrix (of a certain class) by means of matrix, whose “information content”, that is, the
number of reals needed to store it, is significantly lower than the information content of the original matrix.
Sparse matrices (→ Notion 2.7.0.1) are a prominent class of matrices with “low information content”.
Unfortunately, they cannot approximate dense matrices very well. Another type of matrices that enjoy “low
information content”, also called data sparse, are low-rank matrices.
Lemma 3.4.4.15.
If A ∈ R m,n has rank r ≤ min{m, n} (→ Def. 2.2.1.3), then there exist X ∈ R m,r and Y ∈ R n,r ,
such that A = XY⊤ .
Proof. The lemma is a straightforward consequence of Lemma 3.4.1.13 and (3.4.1.14): If A = UΣV⊤ is
the SVD of A, then choose
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 278
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
None of the columns of U and V can vanish. Hence, in addition, we may assume that the columns of U
are normalized: (U):,j 2 = 1, j = 1, . . . , r.
Thus approximating a given matrix A ∈ R m,n with a rank-r matrix, r ≪ min{m, n}, can be regarded as
an instance of matrix compression. The approximation error with respect to some matrix norm k·k will be
minimal if we choose the best approximation
y
Here we explore low-rank best approximation of general matrices with respect to the Euclidean matrix
norm k·k2 induced by the 2-norm for vectors (→ Def. 1.5.5.10), and the Frobenius norm k·k F .
It should be obvious that kAk F invariant under orthogonal/unitary transformations of A. Thus the Frobe-
nius norm of a matrix A, rank(A) = r, can be expressed through its singular values σj :
The next profound result links best approximation in Rr (m, n) and the singular value decomposition (→
Def. 3.4.1.3).
Let A = UΣVH be the SVD of A ∈ K m.n (→ Thm. 3.4.1.1). For 1 ≤ k ≤ rank(A) set
h i
k
Uk := (U):,1 , . . . , (U):,k ∈ K m,k ,
h i
A k : = U k Σ k VH
k = ∑ σℓ (U):,ℓ (V)H:,ℓ with Vk := (V):,1 , . . . , (V):,k ∈ K n,k ,
ℓ=1
Σk := diag(σ1 , . . . , σk ) ∈ K k,k .
kA − Ak k ≤ kA − Fk ∀F ∈ Rk (m, n) ,
that is, Ak is the rank-k best approximation of A in the matrix norms k·k F and k·k2 .
This theorem teaches us that the rank-k-matrix that is closest to A (rank-k best approximation) in both
the Euclidean matrix norm and the Frobenniusnorm (→ Def. 3.4.4.17) can be obtained by truncating the
rank-1 sum expansion (3.4.1.9) obtained from the SVD of A after k terms.
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 279
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
and both matrix norms are invariant under multiplication with orthogonal matrices, we conclude
(
σ , for k·k = k·k2 ,
rank Ak = k and kA − Ak k = kΣ − Σk k = qk+1
σk2+1 + · · · + σr2 , for k·k = k·k F .
➊ First we tackle the Euclidean matrix norm k·k = k·k2 . For the sake of brevity we write v j := (V):,j ,
u j := (U):,j for the columns of the SVD-factors V and U, respectively. Pick B ∈ K n,n , rank B = k.
k +1
2
because ∑ ( vH 2
j x ) = k x k2 = 1.
j =1
➋ Now we turn to the Frobenius norm k·k F . We assume that B ∈ K m,n , rank(B) = k < min{m, n},
minimizes A − F among all rank-k-matrices F ∈ K m,n . We have to show that B coincides with the trun-
cated SVD of A: B = Ak .
The trick is to consider the full SVD of B:
U B ∈ K m,m unitary ,
ΣB O H
B = UB V , Σ B ∈ R k,k diagonal ,
O O B
V B ∈ K n,n unitary .
with some matrices X12 ∈ K k,n−k , X21 ∈ K m−k,k , X22 ∈ K m−k,n−k . Then we introduce two m × n rank-
k-matrices:
L + Σ B + R X12 H
C1 : = U B VB ,
O O
L + ΣB + R O H
C2 : = U B V .
X21 O B
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 280
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Since kA − Bk F is minimal, we conclude from the invariance of the Frobenius norm under orthogonal
transformations
σ1
..
. O
UH
B AV B = .
σk
O X22
This is possible only, if the k leftmost columns of both U B and V B agree with those of the corresponding
SVD-factors of A, which means B = Ak .
✷
The following code computes the low-rank best approximation of a dense matrix in E IGEN.
§3.4.4.21 (Error of low-rank best approxmation of a matrix) Since the matrix norms k·k2 and k·k F are
invariant under multiplication with orthogonal (unitary) matrices, we immediately obtain expressions for the
norms of the best approximation error:
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 281
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This provides precise information about the best approximation error for rank-k matrices. In particular, the
decay of the singular values of the matrix governs the convergence of the rank-k best approximation error
as k increases. y
EXAMPLE 3.4.4.24 (Image compression) A rectangular greyscale image composed of m × n pixels
(greyscale, BMP format) can be regarded as a matrix A ∈ R m,n , (A)i,j ∈ {0, . . . , 255}, cf. Ex. 9.3.2.1.
Thus low-rank approximation of the image matrix is a way to compress the image.
Thm. 3.4.4.19 e = Uk Σk V⊤
➣ best rank-k approximation of image: A
Of course, the matrices Ul , Vk , and Σk are available from the economical (thin) SVD (3.4.1.5) of A.
View of ETH Zurich main building Compressed image, 40 singular values used
100 100
200 200
300 300
400 400
500 500
600 600
700 700
800 800
200 400 600 800 1000 1200 200 400 600 800 1000 1200
Difference image: |original − approximated| Singular Values of ETH view (Log−Scale)
6
10
100
5
10
200
4
10
300
3
400 10 k = 40 (0.08 mem)
500
2
10
600
1
10
700
0
800 10
200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800
Note that there are better and faster ways to compress images than SVD (JPEG, Wavelets, etc.) y
Review question(s) 3.4.4.25 (Best low-rank approximation)
(Q3.4.4.25.A) Show that for A ∈ R m,n and any orthogonal Q ∈ R m,m
kQAk F = kAk F ,
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 282
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q3.4.4.25.B) Show that for any A ∈ R m,n , with singular values σj , j = 1, p := . . . min{m, n}, holds
p
kAk2F = ∑ j=1 σj2 .
e
A−A ≤ tol · kAk2 .
2
e in factorized form A
The function should return A e = XY T as a tuple of matrix factors X ∈ R m,r ,
Y∈R . n,r
write a C++ code snippet that computes the sum of the squares of the singular values of M.
Hint. Remember
(Q3.4.4.25.E) [Question (Q3.4.4.25.D) cnt’d: Sum of σi4 ] Given a matrix M ∈ R n,m , m, n ∈ R, repre-
sented by a C++ object M with an entry access operator
double o p e r a t o r () ( unsigned i n t i, unsigned i n t j) const ;
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 283
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 3.4.4.3 "Principal Component Data Analysis (PCA)": (28 minutes)
Download link, tablet notes
EXAMPLE 3.4.4.26 (Trend analysis) The objective is to extract information in the form of hidden “trends”
from data.
XETRA DAX 1,1.2008 − 29.10.2010
3
10
ADS
ALV
BAYN
BEI
BMW
We are given time series data:
CBK
DAI
2
10 DB1 ✁ (end of day) stock prizes
DBK
DPW
=ˆ n data vectors ∈ R m
stock price (EUR)
DTE
EOAN Rephrased in the language of
FME
FRE3 linear algebra:
1
10 HEI
HEN3
Are there underlying
IFX governing trends ?
LHA
LIN l
MAN
MEO Are there a few vectors
0
MRK
10
MUV2
u1 , . . . , u p , p ≪ n, such that,
RWE
SAP
approximately , all other data
SDF vectors ∈ Span{u1 , . . . , u p }?
SIE
TKA
−1 VOW3
10
0 100 200 300 400 500 600 700 800 900
Fig. 86 days in past
y
EXAMPLE 3.4.4.27 (Classification from measured data) Data vectors belong to different classes,
where those in the same class are “qualitatively similar” in the sense that they are small (random) per-
turbations of a typical data vector. The task is to tease out the typical data patterns and tell which class
every data vector belongs to.
Measurement errors !
! Manufacturing tolerances !
Fig. 87
The following plots display possible (“synthetic”) measured data for two types of diodes; measurement er-
rors and manufacturing tolerances taken into account by additive (Gaussian) random perturbations (noise).
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 284
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
measured U−I characteristics for some diodes measured U−I characteristics for all diodes
1.2 1.2
1 1
0.8 0.8
0.6 0.6
current I
current I
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 88 voltage U Fig. 89 voltage U
y
Ex. 3.4.4.26 and Ex. 3.4.4.27 present typical tasks that can be tackled by principal component analysis.
Now we give an abstract description as a problem of linear algebra.
Given: n data points a j ∈ R m , j = 1, . . . , n, in m-dimensional (feature) space
(e.g., a j may represent a finite time series or a measured relationship of physical quantities)
↔ a j ∈ Span{u} ∀ j = 1, . . . , n ,
for a trend vector u ∈ R m , kuk2 = 1.
↔ a j ∈ Span{u1 , . . . , u p } ∀ j = 1, . . . , n , (3.4.4.28)
Now singular value decomposition (SVD) according to Def. 3.4.1.3 comes into play, because
Lemma 3.4.1.13 tells us that it can supply an orthonormal basis of the image space of a matrix, cf.
Code 3.4.2.9.
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 285
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This already captures the case (3.4.4.28) and we see that the columns of U supply the trend vectors we
are looking for!
➊ no perturbations:
The j-th row of V (up to the p-th component) gives the weights with which the p identified trends
contribute to data set j.
EXAMPLE 3.4.4.31 (PCA of stock prices → Ex. 3.4.4.26cnt’d) Stock prices are given as a large
matrix A ∈ R m,n :
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 286
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
MUV2
RWE
SAP
SDF
SIE
TKA
−1 VOW3 0
10 10
0 100 200 300 400 500 600 700 800 900 0 5 10 15 20 25 30
Fig. 90 days in past Fig. 91 no. of singular value
We observe a pronounced decay of the singular values of A. The plot of Fig. 91 is given in linear-
logarithmic scale. The neat alignment of larger singular values indicates approximate exponential decay
of the singular values.
➣ a few trends (corresponding to a few of the largest singular values) govern the time series.
Five most important stock price trends (normalized) Five most important stock price trends
0.15 500
400
0.1
300
0.05
200
0
100
−0.05
0
−0.1
−100
U(:,1) U*S(:,1)
−0.15 U(:,2) U*S(:,2)
−200
U(:,3) U*S(:,3)
U(:,4) U*S(:,4)
U(:,5) U*S(:,5)
−0.2 −300
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Fig. 92 days in past Fig. 93 days in past
Columns of U (→ Fig. 92) in SVD A = UΣV⊤ provide trend vectors, cf. Ex. 3.4.4.26 & Ex. 3.4.4.32.
When weighted with the corresponding singular value, the importance of a trend contribution emerges,
see Fig. 93
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 287
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Trends in BMW stock, 1.1.2008 − 29.10.2010 Trends in Daimler stock, 1.1.2008 − 29.10.2010
0.25 0.15
0.2
0.1
0.15
0.05
0.1
relative strength
relative strength
0
0.05
0
−0.05
−0.05
−0.1
−0.1
−0.15
−0.15
−0.2 −0.2
1 2 3 4 5 1 2 3 4 5
Fig. 94 no of singular vector Fig. 95 no of singular vector
Stocks of companies from the same sector of the economy should display similar contributions of major
trend vectors, because their prices can be expected to be more closely correlated than stock prices in
general. This is evident in 94 and Fig. 95 for two car makers.
y
EXAMPLE 3.4.4.32 (Principal component analysis for data classification → Ex. 3.4.4.27 cnt’d)
Given: measured U - I characteristics of n = 20 unknown diodes, I (U ) available for m = 50 voltages.
Sought: Number of different types of diodes in batch and reconstructed U - I characteristic for each type.
measured U−I characteristics for some diodes measured U−I characteristics for all diodes
1.2 1.2
1 1
0.8 0.8
0.6 0.6
current I
current I
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 96 voltage U Fig. 97 voltage U
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 288
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
20
← distribution of singular values of matrix
two dominant singular values !
i
singular value σ
15
0
0 2 4 6 8 10 12 14 16 18 20
Fig. 98 no. of singular value
strengths of contributions of singular components principal components (trend vectors) for diode measurements
0.15 0.3
dominant principal component
second principal component
0.1
0.2
strength of singular component #2
0.05
0.1
0
−0.05
0
current I
−0.1
−0.1
−0.15
−0.2
−0.2
−0.25
−0.3
−0.3
−0.35 −0.4
0.1 0.15 0.2 0.25 0.3 0.35 0 5 10 15 20 25 30 35 40 45 50
Fig. 99 strength of singular component #1 Fig. 100 voltage U
Observations:
✦ First two rows of V-matrix specify strength of contribution of the two leading principal components
to each measurement
➣ Points (V):,1:2 , which correspond to different diodes are neatly clustered in R2 . To determine
the type of diode i, we have to identify the cluster to which the point ((V)i,1 , Vi,2 ) belongs (→ cluster
analysis, course “machine learning”, see Rem. 3.4.4.43 below).
✦ The principal components themselves do not carry much useful information in this example.
y
EXAMPLE 3.4.4.33 (Data points (almost) confined to a subspace) More abstractly, above we tried to
identify a subspace to which all data points ai were “close”. We saw that the SVD of
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 289
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
plane ✄ 0
−0.2
Non-zero singular values of A = [a1 , . . . , an ]:
−0.4
−0.6
3.1378
−0.8
1.8092
−1
0.1792 −1.2
−1.4
The third singular value is much smaller, which hints
−1.6
that the data points approximately lie in a 2D sub- 1.5
−1.8
space spanned by the two first singular vectors of A. −1
−0.5
0
0.5 0
0.5
1
Fig. 101 1
1.5 −0.5
y
§3.4.4.34 (Proper orthogonal decomposition (POD)) In the previous Ex. 3.4.4.33 we saw that the
singular values of a matrix whose columns represent data points ∈ R m tell us whether these points are all
“approximately located” in a lower-dimensional subspace V ⊂ R m . This is linked to the following problem:
that is, we seek that k-dimensional subspace Uk of R m for which the sum of squared dis-
tances of the data points to Uk is minimal.
We have already seen a similar problem in Ex. 3.4.4.5. For m = 2 we want to point out the difference to
linear regression:
y a2
x a1
Fig. 102 Fig. 103
Linear regression: Minimize the sum of squares of POD: Minimize the sum of squares of (minimal)
vertical distances. distances.
By finding a k-dimensional subspace we mean finding a, preferably orthonormal, basis of that subspace.
Let us assume that {w1 , . . . , wk } is an orthonormal basis (ONB) of a k-dimensional subspace W ⊂ R m .
Then the orthogonal projection PW x of a point x ∈ R m onto W is given by
k
PW x = ∑ (w⊤j x)w j = WW⊤ x , (3.4.4.36)
j =1
where W = [w1 , . . . , wk ] ∈ R m,k . This formula is closely related to the normal equations for a linear
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 290
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
least squares problem, see Thm. 3.1.2.1 and § 3.1.1.8 for a visualization of an orthogonal projection.
kx − PW xk2 is the (minimal) distance of x to W .
Hence, again writing W ∈ R m,k for the matrix whose columns form an ONB (⇒ W⊤ W = I) of W ⊂ R m ,
we have
n n 2 2
2
∑ w ∈W inf a j − w 2
= ∑ a j − WW⊤ a j = A − WW⊤ A , (3.4.4.37)
2 F
j =1 j =1
where k·k F denotes the Frobenius norm of a matrix, see Def. 3.4.4.17, and A = [a1 , . . . , an ] ∈ R m,n .
Note that rank(WW⊤ A) ≤ k. Let us write A = UΣV⊤ for the SVD of A, and Uk ∈ R m,k , Vk ∈ R n,k ,
and Σk ∈ R k,k for the truncated SVD-factors of A as introduced in Thm. 3.4.4.19. Then the rank-k best
approximation result of that theorem implies
A − WW⊤ A ≥ kA − Uk Σk Vk k F ∀W ∈ R m,k , W⊤ W = I .
F
In fact, we can find W ∈ R m,k with orthonormal columns that realizes the minimum: just choose W := Uk
and verify
WW⊤ A = Uk U⊤ ⊤ ⊤ ⊤
k UΣV = Uk Ik O ΣV = Uk Σk Vk .
The subspace Uk spanned by the first k left singular vectors of A = [a1 , . . . , an ] ∈ R m,n solves the
POD problem (3.4.4.35):
Uk = R (U):,1:k with A = UΣV⊤ the SVD of A.
Appealing to (3.4.4.23), the sum of the squared distances can be obtained as the sum of the squares of
the remaining singular values σk=1 , . . . , σp , p := min{m, n}, of A:
n p
2
∑ winf aj − w 2
= ∑ σℓ2 . (3.4.4.39)
j =1
∈Uk
ℓ=k+1
As a consequence, the decay of the singular values again predicts how close the data points are to the
POD subspaces Uk , k = 1, . . . , p − 1. y
EXAMPLE 3.4.4.40 (Principal axis of a point cloud) Given m > 2 points x j ∈ R k , j = 1, . . . , m, in
k-dimensional space, we ask what is the “longest” and “shortest” diameter d+ and d− . This question can
be stated rigorously in several different ways: here we ask for directions for which the point cloud will have
maximal/minimal variance, when projected onto that direction:
d+ := argmax Q(v) ,
m m
kvk=1 1 ⊤ 2
Q(v) := ∑ |(xi − c) v| , c = ∑ xj . (3.4.4.41)
d− := argmin Q(v) , j =1
m j =1
kvk=1
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 291
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
6 points
major axis
minor axis
2
The directions d+ , d− are called the principal axes
of the point cloud, a term borrowed from mechanics 0
-6
Fig. 104 -8 -6 -4 -2 0 2 4 6 8
The subsets {xi : i ∈ Il } are called the clusters. The points ml are their centers of gravity.
that is, we assign each point to the nearest center of gravity, see Code 3.4.4.49.
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 292
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We start with a single cluster, and then do repeated splitting (➊) and cluster rearrangement (➋) until we
have reached the desired final number n of clusters, see Code 3.4.4.50.
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 293
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
13 f o r ( i n t j = 0 ; j < d . cols ( ) ; ++ j ) {
14 // mx(j) tells the minimal squared distance of point j to the
nearest cluster
15 // idx(j) tells to which cluster point j belongs
16 mx( j ) = d . col ( j ) . minCoeff (& i d x ( j ) ) ;
17 }
18 const double sumd = mx .sum ( ) ; // sum of all squared distances
19 // Computer sum of squared distances within each cluster
20 VectorXd cds (C. cols ( ) ) ; cds . setZero ( ) ;
21 f o r ( i n t j = 0 ; j < i d x . s i z e ( ) ; ++ j ) { // loop over all points
22 cds ( i d x ( j ) ) += mx( j ) ;
23 }
24 r e t u r n std : : make_tuple ( sumd , i d x , cds ) ;
25 }
26
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 294
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 295
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
57 }
58 r e t u r n std : : make_pair (C, i d x ) ;
59 }
How much do you have to increase the rank of the best low-rank approximation of A with respect to the
Euclidean matrix norm in order to reduce the approximation error by a factor of 2?
Can you also answer this question for the Frobenius matrix norm?
△
☞ least squares problem “turned upside down”: now we are allowed to tamper with system matrix and
right hand side vector!
h i h i
b ∈ R(A
b b) ⇒ rank( A b )=n
b b (3.5.0.1) ⇒ b = argmin [A b] − X
b b
A b .
b )=n F
rank(X
3. Direct Methods for Linear Least Squares Problems, 3.5. Total Least Squares 296
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
h i
☞ b b
A b is the rank-n best approximation of [A b]!
We face the problem to compute the best rank-n approximation of the given matrix [A b], a problem
already treated in Section 3.4.4.2: Thm. 3.4.4.19 tells us how to use the SVD of [A b]
n +1 h i n
Thm. 3.4.4.19 b =
[A b] = UΣV⊤ = ∑ σj (U):,j (V)⊤
:,j =⇒ b b
A ∑ σj (U):,j (V)⊤:,j . (3.5.0.3)
j =1 j =1
V orthogonal
h i
=⇒ A b (V):,n+1 = A
b b b (V)n+1,n+1 = 0 .
b (V)1:n,n+1 + b (3.5.0.4)
b,
b =b
(3.5.0.4) also provides the solution x of Sx
x := −A b = −(V)1:n,n+1 /(V)n+1,n+1 ,
b −1 b (3.5.0.5)
Video tutorial for Section 3.6 "Constrained Least Squares": (23 minutes) Download link,
tablet notes
In the examples of Section 3.0.1 we expected all components of the right hand side vectors to be possibly
affected by measurement errors. However, it might happen that some data are very reliable and in this
case we would like the corresponding equation to be satisfied exactly.
linear least squares problem with linear constraint defined as follows:
3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 297
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Linear constraint
Here the constraint matrix C collects all the coefficients of those p equations that are to be satisfied exactly,
and the vector d the corresponding components of the right hand side vector. Conversely, the m equations
of the (overdetermined) LSE Ax = b cannot be satisfied and are treated in a least squares sense.
§3.6.1.1 (A saddle point problem) Recall important technique from multidimensional calculus for tackling
constrained minimization problems: Lagrange multipliers, see [Str09, Sect. 7.9].
L as defined in (3.6.1.3) is called a Lagrange function or, in short, Lagrangian. The simple heuristics
behind Lagrange multipliers is the observation:
2.5
1
Saddle point of F ( x, m) = x2 − 2xm ✄
0.5
that is, both the derivative with respect to x and with −0.5
Fig. 105
multiplier m
state x
y
3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 298
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§3.6.1.4 (Augmented normal equations) In a saddle point the Lagrange function is “flat”, that is, all its
partial derivatives have to vanish there. This yields the following necessary (and sufficient) conditions for
the solution x of (3.6.1.2) and a saddle point in x, q: (For a similar technique employing multi-dimensional
calculus see Rem. 3.1.2.5)
∂L !
(x, q) = A⊤ (Ax − b) + C⊤ q = 0 , (3.6.1.5a)
∂x
∂L !
(x, q) = Cx − d = 0 . (3.6.1.5b)
∂m
This is an (n + p) × (n + p) square linear system of equations, known as augmented normal equa-
tions:
A⊤ A C⊤ x A⊤ b
= . (3.6.1.6)
C 0 q d
It belongs to the class of saddle-point type LSEs, that is, LSEs with a symmetric coefficient matrix with a
zero right-lower square block. In the case p = 0, in the absence of a linear constraint, (3.6.1.6) collapses
to the usual normal equations (3.1.2.2), A⊤ Ax = A⊤ b for the overdetermined linear system of equations
Ax = b.
As we know, a direct elimination solution algorithm for (3.6.1.6) amounts to finding an LU-decomposition of
the coefficient matrix. Here we opt for its symmetric variant, the Cholesky decomposition, see Section 2.8.
On the block-matrix level it can be found by considering the equation
A⊤ A C⊤ R⊤ 0 R G⊤ R, S ∈ R n,n upper triangular matrices,
= ,
C 0 G −S⊤ 0 S G ∈ R p,n .
Thus the blocks of the Cholesky factors of the coefficient matrix of the linear system (3.6.1.6) can be
determined in four steps.
➀ Compute R from R⊤ R = A⊤ A → Cholesky decomposition → Section 2.8,
➁ Compute G from R⊤ G⊤ = C⊤ → n forward substitutions → Section 2.3.2,
➂ Compute S from S⊤ S = GG⊤ → Cholesky decomposition → Section 2.8.
y
§3.6.1.7 (Extended augmented normal equations) The same caveats as those discussed for the regular
normal equations in Rem. 3.2.0.3, Ex. 3.2.0.4, and Rem. 3.2.0.6, apply to the direct use of the augmented
normal equations (3.6.1.6):
1. their condition number can be much bigger than that of the matrix A,
2. forming A⊤ A may be vulnerable to roundoff,
3. the matrix A⊤ A may not be sparse, though A is.
As in § 3.2.0.7 also in the case of the augmented normal equations (3.6.1.6) switching to an extended
version by introducing the residual r = Ax − b as a new unknown is a remedy, cf. (3.2.0.8). This leads to
the following linear system of equations.
−I A 0 r b
A⊤ 0 C⊤ x = 0 Extended augmented
=
ˆ . (3.6.1.8)
normal equations
0 C 0 m d
y
3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 299
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Idea: Identify the subspace in which the solution can vary without violating the constraint.
Since C has full rank, this subspace agrees with the nullspace/kernel of C.
From Lemma 3.4.1.13 and Ex. 3.4.2.7 we have learned that the SVD can be used to compute (an or-
thonormal basis of) the nullspace N (C). The suggests the following method for solving the constrained
linear least squares problem (3.6.0.1).
➀ Compute an orthonormal basis of N (C) using SVD (→ Lemma 3.4.1.13, (3.4.3.1)):
⊤
V
C = U[Σ 0] 1⊤ , U ∈ R p,p , Σ ∈ R p,p , V1 ∈ R n,p , V2 ∈ R n,n− p
V2
N (C) = R(V2 ) .
x 0 : = V 1 Σ −1 U ⊤ d .
x = x0 + V2 y , y ∈ R n− p .
➁ Insert this representation into (3.6.0.1). This yields a standard linear least squares problem with
coefficient matrix AV2 ∈ R m,n− p and right hand side vector b − Ax0 ∈ R m :
α, βe, and γ
for given measured values e e.
Find the least-squares solution, if the bottom equation has to be satisfied exactly. First recast into a
linearly constrained least-squares problem
kAx − bk → min , Cx = d ,
x∗ = argmaxkAxk2 , C := {x ∈ R n : kxk2 = 1, Cx = 0} .
x∈C
3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 300
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Learning Outcomes
After having studied the contents of this chapter you should be able to
• give a rigorous definition of the least squares solution of an (overdetermined) linear system of equa-
tions,
• state the (extended) normal equations for any overdetermined linear system of equations,
• tell conditions for uniqueness and existence of solutions of the normal equations,
• define (economical) QR-decomposition and SVD of a matrix,
• know the asymptotic computational effort of computing economical QR and SVD factorizations,
• explain the use of QR-decomposition and, in particular, Givens rotations, for solving (overdeter-
mined) linear systems of equations (in least squares sense),
• use SVD to solve least squares, (constrained) optimization, and low-rank best approximation prob-
lems
• explain the ideas underlying principal component analysis (PCA) and proper orthogonal decompo-
sition (POD),
• formulate the augmented (extended) normal equations for a linearly constrained least squares prob-
lem.
3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 301
Bibliography
[Bra06] Matthew Brand. “Fast low-rank modifications of the thin singular value decomposition”. In:
Linear Algebra Appl. 415.1 (2006), pp. 20–30. DOI: 10.1016/j.laa.2005.07.021.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 218, 219, 226, 230–263, 274).
[GGK14] W. Gander, M.J. Gander, and F. Kwok. Scientific Computing. Vol. 11. Texts in Computational
Science and Engineering. Heidelberg: Springer, 2014 (cit. on p. 216).
[GV13] Gene H. Golub and Charles F. Van Loan. Matrix computations. Fourth. Johns Hopkins Stud-
ies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, 2013,
pp. xiv+756 (cit. on pp. 241, 243, 247, 266).
[Gut07] M. Gutknecht. “Linear Algebra”. 2007 (cit. on p. 239).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 236, 264,
266, 279).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 230–233, 236,
243, 274).
[HRS16] J.S. Hesthaven, G. Rozza, and B. Stamm. Certified Reduced Basis Methods for Parametrized
Partial Differential Equations. BCAM Springer Briefs. Cham: Springer, 2016.
[Hig02] N.J. Higham. Accuracy and Stability of Numerical Algorithms. 2nd ed. Philadelphia, PA: SIAM,
2002 (cit. on pp. 243, 250).
[Kal96] D. Kalman. “A singularly valuable decomposition: The SVD of a matrix”. In: The College Math-
ematics Journal 27 (1996), pp. 2–23 (cit. on p. 264).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 215, 239, 264, 266, 267).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on p. 231).
[QMN16] Alfio Quarteroni, Andrea Manzoni, and Federico Negri. Reduced basis methods for partial
differential equations. Vol. 92. Unitext. Springer, Cham, 2016, pp. xi+296.
[Ste76] G. W. Stewart. “The economical storage of plane rotations”. In: Numer. Math. 25.2 (1976),
pp. 137–138. DOI: 10.1007/BF01462266 (cit. on p. 247).
[Str19] D. Strang. Linear Algebra and Learning from Data. Cambridge University Press, 2019.
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 223, 225,
264, 298).
[SCF18] Jan Svoboda, Thomas Cashman, and Andrew Fitzgibbon. QRkit: Sparse, Composable QR
Decompositions for Efficient and Stable Solutions to Problems in Computer Vision. 2018.
[Vol08] S. Volkwein. Model reduction using proper orthogonal decomposition. Lecture notes. Graz,
Austria: TU Graz, 2008.
302
Chapter 4
Filtering Algorithms
This chapter continues the theme of numerical linear algebra, also covered in Chapter 1, 2, 10. We will
come across very special linear transformations (↔ matrices) and related algorithms. Surprisingly, these
form the basis of a host of very important numerical methods for signal processing.
§4.0.0.1 (Time-discrete signals and sampling) From the perspective of signal processing we can iden-
tify
vector x ∈ R n ↔ finite discrete (= sampled) signal.
Sampling converts a time-continuous signal, repre-
sented by some real-valued physical quantity (pres-
sure, voltage, power, etc.) into a time-discrete signal:
ˆ time-continuous signal, 0 ≤ t ≤ T ,
X = X (t) = x0
“sampling”: x j = X ( j∆t) , j = 0, . . . , n − 1 ,
x1
n ∈ N, n∆t ≤ T . x2 x n −2 x n −1
X (t)
ˆ time between samples.
∆t > 0 =
As already indicated by the indexing the sam-
Fig. 106 t0 t1 t2 t n −2 t n −1 time
pled values can be arranged in a vector x =
[ x 0 , . . . , x n −1 ] ⊤ ∈ R n .
Note that in this chapter, as is customary in signal processing, we adopt a C++-style indexing from 0: the
components of a vector with length n carry indices ∈ {0, . . . , n − 1}.
As an idealization one sometimes considers a signal of infinite duration X = X (t), −∞ < t < ∞. In this
case sampling yields a bi-infinite time-discrete signal, represented by a sequence ( xk )k∈Z ∈ RZ . If this
sequence has a finite number of non-zero terms only, then we write (0, . . . , xℓ , xℓ+1 , . . . , xn−1 , xn , 0, . . .).
y
EXAMPLE 4.0.0.2 (Sampled audio signals) An important class of time-discrete signals are digital audio
signals. Those may be obtained from sampling and analog-to-digital conversion of air pressure recorded
by a microphone.
303
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
0.2
−0.4
1 F i l t e r i n g > f i l e h e l l o . wav
2 h e l l o . wav : RIFF ( l i t t l e −endian )
−0.6 data , WAVE audio , M i c r o s o f t PCM,
16 b i t , mono 44100 Hz
−0.8
0 0.5 1 1.5
Fig. 107 time t [s]
• In this chapter we neglect the additional “quantization”, that is, the fact that, in practice, the values
of a time-discrete signal ( xk )k∈Z are again discrete, e.g., 16-bit integers for the WAV file format.
Throughout, we consider only xk ∈ R.
• C++ codes handling standard file formats usually rely on dedicated libraries, for instance the library
AudioFile for the WAV format.
y
Contents
4.1 Filters and Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.1.1 Discrete Finite Linear Time-Invariant Causal Channels/Filters . . . . . . . . 304
4.1.2 LT-FIR Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
4.1.3 Discrete Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.1.4 Periodic Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
4.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.1 Diagonalizing Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.2 Discrete Convolution via Discrete Fourier Transform . . . . . . . . . . . . . 326
4.2.3 Frequency filtering via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
4.2.4 Real DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.2.5 Two-dimensional DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
4.2.6 Semi-discrete Fourier Transform [QSS00, Sect. 10.11] . . . . . . . . . . . . . 344
4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.4 Trigonometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.4.1 Sine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.4.2 Cosine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
4.5 Toeplitz Matrix Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.5.1 Matrices with Constant Diagonals . . . . . . . . . . . . . . . . . . . . . . . . 371
4.5.2 Toeplitz Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
4.5.3 The Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
In this section we study a finite linear time-invariant causal channel/filter, which is a widely used model
for digital communication channels, e.g. in wireless communication theory. We adopt a mathematical
perspective harnessing the toolbox of linear algebra as is common in modern engineering.
Mathematically speaking, a (discrete) channel/filter is a function/mapping F : ℓ∞ (Z ) → ℓ∞ (Z ) from the
vector space ℓ∞ (Z ) of bounded input sequences { x j } j∈Z ,
n o
∞
ℓ (Z ) : = xj j ∈Z
: sup | x j | < ∞ ,
to bounded output sequences y j j∈Z .
xk yk
input signal channel output signal
time time
Fig. 108
Channel/filter: F : ℓ ∞ (Z ) → ℓ ∞ (Z ) , yj j ∈Z
= F xj j ∈Z
. (4.1.1.1)
In order to link (discrete) filters to linear algebra, we have to assume certain properties that are indicated
by the attributes “finite ”, “linear”, “time-invariant” and “causal”:
It is natural to assume that it should not matter when exactly signal is fed into the channel. To express this
intuition more rigorously we introduce the time shift operator for signals: for m ∈ Z
Sm : ℓ ∞ (Z ) → ℓ ∞ (Z ) , Sm ( x j j ∈Z
) = x j−m j ∈Z
. (4.1.1.4)
Hence, by applying Sm we advance (m < 0) or delay (m > 0) a signal by |m|∆t. For a time-invariant filter
time-shifts of the input propagate to the output unchanged.
A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called time-invariant (TI), if shifting the input in time leads to the
same output shifted in time by the same amount; it commutes with the time shift operator from
(4.1.1.4):
∀( x j ) j∈Z ∈ ℓ∞ (Z ), ∀m ∈ Z: F (Sm ( x j j ∈Z
)) = Sm ( F ( x j j ∈Z
)) . (4.1.1.6)
Since a channer/filter is a mapping between vector spaces, it makes sense to talk about “linearity of F”.
Of course, a signal should not trigger an output before it arrives at the filter; output may depend only on
past and present inputs, not on the future.
A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called causal (or physical, or nonanticipative), if the output does not
start before the input
∀ M ∈ N: xj j ∈Z
∈ ℓ ∞ (Z ), x j = 0 ∀ j ≤ M ⇒ F ( x j )
j ∈Z k
= 0 ∀k ≤ M . (4.1.1.10)
Now we have collected all the properties of the class of filters in the focus of this section, called LT-FIR
filters.
Acronym: LT-FIR =ˆ finite (→ Def. 4.1.1.2), linear (→ Def. 4.1.1.7), time-invariant (→ Def. 4.1.1.5), and
causal (→ Def. 4.1.1.9) filter F : ℓ∞ (Z ) → ℓ∞ (Z )
§4.1.1.11 (Impulse response) For the description of filters we rely on special input signals, analogous to
the description of a linear mapping R n 7→ R m through a matrix, that is, its action on “coordinate vectors”.
The “coordinate vectors” in signal space ℓ∞ (Z ) are so-called impulses, signals that attain the value +1
for a single sampling point in time and are “mute” for all other times.
The impulse response (IR) of a channel/filter(is the output for a single unit impulse at t = 0 as
1 , if j = 0
input, that is, the input signal is x j = δj,0 := (Kronecker symbol).
0 else
The impulse response of a finite filter can be described by a vector h of finite length n. In particular, the
impulse response of a finite and causal filter is a sequence of the form (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N. Such an impulse response is depicted in Fig. 110.
impulse response
1 h0
h1 h n −2
h2 h n −1
(Q4.1.1.13.A) What is the output of a LT-FIRfilter with impulse response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N, if the input is a constant signal x j j∈Z , x j = a, a ∈ R?
Video tutorial for Section 4.1.2 "LT-FIR Linear Mappings": (12 minutes) Download link,
tablet notes
We aim for a precise mathematical description of the impact of a finite, time-invariant, linear, causal filter
on an input signal: Let (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N, be the impulse response (→ 4.1.1.12)
of that finite (→ Def. 4.1.1.2), linear (→ Def. 4.1.1.7), time-invariant (→ Def. 4.1.1.5), and causal (→
Def. 4.1.1.9) filter (LT-FIR) F : ℓ∞ (Z ) → ℓ∞ (Z ):
F ( δj,0 j ∈Z
) = (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .) .
m −1 m −1
xj j ∈Z
= ∑ xk δj,k j ∈Z
= ∑ x k Sk δj,0 j ∈Z
, (4.1.2.1)
k =0 k =0
where Sk is the time-shift operator from (4.1.1.4). Applying the filter on both sides of this equation and
using linearity and time-invariance we obtain
This leads to a fairly explicit formula for the output signal y j j∈Z := F xj j ∈Z
:
. . .
.. .. .. .. ..
. .
0 0 0 0
0
y h 0 0
0 0 0
y
1
.
..
h
0
0
...
.
.. h n −1 .. h0 ..
.
yn
0
h
.
. .
n −1 . ..
.. = x0 0 + x1 0 + x2 + · · · + x m −1 . . (4.1.2.3)
. h n −1
.. .. 0
.. . . 0
. . h0
.. .. .. .
y m + n −3 . . ..
0
y m + n −2 0 0
0 h n −1
0 0 0
. 0
.. .. .. .. ..
. . . .
Thus, in compact notation we can write the non-zero components of the output signal y j j∈Z as
channel is causal and finite!
m −1
yk = F xj j ∈Z k
= ∑ hk− j x j , k = 0, . . . , m + n − 2 ( h j := 0 for j < 0 and j ≥ n) . (4.1.2.4)
j =0
The output (. . . , 0, y0 , y1 , y2 , . . .) of a finite, time-invariant, linear, and causal channel for finite length
input x = (. . . , 0, x0 , . . . , xm−1 , 0, . . .) ∈ ℓ∞ (Z ) is a superposition of x j -weighted j∆t time-shifted
impulse responses.
EXAMPLE 4.1.2.6 (Visualization: superposition of impulse responses) The following diagrams give
a visual display of the above considerations, namely of the superposition of impulse responses for a
particular finite, time-invariant, linear, and causal filter (LT-FIR), and an input signal of duration 3∆t, ∆t =
ˆ
time between samples. We see the case m = 4, n = 5.
input signal x impulse response h
3.5 3.5
3 3
2.5 2.5
2 2
hi
xi
1.5 1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 112 index i of sampling instance ti Fig. 113 index i of sampling instance ti
This reflects the fact that the is a linear superposition of impulse responses:
response to x response to x response to x response to x
0 1 2 3
3.5 3.5 3.5 3.5
3 3 3 3
signal strength
signal strength
signal strength
2 2 2 2
1.5
+ 1.5
+ 1.5
+ 1.5
1 1 1 1
2.5
signal strength
signal strength
4
2
3
1.5
2
1
0.5 1
0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 118 i Fig. 119 i
y
The formula (4.1.2.4) characterizing the output sequence (yk )k∈Z is a special case of a fundamental
bilinear operation on pairs of sequences, not necessarily finite.
Given two sequences ( hk )k∈Z , ( xk )k∈Z , at least one of which is finite or decays sufficiently fast,
their convolution is another sequence (yk )k∈Z , defined as
✎ Notation: For the sequence arising from convolving two sequences ( hk )k∈Z and ( xk )k∈Z we write
( x k ) ∗ ( h k ).
( xk ) ∗ ( hk ) = ( hk ) ∗ ( xk ) .
Computers can deal only with finite amounts of data. So algorithms can operate only on finite signals,
which will be in the focus of this section. This continues the considerations undertaken in the beginning of
Section 4.1.2, now with an emphasis on recasting operations in the language of linear algebra.
Remark 4.1.3.1 (The case of finite signals and filters) Again we consider a finite (→ Def. 4.1.1.2), linear
(→ Def. 4.1.1.7), time-invariant (→ Def. 4.1.1.5), and causal (→ Def. 4.1.1.9) filter (LT-FIR) with impulse
response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N. From (4.1.2.4) we learn that
We have seen this in (4.1.2.4), where an input signal with m pulses (duration (m − 1)∆t) and an impulse
response with n pulses (duration (n − 1)∆t) spawned an output signal with m + n − 1 pulses (duration
(m − 1 + n − 1)∆t).
Therefore, if we know that all input signals have a duration of at most (m − 1)∆t, which means they are
⊤
of the form (. . . , x0 , x1 , . . . , xm−1 , 0, . . .), we can model them as vectors x = [ x0 , . . . , xm−1 ] ∈ R m , cf.
§ 4.0.0.1, and the filter can be viewed as a linear mapping F : R n → R m+n−1 , which takes us to the
realm of linear algebra.
Thus, for the linear filter we have a matrix representation of (4.1.2.4). Let us first look at the special case
m = 4, n = 5 presented in Ex. 4.1.2.6:
y0 h0 0 0 0
y1 h1 h0 0 0
y2 h2 h1 h0 0
y3
= x0 h3 + x1 h2 + x2 h1 + x3 h0 .
y4 h4 h3 h2 h1
y5 0 h4 h3 h2
y6 0 0 h4 h3
y7 0 0 0 h4
Here, we have already replaced the sequences with finite-length vectors. Translating this relationship into
matrix-vector notation is easy:
y0 h0 0 0 0
y1 h1 h0 0 0
y2 h2 h1 h0 0
x0
y3 h3 h2 h1 h0
= x1 .
y4 h4 h3 h2
h1 x2
y5 0 h4 h3 h2
x3
y6 0 0 h4 h3
y7 0 0 0 h4
⊤
Writing y = [y0 , . . . , ym+n−2 ] ∈ R m+n−1 for the vector of the output signal we find for the general case
the following matrix×vector representation of the action of the filter on the signal:
h0 0
y0 0
.. h1
.
x0
.
..
hn−1
0
y := = 0 =: Cx . (4.1.3.2)
h0
h1
..
.
x m −1
..
.
y m + n −2 0 0 h n −1
Note that the i + 1-th column of the matrix X ∈ R m+n+1,m is obtained by cyclically permuting column i,
i = 1, . . . , m − 1. y
Recall the formula
m −1
yk = ∑ hk− j x j , k = 0, . . . , m + n − 2 ( h j := 0 for j < 0 and j ≥ n) . (4.1.2.4)
j =0
supplying the non-zero terms of the convolution of two finite sequences ( hk )k and ( xk )k of length n and
m, respectively. Both can be identified with vectors [ x0 , . . . , xm−1 ]⊤ ∈ K m , [ h0 , . . . , hn−1 ]⊤ ∈ K n , and,
since (4.1.2.4) is a special case of the convolution of sequences introduced in Def. 4.1.2.7, we might call
it a convolution of vectors. It represents a fundamental operation in signal theory.
m −1
yk = ∑ hk− j x j , k = 0, . . . , m + n − 2 , (4.1.3.4)
j =0
Remark 4.1.3.5 (Commutativity of discrete convolution) Discrete convolution (4.1.3.4) of two vectors is
a commutative operation mirroring the result from Thm. 4.1.2.9 without the implicit assumptions required
there.
Using the notations of Def. 4.1.3.3 we embed the two vectors into bi-infinite sequences xej j∈Z , e
hj
j ∈Z
by zero-padding:
( (
x j for j ∈ {0, . . . , m − 1} , h j for j ∈ {0, . . . , n − 1} ,
xej := , e
h j := , j∈Z.
0 else 0 else
Thm. 4.1.2.9 confirms xej ∗ e hj = e h j ∗ xej , that is, the two sequences have exactly the same
terms. By definition of xej and e
h j we see
m −1
(h ∗ x)k = ∑ ehk− j x j = ∑ ehk− j xej
j =0 j ∈Z
n −1
= e
xej ∗ h j e
= h j ∗ xej = ∑ xek− j e
h j = (x ∗ h)k :
k k
j ∈Z
The discrete convolution of vectors is commutative: Phrased in signal theory terminology, we have found
that filter and signal can be “swapped”:
x 0 , . . . , x n −1 h 0 , . . . , h n −1
= y
LT-FIR h0 , . . . , hn−1 LT-FIR x0 , . . . , xn−1
§4.1.3.6 (Multiplication of polynomials) The formula (4.1.3.4) for the discrete convolution also occurs in
a context completely detached from signal processing. “Surprisingly” the bilinear operation (4.1.2.4) (for
m = n) that takes two input n-vectors and produces an output 2n − 1-vector also provides the coefficients
of the product polynomial.
Concretely, consider two polynomials in t of degree n − 1, n ∈ N, with real or complex coefficients,
n −1 n −1
p(t) := ∑ a j t j , q(t) := ∑ bj t j , a j , bj ∈ K .
j =0 j =0
2n−2 min{k,n−1}
( pq)(t) = ∑ ck tk , ck := ∑ aℓ bk−ℓ , k = 0, . . . , 2n − 2 . (4.1.3.7)
k =0 ℓ=max{0,k−(n−1)}
Let us introduce dummy coefficients for p(t) and q(t), a j , b j , j = 2n, . . . , 2n − 2, all set to 0. This can
be easily done in a computer code by resizing the coefficient vectors of p and q and filling the new entries
with zeros (“zero padding”). The above formula for c j can then be rewritten as
j
cj = ∑ aℓ bj−ℓ , j = 0, . . . , 2n − 2 . (4.1.3.8)
ℓ=0
Hence, the coefficients of the product polynomial can be obtained as the discrete convolution of the coef-
ficient vectors of p and q:
Moreover, this provides another proof for the commutativity of discrete convolution. y
Remark 4.1.3.10 (Convolution of causal sequences) The notion of a discrete convolution of Def. 4.1.3.3
naturally extends to so-called causal sequences ∈ ℓ∞ (N0 ), that is, bounded mappings N0 7→ K: the
(discrete) convolution of two sequences ( x j ) j∈N0 , (y j ) j∈N0 is the sequence (z j ) j∈N0 defined by
k k
zk := ∑ xk− j y j = ∑ x j yk− j , k ∈ N0 . (4.1.2.8)
j =0 j =0
In this context recall the product formula for power series, Cauchy product, which can be viewed as a
multiplication rule for “infinite polynomials” = power series. y
Review question(s) 4.1.3.11 (Discrete convolutions)
(Q4.1.3.11.A) [Calculus of discrete convolutions] Let ∗ : R n × R m → R n+m−1 denote the dis-
⊤
crete convolution of two vectors in Def. 4.1.3.3 defined as (C++ indexing, x = [ x0 , . . . , xn−1 ] ,
Vh = [ h0 , . . . , hm−1 ]⊤ )
m −1
(h ∗ x)k := ∑ hk− j x j , k = 0, . . . , m + n − 2 , (4.1.3.4)
j =0
(Q4.1.3.11.C) [Sum of discrete random variables] Consider two discrete independent random variables
X, Y : Ω → Z, Ω a probability space, and write for the probabilities
Given two sequences ( hk )k∈Z , ( xk )k∈Z , at least one of which is finite or decays sufficiently fast,
their convolution is another sequence (yk )k∈Z , defined as
Understanding how periodic signals interact with finite, linear, time-invariant, causal (FT-FIR) filters is an
important stepping stone for developing algorithms for more general situations.
x j+n = x j ∀ j ∈ Z .
➣ Though infinite, an n-periodic signal ( x j ) j∈Z is uniquely determined by the finitely many values
x0 , . . . , xn−1 and can be associated with a vector x = [ x0 , . . . , xn−1 ]⊤ ∈ R n .
§4.1.4.2 (Linear filtering of periodic signals) Whenever the input signal of a finite, linear, causal,
time-invariant filter (LT-FIR) F : ℓ∞ (Z ) → ℓ∞ (Z ) with impulse response (. . . , 0, h0 , . . . , hn−1 , 0, . . .)
is n-periodic, so will be the output signal. To elaborate this we start from the convolution for-
mula for sequences from Def. 4.1.2.7 and take into account the n-periodicity to compute the output
(yk )k∈Z := F (( xk )k∈Z ):
n −1
Thm. 4.1.2.9 j←ν+ℓn
yk = ∑ hk− j x j = ∑ xk− j h j = ∑ ∑ xk−ν−ℓn hν+ℓn
j ∈Z j ∈Z ν=0 ℓ∈Z
(4.1.4.3)
n −1
periodicity
= ∑ ∑ hν+ℓn xk−ν , k∈Z.
ν=0 ℓ∈Z
From the n-periodicity of x j j∈Z we conclude that yk = yk+n for all k ∈ Z. Thus, in the n-periodic setting,
a causal, linear, and time-invariant filter (LT-FIR) will give rise to a linear mapping R n 7→ R n according
to
n −1 n −1
yk = ∑ p j xk− j = ∑ pk− j x j (4.1.4.4)
j =0 j =0
From (4.1.4.3) we see that the defining terms of the n-periodic sequence ( pk )k∈Z can be computed
according to
This sequence
can be regarded as periodic impulse response, the output generated by the input sequence
∑k∈Z δnk,j j∈Z . It must not be mixed up with the impulse response (→ Def. 4.1.1.12) of the filter.
In matrix notation (4.1.4.4) reads
p0 p n −1 p n −2 · · · ··· p1
y0 .. x0
.. p1 p0 p n −1 . ..
. .. .
p2 p1 p0 .
.. .. .. ..
= . . . . . (4.1.4.6)
.. .. ..
. . . . .
.. . .. .. ..
.. . . p n −1
y n −1 x n −1
p n −1 ··· p1 p0
| {z }
=:P
The following special variant of a discrete convolution operation is motivated by the preceding § 4.1.4.2.
The discrete periodic convolution of two n-periodic sequences ( pk )k∈Z , ( xk )k∈Z yields the n-
periodic sequence
n −1 n −1
(yk ) := ( pk ) ∗n ( xk ) , yk := ∑ pk− j x j = ∑ xk− j p j , k∈Z.
j =0 j =0
The identity claimed in (4.1.4.4) and in Def. 4.1.4.7 can be established by a simple index transformation
ℓ := k − j and subsequent shifting of the sum, which does not change the value thanks to periodicity.
n −1 k n −1
∑ pk− j x j = ∑ p x
| ℓ {zk−ℓ}
= ∑ pℓ xk−ℓ .
j =0 ℓ=k−n−1 ℓ=0
n-periodic in ℓ
This means that the discrete periodic convolution of two sequences commutes.
Since n-periodic sequences can be identified with vectors in K n (see above), we can also introduce the
discrete periodic convolution of vectors:
Def. 4.1.4.7 ➣ discrete periodic convolution of vectors: y = p ∗n x ∈ K n , p, x ∈ K n .
EXAMPLE 4.1.4.8 (Radiative heat transfer) Beyond signal processing discrete periodic convolutions
occur in many mathematical models:
heated
An engineering problem:
✦ cylindrical pipe,
✦ heated on part Γ H of its perimeter (→ prescribed heat flux),
✦ cooled on remaining perimeter ΓK (→ constant heat flux).
Task: compute local heat fluxes.
cooled
Modeling (discretization):
• approximation by regular n-polygon, edges Γ j ,
• isotropic radiation of each edge Γ j (power Ij ),
j
αij
αij radiative heat flow Γ j → Γi : Pji := I,
i ϕ π j
opening angle: αij = π γ|i− j| , 1 ≤ i, j ≤ n,
n n
power balance: ∑ Pji − ∑ Pij = Q j . (4.1.4.9)
i =1,i 6= j i =1,i 6= j
| {z }
= Ij
n αij
(4.1.4.9) ⇒ LSE: Ij − ∑ π i
I = Q j , j = 1, . . . , n .
i =1,i 6= j
1 −γ1 −γ2 −γ3 −γ4 −γ3 −γ2 −γ1 I1 Q1
−γ1 1 −γ1 −γ2 −γ3 −γ4 −γ3 −γ2 I2 Q2
−γ2 −γ1 −γ1 −γ2 −γ3 −γ4 −γ3
1 I3 Q3
−γ3 −γ2 −γ1 −γ1 −γ2 −γ2 −γ4
e.g. n = 8:
1 I4 = Q4 . (4.1.4.10)
−γ4 −γ3 −γ2 −γ1 −γ1 −γ2
−γ3
1 I5 Q5
−γ3 −γ4 −γ3 −γ2 −γ1 −γ1 −γ2
1 I6 Q6
−γ2 −γ3 −γ4 −γ3 −γ2 −γ1 1 −γ1 I7 Q7
−γ1 −γ2 −γ3 −γ4 −γ3 −γ2 −γ1 1 I8 Q8
This is a linear system of equations with symmetric, singular, and (by Lemma 9.1.0.5, ∑ γi ≤ 1) positive
semidefinite (→ Def. 1.1.2.6) system matrix.
Note that the matrices from (4.1.4.6) and (4.1.4.10) have the same structure!
Also observe that the LSE from (4.1.4.10) can be written by means of the discrete periodic convolution
(→ Def. 4.1.4.7) of vectors y = (1, −γ1 , −γ2 , −γ3 , −γ4 , −γ3 , −γ2 , −γ1 ), x = ( I1 , . . . , I8 )
(4.1.4.10) ↔ y ∗8 x = [ Q1 , . . . , Q8 ] ⊤ .
§4.1.4.11 (Circulant matrices) In Ex. 4.1.4.8 we have already seen a matrix of a special form, the matrix
P in
p0 p n −1 p n −2 · · · ··· p1
y0 .. x0
.. p1 p0 p n −1 . ..
. .. .
p2 p1 p0 .
.. .. .. ..
= . . . . . (4.1.4.6)
.. .. ..
.. . . . .
. .. .. .. ..
. . . p n −1
y n −1 x n −1
p n −1 ··· p1 p0
| {z }
=:P
Matrices with this particular structure are so common that they have been given a special name.
✎ Notation: We write circul(p) ∈ K n,n for the circulant matrix generated by the periodic sequence/vector
p = [ p 0 , . . . , p n −1 ] ⊤ ∈ K n
☞ A circulant matrix has constant (main, sub- and super-) diagonals (for which indices j − i = const.).
☞ columns/rows arise by cyclic permutation of the first column/row.
Similar to the case of banded matrices (→ Section 2.7.5) we note that the
“information content” of circulant matrix C ∈ K n,n is just n numbers ∈ K.
(obviously, one vector u ∈ K n enough to define circulant matrix C ∈ K n,n )
y
Supplement 4.1.4.13. Write Z((uk )) ∈ K n,n for the circulant matrix generated by the n-periodic se-
⊤ ⊤
quence (uk )k∈Z . Denote by y := [y0 , . . . , yn−1 ] , x = [ x0 , . . . , xn−1 ] the vectors associated to n-
periodic sequences. Then the commutativity of the discrete periodic convolution (→ Def. 4.1.4.7) involves
Remark 4.1.4.15 (Reduction of discrete convolution to periodic convolution) Recall the discrete con-
⊤ ⊤
volution (→ Def. 4.1.3.3) of two vectors a = [ a0 , . . . , an−1 ] ∈ K n , b = [b0 , . . . , bn−1 ] ∈ K n .
n −1
zk := (a ∗ b)k = ∑ a j bk − j , k = 0, . . . , 2n − 2 (bk := 0 for k < 0, k ≥ n) .
j =0
Fig. 120
−n 0 n 2n − 1 3n − 1 4n − 2
The zero components prevent interaction of different periods:
k k n −1 2n−2
zk = ∑ a j bk − j = ∑ x j yk− j + ∑ x j y| 2n−{z1+k−}j + ∑ x j y2n−1+k− j ,
|{z}
k = 0, . . . , n − 1 ,
j =0 j =0 j = k +1 j=n
=0 =0
n −1 k−n n −1 2n−2
zk = ∑ a j bk − j = ∑ x j |{z}
yk− j + ∑ x j yk− j + ∑ x j yk− j ,
|{z}
k = n, . . . , 2n − 1 .
j = k − n +1 j =0 j = k − n +1 j=n
=0 =0
This makes periodic and non-periodic discrete convolutions coincide. Writing x, y ∈ K n for the defining
vectors of ( xk )k∈Z and (yk )k∈Z we find
(a∗b)k = (x∗2n−1 y)k , k = 0, . . . , 2n − 2 . (4.1.4.17)
In the spirit of (4.1.3.2) we can switch to a matrix view of the reduction to periodic convolution:
a0
z0 b0 0 0bn−1 b1 ...
.. 1 b 0
.
..
.
0
a n −1
= bn−1 b1 b0 0 0 bn−1 . (4.1.4.18)
0
0 b1 b0 0
..
.
..
. 0
..
z2n−2 0 0 bn − 1 b1 b0 .
| {z } 0
a (2n − 1) × (2n − 1) circulant matrix!
y
Review question(s) 4.1.4.19 (Periodic Convolutions)
(Q4.1.4.19.A) Let (yk ) be the (finite) output signal obtained from an LT-FIR channel F with impulse re-
sponse (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .) for a finite input signal ( xk ) with duration (m − 1)∆t. For what
p ∈ N do we get
!
∑ Sℓ p ((yk )) = F ∑ Sℓ p (( xk )) ,
ℓ∈Z ℓ∈Z
Video tutorial for Section 4.2.1 "Diagonalizing Circulant Matrices": (17 minutes)
Download link, tablet notes
Algorithms dealing with circulant matrices make use of their very special spectral properties. Full un-
derstanding requires familiarity with the theory of eigenvalues and eigenvectors of matrices from linear
algebra, see [NS02, Ch. 7], [Gut09, Ch. 9].
EXPERIMENT 4.2.1.1 (Eigenvectors of circulant matrices) Now we are about to discover a very deep
truth . . .
5
C : real(ev)
1
C : imag(ev)
1
2
VectorXd::Random(n).
1
eigenvalues (real part) ✄
0
Little relationship between (complex!) eigenvalues
can be observed, as can be expected from random −1
−0.2 0 0 0
Circulant matrix 1, eigenvector 5 Circulant matrix 1, eigenvector 6 Circulant matrix 1, eigenvector 7 Circulant matrix 1, eigenvector 8
0.4 0.4 0.4 0.4
0 0 0 0
Eigenvectors of matrix C2
Circulant matrix 2, eigenvector 1 Circulant matrix 2, eigenvector 2 Circulant matrix 2, eigenvector 3 Circulant matrix 2, eigenvector 4
0 0.4 0.4 0.4
−0.2 0 0 0
Circulant matrix 2, eigenvector 5 Circulant matrix 2, eigenvector 6 Circulant matrix 2, eigenvector 7 Circulant matrix 2, eigenvector 8
0.4 0.4 0.4 0.4
0 0 0 0
0 0 0 0
Remark 4.2.1.2 (Eigenvectors of commuting matrices) An abstract result from linear algebra puts the
surprising observation made in Exp. 4.2.1.1 in a wider context.
If A, B ∈ K n,n commute, that is, AB = BA, and A has n distinct eigenvalues, then the
eigenspaces of A and B coincide.
BA=AB
(A − λI)v = 0 ⇒ B(A − λI)v = 0 ⇒ (A − λI)Bv = 0 .
Since in the case of n distinct eigenvalues dim N (A − λI) = 1, we conclude that there is ξ ∈ K:
Bv = ξv, v is an eigenvector of B. Since the eigenvectors of A span K n , there cannot be eigenvectors
of B that are not eigenvectors of A.
Moreover, there is a basis of K n consisting of eigenvectors of B; B can be diagonalized.
✷
Next, by straightforward calculation one verifies that every circulant matrix commutes with the unitary and
circulant cyclic permutation matrix
0 0 0 ··· ··· 0 1
1 ..
0 0 . 0
.. ..
0 1 0 . .
S = ... ..
.
..
.
..
. . (4.2.1.4)
.. .. .
.. .
. . . .
. ..
.. . 0 0
0 ··· ··· 0 1 0
As a unitary matrix S can be diagonalized. Observe that Sn − I = O, that is the minimal polynomial of S
is ξ 7→ ξ n − 1, which is irreducible, because it has n distinct roots (of unity). Therefore, by Thm. 4.2.1.3,
S has n different eigenvalues and every eigenvector of S is also eigenvector of any circulant matrix.
By elementary means we can compute the eigenvectors of S: Assume that
v = [v0 , . . . , vn−1 ] ∈ C n \ {0} satisfies Sv = λv for some λ ∈ C. This implies for its components
the following relationships:
Remark 4.2.1.5 (Why using K = C?) In Exp. 4.2.1.1 we saw that we get complex eigenvalues/eigen-
vectors for general circulant matrices. More generally, in many cases real matrices can be diagonalized
only in C, which is the ultimate reason for the importance of complex numbers.
Complex numbers also allow an elegant handling of trigonometric functions: recall from analysis the uni-
fied treatment of trigonometric functions via the complex exponential function
The field of complex numbers C is the natural framework for the analysis of linear, time-invariant
C! filters, and the development of algorithms for circulant matrices.
y
§4.2.1.6 (Eigenvectors of circulant matrices) Now we verify by direct computations that circulant matri-
ces all have a particular set of eigenvectors. This will entail computing in C, cf. Rem. 4.2.1.15.
✎ notation: nth root of unity ωn := exp(−2πı/n) = cos(2π/n) − ı sin(2π/n), n ∈ N
n −1
1 − qn
∑ qk = ∀ q ∈ C \ {1} , n ∈ N . (4.2.1.9)
k =0
1−q
n −1 nj
kj 1 − ωn 1 − exp(−2πıj)
⇒ ∑ ωn = j
=
1 − exp(−2πıj/n)
=0,
k =0 1 − ωn
nj
because exp(−2πıj) = ωn = (ωnn ) j = 1 for all j ∈ Z.
In expressions like ωnkl the term “kl ” will always designate an exponent and will never play
! the role of a superscript.
Now we want to confirm the conjecture gleaned from Exp. 4.2.1.1 that vectors with powers of roots of unity
are eigenvectors for any circulant matrix. We do this by simple and straightforward computations:
We consider a general circulant matrix C ∈ C n,n (→ Def. 4.1.4.12), with cij := (C)i,j = ui− j , for an
n-periodic sequence (uk )k∈Z , uk ∈ C. We “guess” an eigenvector,
h i
n − jk n−1
vk ∈ C : vk := ωn , k ∈ {0, . . . , n − 1} ,
j =0
n −1 n −1
−( j−l )k − jk n−1 − jk
(Cvk ) j = ∑ u j−l ωn−lk = ∑ u l ωn = ωn ∑ ul ωnlk = λk · ωn = λk · (vk ) j . (4.2.1.10)
l =0 l =0 l =0
n −1
vk is eigenvector of C for eigenvalue λk = ∑ ul ωnlk .
l =0
The set {v0 , . . . , vn−1 } ⊂ C n provides the so-called orthogonal trigonometric basis of C n = eigen-
From (4.2.1.8) we can conclude orthogonality of the basis vectors by straightforward computations:
h i n −1 n −1
− jk n−1 jk − jm (k−m) j (4.2.1.8)
v k : = ωn ∈ C n : vH v
k m = ∑ n nω ω = ∑ ωn = 0 , if k 6= m . (4.2.1.12)
j =0
j =0 j =0
The matrix effecting the change of basis trigonometrical basis → standard basis is called the Fourier-
matrix
ωn0 ωn0 ··· ωn0
ωn0
0 ωn1 ··· ωnn−1
h i n −1
ωn2 ··· ωn2n−2 ℓj
F n = ωn = ωn ∈ C n,n . (4.2.1.13)
.. .. .. ℓ,j=0
. . .
( n −1)2
ωn0 ωnn−1 · · · ωn
n −1 n −1 n −1
( l −1) k (l −1)k −( j−1)k k(l − j)
F n FH
n = ∑ ωn ω n ( j −1) k = ∑ ωn ωn = ∑ ωn , 1 ≤ l, j ≤ n .
l,j
k =0 k =0 k =0
✷ y
Remark 4.2.1.15 (Spectrum of Fourier matrix) We draw a conclusion from the properties stated in
Lemma 4.2.1.14:
1 4
F
n2 n
= I ⇒ σ( √1n Fn ) ⊂ {1, −1, i, −i } ,
For any circulant matrix C ∈ K n,n , cij = ui− j , (uk )k∈Z an n-periodic sequence, holds true
n −1 n −1
Cvk = C Fn :,k
= (F):,k ∑ uℓ ωnℓk = Fn :,k ∑ uℓ (Fn )k,ℓ = Fn :,k
(Fn u)k .
ℓ=0 ℓ=0
C = F− 1
n diag( d0 , . . . , dn−1 ) Fn , [ d 0 , . . . , d n −1 ] ⊤ = F n [ u 0 , . . . , u n −1 ] ⊤ . (4.2.1.17)
As a consequence of Lemma 4.2.1.16, (4.2.1.17) the multiplication with Fourier-matrices will be crucial
operation in algorithms for circulant matrices and discrete convolutions. Therefore this operation has been
given a special name:
The linear map DFTn : C n 7→ C n , DFTn (y) := Fn y, y ∈ C n , is called discrete Fourier transform
(DFT), i.e. for [c0 , . . . , cn−1 ] := DFTn (y)
n −1 n −1
kj kj
ck = ∑ y j ωn = ∑ y j exp −2πı
n
, k = 0, . . . , n − 1 . (4.2.1.19)
j =0 j =0
Recall the convention also adopted for the discussion of the DFT: vector indexes range from 0 to n − 1!
Terminology: The result of DFT, c = DFTn (y) = Fn y, is also called the (discrete) Fourier transform of
y.
From F− 1 1
n = n Fn (→ Lemma 4.2.1.14) we find the inverse discrete Fourier transform:
n −1
kj 1 n −1 −kj
ck = ∑ y j ωn ⇔ yj = ∑
n k =0
c k ωn (4.2.1.20)
j =0
6 Eigen : : VectorXcd c ( n ) ;
7 Eigen : : VectorXcd x ( n ) ;
8 y << Comp( 1 , 0 ) , Comp( 2 , 1 ) , Comp( 3 , 2 ) , Comp( 4 , 3 ) , Comp( 5 , 4 ) ;
9 Eigen : : FFT<double> f f t ; // DFT transform object
10 c = f f t . fwd ( y ) ; // DTF of y, see Def. 4.2.1.18
11 x = f f t . inv ( c ) ; // inverse DFT of c, see (4.2.1.20)
12
13 std : : cout << " y = " << y . transpose ( ) << std : : endl
14 << " c = " << c . transpose ( ) << std : : endl
15 << " x = " << x . transpose ( ) << std : : endl ;
16 return 0;
17 }
• P YTHON-functions for discrete Fourier transform (and its inverse) are provided by the package
scipy.fft
DFT: c=scipy.fft(y) ↔ inverse DFT: y=scipy.ifft(c),
where y and c are numpy-arrays.
y
For any circulant matrix C ∈ K n,n , cij = ui− j , (uk )k∈Z n-periodic sequence, holds true
and
outline an efficient algorithm for computing the singular-value decomposition (SVD) of a circulant matrix.
Hint.
• Every z ∈ C can be written as z = rz0 with r ≥ 0 and |z0 | = 1.
• For matrices A ∈ C m,n the full SVD reads A = UΣVH and involves unitary factors U ∈ C m,m
and V ∈ C n,n .
(Q4.2.1.23.F) [Diagonal circulant matrices] Characterize the set of all complex diagonal circulant
n × n-matrices, n ∈ N. Is this set a subspace of R n,n ?
△
Coding the formula for the discrete periodic convolution of two periodic sequences from Def. 4.1.4.7,
n −1 n −1
(yk ) := (uk ) ∗n ( xk ) , yk := ∑ uk− j x j = ∑ xk− j u j , k ∈ {0, . . . , n − 1} ,
j =0 j =0
one could do this in a straightforward manner using two nested loops as in the following code, and with an
asymptotic computational effort O(n2 ) for n → ∞.
12 }
13 }
14 return z ;
15 }
⊤ ⊤
This codes relies on the associated vectors u = [u0 , . . . , un−1 ] ∈ C n and x = [ x0 , . . . , xn−1 ] ∈ C n−1
for the sequences (uk ) and ( xk ), respectively. Using this vectors, indexed from 0, the periodic convolution
formula becomes
k n −1
yk = ∑ (u)k− j (x) j + ∑ ( u )n+k− j ( x ) j .
j =0 j = k +1
Let us assume that a “magic” very efficient implementation of the discrete Fourier transform (DFT) is
available (→ Section 4.3). Then a much faster implementation of pconv() is possible and it is based
on the link with the periodic discrete convolution of Def. 4.1.4.7. In § 4.1.4.11 we have seen that periodic
convolution amounts to multiplication with a circulant matrix. In addition, (4.2.1.17) reduces multiplication
with a circulant matrix to two multiplications with the Fourier matrix Fn (= DFT) and (componentwise)
scaling operations. This suggests how to exploit the equivalence
n −1
discrete periodic convolution zk = ∑ uk− j x j (→ Def. 4.1.4.7), k = 0, . . . , n − 1
j =0
m
n
multiplication with circulant matrix (→ Def. 4.1.4.12) z = Cx, C := ui− j i,j=1 .
Idea: (4.2.1.17) ➣ z = F− 1
n diag( Fn u ) Fn x
Cast in a C++ function computing the periodic discrete convolution of two vectors the convolution theorem
reads:
In Rem. 4.1.4.15 we learned that the discrete convolution of n-vectors (→ Def. 4.1.3.3) can be
accomplished by the periodic discrete convolution of 2n − 1-vectors (obtained by zero padding, see
Rem. 4.1.4.15):
a b
n
a, b ∈ C : a ∗ b = ∗2n−1 ∈ C2n−1 .
0 0
This idea underlies the following C++ implementation of the discrete convolution of two vectors.
C++ code 4.2.2.5: Implementation of discrete convolution (→ Def. 4.1.3.3) based on periodic
discrete convolution ➺ GITLAB
2 Eigen : : VectorXcd f a s t c o n v ( const Eigen : : VectorXcd &h ,
3 const Eigen : : VectorXcd &x ) {
4 assert ( x . s i z e ( ) == h . s i z e ( ) ) ;
5 const Eigen : : Index n = h . s i z e ( ) ;
6 // Zero padding, cf. (4.1.4.16), and periodic discrete convolution
7 // of length 2n − 1, Code 4.2.2.4
8 r e t u r n pconvfft (
9 ( Eigen : : VectorXcd ( 2 * n − 1 ) << h , Eigen : : VectorXcd : : Zero ( n − 1 ) )
10 . finished ( ) ,
11 ( Eigen : : VectorXcd ( 2 * n − 1 ) << x , Eigen : : VectorXcd : : Zero ( n − 1 ) )
12 . finished ( ) ) ;
13 }
Here, pconv() implements periodic discrete convolution. How do you have to change the code so
that it can compute the discrete convolution of two vectors h ∈ R n , x ∈ R m for general n, m ∈ N?
△
Video tutorial for Section 4.2.3 "Frequency filtering via DFT": (20 minutes) Download link,
tablet notes
when interpreted as time-periodic signals, represent harmonic oscillations. This is illustrated when plotting
some vectors of the trigonometric basis (n = 16):
Fourier−basis vector, n=16, j=1 Fourier−basis vector, n=16, j=7 Fourier−basis vector, n=16, j=15
1 1 1
Value
Value
Value
0 0 0
n −1
kj 1 n −1 −kj
ck = ∑ y j ωn ⇔ yj =
n k∑
c k ωn (4.2.1.20)
j =0 =0
effect the transformation from the “pulse basis” into a (scaled) trigonometric basis and vice versa. Thus,
they convert time-domain and frequeny-domain representation of a signal into each other. To see this
more clearly, we examine a real-valued signal of length n = 2m + 1, m ∈ N: yk ∈ R. Its DFT yields
kj (n−k) j
c0 , . . . , cn−1 and theses coefficients satisfy ck = cn−k , because ωn = ω n . Using this relationship we
can write the original signal as a linear combination of sampled trigonometric functions with “frequencies”
k = 0, . . . , m:
m 2m m
−kj −kj −kj (k−n) j
ny j = c0 + ∑ c k ωn + ∑ c k ωn = c0 + ∑ c k ωn + c n−k ωn
k =1 k = m +1 k =1
m
= c0 + 2 ∑ Re(ck ) cos(2π kj/n) + Im(ck ) sin(2π kj/n) , j = 0, . . . , n − 1 ,
k =1
EXAMPLE 4.2.3.3 (Frequency identification with DFT) The following C++ code generates a periodic
signal composed of two base frequencies and distorts its by adding large random noise:
3
20
18
2
16
1 14
12
Signal
|c |2
k
0 10
−1
6
4
−2
0
−3 0 5 10 15 20 25 30
0 10 20 30 40 50 60 70
Fig. 123 Sampling points (time)
Fig. 124 Coefficient index k
Looking at the time-domain plot of the signal given in Fig. 123 (C++ code ➺ GITLAB) it is hard to discern
the underlying base frequencies. However, we observe that frequencies present in unperturbed signal are
clearly evident in the frequency-domain representation after DFT. y
Fig. 125
§4.2.3.6 (“Low” and “high” frequencies) Again, look at plots of real parts of trigonometric basis vectors
(Fn ):,j (= columns of Fourier matrix), n = 16.
Trigonometric basis vector (real part), n=16, j=0 Trigonometric basis vector (real part), n=16, j=1 Trigonometric basis vector (real part), n=16, j=2
1 1 1
0.5 0 0
0 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index
0 0 0
−1 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index
Trigonometric basis vector (real part), n=16, j=6 Trigonometric basis vector (real part), n=16, j=7 Trigonometric basis vector (real part), n=16, j=8
1 1 1
0 0 0
−1 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index
(Here we adopt C++ indexing and count the columns of the matrix from 0.)
Im
Visually, the different basis vectors represent oscilla-
tory signals with different frequencies.
The task of frequency filtering is to suppress or enhance a predefined range of frequencies contained in a
signal.
DFT
➊ Perform DFT of the signal: (y0 , . . . , yn−1 ) 7−−→ (c0 , . . . , cn−1 )
➋ Operate on Fourier coefficients: (c0 , . . . , cn−1 ) 7→ (ce0 , . . . , cen−1 )
DFT−1
➌ Filtered signal by inverse DFT: (ce0 , . . . , cen−1 ) 7−−−→ (ye0 , . . . , yen−1 )
The following code does digital low-pass and high-pass filtering of a signal based on DFT and inverse
DFT. It sets the obtained Fourier coefficients corresponding to high/low frequencies to zero and afterwards
transforms back to time domain.
12 VectorXcd clow = c ;
13 // Set high frequency coefficients to zero, Fig. 127
14 f o r ( i n t j = −k ; j <= +k ; ++ j ) {
15 clow (m + j ) = 0 ;
16 }
17 // (Complementary) vector of high frequency coefficients
18 const VectorXcd c h i g h = c − clow ;
19
2.5
250
2
1.5 200
|ck|
1 150
0.5
100
0
50
−0.5
0
−1 0 20 40 60 80 100 120 140
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 No. of Fourier coefficient
time
Low pass filtering can be used for denoising, that is, the removal of high frequency perturbations of a
signal. y
EXAMPLE 4.2.3.9 (Sound filtering by DFT) Frequency filtering is ubiquitous in sound processing.
Here we demonstrate it in P YTHON ➺ GITLAB, which offers tools for audio processing through the
sounddevice module.
50000
0.2
40000
sound pressure
0.1
Nyquist frequency
|ck |2
30000
0.0
20000
−0.1
10000
−0.2
0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 10000 20000 30000 40000 50000 60000
Fig. 128 time[s] Fig. 129 index k of Fourier coefficients ck
The audio signal (duration ≈ 1.5s) of a human voice is plotted in time domain (vector y ∈ R n , n = 63274)
and as a power spectrum in frequency domain. The power spectrum of a signal y ∈ C n is the vector
n −1
| c j |2 j =0
, where c = DFTn y = Fn y is the discrete Fourier transform of y.
2
We see that the bulk of the signal’s power kyk2 is contained in the low-frequency components. The
paves the way for compressing the signal by low-pass filtering, that is, by dropping its high-frequency
components and storing or transmitting only the discrete Fourier coefficients belonging to low frequencies.
Refer to § 4.2.3.6 for precise information about the association of Fourier coefficients c j with low and high
frequencies.
Below we plot the squared moduli |c j |2 of the Fourier coefficients belonging to low frequencies and the
low-pass filtered sound signals for different cut-off frequencies. Taking into account only low-frequency
discrete Fourier coefficients does not severely distort the sound signal.
40000
sound pressure
0.02
|ck |2
30000
0.00
20000
−0.02
10000
0
−0.04
0 500 1000 1500 2000 2500 3000 0.680 0.685 0.690 0.695 0.700
Fig. 130 index k of Fourier coefficients ck Fig. 131 time[s]
y
Remark 4.2.3.10 (Linear Filtering) Low-pass and high-pass filtering via DFT implement a very general
policy underlying general linear filtering of finite, time-discrete signals, represented by vectors y ∈ C n
➊ Compute the coefficients of an alternative basis representation of y.
➋ Apply some linear mapping to the obtain coefficient vector c ∈ C n , yielding ec.
➌ Recover the representation of ec in the standard basis of C n .
y
Review question(s) 4.2.3.11 (Frequency Filtering via DFT)
(Q4.2.3.11.A) Let y ∈ R n , n = 2k , k ∈ N, be a vector describing an analog time-discrete finite signal.
Denote by c ∈ C n its discrete Fourier transform. Which components of c are related to “low-frequency
content”, and which to “high-frequency content”?
△
m −1
jk −1 jk −1 jk
hk = ∑ (y2j + iy2j+1 ) ωm = ∑m
j=0 y2j ωm + i · ∑m
j=0 y2j+1 ωm , (4.2.4.1)
j =0
m −1
j(m−k) −1 jk
m −1 jk
hm−k = ∑ y2j + iy2j+1 ω m = ∑m
j=0 y2j ωm − i · ∑ j=0 y2j+1 ωm . (4.2.4.2)
j =0
Thus, we can recover the framed sums from suitable combinations of the discrete Fourier coefficients
hk ∈ C, k = 0, . . . , m − 1
−1 jk −1 jk
⇒ ∑m
j=0 y2j ωm = 12 (hk + hm−k ) , ∑m
j=0 y2j+1 ωm = − 21 i (hk − hm−k ) .
Use simple identities for roots of unity to split the DFT of y into two sums:
n −1
jk −1 k jk
m −1 jk
ck = ∑ y j ωn = ∑m
j=0 y2j ωm + ωn · ∑ j=0 y2j+1 ωm . (4.2.4.3)
j =0
( c = 1 (h + h 1 k
k 2 k m−k ) − 2 iωn ( hk − hm−k ) , k = 0, . . . , m − 1 ,
cm = Re{ h0 } − Im{ h0 } , (4.2.4.4)
ck = cn−k , k = m + 1, . . . , n − 1 .
18 Eigen : : FFT<double> f f t ;
19 VectorXcd d = f f t . fwd ( yc ) ;
20 VectorXcd h (m + 1 ) ;
21 h << d , d ( 0 ) ;
22
23 c . resize ( n ) ;
24 // Step II: implementation of (4.2.4.4)
25 f o r ( Eigen : : Index k = 0 ; k < m; ++k ) {
26 c ( k ) = ( h ( k ) + std : : c o n j ( h (m−k ) ) ) / 2 . −
i / 2 . * std : : exp ( − 2 . * s t a t i c _ c a s t <double >( k ) / s t a t i c _ c a s t <double >( n ) * M_PI * i ) * ( h ( k )
− std : : c o n j ( h (m−k ) ) ) ;
27 }
28 c (m) = std : : r e a l ( h ( 0 ) ) − std : : imag ( h ( 0 ) ) ;
29 f o r ( Eigen : : Index k = m+1; k < n ; ++k ) {
30 c ( k ) = std : : c o n j ( c ( n−k ) ) ;
31 }
32 }
Review question(s) 4.2.4.6 (Frequency Filtering via DFT and real DFT)
(Q4.2.4.6.A) For y ∈ R n , what is the result of the linear mapping
n o
⊤
y 7→ Re DFT−
n
1
[( DFT y
n 1) , 0, . . . , 0 ] ?
for 0 ≤ tol < 1. Discuss to what extent this function can be used for the compression of a sound
signal.
(Q4.2.4.6.C) How would you implement a C++ function
Eigen::VectorXd reconstructFromFrequencies(
const s t d :: v e c t o r < s t d ::pair< i n t , s t d ::complex< double >> &f);
that takes the output of selectDominantFrequencies() from Question (Q4.2.4.6.B) and returns
the compressed signal in time domain?
Take into account that selectDominantFrequencies() merely looks at the first half of the dis-
crete Fourier coefficients.
△
Finite time-discrete signals are naturally described by vectors, recall § 4.0.0.1. They can be regarded as
one-dimensional, and typical specimens are audio data given in WAV (Waveform Audio) format. Other
types of data also have to be sent through channels, most importantly, images that can be viewed as two-
dimensional data. The natural linear-algebra style representation of an image is a matrix, see Ex. 3.4.4.24.
In this we study the frequency decomposition of matrices. Due to the natural analogy
one-dimensional data (“audio signal”) ←→ vector y ∈ C n ,
§4.2.5.1 (Matrix Fourier modes) The (inverse) discrete Fourier transform of a vector computes its coef-
ficient of the representation in a basis provided by the columns of the Fourier matrix Fn . The k-th column
can be obtained by sampling harmonic oscillations of frequency k, k = 0, . . . , n − 1:
n −1 n −1 j
(Fn ):,k = cos(2πkt j ) j=0 + ı sin(2πkt j ) j=0 t j := n , k = 0, . . . , n − 1 .
What are the 2D counterpart of these vectors? The matrices obtained by sampling products of trigono-
metric functions, e.g.,
Let a matrix C ∈ C m,n be given as a linear combination of these basis matrices with coefficients y j1 ,j2 ∈ C,
0 ≤ j1 < m, 0 ≤ j2 < n:
m −1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 . (4.2.5.3)
j1 =0 j2 =0
Then the entries of C can be computed by two nested discrete Fourier transforms:
!
m −1 n −1 m −1 n −1
j k j k j k j k
(C)k1 ,k2 = ∑ ∑ y j1 ,j2 ωm1 1 ωn2 2 = ∑ ωm1 1 ∑ ωn2 2 y j1 ,j2 , 0 ≤ k1 < m , 0 ≤ k2 < n .
j1 =0 j2 =0 j1 =0 j2 =0
The coefficients y j1 ,j2 ∈ C, 0 ≤ j1 < m, 0 ≤ j2 < n, can also be regarded as entries of a matrix
Y ∈ C m,n . Thus we can rewrite the above expressions: for all 0 ≤ k1 < m, 0 ≤ k2 < n,
m −1 j k1
(C)k1 ,k2 = ∑ Fn (Y) j1 ,: k2
ωm1 C = Fm (Fn Y⊤ )⊤ = Fm YFn , (4.2.5.4)
j1 =0
because F⊤
n = Fn . This formula defines the two-dimensional discrete Fourier transform of the matrix
Y ∈ C . We abbreviate it by DFTm,n : C m,n → C m,n .
m,n
m −1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 ⇒ Y = F− 1 −1
m CFn =
1
mn Fm CFn . (4.2.5.5)
j1 =0 j2 =0
The following two codes implement (4.2.5.4) and (4.2.5.5) using the DFT facilities of E IGEN.
Remark 4.2.5.8 (Two-dimensional DFT in P YTHON) The two-dimensional DFT is provided by the
P YTHON function:
numpy.fft2(Y) .
and discrete Fourier transforms. This can also be done in two dimensions.
We consider the following bilinear mapping B : C m,n × C m,n → C m,n :
m −1 n −1
k = 0, . . . , m − 1 ,
(B(X, Y))k,ℓ = ∑ ∑ (X)i,j (Y)(k−i) mod m,(ℓ− j) mod n ,
ℓ = 0, . . . , n − 1 ,
(4.2.5.11)
i =0 j =0
Here, as in (4.2.5.10), mod ∗ designates the remainder of integer division like the % operator in C++
and is applied to indices of matrix entries. The formula (4.2.5.11) defines the two-dimensional discrete
periodic convolution, cf. Def. 4.1.4.7. Generalizing the notation for the 1D discrete periodic convolution
(4.2.5.10) we also write
" #
m −1 n −1
X ∗m,n Y := B(X, Y) = ∑ ∑ (X)i,j (Y)(k−i) mod m,(ℓ− j) mod n , X, Y ∈ C m,n .
i =0 j =0 k =0,...,m−1
ℓ=0,...,n−1
A direct loop-based implementation of the formula (4.2.5.11) involves an asymptotic computational effort
of O(m2 n2 ) for m, n → ∞.
The key discovery of Section 4.2.1 about the diagonalization of the discrete periodic convolution operation
in the Fourier basis carries over to two dimensions, because 2D discrete periodic convolution admits a
diagonalization by switching to the trigonometric basis of C m,n , analogous to (4.2.1.17).
sj
In (4.2.5.11) set Y = (Fm ):,r (Fn )s,: ∈ C m,n ri ω , 0 ≤ i < m, 0 ≤ j < n:
↔ (Y)i,j = ωm n
m −1 n −1
(B(X, Y))k,ℓ = ∑ ∑ (X)i,j (Y)(k−i) mod m,(ℓ− j) mod n
i =0 j =0
m −1 n −1
r ( k −i ) s(ℓ− j)
= ∑ ∑ (X)i,j ωm ωn
i =0 j =0
!
m −1 n −1
sj rk sℓ
= ∑ ∑ (X)i,j ωmri ωn · ωm ωn .
i =0 j =0
!
m −1 n −1
ri ω sj
B(X, (Fm ):,r (Fn )s,: ) =
| {z } ∑ ∑ (X)i,j ωm n (Fm ):,r (Fn )s,: . (4.2.5.13)
i =0 j =0
“eigenvector” | {z }
"eigenvalue", see Eq. (4.2.5.3)
Hence, the (complex conjugated) two-dimensional discrete Fourier transform of X according to (4.2.5.3)
provides the eigenvalues of the anti-linear mapping Y 7→ B(X, Y), X ∈ C m,n fixed. Thus we have arrived
at a 2D version of the convolution theorem Thm. 4.2.2.2.
X ∗m,n Y = DFT− 1
m,n (DFTm,n ( X ) ⊙ DFTm,n ( Y )) ,
This suggests the following DFT-based algorithm for evaluating the periodic convolution of matrices:
➊ Compute Ŷ by 2D DFT of Y, see Code 4.2.5.7
➋ Compute X̂ by 2D DFT of X, see Code 4.2.5.6.
➌ Component-wise multiplication of X̂ and Ŷ: Ẑ = X̂. ∗ Ŷ.
➍ Compute Z through inverse 2D DFT of Ẑ.
EXAMPLE 4.2.5.16 (Deblurring by DFT) 2D discrete convolutions are important for image processing.
Let a Gray-scale pixel image be stored in the matrix P ∈ R m,n , actually P ∈ {0, . . . , 255}m,n , see also
Ex. 3.4.4.24.
Write ( pl,k )l,k∈Z for the periodically extended image:
Blurring is a technical term for undesirable cross-talk between neighboring pixels: pixel values get replaced
by weighted averages of near-by pixel values. This is a good model approximation of the effect of distortion
in optical transmission systems like lenses. Blurring can be described by a small matrix called the point-
spread function (PSF):
L L
0≤l<m,
cl,j = ∑ ∑ sk,q pl +k,j+q ,
0≤j<n,
L ∈ {1, . . . , min{m, n}} . (4.2.5.17)
k=− L q=− L
Here the entries of the PSF are referenced as sk,q also with negative indices. We also point out that
usually L will be small compared to m and n, and we have sk,m ≥ 0, and ∑kL=− L ∑qL=− L sk,q = 1. Hence
blurring amounts to averaging local pixel values. You may also want to look at this YouTube Video about
“Convolution in Image Processing”.
1
In the experiments reported below we used: L = 5 and the PSF sk,q = ,, 0 ≤ k, q ≤ 5,
1 + k 2 + q2
normalized to entry sum = 1.
8 i f (M ! = N) {
9 std : : cout << " Error : S not quadratic ! \ n" ;
10 }
11
12 MatrixXd C(m, n ) ;
13 f o r ( i n d e x _ t l = 1 ; l <= m; ++ l ) {
14 f o r ( i n d e x _ t j = 1 ; j <= n ; ++ j ) {
15 double s = 0 ;
16 f o r ( i n d e x _ t k = 1 ; k <= ( 2 * L + 1 ) ; ++k ) {
17 f o r ( i n d e x _ t q = 1 ; q <= ( 2 * L + 1 ) ; ++q ) {
18 index_t k l = l + k − L − 1;
19 i f ( k l < 1) {
20 k l += m;
21 } else i f ( k l > m) {
22 k l −= m;
23 }
24 i n d e x _ t jm = j + q − L − 1 ;
25 i f ( jm < 1 ) {
26 jm += n ;
27 } else i f ( jm > n ) {
28 jm −= n ;
29 }
30 s += P( k l − 1 , jm − 1 ) * S ( k − 1 , q − 1 ) ;
31 }
32 }
33 C( l − 1 , j − 1 ) = s ;
34 }
35 }
36 r e t u r n C;
37 }
Yet, does (4.2.5.17) ring a bell? Hidden in (4.2.5.17) is a 2D discrete periodic convolution, see
Eq. (4.2.5.11)!
L L
cl,j = ∑ ∑ sk,q (P)(l +k) mod m,( j+q) mod n
k=− L q=− L
L L
= ∑ ∑ s−k,−q (P)(l −k) mod m,( j−q) mod n
k=− L q=− L
L L
= ∑ ∑ s−k,−q (P)(l−k) mod m,( j−q) mod n +
k =0 q =0
L −1
∑ ∑ s−k,−q (P)(l −k) mod m,( j−(q+n)) mod n +
k=0 q=− L
−1 L
∑ ∑ s−k,−q (P)(l−(k+m)) mod m,( j−q) mod n +
k=− L q=0
−1 −1
∑ ∑ s−k,−q (P)(l −(k+m)) mod m,( j−(q+n)) mod n
k=− L q=− L
L L
= ∑ ∑ s−k,−q (P)(l−k) mod m,( j−q) mod n +
k =0 q =0
L n −1
∑ ∑ s−k,−q+n (P)(l −k) mod m,( j−q) mod n +
k =0 q = n − L
m −1 L
∑ ∑ s−k+m,−q (P)(l−k) mod m,( j−q) mod n +
k = m − L q =0
m −1 n −1
∑ ∑ s−k+m,−q+n (P)(l −k) mod m,( j−q) mod n .
k=m− L q=n− L
L L
νk µq
B Vν,µ = λν,µ Vν,µ , eigenvalue λν,µ = ∑ ∑ sk,q ωm ωn (4.2.5.21)
k=− L q=− L
| {z }
2-dimensional DFT of point spread function !
Thus the inversion of the blurring operator boils down to componentwise scaling in “Fourier domain”, see
See also Code 4.2.5.15 for the same idea.
Note that this code checks whether deblurring is possible, that is, whether the blurring operator is really
invertible. A near singular blurring operator manifests itself through entries of its DFT close to zero. y
Review question(s) 4.2.5.23 (Two-dimensional DFT)
(Q4.2.5.23.A) Describe a function (t, s) 7→ g(t, s) such that
n o j r
Re (Fm ):,j (Fn )⊤
:,ℓ = g(t j , sr ) j=0,...,m−1 t j := , sr : = , m, n ∈ N .
r =0,...,n−1 m n
Here Fn , n ∈ N, is the Fourier matrix
ωn0 ωn0 ··· ωn0
ωn0
0 ωn1 ··· ωnn−1
h i n −1
ωn2 ··· ωn2n−2 lj
F n = ωn = ωn ∈ C n,n . (4.2.1.13)
.. .. .. l,j=0
. . .
( n −1)2
ωn0 ωnn−1 · · · ωn
(Q4.2.5.23.B) How would you compute the discrete Fourier transform of a tensor product matrix
X = uvH , u, v ∈ C n .
△
§4.2.6.1 (“Squeezing” the DFT) In this section we are concerned with non-periodic signals of infinite
duration as introduced in § 4.0.0.1.
−1
Next, we associate a point tk ∈ [0, 1[ with each index k of the DFT (ck )nk= 0:
k
k ∈ {0, . . . , n − 1} ←→ tk := . (4.2.6.3)
n
−1
Thus we can view (ck )nk= 0 as the heights of n pulses evenly spaced in the interval [0, 1[.
1
0.9
✁ “Squeezing” a vector ∈ R n into [0, 1[.
0.8
c k ↔ c ( t k ); ,
ck
0.5
0.4
k
0.3 tk = , k = 0, . . . , n − 1 .
n
0.2
0.1
This makes it possible to pass from a discrete finite
0
signal to a continuous signal.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 134 t
The notation indicates that we read ck as the value of a function c : [0, 1[7→ C for argument tk . y
EXAMPLE 4.2.6.5 (“Squeezed” DFT of a periodically truncated signal) We consider the bi-infinite
discrete signal y j j∈Z , “concentrated around 0”
1
yj = , j∈Z.
1 + j2
m
We examine the DFT of the 2m + 1-periodic signal obtained by periodic extension of (yk )k=−m , C++ code
➺ GITLAB.
The visual impression is that the values c(tk ) “converge” to those of a function c : [0, 1[7→ R in the
sampling points tk .
y
§4.2.6.6 (Fourier series) Now we pass to the limit m → ∞ in (4.2.6.4) and keep the “sampling a function”
perspective: ck = c(tk ). Note that passing to the limit amounts to dropping the assumption of periodicity!
Terminology: The series (= infinite sum) on the right hand side of (4.2.6.7) is called a Fourier series
(link).
The function c : [0, 1[7→ C defined by (4.2.6.7) is called the Fourier transform of the
sequence (yk )k∈Z (, if the series converges).
2
Fourier transform of (1/1+k )
k
3.5
Thus, the limit we “saw” in Ex. 4.2.6.5 is actually the
3 Fourier transform of the sequence (yk )k∈Z !
2.5
✁ Fourier transform of yk := 1+1k2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 143 t
1 1 1 1 1
Fourier mode
Fourier mode
Fourier mode
Fourier mode
0.2 0.2 0.2 0.2 0.2
−0.2 + 0
−0.2 + 0
−0.2 + 0
−0.2 + 0
−0.2
−1 −1 −1 −1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t t t
1 π
π −2πt 2πt−π
c(t) = ∑ 1 + k2 exp (− 2πıkt ) = π − e−π
· e + e ∈ C ∞ ([0, 1]) .
k ∈Z
e
Note that when considered as a 1-periodic function on R, this c(t) is merely continuous. y
Remark 4.2.6.9 (Decay conditions for bi-infinite signals) The considerations above were based on
✦ truncation of (yk )k∈Z to (yk )m
k=−m and
✦ periodic continuation to an 2m + 1-periodic signal.
Obviously, only if the signal is concentrated around k = 0 this procedure will not lose essential information
contained in the signal, which suggests decay conditions for the coefficients of Fourier series.
The summability condition (4.2.6.11) implies (4.2.6.10). Moreover, (4.2.6.11) ensures that the
Fourier series (4.2.6.7) converges uniformly [Str09, Def. 4.8.1] because the exponentials are all bounded
by 1 in modulus. From [Str09, Thm. 4.8.1] we learn that limits of uniformly convergent series of continuous
functions posses a continuous limit. As a consequence c : [0, 1[7→ C is continuous, if (4.2.6.11) holds. y
EXAMPLE 4.2.6.12 (Convergence of Fourier sums) We consider the following infinite signal, satisfying
1
the summation condition (4.2.6.11): , k ∈ Z, see Ex. 4.2.6.5. We monitored: approxima- yk =
1 + k2
tion of the Fourier transform c(t) by Fourier sums cm (t), see (4.2.6.14).
2 2
Fourier transform of (1/1+k ) Fourier sum approximations with 2m+1 terms, y = 1/(1+k )
k k
3.5 3.5
m=2
3 m=4
3
m=8
m = 16
m = 32
2.5 2.5
2 2
cm(t)
c(t)
1.5 1.5
1 1
0.5 0.5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 144 t Fig. 145 t
We observe convergence of Fourier sums in “eyeball norm”. Quantitative estimates can be deduced from
decay properties of the sequence (yk )k∈Z . If it is summable according to (4.2.6.11), then
M
∑ yk exp(−2πıkt) − ∑ yk exp(−2πıkt) ≤ ∑ |yk | → 0 for M → ∞
k ∈Z k=− M |k|> M
Further quantitative statements about convergence can be deduced from Thm. 4.2.6.33 below. y
j
Task: Approximate evaluation of c(t) at N equidistant points t j := N , j = 0, . . . , N (e.g., for plotting it).
M M
kj
c(t j ) = lim
M→∞
∑ yk exp(−2πikt j ) ≈ ∑ yk exp(−2πi
N
), (4.2.6.15)
k=− M k=− M
for j = 0, . . . , N − 1.
Note that in the case N = M (4.2.6.15) coincides with a discrete Fourier transform (DFT, → Def. 4.2.1.18).
The following code demonstrates the evaluation of a Fourier series at equidistant points using DFT.
C++ code 4.2.6.16: DFT-based evaluation of Fourier sum at equidistant points ➺ GITLAB
2 // evaluate scalar function with a vector
3 // DFT based approximate evaluation of Fourier series
4 // signal is a functor providing the yk
5 // M specifies truncation of series according to (4.2.6.14)
6 // N is the number of equidistant evaluation points for c in
7 // [0, 1[.
8 template <class Function >
9 VectorXcd foursum ( const F u n c t i o n &s i g n a l , i n t M, i n t N) {
10 using i n d e x _ t = Eigen : : Index ;
11 const i n t m = 2 * M + 1 ; // length of the signal
12 // sample signal
§4.2.6.17 (Inverting the Fourier transform) Now we perform a similar passage to the limit as above for
the inverse DFT (4.2.1.20), n = 2m + 1,
1 n −1 jk
yj = ∑
n k =0
ck exp(2πi ) , j = −m, . . . , m .
n
(4.2.6.18)
1 n −1
n k∑
yj = c(tk ) exp(2πijtk ) , j = −m, . . . , m . (4.2.6.19)
=0
Insight: The right hand side of (4.2.6.19) is a Riemann sum, cf. [Str09, Sect. 6.2]
R1
yj = c(t) exp(2πijt) dt , j∈Z . (4.2.6.20)
0
satisfying the summabiliy condition (4.2.6.11) we can swap integration and summation and directly com-
pute
Z1 Z1
c(t) exp(2πijt) dt = ∑ y j exp(−2πıkt)) exp(2πijt) dt
0 0 k ∈Z
Z1
= ∑ yj exp(2πı( j − k )t) dt = y j ,
k ∈Z 0
Z1
(
1 , if n = 0 ,
exp(2πınt) dt =
0 , if n 6= 0 .
0
The formula (4.2.6.20) allows to recover the signal (yk )k∈Z from its Fourier transform c(t).
§4.2.6.21 (Fourier transform as linear mapping) Assuming sufficiently fast decay of the infinite sequence
(yk )k∈Z ∈ CZ , combining (4.2.6.7) and (4.2.6.20) we have found the relationship
Z 1
(4.2.6.7): c(t) = ∑ yk exp(−2πıkt) ↔ (4.2.6.20): yk =
0
c(t) exp(2πıkt) dt .
k ∈Z
Terminology: y j from (4.2.6.20) is called the j-th Fourier coefficient of the function c.
✎ Notation: cbj := y j with y j defined by (4.2.6.20) =
ˆ j-th Fourier coefficient of c : [0, 1[→ C
In a sense, Fourier series summation maps to sequence to a 1-periodic function, Fourier coefficient ex-
traction a 1-periodic function to a sequence
Fourier series
sequence ∈ CZ −−−−−−−−−→ funtion [0, 1[→ C .
Fourier coefficients
Both the space CZ of bi-infinite sequences and the space of functions [0, 1[→ C are vector spaces
equipped with “termwise/pointwise” addition and scalar multiplication. Then it is clear that
• the series summation mapping (cbk )k∈Z 7→ c from (4.2.6.7),
• and the Fourier coeffiicient extraction mapping c 7→ (cbk )k∈Z from (4.2.6.20)
are linear ! (Recall the concept of a linear mapping as explained in [NS02, Ch. 6].)
Z 1
cbj = c(t) exp(2πıjt) dt
(continuous) function 0 (bi-infinite) sequence
c : [0, 1[7→ C (cbj ) j∈Z
c(t) = ∑ cbk exp(−2πıkt)
k ∈Z
✫ ✪
Fourier transform Fourier coefficient
y
Remark 4.2.6.22 (Filtering in Fourier domain) What happens to the Fourier transform of a bi-infinite
signal, if it passes through a channel?
Consider a (bi-)infinite signal ( xk )k∈Z sent through a finite (→ Def. 4.1.1.2, linear (→
Def. 4.1.1.7) time-invariant (→ Def. 4.1.1.5) causal (→ Def. 4.1.1.9) channel with impulse response
(. . . , 0, h0 , . . . , hn−1 0, . . .) (→ Def. 4.1.1.12). By (4.1.2.4) this results in the output signal
n −1
yk = ∑ h j xk− j , k∈Z. (4.2.6.23)
j =0
§4.2.6.27 (Fourier transform and convolution) In fact, the observation made in Rem. 4.2.6.22 is a spe-
cial case of a more general result that provides a version of the convolution theorem Thm. 4.2.2.2 for the
Fourier transform.
Let t 7→ c(t) and b 7→ b(t) be the Fourier transforms of the two summable bi-infinite sequences
(yk )k∈Z and ( xk )k∈Z , respectively. Then the pointwise product t 7→ c(t)b(t) is the Fourier trans-
form of the convolution (→ Def. 4.1.2.7)
( )
( xk ) ∗ (yk ) := ℓ ∈ Z 7→ ∑ xk yℓ−k ∈ CZ .
k ∈Z
Proof. (formal) Ignoring issues of convergence, we may just multiply the two Fourier sequences and sort
the resulting terms:
! ! !
∑ yk exp(−2πıkt) · ∑ x j exp(−2πıjt) = ∑ ∑ yk xℓ−k exp(−2πıℓt) .
k ∈Z j ∈Z ℓ∈Z k ∈Z
| {z }
=((yk )∗( xk ))ℓ
✷ y
§4.2.6.29 (Isometry property of Fourier transform) We will find a conservation of power through Fourier
transform. This is related to the assertion of Lemma 4.2.1.14 for the Fourier matrix Fn , see (4.2.1.13),
namely that √1n Fn is unitary (→ Def. 6.3.1.2), which implies
Thm. 3.3.2.2
1 1
√ Fn unitary √ Fn y = k y k2 . (4.2.6.30)
n n 2
Since DFT boils down to multiplication with Fn (→ Def. 4.2.1.18), we conclude from (4.2.6.30)
1 n −1 m
| c k |2 = | y j |2 .
n k∑ ∑
ck from (4.2.6.2) ⇒ (4.2.6.31)
=0 j=−m
Now we adopt the function perspective again and associated ck ↔ c(tk ). Then we pass to the limit
m → ∞, appeal to Riemann summation (see above), and conclude
Z1
m→∞
(4.2.6.31) =⇒ |c(t)|2 dt = ∑ | y j |2 . (4.2.6.32)
0 j ∈Z
If the Fourier coefficients satisfy ∑k∈Z |cbj |2 < ∞, then the Fourier series
Recalling the concept of the L2 -norm of a function, see (5.2.4.6), the theorem can be stated as follows:
Thm. 4.2.6.33 ↔ The L2 -norm of a Fourier transform agrees with the Euclidean norm of
the corresponding sequence.
2
Here the Euclidean norm of a sequence is understood as ( y k ) k ∈Z 2
:= ∑ | y j |2 .
k ∈Z
From Thm. 4.2.6.33 we can also conclude that the Fourier transform is injective: If and only if c(t) = 0, all
its Fourier coefficients will be zero. y
Review question(s) 4.2.6.34 (Semi-discrete Fourier transform)
c(t) = ∑ yk exp(−2πıkt) ,
k ∈Z
What is the relationship of t 7→ c(t) is the Fourier transform of the shifted sequence Sm (yk )k∈Z ,
m ∈ Z?
△
Video tutorial for Section 4.3 "Fast Fourier Transform (FFT)": (16 minutes) Download link,
tablet notes
You might have been wondering why the reduction to DFTs has received so much attention in Sec-
tion 4.2.2. An explanation is given now.
At first glance, DFT in C n ,
n −1
kj
ck = ∑ y j ωn , k = 0, . . . , n − 1 , (4.2.1.19)
j =0
seems to require an asymptotic computational effort of O(n2 ) (matrix×vector multiplication the dense
Fourier matrix).
EXPERIMENT 4.3.0.2 (Runtimes of DFT implementations) We examine the runtimes of calls to built-in
DFT functions in both M ATLAB and E IGEN ➺ GITLAB.
1
10
Timings in M ATLAB ✄: 0
10
direct matrix multiplication
MATLAB fft() function
LAB loops
2. Multiplication with Fourier matrix (4.2.1.13)
numpy.fft().
−6
10
0 500 1000 1500 2000 2500 3000
vector length n
§4.3.0.3 (FFT algorithm: derivation and complexity) To understand how the discrete Fourier transform
of n-vectors can be implemented with an asymptotic computational effort smaller than O(n2 ) we start with
an elementary manipulation of (4.2.1.19) for n = 2m, m ∈ N:
n −1 2πi
m −1 2πi
m −1 2πi
ck = ∑ y j e− n jk = ∑ y2j e− n 2jk + ∑ y2j+1 e− n (2j +1) k
j =0 j =0 j =0
m −1 m −1
2πi
y2j |e−{zm jk} +e−
2πi
n k y2k+1 e|−{zm jk} ,
2πi (4.3.0.4)
= ∑ · ∑ k∈Z.
j =0 jk j =0 jk
= ωm = ωm
| {z } | {z }
ceven
=:ek codd
=:ek
This means that for even n we can compute DFTn (y) from two DFTs of half the length plus ∼ n additions
and multiplications.
✞ ☎
✝ ✆
(4.3.0.4): DFT of length 2m = 2× DFT of length m + 2m additions & multiplications
FFT-algorithm
The following code shows an E IGEN-based Recursive FFT implementation for DFT of length n = 2 L .
1× DFT of length 2 L
2 L × DFT of length 1
We see that in Code 4.3.0.5 each level of the recursion requires O(2 L ) elementary operations.
Remark 4.3.0.7 (FFT algorithm by matrix factorization) The FFT algorithm can also be analyzed on the
level of matrix-vector calculus:
For n = 2m, m ∈ N,
Fm Fm
POE
m nF = =
0
ωn ωn/2
n
ωn1 ωn/2+1
n
Fm .. Fm ..
. .
ωn/2−1
n
ωnn−1
Fm
I I
0
ωn −ωn0
ωn1 −ωn1
Fm .. ..
. .
ωn/2−1
n
−ωn/2−1
n
ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
ω0 ω2 ω4 ω6 ω8 ω0 ω2 ω4 ω6 ω8
ω0 ω4 ω8 ω2 ω6 ω0 ω4 ω8 ω2 ω6
ω0 ω6 ω2 ω8 ω4 ω0 ω6 ω2 ω8 ω4
ω0 ω8 ω6 ω4 ω2 ω0 ω8 ω6 ω4 ω2
P5OE F10 = , ω := ω10 .
ω0 ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9
ω0 ω3 ω6 ω9 ω2 ω5 ω8 ω1 ω4 ω7
ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5
ω0 ω7 ω4 ω1 ω8 ω5 ω2 ω9 ω6 ω3
ω0 ω9 ω8 ω7 ω6 ω5 ω4 ω3 ω2 ω1
y
To compute an n-point DFT when n is composite (that is, when n = pq), the FFTW library decomposes the
problem using the Cooley-Tukey algorithm, which first computes p transforms of size q, and then computes
q transforms of size p. The decomposition is applied recursively to both the p- and q-point DFTs until the
problem can be solved using one of several machine-generated fixed-size "codelets." The codelets in turn
use several algorithms in combination, including a variation of Cooley-Tukey, a prime factor algorithm, and
a split-radix algorithm. The particular factorization of n is chosen heuristically.
The execution time for fft depends on the length of the transform. It is fastest for powers of two. It is
almost as fast for lengths that have only small prime factors. It is typically several times slower for
lengths that are prime or which have large prime factors → Ex. 4.3.0.12.
Remark 4.3.0.8 (FFT based on general factorization) We motivate the Fast Fourier transform algorithm
for DFT of length n = pq, p, q ∈ N (Cooley-Tuckey algorithm). Again, we start with re-indexing in the
DFT formula for a vector y = [y0 , . . . , yn−1 ] ∈ C n .
n −1 p −1 q −1 p −1 q −1
jk [ j=:l p+m ] − 2πi
pq ( l p + m ) k l (k mod q)
ck = ∑ y j ωn = ∑ ∑ yl p+m e = ∑ ωnmk ∑ y l p+m ωq . (4.3.0.9)
j =0 m =0 l =0 m =0 l =0
q −1
Step I: perform p DFTs of length q, zm = DFTq yl p+m l =0
:
q −1
(zm )k = zm,k := ∑ yl p+m ωqlk , 0≤m<p, 0≤k<q.
l =0
which is amounts to q DFTs of length p after n multiplications. This gives all components ck of DFTn y.
Step I Step II
p p
q q
In fact, the above considerations are the same as those elaborated in Section 4.2.5 that showed that a
two-dimensional DFT of Y ∈ C m,n can be done by carrying out m one-dimensional DFTs of length n plus
n one-dimensional DFTs of length m, see (4.2.5.4) and Code 4.2.5.6. y
Remark 4.3.0.10 (FFT for prime n) When n 6= 2 L , even the Cooley-Tuckey algorithm of Rem. 4.3.0.8 will
eventually lead to a DFT for a vector with prime length.
Quoted from the M ATLAB manual:
When n is a prime number, the FFTW library first decomposes an n-point problem into three (n − 1)-point
problems using Rader’s algorithm [Rad68]. It then uses the Cooley-Tukey decomposition described above
to compute the (n − 1)-point DFTs.
p p −1
For the Fourier matrix F p = ( f ij )i,j=1 the permuted block P p−1 P p,g ( f ij )i,j=2 P⊤
p,g is circulant.
ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
ω0 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1
ω0 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7
ω0 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10
ω0 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5
ω0 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9
F13 −→ ω0 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11
ω0 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12
ω0 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6
ω0 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3
ω0 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8
ω0 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4
ω0 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2
Then apply fast algorithms for multiplication with circulant matrices (= discrete periodic convolution, see
§ 4.1.4.11) to right lower (n − 1) × (n − 1) block of permuted Fourier matrix. These fast algorithms rely
on DFTs of length n − 1, see Code 4.2.2.4. y
Since in Section 4.2 we could implement important operations based on the discrete Fourier transform,
we can now reap the fruits of the availability of a fast implementation of DFT:
← Section 4.2.2
Asymptotic complexity of discrete periodic convolution,see Code 4.2.2.4:
Cost(pconvfft(u,x), u, x ∈ C n ) = O(n log n).
The warning issued in Exp. 2.3.1.7 carries over to numerical methods for signal processing:
Never implement DFT/FFT by yourself!
! Under all circumstances use high-quality numerical libraries!
From FFTW homepage: FFTW is a C subroutine library for computing the discrete Fourier transform (DFT)
in one or more dimensions, of arbitrary input size, and of both real and complex data.
FFTW will perform well on most architectures without modification. Hence the name, "FFTW," which
stands for the somewhat whimsical title of “Fastest Fourier Transform in the West.”
implementation of the FFTW library (version 3.x). This paper also conveys the many tricks it takes
to achieve satisfactory performance for DFTs of arbitrary length.
FFTW can be installed from source following the instructions from the installation page after downloading
the source code of FFTW 3.3.8 from the download page. Precompiled binaries for various linux distribu-
tions are available in their main package repositories:
• Ubuntu/Debian: apt-get install fftw3 fftw3-dev
• Fedora: dnf install fftw fftw-devel
E IGEN’s FFT module can use different backend implementations, one of which is the FFTW library. The
backend may be enabled by defining the preprocessor directive Eigen_FFTW_DEFAULT (prior to inclu-
sion of unsupported/Eigen/FFT) and linking with the FFTW library (-lfftw3). This setup pre-
cedure may be handled automatically by a build system like CMake (see set_eigen_fft_backend
macro on ➺ GITLAB). y
EXAMPLE 4.3.0.12 (Efficiency of FFT for different backend implementations) We measure the run-
times of FTT in E IGEN linking with different libraries, vector lengths n = 2 L .
Platform:
✦ Linux (Ubuntu 16.04 64bit)
✦ Intel(R) Core(TM) i7-4600U CPU @
2.10GHz
✦ L2 256KB, L3 4 MB, 8 GB DDR3 @
1.60GHz y
✦ Clang 3.8.0, -O3
For reasonably high input sizes the FFTW back-
end gives, compared to E IGEN’s default back-
end (Kiss FFT), a speedup of 2-4x.
methods, see, for instance [DR08, Sect. 8.7.3], [Han02, Sect. 53], [QSS00, Sect. 10.9.2].
There thousands of online tutorials on FFT, for instance
• The Fast Fourier Transform (FFT): Most Ingenious Algorithm Ever? (Offers an unconventional
perspective based on polynomial multiplication.)
Review question(s) 4.3.0.13 (The Fast Fourier Transform (FFT))
(Q4.3.0.13.A) What is the asymptotic complexity for m, n → ∞ of the two-dimensional DFT of a matrix
Y ∈ C m,n carried out with the following code:
3 void f f t 2 ( Eigen : : MatrixXcd &C, const Eigen : : MatrixBase < Scalar > &Y) {
4 using i d x _ t = Eigen : : MatrixXcd : : Index ;
5 const i d x _ t m = Y . rows ( ) ;
6 const i d x _ t n = Y . cols ( ) ;
7 C. r e s i z e (m, n ) ;
8 Eigen : : MatrixXcd tmp (m, n ) ;
9
(Q4.3.0.13.B) Assume that an FFT implementation is available only for vectors of length n = 2 L , L ∈ N.
How do you have to modify the following C++ function for the discrete convolution of two vectors
h, x ∈ C n to ensure that it still enjoys an asymptotic complexity of O(n log n) for n → ∞?
(Q4.3.0.13.C) Again assume that the FFT implementation of E IGEN is available only for vectors of length
n = 2 L , L ∈ N. Propose changes to the following C++ function for the discrete periodic convolution of
two vectors u, x ∈ C n that preserve the asymptotic complexity of O(n log n) for n → ∞.
Devise a recursive algorithm for computing the matrix×vector product Hm x, x ∈ R n and determine its
asymptotic complexity in terms of n := 2m → ∞.
△
Supplementary literature. [Han02, Sect. 55], see also [Str99] for an excellent presentation of
Basis transform matrix (sine basis → standard basis): Sn := (sin( jkπ/n))nj,k−=11 ∈ R n−1,n−1 .
n −1
⊤
Sine transform of y = [y1 , . . . , yn−1 ] ∈ R n −1 : sk = ∑ y j sin(πjk/n) , k = 1, . . . , n − 1 .
j =1
(4.4.1.2)
By elementary consideration we can devise a DFT-based algorithm for the sine transform (=
ˆ Sn ×vector):
y j , if j = 1, . . . , n − 1 ,
2n
tool: “wrap around”: e
y ∈ R : yej = 0 , if j = 0, n , e “odd”)
(y
−y2n− j , if j = n + 1, . . . , 2n − 1 .
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
tyj
yj
−→
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 5 10 15 20 25 30
j j
1
Next we use sin( x ) = 2ı (exp(ıx ) − exp(−iux ) to identify the DFT of a wrapped around vector as a sine
transform:
2n−1 n −1 2n−1
(4.2.1.19) 2πı
− πı πı
(F2n e
y)k = ∑ yej e− 2n kj = ∑ yj e n kj − ∑ y2n− j e− n kj
j =1 j =1 j = n +1
n −1 πı πı
= ∑ y j (e− n kj − e n kj ) = −2i (Sn y)k ,k = 1, . . . , n − 1 .
j =1
9 Eigen : : VectorXcd c t ;
10 Eigen : : FFT<double> f f t ; // DFT helper class
11 f f t . SetFlag ( Eigen : : FFT<double > : : Flag : : Unscaled ) ;
12 f f t . fwd ( c t , y t ) ;
13
Remark 4.4.1.4 (Sine transform via DFT of half length) The simple Code 4.4.1.3 relies on a DFT for
vectors of length 2n, which may be a waste of computational resources in some applications. A DFT of
length n is sufficient as demonstrated by the following manipulations.
Step ➀: transform of the coefficients
n −1 2πi
Step ➁: real DFT (→ Section 4.2.4) of (ye0 , . . . , yen−1 ) ∈ R n : ck := ∑ yej e− n jk
j =0
n −1 n −1
πj
Hence Re{ck } = ∑ yej cos(− 2πi
n jk ) = ∑ (y j + yn− j ) sin( n ) cos( 2πi
n jk )
j =0 j =1
n −1 n −1
πj 2k + 1 2k − 1
= ∑ 2y j sin( n ) cos( 2πi
n jk ) = ∑ yj sin( πj) − sin( πj)
j =0 j =0
n n
= s2k+1 − s2k−1 .
n −1 n −1 n −1
Im{ck } = ∑ yej sin(− 2πi
n jk ) = − ∑ 1
2 (y j − yn− j ) sin( 2πi
n jk ) = − ∑ y j sin( 2πi
n jk )
j =0 j =1 j =1
= −s2k .
Step ➂: extraction of sk
n −1
n
s2k+1 , k = 0, . . . , 2 − 1 ➤ from recursion s2k+1 − s2k−1 = Re{ck } , s1 = ∑ y j sin(πj/n) ,
j =1
n
s2k , k = 1, . . . , 2 − 2 ➤ s2k = − Im{ck } .
Implementation (via a fft of length n/2):
13 // Transform coefficients
14 Eigen : : VectorXd y t ( n ) ;
15 yt (0) = 0;
16 y t . t a i l ( n −1) = s i n e v a l s . array ( ) * ( y + y . reverse ( ) ) . array ( ) +
0 . 5 * ( y−y . reverse ( ) ) . array ( ) ;
17
18 // FFT
19 Eigen : : VectorXcd c ;
20 Eigen : : FFT<double> f f t ;
21 f f t . fwd ( c , y t ) ;
22
23 s . resize ( n ) ;
24 s ( 0 ) = s i n e v a l s . dot ( y ) ;
25
32 else {
33 s ( j ) = s ( j −2) + c ( ( k −1) / 2 ) . r e a l ( ) ;
34 }
35 }
36 }
EXAMPLE 4.4.1.6 (Diagonalization of local translation invariant linear grid operators) We consider
a so-called 5-points-stencil-operator on R n,n , n ∈ N, defined as follows
R n,n → R n,n ,
T: (T(X))ij := cxij + cy xi,j+1 + cy xi,j−1 + c x xi+1,j + c x xi−1,j (4.4.1.7)
X 7 → T( X ) ,
grid: 20
Matrix X ∈ R n,n
15
l
grid function ∈ {1, . . . , n}2 7→ R 10
1
Visualization of a grid function ✄ 5
2
3
0
4
1
2 5
3
4
5
i
Identification R n,n ∼
=
2
Rn , xij ∼ xe( j−1)n+i (row-wise numbering) gives a matrix representation T ∈
R n2 ,n2 of T:
C cy I 0 ··· ··· 0 j
..
cy I C cy I .
.. .. ..
T= . . . ∈ R n2 ,n2 ,
0
.
.. cy I C cy I cy
0 ··· ··· 0 cy I C cx
cx
cy
c cx 0 ··· ··· 0
..
c x c cx .
.. .. ..
C=0 . . . ∈ R n,n .
n+1 n+2
.
..
1 2 3
cx c cx i
0 ··· ··· 0 cx c
The key observation is that elements of the sine basis are eigenvectors of T:
( T (Bkl ))ij = c sin( n+
π π π π π
1 ki ) sin( n+1 lj ) + cy sin( n+1 ki ) sin( n+1 l ( j − 1)) + sin( n+1 l ( j + 1)) +
π π π
c x sin( n+ 1 lj ) sin( n+1 k (i − 1)) + sin( n+1 k (i + 1))
π π π π
= sin( n+ 1 ki ) sin( n+1 lj )( c + 2cy cos( n+1 l ) + 2c x cos( n+1 k ))
Hence Bkl is eigenvector of T (or T after row-wise numbering) and the corresponding eigenvalue is
π π
given by c + 2cy cos( n+ 1 l ) + 2c x cos( n+1 k ). Recall very similar considerations for discrete (periodic)
convolutions in 1D (→ § 4.2.1.6) and 2D (→ § 4.2.5.9)
The basis transform can be implemented efficiently based on the 1D sine transform:
n n n n
kl π
X= ∑ ∑ ykl B ⇒ xij = ∑ sin( n+ 1 ki ) ∑ ykl sin( n+π 1 lj) .
k =1 l =1 k =1 l =1
Hence nested sine transforms (→ Section 4.2.5) for rows/columns of Y = (ykl )nk,l =1 .
7 Eigen : : VectorXcd c ;
8 Eigen : : FFT<double> f f t ;
9 const std : : complex <double> i ( 0 , 1 ) ;
10
25
39 S = ( i * C2 . middleRows ( 1 , n ) . transpose ( ) / 2 . ) . r e a l ( ) ;
40 }
C++ code 4.4.1.10: FFT-based solution of local translation invariant linear operators
➺ GITLAB
2 void f f t b a s e d s o l u t i o n l o c a l ( const Eigen : : MatrixXd& B ,
3 double c , double cx , double cy , Eigen : : MatrixXd& X)
4 {
5 const Eigen : : Index m = B . rows ( ) ;
6 const Eigen : : Index n = B . cols ( ) ;
7
8 // Eigen’s meshgrid
9 const Eigen : : MatrixXd I =
Eigen : : RowVectorXd : : LinSpaced ( n , 1 , s t a t i c _ c a s t <double >( n ) ) . r e p l i c a t e (m, 1 ) ;
10 const Eigen : : MatrixXd J =
Eigen : : VectorXd : : LinSpaced (m, 1 , s t a t i c _ c a s t <double >(m) ) . r e p l i c a t e ( 1 , n ) ;
11
12 // FFT
13 Eigen : : MatrixXd X_ ;
14 s i n e t r a n s f o r m 2 d ( B , X_ ) ;
15
16 // Translation
17 Eigen : : MatrixXd T ;
18 T = c + 2 * cx * ( M_PI / ( s t a t i c _ c a s t <double >( n ) +1) * I ) . array ( ) . cos ( ) +
19 2 * cy * ( M_PI / ( s t a t i c _ c a s t <double >(m) +1) * J ) . array ( ) . cos ( ) ;
20 X_ = X_ . cwiseQuotient ( T ) ;
21
22 s i n e t r a n s f o r m 2 d ( X_ , X) ;
23 X = 4 * X / ( (m+1) * ( n +1) ) ;
24 }
Thus the diagonalization of T via 2D sine transformyields an efficient algorithm for solving linear system
of equations T(X) = B: computational cost O(n2 log n). y
EXPERIMENT 4.4.1.11 (Efficiency of FFT-based solver) In the experiment we test the gain in runtime
obtained by using DFT-based algorithms for solving linear systems of equations with coefficient matrix T
induced by the operator T from (4.4.1.7) with the values
c=4 , c x = c y = −1 .
This means
C −I 0 ··· ··· 0 4 −1 0 ··· ··· 0
.. ..
−I C −I . −1 4 −1 .
.. .. ..
. . .
T := . . . ∈ R n2 ,n2 , C := .. .. .. ∈ R n,n .
0 0
. .
.. I C −I .. −1 4 −1
0 ··· ··· 0 −I C 0 · · · · · · 0 −1 4
60
FFT−Loeser
Backslash−Loeser
50
40
tic-toc-timing (M ATLAB) V7, Linux, Intel Pentium
Laufzeit [s]
4 Mobile CPU 1.80GHz) 30
0
0 100 200 300 400 500 600
n
y
n −1
2j+1
cosine transform of y = [y0 , . . . , yn−1 ]⊤ : ck = ∑ y j cos(k 2n π ) , k = 1, . . . , n − 1 ,
j =0
(4.4.2.2)
1 n −1
c0 = √ ∑ y j .
2 j =0
6 Eigen : : VectorXd y_ ( 2 * n ) ;
7 y_ . head ( n ) = y ;
8 y_ . t a i l ( n ) = y . reverse ( ) ;
9
10 // FFT
11 Eigen : : VectorXcd z ;
12 Eigen : : FFT<double> f f t ;
13 f f t . fwd ( z , y_ ) ;
14
Implementation of C− 1
n y (“Wrapping”-technique):
18 // FFT
19 Eigen : : VectorXd z ;
20 Eigen : : FFT<double> f f t ;
21 f f t . i n v ( z , c_2 ) ;
22
28
29 y = 2 * y_ . head ( n ) ;
30 }
Video tutorial for Section 4.5 "Toeplitz Matrix Techniques": (20 minutes) Download link,
tablet notes
This section examines FFT-based algorithms for more general problems in numerical linear algebra. It
connects to the matrix perspective of DFT and linear filters that was adopted occasionally in Section 4.1
and Section 4.2.
This task reminds us of the parameter estimation problem from Ex. 3.0.1.4, which we tackled with least
squares techniques. We employ similar ideas for the current problem
xk input signal
yk output signal
time time
If the yk were exact, we could retrieve h0 , . . . , hn−1 by examining only y0 , . . . , yn−1 and inverting the
discrete periodic convolution (→ Def. 4.1.4.7) using (4.2.1.17).
However, in case the yk are affected by measurements errors it is advisable to use all available yk for a
least squares estimate of the impulse response.
We can now formulate the least squares parameter identification problem: seek h = [ h 0 , . . . , h n −1 ] ⊤ ∈
R n with
x0 x −1 ··· · · · x 1− n
x ..
1 x 0 x −1 . y0
. ..
.. x1 x0 . h0 ..
.
.
. .. . .. .. .
. .
kAh − yk2 = .. .. .. − → min .
. . . x −1 .
..
x n −1 x1 x0 .
h ..
x n x n −1 x1 n −1
. .. y m −1
.. .
x m −1 ··· · · · xm−n 2
This is a linear least squares problem as introduced in Chapter 3 with a coefficient matrix A that enjoys
the property that (A)ij = xi− j , which means that all its diagonals have constant entries.
The coefficient matrix for the normal equations (→ Section 3.1.2, Thm. 3.1.2.1) corresponding to the
above linear least squares problem is
m −1
M := A⊤ A , (M)ij = ∑ x k −i x k − j = : zi − j
k =0
for some m-periodic sequence (zk )k∈Z , due to the m- periodicity of ( xk )k∈Z .
➣ M ∈ R n,n is a matrix with constant diagonals & symmetric positive semi-definite (→ Def. 1.1.2.6)
(“constant diagonals” ⇔ (M)i,j depends only on i − j)
y
EXAMPLE 4.5.1.3 (Linear regression for stationary Markov chains) We consider a sequence of scalar
random variables: (Yk )k∈Z , a so-called Markov chain. These can be thought of as values for a random
quantity sampled at equidistant points in time.
We assume stationary (time-independent) correlations, that is, with (A, Ω, dP) denoting the underlying
probability space,
Z
E ( Yi − j Yi − k ) = Yi− j (ω )Yi−k (ω ) dP (ω ) = uk− j ∀i, j, k ∈ Z , ui = u−i .
Ω
The trick ist to use the linearity of the expectation, which makes it possible to convert (4.5.1.4) into
n n
⊤ n 2
x = [ x1 , . . . , x n ] ∈ R : E | Yi | − 2 ∑ x j u k + ∑ xk x j uk− j → min .
j =1 k,j=1
n
x⊤ Ax − 2b⊤ x → min with b = [uk ]nk=1 , A = ui− j i,j=1 . (4.5.1.5)
By definition A is a so-called covariance matrix and, as such, has to be symmetric and positive definite
(→ Def. 1.1.2.6). By its very definition it has constant diagonals. Also note that
with x∗ = A−1 b. Therefore x∗ is the unique minimizer of x⊤ Ax − 2b⊤ x. The problem is reduced to
solving the linear system of equations Ax = b (Yule-Walker-equation, see below). y
Matrices with constant diagonals occur frequently in mathematical models, see Ex. 4.5.1.1, Ex. 4.5.1.3.
They generalize circulant matrices (→ Def. 4.1.4.12).
u0 u1 ··· · · · u n −1
Definition 4.5.1.7. Toeplitz matrix ..
u −1 u0 u1 .
..
..
n m,n is a Toeplitz matrix, if there is .. .. ..
T = (tij )i,j =1 ∈ K . . . . .
T= .. .. .. ..
..
a vector u = [u−m+1 , . . . , un−1 ] ∈ K m + n −1 such
. . . .
.
that tij = u j−i , 1 ≤ i ≤ m, 1 ≤ j ≤ n. .. .. ..
. . . u1
u 1− m · · · · · · u −1 u0
Note: The “information content” of a matrix M ∈ K m,n with constant diagonals, that is, (M)i,j = mi− j ,
is m + n − 1 numbers ∈ K.
Hence, though potentially densely populated, m × n Toeplitz matrices are data-sparse with infor-
mation content ≪ mn.
To motivate the approach we realize that we have already encountered Toeplitz matrices in the convolution
of finite signals discussed in Rem. 4.1.3.1, see (4.1.3.2). The trick introduced in Rem. 4.1.4.15 was to
(
u j for j = −m + 1, . . . , n − 1 ,
cj = + periodic extension.
0 for j = n ,
The upper left m × n block of C contains T:
Recall from 4.3 that the multiplication with a circulant (m + n) × (m + n)-matrix (= discrete periodic
convolution → Def. 4.1.4.7) can be carried out by means of DFT/FFT with an asymptotic computational
effort of O((m + n) log(m + n)) for m, n → ∞, see Code 4.2.2.4.
From (4.5.2.1) it is clear how to implement matrix×vector for the Toeplitz matrix T
x Tx
C =
zero padding 0 ∗
Therefore the asymptotic computational effort for computing Tx is O((n + m) log(m + n)) for m, n → ∞,
provided that an FFT-based algorithm for discrete periodic convolution is used, see Code 4.2.2.4. This
complexity is almost optimal in light of the data complexity O(m + n) of the Toeplitz matrix.
Note that the symmetry of a Toeplitz matrix is induced by the property u−k = uk of its generating vector.
Task: Find an efficient solution algorithm for the LSE Tx = b = [b1 , . . . , bn ]⊤ , b ∈ R n , the Yule-
Walker problem from 4.5.1.3.
k
Define: ✦ Tk := u j−i i,j=1 ∈ K k,k (left upper block of T) ➣ Tk is s.p.d. Toeplitz matrix ,
✦ xk ∈ K k : Tk xk = bk := [b1 , . . . , bk ]⊤ ⇔ xk = T− 1 k
k b ,
✦ u k : = ( u1 , . . . , u k ) ⊤ ∈ R k
We block partition the linear system of equations Tk+1 xk+1 = bk+1 , k < n:
uk b1
.. e k +1 .. k
k +1
= Tk . x . b
(4.5.3.1)
T k +1 x
u1
= bk
=
bk + 1
+1
u k · · · u1 1 xkk+ 1 bk + 1
Now recall block Gaussian elimination/block-LU decomposition from Rem. 2.3.1.14, Rem. 2.3.2.19. They
+1
xk+1 and obtain an expression for xkk+
teach us how to eliminate e 1.
To state the formulas concisely, we introduce reversing permutations. For a vector they can be realized by
E IGEN’s reverse() method.
§4.5.3.5 (asymptotic complexity) Obviously, given xk and yk , the evaluations involved in (4.5.3.4) take
O(k ) operations for k → ∞, in order to get xk+1 .
It seems that two recursive calls are necessary in order to obtain yk and xk , which enter
(4.5.3.4): this is too expensive!
If bk = Pk uk , then xk = yk
Hence, yk can be computed with an asymptotic cost of O(k2 ) for k → ∞. Once the yk are available,
another simple linear recursion gives us xk with a cost of O(k2 ) for k → ∞.
Cost for solving Tx = b = O(k2 ) for k → ∞. y
Below we give a C++ implementation of the Levinson algorithm for the solution of the Yule-Walker problem
Tx = b with an s.p.d. Toeplitz matrix described by its generating vector u (recursive implementation, xk ,
yk computed simultaneously, un+1 not used!)
Note that this implementation of the Levinson algorithm employs a simple linear recursion with computa-
tional cost ∼ (n − k ) on level k, k = 0, . . . , n − 1, which results in an overall asymptotic complexity of
O(n2 ) for n → ∞, as already discussed in § 4.5.3.5.
Remark 4.5.3.7 (Fast Toeplitz solvers) Meanwhile researchers have found better methods [Ste03]:
now there are FFT-based algorithms for solving Tx = b, T a Toeplitz matrix, with asymptotic complexity
O(n log3 n)! y
Supplementary literature. [DR08, Sect. 8.5]: Very detailed and elementary presentation,
but the discrete Fourier transform through trigonometric interpolation, which is not covered in this
chapter. Hardly addresses discrete convolution.
[Han02, Ch. IX] presents the topic from a mathematical point of view stressing approximation and
trigonometric interpolation. Good reference for algorithms for circulant and Toeplitz matrices.
[Sau06, Ch. 10] also discusses the discrete Fourier transform with emphasis on interpolation and
(least squares) approximation. The presentation of signal processing differs from that of the course.
There is a vast number of books and survey papers dedicated to discrete Fourier transforms, see,
for instance, [Bri88; DV90]. Issues and technical details way beyond the scope of the course are
discussed in these monographs.
Review question(s) 4.5.3.8 (Toeplitz matrix techniques)
(Q4.5.3.8.A) Give an example of a Toeplitz matrix T ∈ R n,n , n > 2, with rank(T) = 1.
(Q4.5.3.8.B) Show that the product of two lower triangular Toeplitz matrices is a Toeplitz matrix again.
△
[Bri88] E.O. Brigham. The Fast Fourier Transform and Its Applications. Englewood Cliffs, NJ: Prentice-
Hall, 1988 (cit. on p. 377).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 361, 377).
[DV90] P. Duhamel and M. Vetterli. “Fast Fourier transforms: a tutorial review and a state of the art”.
In: Signal Processing 19 (1990), pp. 259–299 (cit. on pp. 355, 377).
[FJ05] M. Frigo and S. G. Johnson. “The Design and Implementation of FFTW3”. In: Proceedings
of the IEEE 93.2 (Feb. 2005), pp. 216–231. DOI: 10.1109/JPROC.2004.840301 (cit. on
p. 361).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on p. 319).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 317, 361, 363,
377).
[HR11] Georg Heinig and Karla Rost. “Fast algorithms for Toeplitz and Hankel matrices”. In: Linear
Algebra Appl. 435.1 (2011), pp. 1–59.
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 319, 351).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 344, 361).
[Rad68] C.M. Rader. “Discrete Fourier Transforms when the Number of Data Samples Is Prime”. In:
Proceedings of the IEEE 56 (1968), pp. 1107–1108 (cit. on p. 359).
[Sau06] T. Sauer. Numerical analysis. Boston: Addison Wesley, 2006 (cit. on p. 377).
[Ste03] M. Stewart. “A Superfast Toeplitz Solver with Improved Numerical Stability”. In: SIAM J. Matrix
Analysis Appl. 25.3 (2003), pp. 669–693 (cit. on p. 376).
[Str99] Gilbert Strang. “The Discrete Cosine Transform”. In: SIAM Review 41.1 (1999), pp. 135–147.
DOI: 10.1137/S0036144598336745 (cit. on p. 363).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 348, 350).
378
Chapter 5
Contents
5.1 Abstract Interpolation (AI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.2.1 Uni-Variate Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.2.2 Polynomial Interpolation: Theory . . . . . . . . . . . . . . . . . . . . . . . . 389
5.2.3 Polynomial Interpolation: Algorithms . . . . . . . . . . . . . . . . . . . . . . 393
5.2.4 Polynomial Interpolation: Sensitivity . . . . . . . . . . . . . . . . . . . . . . 409
5.3 Shape-Preserving Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
5.3.1 Shape Properties of Functions and Data . . . . . . . . . . . . . . . . . . . . . 414
5.3.2 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 416
5.3.3 Cubic Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.4 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
5.4.1 Spline Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
5.4.2 Cubic-Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
5.4.3 Structural Properties of Cubic Spline Interpolants . . . . . . . . . . . . . . . 431
5.4.4 Shape Preserving Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . 435
5.5 Algorithms for Curve Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
5.5.1 CAD Task: Curves from Control Points . . . . . . . . . . . . . . . . . . . . . 440
5.5.2 Bezier Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
5.5.3 Spline Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.6 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.1 Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.2 Reduction to Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . 452
5.6.3 Equidistant Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . 454
5.7 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
One-dimensional interpolation
379
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
f ( ti ) = yi , i = 0, . . . , n . (5.1.0.2)
y
(t4 , y4Parlance:
) The numbers ti ∈ R are called nodes,
( t3 , y3 ) the yi ∈ R are the (data) values.
Of course, a necessary requirement on the data is that the ti are pairwise distinct :
Remark 5.1.0.3 (Generalization of data) In (supervised) machine learning this task is called the gener-
alization of the data, because we aim for the creation of a model in the form of the function f : I → R
that permits us to generate new data points based on what we have “learned” from the provided data. y
For ease of presentation we will usually assume that the nodes are ordered: t0 < t1 < · · · < tn and
[t0 , tn ] ⊂ I . However, algorithms often must not take for granted sorted nodes.
Remark 5.1.0.4 (Interpolation of vector-valued data) A natural generalization is data interpolation with
vector-valued data values, seeking a function f : I → R d , d ∈ N, such that, for given data points (ti , yi ),
ti ∈ I mutually different, yi ∈ R d , it satisfies the interpolation conditions f(ti ) = yi , i = 0, . . . , n.
In this case all methods available for scalar data can be applied component-wise.
x1 y3
y4
An important application is curve reconstruction, that
is the interpolation of points y0 , . . . , yn ∈ R2 in the y2
plane.
y5
A particular aspect of this problem is that the nodes
ti also have to be found, usually from the location of
the yi in a preprocessing step.
y1
y0 x2
Fig. 147
y
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 380
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 5.1.0.5 (Multi-dimensional data interpolation) In many applications (computer graphics, com-
puter vision, numerical method for partial differential equations, remote sensing, geodesy, etc.) one has
to reconstruct functions of several variables.
This leads to task of multi-dimensional data interpolation:
Significant additional challenges arise in a genuine multidimensional setting. A treatment is beyond the
scope of this course. However, the one-dimensional techniques presented in this chapter are relevant
even for multi-dimensional data interpolation, if the points xi ∈ R m are points of a finite lattice also called
tensor product grid.
For instance, for m = 2 this is the case, if
n o
{xi }i = [tk , sl ]⊤ ∈ R2 : k ∈ {0, . . . , K }, l ∈ {0, . . . , L} , (5.1.0.6)
§5.1.0.7 (Interpolation schemes) When we talk about “interpolation schemes” in 1D, we mean a map-
ping
R n+1 × R n+1 → { f : I → R }
I: .
[ti ]in=0 , [yi ]in=0 7→ interpolant
Once the function space to which the interpolant belongs is specified, then an interpolation scheme defines
an “interpolation problem” in the sense of § 1.5.5.1. Sometimes, only the data values yi are consider input
data, whereas the dependence of the interpolant on the nodes ti is suppressed, see Section 5.2.4. y
Interpolation
0.5
0.4
0.3
0.1
0
Interpolants can have vastly different properties.
-0.2
methods to build interpolants and their different
-0.3 linear
poly
properties will become apparent.
-0.4 spline
pchip
-0.5
0 1 2 3 4 5 6 7 8
reset
Fig. 148
EXAMPLE 5.1.0.8 (Constitutive relations from measurements) This example addresses an important
application of data interpolation in 1D.
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 381
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In this context: t, y =
ˆ two state variables of a physical system, where t determines y: a functional
dependence y = y(t) is assumed.
t y
voltage U current I
Examples: t and y could be pressure p density ρ
magnetic field H magnetic flux B
··· ···
Known: several accurate (∗) measurements
(ti , yi ) , i = 1, . . . , m
Why do we need to extract the constitutive relations as a function? Imagine that t, y correspond to the
voltage U and current I measured for a 2-port non-linear circuit element (like a diode). This element will
be part of a circuit, which we want to simulate based on nodal analysis as in Ex. 8.1.0.1. In order to solve
the resulting non-linear system of equations F (u) = 0 for the nodal potentials (collected in the vector
u) by means of Newton’s method (→ Section 8.5) we need the voltage-current relationship for the circuit
element as a continuously differentiable function I = f (U ).
(∗) Meaning of attribute “accurate”: justification for interpolation. If measured values yi were affected by
considerable errors, one would not impose the interpolation conditions (5.1.0.2), but opt for data fitting (→
Section 5.7). y
Remark 5.1.0.9 (Mathematical functions in a numerical code) What does it mean to “represent” or
“make available” a function f : I ⊂ R 7→ R in a computer code?
Rather, in the context of numerical methods, “function” should be read as “subroutine”, a piece of code that
can, for any x ∈ I , compute f ( x ) in finite time. Even this has to be qualified, because we can only pass
machine numbers x ∈ I ∩ M (→ § 1.5.2.1) and, of course, in most cases, f ( x ) will be an approximation.
In a C++ code a simple real valued function can be incarnated through a function object of a type as given
in Code 5.1.0.10, see also Section 0.3.3.
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 382
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 5.1.0.11 (A data type designed for interpolation problems) If a constitutive relationship for a
circuit element is needed in a C++ simulation code (→ Ex. 5.1.0.8), the following specialized Function
class could be used to represent it. It demonstrates the concrete object oriented implementation of an
interpolant.
2 class I n t e r p o l a n t {
3 private :
4 // Various internal data describing f
5 // Can be the coefficients of a basis representation (5.1.0.14)
6 public :
7 // Constructor: computation of coefficients c j of representation
(5.1.0.14)
8 Interpolant ( const vector <double>& t , const vector <double>& y ) ;
9 // Evaluation operator for interpolant f
10 double operator() ( double t ) const ;
11 };
Of course, the basis functions b j should be “simple” in the sense that b j ( x ) can be computed efficiently for
every x ∈ I and every j = 0, . . . , m.
Note that the basis functions may depend on the nodes ti , but they must not depend on the values yi .
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 383
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
➙ The internal representation of f (in the data member section of the class Function from
Code 5.1.0.10) will then boil down to storing the coefficients/parameters c j , j = 0, . . . , m.
Note: The focus in this chapter will be on the special case that the data interpolants belong to a finite-
dimensional space of functions spanned by “simple” basis functions.
y
EXAMPLE 5.1.0.15 (Piecewise linear interpolation, see also Section 5.3.2) Recall: A linear function
in 1D is a function of the form x 7→ a + bx, a, b ∈ R (polynomial of degree 1).
y
✬ ✩
Piecewise linear interpolation
✫ ✪
➣ interpolating polygonal line
t
Fig. 149
t0 t1 t2 t3 t4
What is the space V of functions from which we select the interpolant? Remember that a linear function
R → R always can be written as t 7→ α + βt with suitable coefficients α, β ∈ R. We can use this formula
locally on every interval between two nodes. Assuming sorted nodes, t0 < t1 < · · · < tn , this leads to
the mathematical definition
n o
V := f ∈ C0 ( I ) : f (t) = αi + β i t for t ∈ [ti , ti+1 ], i = 0, . . . , n − 1 . (5.1.0.16)
b0 b1 b2 b3 b4 bn
1
Fig. 150 t0 t1 t2 t3 t4 t5 t n −1 tn
Note: in Fig. 150 the basis functions have to be extended by zero outside the t-range where they are
drawn.
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 384
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Explicit formulas for these basis functions can be given and bear out that they are really “simple”:
(
t − t0
1− t1 − t0 for t0 ≤ t < t1 ,
b0 (t) =
0 for t ≥ t1 .
t j −t
for t j−1 ≤ t < t j ,
1 −
t j − t j −1
t−t j
b j (t) = 1− for t j ≤ t < t j+1 , , j = 1, . . . , n − 1 , (5.1.0.17)
t j +1 − t j
0 elsewhere in [t0 , tn ] .
(
1 − tnt−n − t
tn−1 for tn−1 ≤ t < tn ,
bn ( t ) =
0 for t < tn−1 .
The property b j (ti ) = δij , i, j = 1, . . . , n, of the tent function basis is so important that it has been given a
special name:
§5.1.0.21 (Interpolation as a linear mapping) We consider the setting for interpolation that the inter-
polant belongs to a finite-dimension space Vm of functions spanned by basis functions b0 , . . . , bm , see
Rem. 5.1.0.9. Then the interpolation conditions imply that the basis expansion coefficients satisfy a linear
system of equations:
m
(5.1.0.2) & (5.1.0.14) ⇒ f ( ti ) = ∑ j =0 c j b j ( t i ) = y i , i = 0, . . . , n , (5.1.0.22)
m
b0 (t0 ) . . . bm (t0 ) c0 y0
.. .. ..
..
Ac := . . = . =: y .
. (5.1.0.23)
b0 (tn ) . . . bm (tn ) cm yn
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 385
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The interpolation problem in Vm and the linear system (5.1.0.23) are really equivalent in the sense that
(unique) solvability of one implies (unique) solvability of the other.
If m = n and A from (5.1.0.23) regular (→ Def. 2.2.1.1),then for any values y j , j = 0, . . . , n we can find
coefficients c j , j = 0, . . . , n, and, from them build the interpolant according to (5.1.0.14):
n
f = ∑ ( A −1 y ) j b j . (5.1.0.24)
j =0
✓ ✏
For fixed nodes ti the interpolation problem R n +1 7→ Vn
I:
✒ ✑
(5.1.0.22) defines linear mapping y 7→ f
An interpolation operator I : R n+1 7→ C0 ([t0 , tm ]) for the given nodes t0 < t1 < · · · < tn is called
linear, if
✎ Notation: C0 ([t0 , tm ]) =
ˆ vector space of continuous functions on [t0 , tm ] y
Review question(s) 5.1.0.27 (Abstract Interpolation)
(Q5.1.0.27.A) Let {b0 , . . . , bn } be a basis of a subspace V of the space C0 ( I ) of continuous functions
I ⊂ R → R. Which linear systems has to be solved to determined the basis expansion coefficients for
the interpolant f ∈ V satisfying the interpolation conditions f (ti ) = yi for given node set {t0 , t1 , . . . , tn }
and values yi ∈ R?
How does reordering the nodes affect the coefficient matrix of that linear system?
(Q5.1.0.27.B) Given I ⊂ R and the node set {t0 , t1 , . . . , tn } ⊂ I , the ReLU basis of the space V of
piecewise linear continuous functions on that node set is comprised of the functions
(
0 for t < ti−1 ,
r0 ( t ) : = 1 , ri ( t ) : = i ∈ {1, . . . , n} , t ∈ I .
t − ti−1 for t ≥ ti−1 ,
• Show that this set of functions {r0 , r1 , . . . , rn } is really a basis of V .
• Assuming that the nodes are sorted, t0 < t1 < · · · < tn , describe the structure of the coefficient
matrix of that linear system that has to be solved to determine the ReLU basis coefficients of an
interpolant.
△
5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 386
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
P k : = { t 7 → α k t k + α k −1 t k −1 + · · · + α 1 t + α 0 · 1 , α j ∈ R } . (5.2.1.1)
leading coefficient
Terminology: The functions t 7→ tk , k ∈ N0 , are called monomials and the formula t 7→ αk tk +
αk−1 tk−1 + · · · + α0 is the monomial representation of a polynomial.
Obviously, Pk is a vector space, see [NS02, Sect. 4.2, Bsp. 4]. What is its dimension?
dim Pk = k + 1 and P k ⊂ C ∞ (R ).
§5.2.1.3 (The charms of polynomials) Why are polynomials important in computational mathematics?
Remark 5.2.1.4 (Monomial representation) Polynomials (of degree k) in monomial representation are
stored as a vector of their coefficients a j , j = 0, . . . , k. A convention for the ordering has to be fixed.
For instance, the N UM P Y module of P YTHON stores the coefficients of the monomial representation in an
array in descending order :
P YTHON: p(t) := αk tk + αk−1 tk−1 + · · · + α0 ➙ array (αk , αk−1 , . . . , α0 ) (ordered!).
Thus the evaluation of a polynomial given through an array of monomial coefficients reads as:
1 I n [ 8 ] : numpy . polyval ( [ 3 , 0 , 1 ] , 5 ) # 3 ∗ 52 + 0 ∗ 51 + 1
2 Out [ 8 ] : 76
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 387
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§5.2.1.5 (Horner scheme → [DR08, Bem. 8.11]) Efficient evaluation of a polynomial in monomial
representation through Horner scheme as indicated by the following representation:
The following code gives an implementation based on vector data types of E IGEN. The function is vector-
ized in the sense that many evaluation points are processed in parallel.
(Q5.2.1.8.C) For given k ∈ N we store the monomial coefficients of the polynomial p(t) := αk tk +
αk−1 tk−1 + · · · + α0 in a vector a := [αk , . . . , α0 ] ∈ R k+1 . Find a matrix D ∈ R k,k+1 such that
Da ∈ R k provides the monomial coefficients of the derivative p′ .
(Q5.2.1.8.D) The mapping
Pk → R
Φ: R1
p 7→ 0 p(t) dt
is obviously linear and, therefore, has a matrix representation with respect to the monomial basis
{t 7→ 1, t 7→ t, t 7→ t2 , . . . , t 7→ tk } of Pk . Find that matrix.
(Q5.2.1.8.E) A problem from linear algebra: Prove that the functions of the monomial basis
n on
t 7→ tℓ ⊂ Pn , n∈N,
ℓ=0
are linearly independent and, thus, form a basis of Pn .
Hint. Differentiate several times!
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 388
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
p(t) = γ0 (t − γ1 ) · · · · · (t − γn ) , γi ∈ R , i = 0, . . . , n . (5.2.1.9)
Somebody proposes to represent generic polynomials used in a numerical code in factorized form
through the vectors [γ0 , γ1 , . . . , .γn ] ∈ R n+1 of coefficients. Discuss the pros and cons.
△
Supplementary literature. This topic is also presented in [DR08, Sect. 8.2.1], [QSS00,
Now we consider the interpolation problem introduced in Section 5.1 for the special case that the sought
interpolant belongs to the polynomial space Pk (with suitable degree k).
Is this a well-defined problem? Obviously, it fits the framework developed in Rem. 5.1.0.9 and § 5.1.0.21,
because Pn is a finite-dimensional space of functions, for which we already know a basis, the monomials.
Thus, in principle, we could examine the matrix A from (5.1.0.23) to decide, whether the polynomial
interpolant exists and is unique. However, there is a shorter way.
§5.2.2.3 (Lagrange polynomials) For a given set {t0 , t1 , . . . , tn } ⊂ R of nodes consider the
n t − tj
Lagrange polynomials Li ( t ) : = ∏ ti − t j , i = 0, . . . , n . (5.2.2.4)
j =0
j 6 =i
(
1 if i = j ,
➙ Evidently, the Lagrange polynomials satisfy Li ∈ Pn and Li (t j ) = δij :=
0 else.
From this relationship we infer that the Lagrange polynomials are linearly independent. Since there are
n + 1 = dim Pn different Lagrange polynomials, we conclude that they form a basis of Pn , which is a
cardinal basis for the node set {ti }in=0 . y
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 389
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
8
L
0
L2
L5
Consider the equidistant nodes in [−1, 1]: 6
Lagrange Polynomials
2
T : = t j = −1 + n j , 4
j = 0, . . . , n . 2
t5 , respectively.
−4
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 151 t
y
The Lagrange polynomial interpolant p for data points (ti , yi )in=0 allows a straightforward representation
with respect to the basis of Lagrange polynomials for the node set {ti }in=0 :
n
p ( t ) = ∑ yi Li ( t ) ⇔ p ∈ Pn and p(ti ) = yi . (5.2.2.6)
i =0
The general Lagrange polynomial interpolation problem admits a unique solution p ∈ Pn for any
n
set of data points {(ti , yi )}i=0 , n ∈ N, with pairwise distinct interpolation nodes ti ∈ R (i 6= j ⇒
ti 6= t j ).
Known from linear algebra: for a linear mapping T : V 7→ W between finite-dimensional vector spaces
with dim V = dim W holds the equivalence
T surjective ⇔ T bijective ⇔ T injective.
Applying this equivalence to evalT yields the assertion of the theorem
✷
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 390
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 5.2.2.10 (Vandermonde matrix) Lagrangian polynomial interpolation leads to linear systems of
equations also for the representation coefficients of the polynomial interpolant in monomial basis, see
§ 5.1.0.21:
n
p(t j ) = y j ⇐⇒ ∑ ai tij = y j , j = 0, . . . , n
i =0
⇐⇒ solution of (n + 1) × (n + 1) linear system Va = y with matrix
1 t0 t20 · · · t0n
1 t1 t21 · · · t1n
t2 t22 · · · t2n
V = 1 ∈ R n+1,n+1 . (5.2.2.11)
.. .. .. . . ..
. . . . .
1 tn tn · · · tnn
2
Remark 5.2.2.13 (Matrix representation of interpolation operator) In the case of Lagrange interpola-
tion:
• if Lagrange polynomials are chosen as basis for Pn , then IT is represented by the identity matrix;
• if monomials are chosen as basis for Pn , then IT is represented by the inverse of the Vandermonde
matrix V, see Eq. (5.2.2.11).
y
Remark 5.2.2.14 (Generalized polynomial interpolation → [DR08, Sect. 8.2.7], [QSS00, Sect. 8.4])
The following generalization of Lagrange interpolation is possible: We still seek a polynomial interpolant,
but beside function values also prescribe derivatives up to a certain order for interpolating polynomial at
given nodes.
Convention: indicate occurrence of derivatives as interpolation conditions by multiple nodes.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 391
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✬ ✩
Generalized polynomial interpolation problem
Given the (possibly multiple) nodes t0 , . . . , tn , n ∈ N, −∞ < t0 ≤ t1 ≤ · · · ≤ tn < ∞ and the values
y0 , . . . , yn ∈ R compute p ∈ Pn such that
dk
p(t j ) = y j for k = 0, . . . , ℓ j and j = 0, . . . , n , (5.2.2.15)
dtk
where ℓ j := max{i − i ′ : t j = ti = ti′ , i, i ′ = 0, . . . , n} , and ℓ j + 1 is the multiplicity of the node t j .
✫ ✪
Admittedly, the statement of the generalized polynomial interpolation problem is hard to decipher. Let us
look at a simple special case, which is also the most important case of generalized Lagrange interpolation.
It is the case when all the multiplicities are equal to 2. It is called Hermite interpolation (or osculatory
interpolation) and the generalized interpolation conditions read for nodes t0 = t1 < t2 = t3 < · · · <
tn−1 = tn (note the double nodes!) [QSS00, Ex. 8.6]:
The generalized Lagrange polynomials for the nodes T = {t j }nj=0 ⊂ R (multiple nodes allowed)
are defined as Li := IT (ei+1 ), i = 0, . . . , n, where ei = (0, . . . , 0, 1, 0, . . . , 0) T ∈ R n+1 are the
unit vectors.
Note: The linear interpolation operator IT in this definition refers to generalized Lagrangian interpolation.
Its existence is guaranteed by Thm. 5.2.2.16.
T = {t0 = 0, t1 = 0, t2 = 1, t3 = 1} .
Cubic Hermite Polynomials
0.8
satisfy p2
p
0.4 3
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 392
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
More details are given in Section 5.3.3. For explicit formulas for the polynomials see (5.3.3.5). y
Review question(s) 5.2.2.19 (Polynomial interpolation: theory)
(Q5.2.2.19.A) For a set {t0 , t1 , . . . , tn } ⊂ R of nodes the associated Lagrange polynomials are
n t − tj
Li ( t ) : = ∏ ti − t j , i = 0, . . . , n .
j =0
j 6 =i
Write down the Lagrange polynomials L0 , L1 , L2 , L3 for the node set {0, 1, 2, 3}.
(Q5.2.2.19.B) Denote by Li , i = 0, . . . , n, n ∈ N, the Lagrange polynomials for the node set
{t0 , t1 , . . . , tn } ⊂ R that is assumed to be sorted t0 < t1 < · · · < tn .
What can you say about the sign of p(t), where p(t) = Lk (t) Lm (t), t ∈ R?
(Q5.2.2.19.C) For a given node set {t0 , t1 , . . . , tn } ⊂ R the associated Vandermonde matrix reads
1 t0 t20 · · · t0n
1 t1 t21 · · · t1n
t2 t22 · · · t2n
V = 1 ∈ R n+1,n+1 .
.. .. .. . . .
. . . . ..
1 tn t2n · · · tnn
Sketch an efficient implementation of the C++ function
Eigen::VectorXd vanderMult( const Eigen::VectorXd &t,
const Eigen::VectorXd &x);
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 393
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The member function eval(y,x) expects n data values in y and (any number of) evaluation points in
x (↔ [ x1 , . . . , x N ]⊤ ) and returns the vector [ p( x1 ), . . . , p( x N )]⊤ , where p is the Lagrange polynomial
interpolant.
An implementation directly based on the evaluation of Lagrange polynomials (5.2.2.4) and (5.2.2.6) would
incur an asymptotic computational effort of O(n2 N ) for every single invocation of eval and large n, N .
By means of pre-computing parts of the Lagrange polynomials Li the asymptotic effort for
eval can be reduced substantially.
1
where λi = , i = 0, . . . , n: independent of yi !
(ti − t0 ) · · · (ti − ti−1 )(ti − ti+1 ) · · · (ti − tn )
1
with λi = , i = 0, . . . , n, independent of t and yi
(ti − t0 ) · · · (ti − ti−1 )(ti − ti+1 ) · · · (ti − tn )
[Tre13, Thm. 5.1]. Hence, the values λi can be precomputed!
The use of (5.2.3.3) involves
✦ computation of weights λi , i = 0, . . . , n: cost O(n2 ) (only once!),
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 394
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The following C++ class demonstrated the use of the barycentric interpolation formula for efficient multiple
point evaluation of a Lagrange interpolation polynomial:
13 public :
14 // Constructors taking node vector [t0 , . . . , tn ]⊤ as
15 // argument
16 e x p l i c i t BarycPolyInterp ( const nodeVec_t &_ t ) ;
17 // The interpolation points may also be passed in an STL container
18 template <typename SeqContainer > e x p l i c i t BarycPolyInterp ( const SeqContainer &v ) ;
19 // Computation of p( xk ) for data values
20 // (y0 , . . . , yn ) and evaluation points xk
21 template <typename RESVEC, typename DATAVEC>
22 RESVEC eval ( const DATAVEC &y , const nodeVec_t &x ) const ;
23
24 private :
25 void init_lambda ( ) ;
26 };
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 395
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Task: Given a set of interpolation points (t j , y j ), j = 0, . . . , n, with pairwise different interpolation nodes
t j , perform a single point evaluation of the Lagrange polynomial interpolant p at x ∈ R.
We discuss the efficient implementation of the following function for n ≫ 1. It is meant for a single
evaluation of a Lagrange interpolant.
double eval( const Eigen::VectorXd &t, const Eigen::VectorXd &y,
double x);
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 396
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§5.2.3.8 (Aitken-Neville scheme) The starting point is a recursion formula for partial Lagrange inter-
polants: For 0 ≤ k ≤ ℓ ≤ n define
because the left and right hand sides represent polynomials of degree ℓ − k through the points (t j , y j ),
j = k, . . . , ℓ:
y for x = tk [ pk,ℓ−1 (tk ) = yk ] ,
( x − tk ) pk+1,ℓ ( x ) − ( x − tℓ ) pk,ℓ−1 ( x ) k
= y j for x = t j , k < j < ℓ ,
tℓ − tk
yℓ for x = tℓ [ pk+1,ℓ (tℓ ) = yℓ ] .
Thus the values of the partial Lagrange interpolants can be computed sequentially and their dependencies
can be expressed by the following so-called Aitken-Neville scheme:
ℓ−k = 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x ) (ANS)
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Here, the arrows indicate contributions to the convex linear combinations of (5.2.3.9). The computation
can advance from left to right, which is done in following C++ code.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 397
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The vector y contains the diagonals (from bottom left i y[0] y[1] y[2] y[3]
to top right ) of the above triangular tableaux: 0 y0 y1 y2 y3
Note that the algorithm merely needs to store that 1 p0,1 ( x ) y1 y2 y3
single vector, which translates into O(n) required 2 p0,2 ( x ) p1,2 ( x ) y2 y3
memory for n → ∞. 3 p0,3 ( x ) p1,3 ( x ) p2,3 ( x ) y3
Asymptotic complexity of ANipoeval in terms of number of data points is O(n2 ) (two nested loops). This
is the same as for evaluation based on the barycentric formula, but the Aitken-Neville has a key advantage
discussed in the next §. y
§5.2.3.11 (Polynomial interpolation with data updates) The Aitken-Neville algorithm has another inter-
esting feature, when we run through the Aitken-Neville scheme from the top left corner:
n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x )
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Thus, the values of partial polynomial interpolants at x can be computed before all data points are even
processed. This results in an “update-friendly” algorithm that can efficiently supply the point values p0,k ( x ),
k = 0, . . . , n, while being supplied with the data points (ti , yi ). It can be used for the efficient implemen-
tation of the following interpolator class:
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 398
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 153
This uses functions given in Code 5.2.3.7, Code 5.2.3.10 and the function polyfit() (with a clearly
greater computational effort !). polyfit() is the equivalent to P YTHON’s/M ATLAB’s built-in polyfit.
The implementation can be found on GitLab.
y
Review question(s) 5.2.3.14 (Polynomial Interpolation: Algorithms)
(Q5.2.3.14.A) The Aitken-Neville scheme was introduced as
n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x ) (ANS)
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Give an interpretation of the quantities pk,ℓ occurring in (ANS).
(Q5.2.3.14.B) Describe a scenario for the evaluation of degree-n Lagrange polynomial interpolants in a
single point x ∈ R where the use of the barycentric interpolation formula
n n n
λ λ 1
p(t) = ∑ t −i ti yi : ∑ t −i ti , λi : = ∏ ti − t j , (5.2.3.3)
i =0 i =0 j =0
j6=i
Extrapolation is interpolation with the evaluation point t outside the interval [inf j=0,...,n t j , sup j=0,...,n t j ].
In the sequel we assume t = 0, ti > 0. Of course, Lagrangian polynomial interpolation can also be used
for extrapolation. In this section we give a very important application of this “Lagrangian extrapolation”.
Task: compute the limit limh→0 ψ( h) with prescribed accuracy, though the evaluation of the function
ψ = ψ( h) (maybe given in procedural form only) for very small arguments | h| ≪ 1 is difficult,
usually because of numerically instability (→ Section 1.5.5).
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 399
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
f ( h) = f (0) + A1 h2 + A2 h4 + · · · + An h2n + R( h) , Ak ∈ R ,
1.00
0.75
(hi , ψ(hi ))
the closer is p(0) to 0.
−0.50 degree = 1
degree = 2
−0.75 degree = 3
§5.2.3.16 (Numerical differentiation through extrapolation) In Ex. 1.5.4.7 we have already seen a situ-
ation, where we wanted to compute the limit of a function ψ( h) for h → 0, but could not do it with sufficient
accuracy. In this case ψ( h) was a one-sided difference quotient with span h, meant to approximate f ′ ( x )
for a differentiable function f . The cause of numerical difficulties was cancellation → § 1.5.4.5.
Now we will see how to dodge cancellation in difference quotients and how to use extrapolation to zero to
computes derivatives with high accuracy:
Given: smooth function f : I ⊂ R 7→ R in procedural form: function y = f(x)
df f ( x + h) − f ( x − h)
(x) ≈ . (5.2.3.17)
dx 2h
straightforward implementation fails due to cancellation in the numerator, see also Ex. 1.5.4.7.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 400
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This is apparent in the following approximation error tables for three simple functions and x = 1.1.
√
f ( x ) = arctan( x ) f (x) = x f ( x ) = exp( x )
h Relative error h Relative error h Relative error
2−1 0.20786640808609 2−1 0.09340033543136 2−1 0.29744254140026
2−6 0.00773341103991 2−6 0.00352613693103 2−6 0.00785334954789
2−11 0.00024299312415 2−11 0.00011094838842 2−11 0.00024418036620
2−16 0.00000759482296 2−16 0.00000346787667 2−16 0.00000762943394
2−21 0.00000023712637 2−21 0.00000010812198 2−21 0.00000023835113
2−26 0.00000001020730 2−26 0.00000001923506 2−26 0.00000000429331
2−31 0.00000005960464 2−31 0.00000001202188 2−31 0.00000012467100
2−36 0.00000679016113 2−36 0.00000198842224 2−36 0.00000495453865
Recall the considerations elaborated in Ex. 1.5.4.7. Owing to the impact of roundoff errors amplified by
cancellation, h → 0 does not achieve arbitrarily high accuracy. Rather, we observe fewer correct digits for
very small h!
Extrapolation offers a numerically stable (→ Def. 1.5.5.19) alternative, because for a 2(n + 1)-times con-
tinuously differentiable function f : I ⊂ R 7→ R, x ∈ I we find that the symmetric difference quotient
behaves like a polynomial in h2 in the vicinity of h = 0. Consider Taylor sum of f in x with Lagrange
remainder term:
n
f ( x + h) − f ( x − h) 1 d2k f 1
ψ(h) := ∼ f ′ (x) + ∑ 2k
( x ) h2k + f (2n+2) (ξ ( x )) .
2h k =1
( 2k ) ! dx ( 2n + 2 ) !
Since limh→0 ψ( h) = f ′ ( x )
The following C++ function diffex() implements extrapolation to zero of symmetric difference quo-
tients relying on the update-friendly version of the Aitken-Neville algorithm as presented in § 5.2.3.11,
Code 5.2.3.12. Note that the extrapolated value taking into account all available difference quotients al-
ways resides in y[0].
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 401
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
While the extrapolation table (→ § 5.2.3.11) is computed, more and more accurate approximations of
f ′ ( x ) become available. Thus, the difference between the two last approximations (stored in y[0] and
y[1] in Code 5.2.3.19) can be used to gauge the error of the current approximation, it provides an error
indicator, which can be used to decide when the level of extrapolation is sufficient, see Line 27.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 402
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q5.2.3.20.A) We consider a convergent sequence (αn )n∈N of real numbers, whose terms can be ob-
tained as ouput of a black-box function double alpha(unsigned int n). Function calls become
the more expensive the larger n. How might extrapolation to zero be employed to compute the limit
lim αn ?
n→∞
Explain how you can use an object of this type to perform multiple evaluations of the unique even
even of the data points ( t , y ), j = 0, . . . , n, t > 0.
polynomial interpolant p ∈ P2n j j j
Supplementary literature. We also refer to [DR08, Sect. 8.2.4], [QSS00, Sect. 8.2].
In § 5.2.3.8 we have seen a method to evaluate partial polynomial interpolants for a single or a few
evaluation points efficiently. Now we want to do this for many evaluation points that may not be known
when we receive information about the first interpolation points.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 403
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The challenge: Both addPoint() and the evaluation operator operator () may be called many times
and the implementation has to remain efficient under these circumstances.
Why not use the techniques from § 5.2.3.2? Drawback of the Lagrange basis or barycentric formula:
adding another data point affects all basis polynomials/all precomputed values!
Note that, clearly, Nn ∈ Pn with leading coefficient 1. This implies the linear independence of
{ N0 , . . . , Nn } and, in light of dim Pn = n + 1 by Thm. 5.2.1.2, gives us the basis property of that subset
of Pn .
The abstract considerations of § 5.1.0.21 still apply and we get an (n + 1) × (n + 1) linear system of
equations for the coefficients a j , j = 0, . . . , n, of the polynomial interpolant in Newton basis:
a j ∈ R: a0 N0 (t j ) + a1 N1 (t j ) + · · · + an Nn (t j ) = y j , j = 0, . . . , n . (5.2.3.24)
a0 = y0 ,
y − a0 y − y0
a1 = 1 = 1 ,
t1 − t0 t1 − t0
y2 − y0 y1 − y0
y −y
y2 − y0 − (t2 − t0 ) t11 −t00 −
y2 − a0 − ( t2 − t0 ) a1 t2 − t0 t1 − t0
a2 = = = ,
(t2 − t0 )(t2 − t1 ) (t2 − t0 )(t2 − t1 ) t2 − t1
..
.
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 404
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We observe that in the course of forward substitution the same quantities computed again and again. This
suggests that a more efficient implementation is possible. y
§5.2.3.26 (Divided differences) In order to reveal the pattern, we turn to a new interpretation of the
coefficients a j ∈ R of the interpolating polynomials
§5.2.3.29 (Efficient computation of divided differences) Divided differences can be computed by the
divided differences scheme, which is closely related to the Aitken-Neville scheme from Code 5.2.3.10:
t0 y [ t0 ]
> y [ t0 , t1 ]
t1 y [ t1 ] > y [ t0 , t1 , t2 ]
> y [ t1 , t2 ] > y [ t0 , t1 , t2 , t3 ], (5.2.3.30)
t2 y [ t2 ] > y [ t1 , t2 , t3 ]
> y [ t2 , t3 ]
t3 y [ t3 ]
The elements can be computed from left to right, every “>” indicates the evaluation of the recursion
formula (5.2.3.28).
However, we can again resort to the idea of § 5.2.3.11 and traverse (5.2.3.30) along the diagonals from
left bottom to right top: If a new datum (t0 , y0 ) is added, it is enough to compute the n + 1 new divided
differences
y [ t0 ] , y [ t0 , t1 ] , y [ t0 , t1 , t2 ] , . . . , y [ t0 , . . . , t n ] .
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 405
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The C++ function divdiff() listed in Code 5.2.3.31 computes divided differences for data points
(ti , yi ), i = 0, . . . , n, in this fashion. For n = 3 the values of the outer loop variable l in the different
combinations are as follows:
t0 y [ t0 ]
l = 3 > y [ t0 , t1 ]
t1 y [ t1 ] l=3> y [ t0 , t1 , t2 ]
l = 2 > y [ t1 , t2 ] l = 3 > y [ t0 , t1 , t2 , t3 ], (5.2.3.30)
t2 y [ t2 ] l=2> y [ t1 , t2 , t3 ]
l = 1 > y [ t2 , t3 ]
t3 y [ t3 ]
In divdiff() the divided differences y[t0 ], y[t0 , t1 ], . . . , y[t0 , . . . , tn ] overwrite the original data values
y j in the vector y (in-situ computation).
Thus, divdiff() from Code 5.2.3.31 computes the coefficients a j , j = 0, . . . , n, of the polynomial
interpolant with respect to the Newton basis. It uses only the first j + 1 data points to find a j . y
§5.2.3.33 (Efficient evaluation of polynomial in Newton form) Let a polynomial be given in “Newton
form”, that is, as a linear combination of Newton basis polynomials as introduced in (5.2.3.23):
with known coefficients a j , j = 0, . . . , n, e.g., available as the components of a vector. Embark on “asso-
ciative rewriting”,
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 406
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
which reveals how we can perform the “backward evaluation” of p(t) in the spirit of Horner’s scheme (→
§ 5.2.1.5, [DR08, Alg. 8.20]):
p ← an , p ← ( t − t n −1 ) p + a n −1 , p ← ( t − t n −2 ) p + a n −2 , ....
A C++ implementation of this idea is given next.
C++ code 5.2.3.34: Divided differences evaluation by modified Horner scheme ➺ GITLAB
2 // Evaluation of a polynomial in Newton form, that is, represented
through the
3 // vector of its basis expansion coefficients with respect to the Newton
basis
4 // (5.2.3.23).
5 Eigen : : VectorXd evalNewtonForm ( const Eigen : : VectorXd &t ,
6 const Eigen : : VectorXd &a ,
7 const Eigen : : VectorXd &x ) {
8 const Eigen : : Index n = a . s i z e ( ) − 1 ;
9 const Eigen : : VectorXd ones = Eigen : : VectorXd : : Ones ( x . s i z e ( ) ) ;
10 Eigen : : VectorXd p { a [ n ] * ones } ;
11 f o r ( Eigen : : Index j = n − 1 ; j >= 0 ; −− j ) {
12 p = ( x − t [ j ] * ones ) . cwiseProduct ( p ) + a [ j ] * ones ;
13 }
14 return p ;
15 }
EXAMPLE 5.2.3.35 (Class PolyEval) We show the implementation of a C++ class supporting the
efficient update and evaluation of an interpolating polynomial making use of
• the representation of the Lagrange polynomial interpolants in the Newton basis (5.2.3.23),
• the computation of representation coefficients through a divided difference scheme (5.2.3.30), see
Code 5.2.3.31,
• and point evaluations of the polynomial interpolants by means of Horner-like scheme as introduced
in Code 5.2.3.34.
To understand the code return to the triangular linear system for the Newton basis expansion coefficients
a j of a Lagrange polynomial interpolant of degree n through (ti , yi ), i = 0, . . . , n:
1 0 ··· 0
.. .. a0 y0
1 ( t1 − t0 ) . .
a1 y1
.. .. .. .. = .. . (5.2.3.25)
. . . 0 . .
n −1
1 ( t n − t0 ) · · · ∏ ( t n − ti ) an yn
i =0
Given, a0 , . . . , an−1 we can thus compute an from
!
n −1 n −1 k −1
an = ∏ ( t n − t i ) −1 yn − ∑ ∏ ( t n − ti ) a k
i =0 k =0 i =0
n −1 n −1 n −1
= ∏ ( t n − t i ) −1 y n − ∑ ∏ ( t n − t i ) −1 a k
i =0 k =0 i = k
= (. . . (((yn − a0 )/(tn − t0 ) − a1 )/(tn − t1 ) − a2 )/ · · · − an−1 )/(tn − tn−1 ) .
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 407
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 5.2.3.38 (Divided differences and derivatives) If y0 , . . . , yn are the values of a smooth function
f in the points t0 , . . . , tn , that is, y j := f (t j ), then
f (k) ( ξ )
y [ ti , . . . , ti +k ] =
k!
for a certain ξ ∈ [ti , ti+k ], see [DR08, Thm. 8.21]. y
Review question(s) 5.2.3.39 (Newton basis and divided differences)
(Q5.2.3.39.A) Given a node set {t0 , t1 , . . . , tn } let a0 , . . . , an ∈ R be the coefficients of a polynomial p
in the associated Newton basis { N0 , . . . , Nn }. Outline an efficient algorithm for computing the basis
expansion coefficients of p with respect to the basis { L0 , . . . , Ln } of Lagrange polynomials for given
node set.
(Q5.2.3.39.B) Given the value vector y ∈ R n+1 and the node set {t0 , . . . , tn } ⊂ R, remember the nota-
tion y[tk , tℓ ], for divided differences: y[tk , tℓ ] is the leading coefficient of the unique polynomial interpo-
ℓ
lating the data points t j , (y) j , 0 ≤ k, ℓ ≤ n (C++ indexing).
j=k
What can you conclude from y[t0 , t j ] = 0 for all j ∈ {m, . . . , n} for some m ∈ {1, . . . , n}?
△
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 408
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This section addresses a major shortcoming of polynomial interpolation in case the interpolation knots ti
are imposed, which is usually the case when given data points have to be interpolated, cf. Ex. 5.1.0.8.
This liability has to do with the sensitivity of the Lagrange polynomial interpolation problem. From Sec-
tion 2.2.2 remember that the sensitivity/conditioning of a problem provides a measure for the propaga-
tion of perturbations in the data/inputs to the results/outputs.
§5.2.4.1 (The Lagrange polynomial interpolation problem) As explained in § 1.5.5.1 a “problem” in the
sense of numerical analysis is a mapping/function from a data/input set X into a set Y of results/outputs.
Owing to the existence and uniqueness of the polynomial interpolant as asserted in Thm. 5.2.2.7, the
Lagrange polynomial interpolation problem (LIP) as introduced in Section 5.2.2 describes a mapping
from sets of n + 1 data points, n ∈ N0 to polynomials of degree n. Hence, LIP maps a finite sequence
of numbers to a function, and both the data/input set and result/output set have the structure of vector
spaces.
A more restricted view considers the linear interpolation operator from Cor. 5.2.2.8
(
R n +1 → Pn ,
IT : (5.2.2.9)
( y0 , . . . , y n ) T 7→ interpolating polynomial p .
and identifies the Lagrange polynomial interpolation problem with IT , that is, with the mapping taking only
data values to a polynomial. The interpolation nodes are treated as parameters and not considered data.
For the sake of simplicity we adopt this view in the sequel. y
10
T : = −5 + n j ,
j =0 1.5
y,p(t)
1
yj = , j = 0, . . . n.
1 + t2j
1
0.5
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 409
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In fuzzy terms, what we have observed is “high sensitivity” of polynomial interpolation with respect to
perturbations in the data values: small perturbations in the data can cause big variations of the polynomial
interpolants in certain points, which is clearly undesirable. y
§5.2.4.4 (Norms on spaces of functions) For measuring the size of perturbations we need norms (→
⊤
Def. 1.5.5.4) on data and result spaces. For the value vectors y := [y0 , . . . , yn ] ∈ R n+1 we can use any
vector norm, see § 1.5.5.3, for instance the maximum norm kyk∞ .
However the result space is a vector space of functions I ⊂ R → R and so we also need norms on the
vector space of continuous functions C0 ( I ), I ⊂ R. The following norms are the most relevant:
§5.2.4.8 (Sensitivity of linear problem maps) In § 5.1.0.21 we have learned that (polynomial) interpola-
tion gives rise to a linear problem map, see Def. 5.1.0.25. For this class of problem maps the investigation
of sensitivity has to study operator norms, a generalization of matrix norms (→ Def. 1.5.5.10).
Let L : X → Y be a linear problem map between two normed spaces, the data space X (with norm k·k X )
and the result space Y (with norm k·kY ). Thanks to linearity, perturbations of the result y := L(x) for the
input x ∈ X can be expressed as follows:
Hence, the sensitivity (in terms of propagation of absolute errors) can be measured by the operator norm
kL(δx)kY
k L k X →Y : = sup . (5.2.4.9)
δx∈ X \{0} k δx k X
This can be read as the “matrix norm of L”, cf. Def. 1.5.5.10. y
It seems challenging to compute the operator norm (5.2.4.9) for L = IT (IT the Lagrange interpolation
operator for node set T ⊂ I ), X = R n+1 (equipped with a vector norm), and Y = C ( I ) (endowed with a
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 410
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
norm from § 5.2.4.4). The next lemma will provide surprisingly simple concrete formulas.
kIT (y)k L∞ ( I ) n
k IT k ∞ → ∞ : = sup
kyk∞
= ∑ i =0 | L i | L∞ ( I )
, (5.2.4.11)
y∈R n+1 \{0}
kIT (y)k L2 ( I ) 1
n 2
k IT k 2→2 : = sup ≤ ∑ i =0
k Li k2L2 ( I ) . (5.2.4.12)
y∈R n+1 \{0} k y k2
n n n
kIT (y)k L∞ ( I ) = ∑ j =0 y j L j L∞ ( I )
≤ sup ∑ j=0 |y j || L j (t)| ≤ kyk∞ ∑ i =0 | L i | L∞ ( I )
,
t∈ I
Proof. (for the L2 -Norm) By the △-inequality and the Cauchy-Schwarz inequality in R n+1 ,
1 1
n n 2 n 2
∑ ab
j =0 j j
≤ ∑ | a |2
j =0 j ∑ | b |2
j =0 j
∀a j , bj ∈ R ,
we can estimate
1 1
n n 2 n 2 2
kIT (y)k L2 ( I ) ≤ ∑ j=0 |y j | L j L2 ( I )
≤ ∑ | y |2
j =0 j ∑ j =0
Lj L2 ( I )
.
✷
n
Terminology: Lebesgue constant of T : λT := ∑ i =0 | L i | L∞ ( I )
= kIT k ∞ → ∞
Remark 5.2.4.13 (Lebesgue constant for equidistant nodes) We consider Lagrange interpolation for
the special setting
2k n
I = [−1, 1], T = {−1 + n } k =0 (uniformly spaced nodes).
Asymptotic estimate (with (5.2.2.4) and Stirling formula): for n = 2m
1
n · 1
n · n3 · · · · n− 3 n +1
n · n ···· n
2n−1
(2n)! 2n+3/2
| Lm (1 − n1 )| = 2 = ∼
2 4 n −2 (n − 1)22n ((n/2)!)2 n! π ( n − 1) n
n n· · · · · · n · 1
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 411
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Lebesgue
6
constant for families of interpolation nodes
10
✁ Numerically computed values for Lebesgue con-
Chebychev nodes stants for different families of interpolation nodes.
Equidistant nodes
10 5
The k·k∞ -norm of the sum of Lagrange polynomi-
T
10 3
Sophisticated theory [CR92] gives a lower bound for
10 2
the Lebesgue constant for uniformly spaced nodes:
10 1
λT ≥ Ce /2
n
10 0
0 5 10 15 20 25
Fig. 156 Polynomial degree n with C > 0 independent of n.
We can also perform a numerical evaluation of the expression
n
λT = ∑ i =0 | L i | L∞ ( I )
,
for the Lebesgue constant of polynomial interpolation, see Lemma 5.2.4.10. The following code demon-
strates this:
5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 412
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
38 }
39 else {
40 tmp << t . head ( k ) , t . t a i l ( n − ( k + 1 ) ) ;
41 }
42 const double v = ( x − tmp . array ( ) ) . prod ( ) / den ( k ) ;
43 s += std : : abs ( v ) ; // sum over modulus of the polynomials
44 }
45 l = std : : max ( l , s ) ; // maximum of sampled values
46 }
47 return l ;
48 }
Note: In Code 5.2.4.14 the norm k Li k L∞ ( I ) can be computed only approximately by taking the maximum
modulus of function values in many sampling points. y
Due to potentially “high sensitivity” interpolation with global polynomials of high degree is
! not suitable for data interpolation.
y
Hint. The Lagrange polynomials for a node set {t0 , . . . , tn } ⊂ R are given by
n t − tj
Li ( t ) : = ∏ ti − t j , i = 0, . . . , n . (5.2.2.4)
j =0
j 6 =i
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 413
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 157
y
(≥) y j − y j −1
∆ j ≤ ∆ j+1 , j = 1, . . . , n − 1 , ∆ j := , j = 1, . . . , n .
t j − t j −1
( t i +1 − t i ) y i −1 + ( t i − t i −1 ) y i +1
yi ≤ ∀ i = 1, . . . , n − 1,
t i +1 − t i −1
i.e., each data point lies below the line segment connecting the other data, cf. definition of convexity of a
function [Str09, Def. 5.5.2].
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 414
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y y
Fig. 158 t
Fig. 159 t
Convex data Convex function
§5.3.1.5 ((Local) shape preservation) Now we consider interpolation problem to build an interpolant f
with special properties inherited from the given data (ti , yi ), i = 0, . . . , n.
✬ ✩
Goal: shape preserving interpolation:
More ambitious goal: local shape preserving interpolation: for each subinterval I = [ti , ti+ j ]
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 415
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.2
Polynomial
Measure pts.
1
Natural f ← Interpolating polynomial, degree = 10
0.8
0.2
• no locality,
0
• no positivity,
−0.2 • no monotonicity,
−0.4
• no local conservation of the curvature,
in the case of global polynomial interpolation.
−0.6
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
t
y
Then the piecewise linear interpolant s : [t0 , tn ] → R is defined as, cf. Ex. 5.1.0.15:
( t i +1 − t ) y i + ( t − t i ) y i +1
s(t) = for t ∈ [ t i , t i +1 ] . (5.3.2.1)
t i +1 − t i
t
Fig. 160
t0 t1 t2 t3 t4
Piecewise linear interpolation means simply “connect the data points in R2 using straight lines”.
Obvious: linear interpolation is linear (as mapping y 7→ s, see Def. 5.1.0.25) and local in the following
sense:
As obvious are the properties asserted in the following theorem. The local preservation of curvature is a
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 416
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Bad news: none of this properties carries over to local polynomial interpolation of higher polynomial degree
d > 1.
EXAMPLE 5.3.2.4 (Piecewise quadratic interpolation) We consider the following generalization of
piecewise linear interpolation of data points (t j , y j ) ∈ R × R, j = 0, . . . , n.
From Thm. 5.2.2.7 we know that a parabola (polynomial of degree 2) is uniquely determined by 3 data
points. Thus, the idea is to form groups of three adjacent data points and interpolate each of these triplets
by a 2nd-degree polynomial (parabola).
Assume: n = 2m even
piecewise quadratic interpolant q : [min{ti }, max{ti }] 7→ R is defined by
1.2
Nodes
Piecewise linear interpolant
1 Piecewise quadratic interpolant
Nodes as in Exp. 5.3.1.6
0.8
0.4
0.2
No shape preservation for piecewise quadratic inter-
polant 0
−0.2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 161
y
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 417
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Given data points (t j , y j ) ∈ R × R, j = 0, . . . , n, with pairwise distinct ordered nodes t j , and slopes
c j ∈ R, the piecewise cubic Hermite interpolant s : [t0 , tn ] → R is defined by the requirements
s|[ti−1 ,ti ] ∈ P3 , i = 1, . . . , n , s ( ti ) = yi , s ′ ( ti ) = ci , i = 0, . . . , n .
Piecewise cubic Hermite interpolants are continuously differentiable on their interval of definition.
Proof. The assertion of the corollary follows from the agreement of function values and first derivative
values on nodes shared by two intervals, on each of which the piecewise cubic Hermite interpolant is a
polynomial of degree 3.
✷
§5.3.3.3 (Local representation of piecewise cubic Hermite interpolant) Locally, we can write a piece-
wise cubic Hermite interpolant as a linear combination of generalized cardinal basis functions with coeffi-
cients supplied by the data values y j and the slopes c j :
1.2
ti − t
H1 (t) := φ( ), (5.3.3.5a)
1 hi
t − t i −1
0.8 H2 (t) := φ( ), (5.3.3.5b)
H
1
hi
0.6 H2 t −t
H3 (t) := − hi ψ ( i ), (5.3.3.5c)
Hi(t)
H3 hi
0.4
H4
t − t i −1
H4 (t) := hi ψ ( ), (5.3.3.5d)
0.2 hi
0
hi : = t i − t i −1 , (5.3.3.5e)
2 3
φ(τ ) := 3τ − 2τ , (5.3.3.5f)
−0.2
0 0.2 0.4 0.6 0.8 1
Fig. 162 t ψ(τ ) := τ 3 − τ 2 . (5.3.3.5g)
Local basis polynomials on [0, 1]
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 418
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
By tedious, but straightforward computations using the chain rule we find the following values for Hk and
Hk′ at the endpoints of the interval [ti−1 , ti ].
H ( t i −1 ) H ( ti ) H ′ ( t i −1 ) H ′ ( ti )
H1 1 0 0 0
H2 0 1 0 0
H3 0 0 1 0
H4 0 0 0 1
This amounts to a proof for (5.3.3.4) (why?).
The formula (5.3.3.4) is handy for the local evaluation of piecewise cubic Hermit interpolants. The function
hermloceval() in Code 5.3.3.6 performs the efficient evaluation (in multiple points) of a piecewise
cubic polynomial s on t1 , t2 uniquely defined by the constraints s(t1 ) = y1 , s(t2 ) = y2 , s′ (t1 ) = c1 ,
s ′ ( t2 ) = c2 :
§5.3.3.7 (Linear Hermite interpolation) However, the data for an interpolation problem (→ Section 5.1)
are merely the interpolation points (t j , y j ), j = 0, . . . , n, but not the slopes of the interpolant at the nodes.
Thus, in order to define an interpolation operator into the space of piecewise cubic Hermite functions, we
have supply a mapping R n+1 × R n+1 → R n+1 computing the slopes c j from the data points.
Since this mapping should be local it is natural to rely on (weighted) averages of the local slopes ∆ j (→
Def. 5.3.1.3) of the data, for instance
∆1
, for i = 0 ,
y j − y j −1
ci = ∆n , for i = n , , ∆ j := ,j = 1, . . . , n . (5.3.3.8)
t i +1 − t i ∆ + t i − t i −1 t j − t j −1
t −t i t i +1 − t i −1 ∆ i +1 , if 1 ≤ i < n .
i +1 i −1
“Local” means, that, if the values y j are non-zero for only a few adjacent data points with indices j =
k, . . . , k + m, m ∈ N small, then the Hermite interpolant s is supported on [tk−ℓ , tk+m+ℓ ] for small ℓ ∈ N
independent of k and m. y
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 419
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Data points:
✦ 11 equispaced nodes
t j = −1 + 0.2 j, j = 0, . . . , 10.
f ( x ) := sin(5x ) e x .
Fig. 163
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 420
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
From Ex. 5.3.3.9 we learn that, if the slopes are chosen according to Eq. (5.3.3.8), then the resulting
Hermite interpolation does not preserve monotonicity.
y
Consider the situation sketched on the right ✄
The red circles (•) represent data points, the blue line
(—) the piecewise linear interpolant → Section 5.3.2.
§5.3.3.10 (Limiting of local slopes) From the discussion of Fig. 164 and Fig. 165 it is clear that local
monotonicity preservation entails that the local slopes ci of a cubic Hermite interpolant (→ Def. 5.3.3.1)
have to fulfill
(
0 , if sgn(∆i ) 6= sgn(∆i+1 ) ,
ci = , i = 1, . . . , n − 1 . (5.3.3.11)
some “average” of ∆i , ∆i+1 otherwise
1 , if ξ > 0 ,
✎ notation: sign function sgn(ξ ) = 0 , if ξ = 0 , .
−1 , if ξ < 0 .
A slope selection rule that enforces (5.3.3.11) is called a limiter.
Of course, testing for equality with zero does not make sense for data that may be affected by measure-
ment or roundoff errors. Thus, the “average” in (5.3.3.11) must be close to zero already when either ∆i ≈ 0
or ∆i+1 ≈ 0. This is satisfied by the weighted harmonic mean
1
ci = , (5.3.3.12)
wa
∆i + ∆wb
i +1
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 421
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
8
The harmonic mean = “smoothed min(·, ·)-
7
function”.
6
b
ci → 0. 4
(w a = wb = 1/2). 2
1 2 3 4 5 6 7 8 9 10
Fig. 166 a
A good choice of the weights is:
2hi+1 + hi hi+1 + 2hi
wa = , wb = ,
3 ( h i +1 + h i ) 3 ( h i +1 + h i )
This yields the following local slopes, unless (5.3.3.11) enforces ci = 0:
, if i = 0 ,
∆1
sgn(∆1 )=sgn(∆2 ) 3( h i +1 + h i )
→ ci = 2hi +1 + hi 2h + h , for i ∈ {1, . . . , n − 1} , h i : = t i − t i −1 . (5.3.3.13)
+ i∆ i+1
∆i i +1
∆ , if i = n ,
n
Data points
1
Piecew. cubic interpolation polynomial
Data from Exp. 5.3.1.6
Plot created with MATLAB-function call 0.8
v = pchip(t,y,x); 0.6
t: Data nodes t j
s(t)
x: Evaluation points xi
v: Vector s( xi ) 0.2
Remark 5.3.3.15 (Non-linear cubic Hermite interpolation) Note that the mapping y := [y0 , . . . , yn ] →
ci defined by (5.3.3.11) and (5.3.3.13) is not linear.
➣ The “pchip interpolaton operator” does not provide a linear mapping from data space R n+1 into
C1 ([t0 , tn ]) (in the sense of Def. 5.1.0.25).
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 422
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In fact, the non-linearity of the piecewise cubic Hermite interpolation operator is necessary for (only global)
monotonicity preservation:
If, for fixed node set {t j }nj=0 , n ≥ 2, an interpolation scheme I : R n+1 → C1 ( I ) is linear
as a mapping from data values to continuous functions on the interval covered by the nodes
(→ Def. 5.1.0.25), and monotonicity preserving, then I(y)′ (t j ) = 0 for all y ∈ R n+1 and
j = 1, . . . , n − 1.
Of course, an interpolant that is flat in all data points, as stipulated by Thm. 5.3.3.16 for a lineaer, mono-
tonicity preserving, C1 -smooth interpolation scheme, does not make much sense.
At least, the piecewise cubic Hermite interpolation operator is local (in the sense discussed in § 5.3.3.7).
y
The cubic Hermite interpolation polynomial with slopes as in Eq. (5.3.3.13) provides a local
monotonicity-preserving C1 -interpolant.
Proof. See F. F RITSCH UND R. C ARLSON, Monotone piecewise cubic interpolation, SIAM J. Numer.
Anal., 17 (1980), S. 238–246.
✷
The next code demonstrates the calculation of the slopes ci in M ATLAB’s pchip (details in [FC80]):
3 namespace pchipslopes {
4
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 423
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
24 }
25 }
26 // Special slopes at endpoints, beyond (5.3.3.13)
27 c ( 0 ) = pchipend ( h ( 0 ) , h ( 1 ) , d e l t a ( 0 ) , d e l t a ( 1 ) ) ;
28 c ( n − 1 ) = pchipend ( h ( n − 2 ) , h ( n − 3 ) , d e l t a ( n − 2 ) , d e l t a ( n − 3 ) ) ;
29 }
30
31 i n l i n e double pchipend ( const double h1 , const double h2 , const double del1 , const
double d e l 2 ) {
32 // Non-centered, shape-preserving, three-point formula
33 double d = ( ( 2 * h1 + h2 ) * d e l 1 − h1 * d e l 2 ) / ( h1 + h2 ) ;
34 i f ( d* del1 < 0) {
35 d = 0;
36 }
37 else i f ( d e l 1 * d e l 2 < 0 && std : : abs ( d ) > std : : abs ( 3 * d e l 1 ) ) {
38 d = 3* del1 ;
39 }
40 return d ;
41 }
42
43 } //namespace pchipslopes
(Q5.3.3.19.C) Show by counterexample that a locally convexity preserving interpolation scheme can gen-
erate an interpolant with negative function values even if the data values are all positive.
(Q5.3.3.19.D) Given data points (ti , yi ), i = 0, . . . , n, ti− < ti , i = 1, . . . , n, we define
Z t
f ( t ) = y0 + p(τ ) dτ ,
t0
5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 424
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q5.3.3.19.F) Given an ordered node set t0 < t1 < · · · < tn and associated data values yi , i = 0, . . . , n,
the pchip interpolant is a piecewise cubic Hermite interpolant, with slopes chosen according to the
formula
∆1 , if i = 0 ,
sgn(∆1 )=sgn(∆2 ) 3( h i +1 + h i )
, for i ∈ {1, . . . , n − 1} , h i : = t i − t i −1 ,
→ ci = 2hi +1 + hi 2h + h
+ i∆ i+1 y −y (5.3.3.13)
∆i i +1
∆ i : = t i − t i −1 .
∆ i i −1
n , if i = n ,
(i) Determine the supports supp bi ⊂ R of the functions bi , i = 0, . . . , n, which is the pchip inter-
polant for the data values y0 = 0, . . . , yi−1 = 0, yi = 1, yi+1 = 0, . . . , yn = 0.
⊤
(ii) Denote by p(y) the pchip interpolant for the data vector y := [y0 , . . . , yn ] ∈ R n+1 . Can we
write p as
n
p(y)(t) = ∑ y i bi ( t ) , t0 ≤ t ≤ t n ,
i =0
5.4 Splines
Piecewise cubic Hermite Interpolation presented in Section 5.3.3 entailed determining reconstruction
slopes ci . Now we learn about a way how to do piecewise polynomial interpolation, which results in
C k -interpolants, k > 0, and dispenses with auxiliary slopes. The idea is to obtain the missing conditions
implicitly from extra continuity conditions, built into spaces of so-called splines. These are of fundamental
importance for modern computer-aided geometric design (CAGD).
Do not mix up “knots” = “breakpoints” of a spline functions, and “nodes”, the first values in data tuples
(ti , yi ) for 1D interpolation. In the case of spline interpolation, knots may serves as nodes, but not neces-
sarily.
Let’s make explicit the spline spaces of the lowest degrees:
• d = 0 : M-piecewise constant discontinuous functions
• d = 1 : M-piecewise linear continuous functions
• d = 2 : continuously differentiable M-piecewise quadratic functions
The dimension of spline space can be found by a counting argument (heuristic): We count the number
of “degrees of freedom” (d.o.f.s) possessed by a M-piecewise polynomial of degree d, and subtract the
number of linear constraints implicitly contained in Def. 5.4.1.1:
dim Sd,M = n + d .
Remark 5.4.1.3 (Differentiating and integrating splines) Obviously, spline spaces are mapped onto
each other by differentiation & integration:
Z t
′
s ∈ Sd,M ⇒ s ∈ Sd−1,M ∧ t 7→ s(τ ) dτ ∈ Sd+1,M . (5.4.1.4)
a
y
Review question(s) 5.4.1.5 (Spline function spaces)
(Q5.4.1.5.A) Given an (ordered) knot set M := {t0 < t1 < · · · < tn } ⊂ R use a counting argument to
determine the dimension of the space of piecewise polynomials
(Q5.4.1.5.B) Consider the knot set M := {0, 21 , 1}. For arbitrary numbers y0 , y1 , c0 , c1 does there exist
s ∈ S2,M such that
Is s unique?
△
Supplementary literature. More details can be found in [Han02, pp. XIII, 46], [QSS00,
Sect. 8.6.1].
Remark 5.4.2.1 (Perceived smoothness of cubic splines) Cognitive psychology teaches us that the
human eye perceives a C2 -function as “smooth”, while it can still spot the abrupt change of curvature at
the possible discontinuities of the second derivatives of a cube Hermite interpolant (→ Def. 5.3.3.1).
For this reason the simplest spline functions featuring C2 -smoothness are of great importance in computer
aided design (CAD). They are the cubic splines, M-piecewise polynomials of degree 3 contained in S3,M
(→ Def. 5.4.1.1). y
§5.4.2.2 (Cubic spline interpolants) The definition of a cubic spline interpolant is straightforward and
matches the abstract concept of an interpolant introduced in Section 5.1. Also note the relationship with
Hermite interpolation discussed in Section 5.3.3.
Given a node set/knot set M := {t0 < t1 < · · · < tn }, n ∈ N, and data values y0 , . . . , yn ∈ R,
an associated cubic spline interpolant is a function s ∈ S3,M that complies with the interpolation
conditions
s(t j ) = y j , j = 0, . . . , n . (5.4.2.4)
Note that in the case of cubic spline interpolation the spline knots and interpolation nodes coincide.
From dimensional considerations it is clear that the interpolation conditions will fail to fix the interpolating
cubic spline uniquely:
Obviously, “two conditions are missing”, which means that the interpolation problem for cubic splines is
not well-defined by (5.4.2.4). We have to impose two additional conditions. Different ways to do this will
lead to different cubic spline interpolants for the same set of data points. y
§5.4.2.5 (Computing cubic spline interpolants) We opt for a linear interpolation scheme (→
Def. 5.1.0.25) into the spline space S3,M , which means that the two additional conditions must depend
linearly on the data values. As explained in § 5.1.0.21, a linear interpolation scheme will lead to a linear
system of equations for expansion coefficients with respect to a suitable basis.
We reuse the local representation of a cubic spline through cubic Hermite cardinal basis polynomials from
(5.3.3.5):
(5.3.3.4)
s|[t j−1 ,t j ] (t) = s(t j−1 ) · (1 − 3τ 2 + 2τ 3 ) + (5.4.2.6)
s(t j ) · (3τ 2 − 2τ 3 ) +
h j s ′ ( t j −1 ) · 2
(τ − 2τ + τ ) + 3
h j s′ (t j ) · (−τ 2 + τ 3 ) ,
Once these slopes are known, the efficient local evaluation of a cubic spline function can be done in the
same way as for a cubic Hermite interpolant, see Section 5.3.3.1, Code 5.3.3.6.
Note: if s(t j ), s′ (t j ), j = 0, . . . , n, are fixed, then the representation Eq. (5.4.2.6) already guarantees
s ∈ C1 ([t0 , tn ]), cf. the discussion for cubic Hermite interpolation, Section 5.3.3.
However, do the
Inserting these formulas into Eq. (5.4.2.7) leads to n − 1 linear equations for the n + 1 unknown slopes
c j := s′ (t j ). Taking into account the (known) interpolation conditions s(ti ) = yi , we get
! !
1 2 2 1 y j − y j −1 y j +1 − y j
c + + c + c =3 + , (5.4.2.9)
h j j −1 h j h j +1 j h j +1 j +1 h2j h2j+1
for j = 1, . . . , n − 1.
Actually Eq. (5.4.2.9) amounts to an (n − 1) × (n + 1), that is, underdetermined, linear system of equa-
tions. The dimensions make perfect sense, because
n − 1 equations =
ˆ number of interpolation conditions
n + 1 unknowns= ˆ dimension of cubic spline space on knot set {t0 < t1 < · · · < tn }
This linear system of equations can be written in matrix-vector form:
b0 a1 b1 0 ··· ··· 0 c0 y1 − y0 y2 − y1
3 +
0 b1 a2 b2 c1 h21 h22
.. .. .. ..
0 . . . . . ..
. .. .. ..
.. = . . (5.4.2.10)
.. . . .
..
. a n − 2 bn − 2 0 c y n − y n −1
n −1 y −y
3 n−h12 n−2 +
0 ··· ··· 0 bn − 2 a n − 1 bn − 1 h2n
cn n −1
with
1 2 2 i = 0, 1, . . . , n − 1 ,
bi : = , ai : = + ,
h i +1 h i h i +1 [ bi , a i > 0 , a i = 2 ( bi + bi − 1 ) ] .
➙ two additional constraints are required, as already noted in § 5.4.2.2. y
§5.4.2.11 (Types of cubic spline interpolants) To saturate the remaining two degrees of freedom the
following three approaches are popular. All of them involve conditions linear in the data values.
c0 : = q ⊤ y , c1 : = p ⊤ y ,
⊤
for given vectors p, q ∈ R n+1 and y := [y0 , . . . , yn ] ∈ R n+1 the vector of data values. Then the
first and last column can be removed from the system matrix of (5.4.2.10). Their products with c0
and cn , respectively, have to be subtracted from the right hand side of (5.4.2.10), which leads to the
(n − 1) × (n − 1) linear system of equations:
a1 b1 0 ··· ··· 0 3
y1 − y0
+ y2 − y1
−c0 b0
b1 a2 b2 c1 h21 h22
.. .. .. ..
0 . . . . .
..
. .. .. ..
.. = . . (5.4.2.12)
.. . . . 0
..
. a n − 2 bn − 2 c y n −1 − y n −2 y n − y n −1
n −1 3 h2
+ h2 − c n bn − 1
0 ··· ··· 0 bn − 2 a n − 1 n −1 n
2 1 y1 − y0 1 2 y n − y n −1
h1 c 0 + h1 c 1 =3 , h n c n −1 + hn c n =3 .
h21 h2n
Combining these two extra equations with (5.4.2.10), we arrive at a linear system of equations with
tridiagonal s.p.d. (→ Def. 1.1.2.6, Lemma 2.8.0.12) system matrix and unknowns c0 , . . . , cn :
y1 − y0
2 1 0 ··· ··· 0 3
h1
b0 a1 b1 0 · · · ··· 0 y −y y −y
3 1h 2 0 + 2h 2 1
0 b1 a2 b2 c0 1 2
.. .. .. ..
0 . . . . . .
.. = .. . (5.4.2.13)
.. .. .. ..
. . . .
.. c
.
a n − 2 bn − 2 0 n y n −1 − y n −2 y n − y n −1
3 h2n−1
+ h2n
0 ··· · · · 0 bn − 2 a n − 1 bn − 1
y n − y n −1
0 ··· ··· 0 1 2 3 hn
Owing to Thm. 2.7.5.4 this linear system of equations can be solved with an asymptotic computational
effort of O(n) for n → ∞.
➂ Periodic cubic spline interpolation: s′ (t0 ) = s′ (tn ) (➣ c0 = cn ), s′′ (t0 ) = s′′ (tn )
This removes one unknown and adds another equations so that we end up with an n × n-linear system
with s.p.d. (→ Def. 1.1.2.6) system matrix
a1 b1 0 ··· 0 b0
b1 a2 b2 0
. . .. ..
0 .. .. . . bi : = 1
h i +1 , i = 0, 1, . . . , n − 1 ,
A := .. .. .. ..
,
. . . ai : = 2 2
i = 0, 1, . . . , n − 1 .
. 0 h i + h i +1 ,
..
0 . a n − 1 bn − 1
b0 0 · · · 0 bn − 1 a 0
This linear system can be solved with rank-1-modifications techniques (see § 2.6.0.12,
Lemma 2.6.0.21) + tridiagonal elimination: asymptotic computational effort O(n).
Remark 5.4.2.14 (Piecewise cubic interpolation schemes) Let us review three different classes of in-
terpolation schemes relying on piecewise cubic polynomials with respect to a prescribed node set:
✦ Piecewise cubic local Lagrange interpolation
➣ Extra degrees of freedom fixed by putting four nodes in one interval
➥ yields merely C0 -interpolant; perfectly local.
✦ Cubic Hermite interpolation
➣ Extra degrees of freedom fixed by locally reconstructed slopes, e.g. (5.3.3.13)
➥ yields C1 -interpolant; still local.
✦ Cubic spline interpolation
➣ Extra degrees of freedom fixed by C2 -smoothness, complete/natural/periodic constraint.
➥ yields C2 -interpolant; non-local.
y
Fig. 168
The path of the tip of the robotic arm can conveniently be described by the the componentwise complete
cubic spline interpolant s : [t0 , tn ] → R3 satisfying
s ′ ( t 0 ) = α 0 y 0 + α 1 y 1 + α 2 y 2 + α 3 y 3 , s ′ ( t n ) = β 0 y n + β 1 y n −1 + β 2 y n −2 + β 3 y n −3 ,
(Q5.4.2.16.F) For the computation of a cubic spline interpolant in § 5.4.2.5 we used a local, that is, inside
the knot intervals, representation of the spline function by means of cardinal basis polynomials for
Hermite interpolation
instead of (5.4.2.17) as starting point for the computation of a cubic spline interpolant not a good idea?
△
models the elastic bending energy of a rod, whose shape is described by the graph of f (Soundness
check: zero bending energy for straight rod). We will show that cubic spline interpolants have minimal
bending energy among all C2 -smooth interpolating functions.
Given a knot/nodes set M := { a = t0 < t1 < · · · < tn = b} in the interval [ a, b], let s ∈ S3,M be
the natural cubic spline interpolant of the data points (ti , yi ) ∈ R2 , i = 0, . . . , n.
Then s minimizes the elastic bending energy among all interpolating functions in C2 ([ a, b]), that is
We show that any small perturbation of s such that the perturbed spline still satisfies the interpolation
conditions leads to an increase in elastic energy.
Zb
Ebd (s + k ) = 1
2 |s′′ + k′′ |2 dt (5.4.3.4)
a
Zb Zb
′′ ′′
= Ebd (s) + s (t)k (t) dt + 1
2 |k′′ |2 dt .
a a
| {z } | {z }
:= I ≥0
We scrutinize I , split it into the contributions of individual knot intervals, integrate by parts twice, and use
s(4) ≡ 0. By s′′′ (t± ′′′
j ) we denote the limits of the discontinuous third derivative t 7 → s ( t ) from the left (−)
and right (+) of the knot t j .
n Z tj
I= ∑ s′′ (t)k′′ (t) dt
j =1 t j −1
( Z tj
)
n
=∑ − s′′′ (t)k′ (t) dt + s′′ (t j )k′ (t j ) − s′′ (t j−1 )k′ (t j−1 )
j =1 t j −1
(Z )
n tj
=∑ s(4) (t)k (t) dt + s′′ (t j )k′ (t j ) − s′′ (t j−1 )k′ (t j ) − s′′′ (t− ′′′ +
j ) k ( t j ) + s ( t j −1 ) k ( t j −1 )
j =1 t j −1
n
= − ∑ s′′′ (t−
j ) k ( t j ) − s ′′′ +
( t j −1 ) k ( t j −1 ) + s′′ (tn ) k′ (tn ) − s′′ (t0 ) k′ (t0 ) = 0 .
j =1 |{z} | {z } | {z } | {z }
=0 =0 =0 =0
In light of (5.4.3.4): no perturbation k compatible with interpolation conditions can make the bending
energy of s decrease! y
Nature: A thin elastic rod fixed a certain points attains a shape that minimizes its potential bending energy
(virtual work principle of statics).
y
Remark 5.4.3.6 (Shape preservation of cubic spline interpolation)
1.2
y − y0
s ′ ( t0 ) = c0 : = 1 , 0.8
t1 − t0
y n − y n −1
s′ (tn ) = cn := . 0.6
t n − t n −1
s(t)
0.4
§5.4.3.7 (Weak locality of an interpolation scheme) In terms of locality of interpolation schemes, in the
sense of § 5.3.3.7, we habe seen:
• Piecewise linear interpolation (→ Section 5.3.2) is strictly local: Changing a single data value y j
affects the interpolant only on the interval ]t j−1 , t j+1 [.
• Monotonicity preserving piecewise cubic Hermite interpolation (→ Section 5.3.3.2) is still local, be-
cause changing y j will lead to a change in the interpolant only in ]t j−2 , t j+2 [ (the remote intervals
are affected through the averaging of local slopes).
• Polynomial Lagrange interpolation is highly non-local, see Ex. 5.2.4.3.
We can weaken the notion of locality of an interpolation scheme on an ordered node set {ti }in=0 :
➣ (weak) locality measures the impact of a perturbation of a data value y j at points t ∈ [t0 , tn ] as a
function of |t − ti |.
➣ an interpolation scheme is weakly local, if the impact of the perturbation of yi displays a rapid (e.g.
exponential) decay as |t − ti | increases.
For a linear interpolation scheme (→ § 5.1.0.21) locality can be deduced from the decay of the cardinal
interpolants/cardinal basis functions (→ Lagrange polynomials of § 5.2.2.3), that is, the functions b j :=
I(e j ), where e j is the j-th unit vector, and I the interpolation operator. Then weak locality can be quantified
as
Remember:
• Lagrange polynomials satisfying (5.2.2.4) provide cardinal interpolants for polynomial interpolation
→ § 5.2.2.3. As is clear from Fig. 151, they do not display any decay away from their “base node”.
Rather, they grow strongly. Hence, there is no locality in global polynomial interpolation.
• Tent functions (→ Fig. 150) are the cardinal basis functions for piecewise linear interpolation, see
Ex. 5.1.0.15. Hence, this scheme is perfectly local, see (5.3.2.2).
Given a knot/node set M := {t0 < t1 < · · · < tn } the ith natural cubic cardinal spline is defined by the
conditions
These functions will supply a cardinal basis of S3,M and, according to (5.1.0.18), for natural cubic spline
interpolants we have the formula
n
Natural spline interpolant: s(t) = ∑ y j L j (t) .
j =0
This means that t 7→ Li (t) completely characterizes the impact that a change of the value yi has on s. A
very rapid decay of Li means that the value yi does not influence s much outside a small neighborhood of
ti .
Fast (exponential) decay of Li ↔ weak locality of natural cubic spline interpolation.
y
EXPERIMENT 5.4.3.10 (Decay of cardinal basis functions for natural cubic spline interpolation) We
examine the cardinal basis of S3,M (“cardinal splines”) associated with natural cubic spline interpolation
16
on an equidistant knot set M := t j = j − 8 j=0 :
Cardinal cubic spline function Cardinal cubic spline in middle points of the intervals
1.2 0
10
1
−1
10
Value of the cardinal cubic splines
0.8
−2
10
0.6
0.4 −3
10
0.2
−4
10
0
−5
−0.2 10
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
x
x
We observe a rapid exponential decay of the cardinal splines, which is expressed by the statement that
“cubic spline interpolation is weakly local”. y
Review question(s) 5.4.3.11 (Structural properties of cubic spline interpolants)
(Q5.4.3.11.A) Given data points (ti , yi ) ∈ R2 , t0 < t1 < · · · < tn , n ∈ N, show that the piecewise linear
interpolant p ∈ C0 ([t0 , t1 ]) minimizes the elastic energy
n Z tj
Eel ( f ) := ∑ | f ′ (t)|2 dt
j =1 t j −1
n Ztn
f ∗ := argmin Em ( f ) , Em ( f ) := ∑ | f ( ti ) − yi |2 + | f ′′ (t)|2 dt .
f ∈C2 [t 0 ,tn ] i =0 t0
is a cubic spline (with respect to the knot set M := {t0 < t1 < · · · < tn }), which satisfies
s′′ (t0 ) = s′′ (tn ) = 0.
Of course, you should invoke the result about the optimality of the natural cubic spline interpolant.
Given a knot/nodes set M := { a = t0 < t1 < · · · < tn = b} in the interval [ a, b], let s ∈ S3,M
be the natural cubic spline interpolant of the data points (ti , yi ) ∈ R2 , i = 0, . . . , n.
Then s minimizes the elastic bending energy among all interpolating functions in C2 ([ a, b]), that
is
This section presents a non-linear quadratic spline (→ Def. 5.4.1.1, C1 -functions) based interpolation
scheme that preserves both monotonicity and curvature of data even in a local sense, cf. Section 5.3,
§ 5.3.1.5. The construction was presented in [Sch83], is fairly intricate and will be presented step by step.
As with all 1D data interpolation tasks we are given data points (ti , yi ) ∈ R2 , i = 0, . . . , n, and we
assume that their first coordinates are ordered: t0 < t1 < · · · < tn .
In order to obtain extra flexibility required for shape preservation, the key idea of [Sch83] is
n
to use an extended knot set M ⊂ [t0 , tn ] containing n additional knots beside t j j=0 .
Then we construct an interpolating quadratic spline function that satisfies s ∈ S2,M , s(ti ) = yi , i =
0, . . . , n and locally preserves the “shape” of the data in the sense of § 5.3.1.1.
We stress that unlike in the case of cubic spline interpolation the knot set will not agree with the node set in
n
this case M 6= {t j } j=0 : the interpolant s interpolates the data in the points ti but is piecewise polynomial
with respect to M!
y i +1 y i −1 y i −1
yi yi
y i +1
y i −1 y i +1 yi
t i −1 ti t i +1 t i −1 ti t i +1 t i −1 ti t i +1
Figures: slopes according to limited harmonic mean formula
Rule: y i −1
1
Let Ti be the unique straight line through (ti , yi ) with
c i −1
slope ci ; — in figure ✄
If the intersection of Ti−1 and Ti is non-empty and
has a t-coordinate ∈]ti−1 , ti ], l
We chose L to be the linear (d = 1) spline (polygonal line) on the knot set M′ of midpoints of the
knot intervals of M from (5.4.4.2)
M′ = {t0 < 21 (t0 + p1 ) < 12 ( p1 + t1 ) < 12 (t1 + p2 ) < · · · < 12 (tn−1 + pn ) < 12 ( pn + tn ) < tn } ,
L ( ti ) = yi , L ′ ( ti ) = ci . (5.4.4.4)
In each interval ( 21 ( p j + t j ), 12 (t j + p j+1 )) the linear spline L corresponds to the line segment of
slope c j passing through the data point (t j , y j ).
In each interval ( 12 (t j + p j+1 ), 21 ( p j+1 + t j+1 )) the linear spline L corresponds to the line segment
connecting the ones on the other knot intervals of M′ , see Fig. 170.
Given the choice of slopes according to (5.4.4.1), a detailed analysis shows the following shpae-preserving
property of L.
✞ ☎
✝ ✆
L “inherits” local monotonicity and curvature from the data.
EXAMPLE 5.4.4.5 (Auxiliary construction for shape preserving quadratic spline interpolation) We
12
verify the above statements about the polygonal line L for the data points ( j, cos( j)) j=0 , which are marked
as • in the left plot.
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
y
y
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8
−0.8
−1
−1
0 2 4 6 8 10 12
Fig. 171 t 0 2 4 6 8 10 12
Fig. 172 t
Local slopes ci , i = 0, . . . , n
Auxiliary linear spline L
The reader is encouraged to check by inspection of the plots that the piecewise linear function L locally
preserves monotonicity and curvature. y
Lemma 5.4.4.6.
If g is a linear spline (polygonal line) through the three points
1
( a, y a ) , ( ( a + b), w) , (b, yb ) with a < b , y a , yb , w ∈ R ,
2
then the parabola
satisfies
1. p( a) = y a , p(b) = yb , p′ ( a) = g′ ( a) , p′ (b) = g′ (b),
2. g monotonic increasing / decreasing ⇒ p monotonic increasing / decreasing,
3. g convex / concave ⇒ p convex / concave.
The proof boils down to discussing many cases as indiated in the following plots:
Linear Spline l Linear Spline l Linear Spline l
ya Parabola p ya Parabola p w Parabola p
w
w ya
yb yb yb
1
We use the same data points as in Ex. 5.4.4.5 and 0.8
build a quadratic spline s ∈ S2,M , M the knot set 0.6
from (5.4.4.2), based on the formula suggested by 0.4
Lemma 5.4.4.6. 0.2
0
—=
ˆ interpolating quadratic spline s
y
➙
−0.2
−0.6
However, since s 6∈ C2 ([t0 , tn ]) in general, we see −0.8
“kinks”, cf. Rem. 5.4.2.1. −1
0 2 4 6 8 10 12
Fig. 176 t
y
EXAMPLE 5.4.4.8 (Cardinal shape preserving quadratic spline) We examine the shape preserving
quadratic spline that interpolates data values y j = 0, j 6= i, yi = 1, i ∈ {0, . . . , n}, on an equidistant
node set.
Data and slopes Linear auxiliary spline l Quadratic spline
1 1 1
0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 177 t Fig. 178 t Fig. 179 t
Shape preserving quadratic spline interpolation is a local, but not a linear interpolation scheme.
y
1 1 1
0 0 0
ti 0 1 2 3 4 5 6 7 8 9 10 11 12
Data from [MR81]:
yi 0 0.3 0.5 0.2 0.6 1.2 1.3 1 1 1 1 0 -1
1 1 1
y
0 0 0
−1 −1 −1
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 183 t Fig. 184 t Fig. 185 t
In all cases we observe excellent local shape preservation. The implementation can be found in
shapepresintp.hpp ➺ GITLAB. y
Review question(s) 5.4.4.10 (Shape-preserving spline interpolation)
(Q5.4.4.10.A) Argue why an interpolant of the three data points (0, 0), (1, 0), (2, 1), (3, 1) in S2,M ,
M := {0, 1, 2, 3}, cannot be monotonicity preserving.
(Q5.4.4.10.B) Let S : I n+1 × R n+1 → C1 ( I ), I ⊂ R a bounded interval, be an interpolation scheme, that
n n
is, given data points (ti , yi ), ti ∈ I , ti 6= t j for i 6= j, i, j = 0, . . . , n, f := S([ti ]i=0 , [yi ]i=0 ) is a contin-
uously differentiable function f : I → R satisfying the interpolation conditions f (ti ) = yi , i = 0, . . . , n.
Assume that S is locally monotonicity-preserving.
n
What is the support of S(([ti ]i=0 , ek ), if the nodes ti are sorted, t0 < t1 < t2 < · · · < tn and ek stands
n
for the Cartesian coordinate vector δkj j=0 ∈ R n+1 ?
6 control points
4 3 4
7 ✁ Ordered
x2
1 2 5 6 control
2
points
0 0 8
Task: Find an algorithm that “builds” a smooth curve, whose shape “is inspired”
by the locations of the control points.
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 440
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Let us give a more concrete meaning to some aspects of this rather vague specification.
§5.5.1.2 (Parameterized curves)
C = {c(t) : t ∈ [ a, b]} ,
10
−15 −10 −5 0 5 10 15
Fig. 187 x1
y
Thus, “building” a curve amounts to finding a parameterization. In algorithms, we have to choose the
paramterization from sets of simple functions, whose elements can be characterized by finitely many,
more precisely, a few, real numbers (unfortunately, often also called “parameters”).
EXAMPLE 5.5.1.5 (Polynomial planar curves) We call a planar curve (→ Def. 5.5.1.3) polynomial of
degree d ∈ N0 , if it has a parameterization c : [ a, b] → R2 of the form
d
c(t) = ∑ ak tk with vectors ak ∈ R2 , k ∈ {0, . . . , d} . (5.5.1.6)
k =0
We also write c ∈ (Pd )2 to indicate that c is polynomial of degree d. For d = 1 we recover a line segment
with endpoints c( a) and c(b).
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 441
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.0
0.8
0.6
✁ a parabola over [−1, 1].
x2
0.0 parabola
§5.5.1.7 (Smooth curves) The “smoothness” of a curve is directly related to the differentiability class of its
parameterization. As remarked in the beginning of Section 5.4.2, a curve is “visually smooth”, if its parame-
terization c : [ a, b] → R2 is twice continuously differentiable: c ∈ C2 ([ a, b]) × C2 ([ a, b]) =: (C2 ([ a, b]))2 .
Hence, the polygon connecting the control points is not an acceptable curve, because it is merely contin-
uous and has kinks. y
The meaning of “shape fidelity” cannot be captured rigorously, but some aspects of it will be discussed
below.
EXAMPLE 5.5.1.8 (Curves from interpolation) After what we have learned in Section 5.2, Section 5.3,
and Section 5.3.3, it is a natural idea to try and construct the parameterization of a curve by interpolating
the control points pℓ , ℓ ∈ {0, . . . , n} using an established scheme for 1D interpolation of the data points
(tℓ , pℓ ) with mutually distinct nodes tℓ ∈ R. However how should we choose the tℓ ?
The idea is to extract the nodes tℓ from the accumulated length of the segments of the
polygon with corners pℓ :
ℓ
t0 : = 0 , t ℓ : = ∑ k p k − p k −1 k , ℓ ∈ {1, . . . , n} . (5.5.1.9)
k =1
This choice of the nodes together with three different interpolation schemes was used to generate curves
based on a “car-shaped” sequence of control points displayed in Fig. 186.
6 control polygon
polynomial interpolant
4 C 2 cubic spline interpolant
C 1 Hermite interpolant
x2
0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 189 x1
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 442
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
ξ 1 v1 + ξ 2 v2 + · · · + ξ m v m ∈ V ,
0 ≤ ξk ≤ 1 , k = 1, . . . , m , ξ 1 + ξ 2 + · · · + ξ m = 1 .
Finally, we set
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 443
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A recursive definition suggests proofs by induction and a simple one yields the insight that Bezier curves
connect the first and last control point.
Lemma 5.5.2.3.
Given mutually distinct points p0 , . . . , pn , the functions bik , k ∈ {1, . . . , n}, i ∈ {0, . . . , n − k }, de-
fined by (5.5.2.2) satisfy
0.8
0.6
0.2
0.0
Fig. 191 0.0 0.2 0.4 0.6 0.8 1.0
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 444
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The Bezier curves bik , k ∈ {1, . . . , n}, i ∈ {0, . . . , n − k }, defined by (5.5.2.2) are confined to the
convex hulls of their underlying control points pi , . . . , pi+k :
Proof. The result follows by straightforward induction since every convex combination of two points in the
convex hull again belongs to it.
✷
The next theorem states a deep result also connected with shape-fidelity. Its proof is beyond the scope of
this course:
No straight line intersects a Bezier curve bik more times than it intersects the control polygon
formed by its underlying control points pi , . . . , pi+k .
0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 192 x1
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 445
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The Bezier curve bn : [0, 1] → R2 as defined in (5.5.2.2) based on the control points p0 , . . . , pn ,
n ∈ N, can be written as
n
bn (t) = ∑ Bin (t) pi , (5.5.2.9)
i =0
Proof. Of course, we use induction in the polynomial degree based on (5.5.2.2) and the Pascal triangle
for binomial coefficients.
✷
Bernstein polynomials, degree = 8
1.0 B08
B18
B28
0.8 B38
✁ All Bernstein polynomials of degree n = 8.
B48
B58
0.6 B68 Lemma 5.5.2.11. Basis property of Bern-
B78
stein polynomials
Bid (t)
B88
0.0
We can verify by direct computation based on the multinomial theorem that the Bernstein polynomials
satisfy
n
0≤ Bin (t) ≤1 , ∑ Bin (t) = 1 ∀t ∈ [0, 1] . (5.5.2.12)
i =0
Remark 5.5.2.13 (Modified Horner scheme for evaluation of Bezier polynomials) The representa-
tion (5.5.2.9) together with (6.2.1.3) paves the way for an efficient evaluation of Bezier curves for many
parameter values. For example, this is important for fast graphical rendering of Bezier curves.
The algorithm is similar to that from § 5.2.3.33 and based on rewriting
n
bn (t) = ∑ Bin (t) pi
i =0
n n ( n − 1) 2
= (1 − t ) p0 + t p1 (1 − t ) + t p2 (1 − t)+
1 1·2
n(n − 1)(n − 2) 3
t p3 (1 − t)+
1·2·3 !
..
. (1 − t ) + t n p n .
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 446
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
and evaluating the terms in brackets from the innermost to the outer.
C++ code 5.5.2.14: Evaluation of Bezier polynomial curve for many parameter arguments
➺ GITLAB
2 Eigen : : MatrixXd evalBezier ( const Eigen : : MatrixXd &nodes ,
3 const Eigen : : RowVectorXd & t ) {
4 assert ( ( nodes . rows ( ) == 2 ) && "Nodes must have two coordinates " ) ;
5 const Eigen : : Index n = nodes . cols ( ) ; // Number of control points
6 const Eigen : : Index d = n − 1 ; // Polynomial degree
7 const Eigen : : Index N = t . s i z e ( ) ; // No. of evaluation points
8 // Vector containing 1-t ("one minus t")
9 const auto oml { Eigen : : RowVectorXd : : Constant (N, 1 . 0 ) − t } ;
10 // Modified Horner scheme for polynomial in Bezier form
11 // Vector for returning result, initialized with p[0]*(1-t)
12 Eigen : : MatrixXd r e s = nodes . col ( 0 ) * oml ;
13 double binom_val = 1 . 0 ; // Accumulate binomial coefficients
14 // Powers of argument values
15 Eigen : : RowVectorXd t_pow { Eigen : : RowVectorXd : : Constant (N, 1 . 0 ) } ;
16 f o r ( i n t i = 1 ; i < d ; ++ i ) {
17 t_pow . array ( ) * = t . array ( ) ;
18 binom_val * = ( s t a t i c _ c a s t <double >( d − i ) + 1 . 0 ) / i ;
19 r e s += binom_val * nodes . col ( i ) * t_pow ;
20 r e s . row ( 0 ) . array ( ) * = oml . array ( ) ;
21 r e s . row ( 1 ) . array ( ) * = oml . array ( ) ;
22 }
23 r e s += nodes . col ( d ) * ( t . array ( ) * t_pow . array ( ) ) . matrix ( ) ;
24 return res ;
25 }
The asymptotic computational effort for the evaluation for N ∈ N parameter values is O( Nn) for
n, N → ∞. y
Given an interval I := [ a, b] ⊂ R and a knot sequence M := { a = t0 < t1 < . . . < tm−1 <
tm = b}, m ∈ N, the vector space Sd,M of the spline functions of degree d (or order d + 1) is
defined by
It goes without saying that a spline curve of degree d ∈ N is a curve s : [0, 1] → R2 that possesses a
parameterization that is component-wise a spline function of degree d with respect to a knot sequence
M := {t0 = 0 < t1 < . . . < tm−1 < tm = 1} for the interval [0, 1]: s ∈ (Sd,M )2 !
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 447
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This does not yet bring us closer to the goal of building a shape-aware spline curve from n + 1 given
control points p0 , . . . , pn , Pℓ ∈ R2 , n ∈ N. To begin with, recall
A first consideration realizes that n + 1 control points ∈ R2 correspond to 2(n + 1) degrees of freedom
and matching that with the dimension of the space of spline curves we see that we should choose
m := n + 1 − d
§5.5.3.2 (B-splines) Now we take he cue from the representation of Bezier curves by means of Bernstein
polynomials as established Thm. 5.5.2.8 and aim for the construction of spline counterparts of the Bezier
polynomials.
For an elegant presentation we admit generalized knot sequences that may contain multiple knots
U : = {0 = : u 0 ≤ u 1 ≤ u 2 ≤ · · · ≤ u m −1 ≤ u m : = 1 } , m∈N. (5.5.3.3)
The number of times a uk -value occurs in the sequence is called its multiplicity
The following definition can be motivated by a recursion satisfied by the Bernstein polynomials from
(6.2.1.3):
In particular, we notice that Bik is a two-term convex combination of Bik−1 and Bik . We pursue something
similar with splines:
Note that also Nik is a two-term convex combination of Nik−1 and Nik+1 . From this we can deduce the
following properties by simple induction:
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 448
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The B-splines based on the generalized knot sequence U and defined by (5.5.3.6) satisfy
• Nik is a U -piecewise polynomial of degree ≤ k,
• 0 ≤ Nik (t) ≤ 1 for all 0 ≤ t ≤ 1,
• Nik vanishes outside the interval [ui , . . . , ui+k+1 ]: supp( Nik ) = [ui , ui+k+1 ],
i
• ∑ Njk (t) = 1 for all t ∈ [ui , ui+1 [, i = k, . . . , ℓ − k − 1
j =i − k
The main “magic” result about B-splines concerns their smoothness properties that are by no means
obvious from the recursive definition (5.5.3.6). To make a concise statement we build special generalized
knot sequences from a (standard) knot sequence by copying endpoints d times: Given a degree d ∈ N0
and
d :
These plots show B-splines for special generalized knot sequences of the type UM
9 B-splines, degree = 2, 12 knots 9 B-splines, degree = 3, 13 knots
1.0 N02 1.0 N03
N12 N13
N22 N23
0.8 N32 0.8 N33
N42 N43
N52 N53
0.6 N62 0.6 N63
N72 N73
Nid (t)
Nid (t)
N82 N83
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 194 t Fig. 195 t
We notice that on that generalized knot sequence (5.5.3.6) provides ℓ − d = m + d B-spline functions
Nid , i = 0, . . . , m + d − 1. In light of Thm. 5.4.1.2 this is necessary for the following much more general
assertion to hold, whose proof, unfortunately, is very technical and will be skipped.
In particular, we conclude that Nid is d − 1-times continuously differentiable: Nid ∈ C d−1 ([0, 1])! So,
already for d = 3, the case of cubic splines we obtain curves that look perfectly smooth.
Inspired by Thm. 5.5.2.8 we propose the following recipe for constructing a degree-d spline curve from
control points p0 , . . . , pn :
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 449
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
d :
➋ Parameterize the curve by a linear combination of B-splines based on UM
n
s(t) := ∑ Njd (t) p j , t ∈ [0, 1] . (5.5.3.9)
j =0
y
EXPERIMENT 5.5.3.10 (Curve design based on B-splines) We display the spline curves induced by
the “car-shaped” set of control points using equidistant knots in [0, 1]:
0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 196 x1
Spline curve, degree 2, 9 control points
6 control polygon
spline curve
4
x2
0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 197 x1
5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 450
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We assume the period T > 0 to be known and ti ∈ [0, T [ for all interpolation nodes ti , i = 0, . . . , n.
Task: Given T > 0 and data points (ti , yi ), yi ∈ K, ti ∈ [0, T [, find a T -periodic function f : R → K (the
interpolant), f (t + T ) = f (t) ∀t ∈ R, that satisfies the interpolation conditions
f (ti ) = yi , i = 0, . . . , n . (5.6.0.2)
The terminology is natural after recalling expressions for trigonometric functions via complex exponentials
(“Euler’s formula”)
ncos t = 1 (eıt + e−ıt )
2
eit = cos t + ı sin t ⇒ (5.6.1.2)
sin t = 1 ıt
2ı ( e − e−ıt ) .
T given in the form
Thus we can rewrite q ∈ P2n
1n n 2πıjt −2πıjt
o
2 j∑
= α0 + ( α j − ıβ j ) e + ( α j + ıβ j ) e
=1
−1 n
2πıjt
= α0 + 1
2 ∑ (α− j + ıβ − j )e + 1
2 ∑ (α j − ıβ j )e2πıjt
j=−n j =1
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 451
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1
2n 2 (αn− j + ıβ n− j ) for j = 0, . . . , n − 1 ,
= e−2πınt ∑ γ j e2πıjt , with γ j = α0 for j = n , (5.6.1.4)
j =0
1
2 ( α j−n − ıβ j−n ) for j = n + 1, . . . , 2n .
(After scaling) a trigonometric polynomial of degree 2n is a regular polynomial ∈ P2n (in C) re-
stricted to the unit circle S1 ⊂ C.
T
Corollary 5.6.1.6. Dimension of P2n
T has dimension T
The vector space P2n dim P2n = 2n + 1.
t
0 1 1 Re
Fig. 198
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 452
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
All theoretical results and algorithms from polynomial interpolation carry over to trigonometric
interpolation
✦ Existence and uniqueness of trigonometric interpolation polynomial, see Thm. 5.2.2.7,
✦ Concept of Lagrange polynomials, see (5.2.2.4),
✦ the algorithms and representations discussed in Section 5.2.3.
The next code demonstrates the use of standard routines for polynomial interpolation provided by
BarycPolyInterp (→ Code 5.2.3.7) for trigonometric interpolation.
The next code finds the coefficients α j , β j ∈ R of a trigonometric interpolation polynomial in the real-valued
representation (5.6.1.3) for real-valued data y j ∈ R by simply solving the linear system of equations arising
from the interpolation conditions (5.6.0.2).
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 453
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The asymptotic computational effort of this implementation is dominated by the cost for Gaussian elimina-
tion applied to a fully populated (dense) matrix, see Thm. 2.5.0.2: O(n3 ) for n → ∞.
2n
T
q ∈ P2n ⇒ q(t) = e−2πınt · p(e2πıt ) with p(z) = ∑ γj z j ∈ P2n , (5.6.1.5)
j =0
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 454
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
to arrive at the following (2n + 1) × (2n + 1) linear system of equations for computing the unknown coeffi-
2n
jk nk
∑ γj exp 2πı
2n + 1
= (b)k := exp 2πı y , k = 0, . . . , 2n .
2n + 1 k
(5.6.3.2
j =0
cients γ j : m
Lemma 4.2.1.14 1
F2n+1 c = b , c = [γ0 , . . . , γ2n ]⊤ ⇒ c= F2n+1 b . (5.6.3.3
2n + 1
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 455
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
33 }
34 r e t u r n { alpha . r e a l ( ) , beta . r e a l ( ) } ;
35 }
tic-toc-timings ✄
−2
10
MATLAB 7.10.0.499 (R2010a)
runtime[s]
CPU: Intel Core i7, 2.66 GHz, 2 cores, L2 256 KB, L3 −3
10
4 MB
We make the same observation as in Ex. 4.3.0.12: massive gain in efficiency through relying on FFT. y
k
at equidistant points N , N > 2n. k = 0, . . . , N − 1.
2n
−2πınk/N kj
(5.6.1.4) q(k/N ) =e ∑ γj exp(2πı N ) , k = 0, . . . , N − 1 .
j =0
q(k/N ) = e−2πı
kn/N
v j with v = F N ec , (5.6.3.7)
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 456
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The FFT-based implementation is realized in the following code, which accepts the coefficients α j , β j as
components of the vectors a and b, and returns q(k/N ), k = 0, . . . , N − 1, in the vector y.
The next code merges the steps of computing the coefficient of the trigonometric interpolation polynomial
in equidistant points and its evaluation in another set of M equidistant points.
C++ code 5.6.3.9: Equidistant points: fast on the fly evaluation of trigonometric interpolation
polynomial ➺ GITLAB
j
2 // Evaluation of trigonometric interpolation polynomial through ( 2n+1 , y j ),
j = 0, . . . , 2n
k
3 // in equidistant points M , k = 0, M − 1
4 // IN : y = vector of values to be interpolated
5 // q (COMPLEX!) will be used to save the return values
6 void t r i g p o l y v a l e q u i d ( const VectorXd y , const i n t M, VectorXd& q ) {
7 const i n t N = y . s i z e ( ) ;
8 i f (N % 2 == 0 ) {
9 std : : c e r r << "Number of points must be odd ! \ n" ;
10 return ;
11 }
12 const i n t n = (N − 1 ) / 2 ;
13 // computing coefficient γ j , see (5.6.3.3)
14 VectorXcd a , b ;
15 trigipequid ( y , a , b ) ;
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 457
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
16
25 // zero padding
26 VectorXcd ch (M) ; ch << gamma, VectorXcd : : Zero (M − ( 2 * n + 1 ) ) ;
27
y
Review question(s) 5.6.3.10 (Trigonomentic interpolation)
(Q5.6.3.10.A) You have at your disposal the function
s t d ::pair<Eigen::VectorXd,Eigen::VectorXd>
trigpolycoeff(comnst Eigen::VectorXd &t, const Eigen::VectorXd &y);
T
that computes the coefficients α j , β j , of the 1-periodic interpolating trigonometric polynomial q ∈ P2n
n
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) .
j =1
for the data points (tk , yk ) ∈ R2 , k = 0, . . . , 2n, passed in the vectors t and y of equal length.
However, now we have to process data points that are obtained by sampling a T -periodic function,
T > 0 known. Describe, how you can use trigpolycoeff() to obtain a T -periodic interpolant
p : R → R.
T be the 1-periodic trigonometric interpolant of ( t , y ), k = 0, . . . , 2n,
(Q5.6.3.10.B) Let q ∈ P2n k k
0 < tk < 1:
n
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) .
j =1
What are the basis expansion coefficients e α j , βej for the trigonometric interpolant qe ∈ P2n
T through
5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 458
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
using only a single call of a mathematical function of the C++ standard library. The coefficients are
passed through the vectors alpha and beta, the evaluation point t in t.
△
Video tutorial for Section 5.7 "Least Squares Data Fitting": (13 minutes) Download link,
tablet notes
As remarked in Ex. 5.1.0.8 the basic assumption underlying the reconstructing of the functional depen-
dence of two quantities by means of interpolation is that of accurate data. In case of data uncertainty or
measurement errors the exact satisfaction of interpolation conditions ceases to make sense and we are
better off reconstructing a fitting function that is merely “close to the data” in a sense to be made precise
next.
The most general task of (multidimensional, vector-valued) least squares data fitting can be described
as follows:
Such a function f is called a (best) least squares fit for the data in S.
m
k = 1 , d = 1: (5.7.0.2) ⇔ f ∈ argmin ∑ | g(ti ) − yi |2 . (5.7.0.3)
g∈S i =1
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 459
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§5.7.0.4 (Linear spaces of admissible functions) Consider a special variant of the general least squares
data fitting problem: The set S of admissible continuous functions is now chosen as a finite-dimensional
vector space Vn ⊂ C0 ( D ), dim Vn = n ∈ N, cf. the discussion in § 5.1.0.21 for interpolation.
Choose basis of Vn : Vn = Span{b1 , . . . , bn }, b j : D → R d continuous
The best least squares fit f ∈ Vn can be represented by a finite linear combination of the basis
functions b j :
n
f(t) = ∑ j =1 x j b j ( t ) , xj ∈ R . (5.7.0.5)
Vn = W × · · · × W , (5.7.0.6)
| {z }
d factors
dim Vn = d · dim W .
is a basis of Vn (ei =
ˆ i-th unit vector). y
§5.7.0.8 (Linear data fitting → [DR08, Sect. 4.1]) We adopt the setting of § 5.7.0.4 of an n-
dimensional space Vn of admissible functions with basis {b1 , . . . , bn }. Then the least squares data
fitting problem can be recast as follows.
2
m n
⊤
x := [ x1 , . . . , xn ] := argmin ∑ ∑ z j b j ( ti ) − yi . (5.7.0.9)
z j ∈R i =1 j =1 2
Special cases:
• For k = 1, d = 1, data points (ti , yi ) ∈ R × R (scalar, one-dimensional setting), and Vn =
Span{b1 , . . . , bn }, we seek coefficients x j ∈ R, j = 1, . . . , n, as the components of a vector
x = [ x1 , . . . , xn ]⊤ ∈ R n satisfying
2
m n
x = argmin ∑ ∑ ( z ) j b j ( ti ) − yi . (5.7.0.10)
z ∈R n i =1 j =1
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 460
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
• If Vn is a product space according to (5.7.0.6) with basis (5.7.0.7), then (5.7.0.9) amounts to finding
vectors x j ∈ R d , j = 1, . . . , ℓ, with
2
m ℓ
(x1 , . . . , xℓ ) = argmin ∑ ∑ z j q j ( ti ) − yi . (5.7.0.11)
z j ∈R d i =1 j =1 2
EXAMPLE 5.7.0.12 (Linear parameter estimation = linear data fitting → Ex. 3.0.1.4, Ex. 3.1.1.5)
The linear parameter estimation/linear regression problem presented in Ex. 3.0.1.4 can be recast as a
linear data fitting problem with
• k = n, d = 1, data points (xi , yi ) ∈ R k × R,
• an k + 1-dimensional space Vn = {x 7→ a⊤ x + β, a ∈ R k , β ∈ R } of affine linear admissible
functions,
• the choice of basis { x 7 → ( x )1 , . . . , x 7 → ( x ) k , x 7 → 1}.
y
§5.7.0.13 (Linear data fitting as a lineare least squares problem) Linear (least squares) data fitting
leads to an overdetermined linear system of equations for which we seek a least squares solution (→
Def. 3.1.1.1) as in Section 3.1.1. To see this rewrite
2 !2
m n m n d
∑ ∑ z j b j ( ti ) − yi = ∑ ∑∑ b j ( ti ) r z j − ( yi )r .
i =1 j =1 2 i =1 j =1 r =1
In the one-dimensional, scalar case (d = 1) of (5.7.0.10) the related overdetermined linear system of
equations is
b1 (t1 ) . . . bn (t1 ) y1
..
.. .
. x = .. .
. (5.7.0.16)
b1 (tm ) . . . bn (tm ) ym
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 461
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Obviously, for m = n we recover the 1D data interpolation problem from § 5.1.0.21. For d > 1 the
overdetermined linear system of equations (5.7.0.15) can be thought of as d systems of the type (5.7.0.16)
stacked on top of each other, one for every component of the data vectors.
Having reduced the linear least squares data fitting problem to finding the least squares solution of an
overdetermined linear system of equations, we can now apply theoretical results about least squares
solutions, for instance, Cor. 3.1.2.13. The key issue is, whether the coefficient matrix of (5.7.0.16) has full
rank n. Of course, this will depend on the location of the ti .
The scalar one-dimensional linear least squares fitting problem (5.7.0.10) with dim Vn = n, Vn the
vector space of admissible functions, has a unique solution, if and only if there are ti1 , . . . , tin such
that
b1 (ti1 ) . . . bn (ti1 )
.. .. n,n
. . ∈R is invertible, (5.7.0.18)
b1 (tin ) . . . bn (tin )
Equivalent to (5.7.0.18) is the requirement, that there is an n-subset of {t1 , . . . , tn } such that the corre-
sponding interpolation problem for Vn has a unique solution for any data values yi . y
EXAMPLE 5.7.0.19 (Polynomial fitting)
Special variant of scalar (d = 1), one-dimensional k = 1 linear data fitting (→ § 5.7.0.8): we choose the
space of admissible functions as polynomials of degree n − 1,
Vn = Pn−1 , e.g. with basis b j (t) = t j−1 (monomial basis, Section 5.2.1) .
The corresponding overdetermined linear system of equations (5.7.0.16) now reads:
1 t1 t21 . . . t1n−1 y1
1 t 2 n −1
2 t2 . . . t2 y2
.. .. .. .. x = .. , (5.7.0.20)
. . . . .
2
1 tm tm . . . tm n − 1 yn
which, for M ≥ n, has full rank, because it contains invertible Vandermonde matrices (5.2.2.11),
Rem. 5.2.2.10.
The next code demonstrates the computation of the fitting polynomial with respect to the monomial basis
of Pn−1 :
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 462
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The function polyfit returns a vector [ x1 , x2 , . . . , xn ]⊤ describing the fitting polynomial according to the
convention
p ( t ) = x 1 t n −1 + x 2 t n −2 + · · · + x n −1 t + x n . (5.7.0.22)
1
✦ polynomial degree d = 10,
✦ interpolation through data points (t j , f (t j )),
y
Fitting helps curb oscillations that haunt polynomial interpolation: in terms of “shape preservation” the
fitted polynomial is clearly superior to the interpolating polynomial. y
Remark 5.7.0.24 (“Overfitting”) In the previous example we saw that the degree-10 polynomial fitted to
the data in 31 points rather well reflects the “shape of the data”, see Fig. 200. If we had fit a polynomial of
degree 30 to the data instead, the result would have been a function with humongous oscillations, because
we would have recovered the interpolating polynomial at equidistant nodes, cf. Ex. 5.2.4.3.
Thus, in Exp. 5.7.0.23 we see a manifestation of a widely observed phenomenon, which plays a big role
in machine learning:
Overfitting
Fitting data with functions from a large space often produces poorer results (in terms of “shape
preservation” and “generalization error”) than the use of a smaller subspace for fitting.
§5.7.0.26 (Data fitting with regularization) Let us recall two observations made in the context of one-
dimensional interpolation of scalar data:
• High-degree global polynomial interpolants usually suffer from massive oscillations.
• The “nice” cubic-spline interpolant s on [t0 , tn ] could also be characterized as the C2 -interpolant with
Rt
minimal bending energy t n |s′′ (t)| dt, see Thm. 5.4.3.3.
0
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 463
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This motivates augmenting the fitting problem with another term depending on inner product norms of
derivatives of the fitting functions.
Concretely, given data points (ti , yi ) ∈ R2 , i = 0, . . . , m, ti ∈ [ a, b], and seeking the fitting function f in
the n-dimensional space
Vn := Span{b1 , . . . , bn } ⊂ C2 ([ a, b]) ,
we determine it as
( Z b
)
n
2 ′′ 2
f ∈ argmin ∑ | g ( ti ) − yi | +α
a
| g (t)| dt , (5.7.0.27)
g ∈V i =0
Rb
with a regularization parameter α > 0. Since | g′′ (t)| dt will be large, if g oscillates wildly, for large α the
a
extra regularization term will serve to suppress oscillations.
n
Plugging a basis expansion g = ∑ x j b j into (5.7.0.27), we arrive at the quadratic minimization problem
j =1
For this minimization problem we can find a linear system of equations for minimizers by setting
!
grad φ(x) = 0 for φ(x) := kAx − yk22 + αx⊤ Mx. As in § 3.6.1.4 we find generalized normal equa-
tions
This is an n × n linear system of equations with symmetric positive semi-definite coefficient matrix. y
Review question(s) 5.7.0.31 (Least squares data fitting)
(Q5.7.0.31.A) [] Given data points (ti , yi ) ∈ R2 , i = 0, . . . , m, ti ∈ I ⊂ R, let {b1 , . . . , bn } ⊂ C0 ( I )
be a basis of the space V of admissible fitting functions. A least squares fit can be computed as
2
n m n
f = ∑ (x) j b j where x ∈ argmin ∑ ∑ ( z ) j b j ( ti ) − yi .
j =1 z ∈R n i =0 j =1
where the knot set M is given by the node set: M := {t0 < t1 < · · · < tn }. Using the “cardinal basis”
of S1,M comprised of “tent functions”, derive the linear system that has to be solved to find f .
△
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 464
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Learning outcomes
After you have studied this chapter you should
• understand the use of basis functions for representing functions on a computer,
• know the concept of a interpolation operator and what its linearity means,
• know the connection between linear interpolation operators and linear systems of equations,
• be familiar with efficient algorithms for polynomial interpolation in different settings,
• know the meaning and significance of “sensitivity” in the context of interpolation,
• Be familiar with the notions of “shape preservation” for an interpolation scheme and its different
aspects (monotonicity, curvature),
• know the details of cubic Hermite interpolation and how to ensure that it is monotonicity preserving.
• know what splines are and how cubic spline interpolation with different endpoint constraints works.
5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 465
Bibliography
[Aki70] H. Akima. “A New Method of Interpolation and Smooth Curve Fitting Based on Local Proce-
dures”. In: J. ACM 17.4 (1970), pp. 589–602.
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on p. 389).
[Bei+16] Lourenço Beirão da Veiga, Annalisa Buffa, Giancarlo Sangalli, and Rafael Vázquez. “An intro-
duction to the numerical analysis of isogeometric methods”. In: Numerical simulation in physics
and engineering. Vol. 9. SEMA SIMAI Springer Ser. Springer, [Cham], 2016, pp. 3–69.
[BT04] Jean-Paul Berrut and Lloyd N. Trefethen. “Barycentric Lagrange interpolation”. In: SIAM Rev.
46.3 (2004), 501–517 (electronic). DOI: 10.1137/S0036144502417715.
[CR92] D. Coppersmith and T.J. Rivlin. “The growth of polynomials bounded at equally spaced points”.
In: SIAM J. Math. Anal. 23.4 (1992), pp. 970–983 (cit. on p. 412).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 388–391, 396, 403, 407–409, 425, 451, 460).
[FC80] F.N. Fritsch and R.E. Carlson. “Monotone Piecewise Cubic Interpolation”. In: SIAM J. Numer.
Anal. 17.2 (1980), pp. 238–246 (cit. on p. 423).
[Gou05] E. Gourgoulhon. An introduction to polynomial interpolation. Slides for School on spectral
methods. 2005.
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 426, 430).
[Kva14] Boris I. Kvasov. “Monotone and convex interpolation by weighted quadratic splines”. In: Adv.
Comput. Math. 40.1 (2014), pp. 91–116. DOI: 10.1007/s10444-013-9300-9.
[MR81] D. McAllister and J. Roulier. “An algorithm for computing a shape-preserving osculatory
quadratic spline”. In: ACM Trans. Math. Software 7.3 (1981), pp. 331–347 (cit. on pp. 436,
439).
[Moh15] A. Mohsen. “CORRECTING THE FUNCTION DERIVATIVE ESTIMATION USING LA-
GRANGIAN INTERPOLATION”. In: ZAMP (2015).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 387).
[QQY01] A.L. Dontchevand H.-D. Qi, L.-Q. Qi, and H.-X. Yin. “Convergence of Newton’s method for
convex best interpolation”. In: Numer. Math. 87.3 (2001), pp. 435–456 (cit. on p. 436).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 389–392, 403, 409, 425, 426, 431).
[Sch83] Larry L. Schumaker. “On shape preserving quadratic spline interpolation”. In: SIAM J. Numer.
Anal. 20.4 (1983), pp. 854–864. DOI: 10.1137/0720057 (cit. on p. 435).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 414, 415).
[Tre13] Lloyd N. Trefethen. Approximation theory and approximation practice. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2013, viii+305 pp.+back matter (cit. on
p. 394).
[Tre14] N. Trefethen. SIX MYTHS OF POLYNOMIAL INTERPOLATION AND QUADRATURE. Slides.
2014.
466
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
[WX12] Haiyong Wang and Shuhuang Xiang. “On the convergence rates of Leg-
endre approximation”. In: Math. Comp. 81.278 (2012), pp. 861–877. DOI:
10.1090/S0025-5718-2011-02549-4.
Approximation of Functions in 1D
6.1 Introduction
Video tutorial for Section 6.1 "Approximation of Functions in 1D: Introduction": (7 minutes)
Download link, tablet notes
In Chapter 5 we aimed to fill the gaps between given data points by constructing a function connecting
them. In this chapter the function is available already, at least in principle, but its evaluation is too costly,
which forces us to replace it with a “cheaper” or “simpler” function, sometimes called a surrogate function.
§6.1.0.1 (General function approximation problem)
The function ef can be encoded by a small amount of information and is easy to evaluate. For instance,
this is the case for polynomial or piecewise polynomial e
f.
f − ef is small for some norm k·k on the space C0 ( D ) of (piecewise) continous functions.
Below we consider only the case n = d = 1: approximation of scalar valued functions defined on an
468
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
interval. The techniques can be applied componentwise in order to cope with the case of vector valued
function (d > 1). y
A faster alternative is the advance approximation of the function U 7→ I (U ) based on a few computed
values I (Ui ), i = 0, . . . , n, followed by the fast evaluation of the approximant U 7→ e
I (U ) during actual
circuit simulations. This is an example of model reduction by approximation of functions: a complex
subsystem in a mathematical model is replaced by a surrogate function.
In this example we also encounter a typical situation: we have nothing at our disposal but, possibly ex-
pensive, point evaluations of the function U 7→ I (U ) (U 7→ I (U ) in “procedural form”, see Rem. 5.1.0.9).
The number of evaluations of I (U ) will largely determine the cost of building e
I.
This application displays a fundamental difference compared to the reconstruction of constitutive relation-
ships from a priori measurements → Ex. 5.1.0.8: Now we are free to choose the number and location
of the data points, because we can simply evaluate the function U 7→ I (U ) for any U and as often as
needed.
C++ code 6.1.0.4: Class describing a 2-port circuit element for circuit simulation
1 class C i r c u i t E l e m e n t {
2 private :
3 // internal data describing U 7→ e
I (U ) .
4 public :
5 // Constructor taking some parameters and building e
I
6 C i r c u i t E l e m e n t ( const Parameters &P ) ;
7 // Point evaluation operators for e
I and d e
dU I
8 double I ( double U) const ;
9 double dIdU ( double U) const ;
10 };
§6.1.0.5 (Approximation schemes) We define an abstract concept for the sake of clarity: When in this
chapter we talk about an “approximation scheme” (in 1D) we refer to a mapping A : X 7→ V , where X and
V are spaces of functions I 7→ K, I ⊂ R an interval.
Examples are
• X = C k ( I ), the spaces of functions I 7→ K that are k times continuously differentiable, k ∈ N.
• V = Pm ( I ), the space of polynomials of degree ≤ k, see Section 5.2.1
• V = Sd,M , the space of splines of degree d on the knot set M ⊂ I , see Def. 5.4.1.1.
T , the space of trigonometric polynomials of degree 2n, see Def. 5.6.1.1.
• V = P2n
y
sampling interpolation
f : I ⊂ R → K −−−−→ (ti , yi := f (ti ))im=0 −−−−−−→ fe := IT y ( fe(ti ) = yi ) .
In this chapter we will mainly study approximation by interpolation relying on the interpolation schemes
(→ § 5.1.0.7) introduced in Section 5.2, Section 5.3.3, and Section 5.4.
There is additional freedom compared to data interpolation: we can choose the interpolation nodes in
a smart way in order to obtain an accurate interpolant fe.
y
Remark 6.1.0.8 (Interpolation and approximation: enabling technologies) Approximation and interpo-
lation (→ Chapter 5) are key components of many numerical methods, like for integration, differentiation
and computation of the solutions of differential equations, as well as for computer graphics and generation
of smooth curves and surfaces.
f 6≡ 0 ⇒ A( f ) 6≡ 0 ?
Video tutorial for Section 6.2 "Polynomial Approximation: Theory": (13 minutes)
Download link, tablet notes
The space Pk of polynomials of degree ≤ k has been introduced in Section 5.2.1. For reasons listed
in § 5.2.1.3 polynomials are the most important theoretical and practical tool for the approximation of
functions. The next example presents an important case of approximation by polynomials.
EXAMPLE 6.2.0.1 (Taylor approximation → [Str09, Sect. 5.5]) The local approximation of sufficiently
smooth functions by polynomials is a key idea in calculus, which manifests itself in the importance of
approximation by Taylor polynomials: For f ∈ C k ( I ), k ∈ N, I ⊂ R an interval, we approximate
f (2) ( t 0 ) f ( k ) ( t0 )
f (t) ≈ f (t0 ) + f ′ (t0 )(t − t0 ) + ( t − t0 )2 + · · · + (t − t0 )k , for some t0 ∈ I .
| 2 {z k! }
=:Tk (t)∈Pk
✎ Notation: f (k) =
ˆ k-th derivative of function f : I ⊂ R → K
Zt
(t − τ )k
f (t) − Tk (t) = f ( k +1) ( τ ) dτ (6.2.0.2a)
k!
t0
( t − t 0 ) k +1
= f ( k +1) ( ξ ) , ξ = ξ (t, t0 ) ∈] min(t, t0 ), max(t, t0 )[ , (6.2.0.2b)
( k + 1) !
which shows that for f ∈ C k+1 ( I ) the Taylor polynomial Tk is pointwise close to f ∈ C k+1 ( I ), if the
interval I is small and f (k+1) is bounded pointwise.
Approximation by Taylor polynomials is easy and direct but inefficient: a polynomial of lower degree often
gives the same accuracy. Moreover, when f is available only in procedural form as double f(double),
(approximations of) higher order derivatives are difficult to obtain. y
§6.2.0.3 (Nested approximation spaces of polynomials) Obviously, for every interval I ⊂ R, the
spaces of polynomials are nested in the following sense:
P0 ⊂ P1 ⊂ · · · ⊂ P m ⊂ P m +1 ⊂ · · · ⊂ C ∞ ( I ) , (6.2.0.4)
With this family of nested spaces of polynomials at our disposal, it is natural to study associated families
of approximation schemes, one for each degree, mapping into Pm , m ∈ N0 . y
The question is, whether polynomials still offer uniform approximation on arbitrary bounded closed inter-
vals and for functions that are merely continuous, but not any smoother. The answer is YES and this
profound result is known as the Weierstrass Approximation Theorem. Here we give an extended version
with a concrete formula due to Bernstein, see [Dav75, Section 6.2].
✎ Notation: g(k) =
ˆ k-th derivative of a function g : I ⊂ R → K.
j= 0
0.9 j= 1
j= 2
j= 3
j= 4
0.8
Plots of Bernstein polynomials of degree n = 7 ✄ j=
j=
j=
5
6
7
0.7
0.5
n 0.4
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 202 t
0.5
deg = 2, j = 1
deg = 5, j = 2
deg = 8, j = 3
deg = 11, j = 4
deg = 14, j = 6
0.4
deg = 17, j = 7
deg = 20, j = 8
deg = 23, j = 9
deg = 26, j = 10
n− j
= Bnj (t)( tj −
deg = 29, j = 12
d n
0.3
✁ Since dt Bj ( t ) 1− t ) , Bnj has its
Bj (t)
j
unique local maximum in [0, 1] at the site tmax := n .
0.2
As n → ∞ the Bernstein polynomials become more
and more concentrated around the maximum.
0.1
Proof. (of Thm. 6.2.1.2, first part) Fix t ∈ [0, 1]. Using the notations from (6.2.1.3) and the identity (6.2.1.5)
we find
n
f (t) − pn (t) = ∑ ( f (t) − f ( j/n)) Bnj (t) . (6.2.1.7)
k =0
As we see from Fig. 203, for large n the bulk of sum will be contributed by Bernstein polynomial with index
j/n ≈ t, because for every δ > 0
n
1 1 (∗) nt(1 − t) 1
∑ Bnj (t) ≤ 2 ∑ ( j/n − t)2 Bnj (t) ≤ 2 ∑ ( j/n − t)2 Bnj (t) = 2 2
≤ .
| j/n−t|>δ
δ | j/n−t|>δ
δ j =0
δ n 4nδ2
∑ means summation over j ∈ N0 with summation indices confined to the set { j : | j/n − t| > δ}.
| j/n−t|>δ
The identity (∗) can be established by direct but tedious computations [Dav75, Formulas (6.2.8)].
Combining this estimate with (6.2.1.6) and (6.2.1.7) we arrive at
1
| f (t) − pn (t)| ≤ ∑ 4nδ2
| f (t) − f ( j/n)| + ∑ | f (t) − f ( j/n)| .
| j/n−t|>δ | j/n−t|≤δ
Since, f is uniformly continuous on [0, 1], given ǫ > 0 we can choose δ > 0 independently of t such that
| f (s) − f (t)| < ǫ, if |s − t| < δ. The, if we choose n > (ǫδ2 )−1 , we can bound
EXPERIMENT 6.2.1.8 (Bernstein approximants) We compute and plot pn , n = 1, . . . , 25, for two
functions
(
0 , if |2t − 1| > 12 , 1
f 1 (t) := 1 , f 2 (t) := .
2 (1 + cos(2π (2t − 1))) else, 1 + e−12( x−1/2)
The following plots display the sequences of the polynomials pn for n = 2, . . . , 25.
1 1
Function f Function f
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 204 t Fig. 205 t
f = f1 f = f2
We see that the Bernstein approximants “slowly” edge closer and closer to f . Apparently it takes a very
large degree to get really close to f . y
§6.2.1.9 (Best approximation) Now we introduce a concept needed to gauge how close an approximation
scheme gets to the best possible performance.
distk·k ( f , Pk ) := inf k f − pk .
p∈Pk
The notation distk·k is motivated by the notation of “distance” as distance to the nearest point in a set.
For the L2 -norm k·k2 and the supremum norm k·k∞ the best approximation error is well defined for
C = C 0 ( I ).
The polynomial realizing best approximation w.r.t. k·k may neither be unique nor computable with reason-
able effort. Often one is content with rather sharp upper bounds like those asserted in the next theorem,
due to Jackson [Dav75, Thm. 13.3.7].
If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,
( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])
As above, f (r) stands for the r-th derivative of f . Using Stirling’s formula
√
2π nn+1/2 e−n ≤ n! ≤ e nn+1/2 e−n ∀n ∈ N , (6.2.1.12)
with C (r ) dependent on r, but independent of f and, in particular, the polynomial degree n. Using the
Landau symbol from Def. 1.4.1.2 we can rewrite the statement of (6.2.1.13) in asymptotic form
b : C0 ([−1, 1]) →
Assume that an interval [ a, b] ⊂ R, a < b, and a polynomial approximation scheme A
Pn are given. Based on the affine linear mapping
(6.2.1.16)
bt t
Fig. 206
−1 1 a b
bt 7→ t := Φ(bt) := 1 (1 − bt) a + 1 (bt + 1)b
2 2
We add the important observations that affine pullbacks are linear and bijective, they are isomorphisms of
the involved vector spaces of functions (what is the inverse?).
If Φ∗ : C0 ([ a, b]) → C0 ([−1, 1]) is an affine pullback according to (6.2.1.15) and (6.2.1.16), then
Φ∗ : Pn → Pn is a bijective linear mapping for any n ∈ N0 .
Proof. This is a consequence of the fact that translations and dilations take polynomials to polynomials of
the same degree: for monomials we find
The lemma tells us that the spaces of polynomials of some maximal degree are invariant under affine
pullback. Thus, we can define a polynomial approximation scheme A on C0 ([ a, b]) by
A : C0 ([ a, b]) → Pn , b ◦ Φ∗ ,
A : = ( Φ ∗ ) −1 ◦ A (6.2.1.18)
§6.2.1.19 (Transforming approximation error estimates) Thm. 6.2.1.11 targets only the special interval
[−1, 1]. What does it imply for polynomial best approximation on a general interval [ a, b]? To answer this
question we apply techniques from § 6.2.1.14, in particular the pullback (6.2.1.16).
We first have to study the change of norms of functions under the action of affine pullbacks:
Proof. The first estimate should be evident, and the second is a consequence of the transformation
formula for integrals [Str09, Satz 6.1.5]
Z 1 Z
∗ b b b−a b
Φ f ( t ) dt = f (t) dt , (6.2.1.22)
−1 2 a
Thus, for norms of the approximation errors of polynomial approximation schemes defined by affine trans-
formation (6.2.1.18) we get
k f − A f k L∞ ([a,b]) = Φ∗ f − Ab (Φ∗ f ) ,
L∞ ([−1,1])
r , ∀ f ∈ C0 ([ a, b]) . (6.2.1.23)
|b − a|
k f − A f k L2 ([a,b]) = Φ∗ f − Ab (Φ∗ f ) ,
2 L2 ([−1,1])
b.
Equipped with approximation error estimates for A, we can infer corresponding estimates for A
The bounds for approximation errors often involve norms of derivatives as in Thm. 6.2.1.11. Hence, it is
important to understand the interplay of pullback and differentiation: By the 1D chain rule
d df dΦ df
(Φ∗ f )(t̂) = (Φ(t̂)) = (Φ(t̂)) · 21 (b − a) ,
dt̂ dt dt̂ dt
which implies a simple scaling rule for derivatives of arbitrary order r ∈ N0 :
∗ (r ) b − a r ∗ (r )
(Φ f ) = Φ (f ) . (6.2.1.24)
2
Lemma 6.2.1.20
b − a r (r )
( Φ ∗ f ) (r ) = f , f ∈ Cr ([ a, b]), r ∈ N0 .
L∞ ([−1,1]) 2 L∞ ([ a,b])
(6.2.1.25)
§6.2.1.26 (Polynomial best approximation on general intervals) The estimate (6.2.1.24) together with
Thm. 6.2.1.11 paves the way for bounding the polynomial best approximation error on arbitrary intervals
[ a, b], a, b ∈ R. Based on the affine mapping Φ : [−1, 1] → [ a, b] from (6.2.1.15) and writing Φ∗ for the
pullback according to (6.2.1.16) we can chain estimates. If f ∈ Cr ([ a, b]) and n ≥ r, then
(∗)
inf k f − pk L∞ ([a,b]) = inf kΦ∗ f − pk L∞ ([−1,1])
p∈Pn p∈Pn
Thm. 6.2.1.11 (n − r )!
≤ (1 + π2/2)r( Φ ∗ f ) (r ) ∞
n! L ([−1,1])
r
(6.2.1.24) (n − r )! b − a
= (1 + π2/2)r f (r ) ∞ .
n! 2 L ([ a,b])
In step (∗) we used the result of Lemma 6.2.1.17 that Φ∗ p ∈ Pn for all p ∈ Pn . Invoking the arguments
that gave us (6.2.1.13), we end up with the simpler bound
r
b−a
r
f ∈ C ([ a, b]) ⇒ inf k f − pk L∞ ([a,b]) ≤ C (r ) f (r ) . (6.2.1.27)
p∈Pn n L∞ ([ a,b])
Observe that the length of the interval enters the bound in r-th power. y
Review question(s) 6.2.1.28 (Polynomial Approximation: Theory)
(Q6.2.1.28.A) What does Jackson’s theorem Thm. 6.2.1.11 tell us about the polynomial best approxima-
tion error
If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,
( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])
Video tutorial for Section 6.2.2 "Error Estimates for Polynomial Interpolation": (12 minutes)
Download link, tablet notes
We start with an abstract discussion of “covergence of errors/error norms” in order to create awareness
of what behaviors or phenomena we should be looking for and how we can detect them. You may read
Section 6.2.2.2 first, if you prefer to see concrete cases and examples before adopting a higher-level
perspective.
An example for such a family of node sets on I := [ a, b] are the equidistant or equispaced nodes
(n) j
Tn := {t j := a + (b − a) : j = 0, . . . , n} ⊂ I . (6.2.2.3)
n
For families of Lagrange interpolation schemes {LTn }n∈N we can shift the focus onto estimating the
0
asymptotic behavior of the norm of the interpolation error for n → ∞. y
0
10
||f−p ||
−2
10
n ∞
In the numerical experiment the norms of the inter-
||f−p ||
−4
n 2 polation errors can be computed only approximately
10
as follows.
−6 • L∞ -norm: approximated by sampling on a grid
Error norm
10
of meshsize π/1000. y
• L2 -norm: numerical quadrature (→ Chapter 7)
−8
10
−10
10
with trapezoidal rule (7.5.0.4) on a grid of
meshsize π/1000.
−12
10
✁ approximate norms k f − LTn f k∗ , ∗ = 2, ∞.
−14
10
2 4 6 8 10 12 14 16
Fig. 207 n
§6.2.2.5 (Classification of asymptotic behavior of norms of the interpolation error) In the previous
experiment we observed a clearly visible regular behavior of k f − LTn f k as we increased the polyno-
mial degree n. The prediction of the decay law for k f − LTn f k for n → ∞ is one goal in the study of
interpolation errors.
Often this goal can be achieved, even if a rigorous quantitative bound for a norm of the interpolation error
remains elusive. In other words, in many cases
✎ ☞
no quantitative bound for k f − LTn f k can usually be given, but the decay of this norm
✍ ✌
of the interpolation error for increasing n can often be described precisely.
Now we introduce some important terminology for the qualitative description of the behavior of k f − LTn f k
as a function of the polynomial degree n. We assume that
Writing T (n) for the bound of the norm of the interpolation error according to (6.2.2.6) we distinguish
the following types of asymptotic behavior :
The bounds are assumed to be sharp in the sense, that no bounds with larger rate p (for algebraic
convergence) or smaller q (for exponential convergence) can be found.
Convergence behavior of norms of the interpolation error is often expressed by means of the Landau-O-
notation, cf. Def. 1.4.1.2:
Algebraic convergence: k f − IT f k = O ( n − p )
for n → ∞ (“asymptotic!”)
Exponential convergence: k f − IT f k = O ( q n )
y
Remark 6.2.2.8 (Different meanings of “convergence”) Unfortunately, as in many other fields of math-
ematics and beyond, also in numerical analysis the meaning of terms is context-dependent:
§6.2.2.9 (Determining the type of convergence in numerical experiments → § 1.4.1.6) Given pairs
(ni , ǫi ), i = 1, 2, 3, . . ., ni =
ˆ polynomial degrees, ǫi =
ˆ (measured) norms of interpolation errors, how
can we tease out the likely type of convergence according to Def. 6.2.2.7? A similar task was already
encountered in § 1.4.1.6, where we had to extract information about asymptotic complexity from runtime
measurements.
∃C 6= C (n): : ǫi ≈ CT (ni ) ∀i .
−p
➊ Conjectured: algebraic convergence: ǫi ≈ Cni
Slope of line approximating (log ni , log(ǫi )) predicts rate of algebraic convergence: Apply linear regres-
sion as explained in Ex. 3.1.1.5 for data points (log ni , log ǫi ) ➣ least squares estimate for rate p.
Apply linear regression (→ Ex. 3.1.1.5)to points (ni , log ǫi ) ➣ estimate for q := exp(− β).
EXAMPLE 6.2.2.11 (Runge’s example → Ex. 5.2.4.3) We examine the polynomial interpolant of
f (t) = 1+1t2 for equispaced nodes:
n on 1
Tn := t j := −5 + 10
n j , j = 0, . . . , n ➣ y j = .
j =0 1 + t2j
We rely on an approximate computation of the supremum norm of the interpolation error by means of
sampling as in Exp. 6.2.2.4; here we used 1000 equidistant sampling points, see Code 6.2.2.12.
C++ code 6.2.2.12: Computing the interpolation error for Runge’s example ➺ GITLAB
2 // Note: “quick & dirty” implementation !
3 // Lambda function representing x 7→ (1 + x2 )− 1
4 auto f = [ ] ( double x ) { r e t u r n 1 . / ( 1 + x * x ) ; } ;
5 // 1000 sampling points for approximate maximum norm
6 const VectorXd x = VectorXd : : LinSpaced (1000 , −5 , 5 ) ;
7 // Sample function
8 const VectorXd f x = x . unaryExpr ( f ) ; // evaluate f at x
9
12 // Interpolation nodes
13 const VectorXd t = Eigen : : VectorXd : : LinSpaced ( d + 1 , −5 , 5 ) ;
14 // Interpolation data values
15 const VectorXd f t = f e v a l ( f , t ) ;
16 // Compute interpolating polynomial in monomial representation
17 const VectorXd p = p o l y f i t ( t , f t , d ) ;
18 // Evaluate polynomial interpolant in sampling points
19 const VectorXd y = polyval ( p , x ) ;
20 // Approximate supremum norm of interpolation error
21 e r r . push_back ( ( y − f x ) . cwiseAbs ( ) . maxCoeff ( ) ) ;
22 }
Here, polyfit() computes the monomial coefficients of a polynomial interpolant, while polyval()
uses the vectorized Horner scheme of Code 5.2.1.7 to evaluate the polynomial in given points. The names
of the functions are borrowed from P YTHON, see numpy.poly1d.
2
1/(1+x2)
Interpolating polynomial
1.5
0.5
−0.5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 208 Fig. 209
Observation: Strong oscillations of IT f near the endpoints of the interval, which seem to cause
n→∞
k f − LT f k L∞ (]−5,5[) −−−→ ∞ .
Though polynomials possess great power to approximate functions, see Thm. 6.2.1.11 and Thm. 6.2.1.2,
here polynomial interpolants fail completely. Approximation theorists even discovered the following “nega-
tive result”:
y
Review question(s) 6.2.2.14 (Convergence of interpolation errors)
(Q6.2.2.14.A) Assume that the interpolation error for some family (LTn )n∈N , ♯Tn = n + 1, of Lagrangian
polynomial interpolation schemes and some function f ∈ C0 ( I ), I ⊂ R, converges algebraically ac-
cording to
How do you have to raise the polynomial degree n in order to reduce the maximum norm of the interpo-
lation error approximately by a factor of 2?
Writing T (n) for the bound of the norm of the interpolation error according to (6.2.2.6) we distin-
guish the following types of asymptotic behavior :
The bounds are assumed to be sharp in the sense, that no bounds with larger rate p (for algebraic
convergence) or smaller q (for exponential convergence) can be found.
(Q6.2.2.14.B) Let f ∈ C0 ( I ), I ⊂ R, be given along with a family (LTn )n∈N , ♯Tn = n + 1, of Lagrangian
polynomial interpolation schemes. You know that the maximum norm of the interpolation error for f
enjoys exponential convergence of the form
How do you have to increase n in order to halve the maximum norm of the interpolation error?
(Q6.2.2.14.C) Discuss the statement:
Exponential convergence is always faster than algebraic convergence.
△
Video tutorial for Section 6.2.2.2 "Error Estimates for Polynomial Interpolation: Interpolands
of Finite Smoothness": (17 minutes) Download link, tablet notes
Now we aim to establish bounds for the supremum norm of the interpolation error of Lagrangian interpo-
lation similar to the result of Jackson’s best approximation theorem.
If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,
( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])
It states a result for at least continuously differentiable functions and its bound for the polynomial best
approximation error involves norms of certain derivatives of the function f to be approximated. Thus some
smoothness of f is required, but only a few of derivatives need exist. Thus, we say, that Thm. 6.2.1.11
deals with functions of finite smoothnes, that is, f ∈ C k for some k ∈ N. In this section we aim to bound
polynomial interpolation errors for such functions.
We consider f ∈ C n+1 ( I ) and the Lagrangian interpolation approximation scheme (→ Def. 6.2.2.1)
for a node set T := {t0 , . . . , tn } ⊂ I . Then,
f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.2.2.16)
j =0
n
Proof. Write wT (t) := ∏ (t − t j ) ∈ Pn+1 and fix t ∈ I \ T .
j =0
ϕ
Consider the auxiliary function
ϕ( x ) := f ( x ) − LT ( f )( x ) − cwT ( x )
t0 t tn
t1 that belongs to C n+1 ( I ) and has n + 2 distinct zeros
t0 , . . . , t n , t.
Fig. 210
By iterated application of the mean value theorem [Str09, Thm .5.2.1]/Rolle’s theorem
m:=n+1
⇒ ∃τt ∈ I: ϕ(n+1) (τt ) = f (n+1) (τt ) − c(n + 1)! = 0 .
f ( n +1) ( τ )
This fixes the value of c = (n+1)!t and by (6.2.2.17) this amounts to the assertion of the theorem.
✷
Remark 6.2.2.19 (Explicit representation of error of polynomial interpolation) The previous theorem
can be refined:
For f ∈ C n+1 ( I ) let IT ∈ Pn stand for the unique Lagrange interpolant (→ Thm. 5.2.2.7) of f in
the node set T := {t0 , . . . , tn } ⊂ I . Then for all t ∈ I the interpolation error is
Z1 Zτ1 τZn−1Zτn
The proof relies on induction on n, use (5.2.3.9) and the fundamental theorem of calculus, see [Ran00,
Sect. 3.1]. y
Remark 6.2.2.21 (Error representation for generalized Lagrangian interpolation) A result analogous
to Lemma 6.2.2.20 holds also for general polynomial interpolation with multiple nodes as defined in
(5.2.2.15). y
Lemma 6.2.2.20 provides an exact formula (6.5.2.27) for the interpolation error. From it and also from
Thm. 6.2.2.15 we can derive estimates for the supremum norm of the interpolation error on the interval I
as follows:
➊ first bound the right hand side via f (n+1) (τt ) ≤ f ( n +1) ,
L∞ ( I )
➋ then increase the right hand side further by switching to the maximum (in modulus) w.r.t. t (the
resulting bound does no longer depend on t!),
➌ and, finally, take the maximum w.r.t. t on the left of ≤.
This yields the following interpolation error estimate for degree-n Lagrange interpolation on the node set
{ t0 , . . . , t n }:
k f ( n +1) k L ∞ ( I )
Thm. 6.2.2.15 ⇒ k f − LT f k L ∞ ( I ) ≤ ( n +1) !
max|(t − t0 ) · · · · · (t − tn )| . (6.2.2.22)
t∈ I
This reflects a general truth about estimates of norms of the interpolation error:
EXAMPLE 6.2.2.24 (Error of polynomial interpolation Exp. 6.2.2.4 cnt’d) Now we are in a position to
give a theoretical explanation for exponential convergence observed for polynomial interpolation of f (t) =
EXAMPLE 6.2.2.25 (Runge’s example Ex. 6.2.2.11 cnt’d) How can the blow-up of the interpolation
error observed in Ex. 6.2.2.11 be reconciled with Lemma 6.2.2.20 ?
Here f (t) = 1
1+ t2
allows only to conclude | f (n) (t)| = 2n n! · O(|t|−2−n ) for n → ∞.
➙ Possible blow-up of error bound from Thm. 6.2.2.15 →∞ for n → ∞. y
Remark 6.2.2.26 ( L2 -error estimates for polynomial interpolation) Thm. 6.2.2.15 gives error estimates
for the L∞ -norm. What about other norms?
From Lemma 6.2.2.20 we know the error representation
Z1 Zτ1 τZn−1Zτn
2
Zb Zb Zb
2
f (t) g(t) dt ≤ | f (t)| dt · | g(t)|2 dt , ∀ f , g ∈ C0 ([ a, b]) . (6.2.2.27)
a a a
Z Z1 Zτ1 τZn−1Zτn 2
n
kf − LT ( f )k2L2 ( I ) = ··· f ( n +1)
(. . .) dτdτn · · · dτ1 · ∏(t − t j ) dt
I 0 0 0 0 j =0
| {z }
|t−t j |≤| I |
Z Z
≤ | I |2n+2 vol(n+1) (Sn+1 ) | f (n+1) (. . .)|2 dτ dt
I | {z } Sn +1
=1/(n+1)!
Z Z
| I |2n+2
= vol(n) (Ct,τ ) | f (n+1) (τ )|2 dτdt ,
I ( n + 1) ! I | {z }
≤2(n−1)/2 /n!
where
Remark 6.2.2.29 (Interpolation error estimates and the Lebesgue constant [Tre13, Thm. 15.1]) The
sensitivity of a (polynomial) interpolation scheme IT : K n+1 → C0 ( I ), T ⊂ I a node set, as introduced in
Section 5.2.4 and expressed by the Lebesgue constant (→ Lemma 5.2.4.10)
kIT (y)k L∞ ( I )
λ T : = kIT k ∞ → ∞ : = sup ,
y∈R n+1 \{0} kyk∞
establishes an important connection between the norms of the interpolation error and of the best approxi-
mation error.
We first observe that the polynomial approximation scheme LT induced by IT preserves polynomials of
degree ≤ n := ♯T − 1:
LT p = IT [ p(t)]t∈T = p ∀ p ∈ Pn . (6.2.2.30)
Thus, by the triangle inequality, for a generic norm on C0 ( I ) and kLT k designating the associated operator
norm of the linear mapping LT , cf. (5.2.4.9),
(6.2.2.30)
k f − LT f k = k( f − p) − LT ( f − p)k ≤ (1 + kLT k)k f − pk ∀ p ∈ Pn ,
Note that for k·k = k·k L∞ ( I ) , since [ f (t)]t∈T ∞ ≤ k f k L∞ ( I ) , we can estimate the operator norm, cf.
(5.2.4.9),
k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) . (6.2.2.33)
p∈Pn
Hence, if a bound for λT is available, the best approximation error estimate of Thm. 6.2.1.11 immediately
yields interpolation error estimates. y
Review question(s) 6.2.2.34 (Interpolands of finite smoothness)
(Q6.2.2.34.A) For the interpolation error observed in Exp. 6.2.2.4 we found the bound
1 π n +1
k f − LT n f k L ∞ ( I ) ≤ ∀n ∈ N .
n+1 n
Explain why this behavior of the maximum norm of the interpolation error is also called “superexponential
convergence”.
(Q6.2.2.34.B) For Lagrange polynomial interpolation we have seen the following error representation.
f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.5.2.27)
j =0
This was obtained by “counting the zeros” of derivatives of the auxiliary function
ϕ( x ) := f ( x ) − LT ( f )( x ) − cwT ( x ) , a≤x≤b,
where wT was the nodal polynomial belonging to the node set T and c ∈ R was chosen to ensure
ϕ(t) = 0 for a fixed t ∈ [ a, b] \ T .
Now we consider the cubic Hermite interpolation operator
p( a) = f ( a) , p(b) = f (b) ,
H : C1 ([ a, b]) → P3 , p := H( f ) such that
p′ ( a) = f ′ ( a) , p′ (b) = f ′ (b) .
Here p′ , f ′ stands for the derivative. Fixing t ∈] a, b] and using the auxiliary function
If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,
( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])
Video tutorial for Section 6.2.2.3 "Error Estimates for Polynomial Interpolation: Analytic Inter-
polands": (27 minutes) Download link, tablet notes
We have seen that for some Lagrangian approximation schemes applied to certain functions we can
observe exponential convergence (→ Def. 6.2.2.7) of the approximation error for increasing polynomial
degree. This section presents a class of interpolands, which often enable this convergence.
We may say that an analytic function locally agrees with a “polynomial of degree ∞”, because this is
exactly what a convergent power series
∞
f (t) = ∑ a k ( t − t0 ) k , ak ∈ R , (6.2.2.36)
k =0
represents. Note that f need not be given by its Taylor power series on the whole interval I . Those
may converge only on small sub-intervals. Def. 6.2.2.35 merely tells us that I can be covered by such
sub-intevals.
1
f (t) := , t∈R,
1 + t2
on the interval I = [−5, 5]. By the geometric sum formula we have as Taylor series at t0 = 0:
∞
f (t) = ∑ (−1)k t2k , |t| < 1 ,
k =0
whose radius of convergenceq = 1. More generally, deeper theory tells us that the Taylor series at t0
has radius of convergence = t20 + 1. Thus, in [−5, 5] cannot be represented by a single power series,
though it is a perfectly smooth function. y
√
EXAMPLE 6.2.2.38 (Square root function) We consider the the function f (t) := t on I :=]0, 1[. From
calculus we know the power series
√ ∞ k −1 j − 21 k
1+x = 1+ ∑ (−1)k ∏ j+1
x , |x| < 1 . (6.2.2.39)
k =1 j =0
It converges for a all x with | x | < 1 [Str09, Satz 3.7.2]. Using, (6.2.2.39) we get the Taylor series for f at
t0 ∈ I
r k
√ √ t − t0 √ ∞ k −1 j − 1
t − t0 t − t0
t= t0 1+ = t0 + ∑ (−1)k ∏ 2
, <1.
t0 k =1 j =0
j + 1 t0 t0
This series converges only in the open ball around t0 of radius |t0 |. The closer we get to the “singular
point” t = 0, the smaller the radius of convergence. y
Remark 6.2.2.40 (Analytic functions everywhere) Analyticity of a function seems to be a very special
property confined the functions that are given through simple formulas. Yet, this is not true:
C1 R1 C1 R1
✁ Linear RLC-circuit with variable resistor, see also
Ex. 2.6.0.24.
R1
L L
R1
R2 R2
C2 C2 For the shown linear electric circuit all branch cur-
R4 R x R4 R1 rents and voltages will depend analytically on the re-
sistance R x of the variable resistor. This a conse-
C2
C2
R1
R1
R3
U ~~
R2
R2
t 7 → v ( t ) ⊤ A ( t ) −1 u ( t ) , t∈I⊂R,
C1
R4
R4
This remark remains true for many output quantities of physical models considered as functions of model
parameters or input quantities. y
§6.2.2.41 (Approximation by truncated power series) A first glimpse of the relevance of analyticity
for polynomial approximation: Let I ⊂ R be an closed interval and f real-analytic on I according to
Def. 6.2.2.35.
We add a stronger assumption: There is a t0 ∈ I and ρ > 0 such that
• the Taylor series of f at t0
∞
f (t) = ∑ a k ( t − t0 ) k , ak ∈ R , (6.2.2.42)
k =0
I ⊂ { t ∈ R : | t − t0 | ≤ r } . (6.2.2.44)
The convergence theory of power series [Str09, Sect. 3.7] ensures that for any r < R < ρ
∞
∑ | ak | Rk =: C ≤ ∞ . (6.2.2.45)
k =0
This confirms exponential convergence of the approximation error incurred by truncating the Taylor series.
y
The previous § heavily relied on the assumption that f possesses a power series representation that
converges on the entire interval I and even beyond. As we see from Ex. 6.2.2.37 and Ex. 6.2.2.38 this will
usually not be the case.
This is why we continue with a key observation: a power series make perfect sense for complex arguments.
Any function f ∈ C ∞ ( I ) that, locally, for t0 ∈ I can be written as
∞
f (t) = ∑ a n ( t − t0 ) n , an ∈ R , ∀ t ∈ R : | t − t0 | < ρ ( t0 ) ,
n =0
For, this reason no distinction between real and complex analyticity has to be made. For the sake of
completeness, the definition of an analytic function on D ⊂ C is given, nevertheless.
Be aware, that also this definition asserts the existence of a local power series representation of f only.
§6.2.2.49 (Residue calculus) Why go C? The reason is that this permits us to harness powerful tool
from complex analysis (→ course in the BSc program CSE), a field of mathematics, which studies analytic
functions. One of theses tools is the residue theorem.
R
• Note that the integral γ in Thm. 6.2.2.50 is a path integral in the complex plane (“contour integral”):
If the path of integration γ is described by a parameterization τ ∈ J 7→ γ(τ ) ∈ C, J ⊂ R, then
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.2.2.51)
γ J
where γ̇ designates the derivative of γ with respect to the parameter, and · indicates multiplication
in C. For contour integrals we have the estimate
Z
f (z) dz ≤ |γ| max | f (z)| . (6.2.2.52)
γ z∈γ
• Π often stands for the set of poles of f , that is, points where “ f attains the value ∞”.
The residue theorem is very useful, because there are simple formulas for res p f :
let g and h be complex valued functions that are both analytic in a neighborhood of p ∈ C, and
satisfy h( p) = 0, h′ ( p) 6= 0. Then
g g( p)
res p = ′ .
h h ( p)
§6.2.2.54 (Residue remainder formula for Lagrange interpolation) Now we consider a polynomial
Lagrangian approximation scheme on the interval I := [ a, b] ⊂ R, based on the node set T :=
{ t0 , . . . , t n } ⊂ I .
Im
Assumption 6.2.2.55. Analyticity of inter-
poland
D
γ We assume that the interpoland f : [ a, b] → C
t0 t1 t2 t4 Re can be extended to a function f : D ⊂ C →
a b C, which is analytic (→ Def. 6.2.2.48) on the
open set D ⊂ C with [ a, b] ⊂ D.
Fig. 213
Key is the following representation of the Lagrange polynomials (5.2.2.4) for node set T =
{ t0 , . . . , t n }:
n
t − tk w(t) w(t)
L j (t) = ∏ t j − t k
= n =
(t − t j )w′ (t j )
, (6.2.2.56)
k=0,k6= j (t − t j ) ∏ (t j − tk )
k=0,k6= j
where w ( t ) = ( t − t 0 ) · · · · · ( t − t n ) ∈ P n +1 .
Consider the following parameter dependent function gt , whose set of poles in D is Π = {t, t0 , . . . , tn }
f (z)
gt ( z ) : = , z ∈ C \ Π , t ∈ [ a, b] \ {t0 , . . . , tn } .
(z − t)w(z)
Apply residue theorem Thm. 6.2.2.50 to gt and a closed path of integration γ ⊂ D winding once
around [ a, b], such that its interior is simply connected, see the magenta curve in Fig. 213:
Z n n
1 Lemma 6.2.2.53 f (t) f (t j )
gt (z) dz = rest gt + ∑ rest j gt = +∑
2πı γ j =0
w ( t ) j =0 ( t j − t ) w ′ ( t j )
n Z
w(t) w(t)
f (t) = − ∑ f (t j ) ′
+ gt (z) dz . (6.2.2.57)
j =0
(t j − t)w (t j ) 2πı γ
| {z } | {z }
−Lagrange polynomial ! interpolation error !
| {z }
polynomial interpolant !
This is the famous Hermite integral formula [Tre13, Ch. 11], a representation formula for the interpolation
error, an alternative to that of Thm. 6.2.2.15 and Lemma 6.2.2.20. We conclude that for all t ∈ [ a, b]
In concrete setting, in order to exploit the estimate (6.2.2.58) to study the n-dependence of the supremum
norm of the interpolation error, we need to know
• an upper bound for |w(t)| for a ≤ t ≤ b,
• an a lower bound for |w(z)|, z ∈ γ, for a suitable path of integration γ ⊂ D,
• a lower bound for the distance of the path γ and the interval [ a, b] in the complex plane.
y
Remark 6.2.2.59 (Frobenius’ derivation of the Hermite integral formula [Boo05, Sect. 9]) We give a
more elementary alternative derivation of the interpolation error formula in (6.2.2.57), which does not rely
on the residual theorem. We retain the node set T := {t0 , . . . , tn } ⊂ [ a, b] and Ass. 6.2.2.55. We write
w j ( t ) : = ( t − t 0 ) · · · · · ( t − t j −1 ) , j ∈ {1, . . . , n + 1} , w0 := 1 ,
We pick z ∈ C \ T , replace t → z in (6.2.2.60), and divide by w j−1 (z)w j (z), which yields
z 1 t j −1
= + , z ∈ C\T , j = 1, . . . , n + 1 . (6.2.2.61)
w j (z) w j −1 ( z ) w j ( z )
From the two identities (6.2.2.60) and (6.2.2.61) we obtain by summation
( )
n +1 w j −1 ( t ) n +1 w
j −1 ( t ) t j −1 w j −1 ( t )
z· ∑ w j (z)
= ∑
w j −1 ( z )
+
w j (z)
, (6.2.2.62)
j =1 j =1
( )
n +1 w j −1 ( t ) n +1 w ( t ) t j −1 w j −1 ( t )
j
t· ∑ w j (z)
= ∑
w j (z)
+
w j (z)
. (6.2.2.63)
j =1 j =1
where γ ⊂ D is a curve enclosing [ a, b] as drawn in Fig. 213. We can rewrite (6.2.2.65) using (6.2.2.64),
( Z
) Z
n +1
1 f (z) 1 f (z)
f (t) = ∑ 2πı γ w j (z)
dz · w j−1 (t) +
2πı γ (z − t)wn+1 (z)
dz · wn+1 (t) . (6.2.2.66)
j =1
| {z }
=:p(t)
By the definition of w j the function t 7→ p(t) is a polynomial of degree n that interpolates f in T , the set
of zeros of wn+1 . Hence p is the unique Lagrange polynomial interpolant of f with respect to the node
set T . Thus the second term in (6.2.2.66) represents the interpolation error and obviously agrees with the
formula found in (6.2.2.57). y
Remark 6.2.2.67 (Determining the domain of analyticity) The subset of C, where a function f given by
a formula, is analytic can often be determined without computing derivatives using the following conse-
quence of the chain rule:
(Q6.2.2.69.A) We consider the Lagrange polynomial interpolation of the entire function f (t) = et on
n o
j n
I = [0, 1] with equidistant nodes T := t j = n , n ∈ N. As integration path in the residual re-
j =0
mainder estimate
Let Tn := {t0 , . . . , tn } ⊂ [−1, 1] a sequence of sets of interpolation nodes and In : C0 ([−1, 1]) → Pn
the associated family of Lagrangian polynomial interpolation operators. Based on (6.2.2.58) show that
the supremum norm of the interpolation error k f − In f k∞,[−1,1] converges to zero exponentially in the
degree n as n → ∞.
n on
(Q6.2.2.69.C) For Lagrange polynomial interpolation of f (t) = 1+1t2 in the nodes T := −5 + 10
n j ,
j =0
sketch a valid integration path γ ⊂ C for the estimate (6.2.2.58).
(Q6.2.2.69.D) What is the problem,√ if you want to apply (6.2.2.58) to estimate the error of Lagrange
polynomial interpolation of t 7→ t in equidistant nodes in [0, 1]?
(Q6.2.2.69.E) Find largest subset of the complex plane C, to which the logistic curve function
1
f (t) := , t∈R,
1 + exp(−t)
Video tutorial for Section 6.2.3.1 "Chebychev Interpolation: Motivation and Definition": (21
minutes) Download link, tablet notes
1
Recall Thm. 6.2.2.15: k f − LT f k L ∞ ( I ) ≤ f ( n +1) k w k L∞ ( I ) ,
( n + 1) ! L∞ ( I )
Remark 6.2.3.2 (A priori and a posteriori choice of optimal interpolation nodes) We stress that we
aim for an “optimal” a priori choice of interpolation nodes, a choice that is made before any information
about the interpoland becomes available.
Of course, an a posteriori choice based on information gleaned from evaluations of the interpoland f may
yield much better interpolants (in the sense of smaller norm of the interpolation error). Many modern
algorithms employ this a posteriori adaptive approximation policy, but this chapter will not cover them.
However, see Section 7.6 for the discussion of an a posteriori adaptive approach for the numerical approx-
imation of definite integrals. y
w(t)
Are there polynomials satisfying these requirements? If so, do they allow a simple characterization?
Proof. Just use the trigonometric identity cos(n + 1) x = 2 cos nx cos x − cos(n − 1) x with cos x = t.
✷
The theorem implies: • Tn ∈ Pn ,
• their leading coefficients are equal to 2n−1 ,
• Tnn are linearly independent,
the
• Tj j=0 is a basis of Pn = Span{ T0 , . . . , Tn }, n ∈ N0 .
See Code 6.2.3.6 for algorithmic use of the 3-term recursion (6.2.3.5).
1 1 n=5
n=6
n=7
0.8 0.8 n=8
n=9
0.6 0.6
0.4 0.4
0.2 0.2
Tn(t)
Tn(t)
0 0
−0.2 −0.2
n=0
−0.4 −0.4
n=1
−1 −1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 215 t Fig. 216 t
8 V = MatrixXd : : Ones ( d + 1 , n ) ; // T0 ≡ 1
9 i f ( d == 0 ) r e t u r n ;
10 V . block ( 1 , 0 , 1 , n ) = x ; // T1 ( x ) = x
11 i f ( d == 1 ) r e t u r n ;
12 f o r ( unsigned i n t k = 1 ; k < d ; ++k ) {
13 const RowVectorXd p = V . block ( k , 0 , 1 , n ) ; // p = Tk
14 const RowVectorXd q = V . block ( k − 1 , 0 , 1 , n ) ; // q = Tk−1
15 V . block ( k + 1 , 0 , 1 , n ) =
16 2 * x . cwiseProduct ( p ) − q ; // 3-term recursion
17 }
18 }
From Def. 6.2.3.3 we conclude that Tn attains the values ±1 in its extrema with alternating signs, thus
matching our heuristic demands:
kπ
| Tn (tk )| = 1 ⇔ ∃ k = 0, . . . , n: tk = cos , k Tn k L∞ ([−1,1]) = 1 . (6.2.3.7)
n
What is still open is the validity of the heuristics guiding the choice of the optimal nodes. The next funda-
mental theorem will demonstrate that, after scaling, the Tn really supply polynomials on [−1, 1] with fixed
leading coefficient and minimal supremum norm.
Theorem 6.2.3.8. Minimax property of the Chebychev polynomials [DH03, Section 7.1.4.],
[Han02, Thm. 32.2]
The polynomials Tn from Def. 6.2.3.3 minimize the supremum norm in the following sense:
( Tn − q)( x ) > 0 in the local maxima of Tn , and ( Tn − q)( x ) < 0 in all local minima of Tn .
From our knowledge about the n + 1 local extrema of Tn in [−1, 1] (They have alternating signs!), see
(6.2.3.7), we conclude that Tn − q changes sign at least n + 1 times. This implies that Tn − q has at least
n zeros. As a consequence, Tn − q ≡ 0, because Tn − q ∈ Pn−1 (same leading coefficient!).
This cannot be reconciled with the properties (6.2.3.9) of q and, thus, leads to a contradiction.
✷
2k + 1
The zeros of Tn are tk = cos π , k = 0, . . . , n − 1 . (6.2.3.10)
2n
20
18
16
14
12
n
10
Remark 6.2.3.11 (Chebychev nodes on arbitrary interval) Following the recipe of § 6.2.1.14 Chebychev
interpolation on an arbitrary interval [ a, b] can immediately be defined. The same polynomial Lagrangian
approximation scheme is obtained by transforming the Chebychev nodes (6.2.3.10) from [−1, 1] to [ a, b]
using the unique affine transformation (6.2.1.15):
Video tutorial for Section 6.2.3.2 "Chebychev Interpolation Error Estimates": (14 minutes)
Download link, tablet notes
2 1.2
1/(1+x2) Function f
Interpolating polynomial Chebychev interpolation polynomial
1
1.5
0.8
1
0.6
0.4
0.5
0.2
0
0
−0.5 −0.2
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 219 Fig. 220 t
§6.2.3.15 (Finite-smoothness error estimates for Chebychev interpolation) Note the following features
of Chebychev interpolation on the interval [−1, 1]:
n o
• Use of “optimal” interpolation nodes T = btk := cos 2k+1 π , k = 0, . . . , n ,
2( n +1)
We consider f ∈ C n+1 ( I ) and the Lagrangian interpolation approximation scheme (→ Def. 6.2.2.1)
for a node set T := {t0 , . . . , tn } ⊂ I . Then,
f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.2.3.16)
j =0
we immediately get an interpolation error estimate for Chebychev interpolation of f ∈ C n+1 ([−1, 1]):
2− n
k f − IT ( f )k L∞ ([−1,1]) ≤ f ( n +1) . (6.2.3.17)
( n + 1) ! L∞ ([−1,1])
Estimates for the Chebychev interpolation error on [ a, b] are easily derived from (6.2.3.17):
an the affine pullback introduced and discussed in in § 6.2.1.14. For instance, repeated application of the
dn fb dn f
chain rule yields the formula (bt) = ( 1 | I |)n ( t ).
2
dbtn dtn
2− n dn+1 fb
k f − IT ( f )k L∞ ( I ) = fb − ITb ( fb) ≤
L∞ ([−1,1]) (n + 1)! dbtn+1
L∞ ([−1,1])
2−2n−1 n+1 (n+1)
≤ |I| f . (6.2.3.18)
( n + 1) ! L∞ ( I )
Remark 6.2.3.19 (Lebesgue Constant for Chebychev nodes [Tre13, Thm 15.2]) We saw in Sec-
tion 5.2.4 and, in particular, in Rem. 5.2.4.13 that the Lebesgue constant λT that measures the sensitivity
of a polynomial interpolation scheme, blows up exponentially with increasing number of equispaced inter-
polation nodes. In stark contrast λT grows only logarithmically in the number of Chebychev nodes.
3.2
2.6
2
λT ≤ log(1 + n) + 1 . (6.2.3.20)
λT
π
2.4
2.2
Measured Lebesgue constant for Chebychev nodes
based on approximate evaluation of (5.2.4.11) by 2
sampling. ✄
1.8
0 5 10 15 20 25
Polynomial degree n
k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) , (6.2.2.33)
p∈Pn
and the bound for the best approximation error by polynomials for f ∈ Cr ([−1, 1]) from Thm. 6.2.1.11,
( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])
we end up with a bound for the supremum norm of the interpolation error in the case of Chebychev
interpolation on [−1, 1]
( n − r ) ! (r )
k f − LT f k L∞ ([−1,1]) ≤ (2/π log(1 + n) + 2)(1 + π2/2)r f . (6.2.3.21)
n! L∞ ([−1,1])
Emphasizing the asymptotic behavior of the maximum norm of the interpolation error, we can infer for
Chebychev interpolation
r log n
f ∈ C ([−1, 1]) ⇒ k f − LT f k L∞ ([−1,1]) =O for n→∞, (6.2.3.22)
nr
which could be dubbed “almost algebraic convergence” with rate r. This guarantees convergence, if only
f ∈ C1 ([−1, 1])! y
EXPERIMENT 6.2.3.23 (Chebychev interpolation errors) Now we empirically investigate the behavior
of norms of the interpolation error for Chebychev interpolation and functions with different (smoothness)
properties as we increase the number of interpolation nodes.
1.2
Function f
||f−p ||
n ∞
Chebychev interpolation polynomial
||f−pn||2
1
0.8 10
0
Error norm
0.6
0.4
−1
10
0.2
−0.2
−5 −4 −3 −2 −1 0 1 2 3 4 5 −2
10
t 2 4 6 8 10 12 14 16 18 20
Fig. 221 Polynomial degree n
pn → f , k f − In f k L∞ ([−5,5]) ≈ 0.8n
.
➁ f (t) = max{1 − |t|, 0}, I = [−2, 2], n = 10 nodes (plot on the left).
Now f ∈ C0 ( I ) but f ∈
/ C 1 ( I ).
0
1.2 10
Function f ||f−p ||
n ∞
Chebychev interpolation polynomial ||f−p ||
n 2
1
0.8
Error norm
0.6
−1
10
0.4
0.2
−2
−0.2 10
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20
t Polynomial degree n
Fig. 222 Fig. 223
0
10
||f−p ||
n ∞
||f−p ||
n 2
Error norm
−1
• no exponential convergence
• algebraic convergence (?)
−2
10
0 1 2
10 10 10
Fig. 224
Polynomial degree n
(
1
2 (1 + cos πt ) |t| < 1
➂ f (t) = I = [−2, 2], n = 10 (plot on the left).
0 1 ≤ |t| ≤ 2
0
1.2 10
Function f ||f−p ||
n ∞
Chebychev interpolation polynomial ||f−p ||
n 2
1
0.8
−1
10
Error norm
0.6
0.4
−2
10
0.2
−3
−0.2 10
0 1 2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 10 10 10
Remark 6.2.3.26 (Chebychev interpolation of analytic functions [Tre13, Ch. 8]) Assuming that the
interpoland f possesses an analytic extension to a complex neighborhood D of [−1, 1], we now apply
the theory of Section 6.2.2.3 to bound the supremum norm of the Chebychev interpolation error of f on
[−1, 1].
To convert
as obtained in Section 6.2.2.3, into a more concrete estimate, we have to study the behavior of
2k + 1
wn (t) = (t − t0 )(t − t1 ) · · · · · (t − tn ) , tk = cos π , k = 0, . . . , n ,
2n + 2
where the tk are the Chebychev nodes according to (6.2.3.12). They are the zeros of the Chebychev
polynomial (→ Def. 6.2.3.3) of degree n + 1. Since w has leading coefficient 1, we conclude w =
2−n Tn+1 , and
0.8
ρ=1
0.6
ρ=1.2
ρ=1.4
0.4
Thus, we see that γ is an ellipse with foci ±1, large ρ=1.6
ρ=1.8
semi-axis 12 (ρ + ρ−1 ) > 1 and small semi-axis 0.2
ρ=2
1 −1
2 ( ρ − ρ ) > 0.
Im
−0.2
The figure shows elliptical integration contours for dif-
−0.4
ferent values of ρ ✄
−0.6
−0.8
−1 −0.5 0 0.5 1
Fig. 227 Re
Appealing to geometric evidence, we find dist(γ, [−1, 1]) = 21 (ρ + ρ−1 ) − 1, which gives another term
in (6.2.2.58).
The rationale for choosing this particular integration contour is that the cos in its defintion nicely cancels
the arccos in the formula for the Chebychev polynomials. This lets us compute
for all 0 ≤ θ ≤ 2π , which provides a lower bound for |wn | on γ. Plugging all these estimates into
(6.2.2.58) we arrive at
2| γ | 1
k f − LT f k L∞ ([−1,1]) ≤ · max | f (z)| . (6.2.3.28)
π (ρ n + 1 − 1)(ρ + ρ−1 − 2) z∈γ
Note that instead on the nodal polynomial w we have inserted Tn+1 into (6.2.2.58), which is a simple
multiple. The factor will cancel.
The supremum norm of the interpolation error converges exponentially (ρ > 1!):
y
EXPERIMENT 6.2.3.29 (Chebychev interpolation of analytic function → Exp. 6.2.3.23 cnt’d)
−1
10
−2 ||f−p ||
10 n ∞
||f−pn||2
−3
10
−5
10
(Faster) exponential convergence than on the interval
I =] − 5, 5[: −6
10
k f − In f k L2 ([−1,1]) ≈ 0.42n . −7
10
−8
10
−9
10
2 4 6 8 10 12 14 16 18 20
Fig. 228 Polynomial degree n
Explanation, cf. Rem. 6.2.3.26: for I = [−1, 1] the poles ±i of f are farther away relative to the size of
the interval than for I = [−5, 5]. y
Review question(s) 6.2.3.30 (Chebychev interpolation error estimates)
−1
10
−2
10
||f−pn||∞
||f−pn||2
−3
10 Plotting norms of the Chebychev interpolation
for the “Runge function”
−4
10
Error norm
1
(Q6.2.3.30.A)
−5
10 f (t) :=
1 + t2
−6
10
−8
Guess what could be the cause of its emerge
10
−9
10
2 4 6 8 10 12 14 16 18 20
Fig. 229 Polynomial degree n
Video tutorial for Section 6.2.3.3 "Chebychev Interpolation: Computational Aspects": (11
minutes) Download link, tablet notes
§6.2.3.32 (Fast evaluation of Chebychev expansion → [Han02, Alg. 32.1]) Let us assume that the
Chebychev expansion coefficients α j in (6.2.3.3) are given and wonder, how we can efficiently compute
p( x ) for some x ∈ R:
Task: Given n ∈ N, x ∈ R, and the Chebychev expansion coefficients α j ∈ R, j = 0, . . . , n, compute
p( x ) with
n
p( x ) = ∑ α j Tj ( x) , αj ∈ R . (6.2.3.3)
j =0
n −1 n −1
(6.2.3.5)
p( x ) = ∑ α j Tj ( x) + αn Tn ( x) = ∑ α j Tj ( x) + αn (2xTn−1 ( x) − Tn−2 ( x))
j =0 j =0
n −3
= ∑ α j Tj ( x) + (αn−2 − αn )Tn−2 ( x) + (αn−1 + 2xαn )Tn−1 ( x) .
j =0
We recover the point value p( x ) as the point value of another polynomial of degree n − 1 with known
Chebychev expansion:
n −1 α j + 2xα j+1 , if j = n − 1 ,
p( x ) = ∑ e α j Tj ( x ) with e
α j = α j − α j +2 , if j = n − 2 , (6.2.3.33)
j =0
αj else.
This inspires the recursive algorithm of Code 6.2.3.34. A loop-based implementation without recursive
function calls is also possible and given in Code 6.2.3.35.
C++ code 6.2.3.35: Clenshaw algorithm for evalation of Chebychev expansion (6.2.3.3)
➺ GITLAB
2 // Clenshaw algorithm for evaluating p = ∑nj=+11 a j Tj−1
3 // at points passed
in vector x
4 // IN : a = α j , coefficients for p = ∑nj=+11 α j Tj−1
5 // x = (many) evaluation points
6 // OUT: values p( x j ) for all j
7 VectorXd clenshaw ( const VectorXd& a , const VectorXd& x ) {
8 const i n t n = a . s i z e ( ) − 1 ; // degree of polynomial
9 MatrixXd d ( n + 1 , x . s i z e ( ) ) ; // temporary storage for intermediate values
10 f o r ( i n t c = 0 ; c < x . s i z e ( ) ; ++c ) d . col ( c ) = a ;
11 f o r ( i n t j = n − 1 ; j > 0 ; −− j ) {
12 d . row ( j ) += 2 * x . transpose ( ) . cwiseProduct ( d . row ( j +1) ) ; // see (6.2.3.33)
13 d . row ( j −1) −= d . row ( j + 1 ) ;
14 }
15 r e t u r n d . row ( 0 ) + x . transpose ( ) . cwiseProduct ( d . row ( 1 ) ) ;
16 }
Task: Efficiently compute the Chebychev expansion coefficients α j in (6.2.3.3) from the interpolation
conditions
2k + 1
p(tk ) = f (tk ) , k = 0, . . . , n , for tk := cos π . (6.2.3.37)
2( n + 1)
Chebychev nodes
Trick: Ttransform of p into a 1-periodic function q, which turns out to be a trigonometric polynomial
according to Def. 4.2.6.25. Using the definition of the Chebychev polynomials and Euler’s formula we get
n n
Def. 6.2.3.3
q(s):= p(cos 2πs) = ∑ α j Tj (cos 2πs) = ∑ α j cos(2πjs)
j =0 j =0
n
= ∑ 12 α j exp(2πıjs) + exp(−2πıjs) [ by cos z = 12 (ez + e−z ) ]
j =0
(6.2.3.38)
0 , for j = n + 1 ,
n +1 1
2 αj , for j = 1, . . . , n ,
= ∑ β j exp(−2πıjs) , with β j :=
α0 , for j = 0 ,
j=−n
1
2 αn− j , for j = − n, . . . , −1 .
The interpolation conditions (6.2.3.37) for p become interpolation conditions for q in transformed nodes,
which turn out to be equidistant:
(6.2.3.37) 2k + 1
t = cos(2πs) =⇒ q = yk := f (tk ) , k = 0, . . . , n . (6.2.3.39)
4( n + 1)
This amounts to a Lagrange polynomial interpolation problem for equidistant points on the unit circle as
we have seen them in Section 5.6.3.
q ( s ) = q (1 − s )
⇓← (6.2.3.39)
2k + 1
q (1 − ) = yk , k = 0, . . . , n .
4( n + 1)
Thanks to the symmetry of q, see Fig. 231, we can augment the interpolation conditions (6.2.3.39) and
demand
(
k 1 yk , for k = 0, . . . , n ,
q + = zk := (6.2.3.40)
2( n + 1) 4( n + 1) y2n+1−k , for k = n + 1, . . . , 2n + 1 .
s = 1/2
0 1
Fig. 231
Trigonometric interpolation at equidistant points can be done very efficiently by means of FFT-based algo-
rithms, see Code 5.6.3.4. We can also apply these for the computation of Chebychev expansion coeffi-
cients.
From (6.2.3.40) we can derive an (2n + 1) × (2n + 1) square linear system of equations for the unknown
coefficients β j .
n +1
k 1 2πıj 2πı
q +
2( n + 1) 4( n + 1)
= ∑ β j exp −
4( n + 1)
exp −
2( n + 1)
kj = zk .
j=−n
m
2n+1
2πı( j − n) 2πı nk
∑ β j−n exp −
4( n + 1)
exp −
2( n + 1)
kj = exp −πı
n+1 k
z , k = 0, . . . , 2n + 1 .
j =0 | {z }
kj
= ω2 ( n + 1 ) !
m
h i
2πı( j−n) 2n+1
c = β j−n exp − 4(n+1) ,
F2(n+1) c = b with j =0 (6.2.3.41)
h i2n+1
b = exp −πı nnk+1 z k .
k =0
of the Lagrange polynomial interpolant of f ∈ C0 ([−1, 1]) for an arbitrary node set T = {t0 , . . . , tn },
n ∈ N.
(Q6.2.3.44.B) Devise an efficient algorithm for evaluating a polynomial of degree n ∈ N given through its
Chebychev expansion
n
p(t) = ∑ ak Tk (t) , t∈R, ak ∈ R ,
k =0
k
xk := cos π , k = 0, . . . , m , m∈N.
m
Mean square norms generalize the Euclidean norm on K n , see [NS02, Sect. 4.4]. In a sense, they endow
a vector space with a geometry and give a meaning to concepts like “orthogonality”.
Let V be a vector space over the field K. A mapping b : V × V → K is called an inner product on
V , if it satisfies
(i) b is linear in the first argument: b(αv + βw, u) = αb(v, u) + βb(w, u) for all α, β ∈ K,
u, v, w ∈ V ,
(ii) b is (anti-)symmetric: b(v, w) = b(w, v) ( = ˆ complex conjugation),
(iii) b is positive definite: v 6= 0 ⇔ b(v, v) > 0.
b is a semi-inner product, if it still complies with (i) and (ii), but is only positive semi-definite:
b(v, v) ≥ 0 for all v ∈ V .
✎ notation: usually we write (·, ·)V for an inner product on the vector space V .
Let V be a vector space equipped with a (semi-)inner product (·, ·)V . Any two elements v and w of
V are called orthogonal, if (v, w)V = 0. We write v ⊥ w.
If (·, ·)V is a (semi-)inner product (→ Def. 6.3.1.1) on the vector space V , then
q
k v kV : = (v, v)V
defines a (semi-)norm (→ Def. 1.5.5.4) on V , the mean square (semi-)norm/ inner product
(semi-)norm induced by (·, ·)V .
From § 3.1.1.8 we know that in Euclidean space K n the best approximation of vector x ∈ K n in a subspace
V ⊂ K n is unique and given by the orthogonal projection of x onto V . Now we generalize this to vector
spaces equipped with inner products.
ˆ a vector space over K = R, equipped with an mean square semi-norm k·k X induced by a semi-
X=
inner product (·, ·) X , see Thm. 6.3.1.3.
It can be an infinite dimensional function space, e.g., X = C0 ([ a, b]).
Assumption 6.3.1.6.
The semi-inner product (·, ·) X is a genuine inner product (→ Def. 6.3.1.1) on V , that is, it is positive
definite: (v, v) X > 0 ∀v ∈ V \ {0}.
Now we give a formula for the element q of V , which is nearest to a given element f of X with respect to
Theorem 6.3.1.7. Mean square norm best approximation through normal equations
k f − qk X = inf k f − pk X .
p ∈V
Proof. (inspired by Rem. 3.1.2.5) We first show that M is s.p.d. (→ Def. 1.1.2.6). Symmetry is clear from
the definition and the symmetry of (·, ·) X . That M is even positive definite follows from
N N 2
N
xH Mx = ∑ ∑ ξkξ j bk , b j X
= ∑ j =1 ξ j b j >0, (6.3.1.9)
X
k =1 j =1
N
if x := ξ j j=1 6= 0 ⇔ ∑ N
j=1 ξ j b j 6 = 0, since k·k X is a norm on V by Ass. 6.3.1.6.
N h iN
Now, writing c := γ j j=1 ∈ K N , b := f , bj X j =1
∈ K N , and using the basis representation
N
q= ∑ γj bj ,
j =1
we find
Since M is s.p.d., the unique solution of grad Φ(c) = Mc − b = 0 yields the unique global minimizer of
Φ; the Hessian 2M is s.p.d. everywhere!
✷
( f − q, p) X = 0 ∀ p ∈ V ⇔ f −q ⊥ V .
f
The message of Cor. 6.3.1.12:
Remark 6.3.1.13 (Connetion with linear least squares problems Chapter 3) In Section 3.1.1 we in-
troduced the concept of least squares solutions of overdetermined linear systems of equations Ax = b,
A ∈ R m,n , m > n, see Def. 3.1.1.1. Thm. 3.1.2.1 taught that the normal equations A⊤ AX = A⊤ b give
the least squares solution, if rank(A) = n.
In fact, Thm. 3.1.2.1 and the above Thm. 6.3.1.7 agree if X = K n (Euclidean space) and V =
Span{a1 , . . . , an }, where a j ∈ R m are the columns of A and N = n. y
In the setting of Section 6.3.1.2 we may ask: Which choice of basis B = {b1 , . . . , b N } of V ⊂ X renders
the normal equations (6.3.1.8) particularly simple? Answer: A basis B, for which bk , b j X = δkj (δkj the
Kronecker symbol), because this will imply M = I for the coefficient matrix of the normal equations.
N
q= ∑ f , bj b
X j
. (6.3.1.16)
j =1
§6.3.1.17 (Gram-Schmidt orthonormalization) From Section 1.5.1 we already know how to compute
orthonormal bases: The algorithm from § 1.5.1.1 can be run in the framework of any vector space V
endowed with an inner product (·, ·)V and induced mean square norm k·kV .
7: else { bj ←
bj
}
Span{b1 , . . . , bℓ } = Span{ p1 , . . . , pℓ }
k b j kV
} for all ℓ ∈ {1, . . . , k }.
This suggests the following alternative approach to the computation of the mean square best approximant
q in V of f ∈ X :
➊ Orthonormalize a basis {b1 , . . . , b N } of V , N := dim V , using Gram-Schmidt algorithm (6.3.1.18).
➋ Compute q according to (6.3.1.16).
Number of inner products to be evaluated: O( N 2 ) for N → ∞.
Review question(s) 6.3.1.20 (Mean-square best approximation: abstract theory)
(Q6.3.1.20.A) Let V be a finite-dimensional vector space with inner product (·, ·)V . What is an orthonor-
mal basis (ONB) of V ?
(Q6.3.1.20.B) Let X be a finite-dimensional real vector space with inner product (·, ·) X and equipped with
a basis {b1 , . . . , b N }, N := dim X .
Show that the coefficient matrix
(b1 , b1 ) X . . . (b1 , b N ) X
.. .. N,N
M := . . ∈R
(b N , b1 ) X . . . (b N , b N ) X
M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .
Remark 6.3.2.1 (Inner products on spaces Pm of polynomials) To match the abstract framework of
Section 6.3.1 we need to find (semi-)inner products on C0 ([ a, b]) that supply positive definite inner prod-
ucts on Pm . The following options are commonly considered:
✦ On any interval [ a, b] we can use the L2 ([ a, b])-inner product (·, ·) L2 ([a,b)] , defined in (6.3.1.5).
✦ Given a positive integrable weight function
Z b
w : [ a, b] → R , w(t) > 0 for all t ∈ [ a, b] , |w(t)| dt < ∞ , (6.3.2.2)
a
✦ For n ≥ m and n + 1 distinct points collected in the set T := {t0 , t1 , . . . , tn } ⊂ [ a, b] we can use
the discrete L2 -inner product
n
( f , g)T := ∑ f (t j ) g(t j ) . (6.3.2.4)
j =0
that is, multiplication with the independent variable can be shifted to the other function inside the inner
product.
✎ notation: Note that we have to plug a function into the slots of the inner products; this is indicated by
the notation {t 7→ . . .}.
The ideas of Section 6.3.1.3 that center around the use of orthonormal bases can also be applied to
polynomials.
The sequence of orthonormal polynomials from Def. 6.3.2.7 is unique up to signs, supplies an
(·, ·) X -orthonormal basis (→ Def. 6.3.1.14) of Pm , and satisfies
Proof. Comparing Def. 6.3.1.14 and (6.3.2.8) the ONB-property of {r0 , . . . , rm } is immediate. Then
(6.3.2.8) follows from dimensional considerations.
Pk−1 ⊂ Pk has co-dimension 1 so that there a unit “vector” in Pk , which is orthogonal to Pk−1 and unique
up to sign.
✷
Hence sk (t) := t · rk (t) is a polynomial of degree k + 1 with leading coefficient 6= 0, that is sk ∈ Pk+1 \ Pk .
Therefore, rk+1 can be obtain by orthogonally projecting sk onto Pk plus normalization, cf. Lines 4-5 of
Algorithm (6.3.1.18):
k
e
r k +1
r k +1 = ± r k +1 = s k − ∑ s k , r j X r j .
, e (6.3.2.12)
kerk+1 k X j =0
The sum in (6.3.2.12) collapses to two terms! In fact, since (rk , q) X = 0 for all q ∈ Pk−1 , by Ass. 6.3.2.6
(6.3.2.5)
sk , r j X
= {t 7→ trk (t)}, r j X
= rk , {t 7→ tr j } X
= 0 , if j < k − 1 ,
because in this case {t 7→ tr j } ∈ Pk−1 . As a consequence (6.3.2.12) reduces to the 3-term recursion
e
r k +1
r k +1 = ± ,
kerk+1 k X , k = 1, . . . , m − 1 . (6.3.2.13)
e
r k +1 = sk − ({t 7→ trk }, rk ) X rk − ({t 7→ trk }, rk−1 ) X rk−1
The 3-term recursion (6.3.2.13) can be recast in various ways. Forgoing normalization the next theorem
presents one of them.
Proof. (by rather straightforward induction) We first confirm, thanks to the definition of α1 ,
For the induction step we assume that the assertion is true for p0 , . . . , pk and observe that for pk+1
according to (6.3.2.15) we have
This amounts to the assertion of orthogonality for k + 1. Above, several inner product vanish because of
the induction hypothesis!
✷
Remark 6.3.2.16 ( L2 ([−1, 1])-orthogonal polynomials) An important inner product on C0 [−1, 1] is the
L2 -inner product, see (6.3.1.5)
Z 1
( f , g) L2 ([−1,1)] := f (τ ) g(τ ) dτ , f , g, ∈ C0 ([−1, 1]) .
−1
It is a natural question what is the unique sequence of L2 ([−1, 1])-orthonormal polynomials. Their rather
simple characterization will be discussed in the sequel.
Legendre polynomials
1 n=0
n=1
0.8 n=2
n=3
0.6
The Legendre polynomials Pn can be defined by the n=4
n=5
0.4
3-term recursion
0.2
2n + 1 n
P (t)
0
Pn+1 (t) := tPn (t) − Pn−1 (t) ,
n
n+1 n+1 −0.2
(7.4.2.21)
−0.4
P0 := 1 , P1 (t) := t . −0.6
−0.8
−1
−1 −0.5 0 0.5 1
Fig. 233 t
§6.3.2.20 (Discrete orthogonal polynomials) Since they involve integrals, weighted L2 -inner products
(6.3.2.3) are not accessible computationally, unless one resigns to approximation, see Chapter 7 for cor-
responding theory and techniques.
Therefore, given a point set T := {t0 , t1 , . . . , tn }, we focus on the associated discrete L2 -inner product
n
( f , g) X := ( f , g)T := ∑ f (t j ) g(t j ) , f , g ∈ C0 ([ a, b]) , (6.3.2.4)
j =0
The polynomials pk generated by the 3-term recursion (6.3.2.15) from Thm. 6.3.2.14 are then called
discrete orthogonal polynomials. The following C++ code computes the recursion coefficients αk and β k ,
k = 1, . . . , n − 1.
C++ code 6.3.2.21: Computation of weights in 3-term recursion for discrete orthogonal poly-
nomials ➺ GITLAB
2 // Computation of coefficients α, β from 6.3.2.14
3 // IN : t = points in the definition of the discrete L2 -inner product
4 // n = maximal index desired
5 // alpha, beta are used to save coefficients of recursion
6 void c o e f f o r t h o ( const VectorXd& t , const Index n , VectorXd& alpha , VectorXd& beta ) {
7 const Index m = t . s i z e ( ) ; // maximal degree of orthogonal polynomial
§6.3.2.22 (Polynomial fitting) Given a point set T := {t0 , t1 , . . . , tn } ⊂ [ a, b], and a function f : [ a, b] →
K, we may seek to approximate f by its polynomial best approximant with respect to the discrete L2 -norm
k·kT induced by the discrete L2 -inner product (6.3.2.4).
The stable and efficient computation of fitting polynomials can rely on combining Thm. 6.3.2.14 with
Cor. 6.3.1.15:
➊ (Pre-)compute the weights αℓ and β ℓ for the 3-term recursion (6.3.2.15).
➋ (Pre-)compute the values of the orthogonal polynomials pk at desired evaluation points xi ∈ R,
i = 1, . . . , N .
➌ Compute the inner product ( f , pℓ ) X , ℓ = 0, . . . , k, and use (6.3.1.16) to linearly combine the vectors
[ pℓ ( xi )]iN=1 , ℓ = 0, . . . , k.
N
This yields [q( xi )]i=1 , q ∈ Pk the fitting polynomial. y
EXAMPLE 6.3.2.24 (Approximation by discrete polynomial fitting) We use equidistant points T :=
{tk = −1 + k m2 , k = 0, . . . , m} ⊂ [−1, 1], m ∈ N to compute fitting polynomials (→ Def. 6.3.2.23) for
two different functions.
We monitor the L2 -norm and L∞ -norm of the approximation error, both norms approximated by sampling
j
in ξ j = −1 + 500 , j = 0, . . . , 1000.
➀ f (t) = (1 + (5t)2 )−1 , I = [−1, 1] → Ex. 6.2.2.11, analytic in complex neighborhood of [−1, 1]:
0.6
error norm
y
0.4
10 -2
0.2
−0.2 10 -3
0 5 10 15 20 25 30 35
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 235
Fig. polynomial degree n
Fig. 234 t
➣ We observe exponential convergence (→ Def. 6.2.2.7) in the polynomial degree n.
0.4
y
0.2
10 -2
0
−0.2
10 -3
−0.4 10 0 10 1 10 2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 237
Fig. polynomial degree n
Fig. 236 t
➣ We observe only algebraic convergence (→ Def. 6.2.2.7 in the polynomial degree n (for n ≪ m!).
L∞-norm, m=50
2
L -norm, m=100
∞
L -norm, m=100
➂ “bump function” 10
-1
L2 -norm, m=200
L∞-norm, m=200
L2 -norm, m=400
1
f (t) = max{cos(4π |t + |), 0} .
error norm
4 10 -2
braic convergence.
10 -4
10 0 10 1 10 2
Fig. 238 polynomial degree n
y
Review question(s) 6.3.2.25 (Polynomial mean-square best approximation)
(Q6.3.2.25.A) Given a, b ∈ R and a positive continuous weight function w ∈ C0 ([ a, b]), w(t) > 0 for all
t ∈ [ a, b], write { P0 , P1 , P2 }, Pj ∈ P j , j ∈ N0 , for the sequence of orthonormal polynomials with respect
Give an indirect proof that Pj must have j distinct zeros in ] a, b[. To that end assume that Pj has only
ℓ < j zeros z1 , . . . , zℓ in ] a, b[, at which it changes sign and consider the polynomial
q ( t ) : = ( t − z1 ) · · · · · ( t − z ℓ ) , q ∈ P ℓ ,
q ∈ argmink f − pk L∞ ( I ) .
p∈Pn
The results of Section 6.3.1 cannot be applied because the supremum norm is not induced by an inner
product on Pn .
Theory provides us with surprisingly precise necessary and sufficient conditions to be satisfied by the
polynomial L∞ ([ a, b])-best approximant q.
q = argmink f − pk L∞ ( I )
p∈Pn
if and only if there exist n + 2 points a ≤ ξ 0 < ξ 1 < · · · < ξ n+1 ≤ b such that
§6.4.0.3 (Remez algorithm) The widely used iterative algorithm (Remez algorithm) for finding an L∞ -
best approximant is motivated by the alternation theorem. The idea is to determine successively better
approximations of the set of alternants: A(0) → A(1) → . . ., ♯A(l ) = n + 2.
Key is the observation that, due to the alternation theorem, the polynomial L∞ ([ a, b])-best approximant q
will satisfy (one of the) interpolation conditions
(l ) (l ) (l ) (l )
➁ Given approximate alternants A(l ) := {ξ 0 < ξ 1 < · · · < ξ n < ξ n+1 } ⊂ [ a, b] determine
q ∈ Pn and a deviation δ ∈ R satisfying the extended interpolation condition
(l ) (l )
q(ξ k )+(−1)k δ = f (ξ k ) , k = 0, . . . , n + 1 . (6.4.0.6)
After choosing a basis for Pn , this is (n + 2) × (n + 2) linear system of equations, cf. § 5.1.0.21.
➂ Choose A(l +1) as the set of extremal points of f − q, truncated in case more than n + 2 of these
exist.
These extreme can be located approximately by sampling on a fine grid covering [ a, b]). If the
derivative of f ∈ C1 ([ a, b]) is available, too, then search for zeros of ( f − p)′ using the secant
method from § 8.4.2.28.
➃ If k f − qk L∞ ([a,b]) | ≤ TOL · kdk L∞ ([a,b]) STOP, else GOTO ➁.
(TOL is a prescribed relative tolerance.)
C++ code 6.4.0.7: Remez algorithm for uniform polynomial approximation on an interval
➺ GITLAB
2 // IN : f = handle to the Function, point evaluation
62 // error
63 VectorXd F0 = polyval ( cd , xx0 ) − f e v a l ( df , xx0 ) ;
64 // initial guesses from shifting sampling points
65 VectorXd xx1 = xx0 + ( b − a ) / ( 2 * n ) * VectorXd : : Ones ( xx0 . s i z e ( ) ) ;
66 VectorXd F1 = polyval ( cd , xx1 ) − f e v a l ( df , xx1 ) ;
67 // Main loop of the secant method
68 while ( F1 . cwiseAbs ( ) . minCoeff ( ) > 1e −12) {
69 const VectorXd xx2 = xx1 − ( F1 . cwiseQuotient ( F1 − F0 ) ) . cwiseProduct ( xx1 − xx0 ) ;
70 xx0 = xx1 ;
71 xx1 = xx2 ;
72 F0 = F1 ;
73 F1 = polyval ( cd , xx1 ) − f e v a l ( df , xx1 ) ;
74 }
75
EXPERIMENT 6.4.0.8 (Convergence of Remez algorithm) We examine the convergence of the Remez
algorithm from Code 6.4.0.7 for two different functions:
n=3 n=3
n=5 n=5
10 -2 10 -2 n=7
n=7
n=9 n=9
10
-4
n=11 n=11
10 -4
-6
10
10 -6
10 -8
10 -8
-10
10
-10
10
-12
10
∞
-12
10
-14
10
-14
10
10 -16
10 -18 10 -16
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Fig. 240 Step of Remez algorithm Fig. 241 Step of Remez algorithm
Convergence in both cases; faster convergence observed for smooth function, for which machine precision
is reached after a few steps. y
Review question(s) 6.4.0.9 (Uniform best approximation)
△
f ∈ C 0 (R ) , f ( t + 1) = f ( t ) ∀ t ∈ R .
Policy: In the interest of “structure preservation” approximate f in space of functions with “built-in” 1-
periodicity. This already rules out approximation on [0, 1] by global polynomials, because those
can never the extended to globally 1-periodic functions.
The natural space for approximating generic periodic functions is a space of trigonometric polyno-
mials with the same period.
Terminology: T =
P2n ˆ space of trigonometric polynomials of degree 2n.
From Section 5.6 remember a few more facts about trigonometric polynomials and trigonometric interpo-
lation:
T = 2n + 1
✦ Cor. 5.6.1.6: Dimension of the space of trigonometric polynomials: dim P2n
✦ Trigonometric interpolation can be reduced to polynomial interpolation on the unit circle S1 ⊂ C in
the complex plane, see (5.6.1.5).
existence & uniqueness of trigonometric interpolant q satisfying (6.5.1.2) and (6.5.1.3)
✦ There are very efficient FFT-based algorithms for trigonometric interpolation in equidistant nodes
tk = 2nk+1 , k = 0, . . . , 2n, see Code 5.6.3.4.
The relationship of trigonometric interpolation and polynomial interpolation on the unit circle suggests a
uniform distribution of nodes for general trigonometric interpolation.
k
✎ notation: trigonometric interpolation operator in 2n + 1 equidistant nodes tk = 2n+1 , k = 0, . . . , 2n
T
Tn : C0 ([0, 1[) → P2n , Tn ( f )(tk ) = f (tk ) ∀k ∈ {0, . . . , 2n} . (6.5.1.5)
f (t)
State the matrix C ∈ C2n+1,2n+1 converting a representation in trigonometric basis into a representation
with respect to exponential basis:
(Q6.5.1.6.B) When can a function f 0 ∈ C m ([0, 1]) be extended to a 1-periodic function f ∈ C m (R )?
That f is an extension of f 0 means that f |[0,1] ≡ f 0 .
Video tutorial for Section 6.5.2 "Trigonometric Interpolation Error Estimates": (14 minutes)
Download link, tablet notes
From (6.5.1.5) use the notation Tn for trigonometric interpolation in the 2n + 1 equidistant nodes
k
tk := . Our focus will be on the asymptotic behavior of
2n + 1
for functions f : [0, 1[→ C with different smoothness properties. To begin with we report an empiric study.
EXPERIMENT 6.5.2.1 (Interpolation error: trigonometric interpolation) Now we study the asymptotic
behavior of the error of equidistant trigonometric interpolation as n → ∞ in a numerical experiment for
functions with different smoothness properties.
#1 Step function: f (t) = 0 for |t − 21 | > 14 , f (t) = 1 for |t − 12 | ≤ 41
1
#2 C ∞ periodic function: f (t) = q .
1 + 12 sin(2πt)
−2 −2
10 10
−4
−4
10 10
||Interpolationsfehler||2
||Interpolationsfehler||∞
−6
−6 10
10
#1 #1
#2 −8 #2
−8 10
10 #3 #3
−10
−10 10
10
−12
−12 10
10
−14
−14
10
10
2 4 8 16 32 64 128 2 4 8 16 32 64 128
Fig. 243 n Fig. 244 n
Maximum norm of interpolation error L2 ([0, 1])-norm of interpolation error
Observations: Function #1: no convergence in L∞ -norm, algebraic convergence in L2 -norm
Function #3: algebraic convergence in both norms
Function #2: exponential convergence in both norms
We conclude that in this experiment higher smoothness of f leads to faster convergence of the trigono-
metric interplant. y
EXPERIMENT 6.5.2.2 (Gibbs phenomenon) Of course the smooth trigonometric interpolants of the
step function must fail to converge in the L∞ -norm in Exp. 6.5.2.1. Moreover, they will not even converge
“visually” to the step function, which becomes manifest by a closer inspection of the interpolants.
1.2 1.2
f f
p p
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t
n = 16 n = 128
We observe massive “overshooting oscillations” in a neighborhood of the discontinuity. This is the notori-
ous Gibbs phenomenon affecting approximation schemes relying on trigonometric polynomials. y
§6.5.2.3 (Fourier series → (4.2.6.7)) From (6.5.0.1b) we know that the complex vector space space
of trigonometric polynomials P2nT is spanned by the 2n + 1 Fourier modes t 7 → exp(2πıkt ) of “lowest
Now let us make a connection: In Section 4.2.6 we learned that every function f : [0, 1[→ C with finite
L2 ([0, 1])-norm
Z 1
k f k2L2 ([0,1]) := | f (t)|2 dt < ∞ ,
0
∞ Z 1
f (t) = ∑ fbk exp(−2πıkt) 2
in L ([0, 1]) , fbk := f (t) exp(2πıkt) dt . (6.5.2.5)
k=−∞ 0
M
We add, that a limit in L2 ([0, 1]) means that f− ∑ fbk exp(−2πık ·) → 0 for M → ∞. Also
k=− M L2 ([0,1])
note the customary the notation fbk for the Fourier coefficients.
Seeing (6.5.2.4) and (6.5.2.5) side-by-side and understanding that trigonometric polynomials are finite
Fourier series suggests that we investigate the approximation of Fourier series by trigonometric inter-
polants.
Remark 6.5.2.6 ( L2 (]0, 1[): Natural setting for trigonometric interpolation) A fundamental result about
functions given through Fourier series was the following fundamental isometry property of the mapping
taking a function to the sequence of its Fourier coefficients.
If the Fourier coefficients satisfy ∑k∈Z |cbj |2 < ∞, then the Fourier series
This paves the way for estimating the L2 ([0, 1])-norm of interpolation/approximation errors, once we have
information about the decay of their Fourier coefficients.
The L2 ([0, 1]) is also a highly relevant quantity in engineering application, when t 7→ c(t) is regarded as
a time-dependent signal. In this case kck L2 ([0,1]) is the root mean square (RMS) power of the signal. y
§6.5.2.7 (Aliasing) Guided by the insights from § 6.5.2.3, we study the action of the trigonometric
interpolation operator Tn from (6.5.1.5) on individual Fourier modes
j j
µk (t j ) = exp(−2πık 2n+1 ) = exp(−2πı(k − ℓ(2n + 1)) 2n+1 ) = µk−ℓ(2n+1) (t j ) ∀ℓ ∈ Z .
When sampled on the node set Tn := {t0 , . . . , t2n } all the Fourier modes µk−ℓ(2n+1) , ℓ ∈ Z, yield
the same values. Thus trigonometric interpolation cannot distinguish them! This phenomenon is called
aliasing.
Aliasing demonstrated for f (t) = sin(2π · 19t) = Im(exp(2πı19t)) for different node sets.
1 1 1
p p p
f f f
0.8 0.8 0.8
0 0 0
−1 −1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t
Tn µk = µek , e
k ∈ {−n, . . . , n} , k − e
k ∈ (2n + 1)Z [ e
k := k mod (2n + 1) ] . (6.5.2.8)
e = n, n]
For instance, we have n fn = −n, −^
+ 1 = − n, − f = −1, etc.
n − 1 = n, 2n
Trigonometric interpolation by Tn maps all Fourier modes (“frequencies”) to another single Fourier
mode in the finite range {−n, . . . , n}.
y
We can read the trigonometric polynomial Tn f ∈ P2n T as a Fourier series with non-zero coefficients only
bj of the trigonometric interpolation
in the index range {−n, . . . , n}. Thus, for the Fourier coefficients E
error E(t) := f (t) − Tn f (t) we find from (6.5.2.10)
− ∑ fbj+ℓ(2n+1) , if j ∈ {−n, . . . , n} ,
bj =
E ℓ∈Z \{0} j∈Z. (6.5.2.11)
fbj , if | j| > n ,
we conclude from the isometry property asserted in Thm. 4.2.6.33 and the triangle inequality
2
n
k f − Tn f k2L2 (]0,1[) = ∑ ∑ fbj+ℓ(2n+1) + ∑ | fbj |2 , (6.5.2.13)
j=−n ℓ∈Z \{0} | j|>n
n
k f − Tn f k L∞ (]0,1[) ≤ ∑ ∑ | fbj+ℓ(2n+1) | + ∑ | fbj | . (6.5.2.14)
j=−n ℓ∈Z \{0} | j|>n
In order to estimate these norms of the trigonometric interpolation error we need quantitative information
about the decay of the Fourier coefficients fbj as | j| → ∞. y
For the Fourier coefficients of the derivatives a 1-periodic function f ∈ C k−1 (R ), k ∈ N, with
integrable k-th derivative f (k) holds
(\
f (k) ) j = (−2πıj)k fbj , j ∈ Z .
§6.5.2.18 (Fourier coefficients and smoothness) From Lemma 6.5.2.17 and the trivial estimates
(| exp(2πıt)| = 1)
Z 1
| fbj | ≤ | f (t)| dt ≤ k f k L1 (]0,1[) ∀ j ∈ Z , (6.5.2.19)
0
we conclude that (2π | j|)m fbj , m ∈ N, is bounded, provided that f ∈ C m−1 (R ) with integrable m-th
j ∈Z
derivative
If f ∈ C k−1 (R ) with integrable k-th derivative, then fbj = O(| j|−k ) for | j| → ∞
The smoother a periodic function the faster the decay of its Fourier coefficients
The isometry property of Thm. 4.2.6.33 also yields for f ∈ C k−1 (R ) with f (k) ∈ L2 (]0, 1[) that
2 ∞
f (k) = (2π )2k ∑ | j|2k | fbj |2 . (6.5.2.22)
L2 (]0,1[)
j=−∞
We can now combine the identity (6.5.2.22) with (6.5.2.13) and obtain an interpolation error estimate in
L2 (]0, 1[)-norm.
with ck = 2 ∑∞
ℓ=1 (2ℓ − 1)
−2k < ∞.
As a tool we will need the Cauchy-Schwarz inequality for quadratically convergent sequences:
2
∞ ∞ ∞
∑ a ℓ bℓ ≤ ∑ | a ℓ | 2 · ∑ | bℓ | 2 ∀( aℓ ), (bℓ ) ∈ CN . (6.5.2.25)
ℓ=1 ℓ=1 ℓ=1
We start from
2
n
k f − Tn f k2L2 (]0,1[) = ∑ ∑ fbj+ℓ(2n+1) + ∑ | fbj |2 , (6.5.2.13)
j=−n ℓ∈Z \{0} | j|>n
which yields
n
k f − Tn f k2L2 (]0,1[) ≤ ck (2πn)−2k ∑ ∑ |2π ( j + ℓ(2n + 1))|2k | fbj+ℓ(2n+1) |2 +
j=−n |ℓ|≥1
✷
Thm. 6.5.2.23 confirms algebraic convergence of the L2 -norm of the trigonometric interpolation error for
functions with limited smoothness. Higher rates can be expected for smoother functions, which we have
also found in cases #1 and #3 in Exp. 6.5.2.1.
f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.5.2.27)
j =0
Video tutorial for Section 6.5.3 "Trigonometric Interpolation of Analytic Periodic Functions":
(16 minutes) Download link, tablet notes
In Section 6.2.2.3 we saw that we can expect exponential decay of the maximum norm of polynomial
interpolation errors in the case of “very smooth” interpolands. To capture this property of functions we
resorted to the notion of analytic functions, as defined in Def. 6.2.2.48. Since trigonometric interpolation is
closely connected to polynomial interpolation (on the unit circle S1 , see Section 5.6.2), it is not surprising
that analyticity of interpolands will also involve exponential convergence of trigonometric interpolants. This
result will be established in this section.
In case #2 of Exp. 6.5.2.1 we already say an instance of exponential convergence for an analytic inter-
poland. A more detailed study follows.
1
Interpolationsfehlernorm
−4
f (t) = p on I = [0, 1] . 10
1 − α sin(2πt) −6
10
(6.5.3.2) −8 α=0.5, L∞
10
2
α=0.5, L
§6.5.3.3 (Analytic periodic functions) Assume that a 1-periodic function f : R → R possesses an ana-
lytic extension to an open domain D ⊂ C beyond the interval [0, 1]: [0, 1] ⊂ D.
Im
Re
−1 1 2 3
Fig. 246
S
Then, thanks to 1-peridicity f will also have an analytic extension to Dper := ( D + k). That domain
k ∈Z
will contain a strip parallel to the real axis, see Fig. 246:
§6.5.3.4 (Decay of Fourier coefficients of analytic functions) Lemma 6.5.2.20 asserts algebraic de-
cay of the Fourier coefficients of functions with limited smoothness. As analytic 1-periodic functions are
“infinitely smooth”, the will always belong to C ∞ (R ), we expect a stronger result in this case. In fact, we
can conclude exponential decay of the Fourier coefficients.
Proof. ➊: A first variant of the proof uses techniques from complex analysis:
z-plane
S := {z ∈ C: − η ≤ Im{z} ≤ η }, η > 0 .
η
Fig. 247
U + := {z = ξ + ıη, 0 ≤ ξ ≤ 1, 0 ≤ η ≤ r } for k ≥ 0 ,
U − := {z = ξ + ıη, 0 ≤ ξ ≤ 1, 0 ≤ η ≤ r } for k < 0 ,
and note that the contributions of the sides parallel to the imaginary axis cancel thanks to 1-periodicity
(and their opposite orientation).
z-plane
Fig. 248
Thus we compute the Fourier coefficients fbk of f by a different integral. For k > 0 we get
Z1 Z Z1 Z1
fbk = f (t)e 2πıkt
dt = f (z)e 2πıkz
dz = f (s + ıη )e 2πık(s+ıη )
ds = e −2πηk
f (s + ıη )e2πıks ds ,
0 ∂U + \R 0 0
This implies |cn (y)|η n ≤ C (y) and, since [0, 1] is compact and y 7→ C (y) is continuous,
n!
| fbk | ≤ C ∀n ∈ N, k ∈ Z . (6.5.3.12)
(2π |k|η )n
Next, we use Stirling’s formula (6.2.1.12) in the form
n! ≤ enn+ /2 e−n ,
1
n∈N,
which gives
nn+1/2 −n
| fbk | ≤ Ce e ∀n ∈ N, k ∈ Z .
(2π |k|η )n
We can also “interpolate” and replace n with a real number.
rr+1/2 −r
| fbk | ≤ Ce e ∀r ≥ 1, k ∈ Z .
(2π |k|η )r
Finally, we set r := 2π |k |η and arrive at
p
| fbk | ≤ Ce 2πkη exp(−2πη )|k| , k∈Z.
| {z }
=:q
Knowing exponential decay of the Fourier coefficients, the geometric sum formula can be used to ex-
tract estimates for the trigonometric interpolation operator Tn (for equispaced nodes) from (6.5.2.13) and
(6.5.2.14):
Lemma 6.5.3.13. Interpolation error estimates for exponentially decaying Fourier coefficients
This estimate can be combined with the result of Thm. 6.5.3.5 and gives the main result of this section:
The speed of exponential convergence clearly depends on the width η of the “strip of analyticity” S̄.
§6.5.3.16 (Convergence of trigonometric interpolation for analytic interpolands) We can now give
a precise explanation of the observations made in Exp. 6.5.3.1, cf. Rem. 6.2.3.26. Similar to Chebychev
interpolants, also trigonometric interpolants converge exponentially fast, if the interpoland f is 1-periodic
analytic (→ Def. 6.2.2.48) in a strip around the real axis in C, see Thm. 6.5.3.14 for details.
Thus we have to determine the maximal open subset D of C to which the function
1
f (t) = p , t ∈ [0, 1] , 0<α<1, (6.5.3.2)
1 − α sin(2πt)
possesses an analytic extension. Usually it is easier to determine the complement P := C \ D, the
“domain of singularity”. We start with a result from complex analysis.
Im
Re √
✁ domain of singularity of z 7→ z, principal branch
Fig. 249
1 + α sin(2πz) ∈ R0−
m
sin(2πz) = sin(2πx ) cosh(2πy) + ı cos(2πx ) sinh(2πy) ∈] − ∞, −1 − α1 ]
m
1
sin(2πx ) cosh(2πy) ≤ −1 − α and cos(2πx ) sinh(2πy) = 0 .
Note that y = 0 is not possible, because this would imply | sin(2πz)| ≤ 1. Hence, the term
x 7→ cos(2πx ) must make the imaginary part vanish, which means
π
2πx ∈ (2Z + 1) ⇔ x ∈ 21 Z + 41 .
2
Thus we have sin(2πx ) = ±1. As cosh(2πy) > 0, the sine has be negative, which leaves as only
remaining choices for the real part
3
x ∈Z+ 4 ⇔ sin(2πx ) = −1 .
As ξ 7→ cosh(ξ ) is a positive even function, we find the following domain of analyticity of f :
[
C\ (k + 43 + i (R \] − ζ, ζ [)) , ζ ∈ R + , cosh(2πζ ) = 1 + 1
α .
k ∈Z
10
Im
9
7
1
6
cosh Re
5
−2 −1 1 2
4
3 −1
2
1
Fig. 251
0
➣ f analytic in strip
Fig. 250 −1
−3 −2 −1 0 1 2 3 S := {z ∈ C: : −ζ < Im(z) < ζ }.
➣ As α decreases the strip of analyticity becomes wider, since x → cosh( x ) is increasing for x > 0. y
Assume that all the functions f k : D → C, k ∈ N, are analytic on the open set D ⊂ C and
∞
lim sup
n→∞ z∈ D
∑ | f k (z)| = 0 .
k=n
Show that
§6.6.0.1 (Grid/mesh) The attribute “piecewise” refers to partitioning of the interval on which we aim to
approximate. In the case of data interpolation the natural choice was to use intervals defined by interpo-
lation nodes. Yet we already saw exceptions in the case of shape-preserving interpolation by means of
quadratic splines, see Section 5.4.4.
In the case of function approximation based on an interpolation scheme the additional freedom to choose
the interpolation nodes suggests that those be decoupled from the partitioning.
Idea: use piecewise polynomials with respect to a grid/mesh
Borrowing from terminology for splines, cf. Def. 5.4.1.1, the underlying mesh for piecewise polynomial
approximation is sometimes called the “knot set”.
Terminology:
✦ xj = ˆ nodes of the mesh M,
ˆ intervals/cells of the mesh,
✦ [ x j −1 , x j [ = a b
✦ hM := max | x j − x j−1 | = ˆ mesh width, x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
j
✦ If x j = a + jh =
ˆ equidistant (uniform) mesh
with meshwidth h > 0
y
Remark 6.6.0.3 (Local approximation by piecewise polynomials) We will see that most approximation
schemes relying on piecewise polynomials are local in the sense that finding the approximant on a cell of
the mesh relies only on a fixed number of function evaluations in a neighborhood of the cell.
Video tutorial for Section 6.6.1 "Piecewise Polynomial Lagrange Interpolation": (17 minutes)
Download link, tablet notes
Recall theory of polynomial interpolation → Section 5.2.2: n + 1 data points needed to fix interpolating
polynomial, see Thm. 5.2.2.7.
Obviously, IM depends on M, the local degrees n j , and the sets T j of local interpolation points (the latter
two are suppressed in notation).
then the piecewise polynomial Lagrange interpolant according to (6.6.1.2) is continuous on [ a, b]:
s ∈ C0 ([ a, b]).
k f − IM f k ≤ CT ( N ) for N → ∞ , (6.6.1.6)
m
where N := ∑ ( n j + 1).
j =1
But why do we choose this strange number N as parameter when investigating the approximation error?
Because, by Thm. 5.2.1.2, it agrees with the dimension of the space of discontinuous, piecewise polyno-
mials functions
{q : [ a, b] → R: q| Ij ∈ Pn j ∀ j = 1, . . . , m} !
This dimension tells us the number of real parameters we need to describe the interpolant s, that is, the
“information cost” of s. N is also proportional to the number of interpolation conditions, which agrees with
the number of f -evaluations needed to compute s (why only proportional in general?).
Special case: uniform polynomial degree n j = n for all j = 1, . . . , m.
atan(t)
piecew. linear
Compare Exp. 5.3.1.6: 1
piecew. quadratic
piecew. cubic
f (t) = arctan t, I = [−5, 5]
0.5
Grid M := {−5, − 25 , 0, 52 , 5}
Local interpolation nodes equidistant in Ij , endpoints 0
polynomial interpolants ✄
−1.5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 252 t
i
✦ Sequence of (equidistant) meshes: Mi := {−5 + j 2−i 10}2j=0 , i = 1, . . . , 6.
✦ Equidistant local interpolation nodes (endpoints of grid intervals included).
Monitored: interpolation error in (approximate) L∞ - and L2 -norms, see (6.2.3.25), (6.2.3.24)
k gk L∞ ([−5,5]) ≈ max | g(−5 + j/100)| ,
j=0,...,1000
q !1/2
999
1 1 2 2 2
k gk L2 ([−5,5]) ≈ 1000 · 2 g (−5) + ∑ | g(−5 + j/100)| + 21 g(5) .
j =1
2 0
10 10
0
10
−2
10
∞
||Interpolation error||2
−4 −5
||Interpolation error||
10 10
−6
10
−8
10
−10 −10
10 10
−12
10 Deg. =1 Deg. =1
Deg. =2 Deg. =2
−14
Deg. =3 Deg. =3
10 Deg. =4 Deg. =4
Deg. =5 Deg. =5
Deg. =6 Deg. =6
−16 −15
10 10
−2 −1 0 1 −2 −1 0 1
10 10 10 10 10 10 10 10
Fig. 253 mesh width h Fig. 254 mesh width h
(nearly linear error norm graphs in doubly logarithmic scale, see § 6.2.2.9)
n 1 2 3 4 5 6
w.r.t. L2 -norm 1.9957 2.9747 4.0256 4.8070 6.0013 5.2012
w.r.t. L∞ -norm 1.9529 2.8989 3.9712 4.7057 5.9801 4.9228
➣ Higher polynomial degree provides faster algebraic decrease of interpolation error norms. Empiric
evidence for rates α = p + 1
Here: rates estimated by linear regression (→ Ex. 3.1.1.5) based on P YTHON’s polyfit and the inter-
polation errors for meshwidth h ≤ 10 · 2−5 . This was done in order to avoid erratic “preasymptotic”, that
is, for large meshwidth h, behavior of the error.
The bad rates for n = 6 are probably due to the impact of roundoff, because the norms of the interpolation
error had dropped below machine precision, see Fig. 253, 254. y
§6.6.1.8 (Approximation error estimates for piecewise polynomial Lagrange interpolation) The ob-
servations made in Ex. 6.6.1.7 are easily explained by applying the polynomial interpolation error estimates
of Section 6.2.2, for instance
f ( n +1)
L∞ ( I )
k f − LT f k L ∞ ( I ) ≤ max|(t − t0 ) · · · · · (t − tn )| . (6.2.2.22)
( n + 1) ! t∈ I
0
10 −1
10
−1
10 −2
10
−2
10
−3
||Interpolation error||∞
||Interpolation error||2
10
−3
10
−4
10
−4
10
−5
10
−5
10
−6
10
−6
10
−7
−7 10
10
h =5 h =5
h =2.5 −8
h =2.5
−8
10 h =1.25 10 h =1.25
h =0.625 h =0.625
h =0.3125 h =0.3125
−9 −9
10 10
1 2 3 4 5 6 1 2 3 4 5 6
Fig. 255 Local polynomial degree Fig. 256 Local polynomial degree
In this example we deal with an analytic function, see Rem. 6.2.3.26. Though equidistant local interpolation
nodes are used cf. Ex. 6.2.2.11, the mesh intervals seems to be small enough that even in this case
exponential convergence prevails. y
Video tutorial for Section 6.6.2 "Cubic Hermite and Spline Interpolation: Error Estimates": (10
minutes) Download link, tablet notes
See Section 5.3.3 for definition and algorithms for cubic Hermite interpolation of data points, with a focus
on shape preservation, however. If the derivative f ′ of the interpoland f is available (in procedural form),
then it can be used to fix local cubic polynomials by prescribing point values and derivative values in the
endpoints of grid intervals.
Definition 6.6.2.1. Piecewise cubic Hermite interpolant (with exact slopes) → Def. 5.3.3.1
Given f ∈ C1 ([ a, b]) and a mesh M := { a = x0 < x1 < . . . < xm−1 < xm = b} the piecewise
cubic Hermite interpolant (with exact slopes) s : [ a, b] → R is defined as
s|[ x j−1 ,x j ] ∈ P3 , j = 1, . . . , m , s( x j ) = f ( x j ) , s′ ( x j ) = f ′ ( x j ) , j = 0, . . . , m .
Clearly, the piecewise cubic Hermite interpolant is continuously differentiable: s ∈ C1 ([ a, b]), cf.
Cor. 5.3.3.2.
EXPERIMENT 6.6.2.2 (Convergence of Hermite interpolation with exact slopes) In this experiment
✦ domain: I = (−5, 5)
−2
10
algebraic convergence O( h4 ) −5
10
−6
10
−1 0 1
10 10 10
Fig. 257 meshwidth h
The observation made in Exp. 6.6.2.2 matches the theoretical prediction of the rate of algebraic conver-
gence for cubic Hermite interpolation with exact slopes for a smooth function.
Let s be the cubic Hermite interpolant of f ∈ C4 ([ a, b]) on a mesh M := { a = x0 < x1 < . . . <
xm−1 < xm = b} according to Def. 6.6.2.1. Then
1 4
k f − sk L∞ ([a,b]) ≤ h f (4) ,
4! M L∞ ([ a,b])
In Section 5.3.3.2 we saw variants of cubic Hermite interpolation, for which the slopes c j = s′ ( x j ) were
computed from the values y j in preprocessing step. Now we study the use of such a scheme for approxi-
mation.
2
10
sup−norm
2
L −norm
1
10
Piecewise cubic Hermite interpolation of
We observe lower rate of algebraic convergence compared to the use of exact slopes due to averaging
(5.3.3.8). From the plot we deduce O( h3 ) asymptotic decay of the L2 - and L∞ -norms of the approximation
error for meshwidth h → 0.
We have seen three main classes of cubic spline interpolants s ∈ S3,M of data points with node set
M = { a = t0 < t1 < · · · < tn = b} § 5.4.2.11: the complete cubic spline (s′ prescribed at endpoints),
the natural cubic spline (s′′ ( a) = s′′ (b) = 0), and the periodic cubic spline (s′ ( a) = s′ (b), s′′ ( a) = s′′ (b)).
Obviously, both the natural and periodic cubic spline do not make much sense for approximating a generic
continuous function f ∈ C0 ([ a, b]). So we focus on complete cubic spline interpolants with endpoint
slopes inherited from the interpoland:
Given f ∈ C1 ([ a, b]) and a mesh (= knot set) M := { a = x0 < x1 < . . . < xm−1 < xm = b} the
complete cubic spline Hermite interpolant s is defined by the conditions
In § 5.4.2.5 and § 5.4.2.11 we found that interpolation condition at the knots plus fixing the derivatives at
the endpoints uniquely determine a cubic spline function. Hence, the above definition is valid.
We study h-convergence of complete (→ § 5.4.2.11) cubic spline interpolation according to Def. 6.6.3.1,
where the slopes at the endpoints of the interval are made to agree with the derivatives of the interpoland
−2 0
10 10
∞
L −Norm L∞−Norm
2
L −Norm −1 L2−Norm
10
−4
10
−2
10
−6
10 −3
10
||s−f||
||s−f||
−4
−8 10
10
−5
10
−10
10
−6
10
−12 −7
10 −2 −1 0
10 −2 −1 0
10 10 10 10 10 10
Fig. 259 Meshwidth h Fig. 260 Meshwidth h
We remark that there is the following theoretical result [HM76], [DR08, Rem. 9.2]:
5 4 (4)
f ∈ C4 ([t0 , tn ]) k f − sk L∞ ([t0 ,tn ]) ≤ h f . (6.6.3.3)
384 L∞ ([t0 ,tn ])
[Boo05] Carl de Boor. “Divided differences”. In: Surv. Approx. Theory 1 (2005), pp. 46–69 (cit. on
p. 492).
[Bör21] Steffen Börm. On iterated interpolation. 2021.
[CB95] Q. Chen and I. Babuska. “Approximate optimal points for polynomial interpolation of real func-
tions in an interval and in a triangle”. In: Comp. Meth. Appl. Mech. Engr. 128 (1995), pp. 405–
417 (cit. on p. 500).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 483, 547).
[Dav75] P.J. Davis. Interpolation and Approximation. New York: Dover, 1975 (cit. on pp. 472–474).
[DY10] L. Demanet and L. Ying. On Chebyshev interpolation of analytic functions. Online notes. 2010.
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 497).
[EZ66] H. Ehlich and K. Zeller. “Auswertung der Normen von Interpolationsoperatoren”. In: Math. Ann.
164 (1966), pp. 105–112. DOI: 10.1007/BF01429047.
[HM76] C.A. Hall and W.W. Meyer. “Optimal error bounds for cubic spline interpolation”. In: J. Approx.
Theory 16 (1976), pp. 105–122 (cit. on p. 547).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 483, 496, 497,
505, 546).
[JWZ19] Peter Jantsch, Clayton G. Webster, and Guannan Zhang. “On the Lebesgue constant of
weighted Leja points for Lagrange interpolation on unbounded domains”. In: IMA J. Numer.
Anal. 39.2 (2019), pp. 1039–1057. DOI: 10.1093/imanum/dry002.
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 510).
[Ran00] R. Rannacher. Einführung in die Numerische Mathematik. Vorlesungsskriptum Universität Hei-
delberg. 2000 (cit. on p. 484).
[Rem84] R. Remmert. Funktionentheorie I. Grundwissen Mathematik 5. Berlin: Springer, 1984 (cit. on
p. 490).
[Rem02] R. Remmert. Funktionentheorie I. Grundwissen Mathematik 5. Berlin: Springer, 2002 (cit. on
p. 535).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 471, 476,
483, 488, 489).
[Tad86] Eitan Tadmor. “The Exponential Accuracy of Fourier and Chebyshev Differencing Methods”.
In: SIAM Journal on Numerical Analysis 23.1 (1986), pp. 1–10. DOI: 10.1137/0723001.
[TT10] Rodney Taylor and Vilmos Totik. “Lebesgue constants for Leja points”. In: IMA J. Numer. Anal.
30.2 (2010), pp. 462–486. DOI: 10.1093/imanum/drn082.
[Tre13] Lloyd N. Trefethen. Approximation theory and approximation practice. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2013, viii+305 pp.+back matter (cit. on
pp. 486, 492, 500, 502).
[Tre] N. Trefethen. Six myths of polynomial interpolation and quadrature. Slides, University of Ox-
ford.
548
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
[Ver86] P. Vertesi. “On the optimal Lebesgue constants for polynomial interpolation”. In: Acta Math.
Hungaria 47.1-2 (1986), pp. 165–178 (cit. on p. 500).
[Ver90] P. Vertesi. “Optimal Lebesgue constant for Lagrange interpolation”. In: SIAM J. Numer. Anal.
27.5 (1990), pp. 1322–1331 (cit. on p. 500).
Numerical Quadrature
7.1 Introduction
Video tutorial for Section 7.1 "Numerical Quadrature: Introduction": (4 minutes)
Download link, tablet notes
Z
Numerical quadrature deals with the approximate numerical evaluation of integrals f (x) dx for a given
Ω
(closed) integration domain Ω ⊂ R d . Thus, the underlying problem in the sense of § 1.5.5.1 is the
mapping
C0 (Ω) → RR
I: , (7.1.0.1)
f 7→ Ω f (x) dx
General methods for numerical quadrature should rely only on finitely many point evaluations of
the integrand.
y
550
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
☞ Numerical quadrature methods are key building blocks for so-called variational methods for the nu-
merical treatment of partial differential equations. A prominent example is the finite element method.
They are also a pivotal device for the numerical solution of integral equations and in computational
statistics.
y
3
Zb
1.5 f (t) dt
f
EXAMPLE 7.1.0.4 (Heating production in electrical circuits) In Ex. 2.1.0.3 we learned about the nodal
analysis of electrical circuits. Its application to a non-linear circuit will be discussed in Ex. 8.1.0.1, which will
reveal that every computation of currents and voltages can be rather time-consuming. In this example we
consider a non-linear circuit in quasi-stationary operation (capacities and inductances are ignored). Then
the computation of branch currents and nodal voltages entails solving a non-linear system of equations.
The goal is to compute the energy dissipated by the circuit, which is equal to the energy injected by the
voltage source. This energy can be obtained by integrating the power P(t) = U (t) I (t) over period [0, T ]:
Z T
Wtherm = U (t) I (t) dt , where I = I (U ) .
0
double I(double U) involves solving non-linear system of equations, see Ex. 8.1.0.1!
This is a typical example where “point evaluation” by solving the non-linear circuit equations is the only
way to gather information about the integrand. y
Contents
Ch. 10].
Quadrature formulas realize the approximation of an integral through finitely many point evaluations of the
integrand.
(n)
wj : quadrature weights ∈R
Terminology: (n)
cj : quadrature nodes ∈ [ a, b] (also called quadrature points)
Obviously (7.2.0.2) is compatible with integrands f given in procedural form as double f(double t),
compare § 7.1.0.2.
A single invocation costs n point evaluations of the integrand plus n additions and multiplications.
Remark 7.2.0.4 (Transformation of quadrature rules) In the setting of function approximation by poly-
nomials we learned in § 6.2.1.14 that an approximation schemes for any interval could be obtained from
an approximation scheme on a single reference interval ([−1, 1] in § 6.2.1.14) by means of affine pullback,
see (6.2.1.18). A similar affine transformation technique makes it possible to derive quadrature formula for
an arbitrary interval from a single quadrature formula on a reference interval.
n
Given: quadrature formula cbj , w
b j j=1 on reference interval [−1, 1]
τ t
Fig. 264
−1 1 a b
τ 7→ t := Φ(τ ) := 12 (1 − τ ) a + 21 (τ + 1)b
Rb n n c j = 12 (1 − cbj ) a + 21 (1 + cbj )b ,
a f (t) dt ≈ 1
2 (b − b j fb(cbj ) = ∑ w j f (c j ) with
a) ∑ w
j =1 j =1 w j = 12 (b − a)w
bj .
In words, the nodes are just mapped through the affine transformation c j = Φ(cbj ), the weights are scaled
by the ratio of lengths of [ a, b] and [−1, 1].
Another common choice for the reference interval: [0, 1], pay attention! y
Remark 7.2.0.6 (Tables of quadrature rules) In many codes families of quadrature rules are used to
(n) (n)
control the quadrature error. Usually, suitable sequences of weights w j and nodes c j are precomputed
and stored in tables up to sufficiently large values of n. A possible interface could be the following:
s t r u c t QuadTab {
t e m p l a t e < typename VecType>
s t a t i c v o i d getrule( i n t n,VecType &c,VecType &w,
double a=-1.0, double b=1.0);
}
Calling the method getrule() fills the vectors c and w with the nodes and the weights for a desired
n-point quadrature on [ a, b] with [−1, 1] being the default reference interval. For VecType we may assume
the basic functionality of Eigen::VectorXd. y
As explained in § 6.1.0.6 every interpolation scheme IT : R n+1 → V based on the node set T =
{t0 , t1 , . . . , tn } ⊂ [ a, b] (→ § 5.1.0.7) induces an approximation scheme, and, hence, also a quadrature
scheme on [ a, b]:
Z b Z b
f (t) dt ≈ IT [ f (t0 ), . . . , f (tn )]⊤ (t) dt . (7.2.0.9)
a a
Every linear interpolation operator IT according to Def. 5.1.0.25 spawns a quadrature formula (→
Def. 7.2.0.1) by (7.2.0.9).
Proof. Writing e j for the j-th unit vector of R n+1 , j = 0, . . . , n, we have by linearity
Z b n Z b
⊤
a
IT [ f (t0 ), . . . , f (tn )] dt = ∑ f (t j ) (IT (e j ))(t) dt . (7.2.0.11)
j =0 |a {z }
weight w j
Hence, we have arrived at an n + 1-point quadrature formula with nodes t j , whose weights are the
integrals of the cardinal interpolants for the interpolation scheme T .
✷
Summing up, we have found
✓ ✏
interpolation approximation quadrature
−→ −→
✒ ✑
schemes schemes schemes
§7.2.0.12 (Convergence of numerical quadrature) In general the quadrature formula (7.2.0.2) will only
provide an approximate value for the integral.
Z b
quadrature error En ( f ) := f (t) dt − Qn ( f )
a
As in the case of function approximation by interpolation Section 6.2.2, our focus will on the asymptotic
behavior of the quadrature error as a function of the number n of point evaluations of the integrand.
§7.2.0.13 (Quadrature error from approximation error) Bounds for the maximum norm of the approx-
imation error of an approximation scheme directly translate into estimates of the quadrature error of the
induced quadrature scheme (7.2.0.8):
Z b Z b
f (t) dt − QA( f ) ≤ | f (t) − A( f )(t)| dt ≤ |b − a|k f − A( f )k L∞ ([a,b]) . (7.2.0.14)
a a
Hence, the various estimates derived in Section 6.2.2 and Section 6.2.3.2 give us quadrature error esti-
mates “for free”. More details will be given in the next section. y
Review question(s) 7.2.0.15 (Quadrature formulas)
(Q7.2.0.15.A) Explain the structure of a quadrature formula/rule for the approximation of the integral
Rb
a f ( t ) dt, a, b ∈ R .
(Q7.2.0.15.B) The integral satisfies
Z b Z b
g, f ∈ C0 ([ a, b]) , g ≤ f on [ a, b] ⇒ g(t) dt ≤ f (t) dt .
a a
Formulate necessary and sufficient conditions on an n-point quadrature rule Qn such that
g, f ∈ C0 ([ a, b]) , g ≤ f on [ a, b] ⇒ Qn ( g) ≤ Qn ( f ) .
(Q7.2.0.15.C) The documentation of the C++ function
Eigen::Matrix< double , Eigen::Dynamic, 2>
getQuadRule( unsigned i n t n);
claims that it provides a family of quadrature rules on the interval [−1, 1], and that n passes the number
of quadrature nodes/points.
Why does this claim make sense and which piece of information is missing? How could you retrieve
that missing piece, if you can call the function getQuadRule().
Hint. The C++ class std::pair is an abstract container for two objects of different types that can be
accessed via data members first and second.
(Q7.2.0.15.E) A integration path γ in the complex plane C is usually given by its parameterization
γ : [ a, b] → C, a, b ∈ R. If f : D ⊂ C → C is continuous and γ([ a, b]) ⊂ D, then the path integral
of f along γ is defined as
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.2.2.51)
γ J
where γ̇ designates the derivative of γ with respect to the parameter, and · indicates multiplication in C.
How do quadrature formulas look like that can be used for the approximate computation of
Z
f (z) dz , D : = { z ∈ C : | z | ≤ 1} .
∂D
Video tutorial for Section 7.3 "Polynomial Quadrature Formulas": (9 minutes) Download link,
tablet notes
Now we specialize the general recipe of § 7.2.0.7 for approximation schemes based on global polynomials,
the Lagrange approximation scheme as introduced in Section 6.2, Def. 6.2.2.1.
The cardinal interpolants for Lagrange interpolation are the Lagrange polynomials (5.2.2.4)
n −1 (5.2.2.6) n −1
t − tj
Li ( t ) : = ∏ ti − t j
, i = 0, . . . , n − 1 p n −1 ( t ) = ∑ f ( ti ) Li ( t ) .
j =0 i =0
j 6 =i
Zb n −1 Zb nodes c i = t i −1 ,
Z b
pn−1 (t) dt = ∑ f ( ti ) Li (t) dt
weights wi : = Li−1 (t) dt .
(7.3.0.2)
a i =0 a a
2
Zb
f (t) dt ≈ Qmp ( f ) = (b − a) f ( 12 ( a + b)) .
1.5
f
1 “midpoint”
0.5
✁ the area under the graph of f is approximated by
the area of a rectangle.
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 265 t
y
EXAMPLE 7.3.0.4 (Newton-Cotes formulas → [Han02, Ch. 38]) The n := m + 1-point Newton-Cotes
formulas arise from Lagrange interpolation in equidistant nodes (6.2.2.3) in the integration interval [ a, b]:
b−a
Equidistant quadrature nodes t j := a + hj, h := , j = 0, . . . , m:
m
The weights for the interval [0, 1] can be found, e.g., by symbolic computation using MAPLE: the following
MAPLE function expects the polynomial degree as input argument and computes the weights for the
interval [0, 1]:
> newtoncotes := m -> factor(int(interp([seq(i/n, i=0..m)],
[seq(f(i/n), i=0..m)], z),z=0..1)):
Weights on general intervals [ a, b] can then be deduced by the affine transformation rule as explained in
Rem. 7.2.0.4.
111111111111111
000000000000000
2.5
> trapez := newtoncotes(1);
000000000000000
111111111111111
000000000000000
111111111111111
2
000000000000000
111111111111111
000000000000000
111111111111111
btrp ( f ) := 1 ( f (0) + f (1)) 000000000000000
111111111111111
Q (7.3.0.5) 000000000000000
111111111111111
000000000000000
111111111111111
2 1.5
f
000000000000000
111111111111111
000000000000000
111111111111111
Zb
b−a 000000000000000
111111111111111
f (t) dt ≈ ( f ( a) + f (b)) 1
000000000000000
111111111111111
X
000000000000000
111111111111111
2
a 000000000000000
111111111111111
000000000000000
111111111111111
0.5
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
0
0 0.5
000000000000000
111111111111111
1 1.5 2 2.5 3 3.5 4
Fig. 266 t
• n = 3: Simpson rule
> simpson := newtoncotes(2);
Zb
1 1 b−a a+b
f (0) + 4 f ( 2 ) + f (1) f (t) dt ≈ f ( a) + 4 f + f (b) (7.3.0.6)
6 6 2
a
• n = 5: Milne rule
> milne := newtoncotes(4);
1 1 1 3
7 f (0) + 32 f ( 4 ) + 12 f ( 2 ) + 32 f ( 4 ) + 7 f (1)
90
b − a
(7 f ( a) + 32 f ( a + (b − a)/4) + 12 f ( a + (b − a)/2) + 32 f ( a + 3(b − a)/4) + 7 f (b))
90
• n = 7: Weddle rule
> weddle := newtoncotes(6);
1
41 f (0) + 216 f ( 16 ) + 27 f ( 31 ) + 272 f ( 12 ) + 27 f ( 32 ) + 216 f ( 65 ) + 41 f (1) .
840
1
(989 f (0) + 5888 f ( 81 ) − 928 f ( 41 ) + 10496 f ( 83 )
28350
−4540 f ( 12 ) + 10496 f ( 58 ) − 928 f ( 43 ) + 5888 f ( 78 ) + 989 f (1))
y
From Ex. 6.2.2.11 we know that the approximation error incurred by Lagrange interpolation
in equidistant nodes can blow up even for analytic functions. This blow-up can also infect
! the quadrature error of Newton-Cotes formulas for large n, which renders them essentially
useless. In addition they will be marred by large (in modulus) and negative weights, wich
compromises numerical stability (→ Def. 1.5.5.19)
No negative weights!
Quadrature formulas with negative weights should not be used, not even considered!
Remark 7.3.0.8 (Clenshaw-Curtis quadrature rules [Tre08]) The considerations of Section 6.2.3 con-
firmed the superiority of the “optimal” Chebychev nodes (6.2.3.10) for globally polynomial Lagrange in-
terpolation. This suggests that we use these nodes also for numerical quadrature with weights given by
(7.3.0.2). This yields the so-called Clenshaw-Curtis rules with the following rather desirable property:
The weights wnj , j = 1, . . . , n, for every n-point Clenshaw-Curtis rule are positive.
The weights of any n-point Clenshaw-Curtis rule can be computed with a computational effort of
O(n log n) using FFT [Wal06], [Tre08, Sect. 2]. y
f ( n +1)
L∞ ( I )
k f − LT f k L ∞ ( I ) ≤ max|(t − t0 ) · · · · · (t − tn )| . (6.2.2.22)
( n + 1) ! t∈ I
Much sharper estimates for Clenshaw-Curtis rules (→ Rem. 7.3.0.8) can be inferred from the interpolation
error estimate (6.2.3.18) for Chebychev interpolation. For functions with limited smoothness algebraic con-
vergence of the quadrature error for Clenshaw-Curtis quadrature follows from (6.2.3.21). For integrands
that possess an analytic extension to the complex plane in a neighborhood of [ a, b], we can conclude
exponential convergence from (6.2.3.28). y
[DR08, Sect.10.3]
How to gauge the “quality” of an n-point quadrature formula Qn without testing it for specific integrands?
The next definition gives a classical answer.
that is, as the maximal degree +1 of polynomials for which the quadrature rule is guaranteed to be
exact.
§7.4.1.3 (Invariance of order under (affine) transformation) First we note a simple consequence of the
invariance of the polynomial space Pn under affine pullback, see Lemma 6.2.1.17.
An affine transformation of a quadrature rule according to Rem. 7.2.0.4 does not change its order.
§7.4.1.5 (Order of polynomial quadrature rules) Further, by construction all polynomial n-point quadra-
ture rules possess order at least n.
where Lk , k = 0, . . . , n − 1, is the k-th Lagrange polynomial (5.2.2.4) associated with the ordered
node set {c1 , c2 , . . . , cn }.
Proof. The conclusion of the theorem is a direct consequence of the facts that
By construction (7.3.0.2) polynomial n-point quadrature formulas (7.3.0.1) exact for f ∈ Pn−1 ⇒ n-point
polynomial quadrature formula has at least order n. y
Remark 7.4.1.7 (Linear system for quadrature weights) Thm. 7.4.1.6 provides a concrete formula for
quadrature weights, which guaranteed order n for an n-point quadrature formula. Yet evaluating integrals
of Lagrange polynomials may be cumbersome. Here we give a general recipe for finding the weights w j
according to Thm. 7.4.1.6 without dealing with Lagrange polynomials.
For instance, for the computation of quadrature weights, one may choose the monomial basis p j (t) = t j .
y
EXAMPLE 7.4.1.10 (Orders of simple polynomial quadrature formulas) From the order rule for poly-
nomial quadrature rule we immediately conclude the orders of simple representatives.
n Order
1 midpoint rule 2
2 trapezoidal rule (7.3.0.5) 2
3 Simpson rule (7.3.0.6) 4
3
4 8 -rule 4
5 Milne rule 6
The orders for odd n surpass the predictions of Thm. 7.4.1.6 by 1, which can be verified by straightforward
computations; following Def. 7.4.1.1 check the exactness of the quadrature rule on [0, 1] (this is sufficient
→ Cor. 7.4.1.4) for monomials {t 7→ tk }, k = 0, . . . , q − 1, which form a basis of Pq , where q is the order
that is to be confirmed: essentially one has to show
n
1
Q({t 7→ tk }) = ∑ w j ckj = k + 1 , k = 0, . . . , q − 1 , (7.4.1.11)
j =1
For the Simpson rule (7.3.0.6) we can also confirm order 4 with symbolic calculations in MAPLE:
> rule := 1/3*h*(f(2*h)+4*f(h)+f(0))
> err := taylor(rule - int(f(x),x=0..2*h),h=0,6);
1 (4)
err := D ( f )(0)h5 + O h6 , h, 6
90
➣ Composite Simpson rule possesses order 4, indeed ! y
A natural question is whether an n-point quadrature formula achieve an order > n. A negative result limits
the maximal order that can be achieved:
q(t) := (t − c1 )2 · · · · · (t − cn )2 ∈ P2n .
EXAMPLE 7.4.2.2 (2-point quadrature rule of order 4) Necessary & sufficient conditions for order 4, cf.
(7.4.1.9), integrate the functions of the monomial basis of P3 exactly:
Z b
1
Qn ( p) = p(t) dt ∀ p ∈ P3 ⇔ Qn ({t 7→ tq }) = (bq+1 − aq+1 ) , q = 0, 1, 2, 3 .
a q+1
4 equations for weights w j and nodes c j , j = 1, 2 ( a = −1, b = 1), cf. Rem. 7.4.1.7
Z 1 Z 1
1 dt = 2 = 1w1 + 1w2 , t dt = 0 = c1 w1 + c2 w2
−1 −1
Z 1 Z 1 (7.4.2.3)
2
t dt = = c21 w1 + c22 w2 ,
2 3
t dt = 0 = c31 w1 + c32 w2 .
−1 3 −1
Solve using MAPLE:
> eqns := {seq(int(x^k, x=-1..1) = w[1]*xi[1]^k+w[2]*xi[2]^k,k=0..3)};
> sols := solve(eqns, indets(eqns, name)):
> convert(sols, radical);
n √ √ o
➣ weights & nodes: w2 = 1, w1 = 1, c1 = 1/3 3, c2 = −1/3 3
Z 1
1 1
quadrature formula (order 4): f ( x ) dx ≈ f √ + f −√ (7.4.2.4)
−1 3 3
y
§7.4.2.5 (Construction of n-point quadrature rules with maximal order 2n) First we search for neces-
sary conditions that have to be met by the nodes, if an n-point quadrature rule has order 2n.
n Z 1
(n) (n) (n)
Qn ( f ) := ∑ w j f (c j ) ≈
−1
f (t) dt , w j ∈R,n∈N,
j =1
of order 2n ⇔ exact for polynomials ∈ P2n−1 . (7.4.2.6)
(n) (n)
Define P̄n (t) := (t − c1 ) · · · · · (t − cn ) , t ∈ R ⇒ P̄n ∈ Pn .
Note: P̄n has leading coefficient = 1.
By assumption on the order of Qn we know that for any q ∈ P n −1
Z 1 n
(7.4.2.6) (n) (n) (n)
q(t) P̄n (t) dt
−1 | {z }
= ∑ wj q(c j ) P̄n (c j ) = 0 .
j =1 | {z }
∈P2n−1 =0
Z1
2
We conclude L ([−1, 1])-orthogonality: q(t) P̄n (t) dt = 0 ∀ q ∈ P n −1 . (7.4.2.7)
−1
Pn
P̄n
Pn equipped with the inner product
R1
( p, q) 7→ −1 p(t)q(t) dt can be viewed as an
Euclidean space:
✁ P̄n ⊥ Pn−1
Pn−1 ⊂ Pn is a subspace of co-dimension 1. Hence,
Pn−1 has a 1-dimensional orthogonal complement in
P n −1 Pn . By (7.4.2.7) P̄n belongs to that complement. It
takes one additional condition to fix P̄n and the re-
quirement that its leading coefficient be = 1 is that
condition.
Fig. 267
We can also give algebraic arguments for existence and uniqueness of P̄n . Switching to a monomial
representation of P̄n
n −1
This is a linear system of equations A α j j=0 = b with a symmetric, positive definite (→ Def. 1.1.2.6)
coefficient matrix A ∈ R n,n . The A is positive definite can be concluded from
Z1 n−1 2
⊤
x Ax = ∑ (x) j t j dt > 0 , if x 6= 0 .
−1 j =0
Hence, A is regular and the coefficients α j are uniquely determined. Thus there is only one n-point
quadrature rule of order 2n.
The nodes of an n-point quadrature formula of order 2n, if it exists, must coincide with the unique zeros
of the polynomials P̄n ∈ Pn \ {0} satisfying (7.4.2.7).
➣ As we have seen in Section 6.3.2, abstract techniques for vector spaces with inner product can be
applied to polynomials, for instance Gram-Schmidt orthogonalization, cf. § 6.3.1.17, [NS02, Thm. 4.8],
[Gut09, Alg. 6.1].
Now carry out the abstract Gram-Schmidt orthogonalization according to Algorithm (6.3.1.18) and recall
Thm. 6.3.1.19: in a vector space V with inner product (·, ·)V orthogonal vectors q0 , q1 , . . . spanning the
same subspaces as the linearly independent vectors v0 , v1 , . . . are constructed recursively via
n −1
( v n , q k )V
qn := vn − ∑ q ,
( q k , q k )V k
n = 1, 2, . . . , q0 : = v0 . (7.4.2.9)
k =0
Note: P̄n has leading coefficient = 1 ⇒ P̄n uniquely defined (up to sign) by (7.4.2.10).
y
The considerations so far only reveal necessary conditions on the nodes of an n-point quadrature rule of
order 2n:
They do by no means confirm the existence of such rules, but offer a clear hint on how to construct them:
n
Proof. Conclude from the orthogonality of the P̄n that { P̄k }k=0 is a basis of Pn and
Z 1
h(t) P̄n (t) dt = 0 ∀h ∈ Pn−1 . (7.4.2.12)
−1
Recall division of polynomials with remainder (Euclid’s algorithm → Course “Diskrete Mathematik”): for
any p ∈ P2n−1
p(t) = h(t) P̄n (t) + r (t) , for some h ∈ Pn−1 , r ∈ Pn−1 . (7.4.2.13)
Z1 Z1 Z1 m
(∗) (n) (n)
p(t) dt = h(t) P̄n (t) dt + r (t) dt = ∑ wj r (c j ) , (7.4.2.14)
−1 −1 −1 j =1
| {z }
=0 by (7.4.2.12)
(∗): by choice of weights according to Rem. 7.4.1.7 Qn is exact for polynomials of degree ≤ n − 1!
m m m Z1
(n) (n) (7.4.2.13) (n) (n) (n) (n) (n) (7.4.2.14)
∑ w j p(c j ) = ∑ w j h(c j ) P̄n (c j ) + ∑ w j r (c j ) = p(t) dt .
j =1 j =1 | {z } j =1 −1
=0
Legendre polynomials
The L2 (] − 1, 1[)-orthogonal are those already dis- 1 n=0
cussed in Rem. 6.3.2.16: 0.8
n=1
n=2
n=3
0.6 n=4
Definition 7.4.2.16. Legendre polynomials n=5
0.4
The n-th Legendre polynomial Pn is defined by 0.2
• Pn ∈ Pn ,
Pn(t)
Z 1 0
• Pn (1) = 1. −0.6
−0.8
−1
Legendre polynomials P0 , . . . , P5 ➣ −1 −0.5 0 0.5 1
Fig. 268 t
Notice: the polynomials P̄n defined by (7.4.2.10) and the Legendre polynomials Pn of Def. 7.4.2.16 (merely)
differ by a constant factor!
(n)
Gauss points ξ j = zeros of Legendre polynomial Pn
Note: the above considerations, recall (7.4.2.7), show that the nodes of an n-point quadrature formula of
order 2n on [−1, 1] must agree with the zeros of L2 (] − 1, 1[)-orthogonal polynomials.
✞ ☎
✝ ✆
n-point quadrature formulas of order 2n are unique
We are not done yet: the zeros of P̄n from (7.4.2.10) may lie outside [−1, 1].
! In principle P̄n could also have less than n real zeros.
18
✁ Obviously:
Number n of quadrature nodes
16
14
Lemma 7.4.2.17. Zeros of Legendre polyno-
12 mials
10
Pn has n distinct zeros in ] − 1, 1[.
8
6
Zeros of Legendre polynomials = Gauss points
4
2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 269 t
Proof. (indirect) Assume that Pn has only m < n zeros ζ 1 , . . . , ζ m in ] − 1, 1[ at which it changes sign.
Define
m
q(t) := ∏(t − ζ j ) ⇒ qPn ≥ 0 or qPn ≤ 0 .
j =1
Z 1
⇒ q(t) Pn (t) dt 6= 0 .
−1
The n-point Quadrature formulas whose nodes, the Gauss points, are given by the zeros of the n-th
Legendre polynomial (→ Def. 7.4.2.16), and whose weights are chosen according to Thm. 7.4.1.6,
are called Gauss-Legendre quadrature formulas.
The last part of this section examines the non-trivial question of how to compute the Gauss points given as
the zeros of Legendre polynomials. Many different algorithms have been devised for this purpose and we
focus on one that employs tools from numerical linear algebra and relies on particular algebraic properties
of the Legendre polynomials.
Remark 7.4.2.19 (3-Term recursion for Legendre polynomials) From Thm. 6.3.2.14 we learn the or-
thogonal polynomials satisfy the 3-term recursion (6.3.2.15), see also (7.4.2.21). To keep this chapter
self-contained we derive it independently for Legendre polynomials.
Note: the polynomials P̄n from (7.4.2.10) are uniquely characterized by the two properties (try a proof!)
(i) P̄n ∈ Pn with leading coefficient 1: P̄(t) = tn + . . .,
Z 1
(ii) P̄k (t) P̄j (t) dt = 0, if j 6= k ( L2 (] − 1, 1[)-orthogonality).
−1
➣ we get the same polynomials P̄n by another Gram-Schmidt orthogonalization procedure, cf. (7.4.2.9)
and § 6.3.2.11:
R1
n
−1 τ P̄n ( τ ) P̄k ( τ ) dτ
P̄n+1 (t) = t P̄n (t) − ∑ R1 2 · P̄k (t)
k =0 −1 P̄k ( τ ) dτ
if k + 1 < n:
R1 R1
τ P̄n (τ ) P̄n (τ ) dτ −1 τ P̄n ( τ ) P̄n−1 ( τ ) dτ
P̄n+1 (t) = t P̄n (t) − −1R 1 · P̄n (t) − R1 2 · P̄n−1 (t) . (7.4.2.20)
2 ( τ ) dτ
−1 nP̄ −1 P̄n−1 ( τ ) dτ
After rescaling (tedious!) we obtain the famous 3-term recursion for Legendre polynomials
2n + 1 n
Pn+1 (t) := tPn (t) − Pn−1 (t) , P0 := 1 , P1 (t) := t . (7.4.2.21)
n+1 n+1
In Section 6.2.3.1 we discovered a similar 3-term recursion (6.2.3.5) for Chebychev polynomials. Coinci-
dence? Of course not, nothing in mathematics holds “by accident”. By Thm. 6.3.2.14 3-term recursions
are a distinguishing feature of so-called families of orthogonal polynomials, to which the Chebychev poly-
nomials belong as well, spawned by Gram-Schmidt orthogonalization with respect to a weighted L2 -inner
product, however, see [Han02, p. VI].
➤ Efficient and stable evaluation of Legendre polynomials by means of 3-term recursion (7.4.2.21), cf.
the analoguous algorithm for Chebychev polynomials given in Code 6.2.3.6.
In codes Gauss nodes and weights are usually retrieved from tables, cf. Rem. 7.2.0.6.
Justification: en = √ 1 Pn
rewrite 3-term recurrence (7.4.2.21) for scaled Legendre polynomials P 1 n+ /2
n n+1
t Pen (t) = √ Pen−1 (t) + p Pen+1 (t) . (7.4.2.25)
4n2 − 1 4( n + 1)2 − 1
| {z } | {z }
=:β n =:β n+1
The zeros of Pn can be obtained as the n real eigenvalues of the symmetric tridiagonal matrix Jn ∈
R n,n !
This matrix Jn is initialized in Line 10–Line 13 of Code 7.4.2.24. The computation of the weights in Line 18
of Code 7.4.2.24 is explained in [Gan+05, Sect. 3.5.4]. y
Remark 7.4.2.26 (Asymptotic methods for the computation of Gauss-Legendre quadrature rules)
The fastest methods for the computation of nodes for p-point Gauss-Legendre quadrature rules make use
of asymptotic formulas for the zeros of special functions and achieve an asymptotic complexity of O( p) for
p → ∞, see article and [Bog14]. y
The Gauss-Legendre quadrature formulas do not only enjoy maximal order, but another key property that
can be regarded as essential for viable families of quadrature rules.
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 270 tj
(n)
Proof. Writing ξ j , j = 1, . . . , n, for the nodes (Gauss points) of the n-point Gauss-Legendre quadrature
formula, n ∈ N, we define
n
(n) 2
qk (t) = ∏(t − ξ j ) ⇒ qk ∈ P2n−2 .
j =1
j6=k
(n)
This polynomial is integrated exactly by the quadrature rule: since qk (ξ j ) = 0 for j 6= k
Z1
(n) (n)
0< q(t) dt = wk q(ξ k ) ,
| {z }
−1 >0
(n)
where w j are the quadrature weights.
✷
(n)
§7.4.3.2 (Quadrature error and best approximation error) The positivity of the weights w j for all
n-point Gauss-Legendre and Clenshaw-Curtis quadrature rules has important consequences.
Theorem 7.4.3.3. Quadrature error estimate for quadrature rules with positive weights
Proof. The proof runs parallel to the derivation of (6.2.2.33). Writing En ( f ) for the quadrature error, the
left hand side of (7.4.3.4), we find by the definition Def. 7.4.1.1 of the order of a quadrature rule that for all
p ∈ P q −1
Z b n
En ( f ) = En ( f − p) ≤
a
( f − p)(t) dt + ∑ w j ( f − p)(c j ) (7.4.3.5)
j =1
n
≤|b − a|k f − pk L∞ ([a,b]) + ∑ |w j | k f − pk L∞ ([a,b]) .
j =1
n n
∑ |w j | = ∑ w j = |b − a| ,
j =1 j =1
Appealing to (6.2.1.27) and (6.2.2.22), the dependence of the constants on the length of the integration
interval can be quantified for integrands with limited smoothness.
Proof. The first estimate (7.4.3.7) is immediate from (6.2.1.27). The second bound (7.4.3.8) is obtained
by combining (7.4.3.4) and (6.2.2.22).
✷
Please note the different estimates depending on whether the smoothness of f (as described by r) or the
order of the quadrature rule is the “limiting factor”. y
−4
10
−2
10
|quadrature error|
|quadrature error|
−6
10
−3
10
−8
10
−4
10
−10
10
−5
−12
10 10
Equidistant Newton−Cotes quadrature
Chebyshev quadrature
Gauss quadrature
−14 −6
10 10
0 2 4 6 8 10 12 14 16 18 20 0 1
Fig. 271 10 10
Number of quadrature nodes Fig. 272 Number of quadrature nodes
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1]
√
quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error ǫn := 0 f (t) dt − Qn ( f ) for "n → ∞”:
➣
exponential convergence ǫn ≈ O(qn ), 0 < q < 1, for C ∞ -integrand f 1 ❀ : Newton-Cotes quadra-
ture : q ≈ 0.61, Clenshaw-Curtis quadrature : q ≈ 0.40, Gauss-Legendre quadrature : q ≈ 0.27
➣
algebraic convergence ǫn ≈ O(n−α ), α > 0, for integrand f 2 with singularity at t = 0 ❀ Newton-
Cotes quadrature : α ≈ 1.8, Clenshaw-Curtis quadrature : α ≈ 2.5, Gauss-Legendre quadrature :
α ≈ 2.7
y
Remark 7.4.3.10 (Removing a singularity by transformation) From Ex. 7.4.3.9 teaches us that a lack of
smoothness of the integrand can thwart exponential convergence and severely limits the rate of algebraic
convergence of a global quadrature rule for n → ∞.
Here is an example:
Z b√
For a general but smooth f ∈ C ∞ ([0, b]) compute t f (t) dt via a quadrature rule, e.g., n-point
0
Gauss-Legendre quadrature on [0, b]. Due to the presence of a square-root singularity at t = 0 the direct
application of n-point Gauss-Legendre quadrature will result in a rather slow algebraic convergence of the
quadrature error as n → ∞, see Ex. 7.4.3.9.
√ Z b√ Z √b
substitution s = t: t f (t) dt = 2s2 f (s2 ) ds . (7.4.3.11)
0 0
Remark 7.4.3.12 (The message of asymptotic estimates) There is one blot on most n-asymptotic esti-
mates obtained from Thm. 7.4.3.3: the bounds usually involve quantities like norms of higher derivatives
of the interpoland that are elusive in general, in particular for integrands given only in procedural form,
see § 7.1.0.2. Such unknown quantities are often hidden in “generic constants C”. Can we extract useful
information from estimates marred by the presence of such constants?
For fixed integrand f let us assume sharp algebraic convergence (in n) with rate r ∈ N of the quadrature
error En ( f ) for a family of n-point quadrature rules:
sharp
En ( f ) = O(n−r ) =⇒ En ( f ) ≈ Cn−r , (7.4.3.13)
In the case of algebraic convergence with rate r ∈ R a reduction of the quadrature error by a factor
of ρ is bought by an increase of the number of quadrature points by a factor of ρ /r .
1
Now assume sharp exponential convergence (in n) of the quadrature error En ( f ) for a family of n-point
quadrature rules, 0 ≤ q < 1:
sharp
En ( f ) = O(qn ) =⇒ En ( f ) ≈ Cqn , (7.4.3.15)
Cqnold ! log ρ
=ρ ⇔ nnew − nold = − log q .
Cqnnew
In the case of exponential convergence (7.4.3.15) a fixed increase of the number of quadrature
points by − log ρ : log q results in a reduction of the quadrature error by a factor of ρ > 1.
y
Review question(s) 7.4.3.16 (Quadrature error estimates)
√
(Q7.4.3.16.A) By the substitution s = t we could transform
√
Zb √ Zb
t f (t) dt = 2 s2 f (s2 ) ds .
0 0
(n) (n)
Describe the weights w j and nodes c j , j = 1, . . . , n, n ∈ N, of a family of quadrature rules satisfy-
ing
Zb √ n
(n) (n)
t f (t) dt ≈ ∑ wj f (c j ) ,
0 j =1
and enjoying n-asymptotic exponential convergence, if f possess an analytic extension beyond [0, b].
(Q7.4.3.16.B) Let ( Qn )n∈N denote a family of quadrature rules with n the number of quadrature points.
Rb
For the approximate evaluation of a f (t) dt the following adaptive algorithm is employed
1 n := 0;
2 do
3 n : = n +1;
4 while ( | Qn+1 ( f ) − Qn ( f )| ≤ tol · | Qn+1 ( f )| ) ;
5 r e t u r n Q n +1 ( f ) ;
Assuming a sharp asymptotic behavior like En ( f ) = O(n−r ), r ∈ N, (algebraic convergence with rate
r) for n → ∞ of the quadrature error En ( f ), how much extra work may be incurred when reducing the
tolerance by a factor of 10.
(Q7.4.3.16.C) [Improper integral with logarithmic weight] We consider the improper integral
Z 1
I ( f ) := log(t) f (t) dt , (7.4.3.17)
0
When applying the n-point Gauss-Legendre quadrature formula on [0, 1] to the transformed integral,
how will the quadrature error behave for n → ∞?
k
k ( j)
( f g) (k)
(τ ) = ∑ f ( τ ) g(k− j) ( τ ) , f , g ∈ Ck . (7.4.3.18)
j =0
j
Hint. How does τ 7→ ϕ(τ ) and its derivatives behave as τ → 0: ϕ(k) (τ ) = O(τ ? ) for τ → 0.
△
Video tutorial for Section 7.5 "Composite Quadrature": (18 minutes) Download link,
tablet notes
a
f (t) dt = ∑ f (t) dt . (7.5.0.1)
j =1 x j −1
On each mesh interval [ x j−1 , x j ] we then use a local quadrature rule, which may be one of the polynomial
quadrature formulas from 7.3.
Zb
f (t)dt = 12 ( x1 − x0 ) f ( a)+ (7.5.0.4)
a m −1 1.5
1
∑ 2 ( x j +1 − x j−1 ) f ( x j )+
j =1
1
2 ( xm − x m −1 ) f ( b ) . 0.5
Fig. 273 a
−1 a b
7. Numerical Quadrature , 7.5. Composite Quadrature 575
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Zb
f (t)dt = 2.5
a
1
6 ( x1 − x0 ) f ( a)+ (7.5.0.5)
m −1 1.5
➣
1
∑ 6 ( x j +1 − x j−1 ) f ( x j )+
j =1
m
∑ 23 ( x j − x j−1 ) f ( 12 ( x j + x j−1 ))+
0.5
j =1 a
Fig. 274 −1 a b
1
6 ( xm − x m −1 ) f ( b ) .
Formulas (7.5.0.4), (7.5.0.5) directly suggest efficient implementation with minimal number of f -
evaluations.
10 f o r ( unsigned i = 0 ; i < N; ++ i ) {
11 // rule: T = (b - a)/2 * (f(a) + f(b)),
12 // apply on N intervals: [a + i*h, a + (i+1)*h], i=0..(N-1)
13 I += h / 2 * ( f ( a + i * h ) + f ( a + ( i + 1 ) * h ) ) ;
14 }
15 return I ;
16 }
17
6 f o r ( unsigned i = 0 ; i < N; ++ i ) {
7 // rule: S = (b - a)/6*( f(a) + 4*f(0.5*(a + b)) + f(b) )
8 // apply on [a + i*h, a + (i+1)*h]
9 I += h / 6 * ( f ( a + i * h ) + 4 * f ( a + ( i + 0 . 5 ) * h ) + f ( a + ( i +1) * h ) ) ;
10 }
11
12 return I ;
13 }
In both cases the function object passed in f must provide an evaluation operator double operator
(double)const. y
Remark 7.5.0.8 (Composite quadrature and piecewise polynomial interpolation) Composite quadra-
ture scheme based on local polynomial quadrature can usually be understood as “quadrature by approxi-
mation schemes” as explained in § 7.2.0.7. The underlying approximation schemes belong to the class of
general local Lagrangian interpolation schemes introduced in Section 6.6.1.
In other words, many composite quadrature schemes arise from replacing the integrand by a piecewise
interpolating polynomial, see Fig. 273 and Fig. 274 and compare with Fig. 252. y
To see the main rationale behind the use of composite quadrature rules recall Lemma 7.4.3.6: for a
polynomial quadrature rule (7.3.0.1) of order q with positive weights and f ∈ Cr ([ a, b]) the quadrature
error shrinks with the min{r, q} + 1-st power of the length |b − a| of the integration domain! Hence,
applying polynomial quadrature rules to small mesh intervals should lead to a small overall quadrature
error.
§7.5.0.9 (Quadrature error estimate for composite polynomial quadrature rules) Assume a com-
j
posite quadrature rule Q on [ x0 , xm ] = [ a, b], b > a, based on n j -point local quadrature rules Qn j with
positive weights (e.g. local Gauss-Legendre quadrature rules or local Clenshaw-Curtis quadrature rules)
and of fixed orders q j ∈ N on each mesh interval [ x j−1 , x j ]. From Lemma 7.4.3.6 recall the estimate for
f ∈ Cr ([ x j−1 , x j ])
Z xj
j
f (t) dt − Qn j ( f ) ≤ C | x j − x j−1 |min{r,q j }+1 f (min{r,q j }) . (7.3.0.11)
x j −1 L∞ ([ x j−1 ,x j ])
with C > 0 independent of f and j. For f ∈ Cr ([ a, b]), summing up these bounds we get for the global
quadrature error
Z xm m
min{r,q }+1
f (min{r,q j })
j
x0
f (t) dt − Q( f ) ≤ C ∑ hj L∞ ([ x j−1 ,x j ])
,
j =1
§7.5.0.11 (Constructing families of composite quadrature rules) As with polynomial quadrature rules,
we study the asymptotic behavior of the quadrature error for families of composite quadrature rules as a
function on the total number n of function evaluations.
As in the case of M-piecewise polynomial approximation of function (→ Section 6.6.1) families of com-
posite quadrature rules can be generated in two different ways:
(I) use a sequence of successively refined meshes Mk = { x kj } j with ♯M = m(k ) + 1,
k ∈N
m(k ) → ∞ for k → ∞, , combined with the same (transformed, → Rem. 7.2.0.4) local quadrature
rule on all mesh intervals [ x kj−1 , x kj ]. Examples are the composite trapezoidal rule and composite
Simpson rule from Ex. 7.5.0.3 on sequences of equidistant meshes.
➣ h-convergence
m
(II) On a fixed mesh M = xj j =0
, on each cell use the same (transformed) local quadrature rule
taken from a sequence of polynomial quadrature rules of increasing order.
➣ p-convergence
y
EXPERIMENT 7.5.0.12 (Quadrature errors for composite quadrature rules) Composite quadrature
rules based on
• trapezoidal rule (7.3.0.5) ➣ local order 2 (exact for linear functions, see Ex. 7.4.1.10),
• Simpson rule (7.3.0.6) ➣ local order 4 (exact for cubic polynomials, see Ex. 7.4.1.10)
n
on equidistant mesh M := { jh} j=0 , h = 1/n, n ∈ N.
2
0
numerical quadrature of function 1/(1+(5t) ) numerical quadrature of function sqrt(t)
0
10 10
trapezoidal rule trapezoidal rule
Simpson rule
2
Simpson rule
O(h ) −1 1.5
4
10 O(h )
O(h )
−2
−5
10
10
|quadrature error|
|quadrature error|
−3
10
−4
10
−10
10
−5
10
−6
10
−15 −7
10 −2 −1 0 10
10 10 10 −2 −1 0
10 10 10
Fig. 275 meshwidth Fig. 276 meshwidth
√
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1] quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error E(n) := 0 f (t) dt − Qn ( f ) for meshwidth "h → 0”:
☛ Throughout we observe algebraic convergence E(n) = O( hα ) of with rate α > 0 for h = n−1 → 0
➣ Sufficiently smooth integrand f 1 : trapezoidal rule → α = 2, Simpson rule → α = 4 !?
Remark 7.5.0.13 (Composite quadrature rules vs. global quadrature rules) For a fixed integrand
f ∈ Cr ([ a, b]) of limited smoothness on an interval [ a, b] we compare
• a family of composite quadrature rules basedon single localℓ-point rule (with positive weights) of
order q on a sequence of equidistant meshes Mk = { x kj } j ,
k ∈N
• the family of Gauss-Legendre quadrature rules from Def. 7.4.2.18.
We study the asymptotic dependence of the quadrature error on the number n of function evaluations.
The quadrature errors EnGL ( f ) of the n-point Gauss-Legendre quadrature rules are given in
Lemma 7.4.3.6, (7.4.3.7):
Gauss-Legendre quadrature converges at least as fast fixed order composite quadrature on equidistant
meshes.
Moreover, Gauss-Legendre quadrature “automatically detects” the smoothness of the integrand, and en-
joys fast exponential convergence for analytic integrands.
✞ ☎
✝ ✆
Use Gauss-Legendre quadrature instead of fixed order composite quadrature on equidistant meshes.
y
EXPERIMENT 7.5.0.16 (Empiric convergence of equidistant trapezoidal rule) Sometimes there are
surprises: Now we will witness a convergence behavior of a composite quadrature rule that is much better
than predicted by the order of the local quadrature formula.
We consider the equidistant trapezoidal rule (order 2), see (7.5.0.4), Code 7.5.0.6
Z b m −1
1 b−a
f (t) dt ≈ Tm ( f ) := h 2 f ( a) + ∑ f ( a + kh) + 21 f (b) , h := . (7.5.0.17)
a k =1
m
1
f (t) = p , 0<a<1.
1 − a sin(2πt − 1)
(As “exact value of integral” we use T500 in the computation or quadrature errors.)
Trapezoidal rule quadrature for 1./sqrt(1−a*sin(2*pi*x+1)) Trapezoidal rule quadrature for non−periodic function
0
0
10 10
−2
10
−1
10
−4
10
|quadrature error|
|quadrature error|
−2
−6
10
10
−8
10 −3
10
−10
10
−4
10
−12 a=0.5 a=0.5
10 a=0.9
a=0.9
a=0.95 a=0.95
a=0.99 a=0.99
−14 −5
10 10 0 1 2
0 2 4 6 8 10 12 14 16 18 20 10 10 10
Fig. 277 no. of. quadrature nodes Fig. 278 no. of. quadrature nodes
§7.5.0.18 (The magic of the equidistant trapezoidal rule (for periodic integrands))
In this § we use I := [0, 1[ as a reference interval, cf. Exp. 7.5.0.16. We rely on similar techniques as in
Section 5.6, Section 5.6.2. Again, a key tool will be the bijective mapping, see Fig. 198,
ΦS1 : I → S1 := {z ∈ C : |z| = 1} , t 7→ z := exp(2πıt) , (5.6.2.1)
which induces the general pullback, c.f. (6.2.1.16),
(ΦS−11 )∗ : C0 ([0, 1[) → C0 (S1 ) , (ΦS−11 )∗ f (z) := f (ΦS−11 (z)) , z ∈ S1 .
If f ∈ Cr (R ) and 1-periodic, then (ΦS−11 )∗ f ∈ Cr (S1 ). Further, ΦS1 maps equidistant nodes on I := [0, 1]
to equispaced nodes on S1 , which are the roots of unity:
j j j
ΦS1 ( n ) = exp(2πı n ) [ exp(2πı n )n = 1; ] . (7.5.0.19)
Now consider an n-point polynomial quadrature rule on S1 based on the set of equidistant nodes Z :=
{z j := exp(2πı j−n 1 ), j = 1, . . . , n} and defined as
Z n
S1 1
Qn ( g) := LZ g (τ ) dS(τ ) = ∑ wSj g(z j ) , (7.5.0.20)
j =1
S1
where LZ is the Lagrange interpolation operator (→ Def. 6.2.2.1). This means that the weights obey
Thm. 7.4.1.6, where the definition (5.2.2.4) of Lagrange polynomials remains the same for complex nodes.
By sheer symmetry, all the weights have to be the same, which, since the rule will be at least of order 1,
means
1 2π
wSj = , j = 1, . . . , n .
n
1
Moreover, the quadrature rule QSn will be of order n, see Def. 7.4.1.1, that is, it will integrate polynomials
of degree ≤ n − 1, exactly.
1
By transformation (→ Rem. 7.2.0.4 and pullback (7.5.0.18), QSn induces a quadrature rule on I := [0, 1]
by
n n
1 S1 1 1 1 j −1
QnI ( f ) := Qn (ΦS−11 )∗ f = ∑ wSj f (Φ −1
(z j )) = ∑ n f( n ) . (7.5.0.21)
2π 2π j =1 j =1
This is exactly the equidistant trapezoidal rule(7.5.0.17), if f is 1-periodic, f (0) = f (1): QnI = Tn . Hence
we arrive at the following estimate for the quadrature error
Z 1
En ( f ) := f (t) dt − Tn ( f ) ≤ 2π max (ΦS−11 )∗ f (z) − LZ (ΦS−11 )∗ f (z) .
0 z ∈S 1
Equivalently, one can show that Tn integrates trigonometric polynomials up to degree 2n − 1 exactly.
Remember from Section 5.6.1 that the 2n + 1-dimensional space of 1-periodic trigonometric polynomials
of degree 2n can be defined as
T
P2n := Span{t 7→ exp(2πıjt) : j = −n, . . . , n} .
Since the weights of the equidistant trapezoidal rule are clearly positive, by Thm. 7.4.3.3 the asymp-
totic behavior of the quadrature error can directly be inferred from estimates for equidistant trigono-
metric interpolation. Such estimates are given, e.g., in Thm. 6.5.3.14 and they confirm exponential
convergence for periodic integrands that allow analytic extension, in agreement with the observa-
tions made in Exp. 7.5.0.16.
Use the equidistant trapezoidal rule for the numerical quadrature of a periodic integrand over its
interval of periodicity.
Z1
yj = c(t) exp(2πijt) dt . (4.2.6.20)
0
where
Br (z) := {w ∈ C : |w − z| ≤ r }, , r>0,
is the closed disk with radius r > 0 around z and r is so small that Br (z) ⊂ D.
Using the parameterization
Hint. Recall the definition of a path integral in the complex plane (“contour integral”): If the path of
integration γ is described by a parameterization τ ∈ J 7→ γ(τ ) ∈ C, J ⊂ R, then
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.2.2.51)
γ J
Video tutorial for Section 7.6 "Adaptive Quadrature": (13 minutes) Download link, tablet notes
Rb
Hitherto, we have just “blindly” applied quadrature rules for the approximate evaluation of a f (t) dt, obliv-
ious of any properties of the integrand f . This led us to the conclusion of Rem. 7.5.0.13 that Gauss-
Legendre quadrature (→ Def. 7.4.2.18) should be preferred to composite quadrature rules (→ Section 7.5)
in general. Now the composite quadrature rule will partly be rehabilitated, because they offer the flexibility
to adjust the quadrature rule to the integrand, a policy known as adaptive quadrature.
We distinguish
(I) a priori adaptive quadrature: the nodes are fixed before the evaluation of the quadrature
formula, taking into account external information about f , and
(II) a posteriori adaptive quadrature: the node positions are chosen or improved based on infor-
mation gleaned during the computation inside a loop. It terminates when sufficient accuracy
has been reached.
In this section we will chiefly discuss a posteriori adaptive quadrature for composite quadrature rules (→
Section 7.5) based on a single local quadrature rule (and its transformation).
EXAMPLE 7.6.0.2 (Rationale for adaptive quadrature) This example presents an extreme case. We
consider the composite trapezoidal rule (7.5.0.4) on a mesh M := { a = x0 < x1 < · · · < xm = b} and
for the integrand f (t) = 10−14 +t2 on [−1, 1].
10000
9000
8000
f(t)
5000
1000
0
−1 −0.5 0 0.5 1
Fig. 279 t
A quantitative justification can appeal to (7.3.0.11) and the resulting bound for the local quadrature error
(for f ∈ C2 ([ a, b])):
Zxk
1
f (t) dt − ( f ( xk−1 ) + f ( xk )) ≤ 81 h3k f ′′ L∞ ([ xk−1 ,xk ])
, h k : = x k − x k −1 . (7.6.0.3)
2
x k −1
§7.6.0.4 (Goal: equidistribution of errors) The ultimate but elusive goal is to find a mesh with a minimal
number of cells that just delivers a quadrature error below a prescribed threshold. A more practical goal is
to adjust the local meshwidths hk := xk − xk−1 in order to achieve a minimal sum of local error bounds.
This leads to the constrained minimization problem:
m m
∑ h3k f ′′ L∞ ([ x
→ min s.t. ∑ hk = b − a . (7.6.0.5)
k −1 ,xk ])
k =1 k =1
Lemma 7.6.0.6.
Let f : R0+ → R0+ be a convex function with f (0) = 0 and x > 0. Then the constrained
minimization problem: seek ζ 1 , . . . , ζ m ∈ R0+ such that
m m
∑ f (ζ k ) → min and ∑ ζk = x , (7.6.0.7)
k =1 k =1
x
has the solution ζ 1 = ζ 2 = · · · = ζ m = m .
This means that we should strive for equal bounds h3k k f ′′ k L∞ ([ x for all mesh cells.
k −1 ,xk ])
The mesh for a posteriori adaptive composite numerical quadrature should be chosen to achieve
equal contributions of all mesh intervals to the quadrature error
A indicated above, guided by the equidistribution principle, the improvement of the mesh will be done
gradually in an iteration. The change of the mesh in each step is called mesh adaptation and there are
two fundamentally different ways to do it:
(I) by moving nodes, keeping their total number, but making them cluster where mesh intervals should
be small, or
(II) by adding nodes, where mesh intervals should be small (mesh refinement).
Algorithms for a posteriori adaptive quadrature based on mesh refinement usually have the following
structure:
(1) ESTIMATE: based on available information compute an approximation for the quadrature error
on every mesh interval.
(2) CHECK TERMINATION: if total error sufficient small → STOP
(3) MARK: single out mesh intervals with the largest or above average error contributions.
(4) REFINE: add node(s) inside the marked mesh intervals. GOTO (1)
§7.6.0.10 (Adaptive multilevel quadrature) We now see a concrete algorithm based on the two com-
posite quadrature rules introduced in Ex. 7.5.0.3.
Idea: local error estimation by comparing local results of two quadrature formu-
las Q1 , Q2 of different order → local error estimates
❶ (Error estimation)
hk h
ESTk := ( f ( xk−1 ) + 4 f ( pk ) + f ( xk )) − k ( f ( xk−1 ) + 2 f ( pk ) + f ( xk )) . (7.6.0.11)
|6 {z } |4 {z }
Simpson rule trapezoidal rule on split mesh interval
❷ (Check termination)
Rb
Simpson rule on M ⇒ intermediate approximation I ≈ a f (t) dt
m
If ∑ ESTk ≤ RTOL · I ( RTOL := prescribed relative tolerance) ⇒ STOP (7.6.0.12)
k =1
❸ (Marking)
m
1
Marked intervals: S := {k ∈ {1, . . . , m}: ESTk ≥ η ·
m ∑ ESTj } , η ≈ 0.9 . (7.6.0.13)
j =1
1
new mesh: M∗ := M ∪ { pk := ( xk−1 + xk ): k ∈ S} . (7.6.0.14)
2
Then continue with step ❶ and mesh M ← M∗ .
8 i n t main ( ) {
9 auto f = [ ] ( double x ) { r e t u r n std : : exp ( − x * x ) ; } ;
10 VectorXd M( 4 ) ;
11 M << −100 , 0 . 1 , 0 . 5 , 100;
12 std : : cout << " Sqrt ( Pi ) − I n t _ { −100}^{100} exp(−x * x ) dx = " ;
13 std : : cout << adaptquad : : adaptquad ( f , M, 1e−10 , 1e −12) − std : : s q r t ( M_PI ) << " \ n" ;
14 return 0;
15 }
Remark 7.6.0.17 (Estimation of “wrong quadrature error”?) In Code 7.6.0.15 we use the higher order
quadrature rule, the Simpson rule of order 4, to compute an approximate value for the integral. This is
reasonable, because it would be foolish not to use this information after we have collected it for the sake
of error estimation.
Yet, according to our heuristics, what est_loc and est_tot give us are estimates for the error of the
second-order trapezoidal rule, which we do not use for the actual computations.
EXPERIMENT 7.6.0.18 (h-adaptive numerical quadrature) In this numerical test we investigate whether
the adaptive technique from § 7.6.0.10 produces an appropriate distribution of integration nodes. We do
Algorithm: adaptive quadrature, Code 7.6.0.15 with tolerances rtol = 10−6 , abstol = 10−10
We monitor the distribution of quadrature points during the adaptive quadrature and the true and esti-
mated quadrature errors. The “exact” value for the integral is computed by composite Simpson rule on an
equidistant mesh with 107 intervals.
1
10
exact error
0
estimated error
10
500
450 −1
10
400
−2
10
350
quadrature errors
300 −3
10
250
f
−4
200 10
150 −5
10
100
−6
50 10
0 −7
0 10
0
5 0.2
−8
0.4 10
10 0.6
0.8
15 1 −9
10
x 0 200 400 600 800 1000 1200 1400 1600
Fig. 280 quadrature level Fig. 281 no. of quadrature points
Z 1
✦ approximate min{exp(6 sin(2πt)), 100} dt, initial mesh as above
0
1
10
exact error
estimated error
0
100 10
90
−1
10
80
70 −2
10
quadrature errors
60
−3
50 10
f
40
−4
10
30
20 −5
10
10
−6
0 10
0
0
5 0.2 −7
0.4 10
10 0.6
0.8
15 1 −8
10
x 0 100 200 300 400 500 600 700 800
Fig. 282 quadrature level Fig. 283 no. of quadrature points
Observation:
• Adaptive quadrature locally decreases meshwidth where integrand features variations or kinks.
• Trend for estimated error mirrors behavior of true error.
• Overestimation may be due to taking the modulus in (7.6.0.11)
However, the important piece of information we want to extract from ESTk is about the distribution of the
quadrature error.
y
Zxk
1
f (t) dt − ( f ( xk−1 ) + f ( xk )) ≤ 18 h3k f ′′ L∞ ([ xk−1 ,xk ])
, h k : = x k − x k −1 . (7.6.0.3)
2
x k −1
√
We consider the singular integrand f (t) = t on [0, 1]. What mesh M has to be chosen to ensure the
equidistribution of the error bounds from (7.6.0.3).
Rb
(Q7.6.0.20.B) For a posteriori adaptive mesh refinement for the approximation of a f (t) dt
we employed the following estimate of the local quadrature error on the mesh
M : = { a = x0 < x1 < · · · < x m : = b }
hk h
ESTk := ( f ( xk−1 ) + 4 f ( pk ) + f ( xk )) − k ( f ( xk−1 ) + 2 f ( pk ) + f ( xk )) . (7.6.0.11)
|6 {z } |4 {z }
Simpson rule trapezoidal rule on split mesh interval
We could also have use the two lowest order Gauss-Legendre quadrature rules for that purpose,
• The 1-point midpoint rule, on [−1, 1] defined by the node c1 := 0 and w1 := 2,
• the
n 2-point Gauss-Legendre rule from Ex. o7.4.2.2, on [−1, 1] by the weights/nodes
√ √
w2 = 1, w1 = 1, c1 = 1/3 3, c2 = −1/3 3 .
Write down the formula for the resulting estimator ESTk and compare it with the choice (7.6.0.11) in
terms of number of required f -evaluations.
△
Learning Outcomes
✦ You should know what is a quadrature formula and terminology connected with it,
✦ You should be able to transform quadrature formulas to arbitrary intervals.
✦ You should understand how a interpolation and approximation schemes spawn quadrature formulas
and how quadrature errors are connected to interpolation/approximation errors.
✦ You should be able to compute the weights of polynomial quadrature formulas.
✦ You should know the concept of order of a quadrature rule and why it is invariant under (affine)
transformation
✦ You should remember the maximal and minimal order of polynomial quadrature rules.
✦ You should know the order of the n-point Gauss-Legendre quadrature rule.
✦ You should understand why Gauss-Legendre quadrature converges exponentially for integrands that
can be extended analytically and algebraically for integrands with limited smoothness.
✦ You should be apply to apply regularizing transformations to integrals with non-smooth integrands.
✦ You should know about asymptotic convergence of the h-version of composite quadrature.
✦ You should know the principles of adaptive composite quadrature.
[BL19] L. Banjai and M. López-Fernández. “Efficient high order algorithms for fractional integrals
and fractional differential equations”. In: Numer. Math. 141.2 (2019), pp. 289–317. DOI:
10.1007/s00211-018-1004-0.
[Bog14] I. Bogaert. “Iteration-free computation of Gauss-Legendre quadrature nodes and weights”.
In: SIAM J. Sci. Comput. 36.3 (2014), A1008–A1026. DOI: 10.1137/140954969 (cit. on
p. 570).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 552, 556, 559).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 584).
[Fej33] L. Fejér. “Mechanische Quadraturen mit positiven Cotesschen Zahlen”. In: Math. Z. 37.1
(1933), pp. 287–309. DOI: 10.1007/BF01474575 (cit. on p. 559).
[Gan+05] M. Gander, W. Gander, G. Golub, and D. Gruntz. Scientific Computing: An introduction using
MATLAB. Springer, 2005 (cit. on p. 569).
[GLR07] Andreas Glaser, Xiangtao Liu, and Vladimir Rokhlin. “A fast algorithm for the calculation of
the roots of special functions”. In: SIAM J. Sci. Comput. 29.4 (2007), pp. 1420–1438. DOI:
10.1137/06067016X.
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 564,
565).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 552, 557, 559,
568).
[Joh08] S.G. Johnson. Notes on the convergence of trapezoidal-rule quadrature. MIT online course
notes, http://math.mit.edu/ stevenj/trapezoidal.pdf. 2008.
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 564, 565).
[Tre08] Lloyd N. Trefethen. “Is Gauss quadrature better than Clenshaw-Curtis?” In: SIAM Rev. 50.1
(2008), pp. 67–87. DOI: 10.1137/060659831 (cit. on pp. 559, 569).
[TWXX] Lloyd N. Trefethen and J. A. C. Weideman. THE EXPONENTIALLY CONVERGENT TRAPE-
ZOIDAL RULE. XX.
[Wal06] Jörg Waldvogel. “Fast construction of the Fejér and Clenshaw-Curtis quadrature rules”. In: BIT
46.1 (2006), pp. 195–202. DOI: 10.1007/s10543-006-0045-4 (cit. on p. 559).
[Wal11] Jörg Waldvogel. “Towards a general error theory of the trapezoidal rule”. In: Approximation
and computation. Vol. 42. Springer Optim. Appl. Springer, New York, 2011, pp. 267–282. DOI:
10.1007/978-1-4419-6594-3_17.
592
Chapter 8
8.1 Introduction
Video tutorial for Section 8.1 "Iterative Methods for Non-Linear Systems of Equations: Intro-
duction": (6 minutes) Download link, tablet notes
Fig. 284
emitter
A transistor has three ports: emitter, collector, and base. Transistor models give the port currents as
functions of the applied voltages, for instance the Ebers-Moll model (large signal approximation):
UBC
U
UBE IS BC
IC = IS e UT −e UT − e T − 1 = IC (UBE , UBC ) ,
U
βR
U U
IS BE IS BC
IB = e UT − 1 + e UT − 1 = IB (UBE , UBC ) , (8.1.0.2)
βF βR
U UBC
U
BE IS BE
IE = IS e UT − e UT + e UT − 1 = IE (UBE , UBC ) .
βF
IC , IB , IE : current in collector/base/emitter,
UBE , UBC : potential drop between base-emitter, base-collector.
593
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The parameters have the following meanings: β F is the forward common emitter current gain (20 to 500),
β R is the reverse common emitter current gain (0 to 20), IS is the reverse saturation current (on the order
of 10−15 to 10−12 amperes), UT is the thermal voltage (approximately 26 mV at 300 K).
The circuit of Fig. 284 has 5 nodes ➀–➄ with unknown nodal potentials. Kirchhoffs law (2.1.0.4) plus the
constitutive relations gives an equation for each of them.
Non-linear system of equations from nodal analysis, static case (→ Ex. 2.1.0.3):
5 equations ↔ 5 unknowns U1 , U2 , U3 , U4 , U5
Remark 8.1.0.4 (General non-linear systems of equations) A non-linear system of equations is a con-
cept almost too abstract to be useful, because it covers an extremely wide variety of problems . Never-
theless in this chapter we will mainly look at “generic” methods for such systems. This means that every
method discussed may take a good deal of fine-tuning before it will really perform satisfactorily for a given
non-linear system of equations. y
§8.1.0.5 (Generic/general non-linear system of equations) Let us try to describe the “problem” of
having to solve a non-linear system of equations, where the concept of a “problem” was first introduced in
§ 1.5.5.1.
Given: function F : D ⊂ R n 7→ R n , n∈N
m
Possible meanings: ☞ F is known as an analytic expression.
☞ F is merely available in procedural form allowing point evaluations.
Here, D is the domain of definition of the function F, which cannot be evaluated for x 6∈ D.
In contrast to the situation for linear systems of equations (→ Thm. 2.2.1.4), the class of non-linear sys-
tems is far too big to allow a general theory:
There are no general results on the existence & uniqueness of solutions of a “generic” non-
linear system of equations F (x) = 0.
y
Contents
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Review question(s) 8.1.0.6 (Iterative Methods for Non-Linear Systems of Equations: Introduction)
(Q8.1.0.6.A) State that nonlinear system of equations in the form F (x) = 0 with a function F : R n → R n
whose solution answers the following question:
How does the diagonal of the given matrix A ∈ R n,n have to be modified (yielding a matrix
e ) so that the linear system of equations Ax
A e = b, b ∈ R n given, has a prescribed solution
x∗ ∈ R n ?
When does the non-linear system of equations have a unique solution and what is it?
(Q8.1.0.6.B) Which non-linear system of equations is solved by every
Hint. From your analysis course remember necessary conditions for a global minimum of a continuously
differentiable function.
(Q8.1.0.6.C) A diode is a non-linear circuit element that yields vastly different currents depending on
the polarity of the applied voltage. Quantitatively, its current voltage relationship is described by the
Shockley diode equation
U
U
I (U ) = IS exp −1 ,
UT
Fig. 286
Hint. Nodal analysis of electric circuits is explained in Ex. 2.1.0.3. You may look up that example.
(Q8.1.0.6.D) [Inverse function] From analysis you know that every monotonic function f : I ⊂ R → R
can be inverted on its range f ( I ). Reformulate the task of evaluating f −1 (y) for y ∈ f ( I ) as a non-
linear equation in the standard form F ( x ) = 0 for a suitable function F.
△
Video tutorial for Section 8.2.1 "Iterative Methods: Fundamental Concepts": (6 minutes)
Download link, tablet notes
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 596
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
yield only approximate solutions whenever they terminate after finite time.
D
★ ✥
An iterative method for (approximately) solving the
non-linear equation F (x) = 0 is an algorithm
gen-
x (3)
erating an arbitrarily long sequence x(k) of ap- x (2)
k
x∗
✧ ✦
proximate solutions.
x (1) x (4)
x (6)
x(k) =
ˆ k-th iterate
x (0) x (5)
Initial guess
Fig. 287
y
§8.2.1.2 (Key issues with iterative methods) When applying an iterative method to solve a non-linear
system of equations F (x) = 0, the following issues arise:
✦ Speed of convergence: How “fast” does x(k) − x∗ (k·k a suitable norm on R N ) decrease for
increasing k?
More formal definitions can be given:
k→∞
An iterative method converges (for fixed initial guess(es)) :⇔ x(k) → x∗ and F (x∗ ) = 0.
§8.2.1.4 ((Stationary) m-point iterative method) All the iterative methods discussed below fall in the
class of (stationary) m-point, m ∈ N, iterative methods, for which the iterate x(k+1) depends on F and the
m most recent iterates x(k) , . . . , x(k−m+1) , e.g.,
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 597
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 288
y
Φ F (x∗ , . . . , x∗ ) = x∗ ⇐⇒ F (x∗ ) = 0 .
x ( k +1) : = Φ F ( x ( k ) , . . . , x ( k − m +1) ) ,
⇒ F (x∗ ) = 0 .
x∗ := lim x(k)
k→∞
Proof. The very definition of continuity means that limits can be “pulled into a function”:
For a consistent stationary iterative method we can study the error of the iterates x(k) defined as:
e(k) : = x(k) − x∗ .
§8.2.1.9 (Local convergence of iterative methods) Unfortunately, convergence may critically depend on
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 598
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
the choice of initial guesses. The property defined next weakens this dependence:
Fig. 289
y
Our goal: Given a non-linear system of equations, find iterative methods that converge (locally) to a
solution of F (x) = 0.
Two general questions: How to measure, describe, and predict the speed of convergence?
When to terminate the iteration?
x ( k +1) = Φ F ( x ( k ) , . . . , x ( k − m +1) ) ,
ΦF : Rn × · · · × Rn → Rn
| {z }
m times
as a 1-point iteration (also called a fixed-point iteration). What does consistency mean for that 1-point
iteration.
(Q8.2.1.11.B) When is the following 1-point iterative method
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 599
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A stationary m-point iterative method is consistent with the non-linear system of equations
F (x) = 0, if and only if
Φ F (x∗ , . . . , x∗ ) = x∗ ⇐⇒ F (x∗ ) = 0 .
Video tutorial for Section 8.2.2 "Iterative Methods: Speed of Convergence": (15 minutes)
Download link, tablet notes
Here and in the sequel, k·k designates a generic vector norm on R n , see Def. 1.5.5.4. Any occurring
matrix norm is induced by this vector norm, see Def. 1.5.5.10.
It is important to be aware which statements depend on the choice of norm and which do not!
“Speed of convergence” measures the decrease of a norm (see Def. 1.5.5.4) of the iteration error
Terminology: The least upper bound for L gives the rate of convergence:
x ( k +1) − x ∗
rate = sup , x∗ := lim x(k) . (8.2.2.2)
k ∈N 0 x(k) − x∗ k→∞
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 600
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
If dim V < ∞ all norms (→ Def. 1.5.5.4) on V are equivalent (→ Def. 8.2.2.4).
Remark 8.2.2.6 (Detecting linear convergence) Often we will study the behavior of a consistent iterative
method for a model problem in a numerical experiments and measure the norms of the iteration errors
e(k) := x(k) − x∗ . How can we tell that the method enjoys linear convergence?
log e(k)
norms of iteration errors
l
∼ on straight line in lin-log plot
e ( k ) ≤ L k e (0) ,
ˆ linear cvg., • =
•= ˆ faster than linear cvg. ✄
Fig. 290
1 2 3 4 5 6 7 8 k
Let us abbreviate the error norm in step k by ǫk := x(k) − x∗ . In the case of linear convergence (see
Def. 8.2.2.1) assume (with 0 < L < 1)
ǫk+1 ≈ Lǫk ⇒ log ǫk+1 ≈ log L + log ǫk ⇒ log ǫk ≈ k log L + log ǫ0 . (8.2.2.7)
We conclude that log L < 0 determines the slope of the graph in lin-log error chart.
Related: guessing time complexity O(nα ) of an algorithm from measurements, see § 1.4.1.6.
Note the green dots • in Fig. 290: Any “faster” convergence also qualifies as linear convergence in the strict
sense of the definition. However, whenever this term is used, we tacitly imply, that no “faster convergence”
prevails and that the estimates in (8.2.2.7) are sharp. y
cos( x (k) ) + 1
x ( k +1) = x ( k ) + .
sin( x (k) )
In the C++ code Code 8.2.2.9 x has to be initialized with the different values for x0 .
Note: The final iterate x (15) replaces the exact solution x ∗ in the computation of the rate of convergence.
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 601
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
8 f o r ( i n t i =0; i <N; ++ i ) {
9 x = x + ( cos ( x ) +1) / s i n ( x ) ;
10 y( i ) = x;
11 }
12 e r r . r e s i z e (N) ; r a t e s . r e s i z e (N) ;
13 e r r = y−VectorXd : : Constant (N, x ) ;
14 r a t e s = e r r . bottomRows (N−1) . cwiseQuotient ( e r r . topRows (N−1) ) ;
15 }
−1
10
m 10
→ Rem. 8.2.2.6
−4
10
1 2 3 4 5 6 7 8 9 10
Fig. 291 index of iterate
There are notions of convergence that guarantee a much faster (asymptotic) decay of the norms of the
iteration errors than linear convergence from Def. 8.2.2.1.
Definition 8.2.2.10. Order of convergence → [Han02, Sect. 17.2], [DR08, Def. 5.14], [QSS00,
Def. 6.1]
Of course, the order p of convergence of an iterative method refers to the largest possible p in the def-
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 602
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
inition, that is, the error estimate will in general not hold, if p is replaced with p + ǫ for any ǫ > 0, cf.
Rem. 1.4.1.3.
0
10
−2
10
−4
10
iteration error
In the case of convergence of order p ( p > 1) according to Def. 8.2.2.10 and assuming sharpness of the
error bound we obtain for the error norms ǫk := x(k) − x∗ :
k
p
ǫk+1 ≈ Cǫk ⇒ log ǫk+1 = log C + p log ǫk ⇒ log ǫk+1 = log C ∑ pl + pk+1 log ǫ0
l =0
log C log C
⇒ log ǫk+1 = − + + log ǫ0 pk+1 .
p−1 p−1
In this case, the error graph of the function k 7→ log ǫk is a concave (“downward bent”) power curve (for
sufficiently small ǫ0 !)
Remark 8.2.2.12 (Detecting order p > 1 of convergence) How can we guess the order of convergence
(→ Def. 8.2.2.10) from tabulated error norms measured in a numerical experiment?
➣ monitor the quotients (log ǫk+1 − log ǫk )/(log ǫk − log ǫk−1 ) over several steps of the iteration. y
EXAMPLE 8.2.2.13 (quadratic convergence = convergence √ of order 2) From your analysis course
[Str09, Bsp. 3.3.2(iii)] recall the famous iteration for computing a, a > 0:
1 (k) a √ 1 √
x ( k +1) = ( x + ( k ) ) ⇒ | x ( k +1) − a | = ( k ) | x ( k ) − a | 2 . (8.2.2.14)
2 x 2x
√ √
By the arithmetic-geometric mean inequality (AGM) ab ≤ 12 ( a + b) we conclude: x (k) > a for
k ≥√1. Therefore estimate from (8.2.2.14) means that the sequence from (8.2.2.14) converges with order
2 to a.
Note: x (k+1) < x (k) for all k ≥ 2 ➣ ( x (k) )k∈N0 converges as a decreasing sequence that is bounded
from below (→ analysis course)
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 603
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Note the doubling of the number of correct digits in each step ! [impact of roundoff !]
The doubling of the number of significant digits for the iterates holds true for any quadratically convergent
iteration:
Recall from Rem. 1.5.3.4 that the relative error (→ Def. 1.5.3.3) tells the number of significant digits.
Indeed, denoting the relative error in step k by δk , we have in the case of quadratic convergence.
x (k) = x ∗ (1 + δk ) ⇒ x (k) − x ∗ = δk x ∗ .
⇒| x ∗ δk+1 | = | x (k+1) − x ∗ | ≤ C | x (k) − x ∗ |2 = C | x ∗ δk |2
⇒ |δk+1 | ≤ C | x ∗ |δk2 . (8.2.2.15)
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 604
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
What is lim x(k) ? Does this iteration generate linearly convergent sequences?
k→∞
Give a sharp criterion for the initial guess x(0) ∈ R n that guarantees convergence of the resulting se-
quence.
Hint. When will the sequence x (k) − x ∗ be decreasing?
k ∈N 0
△
Video tutorial for Section 8.2.3 "Iterative Methods: Termination Criteria/Stopping Rules": (14
minutes) Download link, tablet notes
As remarked above, usually (even without roundoff errors) an iteration will never arrive at an/the exact
solution x∗ after finitely many steps. Thus, we can only hope to compute an approximate solution by
accepting x(K ) as result for some K ∈ N0 . Termination criteria (stopping rules) are used to determine a
suitable value for K .
For the sake of efficiency ✄ stop iteration when iteration error is just “small enough”
(“small enough” depends on the concrete problem and user demands.)
§8.2.3.1 (Classification of termination criteria (stopping rules) for iterative solvers for non-linear
✎ ☞
systems of equations)
A termination criterion (stopping rule) is an algorithm deciding in each step of an iterative method
✍ ✌
whether to STOP or to CONTINUE.
We can distinguish two types of stopping rules:
Decision to stop based on information Beside x(0) and F, also current and
about F and x(0) , made before start- past iterates are used to decide about
ing the iteration. termination.
A termination criterion for a convergent iteration is deemed reliable, if it lets the iteration CONTINUE, until
the iteration error e(k) := x(k) − x∗ , x∗ the limit value, satisfies certain conditions (usually imposed before
the start of the iteration). y
§8.2.3.2 (Ideal termination) Writing x∗ for the desired solution, termination criteria are usually meant to
ensure accuracy of the final iterate x(K ) in the following sense:
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 605
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
As pointed out before, the comparison x(k) − x∗ ≤ τabs is necessary to ensure termination when
x∗ = 0 can happen.
Obviously, (8.2.3.3) achieves the optimum in terms of efficiency and reliability. Obviously, this termination
criterion is not practical, because x∗ is not known. Algorithmic feasible stopping rules have to replace
x(k) − x∗ and kx∗ k with (upper/lower) bounds or estimates.
§8.2.3.4 (Practical termination criteria for iterations) The following termination criteria are commonly
used in numerical codes:
➀ A priori termination: stop iteration after fixed number of steps (possibly depending on x(0) ).
(A priori =
ˆ without actually taking into account the computed iterates, see § 8.2.3.1)
Invoking additional properties of either the non-linear system of equations F (x) = 0 or the iteration
it is sometimes possible to tell that for sure x(k) − x∗ ≤ τ for all k ≥ K, though this K may be
(significantly) larger than the optimal termination index from (8.2.3.3), see § 8.2.3.7.
➁ Residual based termination: STOP convergent iteration {x(k) }k∈N0 , when
no guaranteed accuracy
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 606
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
F(x) F(x)
x x
Also for this criterion, we have no guarantee that (8.2.3.3) will be satisfied only remotely.
y
Remark 8.2.3.5 (STOP, when stationary in M) A special variant of correction based termination exploits
that M is finite! (→ Section 1.5.3)
3 {
stationary in the discrete set M of ma- 4 double x _ o l d = −1;
chine numbers! 5 double x = a ;
while ( x _ o l d ! = x ) {
y
possibly grossly inefficient ! 6
7 x_old = x ;
(always computes “up to
8 x = 0 . 5 * ( x+a / x ) ;
machine precision”) 9 }
10 return x ;
11 }
The following simple manipulations give an a posteriori termination criterion (for linearly convergent itera-
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 607
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This suggests that we take the right hand side of (8.2.3.8) as a posteriori error bound and use it instead
of the inaccessible x(k+1) − x∗ for checking absolute and relative accuracy in (8.2.3.3). The resulting
termination criterium will be reliable (→ § 8.2.3.1), since we will certainly have achieved the desired
accuracy when we stop the iteration.
(Using e
L > L in (8.2.3.8) still yields a valid upper bound for x(k) − x∗ . Hence, the result can
be trusted, though we might have wasted computational resources by needlessly carrying on with the
iteration.) y
EXAMPLE 8.2.3.9 (A posteriori error bound for linearly convergent iteration) We revisit the iteration
of Ex. 8.2.2.8:
cos x (k) + 1
x ( k +1) = x ( k ) + ⇒ x (k) → π for x (0) close to π .
sin x (k)
Observed rate of convergence: L = 1/2
Error and error bound for x (0) = 0.4:
k | x (k) − π | L
1− L | x
(k) − x ( k −1) | slack of bound
8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 608
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
f ( x0 )
∆s :=
f ′ ( x0 )
L
STOP, as soon as x(k+1) − x(k) ≤ τabs ,
1−L
where τabs > 0 is a user-supplied threshold. Assume that the true (“sharp”) rate of linear convergence
is L ∈ [0, 1[, that is,
x ( k +1) − x ∗ ≈ L x ( k ) − x ∗ ∀k ∈ N0 ,
but for stopping rule a larger value L, L < L < 1 is used. Discuss what this means for the number of
steps until termination and the absolute accuracy of the returned approximation.
△
Supplementary literature. The contents of this section are also treated in [DR08, Sect. 5.3],
1-point stationary iterative methods, see (8.2.1.5), for F (x) = 0 are also called fixed point iterations.
iteration function Φ : U ⊂ R n 7→ R n
➣ iterates (x(k) )k∈N0 : x ( k +1) : = Φ ( x ( k ) ) .
initial guess x (0) ∈ U
| {z }
→ 1-point method, cf. (8.2.1.5)
Note that the sequence of iterates need not be well defined: x(k) 6∈ U possible !
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 609
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A fixed point iteration x(k+1) = Φ(x(k) ) is consistent with F (x) = 0, if, for x ∈ U ∩ D,
F (x) = 0 ⇔ Φ(x) = x .
This is an immediate consequence that for a continuous function limits and function evaluations commute
[Str09, Sect. 4.1].
x ( k +1) : = Φ ( x ( k ) ) . (8.3.1.2)
Note: there are many ways to transform F (x) = 0 into a fixed point form !
EXPERIMENT 8.3.1.3 (Many choices for consistent fixed point iterations) In this example we con-
struct three different consistent fixed point iteration for a single scalar (n = 1) non-linear equation
F ( x ) = 0. In numerical experiments we will see that they behave very differently.
2
1.5
F ( x ) = xe x − 1 , x ∈ [0, 1] .
1
Different fixed-point forms:
F(x)
Φ1 ( x ) = e − x ,
0.5
1+x
Φ2 ( x ) = , 0
1 + ex
Φ3 ( x ) = x + 1 − xe x . −0.5
−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 610
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1 1 1
Φ
Φ
0.4 0.4 0.4
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 611
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In Exp. 8.3.1.3 we observed vastly different behavior of different fixed point iterations for n = 1. Is it
possible to predict this from the shape of the graph of the iteration functions?
1 1 1
Φ
0.4 0.4 0.4
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x
x x
x x
It seems that the slope of the iteration function Φ in the fixed point, that is, in the point where it intersects
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 612
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Now we investigate rigorously, when a fixed point iteration will lead to a convergent iteration with a partic-
ular qualitative kind of convergence according to Def. 8.2.2.10.
A simple consideration: if Φ(x∗ ) = x∗ (fixed point), then a fixed point iteration induced by a contractive
mapping Φ satisfies
(8.3.2.4)
x ( k +1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ ) ≤ L x(k) − x∗ ,
that is, the iteration converges (at least) linearly (→ Def. 8.2.2.1).
then there is a unique fixed point x∗ ∈ D, Φ(x∗ ) = x∗ , which is the limit of the sequence of iterates
x(k+1) := Φ( x (k) ) for any x(0) ∈ D.
Lk
≤ x (1) − x (0) k→∞
−−−→ 0 .
1−L
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 613
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Lemma 8.3.2.7. Sufficient condition for local linear convergence of fixed point iteration →
[Han02, Thm. 17.2], [DR08, Cor. 5.12]
x ( k +1) : = Φ ( x ( k ) ) , (8.3.1.2)
converges locally and at least linearly that is matrix norm, Def. 1.5.5.10 !
∃0 ≤ L < 1: x ( k +1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 ,
∂Φ ∂Φ1 ∂Φ1
1
∂x1 (x) ∂x2 ( x ) ··· ··· ∂xn ( x )
" #n ∂Φ2 ∂Φ2
∂Φi (x)
∂xn ( x )
D Φ(x) = (x) =
∂x1
.. .. . (8.3.2.8)
∂x j . .
i,j=1
∂Φn ∂Φn ∂Φn
∂x1 ( x ) ∂x2 ( x ) ··· ··· ∂xn ( x )
A “visualization” of the statement of Lemma 8.3.2.7 has been provided in Rem. 8.3.2.2: The iteration
converges locally, if Φ is flat in a neighborhood of x ∗ , it will diverge, if Φ is steep there.
if x(k) − x∗ < δ.
✷
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 614
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Lemma 8.3.2.9. Sufficient condition for linear convergence of fixed point iteration
If Φ(x∗ ) = x∗ for some interior point x∗ ∈ U , then the fixed point iteration x(k+1) = Φ(x(k) ) with
x(0) ∈ U converges to x∗ at least linearly with rate L.
We find that Φ is contractive on U with unique fixed point x∗ , to which x(k) converges linearly for k → ∞.
✷
Remark 8.3.2.10 (Bound for asymptotic rate of linear convergence) By asymptotic rate of a linearly
converging iteration we mean the contraction factor for the norm of the iteration error that we can expect,
when we are already very close to the limit x∗ .
If 0 < k DΦ(x∗ )k < 1, x(k) ≈ x∗ then the (worst) asymptotic rate of linear convergence is L =
k DΦ( x ∗ )k y
EXAMPLE 8.3.2.11 (Multidimensional fixed point iteration) In this example we encounter the first
genuine system of non-linear equations and apply Lemma 8.3.2.9 to it.
What about higher order convergence (→ Def. 8.2.2.10, cf. Φ2 in Ex. 8.3.1.3)? Also in this case we should
study the derivatives of the iteration functions in the fixed point (limit point).
We give a refined convergence result only for n = 1 (scalar case, Φ : dom(Φ) ⊂ R 7→ R):
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 615
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Here we used the Landau symbol O(·) to describe the local behavior of a remainder term in the vicinity of
x∗
Lemma 8.3.2.15. Higher order local convergence of fixed point iterations
EXPERIMENT 8.3.2.16 (Exp. 8.3.2.1 continued) Now, Lemma 8.3.2.9 and Lemma 8.3.2.15 permit us a
precise prediction of the (asymptotic) convergence we can expect from the different fixed point iterations
studied in Exp. 8.3.1.3.
1 1 1
Φ
Φ
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x
1 − xe x
Φ2′ ( x ) = = 0 , if xe x − 1 = 0 hence quadratic convergence ! .
(1 + e x )2
∗
Since x ∗ e x − 1 = 0, simple computations yield
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 616
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1
Φ3′ ( x ) = 1 − xe x − e x ⇒ Φ3′ ( x ∗ ) = − ≈ −1.79 hence no convergence .
x∗
y
△-ineq. k+m−1 k + m −1
x(k+m) − x(k) ≤ ∑ x ( j +1) − x ( j ) ≤ ∑ L j − k x ( k +1) − x ( k )
j=k j=k
1− Lm ( k +1) (k) 1 − L m k − l ( l +1)
= x −x ≤ L x − x(l ) .
1−L 1−L
Lk−l
x∗ − x(k) ≤ x ( l +1) − x ( l ) . (8.3.2.18)
1−L
Lk L
x∗ − x(k) ≤ x (1) − x (0) (8.3.2.19) x∗ − x(k) ≤ x ( k ) − x ( k −1)
1−L 1−L
(8.3.2.20)
With the same arguments as in § 8.2.3.7 we see that overestimating L, that is, using a value for L that is
larger than the true value, still gives reliable termination criteria.
However, whereas overestimating L in (8.3.2.20) will not lead to a severe deterioration of the bound, unless
L ≈ 1, using a pessimistic value for L in (8.3.2.19) will result in a bound way bigger than the true bound, if
k ≫ 1. Then the a priori termination criterion (8.3.2.19) will recommend termination many iterations after
the accuracy requirements have already been met. This will thwart the efficiency of the method. y
Review question(s) 8.3.2.21 (Fixed-point iterations)
(Q8.3.2.21.A) Let x(k) , k ∈ N0 , be the iterates produced by a fixed-point iteration x(k+1) = Φ(x(k) ),
Φ : R n → R n . Formulate a 2-point iteration
8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 617
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
x ( k +1) = Φ F ( x ( k ) , . . . , x ( k − m +1) ) ,
(Q8.3.2.21.B)
√ Given a > 0 the following iteration functions spawn fixed-point iterations for the computation
of a:
ϕ1 ( x ) : = a + x − x 2 ,
ϕ2 ( x ) := a/x ,
1
ϕ3 ( x ) : = 1 + x − x 2 ,
a
1
ϕ4 ( x ) := ( x + a/x ) .
2
Predict the behavior and the type of convergence
√ of the induced fixed-point iterations when started with
an initial guess “sufficiently close” to a.
△
Supplementary literature. [AG11, Ch. 3] is also devoted to this topic. The algorithm of “bisec-
tion” discussed in the next subsection, is treated in [DR08, Sect. 5.5.1] and [AG11, Sect. 3.2].
Sought: x∗ ∈ I : F( x∗ ) = 0
8.4.1 Bisection
Video tutorial for Section 8.4.1 "Finding Zeros of Scalar Functions: Bisection": (7 minutes)
Download link, tablet notes
Idea: use ordering of real numbers & intermediate value theorem [Str09, Sect. 4.6]
F(x)
Input: a, b ∈ I such that
x∗ x
∃ x ∗ ∈] min{ a, b}, max{ a, b}[: a b
F( x∗ ) = 0 ,
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 618
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Find a sequence of intervals with geometrically decreasing lengths, in each of which F will change
sign.
Such a sequence can easily be found by testing the sign of F at the midpoint of the current interval, see
Code 8.4.1.2.
§8.4.1.1 (Bisection method) The following C++ code implements the bisection method for finding the
zeros of a function passed through the function handle F in the interval [ a, b] with absolute tolerance
tol.
Line 18: the test ((a<x)&& (x<b)) offers a safeguard against an infinite loop in case tol < resolution
of M at zero x ∗ (cf. “M-based termination criterion”).
This is also an example for an algorithm that (in the case of tol=0) uses the properties of machine
arithmetic to define an a posteriori termination criterion, see Section 8.2.3. The iteration will terminate,
when, e.g., a+e 12 (b − a) = a (+
e is the floating point realization of addition), which, by the Ass. 1.5.3.11
can only happen, when
| 21 (b − a)| ≤ EPS · | a| .
Since the exact zero is located between a and b, this condition implies a relative error ≤ EPS of the
computed zero.
• “foolproof”, robust: will always terminate with a zero of requested accuracy,
Advantages: • requires only point evaluations of F,
• works with any continuous function F, no derivatives needed.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 619
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Merely “linear-type”
(∗)convergence: | x ( k ) − x ∗ | ≤ 2− k | b − a |
Drawbacks: |b − a|
log2 steps necessary
tol
(∗): the convergence of a bisection algorithm is not linear in the sense of Def. 8.2.2.1, because the
condition x (k+1) − x ∗ ≤ L x (k) − x ∗ might be violated at any step of the iteration.
Remark 8.4.1.3 (Generalized bisection methods) It is straightforward to combine the bisection idea
with more elaborate “model function methods” as they will be discussed in the next section: Instead of
stubbornly choosing the midpoint of the probing interval [ a, b] (→ Code 8.4.1.2) as next iterate, one may
use a refined guess for the location of a zero of F in [ a, b].
A method of this type is used by M ATLAB’s fzero function for root finding in 1D [QSS00, Sect. 6.2.3]. y
Review question(s) 8.4.1.4 (Finding Zeros of Scalar Functions: Bisection)
(Q8.4.1.4.A) We use the bisection method to find a zero of f : [0.5, 2] → R with f (1) < 0 and f (2) > 0.
Find an a priori bound for the number of steps needed to determine a zero with a guaranteed relative
error of 10−6 .
(Q8.4.1.4.B) What prevents us from using bisection to find zeros of a function f : D ⊂ C → C?
△
one-point methods : x (k+1) = Φ F ( x (k) ), k ∈ N (e.g., fixed point iteration → Section 8.3)
multi-point methods : x (k+1) = Φ F ( x (k) , x (k−1) , . . . , x (k−m) ), k ∈ N, m = 2, 3, . . ..
Video tutorial for Section 8.4.2.1 "Newton Method in the Scalar Case": (20 minutes)
Download link, tablet notes
Again we consider the problem of finding zeros of the function F : I ⊂ R → R defined on an interval I :
we seek x ∗ ∈ I such that F ( x ∗ ) = 0. Now we impose stricter smoothness requirements and we assume
that F : I ⊂ R 7→ R is continuously differentiable, which means that both F and its derivative F ′ have to
be continuous on I .
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 620
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
F ( x (k) )
x ( k +1) : = x ( k ) − , (8.4.2.1)
F ′ ( x (k) ) F x ( k +1) x ( k )
Fig. 295
The following C++ code snippet implements a generic Newton method for zero finding.
• The types FuncType and DervType must be functor types and provide an evaluation operator
Scalar operator (Scalar)const.
• The arguments F and DF must provide functors for F and F ′ .
This code implements a correction-based termination criterion as introduced in § 8.2.3.4, see also § 8.2.3.2
for a discussion of absolute and relative tolerances.
EXAMPLE 8.4.2.3 (Square root iteration as a Newton iteration) In Ex. 8.2.2.13 we learned about the
quadratically convergent fixed point iteration (8.2.2.14) for the approximate computation of the square root
of a positive number. It can be derived as a Newton iteration (8.4.2.1)!
For F ( x ) = x2 − a, a > 0, we find F ′ ( x ) = 2x, and, thus, the Newton iteration for finding zeros of F
reads:
( x ( k ) )2 − a a
x ( k +1) = x ( k ) − = 1
2 x (k) + ,
2x (k) x (k)
which is exactly (8.2.2.14). Thus, for this F Newton’s method converges globally with order p = 2. y
EXAMPLE 8.4.2.4 (Newton method in 1D (→ Exp. 8.3.1.3)) Newton iterations for two different scalar
non-linear equations F ( x ) = 0 with the same solution sets:
(k) (k)
x (k) e x
−1 ( x ( k ) )2 + e − x
F ( x ) = xe x − 1 ⇒ F ′ ( x ) = e x (1 + x ) ⇒ x (k+1) = x (k) − (k)
=
e x (1 + x ( k ) ) 1 + x (k)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 621
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(k)
−x ′ −x ( k +1) (k) x (k) − e− x 1 + x (k)
F(x) = x − e ⇒ F (x) = 1 + e ⇒ x =x − (k)
= (k)
.
1 + e− x 1 + ex
Exp. 8.3.1.3 confirms quadratic convergence in both cases! (→ Def. 8.2.2.10)
Note that for the computation of its zeros, the function F in this example can be recast in different forms!
y
F ( x (k) )
x ( k +1) : = x ( k ) − , (8.4.2.1)
F ′ ( x (k) )
F(x)
Φ( x ) := x − (8.4.2.1) ⇔ x ( k +1) = Φ ( x ( k ) ) . (8.4.2.6)
F′ ( x)
′ F ( x ) F ′′ ( x )
Φ (x) = ⇒ Φ′ ( x ∗ ) = 0 , if F ( x ∗ ) = 0, F ′ ( x ∗ ) 6= 0 . (8.4.2.7)
( F ′ ( x ))2
U R R R R R R R
Fig. 296
How do we have to choose the leak resistance R > 0 in the linear circuit displayed in Fig. 296 in order to
achieve a prescribed potential at one of the nodes?
The circuit displayed in Fig. 296 is composed of linear resistors only. Thus we can use the nodal analysis of
the circuit introduced in Ex. 2.1.0.3 in order to derive a linear system of equations for the nodal potentials u j
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 622
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
in the nodes represented by • in Fig. 296. Kirchhoff’s current law (2.1.0.4) plus the constitutive relationship
I = U/R for a resistor with resistance R give
1 1 1
Node 1: ( u1 − U ) + u1 + ( u1 − u2 ) = 0 ,
R1 R R2
1 1 1
Node j: ( u − u j −1 ) + u j + ( u j − u j +1 ) = 0 , j = 2, . . . , n − 1 , (8.4.2.10)
R j −1 j R Rj
1 1
Node n: ( u n − u n −1 ) + u n = 0 .
Rn R
n
These n equations are equivalent to a linear system of equations for the vector u = u j j=1 ∈ R n , which
reads in compact notation
1
A+ ·I u = b , (8.4.2.11)
R
1
+ R12R1 − R12
U
− R12 1
R2 + R3
1
− R13 R1
0
− R13 1
R3 + 1
R4 − R14
.
.. .. .. ..
A= . . . , b=
.
1
− R n −2 1
+ 1
− Rn1−1 ..
R n −2 R n −1 .
− Rn1−1 1 1 1
R n + R n −1 − R n 0
− R1n 1
Rn
Thus the current problem can be formulated as: find x ∈ R, x := R−1 > 0, such that
R → R
F ( x ) = 0 with F : , (8.4.2.12)
x 7→ w⊤ (A + xI)−1 b − 1
where A ∈ R n,n is a symmetric, tridiagonal, diagonally dominant matrix, w ∈ R n is a unit vector singling
out the node of interest, and b takes into account the exciting voltage U .
In order to apply Newton’s method to (8.4.2.12), we have to determine the derivative F ′ ( x ) and so by
implicit differentiation [Str09, Sect. 7.8], first rewriting (u( x ) =
ˆ vector of nodal potentials as a function of
x=R ) − 1
F ( x ) = w⊤ u( x ) − 1 , (A + xI)u( x ) = b .
Then we differentiate the linear system of equations defining u( x ) on both sides with respect to x using
the product rule (8.5.1.17)
d
dx
(A + xI)u( x ) = b =⇒ (A + xI)u′ ( x ) + u( x ) = 0 .
u′ ( x ) = −(A + xI)−1 u( x ) . (8.4.2.13)
F ′ ( x ) = w⊤ u′ ( x ) = −w⊤ (A + xI)−1 u( x ) . (8.4.2.14)
w⊤ u ( x (k) ) − 1
x ( k +1) = x ( k ) + F ′ ( x ( k ) ) −1 F ( x ( k ) ) = , (8.4.2.15)
w⊤ z ( x(k) )
with ( A + x (k) I ) u ( x (k) ) = b ,
( A + x (k) I ) z ( x(k) ) = u ( x (k) ) .
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 623
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In each step of the iteration we have to solve two linear systems of equations, which can be done with
asymptotic effort O(n) in this case, because A + x (k) I is tridiagonal.
Note that in a practical application one must demand x > 0, in addition, because the solution must provide
a meaningful conductance (= inverse resistance.)
Also note that bisection (→ 8.4.1) is a viable alternative to using Newton’s method in this case. y
ˆ Boltzmann constant,
kB =
2hν3 1
B(ν, T ) = , h =ˆ Planck constant,
c2 exp hν − 1
k TB
c =ˆ speed of light ,
Useful, if a priori knowledge about the structure of F (e.g. about F being a rational function, see below) is
available. This is often the case, because many problems of 1D zero finding are posed for functions given
in analytic form with a few parameters.
EXAMPLE 8.4.2.17 (Halley’s iteration → [Han02, Sect. 18.3]) This example demonstrates that non-
polynomial model functions can offer excellent approximation of F. In this example the model function is
chosen as a quotient of two linear function, that is, from the simplest class of true rational functions.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 624
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Of course, that this function provides a good model function is merely “a matter of luck”, unless you have
some more information about F. Such information might be available from the application context.
a a ′ (k) 2a
( k )
+ c = F ( x (k)
) , − ( k ) 2
= F ( x ) , ( k ) 3
= F ′′ ( x (k) ) .
x +b ( x + b) ( x + b)
F ( x (k) ) 1
x ( k +1) = x ( k ) − · .
F ′ ( x (k) ) 1 − 1 F ( x (k) ) F ′′ ( x (k) )
2 F ′ ( x ( k ) )2
1 1
Halley’s iteration for F(x) = 2
+ − 1 , x > 0 : and x (0) = 0
( x + 1) ( x + 0.1)2
k x (k) F ( x (k) ) x ( k ) − x ( k −1) x (k) − x ∗
1 0.19865959351191 10.90706835180178 -0.19865959351191 -0.84754290138257
2 0.69096314049024 0.94813655914799 -0.49230354697833 -0.35523935440424
3 1.02335017694603 0.03670912956750 -0.33238703645579 -0.02285231794846
4 1.04604398836483 0.00024757037430 -0.02269381141880 -0.00015850652965
5 1.04620248685303 0.00000001255745 -0.00015849848821 -0.00000000804145
Compare with Newton method (8.4.2.1) for the same problem:
! Newton method converges more slowly, but also needs less effort per step (→ Section 8.4.3) y
§8.4.2.18 (Preconditioning of Newton’s method) In the previous example Newton’s method performed
rather poorly. Often its convergence can be boosted by converting the non-linear equation to an equivalent
one (that is, one with the same solutions) for another function g, which is “closer to a linear function”:
b, where Fb is (locally) invertible with an inverse Fb−1 that can be evaluated with
Assume that (locally) F ≈ F
little effort.
g( x ) := Fb−1 ( F ( x )) ≈ x .
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 625
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
b−1 (0), using the formula for the derivative of the inverse
Then apply Newton’s method to G ( x ) := g( x ) − F
of a function
d b−1 1 1
( F )(y) = ⇒ g′ ( x ) = · F′ ( x) .
dy Fb ( Fb 1 (y))
′ − Fb′ ( g( x ))
Since G is “almost linear” this Newton iteration can be expected to enjoy quadratic convergence for initial
b−1 (0).
guesses x (0) from a large set. A good initial guess is x (0) := F y
1 1
F(x) = 2
+ −1 , x > 0 ,
( x + 1) ( x + 0.1)2
and try to find its zeros.
10
F(x)
9 g(x)
7
Observation:
6
F ( x ) + 1 ≈ 2x −2 for x ≫ 1
5
1 4
and so g( x ) := p is “almost” linear for
F(x) + 1 3
x ≫ 1. 2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
x
! !
Idea: instead of F ( x ) = 0 tackle g( x ) = 1 with Newton’s method (8.4.2.1).
( k )
g( x ) − 1 1 2( F ( x (k) ) + 1)3/2
x ( k +1) (k)
=x − =x + q(k) −1
g′ ( x (k) ) ( k )
F(x ) + 1 F ′ ( x (k) )
q
( k )
2( F ( x ) + 1)(1 − F ( x (k) ) + 1)
(k)
=x + .
F ′ ( x (k) )
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 626
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For zero finding there is wealth of iterative methods that offer higher order of convergence. One class is
discussed next.
§8.4.2.21 (Modified Newton methods) Taking the cue from the iteration function of Newton’s method
(8.4.2.1), we extend it by introducing an extra function H :
F(x)
new fixed point iteration : Φ( x ) = x − H ( x ) with “proper” H : I 7→ R .
F′ ( x)
Still, every zero of F is a fixed point of this Φ,that is, the fixed point iteration is still consistent (→
Def. 8.3.1.1).
Aim: find H such that the method is of p-th order. The main tool is Lemma 8.3.2.15, which tells us that we
have to ensure Φ(ℓ) ( x ∗ ) = 0, 1 ≤ ℓ ≤ p − 1, guarantees local convergence of order p.
F ′′ ( x ∗ )
Φ′ ( x ∗ ) = 1 − H ( x ∗ ) , Φ′′ ( x ∗ ) = H ( x ∗ ) − 2H ′ ( x ∗ ) . (8.4.2.22)
F′ ( x∗ )
Lemma 8.3.2.15 ➢ Necessary conditions for local convergence of order p:
p = 2 (quadratical convergence): H ( x∗ ) = 1 ,
1 F ′′ ( x ∗ )
p = 3 (cubic convergence): H ( x∗ ) = 1 ∧ H ′ ( x∗ ) = .
2 F′ ( x∗ )
If F ∈ C2 ( I ), F ( x ∗ ) = 0, F ′ ( x ∗ ) 6= 0, G ∈ C2 (U ) in a neighbourhood U of 0, G (0) = 1,
G ′ (0) = 12 , then the fixed point iteration (8.4.2.23) converge locally cubically to x ∗ .
Proof. We apply Lemma 8.3.2.15, which tells us that both derivatives from (8.4.2.22) have to vanish.
Using the definition of H we find.
F ′′ ( x ∗ )
H ( x ∗ ) = G (0) , H ′ ( x ∗ ) = − G ′ (0)u′′ ( x ∗ ) = G ′ (0) .
F′ ( x∗ )
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 627
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
k e(k) : = x (k) − x ∗
Halley Euler Quad. Inv.
1 2.81548211105635 3.57571385244736 2.03843730027891
Numerical experiment: 2 1.37597082614957 2.76924150041340 1.02137913293045
3 0.34002908011728 1.95675490333756 0.28835890388161
F ( x ) = xe x − 1 ,
4 0.00951600547085 1.25252187565405 0.01497518178983
x (0) = 5 5 0.00000024995484 0.51609312477451 0.00000315361454
6 0.14709716035310
7 0.00109463314926
8 0.00000000107549
y
Review question(s) 8.4.2.26 (Special 1-point iterative methods for root finding)
(Q8.4.2.26.A) The generic iteration for a modified Newton method for solving the scalar zero-finding prob-
lem F ( x ) = 0 is
!
F ( x (k) ) F ( x (k) ) F ′′ ( x (k) )
x ( k +1) = x (k) − ′ (k) G , (8.4.2.23)
F (x ) ( F ′ ( x (k) ))2
Video tutorial for Section 8.4.2.3 "Multi-Point Methods": (12 minutes) Download link,
tablet notes
Supplementary literature. The secant method is presented in [Han02, Sect. 18.2], [DR08,
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 628
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The figure illustrates the geometric idea underlying x ( k −1) x ( k +1) x (k)
this 2-point method for zero finding:
F ( x ( k ) ) − F ( x ( k −1) )
s ( x ) = F ( x (k) ) + ( x − x (k) ) , (8.4.2.29)
x ( k ) − x ( k −1)
F ( x (k) )( x (k) − x (k−1) )
x ( k +1) = x (k) − . (8.4.2.30)
F ( x ( k ) ) − F ( x ( k −1) )
The following C++ code snippet demonstrates the implementation of the abstract secant method for finding
the zeros of a function passed through the functor F.
Remember: F ( x ) may only be available as output of a (complicated) procedure. In this case it is difficult
to find a procedure that evaluates F ′ ( x ). Thus the significance of methods that do not involve evaluations
of derivatives. y
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 629
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A startling observation: the method seems to have a fractional (!) order of convergence, see Def. 8.2.2.10.
y
Remark 8.4.2.33 (Fractional order of convergence of secant method) Indeed, a fractional order of
convergence can be proved for the secant method, see [Han02, Sect. 18.2]. Here we give an asymptotic
argument that holds, if the iterates are already very close to the zero x ∗ of F.
F ( x )( x − y)
x (k+1) = Φ( x (k) , x (k−1) ) with Φ( x, y) = Φ( x, y) := x − . (8.4.2.34)
F ( x ) − F (y)
Thanks to the asymptotic perspective we may assume that |e(k) |, |e(k−1) | ≪ 1 so that we can rely on
two-dimensional Taylor expansion around ( x ∗ , x (∗) ), cf. [Str09, Satz 7.5.2]:
∂Φ ∗ ∗ ∂Φ ∗ ∗
Φ( x ∗ + h, x ∗ + k ) = Φ( x ∗ , x ∗ ) + ( x , x )h + ( x , x )k+
∂x ∂y
2 (8.4.2.36)
1∂ Φ ∗ ∗ 2 ∂2 Φ ∗ ∗ 2
1∂ Φ ∗ ∗ 2 ∗
2 ∂2 x ( x , x ) h + ∂x∂y ( x , x ) hk + 2 ∂2 y ( x , x ) k + R ( x , h, k ) ,
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 630
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
> e2 = normal(mtaylor(Phi(s+e1,s+e0)-s,[e0,e1],4));
➣ truncated error propagation formula (products of three or more error terms ignored)
. 1 F ′′ ( x ∗ ) (k) (k−1)
e ( k +1) = 2 F′ ( x∗ ) e e = Ce(k) e(k−1) . (8.4.2.37)
How can we deduce the order of converge from this recursion formula? We try e ( k ) = K ( e ( k −1) ) p
inspired by the estimate in Def. 8.2.2.10:
2
⇒ e ( k +1) = K p +1 ( e ( k −1) ) p
2 − p −1 √
⇒ ( e ( k −1) ) p = K − p C ⇒ p2 − p − 1 = 0 ⇒ p = 21 (1 ± 5) .
The second implication is clear after realizing that that the equation has to be satisfied for all k and that
the right-hand (k)
1
√ side does not depend on k. As e → 0 for k → ∞ we get the order of convergence
p = 2 (1 + 5) ≈ 1.62 (see Exp. 8.4.2.32 !) y
F ( x ) = arctan( x )
6
✄
( x (0) , x (1) ) ∈ R2+ of initial guesses. 4
0
0 1 2 3 4 5 6 7 8 9 10
(0)
Fig. 298 x
y
F ( x ∗ ) = 0 ⇒ F −1 (0 ) = x ∗ .
p( F ( x (k− j) )) = x (k− j) , j = 0, . . . , m − 1 .
New approximate zero x (k+1) := p(0)
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 631
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
F −1
F
The graph of F −1 can be obtained by reflecting the
graph of F at the angular bisector. ✄
F ( x ∗ ) = 0 ⇔ F −1 (0 ) = x ∗
Fig. 299
F −1
x∗
Case m = 2 (2-point method) F
➢ secant method
x∗
The interpolation polynomial is a line. In this case
we do not get a new method, because the inverse
function of a linear function (polynomial of degree 1)
is again a polynomial of degree 1.
Fig. 300
Case m = 3: quadratic inverse interpolation, a 3-point method, see [Mol04, Sect. 4.5]
We interpolate the points ( F ( x (k) ), x (k) ), ( F ( x (k−1) ), x (k−1) ), ( F ( x (k−2) ), x (k−2) ) with a parabola
(polynomial of degree 2). Note the importance of monotonicity of F, which ensures that
( k )
F ( x ), F ( x ( k − 1 ) ), F ( x ( k − 2 ) ) are mutually different.
EXPERIMENT 8.4.2.40 (Convergence of quadratic inverse interpolation) We test the method for the
model problem/initial guesses F ( x ) := xe x − 1 = 0 , x (0) = 0 , x (1) = 2.5 , x (2) = 5 .
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 632
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 8.4.3 "Asymptotic Efficiency of Iterative Methods for Zero Finding":
(10 minutes) Download link, tablet notes
§8.4.3.1 (Efficiency) Efficiency is measured by forming the ratio of gain and the effort required to achieve
it:
gain
Efficiency = .
effort
For iterative methods for solving F (x) = 0, F : D ⊂ R n → R n , this means the following:
#{evaluations of D F } #{evaluations of F ′ }
e.g, W≈ +n· +··· .
step step
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 633
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Ingredient ➋: Number of steps k = k (ρ) to achieve a relative reduction of the error by a factor of ρ (=
gain),
Here, e(k) stands for the iteration error in the k-th step.
Notice: | log ρ| ↔ Gain in no. of significant digits of x (k) [ log = log10 ]
(8.4.3.3)
y
§8.4.3.4 (Minimal number of iteration steps) Let us consider an iterative method generating a sequence
of approximate solutions the converges with order p ≥ 1 (→ Def. 8.2.2.10). From its error recursion
we want to derive an estimate for the minimal number k (ρ) ∈ N of iteration steps required to achieve
(8.4.3.2).
➊ Case p = 1, linearly convergent iteration:
Definition 8.2.2.1. Linear convergence
This implies the recursion and estimate for the error norms:
e ( k ) ≤ C k e (0) ∀k ∈ N0 . (8.4.3.6)
! log ρ
e ( k ) ≤ ρ e (0) takes k(ρ) ≥ steps . (8.4.3.7)
log C
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 634
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
p p2 2 p3
e ( k ) ≤ C e ( k −1) ≤ C 1+ p e ( k −2) ≤ C 1+ p + p e ( k −3) ≤ ...
2 + p3 +···+ pk −1 pk p k −1 p k −1
≤ C 1+ p + p e (0) =C p −1 e (0) e (0) , k∈N,
for some constant C > 0. Here, we use the geometric sum formula
pk − 1
1 + p 2 + p 3 + · · · + p k −1 = , k∈N.
p−1
This permits us to estimate the minimal number of steps we have to execute to guarantee a reduction
of the error by a factor of ρ
log ρ
log(1 + log L0 )
, L0 := C /p−1 e(0) < 1 by (8.4.3.9) .
1
k(ρ) ≥ (8.4.3.10)
log p
Summing up, (8.4.3.7) and (8.4.3.10) give us explicit formulas for k (ρ) as a function of ρ. y
§8.4.3.11 (Asymptotic efficiency) Now we adopt an asymptotic perspective and ask for a large reduction
of the error, that is ρ ≪ 1.
If ρ ≪ 1, then (log ρ, log L0 < 0 !)
log ρ
log(1 + ) ≈ log | log ρ| − log | log L0 | ≈ log | log ρ| .
log L0
log C
− , if p=1,
Efficiency|ρ≪1 = log W
p | log ρ| (8.4.3.12)
· , if p>1.
W log(| log ρ|)
We conclude that
• when requiring high accuracy, linearly convergent iterations should not be used, because their effi-
ciency does not increase for ρ → 0,
log p
• for method of order p > 1, the factor W offers a gauge for efficiency.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 635
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y
EXAMPLE 8.4.3.13 (Efficiency of iterative methods) We “simulate” iterations to explore the quantitative
dependence of efficiency on the order of the methods and the target accuracy.
10
C = 0.5
9 C = 1.0
C = 1.5
6
The plot displays the number of iteration steps ac- 5
cording to (8.4.3.10).
4
0
1 1.5 2 2.5
Fig. 301 p
7
Newton method
secant method
6
We compare
• Newton’s method from Section 8.4.2.1 and the 5
4
in terms of number of steps required for a prescribed
guaranteed error reduction, assuming C = 1 in both 3
cases and for e(0) = 0.1.
2
We observe that Newton’s method requires only
marginally fewer steps than the secant method. 1
0
0 2 4 6 8 10
Fig. 302 −log (ρ)
10
y
§8.4.3.14 (Comparison of Newton’s method and of the secant method) We draw conclusions from the
discussion above and (8.4.3.12):
We set the effort for a step of Newton’s method to twice that for a step of the secant method from
Code 8.4.2.31, because we need an additional evaluation of the derivative F ′ in Newton’s method.
8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 636
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 8.5.1 "The Newton Iteration in R n (II)": (15 minutes) Download link,
tablet notes
F ( x (k) )
x ( k +1) : = x ( k ) − , (8.4.2.1)
F x ( k +1) x ( k ) F ′ ( x (k) )
From another perspective, F e is just the Taylor expansion of F around x (k) truncated after the linear term.
Thus, if F is twice continuous differentiable, F ∈ C2 ( I ), by Taylor’s formula [Str09, Satz 5.5.1] the tangent
satisfies
that is the tangent provides a local quadratic approximation of F around x (k) . This directly gives us the
local quadratic convergence of the 1D Newton method, recall § 8.4.2.5.
We know an analogous construction for general twice continuously differentiable F : D ⊂ R n 7→ R n . We
define the affine linear function
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 637
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
⊤
where D F (z) ∈ R n,n is the Jacobian of F = [ F1 , . . . , Fn ] in z ∈ D:
∂F ∂F1 ∂F1
1
∂x1 (z) ∂x2 ( z ) ··· ··· ∂xn ( z )
" #n ∂F2 ∂F2
∂Fi (z) ( z )
D F (z) = (z) =
∂x1
..
∂xn
..
. (8.3.2.8)
∂x j . .
i,j=1
∂Fn ∂Fn ∂Fn
∂x1 ( z ) ∂x2 ( z ) ··· ··· ∂xn ( z )
This is the multi-dimensional generalization of a truncated Taylor expansion. From analysis we know that
e with an error that quadratically depends
also in this case we have a local affine linear approximation by F
(
on the distance to x : k )
2
F (x) − Fe(x) = F (x) − F (x(k) ) − D F (x(k) )(x − x(k) ) = O( x − x(k) ) for x → x(k) .
(8.5.1.5)
y
Idea (→ Section 8.4.2.1): local linearization:
Given x(k) ∈ D ➣ x(k+1) as zero of affine linear model function
L1 := {x ∈ R2 : Fek,1 (x) = 0} ,
L2 := {x ∈ R2 : Fek,2 (x) = 0} , Fe2 (x) = 0
§8.5.1.8 (Generic Newton method in C++) The following C++ implementation of the skeleton for New-
ton’s method uses a correction based a posteriori termination criterion for the Newton iteration. It stops the
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 638
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
iteration if the relative size of the Newton correction drops below the prescribed relative tolerance rtol.
If x∗ ≈ 0 also the absolute size of the Newton correction has to be tested against an absolute tolerance
atol in order to avoid non-termination despite convergence of the iteration,
that computes the Newton correction, that is it returns the solution of a linear system with system
matrix D F (x) (x ↔ x) and right hand side f ↔ f.
☞ The function returns the computed approximate solution of the non-linear system.
The next code demonstrates the invocation of newton for a 2 × 2 non-linear system from a code relying
on E IGEN. It also demonstrates the use of fixed size eigen matrices and vectors.
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 639
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 8.5.1.11 (Affine invariance of Newton method) An important property of the Newton iteration
(8.5.1.6) is its affine invariance → [Deu11, Sect .1.2.2]
☛ ✟
Affine invariance: The Newton iterations for GA (x) = 0 are the same for all regular A !
✡ ✠
This is confirmed by a simple computation:
Why is this an interesting property? Affine invariance should be used as a guideline for
• convergence theory for Newton’s method: assumptions and results should be affine invariant, too.
• modifying and extending Newton’s method: resulting schemes should preserve affine invariance.
In particular, termination criteria for Newton’s method should also be affine invariant in the sense that,
when applied for GA they STOP the iteration at exactly the same step for any choice of the regular matrix
A. y
The function F : R n → R n defining the non-linear system of equations may be given in various formats,
as explicit expression or rather implicitly. In most cases, D F has to be computed symbolically in order to
obtain concrete formulas for the Newton iteration. We now learn how these symbolic computations can be
carried out harnessing advanced techniques of multi-variate calculus.
§8.5.1.12 (Derivatives and their role in Newton’s method) The reader will probably agree that the
derivative of a function F : I ⊂ R → R in x ∈ I is a number F ′ ( x ) ∈ R, the derivative of a function
F : D ⊂ R n → R m , in x ∈ D a matrix D F (x) ∈ R m,n . However, the nature of a derivative in a point is
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 640
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
☞ Note that D F (x)h ∈ W is the vector returned by the linear mapping D F (x) when applied to h ∈ V .
☞ In Def. 8.5.1.13 k·k can be any norm on V (→ Def. 1.5.5.4).
☞ A common shorthand notation for (8.5.1.14) relies on the “little-o” Landau symbol:
In the context of the Newton iteration (8.5.1.6) the computation of the Newton correction s in the k + 1-th
step amounts to solving a linear system of equations:
s = − D F ( x ( k ) ) −1 F ( x ( k ) ) ⇔ D F ( x ( k ) ) s = − F ( x ( k ) ) .
Matching this with Def. 8.5.1.13 we see that we need only determine expressions for D F (x(k) )h, h ∈ V ,
in order to state the LSE yielding the Newton correction. This will become important when applying the
“compact” differentiation rules discussed next. y
Video tutorial for § 8.5.1.15 "Multi-dimensional Differentiation": (20 minutes) Download link,
tablet notes
Stating the Newton iteration (8.5.1.6) for F : R n 7→ R n through an analytic formula entails computing the
Jacobian D F. The safe, but tedious way is to use the definition (8.3.2.8) directly and compute the partial
derivatives.
To avoid cumbersome component-oriented considerations, it is sometimes useful to know the rules of
multidimensional differentiation:
Immediate from Def. 8.5.1.13 are the following differentiation rules (V, W, U, Z are finite-dimensional
normed vector spaces, all functions assumed to be differentiable):
• For F : V 7→ W linear, we have D F (x) = F for all x ∈ V
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 641
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
T (x) := b( F (x), G (x)) ⇒ D T (x)h = b(D F (x)h, G (x)) + b( F (x), D G (x)h) , (8.5.1.17)
h ∈ V, x ∈ D .
We see this by formal computations, making heavy use of the bilinearity of b in step (∗):
T (x + h) = b( F (x + h), G (x + h))
= b( F (x) + D F (x)h + o (khk), G (x) + D G (x)h + o (khk))
(∗)
= b( F (x), G (x)) + b(D F (x)h, G (x)) + b( F (x), D G (x)h) +o (khk) ,
| {z }
=D T ( x ) h
2
where the term b(D F (x)h, D G (x)h) is obviously O(khk ) for h → 0 and, therefore, can be
“thrown into the garbage bin” of o (khk).
The first and second derivatives of real-valued functions occur frequently and have special names, see
[Str09, Def. 7.3.2] and [Str09, Satz 7.5.3].
In other words,
• the gradient grad F (x) is the vector representative of the linear mapping D F (x) : R n → R,
• and the Hessian H F (x) is the matrix representative of the bilinear mapping
D{z 7→ D F (z)}(x) : R n × R n → R.
y
Ψ : R n 7→ R , Ψ(x) := x⊤ Ax , A ∈ R n,n .
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 642
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This is the general matrix representation of a bilinear form on R n . We want to compute the gradient of Ψ.
We do this in two ways:
➊ “High level differentiation”: We apply the product rule (8.5.1.17) with D = V = W = U = R n ,
Z = R, F, G = Id, which means D F (x) = D G (x) = I, and the bilinear form b(x, y) :=
x T Ay:
D Ψ(x)h = h⊤ Ax + x⊤ Ah = (Ax)⊤ h = x⊤ Ah = x⊤ A⊤ + x⊤ A h ,
| {z }
=(grad Ψ(x))⊤
n n n n n n
Ψ(x) = ∑ ∑ (A)k,j xk x j = (A)i,i xi2 + ∑ (A)i,j xi x j + ∑ (A)k,i xk xi + ∑ ∑ (A)k,j xk x j .
k =1 j =1 j =1 k =1 k =1 j =1
j6=i k 6=i k 6=i j6=i
⊤
=(Ax + A x)i , i = 1, . . . , n .
This provides the components of the gradient, since i ∈ {1, . . . , n} was arbitrary.
Of course, the results obtained by both methods must agree! y
EXAMPLE 8.5.1.20 (Derivative of Euclidean norm) We seek the derivative of the Euclidean norm, that
is, of the function F (x) := kxk2 , x ∈ R n \ {0} ( F is defined but not differentiable in x = 0, just look at
the case n = 1!)
➊ “High level differentiation”: We can write F as the composition of two functions F = G ◦ H with
p
G : R + → R + , G (ξ ) := ξ ,
H : Rn → R , H (x) := x⊤ x .
Using the rule for the differentiation of bilinear forms from Ex. 8.5.1.19 for the case A = I and basic
calculus, we find
D H (x)h = 2x⊤ h , x, h ∈ R n ,
ζ
D G (ξ )ζ = √ , ξ > 0, ζ ∈ R .
2 ξ
Finally, the chain rule (8.5.1.16) gives
2x⊤ h x⊤
D F (x)h = D G ( H (x))(D H (x)h) = √ = ·h . (8.5.1.21)
2 x⊤ x k x k2
x
Def. 8.5.1.18 ⇒ grad F (x) = .
k x k2
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 643
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§8.5.1.22 (Newton iteration via product rule) This paragraph explains the use of the general product
rule (8.5.1.17) to derive the linear system solved by the Newton correction. It implements the insights from
§ 8.5.1.12.
We seek solutions of F (x) = 0 with F (x) := b( G (x), H (x)), where
✦ V, W are some vector spaces (finite- or even infinite-dimensional),
✦ G : D → V , H : D → W , D ⊂ R n , are continuously differentiable in the sense of Def. 8.5.1.13,
✦ b : V × W 7→ R n is bilinear (linear in each argument).
According to the general product rule (8.5.1.17) we have
This already defines the linear system of equations to be solved to compute the Newton correction s
b(D G (x(k) )s, H (x(k) )) + b( G (x(k) ), D H (x(k) )s) = −b( G (x(k) ), H (x(k) )) . (8.5.1.24)
Since the left-hand side is linear in s, this really represents a square linear system of n equations. The
next example will present a concrete case. y
For many quasi-linear systems, for which there exist solutions, the fixed point iteration (→ Section 8.3)
D F ( x ) h = (D A ( x ) h ) x + A ( x ) h , h ∈ R n . (8.5.1.28)
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 644
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Note that D A(x(k) ) is a mapping from R n into R n,n , which gets h as an argument. Then the Newton
iteration reads
x ( k +1) = x ( k ) − s , D F ( x ( k ) ) s = (D A ( x ( k ) ) s ) x ( k ) + A ( x ( k ) ) s = A ( x ( k ) ) x ( k ) − b . (8.5.1.29)
EXAMPLE 8.5.1.30 (A special quasi-linear system of equations) We consider the quasi-linear system
of equations
γ(x) 1
1 γ(x) 1
.. .. ..
. . .
A(x)x = b , A(x) := . . . ∈ R n×n , (8.5.1.31)
.. .. ..
1 γ(x) 1
1 γ(x)
where γ(x) := 3 + kxk2 (Euclidean vector norm), the right hand side vector b ∈ R n is given and x ∈ R n
is unknown.
The “pedestrian” approach to the second term starts with writing it explicitly in components as
q
(xkxk)i = xi x12 + · · · + xn2 , i = 1, . . . , n .
Then we can compute the Jacobian according to (8.3.2.8) by taking partial derivatives:
q
∂ xi
(xkxk)i = x12 + · · · + xn2 + xi q ,
∂xi x12 +···+ xn2
∂ xj
(xkxk)i = xi q , j 6= i .
∂x j x2 + · · · + x2
1 n
For the “high level” treatment of the second term x 7→ xkxk2 we apply the product rule (8.5.1.17), together
with (8.5.1.21):
x⊤ h xx⊤
D F (x)h = Th + kxk2 h + x = A(x) + h.
k x k2 k x k2
Thus, in concrete terms the Newton iteration (8.5.1.29) becomes
x ( k ) ( x ( k ) ) ⊤ −1
x ( k +1) = x ( k ) − A ( x ( k ) ) + ( A ( x(k) ) x(k) − b ) .
k x k2
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 645
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Note that the coefficient matrix of the linear system to be solved in each step is a rank-1-modification
(2.6.0.16) of the symmetric positive definite tridiagonal matrix A(x(k) ), cf. Lemma 2.8.0.12. Thus the
Sherman-Morrison-Woodbury formula from Lemma 2.6.0.21 can be used to solve it efficiently. y
This relationship will provide a valid definition of F in a neighborhood of x0 ∈ W , if we assume that there
is x0 , z0 ∈ W such that b( G (x0 ), z0 ) = b, and that the linear mapping z 7→ b( G (x0 ), z) is invertible.
Then, for x close to x0 , F (x) can be computed by solving a square linear system of equations in W . In
Ex. 8.4.2.9 we already saw an example of an implicitly defined F for W = R.
We want to solve F (x) = 0 for this implicitly defined F by means of Newton’s method. In order to determine
the derivative of F we resort to implicit differentiation [Str09, Sect. 7.8] of the defining equation (8.5.1.33)
by means of the general product rule (8.5.1.17). We formally differentiate both sides of (8.5.1.33):
and find that the Newton correction s in the k + 1-th Newton step can be computed as follows:
which constitutes an dim W × dim W linear system of equations. The next example discusses a concrete
application of implicit differentiation with W = R n,n . y
EXAMPLE 8.5.1.35 (Derivative of matrix inversion) We consider matrix inversion as a mapping and
(formally) compute its derivative, that is, the derivative of function
R ∗n,n → R n,n
inv : ,
X 7 → X −1
where R n,n
∗ denotes the (open) set of invertible n × n-matrices, n ∈ N .
inv(X) · X = I , X ∈ R n,n
∗ . (8.5.1.36)
Differentiation on both sides of (8.5.1.36) by means of the product rule (8.5.1.17) yields
For n = 1 we get D inv( x ) h = − xh2 , which recovers the well-known derivative of the function x → x −1 .
y
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 646
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 8.5.1.38 (Matrix inversion by means of Newton’s method → [Ale12; PS91]) Surprisingly,
it is possible to obtain the inverse of a matrix as a the solution of a non-linear system of equations. Thus
it can be computed using Newton’s method.
Given a regular matrix A ∈ R n,n , its inverse can be defined as the unique zero of a function:
−1 R n,n
∗ → R n,n
X=A ⇐⇒ F (X) = O for F : .
X 7 → A − X −1
n,n
Using (8.5.1.37) we find for the derivative of F in X ∈ R ∗
X ( k +1) = X ( k ) − S , S : = D F ( X ( k ) ) −1 F ( X ( k ) ) . (8.5.1.40)
The Newton correction S in the k-th step solves the linear system of equations
(8.5.1.39) −1
−1 −1
D F ( X(k) ) S = X(k)
S X(k) = F ( X(k) ) = A − X(k) .
−1
S = X(k) ( A − X(k) )X(k) = X(k) AX(k) − X(k) . (8.5.1.41)
in (8.5.1.40)
X(k+1) = X(k) − X(k) AX(k) − X(k) = X(k) 2I − AX(k) . (8.5.1.42)
This is the Newton iteration (8.5.1.6) for F (X) = O that we expect to converge locally to X∗ := A−1 . y
Remark 8.5.1.43 (Simplified Newton method [DR08, Sect. 5.6.2]) Computing the Newton correction
can be expensive owing to the O(n3 ) asymptotic cost (→ § 2.5.0.4) of solving a different large n × n
linear system of equations in every step of the Newton iteration (8.5.1.6).
We know that the cost of a linear solve can be reduced to O(n2 ) if the coefficient matrix is available in LU-
or QR factorized form, see, e.g., § 2.3.2.15. This motivates the attempt to “samemp*freeze” the Jacobian
in the Newton iteration and use D F ( x (0) ) throughout, which leads to the simplified Newton iteration:
x ( k +1) = x ( k ) − D F ( x (0) ) −1 F ( x ( k ) ) , k = 0, 1, . . . .
The following C++ function implements a template for this simplified Newton Method and uses the same
Jacobian D F (x(0) ) for all steps, which makes it possible to reuse an LU-decomposition, cf. Rem. 2.5.0.10.
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 647
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
17 }
Drawback: Switching to the simplified Newton method usually sacrifices the asymptotic
quadratic convergence of the Newton method: merely linear convergence can be expected.
y
Remark 8.5.1.45 (Numerical Differentiation for computation of Jacobian) If D F (x) is not available
(e.g. when F (x) is given only as a procedure) we may resort to approximation by difference quotients:
w1 + w2 = 2 ,
c 1 w1 + c 2 w2 = 0 ,
c21 w1 + c22 w2 = 2
3 ,
c31 w1 + c32 w2 =0.
Write down a function F : R4 → R4 such that this non-linear system corresponds to F (x) = 0,
x = [w1 , w2 , c1 , c2 ]⊤ and then derive the corresponding Newton iteration.
(Q8.5.1.46.B) Consider the following non-linear interpolation problem for a twice continuously differen-
tiable function f : [−1, 1] → R. Seek a node set {t0 , . . . , tn } ⊂ [−1, 1] and a polynomial p ∈ Pn such
that
Recast this problem as a non-linear system of equations for the nodes tk , k = 0, . . . , n, and the n + 1
monomial coefficients of p and then state the corresponding Newton iteration.
(Q8.5.1.46.C) For a symmetric positive definite matrix A = A⊤ ∈ R n,n derive the Newton iteration for
solving F (X) = O, where
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 648
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Notice that the Newton iteration (8.5.1.6) is a fixed point iteration (→ Section 8.3) with iteration function
Φ ( x ) = x − D F ( x ) −1 F ( x ) .
F (x∗ ) = 0 ⇒ D Φ(x∗ ) = O ,
that is, the derivative (Jacobian) of the iteration function of the Newton fixed point iteration vanishes in the
limit point. Thus from Lemma 8.3.2.15 we draw the same conclusion as in the scalar case n = 1, cf.
Section 8.4.2.1.
This can easily be seen by the following formal argument valid for F ∈ C2 ( D ) with F (x∗ ) 6= 0.
x ( k +1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ )
2 2
= D Φ(x∗ )(x(k) − x∗ ) + O( x(k) − x∗ ) = O ( x(k) − x∗ ) for x(k) ≈ x∗ .
EXPERIMENT 8.5.2.1 (Convergence of Newton’s method in 2D) We study the convergence of Newton’s
method empirically for n = 2 for
x12 − x24 x1 2 1
F (x) = , x= ∈R with solution F( )=0. (8.5.2.2)
x1 − x23 x2 1
∂ x1 F1 ( x ) ∂ x2 F1 ( x ) 2x1 −4x23
Jacobian (analytic computation): D F (x) = =
∂ x1 F2 ( x ) ∂ x2 F2 ( x ) 1 −3x22
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 649
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
where x(k) = [ x1 , x2 ] T .
2. Set x(k+1) = x(k) + ∆x(k) .
Monitoring the iteration we obtain the following iterates/error norms:
log ǫk+1 − log ǫk
k x(k) ǫk : = k x ∗ − x ( k ) k2
log ǫk − log ǫk−1
0 [0.7, 0.7] T 4.24e-01
1 [0.87850000000000, 1.064285714285714] T 1.37e-01 1.69
2 [1.01815943274188, 1.00914882463936] T 2.03e-02 2.23
3 [1.00023355916300, 1.00015913936075] T 2.83e-04 2.15
4 [1.00000000583852, 1.00000002726552] T 2.79e-08 1.77
5 [0.999999999999998, 1.000000000000000] T 2.11e-15
6 [1, 1] T
☞ (Some) evidence of quadratic convergence, see Rem. 8.2.2.12. y
EXAMPLE 8.5.2.3 (Convergence of Newton’s method for matrix inversion → [Ale12; PS91]) in
Ex. 8.5.1.38 we have derived the Newton iteration for
X(k+1) = X(k) − X(k) AX(k) − X(k) = X(k) 2I − AX(k) . (8.5.1.42)
Now we study the local convergence of this iteration by direct estimates. To that end we first derive a
recursion for the iteration errors E(k) := X(k) − A−1 :
E ( k +1) = X ( k +1) − A −1
(8.5.1.42)
= X(k) 2I − AX(k) − A−1
= (E(k) + A−1 ) 2I − A(E(k) + A−1 ) − A−1
= (E(k) + A−1 )(I − AE(k) ) − A−1 = −E(k) AE(k) .
For the norm of the iteration error (a matrix norm → Def. 1.5.5.10) we conclude from submultiplicativity
(1.5.5.11) a recursive estimate
2
E ( k +1) ≤ E ( k ) kAk . (8.5.2.4)
This holds for any matrix norm according to Def. 1.5.5.10, which is induced by a vector norm. For the
relative iteration error we obtain
!2
E ( k +1) E(k)
≤ k A k A −1 , (8.5.2.5)
kAk kAk | {z }
| {z } | {z }
relative error relative error =cond(A)
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 650
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
From (8.5.2.4) we conclude that the iteration will converge (limk→∞ E(k) = 0), if
which gives a condition on the initial guess S(0) . Now let us consider the Euclidean matrix norm k·k2 ,
which can be expressed in terms of eigenvalues, see Cor. 1.5.5.16. Motivated by this relationship, we use
the initial guess X(0) = αA⊤ with a > 0 still to be determined.
! 2
X (0) A − I = αA⊤ A − I = αkAk22 − 1 < 1 ⇔ α < ,
2 2 kAk22
which is a sufficient condition for the initial guess X(0) = αA⊤ , in order to make (8.5.1.42) converge. In
this case we infer quadratic convergence from both (8.5.2.4) and (8.5.2.5). y
There is a sophisticated theory about the convergence of Newton’s method. For example one can find the
following theorem in [DH03, Thm. 4.10], [Deu11, Sect. 2.1]):
If:
(A) D ⊂ R n open and convex,
(B) F : D 7→ R n continuously differentiable,
(C) D F (x) regular ∀x ∈ D,
∀v ∈ R n , v + x ∈ D,
(D) ∃ L ≥ 0: D F (x)−1 (D F (x + v) − D F (x)) ≤ L k v k2 ,
2 ∀x ∈ D
(E) ∃x∗ : F (x∗ ) = 0 (existence of solution in D)
2
(F) initial guess x(0) ∈ D satisfies ρ : = x ∗ − x (0) < ∧ Bρ (x∗ ) ⊂ D .
2 L
then the Newton iteration (8.5.1.6) satisfies:
(i) x(k) ∈ Bρ (x∗ ) := {y ∈ R n , ky − x∗ k < ρ} for all k ∈ N,
(ii) lim x(k) = x∗ ,
k→∞
2
(iii) x ( k +1) − x ∗ ≤ L
2 x(k) − x∗ (local quadratic convergence) .
2 2
Usually, it is hardly possible to verify the assumptions of the theorem for a concrete non-linear
system of equations, because neither L nor x ∗ are known.
x ( k +1) = x ( k ) − D F ( x ( k ) ) −1 F ( x ( k ) ) (8.5.1.6)
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 651
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
An abstract discussion of ways to stop iterations for solving F (x) = 0 was presented in Section 8.2.3, with
“ideal termination” (→ § 8.2.3.2) as ultimate, but unfeasible, goal.
Yet, in 8.5.2 we saw that Newton’s method enjoys (asymptotic) quadratic convergence, which means rapid
decrease of the relative error of the iterates, once we are close to the solution, which is exactly the point,
when we want to STOP. As a consequence, asymptotically, the Newton correction (difference of two
consecutive iterates) yields rather precise information about the size of the error:
→ uneconomical: one needless update, because x(k) would already be accurate enough.
Remark 8.5.3.3 (Newton’s iteration; computational effort and termination) Some facts about the New-
ton method for solving large (n ≫ 1) non-linear systems of equations:
☛ Solving the linear system to compute the Newton correction may be expensive (asymptotic compu-
tational effort O(n3 ) for direct elimination → § 2.3.1.5) and accounts for the bulk of numerical cost
of a single step of the iteration.
☛ In applications only very few steps of the iteration will be needed to achieve the desired accuracy
due to fast quadratic convergence.
✄ The termination criterion (8.5.3.2) computes the last Newton correction ∆x(k) needlessly, because
x(k) already accurate enough!
Therefore we would like to use an a-posteriori termination criterion that dispenses with computing (and
“inverting”) another Jacobian D F (x(k) ) just to tell us that x(k) is already accurate enough. y
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 652
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§8.5.3.4 (Termination of Newton iteration based on simplified Newton correction) Due to fast asymp-
totic quadratic convergence, we can expect D F (x(k−1) ) ≈ D F (x(k) ) during the final steps of the iteration.
∆x̄(k) available
Effort: Reuse of LU-factorization (→ Rem. 2.5.0.10) of D F (x(k−1) ) ➤
with O(n2 ) operations
C++11 code 8.5.3.6: Generic Newton iteration with termination criterion (8.5.3.5) ➺ GITLAB
2 template <typename FuncType , typename JacType , typename VecType>
3 void newton_stc ( const FuncType &F , const JacType &DF, VecType &x , double r t o l ,
4 double a t o l ) {
5 using s c a l a r _ t = typename VecType : : S c a l a r ;
6 s c a l a r _ t sn ;
7 do {
8 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian ]
9 x −= j a c f a c . solve ( F ( x ) ) ; // Compute next iterate
10 // Compute norm of simplified Newton correction
11 sn = j a c f a c . solve ( F ( x ) ) . norm ( ) ;
12 }
13 // Termination based on simplified Newton correction
14 while ( ( sn > r t o l * x . norm ( ) ) && ( sn > a t o l ) ) ;
15 }
Remark 8.5.3.7 (Residual based termination of Newton’s method) If we used the residual based ter-
mination criterion
F ( x(k) ) ≤ τ ,
then the resulting algorithm would not be affine invariant, because for F (x) = 0 and AF (x) = 0, A ∈
R n,n regular, the Newton iteration might terminate with different iterates. y
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 653
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
converges asymptotically very fast: doubling of number of significant digits in each step
Recall that an implementation of Newton’s method (including stopping rules) for solving GA (x) = 0,
GA (x) := AF (x), F : D ⊂ R n → R n , is called affine invariant, if the same sequence of iterates is
produced for every regular matrix A ∈ R n,n .
△
Video tutorial for Section 8.5.4 "Damped Newton Method": (11 minutes) Download link,
tablet notes
Potentially big problem: Newton method converges quadratically, but only locally , which may render
it useless, if convergence is guaranteed only for initial guesses very close to exact solution, see also
Ex. 8.4.2.38.
In this section we study a method to enlarge the region of convergence, at the expense of quadratic
convergence, of course.
EXAMPLE 8.5.4.1 (Local convergence of Newton’s method) The dark side of local convergence (→
Def. 8.2.1.10): for many initial guesses x(0) Newton’s method will not converge!
1.5
F ( x ) = xe x − 1 ⇒ F ′ (−1) = 0 1
x 7→ xe x − 1
x (0) < − 1 ⇒ x ( k ) → − ∞ ,
0.5
x (0) > − 1 ⇒ x ( k ) → x ∗ , 0
−0.5
because all Newton corrections for x (k) < −1 make
the iterates decrease even further. −1
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 654
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.5
∗
with zero
arctan(ax)
x =0. 0
5
Diverging Newton iteration for F(x) = arctan x
1.5
4.5
1 4
3.5
0.5
3
2.5
a
0
x ( k +1) x ( k −1) x (k) 2
-0.5 1.5
1
-1
0.5
-1.5 0
Fig. 306 −15 −10 −5 0 5 10 15
-6 -4 -2 0 2 4 6
Fig. 307 x
In Fig. 307 the red zone = { x (0) ∈ R, x (k) → 0}, domain of initial guesses for which Newton’s
method converges.
y
If the Newton correction points in the wrong direction (Item ➊), no general remedy is available. If the
Newton correction is too large (Item ➋), there is an effective cure:
With λ(k) > 0: x(k+1) := x(k) − λ(k) D F (x(k) )−1 F (x(k) ) . (8.5.4.2)
Choice of damping factor: affine invariant natural monotonicity test (NMT) [Deu11, Ch. 3]:
λ(k)
choose “maximal” 0 < λ(k) ≤ 1: ∆x(λ(k) ) ≤ (1 − ) ∆x(k) (8.5.4.4)
2 2 2
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 655
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++ code 8.5.4.5: Generic damped Newton method based on natural monotonicity test
➺ GITLAB
1 template <typename FuncType , typename JacType , typename VecType>
2 void dampnewton ( const FuncType &F , const JacType &DF,
3 VecType &x , double r t o l , double a t o l )
4 {
5 using i n d e x _ t = typename VecType : : Index ;
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 const i n d e x _ t n = x . s i z e ( ) ; // No. of unknowns
8 const s c a l a r _ t l m i n = 1E−3; // Minimal damping factor
9 s c a l a r _ t lambda = 1 . 0 ; // Initial and actual damping factor
10 VecType s ( n ) , s t ( n ) ; // Newton corrections
11 VecType xn ( n ) ; // Tentative new iterate
12 s c a l a r _ t sn , s t n ; // Norms of Newton corrections
13
14 do {
15 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian
16 s = j a c f a c . solve ( F ( x ) ) ; // Newton correction
17 sn = s . norm ( ) ; // Norm of Newton correction
18 lambda * = 2 . 0 ;
19 do {
20 lambda / = 2 ; // Reduce damping factor
21 i f ( lambda < l m i n ) throw "No convergence : lambda −> 0" ;
22 xn = x−lambda * s ; // Tentative next iterate
23 s t = j a c f a c . solve ( F ( xn ) ) ; // Simplified Newton correction
24 s t n = s t . norm ( ) ;
25 }
26 while ( s t n > (1 − lambda / 2 ) * sn ) ; // Natural monotonicity test
27 x = xn ; // Now: xn accepted as new iterate
28 lambda = std : : min ( 2 . 0 * lambda , 1 . 0 ) ; // Try to mitigate damping
29 }
30 // Termination based on simplified Newton correction
31 while ( ( s t n > r t o l * x . norm ( ) ) && ( s t n > a t o l ) ) ;
32 }
The arguments for Code 8.5.4.5 are the same as for Code 8.5.3.6. As termination criterion is uses
(8.5.3.5). Note that all calls to solve boil down to forward/backward elimination for triangular matrices
and incur cost of O(n2 ) only.
Note: The LU-factorization of the Jacobi matrix D F (x(k) ) is done once per successful iteration step and
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 656
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
reused for the computation of the simplified Newton correction in Line 23 of the above C++ code.
EXPERIMENT 8.5.4.6 (Damped Newton method) We test the damped Newton method for Item ➋ of
Ex. 8.5.4.1, where excessive Newton corrections made Newton’s method fail.
k λ(k) x (k) F ( x (k) )
F ( x ) = arctan( x ) ,
1 0.03125 0.94199967624205 0.75554074974604
• x (0) = 20
2 0.06250 0.85287592931991 0.70616132170387
• q = 21
3 0.12500 0.70039827977515 0.61099321623952
• LMIN = 0.001
4 0.25000 0.47271811131169 0.44158487422833
We observe that damping
5 0.50000 0.20258686348037 0.19988168667351
is effective and asymptotic
6 1.00000 -0.00549825489514 -0.00549819949059
quadratic convergence is
7 1.00000 0.00000011081045 0.00000011081045
recovered.
8 1.00000 -0.00000000000001 -0.00000000000001
y
EXPERIMENT 8.5.4.7 (Failure of damped Newton method) We examine the effect of damping in the
case of Item ➊ of Ex. 8.5.4.1.
2
1.5
✦ As in Ex. 8.5.4.1:
F ( x ) = xe x − 1, 1
x 7→ xe x − 1
0.5
−1.5
−0.5
This time the initial guess is to the left of the global
minimum of the function. −1
8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 657
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Choice of damping factor: affine invariant natural monotonicity test (NMT) [Deu11, Ch. 3]:
λ(k)
choose “maximal” 0 < λ(k) ≤ 1: ∆x(λ(k) ) ≤ (1 − ) ∆x(k) (8.5.4.4)
2 2 2
We start with the following question: How can we solve the non-linear system of equations F (x) =
0, D : D ⊂ R n → R n , iteratively, in case D F (x) is not available and numerical differentiation (see
Rem. 8.5.1.45) is too expensive?
In 1D (n = 1) we can choose among many derivative-free methods that rely on F-evaluations alone, for
instance the secant method (8.4.2.30) from Section 8.4.2.3:
! −1
F ( x (k) )( x (k) − x (k−1) ) F ( x ( k ) ) − f ( x ( k −1) )
x ( k +1) = x (k) − = x (k) − F ( x (k) ) . (8.4.2.30)
F ( x ( k ) ) − F ( x ( k −1) ) x ( k ) − x ( k −1)
Recall from Rem. 8.4.2.33 that the secant method converges locally with order p ≈ 1.6 and beats
Newton’s method in terms of efficiency (→ Section 8.4.3).
Compare (8.4.2.30) with Newton’s method in 1D for solving F ( x ) = 0:
x ( k +1) = x ( k ) − F ′ ( x ( k ) ) −1 F ( x ( k ) ) . (8.4.2.1)
F ( x ( k ) ) − F ( x ( k −1) )
F ′ ( x (k) ) ≈ "difference quotient" (8.6.0.1)
x ( k ) − x ( k −1)
already computed ! → cheap
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 658
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Reasoning: If we assume that Jk is a good approximation of D F (x(k) ), then it would be foolish not to use
the information contained in Jk for the construction of Jk+1 .
What can “small modification” mean? Demand that Jk acts like Jk−1 on the orthogonal complement of the
one-dimensional subspace of R n generated by the vector x(k) − x(k−1) ! This is expressed by
Note that the update formula (8.6.0.5) means that Jk is spawned by a rank-1-modification of Jk−1 . We
have arrived at a well-defined iterative method.
To start the iteration we have to initialize J0 , e.g. with the exact Jacobi matrix D F (x(0) ).
Let x(k) and Jk be the iterates and matrices, respectively, from Broyden’s method (8.6.0.6), and let J ∈ R n,n
satisfy the same secant condition (8.6.0.2) as Jk+1 :
(I − J− 1
k J )( x
( k +1)
− x(k) ) = − J− 1 (k) −1
k F (x ) − Jk ( F (x
( k +1)
) − F (x(k) )) = −J− 1
k F (x
( k +1)
) . (8.6.0.9)
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 659
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Using the submultiplicative property (1.5.5.11) of the Euclidean matrix norm, we conclude
which we saw in Ex. 1.5.5.20. This estimate holds for all matrices J satisfying (8.6.0.8).
We may read this as follows: (8.6.0.5) gives the k·k2 -minimal relative correction of Jk−1 , such that the
secant condition (8.6.0.2) holds. y
and take x(0) = [0.7, 0.7] T . As starting value for the matrix iteration we use J0 = D F (x(0) ).
0 Broyden: ||F(x (k) )||
10
Broyden: error norm
(k)
Newton: ||F(x )||
(8.5.1.6), 10 -6
quadratically. 10
-14
0 1 2 3 4 5 6 7 8 9 10 11
Fig. 309 Step of iteration
y
Remark 8.6.0.11 (Convergence monitors) In general, any iterative methods for non-linear systems of
equations convergence can fail, that is it may stall or even diverge.
Demand on good numerical software: Algorithms should warn users of impending failure. For iterative
methods this is the task of convergence monitors, that is, conditions, cheaply verifiable a posteriori during
the iteration, that indicate stalled convergence or divergence.
For the damped Newton’s method this role can be played by the natural monotonicity test, see
Code 8.5.4.5; if it fails repeatedly, then the iteration should terminate with an error status.
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 660
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For Broyden’s quasi-Newton method, a similar strategy can rely on the relative size of the “simplified
Broyden corrections” J− 1 (k)
k −1 F ( x ):
J− 1 (k)
k −1 F ( x )
Convergence monitor for (8.6.0.6) : µ := <1? (8.6.0.12)
∆x(k−1)
y
10 1
10 -2
We rely on the setting of Exp. 8.6.0.10.
Convergence monitor
0
10
10 -4
We track
error norm
-6 10 -1
1. the Euclidean norm of the iteration error,
10
2. and the value of the convergence monitor from
10 -8 10 -2
(8.6.0.12).
Remark 8.6.0.14 (Damped Broyden method) Option to improve robustness (increase region of local
convergence):
damped Broyden method (cf. same idea for Newton’s method, Section 8.5.4)
y
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 661
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
which can be expected to hold, if the method converges and the initial guess is sufficiently close to x∗ .
Note that the simplified quasi-Newton correction is also needed for the convergence monitor (8.6.0.12).
The iterated application of (8.6.0.16) pays off, if the iteration terminates after only a few steps. In particular,
for large n ≫ 1 it is not advisable to form the matrices J− 1
k (which will usually be dense in contrast to Jk ),
because we can employ fast successive multiplications with rank-1-matrices (→ Ex. 1.4.3.1) to apply J−
k
1
Except for a single solution of a linear system this can be implemented with simple vector arithmetic:
t0 ∈ R n : J0 t0 = F ( x ( k ) ) ,
(ℓ+1) (∆x(ℓ) )T tℓ
tℓ+1 := tℓ − ∆x 2
, ℓ = 0, . . . , k − 1 , (8.6.0.18)
x(ℓ) 2
+ (∆x(ℓ) )⊤ ∆x(ℓ+1)
∆x(k) := −tk ,
with a computational effort O(n3 + nk ) for n → ∞, assuming that a standard elimination-based direct
solver is used to get t0 , recall Thm. 2.5.0.2. Based on the next iterate x(k+1) := x(k) + ∆x(k) , we obtain
∆x(k+1) := J− 1
k F (x
( k +1)
) in a similar fashion:
!
k −1
∆x(ℓ+1) (∆x(ℓ) ) T
∆x(k+1) = ∏ I− 2
J0−1 F (x(k+1) ) . (8.6.0.19)
ℓ=0 ∆x(ℓ) 2 + (∆x(ℓ) )⊤ ∆x(ℓ+1)
Thus, the cost for N steps of Broyden’s quasi-Newton algorithm is asymptotically O(n3 + n2 N + nN 2 )
for n, N → ∞, because the expensive LU-factorization of J0 will be carried out only once.
This is implemented in the following function upbroyd(), whose arguments are
• a functor object F implementing F : R n → R n ,
• the initial guess x(0) ∈ R n in x,
• another functor object J providing the Jacobian D F (x) ∈ R n,n ,
• relative and absolute tolerance reltol and abstol for correction based termination as discussed
in Section 8.5.3,
• the maximal number of iterations maxit,
• and an optional monitor object for tracking the progress of the iteration, see the explanations con-
cerning recorder objects in § 0.3.3.4.
The implementation makes use of the following type definition:
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 662
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The function is templated to allow its use for both fixed-size and variable size vector types of E IGEN.
C++ code 8.6.0.20: Implementation of quasi-Newton method with recursive update of ap-
proximate Jacobians. ➺ GITLAB
2 template <typename FUNCTION, typename JACOBIAN , typename SCALAR,
3 i n t N = Eigen : : Dynamic ,
4 typename MONITOR =
5 std : : f u n c t i o n <void ( unsigned i n t , Vector <SCALAR, N> ,
6 Vector <SCALAR, N> , Vector <SCALAR, N>) >>
7 Vector <SCALAR, N> upbroyd (
8 FUNCTION &&F , Vector <SCALAR, N> x , JACOBIAN &&J , SCALAR r e l t o l ,
9 SCALAR a b s t o l , unsigned i n t m a x i t = 20 ,
10 MONITOR &&m o n i t o r = [ ] ( unsigned i n t /*itnum*/ ,
11 const Vector <SCALAR, N> & /*x*/ ,
12 const Vector <SCALAR, N> & /*fx*/ ,
13 const Vector <SCALAR, N> & /*dx*/ ) { } ) {
14 // Calculate LU factorization of initial Jacobian once, cf.
Rem. 2.5.0.10
15 auto f a c = J . l u ( ) ;
16 // First quasi-Newton correction ∆x(0) := −J0−1 F (x(0) )
17 Vector <SCALAR, N> s = − f a c . solve ( F ( x ) ) ;
18 // Store the first quasi-Newton correction ∆x(0)
19 std : : vector <Vector <SCALAR, N>> dx { s } ;
20 x += s ; // x(1) := x(0)+ ∆x(0)
21 auto f = F ( x ) ; // Here = F ( x(1) )
22 // Array storing simplified quasi-Newton corrections ∆x(ℓ)
23 std : : vector <Vector <SCALAR, N>> dxs { } ;
2
24 // Array of denominators x(ℓ) + (∆x(ℓ) )⊤ ∆x(ℓ+1)
2
25 std : : vector <SCALAR> den { } ;
26 m o n i t o r ( 0 , x , f , s ) ; // Record start of iteration
27 // Main loop with correction based termination control
28 f o r ( unsigned i n t k = 1 ;
29 ( ( s . norm ( ) >= r e l t o l * x . norm ( ) ) && ( s . norm ( ) >= a b s t o l ) && ( k < m a x i t ) ) ;
30 ++k ) {
31 // Compute J0−1 F (x(k) ), needed for both recursions
32 s = f a c . solve ( f ) ;
33 // (8.6.0.19): recursion for next simplified quasi-Newton correction
34 Vector <SCALAR, N> ss = s ;
35 f o r ( unsigned i n t l = 1 ; l < k ; ++ l ) {
36 ss −= dxs [ l − 1 ] * ( dx [ l − 1 ] . dot ( ss ) ) / den [ l − 1 ] ;
37 }
2
38 // Store next denominator x ( k −1) + (∆x(k−1) )⊤ ∆x(k)
2
39 den . push_back ( dx [ k − 1 ] . squaredNorm ( ) + dx [ k − 1 ] . dot ( ss ) ) ;
40 // Store current simplified quasi-Newton correction ∆x(k)
41 dxs . push_back ( ss ) ;
42 // (8.6.0.18): Compute next quasi-Newton correction recursively
43 f o r ( unsigned i n t l = 0 ; l < k ; ++ l ) {
44 s −= dxs [ l ] * ( dx [ l ] . dot ( s ) ) / den [ l ] ;
45 }
46 s * = ( − 1 . 0 ) ; // Comply with sign convention
47 dx . push_back ( s ) ;
48 // Compute next iterate x(k+1) and F (x(k+1) )
49 x += s ;
50 f = F( x ) ;
51 m o n i t o r ( k , x , f , s ) ; // Record progress
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 663
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
52 }
53 return x ;
54 }
Computational cost :
✦ O( N 2 · n) operations with vectors, (Level I)
N steps
✦ 1 LU-decomposition of J0 , N × solutions of LSEs, see Section 2.3.2
✦ N evaluations of F !
Memory cost :
N steps ✦ LU-factors of J + auxiliary vectors ∈ R n ,
✦ 2N vectors ∆x(k) ∈ R n , ∆x(k) .
y
10
n
b = [1, 2, . . . , n] ∈ R ,
A = I + aa T ∈ R n,n ,
10 -5
1
a= √ (b − 1) .
1·b−1 Broyden: ||F(x (k) )||
Broyden: error norm
-10
10 Newton: ||F(x (k) )||
Initial guess: h = 2/n; x0 = (2:h:4-h)’; Newton: eror norm
Newton (simplified)
The results resemble those of Exp. 8.6.0.10 ✄ 0 1 2 3 4 5 6 7 8 9
Fig. 311 iteration step
20
Broyden−Verfahren
18 Newton−Verfahren 30
16
25
14
Anzahl Schritte
12 20
Laufzeit [s]
10
15
8
6 10
4
5
2 Broyden−Verfahren
Newton−Verfahren
0 0
0 500 1000 1500 0 500 1000 1500
Fig. 312 n Fig. 313 n
☞ In conclusion,
the Broyden method is worthwhile for dimensions n ≫ 1 and low accuracy requirements.
y
ods and generalizations in [Deu11]. The multi-dimensional Newton method is also presented in
8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 664
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
affine invariant?
Remember that an iterative method for solving F (x) = 0 is called affine invariant, if it produces the
same sequence of iterates when applied (with the same initial guess) to AF (x) = 0 with any regular
matrix A ∈ R n,n .
(Q8.6.0.22.B) Show that the matrices Jk from Broyden’s quasi-Newton method (8.6.0.6) satisfy the secant
condition
Video tutorial for Section 8.7 "Non-linear Least Squares": (7 minutes) Download link,
tablet notes
So far we have studied non-linear systems of equations F (x) = 0 with the same number n ∈ N of un-
knowns and equations: F : D ⊂ R n → R n . This generalizes square linear systems of equations whose
numerical treatment was the subjects of Chapter 2. Then, in Chapter 3, we turned our attention to overde-
termined linear systems of equations Ax = b with A ∈ R m,n , m > n. Now we take the same step for
non-linear systems of equations F (x) = 0 and admit non-linear functions F : D ⊂ R n → R m , m > n.
For overdetermined linear systems of equations we had to introduce the concept of a least-squares solu-
tion in Section 3.1, Def. 3.1.1.1. The same concept will apply in the non-linear case.
EXAMPLE 8.7.0.1 (Least squares data fitting) In Section 5.1 we discussed the reconstruction of a
parameterized function f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R from data points (ti , yi ), i = 1, . . . , n, by imposing
interpolation conditions. We demanded that the number n of parameters agreed with number of data
points. Thus, in the case of a general depedence of f on the parameters, the interpolation conditions
(5.1.0.2) yield a non-linear system of equations.
The interpolation approach is justified in the case of highly accurate data. However, we frequently en-
countered inaccurate data, for instance, due to measurement errors. As we discussed in Section 5.7 this
renders the interpolation approach dubious, also in light of the impact of “outliers”.
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 665
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Guided by this idea we arrive at a particular version of the least squares data fitting problem from Sec-
tion 5.7, cf. (5.7.0.2).
m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.7.0.3)
x ∈R n i =1
As we did in § 5.7.0.13 for the linear case, we can rewrite (8.7.0.3) by introducing
f ( x1 , . . . , x n , t1 ) − y1 x1
..
F (x) := . , x = ... . (8.7.0.4)
f ( x1 , . . . , x n , t m ) − y m xn
y
The previous example motivates the following definition generalizing Def. 3.1.1.1.
x∗ ∈ argminx∈ D k F (x)k22 .
The search for such non-linear least-squares solutions is our current concern.
Given: F : D ⊂ R n 7→ R m , m, n ∈ N, m > n.
Find: x∗ ∈ D: x∗ = argminx∈ D Φ(x) , Φ(x) := 12 k F (x)k22 . (8.7.0.7)
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 666
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
As in the case of linear least squares problems (→ Section 3.1.1): a non-linear least squares problem is
related to an overdetermined non-linear system of equations F (x) = 0.
As for non-linear systems of equations discussed in the beginning of this chapter, existence and unique-
ness of x∗ in (8.7.0.7) has to be established in each concrete case!
Remark 8.7.0.8 (“Full-rank condition”) Recall from Rem. 3.1.2.15, Ex. 3.1.2.17, Rem. 3.1.2.18 that for
a linear least-squares problem kAx − bk → min full rank of the matrix A ∈ R m,n was linked to a “good
model”, in which every parameter had an influence independently of the others.
Also in the non-linear setting we we require “independence for each parameter”:
∃ neighbourhood U (x∗ ) such that D F (x) has full rank n ∀ x ∈ U (x∗ ) . (8.7.0.9)
This means that the columns of the Jacobi matrix DF (x) must be linearly independent.
If (8.7.0.9) is not satisfied, then the parameters are redundant in the sense that fewer parameters would
be enough to model the same dependence (locally at x∗ ), cf. Rem. 3.1.2.18. y
Review question(s) 8.7.0.10 (Non-linear least squares)
(Q8.7.0.10.A) The one-dimensional non-linear least-squares data fitting problem for a sequence
(ti , yi ) ∈ R2 , i = 1, . . . , m of data points relies on families of parameterized functions
f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R , n<m,
m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.7.0.3)
x ∈R n i =1
m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.7.0.3)
x ∈R n i =1
1 1 x −µ 2
f X ( x ) = √ e− 2 ( σ ) , x∈R, σ>0.
σ 2π
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 667
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1
CX : { x 1 , . . . , x m } → R , CX ( x i ) : = ♯{ j : x j ≤ xi } , i ∈ {1, . . . , m} ,
m
to formulate an overdetermined non-linear system of equations.
(Q8.7.0.10.D) A scientist proposes that you fit times series data (ti , yi ) ∈ R2 , i = 1, . . . , n, by linear
combinations of m shifted exponentials t 7→ exp(λ(t − c j )), j = 1, . . . , m, with unknown shifts c j ∈ R.
Is this a good idea? Justify your judgment by examining the associated non-linear system of equations
F (x) = 0 and D F (x).
△
Video tutorial for Section 8.7.1 "Non-linear Least Squares: (Damped) Newton Method": (13
minutes) Download link, tablet notes
We examine a first, natural approach to solving the non-linear least squares problem
§8.7.1.1 (The Newton iteration) Note that grad Φ : D ⊂ R n 7→ R n . The simple idea is to use Newton’s
method (→ Section 8.5) to solve the non-linear n × n system of equations grad Φ(x) = 0.
The Newton iteration (8.5.1.6) for non-linear system of equations grad Φ(x) = 0 is
2
Using the definition Φ(x) := 21 k F (x)k we can express grad Φ and H Φ in terms of F : R n 7→ R n . First,
2
since Φ(x) = ( G ◦ F )(x) with G : R m → R, G (z) := 21 kzk , the chain rule
gives
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 668
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
In a second step, we can apply the product rule (8.5.1.17) to x 7→ D F (x) T F (x) and get
m
H Φ(x) := D(grad Φ)(x) = D F (x) T D F (x) + ∑ Fj (x) D2 Fj (x) ,
j =1
m (8.7.1.5)
( )
n ∂2 Fj ∂Fj ∂Fj
(H Φ(x))i,k = ∑ ∂xi ∂xk
(x) Fj (x) +
∂xk
(x) (x)
∂xi
.
j =1
We make the recommendation, cf. § 8.5.1.15, that when in doubt, the reader should differentiate com-
ponents of matrices and vectors! Let us pursue this “pedestrian option” also in this case. To begin
with, we recall that the derivative of the i-th component of grad Φ yields the i-th row of the Jacobian
(D grad Φ)(x) = H Φ(x) ∈ R n,n . So for some i ∈ {1, . . . , n} we abbreviate
(8.7.1.4)
g(x) := (grad Φ(x))i = (D F (x))⊤
:,i F ( x ) .
We compute the components of the gradient of g, which give entries of H Φ:
( )
m
∂g ∂ ∂Fℓ
(H Φ(x))i,k =
∂xk
(x) =
∂xk ℓ=∑ ∂xi (x) Fℓ (x)
1
m 2
∂ Fℓ ∂Fℓ ∂Fℓ
= ∑ (x) + (x) (x) , k = 1, . . . , n .
ℓ=1
∂xi ∂xk ∂xi ∂xk
Of course, we end up with the same formula as in (8.7.1.5).
The above derivative formulas permit us to rewrite (8.7.1.2) in concrete terms. We obtain the
Newton correction s ∈ R n to the current Newton iterate x(k) by solving the n × n linear system of equa-
tions
!
m
D F (x(k) ) T D F (x(k) ) + ∑ Fj (x(k) ) D2 Fj (x(k) ) s = − D F ( x(k) ) T F ( x(k) ) . (8.7.1.6)
j =1
| {z }
| {z } =grad Φ(x(k) )
=H Φ ( x ( k ) )
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 669
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
All the techniques presented in Section 8.5 (damping, termination) can now be applied to the particular
Newton iteration for grad Φ = 0. We refer to that section. y
Remark 8.7.1.7 (Newton method and minimization of quadratic functional) Newton’s method (8.7.1.2)
for (8.7.0.7) can be read as successive minimization of a local quadratic approximation of Φ:
1
Φ(x) ≈ Q(s) := Φ(x(k) ) + grad Φ(x(k) ) T s + s T H Φ(x(k) )s , (8.7.1.8)
2
(k) (k)
grad Q(s) = 0 ⇔ H Φ(x )s + grad Φ(x ) = 0 ⇔ (8.7.1.6) .
➣ So we deal with yet another model function method (→ Section 8.4.2) with quadratic model function
Q for Φ.
y
Review question(s) 8.7.1.9 (Non-linear Least Squares: (Damped) Newton Method)
2 ⊤
(Q8.7.1.9.A) For grad Φ and H Φ for Φ(x) := 12 k F (x)k2 , F = [ F1 , . . . , Fm ] : R n → R m twice continu-
ously differentiable compute
⊤ " #n
∂Φ ∂Φ ∂2 Φ
grad Φ(x) := ( x ), . . . , (x) ∈ Rn , H Φ(x) = ∈ R n,n ,
∂x1 ∂xn ∂xi ∂x j
i,j=1
The Newton method derived in Section 8.7.1 hinges on the availability of second derivatives of F. This
compounds difficulties of implementation, in particular, if F is given only implicitly or in procedural form.
Now we will learn about a method for the non-linear least-squares problem
Idea:
Local linearization of F (here at y):
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 670
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
where x0 is an approximation of the solution x∗ of (8.7.0.7). This is a linear least squares problem in the
standard form given in Def. 3.1.1.1.
A := D F (x0 ) ∈ R m,n ,
(♠) ⇔ argminkAx − bk2 with
x ∈R n b : = − F ( x0 ) + D F ( x0 ) x0 ∈ R m .
The idea of local linearization leads to the Gauss-Newton iteration, if we make the full-rank assumption
(8.7.0.9), which, thanks to Cor. 3.1.2.13 , guarantees uniqueness of solutions of the linear least-squares
problem ♠. Making the substitutions x0 := x(k) and s := x − x(k) , we recover at the following iterative
method:
Note that the function of Code 8.7.2.2 also implements Newton’s method (→ Section 8.5.1) in the case
m = n!
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 671
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 8.7.2.3 (Gauss-Newton versus Newton) Let us summarize the pros and cons of using the
Gauss-Newton approach:
EXAMPLE 8.7.2.4 (Non-linear fitting of data (II) → Ex. 8.7.0.1) Given data points (ti , yi ),
i = 1, . . . , m, we consider the non-linear data fitting problem (8.7.0.5) for the parameterized function
f ( x1 , x2 , x3 ; t) := x1 + x2 exp(− x3 t) .
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 672
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2 4
10 10
2
10
0
10
−2
10
value of F (x(k) )
−4
10
0 −6
10 10
−8
10
−10
10
−1
10
−12
10
−14
10
−2 −16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 314 No. of step of undamped Newton method Fig. 315 No. of step of undamped Newton method
Concerning the convergence behaviour of the plain Newton method we observe that
• for initial value (1.8, 1.8, 0.1) T (red curve) ➤ Newton method caught in local minimum,
• for initial value (1.5, 1.5, 0.1) T (cyan curve) ➤ fast (locally quadratic) convergence.
2 2
10 10
0
10
−2
10
norm of grad Φ(x(k) )
1
10
2
2
value of F (x(k) )
−4
10
−6
10
0
10
−8
10
−10
10
−1
10 −12
10
−14
10
−2 −16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 316 No. of step of damped Newton method Fig. 317 No. of step of damped Newton method
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 673
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
0 0
10 10
−2
10
2
value of F (x(k) )
−6
10
−1 −8
10 10
−10
10
−12
10
−14
10
−2 −16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 318 No. of step of Gauss−Newton method Fig. 319 No. of step of Gauss−Newton method
For the Gauss-Newton method we observe linear convergence for both initial values (Refer to Def. 8.2.2.1,
Rem. 8.2.2.6 for “linear convergence” and how to see it in error plots).
In this experiment the convergence of the Gauss-Newton method is asymptotically clearly slower than that
of the Newton method, but less dependent on the choice of good initial guesses. This matches what is
often observed in practical non-linear fitting. y
is close to a standard linear least-squares problem and its solution s can be obtained by solving a linear
system of equations, which we get by setting the gradient of
2
Ψ(z) := F (x(k) ) + D F (x(k) )z + λkzk22
2
= z⊤ D F (x(k) )⊤ D F (x(k) ) + λI z + 2F (x(k) )⊤ D F (x(k) )z + F (x(k) ) ,
2
grad Ψ(z) = 2 D F (x(k) )⊤ D F (x(k) )⊤ + λI z + 2F (x(k) )⊤ z ,
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 674
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
to zero. This leads to the following n × n linear system of normal equations for damped Gauss-Newton
correction s in the k-th step, see Thm. 3.1.2.1:
D F (x (k) T
) D F (x (k)
) + λI s = − D F (x(k) )⊤ F (x(k) ) . (8.7.3.2)
2
x(k+1) := argmin F (x(k) ) + D F (x(k) )(x − x(k) ) ,
x ∈R n 2
to an n × n non-linear system of equations, that is, F : D ⊂ R n → R n . You may assume that D F (x(k) )
always has full rank.
△
Learning Outcomes
• Knowledge about concepts related to the speed of convergence of an iteration for solving a non-
linear system of equations.
• Ability to estimate type and orders of convergence from empiric data.
• Ability to predict asymptotic linear, quadratic and cubic convergence by inspection of the iteration
function.
• Familiarity with (damped) Newton’s method for general non-linear systems of equations and with the
secant method in 1D.
• Ability to derive the Newton iteration for an (implicitly) given non-linear system of equations.
• Knowledge about quasi-Newton method as multi-dimensional generalizations of the secant method.
8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 675
Bibliography
[Ale12] A. Alexanderian. A basic note on iterative matrix inversion. Onlie document. 2012 (cit. on
pp. 647, 650).
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on pp. 605, 609, 618, 624,
628, 665).
[BC17] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory
in Hilbert spaces. Second. CMS Books in Mathematics/Ouvrages de Mathématiques de la
SMC. Springer, Cham, 2017, pp. xix+619. DOI: 10.1007/978-3-319-48311-5.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 602, 607, 609, 613, 614, 618, 624, 628, 647, 665–675).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 651).
[Deu11] Peter Deuflhard. Newton methods for nonlinear problems. Vol. 35. Springer Se-
ries in Computational Mathematics. Heidelberg: Springer, 2011, pp. xii+424. DOI:
10.1007/978-3-642-23899-4 (cit. on pp. 640, 651, 655, 658, 664).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 599, 602, 614,
624, 628, 630, 665).
[Mol04] C. Moler. Numerical Computing with MATLAB. Philadelphia, PA: SIAM, 2004 (cit. on p. 632).
[PS91] Victor Pan and Robert Schreiber. “An Improved Newton Iteration for the Generalized Inverse
of a Matrix, with Applications”. In: SIAM Journal on Scientific and Statistical Computing 12.5
(1991), pp. 1109–1130. DOI: 10.1137/0912058 (cit. on pp. 647, 650).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 602, 609, 620, 665).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 600, 601,
603, 610, 613, 614, 616, 618, 623, 630, 637, 642, 646).
[Wer92] J. Werner. Numerische Mathematik I. Lineare und nichtlineare Gleichungssysteme, Interpola-
tion, numerische Integration. vieweg studium. Aufbaukurs Mathematik. Braunschweig: Vieweg,
1992 (cit. on p. 665).
676
Chapter 9
Ex. 2.1.0.3: nodal analysis of linear (↔ composed of resistors, inductors, capacitors) electric circuit in
frequency domain (at angular frequency ω > 0) , see (2.1.0.6)
➣ linear system of equations for nodal potentials with complex system matrix A
For circuit of Fig. 320: three unknown nodal potentials
➣ system matrix from nodal analysis at angular frequency ω > 0:
1 1
ıωC + ıωL − ıωL 0
A = − ıωL 1
ıωC + R1 + ıωL
2 1
− ıωL
1 1
0 − ıωL ıωC + ıωL
1
0 0 0 C 0 0 L − L1 0
= 0 R1 0 + ıω 0 C 0 + 1/ıω − L1 L2 − L1 .
0 0 0 0 0 C 0 − L1 1
L
677
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
R = 1, C= 1, L= 1
30
|u |
1
|u |
2
|u3|
25
maximum nodal potential
20
10
Blow-up of some nodal potentials for certain ω !
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
☛ ✟
Fig. 321 angular frequency ω of source voltage U
✡ ✠
resonant frequencies = ω ∈ {ω ∈ R: A(ω ) singular}
If the circuit is operated at a real resonant frequency, the circuit equations will not possess a solution. Of
course, the real circuit will always behave in a well-defined way, but the linear model will break down due
to extremely large currents and voltages. In an experiment this breakdown manifests itself as a rather
explosive meltdown of circuits components. Hence, it is vital to determine resonant frequencies of circuits
in order to avoid their destruction.
1
A(ω )x = (W + ıωC + S)x = 0 . (9.0.0.3)
ıω
1
Substitution: y= ıω x ↔ x = ıωy [TM01, Sect. 3.4]:
W S x −ıC 0 x
(9.0.0.3) ⇔ =ω (9.0.0.4)
I 0 y 0 −ıI y
| {z } |{z} | {z }
:=M :=z :=B
➣ generalized linear eigenvalue problem of the form: find ω ∈ C, z ∈ C2n \ {0} such that
Mz = ωBz . (9.0.0.5)
In this example one is mainly interested in the eigenvalues ω , whereas the eigenvectors z usually need
not be computed.
R = 1, C= 1, L= 1
0.4
ω
0.35
0.3
0.25
0.2
✁ resonant frequencies for circuit from Fig. 320 (in-
Im(ω)
0.05
−0.05
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Fig. 322 Re(ω)
y
ẏ = Ay , A ∈ C n,n . (9.0.0.7)
λ1
.. −1 n,n z = S −1 y
A = S . S , S ∈ C regular =⇒ ẏ = Ay ←→ ż = Dz .
λn
| {z }
=:D
The initial value problem for the decoupled homogeneous linear ODE ż = Dz has a simple analytic
solution
zi (t) = exp(λi t)(z0 )i = exp(λi t) (S−1 )i,:
T
y0 .
In order to find the transformation matrix S all non-zero solution vectors (= eigenvectors) x ∈ C n of the
linear eigenvalue problem
Ax = λx
have to be found.
y
Contents
Supplementary literature. [NS02, Ch. 7], [Gut09, Ch. 9], [QSS00, Sect. 1.7]
Definition 9.1.0.1. Eigenvalues and eigenvectors → [NS02, Sects. 7.1, 7.2], [Gut09,
Sect. 9.1]
For any matrix norm k·k induced by a vector norm (→ Def. 1.5.5.10)
ρ(A) ≤ kAk .
Proof. Let z ∈ C n \ {0} be an eigenvector to the largest (in modulus) eigenvalue λ of A ∈ C n,n . Then
kAxk kAzk
kAk := sup ≥ = |λ| = ρ(A) .
x∈C n,n \{0} k x k kzk
Lemma 9.1.0.5. Gershgorin circle theorem → [DR08, Thm. 7.13], [Han02, Thm. 32.1],
[QSS00, Sect. 5.1]
Lemma 9.1.0.6. Similarity and spectrum → [Gut09, Thm. 9.7], [DR08, Lemma 7.6], [NS02,
Thm. 7.2]
Lemma 9.1.0.7.
Existence of a one-dimensional invariant subspace
ˆ generalized eigenvector, λ =
x= ˆ generalized eigenvalue
Ax = λBx ⇔ B−1 A = λx .
However, usually it is not advisable to use this equivalence for numerical purposes!
Remark 9.1.0.11 (Generalized eigenvalue problems and Cholesky factorization)
If B = B H s.p.d. (→ Def. 1.1.2.6) with Cholesky factorization B = R H R
e := R− H AR−1 , y := Rx .
e = λy where A
Ax = λBx ⇔ Ay
M ATLAB-function: eig
Remark 9.2.0.1 (QR-Algorithm → [GV89, Sect. 7.5], [NS02, Sect. 10.3],[Han02, Ch. 26],[QSS00,
Sect. 5.5-5.7])
Note: All “direct” eigensolvers are iterative methods
Idea: Iteration based on successive unitary similarity transformations
diagonal matrix , if
A= A (0) −−−→ A (1) −−−→ . . . −−−→
upper triangular matrix , els
(→ Thm. 9.1.0.8)
✍ ✌
eigensolvers (→ Def.1.5.5.19)
y
(➞ =
ˆ affected rows/columns, ˆ targeted vector)
=
0 0 0 0 0 0
0 0 0 0
0 0 0 0 0
0 0
−−−→ −−−→ −−−→ 0
0 0 0 0 0 0
transformation to tridiagonal form ! (for general matrices a similar strategy can achieve a similarity
transformation to upper Hessenberg form)
M ATLAB-code 9.2.0.6:
1 A = rand (500,500); B = A’*A; C = g a l l e r y (’tridiag’,500,1,3,1);
0
10
−1
10
−1
10
time [s]
time [s]
−2
10
−2
10
−3
10
−3
10
−4
10 −4
10
−5 −5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 323 matrix size n Fig. 324 matrix size n
0
10
−2
−1
10
10
time [s]
time [s]
−2
10 −3
10
−3
10
−4
10
−4
10
−5 −5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 325 matrix size n Fig. 326 matrix size n
For the sake of efficiency: think which information you really need when computing eigenvalues/eigen-
☛
vectors of dense matrices
Potentially more efficient methods for sparse matrices will be introduced below in Section 9.3, 9.4.
y
Supplementary literature. [DR08, Sect. 7.5], [QSS00, Sect. 5.3.1], [QSS00, Sect. 5.3]
(G)ij = 1 ⇒ link j → i ,
0.08 0.08
0.07 0.07
0.06 0.06
page rank
page rank
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 327 harvard500: no. of page Fig. 328 harvard500: no. of page
Observation: relative visit times stabilize as the number of hops in the stochastic simulation → ∞.
The limit distribution is called stationary distribution/invariant measure of the Markov chain. This is what
we seek.
✦ Numbering of pages 1, . . . , N , ℓi =
ˆ number of links from page i
N
✦ N × N -matrix of transition probabilities page j → page i: A = ( aij )i,j N,N
=1 ∈ R
A matrix A ∈ [0, 1] N,N with the property (9.3.1.3) is called a (column) stochastic matrix.
“Meaning” of A: given x ∈ [0, 1] N , kxk1 = 1, where xi is the probability of the surfer to visit page i,
i = 1, . . . , N , at an instance t in time, y = Ax satisfies
N N N N N N
yj ≥ 0 , ∑ y j = ∑ ∑ a ji xi = ∑ xi ∑ aij = ∑ xi = 1 .
j =1 j =1 i =1 i =1 j =1 i =1
| {z }
=1
Thought experiment: Instead of a single random surfer we may consider m ∈ N, m ≫ 1, of them who
visit pages independently. The fraction of time m · T they all together spend on page i will obviously be
the same for T → ∞ as that for a single random surfer.
Instead of counting the surfers we watch the proportions of them visiting particular web pages at an
(k)
instance of time. Thus, after the k-th hop we can assign a number xi ∈ [0, 1] to web page i, which gives
(k)
(k) ni (k)
the proportion of surfers currently on that page: xi := m , where ni ∈ N0 designates the number of
surfers on page i after the k-th hop.
Now consider m → ∞. The law of law of large numbers suggests that the (“infinitely many”) surfers visiting
page j will move on to other pages proportional to the transistion probabilities aij : in terms of proportions,
for m → ∞ the stochastic evolution becomes a deterministic discrete dynamical system and we find
N
( k +1) (k)
xi = ∑ aij x j , (9.3.1.6)
j =1
that is, the proportion of surfers ending up on page i equals the sum of the proportions on the “source
pages” weighted with the transition probabilities.
Notice that (9.3.1.6) amounts to matrix×vector. Thus, writing x(0) ∈ [0, 1] N , x (0) = 1 for the initial
distribution of the surfers on the net we find
x ( k ) = A k x (0)
will be their mass distribution after k hops. If the limit exists, the i-th component of x∗ := lim x(k) tells us
k→∞
which fraction of the (infinitely many) surfers will be visiting page i most of the time. Thus, x∗ yields the
stationary distribution of the Markov chain.
step 5 step 15
0.1 0.1
0.09 0.09
0.08 0.08
0.07 0.07
0.06 0.06
page rank
page rank
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 329 harvard500: no. of page Fig. 330 harvard500: no. of page
Comparison:
harvard500: 1000000 hops step 5
0.09 0.1
0.08 0.09
0.07 0.08
0.07
0.06
0.06
page rank
page rank
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 331 harvard500: no. of page Fig. 332 harvard500: no. of page
➣ Ax∗ = x∗ ⇒ x∗ ∈ EigA1 .
Does A possess an eigenvalue = 1? Does the associated eigenvector really provide a probability distri-
bution (after scaling), that is, are all of its entries non-negative? Is this probability distribution unique? To
answer these questions we have to study the matrix A:
where ρ(A) is the spectral radius of the matrix A, see Section 9.1.
For r ∈ EigA1, that is, Ar = r, denote by |r| the vector (|ri |)iN=1 . Since all entries of A are non-negative,
we conclude by the triangle inequality that kArk1 ≤ kA|r|k1
Hence, different components of r cannot have opposite sign, which means, that r can be chosen to have
non-negative entries, if the entries of A are strictly positive, which is the case for A from (9.3.1.4). After
normalization krk1 = 1 the eigenvector can be regarded as a probability distribution on {1, . . . , N }.
Sorting the pages according to the size of the corresponding entries in r yields the famous “page rank”.
Plot of entries of
unique vector r ∈
R N with
0 ≤ ( r )i ≤ 1 ,
M ATLAB-code 9.3.1.9: computing page rank vector r via eig k r k1 = 1 ,
Ar = r .
Inefficient implemen-
tation!
harvard500: 1000000 hops harvard 500: Perron−Frobenius vector
0.09 0.1
0.09
0.08
0.08
0.07
0.07
0.06
entry of r−vector
0.06
page rank
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01 0.01
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 333 harvard500: no. of page Fig. 334 harvard500: no. of page
The possibility to compute the stationary probability distribution of a Markov chain through an eigenvector
of the transition probability matrix is due to a property of stationary Markov chains called ergodicity.
0
10
Errors: ✄
error 1−norm
−3
10
A k x0 − r ,
1 −4
10
plot)
−7
10
0 10 20 30 40 50 60
Fig. 335 iteration step
y
The computation of page rank amounts to finding the eigenvector of the matrix A of transition probabilities
that belongs to its largest eigenvalue 1. This is addressed by an important class of practical eigenvalue
problems:
M ATLAB-code 9.3.1.11:
0
10
d = ( 1 : 1 0 ) ’ ; n = length (d ) ;
S = t r i u ( diag ( n : − 1 : 1 , 0 ) + . . .
ones ( n , n ) ) ;
A = S * d i a g ( d , 0 ) * i n v (S ) ;
errors
−1
10
z(k)
−2
10 ✁ error norm − (S):,10
k z(k) k
(Note: (S):,10 =
ˆ eigenvector for eigenvalue 10)
−3
10
Fig. 336
0 5 10 15
iteration step k
20 25 30
z(0) = random vector
Observation: linear convergence of (normalized) eigenvectors!
Suggests direct power method (ger.: Potenzmethode): iterative method (→ Section 8.2)
Note: the “normalization” of the iterates in (9.3.1.12) does not change anything (in exact arithmetic) and
helps avoid overflow in floating point arithmetic.
Due to (9.3.1.13) for large k ≫ 1 (⇒ |λkn | ≫ |λkj | for j 6= n) the contribution of vn (size ζ n λkn ) in
the eigenvector expansion (9.3.1.15) will be much larger than the contribution (size ζ n λkj ) of any other
eigenvector (, if ζ n 6= 0): the eigenvector for λn will swamp all other for k → ∞.
Further (9.3.1.15) nutures expectation: vn will become dominant in z(k) the faster, the better |λn | is
separated from |λn−1 |, see Thm. 9.3.1.21 for rigorous statement.
When (9.3.1.12) has converged, two common ways to recover λmax → [DR08, Alg. 7.20]
kAz(k) k
➊ Az(k) ≈ λmax z(k) ➣ |λn | ≈ (modulus only!)
k z(k) k
2 (z(k) ) H Az(k)
➋ λmax ≈ argmin Az(k) − θz(k) ➤ λmax ≈ 2
.
θ ∈R 2 z(k) 2
This latter formula is extremely useful, which has earned it a special name:
Definition 9.3.1.16.
For A ∈ K n,n , u ∈ K n the Rayleigh quotient is defined by
u H Au
ρA (u) := .
uH u
M ATLAB-code 9.3.1.19:
0
n = length(d); S = triu(diag(n:-1:1,0)+...
10
ones(n,n)); A = S*diag(d,0)*inv(S);
d = (1:10)’;
errors
−1
10
Test matrices:
① d=(1:10)’; ➣ |λn−1 | : |λn | = 0.9
② d = [ones(9,1); 2]; ➣ |λn−1 | : |λn | = 0.5
③ d = 1-2.^(-(1:0.5:5)’); ➣ |λn−1 | : |λn | = 0.9866
① ② ③
(k) (k) (k) (k) (k) (k)
k ρEV ρEW ρEV ρEW ρEV ρEW
22 0.9102 0.9007 0.5000 0.5000 0.9900 0.9781
(k)
z(k) − s·,n 23 0.9092 0.9004 0.5000 0.5000 0.9900 0.9791
ρEV := , 24 0.9083 0.9001 0.5000 0.5000 0.9901 0.9800
z(k−1) − s·,n
25 0.9075 0.9000 0.5000 0.5000 0.9901 0.9809
(k) | ρA ( z(k) ) − λn | 26 0.9068 0.8998 0.5000 0.5000 0.9901 0.9817
ρEW := .
| ρ A ( z ( k −1) ) − λ n | 27 0.9061 0.8997 0.5000 0.5000 0.9901 0.9825
28 0.9055 0.8997 0.5000 0.5000 0.9901 0.9832
29 0.9049 0.8996 0.5000 0.5000 0.9901 0.9839
30 0.9045 0.8996 0.5000 0.5000 0.9901 0.9844
Observation: linear convergence (→ ??) y
| λ n −1 |
Az(k) → λn , z(k) → ±v linearly with rate ,
2 |λn |
where z(k) are the iterates of the direct power iteration and y H z(0) 6= 0 is assumed.
Remark 9.3.1.23 (Termination criterion for direct power iteration) (→ Section 8.2.3)
Adaptation of a posteriori termination criterion (8.3.2.20)
(k) (k−1) ≤ (1/L − 1)tol ,
min z ± z
“relative change” ≤ tol:
kAz(k) k kAz(k−1) k
− z ( k −1) ≤ (1/L − 1)tol see (8.2.3.8) .
k z(k) k k k
More general segmentation problem (non-local): identify parts of the image, not necessarily connected,
with the same texture.
( m − 1) n + 1 mn
Local similarity matrix:
W ∈ R N,N , N := mn , (9.3.2.3)
0 , if pixels i, j not adjacent,
(W)ij = 0 , if i = j ,
σ ( pi , p j ) , if pixels i, j adjacent.
m
ˆ adjacent pixels
↔= ✄
Similarity function, e.g., with α > 0
n+1 n+2 2n
2
σ ( x, y) := exp(−α( x − y) ) , x, y ∈ R .
1 2 3 n
Lexicographic numbering ✄
Fig. 338
n
The entries of the matrix W measure the “similarity” of neighboring pixels: if (W)ij is large, they encode
(almost) the same intensity, if (W)ij is close to zero, then they belong to parts of the picture with very
different brightness. In the latter case, the boundary of the segment may separate the two pixels.
cut(X ) cut(X )
Ncut(X ) := + ,
weight(X ) weight(V \ X )
with cut(X ) := ∑ wij , weight(X ) := ∑ wij .
i ∈X ,j6∈X i ∈X ,j∈X
5 5
10 10
15 15
pixel
pixel
20 20
25 25
30 30
5 10 15 20 25 5 10 15 20 25
Fig. 339 pixel Fig. 340 pixel
5
20
10
15
pixel
15
pixel
10
20
25
5
30
2 4 6 8 10 12 14 16 18
5 10 15 20 25
Fig. 341 pixel Fig. 342 pixel
△ Ncut(X ) for pixel subsets X defined by sliding rectangles, see Fig. 340.
Equivalent reformulation:
(
1 , if i ∈ X ,
indicator function: z : {1, . . . , N } 7→ {−1, 1} , zi := z(i ) = (9.3.2.6)
−1 , if i 6∈ X .
∑ −wij zi z j ∑ −wij zi z j
zi >0,z j <0 zi >0,z j <0
Ncut(X ) = + , (9.3.2.7)
∑ di ∑ di
z i >0 z i <0
di = ∑ wij = weight({i}) . (9.3.2.8)
j∈V
Sparse matrices:
∑ di
y⊤ Ay z i >0
Ncut(X ) = ⊤ , y := (1 + z) − β(1 − z) , β := .
y Dy ∑ di
z i <0
✦ (9.3.2.10) ⇒ 1 ∈ EigA0
✦ Lemma 2.8.0.12: A diagonally dominant =⇒ A is positive semidefinite (→ Def. 1.1.2.6)
Ncut(X ) ≥ 0 and 0 is the smallest eigenvalue of A.
However, we are by no means interested in a minimizer y ∈ Span{1} (with constant entries) that does
not provide a meaningful segmentation.
y ⊥ D1 ⇔ 1⊤ Dy = 0 . (9.3.2.13)
still NP-hard
➣ Minimizing Ncut(X ) amounts to minimizing a (generalized) Rayleigh quotient (→ Def. 9.3.1.16) over
a discrete set of vectors, which is still an NP-hard problem.
Idea: Relaxation
✎ ☞
Task: (9.3.2.15) ⇔ Find minimizer of (generalized) Rayleigh quotient under linear
✍ ✌
constraint
Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of all (real!) eigenvalues of A = AH ∈
C n,n . Then
Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of the (real!) eigenvalues of A = AH ∈
C n,n . Write
ℓ
U0 = {0} , Uℓ := ∑ EigAλ j , ℓ = 1, . . . , m and Uℓ⊥ := {x ∈ C n : uH x = 0 ∀u ∈ Uℓ } .
j =1
Then
Proof. For diagonal A ∈ R n,n the assertion of the theorem is obvious. Thus, Cor. 9.1.0.9 settles
everything.
✷
Well, in Lemma 9.3.2.12 we encounter a generalized Rayleigh quotient ρA,D (y)! How can Thm. 9.3.2.16
be applied to it?
1/2
z=D y
argmin ρA,D (D− /2 z)
1
(9.3.2.15) ⇔ argmin ρA,D (y) =
1⊤ Dy=0 1⊤ D1/2 z=0 (9.3.2.21)
e := D−1/2 AD−1/2 .
= argmin ρAe (z) with A
1⊤ D1/2 z=0
Related: transformation of a generalized eigenvalue problem into a standard eigenvalue problem accord-
ing to
1/2
z=B x
B− /2 AB− /2 z = λz .
1 1
Ax = λBx =⇒ (9.3.2.22)
B1/2 =
ˆ square root of s.p.d. matrix B → Rem. 10.3.0.2.
For segmentation problem: B = D diagonal with positive diagonal entries, see (9.3.2.9)
D−1/2 = diag(d1− /2 , . . . , d− e
1 1/2 −1/2 AD−1/2 can easily be computed.
➥ N ) and A : = D
1⊤ D /2 z = 0 ?
1
How to deal with constraint
Idea: Penalization
Add term P(z) to ρA e ( z ) that becomes “sufficiently large” in case the con-
straint is violated.
z⊤ (D /2 11⊤ D /2 )z
1 1
∗
z = argmin ρAe (z) + P(z) = argmin ρAe (z) +
z∈R N \{0} z∈R N \{0} z⊤ z
(9.3.2.24)
e + D1/2 11⊤ D1/2 .
b := A
= argmin ρ Ab (z) with A
z∈R N \{0}
(9.3.2.10) ⇒ A1 = 0 ⇒ A e .
e (D1/2 1) = 0 ⇔ D1/2 1 ∈ EigA0
Cor. 9.1.0.9 ➤ The orthogonal complement of an eigenvector of a symmetric matrix is spanned by the
other eigenvectors (orthonormalization of eigenvectors belonging to the same eigen-
value is assumed).
(9.3.2.25) e that belongs to
The minimizer of (9.3.2.23) will be one of the other eigenvectors of A
the smallest eigenvalue.
Note: This eigenvector z∗ will be orthogonal to D /2 1, it satisfies the constraint, and, thus, P(z∗ ) = 0!
1
Note: e and A
eigenspaces of A b agree.
Note: Lemma 2.8.0.12 e is positive semidefinite (→ Def. 1.1.2.6) with smallest eigenvalue 0.n
=⇒ A
(1.5.5.13)
e
µ= A = 2. (9.3.2.26)
∞
By Thm. 9.3.2.16:
b,
z∗ = eigenvector belonging to minimal eigenvalue of A
m
∗ 1/2
z = eigenvector ⊥ D 1 belonging to minimal eigenvalue of Ae,
m
D − 1/2 ∗
z = minimizer for (9.3.2.15).
§9.3.2.28 (Algorithm outline: Binary grayscale image segmentation)
➊ Given similarity function σ compute (sparse!) matrices W, D, A ∈ R N,N , see (9.3.2.3), (9.3.2.9).
b :=
➋ Compute y∗ , ky∗ k2 = 1, as eigenvector belonging to the smallest eigenvalue of A
D − 1/2
AD − 1/2 1/2 1/2 ⊤
+ 2(D 1)(D 1) .
N
X := {i ∈ {1, . . . , N }: xi∗ > 1
N ∑ xi∗ } . (9.3.2.29)
i =1
y
1st-stage of segmentation of 31 × 25 grayscale pixel image (root.pbm, red pixels =
ˆ X , σ ( x, y) =
exp(−( /10) ))
x − y 2
Original Segments
5 5
10 10
15 15
20 20
25 25
30 30
Fig. 343 Fig. 344
5 10 15 20 25 5 10 15 20 25
0.02
0.025 0.018
0.02 0.016
0.015
0.014
Image from Fig. 343:
0.012
0.01
0.01 ✁ eigenvector x∗ plotted on pixel grid
0.005
0.008
0 0.006
30
25 25 0.004
20 20
15 15 0.002
10 10
Fig. 345 5 5
To identify more segments, the same algorithm is recursively applied to segment parts of the image
already determined.
Practical segmentation algorithms rely on many more steps of which the above algorithm is only one, pre-
ceeded by substantial preprocessing. Moreover, they dispense with the strictly local perspective adopted
above and take into account more distant connections between image parts, often in a randomized fashion
[SM00].
The image segmentation problem falls into the wider class of graph partitioning problems. Methods based
on (a few of) the eigenvector of the connectivity matrix belonging to the smallest eigenvalues are known
as spectral partitioning methods. The eigenvector belonging to the smallest non-zero eigenvalue that we
computed above is usually called the Fiedler vector of the graph, see [AKY99; ST96]. y
The solution of the image segmentation problem by means of eig in Code 9.3.2.30 amounts a tremendous
waste of computational resources: we compute all eigenvalues/eigenvectors of dense matrices, though
only a single eigenvector associated with the smallest eigenvalue is of interest.
This motivates the quest to find efficient numerical methods for the following task.
Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.
If A ∈ K n,n regular:
−1
Smallest (in modulus) EV of A = Largest (in modulus) EV of A−1
M ATLAB-code 9.3.2.31: inverse iteration for computing λmin (A) and associated eigenvector
where: (A − αI)−1 z(k−1) = ˆ solve (A − αI)w = z(k−1) based on Gaussian elimination (↔ a single
LU-factorization of A − αI as in Code 9.3.2.31).
y
Stability of Gaussian elimination/LU-factorization (→ ??) will ensure that “w from (9.3.2.33) points in
the right direction”
In other words, roundoff errors may badly affect the length of the solution w, but not its direction.
Practice [GT08]: If, in the course of Gaussian elimination/LU-factorization a zero pivot element is really
encountered, then we just replace it with eps, in order to avoid inf values!
|λ j − α|
with λ j ∈ σ (A) , |α − λ j | ≤ |α − λ| ∀λ ∈ σ (A) .
min{|λi − α|, i 6= j}
M ATLAB-code 9.3.2.38:
d = (1:10) ’;
−5 n = length (d ) ;
10
Z = d i a g ( s q r t ( 1 : n ) , 0 ) + ones ( n , n ) ;
[ Q, R ] = q r ( Z ) ;
A = Q* d i a g ( d , 0 ) * Q ’ ;
−10
10
o: |λmin − ρA (z(k) )|
∗ : z(k) − x j , λmin = λ j , x j ∈ EigAλ j ,
−15
10
1 2 3 4 5 6 7 8 9 10 : xj 2 = 1
k
Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.
Options: inverse iteration (→ Code 9.3.2.31) and Rayleigh quotient iteration (9.3.2.36).
We expect that an approximate solution of the linear systems of equations encountered during
inverse iteration should be sufficient, because we are dealing with approximate eigenvectors anyway.
Thus, iterative solvers for solving Aw = z(k−1) may be considered, see Chapter 10. However, the required
accuracy is not clear a priori. Here we examine an approach that completely dispenses with an iterative
solver and uses a preconditioner (→ Notion 10.3.0.3) instead.
Computational effort:
1 matrix×vector
M ATLAB-code 9.3.3.2: preconditioned inverse iteration (9.3.3.1)
1 evaluation of pre-
conditioner
A few
AXPY-operations
M ATLAB-code 9.3.3.4:
1 A = spdiags (repmat([1/n,-1,2*(1+1/n),-1,1/n],n,1),
[-n/2,-1,0,1,n/2],n,n);
2 evalA = @(x) A*x;
3 % inverse iteration
4 invB = @(x) A\x;
5 % tridiagonal preconditioning
6 B = spdiags ( spdiags (A,[-1,0,1]),[-1,0,1],n,n); invB = @(x) B\x;
Monitored: error decay during iteration of Code 9.3.3.2: |ρA (z(k) ) − λmin (A)|
(P)INVIT iterations: tolerance = 0.0001
2
10 28
INVIT, n = 50 INVIT
INVIT, n = 100 PINVIT
0 INVIT, n = 200 26
10
PINVIT, n = 50
PINVIT, n = 100
PINVIT, n = 200 24
max
−2
10
error in approximation for λ
22
−4
#iteration steps
10
20
−6
10 18
−8
10 16
14
−10
10
12
−12
10
10
−14
10
0 5 10 15 20 25 30 35 40 45 50 8
1 2 3 4 5
Fig. 346 # iterationstep 10 10 10 10 10
Fig. 347 n
ÿ + λ2 y = cos(ωt) , (9.3.4.2)
ÿ + Ay = b cos(ωt) , (9.3.4.4)
with symmetric, positive (semi)definite matrix A ∈ R n,n , b ∈ R n . By Cor. 9.1.0.9 there is an orthogonal
matrix Q ∈ R n,n such that
Q⊤ AQ = D := diag(λ1 , . . . , λn ) .
(9.3.4.4) z̈ + Dz = Q⊤ b cos(ωt) .
We have obtained decoupled linear 2nd-order scalar ODEs of the type (9.3.4.2).
☛ ✟
√
✡ ✠
(9.3.4.4) can have growing (with time) solutions, if ω = λi for some i = 1, . . . , n
p
If ω = λ j for one j ∈ {1, . . . , n}, then the solution for the initial value problem for (9.3.4.4) with
y(0) = ẏ(0) = 0 (↔ z(0) = ż(0) = 0) is
t
z(t) ∼ sin(ωt)e j + bounded oscillations
2ω
m
t
y(t) ∼ sin(ωt)(Q):,j + bounded oscillations .
2ω
j-th eigenvector of A
Eigenvectors of A ↔ excitable states y
EXAMPLE 9.3.4.5 (Vibrations of a truss structure cf. [Han02, Sect. 3], M ATLAB’s truss demo)
2.5
−1.5
Fig. 348
0 1 2 3 4 5
Assumptions: ✦ Truss in static equilibrium (perfect balance of forces at each point mass).
✦ Rods are perfectly elastic (i.e., frictionless).
Hook’s law holds for force in the direction of a rod:
∆l
F=α , (9.3.4.7)
l
✁ deformed truss:
lij := ∆p ji , ∆p ji := p j − pi , (9.3.4.8)
2
∆lij (t) := ∆p ji + ∆u ji (t) − lij , ∆u ji (t) := u j (t) − ui (t) . (9.3.4.9)
2
∆lij ∆p ji + ∆u ji (t)
Fij (t) = −αij · . (9.3.4.10)
lij ∆p ji + ∆u ji (t) 2
✞ ☎
✝ ✆
Assumption: Small displacements
2
Possibility of linearization by neglecting terms of order ui 2
!
(9.3.4.8) 1 1
Fij (t) = αij − · (∆p ji + ∆u ji (t)) . (9.3.4.11)
(9.3.4.9) ∆p ji + ∆u ji (t) ∆p ji
2
1 1 x·y
= − + O(kyk22 ) .
k x + y k2 kxk2 kxk32
Proof. Simple Taylor expansion up to linear term for f (x) = ( x12 + · · · + xd2 )−1/2 and f (x + y) =
f (x) + grad f (x) · y + O(kyk22 ).
✷
2
Linearization of force: apply Lemma 9.3.4.12 to (9.3.4.11) and drop terms O( ∆u ji 2 ):
∆p ji · ∆u ji (t)
Fij (t) ≈ − αij 3
· (∆p ji + ∆u ji (t))
lij
(9.3.4.13)
∆p ji · ∆u ji (t)
≈ − αij · ∆p ji .
lij3
n
d2 i
mi
dt2
u (t) = Fi = ∑ − Fij (t) , (9.3.4.14)
j =1
j6=i
d2 i n
1 ji ji ⊤
mi u (t) = ∑ αij ∆p ( ∆p ) (u j (t) − ui (t)) . (9.3.4.15)
dt2 j =1
3
lij
j6=i
n
Compact notation: collect all displacements into one vector u ( t ) = ui ( t ) ∈ R2n
i =1
du
(9.3.4.15) M (t) + Au(t) = f(t) . (9.3.4.16)
dt2
✛ ✘
Rem. 9.3.4.1: if periodic external forces f(t) = cos(ωt)f, f ∈ (wind, earthquake)pact on the R2n ,
truss they can excite vibrations of (linearly in time) growing amplitude, if ω coincides with λ j for an
eigenvalue λ j of A.
✚ ✙
Excited vibrations can lead to the collapse of a truss structure, cf. the notorious
Tacoma-Narrows bridge disaster.
It is essential to know whether eigenvalues of a truss structure fall into a range that can be excited
by external forces.
These will typically(∗) be the low modes ↔ a few of the smallest eigenvalues.
((∗) Reason: fast oscillations will quickly be damped due to friction, which was neglected in our model.)
3
The stiffness matrix will always possess three zero
2
eigenvalues corresponding to rigid body modes (=
displacements without change of length of the rods)
1
−1
0 5 10 15 20 25
Fig. 350 no. of eigenvalue
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
Fig. 351 Fig. 352
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
Fig. 353 Fig. 354
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
y
To compute a few of a truss’s lowest resonant frequencies and excitable mode, we need efficient numerical
methods for the following tasks. Obviously, Code 9.3.4.18 cannot be used for large trusses, because eig
invariable operates on dense matrices and will be prohibitively slow and gobble up huge amounts of
memory, also recall the discussion of Code 9.3.2.30.
Of course, we aim to tackle this task by iterative methods generalizing power iteration (→ Section 9.3.1)
and inverse iteration (→ Section 9.3.2).
9.3.4.1 Orthogonalization
If we just carry out the direct power iteration (9.3.1.12) for two vectors both sequences will converge to the
largest (in modulus) eigenvector. However, we recall that all eigenvectors are mutually orthogonal. This
suggests that we orthogonalize the iterates of the second power iteration (that is to yield the eigenvector for
the second largest eigenvalue) with respect to those of the first. This idea spawns the following iteration,
cf. Gram-Schmidt orthogonalization in (10.2.2.4):
v
w−w· k v k2
✁ Orthogonalization of two vectors
(see Line 4 of Code 9.3.4.19)
v
v
Fig. 355 w· k v k2
Analysis through eigenvector expansions (v, w ∈ R n , kvk2 = kwk2 = 1)
n n
v= ∑ αj uj , w = ∑ β j uj ,
i =1 i =1
n n
⇒ Av = ∑ λ j α j u j , Aw = ∑ λj β j uj ,
i =1 i =1
n −1/2 n
v
v0 : =
k v k2
= ∑ λ2j α2j ∑ λj αj uj ,
i =1 i =1
!
n n n
Aw − (v0⊤ Aw)v0 = ∑ βj− ∑ λ2j α j β j / ∑ λ2j α2j αj λj uj .
i =1 i =1 i =1
We notice that v is just mapped to the next iterate in the regular direct power iteration (9.3.1.12). After
many steps, it will be very close to un , and, therefore, we may now assume v = un ⇔ α j = δj,n
(Kronecker symbol).
n −1
z := Aw − (v0⊤ Aw)v0 = 0 · un + ∑ λj β j uj ,
i =1
n −1 −1/2 n −1
z
w(new) := = ∑ λ2j β2j ∑ λj β j uj .
k z k2 i =1 i =1
The sequence w(k) produced by repeated application of the mapping given by Code 9.3.4.19 asymp-
totically (that is, when v(k) has already converged to un ) agrees with the sequence produced by the
direct power method for A e := U diag(λ1 , . . . , λn−1 , 0). Its convergence will be governed by the relative
gap λn−2 /λn−1 , see Thm. 9.3.1.21.
However: if v(k) itself converges slowly, this reasoning does not apply.
M ATLAB-code 9.3.4.21: power iteration with orthogonal projection for two vectors
0.85
0
10
error quotient
0.8
error
0.75
0.7
−1
10
0.65
0.6
error in λn
error in λn−1
0.55
error in v
error in w
−2
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 356 power iteration step Fig. 357 power iteration step
0.8
error
−1
10 0.75
0.7
0.65
−2
10
0.6
error in λn
error in λn−1
0.55
error in v
error in w
−3
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 358 power iteration step Fig. 359 power iteration step
y
Nothing new:
Gram-Schmidt orthonormalization
(→ [NS02, Thm. 4.8], [Gut09, Alg. 6.1], [QSS00, Sect. 3.4.3])
➋ q⊤
l qk = δlk (orthonormality) , (9.3.4.22)
➋ Span{q1 , . . . , qk } = Span{v1 , . . . , vk } for all k = 1, . . . , m . (9.3.4.23)
z1 = v1 ,
v2⊤ z1
z2 = v2 − z ,
z1⊤ z1 1
⊤
v3 z1 v3⊤ z2 (9.3.4.24)
z3 = v3 − z
z1⊤ z1 1
− z
z2⊤ z2 2
,
..
.
zk
+ normalization qk = , k = 1, . . . , m . (9.3.4.25)
k z k k2
Easy computation: the vectors q1 , . . . , qm produced by (9.3.4.24) satisfy (9.3.4.22) and (9.3.4.23).
M ATLAB-code 9.3.4.30: General subspace power iteration step with qr based orthonormal-
ization
Since the columns of V span a subspace of R n , this idea can be recast as the following task:
⇔ ∃w ∈ V \ {0}: Aw = λw
⇔ ∃u ∈ K m \ {0}: AVu = λVu
⇒ ∃u ∈ K m \ {0}: VH AVu = λVH Vu , (9.3.4.31)
If our initial assumption holds true and u solves (9.3.4.32) and is a simple eigenvalue, a corresponding
x ∈ EigAλ can be recovered as x = Vu.
Note: If V is unitary (→ Def. 6.3.1.2), then the generalized eigenvalue problem (9.3.4.32) will become a
standard linear eigenvalue problem.
Remark 9.3.4.33 (Justification of Ritz projection by min-max theorem)
We revisit m = 2, see Code 9.3.4.19. Recall that by the min-max theorem Thm. 9.3.2.18
Idea: maximize Rayleigh quotient over Span{v, w}, where v, w are output by Code 9.3.4.19. This leads
to the optimization problem
∗ ∗ α
(α , β ) := argmax ρA (αv + βw) = argmax ρ(v,w)⊤ A(v,w) ( ). (9.3.4.35)
α,β∈R, α2 + β2 =1 α,β∈R, α2 + β2 =1
β
v∗ := α∗ v + β∗ w .
Note that kv∗ k2 = 1, if both v and w are normalized, which is guaranteed in Code 9.3.4.19.
Then, orthogonalizing w w.r.t v∗ will produce a new iterate w∗ .
Again the min-max theorem Thm. 9.3.2.18 tells us that we can find (α∗ , β∗ )⊤ as eigenvector to the largest
eigenvalue of
⊤ α α
(v, w) A(v, w) =λ . (9.3.4.36)
β β
M ATLAB-code 9.3.4.37: one step of subspace power iteration with Ritz projection, matrix ver-
sion
1 f u n c t i o n V = sspowitsteprp(A,V)
2 V = A*V; % power iteration applied to columns of V
3 [Q,R] = qr(V,0); % orthonormalization, see Section 9.3.4.1
4 [U,D] = eig(Q’*A*Q); % Solve Ritz projected m × m eigenvalue problem
5 V = Q*U; % recover approximate eigenvectors
6 ev = d i a g (D); % approximate eigenvalues
Note that he orthogonalization step in Code 9.3.4.37 is actually redundant, if exact arithmetic could be
employed, because the Ritz projection could also be realized by solving the generalized eigenvalue prob-
lem.
However, prior orthogonalization is essential for numerical stability (→ Def. 1.5.5.19), cf. the discussion in
Section 3.3.3.
EXAMPLE 9.3.4.38 (Power iteration with Ritz projection)
Matrix as in Ex. 9.3.4.20, σ (A) = {0.5, 1, . . . , 4, 9.5, 10}:
d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1 2
10 10
error in λn error in λn
error in λn−1 error in λn−1
0
error in v 10 error in v
error in w error in w
0 −2
10 10
−4
10
error
error
−1 −6
10 10
−8
10
−2 −10
10 10
−12
10
−3 −14
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 360 power iteration step Fig. 361 power iteration step
0.85 0.7
error quotient
error quotient
0.8 0.6
0.75 0.5
0.7 0.4
0.65 0.3
0.6 0.2
error in λn
error in λn−1
0.55 0.1
error in v
error in w
0.5 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 362 power iteration step Fig. 363 power iteration step
2
10
0
10
j
−2
10
aij := min{ ij , i }
S.p.d. test matrix:
n=200; A = gallery(’lehmer’,n);
error in eigenvalue
−4
10
“Initial eigenvector guesses”:
−6
10 V = eye(n,m);
λ , m=3
−8
10
1 • Observation:
λ , m=3
−10
2
λ3, m=3
linear convergence of eigenvalues
10
λ1, m=6 • choice m > k boosts convergence
−12
10 λ , m=6
2 of eigenvalues
λ , m=6
3
−14
10
1 2 3 4 5 6 7 8 9 10
Fig. 364 iteration step
y
All power methods (→ Section 9.3) for the eigenvalue problem (EVP) Ax = λx only rely on the last iterate
to determine the next one (1-point methods, cf. (8.2.1.5))
“Memory for power iterations”: pursue same idea that led from the gradient method, § 10.1.3.3, to the
conjugate gradient method, § 10.2.2.10: use information from previous iterates to achieve efficient mini-
mization over larger and larger subspaces.
We recall
✦ the direct power method (9.3.1.12) from Section 9.3.1
✦ and the inverse iteration from Section 9.3.2
and how they produce sequences (z(k) )k∈N0 of vectors that are supposed to converge to a vector ∈
EigAλ1 or ∈ EigAλn , respectively.
Intuition: If un (u1 ) “well captured” by V (that is, the angle between the vector and the space V is
small), then we can expect that the largest (smallest) eigenvalue of (9.4.0.1) is a good approximation
for λmax (A)(λmin (A)), and that, assuming normalization
Vw ≈ u1 (or Vw ≈ un ) ,
V = Span{z(0) , Az(0) , . . . , A(k) z(0) } = Kk+1 (A, z(0) ) a Krylov space, → Def. 10.2.1.1 . (9.4.0.2)
✁
direct power method with
Ritz projection onto Krylov
space from (9.4.0.2), cf.
M ATLAB-code 9.4.0.3: Ritz projections onto Krylov space
§ 9.3.4.39.
(9.4.0.2)
Note: implementation for
demonstration purposes only
(inefficient for sparse matrix A!)
M ATLAB-code 9.4.0.5:
1 n=100;
2 M= g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
3 [Q,R]= q r (M); A=Q’* d i a g (1:n)*Q; % synthetic matrix,
σ(A) = {1, 2, 3, . . . , 100}
2
10
100 |λ −µ |
m m
|λ −µ |
m−1 m−1
1
10 |λ −µ |
m−2 m−2
95
0
10
90
Ritz value
Ritz value
−1
10
85
−2
10
80
−3
10
75 µm −4
10
µ
m−1
µ
m−2
−5
70 10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 365 dimension m of Krylov space Fig. 366 dimension m of Krylov space
Observation: “vaguely linear” convergence of largest Ritz values (notation µi ) to largest eigenvalues.
Fastest convergence of largest Ritz value → largest eigenvalue of A
2
10
40 µ1 |λ −µ |
1 1
µ2 |λ −µ |
2 2
35 µ |λ −µ |
3 2 3
1
10
30
error of Ritz value
25 0
10
Ritz value
20
−1
10
15
10
−2
10
0 −3
10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 367 dimension m of Krylov space Fig. 368 dimension m of Krylov space
Observation: Also the smallest Ritz values converge “vaguely linearly” to the smallest eigenvalues of A.
Fastest convergence of smallest Ritz value → smallest eigenvalue of A. y
z(k)
(νI − A)e
z(0) arbitrary , e
z ( k +1) = (9.4.0.6)
z(k)
(νI − A)e 2
➣ u1 can also be expected to be “well captured” by Kk (A, x) and the smallest Ritz value should provide
a good aproximation for λmin (A).
Proof. Lemma 10.2.2.5: {r0 , . . . , rℓ−1 } is an orthogonal basis of Kℓ (A, r0 ), if all the residuals are non-
zero. As AKℓ−1 (A, r0 ) ⊂ Kℓ (A, r0 ), we conclude the orthogonality rm T Ar for all j = 0, . . . , m − 2. Since
j
T
Vm AVm = riT−1 Ar j−1 , 1 ≤ i, j ≤ m ,
ij
α1 β 1
β 1 α2 β 2
.
β 2 α3 . .
.. ..
. .
VlH AVl = =: Tl ∈ K k,k [tridiagonal matrix] (9.4.0.11)
..
.
.. ..
. . β k −1
β k −1 αk
1× A×vector
2 dot products M ATLAB-code 9.4.0.12: Lanczos process, cf. Code 10.2.2.11
2 AXPY-operations
1 division
Closely related to CG iteration,
§ 10.2.2.10, Code 10.2.2.11.
Total computational effort for l steps of Lanczos process, if A has at most k non-zero entries per row:
O(nkl )
Note: Code 9.4.0.12 assumes that no residual vanishes. This could happen, if z0 exactly belonged to
the span of a few eigenvectors. However, in practical computations inevitable round-off errors will always
ensure that the iterates do not stay in an invariant subspace of A, cf. Rem. 9.3.1.22.
Convergence (what we expect from the above considerations) → [DH03, Sect. 8.5])
(l ) (l ) (l )
In l -th step: λ n ≈ µ l , λ n −1 ≈ µ l −1 , . . . , λ 1 ≈ µ 1 ,
(l ) (l ) (l ) (l ) (l )
σ ( T l ) = { µ1 , . . . , µ l }, µ1 ≤ µ2 ≤ · · · ≤ µ l .
2
10
1
10
0
10
|Ritz value−eigenvalue|
0 −2
10 error in Ritz values 10
−4
10
−1
10
−6
10
−2 −8
10 10
−10
10
λn λ
−3 n
10 λ λ
n−1 n−1
−12
λ 10 λn−2
n−2
λn−3 λn−3
−4 −14
10 10
0 5 10 15 20 25 30 0 5 10 15
Fig. 369 step of Lanzcos process Fig. 370 step of Lanzcos process
σ (A) = {0.255680,0.273787,0.307979,0.366209,0.465233,0.643104,1.000000,1.873023,5.048917,44.766069}
σ (T) = {0.263867,0.303001,0.365376,0.465199,0.643104,1.000000,1.873023,5.048917,44.765976,44.766069}
2 3.392123 44.750734
10 0.263867 0.303001 0.365376 0.465199 0.643104 1.000000 1.873023 5.048917 44.765976 44.766069
Idea:
✦ do not rely on orthogonality relations of Lemma 10.2.2.5
✦ use explicit Gram-Schmidt orthogonalization [NS02, Thm. 4.8],
[Gut09, Alg .6.1]
l
e l +1
v
vl +1 := Avl − ∑ (v jH Avl ) v j , vl +1 :=
e ⇒ vl +1 ⊥ Kl (A, z) . (9.4.0.15)
j =1
kv
e l +1 k2
➣ Computational cost for l steps, if at most k non-zero entries in each row of A: O(nkl 2 )
✍ ✌
(ONB) of Kk+1 (A, v0 ) for a general A ∈ C n,n .
v1
vl
f u n c t i o n [ dn , V , Ht ] = a r n o l d i e i g ( A , v0 , k , t o l )
n = s i z e ( A , 1 ) ; V = [ v0 / norm ( v0 ) ] ;
H = zeros ( 1 , 0 ) ; dn = zeros ( k , 1 ) ;
f o r l =1: n
d = dn ;
Ht = [ Ht , zeros ( l , 1 ) ; zeros ( 1 , l ) ] ;
v t = A*V ( : , l ) ;
f o r j =1: l
Ht ( j , l ) = d o t (V ( : , j ) , v t ) ;
v t = v t − Ht ( j , l ) * V ( : , j ) ;
end
ev = s o r t ( e i g ( Ht ( 1 : l , 1 : l ) ) ) ;
dn ( 1 : min ( l , k ) ) = ev ( end : − 1 : end−min ( l , k ) + 1 ) ;
i f ( norm ( d−dn ) < t o l * norm ( dn ) ) , break ; end ;
Ht ( l +1 , l ) = norm ( v t ) ;
V = [ V , v t / Ht ( l +1 , l ) ] ;
end
Heuristic termination criterion
Arnoldi process for computing the
k largest (in modulus) eigenvalues
✗ of A ∈ C n,n ✔
1 A×vector per step
(➣ attractive for sparse
✖ ✕
matrices)
4 4
10 10
2 2
10 10
−2 −2
error in Ritz values
10 10
−4 −4
10 10
−6 −6
10 10
−8 −8
10 10
−10 −10
10 10
λn λn
λn−1 λn−1
−12 −12
10 λ 10 λ
n−2 n−2
λ λ
n−3 n−3
−14 −14
10 10
0 5 10 15 0 5 10 15
Fig. 371 step of Lanzcos process Fig. 372 Step of Arnoldi process
l σ (Hl )
1 38.500000
2 3.392123 44.750734
10 0.255680 0.273787 0.307979 0.366209 0.465233 0.643104 1.000000 1.873023 5.048917 44.766069
M ATLAB-code 9.4.0.23:
95 1
10
80
Ritz value
−1
10
75
−2
10
70
−3
65 10
60
λ −4
10 λ
n n
55 λ λ
n−1 n−1
λn−2 λ
n−2
−5
50 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 373 Step of Arnoldi process Fig. 374 Step of Arnoldi process
7
−1
10
Ritz value
6
−2
10
5
−3
4 10
3
−4
10
2
−5
λ1
10
λ2
1
λ
3
−6
0 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 375 Step of Arnoldi process Fig. 376 Step of Arnoldi process
Observation: “vaguely linear” convergence of largest and smallest eigenvalues, cf. Ex. 9.4.0.4. y
Krylov subspace iteration methods (= Arnoldi process, Lanczos process) attractive for computing a
few of the largest/smallest eigenvalues and associated eigenvectors of large sparse matrices.
M ATLAB-functions:
[AKY99] Charles J Alpert, Andrew B Kahng, and So-Zen Yao. “Spectral partitioning with mul-
tiple eigenvectors”. In: Discrete Applied Mathematics 90.1-3 (1999), pp. 3–26. DOI:
10.1016/S0166-218X(98)00083-3 (cit. on p. 700).
[Bai+00] Z.-J. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the Solution
of Algebraic Eigenvalue Problems. Philadelphia, PA: SIAM, 2000 (cit. on p. 677).
[BF06] Yuri Boykov and Gareth Funka-Lea. “Graph Cuts and Efficient N-D Image Segmenta-
tion”. In: International Journal of Computer Vision 70.2 (Nov. 2006), pp. 109–131. DOI:
10.1007/s11263-006-7934-5.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 679, 681, 685, 690, 692, 701).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 720).
[Gle15] David F. Gleich. “PageRank Beyond the Web”. In: SIAM Review 57.3 (2015), pp. 321–363.
DOI: 10.1137/140976649 (cit. on p. 685).
[GV89] G.H. Golub and C.F. Van Loan. Matrix computations. 2nd. Baltimore, London: John Hopkins
University Press, 1989 (cit. on pp. 682, 683, 696, 721).
[GT08] Craig Gotsman and Sivan Toledo. “On the computation of null spaces of sparse rect-
angular matrices”. In: SIAM J. Matrix Anal. Appl. 30.2 (2008), pp. 445–463. DOI:
10.1137/050638369 (cit. on p. 701).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 679–681,
711, 721).
[Hac94] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations. Vol. 95. Applied
Mathematical Sciences. New York: Springer-Verlag, 1994, pp. xxii+429 (cit. on p. 681).
[Hal70] K.M. Hall. “An r-dimensional quadratic placement algorithm”. In: Management Science 17.3
(1970), pp. 219–229.
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 681, 682, 702,
705, 716).
[LM06] A.N. Lengville and C.D. Meyer. Google’s PageRank and Beyond: The Science of Search En-
gine Rankings. Princeton, NJ: Princeton University Press, 2006 (cit. on p. 685).
[LMM19] Anna Little, Mauro Maggioni, and James M. Murphy. Path-Based Spectral Clustering: Guaran-
tees, Robustness to Outliers, and Fast Algorithms. 2019.
[Lux07] Ulrike von Luxburg. “A tutorial on spectral clustering”. In: Stat. Comput. 17.4 (2007), pp. 395–
416. DOI: 10.1007/s11222-007-9033-z.
[Ney99a] K. Neymeyr. A geometric theory for preconditioned inverse iteration applied to a subspace.
Tech. rep. 130. Tübingen, Germany: SFB 382, Universität Tübingen, Nov. 1999 (cit. on p. 704).
[Ney99b] K. Neymeyr. A geometric theory for preconditioned inverse iteration: III. Sharp convergence
estimates. Tech. rep. 130. Tübingen, Germany: SFB 382, Universität Tübingen, Nov. 1999 (cit.
on p. 704).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 679–682, 711, 721).
726
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 680–682, 685, 692, 711).
[SM00] J.-B. Shi and J. Malik. “Normalized cuts and image segmentation”. In: IEEE Trans. Pattern
Analysis and Machine Intelligence 22.8 (2000), pp. 888–905 (cit. on pp. 693, 695, 700).
[ST96] D.A. Spielman and Shang-Hua Teng. “Spectral partitioning works: planar graphs and finite el-
ement meshes”. In: Foundations of Computer Science, 1996. Proceedings., 37th Annual Sym-
posium on. Oct. 1996, pp. 96–105. DOI: 10.1109/SFCS.1996.548468 (cit. on p. 700).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on p. 679).
[TM01] F. Tisseur and K. Meerbergen. “The quadratic eigenvalue problem”. In: SIAM Review 43.2
(2001), pp. 235–286 (cit. on p. 678).
Supplementary literature. There is a wealth of literature on iterative methods for the solution of
linear systems of equations: The two books [Hac94] and [Saa03] offer a comprehensive treatment
of the topic (the latter is available online for ETH students and staff).
Concise presentations can be found in [QSS00, Ch. 4] and [DR08, Ch. 13].
Learning outcomes:
• Understanding when and why iterative solution of linear systems of equations may be preferred to
direct solvers based on Gaussian elimination.
•
= A class of iterative methods (→ Section 8.2) for approximate solution of large linear systems of
equations Ax = b, A ∈ K n,n .
BUT, we have reliable direct methods (Gauss elimination → Section 2.3, LU-factorization →
§ 2.3.2.15, QR-factorization → ??) that provide an (apart from roundoff errors) exact solution with a
finite number of elementary operations!
Alas, direct elimination may not be feasible, or may be grossly inefficient, because
• it may be too expensive (e.g. for A too large, sparse), → (2.3.2.10),
• inevitable fill-in may exhaust main memory,
• the system matrix may be available only as procedure y=evalA(x) ↔ y = Ax
Contents
10.1 Descent Methods [QSS00, Sect. 4.3.3] . . . . . . . . . . . . . . . . . . . . . . . . . . 729
10.1.1 Quadratic minimization context . . . . . . . . . . . . . . . . . . . . . . . . . 729
10.1.2 Abstract steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
10.1.3 Gradient method for s.p.d. linear system of equations . . . . . . . . . . . . . 731
10.1.4 Convergence of the gradient method . . . . . . . . . . . . . . . . . . . . . . . 732
10.2 Conjugate gradient method (CG) [Han02, Ch. 9], [DR08, Sect. 13.4], [QSS00, Sect. 4.3.4]736
10.2.1 Krylov spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
10.2.2 Implementation of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
10.2.3 Convergence of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
10.3 Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10], [QSS00, Sect. 4.3.5] . . . . . . 745
10.4 Survey of Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 751
728
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 10.1.0.2 (Krylov methods for complex s.p.d. system matrices) In this chapter, for the sake of
simplicity, we restrict ourselves to K = R.
However, the (conjugate) gradient methods introduced below also work for LSE Ax = b with A ∈ C n,n ,
A = A H s.p.d. when ⊤ is replaced with H (Hermitian transposed). Then, all theoretical statements remain
valid unaltered for K = C. y
A quadratic functional
= 21 kx − x∗ k2A .
Then the assertion follows from the properties of the energy norm.
✷
2 1 1
EXAMPLE 10.1.1.4 (Quadratic functional in 2D) Plot of J from (10.1.1.2) for A = , b= .
1 2 1
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 729
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2 16 16
14
1.5 14
12
1 12
10
0.5 10
J(x1,x2)
8
6
8
2
0
x
−0.5 6
2
4
−1 0
−2
2 −2
−1.5
0
0 0.5 1 1.5 2
2 −2 −1.5 −1 −0.5
0
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x2
x1
✞ ☎
Fig. 377 x1 Fig. 378
✝ ✆
Level lines of quadratic functionals with s.p.d. A are (hyper)ellipses y
Note that a minimizer need not exist, if F is not bounded√ from below (e.g., F ( x ) = x3 , x ∈ R, or
F ( x ) = log x, x > 0), or if D is open (e.g., F ( x ) = x, x > 0).
The existence of a minimizer is guaranteed if F is bounded from below and D is closed (→ Analysis).
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 730
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 379
(“Geometric intuition”, see Fig. 377: quadratic functional J with s.p.d. A has unique global minimum,
grad J 6= 0 away from minimum, pointing towards it.)
Adaptation: steepest descent algorithm § 10.1.2.1 for quadratic minimization problem (10.1.1.2), see
[QSS00, Sect. 7.2.4]:
dϕ ∗ ∗ d⊤
k dk
(t ) = 0 ⇔ t = (unique minimizer) . (10.1.3.2)
dt d⊤
k Adk
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 731
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✬ ✩
One step of gradient method involves
✦ A single matrix×vector product with A ,
✦ 2 AXPY-operations (→ Section 1.3.2) on vectors of length n,
✦ 2 dot products in R n .
✫ ✪
Computational cost (per step) = cost(matrix×vector) + O(n)
➣ If A ∈ R n,n is a sparse matrix (→ ??) with “O(n) nonzero entries”, and the data structures allow
to perform the matrix×vector product with a computational effort O(n), then a single step of the
gradient method costs O(n) elementary operations.
➣ Gradient method of § 10.1.3.3 only needs A×vector in procedural form y = evalA(x).
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 732
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 10
9 9
8 8
(0)
x
7 7
6 6
2
2
5 5
x
x
(1)
x
4 4
3 3
2 2
x(2)
x(3)
1 1
0 0
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 380 x1 Fig. 381 x
1
n
J (Qb b⊤ Db
y) = 12 y y − (Q⊤ b)⊤ y
| {z }
b= 1
2 ∑ di yb2i − bbi ybi .
i =1
b⊤
=:b
Hence, a rigid transformation (rotation, reflection) maps the level surfaces of J from (10.1.1.2) to ellipses
with principal axes di . As A s.p.d. di > 0 is guaranteed.
Observations:
• Larger spread of spectrum leads to more elongated ellipses as level lines ➣ slower convergence
of gradient method, see Fig. 381.
• Orthogonality of successive residuals rk , rk+1 .
Clear from definition of § 10.1.3.3:
r⊤
k rk
r⊤ ⊤ ⊤
k r k +1 = r k r k − r k Ark = 0 . (10.1.4.3)
r⊤
k Ark
y
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 733
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
4 2
10 10
2
10 0
10
0
10 −2
10
−2
10
10
−4
10
−6
10
−6
10
−8
10
−8
10
−10
10
−10
10
−12
−12 10
10
−14
−14
10 A = diag(1:0.01:2) 10 A = diag(1:0.01:2)
A = diag(1:0.1:11) A = diag(1:0.1:11)
A = diag(1:1:101) A = diag(1:1:101)
−16 −16
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Fig. 382 iteration step k Fig. 383 iteration step k
Observation:
✦ linear convergence (→ Def. 8.2.2.1), see also Rem. 8.2.2.6
✦
rate of convergence increases (↔ speed of convergence decreases) with spread of
spectrum of A
Impact of distribution of diagonal entries (↔ eigenvalues) of (diagonal matrix) A
(b = x∗ = 0, x0 = cos((1:n)’);)
Test matrix #1: A=diag(d); d = (1:100);
Test matrix #2: A=diag(d); d = [1+(0:97)/97 , 50 , 100];
Test matrix #3: A=diag(d); d = [1+(0:49)*0.05, 100-(0:49)*0.05];
Test matrix #4: eigenvalues exponentially dense at 1
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 734
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
3
10
error norm, #1
norm of residual, #1
error norm, #2
norm of residual, #2
#4 error norm, #3
norm of residual, #3
2
10 error norm, #4
norm of residual, #4
#3
2−norms
matrix no.
1
10
#2
0
10
#1
−1
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45
Fig. 384 diagonal entry iteration step k
Observation: Matrices #1, #2 & #4 ➣ little impact of distribution of eigenvalues on asymptotic con-
vergence (exception: matrix #2)
y
cond2 (A) − 1
x ( k +1) − x ∗ ≤ L x(k) − x∗ , L := ,
A A cond2 (A) + 1
that is, the iteration converges at least linearly (→ Def. 8.2.2.1) w.r.t. energy norm (→ Def. 10.1.0.1).
Remark 10.1.4.6 (2-norm from eigenvalues → [Gut09, Sect. 10.6], [NS02, Sect. 7.4])
A −1 = min(|σ(A)|)−1 , if A regular.
2
λmax (A)
✎ other notation κ (A) := ˆ spectral condition number of A
=
λmin (A)
(for general A: λmax (A)/λmin (A) largest/smallest eigenvalue in modulus)
These results are an immediate consequence of the fact that
see (10.1.4.2), Cor. 9.1.0.9, [NS02, Thm. 7.8], [Gut09, Satz 9.15].
Please note that for general regular M ∈ R n,n we cannot expect cond2 (M) = κ (M). y
10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 735
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1D line search in § 10.1.3.3 is oblivious of former line searches, which rules out reuse of information
gained in previous steps of the iteration. This is a typical drawback of 1-point iterative methods.
Idea:
Replace linear search with subspace correction
Given:
✦ initial guess x(0)
✦ nested subspaces U1 ⊂ U2 ⊂ U3 ⊂ · · · ⊂ Un = R n , dim Uk = k
Lemma 10.2.0.3. rk ⊥ Uk
With x(k) according to (10.2.0.1), Uk from (10.2.0.2) the residual rk := b − Ax(k) satisfies
r⊤
k u = 0 ∀ u ∈ Uk (”rk ⊥ Uk ”).
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 736
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Geometric consideration: since x(k) is the minimizer of J over the affine space Uk + x(0) , the projection of
the steepest descent direction grad J (x(k) ) onto Uk has to vanish:
Proof. Consider
ψ(t) = J (x(k) + tu) , u ∈ Uk , t ∈ R .
By (10.2.0.1), t 7→ ψ(t) has a global minimum in t = 0, which implies
dψ
(0) = grad J (x(k) )⊤ u = (Ax(k) − b)⊤ u = 0 .
dt
Since u ∈ Uk was arbitrary, the lemma is proved.
✷
Corollary 10.2.0.5.
Lemma 10.2.0.3 also implies that, if U0 = {0}, then dim Uk = k as long as x(k) 6= x∗ , that is, before we
have converged to the exact solution.
(10.2.0.1) and (10.2.0.2) define the conjugate gradient method (CG) for the iterative solution of
Ax = b
(hailed as a “top ten algorithm” of the 20th century, SIAM News, 33(4))
Lemma 10.2.1.2.
The subspaces Uk ⊂ R n , k ≥ 1, defined by (10.2.0.1) and (10.2.0.2) satisfy
Since Uk+1 = Span{Uk , rk }, we obtain Uk+1 ⊂ Kk+1 (A, r0 ). Dimensional considerations based on
Lemma 10.2.0.3 finish the proof.
✷
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 737
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10.2.2 Implementation of CG
Assume: basis {p1 , . . . , pl }, l = 1, . . . , n, of Kl (A, r) available
∂ψ
(10.2.0.1) ⇔ = 0 , j = 1, . . . , l .
∂γ j
This leads to a linear system of equations by which the coefficients γ j can be computed:
p1⊤ Ap1 · · · p1⊤ Apl γ1 p1⊤ r
.. .. ..
.. (0)
. . = . , r := b − Ax .
. (10.2.2.1)
p⊤
l Ap1 · · · p⊤
l Apl
γl p⊤
l r
Recall: s.p.d. A induces an inner product ➣ concept of orthogonality [NS02, Sect. 4.4], [Gut09,
Sect. 6.2]. “A-geometry” like standard Euclidean space.
Assume: A-orthogonal basis {p1 , . . . , pn } of R n available, such that
Span{p1 , . . . , pl } = Kl (A, r) .
(Efficient) successive computation of x(l ) becomes possible, see [DR08, Lemma 13.24]
(LSE (10.2.2.1) becomes diagonal !)
r0 := b − Ax(0) ;
p⊤
j r0 (10.2.2.2)
for j = 1 to l do { x ( j ) : = x ( j −1) + pj }
p⊤
j Ap j
From linear algebra we already know a way to construct orthogonal basis vectors:
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 738
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(10.2.2.3) ⇒ Idea:
Gram-Schmidt orthogonalization [NS02, Thm. 4.8], [Gut09, Alg. 6
of residuals r j := b − Ax( j) w.r.t. A-inner product:
p⊤ Ar j
( j)
j
p1 := r0 , p j+1 := (b − Ax ) − ∑ ⊤k p , j = 1, . . . , l − 1 .
| {z } k=1 pk Apk k
=:r j | {z }
(∗)
(10.2.2.4)
rj
Geometric interpretation of
K j (A, r0 ) (10.2.2.4):
j
p⊤
k r0
j
p⊤
k Ar j
(10.2.2.2) & (10.2.2.4) ⇒ p j +1 = r 0 − ∑ ⊤
Apk − ∑ ⊤
pk .
k=1 pk Apk k=1 pk Apk
⇒ p j+1 ∈ Span{r0 , p1 , . . . , p j , Ap1 , . . . , Ap j } .
✷
Orthogonalities from Lemma 10.2.2.5 ➤ short recursions for pk , rk , x(k) !
p⊤
j Ar j
(10.2.2.3) ⇒ (10.2.2.4) collapses to p j+1 := r j − p j , j = 1, . . . , l .
p⊤
j Ap j
p⊤
j r0
(10.2.2.2) r j = r j −1 − Ap j .
p⊤
j Ap j
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 739
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
!T
m −1 r0⊤ pk
Lemma 10.2.2.5, (i) r jH−1 p j = r0 + ∑ Apk p j =r0⊤ p j . (10.2.2.9)
k =1 pkT Apk
The orthogonality (10.2.2.9) together with (10.2.2.8) permits us to replace r0 with r j−1 in the actual imple-
mentation.
y
In CG algorithm: r j = b − Ax(k) agrees with the residual associated with the current iterate (in exact
arithmetic, cf. Ex. 10.2.3.1), but computation through short recursion is more efficient.
➣ We find that the CG method possesses all the algorithmic advantages of the gradient method, cf. the
discussion in Section 10.1.3.
✎ ☞
1 matrix×vector product, 3 dot products, 3 AXPY-operations per step:
✍ ✌
If A sparse, nnz(A) ∼ n ➤ computational effort O(n) per step
M ATLAB-function:
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 740
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For any vector norm and associated matrix norm (→ Def. 1.5.5.10) hold (with residual rl := b − Ax(l ) )
1 krl k k x(l ) − x∗ k kr k
≤ x(0) −x∗ ≤ cond(A) l . (10.2.2.13)
cond(A) kr0 k k k k r0 k
(10.2.2.13) can easily be deduced from the error equation A(x(k) − x∗ ) = rk , see Def. 2.4.0.1 and
(2.4.0.13). y
10.2.3 Convergence of CG
Note: CG is a direct solver, because (in exact arithmetic) x(k) = x∗ for some k ≤ n
0
10
−1
10
residual
h norms during
i CG iteration ✄:
(10)
R = r0 , . . . , r −3
10
−4
10
0 2 4 6 8 10 12 14 16 18 20
Fig. 386 iteration step k
R⊤ R =
1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.000000 0.016019 −0.795816 −0.430569 0.348133
−0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.012075 0.600068 −0.520610 0.420903
0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.001582 −0.078664 0.384453 −0.310577
−0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000024 0.001218 −0.024115 0.019394
0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000002 0.000151 −0.000118
−0.000000 0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000
0.016019 −0.012075 −0.000024 −0.000000 −0.000000 −0.000000 0.000000
0.001582 0.000000 1.000000
−0.795816 −0.078664 −0.000002 −0.000000 −0.000000
0.600068 0.001218 0.000000 1.000000 0.000000
−0.430569 −0.520610 0.384453 −0.024115 0.000151 −0.000000 −0.000000 0.000000 1.000000 0.000000
0.348133 0.420903 −0.310577 0.019394 −0.000118 0.000000 0.000000 −0.000000 0.000000 1.000000
➣ Roundoff
✦ destroys orthogonality of residuals
✦ prevents computation of exact solution after n steps.
Numerical instability (→ Def. 1.5.5.19) ➣ pointless to (try to) use CG as direct solver! y
Practice: CG used for large n as iterative solver : x(k) for some k ≪ n is expected to provide good
approximation for x∗
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 741
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 10.2.3.2 (Convergence of CG as iterative solver) CG (Code 10.2.2.11) & gradient method
(Code 10.1.3.4) for LSE with sparse s.p.d. “Poisson matrix”
A = gallery(’poisson’,m); x0 = (1:n)’; b = zeros(n,1);
2 ,m2
➣ A ∈ Rm
Poisson matrix, m = 10
0
20 35
30 30
50 20
60 15
70 10
80 5
90 0
−2 −1 0 1
10 10 10 10
Fig. 388 eigenvalues poissoneig
100
0 10 20 30 40 50 60 70 80 90 100
Fig. 387 nz = 460 poissonspy
0 0
10 10
−1
10
normalized (!) 2−norms
normalized (!) norms
−2
10
−1
10
−3
10
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 742
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(10.2.3.4)
Bound this minimum for λ ∈ [λmin (A), λmax (A)] by using suitable “polynomial candidates”
Tool: Chebychev polynomials (→ Section 6.2.3.1) ➣ lead to the following estimate [Hac91,
Satz 9.4.2], [DR08, Satz 13.29]
The iterates of the CG method for solving Ax = b (see Code 10.2.2.11) with A = A⊤ s.p.d. satisfy
l
2 1− √1
(l ) κ (A)
x−x ≤ 2l 2l x − x (0)
A A
1+ √1 + 1− √1
κ (A) κ (A)
p !l
κ (A) − 1
≤ 2 p x − x (0) .
κ (A) + 1 A
The estimate of
pthis theorem confirms asymptotic linear convergence of the CG method (→ Def. 8.2.2.1)
κ (A) − 1
with a rate of p
κ (A) + 1
Plots of bounds for error reduction (in energy norm) during CG iteration from Thm. 10.2.3.5:
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 743
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
100
9
90 0.
1
error reduction (energy norm)
80
0.8 70
9
0.
0.6 60
0.8
1/2
κ(A)
0.4 50
0.2 40 0.
9 0.8 0.7
30
0 0.7
100 0.8 0.6
80 20
10 0.5 0.4
60 8 .9 0.7
0 0.6
6 0.5 0.4 0.3
40 10 0.2
4 0.8 0.1
20 0.6 0.4 0.3
0.2 0.1
2 0.7 0. 5
0.3 0.1
0 0.2
0
κ(A)1/2 CG step l
1 2 3 4 5 6 7 8 9 10
CG step l
Measurement of
rate of (linear) convergence:
s
kr30 k2
rate ≈ 10
. (10.2.3.9)
kr20 k2
0.9
0.8
convergence rate of CG
convergence rate of CG
0.7
0.6
cond (A)
2
0.4
0.3
0.2
Justification for estimating the rate of linear convergence (→ Def. 8.2.2.1) of krk k2 → 0:
k r k +1 k2 ≈ L k r k k2 ⇒ k r k + m k2 ≈ L m k r k k2 .
10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 744
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
−2
#4 10
−4
10
#3
2−norms
matrix no.
−6
10
#2
−8
10
error norm, #1
norm of residual, #1
#1 error norm, #2
−10
10 norm of residual, #2
error norm, #3
norm of residual, #3
error norm, #4
norm of residual, #4
−12
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25
diagonal entry no. of CG steps
✝ ✆
CG convergence boosted by clustering of eigenvalues
y
Idea: Preconditioning
Apply CG method to transformed linear system
Ae e , A
ex = b e := B−1/2 AB−1/2 , e e := B−1/2 b ,
x := B /2 x , b
1
(10.3.0.1)
e ), B = B⊤ ∈ R N,N s.p.d. =
with “small” κ (A ˆ preconditioner.
Recall (10.1.4.2) : for every B ∈ R n,n with B⊤ = B there is an orthogonal matrix Q ∈ R n.n such that
B = Q⊤ DQ with a diagonal matrix D (→ Cor. 9.1.0.9, [NS02, Thm. 7.8], [LS09, Satz 9.15]). If B is s.p.d.
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],745
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This is generalized to
B /2 := Q⊤ D /2 Q ,
1 1
and one easily verifies, using Q⊤ = Q−1 , that (B /2 )2 = B and that B /2 is s.p.d. In fact, these two
1 1
2. the evaluation of B−1 x is about as expensive (in terms of elementary operations) as the
matrix×vector multiplication Ax, x ∈ R n .
λmax (A)
Recall: spectral condition number κ (A) := , see (10.1.4.8)
λmin (A)
There are several equivalent ways to express that κ (B− /2 AB− /2 ) is “small”:
1 1
• κ (B−1 A) is “small”,
because spectra agree σ (B−1 A) = σ (B− /2 AB− /2 ) due to similarity (→ Lemma 9.1.0.6)
1 1
✡ ✠
S.p.d. B preconditioner :⇔ B−1 = cheap approximate inverse of A
Problem: B /2 , which occurs prominently in (10.3.0.1) is usually not available with acceptable computa-
1
tional costs.
However, if one formally applies § 10.2.2.10 to the transformed system
e
Aex := B −1/2
AB −1/2 e := B−1/2 b ,
(B /2 x) = b
1
from (10.3.0.1), it becomes apparent that, after suitable transformation of the iteration variables p j and r j ,
B1/2 and B−1/2 invariably occur in products B−1/2 B−1/2 = B−1 and B1/2 B−1/2 = I. Thus, thanks to this
intrinsic transformation square roots of B are not required for the implementation!
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],746
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
e
ex = b
CG for Ae Equivalent CG with transformed variables
x (0) ∈ R n
Input : initial guess e Input : initial guess x(0) ∈ R n
Output : approximate solution e x(l ) ∈ R n Output : approximate solution x(l ) ∈ R n
e − B−1/2 AB−1/2e
e 1 := er0 := b
p x (0) ; e − AB−1/2e
B1/2er0 := B1/2 b x (0) ;
for j = 1 to l do { B−1/2 pe 1 := B−1 (B1/2er0 ) ;
for j = 1 to l do {
e Tjer j−1
p
α := (B−1/2 p
e j ) T B1/2er j−1
e Tj B−1/2 AB−1/2 p
p ej α :=
( j) ( j −1) (B−1/2 p
e j ) T AB−1/2 p ej
x := e
e x e j;
+αp −1/2 ( j) −1/2 ( j−1)
+ α B− /2 p
1
B x := B
e e
x e j;
er j = er j−1 − αB− /2 AB /2 p
1 1
e j;
B /2er j = B /2er j−1 − αAB− /2 p
1 1 1
e j;
(B−1/2 AB−1/2 p
e j ) Ter j
B− /2 p
e j+1 = B−1 (B− /2er j )
1 1
e j +1
p = er j − T −1/2 −1/2 e j;
p
ej B
p AB ej
p
(B−1/2 p
e j ) T AB−1 (B1/2er j ) −1/2
} − B e j;
p
(B−1/2 pe j ) T AB−1/2 p
ej
}
§10.3.0.5 (Preconditioned CG method (PCG) [DR08, Alg. 13.32], [Han02, Alg. 10.1])
Input: ˆ x(0) ∈ R n , tolerance τ > 0
initial guess x ∈ R n =
Output: approximate solution x =ˆ x(l )
p := r := b − Ax; p := B−1 r; q := p; τ0 := p⊤ r;
for l = 1 to lmax do {
β
β := r⊤ q; h := Ap; α := p⊤ h ;
x := x + αp;
r := r − αh; (10.3.0.6)
r⊤ q
q : = B −1 r ; β : = β ;
if |q⊤ r| ≤ τ · τ0 then stop;
p := q + βp;
}
✛ ✘
Computational effort per step: 1 evaluation A×vector,
1 evaluation B−1 ×vector,
✚ ✙
3 dot products, 3 AXPY-operations
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],747
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Remark 10.3.0.8 (Convergence theory for PCG) Assertions of Thm. 10.2.3.5 remain valid with κ (A)
e instead of A.
replaced with κ (B−1 A) and energy norm based on A y
x + triu(A)−1 (b − Ae
x=e x) .
x = (L− 1 −1 −1 −1
A + U A − U A AL A ) b ➤ B
−1
= L− 1 −1 −1 −1
A + U A − U A AL A . (10.3.0.10)
For all these approaches the evaluation of B−1 r can be done with effort of O(n) in the case of a sparse
matrix A (e.g. with O(1) non-zero entries per row). However, there is absolutely no guarantee that
κ (B−1 A) will be reasonably small. It will crucially depend on A, if this can be expected. y
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],748
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The Code 10.3.0.12 highlights the use of a preconditioner in the context of the PCG method; it only takes
a function that realizes the application of B−1 to a vector. In Line 10 of the code this function is passed as
function handle invB.
5 5
10 10
0
10 0
10
−5
10
B−1−norm of residuals
−5
10
A−norm of error
−10
10
−10
10
−15
10
−15
10
−20
10
CG, n = 50 CG, n = 50
−20 CG, n = 100
CG, n = 100 10
−25
10 CG, n = 200 CG, n = 200
PCG, n = 50 PCG, n = 50
PCG, n = 100 PCG, n = 100
PCG, n = 200 PCG, n = 200
−30 −25
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 391 # (P)CG step Fig. 392 # (P)CG step
CG
16 8 3 PCG
32 16 3
64 25 4
2
128 38 4 10
# (P)CG step
256 66 4
512 106 4
1024 149 4 1
10
2048 211 4
4096 298 3
8192 421 3
16384 595 3 0
10
32768 841 3
1 2 3 4 5
10 10 10 10 10
Fig. 393 n
Clearly in this example the tridiagonal part of the matrix is dominant for large n. In addition, its condition
number grows ∼ n2 as is revealed by a closer inspection of the spectrum.
Preconditioning with the tridiagonal part manages to suppress this growth of the condition number of
B−1 A and ensures fast convergence of the preconditioned CG method y
1 krl k k x(l ) − x∗ k kr k
≤ x(0) −x∗ ≤ cond(A) l . (10.2.2.13)
cond(A) kr0 k k k k r0 k
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],749
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(10.2.2.13) 2
estimate for 2-norm of transformed iteration errors: e(l )
e = (e(l ) )⊤ Be(l )
2
Analogous to (10.2.2.13), estimates for energy norm (→ Def. 10.1.0.1) of error e(l ) := x − x(l ) , x∗ :=
A −1 b :
Use error equation Ae(l ) = rl :
2
r⊤ −1 −1 (l ) ⊤
l B rl = ( B Ae ) Ae
(l )
≤ λmax (B−1 A) e(l ) ,
A
2
e(l ) = (Ae(l ) )⊤ e(l ) = r⊤ −1 −1 ⊤ −1 −1 −1 ⊤
l A rl = B r BA rl ≤ λmax ( BA ) ( B rl ) rl .
A
2 2
1 e(l ) ( B −1 r l ) ⊤ r l e(l )
A
≤ ≤ κ ( B −1 A ) A
(10.3.0.14)
κ ( B −1 A ) e ( 0 ) 2 ( B −1 r 0 ) ⊤ r 0 e ( 0 ) 2
A A
10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],750
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Theorem 10.4.1.1.
p
Note: similar formula for (linear) rate of convergence as for CG, see Thm. 10.2.3.5, but with κ (A)
replaced with κ (A) !
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 751
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Computational costs : 1 A×vector, 1 B−1 ×vector per step, a few dot products & SAXPYs
Memory requirement: a few vectors ∈ R n
➤ GMRES method for general matrices A ∈ R n,n → [Han02, Ch. 16], [QSS00, Sect. 4.4.2]
M ATLAB-function: • [x,flag,relr,it,rv] = gmres(A,b,rs,tol,maxit,B,[],x0);
• [. . .] = gmres(Afun,b,rs,tol,maxit,Binvfun,[],x0);
Remark 10.4.1.2 (Restarted GMRES) After many steps of GMRES we face considerable computational
costs and memory requirements for every further step. Thus, the iteration may be restarted with the
current iterate x(l ) as initial guess → rs-parameter triggers restart after every rs steps (Danger: failure
to converge). y
Zoo of methods with short recursions (i.e. constant effort per step)
MATLAB-function: • [x,flag,r,it,rv] = bicgstab(A,b,tol,maxit,B,[],x0)
• [. . .] = bicgstab(Afun,b,tol,maxit,Binvfun,[],x0);
Computational costs : 2 A×vector, 2 B−1 ×vector, 4 dot products, 6 SAXPYs per step
Memory requirements: 8 vectors ∈ R n
Computational costs : 2 A×vector, 2 B−1 ×vector, 2 dot products, 12 SAXPYs per step
Memory requirements: 10 vectors ∈ R n
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 752
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✡ ✠
TRY & PRAY
EXAMPLE 10.4.2.2 (Convergence of Krylov subspace methods for non-symmetric system matrix)
1 A = g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
2 B = g a l l e r y (’tridiag’,0.5*ones(n-1,1),2*ones(n,1),1.5*ones(n-1,1));
Plotted: k r l k2 : k r0 k2 :
3 0
10 10
bicgstab bicgstab
qmr qmr
2
Relative 2−norm of residual
10
−1
10
1
10
−2
10
0
10
−1 −3
10 10
0 5 10 15 20 25 0 5 10 15 20 25
iteration step iteration step
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 753
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Advantages of Krylov methods vs. direct elimination (IF they converge at all/sufficiently fast).
• They require system matrix A in procedural form y=evalA(x) ↔ y = Ax only.
• They can perfectly exploit sparsity of system matrix.
• They can cash in on low accuracy requirements (IF viable termination criterion available).
• They can benefit from a good initial guess.
10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 754
Bibliography
[AK07] Owe Axelsson and János Karátson. “Mesh independent superlinear PCG rates via
compact-equivalent operators”. In: SIAM J. Numer. Anal. 45.4 (2007), pp. 1495–1516. DOI:
10.1137/06066391X.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 728, 729, 736–750).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 733–735,
738, 739).
[Hac91] W. Hackbusch. Iterative Lösung großer linearer Gleichungssysteme. B.G. Teubner–Verlag,
Stuttgart, 1991 (cit. on pp. 735, 743, 751).
[Hac94] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations. Vol. 95. Applied
Mathematical Sciences. New York: Springer-Verlag, 1994, pp. xxii+429 (cit. on pp. 728, 742).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 729, 736–750,
752).
[IM97] I.C.F. Ipsen and C.D. Meyer. The idea behind Krylov methods. Technical Report 97-3. Raleigh,
NC: Math. Dep., North Carolina State University, Jan. 1997.
[LS09] A.R. Laliena and F.-J. Sayas. “Theoretical aspects of the application of convolution quadrature
to scattering of acoustic waves”. In: Numer. Math. 112.4 (2009), pp. 637–678 (cit. on p. 745).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 733–735, 738, 739, 745).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 728–750, 752).
[Saa03] Yousef Saad. Iterative methods for sparse linear systems. Second. Philadelphia, PA: Society
for Industrial and Applied Mathematics, 2003, pp. xviii+528 (cit. on p. 728).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on p. 731).
[Win80] Ragnar Winther. “Some superlinear convergence results for the conjugate gradient method”.
In: SIAM J. Numer. Anal. 17.1 (1980), pp. 14–17.
755
Chapter 11
For historical reasons the approximate solution of initial value problems for ordinary differential equations is
called “Numerical Integration”. This chapter will introduce the most important class of numerical methods
for that purpose.
Contents
11.1 Initial-Value Problems (IVPs) for Ordinary Differential Equations (ODEs) . . . 757
11.1.1 Ordinary Differential Equations (ODEs) . . . . . . . . . . . . . . . . . . . . . 757
11.1.2 Mathematical Modeling with Ordinary Differential Equations: Examples . 759
11.1.3 Theory of Initial-Value-Problems (IVPs) . . . . . . . . . . . . . . . . . . . . . 764
11.1.4 Evolution Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
11.2 Introduction: Polygonal Approximation Methods . . . . . . . . . . . . . . . . . . 771
11.2.1 Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
11.2.2 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774
11.2.3 Implicit midpoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
11.3 General Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
11.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
11.3.2 (Asymptotic) Convergence of Single-Step Methods . . . . . . . . . . . . . . 782
11.4 Explicit Runge-Kutta Single-Step Methods (RKSSMs) . . . . . . . . . . . . . . . . 791
11.5 Adaptive Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
11.5.1 The Need for Timestep Adaptation . . . . . . . . . . . . . . . . . . . . . . . . 798
11.5.2 Local-in-Time Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . 800
11.5.3 Embedded Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . 807
Supplementary literature. Some grasp of the meaning and theory of ordinary differential
equations (ODEs) is indispensable for understanding the construction and properties of numerical
methods. Relevant information can be found in [Str09, Sect. 5.6, 5.7, 6.5].
756
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
dy
ẏ := (t) = f(t, y(t)) , (ODE)
dt
with
☞ a (continuous) right-hand-side function (r.h.s) f : I × D → R N of time t ∈ R and state y ∈ R N ,
☞ defined on a (finite) time interval I ⊂ R, and state space D, which is some sub-set of R N :
D ⊂ R N , N ∈ N.
An ODE is called autonomous, if the right-hand-side function f does not depend on time: f = f(y), see
Def. 11.1.2.4 below.
In the context of mathematical modeling the state vector y ∈ R N is supposed to provide a complete (in the
sense of the model) description of a system. Then (ODE) models a finite-dimensional dynamical system.
Examples will be provided below, see Ex. 11.1.2.1, Ex. 11.1.2.5, and Ex. 11.1.2.7.
A solution of the ODE ẏ = f(t, y) with continuous right hand side function f is a continuously
differentiable function “of time t” y : J ⊂ I → D, defined on an open interval J , for which ẏ(t) =
f(t, y(t)) holds for all t ∈ J (=
ˆ “pointwise”).
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 757
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A solution describes a continuous trajectory in state space, a one-parameter family of states, parameter-
ized by time.
It goes without saying that smoothness of the right hand side function f is inherited by solutions of the
ODE:
§11.1.1.4 (Scalar autonomous ODE: Solution by principals) We consider scalar ODEs, namely (ODE)
in the case N = 1, and, in particular ẏ = f (y) with f : D ⊂ R → R, D an interval.
We embark on formal calculations. Assume that f is continuous and f (t) 6= 0 for all t ∈ D. Further,
suppose that we know a principal F : D → R of 1f , that is, a function y 7→ F (y) satisfying dF 1
dy = f on D.
Then, by the chain rule, every solution y : I ⊂ R → R of ẏ = f (y) also solves
d 1
F (y(t)) = ẏ(t) = 1 , t ∈ D . ⇔ F (y(t)) = t − t0 for some t0 ∈ R . (11.1.1.5)
dt f (y(t))
We also know that F is monotonic and, thus, possesses an inverse function F −1 . Integrating (11.1.1.5)
and applying the fundamental theorem of calculus, we find
This formula describes a one-parameter family of functions (t0 is the parameter), all of which provide a
solution of ẏ = f (y) on a suitable interval.
A particularly simple case is f (y) = λy + c, λ, c ∈ R, the scalar ODE ẏ = λy + c. Following the steps
outlined above, we calculate the solution
h 1 i 1 λ ( t − t0 )
F (y) = log(λy + x ) ⇒ y(t) = e −c , t∈R. (11.1.1.7)
λ λ
y
§11.1.1.8 (Linear ordinary differential equations) Now we take a look at the simplest class of ODEs,
which is also the most important.
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 758
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Proof. We have to show that, if y, z : I → R N are two solutions of (11.1.1.10), then so are y + z and
αy for all α ∈ R. This is an immediate consequence of the linearity of the operations of differentiation and
matrix×vector multiplication.
✷
For the scalar case N = 1 (11.1.1.10) can be written as ẏ = a(t)y with a continuous function a : I → R.
In this case, the chain rule immediately verifies that for fixed t0 ∈ I every function
Z t
y(t) = C exp a(τ ) dτ , C∈R, (11.1.1.12)
t0
is a solution.
If the matrix A ∈ R N,N does not depend on time, (11.1.1.10) is known as a linear ODE with constant co-
efficients: ẏ = Ay. In this case we can choose I = R, and the ODE can be solved by a diagonalization
technique [Str09, Bemerkung 5.6.1], [NS02, Sect. 8.1]: If
we can rewrite
We get N decoupled scalar linear equations żℓ = λℓ zℓ , ℓ = 1, . . . , N . Returning to y we find that every
solution y : R → R N of ẏ = Ay can be written as
e λ1 t
.. −1 N
y(t) = Sz(t) = S . S w for some w ∈ R . (11.1.1.14)
eλ N t
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 759
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.5
αy0
y(t) = (11.1.2.3)
y
,
βy0 + (α − βy0 ) exp(−αt)
0.5
for all t ∈ R.
Note that by fixing the initial value y(0) we can single out a unique representative from the family of
solutions. This will turn out to be a general principle, see Section 11.1.3. y
An ODE of the form ẏ = f(y), that is, with a right hand side function that does not depend on time,
but only on state, is called autonomous.
For an autonomous ODE the right hand side function defines a vector field (“velocity field”) y 7→ f(y) on
state space.
EXAMPLE 11.1.2.5 (Predator-prey model [Ama83, Sect. 1.1],[HLW06, Sect. 1.1.1],[Han02, Ch. 60],
[DR08, Ex. 11.3]) We consider the following model from population dynamics:
Predators and prey coexist in an ecosystem. Without predators the population of prey would be gov-
erned by a simple exponential growth law. However, the growth rate of prey will decrease with increasing
numbers of predators and, eventually, become negative. Similar considerations apply to the predator
population and lead to an ODE model.
ODE-based model: autonomous Lotka-Volterra ODE:
u̇ = (α − βv)u u (α − βv)u
↔ ẏ = f(y) with y = , f(y) = , (11.1.2.6)
v̇ = (δu − γ)v v (δu − γ)v
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 760
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
v
along by velocity field f.
α/β
(Parameter values for Fig. 395: α = 2, β = 1, δ =
1, γ = 1.)
γ/δ
Fig. 395 u
6
u=y 6
1
v=y
2
5
5
4
4
3
2
y
v=y
2
2
1
1
0
1 2 3 4 5 6 7 8 9 10 0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 396
t Fig. 397 u = y1
u(t) u (0) 4
Solution for y0 := = Solution curves for (11.1.2.6)
v(t) v (0) 2
(Parameter values for Fig. 397, 396: α = 2, β = 1, δ = 1, γ = 1) stationary point
y
EXAMPLE 11.1.2.7 (Heartbeat model → [Dea80, p. 655]) This example deals with a phenomenolog-
ical model from physiology. A model is called phenomenological, if it is entirely motivated by observations
without appealing to underlying mechanisms or first principles.
l = l (t) ˆ length of muscle fiber
=
State of heart described by quantities:
p = p(t) ˆ electro-chemical potential
=
l˙ = −(l 3 − αl + p) ,
Phenomenological model: (11.1.2.8)
ṗ = βl ,
Plots of vector fields for (11.1.2.8) and solutions for different choices of parameters are given next:
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 761
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Phase flow for Zeeman model (α = 3,β=1.000000e−01) Heartbeat according to Zeeman model (α = 3,β=1.000000e−01)
2.5 3
l(t)
p(t)
2
2
1.5
1
1
0.5
l/p
p
0 0
−0.5
−1
−1
−1.5
−2
−2
−2.5 −3
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 398 l Fig. 399 time t
Phase flow for Zeeman model (α = 5.000000e−01,β=1.000000e−01) Heartbeat according to Zeeman model (α = 5.000000e−01,β=1.000000e−01)
2.5 3
l(t)
p(t)
2
2
1.5
1
1
0.5
l/p
p
0 0
−0.5
−1
−1
−1.5
−2
−2
−2.5 −3
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 400 l Fig. 401 time t
EXAMPLE 11.1.2.9 (SIR model for spread of local epidemic [Het00]) The field of epidemiology tries to
understand the spread of contagious diseases in populations. It heavily relies on ODEs in its mathematical
modeling. This example presents a particularly simple model for an epidemic in a large, stable, isolated,
and vigorously mixing homogeneous population.
With respect to the disease we partition the population into three different groups and introduce time-
dependent variables for their fractions ∈ [0, 1]:
(I) S = S(t) =
ˆ fraction of susceptible persons, who can still contract the disease,
ˆ fraction of infected/infectious persons, who can pass on the disease,
(II) I = I (t) =
(III) R = R(t) =
ˆ fraction of recovered/removed persons, who are immune or have died.
These three quantities enter the SIR model named after the groups it considers. Besides, the model
involves two crucial model parameters, which have to be determined from data:
1. A parameter β > 0, whose value expresses the probability of transmission, and
2. a parameter r > 0, taking into account how quickly sick people recover or die.
With these notation the ODE underlying the SIR model can be stated as
Ṡ(t) = − βS(t) I (t) , İ (t) = βS(t) I (t) − rI (t) , Ṙ(t) = rI (t) . (11.1.2.10)
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 762
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
0.9
0.8
✁ Evolution of an epidemic according to the SIR
0.7
model (11.1.2.10) for
fractions of individuals
0.6
• β = 0.3, r = 0.1,
S(t)
0.5 I(t) • S(0) = 0.99, I (0) = 0.01, R(0) = 0.0
R(t)
0.4 (non-dimensionalized time)
0.3
Note that in this case not all people end up infected,
0.2 lim S(t) > 0!
t→∞
0.1
0
0 20 40 60 80 100
Fig. 402
time t
y
EXAMPLE 11.1.2.11 (Transient circuit simulation [Han02, Ch. 64]) Chapter 1 and Chapter 8 discuss
circuit analysis as a source of linear and non-linear systems of equations, see Ex. 2.1.0.3 and Ex. 8.1.0.1.
The former example admitted time-dependent currents and potentials, but dependence on time was con-
fined to be “sinusoidal”. This enabled us to switch to frequency domain, see (2.1.0.6), which gave us a
complex linear system of equations for the complex nodal potentials. Yet, this trick is only possible for
linear circuits. In the general case, circuits have to be modelled by ODEs connecting time-dependent
potentials and currents. This will be briefly explained now.
The approach is transient nodal analysis, cf. Ex. 2.1.0.3, based on the Kirchhoff current law (2.1.0.4),
which reads for the node • of the simple circuit drawn in Fig. 403
i R ( t ) − i L ( t ) − iC ( t ) = 0 . (11.1.2.12)
C
In addition we rely on known transient constitutive re-
R
lations for basic linear circuit elements:
u(t)
resistor: i R ( t ) = R −1 u R ( t ) , (11.1.2.13) L
du
capacitor: iC ( t ) = C C ( t ) , (11.1.2.14)
dt Us (t)
di L
coil: u L (t) = L (t) . (11.1.2.15)
dt
Fig. 403
We assume that the source voltage Us (t) is given. To apply nodal analysis to the circuit of Fig. 403 we
differentiate (11.1.2.12) w.r.t. t
di R di di
(t) − L (t) − C (t) = 0 ,
dt dt dt
and plug in the above constitutive relations for circuit elements:
−1 du R −1 d2 u C
R (t) − L u L (t) − C 2 (t) = 0 .
dt dt
We continue following the policy of nodal analysis and express all voltages by potential differences between
nodes of the circuit.
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 763
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For this simple circuit there is only one node with unknown potential, see Fig. 403. Its time-dependent
potential will be denoted by u(t) and this is the unknown of the model, a function of time satisfying the
ordinary differential equation
d2 u
R−1 (U̇s (t) − u̇(t)) − L−1 u(t) − C (t) = 0 .
dt2
This is an autonomous 2nd-order ordinary differential equation:
The attribute “2nd-order” refers to the occurrence of a second derivative with respect to time.
y
A generic Initial value problem (IVP) for a first-order ordinary differential equation (ODE) (→ [Str09,
Sect. 5.6], [DR08, Sect. 11.1]) can be stated as:
The time interval I may be finite or infinite. Frequently, the extended state space is not specified, but as-
sumed to coincide with the maximal domain of definition of f. Sometimes, the model suggests constraints
on D, for instance, positivity of certain components that represent a density. y
§11.1.3.3 (IVPs for autonomous ODEs) Recall Def. 11.1.2.4: For an autonomous ODE ẏ = f(y), that
is the right hand side f does not depend on time t.
Hence, for autonomous ODEs we have I = R and the right hand side function y 7→ f(y) can be regarded
as a stationary vector field (velocity field), see Fig. 395 or Fig. 398.
An important observation: If t 7→ y(t) is a solution of an autonomous ODE, then, for any τ ∈ R, also the
shifted function t 7→ y(t − τ ) is a solution.
Autonomous ODEs naturally arise when modeling time-invariant systems or phenomena. All examples for
Section 11.1.2 belong to this class. y
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 764
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§11.1.3.5 (Autonomization: Conversion into autonomous ODE) In fact, autonomous ODEs already
represent the general case, because any ODE can be converted into an autonomous one:
The idea is to include time as an extra N + 1-st component of an extended state vector z(t). This solution
component has to grow linearly ⇔ its temporal derivative must be = 1
′
y(t) z f ( z N +1 , z ′ )
z(t) := = : ẏ = f(t, y) ↔ ż = g(z) , g(z) := .
t z N +1 1
This means ż N +1 = 1 and implies z N +1 (t) = t + t0 , if t0 stands for the initial time in the original non-
autonomous IVP.
➣ We restrict ourselves to autonomous ODEs in the remainder of this chapter. y
Remark 11.1.3.6 (From higher order ODEs to first order systems [DR08, Sect. 11.2])
An ordinary differential equation of order n ∈ N has the form
Note that the extended system requires initial values y(t0 ), ẏ(t0 ), . . . , y(n−1) (t0 ):
For ODEs of order n ∈ N well-posed initial value problems need to specify initial values for the first
n − 1 derivatives.
§11.1.3.10 (Smoothness classes for right-hand side functions) Now we review results about existence
and uniqueness of solutions of initial value problems for first-order ODEs. These are surprisingly general
and do not impose severe constraints on right hand side functions. Some kind of smoothness of the
right-hand side function f is required, nevertheless and the following definitions describe it in detail.
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 765
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The property of local Lipschitz continuity means that the function (t, y) 7→ f(t, y) has “locally finite slope”
in y. y
EXAMPLE 11.1.3.15 (A function that is not locally Lipschitz continuous [Str09, Bsp. 6.5.3]) The
meaning of local Lipschitz continuity is best explained by giving an example of a function that fails to
possess this property.
√
Consider the square root function t 7→ t on the closed interval [0, 1]. Its slope in t = 0 is infinite and so
it is not locally Lipschitz continuous on [0, 1].
However, if we consider the square root on the open interval ]0, 1[, then it is locally Lipschitz continuous
there. y
The next lemma gives a simple criterion for local Lipschitz continuity, which can be proved by the mean
value theorem, cf. the proof of Lemma 8.3.2.9.
If f and Dy f are continuous on the extended state space Ω, then f is locally Lipschitz continuous
(→ Def. 11.1.3.13).
✎ Notation: ˆ the derivative of f w.r.t. the state variable y, a Jacobian matrix ∈ R N,N as
Dy f =
defined in (8.3.2.8).
The following is the the most important mathematical result in the theory of initial-value problems for ODEs:
Theorem 11.1.3.17. Theorem of Peano & Picard-Lindelöf [Ama83, Satz II(7.6)], [Str09,
Satz 6.5.1], [DR08, Thm. 11.10], [Han02, Thm. 73.1]
If the right hand side function f : Ω 7→ R N is locally Lipschitz continuous (→ Def. 11.1.3.13) then
for all initial conditions (t0 , y0 ) ∈ Ω the IVP
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 766
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Also not that the domain of definition/domain of existence J (t0 , y0 ) of the solution usually depends
! on the initial values (t0 , y0 ) !
Terminology: if J (t0 , y0 ) = I , I the maximal temporal domain of definition of f, we say that the solution
y : I 7→ RN is global.
Notation: For autonomous ODE we always have t0 = 0, and therefore we write J (y0 ) := J (0, y0 ). y
EXAMPLE 11.1.3.20 (“Explosion equation”: finite-time blow-up) Let us explain the still mysterious
“maximal domain of definition” in statement of Thm. 11.1.3.17. It is related to the fact that every solution
of an initial value problem (11.1.3.18) has its own largest possible time interval J (y0 ) ⊂ R on which it is
defined naturally.
As an example we consider the autonomous scalar (d = 1) initial value problem, modeling “explosive
growth” with a growth rate increasing linearly with the density:
ẏ = y2 , y(0) = y0 ∈ R . (11.1.3.21)
10
y = −0.5
0
y = −1
8 0
0 , if y0 = 0 , 2
y(t)
J ( y0 ) = R , if y0 = 0 , −6
−1
] y0 , ∞ [ , if y0 < 0 . −8
−10
−3 −2 −1 0 1 2 3
Fig. 404 t
In this example, for y0 > 0 the solution experiences a blow-up in finite time and ceases to exists afterwards.
y
Supplementary literature. For other concise summaries of the theory of IVPs for ODEs refer
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 767
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
with locally Lipschitz continuous (→ Def. 11.1.3.13) right hand side f : D ⊂ R N → R N , N ∈ N, and
make the following assumption. A more general treatment is given in [DB02].
Now we return to the study of a generic ODE (ODE) instead of an IVP (11.1.3.2). We do this by temporarily
changing the perspective: we fix a “time of interest” t ∈ R \ {0} and follow all trajectories for the duration
t. This induces a mapping of points in state space:
t D 7→ D
➣ mapping Φ : , t 7→ y(t) solution of IVP (11.1.3.18) , (11.1.4.2)
y0 7 → y ( t )
This is a well-defined mapping of the state space into itself, by Thm. 11.1.3.17 and Ass. 11.1.4.1.
Now, we may also let t vary, which spawns a family of mappings Φt t∈R of the state space D into itself.
However, it can also be viewed as a mapping with two arguments, a duration t and an initial state value
y0 !
where t 7→ y(t) ∈ C1 (R, R N ) is the unique (global) solution of the IVP ẏ = f(y), y(0) = y0 , is
the evolution operator/mapping for the autonomous ODE ẏ = f(y).
Note that t 7→ Φt y0 describes the solution of ẏ = f(y) for y(0) = y0 (a trajectory). Therefore, by virtue
of definition, we have
∂Φ
(t, y) = f(Φt y) . (11.1.4.4)
∂t
Let us repeat the different kinds of information contained in an evolution operator when viewed from differ-
ent angles:
t 7→ Φt y0 , y0 ∈ D fixed =
ˆ a trajectory = solution of an IVP ,
t
y 7→ Φ y , t ∈ R fixed =
ˆ a mapping of the state space onto itself .
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 768
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 11.1.4.5 (Evolution operator for Lotka-Volterra ODE (11.1.2.6)) For N = 2 the action of an
evolution operator can be visualized by tracking the movement of point sets in state space. Here this is
done for the Lotka-Volterra ODE
u̇ = (α − βv)u u (α − βv)u
↔ ẏ = f(y) with y = , f(y) = , (11.1.2.6)
v̇ = (δu − γ)v v (δu − γ)v
with positive model parameters α, β, γ, δ > 0.
6
Flow map for Lotka-Volterra system, α=2, β=γ =δ =1
8
t=0
t=0.5
t=1
5 7 t=1.5
t=2
t=3
v (predator)
v = y2
3 4
3
2 X
1
1
0
0 0 1 2 3 4 5 6
0 0.5 1 1.5 2 2.5 3 3.5 4 406
Fig.
u = y1 u (prey)
Fig. 405
state mapping y 7→ Φt y
trajectories t 7→ Φt y0
Think of y ∈ R2 7→ f(y) ∈ R2 as the velocity of the surface of a fluid. Specks of floating dust will be
carried along by the fluid, patches of dust covering parts of the surface will move and deform over time.
This can serve as a “mental image” of Φ. y
Given an evolution operator, we can recover the right-hand side function f of the underlying autonomous
ODE as f(y) = ∂Φ ∂t (0, y ): There is a one-to-one relationship between ODEs and their evolution operators,
and those are the key objects behind an ODE.
Understanding the concept of evolution operators is indispensable for numerical integration, that the is the
construction of numerical methods for the solution of IVPs for ODEs:
Remark 11.1.4.6 (Group property of autonomous evolutions) Under Ass. 11.1.4.1 the
evolution operator gives rise to a group of mappings D 7→ D:
Φs ◦ Φt = Φs+t , Φ−t ◦ Φt = Id ∀t ∈ R . (11.1.4.7)
This is a consequence of the uniqueness theorem Thm. 11.1.3.17. It is also intuitive: following an evolution
up to time t and then for some more time s leads us to the same final state as observing it for the whole
time s + t. y
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 769
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
d
Hint. dy {y 7→ arctan(y)} = 1+1y2
has at least two solutions in the state space R0+ according to the following definition.
A solution of the ODE ẏ = f(t, y) with continuous right hand side function f is a continuously
differentiable function “of time t” y : J ⊂ I → D, defined on an open interval J , for which
ẏ(t) = f(t, y(t)) holds for all t ∈ J (=
ˆ “pointwise”).
How can this be reconciled with the assertion of the main theorem?
If the right hand side function f : Ω̂ 7→ R N is locally Lipschitz continuous (→ Def. 11.1.3.13)
then for all initial conditions (t0 , y0 ) ∈ Ω̂ the IVP
2
1
Hint. Consider the function y(t) = 2t .
(Q11.1.4.8.E) For the autonomous scalar ODE ẏ = sin y1 − 2 answer the following questions
11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 770
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Hint. Make use of the geometrically intuitive statement: If a differentiable function f : [t0 , T ] → R
satisfies f˙(t) ≤ C for all t0 ≤ t ≤ T , then f (t) ≤ f (t0 ) + Ct.
(Q11.1.4.8.F) Rewrite the matrix differential equation Ẏ(t) = AY(t) for Y : R → R n,n , n ∈ N, in the
standard form ẏ = f(y) with right-hand-side function f : R N → R N and suitable N ∈ N.
(Q11.1.4.8.G) What “ingredients” does it take to define an initial value problem for an ODE?
△
Video tutorial for Section 11.2: Introduction: Polygonal Approximation Methods: (17 minutes)
Download link, tablet notes
In this section we will see the first simple methods for the numerical integration (= solution) of initial-value
problems (IVPs). We target an initial value problem (11.1.3.2) for a first-order ordinary differential equation
As usual, the right hand side function f : D ⊂ R N → R N , N ∈ N, may be given only in procedural form,
for instance, in a C++ code as an functor object providing an evaluation operator
Eigen::VectorXd o p e r a t o r () ( double t, const Eigen::VectorXd &y)
const ;
cf. Rem. 5.1.0.9. Occasionally the evaluation of f may involve costly computations.
§11.2.0.1 (Objectives of numerical integration) Two basic tasks can be identified in the field of nu-
merical integration, which aims for the approximate solution of initial value problems for ODEs (Please
distinguish from “numerical quadrature”, see Chapter 7.):
(I) Given initial time t0 , final time T , and initial state y0 compute an approximation of y( T ), where
t 7→ y(t) is the solution of (11.1.3.2). A corresponding function in C++ could look like
State solveivp( double t0, double T, State y0);
Here State is a type providing a fixed size or variable size vector ∈ R N , e.g.,
using State = Eigen::Matrix< double , N, 1>;
(II) Output an approximate solution t → yh (t) of (11.1.3.2) on [t0 , T ] up to final time T 6= t0 for “all
times” t ∈ [t0 , T ] (in practice, of course, only for finitely many times t0 < t1 < t2 < · · · < t M−1 <
t M = T , M ∈ N, consecutively)
s t d :: v e c t o r <State>
solveivp(State y0, const s t d :: v e c t o r < double > &tvec);
This is the “plot solution” task, because we need to know y(t) for many times, if we want to create
a faithful plot of t 7→ y(t).
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 771
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y
y
t
Fig. 407
t0 t1 t2 t3 t4
§11.2.0.2 (Temporal mesh) As in Section 6.6.1 the polygonal approximation in this section will be based
on a (temporal) mesh with M + 1 mesh points (→ § 6.6.0.1)
covering the time interval of interest between initial time t0 and final time T > t0 . We assume that the
interval of interest is contained in the domain of definition of the solution of the IVP: [t0 , T ] ⊂ J (t0 , y0 ). y
The next three sections will derive three simple mesh-based numerical integration methods, each in two
ways:
(i) Based on geometric reasoning we interpret ẏ as the slope/direction of a tangent line.
(ii) In the spirit of numerical differentiation § 5.2.3.16, we replace the derivative ẏ with a mesh-based
difference quotient.
ẏ = f (t, y) := y2 + t2 ➤ N = 1, I, D = R + . (11.2.1.2)
1.5
1.5
1
1
y
0.5
0.5
0
0 0.5 1 1.5 0
Fig. 408 0 0.5 1 1.5
t
h i Fig. 409 t
1 1
tangent field (t, y) 7→ √ f (t,y) solution curves
f 2 (t,y)+1
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 772
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The solution curves run tangentially to the tangent field in each point of the extended state space. y
y1
y
M := {t j := j/5: j = 0, . . . , 5} , 2
1.8
and solve an IVP for the Riccati differential equation,
1.6
see Ex. 11.2.1.1
y
1.4
2 2
ẏ = y + t . (11.2.1.2) 1.2
1
Here: y0 = 21 , t0 = 0, T = 1, ✄
0.8
—=
ˆ “Euler polygon” for uniform timestep h = 0.2 0.6
§11.2.1.4 (Recursion for explicit Euler method) We translate the graphical construction of Fig. 410 into
a formula. Given a temporal mesh M := {t0 < t1 < t2 < · · · < t M−1 < t M } and applied to a general
IVP
N
the explicit Euler method generates a sequence (yk )k=0 of states by the recursion
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 773
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The state yk is supposed to approximate y(tk ), where t 7→ y(t) is the exact solution of the IVP (11.1.3.2).
y
y(tk + hk ) − y(tk )
ẏ(tk ) ≈
hk
y k +1 − y k
ẏ = f(t, y) ←→ = f(tk , yk ) , k = 0, . . . , M − 1 . (11.2.1.7)
hk
Why a “forward difference quotient”? Because the difference quotient in (11.2.1.7) relies on the “future
state” yk+1 ≈ y(tk+1 ) to approximate ẏ(tk ).
In general, Difference schemes follow a simple policy for the discretization of differential equations: replace
all derivatives by difference quotients connecting solution values on a set of discrete points (the mesh). y
Remark 11.2.1.8 (Output of explicit Euler method) To begin with, the explicit Euler recursion (11.2.1.5)
produces a sequence y0 , . . . , y M of states. How does it deliver on the task (I) and (II) stated in § 11.2.0.1?
By “geometric insight” we expect
yk ≈ y(tk ) .
(As usual, we use the notation t 7→ y(t) for the exact solution of an IVP.)
No let us discuss to what extent the explicit Euler method delivers on the tasks formulated in § 11.2.0.1.
Task (I): Easy, because y M already provides an approximation of y( T ).
Task (II): The trajectory t 7→ y(t) is approximated by the piecewise linear function (‘Euler polygon”)
t k +1 − t t − tk
y h : [ t0 , t N ] → R N , y h ( t ) : = y k + y k +1 for t ∈ [ t k , t k +1 ] , (11.2.1.9)
t k +1 − t k t k +1 − t k
see Fig. 411. This function can easily be sampled on any grid of [t0 , t M ]. In fact, it is the M-piecewise
linear interpolant of the data points (tk , yk ), k = 0, . . . , N , see Section 5.3.2).
The same considerations apply to the methods discussed in the next two sections and will not be repeated
there. y
y k +1 − y k
ẏ = f (t, y) ←→ = f (tk+1 , yk+1 ) , k = 0, . . . , M − 1 . (11.2.2.1)
hk
backward difference quotient
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 774
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Note: (11.2.2.2) requires solving a (possibly non-linear) system of equations to obtain yk+1 !
(➤ Terminology “implicit”)
y
y h ( t1 ) Geometry of implicit Euler method:
y(t)
Approximate solution through (t0 , y0 ) on [t0 , t1 ] by
y0
• straight line through (t0 , y0 )
• with slope f (t1 , y1 )
✁ —= ˆ trajectory through (t0 , y0 ),
t —= ˆ trajectory through (t1 , y1 ),
—= ˆ tangent at — in (t1 , y1 ).
t0 t1
Fig. 412
Remark 11.2.2.3 (Feasibility of implicit Euler timestepping) The issue is whether (11.2.2.2) well de-
fined, that is, whether we can solve it for yk+1 and whether this solution unique. The intuition is that for
small timestep size h > 0 the right hand side of (11.2.2.2) is a “small perturbation of the identity”.
Let us give a formal argument. Consider an autonomous ODE ẏ = f(y), assume a continuously differ-
entiable right hand side function f, f ∈ C1 ( D, R N ), and regard (11.2.2.2) as an h-dependent non-linear
system of equations:
To investigate the solvability of this non-linear equation we start with an observation about a partial deriva-
tive of G:
dG dG
(h, z) = I − h Dy f(tk+1 , z) ⇒ (0, z) = I .
dz dz
In addition, G (0, yk ) = 0. Next, recall the implicit function theorem [Str09, Thm. 7.8.1]:
If the Jacobian ∂G
∂y ( p0 ) ∈ R
ℓ,ℓ is invertible, then there is an open neighborhood U of x ∈ R k and
0
a continuously differentiable function g : U → R l such that
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 775
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
For sufficiently small | h| it permits us to conclude that the equation G ( h, z) = 0 defines a continuous
function g = g( h) with g(0) = yk .
For sufficiently small h > 0 the equation (11.2.2.2) has a unique solution yk+1 .
y(t + h) − y(t − h)
ẏ(t) ≈ , h>0. (11.2.3.1)
2h
The idea is to apply this formula in t = 21 (tk + tk+1 ) with h = hk/2, which transforms the ODE into
y k +1 − y k 1
ẏ = f (t, y) ←→ = f 2 (tk + tk+1 ), yh ( 12 (tk + tk+1 )) , k = 0, . . . , M − 1 . (11.2.3.2)
hk
The trouble is that the value yh ( 12 (tk+1 + tk+1 )) does not seem to be available, unless we recall that the
approximate trajectory t 7→ yh (t) is supposed to be piecewise linear, which implies yh ( 12 (tk+1 + tk+1 )) =
1
2 ( y h ( tk ) + y h ( tk+1 )). This gives the recursion formula for the implicit midpoint method in analogy to
(11.2.1.5) and (11.2.2.2):
1
y k +1 = y k + h k f 2 (tk + tk+1 ), 21 (yk + yk+1 ) , k = 0, . . . , N − 1 , (11.2.3.3)
ẏ = λy , y(0) = 1
on the interval [0, 1]. We use M ∈ N equidistant steps of the explicit Euler method to compute an
approximation y M for y(1).
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 776
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
we know that
N
∑ f ℓ (y) = 0 ∀y ∈ D .
ℓ=1
• Show that the sum of the components of every solution t 7→ y(t) is constant in time.
• Show that the sums of the components of the vectors y0 , y1 , y2 , . . . generated by either the explicit
Euler method, the implicit Euler method, or the implicit midpoint method, all applied to solve some
IVP for (*), are the same for all vectors yk .
(Q11.2.3.4.C) We consider the implicit Euler method for the scalar autonomous “explosion ODE” ẏ = y2 .
Given an explicit formula for yk+1 in terms of yk and the timestep size hk > 0. Specify potentially
necessary constraints on the size of hk .
The defining equation for recursion of the implicit Euler method (on some temporal mesh) applied to the
ODE ẏ = f(t, y) is
y k +1 : y k +1 = y k + h k f ( t k +1 , y k +1 ) . (11.2.2.2)
(Q11.2.3.4.D) The recursion of the implicit midpoint rule for the ODE ẏ = f(t, y) is
1
y k +1 : y k +1 = y k + h k f (tk + tk+1 ), 12 (yk + yk+1 ) .
2
Give an explicit form of this recursion for the linear ODE ẏ = A(t)y, where A : R → R N,N is a matrix-
valued function. When will this recursion break down?
(Q11.2.3.4.E) For a twice continuously differentiable function f : I ⊂ R → R N we can use the second
symmetric difference quotient as an approximation of the second derivative f ′′ ( x ), x ∈ I :
f ( x + h) − 2 f ( x ) + f ( x − h)
≈ f ′′ ( x ) for | h| ≪ 1 .
h2
Based on this approximation propose an explicit finite-difference timstepping scheme on a uniform tem-
poral mesh for the second-order ODE ÿ = f(y).
(Q11.2.3.4.F) Formulate the equation that defines that single-step method for the IVP ẏ = f(t, y),
y(t0 ) = y0 , that arises from the difference quotient approximation
y k +1 − y k
ẏ = f(t, y) → ≈ 12 ( f (tk , y(tk )) + f (tk+1 , y(tk+1 ))) , h k : = t k +1 − t k .
hk
A temporal mesh M := {t0 < t1 < t2 < · · · < t M−1 < t M := T } can be taken for granted.
△
11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 777
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 11.3: General Single-Step Methods: (14 minutes) Download link,
tablet notes
Now we fit the numerical schemes introduced in the previous section into a more general class of methods
for the solution of (autonomous) initial value problems (11.1.3.18) for ODEs. Throughout we assume that
all times considered belong to the domain of definition of the unique solution t → y(t) of (11.1.3.18), that
is, for T > 0 we take for granted [0, T ] ⊂ J (y0 ) (temporal domain of definition of the solution of an IVP is
explained in § 11.1.3.19).
11.3.1 Definition
§11.3.1.1 (Discrete evolution operators) From Section 11.2.1 and Section 11.2.2 recall the two Euler
methods for an autonomous ODE ẏ = f(y):
explicit Euler: y k +1 = y k + h k f ( y k ) ,
h k : = t k +1 − t k .
implicit Euler: yk+1 : yk+1 = yk + hk f(yk+1 ) ,
If y0 is the initial value, then y1 := Ψ( h, y0 ) can be regarded as an approximation of y( h), the value
returned by the evolution operator Φ (→ Def. 11.1.4.3) for ẏ = f(y) applied to y0 over the period h.
y ( t k ):
y1 = Ψ( h, y0 ) ←→ y( h) = Φh y0 ➣ Ψ( h, y) ≈ Φh y , (11.3.1.3)
In a sense the polygonal approximation methods are based on approximations for the evolution operator
associated with the ODE.
This is what every single step method does: it tries to approximate the evolution operator Φ for an ODE
by a mapping Ψ of the kind as described in (11.3.1.2).
Remark 11.3.1.4 (Discretization) The adjective “discrete” used above designates (components of) meth-
ods that attempt to approximate the solution of an IVP by a sequence of finitely many states. “Discretiza-
tion” is the process of converting an ODE into a discrete model. This parlance is adopted for all procedures
that reduce a “continuous model” involving ordinary or partial differential equations to a form with a finite
number of unknowns. y
Above we identified the discrete evolutions underlying the polygonal approximation methods. Vice versa,
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 778
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Definition 11.3.1.5. Single step method (for autonomous ODE) → [QSS00, Def. 11.2]
defines a single-step method (SSM) for the autonomous IVP ẏ = f(y), y(0) = y0 on the interval
[0, T ].
☞ In a sense, a single step method defined through its associated discrete evolution does not ap-
proximate a concrete initial value problem, but tries to approximate an ODE in the form of its
evolution operator.
In C++ a discrete evolution operator can be incarnated by a functor type offering an evaluation operator
State o p e r a t o r ()( double h, const State &y) const ;
§11.3.1.8 (Consistent single step methods) Now we state a first quantification of the goal that the
“discrete evolution should be an approximation of the evolution operator”: Ψ ≈ Φ, cf. (11.3.1.3). We want
the discrete evolution Ψ to inherit key properties of the evolution operator Φ. One such property is
d t
Φy = f(y) ∀y ∈ D . (11.3.1.9)
dt t =0
Compliance of Ψ with (11.3.1.9) is expressed through the property of consistency, which, roughly speak-
ing, demands that a viable discrete evolution operator methods is structurally similar to that for the explicit
Euler method (11.2.1.5).
h ψ : I × D → R N continuous,
Ψ y = y + hψ( h, y) with (11.3.1.11)
ψ(0, y) = f(y) .
Differentiating h 7→ Ψh y relying on the product rule confirms that (11.3.1.9) remains true for Ψ instead
of Φ.
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 779
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A single step method according to Def. 11.3.1.5 based on a discrete evolution of the form (11.3.1.11)
is called consistent with the ODE ẏ = f(y).
EXAMPLE 11.3.1.13 (Consistency of implicit midpoint method) The discrete evolution Ψ and, hence,
the function ψ = ψ( h, y) for the implicit midpoint method are defined only implicitly, of course. Thus,
consistency cannot immediately be seen from a formula for ψ.
We examine consistency of the implicit midpoint method for the autonomous ODE ẏ = f(y). The corre-
sponding discrete evolution Ψ is definied by:
Ψh y = y + hf 1 h
2 (y + Ψ y) , h ∈ R, | h| "sufficiently small", y∈D.. (11.3.1.14)
Assume that
• the right hand side function f : D ⊂ R N → R N is locally Lipschitz continuous, f ∈ C0 ( D ),
• and that | h| is “sufficiently small” to guarantee the existence of a solution Φh y of (11.3.1.14) as
explained in Rem. 11.2.2.3.
Then we infer from the implicit function theorem Thm. 11.2.2.4 that the solution Ψh y of (11.3.1.14) will con-
tinuously depend on h: h 7→ Ψh y ∈ C0 (]−δ, δ[, R N ) for small δ > 0. Knowing this, we plug (11.3.1.14)
into itself and obtain
(11.3.1.14)
Ψh y = y + hf( 12 (y + Ψh y)) = y + h f(y + 21 hf( 12 (y + Ψh y))) .
| {z }
=:ψ(h,y)
We repeat that by the implicit function theorem Thm. 11.2.2.4 Ψhj y depends continuously on h and y.
This means that ψ( h, y) has the desired properties, in particular ψ(0, y) = f(y) is clear. y
Remark 11.3.1.15 (Notation for single step methods) Many authors specify a single step method by
writing down the first step for a general stepsize h
y1 = y0 + hf( 12 (y0 + y1 )) .
Actually, this fixes the underlying discrete evolution. Also this course will sometimes adopt this practice. y
§11.3.1.16 (Output of single step methods) Here we resume and continue the discussion of
Rem. 11.2.1.8 for general single step methods according to Def. 11.3.1.5. Assuming unique solvability
of the systems of equations faced in each step of an implicit method, every single step method based on
a mesh M = {0 = t0 < t1 < · · · < t M := T } produces a finite sequence (y0 , y1 , . . . , y M ) of states,
where the first agrees with the initial state y0 .
We expect that the states provide a pointwise approximation of the solution trajectory t → y(t):
yk ≈ y(tk ) , k = 1, . . . , M .
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 780
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Thus task (I) from § 11.2.0.1, computing an approximation for y( T ), is again easy: output y M as an
approximation of y( T ).
Task (II) from § 11.2.0.1, computing the solution trajectory, requires interpolation of the data points (tk , yk )
using some of the techniques presented in Chapter 5. The natural option is M-piecewise polynomial
interpolation, generalizing the polygonal approximation (11.2.1.9) used in Section 11.2.
Note that from the ODE ẏ = f(y) the derivatives ẏh (tk ) = f(yk ) are available without any further
approximation. This facilitates cubic Hermite interpolation (→ Def. 5.3.3.1), which yields
dyh
yh ∈ C1 ([0, T ]): yh |[ xk−1 ,xk ] ∈ P3 , yh (tk ) = yk , (t ) = f(yk ) .
dt k
Summing up, an approximate trajectory t 7→ yh (t) is built in two stages:
(i) Compute sequence (yk )k by running the single step method.
(ii) Post-process the obtained sequence, usually by applying interpolation, to get yh .
y
Review question(s) 11.3.1.17 (General single-step methods)
(Q11.3.1.17.A) Explain the concepts
• evolution operator and
• discrete evolution operator
in connection with the numerical integration of initial-value problems for the ODE ẏ = f(y),
f : D ⊂ R N 7→ R N .
(Q11.3.1.17.B) [Single-step methods and numerical quadrature] There is a connection between numer-
ical integration (the design and analysis of numerical methods for the solution of initial-value problems
for ODEs) and numerical quadrature (study of numerical methods for the evaluation of integrals).
• Explain, how a class of single-step methods for the solution of scalar initial-value problems
ẏ = f (t, y) , y(t0 ) = y0 ∈ R ,
Rb
can be used for the approximate evaluation of integrals a ϕ(τ ) dτ , ϕ : [ a, b] → R.
• If the considered single-step methods are of order p, what does this mean for the induced quadra-
ture method.
• Which quadrature formula does the implicit midpoint method yield?
(Q11.3.1.17.C) [Adjoint single-step method] Let a single-step method for the autonomous ODE
ẏ = f(y), f : D ⊂ R N → R N be defined by its discrete evolution operator Ψ : I × D 7→ D. Then
e : I × D 7→ D defined
the adjoint single-step method is spawned by the discrete evolution operator Ψ
according to
h
−1
e −h
Ψ y := Ψ , y ∈ D, h ∈ R sufficiently small .
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 781
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y1 = y0 + hf(y0 ) .
y1 : y1 = y0 + hf(y1 ) .
y1 : y1 = y0 + hf( 21 (y0 + y1 )) .
For which methods does the associated discrete evolution operator Ψ : [−δ, δ] × D → D, δ > 0 suffi-
ciently small, satisfy
Try to find a simple (scalar) counterexample, if you think that a method does not have property
(11.3.1.18).
△
Of course, the accuracy of the solution sequence (yk )k obtained by a particular single-step method (→
Def. 11.3.1.5) is a central concern. This motivates studying the dependence of suitable norms of the
so-called discretization error on the choice of the temporal mesh M := {0 = t0 < t1 < · · · < t M = T }.
§11.3.2.1 (Discretization error of single step methods) Approximation errors in numerical integration
are also called discretization errors, cf. Rem. 11.3.1.4.
Depending on the objective of numerical integration as stated in § 11.2.0.1 different (norms of) discretiza-
tion errors are of interest:
(I) If only the solution at final time T is sought, the relevant norm of the discretization error is
ǫ M := ky( T ) − y M k ,
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 782
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(III) Between (I) and (II) is the pointwise discretization error, which is the sequence (a so-called grid
function)
e : M → D , ek := y(tk ) − yk , k = 0, . . . , M . (11.3.2.2)
In this case one usually examines the maximum error in the mesh points
where k·k is a suitable vector norm on R N , customarily the Euclidean vector norm.
y
§11.3.2.3 (Asymptotic convergence of single step methods) Once the discrete evolution Ψ associated
with the ODE ẏ = f(y) is specified, the single step method according to Def. 11.3.1.5 is fixed:
The only way to control the accuracy of the solution y N or t 7→ yh (t) is through the selection of the mesh
M = {0 = t0 < t1 < · · · < t N = T }.
Hence we study convergence of single step methods for families of meshes {Mℓ } and track the decay of
(a norm) of the discretization error (→ § 11.3.2.1) as a function of the number M := ♯M of mesh points.
In other words, we examine h-convergence. Convergence through mesh refinement is discussed for
piecewise polynomial interpolation in Section 6.6.1 and for composite numerical quadrature in Section 7.5.
When investigating asymptotic convergence of single step methods we often resort to families of equidis-
tant meshes of [0, T ]:
k
M M := {tk := T: k = 0 . . . , M } . (11.3.2.4)
M
T
We also call this the use of uniform timesteps of size h := M . y
✦ We apply explicit and implicit Euler methods (11.2.1.5)/(11.2.2.2) with uniform timestep h = 1/M,
M ∈ {5, 10, 20, 40, 80, 160, 320, 640}.
✦ Monitored: Error at final time E( h) := |y(1) − y M |
We are mainly interested in the qualitative nature of the asymptotic convergence as h → 0 in the sense
of the types of convergence introduced in Def. 6.2.2.7 with N there replaced with h−1 . Abbreviating some
error norm with EN = EN ( h), recall the classification of asymptotic convergence from Def. 6.2.2.7:
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 783
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1 1
10 10
λ = 1.000000 λ = 1.000000
λ = 3.000000 λ = 3.000000
λ = 6.000000 λ = 6.000000
0 λ = 9.000000 0 λ = 9.000000
10 10
O(h) O(h)
error (Euclidean norm)
−2 −2
10 10
−3 −3
10 10
−4 −4
10 10
−5 −5
10 10
−3 −2 −1 0 −3 −2 −1 0
10 10 10 10 10 10 10 10
Fig. 414 timestep h Fig. 415 timestep h
This matches our expectations, because, as we see from (11.2.1.7) and (11.2.2.1), both Euler methods
can be introduced via an approximation of ẏ by a one-sided difference quotient, which offers an O( h)
approximation of the derivative as h → 0.
0
10
−1
10
−2
However, polygonal approximation methods can do
10
better:
error (Euclidean norm)
−3
10
−4
✁ We study the convergence of the implicit midpoint
10
method (11.2.3.3) in the above setting.
−5
10
−6
We observe algebraic convergence O( h2 ), that is
10
with order/rate 2 for h → 0.
−7
10
λ = 1.000000
λ = 2.000000 Also this is expected, because symmetric difference
−8 λ = 5.000000
quotients of width h offer O( h2 )-approximation of the
10
λ = 10.000000
O(h2)
−9
10
−3
10
−2
10
−1
10
0
10
derivative for h → 0.
Fig. 416 timestep h
Parlance: Based on the observed rate of algebraic convergence, the two Euler methods are said to “con-
verge with first order”, whereas the implicit midpoint method is called “second-order convergent”.
y
The observations made for polygonal timestepping methods reflect a general pattern:
Then conventional single step methods (→ Def. 11.3.1.5) will enjoy asymptotic algebraic conver-
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 784
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
there is a p ∈ N such that the sequence (yk )k generated by the single step method
for ẏ = f(t, y) on a mesh M := {t0 < t1 < · · · < t M = T } satisfies
The maximal integer p ∈ N for which (11.3.2.7) holds for a single step method when applied to an
ODE with (sufficiently) smooth right hand side, is called the order of the method.
As in the case of quadrature rules (→ Def. 7.4.1.1) their order is the principal intrinsic indicator for the
“quality” of a single step method.
§11.3.2.9 (Convergence analysis for the explicit Euler method [Han02, Ch. 74]) We consider the
simplest single-step method, namely the explicit Euler method (11.2.1.5) on a mesh M := {0 = t0 <
t1 < · · · < t M = T } for a generic autonomous IVP
ẏ = f(y) , y(0) = y0 ∈ D ,
cf. Def. 11.1.3.13, and C1 exact solution t 7→ y(t). Throughout we assume that solutions of ẏ = f(y) are
defined on [0, T ] for all initial states y0 ∈ D.
y k +1 = y k + h k f ( y k ) , hk := tk+1 − tk , k = 1, . . . , M − 1 . (11.2.1.5)
D
y(t) In numerical analysis one studies the
The approach to estimate kek k follows a fundamental policy that comprises three key steps. To explain
them we rely on the abstract concepts of the
• evolution operator Φ associated with the ODE ẏ = f(y) (→ Def. 11.1.4.3) and
• discrete evolution operator Ψ defining the explicit Euler single step method, see Def. 11.3.1.5:
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 785
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
We argue that in this context abstraction pays off, because it helps elucidate a general technique for the
convergence analysis of single step methods.
Fig. 417
tk t k +1
τ ( h, y) := Ψh y − Φh y . (11.3.2.13)
τ ( h, y)
Geometric considerations: distance of a smooth curve and its tangent shrinks as the square of the distance
to the intersection point (curve locally looks like a parabola in the ξ − η coordinate system, see Fig. 420).
η
D η h
Φ y(tk )
τ ( h, yk )
τ ( h, yk )
ξ
y(tk ) Ψh y(tk ) ξ
t
tk t k +1
Fig. 419 Fig. 420
The geometric considerations can be made rigorous by analysis: recall Taylor’s formula for the function
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 786
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
K t+h
Z
hj ( j) (t + h − τ )K
y(t + h) − y(t) = ∑ y (t) + y ( K +1) ( τ ) dτ , (11.3.2.14)
j =1
j! K!
t
| {z }
y ( K + 1 ) ( ξ ) K +1
= h
K!
for some ξ ∈ [t, t + h]. We conclude that, if y ∈ C2 ([0, T ]), which is ensured for smooth f, see
Lemma 11.1.1.3, then
Thus we obtain recursion for error norms ǫk := kek k by simply applying the △-inequality:
Use the elementary estimate (1 + Lh j ) ≤ exp( Lh j ) (by convexity of the exponential function):
k l −1 k
l −1
(11.3.2.18) ⇒ ǫk ≤ ∑ ∏ exp( Lh j ) · ρl = ∑ exp( L ∑ j=1 h j )ρl .
l =1 j =1 l =1
l −1
Note: ∑ h j ≤ T for final time T and conclude
j =1
k k
ρk
ǫk ≤ exp( LT ) ∑ ρl ≤ exp( LT ) max ∑ hl ≤ T exp( LT ) l=max hl · max kÿ(τ )k .
l =1 k hk l =1 1,...,k t0 ≤ τ ≤ t k
We can summarize the insight gleaned through this theoretical analysis as follows:
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 787
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✦ and that the error bound grows exponentially with the length T of the integration interval.
y
§11.3.2.20 (One-step error and order of a single step method) In the analysis of the global discretiza-
tion error of the explicit Euler method in § 11.3.2.9 a one-step error of size O( h2k ) led to a total error of
O( h) through the effect of error accumulation over M ≈ h−1 steps. This relationship remains valid for
almost all single step methods [DB02, Theorem 4.10]:
Consider an IVP (11.1.3.2) with solution t 7→ y(t) and a single step method defined by the
discrete evolution Ψ (→ Def. 11.3.1.5). If the one-step error along the solution trajectory satisfies
(Φ is the evolution map associated with the ODE, see Def. 11.1.4.3)
with C > 0 independent of the temporal mesh M: The (pointwise) discretization error converges
algebraically with order/rate p.
A rigorous statement as a theorem would involve some particular assumptions on Ψ, which we do not
want to give here. These assumptions are satisfied, for instance, for all the methods presented in the
sequel. You may refer to [DB02, Sect. 4.1] for further information.
In fact, it is remarkable that a local condition like (11.3.2.22) permits us to make a quantitative prediction
of global convergence. This close relationship has made researchers introduce “order” also as a property
of discrete evolutions.
Let Ψ : I × D 7→ R N be a discrete evolution for the autonomous ODE ẏ = f(y) (with associated
evolution operator Φ : I × D 7→ R N → Def. 11.1.4.3). The largest integer q ∈ N0 such that
A single-step method (SSM, Def. 11.3.1.5) based on the discrete evolution Ψ satisfies
Ψ of order q ∈ N SSM converges algebraically with order q.
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 788
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 11.3.2.25 (Orders of finite-difference single-step methods) Let us determine orders of the
discrete evolutions for the three simple single-step methods introduced in Section 11.2, here listed with
their corresponding discrete evolution operators Ψ (→ § 11.3.1.1) when applied to an autonomous ODE
ẏ = f(y): for y0 ∈ D ⊂ R N ,
The computation of their orders will rely on a fundamental technique for establishing (11.3.2.24) based on
Taylor expansion, which asserts that for a function g ∈ C m+1 (]t0 − δ, t0 + δ[, R N ), δ > 0,
m
1 (k)
g ( t0 + τ ) = ∑ g (t0 ) hτ k + O(τ m+1 ) for τ → 0 . (11.3.2.29)
k =0
m!
Of course the arguments hinge on the smoothness of the vectorfield f = f(y), which will ensure smooth-
ness of solutions of the associated ODE ẏ = f(y). Thus, we make the following simplifying assumption:
Let Φ = Φ(t, y) denote the evolution operator (→ Def. 11.1.4.3) induced by ẏ = f(y), which, by defini-
tion, satisfies
∂Φ
(t, y0 ) = f(Φt y0 ) ∀y0 ∈ D, t ∈ J (y0 ) . (11.1.4.4)
∂t
Setting v(τ ) := Φτ y0 , which is a solution of the initial-value problem ẏ = f(y), y(0) = y0 , we find for
small τ , appealing to the one-dimensional chain rule and (11.1.4.4),
dv d2 v ∂f dv
(τ ) = f(v(τ )) , 2
(τ ) = (v(τ )) (τ ) . (11.3.2.31)
dτ dτ ∂y dτ
dv d2 v
Φ τ y0 = v ( τ ) = v (0) + τ (0) + 21 τ 2 2 (0) + O(τ 3 )
dτ dτ (11.3.2.32)
= y0 + τ f(y0 ) + 2 τ D f(y0 )f(t, y0 ) + O(τ 3 )
1 2
for τ → 0. Note that the derivative D f(y0 ) is an N × N Jacobi matrix. Explicit expressions for the
remainder term involve second derivatives of f.
➊ For the explicit Euler method (11.3.2.26) we immediately have from (11.3.2.32)
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 789
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
This gives
w(τ ) = y0 + τf( 21 (y0 + w(τ ))) = y0 + τf(y0 + 12 τf( 21 (y0 + w(τ )))
= y0 + τf(y0 + 12 τf(y0 + O(τ ))) for τ → 0 .
Then we resort to the truncated Taylor expansion (11.3.2.33) and get for τ → 0
w ( τ ) = y0 + τ + O(τ )) + O(τ 3 )
f(y0 ) + D f(y0 ) 12 τf(y0
= y0 + τ f(y0 ) + D f(y0 ) 12 τ (f(y0 ) + O(τ )) + O(τ 3 ) .
Matching with (11.3.2.32) shows w(τ ) − Φτ y0 = O(τ 3 ) where the “O” just comprises continuous
higher order derivatives of f.
τ ( h, y) = Ψh y − Φh y , y ∈ , h “sufficiently small” ,
for a consistent single-step method defined by the discrete evolution operator Ψ satisfies
11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 790
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A single step method according to Def. 11.3.1.5 based on a discrete evolution of the form
ψ : I × D → R N continuous,
Ψh y = y + hψ( h, y) with (11.3.1.11)
ψ(0, y) = f(y) .
(Q11.3.2.34.B) Let t ∈ I 7→ y(t), I ⊂ R an interval containing 0, denote the solution of the au-
tonomous IVP
ẏ = f(y) , y(0) = y0 .
Ψh y := y + hf(y) + 21 h2 D f(y)f(y) ,
Video tutorial for Section 11.4: Explicit Runge-Kutta Single-Step Methods (RKSSMs): (27
minutes) Download link, tablet notes
So far we only know first and second order methods from 11.2: the explicit and implicit Euler method
(11.2.1.5) and (11.2.2.2), respectively, are of first order, the implicit midpoint rule of second order. We
observed this in Exp. 11.3.2.5 and it can be proved rigorously for all three methods adapting the arguments
of § 11.3.2.9.
Thus, barring the impact of roundoff, the low-order polygonal approximation methods are guaranteed to
achieve any prescribed accuracy provided that the mesh is fine enough. Why should we need any other
timestepping schemes?
Remark 11.4.0.1 (Rationale for high-order single step methods cf. [DR08, Sect. 11.5.3]) We argue
that the use of higher-order timestepping methods is highly advisable for the sake of efficiency. The
reasoning is very similar to that of Rem. 7.4.3.12, when we considered numerical quadrature. The reader
is advised to study that remark again.
As we saw in § 11.3.2.3 error bounds for single step methods for the solution of IVPs will inevitably feature
unknown constants “C > 0”. Thus they do not give useful information about the discretization error for
a concrete IVP and mesh. Hence, it is too ambitious to ask how many timesteps are needed so that
ky( T ) − y N k stays below a prescribed bound.
However an easier question can be answered by asymptotic estimates like (11.3.2.7) and this questions
reads:
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 791
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Computational effort ∼ total number of f-evaluations for approximately solving the IVP,
Now, let us consider a single step method of order p ∈ N, employed with a uniform timestep hold .
We focus on the maximal discretization error in the mesh points, see § 11.3.2.1. We make the crucial
assumption that the asymptotic error bounds are sharp:
err( hnew ) ! 1
Goal: = for reduction factor ρ>1.
err( hold ) ρ
p
hnew ! 1
hnew = ρ− /p hold
1
(11.3.2.7) ⇒ p = ⇔ . (11.4.0.3)
hold ρ
10 p=1
p=2
p=3
8 p=4
6
✁ Plots of ρ1/p vs. ρ
1/p
0
2 4 6 8 10
Fig. 421
We remark that another (minor) rationale for using higher-order methods is to curb impact of roundoff
errors (→ Section 1.5.3) accumulating during timestepping [DR08, Sect. 11.5.3]. y
§11.4.0.4 (Bootstrap construction of explicit single step methods) Now we will build a class of meth-
ods that are explicit and achieve orders p > 2. The starting point is a simple integral equation satisfied by
any solution t 7→ y(t) of an initial value problems for the general ODE ẏ = f(t, y):
Z t1
ẏ(t) = f(t, y(t)) ,
IVP: ⇒ y ( t1 ) = y0 + f(τ, y(τ )) dτ
y ( t0 ) = y0 t0
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 792
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
What error can we afford in the approximation of y(t0 + ci h) (under the assumption that f is Lipschitz
continuous)? We take the cue from the considerations in § 11.3.2.9.
Note that there is a factor h in front of the quadrature sum in (11.4.0.5). Thus, our goal can already be
achieved, if only
y(t0 + ci h) is approximated up to an error O( h p ),
again, because in (11.4.0.5) a factor of size h multiplies f(t0 + ci , y(t0 + ci h)).
This is accomplished by a less accurate discrete evolution than the one we are about to build. Thus,
we can construct discrete evolutions of higher and higher order, in turns, starting with the explicit Euler
method. All these methods will be explicit, that is, y1 can be computed directly from point values of f. y
EXAMPLE 11.4.0.6 (Simple Runge-Kutta methods by quadrature & boostrapping) Now we apply the
boostrapping idea outlined above. We write kℓ ∈ R N for the approximations of y(t0 + ci h).
• Quadrature formula = trapezoidal rule (7.3.0.5):
1
Q( f ) = 21 ( f (0) + f (1)) ↔ s = 2: c1 = 0, c2 = 1 , b1 = b2 = , (11.4.0.7)
2
and y(t1 ) approximated by explicit Euler step (11.2.1.5)
(11.4.0.9) = explicit midpoint method (for numerical integration of ODEs) [DR08, Alg. 11.18].
y
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 793
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✦ IVP: ẏ = 10y(1 − y) (scalar logistic ODE (11.1.2.2)), initial value y(0) = 0.01, final time T = 1,
✦ Explicit single step methods, uniform timestep h.
0
1 10
y(t) s=1, Explicit Euler
0.9 Explicit Euler s=2, Explicit trapezoidal rule
Explicit trapezoidal rule s=2, Explicit midpoint rule
0.8 Explicit midpoint rule −1
O(h2)
10
0.7
error |y (1)−y(1)|
−2
0.6 10
0.5
y
h
−3
0.4 10
0.3
−4
0.2 10
0.1
0 −2 −1
0 0.2 0.4 0.6 0.8 1 10 10
Fig. 422 t Fig. 423 stepsize h
i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1
The vectors ki ∈ R N , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.
Recall Rem. 11.3.1.15 to understand how the discrete evolution for an explicit Runge-Kutta method is
specified in this definition by giving the formulas for the first step. This is a convention widely adopted in
the literature about numerical methods for ODEs. Of course, the increments ki have to be computed anew
in each timestep.
The implementation of an s-stage explicit Runge-Kutta single step method according to Def. 11.4.0.11 is
straightforward: The increments ki ∈ R N are computed successively, starting from k1 = f(t0 + c1 h, y0 ).
Only s f-evaluations and AXPY operations (→ Section 1.3.2) are required to compute the next state
vector from the current.
In books and research articles a particular way to write down the coefficients characterizing RK-SSMs is
widely used:
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 794
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
c1 0 ··· 0
.. ..
c2 a21 . .
Shorthand notation for (explicit) Runge-
.. .. .. ..
Kutta methods [DR08, (11.75)] . . . .
c A .. .. .. ..
:= . . . . .
Butcher scheme ✄ bT
.. .. .. ..
. . . .
(Note that A is a strictly lower triangular
s × s-matrix) cs as1 ··· as,s−1 0
b1 ··· bs − 1 bs
(11.4.0.13)
Now we restrict ourselves to the case of an autonomous ODE ẏ = f(y). Matching Def. 11.4.0.11 and
Def. 11.3.1.5, we see that the discrete evolution induced by an explicit Runge-Kutta single-step method
is
s
Ψ h y = y + h ∑ bi k i , h∈R, y∈D, (11.4.0.14)
i =1
Is this discrete evolution consistent in the sense of § 11.3.1.8, that is, does ψ(0, y) = f(y) hold? If h = 0,
the increment equations yield
s
h = 0 ⇒ k1 = · · · = k s = f ( y ) . ψ(0, y) = ∑ bi f(y) .
i =1
A Runge-Kutta single step method according to Def. 11.4.0.11 is consistent (→ Def. 11.3.1.12) with
the ODE ẏ = f(t, y), if and only if
s
∑ bi = 1 .
i =1
Remark 11.4.0.16 (RK-SSM and quadrature rules) Note that in Def. 11.4.0.11 the coefficients Ci and bi ,
1 ∈ {1, . . . , s}, can be regarded as nodes and weights of a quadrature formula (→ Def. 7.2.0.1) on [0, 1]:
apply the explicit Runge-Kutta single step method to the “ODE” ẏ = f (t), f ∈ C0 ([0, 1]), on [[]0, 1] with
timestep h = 1 and initial value y(0), with exact solution
Z t
ẏ(t) = f (t) , y (0) = 0 ⇒ y ( t ) = f (τ ) dτ .
0
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 795
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Recall that the quadrature rule with these weights and nodes c j will have order ≥ 1 (→ Def. 7.4.1.1), if
the weights add up to 1! y
EXAMPLE 11.4.0.17 (Butcher schemes for some explicit RK-SSM [DR08, Sect. 11.6.1]) The fol-
lowing explicit Runge-Kutta single step methods are often mentioned in literature.
0 0
• Explicit Euler method (11.2.1.5): ➣ order = 1
1
0 0 0
explicit trapezoidal method
• 1 1 0 ➣ order = 2
(11.4.0.8): 1 1
2 2
0 0 0
1 1
• explicit midpoint method (11.4.0.9): 2 2 0 ➣ order = 2
0 1
0 0 0 0 0
1 1
2 2 0 0 0
1 1
• Classical 4th-order RK-SSM: 2 0 2 0 0 ➣ order = 4
1 0 0 1 0
1 2 2 1
6 6 6 6
0 0 0 0 0
1 1
3 3 0 0 0
2
• Kutta’s 3/8-method: 3− 13 1 0 0 ➣ order = 4
1 1 −1 1 0
1 3 3 1
8 8 8 8
Hosts of (explicit) Runge-Kutta methods can be found in the literature, see for example the Wikipedia page.
They are stated in the form of Butcher schemes (11.4.0.13) most of the time. y
Remark 11.4.0.18 (Construction of higher order Runge-Kutta single step methods) Runge-Kutta sin-
gle step methods of order p > 2 are not found by bootstrapping as in Ex. 11.4.0.6, because the resulting
methods would have quite a lot of stages compared to their order.
Rather one derives order conditions yielding large non-linear systems of equations for the coefficients
aij and bi in Def. 11.4.0.11, see [DB02, Sect .4.2.3] and [HLW06, Ch. III]. This approach is similar to
the construction of a Gauss quadrature rule in Ex. 7.4.2.2. Unfortunately, the systems of equations are
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 796
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
very difficult to solve and no universal recipe is available. Nevertheless, through massive use of symbolic
computation, explicit Runge-Kutta methods of order up to 19 have been constructed in this way. y
Remark 11.4.0.19 (“Butcher barriers” for explicit RK-SSM) The following table gives lower bounds for
the number of stages needed to achieve order p for an explicit Runge-Kutta method.
order p 1 2 3 4 5 6 7 8 ≥9
minimal no. s of stages 1 2 3 4 6 7 9 11 ≥ p+3
No general formula is has been discovered. What is known is that for explicit Runge-Kutta single step
methods according to Def. 11.4.0.11
order p ≤ number s of stages of RK-SSM
y
numerical integration: [DR08, Sect. 11.6], [Han02, Ch. 76], [QSS00, Sect. 11.8].
i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1
The vectors ki ∈ R N , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.
with f : I × D → R N can be converted into the equivalent IVP for the extended state
⊤
z = [ z 1 , . . . , z N , z N +1 ] : = [ y t ] ⊤ ∈ R N +1 :
f ( z N +1 , [ z 1 , . . . , z N ] ⊤ ) y0
ż = g(z) , g(z) := , z (0) = . (11.4.0.22)
1 t0
Let us apply the same 2-stage explicit Runge-Kutta method to (11.4.0.21) and (11.4.0.22). When will
both approaches produce the same sequence of states yk ∈ D?
(Q11.4.0.20.C) Formulate a generic 2-stage explicit Runge-Kutta method for the autonomous second-
order ODE ÿ = f(y), f : D ⊂ R N → R N .
Hint. Apply a standard 2-stage explicit Runge-Kutta method after transformation to an equivalent first-
order ODE.
△
11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 797
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 11.5: Adaptive Stepsize Control: (32 minutes) Download link,
tablet notes
Section 7.6, in the context of numerical quadrature, teaches an a-posteriori way to adjust the mesh under-
lying a composite quadrature rule to the integrand: During the computation we estimate the local quadra-
ture error by comparing the approximations obtained by using quadrature formulas of different order. The
same policy for adapting the integration mesh is very popular in the context of numerical integration, too.
Since the size hk := tk+1 − tk of the cells of the temporal mesh is also called the timestep size, this kind
of a-posteriori mesh adaptation is also known as stepsize control.
By the laws of reaction kinetics of physical chemistry from (11.5.1.2) we can extract the following (system
of) ordinary differential equation(s) for the concentrations of the different compounds:
y1 := c(BrO3− ): ẏ1 = − k 1 y1 y2 − k 3 y1 y3 ,
y2 := c(Br− ): ẏ2 = − k 1 y1 y2 − k 2 y2 y3 + k 5 y5 ,
y3 := c(HBrO2 ): ẏ3 = k1 y1 y2 − k2 y2 y3 + k3 y1 y3 − 2k4 y23 , (11.5.1.3)
y4 := c(Org): ẏ4 = k2 y2 y3 + k4 y23 ,
y5 := c(Ce(IV)): ẏ5 = k 3 y1 y3 − k 5 y5 ,
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 798
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
−4
10 −6
10
−5
10
−7
10
−6
10
c(t)
c(t)
−8
10
−7
10
−9
10
−8
10
−10
−9 10
10
−10 −11
10 10
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Fig. 424 t Fig. 425 t
This is very common with evolutions arising from practical models (circuit models, chemical reaction mod-
els, mechanical systems)
y
70
ẏ = y2 , y(0) = y0 > 0 . 60
y0
y(t)
1 − y0 t 40
0
−1 −0.5 0 0.5 1 1.5 2 2.5
Fig. 426 t
How to choose temporal mesh {t0 < t1 < · · · < t N −1 < t N } for single step method in case J (y0 ) is not
known, even worse, if it is not clear a priori that a blow up will happen?
Just imagine: what will result from equidistant explicit Euler integration (11.2.1.5) applied to the above
IVP?
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 799
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
solution by ode45
100
y0 = 1
y0 = 0.5
90
y =2
0
50
y
40
10
0
−1 −0.5 0 0.5 1 1.5 2 2.5
Fig. 427 t
y
Why do we embrace local-in-time timestep control (based on estimating only the one-step error)? One
could raise a serious objection: If a small time-local error in a single timestep leads to large error
kyk − y(tk )k at later times, then local-in-time timestep control is powerless about it and will not even
notice!
Nevertheless, local-in-time timestep control is used almost exclusively,
☞ because we do not want to discard past timesteps, which could amount to tremendous waste of
computational resources,
☞ because it is inexpensive and it works for many practical problems,
☞ because there is no reliable method that can deliver guaranteed accuracy for general IVP.
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 800
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§11.5.2.2 (Local-in-time error estimation) We “recycle” heuristics already employed for adaptive quadra-
ture, see Section 7.6, § 7.6.0.10. There we tried to get an idea of the local quadrature error by comparing
two approximations of different order. Now we pursue a similar idea over a single timestep.
Idea: Estimation of one-step error
e h of different order over current
Compare results for two discrete evolutions Ψh , Ψ
timestep h:
e h yk − Ψh yk .
Φh yk − Ψh yk ≈ ESTk := Ψ (11.5.2.3)
| {z }
one-step error
§11.5.2.4 ((Crude) local timestep control) We take for granted the availability of a local error estimate
ESTk that we have computed for a current stepsize h. We specify target values ATOL > 0, RTOL > 0 of
absolute and relative tolerances to be met by the local error and implement the following policy:
absolute tolerance
ESTk ↔ ATOL
Compare ➣ Reject/accept current step (11.5.2.5)
ESTk ↔ RTOLkyk k
relative tolerance
Both tolerances RTOL > 0 and ATOL > 0 have to be supplied by the user of the adaptive algorithm. The
absolute tolerance is usually chosen significantly smaller than the relative tolerance and merely serves as
a safeguard against non-termination in case yk ≈ 0. For a similar use of absolute and relative tolerances
see Section 8.2.3, which deals with termination criteria for iterations, in particular (8.2.3.3).
The rationale behind the adjustment of the timestep size in (∗) is the following: if the current stepsize
guarantees sufficiently small one-step error, then it might be possible to obtain a still acceptable one-
step error with a larger timestep, which would enhance efficiency (fewer timesteps for total numerical
integration). This should be tried, since timestep control will usually provide a safeguard against undue
loss of accuracy.
The following C++ code implements a wrapper function odeintadapt() for a general adaptive single-
step method according to the policy outlined above. The arguments are meant to pass the following
information:
• Psilow, Psihigh: functors passing discrete evolution operators for autonomous ODE of different
order, type @(y,h), expecting a state (usually a column vector) as first argument, and a stepsize
as second,
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 801
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
C++ code 11.5.2.6: Simple local stepsize control for single step methods ➺ GITLAB
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double _norm ( const S t a t e &y ) { r e t u r n y . norm ( ) ; }
5
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 802
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
• line 24: make comparison (11.5.2.5) to decide whether to accept or reject local step.
• line 27, 28: step accepted, update state and current time and suggest 1.1 times the current stepsize
for next step.
• line 30 step rejected, try again with half the stepsize.
• Return value is a vector of pairs consisting of
– times t ↔ temporal mesh t0 < t1 < t2 < . . . < t N < T , where t N < T indicated
premature termination (collapse, blow-up),
– states y ↔ sequence (yk )kN=0 .
y
Remark 11.5.2.7 (Choice of norm) In Code 11.5.2.6 the norm underlying timestep control is passed
through a functor. This is related to the important fact that that norm has to be chosen in a problem-
dependent way. For instance, in the case of systems of ODEs different components of the state vector y
may correspond to different physical quantities. Hence, if we used the Euclidean norm k·k = k·k2 , the
choice of physical units might have a strong impact on the selection of timesteps, which is clearly not
desirable and destroys the scale-invariance of the algorithm, cf. Rem. 2.3.3.11. y
Remark 11.5.2.8 (Estimation of “wrong” error?) We face the same conundrum as in the case of adap-
tive numerical quadrature, see Rem. 7.6.0.17:
By the heuristic considerations, see (11.5.2.3) it seems that ESTk measures the one-step error for
! the low-order method Ψ and that we should use yk+1 = Ψhk yk , if the timestep is accepted.
hk
e y , since it is available for free. This is
However, it would be foolish not to use the better value yk+1 = Ψ k
what is done in every implementation of adaptive methods, also in Code 11.5.2.6, and this choice can be
justified by control theoretic arguments [DB02, Sect. 5.2]. y
EXPERIMENT 11.5.2.9 (Simple adaptive stepsize control) We test adaptive timestepping routine from
Code 11.5.2.6 for a scalar IVP and compare the estimated local error and true local error.
✦ IVP for ODE ẏ = cos(αy)2 , α > 0, solution y(t) = arctan(α(t − c))/α for y(0) ∈] − π/2, π/2[
✦ Simple adaptive timestepping based on explicit Euler (11.2.1.5) and explicit trapezoidal rule
(11.4.0.8)
Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000 Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000
0.08 0.025
y(t) true error |y(t )−y |
k k
yk estimated error ESTk
0.06 rejection
0.02
0.04
0.02
0.015
error
y
0.01
−0.02
−0.04
0.005
−0.06
−0.08 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 428 t Fig. 429 t
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 803
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Observations:
☞ Adaptive timestepping well resolves local features of solution y(t) at t = 1
☞ Estimated error (an estimate for the one-step error) and true error are not related! To understand
this recall Rem. 11.5.2.8.
y
EXPERIMENT 11.5.2.10 (Gain through adaptivity → Exp. 11.5.2.9) In this experiment we want to
explore whether adaptive timestepping is worth while, as regards reduction of computational effort without
sacrificing accuracy.
We retain the simple adaptive timestepping from previous experiment Exp. 11.5.2.9 and also study the
same IVP.
New: initial state y(0) = 0!
Now we examine the dependence of the maximal discretization error in mesh points on the computational
effort. The latter is proportional to the number of timesteps.
2
Solving dt y = a cos(y) with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for dt y = a cos(y)2 with a = 40.000000
1
0.05 10
uniform timestep
adaptive timestep
0.045
0
10
0.04
0.035
−1
10
max |y(t )−y |
k
0.03
k
−2
10
y
0.025
k
0.02
−3
10
0.015
rtol = 0.400000
0.01 rtol = 0.200000
−4
rtol = 0.100000 10
rtol = 0.050000
0.005 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
−5
0 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 430 t Fig. 431 no. N of timesteps
Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping achieves much better accuracy for a fixed computational effort.
y
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 804
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
2
Solving d y = a cos(y)2 with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for dt y = a cos(y) with a = 40.000000
t
0
0.05 10
uniform timestep
adaptive timestep
0.04
0.03
0.02
−1
10
k
y
k
−0.01
−2
10
−0.02
rtol = 0.400000
−0.03 rtol = 0.200000
rtol = 0.100000
rtol = 0.050000
−0.04 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
−3
−0.05 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 432 t Fig. 433 no. N of timesteps
Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping leads to larger errors at the same computational cost as uniform timestep-
ping!
Explanation: the position of the steep step of the solution has a sensitive dependence on an initial value,
π
if y(0) ≈ 2α :
1
y(t) = α arctan(α(t + tan(y0/α))) , step at ≈ − tan(y0/α) .
Hence, small local errors in the initial timesteps will lead to large errors at around time t ≈ 1. The stepsize
control is mistaken in condoning these small one-step errors in the first few steps and, therefore, incurs
huge errors later.
However, the perspective of backward error analysis (→ § 1.5.5.18) rehabilitates adaptive stepsize control
in this case: it gives us a numerical solution that is very close to the exact solution of the ODE with slightly
perturbed initial state y0 . y
§11.5.2.12 (Refined local stepsize control → [DR08, Sect. 11.7]) The above algorithm (Code 11.5.2.6)
is simple, but the rule for increasing/shrinking of timestep “squanders” the information contained in ESTk :
TOL:
More ambitious goal ! When ESTk > TOL : stepsize adjustment better hk = ?
When ESTk < TOL : stepsize prediction good hk+1 = ?
p +2
Ψhk y(tk ) − Φhk y(tk ) = ch p+1 + O( hk ),
(11.5.2.13)
e h k y ( t k ) − Φ h k y ( t k ) = O ( h p +2 ) ,
Ψ k
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 805
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
with some (unknown) constant c > 0. Why h p+1 ? Remember the estimate (11.3.2.15) from the error
analysis of the explicit Euler method: we also found O( h2k ) there for the one-step error of a single step
method of order 1.
p +2
Heuristic reasoning: The timestep hk is small ➣ “higher order terms” O( hk ) can be ignored.
. p +1 p +2
Ψhk y(tk ) − Φhk y(tk ) = chk + O( hk ), . p +1
⇒ ESTk = chk . (11.5.2.14)
e hk y(tk ) − Φhk y(tk ) =. O( h p+2 ) .
Ψ k
.
✎ notation: = equality up to higher order terms in hk
. p +1 . ESTk
ESTk = chk ⇒ c= p +1
. (11.5.2.15)
hk
For the sake of accuracy (demands “ESTk < TOL”) & efficiency (favors “>”) we aim for
!
ESTk = TOL := max{ATOL, kyk kRTOL} . (11.5.2.16)
What timestep h∗ can actually achieve (11.5.2.16), if we “believe” (heuristics!) in (11.5.2.14) (and, there-
fore, in (11.5.2.15))?
ESTk p +1
(11.5.2.15) & (11.5.2.16) ⇒ TOL = p +1
h∗ . (11.5.2.17)
hk
r
TOL
"‘Optimal timestep”: h∗ = h p +1
. (11.5.2.18)
ESTk
(stepsize prediction)
(Acceptance of current timstep): If ESTk ≤ TOL ➣ use h∗ as stepsize for next step.
C++ code 11.5.2.19: Refined local stepsize control for single step methods ➺ GITLAB
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double d e f a u l t n o r m ( const State &y ) { r e t u r n y . norm ( ) ; }
5 // Auxilary struct to hold user options
6 struct Odeintssctrl_options {
7 double T ; // terminal time
8 double h0 ; // initial time step
9 double r e l t o l ; // norm-relative error tolerance
10 double a b s t o l ; // absolute error tolerance
11 double hmin ; // smallest allowed time step
12 } __attribute__ ( ( aligned (64) ) ) ;
13 // Adaptive single-step integrator
14 template <class DiscEvolOp , class State ,
15 class NormFunc = decltype ( d e f a u l t n o r m <State >) >
16 std : : vector <std : : pair <double , State >> o d e i n t s s c t r l (
17 DiscEvolOp &&Psilow , unsigned i n t p , DiscEvolOp &&Psihigh , const State &y0 ,
18 const O d e i n t s s c t r l _ o p t i o n s &opt ,
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 806
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Comments on Code 11.5.2.19 (see comments on Code 11.5.2.6 for more explanations):
• Input arguments as for Code 11.5.2.6, except for p =
ˆ order of lower order discrete evolution.
• line 38: compute presumably better local stepsize according to (11.5.2.18),
• line 33: decide whether to repeat the step or advance,
• line 33: extend output arrays if current step has not been rejected.
y
Use two RK-SSMs based on the same increments, that is, built with the same coefficients aij , but
different weights bi , see Def. 11.4.0.11 for the formulas, and different orders p and p + 1.
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 807
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
EXAMPLE 11.5.3.2 (Commonly used embedded explicit Runge-Kutta methods) The following two
embedded RK-SSM, presented in the form of their extended Butcher schemes, provided single step meth-
ods of orders 4 & 5 [HNW93, Tables 4.1 & 4.2].
0 0
1 1 1 1
3 3 2 2
1 1 1 1 1
3 6 6 2 0 2
1 1 3 1 0 0 1
2 8 0 8
1 3 5 7 13 1
1 2 0 − 32 2 4 32 32 32 − 32
1 2 1 1 1 1 1
y1 6 0 0 3 6
y1 6 3 3 6
yb1 1
10 0 3
10
2
5
1
5
yb1 − 12 7
3
7
3
13
6 − 16
3
(i) StateType: type for vectors in state space V , e.g. a fixed size vector type of E IGEN:
Eigen::Matrix<double,N,1>, where N is an integer constant § 11.2.0.1.
(ii) RhsType: a functor type, see Section 0.3.3, for the right hand side function f; must match State-
Type, default type provided.
The functor for the right hand side f : D ⊂ V → V of the ODE ẏ = f(y) is specified as an argument of
the constructor. The single-step numerical integrator is invoked by the templated method
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 808
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The arguments of solve() are not sufficient to control the behavior of the adaptive integrator. In addition,
one can set data members of the data structure Ode45.options to configure an instance ode45obj of
Ode45:
ode45obj.options.<option_you_want_to_set> = <value>;
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 809
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y
Remark 11.5.3.5 (Tolerances and accuracy) As we have learned in § 11.5.3.3 for objects of the class
Ode45 tolerances for the refined local stepsize control of § 11.5.2.12 can be specified by setting the
member variables options.rtol and options.atol.
The possibility to pass tolerances to numerical integrators based on adaptive timestepping may tempt
the user into believing that they allow to control the accuracy of the solutions. However, as is clear from
§ 11.5.2.12, these tolerances are solely applied to local error estimates and, inherently, have nothing to
do with global discretization errors, see Exp. 11.5.2.9.
The absolute/relative tolerances imposed for local-in-time adaptive timestepping do not allow to
predict accuracy of solution!
EXAMPLE 11.5.3.7 (Adaptive timestepping for mechanical problem) We test the effect of adaptive
stepsize control in M ATLAB for the equations of motion describing the planar movement of a point mass in
a conservative force field x ∈ R2 7→ F (x) ∈ R2 : Let t 7→ y(t) ∈ R2 be the trajectory of point mass (in
the plane).
2y
From Newton’s law: ÿ = F (y) := − . (11.5.3.8)
kyk22
acceleration force
As in Rem. 11.1.3.6 we can convert the second-order ODE (11.5.3.8) into an equivalent 1st-order ODE by
introducing the velocity v := ẏ as an extra solution component:
" v #
ẏ
(11.5.3.8) ⇒ = − 2y . (11.5.3.9)
v̇ k y k2 2
Adaptive numerical integration with adaptive numerical integrator Ode45 (to § 11.5.3.3) with
➊ options.rtol = 0.001, options.atol = 1.0E-5,
➋ options.rtol = 0.01, options.atol = 1.0E-3,
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 810
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Zeitschrittweite
yi(t)
0 0 0.1
−1
−2
−3
−4 −5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t
Zeitschrittweite
yi(t)
0 0 0.1
−1
−2
−3
−4 −5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
y2
y2
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
Exakte Bahn Exakte Bahn
Naeherung Naeherung
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
y1 y1
Observations:
☞ Fast changes in solution components captured by adaptive approach through very small timesteps.
☞ Completely wrong solution, if tolerance reduced slightly.
In this example we face a rather sensitive dependence of the trajectories on initial states or intermediate
states. Small perturbations at one instance in time can be have a massive impact on the solution at later
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 811
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 812
Bibliography
[Ama83] H. Amann. Gewöhnliche Differentialgleichungen. 1st. Berlin: Walter de Gruyter, 1983 (cit. on
pp. 759, 760, 766).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 760, 764–767, 773, 785, 791–793, 795–797, 805).
[Dea80] M.A.B. Deakin. “Applied catastrophe theory in the social and biological sciences”. In: Bulletin
of Mathematical Biology 42.5 (1980), pp. 647–679 (cit. on p. 761).
[DB02] P. Deuflhard and F. Bornemann. Scientific Computing with Ordinary Differential Equations.
2nd ed. Vol. 42. Texts in Applied Mathematics. New York: Springer, 2002 (cit. on pp. 756, 768,
788, 796, 803).
[Gra02] C.R. Gray. “An analysis of the Belousov-Zhabotinski reaction”. In: Rose-Hulman Undergradu-
ate Math Journal 3.1 (2002) (cit. on p. 798).
[HLW06] E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration. 2nd ed. Vol. 31.
Springer Series in Computational Mathematics. Heidelberg: Springer, 2006 (cit. on pp. 756,
760, 796).
[HNW93] E. Hairer, S.P. Norsett, and G. Wanner. Solving Ordinary Differential Equations I. Nonstiff
Problems. 2nd ed. Berlin, Heidelberg, New York: Springer-Verlag, 1993 (cit. on pp. 756, 808).
[HW11] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Stiff and Differential-
Algebraic Problems. Vol. 14. Springer Series in Computational Mathematics. Berlin: Springer-
Verlag, 2011 (cit. on p. 756).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 759, 760, 763,
766, 773, 785, 797, 798).
[Het00] Herbert W. Hethcote. “The mathematics of infectious diseases”. In: SIAM Rev. 42.4 (2000),
pp. 599–653. DOI: 10.1137/S0036144500371907 (cit. on p. 762).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 759).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 767, 779, 797).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 756, 759,
764–766, 775, 787).
813
Chapter 12
Explicit Runge-Kutta methods with stepsize control (→ Section 11.5) seem to be able to provide approx-
imate solutions for any IVP with good accuracy provided that tolerances are set appropriately. Does this
mean that everything is settled about numerical integration?
EXAMPLE 12.0.0.1 (Explicit adaptive RK-SSM for stiff IVP) In this example we will witness the near
failure of a high-order adaptive explicit Runge-Kutta method for a simple scalar autonomous ODE.
This is a logistic ODE as introduced in Ex. 11.1.2.1. We try to solve it by means of an explicit adaptive
embedded Runge-Kutta-Fehlberg method (→ Section 11.5.3) using the embedded Runge-Kutta single-
step method offered by Ode45 as explained in § 11.5.3.3 (Preprocessor switch MATLABCOEFF activated).
C++ code 12.0.0.3: Solving (12.0.0.2) with Ode45 numerical integrator ➺ GITLAB
2 // Types to be used for a scalar ODE with state space R
3 using StateType = double ;
4 using RhsType = std : : f u n c t i o n <StateType ( StateType ) >;
5 // Logistic differential equation (11.1.2.2)
6 const double lambda = 5 0 0 . 0 ;
7 const RhsType f = [ lambda ] ( StateType y ) { r e t u r n lambda * y * y * ( 1 − y ) ; } ;
8 const StateType y0 = 0 . 0 1 ; // Initial value, will create a STIFF IVP
9 // State space R, simple modulus supplies norm
10 const auto normFunc = [ ] ( StateType x ) { r e t u r n f a b s ( x ) ; } ;
11
814
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1 − number o f s t e p s : 183
Statistics of the integrator run ✄ 2 − number o f r e j e c t e d s t e p s : 185
3 − function calls : 1302
2
ode45 for dty = 500.000000 y (1−y) 2
ode45 for d y = 500.000000 y (1−y)
1.4 t
y(t) 1.5 0.03
ode45
1.2
1
1 0.02
0.8
Stepsize
y(t)
y
0.6
0.5 0.01
0.4
0.2
0 0
0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 Fig.
1 435 t
Fig. 434 t
Stepsize control of Ode45 running amok!
The solution is virtually constant from t > 0.2 and, nevertheless, the integrator uses tiny timesteps
? until the end of the integration interval. Why this crazy behavior?
y
Contents
12.1 Model Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
12.2 Stiff Initial-Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
12.3 Implicit Runge-Kutta Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . 835
12.3.1 The Implicit Euler Method for Stiff IVPs . . . . . . . . . . . . . . . . . . . . . 835
12.3.2 Collocation Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . 836
12.3.3 General Implicit Runge-Kutta Single-Step Methods (RK-SSMs) . . . . . . . 840
12.3.4 Model Problem Analysis for Implicit Runge-Kutta Single-Step Methods (IRK-SSMs)842
12.4 Semi-Implicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 850
12.5 Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
Video tutorial for Section 12.1:Model Problem Analysis: (40 minutes) Download link,
tablet notes
Fortunately, full insight into the observations made in Ex. 12.0.0.1 can already be gleaned from a scalar
linear model problem that is extremely easy to analyze.
EXPERIMENT 12.1.0.1 (Adaptive explicit RK-SSM for scalar linear decay ODE) To rule out that what
we observed in Ex. 12.0.0.1 might have been a quirk of the IVP (12.0.0.2) we conduct the same investiga-
tions for the simple linear, scalar ( N = 1), autonomous IVP
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 815
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1 − number o f s t e p s : 33
We use the adaptive integrator of Ode45 (→
2 − number o f r e j e c t e d s t e p s : 32
§ 11.5.3.3) to solve (12.1.0.2) with the same param-
3 − function calls : 231
eters as in Code 11.5.3.4. ✄
ode45 for dty = -80.000000 y
1 ode45 for dty = -80.000000 y
y(t) 1 0.015
ode45
0.8
Stepsize
y(t)
0.4
y
0.2 0 0.005
-0.5 0
0 0.2 0.4 0.6 0.8 1
-0.2 Fig. 437
0 0.2 0.4 0.6 0.8 1 t
Fig. 436 t
Observation: Though y(t) ≈ 0 for t > 0.1, the integrator keeps on using “unreasonably small” timesteps
even then. y
In this section we will discover a simple explanation for the startling behavior of the adaptive timestepping
Ode45 in Ex. 12.0.0.1.
EXAMPLE 12.1.0.3 (Blow-up of explicit Euler method) The simplest explicit RK-SSM is the explicit
Euler method, see Section 11.2.1. We know that it should converge like O( h) for meshwidth h → 0. In
this example we will see that this may be true only for sufficiently small h, which may be extremely small.
ẏ = f (y) := λy , λ ≪ 0 , y (0) = 1 .
✦ We apply the explicit Euler method (11.2.1.5) with uniform timestep h = 1/M, M ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
Explicit Euler method for saalar model problem Explicit Euler, h=174.005981Explicit Euler, h=175.005493
20
10 3.5
λ = −10.000000
λ = −30.000000 3
λ = −60.000000
error at final time T=1 (Euclidean norm)
10 λ = −90.000000
10 2.5
O(h)
0
10 1.5
−10
10
y
0.5
−20
10 −0.5
−1
−30
10 −1.5
−2 exact solution
−40
explicit Euler
10
−3 −2 −1 0
10 10 10 10 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 438 timestep h Fig. 439 t
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 816
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✦ Now we look at an IVP for the logistic ODE, see Ex. 11.1.2.1:
✦ As before, we apply the explicit Euler method (11.2.1.5) with uniform timestep h = 1/M, M ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
140
10
λ = 10.000000 1.4
λ = 30.000000
10
120 λ = 60.000000
λ = 90.000000 1.2
100
10 1
error (Euclidean norm)
80 0.8
10
0.6
60
10
y
0.4
40
10
0.2
20
10 0
10
0 −0.2
exact solution
−0.4 explicit Euler
−20
10
−3 −2 −1 0
10 10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 440 timestep h Fig. 441 t
For large timesteps h we also observe oscillatory blow-up of the sequence (yk )k .
Deeper analysis:
For y ≈ 1: f (y) ≈ λ(1 − y) ➣ If y(t0 ) ≈ 1, then the solution of the IVP will behave like the solution
of ẏ = λ(1 − y), which is a linear ODE. Similary, z(t) := 1 − y(t) will behave like the solution of the
“decay equation” ż = −λz. Thus, around the stationary point y = 1 the explicit Euler method behaves
like it did for ẏ = λy in the vicinity of the stationary point y = 0; it grossly overshoots. y
§12.1.0.4 (Linear model problem analysis: explicit Euler method) The phenomenon observed in the
two previous examples is accessible to a remarkably simple rigorous analysis: Motivated by the consider-
ations in Ex. 12.1.0.3 we study the explicit Euler method (11.2.1.5) for the
Recall the recursion for the explicit Euler with uniform timestep h > 0 method for (12.1.0.5):
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 817
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Only if |λ| h < 2 we obtain a decaying solution by the explicit Euler method!
Could it be that the timestep control is desperately trying to enforce the qualitatively correct behavior of
the numerical solution in Ex. 12.1.0.3? Let us examine how the simple stepsize control of Code 11.5.2.6
fares for model problem (12.1.0.5):
EXPERIMENT 12.1.0.8 (Simple adaptive timestepping for fast decay) In this example we let a trans-
parent adaptive timestep struggle with “overshooting”:
✦ “Linear model problem IVP”: ẏ = λy, y(0) = 1, λ = −100
✦ Simple adaptive timestepping method as in Exp. 11.5.2.9, see Code 11.5.2.6. Timestep control
based on the pair of 1st-order explicit Euler method and 2nd-order explicit trapezoidal method.
Decay equation, rtol = 0.010000, atol = 0.000100, λ = 100.000000 x 10
−3 Decay equation, rtol = 0.010000, atol = 0.000100, λ = 100.000000
1 3
y(t) true error |y(t )−y |
k k
yk estimated error EST
0.9 k
rejection
2.5
0.8
0.7
2
0.6
error
y
0.5 1.5
0.4
1
0.3
0.2
0.5
0.1
0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 442 t Fig. 443 t
Observation: in fact, stepsize control enforces small timesteps even if y(t) ≈ 0 and persistently triggers
rejections of timesteps. This is necessary to prevent overshooting in the Euler method, which contributes
to the estimate of the one-step error.
We see the purpose of stepsize control thwarted, because after only a very short time the solution is
almost zero and then, in fact, large timesteps should be chosen. y
Are these observations a particular “flaw” of the explicit Euler method? Let us study the behavior of another
simple explicit Runge-Kutta method applied to the linear model problem.
EXAMPLE 12.1.0.9 (Explicit trapzoidal method for decay equation → [DR08, Ex. 11.29])
The explicit trapezoidal method is a 2-stage explicit Ruge-Kutta method, whose Butcher scheme is given
in Ex. 11.4.0.17 and which was derived in Ex. 11.4.0.6. We state its recursion for the ODE ẏ = f(t, y) in
terms of the first step y0 → y1 :
k1 = f(t0 , y0 ) , k2 = f(t0 + h, y0 + hk1 ) , y1 = y0 + 2h (k1 + k2 ) . (11.4.0.8)
Apply it to the model problem (12.1.0.5), that is, the scalar autonomous ODE with right hand side function
f(y) = f (y) = λy, λ < 0:
k1 = λy0 , k2 = λ(y0 + hk1 ) ⇒ y1 = (1 + λh + 12 (λh)2 ) y0 . (12.1.0.10)
| {z }
=:S(hλ)
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 818
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The sequence of approximations generated by the explicit trapezoidal rule can be expressed in
closed form as
yk = S( hλ)k y0 , k = 0, . . . , N . (12.1.0.11)
2 z 7→ 1 − z + 21 z2
|S(hλ)| < 1 ⇔ − 2 < hλ < 0 .
1.5
Qualitatively correct decay behavior of (yk )k only un-
S(z)
h ≤ |2/λ| . (12.1.0.12)
0.5
§12.1.0.13 (Model problem analysis for general explicit Runge-Kutta single step methods) We
generalize the approach taken in Ex. 12.1.0.9 and apply an explicit s-stage Runge-Kutta method (→
A c
Def. 11.4.0.11) encoded by the Butcher scheme , A ∈ R s,s strictly lower-triangular, to the au-
bT
tonomous scalar linear ODE (12.1.0.5) (ẏ = λy). We write down the equations for the increments and y1
from Def. 11.4.0.11 for f (y) := λy and then convert the resulting system of equations into matrix form:
i −1
k i = λ(y0 + h ∑ aij k j ) ,
j =1 I − zA 0 k 1
⇒ = y0 , (12.1.0.14)
s −zb⊤ 1 y1 1
y 1 = y 0 + h ∑ bi k i
i =1
where k ∈ R s =ˆ denotes the vector [k1 , . . . , k s ]⊤ /λ of increments, and z := λh. Next we apply block
Gaussian elimination (→ Rem. 2.3.1.11) to solve for y1 and obtain
and note that A is a strictly lower triangular matrix, which means that det(I − zA) = 1. Thus we have
proved the following theorem.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 819
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The discrete evolution Ψλh of an explicit s-stage Runge-Kutta single step method (→ Def. 11.4.0.11)
c A
with Butcher scheme (see (11.4.0.13)) for the ODE ẏ = λy amounts to a multiplication
bT
with the number
EXAMPLE 12.1.0.19 (Stability functions of explicit Runge-Kutta single step methods) From
Thm. 12.1.0.17 and their Butcher schemes we can instantly compute the stability functions of explicit
RK-SSM. We do this for a few methods whose Butcher schemes were listed in Ex. 11.4.0.17
0 0 S(z) = 1 + z .
• Explicit Euler method (11.2.1.5): ➣
1
0 0 0
• Expl. trapezoidal method (11.4.0.8): 1 1 0 ➣ S(z) = 1 + z + 21 z2 .
1 1
2 2
0 0 0 0 0
1 1
2 2 0 0 0
1 2 1 3 1 4
• Classical RK4 method: 1 1
2 0 2 0 0 ➣ S(z) = 1 + z + 2 z + 6 z + 24 z .
1 0 0 1 0
1 2 2 1
6 6 6 6
These examples confirm an immediate consequence of the determinant formula for the stability function
S ( z ).
For a consistent (→ Def. 11.3.1.12) s-stage explicit Runge-Kutta single step method according to
Def. 11.4.0.11 the stability function S defined by (12.1.0.56) is a non-constant polynomial of degree
≤ s, that is, S ∈ Ps .
Remark 12.1.0.21 (Stability function and exponential function) Let us compare the two evolution op-
erators:
• Φ=
ˆ evolution operator (→ Def. 11.1.4.3) for ẏ = λy,
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 820
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
• Ψ=
ˆ discrete evolution operator (→ § 11.3.1.1) for an s-stage Runge-Kutta single step method.
Φh y = eλh y ←→ Ψh y = S(λh)y .
Let S denote the stability function of an s-stage explicit Runge-Kutta single step method of order
q ∈ N. Then
This means that the lowest q + 1 coefficients of S(z) must be equal to the first coefficients of the expo-
nential series:
q
1
S(z) = ∑ j! z j + zq+1 p(z) with some p ∈ P s − q −1 .
j =0
In order to match the first q terms of the exponential series, we need at least S ∈ Pq , which entails a
minimum of q stages.
§12.1.0.26 (Stability induced timestep constraint) In § 12.1.0.13 we established that for the sequence
(yk )∞
k=0 produced by an explicit Runge-Kutta single step method applied to the linear scalar model ODE
ẏ = λy, λ ∈ R, with uniform timestep h > 0 holds
(yk )∞
k=0 non-increasing ⇔ |S(λh)| ≤ 1 ,
∞ (12.1.0.27)
(yk )k=0 exponentially increasing ⇔ |S(λh)| > 1 .
So, for any λ 6= 0 there will be a threshold hmax > 0 so that |yk | → ∞ as | h| > hmax .
Reversing the argument we arrive at a timestep constraint, as already observed for the explicit Euler
methods in § 12.1.0.4.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 821
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Only if one ensures that |λh| is sufficiently small, one can avoid exponentially increasing approxi-
mations yk (qualitatively wrong for λ < 0) when applying an explicit RK-SSM to the model problem
(12.1.0.5) with uniform timestep h > 0,
For λ ≪ 0 this stability induced timestep constraint may force h to be much smaller than required by
demands on accuracy : in this case timestepping becomes inefficient. y
Remark 12.1.0.29 (Stepsize control detects instability) Ex. 12.0.0.1, Exp. 12.1.0.8 send the message
that local-in-time stepsize control as discussed in Section 11.5 selects timesteps that avoid blow-up, with
a hefty price tag however in terms of computational cost and poor accuracy. y
Objection: simple linear scalar IVP (12.1.0.5) may be an oddity rather than a model problem: the weakness
of explicit Runge-Kutta methods discussed above may be just a peculiar response to an unusual situation.
Let us extend our investigations to systems of linear ODEs of dimension N > 1 of the state space.
§12.1.0.30 (Systems of linear ordinary differential equations, § 11.1.1.8 revisited) A generic linear
ordinary differential equation with constant coefficients on the state space R N has the form
As explained in [NS02, Sect. 8.1], (12.1.0.31) can be solved by diagonalization: If we can find a regular
matrix V ∈ C N,N such that
λ1 0
.. N,N
MV = VD with diagonal matrix D = . ∈C , (12.1.0.32)
0 λN
The columns of V are a basis of eigenvectors of M, the λ j ∈ C, j = 1, . . . , N are the associated eigen-
values of M, see Def. 9.1.0.1.
The idea behind diagonalization is the transformation of (12.1.0.31) into N decoupled scalar linear
ODEs:
ż1 = λ1 z1
z ( t ) : = V −1 y ( t ) ..
ẏ = My −−−−−−−−→ ż = Dz ↔ . , since M = VDV−1 .
ż N = λ N z N
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 822
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Fig. 445
We integrate IVPs for this ODE by means of the adaptive integrator Ode45 from § 11.5.3.3.
RCL−circuit: R=100.000000, L=1.000000, C=0.000001
0.01
u(t)
0.008
v(t)/100 R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V sin(t),
0.006
u(0) = v(0) = 0 (“switch on”)
0.004 Ode45 statistics:
0.002 17897 successful steps
u(t),v(t)
0
1090 failed attempts
−0.002
113923 function evaluations
−0.004
Maybe the time-dependent right hand side due to the time-harmonic excitation severly affects ode45? Let
us try a constant exciting voltage:
x 10
−3 RCL−circuit: R=100.000000, L=1.000000, C=0.000001
2
u(t)
v(t)/100
0
R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V,
−2
u(0) = v(0) = 0 (“switch on”)
−4
Ode45 statistics:
u(t),v(t)
We make the same observation as in Ex. 12.0.0.1, Exp. 12.1.0.8: the local-in-time stepsize control of
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 823
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
ode45 (→ Section 11.5) enforces extremely small timesteps though the solution almost constant except
at t = 0.
To understand the structure of the solutions for this transient circuit example, let us apply the diagonaliza-
tion technique from § 12.1.0.30 to the linear ODE
0 1
ẏ = y , y (0) = y0 ∈ R 2 . (12.1.0.36)
− β −α
| {z }
=:M
We can obtain the general solution of ẏ = My, M ∈ R2,2 , by diagonalization of M (if possible):
λ1
MV = M(v1 , v2 ) = (v1 , v2 ) . (12.1.0.37)
λ2
where v1 , v2 ∈ R2 \ {0} are the the eigenvectors of M, λ1 , λ2 are the eigenvalues of M, see Def. 9.1.0.1.
The latter are the roots of the characteristic polynomial t 7→ χ(t) := t2 + αt + β in C, and we find
(p
α2 − 4β , if α2 ≥ 4β ,
λ1/2 = 21 (−α ± D ) , D := p
ı 4β − α2 , if α2 < 4β .
Note that the eigenvalues have a large (in modulus) negative real part and a non-vanishing imaginary part
in the setting of the experiment.
§12.1.0.40 (“Diagonalization” of explicit Euler method) Recall the discrete evolution of the explicit
Euler method (11.2.1.5) for the linear ODE ẏ = My, M ∈ R N,N :
As in § 12.1.0.30 we assume that M can be diagonalized, that is (12.1.0.32) holds: V−1 MV = D with a
diagonal matrix D ∈ C N,N containing the eigenvalues of M on its diagonal. Next, apply the decoupling
by diagonalization idea to the recursion of the explicit Euler method.
z k : = V −1 y k
V−1 yk+1 = V−1 yk + hV−1 MV(V−1 yk ) ⇔ (zk+1 )i = (zk )i + hλi (zk )i , (12.1.0.41)
| {z }
ˆ explicit Euler step for żi = λi zi
=
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 824
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
So far we conducted the model problem analysis under the premises λ < 0.
p
However, in Ex. 12.1.0.35 we face λ1/2 = − 12 α ± i 4β − α2 (complex eigenvalues!). Let us now
examine how the explicit Euler method and even general explicit RK-methods respond to them.
Remark 12.1.0.42 (Explicit Euler method for damped oscillations) Consider linear model IVP
(12.1.0.5) for λ ∈ C:
The model problem analysis from Ex. 12.1.0.3, Ex. 12.1.0.9 can be extended verbatim to the case of
λ ∈ C. It yields the following insight for the for the explicit Euler method and λ ∈ C:
The sequence generated by the explicit Euler method (11.2.1.5) for the model problem (12.1.0.5) satisfies
0.5
✁ { z ∈ C : |1 + z | < 1}
Im z
0
The green region of the complex plane marks values
for λh, for which the explicit Euler method will pro-
−0.5
duce exponentially decaying solutions.
−1
−1.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1
Fig. 448 Re z
q
Now we can conjecture what happens in Ex. 12.1.0.35: the eigenvalues λ1/2 = ± i β − 41 α2 of − 21 α
M have a very large (in modulus) negative real part. Since the integrator of Ode45 can be expected to
behave as if it integrates ż = λ2 z, it faces a severe timestep constraint, if exponential blow-up is to be
avoided, see Ex. 12.1.0.3. Thus stepsize control must resort to tiny timesteps. y
§12.1.0.43 (Extended model problem analysis for explicit Runge-Kutta single step methods) Recall
the definition of a generic explicit RK-SSM for the ODE ẏ = f(t, y):
i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1
The vectors ki ∈ R N , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 825
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A c
We apply such an explicit s-stage RK-SSM described by the Butcher scheme to the autonomous
bT
linear ODE ẏ = My, M ∈ C N,N , and obtain (for the first step with timestep size h > 0)
ℓ−1 s
k ℓ = M ( y0 + h ∑ aℓ j k j ) , ℓ = 1, . . . , s , y1 = y0 + h ∑ bi k ℓ . (12.1.0.44)
j =1 ℓ=1
Now assume that M can be diagonalized, that is (12.1.0.32) holds: V−1 MV = D with a diagonal
matrix D ∈ C N,N containing the eigenvalues λ1 , . . . , λ N ∈ C of M on its diagonal. Then apply the
substitutions
b ℓ := V−1 kℓ , ℓ = 1, . . . , s ,
k bk := V−1 yk , k = 0, 1 ,
y
We infer that, if (yk )k is the sequence produced by an explicit RK-SSM applied to ẏ = My, then
[1]
yk 0
..
−1
yk = V
. V ,
[d]
0 yk
[i ]
where yk is the sequence generated by the same RK-SSM with the same sequence of timesteps for
k
the IVP ẏ = λi y, y(0) = V−1 y0 i .
✗ ✔
The RK-SSM generates uniformly bounded solution sequences (yk )∞
for the ODE ẏ = My with k =0
diagonalizable matrix M ∈ R N,N with eigenvalues λ1 , . . . , λ N , if and only if it generates uniformly
✖ ✕
bounded sequences for all the scalar ODEs ż = λi ż, i = 1, . . . , N .
Understanding the behavior of RK-SSM for autonomous scalar linear ODEs ẏ = λy with λ ∈ C is
enough to predict their behavior for general autonomous linear systems of ODEs.
Theorem 12.1.0.48. (Absolute) stability of explicit RK-SSM for linear systems of ODEs
The sequence (yk )k of approximations generated by an explicit RK-SSM (→ Def. 11.4.0.11) with
stability function S (defined in (12.1.0.56)) applied to the linear autonomous ODE ẏ = My, M ∈
C N,N , with uniform timestep h > 0 decays exponentially for every initial state y0 ∈ C N , if and only
if |S(λi h)| < 1 for all eigenvalues λi of M.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 826
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
for any solution of ẏ = My. This is obvious from the representation formula (12.1.0.33). y
Hence, the modulus |S(λh)| tells us for which combinations of λ and stepsize h we achieve exponential
decay yk → ∞ for k → ∞, which is the desirable behavior of the approximations for Re λ < 0.
Let the discrete evolution Ψ for a single step method applied to the scalar linear ODE ẏ = λy,
λ ∈ C, be of the form
and a function S : C → C. Then the region of (absolute) stability of the single step method is
given by
SΨ := {z ∈ C: |S(z)| < 1} ⊂ C .
Of course, by Thm. 12.1.0.17, in the case of explicit RK-SSM the function S will coincide with their
stability function from (12.1.0.56).
We can easily combine the statement of Thm. 12.1.0.48 with the concept of a region of stability and
conclude that an explicit RK-SSM will generate expoentially decaying solutions for the linear ODE ẏ =
My, M ∈ C N,N , for every initial state y0 ∈ C N , if and only if λi h ∈ SΨ for all eigenvalues λi of M.
EXAMPLE 12.1.0.53 (Regions of stability of some explicit RK-SSM) The green domains ⊂ C depict
the bounded regions of stability for some RK-SSM from Ex. 11.4.0.17.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 827
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
3 3
2.5
2
2 2
1.5
1 1 1
0.5
Im
Im
0 0
Im
−0.5
−1 −1
−1
−1.5
−2 −2
−2
−2.5 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Re Re Re
In general we have for a consistent RK-SSM (→ Def. 11.3.1.12) that their stability functions staidfy S(z) =
1 + z + O(z2 ) for z → 0. Therefore, SΨ 6= ∅ and the imaginary axis will be tangent to SΨ in z = 0. y
The discrete evolution Ψλh of an explicit s-stage Runge-Kutta single step method (→
c A
Def. 11.4.0.11) with Butcher scheme (see (11.4.0.13)) for the ODE ẏ = λy amounts
bT
to a multiplication with the number
12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 828
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q12.1.0.54.E) Compute the stability function for the three-stage Runge-Kutta single step method, defined
through the Butcher scheme
0 0 0 0
1/3 1/3 0 0
.
2/3 0 2/3 0
1/4 0 3/4
(Q12.1.0.54.F) What is the stability-induced timestep constraint for the classical 4-stage explicit Runge-
Kutta single step method of order 4, when applied to the ODE
0 −1
ẏ(t) = My(t) with M := .
1 0
−2
−3
−3 −2 −1 0 1 2 3
Re
Supplementary literature. Related to this section are [Han02, Ch. 77] and [QSS00,
Sect. 11.3.3].
This section will reveal that the behavior observed in Ex. 12.0.0.1 and Ex. 12.1.0.3 is typical for a large
class of problems and that the model problem (12.1.0.5) really represents a “generic case”. This justifies
the attention paid to linear model problem analysis in Section 12.1.
EXAMPLE 12.2.0.1 (Kinetics of chemical reactions → [Han02, Ch. 62]) In Ex. 11.5.1.1 we already
saw an ODE model for the dynamics of a chemical reaction. Now we study an abstract reaction.
2k 4 k
reaction: A + B ←−
−→ C , A + C ←−
−→ D ,
k1 k3 (12.2.0.2)
| {z } | {z }
fast reaction slow reaction
12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 829
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
If c A (0) > c B (0) ➢ 2nd reaction determines overall long-term reaction dynamics
⊤
Mathematical model: non-linear ODE involving concentrations y(t) = [c A (t), c B (t), cC (t), c D (t)]
cA − k 1 c A c B + k 2 cC − k 3 c A cC + k 4 c D
d cB − k 1 c A c B + k 2 cC
ẏ := = f(y) :=
k 1 c A c B − k 2 cC − k 3 c A cC + k 4 c D . (12.2.0.3)
dt cC
cD k 3 c A cC − k 4 c D
c (t) 10 7
A
cC(t)
10
cA,k, ode45 9 6
cC,k, ode45
8 8 5
concentrations
6 7 4
timestep
c (t)
C
4 6 3
2 5 2
4 1
0
3 0
−2 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 450
Fig. t
Fig. 449 t
Observations: After a fast initial transient phase, the solution shows only slow dynamics. Nevertheless,
the explicit adaptive integrator used for this simulation insists on using a tiny timestep. It behaves very
much like Ode45 in Ex. 12.0.0.1. y
EXAMPLE 12.2.0.4 (Strongly attractive limit cycle) We consider the non-linear Autonomous ODE
ẏ = f(y) with
0 −1
f(y) := y + λ (1 − k y k2 ) y , (12.2.0.5)
1 0
(12.2.0.6) provides a solution even for λ 6= 0, if ky(0)k2 = 1, because in this case the term
λ(1 − kyk2 ) y will never become non-zero on the solution trajectory.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 830
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
1.5 2
1.5
1
0.5
0.5
2
0
y
−0.5 −0.5
−1
−1
−1.5
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
Fig. 451 y Fig. 452 −2
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1 7
0.5 6
timestep
timestep
y (t)
y (t)
0 5 0 0.1
i
−0.5 4
−1 3
y y
1,k 1,k
y2,k y2,k
−1.5 2 −1 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Fig. 453 t Fig. 454 t
We want to find criteria that allow to predict the massive problems haunting explicit single step methods in
the case of the non-linear IVP of Ex. 12.0.0.1, Ex. 12.2.0.1, and Ex. 12.2.0.4. Recall that for linear IVPs of
the form ẏ = My, y(0) = y0 , the model problem analysis of Section 12.1 tells us that, given knowledge
of the region of stability of the timestepping scheme, the eigenvalues of the matrix M ∈ C N,N provide full
information about timestep constraint we are going to face. Refer to Thm. 12.1.0.48 and § 12.1.0.49.
The ODEs we saw in Ex. 12.2.0.1 and Ex. 12.2.0.4 are non-linear . Yet, the entire stability analysis of
Section 12.1 was based on linear ODEs. Thus, we need to extend the stability analysis to non-linear
ODEs.
We start with a “phenomenological notion”, just a keyword to refer to the kind of difficulties presented by
the IVPs of Ex. 12.0.0.1, Ex. 12.2.0.1, Exp. 12.1.0.8, and Ex. 12.2.0.4.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 831
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
§12.2.0.8 (Linearization of ODEs) Linear ODEs, though very special, are highly relevant as “local model”
for general ODEs: We consider a general autonomous ODE
ẏ = f(y) , f : D ⊂ R N → R N .
As usual, we assume f to be C2 -smooth and that it enjoys local Lipschitz continuity (→ Def. 11.1.3.13) on
D so that unique solvability of IVPs is guaranteed by Thm. 11.1.3.17.
We fix a state y∗ ∈ D, D the state space, write t 7→ y(t) for the solution with y(0) = y∗ . We set
z(t) = y(t) − y∗ , which satisfies
This is obtained by Taylor expansion of f at y∗ , see [Str09, Satz 7.5.2]. Hence, in a neighborhood of a
state y∗ on a solution trajectory t 7→ y(t), the deviation z(t) = y(t) − y∗ satisfies
The short-time evolution of y with y(0) = y∗ is approximately governed by the affine-linear ODE
§12.2.0.11 (Linearization of explicit Runge-Kutta single step methods) We consider one step of
a general s-stage RK-SSM according to Def. 11.4.0.11 for the autonomous ODE ẏ = f(y), with smooth
right hand side function f : D ⊂ R N → R N :
i −1 s
ki = f(y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
We perform linearization at y∗ := y0 and ignore all terms at least quadratic in the timestep size h (this
is indicated by the ≈ symbol):
i −1 s
∗ ∗
ki ≈ f(y ) + D f(y ) h ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
i −1 s
ki ≈ b + Mh ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1
the discrete evolution of the RK-SSM for ẏ = f(y) in the state y∗ is close to the discrete
evolution of the same RK-SSM applied to the linearization (12.2.0.10) of the ODE in y∗ .
12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 832
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
w k = y k − y 0 + M −1 b .
➣ The analysis of the behavior of an RK-SSM for an autonomous affine-linear ODE can be reduced to
understanding its behavior for an autonomous linear ODE with the same matrix.
the behavior of an explicit Runge-Kutta single-step method applied to ẏ = f(y) close to the
state y∗ is determined by the eigenvalues of the Jacobian D f(y∗ ).
In particular, if D f(y∗ ) has at least one eigenvalue whose modulus is large, then an exponential drift-off
of the approximate states yk away from y∗ can only be avoided for sufficiently small timestep, again a
timestep constraint.
An initial value problem for an autonomous ODE ẏ = f(y) will probably be stiff, if, for substantial
periods of time,
where t 7→ y(t) is the solution trajectory and σ (M) is the spectrum of the matrix M, see
Def. 9.1.0.1.
The condition (12.2.0.14) has to be read as “the real parts of all eigenvalues are below a bound with small
modulus”. If this is not the case, then the exact solution will experience blow-up. It will change drastically
over very short periods of time and small timesteps will be required anyway in order to resolve this. y
We find
Hence, in case λ ≫ 1 as in Fig. 435, we face a stiff problem close to the stationary state y = 1.
The observations made in Fig. 435 exactly match this prediction.
➋ The solution of the IVP from Ex. 12.2.0.4
0 −1
ẏ = f(y) := y + λ (1 − k y k2 ) y , k y0 k2 = 1 . (12.2.0.5)
1 0
12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 833
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
satisfies ky(t)k2 = 1 for all times. Using the product rule (8.5.1.17) of multi-dimensional differential
calculus, we find
0 −1
D f(y) = + λ −2yy⊤ + (1 − kyk22 I) .
1 0
n p p o
σ (D f(y)) = −λ − λ2 − 1, −λ + λ2 − 1 , if kyk2 = 1 .
Thus, for λ ≫ 1, D f(y(t)) will always have an eigenvalue with large negative real part, whereas
the other eigenvalue is close to zero: the IVP is stiff.
y
Remark 12.2.0.16 (Characteristics of stiff IVPs) Often one can already tell from the expected behavior
of the solution of an IVP, which is often clear from the modeling context, that one has to brace for stiffness.
An initial value problem for an autonomous ODE ẏ = f(y) will probably be stiff, if, for substantial
periods of time,
where t 7→ y(t) is the solution trajectory and σ (M) is the spectrum of the matrix M, see
Def. 9.1.0.1.
ẅ = − sin w − λẇ .
where t 7→ g(t) is a given smooth excitation. Find out, whether initial-value problems for (12.2.0.19)
whose solution satisfies
− β −1 g ( t )
y(t) ≈
0
12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 834
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
are stiff.
h i p
0 1
Hint. The eigenvalues of the matrix − β −α are λ1 = 12 (−α + ı 4β − α2 ),
p
λ1 = 21 (−α − ı 4β − α2 ).
△
Video tutorial for Section 12.3: Implicit Runge-Kutta Single-Step Methods: (50 minutes)
Download link, tablet notes
Explicit Runge-Kutta single step method cannot escape tight timestep constraints for stiff IVPs that may
render them inefficient, see § 12.1.0.49. In this section we are going to augment the class of Runge-Kutta
methods by timestepping schemes that can cope well with stiff IVPs.
ẏ = λy , y(0) = 1 , λ < 0 .
We apply both the explicit Euler method (11.2.1.5) and the implicit Euler method (11.2.2.2) with uniform
timesteps h = 1/N , N ∈ {5, 10, 20, 40, 80, 160, 320, 640} and monitor the error at final time T = 1 for
different values of λ.
Explicit Euler method (11.2.1.5) Implicit Euler method (11.2.2.2)
Explicit Euler method for saalar model problem Implicit Euler method for saalar model problem
20 0
10 10
λ = −10.000000
λ = −30.000000
λ = −60.000000 −5
10
error at final time T=1 (Euclidean norm)
10 λ = −90.000000
10
O(h)
−10
10
0
10
−15
10
−10 −20
10 10
−25
10
−20
10
−30
10
−30 λ = −10.000000
10
−35
λ = −30.000000
10 λ = −60.000000
λ = −90.000000
O(h)
−40 −40
10 10
−3 −2 −1 0 −3 −2 −1 0
10 10 10 10 10 10 10 10
Fig. 455 timestep h Fig. 456 timestep h
λ large: blow-up of yk for large timestep h λ large: stable for all timesteps h > 0 !
We observe onset of convergence of the implicit Euler method already for large timesteps h. y
§12.3.1.2 (Linear model problem analysis: implicit Euler method) We follow the considerations of
§ 12.1.0.4 and consider the implicit Euler method (11.2.2.2) for the
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 835
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Without any timestep constraint we obtain the qualitatively correct behavior of (yk )k for Re λ < 0 and any
h > 0!
As in § 12.1.0.40 this analysis can be extended to linear systems of ODEs ẏ = My, M ∈ C N,N , by
means of diagonalization.
As in § 12.1.0.30 and § 12.1.0.40 we assume that M can be diagonalized, that is (12.1.0.32) holds:
V−1 MV = D with a regular matrix V ∈ C N,N and a diagonal matrix D ∈ C N,N containing the eigenval-
ues λ1 , . . . , λ N of M on its diagonal. Next, apply the decoupling by diagonalization idea to the recursion
of the implicit Euler method.
z k : = V −1 y k 1
V−1 yk+1 = V−1 yk + h |V−{z
1
MV}(V−1 yk+1 ) ⇔ ( z k +1 ) i = (z ) . (12.3.1.6)
1 − λi h k i
=D | {z }
ˆ implicit Euler step for żi = λi zi
=
Crucial insight:
For any timestep, the implicit Euler method generates exponentially decaying solution sequences
(yk )∞
k=0 for ẏ = My with diagonalizable matrix M ∈ R
N,N with eigenvalues λ , . . . , λ , if Re λ < 0
1 N i
for all i = 1, . . . , N .
Thus we expect that the implicit Euler method will not face stability induced timestep constraints for stiff
problems (→ Notion 12.2.0.7). y
Setting: We consider the general ordinary differential equation ẏ = f(t, y), f : I × D → R N locally
Lipschitz continuous, which guarantees the local existence of unique solutions of initial value problems,
see Thm. 11.1.3.17.
We define the single step method through specifying the first step y0 = y(t0 ) → y1 ≈ y(t1 ), where
y0 ∈ D is the initial step at initial time t0 ∈ I , cf. Rem. 11.3.1.15. We assume that the exact solution
trajectory t 7→ y(t) exists on [t0 , t1 ]. Use as a timestepping scheme on a temporal mesh (→ § 11.2.0.2)
in the sense of Def. 11.3.1.5 is straightforward.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 836
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
y h ( t0 ) = y0 ,
(12.3.2.3)
ẏh (τj ) = f(τj , yh (τj )) , j = 1, . . . , s ,
➌ Choose y1 := yh (t1 ).
y
§12.3.2.4 (Polynomial collocation) Existence of the function yh : [t0 , t1 ] → R N satisfying (12.3.2.3) and
the possibility to compute it efficiently will crucially depend on the choice of the trial space V .
☛ ✟
✡ ✠
N
Our choice (the “standard option”): (Componentwise) polynomial trial space V = (Ps )
Recalling dim Ps = s + 1 from Thm. 5.2.1.2, we see that our choice makes the number = N (s + 1) of
collocation conditions matches the dimension of the trial space V .
Now we want to derive a concrete representation for the polynomial yh . We draw on concepts introduced
in Section 5.2.2. We define the collocation points as
In each of its N components, the derivative ẏh is a polynomial of degree s − 1: ẏ ∈ (Ps−1 ) N . Hence, it
has the following representation, compare (5.2.2.6).
s
ẏh (t0 + ξh) = ∑ ẏh (t0 + c j h) L j (ξ ) , 0 ≤ ξ ≤ 1. (12.3.2.5)
j =1
As τj = t0 + c j h, the collocation conditions (12.3.2.3) make it possible to replace ẏh (c j h) with an expres-
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 837
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(12.3.2.3) s
ẏh (t0 + ξh) = ∑ k j L j (ξ ) with “coefficients” k j := f (t0 + c j h, yh (t0 + c j h)) .
j =1
This yields the following formulas for the computation of y1 , which characterize the s-stage collocation
single step method induced by the (normalized) collocation points c j ∈ [0, 1], j = 1, . . . , s.
s Z ci
ki = f (t0 + ci h, y0 + h ∑ aij k j ) , aij := L j (τ ) dτ ,
j =1 0
where Z 1 (12.3.2.6)
s
y 1 : = y h ( t 1 ) = y 0 + h ∑ bi k i . bi : = Li (τ ) dτ .
0
i =1
Note that, since arbitrary y0 ∈ D, t0 , t1 ∈ I were admitted, this defines a discrete evolution Ψ : I × I ×
D → R N by Ψt0 ,t1 y0 := yh (t1 ). y
Remark 12.3.2.7 (Implicit nature of collocation single step methods) Note that (12.3.2.6) represents
a generically non-linear system of s · N equations for the s · N components of the vectors ki , i = 1, . . . , s.
Usually, it will not be possible to obtain the increments ki ∈ R N by a fixed number of evaluations of f. For
this reason the single step methods defined by (12.3.2.6) are called implicit.
With similar arguments as in Rem. 11.2.2.3 one can prove that for sufficiently small |t1 − t0 | a unique set
of solution vectors k1 , . . . , ks can be found. y
§12.3.2.8 (Collocation single step methods and quadrature) Clearly, in the case N = 1, f (t, y) =
f (t), y0 = 0 the computation of y1 boils down to the evaluation of a quadrature formula on [t0 , t1 ], because
from (12.3.2.6) we get
s Z 1
y 1 = h ∑ bi f ( t 0 + c i h ) , bi : = Li (τ ) dτ , (12.3.2.9)
i =1 0
which is a polynomial quadrature formula (7.3.0.2) on [0, 1] with nodes c j transformed to [t0 , t1 ] according
to (7.2.0.5). y
EXPERIMENT 12.3.2.10 (Empiric Convergence of collocation single step methods) We consider the
initial value problem for the scalar logistic ODE
We perform numerical integration by timestepping with uniform timestep h based on a collocation single
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 838
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
0
10
j
➊ Equidistant collocation points, c j = s +1 , 10
−2
j = 1, · · · , s.
max |y (t )−y(t) )|
k
−4
We observe algebraic convergence with the empiric 10
h k
rates
−6
k
10
s =1 : p = 1.96
s =2 : p = 2.03
:
−8
s =3 p = 4.00 10 s=1
s=2
s =4 : p = 4.04 s=3
−10
s=4
10 −2 −1 0
10 10 10
Fig. 457 h
In this case we conclude the following (empiric) order (→ Def. 11.3.2.8) of the collocation single step
method:
(
s for even s ,
(empiric) order =
s + 1 for odd s .
Next, we recall from § 7.4.2.15 an exceptional set of quadrature points, the Gauss points, provided by the
zeros of the L2 ([−1, 1])-orthogonal Legendre polynomials, see Fig. 269.
0
10
s =1 : p = 1.96 −10
10
s =2 : p = 4.01
s =3 : p = 6.00 s=1
s=2
s =4 : p = 8.02 s=3
−15
s=4
10 −2 −1 0
10 10 10
Fig. 458 h
Obviously, for the (empiric) order (→ Def. 11.3.2.8) of the Gauss collocation single step method holds
(empiric) order = 2s .
Note that the 1-stage Gauss collocation single step method is the implicit midpoint method from Sec-
tion 11.2.3. y
§12.3.2.11 (Order of collocation single step method) What we have observed in Exp. 12.3.2.10 reflects
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 839
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Theorem 12.3.2.12. Order of collocation single step method [DB02, Satz .6.40]
Provided that f ∈ C p ( I × D ), the order (→ Def. 11.3.2.8) of an s-stage collocation single step
method according to (12.3.2.6) agrees with the order (→ Def. 7.4.1.1) of the quadrature formula on
[0, 1] with nodes c j and weights b j , j = 1, . . . , s.
This also explains the surprisingly high order of the Gauss collocation single-step method, because for
s-point Gauss-Legendre numerical quadrature, the family of quadrature rules based on Gauss points as
nodes, Section 7.4.2 derived the order 2s.
➣ By Thm. 7.4.2.11 the s-stage Gauss collocation single step method whose nodes c j are chosen as the
s Gauss points on [0, 1] is of order 2s.
y
Definition 12.3.3.1. General Runge-Kutta single step method (cf. Def. 11.4.0.11)
Note that the computation of the increments ki may now require the solution of (non-linear) systems of
equations of size s · N . In this case we speak about an “implicit” method, cf. Rem. 12.3.2.7.
The Butcher schenme notation introduced in (11.4.0.13) can easily be adapted to the the case of general
RK-SSMs by dropping the requirement that the Butcher matrix be strictly lower triangular.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 840
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Many of the techniques and much of the theory discussed for explicit RK-SSMs carry over to general
(implicit) Runge-Kutta single step methods:
• Sufficient condition for consistence from Cor. 11.4.0.15
• Algebraic convergence for meshwidth h → 0 and the related concept of order (→ Def. 11.3.2.8)
• Embedded methods and algorithms for adaptive stepsize control from Section 11.5
§12.3.3.4 (Butcher schemes for Gauss collocation RK-SSMs) As in (11.4.0.13) we can arrange the
coefficients of Gauss collocation single-step methods in the form of a Butcher scheme and get
1 1
for s = 1: 2 2 , (12.3.3.5a)
1
1
√ √
2 − 61 √3 1
4 √
1 1
4 − 6 3
1
for s = 2: 2 + 61 3 1 1
4 + 6 3
1
4
, (12.3.3.5b)
1 1
2 2
1 1
√ 5 2 1
√ 5 1
√
2 − 10 15 36 √ 9 − 15 15 36 − 30 √15
1 5 1 2 5 1
2√ 36 + 24 √15 9√ − 24 15
for s = 3: 1 1 5 1 2 1
36
5 . (12.3.3.5c)
2 + 10 15 36 + 30 15 9 + 15 15 36
5 4 5
18 9 18
y
Remark 12.3.3.6 (Stage form equations for increments) In Def. 12.3.3.1 instead of the increments we
can consider as unknowns the so-called stages
s
gi := h ∑ aij k j ∈ R N , i = 1, . . . , s , ⇔ ki = f(t0 + ci h, y0 + gi ) . (12.3.3.7)
j =1
This leads to the equivalent defining equations in “stage form” for an implicit RK-SSM
s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s ,
j =1
⇓
s s
gi = h ∑ aij f(t0 + ci h, y0 + g j ) , y1 = y0 + h ∑ bi f(t0 + c j h, y0 + gi ) . (12.3.3.8)
j =1 i =1
In terms of implementation there is no difference: Also the stage equations (12.3.3.8) are usually solved
by means of Newton’s method, see next remark. y
Remark 12.3.3.9 (Solving the stage equations for implicit RK-SSMs) We reformulate the increment
equations in stage form (12.3.3.8) as a non-linear system of equations in standard form F (x) = 0.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 841
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
where I N is the N × N identity matrix and ⊗ designates the Kronecker product introduced in Def. 1.4.3.7.
We compute an approximate solution of F (g) = 0 iteratively by means of the simplified Newton method
presented in Rem. 8.5.1.43. This is a Newton method with “frozen Jacobian”. As g → 0 for h → 0, we
choose zero as initial guess:
Obviously, D F (0) → I for h → 0. Thus, D F (0) will be regular for sufficiently small h.
In each step of the simplified Newton method we have to solve a linear system of equations with coefficient
matrix D F (0). If s · N is large, an efficient implementation has to reuse the LU-decomposition of D F (0),
see Code 8.5.1.44 and Rem. 2.5.0.10. y
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 842
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The discrete evolution Ψλh of an s-stage Runge-Kutta single step method (→ Def. 12.3.3.1) with
c A
Butcher scheme (see (12.3.3.3)) for the ODE ẏ = λy is given by a multiplication with
bT
det(I − zA + z1b T )
S(z) := 1 + zb T (I − zA)−1 1 = , z := λh , 1 = [1, . . . , 1] T ∈ R s .
| {z } det(I − zA)
stability function
EXAMPLE 12.3.4.5 (Regions of stability for simple implicit RK-SSM) We determine the Butcher
schemes (12.3.3.3) for simple implicit RK-SSM and apply the formula from Thm. 12.3.4.4 to compute
their stability functions.
1 1 1
• Implicit Euler method: ➣ S(z) = .
1 1−z
1
2
1
2
1 + 21 z
• Implicit midpoint method: ➣ S(z) = .
1 1 − 21 z
SΨ := {z ∈ C: |S(z)| < 1} ⊂ C ,
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 843
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
3
3
2 2
1 1
Im
Im
0
−1 −1
−2 −2
−3
−3 −3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3Fig. 460 Re
Fig. 459 Re
SΨ : implicit midpoint method (11.2.3.3)
SΨ : implicit Euler method (11.2.2.2)
From the determinant formula for the stability function S(z) we can conclude a generalization of
Cor. 12.1.0.20.
For a consistent (→ Def. 11.3.1.12) s-stage general Runge-Kutta single step method according to
P(z)
Def. 12.3.3.1 the stability function S is a non-constant rational function of the form S(z) =
Q(z)
with polynomials P ∈ Ps , Q ∈ Ps .
Of course, a rational function z 7→ S(z) can satisfy lim|z|→∞ |S(z)| < 1 as we habe seen in Ex. 12.3.4.5.
As a consequence, the region of stability for implicit RK-SSM need not be bounded.
§12.3.4.7 (A-stability) A general RK-SSM with stability function S applied to the scalar linear IVP ẏ = λy,
y(0) = y0 ∈ C, λ ∈ C, with uniform timestep h > 0 will yield the sequence (yk )∞ k=0 defined by
yk = S(z)k y0 , z = λh . (12.3.4.8)
Hence, the next property of a RK-SSM guarantees that the sequence of approximations decays exponen-
tially whenever the exact solution of the model problem IVP (12.1.0.5) does so.
C − := {z ∈ C: Re z < 0} ⊂ SΨ . (SΨ =
ˆ region of stability Def. 12.1.0.51)
From Ex. 12.3.4.5 we conclude that both the implicit Euler method and the implicit midpoint method are
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 844
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
A-stable.
A-stable Runge-Kutta single step methods will not be affected by stability induced timestep constraints
when applied to stiff IVP (→ Notion 12.2.0.7).
§12.3.4.10 (“Ideal” region of stability) In order to reproduce the qualitative behavior of the exact solution,
a single step method when applied to the scalar linear IVP ẏ = λy, y(0) = y0 ∈ C, λ ∈ C, with uniform
timestep h > 0,
∞
• should yield an exponentially decaying sequence (yk )k=0 , whenever Re λ < 0,
∞
• should produce an exponentially increasing sequence sequence (yk )k=0 , whenever Re λ > 0.
Thus, in light of (12.3.4.8), we agree that the stability if
Regions of stability of Gauss collocation single step methods, see Exp. 12.3.2.10:
5 20 50 0.7 1.5
0.7
1.5
1.1
1.1
0.9
0.9
0.7
1
40
1
4
1.1
15
0.9
1.5 5
1.
1
3 30
10
5
1.
0.
7
20
0.7
2
0.7
0.4 0.
4
1.5
0.
4 5
1 10
4
0.4
0.
0.4
0.9 1
0.9 1.51.1
1.1
0.91
Im
Im
0 0
1.1
Im
0
0.4
1
−1 −10
−5
1.5
0.
4
0.7
0.4
0.4 0.4
0.7
−2 −20
0.7
1.
−10
5
−3 −30
1.
5
0.9
1.1
−15 1.5
1.1
0.9
0.7 −40
1.1
−4
1
1.5
1
0.9
1
0.7 1.5
0.7 −20 −50
−5 −20 −10 0 10 20 −60 −40 −20 0 20 40 60
Fig. 461
−6 −4 −2 0 2 4 Fig.6 462 Re Fig. 463 Re
Re
Theorem 12.3.4.12. Region of stability of Gauss collocation single step methods [DB02,
Satz 6.44]
s-stage Gauss collocation single step methods defined by (12.3.2.6) with the nodes cs given by the
s Gauss points on [0, 1], feature the “ideal” stability domain:
SΨ = C − . (12.3.4.11)
EXPERIMENT 12.3.4.13 (Implicit RK-SSMs for stiff IVP) We consider the stiff IVP
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 845
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
whose solution essentially is the smooth function t 7→ sin(2πt). Applying the criteria (12.2.0.13) and
(12.2.0.14) we immediately see that this IVP is extremely stiff.
1
We solve it with different implicit RK-SSM on [0, 1] with large uniform timestep h = 20 .
4
1
y(t)
Impliziter Euler 0.8 exp(z)
3
Kollokations RK−ESV s=1 Impliziter Euler
0.6
Kollokations RK−ESV s=2 Gauss−Koll.−RK−ESV s=1
2 Kollokations RK−ESV s=3 0.4 Gauss−Koll.−RK−ESV s=2
Kollokations RK−ESV s=4 Gauss−Koll.−RK−ESV s=3
0.2
1
Re(S(z))
Gauss−Koll.−RK−ESV s=4
0
y
0 −0.2
−0.4
−1
−0.6
−2 −0.8
−1
−3 −1000 −800 −600 −400 −200 0
0 0.2 0.4 0.6 0.8 1 465
Fig. z
Fig. 464 t
We observe that Gauss collocation RK-SSMs incur a huge discretization error, whereas the simple implicit
Euler method provides a perfect approximation!
lim |S(z)| = 1 .
|z|→∞
Hence, when they are applied to ẏ = λy with extremely large (in modulus) λ < 0, they will produce
sequences that decay only very slowly or even oscillate, which misses the very rapid decay of the ex-
act solution. The stability function for the implicity Euler method is S(z) = (1 − z)−1 and satisfies
lim|z|→∞ S(z) = 0, which will mean a fast exponential decay of the yk . y
§12.3.4.14 (L-stability) In light of what we learned in the previous experiment we can now state what we
expect from the stability function of a Runge-Kutta method that is suitable for stiff IVP (→ Notion 12.2.0.7):
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 846
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
P(z)
For a rational function S(z) = Q(z) the limit for |z| → ∞ exists and can easily be expressed by the leading
coefficients of the polynomials P and Q:
If b T = (A):,j
T
(row of A) ⇒ S(−∞) = 0 . (12.3.4.20)
A closer look at the coefficient formulas of (12.3.2.6) reveals that the algebraic condition (12.3.4.20) will
automatically satisfied for a collocation single step method with cs = 1! y
EXAMPLE 12.3.4.21 (L-stable implicit Runge-Kutta methods) There is a family of s-point quadra-
ture formulas on [0, 1] with a node located in 1 and (maximal) order 2s − 1: Gauss-Radau formulas.
They induce the L-stable Gauss-Radau collocation single step methods of order 2s − 1 according to
Thm. 12.3.2.12.
√ √ √ √
4− 6 88−7 6 296−169 6 −2+3 6
1 5 1 10
√ 360 √ 1800√ 225√
3 12 − 12 4+ 6 296+169 6 88+7 6 −2−3 6
1 1 3 1 10 1800
√ 360√ 225
1
1 4
3
4
1 1 16− 6 16+ 6 1
36√ 36√ 9
4 4 16− 6 16+ 6 1
36 36 9
P(z) 60
S(z) = , P ∈ P s −1 , Q ∈ P s .
Re(S(z))
Q(z) 50
40
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 847
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10 20 30
8
15
0.4 20
6 0.4 0.4
0.4
10 0.7
0.7 0.7 0.9
4
0.4
1.1 0. 1
0.9 4
0.9 1 1
1.1 1.5 10 1.1 1.1 0.
0.4
1. 5 1.5 7
5 1.
2
0.9
0.7
5
0 .7
0.4
1 .1
1
1
0.1
1
0.4
Im
Im
Im
0 0
0.9
0.7
0.9
1.5
1.5
0.9
0.9
1.5
1.5 .1
0.4
1 1.1
0.7
1.1
0.4
1
−2
0.4
1
1.5
7
−5 1.1
0.
1.1 1.5
0.7
7 1 1.1 −10
0.9 1 0. 0.9
−4 1
0.4
0.7 0.9
0.4 0.7
−10 0.4
−6 0.4 0.4
0.4 −20
−15
−8
EXPERIMENT 12.3.4.22 (Gauss-Radau collocation SSM for stiff IVP) We revisit the stiff IVP from
Ex. 12.0.0.1
We compare the sequences generated by 1-stage and 2-stage Gauss collocation and Gauss-Radau col-
location SSMs, respectively (uniform timestep).
Äquidistantes Gitter, h=0.016667 Äquidistantes Gitter, h=0.016667
1.4 1.4
y(t) y(t)
Gauss−Koll., s= 1 RADAU, s= 1
1.2 Gauss−Koll., s= 2 1.2 RADAU, s= 2
1 1
0.8 0.8
y
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t
The 2nd-order Gauss collocation SSM (implicit midpoint method) suffers from spurious oscillations when
homing in on the stable stationary state y = 1. The explanation from Exp. 12.3.4.13 also applies to this
example.
The fourth-order Gauss method is already so accurate that potential overshoots when approaching y = 1
are damped fast enough. y
y k +1 = y k + h k f ( t k +1 , y k +1 ) , h k : = t k +1 − t k ,
1
is S(z) = .
1−z
When will one observe a totally wrong qualitative behavior of the sequence (yk ) of states generated by
the implicit Euler method applied to the scalar growth ODE ẏ = λy, λ > 0?
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 848
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Hint. What kind of function is the stability function for an implicit RK-SSM?
(Q12.3.4.23.C) We apply a general Runge-Kutta single-step method to the autonomous affine-linear ODE
ẏ = My + b, M ∈ R N,N , b ∈ R N , N ∈ N. Describe the linear system of equations that has to be
solved in every timestep.
The definition of a general Runge-Kutta single-step method applied to the ODE is ẏ = f(t, y) is as
follows:
(Q12.3.4.23.D) Show that the single-step methods arising from the polynomial collocation approach with
s ∈ N collocation points will always be consistent.
Hint. The general formulas for a single-step method constructed via the polynomial collocation approach
with normalized collocation points c1 , c2 , . . . , cs are
s Z ci
ki = f (t0 + ci h, y0 + h ∑ aij k j ) , aij := L j (τ ) dτ ,
j =1 0
where Z 1 (12.3.2.6)
s
y 1 : = y h ( t 1 ) = y 0 + h ∑ bi k i . bi : = Li (τ ) dτ .
0
i =1
where { L1 , . . . , Ls } ⊂ Ps−1 are the Lagrange polynomials associated with the node set {c1 , c2 , . . . , cs }
on [0, 1].
Also remember that a single-step method for the ODE ẏ = f(y) is consistent, if and only if, its associ-
ated discrete evolution is of the form
ψ : I × D → R N continuous,
Ψh y = y + hψ( h, y) with (11.3.1.11)
ψ(0, y) = f(y) .
c A
(Q12.3.4.23.E) Let be the Butcher scheme for an s-stage collocation single-step method. Show
b⊤
that
(A)s,: = b⊤ ,
which, for an A-stable method is a sufficient condition for L-stability.
△
12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 849
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Video tutorial for Section 12.4: Semi-Implicit Runge-Kutta Methods: (13 minutes)
Download link, tablet notes
From Section 12.3.3 recall the formulas for general/implicit Runge-Kutta single-step methods for the ODE
ẏ = f(t, y):
Definition 12.3.3.1. General Runge-Kutta single-step method
Remember that we compute approximate solutions anyway, and the increments are weighted with the
stepsize h ≪ 1, see Def. 12.3.3.1. So there is no point in determining them with high accuracy!
Idea: Use only a fixed small number of Newton steps to solve for the ki , i = 1, . . . , s.
EXAMPLE 12.4.0.1 (Semi-implicit Euler single-step method) We apply the above idea to the implicit
Euler method introduced in Section 11.2.2. For the sake of simplicity we consider the autonomous ODE
ẏ = f(y), f : D ⊂ R N → R N .
The recursion for the implicit Euler method with (local) stepsize h > 0 is
A single Newton step (8.5.1.6) applied to F (y) = 0 with the natural initial guess yk yields
Note that for a linear ODE with f(y) = My, M ∈ R N,N , we recover the original implicit Euler method! y
12. Single-Step Methods for Stiff Initial-Value Problems, 12.4. Semi-Implicit Runge-Kutta Methods 850
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
✦ We consider an Initial value problem for logistic ODE, see Ex. 11.1.2.1
−1
10
✦ We run the implicit Euler method (11.2.2.2) and
the semi-implicit Euler method (12.4.0.2) with
−2
10
uniform timestep h = 1/n,
error
n ∈ {5, 8, 11, 17, 25, 38, 57, 85, 128, 192, 288,
, 432, 649, 973, 1460, 2189, 3284, 4926, 7389}. 10
−3
implicit Euler
semi−implicit Euler
−5
O(h)
10 −4 −3 −2 −1 0
10 10 10 10 10
Fig. 470 h
We observe that the approximate solution of the defining equation for yk+1 by a single Newton step pre-
serves the 1st-order convergence of the implicit Newton method. Also the semi-implicit Euler methods
seems to be of first order. y
EXPERIMENT 12.4.0.4 (Convergence of semi-implicit midpoint method) Again, we tackle the IVP
from Exp. 12.4.0.3.
Logistic ODE, y0 = 0.100000, λ = 5.000000
0
10
−2
✦ Now, implicit midpoint method (11.2.3.3), 10
s
ki := f(y0 + h ∑ aij k j ) , i = 1, . . . , s
? ki = f ( y0 ) + h D f ( y0 )
j =1
s
∑ aij k j
j =1
!
, i = 1, . . . , s . (12.4.0.5)
12. Single-Step Methods for Stiff Initial-Value Problems, 12.4. Semi-Implicit Runge-Kutta Methods 851
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
The good news is that all results about stability derived from model problem analysis (→ Section 12.1)
remain valid despite linearization of the increment equations:
✞ ☎
✝ ✆
Linearization does nothing for linear ODEs ➢ stability function (→ Thm. 12.3.4.4) not affected!
The bad news is that the preservation of the order observed in Exp. 12.4.0.3 will no longer hold in the
general case.
✦ −10
10
Increments from linearized equations (12.4.0.5)
RADAU (s=2)
−12 semi−implicit RADAU
✦ We monitor the error through err = 10
O(h3)
max |y j − y(t j )| −14
O(h2)
10
j=1,...,n 10
−4 −3
10
−2
10 10
−1
10
0
Fig. 472 h
§12.4.0.8 (Rosenbrock-Wanner methods) We have just seen that the simple linearization according to
(12.4.0.5) will degrade the order of implicit RK-SSMs and leads to a substantial loss of accuracy. This is
not an option.
Yet, the idea behind (12.4.0.5) has been refined. One does not start from a known RK-SSM, but introduces
general coefficients for structurally linear increment equations.
i −1 i −1
(I − haii J)ki = f(y0 + h ∑ ( aij + dij )k j ) − hJ ∑ dij k j , J = D f(y0 ) ,
j =1 j =1
(12.4.0.9)
s
y1 : = y0 + h ∑ b j k j .
j =1
Then the coefficients aij , dij , and bi are determined from order conditions by solving large non-linear
systems of equations.
In each step s linear systems with coefficient matrices I − haii J have to be solved. For methods used in
practice one often demands that aii = γ for all i = 1, . . . , s. As a consequence, we have to solve s linear
12. Single-Step Methods for Stiff Initial-Value Problems, 12.4. Semi-Implicit Runge-Kutta Methods 852
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
systems with the same coefficient matrix I − hγJ ∈ R N,N , which permits us to reuse LU-factorizations,
see Rem. 2.5.0.10. y
Derive the defining equation of the semi-implicit variant, which arises from solving the defining equation
for yk+1 by a single Newton step with initial guess yk .
(Q12.4.0.10.B) [Stability function of ROW-SSM] A Rosenbrock-Wanner (ROW) single-step method for
the autonomous ODE ẏ = f(y) can be defined by
i −1 i −1
(I − haii J)ki = f(y0 + h ∑ ( aij + dij )k j ) − hJ ∑ dij k j , J = D f(y0 ) ,
j =1 j =1
(12.4.0.9)
s
y1 : = y0 + h ∑ b j k j .
j =1
Video tutorial for Section 12.5: Splitting Methods: (21 minutes) Download link, tablet notes
§12.5.0.1 (Splitting idea: composition of partial evolutions) Many relevant ordinary differential equa-
tions feature a right hand side function that is the sum to two (or more) terms. Consider an autonomous
IVP with a right hand side function that can be split in an additive fashion:
Let us introduce the evolution operators (→ Def. 11.1.4.3) for both summands:
Temporarily we assume that both Φtf , Φtg are available in the form of analytic formulas or highly accurate
12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 853
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
approximations.
Idea: Build single step methods (→ Def. 11.3.1.5) based on the following
discrete evolutions
Ψh Ψh
(12.5.0.3) ↔ Φhg (12.5.0.4) ↔ Φhg
y0 y0
Fig. 473
Φhf Fig. 474
Φ f/2
h
Note that over many timesteps the Strang splitting approach is not more expensive than Lie-Trotter split-
ting, because the actual implementation of (12.5.0.4) should be done as follows:
y1/2 := Φ f/2 ,
h
y1 := Φhg y1/2 ,
y3/2 := Φhf y1 , y2 := Φhg y3/2 ,
y5/2 := Φhf y2 , y3 := Φhg y5/2 ,
.. ..
. .,
because Φ f/2 ◦ Φ f/2 = Φhf . This means that a Strang splitting SSM differs from a Lie-Trotter splitting SSM
h h
EXPERIMENT 12.5.0.5 (Convergence of simple splitting methods) We consider the following IVP
whose right hand side function is the sum of two functions for which the ODEs can be solved analytically:
q
ẏ = λy(1 − y) + 1 − y2 , y (0) = 0 .
| {z } | {z }
=: f (y) =:g(y)
1
Φtf y = , t > 0, y ∈]0, 1] (logistic ODE (11.1.2.2))
1 + (y−1 − 1)e−λt
(
t sin(t + arcsin(y)) , if t + arcsin(y) < π2 ,
Φg y = t > 0, y ∈ [0, 1] .
1 , else,
12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 854
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
−2
10
Numerical experiment:
−3
10 For T = 1, λ = 1, we compare the two splitting
methods for uniform timesteps with a very accurate
|y(T)−y (T)|
−4
10
f=@(t,x) λ*x*(1-x)+sqrt(1-x^2);
options=odeset(’reltol’,1.0e-10,...
−5
10 ’abstol’,1.0e-12);
Lie−Trotter−Splitting
Strang−Splitting [t,yex]=ode45(f,[0,1],y0,options);
O(h)
2
−6
10
O(h ) ✁ Error at final time T = 1
−2 −1
10 10
Fig. 475 Zeitschrittweite h
We observe algebraic convergence of the two splitting methods, order 1 for (12.5.0.3), oder 2 for (12.5.0.4).
y
Die single step methods defined by (12.5.0.3) or (12.5.0.4) are of order (→ Def. 11.3.2.8) 1 and 2,
respetively.
§12.5.0.7 (Inexact splitting methods) Of course, the assumption that ẏ = f(y) and ẏ = g(y) can be
solved exactly will hardly ever be met. However, it should be clear that a “sufficiently accurate” approxima-
tion of the evolution maps Φhg and Φhf is all we need
EXPERIMENT 12.5.0.8 (Convergence of inexact simple splitting methods) Again we consider the
IVP of Exp. 12.5.0.5 and inexact splitting methods based on different single step methods for the two ODE
corresponding to the summands.
12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 855
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
10
−2
LTS-Eul explicit Euler method (11.2.1.5) → Ψhh,g ,
Ψhh, f + Lie-Trotter splitting (12.5.0.3)
10
−3
SS-Eul explicit Euler method (11.2.1.5) → Ψhh,g ,
Ψhh, f + Strang splitting (12.5.0.4)
|y(T)−y (T)|
−4
10
method (11.2.1.5) ◦ exact evolution Φhg ◦
implicit Euler method (11.2.2.2)
LTS-EMP explicit midpoint method (11.2.3.3) →
−5
10 LTS−Eul
SS−Eul
SS−EuEI Ψhh,g , Ψhh, f + Lie-Trotter splitting (12.5.0.3)
LTS−EMP
−6 SS−EMP SS-EMP explicit midpoint method (11.4.0.9) →
10
−2
10
−1
10 Ψhh,g , Ψhh, f + Strang splitting (12.5.0.4)
Fig. 476 Zeitschrittweite h
☞ The order of splitting methods may be (but need not be) limited by the order of the SSMs used for
Φhf , Φhg .
y
§12.5.0.9 (Application of splitting methods) In the following situation the use splitting methods seems
advisable:
“Splittable” ODEs
EXPERIMENT 12.5.0.11 (Splitting off stiff components) Recall Ex. 12.0.0.1 and the IVP studied there:
small perturbation
1.4
0.03
ode45
y(t)
1.2
1
1
0.02
Zeitschrittweite
0.8
y(t)
0.6
0.01
0.4
y(t)
0.2 LT−Eulex, h=0.04
0 LT−Eulex, h=0.02
ST−MPRexpl, h=0.05
0
0 0.2 0.4 0.6 0.8 1 0
Fig. 477 t 0 0.2 0.4 0.6 0.8 1
Fig. 478 t
Solution by ode45, see Ex. 12.0.0.1 inexact splitting method: solution (yk )
ode45: 152
LT-Eulex, h = 0.04: 25
Total number of timesteps
LT-Eulex, h = 0.02: 50
ST-MPRexpl, h = 0.05: 20
Details of the methods:
12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 856
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
LT-Eulex: ẏ = λy(1 − y) → exact evolution, ẏ = α sin y → expl. Euler (11.2.1.5) & Lie-Trotter
splitting (12.5.0.3)
ST-MPRexpl: ẏ = λy(1 − y) → exact evolution, ẏ = α sin y → expl. midpoint rule (11.4.0.9) & Strang
splitting (12.5.0.4)
We observe that this splitting scheme can cope well with the stiffness of the problem, because the stiff
term on the right hand side is integrated exactly. y
EXAMPLE 12.5.0.12 (Splitting linear and decoupled terms) In the numerical treatment of partial differ-
ential equation one commonly encounters ODEs of the form
g ( y1 )
.. ⊤ N,N
ẏ = f(y) := −Ay + . , A=A ∈R positive definite (→ Def. 1.1.2.6) , (12.5.0.13)
g(y N )
with state space D = R N , where λmin (A) ≈ 1, λmax (A) ≈ N 2 , and the derivative of g : R → R is
bounded. Then IVPs for (12.5.0.13) will be stiff, since the Jacobian
g ′ ( y1 )
.. N,N
D f(y) = −A + . ∈R
g′ (y N )
will have eigenvalues “close to zero” and others that are large (in modulus) and negative. Hence, D f(y)
will satisfy the criteria (12.2.0.13) and (12.2.0.14) for any state y ∈ R N .
• For the linear ODE ẏ = g(y) we have to use an L-stable (→ Def. 12.3.4.15) single step method,
for instance a second-order implicit Runge-Kutta method. Its increments can be obtained by solving
a linear system of equations, whose coefficient matrix will be the same for every step, if uniform
timesteps are used.
• The ODE ẏ = q(y) boils down to decoupled scalar ODEs ẏ j = g(y j ), j = 1, . . . , N . For them we
can use an inexpensive explicit RK-SSM like the explicit trapezoidal method (11.4.0.8). According
to our assumptions on g these ODEs are not haunted by stiffness.
y
with state space R2 . Derive formulas for the Strang-splitting single step method applied to the math-
ematical pendulum equation and using the additive decomposition of the right-hand-side vectorfield
suggested in (??). Distinguish the initial step, a regular step, and the final step.
h i h i
ẋ ϕ(y)
Hint. What is the analytic solution of the ODE ẏ = for an arbitrary function ϕ : R → R?
0
12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 857
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(Q12.5.0.14.B) Elaborate the extension of the Strang splitting single step method to the ODE
Develop formulas relying on the exact evolutions Φhf , Φhg , and Φrh for the ODEs ẏ = f(y), ẏ = g(y),
and ẏ = r(y).
(Q12.5.0.14.C) For a symmetric positive definite matrix A ∈ R N,N consider the autonomous ODE on
state space R N :
N N
ẏ = −Ay + sin(πy j ) j=1 , y = y j j =1 . (12.5.0.16)
We know that the smallest eigenvalue λmin (A) of A is 1 and the largest λmax (A) can be as big as 109 .
(i) Based on the Strang splitting Propose an efficient second-order single-step method for (12.5.0.16).
(ii) What is the computational effort for every regular timestep of your method, asymptotically for
N → ∞?
△
12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 858
Bibliography
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 815, 818, 849).
[DB02] P. Deuflhard and F. Bornemann. Scientific Computing with Ordinary Differential Equations.
2nd ed. Vol. 42. Texts in Applied Mathematics. New York: Springer, 2002 (cit. on pp. 840, 845).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 820, 829, 846,
848, 853).
[MQ02] R.I. McLachlan and G.R.W. Quispel. “Splitting methods”. In: Acta Nmerica 11 (2002) (cit. on
p. 858).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 822).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 820, 829, 835, 849).
[Ran15] Joachim Rang. “Improved traditional Rosenbrock-Wanner methods for stiff
ODEs and DAEs”. In: J. Comput. Appl. Math. 286 (2015), pp. 128–144. DOI:
10.1016/j.cam.2015.03.010.
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 824, 832).
859
Main Index: Terms and Keywords
860
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
(A)i,j = ˆ reference to entry aij of matrix A, 55 k·k =ˆ norm on vector space, 119
(A)k:l,r:s =ˆ reference to submatrix of A Pk , 387
spanning rows k, . . . , l and columns Ψh y =ˆ discrete evolution for autonomous ODE,
r, . . . , s, 55 778
ˆ i-th component of vector x, 54
( x )i = R(A) = ˆ image/range space of a matrix, 130,
( xk ) ∗n (yk ) = ˆ discrete periodic convolution, 315 217
0
C (I) = ˆ space of continuous functions I → R, (·, ·)V = ˆ inner product on vector space V , 510
410 Sd,M , 425
C1 ([ a, b]) = ˆ space of continuously differentiable A† = ˆ Moore-Penrose pseudoinverse of A, 228
functions [ a, b] 7→ R, 418 A =⊤ ˆ transposed matrix, 56
J ( t0 , y0 ) =ˆ maximal domain of definition of a I= ˆ identity matrix, 56
solution of an IVP, 766 h∗x= ˆ discrete convolution of two vectors, 312
O= ˆ zero matrix, 56 x ∗n y = ˆ discrete periodic convolution of vectors,
O(·)= ˆ Landau symbol, 83 315
⊥
V = ˆ orthogonal complement of a subspace, ˆ complex conjugation, 56
z̄ =
223 C − := {z ∈ C: Re z < 0}, 844
E= ˆ expected value of a random variable, 372 K= ˆ generic field of numbers, either R or C, 54
T n,n
Pn = ˆ space of trigonometric polynomials of K∗ = ˆ set of invertible n × n matrices, 131
degree n, 451 M= ˆ set of machine numbers, 94
Rk (m, n) = ˆ set of rank-k matrices, 279 δij =ˆ Kronecker symbol, 389
DFTn = ˆ discrete Fourier transform of length n, δij =ˆ Kronecker symbol, called “Kronecker delta”
324 by Wikipedia, 55
DΦ = ˆ Jacobian of Φ : D 7→ R n at x ∈ D, 614 δij =ˆ Kronecker symbol, called “Kronecker delta”
Dy f = ˆ Derivative of f w.r.t. y (Jacobian), 766 by Wikipedia, 306
EPS = ˆ machine precision, 99 ℓ ∞ (Z ) = ˆ space of bounded bi-infinite
EigAλ = ˆ eigenspace of A for eigenvalue λ, 680 sequences, 305 √
IT =ˆ Lagrange polynomial interpolation operator ı= ˆ imaginary unit, “ı := −1”, 128
based on node set T , 393 κ (A) = ˆ spectral condition number, 735
RA = ˆ range/column space of matrix A, 267 λT = ˆ Lebesgue constant for Lagrange
N (A) = ˆ kernel/nullspace of a matrix, 130, 217 interpolation on node set T , 411
NA = ˆ nullspace of matrix A, 267 λmax = ˆ largest eigenvalue (in modulus), 735
Kl (A, z) = ˆ Krylov subspace, 737 λmin = ˆ smallest eigenvalue (in modulus), 735
LT = ˆ Lagrangian (interpolation polynomial) ( xk ) ∗ ( hk ) =ˆ convolution of two sequences, 309
approximation scheme on node set T , 1 = [1, . . . , ]⊤ , 820, 828
478 Ncut(X ) = ˆ normalized cut of subset of
kAx − bk2 → min = ˆ minimize kAx − bk2 , weighted graph, 693
226 argmin = ˆ (global) minimizer of a functional, 730
kAk2F , 279 cond(A), 133
kxk A = ˆ energy norm induced by s.p.d. matrix cut(X ) = ˆ cut of subset of weighted graph, 693
A, 729 ˆ square diagonal matrix, 56
diag(d1 , . . . , dn ) =
k·k = ˆ Euclidean norm of a vector ∈ K n , 91 distk·k ( x, V ) = ˆ distance of an element of a
875
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
877
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024
Convergence of Hermite interpolation with exact Different choices for consistent iteration
slopes, 544 functions (III), 616
Convergence of inexact simple splitting methods, Different meanings of “convergence”, 479
855 Differentiating and integrating splines, 426
Convergence of Krylov subspace methods for Discrete evolutions for non-autonomous ODEs,
non-symmetric system matrix, 753 779
Convergence of naive semi-implicit Radau Discretization, 778
method, 852 Distribution of machine numbers, 96
Convergence of Newton’s method for matrix Divided differences and derivatives, 408
inversion, 650
Convergence of Newton’s method in 2D, 649 Economical vs. full QR-decomposition, 257
Convergence of quadratic inverse interpolation, Efficiency of FFT for different backend
632 implementations, 361
Convergence of Remez algorithm, 524 Efficiency of FFT-based solver, 368
Convergence of secant method, 630 Efficiency of iterative methods, 636
Convergence of semi-implicit midpoint method, Efficient associative matrix multiplication, 86
851 Eigenvectors of circulant matrices, 319
Convergence of simple Runge-Kutta methods, Eigenvectors of commuting matrices, 320
793 Empiric Convergence of collocation single step
Convergence of simple splitting methods, 854 methods, 838
Convergence rates for CG method, 744 Empiric convergence of equidistant trapezoidal
Convergence theory for PCG, 748 rule, 579
Convex least squares functional, 225 Empiric convergence of semi-implicit Euler
Convolution of causal sequences, 313 single-step method, 850
Cosine transforms for compression, 371 Envelope of a matrix, 200
Curve design based on B-splines, 450 Envelope oriented matrix storage, 203
Curves from interpolation, 442 Error of polynomial interpolation, 484
Error representation for generalized Lagrangian
Damped Broyden method, 661 interpolation, 484
Damped Newton method, 657 Estimation of “wrong quadrature error”?, 588
Data points confined to a subspace, 289 Estimation of “wrong” error?, 803
Deblurring by DFT, 341 Euler methods for stiff decay IVP, 835
Decay conditions for bi-infinite signals, 348 Evolution operator for Lotka-Volterra ODE, 769
Decay of cardinal basis functions for natural Explicit adaptive RK-SSM for stiff IVP, 814
cubic spline interpolation, 434 Explicit Euler method as a difference scheme,
Decimal floating point numbers, 95 774
Denoising by frequency filtering, 333 Explicit Euler method for damped oscillations,
Derivative of a bilinear form, 642 825
Derivative of matrix inversion, 646 Explicit representation of error of polynomial
Derivative of Euclidean norm, 643 interpolation, 483
Derivative-based local linear approximation of Explicit trapzoidal rule for decay equation, 818
functions, 637 Exploiting trigonometric identities to avoid
Detecting linear convergence, 601 cancellation, 109
Detecting order p > 1 of convergence, 603 Extracting triplets from Eigen::SparseMatrix,
Detecting periodicity in data, 330 184
Determining the domain of analyticity, 493 Extremal numbers in M, 95
Diagonalization of local translation invariant
linear grid operators, 366 Failure of damped Newton method, 657
diagonally dominant matrices from nodal Failure of Krylov iterative solvers, 753
analysis, 206 Fast Toeplitz solvers, 376
Different choices for consistent fixed point Feasibility of implicit Euler timestepping, 775
iterations (II), 611 FFT algorithm by matrix factorization, 357
FFT based on general factorization, 359 Impact of roundoff errors on CG, 741
FFT for prime vector length, 359 Implicit differentiation of F, 622
Filtering in Fourier domain, 352 Implicit nature of collocation single step
Fit of hyperplanes, 276 methods, 838
Fixed points in 1D, 612 Implicit RK-SSMs for stiff IVP, 845
Fractional order of convergence of secant Importance of numerical quadrature, 550
method, 630 In-situ LU-decomposition, 148
Frequency identification with DFT, 329 Inequalities between vector norms, 119
Frobenius’ derivation of the Hermite integral Initial guess for power iteration, 692
formula, 492 Initialization of sparse matrices in Eigen, 184
From higher order ODEs to first order systems, Inner products on spaces Pm of polynomials,
765 515
Full-rank condition, 224 Input errors and roundoff errors, 97
Instability of multiplication with inverse, 164
Gain through adaptivity, 804
interpolation
Gauss-Newton versus Newton, 672
piecewise cubic monotonicity preserving,
Gauss-Radau collocation SSM for stiff IVP, 848
422
Gaussian elimination, 136
shape preserving quadratic spline, 439
Gaussian elimination and LU-factorization, 144
Interpolation and approximation: enabling
Gaussian elimination for non-square matrices,
technologies, 470
140
Interpolation error estimates and the Lebesgue
Gaussian elimination via rank-1 modifications,
constant, 486
142
Interpolation error: trigonometric interpolation,
Gaussian elimination with pivoting for
528
3 × 3-matrix, 153
Interpolation of vector-valued data, 380
Generalization of data, 380
Intersection of lines in 2D, 134
Generalized bisection methods, 620
Generalized eigenvalue problems and Cholesky Justification of Ritz projection by min-max
factorization, 682 theorem, 714
Generalized Lagrange polynomials for Hermite
Kinetics of chemical reactions, 829
Interpolation, 392
Krylov methods for complex s.p.d. system
Generalized polynomial interpolation, 391
matrices, 729
Gibbs phenomenon, 528
Krylov subspace methods for generalized EVP,
Gradient method in 2D, 732
725
Gram-Schmidt orthogonalization of polynomials,
564 Least squares data fitting, 665
Gram-Schmidt orthonormalization based on Lebesgue Constant for Chebychev nodes, 500
MyVector implementation, 46 Lebesgue constant for equidistant nodes, 411
Group property of autonomous evolutions, 769 Linear Filtering, 334
Growth with limited resources, 759 Linear parameter estimation = linear data fitting,
461
Halley’s iteration, 624
Linear parameter estimation in 1D, 214
Heartbeat model, 761
linear regression, 218
Heating production in electrical circuits, 551
Linear regression for stationary Markov chains,
Hesse matrix of least squares functional, 225
372
Hidden summation, 87
Linear regression: Parameter estimation for a
Hump function, 488
linear model, 214
Image compression, 282 Linear system for quadrature weights, 560
Image segmentation, 692 Linear systems with arrow matrices, 171
Impact of choice of norm, 600 Linearly convergent iteration, 601
Impact of matrix data access patterns on Local approximation by piecewise polynomials,
runtime, 68 540
FRC =
ˆ full-ranjk condition, 224 PCA =ˆ principal component analysis, 284
PCONV = ˆ discrete periodic convolution (of
GE =
ˆ Gaussian elimination, 135
vectors), 315
HHM =
ˆ Householder matrix, 241 POD =ˆ principal orthogonal decompostion, 290
PP =
ˆ piecewise polynomial, 541
IC =
ˆ interpolation conditions, 380 PPLIP =ˆ piecewise polynomial Lagrange
IR =
ˆ impulse response, 306 interpolation, 541
IRK-SSM = ˆ implicit Runge-Kutta single-step PSF =
ˆ point-spread function, 341
method, 842
IVP =
ˆ initial-value problem, 757 QR =
ˆ quadrature rule, 552
KCL =
ˆ Krichoff Current Law, 128
r.h.s =
ˆ right-hand side, 757
LIP =ˆ Lagrange polynomial interpolation, 389
LPI =ˆ Lagrangian (interpolation polynomial) SF =
ˆ stability function, 820, 828
approximation scheme, 478 SSM =ˆ single-step method, 779
LSE = ˆ Linear System of Equations, 127 SVD =ˆ singular value decomposition, 264
LT-FIR =ˆ finite, linear, time-invariant, causal
filter/channel, 306 TI =
ˆ time-invariant, 305
884
German terms
885
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024