Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
36 views886 pages

NumCSE Lecture Document

The document outlines the course structure and content for 'Numerical Methods for Computational Science and Engineering' taught by Prof. Ralf Hiptmair at ETH Zurich for the Autumn Term 2024. It includes topics such as programming in C++, prerequisites, and various numerical methods, with a focus on linear systems of equations. The document is a work in progress and will be updated online, with a stable structure expected.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views886 pages

NumCSE Lecture Document

The document outlines the course structure and content for 'Numerical Methods for Computational Science and Engineering' taught by Prof. Ralf Hiptmair at ETH Zurich for the Autumn Term 2024. It includes topics such as programming in C++, prerequisites, and various numerical methods, with a focus on linear systems of equations. The document is a work in progress and will be updated online, with a stable structure expected.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 886

NumCS(E), AT’24, Prof.

Ralf Hiptmair ©SAM, ETH Zurich, 2024

ETH Lectures 401-0663-00L Numerical Methods for Computer Science


401-2673-00L Numerical Methods for CSE

Numerical Methods for


Computational Science and Engineering

Prof. R. Hiptmair, SAM, ETH Zurich


(with contributions from Prof. P. Arbenz and Dr. V. Gradinaru)

Autumn Term 2024, Version of February 5, 2025


(C) Seminar für Angewandte Mathematik, ETH Zürich

Link to the current version of this lecture document

Always under construction!


The online version will always be work in progress and subject
to change.

(Nevertheless, structure and main contents can be expected to


be stable)

Do not print before the end of term!

, 1
Contents

0 Introduction 8
0.1 Course Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.1 Focus of this Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.1.3 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2 Teaching Style and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.1 Flipped Classroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.1.1 Course Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.1.2 Following the Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
0.2.2 Clarifications and Frank Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
0.2.3 Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
0.2.4 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
0.2.5 Information on Examinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
0.2.5.1 For the Course 401-2673-00L Numerical Methods for CSE (BSc CSE) . . 28
0.2.5.2 For the Course 401-0663-00L Numerical Methods for CS (BSc Informatik) 29
0.3 Programming in C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.3.1 Function Arguments and Overloading . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.3.2 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.3.3 Function Objects and Lambda Functions . . . . . . . . . . . . . . . . . . . . . . . 33
0.3.4 Multiple Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
0.3.5 A Vector Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
0.3.6 Complex numbers in C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
0.4 Prerequisite Mathematical Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
0.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
0.4.2 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
0.4.3 Trigonometric Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
0.4.4 Linear Algebra and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1 Computing with Matrices and Vectors 53


1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.1.2 Classes of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.2 Software and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.1 E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.2 P YTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.2.3 (Dense) Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.3 Basic Linear Algebra Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.3.1 Elementary Matrix-Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.3.2 BLAS – Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . . . . . . . 76
1.4 Computational Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

2
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.4.1 (Asymptotic) Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 83


1.4.2 Cost of Basic Linear-Algebra Operations . . . . . . . . . . . . . . . . . . . . . . . 84
1.4.3 Improving Complexity in Numerical Linear Algebra: Some Tricks . . . . . . . . . . . 86
1.5 Machine Arithmetic and Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1.5.1 Experiment: Loss of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1.5.2 Machine Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.5.3 Roundoff Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1.5.4 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1.5.5 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

2 Direct Methods for (Square) Linear Systems of Equations 126


2.1 Introduction: Linear Systems of Equations (LSE) . . . . . . . . . . . . . . . . . . . . . . . 127
2.2 Theory: Linear Systems of Equations (LSE) . . . . . . . . . . . . . . . . . . . . . . . . . 130
2.2.1 LSE: Existence and Uniqueness of Solutions . . . . . . . . . . . . . . . . . . . . . 130
2.2.2 Sensitivity/Conditioning of Linear Systems . . . . . . . . . . . . . . . . . . . . . . 131
2.3 Gaussian Elimination (GE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.3.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.3.2 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.4 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.5 Survey: Elimination Solvers for Linear Systems of Equations . . . . . . . . . . . . . . . . 165
2.6 Exploiting Structure when Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . 170
2.7 Sparse Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.7.1 Sparse Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
2.7.2 Sparse Matrices in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
2.7.3 Direct Solution of Sparse Linear Systems of Equations . . . . . . . . . . . . . . . . 190
2.7.4 LU-Factorization of Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 193
2.7.5 Banded Matrices [DR08, Sect. 3.7] . . . . . . . . . . . . . . . . . . . . . . . . . . 199
2.8 Stable Gaussian Elimination Without Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . 206

3 Direct Methods for Linear Least Squares Problems 213


3.0.1 Overdetermined Linear Systems of Equations: Examples . . . . . . . . . . . . . . 214
3.1 Least Squares Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
3.1.1 Least Squares Solutions: Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.1.3 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.1.4 Sensitivity of Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . 229
3.2 Normal Equation Methods [DR08, Sect. 4.2], [Han02, Ch. 11] . . . . . . . . . . . . . . . . 230
3.3 Orthogonal Transformation Methods [DR08, Sect. 4.4.2] . . . . . . . . . . . . . . . . . . . 234
3.3.1 Transformation Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.3.2 Orthogonal/Unitary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
3.3.3 QR-Decomposition [Han02, Sect. 13], [Gut09, Sect. 7.3] . . . . . . . . . . . . . . . 236
3.3.3.1 QR-Decomposition: Theory . . . . . . . . . . . . . . . . . . . . . . . . . 237
3.3.3.2 Computation of QR-Decomposition . . . . . . . . . . . . . . . . . . . . . 240
3.3.3.3 QR-Decomposition: Stability . . . . . . . . . . . . . . . . . . . . . . . . . 249
3.3.3.4 QR-Decomposition in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . 250
3.3.4 QR-Based Solver for Linear Least Squares Problems . . . . . . . . . . . . . . . . 252
3.3.5 Modification Techniques for QR-Decomposition . . . . . . . . . . . . . . . . . . . . 257
3.3.5.1 Rank-1 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3.3.5.2 Adding a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
3.3.5.3 Adding a Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
3.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

CONTENTS, CONTENTS 3
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.4.1 SVD: Definition and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264


3.4.2 SVD in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
3.4.3 Solving General Least-Squares Problems by SVD . . . . . . . . . . . . . . . . . . 272
3.4.4 SVD-Based Optimization and Approximation . . . . . . . . . . . . . . . . . . . . . 275
3.4.4.1 Norm-Constrained Extrema of Quadratic Forms . . . . . . . . . . . . . . 275
3.4.4.2 Best Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . 278
3.4.4.3 Principal Component Data Analysis (PCA) . . . . . . . . . . . . . . . . . 284
3.5 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
3.6 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
3.6.1 Solution via Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 298
3.6.2 Solution via SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

4 Filtering Algorithms 303


4.1 Filters and Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.1.1 Discrete Finite Linear Time-Invariant Causal Channels/Filters . . . . . . . . . . . . 304
4.1.2 LT-FIR Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
4.1.3 Discrete Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.1.4 Periodic Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
4.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.1 Diagonalizing Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.2 Discrete Convolution via Discrete Fourier Transform . . . . . . . . . . . . . . . . . 326
4.2.3 Frequency filtering via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
4.2.4 Real DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.2.5 Two-dimensional DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
4.2.6 Semi-discrete Fourier Transform [QSS00, Sect. 10.11] . . . . . . . . . . . . . . . . 344
4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.4 Trigonometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.4.1 Sine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.4.2 Cosine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
4.5 Toeplitz Matrix Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.5.1 Matrices with Constant Diagonals . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.5.2 Toeplitz Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
4.5.3 The Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

5 Data Interpolation and Data Fitting in 1D 379


5.1 Abstract Interpolation (AI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.2.1 Uni-Variate Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.2.2 Polynomial Interpolation: Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
5.2.3 Polynomial Interpolation: Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 393
5.2.3.1 Multiple evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
5.2.3.2 Single evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
5.2.3.3 Extrapolation to Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
5.2.3.4 Newton Basis and Divided Differences . . . . . . . . . . . . . . . . . . . 403
5.2.4 Polynomial Interpolation: Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 409
5.3 Shape-Preserving Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
5.3.1 Shape Properties of Functions and Data . . . . . . . . . . . . . . . . . . . . . . . 414
5.3.2 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
5.3.3 Cubic Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.3.3.1 Definition and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.3.3.2 Local Monotonicity-Preserving Hermite Interpolation . . . . . . . . . . . . 421
5.4 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

CONTENTS, CONTENTS 4
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5.4.1 Spline Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425


5.4.2 Cubic-Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
5.4.3 Structural Properties of Cubic Spline Interpolants . . . . . . . . . . . . . . . . . . . 431
5.4.4 Shape Preserving Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 435
5.5 Algorithms for Curve Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
5.5.1 CAD Task: Curves from Control Points . . . . . . . . . . . . . . . . . . . . . . . . 440
5.5.2 Bezier Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
5.5.3 Spline Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.6 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.1 Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.2 Reduction to Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 452
5.6.3 Equidistant Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 454
5.7 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

6 Approximation of Functions in 1D 468


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
6.2 Approximation by Global Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
6.2.1 Polynomial Approximation: Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 472
6.2.2 Error Estimates for Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . 478
6.2.2.1 Convergence of Interpolation Errors . . . . . . . . . . . . . . . . . . . . . 478
6.2.2.2 Interpolands of Finite Smoothness . . . . . . . . . . . . . . . . . . . . . 482
6.2.2.3 Analytic Interpolands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
6.2.3 Chebychev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
6.2.3.1 Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
6.2.3.2 Chebychev Interpolation Error Estimates . . . . . . . . . . . . . . . . . . 498
6.2.3.3 Chebychev Interpolation: Computational Aspects . . . . . . . . . . . . . . 505
6.3 Mean Square Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
6.3.1 Abstract Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
6.3.1.1 Mean Square Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
6.3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
6.3.1.3 Orthonormal Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
6.3.2 Polynomial Mean Square Best Approximation . . . . . . . . . . . . . . . . . . . . . 515
6.4 Uniform Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
6.5 Approximation by Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 525
6.5.1 Approximation by Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . 526
6.5.2 Trigonometric Interpolation Error Estimates . . . . . . . . . . . . . . . . . . . . . . 527
6.5.3 Trigonometric Interpolation of Analytic Periodic Functions . . . . . . . . . . . . . . 534
6.6 Approximation by Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
6.6.1 Piecewise Polynomial Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . 541
6.6.2 Cubic Hermite Interpolation: Error Estimates . . . . . . . . . . . . . . . . . . . . . 544
6.6.3 Cubic Spline Interpolation: Error Estimates [Han02, Ch. 47] . . . . . . . . . . . . . 546

7 Numerical Quadrature 550


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
7.2 Quadrature Formulas – Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 552
7.3 Polynomial Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
7.4 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
7.4.1 Order of a Quadrature Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
7.4.2 Maximal-Order Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
7.4.3 Quadrature Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
7.5 Composite Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
7.6 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583

CONTENTS, CONTENTS 5
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8 Iterative Methods for Non-Linear Systems of Equations 593


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
8.2 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
8.2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
8.2.2 Speed of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
8.2.3 Termination Criteria/Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 605
8.3 Fixed-Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
8.3.1 Consistent Fixed-Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
8.3.2 Convergence of Fixed-Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . 611
8.4 Finding Zeros of Scalar Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
8.4.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
8.4.2 Model Function Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
8.4.2.1 Newton Method in the Scalar Case . . . . . . . . . . . . . . . . . . . . . 620
8.4.2.2 Special One-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . 624
8.4.2.3 Multi-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
8.4.3 Asymptotic Efficiency of Iterative Methods for Zero Finding . . . . . . . . . . . . . . 633
8.5 Newton’s Method in R n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
8.5.1 The Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
8.5.2 Convergence of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
8.5.3 Termination of Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
8.5.4 Damped Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
8.6 Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
8.7 Non-linear Least Squares [DR08, Ch. 6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
8.7.1 (Damped) Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
8.7.2 Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
8.7.3 Trust Region Method (Levenberg-Marquardt Method) . . . . . . . . . . . . . . . . . 674

9 Computation of Eigenvalues and Eigenvectors 677


9.1 Theory of eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
9.2 “Direct” Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
9.3 Power Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
9.3.1 Direct power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
9.3.2 Inverse Iteration [DR08, Sect. 7.6], [QSS00, Sect. 5.3.2] . . . . . . . . . . . . . . . 692
9.3.3 Preconditioned inverse iteration (PINVIT) . . . . . . . . . . . . . . . . . . . . . . . 702
9.3.4 Subspace iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
9.3.4.1 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
9.3.4.2 Ritz projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
9.4 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716

10 Krylov Methods for Linear Systems of Equations 728


10.1 Descent Methods [QSS00, Sect. 4.3.3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
10.1.1 Quadratic minimization context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
10.1.2 Abstract steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
10.1.3 Gradient method for s.p.d. linear system of equations . . . . . . . . . . . . . . . . 731
10.1.4 Convergence of the gradient method . . . . . . . . . . . . . . . . . . . . . . . . . 732
10.2 Conjugate gradient method (CG) [Han02, Ch. 9], [DR08, Sect. 13.4], [QSS00, Sect. 4.3.4] . 736
10.2.1 Krylov spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
10.2.2 Implementation of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
10.2.3 Convergence of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
10.3 Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10], [QSS00, Sect. 4.3.5] . . . . . . . . . 745
10.4 Survey of Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
10.4.1 Minimal residual methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751

CONTENTS, CONTENTS 6
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10.4.2 Iterations with short recursions [QSS00, Sect. 4.5] . . . . . . . . . . . . . . . . . . 752

11 Numerical Integration – Single Step Methods 756


11.1 Initial-Value Problems (IVPs) for Ordinary Differential Equations (ODEs) . . . . . . . . . . 757
11.1.1 Ordinary Differential Equations (ODEs) . . . . . . . . . . . . . . . . . . . . . . . . 757
11.1.2 Mathematical Modeling with Ordinary Differential Equations: Examples . . . . . . . 759
11.1.3 Theory of Initial-Value-Problems (IVPs) . . . . . . . . . . . . . . . . . . . . . . . . 764
11.1.4 Evolution Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
11.2 Introduction: Polygonal Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . 771
11.2.1 Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
11.2.2 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774
11.2.3 Implicit midpoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
11.3 General Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
11.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
11.3.2 (Asymptotic) Convergence of Single-Step Methods . . . . . . . . . . . . . . . . . . 782
11.4 Explicit Runge-Kutta Single-Step Methods (RKSSMs) . . . . . . . . . . . . . . . . . . . . 791
11.5 Adaptive Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
11.5.1 The Need for Timestep Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 798
11.5.2 Local-in-Time Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
11.5.3 Embedded Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 807

12 Single-Step Methods for Stiff Initial-Value Problems 814


12.1 Model Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
12.2 Stiff Initial-Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
12.3 Implicit Runge-Kutta Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 835
12.3.1 The Implicit Euler Method for Stiff IVPs . . . . . . . . . . . . . . . . . . . . . . . . 835
12.3.2 Collocation Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
12.3.3 General Implicit Runge-Kutta Single-Step Methods (RK-SSMs) . . . . . . . . . . . 840
12.3.4 Model Problem Analysis for Implicit Runge-Kutta Single-Step Methods (IRK-SSMs) 842
12.4 Semi-Implicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
12.5 Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853

Index 860
Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884

CONTENTS, CONTENTS 7
Chapter 0

Introduction

0.1 Course Fundamentals


0.1.1 Focus of this Course
Emphasis is put
✄ on algorithms (principles, computational cost, scope, and limitations),

✄ on (efficient and stable) implementation in C++ based on the numerical linear algebra template library
E IGEN, a Domain Specific Language (DSL) embedded into C++.

✄ on numerical experiments (design and interpretation).

§0.1.1.1 (Aspects outside the scope of this course) No emphasis will be put on
• theory and proofs (unless essential for derivation and understanding of algorithms).
☞ 401-3651-00L Numerical Methods for Elliptic and Parabolic Partial Differential Equations
401-3652-00L Numerical Methods for Hyperbolic Partial Differential Equations
(both courses offered in BSc Mathematics)
• hardware aware implementation (cache hierarchies, CPU pipelining, vectorization, etc.)
☞ 263-0007-00L Advanced System Lab (How To Write Fast Numerical Code, Prof. M. Püschel,
D-INFK)
• issues of high-performance computing (HPC, shard and distributed memory parallelisation, vector-
ization)
☞ 151-0107-20L High Performance Computing for Science and Engineering (HPCSE,
Prof. P. Koumoutsakos, D-MAVT)
263-2800-00L Design of Parallel and High-Performance Computing (Prof. T. Höfler, D-INFK)
However, note that these other courses partly rely on knowledge of elementary numerical methods, which
is covered in this course. y

8
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Contents
§0.1.1.2 (Prequisites) This course will take for granted basic knowledge of linear algebra, calculus, and
programming, that you should have acquired during your first year at ETH.

Numerical Methods

Approximating integrals:
Least squares problems

Numerical quadrature

Numerically solving
ordinary differential
Linear systems of

Interpolation and

Filtering & FFT


approximation

equations
equations

Analysis Linear algebra Programming (in C++)


y

§0.1.1.3 (Numerical methods: A motley toolbox)

This course discusses elementary numerical methods and techniques

They are vastly different in terms of ideas, design, analysis, and scope of application. They are the
items in a toolbox, some only loosely related by the common purpose of being building blocks for
codes for numerical simulation.

Do not expect much coherence between the chapters of this course!

A purpose-oriented notion of “Numerical methods for


CSE”:

A: “Stop putting a hammer, a level, and duct tape


in one box! They have nothing to do with each
other!”
B: “I might need any of these tools when fixing some-
thing about the house”

Fig. 1
y

§0.1.1.4 (Dependencies of topics) Despite the diverse nature of the individual topics covered in this
course, some depend on others for providing essential building blocks. The following directed graph tries
to capture these relationships. The arrows have to be read as “uses results or algorithms of”.

0. Introduction, 0.1. Course Fundamentals 9


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Numerical integration
ẏ = f(t, y), Chapter 11

Quadrature
R Eigenvalues Krylov methods
f ( x ) dx,
Ax = λx, Chapter 9 Chapter 10
Chapter 7

Least squares,
Function approximation, Non-linear least squares,
kAx − bk → min,
Chapter 6 k F (x)k → min, Section 8.7
Chapter 3

Interpolation Linear systems Non-linear systems


!
∑i αi b( xi ) = f ( xi ), Chapter 5 Ax = b, Chapter 2 F (x) = 0, Chapter 8

Filtering, Chapter 4 Sparse matrices, Section 2.7

Computing with matrices and vectors, Ch. 1 !


Zero finding f ( x ) = 0
y

Any one-semester course “Numerical methods for CSE” will cover only selected chapters and sec-
tions of this document. Only topics addressed in class or in homework problems will be relevant
for exams!

§0.1.1.5 (Relevance of this course) I am a student of computer science. After the exam, may I safely
forget everything I have learned in this mandatory “numerical methods” course? No, because it is highly
likely that other courses or projects will rely on the contents of this course:


singular value decomposition
Computational statistics, machine learning
least squares

function approximation 
numerical quadrature machine learning, Numerical methods for PDEs

numerical integration

interpolation
Computer graphics
least squares

eigensolvers
Graph theoretic algorithms
sparse linear systems

numerical integration Computer animation, robotics

0. Introduction, 0.1. Course Fundamentals 10


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and many more applications of fundamental numerical methods . . ..

Hardly anyone will need everything covered in this course, but most of you will need something.

0.1.2 Goals
This course is meant to impart
✦ knowledge of some fundamental algorithms forming the basis of numerical simulations,
✦ familiarity with essential terms in numerical mathematics and the techniques used for the analysis
of numerical algorithms
✦ the skill to choose the appropriate numerical methods for concrete problems,
✦ the ability to interpret numerical results,
✦ proficiency in implementing numerical algorithms efficiently in C++, using numerical libraries.

Indispensable: Learning by doing (➔ exercises)

0.1.3 Literature
Parts of the following textbooks may be used as supplementary reading for this course. References to
relevant sections will be provided in the course material.

Studying extra literature is not important for following this course!

✦ [AG11] U. A SCHER AND C. G REIF, A First Course in Numerical Methods, SIAM, Philadelphia, 2011.

Comprehensive introduction to numerical methods with an algorithmic focus based on MATLAB.


(Target audience: students of engineering subjects)
✦ [DR08] W. DAHMEN AND A. R EUSKEN, Numerik für Ingenieure und Naturwissenschaftler, Springer,
Heidelberg, 2006.

Good reference for large parts of this course; provides a lot of simple examples and lucid explana-
tions, but also rigorous mathematical treatment.
(Target audience: undergraduate students in science and engineering)
Available for download as PDF
✦ [Han02] M. H ANKE -B OURGEOIS, Grundlagen der Numerischen Mathematik und des Wis-
senschaftlichen Rechnens, Mathematische Leitfäden, B.G. Teubner, Stuttgart, 2002.
Gives detailed description and mathematical analysis of algorithms and relies on MATLAB. Profound
treatment of theory way beyond the scope of this course. (Target audience: undergraduates in
mathematics)
✦ [QSS00] A. Q UARTERONI , R. S ACCO, F. S ALERI, Numerical mathematics, vol. 37 of Texts in
AND
Applied Mathematics, Springer, New York, 2000.

0. Introduction, 0.1. Course Fundamentals 11


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Classical introductory numerical analysis text with many examples and detailed discussion of algo-
rithms. (Target audience: undergraduates in mathematics and engineering)
Can be obtained from website.
✦ [DH03] P. D EUFLHARD A. H OHMANN, Numerische Mathematik. Eine algorithmisch orientierte
AND
Einführung, DeGruyter, Berlin, 1 ed., 1991.
Modern discussion of numerical methods with profound treatment of theoretical aspects (Target
audience: undergraduate students in mathematics).
✦ [GGK14]: W.. G ANDER , M.J. G ANDER , AND F. K WOK, Scientific Computing, Text in Computational
Science and Engineering, springer, 2014.
Comprehensive treatment of elementary numerical methods with an algorithmic focus.
D-INFK maintains a webpage with links to some of these books.

Essential prerequisite for this course is a solid knowledge in linear algebra and calculus. Familiarity with
the topics covered in the first semester courses is taken for granted, see
✦ [NS02] K. N IPP AND D. S TOFFER, Lineare Algebra, vdf Hochschulverlag, Zürich, 5 ed., 2002.
✦ [Gut09] M. G UTKNECHT, Lineare algebra, lecture notes, SAM, ETH Zürich, 2009, available online.
✦ [Str09] M. S TRUWE, Analysis für Informatiker. Lecture notes, ETH Zürich, 2009, available online.

0.2 Teaching Style and Model


0.2.1 Flipped Classroom
This course will depart from the usual academic teaching arrangement centering around classes taught
by a lecturer addressing an audience in a lecture hall.

A flipped-classroom course

This course will follow the flipped-classroom paradigm:

Learning by self-study guided by

instruction videos interactive homeworks


lecture notes
tablet notes Q&A sessions tutorial classes

All the course material will be published online through the course Moodle Page. All notes jotted down by
the lecturer during the creation of videos or during the Q&A sessions will be made available as PDF.

0.2.1.1 Course Videos

In the flipped-classroom teaching model regular lectures will be replaced with pre-recorded videos. These
videos are not commercial-grade clips, but resemble video recordings from a standard classroom setting;
they convey the development of the material on a tablet accompanied by the lecturer’s voice.

0. Introduction, 0.2. Teaching Style and Model 12


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The Videos will be published through


1. the course Moodle Page (see be-
side),
2. and as .mp4-files on PolyBox
(password required).

Every video comes with a PDF contain-


ing the tablet notes taken during the cre-
ation of the video. However, the PDF may
have been corrected, updated, or supple-
mented later.

Fig. 2

§0.2.1.2 (“Pause” and “fast forward”) Videos have two big advantages:

You can stop a video at any time, whenever


• you need more time to think,
• you want to look up related information,
• you want to work for yourself.

Make use of this possibility!


Fig. 3

The video portal also allows you to play the videos at 1.5× speed. This can be useful, if the current topic
is very clear to you. You can also skip entire parts using the scroll bar. The same functionality (fast playing
and skipping) is offered by most video players, for instance the VLC media player. y

§0.2.1.3 (Review questions) Most lecture units (corresponding to a video) are accompanied with a list of
review questions. You should try to answer them off the top of your head without consulting any written
material shortly after you have finished studying the unit .
In case you are utterly clueless about how to approach a review question, you probably need to refresh
some of the unit’s topics.
y

§0.2.1.4 (List of available tutorial videos) This is the list of available video tutorials as of February 5,
2025:

1. Video tutorial for Chapter 0 “Introduction”: (16 minutes) Download link, tablet notes

Video tutorial for Section 1.1.1 “Notations and Classes of Matrices”: (7 minutes)
2.
Download link, tablet notes

→ review questions 1.1.2.9

3. Video tutorial for Section 1.2.1 "E IGEN ": (11 minutes) Download link, tablet notes

0. Introduction, 0.2. Teaching Style and Model 13


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

→ review questions 1.2.1.14

Video tutorial for Section 1.2.3 "(Dense) Matrix Storage Formats": (10 minutes)
4.
Download link, tablet notes

→ review questions 1.2.3.11

Video tutorial for Section 1.4 "Computational Effort": (29 minutes) Download link,
5.
tablet notes

→ review questions 1.4.3.11

Video tutorial for Section 1.5 "Machine Arithmetic and Consequences": (16 minutes)
6.
Download link, tablet notes

→ review questions 1.5.3.18

7. Video tutorial for Section 1.5.4 "Cancellation": (22 minutes) Download link, tablet notes

→ review questions 1.5.4.33

Video tutorial for Section 1.5.5 "Numerical Stability": (17 minutes) Download link,
8.
tablet notes

→ review questions 1.5.5.23


Video tutorial for Section 2.1 & Section 2.2.1 "Introduction and Theory: Linear Systems
9.
of Equations (LSEs)": (6 minutes) Download link, tablet notes

→ review questions 2.2.1.7

Video tutorial for Ex. 2.1.0.3 "Nodal Analysis of Linear Electric Circuits": (8 minutes)
10.
Download link, tablet notes

→ review questions 2.1.0.8

Video tutorial for Section 2.2.2 "Sensitivity of Linear Systems": (15 minutes)
11.
Download link, tablet notes

→ review questions 2.2.2.12

Video tutorial for Section 2.3 & Section 2.5 "Gaussian Elimination": (17 minutes)
12.
Download link, tablet notes

→ review questions 2.3.2.21


Video tutorial for Section 2.6 "Exploiting Structure when Solving Linear Systems": (17
13.
minutes) Download link, tablet notes

→ review questions 2.6.0.25

Video tutorial for Section 2.7.1 "Sparse Matrix Storage Formats": (10 minutes)
14.
Download link, tablet notes

→ review questions 2.7.1.5

0. Introduction, 0.2. Teaching Style and Model 14


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Video tutorial for Section 2.7.2 "Sparse Matrices in E IGEN ": (6 minutes) Download link,
15.
tablet notes

→ review questions 2.7.2.17


Video tutorial for Section 2.7.3 "Direct Solution of Sparse Linear Systems of Equations":
16.
(10 minutes) Download link, tablet notes

→ review questions>2.7.3.7
Video tutorial for Section 3.0.1 "Overdetermined Linear Systems of Equations: Exam-
17.
ples": (12 minutes) Download link, tablet notes

→ review questions 3.0.1.11

Video tutorial for Section 3.1.1 "Least Squares Solutions": (9 minutes) Download link,
18.
tablet notes

→ review questions 3.1.1.14

Video tutorial for Section 3.1.2 "Normal Equations": (16 minutes) Download link,
19.
tablet notes

→ review questions 3.1.2.23

Video tutorial for Section 3.1.3 "Moore-Penrose Pseudoinverse": (8 minutes)


20.
Download link, tablet notes

→ review questions 3.1.3.8

Video tutorial for Section 3.2 "Normal Equation Methods": (12 minutes) Download link,
21.
tablet notes

→ review questions 3.2.0.11

Video tutorial for Section 3.3 "Orthogonal Transformation Methods": (10 minutes)
22.
Download link, tablet notes

→ review questions 3.3.2.3

Video tutorial for Section 3.3.3.1 "QR-Decomposition: Theory": (11 minutes)


23.
Download link, tablet notes

→ review questions 3.3.3.8


Video tutorial for Section 3.3.3.2 & Section 3.3.3.4 "Computation of QR-Decomposition,
24.
QR-Decomposition in E IGEN ": (32 minutes) Download link, tablet notes

→ review questions 3.3.3.29


Video tutorial for Section 3.3.4 "QR-Based Solver for Linear Least Squares Problems":
25.
(9 minutes) Download link, tablet notes

→ review questions 3.3.4.8


Video tutorial for Section 3.3.5 "Modification Techniques for QR-Decomposition": (25
26.
minutes) Download link, tablet notes

0. Introduction, 0.2. Teaching Style and Model 15


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

→ review questions 3.3.5.7


Video tutorial for Section 3.4.1 "Singular Value Decomposition: Definition and Theory":
27.
(13 minutes) Download link, tablet notes

→ review questions 3.4.1.15

Video tutorial for Section 3.4.2 "SVD in E IGEN ": (9 minutes) Download link,
28.
tablet notes

→ review questions 3.4.2.10


Video tutorial for Section 3.4.3 "Solving General Least-Squares Problems by SVD": (14
29.
minutes) Download link, tablet notes

→ review questions 3.4.3.17


Video tutorial for Section 3.4.4.1 "Norm-Constrained Extrema of Quadratic Forms": (11
30.
minutes) Download link, tablet notes

→ review questions 3.4.4.13

Video tutorial for Section 3.4.4.2 "Best Low-Rank Approximation": (13 minutes)
31.
Download link, tablet notes

→ review questions 3.4.4.25


Video tutorial for Section 3.4.4.3 "Principal Component Data Analysis (PCA)": (28
32.
minutes) Download link, tablet notes

→ review questions 3.4.4.51

Video tutorial for Section 3.6 "Constrained Least Squares": (23 minutes)
33.
Download link, tablet notes

→ review questions 3.6.2.1


Video tutorial for Section 4.1.1 "Discrete Finite Linear Time-Invariant Causal Channel-
34.
s/Filters": (11 minutes) Download link, tablet notes

→ review questions 4.1.1.13

Video tutorial for Section 4.1.2 "LT-FIR Linear Mappings": (12 minutes) Download link,
35.
tablet notes

→ review questions 4.1.2.10

Video tutorial for Section 4.1.3 "Discrete Convolutions": (9 minutes) Download link,
36.
tablet notes

→ review questions 4.1.3.11

Video tutorial for Section 4.1.4 "Periodic Convolutions": (12 minutes) Download link,
37.
tablet notes

→ review questions 4.1.4.19

0. Introduction, 0.2. Teaching Style and Model 16


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Video tutorial for Section 4.2.1 "Diagonalizing Circulant Matrices": (17 minutes)
38.
Download link, tablet notes

→ review questions 4.2.1.23

Video tutorial for Section 4.2.2 "Discrete Convolution via DFT": (7 minutes)
39.
Download link, tablet notes

→ review questions 4.2.2.6

Video tutorial for Section 4.2.3 "Frequency filtering via DFT": (20 minutes)
40.
Download link, tablet notes

→ review questions 4.2.3.11

Video tutorial for Section 4.2.5 "Two-Dimensional DFT": (20 minutes) Download link,
41.
tablet notes

→ review questions 4.2.5.23

Video tutorial for Section 4.3 "Fast Fourier Transform (FFT)": (16 minutes)
42.
Download link, tablet notes

→ review questions 4.3.0.13

Video tutorial for Section 4.5 "Toeplitz Matrix Techniques": (20 minutes) Download link,
43.
tablet notes

→ review questions 4.5.3.8

Video tutorial for Section 5.1 "Abstract Interpolation": (16 minutes) Download link,
44.
tablet notes

→ review questions 5.1.0.27

Video tutorial for Section 5.2.1 "Uni-Variate Polynomials": (7 minutes) Download link,
45.
tablet notes

→ review questions 5.2.1.8

Video tutorial for Section 5.2.2 "Polynomial Interpolation: Theory": (6 minutes)


46.
Download link, tablet notes

→ review questions 5.2.2.19

Video tutorial for Section 5.2.3 "Polynomial Interpolation: Algorithms": (18 minutes)
47.
Download link, tablet notes

→ review questions 5.2.3.14

Video tutorial for Section 5.2.3.3 "Extrapolation to Zero": (12 minutes) Download link,
48.
tablet notes

→ review questions 5.2.3.20

Video tutorial for Section 5.2.3.4 "Newton Basis and Divided Differences": (17 minutes)
49.
Download link, tablet notes

0. Introduction, 0.2. Teaching Style and Model 17


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

→ review questions 5.2.3.39

Video tutorial for Section 5.2.4 "Polynomial Interpolation: Sensitivity": (13 minutes)
50.
Download link, tablet notes

→ review questions 5.2.4.16

Video tutorial for Section 5.3 "Shape-Preserving Interpolation": (23 minutes)


51.
Download link, tablet notes

→ review questions 5.3.3.19

Video tutorial for Section 5.4.1 "Spline Function Spaces": (9 minutes) Download link,
52.
tablet notes

→ review questions 5.4.1.5

Video tutorial for Section 5.4.2 "Cubic Spline Interpolation": (14 minutes)
53.
Download link, tablet notes

→ review questions 5.4.2.16


Video tutorial for Section 5.4.3 "Structural Properties of Cubic Spline Interpolants": (12
54.
minutes) Download link, tablet notes

→ review questions 5.4.3.11

Video tutorial for Section 5.6 "Trigonometric Interpolation": (14 minutes) Download link,
55.
tablet notes

→ review questions 5.6.3.10

Video tutorial for Section 5.7 "Least Squares Data Fitting": (13 minutes) Download link,
56.
tablet notes

→ review questions 5.7.0.31


Video tutorial for Section 6.1 "Approximation of Functions in 1D: Introduction": (7 min-
57.
utes) Download link, tablet notes

→ review questions 6.1.0.9

Video tutorial for Section 6.2 "Polynomial Approximation: Theory": (13 minutes)
58.
Download link, tablet notes

→ review questions 6.2.1.28


Video tutorial for Section 6.2.2 "Error Estimates for Polynomial Interpolation": (12 min-
59.
utes) Download link, tablet notes

→ review questions 6.2.2.14


Video tutorial for Section 6.2.2.2 "Error Estimates for Polynomial Interpolation: Inter-
60.
polands of Finite Smoothness": (17 minutes) Download link, tablet notes

→ review questions 6.2.2.34

0. Introduction, 0.2. Teaching Style and Model 18


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Video tutorial for Section 6.2.2.3 "Error Estimates for Polynomial Interpolation: Analytic
61.
Interpolands": (27 minutes) Download link, tablet notes

→ review questions 6.2.2.69


Video tutorial for Section 6.2.3.1 "Chebychev Interpolation: Motivation and Definition":
62.
(21 minutes) Download link, tablet notes

→ review questions 6.2.3.13


Video tutorial for Section 6.2.3.2 "Chebychev Interpolation Error Estimates": (14 min-
63.
utes) Download link, tablet notes

→ review questions 6.2.3.30


Video tutorial for Section 6.2.3.3 "Chebychev Interpolation: Computational Aspects":
64.
(11 minutes) Download link, tablet notes

→ review questions 6.2.3.44


Video tutorial for Section 6.5.1 "Approximation by Trigonometric Interpolation": (5 min-
65.
utes) Download link, tablet notes

→ review questions 6.5.1.6


Video tutorial for Section 6.5.2 "Trigonometric Interpolation Error Estimates": (14 min-
66.
utes) Download link, tablet notes

→ review questions 6.5.2.26


Video tutorial for Section 6.5.3 "Trigonometric Interpolation of Analytic Periodic Func-
67.
tions": (16 minutes) Download link, tablet notes

→ review questions 6.5.3.18


Video tutorial for Section 6.6.1 "Piecewise Polynomial Lagrange Interpolation": (17
68.
minutes) Download link, tablet notes

Video tutorial for Section 6.6.2 "Cubic Hermite and Spline Interpolation: Error Esti-
69.
mates": (10 minutes) Download link, tablet notes

Video tutorial for Section 7.1 "Numerical Quadrature: Introduction": (4 minutes)


70.
Download link, tablet notes

→ review questions 7.1.0.5

Video tutorial for Section 7.2 "Quadrature Formulas/Rules": (13 minutes)


71.
Download link, tablet notes

→ review questions 7.2.0.15

Video tutorial for Section 7.3 "Polynomial Quadrature Formulas": (9 minutes)


72.
Download link, tablet notes

→ review questions 7.3.0.12

0. Introduction, 0.2. Teaching Style and Model 19


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Video tutorial for Section 7.4.1 "Order of a Quadrature Rule": (9 minutes)


73.
Download link, tablet notes

→ review questions 7.4.1.12

Video tutorial for Section 7.4.2 "Maximal-Order Quadrature Rules": (16 minutes)
74.
Download link, tablet notes

→ review questions 7.4.2.27


Video tutorial for Section 7.4.3 "(Gauss-Legendre) Quadrature Error Estimates": (18
75.
minutes) Download link, tablet notes

→ review questions 7.4.3.16

Video tutorial for Section 7.5 "Composite Quadrature": (18 minutes) Download link,
76.
tablet notes

→ review questions 7.5.0.26

Video tutorial for Section 7.6 "Adaptive Quadrature": (13 minutes) Download link,
77.
tablet notes

→ review questions 7.6.0.20


Video tutorial for Section 8.1 "Iterative Methods for Non-Linear Systems of Equations:
78.
Introduction": (6 minutes) Download link, tablet notes

→ review questions 8.1.0.6

Video tutorial for Section 8.2.1 "Iterative Methods: Fundamental Concepts": (6 minutes)
79.
Download link, tablet notes

→ review questions 8.2.1.11


Video tutorial for Section 8.2.2 "Iterative Methods: Speed of Convergence": (15 min-
80.
utes) Download link, tablet notes

→ review questions 8.2.2.16


Video tutorial for Section 8.2.3 "Iterative Methods: Termination Criteria/Stopping Rules":
81.
(14 minutes) Download link, tablet notes

→ review questions 8.2.3.10

Video tutorial for Section 8.3 "Fixed-Point Iterations": (12 minutes) Download link,
82.
tablet notes

→ review questions 8.3.2.21


Video tutorial for Section 8.4.1 "Finding Zeros of Scalar Functions: Bisection": (7 min-
83.
utes) Download link, tablet notes

→ review questions 8.4.1.4

Video tutorial for Section 8.4.2.1 "Newton Method in the Scalar Case": (20 minutes)
84.
Download link, tablet notes

0. Introduction, 0.2. Teaching Style and Model 20


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

→ review questions 8.4.2.16

Video tutorial for Section 8.4.2.3 "Multi-Point Methods": (12 minutes) Download link,
85.
tablet notes

→ review questions 8.4.2.41


Video tutorial for Section 8.4.3 "Asymptotic Efficiency of Iterative Methods for Zero Find-
86.
ing": (10 minutes) Download link, tablet notes

→ review questions 8.4.3.15

Video tutorial for Section 8.5.1 "The Newton Iteration in R n (I)": (10 minutes)
87.
Download link, tablet notes

→ review questions 8.5.1.46

Video tutorial for § 8.5.1.15 "Multi-dimensional Differentiation": (20 minutes)


88.
Download link, tablet notes

Video tutorial for Section 8.5.1 "The Newton Iteration in R n (II)": (15 minutes)
89.
Download link, tablet notes

Video tutorial for Section 8.5.2 "Convergence of Newton’s Method": (9 minutes)


90.
Download link, tablet notes

→ review questions 8.5.2.8

Video tutorial for Section 8.5.3 "Termination of Newton Iteration": (7 minutes)


91.
Download link, tablet notes

→ review questions 8.5.3.9

Video tutorial for Section 8.5.4 "Damped Newton Method": (11 minutes) Download link,
92.
tablet notes

→ review questions 8.5.4.8

Video tutorial for Section 8.6 "Quasi-Newton Method": (15 minutes) Download link,
93.
tablet notes

→ review questions 8.6.0.22

Video tutorial for Section 8.7 "Non-linear Least Squares": (7 minutes) Download link,
94.
tablet notes

→ review questions 8.7.0.10


Video tutorial for Section 8.7.1 "Non-linear Least Squares: (Damped) Newton Method":
95.
(13 minutes) Download link, tablet notes

→ review questions 8.7.1.9

Video tutorial for Section 8.7.2 "(Trust-region) Gauss-Newton Method": (13 minutes)
96.
Download link, tablet notes

→ review questions 8.7.3.3

0. Introduction, 0.2. Teaching Style and Model 21


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Video tutorial for Section 11.1: Initial-Value Problems (IVPs) for Ordinary Differential
97.
Equations (ODEs): (35 minutes) Download link, tablet notes

→ review questions 11.1.4.8


Video tutorial for Section 11.2: Introduction: Polygonal Approximation Methods: (17
98.
minutes) Download link, tablet notes

→ review questions 11.2.3.4

Video tutorial for Section 11.3: General Single-Step Methods: (14 minutes)
99.
Download link, tablet notes

→ review questions 11.3.1.17


Video tutorial for Section 11.3.2:(Asymptotic) Convergence of Single-Step Methods:
100.
(20 minutes) Download link, tablet notes

→ review questions 11.3.2.34


Video tutorial for Section 11.4: Explicit Runge-Kutta Single-Step Methods (RKSSMs):
101.
(27 minutes) Download link, tablet notes

→ review questions 11.4.0.20

Video tutorial for Section 11.5: Adaptive Stepsize Control: (32 minutes) Download link,
102.
tablet notes

→ review questions 11.5.3.10

Video tutorial for Section 12.1:Model Problem Analysis: (40 minutes) Download link,
103.
tablet notes

→ review questions 12.1.0.54

Video tutorial for Section 12.2: Stiff Initial-Value Problems: (24 minutes) Download link,
104.
tablet notes

→ review questions 12.2.0.17

Video tutorial for Section 12.3: Implicit Runge-Kutta Single-Step Methods: (50 minutes)
105.
Download link, tablet notes

→ review questions 12.3.4.23

Video tutorial for Section 12.4: Semi-Implicit Runge-Kutta Methods: (13 minutes)
106.
Download link, tablet notes

→ review questions 12.4.0.10

Video tutorial for Section 12.5: Splitting Methods: (21 minutes) Download link,
107.
tablet notes

→ review questions 12.5.0.14


y

0. Introduction, 0.2. Teaching Style and Model 22


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Necessary corrections and updates of the lecture document will sometimes lead to changes
in the numbering of paragraphs and formulas, which, of course, cannot be applied to the
! recorded videos.
However, these changes will be taken into account into the tablet notes supplied for every
video.

0.2.1.2 Following the Course

Weekly study assignments

• For every week there is a list of course units and associated videos published on the course
Moodle Page.
• The corresponding contents must be studied in that same week.

§0.2.1.6 (How to organize your learning)


☛ Develop a routine: Plan fixed slots, with a total duration of four hours, for studying for the course
material in your weekly calendar. This does not include homework.
☛ Choose a stable setting, in which you can really concentrate (quiet area, headphones, coffee, etc.)
☛ Take breaks, when concentration is declining, usually after 20 to 45 minutes, but avoid online dis-
tractions during breaks.
y
You must not procrastinate!

! Do not put off studying for this course. Dependencies between the topics will make it very
hard to catch up.

§0.2.1.7 (“Personalized learning”) The flipped classroom model allows students to pursue their preferred
ways of studying. The following approaches can be tried.
• Traditional: You watch the assigned videos similar to attending a conventional classroom lecture.
Afterwards digest the material based on the tablet notes and/or the lecture document. Finally, answer
the review questions and look up more information in the lecture document.
• Reading-centered: You work through the unit reading the tablet notes, and, sometimes, related
sections of the lecture document. You occasionally watch parts of the videos, in case some consid-
erations and arguments have not become clear to you already.

Collaborative studying is encouraged:


• You may watch course videos together with
classmates.
• You may meet to discuss course units.
• You may solve homework problems in a group
assigning different parts to different members.

☛ Explaining to others is a great way to deepen


understanding.
☛ It is easy to sustain motivation and avoid dis-
Fig. 4
traction in a peer study group.
y

0. Introduction, 0.2. Teaching Style and Model 23


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§0.2.1.8 (Question and Answer (Q&A) sessions) The lecturer will offer a two-hour so-called Q&A ses-
sion almost every week during the teaching period, but not in the weeks in which term exams will be held.
These Q&A sessions will be devoted to
• discussing and answering questions asked by the participants of the course,
• presenting solutions of review questions, and
• offering additional explanations for some parts of the course.
Questions can be asked right during the Q&A session, but participants of the course are encouraged to
submit general or specific questions of comments beforehand.
Questions/comments can be posted in dedicated D ISCUNA chat channels
(folder “Q&A Channels”, community “NumCSE Autumn <YEAR>”), which
will be set up for each week in which a regular Q&A session will take
place.

It is highly desirable that questions are submitted at least a few hours before the start of the Q&A session
so that the lecturer has the opportunity to structure his or her answer.
Tablet notes of the Q&A sessions will be made available for download. y

0.2.2 Clarifications and Frank Words


§0.2.2.1 (“Lecture notes”)
The PDF you are reading is referred to as lecture document and is an important source of information, but

this course document is neither a textbook nor comprehensive lecture notes.


They are meant to supplement and be supplemented by explanations given in the videos.

Some pieces of advice:


✦ The lecture document is only partly designed to be self-contained and can/should be studied in parts
in addition to attending to watching the course videos and/or reading the tablet notes.
✦ This text is not meant for mere reading, but for working with,
✦ Turn pages all the time and follow the numerous cross-references,
✦ study the relevant section of the course material when doing homework problems,
✦ You may study referenced literature to refresh prerequisite knowledge and for alternative presen-
tation of the material (from a different angle, maybe), but be careful about not getting confused or
distracted by information overload.
y

§0.2.2.2 (Comprehension is a process . . .)


✦ This course will require
hard work – perseverance – patience
✦ Do not expect to understand everything at once. Most students will
• understand about one third of the material when watching videos and studying the course
material
• understand another third when making a serious effort to solve the homework problems,

0. Introduction, 0.2. Teaching Style and Model 24


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• hopefully understand the remaining third when studying for the main examination after the end
of the course.
Perseverance will be rewarded!

§0.2.2.3 (Expected workload)


(I) You are a student in the BSc/MSc programme of Computational Science and Engineering (CSE) or
others. Then you are taking the full version of the course (401-2673-*), which is endowed with 9
ECTS credits, which roughly corresponds to a total workload of 270 hours:

270 hours = 180


| {zhours} + | hours
90 {z } ,
during term exam preparation

which amounts to a massive


average workload ≈ 10 − 14 hours per week.
(II) If you are a student in the BSc Computer Science you are offered a trimmed version of the course,
which is worth 7 ECTS credits. Though a very loose relationship, this roughly indicates a total
workload of 180 hours:

180 hours = 110


| {zhours} + | hours
70 {z } .
during term exam preparation

This indicates that you should brace for an


average workload ≈ 7 − 9 hours per week.
For both versions of the course your efforts have to be split between
• watching videos and/or studying the course material,
• and solving homework problems,
• attending Q&A sessions and tutorials.
where homework may keep you busy for 5 − 7 hours every week for the full version.
Of course, all these are averages and the workload may vary between different weeks. y

0.2.3 Requests
The lecturers very much welcome and, putting it even more strongly, rather depend on feedback and
suggestions of the students taking the course for continuous improvement of the course contents and
presentation. Therefore all participants are strongly encouraged to get involved actively and contribute in
the following ways:
§0.2.3.1 (Reporting errors) As the documents for this course will always be in a state of flux, they will
inevitably and invariably teem with small errors, mainly typos and omissions.

For error reporting we use the D ISCUNA online collaboration platform that
runs in the browser.

D ISCUNA allows to attach various types of annotations to shared PDF documents, see instruction video.

0. Introduction, 0.2. Teaching Style and Model 25


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Please report errors in the lecture material through the D ISCUNA NumCSE Community to which
various course-related documents have already been uploaded.

In the beginning of theperiod you receive a join link of the form


teaching
https://app.discuna.com/<JOIN CODE>. Open the link in a web browser and it will take
you to the D ISCUNA community page.
To report an error,
1. select the corresponding PDF document (chapter of the lecture document of homework problem) in
the left sidebar,
2. press the prominent white-on-blue +-button in the right sidebar,
3. click on the displayed PDF where the error is located,
4. then in the pop-up window choose the “Error” category,
5. and add a title and,
6. if the title does not tell everything, a short description.
In case you cannot or do not want to link an error to a particular point in the PDF, you may just click on the
title page of the respective chapter. Then, please precisely specify the concerned section and the number
of the paragraph, remark, equation etc. Do not give page numbers as they may change with updates to
the documents.
Note that chapter PDFs and homework problem files will gradually be added to the D ISCUNA NumCSE
community. Hence, the final chapters will not be accessible in the beginning of the course. y

§0.2.3.2 (Pointing out technical problems) The D ISCUNA NumCSE Community is equipped with a chat
channel “Technical Problems”. In case you encounter a problem affecting the videos, the course web
pages, or the PDF documents supplied online, say, severely distorted or missing audio tracks or a faulty
link, instantly post a comment to this channel with a short description of the problem. You can do this after
clicking on the channel name in the left sidebar in the community y

§0.2.3.3 (Providing comments and suggestions) The chat channel “General Comments” of the
D ISCUNA NumCSE Community is meant for letting the lecturer know about weaknesses of the contents,
structure, and presentation of the course and how they can be remedied. Your statements should be
constructive and address specific parts or aspects of the course.
Regularly, students attending the course remark that they have found online resources like instruction
videos that they think present some of the course material in a much clearer and better structured way. It
is important that you tell the lecturer about those online resources so that he can include pointers to them
and get inspiration. Use the “General Comments” channel also for this purpose. State clearly, which part
of the course you are referring to, and briefly explain why the online resource is superior or a valuable
supplement. y

§0.2.3.4 (Asking/posting questions) Whenever a question comes up while you are studying for the
course or trying to solve homework problems and that question lingers, it is probably connected to an
issue that also bothers other students. What to do, in case you are not able to attend the Q&A session?

Please post arising question to the D ISCUNA Q&A channels even if you do not attend the Q&A
session! See also § 0.2.1.8

This has the benefit of

0. Introduction, 0.2. Teaching Style and Model 26


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• initiating a discussion of the question that may also be relevant for other students, and
• will make it possible for you to find an answer in the Q&A tablet notes.
Tongue in cheek:

There is no question too stupid to be worth asking!

The most stupid practice is to hesitate to ask questions!


y

0.2.4 Assignments

A steady and persistent effort spent on homework problems is essential for success in this course.

You should expect to spend 3-5 hours per week on trying to solve the homework problems. Since many
involve small coding projects, the time it will take an individual student to arrive at a solution is hard to
predict.

For the sake of efficiency:


Avoid coding errors (bugs) in your homework coding projects!

The problems are published online together with plenty of hints. A master solution will also be made
available, but it is foolish to read the master solution parallel to working on a problem sheet, because
trying to find the solution on one’s own is essential for developing problem solving skills, though it may
occasionally be frustrating.

§0.2.4.1 (Homeworks and tutors’ corrections)


✦ The weekly assignments will be a few problems from the NCSE Problem Collection available on-
line as PDF, see course Moodle page for the link. The particular problems to be solved will be
communicated through that Moodle page every week.

Please note that this problem collection is being extended throughout the semester. Thus, make
sure that you obtain the most current version every week. A polybox link will also be distributed;
if you install the Polybox Client the most current version of all course documents will always be
uploaded to your machine.
✦ Some or all of the problems of an assignment sheet will be discussed in the tutorial classes at least
one week after the problems have been assigned.

✦ Your tutors are happy to examine your solutions and give you feedback : You may either hand
them your solution papers during the tutorial session (put your name on every sheet and clearly
mark the problems you want to be inspected) or upload a scan/photo through the C ODE E XPERT up-
load interface, see § 0.2.4.2 below. You are encouraged to hand in incomplete and wrong solutions,
so that you can receive valuable feedback even on incomplete or failed attempts.
✦ Your tutors will automatically have access to all your homework codes, see § 0.2.4.2 below.
y

§0.2.4.2 (C ODE E XPERT C++ online IDE and testing evironment)

0. Introduction, 0.2. Teaching Style and Model 27


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C ODE E XPERT has been developed at ETH as on on-


line IDE for small programming assignment and cod-
[code]expert ing homeworks. It will be used in this course for all
C++ homework problems.

Please study the documentation!


C ODE E XPERT also offers the possibility of uploading any files to a private area (connected with a home-
work problem) that, beside you, only the tutor in charge of your exercise group can access. This is the
preferred option for sharing (scans/photos of) your solutions of homework problem with your tutor.

Note that C ODE E XPERT will also be using for the coding problems of the main examination.

y
§0.2.4.4 (C ODE E XPERT synchronization with local folder) If you prefer to use your own editor locally
on your computer, synchronization between the online C ODE E XPERT repository and your local folder is
available via Code Expert Sync tool. Follow the instruction here.
The working pipeline is:

Sync from C ODE E XPERT platform −→ Edit locally −→ Sync with C ODE E XPERT and run/test −→
continue editting locally . . .

0.2.5 Information on Examinations


0.2.5.1 For the Course 401-2673-00L Numerical Methods for CSE (BSc CSE)

§0.2.5.1 (Examinations during the teaching period) From the ETH course directory:
An optional 30-minutes mid-term exam and an optional 30-minutes end-term exam will be
held during the teaching period. The grades of these interim examinations will be taken into
account through a BONUS of up to 30% for the final grade.
The term exams will be conducted as closed book examinations on paper . The dates of the exams will be
communicated in the beginning of the term and published on the course webpage. The term exams can
neither be repeated nor be taken remotely.
The final grade is computed according to the formula

G := 0.25 · ⌈4 · max{ Gs , 0.85Gs + 0.15gm , 0.85Gs + 0.15ge , 0.7Gs + 0.15gm + 0.15ge }⌉ , (0.2.5.2)
ˆ grade in main exam, gm =
Gs = ˆ mid-term grade, ge =
ˆ end-term grade,

where ⌈ x ⌉ designates the smallest integer ≥ x. y

§0.2.5.3 (Main examination during exam session)


✦ Three-hour written examination involving coding problems to be done at the computer. The date of
the exam will be set and communicated by the ETH exam office, and will also be published on the
course webpage.
✦ The coding part of the exam has to be done using C ODE E XPERT.
✦ Subjects of examination:

0. Introduction, 0.2. Teaching Style and Model 28


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• All topics that have been addressed in a video listed on the course Moodle page or in any
assigned homework problem
The lecture document contains much more material than covered in class. All these extra topics are
not relevant for the exam.
✦ Lecture document (as PDF), the E IGEN documentation, and the online C++ REFERENCE PAGES will
be available PDF during the examination. The corresponding final version of the lecture document
will be made available at least two weeks before the exam.
✦ No other materials may be used during the exam.
✦ The homework problem collection cannot be accessed during the exam.
✦ The exam questions will be asked in English.
✦ In case you come to the conclusion that you have too little time to prepare for the main exam a few
weeks before the exam, contemplate withdrawing in order not to squander an attempt.
y

§0.2.5.4 (Repeating the main exam)


• Bonus points earned in term exams in last year’s course can be taken into account for this course’s
main exam.
• If you want to take this option, please declare this intention by email to the course organizers before
the mid-term exam. Otherwise, your bonus will be based on the results of this year’s term exams.
y

0.2.5.2 For the Course 401-0663-00L Numerical Methods for CS (BSc Informatik)

§0.2.5.5 (Homework bonus) During the teaching period every week quizzes and exercices similar to
those that will appear in the final exam are published on the moodle page of the lecture. These are open
for answers for about one week and students are expected to answer them within this time. Answering
them later is not possible. Correct answers are awarded “semester points” that are defined for each
questions. Hence, each student has the possibility to accumulate such points during the semester.

Grade bonus
The grade achieved in the final exam will be raised by 0.25 for all students who have earned at least
75% of the “semester points”.

§0.2.5.7 (Main (session) examination)


• Most of the exam questions are quizzes and exercices simmilar to those assigned as homework.
• The exam will mainly comprise either multiple choice tasks or tasks where you have to type the
answer in an answer box.
• The exam may contain questions addressing coding. You may be asked to find and correct errors in
code snippets or supplement missing parts of a code.
• Some tasks may require programming in order to be answered, though these codes will NOT be
checked, but only the correctness of the final result will matter.

0. Introduction, 0.2. Teaching Style and Model 29


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Visual code studio can be used as an editor during the exam, but only the codes submitted through
C ODE E XPERT will be saved during the exam and taken into account for grading.
y

0.3 Programming in C++


C++20 is the current ANSI/ISO standard for the programming language C++. On the one hand, it offers
a wealth of features and possibilities. On the other hand, this can be confusing and even be prone to
inconsistencies. A major cause of inconsistent design is the requirement with backward compatibility with
the C programming language and the earlier standard C++ 98.

However, C++ has become the main language in computational science and engineering and high per-
formance computing. Therefore this course relies on C++ to discuss the implementation of numerical
methods.

In fact C++ is a blend of different programming paradigms:


• an object oriented core providing classes, inheritance, and runtime polymorphism,
• a powerful template mechanism for parametric types and partial specialization, enabling template
meta-programming and compile-time polymorphism,
• a collection of abstract data containers and basic algorithms provided by the Standard Template
Libary (STL).

Supplementary literature. A popular book for learning C++ that has been upgraded to include

the C++11 standard is [LLM12].


The book [Jos12] gives a comprehensive presentation of the new features of C++11 compared to
earlier versions of C++.
There are plenty of online reference pages for C++, for instance
http://en.cppreference.com and http://www.cplusplus.com/.

The following sections highlight a few particular aspects of C++ that may be important for code develop-
ment in this course.

The version of the course for BSc students of Computer Science includes a two-week introduction
to C++ in the beginning of the course.

0.3.1 Function Arguments and Overloading


§0.3.1.1 (Function overloading, [LLM12, Sect. 6.4]) Argument types are an integral part of a function
declaration in C++. Hence the following functions are different
i n t * f( i n t ); // use this in the case of a single numeric argument
double f( i n t *); // use only, if pointer to a integer is given
v o i d f( const MyClass &); // use when called for a MyClass object

0. Introduction, 0.3. Programming in C++ 30


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and the compiler selects the function to be used depending on the type of the arguments following rather
sophisticated rules, refer to overload resolution rules. Complications arise, because implicit type conver-
sions have to be taken into account. In case of ambiguity a compile-time error will be triggered. Functions
cannot be distinguished by return type!

For member functions (methods) of classes an additional distinction can be introduced by the const spec-
ifier:
s t r u c t MyClass {
double f( double ); // use for a mutable object of type MyClass
double f( double ) const ; // use this version for a constant object
...
};

The second version of the method f is invoked for constant objects of type MyClass. y

§0.3.1.2 (Operator overloading [LLM12, Chapter 14]) In C++ unary and binary operators like =, ==, +,
-, *, /, +=, -=, *=, /=, %, &&, ||, «, », etc. are regarded as functions with a fixed number of arguments
(one or two). For built-in numeric and logic types they are defined already. They can be extended to any
other type, for instance
MyClass o p e r a t o r +( const MyClass &, const MyClass &);
MyClass o p e r a t o r +( const MyClass &, double );
MyClass o p e r a t o r +( const MyClass &); // unary + !

The same selection rules as for function overloading apply. Of course, operators can also be introduced
as class member functions.

C++ gives complete freedom to overload operators. However, the semantics of the new operators should
be close to the customary use of the operator. y

§0.3.1.3 (Passing arguments by value and by reference [LLM12, Sect. 6.2]) Consider a generic func-
tion declared as follows:
v o i d f(MyClass x); // Argument x passed by value.

When f is invoked, a temporary copy of the argument is created through the copy constructor or the move
constructor of MyClass. The new temporary object is a local variable inside the function body.

When a function is declared as follows


v o i d f(MyClass &x); // Argument x passed by reference.

then the argument is passed to the scope of the function and can be changed inside the function. No
copies are created. If one wants to avoid the creation of temporary objects, which may be costly, but also
wants to indicate that the argument will not be modified inside f, then the declaration should read
v o i d f(const MyClass &x); // Argument x passed by constant referene.

New in C++11 is move semantics, enabled in the following definition


v o i d f( const MyClass &&x); // Optional shallow copy

In this case, if the scope of the object passed as the argument is merely the function or std::move()
tags it as disposable, the move constructor of MyClass is invoked, which will usually do a shallow copy
only. Refer to Code 0.3.5.10 for an example. y

0. Introduction, 0.3. Programming in C++ 31


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0.3.2 Templates
§0.3.2.1 (Function templates) The template mechanism supports parameterization of definitions of
classes and functions by type. An example of a function templates is
t e m p l a t e < typename ScalarType, typename VectorType>
VectorType saxpy(ScalarType alpha, const VectorType &x, const
VectorType &y)
{ r e t u r n (alpha*x+y); }

Depending on the concrete type of the arguments the compiler will instantiate particular versions of this
function, for instance saxpy<float,double>, when alpha is of type float and both x and y are of
type double. In this case the return type will be double.

For the above example the compiler will be able to deduce the types ScalarType and VectorType
from the arguments. The programmer can also specify the types directly through the < >-syntax as in
saxpy< double , double >(a,x,y);

If an instantiation for all arguments of type double is desired. In case, the arguments do not supply
enough information about the type parameters, specifying (some of) them through < > is mandatory. y

§0.3.2.2 (Class templates) A class template defines a class depending on one or more type parameters,
for instance
t e m p l a t e < typename T>
c l a s s MyClsTempl {
public:
using parm_t = T; // T-dependent type
MyClsTempl( v o i d ); // Default constructor
MyClsTempl( const T&); // Constructor with an argument
t e m p l a t e < typename U>
T memfn( const T&, const U&) const ; // Templated member function
private:
T *ptr; // Data member, T-pointer
};

Types MyClsTempl<T> for a concrete choice of T are instantiated when a corresponding object is de-
clared, for instance via
double x = 3.14;
MyClass myobj; // Default construction of an object
MyClsTempl< double > tinstd; // Instantiation for T = double
MyClsTempl<MyClass> mytinst(myobj); // Instantiation for T = MyClass
MyClass ret = mytinst.memfn(myobj,x); // Instantiation of member
function for U = double, automatic type deduction

The types spawned by a template for different parameter types have nothing to do with each other. y

Requirements on parameter types

The parameter types for a template have to provide all type definitions, member functions, operators,
and data to make possible the instantiation (“compilation”) of the class of function template.

0. Introduction, 0.3. Programming in C++ 32


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0.3.3 Function Objects and Lambda Functions


A function object is an object of a type that provides an overloaded “function call” operator (). Function
objects can be implemented in two different ways:
(I) through special classes like the following that realizes a function R 7→ R
c l a s s MyFun {
public:
...
double operator ( double x) const ; // Evaluation operator
...
};

The evaluation operator can take more than one argument and need not be declared const.
(II) through lambda functions, an “anonymous function” defined as
[<capture list>] (<arguments>) -> <return type> { body; }

where <capture list> is a list of variables from the local scope to be passed to the lambda func-
tion; an & indicates passing by reference,
<arguments> is a comma separated list of function arguments complete with types,
<return type> is an optional return type; often the compiler will be able to deduce the
return type from the definition of the function.
Function classes should be used, when the function is needed in different places, whereas lambda func-
tions for short functions intended for single use.

C++ code 0.3.3.1: Demonstration of use of lambda function ➺ GITLAB


1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // A vector of the same length
5 std : : vector <double> w( v . s i z e ( ) ) ;
6 // Do cumulative summation of v and store result in w
7 double sum = 0 ;
8 std : : transform ( v . begin ( ) , v . end ( ) ,w . begin ( ) ,
9 [&sum] ( double x ) { sum += x ; r e t u r n sum ; } ) ;
10 cout << "sum = " << sum << " , w = [ " ;
11 f o r ( auto x : w) cout << x << ’ ’ ; cout << ’ ] ’ << endl ;
12 return ( 0 ) ;
13 };

In this code the lambda function captures the local variable sum by reference, which enables the lambda
function to change its value in the surrounding scope.
§0.3.3.2 (Function type wrappers) The special class std::function provides types for general poly-
morphic function wrappers.
s t d ::function<return type(arg types)>

C++ code 0.3.3.3: Use of std::function ➺ GITLAB


1 double binop ( double arg1 , double arg2 ) { r e t u r n ( arg1 / arg2 ) ; }
2

3 void s t d f u n c t i o n t e s t ( void ) {
4 // Vector of objects of a particular signature

0. Introduction, 0.3. Programming in C++ 33


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5 std : : vector <std : : f u n c t i o n <double ( double , double ) >> f n v e c ;


6 // Store reference to a regular function
7 f n v e c . push_back ( binop ) ;
8 // Store a lambda function
9 f n v e c . push_back ( [ ] ( double x , double y ) −> double { r e t u r n y / x ; } ) ;
10 f o r ( auto f n : f n v e c ) { std : : cout << f n ( 3 , 2 ) << std : : endl ; }
11 }

In this example an object of type std::function<double(double,double)> can hold a regular func-


tion taking two double arguments and returning another double or a lambda function with the same
signature. Guess the output of stdfunctiontest! y

§0.3.3.4 (Recorder objects) In the case of routines that perform some numerical computations we are
often interested in the final result only. Occasionally we may also want to screen intermediate results. The
following example demonstrates the use of an optional object for collecting information while the function
is being executed. If no such object is supplied, an idle lambda function is passed, which incurs absolutely
no runtime overhead.

C++ code 0.3.3.5: An example of a function taking a recorder object. ➺ GITLAB


2 template <typename RECORDER = std : : f u n c t i o n <void ( i n t , i n t ) >>
3 unsigned i n t m y l o o p f u n c t i o n (
4 unsigned i n t n , unsigned i n t v a l = 1 ,
5 RECORDER &&rec = [ ] ( i n t , i n t ) −> void { } ) {
6 f o r ( unsigned i n t i = 0 ; i < n ; ++ i ) {
7 rec ( i , v a l ) ; // Removed by the compiler for the default argument
8 i f ( v a l % 2 == 0 ) {
9 val /= 2;
10 } else {
11 v a l *= 3 ;
12 v a l ++;
13 }
14 }
15 rec ( n , v a l ) ;
16 return val ;
17 }

C++ code 0.3.3.6: Calling myloopfunction() ➺ GITLAB


2 std : : cout << " myloopfunction (10 , 1) = " << myloopfunction ( 1 0 , 1 ) << std : : endl ;
3 // Run with recorder
4 std : : vector <std : : pair < i n t , i n t >> s t o r e { } ;
5 std : : cout << " myloopfunction (10 , 1) = "
6 << myloopfunction ( 1 0 , 1 ,
7 [& s t o r e ] ( i n t n , i n t v a l ) −> void {
8 s t o r e . emplace_back ( n , v a l ) ;
9 })
10 << std : : endl ;
11 std : : cout << " History : " << std : : endl ;
12 f o r ( const auto& i : s t o r e ) {
13 std : : cout << i . f i r s t << " −> " << i . second << std : : endl ;
14 }

§0.3.3.7 (Captures of lambda functions)

0. Introduction, 0.3. Programming in C++ 34


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Putting the name of a variable in the current scope in a lambda’s capture list, makes that variable
accessible inside the lambda’s body as a const reference. The variables are immutable inside the
lambda’s body.
• Capturing a local variable as non-const reference prepend the variable name with &. That variable’s
value can be changed by the lambda.
• The capture list [=] captures all local variables as const references.
• Conversely, the capture list [&] means that all variables in the local scope are captured by non-
const reference and can be changed by the lambda function.
y

§0.3.3.8 (Lambda functions inside member functions) To access class methods or class variables in
a lambda function inside a member function of a class you have to capture the current object as const
reference by putting this or *this in the capture list.

C++ code 0.3.3.9: A lambda function inside a member function. ➺ GITLAB


2 struct X {
3 e x p l i c i t X ( i n t N) : N_ (N) { }
4 [ [ n o d i s c a r d ] ] unsigned i n t mod( unsigned i n t n ) const { r e t u r n N_ % n ; }
5 bool modmatch ( const std : : vector < i n t > &nums , unsigned i n t n ) ;
6 private :
7 i n t N_ ;
8 };
9

10 bool X : : modmatch ( const std : : vector < i n t > &nums , unsigned i n t n ) {


11 auto i t = std : : f i n d _ i f ( nums . begin ( ) , nums . end ( ) , [ t h i s , n ] ( i n t k ) −> bool {
12 N_++; r e t u r n ( N_ ! = 0 ) and ( ( k % n ) == mod( n ) ) ;
13 }) ;
14 r e t u r n ( i t ! = nums . end ( ) ) ;
15 }

§0.3.3.10 (Recursions based on lambda functions) Lambda functions offer an elegant way to implment
recursive algorithms locally inside a function. Note that
• you have to capture the lambda function itself by reference,
• and that you cannot use auto for automatic compile-time type deduction of that lambda function!

C++ code 0.3.3.11: Recursively calling a lambda function ➺ GITLAB


2 i n t main ( ) {
3 int n_calls = 0;
4 std : : function < i n t ( i n t ) > f a c t o r i a l = [& f a c t o r i a l , & n _ c a l l s ] ( i n t n ) −> i n t {
5 n _ c a l l s ++;
6 i f ( n == 0 ) {
7 return 1;
8 }
9 return n * f a c t o r i a l ( n − 1) ;
10 };
11 std : : cout << " 10! = = " << f a c t o r i a l ( 1 0 ) << std : : endl ;
12 return 0;
13 }

0. Introduction, 0.3. Programming in C++ 35


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0.3.4 Multiple Return Values


In P YTHON it is customary to return several variables from a function call, which, in fact, amounts to
returning a tuple of mixed-type objects:
1 def f ( a , b ) :
2 r e t u r n min ( a , b ) , max ( a , b ) , ( a+b ) / 2
3 x , y , z = f (1 , 2)

In C++ this is also possible by using the tuple utility. For instance, the following function computes the
mimimal and maximal element of a vector and also returns its cumulative sum. It returns all these values.

C++ code 0.3.4.1: Function with multiple return values ➺ GITLAB


1 template <typename T>
2 std : : tuple <T , T , std : : vector <T>> extcumsum ( const std : : vector <T> &v ) {
3 // Local summation variable captured by reference by lambda function
4 T sum { } ;
5 // temporary vector for returning cumulative sum
6 std : : vector <T> w { } ;
7 // cumulative summation
8 std : : t r a n s f o r m ( v . cbegin ( ) , v . cend ( ) , b a c k _ i n s e r t e r (w) ,
9 [&sum] ( T x ) { sum += x ; r e t u r n (sum) ; } ) ;
10 r e t u r n ( std : : make_tuple ( * std : : min_element ( v . cbegin ( ) , v . cend ( ) ) ,
11
* std : : max_element ( v . cbegin ( ) , v . cend ( ) ) ,
12 std : : move(w) ) ) ;
13 }

This code snippet shows how to extract the individual components of the tuple returned by the previous
function.

C++ code 0.3.4.2: Calling a function with multiple return values ➺ GITLAB
1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // Variables for return values
5 double minv , maxv ; // Extremal elements
6 std : : vector <double> cs ; // Cumulative sums
7 std : : t i e ( minv , maxv , cs ) = extcumsum ( v ) ;
8 cout << " min = " << minv << " , max = " << maxv << endl ;
9 cout << " cs = [ " ; f o r ( double x : cs ) cout << x << ’ ’ ; cout << " ] " << endl ;
10 return ( 0 ) ;
11 }

Be careful: many temporary objects might be created! A demonstration of this hidden cost is given in
Exp. 0.3.5.27. From C++17 a more compact syntax is available:

C++ code 0.3.4.3: Calling a function with multiple return values ➺ GITLAB
1 i n t main ( ) {
2 // initialize a vector from an initializer list
3 std : : vector <double> v ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 } ) ;
4 // Definition of variables and assignment of return values all at once
5 auto [ minv , maxv , cs ] = extcumsum ( v ) ;

0. Introduction, 0.3. Programming in C++ 36


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6 cout << " min = " << minv << " , max = " << maxv << endl ;
7 cout << " cs = [ " ; f o r ( double x : cs ) cout << x << ’ ’ ; cout << " ] " << endl ;
8 return ( 0 ) ;
9 }

Remark 0.3.4.4 (“auto” considered harmful) C++ is a strongly typed programming language and every
variable must have a precise type. However, the developer of templated classes and functions may not
know the type of some variables in advance, because it can be deduced only after instantiation through
the compiler. The auto keyword has been introduced to handle this situation.
There is a temptation to use auto profligately, because it is convenient, in particular when using templated
data types. However, this denies a major benefit of types, consistency checking at compile time and, as a
developer, one may eventually lose track of the types completely, which can lead to errors that are hard to
detect.
Thus, the use of auto should be avoided, unless in the following situations:
• for variables inside templated functions or classes, whose precise type will only become clear during
instantiation,
• for lambda functions, see Section 0.3.3,
• for return values of templated library (member) functions, whose type is “impossible to deduce” by
the user. An example is expression templates in E IGEN, refer to Rem. 1.2.1.11 below.
y

0.3.5 A Vector Class


Since C++ is an object oriented programming language, datatypes defined by classes play a pivotal role in
every C++ program. Here, we demonstrate the main ingredients of a class definition and other important
facilities of C++ for the class MyVector meant for objects representing vectors from R n . The codes can
be found in ➺ GITLAB. A similar vector class is presented in [Fri19, Ch. 6].

C++ 11 class 0.3.5.1: Definition of a simple vector class MyVector ➺ GITLAB


1 namespace myvec {
2 class MyVector {
3 public :
4 using v a l u e _ t = double ;
5 // Constructor creating constant vector, also default constructor
6 e x p l i c i t MyVector ( std : : s i z e _ t n = 0 , double v a l = 0 . 0 ) ;
7 // Constructor: initialization from an STL container
8 template <typename Container > MyVector ( const C o n t a i n e r &v ) ;
9 // Constructor: initialization from an STL iterator range
10 template <typename I t e r a t o r > MyVector ( I t e r a t o r f i r s t , I t e r a t o r l a s t ) ;
11 // Copy constructor, computational cost O(n)
12 MyVector ( const MyVector &mv) ;
13 // Move constructor, computational cost O(1)
14 MyVector ( MyVector &&mv) ;
15 // Copy assignment operator, computational cost O(n)
16 MyVector &operator = ( const MyVector &mv) ;
17 // Move assignment operator, computational cost O(1)
18 MyVector &operator = ( MyVector &&mv) ;
19 // Destructor
20 v i r t u a l ~MyVector ( void ) ;
21 // Type conversion to STL vector

0. Introduction, 0.3. Programming in C++ 37


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

22 operator std : : vector <double> ( ) const ;


23

24 // Returns length of vector


25 std : : s i z e _ t s i z e ( void ) const { r e t u r n n ; }
26 // Access operators: rvalue & lvalue, with range check
27 double operator [ ] ( std : : s i z e _ t i ) const ;
28 double &operator [ ] ( std : : s i z e _ t i ) ;
29 // Comparison operators
30 bool operator == ( const MyVector &mv) const ;
31 bool operator ! = ( const MyVector &mv) const ;
32 // Transformation of a vector by a function R → R
33 template <typename Functor >
34 MyVector &t r a n s f o r m ( F u n c t o r && f ) ;
35

36 // Overloaded arithmetic operators


37 // In place vector addition: x += y;
38 MyVector &operator +=( const MyVector &mv) ;
39 // In place vector subtraction: x-= y;
40 MyVector &operator −=( const MyVector &mv) ;
41 // In place scalar multiplication: x *= a;
42 MyVector &operator * = ( double alpha ) ;
43 // In place scalar division: x /= a;
44 MyVector &operator / = ( double alpha ) ;
45 // Vector addition
46 MyVector operator + ( MyVector mv) const ;
47 // Vector subtraction
48 MyVector operator − ( const MyVector &mv) const ;
49 // Scalar multiplication from right and left: x = a*y; x = y*a
50 MyVector operator * ( double alpha ) const ;
51 f r i e n d MyVector operator * ( double alpha , const MyVector &) ;
52 // Scalar divsion: x = y/a;
53 MyVector operator / ( double alpha ) const ;
54

55 // Euclidean norm
56 [ [ n o d i s c a r d ] ] double norm ( void ) const ;
57 // Euclidean inner product
58 double operator * ( const MyVector &) const ;
59 // Output operator
60 f r i e n d std : : ostream &
61 operator << ( std : : ostream & , const MyVector &mv) ;
62

63 s t a t i c bool dbg ; // Flag for verbose output


64 // Non-const static class variables deprecated by C++ core guidelines!
65 private :
66 std : : s i z e _ t n { 0 } ; // Length of vector
67 double * data { n u l l p t r } ; // data array (standard C array)
68 };
69 } // namespace myvec

Note the use of a public static data member dbg in Line 63 that can be used to control debugging output
by setting MyVector::dbg = true or MyVector::dbg = false.

Remark 0.3.5.2 (Contiguous arrays in C++) The class MyVector uses a C-style array and dynamic
memory management with new and delete to store the vector components. This is for demonstration
purposes only and not recommended.

0. Introduction, 0.3. Programming in C++ 38


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Arrays in C++

In C++ use the STL container std::vector<T> for storing data in contiguous memory locations.
Exception: use std::array<T>, if the number of elements is known at compile time.

§0.3.5.4 (Member and friend functions of MyVector ➺ GITLAB)

C++ code 0.3.5.5: Constructor for constant vector, also default constructor, see Line 6 in
Code 0.3.5.1 ➺ GITLAB
1 MyVector : : MyVector ( std : : s i z e _ t _n , double _a ) : n ( _n ) , data ( n u l l p t r ) {
2 i f ( dbg ) cout << " { Constructor MyVector ( " << _n
3 << " ) called " << ’ } ’ << endl ;
4 i f ( n > 0 ) data = new double [ _n ] ;
5 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] = _a ;
6 }

This constructor can also serve as default constructor (a constructor that can be invoked without any
argument), because defaults are supplied for all its arguments.

The following two constructors initialize a vector from sequential containers according to the conventions
of the STL.

C++ code 0.3.5.6: Templated constructors copying vector entries from an STL container
➺ GITLAB
1 template <typename Container >
2 MyVector : : MyVector ( const C o n t a i n e r &v ) : n ( v . s i z e ( ) ) , data ( n u l l p t r ) {
3 i f ( dbg ) cout << " { MyVector ( length " << n
4 << " ) constructed from container " << ’ } ’ << endl ;
5 i f ( n > 0) {
6 double * tmp = ( data = new double [ n ] ) ;
7 f o r ( auto i : v ) * tmp++ = i ; // foreach loop
8 }
9 }

Note the use of the new C++ 11 facility of a “foreach loop” iterating through a container in Line 7.

C++ code 0.3.5.7: Constructor initializing vector from STL iterator range ➺ GITLAB
1 template <typename I t e r a t o r >
2 MyVector : : MyVector ( I t e r a t o r f i r s t , I t e r a t o r l a s t ) : n ( 0 ) , data ( n u l l p t r ) {
3 n = std : : d i s t a n c e ( f i r s t , l a s t ) ;
4 i f ( dbg ) cout << " { MyVector ( length " << n
5 << " ) constructed from range " << ’ } ’ << endl ;
6 i f ( n > 0) {
7 data = new double [ n ] ;
8 std : : copy ( f i r s t , l a s t , data ) ;
9 }
10 }

The use of these constructors is demonstrated in the following code

0. Introduction, 0.3. Programming in C++ 39


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 0.3.5.8: Initialization of a MyVector object from an STL vector ➺ GITLAB
1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r u e ;
3 std : : vector < i n t > i v e c = { 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 } ; // initializer list
4 myvec : : MyVector v1 ( i v e c . cbegin ( ) , i v e c . cend ( ) ) ;
5 myvec : : MyVector v2 ( i v e c ) ;
6 myvec : : MyVector v r ( i v e c . crbegin ( ) , i v e c . crend ( ) ) ;
7 cout << " v1 = " << v1 << endl ;
8 cout << " v2 = " << v2 << endl ;
9 cout << " vr = " << v r << endl ;
10 return ( 0 ) ;
11 }

The following output is produced:


{ MyVector ( l e n g t h 7 ) c o n s t r u c t e d from range }
{ MyVector ( l e n g t h 7 ) c o n s t r u c t e d from c o n t a i n e r }
{ MyVector ( l e n g t h 7 ) c o n s t r u c t e d from range }
v1 = [ 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 ]
v2 = [ 1 , 2 , 3 , 5 , 7 , 1 1 , 1 3 ]
vr = [ 13 ,11 ,7 ,5 ,3 ,2 ,1 ]
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 7)}
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 7)}
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 7)}

The copy constructor listed next relies on the STL algorithm std::copy to copy the elements of an
existing object into a newly created object. This takes n operations.

C++ code 0.3.5.9: Copy constructor ➺ GITLAB


1 MyVector : : MyVector ( const MyVector &mv) : n (mv . n ) , data ( n u l l p t r ) {
2 i f ( dbg ) cout << " {Copy construction of MyVector ( length "
3 << n << " ) " << ’ } ’ << endl ;
4 i f ( n > 0) {
5 data = new double [ n ] ;
6 std : : copy_n (mv . data , n , data ) ;
7 }
8 }

An important new feature of C++11 is move semantics which helps avoid expensive copy operations. The
following implementation just performs a shallow copy of pointers and, thus, for large n is much cheaper
than a call to the copy constructor from Code 0.3.5.9. The source vector is left in an empty vector state.

C++ code 0.3.5.10: Move constructor ➺ GITLAB


1 MyVector : : MyVector ( MyVector &&mv) : n (mv . n ) , data (mv . data ) {
2 i f ( dbg ) cout << " {Move construction of MyVector ( length "
3 << n << " ) " << ’ } ’ << endl ;
4 mv . data = n u l l p t r ; mv . n = 0 ; // Reset victim of data theft
5 }

The following code demonstrates the use of std::move() to mark a vector object as disposable and
allow the compiler the use of the move constructor. The code also uses left multiplication with a scalar,
see Code 0.3.5.23.

0. Introduction, 0.3. Programming in C++ 40


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 0.3.5.11: Invocation of copy and move constructors ➺ GITLAB


1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r u e ;
3 myvec : : MyVector v1 ( std : : vector <double >(
4 {1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9}) ) ;
5 myvec : : MyVector v2 ( 2 . 0 * v1 ) ; // Scalar multiplication
6 myvec : : MyVector v3 ( std : : move( v1 ) ) ;
7 cout << " v1 = " << v1 << endl ;
8 cout << " v2 = " << v2 << endl ;
9 cout << " v3 = " << v3 << endl ;
10 return ( 0 ) ;
11 }

This code produces the following output. We observe that v1 is empty after its data have been “stolen” by
v2.
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ o p e r a t o r a * , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r * = , MyVector o f l e n g t h 8 }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
v1 = [ ]
v2 = [ 2 . 4 , 4 . 6 , 6 . 8 , 9 , 1 1 . 2 , 1 3 . 4 , 1 5 . 6 , 1 7 . 8 ]
v3 = [ 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 , 8 . 9 ]
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }

We observe that the object v1 is reset after having been moved to v3.
Use std::move only for special purposes like above and only if an object has a move con-
structor. Otherwise a ’move’ will trigger a plain copy operation. In particular, do not use
! std::move on objects at the end of their scope, e.g., within return statements.

The next operator effects copy assignment of an rvalue MyVector object to an lvalue MyVector. This
involves O(n) operations.

C++ code 0.3.5.12: Copy assignment operator ➺ GITLAB


1 MyVector &MyVector : : operator = ( const MyVector &mv) {
2 i f ( dbg ) cout << " {Copy assignment of MyVector ( length "
3 << n << "<−" << mv . n << " ) " << ’ } ’ << endl ;
4 i f ( t h i s == &mv) r e t u r n ( * t h i s ) ;
5 i f ( n ! = mv . n ) {
6 n = mv . n ;
7 i f ( data ! = n u l l p t r ) d e l e t e [ ] data ;
8 i f ( n > 0 ) data = new double [ n ] ; else data = n u l l p t r ;
9 }
10 i f ( n > 0 ) std : : copy_n (mv . data , n , data ) ;
11 return ( * this ) ;
12 }

The move semantics is realized by an assignment operator relying on shallow copying.

0. Introduction, 0.3. Programming in C++ 41


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 0.3.5.13: Move assignment operator ➺ GITLAB


1 MyVector &MyVector : : operator = ( MyVector &&mv) {
2 i f ( dbg ) cout << " {Move assignment of MyVector ( length "
3 << n << "<−" << mv . n << " ) " << ’ } ’ << endl ;
4 i f ( data ! = n u l l p t r ) d e l e t e [ ] data ;
5 n = mv . n ; data = mv . data ;
6 mv . n = 0 ; mv . data = n u l l p t r ;
7 return ( * this ) ;
8 }

The destructor releases memory allocated by new during construction or assignment.

C++ code 0.3.5.14: Destructor: releases allocated memory ➺ GITLAB


1 MyVector : : ~ MyVector ( void ) {
2 i f ( dbg ) cout << " { Destructor f o r MyVector ( length = "
3 << n << " ) " << ’ } ’ << endl ;
4 i f ( data ! = n u l l p t r ) d e l e t e [ ] data ;
5 }

The operator keyword is also use to define implicit type conversions.

C++ code 0.3.5.15: Type conversion operator: copies contents of vector into STL vector
➺ GITLAB
1 MyVector : : operator std : : vector <double> ( ) const {
2 i f ( dbg ) cout << " { Conversion to std : : vector , length = " << n << ’ } ’ << endl ;
3 r e t u r n std : : vector <double >( data , data+n ) ;
4 }

The bracket operator [] can be used to fetch and set vector components. Note that index range checking
is performed; an exception is thrown for invalid indices. The following code also gives an example of
operator overloading as discussed in § 0.3.1.2.

C++ code 0.3.5.16: rvalue and lvalue access operators ➺ GITLAB


1 double MyVector : : operator [ ] ( std : : s i z e _ t i ) const {
2 i f ( i >= n ) throw ( std : : l o g i c _ e r r o r ( " [ ] out of range " ) ) ;
3 r e t u r n data [ i ] ;
4 }
5

6 double &MyVector : : operator [ ] ( std : : s i z e _ t i ) {


7 i f ( i >= n ) throw ( std : : l o g i c _ e r r o r ( " [ ] out of range " ) ) ;
8 r e t u r n data [ i ] ;
9 }

Componentwise direct comparison of vectors. Can be dangerous in numerical codes,cf. Rem. 1.5.3.15.

C++ code 0.3.5.17: Comparison operators ➺ GITLAB


1 bool MyVector : : operator == ( const MyVector &mv) const
2 {
3 i f ( dbg ) cout << " { Comparison ==: " << n << " <−> " << mv . n << ’ } ’ << endl ;
4 i f ( n ! = mv . n ) r e t u r n ( f a l s e ) ;
5 else {

0. Introduction, 0.3. Programming in C++ 42


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6 f o r ( std : : s i z e _ t l =0; l <n ;++ l )


7 i f ( data [ l ] ! = mv . data [ l ] ) r e t u r n ( f a l s e ) ;
8 }
9 return ( true ) ;
10 }
11

12 bool MyVector : : operator ! = ( const MyVector &mv) const {


13 r e t u r n ! ( * t h i s == mv) ;
14 }

The transform method applies a function to every vector component and overwrites it with the value
returned by the function. The function is passed as an object of a type providing a ()-operator that accepts
a single argument convertible to double and returns a value convertible to double.

C++ code 0.3.5.18: Transformation of a vector through a functor double → double


➺ GITLAB
1 template <typename Functor >
2 MyVector &MyVector : : t r a n s f o r m ( F u n c t o r && f ) {
3 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] = f ( data [ l ] ) ;
4 return ( * this ) ;
5 }

The following code demonstrates the use of the transform method in combination with
1. a function object of the following type

C++ code 0.3.5.19: A functor type


1 s t r u c t Si mpl eFunct i on {
2 Si mpl eFunct i on ( double _a = 1 . 0 ) : c n t ( 0 ) , a ( _a ) { }
3 double operator ( ) ( double x ) { c n t ++; r e t u r n ( x+a ) ; }
4 int cnt ; // internal counter
5 const double a ; // increment value
6 };

2. a lambda function defined directly inside the call to transform.

C++ code 0.3.5.20: transformation of a vector via a functor object


1 i n t main ( ) {
2 myvec : : MyVector : : dbg = f a l s e ;
3 double a = 2 . 0 ; // increment
4 int cnt = 0; // external counter used by lambda function
5 myvec : : MyVector mv( std : : vector <double >(
6 {1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9}) ) ;
7 mv . transform ( [ a ,& c n t ] ( double x ) { c n t ++; r e t u r n ( x+a ) ; } ) ;
8 cout << c n t << " operations , mv transformed = " << mv << endl ;
9 SimpleFunction t r f ( a ) ; mv . transform ( t r f ) ;
10 cout << t r f . c n t << " operations , mv transformed = " << mv << endl ;
11 mv . transform ( SimpleFunction ( − 4 . 0 ) ) ;
12 cout << " Final vector = " << mv << endl ;
13 return ( 0 ) ;
14 }

The output is

0. Introduction, 0.3. Programming in C++ 43


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 3 . 2 , 4 . 3 , 5 . 4 , 6 . 5 , 7 . 6 , 8 . 7 , 9 . 8 , 1 0 . 9 ]
8 o p e r a t i o n s , mv t r a n s f o r m e d = [ 5 . 2 , 6 . 3 , 7 . 4 , 8 . 5 , 9 . 6 , 1 0 . 7 , 1 1 . 8 , 1 2 . 9 ]
Final vector = [ 1.2 ,2.3 ,3.4 ,4.5 ,5.6 ,6.7 ,7.8 ,8.9 ]

Operator overloading provides the “natural” vector operations in R n both in place and with a new vector
created for the result.

C++ code 0.3.5.21: In place arithmetic operations (one argumnt) ➺ GITLAB


1 MyVector &MyVector : : operator +=( const MyVector &mv) {
2 i f ( dbg ) cout << " { operator +=, MyVector of length "
3 << n << ’ } ’ << endl ;
4 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " +=: vector size mismatch " ) ) ;
5 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] += mv . data [ l ] ;
6 return ( * this ) ;
7 }
8

9 MyVector &MyVector : : operator −=( const MyVector &mv) {


10 i f ( dbg ) cout << " { operator −=, MyVector of length "
11 << n << ’ } ’ << endl ;
12 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " −=: vector size mismatch " ) ) ;
13 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] −= mv . data [ l ] ;
14 return ( * this ) ;
15 }
16

17 MyVector &MyVector : : operator * = ( double alpha ) {


18 i f ( dbg ) cout << " { operator * = , MyVector of length "
19 << n << ’ } ’ << endl ;
20 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] * = alpha ;
21 return ( * this ) ;
22 }
23

24 MyVector &MyVector : : operator / = ( double alpha ) {


25 i f ( dbg ) cout << " { operator / = , MyVector of length "
26 << n << ’ } ’ << endl ;
27 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) data [ l ] / = alpha ;
28 return ( * this ) ;
29 }

C++ code 0.3.5.22: Binary arithmetic operators (two arguments) ➺ GITLAB


1 MyVector MyVector : : operator + ( MyVector mv) const {
2 i f ( dbg ) cout << " { operator + , MyVector of length "
3 << n << ’ } ’ << endl ;
4 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " +: vector size mismatch " ) ) ;
5 mv += * t h i s ;
6 r e t u r n (mv) ;
7 }
8

9 MyVector MyVector : : operator − ( const MyVector &mv) const {


10 i f ( dbg ) cout << " { operator −, MyVector of length "
11 << n << ’ } ’ << endl ;
12 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " +: vector size mismatch " ) ) ;
13 MyVector tmp ( * t h i s ) ; tmp −= mv ;
14 r e t u r n ( tmp ) ;
15 }
16

17 MyVector MyVector : : operator * ( double alpha ) const {


18 i f ( dbg ) cout << " { operator * a , MyVector of length "

0. Introduction, 0.3. Programming in C++ 44


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

19 << n << ’ } ’ << endl ;


20 MyVector tmp ( * t h i s ) ; tmp * = alpha ;
21 r e t u r n ( tmp ) ;
22 }
23

24 MyVector MyVector : : operator / ( double alpha ) const {


25 i f ( dbg ) cout << " { operator / , MyVector of length " << n << ’ } ’ << endl ;
26 MyVector tmp ( * t h i s ) ; tmp / = alpha ;
27 r e t u r n ( tmp ) ;
28 }

C++ code 0.3.5.23: Non-member function for left multiplication with a scalar ➺ GITLAB
1 MyVector operator * ( double alpha , const MyVector &mv) {
2 i f ( MyVector : : dbg ) cout << " { operator a * , MyVector of length "
3 << mv . n << ’ } ’ << endl ;
4 MyVector tmp (mv) ; tmp * = alpha ;
5 r e t u r n ( tmp ) ;
6 }

C++ code 0.3.5.24: Euclidean norm ➺ GITLAB


1 double MyVector : : norm ( void ) const {
2 i f ( dbg ) cout << " {norm : MyVector of length " << n << ’ } ’ << endl ;
3 double s = 0 ;
4 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) s += ( data [ l ] * data [ l ] ) ;
5 r e t u r n ( std : : s q r t ( s ) ) ;
6 }

Adopting the notation in some linear algebra texts, the operator * has been chosen to designate the
Euclidean inner product:

C++ code 0.3.5.25: Euclidean inner product ➺ GITLAB


1 double MyVector : : operator * ( const MyVector &mv) const {
2 i f ( dbg ) cout << " { dot * , MyVector of length " << n << ’ } ’ << endl ;
3 i f ( n ! = mv . n ) throw ( std : : l o g i c _ e r r o r ( " dot : vector size mismatch " ) ) ;
4 double s = 0 ;
5 f o r ( std : : s i z e _ t l =0; l <n ;++ l ) s += ( data [ l ] * mv . data [ l ] ) ;
6 return ( s ) ;
7 }

At least for debugging purposes every reasonably complex class should be equipped with output function-
ality.

C++ code 0.3.5.26: Non-member function output operator ➺ GITLAB


1 std : : ostream &operator << ( std : : ostream &o , const MyVector &mv) {
2 o << " [ " ;
3 f o r ( std : : s i z e _ t l =0; l <mv . n ;++ l )
4 o << mv . data [ l ] << ( l ==mv . n−1? ’ ’ : ’ , ’ ) ;
5 r e t u r n ( o << " ] " ) ;
6 }

0. Introduction, 0.3. Programming in C++ 45


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXPERIMENT 0.3.5.27 (“Behind the scenes” of MyVector arithmetic) The following code highlights
the use of operator overloading to obtain readable and compact expressions for vector arithmetic.

C++ code 0.3.5.28:


1 i n t main ( ) {
2 myvec : : MyVector : : dbg = t r u e ;
3 myvec : : MyVector x ( std : : vector <double > ( { 1 . 2 , 2 . 3 , 3 . 4 , 4 . 5 , 5 . 6 , 6 . 7 , 7 . 8 , 8 . 9 } ) ) ;
4 myvec : : MyVector y ( std : : vector <double > ( { 2 . 1 , 3 . 2 , 4 . 3 , 5 . 4 , 6 . 5 , 7 . 6 , 8 . 7 , 9 . 8 } ) ) ;
5 auto z = x +( x * y ) * x + 2 . 0 * y / ( x−y ) . norm ( ) ;
6 }

We run the code and trace calls. This is printed to the console:
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ MyVector ( l e n g t h 8 ) c o n s t r u c t e d from c o n t a i n e r }
{ d o t * , MyVector o f l e n g t h 8 }
{ o p e r a t o r a * , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r * = , MyVector o f l e n g t h 8 }
{ o p e r a t o r + , MyVector o f l e n g t h 8 }
{ o p e r a t o r += , MyVector o f l e n g t h 8 }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r a * , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r * = , MyVector o f l e n g t h 8 }
{ o p e r a t o r − , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r −= , MyVector o f l e n g t h 8 }
{ norm : MyVector o f l e n g t h 8 }
{ o p e r a t o r / , MyVector o f l e n g t h 8 }
{ Copy c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ o p e r a t o r / = , MyVector o f l e n g t h 8 }
{ o p e r a t o r + , MyVector o f l e n g t h 8 }
{ o p e r a t o r += , MyVector o f l e n g t h 8 }
{ Move c o n s t r u c t i o n o f MyVector ( l e n g t h 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 0 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }
{ D e s t r u c t o r f o r MyVector ( l e n g t h = 8 ) }

Several temporary objects are created and destroyed and quite a few copy operations take place. The
situation would be worse unless move semantics was available; if we had not supplied a move constructor,
a few more copy operations would have been triggered. Even worse, the frequent copying of data runs a
high risk of cache misses. This is certainly not an efficient way to do elementary vector operations though
it looks elegant at first glance. y

EXAMPLE 0.3.5.29 (Gram-Schmidt orthonormalization based on MyVector implementation) Gram-


Schmidt orthonormalization has been taught in linear algebra and its theory will be revisited in § 1.5.1.1.

0. Introduction, 0.3. Programming in C++ 46


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Here we use this simple algorithm from linear algebra to demonstrate the use of the vector class MyVector
defined in Code 0.3.5.1.
The templated function gramschmidt takes a sequence of vectors stored in a std::vector object. The
actual vector type is passed as a template parameter. It has to supply size() and norm() member
functions as well as in place arithmetic operations -=, / and =. Note the use of the highlighted methods
of the std::vector class.

C++ code 0.3.5.30: templated function for Gram-Schmidt orthonormalization ➺ GITLAB


1 template <typename Vec>
2 std : : vector <Vec> gramschmidt ( const std : : vector <Vec> &A, double eps=1E−14) {
3 const i n t k = A. s i z e ( ) ; // no. of vectors to orthogonalize
4 const i n t n = A [ 0 ] . s i z e ( ) ; // length of vectors
5 cout << " gramschmidt orthogonalization f o r " << k << ’ ’ << n << "−vectors " <<
endl ;
6 std : : vector <Vec> Q( { A [ 0 ] / A [ 0 ] . norm ( ) } ) ; // output vectors
7 f o r ( i n t j = 1 ; ( j <k ) && ( j <n ) ;++ j ) {
8 Q. push_back (A[ j ] ) ;
9 f o r ( i n t l =0; l < j ;++ l ) Q. back ( ) −= (A[ j ] * Q[ l ] ) *Q[ l ] ;
10 i f (Q. back ( ) . norm ( ) < eps * A[ j ] . norm ( ) ) { // premature termination ?
11 Q. pop_back ( ) ; break ;
12 }
13 Q. back ( ) / = Q. back ( ) . norm ( ) ; // normalization
14 }
15 r e t u r n (Q) ; // return at end of local scope
16 }

This driver program calls a function that initializes a sequence of vectors and then orthonormalizes them
by means of the Gram-Schmidt algorithm. Eventually orthonormality of the computed vectors is tested.
Please pay attention to
• the use of auto to avoid cumbersome type declarations,
• the for loops following the “foreach” syntax.
• automatic indirect template type deduction for the templated function gramschmidt from its argu-
ment. In Line 6 the function gramschmidt<MyVector> is instantiated.

C++ code 0.3.5.31: Driver code for Gram-Schmidt orthonormalization


1 i n t main ( ) {
2 myvec : : MyVector : : dbg = f a l s e ;
3 const i n t n = 7 ; const i n t k = 7 ;
4 auto A( i n i t v e c t o r s ( n , k , [ ] ( i n t i , i n t j )
5 { r e t u r n std : : min ( i +1 , j +1) ; } ) ) ;
6 auto Q( gramschmidt ( A) ) ; // instantiate template for MyVector
7 cout << " Set of vectors to be orthonormalized : " << endl ;
8 f o r ( const auto &a : A) { cout << a << endl ; }
9 cout << " Output of Gram−Schmidt orthonormalization : " << endl ;
10 f o r ( const auto &q : Q) { cout << q << endl ; }
11 cout << " Testing orthogonality : " << endl ;
12 f o r ( const auto & q i : Q) {
13 f o r ( const auto & q j : Q)
14 cout << std : : s e t p r e c i s i o n ( 3 ) << std : : setw ( 9 ) << q i * q j << ’ ’ ;
15 cout << endl ; }
16 return ( 0 ) ;
17 }

0. Introduction, 0.3. Programming in C++ 47


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This initialization function takes a functor argument as discussed in Section 0.3.3.

C++ code 0.3.5.32: Initialization of a set of vectors through a functor with two arguments
1 template <typename Functor >
2 std : : vector <myvec : : MyVector>
3 i n i t v e c t o r s ( std : : s i z e _ t n , std : : s i z e _ t k , F u n c t o r &&f ) {
4 std : : vector <MyVector> A { } ;
5 f o r ( i n t j =0; j <k ;++ j ) {
6 A . push_back ( MyVector ( n ) ) ;
7 f o r ( i n t i =0; i <n ;++ i )
8 ( A . back ( ) ) [ i ] = f ( i , j ) ;
9 }
10 r e t u r n ( A) ;
11 }

0.3.6 Complex numbers in C++


§0.3.6.1 (Data types for complex numbers) The fundamental data type for complex numbers is
using complex = s t d ::complex<T>;

where the template argument T must be a floating point type like double or float . Then he type complex
• t supports all basic arithmetic operations +, −, ∗, /,
• provides the member functions real() and imag() for extracting real and imaginary parts,
• and can be passed to std::abs() and std::arg() to get the modulus |z| and the argument
ϕ ∈ [−π, π ] of the complex number z = |z| exp(iϕ).
Complex conjugation can be done by calling std::conj() for a complex number. y

§0.3.6.2 (Initialization of complex numbers) The value of a variable of type std::complex<double> can
be initialized
• by calling the standard constructor and supplying real and imaginary part: x =
std::complex<double>(x,y), where x,y are of a numeric type that can be converted to
double. If the second argument is omitted, the imaginary part is set to zero.
• by providing a complex literal, x = 1.0+1.0i. This entails the directive using namespace
std::complex_literals.
• by specifying the modulus r ≥ 0 and argument ϕ ∈ R and calling std::polar(): x =
std::polar(r,phi). Arguments are always given in radians.
y

§0.3.6.3 (Functions with complex arguments) All standard mathematical functions like exp, sin, cos,
sinh, and cosh can be supplied with complex arguments.
Note that the definition of log and of square roots for complex argument entails specifying a branch
cut. The default choice for the built-in functions is the negative real line. For instance this means that
std::sqrt(z) for a complex number z will always have non-negative real part. y

The following code demonstrates the handling of complex numbers.

0. Introduction, 0.3. Programming in C++ 48


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 0.3.6.4: Data types and operations for complex numbers ➺ GITLAB
2 # include <complex>
3 # include < iostream >
4 # include <numbers>
5 using complex = std : : complex<double > ;
6 using namespace std : : c o m p l e x _ l i t e r a l s ;
7 i n t main ( i n t /*argc*/ , char * * /*argv*/ ) {
8 std : : cout << "Demo: Complex numbers i n C++" << std : : endl ;
9 // This initialization requires std::complex_literals
10 complex z = 0 . 5 ; // Std constructor, real part only
11 z += 0 . 5 + 1 . 0 i ;
12 // Various elementary operations, see
13 // https://en.cppreference.com/w/cpp/numeric/complex
14 std : : cout << " z = " << z << " , Re( z ) = " << z . r e a l ( )
15 << " , Im ( z ) = " << z . imag ( ) << " | z | = " << std : : abs ( z )
16 << " , arg ( z ) = " << std : : arg ( z ) << " , conj ( z ) = " << std : : conj ( z )
17 << std : : endl ;
18 complex w = std : : polar ( 1 . 0 , std : : numbers : : p i / 4 . 0 ) ;
19 std : : cout << "w = " << w << std : : endl ;
20 std : : cout << " exp ( z ) = " << std : : exp ( z )
21 << " , abs ( exp ( z ) ) = " << std : : abs ( std : : exp ( z ) ) << " = "
22 << std : : exp ( z . r e a l ( ) ) << std : : endl ;
23 std : : cout << " s q r t ( z ) = " << std : : s q r t ( z )
24 << " , arg ( s q r t ( z ) ) = " << std : : arg ( std : : s q r t ( z ) ) << std : : endl ;
25

26 return 0;
27 }

Terminal output:
1 Demo : Complex numbers i n C++
2 z = ( 1 , 1 ) , Re ( z ) = 1 , Im ( z ) = 1 | z | = 1.41421 , arg ( z ) = 0.785398 , c o n j ( z )
= (1 , −1)
3 w = (0.707107 ,0.707107)
4 exp ( z ) = ( 1 . 4 6 8 6 9 , 2 . 2 8 7 3 6 ) , abs ( exp ( z ) ) = 2.71828 = 2.71828
5 s q r t ( z ) = ( 1 . 0 9 8 6 8 , 0 . 4 5 5 0 9 ) , arg ( s q r t ( z ) ) = 0.392699

0.4 Prerequisite Mathematical Knowledge


0.4.1 Basics
In school you should have learned basic facts of real analysis of one variable, in particular, about differen-
tiation, integration, and fundamental special functions.
§0.4.1.1 (Power functions and roots)

1
x0 := 1 , x m+n = x m x n , x −n = ∀ x ∈ R \ {0} , m, n ∈ Z , (0.4.1.2)
xn
x a+b = x a x b , x ab = ( x a )b ∀ x > 0 , a, b ∈ R , (0.4.1.3)
d d
{ x 7→ x n } = nx n−1 , x 6= 0, n ∈ N , { x 7→ x a } = ax a−1 , x > 0, a ∈ R , (0.4.1.4)
dx dx
Z
x a +1
x a dx = + C , a ∈ R \ {−1} , (0.4.1.5)
a+1

0. Introduction, 0.4. Prerequisite Mathematical Knowledge 49


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

where the last integral can only cover subsets of R + unless a ∈ N. The notation in (0.4.1.5) expresses
a +1
that x 7→ xa+1 in the principal (ger.: Stammfunktion) of x 7→ x a . y
§0.4.1.6 (Exponential functions and logarithms) In this course log always stands for the logarithm with
respect to basis e = 2.71828 . . ..

exp( x ) = e x , x∈R , log(exp( x )) = x ∀ x ∈ R , a x := exp( x log( a)) , x ∈ R, a > 0 . .


(0.4.1.7)

Calculus of exponential functions and logarithms:

1
exp( x + y) = exp( x ) exp(y) , exp(− x ) = ∀ x, y ∈ R ,
exp( x ) (0.4.1.8)
log( xy) = log( x ) + log(y) , log( x/y) log( x ) − log(y) ∀ x, y > 0 ,
exp(nx ) = exp( x )n , exp( ax ) = exp( x ) a ∀ x ∈ R, n ∈ Z, a > 0 . (0.4.1.9)

Differentiation and integration:

d d 1
{ x 7→ exp( x )} = exp( x ) , x∈R , { x 7→ log( x )} = , x>0, (0.4.1.10)
dx Z Z dx x
exp( x ) dx = exp( x ) + C , log( x ) dx = x log( x ) − x + C , (0.4.1.11)

where, of course, the logarithm can only be integrated over subsets of R + . y


§0.4.1.12 (Rules for differentiation and integration) Assuming sufficient smoothness of the involved
functions f : I ⊂ R → R and g : D ⊂ R → R and that products and compositions are well-defined, we
df dg
have, writing f ′ ( x ) := dx ( x ), g′ ( x ) := dx ( x ),

d
1D product rule: { x 7→ f ( x ) g( x )} = f ′ ( x ) g( x ) + f ( x ) g′ ( x ) , (0.4.1.13)
dx
d
1D chain rule: { x 7→ f ( g( x ))} = f ′ ( g( x )) g( x ) . (0.4.1.14)
dx
This implies the following standard integration techniques:

Zb gZ(b)

Integration by substituion: f ( g( x )) g ( x ) dx = f (y)dy , (0.4.1.15)
a g( a)
Z Z
integration by parts: f ( x ) g′ ( x ) dx = − f ′ ( x ) g( x ) dx + f ( x ) g( x ) . (0.4.1.16)

Taylor expansion formula in one dimension for a function that is m + 1 times continuously differentiable
in a neighborhood of x0
m
1 (k) 1
f ( x0 + h ) = ∑ f ( x0 ) h k + R m ( x0 , h ) , R m ( x0 , h ) = f ( m +1) ( ξ ) h m +1 , (1.5.4.28)
k =0
k! ( m + 1) !

for some ξ ∈ [min{ x0 , x0 + h}, max{ x0 , x0 + h}], and for all sufficiently small | h|. y

0. Introduction, 0.4. Prerequisite Mathematical Knowledge 50


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0.4.2 Complex Numbers


We write ı for the imaginary unit: Re ı = 0, Im ı = 1, ı2 = −1. Then every complex number z ∈ C can
be identified with a pair ( x, y) ∈ R2 of real numbers via z = x + ıy.

Multiplication: ( x + ıy)(u + ıv) = ( xu − yv) + ı( xv + yu) ∀ x, y, u, v ∈ R ,


Complex conjugation: x + ıy = x −ıy ∀ x, y ∈ R ,
Modulus: |z|2 := Re{zz} , | Z | = |z||w| ∀z, w ∈ C ,
w wz
Division: = 2 ∀w, z ∈ C, z 6= 0 .
z |z|
Many mathematical functions can be extended to complex arguments and many calculus rules will remain
valid for them, in particular the formulas involving the functions exp, sin, cos.

Euler’s formula: exp(ıt) = cos(t) + ı sin(t) ∀t ∈ R . (0.4.2.1)

Some parts of this course will rely on sophisticated results from complex analysis, that is, the field of
mathematics studying functions C → C. This results will be recalled when needed.

0.4.3 Trigonometric Functions


Trigonometric functions can be defined via the complex exponential function:

cos(z) = 21 (exp(ız) + exp(−ız)) , sin(z) = 1


2ı (exp(ız ) − exp(−ız )) , z∈C. (0.4.3.1)

This implies

1
sin2 z + cos2 z = 1 ∀z ∈ C , 1 + tan2 z = ∀z ∈ C \ πZ .
sin2 z
Addition formulas:

sin(z ± w) = sin z cos w ± cos z sin w , cos(z ± w) = cos z cos w ∓ sin z sin w ∀z, w ∈ C ,

and, whenever defined


 
tan z ± tan w z±w
tan(z ± w) = , arctan z ± arctan w = arctan .
1 ∓ tan z tan w 1 ∓ zw

0.4.4 Linear Algebra and Analysis


This course takes for granted that participants have been educated in the foundations of linear algebra by
attending corresponding first-year introductory courses, for instance
• 401-0151-00L Lineare Algebra
• 401-0231-10L Analysis 1
• 401-0232-10L Analysis 2
Quite a few concepts and techniques introduced in those courses will be needed for and will be taken for
granted in the current course.

0. Introduction, 0.4. Prerequisite Mathematical Knowledge 51


Bibliography

[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on p. 11).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on p. 11).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 12).
[Fri19] F. Friedrich. Datenstrukturen und Algorithmen. Lecture slides. 2019 (cit. on p. 37).
[GGK14] W. Gander, M.J. Gander, and F. Kwok. Scientific Computing. Vol. 11. Texts in Computational
Science and Engineering. Heidelberg: Springer, 2014 (cit. on p. 12).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on p. 12).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on p. 11).
[Jos12] N.M. Josuttis. The C++ Standard Library. Boston, MA: Addison-Wesley, 2012 (cit. on p. 30).
[LLM12] S. Lippman, J. Lajoie, and B. Moo. C++ Primer. 5th. Boston: Addison-Wesley, 2012 (cit. on
pp. 30, 31).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 12).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on p. 11).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on p. 12).

52
Chapter 1

Computing with Matrices and Vectors

§1.0.0.1 (Prerequisite knowledge for Chapter 1) The reader must master the basics of linear vector
and matrix calculus as covered in every introductory course on linear algebra [NS02, Ch. 2].
On a few occasions we will also need results of 1D real calculus like Taylor’s formula [Str09, Sect. 5.5]. y

§1.0.0.2 (Levels of operations in simulation codes) The lowest level of real arithmetic available on
computers are the elementary operations “+”, “−”, “∗”, “\”, “^”, usually implemented in hardware. The next
level comprises computations on finite arrays of real numbers, the elementary linear algebra operations
(BLAS). On top of them we build complex algorithms involving iterations and approximations.

Complex iterative/recursive/approximative algorithms

Linear algebra operations on arrays (BLAS)

Elementary operations in R

Hardly ever anyone will contemplate implementing elementary operations on binary data formats; similarly,
well tested and optimised code libraries should be used for all elementary linear algebra operations in
simulation codes. This chapter will introduce you to such libraries and how to use them smartly. y

Contents
1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.1.2 Classes of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.2 Software and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.1 E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.2.2 P YTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.2.3 (Dense) Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.3 Basic Linear Algebra Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.3.1 Elementary Matrix-Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . 70
1.3.2 BLAS – Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . . . . . 76
1.4 Computational Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.4.1 (Asymptotic) Computational Complexity . . . . . . . . . . . . . . . . . . . . 83
1.4.2 Cost of Basic Linear-Algebra Operations . . . . . . . . . . . . . . . . . . . . 84
1.4.3 Improving Complexity in Numerical Linear Algebra: Some Tricks . . . . . 86
1.5 Machine Arithmetic and Consequences . . . . . . . . . . . . . . . . . . . . . . . . 91
1.5.1 Experiment: Loss of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . 91
1.5.2 Machine Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.5.3 Roundoff Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1.5.4 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

53
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.5.5 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

1.1 Fundamentals
1.1.1 Notations
Video tutorial for Section 1.1.1 “Notations and Classes of Matrices”: (7 minutes)
Download link, tablet notes

→ review questions 1.1.2.9


The notations in this course try to adhere to established conventions. Since these may not be universal,
idiosyncrasies cannot be avoided completely. Notations in textbooks may be different, beware!
Many considerations apply to real (field R) and complex (field C) numbers alike. Therefore we adopt the
notatin K for a generic field of numbers. Thus, in this course, K will designate either R (real numbers) or
C (complex numbers); complex arithmetic [Str09, Sect. 2.5] plays a crucial role in many applications, for
instance in signal processing.
§1.1.1.1 (Notations for vectors)
✦ Vectors = are n-tuples (n ∈ N) with components ∈ K.
vector = one-dimensional array (of real/complex numbers)
✦ Default in this lecture: vectors = column vectors
 
x1
 ..   
 .  ∈ Kn x1 · · · xn ∈ K1,n
xn
column vector row vector

Kn =
ˆ vector space of column vectors with n components in K.

“Linear algebra convention”: Unless stated otherwise, in mathematical formulas vector com-
ponents are indexed from 1!

✎ notation for column vectors: bold small roman letters, e.g. x, y, z



column vector 7→ row vector
✦ Transposing: .
row vector 7→ column vector
 ⊤  
x1 x1
 ..     ⊤  . 
. = x1 · · · x n , x1 · · · x n =  .. 
xn xn

✎ Notation for row vectors: x⊤ , y⊤ , z⊤


✦ Addressing vector components:

✎ two notations: x = [ x1 , . . . , x n ] ⊤ → xi , i = 1, . . . , n
x ∈ Kn → (x)i , i = 1, . . . , n

1. Computing with Matrices and Vectors, 1.1. Fundamentals 54


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ Selecting sub-vectors:

✎ notation: x = [ x1 . . . xn ]⊤ ➣ (x)k:l = [ xk , . . . , xl ]⊤ , 1 ≤ k ≤ l ≤ n
✎ notations like 1 ≤ k, ℓ ≤ n where it is clear from the context that k and ℓ designate integer
indices mean “for all k, ℓ ∈ {1, . . . , n}”.
 ⊤
✦ j-th unit vector: e j = 0, . . . , 1, . . . , 0 , (e j )i = δij , i, j = 1, . . . , n.

✎ notation: Kronecker symbol, also called “Kronecker delta”, and defined as δij := 1, if i = j,
δij := 0, if i 6= j.
y

§1.1.1.2 (Notations and notions for matrices)


✦ Matrices = two-dimensional arrays of real/complex numbers
 
a11 . . . a1m
 .. 
A :=  ... . ∈K
n,m
, n, m ∈ N .
an1 . . . anm

vector space of n × m-matrices: (n =


ˆ number of rows, m =
ˆ number of columns)

✎ notation for matrices: bold CAPITAL roman letters, e.g., A, S, Y


Special cases: K n,1 ↔ column vectors, K1,n ↔ row vectors
✦ Writing a matrix as a tuple of its columns or rows
ci ∈ K n , i = 1, . . . , m ✄ A := [c1 , c2 , . . . , cm ] ∈ K n,m ,
 
r1⊤
 
ri ∈ K m , i = 1, . . . , n ✄ A :=  ...  ∈ K n,m .
r⊤n

✦ Addressing matrix entries & sub-matrices (✎ notations):


→ matrix entry/matrix element (A)i,j := aij , 1 ≤ i ≤ n, 1 ≤ j ≤ m ,
 
a11 . . . a1m → i-th row, 1 ≤ i ≤ n: (A)i,: := [ ai,1 , . . . , ai,m ] ,
 ..  ⊤
..  → j-th column, 1 ≤ j ≤ m: (A):,j := a1,j , . . . , an,j ,
A :=  . . 
  1≤k≤ℓ≤n,
an1 . . . anm → matrix block (A)k:ℓ,r:s := aij i=k,...,ℓ ,
j=r,...,s 1 ≤r≤s≤m.
(sub-matrix)

k ak,r ak,s The colon (:) range notation is inspired by


M ATLAB’s/P YTHON’s matrix addressing conventions.
(A)k:l,r:s is a matrix of size (l − k + 1) × (s − r + 1).

ℓ aℓ,r aℓ,s Note that in P YTHON the : notation


! describes slightly different ranges: the
end value is excluded.

Fig. 5
r s
1. Computing with Matrices and Vectors, 1.1. Fundamentals 55
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ Transposed matrix:
 ⊤  
a11 . . . a1m a11 . . . an1
 ..   . .. 
A⊤ =  ... .  :=  .. . ∈K
m,n
.
an1 . . . anm a1m . . . amn

✦ Adjoint matrix (Hermitian transposed):


 H  
a11 . . . a1m ā11 . . . ān1
 ..   . .. 
AH :=  ... .  :=  .. . ∈K
m,n
.
an1 . . . anm ā1m . . . āmn

✎ notation: āij = Re( aij ) − iIm( aij ) complex conjugate of aij . Of course, for A ∈ R n,m we
have AH = A⊤ .
y

1.1.2 Classes of Matrices


Most matrices occurring in mathematical modelling have a special structure. This section presents a few
of these. More will come up throughout the remainder of this chapter; see also [AG11, Sect. 4.3].

§1.1.2.1 (Special matrices) Terminology and notations for a few very special matrices:
 
1 0
 ..  n,n
Identity matrix: I := In :=  . ∈K ,
0 1
 
0 ... 0
 
Zero matrix: O := On,m :=  ... . . . ...  ∈ K n,m ,
0 ... 0
 
d1 0
 ..  n,n
Diagonal matrix: diag(d1 , . . . , dn ) :=  .  ∈ K , d j ∈ K , j = 1, . . . , n .
0 dn

The creation of special matrices can usually be done by special commands or functions in the various
languages or libraries dedicated to numerical linear algebra, see § 1.2.1.3. y
§1.1.2.2 (Diagonal and triangular matrices) A little terminology to quickly refer to matrices whose non-
zero entries occupy special locations:

Definition 1.1.2.3. Types of matrices


 
A matrix A = aij ∈ K m,n is a
• diagonal matrix, if aij = 0 for i 6= j,
• upper triangular matrix, if aij = 0 for i > j,
• lower triangular matrix, if aij = 0 for i < j.
A triangular matrix is normalized, if aii = 1, i = 1, . . . , min{m, n}.

1. Computing with Matrices and Vectors, 1.1. Fundamentals 56


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

     
 0     0 
     
     
     
     
     
     
 0   0   
     

diagonal matrix upper triangular lower triangular


y

§1.1.2.4 (Symmetric matrices)

Definition 1.1.2.5. Hermitian/symmetric matrices

A matrix M ∈ K n,n , n ∈ N, is Hermitian, if MH = M. If K = R, the matrix is called symmetric.

Definition 1.1.2.6. Symmetric positive definite (s.p.d.) matrices → [DR08, Def. 3.31],
[QSS00, Def. 1.22]

M ∈ K n,n , n ∈ N, is symmetric (Hermitian) positive definite (s.p.d.), if

M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .

If xH Mx ≥ 0 for all x ∈ K n ✄ M positive semi-definite.

Lemma 1.1.2.7. Necessary conditions for s.p.d. → [DR08, Satz 3.33], [QSS00, Prop. 1.18]

For a symmetric/Hermitian positive definite matrix M = MH ∈ K n,n holds true:


1. mii > 0, i = 1, . . . , n,
2. mii m jj − |mij |2 > 0 ∀1 ≤ i < j ≤ n,
3. all eigenvalues of M are positive. (← also sufficient for symmetric/Hermitian M)

Remark 1.1.2.8 (S.p.d. Hessians) Recall from analysis: in an isolated local minimum x ∗ of a C2 -function
f : R n 7→ R ➤ Hessian D2 f ( x ∗ ) s.p.d. (see Def. 8.5.1.18 for the definition of the Hessian)
To compute the minimum of a C2 -function iteratively by means of Newton’s method (→ Sect. 8.5) a linear
system of equations with the s.p.d. Hessian as system matrix has to be solved in each step.

The solutions of many equations in science and engineering boils down to finding the minimum of some
(energy, entropy, etc.) function, which accounts for the prominent role of s.p.d. linear systems in applica-
tions. y
Review question(s) 1.1.2.9 (Notations, matrix-vector calculus, and special matrices)
(Q1.1.2.9.A) Give a compact notation for the row vector containing the diagonal entries of a square matrix
S ∈ R n,n , n ∈ N.
(Q1.1.2.9.B) How can you write down the s × s-submatrix, s ∈ N, in the upper right corner of C ∈ R n,m ,
n, m ≥ s.

1. Computing with Matrices and Vectors, 1.1. Fundamentals 57


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q1.1.2.9.C) We consider two matrices A, B ∈ R n,m , both with at most N ∈ N non-zero entries. What
is the maximal number of non-zero entries of A + B?
(Q1.1.2.9.D) A matrix A ∈ R n,m enjoys the following property (banded matrix):

i ∈ {1, . . . , n}, j ∈ {1, . . . , m}, i − j 6∈ {− B− , . . . , B+ } ⇒ (A)ij = 0 ,

for given B− , B+ ∈ N0 . What is the maximal number of non-zero entries of A.


(Q1.1.2.9.E) A matrix A with real entries is known to be skew-symmetric: A⊤ = −A What does this tell
us about A and its entries?
(Q1.1.2.9.F) What is the dimension of the vector space of
• symmetric matrices ∈ R n,n ?
• skew-symmetric matrices ∈ R n,n , see Question (Q1.1.2.9.E)?
• lower triangular matrices ∈ R n,n ?
• diagonal matrices ∈ R n,n ?

1.2 Software and Libraries


Whenever algorithms involve matrices and vectors (in the sense of linear algebra) it is advisable to rely on
suitable code libraries or numerical programming environments.

1.2.1 E IGEN

Video tutorial for Section 1.2.1 "E IGEN ": (11 minutes) Download link, tablet notes

→ review questions 1.2.1.14


Currently, the most widely used programming language for the development of new simulation software
in scientific and industrial high-performance computing is C++. In this course we are going to use and
discuss E IGEN as an example for a C++ library for numerical linear algebra (“embedded” domain specific
language: DSL).

E IGEN is a header-only C++ template library designed to enable easy, natural and efficient numerical
linear algebra: it provides data structures and a wide range of operations for matrices and vectors, see
below. E IGEN also implements many more fundamental algorithms documentation page or the discussion
below).

E IGEN relies on expression templates to allow the efficient evaluation of complex expressions involving
matrices and vectors. Refer to the example given in the E IGEN documentation for details.

➥ Link to an “E IGEN Cheat Sheet” (quick reference relating to M ATLAB commands)

§1.2.1.1 (Matrix and vector data types in E IGEN) A generic matrix data type is given by the templated
class
Eigen::Matrix< typename Scalar,
i n t RowsAtCompileTime, i n t ColsAtCompileTime>

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 58


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Here Scalar is the underlying scalar type of the matrix entries, which must support the usual operations
’+’,’-’,’*’,’/’, and ’+=’, ’*=’, ’¯’, etc. Usually the scalar type will be either double, float , or complex<>. The
cardinal template arguments RowsAtCompileTime and ColsAtCompileTime can pass a fixed size
of the matrix, if it is known at compile time. There is a specialization selected by the template argument
Eigen::Dynamic supporting variable size “dynamic” matrices.

C++ code 1.2.1.2: Vector type and their use in E IGEN ➺ GITLAB
1 # include <Eigen / Dense >
2

3 template <typename Scalar >


4 void eigenTypeDemo ( unsigned i n t dim )
5 {
6 // General dynamic (variable size) matrices
7 using dynMat_t = Eigen : : Matrix < Scalar , Eigen : : Dynamic , Eigen : : Dynamic > ;
8 // Dynamic (variable size) column vectors
9 using dynColVec_t = Eigen : : Matrix < Scalar , Eigen : : Dynamic , 1 > ;
10 // Dynamic (variable size) row vectors
11 using dynRowVec_t = Eigen : : Matrix < Scalar , 1 , Eigen : : Dynamic > ;
12 using i n d e x _ t = typename dynMat_t : : Index ;
13 using e n t r y _ t = typename dynMat_t : : S c a l a r ;
14

15 // Declare vectors of size ’dim’; not yet initialized


16 dynColVec_t c o l v e c ( dim ) ;
17 dynRowVec_t rowvec ( dim ) ;
18 // Initialisation through component access
19 f o r ( i n d e x _ t i =0; i < c o l v e c . s i z e ( ) ; ++ i ) c o l v e c [ i ] = s t a t i c _ c a s t < Scalar >( i ) ;
20 f o r ( i n d e x _ t i =0; i < rowvec . s i z e ( ) ; ++ i ) rowvec [ i ] = s t a t i c _ c a s t < Scalar > ( 1 ) / ( i +1) ;
21 c o l v e c [ 0 ] = s t a t i c _ c a s t < Scalar > ( 3 . 1 4 ) ;
22 rowvec [ dim −1] = s t a t i c _ c a s t < Scalar > ( 2 . 7 1 8 ) ;
23 // Form tensor product, a matrix, see Section 1.3.1
24 dynMat_t vecprod = c o l v e c * rowvec ;
25 const i n t nrows = vecprod . rows ( ) ;
26 const i n t n c o l s = vecprod . cols ( ) ;
27 }

Note that in Line 24 we could have relied on automatic type deduction via auto vectprod = ....
However, as argued in Rem. 0.3.4.4 often it is safer to forgo this option and specify the type directly
The following convenience data types are provided by E IGEN, see E IGEN documentation:
ˆ generic variable size matrix with double precision entries
• MatrixXd =
• VectorXd, RowVectorXd = ˆ dynamic column and row vectors
(= dynamic matrices with one dimension equal to 1)
• MatrixNd with N = 2, 3, 4 for small fixed size square N × N -matrices (type double)
• VectorNd with N = 2, 3, 4 for small column vectors with fixed length N .
The d in the type name may be replaced with i (for int), f (for float), and cd (for
complex<double>) to select another basic scalar type.

All matrix type feature the methods cols(), rows(), and size() telling the number of columns, rows,
and total number of entries.
Access to individual matrix entries and vector components, both as Rvalue and Lvalue, is possible through
the ()-operator taking two arguments of type index_t. If only one argument is supplied, the matrix is
accessed as a linear array according to its memory layout. For vectors, that is, matrices where one

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 59


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

dimension is fixed to 1, the []-operator can replace () with one argument, see Line 22 of Code 1.2.1.2.
y
§1.2.1.3 (Initialization of dense matrices in E IGEN, E IGEN documentation) The entry access oper-
ator (int i,int j) allows the most direct setting of matrix entries; there is hardly any runtime penalty.

Of course, in E IGEN dedicated functions take care of the initialization of the special matrices introduced in
§ 1.1.2.1:
Eigen::MatrixXd I = Eigen::MatrixXd::Identity(n,n);
Eigen::MatrixXd O = Eigen::MatrixXd::Zero(n,m);
Eigen::MatrixXd D = d_vector.asDiagonal();

C++ code 1.2.1.4: Initializing special matrices in E IGEN, ➺ GITLAB


1 # include <Eigen / Dense >
2 // Just allocate space for matrix, no initialisation
3 Eigen : : MatrixXd A( rows , cols ) ;
4 // Zero matrix. Similar to matlab command zeros(rows, cols);
5 Eigen : : MatrixXd B = MatrixXd : : Zero ( rows , cols ) ;
6 // Ones matrix. Similar to matlab command ones(rows, cols);
7 Eigen : : MatrixXd C = MatrixXd : : Ones ( rows , cols ) ;
8 // Matrix with all entries same as value.
9 Eigen : : MatrixXd D = MatrixXd : : Constant ( rows , cols , v a l u e ) ;
10 // Random matrix, entries uniformly distributed in [0, 1]
11 Eigen : : MatrixXd E = MatrixXd : : Random( rows , cols ) ;
12 // (Generalized) identity matrix, 1 on main diagonal
13 Eigen : : MatrixXd I = MatrixXd : : I d e n t i t y ( rows , cols ) ;
14 std : : cout << " size of A = ( " << A . rows ( ) << ’ , ’ << A . cols ( ) << ’ ) ’ << std : : endl ;

A versatile way to initialize a matrix relies on a combination of the operators « and ,, which allows the
construction of a matrix from blocks, see ➺ GITLAB, function blockinit().
MatrixXd mat3(6,6);
mat3 <<
MatrixXd::Constant(4,2,1.5), // top row, first block
MatrixXd::Constant(4,3,3.5), // top row, second block
MatrixXd::Constant(4,1,7.5), // top row, third block
MatrixXd::Constant(2,4,2.5), // bottom row, left block
MatrixXd::Constant(2,2,4.5); // bottom row, right block

The matrix is filled top to bottom left to right, block dimensions have to match (like in MATLAB). y

§1.2.1.5 (Access to submatrices in E IGEN, E IGEN documentation) The method block(int i,int
j,int p,int q) returns a reference to the submatrix with upper left corner at position (i, j) and size
p × q.

The methods row(int i) and col(int j) provide a reference to the corresponding row and column of
the matrix. Even more specialised access methods are
topLeftCorner(p,q), bottomLeftCorner(p,q),
topRightCorner(p,q), bottomRightCorner(p,q),
topRows(q), bottomRows(q),
leftCols(p), and rightCols(q),
with obvious purposes.

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 60


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 1.2.1.6: Demonstration code for access to matrix blocks in E IGEN ➺ GITLAB
2 template <typename MatType>
3 void blockAccess ( Eigen : : MatrixBase <MatType> &M)
4 {
5 using i n d e x _ t = typename Eigen : : MatrixBase <MatType > : : Index ;
6 const i n d e x _ t nrows (M. rows ( ) ) ; // No. of rows
7 const i n d e x _ t n c o l s (M. cols ( ) ) ; // No. of columns
8

9 cout << " Matrix M = " << endl << M << endl ; // Print matrix
10 // Block size half the size of the matrix
11 const i n d e x _ t p = nrows / 2 ;
12 const i n d e x _ t q = n c o l s / 2 ;
13 // Output submatrix with left upper entry at position (i,i)
14 f o r ( i n d e x _ t i =0; i < std : : min ( p , q ) ; i ++) {
15 cout << " Block ( " << i << ’ , ’ << i << ’ , ’ << p << ’ , ’ << q
16 << " ) = " << M. block ( i , i , p , q ) << endl ;
17 }
18 // l-value access: modify sub-matrix by adding a constant
19 M. block ( 1 , 1 , p , q ) += Eigen : : MatrixBase <MatType > : : Constant ( p , q , 1 . 0 ) ;
20 cout << "M = " << endl << M << endl ;
21 // r-value access: extract sub-matrix
22 const MatrixXd B = M. block ( 1 , 1 , p , q ) ;
23 cout << " Isolated modified block = " << endl << B << endl ;
24 // Special sub-matrices
25 cout << p << " top rows of m = " << M. topRows ( p ) << endl ;
26 cout << p << " bottom rows of m = " << M. bottomRows ( p ) << endl ;
27 cout << q << " l e f t cols of m = " << M. l e f t C o l s ( q ) << endl ;
28 cout << q << " r i g h t cols of m = " << M. r i g h t C o l s ( p ) << endl ;
29 // r-value access to upper triangular part
30 const MatrixXd T = M. template triangularView <Upper > ( ) ; //
31 cout << "Upper t r i a n g u l a r part = " << endl << T << endl ;
32 // l-value access to upper triangular part
33 M. template triangularView <Lower > ( ) * = − 1 . 5 ; //
34 cout << " Matrix M = " << endl << M << endl ;
35 }

• Note that the function blockAccess() is templated and that the matrix argument passed through
M has a type derived from Eigen::MatrixBase. The deeper reason for this alien looking signature of
blockAccess() is explained in E IGEN documentation.
• E IGEN offers views for access to triangular parts of a matrix, see Line 30 and Line 33, according to
M.triangularView<XX>()
where XX can stand for one of the following: Upper, Lower, StrictlyUpper, StrictlyLower, UnitUpper,
UnitLower, see E IGEN documentation.
• For column and row vectors references to sub-vectors can be obtained by the methods head(int
length), tail(int length), and segment(int pos,int length).

Note: Unless the preprocessor switch NDEBUG is set, E IGEN performs range checks on all indices. y
§1.2.1.7 (Componentwise operations in E IGEN) Running out of overloadable operators, E IGEN uses the
Array concept to furnish entry-wise operations on matrices. An E IGEN-Array contains the same data as a
matrix, supports the same methods for initialisation and access, but replaces the operators of matrix arith-
metic with entry-wise actions. Matrices and arrays can be converted into each other by the array() and
matrix() methods, see E IGEN documentation for details. Information about functions that enable
entry-wise operation is available in the E IGEN documentation.

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 61


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 1.2.1.8: Using Array in E IGEN ➺ GITLAB


2 void i n l i n e matArray ( i n t nrows , i n t n c o l s ) {
3 Eigen : : MatrixXd m1( nrows , n c o l s ) ;
4 Eigen : : MatrixXd m2( nrows , n c o l s ) ;
5 f o r ( i n t i = 0 ; i < m1. rows ( ) ; i ++) {
6 f o r ( i n t j = 0 ; j < m1. cols ( ) ; j ++) {
7 m1( i , j ) = s t a t i c _ c a s t <double > ( ( i +1) ) / ( j +1) ;
8 m2( i , j ) = s t a t i c _ c a s t <double > ( ( j +1) ) / ( i +1) ;
9 }
10 }
11 // Entry-wise product, not a matrix product
12 const Eigen : : MatrixXd m3 = (m1. array ( ) * m2. array ( ) ) . matrix ( ) ;
13 // Explicit entry-wise operations on matrices are possible
14 const Eigen : : MatrixXd m4(m1. cwiseProduct (m2) ) ;
15 // Entry-wise logarithm
16 cout << "Log (m1) = " << endl << l o g (m1. array ( ) ) << endl ;
17 // Entry-wise boolean expression, true cases counted
18 cout << (m1. array ( ) > 3 ) . count ( ) << " e n t r i e s of m1 > 3" << endl ;
19 }

The application of a functor (→ Section 0.3.3) to all entries of a matrix can also be done via the
unaryExpr() method of a matrix:
// Apply a lambda function to all entries of a matrix
au to fnct = []( double x) { r e t u r n (x+1.0/x); };
co u t << "f(m1) = " << e n d l << m1.unaryExpr(fnct) << e n d l ;

§1.2.1.9 (Reduction operations in E IGEN) According to E IGEN’s terminology, reductions are op-
erations that access all entries of a matrix and accumulate some information in the process
E IGEN documentation. A typical example is the summation of the entries.

C++ code 1.2.1.10: Summation reduction in E IGEN ➺ GITLAB


2 template <class Matrix >
3 void sumEntries ( Eigen : : MatrixBase <Matrix > &M) {
4 using S c a l a r = typename Eigen : : MatrixBase <Matrix > : : S c a l a r ;
5 // Compute sum of all entries
6 const S c a l a r s = M.sum ( ) ;
7 // Row-wise and column-wise sum of entries: results are vectors
8 const Eigen : : Matrix < Scalar , 1 , Eigen : : Dynamic> colsums {M. rowwise ( ) .sum ( ) } ;
9 const Eigen : : Matrix < Scalar , Eigen : : Dynamic , 1> rowsums {M. colwise ( ) .sum ( ) } ;
10 std : : cout << M. rows ( ) << ’ x ’ << M. cols ( ) << "−matrix : " << colsums .sum ( )
11 << " = " << rowsums .sum ( ) << " = " << s << std : : endl ;
12 }

Remark 1.2.1.11 (’auto’ in E IGEN codes) The expression template programming model (→ explanations
from E IGEN documentation) relies on complex intermediary data types hidden from the user . They support
the efficient evaluation of complex expressions E IGEN documentation. Let us look at the following two
code snippets that assume that both M and R are of type Eigen::MatrixXd.
Code I:
au to D = M.diagonal().asDiagonal(); R = D.inverse();

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 62


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Code II:
Eigen::MatrixXd D = M.diagonal().asDiagonal(); R = D.inverse();

inversion of diagonal matrices


performance vs. matrix size
explicit
105 auto (Quad-Core Intel Core i7 @ 3.1 GHz L2 256 KB, L3
O(n³)
8MB, Mem 16 GB, macOS 10.15.5, clang version
104
11.0.3 -02, NDEBUG)
103
time [µs]

We observe that for large matrices Code I (“auto”, —


102 , values ≤ 0.5µs suppressed) runs much faster than
Code II (“explicit”, —) though they are “algebraically
101 equivalent”.
100
101 102
Fig. 6 matrix size (n)

The reason is that in Code I D is of a complex type that preserves the information that the matrix is diagonal.
Of course, inverting a diagonal matrix is cheap. Conversely forcing D to be of type Eigen::MatrixXd loses
this information and the expensive invert() method for a generic densely populated matrix is invoked.
This is one of the exceptions to Rem. 0.3.4.4: for variables holding the result of E IGEN expressions auto
is recommended. y

Remark 1.2.1.12 (E IGEN-based code: debug mode and release mode) If you want a C++ code built
using the E IGEN library run fast, for instance, for large computations or runtime measurements, you should
compile in release mode, that is, with the compiler switches -O2 -DNDEBUG (for gcc or clang). In a
cmake-based build system you can achieve this by setting the flag CMAKE_BUILD_TYPE to “Release”.
The default setting for E IGEN is debug mode, which makes E IGEN do a lot of consistency
checking and considerably slows down execution of a code.
!
For “production runs” E IGEN-based codes must be compiled in release mode!
y

Remark 1.2.1.13 (E IGEN in use)


☞ E IGEN is used as one of the base libraries for the Robot Operating System (ROS), an open source
project with strong ETH participation.
☞ The geometry processing library libigl uses E IGEN as its basic linear algebra engine. It is being used
and developed at ETH Zurich, at the Interactive Geometry Lab and Advanced Technologies Lab.
y
Review question(s) 1.2.1.14 (E IGEN)
For the following questions you may consult the E IGEN documentation.
(Q1.2.1.14.A) Outline a C++ function
t e m p l a t e < typename Matrix>
v o i d replaceWithId(Eigen::DenseBase<Matrix> &M);

that checks whether the matrix is an n × n-matrix with even n ∈ N and then replaces its upper right
n/2 × n/2-block with an identity matrix. Do not use any C++ loops.

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 63


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q1.2.1.14.B) Given an Eigen::VectorXd object v (↔ v ∈ R n ), sketch a C++ code snippet that replaces
it with a vector v
e defined by
(
(v)n for i = 1 ,
v )i : =
(e
(v)i−1 for i = 2, . . . , n .

Do not use C++ loops. Can you see a problem?


(Q1.2.1.14.C) Given a matrix M ∈ R m,n stored in an Eigen::MatrixXd object M write down a C++ code
snipper that initializes another variable Mext of type Eigen::MatrixXd corresponding to
 
M 0
f
M := ⊤ ∈ R m+1,n+1
0 1

using E IGEN’s << matrix construction operator E IGEN documentation.


(Q1.2.1.14.D) Learn about the methods head() and tail() from the E IGEN documentation and ex-
press them by means of the block() method applied to the same variable.

1.2.2 P YTHON
P YTHON is a widely used general-purpose and open source programming language. Together with the
packages like N UM P Y and MATPLOTLIB it delivers similar functionality like M ATLAB for free. For interactive
computing IP YTHON can be used. All those packages belong to the S CI P Y ecosystem.
P YTHON features a good documentation and several scientific distributions are available (e.g. Anaconda,
Enthought) which contain the most important packages. On most Linux-distributions the S CI P Y ecosystem
is also available in the software repository, as well as many other packages including for example the
Spyder IDE delivered with Anaconda.
A good introduction tutorial to numerical P YTHON are the S CI P Y-lectures. The full documentation of
N UM P Y and S CI P Y can be found here. For former M ATLAB-users there’s also a guide. The scripts in this
lecture notes follow the official P YTHON style guide.
Note that in P YTHON we have to import the numerical packages explicitly before use. This is normally done
at the beginning of the file with lines like import numpy as np and from matplotlib import
pyplot as plt. Those import statements are often skipped in this lecture notes to focus on the actual
computations. But you can always assume the import statements as given here, e.g. np.ravel(A) is
a call to a N UM P Y function and plt.loglog(x, y) is a call to a MATPLOTLIB pyplot function.
P YTHON is not used in the current version of the lecture. Nevertheless a few P YTHON codes are supplied
in order to convey similarities and differences to implementations in M ATLAB and C++.
§1.2.2.1 (Matrices and Vectors in P YTHON) The basic numeric data type in P YTHON are N UM P Y’s n-
dimensional arrays. Vectors are normally implemented as 1D arrays and no distinction is made between
row and column vectors. Matrices are represented as 2D arrays.
☞ v = np.array([1, 2, 3]) creates a 1D array with the three elements 1, 2 and 3.
☞ A = np.array([[1, 2], [3, 4]] creates a 2D array.
☞ A.shape gives the n-dimensional size of an array.
☞ A.size gives the total number of entries in an array.

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 64


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note: There’s also a matrix class in N UM P Y with different semantics but its use is officially discouraged
and it might even be removed in future release.
y

§1.2.2.2 (Manipulating arrays in P YTHON) There are many possibilities listed in the documentation how
to create, index and manipulate arrays.
An important difference to M ATLAB is, that all arithmetic operations are normally performed element-wise,
e.g. A * B is not the matrix-matrix product but element-wise multiplication (in M ATLAB: A.*A). Also A
* v does a broadcasted element-wise product. For the matrix product one has to use np.dot(A, B)
or A.dot(B) explicitly. y

1.2.3 (Dense) Matrix Storage Formats

Video tutorial for Section 1.2.3 "(Dense) Matrix Storage Formats": (10 minutes)
Download link, tablet notes

→ review questions 1.2.3.11


All numerical libraries store the entries of a (generic = dense) matrix A ∈ K m,n in a linear array of length
mn (or longer). Accessing entries entails suitable index computations.

Two natural options for “vectorisation” of a matrix: row major, column major

  Row major (C-arrays, bitmaps, Python):


1 2 3 A_arr 1 2 3 4 5 6 7 8 9
A = 4 5 6
Column major (Fortran, M ATLAB, E IGEN):
7 8 9
A_arr 1 4 7 2 5 8 3 6 9

Access to entry (A)ij of A ∈ K n,m ,


i = 1, . . . , n, j = 1, . . . , m:
row major:

(A)ij ↔A_arr(m*(i-1)+(j-1))

column major:

(A)ij ↔A_arr(n*(j-1)+(i-1)) Fig. 7 Fig. 8

row major column major

EXAMPLE 1.2.3.1 (Accessing matrix data as a vector) In E IGEN the single index access operator relies
on the linear data layout:
In E IGEN the data layout can be controlled by a template argument; default is column major.

C++ code 1.2.3.2: Single index access of matrix entries in E IGEN ➺ GITLAB
2 void s t o r a g e O r d e r ( i n t nrows =6 , i n t n c o l s =7)
3 {
4 cout << " D i f f e r e n t matrix storage layouts i n Eigen " << endl ;
5 // Template parameter ColMajor selects column major data layout
6 Matrix <double , Dynamic , Dynamic , ColMajor > mcm( nrows , n c o l s ) ;
7 // Template parameter RowMajor selects row major data layout
8 Matrix <double , Dynamic , Dynamic , RowMajor> mrm( nrows , n c o l s ) ;

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 65


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

9 // Direct initialization; lazy option: use int as index type


10 f o r ( i n t l =1 , i = 0 ; i < nrows ; i ++) {
11 f o r ( i n t j = 0 ; j < n c o l s ; j ++ , l ++) {
12 mcm( i , j ) = mrm( i , j ) = l ;
13 }
14 }
15

16 cout << " Matrix mrm = " << endl << mrm << endl ;
17 cout << "mcm l i n e a r = " ;
18 f o r ( i n t l =0; l < mcm. s i z e ( ) ; l ++) {
19 cout << mcm( l ) << ’ , ’ ;
20 }
21 cout << endl ;
22

23 cout << "mrm l i n e a r = " ;


24 f o r ( i n t l =0; l < mrm . s i z e ( ) ; l ++) {
25 cout << mrm( l ) << ’ , ’ ;
26 }
27 cout << endl ;
28 }

The function call storageOrder(3,3), cf. Code 1.2.3.2 yields the output
1 D i f f e r e n t matrix s t o r a g e l a y o u t s i n Eigen
2 Matrix mrm =
3 1 2 3
4 4 5 6
5 7 8 9
6 mcm l i n e a r = 1 , 4 , 7 , 2 , 5 , 8 , 3 , 6 , 9 ,
7 mrm l i n e a r = 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,

In P YTHON the default data layout is row major, but it can be explicitly set. Further, array transposition
does not change any data, but only the memory order and array shape.

P YTHON-code 1.2.3.3: Storage order in P YTHON


1 # array creation
2 A = np . a r r a y ( [ [ 1 , 2 ] , [ 3 , 4 ] ] ) # default (row major) storage
3 B = np . a r r a y ( [ [ 1 , 2 ] , [ 3 , 4 ] ] , o r d e r = ’ F ’ ) # column major storage
4

5 # show internal storage


6 np . r a v e l ( A , ’K ’ ) # array elements as stored in memory: [1, 2, 3, 4]
7 np . r a v e l ( B , ’K ’ ) # array elements as stored in memory: [1, 3, 2, 4]
8

9 # nothing happens to the data on transpose, just the storage order


changes
10 np . r a v e l ( A . T , ’K ’ ) # array elements as stored in memory: [1, 2, 3, 4]
11 np . r a v e l ( B . T , ’K ’ ) # array elements as stored in memory: [1, 3, 2, 4]
12

13 # storage order can be accessed by checking the array’s flags


14 A . f l a g s [ ’C_CONTIGUOUS ’ ] # True
15 B . f l a g s [ ’F_CONTIGUOUS ’ ] # True
16 A . T . f l a g s [ ’F_CONTIGUOUS ’ ] # True
17 B . T . f l a g s [ ’C_CONTIGUOUS ’ ] # True

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 66


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 1.2.3.4 (Vectorisation of a matrix) Mapping a column-major matrix to a column vector with the
same number of entries is called vectorization or linearization in numerical linear algebra, in symbols
 
(A):,1
 (A):,2 
 
vec : K n,m → K n·m , vec(A) :=  ..  ∈ R n·m . (1.2.3.5)
 . 
(A):,m
y

Remark 1.2.3.6 (Reshaping matrices in E IGEN) If you need a reshaped view of a matrix’ data in E IGEN
you can obtain it via the raw data vector belonging to the matrix. Then use this information to create a
matrix view by means of Map → documentation.

C++ code 1.2.3.7: Demonstration on how reshape a matrix in E IGEN ➺ GITLAB


2 template <typename MatType>
3 void r e s h a p e t e s t ( MatType &M)
4 {
5 using i n d e x _ t = typename MatType : : Index ;
6 using e n t r y _ t = typename MatType : : S c a l a r ;
7 const i n d e x _ t n s i z e (M. s i z e ( ) ) ;
8

9 // reshaping possible only for matrices with non-prime dimensions


10 i f ( ( n s i z e %2) == 0 ) {
11 e n t r y _ t * Mdat = M. data ( ) ; // raw data array for M
12 // Reinterpretation of data of M
13 Map<Eigen : : Matrix < e n t r y _ t , Dynamic , Dynamic>> R( Mdat , 2 , n s i z e / 2 ) ;
14 // (Deep) copy data of M into matrix of different size
15 const Eigen : : Matrix < e n t r y _ t , Dynamic , Dynamic> S =
16 Map<Eigen : : Matrix < e n t r y _ t , Dynamic , Dynamic > >(Mdat , 2 , n s i z e / 2 ) ;
17

18 cout << " Matrix M = " << endl << M << endl ;
19 cout << " reshaped to " << R. rows ( ) << ’ x ’ << R. cols ( )
20 << " = " << endl << R << endl ;
21 // Modifying R affects M, because they share the data space !
22 R *= − 1 . 5 ;
23 cout << " Scaled ( ! ) matrix M = " << endl << M << endl ;
24 // Matrix S is not affected, because of deep copy
25 cout << " Matrix S = " << endl << S << endl ;
26 }
27 }

This function has to be called with a mutable (l-value) matrix type object. A sample output is printed next:
1 Matrix M =
2 0 −1 −2 −3 −4 −5 −6
3 1 0 −1 −2 −3 −4 −5
4 2 1 0 −1 −2 −3 −4
5 3 2 1 0 −1 −2 −3
6 4 3 2 1 0 −1 −2
7 5 4 3 2 1 0 −1
8 reshaped t o 2x21 =
9 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
10 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1
11 Scaled ( ! ) matrix M =
12 −0 1 . 5 3 4.5 6 7.5 9

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 67


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

13 −1.5 −0 1 . 5 3 4.5 6 7.5


14 −3 −1.5 −0 1 . 5 3 4.5 6
15 −4.5 −3 −1.5 −0 1 . 5 3 4.5
16 −6 −4.5 −3 −1.5 −0 1 . 5 3
17 −7.5 −6 −4.5 −3 −1.5 −0 1 . 5
18 Matrix S =
19 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1 −6 −4 −2
20 1 3 5 0 2 4 −1 1 3 −2 0 2 −3 −1 1 −4 −2 0 −5 −3 −1

y
Remark 1.2.3.8 (N UM P Y function reshape) N UM P Y offers the function np.reshape for changing the
dimensions of a matrix A ∈ K m,n :
# read elements of A in row major order (default)
B = np.reshape(A, (k, l)) # error, in case kl 6= mn
B = np.reshape(A, (k, l), order=’C’) # same as above
# read elements of A in column major order
B = np.reshape(A, (k, l), order=’F’)
# read elements of A as stored in memory
B = np.reshape(A, (k, l), order=’A’)

This command will create an k × l -array by reinterpreting the array of entries of A as data for an array
with k rows and l columns. The order in which the elements of A are be read can be set by the order
argument to row major (default, ’C’), column major (’F’) or A’s internal storage order, i.e. row major if
A is row major or column major if A is column major (’A’). y

EXPERIMENT 1.2.3.9 (Impact of matrix data access patterns on runtime) Modern CPU feature several
levels of memories (registers, L1 cache, L2 cache, . . ., main memory) of different latency, bandwidth, and
size. Frequently accessing memory locations with widely different addresses results in many cache misses
and will considerably slow down the CPU.
The following C++ code sequentially runs through the entries of a column major matrix (E IGEN’s de-
fault) in two ways and measures the (average) time required for the loops to complete. It relies on the
std::chrono library C++ reference.

C++ code 1.2.3.10: Timing for row and column oriented matrix access for E IGEN ➺ GITLAB
2 void r o w c o l a c c e s s t i m i n g ( )
3 {
4 c o n s t e x p r s i z e _ t K = 3 ; // Number of repetitions
5 c o n s t e x p r i n d e x _ t N_min = 5 ; // Smallest matrix size 32
6 c o n s t e x p r i n d e x _ t N_max = 1 3 ; // Scan until matrix size of 8192
7 i n d e x _ t n = ( 1UL << s t a t i c _ c a s t < s i z e _ t >( N_min ) ) ;
8 Eigen : : MatrixXd t i m e s ( N_max−N_min +1 ,3) ;
9

10 f o r ( i n d e x _ t l =N_min ; l <= N_max ; l ++ , n * = 2 ) {


11 Eigen : : MatrixXd A = Eigen : : MatrixXd : : Random( n , n ) ;
12 value_t t1 = 1000.0;
13 f o r ( s i z e _ t k =0; k<K ; k ++) {
14 auto t i c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
15 f o r ( i n d e x _ t j =0; j < n −1; j ++) {
16 A . row ( j +1) −= A . row ( j ) ; // row access
17 }
18 auto t o c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
19 const v a l u e _ t t =

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 68


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

s t a t i c _ c a s t < v a l u e _ t >( d u r a t i o n _ c a s t <microseconds >( toc − t i c ) . count ( ) ) / 1 E6 ;


20 t 1 = std : : min ( t1 , t ) ;
21 }
22 value_t t2 = 1000.0;
23 f o r ( s i z e _ t k =0; k<K ; k ++) {
24 auto t i c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
25 f o r ( i n d e x _ t j =0; j < n −1; j ++) {
26 A . col ( j +1) −= A . col ( j ) ; //column access
27 }
28 auto t o c = h i g h _ r e s o l u t i o n _ c l o c k : : now ( ) ;
29 const v a l u e _ t t =
s t a t i c _ c a s t < v a l u e _ t >( d u r a t i o n _ c a s t <microseconds >( toc − t i c ) . count ( ) ) / 1 E6 ;
30 t 2 = std : : min ( t2 , t ) ;
31 }
32 t i m e s ( l −N_min , 0 ) = s t a t i c _ c a s t < v a l u e _ t >( n ) ;
33 t i m e s ( l −N_min , 1 ) = t 1 ;
34 t i m e s ( l −N_min , 2 ) = t 2 ;
35 }
36 std : : cout << t i m e s << std : : endl ;
37 }

10 1
A(:,j+1) = A(:,j+1) - A(:,j)
A(i+1,:) = A(i+1,:) - A(i,:)
eigen row access ✁ Plot of average runtimes as measured
10 0 eigen column access
with code Code 1.2.3.10.
10 -1 Platform:
✦ ubuntu 14.04 LTS
10 -2 ✦ i7-3517U CPU @ 1.90GHz
runtime [s]

✦ L1 32 KB, L2 256 KB, L3 4096 KB,


10 -3 Mem 8 GB
✦ gcc 4.8.4, -O3, -DNDEBUG
10 -4
The compiler flags -O3 and -DNDEBUG
10 -5
are essential. The C++ code would be
significantly slower if the default compiler
10 -6
options were used!
10 1 10 2 10 3 10 4
Fig. 9 n

We observe a blatant discrepancy of CPU time required for accessing entries of a matrix in rowwise or
columnwise fashion. This reflects the impact of features of the unterlying hardware architecture, like cache
size and memory bandwidth:

Interpretation of timings: Since standard matrices in E IGEN are stored column major all the matrix el-
ements in a column occupy contiguous memory locations, which will all reside in the cache together.
Hence, column oriented access will mainly operate on data in the cache even for large matrices. Con-
versely, row oriented access addresses matrix entries that are stored in distant memory locations, which
incurs frequent cash misses (cache thrashing).
The impact of hardware architecture on the performance of algorithms will not be taken into account in
this course, because hardware features tend to be both intricate and ephemeral. However, for modern
high performance computing it is essential to adapt implementations to the hardware on which the code is
supposed to run. y
Review question(s) 1.2.3.11 (Dense matrix storage formats)

1. Computing with Matrices and Vectors, 1.2. Software and Libraries 69


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q1.2.3.11.A) Write efficient elementary C++ loops that realize the matrix×vector product Mx,
m ∈ R m,n , x ∈ R n , where M is stored in an Eigen::MatrixXd object M and x given as a
Eigen::VectorXd object x. Assume the default (Column major) memory layout for M. Discuss the
memory access pattern.
(Q1.2.3.11.B) A black-box function has the following signature:
t e m p l a t e < typename Vector>
double processVector( const Eigen::DenseBase<Vector> &v);

It is known that it accesses each vector entry only once.


Recall that the vectorisation of a matrix A ∈ K n,m is defined as


(A):,1
 (A):,2 
 
vec : K n,m → K n·m , vec(A) :=  ..  ∈ R n·m . (1.2.3.5)
 . 
(A):,m

Given a matrix A ∈ R n,m stored in an Eigen::MatrixXd object A (column major memory layout), how
can you efficiently realize the following function calls in C++:
• processVector(vec(A)) ,
• processVector(vec(A⊤ )) ?

1.3 Basic Linear Algebra Operations


First we refresh the basic rules of vector and matrix calculus. Then we will learn about a very old program-
ming interface for simple dense linear algebra operations.

1.3.1 Elementary Matrix-Vector Calculus


What you should know from linear algebra [NS02, Sect. 2.2]:
✦ vector space operations in matrix space K m,n (addition, multiplication with scalars)
✦ n
n H
dot product: x, y ∈ K , n ∈ N: x·y := x y = ∑ x̄i yi ∈ K
i =1
(in E IGEN: x.dot(y) or x.adjoint()*y, x,y =
ˆ column vectors)
✦ 
tensor product: x ∈ K m , y ∈ K n , n ∈ N: xyH = xi ȳ j i =1,...,m ∈ K m,n
j=1,...,n
(in E IGEN: x*y.adjoint(), x,y =
ˆ column vectors)
✦ All are special cases of the matrix product:

" #
n
A ∈ K m,n , B ∈ K n,k : AB = ∑ aij bjl ∈ K m,k . (1.3.1.1)
j =1 i =1,...,m
l =1,...,k

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 70
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Recall from linear algebra basic properties of the matrix product: for all K-matrices A, B, C (of suitable
sizes), α, β ∈ K

associative:
(AB)C = A(BC) ,
bi-linear: (αA + βB)C = α(AC) + β(BC) , C(αA + βB) = α(CA) + β(CB) ,
non-commutative: AB 6= BA in general .

§1.3.1.2 (Visualisation of (special) matrix products) Dependency of an entry of a product matrix:

m n = m

n k

Fig. 10
k

= =

dot product tensor product


y

Remark 1.3.1.3 (Row-wise & column-wise view of matrix product) To understand what is going on
when forming a matrix product, it is often useful to decompose it into matrix×vector operations in one of
the following two ways:

A ∈ K m,n , B ∈ K n,k :
 
" # (A)1,: B
 .. 
AB = A(B):,1 ... A(B):,k , AB =  . .
(A)m,: B (1.3.1.4)
↓ ↓
matrix assembled from columns matrix assembled from rows

For notations refer to Sect. 1.1.1. y

Remark 1.3.1.5 (Understanding the structure of product matrices) A “mental image” of matrix multi-
plication is useful for telling special properties of product matrices.

For instance, zero blocks of the product matrix can be predicted easily in the following situations using the
idea explained in Rem. 1.3.1.3 (try to understand how):

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 71
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

m 0 n = 0 m

n k

Fig. 11
k

m
0 n =
0 m

n k

Fig. 12
k
A clear understanding of matrix multiplication enables you to “see”, which parts of a matrix factor matter
in a product:

irrelevant matrix entries

m n = m

n 0 k

Fig. 13
k
“Seeing” the structure/pattern of a matrix product:
    
    
    
    
    
    
    
  = ,
    
    
    
    
    
    

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 72
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

    
    
    
    
    
    
    
  = .
    
    
    
    
    
    

These nice renderings of the so-called patterns of matrices, that is, the distribution of their non-zero entries
have been created by a special plotting command spy() of matplotlibcpp.

C++ code 1.3.1.6: Visualizing the structure of matrices in E IGEN ➺ GITLAB


2 # include " matplotlibcpp . h" // Tools for plotting, see
https://github.com/lava/matplotlib-cpp
3 # include <Eigen / Dense>
4 # include < s t r i n g >
5 namespace p l t = m a t p l o t l i b c p p ;
6 using namespace Eigen ;
7

8 // Produce spy-plot of a dense Eigen matrix.


9 void spy ( const Eigen : : MatrixXd &M, const std : : s t r i n g &fname ) {
10 plt : : figure () ;
11 p l t : : spy (M, { { " marker " , "o" } , { " markersize " , "2" } , { " color " , "b" } } ) ;
12 p l t : : t i t l e ( " nnz = " + std : : t o _ s t r i n g (M. nonZeros ( ) ) ) ;
13 p l t : : s a v e f i g ( fname ) ;
14 }
15

16 i n t main ( ) {
17 i n t n = 100;
18 MatrixXd A( n , n ) , B( n , n ) ;
19 A . setZero ( ) ;
20 B . setZero ( ) ;
21 // Initialize matrices, see Fig. 13
22 A . diagonal ( ) = VectorXd : : LinSpaced ( n , 1 , n ) ;
23 A . col ( n − 1 ) = VectorXd : : LinSpaced ( n , 1 , n ) ;
24 A . row ( n − 1 ) = RowVectorXd : : LinSpaced ( n , 1 , n ) ;
25 B = A . colwise ( ) . reverse ( ) ;
26 // Matrix products
27 MatrixXd C = A * A , D = A * B;
28 spy ( A , "Aspy_cpp . eps " ) ; // Sparse arrow matrix
29 spy ( B , "Bspy_cpp . eps " ) ; // Sparse arrow matrix
30 spy (C, "Cspy_cpp . eps " ) ; // Fully populated matrix
31 spy (D, "Dspy_cpp . eps " ) ; // Sparse "framed" matrix
32 return 0;
33 }

This code also demonstrates the use of diagonal(), col(), row() for L-value access to parts of a
matrix.
P YTHON/MATPLOTLIB-command for visualizing the structure of a matrix: plt.spy(M)

P YTHON-code 1.3.1.7: Visualizing the structure of matrices in P YTHON


1 n = 100

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 73
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 A = np . d i a g ( np . mgrid [ : n ] )
3 A [ : , −1] = A[ − 1 , : ] = np . mgrid [ : n ]
4 p l t . spy ( A)
5 p l t . spy ( A [ : : − 1 , : ] )
6 p l t . spy ( np . d o t ( A , A) )
7 p l t . spy ( np . d o t ( A , B) )

Remark 1.3.1.8 (Multiplying triangular matrices) The following result is useful when dealing with matrix
decompositions that often involve triangular matrices.

Lemma 1.3.1.9. Group of regular diagonal/triangular matrices


( diagonal
( diagonal
A, B upper triangular ⇒ AB and A −1 upper triangular .
lower triangular lower triangular

(assumes that A is regular)

“Proof by visualization” → Rem. 1.3.1.5


     
0 0 0
     
     
     
     
 · = .
     
     
     
     

y
EXPERIMENT 1.3.1.10 (Scaling a matrix) Scaling = multiplication with diagonal matrices (with non-zero
diagonal entries):

It is important to know the different effect of multiplying with a diagonal matrix from left or right:

✦ multiplication with diagonal matrix from left ➤ row scaling


    
d1 0 0 a11 a12 . . . a1m d1 a11 d1 a12 . . . d1 a1m  
 0 d2 d ( A )
 0 
 a21 a22 a2m   
  d2 a21 d2 a22 . . . d2 a2m  
1
..
1,:

 ..   .
. .
.  =  .
. .
. = . .
 .  . .   . . 
dn (A)n,:
0 0 dn an1 an2 . . . anm dn an1 dn an2 . . . dn anm

✦ multiplication with diagonal matrix from right ➤ column scaling


    
a11 a12 . . . a1m d1 0 0 d1 a11 d2 a12 . . . dm a1m
a a2m   0  
 21 a22   0 d2   d1 a21 d2 a22 . . . dm a2m 
 .. ..   ..  =  . .. 
 . .  .   .. . 
an1 an2 . . . anm 0 0 dm d1 an1 d2 an2 . . . dm anm
" #
= d1 (A):,1 ... dm (A):,m .

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 74
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Multiplication with a scaling matrix D =


diag(d1 , . . . , dn ) ∈ R n,n in E IGEN can be realised
in three ways, see Code 1.3.1.11, Line 9-Line 11,
Line 13, and Line 15.

Measured runtimes (Ubuntu Linux 14.04 LTS, Intel


Core(TM) i7-3517U CPU @ 1.90GHz × 4, 64-bit,
gcc 4.8.4, -O3 -DNDEBUG) ✄
The code will be slowed down massively in case
a temporary dense matrix is created inadvertently.
Notice that E IGEN’s expression templates avoid this
pointless effort, see Line 15.

Fig. 14

C++ code 1.3.1.11: Timing multiplication with scaling matrix in E IGEN ➺ GITLAB
2 i n t nruns = 3 , minExp = 2 , maxExp = 1 4 ;
3 MatrixXd tms ( maxExp−minExp +1 ,4) ;
4 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
5 Timer tbad , tgood , t o p t ; // timer class
6 i n t n = std : : pow ( 2 , minExp + i ) ;
7 VectorXd d = VectorXd : : Random( n , 1 ) , x = VectorXd : : Random( n , 1 ) , y ( n ) ;
8 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
9 MatrixXd D = d . asDiagonal ( ) ; //
10 // matrix vector multiplication
11 tbad . s t a r t ( ) ; y = D * x ; tbad . s t o p ( ) ; //
12 // componentwise multiplication
13 tgood . s t a r t ( ) ; y= d . cwiseProduct ( x ) ; tgood . s t o p ( ) ; //
14 // matrix multiplication optimized by Eigen
15 t o p t . s t a r t ( ) ; y = d . asDiagonal ( ) * x ; t o p t . s t o p ( ) ; //
16 }
17 tms ( i , 0 ) =n ;
18 tms ( i , 1 ) =tgood . min ( ) ; tms ( i , 2 ) =tbad . min ( ) ; tms ( i , 3 ) = t o p t . min ( ) ;
19 }

Hardly surprising, the component-wise multiplication of the two vectors is way faster than the intermit-
tent initialisation of a diagonal matrix (main populated by zeros) and the computation of a matrix×vector
product. Nevertheless, such blunders keep on haunting numerical codes. Do not rely solely on E IGEN
optimizations! y

Remark 1.3.1.12 (Row and column transformations) Simple operations on rows/columns of matrices,
cf. what was done in Exp. 1.2.3.9, can often be expressed as multiplication with special matrices: For
instance, given A ∈ K n,m we obtain B by adding row (A) j,: to row (A) j+1,: , 1 ≤ j < n.
 
1
 .. 
 . 
 
Realisation through matrix  1 
B= A .
product  1 1 
 .. 
 . 
1

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 75
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The matrix multiplying A from the left is a specimen of a transformation matrix, a matrix that coincides
with the identity matrix I except for a single off-diagonal entry.

left-multiplication row transformations


with transformation matrices ➙
right-multiplication column transformations

Row/column transformations will play a central role in Section 2.3. y

§1.3.1.13 (Block matrix product) Given matrix dimensions M, N, K ∈ N block sizes 1 ≤ n < N
(n′ := N − n), 1 ≤ m < M (m′ := M − m), 1 ≤ k < K (k′ := K − k) we start from the following
matrices:
′ ′
A11 ∈ K m,n A12 ∈ K m,n B11 ∈ K n,k B12 ∈ K n,k
′ ′ ′ , ′ ′ ′ .
A21 ∈ K m ,n A22 ∈ K m ,n B21 ∈ K n ,k B22 ∈ K n ,k
This matrices serve as sub-matrices or matrix blocks and are assembled into larger matrices
   
A11 A12 M,N B11 B12
A= ∈K , B= ∈ K N,K .
A21 A22 B21 B22
It turns out that the matrix product AB can be computed by the same formula as the product of simple
2 × 2-matrices:
    
A11 A12 B11 B12 A11 B11 + A12 B21 A11 B12 + A12 B22
= . (1.3.1.14)
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22

m n m

M N = M

m′ m′

n′
n n′ k k′
N
k k′ K
Fig. 15
K
Bottom line: one can compute with block-structured matrices in almost (∗) the same ways as with matrices
with real/complex entries, see [QSS00, Sect. 1.3.3].
(∗): you must not use the commutativity of multiplication (because matrix multiplication is not
! commutative).
y

1.3.2 BLAS – Basic Linear Algebra Subprograms


BLAS (Basic Linear Algebra Subprograms) is a specification (API) that prescribes a set of low-level rou-
tines for performing common linear algebra operations such as vector addition, scalar multiplication, dot
products, linear combinations, and matrix multiplication. They are the de facto low-level routines for linear
algebra libraries (Wikipedia).

The BLAS API is standardised by the BLAS technical forum and, due to its history dating back to the 70s,
follows conventions of FORTRAN 77, see the Quick Reference Guide for examples. However, wrappers for

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 76
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

other programming languages are available. CPU manufacturers and/or developers of operating systems
usually supply highly optimised implementations:
• OpenBLAS: open source implementation with some general optimisations, available under BSD
license.
• ATLAS (Automatically Tuned Linear Algebra Software): open source BLAS implementation with
auto-tuning capabilities. Comes with C and FORTRAN interfaces and is included in Linux distribu-
tions.
• Intel MKL (Math Kernel Library): commercial highly optimised BLAS implemetation available for all
Intel CPUs. Used by most proprietory simulation software and also M ATLAB.
EXPERIMENT 1.3.2.1 (Multiplying matrices in E IGEN)
The following E IGEN-based C++ code performs a multiplication of densely populated matrices in three
different ways:
1. Direct implementation of three nested loops
2. Realization by matrix×vector products
3. Use of buit-in matrix multiplication of E IGEN

C++ code 1.3.2.2: Timing different implementations of matrix multiplication in E IGEN


➺ GITLAB
2 void mmtiming ( ) {
3 i n t nruns = 3 , minExp = 2 , maxExp = 1 0 ;
4 MatrixXd t i m i n g s ( maxExp − minExp + 1 , 5 ) ;
5 f o r ( i n t p = 0 ; p <= maxExp − minExp ; ++p ) {
6 Timer t1 , t2 , t3 , t 4 ; // timer class
7 i n t n = std : : pow ( 2 , minExp + p ) ;
8 MatrixXd A = MatrixXd : : Random( n , n ) ;
9 MatrixXd B = MatrixXd : : Random( n , n ) ;
10 MatrixXd C = MatrixXd : : Zero ( n , n ) ;
11 f o r ( i n t q = 0 ; q < nruns ; ++q ) {
12 // Loop based implementation no template magic
13 t1 . s t a r t ( ) ;
14 f o r ( i n t i = 0 ; i < n ; ++ i )
15 f o r ( i n t j = 0 ; j < n ; ++ j )
16 f o r ( i n t k = 0 ; k < n ; ++k )
17 C( i , j ) += A( i , k ) * B( k , j ) ;
18 t1 . stop ( ) ;
19 // dot product based implementation little template magic
20 t2 . s t a r t ( ) ;
21 f o r ( i n t i = 0 ; i < n ; ++ i )
22 f o r ( i n t j = 0 ; j < n ; ++ j )
23 C( i , j ) = A . row ( i ) . dot ( B . col ( j ) ) ;
24 t2 . stop ( ) ;
25 // matrix-vector based implementation middle template magic
26 t3 . s t a r t ( ) ;
27 f o r ( i n t j = 0 ; j < n ; ++ j )
28 C. col ( j ) = A * B . col ( j ) ;
29 t3 . stop ( ) ;
30 // Eigen matrix multiplication template magic optimized
31 t4 . s t a r t ( ) ;
32 C = A * B;
33 t4 . stop ( ) ;
34 }
35 timings ( p , 0) = n ;

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 77
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

36 timings (p , 1) = t 1 . min ( ) ;
37 timings (p , 2) = t 2 . min ( ) ;
38 timings (p , 3) = t 3 . min ( ) ;
39 timings (p , 4) = t 4 . min ( ) ;
40 }
41 std : : cout << std : : s c i e n t i f i c << std : : s e t p r e c i s i o n ( 3 ) << t i m i n g s << std : : endl ;

Timings: Different implementations of matrix multiplication


10 1
loop implementation

10 0
dot-product implementation
matrix-vector implementation Platform:
Eigen matrix product
✦ ubuntu 14.04 LTS
10 -1
✦ i7-3517U CPU @ 1.90GHz
10 -2
✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3
✬ ✩
time [s]

10 -3

In E IGEN we can achieve some gain in execution


10 -4
speed by relying on compact matrix/vector opera-
10 -5 tions that invoke efficient E IGEN built-in functions.
10 -6
However, compiler optimizations make plain and

✫ ✪
simple loops almost competitive.
10 -7
10 0 10 1 10 2 10 3 10 4
Fig. 16 matrix size n

y
BLAS routines are grouped into “levels” according to the amount of data and computation involved (asymp-
totic complexity, see Section 1.4.1 and [GV89, Sect. 1.1.12]):
• Level 1: vector operations such as scalar products and vector norms.
asymptotic complexity O(n), (with n =ˆ vector length),
e.g.: dot product: ρ = x⊤ y
• Level 2: vector-matrix operations such as matrix-vector multiplications.
asymptotic complexity O(mn),(with (m, n) = ˆ matrix size),
e.g.: matrix×vector multiplication: y = αAx + βy
• Level 3: matrix-matrix operations such as matrix additions or multiplications.
asymptotic complexity often O(nmk ),(with (n, m, k ) =
ˆ matrix sizes),
e.g.: matrix product: C = AB
Syntax of BLAS calls:
The functions have been implemented for different types, and are distinguished by the first letter of the
function name. E.g. sdot is the dot product implementation for single precision and ddot for double
precision.

✦ BLAS LEVEL 1: vector operations, asymptotic complexity O(n), n =


ˆ vector length
• dot product ρ = x⊤ y

xDOT(N,X,INCX,Y,INCY)
– x ∈ {S, D}, scalar type: S =
ˆ type float, D =
ˆ type double
ˆ length of vector (modulo stride INCX)
– N=
ˆ vector x: array of type x
– X=
ˆ stride for traversing vector X
– INCX =

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 78
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

ˆ vector y: array of type x


– Y=
ˆ stride for traversing vector Y
– INCY =
• vector operations y = αx + y

xAXPY(N,ALPHA,X,INCX,Y,INCY)
– x ∈ {S, D, C, Z}, S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
ˆ length of vector (modulo stride INCX)
– N=
ˆ scalar α
– ALPHA =
ˆ vector x: array of type x
– X=
ˆ stride for traversing vector X
– INCX =
ˆ vector y: array of type x
– Y=
ˆ stride for traversing vector Y
– INCY =
✦ BLAS LEVEL 2: matrix-vector operations, asymptotic complexity O(mn), (m, n) =
ˆ matrix size

• matrix×vector multiplication y = αAx + βy

xGEMV(TRANS,M,N,ALPHA,A,LDA,X,
INCX,BETA,Y,INCY)
– x ∈ {S, D, C, Z}, scalar type: S =
ˆ type float, D =
ˆ type double, C =
ˆ type complex
ˆ size of matrix A
– M, N =
ˆ scalar parameter α
– ALPHA =
ˆ matrix A stored in linear array of length M · N (column major arrangement)
– A=

(A)i,j = A[ N ∗ ( j − 1) + i ] .

ˆ “leading dimension” of A ∈ K n,m , that is, the number n of rows.


– LDA =
ˆ vector x: array of type x
– X=
ˆ stride for traversing vector X
– INCX =
ˆ scalar paramter β
– BETA =
ˆ vector y: array of type x
– Y=
ˆ stride for traversing vector Y
– INCY =
• BLAS LEVEL 3: matrix-matrix operations, asymptotic complexity O(mnk ), (m, n, k ) =
ˆ matrix
sizes

– matrix×matrix multiplication C = αAB + βC


xGEMM(TRANSA,TRANSB,M,N,K,

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 79
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

ALPHA,A,LDA,X,B,LDB,
BETA,C,LDC)
(☞ meaning of arguments as above)
Remark 1.3.2.3 (BLAS calling conventions) The BLAS calling syntax seems queer in light of modern
object oriented programming paradigms, but it is a legacy of FORTRAN77, which was (and partly still is)
the programming language, in which the BLAS routines were coded.

It is a very common situation in scientific computing that one has to rely on old codes and libraries imple-
mented in an old-fashioned style. y

EXAMPLE 1.3.2.4 (Calling BLAS routines from C/C++) When calling BLAS library functions from C,
all arguments have to be passed by reference (as pointers), in order to comply with the argument passing
mechanism of FORTRAN77, which is the model followed by BLAS.

C++-code 1.3.2.5: BLAS-based SAXPY operation in C++


1 # define daxpy_ cblas_daxpy
2 # include < iostream >
3 # include <vector >
4

5 // Definition of the required BLAS function. This is usually done


6 // in a header file like blas.h that is included in the E I G E N 3
7 // distribution
8 extern "C" {
9 i n t daxpy_ ( const i n t * n , const double * da , const double * dx , const i n t * i n c x ,
10 double * dy , const i n t * i n c y ) ;
11 }
12

13 using std : : cout ;


14 using std : : endl ;
15

16 i n t main ( ) {
17 cout << "Demo code f o r NumCSE course : c a l l basic BLAS routines from C++"
18 << endl ;
19 const i n t n = 5 ; // length of vector
20 const i n t i n c x = 1 ; // stride
21 const i n t i n c y = 1 ; // stride
22 const double alpha = 2 . 5 ; // scaling factor
23

24 // Allocated raw arrays of doubles


25 std : : vector <double> x ( n ) ;
26 std : : vector <double> y ( n ) ;
27

28 f o r ( s i z e _ t i = 0 ; i < n ; i ++) {
29 x [ i ] = 3.1415 * s t a t i c _ c a s t <double >( i ) ;
30 y [ i ] = 1 . 0 / s t a t i c _ c a s t <double >( i + 1 ) ;
31 }
32

33 cout << " x =[ " ;


34 for ( size_t i = 0; i < n ; i ++) {
35 cout << x [ i ] << ’ ’;
36 }
37 cout << " ] " << endl ;
38 cout << " y =[ " ;
39 for ( size_t i = 0; i < n ; i ++) {
40 cout << y [ i ] << ’ ’;

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 80
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

41 }
42 cout << " ] " << endl ;
43

44 // Call the BLAS library function passing pointers to all arguments


45 // (Necessary when calling FORTRAN routines from C
46 daxpy_(&n , &alpha , x . data ( ) , &i n c x , y . data ( ) , &i n c y ) ;
47

48 cout << " y = " << alpha << " * x + y = [ " ;


49 f o r ( i n t i = 0 ; i < n ; i ++) {
50 cout << y [ i ] << ’ ’ ;
51 }
52 cout << " ] " << endl ;
53 return ( 0 ) ;
54 }

When using E IGEN in a mode that includes an external BLAS library, all this calls are wrapped into E IGEN
methods. y
EXAMPLE 1.3.2.6 (Using Intel Math Kernel Library (Intel MKL) from E IGEN) The
Intel Math Kernel Library is a highly optimized math library for Intel processors and can be called
directly from E IGEN, see E IGEN documentation on “Using Intel® Math Kernel Library from Eigen”.

C++-code 1.3.2.7: Timing of matrix multiplication in E IGEN for MKL comparison ➺ GITLAB
2 //! script for timing different implementations of matrix
multiplications
3 void mmeigenmkl ( ) {
4 i n t nruns = 3 , minExp = 6 , maxExp = 1 3 ;
5 MatrixXd t i m i n g s ( maxExp−minExp +1 ,2) ;
6 f o r ( i n t p = 0 ; p <= maxExp−minExp ; ++p ) {
7 Timer t 1 ; // timer class
8 i n t n = std : : pow ( 2 , minExp + p ) ;
9 MatrixXd A = MatrixXd : : Random( n , n ) ;
10 MatrixXd B = MatrixXd : : Random( n , n ) ;
11 MatrixXd C = MatrixXd : : Zero ( n , n ) ;
12 f o r ( i n t q = 0 ; q < nruns ; ++q ) {
13 t1 . s t a r t ( ) ;
14 C = A * B;
15 t1 . stop ( ) ;
16 }
17 t i m i n g s ( p , 0 ) =n ; t i m i n g s ( p , 1 ) = t 1 . min ( ) ;
18 }
19 std : : cout << std : : s c i e n t i f i c << std : : s e t p r e c i s i o n ( 3 ) << t i m i n g s << std : : endl ;
20 }

Timing results:
n E IGEN sequential [s] E IGEN parallel [s] MKL sequential [s] MKL parallel [s]
64 1.318e-04 1.304e-04 6.442e-05 2.401e-05
128 7.168e-04 2.490e-04 4.386e-04 1.336e-04
256 6.641e-03 1.987e-03 3.000e-03 1.041e-03
512 2.609e-02 1.410e-02 1.356e-02 8.243e-03
1024 1.952e-01 1.069e-01 1.020e-01 5.728e-02
2048 1.531e+00 8.477e-01 8.581e-01 4.729e-01
4096 1.212e+01 6.635e+00 7.075e+00 3.827e+00
8192 9.801e+01 6.426e+01 5.731e+01 3.598e+01

1. Computing with Matrices and Vectors, 1.3. Basic Linear Algebra Operations 81
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 2 10 -9
Eigen sequential Eigen sequential
Eigen parallel Eigen parallel
MKL sequential MKL sequential
10 1 MLK parallel MLK parallel

[s]
3
10 0

execution time divided by n


execution time [s]

10 -1

10 -10

-2
10

10 -3

10 -4

10 -5 10 -11
10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4
Fig. 17 Fig. 18 matrix size n
matrix size n

Timing environment: ✦ ubuntu 14.04 LTS


✦ i7-3517U CPU @ 1.90GHz
✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3
y

1.4 Computational Effort

Video tutorial for Section 1.4 "Computational Effort": (29 minutes) Download link, tablet notes

→ review questions 1.4.3.11


Large scale numerical computations require immense resources and execution time of numerical codes
often becomes a central concern. Therefore, much emphasis has to be put on
1. designing algorithms that produce a desired result with (nearly) minimal computational effort (de-
fined precisely below),
2. exploit possibilities for parallel and vectorised execution,
3. organising algorithms in order to make them fit memory hierarchies (“cache-aware”),
4. implementing codes that make optimal use of hardware resources and capabilities,
While Item 2–Item 4 are out of the scope of this course and will be treated in more advanced lectures,
Item 1 will be a recurring theme.

The following definition encapsulates what is regarded as a measure for the “cost” of an algorithm in
computational mathematics.

Definition 1.4.0.1. Computational effort

The computational effort/computational cost required/incurred by a numerical code amounts to


the number of elementary operations (additions,subtractions,multiplications,divisions,square roots)
executed in a run.

§1.4.0.2 (What computational effort does not tell us) Fifty years ago counting elementary operations
provided good predictions of runtimes, but nowadays this is no longer true.

1. Computing with Matrices and Vectors, 1.4. Computational Effort 82


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

“Computational effort 6∼ runtime”

The computational effort involved in a run of a numerical code is only loosely related
! to overall execution time on modern computers.

This is conspicuous in Exp. 1.2.3.9, where algorithms incurring exactly the same computational effort took
different times to execute.

The reason is that on today’s computers a key bottleneck for fast execution is latency and bandwidth of
memory, cf. the discussion at the end of Exp. 1.2.3.9 and [KW03]. Thus, concepts like I/O-complexity
[AV88; GJ10] might be more appropriate for gauging the efficiency of a code, because they take into
account the pattern of memory access. y

1.4.1 (Asymptotic) Computational Complexity


The concept of computational effort from Def. 1.4.0.1 is still useful in a particular context:

Definition 1.4.1.1. (Asymptotic) complexity

The asymptotic (computational) complexity of an algorithm characterises the worst-case depen-


dence of its computational effort (→ Def. 1.4.0.1) on one or more problem size parameter(s) when
these tend to ∞.

• Problem size parameters in numerical linear algebra usually are the lengths and dimensions of the
vectors and matrices that an algorithm takes as inputs.
• Worst case indicates that the maximum effort over a set of admissible data is taken into account.

When dealing with asymptotic complexities a mathematical formalism comes handy:

Definition 1.4.1.2. Landau symbol [AG11, p. 7]

We write F (n) = O( G (n)) for two functions F, G : N → R, if there exists a constant C > 0 and
n∗ ∈ N such that

F (n) ≤ C G (n) ∀n ≥ n∗ .

More generally, F (n1 , . . . , nk ) = O( G (n1 , . . . , nk )) for two functions F, G : N k → R implies the


existence of a constant C > 0 and a threshold value n∗ ∈ N such that

F (n1 , . . . , nk ) ≤ CG (n1 , . . . , nk ) ∀n1 , . . . , nk ∈ N , nℓ ≥ n∗ , ℓ = 1, . . . , k .

Remark 1.4.1.3 (Meaningful “O-bounds” for complexity) Of course, the definition of the Landau symbol
leaves ample freedom for stating meaningless bounds; an algorithm that runs with linear complexity O(n)
can be correctly labelled as possessing O(exp(n)) complexity.
Yet, whenever the Landau notation is used to describe asymptotic complexities, the bounds have to be
sharp in the sense that no function with slower asymptotic growth will be possible inside the O. To make
this precise we stipulate the following.

1. Computing with Matrices and Vectors, 1.4. Computational Effort 83


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Sharpness of a complexity bound

Whenever the asymptotic complexity of an algorithm is stated as O(nα logβ n exp(γnδ )) with non-
negative parameters α, β, γ, δ ≥ 0 in terms of the problem size parameter n, we take for granted
that choosing a smaller value for any of the parameters will no longer yield a valid (or provable)
asymptotic bound.

In particular
✦ complexity O(n) means that the complexity is not O(nα ) for any α < 1,
✦ complexity O(exp(n)) excludes asymptotic complexity O(n p ) for any p ∈ R.
Terminology: If the asymptotic complexity of an algorithm is O(n p ) with p = 1, 2, 3 we say that it is of
“linear”, “quadratic”, and “cubic” complexity, respectively.
y

Remark 1.4.1.5 (Relevance of asymptotic complexity) § 8.4.3.14 warned us that computational effort
and, thus, asymptotic complexity, of an algorithm for a concrete problem on a particular platform may
not have much to do with the actual runtime (the blame goes to memory hierarchies, internal pipelining,
vectorisation, etc.).

Then, why do we pay so much attention to asymptotic complexity in this course?

To a certain extent, the asymptotic complexity allows to predict the dependence of the runtime of a
particular implementation of an algorithm on the problem size (for large problems).

For instance, an algorithm with asymptotic complexity O(n2 ) is likely to take 4× as much time when the
problem size is doubled. y

§1.4.1.6 (Concluding polynomial complexity from runtime measurements)


Available: “Measured runtimes” ti = ti (ni ) for different values n1 , n2 , . . . , n N , ni ∈ N,
of the problem size parameter

Conjectured: power law dependence ti ≈ Cniα (also “algebraic dependence”), α ∈ R


How can we glean evidence that supports or refutes our conjecture from the data? Look at the data in
doubly logarithmic scale!

ti = Cniα ⇒ log(ti ) ≈ log C + α log(ni ) , i = 1, . . . , N .

If the conjecture holds true, then the points (ni , ti ) will approximately lie on a straight
line with slope α in a doubly logarithmic plot (which can be created in P YTHON by the
lmatplotlib.pyplot.loglog plotting command.
➣ Offers a quick “visual test” of conjectured asymptotic complexity
More rigorous: Perform linear regression on (log ni , log ti ), i = 1, . . . , N (→ Chapter 3)
y

1.4.2 Cost of Basic Linear-Algebra Operations


Performing elementary BLAS-type operations through simple (nested) loops, we arrive at the following
obvious complexity bounds:

1. Computing with Matrices and Vectors, 1.4. Computational Effort 84


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

operation description #mul/div #add/sub asymp. complexity


dot product (x ∈ Rn , y
∈ 7→ Rn ) xH y n n−1 O(n)
tensor product m n
(x ∈ R , y ∈ R ) 7→ xyH nm 0 O(mn)
Matrix×vector (x ∈ R n , A ∈ R m,n ) 7→ Ax mn ( n − 1) m O(mn)
matrix product(∗) (A ∈ R m,n , B ∈ R n,k ) 7→ AB mnk mk (n − 1) O(mnk )

EXPERIMENT 1.4.2.1 (Runtimes of elementary linear algebra operations in E IGEN)

Measured runtimes, code eigenopstiming.cpp


matrix-vector product
106 matrix-matrix product • Timing code ➺ GITLAB, m = n
O(n2 )
105 O(n3 ) • Linux kernel 5.5.9-100.fc30.x86_64, Fedora 30
• gcc 9.3.1 -O2, Release mode
runtime (microseconds)

104

• Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz


103

102 Runtime data points approximately on lines in doubly


101
logarithmic plot, see § 1.4.1.6: Asymptotic behavior
of measured runtimes match predictions from above
100
table for large n.
10−1

101 102 103


Fig. 19 problem size parameter n

Remark 1.4.2.2 (“Fast” matrix multiplication)


(∗): The O(mnk ) complexity bound applies to “straightforward” matrix multiplication according to
(1.3.1.1).
For m = n = k there are (sophisticated) variants with better asymptotic complexity, e.g., the divide-and-
conquer Strassen algorithm [Str69] with asymptotic complexity O(nlog2 7 ):
Start from A, B ∈ K n,n with n = 2ℓ, ℓ ∈ N. The idea relies on the block matrix
h producti (1.3.1.14) with
C11 C12
Aij , Bij ∈ K ℓ,ℓ , i, j ∈ {1, 2}. Let C := AB be partitioned accordingly: C = C21 C22 . Then tedious
elementary computations reveal

C11 = Q0 + Q3 − Q4 + Q6 ,
C21 = Q1 + Q3 ,
C12 = Q2 + Q4 ,
C22 = Q0 + Q2 − Q1 + Q5 ,
where the Qk ∈ K ℓ,ℓ , k = 0, . . . , 6 are obtained from

Q0 = (A11 + A22 ) ∗ (B11 + B22 ) ,


Q1 = (A21 + A22 ) ∗ B11 ,
Q2 = A11 ∗ (B12 − B22 ) ,
Q3 = A22 ∗ (−B11 + B21 ) ,
Q4 = (A11 + A12 ) ∗ B22 ,
Q5 = (−A11 + A21 ) ∗ (B11 + B12 ) ,
Q6 = (A12 − A22 ) ∗ (B21 + B22 ) .
Beside a considerable number of matrix additions ( computational effort O(n2 ) ) it takes only 7 multiplica-
tions of matrices of size n/2 to compute C! Strassen’s algorithm boils down to the recursive application
of these formulas for n = 2k , k ∈ N. The asymptotic complexity of O(nlog2 7 ) for n → ∞ the follows from
the master theorem on divide-and-conquer algorithms.

1. Computing with Matrices and Vectors, 1.4. Computational Effort 85


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

A refined algorithm of this type can achieve complexity O(n2.36 ), see [CW90]. y

1.4.3 Improving Complexity in Numerical Linear Algebra: Some Tricks


In computations involving matrices and vectors complexity of algoritms can often be reduced by performing
the operations in a particular order:
EXAMPLE 1.4.3.1 (Efficient associative matrix multiplication) We consider the multiplication with a
rank-1-matrix. Matrices with rank 1 can always be obtained as the tensor product of two vectors, that is,
the matrix product of a column vector and a row vector. Given a ∈ K m , b ∈ K n , x ∈ K n we may compute
the vector y = ab⊤ x in two ways:

   
y = ab⊤ x . (1.4.3.2) y = a b⊤ x . (1.4.3.3)

T = (a*b.transpose())*x; t = a*b.dot(x);
➤ complexity O(mn) ➤ complexity O(n + m) (“linear complexity”)

Visualization of evaluation according to (1.4.3.2):


     
     
     
     
 ·  =  
     
     

Visualization of evaluation according to (1.4.3.3):


    
    
     
    
  · = 
    
    

Timings for rank 1 matrix-vector multiplications


10 0
slow evaluation
efficient evaluation ✁ average runtimes for efficient/inefficient
O(n)
10 -1 2
O(n ) matrix×vector multiplication with rank-1 ma-
10 -2 trices , see § 1.4.1.6 for the rationale behind
choosing a doubly logarithmic plot.
average runtime (s)

10 -3

Platform:
10 -4

10 -5
✦ ubuntu 14.04 LTS
✦ i7-3517U CPU @ 1.90GHz
10 -6
✦ L1 32 KB, L2 256 KB, L3 4096 KB,
10 -7
✦ 8 GB main memory
✦ gcc 4.8.4, -O3
10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 20 problem size n

1. Computing with Matrices and Vectors, 1.4. Computational Effort 86


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 1.4.3.4: E IGEN code for Ex. 1.4.3.1 ➺ GITLAB


2 //! This function compares the runtimes for the multiplication
3 //! of a vector with a rank-1 matrix ab⊤ , a, b ∈ R n
4 //! using different associative evaluations.
5 //! Runtime measurements consider minimal time for
6 //! several (nruns) runs
7 MatrixXd d o t t e n s t i m i n g ( ) {
8 const i n t nruns = 3 , minExp = 2 , maxExp = 1 3 ;
9 // Matrix for storing recorded runtimes
10 MatrixXd t i m i n g s ( maxExp−minExp +1 ,3) ;
11 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
12 Timer t f o o l , t s m a r t ; // Timer objects
13 const i n t n = std : : pow ( 2 , minExp + i ) ;
14 VectorXd a = VectorXd : : LinSpaced ( n , 1 , n ) ;
15 VectorXd b = VectorXd : : LinSpaced ( n , 1 , n ) . reverse ( ) ;
16 VectorXd x = VectorXd : : Random( n , 1 ) , y ( n ) ;
17 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
18 // Grossly wasteful evaluation
19 t f o o l . s t a r t ( ) ; y = ( a * b . transpose ( ) ) * x ; t f o o l . stop ( ) ;
20 // Efficient implementation
21 t s m a r t . s t a r t ( ) ; y = a * b . dot ( x ) ; t s m a r t . s t o p ( ) ;
22 }
23 t i m i n g s ( i , 0 ) =n ;
24 t i m i n g s ( i , 1 ) = t s m a r t . min ( ) ; t i m i n g s ( i , 2 ) = t f o o l . min ( ) ;
25 }
26 return timings ;
27 }

Complexity can sometimes be reduced by reusing intermediate results.


EXAMPLE 1.4.3.5 (Hidden summation) The asymptotic complexity of the E IGEN code
Eigen::MatrixXd AB = A*B.transpose();
y = AB.triangularView<Eigen::Upper>()*x;

when supplied with two low-rank matrices A, B ∈ K n,p , p ≪ n, in terms of n → ∞ obviously is O(n2 ),
because an intermediate n × n-matrix AB T is built.

First, consider the case of a tensor product (= rank-1) matrix, that is, p = 1, A ↔ a = [ a1 , . . . , an ]⊤ ∈ K n ,
B ↔ b = [b1 , . . . , bn ] ∈ K n . Then
  
a1 b1 a1 b2 . . . . . . a 1 bn x1
 0 . . . . . . a 2 bn 
 a2 b2 a2 b3  ... 

 .. .. .. .. ..  
 . . . . .  
y = triu(ab )x = 
T  
 ..
.
.. ..
. .
..  
 .  
 . .. .. ..  .. 
 .. . . .  .
0 ... . . . 0 a n bn xn

1. Computing with Matrices and Vectors, 1.4. Computational Effort 87


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
   1 1 . . . ... 1
 

a1  b1  x1
 
 ..  0 1 1 ... ... 1    ..  .. 
 .   
..  .  . 
  ... . . . . . . . . .

.    
    
=  .. .. .. .. 
    .
  . . . .   
   .   . 
 ..   . .. .  ..  .. 
.  .. .. . ..  . 

an  0 . . . bn x n 
... 0 1 
| {z }
T

The brackets indicate the order of the matrix×vector multiplications. Thus, the core problem is
the fast multiplication of a vector with an upper triangular matrix T described in E IGEN syntax by
Eigen::MatrixXd::Ones(n,n).triangularView<Eigen::Upper>(). Note that multipli-
cation of a vector x with T yields a vector of partial sums of components of x starting from last compo-
nent:
    
1 1 ... ... 1 v1 sn
0 1 1 ... ... 1 
  ...   
 sn.−1 
 .. . . . . .. ..     
. . . . .    .. 
n
 . .. .. .   = , s j := ∑ vk .
 .. . . .. 
 
 
 

.. 
k = n − j +1
. .. .. ..   ..  
 .. . . .  . . 
0 ... ... 0 1 vn s1

This can be achieved by invoking the special C++ command std::partial_sum from the C++ stan-
dard library (documentation). We also observe that
p
T
AB = ∑ (A):,ℓ ((B):,ℓ )⊤ ,
ℓ=1

so that the computations for the special case p = 1 discussed above can simply be reused p times!

C++ code 1.4.3.6: Efficient multiplication with the upper diagonal part of a rank- p-matrix in
E IGEN ➺ GITLAB
2 //! Computation of y = triu(AB T )x
3 //! Efficient implementation with backward cumulative sum
4 //! (partial_sum)
5 template <class Vec , class Mat>
6 void l r t r i m u l t e f f ( const Mat& A , const Mat& B , const Vec& x , Vec& y ) {
7 const i n t n = A . rows ( ) ;
8 const i n t p = A . cols ( ) ;
9 assert ( n == B . rows ( ) && p == B . cols ( ) ) ; // size mismatch
10 f o r ( i n t l = 0 ; l < p ; ++ l ) {
11 Vec tmp = ( B . col ( l ) . array ( ) * x . array ( ) ) . matrix ( ) . reverse ( ) ;
12 std : : partial_sum ( tmp . begin ( ) , tmp . end ( ) , tmp . begin ( ) ) ;
13 y += ( A . col ( l ) . array ( ) * tmp . reverse ( ) . array ( ) ) . matrix ( ) ;
14 }
15 }

This code enjoys the obvious complexity of O( pn) for p, n → ∞, p < n. The code offers an example of a
function templated with its argument types, see § 0.3.2.1. The types Vec and Mat must fit the concept of
E IGEN vectors/matrices. y

The next concept from linear algebra is important in the context of computing with multi-dimensional arrays.

1. Computing with Matrices and Vectors, 1.4. Computational Effort 88


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 1.4.3.7. Kronecker product

The Kronecker product A ⊗ B of two matrices A ∈ K m,n and B ∈ K l,k , m, n, l, k ∈ N, is the


(ml ) × (nk)-matrix
 
(A)1,1 B (A)1,2 B . . . . . . (A)1,n B
 .. 
 (A)2,1 B (A)2,2 B . 
 .. .. .. 
A ⊗ B := 
 . . .  ∈ K ml,nk .

 . . .. 
 .. .. . 
(A)m,1 B (A)m,2 B . . . . . . (A)m,n B

EXAMPLE 1.4.3.8 (Multiplication of Kronecker product with vector) The function (A ⊗ B)x when
invoked with two matrices A ∈ K m,n and B ∈ K l,k and a vector x ∈ K nk , will suffer an asymptotic
complexity of O(m · n · l · k ), determined by the size of the intermediate dense matrix A ⊗ B ∈ K ml,nk .

Using the partitioning of the vector x into n equally long sub-vectors


 
x1
 x2 
 
x =  .  , x j ∈ Kk ,
 .. 
xn
we find the representation
 
(A)1,1 Bx1 + (A)1,2 Bx2 + · · · + (A)1,n Bxn
 (A)2,1 Bx1 + (A)2,2 Bx2 + · · · + (A)2,n Bxn 
 . 
 
(A ⊗ B)x =  .. .
 . 
 .. 
(A)m,1 Bx1 + (A)m,2 Bx2 + · · · + (A)m,n Bxn

The idea is to form the products Bx j , j = 1, . . . , n, once, and then combine them linearly with coefficients
given by the entries in the rows of A:

C++ code 1.4.3.9: Efficient multiplication of Kronecker product with vector in E IGEN
➺ GITLAB
2 template <class Matrix , class Vector >
3 V e c t o r kronmultv ( const Matrix &A , const Matrix &B , const V e c t o r &x ) {
4 const s i z e _ t m = A . rows ( ) ;
5 const s i z e _ t n = A . cols ( ) ;
6 const s i z e _ t l = B . rows ( ) ;
7 const s i z e _ t k = B . cols ( ) ;
8 // 1st matrix mult. computes the products Bx j
9 // 2nd matrix mult. combines them linearly with the coefficients of
A
10 Matrix t = B * Matrix : : Map( x . data ( ) , k , n ) * A . transpose ( ) ; //
11 r e t u r n Matrix : : Map( t . data ( ) , m* l , 1 ) ;
12 }

Recall the reshaping of a matrix in E IGEN in order to understand this code: Rem. 1.2.3.6. Note a new
twist: Here the Map() member function of an E IGEN data type is used, where X::Map(<args>) is
roughly equivalent to Eigen::Map<X>(<args>).

1. Computing with Matrices and Vectors, 1.4. Computational Effort 89


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The asymptotic complexity of this code is determined by the two matrix multiplications in Line 10. This
yields the asymptotic complexity O(lkn + mnl ) for l, k, m, n → ∞.

P YTHON-code 1.4.3.10: Efficient multiplication of Kronecker product with vector in P YTHON


1 def k r o n m u l t v ( A , B , x ) :
2 n , k = A . shape [ 1 ] , B . shape [ 1 ]
3 a s s e r t x . s i z e == n * k , ’ s i z e mismatch ’
4 xx = np . reshape ( x , ( n , k ) )
5 Z = np . d o t ( xx , B . T )
6 yy = np . d o t ( A , Z )
7 r e t u r n np . r a v e l ( yy )

Note that different reshaping is used in the P YTHON code due to the default row major storage order. y
Review question(s) 1.4.3.11 (Computational effort)
(Q1.4.3.11.A) Explain why the classical concept of “computational effort” (= computational cost) is only
loosely related to the runtime of a concrete implementation of an algorithm.
(Q1.4.3.11.B) We are given two dense matrices A, B ∈ R n,p , n, p ∈ N, p < n fixed. What is the asmp-
totic complexity of each of the following two lines of code in terms of n, p → ∞?
Eigen::MatrixXd AB = A*B.transpose();
y = AB.triangularView<Eigen::Upper>()*x;

(Q1.4.3.11.C) Given a vector u ∈ R n , n ∈ N, we consider the matrix

A ∈ R n,n : (A)i,j = ui + u j + ui u j , i, j ∈ {1, . . . , n} .

Outline an efficient algorithm for computing the matrix-vector product Ax, x ∈ R n . What is its asymp-
totic complexity for n → ∞?
(Q1.4.3.11.D) [Matrix×vector multiplication involving Kronecker product, cf. Ex. 1.4.3.8] How do you
have to modify the implementation of the function kronmultv() from Ex. 1.4.3.8 to ensure that it is
still efficient even when called with long column or row vectors as arguments A and B, in particular in
the cases A ∈ R n,1 , B ∈ R k,1 or A ∈ R1,n , B ∈ R1,k , n, k ≫ 1?

C++ code 1.4.3.9: (Potentially not so) efficient multiplication of Kronecker product with
vector in E IGEN ➺ GITLAB
2 template <class Matrix , class Vector >
3 V e c t o r kronmultv ( const Matrix &A , const Matrix &B , const V e c t o r &x ) {
4 const s i z e _ t m = A . rows ( ) ;
5 const s i z e _ t n = A . cols ( ) ;
6 const s i z e _ t l = B . rows ( ) ;
7 const s i z e _ t k = B . cols ( ) ;
8 // 1st matrix mult. computes the products Bx j
9 // 2nd matrix mult. combines them linearly with the coefficients
of A
10 Matrix t = B * Matrix : : Map( x . data ( ) , k , n ) * A . transpose ( ) ; //
11 r e t u r n Matrix : : Map( t . data ( ) , m* l , 1 ) ;
12 }

(Q1.4.3.11.E) [Multiplication with a “kernel matrix”] Ex. 1.4.3.5 gave us efficient functions
Eigen::VectorXd mvtriutp( const Eigen::Vector &a,
const Eigen::Vector &b, const Eigen::Vector &x);

1. Computing with Matrices and Vectors, 1.4. Computational Effort 90


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Eigen::VectorXd mvtriltp( const Eigen::Vector &a,


const Eigen::Vector &b, const Eigen::Vector &x);

for evaluating

y = triu(ab⊤ )x and y = tril(ab⊤ )x , a, b, x ∈ R n .


for large n ∈ N. Here, triu and tril extract the upper and lower triangular part of a matrix, respectively.
Explain how this function can be used for the efficient computation of the matrix×vector product
h in
eα|k− j| x, x ∈ Rn , α∈R.
k,j=1

(Q1.4.3.11.F) [Initialization of a “kernel matrix”] A call to the black-box function double f(unsigned
int i) might be expensive. Outline an efficient C++ code for the initialization of the matrix

[f(|k − j|)]nk.j=1 ∈ R n,n .

1.5 Machine Arithmetic and Consequences

Video tutorial for Section 1.5 "Machine Arithmetic and Consequences": (16 minutes)
Download link, tablet notes

→ review questions 1.5.3.18

1.5.1 Experiment: Loss of Orthogonality


§1.5.1.1 (Gram-Schmidt orthogonalisation) From linear algebra [NS02, Sect. 4.4] or Ex. 0.3.5.29 we
recall the fundamental algorithm of Gram-Schmidt orthogonalisation of an ordered finite set {a1 , . . . , ak },
k ∈ N, of vectors aℓ ∈ K n :
Input: { a1 , . . . , a k } ⊂ K n
In linear algebra we have learnt that, if it does
1: q1 : = a1
% 1st output vector not STOP prematurely, this algorithm will com-
k a1 k 2 pute orthonormal vectors q1 , . . . , qk satisfying
2: for j = 2, . . . , k do
{ % Orthogonal projection
3: q j := a j Span{q1 , . . . , qℓ } = Span{a1 , . . . , aℓ } ,
4: for ℓ = 1, 2, . . . , j − 1 do (GS) (1.5.1.2)
5: { q j ← q j − a j · qℓ qℓ }
6: if ( q j = 0 ) then STOP for all ℓ ∈ {1, . . . , k }.
qj
7: else { qj ← } More precisely, if a1 , . . . , aℓ , ℓ ≤ k, are linearly
k q j k2
8: } independent, then the Gram-Schmidt algorithm
will not terminate before the ℓ + 1-th step.
Output: { q1 , . . . , q j }
ˆ Euclidean norm of a vector ∈ K n
✎ Notation: k·k2 =
The following code implements the Gram-Schmidt orthonormalization of a set of vectors passed as the
columns of a matrix A ∈ R n,k . The template paramter Matrix should match the concept of a matrix type
in E IGEN like Eigen::MatrixXd or

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 91
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 1.5.1.3: Gram-Schmidt orthogonalisation in E IGEN ➺ GITLAB


2 template <class Matrix > Matrix gramschmidt ( const Matrix &A) {
3 Matrix Q = A ;
4 // First vector just gets normalized, Line 1 of (GS)
5 Q. col ( 0 ) . normalize ( ) ;
6 f o r ( unsigned i n t j = 1 ; j < A . cols ( ) ; ++ j ) {
7 // Replace inner loop over each previous vector in Q with fast
8 // matrix-vector multiplication (Lines 4, 5 of (GS))
9 Q. col ( j ) −= Q. l e f t C o l s ( j ) *
10 (Q. l e f t C o l s ( j ) . a d j o i n t ( ) * A . col ( j ) ) ; //
11 // Normalize vector, if possible.
12 // Otherwise colums of A must have been linearly dependent
13 i f (Q. col ( j ) . norm ( ) > 10e−9 * A . col ( j ) . norm ( ) ) { //
14 Q. col ( j ) . normalize ( ) ; // Line 7 of (GS)
15 } else {
16 std : : c e r r << "Gram−Schmidt f a i l e d : A has l i n . dep columns . " << std : : endl ;
17 break ;
18 }
19 }
20 r e t u r n Q;
21 }

We will soon learn the rationale behind the odd test in Line 13.
In P YTHON the same algorithm can be implemented as follows:

P YTHON-code 1.5.1.4: Gram-Schmidt orthogonalisation in P YTHON


1 def gramschmidt ( A) :
2 _ , k = A . shape
3 Q = A [ : , [ 0 ] ] / np . l i n a l g . norm ( A [ : , 0 ] )
4 f o r j i n range ( 1 , k ) :
5 q = A [ : , j ] − np . d o t (Q, np . d o t (Q. T , A [ : , j ] ) )
6 nq = np . l i n a l g . norm ( q )
7 i f nq < 1e−9 * np . l i n a l g . norm ( A [ : , j ] ) :
8 break
9 Q = np . column_stack ( [ Q, q / nq ] )
10 return Q

Note the different loop range due to the zero-based indexing in P YTHON.
y
EXPERIMENT 1.5.1.5 (Unstable Gram-Schmidt orthonormalization) If {a1 , . . . , ak } are linearly inde-
pendent we expect the output vectors q1 , . . . , qk to be orthonormal:

(qℓ )⊤ qm = δℓ,m , ℓ, m ∈ {1, . . . , k} . (1.5.1.6)

This
 1 property  can be easily tested numerically, for instance by computing Q⊤ Q for a matrix Q =
q , . . . , qk ∈ R n,k .

C++ code 1.5.1.7: Wrong result from Gram-Schmidt orthogonalisation E IGEN ➺ GITLAB
2 void g s r o u n d o f f ( MatrixXd& A) {
3 // Gram-Schmidt orthogonalization of columns of A, see Code 1.5.1.3
4 MatrixXd Q = gramschmidt ( A) ;
5 // Test orthonormality of columns of Q, which should be an
6 // orthogonal matrix according to theory

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 92
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

7 cout << s e t p r e c i s i o n ( 4 ) << f i x e d << " I = "


8 << endl << Q. transpose ( ) *Q << endl ;
9 // E I G E N ’s stable internal Gram-Schmidt orthogonalization by
10 // QR-decomposition, see Rem. 1.5.1.9 below
11 HouseholderQR<MatrixXd > q r ( A . rows ( ) ,A . cols ( ) ) ; //
12 q r . compute ( A ) ; MatrixXd Q1 = q r . householderQ ( ) ; //
13 // Test orthonormality
14 cout << " I1 = " << endl << Q1 . transpose ( ) * Q1 << endl ;
15 // Check orthonormality and span property (1.5.1.2)
16 const MatrixXd R1 = q r . matrixQR ( ) . triangularView <Upper > ( ) ;
17 cout << s c i e n t i f i c << "A−Q1*R1 = " << endl << A−Q1 * R1 << endl ;
18 }

We test the orthonormality of the output vectors of Gram-Schmidt orthogonalization for a special matrix
A ∈ R10,10 , a so-called Hilbert matrix, defined by (A)i,j = (i + j − 1)−1 . Then Code 1.5.1.7 produces
the follwing output:
I =
1.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 1.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 1.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0000 -0.0000 -0.0000 -0.0000 1.0000 0.0000 -0.0008 -0.0007 -0.0007 -0.0006
0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 -0.0540 -0.0430 -0.0360 -0.0289
-0.0000 -0.0000 -0.0000 -0.0000 -0.0008 -0.0540 1.0000 0.9999 0.9998 0.9996
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0430 0.9999 1.0000 1.0000 0.9999
-0.0000 -0.0000 -0.0000 -0.0000 -0.0007 -0.0360 0.9998 1.0000 1.0000 1.0000
-0.0000 -0.0000 -0.0000 -0.0000 -0.0006 -0.0289 0.9996 0.9999 1.0000 1.0000

Obviously, the vectors produced by the function gramschmidt fail to be orthonormal, contrary to the
predictions of rigorous results from linear algebra!

However, Line 11, Line 12 of Code 1.5.1.7 demonstrate another way to orthonormalize the columns of a
matrix using E IGEN’s built-in class template HouseholderQR (more details in Section 3.3.3).
I1 =
1.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
-0.0000 1.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000
0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 1.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000 -0.0000 -0.0000 0.0000
-0.0000 -0.0000 0.0000 -0.0000 -0.0000 1.0000 -0.0000 -0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 1.0000 0.0000 0.0000 -0.0000
-0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 1.0000 -0.0000 0.0000
-0.0000 0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 1.0000 -0.0000
0.0000 -0.0000 0.0000 0.0000 0.0000 -0.0000 -0.0000 0.0000 -0.0000 1.0000

Now we observe apparently perfect orthogonality (1.5.1.6) of the columns of the matrix Q1 in Code 1.5.1.7.
Obviously, there is another algorithm that reliably yields the theoretical output of Gram-Schmidt orthogo-
nalization. There is no denying that it is possible to compute Gram-Schmidt orthonormalization in a “clean”
way. y

“Computers cannot compute”

Computers cannot compute “properly” in R: numerical computations may not respect the laws of
analysis and linear algebra!

This introduces an important new aspect in the study of numerical algorithms.

Remark 1.5.1.9 (Stable orthonormalization by QR-decomposition) In Code 1.5.1.7 we saw the use of
the E IGEN class HousholderQR<MatrixType> for the purpose of Gram-Schmidt orthogonalisation.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 93
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The underlying theory and algorithms will be explained later in Section 3.3.3. There we will have the
following insight:
➣ Up to signs the columns of the matrix Q available from the QR-decomposition of A are the same
vectors as produced by the Gram-Schmidt orthogonalisation of the columns of A.

Code 1.5.1.7 demonstrates a case where a desired result can be obtained by two algebraically
equivalent computations, that is, they yield the same result in a mathematical sense. Yet, when
! implemented on a computer, the results can be vastly different. One algorithm may produce junk
(“unstable algorithm”), whereas the other lives up to the expectations (“stable algorithm”)

Supplement to Exp. 1.5.1.5: despite its ability to produce orthonormal vectors, we get as output for
D=A-Q1*R1 in Code 1.5.1.7:
D =
2.2204e-16 3.3307e-16 3.3307e-16 1.9429e-16 1.9429e-16 5.5511e-17 1.3878e-16 6.9389e-17 8.3267e-17 9.7145e-17
0.0000e+00 1.1102e-16 8.3267e-17 5.5511e-17 0.0000e+00 5.5511e-17 -2.7756e-17 0.0000e+00 0.0000e+00 4.1633e-17
-5.5511e-17 5.5511e-17 2.7756e-17 5.5511e-17 0.0000e+00 0.0000e+00 0.0000e+00 -1.3878e-17 1.3878e-17 1.3878e-17
0.0000e+00 5.5511e-17 2.7756e-17 2.7756e-17 0.0000e+00 1.3878e-17 -1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 1.3878e-17 4.1633e-17
-2.7756e-17 2.7756e-17 1.3878e-17 4.1633e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 2.7756e-17 2.7756e-17
0.0000e+00 2.7756e-17 0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.7756e-17 2.0817e-17
0.0000e+00 2.7756e-17 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 1.3878e-17 2.0817e-17 2.7756e-17
1.3878e-17 1.3878e-17 1.3878e-17 2.7756e-17 1.3878e-17 0.0000e+00 -1.3878e-17 6.9389e-18 -6.9389e-18 1.3878e-17
0.0000e+00 2.7756e-17 1.3878e-17 1.3878e-17 1.3878e-17 0.0000e+00 0.0000e+00 0.0000e+00 1.3878e-17 1.3878e-17

➥ The computed QR-decomposition apparently fails to meet the exact algebraic requirements stipulated
by Thm. 3.3.3.4. However, note the tiny size of the “defect”. y

1.5.2 Machine Numbers


§1.5.2.1 (The finite and discrete set of machine numbers) The reason, why computers must fail to
execute exact computations with real numbers is clear:
✞ ☎ ✞ ☎
Computer = finite automaton ➢ can handle only finitely many numbers, not R
✝ ✆ ✝ ✆
machine numbers, set M

Essential property: M is a finite, discrete subset of R (its numbers separated by gaps)

The set of machine numbers M cannot be closed under elementary arithmetic operations
+, −, ·, /, that is, when adding, multiplying, etc., two machine numbers the result may not belong
to M.

The results of elementary operations with operands in M have to be mapped back to M, an oper-
ation called rounding.

roundoff errors (ger.: Rundungsfehler) are inevitable

The impact of roundoff means that mathematical identities may not carry over to the computational realm.
As we have seen above in Exp. 1.5.1.5
✞ ☎

✝ ✆
Computers cannot compute “properly” !

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 94
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✛ ✘

analysis
numerical computations 6=
linear algebra
✚ ✙
This introduces a new and important aspect in the study of numerical algorithms!

§1.5.2.2 (Internal representation of machine numbers) Now we give a brief sketch of the internal
structure of machine numbers ∈ M. The main insight will be that

“Computers use floating point numbers (scientific notation)”

EXAMPLE 1.5.2.3 (Decimal floating point numbers) Some 3-digit normalized decimal floating point
numbers:
valid: 0.723 · 102 , 0.100 · 10−20 , −0.801 · 105
invalid: 0.033 · 102 , 1.333 · 10−4 , −0.002 · 103
General form of an m-digit normalized decimal floating point number:

never = 0 !

x=± 0 . 1 1 1 1 1 ... 1 1 · 10E


| {z }
m digits of mantissa exponent ∈ Z

Of course, computers are restricted to a finite range of exponents:

Definition 1.5.2.4. Machine numbers/floating point numbers → [AG11, Sect. 2.2]

Given ☞ basis B ∈ N \ {1},


☞ exponent range {emin , . . . , emax }, emin , emax ∈ Z, emin < emax ,
☞ number m ∈ N of digits (for mantissa),
the corresponding set of machine numbers is

M := {d · B E : d = i · B−m , i = Bm−1 , . . . , Bm − 1, E ∈ {emin , . . . , emax }}

never = 0 ! 1 1 ... 1 1
| {z }
machine number ∈ M : x=± 0 . 1 1 1 1 1 ... 1 1 ·B digits for exponent

| {z }
m digits for mantissa

Remark 1.5.2.5 (Extremal numbers in M) Clearly, there is a largest element of M and two that are
closest to zero. These are mainly determined by the range for the exponent E, cf. Def. 1.5.2.4.
Largest machine number (in modulus) : xmax = max |M | = (1 − B−m ) · Bemax
Smallest machine number (in modulus) : xmin = min |M | = B−1 · Bemin

In C++ these extremal machine numbers are accessible through the

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 95
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

std::numeric_limits<double>::max()
and std::numeric_limits<double>::min()
functions. Other properties of arithmetic types can be queried accordingly from the numeric_limits header.
y

Remark 1.5.2.6 (Distribution of machine numbers) From Def. 1.5.2.4 it is clear that there are equi-
spaced sections of M and that the gaps between machine numbers are bigger for larger numbers, see
also [AG11, Fig. 2.3].
Bemin −1

0
spacing Bemin −m spacing Bemin −m+1 spacing Bemin −m+2
Gap partly filled with non-normalized numbers
Non-normalized numbers violate the lower bound for the mantissa i in Def. 1.5.2.4. y

§1.5.2.7 (IEEE standard 754 for machine numbers → [Ove01], [AG11, Sect. 2.4], → link) No sur-
prise: for modern computers B = 2 (binary system), the other parameters of the universally implemented
machine number system are
single precision : m = 24∗ ,E ∈ {−125, . . . , 128} ➣ 4 bytes
double precision : m = 53∗ ,E ∈ {−1021, . . . , 1024} ➣ 8 bytes
∗: including bit indicating sign

The standardisation of machine numbers is important, because it ensures that the same numerical algo-
rithm, executed on different computers will nevertheless produce the same result. y

Remark 1.5.2.8 (Special cases in IEEE standard)


The IEEE standard makes provisions for exceptions triggered by overflow or invalid floating point opera-
tions. The following C++ code snippet shows cases when these “flags” are raised.

const double x = exp(1000);


const double y = 3/x;
const double z = x*sin(M_PI);
const double w = x*log(1);
cout << x << e n d l << y << e n d l << z << e n d l << w << e n d l ;

1 inf
2 0
Output: inf
!
3

4 −nan

E = emax , M 6= 0 =
ˆ NaN = Not a number → exception
E = emax , M = 0 =ˆ Inf = Infinity → overflow
E =0 ˆ Non-normalized numbers → underflow
=
E = 0, M = 0 ˆ number 0
=
In C++ these flags can be tested with the functions std::isnan() C++ reference and
std::isinf() C++ reference. y

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 96
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§1.5.2.9 (Characteristic parameters of IEEE floating point numbers (double precision))


☞ C++ does not always fulfill the requirements of the IEEE 754 standard and it needs to be checked
with std::numeric_limits<T>::is_iec559.

C++ code 1.5.2.10: Querying characteristics of double numbers ➺ GITLAB


2

3 # include <iomanip >


4 # include < iostream >
5 # include < l i m i t s >
6

7 using std : : cout ;


8 using std : : endl ;
9 using std : : numeric_limits ;
10

11 i n t main ( ) {
12 cout << numeric_limits <double > : : i s _ i e c 5 5 9 << endl
13 << std : : d e f a u l t f l o a t << numeric_limits <double > : : min ( ) << endl
14 << std : : h e x f l o a t << numeric_limits <double > : : min ( ) << endl
15 << std : : d e f a u l t f l o a t << numeric_limits <double > : : max ( ) << endl
16 << std : : h e x f l o a t << numeric_limits <double > : : max ( ) << endl ;
17 }

1 true
2 2.22507e−308
Output: 3 0010000000000000
4 1.79769e+308
5 7fefffffffffffff
y

1.5.3 Roundoff Errors


EXPERIMENT 1.5.3.1 (Input errors and roundoff errors) The following computations would always
result in 0, if done in exact arithmetic.

C++ code 1.5.3.2: Demonstration of roundoff errors ➺ GITLAB


2 # include < iostream >
3 i n t main ( ) {
4 std : : cout . p r e c i s i o n ( 1 5 ) ;
5 double a = 4 . 0 / 3 . 0 ;
6 double b = a −1;
7 double c = 3 * b ;
8 double e = 1−c ;
9 std : : cout << e << std : : endl ;
10 a = 1 0 1 2 . 0 / 1 1 3 . 0 ; b = a −9; c = 113 * b ; e = 5+c ;
11 std : : cout << e << std : : endl ;
12 a = 83810206. 0/ 6789. 0; b = a−12345; c = 6789 * b ; e = c −1;
13 std : : cout << e << std : : endl ;
14 }

1 2.22044604925031e−16
Output: 2 6.75015598972095e−14
3 −1.60798663273454e−09

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 97
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Can you devise a similar calculation, whose result is even farther off zero? Apparently the rounding that
inevitably accompanies arithmetic operations in M can lead to results that are far away from the true
result. y

For the discussion of errors introduced by rounding we need important notions.

Definition 1.5.3.3. Absolute and relative error → [AG11, Sect. 1.2]


Let xe ∈ K be an approximation of x ∈ K. Then its absolute error is given by

ǫabs := | x − xe| ,

and its relative error is defined as

| x − xe|
ǫrel := .
|x|

Remark 1.5.3.4 (Relative error and number of correct digits) The number of correct (significant, valid)
digits of an approximation xe of x ∈ K is defined through the relative error:
| x − xe|
If ǫrel := | x| ≤ 10−ℓ , then xe has ℓ correct digits, ℓ ∈ N0
To see this write write as base-10 floating point numbers

x = d1 .d2 d3 . . . dm dm+1 . . . dn · 10E ,


di , dei ∈ {0, . . . , 10} , d1 6= 0 , dm+1 6= dem+1 .
xe = d1 .d2 d3 . . . dm dem+1 . . . den · 10E ,

This means that xe has m correct digits. We compute the relative error
m
z }| {
| x − xe| 0. 0 . . . 0 δm+1 δm+1 . . . δn
ǫ= = , δm+1 6= 0
|x| d1 .d2 d3 . . . dm dem+1 . . . den

Obviously we have ǫ ≈ 10−m . y

§1.5.3.5 (Floating point operations) We may think of the elementary binary operations +, −, ∗, / in M
comprising two steps:
➊ Compute the exact result of the operation.
➋ Perform rounding of the result of ➊ to map it back to M.
Definition 1.5.3.6. Correct rounding

Correct rounding (“rounding up”) is given by the function



R → M
rd :
x 7→ max argminxe∈M | x − xe| .

(Recall that argminx F ( x ) is the set of arguments of a real valued function F that makes it attain its (global)
minimum.)

Of course, ➊ above is not possible in a strict sense, but the effect of both steps can be realised and yields

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 98
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

a floating point realization of ⋆ ∈ {+, −, ·, /}.

✎ Notation: ⋆ for the floating point realization of ⋆ ∈ {+, −, ·, /}:


write e

Then ➊ and ➋ may be summed up into

For ⋆ ∈ {+, −, ·, /}: xe


⋆ y := rd( x ⋆ y) .

Remark 1.5.3.7 (Breakdown of associativity) As a consequence of rounding addition + e and multiplica-


tion e
∗ as implemented on computers fail to be associative. They will usually be commutative, though this
is not guaranteed. y

§1.5.3.8 (Estimating roundoff errors → [AG11, p. 23]) Let us denote by EPS the largest relative error
(→ Def. 1.5.3.3) incurred through rounding:

| rd( x ) − x |
EPS := max , (1.5.3.9)
x∈ I |x|
where I = [min |M |, max |M |] ∩ M is the range of positive machine numbers.

For machine numbers according to Def. 1.5.2.4 EPS can be computed from the defining parameters B
(base) and m (length of mantissa) [AG11, p. 24]:

EPS = 21 B1−m . (1.5.3.10)

However, when studying roundoff errors, we do not want to delve into the intricacies of the internal repre-
sentation of machine numbers. This can be avoided by just using a single bound for the relative error due
to rounding, and, thus, also for the relative error potentially suffered in each elementary operation.

Assumption 1.5.3.11. “Axiom” of roundoff analysis

There is a small positive number EPS, the machine precision, such that for the elementary arithmetic
operations ⋆ ∈ {+, −, ·, /} and “hard-wired” functions∗ f ∈ {exp, sin, cos, log, . . .} holds

xe
⋆ y = ( x ⋆ y)(1 + δ) , fe( x ) = f ( x )(1 + δ) ∀ x, y ∈ M ,

with |δ| < EPS.

∗: this is an ideal, which may not be accomplished even by modern CPUs.

relative roundoff errors of elementary steps in a program bounded by machine precision !

EXAMPLE 1.5.3.12 (Machine precision for IEEE standard) C++ tells the machine precision as fol-
lowing:

C++ code 1.5.3.13: Finding out EPS in C++ ➺ GITLAB


2 # include < iostream >
3 # include < l i m i t s > // get various properties of arithmetic types

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 99
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

4 i n t main ( ) {
5 std : : cout . p r e c i s i o n ( 1 5 ) ;
6 std : : cout << std : : n u m e r i c _ l i m i t s <double > : : epsilon ( ) << std : : endl ;
7 }

Output:
1 2.22044604925031e−16

Knowing the machine precision can be important for checking the validity of computations or coding ter-
mination conditions for iterative approximations. y

EXPERIMENT 1.5.3.14 (Adding EPS to 1)

cout .precision(25);
const double eps =
s t d ::numeric_limits< double >::epsilon();
cout << s t d ::fixed << 1.0 + 0.5*eps << e n d l In fact, the following “definition”
<< 1.0 - 0.5*eps << e n d l of EPS is sometimes used:
<< (1.0 + 2/eps) - 2/eps << e n d l ;
EPS is the smallest posi-
Output: tive number ∈ M for which
1+e EPS 6= 1 (in M):
1 1.0000000000000000000000000
2 0.9999999999999998889776975
3 0.0000000000000000000000000

e EPS = 1 actually complies with the “axiom” of roundoff error analysis, Ass. 1.5.3.11:
We find that 1+

EPS
1 = (1 + EPS)(1 + δ) ⇒ |δ| = < EPS ,
1 + EPS
2 2 EPS
= (1 + )(1 + δ) ⇒ |δ| = < EPS .
EPS EPS 2 + EPS
y

!
Do we have to worry about these tiny roundoff errors ?

YES (→ Exp. 1.5.1.5): • accumulation of roundoff errors


• amplification of roundoff errors

Remark 1.5.3.15 (Testing equality with zero)


Since results of numerical computations are almost always polluted by roundoff errors:
Tests like if (x == 0) are pointless and even dangerous, if x contains the result
of a numerical computation.
! Remedy: Test if (abs(x) < tol*s) ...,
s=ˆ positive number, compared to which | x | should be small.
tol = ˆ “suitable” tolerance, often tol ≈ EPS
We saw a first example of this practise in Code 1.5.1.3, Line 13.

To motivate this rule, think of a code where the number stored in x is a length. One may have chosen µm
as reference unit and then x=0.1. Another user prefers km as length unit, which means that x=1.0E-10

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 100
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

in this case. Just comparing x to a fixed threshold will lead to a different behavior of the code depending on
the choice of physical units, which is certainly not desirable. In general numerical codes should largely be
insensitive to the choice of physical units, a property called scaling invariance. From these considerations
we also conclude that a guideline for choosing the comparison variable s is that it should represent a
quantity with the same physical units. y

Remark 1.5.3.16 (Overflow and underflow) Since the set of machine numbers M is a finite set, the
result of an arithmetic operation can lie outside the range covered by it. In this case we have to deal with
overflow =ˆ |result of an elementary operation| > max{M }
ˆ IEEE standard ⇒ Inf
=
ˆ 0 < |result of an elementary operation| < min{|M \ {0}|}
underflow =
ˆ IEEE standard ⇒ use non-normalized numbers (!)
=
The Axiom of roundoff analysis Ass. 1.5.3.11 does not hold once non-normalized numbers are encoun-
tered:

C++ code 1.5.3.17: Demonstration of over-/underflow ➺ GITLAB


2 # include <cmath> //define _USE_MATH_DEFINES to access M_PI
3 # include < iostream >
4 # include < l i m i t s >
5

6 using std : : cout ;


7 using std : : n u m e r i c _ l i m i t s ;
8 using std : : endl ;
9

10 i n t main ( ) {
11 cout . p r e c i s i o n ( 1 5 ) ;
12 const double min = n u m e r i c _ l i m i t s <double > : : min ( ) ;
13 const double res1 = M_PI * min /123456789101112;
14 const double res2 = res1 * 123456789101112/ min ;
15 cout << res1 << endl << res2 << endl ;
16 }

1 5.68175492717434e−322
Output: 2 3.15248510554597

Try to avoid underflow and overflow

A simple example teaching how to avoid overflow during the computation of the norm of a 2D vector [AG11,
Ex. 2.9]:

q ( p
r= x2 + y2 | x | 1 + (y/x)2 , if | x | ≥ |y| ,
r= p
|y| 1 + ( x/y)2 , if |y| > | x | .
straightforward evaluation:
p p overflow, when | x | >
max |M | or |y| > max |M |. ➢ no overflow!
y
Review question(s) 1.5.3.18 (Machine arithmetic)
(Q1.5.3.18.A) What is the order of magnitude of the machine precision EPS for the double floating point
type?
(Q1.5.3.18.B) What is the “Axiom of roundoff analysis”?

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 101
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q1.5.3.18.C) The set of two-digit decimal numbers is

D2 := {0.xy · 10e : x, y ∈ {0, . . . , 9}, x 6= 0, e ∈ Z } .


Give a sharp bound for the relative error of (correct) rounding in D2 .
(Q1.5.3.18.D) Given two variables of type Eigen::VectorXd, how can you safely check in a C++ code
that the vectors they describe are
• linearly dependent,
• orthogonal?
(Q1.5.3.18.E) [An obfuscated C++ function] What is the purpose of the following function? What is
the idea behind Line 19

C++ code 1.5.3.19: ➺ GITLAB


2 double fun ( double x ) {
3 bool neg = f a l s e ;
4 i f ( x < 0) {
5 neg = t r u e ;
6 x = −x ;
7 }
8 unsigned i n t f = 0 ;
9 while ( x > 1 ) {
10 f ++;
11 x /= 2.0;
12 }
13 double v = 1 . 0 + x ;
14 double num = x ;
15 double den = 1 . 0 ;
16 f o r ( i n t i = 2 ; t r u e ; ++ i ) {
17 const double s = (num * = x ) / ( den * = i ) ;
18 i f ( s == 0 . 0 ) {
19 break ; //
20 }
21 v += s ;
22 }
23 while ( f −− > 0 ) {
24 v = v * v;
25 }
26 i f ( neg ) {
27 v = 1.0 / v ;
28 }
29 return v ;
30 }

1.5.4 Cancellation

Video tutorial for Section 1.5.4 "Cancellation": (22 minutes) Download link, tablet notes

→ review questions 1.5.4.33


In general, predicting the impact of roundoff errors on the result of a multi-stage computation is very diffi-
cult, if possible at all. However, there is a constellation that is particularly prone to dangerous amplification
of roundoff errors and still can be detected easily.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 102
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 1.5.4.1 (Computing the zeros of a quadratic polynomial) The following simple E IGEN code
computes the real roots of a quadratic polynomial p(ξ ) = ξ 2 + αξ + β by the discriminant formula

1 √ 
p(ξ 1 ) = p(ξ 2 ) = 0 , ξ 1/2 = −α ± D , if D := α2 − 4β ≥ 0 . (1.5.4.2)
2

C++ code 1.5.4.3: Discriminant formula for the real roots of p(ξ ) = ξ 2 + αξ + β ➺ GITLAB
2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means
p of the familiar discriminant
4 //! formula ξ 1,2 = 12 (−α ± α2 − 4β). However
5 //! this implementation is vulnerable to round-off! The zeros are
6 //! returned in a column vector
7 i n l i n e Vector2d zerosquadpol ( double alpha , double beta ) {
8 Vector2d z ;
9 const double D = std : : pow ( alpha , 2 ) − 4 * beta ; // discriminant
10 i f (D >= 0 ) {
11 // The famous discriminant formula
12 const double wD = std : : s q r t (D) ;
13 z << ( − alpha − wD) / 2 , ( − alpha + wD) / 2 ; //
14 }
15 else {
16 throw std : : r u n t i m e _ e r r o r ( "no r e a l zeros " ) ;
17 }
18 return z ;
19 }

This formula is applied to the quadratic polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) after its coefficients α, β have
been computed from γ, which will have introduced small relative roundoff errors (of size EPS).

C++ code 1.5.4.4: Testing the accuracy of computed roots of a quadratic polynomial
➺ GITLAB
2 //! Eigen Function for testing the computation of the zeros of a
parabola
3 void compzeros ( ) {
4 i n t n = 100;
5 MatrixXd r e s ( n , 4 ) ;
6 VectorXd gamma = VectorXd : : LinSpaced ( n , 2 , 992) ;
7 f o r ( i n t i = 0 ; i < n ; ++ i ) {
8 double alpha = −(gamma( i ) + 1 . / gamma( i ) ) ;
9 double beta = 1 . ;
10 Vector2d z1 = zerosquadpol ( alpha , beta ) ;
11 Vector2d z2 = zerosquadpolstab ( alpha , beta ) ;
12 double z t r u e = 1 . / gamma( i ) , z 2 t r u e = gamma( i ) ;
13 r e s ( i , 0 ) = gamma( i ) ;
14 r e s ( i , 1 ) = std : : abs ( ( z1 ( 0 ) − z t r u e ) / z t r u e ) ;
15 r e s ( i , 2 ) = std : : abs ( ( z2 ( 0 ) − z t r u e ) / z t r u e ) ;
16 r e s ( i , 3 ) = std : : abs ( ( z1 ( 1 ) − z 2 t r u e ) / z 2 t r u e ) ;
17 }

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 103
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

-11 Roots of a parabola computed in an unstable manner


×10
3.5
small root
large root

Plot of relative errors Def. 1.5.3.3 ✄ 3

We observe that roundoff incurred during the compu- 2.5

2
relative errors in ξ , ξ
tation of α and β leads to “wrong” roots.

1
For large γ the computed small root may be fairly 2

inaccurate as regards its relative error, which can be


1.5
several orders of magnitude larger than machine pre-
cision EPS. 1

The large root always enjoys a small relative error 0.5

about the size of EPS.


0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 21 γ

In order to understand why the small root is much more severely affected by roundoff, note that its com-
putation involves the subtraction of two large numbers, if γ is large. This is the typical situation, in which
cancellation occurs. y
§1.5.4.5 (Visualisation of cancellation effect) We look at the exact subtraction of two almost equal
positive numbers both of which have small relative errors (red boxes) with respect to some desired exact
value (indicated by blue boxes). The result of the subtraction will be small, but the errors may add up
during the subtraction, ultimately constituting a large fraction of the result.
(absolute) errors

Cancellation

ˆ Subtraction of almost equal numbers


=
(➤ extreme amplification of relative errors)

Fig. 22
(✁ Roundoff error introduced by subtraction itself is negligi-
ble.)
y
EXAMPLE 1.5.4.6 (Cancellation in decimal system) We consider two positive numbers x, y
of about the same size afflicted with relative errors ≈ 10−7 . This means that their sev-
enth decimal digits are perturbed, here indicated by ∗. When we subtract the two numbers
the perturbed digits are shifted to the left, resulting in a possible relative error of ≈ 10−3 :

x = 0.123467∗ ← 7th digit perturbed


y = 0.123456∗ ← 7th digit perturbed
x − y = 0.000011∗ = 0.11∗000 · 10 − 4 ← 3rd digit perturbed

padded zeroes

Again, this example demonstrates that cancellation wreaks havoc through error amplification, not through
the roundoff error due to the subtraction. y

EXAMPLE 1.5.4.7 (Cancellation when evaluating difference quotients → [DR08, Sect. 8.2.6], [AG11,
Ex. 1.3]) From analysis we know that the derivative of a differentiable function f : I ⊂ R → R at a point

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 104
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

x ∈ I is the limit of a difference quotient

f ( x + h) − f ( x )
f ′ ( x ) = lim .
h →0 h
This suggests the following approximation of the derivative by a difference quotient with small but finite
h>0
f ( x + h) − f ( x )
f ′ (x) ≈ for |h| ≪ 1 .
h
Results from analysis tell us that the approximation error should tend to zero for h → 0. More precise
quantitative information is provided by the Taylor formula for a twice continuously differentiable function
[AG11, p. 5]

f ( x + h) = f ( x ) + f ′ ( x ) h + 12 f ′′ (ξ ) h2 for some ξ = ξ ( x, h) ∈ [min{ x, x + h}, max{ x, x + h}] ,


(1.5.4.8)

from which we infer

f ( x + h) − f ( x )
− f ′ ( x ) = 12 h f ′′ (ξ ) for some ξ = ξ ( x, h) ∈ [min{ x, x + h}, max{ x, x + h}] .
h
(1.5.4.9)

We investigate the approximation of the derivative by difference quotients for f = exp, x = 0, and different
values of h > 0:

C++ code 1.5.4.10: Difference quotient approximation


of the derivative of exp ➺ GITLAB log10 ( h) relative error
2 // Difference quotient approximation -1 0.05170918075648
3 // of the derivative of exp -2 0.00501670841679
4 void d i f f q ( ) { -3 0.00050016670838
5 double h = 0 . 1 ; -4 0.00005000166714
6 const double x = 0 . 0 ;
-5 0.00000500000696
7 f o r ( i n t i = 1 ; i <= 1 6 ; ++ i ) {
-6 0.00000049996218
8 const double d f = ( exp ( x + h ) − exp ( x ) ) / h ;
9 cout << s e t p r e c i s i o n ( 1 4 ) << f i x e d ; -7 0.00000004943368
10 cout << setw ( 5 ) << − i << setw ( 2 0 ) << abs ( d f -8 0.00000000607747
− 1 ) << endl ; -9 0.00000008274037
11 h /= 10; -10 0.00000008274037
12 } -11 0.00000008274037
13 } -12 0.00008890058234
-13 0.00079927783736
-14 0.00079927783736
Measured relative errors ✄ -15 0.11022302462516
-16 1.00000000000000
We observe an initial decrease of the relative approximation er-
ror followed by a steep increase when h drops below 10−8 .

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 105
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

That the observed errors are really due to


round-off errors is confirmed by the nu-
merical results reported besides, using a
variable precision floating point module of
E IGEN, the MPFRC++ Support module,
which is no longer available now.

The C++ used to generate these results


can be found in ➺ GITLAB.

Fig. 23

Obvious culprit for what we see in Fig. 23: cancellation when computing the numerator of the
difference quotient for small | h| leads to a strong amplification of inevitable errors introduced by
the evaluation of the transcendent exponential function.

We witness the competition of two opposite effects: Smaller h results in a better approximation of the
derivative by the difference quotient, but the impact of cancellation is the stronger the smaller | h|.

f ( x + h) − f ( x ) 
Approximation error f ′ ( x ) − →0
h as h → 0 .
Impact of roundoff → ∞ 

In order to provide a rigorous underpinning for our conjecture, in this example we embark on our first
roundoff error analysis merely based on the “Axiom of roundoff analysis” Ass. 1.5.3.11: As in the compu-
tational example above we study the approximation of f ′ ( x ) = e x for f = exp, x ∈ R.

correction factors take into account roundoff:


e x+h (1 + δ1 ) − e x (1 + δ2 )
(→ "‘axiom of roundoff analysis”, Ass. 1.5.3.11)
df =
h ! |δ1 |, |δ2 | ≤ eps .
eh h
− 1 δ1 e − δ2
= ex +
h h 1 + O(h) O( h−1 ) for h → 0
 h 
x e −1 1+ e h
⇒ |df| ≤ e h + eps h

(Note that the estimate for the term (eh − 1)/h is a particular case of (1.5.4.9).)

e x − df 2eps p
relative error: ≈ h + → min for h = 2 eps .
ex h

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 106
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


In double precision: 2eps = 2.107342425544702 · 10−8 y

Remark 1.5.4.11 (Cancellation during the computation of relative errors) In the numerical experiment
of Ex. 1.5.4.7 we computed the relative error of the result by subtraction, see Code 1.5.4.10. Of course,
massive cancellation will occur! Do we have to worry?

In this case cancellation can be tolerated, because we are interested only in the magnitude of the relative
error. Even if it was affected itself by a large relative error, this information is still not compromised.

For example, if the relative error has the exact value 10−8 , but can be computed only with a huge relative
error of 10%, then the perturbed value would still be in the range [0.9 · 10−8 , 1.1 · 10−8 ]. Therefore it will
still have the correct magnitude and still permit us to conclude the number of valid digits correctly. y

Remark 1.5.4.12 (Cancellation in Gram-Schmidt orthogonalisation of Exp. 1.5.1.5) The Hilbert matrix
A ∈ R10,10 , (A)i,j = (i + j − 1)−1 , considered in Exp. 1.5.1.5 has columns that are almost linearly
dependent.

Cancellation when computing orthogonal projection


of vector a onto space spanned by vector b ✄
p
a·b
p = a− b. a
b·b
b
If a, b point in almost the same direction, kpk ≪
kak, kbk, so that a “tiny” vector p is obtained by
subtracting two “long” vectors, which implies cancel-
lation. Fig. 24

This can happen in Line 10 of Code 1.5.1.3.


y

EXAMPLE 1.5.4.13 (Cancellation: roundoff error analysis) We consider a simple arithmetic expression
written in two ways:
a2 − b2 = ( a + b)( a − b) , a, b ∈ R; .
We evaluate this term by means of two algebraically equivalent algorithms for the input data a = 1.3,
b = 1.2 in 2-digit decimal arithmetic with standard rounding. (“Algebraically equivalent” means that two
algorithms will produce the same results in the absence of roundoff errors.
Algorithm A Algorithm B
x := ae· a = 1.7 (rounded) e b = 2.5 (exact)
x := a+
y := be· b = 1.4 (rounded) e b = 0.1 (exact)
y := a−
e y = 0.30 (exact)
x− x ∗ y = 0.25 (exact)
Algorithm B produces the exact result, whereas Algorithm A fails to do so. Is this pure coincidence or an
indication of the superiority of algorithm B? This question can be answered by roundoff error analysis. We
demonstrate the approach for the two algorithms A & B and general input a, b ∈ R.
Roundoff error analysis heavily relies on Ass. 1.5.3.11 and dropping terms of “higher order” in the machine
precision, that is terms that behave like O(EPSq ), q > 1. It involves introducing the relative roundoff error
for every elementary operation through a factor (1 + δ), |δ| ≤ EPS.
Algorithm A:
x = a2 (1 + δ1 ) , y = b2 (1 + δ2 )

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 107
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

fe = ( a2 (1 + δ1 ) − b2 (1 + δ2 ))(1 + δ3 ) = f + a2 δ1 − b2 δ2 + ( a2 − b2 )δ3 + O(EPS2 )

 
| fe − f | a2 + b2 + | a2 − b2 | 2 | a2 + b2 |
≤ EPS + O(EPS ) = EPS 1 + 2 + O(EPS2 ) . (1.5.4.14)
|f| | a2 − b2 | | a − b2 |
will be neglected
For a ≈ b the relative error of the result of Algorithm A will be much larger than the machine
precision EPS. This reflects cancellation in the last subtraction step.

Algorithm B:

x = ( a + b)(1 + δ1 ) , y = ( a − b)(1 + δ2 )
fe = ( a + b)( a − b)(1 + δ1 )(1 + δ2 )(1 + δ3 ) = f + ( a2 − b2 )(δ1 + δ2 + δ3 ) + O(EPS2 )

| fe − f |
≤ |δ1 + δ2 + δ3 | + O(EPS2 ) ≤ 3EPS + O(EPS2 ) . (1.5.4.15)
|f|

Relative error of the result of Algorithm B is always ≈ EPS !

In this example we see a general guideline at work:

If inevitable, subtractions prone to cancellation should be done as early as possible.

The reason is that input data and and initial intermediate results are usually not as much tainted by roundoff
errors as numbers computed after many steps. y

§1.5.4.16 (Avoiding disastrous cancellation) The following examples demonstrate a few fundamental
techniques for steering clear of cancellation by using alternative formulas that yield the same value (in
exact arithmetic), but do not entail subtracting two numbers of almost equal size.

EXAMPLE 1.5.4.17 (Stable discriminant formula → Ex. 1.5.4.1, [AG11, Ex. 2.10]) If ξ 1 and ξ 2 are
the two roots of the quadratic polynomial p(ξ ) = ξ 2 + αξ + β, then ξ 1 · ξ 2 = β (Vieta’s formula). Thus
once we have computed a root, we can obtain the other by simple division.

Idea:
➊ Depending on the sign of α compute “stable root” without cancellation.
➋ Compute other root from Vieta’s formula (avoiding subtraction)

C++ code 1.5.4.18: Stable computation of real roots of a quadratic polynomial ➺ GITLAB
2 //! C++ function computing the zeros of a quadratic polynomial
3 //! ξ → ξ 2 + αξ + β by means
p of the familiar discriminant
4 //! formula ξ 1,2 = 12 (−α ± α2 − 4β).
5 //! This is a stable implementation based on Vieta’s theorem.
6 //! The zeros are returned in a column vector
7 Eigen : : VectorXd zerosquadpolstab ( double alpha , double beta ) {
8 Eigen : : Vector2d z ( 2 ) ;
9 const double D = std : : pow ( alpha , 2 ) − 4 * beta ; // discriminant

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 108
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 i f (D >= 0 ) {
11 const double wD = std : : s q r t (D) ;
12 // Use discriminant formula only for zero far away from 0
13 // in order to avoid cancellation. For the other zero
14 // use Vieta’s formula.
15 i f ( alpha >= 0 ) {
16 const double t = 0 . 5 * ( − alpha − wD) ; //
17 z << t , beta / t ;
18 } else {
19 const double t = 0 . 5 * ( − alpha + wD) ; //
20 z << beta / t , t ;
21 }
22 }
23 else {
24 throw std : : r u n t i m e _ e r r o r ( "no r e a l zeros " ) ;
25 }
26 return z ;
27 }

➥ Invariably, we add numbers with the same sign in Line 16 and Line 19.
-11 Roundoff in the computation of zeros of a parabola
×10
3.5
unstable
stable

Numerical experiment based on the driver code 2.5

Code 1.5.4.4.
relative error in ξ 1

Observation:
1.5
The new code can also compute the small root of
the polynomial p(ξ ) = (ξ − γ)(ξ − γ1 ) (expanded 1

in monomials) with a relative error ≈ EPS.


0.5

0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 25 γ
y

EXAMPLE 1.5.4.19 (Exploiting trigonometric identities to avoid cancellation) The task is to evaluate
the integral
Z x
sin t dt = 1− cos x = 2 sin2 ( x/2) for 0 < x ≪ 1 , (1.5.4.20)
0 | {z } | {z }
I II

and this can be done by the two different formulas I and I I .

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 109
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Unstable computation of 1-cos(x)


-2
10

10 -4

Relative error of expression I (1-cos(x))

relative error of 1-cos(x)


10 -6

with respect to equivalent expression I I


10 -8
(2*sin(x/2)^{2}) ✄
10 -10
Expression I is affected by cancellation for | x | ≪ 1,
since then cos x ≈ 1, whereas expression I I can be 10 -12

evaluated with a relative error ≈ EPS for all x.


-14
10

-16
10
10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0
Fig. 26 x
y

Analytic manipulations offer ample opportunity to rewrite expressions in equivalent form immune to
cancellation.

EXAMPLE 1.5.4.21 (Switching to equivalent formulas to avoid cancellation)

Now we see an example of a computation allegedly dating back to


Archimedes, who tried to approximate the area of a circle by the areas
of inscribed regular polygons.

Approximation of a circle by a regular n-gon, n ∈ N ✄

Fig. 27

sin α2n
We focus on the unit circle. The area of the inscribed
n-gon is Fn
  cos α2n
αn αn n n 2π
An = n cos sin = sin αn = sin .
2 2 2 2 n αn
2
Recursion formula for An derived from Fig. 28

r s p
αn 1 − cos αn 1− 1 − sin2 αn
sin = = ,
2 2 2

Initial approximation: A6 = 32 3 .

C++ code 1.5.4.22: Tentative computation of circumference of regular polygon ➺ GITLAB


2 //! Approximation of Pi by approximating the circumference of a
3 //! regular polygon
4 MatrixXd A p p r o x P I i n s t a b l e ( double t o l = 1e −8 , unsigned i n t maxIt = 50) {

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 110
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5 double s= s q r t ( 3 ) / 2 . ;
6 double An = 3 . * s ; // initialization (hexagon case)
7 unsigned i n t n = 6 ;
8 unsigned i n t i t = 0 ;
9 MatrixXd r e s ( maxIt , 4 ) ; // matrix for storing results
10 r e s ( i t , 0 ) = n ; r e s ( i t , 1 ) = An ;
11 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
12 while ( i t < maxIt && s > t o l ) { // terminate when s is ’small enough’
13 s = s q r t ( ( 1 . − s q r t ( 1 . − s * s ) ) / 2 . ) ; // recursion for area
14 n * = 2 ; An = n / 2 . * s ; // new estimate for circumference
15 ++ i t ;
16 r e s ( i t , 0 ) =n ; r e s ( i t , 1 ) =An ; // store results and (absolute) error
17 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
18 }
19 r e t u r n r e s . topRows ( i t ) ;
20 }

The approximation deteriorates after applying the recursion formula many times:
n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589794 0.500000000000000
24 3.105828541230250 -0.035764112359543 0.258819045102521
48 3.132628613281237 -0.008964040308556 0.130526192220052
96 3.139350203046872 -0.002242450542921 0.065403129230143
192 3.141031950890530 -0.000560702699263 0.032719082821776
384 3.141452472285344 -0.000140181304449 0.016361731626486
768 3.141557607911622 -0.000035045678171 0.008181139603937
1536 3.141583892148936 -0.000008761440857 0.004090604026236
3072 3.141590463236762 -0.000002190353031 0.002045306291170
6144 3.141592106043048 -0.000000547546745 0.001022653680353
12288 3.141592516588155 -0.000000137001638 0.000511326906997
24576 3.141592618640789 -0.000000034949004 0.000255663461803
49152 3.141592645321216 -0.000000008268577 0.000127831731987
98304 3.141592645321216 -0.000000008268577 0.000063915865994
196608 3.141592645321216 -0.000000008268577 0.000031957932997
393216 3.141592645321216 -0.000000008268577 0.000015978966498
786432 3.141593669849427 0.000001016259634 0.000007989485855
1572864 3.141592303811738 -0.000000349778055 0.000003994741190
3145728 3.141608696224804 0.000016042635011 0.000001997381017
6291456 3.141586839655041 -0.000005813934752 0.000000998683561
12582912 3.141674265021758 0.000081611431964 0.000000499355676
25165824 3.141674265021758 0.000081611431964 0.000000249677838
50331648 3.143072740170040 0.001480086580246 0.000000124894489
100663296 3.159806164941135 0.018213511351342 0.000000062779708
201326592 3.181980515339464 0.040387861749671 0.000000031610136
402653184 3.354101966249685 0.212509312659892 0.000000016660005
805306368 4.242640687119286 1.101048033529493 0.000000010536712
1610612736 6.000000000000000 2.858407346410207 0.000000007450581

Where does cancellation occur in Line 13 of √s ≪ 1, computing 1 − s will not trigger


√ Code 1.5.4.22? Since
cancellation. However, the subtraction 1− 1 − s will, because 1 − s2 ≈ 1 for s ≪ 1:
2

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 111
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

v p
r
u p For αn ≪ 1: 1 − sin2 αn ≈ 1
u
αn 1 − cos αn t 1 − 1 − sin2 αn
sin = = Cancellation here!
2 2 2

We arrive at an equivalent formula not vulnerable to cancellation essentially using the identity ( a + b)( a −
b) = a2 − b2 in order to eliminate the difference of square roots in the numerator.
s v
p u p p
αn 1− 2
1 − sin αn u 1 − 1 − sin2 αn 1 + 1 − sin2 αn
sin = = t · p
2 2 2 1 + 1 − sin2 αn
s
1 − (1 − sin2 αn ) sin αn
= p =r  p .
2(1 + 1 − sin2 αn ) 2
2 1 + 1 − sin αn

C++ code 1.5.4.23: Stable recursion for area of regular n-gon ➺ GITLAB
2 //! Approximation of Pi by approximating the circumference of a
3 //! regular polygon
4 MatrixXd a p p r p i s t a b l e ( double t o l = 1e −8 , unsigned i n t maxIt = 50) {
5 double s= s q r t ( 3 ) / 2 . ; double An = 3 . * s ; // initialization (hexagon case)
6 unsigned i n t n = 6 ;
7 unsigned i n t i t = 0 ;
8 MatrixXd r e s ( maxIt , 4 ) ; // matrix for storing results
9 r e s ( i t , 0 ) = n ; r e s ( i t , 1 ) = An ;
10 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
11 while ( i t < maxIt && s > t o l ) { // terminate when s is ’small enough’
12 s = s / s q r t ( 2 * ( 1 + s q r t ( ( 1 + s ) * (1 − s ) ) ) ) ; // Stable recursion without
cancellation
13 n * = 2 ; An = n / 2 . * s ; // new estimate for circumference
14 ++ i t ;
15 r e s ( i t , 0 ) =n ; r e s ( i t , 1 ) =An ; // store results and (absolute) error
16 r e s ( i t , 2 ) = An − M_PI ; r e s ( i t , 3 ) =s ;
17 }
18 r e t u r n r e s . topRows ( i t ) ;
19 }

Using the stable recursion, we observe better approximation for polygons with more corners:

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 112
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

n An An − π sin αn
6 2.598076211353316 -0.543516442236477 0.866025403784439
12 3.000000000000000 -0.141592653589793 0.500000000000000
24 3.105828541230249 -0.035764112359544 0.258819045102521
48 3.132628613281238 -0.008964040308555 0.130526192220052
96 3.139350203046867 -0.002242450542926 0.065403129230143
192 3.141031950890509 -0.000560702699284 0.032719082821776
384 3.141452472285462 -0.000140181304332 0.016361731626487
768 3.141557607911857 -0.000035045677936 0.008181139603937
1536 3.141583892148318 -0.000008761441475 0.004090604026235
3072 3.141590463228050 -0.000002190361744 0.002045306291164
6144 3.141592105999271 -0.000000547590522 0.001022653680338
12288 3.141592516692156 -0.000000136897637 0.000511326907014
24576 3.141592619365383 -0.000000034224410 0.000255663461862
49152 3.141592645033690 -0.000000008556103 0.000127831731976
98304 3.141592651450766 -0.000000002139027 0.000063915866118
196608 3.141592653055036 -0.000000000534757 0.000031957933076
393216 3.141592653456104 -0.000000000133690 0.000015978966540
786432 3.141592653556371 -0.000000000033422 0.000007989483270
1572864 3.141592653581438 -0.000000000008355 0.000003994741635
3145728 3.141592653587705 -0.000000000002089 0.000001997370818
6291456 3.141592653589271 -0.000000000000522 0.000000998685409
12582912 3.141592653589663 -0.000000000000130 0.000000499342704
25165824 3.141592653589761 -0.000000000000032 0.000000249671352
50331648 3.141592653589786 -0.000000000000008 0.000000124835676
100663296 3.141592653589791 -0.000000000000002 0.000000062417838
201326592 3.141592653589794 0.000000000000000 0.000000031208919
402653184 3.141592653589794 0.000000000000001 0.000000015604460
805306368 3.141592653589794 0.000000000000001 0.000000007802230
1610612736 3.141592653589794 0.000000000000001 0.000000003901115
Recursion for the area of a regular n-gon
2
10

0
10

Plot of errors for approximations of π as computed by 10


-2

the two algebraically equivalent recursion formulas✄


approximation error

-4
10

Observation, cf. Ex. 1.5.4.7 10 -6

Amplified roundoff errors due to cancellation super- 10 -8

sedes approximation error for n ≥ 105 . 10 -10

Roundoff errors merely of magnitude EPS in the case 10 -12

of stable recursion 10 -14


unstable recursion
stable recursion
10 -16
10 0 10 2 10 4 10 6 10 8 10 10
Fig. 29 n
y

EXAMPLE 1.5.4.24 (Summation of exponential series)

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 113
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 1.5.4.25: Summation of exponential se-


ries ➺ GITLAB
In principle, the function value exp( x ) can 2 double expeval ( double x ,
be approximated up to any accuracy by 3 double t o l =1e −8) {
4 // Initialization
summing sufficiently many terms of the
5 double y = 1 . 0 ;
globally convergent exponential series. 6 double term = 1 . 0 ;
7 uint64_t k = 1;

xk 8 // Termination criterion
exp( x ) = ∑ while ( abs ( term ) > t o l * y ) {
k =0
k! 9

10 term * = x / s t a t i c _ c a s t <double >( k ) ; //


x2 x3 x4 next summand
= 1+x+ + + +... . 11 y += term ; // Summation
2 6 24 12 ++k ;
13 }
14 return y ;
15 }

Results for tol = 10−8 , eg


xp designates the approximate value for exp( x ) returned by the function from
Code 1.5.4.25. Rightmost column lists relative errors, which tells us the number of valid digits in the
approximate result.
| exp( x )−eg
xp( x )|
x Approximation eg
xp( x ) exp( x ) exp( x )
-20 6.1475618242e-09 2.0611536224e-09 1.982583033727893
-18 1.5983720359e-08 1.5229979745e-08 0.049490585500089
-16 1.1247503300e-07 1.1253517472e-07 0.000534425951530
-14 8.3154417874e-07 8.3152871910e-07 0.000018591829627
-12 6.1442105142e-06 6.1442123533e-06 0.000000299321453
-10 4.5399929604e-05 4.5399929762e-05 0.000000003501044
-8 3.3546262812e-04 3.3546262790e-04 0.000000000662004
-6 2.4787521758e-03 2.4787521767e-03 0.000000000332519
-4 1.8315638879e-02 1.8315638889e-02 0.000000000530724
-2 1.3533528320e-01 1.3533528324e-01 0.000000000273603
0 1.0000000000e+00 1.0000000000e+00 0.000000000000000
2 7.3890560954e+00 7.3890560989e+00 0.000000000479969
4 5.4598149928e+01 5.4598150033e+01 0.000000001923058
6 4.0342879295e+02 4.0342879349e+02 0.000000001344248
8 2.9809579808e+03 2.9809579870e+03 0.000000002102584
10 2.2026465748e+04 2.2026465795e+04 0.000000002143799
12 1.6275479114e+05 1.6275479142e+05 0.000000001723845
14 1.2026042798e+06 1.2026042842e+06 0.000000003634135
16 8.8861105010e+06 8.8861105205e+06 0.000000002197990
18 6.5659968911e+07 6.5659969137e+07 0.000000003450972
20 4.8516519307e+08 4.8516519541e+08 0.000000004828737

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 114
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

×10 7 Terms in exponential sum for x = -20


5

3
Observation:

value of k-th summand


2

Large relative approximation errors for x ≪ 0. 1

For x ≪ 0 we have exp( x )| ≪ 1, but this value 0

is computed by summing large numbers of opposite -1

sign. -2

Terms summed up for x = −20 ✄ -3

-4

-5
0 5 10 15 20 25 30 35 40 45 50
Fig. 30 index k of summand

Remedy: Cancellation can be avoided by using identity

1
exp( x ) = , if x<0.
exp(− x )
y

EXAMPLE 1.5.4.26 (Trade cancellation for approximation) In a computer code we have to provide a
routine for the evaluation of the “hidden difference quotient”
Z 1
exp( a) − 1
I ( a) := e at dt = for any a>0, (1.5.4.27)
0 a
cf. the discussion of cancellation in the context of numerical differentiation in Ex. 1.5.4.7. There we
observed massive cancellation.

Trick. Recall the Taylor expansion formula in one dimension for a function that is m + 1 times continu-
ously differentiable in a neighborhood of x0 [Str09, Satz 5.5.1]
m
1 (k) 1
f ( x0 + h ) = ∑ f ( x0 ) h k + R m ( x0 , h ) , R m ( x0 , h ) = f ( m +1) ( ξ ) h m +1 , (1.5.4.28)
k =0
k! ( m + 1 ) !

for some ξ ∈ [min{ x0 , x0 + h}, max{ x0 , x0 + h}], and for all sufficiently small | h|. Here R( x0 , h) is
called the remainder term and f (k) denotes the k-th derivative of f .

Cancellation in (1.5.4.27) can be avoided by replacing exp( a), a > 0, with a suitable Taylor expansion of
a 7→ e a around a = 0 and then dividing by a:
m
exp( a) − 1 1 1
= ∑ ak + Rm ( a) , Rm ( a) = exp(ξ ) am for some 0 ≤ ξ ≤ a .
a k =0
( k + 1 ) ! ( m + 1 ) !

Then use as an approximation the point value of the Taylor polynomial


m
1
I ( a) ≈ e
Im ( a) := ∑ ak .
k =0
( k + 1 ) !

For a similar discussion see [AG11, Ex. 2.12].


Issue: A finite Taylor sum usually offers only an approximation and we incur an approximation error. This
begs the question how to choose the number m of terms to be retained in the Taylor expansion. We have
to pick m large enough such that the relative approximation error remains below a prescribed threshold

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 115
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

tol. To estimate the relative approximation error, we use the expression for the remainder together with
the simple estimate (exp( a) − 1)/a > 1 for all a > 0:
m
1
I ( a) − e
Im ( a) (e a − 1)/a − ∑ ( k +1) !
ak
k =0
rel. err. = =
| I ( a)| − 1)/a (e a
1 1
≤ exp(ξ ) am ≤ exp( a) am .
( m + 1) ! ( m + 1) !

For a = 10−3 we get


m 1 2 3 4 5
1.0010e-03 5.0050e-07 1.6683e-10 4.1708e-14 8.3417e-18
Hence, keeping m = 3 terms is enough for achieving about 10 valid digits.
-6
10
(exp(a)-1.0)/a

Relative error of unstable formula 10 -7


Taylor stabilized

(exp(a)-1.0)/a and relative error, when -8


10
using a Taylor expansion approximation for small a✄
-9
10

i f (abs(a) < 1E-3)


relative error

-10
10
v = 1.0 + (1.0/2 + 1.0/6*a)*a;
10 -11
else
10 -12
v = (exp(a)-1.0)/a;
end 10
-13

Error computed by comparison with the P YTHON


-14
10

library function numpy.expm1() that provides a 10 -15

stable implementation of exp( x ) − 1. 10 -16


10 -10 10 -8 10 -6 10 -4 10 -2 10 0
Fig. 31 argument a
y
EXAMPLE 1.5.4.29 (Complex step differentiation [LM67]) This is a technique from complex analysis
that can be applied to real-valued analytic functions. Let f : I → R, I ⊂ R an interval, be analytic in a
neighborhood of x0 ∈ I , which means that it can be written as a convergent power series there, see also
Def. 6.2.2.48:

f (x) = ∑ a j ( x − x0 ) j ∀ x : | x − x0 | < ρ and some ρ > 0, a j ∈ R . (1.5.4.30)
j =0

Note that f is infinitely many times differentiable in a neighborhood of x0 and that its derivatives satisfy
f (n) ( x0 ) = n!an ∈ R, n ∈ N0 .
A power series like in (6.2.2.36) also makes sense for x ∈ C, | x − x0 | < ρ! Thus, for 0 < h < ρ we can
approximate f in a neighborhood of x0 by means of a complex Taylor polynomial

f ( x0 + ıh) = f ( x0 ) + f ′ ( x0 )ıh − f ′′ ( x0 ) h2 + O( h3 ) for h ∈ R → 0 . (1.5.4.31)

Trick. Take the imaginary part on both sides of (1.5.4.31) using that all derivatives are real:

Im f ( x0 + ıh) = h f ′ ( x0 ) + O( h3 ) for h ∈ R → 0 .

As a consequence we obtain the approximation

Im f ( x0 + ıh)
f ′ ( x0 ) = + O(h2 ) for h ∈ R → 0 ,
h

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 116
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

which suggests that we may rely on the cancellation-free expression

Im f ( x0 + ıh)
f ′ ( x0 ) ≈
h

for h ≈ EPS to compute the derivative of f in x0 . y

Remark 1.5.4.32 (A broader view of cancellation) Cancellation can be viewed as a particular case of a
situation, in which severe amplification of relative errors is possible. Consider a function

F : R n → R of class C2 , twice continuously differentiable,



and an argument vector x := [ x1 , . . . , xn ] 6= 0, for which

F ( x1 , . . . , xn ) 6= 0 , F ( x1 , . . . , xn ) ≈ 0 , grad F ( x1 , . . . , xn ) 6= 0 .

We supply arguments xei ∈ R with small relative errors ǫi , xei = xi (1 + ǫi ), i = 1, . . . , n, and study the
resulting relative error δ of the result

| F ( xe1 , . . . , xen ) − F ( x1 , . . . , xn )|
δ := .
| F ( x1 , . . . , xn )|
Thanks to the smoothness of F, we can employ multi-dimensional Taylor approximation
 
ǫ1 x 1
 
F ( x1 (1 + ǫ1 ), . . . , xn (1 + ǫn )) = F ( x1 , . . . , xn ) + grad F ( x1 , . . . , xn )⊤  ...  + R(x, ǫ) ,
ǫn x n
with remainder R(x, ǫ) = O(ǫ12 + · · · + ǫn2 ) for ǫi → 0 .

This yields
 
ǫ1 x 1
 
grad F ( x1 , . . . , xn )⊤  ...  + R(x, ǫ)
ǫn x n
δ= .
| F ( x1 , . . . , xn )|

If |ǫi | ≪ 1, we can neglect the remainder term and obtain the first-order approximation (indicated by =
˙)
 
ǫ1 x 1
1  
δ=
˙ grad F ( x1 , . . . , xn )⊤  ...  .
| F ( x1 , . . . , xn )|
ǫn x n

In case |(grad F ( x1 , . . . , xn ))i | ≫ | F (x)|, | xi | ≫ 0 for some i ∈ {1, . . . , n}, we can thus encounter
δ ≫ max j |ǫ j |, which indicates a potentially massive amplification of relative errors.
• “Classical cancellation” as discussed in § 1.5.4.5 fits this setting and corresponds to the special
choice F : R2 → R, F ( x1 , x2 ) := x1 − x2 .
• The effect found above can be observed for the simple trigonometric functions sin and cos!
y

Review question(s) 1.5.4.33.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 117
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q1.5.4.33.A) Give an expression for


Z b
I ( a, b) := 1/x dx
a

that allows cancellation-free evaluation for integration bounds 1 ≪ a ≈ b.


(Q1.5.4.33.B) For integration bounds a, b > 1, a ≈ b, propose a numerically sound way of computing

Zb
1
I ( a, b) := dx .
1 + x2
a

Hints.
d 1
{ x 7→ arctan( x )} = ,
dx 1 + x2
tan(α) − tan β
tan(α − β) = .
1 + tan(α) tan( β)

(Q1.5.4.33.C) What is the problem with the C++ expression


y = s t d ::log( s t d ::cosh(x));

where x is of type double? Rewrite this line of code into an algebraically equivalent one so that problem
does no longer occur.
(Q1.5.4.33.D) [Harmless cancellation] Discuss the impact of round-off errors and cancellation for the
C++ expression
f = x + s t d ::sqrt(1-x*x);

where x is of type double and in the interval [−1, 1].


(Q1.5.4.33.E) [Error amplification without subtraction] For what values of 0 < x < 2 may the following
expression yield a result with a huge relative error, even if we assume that the variable x contains an
“exact value”?
f = s t d ::cos(M_PI/2.0*x)/ s t d ::log(x);

Suggest a way to rewrite the expression that avoids this instability.


(Q1.5.4.33.F) [Error amplification by multiplication] We multiply two quantities xe ∈ R and ye ∈ R, which
carry relative errors bounded by machine precision, that is

xe = x (1 + δx ) , ye = y(1 + δy ) , |δx |, |δy | ≤ EPS ,

where x, y ∈ R designate the “exact values”. Give a bound for the relative error of the product xe · ye and
discuss whether it can be much larger than EPS for particular values of x and y.

1.5.5 Numerical Stability

Video tutorial for Section 1.5.5 "Numerical Stability": (17 minutes) Download link, tablet notes

→ review questions 1.5.5.23

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 118
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We have seen that a particular “problem” can be tackled by different “algorithms”, which produce different
results due to roundoff errors. This section will clarify what distinguishes a “good” algorithm from a rather
abstract point of view.

§1.5.5.1 (The “problem”)

A mathematical notion of “problem”: F


x y
✦ data space X , usually X ⊂ R n
✦ result space Y , usually Y ⊂ R m Y
✦ mapping (problem function) F : X 7→ Y X
✎ ☞
results
data
A problem is a well defined function that assigns
✍ ✌
to each datum a result.
Fig. 32

Note: In this course, both the data space X and the result space Y will always be subsets of finite dimen-
sional vector spaces.

EXAMPLE 1.5.5.2 (The “matrix×vector-multiplication problem”) We consider the “problem” of com-


puting the product Ax for a given matrix A ∈ K m,n and a given vector x ∈ K n .
➣ • Data space X = K m,n × K n (input is a matrix and a vector)
• Result space Y = R m (space of column vectors)
• Problem function F : X → Y , F (a, x) := Ax
y

§1.5.5.3 (Norms on spaces of vectors and matrices) Norms provide tools for measuring errors. Recall
from linear algebra and calculus [NS02, Sect. 4.3], [Gut09, Sect. 6.1]:

Definition 1.5.5.4. Norm

X = vector space over field K, K = C, R. A map k · k : X 7→ R0+ is a norm on X , if it satisfies


(i) ∀x ∈ X: x 6= 0 ⇔ kxk > 0 (definite),
(ii) kλxk = |λ|kxk ∀x ∈ X, λ ∈ K (homogeneous),
(iii) kx + yk ≤ kxk + kyk ∀x, y ∈ X (triangle inequality).

Examples: (for vector space K n , vector x = ( x1 , x2 , . . . , xn ) T ∈ K n )


name : definition E IGEN function
q
Euclidean norm : k x k2 : = | x1 |2 + · · · + | x n |2 x.norm()
1-norm : k x k1 : = | x1 | + · · · + | x n | x.lpNorm<1>()
∞-norm, max norm : kxk∞ := max{| x1 |, . . . , | xn |} x.lpNorm<Eigen::Infinity>()

Remark 1.5.5.5 (Inequalities between vector norms) All norms on the vector space K n , n ∈ N, are
equivalent in the sense that for arbitrary two norms k·k1 and k·k2 we can always find a constant C > 0
such that

k v k1 ≤ C k v k2 ∀ v ∈ K n . (1.5.5.6)

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 119
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Of course, the constant C will usually depend on n and the norms under consideration.

For the vector norms introduced above, explicit expressions for the constants “C” are available: for all
x∈ Kn

k x k2 ≤ k x k1 ≤ n k x k2 , (1.5.5.7)

k x k ∞ ≤ k x k2 ≤ n k x k ∞ , (1.5.5.8)
k x k ∞ ≤ k x k1 ≤ n k x k ∞ . (1.5.5.9)

The matrix space K m,n is a vector space, of course, and can also be equipped with various norms. Of
particular importance are norms induced by vector norms on K n and K m .

Definition 1.5.5.10. Matrix norm

Given vector norms k·k x and k·ky on K n and K m , respectively, the associated matrix norm is
defined by

kMxky
M ∈ R m,n : kMk := sup .
x∈R n \{0} kxk x

By virtue of definition the matrix norms enjoy an important property, they are sub-multiplicative:

∀A ∈ K n,m , B ∈ K m,k : kABk ≤ kAkkBk . (1.5.5.11)

✎ notations for matrix norms for quadratic matrices associated with standard vector norms:

k x k2 → k M k2 , k x k1 → k M k1 , k x k ∞ → k M k ∞

EXAMPLE 1.5.5.12 (Matrix norm associated with ∞-norm and 1-norm) Rather simple formulas are
available for the matrix norms induced by the vector norms k·k∞ and k·k1
m m 
e.g. for M = m11 12
21 m22
∈ K2,2 : kMxk∞ = max{|m11 x1 + m12 x2 |, |m21 x1 + m22 x2 |}
≤ max{|m11 | + |m12 |, |m21 | + |m22 |} k x k∞ ,
kMxk1 = |m11 x1 + m12 x2 | + |m21 x1 + m22 x2 |
≤ max{|m11 | + |m21 |, |m12 | + |m22 |}(| x1 | + | x2 |) .
 
For general M = mij ∈ K m,n

n
➢ matrix norm ↔ k·k∞ = row sum norm kMk∞ := max ∑ |mij | ,
i =1,...,m j=1
(1.5.5.13)

m
➢ matrix norm ↔ k·k1 = column sum norm kMk1 := max ∑ |mij | .
j=1,...,n i =1
(1.5.5.14)

Sometimes special formulas for the Euclidean matrix norm come handy [GV89, Sect. 2.3.3]:

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 120
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lemma 1.5.5.15. Formula for Euclidean norm of a Hermitian matrix

|x H Ax|
A ∈ K n,n , A = A H ⇒ kAk2 = max .
x 6 =0 kxk22

Proof. Recall from linear algebra: Hermitian matrices (a special class of normal matrices) enjoy unitary
similarity to diagonal matrices:

∃U ∈ K n,n , diagonal D ∈ R n,n : U−1 = U H and A = U H DU .

Since multiplication with an unitary matrix preserves the 2-norm of a vector, we conclude

kAk2 = U H DU = kDk2 = max |di | , D = diag(d1 , . . . , dn ) .


2 i =1,...,i

On the other hand, for the same reason:

max x H Ax = max (Ux) H D(Ux) = max y H Dy = max |di | .


k x k2 =1 k x k2 =1 k y k2 =1 i =1,...,i

Hence, both expressions in the statement of the lemma agree with the largest modulus of eigenvalues of
A.

Corollary 1.5.5.16. Euclidean matrix norm and eigenvalues

For A ∈ K m,n the Euclidean matrix norm kAk2 is the square root of the largest (in modulus)
eigenvalue of A H A.

For a normal matrix A ∈ K n,n (that is, A satisfies AH A = AAH ) the Euclidean matrix norm agrees
with the modulus of the largest eigenvalue.

§1.5.5.17 ((Numerical) algorithm) When we talk about an “algorithm” we have in mind a concrete code
function in M ATLAB or C++; the only way to describe an algorithm is through a piece of code. We assume
that this function defines another mapping F e : X → Y on the data space of the problem. Of course,
we can only feed data to the M ATLAB/C++-function, if they can be represented in the set M of machine
e is the assumption that input data are subject to rounding
numbers. Hence, implicit in the definition of F
before passing them to the code function proper.
Problem Algorithm

F : X ⊂ Rn → Y ⊂ Rm e⊂ M
Fe : X → Y
y

§1.5.5.18 (Stable algorithm → [AG11, Sect. 1.3]) [Stable algorithm]


✦ We study a problem (→ § 1.5.5.1) F : X → Y on data space X into result space Y .
✦ We assume that both X and Y are equipped with norms k·k X and k·kY , respectively (→
Def. 1.5.5.4).
✦ We consider a concrete algorithm Fe : X → Y according to § 1.5.5.17.

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 121
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We write w(x), x ∈ X , for the computational effort (→ Def. 1.4.0.1, “number of elementary operations”)
required by the algorithm for input x.

Definition 1.5.5.19. Stable algorithm

An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result Fe(x)
(possibly affected by roundoff) is the exact result for “slightly perturbed” data:

x ∈ X: kx − e
∃C ≈ 1: ∀x ∈ X: ∃e xk X ≤ Cw(x) EPSkxk X ∧ Fe(x) = F (e
x) .

Here EPS should be read as machine precision according to the “Axiom” of roundoff analysis Ass. 1.5.3.11.

F
Illustration of Def. 1.5.5.19 ✄ x Fe y
(y =ˆ exact result for exact data x) Fe(x)
e
x F
Terminology: (Y, k·kY )
Def. 1.5.5.19 introduces stability in the sense of ( X, k·k X )
backward error analysis
Fig. 33

Sloppily speaking, the impact of roundoff (∗) on a stable algorithm is of the same order of magnitude
as the effect of the inevitable perturbations due to rounding of the input data.

➣ For stable algorithms roundoff errors are “harmless”.

(∗) In some cases the definition of Fe will also involve some approximations as in Ex. 1.5.4.26. Then the
above statement also includes approximation errors. y

EXAMPLE 1.5.5.20 (Testing stability of matrix×vector multiplication) Assume you are given a black
box implementation of a function
VectorXd mvmult( const MatrixX &A, const VectorXd &x)

that purports to provide a stable implementation of Ax for A ∈ K m,n , x ∈ K n , cf. Ex. 1.5.5.2. How can
we verify this claim for particular data. Both, K m,n and K n are equipped with the Euclidean norm.

The task is, given y ∈ K n as returned by the function, to find conditions on y that ensure the existence of
aAe ∈ K m,n such that

e = y and
Ax e −A
A ≤ Cmn EPSkAk2 , (1.5.5.21)
2

for a small constant ≈ 1.

In fact we can choose (easy computation)

e = A + zx T , z := y − Ax ∈ K m ,
A
kxk22

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 122
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and we find

e −A x · w k z k2 ky − Axk2
A = zx T = sup ≤ k x k2 k z k2 = .
2 2 w∈K n \{0} k w k2 k x k2
✬ ✩
Hence, in principle stability of an algorithm for computing Ax is confirmed, if for every x ∈ R n the
computed result y = mvmult(A, x) satisfies

ky − Axk2 ≤ C mn EPS kxk2 kAk2 ,

✫ ✪
with a small constant C > 0 independent of data and problem size.
y

Remark 1.5.5.22 (Numerical stability and sensitive dependence on data)


F A problem shows sensitive dependence on the data,
if small perturbations of input data lead to large per-
x y
e
x turbations of the output. Such problems are also
Y called ill-conditioned. For such problems stability
X of an algorithm is easily accomplished.
results
data ✁ “Mental image”: ill-conditioned problem: slightly
e
y
F different data (w.r.t. k·k X ) yield vastly different re-
Fig. 34 sults (ky − eykY large).

Example: The problem is the prediction of the po-


sition of the billard ball after ten bounces given the
initial position, velocity, and spin.

It is well known, that tiny changes of the initial condi-


tions can shift the final location of the ball to virtually
any point on the table: the billard problem is chaotic.

Hence, a stable algorithm for its solution may just out-


put a fixed or random position without even using the
initial conditions!
Fig. 35
y
Review question(s) 1.5.5.23 (Numerical stability)
(Q1.5.5.23.A) We consider the problem of multiplying the Kronecker product of two real n × n matrices
with a vector. Give the formula for the problem mapping and characterize the (largest possible) )data
space and result space.
(Q1.5.5.23.B) Fill in the blanks in the following definition of a stable algorithm:

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 123
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 1.5.5.19. Stable algorithm

An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result
e
F (x) (possibly affected by roundoff) is the exact result for “slightly perturbed” data:

∃C ≈ 1: ∀ : ∃ :
!X !
≤ Cw(x) EPSkxk X ∧ Fe =F .

(Q1.5.5.23.C) Suppose you have to examine a black-box function


double add( double x, double y);

that just adds the two numbers given as arguments. Derive conditions on the returned result that, when
satisfied, imply the stability of the implementation of add(). Of course, the norm on R is just |·|.

Learning Outcomes
Principal take-home knowledge and skills from this chapter:
• Learning by doing: Knowledge about the syntax of fundamental operations on matrices and vectors
in E IGEN.
• Understanding of the concepts of computational effort/cost and asymptotic complexity in numerics.
• Awareness of the asymptotic complexity of basic linear algebra operations
• Ability to determine the (asymptotic) computational effort for a concrete (numerical linear algebra)
algorithm.
• Ability to manipulate simple expressions involving matrices and vectors in order to reduce the com-
putational cost for their evaluation.
• Knowledge about round-off and machine precision.
• Familiarity with the phenomenon of “cancellation”: cause, effect, remedies, and tricks

1. Computing with Matrices and Vectors, 1.5. Machine Arithmetic and Consequences 124
Bibliography

[AV88] A. Aggarwal and J.S. Vitter. “The input/output complexity of sorting and related problems”. In:
Communications of the ACM 31.9 (1988), pp. 1116–1127 (cit. on p. 83).
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on pp. 56, 83, 95, 96, 98,
99, 101, 104, 105, 108, 115, 121).
[CW90] D. Coppersmith and S. Winograd. “Matrix multiplication via arithmetic progression”. In: J. Sym-
bgolic Computing 9.3 (1990), pp. 251–280 (cit. on p. 86).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 57, 104).
[GV89] G.H. Golub and C.F. Van Loan. Matrix computations. 2nd. Baltimore, London: John Hopkins
University Press, 1989 (cit. on pp. 78, 120).
[GJ10] Gero Greiner and Riko Jacob. “The I/O Complexity of Sparse Matrix Dense Matrix Multipli-
cation”. In: LATIN 2010: THEORETICAL INFORMATICS. Ed. by LopezOrtiz, A. Vol. 6034.
Lecture Notes in Computer Science. Microsoft Res; Yahoo Res; Univ Waterloo. 2010, 143–
156. DOI: {10.1007/978-3-642-12200-2\_14} (cit. on p. 83).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on p. 119).
[KW03] M. Kowarschik and C. Weiss. “An Overview of Cache Optimization Techniques and Cache-
Aware Numerical Algorithms”. In: Algorithms for Memory Hierarchies. Vol. 2625. Lecture Notes
in Computer Science. Heidelberg: Springer, 2003, pp. 213–232 (cit. on p. 83).
[LM67] J. N. Lyness and C. B. Moler. “Numerical differentiation of analytic functions”. In: SIAM J.
Numer. Anal. 4 (1967), pp. 202–210 (cit. on p. 116).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 53, 70, 91, 119).
[Ove01] M.L. Overton. Numerical Computing with IEEE Floating Point Arithmetic. Philadelphia, PA:
SIAM, 2001 (cit. on p. 96).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 57, 76).
[Str69] V. Strassen. “Gaussian elimination is not optimal”. In: Numer. Math. 13 (1969), pp. 354–356
(cit. on p. 85).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 53, 54, 115).
[Van00] Charles F. Van Loan. “The ubiquitous Kronecker product”. In: J. Comput. Appl. Math. 123.1-2
(2000), pp. 85–100. DOI: 10.1016/S0377-0427(00)00393-9.

125
Chapter 2

Direct Methods for (Square) Linear Systems of


Equations

§2.0.0.1 (Required prior knowledge for Chapter 2) Also this chapter heavily relies on concepts and
techniques from linear algebra as taught in the 1st semester introductory course. Knowledge of the fol-
lowing topics from linear algebra will be taken for granted and they should be refreshed in case of gaps:
• Operations involving matrices and vectors [NS02, Ch. 2], already covered in Chapter 1
• Computations with block-structured matrices, cf. § 1.3.1.13
• Linear systems of equations: existence and uniqueness of solutions [NS02, Sects. 1.2, 3.3]
• Gaussian elimination [NS02, Ch. 2]
• LU-decomposition and its connection with Gaussian elimination [NS02, Sect. 2.4]
y
Contents
2.1 Introduction: Linear Systems of Equations (LSE) . . . . . . . . . . . . . . . . . . . 127
2.2 Theory: Linear Systems of Equations (LSE) . . . . . . . . . . . . . . . . . . . . . . 130
2.2.1 LSE: Existence and Uniqueness of Solutions . . . . . . . . . . . . . . . . . . 130
2.2.2 Sensitivity/Conditioning of Linear Systems . . . . . . . . . . . . . . . . . . 131
2.3 Gaussian Elimination (GE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.3.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.3.2 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.4 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.5 Survey: Elimination Solvers for Linear Systems of Equations . . . . . . . . . . . 165
2.6 Exploiting Structure when Solving Linear Systems . . . . . . . . . . . . . . . . . . 170
2.7 Sparse Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.7.1 Sparse Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 179
2.7.2 Sparse Matrices in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
2.7.3 Direct Solution of Sparse Linear Systems of Equations . . . . . . . . . . . . 190
2.7.4 LU-Factorization of Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . 193
2.7.5 Banded Matrices [DR08, Sect. 3.7] . . . . . . . . . . . . . . . . . . . . . . . . 199
2.8 Stable Gaussian Elimination Without Pivoting . . . . . . . . . . . . . . . . . . . . 206

126
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2.1 Introduction: Linear Systems of Equations (LSE)


§2.1.0.1 (The problem: solving a linear system) What is “the problem” considered in this chapter,
when we apply the notion of “Problem” introduced in § 1.5.5.1, that is, which functions do “Direct Methods
for Linear Systems of Equations (LSE)” attempt to evaluate and what suitable data spaces X and result
spaces Y ?
Input/data : square matrix A ∈ K n,n , vector b ∈ K n , n ∈ N ➣ data space X = K n,n × K n
Output/result : solution vector x ∈ K n : Ax = b ← (square) linear system of equations (LSE)
➣ result space Y = K n

(Terminology: A =
ˆ system matrix/coefficient matrix, b =
ˆ right hand side vector )

Linear systems with rectangular system matrices A ∈ K m,n , called “overdetermined” for m > n, and
“underdetermined” for m < n will be treated in Chapter 3. y

Remark 2.1.0.2 (LSE: key components of mathematical models in many fields) Linear systems of
equations are ubiquitous in computational science: they are encountered
• with discrete linear models in network theory (see Ex. 2.1.0.3), control, statistics;
• in the case of discretized boundary value problems for ordinary and partial differential equations (→
course “Numerical methods for partial differential equations”, 4th semester);
• as a result of linearization (e.g, “Newton’s method” → Section 8.5).
y

EXAMPLE 2.1.0.3 (Nodal analysis of (linear) electric circuit [QSS00, Sect. 4.7.1])
Now we study a very important application of numerical simulation, where (large, sparse) linear systems
of equations play a central role: Numerical circuit analysis. We begin with linear circuits in the frequency
domain, which are directly modelled by complex linear systems of equations. In later chapters we will
tackle circuits with non-linear elements, see Ex. 8.1.0.1, and, finally, will learn about numerical methods
for computing the transient (time-dependent) behavior of circuits, see Ex. 11.1.2.11.

Modeling of simple linear circuits takes only elementary physical laws as covered in any introductory
course of physics (or even in secondary school physics). There is no sophisticated physics or mathematics
involved. Circuits are composed of so-called circuit elements connected by (ideal) wires.
A circuit diagram ✄ ➀ C1 ➁ R1 ➂
•: Nodes that is, junctions of wires
We number the nodes 1, . . . , n and write Ikj (physi- U ~~ L
R5 R2
cal units, [ Ikj ] = 1A) for the electric current flowing C2
from node k to node j. Currents have a sign: R3
R4
Ikj = − Ijk
Fig. 36 ➃ ➄ ➅

2. Direct Methods for (Square) Linear Systems of Equations, 2.1. Introduction: Linear Systems of Equations (LSE)
127
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The most fundamental relationship is the Kirchhoff current law (KCL) that demands that the sum of node
currents vanishes:

∀k ∈ {1, . . . , n}: ∑nj=1 Ikj = 0 . (2.1.0.4)

The unknowns of the model are the nodal potentials Uk , k = 1, . . . , n. (Some of them may be known, for
instance those for grounded nodes: ➅ in Fig. 36, and nodes connected to voltage sources: ➀ in Fig. 36.)
The difference of the nodal potentials of two connected nodes is called the branch voltage.

The circuit elements are characterized by current-voltage relationships, so-called constitutive relations,
here given in frequency domain for angular frequency ω > 0 (physical units [ω ] = 1s−1 ). We consider
only the following simple circuit elements:

U 
• Ohmic resistor: I= , [ R] = 1VA−1  −1
 R (Uk − Uj ) ,
R
• capacitor: I = ıωCU , capacitance [C ] = 1AsV−1 ➤ Ikj = ıωC (Uk − Uj ) ,
U 

• coil/inductor : I= , inductance [ L] = 1VsA−1 −ıω −1 L−1 (Uk − Uj ) .
ıωL

✎ notation: ı = ˆ imaginary unit “ı := −1”, ı = exp(ıπ/2), ı2 = −1
Here we face the special case of a linear circuit: all relationships between branch currents and voltages
are of the form

Ikj = αkj (Uk − Uj ) with αkj ∈ C . (2.1.0.5)

The concrete value of αkj is determined by the circuit element connecting node k and node j.

These constitutive relations are derived by assuming a harmonic time-dependence of all quantities, which
is termed circuit analysis in the frequency domain (AC-mode).

voltage: u(t) = Re{U exp(ıωt)} , current: i (t) = Re{ I exp(ıωt)} . (2.1.0.6)

Here U, I ∈ C are called complex amplitudes. This implies for temporal derivatives (denoted by a dot):

du di
(t) = Re{ıωU exp(ıωt)} , (t) = Re{ıωI exp(ıωt)} . (2.1.0.7)
dt dt
For a capacitor the total charge is proportional to the applied voltage:

dq
i(t) = (t)
dt du
q(t) = Cu(t) ⇒ i(t) = C (t) .
dt
di
For a coil the voltage is proportional to the rate of change of current: u(t) = L dt (t). Combined with
(2.1.0.6) and (2.1.0.7) this leads to the above constitutive relations.

Now we combine the constitutive relations with the Kirchhoff current law (2.1.0.4). We end up with a linear
system of equations!

➁ : ıωC1 (U2 − U1 ) + R1−1 (U2 − U3 ) − ıω −1 L−1 (U2 − U4 ) + R2−1 (U2 − U5 ) = 0,


➂: R1−1 (U3 − U2 ) + ıωC2 (U3 − U5 ) = 0,
➃: R5−1 (U4 − U1 ) − ıω −1 L−1 (U4 − U2 ) + R4−1 (U4 − U5 ) = 0,
➄: R2−1 (U5 − U2 ) + ıωC2 (U5 − U3 ) + R4−1 (U5 − U4 ) + R3−1 (U5 − U6 ) = 0,

2. Direct Methods for (Square) Linear Systems of Equations, 2.1. Introduction: Linear Systems of Equations (LSE)
128
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

U1 = U , U6 = 0 .
We do not get equations for the nodes ➀ and ➅, because these nodes are connected to the “outside
world” so that the Kirchhoff current law (2.1.0.4) does not hold (from a local perspective). This is fitting,
because the voltages in these nodes are known anyway.

 i i    
ıωC1 + R11 − ωL + R12 − R11 ωL − R12 U2 ıωC1 U
 − R11 1
−ıωC2    0 
 R1 + ıωC2 0 U3   1 
   = 
 i
ωL 0 1 i 1
R5 − ωL + R4
1
− R4  U4 U 
R5
− R12 −ıωC2 − R14 1 1 −1 U5 0
R2 + ıωC2 + R4 + R3

This is a linear system of equations with complex coefficients: A ∈ C4,4 , b ∈ C4 . For the algorithms to
be discussed below this does not matter, because they work alike for real and complex numbers. y
Review question(s) 2.1.0.8 (Nodal analysis of linear electric circuits)
(Q2.1.0.8.A) [A simple resistive circuit]
➀ ➁

In the electric circuit drawn beside all re-


sistors have the same resistance R > 0. U ~
~

Which nodal potentials are already
known?
Derive the linear system of equations for
the remaining unkown nodal potentials.

Fig. 37 ➃ ➄

(Q2.1.0.8.B) [Current source] The voltage source with strength U in Fig. 37 is replaced with a current
source, which drives a known current I through the circuit branch it is attached to.
Which linear system of equations has to be solved in order to determine the unknown nodal potentials
U1 , U2 , U3 , U4 ?
(Q2.1.0.8.C) A linear mapping L : R n → R n is represented by the matrix A ∈ R n,n with respect to the
standard basis of R n comprising Cartesian coordinate vectors eℓ , ℓ = 1, . . . , n.
Explain, how one can compute the matrix representation of L with respect to the basis
   
     

 2 1 0 0 0  

   .  .. 


1  2
     1   .
. 
 . 


  
0 1 2    .   .. 
  

 
.      ..   .  
  
 ..  
 0  1     
.      .   .. 
 .   .
 . ,  .. , 0, . . .  . ,  . 

 .   . 0  

  .   ..   .     ...  

  ..   .   .  1  







  .  
 .
.

   
 0   


 ..  ..  .   

  .  2  1  


 0 

0 0 1 2
by merely solving n linear systems of equations and forming matrix products.

2. Direct Methods for (Square) Linear Systems of Equations, 2.1. Introduction: Linear Systems of Equations (LSE)
129
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2.2 Theory: Linear Systems of Equations (LSE)


2.2.1 LSE: Existence and Uniqueness of Solutions
The following concepts and results are known from linear algebra [NS02, Sect. 1.2], [Gut09, Sect. 1.3]:

Definition 2.2.1.1. Invertible matrix → [NS02, Sect. 2.3]

A ∈ K n,n invertible/regular :⇔ ∃1 B ∈ K n,n : AB = BA = I .

B is called the inverse of A, (✎ notation B = A −1 )

Now, recall a few notions from linear algebra needed to state criteria for the invertibility of a matrix.

Definition 2.2.1.2. Image space and kernel of a matrix

Given A ∈ K m,n , the range/image (space) of A is the subspace of K m spanned by the columns of
A

R(A) := {Ax, x ∈ K n } ⊂ K m .

The kernel/nullspace of A is

N (A) := {z ∈ K n : Az = 0} .

Definition 2.2.1.3. Rank of a matrix → [NS02, Sect. 2.4], [QSS00, Sect. 1.5]
The rank of a matrix M ∈ K m,n , denoted by rank(M), is the maximal number of linearly indepen-
dent rows/columns of M. Equivalently, rank(A) = dim R(A).

Theorem 2.2.1.4. Criteria for invertibility of matrix → [NS02, Sect. 2.3 & Cor. 3.8]
A square matrix A ∈ K n,n is invertible/regular if one of the following equivalent conditions is satis-
fied:
1. ∃B ∈ K n,n : BA = AB = I,
2. x 7→ Ax defines an endomorphism of K n ,
3. the columns of A are linearly independent (full column rank),
4. the rows of A are linearly independent (full row rank),
5. det A 6= 0 (non-vanishing determinant),
6. rank(A) = n (full rank).

§2.2.1.5 (Solution of a LSE as a “problem”, recall § 2.1.0.1) Linear algebra give us a formal way to
denote solution of LSE:

A ∈ K n,n regular & Ax = b ⇒ x = A−1 b .

inverse matrix

Now recall our notion of “problem” from § 1.5.5.1 as a function F mapping data in a data space X to a
result in a result space Y . Concretely, for n × n linear systems of equations:

X := K n,n
∗ ×K
n → Y := K n
F:
(A, b) 7 → A −1 b

2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 130
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✎ notation: (open) set of regular matrices ⊂ K n,n :


K n,n
∗ := {A ∈ K
n,n
: A regular/invertible → Def. 2.2.1.1} .
y
Remark 2.2.1.6 (The inverse matrix and solution of a LSE) In principle, in E IGEN the inverse of a matrix
A is available through the member function inverse() of matrix type, see E IGEN documentation.
However, there are only a few case that always involve fixed-size small matrices, where the actual compu-
tation of the inverse of a matrix is warranted. The general advice is the following:
Avoid computing the inverse of a matrix (which can almost always be avoided)!
In particular, never ever even contemplate using x = A.inverse()*b to solve the
! linear system of equations Ax = b, cf. Exp. 2.4.0.14. The next sections present a sound way
to do this.
Another reason for this advice is given in Exp. 2.4.0.14. y
Review question(s) 2.2.1.7 (LSE: Existence and Uniqueness of Solutions)
(Q2.2.1.7.A) [Diagonal linear systems of equations] How can you tell that a square linear system of
equations with a diagonal system matrix has a unique solution?
(Q2.2.1.7.B) Outline a practical algorithm that checks whether an upper triangular matrix A ∈ R n,n ,
n ∈ N, is invertible.
(Q2.2.1.7.C) A square matrix A ∈ R n,n has the following entries

1 , if j ≥ i ,

(A)i,j = α , if i = n, j = 1 , i, j ∈ {1, . . . , n} , α∈R.


0 elsewhere,
For what values of α is A regular?

2.2.2 Sensitivity/Conditioning of Linear Systems


The sensitivity/samterm*conditioning of a problem (for given data) gauges
the impact of small perturbations of the data on the result.

Before we examine sensitivity for linear systems of equations, we look at the simpler problem of
matrix×vector multiplication.

EXAMPLE 2.2.2.1 (Sensitivity of linear mappings) For a fixed given regular A ∈ K n,n we study the
problem map
F : K n → K n , x 7→ Ax ,
that is, now we consider only the vector x as data.
Goal: Estimate relative perturbations in F (x) due to relative perturbations in x.
We assume that K n is equipped with some vector norm (→ Def. 1.5.5.4) and we use the induced matrix
norm (→ Def. 1.5.5.10) on K n,n . Using linearity and the elementary estimate kMxk ≤ kMkkxk, which
is a direct consequence of the definition of an induced matrix norm, we obtain

Ax = y ⇒ kxk ≤ A−1 kyk


A(x + ∆x) = y + ∆y ⇒ A∆x = ∆y ⇒ k∆yk ≤ kAkk∆xk

2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 131
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

k∆yk kAkk∆xk k∆xk


⇒ ≤ − 1
= k A k A −1 . (2.2.2.2)
kyk k A −1 k k x k kxk

relative perturbation in result relative perturbation in data


We have found that the quantity kAk A−1 bounds amplification of relative errors in the argument vector
in a matrix×vector-multiplication with the matrix A. y

Now we study the sensitivity of the problem of finding the solution of a linear system of equations Ax = b,
A ∈ R n,n regular, b ∈ R n , see § 2.1.0.1. We write e
x for the solution of the perturbed linear system.

Question: To what extent do perturbations in the data A, b cause a

kx − e
xk
(normwise) relative error: ǫr : = ?
kxk
(k·k =
ˆ suitable vector norm, e.g., maximum norm k·k∞ )

Perturbed linear system:

Ax = b ↔ (A + ∆A)e
x = b + ∆b x − x) = ∆b − ∆Ax .
(A + ∆A)(e (2.2.2.3)

Theorem 2.2.2.4. Conditioning of LSEs → [QSS00, Thm. 3.1], [GGK14, Thm 3.5]
−1
If A regular, k∆Ak < A−1 and (2.2.2.3), then
(i) A + ∆A is regular/invertible,
(ii) If Ax = b, (A + ∆A)e x = b + ∆b, then
 
kx − e
xk A −1 k A k k∆bk k∆Ak
≤ + .
kxk 1 − kA−1 kkAkk∆Ak/kAk kbk kAk
relative error of data relative perturbations

The proof is based on the following fundamental result:

Lemma 2.2.2.5. Perturbation lemma → [QSS00, Thm. 1.5]


1
B ∈ R n,n , kBk < 1 ⇒ I + B regular ∧ ( I + B ) −1 ≤ .
1 − kBk

Proof. We start with the △-inequality

k(I + B)xk ≥ (kxk − kBxk) ≥ (1 − kBk)kxk ∀x ∈ R n . (2.2.2.6)


| {z }
6 =0

We conclude that I + B must have trivial kernel N (I + B) = {0}, which implies that the square matrix
I + B is regular. We continue using this fact, the definition of the matrix norm, and (2.2.2.6):

( I + B ) −1 x kyk 1
( I + B ) −1 = sup = sup ≤ .
x∈R n \{0} kxk y∈R n \{0} k( I + B ) y k 1 − kBk

2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 132
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


Proof. (of Thm. 2.2.2.4) We use a slightly generalized version of Lemma 2.2.2.5, which gives us

−1 A −1
(A + ∆A) ≤ .
1 − kA−1 ∆Ak
We combine this estimate with (2.2.2.3):
 
A −1 A −1 k A k k∆bk k∆Ak
k∆xk ≤ (k∆bk + k∆Axk) ≤ + kxk .
1 − kA−1 ∆Ak 1 − kA−1 kk∆Ak kAkkxk kAk


Note that the term kAk A−1 occurs frequently. Therefore it has been given a special name:

Definition 2.2.2.7. Condition (number) of a matrix

Condition (number) of a matrix A ∈ R n,n : cond(A) := A−1 kAk

Note: cond(A) depends on the matrix norm k·k !

Rewriting estimate of Thm. 2.2.2.4 with ∆b = 0,

kx − e
xk cond(A)δA k∆Ak
ǫr : = ≤ , δA := . (2.2.2.8)
kxk 1 − cond(A)δA kAk
From (2.2.2.8) we conclude important messsages of cond(A):

✦ If cond(A) ≫ 1, small perturbations in A can lead to large relative errors in the solution of
the LSE.
✓ ✏
✦ If cond(A) ≫ 1, a stable algorithm (→ Def. 1.5.5.19) can produce solutions

✒ ✑
with large relative error !

Recall Thm. 2.2.2.4: for regular A ∈ K n,n , small ∆A, generic vector/matrix norm k·k
 
Ax = b kx − e
xk cond(A) k∆bk k∆Ak
⇒ ≤ + . (2.2.2.9)
(A + ∆A)e
x = b + ∆b kxk 1 − cond(A)k∆Ak/kAk kbk kAk

cond(A) ≫ 1 ➣ small relative changes of data A, b may effect huge relative changes in so-
lution.

cond(A) indicates sensitivity of “LSE problem” (A, b) 7→ x = A−1 b


(as “amplification factor” of (worst-case) relative perturbations in the data A, b).

Terminology:
Small changes of data ⇒ small perturbations of result : well-conditioned problem
Small changes of data ⇒ large perturbations of result : ill-conditioned problem

2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 133
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note: sensitivity gauge depends on the chosen norm !

EXAMPLE 2.2.2.10 (Intersection of lines in 2D) Solving a 2 × 2 linear system of equations amounts to
finding the intersection of two lines in the coordinate plane: This relationship allows a geometric view of
“sensitivity of a linear system”, when using the distance metric (Euclidean vector norm).
Remember the Hessian normal form of a straight line in the plane. We are given the Hessian normal
forms of two lines L1 and L2 and want to compute the coordinate vector x ∈ R2 of the point in which they
intersect:

Li = {x ∈ R2 : x T ni = di } , ni ∈ R2 , di ∈ R , i = 1, 2 .
 T  
n1 d
LSE for finding intersection: T x= 1 ,
n d2
| {z2 } |{z}
=:A =:b

where the ni are (unit) direction vectors, and the di ∈ R give the (signed) distance to the origin.
Now we perturb the right-hand side vector b and wonder how this will impact the intersection points. The
situation is illustrated by the following two pictures, in which the original and perturbed lines are drawn in
black and red, respectively.

nearly orthogonal intersection: well-conditioned glancing intersection: ill-conditioned

Obviously, if the lines are almost parallel, a small shift in their position will lead to a big shift of the inter-
section point.
 
1 cos ϕ
The following E IGEN-based C++ code investigates condition numbers for the matrix that can
0 sin ϕ
arise when computing the intersection of two lines enclosing the angle ϕ. As usual the directive using
namespace Eigen; was given in the beginning of the file.

C++-code 2.2.2.11: condition numbers of 2 × 2 matrices ➺ GITLAB


2 VectorXd p h i = VectorXd : : LinSpaced ( 5 0 , M_PI / 200 , M_PI / 2 ) ;
3 MatrixXd r e s ( p h i . s i z e ( ) , 3 ) ;
4 Matrix2d A ;
5 A( 0 , 0 ) = 1 ;
6 A( 1 , 0 ) = 0 ;
7 f o r ( i n t i = 0 ; i < p h i . s i z e ( ) ; ++ i ) {
8 A( 0 , 1 ) = std : : cos ( p h i ( i ) ) ;
9 A( 1 , 1 ) = std : : s i n ( p h i ( i ) ) ;
10 // L2 condition number is the quotient of the maximal
11 // and minimal singular value of A
12 JacobiSVD<MatrixXd > svd ( A) ;
13 double C2 = svd . s i n g u l a r V a l u e s ( ) ( 0 ) / //
14 svd . s i n g u l a r V a l u e s ( ) ( svd . s i n g u l a r V a l u e s ( ) . s i z e ( ) − 1 ) ;
15 // L-infinity condition number
16 double C i n f =

2. Direct Methods for (Square) Linear Systems of Equations, 2.2. Theory: Linear Systems of Equations (LSE) 134
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

17 A . inverse ( ) . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) *


18 A . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) ; //
19 res ( i , 0) = phi ( i ) ;
20 r e s ( i , 1 ) = C2 ;
21 res ( i , 2) = Cinf ;
22 }

In Line 13 we compute the condition number of A with respect to the Euclidean vector norm using special
E IGEN built-in functions.
Line 18 evaluated the condition number of a matrix for the maximum norm, recall Ex. 1.5.5.12.

140
2−norm
max−norm

120

We clearly observe a blow-up of cond(A) (with re- 100

condition numbers
spect to the Euclidean vector norms) as the angle 80
enclosed by the two lines shrinks.
60
This corresponds to a large sensitivity of the location
of the intersection point in the case of glancing inci- 40

dence.
20

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Fig. 38 angle of n1, n2

Heuristics for predicting large cond(A)


cond(A) ≫ 1 ↔ columns/rows of A “almost linearly dependent”

Review question(s) 2.2.2.12 (Sensitivity of linear systems)


(Q2.2.2.12.A) Analyze the sensitivity of a linear system of equations with diagonal system matrix relying
on the Euclidean vector and matrix norms. Consider perturbations of the right-hand side vector and of
the diagonal elements and investigate the amplification of relative errors.

2.3 Gaussian Elimination (GE)


2.3.1 Basic Algorithm
The problem of solving a linear system of equations is rather special compared to many other numerical
tasks:
An exceptional feature of linear systems of equations (LSE) is that its “exact” solution com-
! putable with finitely many elementary operations.
The algorithm is Gaussian elimination (GE) (→ secondary school, linear algebra,)
Familiarity with the algorithm of Gaussian elimination for a square linear system of equations will be taken
for granted.

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 135
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Supplementary literature. In case you cannot remember the main facts about Gaussian

elimination, very detailed accounts and examples can be found in


• M. Gutknecht’s lecture notes [Gut09, Ch. 1],
• the textbook by Nipp & Stoffer [NS02, Ch. 1],
• the numerical analysis text by Quarteroni et al. [QSS00, Sects. 3.2 & 3.3],
• the textbook by Ascher & Greif [AG11, Sect. 5.1],
and, to some extent, below, see Ex. 2.3.1.1.
Wikipedia: Although the method is named after mathematician Carl Friedrich Gauss, the earliest pre-
sentation of it can be found in the important Chinese mathematical text Jiuzhang suanshu or
The Nine Chapters on the Mathematical Art, dated approximately 150 B.C., and commented
on by Liu Hui in the 3rd century.

The idea of Gaussian elimination is the transformation of a linear system of equa-


tions into a “simpler”, but equivalent LSE by means of successive (invertible) row
transformations.

Rem. 1.3.1.12: row transformations ↔ left-multiplication with transformation matrix


Obviously, left multiplication with a regular matrix does not affect the solution of an LSE: for any regular
T ∈ K n,n

Ax = b ⇒ A′ x = b′ , if A′ = TA, b′ = Tb .

So we may try to convert the linear system of equations to a form that can be solved more easily by
multiplying with regular matrices from left, which boils down to applying row transformations. A suitable
target format is a diagonal linear system of equations, for which all equations are completely decoupled.
This is the gist of Gaussian elimination.
EXAMPLE 2.3.1.1 (Gaussian elimination)
Stage ➀ (Forward) elimination:
    
1 1 0 x1 4 x1 + x2 = 4
2 1  
−1 x2 = 1    ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
       
1 1 0 4 1 1 0 4
 2 1 −1   1  ➤  0 −1 −1   −7 
3 −1 −1 −3 3 −1 −1  −3     
1 1 0 4 1 1 0 4
➤  0 −1 −1   −7  ➤  0 − 1 −1   −7 
0 −4 −1 −15 0 0 3 13
| {z }
=U
= pivot row, pivot element bold.
We have transformed the LSE to upper triangular form

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 136
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Stage ➁ Solve by back substitution:

x1 + x2 = 4 x3 = 13
3
− x2 − = −7 ⇒ 13 8
x3 x2 = 7 − 3 = 3
3x3 = 13 8 4
x1 = 4 − 3 = 3 .

More detailed examples are given in [Gut09, Sect. 1.1], [NS02, Sect. 1.1]. y

More general:

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn

• i-th row - li1 · 1st row (pivot row), li1 := ai1/a11 , i = 2, . . . , n

a11 x1 + a12 x2 + · · · + a1n xn = b1


(1) (1) (1)
a22 x2 + · · · + a2n xn = b2 with
.. .. .. .. .. .. .. (1)
. . . . . . . aij = aij − a1j li1 , i, j = 2, . . . , n ,
.. .. .. .. .. .. .. (1)
. . . . . . . bi = bi − b1 li1 , i = 2, . . . , n .
(1) (1) (1)
an2 x2 + ··· + ann xn = bn
(1) (1)
• i-th row - li1 · 2nd row (pivot row), li2 := ai2 /a22 , i = 3, . . . , n.

a11 x1 + a12 x2 + a13 x3 + · · · + a1n xn = b1


(1) (1) (1) (1)
a22 x2 + a23 x3 + · · · + a2n xn = b2
(2) (2) (2)
a33 x3 + · · · + a3n xn = b3
.. .. .. .. .. .. ..
. . . . . . .
(2) (2) (2)
an3 x3 + · · · + ann xn = bn

After n − 1 steps: linear systems of equations in upper triangular form

a11 x1 + a12 x2 + a13 x3 + ··· + a1n xn = b1


(1) (1) (1) (1)
a22 x2 + a23 x3 + ··· + a2n xn = b2
(2) (2) (2)
a33 x3 + ··· + a3n xn = b3
.. .. .. .. .. ..
. . . . . .
.. .. .. .. ..
. . . . .
( n −1) ( n −1)
ann x n = bn

(1) (2) ( n −2)


Terminology: a11 , a22 , a33 , . . . , an−1,n−1 = pivots/pivot elements
Graphical depiction:

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 137
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


0∗ 0
0∗

−→ −→ −→

0 00

0 0 0
0
0∗
−→ −→ · · · −→ −→

0 ∗
0 00 0 0 0 0

ˆ the pivot entry (necessarily 6= 0, which we assume here),


∗= = pivot row
In k-th step (starting from A ∈ K n,n , 1 ≤ k < n, pivot row a⊤
k· ):

transformation: Ax = b ➤ A′ x = b′ .
with
 aik
 
 aij − akk akj for k < i, j ≤ n ,
 bi − aik b for k < i ≤ n ,
aij′ := 0 for k < i ≤ n,j = k ,

bi : = akk k (2.3.1.2)

 b else.
 i
aij else,

multipliers lik
§2.3.1.3 (Gaussian elimination: algorithm) Here we give a direct E IGEN implementation of Gaussian
elimination for LSE Ax = b (grossly inefficient!).

C++ code 2.3.1.4: Solving LSE Ax = b with Gaussian elimination ➺ GITLAB


2 //! Gauss elimination without pivoting, x = A\b
3 //! A must be an n × n-matrix, b an n-vector
4 //! The result is returned in x
5 void g a u s s e l i m s o l v e ( const MatrixXd &A , const VectorXd& b ,
6 VectorXd& x ) {
7 const Index n = A . rows ( ) ;
8 MatrixXd Ab ( n , n +1) ; // Augmented matrix [ A, b]
9 Ab << A , b ; //
10 // Forward elimination (cf. step ➀ in Ex. 2.3.1.1)
11 f o r ( Index i = 0 ; i < n −1; ++ i ) {
12 const double p i v o t = Ab ( i , i ) ;
13 f o r ( Index k = i +1; k < n ; ++k ) {
14 const double f a c = Ab ( k , i ) / p i v o t ; // the multiplier
15 Ab . block ( k , i +1 ,1 , n− i ) −= f a c * Ab . block ( i , i +1 ,1 , n− i ) ; //
16 }
17 }
18 // Back substitution (cf. step ➁ in Ex. 2.3.1.1)
19 Ab ( n −1 ,n ) = Ab ( n −1 ,n ) / Ab ( n −1 ,n −1) ;
20 f o r ( Index i = n −2; i >= 0 ; −− i ) {
21 f o r ( Index l = i +1; l < n ; ++ l ) {
22 Ab ( i , n ) −= Ab ( l , n ) * Ab ( i , l ) ;
23 }

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 138
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

24 Ab ( i , n ) / = Ab ( i , i ) ;
25 }
26 x = Ab . r i g h t C o l s ( 1 ) ; // Solution in rightmost column!
27 }

• In Line 9 the right hand side vector set as last column of matrix, which facilitates simultaneous row
transformations of matrix and r.h.s.
• In Line 14 the variable fac is the multiplier from (2.3.1.2).
• In Line 26 we extract solution from last column of the transformed matrix.
y
§2.3.1.5 (Computational effort of Gaussian elimination) We examine Code 2.3.1.4.
• Forward elimination involves three nested loops (note that the compact vector operation in Line 15
involves another loop from i + 1 to m).
• Back substitution can be done with two nested loops.
computational cost (↔ number of elementary operations) of Gaussian elimination [NS02, Sect. 1.3]:
n −1
forward elimination : ∑i=1 (n − i)(2(n − i) + 3) = n(n − 1)( 32 n + 76 ) Ops. , (2.3.1.6)
n
back substitution : ∑i=1 2(n − i) + 1 = n2 Ops. .
✎ ☞
asymptotic complexity (→ Section 1.4) of Gaussian elimination 2 3
= 3n + O ( n2 ) = O ( n3 )
✍ ✌
(without pivoting) for generic LSE Ax = b, A ∈ R n,n
y

EXPERIMENT 2.3.1.7 (Runtime of Gaussian elimination) In this experiment we compare the efficiency
of our hand-coded Gaussian elimination with that of library functions.

C++ code 2.3.1.8: Measuring runtimes of Code 2.3.1.4 vs. E IGEN lu()-operator vs. MKL
➺ GITLAB
2 //! Eigen code for timing numerical solution of linear systems
3 MatrixXd g a u s s t i m i n g ( ) {
4 std : : vector < i n t > n = {8 ,16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192};
5 i n t nruns = 3 ;
6 MatrixXd t i m e s ( n . s i z e ( ) , 3 ) ;
7 f o r ( i n t i = 0 ; i < n . s i z e ( ) ; ++ i ) {
8 Timer t1 , t 2 ; // timer class
9 MatrixXd A = MatrixXd : : Random( n [ i ] , n [ i ] ) + n [ i ] * MatrixXd : : I d e n t i t y ( n [ i ] , n [ i ] ) ;
10 VectorXd b = VectorXd : : Random( n [ i ] ) ;
11 VectorXd x ( n [ i ] ) ;
12 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
13 t 1 . s t a r t ( ) ; x = A . l u ( ) . solve ( b ) ; t 1 . s t o p ( ) ; // Eigen implementation
14 # i f n d e f EIGEN_USE_MKL_ALL // only test own algorithm without MKL
15 i f ( n [ i ] <= 4096) // Prevent long runs
16 t2 . s t a r t ( ) ; gausselimsolve ( A , b , x ) ; t2 . stop ( ) ; // own gauss
elimination
17 #endif
18 }
19 t i m e s ( i , 0 ) = n [ i ] ; t i m e s ( i , 1 ) = t 1 . min ( ) ; t i m e s ( i , 2 ) = t 2 . min ( ) ;
20 }
21 return times ;
22 }

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 139
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 4
Eigen lu() solver
gausselimsolve
MLK solver sequential
MLK solver parallel
10 2
Platform: O(n 3 )

✦ ubuntu 14.04 LTS


✦ i7-3517U CPU @ 1.90GHz 10 0

× 4

execution time [s]


✦ L1 32 KB, L2 256 KB, L3
10 -2
4096 KB, Mem 8 GB
✦ gcc 4.8.4, -O3
10 -4
E IGEN is about two or-
ders of magnitude faster
than a direct implementa- 10 -6
tion, MKL is even faster.

10 -8
10 0 10 1 10 2 10 3 10 4
Fig. 39 matrix size n

n Code 2.3.1.4 [s] E IGEN lu() [s] MKL sequential [s] MKL parallel [s]
8 6.340e-07 1.140e-06 3.615e-06 2.273e-06
16 2.662e-06 3.203e-06 9.603e-06 1.408e-05
32 1.617e-05 1.331e-05 1.603e-05 2.495e-05
64 1.214e-04 5.836e-05 5.142e-05 7.416e-05
128 2.126e-03 3.180e-04 2.041e-04 3.176e-04
256 3.464e-02 2.093e-03 1.178e-03 1.221e-03
512 3.954e-01 1.326e-02 7.724e-03 8.175e-03
1024 4.822e+00 9.073e-02 4.457e-02 4.864e-02
2048 5.741e+01 6.260e-01 3.347e-01 3.378e-01
4096 5.727e+02 4.531e+00 2.644e+00 1.619e+00
8192 - 3.510e+01 2.064e+01 1.360e+01
y
Never implement Gaussian elimination yourself !

use numerical libraries (LAPACK/MKL) or E IGEN !

A concise list of libraries for numerical linear algebra and related problems can be found here.

Remark 2.3.1.9 (Gaussian elimination for non-square matrices) In Code 2.3.1.4: the right hand side
vector b was first appended to matrix A as rightmost column, and then forward elimination and back
substitution were carried out on the resulting matrix. This can be generalized to a Gaussian elimination for
rectangular matrices A ∈ K n,n+1 !
Consider a “fat matrix” A ∈ K n,m , m>n:

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 140
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

     
1
1 0
     
     
  −→   −→  
   0   0 
1

forward elimination back substitution


Recall Code 2.3.1.4 (m = n + 1): the solution vector x = A−1 b was recovered as the rightmost column
of the augmented matrix (A, b) after forward elimination and back substitution. In the above cartoon it
would be contained in the yellow part of the matrix on the right.
With this technique we have an efficient way for simultaneously solving of
LSEs with multiple right hand sides . These multiple right-hand side can be passed as the column
of a matrix B and the problem of solving an LSE for several right-hand-side vectors can be stated as
follows:
Given regular A ∈ K n,n , B ∈ K n,k , seek X ∈ K n,k such that
AX = B ⇔ X = A−1 B

Usually library functions meant to solve LSEs also accept a matrix instead of a right-hand-side vector and
then return a matrix of solution vectors. For instance, in E IGEN the following function call accomplishes
this:
Eigen::MatrixXd X = A.lu().solve(B);

Its asymptotic complexityis O(n2 (n + k ) for n, k → ∞.

C++ code 2.3.1.10: Gaussian elimination with multiple r.h.s. → Code 2.3.1.4 ➺ GITLAB
2 //! Gauss elimination without pivoting, X = A−1 B
3 //! A must be an n × n-matrix, B an n × m-matrix
4 //! Result is returned in matrix X
5 void g a u s s e l i m s o l v e m u l t ( const MatrixXd &A , const MatrixXd& B ,
6 MatrixXd& X) {
7 const Eigen : : Index n = A . rows ( ) ;
8 const Eigen : : Index m = B . cols ( ) ;
9 MatrixXd AB( n , n+m) ; // Augmented matrix [ A, B]
10 AB << A , B ;
11 // Forward elimination, do not forget the B part of the Matrix
12 f o r ( Eigen : : Index i = 0 ; i < n −1; ++ i ) {
13 const double p i v o t = AB( i , i ) ;
14 f o r ( Eigen : : Index k = i +1; k < n ; ++k ) {
15 const double f a c = AB( k , i ) / p i v o t ;
16 AB . block ( k , i +1 ,1 ,m+n− i −1)−= f a c * AB . block ( i , i +1 ,1 ,m+n− i −1) ;
17 }
18 }
19 // Back substitution
20 AB . block ( n −1 , n , 1 , m) / = AB( n −1 ,n −1) ;
21 f o r ( Eigen : : Index i = n −2; i >= 0 ; −− i ) {
22 f o r ( Eigen : : Index l = i +1; l < n ; ++ l ) {
23 AB . block ( i , n , 1 ,m) −= AB . block ( l , n , 1 ,m) * AB( i , l ) ;
24 }
25 AB . block ( i , n , 1 ,m) / = AB( i , i ) ;
26 }
27 X = AB . r i g h t C o l s (m) ;
28 }

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 141
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

y
Concerning the next two remarks: For understanding or analyzing special variants of Gaussian elimination,
it is useful to be aware of
• the effects of elimination steps on the level of matrix blocks, cf. § 1.3.1.13,
• and of the recursive nature of Gaussian elimination.
Remark 2.3.1.11 (Gaussian elimination via rank-1 modifications) We can view Gaus-
sian elimination from the perspective of matrix block operations: Then the first step
of Gaussian elimination with pivot α 6= 0), cf. (2.3.1.2), can be expressed as
   
α c⊤ α c⊤
   
   
   
   
   
   
A := 
 d
 → A′ :=


 0 dc⊤ .
 (2.3.1.12)
 C   C′ := C − 
   α 
   
   
   

rank-1 modification of C
Terminology: Adding a tensor product of two vectors to a matrix is called a rank-1 modification of that
matrix, see also § 2.6.0.12 below.
Notice that the transformation (2.3.1.12) is applied to the resulting lower-right block C′ in the next elimina-
tion step. Thus Gaussian elimination can be realized by successive rank-1 modifications applied to smaller
and smaller lower-right blocks of the matrix. An implementation in this spirit is given in Code 2.3.1.13.

C++ code 2.3.1.13: GE by rank-1 modification ➺ GITLAB

  2 //! in-situ Gaussian elimination, no pivoting


3 //! right hand side in rightmost column of A
  4 //! back substitution is not done in this code!
  5 void b l o c k g s ( Eigen : : MatrixXd &A) {
 
  6 const Eigen : : Index n = A . rows ( ) ;
  f o r ( Eigen : : Index i = 1 ; i < n ; ++ i ) {
 . 7
  8 // rank-1 modification of C
 
  9 A . bottomRightCorner ( n− i , n− i +1) −=
  A . col ( i −1) . t a i l ( n− i ) * A . row ( i −1) . t a i l ( n− i +1) /
 
A( i −1 , i −1) ;
10 A . col ( i − 1 ) . t a i l ( n − i ) . setZero ( ) ; // set d = 0
11 }
r.h.s. b ∼ A(:, end) ➤ 12 }

In this code the Gaussian elimination is carried out in situ: the matrix A is replaced with the transformed
matrices during elimination. If the matrix is not needed later this offers maximum efficiency. An in-situ
LU-decomposition as described in Rem. 2.3.2.11 could also be performed by Code 2.3.1.13 after a modi-
fication of Line 10. y
Remark 2.3.1.14 (Block Gaussian elimination) Recall the “principle” from § 1.3.1.13: deal with block
matrices (“matrices of matrices”) like regular matrices (except for commutativity of multiplication!). This
suggests a block view of Gaussian elimination:

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 142
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Given: regular matrix A ∈ K n,n with sub-matrices/blocks


A11 := (A)1:k,1:k , A12 = (A)1:k,k+1,n ,
k < n,
A21 := (A)k+1:n,1:k , A22 = (A)k+1:n,k+1,n ,
and a right-hand-side vector b ∈ K n split into b1 = (b)1:k , b2 = (b)k+1:n
We apply the usual row transformations from (2.3.1.2) on the level of matrix block to this block-partitioned
linear system using A11 as pivot block. Of course, we have to assume that A11 is invertible, generalizing
the assumption that eligible pivot elements must not be zero. Again, the manipulations can be broken
down into an elimination step ❶ and a backsubstitution step ❷.
   
A11 A12 b1 ❶ A11 A12 b1
−→ −1 −1
A21 A22 b2 0 A22 − A21 A11 A12 b2 − A21 A11 b1
 −1

❷ I 0 A11 (b1 − A12 S−1 bS )
−→ ;,
0 I S −1 b S
−1
where we abbreviated S := A22 − A21 A11 A12 , a matrix known as Schur complement, see
−1
Rem. 2.3.2.19, and bS := b2 − A21 A11 b1 .
We can read off the solution of the block-partitioned linear system from the above Gaussian elimina-
tion:
    
A11 A12 x1 b x 2 = S −1 b S ,
= 1 ⇒ −1
(2.3.1.15)
A21 A22 x2 b2 x1 = A11 (b1 − A12 S−1 bS ) .
y

2.3.2 LU-Decomposition
A matrix factorization (ger. Matrixzerlegung) expresses a general matrix A as product of two special (fac-
tor) matrices. Requirements for these special matrices define the matrix factorization. Matrix factorizations
come with the mathematical issue of existence & uniqueness, and pose the numerical challenge of finding
algorithms for computing the factor matrices (efficiently and stably).
Matrix factorizations
☞ often capture the essence of algorithms in compact form (here: Gaussian elimination),
☞ are important building blocks for complex algorithms,
☞ are key theoretical tools for algorithm analysis.
In this section the forward elimination step of Gaussian elimination will be related to a special matrix
factorization, the so-called LU-decomposition or LU-factorization.

Supplementary literature. The LU-factorization should be well known from the introductory

linear algebra course. In case you need to refresh your knowledge, please consult one of the
following:
• textbook by Nipp & Stoffer [NS02, Sect. 2.4],
• book by M. Hanke-Bourgeois [Han02, p. II.4],
• linear algebra lecture notes by M. Gutknecht [Gut09, Sect. 3.1],
• textbook by Quarteroni et al. [QSS00, Sect.3.3.1],
• Sect. 3.5 of the book by Dahmen & Reusken,

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 143
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Sect. 5.1 of the textbook by Ascher & Greif [AG11].


See also (2.3.2.1) below.
Recall the gist of Gaussian elimination split into the two steps of forward elimination and backsubstitution
with (multiple) right-hand-side vectors appended to the coefficient matrix as rightmost columns:

     
1
1 0
     
     
  −→   −→  
   0   0 
1

row transformations row transformations

Here: row transformation = adding a multiple of a matrix row to another row, or multiplying a row with
a non-zero scalar (number) swapping two rows (more special row transfor-
mations are discussed in Rem. 1.3.1.12)
Note: Row transformations preserve regularity of a matrix and, thus, are suitable for transforming linear
systems of equations: they will not affect the solution when applied to both the coefficient matrix
and right-hand-side vector.
Rem. 1.3.1.12: row transformations can be realized by multiplication from left with suitable transformation
matrices. When multiplying these transformation matrices we can emulate the effect to successive row
transformations through left multiplication with a matrix T:

   
   A′ 
 A   
  −→   ⇔ TA = A′ .
   0 

row transformations
Now we want to determine the T for the forward elimination step of Gaussian elimination.
EXAMPLE 2.3.2.1 (Gaussian elimination and LU-factorization → [NS02, Sect. 2.4], [Han02, p. II.4],
[Gut09, Sect. 3.1]) We revisit the LSE from Ex. 2.3.1.1 and carry out (forward) Gaussian elimination:
    
1 1 0 x1 4 x1 + x2 = 4
2 1 −1 x2  =  1  ←→ 2x1 + x2 − x3 = 1 .
3 −1 −1 x3 −3 3x1 − x2 − x3 = −3
           
1 1 1 0 4 1 1 1 0 4
 1   2 1 −1   1  ➤  2 1   0 −1 −1   −7  ➤
1 3 −1 −1 −3 0 1 3 −1 −1 −3
           
1 1 1 0 4 1 1 1 0 4
 2 1   0 −1 −1   −7  ➤  2 1   0 − 1 −1   −7 
3 0 1 0 −4 −1 −15 3 4 1 0 0 3 13
| {z } | {z }
=:L =:U
As before we highlight the pivot rows with and write the pivot element in bold. In addition, we let
the negative multipliers take the places of matrix entries made to vanish; we color these entries red.
After this replacement we make the “surprising” observation that A = L U! y

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 144
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The link between Gaussian elimination and matrix factorization, an explanations for the observation made
in Ex. 2.3.2.1, becomes clear by recalling that row transformations result from multiplications with elimina-
tion matrices:
    
1 0 ··· ··· 0 a1 a1
 a2    
− a 1 0  a2   0 
 1    
 a3    
a1 6 = 0  − a1   a3  =  0  . (2.3.2.2)
    
 .  .   . 
 ..  ..   .. 
    
− aan1 0 1 an 0

n − 1 steps of Gaussian forward elimination immediately give rise to a matrix factorization (non-zero
pivot elements assumed)

elimination matrices Li , i = 1, . . . , n − 1 ,
A = L 1 · · · · · L n −1 U , with
upper triangular matrix U ∈ R n,n .

    
1 0 ··· ··· 0 1 0 ··· ··· 0 1 0 ··· ··· 0
    
 l2 1 0  0  0
  0 1   l2 1 
    
 l3  0 h3 1  =  l3 h 3 1 
    
.  . .  . . 
 ..  .. ..   .. .. 
    
ln 0 1 0 hn 0 1 ln h n 0 1
The matrix products L1 · · · · · Ln−1 yield normalized lower triangular matrices,
a
whose entries are the multipliers − a ik from (2.3.1.2) → Ex. 2.3.1.1.
kk

The matrix factorization that “automatically” emerges during Gaussian forward elimination has a special
name:

Definition 2.3.2.3. LU-decomposition/LU-factorization

Given a square matrix A ∈ K n,n , an upper triangular matrix U ∈ K n,n and a normalized lower
triangular matrix (→ Def. 1.1.2.3) form an LU-decomposition/LU-factorization of A, if A = LU.

   1   






1
1
1
0 





   1   
   1   
  =  1  ·  ,
   1
  
     






1
1



 0 

1
1
A = L · U.

Using this notion we can summarize what we have learned from studying elimination matrices:
✤ ✜
The (forward) Gaussian elimination (without pivoting), for Ax = b, A ∈ R n,n ,
if possible, is alge-
braically equivalent to an LU-factorization/LU-decomposition A = LU of A into a normalized lower
triangular matrix L and an upper triangular matrix U, [DR08, Thm. 3.2.1], [NS02, Thm. 2.10], [Gut09,

✣ ✢
Sect. 3.1].

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 145
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Algebraically equivalent = ˆ when carrying out the forward elimination in situ as in Code 2.3.1.4 and storing
the multipliers in a lower triangular matrix as in Ex. 2.3.2.1, then the latter will contain the L-factor and the
original matrix will be replaced with the U-factor.

Lemma 2.3.2.4. Existence of LU -decomposition

The LU -decomposition of A ∈ K n,n exists, if all submatrices (A)1:k,1:k , 1 ≤ k ≤ n, are regular.

Proof. We adopt a block matrix perspective (→ § 1.3.1.13) and employ induction w.r.t. n:
n = 1: assertion trivial
n − 1→n: Induction hypothesis ensures existence of normalized lower triangular matrix L e and regular
e such that A
upper triangular matrix U e =LeU
e , where A
e is the upper left (n − 1) × (n − 1) block of A:
    
e b
A e 0
L e y
U
= =: LU .
a⊤ α x⊤ 1 0 ξ

Then solve

➊ e =b
Ly → provides y ∈ K n ,
➋ x⊤ U
e = a⊤ → provides x ∈ K n ,
➌ x⊤ y + ξ = α → provides ξ ∈ K .

Regularity of A involves ξ 6= 0 (why?) so that U will be regular, too.



§2.3.2.5 (Uniqueness of LU -decomposition) Regular upper triangular matrices and normalized lower
triangular matrices form matrix groups (→ Lemma 1.3.1.9). Their only common element is the identity
matrix.

L1 U1 = L2 U2 ⇒ L2−1 L1 = U2 U1−1 = I .

Since inverses of matrices are unique, so are the LU-factors: U1 = U2 and L1 = L2 . y


§2.3.2.6 (Basic algorithm for computing LU-decomposition) There are direct ways to determine the
factor matrices of the LU -decomposition [Gut09, Sect. 3.1], [QSS00, Sect. 3.3.3] and, of course, they are
closely related to forward Gaussian elimination. To derive the algorithm we study the entries of the product
of a normalized lower triangular and an upper triangular matrix, see Def. 1.1.2.3:
     
 0     
     
     
     
 · = 
     
     
     
     

Taking into account the zero entries known a priori, we arrive at



min{i,k} ∑i−1 lij u jk + 1 · uik , if i ≤ k ,
j =1
LU = A ⇒ aik = ∑ lij u jk = (2.3.2.7)
j =1 ∑k−1 lij u jk + lik ukk , if i > k .
j =1

This reveals how to compute the entries of L and U sequentially. We start with the top row of U, which
agrees with that of A, and then work our way towards to bottom right corner:

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 146
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➤ • row by row computation of U


1
2
• column by column computation of L
3
Entries of A can be replaced with those of L, U ! 4
(so-called in situ/in place computation)
1
(Crout’s algorithm, [Gut09, Alg. 3.1]) 2
3
ˆ rows of U
=

ˆ columns of L
=
Fig. 40

The following code follows this sequential computation scheme:

C++ code 2.3.2.8: LU-factorization ➺ GITLAB


2 //! Algorithm of Crout: LU-factorization of A ∈ K n,n
3 std : : pair <MatrixXd , MatrixXd > l u f a k ( const MatrixXd &A) {
4 const Index n = A . rows ( ) ;
5 assert ( n == A . cols ( ) ) ; // Ensure matrix is square
6 MatrixXd L { MatrixXd : : I d e n t i t y ( n , n ) } ;
7 MatrixXd U{ MatrixXd : : Zero ( n , n ) } ;
8 f o r ( Index k = 0 ; k < n ; ++k ) {
9 // Compute row of U
10 f o r ( Index j = k ; j < n ; ++ j ) {
11 U( k , j ) = A( k , j ) − ( L . block ( k , 0 , 1 , k ) * U. block ( 0 , j , k , 1 ) ) ( 0 , 0 ) ;
12 }
13 // Compute column of L
14 f o r ( Index i = k + 1 ; i < n ; ++ i ) {
15 L ( i , k ) = ( A( i , k ) − ( L . block ( i , 0 , 1 , k ) * U. block ( 0 , k , k , 1 ) ) ( 0 , 0 ) ) /
16 U( k , k ) ;
17 }
18 }
19 return { L , U } ;
20 }

It is instructive to compare this code with a simple implementation of the matrix product of a normalized
lower triangular and an upper triangular matrix. From this perspective the LU-factorization looks like the
“inversion” of matrix multiplication:

C++ code 2.3.2.9: matrix multiplication L · U ➺ GITLAB


2 //! Multiplication of normalized lower/upper triangular matrices
3 MatrixXd l u m u l t ( const MatrixXd &L , const MatrixXd &U) {
4 const Eigen : : Index n = L . rows ( ) ;
5 assert ( n == L . cols ( ) && n == U. cols ( ) && n == U. rows ( ) ) ;
6 MatrixXd A{ MatrixXd : : Zero ( n , n ) } ;
7 f o r ( Eigen : : Index k = 0 ; k < n ; ++k ) {
8 f o r ( Eigen : : Index j = k ; j < n ; ++ j ) {
9 A( k , j ) = U( k , j ) + ( L . block ( k , 0 , 1 , k ) * U. block ( 0 , j , k , 1 ) ) ( 0 , 0 ) ;
10 }
11 f o r ( Eigen : : Index i = k + 1 ; i < n ; ++ i ) {
12 A( i , k ) =
13 ( L . block ( i , 0 , 1 , k ) * U. block ( 0 , k , k , 1 ) ) ( 0 , 0 ) + L ( i , k ) * U( k , k ) ;
14 }
15 }

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 147
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

16 return A;
17 }

Observe: Solving for entries L(i,k) of L and U(k,j) of U in the multiplication of an upper triangular
and normalized lower triangular matrix (→ Code 2.3.2.9) yields the algorithm for LU-factorization (→
Code 2.3.2.8). y

The computational cost of LU-factorization is immediate from Code 2.3.2.8 and the same as for Gaussian
elimination, cf. § 2.3.1.5:
✗ ✔
Asymptotic complexity of LU-factorization of A ∈ R n,n (2.3.2.10)

= 23 n3 + O(n2 ) = O(n3 ) for n → ∞


✖ ✕

Remark 2.3.2.11 (In-situ LU-decomposition) “In situ” is Latin and means “in place”. Many library
routines provide routines that overwrite the matrix A with its LU-factors in order to save memory when
the original matrix is no longer needed. This is possible because the number of unknown entries of the
LU-factors combined exactly agrees with the number of entries of A. The convention is to replace the
strict lower-triangular part of A with L, and the upper triangular part with U:

A −→
L

y
Remark 2.3.2.12 (Recursive LU-factorization) Recall Rem. 2.3.1.11 and the recursive view of Gaussian
elimination it suggests, because in (2.3.1.12) an analoguous row transformation can be applied to the
remaining right-lower block C′ .
In light of the close relationship between Gaussian elimination and LU-factorization there will also be a
recursive version of LU-factorization.
The following code implements the recursive in situ (in place) LU-decomposition of A ∈ R n,n (without
pivoting). It is closely related to Code 2.3.1.13, but now both L and U are stored in place of A:

C++ code 2.3.2.13: Recursive LU-factorization ➺ GITLAB


2 //! in situ recursive LU-factorization
3 MatrixXd l u r e c ( const MatrixXd &A) {
4 const Eigen : : Index n = A . rows ( ) ;
5 MatrixXd r e s u l t ( n , n ) ;
6 i f ( n > 1) {
7 const VectorXd f a c = A . col ( 0 ) . t a i l ( n −1) / A ( 0 , 0 ) ; //
8 r e s u l t . bottomRightCorner ( n −1 ,n −1) = l u r e c ( A . bottomRightCorner ( n −1 ,n −1)
− f a c * A . row ( 0 ) . t a i l ( n −1) ) ; //
9 r e s u l t . row ( 0 ) = A . row ( 0 ) ; r e s u l t . col ( 0 ) . t a i l ( n −1) = f a c ;
10 return r e s u l t ;
11 }
12 return A;
13 }

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 148
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Refer to (2.3.1.12) to understand lurec: the rank-1 modification of the lower (n − 1) × (n − 1)-block of
the matrix is done in Line 7-Line 8 of the code.

C++ code 2.3.2.14: Driver for recursive LU-factorization of Code 2.3.2.13 ➺ GITLAB
2 //! post-processing: extract L and U
3 void l u r e c d r i v e r ( const MatrixXd &A , MatrixXd &L , MatrixXd &U) {
4 const MatrixXd A_dec = l u r e c ( A) ;
5 // post-processing:
6 //extract L and U
7 U = A_dec . triangularView <Upper > ( ) ;
8 L . setIdentity ( ) ;
9 L += A_dec . triangularView < S t r i c t l y L o w e r > ( ) ;
10 }

y
§2.3.2.15 (Using LU-factorization to solve a linear system of equations) An intermediate
LU-factorization paves the way for a three-stage procedure for solving an n × n linear system of equations.
① LU -decomposition A = LU, #elementary operations = 31 n(n − 1)(n + 1)
Ax = b : ② forward substitution, solve Lz = b, #elementary operations = 21 n(n − 1)
③ backward substitution, solve Ux = z, #elementary operations = 21 n(n + 1)
➣ The asymptotic complexity of the complete three-stage algorithm is (in leading order) the same as for
Gaussian elimination (The bulk of computational cost is incurred in the factorization step ①).
However, the perspective of LU-factorization reveals that the solution of linear systems of equations can be
split into two separate phases with different asymptotic complexity in terms of the number n of unknowns:
✗ ✔ ✗ ✔
setup phase elimination phase
(factorization) + (forward/backward substition)

✖ ✕ ✖ ✕
Cost: O(n3 ) Cost: O(n2 )
y

Remark 2.3.2.16 (Rationale for using LU-decomposition in algorithms) Gauss elimination and
LU-factorization for the solution of a linear system of equations (→ § 2.3.2.15) are equivalent and only
differ in the ordering of the steps.

Then, why is it important to know about LU-factorization?


Because in the case of LU-factorization the expensive forward elimination and the less expensive (for-
ward/backward) substitutions are separated, which sometimes can be exploited to reduce computational
cost, as highlighted in Rem. 2.5.0.10 below. y
Remark 2.3.2.17 ("‘Partial LU -decompositions” of principal minors) The algorithm from § 2.3.2.6
reveals that the the computation of the LU-decomposition of a matrix proceeds from top-left to bottom-
right. This implies the locality property discussed in this remark. To understands its heading we remind
that a principal minor refers to the left upper block of a matrix
The following “visual rule” help identify the structure of the LU-factors of a matrix.

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 149
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

    
     
     
     


 
 
0 





     
     
 =   
     
     
     
     
     0 
     
     

(2.3.2.18)

The left-upper blocks of both L and U in the LU-factorization of A depend only on the corresponding
left-upper block of A! y
Remark 2.3.2.19 (Block LU-factorization) In the spirit of § 1.3.1.13 we can also adopt a matrix-block
perspective of LU-factorization. This is a natural idea in light of the close connection between matrix multi-
plication and matrix factorization, cf. the relationship between matrix factorization and matrix multiplication
found in § 2.3.2.6:
Block matrix multiplication (1.3.1.14) ∼
= block LU -decomposition:
We consider a block-partitioned matrix
 
A11 A12 A11 ∈ K n,n regular , A12 ∈ K n,m
A= ,
A21 A22 A21 ∈ K m,n , A22 ∈ K m,m .

The block LU-decomposition arises from the block Gaussian forward elimination of Rem. 2.3.1.14 in the
same way as the standard LU-decomposition is spawned by the entry-wise Gaussian elimination:
With

     with Schur complement


A11 A12 I 0 A11 A12
= −1 , (2.3.2.20)
A21 A22 A21 A11 I 0 S −1
S := A22 − A21 A11 A12 .
| {z }
block LU-factorization

Under the assumption that A11 is invertible, the Schur complement matrix S is invertible, if and only if this
holds for A. y
Review question(s) 2.3.2.21 (Gaussian elimination and LU-decomposition)
(Q2.3.2.21.A) Performing Gaussian elimination by hand compute the solution of the following 4 × 4 linear
system of equations
    
2 −1 0 0 x1 0
 −1 2 −1 0   x2  0
    
 0 −1 2 −1  x3  = 0 .
0 0 −1 2 x4 1

(Q2.3.2.21.B) Give an example of a 2 × 2-matrix, for which there does not exist an LU-decomposition.
(Q2.3.2.21.C) Assume that one of the LU-factors of a square matrix A ∈ R n,n is diagonal. What proper-
ties of A can you infer?

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 150
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q2.3.2.21.D) Suppose the LU-factors L, U ∈ R n,n of a square matrix A ∈ R n,n exist and have been
computed already. Sketch an algorithm for computing the determinant det A.
From linear algebra remember that the determinant of the product of two square matrices is the product
of the determinants of its factors.
(Q2.3.2.21.E) Compute the block LU-decomposition of the partitioned matrix
 
Ik B⊤
A= ∈ R n,n , B ∈ R n−k,k , k ∈ {1, . . . , n − 1} .
B O

When is this matrix regular?


(Q2.3.2.21.F) Predict the asymptotic computational effort of an efficient algorithm for computing the block
LU-decomposition of
 
Ik B⊤
A= ∈ R n,n , B ∈ R n−k,k , k ∈ {1, . . . , n − 1} ,
B O

in terms of n, k → ∞.
(Q2.3.2.21.G) What is the inverse of the block matrix
 
O A
∈ R2n,2n , A ∈ R n,n regular ?
A⊤ O

Use block Gaussian elimination to find it and express it in terms of A−1 .


(Q2.3.2.21.H) [Schur complement] Consider the following block partitioning of a matrix
A ∈ R n+m,n+m
 
A11 A12 A11 ∈ R n,n , A12 ∈ R n,m ,
A= ,
A21 A22 A21 ∈ R m,n , A22 ∈ R m,m .

We assume that A11 is regular , which renders the Schur complement


−1
S := A22 − A21 A11 A12 ∈ R m,m

well-defined.
Show that A is singular, if and only if S is singular :

N (A) 6= {0} ⇐⇒ N (S) 6= {0} .

Hint.
• First show that if [ yx ] ∈ N (A), then Sy = 0.
h −1
i
• For y ∈ Rm such that Sy = 0 consider −A11 A12 y .
y

2.3.3 Pivoting
We know from linear algebra [NS02, Sect. 1.1] that sometimes we have to swap rows of a linear system
of equations in order to carry out Gaussian elimination without encountering a division by zero. Here is a

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 151
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 × 2 example:
         
0 1 x1 b 1 0 x1 b
= 1 = 2
1 0 x2 b2 0 1 x2 b1

breakdown of Gaussian elimination Gaussian elimination feasible


pivot element = 0

Remedy (in linear algebra): ˆ avoid zero pivot elements by swapping rows.
Pivoting =

EXAMPLE 2.3.3.1 (Pivoting and numerical stability → [DR08, Example 3.2.3]) Gaussian elimination
for the 2 × 2 linear system of equations studied in this example will never lead to a division by zero.
Nevertheless, Gaussian elimination runs into problems.
2 MatrixXd A( 2 , 2 ) ;
3 A << 5 . 0 e −17 , 1 . 0 , 1 . 0 , 1 . 0 ; Output:
4 VectorXd b ( 2 ) ;
5 VectorXd x2 ( 2 ) ; 1 x1 =
6 b << 1 . 0 , 2 . 0 ; 2 1
const VectorXd x1 = A . f u l l P i v L u ( ) . solve ( b ) ;
7
3 1
8 gausselimsolve : : gausselimsolve ( A , b , x2 ) ; // see
Code 2.3.1.10 4 x2 =
9 const auto [ L , U] = l u f a k : : l u f a k ( A ) ; // see 5 0
Code 2.3.2.8
10 const VectorXd z = L . l u ( ) . solve ( b ) ; 6 1
11 const VectorXd x3 = U. l u ( ) . solve ( z ) ; 7 x3 =
12 std : : cout << " x1 = \ n" 8 0
13 << x1 << " \ nx2 = \ n"
9 1
14 << x2 << " \ nx3 = \ n"
15 << x3 << std : : endl ;

We get different results from E IGEN built-in linear solver and out hand-crafted Gaussian elimination! Let’s
see what we should expect as an “exact solution”:
 1 
     
ǫ 1 1  1−ǫ  1
A= , b= ⇒ x= ≈ for |ǫ| ≪ 1 .
1 1 2 1 − 2ǫ 1
1−ǫ
What is wrong with E IGEN? To make sense of our observations we have to rely on our insights into
roundoff errors gained in Section 1.5.3. Armed with knowledge about the behavior of machine numbers
and roundoff errors we can understand what is going on in this example:
1
➊ We “simulate” floating point arithmetic for straightforward LU-factorization: if ǫ ≤ 2 EPS, EPS =
ˆ
machine precision,
     
1 0 ǫ 1 (∗)
e := ǫ 1
L= , U= =U in M ! (2.3.3.2)
ǫ −1 1 0 1 − ǫ −1 0 − ǫ −1

(∗): because 1+ e 2/EPS = 2/EPS, see Exp. 1.5.3.14.


   
e 2ǫ 0
The solution of LUx = b is x = ≈ , which is a meaningless result!
1 − 2ǫ 1
➋ Let’s conduct an LU-factorization in M after swapping rows:
       
1 1 1 0 1 1 e 1 1
A= ⇒ L= , U= = U := in M . (2.3.3.3)
ǫ 1 ǫ 1 0 1−ǫ 0 1

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 152
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
e 1 + 2ǫ
The solution of LUx = b is x = , which is a sufficiently accurate result!
1 − 2ǫ
From Section 1.5.5, Def. 1.5.5.19 remember the concept of numerical stability, see also [DR08, Sect. 2.3].
An LU-decomposition computed in M is stable, if it is the exact LU-decomposition of a slightly perturbed
matrix. Is this satisfied for the LU-decompositions obtained in ➊ and ➋?
 
e 0 0
➊, no row swapping, → (2.3.3.2): LU = A + E with E = unstable !
0 1
 
e + E with E = 0
e =A
➋, after row swapping, → (2.3.3.3): LU
0
stable !
0 ǫ
Clearly, swapping rows is necessary for being able to stably compute the LU-decomposition in floating
point arithmetic. y

Suitable pivoting is essential for controlling the impact of roundoff errors


on Gaussian elimination (→ Section 1.5.5, [NS02, Sect. 2.5])

The main rationale behind pivoting in numerical linear algebra is not to steer clear of division by
zero, but to ensure numerical stability of Gaussian elimination.

§2.3.3.4 (Partial pivoting) In linear algebra is was easy to decide when pivoting should be done. We just
had to check whether a potential pivot element was equal to zero. The situation is murky in numerical
linear algebra because
(i) a test == 0.0 is meaningless in floating point computations Rem. 1.5.3.15,
(ii) the goal of numerical stability is hard to quantify.
Nevertheless there is a very successful strategy and it is known as partial pivoting: Writing ai,j
i, j ∈ {k, . . . , n} for the elements of the intermediate matrix obtained after k < n steps of Gaussian elimi-
nation applied to an n × n LSE, we choose the index j of the next pivot row as follows:

| a j,i |
j ∈ {k, . . . , n} such that → max (2.3.3.5)
max{| a j,l |, l = k, . . . , n}

In a sense, we choose the relatively largest pivot element compared to the other entries in the same row
[NS02, Sect. 2.5]. y
EXAMPLE 2.3.3.6 (Gaussian elimination with pivoting for 3 × 3-matrix) The following sequence of
matrices is produced by Gaussian elimination with partial pivoting:
         
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A =  2 −3 2 → 1 2 2 → 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1 
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373

➊: swap rows 1 & 2.


➋: elimination with top row as pivot row
➌: swap rows 2 & 3
➍: elimination with 2nd row as pivot row y

§2.3.3.7 (Algorithm: Gaussian elimination with partial pivoting)

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 153
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 2.3.3.8: Gaussian elimination with pivoting: extension of Code 2.3.1.4 ➺ GITLAB
2 //! Solving an LSE Ax = b by Gaussian elimination with partial pivoting
3 //! A must be an n × n-matrix, b an n-vector
4 void g e p i v ( const MatrixXd &A , const VectorXd& b , VectorXd& x ) {
5 const Eigen : : Index n = A . rows ( ) ;
6 MatrixXd Ab ( n , n +1) ;
7 Ab << A , b ; //
8 // Forward elimination by rank-1 modification, see Rem. 2.3.1.11
9 f o r ( Eigen : : Index k = 0 ; k < n −1; ++k ) {
10 Eigen : : Index j = −1; // j = pivot row index
11 // p = relatively largest pivot
12 const double p = ( Ab . col ( k ) . t a i l ( n−k ) . cwiseAbs ( ) . cwiseQuotient (
Ab . block ( k , k , n−k , n−k ) . cwiseAbs ( ) . rowwise ( ) . maxCoeff ( ) )
) . maxCoeff(& j ) ; //
13 i f ( p < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) *
Ab . block ( k , k , n−k , n−k ) . norm ( ) ) {
14 throw std : : l o g i c _ e r r o r ( " nearly singular " ) ; //
15 }
16 Ab . row ( k ) . t a i l ( n−k +1) . swap ( Ab . row ( k+ j ) . t a i l ( n−k +1) ) ; //
17 Ab . bottomRightCorner ( n−k −1 ,n−k ) −= Ab . col ( k ) . t a i l ( n−k −1) *
Ab . row ( k ) . t a i l ( n−k ) / Ab ( k , k ) ; //
18 }
19 // Back substitution (same as in Code 2.3.1.4)
20 Ab ( n −1 ,n ) = Ab ( n −1 ,n ) / Ab ( n −1 ,n −1) ;
21 f o r ( Eigen : : Index i = n −2; i >= 0 ; −− i ) {
22 f o r ( Eigen : : Index l = i +1; l < n ; ++ l ) {
23 Ab ( i , n ) −= Ab ( l , n ) * Ab ( i , l ) ;
24 }
25 Ab ( i , n ) / = Ab ( i , i ) ;
26 }
27 x = Ab . r i g h t C o l s ( 1 ) ; //
28 }

(The pivot row index j is chosen in Line 12 of the code.)


Explanations to Code 2.3.3.8:
Line 7: Augment matrix A by right hand side vector b, see comments on Code 2.3.1.4 for explanations.
Line 12: Select index j for pivot row according to the recipe of partial pivoting, see (2.3.3.5).

Note: Inefficient implementation above (too many comparisons)! Try to do better!


Line 14: If the pivot element is still very small relative to the norm of the matrix, then we have encountered
an entire column that is close to zero. Gaussian elimination may not be possible in a stable
fashion for this matrix; warn user and terminate.
Line 16: A way to swap rows of a matrix in E IGEN.
Line 17: Forward elimination by means of rank-1-update, see (2.3.1.12).
Line 27: As in Code 2.3.1.4: after back substitution last column of augmented matrix supplies solution
x = A −1 b .
y
§2.3.3.9 (Algorithm: LU-factorization with pivoting) Recall the close relationship between
Gaussian elimination and LU-factorization

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 154
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ LU-factorization with pivoting? Of course, just by rearranging the operations of Gaussian forward elim-
ination with pivoting.

E IGEN-based code for in place LU-factorization of A ∈ R n,n with partial pivoting:

C++ code 2.3.3.10: LU-factorization with partial pivoting ➺ GITLAB


2 void l u p i v ( MatrixXd &A) { //insitu
3 const Eigen : : Index n = A . rows ( ) ;
4 f o r ( i n t k = 0 ; k < n −1; ++k ) {
5 i n t j = −1; //j = pivot row index
6 // p = relativly largest pivot
7 const double p = ( A . col ( k ) . t a i l ( n−k ) . cwiseAbs ( ) . cwiseQuotient (
A . block ( k , k , n−k , n−k ) . cwiseAbs ( ) . rowwise ( ) . maxCoeff ( ) ) ) . maxCoeff(& j ) ;
//
8 i f ( p < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) *
A . block ( k , k , n−k , n−k ) . norm ( ) ) { //
9 throw std : : l o g i c _ e r r o r ( " nearly singular " ) ;
10 }
11 A . row ( k ) . t a i l ( n−k −1) . swap ( A . row ( k+ j ) . t a i l ( n−k −1) ) ; //
12 const VectorXd f a c = A . col ( k ) . t a i l ( n−k −1) / A( k , k ) ; //
13 A . bottomRightCorner ( n−k −1 ,n−k −1) −= f a c * A . row ( k ) . t a i l ( n−k −1) ; //
14 A . col ( k ) . t a i l ( n−k −1) = f a c ; //
15 }
16 }

Notice that the recursive call is omitted as in Rem. 2.3.1.11.


Explanations to Code 2.3.3.10:
Line 7: Find the relatively largest pivot element p and the index j of the corresponding row of the matrix,
see (2.3.3.5)

Line 8: If the pivot element is still very small relative to the norm of the matrix, then we have encountered
an entire column that is close to zero. The matrix is (close to) singular and LU-factorization does
not exist.
Line 11: Swap the first and the j-th row of the matrix.

Line 12: Initialize the vector of multiplier.

Line 13: Call the routine for the lower right (n − 1) × (n − 1)-block of the matrix after subtracting suitable
multiples of the first row from the other rows, cf. Rem. 2.3.1.11 and Rem. 2.3.2.12.

Line 14: Reassemble the parts of the LU-factors. The vector of multipliers yields a column of L, see
Ex. 2.3.2.1.
y
Remark 2.3.3.11 (Rationale for partial pivoting policy (2.3.3.5) → [NS02, Page 47]) Why do we
select the relatively largest pivot element in (2.3.3.5)? Because we aim for an algorithm for Gaussian
elimination/LU-decomposition that possesses the highly desirable scale-invariance property. Loosely
speaking, the algorithm should not make different decisions on pivoting when we multiply the LSE with
a regular diagonal matrix from the left. Let us take a closer look at a 2 × 2 example:
Scale linear system of equations from Ex. 2.3.3.1:
        
2/ǫ 0 ǫ 1 x1 2 2/ǫx1 2/ǫ
e
= = := b
0 1 1 1 x2 1 1 x2 2

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 155
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

No row swapping would be triggered, if absolutely largest pivot element was used to select the pivot row:
       
2 2/ǫ 1 0 2 2/ǫ · 1 0 2 2/ǫ
= = in M .
1 1 1 1 0 1 − 2/ǫ 1 1 0 −2/ǫ
| {z } | {z }
e
L e
U

Using the rules of arithmetic in M (→ Exp. 1.5.3.14), we find


     
e −1 L e=
e −1 b · 1 −1 −1 2 1 · 0
U − = ,
2 0 ǫ ǫ −1 1
which is not an acceptable result.
y

§2.3.3.12 (Theory of pivoting) We view pivoting from the perspective of matrix operations and start with
a matrix view of row swapping.

Definition 2.3.3.13. Permutation matrix

An n-permutation, n ∈ N, is a bijective mapping π : {1, . . . , n} 7→ {1, . . . , n}. The corresponding


permutation matrix Pπ ∈ K n,n is defined by
(
1 , if j = π (i ) ,
(Pπ )ij =
0 else.

 
1 0 0 0
0 0 1 0
ˆ P=
Example: permutation (1, 2, 3, 4) 7→ (1, 3, 2, 4) = 0
.
1 0 0
0 0 0 1

Note: ✦ P⊤ = P−1 for any permutation matrix P (→ permutation matrices orthogonal/unitary)


✦ left-multiplication Pπ A effects π -permutation of rows of A ∈ K n,m
✦ right-multiplication APπ effects π -permutation of columns of A ∈ K m,n

Lemma 2.3.3.14. Existence of LU-factorization with pivoting → [DR08, Thm. 3.25], [Han02,
Thm. 4.4]

For any regular A ∈ K n,n there is a permutation matrix (→ Def. 2.3.3.13) P ∈ K n,n , a normalized
lower triangular matrix L ∈ K n,n , and a regular upper triangular matrix U ∈ K n,n (→ Def. 1.1.2.3),
such that PA = LU .

Proof. (by induction)

Every regular matrix A ∈ K n,n admits a row permutation encoded by the permutation matrix P ∈ K n,n ,
such that A′ := (A)1:n−1,1:n−1 is regular (why ?).
By induction assumption there is a permutation matrix P′ ∈ K n−1,n−1 such that P′ A′ possesses a
LU-factorization A′ = L′ U′ . There are x, y ∈ K n−1 , γ ∈ K such that
   ′  ′   ′ ′   ′  
P′ 0 P 0 A x LU x L 0 U d
PA = = = ⊤ ,
0 1 0 1 y⊤ γ y⊤ γ c 1 0 α

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 156
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

if we choose

d = ( L ′ ) −1 x , c = ( u ′ ) − T y , α = γ − c ⊤ d ,
which is always possible. ✷ y
EXAMPLE 2.3.3.15 (Ex. 2.3.3.6 cnt’d) Let us illustrate the assertion of Lemma 2.3.3.14 for the small
3 × 3 LSE from Ex. 2.3.3.6:
         
1 2 2 2 −3 2 2 −3 2 2 −7 2 2 −7 2
➊ ➋ ➌ ➍
A =  2 −3 2 → 1 2 2 → 0 3.5 1 → 0 25.5 −1 → 0 25.5 −1 
1 24 0 1 24 0 0 25.5 −1 0 3.5 1 0 0 1.373

     
2 −3 2 1 0 0 0 1 0
U =  0 25.5 −1 , L =  0.5 1 0 , P= 0 0 1 .
0 0 1.1373 0.5 0.1373 1 1 0 0
Two permutations: in step ➊ swap rows #1 and #2, in step ➌ swap rows #2 and #3. Apply these swaps to
the identity matrix and you will recover P. See also [DR08, Ex. 3.30]. y

§2.3.3.16 (LU-decomposition in E IGEN) E IGEN provides various functions for computing the LU-
decomposition of a given matrix. They all perform the factorization in-situ → Rem. 2.3.2.11:

A −→
L

The resulting matrix can be retrieved and used to recover the LU-factors, as demonstrated in the next code
snippet. Note that the method matrixLU returns just a single matrix, from which the LU-factors have to
be extracted using special view methods.

C++ code 2.3.3.17: Performing explicit LU-factorization in E IGEN ➺ GITLAB


2 const Eigen : : MatrixXd : : Index n = A . cols ( ) ;
3 assert ( n == A . rows ( ) ) ; // ensure square matrix
4 const Eigen : : PartialPivLU <MatrixXd > l u ( A) ;
5 // Normalized lower-triangule factor
6 MatrixXd L = MatrixXd : : I d e n t i t y ( n , n ) ;
7 L . triangularView < S t r i c t l y L o w e r > ( ) += l u . matrixLU ( ) ;
8 // Upper triangular factor
9 const MatrixXd U = l u . matrixLU ( ) . triangularView <Upper > ( ) ;
10 // Permutation matrix, see Def. 2.3.3.13
11 const MatrixXd P = l u . permutationP ( ) ;

Note that for solving a linear system of equations by means of LU-decomposition (the standard algorithm)
we never have to extract the LU-factors. y

Remark 2.3.3.18 (Row swapping commutes with forward elimination) Any kind of pivoting only in-
volves comparisons and row/column permutations, but no arithmetic operations on the matrix entries.
This makes the following observation plausible:

2. Direct Methods for (Square) Linear Systems of Equations, 2.3. Gaussian Elimination (GE) 157
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The LU-factorization of A ∈ K n,n with partial pivoting by § 2.3.3.9 is numerically equivalent to the LU-
factorization of PA without pivoting (→ Code in § 2.3.2.6), when P is a permutation matrix gathering
the row swaps entailed by partial pivoting.

ˆ same result when executed with the same machine arithmetic


numerically equivalent =

The above statement means that whenever we study the impact of roundoff errors on LU-
factorization it is safe to consider only the basic version without pivoting, because we can always
assume that row swaps have been conducted beforehand.
y

2.4 Stability of Gaussian Elimination


It will turn out that when investigating the stability of algorithms meant to solve linear systems of equations,
a key quantity is the residual.

Definition 2.4.0.1. Residual


x ∈ K n of the LSE Ax = b (A ∈ K n,n , b ∈ K n ), its residual is the
Given an approximate solution e
vector

r = b − Ae
x.

§2.4.0.2 (Probing stability of a direct solver for LSE) Assume that you have downloaded a direct solver
for a general (dense) linear system of equations Ax = b, A ∈ K n,n regular, b ∈ K n . When given the
data A and b it returns the perturbed solution e x. How can we tell that e x is the exact solution of a linear
system with slightly perturbed data (in the sense of a tiny relative error of size ≈ EPS, EPS the machine
precision, see § 1.5.3.8). That is, how can we tell that e
x is an acceptable solution in the sense of backward
error analysis, cf. Def. 1.5.5.19:

Definition 1.5.5.19. Stable algorithm

An algorithm Fe for solving a problem F : X 7→ Y is numerically stable if for all x ∈ X its result Fe(x)
(possibly affected by roundoff) is the exact result for “slightly perturbed” data:

∃C ≈ 1: ∀x ∈ X: ∃e xk X ≤ Cw(x) EPSkxk X ∧ Fe(x) = F (e


x ∈ X: kx − e x) .

A question similar to the one we ask now for Gaussian elimination was answered in Ex. 1.5.5.20 for the
operation of matrix×vector multiplication.

We can alter either side of the linear system of equations in order to restore e
x as solution:
x accounted for by perturbation of right hand side:
➊ x−e

Ax = b
x − b =: −r (residual, Def. 2.4.0.1) .
⇒ ∆b = Ae
x = b + ∆b
Ae

krk
Hence, e
x can be accepted as a solution, if ≤ Cn3 · EPS, for some small constant C ≈ 1, see
kbk
Def. 1.5.5.19. Here, k·k can be any vector norm on K n .

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 158
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

x accounted for by perturbation of system matrix:


➋ x−e

Ax = b , (A + ∆A)e
x=b
xH , u ∈ K n ]
[ try perturbation ∆A = ue
r xH
re
u= ⇒ ∆A = .
kxk22 kxk22
As in Ex. 1.5.5.20 we find

k∆Ak2 krk k r k2
= ≤ . (2.4.0.3)
k A k2 x k2
kAk2 ke kAe x k2

krk
Thus, e
x is ok in the sense of backward error analysis, if ≤ Cn3 · EPS.
kAexk

A stable algorithm for solving an LSE yields a residual r := b − Ae


x small (in norm) relative to b.

Now that we know when to accept a vector as solution of a linear system of equations, we can explore
whether an implementation of Gaussian elimination (with some pivoting strategy) in floating point arith-
metic actually delivers acceptable solutions. Given the several levels of nested loops occurring in algo-
rithms for Gaussian elimination, it is not surprising that the roundoff error analysis of Gaussian elimination
based on Ass. 1.5.3.11 is rather involved. Here we merely summarise the results:

The analysis can be simplified by using the fact that equivalence of Gaussian elimination and LU-
factorization extends to machine arithmetic, cf. Section 2.3.2

Lemma 2.4.0.4. Equivalence of Gaussian elimination and LU-factorization

The following algorithms for solving the LSE Ax = b (A ∈ K n,n , b ∈ K n ) are


numerically equivalent:
❶ Gaussian elimination (forward elimination and back substitution) without pivoting, see Algo-
rithm 2.3.1.3.
❷ LU-factorization of A (→ Code in § 2.3.2.6) followed by forward and backward substitution,
see Algorithm 2.3.2.15.

In Rem. 2.3.3.18 we learned that pivoting can be taken into account by a prior permutation of the rows of
the linear system of equations. Since permutations do not introduce any roundoff errors, it is thus sufficient
to consider LU-factorization without pivoting.
A profound roundoff analysis of Gaussian elimination/LU-factorization can be found in [GV89, Sect. 3.3 &
3.5] and [Hig02, Sect. 9.3]. A less rigorous, but more lucid discussion is given in [TB97, Lecture 22]. Here
we only quote a result due to Wilkinson, [Hig02, Thm. 9.5]:

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 159
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 2.4.0.5. Stability of Gaussian elimination with partial pivoting

Let A ∈ R n,n be regular and A(k) ∈ R n,n , k = 1, . . . , n − 1, denote the intermediate matrix arising
in the k-th step of § 2.3.3.9 (Gaussian elimination with partial pivoting) when carried out with exact
arithmetic.
For the approximate solution e x ∈ R n of the LSE Ax = b, b ∈ R n , computed as in § 2.3.3.9 (based
on machine arithmetic with machine precision EPS, → Ass. 1.5.3.11) there is ∆A ∈ R n,n with

max |(A(k) )ij |


3EPS i,j,k
k∆Ak∞ ≤ n3 ρ kAk∞ , ρ := ,
1 − 3nEPS max |(A)ij |
i,j
such that (A + ∆A)e
x=b.

ρ “small” ➥ Gaussian elimination with partial pivoting is stable (→ Def. 1.5.5.19)

If ρ is “small”, the computed solution of a LSE can be regarded as the exact solution of a LSE with “slightly
perturbed” system matrix (perturbations of size O(n3 EPS)).

Bad news: exponential growth ρ ∼ 2n is possible !

EXAMPLE 2.4.0.6 (Wilkinson’s counterexample) We confirm the bad news by means of a famous
example, known as the so-called Wilkinson matrix.
 
1 0 0 0 0 0 0 0 0 1
 −1 1 0 0 0 0 0 0 0 1
n=10:  
 −1 −1 1 0 0 0 0 0 0 1
  
 −1 −1 −1 1 0 0 0 0 0 1

1 , if i = j ∨ j = n ,  
 −1 −1 −1 −1 1 0 0 0 0 1
aij = −1 , if i > j , , A=
 −1


  −1 −1 −1 −1 1 0 0 0 1
0 else.  −1 −1 −1 −1 −1 −1 1 0 0 1
 
 −1 −1 −1 −1 −1 −1 −1 1 0 1
 
 −1 −1 −1 −1 −1 −1 −1 −1 1 1
−1 −1 −1 −1 −1 −1 −1 −1 −1 1
Partial pivoting does not trigger row permutations !
 

1 , if i = j , 
1 , if i = j ,
A = LU , lij = −1 , if i > j , uij = 2 i − 1 , if j = n ,

 

0 else 0 else.

We find an exponential blow-up of entries of U !


The following code relies on E IGEN’s Gaussian elimination solver and applies it to the Wilkinson matrix:

C++ code 2.4.0.7: Gaussian elimination for “Wilkinson system” in E IGEN ➺ GITLAB
2 MatrixXd r e s ( 1 0 0 , 2 ) ;

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 160
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3 f o r ( i n t n = 1 0 ; n <= 100 * 10; n += 10) {


4 MatrixXd A ( n , n ) ; A . s e t I d e n t i t y ( ) ;
5 A . triangularView < S t r i c t l y L o w e r > ( ) . setConstant ( −1) ;
6 A . rightCols <1 >() . setOnes ( ) ;
7 VectorXd x = VectorXd : : Constant ( n , − 1 ) . binaryExpr ( VectorXd : : LinSpaced ( n , 1 , n ) ,
[ ] ( double x , double y ) { r e t u r n pow ( x , y ) ; } ) ;
8 double r e l e r r = ( A . l u ( ) . solve (A * x ) −x ) . norm ( ) / x . norm ( ) ;
9 r e s ( n / 10 − 1 , 0) = n ; r e s ( n / 10 − 1 , 1) = r e l e r r ;
10 }
11 // ... different solver(e.g. colPivHouseholderQr()), plotting

The measured relative errors are displayed in the following plots alongside the Euclidean condition num-
bers of the Wilkinson matrices.

450
0
10
400
−2
10
350

relative error (Euclidean norm)


−4
10
300
−6
10
cond2(A)

250
Gaussian elimination
−8
10 QR−decomposition
200 relative residual norm

−10
10
150
−12
10
100
−14
10
50
−16
10
0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Fig. 41 n Fig. 42 n

Large relative errors in solution,


l (∗) Evidence of Instability of Gaussian elimination!
though cond2 (A) is small!
(∗) If cond2 (A) was huge, then big errors in the solution of a linear system can be caused by small per-
turbations of either the system matrix or the right hand side vector, see (2.4.0.13) and the message
of Thm. 2.2.2.4, (2.2.2.8). In this case, a stable algorithm can obviously produce a grossly “wrong”
solution, as was already explained after (2.2.2.8).
Hence, lack of stability of Gaussian elimination will only become apparent for linear systems with
well-conditioned system matrices.
The observations made in this example match Thm. 2.4.0.5, because due to the exponential blow-up of the
entries of U with increasing matrix size we encounter an exponential growth of ρ = ρ(n), see Ex. 2.4.0.6.
y


Observation: In practice ρ (almost) always grows only mildly (like O( n)) with n


Discussion in [TB97, Lecture 22]: growth factors larger than the orderO( n) are exponentially rare in
certain relevant classes of random matrices.
EXAMPLE 2.4.0.8 (Stability by small random perturbations) Spielman and Teng [ST96] discovered
that a tiny relative random perturbation of the Wilkinson matrix on the scale of the machine precision EPS
(→ § 1.5.3.8) already remedies the instability of Gaussian elimination.

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 161
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 2.4.0.9: Stabilization of Gaussian elimination with partial pivoting by small random
perturbations ➺ GITLAB
2 //! Curing Wilkinson’s counterexample by random perturbation
3 MatrixXd r e s ( 2 0 , 3 ) ;
4 mt19937 gen ( 4 2 ) ; // seed
5 // normal distribution, mean = 0.0, stddev = 1.0
6 std : : normal_distribution <> b e l l c u r v e ;
7 f o r ( i n t n = 1 0 ; n <= 10 * 2 0 ; n += 10) {
8 // Build Wilkinson matrix
9 MatrixXd A ( n , n ) ; A . s e t I d e n t i t y ( ) ;
10 A . triangularView < S t r i c t l y L o w e r > ( ) . setConstant ( −1) ;
11 A . rightCols <1 >() . setOnes ( ) ;
12 // imposed solution
13 VectorXd x = VectorXd : : Constant ( n , −1) . binaryExpr (
14 VectorXd : : LinSpaced ( n , 1 , n ) ,
15 [ ] ( double x , double y ) { r e t u r n pow ( x , y ) ; } ) ;
16 double r e l e r r = ( A . l u ( ) . solve ( A * x ) − x ) . norm ( ) / x . norm ( ) ;
17 // Randomly perturbed Wilkinson matrix by matrix with iid
18 // N (0, eps) distributed entries
19 MatrixXd Ap = A . unaryExpr ( [ & ] ( double x ) {
20 r e t u r n x + n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) * b e l l c u r v e ( gen ) ;
21 }) ;
22 double r e l e r r p = ( Ap . l u ( ) . solve ( Ap * x ) − x ) . norm ( ) / x . norm ( ) ;
23 r e s ( n / 10 − 1 , 0 ) = n ;
24 r e s ( n / 10 − 1 , 1 ) = r e l e r r ;
25 r e s ( n / 10 − 1 , 2 ) = r e l e r r p ;
26 }

0
10

−2
10

−4
10
Recall the statement made above about “improbabil-
ity” of matrices for which Gaussian elimination with
−6
10
partial pivoting is unstable. This is now matched
relative error

−8
10
unperturbed matrix
randn perturbed matrix
by the observation that a tiny random perturba-
−10
tion of the matrix (almost certainly) cures the prob-
10
lem. This is investigated by the brand-new field
−12
10 of smoothed analysis of numerical algorithms, see
−14
[SST06].
10

−16
10
0 20 40 60 80 100 120 140 160 180 200
Fig. 43 matrix size n
y

Gaussian elimination/LU-factorization with partial pivoting is stable (∗)


(for all practical purposes) !

(∗): stability refers to maximum norm k·k∞ .


EXPERIMENT 2.4.0.10 (Conditioning and relative error → Exp. 2.4.0.11 cnt’d)
In the discussion of numerical stability (→ Def. 1.5.5.19, Rem. 1.5.5.22) we have seen that a stable algo-
rithm may produce results with large errors for ill-conditioned problems. The conditioning of the problem of
solving a linear system of equations is determined by the condition number (→ Def. 2.2.2.7) of the system

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 162
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

matrix, see Thm. 2.2.2.4.

Hence, for an ill-conditioned linear system, whose system matrix has a huge condition number, (stable)
Gaussian elimination may return “solutions” with large errors. This will be demonstrated in this experiment.
20 2
10 10
cond(A)

19
10 1
10

Numerical experiment with nearly singular matrix 18


10
from Exp. 2.4.0.11
0
10

17
10

relative error
−1
T 10

cond(A)
A = uv + ǫI , 10
16

u = 31 (1, 2, 3, . . . , 10) T ,
−2
10
15
10

1 T
v = (−1, 12 , − 13 , 41 , . . . , 10 ) 10
14
10
−3

−4
13 10
10

relative error
12 −5
10 10
−14 −13 −12 −11 −10 −9 −8 −7 −6 −5
10 10 10 10 10 10 10 10 10 10
Fig. 44 ε
y
The practical stability of Gaussian elimination for Ax = b is reflected by the size of a particular vector, the
residual r := b − Ae x, e
x the computed solution, that can easily be computed after the elimination solver
has finished:
In practice Gaussian elimination/LU-factorization with partial pivoting
produces “relatively small residual vectors”

This is confirmed by a simple consideration analogous to § 2.4.0.2:

x = b ⇒ r = b − Ae
(A + ∆A)e x ⇒
x = ∆Ae krk ≤ k∆Akke
xk ,
for any vector norm k·k. This means that, if a direct solver for an LSE is stable in the sense of backward
error analysis, that is, the perturbed solution could be obtained as the exact solution for a slightly relatively
perturbed system matrix, then the residual will be (relatively) small.
EXPERIMENT 2.4.0.11 (Small residuals by Gaussian elimination) Gaussian elimination works mira-
cles in terms of delivering small residuals! To demonstrate this we study a numerical experiment with
nearly singular matrix.

A = uv T + ǫI , u = 13 (1, 2, 3, . . . , 10) T ,
with 1 T
v = (−1, 12 , − 13 , 41 , . . . , 10 )
singular rank-1 matrix

C++ code 2.4.0.12: Small residuals for Gauss elimination ➺ GITLAB


2 i n t n = 10;
3 // Initialize vectors u and v
4 VectorXd u = VectorXd : : LinSpaced ( n , 1 , n ) / 3 . 0 ;
5 VectorXd v = u . cwiseInverse ( ) . array ( ) *
6 VectorXd : : LinSpaced ( n , 1 , n )
7 . unaryExpr ( [ ] ( double x ) { r e t u r n pow( −1 , x ) ; } )
8 . array ( ) ;
9 VectorXd x = VectorXd : : Ones ( n ) ;
10 double nx = x . lpNorm< I n f i n i t y > ( ) ;

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 163
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11 VectorXd expo = VectorXd : : LinSpaced ( 1 9 , −5 , −14) ;


12 Eigen : : MatrixXd r e s ( expo . s i z e ( ) , 4 ) ;
13 f o r ( i n t i = 0 ; i <= expo . s i z e ( ) ; ++ i ) {
14 // Build coefficient matrix A
15 double e p s i l o n = std : : pow ( 1 0 , expo ( i ) ) ;
16 MatrixXd A = u * v . transpose ( ) + e p s i l o n * MatrixXd : : I d e n t i t y ( n , n ) ;
17 VectorXd b = A * x ; // right-hand-side vector
18 double nb = b . lpNorm< I n f i n i t y > ( ) ; // maximum norm
19 VectorXd x t = A . l u ( ) . solve ( b ) ; // Gaussian elimination
20 VectorXd r = b − A * x t ; // residual vector
21 res ( i , 0) = epsilon ;
22 r e s ( i , 1 ) = ( x − x t ) . lpNorm< I n f i n i t y > ( ) / nx ;
23 r e s ( i , 2 ) = r . lpNorm< I n f i n i t y > ( ) / nb ;
24 // L-infinity condition number
25 r e s ( i , 3 ) = A . inverse ( ) . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) *
26 A . cwiseAbs ( ) . rowwise ( ) .sum ( ) . maxCoeff ( ) ;
27 }

2
10
relative error
relative residual
0
10

−2
10

−4
10

−6
10
Observations (w.r.t k·k∞ -norm)
✦ for ǫ ≪ 1 large relative error in computed so-
−8
10
lution e
x
−10
10 ✦ small residuals for any ǫ
−12
10

−14
10

−16
10
−14 −13 −12 −11 −10 −9 −8 −7 −6 −5
10 10 10 10 10 10 10 10 10 10
Fig. 45 ε

How can a large relative error be reconciled with a small relative residual? We continue a discussion that
we already started in Ex. 2.4.0.6:
Ax = b ↔ Ae x≈b

x k ≤ A −1 k r k
x) = r ⇒ kx − e
A(x − e kx − e
xk krk
⇒ ≤ k A k A −1 . (2.4.0.13)
Ax = b ⇒ kbk ≤ kAkkxk kxk kbk
➣ If cond(A) := kAk A−1 ≫ 1, then a small relative residual may not imply a small relative error.
Also recall the discussion in Exp. 2.4.0.10. y
EXPERIMENT 2.4.0.14 (Instability of multiplication with inverse) An important justification for
Rem. 2.2.1.6 that advised us not to compute the inverse of a matrix in order to solve a linear system of
equations is conveyed by this experiment. We again consider the nearly singular matrix from Exp. 2.4.0.11.

A = uv T + ǫI , u = 13 (1, 2, 3, . . . , 10) T ,
with 1 T
v = (−1, 12 , − 13 , 41 , . . . , 10 )
singular rank-1 matrix

C++ code 2.4.0.15: Instability of multiplication with inverse ➺ GITLAB


2 i n t n = 10;

2. Direct Methods for (Square) Linear Systems of Equations, 2.4. Stability of Gaussian Elimination 164
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3 VectorXd u = VectorXd : : LinSpaced ( n , 1 , n ) / 3 . 0 ;


4 VectorXd v = u . cwiseInverse ( ) . array ( ) * VectorXd : : LinSpaced ( n , 1 , n ) . unaryExpr (
[ ] ( double x ) { r e t u r n pow( −1 , x ) ; } ) . array ( ) ;
5 VectorXd x = VectorXd : : Ones ( n ) ;
6 VectorXd expo = VectorXd : : LinSpaced (19 , −5 , −14) ;
7 MatrixXd r e s ( expo . s i z e ( ) , 4 ) ;
8 f o r ( i n t i = 0 ; i < expo . s i z e ( ) ; ++ i ) {
9 double e p s i l o n = std : : pow ( 1 0 , expo ( i ) ) ;
10 MatrixXd A = u * v . transpose ( ) + e p s i l o n * ( MatrixXd : : Random( n , n ) +
MatrixXd : : Ones ( n , n ) ) / 2 ;
11 VectorXd b = A * x ;
12 double nb = b . lpNorm< I n f i n i t y > ( ) ;
13 VectorXd x t = A . l u ( ) . solve ( b ) ; // stable solving
14 VectorXd r = b − A * x t ; // residual
15 MatrixXd B = A . inverse ( ) ;
16 VectorXd x i = B * b ; // solving via inverse
17 VectorXd r i = b − A * x i ; // residual
18 MatrixXd R = MatrixXd : : I d e n t i t y ( n , n ) − A * B ; // residual
19 r e s ( i , 0 ) = e p s i l o n ; r e s ( i , 1 ) = ( r ) . lpNorm< I n f i n i t y > ( ) / nb ;
20 r e s ( i , 2 ) = r i . lpNorm< I n f i n i t y > ( ) / nb ;
21 // L-infinity condition number
22 r e s ( i , 3 ) = R. lpNorm< I n f i n i t y > ( ) / B . lpNorm< I n f i n i t y > ( ) ;
23 }

2
10
Gaussian elimination
multiplication with inversel
0
10 inverse

−2
10

−4
10
relative residual

The computation of the inverse is affected by round- −6


10

off errors, but does not benefit from the same favor-
−8
10
able cancellation of roundoff errors as Gaussian elim-
ination. −10
10

−12
10

−14
10

−16
10
−14 −13 −12 −11 −10 −9 −8 −7 −6 −5
10 10 10 10 10 10 10 10 10 10
Fig. 46 ε
y

2.5 Survey: Elimination Solvers for Linear Systems of Equations


All direct (∗) solver algorithms for square linear systems of equations Ax = b with given matrix A ∈
K n,n , right hand side vector b ∈ K n and unknown x ∈ K n rely on variants of Gaussian elimination
with pivoting, see Section 2.3.3. Sophisticated, optimised and verified implementations are available in
numerical libraries like LAPACK/MKL.

(∗): a direct solver terminates after a predictable finite number of elementary operations for every admis-
sible input.

Never contemplate implementing a general solver for linear systems of equations!

If possible, use algorithms from numerical libraries! (→ Exp. 2.3.1.7)

2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
165
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Therefore, familiarity with details of Gaussian elimination is not required, but one must know when and
how to use the library functions and one must be able to assess the computational effort they involve.

§2.5.0.1 (Computational effort for direct elimination) We repeat the reasoning of § 2.3.1.5: Gaus-
sian elimination for a general (dense) matrix invariably involves three nested loops of length n, see
Code 2.3.1.4, Code 2.3.3.8.

Theorem 2.5.0.2. Cost of Gaussian elimination → § 2.3.1.5


Given a linear system of equations Ax = b, A ∈ K n,n regular, b ∈ K n , n ∈ N, the asymptotic
computational effort (→ Def. 1.4.0.1) for its direct solution by means of Gaussian elimination in
terms of the problem size parameter n is O(n3 ) for n → ∞.

The constant hidden in the Landau symbol can be expected to be rather small (≈ 1) as is clear from
(2.3.1.6).

The cost for solving are substantially lower, if certain properties of the matrix A are known. This is clear,
if A is diagonal or orthogonal/unitary. It is also true for triangular matrices (→ Def. 1.1.2.3), because they
can be solved by simple back substitution or forward elimination. We recall the observation made in see
§ 2.3.2.15.

Theorem 2.5.0.3. Cost for solving triangular systems → § 2.3.1.5


In the setting of Thm. 2.5.0.2, the asymptotic computational effort for solving a triangular linear
system of equations is O(n2 ) for n → ∞.

y
§2.5.0.4 (Direct solution of linear systems of equations in E IGEN) E IGEN supplies a rich suite of
functions for matrix decompositions and solving LSEs, see E IGEN documentation. The default solver is
Gaussian elimination with partial pivoting, accessible through the methods lu() and solve()of dense
matrix types:
Given: system/coefficient matrix A ∈ K n,n regular ↔ A (n × n E IGEN matrix)
right hand side vectors B ∈ K n,ℓ ↔ B (n × ℓ E IGEN matrix)
(corresponds to multiple right hand sides, cf. Code 2.3.1.10)
linear algebra E IGEN
h i
X = A−1 B = A−1 (B):,1 , . . . , A−1 (B):,ℓ X = A.lu().solve(B)

Summarizing the detailed information given in § 2.3.2.15:


cost(X = A.lu().solve(B)) = = O(n3 + ln2 ) for n, l → ∞
y
Remark 2.5.0.5 (Communicating special properties of system matrices in E IGEN) Sometimes, the
coefficient matrix of a linear system of equations is known to have certain analytic properties that a direct
solver can exploit to perform elimination more efficiently. These properties may even be impossible to
detect by an algorithm, because matrix entries that should vanish exactly might have been perturbed due
to roundoff.

Thus one needs to pass E IGEN these informations as follows:

2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
166
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 // A is lower triangular
3 x = A . triangularView <Eigen : : Lower > ( ) . solve ( b ) ;
4 // A is upper triangular
5 x = A . triangularView <Eigen : : Upper > ( ) . solve ( b ) ;
6 // A is hermitian / self adjoint and positive definite
7 x = A . s e l f a d j o i n t V i e w <Eigen : : Upper > ( ) . l l t ( ) . solve ( b ) ;
8 // A is hermiatin / self adjoint (poitive or negative semidefinite)
9 x = A . s e l f a d j o i n t V i e w <Eigen : : Upper > ( ) . l d l t ( ) . solve ( b ) ;

The methods llt() and ldlt() rely on special factorizations for symmetric matrices, see § 2.8.0.13
below. y
EXPERIMENT 2.5.0.6 (Standard E IGEN lu() operator versus triangularView() ) In this numerical ex-
periment we study the gain in efficiency achievable by make the direct solver aware of important matrix
properties.

C++ code 2.5.0.7: Direct solver applied to a upper triangular matrix ➺ GITLAB
2 //! Eigen code: assessing the gain from using special properties
3 //! of system matrices in Eigen
4 MatrixXd t i m i n g ( ) {
5 std : : vector < i n t > n = {16 ,32 ,64 ,128 ,256 ,512 ,1024 ,2048 ,4096 ,8192};
6 const i n t nruns = 3 ;
7 MatrixXd t i m e s ( n . s i z e ( ) , 3 ) ;
8 f o r ( unsigned i n t i = 0 ; i < n . s i z e ( ) ; ++ i ) {
9 Timer t 1 ;
10 Timer t 2 ; // timer class
11 MatrixXd A = VectorXd : : LinSpaced ( n [ i ] , 1 , n [ i ] ) . asDiagonal ( ) ;
12 A += MatrixXd : : Ones ( n [ i ] , n [ i ] ) . triangularView <Upper > ( ) ;
13 const VectorXd b = VectorXd : : Random( n [ i ] ) ;
14 VectorXd x1 ( n [ i ] ) ;
15 VectorXd x2 ( n [ i ] ) ;
16 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
17 t1 . s t a r t ( ) ; x1 = A . l u ( ) . solve ( b ) ; t1 . stop ( ) ;
18 t2 . s t a r t ( ) ; x2 = A . triangularView <Upper > ( ) . solve ( b ) ; t 2 . s t o p ( ) ;
19 }
20 t i m e s ( i , 0 ) = n [ i ] ; t i m e s ( i , 1 ) = t 1 . min ( ) ; t i m e s ( i , 2 ) = t 2 . min ( ) ;
21 }
22 return times ;
23 }

10 2
naive lu()
triangularView lu()
10 1

10 0

Observation: ✄
runtime for direct solver [s]

10 -1

Being told that only the upper triangular part of


10 -2
the matrix needs to be taken into account, Gaus-
sian elimination reduces to cheap backward elimi- 10 -3

nation, which is much faster than full elimination, cf.


10 -4
Thm. 2.5.0.2 vs. Thm. 2.5.0.3.
10 -5

10 -6
10 1 10 2 10 3 10 4
Fig. 47
matrix size n
y

2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
167
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§2.5.0.8 (Direct solvers for LSE in E IGEN) Invocation of direct solvers in E IGEN is a two stage process:
➊ Request a decomposition (LU,QR,LDLT) of the matrix and store it in a temporary “decomposition
object”.
➋ Perform backward & forward substitutions by calling the solve() method of the decomposition
object.
The general format for invoking linear solvers in E IGEN is as follows:
Eigen::SolverType<Eigen::MatrixXd> solver(A);
Eigen::VectorXd x = solver.solve(b);

This can be reduced to one line, as the solvers can also be used as methods acting on matrices:
Eigen::VectorXd x = A.solverType().solve(b);

A full list of solvers can be found in the E IGEN documentation. The next code demonstrates a few of
the available decompositions that can serve as the basis for a linear solver:

C++-code 2.5.0.9: E IGEN based function solving a LSE ➺ GITLAB


2 // Gaussian elimination with partial pivoting, Code 2.3.3.8
3 i n l i n e void l u _ s o l v e ( const MatrixXd &A , const VectorXd &b , VectorXd &x ) {
4 x = A . l u ( ) . solve ( b ) ; // ’lu()’ is short for ’partialPivLu()’
5 }
6

7 // Gaussian elimination with total pivoting


8 i n l i n e void f u l l p i v l u _ s o l v e ( const MatrixXd &A , const VectorXd &b , VectorXd &x ) {
9 x = A . f u l l P i v L u ( ) . solve ( b ) ; // total pivoting
10 }
11

12 // An elimination solver based on Householder transformations


13 i n l i n e void q r _ s o l v e ( const MatrixXd &A , const VectorXd &b , VectorXd &x ) {
14 const Eigen : : HouseholderQR<MatrixXd > s o l v e r ( A) ; // see Section 3.3.3
15 x = s o l v e r . solve ( b ) ;
16 }
17

18 // Use singular value decomposition (SVD)


19 i n l i n e void svd_solve ( const MatrixXd &A , const VectorXd &b , VectorXd &x ) {
20 // SVD based solvers, see Section 3.4
21 x = A . jacobiSvd ( Eigen : : ComputeThinU | Eigen : : ComputeThinV ) . solve ( b ) ;
22 }

The different decompositions trade speed for stability and accuracy: fully pivoted and QR-based decom-
positions also work for nearly singular matrices, for which the standard LU-factorization may non longer
be reliable. y

Remark 2.5.0.10 (Many sequential solutions of LSE) As we have seen in Code 2.5.0.9, E IGEN provides
functions that return decompositions of matrices. For instance, we can get an object “containing” the
LU-decomposition (→ Section 2.3.2) of a matrix by the following commands:
Eigen::MatrixXd A(n,n); // A dense square matrix object
......
au to ludec = A.lu(); // Perform LU-decomposition and store the
factors.

Based on the precomputed decompositions, a linear system of equations with coefficient matrix A ∈ K n,n
can be solved with asymptotic computational effort O(n2 ), cf. § 2.3.2.15.

2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
168
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The following example illustrates a special situation, in which matrix decompositions can curb computa-
tional cost:

C++ code 2.5.0.12: Smart approach!


C++ code 2.5.0.11: Wasteful approach!
➺ GITLAB ➺ GITLAB
2 // Setting: N ≫ 1,
2 // Setting: N ≫ 1,
3 // large matrix A ∈ K n,n
3 // large matrix A ∈ K n,n
4 auto A_lu_dec = A . l u ( ) ;
4 f o r ( i n t j = 0 ; j < N; ++ j ) {
5 f o r ( i n t j = 0 ; j < N; ++ j ) {
5 x = A . l u ( ) . solve ( b ) ;
6 x = A_lu_dec . solve ( b ) ;
6 b = some_function ( x ) ;
7 b = some_function ( x ) ;
7 }
8 }

computational effort O( Nn3 )


computational effort O(n3 + Nn2 )
y

EXAMPLE 2.5.0.13 (Reuse of LU-decomposition in inverse power iteration) A concrete example is


the so-called inverse power iteration, see Chapter 9, for which a skeleton code is given next. It computes
the iterates

∗ −1 ( k ) ( k +1) x∗
x := A x , x := ∗ , k = 0, 1, 2, . . . , (2.5.0.14)
k x k2

C++-code 2.5.0.15: Efficient implementation of inverse power method in E IGEN ➺ GITLAB


2 template <class VecType , class MatType>
3 VecType i n v p o w i t ( const Eigen : : MatrixBase <MatType> &A , double t o l )
4 {
5 using i n d e x _ t = typename MatType : : Index ;
6 // Make sure that the function is called with a square matrix
7 const i n d e x _ t n = A . cols ( ) ;
8 const i n d e x _ t m = A . rows ( ) ;
9 e i g e n _ a s s e r t ( n == m) ;
10 // Request LU-decomposition
11 auto A_lu_dec = A . l u ( ) ;
12 // Initial guess for inverse power iteration
13 VecType xo = VecType : : Zero ( n ) ;
14 VecType xn = VecType : : Random( n ) ;
15 // Normalize vector
16 xn / = xn . norm ( ) ;
17 // Terminate if relative (normwise) change below threshold
18 while ( ( xo−xn ) . norm ( ) > xn . norm ( ) * t o l ) {
19 xo = xn ;
20 xn = A_lu_dec . solve ( xo ) ;
21 xn / = xn . norm ( ) ;
22 }
23 r e t u r n ( xn ) ;
24 }

The use of Eigen::MatrixBase<MatType> makes it possible to call invpowit with an expression


argument:
2 const MatrixXd A = MatrixXd : : Random( n , n ) ;
3 const MatrixXd B = MatrixXd : : Random( n , n ) ;
4 const auto ev = i n v p o w i t <VectorXd >(A+B , t o l ) ;

2. Direct Methods for (Square) Linear Systems of Equations, 2.5. Survey: Elimination Solvers for Linear Systems
169
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This is necessary, because A+B will spawn an auxiliary object of a “strange” type determined by the
expression template mechanism. y

Remark 2.5.0.16 (Access to LU-factors in E IGEN) LU-decomposition objects available in E IGEN provide
access to the computed LU-factors L and U through a member function matrixLU(). This returns a
matrix object with L stored in its strictly lower triangular part, and U in its upper triangular part.
However note that E IGEN’s algorithms for LU-factorization invariably employ (partial) pivoting for the sake
of numerical stability, see Section 2.3.3 for a discussion. This has the effect that the LU-factors of a matrix
A ∈ R n,n are actually those for a matrix PA, where P is a permutation matrix as stated in Lemma 2.3.3.14.
Thus matrixLU() provides the LU-factorization of A after some row permutation.

C++ code 2.5.0.17: Retrieving the LU-factors from an E IGEN lu object ➺ GITLAB
2 std : : pair <Eigen : : MatrixXd , Eigen : : MatrixXd >
3 lufak_eigen ( const Eigen : : MatrixXd &A) {
4 // Compute LU decomposition
5 auto l u d e c = A . l u ( ) ;
6 // The LU-factors are computed by in-situ LU-decomposition,
7 // see Rem. 2.3.2.11, and are stored in a dense matrix of
8 // the same size as A
9 Eigen : : MatrixXd L { l u d e c . matrixLU ( ) . triangularView <Eigen : : UnitLower > ( ) } ;
10 const Eigen : : MatrixXd U { l u d e c . matrixLU ( ) . triangularView <Eigen : : Upper > ( ) } ;
11 // E I G E N employs partial pivoting, see § 2.3.3.7, which can be viewed
12 // as a prior permutation of the rows of A. We apply the inverse of
this
13 // permutation to the L-factor in order to achieve A = LU.
14 L . applyOnTheLeft ( l u d e c . permutationP ( ) . inverse ( ) ) ;
15 // Return LU-factors as members of a 2-tuple.
16 return { L , U } ;
17 }

2.6 Exploiting Structure when Solving Linear Systems


By “structure” of a linear system we mean prior knowledge that
✦ either certain entries of the system matrix vanish,
✦ or the system matrix is generated by a particular formula.

§2.6.0.1 (Triangular linear systems) Triangular linear systems are linear systems of equations whose
system matrix is a triangular matrix (→ Def. 1.1.2.3).
Thm. 2.5.0.3 tells us that (dense) triangular linear systems can be solved by backward/forward elimination
with O(n2 ) asymptotic computational effort (n = ˆ number of unknowns) compared to an asymptotic com-
3
plexity of O(n ) for solving a generic (dense) linear system of equations (→ Thm. 2.5.0.2, Exp. 2.5.0.6).
This is the simplest case where exploiting special structure of the system matrix leads to faster algorithms
for the solution of a special class of linear systems. y

§2.6.0.2 (Block elimination) Remember that thanks to the possibility to compute the matrix product in a
block-wise fashion (→ § 1.3.1.13), Gaussian elimination can be conducted on the level of matrix blocks.

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 170
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We recall Rem. 2.3.1.14 and Rem. 2.3.2.19.

For k, ℓ ∈ N consider the block partitioned square n × n, n := k + ℓ, linear system


    
A11 A12 x1 b A11 ∈ K k,k , A12 ∈ K k,ℓ ,A21 ∈ K ℓ,k ,A22 ∈ K ℓ,ℓ ,
= 1 , (2.6.0.3)
A21 A22 x2 b2 x1 ∈ K k , x2 ∈ K ℓ , b1 ∈ K k , b2 ∈ K ℓ .

Using block matrix multiplication (applied to the matrix×vector product in (2.6.0.3)) we find an equivalent
way to write the block partitioned linear system of equations:

A11 x1 + A12 x2 = b1 ,
(2.6.0.4)
A21 x1 + A22 x2 = b2 .

We assume that A11 is regular (invertible) so that we can solve for x1 from the first equation.

➣ By elementary algebraic manipulations (“block Gaussian elimination”) we find


−1
x1 = A11 (b1 − A12 x2 ) ,
−1 −1
(A22 − A21 A11 A12 ) x2 = b2 − A21 A11 b1 ,
| {z }
Schur complement

The resulting ℓ × ℓ linear system of equations for the unknown vector x2 is called the Schur complement
system for (2.6.0.3).
Unless A has a special structure that allows the efficient solution of linear systems with system matrix
A11 , the Schur complement system is mainly of theoretical interest. y

EXAMPLE 2.6.0.5 (Linear systems with arrow matrices) From n ∈ N, a diagonal matrix D ∈ K n,n ,
c ∈ K n , b ∈ K n , and α ∈ K, we can build an (n + 1) × (n + 1) arrow matrix.
0

 
2

 
 
 
  4

 
 
 
 D c  6

A=


 (2.6.0.6)
 
  8
 
 
 
  10
 
b⊤ α
12
0 2 4 6 8 10 12
Fig. 48 nz = 31

We can apply the block partitioning (2.6.0.3) with k = n and ℓ = 1 to a linear system Ax = y with system
matrix A and obtain A11 = D, which can be inverted easily, provided that all diagonal entries of D are
different from zero. In this case
    
D c x1 y
Ax = ⊤ = y := 1 , (2.6.0.7)
b α ξ η
η − b T D −1 y 1
ξ= ,
α − b ⊤ D −1 c (2.6.0.8)
x1 = D−1 (y1 − ξc) .

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 171
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

These formulas make sense, if D is regular and α − b⊤ D−1 c 6= 0, which is another condition for the
invertibility of A.

Using the formula (2.6.0.8) we can solve the linear system (2.6.0.7) with an asymptotic complexity O(n)!
This superior speed compared to Gaussian elimination applied to the (dense) linear system is evident in
runtime measurements.

C++ code 2.6.0.9: Dense Gaussian elimination applied to arrow system ➺ GITLAB
2 VectorXd arrowsys_slow ( const VectorXd &d ,
3 const VectorXd &c ,
4 const VectorXd &b , double alpha ,
5 const VectorXd &y ) {
6 const Eigen : : Index n = d . s i z e ( ) ;
7 MatrixXd A( n + 1 , n + 1 ) ; // Empty dense matrix
8 A . setZero ( ) ; // Initialize with all zeros.
9 A . diagonal ( ) . head ( n ) = d ; // Initialize matrix diagonal from a vector.
10 A . col ( n ) . head ( n ) = c ; // Set rightmost column c.
11 A . row ( n ) . head ( n ) = b ; // Set bottom row b⊤ .
12 A( n , n ) = alpha ; // Set bottom-right entry α.
13 r e t u r n A . l u ( ) . solve ( y ) ; // Gaussian elimination
14 }

Asymptotic complexity O(n3 )!


(Due to the serious blunder of accidentally creating a matrix full of zeros, cf. Exp. 1.3.1.10.)

C++ code 2.6.0.10: Solving an arrow system according to (2.6.0.8) ➺ GITLAB


2 VectorXd a r r o w s y s _ f a s t ( const VectorXd &d ,
3 const VectorXd &c ,
4 const VectorXd &b , double alpha ,
5 const VectorXd &y ) {
6 const Eigen : : Index n = d . s i z e ( ) ;
7 const VectorXd z = c . array ( ) / d . array ( ) ; // z = D−1 c
8 const VectorXd w =
9 y . head ( n ) . array ( ) / d . array ( ) ; // w = D−1 y1
10 const double den = alpha − b . dot ( z ) ; // denominator in (2.6.0.8)
11 // Check for (relatively!) small denominator
12 i f ( std : : abs ( den ) < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) *
13 ( std : : abs ( b . dot ( z ) ) + std : : abs ( alpha ) ) ) {
14 throw std : : r u n t i m e _ e r r o r ( " Nearly singular system " ) ;
15 }
16 const double x i = ( y ( n ) − b . dot (w) ) / den ;
17 r e t u r n ( Eigen : : VectorXd ( n + 1 ) << w − x i * z , x i ) . f i n i s h e d ( ) ;
18 }

Asymptotic complexity O(n) for n → ∞!

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 172
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 1
arrowsys slow
arrowsys fast

10 0

10 -1
Code for Runtime measurements can be ob-
tained from ➺ GITLAB. 10 -2

runtime [s]
(Intel i7-3517U CPU @ 1.90GHz, 64-bit, 10 -3
y
Ubuntu Linux 14.04 LTS, gcc 4.8.4, -O3)
10 -4

No comment! ✄
10 -5

10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 49 matrix size n

Remark 2.6.0.11 (Sacrificing numerical stability for efficiency) The vector based implementation of
the solver of Code 2.6.0.10 can be vulnerable to roundoff errors, because, upon closer inspection, the
algorithm turns out to be equivalent to Gaussian elimination without pivoting, cf. Section 2.3.3, Ex. 2.3.3.1.

Caution: stability at risk in Code 2.6.0.10


!
Yet, there are classes of matrices for which Gaussian elimination without pivoting is guaranteed to be
stable. For such matrices algorithms like that of Code 2.6.0.10 are safe. y

§2.6.0.12 (Solving LSE subject to low-rank modification of system matrix) Given a regular matrix
A ∈ K n,n , let us assume that at some point in a code we are in a position to solve any linear system
Ax = b “fast”, because
✦ either A has a favorable structure, eg. triangular, see § 2.6.0.1,
✦ or an LU-decomposition of A is already available, see § 2.3.2.15.
e is obtained by changing a single entry of A:
Now, a A
(
aij , if (i, j) 6= (i ∗ , j∗ ) ,
e ∈ K n,n : e
A, A aij = , i∗ , j∗ ∈ {1, . . . , n} . (2.6.0.13)
z + aij , if (i, j) = (i∗ , j∗ ) ,

e = A + z · ei∗ e T∗
A . (2.6.0.14)
j

(Recall: ei = ˆ i-th unit vector.) The question is whether we can reuse some of the computations spent on
solving Ax = b in order to solve Ae e x = b with less effort than entailed by a direct Gaussian elimination
from scratch.

We may also consider a matrix modification affecting a single row: Changing a single row: given z ∈ K n

(
aij , if i 6= i ∗ ,
e ∈ K n,n : e
A, A aij = , i∗ , j∗ ∈ {1, . . . , n} .
(z) j + aij , if i = i∗ ,

e = A + ei ∗ z T
A . (2.6.0.15)

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 173
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Both matrix modifications (2.6.0.13) and (2.6.0.15) represent rank-1-modifications of A. A generic


rank-1-modification reads

e := A + uv H , u, v ∈ K n .
A ∈ K n,n 7→ A (2.6.0.16)

general rank-1-matrix

Trick: Block elimination for an extended linear system, see § 2.6.0.2

As in Ex. 2.3.1.1 we carry out Gaussian elimination for the first column of the bock-partitioned linear
system
    
−1 vH
ξ 0
= (2.6.0.17)
u A e
x b
    
−1 vH ξ 0 ex = b !
[G.e. on first column] ➤ H = Ae (2.6.0.18)
0 A + uv e
x b
Hence, we have solved the modified LSE, once we have found the component e
x of the solution of the
linear system (2.6.0.17).
Now we swap (block) rows and columns and consider the block-partitioned linear system
    
A u e
x b
= .
vH −1 ξ 0
We do (block) Gaussian elimination on the first (block) )column again, which yields the Schur complement
system

(1 + v H A −1 u ) ξ = v H A −1 b . (2.6.0.19)
uvH A−1
x = b−
Ae b. (2.6.0.20)
1 + v H A −1 u
The generalization of this formula to rank-k-perturbations if given in the following lemma:

Lemma 2.6.0.21. Sherman-Morrison-Woodbury formula

For regular A ∈ K n,n , and U, V ∈ K n,k , n, k ∈ N, k ≤ n, holds

(A + UV H )−1 = A−1 − A−1 U(I + V H A−1 U)−1 V H A−1 ,

if I + V H A−1 U is regular.

Proof. Straightforward algbra:


 
−1 −1 H −1 −1 H −1
A −A U(I + V A U) V A (A + UV H ) =
I − A−1 U (I + V H A−1 U)−1 (I + V H A−1 U) V H + A−1 UV H = I .
| {z }
=I

Uniqueness of the inverse settles the case.


e x = b with A
We use this result to solve Ae e from (2.6.0.16) more efficiently than straightforward elimination
could deliver, provided that the LU-factorisation a = LU is already known. We apply Lemma 2.6.0.21 for

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 174
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

k = 1 and get

A−1 u(vH (A−1 b))


x = A −1 b −
e . (2.6.0.22)
1 + v H ( A −1 u )

We have to solve two linear systems of equations with system matrix A, which is "cheap" provided that
the LU-decomposition of A is available. This is another case, where precomputing the LU-decomposition
pays off.

Assuming that lu passes an object that contains an LU-decomposition of A ∈ R n,n , the following code
demonstrates an efficient implementation with asymptotic complexity O(n2 ) for n → ∞ due to the back-
ward/forward substitutions in Lines 7-8.

C++ code 2.6.0.23: Solving a rank-1 modified LSE ➺ GITLAB


2 // Solving rank-1 updated LSE based on (2.6.0.22)
3 template <class LUDec>
4 Eigen : : VectorXd smw( const Eigen : : VectorXd &u , const Eigen : : VectorXd &v ,
5 const LUDec &lu , const Eigen : : VectorXd &b ) {
6 const double s i n g f a c = 1 . 0E6 ; // Do not lose more than 10 digits
7 const Eigen : : VectorXd z = l u . solve ( b ) ; // z = A−1 b
8 const Eigen : : VectorXd w = l u . solve ( u ) ; // w = A−1 u
9 const double alpha = 1 . 0 + v . dot (w) ; // Compute denominator of (2.6.0.22)
10 const double beta = v . dot ( z ) ; // Factor for numerator of (2.6.0.22)
11 i f ( std : : abs ( alpha ) <= s i n g f a c * std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ) {
12 throw std : : r u n t i m e _ e r r o r ( "A nearly singular " ) ;
13 }
14 r e t u r n ( z − w * beta / alpha ) ; // see (2.6.0.22)
15 }

In Rem. 1.5.3.15 we were told that the test whether a numerical result was zero should be done by
comparing with another quantity. Then why is this advice not heeded in the above code. The reason
is the line alpha = 1.0 + v.dot(w). Imagine this results in a value α = 10 · EPS. In this case
the computation of α would involve massive cancellation, see Section 1.5.4, and the result would probably
have a huge relative error ≈ 0.1. This would completely destroy the accuracy of the final result, regardless
of the size of any other quantity computed in the code. Therefore, it is advisable to check the absolute size
of α. y

EXAMPLE 2.6.0.24 (Resistance to currents map) Many lineare systems with system matrices that differ
in a single entry only have to be solved when we want to determine the dependence of the total impedance

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 175
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

of a (linear) circuit from the parameters of a single component.

C1 R1 C1 R1

L L

R1
R1
Large (linear) electric circuit (modelling → R2 R2
C2 C2
Ex. 2.1.0.3) ✄ Rx
R4 R4 R1
Sought:

C2

C2
R1

R1
R3

R3
Dependence of (certain) branch currents U ~
~
on “continuously varying” resistance R x

R2

R2
(➣ currents for many different values of
Rx )
C1

C1
R4

R4
L

L
R2 12 R2 R2
Fig. 50

Only a few entries of the nodal analysis matrix A (→ Ex. 2.1.0.3) are affected by variation of R x !
(If R x connects nodes i & j ⇒ only entries aii , a jj , aij , a ji of A depend on R x )
y
Review question(s) 2.6.0.25 (Exploiting structure when solving linear systems of equations)
(Q2.6.0.25.A) Compute the block LU-decomposition for the arrow matrix
 
 
 
 
 
 
 
  D ∈ R n,n regular, diagonal ,
 D c 
A=

,
 c, b ∈ R n ,
  α∈R,
 
 
 
 
 
 
b⊤ α
according to the indicated (and natural) partitioning of the matrix.
(Q2.6.0.25.B) Sketch an efficient algorithm for solving the LSE
 
 
 
 
 
 
 
 
 I n −1 c 
 x = b , b ∈ Rn , c ∈ R n −1 , α>0.
 
 
 
 
 
 
 
 
0⊤ α
(Q2.6.0.25.C) Given a matrix A ∈ R n,n find rank-1 modifications that replace its i-th row or column with
a given vector w ∈ R n .

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 176
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q2.6.0.25.D) Given a regular matrix A ∈ R n,n and b ∈ R n , we want to solve many linear systems of the
e (ξ )x = b, where A
form A e (ξ ) is obtained by adding ξ ∈ R to every entry of A.

Propose an efficient implementation for a C++ class


c l a s s ConstModMatSolveLSE {
public:
ConstModMatSolveLSE( const Eigen::MatrixXd &A,
const Eigen::VectorXd &b);
Eigen::VectorXd solvemod( double xi) const ;
private:
...
};

e ( ξ )!
that serves this purpose. Do not forget to test for near-singularity of the matrix A
(Q2.6.0.25.E) [“Z-shaped” matrix] Let A ∈ R n,n be “Z-shaped”

(A)i,j = 0 ,if i ∈ {2, . . . , n − 1} , i + j 6= n + 1 ,


 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
 ∗ 
 
 ∗ 
 
 ∗ 
 
e.g. A=  ∗ .

 ∗ 
 
 ∗ 
 
 ∗ 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
1. Outline an efficient algorithm for solving a linear system of equations Ax = b, b ∈ R n,n .
2. Give a sufficient and necessary condition for A being regular/invertible.
(Q2.6.0.25.F) [Loss of stability] By direct block-wise Gaussian elimination we found the following
solution formulas for a block-partitioned linear system of equations with D ∈ R n,n , c, b ∈ R n , α ∈ R,
y ∈ R n +1 :
    
D c x1 y
Ax = ⊤ = y := 1 , (2.6.0.7)
b α ξ η
η − b T D −1 y 1
ξ= ,
α − b ⊤ D −1 c (2.6.0.8)
x1 = D−1 (y1 − ξc) .

Use these formulas to compute the solution of the 2 × 2 linear system of equations
    
δ 1 x1 1
= ,
1 1 ξ 2

assuming |δ| < 21 EPS and using floating point arithmetic.

Hint. Remember that, if |δ| < 21 EPS, in floating point arithmetic

1+̃δ = 1 and 2+̃δ−1 = δ−1 .

This is compatible with the “Axiom” of roundoff analysis Ass. 1.5.3.11

2. Direct Methods for (Square) Linear Systems of Equations, 2.6. Exploiting Structure when Solving Linear 177
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q2.6.0.25.G) [A banded linear system] Sketch an efficient algorithm for the solution of the n × n linear
system of equations
 
1 0 ... ... 0 1
1 1 0 ... 0
 
 .. .. .. .. 
0 . . . .
. . .
 n
. . . . .. x = b ∈ R .
 .. . . . . . . .

. .. 
 .. . 1 1 0
0 ... ... 0 1 1

When will a solution exists for every right-hand side vector b?

Hint. You may first perform Gaussian elimination for n = 7 and n = 8 or use the LU-decomposition
 
1 0 ... ... 0 1
1 1 0 ... 0
 
 .. .. .. .. 
0 . . . .
. . . . . . . .. 

 .. . . . . . . .

. .. 
 .. . 1 1 0
0 ... ... 0 1 1
 
  1 0 ... ... 0 1
1 0 ... ... 0 0  .. .. 
0 1 . . −1 
1 1 0 ... 
0  . . 
 .. .
.. ..
 .. .. .. ..  1 
0 . . . .  .. .. ..


= .
 .. . . . . . . .. .
. . . 
  . . . −1 .
 . . . .

.  .. .. . 
..  . . 0 . 
 .. . 1 1 0  . 
 .. .. .. − 
0 ... ... 0 1 1 . . (− 1 ) n 2

0 ... . . . 0 1 + (−1)n−1

2.7 Sparse Linear Systems


We start with a (rather fuzzy) classification of matrices according to their numbers of zeros:

Dense(ly populated) matrices) sparse(ly populated) matrices

Notion 2.7.0.1. Sparse matrix

A ∈ K m,n , m, n ∈ N, is sparse, if

nnz(A) := #{(i, j) ∈ {1, . . . , m} × {1, . . . , n}: aij 6= 0} ≪ mn .

Sloppy parlance: matrix sparse :⇔ “almost all” entries = 0 /“only a few percent of” entries 6= 0

J.H. Wilkinson’s informal working definition for a developer of simulation codes:

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 178
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Notion 2.7.0.2. Sparse matrix

A matrix with enough zeros that it pays to take advantage of them should be treated as sparse.

A more rigorous “mathematical” definition:

Definition 2.7.0.3. Families of sparse matrices

Given a strictly increasing sequences m : N 7→ N, n : N 7→ N, a family (A(l ) )l ∈N of matrices


with A(l ) ∈ K ml ,nl is sparse (opposite: dense), if

nnz(A(l ) )
lim =0.
l →∞ nl ml

Simple example: families of diagonal matrices (→ Def. 1.1.2.3) of increasing size.

EXAMPLE 2.7.0.4 (Sparse LSE in circuit modelling) See Ex. 2.1.0.3 for the description of a linear
electric circuit by means of a linear system of equations for nodal voltages. For large circuits the system
matrices will invariably be huge and sparse.

Modern electric circuits (VLSI chips):


105 − 107 circuit elements
• Each element is connected to only a few nodes
• Each node is connected to only a few elements
[In the case of a linear circuit] y

nodal analysis ➤ sparse circuit matrix


(Impossible to even store as dense matrices)
Fig. 51

Remark 2.7.0.5 (Sparse matrices from the discretization of linear partial differential equations)
Another important context in which sparse matrices usually arise:

☛ spatial discretization of linear boundary value problems for partial differential equations by means
of finite element (FE), finite volume (FV), or finite difference (FD) methods (→ 4th semester course
“Numerical methods for PDEs”).
y

2.7.1 Sparse Matrix Storage Formats


Sparse matrix storage formats for storing a “sparse matrix” A ∈ K m,n are designed to achieve two objec-
tives:
➊ Amount of memory required is only slightly more than nnz(A) scalars.
➋ Computational effort for matrix×vector multiplication is proportional to nnz(A).
In this section we see a few schemes used by numerical libraries.

§2.7.1.1 (Triplet/coordinate list (COO) format) In the case of a sparse matrix A ∈ K m,n , this format
stores triplets (i, j, αi,j ), 1 ≤ i ≤ m, 1 ≤ j ≤ n:

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 179
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

s t r u c t Triplet {
size_t i; // row index
size_t j; // column index
scalar_t a; // additive contribution to matrix entry
};
using TripletMatrix = s t d :: v e c t o r <Triplet>;

Here scalar_t is the underlying scalar type, either float , double, or std::complex<double>.
The vector of triplets in a TripletMatrix has size ≥ nnz(a). We write “≥”, because repetitions of index
pairs (i, j) are allowed. The matrix entry (A))i, j is defined to be the sum of all values αi,j associated with
the index pair (i, j). The next code clearly demonstrates this summation.

C++-code 2.7.1.2: Matrix×vector product y = Ax+y in triplet format


1 void m u l t T r i p l M a t v e c ( const T r i p l e t M a t r i x &A ,
2 const vector < s c a l a r _ t > &x ,
3 vector < s c a l a r _ t > &y )
4 f o r ( s i z e _ t k =0; k<A . s i z e ( ) ; k ++) {
5 y [ A [ k ] . i ] += A [ k ] . a * x [ A [ k ] . j ] ;
6 }

Note that this code assumes that the result vector y has the appropriate length; no index checks are
performed.

Code 2.7.1.2: computational effort is proportional to the number of triplets. (This might be much larger
than nnz(A) in case of many repetitions of triplets.) y

Remark 2.7.1.3 (The zoo of sparse matrix formats) Special sparse matrix storage formats store only
non-zero entries:
• Compressed Row Storage (CRS)
• Compressed Column Storage (CCS) → used by MATLAB
• Block Compressed Row Storage (BCRS)
• Compressed Diagonal Storage (CDS)
• Jagged Diagonal Storage (JDS)
• Skyline Storage (SKS)
All of these formats achieve the two objectives stated above. Some have been designed for sparse matri-
ces with additional structure or for seamless cooperation with direct elimination algorithms (JDS,SKS). y

 
§2.7.1.4 (Compressed row-storage (CRS) format) The CRS format for a sparse matrix A = aij ∈
K n,n keeps the data in three contiguous arrays:
std::vector<scalar_t> val size ≥ nnz(A) := #{(i, j) ∈ {1, . . . , n}2 , aij 6= 0}
std::vector<size_t> col_ind size = val.size()
std::vector<size_t> row_ptr size n + 1 & row_ptr[n + 1] =val.size()
(sentinel value)
ˆ (number of nonzeros) of A
As above we write nnz(A) =

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 180
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Access to matrix entry aij 6= 0, 1 ≤ i, j ≤ n (“mathematical indexing”)



col_ind[k ] = j ,
val[k ] = aij ⇔ 1 ≤ k ≤ nnz(A) .
row_ptr[i ] ≤ k < row_ptr[i + 1] ,

val aij

col_ind j

row_ptr beginning of data for i-th row

i
  val-vector:
10 0 0 0 −2 0
3 9 0 0 0 3  10 -2 3 9 3 7 8 7 3 ...9 13 4 2 -1
 
0 7 8 7 0 0  col_ind-array:
A = 
3

1 5 1 2 6 2 3 4 1 ...5 6 2 5 6
 0 8 7 5 0 

0 8 0 9 9 13  row_ptr-array:
0 4 0 0 2 −1 1 3 6 9 13 17 20

Variant: diagonal CRS format (matrix diagonal stored in separate array)


The CCS format is equivalent to CRS format for the transposed matrix. y
Review question(s) 2.7.1.5 (Sparse Matrix Storage Formats)
(Q2.7.1.5.A) Explain why access to the source code of a function that computes the matrix×vector prod-
uct for a particular sparse-matrix storage format (encapsulated in a C++ class) already gives you full
information about that format.
(Q2.7.1.5.B) Let a matrix A ∈ R n,n be given in COO/triplet format and by an TripletMatrix object A:
s t r u c t Triplet {
size_t i; // row index
size_t j; // column index
scalar_t a; // additive contribution to matrix entry
};
using TripletMatrix = s t d :: v e c t o r <Triplet>;

Outline the implementation of a function


Eigen::VectorXd mvTridiPart( const TripletMatrix &A,
const Eigen::VectorXd &x);

e , where A
that computes y := Ax e ∈ R n,n is defined as
(
  (A)i,j , if |i − j| ≤ 1 ,
e
A = i, j ∈ {1, . . . , n} .
i,j 0 else,

(Q2.7.1.5.C) Let a matrix A ∈ R n,n be given in COO/triplet format and by an TripletMatrix object A:
s t r u c t Triplet {
size_t i; // row index

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 181
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

size_t j; // column index


scalar_t a; // additive contribution to matrix entry
};
using TripletMatrix = s t d :: v e c t o r <Triplet>;

Sketch a code that builds a TripletMatrix object corresponding to the matrix


 
In A
B := ⊤ ∈ R2n,2n .
A O

(Q2.7.1.5.D) Assume that a sparse matrix in CRS format is represented by an object of the type
s t r u c t CRSMatrix {
s t d :: v e c t o r < double > val;
s t d :: v e c t o r < s t d :: s i z e _ t > col_ind;
s t d :: v e c t o r < s t d :: s i z e _ t > row_ptr;
};

Describe the implementation of a C++ function


CRSMatrix makeSecondDiffMat( unsigned i n t n);

that creates the CRSMatrix object for A ∈ R n,n defined as




2 , if i = j ,
(A)i,j = −1 , if |i − j| = 1 , i, j ∈ {1, . . . , n} .


0 else,

(Q2.7.1.5.E) For a given matrix A ∈ R m,n , m, n ∈ N, we define the square matrix


 
Om,m A
WA := ∈ R m+n,m+n .
A⊤ On,n

Outline the implementation of an efficient C++ function


v o i d crsAtoW( s t d :: v e c t o r < double > &val,
s t d :: v e c t o r < unsigned i n t > &col_ind,
s t d :: v e c t o r < unsigned i n t > &row_ptr);

whose arguments supply the three vectors defining the matrix A in CRS format and which overwrites
them with the corresponding vectors of the CRS-format description of WA .

Remember that the CRS format of a matrix A ∈ R m,n is defined by



col_ind[k ] = j ,
val[k ] = (A)i,j ⇔ 1 ≤ k ≤ nnz(A) .
row_ptr[i ] ≤ k < row_ptr[i + 1] ,

It may be convenient to use std::vector::resize(n) that resizes a vector so that it contains n


elements. If n is smaller than the current container size, the content is reduced to its first n elements,
removing those beyond (and destroying them). If n is greater than the current container size, the content
is expanded by inserting at the end as many elements as needed to reach a size of n using their default
value.

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 182
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2.7.2 Sparse Matrices in E IGEN


Eigen can handle sparse matrices in the standard Compressed Row Storage (CRS) and Compressed
Column Storage (CCS) format, see § 2.7.1.4 and the E IGEN documentation:
# i n c l u d e <Eigen/Sparse>
Eigen::SparseMatrix< i n t , Eigen::ColMajor> Asp(rows,cols); // CCS
format
Eigen::SparseMatrix< double , Eigen::RowMajor> Bsp(rows,cols); // CRS
format

Usually sparse matrices in CRS/CCS format must not be filled by setting entries through index-pair access,
because this would entail frequently moving big chunks of memory. The matrix should first be assembled
in triplet format (→ E IGEN documentation), from which a sparse matrix is built. E IGEN offers special
data types and facilities for handling triplets.

s t d :: v e c t o r <Eigen::Triplet < double > > triplets;


// .. fill the std::vector triplets ..
Eigen::SparseMatrix< double , Eigen::RowMajor> spMat(rows, cols);
spMat.setFromTriplets(triplets.begin(), triplets.end());

A triplet object can be initialized as demonstrated in the following example:


unsigned i n t row_idx = 2;
unsigned i n t col_idx = 4;
double value = 2.5;
Eigen::Triplet< double > triplet(row_idx,col_idx,value);
s t d :: co u t << ’(’ << triplet.row() << ’,’ << triplet.col()
<< ’,’ << triplet.value() << ’)’ << s t d :: e n d l ;

As shown, a Triplet object offers the access member functions row(), col(), and value() to fetch
the row index, column index, and scalar value stored in a Triplet.

The statement that entry-wise initialization of sparse matrices is not efficient has to be qualified in Eigen.
Entries can be set, provided that enough space for each row (in RowMajor format) is reserved in ad-
vance. This done by the reserve() method that takes an integer vector of maximal expected numbers
of non-zero entries per row:

C++-code 2.7.2.1: Accessing entries of a sparse matrix: potentially inefficient! ➺ GITLAB


1 unsigned i n t rows , cols , max_no_nnz_per_row ;
2 .....
3 Eigen : : SparseMatrix <double , Eigen : : RowMajor> mat ( rows , cols ) ;
4 mat . reserve ( Eigen : : RowVectorXi : : Constant ( rows , max_no_nnz_per_row ) ) ;
5 // do many (incremental) initializations
6 f o r ( i n t i = 0 ; i < rows ; ++ i ) {
7 mat . i n s e r t ( i , i ) = − 1 . 0 ; // only for matrix entries not yet set!
8 mat . i n s e r t ( i , i + 1 ) = 1 . 0 ;
9 mat . coeffRef ( i , 2 * i ) −= 1 . 0 ; // access entry possibly not set yet
10 }
11 mat . makeCompressed ( ) ; // squeeze out zeros

insert(i.j) sets an entry of the sparse matrix, which is rather efficient, provided that enough space
has be reserved. coeffRef(i,j) gives l-value and r-value access to any matrix entry, creating a

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 183
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

non-zero entry, if needed: costly!

The usual matrix operations are supported for sparse matrices; addition and subtraction may involve only
sparse matrices stored in the same format. These operations may incur large hidden costs and have to
be used with care!

EXPERIMENT 2.7.2.2 (Initialization of sparse matrices in Eigen) We study the runtime behavior of the
initialization of a sparse matrix in Eigen. We use the methods described above. The code is available from
➺ GITLAB.
6
10

Triplets

Runtimes (in ms) for the initialization of a banded ma- 5


coeffRef with space reserved
coeffRef without space reserved
10
trix (with 5 non-zero diagonals, that is, a maximum of
5 non-zero entries per row) using different techniques 10
4

Time in milliseconds
in Eigen.
3
10

Green line: timing for entry-wise initialization with


only 4 non-zero entries per row reserved in advance. 10
2

(OS: Ubuntu Linux 14.04, CPU: Intel [email protected] Ghz, 10


1

Compiler: g++-4.8.2, -O2)


0
10
2 3 4 5 6 7
10 10 10 10 10 10
Size of matrix
Fig. 52

Observation: insufficient advance allocation of memory massively slows down the set-up of a sparse
matrix in the case of direct entry-wise initialization.
Reason: Massive internal copying of data is required to created space for “unexpected” entries. y

Remark 2.7.2.3 (Extracting triplets from Eigen::SparseMatrix) Given an Eigen::SparseMatrix object


A describing a generic sparse matrix A ∈ R n,n we have to create another Eigen::SparseMatrix object B
for a matrix B ∈ R n,n , which agrees with A except that the entries in its first sub-diagonal are equal to the
entries of a vector v ∈ R n−1 :
(
(v) j , if i = j + 1 ,
(B)i,j = i, j ∈ {1, . . . , n} .
(A)i,j else,
How can this be done efficiently? By temporarily creating a triplet representation of A, manipulating a few
triplets, and then using makeCompressed() to obtain B.
Thus we need a way to extract a triplet vector from an Eigen::SparseMatrix object. This is done by the
following fairly complicated code (→ E IGEN documentation):

C++ code 2.7.2.4: Extracting triplets from a Eigen::SparseMatrix ➺ GITLAB


2 template <typename Scalar >
3 std : : vector <Eigen : : T r i p l e t < Scalar >>
4 c o n v e r t T o T r i p l e t s ( Eigen : : SparseMatrix < Scalar > &A) {
5 // Empty vector of triplets to be grown in the following loop
6 std : : vector <Eigen : : T r i p l e t < Scalar >> t r i p l e t s { } ;
7 // Loop over row/columns (depending on column/row major format
8 f o r ( i n t k = 0 ; k < A . o u t e r S i z e ( ) ; ++k ) {
9 // Loop over inner dimension and obtain triplets corresponding
10 // to non-zero entries.
11 f o r ( typename Eigen : : SparseMatrix < Scalar > : : I n n e r I t e r a t o r i t (A, k ) ; i t ;
12 ++ i t ) {
13 // Retrieve triplet data from iterator

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 184
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

14 t r i p l e t s . emplace_back ( i t . row ( ) , i t . col ( ) , i t . v a l u e ( ) ) ;


15 }
16 }
17 return t r i p l e t s ;
18 }

EXAMPLE 2.7.2.5 (Smoothing of a triangulation) This example demonstrates that sparse linear sys-
tems of equations naturally arise in the handling of triangulations.

Definition 2.7.2.6. Planar triangulation

A planar triangulation (mesh) M consists of a set N of N ∈ N distinct points ∈ R2 and a set T


of triangles with vertices in N , such that the following two conditions are satisfied:
1. the interiors of the triangles are mutually disjoint (“no overlap”),
2. for every two closed distinct triangles ∈ T their intersection satisfies exactly one of the fol-
lowing conditions:
(a) it is empty
(b) it is exactly one vertex from N ,
(c) it is a common edge of both triangles

The points in N are also called the nodes of the mesh, the triangles the cells, and all line segments
connecting two nodes and occurring as a side of a triangle form the set of edges. We always assume a
consecutive numbering of the nodes and cells of the triangulation (starting from 1, M ATLAB’s convention).

Fig. 53 Fig. 54

Valid planar triangulation Mesh with “illegal” hanging nodes


Triangulations are of fundamental importance for computer graphics, landscape models, geodesy, and
numerical methods. They need not be planar, but the algorithmic issues remain the same.

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 185
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Common data structure for describing a triangulation with N nodes and M cells:
• column vector x ∈ R N : x-coordinates of nodes
• column vector y ∈ R N : y-coordinates of nodes
• M × 3-matrix T whose rows contain the index numbers of the vertices of the cells.
(This matrix is a so-called triangle-node incidence matrix.)

P YTHON’s visualization add-on provides the function


matplotlib.pyplot.triplot, which can be
used to draw planar triangulations.

Fig. 55

The cells of a mesh may be rather distorted triangles (with very large and/or small angles), which is usually
not desirable. We study an algorithm for smoothing a mesh without changing the planar domain covered
by it.

Definition 2.7.2.7. Boundary edge

Every edge that is adjacent to only one cell is a boundary edge of the triangulation. Nodes that are
endpoints of boundary edges are boundary nodes.

✎ Notation: Γ ⊂ {1, . . . , N } =
ˆ set of indices of boundary nodes.

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 186
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
✎ Notation: pi = p1i , p2i ]⊤ ∈ R2 =
ˆ coordinate vector of node ♯i, i = 1, . . . , N

We define
S(i ) := { j ∈ {1, . . . , N } : nodes i and j are connected by an edge} , (2.7.2.8)
as the set of node indices of the “neighbours” of the node with index number i.

Definition 2.7.2.9. Smoothed triangulation

A triangulation is called smoothed, if

1
pi = ∑ pj (2.7.2.10)
♯ S (i ) j ∈ S (i )
m
   
♯ S (i ) pi = ∑ pj , d = 1, 2 , for all i ∈ {1, . . . , N } \ Γ ,
d d
j ∈ S (i )

that is, every interior node is located in the center of gravity of its neighbours.

The relations (2.7.2.10) correspond to the lines of a sparse linear system of equations! In order to state it,
we insert the coordinates of all nodes into a column vector z ∈ K2N , according to
(
p1i , if 1 ≤ i ≤ N ,
zi = i− N (2.7.2.11)
p2 , if N + 1 ≤ i ≤ 2N .
For the sake of ease of presentation, in the sequel we assume (which is not the case in usual triangulation
data) that interior nodes have index numbers smaller than that of boundary nodes.

From (2.7.2.8) we infer that the system matrix C ∈ R2n,2N , n := N − ♯Γ, of that linear system has the
following structure:

  ♯S(i ) , if i = j ,

i ∈ {1, . . . , n} ,
A O
C= , (A)i,j = −1 , if j ∈ S(i ) , (2.7.2.12)
O A 
 j ∈ {1, . . . , N } .
0 else,
(2.7.2.10) ⇔ Cz = 0 . (2.7.2.13)
➣ nnz(A) ≤ number of edges of M + number of interior nodes of M.
➣ The matrix C associated with M according to (2.7.2.12) is clearly sparse.
➣ The sum of the entries in every row of C vanishes.

We partition the vector z into coordinates of nodes in the interior and of nodes on the boundary
 
zint
1
 zbd   ⊤
zT =  1  := z , . . . , z , z
zint  1 n n + 1 , . . . , z N , z N + 1 , . . . , z N + n , z N + n + 1 , . . . , z 2N .
2
zbd
2
This induces the following block partitioning of the linear system (2.7.2.13):
int 
 z
 1bd
Aint Abd O O  
 z1  = 0 , Aint ∈ R n,n ,
O O Aint 
Abd z2 int Abd ∈ R n,N −n .
zbd
2

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 187
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

m
 
 
  
 
 
 Aint Abd  
  
  
  
  =0. (2.7.2.14)
  
  
 Aint Abd  
  
 
 
 
 

The linear system (2.7.2.14) holds the key to the algorithmic realization of mesh smoothing; when smooth-
ing the mesh
(i) the node coordinates belonging to interior nodes have to be adjusted to satisfy the equilibrium con-
dition (2.7.2.10), they are unknowns,
(ii) the coordinates of nodes located on the boundary are fixed, that is, their values are known.
unknown zint int
1 , z2 , known zbd bd
1 , z2
(yellow in (2.7.2.14)) (pink in (2.7.2.14))
   
(2.7.2.13)/(2.7.2.14) ⇔ Aint zint
1 z int = − A zbd A zbd .
2 bd 1 bd 2 (2.7.2.15)

This is a square linear system with an n × n system matrix, to be solved for two different right hand side
vectors. The matrix Aint is also known as the matrix of the combinatorial graph Laplacian.

We examine the sparsity pattern of the system matrices Aint for a sequence of triangulations created by
regular refinement.

Definition 2.7.2.16. Regular refinemnent of


a planar triangulation

The planar triangulation with cells obtained by


splitting all cells of a planar triangulation M
into four congruent triangles is called the reg-
ular refinement of M. Fig. 56

We start from the triangulation of Fig. 55 and in turns perform regular refinement and smoothing (left ↔

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 188
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

after refinement, right ↔ after smoothing)

Refined mesh level 1 Smoothed mesh level 1

Refined mesh level 2 Smoothed mesh level 2

Refined mesh level 3 Smoothed mesh level 3

Below we give spy plots of the system matrices Aint for the first three triangulations of the sequence:

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 189
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Fig. 57 Fig. 58 Fig. 59


y
Review question(s) 2.7.2.17 (Sparse matrices in E IGEN)
(Q2.7.2.17.A) How would you implement the method setFromTriplets() of
Eigen::SparseMatrix<double> in order to achieve an asymptotic complexity O(♯triplets)?

2.7.3 Direct Solution of Sparse Linear Systems of Equations


Efficient Gaussian elimination for sparse matrices requires sophisticated algorithms that are encapsulated
in special types of solvers in E IGEN. Their calling syntax remains unchanged, however:
Eigen::SolverType<Eigen::SparseMatrix< double >> solver(A);
Eigen::VectorXd x = solver.solve(b);

The standard sparse solver is SparseLU.

C++ code 2.7.3.1: Function for solving a sparse LSE with E IGEN ➺ GITLAB
2 using SparseMatrix = Eigen : : SparseMatrix <double > ;
3 // Perform sparse elimination
4 i n l i n e void s p a r s e _ s o l v e ( const SparseMatrix &A , const VectorXd &b , VectorXd &x ) {
5 const Eigen : : SparseLU<SparseMatrix > s o l v e r ( A) ;
6 i f ( s o l v e r . i n f o ( ) ! = Eigen : : Success ) {
7 throw std : : r u n t i m e _ e r r o r ( " Matrix f a c t o r i z a t i o n f a i l e d " ) ;
8 }
9 x = s o l v e r . solve ( b ) ;
10 }

The constructor of the solver object builds the actual sparse LU-decomposition. The solve method
then does forward and backward elimination, cf. § 2.3.2.15. It can be called multiple times, see
Rem. 2.5.0.10. For more sample codes see ➺ GITLAB.
EXPERIMENT 2.7.3.2 (Sparse elimination for arrow matrix) In Ex. 2.6.0.5 we saw that applying the
standard lu() solver to a sparse arrow matrix results in an extreme waste of computational resources.

Yet, E IGEN can do much better! The main mistake was the creation of a dense matrix instead of storing
the arrow matrix in sparse format. There are E IGEN solvers which rely on particular sparse elimination
techniques. They still rely of Gaussian elimination with (partial) pivoting (→ Code 2.3.3.8), but take pains
to operate on non-zero entries only. This can greatly boost the speed of the elimination.

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 190
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 2.7.3.3: Invoking sparse elimination solver for arrow matrix ➺ GITLAB
2 template <class solver_t >
3 VectorXd arrowsys_sparse ( const VectorXd &d ,
4 const VectorXd &c ,
5 const VectorXd &b , double alpha ,
6 const VectorXd &y ) {
7 const Eigen : : Index n = d . s i z e ( ) ;
8 SparseMatrix <double> A( n+1 , n +1) ; // default: column-major
9 VectorXi reserveVec = VectorXi : : Constant ( n+1 , 2 ) ; // nnz per col
10 reserveVec ( n ) = s t a t i c _ c a s t < i n t >( n +1) ; // last full col
11 A . r e s e r v e ( reserveVec ) ;
12 f o r ( i n t j = 0 ; j < n ; ++ j ) { // initalize along cols for efficiency
13 A. insert ( j , j ) = d( j ) ; // diagonal entries
14 A. insert (n , j ) = b( j ) ; // bottom row entries
15 }
16 f o r ( i n t i = 0 ; i < n ; ++ i ) {
17 A. insert ( i , n) = c ( i ) ; // last col
18 }
19 A . i n s e r t ( n , n ) = alpha ; // bottomRight entry
20 A . makeCompressed ( ) ;
21 r e t u r n s o l v e r _ t ( A) . solve ( y ) ;
22 }

10 1
arrowsys slow
arrowsys fast
arrowsys SparseLU
Observation: 10 0 arrowsys iterative

The sparse elimination solver is several orders 10 -1

of magnitude faster than lu() operating on a


10 -2
runtime [s]

dense matrix.
y
The sparse solver is still slower than 10 -3

Code 2.6.0.10. The reason is that it is a


10 -4
general algorithm that has to keep track of
non-zero entries and has to be prepared to do 10 -5

pivoting.
10 -6
10 0 10 1 10 2 10 3 10 4
Fig. 60 matrix size n

EXPERIMENT 2.7.3.4 (Timing sparse elimination for the combinatorial graph Laplacian) We con-
sider a sequence of planar triangulations created by successive regular refinement (→ Def. 2.7.2.16) of
the planar triangulation of Fig. 55, see Ex. 2.7.2.5. We use different E IGEN and MKL sparse solver for the
linear system of equations (2.7.2.15) associated with each mesh.

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 191
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 2
Eigen SparseLU
Eigen SimplicialLDLT
Timing results ✄ 10 1 Eigen ConjugateGradient
MKL PardisoLU
MKL PardisoLDLT
O(n 1.5 )
Platform: 10 0

✦ ubuntu 14.04 LTS 10 -1

solution time [s]


✦ i7-3517U CPU @ 1.90GHz
✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB 10 -2

✦ gcc 4.8.4, -O3 10 -3

We observe an empirical asymptotic complexity (→ 10 -4

Def. 1.4.1.1) of O(n1.5 ), way better than the asymp-


totic complexity of O(n3 ) expected for Gaussian 10 -5

elimination in the case of dense matrices. 10 -6


10 1 10 2 10 3 10 4 10 5 10 6
Fig. 61 size of matrix A int
y

When solving linear systems of equations directly dedicated sparse elimination solvers from
numerical libraries have to be used!

System matrices are passed to these algorithms in sparse storage formats (→ Section 2.7.1) to
convey information about zero entries.

STOP Never ever even think about implementing a general sparse elimination solver by yourself!

For an survey of sparse solvers available in E IGEN see E IGEN documentation.


§2.7.3.5 (Implementations of sparse solvers) Widely used implementations of sparse solvers are:

→ SuperLU (http://www.cs.berkeley.edu/~demmel/SuperLU.html),
→ UMFPACK (https://en.wikipedia.org/wiki/UMFPACK), used by M ATLAB’s \,
→ PARDISO [SG04] (http://www.pardiso-project.org/), incorporated into MKL

✁ fill-in (→ Def. 2.7.4.3) during sparse elimination


with PARDISO

PARDISO has been developed by


Prof. O. Schenk and his group (for-
merly University of Basel, now USI
Lugano).

Fig. 62

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 192
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 2.7.3.6: Example code demonstrating the use of PARDISO with E IGEN ➺ GITLAB
2 void sol veSpar sePar di so ( s i z e _ t n ) {
3 using SpMat = Eigen : : SparseMatrix <double > ;
4 // Initialize a sparse matrix
5 const SpMat M = i n i t S p a r s e M a t r i x <SpMat >( n ) ;
6 const Eigen : : VectorXd b = Eigen : : VectorXd : : Random( n ) ;
7 Eigen : : VectorXd x ( n ) ;
8 // Initalization of the sparse direct solver based on the Pardiso
library with
9 // directly passing the matrix M to the solver Pardiso is part of the
Intel
10 // MKL library, see also Ex. 1.3.2.6
11 Eigen : : PardisoLU<SpMat> s o l v e r (M) ;
12 // The checks of Code 2.7.3.1 are omitted
13 // solve the LSE
14 x = s o l v e r . solve ( b ) ;
15 }

Required is #include <Eigen/PardisoSupport>, the compilation flag -DEIGEN_USE_MKL_ALL,


and the inclusion of MKL libraries during the linking phase.
1 COMPILER = c l a n g ++
2 FLAGS = − s t d =c++11 −m64 − I / u s r / i n c l u d e / eigen3 − I $ {MKLROOT } / i n c l u d e −O3 −DNDEBUG
3

4 # Intel(R) MKL 11.3.2, Linux, None, GNU C/C++, Intel(R) 64, Static, LP64,
Sequential)
5 FLAGS_LINK = −Wl, − − s t a r t −group $ {MKLROOT } / l i b / i n t e l 6 4 / l i b m k l _ i n t e l _ l p 6 4 . a \
6 $ {MKLROOT } / l i b / i n t e l 6 4 / l i b m k l _ c o r e . a $ {MKLROOT } / l i b / i n t e l 6 4 / l i b m k l _ s e q u e n t i a l . a \
7 −Wl, − −end−group − l p t h r e a d −lm − l d l
8

9 a l l : main . cpp
10 $ (COMPILER) $ (FLAGS) −DEIGEN_USE_MKL_ALL $< −o main $ ( FLAGS_LINK )

y
Review question(s) 2.7.3.7 (Direct solution of sparse linear systems of equations)
(Q2.7.3.7.A) In Code 2.7.3.1 we checked (solver.info()!= Eigen::Success), where
solver was of type Eigen::SparseLU<Eigen::SparseMatrix>. Can you explain, why it
is a good idea to include this test.
(Q2.7.3.7.B) What are the benefits of storing a matrix in CRS or CCS format?

2.7.4 LU-Factorization of Sparse Matrices


In Section 2.7.1 we have seen, how sparse matrices can be stored requiring O(nnz(A)) memory.

However, simple examples show that the product of sparse matrices need not be sparse, which means
that the multiplication of large sparse matrices will usually require an effort way bigger than the sum of the
numbers of their non-zero entries.
What is the situation concerning the solution of square linear systems of equations with sparse system
matrices? Generically, we have to brace for a computational effort O(n3 ) for matrix size n → ∞. Yet
Section 2.7.3 sends the message that a better asymptotic complexity can often be achieved, if the sparse

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 193
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

matrix has a particular structure and sophisticated library routines are used. In this section, we examine
some aspects of Gaussian elimination ↔ LU-factorisation when applied in a sparse-matrix context.
EXAMPLE 2.7.4.1 ( LU -factorization of sparse matrices) We examine the following “sparse” matrix with
a typical structure and inspect the pattern of the LU-factors returned by E IGEN, see Code 2.7.4.2.
 
3 −1 −1
. .. ..
 −1 . . . . 
 .. .. .. 
 . . −1 . 
A=
 −1 −1 3 −1  n,n
 ∈ R ,n ∈ N
 3 −1 
.. . ..
 . −1 . . . 
.. .. ..
. . . −1
−1 −1 3

C++ code 2.7.4.2: Visualizing LU-factors of a sparse matrix ➺ GITLAB


2 // Build matrix
3 const Eigen : : Index n = 100;
4 RowVectorXd d i a g _ e l ( 5 ) ;
5 d i a g _ e l << −1 , −1 , 3 , −1 , −1;
6 VectorXi diag_no ( 5 ) ;
7 diag_no << −n , −1 , 0 , 1 , n ;
8 MatrixXd B = d i a g _ e l . r e p l i c a t e ( 2 * n , 1 ) ;
9 B( n − 1 , 1 ) = 0 ;
10 B( n , 3 ) = 0 ; // delete elements
11 // A custom function from the Utils folder
12 const SparseMatrix <double> A = spdiags ( B , diag_no , 2 * n , 2 * n ) ;
13 // It is not possible to access the LU-factors in the case of
14 // E I G E N ’s LU-decomposition for sparse matrices.
15 // Therefore we have to resort to the dense version.
16 auto s o l v e r = MatrixXd ( A) . l u ( ) ;
17 MatrixXd L = MatrixXd : : I d e n t i t y ( 2 * n , 2 * n ) ;
18 L += s o l v e r . matrixLU ( ) . triangularView < S t r i c t l y L o w e r > ( ) ;
19 const MatrixXd U = s o l v e r . matrixLU ( ) . triangularView <Upper > ( ) ;
20 // Plotting
21 spy ( A , " Sparse matrix " , " sparseA_cpp . eps " ) ;
22 spy ( L , " Sparse matrix : L f a c t o r " , " sparseL_cpp . eps " ) ;
23 spy (U, " Sparse matrix : U f a c t o r " , " sparseU_cpp . eps " ) ;

Sparse matrix Sparse matrix: L factor Sparse matrix: U factor


0 0 0

20 20 20

40 40 40

60 60 60

80 80 80

100 100 100

120 120 120

140 140 140

160 160 160

180 180 180

200 200 200


0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Fig. 63 nz = 796 Fig. 64 nz = 10299 Fig. 65 nz = 10299
y

Observation: A sparse 6⇒ LU -factors sparse

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 194
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Of course, in case the LU-factors of a sparse matrix possess many more non-zero entries than the matrix
itself, the effort for solving a linear system with direct elimination will increase significantly. This can be
quantified by means of the following concept:

Definition 2.7.4.3. Fill-in


Let A = LU be an LU -factorization (→ Section 2.3.2) of A ∈ K n,n . If lij 6= 0 or uij 6= 0 though
aij = 0, then we encounter fill-in at position (i, j).

EXAMPLE 2.7.4.4 (Sparse LU -factors) Ex. 2.7.4.1 ➣ massive fill-in can occur for sparse matrices
This example demonstrates that fill-in can largely be avoided, if the matrix has favorable structure. In this
case a LSE with this particular system matrix A can be solved efficiently, that is, with a computational
effort O(nnz(A)) by Gaussian elimination.

C++ code 2.7.4.5: LU-factorization of sparse matrix ➺ GITLAB


2 // Build matrix
3 MatrixXd A( 1 1 , 11) ;
4 A. setIdentity ( ) ;
5 A . col ( 1 0 ) . setOnes ( ) ;
6 A . row ( 1 0 ) . setOnes ( ) ;
7 // A.reverseInPlace(); // used inEx. 2.7.4.6
8 auto s o l v e r = A . l u ( ) ;
9 MatrixXd L = MatrixXd : : I d e n t i t y ( 1 1 , 11) ;
10 L += s o l v e r . matrixLU ( ) . triangularView < S t r i c t l y L o w e r > ( ) ;
11 const MatrixXd U = s o l v e r . matrixLU ( ) . triangularView <Upper > ( ) ;
12 const MatrixXd Ainv = A . inverse ( ) ;
13 // Plotting
14 spy ( A , " Pattern of A" , " Apat_cpp . eps " ) ;
15 spy ( L , " Pattern of L" , " Lpat_cpp . eps " ) ;
16 spy (U, " Pattern of U" , " Upat_cpp . eps " ) ;
17 spy ( Ainv , " Pattern of A^{ −1} " , " Ainvpat_cpp . eps " ) ;

A is called an “arrow matrix”, see the pattern of non-zero entries below and Ex. 2.6.0.5.
Recalling Rem. 2.3.2.17 it is easy to see that the LU-factors of A will be sparse and that their sparsity
patterns will be as depicted below. Observe that despite sparse LU-factors, A−1 will be densely populated.
Pattern of A −1 Pattern of L Pattern of U
Pattern of A
0 0 0 0

2 2 2 2

4 4 4 4

6 6 6 6

8 8 8 8

10 10 10 10

12 12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 121 nz = 21 nz = 21

L, U sparse 6=⇒ A−1 sparse !


!
Besides stability and efficiency issues, see Exp. 2.4.0.11, this is another reason why using x =
A.inverse()*y instead of y = A.lu().solve(b) is usually a major blunder. y

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 195
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 2.7.4.6 (LU-decomposition of flipped “arrow matrix”) Recall the discussion in Ex. 2.6.0.5.
Here we look at an arrow matrix in a slightly different form:

 
α b⊤
 
 
 
 
 
 
  α∈R,
 
M=
 c
 ,
 b, c ∈ R n−1 ,
 D  D ∈ R n−1,n−1 regular diagonal matrix, → Def. 1.1.2.3
 
 
 
 
 
 

(2.7.4.7)

Run the algorithm from § 2.3.2.6 (LU decompisition without pivoting):

✦ LU-decomposition dense factor matrices with O(n2 ) non-zero entries.


✦ asymptotic computational cost: O(n3 )

0 0

2 2

4 4

Output of modified
Code 2.7.4.5: 6
L 6
U
Obvious fill-in (→ Def. 2.7.4.3) 8 8

10 10

12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 65 nz = 65

 
 
 
 
 
 
Now it comes as a surprise that the arrow matrix A  
 
from Ex. 2.6.0.5, (2.6.0.6) has sparse LU-factors!  D c 
A=



Arrow matrix (2.6.0.6) ✄  
 
 
 
 
 
 
b⊤ α

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 196
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

   
   
   
   
   
   
   
   
 I 0   D c 
A=

·
 
 , σ : = α − b ⊤ D −1 c .

   
   
   
   
   
   
   
b ⊤ D −1 1 0 σ
| {z } | {z }
=:L =:U

➣ In this case LU-factorisation is possible without fill-in, cost merely O(n)!

Idea: Transform A into A by row and column permutations before performing LU-
decomposition.

Details: Apply a cyclic permutation of rows/columns:

• 1st row/column → n-th row/column


• i-th row/column → i − 1-th row/column, i = 2, . . . , n

0 0

2 2

4 4

6 6

8 8

10 10

12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 66 nz = 31 Fig. 67 nz = 31

➣ Then LU-factorization (without pivoting) of the resulting matrix requires O(n) operations.

C++ code 2.7.4.8: Permuting arrow matrix, see Fig. 66, Fig. 67 ➺ GITLAB
2 MatrixXd A( 1 1 , 11) ;
3 A. setIdentity ( ) ;
4 A . col ( 0 ) . setOnes ( ) ;
5 A . row ( 0 ) = RowVectorXd : : LinSpaced ( 1 1 , 11 , 1 ) ;

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 197
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6 // Permutation matrix (→ Def. 2.3.3.13) encoding cyclic


7 // permutation
8 MatrixXd P( 1 1 , 11) ;
9 P . setZero ( ) ;
10 P . topRightCorner ( 1 0 , 10) . s e t I d e n t i t y ( ) ;
11 P( 1 0 , 0 ) = 1 ;
12 spy ( A , "A" , " InvArrowSpy_cpp . eps " ) ;
13 spy ( P * A * P . transpose ( ) , " permuted A" , " ArrowSpy_cpp . eps " ) ;

EXAMPLE 2.7.4.9 (Pivoting destroys sparsity) In Ex. 2.7.4.6 we found that permuting a matrix can make
it amenable to Gaussian elimination/LU-decomposition with much less fill-in (→ Def. 2.7.4.3). However,
recall from Section 2.3.3 that pivoting, which may be essential for achieving numerical stability, amounts to
permuting the rows (or even columns) of the matrix. Thus, we may face the awkward situation that pivoting
tries to reverse the very permutation we applied to minimize fill-in! The next example shows that this can
happen for an arrow matrix.

C++ code 2.7.4.10: fill-in due to pivoting ➺ GITLAB


2 // Study of fill-in with LU-factorization due to pivoting
3 Eigen : : MatrixXd A( 1 1 , 11) ;
4 A . setZero ( ) ;
5 A . diagonal ( ) = Eigen : : VectorXd : : LinSpaced ( 1 1 , 1 , 11) . cwiseInverse ( ) ;
6 A . col ( 1 0 ) . setConstant ( 2 ) ;
7 A . row ( 1 0 ) . setConstant ( 2 ) ;
8 auto s o l v e r = A . l u ( ) ;
9 Eigen : : MatrixXd L = Eigen : : MatrixXd : : I d e n t i t y ( 1 1 , 11) ;
10 L += s o l v e r . matrixLU ( ) . triangularView <Eigen : : S t r i c t l y L o w e r > ( ) ;
11 const Eigen : : MatrixXd U = s o l v e r . matrixLU ( ) . triangularView <Eigen : : Upper > ( ) ;
12 // Plotting
13 spy ( A , " Arrow matrix A" , " f i l l i n p i v o t A . eps " ) ;
14 spy ( L , "L f a c t o r " , " f i l l i n p i v o t L . eps " ) ;
15 spy (U, "U f a c t o r " , " f i l l i n p i v o t U . eps " ) ;
16 std : : cout << A << std : : endl ;

 
1 2
 1
2
 2 
 .. .. 
A= . . → arrow matrix, Ex. 2.7.4.4
 
 1
2 
10
2 ... 2

The distributions of non-zero entries of the computed LU-factors (“spy-plots”) are as follows:

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 198
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

arrow matrix A L factor U factor


0 0 0

2 2 2

4 4 4

6 6 6

In
8 8 8

10 10 10

12 12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 21 nz = 66

A L U
this case the solution of a LSE with system matrix A ∈ R n,n of the above type by means of Gaussian
elimination with partial pivoting would incur costs of O(n3 ). y

2.7.5 Banded Matrices [DR08, Sect. 3.7]


Banded matrices are a special class of sparse matrices (→ Notion 2.7.0.1 with extra structure:

Definition 2.7.5.1. Bandwidth

For A = ( aij )i,j ∈ K m,n we call

bw(A) := min{k ∈ N: j − i > k ⇒ aij = 0} upper bandwidth ,


bw(A) := min{k ∈ N: i − j > k ⇒ aij = 0} lower bandwidth .

bw(A) := bw(A) + bw(A) + 1 = bandwidth of A.

• bw(A) = 1 ✄ A diagonal matrix, → Def. 1.1.2.3


• bw(A) = bw(A) = 1 ✄ A tridiagonal matrix
• More general: A ∈ R n,n with bw(A) ≪ n =
ˆ banded matrix

: diagonal

: super-diagonals
m

: sub-diagonals

✁ bw(A) = 3, bw(A) = 2
n

for banded matrix A ∈ K m,n : nnz(A) ≤ min{m, n} bw(A)

We now examine a generalization of the concept of a banded matrix that is particularly useful in the context
of Gaussian elimination:

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 199
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 2.7.5.2. Matrix envelope

For A ∈ K n,n define


row bandwidth bwiR (A) := max{0, i − j : aij 6= 0, 1 ≤ j ≤ n}, i ∈ {1, ..., n}
column bandwidth bwC max{0, j − i : aij 6= 0, 1 ≤ i ≤ n}, j ∈ {1, ..., n}
j ( A ) : =( )
i − bwiR (A) ≤ j ≤ i ,
envelope env(A) := (i, j) ∈ {1, . . . , n}2 :
j − bwCj (A) ≤ i ≤ j

EXAMPLE 2.7.5.3 (Envelope of a matrix) We give an example illustrating Def. 2.7.5.2.

  bwR ( A) =0
∗ 0 ∗ 0 0 0 0 1
 0 ∗ 0 0 ∗ 0 0  bw2R ( A) =0
 
 ∗ 0 ∗ 0 0 0 ∗  bw3R ( A) =2
  env( A) = red entries
A=
 0 0 0 ∗ ∗ 0 ∗  bwR ( A)
 4 = 0 ∗ = non-zero matrix entry a 6= 0
  bwR ( A) ˆ ij
 0 ∗ 0 ∗ ∗ ∗ 0  5 =3
 0 0 0 0 ∗ ∗ 0  bwR ( A) =1
6
0 0 ∗ ∗ 0 0 ∗ bw7R ( A) =4

Starting from a “spy-plot”, it is easy to find the evelope:


0 0

2 2

4 4

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 68 nz = 138 Fig. 69 nz = 121

Note: the envelope of the arrow matrix from Ex. 2.7.4.4 is just the set of index pairs of its non-zero entries.
Hence, the following theorem provides another reason for the sparsity of the LU-factors in that example. y

Theorem 2.7.5.4. Envelope and fill-in → [QSS00, Sect. 3.9]


If A ∈ K n,n is regular with LU-factorization A = LU, then fill-in (→ Def. 2.7.4.3) is confined to
env(A).

Gaussian elimination without pivoting

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 200
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Proof. (by induction, version I) Examine first step of Gaussian elimination without pivoting, a11 6= 0
   " #
a11 b⊤ 1 0 a11 b⊤
A= = ⊤
c à − ac11 I 0 à − cb
a11
| {z } | {z }
L (1) U (1)

ci−1 = 0 , if i > j ,
If (i, j) 6∈ env(A) ⇒
b j−1 = 0 , if i < j .
⇒ env(L(1) ) ⊂ env(A), env(U(1) ) ⊂ env(A) .

Moreover, env(Ã − cb
a ) = env(( A )2:n,2:n ) ✷
11

Proof. (by induction, version II) Use block-LU-factorization, cf. Rem. 2.3.2.19 and proof of
Lemma 2.3.2.4:
     e ⊤l = c ,
Ae b e 0
L e u
U U
= ⇒ (2.7.5.5)
c⊤ α l⊤ 1 0 ξ e =b.
Lu

From Def. 2.7.5.2:


0
If mnR (A) = m, then c1 , . . . , cn−m = 0 (entries of c
0
from (2.7.5.5))

If mC
n ( A ) = m, then b1 , . . . , bn−m = 0 (entries of b
= 0
from (2.7.5.5))

✁ for lower triagular LSE:


If c1 , . . . , ck = 0 then l1 , . . . , lk = 0
If b1 , . . . , bk = 0, then u1 , . . . , uk = 0
Fig. 70 ⇓
assertion of the theorem ✷

Thm. 2.7.5.4 immediately suggests a policy for saving cmputational effort when solving linear system
whose system matrix A ∈ K n,n is sparse due to small envelope:

♯ env(A) ≪ n2 :
✞ ☎

✝ ✆
Policy Confine elimination to envelope!

Details will be given now:

Envelope-aware LU-factorization:

C++ code 2.7.5.6: Computing row bandwidths, → Def. 2.7.5.2 ➺ GITLAB


2 //! computes rowbandwidth numbers miR (A) of A (sparse
3 //! matrix) according to Def. 2.7.5.2
4 template <class numeric_t >
5 VectorXi rowbandwidth ( const SparseMatrix <numeric_t > &A) {
6 VectorXi m = VectorXi : : Zero ( A . rows ( ) ) ;
7 f o r ( i n t k = 0 ; k < A . o u t e r S i z e ( ) ; ++k ) {
8 f o r ( typename SparseMatrix < numeric_t > : : I n n e r I t e r a t o r i t ( A , k ) ; i t ; ++ i t ) {

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 201
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

9 m( i t . row ( ) ) =
10 std : : max<VectorXi : : Scalar >(m( i t . row ( ) ) , i t . row ( ) − i t . col ( ) ) ;
11 }
12 }
13 r e t u r n m;
14 }
15 //! computes row bandwidth numbers miR (A) of A (dense
16 //! matrix) according to Def. 2.7.5.2
17 template <class Derived >
18 VectorXi rowbandwidth ( const MatrixBase <Derived > &A) {
19 VectorXi m = VectorXi : : Zero ( A . rows ( ) ) ;
20 f o r ( i n t i = 1 ; i < A . rows ( ) ; ++ i ) {
21 f o r ( i n t j = 0 ; j < i ; ++ j ) {
22 i f ( A( i , j ) ! = 0 ) {
23 m( i ) = i − j ;
24 break ;
25 }
26 }
27 }
28 r e t u r n m;
29 }

C++ code 2.7.5.7: Envelope aware forward substitution ➺ GITLAB


2 //! evelope aware forward substitution for Lx = y
3 //! (L = lower triangular matrix)
4 //! argument mr: row bandwidth vector
5 VectorXd substenv ( const MatrixXd &L , const VectorXd &y , const VectorXi &mr ) {
6 const Eigen : : Index n = L . cols ( ) ;
7 VectorXd x ( n ) ;
8 x ( 0 ) = y ( 0 ) / L (0 , 0) ;
9 f o r ( Eigen : : Index i = 1 ; i < n ; ++ i ) {
10 i f ( mr ( i ) > 0 ) {
11 const double z e t a =
12 L . row ( i ) . segment ( i − mr ( i ) , mr ( i ) ) * x . segment ( i − mr ( i ) , mr ( i ) ) ;
13 x ( i ) = ( y ( i ) − zeta ) / L ( i , i ) ;
14 } else {
15 x( i ) = y( i ) / L( i , i ) ;
16 }
17 }
18 return x ;
19 }

Asymptotic complexity of envelope aware forward substitution, cf. § 2.3.2.15, for Lx = y, L ∈ K n,n
regular lower triangular matrix is

O(# env(L)) !

By block LU-factorization (→ Rem. 2.3.2.19) we find


    
(A)1:n−1,1:n−1 (A)1:n−1,n L1 0 U1 u
= , (2.7.5.8)
(A)n,1:n−1 (A)n,n l⊤ 1 0 γ
⇒ (A)1:n−1,1:n−1 = L1 U1 , L1 u = (A)1:n−1,n , U1⊤ l = (A)⊤ ⊤
n,1:n−1 , l u + γ = ( A )n,n .
(2.7.5.9)

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 202
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 2.7.5.10: Envelope aware recursive LU-factorization ➺ GITLAB


2 //! envelope aware recursive LU-factorization
3 //! of structurally symmetric matrix
4 void l u e n v ( const MatrixXd &A , MatrixXd &L , MatrixXd &U) {
5 const Eigen : : Index n = A . cols ( ) ;
6 assert ( n == A . rows ( ) && "A must be square " ) ;
7 i f ( n == 1 ) {
8 L . setIdentity ( ) ;
9 U = A;
10 } else {
11 VectorXi mr = rowbandwidth ( A) ; // = colbandwidth thanks to symmetry
12 MatrixXd L1 ( n − 1 , n − 1 ) ;
13 MatrixXd U1 ( n − 1 , n − 1 ) ;
14 l u e n v ( A . topLeftCorner ( n − 1 , n − 1 ) , L1 , U1 ) ;
15 VectorXd u = substenv ( L1 , A . col ( n − 1 ) . head ( n − 1 ) , mr ) ;
16 VectorXd l =
17 substenv ( U1 . transpose ( ) , A . row ( n − 1 ) . head ( n − 1 ) . transpose ( ) , mr ) ;
18 const double gamma = mr ( n − 1 ) > 0 ?
19 A( n − 1 , n − 1 ) − l . t a i l ( mr ( n − 1 ) ) . dot ( u . t a i l ( mr ( n − 1 ) ) ) :
20 A( n − 1 , n − 1 ) ;
21 L . topLeftCorner ( n − 1 , n − 1 ) = L1 ;
22 L . col ( n − 1 ) . setZero ( ) ;
23 L . row ( n − 1 ) . head ( n − 1 ) = l . transpose ( ) ;
24 L ( n − 1 , n − 1) = 1;
25 U. topLeftCorner ( n − 1 , n − 1 ) = U1 ;
26 U. col ( n − 1 ) . head ( n − 1 ) = u ;
27 U. row ( n − 1 ) . setZero ( ) ;
28 U( n − 1 , n − 1 ) = gamma ;
29 }
30 }

Implementation of envelope aware recursive LU-factorization (no pivoting !)

Assumption: A ∈ K n,n is structurally symmetric


Asymptotic complexity (A ∈ K n,n ) O(n · # env(A)) for n → ∞ .
Definition 2.7.5.11. Structurally symmetric matrix

A ∈ K n,n is structurally symmetric, if

(A)i,j 6= 0 ⇔ (A) j,i 6= 0 ∀i, j ∈ {1, . . . , n} .

Since by Thm. 2.7.5.4 fill-in is confined to the envelope, we need store only the matrix entries aij , (i, j) ∈
env(A) when computing (in situ) LU-factorization of structurally symmetric A ∈ K n,n

➤ Storage required: n + 2 ∑in=1 mi (A) floating point numbers


➤ terminology: envelope oriented matrix storage

EXAMPLE 2.7.5.12 (Envelope oriented matrix storage) Linear envelope oriented matrix storage of
symmetric A = A⊤ ∈ R n,n :

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 203
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Two arrays:
scalar_t * val size P, Indexing rule:
 
size_t * dptr size n ∗ 0 ∗ 0 0 0 0
0 ∗ 0 0 ∗ 0 0 dptr[ j] = k
 ∗ ∗ 
n A =  0 00 ∗
 0
0

0

0
0 ∗ 
 m
P : = n + ∑ mi ( A ) . 0 ∗ 0 ∗ ∗ ∗ 0
0 0 0 0 ∗ ∗ 0 val[k ] = a jj
i =1 0 0 ∗ ∗ 0 0 ∗
(2.7.5.13)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
val a11 a22 a31 a32 a33 a44 a52 a53 a54 a55 a65 a66 a73 a74 a75 a76 a77
dptr 0 1 2 5 6 10 12 17
y

Minimizing bandwidth/envelope:
Goal: Minimize mi (A),A = ( aij ) ∈ R N,N , by permuting rows/columns of A
EXAMPLE 2.7.5.14 (Reducing bandwidth by row/column permutations) Recall: cyclic permutation
of rows/columns of arrow matrix applied in Ex. 2.7.4.6. This can be viewed as a drastic shrinking of the
envelope:
envelope arrow matrix envelope arrow matrix
0 0

2 2

4 4

6 6

8 8

10 10

12 12
0 2 4 6 8 10 12 0 2 4 6 8 10 12
nz = 31 nz = 31

Another example:Reflection at cross diagonal ➤ reduction of # env(A)


   
∗ 0 0 ∗ ∗ ∗ ∗ ∗ ∗ 0 0 ∗
 0 ∗ 0 0 0 0   ∗ ∗ ∗ 0 0 ∗ 
 0 0 ∗ 0 0 0  −→  ∗ ∗ ∗ 0 0 ∗ 
 ∗ 0 0 ∗ ∗ ∗   0 0 0 ∗ 0 0 
∗ 0 0 ∗ ∗ ∗ 0 0 0 0 ∗ 0
∗ 0 0 ∗ ∗ ∗ ∗ ∗ ∗ 0 0 ∗
i ← N+1−i
# env(A) = 30# env(A) = 22

EXAMPLE 2.7.5.15 (Reducing fill-in by reordering)

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 204
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Envelope reducing permutations are at the heart of all mod-


ern sparse solvers (→ § 2.7.3.5). They employ elaborate algo-
rithms for the analysis of matrix graphs, that is, the connections
between components of the vector of unknowns defined by non-
zero entries of the matrix. For further discussion see [AG11,
Sect. 5.7].

E IGEN supplies a few ordering methods for sparse matrices.


These methods use permutations to aim for minimal band-
width/envelop of a given sparse matrix. We study an example
with a 347×347 matrix M originating in the numerical solution
of partial differential equations, cf. Rem. 2.7.0.5.
Pattern of M ➥
(Here: no row swaps from pivoting !)

C++ code 2.7.5.16: preordering in E IGEN ➺ GITLAB


2 // L and U cannot be extracted from SparseLU -> LDLT
3 const SimplicialLDLT <SpMat_t , Lower , AMDOrdering< i n t > > s o l v e r 1 (M) ;
4 const SimplicialLDLT <SpMat_t , Lower , N a t u r a l O r d e r i n g < i n t > > s o l v e r 2 (M) ;
5 const MatrixXd U1 =
6 s o l v e r 1 . matrixU ( ) *
7 MatrixXd : : I d e n t i t y (
8 M. rows ( ) , M. cols ( ) ) ; // explicit conversion fixes occasional
segfault
9 const MatrixXd U2 = MatrixXd ( s o l v e r 2 . matrixU ( ) ) ;
10 // Plotting
11 spy (M, " Sparse matrix M" , "MSpy. eps " ) ;
12 spy ( U1 , "U f a c t o r ( approximate minimum degree ) " , "AMDUSpy. eps " ) ;
13 spy ( U2 , "U f a c t o r ( no reordering ) " , " NaturalUSpy . eps " ) ;

Examine patterns of LU-factors (→ Section 2.3.2) after reordering:

no reordering approximate minimum degree

2. Direct Methods for (Square) Linear Systems of Equations, 2.7. Sparse Linear Systems 205
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2.8 Stable Gaussian Elimination Without Pivoting


Recall some insights gained, and examples and experimnents seen so far in this chapter:
• Thm. 2.7.5.4 ➣ special structure of the matrix helps avoid fill-in in Gaussian elimination/LU-
factorization without pivoting.
• Ex. 2.7.4.9 ➣ pivoting can trigger huge fill-in that would not occur without it.
• Ex. 2.7.5.15 ➣ fill-in reducing effect of reordering can be thwarted by later row swapping in the
course of pivoting.
• BUT pivoting is essential for stability of Gaussian elimination/LU-factorization → Ex. 2.3.3.1.

It would be very desirable to have a priori criteria, when Gaussian elimination/LU-factorization re-
mains stable even without pivoting. This can help avoid the extra work for partial pivoting and makes
it possible to exploit structure without worrying about stability.

This section will introduce classes of matrices that allow Gaussian elimination without pivoting. Fortunately,
linear systems of equations featuring system matrices from these classes are very common in applications.

EXAMPLE 2.8.0.1 (Diagonally dominant matrices from nodal analysis → Ex. 2.1.0.3)
➀ R12 ➁ R23 ➂
Consider:

electrical circuit entirely composed of R24


U ~~ R25
Ohmic resistors. R14
R35
Circuit equations from nodal analysis, see R45 R56
Ex. 2.1.0.3:
➃ ➄ ➅

−1 −1 −1 −1
➁ : R12 (U2 − U1 ) + R23 (U2 − U3 ) + R24 (U2 − U4 ) + R25 (U2 − U5 ) = 0,
−1 −1
➂: R23 (U3 − U2 ) + R35 (U3 − U5 ) = 0,
−1 −1 −1
➃: R14 (U4 − U1 ) + R24 (U4 − U2 ) + R45 (U4 − U5 ) = 0,
−1 −1 −1
➄ : R25 (U5 − U2 ) + R35 (U5 − U3 ) + R45 (U5 − U4 ) + R56 (U5 − U6 ) = 0,
U1 = U , U6 = 0 .

 1 1    
R12 + R23+ R124 + 1
R25 − R123 − R124 − R125 U2
1
R12
 − R123 1 1
− R135    
 R23 + R35 0 U3   0 
   =  1 U
 − R124 0 1
R24 + R45
1
− R145  U4  R14 
− R125 − R135 − R145 1
+ R135 + R145 + 1 U5 0
R22 R56

➣ The matrix A ∈ R n,n arising from nodal analysis satisfies

• A = A⊤ , akk > 0 , akj ≤ 0 for k 6= j , (2.8.0.2)


n
• ∑ akj ≥ 0 , k = 1, . . . , n , (2.8.0.3)
j =1
• A is regular. (2.8.0.4)

2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting206
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

All these properties are obvious except for the fact that A is regular.
Proof of (2.8.0.4): By Thm. 2.2.1.4 it suffices to show that the nullspace of A is trivial: Ax = 0 ⇒ x=
0. So we pick x ∈ R n , Ax = 0, and denote by i ∈ {1, . . . , n} the index such that

| xi | = max{| x j |, j = 1, . . . , n} .

Intermediate goal: show that all entries of x are the same:

aij | aij |
Ax = 0 ⇒ xi = ∑ aii x j ⇒ | xi | ≤ ∑ |aii | | x j | . (2.8.0.5)
j 6 =i j 6 =i

By (2.8.0.3) and the sign condition from (2.8.0.2) we conclude

| aij |
∑ |aii | ≤ 1 . (2.8.0.6)
j 6 =i

Hence, (2.8.0.6) combined with the above estimate (2.8.0.5) that tells us that the maximum is smaller
equal than a mean implies | x j | = | xi | for all j = 1, . . . , n. Finally, the sign condition akj ≤ 0 for k 6= j
enforces the same sign of all xi . Thus, we conclude, w.l.o.g., x1 = x2 = · · · = xn . As
n
∃i ∈ {1, . . . , n}: ∑ aij > 0 (strict inequality) ,
j =1

Ax = 0 is only possible for x = 0. y

§2.8.0.7 (Diagonally dominant matrices)

Definition 2.8.0.8. Diagonally dominant matrix → [QSS00, Def. 1.24]


A ∈ K n,n is diagonally dominant, if

∀k ∈ {1, . . . , n}: ∑ j6=k |akj | ≤ |akk | .


The matrix A is called strictly diagonally dominant, if

∀k ∈ {1, . . . , n}: ∑ j6=k |akj | < |akk | .

Lemma 2.8.0.9. LU-factorization of diagonally dominant matrices


 A has LU-factorization
regular, diagonally dominant
A ⇒ m
with positive diagonal 
Gaussian elimination feasible without pivoting(∗)

(∗): In fact, when we apply partial pivoting to a diagonally dominant matrix it will trigger not a single row
permutation, because (2.3.3.5) will always be satisfied for j = k!
➣ We can dispense with pivoting without compromising stability.

Proof.(of Lemma 2.8.0.9). Appealing to (2.3.1.12) we rely on induction w.r.t. n:

2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting207
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

It is clear that partial pivoting in the first step selects a11 as pivot element, cf. (2.3.3.5). Thus after the 1st
step of elimination we obtain the modified entries
(1) ai1 (1)
aij = aij − a1j , i, j = 2, . . . , n ⇒ aii > 0 ,
a11
which we conclude from diagonal dominance. That also permits us to infer
n n
(1) (1) ai1 a
| aii | − ∑ | aij | = aii − a1i − ∑ aij − i1 a1j
j =2
a11 j =2
a11
j 6 =i j 6 =i
n
| ai1 || a1i | |a | n
≥ aii − − ∑ | aij | − i1 ∑ | a1j |
a11 j =2
a11 j=2
j 6 =i j 6 =i
n n
| ai1 || a1i | a − | a1i |
≥ aii − − ∑ | aij | − | ai1 | 11 ≥ aii − ∑ | aij | ≥ 0 .
a11 j =2
a11 j =1
j 6 =i j 6 =i

A regular, diagonally dominant ⇒ partial pivoting according to (2.3.3.5) selects i-th row in i-th step. y

§2.8.0.10 (Gaussian elimination for symmetric positive definite (s.p.d.) matrices) The class of sym-
metric positive definite (s.p.d.) matrices has been defined in Def. 1.1.2.6. They permit stable Gaussian
elimintation without pivoting:

Theorem 2.8.0.11. Gaussian elimination for s.p.d. matrices

Every symmetric/Hermitian positive definite matrix (s.p.d. → Def. 1.1.2.6) possesses an LU-
decomposition (→ Section 2.3.2).

Equivalent to the assertion of the theorem is the assertion that for s.p.d. matrices Gaussian elimination is
feasible without pivoting.
In fact, this theorem is a corollary of Lemma 2.3.2.4, because all principal minors of an s.p.d. matrix are
s.p.d. themselves. However, we outline an alternative self-contained proof:

Proof. (of Thm. 2.8.0.11) we pursue a proof by induction with respect to the matrix size n. The assertion
in the case n = 1 is obviously true.
For the induction argument n − 1 ⇒ n consider the first step of the elimination algorithm
  " #
a b⊤ 1. step a11 b⊤
A = 11 e −−−−−−−−−−→ e − bb⊤ .
b A Gaussian elimination 0 A a11

This step has not problem, because all diagonal entries of an s.p.d. matrix are strictly positive.

The induction requires us to show that the right-lower block Ae − bb ∈ R n−1,n−1 is also symmetric and
a11
positive definite. Its symmetry is evident, but the demonstration of the s.p.d. property relies on a trick: As
A ist s.p.d. (→ Def. 1.1.2.6), for every y ∈ R n−1 \ {0}
" ⊤
#⊤  " ⊤
#

− ba11y a11 b⊤ − ba11y e − bb )y .
0< e = y⊤ (A
y b A y a11

We conclude that A e − bb positive definite. Thus, according to the induction hypothesis, Gaussian
a11
elimination without pivoting can now be applied to that right-lower block.

2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting208
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


The proof can also be based on the identities
    
(A)1:n−1,1:n−1 (A)1:n−1,n L1 0 U1 u
= , (2.7.5.8)
(A)n,1:n−1 (A)n,n l⊤ 1 0 γ
⇒ (A)1:n−1,1:n−1 = L1 U1 , L1 u = (A)1:n−1,n , U1⊤ l = (A)⊤ ⊤
n,1:n−1 , l u + γ = ( A )n,n ,

noticing that the principal minor (A)1:n−1,1:n−1 is also s.p.d. This allows a simple induction argument.

Note: no pivoting required (→ Section 2.3.3)


(partial pivoting always picks current pivot row)

The next result gives a useful criterion for telling whether a given symmetric/Hermitian matrix is s.p.d.:

Lemma 2.8.0.12. Diagonal dominance and definiteness

A diagonally dominant Hermitian/symmetric matrix with non-negative diagonal entries is positive


semi-definite.
A strictly diagonally dominant Hermitian/symmetric matrix with positive diagonal entries is positive
definite.

Proof. For A = AH diagonally dominant, use inequality between arithmetic and geometric mean (AGM)
ab ≤ 12 ( a2 + b2 ):
n   n  
xH Ax = ∑ aii | xi |2 + ∑ aij x̄i x j ≥ ∑ aii | xi |2 − ∑ | aij || xi || x j |
i =1 i6= j i =1 i6= j
AGM n
≥ ∑ aii | xi |2 − 21 ∑ |aij |(| xi |2 + | x j |2 )
i =1 i6= j
 n   n 
1 2 2 1 2 2
≥ 2 ∑ { a ii | x i | − ∑ | a ij || x i | } + 2 ∑ { a ii | x j | − ∑ | a ij || x j | }
i =1 j 6 =i j =1 i6= j
n  
≥ ∑ | xi |2 aii − ∑ | aij | ≥ 0 .
i =1 j 6 =i

§2.8.0.13 (Cholesky decomposition)

Lemma 2.8.0.14. Cholesky decomposition for s.p.d. matrices → [Gut09, Sect. 3.4], [Han02,
Sect. II.5], [QSS00, Thm. 3.6]

For any s.p.d. A ∈ K n,n , n ∈ N, there is a unique upper triangular matrix R ∈ K n,n with rii > 0,
i = 1, . . . , n, such that A = RH R (Cholesky decomposition).

Proof. Thm. 2.8.0.11, Lemma 2.3.2.4 ensure the existence of a unique LU-decomposition of A: A=
LU, which we can rewrite as follows:

e , D=ˆ diagonal of U ,
A = LDU e=
U ˆ normalized upper triangular matrix → Def. 1.1.2.3

2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting209
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Due to the uniqueness of the LU -decomposition we infer

A = A⊤ ⇒ U = DL⊤ ⇒ A = LDL⊤ ,

with unique L, D (diagonal matrix)

x⊤ Ax > 0 ∀x 6= 0 ⇒ y⊤ Dy > 0 ∀y 6= 0 .

➤ The
√ diagonal matrix D has a positive diagonal and, hence, we can take its “square root” and choose
R := DL⊤ .

We find formulas analogous to (2.3.2.7)

 i −1

min{i,k}  ∑ r ji r jk + rii rik , if i < k ,

H j =1
R R = A ⇒ aik = ∑ r ji r jk =
 i −1
(2.8.0.15)
 2 2 , if i = k .
j =1
 ∑ |r ji | + rii

j =1

C++ code 2.8.0.16: Simple Cholesky factorization ➺ GITLAB


2 //! simple Cholesky factorization
3 void c h o l f a c ( const MatrixXd &A , MatrixXd &R) {
4 const Eigen : : Index n = A . rows ( ) ;
5 R = A;
6 f o r ( Eigen : : Index k = 0 ; k < n ; ++k ) {
7 f o r ( Eigen : : Index j = k +1; j < n ; ++ j ) {
8 R. row ( j ) . t a i l ( n− j ) −= R. row ( k ) . t a i l ( n− j ) * R( k , j ) / R( k , k ) ;
9 }
10 R. row ( k ) . t a i l ( n−k ) / = std : : s q r t (R( k , k ) ) ;
11 }
12 R. triangularView < S t r i c t l y L o w e r > ( ) . setZero ( ) ;
13 }

Cost of Cholesky decomposition

The asymptotic computational cost (# elementary arithmetic operations) of computing the Cholesky
decomposition of an n × n s.p.d. matrix is 16 n3 + O(n2 ) for matrix size n → ∞.

This is “half the costs” of computing a general LU-factorization, cf. Code in § 2.3.2.6, but this does not
mean “twice as fast” in a concrete implementation, because memory access patterns will have a crucial
impact, see Rem. 1.4.1.5.
Gains of efficiency hardly justify the use of Cholesky decomposition in modern numerical algorithms.
Savings in memory compared to standard LU-factorization (only one factor R has to be stored) offer a
stronger reason to prefer the Cholesky decomposition. y

§2.8.0.18 (Cholesky-type decompositions in E IGEN) Hardly surprising, E IGEN provides library routines
for the computation of the (generalized) Cholesky decomposition of an symmetric (positive definite) matrix.
For dense or sparse matrices these are the methods (→ E IGEN documentation)
• LLT() for computing a genuine Cholesky decomposition,
• LDLT() for computing a factorization A = LDL⊤ with a normalized lower-triangular matrix L and
a diagonal matrix D.

2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting210
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

These methods are invoked like all other matrix decomposition methods, refer to § 2.5.0.8, where
solverType is to be replaced with either LLT or LDLT. Rem. 2.5.0.10 also applies. The LDLT-
decomposition can be attempted for any symmetric matrix, but need not exist. y

§2.8.0.19 (Numerical stability of Cholesky decomposition) The computation of Cholesky-factorization


by means of the algorithm of Code 2.8.0.16 is numerically stable (→ Def. 1.5.5.19)!
To understand this recall Thm. 2.4.0.5: Numerical instability of Gaussian elimination (with any kind of piv-
oting) manifests itself in massive growth of the entries of the intermediate matrices A(k) arising during
elimination. Then use the relationship between LU-factorization and Cholesky decomposition, which tells
us that we only have to monitor the growth of entries of intermediate upper triangular “Cholesky factoriza-
tion matrices” A = (R(k) )H R(k) . We consider the Euclidean vector norm/matrix norm (→ Def. 1.5.5.10)
k·k2

A = RH R ⇒ kAk2 = sup xH RH Rx = sup (Rx)H (Rx) = kRk22 .


k x k2 =1 k x k2 =1

➤ For all intermediate Cholesky factorization matrices holds: ( R ( k ) )H = R(k) = kAk1/2


2 ! Of
2 2
course, this rules out a blowup of entries of the R(k) .
Computation of the Cholesky decomposition largely agrees with the computation of LU-factorization (with-
out pivoting). Using the latter together with forward and backward substitution (→ Section 2.3.2) to solve
a linear system of equations is algebraically and numerically equivalent to using Gaussian elimination
without pivoting. From these equivalences we conclude:
✗ ✔
Solving LSE with s.p.d. system matrix via
Cholesky decomposition + forward & backward substitution

✖ ✕
is numerically stable (→ Def. 1.5.5.19)
m
Gaussian elimination for s.p.d. matrices

Gaussian elimination without pivoting is a numerically stable way to solve LSEs with s.p.d.
system matrix.

Learning Outcomes
Principal take-home knowledge and skills from this chapter:
• A clear understanding of the algorithm of Gaussian elimination with and without pivoting (prerequisite
knowledge from linear algebra)
• Insight into the relationship between Gaussian elimination and LU-decomposition and the algorith-
mic relevance of LU-decomposition
• Awareness of the asymptotic complexity of dense Gaussian elimination, LU-decomposition, and
elimination for special matrices
• Familiarity with “sparse matrices”: notion, data structures, initialization, benefits
• Insight into the reduced computational complexity of the direct solution of sparse linear systems of
equations with special structural properties.

2. Direct Methods for (Square) Linear Systems of Equations, 2.8. Stable Gaussian Elimination Without Pivoting211
Bibliography

[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on pp. 136, 144, 205).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 145, 152, 153, 156, 157, 199).
[GGK14] W. Gander, M.J. Gander, and F. Kwok. Scientific Computing. Vol. 11. Texts in Computational
Science and Engineering. Heidelberg: Springer, 2014 (cit. on p. 132).
[GV89] G.H. Golub and C.F. Van Loan. Matrix computations. 2nd. Baltimore, London: John Hopkins
University Press, 1989 (cit. on p. 159).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 130, 136,
137, 143–147, 209).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 143, 144, 156,
209).
[Hig02] N.J. Higham. Accuracy and Stability of Numerical Algorithms. 2nd ed. Philadelphia, PA: SIAM,
2002 (cit. on p. 159).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 126, 130, 136, 137, 139, 143–145, 151, 153, 155).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 127, 130, 132, 136, 143, 146, 200, 207, 209).
[SST06] A. Sankar, D.A. Spielman, and S.-H. Teng. “Smoothed analysis of the condition numbers and
growth factors of matrices”. In: SIAM J. Matrix Anal. Appl. 28.2 (2006), pp. 446–476 (cit. on
p. 162).
[SG04] O. Schenk and K. Gärtner. “Solving Unsymmetric Sparse Systems of Linear Equations with
PARDISO”. In: J. Future Generation Computer Systems 20.3 (2004), pp. 475–487 (cit. on
p. 192).
[ST96] D.A. Spielman and Shang-Hua Teng. “Spectral partitioning works: planar graphs and finite el-
ement meshes”. In: Foundations of Computer Science, 1996. Proceedings., 37th Annual Sym-
posium on. Oct. 1996, pp. 96–105. DOI: 10.1109/SFCS.1996.548468 (cit. on p. 161).
[TB97] L.N. Trefethen and D. Bau. Numerical Linear Algebra. Philadelphia, PA: SIAM, 1997 (cit. on
pp. 159, 161).

212
Chapter 3

Direct Methods for Linear Least Squares


Problems

In this chapter we study numerical methods for overdetermined (OD) linear systems of equations, that
is, linear systems with a “tall” rectangular system matrix
   
   
    
   
x ∈ R n : “Ax = b” , (3.0.0.1)    
    
 A  x = b
   
b ∈ R m , A ∈ R m,n , m≥n.    
   
   

We point out that, in contrast to Chapter 1, Chapter 2, we will restrict ourselves to real linear systems in
this chapter.
Note that the quotation marks in (3.0.0.1) indicate that this is not a well-defined problem in the sense of
§ 1.5.5.1; Ax = b does no define a mapping (A, b) 7→ x, because
• such a vector x ∈ R n may not exist,
• and, even if it exists, it may not be unique.
Therefore, first we have to establish a crisp concept of that we mean by a “solution” of (3.0.0.1).

Contents
3.0.1 Overdetermined Linear Systems of Equations: Examples . . . . . . . . . . . 214
3.1 Least Squares Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
3.1.1 Least Squares Solutions: Definition . . . . . . . . . . . . . . . . . . . . . . . . 218
3.1.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.1.3 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.1.4 Sensitivity of Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . 229
3.2 Normal Equation Methods [DR08, Sect. 4.2], [Han02, Ch. 11] . . . . . . . . . . . . 230
3.3 Orthogonal Transformation Methods [DR08, Sect. 4.4.2] . . . . . . . . . . . . . . . 234
3.3.1 Transformation Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.3.2 Orthogonal/Unitary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
3.3.3 QR-Decomposition [Han02, Sect. 13], [Gut09, Sect. 7.3] . . . . . . . . . . . . 236
3.3.4 QR-Based Solver for Linear Least Squares Problems . . . . . . . . . . . . . . 252
3.3.5 Modification Techniques for QR-Decomposition . . . . . . . . . . . . . . . . 257

213
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . 264


3.4.1 SVD: Definition and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
3.4.2 SVD in E IGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
3.4.3 Solving General Least-Squares Problems by SVD . . . . . . . . . . . . . . . 272
3.4.4 SVD-Based Optimization and Approximation . . . . . . . . . . . . . . . . . 275
3.5 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
3.6 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
3.6.1 Solution via Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . 298
3.6.2 Solution via SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

3.0.1 Overdetermined Linear Systems of Equations: Examples

Video tutorial for Section 3.0.1 "Overdetermined Linear Systems of Equations: Examples":
(12 minutes) Download link, tablet notes

→ review questions 3.0.1.11

You may think that overdetermined linear systems of equations are exotic, but this is not true. Rather they
are very common in mathematical models.
EXAMPLE 3.0.1.1 (Linear parameter estimation in 1D) From first principles it is known that two physical
quantities x ∈ R and y ∈ R (e.g., pressure and density of an ideal gas) are related by a linear relationship

y = αx + β for some unknown coefficients/parameters α, β ∈ R . (3.0.1.2)

We carry out m ∈ N measurements that yield pairs ( xi , yi ) ∈ R2 , i = 1, . . . , m, m ≥ 2. If the measure-


ments were perfect, we could expect that there exist α, β ∈ R such that yi = αxi + β for all i = 1, . . . , m.
This is an overdetermined linear system of equations of the form (3.0.0.1):
   
x1 1 y1
 x2 1  
 . .    y.2 
 . .  . 
 . . α  . 
  =   ↔ Ax = b , A ∈ R m,2 , b ∈ R m , x ∈ R2 . (3.0.1.3)
  β  
 . .  . 
 .. ..   .. 
xm 1 ym

In practice inevitable (“random”) measurement errors will affect the yi s, push the vector b out of the
range/image R(A) of A (→ Def. 2.2.1.2), and thwart the solvability of (3.0.1.3). Assuming independent
h i
and randomly distributed measurement errors in the yi , for m > 2 the probability that a solution αβ exists
is actually zero, see Rem. 3.1.0.2. y

EXAMPLE 3.0.1.4 (Linear regression: Parameter estimation for a linear model) Ex. 3.0.1.1 can be
generalized to higher dimensions:

Given: measured data points (xi , yi ), xi ∈ R n , yi ∈ R, i = 1, . . . , m, m ≥ n + 1


(yi , xi affected by measurement errors).

Known: without measurement errors data would satisfy an affine linear relationship y = a⊤ x + β, for
some a ∈ R n , c ∈ R.
Plugging in the measured quantities gives yi = a⊤ xi + β, i = 1, . . . , m, a linear system of equations of

3. Direct Methods for Linear Least Squares Problems, 3. Direct Methods for Linear Least Squares Problems 214
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

the form
   
x1⊤ 1   y1
 .. ..  a  .. 
 . . β =  .  ↔ Ax = b , A ∈ R m,n+1 , b ∈ R m , x ∈ R n+1 , (3.0.1.5)
x⊤m 1 ym
which is an overdetermined LSE, in case m > n + 1. y

EXAMPLE 3.0.1.6 (Measuring the angles of a triangle [NS02, Sect. 5.1]) We measure the angles
of a planar triangle and obtain e eγ
α, β, e (in radians). In the case of perfect measurements the true angles
α, β, γ would satisfy
   
1 0 0   e
α
0 α
 1  
0   βe
0 β =  . (3.0.1.7)
0 1 e
γ
γ
1 1 1 π
Measurement errors will inevitably make the measured angles fail to add up to π so that (3.0.1.7) will not

have a solution [α, β, γ] .
Then, why should we add this last equation? This is suggested by a tenet of data science that reads “You
cannot afford not to use any piece of information available”. It turns out that solving (3.0.1.7) “in a suitable
way” as discussed below in Section 3.1.1 enhances cancellation of measurement errors and gives better
estimates for the angles. We will not discuss this here and refer to statistics for an explanation.
10 -3 100 angle measurements, /50 variance
5

4.5 Here we just report the results of a numerical experi-


4
ment:
We consider the triangle with angles π/2, π/3, and
variance(least squares solution)

3.5
π/6. Synthetic “measurement errors” are introduced
3
by adding a normally distributed random perturbation
2.5 with mean 0 and standard deviation π/50 to the exact
2
α, βe, and γ
values of the angles, yielding e e.
For 100 “measurements” we compute the variance
1.5
of the raw angles and that of the estimates obtained
1
by solving (3.0.1.7) in least squares sense (→ Sec-
0.5
angle
angle
/2
/3
tion 3.1.1). These variances are plotted for many dif-
0
angle /6
ferent “runs”.
0 1 2 3 4 5
Fig. 71 variance(measurements) 10 -3

We observe that in most runs the variance of the estimates from (3.0.1.7) are smaller than those of the
raw data. y
EXAMPLE 3.0.1.8 (Angles in a triangulation)

In Ex. 2.7.2.5 we learned about the concept and data


structures for planar triangulations → Def. 2.7.2.6.
Such triangulations have been and continue to be
of fundament importance for geodesy. In particular
before distances could be measured accurately by
means of lasers, triangulations were indispensable,
because angles could already be determined with
high precision. C.F. Gauss pioneered both the use
of triangulations in geodesy and the use of the least
squares method to deal with measurement errors →
Fig. 72
Wikipedia.

3. Direct Methods for Linear Least Squares Problems, 3. Direct Methods for Linear Least Squares Problems 215
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Die Grundlagen seines Verfahrens hatte Gauss schon 1795 im Alter von 18 Jahren entwickelt.
Basis war eine Idee von Pierre-Simon Laplace, die Beträge von Fehlern aufzusummieren,
so dass sich die Fehler zu Null addieren. Gauss nahm stattdessen die Fehlerquadrate und
konnte die künstliche Zusatzanforderung an die Fehler weglassen.
Gauss benutzte dann das Verfahren intensiv bei seiner Vermessung des Königreichs Han-
nover durch Triangulation. 1821 und 1823 erschien die zweiteilige Arbeit sowie 1826 eine
Ergänzung zur Theoria combinationis observationum erroribus minimis obnoxiae (Theorie der
den kleinsten Fehlern unterworfenen Kombination der Beobachtungen), in denen Gauss eine
Begründung liefern konnte, weshalb sein Verfahren im Vergleich zu den anderen so erfolgre-
ich war: Die Methode der kleinsten Quadrate ist in einer breiten Hinsicht optimal, also besser
als andere Methoden.
We now extend Ex. 3.0.1.6 to planar triangulations, for which measured values for all internal angles are
available. We obtain an overdetermined system of equations by combining the following linear relations:
1. each angle is supposed to be equal to its measured value,
2. the sum of interior angles is π for every triangle,
3. the sum of the angles at an interior node is 2π .
If the planar triangulation has N0 interior vertices and M cells, then we end up with 4M + N0 equations
for the 3M unknown angles. y

EXAMPLE 3.0.1.9 ((Relative) point locations from distances [GGK14, Sect. 6.1]) Consider n points
located on the real axis at unknown locations xi ∈ R, i = 1, . . . , n. At least we know that xi < xi+1 ,
i = 1, . . . , n − 1.
We measure the m := (n2 ) = 21 n(n − 1) pairwise distances dij := | xi − x j |, i, j ∈ {1, . . . , n}, i 6= j.
They are connected to the point positions by the overdetermined linear system of equations
   
−1 1 0 . . . ... 0 d12
 −1 0 1 0   d13 
   
xi − x j = dij ,  . . .. ...   .. 
 ..    . 
 .  x1  . 
1≤j<i≤n.  . .. ..   
 . . .  x2   .. 
    
↔  −1 . . . 0 1 ..  =  d1n  (3.0.1.10)
l   .   
 0 −1 1 0  d23 
 . ..  xn  . 
 .. .  .. 
Ax = b    
 .. ..   .. 
 . .   . 
0 ... −1 1 dn−1,n


Note that we can never expect a unique solution for x ∈ R n , because adding a multiple of [1, 1, . . . , 1]

to any solution will again yield a solution, because A has a non-trivial kernel: N (A) = [1, 1, . . . , 1] .
Non-uniqueness can be cured by setting x1 := 0, thus removing one component of x.
If the measurements were perfect, we could then find x2 , . . . , xn from di−1,i , i = 2, . . . , n by solving a
standard (square) linear system of equations. However, as in Ex. 3.0.1.6, using much more information
through the overdetermined system (3.0.1.10) helps curb measurement errors. y
Review question(s) 3.0.1.11 (Overdetermined Linear Systems of Equations: Examples)
(Q3.0.1.11.A) The mass of three different items is measures in all possible combinations. Find the overde-
termined linear system of equations (LSE) that has to be “solved” to obtain estimates for the three
masses. What would be the size of the corresponding overdetermined LSE for m ∈ N masses?

3. Direct Methods for Linear Least Squares Problems, 3. Direct Methods for Linear Least Squares Problems 216
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q3.0.1.11.B) A time-harmonic current with frequency f > 0 can be written as


I (t) = a cos(2π f t) + b sin(2π f t) , t∈R, a, b ∈ R .
It is measured at times tk = mk f , k = 0, . . . , m − 1, m > 2, and we obtain values Ik ≈ I (tk ). Which
overdetermined linear system of equations can be used to estimate the coefficients a, b?
R1
(Q3.0.1.11.C) [A 0 e x de-type problem] We know the solution x ∈ R n and the right-hand-side vector
n
b ∈ R of the n × n (Toeplitz) tridiagonal linear system of equations
 
α β 0 ... ... 0
 .. .. 
β α . .
 .

0 β .. ... 
 
. .. .. .. 
 .. . . . 
 
 .. .. .. 
 . . . 
 
 .. .. .. 
 . . . 
 .. .. .. .. x = b .
 . . . . 
 
 .. .. .. 
 . . . 
 
 .. .. .. 
 . . .
 . .

 .. .. β 0
 
. 
 .. β α β
0 ... ... 0 β α

Which overdetermined linear system of equations of maximal size has the vector [α, β] ∈ R2 as its
solution?

3.1 Least Squares Solution Concepts

Video tutorial for Section 3.1.1 "Least Squares Solutions": (9 minutes) Download link,
tablet notes

→ review questions 3.1.1.14

Throughout we consider the (possibly overdetermined) linear system of equations


x ∈ R n : “Ax = b” , b ∈ R m , A ∈ R m,n , m≥n. (3.0.0.1)
Recall from linear algebra that Ax = b has a solution, if and only if the right hand side vector b lies in the
image (range space, → Def. 2.2.1.2) of the matrix A:
∃x ∈ R n : Ax = b ⇔ b ∈ R(A) . (3.1.0.1)
✎ Notation for important subspaces associated with a matrix A ∈ K m,n (→ Def. 2.2.1.2)
image/range: R(A) := {Ax, x ∈ K n } ⊂ K m ,
kernel/nullspace: N (A) := {x ∈ K n : Ax = 0} .

Remark 3.1.0.2 (Consistent right hand side vectors are highly improbable) If R(A) 6= R m , then
“almost all” perturbations of b (e.g., due to measurement errors) will destroy b ∈ R(A), because R(A)
is a “set of measure zero” in R m . y

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 217
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.1.1 Least Squares Solutions: Definition


Definition 3.1.1.1. Least squares solution

For given A ∈ R m,n , b ∈ R m the vector x ∈ R n is a least squares solution of the linear system of
equations Ax = b, if

x ∈ argminkAy − bk22 ,
y ∈R n
m
!2
m n
kAx − bk22 = minn kAy − bk22 = min ∑ ∑ (A)i,j y j − (b)i .
y ∈R y1 ,...,yn ∈R
i =1 j =1

➨ A least squares solution is any vector x that minimizes the Euclidean norm of the residual r =
b − Ax, see Def. 2.4.0.1.

We write lsq(A, b) for the set of least squares solutions of the linear system of equations Ax = b,
A ∈ R m,n , b ∈ R m :

lsq(A, b) := {x ∈ R n : x is a least squares solution of Ax = b} ⊂ R n . (3.1.1.2)

§3.1.1.3 (Least squares solutions and “ true” solutions of LSE) The concept of least squares solutions
is a genuine generalization of what is regarded as a solution of linear system of equations in linear algebra:
Clearly, for a square linear system of equations with regular system matrix the least squares solution
agrees with the “true” solution:

A ∈ R n,n regular ⇒ lsq(A, b) = {A−1 b} ∀b ∈ R n . (3.1.1.4)

Also for A ∈ R m,n , m > n, the set lsq(A, b) contains only “true” solutions of Ax = b, if b ∈ R(A),
because in this case the smallest residual is 0. y

Next, we examine least squares solutions from a geometric perspective.


EXAMPLE 3.1.1.5 (linear regression → [DR08, Ex. 4.1]) We consider the problem of parameter esti-
mation for a linear model from Ex. 3.0.1.4:
Given: measured data points (xi , yi ), xi ∈ R n , yi ∈ R, i = 1, . . . , m, m ≥ n + 1
(yi , xi affected by measurement errors).

Known: without measurement errors data would satisfy affine linear relationship

y = a⊤ x + β , (3.1.1.6)

from some parameters a ∈ R n , β ∈ R.


Solving the overdetermined linear system of equations in least squares sense we obtain a least squares
estimate for the parameters a and β:
m
(a, β) = argmin ∑ | yi − a ⊤ xi − β |2 (3.1.1.7)
a∈R n ,β∈R i =1

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 218
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In statistics, solving (3.1.1.7) is known as linear re-


gression

Linear regression for n = 1, m = 8: “fitting” a regres-


sion line to data points ✄
x
Fig. 73
y
§3.1.1.8 (The geometry of least squares problems) A geometric “proof” for the existence of least
squares solutions (R = R)
✁ For a least squares solution x ∈ R n the vector
b Ax ∈ R m is the unique orthogonal projection
of b onto
{Ax, x ∈ R n }
R(A) = Span{(A):,1 , . . . , (A):,n } ,
Ax
because the orthogonal projection provides the
Fig. 74
nearest (w.r.t. the Euclidean distance) point to
b in the subspace (hyperplane) R(A).
From this geometric consideration we conclude that lsq(A, b) is the space of solutions of Ax = b∗ ,
where b∗ is the orthogonal projection of b onto R(A). Since the set of solutions of a linear system of
equations invariably is an affine space, this argument teaches that lsq(A, b) is an affine subspace of R n !
y
Geometric intuition yields the following insight:

Theorem 3.1.1.9. Existence of least squares solutions

For any A ∈ R m,n , b ∈ R m a least squares solution of Ax = b (→ Def. 3.1.1.1) exists.

2
Proof. The function F : R n → R, F (x) := kb − Axk2 is continuous, bounded from below by 0 and
F (x) → ∞ for kxk → ∞. Hence, there must be an x∗ ∈ R n for which it attains its minimum.

§3.1.1.10 (Least squares solution as maximum-likelihood estimator → [DR08, Sect. 4.5]) Extending
the considerations of Ex. 3.0.1.4, a generic linear parameter estimation problem seeks to determine the
unknown parameter vector x ∈ R n from the linear relationship Ax = y, where A ∈ R m is known and
y ∈ R m is accessible through measurements.
Unfortunately, y is affected by measurement errors, Thus we model it as a random vector y = y(ω ),
ω ∈ Ω, Ω the set of outcomes from a probability space.
The measurement errors in different components of y are supposed be unbiased (expectation = 0), inde-
pendent, and identically normally distributed with variance σ2 , σ > 0, which means

y(ω ) = Ax + e(ω ) , ω∈Ω, (3.1.1.11)

where the probability distribution of e satisfies


m Z  2
1 1z
P ((e)ℓ ∈ Iℓ , ℓ = 1, . . . , m) = ∏ √ exp − 2 2 dz , Iℓ ⊂ R . (3.1.1.12)
ℓ=1 σ 2π Iℓ σ

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 219
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

From (3.1.1.11) we infer that the probability density of y is

m  2 !  
yℓ − (Ax)ℓ 1
L(x; y) = ∏ exp − 21 = exp − 2 ky − Axk22 . (3.1.1.13)
ℓ=1
σ 2σ

The last identity follows from exp( x ) exp(y) = exp( x + y). This probability density function y 7→ L(x; y)
is called the likelihood of y, and the notation emphasizes its dependence on the parameters x.
Assume that we are given a measurement (realization/sample) b of y. The maximum likelihood principle
then suggests that we choose the parameters so that the probability density of y becomes maximal at b:

1
x∗ ∈ R n such that x∗ = argmax L(x; b) = argmax exp(− 2
kb − Axk22 ) .
x ∈R n x ∈R n 2σ

Obviously, due to the monotonicity of ξ 7→ exp(ξ ), x∗ is a least squares solution of Ax = b according to


Def. 3.1.1.1:

x∗ ∈ R n such that x∗ = argminkb − Axk22 .


x ∈R n

y
Review question(s) 3.1.1.14 (Least squares solution: Definition)
(Q3.1.1.14.A) Describe A ∈ R2,2 and b ∈ R2 so that lsq(A, b) contains more than a single vector.
(Q3.1.1.14.B) What is lsq(A, 0) for A ∈ R m,n ?
(Q3.1.1.14.C) Given a matrix B ∈ R m,n , a vector c ∈ R m , and λ > 0, define

{x∗ } := argminkBx − ck22 + λkxk22 ⊂ R n .


x ∈R n

State an overdetermined linear system of equations Ax = b, of which x∗ is a least-squares solution.


(Q3.1.1.14.D) [The geometry of least-squares problems]
b

{Ax, x ∈ R n } Explain why the figure beside illustrates the concept


of a least-squares solution of an overdetermined lin-
Ax ear system of equations Ax = b, A ∈ R m,n , m > n.

Fig. 75

3.1.2 Normal Equations

Video tutorial for Section 3.1.2 "Normal Equations": (16 minutes) Download link, tablet notes

→ review questions 3.1.2.23

Appealing to the geometric intuition gleaned from Fig. 74 we infer the orthogonality of b − Ax, x a least
squares solution of the overdetermined linear systems of equations Ax = b, to all columns of A:

b − Ax ⊥ R(A) ⇔ b − Ax ⊥ (A):,j , j = 1, . . . , n ⇔ A⊤ (b − Ax) = 0 .

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 220
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Surprisingly, we have found a square linear system of equations satisfied by the least squares solution.
The next theorem gives the formal statement is this discovery. It also completely characterizes lsq(A, b)
and reveals a way to compute this set.

Theorem 3.1.2.1. Obtaining least squares solutions by solving normal equations

The vector x ∈ R n is a least squares solution (→ Def. 3.1.1.1) of the linear system of equations
Ax = b, A ∈ R m,n , b ∈ R m , if and only if it solves the normal equations (NEQ)

A⊤ Ax = A⊤ b . (3.1.2.2)

Note that the normal equations (3.1.2.2) are an n × n square linear system of equations with a symmetric
positive semi-definite coefficient matrix:
   
   
   
" # " # " # 
   
   
A⊤  A  x = A⊤  b,
   
   
   
   

 
 
 
" #" # " # 
 
 
⇔ A⊤ A x = A⊤  b.
 
 
 
 

Proof. (of Thm. 3.1.2.1)


➊: We first show that a least squares solution satisfies the normal equations. Let x ∈ R n be a least
squares solutions according to Def. 3.1.1.1. Pick an arbitrary d ∈ R n \ {0} and define the function

ϕd : R → R , ϕd (τ ) := kA(x + τd) − bk22 . (3.1.2.3)

We find the equivalent expression

ϕd (τ ) = τ 2 d⊤ A⊤ Ad + 2τd⊤ A⊤ (Ax − b) + kAx − bk22 ,

which shows that τ 7→ ϕd (τ ) is a smooth (C ∞ ) function.


2
Moreover, since every x ∈ lsq(A, b) is a minimizer of y 7→ kAy − bk2 , we conclude that τ 7→ ϕd (τ )
has a global minimum in τ = 0. Necessarily,

dϕd
= 2d⊤ A⊤ (Ax − b) = 0 .
dτ |τ =0

Since this holds for any vector d 6= 0, we conclude (set d equal to all the Euclidean unit vectors in R n )

A⊤ (Ax − b) = 0 ,

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 221
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

which is equivalent to the normal equations (3.1.2.2).

➋: Let x be a solution of the normal equations. Then we find by tedious but straightforward computations

kAy − bk22 − kAx − bk22


=y⊤ A⊤ Ay − 2y⊤ A⊤ b + b⊤ b − x⊤ A⊤ Ax + 2x⊤ A⊤ b − b⊤ b
(3.1.2.2) ⊤
= y A⊤ Ay − 2y⊤ A⊤ Ax + x⊤ A⊤ Ax
=(y − x)⊤ A⊤ A(y − x) = kA(x − y)k22 ≥ 0 .
=⇒ kAy − bk ≥ kAx − bk .

Since this holds for any y ∈ R n , x must be a global minimizer of y 7→ kAy − bk!

EXAMPLE 3.1.2.4 (Normal equations for some examples from Section 3.0.1) Given A and b it takes
only elementary linear algebra operations to form the normal equations

A⊤ Ax = A⊤ b . (3.1.2.2)

• For Ex. 3.0.1.1 (1D linear regression), A ∈ R m,2 given in


   
x1 1 y1
 x2 1  y2 
 . .    . 
 . .  . 
 . . α  . 
  =   ↔ Ax = b , A ∈ R m,2 , b ∈ R m , x ∈ R2 , (3.0.1.3)
  β  
 . .  . 
 .. ..   .. 
xm 1 ym
we obtain the normal equations linear system
 
x1 1
 x2 1
  
 . .       ⊤ 
x1 x2 . . . . . . xm  .. ..  α kxk22 1⊤ x α x y
  = ⊤ = ⊤ ,
1 1 ... ... 1   β 1 x m β 1 y
 . .
 .. .. 
xm 1

with 1 = [1, . . . , 1] .
• In the case of Ex. 3.0.1.4 (multi-dimensional linear regression) and the overdetermined m × (n + 1)
linear system
   
x1⊤ 1   y1
 .. ..  a  .. 
 . . β =  .  ↔ Ax = b , A ∈ R m,n+1 , b ∈ R m , x ∈ R n+1 , (3.0.1.5)
x⊤m 1 ym

the normal equations read


 
  x1⊤ 1       
x1 x2 . . . . . . xm  .. ..  a XX⊤ X1 a Xy
1 1 ... ... 1
 . .  β = 1⊤ X⊤ m β = 1⊤ y ,
x⊤
m 1

where X = [x1 , . . . , xm ] ∈ R n,m .

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 222
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 3.1.2.5 (Normal equations from gradient) We consider the function

J : R n → R , J (y) := kAy − bk22 . (3.1.2.6)

As above, using elementary identities for the Euclidean inner product on R m , J can be recast as

J (y) = y⊤ A⊤ Ay − 2b⊤ Ay + b⊤ b
n n m n m
y = [ y1 , . . . , y n ] ⊤ ∈ R n .
= ∑ ∑ (A⊤ A)ij yi y j − 2 ∑ ∑ bi (A)ij y j + ∑ bi2 ,
i =1 j =1 i =1 j =1 i =1

Obviously, J is a multivariate polynomial in y1 , . . . , yn . As such, J is an infinitely differentiable function,


J ∈ C ∞ (R n , R ), see [Str09, Bsp. 7.1.5]. The gradient of J vanishes where J attains extremal values
[Str09, Satz 7.5.3]. Thus, x ∈ R n is a minimizer of J only if

grad J (x) = 2A⊤ Ax − 2A⊤ b= 0 . (3.1.2.7)

∂J
This formula for the gradient of J can easily be confirmed by computing the partial derivatives ∂y from the
i
above explicit formula. Observe that (3.1.2.7) is equivalent to the normal equations (3.1.2.2). y

§3.1.2.8 (The linear least squares problem (→ § 1.5.5.1)) Thm. 3.1.2.1 together with Thm. 3.1.1.9
already confirms that the normal equations will always have a solution and that lsq(A, b) is a subspace
of R n parallel to N (A⊤ A). The next theorem gives even more detailed information.

Theorem 3.1.2.9. Kernel and range of A⊤ A

For A ∈ R m,n , m ≥ n, holds

N (A⊤ A) = N (A) , (3.1.2.10)


R(A⊤ A) = R(A⊤ ) . (3.1.2.11)

For the proof we need an basic result from linear algebra:

Lemma 3.1.2.12. Kernel and range of (Hermitian) transposed matrices

For any matrix A ∈ K m,n holds

N (A) = R(AH )⊥ , N (A)⊥ = R(AH ) .

✎ Notation: Orthogonal complement of a subspace V ⊂ K k :

V ⊥ : = { x ∈ K k : xH y = 0 ∀ y ∈ V } .

Proof. (of Thm. 3.1.2.9)


➊: We first show (3.1.2.10)

z ∈ N (A⊤ A) ⇔ A⊤ Az = 0 ⇒ z⊤ A⊤ Az = kAzk22 = 0 ⇔ Az = 0 ,
Az = 0 ⇒ A⊤ Az = 0 ⇔ z ∈ N (A⊤ A) .

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 223
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➋: The relationship (3.1.2.11) follows from (3.1.2.10) and Lemma 3.1.2.12:


Lemma 3.1.2.12 (3.1.2.10) Lemma 3.1.2.12
R(A⊤ ) = N (A)⊥ = N (A⊤ A)⊥ = R(A⊤ A) .

Corollary 3.1.2.13. Uniqueness of least squares solutions

If m ≥ n and N (A) = {0}, then the linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , has
a unique least squares solution (→ 3.1.1.1)

x = ( A ⊤ A ) −1 A ⊤ b , (3.1.2.14)

that can be obtained by solving the normal equations (3.1.2.2).

Note that A⊤ A is symmetric positive definite (→ Def. 1.1.2.6), if N (A) = {0}.

Remark 3.1.2.15 (Full-rank condition (→ Def. 2.2.1.3)) For a matrix A ∈ R m,n with m ≥ n is equiva-
lent
N (A) = {0} ⇐⇒ rank(A) = n . (3.1.2.16)
Hence the assumption N (A) = {0} of Cor. 3.1.2.13 is also called a full-rank condition (FRC), because
the rank of A is maximal. y

EXAMPLE 3.1.2.17 (Meaning of full-rank condition for linear models) We revisit the parameter esti-
mation problem for a linear model.
• For Ex. 3.0.1.1, A ∈ R m,2 given in (3.0.1.3) it is easy to see
 
x1 1
  x2 1 
 . . 
 . . 
 . . 
rank  = 2 ⇔ ∃i, j ∈ {1, . . . , m}: xi 6= x j ,
 
 . . 
 .. .. 
xm 1
that is, the manifest condition, that the all points ( xi , yi ) have the same x-coordinate.
y

1D linear regression fails, in case all data points lie


on a vertical line in the x − y-plane. ✄

It goes without saying that no meaningful regression


line can be found in this case.
x
Fig. 76

• In the case of Ex. 3.0.1.4 and the overdetermined m × (n + 1) linear system (3.0.1.5), we find
 
x1⊤ 1 There is a subset of n + 1 points
 .. 
rank ... .  = n + 1 ⇔ xi1 , . . . xin+1 such that {xi1 , . . . xin+1 }
x⊤ 1 spans a non-degenerate n-simplex.
m

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 224
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 3.1.2.18 (Rank defect in linear least squares problems) In case the system matrix A ∈ R m,n ,
m ≥ n, of an overdetermined linear system arising from a mathematical fails to have full rank, it hints at
inadequate modelling:
In this case parameters are redundant, because different sets of parameters yield the same output quan-
tities: the parameters are not “observable”. y

Remark 3.1.2.19 (Hesse matrix of least squares functional) For the least squares functional

J : R n → R , J (y) := kAy − bk22 . (3.1.2.6)

and its explicit form as polynomial in the vector components y j we find the Hessian (→ Def. 8.5.1.18,
[Str09, Satz 7.5.3]) of J :
" #n
∂2 J
H J (y) = (y) = 2A⊤ A . (3.1.2.20)
∂yi ∂y j
i,k=1

Thm. 3.1.2.9 implies that A⊤ A is positive definite (→ Def. 1.1.2.6) if and only if N (A) = {0}.
Therefore, by [Str09, Satz 7.5.3], under the full-rank condition J has a positive definite Hessian everywhere,
and a minimum at every stationary point of its gradient, that is, at every solution of the normal equations.
y

Remark 3.1.2.21 (Convex least squares functional) Another result from analysis tells us that real-valued
C1 -functions on R n whose Hessian has positive eigenvalues uniformly bounded away from zero are strictly
convex. Hence, if A has full rank, the least squares functional J from (3.1.2.6) is a strictly convex function.

Visualization of a least squares functional J : R2 →


R for n = 2 ✄
Under the full-rank condition the graph of J is a
paraboloid with J (y) → ∞ for kyk → ∞

Fig. 77
y

Now we are in a position to state precisely what we mean by solving an overdetermined (m ≥ n!) linear
system of equations Ax = b, A ∈ R m,n , b ∈ R m , provided that A has full (maximal) rank, cf. (3.1.2.16).

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 225
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Full rank linear) least squares problem: [DR08, Sect. 4.2]

given: A ∈ R m,n , m, n ∈ N, m ≥ n, rank(A) = n, b ∈ R m ,


find:
unique x ∈ R n such that kAx − bk2 = min{kAy − bk2 : y ∈ R n } (3.1.2.22)
m
x = argminkAy − bk2
y ∈R n

✎ A sloppy notation for the minimization problem (3.1.2.22) is kAx − bk2 → min y
Review question(s) 3.1.2.23 (Normal equations)
(Q3.1.2.23.A) Compute the system matrix and the right-hand side vector for the normal equations for 1D
linear regression, which led to the overdetermined linear system of equations
   
x1 1 y1
 x2 1  y2 
 . .    . 
 . .  . 
 . . α  . 
  = ,
  β  
 . .  . 
 .. ..   .. 
xm 1 ym

(Q3.1.2.23.B) Let {v1 , . . . , vk } ⊂ R n , k < n, be a basis of a subspace V ⊂ R n . Give a formula for the
point x ∈ V with smallest Euclidean distance from a given point p ∈ R n . Why is the basis property of
{v1 , . . . , vk } ⊂ R n important?
(Q3.1.2.23.C) Characterize the set of least squares solutions lsq(A, b), if A ∈ R m,n , m ≥ n, has or-
thonormal columns and b ∈ R m is an arbitrary vector.
(Q3.1.2.23.D) Let A ∈ R m,n , m ≥ n, have full rank: rank(A) = n. Show that the mapping

P : R m → R m , P ( y ) : = A ( A ⊤ A ) −1 A ⊤ y , y ∈ Rm ,

is an orthogonal projection onto R(A). This entails proving two properties


(I) P ◦ P = P (projection property),
(II) P(y) − y ⊥ R(A) for all y ∈ R m .
(Q3.1.2.23.E) [Ridge regression] For a given matrix A ∈ R m,n , m > n, and vector b ∈ R m , derive a
linear system of equations satisfied by any

x∗ ∈ argminkAx − bk22 + λkxk22 ,


x ∈R n

where λ > 0 is a fixed constant.


Show that the linear system you have obtained will always possess a unique solution for any A ∈ R m,n .
(Q3.1.2.23.F) How do the normal equations belonging to an overdetermined linear system of equations
Ax = b, A ∈ R m,n , m > n, change, when we permute the rows?

Hint. Permuting the rows of an LSE amounts to left-multiplication of the system matrix and the right-
hand-side vector with a permutation matrix.

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 226
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.1.3 Moore-Penrose Pseudoinverse


Video tutorial for Section 3.1.3 "Moore-Penrose Pseudoinverse": (8 minutes) Download link,
tablet notes

→ review questions 3.1.3.8

As we have seen in Ex. 3.0.1.9, there can be many least squares solutions of Ax = b, in case N (A) 6=
{0}. We can impose another condition to single out a unique element of lsq(A, b):

Definition 3.1.3.1. Generalized solution of a linear system of equations

The generalized solution x† ∈ R n of a linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , is


defined as

x† := argmin{kxk2 : x ∈ lsq(A, b)} . (3.1.3.2)

➨ The generalized solution is the least squares solution with minimal norm.

§3.1.3.3 (Reduced normal equations) Elementary geometry teaches that the minimal norm element of
an affine subspace L (a plane) in Euclidean space is the orthogonal projection of 0 onto L.

lsq(A, b) k N (A)
✁ visualization:
The minimal norm element x† of the affine space
x† lsq(A, b) ⊂ R n belongs to the subspace of R n that
is orthogonal to lsq(A, b).
N (A)⊥
0

Fig. 78

Since the space of least squares solutions of Ax = b is an affine subspace parallel to N (A),

lsq(A, b) = x0 + N (A) , x0 solves normal equations, (3.1.3.4)

the generalized solution x† of Ax = b according to Def. 3.1.3.1 is contained in N (A)⊥ . Therefore, given
a basis {v1 , . . . , vk } ⊂ R n of N (A)⊥ , k := dim N (A)⊥ = n − dim N (A), we can find y ∈ R k such
that

x† = Vy with V := [v1 , . . . , vk ] ∈ R n,k .

Plugging this representation into the normal equations and multiplying with V⊤ yields the reduced normal
equations

V⊤ A⊤ AV y = V⊤ A⊤ b (3.1.3.5)
m

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 227
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
    
  " #
    
V⊤  A⊤    y =
  

A 

V 
 
 

  
 
  
 V⊤ 
 A⊤  b  .

The very construction of V ensures N (AV) = {0} so that, by Thm. 3.1.2.9 the k × k linear system of
equations (3.1.3.5) has a unique solution. The next theorem summarizes our insights:

Theorem 3.1.3.6. Formula for generalized solution

Given A ∈ R m,n , b ∈ R m , the generalized solution x† of the linear system of equations Ax = b is


given by

x† = V(V⊤ A⊤ AV)−1 (V⊤ A⊤ b) ,

where V is any matrix whose columns form a basis of N (A)⊥ .

Terminology: The matrix

A† := V(V⊤ A⊤ AV)−1 V⊤ A⊤ ∈ R n,m

is called the Moore-Penrose pseudoinverse of A. If N A = {0}, then the formula simplifies to


A † = ( A ⊤ A ) −1 A ⊤ .

✎ notation: A† ∈ R n,m =
ˆ pseudoinverse of A ∈ R m,n

Note that the Moore-Penrose pseudoinverse does not depend on the choice of V. y

Armed with the concept of generalized solution and the knowledge about its existence and uniqueness we
can state the most general linear least squares problem:

(General linear) least squares problem:

given: A ∈ R m,n , m, n ∈ N, b ∈ R m ,
find: x† ∈ R n such that
(i) Ax† − b = min{kAy − bk2 : y ∈ R n }, (3.1.3.7)
2
(ii) x† is minimal under the condition (i).
2

Review question(s) 3.1.3.8 (Moore-Penrose pseudoinverse)


(Q3.1.3.8.A) Let A ∈ R m,n with non-trivial nullspace N (A) 6= {0}, k := dim N (A). The columns of
the matrix V ∈ R n,n−k provide a basis for the orthogonal complement N (A)⊥ . Show that the matrix
V⊤ A⊤ AV ∈ R n−k,n−k is regular.

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 228
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q3.1.3.8.B) Given A ∈ R m,n and a basis {v1 , . . . , vk }, k ≤ n, of the orthogonal complement N (A)⊥ ,
show that the Moore-Penrose pseudoinverse

A† := V(V⊤ A⊤ AV)−1 V⊤ A⊤ , V := [v1 , . . . , vk ] ∈ R n,k

does not depend on the choice of the basis vectors vℓ .


(Q3.1.3.8.C) [Pseudoinverse of a vector] What is the pseudoinverse of a non-zero column vector
m
a ∈ R , when it is regarded as a m × 1-matrix, m ∈ N?
What is the pseudoinverse of a non-zero row vector a, a⊤ ∈ R n , when it is regarded as a 1 × n-matrix,
n ∈ N?

3.1.4 Sensitivity of Least Squares Problems


Consider the full-rank linear least squares problem introduced in (3.1.2.22):
➣ data (A, b) ∈ R m,n × R m , result x = argminy∈Rn kAy − bk2 ∈ R n
On data space and result space we use the Euclidean norm (2-norm k·k2 ) and the associated matrix
norm, see § 1.5.5.3.

Recall Section 2.2.2, where we discussed the sensitivity of solutions of square linear systems, that is,
the impact of perturbations in the problem data on the result. Now we study how (small) changes in A
and b affect the unique (→ Cor. 3.1.2.13) least solution x of Ax = b in the case of A with full rank (⇔
N ( A ) = { 0 })

Note: If the matrix A ∈ R m,n , m ≥ n, has full rank, then there is a c > 0 such that A + ∆A still has
full rank for all ∆A ∈ R m,n with k∆Ak2 < c. Hence, “sufficiently small” perturbations will not destroy the
full-rank property of A. This is a generalization of the Perturbation Lemma 2.2.2.5.

For square linear systems the condition number of the system matrix (→ Def. 2.2.2.7) provided the key
gauge of sensitivity. To express the sensitivity of linear least squares problems we also generalize this
concept:

Definition 3.1.4.1. Generalized condition number of a matrix

Given A ∈ K m,n , m ≥ n, rank(A) = n, we define its generalized (Euclidean) condition number


as
s
λmax (AH A)
cond2 (A) := .
λmin (AH A)

✎ notation: ˆ smallest (in modulus) eigenvalue of matrix A


λmin (A) =
ˆ largest (in modulus) eigenvalue of matrix A
λmax (A) =

For a square regular matrix this agrees with its condition number according to Def. 2.2.2.7, which follows

3. Direct Methods for Linear Least Squares Problems, 3.1. Least Squares Solution Concepts 229
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

from Cor. 1.5.5.16.

Theorem 3.1.4.2. Sensitivity of full-rank linear least squares problem

For m ≥ n, A ∈ R m,n , rank(A) = n, let x ∈ R n be the solution of the least squares problem
kAx − bk → min and b x − bk →
x the solution of the perturbed least squares problem k(A + ∆A)b
min. Then
 
kx − b x k2 · k r k2 k∆Ak2
≤ 2 cond2 (A) + cond22 (A)
k x k2 k A k2 k x k2 k A k2

holds, where r = Ax − b is the residual.

This means: if krk2 ≪ 1 ➤ condition of the least squares problem ≈ cond2 (A)
if krk2 “large” ➤ condition of the least squares problem ≈ cond22 (A)
For instance, in a linear parameter estimation problem (→ Ex. 3.0.1.4) a small residual will be the conse-
quence of small measurement errors.

Review question(s) 3.1.4.3 (Sensitivity of least-squares problems)


(Q3.1.4.3.A) Interpret the assertion of the theorem

Theorem 3.1.4.2. Sensitivity of full-rank linear least squares problem

For m ≥ n, A ∈ R m,n , rank(A) = n, let x ∈ R n be the solution of the least squares


problem kAx − bk → min and b x the solution of the perturbed least squares problem
k(A + ∆A)bx − bk → min. Then
 
kx − b x k2 · k r k2 k∆Ak2
≤ 2 cond2 (A) + cond22 (A)
k x k2 k A k2 k x k2 k A k2

holds, where r = Ax − b is the residual.

for the case n = 1.

Hint. Recall the definition

Definition 3.1.4.1. Generalized condition number of a matrix

Given A ∈ K m,n , m ≥ n, rank(A) = n, we define its generalized (Euclidean) condition number


as
s
λmax (AH A)
cond2 (A) := .
λmin (AH A)

3.2 Normal Equation Methods [DR08, Sect. 4.2], [Han02, Ch. 11]

Video tutorial for Section 3.2 "Normal Equation Methods": (12 minutes) Download link,
tablet notes

3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
230
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

→ review questions 3.2.0.11

Given A ∈ R m,n , m ≥ n, rank(A) = n, b ∈ R m , we introduce a first practical numerical method


to determine the unique least squares solution (→ Def. 3.1.1.1) of the overdetermined linear system of
equations Ax = b.
In fact, Cor. 3.1.2.13 suggests a simple algorithm for solving linear least squares problems of the form
(3.1.2.22) satisfying the full (maximal) rank condition rank(A) = n: it boils down to solving the normal
equations (3.1.2.2):

Algorithm: Normal equation method to solve full-rank least squares problem Ax = b

➊ Compute regular matrix C := A⊤ A ∈ R n,n .


➋ Compute right hand side vector c := A⊤ b.
➌ Solve s.p.d. (→ Def. 1.1.2.6) linear system of equations: Cx = c → § 2.8.0.13

Definition 1.1.2.6. Symmetric positive definite (s.p.d.) matrices → [DR08, Def. 3.31],
[QSS00, Def. 1.22]

M ∈ K n,n , n ∈ N, is symmetric (Hermitian) positive definite (s.p.d.), if

M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .

If xH Mx ≥ 0 for all x ∈ K n ✄ M positive semi-definite.

The s.p.d. property of C is an immediate consequence of the equivalence rank(A) = n ⇔


N ( A ) = { 0 }:

x⊤ Cx = x⊤ A⊤ Ax = (Ax)⊤ (Ax) = kAxk22 > 0 ⇔ x 6= 0 .

The above algorithm can be realized in E IGEN in a single line of code:

C++ code 3.2.0.1: Solving a linear least squares problem via normal equations ➺ GITLAB
2 //! Solving the overdetermined linear system of equations
3 //! Ax = b by solving normal equations (3.1.2.2)
4 //! The least squares solution is returned by value
5 VectorXd normeqsolve ( const MatrixXd &A , const VectorXd &b ) {
6 i f ( b . s i z e ( ) ! = A . rows ( ) ) {
7 throw r u n t i m e _ e r r o r ( " Dimension mismatch " ) ;
8 }
9 // Use Cholesky factorization for s.p.d. system matrix, § 2.8.0.13
10 VectorXd x = ( A . transpose ( ) * A) . l l t ( ) . solve ( A . transpose ( ) * b ) ;
11 return x ;
12 }

By Thm. 2.8.0.11, for the s.p.d. matrix A⊤ A Gaussian elimination remains stable even without pivot-
ing. This is taken into account by requesting the Cholesky decomposition of A⊤ A by calling the method
llt().

§3.2.0.2 (Asymptotic complexity of normal equation method) The problem size parameters for the
linear least squares problem (3.1.2.22) are the matrix dimensions m, n ∈ N, where n small & fixed,
n ≪ m, is common.

3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
231
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In Section 1.4.2 and Thm. 2.5.0.2 we discussed the asymptotic complexity of the operations involved in
step ➊-➌ of the normal equation method:

step ➊: cost O(mn2 ) 

step ➋: cost O(nm) cost O(n2 m + n3 ) for m, n → ∞ .


step ➌: cost O(n3 )

Note that for small fixed n, n ≪ m, m → ∞ the computational effort scales linearly with m. y

Remark 3.2.0.3 (Conditioning of normal equations [DR08, pp. 128])


The solution of least squares problems via the normal equation method is vulnerable
to instability ; immediate from Def. 3.1.4.1:

cond2 (AH A) = cond2 (A)2 .


!
Recall from Thm. 2.2.2.4: cond2 (AH A) governs amplification of (roundoff) errors in
A⊤ A and A⊤ b when solving normal equations (3.1.2.2).
➣ For fairly ill-conditioned A using the normal equations (3.1.2.2) to solve the linear least squares prob-
lem from Def. 3.1.1.1 numerically may run the risk of huge amplification of roundoff errors incurred
during the computation of the right hand side AH b: potential instability (→ Def. 1.5.5.19) of normal
equation approach.
y

EXAMPLE 3.2.0.4 (Roundoff effects in normal equations → [DR08, Ex. 4.12]) In this example we
witness loss of information in the computation of AH A.

 
1 1  
⊤ 1 + δ2 1
A =  δ 0 ⇒ A A=
1 1 + δ2
0 δ
! √
Exp. 1.5.3.14: If δ ≈ EPS, then 1 + δ2 = 1 in M (set of machine numbers, see
Hence the computed A⊤ A will fail to be regular, though rank(A) = 2,
Def. 1.5.2.4). √
cond2 (A) ≈ EPS.

C++-code 3.2.0.5: Computation of numerical rank of a matrix ➺ GITLAB


2 i n t main ( ) {
3 MatrixXd A ( 3 , 2 ) ;
4 // Inquire about machine precision → Ex. 1.5.3.12
5 const double eps = std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ;
6 // « initialization of matrix → § 1.2.1.3
7 A << 1 , 1 , s q r t ( eps ) , 0 , 0 , s q r t ( eps ) ;
8 // Output rank of A⊤ A
9 std : : cout << "Rank of A: " << A . f u l l P i v L u ( ) . rank ( ) << std : : endl
10 << "Rank of A^TA: "
11 << ( A . transpose ( ) * A) . f u l l P i v L u ( ) . rank ( ) << std : : endl ;
12 return 0;
13 }

Output:
1 Rank o f A : 2
2 Rank o f A^T * A : 1

3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
232
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 3.2.0.6 (Loss of sparsity when forming normal equations) Another reason not to compute
AH A, when both m, n large:

A sparse 6⇒ A⊤ A sparse

Example from Rem. 1.3.1.5: “Arrow matrices”


    
    
    
    
    
    
    
  = .
    
    
    
    
    
    

Consequences for normal equation method, if both m, n large:

✦ Potential memory overflow, when computing A⊤ A


✦ Squanders possibility to use efficient sparse direct elimination techniques, see Section 2.7.4

This situation is faced in Ex. 3.0.1.9, Ex. 3.0.1.8. y

§3.2.0.7 (Extended normal equations)


There is a way to avoid the computation of the system matrix A⊤ A ∈ R n,n of the normal equation for the
overdetermined linear system of equations Ax = b, A ∈ R m,n , m ≥ n. The trick is to extend the normal
equations,

A⊤ Ax = A⊤ b , (3.1.2.2)

by introducing the introduce residual r := Ax − b ∈ R m as new (auxiliary) unknown:


      
⊤ ⊤r −Im A r b
A Ax = A b ⇔ B := ⊤ = . (3.2.0.8)
x A O x 0

The benefit of using (3.2.0.8) instead of the standard normal equations (3.1.2.2) is that sparsity is pre-
served. However, the conditioning of the system matrix in (3.2.0.8) is not better than that of A⊤ A.

A more general substitution q := α−1 (Ax − b) with α > 0 may even improve the conditioning for suitably
chosen parameter α > 0:
      
⊤ ⊤ q −αI A q b
A Ax = A b ⇔ Bα := ⊤ = . (3.2.0.9)
x A 0 x 0

For m, n ≫ 1, A sparse, both (3.2.0.8) and (3.2.0.9) lead to large sparse linear systems of equations,
amenable to sparse direct elimination techniques, see Section 2.7.4. y

EXAMPLE 3.2.0.10 (Conditioning of the extended normal equations)

3. Direct Methods for Linear Least Squares Problems, 3.2. Normal Equation Methods [DR08, Sect. 4.2], [Han02,
233
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10
10
cond2(A)
In this example we explore empirically how the Eu- 9
10 cond2(AHA)
clidean condition number of the extended normal 8
10
cond2(B)
equations (3.2.0.9) is influenced by the coice of α cond (B )
2 α
7
10
Consider (3.2.0.8), (3.2.0.9) for
6

  10

1+ǫ 1 5
10

A =  1−ǫ 1  . 4
10
ǫ ǫ
3
10

Plot of different condition numbers


2
10

in dependence on ǫ√ ✄ 1
10

(Here α = ǫkAk2 / 2) 0
10
−5 −4 −3 −2 −1 0
10 10 10 10 10 10
Fig. 79 ε
y
Review question(s) 3.2.0.11 (Normal equation methods)
(Q3.2.0.11.A) We consider the overdetermined linear system of equations

Ax = b , A ∈ R m,n , m ≥ n, b ∈ R m . (3.2.0.12)

We augment it by another equation and get another overdetermined linear system of equations
   
A b
⊤ ex= , v ∈ Rn , β∈R. (3.2.0.13)
v β

How are the normal equations of (3.2.0.12) and (3.2.0.13) related?


(Q3.2.0.11.B) Discuss how the coefficient matrices and right-hand side vectors of the normal equations
belonging to the two overdetermined linear system of equations

Ax = b , A ∈ R m,n , m ≥ n, b ∈ R m , (3.2.0.14)
⊤ m n
(A + uv )e
x=b, u∈R , v∈R , (3.2.0.15)

are related.

3.3 Orthogonal Transformation Methods [DR08, Sect. 4.4.2]

Video tutorial for Section 3.3 "Orthogonal Transformation Methods": (10 minutes)
Download link, tablet notes

→ review questions 3.3.2.3

3.3.1 Transformation Idea


We consider the full-rank linear least squares problem (3.1.2.22)

given A ∈ R m,n , b ∈ R m find x = argminkAy − bk2 . (3.1.2.22)


y ∈R n

Setting: m ≥ n and A has full (maximum) rank: rank(A) = n.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
234
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§3.3.1.1 (Generalizing the policy underlying Gaussian elimination) Recall the rationale behind Gaus-
sian elimination (→ Section 2.3, Ex. 2.3.1.1)
➥ e,
By row transformations convert LSE Ax = b to equivalent (in terms of set of solutions) LSE Ux = b
which is easier to solve because it has triangular form.
How to adapt this policy to linear least squares problem (3.1.2.22) ?
Two questions: ➊ What linear least squares problems are “easy to solve” ?
➋ How can we arrive at them by equivalent transformations of (3.1.2.22) ?
Here we call two overdetermined linear systems Ax = b and Ax e = b e equivalent in the sense of
e ), see (3.1.1.2).
e b
(3.1.2.22), if both have the same set of least squares solutions: lsq(A, b) = lsq(A,
y

§3.3.1.2 (Triangular linear least squares problems)


The answer to question ➊ is the same as for LSE/Gaussian elimination:
Linear least squares problems (3.1.2.22) with upper triangular A are easy to solve!

 
 
  b1
   ..    −1  
   .  b1
   
       .. 
        . 
       .. 
  x1   x =    
        . 
  .. (∗) R
 A  . −


 → min =⇒ 




 .. 

    .
  xn   bn
   
   
   
   .. 
   .  ˆ least squares solution
x=
 
bm
2

How can we draw the conclusion (∗)? Obviously, the components n + 1, . . . , m of the vector inside the
norm are fixed and do not depend on x. All we can do is to make the first components 1, . . . , n vanish, by
choosing a suitable x, see [DR08, Thm. 4.13]. Obviously, x = R−1 (b)1:n accomplishes this.

Note: since A has full rank n, the upper triangular part R ∈ R n,n of A is regular! y

Answer to question ➋:
Idea: If we have a (transformation) matrix T ∈ R m,m satisfying

kTyk2 = kyk2 ∀y ∈ R m , (3.3.1.3)

then argminkAy − bk2 = argmin Ay e


e −b ,
y ∈R n y ∈R n 2

where A e = Tb.
e = TA and b

The next section will characterize the class of eligible transformation matrices T.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
235
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.3.2 Orthogonal/Unitary Matrices


Definition 3.3.2.1. Unitary and orthogonal matrices → [Gut09, Sect. 2.8]

• Q ∈ K n,n , n ∈ N, is unitary, if Q−1 = Q H .


• Q ∈ R n,n , n ∈ N, is orthogonal, if Q−1 = Q T .

Theorem 3.3.2.2. Preservation of Euclidean norm


A matrix is unitary/orthogonal, if and only if the associated linear mapping preserves the 2-norm:

Q ∈ K n,n unitary ⇔ kQxk2 = kxk2 ∀x ∈ K n .

From Thm. 3.3.2.2 we immediately conclude that, if a matrix Q ∈ K n,n is unitary/orthogonal, then

all rows/columns (regarded as vectors ∈ K n ) have Euclidean norm = 1,

all rows/columns are pairwise orthogonal (w.r.t. Euclidean inner product),

| det Q| = 1, kQk2 = 1, and all eigenvalues ∈ {z ∈ C: |z| = 1}.


kQAk2 = kAk2 for any matrix A ∈ K n,m
Review question(s) 3.3.2.3 (Orthogonal transformations)
(Q3.3.2.3.A) [Orthogonal matrices in R2 ] Give a full characterization of all orthogonal matrices
Q ∈ R2,2 .
Hint. The subspace 2
 u1 of R spanned by
 −all
u2
vectors orthogonal (w.r.t the Euclidean inner product) to a
given vector u = u2 is spanned by u1 .
(Q3.3.2.3.B) [Polarization identity] Prove the following variant of a polarization identity:
 

x y= 1
4 kx + yk22 − kx − yk22 ∀x, y ∈ R2 .

(Q3.3.2.3.C) Based on the result of Question (Q3.3.2.3.A) find an orthogonal matrix Q ∈ R2,2 , such that
   
0 a12 ∗ ∗
Q = , a12 , a22 ∈ R .
1 a22 0 ∗

Here ∗ stands for an arbitrary matrix entry.


3.3.3 QR-Decomposition [Han02, Sect. 13], [Gut09, Sect. 7.3]


This section will answer the question whether and how it is possible to find orthogonal transformations that
convert any given matrix A ∈ R m,n , m ≥ n, rank(A) = n, to upper triangular form, as required for the
application of the “equivalence transformation idea” to full-rank linear least squares problems.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
236
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.3.3.1 QR-Decomposition: Theory

Video tutorial for Section 3.3.3.1 "QR-Decomposition: Theory": (11 minutes) Download link,
tablet notes

→ review questions 3.3.3.8

§3.3.3.1 (Gram-Schmidt orthogonalisation recalled → § 1.5.1.1)


Input: {a1 , . . . , an } ⊂ K m
Output: {q1 , . . . , qn } (assuming no premature termination!)

Theorem 3.3.3.2. Span property of


a1 G.S. vectors
1: q1 : = ; % 1st output vector
k a1 k 2
2: for j = 2, . . . , n do If {a1 , . . . , an } ⊂ R m is linearly indepen-
{ % Orthogonal projection dent, then Algorithm (GS) computes or-
3: q j := a j ; thonormal vectors q1 , . . . , qn ∈ R m sat-
4: for ℓ = 1, 2, . . . , j − 1 do (GS) isfying
5: { q j ← q j − a j , qℓ qℓ ; }
6: if ( q j = 0 ) then STOP
Span{q1 , . . . , qℓ } = Span{a1 , . . . , aℓ } ,
qj (1.5.1.2)
7: else { qj ← ; }
k q j k2
8: } for all ℓ ∈ {1, . . . , n}.

The span property (1.5.1.2) can be made more explicit in terms of the existence of linear combinations

q1 = t11 a1
q2 = t12 a1 + t22 a2
q3 = t13 a1 + t23 a2 + t33 a3 ∃T ∈ R n,n upper triangular: Q = AT , (3.3.3.3)
..
.
qn = t1n a1 + t2n a2 + · · · + tnn an .

where Q = [q1 , . . . , qn ] ∈ R m,n (with orthonormal columns), A = [a1 , . . . , an ] ∈ R m,n . Note that thanks
to the linear independence of {a1 , . . . , ak } and {q1 , . . . , qk }, the matrix T = (tij )i,j
k k,k
=1 ∈ R is regular
(“non-existent” tij are set to zero, of course).

Recall from Lemma 1.3.1.9 that inverses of regular upper triangular matrices are upper triangular
again.
  −1  
   
   
   
  = 
 T   R 
   
   

Thus, by (3.3.3.3), we have found an upper triangular R := T−1 ∈ R n,n such that

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
237
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

   
   
   
   
   
    
   
   
    
    
 A   Q  
A = QR ↔ 

=
 


.

    R 
    
   
   
   
   
   
   
   

Next “augmentation by zero”: add m − n zero rows at the bottom of R and complement columns of Q to
e ∈ R m,m :
an orthonormal basis of R m , which yields an orthogonal matrix Q

 
e R
A=Q
0
l
    
    
    
    
    
    
    R 
    
    
    
 A   e  
 = Q  
    
    
    
    
    
    0 
    
    
    
 

 
e⊤ R
⇔ Q A= .
0
y
Thus the algorithm of Gram-Schmidt orthonormalization “proves” the following theorem.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
238
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 3.3.3.4. QR-decomposition → [NS02, Satz 5.2], [Gut07, Sect. 7.3]

For any matrix A ∈ K n,k with rank(A) = k there exists


(i) a unique Matrix Q0 ∈ R n,k that satisfies QH 0 Q0 = Ik , and a unique upper triangular Matrix
k,k
R0 ∈ K with (R)i,i > 0, i ∈ {1, . . . , k }, such that

A = Q0 · R0 (“economical” QR-decomposition) ,

(ii) a unitary Matrix Q ∈ K n,n and a unique upper triangular R ∈ K n,k with (R)i,i > 0, i ∈
{1, . . . , n}, such that

A = Q·R (full QR-decomposition) .

If K = R all matrices will be real and Q is then orthogonal.

Visualisation: “economical” QR-decomposition: QH


0 Q0 = Ik (orthonormal columns),

A = Q0 R0 , Q0 ∈ K n,k , R0 ∈ K k,k upper triangular ,


   
   
   
   
   
    
   
   
    
    
 A   Q0  
 =  . (3.3.3.5)
    R0 
    
    
   
   
   
   
   
   
   

Visualisation: full QR-decompisiton: QH Q = QQH = In (orthogonal matrix),

A = QR , Q ∈ K n,n , R ∈ K n,k ,
    
    
    
    
    
    
    
    
    
 A    
 = Q  . (3.3.3.6)
    
    R 
    
    
    
    
    
    
    

For square A, that is, n = k, both QR-decompositions coincide.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
239
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Corollary 3.3.3.7. Uniqueness of QR-factorization

The “economical” QR-factorization (3.3.3.1) of A ∈ K m,n , m ≥ n, with rank(A) = n is unique, if


we demand (R0 )ii > 0, i = 1, . . . , n.

Proof. We observe that R is regular, if A has full rank n. Since the regular upper triangular matrices form
a group under multiplication:

Q1 R1 = Q2 R2 ⇒ Q1 = Q2 R with upper triangular R := R2 R1−1 .


I = QH H H H
1 Q1 = R Q2 Q2 R = R R .
| {z }
=I

The assertion follows by uniqueness of Cholesky decomposition, Lemma 2.8.0.14.


Review question(s) 3.3.3.8 (QR-Decomposition: Theory)


(Q3.3.3.8.A) Assume that A ∈ R m,n , m ≥ n, satisfies that A⊤ A is a diagonal matrix with positive diago-
nal entries. Describe the economical QR-decomposition of A.
(Q3.3.3.8.B) What is the QR-decomposition of a “right-lower triangular” square matrix A ∈ R n,n
((A)i,j = 0 for i + j ≤ n)?

(Q3.3.3.8.C) What is the R in the full QR-decomposition A = QR of a tensor product matrix A = uv⊤ ,
u ∈ R m , v ∈ R n , m, n ∈ N, m ≥ n.
Hint. rank(A) = rank(R).
(Q3.3.3.8.D) Explain why the full QR-decomposition/QR-factorization of A ∈ R m,n , m > n, cannot be
unique, even if we demand (R)ii > 0, i = 1, . . . , n.

3.3.3.2 Computation of QR-Decomposition

Video tutorial for Section 3.3.3.2 & Section 3.3.3.4 "Computation of QR-Decomposition, QR-
Decomposition in E IGEN ": (32 minutes) Download link, tablet notes

→ review questions 3.3.3.29

In theory, Gram-Schmidt orthogonalization (GS) can be used to compute the QR-factorization of a matrix
A ∈ R m,n , m ≥ n, rank(A) = n. However, as we saw in Exp. 1.5.1.5, Gram-Schmidt orthogonalization
in the form of Code 1.5.1.3 is not a stable algorithm.
There is a stable way to compute QR-decompositions, based on the accumulation of orthogonal transfor-
mations.

Corollary 3.3.3.9. Composition of orthogonal transformations

The product of two orthogonal/unitary matrices of the same size is again orthogonal/unitary.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
240
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Idea: find simple orthogonal (row) transformations rendering certain matrix ele-
ments zero:
   
   
   
Q

= 0
 
 with Q⊤ = Q−1 .

   

Recall that this “annihilation of column entries” is the key operation in Gaussian forward elimination, where
it is achieved by means of non-unitary row transformations, see Sect. 2.3.2. Now we want to find a
counterpart of Gaussian elimination based on unitary row transformations on behalf of numerical stability.

EXAMPLE 3.3.3.10 (“Annihilating” orthogonal transformations in 2D) In 2D there are two possible
orthogonal transformations make 2nd component of a ∈ R2 vanish, which, in geometric terms, amounts
to mapping the vector onto the x1 -axis.
x2
x2
h i
cos ϕ sin ϕ
Q= − sin ϕ cos ϕ
a a

x1 ϕ x1
. .

Fig. 80
Fig. 81

reflections at angle bisector,


rotations turning a onto x1 -axis.

Note that in each case we have two different length-preserving lineare mappings at our disposal. This
flexibility will be important for curbing the impact of roundoff. y

Both reflections and rotations are actually used in library routines and both are discussed in the sequel:

§3.3.3.11 (Householder reflections → [GV13, Sect. 5.1.2]) The following so-called Householder
matrices (HHM) effect the reflection of a vector into a multiple of the first unit vector with the same length:

vv⊤
Q = H(v) := I − 2 with v = a±kak2 e1 , (3.3.3.12)
v⊤ v

where e1 is the first Cartesian basis vector. Orthogonality of these matrices can be established by direct
computation.
Fig. 82 depicts a “geometric derivation” of Householder reflections mapping a → b, assuming
kak2 = kbk2 . We accomplish this by a reflection at the hyperplane with normal vector b − a.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
241
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Given a, b ∈ R n with kak2 = kbk2 , the difference


vector v := a − b is orthogonal to the bisector. v

v⊤ v a
b = a − (a − b) = a − v
v⊤ v
v⊤ a vv⊤ b
= a − 2v ⊤ = a − 2 ⊤ a = H(v)a ,
v v v v

Fig. 82

because, due to orthogonality (a − b)⊥(a + b) (⇔ (a − b)⊤ (a + b) = 0),

v⊤ v = (a − b)⊤ (a − b) = (a − b)⊤ (a − b + a + b) = 2(a − b)⊤ a = 2v⊤ a .

As a consequence, we have for the Householder matrix H(v) from (3.3.3.12)


   
∗ ±kak2
∗  0 
   
H(v)a = ±kak2 e1 : H(v) ..  =  ..  .
.  . 
∗ 0

Hence, suitable successive Householder transformations determined by the leftmost column (“target col-
umn”) of shrinking bottom right matrix blocks can be used to achieve upper triangular form R. The following
series of figures visualizes the gradual annihilation of the lower triangular matrix part for a square matrix:

       
*
     *   
       * 
       
 ➤ ➤ ➤ .
   0   0   0 
       
       

= “target column a” (determines unitary transformation),


= modified in course of transformations.
Writing Qℓ for the Householder matrix used in the ℓ-th factorization

Q n −1 Q n −2 · · · · · Q 1 A = R ,
QR-decomposition Q := Q1⊤ · · · · · Q⊤
n−1 orthogonal matrix ,
of A ∈ C n,n : A = QR ,
(QR-factorization) R upper triangular matrix .
y

Remark 3.3.3.13 (QR-decomposition of “fat” matrices) We can also apply successive Householder
transformation as outlined in § 3.3.3.11 to a matrix A ∈ R m,n with m < n. If the first m columns of A are
linearly independent, we obtain another variant of the QR-decomposition:
    
    
    
    
 A = Q  R ,
    
    
    

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
242
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

A = QR , Q ∈ R m,m , R ∈ R m,n ,

where Q is orthogonal, R upper triangular, that is, (R)i,j = 0 for i > j. y

Remark 3.3.3.14 (Stable implementation of Householder reflections) In (3.3.3.12) the computation of


the vector v can be prone to cancellation (→ Section 1.5.4), if the vector a encloses a very small angle
with the first unit vector, because in this case v can be very small and beset with a huge relative error. For
instance, this occurs for
   
1 ξ
δ δ
   
a =  ..  , δ ≈ 0 ⇒ v = a − kake1 =  ..  with ξ ≈ 0 .
. .
δ δ

This is a concern, because in the formula for the Householder matrix,

vv⊤
H(v) := I − 2 ,
v⊤ v
v is normalized to unit length (division by kvk22 ), and then a large absolute error might result.
Fortunately, two choices for v are possible in (3.3.3.12) and at most one can be affected by cancellation.
The right choice is
(
a+kak2 e1 , if a1 > 0 ,
v=
a−kak2 e1 ) , if a1 ≤ 0 .

See [Hig02, Sect. 19.1] and [GV13, Sect. 5.1.3] for a discussion. y

§3.3.3.15 (Givens rotations → [Han02, Sect. 14], [GV13, Sect. 5.1.8]) The 2D rotation displayed in
Fig. 81 can be embedded in an identity matrix. Thus, the following orthogonal transformation, a Givens
rotation, annihilates the k-th component of a vector a = [ a1 , . . . , an ]⊤ ∈ R n . Here γ stands for cos( ϕ)
and σ for sin( ϕ), ϕ the angle of rotation, see Fig. 81.
     (1) 
γ ··· σ ··· 0 a1 a
.
 . . . .
. .
.   ..   1.  a1
 . . . .  .   ..  γ = p ,
     | a1 + | a k |2
|2
G1k ( a1 , ak )a := −σ · · · γ · · · 0 ak  =   0  for
 ak (3.3.3.16)
 . .. . .  
..  ..    ..  σ = p .
 .. . . . .  .  | a1 |2 + | a k |2
0 · · · 0 · · · 1 an an

Orthogonality (→ Def. 6.3.1.2) of G1k ( a1 , ak ) is verified immediately. Again, we have two options for an
annihilating rotation, see Ex. 3.3.3.10. It will always be possible to choose one that avoids overflow [GV13,
Sect. 5.1.8], see Code 3.3.3.17 for details.

C++ code 3.3.3.17: Stable Givens rotation of a 2D vector, [GV13, Alg. 5.1.3] ➺ GITLAB
2 // plane (2D) Givens rotation avoiding cancellation
 
3 // Computes orthogonal G ∈ R2,2 with G⊤ a = 0r =: x, r = ±kak2
4 void planerot ( const Eigen : : Vector2d& a , Eigen : : Matrix2d& G,
5 Eigen : : Vector2d& x ) {
6 int sign { 1 } ;
7 const double anorm = a . norm ( ) ;
8 i f ( anorm ! = 0 . 0 ) { //
9 double s ; // s ↔ σ

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
243
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 double c ; // c ↔ γ
11 i f ( std : : abs ( a [ 1 ] ) > std : : abs ( a [ 0 ] ) ) { // Avoid overflow
12 const double t = −a [ 0 ] / a [ 1 ] ;
13 s = 1 . 0 / std : : s q r t ( 1 . 0 + t * t ) ;
14 c = s * t;
15 s i g n = −1;
16 } else {
17 const double t = −a [ 1 ] / a [ 0 ] ;
18 c = 1 . 0 / std : : s q r t ( 1 . 0 + t * t ) ;
19 s = c * t;
20 }
21 G << c , s , −s , c ; // Form 2 × 2 Givens rotation matrix
22 } else {
23 G. s e t I d e n t i t y ( ) ;
24 }
25 x << ( s i g n * anorm ) , 0 . 0 ;
26 }

We validate the implementation


h i by straightforward computations using the variable names of
Code 3.3.3.17 and a = aa0 :
1

• Case | a1 | ≥ | a0 |: t = − aa10 , s = √1 , c = st
1+ t2
 ⊤      
c s a0 ca0 − sa1 sta0 − sa1
= =
−s c a1 sa0 + ca1 sa0 + sta1
  " 2 #  
1 ta0 − a1 | a 1 | − a0 − a 1 sgn ( a 1 )k a k 2
=√ = a1 = .
1 + t2 a0 + ta1 k a k2 a0 − a0 0

• Case | a0 | > | a1 |: t = − aa01 , c = √1 , s = ct


1+ t2
 ⊤       " 2
#  
c s a0 ca0 − cta1 1 a0 − ta1 | a 0 | a 0 + a1 sgn ( a 0 )k a k 2
= =√ = a0 = .
−s c a1 cta0 + ca1 1 + t2 ta0 + a1 k a k2 − a1 + a1 0

So far, we know how to annihilate a single component of a vector by means of a Givens rotation that targets
that component and some other (the first in (3.3.3.16)). However, for the sake of QR-decomposition we
aim to map all components to zero except for the first.
☞ This can be achieved by n − 1 successive Givens rotations, see also Code 3.3.3.19
     (2)   
a1 (1) ( n −1)
a a a
 a2   1   10   1 
   0     0 
 ..     .. 
 .  G12 ( a1 ,a2 )  a3  G13 ( a1(1) ,a3 ) 
 0  G ( a(2) ,a )

( n −2)
G1n ( a1 ,an )  
 .  −−−−−→  .  −−−−−−→   −
14 1
−−−−−
4
→ · · · −
−−−−−−− →  .  (3.3.3.18)
 ..   ..   a4   
     .   
.  ..   .   . 
 ..   .  .  .. 
an an an 0

✎ Notation: Gij ( a1 , a2 ) =
ˆ Givens rotation (3.3.3.16) modifying rows i and j of the matrix.

C++11 code 3.3.3.19: Roating a vector onto the x1 -axis by successive Givens transformation
➺ GITLAB

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
244
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 // Orthogonal transformation of a (column) vector into a multiple of


3 // the first unit vector by successive Givens transformations
4 // Note that the output vector could be computed much more efficiently!
5 void g i v e n s c o l t r f ( const VectorXd &aIn , MatrixXd &Q, VectorXd &aOut ) {
6 const Eigen : : Index n = a I n . s i z e ( ) ;
7 // Assemble rotations in a dense matrix Q
8 // For (more efficient) alternatives see Rem. 3.3.3.22
9 Q. s e t I d e n t i t y ( ) ; // Start from Q = I
10 Matrix2d G;
11 aOut = a I n ;
12 f o r ( Eigen : : Index j = 1 ; j < n ; ++ j ) {
13 const double a0 = aOut [ 0 ] ;
14 const double a1 = aOut [ j ] ;
15 // Determine entries of 2D rotation matrix, see Code 3.3.3.17
16 double s = NAN; // s ↔ σ
17 double c = NAN; // c ↔ γ
18 i f ( a1 ! = 0 . 0 ) {
19 i f ( std : : abs ( a1 ) > std : : abs ( a0 ) ) { // Avoid overflow
20 const double t = −a0 / a1 ;
21 s = 1 . 0 / std : : s q r t ( 1 . 0 + t * t ) ;
22 c = s * t;
23 } else {
24 const double t = −a1 / a0 ;
25 c = 1 . 0 / std : : s q r t ( 1 . 0 + t * t ) ;
26 s = c * t;
27 }
28 G << c , s , −s , c ; // Form 2 × 2 Givens rotation matrix
29 } else { // No rotation required
30 G. s e t I d e n t i t y ( ) ;
31 }
32 // select 1st and jth element of aOut and use the Map function
33 // to prevent copying; equivalent to aOut([1,j]) in M A T L A B
34 Map<VectorXd , 0 , I n n e r S t r i d e <>> aOutMap ( aOut . data ( ) , 2 , I n n e r S t r i d e < >( j ) ) ;
35 aOutMap = G. transpose ( ) * aOutMap ;
36 // select 1st and jth column of Q (Q(:,[1,j]) in M A T L A B )
37 Map<MatrixXd , 0 , O u t e r S t r i d e <>> QMap(Q. data ( ) , n , 2 , O u t e r S t r i d e < >( j * n ) ) ;
38 // Accumulate orthogonal transformations in a dense matrix; just
done for
39 // demonstration purposes! See Rem. 3.3.3.22
40 QMap = QMap * G;
41 }
42 }

Armed with these compound Givens rotations we can proceed as in the case of Householder reflections
to accomplish the orthogonal transformation of a full-rank matrix to upper triangular form, see

C++11 code 3.3.3.20: QR-decomposition by successive Givens rotations ➺ GITLAB


2 //! QR decomposition of square matrix A by successive Givens
3 //! transformations
4 void qrgivens ( const MatrixXd &A , MatrixXd &Q, MatrixXd &R) {
5 const Eigen : : Index n = A . rows ( ) ;
6 // Assemble rotations in a dense matrix.
7 // For (more efficient) alternatives see Rem. Rem. 3.3.3.22
8 Q. s e t I d e n t i t y ( ) ;
9 Matrix2d G;
10 Vector2d tmp ;
11 Vector2d xDummy ;
12 R = A ; // In situ transformation

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
245
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

13 f o r ( Eigen : : Index i = 0 ; i < n − 1 ; ++ i ) {


14 f o r ( Eigen : : Index j = n − 1 ; j > i ; −− j ) {
15 tmp ( 0 ) = R( j − 1 , i ) ;
16 tmp ( 1 ) = R( j , i ) ;
17 planerot ( tmp , G, xDummy) ; // see Code 3.3.3.17
18 R. block ( j − 1 , 0 , 2 , n ) = G. transpose ( ) * R. block ( j − 1 , 0 , 2 , n ) ;
19 Q. block ( 0 , j − 1 , n , 2 ) = Q. block ( 0 , j − 1 , n , 2 ) * G;
20 }
21 }
22 }

Remark 3.3.3.21 (Testing != 0.0 in Code 3.3.3.17) In light of the guideline “do not test floating point
number for exact equality” from Rem. 1.5.3.15 the test if (anorm != 0.0) Line 8 looks inappropriate.
However, its sole purpose is to avoid division by zero and the code will work well even if anorm≈ 0. y

Remark 3.3.3.22 (Storing orthogonal transformations) When doing successive orthogonal transforma-
tions as in the case of QR-decomposition by means of Householder reflections (→ § 3.3.3.11) or Givens
rotations (→ § 3.3.3.15) it would be prohibitively expensive to assemble and even multiply the transforma-
tion matrices!

The matrices for the orthogonal transformation are never built in codes!
The transformations are stored in a compressed format.

Therefore, we stress that Code 3.3.3.20 is meant for demonstration purposes only, because the construc-
tion of the Q-factor matrix would never be done in this way in a well-designed numerical code.

➊ In the case of Householder reflections H (v) ∈ R m,m (3.3.3.12), see [GOV13],


➤ store only the last n − 1 components of the normalized vector v ∈ R m

For QR-decomposition of a matrix A ∈ R m,n , by means of successive Householder reflections H(v1 ) ·


· · · · H(vk ), k := min{m, n}, we store the bottom parts of the vectors v j ∈ R m− j+1 , j = 1, . . . , k,
whose lengths decrease, in place of the “annihilated” lower triangular part of A, which yields an in-situ
QR-factorization.

↑ Case m < n

↔ space for Householder vectors

Case m > n →

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
246
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➋ In the case of Givens rotations, for a single rotation Gi,j ( a1 , a2 ) we need store only the row indices (i, j)
and rotation angle [Ste76], [GV13, Sect. 5.1.11]. The latter is subject to a particular encoding scheme:

  
1M , if γ = 0 ,
γ σ 1
for G = ⇒ store ρ := 2 sign(γ)σ , if |σ| < |γ| , (3.3.3.23)
−σ γ 

2 sign(σ )/γ , if |σ | ≥ |γ| ,

 ρ = 1M ⇒ γ = 0 , σ = 1 √ ,
which means |ρ| < 1 ⇒ σ = 2ρ , γ = p 1 − σ2 , (3.3.3.24)

|ρ| > 1 ⇒ γ = 2/ρ , σ = 1 − γ2 .
Here 1M < alludes to the fact that the number 1.0 can be represented exactly in machine number systems.
Then store Gij ( a, b) as triple (i, j, ρ). The parameter ρ forgets the sign of the matrix Gij , so the signs of
the corresponding rows in the transformed matrix R have to be changed accordingly. The rationale behind
the above convention is to curb the impact of roundoff errors, because when we recover γ, σ by taking the
square root of a difference we never subtract two numbers of equal size; cancellation is avoided.

Summing up, when asking to “compute the economical/full QR-decomposition” A = QR of a matrix A, we


request the upper triangular matrix R plus Q in a format that permits us to evaluate Q×vector efficiently.
y

§3.3.3.25 (Computational cost of computing QR-decompositions) How many elementary operations


are asymptotically involved in the computation of the R-factor of the QR-decomposition of a “tall” matrix
A ∈ R m,n , m ≥ n based on Householder reflections as explained in § 3.3.3.11?
Obviously, the creation of zeros in the lower triangular part of A can be accomplished by n Householder
transformation steps, as is illustrated by the following figure.
       
*
*
       * 
       
       
       
  ➊   ➋   ➌  
       
 ➤ 0 ➤ ➤ .
     0   0 
       
       
       
       

Note that the multiplication of a vector w ∈ R m with a Householder matrix H(v) := I − 2vv⊤ , v ∈ R m ,
kvk2 = 1, takes only 2m operations, cf. Ex. 1.4.3.1.
Next, we examine the elementary matrix vector operations involved in orthogonal transformation of A into
an upper triangular matrix R ∈ R m,n .
Step ➊: Householder matrix × n − 1 remaining matrix cols. of size m, cost = 2m(n − 1)
Step ➋: Householder matrix × n − 2 remaining matrix cols. of size m − 1, cost = 2(m − 1)(n − 2)
Step ➌: Householder matrix × n − 3 remaining matrix cols. of size m − 2, cost = 2(m − 2)(n − 3)
..
.
We see that the combined number of entries of the -colored matrix blocks in the above figures is propor-
tional to the total work.
n −1
cost(R-factor of A by Householder trf.) = ∑ 2(m − k + 1)(n − k) (3.3.3.26)
k =1
= O(mn2 ) for m, n → ∞ .

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
247
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 3.3.3.27 (QR-decomposition of banded matrices) The advantage of Givens rotations is its
selectivity, which can be exploited for banded matrices, see Section 2.7.5.

Definition 2.7.5.1. Bandwidth

For A = ( aij )i,j ∈ K m,n we call

bw(A) := min{k ∈ N: j − i > k ⇒ aij = 0} upper bandwidth ,


bw(A) := min{k ∈ N: i − j > k ⇒ aij = 0} lower bandwidth .

bw(A) := bw(A) + bw(A) + 1 = bandwidth of A.

Specific case: Orthogonal transformation of an n × n tridiagonal matrix to upper triangular form, that is,
the annihilation of the sub-diagonal, by means of successive Givens rotations:

     
∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0
∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0
     
0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0 0 0
     
0 0 ∗ ∗ ∗ 0 0 0 G12 0 0 ∗ ∗ ∗ 0 0 0 G23 ···Gn−1,n 0 0 0 ∗ ∗ ∗ 0 0
  −−−→   −−−−−−→  
0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗ 0
     
0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗ ∗
     
0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 ∗ ∗ ∗ 0 0 0 0 0 0 ∗ ∗
0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 ∗ ∗ 0 0 0 0 0 0 0 ∗
ˆ entry set to zero by Givens rotation, ∗ =
∗= ˆ new non-zero entry (“fill-in” → Def. 2.7.4.3).
This is a manifestation of a more general result, see Def. 2.7.5.1 for notations:

Theorem 3.3.3.28. QR-decomposition “preserves bandwidth”

If A = QR is the QR-decomposition of a regular matrix A ∈ R n,n , then bw(R) ≤ bw(A).

Studying the algorithms sketched above for tridiagonal matrices, we find that a total of at most n · bw(A)
Givens rotations is required or computing the QR-decomposition. Each of them acts on O(bw(A)) non-
zero entries of the matrix, which leads to an asymptotic total computational effort of O(n · bw(A)2 ) for
n → ∞. y
Review question(s) 3.3.3.29 (Computation of QR-decompositions)
(Q3.3.3.29.A) Let A ∈ R n,n be “Z-shaped”

(A)i,j = 0 ,if i ∈ {2, . . . , n − 1} , i + j 6= n + 1 ,


 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
 ∗ 
 
 ∗ 
 
 ∗ 
 
e.g. A=  ∗ .

 ∗ 
 
 ∗ 
 
 ∗ 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
248
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1. Give a sequence of Givens rotations that convert A into upper triangular form.
2. Think about an efficient way to deploy orthogonal transformation techniques for the efficient solu-
tion of a linear system of equations Ax = b, b ∈ R n .
(Q3.3.3.29.B) The matrix A ∈ R n,n , n ∈ N, is upper triangular except for a single non-zero entry in
position (n, 1):

(A)i,j = 0 , if i > j and (i, j) 6= (n, 1) .

Which sequence of Givens rotations (of minimal length) can be used to compute the QR-decomposition
of A?
(Q3.3.3.29.C) [Householder matrices] What is a Householder matrix and what are its properties
(regularity, orthogonality, symmetry, rank, kernel, range)?

3.3.3.3 QR-Decomposition: Stability

In numerical linear algebra orthogonal transformation methods usually give rise to reliable algorithms,
thanks to the norm-preserving property of orthogonal transformations.

§3.3.3.30 (Stability of unitary/orthogonal transformations) We consider the mapping (the “transfor-


mation” induced by Q)

F : K n → K n , F (x) := Qx , Q ∈ K n,n unitary/orthogonal .

We are interested in the sensitivity of F, that is, the impact of relative errors in the data vector x on the
output vector y := F (x).
We study the output for a perturbed input vector:
)
Qx = y ⇒ kxk2 = kyk2 k∆yk2 k∆xk2
= .
Q(x + ∆x) = y + ∆y ⇒ Q∆x = ∆y ⇒ k∆yk2 = k∆xk2 k y k2 k x k2

We conclude, that unitary/orthogonal transformations do not cause any amplification of relative errors in
the data vectors.
Of course, this also applies to the “solution” of square linear systems with orthogonal coefficient matrix
Q ∈ R n,n , which, by Def. 6.3.1.2, boils down to multiplication of the right hand side vector with QH . y

Remark 3.3.3.31 (Conditioning of conventional row transformations) Gaussian elimination as pre-


sented in § 2.3.1.3 converts a matrix to upper triangular form by elementary row transformations. Those
add a scalar multiple of a row of the matrix to another matrix and amount to left-multiplication with matri-
ces

T := In + µe j ei⊤ , µ ∈ K , i, j ∈ {1, . . . , n}, i 6= j . (3.3.3.32)

However, these transformations can lead to a massive amplification of relative errors, which, by virtue of
Ex. 2.2.2.1 can be linked to large condition numbers of T.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
249
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This accounts for fact that the computation of LU-decompositions by means of Gaussian elimination might
not be stable, see Ex. 2.4.0.6. y

EXPERIMENT 3.3.3.33 (Conditioning of conventional row transformations, Rem. 3.3.3.31 cnt’d)


Condition numbers of row transformation matrices
7
10

6
10

Study in 2D:
5
10

2 × 2 row transformation matrix, (cf. elimination ma-

condition number
trices of Gaussian elimination.
4
10

 
1 0 10
3

T(µ) =
µ 1 2
10

Euclidean condition numbers of T(µ) ✄ 1


10
2−norm
maximum norm
1−norm
0
10
−4 −3 −2 −1 0 1 2 3 4
10 10 10 10 10 10 10 10 10
Fig. 83 µ
y

The perfect conditioning of orthogonal transformation prevents the destructive build-up of roundoff errors.

Theorem 3.3.3.34. Stability of Householder QR [Hig02, Thm. 19.4]

e ∈ R m,n be the R-factor of the QR-decomposition of A ∈ R m,n computed by means of


Let R
successive Householder reflections (→ § 3.3.3.11). Then there exists an orthogonal Q ∈ R m,m
such that

e with k∆Ak ≤ cmn EPS


A + ∆A = QR 2 k A k2 , (3.3.3.35)
1 − cmn EPS
where EPS is the machine precision and c > 0 a small constant independent of A.

3.3.3.4 QR-Decomposition in E IGEN

E IGEN offers several classes dedicated to computing QR-type decompositions of matrices, for instance
HouseholderQR. Internally the QR-decomposition is stored in compressed format as explained in
Rem. 3.3.3.22. Its computation is triggered by the constructor.

C++-code 3.3.3.36: QR-decompositions in E IGEN ➺ GITLAB


2 # include <Eigen /QR>
3

4 // Computation of full QR-decomposition (3.3.3.1),


5 // dense matrices built for both QR-factors (expensive!)
6 i n l i n e std : : pair <MatrixXd , MatrixXd > qr_decomp_full ( const MatrixXd& A) {
7 const Eigen : : HouseholderQR<MatrixXd > qr ( A) ;
8 const MatrixXd Q = qr . householderQ ( ) ; //
9 const MatrixXd R = qr . matrixQR ( ) . template triangularView <Eigen : : Upper > ( ) ;
10 r e t u r n {Q, R } ;
11 }
12

13 // Computation of economical QR-decomposition (3.3.3.1),


14 // dense matrix built for Q-factor (possibly expensive!)
15 i n l i n e std : : pair <MatrixXd , MatrixXd > qr_decomp_eco ( const MatrixXd& A) {

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
250
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

16 using i n d e x _ t = MatrixXd : : Index ;


17 const i n d e x _ t m = A . rows ( ) ;
18 const i n d e x _ t n = A . cols ( ) ;
19 const Eigen : : HouseholderQR<MatrixXd > qr ( A) ;
20 const MatrixXd Q = ( qr . householderQ ( ) * MatrixXd : : I d e n t i t y (m, n ) ) ; //
21 const MatrixXd R = qr . matrixQR ( ) . block ( 0 , 0 , n , n ) . template
triangularView <Eigen : : Upper > ( ) ; //
22 r e t u r n {Q, R } ;
23 }

Note that the method householderQ returns the Q-factor in compressed format, refer to Rem. 3.3.3.22.
Assignment to a matrix will convert it into a (dense) matrix format, see Line 8; only then the actual com-
putation of the matrix entries is performed. It can also be multiplied with another matrix of suitable size,
which is used in Line 20 to extract the Q-factor Q0 ∈ R m,n of the economical QR-decomposition (3.3.3.1).
The matrix returned by the method matrixQR() gives access to a matrix storing the QR-factors in
compressed form. Its upper triangular part provides R, see Line 21.

§3.3.3.37 (Economical versus full QR-decomposition) The distinction of Thm. 3.3.3.4 between eco-
nomical and full QR-decompositions of a “tall” matrix A ∈ R m,n , m > n, becomes blurred on the algo-
rithmic level. If all we want is a representation of the Q-factor as a product of orthogonal transformations
as discussed in Rem. 3.3.3.22, exactly the same computations give us both types of QR-decompositions,
because, of course, the bottom zero block of R need not be stored.

The same computations yield both full and economical QR-decompositions with Q-factors in product
form.

This is clearly reflected in Code 3.4.2.1. Thus, in the derivation of algorithms we choose either type of
QR-decomposition, whichever is easier to understand. y
§3.3.3.38 (Cost of QR-decomposition in E IGEN) A close inspection of the algorithm for the computation
of QR-decompositions of A ∈ R m,n by successive Householder reflections (→ § 3.3.3.11) reveals, that n
transformations costing ∼ mn operations each are required.

Asymptotic complexity of Householder QR-decomposition

The computational effort for HouseholderQR() of A ∈ R m,n , m > n, is O(mn2 ) for m, n → ∞.

EXPERIMENT 3.3.3.40 (Asymptotic complexity of Householder QR-factorization) We empirically


investigate the (asymptotic) complexity of QR-factorization algorithms in E IGEN through runtime measure-
ments.

C++-code 3.3.3.41: timing QR-factorizations in E IGEN ➺ GITLAB


2 i n t nruns = 3 , minExp = 2 , maxExp = 6 ;
3 MatrixXd tms ( maxExp−minExp +1 ,4) ;
4 f o r ( i n t i = 0 ; i <= maxExp−minExp ; ++ i ) {
5 Timer t1 , t2 , t 3 ; // timer class
6 i n t n = std : : pow ( 2 , minExp + i ) ; i n t m = n * n ;
7 // Initialization of matrix A
8 MatrixXd A (m, n ) ; A . setZero ( ) ;
9 A . s e t I d e n t i t y ( ) ; A . block ( n , 0 ,m−n , n ) . setOnes ( ) ;
10 A += VectorXd : : LinSpaced (m, 1 ,m) * RowVectorXd : : LinSpaced ( n , 1 , n ) ;

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
251
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11 f o r ( i n t j = 0 ; j < nruns ; ++ j ) {
12 // plain QR-factorization in the constructor
13 t 1 . s t a r t ( ) ; HouseholderQR<MatrixXd > qr ( A) ; t 1 . s t o p ( ) ;
14 // full decomposition
15 t 2 . s t a r t ( ) ; std : : pair <MatrixXd , MatrixXd > QR2 = q r _ d e c o m p _ f u l l ( A) ; t 2 . s t o p ( ) ;
16 // economic decomposition
17 t 3 . s t a r t ( ) ; std : : pair <MatrixXd , MatrixXd > QR3 = qr_decomp_eco ( A) ; t 3 . s t o p ( ) ;
18 }
19 tms ( i , 0 ) =n ;
20 tms ( i , 1 ) = t 1 . min ( ) ; tms ( i , 2 ) = t 2 . min ( ) ; tms ( i , 3 ) = t 3 . min ( ) ;
21 }

10 2

Timings for HouseholderQR


qr_decomp_full()

• plain QR-factorization in the constructor of


1 qr_decomp_eco()
10 4
O(n )
O(n 6 )
HouseholderQR, 10 0

• invocation of function qr_decomp_full(),


10 -1
see Code 3.4.2.1,

time [s]
• call to qr_decomp_eco() from 10 -2

Code 3.4.2.1.
10 -3
Platform:
✦ ubuntu 14.04 LTS 10 -4

✦ CPU i7-3517U, 1.90GHZ, 4 cores 10 -5

✦ L1 32 KB, L2 256 KB, L3 4096 KB, Mem 8 GB


10 -6
✦ gcc 4.8.4, -O3 10 0 10 1 10 2
Fig. 84 n

2
The runtimes for the QR-factorization of A ∈ R n ,n behave like O(n2 · n) for large n. y

3.3.4 QR-Based Solver for Linear Least Squares Problems

Video tutorial for Section 3.3.4 "QR-Based Solver for Linear Least Squares Problems": (9
minutes) Download link, tablet notes

→ review questions 3.3.4.8

The QR-decomposition introduced in Section 3.3.3, Thm. 3.3.3.4, paves the way for the practical algo-
rithmic realization of the “equivalent orthonormal transformation to upper triangular form”-idea from Sec-
tion 3.3.1.
We consider the full-rank linear least squares problem Eq. (3.1.2.22): Given A ∈ R m,n , m ≥ n,
rank(A) = n,

seek x ∈ R n such that kAx − bk2 → min .

We assume that we are given a


QR-decomposition: A = QR, Q ∈ R m,m orthogonal, R ∈ R m,n (regular) upper triangular matrix.
We apply the orthogonal 2-norm preserving (→ Thm. 3.3.2.2) transformation encoded in Q to Ax − b,
the vector inside the 2-norm to be minimized:

kAx − bk2 = Q(Rx − Q⊤ b) e


= Rx − b e := Q⊤ b .
, b
2 2

Thus, we have obtained an equivalent triangular linear least squares problem:

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
252
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
 
  e
b1
   .. 
   . 
   
 R0   
 
   



  x1  
 
  ..   
kAx − bk2 → min ⇔   . −


 → min .
   
  xn  
   
   
   
 0   .. 
  
  . 
  e
bm
2

 
  −1 0
   .. 
  e  . 
  b1  
  ..   0 
x=


 .  , with residual r = Qe 

 .
 R0   bn + 1 

 e
bn  .. 
 . 
e
bm
q
Note: by Thm. 3.3.2.2 the norm of the residual is readily available: krk2 = ebn2 +1 + · · · + e2.
bm

C++-code 3.3.4.1: QR-based solver for full rank linear least squares problem (3.1.2.22)
➺ GITLAB
2 // Solution of linear least squares problem (3.1.2.22) by means of
QR-decomposition
3 // Note: A ∈ R m,n with m > n, rank(A) = n is assumed
4 // Least squares solution returned in x, residual norm as return value
5 double q r l s q s o l v e ( const MatrixXd& A , const VectorXd& b ,
6 VectorXd& x ) {
7 const unsigned m = A . rows ( ) ;
8 const unsigned n = A . cols ( ) ;
9

10 MatrixXd Ab (m, n + 1 ) ; Ab << A , b ; // Form extended matrix [A, b]


11

12 // QR-decomposition of extended matrix automatically transforms b


13 MatrixXd R = Ab . householderQr ( ) . matrixQR ( ) . template
14 triangularView <Eigen : : Upper > ( ) ; //
15

16 MatrixXd R_nn = R. block ( 0 , 0 , n , n ) ; // R-factor R0


−1
17 // Compute least squares solution x = (R)1:n,1:n (Q⊤ b)1:n
18 x = R_nn . template triangularView <Eigen : : Upper > ( ) . solve (R . block ( 0 , n , n , 1 ) ) ;
19 r e t u r n R( n , n ) ; // residual norm = kAbx − bk2 (why ?)
20 }

Discussion of (some) details of implementation in Code 3.3.4.1:


• The QR-decomposition is computed in a numerically stable way by means of Householder reflec-
tions (→ § 3.3.3.11) by E IGEN’s built-in function householderQR available for matrix type. The

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
253
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

computational cost of this function when called for an m × n matrix is, asymptotically for m, n → ∞,
O ( n2 m ).
• Line 10: We perform the QR-decomposition of the extended matrix [A, b] with b as rightmost col-
umn. Thus, the orthogonal transformations are automatically applied to b; the augmented matrix is
converted into [R, Q⊤ b], the data of the equivalent upper triangular linear least squares problem.
Thus, actually, no information about Q needs to be stored, if one is interested in the least squares
solution x only.
The idea is borrowed from Gaussian elimination, see Code 2.3.1.4, Line 9.
• Line 14: MatrixQR() returns the compressed QR-factorization as a matrix, where the R-factor
R ∈ R m,n is contained in the upper triangular part, whose top n rows give R0 from see (3.3.3.1).
• Line 19: the components (b)n+2:m of the vector b (treated as rightmost column of the augmented
matrix) are annihilated when computing the QR-decomposition (by final Householder reflection):
 
Q⊤ [A, b] = 0. Hence, Q⊤ [A, b] e )n+1:m
= (b , which gives the norm of the
n+2:m,n n+1,n+1 2
residual.
➤ A QR-based algorithm is implemented in the solve() method available for E IGEN’s QR-
decomposition, see Code 3.3.4.2.

C++ code 3.3.4.2: E IGEN’s built-in QR-based linear least squares solver ➺ GITLAB
2 // Solving a full-rank least squares problem kAx − bk2 → min in E I G E N
3 double l s q s o l v e _ e i g e n ( const MatrixXd& A , const VectorXd& b ,
4 VectorXd& x ) {
5 x = A . householderQr ( ) . solve ( b ) ;
6 r e t u r n ( ( A * x−b ) . norm ( ) ) ;
7 }

Remark 3.3.4.3 (QR-based solution of linear systems of equations) Applying the QR-based algorithm
for full-rank linear least squares problems in the case m = n, that is, to a square linear system of equations
Ax = b with a regular coefficient matrix , will compute the solution x = A−1 b. In a sense, the QR-
decomposition offers an alternative to Gaussian elimination/LU-decomposition discussed in § 2.3.2.15.
The steps for solving a linear system of equations Ax = b by means of QR-decomposition are as follows:
① QR-decomposition A = QR, computational costs 23 n3 + O(n2 )
(about twice as expensive as LU -decomposition without pivoting)
Ax = b : ② orthogonal transformation z = Q⊤ b, computational costs 4n2 + O(n)
(in the case of compact storage of reflections/rotations)
③ Backward substitution, solve Rx = z, computational costs 12 n(n + 1)
Benefit: we can utterly dispense with any kind of pivoting:
✬ ✩
✌ Computing the generalized QR-decomposition A = QR by means of Householder reflections
or Givens rotations is (numerically stable) for any A ∈ C m,n .
✌ For any regular system matrix an LSE can be solved by means of
QR-decomposition + orthogonal transformation + backward substitution
✫ ✪
in a stable manner.

Drawback: QR-decomposition can hardly ever avoid massive fill-in (→ Def. 2.7.4.3) also in situations,
where LU-factorization greatly benefits from Thm. 2.7.5.4. y

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
254
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 3.3.4.4 (QR-based solution of banded LSE) From Rem. 3.3.3.27, Thm. 3.3.3.28, we know that
that particular situtation, in which QR-decomposition can avoid fill-in (→ Def. 2.7.4.3) is the case of banded
matrices, see Def. 2.7.5.1. For a banded n × n linear systems of equations with small fixed bandwidth
bw(A) ≤ O(1) we incur an
➣ asymptotic computational effort: O(n) for n → ∞

 
The following code uses QR-decomposition com-
d1 c1 0 ... 0
 .. 
puted by means of selective Givens rotations (→  e1 d 2 c 2 . 
 
§ 3.3.3.15) to solve a tridiagonal linear system of A =  0 e2 d 3 c 3 
. . . . 
equations Ax = b  .. .. .. .. c n −1 
0 ... 0 e n −1 d n
The matrix is passed in the form of three vectors e, c, d giving the entries in the non-zero bands.

C++ code 3.3.4.5: Solving a tridiagonal system by means of QR-decomposition ➺ GITLAB


2 //! @brief Solves the tridiagonal system Ax = b with QR-decomposition
3 //! @param[in] d Vector of dim n; the diagonal elements
4 //! @param[in] c Vector of dim n − 1; the lower diagonal elements
5 //! @param[in] e Vector of dim n − 1; the upper diagonal elements
6 //! @param[in] b Vector of dim n; the rhs.
7 //! @param[out] x Vector of dim n
8 VectorXd t r i d i a g q r ( VectorXd c , VectorXd d , VectorXd e , VectorXd& b ) {
9 const Eigen : : Index n = d . s i z e ( ) ;
10 // resize the vectors c and d to correct length if needed
11 c . conservativeResize ( n ) ; e . conservativeResize ( n ) ;
12 const double t = d . norm ( ) + e . norm ( ) + c . norm ( ) ;
13 Matrix2d R;
14 Vector2d z ;
15 Vector2d tmp ;
16 f o r ( Eigen : : Index k = 0 ; k < n −1; ++k ) {
17 tmp ( 0 ) = d ( k ) ; tmp ( 1 ) = e ( k ) ;
18 // Use givensrotation to set the entries below the diagonal
19 // to zero
20 planerot ( tmp , R, z ) ; // see Code 3.3.3.17
21 i f ( std : : abs ( z ( 0 ) ) / t < std : : n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ) {
22 throw std : : r u n t i m e _ e r r o r ( "A nearly singular " ) ;
23 }
24 // Update all other entries of the matrix and rhs. which
25 // were affected by the givensrotation
26 d(k) = z (0) ;
27 b . segment ( k , 2 ) . applyOnTheLeft (R) ;
// rhs.
28 // Block of the matrix affected by the givensrotation
29 Matrix2d Z ;
30 Z << c ( k ) , 0 , d ( k +1) , c ( k +1) ;
31 Z . applyOnTheLeft (R) ;
32 // Write the transformed block back to the corresponding places
33 c . segment ( k , 2 ) = Z . diagonal ( ) ; d ( k +1) = Z ( 1 , 0 ) ; e ( k ) = Z ( 0 , 1 ) ;
34 }
35 // Note that the e is now above d and c
36 // Backsubstitution acting on upper triangular matrix
37 // with upper bandwidth 2 (stored in vectors).
38 VectorXd x ( n ) ;
39 // last row
40 x ( n −1) = b ( n −1) / d ( n −1) ;
41 i f ( n >= 2 ) {
42 // 2nd last row

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
255
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

43 x ( n −2) = ( b ( n −2)−c ( n −2) * x ( n −1) ) / d ( n −2) ;


44 // remaining rows
45 f o r ( Eigen : : Index i = n −3; i >= 0 ; −− i ) {
46 x ( i ) = ( b ( i ) − c ( i ) * x ( i +1) − e ( i ) * x ( i +2) ) / d ( i ) ;
47 }
48 }
49 return x ;
50 }

EXAMPLE 3.3.4.6 (Stable solution of LSE by means of QR-decomposition) Aiming to confirm the
claim of superior stability of QR-based approaches (→ Rem. 3.3.4.3, § 3.3.3.30) we revisit Wilkinson’s
counterexample from Ex. 2.4.0.6 for which Gaussian elimination with partial pivoting does not yield an
acceptable solution.

0
10

−2
10

Wilkinson matrix A ∈ R n,n


relative error (Euclidean norm)

−4
10


 −1 for i > j, j < n , −6
10

1 for i = j , −8
10
Gaussian elimination
QR−decomposition
(A)i,j := relative residual norm

 0 for i < j, j < n ,


−10
10

1 for j = n . −12
10

QR-decomposition produces perfect solution ✄ −14


10

−16
10

0 100 200 300 400 500 600 700 800 900 1000
Fig. 85 n
y

Let us summarize the pros and cons of orthogonal transformation techniques for linear least squares
problems:

Normal equations vs. orthogonal transformations method

Superior numerical stability (→ Def. 1.5.5.19) of orthogonal transformations methods:

Use orthogonal transformations methods for least squares problems (3.1.3.7), whenever
A ∈ R m,n dense and n small.

SVD/QR-factorization cannot exploit sparsity:

Use normal equations in the expanded form (3.2.0.8)/(3.2.0.9), when A ∈ R m,n sparse (→
Notion 2.7.0.1) and m, n big.

Review question(s) 3.3.4.8 (QR-Based Solver for Linear Least-Squares Problems)


(Q3.3.4.8.A) Given A ∈ R m,n , m > n, rank(A) = n, b ∈ R m , let the full QR-decomposition

e ∈ R m,m orthogonal ,
Q
eR
[A b] = Q e,
e ∈ R m,n+1 upper triangular ,
R

of the augmented matrix [A b] ∈ R m,n+1 be given.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
256
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

e and R
• How can you compute the unique least-squares solution x∗ ∈ R n of Ax = b using Q e?

• Explain why kAx∗ − bk2 = (R)n+1,n+1 .


(Q3.3.4.8.B) Describe how the QR-decomposition of a tridiagonal matrix A ∈ R n,n can be computed with
an asymptotic effort of O(n) for n → ∞.

3.3.5 Modification Techniques for QR-Decomposition

Video tutorial for Section 3.3.5 "Modification Techniques for QR-Decomposition": (25 minutes)
Download link, tablet notes

→ review questions 3.3.5.7

e = b efficiently, whose
In § 2.6.0.12 we faced the task of solving a square linear system of equations Ax
e was a (rank-1) perturbation of A, for which an LU-decomposition was available.
coefficient matrix A
Lemma 2.6.0.21 showed a way to reuse the information contained in the LU-decomposition.
A similar task can be posed for the QR-decomposition: Assume that a QR-decomposition (→
Thm. 3.3.3.4) of a matrix A ∈ R m,n , m ≥ n, has already been computed. However, now we have to
solve a full-rank linear least squares problem e −b
Ax e ∈ R m,n , which is a “slight”
→ min with A
2
perturbation of A. If we aim to use orthogonalization techniques it would be desirable to compute a
QR-decomposition of A e with recourse to the QR-decomposition of A.

Remark 3.3.5.1 (Economical vs. full QR-decomposition) We remind of § 3.3.3.37: The precise type of
QR-decomposition, whether full or economical, does not matter, since all algorithms will store the Q-factors
as products of orthogonal transformations.
Thus, below we will select that type of QR-decomposition, which allows an easier derivation of an algo-
rithm, which will be the full QR-decomposition. y

3.3.5.1 Rank-1 Modifications

For A ∈ R m,n , m ≥ n, rank(A) = n, we consider the rank-1 modification, cf. Eq. (2.6.0.16),

e := A + uv⊤ , u ∈ R m , v ∈ R n .
A −→ A (3.3.5.2)

Remember from § 2.6.0.12, (2.6.0.13), (2.6.0.15) that changing a single entry, row, or column of A can be
achieved through special rank-1 perturbations.
h i
Given a full QR-decomposition according to Thm. 3.3.3.4, A = QR = Q RO0 , Q ∈ R m,m orthogonal
(stored in some implicit format as product of orthogonal transformations, see Rem. 3.3.3.22), R ∈ R m,n
and R0 ∈ R n,n upper triangular, the goal is to find an efficient algorithm that yields a QR-decomposition
e: A
of A e =Q eR e ∈ R m,m a product of orthogonal transformations, R
e, Q e ∈ R m,n upper triangular.

Step ➊: compute w = Q⊤ u ∈ R m .

Observe that A + uv⊤ = Q(R + wv⊤ ), because Q⊤ Q = Im .

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
257
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ Computational effort = O(mn), if Q stored in suitable (compressed) format, cf. Rem. 3.3.3.22.

Step ➋: Orthogonally transform w → kwke1 , e1 ∈ R m =


ˆ 1st coordinate vector.

This can be done by applying m − 1 Givens rotations to be employed in the following order:
       
∗ ∗ ∗ ∗
∗ ∗ ∗ 0
       
 ..   ..   ..   .. 
. G . G . G .
  m−1,m   m−2,m−1   m−3,m−2 G12  
w = ∗ −−−−→ ∗ −−−−−→ ∗ −−−−−→ · · · −−−→ 0
       
∗ ∗ ∗ 0
       
∗ ∗ 0 0
∗ 0 0 0

Of course, these transformations also have to act on R ∈ R m,n and they will affect R by creating a single
non-zero subdiagonal by linearly combining pairs of adjacent rows from bottom to top:
   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 0 ∗ ··· ∗ ∗ ∗ ∗   0 ∗ ··· ∗ ∗ ∗ ∗ 
 . ..   . .. 
 . ..   . ... 
 . . .   . . 
   
 0 ··· 0 ∗ ∗ ∗ ∗   0 ··· 0 ∗ ∗ ∗ ∗ 
   
 0 ··· 0 0 ∗ ∗ ∗  G  0 ··· 0 0 ∗ ∗ ∗  G
  n,n+1   n−1,n
R= 0 ··· 0 0 0 ∗ ∗  −−−→  0 · · · 0 0 0 ∗ ∗  −−−→
   
 0 ··· 0 0 0 0 ∗   0 ··· 0 0 0 0 ∗ 
   
 0 ··· 0 0 0 0 0   0 ··· 0 0 0 0 ∗ 
   
 0 ··· 0 0 0 0 0   0 ··· 0 0 0 0 0 
 . ..   . .. 
 .. .   .. . 
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 0 ∗ ··· ∗ ∗ ∗ ∗    ∗ ∗ ··· ∗ ∗ ∗ ∗ 
 . ..   .. 
 . ..  .. 
 . . .   . . 
   
 0 ··· 0 ∗ ∗ ∗ ∗   0 ··· ∗ ∗ ∗ ∗ ∗ 
   
 0 ··· 0 0 ∗ ∗ ∗  G  0 ··· 0 ∗ ∗ ∗ ∗ 
  n−2,n−1 G 1,2  
−−−→  0 ··· 0 0 0 ∗ ∗  −−−−−→ · · · −−−→  0 · · · 0 0 ∗ ∗ ∗  = : R1 .
   
 0 ··· 0 0 0 ∗ ∗   0 ··· 0 0 0 ∗ ∗ 
   
 0 ··· 0 0 0 0 ∗   0 ··· 0 0 0 0 ∗ 
   
 0 ··· 0 0 0 0 0   0 ··· 0 0 0 0 0 
 . ..   . .. 
 .. .   .. . 
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0

We see (R1 )i,j = 0, if i > j + 1. It is a so-called upper Hessenberg matrix. This is also true of
R1 + kwk2 e1 v⊤ ∈ R n,n , because only the top row of the matrix e1 v⊤ is non-zero. Therefore, if Q1 ∈
R n,n collects all m − 1 orthogonal transformations used in Step ➋, then

A + uv⊤ = QQ1⊤ ( R1 + kwk2 e1 v⊤ ) with orthogonal Q1 := G12 · · · · · Gn−1,n .


| {z }
upper Hessenberg matrix

➣ Computational effort = O(n + n2 ) = O(n2 ) for n → ∞

Step ➌: Convert R1 + kwk2 e1 v⊤ ∈ R n,n into upper triangular form by n − 1 successive Givens rotations

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
258
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

applied to rows of this matrix.


   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 ∗ ∗ ··· ∗ ∗ ∗ ∗   0 ∗ ··· ∗ ∗ ∗ ∗ 
   
 ..   .. 
 .   . 
   
 0 ··· ∗ ∗ ∗ ∗ ∗   0 ··· ∗ ∗ ∗ ∗ ∗ 
   
 0 ··· 0 ∗ ∗ ∗ ∗   0 ··· 0 ∗ ∗ ∗ ∗ 
  G12   G23
R1 + k w k2 e1 v ⊤ =  0 ··· 0 0 ∗ ∗ ∗  −−− →  0 ··· 0 0 ∗ ∗ ∗  −−− → ···
   
 0 ··· 0 0 0 ∗ ∗   0 ··· 0 0 0 ∗ ∗ 
   
 0 ··· 0 0 0 0 ∗   0 ··· 0 0 0 0 ∗ 
   
 0 ··· 0 0 0 0 0   0 ··· 0 0 0 0 0 
 . ..   . .. 
 .. .   .. . 
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0

   
∗ ∗ ··· ∗ ∗ ∗ ∗ ∗ ∗ ··· ∗ ∗ ∗ ∗
 0 ∗ ··· ∗ ∗ ∗ ∗   0 ∗ ··· ∗ ∗ ∗ ∗ 
   
 ..   .. 
 .   . 
   
 0 ··· 0 ∗ ∗ ∗ ∗   0 ··· 0 ∗ ∗ ∗ ∗ 
   
 0
Gn−1,n 
··· 0 0 ∗ ∗ ∗  G  0 ··· 0 0 ∗ ∗ ∗ 
 n,n+1   e.
−−−→  0 · · · 0 0 0 ∗ ∗  −−−→  0 · · · 0 0 0 ∗ ∗  =: R
   
 0 ··· 0 0 0 0 ∗   0 ··· 0 0 0 0 ∗ 
   
 0 ··· 0 0 0 0 ∗   0 ··· 0 0 0 0 0 
   
 0 ··· 0 0 0 0 0   0 ··· 0 0 0 0 0 
 . ..   . .. 
 .. .   .. . 
0 ··· 0 0 0 0 0 0 ··· 0 0 0 0 0
(Gn,n+1 Gn−2,n−1 · · · . . . · · · G23 G12 )(R1 + kwk e1 v⊤ ) = R
2
e (upper triangular!) . (3.3.5.3)

Since we need n − 1 Givens rotations acting on matrix rows of length n:


➣ Computational effort = O(n2 ) for n → ∞

e = A + uv⊤ = Q
A eR e = QQ⊤ G⊤ G23
e with Q 1 12

· · · · · G⊤ ⊤
n−1,n−2 Gn,n−1 .

➣ Total asymptotic total computational effort = O(mn + n2 ) for m, n → ∞


e from
For large n this is much cheaper than the cost O(n2 m) for computing the QR-decomposition of A
e.
scratch. Moreover, we avoid forming and storing the matrix A

3.3.5.2 Adding a Column


e ∈ R m,n+1 by inserting a column v into A ∈ R m,n at an arbitrary
We obtain an augmented matrix A
location.

e = [a1 , . . . , ak−1 , v, ak , . . . , an ] ∈ R m,n+1 , v ∈ R m , a j := (A):,j .


A ∈ R m,n −→ A (3.3.5.4)

On the level of matrix-vector arithmetic the following explanations are easier for the full QR-
decompositions, cf. Rem. 3.3.5.1.

Given: full QR-decomposition of A: A = QR, Q ∈ R m,m orthogonal (stored as product of O(n)


orthogonal transformations), R ∈ R m,n upper triangular.

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
259
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Sought: e from (3.3.5.4): A


full QR-decomposition of A e = Q eR
e, Qe ∈ R m,m with orthonormal columns,
e ∈ R m,n+1 upper triangular, computed
stored as product of O(n) orthogonal transformations, R
efficiently.

As preparation we point out that left-multiplication of a matrix with another matrix can be understood as
forming multiple matrix×vector products:
 
 
 
 
 
 
 
 R0 
 
 
 
 
 
h i  
 
A = QR ⇔ Q⊤ A = Q⊤ a1 , . . . , Q⊤ an = R =:  , R0 ∈ R n,n ,
 
 
 
 
 
 
 
 O 
 
 
 
 
 


that is, Q⊤ a j ℓ = 0 for ℓ > j. We immediately infer
 
 
 
 
 
 
 
 
h i  
e = Q⊤ a1 , . . . , Q⊤ ak−1 , Q⊤ v, Q⊤ ak , . . . , Q⊤ an = 
Q⊤ A 

 =: W ∈ R m,n+1 ,
 
 
 
 
 
 
 
 

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
260
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

which suggests the following three-step algorithm.

Step ➊: compute w = Q⊤ v ∈ R m .

➣ Computational effort = O(mn) for m, n → ∞, if Q stored in suitable (compressed) format.

Step ➋: Annihilate bottom m − n − 1 components of w ↔ rows of W (, if m > n + 1).

This can be done by m − n − 1 Givens rotations targeting adjacent rows of W bottom → top:
   
   
   
   
   
   
   
   
  G  
  m−1,m Gn+1,n+2  
W=  −−−−→ ... −−−−−→   =: T .
   
   
   
   
   
   
   
   

➣ Computational effort = O(m − n) for m, n → ∞.

Writing Q2⊤ := Gn+1,n+2 · · · · · Gm−1,m ∈ R m,m for the orthogonal matrix representing the product of
Givens rotations, we find
Q2⊤ Q⊤ A
e =T.

Step ➌: Transform T to upper triangular form.

We accomplish this by applying n + 1 − k successive Givens rotations from bottom to top in the following
fashion.
   
∗ ··· ∗ ··· ∗ ∗ ··· ∗ ··· ∗
 ..   .. 
 0 ∗ ∗ .   0 ∗ ∗ . 
 .. .. ..   .. .. .. 
 . . .   . . . 
   
 .. .. ..   ∗ 
 . . .   
   .. 
 ∗ ∗ ∗  Gn,n+1 Gk,k+1  0 . 
T=

 −−−→ · · · −
 −−→ 
 ..


 ∗ ∗   0 ∗ . 
 ..   .. 
 . ∗ 0   . 0 ∗ 
 ..   
 0 ··· 0 ··· .
 
0 ··· 0 ··· 0 
   
 .. .. ..   .. .. 
 . . .   . . 
0 ··· 0 ··· 0 0 ··· 0 ··· 0
         
         
         
         0 
 → → → 0 → 
   0   0
    
 0         

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
261
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

ˆ target rows of Givens rotations,


= ˆ new entries 6= 0
=

➣ Computational effort for this step = O (n − k )2 for n → ∞

➣ Total asymptotic computational cost = O(n2 + m) for m, n → ∞

e and then computing its (eco-


Again, for large m, n this is significantly cheaper that forming the matrix A
nomical) QR-decomposition.

3.3.5.3 Adding a Row

Again, the perspective of the full QR-decomposition is preferred for didactic reasons, cf. Rem. 3.3.5.1.
We are given a matrix A ∈ R m,n of which a full QR-decomposition (→ Thm. 3.3.3.4) A = QR, Q ∈
R m,m orthogonal, R ∈ R m,n upper triangular, is already available, maybe only in encoded form (→
Rem. 3.3.3.22).
We add another row to the matrix A in arbitrary position k ∈ {1, . . . , m} and obtain
 
(A)1,:
 .. 
 . 
 
(A)k−1,: 
 
A ∈ R m,n e =  v T  , with given v ∈ R n .
7→ A (3.3.5.5)
 
 (A) 
 k,: 
 .. 
 . 
(A)m,:

Task: Find an algorithm for the efficient computation of the QR-decomposition A e = Q eRe of Ae from
e ∈R
(3.3.5.5) , Q m + 1,m + 1 e ∈ K +1,n+1
orthogonal (as a product of orthogonal transformations), R m

upper triangular.
Step ①: Move new row to the bottom.

e:
Employ partial cyclic permutation of rows of A

row m + 1 ← row k , row i ← row i + 1 , i = k, . . . , m .

which can be achieved by multiplying from the right with the


following (orthogonal!) permutation matrix P ∈ R m+1,m+1 Row k
 
1 0 ··· ··· 0
 .. .. 
0 . .
 
 1 0 0
. 
⊤ . 
P :=  . 0 0 1 ∈ R m+1,m+1 .
. .. 
 .. 1 0 . 0
 
 . .. . .
. . .. 


0 ··· 1 0

3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
262
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
 
 
 
 
 
 
 
       
Q⊤ 0 e  
e = A⊤
PA
R
PA = ⊤ =  R  =: T ∈ R m+1,n .
v 0 1 v  
 
 
 
 
 
 
 
 
v⊤

This step is a mere bookkeeping operation and does not involve any computations.

Step ②: Restore upper triangular form through Givens rotations (→ § 3.3.3.15)


Successively target bottom row and rows from the top to turn leftmost entries of bottom row into zeros.
Here demonstrated for the case m = n
   
∗ ··· ··· ∗ ∗ ··· ··· ∗
 ..   .. 
 0 ∗ .   0 ∗ . 
 .. ..   .. .. 
 . 0 .   . 0 . 
  G1,m+1   G2,m+1
T=
 .. .. .. 
 −−−−→   .. .. ..  −−−−→ · · ·

 . . ∗ .   . . ∗ . 
 0 0 0 ∗ ∗   0 0 0 ∗ ∗ 
   
 0 0 0 0 ∗   0 0 0 0 ∗ 
∗ ··· ··· ∗ ∗ ∗ 0 ∗ ··· ∗ ∗ ∗
   
∗ ··· ··· ∗ ∗ ··· ··· ∗
 ..   .. 
 0 ∗ .   0 ∗ . 
 . ..   . .. 
 . 0 .   . 0 . 
Gm−1,m+1  .  Gm,m+1  . 
· · · −−−−−→   ... ... .. 
 −−−−→   ... ... ..  := R

e (3.3.5.6)
 ∗ .   ∗ . 
 0 0 0 ∗ ∗   0 0 0 ∗ ∗ 
   
 0 0 0 0 ∗   0 0 0 0 ∗ 
0 ··· 0 0 ∗ 0 ··· 0 0

➣ Computational effort for this step = O(n2 ) for n → ∞

Finally, setting Q1 = Gm,m+1 · · · · · G1,m+1 the final QR-decomposition reads


 
Q 0
e =P
A T
Q1⊤ R eR
e=Q e ∈ K m+1,m+1 ,
e with orthogonal Q
0 1

e is never
because the product of orthogonal matrices is again orthogonal. Of course, we know that Q
formed in an algorithm but kept as a sequence of orthogonal transformations.
Review question(s) 3.3.5.7 (Modification teachniques for QR-decompositions)
(Q3.3.5.7.A) Explain why, as far as the use of the QR-decomposition in numerical methods is concerned,
the distinction between full and economical versions does not matter.

263
3. Direct Methods for Linear Least Squares Problems, 3.3. Orthogonal Transformation Methods [DR08, Sect. 4.4.2]
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q3.3.5.7.B) [QR-update after dropping a column] e arises from A ∈ R m,n , m ≥ n, by


Assume that A
dropping the k-th column, k ∈ {1, . . . , n}. How can a QR-decomposition of Ae be computed based on
a QR-decomposition A = QR of A?
e.
Hint. Examine the structure of QA
(Q3.3.5.7.C) [Update of QR-decomposition when modifying a single entry] Assume that the (full)
QR-decomposition A = QR of A ∈ R m,n , m ≥ n, is available. Describe the algorithm for comput-
ing the QR-decomposition of A e ∈ R m,n that arises from setting a single entry of A at position (ℓ, k ),
ℓ ∈ {1, . . . , m}, k ∈ {1, . . . , n} to zero.

Exception. You may look at the lecture notes to answer this question.

3.4 Singular Value Decomposition (SVD)


Beside the QR-decomposition of a matrix A ∈ R m,n there are other factorizations based on orthogonal
transformations. The most important among them is the singular value decomposition (SVD), which can be
used to tackle linear least squares problems and many other optimization problems beyond, see [Kal96].

3.4.1 SVD: Definition and Theory

Video tutorial for Section 3.4.1 "Singular Value Decomposition: Definition and Theory": (13
minutes) Download link, tablet notes

→ review questions 3.4.1.15

Theorem 3.4.1.1. Singular value decomposition → [NS02, Thm. 9.6], [Gut09, Thm. 11.1]
For any A ∈ K m,n there are unitary/ orthogonal matrices U ∈ K m,m , V ∈ K n,n and a (generalized)
diagonal (∗) matrix Σ = diag(σ1 , . . . , σp ) ∈ R m,n , p := min{m, n}, σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0
such that

A = UΣVH .

Terminology (∗): A matrix Σ is called a generalized diagonal matrix, if (Σ)i,j = 0, if i 6= j, 1 ≤ i ≤ m,


1 ≤ j ≤ n. We still use the diag operator to create it from a vector.
Proof. (of Thm. 3.4.1.1, by induction)
To start the induction note that the assertion of the theorem is immediate for n = 1 or m = 1.
For the induction step (n − 1, m − 1)⇒(m, n) first remember from analysis [Str09, Thm. 4.2.3]: Continu-
ous real-valued functions attain extremal values on compact sets (here the unit ball {x ∈ K n : kxk2 ≤ 1}).
In particular, consider the function v ∈ K n 7→ kAvk ∈ R. This function will attain its maximal value kAk2
on {v ∈ K2 : kvk2 ≤ 1} for at least one vector x, kxk = 1:

➤ ∃x ∈ K n , y ∈ K m , kxk = kyk2 = 1 : Ax = σy , σ = kAk2 ,


where we used the definition of the matrix 2-norm, see Def. 1.5.5.10. By Gram-Schmidt orthogonalization
or a similar procedure we can extend the single unit vectors x and y to orthonormal bases of K n and K m ,
respectivelt: ∃Ve ∈ K n,n−1 , U
e ∈ K m,m−1 such that
e ] ∈ K n,n , U = [y U
V = [x V e ] ∈ K m,m are orthogonal.

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 264
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

iH h hi  yH Ax yH AV e
  H

H e A xV e = σ w
U AV = y U e H Ax U e = 0 B
e H AV = : A1 .
U
For the induction argument we have to show that w = 0. Since
  2   2
σ σ 2 + wH w
A1 = = (σ2 + wH w)2 + kBwk22 ≥ (σ2 + wH w)2 ,
w 2
Bw 2
we conclude
σ 2
kA1 xk22 A1 ( w ) ( σ 2 + wH w )2
kA1 k22 = sup ≥ 2
≥ = σ 2 + wH w . (3.4.1.2)
06 = x ∈K n kxk22 σ
(w )
2
2
2
σ +w w H

We exploit that multiplication with orthogonal matrices either from right or left does not affect the Euclidean
matrix norm:
2 (3.4.1.2)
σ2 = kAk22 = UH AV = kA1 k22 =⇒ kA1 k22 = kA1 k22 + kwk22 ⇒ w = 0 .
2

 
σ 0
A1 = .
0 B
Then apply the induction argument to B.

Definition 3.4.1.3. Singular value decomposition (SVD)

The decomposition A = UΣVH of Thm. 3.4.1.1 is called singular value decomposition (SVD) of
A. The diagonal entries σi of Σ are the singular values of A. The columns of U/V are the left/right
singular vectors of A.

Next, we visualize the structure of the singular value decomposition of a matrix A = K m,n .

    
    
    
     
    
    
     
     
 A     
 = U  Σ  VH 
     
     
    
    
    
    
   

 
     
 
     
     VH

 A = U  Σ  
     
    



 
 

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 265
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§3.4.1.4 (Economical singular value decomposition) As in the case of the QR-decomposition, compare
(3.3.3.1) and (3.3.3.1), we can also drop the bottom zero rows of Σ and the corresponding columns of U
in the case of m > n. Thus we end up with an “economical” singular value decomposition of A ∈ K m,n :

m ≥ n: A = UΣVH , U ∈ K m,n , Σ ∈ K n,n , V ∈ K n,n , UH U = In ,V unitary ,


(U orthonormal columns)
m < n: A = UΣVH , U ∈ K m,m , Σ ∈ K m,m , V ∈ K n,m , U unitary ,VH V = Im .
(V orthonormal columns)
(3.4.1.5)

with true diagonal matrices Σ, whose diagonals contain the singular values of A.
Visualization of economimcal SVD for m > 0:
   
   
   
   
   
     
   
   
     
     
    Σ  
 A = U   VH 
     
     
     
   
   
   
   
   
   
   

The economical SVD is also called thin SVD in literature [GV13, Sect. 2.3.4]. y

An alternative motivation and derivation of the SVD is based on diagonalizing the Hermitian matrices
AAH ∈ R m,m and AH A ∈ R n,n . The relationship is made explicit in the next lemma.
Lemma 3.4.1.6.

The squares σi2 of the non-zero singular values of A are the non-zero eigenvalues of AH A, AAH
with associated eigenvectors (V):,1 , . . . , (V):,p , (U):,1 , . . . , (U):,p , respectively.

Proof. AAH and AH A are similar (→ Lemma 9.1.0.6) to diagonal matrices with non-zero diagonal
entries σi2 (σi 6= 0), e.g.,

AAH = UΣVH VΣH UH = U ΣΣ H


| {z} UH . ✷
diagonal matrix


Remark 3.4.1.7 (SVD and additive rank-1 decomposition → [Gut09, Cor. 11.2], [NS02, Thm. 9.8])
Recall from linear algebra that rank-1 matrices coincide with tensor products of vectors:

A ∈ K m,n and rank(A) = 1 ⇔ ∃u ∈ K m , v ∈ K n : A = uvH , (3.4.1.8)

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 266
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

because rank(A) = 1 means that Ax = µ(x)u for some u ∈ K m and a linear form x ∈ K n 7→ µ(x) ∈
K. By the Riesz representation theorem the latter can be written as µ(x) = vH x.

The singular value decomposition provides an additive decomposition into rank-1 matrices:

p
A = UΣVH = ∑ σj (U):,j (V)H:,j . (3.4.1.9)
j =1

Since the columns of U and V are orthonormal, we immediately conclude:

A(V):,j = σj (U) j , AH (U) j = σj (V):,j , j ∈ {1, . . . , p} . (3.4.1.10)

y
Remark 3.4.1.11 (Uniqueness of SVD)
The SVD from Def. 3.4.1.3 is not (necessarily) unique, but the singular values are.
Proof. Proof by contradiction: assume that A has two singular value decompositions

A = U1 Σ1 VH H
1 = U2 Σ2 V2 ⇒ U1 Σ1 ΣH UH H
1 = AA = U2 Σ2 ΣH UH
2 .
| {z 1} | {z 2}
=diag(σ12 ,...,σm
2) 2)
=diag(σ12 ,...,σm

The two diagonal matrices are similar, which implies that they have the same eigenvalues, which agree
with their diagonal entries. Since the latter are sorted, the diagonals must agree.
✷ y
§3.4.1.12 (SVD, nullspace, and image space) The SVD give complete information about all crucial
subspaces associated with a matrix:

Lemma 3.4.1.13. SVD and rank of a matrix → [NS02, Cor. 9.7]

Let A = UΣVH be the SVD of A ∈ K m,n according to Thm. 3.4.1.1. If, for some 1 ≤ r ≤ p :=
min{m, n}, the singular values of A ∈ K m,n satisfy

σ1 ≥ · · · ≥ σr > σr+1 = · · · σp = 0 ,

then

• rank(A) = r (no. of non-zero singular values) ,


• N (A) = Span{(V):,r+1 , . . . , (V):,n } ,
• R(A) = Span{(U):,1 , . . . , (U):,r } .

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 267
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Illustration for m > n: columns = ONB of R(A) rows = ONB of N (A)


    
    Σr 
    0

    
     
    
    
     
     
 A     VH 
 =   
   U   
     
    
    
    0 0 
    | {z }
    
     ∈K n,n
 

| {z } | {z }| {z }
∈K m,n ∈K m,m ∈K m,n
(3.4.1.14)

y
Review question(s) 3.4.1.15 (SVD: Definition and theory)
(Q3.4.1.15.A) If a square matrix A ∈ R n,n is given as A = QDQ⊤ with an orthogonal matrix Q ∈ R n,n
and a diagonal matrix D ∈ R n,n , then what is a singular value decomposition of A?
(Q3.4.1.15.B) What is a full singular value decomposition of A = uv⊤ , u ∈ R m , v ∈ R n ?
(Q3.4.1.15.C) Based on the SVD give a proof of the fundamental dimension theorem from linear algebra:

rank(A) + dim N (A) = n ∀A ∈ K m,n .

(Q3.4.1.15.D) Use the SVD of A ∈ K m,n to prove the fundamental relationships

R(AH ) = N (A)⊥ , N (AH ) = R(A)⊥ .

Here X ⊥ designates the orthogonal complement of a subspace X ⊂ K d with respect to the Euclidean
inner product:

X ⊥ : = { v ∈ K d : xH v = 0 ∀ x ∈ X } .

(Q3.4.1.15.E) Use the SVD to show that every regular square matrix A ∈ R n,n can be factorized as

A = QS , Q orthogonal , S symmetric, positive definite ,

which is the so-called polar decomposition of A.


(Q3.4.1.15.F) [SVD of a matrix with vanishing bottom rows] Assume that the bottom m − n rows of
A ∈ R m,n , m > n, vanish, that is, A is of the form
 
A∗
A= with A∗ ∈ R n,n .
Om−n,n

Based on a known full singular-value decomposition A∗ = U∗ Σ∗ V⊤


∗ of A∗ , state a full SVD of A.

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 268
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q3.4.1.15.G) [Completion to regular matrix] Given m, n ∈ N, m < n, and a matrix T ∈ R n,m with full
rank m, sketch an algorithm that computes a matrix X ∈ R n,n−m such that [T X] ∈ R n,n is regular/in-
vertible.
The following result can be a starting point:

Lemma 3.4.1.13. SVD and rank of a matrix

Let A = UΣVH be the SVD of A ∈ K m,n according to Thm. 3.4.1.1. If, for some 1 ≤ r ≤ p :=
min{m, n}, the singular values of A ∈ K m,n satisfy

σ1 ≥ · · · ≥ σr > σr+1 = · · · σp = 0 ,

then

• rank(A) = r (no. of non-zero singular values) ,


• N (A) = Span{(V):,r+1 , . . . , (V):,n } ,
• R(A) = Span{(U):,1 , . . . , (U):,r } .

3.4.2 SVD in E IGEN

Video tutorial for Section 3.4.2 "SVD in E IGEN ": (9 minutes) Download link, tablet notes

→ review questions 3.4.2.10

The E IGEN class JacobiSVD is constructed from a matrix data type, computes the SVD of its argument
during construction and offers access methods MatrixU(), singularValues(), and MatrixV()
to request the SVD-factors and singular values.

C++-code 3.4.2.1: Computing SVDs in E IGEN ➺ GITLAB


2 # include <Eigen / SVD>
3 using MatrixXd = Eigen : : MatrixXd ;
4 // Computation of (full) SVD A = UΣVH → Thm. 3.4.1.1
5 // SVD factors are returned as dense matrices in natural order
6 i n l i n e std : : tuple <MatrixXd , MatrixXd , MatrixXd > s v d _ f u l l ( const MatrixXd& A) {
7 const Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeFullU | Eigen : : ComputeFullV ) ;
8 const MatrixXd & U = svd . matrixU ( ) ; // get unitary (square) matrix U
9 const MatrixXd & V = svd . matrixV ( ) ; // get unitary (square) matrix V
10 const VectorXd & sv = svd . singularValues ( ) ; // get singular values as vector
11 MatrixXd Sigma = MatrixXd : : Zero ( A . rows ( ) , A . cols ( ) ) ;
12 const unsigned p = sv . s i z e ( ) ; // no. of singular values
13 Sigma . block ( 0 , 0 , p , p ) = sv . asDiagonal ( ) ; // set diagonal block of Σ
14 r e t u r n {U, Sigma , V } ;
15 }
16

17 // Computation of economical (thin) SVD A = UΣVH , see (3.4.1.5)


18 // SVD factors are returned as dense matrices in natural order
19 i n l i n e std : : tuple <MatrixXd , MatrixXd , MatrixXd > svd_eco ( const MatrixXd& A) {
20 const Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinU | Eigen : : ComputeThinV ) ;
21 const MatrixXd & U = svd . matrixU ( ) ; // get matrix U with orthonormal
columns
22 const MatrixXd & V = svd . matrixV ( ) ; // get matrix V with orthonormal
columns
23 const VectorXd & sv = svd . singularValues ( ) ; // get singular values as vector

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 269
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

24 const MatrixXd Sigma = sv . asDiagonal ( ) ; // build diagonal matrix Σ


25 r e t u r n {U, Sigma , V } ;
26 }

The second argument in the constructor of JacobiSVD determines, whether the methods matrixU()
and matrixV() return the factor for the full SVD of Def. 3.4.1.3 or of the economical (thin) SVD (3.4.1.5):
Eigen::ComputeFull* will select the full versions, whereas Eigen::ComputeThin* picks the
economical versions → documentation.

Internally, the computation of the SVD is done by a sophisticated algorithm, for which key steps rely on
orthogonal/unitary transformations. Also there we reap the benefit of the exceptional stability brought
about by norm-preserving transformations → § 3.3.3.30.

E IGEN’s algorithm for computing SVD is (numerically) stable → Def. 1.5.5.19

§3.4.2.2 (Computational cost of computing the SVD) According to E IGEN’s documentation the SVD of
a general dense matrix involves the following asymptotic complexity:

cost(economical SVD of A ∈ K m,n ) = O(min{m, n}2 max{m, n})

The computational effort is (asymptotically) linear in the larger matrix dimension. y

EXAMPLE 3.4.2.3 (SVD-based computation of the rank of a matrix) Based on Lemma 3.4.1.13, the
SVD is the main tool for the stable computation of the rank of a matrix (→ Def. 2.2.1.3)

However, theory as reflected in Lemma 3.4.1.13 entails identifying zero singular values, which must rely
on a threshold condition in a numerical code, recall Rem. 1.5.3.15. Given the SVD A = UΣVH , Σ =
diag(σ1 , . . . , σmin{m,n} ), of a matrix A ∈ K m,n , A 6= 0 and a tolerance tol > 0, we define the numerical
rank
 
r := ♯ σi : |σi | ≥ tol max{|σj |} . (3.4.2.4)
j

The following code implements this rule.

C++-code 3.4.2.5: Computing rank of a matrix through SVD ➺ GITLAB


2 // Computation of the numerical rank of a non-zero matrix by means of
3 // singular value decomposition, cf. (3.4.2.4).
4 Eigen : : Index rank_by_svd ( const Eigen : : MatrixXd &A , double t o l = EPS) {
5 i f ( A . norm ( ) == 0 ) {
6 r e t u r n s t a t i c _ c a s t <Eigen : : Index > ( 0 ) ;
7 }
8 const Eigen : : JacobiSVD<Eigen : : MatrixXd > svd ( A) ;
9 const Eigen : : VectorXd & sv =
10 svd . singularValues ( ) ; // Get sorted singular values as vector
11 const Eigen : : Index n = sv . s i z e ( ) ;
12 Eigen : : Index r = 0 ;
13 // Test relative size of singular values
14 while ( ( r < n ) && ( sv ( r ) >= sv ( 0 ) * t o l ) ) {
15 r ++;

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 270
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

16 }
17 return r ;
18 }

E IGEN offers an equivalent built-in method rank() for objects representing singular value decomposi-
tions:

C++-code 3.4.2.6: Using rank() in E IGEN ➺ GITLAB


2 // Computation of the numerical rank of a matrix by means of SVD
3 Eigen : : Index rank_eigen ( const Eigen : : MatrixXd &A , double t o l = EPS) {
4 r e t u r n A . jacobiSvd ( ) . s e t T h r e s h o l d ( t o l ) . rank ( ) ;
5 }

The method setThreshold() passes tol from (3.4.2.4) to rank(). y

EXAMPLE 3.4.2.7 (Computation of nullspace and image space of matrices) “Computing” a subspace
of R k amounts to making available a (stable) basis of that subspace, ideally an orthonormal basis.
Lemma 3.4.1.13 taught us how to glean orthonormal bases of N (A) and R(A) from the SVD of a matrix
A. This immediately gives a numerical method and its implementation is given in the next two codes.

C++-code 3.4.2.8: ONB of N (A) through SVD ➺ GITLAB


2 // Computation of an ONB of the kernel of a matrix
3 Eigen : : MatrixXd n u l l s p a c e ( const Eigen : : MatrixXd &A , double t o l = EPS) {
4 using i n d e x _ t = Eigen : : Index ;
5 Eigen : : JacobiSVD<Eigen : : MatrixXd > svd ( A , Eigen : : ComputeFullV ) ;
6 const i n d e x _ t r = svd . s e t T h r e s h o l d ( t o l ) . rank ( ) ;
7 // Rightmost columns of V provide ONB of N (A)
8 Eigen : : MatrixXd Z = svd . matrixV ( ) . r i g h t C o l s ( A . cols ( ) − r ) ;
9 return Z ;
10 }

C++-code 3.4.2.9: ONB of R(A) through SVD ➺ GITLAB


2 // Computation of an ONB of the image space of a matrix
3 Eigen : : MatrixXd rangespace ( const Eigen : : MatrixXd &A , double t o l = EPS) {
4 using i n d e x _ t = Eigen : : Index ;
5 Eigen : : JacobiSVD<Eigen : : MatrixXd > svd ( A , Eigen : : ComputeThinU ) ;
6 const i n d e x _ t r = svd . s e t T h r e s h o l d ( t o l ) . rank ( ) ;
7 // r left columns of U provide ONB of R(A)
8 r e t u r n svd . matrixU ( ) . l e f t C o l s ( r ) ;
9 }

y
Review question(s) 3.4.2.10 (SVD in E IGEN)
(Q3.4.2.10.A) Please examine Code 3.4.2.9 and detect a potentially serious loss of efficiency. In which
situations will this have an impact?

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 271
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.4.3 Solving General Least-Squares Problems by SVD

Video tutorial for Section 3.4.3 "Solving General Least-Squares Problems by SVD": (14
minutes) Download link, tablet notes

→ review questions 3.4.3.17

In a similar fashion as explained for QR-decomposition in Section 3.3.4, the singular valued decompisition
(SVD, → Def. 3.4.1.3) can be used to transform general linear least squares problems (3.1.3.7) into a
simpler form. In the case of SVD-based orthogonal transformation methods this simpler form involves
merely a diagonal matrix.
Here we consider the most general setting

Ax = b ∈ R m with A ∈ R m,n , rank(A) = r ≤ min{m, n} .

In particular, we drop the assumption of full rank of A. This means that the minimum norm condition (ii) in
the definition (3.1.3.7) of a linear least squares problem may be required for singling out a unique solution.
We recall the (full) SVD of A ∈ R m,n :
   
Σr 0 V1⊤
A = [ U1 U2 ]
0 0 V2⊤
     
     Σr 
     0 
     
       
     
     
       
       

 A  
 = 







 V1⊤ 

   U1 U2     
       V2⊤ 
     
     0 0 
      | {z }
     
      ∈R n,n
     

| {z } | {z } | {z }
∈R m,n ∈R m,m ∈R m,n
(3.4.3.1)

with U1 ∈ R m,r , U2 ∈ R m,m−r with orthonormal columns, U := [U1 , U2 ] unitary,


Σr = diag(σ1 , . . . , σr ) ∈ Rr,r (singular values, Def. 3.4.1.3),
V1 ∈ R n,r , V2 ∈ R n,n−r with orthonormal columns, V := [V1 , V2 ] unitary.
We can proceed in two different ways, both of which we elaborate in the sequel:
Approach ➊: We can use the invariance of the 2-norm of a vector with respect to multiplication with
U := [U1 , U2 ], see Thm. 3.3.2.2, together with the fact that U is unitary, U−1 = U⊤ , see Def. 6.3.1.2:

" #
U1⊤
[ U1 , U2 ] · =I.
U2⊤

      ⊤ 
Σ 0 V1⊤ Σr V1⊤ x U b
kAx − bk2 = [U1 U2 ] r x−b = − 1⊤ (3.4.3.2)
0 0 V2⊤ 2
0 U2 b 2

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 272
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We follow the same strategy as in the case of QR-based


 solvers
  for full-rank
 linear least squares problems.
Σr V1⊤ x U⊤ b
We choose x such that the first r components of − 1⊤ vanish:
0 U2 b

➣ (possibly underdetermined) r × n linear system Σr V1⊤ x = U1⊤ b . (3.4.3.3)

To fix a unique solution in the case r < n we appeal to the minimal norm condition in (3.1.3.7): appealing
to the considerations of § 3.1.3.3, the solution x of (3.4.3.3) is unique up to contributions from

Lemma 3.1.2.12 orthonormality


N (V1⊤ ) = R(V1 )⊥ = R(V2 ) . (3.4.3.4)

Since V is unitary, the minimal norm solution is obtained by setting contributions from R(V2 ) to zero,
which amounts to choosing x ∈ R(V1 ). This converts (3.4.3.3) into

Σr V1⊤ V1 z = U1⊤ b ⇒ z = Σr−1 U1⊤ b .


| {z }
=I

generalized solution → Def. 3.1.3.1 x† = V1 Σr−1 U1⊤ b , krk2 = U2⊤ b . (3.4.3.5)


2

Approach ➋: From Thm. 3.1.2.1 we know that the generalized least-squares solution x† of Ax = b solves
the normal equations (3.1.2.2), and in § 3.1.3.3 we saw that x† lies in the orthogonal complement of
N ( A ):

A⊤ Ax† = A⊤ b , x† ∈ N (A)⊥ . (3.4.3.6)

By Lemma 3.4.1.13 and using the notations from (3.4.3.1) together with the fact that the columns of V
form an orthonormal basis of R n :

N (A) = R(V2 ) ⇔ N (A)⊥ = R(V1 ) . (3.4.3.7)

Hence, we can write x† = V1 y for some y ∈ Rr . We plug this representation into the normal equations
and also multiply with V1⊤ , similar to what we did in § 3.1.3.3:

V1⊤ A⊤ AV1 y = V1⊤ A⊤ b . (3.4.3.8)

Next, insert the SVD of A:

V1⊤ VΣ |U{z

U} ΣV⊤ V1⊤ y = V1⊤ VΣU⊤ b . (3.4.3.9)
=I

Then we switch to the block-partitioned form as in (3.4.3.1):


 ⊤     ⊤  ⊤ 
Σr O Σr O V1⊤ Σr O U1
V1⊤ [V1 , V2 ] ⊤

V1 y = V1 [ V1 , V2 ] b (3.4.3.10)
| {z } O O O O V2 O O U2⊤
∈Rr,n | {z } | {z }
∈R n,n ∈R n,r
m
    ⊤  ⊤ 
Σ2r O I Σr O U1
[I, O] y = [I, O] b (3.4.3.11)
O O O O O U2⊤
m

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 273
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Σ2r y = Σr U1⊤ b . (3.4.3.12)

Cancelling the invertible matrix Σr on both sides yields formula (3.4.3.5).


We remind that In a practical implementation, as in Code 3.4.2.5, one has to resort to the numerical rank
from (3.4.2.4):
r = max{i: σi /σ1 > tol} ,
where we have assumed that the singular values σj are sorted according to decreasing modulus.

C++-code 3.4.3.13: Computing generalized solution of Ax = b via SVD ➺ GITLAB


2 Eigen : : VectorXd l s q s v d ( const Eigen : : MatrixXd &A , const Eigen : : VectorXd &b ) {
3 // Compute economical SVD, compare Code 3.4.2.1
4 const Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinU | Eigen : : ComputeThinV ) ;
5 const Eigen : : VectorXd & sv = svd . singularValues ( ) ;
6 const unsigned i n t r = svd . rank ( ) ; // Numerical rank, default tolerance
7 // x† = V1 Σr−1 UH
1 b, see (3.4.3.5)
8 const MatrixXd & U = svd . matrixU ( ) ;
9 const MatrixXd & V = svd . matrixV ( ) ;
10

11 r e t u r n V . l e f t C o l s ( r ) * ( sv . head ( r ) . cwiseInverse ( ) . asDiagonal ( ) *


12 (U. l e f t C o l s ( r ) . a d j o i n t ( ) * b ) ) ;
13 }

The solve() method directly returns the generalized solution

C++-code 3.4.3.14: Computing generalized solution of Ax = b via SVD ➺ GITLAB


2 Eigen : : VectorXd l s q s v d _ e i g e n ( const Eigen : : MatrixXd &A , const Eigen : : VectorXd &b ) {
3 const Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinU | Eigen : : ComputeThinV ) ;
4 r e t u r n svd . solve ( b ) ;
5 }

Remark 3.4.3.15 (Pseudoinverse and SVD → [Han02, Ch. 12], [DR08, Sect. 4.7]) From Thm. 3.1.3.6
we could conclude a general formula for the Moore-Penrose pseudoinverse of any matrix A ∈ R m,n . Now,
the solution formula (3.4.3.5) directly yields a concrete incarnation of the pseudoinverse A+ .

Theorem 3.4.3.16. Pseudoinverse and SVD

If A ∈ K m,n with rank(A) = r has the full singular value decomposition A = UΣVH (→
Thm. 3.4.1.1) partitioned as in (3.4.3.1), then its Moore-Penrose pseudoinverse (→ Thm. 3.1.3.6)
is given by A† = V1 Σr−1 UH1.

y
Review question(s) 3.4.3.17 (Solving general least-squares problems by SVD)
(Q3.4.3.17.A) Discuss the efficient implementation of a C++ function
Eigen::VectorXd solveRankOneLsq( const Eigen::VectorXd &u,
const Eigen::VectorXd &v, const Eigen::VectorXd &b);

that returns the general least squares solution of Ax = b for the rank-1 matrix A := uv⊤ , u ∈ R m ,
v ∈ R n , m ≥ n.

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 274
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.4.4 SVD-Based Optimization and Approximation


For the general least squares problem (3.1.3.7) we have seen the use of SVD for its numerical solution in
Section 3.4.3. The the SVD was a powerful tool for solving a minimization problem for a 2-norm. In many
other contexts the SVD is also a key component in numerical optimization.

3.4.4.1 Norm-Constrained Extrema of Quadratic Forms

Video tutorial for Section 3.4.4.1 "Norm-Constrained Extrema of Quadratic Forms": (11 min-
utes) Download link, tablet notes

→ review questions 3.4.4.13

We consider the following problem of finding the extrema of quadratic forms on the Euclidean unit sphere
{ x ∈ K n : k x k2 = 1}:

given A ∈ K m,n , m ≥ n, find x ∈ K n , kxk2 = 1 , kAxk2 → min . (3.4.4.1)

Use that multiplication with orthogonal/unitary matrices preserves the 2-norm (→ Thm. 3.3.2.2) and resort
to the (full) singular value decomposition A = UΣVH (→ Def. 3.4.1.3):
2 2
min kAxk22 = min UΣVH x = min UΣ(VH x)
k x k2 =1 k x k2 =1 2 k V x k2 =1
H 2

[ y = VH x ] = min kΣyk22 = min (σ12 y21 + · · · + σn2 y2n ) ≥ σn2 .


k y k2 =1 k y k2 =1

Since the singular values are assumed to be sorted as σ1 ≥ σ2 ≥ · · · ≥ σn , the minimum with value σn2
is attained for y2n = 1 and y1 = · · · = yn−1 = 0, that is, VH x = y = en (=
ˆ n-th Cartesian basis vector
in R n ). ⇒ minimizer x∗ = Ven = (V):,n , minimal value kAx∗ k2 = σn .

C++ code 3.4.4.2: Solving (3.4.4.1) with E IGEN ➺ GITLAB


2 // E I G E N based function for solving (3.4.4.1);
3 // minimizer returned nin x, mininum as return value
4 double minconst ( VectorXd &x , const MatrixXd &A) {
5 const Eigen : : Index m = A . rows ( ) ;
6 const Eigen : : Index n = A . cols ( ) ;
7 i f (m < n ) {
8 throw std : : r u n t i m e _ e r r o r ( "A must be t a l l matrix " ) ;
9 }
10 // SVD factor U is not computed!
11 const Eigen : : JacobiSVD<MatrixXd > svd ( A , Eigen : : ComputeThinV ) ;
12 x . r e s i z e ( n ) ; x . setZero ( ) ; x ( n −1) = 1 . 0 ; // en
13 x = svd . matrixV ( ) * x ;
14 r e t u r n ( svd . s i n g u l a r V a l u e s ( ) ) ( n −1) ;
15 }

By similar arguments we can solve the corresponding norm constrained maximization problem

given A ∈ K m,n , m ≥ n, find x ∈ K n , kxk2 = 1 , kAxk2 → max ,

and obtain the solution based on the SVD A = UΣVH of A:

σ1 = max kAxk2 , (V):,1 = argmaxkAxk2 . (3.4.4.3)


k x k2 =1 k x k2 =1

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 275
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Recall: The Euclidean matrix norm (2-norm) of the matrix A (→ Def. 1.5.5.10) is defined as the maximum
in (3.4.4.3). Thus we have proved the following theorem:

Lemma 3.4.4.4. SVD and Euclidean matrix norm

If A ∈ K m,n has singular values σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0, p := min{m, n}, then its Euclidean


matrix norm is given by kAk2 = σ1 (A).
If m = n and A is regular/invertible, then its 2-norm condition number is cond2 (A) = σ1 /σn .

EXAMPLE 3.4.4.5 (Fit of hyperplanes) For an important application from computational geometry, this
example studies the power and versatility of orthogonal transformations in the context of (generalized)
least squares minimization problems.
From school recall the Hesse normal form of a hyperplane H (= affine subspace of dimension d − 1) in
Rd :
H = { x ∈ R d : c + n ⊤ x = 0} , n ∈ R d , k n k2 = 1 . (3.4.4.6)

where n is the unit normal to H and |c| gives the distance of H from 0. The Hesse normal form is
convenient for computing the distance of points from H, because the

Euclidean distance of y ∈ R d from the plane is dist(H, y) = |c + n⊤ y| , (3.4.4.7)

Goal: given the point coordinate vectors y1 , . . . , ym ∈ R d , m > d, find H ↔ {c ∈ R, n ∈ R d ,


knk2 = 1}, such that
m m
∑ dist(H, y j )2 = ∑ |c + n⊤ y j |2 → min . (3.4.4.8)
j =1 j =1

Note that (3.4.4.8) is not a linear least squares problem due to the constraint knk2 = 1. However, it turns
out to be a minimization problem with almost the structure of (3.4.4.1) (yk,ℓ := (yk )ℓ ):

  
1 y1,1 · · · y1,d c
1 y · · · y2,d   n1 
 
 2,1 
(3.4.4.8) ⇔  .. .. ..   ..  → min under the constraint k n k2 = 1 .
. . .  . 
1 ym,1 · · · ym,d nd
| {z } | {z }
=:A =:x 2
Note that the solution component c is not subject to the constraint. One is tempted to use this freedom to
make one component of Ax vanish, but which one is not clear. This is why we need another preparatory
step.
Step ➊: To convert the minimization problem into the form (3.4.4.1) we start with a QR-decomposition
(→ Section 3.3.3)
 
r11 r12 · · · · · ·r1,d+1
0 r22 · · · · · ·r2,d+1 
   
 .. .. .. 
1 y1,1 · · · y1,d  . . . 
1 y   . .. .. 
 2,1 · · · y2,d   .. . . 
A :=  .. .. ..  = QR , R := 
0
 ∈ R m,d+1 .
. . .   rd+1,d+1 

1 ym,1 · · · ym,d 0 ··· ··· 0 
 
 . .. 
 .. . 
0 ··· ··· 0

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 276
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
r11 r12 · · · · · ·r1,d+1
0 r22 · · · · · ·r2,d+1 
  c 
 .. .. .. 
 . . .  n 
 . .. ..  1 
 .. . .  . 
kAxk2 → min ⇔ kRxk2 =   ..  → min . (3.4.4.9)
0 rd+1,d+1 
  
 ... 
0 ··· ··· 0  

 . ..  nd
 .. . 
0 ··· ··· 0 2

Step ➋ Note that, if n is the solution of (3.4.4.8), then necessarily (why?)

c · r11 + n1 · r12 + · · · + r1,d+1 · nd = 0 .

This insight converts (3.4.4.9) to


  
r22 r23 · · · · · · r2,d+1 n1
 0 r33 · · · · · · r  .. 
 3,d+1   . 
 .. .. ..  .  → min , k n k2 = 1 . (3.4.4.10)
 . . .  .. 
0 rd+1,d+1 nd
2
(3.4.4.10) is now a problem of type (3.4.4.1), minimization on the Euclidean sphere. Hence, (3.4.4.10)
can be solved using the SVD-based algorithm implemented in Code 3.4.4.2.

√ d
−1
Note: Since r11 = k(A):,1 k2 = m 6= 0, c = −r11 ∑ r1,j+1 n j can always be computed.
j =1

This algorithm is implemented as case p==dim+1 in the following code, making heavy use of E IGEN’s
block access operations and the built-in QR-decomposition and SVD factorization.

C++-code 3.4.4.11: (Generalized) distance fitting of a hyperplane: solution of (3.4.4.12)


➺ GITLAB
2 // Solves constrained linear least squares problem
3 // (3.4.4.12) with dim passing d
4 std : : pair <Eigen : : VectorXd , Eigen : : VectorXd> c l s q ( const MatrixXd& A ,
5 const unsigned dim ) {
6 const unsigned p = A . cols ( ) ;
7 unsigned m = A . rows ( ) ;
8 i f ( p < dim + 1 ) {
9 throw r u n t i m e _ e r r o r ( " not enough unknowns" ) ;
10 }
11 i f (m < dim ) {
12 throw r u n t i m e _ e r r o r ( " not enough equations " ) ;
13 }
14 m = std : : min (m, p ) ; // Number of variables
15 // First step: orthogonal transformation, see Code 3.3.4.1
16 MatrixXd R = A . householderQr ( ) . matrixQR ( ) . template triangularView <Eigen : : Upper > ( ) ;
17 // compute matrix V from SVD composition of R, solve (3.4.4.10)
18 MatrixXd V = R. block ( p − dim , p − dim , m + dim − p , dim )
19 . j a c o b i S v d ( Eigen : : ComputeFullV ) . matrixV ( ) ;
20 const VectorXd n = V . col ( dim − 1 ) ; // Norm-constrained part of solution
vector
21 // Compute free part of solution vector
22 const auto R _ t o p l e f t = R. topLeftCorner ( p − dim , p − dim ) ;

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 277
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

23 // Check for singular matrix


24 const auto R_diag = R _ t o p l e f t . diagonal ( ) . cwiseAbs ( ) ;
25 i f ( R_diag . minCoeff ( ) < ( n u m e r i c _ l i m i t s <double > : : e p s i l o n ( ) ) * R_diag . maxCoeff ( ) ) {
26 throw r u n t i m e _ e r r o r ( "Upper l e f t block of R not regular " ) ;
27 }
28 const VectorXd c = −( R _ t o p l e f t . template triangularView <Eigen : : Upper > ( ) ) .
29 solve (R. block ( 0 , p − dim , p − dim , dim ) * n ) ;
30 return { c , n } ;
31 }

Note that Code 3.4.4.11 solves the general problem: For A ∈ K m,n find n ∈ R d , c ∈ R n−d such that
 
c
A → min with constraint k n k2 = 1 . (3.4.4.12)
n 2
y
Review question(s) 3.4.4.13 (Norm-Constrained Extrema of Quadratic Forms)
(Q3.4.4.13.A) Let M ∈ R n,n be symmetric and positive definite (s.p.d.) and A ∈ R m,n . Devise an algo-
rithm for computing

argmaxkAxk , B := {x ∈ R n , x⊤ Mx = 1} ,
x∈ B

also based on the SVD of M.


3.4.4.2 Best Low-Rank Approximation

Video tutorial for Section 3.4.4.2 "Best Low-Rank Approximation": (13 minutes)
Download link, tablet notes

→ review questions 3.4.4.25

§3.4.4.14 (Low-Rank matrix compression) Matrix compression addresses the problem of approximating
a given “generic” matrix (of a certain class) by means of matrix, whose “information content”, that is, the
number of reals needed to store it, is significantly lower than the information content of the original matrix.
Sparse matrices (→ Notion 2.7.0.1) are a prominent class of matrices with “low information content”.
Unfortunately, they cannot approximate dense matrices very well. Another type of matrices that enjoy “low
information content”, also called data sparse, are low-rank matrices.

Lemma 3.4.4.15.

If A ∈ R m,n has rank r ≤ min{m, n} (→ Def. 2.2.1.3), then there exist X ∈ R m,r and Y ∈ R n,r ,
such that A = XY⊤ .

Proof. The lemma is a straightforward consequence of Lemma 3.4.1.13 and (3.4.1.14): If A = UΣV⊤ is
the SVD of A, then choose

X := (U):,1:r (Σ)1:r,1:r , Y = (V):,1:r .

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 278
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

None of the columns of U and V can vanish. Hence, in addition, we may assume that the columns of U
are normalized: (U):,j 2 = 1, j = 1, . . . , r.

It takes only r (m + n − r ) real numbers to store A ∈ R m,n with rank(A) = r.

Thus approximating a given matrix A ∈ R m,n with a rank-r matrix, r ≪ min{m, n}, can be regarded as
an instance of matrix compression. The approximation error with respect to some matrix norm k·k will be
minimal if we choose the best approximation

Ar := argmin{kA − Bk : B ∈ R m,n , rank(B) = r } , 1 ≤ r ≤ min{m, n} . (3.4.4.16)

y
Here we explore low-rank best approximation of general matrices with respect to the Euclidean matrix
norm k·k2 induced by the 2-norm for vectors (→ Def. 1.5.5.10), and the Frobenius norm k·k F .

Definition 3.4.4.17. Frobenius norm


The Frobenius norm of A ∈ K m,n is defined as
m n
kAk2F := ∑ ∑ |aij |2 .
i =1 j =1

It should be obvious that kAk F invariant under orthogonal/unitary transformations of A. Thus the Frobe-
nius norm of a matrix A, rank(A) = r, can be expressed through its singular values σj :

Frobenius norm and SVD: kAk2F = ∑rj=1 σj2 (3.4.4.18)

✎ notation: Rr (m, n) := {A ∈ K m,n : rank(A) ≤ r }, m, n, r ∈ N

The next profound result links best approximation in Rr (m, n) and the singular value decomposition (→
Def. 3.4.1.3).

Theorem 3.4.4.19. Best low rank approximation → [Gut09, Thm. 11.6]

Let A = UΣVH be the SVD of A ∈ K m.n (→ Thm. 3.4.1.1). For 1 ≤ k ≤ rank(A) set
h i
k
Uk := (U):,1 , . . . , (U):,k ∈ K m,k ,
h i
A k : = U k Σ k VH
k = ∑ σℓ (U):,ℓ (V)H:,ℓ with Vk := (V):,1 , . . . , (V):,k ∈ K n,k ,
ℓ=1
Σk := diag(σ1 , . . . , σk ) ∈ K k,k .

Then, for k·k = k·k F and k·k = k·k2 , holds true

kA − Ak k ≤ kA − Fk ∀F ∈ Rk (m, n) ,
that is, Ak is the rank-k best approximation of A in the matrix norms k·k F and k·k2 .

This theorem teaches us that the rank-k-matrix that is closest to A (rank-k best approximation) in both
the Euclidean matrix norm and the Frobenniusnorm (→ Def. 3.4.4.17) can be obtained by truncating the
rank-1 sum expansion (3.4.1.9) obtained from the SVD of A after k terms.

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 279
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Proof. (of Thm. 3.4.4.19) As in the statement of the theorem write Ak = Uk Σk VH


k for the truncated SVD.
Obviously, since (here shown for m ≥ n)
 
Ok,k Ok,n−k
 σk+1 
 
 ..  H
A − Ak = U  On−k . V ,
 
 σn 
Om−n,k Om−n,n−k

and both matrix norms are invariant under multiplication with orthogonal matrices, we conclude
(
σ , for k·k = k·k2 ,
rank Ak = k and kA − Ak k = kΣ − Σk k = qk+1
σk2+1 + · · · + σr2 , for k·k = k·k F .

➊ First we tackle the Euclidean matrix norm k·k = k·k2 . For the sake of brevity we write v j := (V):,j ,
u j := (U):,j for the columns of the SVD-factors V and U, respectively. Pick B ∈ K n,n , rank B = k.

dim N (B) = n − k ⇒ N (B) ∩ Span{v1 , . . . , vk+1 } 6= {0} .

For x ∈ N (B) ∩ Span{v1 , . . . , vk+1 }, kxk2 = 1, we have an expansion into columns of V:


+1 H
x = ∑kj= 1 ( v j x ) v j . We use this, the fact that x is a unit vector and the definition of the Euclidean matrix
norm.
2
k +1 k +1
kA − Bk22 ≥ k(A − B)xk22 = kAxk22 = ∑ σj (vH
j x)u j = ∑ σj2 (vHj x)2 ≥ σk2+1 ,
j =1 2 j =1

k +1
2
because ∑ ( vH 2
j x ) = k x k2 = 1.
j =1

➋ Now we turn to the Frobenius norm k·k F . We assume that B ∈ K m,n , rank(B) = k < min{m, n},
minimizes A − F among all rank-k-matrices F ∈ K m,n . We have to show that B coincides with the trun-
cated SVD of A: B = Ak .
The trick is to consider the full SVD of B:
  U B ∈ K m,m unitary ,
ΣB O H
B = UB V , Σ B ∈ R k,k diagonal ,
O O B
V B ∈ K n,n unitary .

Generically, we can write

  L ∈ K k,k strictly lower triangular ,


H L + D + R X12
U B AV B = , D ∈ K k,k diagonal ,
X21 X22
R ∈ K k,k strictly upper triangular ,

with some matrices X12 ∈ K k,n−k , X21 ∈ K m−k,k , X22 ∈ K m−k,n−k . Then we introduce two m × n rank-
k-matrices:
 
L + Σ B + R X12 H
C1 : = U B VB ,
O O
 
L + ΣB + R O H
C2 : = U B V .
X21 O B

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 280
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Since kA − Bk F is minimal, we conclude from the invariance of the Frobenius norm under orthogonal
transformations

kA − C1 k2F ≥ kA − Bk2F = kA − C1 k2F + kLk2F + kRk2F + kX12 k2F ,


kA − C2 k2F ≥ kA − Bk2F = kA − C2 k2F + kLk2F + kRk2F + kX21 k2F .
Obviously, this implies L = O, R = O, X12 = O, and X21 = O, which means that


D O
A=U VH d ∈ K k,k diagonal .
O X22 b

Write D = diag(d1 , . . . , dk ). Then we have with r := rank(A)

kAk2F = σ12 + · · · + σr2 = d21 + · · · + d2k + kX22 k2F ,


kA − Bk2F = kD − Σ B k2F + kX22 k2F = σk2+1 + · · · + σr2 .
Hence, by the minimizer property of B
1. Σ B = D, because any other choice makes kA − Bk F bigger,
2. kX22 k must be minimal, which entails choosing d j = σj , j = 1, . . . , k.

 
σ1
 .. 
 . O
UH
B AV B =  .
 σk 
O X22

This is possible only, if the k leftmost columns of both U B and V B agree with those of the corresponding
SVD-factors of A, which means B = Ak .

The following code computes the low-rank best approximation of a dense matrix in E IGEN.

C++ code 3.4.4.20: SVD-based low-rank matrix compression, ➺ GITLAB


2 Eigen : : MatrixXd lowrankbestapprox ( const Eigen : : MatrixXd &A , unsigned i n t k ) {
3 // Compute economical SVD, compare Code 3.4.2.1
4 const Eigen : : JacobiSVD<MatrixXd > svd (
5 A , Eigen : : ComputeThinU | Eigen : : ComputeThinV ) ;
6 // Form matrix product Uk Σk Vk .
7 // Extract Σk as diagonal matrix of largest k singular
8 // values. E I G E N provides singular values in decreasing order!
9 r e t u r n ( svd . matrixU ( ) . l e f t C o l s ( k ) ) *
10 ( svd . singularValues ( ) . head ( k ) . asDiagonal ( ) ) *
11 ( svd . matrixV ( ) . l e f t C o l s ( k ) . transpose ( ) ) ;
12 }

§3.4.4.21 (Error of low-rank best approxmation of a matrix) Since the matrix norms k·k2 and k·k F are
invariant under multiplication with orthogonal (unitary) matrices, we immediately obtain expressions for the
norms of the best approximation error:

In Euclidean matrix norm: A − U k Σ k VH


k = σk+1 , (3.4.4.22)
2
2 min{m,n}
in Frobenius norm: A − U k Σ k VH
k = ∑ σj2 . (3.4.4.23)
F
j = k +1

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 281
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This provides precise information about the best approximation error for rank-k matrices. In particular, the
decay of the singular values of the matrix governs the convergence of the rank-k best approximation error
as k increases. y
EXAMPLE 3.4.4.24 (Image compression) A rectangular greyscale image composed of m × n pixels
(greyscale, BMP format) can be regarded as a matrix A ∈ R m,n , (A)i,j ∈ {0, . . . , 255}, cf. Ex. 9.3.2.1.
Thus low-rank approximation of the image matrix is a way to compress the image.
Thm. 3.4.4.19 e = Uk Σk V⊤
➣ best rank-k approximation of image: A

Of course, the matrices Ul , Vk , and Σk are available from the economical (thin) SVD (3.4.1.5) of A.
View of ETH Zurich main building Compressed image, 40 singular values used

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800
200 400 600 800 1000 1200 200 400 600 800 1000 1200
Difference image: |original − approximated| Singular Values of ETH view (Log−Scale)
6
10

100
5
10

200

4
10
300

3
400 10 k = 40 (0.08 mem)

500
2
10

600

1
10
700

0
800 10
200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800

Note that there are better and faster ways to compress images than SVD (JPEG, Wavelets, etc.) y
Review question(s) 3.4.4.25 (Best low-rank approximation)
(Q3.4.4.25.A) Show that for A ∈ R m,n and any orthogonal Q ∈ R m,m

kQAk F = kAk F ,

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 282
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

where k·k F is the Frobenius norm of a matrix.

Definition 3.4.4.17. Frobenius norm


The Frobenius norm of A ∈ K m,n is defined as
m n
kAk2F := ∑ ∑ |aij |2 .
i =1 j =1

(Q3.4.4.25.B) Show that for any A ∈ R m,n , with singular values σj , j = 1, p := . . . min{m, n}, holds

p
kAk2F = ∑ j=1 σj2 .

(Q3.4.4.25.C) Sketch the implementation of a C++ function


s t d ::pair<Eigen::MatrixXd,Eigen::MatrixXd>
lowRankApprox( const Eigen::MatrixXd &A, double tol);

e ∈ R m,n of minimal rank r ∈ {1, . . . , min{m, n}} such that


that computes a matrix A

e
A−A ≤ tol · kAk2 .
2

e in factorized form A
The function should return A e = XY T as a tuple of matrix factors X ∈ R m,r ,
Y∈R . n,r

(Q3.4.4.25.D) [Sum of squares of singular values] Given a matrix M ∈ R n,m , m, n ∈ R, represented


by a C++ object M with an entry access operator
double o p e r a t o r () ( unsigned i n t i, unsigned i n t j) const ;

write a C++ code snippet that computes the sum of the squares of the singular values of M.

Hint. Remember

Frobenius norm and SVD: kAk2F = ∑rj=1 σj2 (3.4.4.18)

(Q3.4.4.25.E) [Question (Q3.4.4.25.D) cnt’d: Sum of σi4 ] Given a matrix M ∈ R n,m , m, n ∈ R, repre-
sented by a C++ object M with an entry access operator
double o p e r a t o r () ( unsigned i n t i, unsigned i n t j) const ;

draft a C++ code snippet that computes


p
p
s := ∑ σi4 , p := min{m, n} , ˆ singular values of M .
{σi }i=1 =
i =1

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 283
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.4.4.3 Principal Component Data Analysis (PCA)

Video tutorial for Section 3.4.4.3 "Principal Component Data Analysis (PCA)": (28 minutes)
Download link, tablet notes

→ review questions 3.4.4.51

EXAMPLE 3.4.4.26 (Trend analysis) The objective is to extract information in the form of hidden “trends”
from data.
XETRA DAX 1,1.2008 − 29.10.2010
3
10
ADS
ALV
BAYN
BEI
BMW
We are given time series data:
CBK
DAI
2
10 DB1 ✁ (end of day) stock prizes
DBK
DPW
=ˆ n data vectors ∈ R m
stock price (EUR)

DTE
EOAN Rephrased in the language of
FME
FRE3 linear algebra:
1
10 HEI
HEN3
Are there underlying
IFX governing trends ?
LHA
LIN l
MAN
MEO Are there a few vectors
0
MRK
10
MUV2
u1 , . . . , u p , p ≪ n, such that,
RWE
SAP
approximately , all other data
SDF vectors ∈ Span{u1 , . . . , u p }?
SIE
TKA
−1 VOW3
10
0 100 200 300 400 500 600 700 800 900
Fig. 86 days in past
y
EXAMPLE 3.4.4.27 (Classification from measured data) Data vectors belong to different classes,
where those in the same class are “qualitatively similar” in the sense that they are small (random) per-
turbations of a typical data vector. The task is to tease out the typical data patterns and tell which class
every data vector belongs to.

Given: measured U - I characteristics of n diodes in


a box
(k)
(data points (Uj , Ij ), j = 1, . . . , m, k =
1, . . . , n)

Classification problem: find out


• how many different types of diodes in box,
• the U - I characteristic of each type.

Measurement errors !
! Manufacturing tolerances !

Fig. 87

The following plots display possible (“synthetic”) measured data for two types of diodes; measurement er-
rors and manufacturing tolerances taken into account by additive (Gaussian) random perturbations (noise).

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 284
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

measured U−I characteristics for some diodes measured U−I characteristics for all diodes
1.2 1.2

1 1

0.8 0.8

0.6 0.6
current I

current I
0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 88 voltage U Fig. 89 voltage U
y
Ex. 3.4.4.26 and Ex. 3.4.4.27 present typical tasks that can be tackled by principal component analysis.
Now we give an abstract description as a problem of linear algebra.
Given: n data points a j ∈ R m , j = 1, . . . , n, in m-dimensional (feature) space
(e.g., a j may represent a finite time series or a measured relationship of physical quantities)

In Ex. 3.4.4.26: ˆ number of stocks,


n=
m=ˆ number of days, for which stock prices are recorded
✦ Extreme case: all stocks follow exactly one trend

↔ a j ∈ Span{u} ∀ j = 1, . . . , n ,
for a trend vector u ∈ R m , kuk2 = 1.

✦ Unlikely case: all stocks prices are governed by p < n trends:

↔ a j ∈ Span{u1 , . . . , u p } ∀ j = 1, . . . , n , (3.4.4.28)

with orthonormal trend vectors ui ∈ R m , i = 1, . . . , p.


Why unlikely ? Small random fluctuations will be present in each stock prize
Why orthonormal ? Trends should be as “independent as possible” (minimally correlated)
Expressed using the terminology linear algebra:

rank(A) = p for A := [a1 , . . . , an ] ∈ R m,n ,


(3.4.4.28) ⇔ (3.4.4.29)
R(A) = Span{u1 , . . . , u p }
✦ Realistic: stock prizes approximately follow a few trends

a j ∈ Span{u1 , . . . , u p } + “small perturbations” ∀ j = 1, . . . , m ,


with orthonormal trend vectors ui , i = 1, . . . , p.

Task (PCA): determine (minimal) p and orthonormal trend vectors ui , i = 1, . . . , p

Now singular value decomposition (SVD) according to Def. 3.4.1.3 comes into play, because
Lemma 3.4.1.13 tells us that it can supply an orthonormal basis of the image space of a matrix, cf.
Code 3.4.2.9.

Issue: how to deal with (small, random) perturbations ?

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 285
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Recall Rem. 3.4.1.7, (3.4.1.9): If A = UΣH⊤ is the SVD of A ∈ R m,n , then (u j =


ˆ columns of U, v j =
ˆ
columns of V)
     
     
     
     
     
 A      u  
  = σ1  u1 v1⊤ + σ2  2 v2⊤ +...
     
     
     
     
     

This already captures the case (3.4.4.28) and we see that the columns of U supply the trend vectors we
are looking for!
➊ no perturbations:

SVD: A = UΣVH satisfies σ1 ≥ σ2 ≥ . . . σp > σp+1 = · · · = σmin{m,n} = 0 ,


orthonormal trend vectors (U):,1 , . . . , (U):,p .
➋ with perturbations:

SVD: A = UΣVH satisfies σ1 ≥ σ2 ≥ . . . σp ≫σp+1 ≈ · · · ≈ σmin{m,n} ≈ 0 ,


orthonormal trend vectors (U):,1 , . . . , (U):,p .
If there is a pronounced gap in distribution of the singular values, which separates p large from
min{m, n} − p relatively small singular values, this hints that R(A) has essentially dimension p. It
depends on the application what one accepts as a “pronounced gap”.

Frequently used criterion:


( )
q min{m,n}
p = min q: ∑ σj2 ≥ (1 − τ ) ∑ σj2 for τ≪1. (3.4.4.30)
j =1 j =1

What is the Information carried by V in PCA context ?


     
     
     
     
     
 A      u  
  = σ1  u1 v1⊤ + σ2  2 v2⊤ +...
     
     
     
     
     

j-th data set (↔ time series # j) in j-th column of A

(3.4.1.9) ⇒ (A):,j = σ1 u1 (v1 ) j + σ2 u2 (v2 ) j + . . .

The j-th row of V (up to the p-th component) gives the weights with which the p identified trends
contribute to data set j.

EXAMPLE 3.4.4.31 (PCA of stock prices → Ex. 3.4.4.26cnt’d) Stock prices are given as a large
matrix A ∈ R m,n :

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 286
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

columns of A → time series of end of day stock prices of individual stocks


rows of A → closing prices of DAX stocks on a particular day
The data were obtained from Yahoo Finance in 2016:
! / b i n / csh
foreach i (ADS ALV BAYN BEI BMW CBK DAI DBK DB1 LHA DPW DTE EOAN FRE3 \
FME HEI HEN3 IFX SDF LIN MAN MRK MEO MUV2 RWE SAP SIE TKA VOW3)
wget −O " i " . csv " h t t p : / / i c h a r t . f i n a n c e . yahoo . com / t a b l e . csv?s= i .DE&a=00&b=1&
c=2008&d=09&e=30& f =2010&g=d&i g n o r e = . csv "
sed − i −e ’ s / − / , / g ’ " i " . csv
end

XETRA DAX 1,1.2008 − 29.10.2010 Singular values of stock pricce matrix


3 4
10 10
ADS
ALV
BAYN
BEI
BMW
CBK
2 DAI 3
10 DB1 10
DBK
DPW
stock price (EUR)

DTE singular value


EOAN
FME
FRE3
1 2
10 HEI 10
HEN3
IFX
LHA
LIN
MAN
MEO
0
10 MRK 10
1

MUV2
RWE
SAP
SDF
SIE
TKA
−1 VOW3 0
10 10
0 100 200 300 400 500 600 700 800 900 0 5 10 15 20 25 30
Fig. 90 days in past Fig. 91 no. of singular value

We observe a pronounced decay of the singular values of A. The plot of Fig. 91 is given in linear-
logarithmic scale. The neat alignment of larger singular values indicates approximate exponential decay
of the singular values.
➣ a few trends (corresponding to a few of the largest singular values) govern the time series.
Five most important stock price trends (normalized) Five most important stock price trends
0.15 500

400
0.1

300
0.05

200
0

100

−0.05
0

−0.1
−100

U(:,1) U*S(:,1)
−0.15 U(:,2) U*S(:,2)
−200
U(:,3) U*S(:,3)
U(:,4) U*S(:,4)
U(:,5) U*S(:,5)
−0.2 −300
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Fig. 92 days in past Fig. 93 days in past

Columns of U (→ Fig. 92) in SVD A = UΣV⊤ provide trend vectors, cf. Ex. 3.4.4.26 & Ex. 3.4.4.32.

When weighted with the corresponding singular value, the importance of a trend contribution emerges,
see Fig. 93

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 287
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Trends in BMW stock, 1.1.2008 − 29.10.2010 Trends in Daimler stock, 1.1.2008 − 29.10.2010
0.25 0.15

0.2
0.1

0.15

0.05
0.1
relative strength

relative strength
0
0.05

0
−0.05

−0.05
−0.1

−0.1

−0.15
−0.15

−0.2 −0.2
1 2 3 4 5 1 2 3 4 5
Fig. 94 no of singular vector Fig. 95 no of singular vector

Stocks of companies from the same sector of the economy should display similar contributions of major
trend vectors, because their prices can be expected to be more closely correlated than stock prices in
general. This is evident in 94 and Fig. 95 for two car makers.
y
EXAMPLE 3.4.4.32 (Principal component analysis for data classification → Ex. 3.4.4.27 cnt’d)
Given: measured U - I characteristics of n = 20 unknown diodes, I (U ) available for m = 50 voltages.
Sought: Number of different types of diodes in batch and reconstructed U - I characteristic for each type.
measured U−I characteristics for some diodes measured U−I characteristics for all diodes
1.2 1.2

1 1

0.8 0.8

0.6 0.6
current I

current I

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 96 voltage U Fig. 97 voltage U

Data matrix A ∈ R m,n , m ≫ n:


Rows A → series of measurements for different diodes (times/locations etc.),
Columns of A → measured values corresponding to one diode (time/location etc.).
Goal of PCA: detect linear correlations between columns of A

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 288
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

singular values for diode measurement matrix


25

20
← distribution of singular values of matrix
two dominant singular values !
i
singular value σ

15

measurements display linear correlation with two


10 principal components

5 two types of diodes in batch

0
0 2 4 6 8 10 12 14 16 18 20
Fig. 98 no. of singular value

strengths of contributions of singular components principal components (trend vectors) for diode measurements
0.15 0.3
dominant principal component
second principal component
0.1
0.2
strength of singular component #2

0.05

0.1
0

−0.05
0
current I

−0.1

−0.1
−0.15

−0.2
−0.2

−0.25

−0.3
−0.3

−0.35 −0.4
0.1 0.15 0.2 0.25 0.3 0.35 0 5 10 15 20 25 30 35 40 45 50
Fig. 99 strength of singular component #1 Fig. 100 voltage U

Observations:
✦ First two rows of V-matrix specify strength of contribution of the two leading principal components
to each measurement

➣ Points (V):,1:2 , which correspond to different diodes are neatly clustered in R2 . To determine
the type of diode i, we have to identify the cluster to which the point ((V)i,1 , Vi,2 ) belongs (→ cluster
analysis, course “machine learning”, see Rem. 3.4.4.43 below).
✦ The principal components themselves do not carry much useful information in this example.
y

EXAMPLE 3.4.4.33 (Data points (almost) confined to a subspace) More abstractly, above we tried to
identify a subspace to which all data points ai were “close”. We saw that the SVD of

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 289
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Data points • ↔ ai ∈ R3 “almost” located on on a 0.2

plane ✄ 0

−0.2
Non-zero singular values of A = [a1 , . . . , an ]:
−0.4

−0.6
3.1378
−0.8
1.8092
−1
0.1792 −1.2

−1.4
The third singular value is much smaller, which hints
−1.6
that the data points approximately lie in a 2D sub- 1.5
−1.8
space spanned by the two first singular vectors of A. −1
−0.5
0
0.5 0
0.5
1

Fig. 101 1
1.5 −0.5
y

§3.4.4.34 (Proper orthogonal decomposition (POD)) In the previous Ex. 3.4.4.33 we saw that the
singular values of a matrix whose columns represent data points ∈ R m tell us whether these points are all
“approximately located” in a lower-dimensional subspace V ⊂ R m . This is linked to the following problem:

Problem of proper orthogonal decomposition (POD):

Given: Data points a1 , . . . , an ∈ R m , m, n ∈ N


Sought: For k ≤ min{m, n}, find a subspace Uk ⊂ R m such that
n
2
Uk = argmin ∑ winf
∈W
aj − w 2
, (3.4.4.35)
W ⊂R m ,dim W =k j=1

that is, we seek that k-dimensional subspace Uk of R m for which the sum of squared dis-
tances of the data points to Uk is minimal.

We have already seen a similar problem in Ex. 3.4.4.5. For m = 2 we want to point out the difference to
linear regression:
y a2

x a1
Fig. 102 Fig. 103

Linear regression: Minimize the sum of squares of POD: Minimize the sum of squares of (minimal)
vertical distances. distances.
By finding a k-dimensional subspace we mean finding a, preferably orthonormal, basis of that subspace.
Let us assume that {w1 , . . . , wk } is an orthonormal basis (ONB) of a k-dimensional subspace W ⊂ R m .
Then the orthogonal projection PW x of a point x ∈ R m onto W is given by

k
PW x = ∑ (w⊤j x)w j = WW⊤ x , (3.4.4.36)
j =1

where W = [w1 , . . . , wk ] ∈ R m,k . This formula is closely related to the normal equations for a linear

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 290
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

least squares problem, see Thm. 3.1.2.1 and § 3.1.1.8 for a visualization of an orthogonal projection.
kx − PW xk2 is the (minimal) distance of x to W .
Hence, again writing W ∈ R m,k for the matrix whose columns form an ONB (⇒ W⊤ W = I) of W ⊂ R m ,
we have
n n 2 2
2
∑ w ∈W inf a j − w 2
= ∑ a j − WW⊤ a j = A − WW⊤ A , (3.4.4.37)
2 F
j =1 j =1

where k·k F denotes the Frobenius norm of a matrix, see Def. 3.4.4.17, and A = [a1 , . . . , an ] ∈ R m,n .

Note that rank(WW⊤ A) ≤ k. Let us write A = UΣV⊤ for the SVD of A, and Uk ∈ R m,k , Vk ∈ R n,k ,
and Σk ∈ R k,k for the truncated SVD-factors of A as introduced in Thm. 3.4.4.19. Then the rank-k best
approximation result of that theorem implies

A − WW⊤ A ≥ kA − Uk Σk Vk k F ∀W ∈ R m,k , W⊤ W = I .
F

In fact, we can find W ∈ R m,k with orthonormal columns that realizes the minimum: just choose W := Uk
and verify
 
WW⊤ A = Uk U⊤ ⊤ ⊤ ⊤
k UΣV = Uk Ik O ΣV = Uk Σk Vk .

Theorem 3.4.4.38. Solution of POD problem

The subspace Uk spanned by the first k left singular vectors of A = [a1 , . . . , an ] ∈ R m,n solves the
POD problem (3.4.4.35):

Uk = R (U):,1:k with A = UΣV⊤ the SVD of A.

Appealing to (3.4.4.23), the sum of the squared distances can be obtained as the sum of the squares of
the remaining singular values σk=1 , . . . , σp , p := min{m, n}, of A:

n p
2
∑ winf aj − w 2
= ∑ σℓ2 . (3.4.4.39)
j =1
∈Uk
ℓ=k+1

As a consequence, the decay of the singular values again predicts how close the data points are to the
POD subspaces Uk , k = 1, . . . , p − 1. y
EXAMPLE 3.4.4.40 (Principal axis of a point cloud) Given m > 2 points x j ∈ R k , j = 1, . . . , m, in
k-dimensional space, we ask what is the “longest” and “shortest” diameter d+ and d− . This question can
be stated rigorously in several different ways: here we ask for directions for which the point cloud will have
maximal/minimal variance, when projected onto that direction:

d+ := argmax Q(v) ,
m m
kvk=1 1 ⊤ 2
Q(v) := ∑ |(xi − c) v| , c = ∑ xj . (3.4.4.41)
d− := argmin Q(v) , j =1
m j =1
kvk=1

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 291
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6 points
major axis
minor axis

2
The directions d+ , d− are called the principal axes
of the point cloud, a term borrowed from mechanics 0

and connected with the axes of inertia of an assembly


of point masses. -2

Principal axes of a point cloud in 2D ✄ -4

-6

Fig. 104 -8 -6 -4 -2 0 2 4 6 8

d+ , d− can be computed by computing the extremizers of x 7→ kAxk2 with


 
( x1 − c ) ⊤ d+ = argmaxkAxk2 ,
 ..  m,k kvk=1
A= . ∈R ⇒ , (3.4.4.42)
d − = argmin k Ax k ,
(xm − c)⊤ kvk=1
2

on {x ∈ R k : kxk2 = 1}, using the SVD-based method presented in Section 3.4.4.1. y


Remark 3.4.4.43 (Algorithm for cluster analysis) In data classification as presented in Ex. 3.4.4.32,
after we have identified the p main trends (↔ singular values and left singular vectors) and how much
they contribute to every data point (↔ right singular vectors), the last step is to perform a cluster analysis
based on the right singular vectors v1 , . . . , v p .
Now we study the abstract problem of cluster analysis:
Given: ✦ N data points xi ∈ R k , i = 1, . . . , N ,

✦ Assume: number n of desired clusters is known in advance.


Sought: Partitioning of index set {1, . . . , N } = I1 ∪ · · · ∪ In , achieving minimal mean least squares error
n
1
mlse := ∑ ∑ kxi − ml k22 , ml =
♯ Il ∑ xi . (3.4.4.44)
l =1 i ∈ Il i ∈ Il

The subsets {xi : i ∈ Il } are called the clusters. The points ml are their centers of gravity.

The Algorithm involves two components:


➊ Splitting of a cluster by separation along its principal axis, see Ex. 3.4.4.40 and Code 3.4.4.48:
al := argmax{ ∑ |(xi − ml )⊤ v|2 } . (3.4.4.45)
k v k2 =1 i ∈ Il

Relies on the algorithm from Ex. 3.4.4.40, see Code 3.4.4.48.


➋ Improvement of clusters using the Lloyd-Max algorithm, see Code 3.4.4.49. It involves two steps in
turns:
(a) Given centers of gravity ml redistribute points according to

Il := {i ∈ {1, . . . , N }: kxi − ml k2 ≤ kxi − mk k2 ∀k 6= l } , (3.4.4.46)

that is, we assign each point to the nearest center of gravity, see Code 3.4.4.49.

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 292
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(b) Recompute centers of gravity


1
ml =
♯ Il ∑ xi . (3.4.4.47)
i ∈ Il

We start with a single cluster, and then do repeated splitting (➊) and cluster rearrangement (➋) until we
have reached the desired final number n of clusters, see Code 3.4.4.50.

C++-code 3.4.4.48: Principal axis point set separation ➺ GITLAB


2 // Separation of a set of points whose coordinates are stored in the
3 // columns of X according to their location w.r.t. the principal axis
4 std : : pair <VectorXi , VectorXi > p r i n c a x i s s e p ( const MatrixXd & X) {
5 const Eigen : : Index N = X . cols ( ) ; // no. of points
6 const VectorXd g = X . rowwise ( ) .sum ( ) / N; // Center of gravity, cf. (3.4.4.47)
7 MatrixXd Y = X − g . r e p l i c a t e ( 1 ,N) ; // Normalize point coordinates.
8 // Compute principal axes, cf. (3.4.4.45) and (3.4.4.3). Note that the
9 // SVD of a symmetric matrix is available through an orthonormal
10 // basis of eigenvectors.
11 const Eigen : : SelfAdjointEigenSolver <MatrixXd > es ( Y * Y . transpose ( ) ) ;
12 // Major principal axis
13 Eigen : : VectorXd a = es . e i g e n v e c t o r s ( ) . rightCols <1 >() ;
14 // Coordinates of points w.r.t. to major principal axis
15 Eigen : : VectorXd c = a . transpose ( ) * Y ;
16 // Split point set according to locations of projections on principal
axis
17 // std::vector with indices to prevent resizing of matrices
18 std : : vector < i n t > i 1 ;
19 std : : vector < i n t > i 2 ;
20 f o r ( i n t i = 0 ; i < c . s i z e ( ) ; ++ i ) {
21 i f ( c ( i ) >= 0 ) {
22 i 1 . push_back ( i ) ;
23 }
24 else {
25 i 2 . push_back ( i ) ;
26 }
27 }
28 // return the mapped std::vector as Eigen::VectorXd
29 return {
30 VectorXi : : Map( i 1 . data ( ) , s t a t i c _ c a s t <Eigen : : Index >( i 1 . s i z e ( ) ) ) ,
31 VectorXi : : Map( i 2 . data ( ) , s t a t i c _ c a s t <Eigen : : Index >( i 2 . s i z e ( ) ) )
32 };
33 }

C++-code 3.4.4.49: Lloyd-Max algorithm for cluster indentification ➺ GITLAB


2 template <class Derived >
3 std : : tuple <double , VectorXi , VectorXd> distcomp ( const MatrixXd & X , const
MatrixBase <Derived > & C) {
4 // Compute squared distances
5 // d.row(j) = squared distances from all points in X to cluster j
6 MatrixXd d (C . cols ( ) , X . cols ( ) ) ;
7 f o r ( i n t j = 0 ; j < C. cols ( ) ; ++ j ) {
8 MatrixXd Dv = X − C. col ( j ) . r e p l i c a t e ( 1 , X . cols ( ) ) ;
9 d . row ( j ) = Dv . array ( ) . square ( ) . colwise ( ) .sum ( ) ;
10 }
11 // Compute minimum distance point association and sum of minimal
squared distances
12 VectorXi i d x ( d . cols ( ) ) ; VectorXd mx( d . cols ( ) ) ;

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 293
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

13 f o r ( i n t j = 0 ; j < d . cols ( ) ; ++ j ) {
14 // mx(j) tells the minimal squared distance of point j to the
nearest cluster
15 // idx(j) tells to which cluster point j belongs
16 mx( j ) = d . col ( j ) . minCoeff (& i d x ( j ) ) ;
17 }
18 const double sumd = mx .sum ( ) ; // sum of all squared distances
19 // Computer sum of squared distances within each cluster
20 VectorXd cds (C. cols ( ) ) ; cds . setZero ( ) ;
21 f o r ( i n t j = 0 ; j < i d x . s i z e ( ) ; ++ j ) { // loop over all points
22 cds ( i d x ( j ) ) += mx( j ) ;
23 }
24 r e t u r n std : : make_tuple ( sumd , i d x , cds ) ;
25 }
26

27 // Lloyd-Max iterative vector quantization algorithm for discrete point


28 // sets; the columns of X contain the points xi , the columns of
29 // C initial approximations for the centers of the clusters. The final
30 // centers are returned in C, the index vector idx specifies
31 // the association of points with centers.
32 template <class Derived >
33 void lloydmax ( const MatrixXd & X , MatrixBase <Derived > & C, VectorXi & i d x , VectorXd
& cds , const double t o l = 0 . 0 0 0 1 ) {
34 const Eigen : : Index k = X . rows ( ) ; // dimension of space
35 const Eigen : : Index N = X . cols ( ) ; // no. of points
36 const Eigen : : Index n = C. cols ( ) ; // no. of clusters
37 i f ( k ! = C. rows ( ) ) {
38 throw std : : l o g i c _ e r r o r ( " dimension mismatch " ) ;
39 }
40 double sd_old = std : : n u m e r i c _ l i m i t s <double > : : max ( ) ;
41 double sd = NAN;
42 std : : t i e ( sd , i d x , cds ) = distcomp ( X , C) ;
43 // Terminate, if sum of squared minimal distances has not changed much
44 while ( ( sd_old −sd ) / sd > t o l ) {
45 // Compute new centers of gravity according to (3.4.4.47)
46 MatrixXd Ctmp (C. rows ( ) ,C. cols ( ) ) ; Ctmp . setZero ( ) ;
47 // number of points in cluster for normalization
48 VectorXi n j ( n ) ; n j . setZero ( ) ;
49 f o r ( i n t j = 0 ; j < N; ++ j ) { // loop over all points
50 Ctmp . col ( i d x ( j ) ) += X . col ( j ) ;
51 ++ n j ( i d x ( j ) ) ; // count associated points for normalization
52 }
53 f o r ( i n t i = 0 ; i < Ctmp . cols ( ) ; ++ i ) {
54 i f ( n j ( i ) > 0) {
55 C. col ( i ) = Ctmp . col ( i ) / n j ( i ) ; // normalization
56 }
57 }
58 sd_old = sd ;
59 // Get new minimum association of the points to cluster points
60 // for next iteration
61 std : : t i e ( sd , i d x , cds ) = distcomp ( X , C) ;
62 }
63 }
64 // Note: this function is needed to allow a call with an rvalue
65 // && stands for an rvalue reference and allows rvalue arguments
66 // such as C.leftCols(nc) to be passed by reference (C++ 11 feature)
67 template <class Derived >
68 void lloydmax ( const MatrixXd & X , MatrixBase <Derived > && C, VectorXi & i d x , VectorXd
& cds , const double t o l = 0 . 0 0 0 1 ) {
69 lloydmax ( X , C, i d x , cds , t o l ) ;
70 }

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 294
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 3.4.4.50: Clustering of point set ➺ GITLAB


2 // n-quantization of point set in k-dimensional space based on
3 // minimizing the mean square error of Euclidean distances. The
4 // columns of the matrix X contain the point coordinates, n specifies
5 // the desired number of clusters.
6 std : : pair <MatrixXd , VectorXi > p o i n t c l u s t e r ( const MatrixXd & X , const i n t n ) {
7 const Eigen : : Index N = X . cols ( ) ; // no. of points
8 const Eigen : : Index k = X . rows ( ) ; // dimension of space
9 // Start with two clusters obtained by principal axis separation
10 i n t nc = 1 ; // Current number of clusters
11 // Initial single cluster encompassing all points
12 VectorXi I b i g = VectorXi : : LinSpaced (N, 0 , s t a t i c _ c a s t < i n t >(N−1) ) ;
13 i n t nbig = 0 ; // Index of largest cluster
14 MatrixXd C( X . rows ( ) , n ) ; // matrix for cluster midpoints
15 C. col ( 0 ) = X . rowwise ( ) .sum ( ) / N; // center of gravity
16 VectorXi i d x (N) ; i d x . setOnes ( ) ;
17 // Split largest cluster into two using the principal axis separation
18 // algorithm
19 while ( nc < n ) {
20 VectorXi i 1 ;
21 VectorXi i 2 ;
22 MatrixXd Xbig ( k , I b i g . s i z e ( ) ) ;
23 f o r ( i n t i = 0 ; i < I b i g . s i z e ( ) ; ++ i ) { // slicing
24 Xbig . col ( i ) = X . col ( I b i g ( i ) ) ;
25 }
26 // separete Xbig into two clusters, i1 and i2 are index vectors
27 std : : t i e ( i 1 , i 2 ) = princaxissep : : princaxissep ( Xbig ) ;
28 // new cluster centers of gravity
29 VectorXd c1 ( k ) ; c1 . setZero ( ) ;
30 VectorXd c2 ( k ) ; c2 . setZero ( ) ;
31 f o r ( i n t i = 0 ; i < i 1 . s i z e ( ) ; ++ i ) {
32 c1 += X . col ( I b i g ( i 1 ( i ) ) ) ;
33 }
34 f o r ( i n t i = 0 ; i < i 2 . s i z e ( ) ; ++ i ) {
35 c2 += X . col ( I b i g ( i 2 ( i ) ) ) ;
36 }
37 c1 / = s t a t i c _ c a s t <double >( i 1 . s i z e ( ) ) ; // normalization
38 c2 / = s t a t i c _ c a s t <double >( i 2 . s i z e ( ) ) ;
39 C. col ( n b i g ) = c1 ;
40 C. col ( n b i g +1) = c2 ;
41 ++nc ; // Increase number of clusters
42 // Improve clusters by Lloyd-Max iteration
43 VectorXd cds ; // saves mean square error of clusters
44 // Note C.leftCols(nc) is passed as rvalue reference (C++ 11)
45 lloydmax : : lloydmax ( X , C. l e f t C o l s ( nc ) , i d x , cds ) ;
46 // Identify cluster with biggest contribution to mean square error
47 cds . maxCoeff(& n b i g ) ;
48 i n t counter = 0;
49 // update Ibig with indices of points in cluster with biggest
contribution
50 f o r ( i n t i = 0 ; i < i d x . s i z e ( ) ; ++ i ) {
51 i f ( i d x ( i ) == n b i g ) {
52 I b i g ( counter ) = i ;
53 ++ c o u n t e r ;
54 }
55 }
56 I b i g . conservativeResize ( counter ) ;

3. Direct Methods for Linear Least Squares Problems, 3.4. Singular Value Decomposition (SVD) 295
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

57 }
58 r e t u r n std : : make_pair (C, i d x ) ;
59 }

Review question(s) 3.4.4.51 (Principal Component Data Analysis)


(Q3.4.4.51.A) Assume that the (sorted!) singular values σj , j ∈ {1, . . . , min{m, n}}, of A ∈ R m,n ,
m, n ≫ 1, obey the “asymptotic” decay law

σj ≈ C exp(−αj) for some C > 0, α > 0 .

How much do you have to increase the rank of the best low-rank approximation of A with respect to the
Euclidean matrix norm in order to reduce the approximation error by a factor of 2?
Can you also answer this question for the Frobenius matrix norm?

3.5 Total Least Squares


In the examples of Section 3.0.1 we generally considered overdetermined linear systems of equations
Ax = b, for which only the right hand side vector b was affected by measurement errors. However, also
the entries of the coefficient matrix A may have been obtained by measurement. This is the case, for
instance, in the nodal analysis of electric circuits → Ex. 2.1.0.3. Then, it may be legitimate to seek a
“better” matrix based on information contained in the whole linear system. This is the gist of the total least
squares approach.
Given: overdetermined linear system of equations Ax = b, A ∈ R m,n , b ∈ R m , m>n.
Known: LSE solvable ⇔ b ∈ Im(A), if A, b were not perturbed,
but A, b are perturbed (measurement errors).

Sought: Solvable overdetermined system of equations Ax b, A


b =b b ∈ Rm ,
b ∈ R m,n , b
“nearest” to Ax = b.

☞ least squares problem “turned upside down”: now we are allowed to tamper with system matrix and
right hand side vector!

Total least squares problem:

Given: A ∈ R m,n , m > n, rank(A) = n, b ∈ R m ,


find: A b ∈ R m with
b ∈ R m,n , b
h i
[ A b] − A b
b b → min b ∈ R(A
, b b) .
F
(3.5.0.1)

h i h i
b ∈ R(A
b b) ⇒ rank( A b )=n
b b (3.5.0.1) ⇒ b = argmin [A b] − X
b b
A b .
b )=n F
rank(X

3. Direct Methods for Linear Least Squares Problems, 3.5. Total Least Squares 296
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

h i
☞ b b
A b is the rank-n best approximation of [A b]!

We face the problem to compute the best rank-n approximation of the given matrix [A b], a problem
already treated in Section 3.4.4.2: Thm. 3.4.4.19 tells us how to use the SVD of [A b]

A = UΣV⊤ , U ∈ R m,n+1 , Σ ∈ R n+1,n+1 , V ∈ R n+1,n+1 , (3.5.0.2)


h
i
b b
to construct A b :

n +1 h i n
Thm. 3.4.4.19 b =
[A b] = UΣV⊤ = ∑ σj (U):,j (V)⊤
:,j =⇒ b b
A ∑ σj (U):,j (V)⊤:,j . (3.5.0.3)
j =1 j =1
V orthogonal
h i
=⇒ A b (V):,n+1 = A
b b b (V)n+1,n+1 = 0 .
b (V)1:n,n+1 + b (3.5.0.4)

b,
b =b
(3.5.0.4) also provides the solution x of Sx

x := −A b = −(V)1:n,n+1 /(V)n+1,n+1 ,
b −1 b (3.5.0.5)

if (V)n+1,n+1 6= 0 (in numerical sense, of course).

C++-code 3.5.0.6: Total least squares via SVD ➺ GITLAB


2 // computes only solution x of fitted consistent LSE
3 VectorXd l s q t o t a l ( const MatrixXd& A , const VectorXd& b ) {
4 const unsigned m = A . rows ( ) ;
5 const unsigned n = A . cols ( ) ;
6 MatrixXd C(m, n + 1 ) ; C << A , b ; // C = [ A, b]
7 // We need only the SVD-factor V, see (3.5.0.3)
8 MatrixXd V = C. jacobiSvd ( Eigen : : ComputeThinU | Eigen : : ComputeThinV ) . matrixV ( ) ;
9

10 // Compute solution according to (3.5.0.5);


11 const double s = V( n , n ) ;
12 i f ( std : : abs ( s ) < 1 . 0E−15) { c e r r << "No s o l u t i o n ! \ n" ; r e t u r n { } ; }
13 r e t u r n ( −V . col ( n ) . head ( n ) / s ) ;
14 }

3.6 Constrained Least Squares

Video tutorial for Section 3.6 "Constrained Least Squares": (23 minutes) Download link,
tablet notes

→ review questions 3.6.2.1

In the examples of Section 3.0.1 we expected all components of the right hand side vectors to be possibly
affected by measurement errors. However, it might happen that some data are very reliable and in this
case we would like the corresponding equation to be satisfied exactly.
linear least squares problem with linear constraint defined as follows:

3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 297
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Linear least squares problem with linear constraint:

Given: A ∈ R m,n , m ≥ n, rank(A) = n, b ∈ R m ,


C ∈ R p,n , p < n, rank(C) = p, d ∈ R p

Find: x ∈ R n such that kAx − bk2 → min and Cx = d .


(3.6.0.1)

Linear constraint

Here the constraint matrix C collects all the coefficients of those p equations that are to be satisfied exactly,
and the vector d the corresponding components of the right hand side vector. Conversely, the m equations
of the (overdetermined) LSE Ax = b cannot be satisfied and are treated in a least squares sense.

3.6.1 Solution via Lagrangian Multipliers

§3.6.1.1 (A saddle point problem) Recall important technique from multidimensional calculus for tackling
constrained minimization problems: Lagrange multipliers, see [Str09, Sect. 7.9].

Idea: coupling the constraint using the Lagrange multiplier m ∈ R p

x = argmin sup L(y, m) , (3.6.1.2)


y ∈R n m ∈R p
1
L(y, m) := kAy − bk22 + m⊤ (Cy − d) . (3.6.1.3)
2

L as defined in (3.6.1.3) is called a Lagrange function or, in short, Lagrangian. The simple heuristics
behind Lagrange multipliers is the observation:

sup L(y, m) = ∞, in case Cx 6= d!


m ∈R p

➥ A minimum in (3.6.1.2) can only attained, if the constraint is satisfied!

(3.6.1.2) is called a saddle point problem.

2.5

Solution of min-max problem like (3.6.1.2) is called a 2

saddle point. 1.5


L(x,m)

1
Saddle point of F ( x, m) = x2 − 2xm ✄
0.5

Note that the function is “flat” in the saddle point •, 0

that is, both the derivative with respect to x and with −0.5

respect to m has to vanish.


−1
1

0.4 0.6 0.8 1


−1 −0.4 −0.2 0 0.2
−1 −0.8 −0.6

Fig. 105
multiplier m
state x
y

3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 298
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§3.6.1.4 (Augmented normal equations) In a saddle point the Lagrange function is “flat”, that is, all its
partial derivatives have to vanish there. This yields the following necessary (and sufficient) conditions for
the solution x of (3.6.1.2) and a saddle point in x, q: (For a similar technique employing multi-dimensional
calculus see Rem. 3.1.2.5)

∂L !
(x, q) = A⊤ (Ax − b) + C⊤ q = 0 , (3.6.1.5a)
∂x
∂L !
(x, q) = Cx − d = 0 . (3.6.1.5b)
∂m
This is an (n + p) × (n + p) square linear system of equations, known as augmented normal equa-
tions:
    
A⊤ A C⊤ x A⊤ b
= . (3.6.1.6)
C 0 q d
It belongs to the class of saddle-point type LSEs, that is, LSEs with a symmetric coefficient matrix with a
zero right-lower square block. In the case p = 0, in the absence of a linear constraint, (3.6.1.6) collapses
to the usual normal equations (3.1.2.2), A⊤ Ax = A⊤ b for the overdetermined linear system of equations
Ax = b.
As we know, a direct elimination solution algorithm for (3.6.1.6) amounts to finding an LU-decomposition of
the coefficient matrix. Here we opt for its symmetric variant, the Cholesky decomposition, see Section 2.8.
On the block-matrix level it can be found by considering the equation
    
A⊤ A C⊤ R⊤ 0 R G⊤ R, S ∈ R n,n upper triangular matrices,
= ,
C 0 G −S⊤ 0 S G ∈ R p,n .

Thus the blocks of the Cholesky factors of the coefficient matrix of the linear system (3.6.1.6) can be
determined in four steps.
➀ Compute R from R⊤ R = A⊤ A → Cholesky decomposition → Section 2.8,
➁ Compute G from R⊤ G⊤ = C⊤ → n forward substitutions → Section 2.3.2,
➂ Compute S from S⊤ S = GG⊤ → Cholesky decomposition → Section 2.8.
y

§3.6.1.7 (Extended augmented normal equations) The same caveats as those discussed for the regular
normal equations in Rem. 3.2.0.3, Ex. 3.2.0.4, and Rem. 3.2.0.6, apply to the direct use of the augmented
normal equations (3.6.1.6):
1. their condition number can be much bigger than that of the matrix A,
2. forming A⊤ A may be vulnerable to roundoff,
3. the matrix A⊤ A may not be sparse, though A is.
As in § 3.2.0.7 also in the case of the augmented normal equations (3.6.1.6) switching to an extended
version by introducing the residual r = Ax − b as a new unknown is a remedy, cf. (3.2.0.8). This leads to
the following linear system of equations.
    
−I A 0 r b
A⊤ 0 C⊤  x  =  0  Extended augmented
=
ˆ . (3.6.1.8)
normal equations
0 C 0 m d
y

3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 299
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3.6.2 Solution via SVD

Idea: Identify the subspace in which the solution can vary without violating the constraint.
Since C has full rank, this subspace agrees with the nullspace/kernel of C.

From Lemma 3.4.1.13 and Ex. 3.4.2.7 we have learned that the SVD can be used to compute (an or-
thonormal basis of) the nullspace N (C). The suggests the following method for solving the constrained
linear least squares problem (3.6.0.1).
➀ Compute an orthonormal basis of N (C) using SVD (→ Lemma 3.4.1.13, (3.4.3.1)):
 ⊤
V
C = U[Σ 0] 1⊤ , U ∈ R p,p , Σ ∈ R p,p , V1 ∈ R n,p , V2 ∈ R n,n− p
V2
N (C) = R(V2 ) .

and the particular solution x0 ∈ N (C)⊤ = R(V1 ) of the constraint equation

x 0 : = V 1 Σ −1 U ⊤ d .

This gives us a representation of the solution x of (3.6.0.1) of the form

x = x0 + V2 y , y ∈ R n− p .

➁ Insert this representation into (3.6.0.1). This yields a standard linear least squares problem with
coefficient matrix AV2 ∈ R m,n− p and right hand side vector b − Ax0 ∈ R m :

kA(x0 + V2 y) − bk2 → min ⇔ kAV2 y − (b − Ax0 )k → min .

Review question(s) 3.6.2.1 (Constrained least-squares problems)


(Q3.6.2.1.A) The angles of a flat triangle can be estimated by solving the linear system of equations
   
1 0 0   e
α
0 α
 1  
0   βe
0 β =  .
0 1 e
γ
γ
1 1 1 π

α, βe, and γ
for given measured values e e.
Find the least-squares solution, if the bottom equation has to be satisfied exactly. First recast into a
linearly constrained least-squares problem

kAx − bk → min , Cx = d ,

with A ∈ R m,n , C ∈ R p,n , b ∈ R m , d ∈ R p .


(Q3.6.2.1.B) Given A ∈ R m,n , C ∈ R p,n , m, n, p ∈ N, p < n, we want to solve

x∗ = argmaxkAxk2 , C := {x ∈ R n : kxk2 = 1, Cx = 0} .
x∈C

Sketch an SVD-based algorithm for computing x∗ .


3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 300
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Learning Outcomes
After having studied the contents of this chapter you should be able to
• give a rigorous definition of the least squares solution of an (overdetermined) linear system of equa-
tions,
• state the (extended) normal equations for any overdetermined linear system of equations,
• tell conditions for uniqueness and existence of solutions of the normal equations,
• define (economical) QR-decomposition and SVD of a matrix,
• know the asymptotic computational effort of computing economical QR and SVD factorizations,
• explain the use of QR-decomposition and, in particular, Givens rotations, for solving (overdeter-
mined) linear systems of equations (in least squares sense),
• use SVD to solve least squares, (constrained) optimization, and low-rank best approximation prob-
lems
• explain the ideas underlying principal component analysis (PCA) and proper orthogonal decompo-
sition (POD),
• formulate the augmented (extended) normal equations for a linearly constrained least squares prob-
lem.

3. Direct Methods for Linear Least Squares Problems, 3.6. Constrained Least Squares 301
Bibliography

[Bra06] Matthew Brand. “Fast low-rank modifications of the thin singular value decomposition”. In:
Linear Algebra Appl. 415.1 (2006), pp. 20–30. DOI: 10.1016/j.laa.2005.07.021.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 218, 219, 226, 230–263, 274).
[GGK14] W. Gander, M.J. Gander, and F. Kwok. Scientific Computing. Vol. 11. Texts in Computational
Science and Engineering. Heidelberg: Springer, 2014 (cit. on p. 216).
[GV13] Gene H. Golub and Charles F. Van Loan. Matrix computations. Fourth. Johns Hopkins Stud-
ies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, 2013,
pp. xiv+756 (cit. on pp. 241, 243, 247, 266).
[Gut07] M. Gutknecht. “Linear Algebra”. 2007 (cit. on p. 239).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 236, 264,
266, 279).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 230–233, 236,
243, 274).
[HRS16] J.S. Hesthaven, G. Rozza, and B. Stamm. Certified Reduced Basis Methods for Parametrized
Partial Differential Equations. BCAM Springer Briefs. Cham: Springer, 2016.
[Hig02] N.J. Higham. Accuracy and Stability of Numerical Algorithms. 2nd ed. Philadelphia, PA: SIAM,
2002 (cit. on pp. 243, 250).
[Kal96] D. Kalman. “A singularly valuable decomposition: The SVD of a matrix”. In: The College Math-
ematics Journal 27 (1996), pp. 2–23 (cit. on p. 264).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 215, 239, 264, 266, 267).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on p. 231).
[QMN16] Alfio Quarteroni, Andrea Manzoni, and Federico Negri. Reduced basis methods for partial
differential equations. Vol. 92. Unitext. Springer, Cham, 2016, pp. xi+296.
[Ste76] G. W. Stewart. “The economical storage of plane rotations”. In: Numer. Math. 25.2 (1976),
pp. 137–138. DOI: 10.1007/BF01462266 (cit. on p. 247).
[Str19] D. Strang. Linear Algebra and Learning from Data. Cambridge University Press, 2019.
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 223, 225,
264, 298).
[SCF18] Jan Svoboda, Thomas Cashman, and Andrew Fitzgibbon. QRkit: Sparse, Composable QR
Decompositions for Efficient and Stable Solutions to Problems in Computer Vision. 2018.
[Vol08] S. Volkwein. Model reduction using proper orthogonal decomposition. Lecture notes. Graz,
Austria: TU Graz, 2008.

302
Chapter 4

Filtering Algorithms

This chapter continues the theme of numerical linear algebra, also covered in Chapter 1, 2, 10. We will
come across very special linear transformations (↔ matrices) and related algorithms. Surprisingly, these
form the basis of a host of very important numerical methods for signal processing.

§4.0.0.1 (Time-discrete signals and sampling) From the perspective of signal processing we can iden-
tify
vector x ∈ R n ↔ finite discrete (= sampled) signal.
Sampling converts a time-continuous signal, repre-
sented by some real-valued physical quantity (pres-
sure, voltage, power, etc.) into a time-discrete signal:

ˆ time-continuous signal, 0 ≤ t ≤ T ,
X = X (t) = x0

“sampling”: x j = X ( j∆t) , j = 0, . . . , n − 1 ,
x1
n ∈ N, n∆t ≤ T . x2 x n −2 x n −1
X (t)
ˆ time between samples.
∆t > 0 =
As already indicated by the indexing the sam-
Fig. 106 t0 t1 t2 t n −2 t n −1 time
pled values can be arranged in a vector x =
[ x 0 , . . . , x n −1 ] ⊤ ∈ R n .
Note that in this chapter, as is customary in signal processing, we adopt a C++-style indexing from 0: the
components of a vector with length n carry indices ∈ {0, . . . , n − 1}.

As an idealization one sometimes considers a signal of infinite duration X = X (t), −∞ < t < ∞. In this
case sampling yields a bi-infinite time-discrete signal, represented by a sequence ( xk )k∈Z ∈ RZ . If this
sequence has a finite number of non-zero terms only, then we write (0, . . . , xℓ , xℓ+1 , . . . , xn−1 , xn , 0, . . .).
y
EXAMPLE 4.0.0.2 (Sampled audio signals) An important class of time-discrete signals are digital audio
signals. Those may be obtained from sampling and analog-to-digital conversion of air pressure recorded
by a microphone.

303
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

sound signal "Hello"


0.6 A simple file format for storing digital audio signals
is WAV. The audio signal is stored as a sequence of
0.4
16-bit integer values with a standard sampling rate of
44.1 kHz (44100 samples/second, sound sampled
normalized sound pressure

0.2

every ∆t = 2.2676 · 10−5 s).


0

✁ Sampled speech audio data, file hello,wav ,


−0.2 ➺ GITLAB

−0.4
1 F i l t e r i n g > f i l e h e l l o . wav
2 h e l l o . wav : RIFF ( l i t t l e −endian )
−0.6 data , WAVE audio , M i c r o s o f t PCM,
16 b i t , mono 44100 Hz
−0.8
0 0.5 1 1.5
Fig. 107 time t [s]

• In this chapter we neglect the additional “quantization”, that is, the fact that, in practice, the values
of a time-discrete signal ( xk )k∈Z are again discrete, e.g., 16-bit integers for the WAV file format.
Throughout, we consider only xk ∈ R.
• C++ codes handling standard file formats usually rely on dedicated libraries, for instance the library
AudioFile for the WAV format.
y
Contents
4.1 Filters and Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.1.1 Discrete Finite Linear Time-Invariant Causal Channels/Filters . . . . . . . . 304
4.1.2 LT-FIR Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
4.1.3 Discrete Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.1.4 Periodic Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
4.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.1 Diagonalizing Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.2.2 Discrete Convolution via Discrete Fourier Transform . . . . . . . . . . . . . 326
4.2.3 Frequency filtering via DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
4.2.4 Real DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.2.5 Two-dimensional DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
4.2.6 Semi-discrete Fourier Transform [QSS00, Sect. 10.11] . . . . . . . . . . . . . 344
4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.4 Trigonometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.4.1 Sine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.4.2 Cosine transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
4.5 Toeplitz Matrix Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.5.1 Matrices with Constant Diagonals . . . . . . . . . . . . . . . . . . . . . . . . 371
4.5.2 Toeplitz Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
4.5.3 The Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

4.1 Filters and Convolutions


4.1.1 Discrete Finite Linear Time-Invariant Causal Channels/Filters
Video tutorial for Section 4.1.1 "Discrete Finite Linear Time-Invariant Causal Channels/Filters":
(11 minutes) Download link, tablet notes

4. Filtering Algorithms, 4.1. Filters and Convolutions 304


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In this section we study a finite linear time-invariant causal channel/filter, which is a widely used model
for digital communication channels, e.g. in wireless communication theory. We adopt a mathematical
perspective harnessing the toolbox of linear algebra as is common in modern engineering.
Mathematically speaking, a (discrete) channel/filter is a function/mapping F : ℓ∞ (Z ) → ℓ∞ (Z ) from the
vector space ℓ∞ (Z ) of bounded input sequences { x j } j∈Z ,
n  o

ℓ (Z ) : = xj j ∈Z
: sup | x j | < ∞ ,

to bounded output sequences y j j∈Z .
xk yk
input signal channel output signal

time time

Fig. 108

  
Channel/filter: F : ℓ ∞ (Z ) → ℓ ∞ (Z ) , yj j ∈Z
= F xj j ∈Z
. (4.1.1.1)

In order to link (discrete) filters to linear algebra, we have to assume certain properties that are indicated
by the attributes “finite ”, “linear”, “time-invariant” and “causal”:

Definition 4.1.1.2. Finite channel/filter

A channel/filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called finite, if every input signal of finite duration produces


an output signal of finite duration,
 
∃ M ∈ N: | j| > M ⇒ x j = 0 ⇒ ∃ N ∈ N: |k| > N ⇒ ( F ( x j j ∈Z
))k = 0 ,
(4.1.1.3)

It is natural to assume that it should not matter when exactly signal is fed into the channel. To express this
intuition more rigorously we introduce the time shift operator for signals: for m ∈ Z
 
Sm : ℓ ∞ (Z ) → ℓ ∞ (Z ) , Sm ( x j j ∈Z
) = x j−m j ∈Z
. (4.1.1.4)

Hence, by applying Sm we advance (m < 0) or delay (m > 0) a signal by |m|∆t. For a time-invariant filter
time-shifts of the input propagate to the output unchanged.

Definition 4.1.1.5. Time-invariant channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called time-invariant (TI), if shifting the input in time leads to the
same output shifted in time by the same amount; it commutes with the time shift operator from
(4.1.1.4):
 
∀( x j ) j∈Z ∈ ℓ∞ (Z ), ∀m ∈ Z: F (Sm ( x j j ∈Z
)) = Sm ( F ( x j j ∈Z
)) . (4.1.1.6)

Since a channer/filter is a mapping between vector spaces, it makes sense to talk about “linearity of F”.

4. Filtering Algorithms, 4.1. Filters and Convolutions 305


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 4.1.1.7. Linear channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called linear, if F is a linear mapping:


   
F (α x j j ∈Z
+ β yj j ∈Z
) = αF ( x j j ∈Z
) + βF ( y j j ∈Z
) (4.1.1.8)
 
for all sequences x j j∈Z , y j j∈Z ∈ ℓ∞ (Z ) and real numbers α, β ∈ R.

Slightly rewritten, this means that for all scaling factors α, β ∈ R

output(α · signal 1 + β · signal 2) = α · output(signal 1) + β · output(signal 2) .

Of course, a signal should not trigger an output before it arrives at the filter; output may depend only on
past and present inputs, not on the future.

Definition 4.1.1.9. Causal channel/filter

A filter F : ℓ∞ (Z ) → ℓ∞ (Z ) is called causal (or physical, or nonanticipative), if the output does not
start before the input
 
∀ M ∈ N: xj j ∈Z
∈ ℓ ∞ (Z ), x j = 0 ∀ j ≤ M ⇒ F ( x j )
j ∈Z k
= 0 ∀k ≤ M . (4.1.1.10)

Now we have collected all the properties of the class of filters in the focus of this section, called LT-FIR
filters.
Acronym: LT-FIR =ˆ finite (→ Def. 4.1.1.2), linear (→ Def. 4.1.1.7), time-invariant (→ Def. 4.1.1.5), and
causal (→ Def. 4.1.1.9) filter F : ℓ∞ (Z ) → ℓ∞ (Z )
§4.1.1.11 (Impulse response) For the description of filters we rely on special input signals, analogous to
the description of a linear mapping R n 7→ R m through a matrix, that is, its action on “coordinate vectors”.
The “coordinate vectors” in signal space ℓ∞ (Z ) are so-called impulses, signals that attain the value +1
for a single sampling point in time and are “mute” for all other times.

Definition 4.1.1.12. Impulse response

The impulse response (IR) of a channel/filter(is the output for a single unit impulse at t = 0 as
1 , if j = 0
input, that is, the input signal is x j = δj,0 := (Kronecker symbol).
0 else

The impulse response of a finite filter can be described by a vector h of finite length n. In particular, the
impulse response of a finite and causal filter is a sequence of the form (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N. Such an impulse response is depicted in Fig. 110.
impulse response

1 h0
h1 h n −2
h2 h n −1

Fig. 109 t0 t1 t2 tn−t2n−1 time Fig. 110 t0 t1 t2 tn−t2n−1 time


Thanks to the special properties of LT-FIR filters, their impulse response offers a complete characterization.
This will be explored in the next section. y
Review question(s) 4.1.1.13 (Discrete finite linear time-invariant causal channels/filters)

4. Filtering Algorithms, 4.1. Filters and Convolutions 306


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q4.1.1.13.A) What is the output of a LT-FIRfilter with impulse response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N, if the input is a constant signal x j j∈Z , x j = a, a ∈ R?

(Q4.1.1.13.B) [Filter with delayed feedback]


A filter setup is defined by feeding back the output
signal with a delay ∆t (effected by the shift operator
+
OpS1 ) and damped by a factor of 2 ✄
Judge whether this is a linear, time-invariant and 1
causal filter, and determine its impulse response.Fig. 111 2 S1

Hint. First determine the output signal, when the impulse δ0,j j∈Z is sent through the filter.

4.1.2 LT-FIR Linear Mappings

Video tutorial for Section 4.1.2 "LT-FIR Linear Mappings": (12 minutes) Download link,
tablet notes

We aim for a precise mathematical description of the impact of a finite, time-invariant, linear, causal filter
on an input signal: Let (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N, be the impulse response (→ 4.1.1.12)
of that finite (→ Def. 4.1.1.2), linear (→ Def. 4.1.1.7), time-invariant (→ Def. 4.1.1.5), and causal (→
Def. 4.1.1.9) filter (LT-FIR) F : ℓ∞ (Z ) → ℓ∞ (Z ):

F ( δj,0 j ∈Z
) = (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .) .

Owing to time-invariance we already know the response to a shifted unit pulse


   
F ( δj,k j ∈Z
) = h j−k j ∈Z
= ··· 0 h0 h1 · · · h n −1 0 ··· .
↑ ↑
t = k∆t t = (k + n − 1)∆t

Every finite input signal (. . . , 0, x0 , x1 , . . . , xm−1 , 0, . . .) ∈ ℓ∞ (Z ) can be written as the superposition of


scaled unit impulses, which, in turns, are time-shifted copies of a unit pulse at t = 0:

 m −1  m −1   
xj j ∈Z
= ∑ xk δj,k j ∈Z
= ∑ x k Sk δj,0 j ∈Z
, (4.1.2.1)
k =0 k =0

where Sk is the time-shift operator from (4.1.1.4). Applying the filter on both sides of this equation and
using linearity and time-invariance we obtain

   linearity m−1     time-invariance m−1    


F xj j ∈Z
= ∑ k k j,0 j∈Z
x F S δ = ∑ k k
x S F δ j,0 j∈Z . (4.1.2.2)
k =0 k =0

4. Filtering Algorithms, 4.1. Filters and Convolutions 307


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

   
This leads to a fairly explicit formula for the output signal y j j∈Z := F xj j ∈Z
:

 
 .   .     . 
.. .. .. .. ..
. .  
 0   0   0     0 
       0   
 y   h   0     0 
 0   0     0   
 y
 1


 . 
 .. 
 h 
 0 

 0 
  ... 
     .     
 ..   h n −1   ..   h0   .. 
.  

 yn 
 
 0 
 
h


 . 
 .   . 
     n −1   .   .. 
 ..  = x0  0  + x1  0  + x2   + · · · + x m −1  . . (4.1.2.3)
 .       h n −1   
   ..   ..     0 
 ..   .   .   0   
 .       .   h0 
   ..   ..   ..   . 
 y m + n −3   .   .     .. 
       0   
 y m + n −2   0   0     
       0   h n −1 
 0   0   0     
.  0 
.. .. .. .. ..
. . . .

Thus, in compact notation we can write the non-zero components of the output signal y j j∈Z as
channel is causal and finite!

   m −1
yk = F xj j ∈Z k
= ∑ hk− j x j , k = 0, . . . , m + n − 2 ( h j := 0 for j < 0 and j ≥ n) . (4.1.2.4)
j =0

Summary of the above considerations:

Superposition of impulse responses

The output (. . . , 0, y0 , y1 , y2 , . . .) of a finite, time-invariant, linear, and causal channel for finite length
input x = (. . . , 0, x0 , . . . , xm−1 , 0, . . .) ∈ ℓ∞ (Z ) is a superposition of x j -weighted j∆t time-shifted
impulse responses.

EXAMPLE 4.1.2.6 (Visualization: superposition of impulse responses) The following diagrams give
a visual display of the above considerations, namely of the superposition of impulse responses for a
particular finite, time-invariant, linear, and causal filter (LT-FIR), and an input signal of duration 3∆t, ∆t =
ˆ
time between samples. We see the case m = 4, n = 5.
input signal x impulse response h
3.5 3.5

3 3

2.5 2.5

2 2
hi
xi

1.5 1.5

1 1

0.5 0.5

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 112 index i of sampling instance ti Fig. 113 index i of sampling instance ti

4. Filtering Algorithms, 4.1. Filters and Convolutions 308


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In this special case the formula (4.1.2.3) becomes


. . . . .
.. .. .. .. ..
         
0 0 0 0 0
         
 y0   h0  0 0 0
         
 y1   h1   h0  0 0
         
 y2   h2   h1   h0  0
         
 y3   h3   h2   h1  h 
  = x0   + x1   + x2   + x3  0  .
y
 4 h
 4 h
 3 h
 2  h1 
         
 y5  0  h4   h3   h2 
         
 y6  0 0  h4   h3 
         
 y7  0 0 0  h4 
         
0 0 0 0 0
.. .. .. .. ..
. . . . .

This reflects the fact that the is a linear superposition of impulse responses:
response to x response to x response to x response to x
0 1 2 3
3.5 3.5 3.5 3.5

3 3 3 3

2.5 2.5 2.5 2.5


signal strength

signal strength

signal strength

signal strength
2 2 2 2

1.5
+ 1.5
+ 1.5
+ 1.5

1 1 1 1

0.5 0.5 0.5 0.5

Fig. 114 Fig.8 115 Fig.8 116 Fig.8 117


0 0 0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8
i i i i

all responses accumulated responses


3.5

2.5
signal strength
signal strength

4
2

3
1.5

2
1

0.5 1

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Fig. 118 i Fig. 119 i
y

The formula (4.1.2.4) characterizing the output sequence (yk )k∈Z is a special case of a fundamental
bilinear operation on pairs of sequences, not necessarily finite.

Definition 4.1.2.7. Convolution of sequences

Given two sequences ( hk )k∈Z , ( xk )k∈Z , at least one of which is finite or decays sufficiently fast,
their convolution is another sequence (yk )k∈Z , defined as

yk = ∑ hk− j x j , k∈Z. (4.1.2.8)


j ∈Z

✎ Notation: For the sequence arising from convolving two sequences ( hk )k∈Z and ( xk )k∈Z we write
( x k ) ∗ ( h k ).

4. Filtering Algorithms, 4.1. Filters and Convolutions 309


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note that convolution is not well-defined on ℓ∞ (Z ) × ℓ∞ (Z ). A counterexample is provided by constant,


non-zero sequences. However, convolution has another interesting property, which can easily be estab-
lished by re-indexing j ← k − j in (4.1.2.8):

Theorem 4.1.2.9. Convolution of sequences commutes

If well-defined, the convolution of sequences commutes

( xk ) ∗ ( hk ) = ( hk ) ∗ ( xk ) .

Review question(s) 4.1.2.10 (LT-FIR Linear Mappings)


(Q4.1.2.10.A) [Composition of LT-FIR filters] Given two LT-FIR channels with impulse responses
(. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N and (. . . , 0, g0 , g1 , . . . , gl , 0, . . . , 0), l ∈ N we build another
channel by composing them. This means we send the output of the first into the second. What is
the impulse response of the composition?
(Q4.1.2.10.B) An simple LT-FIR channel has impulse response (. . . , 0, 0, h0 := 1, h1 := 1, 0, 0, . . .). What
is the impulse response of the channel that is constructed by composing N of the simple channels. You
should see a familiar pattern!
(Q4.1.2.10.C) [] An LT-FIR channel has known impulse response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .),
n ∈ N. We know that it received a finite input signal ( xk ) of duration (m − 1)∆t and we measure the
output signal (yk ) in exactly the same timespan.
Outline how one can compute the input signal. When is this possible?

4.1.3 Discrete Convolutions


Video tutorial for Section 4.1.3 "Discrete Convolutions": (9 minutes) Download link,
tablet notes

☞ You may also watch the explanations of 3Blue1Brown on convolutions here.

Computers can deal only with finite amounts of data. So algorithms can operate only on finite signals,
which will be in the focus of this section. This continues the considerations undertaken in the beginning of
Section 4.1.2, now with an emphasis on recasting operations in the language of linear algebra.
Remark 4.1.3.1 (The case of finite signals and filters) Again we consider a finite (→ Def. 4.1.1.2), linear
(→ Def. 4.1.1.7), time-invariant (→ Def. 4.1.1.5), and causal (→ Def. 4.1.1.9) filter (LT-FIR) with impulse
response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N. From (4.1.2.4) we learn that

duration(output signal) ≤ duration(input signal) + duration(impulse response) .

We have seen this in (4.1.2.4), where an input signal with m pulses (duration (m − 1)∆t) and an impulse
response with n pulses (duration (n − 1)∆t) spawned an output signal with m + n − 1 pulses (duration
(m − 1 + n − 1)∆t).
Therefore, if we know that all input signals have a duration of at most (m − 1)∆t, which means they are

of the form (. . . , x0 , x1 , . . . , xm−1 , 0, . . .), we can model them as vectors x = [ x0 , . . . , xm−1 ] ∈ R m , cf.
§ 4.0.0.1, and the filter can be viewed as a linear mapping F : R n → R m+n−1 , which takes us to the
realm of linear algebra.

4. Filtering Algorithms, 4.1. Filters and Convolutions 310


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Thus, for the linear filter we have a matrix representation of (4.1.2.4). Let us first look at the special case
m = 4, n = 5 presented in Ex. 4.1.2.6:
         
y0 h0 0 0 0
 y1   h1   h0  0 0
         
 y2   h2   h1   h0  0
         
 y3         
  = x0  h3  + x1  h2  + x2  h1  + x3  h0  .
 y4   h4   h3   h2   h1 
         
 y5  0  h4   h3   h2 
         
 y6  0 0  h4   h3 
y7 0 0 0 h4

Here, we have already replaced the sequences with finite-length vectors. Translating this relationship into
matrix-vector notation is easy:
   
y0 h0 0 0 0
 y1   h1 h0 0 0
    
 y2   h2 h1 h0 0
    x0
 y3   h3 h2 h1 h0   
 =   x1  .
 y4   h4 h3 h2 
h1   x2 
  
 y5   0 h4 h3 h2 
    x3
 y6   0 0 h4 h3 
y7 0 0 0 h4

Writing y = [y0 , . . . , ym+n−2 ] ∈ R m+n−1 for the vector of the output signal we find for the general case
the following matrix×vector representation of the action of the filter on the signal:
 
 h0 0
y0 0 
 ..   h1 
 .    
    x0
    . 
    .. 
   
    
    
    
  hn−1  
   0  
y :=  = 0  =: Cx . (4.1.3.2)
   h0 
  
   h1   
    
    
    .. 
    . 
   
    x m −1
 ..   
 . 
y m + n −2 0 0 h n −1

Note that the i + 1-th column of the matrix X ∈ R m+n+1,m is obtained by cyclically permuting column i,
i = 1, . . . , m − 1. y
Recall the formula
m −1
yk = ∑ hk− j x j , k = 0, . . . , m + n − 2 ( h j := 0 for j < 0 and j ≥ n) . (4.1.2.4)
j =0

supplying the non-zero terms of the convolution of two finite sequences ( hk )k and ( xk )k of length n and
m, respectively. Both can be identified with vectors [ x0 , . . . , xm−1 ]⊤ ∈ K m , [ h0 , . . . , hn−1 ]⊤ ∈ K n , and,
since (4.1.2.4) is a special case of the convolution of sequences introduced in Def. 4.1.2.7, we might call
it a convolution of vectors. It represents a fundamental operation in signal theory.

4. Filtering Algorithms, 4.1. Filters and Convolutions 311


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 4.1.3.3. Discrete convolution


⊤ ⊤
Given x = [ x0 , . . . , xm−1 ] ∈ K m , h = [ h0 , . . . , hn−1 ] ∈ K n their discrete convolution

(DCONV) is the vector y = [y0 , . . . , ym+n−2 ] ∈ K m+n−1 with components

m −1
yk = ∑ hk− j x j , k = 0, . . . , m + n − 2 , (4.1.3.4)
j =0

where we have adopted the convention h j := 0 for j < 0 or j ≥ n.

✎ Notation for discrete convolution (4.1.3.4): y = h ∗ x.

Remark 4.1.3.5 (Commutativity of discrete convolution) Discrete convolution (4.1.3.4) of two vectors is
a commutative operation mirroring the result from Thm. 4.1.2.9 without the implicit assumptions required
there.
  
Using the notations of Def. 4.1.3.3 we embed the two vectors into bi-infinite sequences xej j∈Z , e
hj
j ∈Z
by zero-padding:
( (
x j for j ∈ {0, . . . , m − 1} , h j for j ∈ {0, . . . , n − 1} ,
xej := , e
h j := , j∈Z.
0 else 0 else
     
Thm. 4.1.2.9 confirms xej ∗ e hj = e h j ∗ xej , that is, the two sequences have exactly the same
terms. By definition of xej and e
h j we see

m −1
(h ∗ x)k = ∑ ehk− j x j = ∑ ehk− j xej
j =0 j ∈Z
       n −1
= e
xej ∗ h j e
= h j ∗ xej = ∑ xek− j e
h j = (x ∗ h)k :
k k
j ∈Z

The discrete convolution of vectors is commutative: Phrased in signal theory terminology, we have found
that filter and signal can be “swapped”:

x 0 , . . . , x n −1 h 0 , . . . , h n −1

= y
LT-FIR h0 , . . . , hn−1 LT-FIR x0 , . . . , xn−1

§4.1.3.6 (Multiplication of polynomials) The formula (4.1.3.4) for the discrete convolution also occurs in
a context completely detached from signal processing. “Surprisingly” the bilinear operation (4.1.2.4) (for
m = n) that takes two input n-vectors and produces an output 2n − 1-vector also provides the coefficients
of the product polynomial.
Concretely, consider two polynomials in t of degree n − 1, n ∈ N, with real or complex coefficients,

n −1 n −1
p(t) := ∑ a j t j , q(t) := ∑ bj t j , a j , bj ∈ K .
j =0 j =0

4. Filtering Algorithms, 4.1. Filters and Convolutions 312


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Their product pq will be a polynomial of degree 2n − 2:

2n−2 min{k,n−1}
( pq)(t) = ∑ ck tk , ck := ∑ aℓ bk−ℓ , k = 0, . . . , 2n − 2 . (4.1.3.7)
k =0 ℓ=max{0,k−(n−1)}

Let us introduce dummy coefficients for p(t) and q(t), a j , b j , j = 2n, . . . , 2n − 2, all set to 0. This can
be easily done in a computer code by resizing the coefficient vectors of p and q and filling the new entries
with zeros (“zero padding”). The above formula for c j can then be rewritten as

j
cj = ∑ aℓ bj−ℓ , j = 0, . . . , 2n − 2 . (4.1.3.8)
ℓ=0

Hence, the coefficients of the product polynomial can be obtained as the discrete convolution of the coef-
ficient vectors of p and q:

[c0 c1 . . . c2n−2 ]⊤ = a ∗ b ! (4.1.3.9)

Moreover, this provides another proof for the commutativity of discrete convolution. y

Remark 4.1.3.10 (Convolution of causal sequences) The notion of a discrete convolution of Def. 4.1.3.3
naturally extends to so-called causal sequences ∈ ℓ∞ (N0 ), that is, bounded mappings N0 7→ K: the
(discrete) convolution of two sequences ( x j ) j∈N0 , (y j ) j∈N0 is the sequence (z j ) j∈N0 defined by

k k
zk := ∑ xk− j y j = ∑ x j yk− j , k ∈ N0 . (4.1.2.8)
j =0 j =0

In this context recall the product formula for power series, Cauchy product, which can be viewed as a
multiplication rule for “infinite polynomials” = power series. y
Review question(s) 4.1.3.11 (Discrete convolutions)
(Q4.1.3.11.A) [Calculus of discrete convolutions] Let ∗ : R n × R m → R n+m−1 denote the dis-

crete convolution of two vectors in Def. 4.1.3.3 defined as (C++ indexing, x = [ x0 , . . . , xn−1 ] ,
Vh = [ h0 , . . . , hm−1 ]⊤ )
m −1
(h ∗ x)k := ∑ hk− j x j , k = 0, . . . , m + n − 2 , (4.1.3.4)
j =0

where we set non-existent components of h to zero in the sum.


Which of the following statements are true?
1. h ∗ x = x ∗ h for all x ∈ R n , h ∈ R m , m, n ∈ N,
2. x ∗ x = 0 ⇒ x = 0 for all x ∈ R n
3. (x + y) ∗ h = x ∗ h + y ∗ h for all x, y ∈ R n , h ∈ R m .
(Q4.1.3.11.B) [Repeated convolution] Write ∗ for the convolution of two vectors according to
Def. 4.1.3.3. Give a formula for the entries of the vector
   
n 1 1
b := ∗···∗ .
1 1
| {z }
n times

4. Filtering Algorithms, 4.1. Filters and Convolutions 313


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q4.1.3.11.C) [Sum of discrete random variables] Consider two discrete independent random variables
X, Y : Ω → Z, Ω a probability space, and write for the probabilities

xk := P ( X = k ) ∈ [0, 1] , yk := P (Y = k ) ∈ [0, 1] , k∈Z.

Give a formula for P ( X + Y = k ), k ∈ Z, and relate it to the convolution of sequences.

Definition 4.1.2.7. Convolution of sequences

Given two sequences ( hk )k∈Z , ( xk )k∈Z , at least one of which is finite or decays sufficiently fast,
their convolution is another sequence (yk )k∈Z , defined as

yk = ∑ hk− j x j , k∈Z. (4.1.2.8)


j ∈Z

4.1.4 Periodic Convolutions


Video tutorial for Section 4.1.4 "Periodic Convolutions": (12 minutes) Download link,
tablet notes

Understanding how periodic signals interact with finite, linear, time-invariant, causal (FT-FIR) filters is an
important stepping stone for developing algorithms for more general situations.

Definition 4.1.4.1. Periodic time-discrete signal



An n-periodic signal, n ∈ N, is a sequence x j j∈Z ∈ ℓ∞ (Z ) satisfying

x j+n = x j ∀ j ∈ Z .

➣ Though infinite, an n-periodic signal ( x j ) j∈Z is uniquely determined by the finitely many values
x0 , . . . , xn−1 and can be associated with a vector x = [ x0 , . . . , xn−1 ]⊤ ∈ R n .

§4.1.4.2 (Linear filtering of periodic signals) Whenever the input signal of a finite, linear, causal,
time-invariant filter (LT-FIR) F : ℓ∞ (Z ) → ℓ∞ (Z ) with impulse response (. . . , 0, h0 , . . . , hn−1 , 0, . . .)
is n-periodic, so will be the output signal. To elaborate this we start from the convolution for-
mula for sequences from Def. 4.1.2.7 and take into account the n-periodicity to compute the output
(yk )k∈Z := F (( xk )k∈Z ):
n −1
Thm. 4.1.2.9 j←ν+ℓn
yk = ∑ hk− j x j = ∑ xk− j h j = ∑ ∑ xk−ν−ℓn hν+ℓn
j ∈Z j ∈Z ν=0 ℓ∈Z
(4.1.4.3)
n −1  
periodicity
= ∑ ∑ hν+ℓn xk−ν , k∈Z.
ν=0 ℓ∈Z

From the n-periodicity of x j j∈Z we conclude that yk = yk+n for all k ∈ Z. Thus, in the n-periodic setting,
a causal, linear, and time-invariant filter (LT-FIR) will give rise to a linear mapping R n 7→ R n according
to
n −1 n −1
yk = ∑ p j xk− j = ∑ pk− j x j (4.1.4.4)
j =0 j =0

4. Filtering Algorithms, 4.1. Filters and Convolutions 314


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

for some p0 , . . . , pn−1 ∈ R satisfying pk := pk−n for all k ∈ Z .

From (4.1.4.3) we see that the defining terms of the n-periodic sequence ( pk )k∈Z can be computed
according to

pj = ∑ h j+ℓn , j ∈ {0, . . . , n − 1} . (4.1.4.5)


ℓ∈Z

This sequence
 can be regarded as periodic impulse response, the output generated by the input sequence
∑k∈Z δnk,j j∈Z . It must not be mixed up with the impulse response (→ Def. 4.1.1.12) of the filter.
In matrix notation (4.1.4.4) reads
 
  p0 p n −1 p n −2 · · · ··· p1  
y0 ..  x0

 ..   p1 p0 p n −1 .  .. 
 .   ..  . 
   p2 p1 p0 .  
    
   .. .. .. ..  
 = . . . .  . (4.1.4.6)
    
   .. .. ..  
 .   . . .  . 
 ..   . .. ..  .. 
 .. . . p n −1 
y n −1 x n −1
p n −1 ··· p1 p0
| {z }
=:P

where (P)ij = pi− j , 1 ≤ i, j ≤ n, with p j := p j+n for 1 − n ≤ j < 0.


y

The following special variant of a discrete convolution operation is motivated by the preceding § 4.1.4.2.

Definition 4.1.4.7. Discrete periodic convolution

The discrete periodic convolution of two n-periodic sequences ( pk )k∈Z , ( xk )k∈Z yields the n-
periodic sequence

n −1 n −1
(yk ) := ( pk ) ∗n ( xk ) , yk := ∑ pk− j x j = ∑ xk− j p j , k∈Z.
j =0 j =0

✎ notation for discrete periodic convolution: ( pk ) ∗n ( xk )

The identity claimed in (4.1.4.4) and in Def. 4.1.4.7 can be established by a simple index transformation
ℓ := k − j and subsequent shifting of the sum, which does not change the value thanks to periodicity.
n −1 k n −1
∑ pk− j x j = ∑ p x
| ℓ {zk−ℓ}
= ∑ pℓ xk−ℓ .
j =0 ℓ=k−n−1 ℓ=0
n-periodic in ℓ

This means that the discrete periodic convolution of two sequences commutes.

Since n-periodic sequences can be identified with vectors in K n (see above), we can also introduce the
discrete periodic convolution of vectors:
Def. 4.1.4.7 ➣ discrete periodic convolution of vectors: y = p ∗n x ∈ K n , p, x ∈ K n .

EXAMPLE 4.1.4.8 (Radiative heat transfer) Beyond signal processing discrete periodic convolutions
occur in many mathematical models:

4. Filtering Algorithms, 4.1. Filters and Convolutions 315


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

heated
An engineering problem:

✦ cylindrical pipe,
✦ heated on part Γ H of its perimeter (→ prescribed heat flux),
✦ cooled on remaining perimeter ΓK (→ constant heat flux).
Task: compute local heat fluxes.

cooled

Modeling (discretization):
• approximation by regular n-polygon, edges Γ j ,
• isotropic radiation of each edge Γ j (power Ij ),
j
αij
αij radiative heat flow Γ j → Γi : Pji := I,
i ϕ π j
opening angle: αij = π γ|i− j| , 1 ≤ i, j ≤ n,
n n
power balance: ∑ Pji − ∑ Pij = Q j . (4.1.4.9)
i =1,i 6= j i =1,i 6= j
| {z }
= Ij

ˆ heat flux through Γ j , satisfies


Qj =
Z 2π
(
n j local heating , if ϕ ∈ Γ H ,
Q j := q( ϕ) dϕ , q( ϕ) := R

n ( j −1)
− |Γ1 | ΓH q( ϕ) dϕ (const.), if ϕ ∈ ΓK .
K

n αij
(4.1.4.9) ⇒ LSE: Ij − ∑ π i
I = Q j , j = 1, . . . , n .
i =1,i 6= j

    
1 −γ1 −γ2 −γ3 −γ4 −γ3 −γ2 −γ1 I1 Q1
−γ1 1 −γ1 −γ2 −γ3 −γ4 −γ3 −γ2   I2   Q2 
    
−γ2 −γ1 −γ1 −γ2 −γ3 −γ4 −γ3     
 1  I3   Q3 
−γ3 −γ2 −γ1 −γ1 −γ2 −γ2 −γ4     
e.g. n = 8: 
1  I4  =  Q4 . (4.1.4.10)
−γ4 −γ3 −γ2 −γ1 −γ1 −γ2 
−γ3    
 1  I5   Q5 
−γ3 −γ4 −γ3 −γ2 −γ1 −γ1 −γ2     
 1  I6   Q6 
−γ2 −γ3 −γ4 −γ3 −γ2 −γ1 1 −γ1  I7   Q7 
−γ1 −γ2 −γ3 −γ4 −γ3 −γ2 −γ1 1 I8 Q8
This is a linear system of equations with symmetric, singular, and (by Lemma 9.1.0.5, ∑ γi ≤ 1) positive
semidefinite (→ Def. 1.1.2.6) system matrix.

Note that the matrices from (4.1.4.6) and (4.1.4.10) have the same structure!

Also observe that the LSE from (4.1.4.10) can be written by means of the discrete periodic convolution
(→ Def. 4.1.4.7) of vectors y = (1, −γ1 , −γ2 , −γ3 , −γ4 , −γ3 , −γ2 , −γ1 ), x = ( I1 , . . . , I8 )

(4.1.4.10) ↔ y ∗8 x = [ Q1 , . . . , Q8 ] ⊤ .

4. Filtering Algorithms, 4.1. Filters and Convolutions 316


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§4.1.4.11 (Circulant matrices) In Ex. 4.1.4.8 we have already seen a matrix of a special form, the matrix
P in
 
  p0 p n −1 p n −2 · · · ··· p1  
y0 ..  x0
..   p1 p0 p n −1 .  .. 
 
 .   ..  . 
   p2 p1 p0 .  
    
   .. .. .. ..  
 = . . . .  . (4.1.4.6)
    
   .. .. ..  
 ..   . . .  . 
 .   .. .. ..  .. 
 . . . p n −1 
y n −1 x n −1
p n −1 ··· p1 p0
| {z }
=:P

Matrices with this particular structure are so common that they have been given a special name.

Definition 4.1.4.12. Circulant matrix → [Han02, Sect. 54]


 n
A matrix C = cij i,j=1 ∈ K n,n is circulant
:⇔ ∃ ( pk )k∈Z n-periodic sequence: cij = pi− j , 1 ≤ i, j ≤ n.

✎ Notation: We write circul(p) ∈ K n,n for the circulant matrix generated by the periodic sequence/vector
p = [ p 0 , . . . , p n −1 ] ⊤ ∈ K n

The structure of a generic circulant matrix (“constant diagonals”) can be visualized as


 
p 0 p n −1 p n −2 · · · ··· p1
 p1 p0 p2 
 
 .. 
 p2 . 
 . 
 .. 
circul(p) = 



 
 . .. 
 .. . 
 
 p n −2 p n −1 
p n −1 p n −2 . . . ··· p1 p0

☞ A circulant matrix has constant (main, sub- and super-) diagonals (for which indices j − i = const.).
☞ columns/rows arise by cyclic permutation of the first column/row.

Similar to the case of banded matrices (→ Section 2.7.5) we note that the
“information content” of circulant matrix C ∈ K n,n is just n numbers ∈ K.
(obviously, one vector u ∈ K n enough to define circulant matrix C ∈ K n,n )
y

Supplement 4.1.4.13. Write Z((uk )) ∈ K n,n for the circulant matrix generated by the n-periodic se-
⊤ ⊤
quence (uk )k∈Z . Denote by y := [y0 , . . . , yn−1 ] , x = [ x0 , . . . , xn−1 ] the vectors associated to n-
periodic sequences. Then the commutativity of the discrete periodic convolution (→ Def. 4.1.4.7) involves

4. Filtering Algorithms, 4.1. Filters and Convolutions 317


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

circul(x)y = circul(y)x . (4.1.4.14)

Remark 4.1.4.15 (Reduction of discrete convolution to periodic convolution) Recall the discrete con-
⊤ ⊤
volution (→ Def. 4.1.3.3) of two vectors a = [ a0 , . . . , an−1 ] ∈ K n , b = [b0 , . . . , bn−1 ] ∈ K n .
n −1
zk := (a ∗ b)k = ∑ a j bk − j , k = 0, . . . , 2n − 2 (bk := 0 for k < 0, k ≥ n) .
j =0

Now expand a0 , . . . , an−1 and b0 , . . . , bn−1 to 2n − 1-periodic sequences by zero padding.


( (
ak , if 0 ≤ k < n , bk , if 0 ≤ k < n ,
xk := , yk := (4.1.4.16)
0 , if n ≤ k < 2n − 1 0 , if n ≤ k < 2n − 1 ,
and periodic extension: xk = x2n−1+k , yk = y2n−1+k for all k ∈ Z.
a0 a n −1
0 0 0

Fig. 120
−n 0 n 2n − 1 3n − 1 4n − 2
The zero components prevent interaction of different periods:
k k n −1 2n−2
zk = ∑ a j bk − j = ∑ x j yk− j + ∑ x j y| 2n−{z1+k−}j + ∑ x j y2n−1+k− j ,
|{z}
k = 0, . . . , n − 1 ,
j =0 j =0 j = k +1 j=n
=0 =0
n −1 k−n n −1 2n−2
zk = ∑ a j bk − j = ∑ x j |{z}
yk− j + ∑ x j yk− j + ∑ x j yk− j ,
|{z}
k = n, . . . , 2n − 1 .
j = k − n +1 j =0 j = k − n +1 j=n
=0 =0
This makes periodic and non-periodic discrete convolutions coincide. Writing x, y ∈ K n for the defining
vectors of ( xk )k∈Z and (yk )k∈Z we find
(a∗b)k = (x∗2n−1 y)k , k = 0, . . . , 2n − 2 . (4.1.4.17)
In the spirit of (4.1.3.2) we can switch to a matrix view of the reduction to periodic convolution:
 
    a0
z0 b0 0 0bn−1 b1  ... 
 ..   1 b 0  
 .   


    
    
    
    
    
   
    .. 
    .  
   0 
     a n −1 

  = bn−1 b1 b0 0 0 bn−1 . (4.1.4.18)
    0 
  0 b1 b0 0 
    .. 
    . 
    
    
    
    
    
    
 ..    
 .   0 
 .. 
z2n−2 0 0 bn − 1 b1 b0 . 
| {z } 0
a (2n − 1) × (2n − 1) circulant matrix!

4. Filtering Algorithms, 4.1. Filters and Convolutions 318


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Discrete convolution can be realized by multiplication with a circulant matrix (→ 4.1.4.12)

y
Review question(s) 4.1.4.19 (Periodic Convolutions)
(Q4.1.4.19.A) Let (yk ) be the (finite) output signal obtained from an LT-FIR channel F with impulse re-
sponse (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .) for a finite input signal ( xk ) with duration (m − 1)∆t. For what
p ∈ N do we get
!
∑ Sℓ p ((yk )) = F ∑ Sℓ p (( xk )) ,
ℓ∈Z ℓ∈Z

where Sν is the time-shift operator: Sν ((zk )) := (zk−ν )k∈Z ?


(Q4.1.4.19.B) [Zero-padding] As in Question (Q4.1.4.19.A) let (yk )k∈Z be the (finite) output signal
obtained from an LT-FIR channel F with impulse response (. . . , 0, h0 , h1 , . . . , hn−1 , 0, . . .), n ∈ N, for a
finite input signal ( xk )k∈Z with duration (m − 1)∆t, m ∈ N. For what p ∈ N can we count on
!
 
yk = F ∑ Sℓ p (( xk )) , k = 0, . . . , m − 1 ?
ℓ∈Z k

4.2 Discrete Fourier Transform (DFT)


4.2.1 Diagonalizing Circulant Matrices

Video tutorial for Section 4.2.1 "Diagonalizing Circulant Matrices": (17 minutes)
Download link, tablet notes

Algorithms dealing with circulant matrices make use of their very special spectral properties. Full un-
derstanding requires familiarity with the theory of eigenvalues and eigenvectors of matrices from linear
algebra, see [NS02, Ch. 7], [Gut09, Ch. 9].

EXPERIMENT 4.2.1.1 (Eigenvectors of circulant matrices) Now we are about to discover a very deep
truth . . .
5
C : real(ev)
1
C : imag(ev)
1

Experimentally, we examine the eigenvalues and 4


C : real(ev)
2
C : imag(ev)
2

eigenvectors of two random 8 × 8 circulant matri-


ces C1 , C2 (→ Def. 4.1.4.12), generated from ran- 3

dom vectors with entries even distributed in [0, 1],


eigenvalue

2
VectorXd::Random(n).
1
eigenvalues (real part) ✄
0
Little relationship between (complex!) eigenvalues
can be observed, as can be expected from random −1

matrices with entries ∈ [0, 1].


−2
0 1 2 3 4 5 6 7 8 9
Fig. 121 index of eigenvalue

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 319


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Now: the surprise . . .


Eigenvectors of matrix C1 , visualized through the size of the real and imaginary parts of their components.
Circulant matrix 1, eigenvector 1 Circulant matrix 1, eigenvector 2 Circulant matrix 1, eigenvector 3 Circulant matrix 1, eigenvector 4
0 0.4 0.4 0.4

−0.05 0.3 0.3 0.3

−0.1 0.2 0.2 0.2

vector component value

vector component value

vector component value


vector component value

−0.15 0.1 0.1 0.1

−0.2 0 0 0

−0.25 −0.1 −0.1 −0.1

−0.3 −0.2 −0.2 −0.2

−0.35 −0.3 −0.3 −0.3


real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
−0.4 −0.4 −0.4 −0.4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
vector component index vector component index vector component index vector component index

Circulant matrix 1, eigenvector 5 Circulant matrix 1, eigenvector 6 Circulant matrix 1, eigenvector 7 Circulant matrix 1, eigenvector 8
0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2


vector component value

vector component value

vector component value

vector component value


0.1 0.1 0.1 0.1

0 0 0 0

−0.1 −0.1 −0.1 −0.1

−0.2 −0.2 −0.2 −0.2

−0.3 −0.3 −0.3 −0.3


real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
−0.4 −0.4 −0.4 −0.4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
vector component index vector component index vector component index vector component index

Eigenvectors of matrix C2
Circulant matrix 2, eigenvector 1 Circulant matrix 2, eigenvector 2 Circulant matrix 2, eigenvector 3 Circulant matrix 2, eigenvector 4
0 0.4 0.4 0.4

−0.05 0.3 0.3 0.3

−0.1 0.2 0.2 0.2


vector component value

vector component value

vector component value


vector component value

−0.15 0.1 0.1 0.1

−0.2 0 0 0

−0.25 −0.1 −0.1 −0.1

−0.3 −0.2 −0.2 −0.2

−0.35 −0.3 −0.3 −0.3


real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
−0.4 −0.4 −0.4 −0.4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
vector component index vector component index vector component index vector component index

Circulant matrix 2, eigenvector 5 Circulant matrix 2, eigenvector 6 Circulant matrix 2, eigenvector 7 Circulant matrix 2, eigenvector 8
0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2


vector component value

vector component value

vector component value

vector component value

0.1 0.1 0.1 0.1

0 0 0 0

−0.1 −0.1 −0.1 −0.1

−0.2 −0.2 −0.2 −0.2

−0.3 −0.3 −0.3 −0.3


real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
−0.4 −0.4 −0.4 −0.4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
vector component index vector component index vector component index vector component index

Observation: different random circulant matrices have the same eigenvectors!

Eigenvectors of circulant matrix C = circul([1, 2, . . . , 128]⊤ ):


random 256x256 circulant matrix, eigenvector 2 random 256x256 circulant matrix, eigenvector 3 random 256x256 circulant matrix, eigenvector 5 random 256x256 circulant matrix, eigenvector 8
0.1 0.1 0.1 0.1

0.08 0.08 0.08 0.08

0.06 0.06 0.06 0.06


vector component value

vector component value

vector component value

vector component value

0.04 0.04 0.04 0.04

0.02 0.02 0.02 0.02

0 0 0 0

−0.02 −0.02 −0.02 −0.02

−0.04 −0.04 −0.04 −0.04

−0.06 −0.06 −0.06 −0.06

−0.08 −0.08 −0.08 −0.08


real part real part real part real part
imaginary part imaginary part imaginary part imaginary part
−0.1 −0.1 −0.1 −0.1
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
vector component index vector component index vector component index vector component index

The eigenvectors remind us of sampled trigonometric functions cos(k/n), sin(k/n), k = 0, . . . , n − 1! y

Remark 4.2.1.2 (Eigenvectors of commuting matrices) An abstract result from linear algebra puts the
surprising observation made in Exp. 4.2.1.1 in a wider context.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 320


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 4.2.1.3. Commuting matrices have the same eigenvectors

If A, B ∈ K n,n commute, that is, AB = BA, and A has n distinct eigenvalues, then the
eigenspaces of A and B coincide.

Proof. Let v ∈ K n \ {0} be an eigenvector of A with eigenvalue λ. Then

BA=AB
(A − λI)v = 0 ⇒ B(A − λI)v = 0 ⇒ (A − λI)Bv = 0 .

Since in the case of n distinct eigenvalues dim N (A − λI) = 1, we conclude that there is ξ ∈ K:
Bv = ξv, v is an eigenvector of B. Since the eigenvectors of A span K n , there cannot be eigenvectors
of B that are not eigenvectors of A.
Moreover, there is a basis of K n consisting of eigenvectors of B; B can be diagonalized.

Next, by straightforward calculation one verifies that every circulant matrix commutes with the unitary and
circulant cyclic permutation matrix
 
0 0 0 ··· ··· 0 1
1 ..
 0 0 . 0 
 .. .. 
0 1 0 . . 
 
 
S =  ... ..
.
..
.
..
. . (4.2.1.4)
 
 .. .. .
.. . 
 . . . . 
. .. 
 .. . 0 0 
0 ··· ··· 0 1 0

As a unitary matrix S can be diagonalized. Observe that Sn − I = O, that is the minimal polynomial of S
is ξ 7→ ξ n − 1, which is irreducible, because it has n distinct roots (of unity). Therefore, by Thm. 4.2.1.3,
S has n different eigenvalues and every eigenvector of S is also eigenvector of any circulant matrix.
By elementary means we can compute the eigenvectors of S: Assume that
v = [v0 , . . . , vn−1 ] ∈ C n \ {0} satisfies Sv = λv for some λ ∈ C. This implies for its components
the following relationships:

v j−1 = λv j , j = 1, . . . , n − 1 , vn−1 = λv0 ,


n − j −i
⇒ vj = λ v n −1 = λ n − j v 0 ⇒ λ n = 1 .

Hence λ is an n-th root of unity and has the representation λ = exp 2πı nk , k ∈ {0, . . . , n − 1}. Setting
v0 := 1 we find
k j kj 
v j = exp 2πı = exp 2πı , j = 0, . . . , n − 1 .
n n
y

Remark 4.2.1.5 (Why using K = C?) In Exp. 4.2.1.1 we saw that we get complex eigenvalues/eigen-
vectors for general circulant matrices. More generally, in many cases real matrices can be diagonalized
only in C, which is the ultimate reason for the importance of complex numbers.
Complex numbers also allow an elegant handling of trigonometric functions: recall from analysis the uni-
fied treatment of trigonometric functions via the complex exponential function

exp(it) = cos(t) + ı sin(t) , t ∈ R .

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 321


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The field of complex numbers C is the natural framework for the analysis of linear, time-invariant
C! filters, and the development of algorithms for circulant matrices.
y

§4.2.1.6 (Eigenvectors of circulant matrices) Now we verify by direct computations that circulant matri-
ces all have a particular set of eigenvectors. This will entail computing in C, cf. Rem. 4.2.1.15.
✎ notation: nth root of unity ωn := exp(−2πı/n) = cos(2π/n) − ı sin(2π/n), n ∈ N

ω n = ωn−1 , ωnn = 1 , ωn/2 = −1 , ωnk = ωnk+n ∀ k ∈ Z ,


n
satisfies (4.2.1.7)
(
n −1 n −1
kj j k n , if j = 0 mod n ,
∑ ωn = ∑ ωn =
0 , if j 6= 0 mod n .
(4.2.1.8)
k =0 k =0

(4.2.1.8) is a simple consequence of the geometric sum formula

n −1
1 − qn
∑ qk = ∀ q ∈ C \ {1} , n ∈ N . (4.2.1.9)
k =0
1−q
n −1 nj
kj 1 − ωn 1 − exp(−2πıj)
⇒ ∑ ωn = j
=
1 − exp(−2πıj/n)
=0,
k =0 1 − ωn
nj
because exp(−2πıj) = ωn = (ωnn ) j = 1 for all j ∈ Z.

In expressions like ωnkl the term “kl ” will always designate an exponent and will never play
! the role of a superscript.

Now we want to confirm the conjecture gleaned from Exp. 4.2.1.1 that vectors with powers of roots of unity
are eigenvectors for any circulant matrix. We do this by simple and straightforward computations:
We consider a general circulant matrix C ∈ C n,n (→ Def. 4.1.4.12), with cij := (C)i,j = ui− j , for an
n-periodic sequence (uk )k∈Z , uk ∈ C. We “guess” an eigenvector,
h i
n − jk n−1
vk ∈ C : vk := ωn , k ∈ {0, . . . , n − 1} ,
j =0

and verify the eigenvector property by direct computation:

n −1 n −1
−( j−l )k − jk n−1 − jk
(Cvk ) j = ∑ u j−l ωn−lk = ∑ u l ωn = ωn ∑ ul ωnlk = λk · ωn = λk · (vk ) j . (4.2.1.10)
l =0 l =0 l =0

change of summation index independent of j !

n −1
vk is eigenvector of C for eigenvalue λk = ∑ ul ωnlk .
l =0

The set {v0 , . . . , vn−1 } ⊂ C n provides the so-called orthogonal trigonometric basis of C n = eigen-

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 322


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

vector basis for circulant matrices


       
 ωn0 ωn0 ωn0 ωn0 

 

  ..   ωn−1   ωn2−n   ωn1−n 

  .       

   ..   2(2− n )   2(1− n )
   .   ωn   ωn 
{ v 0 , . . . , v n −1 } =   ,  , · · · ,
 .. , 
  .. 
 . (4.2.1.11)

      .   . 

  .   .   ..   .. 

  ..   ..   .   . 

 

 ω0 ω 1− n −(n−1)(n−2) −(n−1) 
2
n n ωn ωn

From (4.2.1.8) we can conclude orthogonality of the basis vectors by straightforward computations:

h i n −1 n −1
− jk n−1 jk − jm (k−m) j (4.2.1.8)
v k : = ωn ∈ C n : vH v
k m = ∑ n nω ω = ∑ ωn = 0 , if k 6= m . (4.2.1.12)
j =0
j =0 j =0

The matrix effecting the change of basis trigonometrical basis → standard basis is called the Fourier-
matrix
 
ωn0 ωn0 ··· ωn0
ωn0
 0 ωn1 ··· ωnn−1 
 h i n −1
 ωn2 ··· ωn2n−2  ℓj
F n =  ωn  = ωn ∈ C n,n . (4.2.1.13)
 .. .. ..  ℓ,j=0
 . . . 
( n −1)2
ωn0 ωnn−1 · · · ωn

Lemma 4.2.1.14. Properties of Fourier matrices

The scaled Fourier-matrix √1n Fn is unitary (→ Def. 6.3.1.2) : F− 1 1 H 1


n = n Fn = n Fn .

Proof. The lemma is immediate from (4.2.1.12) and (4.2.1.8),because

  n −1 n −1 n −1
( l −1) k (l −1)k −( j−1)k k(l − j)
F n FH
n = ∑ ωn ω n ( j −1) k = ∑ ωn ωn = ∑ ωn , 1 ≤ l, j ≤ n .
l,j
k =0 k =0 k =0

✷ y

Remark 4.2.1.15 (Spectrum of Fourier matrix) We draw a conclusion from the properties stated in
Lemma 4.2.1.14:
1 4
F
n2 n
= I ⇒ σ( √1n Fn ) ⊂ {1, −1, i, −i } ,

because, if λ ∈ C is an eigenvalue of Fn , then there is an eigenvector x ∈ C n \ {0} such that Fn x = λx,


see Def. 9.1.0.1. y

Lemma 4.2.1.16. Diagonalization of circulant matrices (→ Def. 4.1.4.12)

For any circulant matrix C ∈ K n,n , cij = ui− j , (uk )k∈Z an n-periodic sequence, holds true

CFn = Fn diag(d0 , . . . , dn−1 ) , [d0 , . . . , dn−1 ]⊤ = Fn [u0 , . . . , un−1 ]⊤ ,


  n −1
where Fn ∈ C n,n is the Fourier matrix: Fn = exp(− 2πı
n ℓ k ) ℓ,k=0 from (4.2.1.13).

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 323


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Proof. The computations from (4.2.1.10) established:

 n −1  n −1 
Cvk = C Fn :,k
= (F):,k ∑ uℓ ωnℓk = Fn :,k ∑ uℓ (Fn )k,ℓ = Fn :,k
(Fn u)k .
ℓ=0 ℓ=0

Then invoke the rules of matrix×matrix multiplication.


From this lemma and the fact Fn = nF− 1


n we conclude

C = F− 1
n diag( d0 , . . . , dn−1 ) Fn , [ d 0 , . . . , d n −1 ] ⊤ = F n [ u 0 , . . . , u n −1 ] ⊤ . (4.2.1.17)

As a consequence of Lemma 4.2.1.16, (4.2.1.17) the multiplication with Fourier-matrices will be crucial
operation in algorithms for circulant matrices and discrete convolutions. Therefore this operation has been
given a special name:

Definition 4.2.1.18. Discrete Fourier transform (DFT)

The linear map DFTn : C n 7→ C n , DFTn (y) := Fn y, y ∈ C n , is called discrete Fourier transform
(DFT), i.e. for [c0 , . . . , cn−1 ] := DFTn (y)

n −1 n −1
kj kj 
ck = ∑ y j ωn = ∑ y j exp −2πı
n
, k = 0, . . . , n − 1 . (4.2.1.19)
j =0 j =0

Recall the convention also adopted for the discussion of the DFT: vector indexes range from 0 to n − 1!

Terminology: The result of DFT, c = DFTn (y) = Fn y, is also called the (discrete) Fourier transform of
y.

From F− 1 1
n = n Fn (→ Lemma 4.2.1.14) we find the inverse discrete Fourier transform:

n −1
kj 1 n −1 −kj
ck = ∑ y j ωn ⇔ yj = ∑
n k =0
c k ωn (4.2.1.20)
j =0

§4.2.1.21 (Discrete Fourier transform in E IGEN and P YTHON)


• E IGEN-functions for discrete Fourier transform (and its inverse):
DFT: c=fft.fwd(y) ↔ inverse DFT: y=fft.inv(c);
Before using fft, remember to:
1. # include <unsupported/Eigen/FFT>
2. Instantiate helper class Eigen::FFT<double> fft;
(The template argument should always be double.)

C++ code 4.2.1.22: Demo: discrete Fourier transform in E IGEN ➺ GITLAB


2 i n t main ( i n t /*argc*/ , char * * /*argv*/ ) {
3 using Comp = std : : complex <double > ;
4 const Eigen : : VectorXcd : : Index n = 5 ;
5 Eigen : : VectorXcd y ( n ) ;

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 324


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6 Eigen : : VectorXcd c ( n ) ;
7 Eigen : : VectorXcd x ( n ) ;
8 y << Comp( 1 , 0 ) , Comp( 2 , 1 ) , Comp( 3 , 2 ) , Comp( 4 , 3 ) , Comp( 5 , 4 ) ;
9 Eigen : : FFT<double> f f t ; // DFT transform object
10 c = f f t . fwd ( y ) ; // DTF of y, see Def. 4.2.1.18
11 x = f f t . inv ( c ) ; // inverse DFT of c, see (4.2.1.20)
12

13 std : : cout << " y = " << y . transpose ( ) << std : : endl
14 << " c = " << c . transpose ( ) << std : : endl
15 << " x = " << x . transpose ( ) << std : : endl ;
16 return 0;
17 }

• P YTHON-functions for discrete Fourier transform (and its inverse) are provided by the package
scipy.fft
DFT: c=scipy.fft(y) ↔ inverse DFT: y=scipy.ifft(c),
where y and c are numpy-arrays.
y

Review question(s) 4.2.1.23 (Diagonalizing circulant matrices)


(Q4.2.1.23.A) To practice
complex arithmetic compute the discrete Fourier transform of

x = [1, 2 − ı, −ı, −1 + 2ı] .
(Q4.2.1.23.B) Denote by DFTn : C n → C n the discrete Fourier transform,
n −1 
kj
(DFTn y)k := ∑ y j ωn , k = 0, . . . , n − 1 , ωn = exp − 2πı
n . (4.2.1.19)
j =0
Show that
xH y = n1 DFTn (x)H DFTn (y) .
Use the following lemma:

Lemma 4.2.1.14. Properties of Fourier matrix

The scaled Fourier-matrix √1n Fn is unitary (→ Def. 6.3.1.2) : F− 1 1 H 1


n = n Fn = n Fn .

(Q4.2.1.23.C) Explain, why the identity


C = F− 1
n diag( d0 , . . . , dn−1 ) Fn , [ d 0 , . . . , d n −1 ] ⊤ = F n [ u 0 , . . . , u n −1 ] ⊤ , (4.2.1.17)

where u := [u0 , . . . , un−1 ] ∈ C n is the generating vector of the circulant matrix C according to the
formula
(C)k,ℓ = (u)ℓ−k mod n ,
still makes sense for the very special “circulant” matrix C = In .
(Q4.2.1.23.D) What are the eigenvalues and eigenvectors of the permutation matrix
 
0 ... 0 1
 .. 
1 . 0
 . .. 
P= .. ... .
n,n
 0 ∈C
. . . . . .. 
 .. . . .
0 ... 0 1 0

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 325


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

that effects a cyclic permutation of the entries of an n-vector?


(Q4.2.1.23.E) [SVD of circulant matrices] Based on the results

Lemma 4.2.1.16. Diagonalization of circulant matrices

For any circulant matrix C ∈ K n,n , cij = ui− j , (uk )k∈Z n-periodic sequence, holds true

CFn = Fn diag(d1 , . . . , dn ) , [d0 , . . . , dn−1 ]⊤ = Fn [u0 , . . . , un−1 ]⊤ .

and

Lemma 4.2.1.14. Properties of Fourier matrices

The scaled Fourier-matrix √1n Fn is unitary (→ Def. 6.3.1.2) : F− 1 1 H 1


n = n Fn = n Fn .

outline an efficient algorithm for computing the singular-value decomposition (SVD) of a circulant matrix.

Hint.
• Every z ∈ C can be written as z = rz0 with r ≥ 0 and |z0 | = 1.
• For matrices A ∈ C m,n the full SVD reads A = UΣVH and involves unitary factors U ∈ C m,m
and V ∈ C n,n .
(Q4.2.1.23.F) [Diagonal circulant matrices] Characterize the set of all complex diagonal circulant
n × n-matrices, n ∈ N. Is this set a subspace of R n,n ?

4.2.2 Discrete Convolution via Discrete Fourier Transform


Video tutorial for Section 4.2.2 "Discrete Convolution via DFT": (7 minutes) Download link,
tablet notes

Coding the formula for the discrete periodic convolution of two periodic sequences from Def. 4.1.4.7,
n −1 n −1
(yk ) := (uk ) ∗n ( xk ) , yk := ∑ uk− j x j = ∑ xk− j u j , k ∈ {0, . . . , n − 1} ,
j =0 j =0

one could do this in a straightforward manner using two nested loops as in the following code, and with an
asymptotic computational effort O(n2 ) for n → ∞.

C++ code 4.2.2.1: Discrete periodic convolution: straightforward implementation ➺ GITLAB


2 Eigen : : VectorXcd pconv ( const Eigen : : VectorXcd &u , const Eigen : : VectorXcd &x ) {
3 const Eigen : : Index n = x . s i z e ( ) ;
4 Eigen : : VectorXcd z = Eigen : : VectorXcd : : Zero ( n ) ;
5 // “naive” two-loop implementation of discrete periodic convolution
6 f o r ( Eigen : : Index k = 0 ; k < n ; ++k ) {
7 f o r ( Eigen : : Index j = 0 , l = k ; j <= k ; ++ j , −− l ) {
8 z [ k ] += u [ l ] * x [ j ] ;
9 }
10 f o r ( Eigen : : Index j = k + 1 , l = n − 1 ; j < n ; ++ j , −− l ) {
11 z [ k ] += u [ l ] * x [ j ] ;

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 326


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

12 }
13 }
14 return z ;
15 }

⊤ ⊤
This codes relies on the associated vectors u = [u0 , . . . , un−1 ] ∈ C n and x = [ x0 , . . . , xn−1 ] ∈ C n−1
for the sequences (uk ) and ( xk ), respectively. Using this vectors, indexed from 0, the periodic convolution
formula becomes
k n −1
yk = ∑ (u)k− j (x) j + ∑ ( u )n+k− j ( x ) j .
j =0 j = k +1

Let us assume that a “magic” very efficient implementation of the discrete Fourier transform (DFT) is
available (→ Section 4.3). Then a much faster implementation of pconv() is possible and it is based
on the link with the periodic discrete convolution of Def. 4.1.4.7. In § 4.1.4.11 we have seen that periodic
convolution amounts to multiplication with a circulant matrix. In addition, (4.2.1.17) reduces multiplication
with a circulant matrix to two multiplications with the Fourier matrix Fn (= DFT) and (componentwise)
scaling operations. This suggests how to exploit the equivalence
n −1
discrete periodic convolution zk = ∑ uk− j x j (→ Def. 4.1.4.7), k = 0, . . . , n − 1
j =0

m
 n
multiplication with circulant matrix (→ Def. 4.1.4.12) z = Cx, C := ui− j i,j=1 .

Idea: (4.2.1.17) ➣ z = F− 1
n diag( Fn u ) Fn x

This formula is usually referred to as convolution theorem:

Theorem 4.2.2.2. Convolution theorem


The discrete periodic convolution ∗n between n-dimensional vectors u and x is equal to the inverse
DFT of the component-wise product between the DFTs of u and x; i.e.:
" # n −1
n −1  n
(u) ∗n (x) := ∑ u(k− j) mod n x j = F− 1
n ( F n u ) j ( F n x ) j j =1 . (4.2.2.3)
j =0 k =0

Cast in a C++ function computing the periodic discrete convolution of two vectors the convolution theorem
reads:

C++ code 4.2.2.4: Discrete periodic convolution: DFT implementation ➺ GITLAB


2 Eigen : : VectorXcd p c o n v f f t ( const Eigen : : VectorXcd &u ,
3 const Eigen : : VectorXcd &x ) {
4 Eigen : : FFT<double> f f t ;
5 r e t u r n f f t . inv ( ( ( f f t . fwd ( u ) ) . cwiseProduct ( f f t . fwd ( x ) ) ) . e v a l ( ) ) ;
6 }

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 327


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In Rem. 4.1.4.15 we learned that the discrete convolution of n-vectors (→ Def. 4.1.3.3) can be
accomplished by the periodic discrete convolution of 2n − 1-vectors (obtained by zero padding, see
Rem. 4.1.4.15):
   
a b
n
a, b ∈ C : a ∗ b = ∗2n−1 ∈ C2n−1 .
0 0

This idea underlies the following C++ implementation of the discrete convolution of two vectors.

C++ code 4.2.2.5: Implementation of discrete convolution (→ Def. 4.1.3.3) based on periodic
discrete convolution ➺ GITLAB
2 Eigen : : VectorXcd f a s t c o n v ( const Eigen : : VectorXcd &h ,
3 const Eigen : : VectorXcd &x ) {
4 assert ( x . s i z e ( ) == h . s i z e ( ) ) ;
5 const Eigen : : Index n = h . s i z e ( ) ;
6 // Zero padding, cf. (4.1.4.16), and periodic discrete convolution
7 // of length 2n − 1, Code 4.2.2.4
8 r e t u r n pconvfft (
9 ( Eigen : : VectorXcd ( 2 * n − 1 ) << h , Eigen : : VectorXcd : : Zero ( n − 1 ) )
10 . finished ( ) ,
11 ( Eigen : : VectorXcd ( 2 * n − 1 ) << x , Eigen : : VectorXcd : : Zero ( n − 1 ) )
12 . finished ( ) ) ;
13 }

Review question(s) 4.2.2.6 (Discrete convolution via DFT)


(Q4.2.2.6.A) We saw that the discrete convolution of two vectors h and x of length n can be accomplished
by
y = pconvfft(
(Eigen::VectorXcd(2 * n - 1) << h, Eigen::VectorXcd::Zero(n - 1))
.finished(),
(Eigen::VectorXcd(2 * n - 1) << x, Eigen::VectorXcd::Zero(n - 1))
.finished());

Here, pconv() implements periodic discrete convolution. How do you have to change the code so
that it can compute the discrete convolution of two vectors h ∈ R n , x ∈ R m for general n, m ∈ N?

4.2.3 Frequency filtering via DFT

Video tutorial for Section 4.2.3 "Frequency filtering via DFT": (20 minutes) Download link,
tablet notes

☞ A nice introduction by 3Blue1Brown can be found here.

The trigonometric basis vectors,


    n−1
2πjk 2πjk
vk := [exp(2πıjk/n)]nj=−01 = cos + ı sin ∈ Cn , (4.2.3.1)
n n j =0

when interpreted as time-periodic signals, represent harmonic oscillations. This is illustrated when plotting
some vectors of the trigonometric basis (n = 16):

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 328


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Fourier−basis vector, n=16, j=1 Fourier−basis vector, n=16, j=7 Fourier−basis vector, n=16, j=15
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

Value
Value

Value
0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8


Real part Real part Real part
Imaginary p. Imaginary p. Imaginary p.
−1 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
Vector component k Vector component k Fig. 122 Vector component k

“slow oscillation/low frequency” “fast oscillation/high frequency” “slow oscillation/low frequency”


Dominant coefficients of a signal after transformation to trigonometric basis indicate dominant fre-
quency components.
We say that the coefficients of a signal w.r.t. the trigonometric basis represent the signal in frequency
domain, original signal represented in the “pulse basis” of coordinate vectors is given in time domain.
§4.2.3.2 (Frequency decomposition of a signal) Since the trigonometric basis vectors form the columns
of the (complex conjugate) Fourier matrix DFT (4.2.1.19) and inverse DFT (4.2.1.20),

n −1
kj 1 n −1 −kj
ck = ∑ y j ωn ⇔ yj =
n k∑
c k ωn (4.2.1.20)
j =0 =0

effect the transformation from the “pulse basis” into a (scaled) trigonometric basis and vice versa. Thus,
they convert time-domain and frequeny-domain representation of a signal into each other. To see this
more clearly, we examine a real-valued signal of length n = 2m + 1, m ∈ N: yk ∈ R. Its DFT yields
kj (n−k) j
c0 , . . . , cn−1 and theses coefficients satisfy ck = cn−k , because ωn = ω n . Using this relationship we
can write the original signal as a linear combination of sampled trigonometric functions with “frequencies”
k = 0, . . . , m:
m 2m m
−kj −kj −kj (k−n) j
ny j = c0 + ∑ c k ωn + ∑ c k ωn = c0 + ∑ c k ωn + c n−k ωn
k =1 k = m +1 k =1
m
= c0 + 2 ∑ Re(ck ) cos(2π kj/n) + Im(ck ) sin(2π kj/n) , j = 0, . . . , n − 1 ,
k =1

since ωnℓ = cos(2π ℓ/n) − i sin(2π ℓ/n).


➣ The moduli |ck |, |cn−k | of the coefficients obtained by DFT measure the strength with which an
oscillation with frequency k is represented in the signal, 0 ≤ k ≤ ⌊ n2 ⌋.
y

EXAMPLE 4.2.3.3 (Frequency identification with DFT) The following C++ code generates a periodic
signal composed of two base frequencies and distorts its by adding large random noise:

C++ code 4.2.3.4: Generation of noisy sinusoidal signal ➺ GITLAB


2 VectorXd s i g n a l g e n ( ) {
3 const i n t N = 6 4 ;
4 const ArrayXd t = ArrayXd : : LinSpaced (N, 0 , N) ;
5 const VectorXd x = ( ( 2 * M_PI / N * t ) . s i n ( ) + ( 1 4 * M_PI / N * t ) . s i n ( ) ) . matrix ( ) ;
6 r e t u r n x + VectorXd : : Random(N) ;
7 }

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 329


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3
20

18
2

16

1 14

12
Signal

|c |2
k
0 10

−1
6

4
−2

0
−3 0 5 10 15 20 25 30
0 10 20 30 40 50 60 70
Fig. 123 Sampling points (time)
Fig. 124 Coefficient index k

Looking at the time-domain plot of the signal given in Fig. 123 (C++ code ➺ GITLAB) it is hard to discern
the underlying base frequencies. However, we observe that frequencies present in unperturbed signal are
clearly evident in the frequency-domain representation after DFT. y

EXAMPLE 4.2.3.5 (Detecting periodicity in data)

Google provides information about the frequency of


certain words in web searches. In this example we
study the data of for the word “Vorlesungsverzeich-
nis”.

Raw data: weekly occurrences of “Vorlesungsverze-


ichnis” in Google web searches ✄
Some periodic pattern in the data is conspicuous to
the “naked eye”. How can an algorithm detect inher-
ent periodicity of the data and find out the periods?
Of course, by means of DFT!

Fig. 125

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 330


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

DFT of the signal (y0 , . . . , yn−1 ), correspond-


ing to the number of searches containing “Vor-
lesungsverzeichnis” in a large number n of
consecutive weeks, yields the coefficients c j ,
j = 0, . . . , n − 1.
The plot beside (C++ code ➺ GITLAB) displays the
“(Fourier) power spectrum” of the signal, the values
|c j |2 , 0 ≤ j ≤ ⌊ n2 ⌋.
We see that pronounced peaks in the power spec-
trum point to periodic structure of the data. The co-
efficient indices of the peaks of the power spectrum
tell us dominant frequencies present in the data and,
after inversion, the lengths of dominant periods.
Fig. 126

We can state the main message of this example as follows:

DFT is a computer’s eye for periodic patterns in data

§4.2.3.6 (“Low” and “high” frequencies) Again, look at plots of real parts of trigonometric basis vectors
(Fn ):,j (= columns of Fourier matrix), n = 16.
Trigonometric basis vector (real part), n=16, j=0 Trigonometric basis vector (real part), n=16, j=1 Trigonometric basis vector (real part), n=16, j=2
1 1 1

0.9 0.8 0.8

0.8 0.6 0.6


vector component value

vector component value

vector component value

0.7 0.4 0.4

0.6 0.2 0.2

0.5 0 0

0.4 −0.2 −0.2

0.3 −0.4 −0.4

0.2 −0.6 −0.6

0.1 −0.8 −0.8

0 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index

Re(F16 ):,0 Re(F16 ):,1 Re(F16 ):,2


Trigonometric basis vector (real part), n=16, j=3 Trigonometric basis vector (real part), n=16, j=4 Trigonometric basis vector (real part), n=16, j=5
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


vector component value

vector component value

vector component value

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−1 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index

Re(F16 ):,3 Re(F16 ):,4 Re(F16 ):,5

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 331


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Trigonometric basis vector (real part), n=16, j=6 Trigonometric basis vector (real part), n=16, j=7 Trigonometric basis vector (real part), n=16, j=8
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


vector component value

vector component value

vector component value


0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−1 −1 −1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
vector component index vector component index vector component index

Re(F16 ):,6 Re(F16 ):,7 Re(F16 ):,8


What about the remaining columns? They are just the first ones “wrapped around”:
h i h i 
j ( n − k ) n −1 (4.2.1.7) − jk n−1
(Fn ):,n−k = ωn = ωn = Fn :,k
, k = 0, . . . , n − 1 .
j =0 j =0

(Here we adopt C++ indexing and count the columns of the matrix from 0.)
Im
Visually, the different basis vectors represent oscilla-
tory signals with different frequencies.

This is also revealed by elementary trigonometric


identities: High frequencies
 
( j −1) k n −1
Re(Fn ):,j = Re ωn
k =0
Re
−1
= (Re exp(−2πi ( j − 1)k/n))nk= 0
= (cos(2π ( j − 1) x )) x=0, 1 ,...,1− 1 .
n n Low frequencies

• Slow oscillations/low frequencies correspond


to indices j ≈ 1 and j ≈ n.
• Fast oscillations/high frequencies correspond
to indices j ≈ n/2.
Fig. 127

The task of frequency filtering is to suppress or enhance a predefined range of frequencies contained in a
signal.

DFT
➊ Perform DFT of the signal: (y0 , . . . , yn−1 ) 7−−→ (c0 , . . . , cn−1 )
➋ Operate on Fourier coefficients: (c0 , . . . , cn−1 ) 7→ (ce0 , . . . , cen−1 )
DFT−1
➌ Filtered signal by inverse DFT: (ce0 , . . . , cen−1 ) 7−−−→ (ye0 , . . . , yen−1 )

The following code does digital low-pass and high-pass filtering of a signal based on DFT and inverse
DFT. It sets the obtained Fourier coefficients corresponding to high/low frequencies to zero and afterwards
transforms back to time domain.

C++ code 4.2.3.7: DFT-based frequency filtering ➺ GITLAB


2 i n l i n e void f r e q f i l t e r ( const VectorXd &y , i n t k , VectorXd &low , VectorXd &h i g h ) {
3 const VectorXd : : Index n = y . s i z e ( ) ;
4 i f ( n % 2 != 0) {
5 throw std : : r u n t i m e _ e r r o r ( "Even vector length required ! " ) ;
6 }

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 332


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

7 const VectorXd : : Index m = y . s i z e ( ) / 2 ;


8

9 Eigen : : FFT<double> f f t ; // DFT helper object


10 const VectorXcd c = f f t . fwd ( y ) ; // Perform DFT of input vector
11

12 VectorXcd clow = c ;
13 // Set high frequency coefficients to zero, Fig. 127
14 f o r ( i n t j = −k ; j <= +k ; ++ j ) {
15 clow (m + j ) = 0 ;
16 }
17 // (Complementary) vector of high frequency coefficients
18 const VectorXcd c h i g h = c − clow ;
19

20 // Recover filtered time-domain signals


21 low = f f t . inv ( clow ) . r e a l ( ) ;
22 h i g h = f f t . inv ( c h i g h ) . r e a l ( ) ;
23 }

The code could be optimised by exploiting y j ∈ R and cn/2−k = cn/2+k .


Summary: ˆ low pass filter .
Map y 7→ low (in Code 4.2.3.7) =

Map y 7→ high (in Code 4.2.3.7) =


ˆ high pass filter .
y

EXAMPLE 4.2.3.8 (Denoising by frequency filtering)


Signal perturbed by “deterministic noise”:
n = 256; y = exp(sin(2*pi*((0:n-1)’)/n)) + 0.5*sin(exp(1:n)’);
We performed frequency filtering by Code 4.2.3.7 with k = 120.
3.5
signal 350
noisy signal
3 low pass filter
high pass filter
300

2.5

250
2

1.5 200
|ck|

1 150

0.5
100

0
50

−0.5

0
−1 0 20 40 60 80 100 120 140
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 No. of Fourier coefficient
time

Low pass filtering can be used for denoising, that is, the removal of high frequency perturbations of a
signal. y

EXAMPLE 4.2.3.9 (Sound filtering by DFT) Frequency filtering is ubiquitous in sound processing.
Here we demonstrate it in P YTHON ➺ GITLAB, which offers tools for audio processing through the
sounddevice module.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 333


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

sampled sound signal power spectrum of sound signal


0.3 60000

50000
0.2

40000
sound pressure

0.1
Nyquist frequency

|ck |2
30000

0.0
20000

−0.1
10000

−0.2
0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 10000 20000 30000 40000 50000 60000
Fig. 128 time[s] Fig. 129 index k of Fourier coefficients ck

The audio signal (duration ≈ 1.5s) of a human voice is plotted in time domain (vector y ∈ R n , n = 63274)
and as a power spectrum in frequency domain. The power spectrum of a signal y ∈ C n is the vector
 n −1
| c j |2 j =0
, where c = DFTn y = Fn y is the discrete Fourier transform of y.

2
We see that the bulk of the signal’s power kyk2 is contained in the low-frequency components. The
paves the way for compressing the signal by low-pass filtering, that is, by dropping its high-frequency
components and storing or transmitting only the discrete Fourier coefficients belonging to low frequencies.
Refer to § 4.2.3.6 for precise information about the association of Fourier coefficients c j with low and high
frequencies.
Below we plot the squared moduli |c j |2 of the Fourier coefficients belonging to low frequencies and the
low-pass filtered sound signals for different cut-off frequencies. Taking into account only low-frequency
discrete Fourier coefficients does not severely distort the sound signal.

low frequency power spectrum sound filtering


60000
signal
cut-off = 5000
0.04
50000 cut-off = 3000
cut-off = 1000

40000
sound pressure

0.02
|ck |2

30000

0.00

20000

−0.02
10000

0
−0.04
0 500 1000 1500 2000 2500 3000 0.680 0.685 0.690 0.695 0.700
Fig. 130 index k of Fourier coefficients ck Fig. 131 time[s]
y
Remark 4.2.3.10 (Linear Filtering) Low-pass and high-pass filtering via DFT implement a very general
policy underlying general linear filtering of finite, time-discrete signals, represented by vectors y ∈ C n
➊ Compute the coefficients of an alternative basis representation of y.
➋ Apply some linear mapping to the obtain coefficient vector c ∈ C n , yielding ec.
➌ Recover the representation of ec in the standard basis of C n .
y
Review question(s) 4.2.3.11 (Frequency Filtering via DFT)
(Q4.2.3.11.A) Let y ∈ R n , n = 2k , k ∈ N, be a vector describing an analog time-discrete finite signal.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 334


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Denote by c ∈ C n its discrete Fourier transform. Which components of c are related to “low-frequency
content”, and which to “high-frequency content”?

4.2.4 Real DFT


Every time-discrete signal obtained from sampling a time-dependent physical quantity will yield a real
vector. Of couse, a real vector contains only half the information compared to complex vector of the same
length. We aim to exploit this for a more efficient implementation of DFT.
Task: Efficient implementation of DFT (Def. 4.2.1.18) (c0 , . . . , cn−1 ) for real coefficients
(y0 , . . . , yn−1 )⊤ ∈ R n , n = 2m, m ∈ N.
(n−k) j kj
If y j ∈ R in the DFT formula (4.2.1.19), we obtain redundant output: since ωn = ωn ,
k = 0, . . . , n − 1, we conclude the following relationship between discrete Fourier coefficients of a real-
valued signal.
n −1 n −1
kj (n−k) j
ck = ∑ yj ωn = ∑ y j ωn = cn−k , k = 1, . . . , n − 1 .
j =0 j =0

Idea: Map y ∈ R n to a vector C m and use DFT of length m on it.

m −1
jk −1 jk −1 jk
hk = ∑ (y2j + iy2j+1 ) ωm = ∑m
j=0 y2j ωm + i · ∑m
j=0 y2j+1 ωm , (4.2.4.1)
j =0
m −1
j(m−k) −1 jk
m −1 jk
hm−k = ∑ y2j + iy2j+1 ω m = ∑m
j=0 y2j ωm − i · ∑ j=0 y2j+1 ωm . (4.2.4.2)
j =0

Thus, we can recover the framed sums from suitable combinations of the discrete Fourier coefficients
hk ∈ C, k = 0, . . . , m − 1

−1 jk −1 jk
⇒ ∑m
j=0 y2j ωm = 12 (hk + hm−k ) , ∑m
j=0 y2j+1 ωm = − 21 i (hk − hm−k ) .

Use simple identities for roots of unity to split the DFT of y into two sums:

n −1
jk −1 k jk
m −1 jk
ck = ∑ y j ωn = ∑m
j=0 y2j ωm + ωn · ∑ j=0 y2j+1 ωm . (4.2.4.3)
j =0
( c = 1 (h + h 1 k
k 2 k m−k ) − 2 iωn ( hk − hm−k ) , k = 0, . . . , m − 1 ,
cm = Re{ h0 } − Im{ h0 } , (4.2.4.4)
ck = cn−k , k = m + 1, . . . , n − 1 .

C++ code 4.2.4.5: DFT of real vectors of length n/2 ➺ GITLAB


2 // Perform fft on a real vector y of even
3 // length and return (complex) coefficients in c
4 // Note: E I G E N ’s DFT method fwd() has this already implemented and
5 // we could also just call: c = fft.fwd(y);

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 335


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6 void f f t r e a l ( const VectorXd& y , VectorXcd& c ) {


7 const Eigen : : Index n = y . s i z e ( ) ;
8 const Eigen : : Index m = n / 2 ;
9 i f ( n % 2 ! = 0 ) { std : : cout << "n must be even ! \ n" ; r e t u r n ; }
10

11 // Step I: compute h from (4.2.4.1), (4.2.4.2)


12 const std : : complex <double> i ( 0 , 1 ) ; // Imaginary unit
13 VectorXcd yc (m) ;
14 f o r ( Eigen : : Index j = 0 ; j < m; ++ j ) {
15 yc ( j ) = y ( 2 * j ) + i * y ( 2 * j + 1 ) ;
16 }
17

18 Eigen : : FFT<double> f f t ;
19 VectorXcd d = f f t . fwd ( yc ) ;
20 VectorXcd h (m + 1 ) ;
21 h << d , d ( 0 ) ;
22

23 c . resize ( n ) ;
24 // Step II: implementation of (4.2.4.4)
25 f o r ( Eigen : : Index k = 0 ; k < m; ++k ) {
26 c ( k ) = ( h ( k ) + std : : c o n j ( h (m−k ) ) ) / 2 . −
i / 2 . * std : : exp ( − 2 . * s t a t i c _ c a s t <double >( k ) / s t a t i c _ c a s t <double >( n ) * M_PI * i ) * ( h ( k )
− std : : c o n j ( h (m−k ) ) ) ;
27 }
28 c (m) = std : : r e a l ( h ( 0 ) ) − std : : imag ( h ( 0 ) ) ;
29 f o r ( Eigen : : Index k = m+1; k < n ; ++k ) {
30 c ( k ) = std : : c o n j ( c ( n−k ) ) ;
31 }
32 }

Review question(s) 4.2.4.6 (Frequency Filtering via DFT and real DFT)
(Q4.2.4.6.A) For y ∈ R n , what is the result of the linear mapping
n  o

y 7→ Re DFT−
n
1
[( DFT y
n 1) , 0, . . . , 0 ] ?

Here Re extracts the real parts of the components of a complex vector.


(Q4.2.4.6.B) Outline the implementation of a C++ function
s t d :: v e c t o r < s t d ::pair< i n t , s t d ::complex< double >>
selectDominantFrequencies( const Eigen::VectorXd &y, double tol);

that returns a sequence of pairs


( )
⌈n/2−1⌉
∗ 2 2
( j, c j ) ∈ N0 × C , j ∈ J := argmin ♯J : ∑ |c j | ≥ (1 − tol) ∑ |c j | ,
J ⊂{0,...,⌈n/2−1⌉} j∈ J j =0

for 0 ≤ tol < 1. Discuss to what extent this function can be used for the compression of a sound
signal.
(Q4.2.4.6.C) How would you implement a C++ function
Eigen::VectorXd reconstructFromFrequencies(
const s t d :: v e c t o r < s t d ::pair< i n t , s t d ::complex< double >> &f);

that takes the output of selectDominantFrequencies() from Question (Q4.2.4.6.B) and returns
the compressed signal in time domain?

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 336


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Take into account that selectDominantFrequencies() merely looks at the first half of the dis-
crete Fourier coefficients.

4.2.5 Two-dimensional DFT


Video tutorial for Section 4.2.5 "Two-Dimensional DFT": (20 minutes) Download link,
tablet notes

Finite time-discrete signals are naturally described by vectors, recall § 4.0.0.1. They can be regarded as
one-dimensional, and typical specimens are audio data given in WAV (Waveform Audio) format. Other
types of data also have to be sent through channels, most importantly, images that can be viewed as two-
dimensional data. The natural linear-algebra style representation of an image is a matrix, see Ex. 3.4.4.24.
In this we study the frequency decomposition of matrices. Due to the natural analogy
one-dimensional data (“audio signal”) ←→ vector y ∈ C n ,

two-dimensional data (“image”) ←→ matrix. Y ∈ C m,n ,


these techniques are of fundamental importance for image processing.

§4.2.5.1 (Matrix Fourier modes) The (inverse) discrete Fourier transform of a vector computes its coef-
ficient of the representation in a basis provided by the columns of the Fourier matrix Fn . The k-th column
can be obtained by sampling harmonic oscillations of frequency k, k = 0, . . . , n − 1:
  n −1   n −1 j
(Fn ):,k = cos(2πkt j ) j=0 + ı sin(2πkt j ) j=0 t j := n , k = 0, . . . , n − 1 .

What are the 2D counterpart of these vectors? The matrices obtained by sampling products of trigono-
metric functions, e.g.,

(t, s) 7→ cos(2πkt) cos(2π ℓs) , k ∈ {0, . . . , m − 1}, ℓ ∈ {0, . . . , n − 1} , m, n ∈ N ,


j
at the points (t j := m , sr := nr ), j ∈ {0, . . . , m − 1}, r ∈ {0, . . . , n − 1}! Complex versions of such matri-
ces provide a two-dimensional trigonometric basis of C m,n , whose element are given by the tensor product
matrices
n o
(Fm ):,j (Fn )⊤
:,ℓ , 1 ≤ j ≤ m, 1 ≤ ℓ ≤ n ⊂ C m,n . (4.2.5.2)

Let a matrix C ∈ C m,n be given as a linear combination of these basis matrices with coefficients y j1 ,j2 ∈ C,
0 ≤ j1 < m, 0 ≤ j2 < n:
m −1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 . (4.2.5.3)
j1 =0 j2 =0

Then the entries of C can be computed by two nested discrete Fourier transforms:
!
m −1 n −1 m −1 n −1
j k j k j k j k
(C)k1 ,k2 = ∑ ∑ y j1 ,j2 ωm1 1 ωn2 2 = ∑ ωm1 1 ∑ ωn2 2 y j1 ,j2 , 0 ≤ k1 < m , 0 ≤ k2 < n .
j1 =0 j2 =0 j1 =0 j2 =0

Note that C++ indexing is applied throughout.


y

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 337


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The coefficients y j1 ,j2 ∈ C, 0 ≤ j1 < m, 0 ≤ j2 < n, can also be regarded as entries of a matrix
Y ∈ C m,n . Thus we can rewrite the above expressions: for all 0 ≤ k1 < m, 0 ≤ k2 < n,
m −1  j k1
(C)k1 ,k2 = ∑ Fn (Y) j1 ,: k2
ωm1 C = Fm (Fn Y⊤ )⊤ = Fm YFn , (4.2.5.4)
j1 =0

because F⊤
n = Fn . This formula defines the two-dimensional discrete Fourier transform of the matrix
Y ∈ C . We abbreviate it by DFTm,n : C m,n → C m,n .
m,n

From Lemma 4.2.1.14 we immediately get the inversion formula:

m −1 n −1
C= ∑ ∑ y j1 ,j2 (Fm ):,j1 (Fn )⊤
:,j2 ⇒ Y = F− 1 −1
m CFn =
1
mn Fm CFn . (4.2.5.5)
j1 =0 j2 =0

The following two codes implement (4.2.5.4) and (4.2.5.5) using the DFT facilities of E IGEN.

C++ code 4.2.5.6: Two-dimensional discrete Fourier transform ➺ GITLAB


2 template <typename Scalar >
3 void f f t 2 ( Eigen : : MatrixXcd &C, const Eigen : : MatrixBase < Scalar > &Y) {
4 using i d x _ t = Eigen : : MatrixXcd : : Index ;
5 const i d x _ t m = Y . rows ( ) ;
6 const i d x _ t n = Y . cols ( ) ;
7 C. r e s i z e (m, n ) ;
8 Eigen : : MatrixXcd tmp (m, n ) ;
9

10 Eigen : : FFT<double> f f t ; // Helper class for DFT


11 // Transform rows of matrix Y
12 f o r ( i d x _ t k = 0 ; k < m; k ++) {
13 const Eigen : : VectorXcd t v ( Y . row ( k ) ) ;
14 tmp . row ( k ) = f f t . fwd ( t v ) . transpose ( ) ;
15 }
16

17 // Transform columns of temporary matrix


18 f o r ( i d x _ t k = 0 ; k < n ; k ++) {
19 const Eigen : : VectorXcd t v ( tmp . col ( k ) ) ;
20 C. col ( k ) = f f t . fwd ( t v ) ;
21 }
22 }

C++ code 4.2.5.7: Inverse two-dimensional discrete Fourier transform ➺ GITLAB


2 template <typename Scalar >
3 void i f f t 2 ( Eigen : : MatrixXcd &C, const Eigen : : MatrixBase < Scalar > &Y) {
4 using i d x _ t = Eigen : : MatrixXcd : : Index ;
5 const i d x _ t m = Y . rows ( ) ;
6 const i d x _ t n = Y . cols ( ) ;
7 f f t 2 (C, Y . conjugate ( ) ) ;
8 C = C. conjugate ( ) / (m * n ) ;
9 }

Remark 4.2.5.8 (Two-dimensional DFT in P YTHON) The two-dimensional DFT is provided by the
P YTHON function:
numpy.fft2(Y) .

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 338


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§4.2.5.9 (Periodic convolution of matrices) In Section 4.2.2 we linked (periodic) convolutions


" # n −1
n −1
y = p ∗n x := ∑ x j p(k− j) mod n , p, x ∈ C n , (4.2.5.10)
j =0 k =0

and discrete Fourier transforms. This can also be done in two dimensions.
We consider the following bilinear mapping B : C m,n × C m,n → C m,n :
m −1 n −1
k = 0, . . . , m − 1 ,
(B(X, Y))k,ℓ = ∑ ∑ (X)i,j (Y)(k−i) mod m,(ℓ− j) mod n ,
ℓ = 0, . . . , n − 1 ,
(4.2.5.11)
i =0 j =0

Here, as in (4.2.5.10), mod ∗ designates the remainder of integer division like the % operator in C++
and is applied to indices of matrix entries. The formula (4.2.5.11) defines the two-dimensional discrete
periodic convolution, cf. Def. 4.1.4.7. Generalizing the notation for the 1D discrete periodic convolution
(4.2.5.10) we also write
" #
m −1 n −1
X ∗m,n Y := B(X, Y) = ∑ ∑ (X)i,j (Y)(k−i) mod m,(ℓ− j) mod n , X, Y ∈ C m,n .
i =0 j =0 k =0,...,m−1
ℓ=0,...,n−1

A direct loop-based implementation of the formula (4.2.5.11) involves an asymptotic computational effort
of O(m2 n2 ) for m, n → ∞.

C++ code 4.2.5.12: Straightforward implementation of 2D discrete periodic convolution


➺ GITLAB
2 // Straightforward implementation of 2D periodic convolution
3 template <typename Scalar1 , typename Scalar2 , class E i g e n M a t r i x >
4 void pmconv_basic ( const Eigen : : MatrixBase <Scalar1 > &X ,
5 const Eigen : : MatrixBase <Scalar2 > &Y , E i g e n M a t r i x &Z ) {
6 using i d x _ t = typename E i g e n M a t r i x : : Index ;
7 using v a l _ t = typename E i g e n M a t r i x : : S c a l a r ;
8 const i d x _ t n = X . cols ( ) ;
9 const i d x _ t m = X . rows ( ) ;
10 i f ( (m ! = Y . rows ( ) ) | | ( n ! = Y . cols ( ) ) ) {
11 throw std : : r u n t i m e _ e r r o r ( "pmconv : size mismatch " ) ;
12 }
13 Z . r e s i z e (m, n ) ; // Ensure right size of output matrix
14 // Normalization of indices
15 auto idxwrap = [ ] ( const i d x _ t L , i n t i ) {
16 r e t u r n ( ( i >= L ) ? i − L : ( ( i < 0 ) ? i + L : i ) ) ;
17 };
18 // Implementation of (4.2.5.11)
19 f o r ( i n t i = 0 ; i < m; i ++) {
20 f o r ( i n t j = 0 ; j < n ; j ++) {
21 val_t s = 0;
22 f o r ( i n t k = 0 ; k < m; k ++) {
23 f o r ( i n t l = 0 ; l < n ; l ++) {
24 s += X( k , l ) * Y( idxwrap (m, i − k ) , idxwrap ( n , j − l ) ) ;
25 }
26 }
27 Z( i , j ) = s ;
28 }
29 }
30 }

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 339


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The key discovery of Section 4.2.1 about the diagonalization of the discrete periodic convolution operation
in the Fourier basis carries over to two dimensions, because 2D discrete periodic convolution admits a
diagonalization by switching to the trigonometric basis of C m,n , analogous to (4.2.1.17).
sj
In (4.2.5.11) set Y = (Fm ):,r (Fn )s,: ∈ C m,n ri ω , 0 ≤ i < m, 0 ≤ j < n:
↔ (Y)i,j = ωm n
m −1 n −1
(B(X, Y))k,ℓ = ∑ ∑ (X)i,j (Y)(k−i) mod m,(ℓ− j) mod n
i =0 j =0
m −1 n −1
r ( k −i ) s(ℓ− j)
= ∑ ∑ (X)i,j ωm ωn
i =0 j =0
!
m −1 n −1
sj rk sℓ
= ∑ ∑ (X)i,j ωmri ωn · ωm ωn .
i =0 j =0
!
m −1 n −1
ri ω sj
B(X, (Fm ):,r (Fn )s,: ) =
| {z } ∑ ∑ (X)i,j ωm n (Fm ):,r (Fn )s,: . (4.2.5.13)
i =0 j =0
“eigenvector” | {z }
"eigenvalue", see Eq. (4.2.5.3)

Hence, the (complex conjugated) two-dimensional discrete Fourier transform of X according to (4.2.5.3)
provides the eigenvalues of the anti-linear mapping Y 7→ B(X, Y), X ∈ C m,n fixed. Thus we have arrived
at a 2D version of the convolution theorem Thm. 4.2.2.2.

Theorem 4.2.5.14. 2D convolution theorem


For any X, Y ∈ C m,n , we have

X ∗m,n Y = DFT− 1
m,n (DFTm,n ( X ) ⊙ DFTm,n ( Y )) ,

where ⊙ stands for the entrywise multiplication of matrices of equal size.

This suggests the following DFT-based algorithm for evaluating the periodic convolution of matrices:
➊ Compute Ŷ by 2D DFT of Y, see Code 4.2.5.7
➋ Compute X̂ by 2D DFT of X, see Code 4.2.5.6.
➌ Component-wise multiplication of X̂ and Ŷ: Ẑ = X̂. ∗ Ŷ.
➍ Compute Z through inverse 2D DFT of Ẑ.

C++ code 4.2.5.15: DFT-based 2D discrete periodic convolution ➺ GITLAB


2 // DFT based implementation of 2D periodic convolution
3 template <typename Scalar1 , typename Scalar2 , class E i g e n M a t r i x >
4 void pmconv ( const Eigen : : MatrixBase <Scalar1 > &X ,
5 const Eigen : : MatrixBase <Scalar2 > &Y , E i g e n M a t r i x &Z ) {
6 using Comp = std : : complex <double > ;
7 using i d x _ t = typename E i g e n M a t r i x : : Index ;
8 const i d x _ t n = X . cols ( ) ;
9 const i d x _ t m = X . rows ( ) ;
10 i f ( (m ! = Y . rows ( ) ) | | ( n ! = Y . cols ( ) ) ) {
11 throw std : : r u n t i m e _ e r r o r ( "pmconv : size mismatch " ) ;
12 }
13 Z . r e s i z e (m, n ) ;

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 340


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

14 Eigen : : MatrixXcd Xh (m, n ) ;


15 Eigen : : MatrixXcd Yh (m, n ) ;
16 // Step ➊: 2D DFT of Y
17 f f t 2 ( Yh , ( Y . template cast <Comp> ( ) ) ) ;
18 // Step ➋: 2D DFT of X
19 f f t 2 ( Xh , ( X . template cast <Comp> ( ) ) ) ;
20 // Steps ➌, ➍: inverse DFT of component-wise product
21 i f f t 2 ( Z , Xh . cwiseProduct ( Yh ) ) ;
22 }

EXAMPLE 4.2.5.16 (Deblurring by DFT) 2D discrete convolutions are important for image processing.
Let a Gray-scale pixel image be stored in the matrix P ∈ R m,n , actually P ∈ {0, . . . , 255}m,n , see also
Ex. 3.4.4.24.
Write ( pl,k )l,k∈Z for the periodically extended image:

pl,j = (P)l +1,j+1 for 1 ≤ l ≤ m, 1 ≤ j ≤ n , pl,j = pl +m,j+n ∀l, k ∈ Z .

Blurring is a technical term for undesirable cross-talk between neighboring pixels: pixel values get replaced
by weighted averages of near-by pixel values. This is a good model approximation of the effect of distortion
in optical transmission systems like lenses. Blurring can be described by a small matrix called the point-
spread function (PSF):

L L
0≤l<m,
cl,j = ∑ ∑ sk,q pl +k,j+q ,
0≤j<n,
L ∈ {1, . . . , min{m, n}} . (4.2.5.17)
k=− L q=− L

blurred image point spread function

Here the entries of the PSF are referenced as sk,q also with negative indices. We also point out that
usually L will be small compared to m and n, and we have sk,m ≥ 0, and ∑kL=− L ∑qL=− L sk,q = 1. Hence
blurring amounts to averaging local pixel values. You may also want to look at this YouTube Video about
“Convolution in Image Processing”.
1
In the experiments reported below we used: L = 5 and the PSF sk,q = ,, 0 ≤ k, q ≤ 5,
1 + k 2 + q2
normalized to entry sum = 1.

C++ code 4.2.5.18: Point spread function (PSF) ➺ GITLAB


2 void psf ( const Eigen : : Index L , MatrixXd& S) {
3 const VectorXd x = VectorXd : : LinSpaced ( 2 * L+1 , − s t a t i c _ c a s t <double >( L ) ,
s t a t i c _ c a s t <double >( L ) ) ;
4 const MatrixXd X = x . r e p l i c a t e ( 1 , x . s i z e ( ) ) ;
5 const MatrixXd Y = ( x . transpose ( ) ) . r e p l i c a t e ( x . s i z e ( ) , 1 ) ;
6 const MatrixXd E = MatrixXd : : Ones ( 2 * L+1 , 2 * L +1) ;
7 S = E . cwiseQuotient ( E + X . cwiseProduct ( X) + Y . cwiseProduct ( Y) ) ;
8 S / = S .sum ( ) ;
9 }

This is how this PSF acts on an image according to (4.2.5.17):

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 341


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Fig. 132 Fig. 133

Of course, (4.2.5.17) defines a linear operator B : R m,n 7→ R m,n (“blurring operator”).

C++ code 4.2.5.19: Blurring operator ➺ GITLAB


2 i n l i n e MatrixXd b l u r ( const MatrixXd &P , const MatrixXd &S ) {
3 typedef Eigen : : Index i n d e x _ t ;
4 auto dimensions = std : : make_tuple ( P . rows ( ) , P . cols ( ) , S . rows ( ) , S . cols ( ) ) ;
5 const auto [ m, n , M, N ] = dimensions ;
6 const i n d e x _ t L = (M − 1 ) / 2 ;
7

8 i f (M ! = N) {
9 std : : cout << " Error : S not quadratic ! \ n" ;
10 }
11

12 MatrixXd C(m, n ) ;
13 f o r ( i n d e x _ t l = 1 ; l <= m; ++ l ) {
14 f o r ( i n d e x _ t j = 1 ; j <= n ; ++ j ) {
15 double s = 0 ;
16 f o r ( i n d e x _ t k = 1 ; k <= ( 2 * L + 1 ) ; ++k ) {
17 f o r ( i n d e x _ t q = 1 ; q <= ( 2 * L + 1 ) ; ++q ) {
18 index_t k l = l + k − L − 1;
19 i f ( k l < 1) {
20 k l += m;
21 } else i f ( k l > m) {
22 k l −= m;
23 }
24 i n d e x _ t jm = j + q − L − 1 ;
25 i f ( jm < 1 ) {
26 jm += n ;
27 } else i f ( jm > n ) {
28 jm −= n ;
29 }
30 s += P( k l − 1 , jm − 1 ) * S ( k − 1 , q − 1 ) ;
31 }
32 }
33 C( l − 1 , j − 1 ) = s ;
34 }
35 }
36 r e t u r n C;
37 }

Yet, does (4.2.5.17) ring a bell? Hidden in (4.2.5.17) is a 2D discrete periodic convolution, see

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 342


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Eq. (4.2.5.11)!
L L
cl,j = ∑ ∑ sk,q (P)(l +k) mod m,( j+q) mod n
k=− L q=− L
L L
= ∑ ∑ s−k,−q (P)(l −k) mod m,( j−q) mod n
k=− L q=− L
L L
= ∑ ∑ s−k,−q (P)(l−k) mod m,( j−q) mod n +
k =0 q =0
L −1
∑ ∑ s−k,−q (P)(l −k) mod m,( j−(q+n)) mod n +
k=0 q=− L
−1 L
∑ ∑ s−k,−q (P)(l−(k+m)) mod m,( j−q) mod n +
k=− L q=0
−1 −1
∑ ∑ s−k,−q (P)(l −(k+m)) mod m,( j−(q+n)) mod n
k=− L q=− L
L L
= ∑ ∑ s−k,−q (P)(l−k) mod m,( j−q) mod n +
k =0 q =0
L n −1
∑ ∑ s−k,−q+n (P)(l −k) mod m,( j−q) mod n +
k =0 q = n − L
m −1 L
∑ ∑ s−k+m,−q (P)(l−k) mod m,( j−q) mod n +
k = m − L q =0
m −1 n −1
∑ ∑ s−k+m,−q+n (P)(l −k) mod m,( j−q) mod n .
k=m− L q=n− L

Hence we have that the blured image is given by the matrix




 s−k,−q , for 0 ≤ k, q ≤ L ,



 , for m − L ≤ k < m , 0≤q≤L,
s−k+m,−q
C = S ∗m,n P with (S)k,q = s−k,q+n , for 0 ≤ k ≤ L , n − L ≤ q < n , (4.2.5.20)



 s−k+m,−q+m , for m − L ≤ k < m , n − L ≤ q < n ,


0 else.
In light of Thm. 4.2.5.14 and of the algorithm implemented in Code 4.2.5.15, now it hardly comes as a
surprise that DFT comes handy for reversing the effect of the blurring!
We give an elementary derivation and revisit the considerations of § 4.2.5.1 and recall the derivation of
(4.2.1.10) and Lemma 4.2.1.16.
    L L L L
νk µq ν(l +k) µ( j+q) νl µj νk µq
B( ωm ωn
k,q∈Z
= ∑ ∑ sk,q ωm ωn = ωm ωn ∑ ∑ sk,q ωm ωn .
l,j k=− L q=− L k=− L q=− L
νk ω µq 
Vν,µ := ωm n k,q∈Z
, 0 ≤ µ < m, 0 ≤ ν < n are the eigenvectors of B :

L L
νk µq
B Vν,µ = λν,µ Vν,µ , eigenvalue λν,µ = ∑ ∑ sk,q ωm ωn (4.2.5.21)
k=− L q=− L
| {z }
2-dimensional DFT of point spread function !

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 343


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Thus the inversion of the blurring operator boils down to componentwise scaling in “Fourier domain”, see
See also Code 4.2.5.15 for the same idea.

C++ code 4.2.5.22: DFT based deblurring ➺ GITLAB


2 Eigen : : MatrixXd deblur ( const Eigen : : MatrixXd &C, const Eigen : : MatrixXd &S ,
3 const double t o l = 1e −3) {
4 typedef Eigen : : Index i n d e x _ t ;
5 auto dimensions = std : : make_tuple (C . rows ( ) , C. cols ( ) , S . rows ( ) , S . cols ( ) ) ;
6 const auto [ m, n , M, N ] = dimensions ;
7 const i n d e x _ t L = (M − 1 ) / 2 ;
8 i f (M ! = N) {
9 throw std : : r u n t i m e _ e r r o r ( " Error : S not quadratic ! " ) ;
10 }
11 Eigen : : MatrixXd Spad = Eigen : : MatrixXd : : Zero (m, n ) ;
12 // Zero padding, see (4.2.5.20)
13 Spad . block ( 0 , 0 , L + 1 , L + 1) = S . block ( L , L, L + 1 , L + 1) ;
14 Spad . block (m − L , n − L , L , L) = S . block ( 0 , 0, L, L) ;
15 Spad . block ( 0 , n − L , L + 1 , L) = S . block ( L , 0, L + 1, L) ;
16 Spad . block (m − L , 0 , L , L + 1) = S . block ( 0 , L, L , L + 1) ;
17 // Inverse of blurring operator (fft2 expects a complex matrix)
18 const Eigen : : MatrixXcd SF = f f t 2 ( Spad . cast <complex > ( ) ) ;
19 // Test for invertibility
20 i f ( SF . cwiseAbs ( ) . minCoeff ( ) < t o l * SF . cwiseAbs ( ) . maxCoeff ( ) ) {
21 std : : c e r r << " Error : Deblurring impossible ! \ n" ;
22 }
23 // DFT based deblurring
24 r e t u r n i f f t 2 ( f f t 2 (C. cast <complex > ( ) ) . cwiseQuotient ( SF ) ) . r e a l ( ) ;
25 }

Note that this code checks whether deblurring is possible, that is, whether the blurring operator is really
invertible. A near singular blurring operator manifests itself through entries of its DFT close to zero. y
Review question(s) 4.2.5.23 (Two-dimensional DFT)
(Q4.2.5.23.A) Describe a function (t, s) 7→ g(t, s) such that
n o   j r
Re (Fm ):,j (Fn )⊤
:,ℓ = g(t j , sr ) j=0,...,m−1 t j := , sr : = , m, n ∈ N .
r =0,...,n−1 m n
Here Fn , n ∈ N, is the Fourier matrix
 
ωn0 ωn0 ··· ωn0
ωn0
 0 ωn1 ··· ωnn−1 
 h i n −1
 ωn2 ··· ωn2n−2  lj
F n =  ωn  = ωn ∈ C n,n . (4.2.1.13)
 .. .. ..  l,j=0
 . . . 
( n −1)2
ωn0 ωnn−1 · · · ωn
(Q4.2.5.23.B) How would you compute the discrete Fourier transform of a tensor product matrix
X = uvH , u, v ∈ C n .

4.2.6 Semi-discrete Fourier Transform [QSS00, Sect. 10.11]


Starting from § 4.1.4.2 we mainly looked at time-discrete n-periodic signals, which can be mapped to
vectors ∈ R n . This led to discrete periodic convolution (→ Def. 4.1.4.7) and the discrete Fourier transform
(DFT) (→ Def. 4.2.1.18) as (bi-)linear mappings in C n .

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 344


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§4.2.6.1 (“Squeezing” the DFT) In this section we are concerned with non-periodic signals of infinite
duration as introduced in § 4.0.0.1.

Idea: Study the limit n → ∞ for DFT in the n-periodic setting

Let (yk )k∈Z be an n-periodic sequence (signal), n = 2m + 1, m ∈ N, with generating vector


y := [y0 , . . . , yn−1 ]⊤ . Thanks to periodicity we can rewrite the DFT c = DFTn y with a simple change
of indexing:
m
kj
DFT → Def. 4.2.1.18: (c)k = ck := ∑ y j exp(−2πi
n
) , k = 0, . . . , n − 1 . (4.2.6.2)
j=−m

−1
Next, we associate a point tk ∈ [0, 1[ with each index k of the DFT (ck )nk= 0:

k
k ∈ {0, . . . , n − 1} ←→ tk := . (4.2.6.3)
n
−1
Thus we can view (ck )nk= 0 as the heights of n pulses evenly spaced in the interval [0, 1[.
1

0.9
✁ “Squeezing” a vector ∈ R n into [0, 1[.
0.8

We can interpret the values ck as sampled values of


0.7
a functions defined on [0, 1]
0.6

c k ↔ c ( t k ); ,
ck

0.5

0.4

k
0.3 tk = , k = 0, . . . , n − 1 .
n
0.2

0.1
This makes it possible to pass from a discrete finite
0
signal to a continuous signal.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 134 t

In a sense, formally, we can rewrite (4.2.6.2) as


m
DFT: c(tk ) := ck = ∑ y j exp(−2πıjtk ) , k = 0, . . . , n − 1 . (4.2.6.4)
j=−m

The notation indicates that we read ck as the value of a function c : [0, 1[7→ C for argument tk . y

EXAMPLE 4.2.6.5 (“Squeezed” DFT of a periodically truncated signal) We consider the bi-infinite
discrete signal y j j∈Z , “concentrated around 0”

1
yj = , j∈Z.
1 + j2
m
We examine the DFT of the 2m + 1-periodic signal obtained by periodic extension of (yk )k=−m , C++ code
➺ GITLAB.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 345


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Fig. 135 Fig. 136

Fig. 137 Fig. 138

Fig. 139 Fig. 140

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 346


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Fig. 141 Fig. 142

The visual impression is that the values c(tk ) “converge” to those of a function c : [0, 1[7→ R in the
sampling points tk .
y

§4.2.6.6 (Fourier series) Now we pass to the limit m → ∞ in (4.2.6.4) and keep the “sampling a function”
perspective: ck = c(tk ). Note that passing to the limit amounts to dropping the assumption of periodicity!

c(t) = ∑ yk exp(−2πıkt) . (4.2.6.7)


k ∈Z

Terminology: The series (= infinite sum) on the right hand side of (4.2.6.7) is called a Fourier series
(link).
The function c : [0, 1[7→ C defined by (4.2.6.7) is called the Fourier transform of the
sequence (yk )k∈Z (, if the series converges).

Corollary 4.2.6.8. Periodicity of Fourier transforms

Fourier transforms t 7→ c(t) are 1-periodic functions R → C.

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 347


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2
Fourier transform of (1/1+k )
k
3.5
Thus, the limit we “saw” in Ex. 4.2.6.5 is actually the
3 Fourier transform of the sequence (yk )k∈Z !

2.5
✁ Fourier transform of yk := 1+1k2

From (4.2.6.7) we conclude:


c(t)

1.5 Fourier transform


=
1
weighted sum of Fourier modes t 7→ exp(−2πıkt),
0.5
k∈Z

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 143 t

1 1 1 1 1

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4


Fourier mode

Fourier mode

Fourier mode

Fourier mode

Fourier mode
0.2 0.2 0.2 0.2 0.2

−0.2 + 0

−0.2 + 0

−0.2 + 0

−0.2 + 0

−0.2

−0.4 −0.4 −0.4 −0.4 −0.4

−0.6 −0.6 −0.6 −0.6 −0.6

−0.8 −0.8 −0.8 −0.8 −0.8

−1 −1 −1 −1 −1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t t t

→ related animation on Wikipedia.


It is possible to derive a closed-form expression for the function displayed in Fig. 144:

1 π  
π −2πt 2πt−π
c(t) = ∑ 1 + k2 exp (− 2πıkt ) = π − e−π
· e + e ∈ C ∞ ([0, 1]) .
k ∈Z
e

Note that when considered as a 1-periodic function on R, this c(t) is merely continuous. y

Remark 4.2.6.9 (Decay conditions for bi-infinite signals) The considerations above were based on
✦ truncation of (yk )k∈Z to (yk )m
k=−m and
✦ periodic continuation to an 2m + 1-periodic signal.
Obviously, only if the signal is concentrated around k = 0 this procedure will not lose essential information
contained in the signal, which suggests decay conditions for the coefficients of Fourier series.

Minimal requirement: lim |yk | = 0 , (4.2.6.10)


k→∞

Stronger requirement: ∑ |yk | < ∞ . (4.2.6.11)


k ∈Z

The summability condition (4.2.6.11) implies (4.2.6.10). Moreover, (4.2.6.11) ensures that the
Fourier series (4.2.6.7) converges uniformly [Str09, Def. 4.8.1] because the exponentials are all bounded
by 1 in modulus. From [Str09, Thm. 4.8.1] we learn that limits of uniformly convergent series of continuous
functions posses a continuous limit. As a consequence c : [0, 1[7→ C is continuous, if (4.2.6.11) holds. y

EXAMPLE 4.2.6.12 (Convergence of Fourier sums) We consider the following infinite signal, satisfying
1
the summation condition (4.2.6.11): , k ∈ Z, see Ex. 4.2.6.5. We monitored: approxima- yk =
1 + k2
tion of the Fourier transform c(t) by Fourier sums cm (t), see (4.2.6.14).

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 348


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 2
Fourier transform of (1/1+k ) Fourier sum approximations with 2m+1 terms, y = 1/(1+k )
k k
3.5 3.5

m=2
3 m=4
3
m=8
m = 16
m = 32
2.5 2.5

2 2

cm(t)
c(t)

1.5 1.5

1 1

0.5 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 144 t Fig. 145 t

We observe convergence of Fourier sums in “eyeball norm”. Quantitative estimates can be deduced from
decay properties of the sequence (yk )k∈Z . If it is summable according to (4.2.6.11), then

M
∑ yk exp(−2πıkt) − ∑ yk exp(−2πıkt) ≤ ∑ |yk | → 0 for M → ∞
k ∈Z k=− M |k|> M

Further quantitative statements about convergence can be deduced from Thm. 4.2.6.33 below. y

Remark 4.2.6.13 (Numerical summation of Fourier series)


Assuming sufficiently fast decay of the signal (yk )k∈Z for k → ∞ (→ Rem. 4.2.6.9), we can approximate
the Fourier series (4.2.6.7) by a Fourier sum
M
c(t) ≈c M (t) := ∑ yk exp(−2πikt) , M ≫ 1 . (4.2.6.14)
k=− M

j
Task: Approximate evaluation of c(t) at N equidistant points t j := N , j = 0, . . . , N (e.g., for plotting it).

M M
kj
c(t j ) = lim
M→∞
∑ yk exp(−2πikt j ) ≈ ∑ yk exp(−2πi
N
), (4.2.6.15)
k=− M k=− M

for j = 0, . . . , N − 1.

Note that in the case N = M (4.2.6.15) coincides with a discrete Fourier transform (DFT, → Def. 4.2.1.18).
The following code demonstrates the evaluation of a Fourier series at equidistant points using DFT.

C++ code 4.2.6.16: DFT-based evaluation of Fourier sum at equidistant points ➺ GITLAB
2 // evaluate scalar function with a vector
3 // DFT based approximate evaluation of Fourier series
4 // signal is a functor providing the yk
5 // M specifies truncation of series according to (4.2.6.14)
6 // N is the number of equidistant evaluation points for c in
7 // [0, 1[.
8 template <class Function >
9 VectorXcd foursum ( const F u n c t i o n &s i g n a l , i n t M, i n t N) {
10 using i n d e x _ t = Eigen : : Index ;
11 const i n t m = 2 * M + 1 ; // length of the signal
12 // sample signal

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 349


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

13 // VectorXd y = feval(signal, VectorXd::LinSpaced(m, -M, M));


14 VectorXd y = VectorXd : : LinSpaced (m, −M, M) . unaryExpr ( s i g n a l ) ;
15 // Ensure that there are more sampling points than terms in series
16 const i n t l = s t a t i c _ c a s t < i n t >(m > N ? c e i l ( s t a t i c _ c a s t <double >(m) / N) : 1 ) ;
17 N *= l ;
18 // Zero padding and wrapping of signal, see
19 // Code 4.2.3.7
20 VectorXd y _ e x t = VectorXd : : Zero (N) ;
21 y _ e x t . head (M + 1 ) = y . t a i l (M + 1 ) ;
22 y _ e x t . t a i l (M) = y . head (M) ;
23 // Perform DFT and decimate output vector
24 Eigen : : FFT<double> f f t ;
25 const Eigen : : VectorXcd k = f f t . fwd ( y _ e x t ) ;
26 Eigen : : VectorXcd c (N / l ) ;
27 f o r ( i n t i = 0 ; i < N / l ; ++ i ) {
28 c ( i ) = k ( s t a t i c _ c a s t < i n d e x _ t >( i ) * s t a t i c _ c a s t < i n d e x _ t >( l ) ) ;
29 }
30 return c ;
31 }

§4.2.6.17 (Inverting the Fourier transform) Now we perform a similar passage to the limit as above for
the inverse DFT (4.2.1.20), n = 2m + 1,

1 n −1 jk
yj = ∑
n k =0
ck exp(2πi ) , j = −m, . . . , m .
n
(4.2.6.18)

We adopt a function perspective as before: ck ↔ c(tk ), tk = nk , cf. (4.2.6.3), and rewrite

1 n −1
n k∑
yj = c(tk ) exp(2πijtk ) , j = −m, . . . , m . (4.2.6.19)
=0

Then pass to the limit m → ∞ in (4.2.6.19)

Insight: The right hand side of (4.2.6.19) is a Riemann sum, cf. [Str09, Sect. 6.2]

In the limit m → ∞ the sum becomes an integral!

R1
yj = c(t) exp(2πijt) dt , j∈Z . (4.2.6.20)
0

This formula is the inversion of the summation of a Fourier series (4.2.6.7)!


In fact, this is not surprising, because for a Fourier series

c(t) = ∑ y j exp(−2πıkt) , t∈R,


k ∈Z

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 350


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

satisfying the summabiliy condition (4.2.6.11) we can swap integration and summation and directly com-
pute

Z1 Z1
c(t) exp(2πijt) dt = ∑ y j exp(−2πıkt)) exp(2πijt) dt
0 0 k ∈Z
Z1
= ∑ yj exp(2πı( j − k )t) dt = y j ,
k ∈Z 0

because of the “orthogonality relation”

Z1
(
1 , if n = 0 ,
exp(2πınt) dt =
0 , if n 6= 0 .
0

The formula (4.2.6.20) allows to recover the signal (yk )k∈Z from its Fourier transform c(t).

§4.2.6.21 (Fourier transform as linear mapping) Assuming sufficiently fast decay of the infinite sequence
(yk )k∈Z ∈ CZ , combining (4.2.6.7) and (4.2.6.20) we have found the relationship
Z 1
(4.2.6.7): c(t) = ∑ yk exp(−2πıkt) ↔ (4.2.6.20): yk =
0
c(t) exp(2πıkt) dt .
k ∈Z

Terminology: y j from (4.2.6.20) is called the j-th Fourier coefficient of the function c.
✎ Notation: cbj := y j with y j defined by (4.2.6.20) =
ˆ j-th Fourier coefficient of c : [0, 1[→ C
In a sense, Fourier series summation maps to sequence to a 1-periodic function, Fourier coefficient ex-
traction a 1-periodic function to a sequence
Fourier series
sequence ∈ CZ −−−−−−−−−→ funtion [0, 1[→ C .
Fourier coefficients

Both the space CZ of bi-infinite sequences and the space of functions [0, 1[→ C are vector spaces
equipped with “termwise/pointwise” addition and scalar multiplication. Then it is clear that
• the series summation mapping (cbk )k∈Z 7→ c from (4.2.6.7),
• and the Fourier coeffiicient extraction mapping c 7→ (cbk )k∈Z from (4.2.6.20)
are linear ! (Recall the concept of a linear mapping as explained in [NS02, Ch. 6].)

Let us summarize the fundamental correspondences:


✬ ✩

Z 1
cbj = c(t) exp(2πıjt) dt
(continuous) function 0 (bi-infinite) sequence
c : [0, 1[7→ C (cbj ) j∈Z
c(t) = ∑ cbk exp(−2πıkt)
k ∈Z

✫ ✪
Fourier transform Fourier coefficient
y

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 351


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 4.2.6.22 (Filtering in Fourier domain) What happens to the Fourier transform of a bi-infinite
signal, if it passes through a channel?
Consider a (bi-)infinite signal ( xk )k∈Z sent through a finite (→ Def. 4.1.1.2, linear (→
Def. 4.1.1.7) time-invariant (→ Def. 4.1.1.5) causal (→ Def. 4.1.1.9) channel with impulse response
(. . . , 0, h0 , . . . , hn−1 0, . . .) (→ Def. 4.1.1.12). By (4.1.2.4) this results in the output signal
n −1
yk = ∑ h j xk− j , k∈Z. (4.2.6.23)
j =0

We introduce the Fourier transforms of the infinite signals:


( y k ) k ∈Z ↔ t 7 → c ( t ) , ( x j ) j ∈Z ↔ t 7→ b(t) .
We also assume that ( xk )k∈Z satisfies the summability condition (4.2.6.11). Then elementary computa-
tions establish a relationship between the Fourier transforms:
n −1
c(t) = ∑ yk exp(−2πıkt) = ∑ ∑ h j xk− j exp(−2πıkt)
k ∈Z k ∈Z j =0
n −1
[shift summation index k] = ∑ ∑ h j xk exp(−2πıjt) exp(−2πıkt) (4.2.6.24)
j =0 k ∈Z
!
n −1
= ∑ h j exp(−2πıjt) b(t) .
j =0
| {z }
trigonometric polynomial of degree n − 1

Definition 4.2.6.25. Trigonometric polynomial

A trigonometric polynomial is a function R → C that is a weighted sum of finitely many terms


t → exp(−2πıkt), k ∈ Z.

We summarize the insight gleaned from (4.2.6.24):

Discrete convolution in Fourier domain


The discrete convolution of a signal with finite impulse response amounts to a multiplication of
its Fourier transform with a trigonometric polynomial whose coefficients are given by the impulse
response.

§4.2.6.27 (Fourier transform and convolution) In fact, the observation made in Rem. 4.2.6.22 is a spe-
cial case of a more general result that provides a version of the convolution theorem Thm. 4.2.2.2 for the
Fourier transform.

Theorem 4.2.6.28. Convolution theorem

Let t 7→ c(t) and b 7→ b(t) be the Fourier transforms of the two summable bi-infinite sequences
(yk )k∈Z and ( xk )k∈Z , respectively. Then the pointwise product t 7→ c(t)b(t) is the Fourier trans-
form of the convolution (→ Def. 4.1.2.7)
( )
( xk ) ∗ (yk ) := ℓ ∈ Z 7→ ∑ xk yℓ−k ∈ CZ .
k ∈Z

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 352


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Proof. (formal) Ignoring issues of convergence, we may just multiply the two Fourier sequences and sort
the resulting terms:
! ! !
∑ yk exp(−2πıkt) · ∑ x j exp(−2πıjt) = ∑ ∑ yk xℓ−k exp(−2πıℓt) .
k ∈Z j ∈Z ℓ∈Z k ∈Z
| {z }
=((yk )∗( xk ))ℓ

✷ y

§4.2.6.29 (Isometry property of Fourier transform) We will find a conservation of power through Fourier
transform. This is related to the assertion of Lemma 4.2.1.14 for the Fourier matrix Fn , see (4.2.1.13),
namely that √1n Fn is unitary (→ Def. 6.3.1.2), which implies

Thm. 3.3.2.2
1 1
√ Fn unitary √ Fn y = k y k2 . (4.2.6.30)
n n 2

Since DFT boils down to multiplication with Fn (→ Def. 4.2.1.18), we conclude from (4.2.6.30)

1 n −1 m
| c k |2 = | y j |2 .
n k∑ ∑
ck from (4.2.6.2) ⇒ (4.2.6.31)
=0 j=−m

Now we adopt the function perspective again and associated ck ↔ c(tk ). Then we pass to the limit
m → ∞, appeal to Riemann summation (see above), and conclude

Z1
m→∞
(4.2.6.31) =⇒ |c(t)|2 dt = ∑ | y j |2 . (4.2.6.32)
0 j ∈Z

Theorem 4.2.6.33. Isometry property of the Fourier transform

If the Fourier coefficients satisfy ∑k∈Z |cbj |2 < ∞, then the Fourier series

c(t) = ∑ cbk exp(−2πıkt)


k ∈Z

yields a function c ∈ L2 (]0, 1[) that satisfies


Z 1
kck2L2 (]0,1[) := |c(t)|2 dt = ∑ |cbj |2 .
0 k ∈Z

Recalling the concept of the L2 -norm of a function, see (5.2.4.6), the theorem can be stated as follows:
Thm. 4.2.6.33 ↔ The L2 -norm of a Fourier transform agrees with the Euclidean norm of
the corresponding sequence.
2
Here the Euclidean norm of a sequence is understood as ( y k ) k ∈Z 2
:= ∑ | y j |2 .
k ∈Z

From Thm. 4.2.6.33 we can also conclude that the Fourier transform is injective: If and only if c(t) = 0, all
its Fourier coefficients will be zero. y
Review question(s) 4.2.6.34 (Semi-discrete Fourier transform)

4. Filtering Algorithms, 4.2. Discrete Fourier Transform (DFT) 353


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q4.2.6.34.A) Section 4.1.1 introduced the shift operator


 
Sm : ℓ ∞ (Z ) → ℓ ∞ (Z ) , Sm ( x j j ∈Z
) = x j−m j ∈Z
. (4.1.1.4)

Let t 7→ c(t) be the Fourier transform of the sequence (yk )k∈Z ∈ ℓ∞ (Z ),

c(t) = ∑ yk exp(−2πıkt) ,
k ∈Z

What is the relationship of t 7→ c(t) is the Fourier transform of the shifted sequence Sm (yk )k∈Z ,
m ∈ Z?

4.3 Fast Fourier Transform (FFT)

Video tutorial for Section 4.3 "Fast Fourier Transform (FFT)": (16 minutes) Download link,
tablet notes

You might have been wondering why the reduction to DFTs has received so much attention in Sec-
tion 4.2.2. An explanation is given now.
At first glance, DFT in C n ,

n −1
kj
ck = ∑ y j ωn , k = 0, . . . , n − 1 , (4.2.1.19)
j =0

seems to require an asymptotic computational effort of O(n2 ) (matrix×vector multiplication the dense
Fourier matrix).

C++-code 4.3.0.1: Double-loop implementation of DFT


2 // DFT (4.2.1.19) of vector y returned in c
3 void n a i v e d f t ( const VectorXcd& y , VectorXcd& c ) {
4 using i d x _ t = VectorXcd : : Index ;
5 const i d x _ t n = y . s i z e ( ) ;
6 const std : : complex <double> i ( 0 , 1 ) ;
7 c . resize ( n ) ;
8 // root of unity ωn , w holds its powers
9 const std : : complex <double> w = std : : exp ( −2 * M_PI / s t a t i c _ c a s t <double >( n ) * i ) ;
10 std : : complex <double> s = w;
11 c ( 0 ) = y .sum ( ) ;
12 f o r ( i d x _ t j = 1 ; j < n ; ++ j ) {
13 c ( j ) = y ( n −1) ;
14 f o r ( i d x _ t k = n −2; k >= 0 ; −−k ) {
15 c( j ) = c( j ) *s + y(k) ;
16 }
17 s * = w;
18 }
19 }

EXPERIMENT 4.3.0.2 (Runtimes of DFT implementations) We examine the runtimes of calls to built-in
DFT functions in both M ATLAB and E IGEN ➺ GITLAB.

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 354


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1
10

loop based computation

Timings in M ATLAB ✄: 0
10
direct matrix multiplication
MATLAB fft() function

1. Straightforward implementation involving M AT- −1


10

LAB loops
2. Multiplication with Fourier matrix (4.2.1.13)

run time [s]


−2
10

3. M ATLAB’s built-in function fft()


−3
10

(MATLAB V6.5, Linux, Mobile Intel Pentium 4 - M


CPU 2.40GHz, minimum over 5 runs) −4
10

Similar runtimes would be obtained for P YTHON’s −5


10

numpy.fft().
−6
10
0 500 1000 1500 2000 2500 3000
vector length n

✁ DFT runtimes in E IGEN (only for n = 2 L !)


1. Straightforward implementation involving C++
loops, see Code 4.3.0.1
2. Multiplication with Fourier matrix (4.2.1.13)
3. E IGEN’s built-in FFT class, method fwd()

Note: Eigen uses KISS FFT as default backend in its


FFT module, which falls back loop-based slow DFT
when used on data sizes, which are large primes. An
superior alternative is FFTW, see § 4.3.0.11.

The secret of M ATLAB’s/E IGEN’s/P YTHON’s fft():


the Fast Fourier Transform (FFT) algorithm [DV90]

(discovered by C.F. Gauss in 1805, rediscovered by Cooley & Tuckey in 1965,


one of the “top ten algorithms of the century”).

§4.3.0.3 (FFT algorithm: derivation and complexity) To understand how the discrete Fourier transform
of n-vectors can be implemented with an asymptotic computational effort smaller than O(n2 ) we start with
an elementary manipulation of (4.2.1.19) for n = 2m, m ∈ N:

n −1 2πi
m −1 2πi
m −1 2πi
ck = ∑ y j e− n jk = ∑ y2j e− n 2jk + ∑ y2j+1 e− n (2j +1) k

j =0 j =0 j =0
m −1 m −1
2πi
y2j |e−{zm jk} +e−
2πi
n k y2k+1 e|−{zm jk} ,
2πi (4.3.0.4)
= ∑ · ∑ k∈Z.
j =0 jk j =0 jk
= ωm = ωm
| {z } | {z }
ceven
=:ek codd
=:ek

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 355


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and note the m-periodicity: ceeven


k k+m , c
= ceeven eodd
k k+m for all k ∈ Z .
= ceodd
The key observation is that the sequences/m-vectors ceeven
k and ceodd
k can be computed with DFTs of length
m!

with yeven := [y0 , y2 , . . . , yn−2 ]⊤ ∈ C m : [ceeven


k ]m −1
k=0 = DFTm ( yeven ) ,
⊤  odd m−1
with yodd := [y1 , y3 , . . . , yn−1 ] ∈ C m : cek k=0 = DFTm (yodd ) .

This means that for even n we can compute DFTn (y) from two DFTs of half the length plus ∼ n additions
and multiplications.
✞ ☎

✝ ✆
(4.3.0.4): DFT of length 2m = 2× DFT of length m + 2m additions & multiplications

Idea for n = 2 L : Divide & conquer recursion

FFT-algorithm

The following code shows an E IGEN-based Recursive FFT implementation for DFT of length n = 2 L .

C++ code 4.3.0.5: Recursive FFT ➺ GITLAB


2 // Recursive DFT for vectors of length n = 2 L
3 VectorXcd f f t r e c ( const VectorXcd &y ) {
4 const VectorXcd : : Index n = y . s i z e ( ) ;
5

6 // Nothing to do for DFT of length 1


7 i f ( n == 1 ) {
8 return y ;
9 }
10 i f ( n % 2 != 0) {
11 throw std : : r u n t i m e _ e r r o r ( " size ( y ) must be even ! " ) ;
12 }
13

14 // Even/odd splitting by rearranging the vector components into a


n/2 × 2 matrix!
15 // See Rem. 1.2.3.6 for use of Eigen::Map
16 const Eigen : : Map<const Eigen : : Matrix <std : : complex <double > , Eigen : : Dynamic ,
17 Eigen : : Dynamic , Eigen : : RowMajor>>
18 Y( y . data ( ) , n / 2 , 2 ) ;
19 const VectorXcd c1 = f f t r e c ( Y . col ( 0 ) ) ;
20 const VectorXcd c2 = f f t r e c ( Y . col ( 1 ) ) ;
21 // Root of unity ωn
22 const std : : complex <double> omega =
23 std : : exp ( −2 * M_PI / s t a t i c _ c a s t <double >( n ) * std : : complex <double > ( 0 , 1 ) ) ;
24 // Factor in (4.3.0.4)
25 std : : complex <double> s ( 1 . 0 , 0 . 0 ) ;
26 VectorXcd c ( n ) ;
27 // Scaling of DFT of odd components plus periodic continuation of c1,
c2
28 f o r ( Eigen : : Index k = 0 ; k < n ; ++k ) {
29 c ( k ) = c1 ( k % ( n / 2 ) ) + c2 ( k % ( n / 2 ) ) * s ;
30 s * = omega ;
31 }
32 return c ;
33 }

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 356


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Visualization of computational cost of fftrec() from Code 4.3.0.5:

1× DFT of length 2 L

2× DFT of length 2 L−1

4× DFT of length 2 L−2

2 L × DFT of length 1
We see that in Code 4.3.0.5 each level of the recursion requires O(2 L ) elementary operations.

Asymptotic complexity of FFT

Asymptotic complexity of FFT algorithm for n = 2 L : O( L2 L ) = O(n log2 n)


(fft.fwd()/fft.inv()-function calls in E IGEN: computational cost is ≈ 5n log2 n).

Remark 4.3.0.7 (FFT algorithm by matrix factorization) The FFT algorithm can also be analyzed on the
level of matrix-vector calculus:
For n = 2m, m ∈ N,

consider even-odd sorting POE


m (1, . . . , n ) = (1, 3, . . . , n − 1, 2, 4, . . . , n ) .

Also use the notation POE n,n , n = 2m.


m for the corresponding permutation of the rows of a matrix ∈ C
2j j
As ωn = ωm we conclude

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 357


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

 
 
 
 
 Fm Fm 
 
 
 
 
POE
m nF =  =
  0    
 ωn ωn/2
n 
 
  ωn1   ωn/2+1
n  
     
 Fm  ..  Fm  ..  
  .   .  
ωn/2−1
n
ωnn−1
 
 
 
  
  
 Fm  
  
  I I 
  
  
  
  
  
  0 
   ωn −ωn0 
  
  ωn1 −ωn1 
 Fm  .. .. 
  . . 
 
  ωn/2−1
n
−ωn/2−1
n

This reveals how to apply a divide-and-conquer idea when evaluating Fn x.


Example: partitioning of Fourier matrix for n = 10

 
ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
 ω0 ω2 ω4 ω6 ω8 ω0 ω2 ω4 ω6 ω8 
 
 ω0 ω4 ω8 ω2 ω6 ω0 ω4 ω8 ω2 ω6 
 
 ω0 ω6 ω2 ω8 ω4 ω0 ω6 ω2 ω8 ω4 
 
 
 ω0 ω8 ω6 ω4 ω2 ω0 ω8 ω6 ω4 ω2 
P5OE F10 =  , ω := ω10 .
 ω0 ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 
 
 ω0 ω3 ω6 ω9 ω2 ω5 ω8 ω1 ω4 ω7 
 
 ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5 ω0 ω5 
 
 ω0 ω7 ω4 ω1 ω8 ω5 ω2 ω9 ω6 ω3 
ω0 ω9 ω8 ω7 ω6 ω5 ω4 ω3 ω2 ω1
y

What if n 6= 2 L ? Quoted from M ATLAB manual:

To compute an n-point DFT when n is composite (that is, when n = pq), the FFTW library decomposes the
problem using the Cooley-Tukey algorithm, which first computes p transforms of size q, and then computes
q transforms of size p. The decomposition is applied recursively to both the p- and q-point DFTs until the
problem can be solved using one of several machine-generated fixed-size "codelets." The codelets in turn
use several algorithms in combination, including a variation of Cooley-Tukey, a prime factor algorithm, and
a split-radix algorithm. The particular factorization of n is chosen heuristically.

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 358


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The execution time for fft depends on the length of the transform. It is fastest for powers of two. It is
almost as fast for lengths that have only small prime factors. It is typically several times slower for
lengths that are prime or which have large prime factors → Ex. 4.3.0.12.

Remark 4.3.0.8 (FFT based on general factorization) We motivate the Fast Fourier transform algorithm
for DFT of length n = pq, p, q ∈ N (Cooley-Tuckey algorithm). Again, we start with re-indexing in the
DFT formula for a vector y = [y0 , . . . , yn−1 ] ∈ C n .

n −1 p −1 q −1 p −1 q −1
jk [ j=:l p+m ] − 2πi
pq ( l p + m ) k l (k mod q)
ck = ∑ y j ωn = ∑ ∑ yl p+m e = ∑ ωnmk ∑ y l p+m ωq . (4.3.0.9)
j =0 m =0 l =0 m =0 l =0
  q −1 
Step I: perform p DFTs of length q, zm = DFTq yl p+m l =0
:

q −1
(zm )k = zm,k := ∑ yl p+m ωqlk , 0≤m<p, 0≤k<q.
l =0

Step II: for k =: rq + s, 0 ≤ r < p, 0 ≤ s < q compute


p −1 p −1
− 2πi
pq (rq + s ) m
crq+s = ∑ e zm,s = ∑ (ωnms zm,s )ωmr
p ,
m =0 m =0

which is amounts to q DFTs of length p after n multiplications. This gives all components ck of DFTn y.
Step I Step II

p p

q q
In fact, the above considerations are the same as those elaborated in Section 4.2.5 that showed that a
two-dimensional DFT of Y ∈ C m,n can be done by carrying out m one-dimensional DFTs of length n plus
n one-dimensional DFTs of length m, see (4.2.5.4) and Code 4.2.5.6. y

Remark 4.3.0.10 (FFT for prime n) When n 6= 2 L , even the Cooley-Tuckey algorithm of Rem. 4.3.0.8 will
eventually lead to a DFT for a vector with prime length.
Quoted from the M ATLAB manual:
When n is a prime number, the FFTW library first decomposes an n-point problem into three (n − 1)-point
problems using Rader’s algorithm [Rad68]. It then uses the Cooley-Tukey decomposition described above
to compute the (n − 1)-point DFTs.

Details of Rader’s algorithm: starting point is a theorem from number theory:

∀ p ∈ N prime ∃ g ∈ {1, . . . , p − 1}: { gk mod p: k = 1, . . . , p − 1} = {1, . . . , p − 1} ,

permutation P p,g : {1, . . . , p − 1} 7→ {1, . . . , p − 1} , P p,g (k ) = gk mod p ,

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 359


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

reversing permutation Pk : {1, . . . , k } 7→ {1, . . . , k } , Pk (i ) = k − i + 1 .

With these two permutations we can achieve something amazing:

p p −1
For the Fourier matrix F p = ( f ij )i,j=1 the permuted block P p−1 P p,g ( f ij )i,j=2 P⊤
p,g is circulant.

Example for p = 13: g = 2, permutation (2 4 8 3 6 12 11 9 5 10 7 1)

ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0 ω0
ω0 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1
ω0 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7
ω0 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10
ω0 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5
ω0 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9
F13 −→ ω0 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12 ω 11
ω0 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6 ω 12
ω0 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3 ω6
ω0 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8 ω3
ω0 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4 ω8
ω0 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2 ω4
ω0 ω4 ω8 ω3 ω6 ω 12 ω 11 ω9 ω5 ω 10 ω7 ω1 ω2

Then apply fast algorithms for multiplication with circulant matrices (= discrete periodic convolution, see
§ 4.1.4.11) to right lower (n − 1) × (n − 1) block of permuted Fourier matrix. These fast algorithms rely
on DFTs of length n − 1, see Code 4.2.2.4. y

Since in Section 4.2 we could implement important operations based on the discrete Fourier transform,
we can now reap the fruits of the availability of a fast implementation of DFT:

Asymptotic complexity of fft.fwd()/fft.inv() for y ∈ C n = O(n log n).

← Section 4.2.2
Asymptotic complexity of discrete periodic convolution,see Code 4.2.2.4:
Cost(pconvfft(u,x), u, x ∈ C n ) = O(n log n).

Asymptotic complexity of discrete convolution, see Code 4.2.2.5:


Cost(myconv(h,x), h, x ∈ C n ) = O(n log n).

The warning issued in Exp. 2.3.1.7 carries over to numerical methods for signal processing:
Never implement DFT/FFT by yourself!
! Under all circumstances use high-quality numerical libraries!

§4.3.0.11 (FFTW - A highly-performing self-tuning library for FFT)

From FFTW homepage: FFTW is a C subroutine library for computing the discrete Fourier transform (DFT)
in one or more dimensions, of arbitrary input size, and of both real and complex data.
FFTW will perform well on most architectures without modification. Hence the name, "FFTW," which
stands for the somewhat whimsical title of “Fastest Fourier Transform in the West.”

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 360


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Supplementary literature. [FJ05] offers a comprehensive presentation of the design and

implementation of the FFTW library (version 3.x). This paper also conveys the many tricks it takes
to achieve satisfactory performance for DFTs of arbitrary length.
FFTW can be installed from source following the instructions from the installation page after downloading
the source code of FFTW 3.3.8 from the download page. Precompiled binaries for various linux distribu-
tions are available in their main package repositories:
• Ubuntu/Debian: apt-get install fftw3 fftw3-dev
• Fedora: dnf install fftw fftw-devel
E IGEN’s FFT module can use different backend implementations, one of which is the FFTW library. The
backend may be enabled by defining the preprocessor directive Eigen_FFTW_DEFAULT (prior to inclu-
sion of unsupported/Eigen/FFT) and linking with the FFTW library (-lfftw3). This setup pre-
cedure may be handled automatically by a build system like CMake (see set_eigen_fft_backend
macro on ➺ GITLAB). y

EXAMPLE 4.3.0.12 (Efficiency of FFT for different backend implementations) We measure the run-
times of FTT in E IGEN linking with different libraries, vector lengths n = 2 L .

Platform:
✦ Linux (Ubuntu 16.04 64bit)
✦ Intel(R) Core(TM) i7-4600U CPU @
2.10GHz
✦ L2 256KB, L3 4 MB, 8 GB DDR3 @
1.60GHz y
✦ Clang 3.8.0, -O3
For reasonably high input sizes the FFTW back-
end gives, compared to E IGEN’s default back-
end (Kiss FFT), a speedup of 2-4x.

Supplementary literature. FTT is covered in almost every textbook on elementary numerical

methods, see, for instance [DR08, Sect. 8.7.3], [Han02, Sect. 53], [QSS00, Sect. 10.9.2].
There thousands of online tutorials on FFT, for instance
• The Fast Fourier Transform (FFT): Most Ingenious Algorithm Ever? (Offers an unconventional
perspective based on polynomial multiplication.)
Review question(s) 4.3.0.13 (The Fast Fourier Transform (FFT))
(Q4.3.0.13.A) What is the asymptotic complexity for m, n → ∞ of the two-dimensional DFT of a matrix
Y ∈ C m,n carried out with the following code:

C++ code 4.2.5.6: Two-dimensional discrete Fourier transform ➺ GITLAB


2 template <typename Scalar >

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 361


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3 void f f t 2 ( Eigen : : MatrixXcd &C, const Eigen : : MatrixBase < Scalar > &Y) {
4 using i d x _ t = Eigen : : MatrixXcd : : Index ;
5 const i d x _ t m = Y . rows ( ) ;
6 const i d x _ t n = Y . cols ( ) ;
7 C. r e s i z e (m, n ) ;
8 Eigen : : MatrixXcd tmp (m, n ) ;
9

10 Eigen : : FFT<double> f f t ; // Helper class for DFT


11 // Transform rows of matrix Y
12 f o r ( i d x _ t k = 0 ; k < m; k ++) {
13 const Eigen : : VectorXcd t v ( Y . row ( k ) ) ;
14 tmp . row ( k ) = f f t . fwd ( t v ) . transpose ( ) ;
15 }
16

17 // Transform columns of temporary matrix


18 f o r ( i d x _ t k = 0 ; k < n ; k ++) {
19 const Eigen : : VectorXcd t v ( tmp . col ( k ) ) ;
20 C. col ( k ) = f f t . fwd ( t v ) ;
21 }
22 }

(Q4.3.0.13.B) Assume that an FFT implementation is available only for vectors of length n = 2 L , L ∈ N.
How do you have to modify the following C++ function for the discrete convolution of two vectors
h, x ∈ C n to ensure that it still enjoys an asymptotic complexity of O(n log n) for n → ∞?

C++ code 4.2.2.5: Implementation of discrete convolution (→ Def. 4.1.3.3) based on


periodic discrete convolution ➺ GITLAB
2 Eigen : : VectorXcd f a s t c o n v ( const Eigen : : VectorXcd &h ,
3 const Eigen : : VectorXcd &x ) {
4 assert ( x . s i z e ( ) == h . s i z e ( ) ) ;
5 const Eigen : : Index n = h . s i z e ( ) ;
6 // Zero padding, cf. (4.1.4.16), and periodic discrete convolution
7 // of length 2n − 1, Code 4.2.2.4
8 r e t u r n pconvfft (
9 ( Eigen : : VectorXcd ( 2 * n − 1 ) << h , Eigen : : VectorXcd : : Zero ( n − 1 ) )
10 . finished ( ) ,
11 ( Eigen : : VectorXcd ( 2 * n − 1 ) << x , Eigen : : VectorXcd : : Zero ( n − 1 ) )
12 . finished ( ) ) ;
13 }

(Q4.3.0.13.C) Again assume that the FFT implementation of E IGEN is available only for vectors of length
n = 2 L , L ∈ N. Propose changes to the following C++ function for the discrete periodic convolution of
two vectors u, x ∈ C n that preserve the asymptotic complexity of O(n log n) for n → ∞.

C++ code cpp:pconvfft: Discrete periodic convolution: DFT implementation ➺ GITLAB


2 Eigen : : VectorXcd p c o n v f f t ( const Eigen : : VectorXcd &u ,
3 const Eigen : : VectorXcd &x ) {
4 Eigen : : FFT<double> f f t ;
5 r e t u r n f f t . inv ( ( ( f f t . fwd ( u ) ) . cwiseProduct ( f f t . fwd ( x ) ) ) . e v a l ( ) ) ;
6 }

(Q4.3.0.13.D) A family of square matrices Hm ∈ R n,n , m ∈ N0 , n := 2m , is recursively defined as


 
H m −1 H m −1
H0 : = [1] , H m : = , m∈N.
H m −1 − H m −1

4. Filtering Algorithms, 4.3. Fast Fourier Transform (FFT) 362


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Devise a recursive algorithm for computing the matrix×vector product Hm x, x ∈ R n and determine its
asymptotic complexity in terms of n := 2m → ∞.

4.4 Trigonometric Transformations

Supplementary literature. [Han02, Sect. 55], see also [Str99] for an excellent presentation of

various variants of the cosine transform.


Keeping in mind exp(2πix ) = cos(2πx ) + ı sin(2πx ) we may also consider the real/imaginary parts
of the Fourier basis vectors (Fn ):,j as bases of R n and define the corresponding basis transformation.
They can all be realized by means of fft with an asymptotic computational effort of O(n log n). These
transformations avoid the use of complex numbers.
Details are given in the sequel.

4.4.1 Sine transform


Another trigonometric basis transform in R n−1 , n ∈ N:

Standard basis of R n−1  “Sine basis”  


        ( n −1) π
1 0 0 0  
 sin ( π
) sin ( 2π
) sin ( 


  
 n n  n  


  0   1   .
.   .
.  
 
  sin ( 2π
)   sin ( 4π
)    


      .   .  
 
  n     
2( n −1) π   
n
  .  0
     
 
  .
.   .
.   

.
 .  .  .
 ..    .   .   sin ( n ) 
  .  · · ·   .  ←−     ..
   .  0 ..     ···
 .



  .  .      
      

  ..  ..  1 0  
 
  .
.   .
.   . 
   .   .  .. 

 
 
   

0 0 0 1 
 sin( ( n − 1 ) π 2 ( n − 1 ) π ( n − 1 ) 2 π 

n ) sin ( n ) sin ( n

Basis transform matrix (sine basis → standard basis): Sn := (sin( jkπ/n))nj,k−=11 ∈ R n−1,n−1 .

Lemma 4.4.1.1. Properties of the sine matrix



2/n Sn ∈ R n,n is real, symmetric and orthogonal (→ Def. 6.3.1.2)

n −1

Sine transform of y = [y1 , . . . , yn−1 ] ∈ R n −1 : sk = ∑ y j sin(πjk/n) , k = 1, . . . , n − 1 .
j =1

(4.4.1.2)

By elementary consideration we can devise a DFT-based algorithm for the sine transform (=
ˆ Sn ×vector):



y j , if j = 1, . . . , n − 1 ,
2n
tool: “wrap around”: e
y ∈ R : yej = 0 , if j = 0, n , e “odd”)
(y


−y2n− j , if j = n + 1, . . . , 2n − 1 .

4. Filtering Algorithms, 4.4. Trigonometric Transformations 363


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This “wrap around” transformation can be visualized as follows:


1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

tyj
yj

−→
−0.2 −0.2

−0.4 −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 5 10 15 20 25 30
j j

1
Next we use sin( x ) = 2ı (exp(ıx ) − exp(−iux ) to identify the DFT of a wrapped around vector as a sine
transform:
2n−1 n −1 2n−1
(4.2.1.19) 2πı
− πı πı
(F2n e
y)k = ∑ yej e− 2n kj = ∑ yj e n kj − ∑ y2n− j e− n kj
j =1 j =1 j = n +1
n −1 πı πı
= ∑ y j (e− n kj − e n kj ) = −2i (Sn y)k ,k = 1, . . . , n − 1 .
j =1

C++ code 4.4.1.3: Wrap-around implementation of sine transform ➺ GITLAB


2 // Simple sine transform of y ∈ R n − 1 into c ∈ R n−1 by (4.4.1.2)
3 void s i n e t r f w r a p ( const VectorXd &y , VectorXd& c )
4 {
5 VectorXd : : Index n = y . s i z e ( ) +1;
6 e
// Create wrapped vector y
7 VectorXd y t ( 2 * n ) ; y t << 0 , y , 0 , − y . reverse ( ) ;
8

9 Eigen : : VectorXcd c t ;
10 Eigen : : FFT<double> f f t ; // DFT helper class
11 f f t . SetFlag ( Eigen : : FFT<double > : : Flag : : Unscaled ) ;
12 f f t . fwd ( c t , y t ) ;
13

14 const std : : complex <double> v ( 0 , 2 ) ; // factor 2ı


15 c = ( − c t . middleRows ( 1 , n −1) / v ) . r e a l ( ) ;
16 }

Remark 4.4.1.4 (Sine transform via DFT of half length) The simple Code 4.4.1.3 relies on a DFT for
vectors of length 2n, which may be a waste of computational resources in some applications. A DFT of
length n is sufficient as demonstrated by the following manipulations.
Step ➀: transform of the coefficients

yej = sin( jπ/n)(y j + yn− j ) + 21 (y j − yn− j ) , j = 1, . . . , n − 1 , ye0 = 0 .

4. Filtering Algorithms, 4.4. Trigonometric Transformations 364


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

n −1 2πi
Step ➁: real DFT (→ Section 4.2.4) of (ye0 , . . . , yen−1 ) ∈ R n : ck := ∑ yej e− n jk
j =0

n −1 n −1
πj
Hence Re{ck } = ∑ yej cos(− 2πi
n jk ) = ∑ (y j + yn− j ) sin( n ) cos( 2πi
n jk )
j =0 j =1
n −1 n −1  
πj 2k + 1 2k − 1
= ∑ 2y j sin( n ) cos( 2πi
n jk ) = ∑ yj sin( πj) − sin( πj)
j =0 j =0
n n
= s2k+1 − s2k−1 .
n −1 n −1 n −1
Im{ck } = ∑ yej sin(− 2πi
n jk ) = − ∑ 1
2 (y j − yn− j ) sin( 2πi
n jk ) = − ∑ y j sin( 2πi
n jk )
j =0 j =1 j =1
= −s2k .
Step ➂: extraction of sk
n −1
n
s2k+1 , k = 0, . . . , 2 − 1 ➤ from recursion s2k+1 − s2k−1 = Re{ck } , s1 = ∑ y j sin(πj/n) ,
j =1
n
s2k , k = 1, . . . , 2 − 2 ➤ s2k = − Im{ck } .
Implementation (via a fft of length n/2):

C++ code 4.4.1.5: Sine transform ➺ GITLAB


2 void s i n e t r a n s f o r m ( const Eigen : : VectorXd &y , Eigen : : VectorXd& s )
3 {
4 const Eigen : : Index n = y . rows ( ) + 1 ;
5 std : : complex <double> i ( 0 , 1 ) ;
6

7 // Prepare sine terms


8 const Eigen : : VectorXd x = Eigen : : VectorXd : : LinSpaced ( n −1 , 1 ,
s t a t i c _ c a s t <double >( n −1) ) ;
9 const Eigen : : VectorXd s i n e v a l s = x . unaryExpr ( [ & ] ( double z ) {
10 r e t u r n imag ( std : : pow ( std : : exp ( i * M_PI / s t a t i c _ c a s t <double >( n ) ) , z ) ) ;
11 }) ;
12

13 // Transform coefficients
14 Eigen : : VectorXd y t ( n ) ;
15 yt (0) = 0;
16 y t . t a i l ( n −1) = s i n e v a l s . array ( ) * ( y + y . reverse ( ) ) . array ( ) +
0 . 5 * ( y−y . reverse ( ) ) . array ( ) ;
17

18 // FFT
19 Eigen : : VectorXcd c ;
20 Eigen : : FFT<double> f f t ;
21 f f t . fwd ( c , y t ) ;
22

23 s . resize ( n ) ;
24 s ( 0 ) = s i n e v a l s . dot ( y ) ;
25

26 f o r ( Eigen : : Index k =2; k<=n −1; ++k )


27 {
28 const Eigen : : Index j = k −1; // Shift index to consider indices
starting from 0
29 i f ( k%2==0) {
30 s ( j ) = −c ( k / 2 ) . imag ( ) ;
31 }

4. Filtering Algorithms, 4.4. Trigonometric Transformations 365


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

32 else {
33 s ( j ) = s ( j −2) + c ( ( k −1) / 2 ) . r e a l ( ) ;
34 }
35 }
36 }

EXAMPLE 4.4.1.6 (Diagonalization of local translation invariant linear grid operators) We consider
a so-called 5-points-stencil-operator on R n,n , n ∈ N, defined as follows

R n,n → R n,n ,
T: (T(X))ij := cxij + cy xi,j+1 + cy xi,j−1 + c x xi+1,j + c x xi−1,j (4.4.1.7)
X 7 → T( X ) ,

with coefficients c, cy , c x ∈ R, convention: xij := 0 for (i, j) 6∈ {1, . . . , n}2 .

A matrix can be regarded as a function that assigns


values (= matrix entries) to the points of a 2D lattice/- 25

grid: 20

Matrix X ∈ R n,n
15
l
grid function ∈ {1, . . . , n}2 7→ R 10

1
Visualization of a grid function ✄ 5
2
3
0
4
1
2 5
3
4
5

i
Identification R n,n ∼
=
2
Rn , xij ∼ xe( j−1)n+i (row-wise numbering) gives a matrix representation T ∈
R n2 ,n2 of T:

 
C cy I 0 ··· ··· 0 j
 .. 
cy I C cy I . 
 .. .. ..

T= . . .  ∈ R n2 ,n2 ,
 0 
 . 
 .. cy I C cy I cy
0 ··· ··· 0 cy I C cx
cx
  cy
c cx 0 ··· ··· 0
 .. 
c x c cx .
 .. .. ..

C=0 . . .  ∈ R n,n .

n+1 n+2

. 
 ..
1 2 3
cx c cx  i
0 ··· ··· 0 cx c

4. Filtering Algorithms, 4.4. Trigonometric Transformations 366


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We already know the sine basis of R n,n :


n
Bkl = sin( n+
π π
1 ki ) sin( n+1 lj ) i,j=1
. (4.4.1.8)
1

These matrices will also provide a basis of the vector 0.5

space of grid functions {1, . . . , n}2 7→ R.


0 10
9
n = 10: grid function B2,3 ➣ 8
7
6
5
4
3
2
−0.5 1
10
9
8
7
6
5
4
3
−1 2
1

The key observation is that elements of the sine basis are eigenvectors of T:

( T (Bkl ))ij = c sin( n+
π π π π π
1 ki ) sin( n+1 lj ) + cy sin( n+1 ki ) sin( n+1 l ( j − 1)) + sin( n+1 l ( j + 1)) +
π π π

c x sin( n+ 1 lj ) sin( n+1 k (i − 1)) + sin( n+1 k (i + 1))
π π π π
= sin( n+ 1 ki ) sin( n+1 lj )( c + 2cy cos( n+1 l ) + 2c x cos( n+1 k ))

Hence Bkl is eigenvector of T (or T after row-wise numbering) and the corresponding eigenvalue is
π π
given by c + 2cy cos( n+ 1 l ) + 2c x cos( n+1 k ). Recall very similar considerations for discrete (periodic)
convolutions in 1D (→ § 4.2.1.6) and 2D (→ § 4.2.5.9)
The basis transform can be implemented efficiently based on the 1D sine transform:
n n n n
kl π
X= ∑ ∑ ykl B ⇒ xij = ∑ sin( n+ 1 ki ) ∑ ykl sin( n+π 1 lj) .
k =1 l =1 k =1 l =1

Hence nested sine transforms (→ Section 4.2.5) for rows/columns of Y = (ykl )nk,l =1 .

Here: implementation of sine transform (4.4.1.2) with “wrapping”-technique.

C++ code 4.4.1.9: 2D sine transform ➺ GITLAB


2 void s i n e t r a n s f o r m 2 d ( const Eigen : : MatrixXd& Y , Eigen : : MatrixXd& S)
3 {
4 const Eigen : : Index m = Y . rows ( ) ;
5 const Eigen : : Index n = Y . cols ( ) ;
6

7 Eigen : : VectorXcd c ;
8 Eigen : : FFT<double> f f t ;
9 const std : : complex <double> i ( 0 , 1 ) ;
10

11 Eigen : : MatrixXcd C( 2 *m+2 ,n ) ;


12 C. row ( 0 ) = Eigen : : VectorXcd : : Zero ( n ) ;
13 C. middleRows ( 1 , m) = Y . cast <std : : complex <double > >() ;
14 C. row (m+1) = Eigen : : VectorXcd : : Zero ( n ) ;
15 C. middleRows (m+2 , m) = −Y . colwise ( ) . reverse ( ) . cast <std : : complex <double > >() ;
16

17 // FFT on each column of C - Eigen::fft only operates on vectors


18 f o r ( i n t i =0; i <n ; ++ i )
19 {
20 f f t . fwd ( c , C. col ( i ) ) ;
21 C. col ( i ) = c ;
22 }
23

24 C. middleRows ( 1 ,m) = i * C. middleRows ( 1 ,m) / 2 . ;

4. Filtering Algorithms, 4.4. Trigonometric Transformations 367


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

25

26 Eigen : : MatrixXcd C2 ( 2 * n+2 ,m) ;


27 C2 . row ( 0 ) = Eigen : : VectorXcd : : Zero (m) ;
28 C2 . middleRows ( 1 , n ) = C. middleRows ( 1 ,m) . transpose ( ) ;
29 C2 . row ( n +1) = Eigen : : VectorXcd : : Zero (m) ;
30 C2 . middleRows ( n+2 , n ) = −C. middleRows ( 1 ,m) . transpose ( ) . colwise ( ) . reverse ( ) ;
31

32 // FFT on each column of C2 - Eigen::fft only operates on vectors


33 f o r ( i n t i =0; i <m; ++ i )
34 {
35 f f t . fwd ( c , C2 . col ( i ) ) ;
36 C2 . col ( i ) = c ;
37 }
38

39 S = ( i * C2 . middleRows ( 1 , n ) . transpose ( ) / 2 . ) . r e a l ( ) ;
40 }

C++ code 4.4.1.10: FFT-based solution of local translation invariant linear operators
➺ GITLAB
2 void f f t b a s e d s o l u t i o n l o c a l ( const Eigen : : MatrixXd& B ,
3 double c , double cx , double cy , Eigen : : MatrixXd& X)
4 {
5 const Eigen : : Index m = B . rows ( ) ;
6 const Eigen : : Index n = B . cols ( ) ;
7

8 // Eigen’s meshgrid
9 const Eigen : : MatrixXd I =
Eigen : : RowVectorXd : : LinSpaced ( n , 1 , s t a t i c _ c a s t <double >( n ) ) . r e p l i c a t e (m, 1 ) ;
10 const Eigen : : MatrixXd J =
Eigen : : VectorXd : : LinSpaced (m, 1 , s t a t i c _ c a s t <double >(m) ) . r e p l i c a t e ( 1 , n ) ;
11

12 // FFT
13 Eigen : : MatrixXd X_ ;
14 s i n e t r a n s f o r m 2 d ( B , X_ ) ;
15

16 // Translation
17 Eigen : : MatrixXd T ;
18 T = c + 2 * cx * ( M_PI / ( s t a t i c _ c a s t <double >( n ) +1) * I ) . array ( ) . cos ( ) +
19 2 * cy * ( M_PI / ( s t a t i c _ c a s t <double >(m) +1) * J ) . array ( ) . cos ( ) ;
20 X_ = X_ . cwiseQuotient ( T ) ;
21

22 s i n e t r a n s f o r m 2 d ( X_ , X) ;
23 X = 4 * X / ( (m+1) * ( n +1) ) ;
24 }

Thus the diagonalization of T via 2D sine transformyields an efficient algorithm for solving linear system
of equations T(X) = B: computational cost O(n2 log n). y
EXPERIMENT 4.4.1.11 (Efficiency of FFT-based solver) In the experiment we test the gain in runtime
obtained by using DFT-based algorithms for solving linear systems of equations with coefficient matrix T
induced by the operator T from (4.4.1.7) with the values

c=4 , c x = c y = −1 .

4. Filtering Algorithms, 4.4. Trigonometric Transformations 368


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This means
   
C −I 0 ··· ··· 0 4 −1 0 ··· ··· 0
 ..   .. 
−I C −I .   −1 4 −1 . 
 .. .. ..
  . . .

T :=  . . .  ∈ R n2 ,n2 , C :=  .. .. ..  ∈ R n,n .
0   0 
 .   . 
 .. I C −I  .. −1 4 −1
0 ··· ··· 0 −I C 0 · · · · · · 0 −1 4

60
FFT−Loeser
Backslash−Loeser
50

40
tic-toc-timing (M ATLAB) V7, Linux, Intel Pentium

Laufzeit [s]
4 Mobile CPU 1.80GHz) 30

Similar results would be obtained by an implementa-


20
tion in P YTHON.
10

0
0 100 200 300 400 500 600
n
y

4.4.2 Cosine transform


Another trigonometric basis transform in R n , n ∈ N:
standard basis of R n  “cosine basis” 
  −1/2
      
 2 − 1/ 2
2 −1/2 2 

1 0 0 0 
  (2n−1)π 

 
 
  cos ( π
)  cos ( 3π
)   cos ( )  


  0   1   .
.   .
.  
 
  2n  2n   2n  


      .   .  
 
  cos ( 2π
)  cos ( 6π
)   2 ( 2n − 1 ) π  

 .
  .  0  .    
 2n  2n   cos ( 2n )  

 .  .   ..   ..  .
.
  .  .
  .  · · ·   .  ←  .  .  · · ·  .
. 
   .  0 ..       

  .  .      
     

  ..  ..  1 0   
     

 
 
  .
.  .
.   .  

  
  .  .   ..  

0 0 0 1 
 ( n − 1 ) π ( − )


 cos( ) cos( 3 n 1 π
) cos(
( n − 1 )( 2n − 1 ) π
) 
2n 2n 2n
Basis transform matrix (cosine basis → standard basis):
(
  n −1 2−1/2 , if i = 0 ,
Cn = cij i,j=0 ∈ R n,n with cij = 2j+1
cos(i 2n π ) , if i > 0 .

Lemma 4.4.2.1. Properties of cosine matrix



The matrix 2/n Cn ∈ R n,n is real and orthogonal (→ Def. 6.3.1.2) .

Note: Cn is not symmetric.

n −1
2j+1
cosine transform of y = [y0 , . . . , yn−1 ]⊤ : ck = ∑ y j cos(k 2n π ) , k = 1, . . . , n − 1 ,
j =0

(4.4.2.2)

4. Filtering Algorithms, 4.4. Trigonometric Transformations 369


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1 n −1
c0 = √ ∑ y j .
2 j =0

Implementation of Cy using the ”wrapping”-technique as in Code 4.4.1.3:

C++ code 4.4.2.3: Cosine transform ➺ GITLAB


2 i n l i n e void c o s i n e t r a n s f o r m ( const Eigen : : VectorXd& y , Eigen : : VectorXd& c )
3 {
4 const Eigen : : Index n = y . s i z e ( ) ;
5

6 Eigen : : VectorXd y_ ( 2 * n ) ;
7 y_ . head ( n ) = y ;
8 y_ . t a i l ( n ) = y . reverse ( ) ;
9

10 // FFT
11 Eigen : : VectorXcd z ;
12 Eigen : : FFT<double> f f t ;
13 f f t . fwd ( z , y_ ) ;
14

15 const std : : complex <double> i ( 0 , 1 ) ;


16 c . resize ( n ) ;
17 c (0) = z (0) . real ( ) / ( 2 * sqrt (2) ) ;
18 f o r ( Eigen : : Index j =1; j <n ; ++ j ) {
19 c ( j ) = ( 0 . 5 * pow ( exp ( − i * M_PI / ( 2 * s t a t i c _ c a s t <double >( n ) ) ) , j ) *
z ( j ) ) . real ( ) ;
20 }
21 }

Implementation of C− 1
n y (“Wrapping”-technique):

C++-code 4.4.2.4: Inverse cosine transform ➺ GITLAB


2 i n l i n e void i c o s i n e t r a n s f o r m ( const Eigen : : VectorXd& c , Eigen : : VectorXd& y )
3 {
4 const Eigen : : Index n = c . s i z e ( ) ;
5

6 const std : : complex <double> i ( 0 , 1 ) ;


7 Eigen : : VectorXcd c_1 ( n ) ;
8 c_1 ( 0 ) = s q r t ( 2 ) * c ( 0 ) ;
9 f o r ( Eigen : : Index j =1; j <n ; ++ j ) {
10 c_1 ( j ) = pow ( exp ( − i * M_PI / ( 2 * s t a t i c _ c a s t <double >( n ) ) ) , j ) * c ( j ) ;
11 }
12

13 Eigen : : VectorXcd c_2 ( 2 * n ) ;


14 c_2 . head ( n ) = c_1 ;
15 c_2 ( n ) = 0 ;
16 c_2 . t a i l ( n −1) = c_1 . t a i l ( n −1) . reverse ( ) . conjugate ( ) ;
17

18 // FFT
19 Eigen : : VectorXd z ;
20 Eigen : : FFT<double> f f t ;
21 f f t . i n v ( z , c_2 ) ;
22

23 // To obtain the same result of Matlab,


24 // shift the inverse FFT result by 1.
25 Eigen : : VectorXd y_ ( 2 * n ) ;
26 y_ . head ( 2 * n −1) = z . t a i l ( 2 * n −1) ;
27 y_ ( 2 * n −1) = z ( 0 ) ;

4. Filtering Algorithms, 4.4. Trigonometric Transformations 370


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

28

29 y = 2 * y_ . head ( n ) ;
30 }

Remark 4.4.2.5 (Cosine transforms for compression)

The cosine transforms discussed above are named


DCT-II and DCT-III.
Various cosine transforms arise by imposing
various boundary conditions:
• DCT-II: even around −1/2 and N − 1/2
• DCT-III: even around 0 and odd around N
DCT-II is used in JPEG-compression while a slightly
modified DCT-IV makes the main component of MP3,
AAC and WMA formats.
y
Review question(s) 4.4.2.6 (Trigonometric Transformations)

4.5 Toeplitz Matrix Techniques

Video tutorial for Section 4.5 "Toeplitz Matrix Techniques": (20 minutes) Download link,
tablet notes

This section examines FFT-based algorithms for more general problems in numerical linear algebra. It
connects to the matrix perspective of DFT and linear filters that was adopted occasionally in Section 4.1
and Section 4.2.

4.5.1 Matrices with Constant Diagonals


EXAMPLE 4.5.1.1 (Parameter identification for linear time-invariant filters) We want to determine the
impulse response of an LT-FIR channel from (noisy) measurements:
• Given: ( xk )k∈Z m-periodic discrete signal = known input
• Given: (yk )k∈Z m-periodic measured (∗) output signal of a linear time-invariant filter, see Section 4.1.1.
(∗) ➔ inevitably affected by measurement errors!
• Sought: Estimate for the impulse response (→ Def. 4.1.1.12) of the filter

This task reminds us of the parameter estimation problem from Ex. 3.0.1.4, which we tackled with least
squares techniques. We employ similar ideas for the current problem

• Known: impulse response of filter has maximal duration n∆t, n ∈ N, n ≤ m


cf. (4.1.2.4) n −1
∃ h = [ h 0 , . . . , h n −1 ] ⊤ ∈ R n , n ≤ m : y k = ∑ h j xk− j . (4.5.1.2)
j =0

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 371


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

xk input signal
yk output signal

time time

If the yk were exact, we could retrieve h0 , . . . , hn−1 by examining only y0 , . . . , yn−1 and inverting the
discrete periodic convolution (→ Def. 4.1.4.7) using (4.2.1.17).

However, in case the yk are affected by measurements errors it is advisable to use all available yk for a
least squares estimate of the impulse response.

We can now formulate the least squares parameter identification problem: seek h = [ h 0 , . . . , h n −1 ] ⊤ ∈
R n with
 
x0 x −1 ··· · · · x 1− n
 x ..   
 1 x 0 x −1 .  y0
 . ..  
 .. x1 x0 .  h0  .. 
  .
 . 
 . .. . .. ..  .   
 .  .   
    
kAh − yk2 =  .. .. ..  −  → min .
 . . . x −1  .   
  ..   
 x n −1 x1 x0   . 
  h  .. 
 x n x n −1 x1  n −1
 . ..  y m −1
 .. . 
x m −1 ··· · · · xm−n 2
This is a linear least squares problem as introduced in Chapter 3 with a coefficient matrix A that enjoys
the property that (A)ij = xi− j , which means that all its diagonals have constant entries.
The coefficient matrix for the normal equations (→ Section 3.1.2, Thm. 3.1.2.1) corresponding to the
above linear least squares problem is
m −1
M := A⊤ A , (M)ij = ∑ x k −i x k − j = : zi − j
k =0

for some m-periodic sequence (zk )k∈Z , due to the m- periodicity of ( xk )k∈Z .
➣ M ∈ R n,n is a matrix with constant diagonals & symmetric positive semi-definite (→ Def. 1.1.2.6)
(“constant diagonals” ⇔ (M)i,j depends only on i − j)
y

EXAMPLE 4.5.1.3 (Linear regression for stationary Markov chains) We consider a sequence of scalar
random variables: (Yk )k∈Z , a so-called Markov chain. These can be thought of as values for a random
quantity sampled at equidistant points in time.
We assume stationary (time-independent) correlations, that is, with (A, Ω, dP) denoting the underlying
probability space,
Z
E ( Yi − j Yi − k ) = Yi− j (ω )Yi−k (ω ) dP (ω ) = uk− j ∀i, j, k ∈ Z , ui = u−i .

Here E stands for the expectation of a random variable.

Model: We expect a finite linear dependency of the form


n
⊤ n
∃ x = [ x 1 , . . . , x n ] ∈ R : Yk = ∑ x j Yk − j ∀k ∈ Z .
j =1

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 372


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

with unknown parameters x j , j = 1, . . . , n. Our task is to estimate the parameters x1 , . . . , xn based


on the known correlations uℓ . We try to minimize the expectation of the square of the expectation of the
residual. This means that for some fixed i ∈ Z we use the
n 2
estimator: x = argmin E Yi − ∑ x j Yi− j . (4.5.1.4)
x ∈R n j =1

The trick ist to use the linearity of the expectation, which makes it possible to convert (4.5.1.4) into
n n
⊤ n 2
x = [ x1 , . . . , x n ] ∈ R : E | Yi | − 2 ∑ x j u k + ∑ xk x j uk− j → min .
j =1 k,j=1
 n
x⊤ Ax − 2b⊤ x → min with b = [uk ]nk=1 , A = ui− j i,j=1 . (4.5.1.5)

By definition A is a so-called covariance matrix and, as such, has to be symmetric and positive definite
(→ Def. 1.1.2.6). By its very definition it has constant diagonals. Also note that

x⊤ Ax − 2b⊤ x = (x − x∗ )⊤ A(x − x∗ ) − x∗ Ax∗ , (4.5.1.6)

with x∗ = A−1 b. Therefore x∗ is the unique minimizer of x⊤ Ax − 2b⊤ x. The problem is reduced to
solving the linear system of equations Ax = b (Yule-Walker-equation, see below). y

Matrices with constant diagonals occur frequently in mathematical models, see Ex. 4.5.1.1, Ex. 4.5.1.3.
They generalize circulant matrices (→ Def. 4.1.4.12).

 
u0 u1 ··· · · · u n −1
Definition 4.5.1.7. Toeplitz matrix  ..
 u −1 u0 u1 .
 ..

..
n m,n is a Toeplitz matrix, if there is  .. .. .. 
T = (tij )i,j =1 ∈ K  . . . . .

T= .. .. .. .. 
..
a vector u = [u−m+1 , . . . , un−1 ] ∈ K m + n −1 such 
 . . . . 
.
that tij = u j−i , 1 ≤ i ≤ m, 1 ≤ j ≤ n.  .. .. .. 
 . . . u1 
u 1− m · · · · · · u −1 u0

Note: The “information content” of a matrix M ∈ K m,n with constant diagonals, that is, (M)i,j = mi− j ,
is m + n − 1 numbers ∈ K.
Hence, though potentially densely populated, m × n Toeplitz matrices are data-sparse with infor-
mation content ≪ mn.

4.5.2 Toeplitz Matrix Arithmetic


 
Given: T = u j −i ∈ K m,n , a Toeplitz matrix with generating vector u = [u−m+1 , . . . , un−1 ]⊤ ∈
K m+n−1 , see Def. 4.5.1.7.

Task: Efficient evaluation of matrix×vector product Tx, x ∈ K n

To motivate the approach we realize that we have already encountered Toeplitz matrices in the convolution
of finite signals discussed in Rem. 4.1.3.1, see (4.1.3.2). The trick introduced in Rem. 4.1.4.15 was to

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 373


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

extend the matrix to a circulant matrix by zero padding, compare (4.1.4.18).

Idea: Extend T ∈ K m,n to a circulant matrix


 (→ Def. 4.1.4.12) C ∈ K m+n,m+n generated by
the m + n-periodic sequence c j j∈Z given by

(
u j for j = −m + 1, . . . , n − 1 ,
cj = + periodic extension.
0 for j = n ,
The upper left m × n block of C contains T:

(C)ij = ci− j , 1 ≤ i, j ≤ m + n ⇒ (C)1:m,1:n = T . (4.5.2.1)

The following formula demonstrates the structure of C in the case m = n.


 
u0 u1 ··· ··· u n −1 0 u 1− n · · · ··· u −1
 .. .. .. 
 u −1 u0 u1 . u n −1 0 . . 
 . .. ..

 . .. .. .. .. .. 
 . . . . . . . . 
 . .. .. .. .. .. 
 .. . . . . . 
 
 .. .. .. .. .. .. 
 . . . u1 . . . u 1− n 
 
 u 1− n · · · ··· u −1 u0 u1 u n −1 0 
C=
 0


 u 1− n · · · ··· u −1 u0 u1 ··· ··· u n −1 
 .. .. .. 
 u n −1 0 . . u −1 u0 u1 . 
 . .. .. 
 . .. .. .. .. .. 
 . . . . . . . . 
 .. .. .. .. .. .. 
 . . . . . . 
 
 .. .. .. .. .. .. 
 . . . u 1− n . . . u1 
u1 u n −1 0 u 1− n ··· ··· u −1 u0

Recall from 4.3 that the multiplication with a circulant (m + n) × (m + n)-matrix (= discrete periodic
convolution → Def. 4.1.4.7) can be carried out by means of DFT/FFT with an asymptotic computational
effort of O((m + n) log(m + n)) for m, n → ∞, see Code 4.2.2.4.

From (4.5.2.1) it is clear how to implement matrix×vector for the Toeplitz matrix T
   
x Tx
C =
zero padding 0 ∗

Therefore the asymptotic computational effort for computing Tx is O((n + m) log(m + n)) for m, n → ∞,
provided that an FFT-based algorithm for discrete periodic convolution is used, see Code 4.2.2.4. This
complexity is almost optimal in light of the data complexity O(m + n) of the Toeplitz matrix.

4.5.3 The Levinson Algorithm


Given: n
Symmetric positive definite (s.p.d.) (→ Def. 1.1.2.6) Toeplitz matrix T = (u j−i )i,j n,n with
=1 ∈ R
generating vector u = [u−n+1 , . . . , un−1 ] ∈ R2n−1 , u−k = uk .

Note that the symmetry of a Toeplitz matrix is induced by the property u−k = uk of its generating vector.

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 374


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Without loss of generality we assume that T has unit diagonal, u0 = 1.

Task: Find an efficient solution algorithm for the LSE Tx = b = [b1 , . . . , bn ]⊤ , b ∈ R n , the Yule-
Walker problem from 4.5.1.3.

Employ a recursive (inductive) solution strategy.

 k
Define: ✦ Tk := u j−i i,j=1 ∈ K k,k (left upper block of T) ➣ Tk is s.p.d. Toeplitz matrix ,

✦ xk ∈ K k : Tk xk = bk := [b1 , . . . , bk ]⊤ ⇔ xk = T− 1 k
k b ,

✦ u k : = ( u1 , . . . , u k ) ⊤ ∈ R k
We block partition the linear system of equations Tk+1 xk+1 = bk+1 , k < n:
    
uk b1
 ..  e k +1   ..   k 
k +1 
= Tk .  x   .  b
(4.5.3.1)
T k +1 x
 u1 
 =  bk
=
 bk + 1
+1
u k · · · u1 1 xkk+ 1 bk + 1

Now recall block Gaussian elimination/block-LU decomposition from Rem. 2.3.1.14, Rem. 2.3.2.19. They
+1
xk+1 and obtain an expression for xkk+
teach us how to eliminate e 1.
To state the formulas concisely, we introduce reversing permutations. For a vector they can be realized by
E IGEN’s reverse() method.

Pk : {1, . . . , k } 7→ {1, . . . , k } , Pk (i ) := k − i + 1 . (4.5.3.2)

Then we carry out block elimination:



ex k +1 = T − 1 k k +1 k k k +1 −1
k ( b − x k +1 Pk u ) = x − x k +1 T k Pk u ,
k
 ⊤  ⊤ (4.5.3.3)
 x k +1 = b k k +1 k k k +1 k
T− 1 k
k +1 k +1 − Pk u · ex = bk + 1 − P k u x + x k +1 Pk u k Pk u .

The recursive idea is clear after introducing the auxiliary vectors yk := T− 1 k


k Pk u , which converts (4.5.3.3)
into
  ⊤  ⊤
k +1 x k +1
e +1
xkk+ 1 = ( bk + 1 − P k u
k xk )/σk k
x = k +1 with +1 k
, σk := 1 − Pk u yk . (4.5.3.4)
x k +1 xk+1 = xk − xkk+
e 1y

Here, the assumption that T is s.p.d. ensures σk ≥ 1.


Therefore, solving Tk+1 xk+1 = bk+1 seems to entail the following steps:
➊ Solve Tk yk = Pk uk (k × k s.p.d. Toeplitz LSE, recursion).
➋ Solve tk xk = bk (k × k s.p.d. Toeplitz LSE, recursion).
➌ Compute xk+1 according to (4.5.3.4).

§4.5.3.5 (asymptotic complexity) Obviously, given xk and yk , the evaluations involved in (4.5.3.4) take
O(k ) operations for k → ∞, in order to get xk+1 .

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 375


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

It seems that two recursive calls are necessary in order to obtain yk and xk , which enter
(4.5.3.4): this is too expensive!

If bk = Pk uk , then xk = yk

Simple linear recursion sufficient to compute yk

Hence, yk can be computed with an asymptotic cost of O(k2 ) for k → ∞. Once the yk are available,
another simple linear recursion gives us xk with a cost of O(k2 ) for k → ∞.
Cost for solving Tx = b = O(k2 ) for k → ∞. y

Below we give a C++ implementation of the Levinson algorithm for the solution of the Yule-Walker problem
Tx = b with an s.p.d. Toeplitz matrix described by its generating vector u (recursive implementation, xk ,
yk computed simultaneously, un+1 not used!)

C++ code 4.5.3.6: Levinson algorithm ➺ GITLAB


2 void levinson ( const Eigen : : VectorXd &u , const Eigen : : VectorXd &b ,
3 Eigen : : VectorXd &x , Eigen : : VectorXd &y ) {
4 const Eigen : : Index k = u . s i z e ( ) − 1 ; // Matrix size - 1
5 // Trivial case of 1 × 1 linear sysrtem
6 i f ( k == 0 ) {
7 x . resize ( 1 ) ; x ( 0 ) = b ( 0 ) ;
8 y . resize ( 1 ) ; y ( 0 ) = u ( 0 ) ;
9 return ;
10 }
11 // Vectors holding result of recursive call
12 Eigen : : VectorXd xk ;
13 Eigen : : VectorXd yk ;
14 // Recursive call for computing xk and yk
15 levinson ( u . head ( k ) , b . head ( k ) , xk , yk ) ;
16 // Coefficient σk from (4.5.3.4)
17 const double sigma = 1 − u . head ( k ) . dot ( yk ) ;
18 // Update of x according to (4.5.3.4)
19 const double t = ( b ( k ) − u . head ( k ) . reverse ( ) . dot ( xk ) ) / sigma ;
20 x = xk − t * yk . head ( k ) . reverse ( ) ;
21 x . conservativeResize ( x . size ( ) + 1) ;
22 x ( x . size ( ) − 1) = t ;
23 // Update of vectors yk
24 const double s = ( u ( k ) − u . head ( k ) . reverse ( ) . dot ( yk ) ) / sigma ;
25 y = yk − s * yk . head ( k ) . reverse ( ) ;
26 y . conservativeResize ( y . size ( ) + 1) ;
27 y ( y . size ( ) − 1) = s ;
28 }

Note that this implementation of the Levinson algorithm employs a simple linear recursion with computa-
tional cost ∼ (n − k ) on level k, k = 0, . . . , n − 1, which results in an overall asymptotic complexity of
O(n2 ) for n → ∞, as already discussed in § 4.5.3.5.

Remark 4.5.3.7 (Fast Toeplitz solvers) Meanwhile researchers have found better methods [Ste03]:
now there are FFT-based algorithms for solving Tx = b, T a Toeplitz matrix, with asymptotic complexity
O(n log3 n)! y

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 376


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Supplementary literature. [DR08, Sect. 8.5]: Very detailed and elementary presentation,

but the discrete Fourier transform through trigonometric interpolation, which is not covered in this
chapter. Hardly addresses discrete convolution.
[Han02, Ch. IX] presents the topic from a mathematical point of view stressing approximation and
trigonometric interpolation. Good reference for algorithms for circulant and Toeplitz matrices.
[Sau06, Ch. 10] also discusses the discrete Fourier transform with emphasis on interpolation and
(least squares) approximation. The presentation of signal processing differs from that of the course.
There is a vast number of books and survey papers dedicated to discrete Fourier transforms, see,
for instance, [Bri88; DV90]. Issues and technical details way beyond the scope of the course are
discussed in these monographs.
Review question(s) 4.5.3.8 (Toeplitz matrix techniques)
(Q4.5.3.8.A) Give an example of a Toeplitz matrix T ∈ R n,n , n > 2, with rank(T) = 1.
(Q4.5.3.8.B) Show that the product of two lower triangular Toeplitz matrices is a Toeplitz matrix again.

4. Filtering Algorithms, 4.5. Toeplitz Matrix Techniques 377


Bibliography

[Bri88] E.O. Brigham. The Fast Fourier Transform and Its Applications. Englewood Cliffs, NJ: Prentice-
Hall, 1988 (cit. on p. 377).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 361, 377).
[DV90] P. Duhamel and M. Vetterli. “Fast Fourier transforms: a tutorial review and a state of the art”.
In: Signal Processing 19 (1990), pp. 259–299 (cit. on pp. 355, 377).
[FJ05] M. Frigo and S. G. Johnson. “The Design and Implementation of FFTW3”. In: Proceedings
of the IEEE 93.2 (Feb. 2005), pp. 216–231. DOI: 10.1109/JPROC.2004.840301 (cit. on
p. 361).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on p. 319).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 317, 361, 363,
377).
[HR11] Georg Heinig and Karla Rost. “Fast algorithms for Toeplitz and Hankel matrices”. In: Linear
Algebra Appl. 435.1 (2011), pp. 1–59.
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 319, 351).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 344, 361).
[Rad68] C.M. Rader. “Discrete Fourier Transforms when the Number of Data Samples Is Prime”. In:
Proceedings of the IEEE 56 (1968), pp. 1107–1108 (cit. on p. 359).
[Sau06] T. Sauer. Numerical analysis. Boston: Addison Wesley, 2006 (cit. on p. 377).
[Ste03] M. Stewart. “A Superfast Toeplitz Solver with Improved Numerical Stability”. In: SIAM J. Matrix
Analysis Appl. 25.3 (2003), pp. 669–693 (cit. on p. 376).
[Str99] Gilbert Strang. “The Discrete Cosine Transform”. In: SIAM Review 41.1 (1999), pp. 135–147.
DOI: 10.1137/S0036144598336745 (cit. on p. 363).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 348, 350).

378
Chapter 5

Machine Learning of One-Dimensional Data


(Data Interpolation and Data Fitting in 1D)

Contents
5.1 Abstract Interpolation (AI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.2.1 Uni-Variate Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.2.2 Polynomial Interpolation: Theory . . . . . . . . . . . . . . . . . . . . . . . . 389
5.2.3 Polynomial Interpolation: Algorithms . . . . . . . . . . . . . . . . . . . . . . 393
5.2.4 Polynomial Interpolation: Sensitivity . . . . . . . . . . . . . . . . . . . . . . 409
5.3 Shape-Preserving Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
5.3.1 Shape Properties of Functions and Data . . . . . . . . . . . . . . . . . . . . . 414
5.3.2 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 416
5.3.3 Cubic Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.4 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
5.4.1 Spline Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
5.4.2 Cubic-Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
5.4.3 Structural Properties of Cubic Spline Interpolants . . . . . . . . . . . . . . . 431
5.4.4 Shape Preserving Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . 435
5.5 Algorithms for Curve Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
5.5.1 CAD Task: Curves from Control Points . . . . . . . . . . . . . . . . . . . . . 440
5.5.2 Bezier Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
5.5.3 Spline Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.6 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.1 Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.2 Reduction to Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . 452
5.6.3 Equidistant Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . 454
5.7 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

5.1 Abstract Interpolation (AI)


The task of (one-dimensional, scalar) data interpolation (point interpolation) can be described as follows:

One-dimensional interpolation

Given: data points (ti , yi ), i = 0, . . . , n, n ∈ N, ti ∈ I ⊂ R, yi ∈ R

379
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Objective: Reconstruction of a function f : I 7→ R


• satisfying the n + 1 interpolation conditions (IC)

f ( ti ) = yi , i = 0, . . . , n . (5.1.0.2)

• and belonging to a set V of eligible functions.


n
The function f we find is called the interpolant of the given data set {(ti , yi )}i=0 .

y
(t4 , y4Parlance:
) The numbers ti ∈ R are called nodes,
( t3 , y3 ) the yi ∈ R are the (data) values.

ˆ data points (ti , yi ) ∈ R2


✁ •=
( t2 , y2 ) ˆ nodes
ti =
( t0 , y0 ) ( t1 , y1 ) ˆ values
yi =
t
Graph of interpolant f passes through data points.
Fig. 146 t0 t1 t2 t3 t4

Of course, a necessary requirement on the data is that the ti are pairwise distinct :

ti 6= ti , if i 6= j for all i, j ∈ {0, . . . , n} .

Remark 5.1.0.3 (Generalization of data) In (supervised) machine learning this task is called the gener-
alization of the data, because we aim for the creation of a model in the form of the function f : I → R
that permits us to generate new data points based on what we have “learned” from the provided data. y

For ease of presentation we will usually assume that the nodes are ordered: t0 < t1 < · · · < tn and
[t0 , tn ] ⊂ I . However, algorithms often must not take for granted sorted nodes.

Remark 5.1.0.4 (Interpolation of vector-valued data) A natural generalization is data interpolation with
vector-valued data values, seeking a function f : I → R d , d ∈ N, such that, for given data points (ti , yi ),
ti ∈ I mutually different, yi ∈ R d , it satisfies the interpolation conditions f(ti ) = yi , i = 0, . . . , n.

In this case all methods available for scalar data can be applied component-wise.

x1 y3
y4
An important application is curve reconstruction, that
is the interpolation of points y0 , . . . , yn ∈ R2 in the y2
plane.
y5
A particular aspect of this problem is that the nodes
ti also have to be found, usually from the location of
the yi in a preprocessing step.
y1
y0 x2
Fig. 147
y

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 380
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 5.1.0.5 (Multi-dimensional data interpolation) In many applications (computer graphics, com-
puter vision, numerical method for partial differential equations, remote sensing, geodesy, etc.) one has
to reconstruct functions of several variables.
This leads to task of multi-dimensional data interpolation:

Given: data points (xi , yi ), i = 0, . . . , n, n ∈ N, xi ∈ D ⊂ R m , m > 1, yi ∈ R d


Objective: reconstruction of a (continuous) function f : D 7→ R d satisfying the n + 1

interpolation conditions f(xi ) = yi , i = 0, . . . , n.

Significant additional challenges arise in a genuine multidimensional setting. A treatment is beyond the
scope of this course. However, the one-dimensional techniques presented in this chapter are relevant
even for multi-dimensional data interpolation, if the points xi ∈ R m are points of a finite lattice also called
tensor product grid.
For instance, for m = 2 this is the case, if
n o
{xi }i = [tk , sl ]⊤ ∈ R2 : k ∈ {0, . . . , K }, l ∈ {0, . . . , L} , (5.1.0.6)

where tk ∈ R, k = 0, . . . , K , and sl , l = 0, . . . , L, K, L ∈ N, are pairwise distinct nodes. y

§5.1.0.7 (Interpolation schemes) When we talk about “interpolation schemes” in 1D, we mean a map-
ping

R n+1 × R n+1 → { f : I → R }
I: .
[ti ]in=0 , [yi ]in=0 7→ interpolant
Once the function space to which the interpolant belongs is specified, then an interpolation scheme defines
an “interpolation problem” in the sense of § 1.5.5.1. Sometimes, only the data values yi are consider input
data, whereas the dependence of the interpolant on the nodes ti is suppressed, see Section 5.2.4. y

Interpolation
0.5

0.4

0.3

✁ There are infinitely many ways to fix an interpolant


for given data points.
0.2

0.1

0
Interpolants can have vastly different properties.

In this chapter we will discuss a few widely used


-0.1

-0.2
methods to build interpolants and their different
-0.3 linear

poly
properties will become apparent.
-0.4 spline

pchip

-0.5
0 1 2 3 4 5 6 7 8

reset
Fig. 148

We may (have to!) impose additional requirements on the interpolant:


• minimal smoothness of f, e.g. f ∈ C1 , etc.
• special shape of f (positivity, monotonicity, convexity → Section 5.3 below)

EXAMPLE 5.1.0.8 (Constitutive relations from measurements) This example addresses an important
application of data interpolation in 1D.

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 381
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In this context: t, y =
ˆ two state variables of a physical system, where t determines y: a functional
dependence y = y(t) is assumed.
t y
voltage U current I
Examples: t and y could be pressure p density ρ
magnetic field H magnetic flux B
··· ···
Known: several accurate (∗) measurements

(ti , yi ) , i = 1, . . . , m

Why do we need to extract the constitutive relations as a function? Imagine that t, y correspond to the
voltage U and current I measured for a 2-port non-linear circuit element (like a diode). This element will
be part of a circuit, which we want to simulate based on nodal analysis as in Ex. 8.1.0.1. In order to solve
the resulting non-linear system of equations F (u) = 0 for the nodal potentials (collected in the vector
u) by means of Newton’s method (→ Section 8.5) we need the voltage-current relationship for the circuit
element as a continuously differentiable function I = f (U ).

(∗) Meaning of attribute “accurate”: justification for interpolation. If measured values yi were affected by
considerable errors, one would not impose the interpolation conditions (5.1.0.2), but opt for data fitting (→
Section 5.7). y

We can distinguish two aspects of the interpolation problem:


➊ Find interpolant f : I ⊂ R → R and store/represent it (internally).
➋ Evaluate f at a few or many evaluation points x ∈ I

Remark 5.1.0.9 (Mathematical functions in a numerical code) What does it mean to “represent” or
“make available” a function f : I ⊂ R 7→ R in a computer code?

! A general “mathematical” function f : I ⊂ R 7→ R d , I an interval, contains an “infinite amount of


information”.

Rather, in the context of numerical methods, “function” should be read as “subroutine”, a piece of code that
can, for any x ∈ I , compute f ( x ) in finite time. Even this has to be qualified, because we can only pass
machine numbers x ∈ I ∩ M (→ § 1.5.2.1) and, of course, in most cases, f ( x ) will be an approximation.
In a C++ code a simple real valued function can be incarnated through a function object of a type as given
in Code 5.1.0.10, see also Section 0.3.3.

C++-code 5.1.0.10: C++ data type representing a real-valued function ➺ GITLAB


1 class Function {
2 private :
3 // various internal data describing f
4 public :
5 // Constructor: expects information for specifying the function
6 Function ( /* ... */ ) ;
7 // Evaluation operator
8 double operator () ( double t ) const ;
9 };

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 382
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 5.1.0.11 (A data type designed for interpolation problems) If a constitutive relationship for a
circuit element is needed in a C++ simulation code (→ Ex. 5.1.0.8), the following specialized Function
class could be used to represent it. It demonstrates the concrete object oriented implementation of an
interpolant.

C++-code 5.1.0.12: C++ class representing an interpolant in 1D ➺ GITLAB


1

2 class I n t e r p o l a n t {
3 private :
4 // Various internal data describing f
5 // Can be the coefficients of a basis representation (5.1.0.14)
6 public :
7 // Constructor: computation of coefficients c j of representation
(5.1.0.14)
8 Interpolant ( const vector <double>& t , const vector <double>& y ) ;
9 // Evaluation operator for interpolant f
10 double operator() ( double t ) const ;
11 };

Two main components have to be designed and implemented:


✦ The constructor, which is in charge of “setup”, e.g. building and solving a linear system of equations,
see (5.1.0.23) below.
✦ The evaluation operator operator (), e.g., implemented as evaluation of a linear combination, refer
to (5.1.0.14) below.
Crucial issue: computational effort for evaluation of interpolant at single point: O(1) or O(n) (or in be-
tween)?
y

§5.1.0.13 (Internal representation of classes of mathematical functions)

➙ Idea: parametrization, a finite number of parameters c0 , . . . , cm , m ∈ N, characterizes f .


Special case: Representation with finite linear combination of basis functions
b j : I ⊂ R 7→ R, j = 0, . . . , m:
m m
f = ∑ j =0 c j b j ⇔ f (t) = ∑ j =0 c j b j ( t ), t ∈ I , c j ∈ Rd . (5.1.0.14)

➙ f belongs to a finite dimensional function space

Vm := Span{{t 7→ b0 (t)}, . . . , {t 7→ bm (t)}} [ = Span{b0 , . . . , bm } ] ,


m
with dim Vm = m + 1, provided that {{t 7→ bi (t)}}i=0 is linearly independent, which is already implied
by the term “basis functions”.

Of course, the basis functions b j should be “simple” in the sense that b j ( x ) can be computed efficiently for
every x ∈ I and every j = 0, . . . , m.

Note that the basis functions may depend on the nodes ti , but they must not depend on the values yi .

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 383
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➙ The internal representation of f (in the data member section of the class Function from
Code 5.1.0.10) will then boil down to storing the coefficients/parameters c j , j = 0, . . . , m.

Note: The focus in this chapter will be on the special case that the data interpolants belong to a finite-
dimensional space of functions spanned by “simple” basis functions.
y

EXAMPLE 5.1.0.15 (Piecewise linear interpolation, see also Section 5.3.2) Recall: A linear function
in 1D is a function of the form x 7→ a + bx, a, b ∈ R (polynomial of degree 1).
y

✬ ✩
Piecewise linear interpolation

= connect data points (ti , yi ), i = 0, . . . , n,


ti−1 < ti , by line segments

✫ ✪
➣ interpolating polygonal line

Piecewise linear interpolant of data ✄

t
Fig. 149
t0 t1 t2 t3 t4
What is the space V of functions from which we select the interpolant? Remember that a linear function
R → R always can be written as t 7→ α + βt with suitable coefficients α, β ∈ R. We can use this formula
locally on every interval between two nodes. Assuming sorted nodes, t0 < t1 < · · · < tn , this leads to
the mathematical definition
n o
V := f ∈ C0 ( I ) : f (t) = αi + β i t for t ∈ [ti , ti+1 ], i = 0, . . . , n − 1 . (5.1.0.16)

Here, C0 ( I ) designates the space of continuous functions I → R, I := [t0 , tn ]. Note that “ f ∈ C0 ( I )” is


necessary to render (5.1.0.16) non-ambiguous.
Now, what could be a convenient set of basis functions {b j }nj=0 for representing the piecewise linear
interpolant through n + 1 data points? A possible choice is the “Tent function” (“hat function”) basis:

b0 b1 b2 b3 b4 bn
1

Fig. 150 t0 t1 t2 t3 t4 t5 t n −1 tn

Note: in Fig. 150 the basis functions have to be extended by zero outside the t-range where they are
drawn.

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 384
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Explicit formulas for these basis functions can be given and bear out that they are really “simple”:
(
t − t0
1− t1 − t0 for t0 ≤ t < t1 ,
b0 (t) =
0 for t ≥ t1 .
 t j −t
 for t j−1 ≤ t < t j ,
1 −
 t j − t j −1
t−t j
b j (t) = 1− for t j ≤ t < t j+1 , , j = 1, . . . , n − 1 , (5.1.0.17)

 t j +1 − t j

0 elsewhere in [t0 , tn ] .
(
1 − tnt−n − t
tn−1 for tn−1 ≤ t < tn ,
bn ( t ) =
0 for t < tn−1 .

Moreover, these basis functions are uniquely determined by the conditions


• b j is continuous on [t0 , tn ],
• b j is linear on each interval [ti−1 , ti ], i = 1, . . . , n,
(
1 , if i = j ,
• b j (ti ) = δij := ➣ a so-called cardinal basis for the node set {ti }in=0 .
0 else.
This last condition implies a simple basis representation of a (the ?) piecewise linear interpolant of the
data points (ti , yi ), i = 0, . . . , n:
n
f (t) = ∑ y j b j (t) , t0 ≤ t ≤ t n , (5.1.0.18)
j =0

where the b j are given by (5.1.0.17). y

The property b j (ti ) = δij , i, j = 1, . . . , n, of the tent function basis is so important that it has been given a
special name:

Definition 5.1.0.19. Cardinal basis

A basis {b0 , . . . , bn } of an n + 1-dimensional vector space of functions f : I ⊂ R → R is a cardi-


nal basis with respect to the set {t0 , . . . , tn } ⊂ I of nodes, if
(
1 , if i = j ,
b j (ti ) = δij := i, j ∈ {0, . . . , n} . (5.1.0.20)
0 else,

§5.1.0.21 (Interpolation as a linear mapping) We consider the setting for interpolation that the inter-
polant belongs to a finite-dimension space Vm of functions spanned by basis functions b0 , . . . , bm , see
Rem. 5.1.0.9. Then the interpolation conditions imply that the basis expansion coefficients satisfy a linear
system of equations:
m
(5.1.0.2) & (5.1.0.14) ⇒ f ( ti ) = ∑ j =0 c j b j ( t i ) = y i , i = 0, . . . , n , (5.1.0.22)
m
    
b0 (t0 ) . . . bm (t0 ) c0 y0
 ..  ..   .. 
..
Ac :=  .  .  =  .  =: y .
. (5.1.0.23)
b0 (tn ) . . . bm (tn ) cm yn

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 385
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This is an (n + 1) × (m + 1) linear system of equations !

The interpolation problem in Vm and the linear system (5.1.0.23) are really equivalent in the sense that
(unique) solvability of one implies (unique) solvability of the other.

Necessary condition for unique solvability


: m=n
of interpolation problem (5.1.0.22) ∀y

If m = n and A from (5.1.0.23) regular (→ Def. 2.2.1.1),then for any values y j , j = 0, . . . , n we can find
coefficients c j , j = 0, . . . , n, and, from them build the interpolant according to (5.1.0.14):
n
f = ∑ ( A −1 y ) j b j . (5.1.0.24)
j =0

✓ ✏

For fixed nodes ti the interpolation problem R n +1 7→ Vn
I:
✒ ✑
(5.1.0.22) defines linear mapping y 7→ f

data space function space


Beware, “linear” in the statement above has nothing to do with a linear function or piecewise linear inter-
polation discussed in Ex. 5.1.0.15!

Definition 5.1.0.25. Linear interpolation operator

An interpolation operator I : R n+1 7→ C0 ([t0 , tm ]) for the given nodes t0 < t1 < · · · < tn is called
linear, if

I(αy + βz) = αI(y) + βI(z) ∀y, z ∈ R n+1 , α, β ∈ R . (5.1.0.26)

✎ Notation: C0 ([t0 , tm ]) =
ˆ vector space of continuous functions on [t0 , tm ] y
Review question(s) 5.1.0.27 (Abstract Interpolation)
(Q5.1.0.27.A) Let {b0 , . . . , bn } be a basis of a subspace V of the space C0 ( I ) of continuous functions
I ⊂ R → R. Which linear systems has to be solved to determined the basis expansion coefficients for
the interpolant f ∈ V satisfying the interpolation conditions f (ti ) = yi for given node set {t0 , t1 , . . . , tn }
and values yi ∈ R?
How does reordering the nodes affect the coefficient matrix of that linear system?
(Q5.1.0.27.B) Given I ⊂ R and the node set {t0 , t1 , . . . , tn } ⊂ I , the ReLU basis of the space V of
piecewise linear continuous functions on that node set is comprised of the functions
(
0 for t < ti−1 ,
r0 ( t ) : = 1 , ri ( t ) : = i ∈ {1, . . . , n} , t ∈ I .
t − ti−1 for t ≥ ti−1 ,
• Show that this set of functions {r0 , r1 , . . . , rn } is really a basis of V .
• Assuming that the nodes are sorted, t0 < t1 < · · · < tn , describe the structure of the coefficient
matrix of that linear system that has to be solved to determine the ReLU basis coefficients of an
interpolant.

5. Data Interpolation and Data Fitting in 1D, 5.1. Abstract Interpolation (AI) 386
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5.2 Global Polynomial Interpolation


(Global) polynomial interpolation, that is, interpolation into spaces of functions spanned by polynomials
up to a certain degree, is the simplest interpolation scheme and of great importance as building block for
more complex algorithms.

5.2.1 Uni-Variate Polynomials


Polynomials in a single variable are familiar and simple objects:
Notation: Vector space of the (uni-variate) polynomials of degree ≤ k, k ∈ N:

P k : = { t 7 → α k t k + α k −1 t k −1 + · · · + α 1 t + α 0 · 1 , α j ∈ R } . (5.2.1.1)

leading coefficient
Terminology: The functions t 7→ tk , k ∈ N0 , are called monomials and the formula t 7→ αk tk +
αk−1 tk−1 + · · · + α0 is the monomial representation of a polynomial.

Obviously, Pk is a vector space, see [NS02, Sect. 4.2, Bsp. 4]. What is its dimension?

Theorem 5.2.1.2. Dimension of space of polynomials

dim Pk = k + 1 and P k ⊂ C ∞ (R ).

In fact, Pk can be regarded as a finite-dimensional subspace of the space C0 (R ) of continuous functions


R → R. The monomial representation introduced above is a way to write a polynomials as a linear
combination of the special basis functions t 7→ tk , see Rem. 5.1.0.9.
Proof. (of Thm. 5.2.1.2) Dimension formula by linear independence of monomials.

As a consequence of Thm. 5.2.1.2 the monomial representation of a polynomial is unique.

§5.2.1.3 (The charms of polynomials) Why are polynomials important in computational mathematics?

➙ Easy to compute (only elementary operations required), integrate and differentiate


➙ Vector space & algebra
➙ Analysis: Taylor polynomials & power series y

Remark 5.2.1.4 (Monomial representation) Polynomials (of degree k) in monomial representation are
stored as a vector of their coefficients a j , j = 0, . . . , k. A convention for the ordering has to be fixed.
For instance, the N UM P Y module of P YTHON stores the coefficients of the monomial representation in an
array in descending order :
P YTHON: p(t) := αk tk + αk−1 tk−1 + · · · + α0 ➙ array (αk , αk−1 , . . . , α0 ) (ordered!).
Thus the evaluation of a polynomial given through an array of monomial coefficients reads as:
1 I n [ 8 ] : numpy . polyval ( [ 3 , 0 , 1 ] , 5 ) # 3 ∗ 52 + 0 ∗ 51 + 1
2 Out [ 8 ] : 76

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 387
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§5.2.1.5 (Horner scheme → [DR08, Bem. 8.11]) Efficient evaluation of a polynomial in monomial
representation through Horner scheme as indicated by the following representation:

p(t) = t(· · · t(t(αn t + αn−1 ) + αn−2 ) + · · · + α1 ) + α0 . (5.2.1.6)

The following code gives an implementation based on vector data types of E IGEN. The function is vector-
ized in the sense that many evaluation points are processed in parallel.

C++-code 5.2.1.7: Horner scheme (vectorized version) ➺ GITLAB


2 // Efficient evaluation of a polynomial in monomial representation
3 // using the Horner scheme (5.2.1.6)
4 // IN: p = vector of monomial coefficients, length = degree + 1
5 // (leading coefficient in p(0), P Y T H O N convention Rem. 5.2.1.4)
6 // t = vector of evaluation points ti
7 // OUT: vector of values: polynomial evaluated at ti
8 Eigen : : VectorXd h o r n e r ( const Eigen : : VectorXd &p , const Eigen : : VectorXd & t ) {
9 const VectorXd : : Index n = t . s i z e ( ) ;
10 Eigen : : VectorXd y { p [ 0 ] * VectorXd : : Ones ( n ) } ;
11 f o r ( unsigned i = 1 ; i < p . s i z e ( ) ; ++ i )
12 y = t . cwiseProduct ( y ) + p [ i ] * VectorXd : : Ones ( n ) ;
13 return y ;
14 }

Optimal asymptotic complexity: O(n)

The Horner scheme is implemented in P YTHON’s “built-in”-function numpy.polyval(p,x). The argu-


ment x can be a matrix or a vector. In this case the function evaluates the polynomial described by p for
each entry/component. Heed Rem. 5.2.1.4. y

Review question(s) 5.2.1.8 (Polynomials)


(Q5.2.1.8.A) Why are polynomials the most widely used class of functions in numerical computations?
(Q5.2.1.8.B) What are the dimensions of the following two subspaces of the space Pk of polynomials of
degree ≤ k,
• Pkeven := { p ∈ Pk : p(t) = p(−t) ∀t ∈ R },

• Pkodd := { p ∈ Pk : p(t) = − p(−t) ∀t ∈ R }?

(Q5.2.1.8.C) For given k ∈ N we store the monomial coefficients of the polynomial p(t) := αk tk +
αk−1 tk−1 + · · · + α0 in a vector a := [αk , . . . , α0 ] ∈ R k+1 . Find a matrix D ∈ R k,k+1 such that
Da ∈ R k provides the monomial coefficients of the derivative p′ .
(Q5.2.1.8.D) The mapping

Pk → R
Φ: R1
p 7→ 0 p(t) dt
is obviously linear and, therefore, has a matrix representation with respect to the monomial basis
{t 7→ 1, t 7→ t, t 7→ t2 , . . . , t 7→ tk } of Pk . Find that matrix.
(Q5.2.1.8.E) A problem from linear algebra: Prove that the functions of the monomial basis
n on
t 7→ tℓ ⊂ Pn , n∈N,
ℓ=0
are linearly independent and, thus, form a basis of Pn .
Hint. Differentiate several times!

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 388
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q5.2.1.8.F) The factorized representation of a polynomial p ∈ Pn , n ∈ N, writes it in the form

p(t) = γ0 (t − γ1 ) · · · · · (t − γn ) , γi ∈ R , i = 0, . . . , n . (5.2.1.9)

Somebody proposes to represent generic polynomials used in a numerical code in factorized form
through the vectors [γ0 , γ1 , . . . , .γn ] ∈ R n+1 of coefficients. Discuss the pros and cons.

5.2.2 Polynomial Interpolation: Theory

Supplementary literature. This topic is also presented in [DR08, Sect. 8.2.1], [QSS00,

Sect. 8.1], [AG11, Ch. 10].

Now we consider the interpolation problem introduced in Section 5.1 for the special case that the sought
interpolant belongs to the polynomial space Pk (with suitable degree k).

Lagrange polynomial interpolation problem (LIP)

Given the set of interpolation nodes {t0 , . . . , tn } ⊂ R, n ∈ N, and the values y0 , . . . , yn ∈ R


compute p ∈ Pn such that it satisfies the interpolation conditions (IC)

p(t j ) = y j for j = 0, . . . , n . (5.2.2.2)

Is this a well-defined problem? Obviously, it fits the framework developed in Rem. 5.1.0.9 and § 5.1.0.21,
because Pn is a finite-dimensional space of functions, for which we already know a basis, the monomials.
Thus, in principle, we could examine the matrix A from (5.1.0.23) to decide, whether the polynomial
interpolant exists and is unique. However, there is a shorter way.

§5.2.2.3 (Lagrange polynomials) For a given set {t0 , t1 , . . . , tn } ⊂ R of nodes consider the
n t − tj
Lagrange polynomials Li ( t ) : = ∏ ti − t j , i = 0, . . . , n . (5.2.2.4)
j =0
j 6 =i

(
1 if i = j ,
➙ Evidently, the Lagrange polynomials satisfy Li ∈ Pn and Li (t j ) = δij :=
0 else.

From this relationship we infer that the Lagrange polynomials are linearly independent. Since there are
n + 1 = dim Pn different Lagrange polynomials, we conclude that they form a basis of Pn , which is a
cardinal basis for the node set {ti }in=0 . y

EXAMPLE 5.2.2.5 ( Lagrange polynomials for uniformly spaced nodes)

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 389
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8
L
0
L2
L5
Consider the equidistant nodes in [−1, 1]: 6

Lagrange Polynomials
2
T : = t j = −1 + n j , 4

j = 0, . . . , n . 2

The plot shows the Lagrange polynomials for this set


of nodes that do not vanish in the nodes t0 , t2 , and −2

t5 , respectively.
−4
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 151 t
y

The Lagrange polynomial interpolant p for data points (ti , yi )in=0 allows a straightforward representation
with respect to the basis of Lagrange polynomials for the node set {ti }in=0 :

n
p ( t ) = ∑ yi Li ( t ) ⇔ p ∈ Pn and p(ti ) = yi . (5.2.2.6)
i =0

Theorem 5.2.2.7. Existence & uniqueness of Lagrange interpolation polynomial → [QSS00,


Thm. 8.1], [DR08, Satz 8.3]

The general Lagrange polynomial interpolation problem admits a unique solution p ∈ Pn for any
n
set of data points {(ti , yi )}i=0 , n ∈ N, with pairwise distinct interpolation nodes ti ∈ R (i 6= j ⇒
ti 6= t j ).

Proof. Consider the linear evaluation operator



P n 7 → R n +1 ,
evalT :
p 7→ ( p(ti ))in=0 ,
which maps between finite-dimensional vector spaces of the same dimension, see Thm. 5.2.1.2.

Representation (5.2.2.6) ⇒ existence of interpolating polynomial


⇒ evalT is surjective (“onto”)

Known from linear algebra: for a linear mapping T : V 7→ W between finite-dimensional vector spaces
with dim V = dim W holds the equivalence
T surjective ⇔ T bijective ⇔ T injective.
Applying this equivalence to evalT yields the assertion of the theorem

Corollary 5.2.2.8. Lagrange interpolation as linear mapping → § 5.1.0.21

The polynomial interpolation in the nodes T := {t j }nj=0 defines a linear operator


(
R n +1 → Pn ,
IT : T
(5.2.2.9)
( y0 , . . . , y n ) 7→ interpolating polynomial p .

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 390
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 5.2.2.10 (Vandermonde matrix) Lagrangian polynomial interpolation leads to linear systems of
equations also for the representation coefficients of the polynomial interpolant in monomial basis, see
§ 5.1.0.21:
n
p(t j ) = y j ⇐⇒ ∑ ai tij = y j , j = 0, . . . , n
i =0
⇐⇒ solution of (n + 1) × (n + 1) linear system Va = y with matrix

 
1 t0 t20 · · · t0n
1 t1 t21 · · · t1n 
 
 t2 t22 · · · t2n 
V = 1  ∈ R n+1,n+1 . (5.2.2.11)
 .. .. .. . . .. 
. . . . .
1 tn tn · · · tnn
2

A matrix in the form of V is called Vandermonde matrix.


The following code initializes a Vandermonde matrix in E IGEN:

C++ code 5.2.2.12: Initialization of Vandermonde matrix ➺ GITLAB


2 // Initialization of a Vandermonde matrix (5.2.2.11)
3 // from interpolation points ti .
4 MatrixXd vander ( const VectorXd & t ) {
5 const VectorXd : : Index n = t . s i z e ( ) ;
6 MatrixXd V( n , n ) ; V . col ( 0 ) = VectorXd : : Ones ( n ) ; V . col ( 1 ) = t ;
7 // Store componentwise integer powers of point coordinate vector
8 // into the columns of the Vandermonde matrix
9 f o r ( i n t j =2; j <n ; j ++) V . col ( j ) = ( t . array ( ) . pow ( j ) ) . matrix ( ) ;
10 return V;
11 }

Remark 5.2.2.13 (Matrix representation of interpolation operator) In the case of Lagrange interpola-
tion:
• if Lagrange polynomials are chosen as basis for Pn , then IT is represented by the identity matrix;
• if monomials are chosen as basis for Pn , then IT is represented by the inverse of the Vandermonde
matrix V, see Eq. (5.2.2.11).
y
Remark 5.2.2.14 (Generalized polynomial interpolation → [DR08, Sect. 8.2.7], [QSS00, Sect. 8.4])
The following generalization of Lagrange interpolation is possible: We still seek a polynomial interpolant,
but beside function values also prescribe derivatives up to a certain order for interpolating polynomial at
given nodes.
Convention: indicate occurrence of derivatives as interpolation conditions by multiple nodes.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 391
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✬ ✩
Generalized polynomial interpolation problem
Given the (possibly multiple) nodes t0 , . . . , tn , n ∈ N, −∞ < t0 ≤ t1 ≤ · · · ≤ tn < ∞ and the values
y0 , . . . , yn ∈ R compute p ∈ Pn such that

dk
p(t j ) = y j for k = 0, . . . , ℓ j and j = 0, . . . , n , (5.2.2.15)
dtk
where ℓ j := max{i − i ′ : t j = ti = ti′ , i, i ′ = 0, . . . , n} , and ℓ j + 1 is the multiplicity of the node t j .
✫ ✪

Admittedly, the statement of the generalized polynomial interpolation problem is hard to decipher. Let us
look at a simple special case, which is also the most important case of generalized Lagrange interpolation.
It is the case when all the multiplicities are equal to 2. It is called Hermite interpolation (or osculatory
interpolation) and the generalized interpolation conditions read for nodes t0 = t1 < t2 = t3 < · · · <
tn−1 = tn (note the double nodes!) [QSS00, Ex. 8.6]:

p(t2j ) = y2j , p′ (t2j ) = y2j+1 , j = 0, . . . , n/2 .

Theorem 5.2.2.16. Existence & uniqueness of generalized Lagrange interpolation polynomi-


als
The generalized polynomial interpolation problem Eq. (5.2.2.15) admits a unique solution p ∈ Pn
for any (ti , y)i )

Definition 5.2.2.17. Generalized Lagrange polynomials

The generalized Lagrange polynomials for the nodes T = {t j }nj=0 ⊂ R (multiple nodes allowed)
are defined as Li := IT (ei+1 ), i = 0, . . . , n, where ei = (0, . . . , 0, 1, 0, . . . , 0) T ∈ R n+1 are the
unit vectors.

Note: The linear interpolation operator IT in this definition refers to generalized Lagrangian interpolation.
Its existence is guaranteed by Thm. 5.2.2.16.

EXAMPLE 5.2.2.18 (Generalized Lagrange polynomials for Hermite Interpolation)


1.2
Consider the node set
1

T = {t0 = 0, t1 = 0, t2 = 1, t3 = 1} .
Cubic Hermite Polynomials

0.8

The plot shows the four unique generalized Lagrange


polynomials of degree n = 3 for these nodes. They 0.6 p0
p
1

satisfy p2
p
0.4 3

p0 (0) = 1, p0 (1) = p0′ (0)


= p0′ (1) =0, 0.2

p1 (1) = 1, p1 (0) = p1′ (0)


= p1′ (1) =0,

p2′ (1)
0
p2 (0) = 1, p2 (1) = p2 (0) = =0,
p3′ (1) = 1, p3 (1) = p3 (0) = p3′ (0) =0. −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 152 t

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 392
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

More details are given in Section 5.3.3. For explicit formulas for the polynomials see (5.3.3.5). y
Review question(s) 5.2.2.19 (Polynomial interpolation: theory)
(Q5.2.2.19.A) For a set {t0 , t1 , . . . , tn } ⊂ R of nodes the associated Lagrange polynomials are
n t − tj
Li ( t ) : = ∏ ti − t j , i = 0, . . . , n .
j =0
j 6 =i

Write down the Lagrange polynomials L0 , L1 , L2 , L3 for the node set {0, 1, 2, 3}.
(Q5.2.2.19.B) Denote by Li , i = 0, . . . , n, n ∈ N, the Lagrange polynomials for the node set
{t0 , t1 , . . . , tn } ⊂ R that is assumed to be sorted t0 < t1 < · · · < tn .
What can you say about the sign of p(t), where p(t) = Lk (t) Lm (t), t ∈ R?
(Q5.2.2.19.C) For a given node set {t0 , t1 , . . . , tn } ⊂ R the associated Vandermonde matrix reads
 
1 t0 t20 · · · t0n
1 t1 t21 · · · t1n 
 
 t2 t22 · · · t2n 
V = 1  ∈ R n+1,n+1 .
 .. .. .. . . .
. . . . .. 
1 tn t2n · · · tnn
Sketch an efficient implementation of the C++ function
Eigen::VectorXd vanderMult( const Eigen::VectorXd &t,
const Eigen::VectorXd &x);

for computing Vx for some x ∈ R n+1 .


5.2.3 Polynomial Interpolation: Algorithms


Now we consider the algorithmic realization of Lagrange interpolation as introduced in Section 5.2.2.
The goal is to achieve an efficient implementation of the class Interpolant introduced in Code 5.1.0.12
specialized for Lagrange polynomial interpolation. The setting is a follows:
Given: nodes T := {−∞ < t0 < t1 < . . . < tn < ∞},
values y := {y0 , y1 , . . . , yn },
Notation: we write p := IT (y) for the unique Lagrange polynomial interpolant, whose existence is as-
serted by Thm. 5.2.2.7.
When used in a numerical code, different demands can be made for a class that implements Lagrange
interpolants. These demands determine, which algorithm is most suitable for the constructors and the
evaluation operators.

5.2.3.1 Multiple evaluations

Task: For ➊ a fixed set {t0 , . . . , tn } of nodes,


and ➋ many different given data values yi , i = 0, . . . , n
and ➌ many arguments xk , k = 1, . . . , N , N ≫ 1,
efficiently compute all p( xk ) for p ∈ Pn interpolating in (ti , yi ), i = 0, . . . , n.

The definition of a possible interpolator data type could be as follows:

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 393
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 5.2.3.1: Polynomial Interpolation ➺ GITLAB


1 class P o l y I n t e r p {
2 public :
3 // Constructors taking node vector [t0 , . . . , tn ]⊤ as argument
4 P o l y I n t e r p ( const Eigen : : VectorXd & t ) ;
5 template <typename SeqContainer >
6 P o l y I n t e r p ( const SeqContainer &v ) ;
7 // Evaluation operator for data (y0 , . . . , yn ); computes
8 // p( xk ) for xk s passed in x
9 Eigen : : VectorXd eval ( const Eigen : : VectorXd &y ,
10 const Eigen : : VectorXd &x ) const ;
11 private :
12 // various internal data describing p
13 Eigen : : VectorXd _ t ;
14 };

The member function eval(y,x) expects n data values in y and (any number of) evaluation points in
x (↔ [ x1 , . . . , x N ]⊤ ) and returns the vector [ p( x1 ), . . . , p( x N )]⊤ , where p is the Lagrange polynomial
interpolant.

An implementation directly based on the evaluation of Lagrange polynomials (5.2.2.4) and (5.2.2.6) would
incur an asymptotic computational effort of O(n2 N ) for every single invocation of eval and large n, N .

§5.2.3.2 (Barycentric interpolation formula)

By means of pre-computing parts of the Lagrange polynomials Li the asymptotic effort for
eval can be reduced substantially.

Simple manipulations starting from (5.2.2.6) give an altenative representation of p:


n n n t − tj n n n n
λi
p(t) = ∑ Li ( t ) yi = ∑ ∏ t − t j i ∑ i ∏ ( t − t j ) yi =
y = λ ∏(t − t j ) · ∑ y ,
t − ti i
i =0 i =0 j =0 i i =0 j =0 j =0 i =0
j6=i j6=i

1
where λi = , i = 0, . . . , n: independent of yi !
(ti − t0 ) · · · (ti − ti−1 )(ti − ti+1 ) · · · (ti − tn )

From the above formula, with p(t) ≡ 1, yi = 1:


n n n
λ 1
1= ∏(t − t j ) ∑ t −i ti ⇒ ∏(t − t j ) = λi
j =0 i =0 j =0 ∑in=0 t − ti
n
λ
∑ t −i ti yi
i =0
Barycentric interpolation formula p(t) = n . (5.2.3.3)
λ
∑ t −i ti
i =0

1
with λi = , i = 0, . . . , n, independent of t and yi
(ti − t0 ) · · · (ti − ti−1 )(ti − ti+1 ) · · · (ti − tn )
[Tre13, Thm. 5.1]. Hence, the values λi can be precomputed!
The use of (5.2.3.3) involves
✦ computation of weights λi , i = 0, . . . , n: cost O(n2 ) (only once!),

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 394
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ cost O(n) for every subsequent evaluation of p.


⇒ total asymptotic complexity O( Nn) + O(n2 ) y

The following C++ class demonstrated the use of the barycentric interpolation formula for efficient multiple
point evaluation of a Lagrange interpolation polynomial:

C++-code 5.2.3.4: Class for multiple data/multiple point evaluations ➺ GITLAB


2 template <typename NODESCALAR = double> class BarycPolyInterp {
3 private :
4 using nodeVec_t = Eigen : : Matrix <NODESCALAR, Eigen : : Dynamic , 1 >;
5 using i d x _ t = typename nodeVec_t : : Index ;
6 // Number n of interpolation points, deg polynomial +1
7 const i d x _ t n ;
8 // Locations of n interpolation points
9 nodeVec_t t ;
10 // Precomputed values λi , i = 0, . . . , n − 1
11 nodeVec_t lambda ;
12

13 public :
14 // Constructors taking node vector [t0 , . . . , tn ]⊤ as
15 // argument
16 e x p l i c i t BarycPolyInterp ( const nodeVec_t &_ t ) ;
17 // The interpolation points may also be passed in an STL container
18 template <typename SeqContainer > e x p l i c i t BarycPolyInterp ( const SeqContainer &v ) ;
19 // Computation of p( xk ) for data values
20 // (y0 , . . . , yn ) and evaluation points xk
21 template <typename RESVEC, typename DATAVEC>
22 RESVEC eval ( const DATAVEC &y , const nodeVec_t &x ) const ;
23

24 private :
25 void init_lambda ( ) ;
26 };

C++-code 5.2.3.5: Interpolation class: constructors ➺ GITLAB


2 template <typename NODESCALAR>
3 BarycPolyInterp <NODESCALAR> : : BarycPolyInterp ( const nodeVec_t &_ t )
4 : n ( _ t . s i z e ( ) ) , t ( _ t ) , lambda ( n ) {
5 init_lambda ( ) ;
6 }
7

8 template <typename NODESCALAR>


9 template <typename SeqContainer >
10 BarycPolyInterp <NODESCALAR> : : BarycPolyInterp ( const SeqContainer &v )
11 : n ( v . s i z e ( ) ) , t ( n ) , lambda ( n ) {
12 idx_t t i = 0;
13 f o r ( auto t p : v ) {
14 t ( t i ++) = t p ;
15 }
16 init_lambda ( ) ;
17 }

C++-code 5.2.3.6: Interpolation class: precomputations ➺ GITLAB


2 template <typename NODESCALAR>
3 void B a r y c P o l y I n t e r p <NODESCALAR> : : init_lambda ( ) {

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 395
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

4 // Precompute the weights λi with effort O(n2 )


5 f o r ( unsigned k = 0 ; k < n ; ++k ) {
6 // little workaround: in E I G E N cannot subtract a vector
7 // from a scalar; multiply scalar by vector of ones
8 lambda ( k ) =
9 1 . / ( ( t ( k ) * nodeVec_t : : Ones ( k ) − t . head ( k ) ) . prod ( ) *
10 ( t ( k ) * nodeVec_t : : Ones ( n − k − 1 ) − t . t a i l ( n − k − 1 ) ) . prod ( ) ) ;
11 }
12 }

C++-code 5.2.3.7: Interpolation class: multiple point evaluations ➺ GITLAB


2 template <typename NODESCALAR>
3 template <typename RESVEC, typename DATAVEC>
4 [ [ n o d i s c a r d ] ] RESVEC B a r y c P o l y I n t e r p <NODESCALAR> : : eval ( const DATAVEC &y ,
5 const nodeVec_t &x ) const {
6 const i d x _ t N = x . s i z e ( ) ; // No. of evaluation points
7 RESVEC p (N) ; // Ouput vector
λi
8 // Compute quotient of weighted sums of t − ti ,
9 // effort O(n)
10 f o r ( i n t i = 0 ; i < N; ++ i ) {
11 const nodeVec_t z = ( x [ i ] * nodeVec_t : : Ones ( n ) − t ) ;
12

13 // Check if we want to evaluate close to a node


14 const double t r e f { z . cwiseAbs ( ) . maxCoeff ( ) } ; // reference size
15 idx_t k ;
16 i f ( z . cwiseAbs ( ) . minCoeff (& k ) <
17 t r e f * std : : abs ( std : : n u m e r i c _ l i m i t s <NODESCALAR> : : e p s i l o n ( ) ) ) {
18 // evaluation at node tk
19 p[ i ] = y[k ];
20 } else {
21 const nodeVec_t mu = lambda . cwiseQuotient ( z ) ;
22 p [ i ] = (mu. cwiseProduct ( y ) ) .sum ( ) / mu.sum ( ) ;
23 }
24 } // end for
25 return p ;
26 }

Runtime measurements of direct evaluation of a polynomial in monomial representation vs. barycentric


formula are reported in Exp. 5.2.3.13.

5.2.3.2 Single evaluation

Supplementary literature. This topic is also discussed in [DR08, Sect. 8.2.2].

Task: Given a set of interpolation points (t j , y j ), j = 0, . . . , n, with pairwise different interpolation nodes
t j , perform a single point evaluation of the Lagrange polynomial interpolant p at x ∈ R.

We discuss the efficient implementation of the following function for n ≫ 1. It is meant for a single
evaluation of a Lagrange interpolant.
double eval( const Eigen::VectorXd &t, const Eigen::VectorXd &y,
double x);

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 396
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§5.2.3.8 (Aitken-Neville scheme) The starting point is a recursion formula for partial Lagrange inter-
polants: For 0 ≤ k ≤ ℓ ≤ n define

pk,ℓ := unique interpolating polynomial of degree ℓ − k through (tk , yk ), . . . , (tℓ , yℓ ),

From the uniqueness of polynomial interpolants (→ Thm. 5.2.2.7) we find

pk,k ( x ) ≡ yk (“constant polynomial”) , k = 0, . . . , n ,


( x − tk ) pk+1,ℓ ( x ) − ( x − tℓ ) pk,ℓ−1 ( x )
pk,ℓ ( x ) = (5.2.3.9)
tℓ − tk
x − tℓ
= pk+1,ℓ ( x ) + (p ( x ) − pk,ℓ−1 ( x )) , 0 ≤ k < ℓ ≤ n ,
tℓ − tk k+1,ℓ

because the left and right hand sides represent polynomials of degree ℓ − k through the points (t j , y j ),
j = k, . . . , ℓ:

y for x = tk [ pk,ℓ−1 (tk ) = yk ] ,
( x − tk ) pk+1,ℓ ( x ) − ( x − tℓ ) pk,ℓ−1 ( x )  k
= y j for x = t j , k < j < ℓ ,
tℓ − tk 

yℓ for x = tℓ [ pk+1,ℓ (tℓ ) = yℓ ] .

Thus the values of the partial Lagrange interpolants can be computed sequentially and their dependencies
can be expressed by the following so-called Aitken-Neville scheme:

ℓ−k = 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x ) (ANS)
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )

Here, the arrows indicate contributions to the convex linear combinations of (5.2.3.9). The computation
can advance from left to right, which is done in following C++ code.

C++-code 5.2.3.10: Aitken-Neville algorithm ➺ GITLAB


2 // Aitken-Neville algorithm for evaluation of interpolating polynomial
3 // IN: t, y: (vectors of) interpolation data points
4 // x: (single) evaluation point
5 // OUT: value of interpolant in x
6 double ANipoleval ( const Eigen : : VectorXd& t , Eigen : : VectorXd y , const double x ) {
7 f o r ( i n t i = 1 ; i < y . s i z e ( ) ; ++ i ) {
8 // Careful: int loop index required for comparison >= 0 !
9 f o r ( i n t k = i − 1 ; k >= 0 ; −−k ) {
10 // Recursion (5.2.3.9)
11 y [ k ] = y [ k + 1] + ( y [ k + 1] − y [ k ] ) * ( x − t [ i ] ) / ( t [ i ] − t [ k ] ) ;
12 }
13 }
14 return y [ 0 ] ;
15 }

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 397
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The vector y contains the diagonals (from bottom left i y[0] y[1] y[2] y[3]
to top right ) of the above triangular tableaux: 0 y0 y1 y2 y3
Note that the algorithm merely needs to store that 1 p0,1 ( x ) y1 y2 y3
single vector, which translates into O(n) required 2 p0,2 ( x ) p1,2 ( x ) y2 y3
memory for n → ∞. 3 p0,3 ( x ) p1,3 ( x ) p2,3 ( x ) y3

Asymptotic complexity of ANipoeval in terms of number of data points is O(n2 ) (two nested loops). This
is the same as for evaluation based on the barycentric formula, but the Aitken-Neville has a key advantage
discussed in the next §. y

§5.2.3.11 (Polynomial interpolation with data updates) The Aitken-Neville algorithm has another inter-
esting feature, when we run through the Aitken-Neville scheme from the top left corner:
n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x )
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Thus, the values of partial polynomial interpolants at x can be computed before all data points are even
processed. This results in an “update-friendly” algorithm that can efficiently supply the point values p0,k ( x ),
k = 0, . . . , n, while being supplied with the data points (ti , yi ). It can be used for the efficient implemen-
tation of the following interpolator class:

C++-code 5.2.3.12: Single point evaluation with data updates ➺ GITLAB


1 class P o l y E v a l {
2 private :
3 // evaluation point and various internal data describing the
polynomials
4 public :
5 // Constructor taking the evaluation point as argument
6 P o l y E v a l ( double x ) ;
7 // Add another data point and update internal information
8 void addPoint ( double t , double y ) ;
9 // Value of current interpolating polynomial at x
10 double eval ( void ) const ;
11 };

EXPERIMENT 5.2.3.13 (Timing polynomial evaluations)

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 398
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Comparison of the computational time needed for


polynomial interpolation of

{ti = i }i=1,...,n , { yi = i }i=1,...,n ,

n = 3, . . . , 200, and evaluation in a single point x ∈


[0, n].
Minimum computational time over 100 runs ➙

The measurements were carried out with the code


polytiming.cpp ➺ GITLAB, gcc with -O3.

Fig. 153

This uses functions given in Code 5.2.3.7, Code 5.2.3.10 and the function polyfit() (with a clearly
greater computational effort !). polyfit() is the equivalent to P YTHON’s/M ATLAB’s built-in polyfit.
The implementation can be found on GitLab.
y
Review question(s) 5.2.3.14 (Polynomial Interpolation: Algorithms)
(Q5.2.3.14.A) The Aitken-Neville scheme was introduced as

n= 0 1 2 3
t0 y0 =: p0,0 ( x ) → p0,1 ( x ) → p0,2 ( x ) → p0,3 ( x )
ր ր ր
t1 y1 =: p1,1 ( x ) → p1,2 ( x ) → p1,3 ( x ) (ANS)
ր ր
t2 y2 =: p2,2 ( x ) → p2,3 ( x )
ր
t3 y3 =: p3,3 ( x )
Give an interpretation of the quantities pk,ℓ occurring in (ANS).
(Q5.2.3.14.B) Describe a scenario for the evaluation of degree-n Lagrange polynomial interpolants in a
single point x ∈ R where the use of the barycentric interpolation formula
n n n
λ λ 1
p(t) = ∑ t −i ti yi : ∑ t −i ti , λi : = ∏ ti − t j , (5.2.3.3)
i =0 i =0 j =0
j6=i

is more efficient than computations based on the Aitken-Neville scheme.


5.2.3.3 Extrapolation to Zero

Extrapolation is interpolation with the evaluation point t outside the interval [inf j=0,...,n t j , sup j=0,...,n t j ].
In the sequel we assume t = 0, ti > 0. Of course, Lagrangian polynomial interpolation can also be used
for extrapolation. In this section we give a very important application of this “Lagrangian extrapolation”.
Task: compute the limit limh→0 ψ( h) with prescribed accuracy, though the evaluation of the function
ψ = ψ( h) (maybe given in procedural form only) for very small arguments | h| ≪ 1 is difficult,
usually because of numerically instability (→ Section 1.5.5).

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 399
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The extrapolation technique introduced below works well, if


✦ ψ is an even function of its argument: ψ(t) = ψ(−t),
✦ ψ = ψ( h) behaves “nicely” around h = 0.
Theory: The analysis of extrapolation techniques usually relies on the existence of an asymptotic expan-
sion in h2

f ( h) = f (0) + A1 h2 + A2 h4 + · · · + An h2n + R( h) , Ak ∈ R ,

with remainder estimate | R(h)| = O(h2n+2 ) for h → 0 .


Idea: approximating an inaccessible limit by extrapolation to zero

➀ Pick h0 , . . . , hn for which ψ can be evaluated “safely”.

➁ evaluation of ψ( hi ) for different hi , i = 0, . . . , n, | hi | > 0.

➂ Appeoximate ψ(0) ≈ p(0), p =


ˆ interpolating polynomial p ∈ Pn , p( hi ) =
ψ ( h i ).

1.00

0.75

0.50 ✁ extrapolating polynomials


0.25
In this manufactured example we have
p

0.00 ψ(t) = arctan(2t), which means ψ(0) = 0. The


higher the degree of the extrapolating polynomial p,
−0.25

(hi , ψ(hi ))
the closer is p(0) to 0.
−0.50 degree = 1
degree = 2
−0.75 degree = 3

−0.2 0.0 0.2 0.4 0.6 0.8 1.0


Fig. 154 t

§5.2.3.16 (Numerical differentiation through extrapolation) In Ex. 1.5.4.7 we have already seen a situ-
ation, where we wanted to compute the limit of a function ψ( h) for h → 0, but could not do it with sufficient
accuracy. In this case ψ( h) was a one-sided difference quotient with span h, meant to approximate f ′ ( x )
for a differentiable function f . The cause of numerical difficulties was cancellation → § 1.5.4.5.

Now we will see how to dodge cancellation in difference quotients and how to use extrapolation to zero to
computes derivatives with high accuracy:
Given: smooth function f : I ⊂ R 7→ R in procedural form: function y = f(x)

Sought: (approximation of) f ′ ( x ), x ∈ I .

Natural idea: approximation of derivative by (symmetric) difference quotient

df f ( x + h) − f ( x − h)
(x) ≈ . (5.2.3.17)
dx 2h

straightforward implementation fails due to cancellation in the numerator, see also Ex. 1.5.4.7.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 400
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 5.2.3.18: Numeric differentiation through difference quotients ➺ GITLAB


2 // numerical differentiation using difference quotient
f ( x + h)− f ( x )
3 // f ′ ( x ) = limh→0 h
4 // IN: f (function object) = function to derive
5 // df = exact derivative (to compute error),
6 // name = string of function name (for plot filename)
7 // OUT: plot of error will be saved as "<name>numdiff.eps"
8 template <class Function , class D e r i v a t i v e >
9 void d i f f ( const double x , F u n c t i o n& f , D e r i v a t i v e & df , const std : : s t r i n g name ) {
10 std : : vector <long double> e r r o r , h ;
11 // build vector of widths of difference quotients
12 f o r ( i n t i = −1; i >= −61; i −= 5 ) h . push_back ( std : : pow ( 2 , i ) ) ;
13 f o r ( unsigned j = 0 ; j < h . s i z e ( ) ; ++ j ) {
14 // compute approximate solution using difference quotient
15 double df_approx = ( f ( x + h [ j ] ) − f ( x ) ) / h [ j ] ;
16 // compute relative error
17 double r e l _ e r r = std : : abs ( ( df_approx − d f ( x ) ) / d f ( x ) ) ;
18 e r r o r . push_back ( r e l _ e r r ) ;
19 }

This is apparent in the following approximation error tables for three simple functions and x = 1.1.

f ( x ) = arctan( x ) f (x) = x f ( x ) = exp( x )
h Relative error h Relative error h Relative error
2−1 0.20786640808609 2−1 0.09340033543136 2−1 0.29744254140026
2−6 0.00773341103991 2−6 0.00352613693103 2−6 0.00785334954789
2−11 0.00024299312415 2−11 0.00011094838842 2−11 0.00024418036620
2−16 0.00000759482296 2−16 0.00000346787667 2−16 0.00000762943394
2−21 0.00000023712637 2−21 0.00000010812198 2−21 0.00000023835113
2−26 0.00000001020730 2−26 0.00000001923506 2−26 0.00000000429331
2−31 0.00000005960464 2−31 0.00000001202188 2−31 0.00000012467100
2−36 0.00000679016113 2−36 0.00000198842224 2−36 0.00000495453865
Recall the considerations elaborated in Ex. 1.5.4.7. Owing to the impact of roundoff errors amplified by
cancellation, h → 0 does not achieve arbitrarily high accuracy. Rather, we observe fewer correct digits for
very small h!

Extrapolation offers a numerically stable (→ Def. 1.5.5.19) alternative, because for a 2(n + 1)-times con-
tinuously differentiable function f : I ⊂ R 7→ R, x ∈ I we find that the symmetric difference quotient
behaves like a polynomial in h2 in the vicinity of h = 0. Consider Taylor sum of f in x with Lagrange
remainder term:
n
f ( x + h) − f ( x − h) 1 d2k f 1
ψ(h) := ∼ f ′ (x) + ∑ 2k
( x ) h2k + f (2n+2) (ξ ( x )) .
2h k =1
( 2k ) ! dx ( 2n + 2 ) !

Since limh→0 ψ( h) = f ′ ( x )

➙ approximate f ′ ( x ) by interpolation of ψ in points hi .

The following C++ function diffex() implements extrapolation to zero of symmetric difference quo-
tients relying on the update-friendly version of the Aitken-Neville algorithm as presented in § 5.2.3.11,
Code 5.2.3.12. Note that the extrapolated value taking into account all available difference quotients al-
ways resides in y[0].

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 401
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 5.2.3.19: Numerical differentiation by adaptive extrapolation to zero ➺ GITLAB


2 // Extrapolation based numerical differentation
3 // with a posteriori error control
4 // f: handle of a function defined in a neighbourhood of x ∈ R
5 // x: point at which approximate derivative is desired
6 // h0: initial distance from x
7 // rtol: relative target tolerance, atol: absolute tolerance
8 template <class Function >
9 double d i f f e x ( F u n c t i o n& f , const double x , const double h0 ,
10 const double r t o l , const double a t o l ) {
11 const unsigned n i t = 1 0 ; // Maximum depth of extrapolation
12 VectorXd h ( n i t ) ; h [ 0 ] = h0 ; // Widths of difference quotients
13 VectorXd y ( n i t ) ; // Approximations returned by difference quotients
14 y [ 0 ] = ( f ( x + h0 ) − f ( x − h0 ) ) / ( 2 * h0 ) ; // Widest difference quotient
15

16 // using Aitken-Neville scheme with x = 0, see Code 5.2.3.10


17 f o r ( unsigned i = 1 ; i < n i t ; ++ i ) {
18 // create data points for extrapolation
19 h [ i ] = h [ i − 1 ] / 2 ; // Next width half as big
20 y [ i ] = ( f ( x + h [ i ] ) − f ( x − h [ i ] ) ) / ( 2 . 0 * h [ i ] ) ; // difference quotient
21 // Aitken-Neville update
22 f o r ( i n t k = s t a t i c _ c a s t < i n t >( i − 1 ) ; k >= 0 ; −−k ) {
23 y [ k ] = y [ k +1] − ( y [ k +1] − y [ k ] ) * h [ i ] / ( h [ i ] − h [ k ] ) ;
24 }
25 // termination of extrapolation when desired tolerance is reached
26 const double e r r e s t = std : : abs ( y [ 1 ] − y [ 0 ] ) ; // error indicator
27 i f ( e r r e s t < r t o l * std : : abs ( y [ 0 ] ) | | e r r e s t < a t o l ) { //
28 break ;
29 }
30 }
31 r e t u r n y [ 0 ] ; // Return value extrapolated from largest number of
difference quotients
32 }

While the extrapolation table (→ § 5.2.3.11) is computed, more and more accurate approximations of
f ′ ( x ) become available. Thus, the difference between the two last approximations (stored in y[0] and
y[1] in Code 5.2.3.19) can be used to gauge the error of the current approximation, it provides an error
indicator, which can be used to decide when the level of extrapolation is sufficient, see Line 27.

auto f = [](double x) auto g = [](double x) auto h = [](double x)


{return std::atan(x);} {return std::sqrt(x);} {return std::exp(x);}
diffex(f,1.1,0.5) diffex(g,1.1,0.5) diffex(h,1.1,0.5)
Degree Relative error Degree Relative error Degree Relative error
0 0.04262829970946 0 0.02849215135713 0 0.04219061098749
1 0.02044767428982 1 0.01527790811946 1 0.02129207652215
2 0.00051308519253 2 0.00061205284652 2 0.00011487434095
3 0.00004087236665 3 0.00004936258481 3 0.00000825582406
4 0.00000048930018 4 0.00000067201034 4 0.00000000589624
5 0.00000000746031 5 0.00000001253250 5 0.00000000009546
6 0.00000000001224 6 0.00000000004816 6 0.00000000000002
7 0.00000000000021

Advantage: guaranteed accuracy ➙ efficiency y


Review question(s) 5.2.3.20 (Extrapolation to zero)

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 402
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q5.2.3.20.A) We consider a convergent sequence (αn )n∈N of real numbers, whose terms can be ob-
tained as ouput of a black-box function double alpha(unsigned int n). Function calls become
the more expensive the larger n. How might extrapolation to zero be employed to compute the limit
lim αn ?
n→∞

Hint. Consider a function ψ :]0, 1] →]R with ψ( h) := αn , if h = n−1 .


(Q5.2.3.20.B) We consider the space of even polynomials
n o
even
P2n := t 7→ γ0 + γ2 t2 + γ4 t4 + · · · + γ2n t2n , γ2j ∈ R, j = 0, . . . , n , n∈N.
even satisfying
• Given data points (t j , y j ), j = 0, . . . , n, t j > 0, show that there is a unique p ∈ P2n
the interpolation conditions p(t j ) = y j for all ∈ {0, . . . , n}.
• You have an implementation of the functor class for the evaluation of polynomial interpolants at
your disposal, see Code 5.2.3.1:

C++ code Code 5.2.3.1: Polynomial Interpolation ➺ GITLAB


1 class P o l y I n t e r p {
2 public :
3 // Constructors taking node vector [t0 , . . . , tn ]⊤ as argument
4 P o l y I n t e r p ( const Eigen : : VectorXd & t ) ;
5 template <typename SeqContainer >
6 P o l y I n t e r p ( const SeqContainer &v ) ;
7 // Evaluation operator for data (y0 , . . . , yn ); computes
8 // p( xk ) for xk s passed in x
9 Eigen : : VectorXd eval ( const Eigen : : VectorXd &y ,
10 const Eigen : : VectorXd &x ) const ;
11 private :
12 // various internal data describing p
13 Eigen : : VectorXd _ t ;
14 };

Explain how you can use an object of this type to perform multiple evaluations of the unique even
even of the data points ( t , y ), j = 0, . . . , n, t > 0.
polynomial interpolant p ∈ P2n j j j

5.2.3.4 Newton Basis and Divided Differences

Supplementary literature. We also refer to [DR08, Sect. 8.2.4], [QSS00, Sect. 8.2].

In § 5.2.3.8 we have seen a method to evaluate partial polynomial interpolants for a single or a few
evaluation points efficiently. Now we want to do this for many evaluation points that may not be known
when we receive information about the first interpolation points.

C++ code 5.2.3.21: Polynomial evaluation ➺ GITLAB


1 class PolyEval {
2 private :
3 // evaluation point and various internal data describing the
polynomials
4 public :
5 // Idle Constructor
6 PolyEval ( ) ;

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 403
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

7 // Add another data point and update internal information


8 void addPoint ( double t , double y ) ;
9 // Evaluation of current interpolating polynomial at many points
10 Eigen : : VectorXd operator ( ) ( const Eigen : : VectorXd &x ) const ;
11 };

The challenge: Both addPoint() and the evaluation operator operator () may be called many times
and the implementation has to remain efficient under these circumstances.

Why not use the techniques from § 5.2.3.2? Drawback of the Lagrange basis or barycentric formula:
adding another data point affects all basis polynomials/all precomputed values!

§5.2.3.22 (Newton basis for Pn )


Our tool now is an “update friendly” representation of the polynomials interpolants in terms of the Newton
basis for Pn
n −1
N0 (t) := 1 , N1 (t) := (t − t0 ) , ... , Nn (t) := ∏ ( t − ti ) . (5.2.3.23)
i =0

Note that, clearly, Nn ∈ Pn with leading coefficient 1. This implies the linear independence of
{ N0 , . . . , Nn } and, in light of dim Pn = n + 1 by Thm. 5.2.1.2, gives us the basis property of that subset
of Pn .
The abstract considerations of § 5.1.0.21 still apply and we get an (n + 1) × (n + 1) linear system of
equations for the coefficients a j , j = 0, . . . , n, of the polynomial interpolant in Newton basis:

a j ∈ R: a0 N0 (t j ) + a1 N1 (t j ) + · · · + an Nn (t j ) = y j , j = 0, . . . , n . (5.2.3.24)

⇔ triangular linear system


    
N0 (t0 ) N1 (t0 ) · · · Nn (t0 ) a0 y0
 N0 (t ) N (t ) · · · Nn (t )  a   y 
 1 1 1 1  1   1 
 .. .. ..  ..  =  .. 
 . . .  .   . 
N0 (tn ) N1 (tn ) · · · Nn (tn ) an yn
m
 
1 0 ··· 0    
 .. ..  a 0 y0
 1 ( t1 − t0 ) . .  a   y 
 .. .. ..  1   1 
 . . . 0  ..  =  .. . (5.2.3.25)
  .   . 
 n −1 
1 ( t n − t0 ) · · · ∏ ( t n − ti ) an yn
i =0

This triangular linear system can be solved by simple forward substitution:

a0 = y0 ,
y − a0 y − y0
a1 = 1 = 1 ,
t1 − t0 t1 − t0
y2 − y0 y1 − y0
y −y
y2 − y0 − (t2 − t0 ) t11 −t00 −
y2 − a0 − ( t2 − t0 ) a1 t2 − t0 t1 − t0
a2 = = = ,
(t2 − t0 )(t2 − t1 ) (t2 − t0 )(t2 − t1 ) t2 − t1
..
.

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 404
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We observe that in the course of forward substitution the same quantities computed again and again. This
suggests that a more efficient implementation is possible. y

§5.2.3.26 (Divided differences) In order to reveal the pattern, we turn to a new interpretation of the
coefficients a j ∈ R of the interpolating polynomials

p(t) = a0 N0 (t) + a1 N1 (t) + · · · + an Nn (t) , t∈R,


represented in the Newton basis { N0 , . . . , Nn } from (5.2.3.23). We start with the following observation.
✗ ✔
The Newton basis polynomial Nj (t) has degree j and leading coefficient 1

a j is the leading coefficient of the interpolating polynomial p0,j .


✖ ✕
Using the notation pℓ,m for partial polynomial interpolants through the data points (tℓ , yℓ ), . . . , (tm , ym ),
which was introduced in Section 5.2.3.2, see (5.2.3.9) we can state the recursion
pk,k ( x ) ≡ yk (“constant polynomial”) , k = 0, . . . , n ,
( x − tk ) pk+1,ℓ ( x ) − ( x − tℓ ) pk,ℓ−1 ( x )
pk,ℓ ( x ) = (5.2.3.9)
tℓ − tk
x − tℓ
= pk+1,ℓ ( x ) + (p ( x ) − pk,ℓ−1 ( x )) , 0≤k≤ℓ≤n,
tℓ − tk k+1,ℓ
This implies a recursion for the leading coefficients aℓ,m of the interpolating polynomials pℓ,m ,
aℓ+1,m − aℓ,m−1
aℓ,m = , 0≤ℓ≤m≤n. (5.2.3.27)
tm − tℓ
Hence, instead of using elimination for a triangular linear system, we find a simpler and more efficient
algorithm using the so-called divided differences, which are defined by the recursion
y [ ti ] = yi
y [ t i +1 , . . . , t i + k ] − y [ t i , . . . , t i + k −1 ] (5.2.3.28)
y [ ti , . . . , ti +k ] = (recursion) .
ti +k − ti
We adopt this strange notation for the sake of compatibility with literature. y

§5.2.3.29 (Efficient computation of divided differences) Divided differences can be computed by the
divided differences scheme, which is closely related to the Aitken-Neville scheme from Code 5.2.3.10:
t0 y [ t0 ]
> y [ t0 , t1 ]
t1 y [ t1 ] > y [ t0 , t1 , t2 ]
> y [ t1 , t2 ] > y [ t0 , t1 , t2 , t3 ], (5.2.3.30)
t2 y [ t2 ] > y [ t1 , t2 , t3 ]
> y [ t2 , t3 ]
t3 y [ t3 ]

The elements can be computed from left to right, every “>” indicates the evaluation of the recursion
formula (5.2.3.28).

However, we can again resort to the idea of § 5.2.3.11 and traverse (5.2.3.30) along the diagonals from
left bottom to right top: If a new datum (t0 , y0 ) is added, it is enough to compute the n + 1 new divided
differences

y [ t0 ] , y [ t0 , t1 ] , y [ t0 , t1 , t2 ] , . . . , y [ t0 , . . . , t n ] .

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 405
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The C++ function divdiff() listed in Code 5.2.3.31 computes divided differences for data points
(ti , yi ), i = 0, . . . , n, in this fashion. For n = 3 the values of the outer loop variable l in the different
combinations are as follows:

t0 y [ t0 ]
l = 3 > y [ t0 , t1 ]
t1 y [ t1 ] l=3> y [ t0 , t1 , t2 ]
l = 2 > y [ t1 , t2 ] l = 3 > y [ t0 , t1 , t2 , t3 ], (5.2.3.30)
t2 y [ t2 ] l=2> y [ t1 , t2 , t3 ]
l = 1 > y [ t2 , t3 ]
t3 y [ t3 ]

In divdiff() the divided differences y[t0 ], y[t0 , t1 ], . . . , y[t0 , . . . , tn ] overwrite the original data values
y j in the vector y (in-situ computation).

C++ code 5.2.3.31: In-situ computation of divided differences ➺ GITLAB


2 // IN: t = node set (mutually different)
3 // y = nodal values
4 // OUT: y = coefficients of polynomial in Newton basis
5 void d i v d i f f ( const Eigen : : VectorXd &t , Eigen : : VectorXd &y ) {
6 const Eigen : : Index n = y . s i z e ( ) ;
7 // Follow scheme (5.2.3.30), recursion (5.2.3.27)
8 f o r ( Eigen : : Index l = 1 ; l < n ; ++ l ) {
9 f o r ( Eigen : : Index j = n − l ; j < n ; ++ j ) {
10 y [ j ] = (y [ j ] − y [ j − 1]) / ( t [ j ] − t [n − 1 − l ]) ;
11 }
12 }
13 }

Computational effort: O(n2 ) for no. of data points n → ∞


By derivation, the computed finite differences are the coefficients of interpolating polynomials in the New-
ton basis
n −1
p(t) = a0 + a1 (t − t0 ) + a2 (t − t0 )(t − t1 ) + · · · + an ∏ (t − t j ) (5.2.3.32)
j =0
a0 = y [ t0 ], a1 = y [ t0 , t1 ], a2 = y [ t0 , t1 , t2 ], . . . .

Thus, divdiff() from Code 5.2.3.31 computes the coefficients a j , j = 0, . . . , n, of the polynomial
interpolant with respect to the Newton basis. It uses only the first j + 1 data points to find a j . y

§5.2.3.33 (Efficient evaluation of polynomial in Newton form) Let a polynomial be given in “Newton
form”, that is, as a linear combination of Newton basis polynomials as introduced in (5.2.3.23):

p(t) = a0 N0 (t) + a1 N1 (t) + · · · + an Nn (t)


n −1
= a0 · 1 + a1 (t − t0 ) + a2 (t − t0 )(t − t1 ) + · · · + an
| {z } | {z } ∏ (t − t j ) , t∈R,
j =0
= N1 (t) = N2 (t) | {z }
= Nn (t)

with known coefficients a j , j = 0, . . . , n, e.g., available as the components of a vector. Embark on “asso-
ciative rewriting”,

p(t) = (. . . (( an (t − tn−1 ) + an−1 )(t − tn−2 ) + an−2 ) · . . . · + a1 )(t − t0 ) + a0 ,

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 406
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

which reveals how we can perform the “backward evaluation” of p(t) in the spirit of Horner’s scheme (→
§ 5.2.1.5, [DR08, Alg. 8.20]):
p ← an , p ← ( t − t n −1 ) p + a n −1 , p ← ( t − t n −2 ) p + a n −2 , ....
A C++ implementation of this idea is given next.

C++ code 5.2.3.34: Divided differences evaluation by modified Horner scheme ➺ GITLAB
2 // Evaluation of a polynomial in Newton form, that is, represented
through the
3 // vector of its basis expansion coefficients with respect to the Newton
basis
4 // (5.2.3.23).
5 Eigen : : VectorXd evalNewtonForm ( const Eigen : : VectorXd &t ,
6 const Eigen : : VectorXd &a ,
7 const Eigen : : VectorXd &x ) {
8 const Eigen : : Index n = a . s i z e ( ) − 1 ;
9 const Eigen : : VectorXd ones = Eigen : : VectorXd : : Ones ( x . s i z e ( ) ) ;
10 Eigen : : VectorXd p { a [ n ] * ones } ;
11 f o r ( Eigen : : Index j = n − 1 ; j >= 0 ; −− j ) {
12 p = ( x − t [ j ] * ones ) . cwiseProduct ( p ) + a [ j ] * ones ;
13 }
14 return p ;
15 }

Computational effort: Asymptotically O(n) for a single evaluation of p(t).


(Can be interleaved with the computation of the a j s, see Code 5.2.3.21.)
y

EXAMPLE 5.2.3.35 (Class PolyEval) We show the implementation of a C++ class supporting the
efficient update and evaluation of an interpolating polynomial making use of
• the representation of the Lagrange polynomial interpolants in the Newton basis (5.2.3.23),
• the computation of representation coefficients through a divided difference scheme (5.2.3.30), see
Code 5.2.3.31,
• and point evaluations of the polynomial interpolants by means of Horner-like scheme as introduced
in Code 5.2.3.34.
To understand the code return to the triangular linear system for the Newton basis expansion coefficients
a j of a Lagrange polynomial interpolant of degree n through (ti , yi ), i = 0, . . . , n:
 
1 0 ··· 0    
 .. ..  a0 y0
 1 ( t1 − t0 ) . .    
  a1   y1 
 .. .. ..  ..  =  .. . (5.2.3.25)
 . . . 0  .   . 
 n −1 
1 ( t n − t0 ) · · · ∏ ( t n − ti ) an yn
i =0
Given, a0 , . . . , an−1 we can thus compute an from
!
n −1 n −1 k −1
an = ∏ ( t n − t i ) −1 yn − ∑ ∏ ( t n − ti ) a k
i =0 k =0 i =0
n −1 n −1 n −1
= ∏ ( t n − t i ) −1 y n − ∑ ∏ ( t n − t i ) −1 a k
i =0 k =0 i = k
= (. . . (((yn − a0 )/(tn − t0 ) − a1 )/(tn − t1 ) − a2 )/ · · · − an−1 )/(tn − tn−1 ) .

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 407
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 5.2.3.36: Definition of a class for “update friendly” polynomial interpolant


➺ GITLAB
1 class PolyEval {
2 private :
3 std : : vector <double> t ; // Interpolation nodes
4 std : : vector <double> y ; // Coefficients in Newton representation
5 public :
6 PolyEval ( ) ; // Idle constructor
7 void addPoint ( double t , double y ) ; // Add another data point
8 // evaluate value of current interpolating polynomial at x,
9 double operator ( ) ( double x ) const ;
10 };

C++-code 5.2.3.37: Implementation of class PolyEval ➺ GITLAB


1 PolyEval : : PolyEval ( ) { }
2

3 void PolyEval : : addPoint ( double td , double yd ) {


4 t . push_back ( t d ) ;
5 y . push_back ( yd ) ;
6 i n t n = t . size ( ) ;
7 f o r ( i n t j = 0 ; j < n − 1 ; j ++)
8 y [ n − 1] = ( ( y [ n − 1] − y [ j ] ) / ( t [ n − 1] − t [ j ] ) ) ;
9 }
10

11 double PolyEval : : operator ( ) ( double x ) const {


12 double s = y . back ( ) ;
13 f o r ( i n t i = y . s i z e ( ) − 2 ; i >= 0 ; −− i )
14 s = s * (x − t [ i ]) + y[ i ];
15 return s ;
16 }

Remark 5.2.3.38 (Divided differences and derivatives) If y0 , . . . , yn are the values of a smooth function
f in the points t0 , . . . , tn , that is, y j := f (t j ), then

f (k) ( ξ )
y [ ti , . . . , ti +k ] =
k!
for a certain ξ ∈ [ti , ti+k ], see [DR08, Thm. 8.21]. y
Review question(s) 5.2.3.39 (Newton basis and divided differences)
(Q5.2.3.39.A) Given a node set {t0 , t1 , . . . , tn } let a0 , . . . , an ∈ R be the coefficients of a polynomial p
in the associated Newton basis { N0 , . . . , Nn }. Outline an efficient algorithm for computing the basis
expansion coefficients of p with respect to the basis { L0 , . . . , Ln } of Lagrange polynomials for given
node set.
(Q5.2.3.39.B) Given the value vector y ∈ R n+1 and the node set {t0 , . . . , tn } ⊂ R, remember the nota-
tion y[tk , tℓ ], for divided differences: y[tk , tℓ ] is the leading coefficient of the unique polynomial interpo-
 ℓ
lating the data points t j , (y) j , 0 ≤ k, ℓ ≤ n (C++ indexing).
j=k

What can you conclude from y[t0 , t j ] = 0 for all j ∈ {m, . . . , n} for some m ∈ {1, . . . , n}?

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 408
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5.2.4 Polynomial Interpolation: Sensitivity

Supplementary literature. For related discussions see [QSS00, Sect. 8.1.3].

This section addresses a major shortcoming of polynomial interpolation in case the interpolation knots ti
are imposed, which is usually the case when given data points have to be interpolated, cf. Ex. 5.1.0.8.

This liability has to do with the sensitivity of the Lagrange polynomial interpolation problem. From Sec-
tion 2.2.2 remember that the sensitivity/conditioning of a problem provides a measure for the propaga-
tion of perturbations in the data/inputs to the results/outputs.

§5.2.4.1 (The Lagrange polynomial interpolation problem) As explained in § 1.5.5.1 a “problem” in the
sense of numerical analysis is a mapping/function from a data/input set X into a set Y of results/outputs.
Owing to the existence and uniqueness of the polynomial interpolant as asserted in Thm. 5.2.2.7, the
Lagrange polynomial interpolation problem (LIP) as introduced in Section 5.2.2 describes a mapping

(R × R )n+1 −→ Pn , ((ti , yi ))in=0 7→ p ∈ Pn : p(ti ) = yi , i = 0, . . . , n . (5.2.4.2)

from sets of n + 1 data points, n ∈ N0 to polynomials of degree n. Hence, LIP maps a finite sequence
of numbers to a function, and both the data/input set and result/output set have the structure of vector
spaces.
A more restricted view considers the linear interpolation operator from Cor. 5.2.2.8
(
R n +1 → Pn ,
IT : (5.2.2.9)
( y0 , . . . , y n ) T 7→ interpolating polynomial p .

and identifies the Lagrange polynomial interpolation problem with IT , that is, with the mapping taking only
data values to a polynomial. The interpolation nodes are treated as parameters and not considered data.
For the sake of simplicity we adopt this view in the sequel. y

EXAMPLE 5.2.4.3 (Oscillating polynomial interpolant (Runge’s counterexample) → [DR08,


Sect. 8.3], [QSS00, Ex. 8.1]) This example offers a glimpse of the problems haunting polynomial in-
terpolation.
We examine the polynomial Lagrange interpolant (→ 3
data

Section 5.2.2, (5.2.2.2)) for uniformly spaced nodes Lagrange interpolant


perturbed interpolant
2.5
and the following data:
n on 2

10
T : = −5 + n j ,
j =0 1.5
y,p(t)

1
yj = , j = 0, . . . n.
1 + t2j
1

0.5

Plotted is the interpolant for n = 10 and the inter-


0
polant for which the data value at t = 0 has been
perturbed by 0.1 ➙ -0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
(See also Ex. 6.2.2.11 below.) Fig. 155 t

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 409
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

☞ possible strong oscillations of interpolating polynomials


of high degree on uniformly spaced nodes!
! ☞ Slight perturbations of data values can engender strong
variations of a high-degree Lagrange interpolant “far away”.

In fuzzy terms, what we have observed is “high sensitivity” of polynomial interpolation with respect to
perturbations in the data values: small perturbations in the data can cause big variations of the polynomial
interpolants in certain points, which is clearly undesirable. y

§5.2.4.4 (Norms on spaces of functions) For measuring the size of perturbations we need norms (→

Def. 1.5.5.4) on data and result spaces. For the value vectors y := [y0 , . . . , yn ] ∈ R n+1 we can use any
vector norm, see § 1.5.5.3, for instance the maximum norm kyk∞ .
However the result space is a vector space of functions I ⊂ R → R and so we also need norms on the
vector space of continuous functions C0 ( I ), I ⊂ R. The following norms are the most relevant:

supremum norm k f k L∞ ( I ) := sup{| f (t)|: t ∈ I } , (5.2.4.5)


Z
L2 -norm k f k2L2 ( I ) := | f (t)|2 dt , (5.2.4.6)
ZI
L1 -norm k f k L1 ( I ) := | f (t)| dt . (5.2.4.7)
I

Note the relationship with the vector norms introduced in § 1.5.5.3. y

§5.2.4.8 (Sensitivity of linear problem maps) In § 5.1.0.21 we have learned that (polynomial) interpola-
tion gives rise to a linear problem map, see Def. 5.1.0.25. For this class of problem maps the investigation
of sensitivity has to study operator norms, a generalization of matrix norms (→ Def. 1.5.5.10).

Let L : X → Y be a linear problem map between two normed spaces, the data space X (with norm k·k X )
and the result space Y (with norm k·kY ). Thanks to linearity, perturbations of the result y := L(x) for the
input x ∈ X can be expressed as follows:

L(x + δx) = L(x) + L(δx) = y + L(δx) .

Hence, the sensitivity (in terms of propagation of absolute errors) can be measured by the operator norm

kL(δx)kY
k L k X →Y : = sup . (5.2.4.9)
δx∈ X \{0} k δx k X

This can be read as the “matrix norm of L”, cf. Def. 1.5.5.10. y

It seems challenging to compute the operator norm (5.2.4.9) for L = IT (IT the Lagrange interpolation
operator for node set T ⊂ I ), X = R n+1 (equipped with a vector norm), and Y = C ( I ) (endowed with a

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 410
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

norm from § 5.2.4.4). The next lemma will provide surprisingly simple concrete formulas.

Lemma 5.2.4.10. Absolute conditioning of polynomial interpolation

Given a mesh T ⊂ R with generalized Lagrange polynomials Li , i = 0, . . . , n, and fixed I ⊂ R,


the norm of the interpolation operator satisfies

kIT (y)k L∞ ( I ) n
k IT k ∞ → ∞ : = sup
kyk∞
= ∑ i =0 | L i | L∞ ( I )
, (5.2.4.11)
y∈R n+1 \{0}
kIT (y)k L2 ( I )  1
n 2
k IT k 2→2 : = sup ≤ ∑ i =0
k Li k2L2 ( I ) . (5.2.4.12)
y∈R n+1 \{0} k y k2

Proof. (for the L∞ -Norm) By △-inequality

n n n
kIT (y)k L∞ ( I ) = ∑ j =0 y j L j L∞ ( I )
≤ sup ∑ j=0 |y j || L j (t)| ≤ kyk∞ ∑ i =0 | L i | L∞ ( I )
,
t∈ I

equality in (5.2.4.11) for y := (sgn( L j (t∗ )))nj=0 , t∗ := argmaxt∈ I ∑in=0 | Li (t)|.


Proof. (for the L2 -Norm) By the △-inequality and the Cauchy-Schwarz inequality in R n+1 ,
 1  1
n n 2 n 2
∑ ab
j =0 j j
≤ ∑ | a |2
j =0 j ∑ | b |2
j =0 j
∀a j , bj ∈ R ,

we can estimate
 1  1
n n 2 n 2 2
kIT (y)k L2 ( I ) ≤ ∑ j=0 |y j | L j L2 ( I )
≤ ∑ | y |2
j =0 j ∑ j =0
Lj L2 ( I )
.


n
Terminology: Lebesgue constant of T : λT := ∑ i =0 | L i | L∞ ( I )
= kIT k ∞ → ∞

Remark 5.2.4.13 (Lebesgue constant for equidistant nodes) We consider Lagrange interpolation for
the special setting
2k n
I = [−1, 1], T = {−1 + n } k =0 (uniformly spaced nodes).
Asymptotic estimate (with (5.2.2.4) and Stirling formula): for n = 2m
1
n · 1
n · n3 · · · · n− 3 n +1
n · n ···· n
2n−1
(2n)! 2n+3/2
| Lm (1 − n1 )| =  2 = ∼
2 4 n −2 (n − 1)22n ((n/2)!)2 n! π ( n − 1) n
n n· · · · · · n · 1

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 411
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lebesgue
6
constant for families of interpolation nodes
10
✁ Numerically computed values for Lebesgue con-
Chebychev nodes stants for different families of interpolation nodes.
Equidistant nodes
10 5
The k·k∞ -norm of the sum of Lagrange polynomi-
T

als was approximated by sampling in 105 equidis-


10 4
tant points, see Code 5.2.4.14.
Lebesgue constant

10 3
Sophisticated theory [CR92] gives a lower bound for
10 2
the Lebesgue constant for uniformly spaced nodes:

10 1
λT ≥ Ce /2
n

10 0
0 5 10 15 20 25
Fig. 156 Polynomial degree n with C > 0 independent of n.
We can also perform a numerical evaluation of the expression
n
λT = ∑ i =0 | L i | L∞ ( I )
,

for the Lebesgue constant of polynomial interpolation, see Lemma 5.2.4.10. The following code demon-
strates this:

C++ code 5.2.4.14: Approximate computation of Lebesgue constants ➺ GITLAB


2 // Computation of Lebesgue constant of polynomial interpolation
3 // with nodes ti passed in the vector t based on (5.2.4.11).
4 // N specifies the number of sampling points for the approximate
5 // computation of the maximum norm of the Lagrange polynimial
6 // on the interval [−1, 1].
7 double lebesgue ( const VectorXd& t , const unsigned& N) {
8 const unsigned n = t . s i z e ( ) ;
9 // compute denominators of normalized Lagrange polynomials relative to
the nodes t
10 VectorXd den ( n ) ;
11 f o r ( unsigned i = 0 ; i < n ; ++ i ) {
12 VectorXd tmp ( n − 1 ) ;
13 // Note: comma initializer can’t be used with vectors of length 0
14 i f ( i == 0 ) {
15 tmp = t . t a i l ( n − 1 ) ;
16 }
17 else i f ( i == n − 1 ) {
18 tmp = t . head ( n − 1 ) ;
19 }
20 else {
21 tmp << t . head ( i ) , t . t a i l ( n − ( i + 1 ) ) ;
22 }
23 den ( i ) = ( t ( i ) − tmp . array ( ) ) . prod ( ) ;
24 }
25

26 double l = 0 ; // return value


27 f o r ( unsigned j = 0 ; j < N; ++ j ) {
28 const double x = −1 + j * ( 2 . / N) ; // sampling point for k·k L∞ ([−1,1])
29 double s = 0 ;
30 f o r ( unsigned k = 0 ; k < n ; ++k ) {
31 // v provides value of normalized Lagrange polynomials
32 VectorXd tmp ( n − 1 ) ;
33 i f ( k == 0 ) {
34 tmp = t . t a i l ( n − 1 ) ;
35 }
36 else i f ( k == n − 1 ) {
37 tmp = t . head ( n − 1 ) ;

5. Data Interpolation and Data Fitting in 1D, 5.2. Global Polynomial Interpolation 412
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

38 }
39 else {
40 tmp << t . head ( k ) , t . t a i l ( n − ( k + 1 ) ) ;
41 }
42 const double v = ( x − tmp . array ( ) ) . prod ( ) / den ( k ) ;
43 s += std : : abs ( v ) ; // sum over modulus of the polynomials
44 }
45 l = std : : max ( l , s ) ; // maximum of sampled values
46 }
47 return l ;
48 }

Note: In Code 5.2.4.14 the norm k Li k L∞ ( I ) can be computed only approximately by taking the maximum
modulus of function values in many sampling points. y

§5.2.4.15 (Importance of knowing the sensitivity of polynomial interpolation) In Ex. 5.1.0.8 we


learned that interpolation is an important technique for obtaining a mathematical (and algorithmic) descrip-
tion of a constitutive relationship from measured data. If the interpolation operator is poorly conditioned,
tiny measurement errors will lead to big (local) deviations of the interpolant from its “true” form.
Since measurement errors are inevitable, poorly conditioned interpolation procedures are useless for de-
termining constitutive relationships from measurements.

Due to potentially “high sensitivity” interpolation with global polynomials of high degree is
! not suitable for data interpolation.
y

Review question(s) 5.2.4.16 (Polynomial Interpolation: Sensitivity)


(Q5.2.4.16.A) Consider the node set T := {0, 1, . . . , n}, n ∈ N, and the associated linear polynomial
interpolation operator IT : R n+1 → Pn . Quantify the sensitivity of the mapping
 

 ..  1 
 
R→R , y n 7 → IT   .   ,
 ∗  2
yn

Hint. The Lagrange polynomials for a node set {t0 , . . . , tn } ⊂ R are given by
n t − tj
Li ( t ) : = ∏ ti − t j , i = 0, . . . , n . (5.2.2.4)
j =0
j 6 =i

5.3 Shape-Preserving Interpolation


When reconstructing a quantitative dependence of quantities from measurements, first principles from
physics often stipulate qualitative constraints, which translate into shape properties of the function f , e.g.,
when modelling the material law for a gas:
ti pressure values, yi densities ➣ f positive & monotonic.

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 413
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Notation: given data: (ti , yi ) ∈ R2 , i = 0, . . . , n, n ∈ N, t0 < t1 < · · · < tn .

EXAMPLE 5.3.0.1 (Magnetization curves)

For many materials physics stipulates properties of


the functional dependence of magnetic flux B from
magnetic field strength H :

✦ H 7→ B( H ) smooth (at least C1 ),


✦ H 7→ B( H ) monotonic (increasing),
✦ H 7→ B( H ) concave

Fig. 157
y

5.3.1 Shape Properties of Functions and Data


§5.3.1.1 (The “shape” of data) The section is about “shape preservation”. In the previous example we
have already seen a few properties that constitute the “shape” of a function: sign, monotonicity and curva-
ture. Now we have to identify analogous properties of data sets in the form of sequences of interpolation
points (t j , y j ), j = 0, . . . , n, t j pairwise distinct.

Definition 5.3.1.2. monotonic data

The data (ti , yi ) are called monotonic when yi ≥ yi−1 or yi ≤ yi−1 , i = 1, . . . , n.

Definition 5.3.1.3. Convex/concave data


n
The data {(ti , yi )}i=0 are called convex (concave) if

(≥) y j − y j −1
∆ j ≤ ∆ j+1 , j = 1, . . . , n − 1 , ∆ j := , j = 1, . . . , n .
t j − t j −1

Mathematical characterization of convex data:

( t i +1 − t i ) y i −1 + ( t i − t i −1 ) y i +1
yi ≤ ∀ i = 1, . . . , n − 1,
t i +1 − t i −1

i.e., each data point lies below the line segment connecting the other data, cf. definition of convexity of a
function [Str09, Def. 5.5.2].

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 414
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

y y

Fig. 158 t
Fig. 159 t
Convex data Convex function

Definition 5.3.1.4. Convex/concave function → [Str09, Def. 5.5.2]

convex f (λx + (1 − λ)y) ≤ λ f ( x ) + (1 − λ) f (y) ∀0 ≤λ≤1,


f : I ⊂ R 7→ R :⇔
concave f (λx + (1 − λ)y) ≥ λ f ( x ) + (1 − λ) f (y) ∀ x, y ∈ I .

§5.3.1.5 ((Local) shape preservation) Now we consider interpolation problem to build an interpolant f
with special properties inherited from the given data (ti , yi ), i = 0, . . . , n.
✬ ✩
Goal: shape preserving interpolation:

positive data −→ positive interpolant f ,


monotonic data −→ monotonic interpolant f ,
convex data convex interpolant f .
✫ ✪
−→

More ambitious goal: local shape preserving interpolation: for each subinterval I = [ti , ti+ j ]

positive data in I −→ locally positive interpolant f | I ,


monotonic data in I −→ locally monotonic interpolant f | I ,
convex data in I −→ locally convex interpolant f | I .
y

EXPERIMENT 5.3.1.6 (Bad behavior of global polynomial interpolants)


We perform Lagrange interpolation for the following positive and monotonic data:
ti -1.0 -0.640 -0.3600 -0.1600 -0.0400 0.0000 0.0770 0.1918 0.3631 0.6187 1.0
yi 0.0 0.000 0.0039 0.1355 0.2871 0.3455 0.4639 0.6422 0.8678 1.0000 1.0
created by taking points on the graph of


0 if t < − 52 ,
1
f (t) = 2 (1 + cos(π (t − 53 ))) if − 25 < t < 35 ,


1 otherwise.

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 415
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.2
Polynomial
Measure pts.
1
Natural f ← Interpolating polynomial, degree = 10
0.8

We observe oscillations at the endpoints of the inter-


0.6
val (see Fig. 155), which means that we encounter
0.4
y

0.2
• no locality,
0
• no positivity,
−0.2 • no monotonicity,
−0.4
• no local conservation of the curvature,
in the case of global polynomial interpolation.
−0.6
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
t
y

5.3.2 Piecewise Linear Interpolation


There is a very simple method of achieving perfect shape preservation by means of a linear (→ § 5.1.0.21)
interpolation operator into the space of continuous functions:
Data: (ti , yi ) ∈ R2 , i = 0, . . . , n, n ∈ N, t0 < t1 < · · · < tn .

Then the piecewise linear interpolant s : [t0 , tn ] → R is defined as, cf. Ex. 5.1.0.15:

( t i +1 − t ) y i + ( t − t i ) y i +1
s(t) = for t ∈ [ t i , t i +1 ] . (5.3.2.1)
t i +1 − t i

The piecewise linear interpolant is also called a


polygonal curve. It is continuous and consists
of n line segments.

Piecewise linear interpolant of data from


Fig. 158 ✄

t
Fig. 160
t0 t1 t2 t3 t4
Piecewise linear interpolation means simply “connect the data points in R2 using straight lines”.
Obvious: linear interpolation is linear (as mapping y 7→ s, see Def. 5.1.0.25) and local in the following
sense:

y j = δij , i, j = 0, . . . , n ⇒ supp(s) ⊂ [ti−1 , ti+1 ] . (5.3.2.2)

As obvious are the properties asserted in the following theorem. The local preservation of curvature is a

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 416
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

straightforward consequence of Def. 5.3.1.3.

Theorem 5.3.2.3. Local shape preservation by piecewise linear interpolation

Let s ∈ C ([t0 , tn ]) be the piecewise linear interpolant of (ti , yi ) ∈ R2 , i = 0, . . . , n, for every


subinterval I = [t j , tk ] ⊂ [t0 , tn ]:

if (ti , yi )| I are positive/negative ⇒ s| I is positive/negative,


if (ti , yi )| I are monotonic (increasing/decreasing) ⇒ s| I is monotonic (increasing/decreasing),
if (ti , yi )| I are convex/concave ⇒ s| I is convex/concave.

Local shape preservation = perfect shape preservation!

Bad news: none of this properties carries over to local polynomial interpolation of higher polynomial degree
d > 1.
EXAMPLE 5.3.2.4 (Piecewise quadratic interpolation) We consider the following generalization of
piecewise linear interpolation of data points (t j , y j ) ∈ R × R, j = 0, . . . , n.

From Thm. 5.2.2.7 we know that a parabola (polynomial of degree 2) is uniquely determined by 3 data
points. Thus, the idea is to form groups of three adjacent data points and interpolate each of these triplets
by a 2nd-degree polynomial (parabola).
Assume: n = 2m even
piecewise quadratic interpolant q : [min{ti }, max{ti }] 7→ R is defined by

q j := q |[t2j−2 ,t2j ] ∈ P2 , q j (ti ) = yi , i = 2j − 2, 2j − 1, 2j , j = 1, . . . , m . (5.3.2.5)

1.2
Nodes
Piecewise linear interpolant
1 Piecewise quadratic interpolant
Nodes as in Exp. 5.3.1.6
0.8

Piecewise linear (blue) and quadratic (red) inter-


polants ✄ 0.6

0.4

0.2
No shape preservation for piecewise quadratic inter-
polant 0

−0.2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 161
y

The “only” drawback of piecewise linear interpolation:

interpolant is only C0 but not C1 (no continuous derivative).


However: Interpolant usually serves as input for other numerical methods like a Newton-method for solving
non-linear systems of equations, see Section 8.5, which requires derivatives.

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 417
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5.3.3 Cubic Hermite Interpolation


Aim: construct local shape-preserving (→ Section 5.3) (linear ?) interpolation operator that fixes short-
coming of piecewise linear interpolation by ensuring C1 -smoothness of the interpolant.
✎ notation: C1 ([ a, b]) =
ˆ space of continuously differentiable functions [ a, b] 7→ R.

5.3.3.1 Definition and Algorithms

Given: mesh points (ti , yi ) ∈ R2 , i = 0, . . . , n, t0 < t1 < · · · < t n .

Goal: build function f ∈ C1 ([t0 , tn ]) satisfying the interpolation conditions f (ti ) = yi , i = 0, . . . , n.

Definition 5.3.3.1. Cubic Hermite polynomial interpolant

Given data points (t j , y j ) ∈ R × R, j = 0, . . . , n, with pairwise distinct ordered nodes t j , and slopes
c j ∈ R, the piecewise cubic Hermite interpolant s : [t0 , tn ] → R is defined by the requirements

s|[ti−1 ,ti ] ∈ P3 , i = 1, . . . , n , s ( ti ) = yi , s ′ ( ti ) = ci , i = 0, . . . , n .

Corollary 5.3.3.2. Smoothness of cubic Hermite polynomial interpolant

Piecewise cubic Hermite interpolants are continuously differentiable on their interval of definition.

Proof. The assertion of the corollary follows from the agreement of function values and first derivative
values on nodes shared by two intervals, on each of which the piecewise cubic Hermite interpolant is a
polynomial of degree 3.

§5.3.3.3 (Local representation of piecewise cubic Hermite interpolant) Locally, we can write a piece-
wise cubic Hermite interpolant as a linear combination of generalized cardinal basis functions with coeffi-
cients supplied by the data values y j and the slopes c j :

s(t) = yi−1 H1 (t) + yi H2 (t) + ci−1 H3 (t) + ci H4 (t) , t ∈ [ti−1 , ti ] , (5.3.3.4)

where the basic functions Hk , k = 1, 2, 3, 4, are as follows:

1.2
ti − t
H1 (t) := φ( ), (5.3.3.5a)
1 hi
t − t i −1
0.8 H2 (t) := φ( ), (5.3.3.5b)
H
1
hi
0.6 H2 t −t
H3 (t) := − hi ψ ( i ), (5.3.3.5c)
Hi(t)

H3 hi
0.4
H4
t − t i −1
H4 (t) := hi ψ ( ), (5.3.3.5d)
0.2 hi
0
hi : = t i − t i −1 , (5.3.3.5e)
2 3
φ(τ ) := 3τ − 2τ , (5.3.3.5f)
−0.2
0 0.2 0.4 0.6 0.8 1
Fig. 162 t ψ(τ ) := τ 3 − τ 2 . (5.3.3.5g)
Local basis polynomials on [0, 1]

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 418
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

By tedious, but straightforward computations using the chain rule we find the following values for Hk and
Hk′ at the endpoints of the interval [ti−1 , ti ].
H ( t i −1 ) H ( ti ) H ′ ( t i −1 ) H ′ ( ti )
H1 1 0 0 0
H2 0 1 0 0
H3 0 0 1 0
H4 0 0 0 1
This amounts to a proof for (5.3.3.4) (why?).
The formula (5.3.3.4) is handy for the local evaluation of piecewise cubic Hermit interpolants. The function
hermloceval() in Code 5.3.3.6 performs the efficient evaluation (in multiple points) of a piecewise
cubic polynomial s on t1 , t2 uniquely defined by the constraints s(t1 ) = y1 , s(t2 ) = y2 , s′ (t1 ) = c1 ,
s ′ ( t2 ) = c2 :

C++ code 5.3.3.6: Local evaluation of cubic Hermite polynomial ➺ GITLAB


2 // Multiple point evaluation of Hermite polynomial
3 // y1 , y2 : data values
4 // c1 , c2 : slopes
5 Eigen : : VectorXd hermloceval ( Eigen : : VectorXd t , double t1 , double t2 ,
6 double y1 , double y2 , double c1 , double c2 ) {
7 const double h = t 2 − t1 , a1 = y2 − y1 , a2 = a1 − h * c1 , a3 = h * c2 − a1 − a2 ;
8 t = ( ( t . array ( ) − t 1 ) / h ) . matrix ( ) ;
9 r e t u r n ( y1 + ( a1 + ( a2 + a3 * t . array ( ) ) * ( t . array ( ) − 1 ) ) * t . array ( ) ) . matrix ( ) ;
10 }

§5.3.3.7 (Linear Hermite interpolation) However, the data for an interpolation problem (→ Section 5.1)
are merely the interpolation points (t j , y j ), j = 0, . . . , n, but not the slopes of the interpolant at the nodes.
Thus, in order to define an interpolation operator into the space of piecewise cubic Hermite functions, we
have supply a mapping R n+1 × R n+1 → R n+1 computing the slopes c j from the data points.

Since this mapping should be local it is natural to rely on (weighted) averages of the local slopes ∆ j (→
Def. 5.3.1.3) of the data, for instance

 ∆1
 , for i = 0 ,
y j − y j −1
ci = ∆n , for i = n , , ∆ j := ,j = 1, . . . , n . (5.3.3.8)

 t i +1 − t i ∆ + t i − t i −1 t j − t j −1
t −t i t i +1 − t i −1 ∆ i +1 , if 1 ≤ i < n .
i +1 i −1

Leads to a linear (→ Def. 5.1.0.25), local Hermite interpolation operator

“Local” means, that, if the values y j are non-zero for only a few adjacent data points with indices j =
k, . . . , k + m, m ∈ N small, then the Hermite interpolant s is supported on [tk−ℓ , tk+m+ℓ ] for small ℓ ∈ N
independent of k and m. y

EXAMPLE 5.3.3.9 (Average-based pecewise cubic Hermite interpolation)

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 419
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Data points:
✦ 11 equispaced nodes

t j = −1 + 0.2 j, j = 0, . . . , 10.

in the interval I = [−1, 1],


✦ yi = f (ti ) with

f ( x ) := sin(5x ) e x .

Here we used weighted averages of


slopes as in Eq. (5.3.3.8).

For details see code hermintp1.hpp


➺ GITLAB.

Fig. 163

No strict local/global preservation of monotonicity!

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 420
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5.3.3.2 Local Monotonicity-Preserving Hermite Interpolation

From Ex. 5.3.3.9 we learn that, if the slopes are chosen according to Eq. (5.3.3.8), then the resulting
Hermite interpolation does not preserve monotonicity.

y
Consider the situation sketched on the right ✄
The red circles (•) represent data points, the blue line
(—) the piecewise linear interpolant → Section 5.3.2.

In the nodes marked with 7→ the first derivative of a


monotonicity preserving C1 -smooth interpolant must
vanish! Otherwise an “overshoot” occurs, see also
Fig. 161.

Of course, this will be violated, when a (weighted)


arithmetic average is used for the computation of t
slopes for cubic Hermite interpolation. Fig. 164

✁ Consider the situation sketched on the left.


y
The red circles (•) represent data points, the blue line
(—) the piecewise linear interpolant → Section 5.3.2.

A local monotonicity preserving C1 -smooth inter-


polant (→ § 5.3.1.5) s must be flat (= vanishing first
derivative) in data points (t j , y j ), for which

y j−1 < y j and y j+1 < y j ,


y j−1 > y j and y j+1 > y j ,
t in “local extrema” of the data set.
Fig. 165
Otherwise, overshoots or undershoots would destroy
local monotonicity on one side of the extremum.

§5.3.3.10 (Limiting of local slopes) From the discussion of Fig. 164 and Fig. 165 it is clear that local
monotonicity preservation entails that the local slopes ci of a cubic Hermite interpolant (→ Def. 5.3.3.1)
have to fulfill
(
0 , if sgn(∆i ) 6= sgn(∆i+1 ) ,
ci = , i = 1, . . . , n − 1 . (5.3.3.11)
some “average” of ∆i , ∆i+1 otherwise


1 , if ξ > 0 ,
✎ notation: sign function sgn(ξ ) = 0 , if ξ = 0 , .


−1 , if ξ < 0 .
A slope selection rule that enforces (5.3.3.11) is called a limiter.

Of course, testing for equality with zero does not make sense for data that may be affected by measure-
ment or roundoff errors. Thus, the “average” in (5.3.3.11) must be close to zero already when either ∆i ≈ 0
or ∆i+1 ≈ 0. This is satisfied by the weighted harmonic mean
1
ci = , (5.3.3.12)
wa
∆i + ∆wb
i +1

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 421
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

with weights w a > 0, wb > 0, ( w a + w b = 1).


10

8
The harmonic mean = “smoothed min(·, ·)-
7
function”.
6

Obviously ∆i → 0 or ∆i+1 → 0 in (5.3.3.12), then 5

b
ci → 0. 4

Contour plot of the harmonic mean of a and b ➙ 3

(w a = wb = 1/2). 2

1 2 3 4 5 6 7 8 9 10
Fig. 166 a
A good choice of the weights is:
2hi+1 + hi hi+1 + 2hi
wa = , wb = ,
3 ( h i +1 + h i ) 3 ( h i +1 + h i )
This yields the following local slopes, unless (5.3.3.11) enforces ci = 0:

 , if i = 0 ,
 ∆1


sgn(∆1 )=sgn(∆2 ) 3( h i +1 + h i )
→ ci = 2hi +1 + hi 2h + h , for i ∈ {1, . . . , n − 1} , h i : = t i − t i −1 . (5.3.3.13)
 + i∆ i+1


∆i i +1
∆ , if i = n ,
n

Piecewise cubic Hermite interpolation with local slopes chosen according to


(5.3.3.11) and (5.3.3.13) is available through the M ATLAB function/P YTHON class v =
pchip(t,y,x);/scipy.interpolate.PchipInterpolator. The argument t passes
the interpolation nodes, y the corresponding data values, and x is a vector of evaluation points. y

EXAMPLE 5.3.3.14 (Monotonicity preserving piecewise cubic polynomial interpolation)

Data points
1
Piecew. cubic interpolation polynomial
Data from Exp. 5.3.1.6
Plot created with MATLAB-function call 0.8

v = pchip(t,y,x); 0.6

t: Data nodes t j
s(t)

y: Data values y j 0.4

x: Evaluation points xi
v: Vector s( xi ) 0.2

We observe perfect local monotonicity preservation,


0
no under- or overshoots at extrema. ✄
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 167 t
y

Remark 5.3.3.15 (Non-linear cubic Hermite interpolation) Note that the mapping y := [y0 , . . . , yn ] →
ci defined by (5.3.3.11) and (5.3.3.13) is not linear.

➣ The “pchip interpolaton operator” does not provide a linear mapping from data space R n+1 into
C1 ([t0 , tn ]) (in the sense of Def. 5.1.0.25).

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 422
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In fact, the non-linearity of the piecewise cubic Hermite interpolation operator is necessary for (only global)
monotonicity preservation:

Theorem 5.3.3.16. Property of linear, monotonicity preserving interpolation into C1

If, for fixed node set {t j }nj=0 , n ≥ 2, an interpolation scheme I : R n+1 → C1 ( I ) is linear
as a mapping from data values to continuous functions on the interval covered by the nodes
(→ Def. 5.1.0.25), and monotonicity preserving, then I(y)′ (t j ) = 0 for all y ∈ R n+1 and
j = 1, . . . , n − 1.

Of course, an interpolant that is flat in all data points, as stipulated by Thm. 5.3.3.16 for a lineaer, mono-
tonicity preserving, C1 -smooth interpolation scheme, does not make much sense.

At least, the piecewise cubic Hermite interpolation operator is local (in the sense discussed in § 5.3.3.7).
y

Theorem 5.3.3.17. Monotonicity preservation of limited cubic Hermite interpolation

The cubic Hermite interpolation polynomial with slopes as in Eq. (5.3.3.13) provides a local
monotonicity-preserving C1 -interpolant.

Proof. See F. F RITSCH UND R. C ARLSON, Monotone piecewise cubic interpolation, SIAM J. Numer.
Anal., 17 (1980), S. 238–246.

The next code demonstrates the calculation of the slopes ci in M ATLAB’s pchip (details in [FC80]):

C++ code 5.3.3.18: Monotonicity preserving slopes in pchip ➺ GITLAB


1 # include <Eigen / Dense>
2

3 namespace pchipslopes {
4

5 using Eigen : : VectorXd ;


6

7 // using forward declaration of the function pchipend, implementation


below
8 double pchipend ( double h1 , double h2 , double del1 , double d e l 2 ) ;
9

10 i n l i n e void pchipslopes ( const VectorXd& t , const VectorXd& y , VectorXd& c ) {


11 // Calculation of local slopes ci for shape preserving cubic Hermite
interpolation, see (5.3.3.11), (5.3.3.13)
12 // t, y are vectors passing the data points
13 const unsigned n = t . s i z e ( ) ;
14 const VectorXd h = t . t a i l ( n − 1 ) − t . head ( n − 1 ) ;
15 const VectorXd d e l t a = ( y . t a i l ( n − 1 ) − y . head ( n − 1 ) ) . cwiseQuotient ( h ) ; // linear
slopes
16 c = VectorXd : : Zero ( n ) ;
17

18 // compute reconstruction slope according to (5.3.3.13)


19 f o r ( unsigned i = 0 ; i < n − 2 ; ++ i ) {
20 i f ( d e l t a ( i ) * d e l t a ( i + 1) > 0) {
21 const double w1 = 2 * h ( i + 1 ) + h ( i ) ;
22 const double w2 = h ( i + 1 ) + 2 * h ( i ) ;
23 c ( i + 1 ) = ( w1 + w2 ) / ( w1 / d e l t a ( i ) + w2 / d e l t a ( i + 1 ) ) ;

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 423
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

24 }
25 }
26 // Special slopes at endpoints, beyond (5.3.3.13)
27 c ( 0 ) = pchipend ( h ( 0 ) , h ( 1 ) , d e l t a ( 0 ) , d e l t a ( 1 ) ) ;
28 c ( n − 1 ) = pchipend ( h ( n − 2 ) , h ( n − 3 ) , d e l t a ( n − 2 ) , d e l t a ( n − 3 ) ) ;
29 }
30

31 i n l i n e double pchipend ( const double h1 , const double h2 , const double del1 , const
double d e l 2 ) {
32 // Non-centered, shape-preserving, three-point formula
33 double d = ( ( 2 * h1 + h2 ) * d e l 1 − h1 * d e l 2 ) / ( h1 + h2 ) ;
34 i f ( d* del1 < 0) {
35 d = 0;
36 }
37 else i f ( d e l 1 * d e l 2 < 0 && std : : abs ( d ) > std : : abs ( 3 * d e l 1 ) ) {
38 d = 3* del1 ;
39 }
40 return d ;
41 }
42

43 } //namespace pchipslopes

Review question(s) 5.3.3.19 (Shape-preserving interpolation)


(Q5.3.3.19.A) State three shortcomings of global polynomial interpolation, which are remedied by using
piecewise linear interpolation.
(Q5.3.3.19.B) Given data points (ti , yi ) ∈ R2 and a continuous interpolant f generated by a local mono-
tonicity preserving interpolation scheme, show that
 
f (t), t ∈ [min{ti }, max{ti }] = [min{yi }, max{yi }] .
i i i i

(Q5.3.3.19.C) Show by counterexample that a locally convexity preserving interpolation scheme can gen-
erate an interpolant with negative function values even if the data values are all positive.
(Q5.3.3.19.D) Given data points (ti , yi ), i = 0, . . . , n, ti− < ti , i = 1, . . . , n, we define
Z t
f ( t ) = y0 + p(τ ) dτ ,
t0

where p is the piecewise linear interpolant of


 
t i + t i +1 y i +1 − y i
, , i = 0, . . . , n − 1 .
2 t i +1 − t i

Is the mapping (ti , yi )i → f a globally monotonicity/convexity preserving?


(Q5.3.3.19.E) Given nodes t0 < t1 < · · · < tn let P3 (T ) stand for the function space
n o
P3 (T ) := f ∈ C1 ([t0 , tn ]): f |[ti−1 ,ti ] ∈ P3 , i = 1, . . . n .

Propose a linear interpolation operator

FT : R n+1 → P3 (T ) , (FT (y))(t j ) = (y) j , j = 0, . . . , n ,

that is locally monotonicity preserving.

5. Data Interpolation and Data Fitting in 1D, 5.3. Shape-Preserving Interpolation 424
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q5.3.3.19.F) Given an ordered node set t0 < t1 < · · · < tn and associated data values yi , i = 0, . . . , n,
the pchip interpolant is a piecewise cubic Hermite interpolant, with slopes chosen according to the
formula


 ∆1 , if i = 0 ,


sgn(∆1 )=sgn(∆2 ) 3( h i +1 + h i )
, for i ∈ {1, . . . , n − 1} , h i : = t i − t i −1 ,
→ ci = 2hi +1 + hi 2h + h
+ i∆ i+1 y −y (5.3.3.13)

 ∆i i +1
∆ i : = t i − t i −1 .

∆ i i −1
n , if i = n ,

(i) Determine the supports supp bi ⊂ R of the functions bi , i = 0, . . . , n, which is the pchip inter-
polant for the data values y0 = 0, . . . , yi−1 = 0, yi = 1, yi+1 = 0, . . . , yn = 0.

(ii) Denote by p(y) the pchip interpolant for the data vector y := [y0 , . . . , yn ] ∈ R n+1 . Can we
write p as
n
p(y)(t) = ∑ y i bi ( t ) , t0 ≤ t ≤ t n ,
i =0

with suitable functions bi ?


5.4 Splines
Piecewise cubic Hermite Interpolation presented in Section 5.3.3 entailed determining reconstruction
slopes ci . Now we learn about a way how to do piecewise polynomial interpolation, which results in
C k -interpolants, k > 0, and dispenses with auxiliary slopes. The idea is to obtain the missing conditions
implicitly from extra continuity conditions, built into spaces of so-called splines. These are of fundamental
importance for modern computer-aided geometric design (CAGD).

Supplementary literature. Splines are also presented in [DR08, Ch. 9].

5.4.1 Spline Function Spaces


Definition 5.4.1.1. Spline space → [QSS00, Def. 8.1]
Given an interval I := [ a, b] ⊂ R and a knot sequence M := { a = t0 < t1 < . . . < tn−1 < tn =
b}, n ∈ N, the vector space Sd,M of the spline functions of degree d (or order d + 1), d ∈ N0 ,
is defined by

Sd,M := {s ∈ C d−1 ( I ): s j := s|[t j−1 ,t j ] ∈ Pd ∀ j = 1, . . . , n} .

d − 1-times continuously differentiable locally polynomial of degree d

Do not mix up “knots” = “breakpoints” of a spline functions, and “nodes”, the first values in data tuples
(ti , yi ) for 1D interpolation. In the case of spline interpolation, knots may serves as nodes, but not neces-
sarily.
Let’s make explicit the spline spaces of the lowest degrees:
• d = 0 : M-piecewise constant discontinuous functions
• d = 1 : M-piecewise linear continuous functions
• d = 2 : continuously differentiable M-piecewise quadratic functions

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 425


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The dimension of spline space can be found by a counting argument (heuristic): We count the number
of “degrees of freedom” (d.o.f.s) possessed by a M-piecewise polynomial of degree d, and subtract the
number of linear constraints implicitly contained in Def. 5.4.1.1:

dim Sd,M = n · dim Pd − #{C d−1 continuity constraints} = n · (d + 1) − (n − 1) · d = n + d .

Theorem 5.4.1.2. Dimension of spline space

The space Sd,M from Def. 5.4.1.1 has dimension

dim Sd,M = n + d .

Remark 5.4.1.3 (Differentiating and integrating splines) Obviously, spline spaces are mapped onto
each other by differentiation & integration:
 Z t 

s ∈ Sd,M ⇒ s ∈ Sd−1,M ∧ t 7→ s(τ ) dτ ∈ Sd+1,M . (5.4.1.4)
a

y
Review question(s) 5.4.1.5 (Spline function spaces)
(Q5.4.1.5.A) Given an (ordered) knot set M := {t0 < t1 < · · · < tn } ⊂ R use a counting argument to
determine the dimension of the space of piecewise polynomials

Sd,k M := {s ∈ C k ( I ): s|[t j−1 ,t j ] ∈ Pd ∀ j = 1, . . . , n} .

(Q5.4.1.5.B) Consider the knot set M := {0, 21 , 1}. For arbitrary numbers y0 , y1 , c0 , c1 does there exist
s ∈ S2,M such that

s (0) = y0 , s (1) = y1 , s ′ (0) = c0 , s ′ (1) = c1 ?

Is s unique?

5.4.2 Cubic-Spline Interpolation


We already know the special case of interpolation in S1,M , when the interpolation nodes are the knots of
M, because this boils down to simple piecewise linear interpolation, see Section 5.3.2. Now we explore
the interpolation by means of splines of degree d = 3.

Supplementary literature. More details can be found in [Han02, pp. XIII, 46], [QSS00,

Sect. 8.6.1].
Remark 5.4.2.1 (Perceived smoothness of cubic splines) Cognitive psychology teaches us that the
human eye perceives a C2 -function as “smooth”, while it can still spot the abrupt change of curvature at
the possible discontinuities of the second derivatives of a cube Hermite interpolant (→ Def. 5.3.3.1).
For this reason the simplest spline functions featuring C2 -smoothness are of great importance in computer
aided design (CAD). They are the cubic splines, M-piecewise polynomials of degree 3 contained in S3,M
(→ Def. 5.4.1.1). y

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 426


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§5.4.2.2 (Cubic spline interpolants) The definition of a cubic spline interpolant is straightforward and
matches the abstract concept of an interpolant introduced in Section 5.1. Also note the relationship with
Hermite interpolation discussed in Section 5.3.3.

Definition 5.4.2.3. Cubic-spline interpolant

Given a node set/knot set M := {t0 < t1 < · · · < tn }, n ∈ N, and data values y0 , . . . , yn ∈ R,
an associated cubic spline interpolant is a function s ∈ S3,M that complies with the interpolation
conditions

s(t j ) = y j , j = 0, . . . , n . (5.4.2.4)

Note that in the case of cubic spline interpolation the spline knots and interpolation nodes coincide.
From dimensional considerations it is clear that the interpolation conditions will fail to fix the interpolating
cubic spline uniquely:

dim S3,M − #{ interpolation conditions} = (n + 3) − (n + 1) = 2 free d.o.f.

Obviously, “two conditions are missing”, which means that the interpolation problem for cubic splines is
not well-defined by (5.4.2.4). We have to impose two additional conditions. Different ways to do this will
lead to different cubic spline interpolants for the same set of data points. y

§5.4.2.5 (Computing cubic spline interpolants) We opt for a linear interpolation scheme (→
Def. 5.1.0.25) into the spline space S3,M , which means that the two additional conditions must depend
linearly on the data values. As explained in § 5.1.0.21, a linear interpolation scheme will lead to a linear
system of equations for expansion coefficients with respect to a suitable basis.

We reuse the local representation of a cubic spline through cubic Hermite cardinal basis polynomials from
(5.3.3.5):
(5.3.3.4)
s|[t j−1 ,t j ] (t) = s(t j−1 ) · (1 − 3τ 2 + 2τ 3 ) + (5.4.2.6)
s(t j ) · (3τ 2 − 2τ 3 ) +
h j s ′ ( t j −1 ) · 2
(τ − 2τ + τ ) + 3

h j s′ (t j ) · (−τ 2 + τ 3 ) ,

with h j := t j − t j−1 , τ := (t − t j−1 )/h j .


➣ The task of cubic spline interpolation boils down to finding slopes s′ (t j ) in the knots of the mesh M.

Once these slopes are known, the efficient local evaluation of a cubic spline function can be done in the
same way as for a cubic Hermite interpolant, see Section 5.3.3.1, Code 5.3.3.6.

Note: if s(t j ), s′ (t j ), j = 0, . . . , n, are fixed, then the representation Eq. (5.4.2.6) already guarantees
s ∈ C1 ([t0 , tn ]), cf. the discussion for cubic Hermite interpolation, Section 5.3.3.

➣ only the continuity of s′′ ✦ has to be enforced by the choice of slopes s′ (t j )


m
✦ will yield extra conditions to fix the slopes s′ (t j )

However, do the

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 427


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ interpolation conditions (5.4.2.4) s(t j ) = y j , j = 0, . . . , n, and the


✦ smoothness constraint s ∈ C2 ([t0 , tn ])
uniquely determine the unknown slopes c j := s′ (t j )? In light of the discussion in § 5.4.2.2, probably not,
because we have not yet described the two additional required conditions. Next, we work out the details:
From s ∈ C2 ([t0 , tn ]) we obtain n − 1 continuity constraints for s′′ (t) at the internal nodes
s′′|[t j−1 ,t j ] (t j ) = s′′|[t j ,t j+1 ] (t j ) , j = 1, . . . , n − 1 . (5.4.2.7)

Based on Eq. (5.4.2.6), we express Eq. (5.4.2.7) in concrete terms, using


s′′|[t j−1 ,t j ] (t) = s(t j−1 ) h− 2 −2
j 6(−1 + 2τ ) + s ( t j ) h j 6(1 − 2τ ) (5.4.2.8)
+ h− 1 ′ −1 ′
j s ( t j−1 )(−4 + 6τ ) + h j s ( t j )(−2 + 6τ ) , τ : = ( t − t j−1 ) /h j ,
−1
which can be obtained by the chain rule and from dτ
dt = h j .
Eq. (5.4.2.8)
⇒ s′′|[t j−1 ,t j ] (t j−1 ) = −6 · s(t j−1 ) h− 2 −2 −1 ′ −1 ′
j + 6 · s ( t j ) h j − 4 · h j s ( t j −1 ) − 2 · h j s ( t j ) ,

s′′|[t j−1 ,t j ] (t j ) = 6 · s(t j−1 ) h− 2 −2 −1 ′ −1 ′


j + − 6 · s ( t j ) h j + 2 · h j s ( t j −1 ) + 4 · h j s ( t j ) .

In particular, replacing j ← j + 1 in the top equation, yields

s′′|[t j ,t j+1 ] (t j ) = −6 · s(t j ) h− 2 −2 −1 ′ −1 ′


j +1 + 6 · s ( t j +1 ) h j +1 − 4 · h j +1 s ( t j ) − 2 · h j +1 s ( t j +1 ) .

Inserting these formulas into Eq. (5.4.2.7) leads to n − 1 linear equations for the n + 1 unknown slopes
c j := s′ (t j ). Taking into account the (known) interpolation conditions s(ti ) = yi , we get
! !
1 2 2 1 y j − y j −1 y j +1 − y j
c + + c + c =3 + , (5.4.2.9)
h j j −1 h j h j +1 j h j +1 j +1 h2j h2j+1
for j = 1, . . . , n − 1.
Actually Eq. (5.4.2.9) amounts to an (n − 1) × (n + 1), that is, underdetermined, linear system of equa-
tions. The dimensions make perfect sense, because
n − 1 equations =
ˆ number of interpolation conditions
n + 1 unknowns= ˆ dimension of cubic spline space on knot set {t0 < t1 < · · · < tn }
This linear system of equations can be written in matrix-vector form:
      
b0 a1 b1 0 ··· ··· 0 c0 y1 − y0 y2 − y1
3 +
 0 b1 a2 b2  c1   h21 h22 
 

 
  
 .. .. .. ..    
 0 . . . .  .   .. 
 . .. .. ..
 .. = . . (5.4.2.10)
 .. . . .    
     
 ..    
 . a n − 2 bn − 2 0 c y n − y n −1 
n −1  y −y
3 n−h12 n−2 +
0 ··· ··· 0 bn − 2 a n − 1 bn − 1 h2n
cn n −1

with
1 2 2 i = 0, 1, . . . , n − 1 ,
bi : = , ai : = + ,
h i +1 h i h i +1 [ bi , a i > 0 , a i = 2 ( bi + bi − 1 ) ] .
➙ two additional constraints are required, as already noted in § 5.4.2.2. y

§5.4.2.11 (Types of cubic spline interpolants) To saturate the remaining two degrees of freedom the
following three approaches are popular. All of them involve conditions linear in the data values.

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 428


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➀ Complete cubic spline interpolation: s′ (t0 ) = c0 , s′ (tn ) = cn prescribed according to

c0 : = q ⊤ y , c1 : = p ⊤ y ,

for given vectors p, q ∈ R n+1 and y := [y0 , . . . , yn ] ∈ R n+1 the vector of data values. Then the
first and last column can be removed from the system matrix of (5.4.2.10). Their products with c0
and cn , respectively, have to be subtracted from the right hand side of (5.4.2.10), which leads to the
(n − 1) × (n − 1) linear system of equations:
     
a1 b1 0 ··· ··· 0   3
y1 − y0
+ y2 − y1
−c0 b0
 b1 a2 b2  c1  h21 h22

    
 .. .. .. ..    
 0 . . . .  . 
  .. 
 . .. .. ..
 .. = .  . (5.4.2.12)
 .. . . . 0    
     
 ..   
 . a n − 2 bn − 2  c  y n −1 − y n −2 y n − y n −1 
n −1 3 h2
+ h2 − c n bn − 1
0 ··· ··· 0 bn − 2 a n − 1 n −1 n

➁ Natural cubic spline interpolation: s′′ (t0 ) = s′′ (tn ) = 0

2 1 y1 − y0 1 2 y n − y n −1
h1 c 0 + h1 c 1 =3 , h n c n −1 + hn c n =3 .
h21 h2n

Combining these two extra equations with (5.4.2.10), we arrive at a linear system of equations with
tridiagonal s.p.d. (→ Def. 1.1.2.6, Lemma 2.8.0.12) system matrix and unknowns c0 , . . . , cn :
   y1 − y0

2 1 0 ··· ··· 0 3
 h1 
 b0 a1 b1 0 · · · ··· 0     y −y y −y


   3 1h 2 0 + 2h 2 1 
 0 b1 a2 b2  c0  1 2 
    
 .. .. .. ..    
 0 . . . .  .   . 
  ..  =  ..  . (5.4.2.13)
 .. .. .. ..   
 . . . .    
    
 ..  c  
 .
a n − 2 bn − 2 0  n  y n −1 − y n −2 y n − y n −1 
   3 h2n−1
+ h2n

0 ··· · · · 0 bn − 2 a n − 1 bn − 1  
y n − y n −1
0 ··· ··· 0 1 2 3 hn

Owing to Thm. 2.7.5.4 this linear system of equations can be solved with an asymptotic computational
effort of O(n) for n → ∞.

➂ Periodic cubic spline interpolation: s′ (t0 ) = s′ (tn ) (➣ c0 = cn ), s′′ (t0 ) = s′′ (tn )

This removes one unknown and adds another equations so that we end up with an n × n-linear system
with s.p.d. (→ Def. 1.1.2.6) system matrix
 
a1 b1 0 ··· 0 b0
 b1 a2 b2 0 
 
 . . .. .. 
 0 .. .. . .  bi : = 1
h i +1 , i = 0, 1, . . . , n − 1 ,

A :=  .. .. .. ..
 ,
. . .  ai : = 2 2
i = 0, 1, . . . , n − 1 .
 . 0  h i + h i +1 ,
 .. 
 0 . a n − 1 bn − 1 
b0 0 · · · 0 bn − 1 a 0
This linear system can be solved with rank-1-modifications techniques (see § 2.6.0.12,
Lemma 2.6.0.21) + tridiagonal elimination: asymptotic computational effort O(n).

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 429


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 5.4.2.14 (Piecewise cubic interpolation schemes) Let us review three different classes of in-
terpolation schemes relying on piecewise cubic polynomials with respect to a prescribed node set:
✦ Piecewise cubic local Lagrange interpolation
➣ Extra degrees of freedom fixed by putting four nodes in one interval
➥ yields merely C0 -interpolant; perfectly local.
✦ Cubic Hermite interpolation
➣ Extra degrees of freedom fixed by locally reconstructed slopes, e.g. (5.3.3.13)
➥ yields C1 -interpolant; still local.
✦ Cubic spline interpolation
➣ Extra degrees of freedom fixed by C2 -smoothness, complete/natural/periodic constraint.
➥ yields C2 -interpolant; non-local.
y

EXAMPLE 5.4.2.15 (Control of a robotic arm [Han02, Sect. 46])

Given times M := {t0 < t1 < · · · < tn }, n ∈ N,


the tip of a robotic arm is to visit the point in space
yi ∈ R3 at time ti . At initial time t0 and final time tn
the arm is to be at rest.

Fig. 168

The path of the tip of the robotic arm can conveniently be described by the the componentwise complete
cubic spline interpolant s : [t0 , tn ] → R3 satisfying

(s) j ∈ S3,M , j = 1, 2, 3 , s(ti ) = yi , i = 0, . . . , n , s′ (t0 ) = s′ (tn ) = 0 .


Note that the first derivative t 7→ s′ (t) is the velocity of the tip, while the second derivative t 7→ s′′
describes its acceleration. Hence, s ∈ C2 ([t0 , tn ], R3 ) ensures that the acceleration does not change
abruptly, which is important for curbing wear on the joints of the robotic arm. y

Review question(s) 5.4.2.16 (Cubic spline interpolation)


(Q5.4.2.16.A) What do cubic spline interpolants and piecewise cubic Hermite interpolants have in com-
mon and in which respects are they different?
(Q5.4.2.16.B) Given the node set {t0 < t1 < · · · < tn } ⊂ R assume that the data values are obtained
by sampling a polynomial p : R → R, I := [t0 , tn ]: yi := p(ti ), i = 0, . . . , n. Characterize the maxi-
mal vector space of polynomials p such that p ≡ s, where s is the natural cubic spline interpolant of
(ti , yi )in=0 . What is its dimension?
Remember that, beside the interpolation conditions s(ti ) = yi the natural cubic spline interpolant also
satisfies s′′ (t0 ) = s′′ (tn ) = 0.

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 430


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q5.4.2.16.C) Characterize the rank-1 modification of


 
a1 b1 0 ··· 0 b0
 b1 a2 b2 0 
 
 . . .. .. 
 0 .. .. . . 

A :=  .. .. .. ..
 ∈ R n,n
. . . 
 . 0 
 .. 
 0 . a n − 1 bn − 1 
b0 0 · · · 0 bn − 1 a 0

that turns it into a genuine tri-diagonal matrix.


(Q5.4.2.16.D) The node/knot set M := {t0 < t1 < · · · < tn }, n ≥ 3, is given. Beside the interpolation
conditions s(ti ) = yi for data values yi , i = 0, . . . , n, a complete cubic spline interpolant s ∈ S3,M is
determined by the additional two conditions

s ′ ( t 0 ) = α 0 y 0 + α 1 y 1 + α 2 y 2 + α 3 y 3 , s ′ ( t n ) = β 0 y n + β 1 y n −1 + β 2 y n −2 + β 3 y n −3 ,

for given weights α j , β j ∈ R.


Find weights α0 , . . . , α4 and β 1 , . . . , β 4 such that in the case yi = p(ti ) for some cubic polynomial
p ∈ P3 one always finds s ≡ p on [t0 , tn ].
(Q5.4.2.16.E) Find cubic polynomials b1 , b2 , b3 , b4 such that

b1 (0) =1, b1 (1) = 0; , b1′′ (0) =0, b1′′ (1) =0,


b2 (0) =0, b1 (2) = 1; , b2′′ (0) =0, b2′′ (1) =0,
b3 (0) =0, b1 (3) = 0; , b3′′ (0) =1, b3′′ (1) =0,
b4 (0) =0, b1 (4) = 0; , b4′′ (0) =0, b4′′ (1) =1.

(Q5.4.2.16.F) For the computation of a cubic spline interpolant in § 5.4.2.5 we used a local, that is, inside
the knot intervals, representation of the spline function by means of cardinal basis polynomials for
Hermite interpolation

s|[t j−1 ,t j ] (t) = a j (1 − 3τ 2 + 2τ 3 ) + b j (3τ 2 − 2τ 3 ) + c j (τ − 2τ 2 + τ 3 ) + d j (−τ 2 + τ 3 ) , (5.4.2.17)

(h j := t j − t j−1 , τ := (t − t j−1 )/h j ) with coefficients a j , b j , c j , d j ∈ R.


Explain the big benefits derived from using (5.4.2.17).
(Q5.4.2.16.G) This question continues Question (Q5.4.2.16.F). Why is using the local representation

s|[t j−1 ,t j ] (t) = a j t3 + b j t2 + c j t + d j , a j , bj , c j , d j ∈ R ,

instead of (5.4.2.17) as starting point for the computation of a cubic spline interpolant not a good idea?

5.4.3 Structural Properties of Cubic Spline Interpolants


§5.4.3.1 (Extremal properties of natural cubic spline interpolants → [QSS00, Sect. 8.6.1, Prop-
erty 8.2]) Splines are special! For a function f : [ a, b] 7→ R, f ∈ C2 ([ a, b]), the term
Z b
Ebd ( f ) := 1
2 | f ′′ (t)|2 dt , (5.4.3.2)
a

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 431


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

models the elastic bending energy of a rod, whose shape is described by the graph of f (Soundness
check: zero bending energy for straight rod). We will show that cubic spline interpolants have minimal
bending energy among all C2 -smooth interpolating functions.

Theorem 5.4.3.3. Optimality of natural cubic spline interpolant

Given a knot/nodes set M := { a = t0 < t1 < · · · < tn = b} in the interval [ a, b], let s ∈ S3,M be
the natural cubic spline interpolant of the data points (ti , yi ) ∈ R2 , i = 0, . . . , n.
Then s minimizes the elastic bending energy among all interpolating functions in C2 ([ a, b]), that is

Ebd (s) ≤ Ebd ( f ) ∀ f ∈ C2 ([ a, b]), f (ti ) = yi , i = 0, . . . , n .

Idea of proof: variational calculus

We show that any small perturbation of s such that the perturbed spline still satisfies the interpolation
conditions leads to an increase in elastic energy.

Pick perturbation direction k ∈ C2 ([t0 , tn ]) satisfying k (ti ) = 0, i = 0, . . . , n:

Zb
Ebd (s + k ) = 1
2 |s′′ + k′′ |2 dt (5.4.3.4)
a
Zb Zb
′′ ′′
= Ebd (s) + s (t)k (t) dt + 1
2 |k′′ |2 dt .
a a
| {z } | {z }
:= I ≥0

We scrutinize I , split it into the contributions of individual knot intervals, integrate by parts twice, and use
s(4) ≡ 0. By s′′′ (t± ′′′
j ) we denote the limits of the discontinuous third derivative t 7 → s ( t ) from the left (−)
and right (+) of the knot t j .

n Z tj
I= ∑ s′′ (t)k′′ (t) dt
j =1 t j −1
( Z tj
)
n
=∑ − s′′′ (t)k′ (t) dt + s′′ (t j )k′ (t j ) − s′′ (t j−1 )k′ (t j−1 )
j =1 t j −1
(Z )
n tj
=∑ s(4) (t)k (t) dt + s′′ (t j )k′ (t j ) − s′′ (t j−1 )k′ (t j ) − s′′′ (t− ′′′ +
j ) k ( t j ) + s ( t j −1 ) k ( t j −1 )
j =1 t j −1
n  
= − ∑ s′′′ (t−
j ) k ( t j ) − s ′′′ +
( t j −1 ) k ( t j −1 ) + s′′ (tn ) k′ (tn ) − s′′ (t0 ) k′ (t0 ) = 0 .
j =1 |{z} | {z } | {z } | {z }
=0 =0 =0 =0

In light of (5.4.3.4): no perturbation k compatible with interpolation conditions can make the bending
energy of s decrease! y

Remark 5.4.3.5 (Origin of the term “Spline”)


§ 5.4.3.1: (Natural) cubic spline interpolant provides C2 -curve of minimal elastic bending energy that trav-
els through prescribed points.
m

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 432


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Nature: A thin elastic rod fixed a certain points attains a shape that minimizes its potential bending energy
(virtual work principle of statics).

Cubic spline interpolation approximates shape of


elastic rods. Such rods were in fact used in the
manufacturing of ship hulls as “analog comput-
ers” for “interpolating points” that were specified
by the designer of the ship.

Cubic spline interpolation



before the age of computers

y
Remark 5.4.3.6 (Shape preservation of cubic spline interpolation)
1.2

Data s(t j ) = y j from Exp. 5.3.1.6 and Data points


Cubic spline interpolant
1

y − y0
s ′ ( t0 ) = c0 : = 1 , 0.8
t1 − t0
y n − y n −1
s′ (tn ) = cn := . 0.6

t n − t n −1
s(t)

0.4

The cubic spline interpolant is nor monotonicity nor


curvature preserving 0.2

This is not surprising in light of Thm. 5.3.3.16, be- 0

cause cubic spline interpolation is a linear interpola-


tion scheme. −0.2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 169 t
y

§5.4.3.7 (Weak locality of an interpolation scheme) In terms of locality of interpolation schemes, in the
sense of § 5.3.3.7, we habe seen:
• Piecewise linear interpolation (→ Section 5.3.2) is strictly local: Changing a single data value y j
affects the interpolant only on the interval ]t j−1 , t j+1 [.
• Monotonicity preserving piecewise cubic Hermite interpolation (→ Section 5.3.3.2) is still local, be-
cause changing y j will lead to a change in the interpolant only in ]t j−2 , t j+2 [ (the remote intervals
are affected through the averaging of local slopes).
• Polynomial Lagrange interpolation is highly non-local, see Ex. 5.2.4.3.
We can weaken the notion of locality of an interpolation scheme on an ordered node set {ti }in=0 :
➣ (weak) locality measures the impact of a perturbation of a data value y j at points t ∈ [t0 , tn ] as a
function of |t − ti |.

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 433


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ an interpolation scheme is weakly local, if the impact of the perturbation of yi displays a rapid (e.g.
exponential) decay as |t − ti | increases.
For a linear interpolation scheme (→ § 5.1.0.21) locality can be deduced from the decay of the cardinal
interpolants/cardinal basis functions (→ Lagrange polynomials of § 5.2.2.3), that is, the functions b j :=
I(e j ), where e j is the j-th unit vector, and I the interpolation operator. Then weak locality can be quantified
as

∃λ > 0|: |b j (t)| ≤ b j (t j ) · exp(−λ|t − t j |) , t ∈ [t0 , tn ] . (5.4.3.8)

Remember:
• Lagrange polynomials satisfying (5.2.2.4) provide cardinal interpolants for polynomial interpolation
→ § 5.2.2.3. As is clear from Fig. 151, they do not display any decay away from their “base node”.
Rather, they grow strongly. Hence, there is no locality in global polynomial interpolation.
• Tent functions (→ Fig. 150) are the cardinal basis functions for piecewise linear interpolation, see
Ex. 5.1.0.15. Hence, this scheme is perfectly local, see (5.3.2.2).
Given a knot/node set M := {t0 < t1 < · · · < tn } the ith natural cubic cardinal spline is defined by the
conditions

Li ∈ S3,M , Li (t j ) = δij , Li′′ (t0 ) = Li′′ (tn ) = 0 . (5.4.3.9)

These functions will supply a cardinal basis of S3,M and, according to (5.1.0.18), for natural cubic spline
interpolants we have the formula
n
Natural spline interpolant: s(t) = ∑ y j L j (t) .
j =0

This means that t 7→ Li (t) completely characterizes the impact that a change of the value yi has on s. A
very rapid decay of Li means that the value yi does not influence s much outside a small neighborhood of
ti .
Fast (exponential) decay of Li ↔ weak locality of natural cubic spline interpolation.
y

EXPERIMENT 5.4.3.10 (Decay of cardinal basis functions for natural cubic spline interpolation) We
examine the cardinal basis of S3,M (“cardinal splines”) associated with natural cubic spline interpolation
 16
on an equidistant knot set M := t j = j − 8 j=0 :

Cardinal cubic spline function Cardinal cubic spline in middle points of the intervals
1.2 0
10

1
−1
10
Value of the cardinal cubic splines

0.8

−2
10
0.6

0.4 −3
10

0.2
−4
10
0

−5
−0.2 10
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
x
x

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 434


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We observe a rapid exponential decay of the cardinal splines, which is expressed by the statement that
“cubic spline interpolation is weakly local”. y
Review question(s) 5.4.3.11 (Structural properties of cubic spline interpolants)
(Q5.4.3.11.A) Given data points (ti , yi ) ∈ R2 , t0 < t1 < · · · < tn , n ∈ N, show that the piecewise linear
interpolant p ∈ C0 ([t0 , t1 ]) minimizes the elastic energy
n Z tj
Eel ( f ) := ∑ | f ′ (t)|2 dt
j =1 t j −1

among all continuous, piecewise continuously differentiable interpolants.


(Q5.4.3.11.B) Given data points (ti , yi ), i = 0, . . . , n, ti−1 < ti , yi ∈ R, show that the solution f ∗ of the
minimization problem

n Ztn
f ∗ := argmin Em ( f ) , Em ( f ) := ∑ | f ( ti ) − yi |2 + | f ′′ (t)|2 dt .
f ∈C2 [t 0 ,tn ] i =0 t0

is a cubic spline (with respect to the knot set M := {t0 < t1 < · · · < tn }), which satisfies
s′′ (t0 ) = s′′ (tn ) = 0.
Of course, you should invoke the result about the optimality of the natural cubic spline interpolant.

Theorem Thm. 5.4.3.3. Optimality of natural cubic spline interpolant

Given a knot/nodes set M := { a = t0 < t1 < · · · < tn = b} in the interval [ a, b], let s ∈ S3,M
be the natural cubic spline interpolant of the data points (ti , yi ) ∈ R2 , i = 0, . . . , n.
Then s minimizes the elastic bending energy among all interpolating functions in C2 ([ a, b]), that
is

Ebd (s) ≤ Ebd ( f ) ∀ f ∈ C2 ([ a, b]), f (ti ) = yi , i = 0, . . . , n .

5.4.4 Shape Preserving Spline Interpolation


According to Rem. 5.4.3.6 cubic spline interpolation is neither monotonicity preserving nor curvature pre-
serving. Necessarily so according to Thm. 5.3.3.16, because it is a linear interpolation scheme producing
a continuously differentiable interpolant with slopes 6= 0 in the nodes.

This section presents a non-linear quadratic spline (→ Def. 5.4.1.1, C1 -functions) based interpolation
scheme that preserves both monotonicity and curvature of data even in a local sense, cf. Section 5.3,
§ 5.3.1.5. The construction was presented in [Sch83], is fairly intricate and will be presented step by step.
As with all 1D data interpolation tasks we are given data points (ti , yi ) ∈ R2 , i = 0, . . . , n, and we
assume that their first coordinates are ordered: t0 < t1 < · · · < tn .

In order to obtain extra flexibility required for shape preservation, the key idea of [Sch83] is
n
to use an extended knot set M ⊂ [t0 , tn ] containing n additional knots beside t j j=0 .

Then we construct an interpolating quadratic spline function that satisfies s ∈ S2,M , s(ti ) = yi , i =
0, . . . , n and locally preserves the “shape” of the data in the sense of § 5.3.1.1.

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 435


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We stress that unlike in the case of cubic spline interpolation the knot set will not agree with the node set in
n
this case M 6= {t j } j=0 : the interpolant s interpolates the data in the points ti but is piecewise polynomial
with respect to M!

We proceed in four steps:


➀ Shape preserving choice of slopes ci , i = 0, . . . , n, as in Section 5.3.3.2, [MR81; QQY01]:
Recall Eq. (5.3.3.11) and Eq. (5.3.3.13): we fix the slopes ci in the nodes using the harmonic mean of
data slopes ∆ j , the final interpolant will be tangent to these segments in the points (ti , yi ). If (ti , yi ) is a
local maximum or minimum of the data, then apply slope limiting: c j is set to zero (→ § 5.3.3.10):


 2
, if sign(∆i ) = sign(∆i+1 ) ,
Limiter ci : = ∆i−1 + ∆i−+11 i = 1, . . . , n − 1 .

0 (5.4.4.1)
otherwise,
c0 := 2∆1 − c1 , cn := 2∆n − cn−1 ,
y j − y j −1
where ∆ j := t −t designates the local data slopes.
j j −1

y i +1 y i −1 y i −1

yi yi
y i +1

y i −1 y i +1 yi
t i −1 ti t i +1 t i −1 ti t i +1 t i −1 ti t i +1
Figures: slopes according to limited harmonic mean formula

➁ Choice of “extra knots” pi ∈]ti−1 , ti ], i = 1, . . . , n:

Rule: y i −1
1
Let Ti be the unique straight line through (ti , yi ) with
c i −1
slope ci ; — in figure ✄
If the intersection of Ti−1 and Ti is non-empty and
has a t-coordinate ∈]ti−1 , ti ], l

☞ then pi := t-coordinate of Ti−1 ∩ Ti , ci


yi 1
☞ otherwise pi = 12 (ti−1 + ti ).
Fig. 170
ti−11 ( pi + ti−1 ) pi 1
2 ( pi + ti ) ti
2
These points will be used to build the knot set for the final quadratic spline:

M = { t0 < p1 ≤ t1 < p2 ≤ · · · < p n ≤ t n } . (5.4.4.2)

The following code implements the construction of the auxiliary knots

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 436


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 5.4.4.3: Selection of auxiliary knots p j ➺ GITLAB


2 Eigen : : VectorXd extra_knots ( const Eigen : : VectorXd& t , const Eigen : : VectorXd &y ,
3 const Eigen : : VectorXd& c ) {
4 const unsigned n = t . s i z e ( ) ;
5 assert ( ( n == y . s i z e ( ) ) && ( n == c . s i z e ( ) ) ) ;
6 Eigen : : VectorXd p = ( t [ n − 1 ] + 1 . 0 ) * Eigen : : VectorXd : : Ones ( n − 1 ) ;
7 f o r ( unsigned j = 0 ; j < n − 1 ; ++ j ) {
8 // Safe comparison, just to avoid division by zero exception
9 // If both slopes a close p j will certainly lie outside the node
interval
10 if (c[ j ] != c [ j + 1 ] ) {
11 p[ j ] = ( y [ j +1] − y [ j ] + t [ j ] * c [ j ] − t [ j + 1 ] * c [ j + 1 ] ) / ( c [ j +1] − c [ j ] ) ;
12 }
13 if (p[ j ] < t [ j ] | | p [ j ] > t [ j +1]) {
14 p[ j ] = 0.5*( t [ j ] + t [ j +1]) ;
15 }
16 }
17 return p ;
18 }

➂ Construction of auxiliary polygonal line:

We chose L to be the linear (d = 1) spline (polygonal line) on the knot set M′ of midpoints of the
knot intervals of M from (5.4.4.2)

M′ = {t0 < 21 (t0 + p1 ) < 12 ( p1 + t1 ) < 12 (t1 + p2 ) < · · · < 12 (tn−1 + pn ) < 12 ( pn + tn ) < tn } ,

satisfying the conditions

L ( ti ) = yi , L ′ ( ti ) = ci . (5.4.4.4)

In each interval ( 21 ( p j + t j ), 12 (t j + p j+1 )) the linear spline L corresponds to the line segment of
slope c j passing through the data point (t j , y j ).

In each interval ( 12 (t j + p j+1 ), 21 ( p j+1 + t j+1 )) the linear spline L corresponds to the line segment
connecting the ones on the other knot intervals of M′ , see Fig. 170.

Given the choice of slopes according to (5.4.4.1), a detailed analysis shows the following shpae-preserving
property of L.
✞ ☎

✝ ✆
L “inherits” local monotonicity and curvature from the data.

EXAMPLE 5.4.4.5 (Auxiliary construction for shape preserving quadratic spline interpolation) We
12
verify the above statements about the polygonal line L for the data points ( j, cos( j)) j=0 , which are marked
as • in the left plot.

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 437


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Linear auxiliary spline l

1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
y

y
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8
−0.8
−1
−1
0 2 4 6 8 10 12
Fig. 171 t 0 2 4 6 8 10 12
Fig. 172 t
Local slopes ci , i = 0, . . . , n
Auxiliary linear spline L
The reader is encouraged to check by inspection of the plots that the piecewise linear function L locally
preserves monotonicity and curvature. y

➃ Local quadratic approximation / interpolation of L:

Tedious, but elementary calculus, confirms the following fact:

Lemma 5.4.4.6.
If g is a linear spline (polygonal line) through the three points

1
( a, y a ) , ( ( a + b), w) , (b, yb ) with a < b , y a , yb , w ∈ R ,
2
then the parabola

p(t) := (y a (b − t)2 + 2w(t − a)(b − t) + yb (t − a)2 )/(b − a)2 , a ≤ t ≤ b ,

satisfies
1. p( a) = y a , p(b) = yb , p′ ( a) = g′ ( a) , p′ (b) = g′ (b),
2. g monotonic increasing / decreasing ⇒ p monotonic increasing / decreasing,
3. g convex / concave ⇒ p convex / concave.

The proof boils down to discussing many cases as indiated in the following plots:
Linear Spline l Linear Spline l Linear Spline l
ya Parabola p ya Parabola p w Parabola p
w

w ya
yb yb yb

Fig. 173 Fig. 174 Fig. 175


a 1 b a 1 b a 1 b
2 ( a + b) 2 ( a + b) 2 ( a + b)
Lemma 5.4.4.6 implies that the final quadratic spline that passes through the points (t j , y j ) with slopes c j
can be built locally as the parabola p using the linear spline L that plays the role of g in the lemma.

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 438


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 5.4.4.7 (Ex. 5.4.4.5 cnt’d)


Quadratic spline

1
We use the same data points as in Ex. 5.4.4.5 and 0.8
build a quadratic spline s ∈ S2,M , M the knot set 0.6
from (5.4.4.2), based on the formula suggested by 0.4
Lemma 5.4.4.6. 0.2

0
—=
ˆ interpolating quadratic spline s

y

−0.2

Local shape preservation is evident. −0.4

−0.6
However, since s 6∈ C2 ([t0 , tn ]) in general, we see −0.8
“kinks”, cf. Rem. 5.4.2.1. −1

0 2 4 6 8 10 12
Fig. 176 t
y

EXAMPLE 5.4.4.8 (Cardinal shape preserving quadratic spline) We examine the shape preserving
quadratic spline that interpolates data values y j = 0, j 6= i, yi = 1, i ∈ {0, . . . , n}, on an equidistant
node set.
Data and slopes Linear auxiliary spline l Quadratic spline

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


y

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 177 t Fig. 178 t Fig. 179 t

Shape preserving quadratic spline interpolation is a local, but not a linear interpolation scheme.
y

EXAMPLE 5.4.4.9 (Shape preserving quadratic spline interpolation)


We reuse the data from Exp. 5.3.1.6:
Data and slopes Linear auxiliary spline l Quadratic Spline

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


y

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1


Fig. 180 t Fig. 181 t Fig. 182 t

ti 0 1 2 3 4 5 6 7 8 9 10 11 12
Data from [MR81]:
yi 0 0.3 0.5 0.2 0.6 1.2 1.3 1 1 1 1 0 -1

5. Data Interpolation and Data Fitting in 1D, 5.4. Splines 439


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Data and slopes Linear auxiliary spline l Quadratic spline

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5


y

y
0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 183 t Fig. 184 t Fig. 185 t

In all cases we observe excellent local shape preservation. The implementation can be found in
shapepresintp.hpp ➺ GITLAB. y
Review question(s) 5.4.4.10 (Shape-preserving spline interpolation)
(Q5.4.4.10.A) Argue why an interpolant of the three data points (0, 0), (1, 0), (2, 1), (3, 1) in S2,M ,
M := {0, 1, 2, 3}, cannot be monotonicity preserving.
(Q5.4.4.10.B) Let S : I n+1 × R n+1 → C1 ( I ), I ⊂ R a bounded interval, be an interpolation scheme, that
n n
is, given data points (ti , yi ), ti ∈ I , ti 6= t j for i 6= j, i, j = 0, . . . , n, f := S([ti ]i=0 , [yi ]i=0 ) is a contin-
uously differentiable function f : I → R satisfying the interpolation conditions f (ti ) = yi , i = 0, . . . , n.
Assume that S is locally monotonicity-preserving.
n
What is the support of S(([ti ]i=0 , ek ), if the nodes ti are sorted, t0 < t1 < t2 < · · · < tn and ek stands
 n
for the Cartesian coordinate vector δkj j=0 ∈ R n+1 ?

5.5 Algorithms for Curve Design


In this section we study a peculiar task of “data fitting in 1D”, which arises in Computer Aided Design
(CAD).

5.5.1 CAD Task: Curves from Control Points


Data: Using tablet and pen a designer jots down n + 1, n ∈ N, ordered control points in the plane,
whose coordinate vectors we denote by pℓ ∈ R2 , ℓ ∈ {0, . . . , n}.

6 control points

4 3 4
7 ✁ Ordered
x2

1 2 5 6 control
2
points

0 0 8

−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0


Fig. 186 x1

Task: Find an algorithm that “builds” a smooth curve, whose shape “is inspired”
by the locations of the control points.

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 440
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Let us give a more concrete meaning to some aspects of this rather vague specification.
§5.5.1.2 (Parameterized curves)

Definition 5.5.1.3. Planar curve

A planar curve C is a subset of R2 such that

C = {c(t) : t ∈ [ a, b]} ,

with a continuous function c : [ a, b] → R2 ( a, b ∈ R, b > a), called a parameterization.

Note that the parameterization of a curve is not unique.

EXAMPLE 5.5.1.4 (A parameterized curve)

10

The heart-shaped curve is described by the parame-


5
terization
I love numerics
 
0
16 sin3 t
c(t) = ,
x2

13 cos t − 5 cos(2t) − 2 cos(3t) − cos(4t)


−5
t ∈ [0, 2π ] .
−10
This is an example of a parameterization by a
−15 trigonometric polynomial.
curve

−15 −10 −5 0 5 10 15
Fig. 187 x1
y

Thus, “building” a curve amounts to finding a parameterization. In algorithms, we have to choose the
paramterization from sets of simple functions, whose elements can be characterized by finitely many,
more precisely, a few, real numbers (unfortunately, often also called “parameters”).

EXAMPLE 5.5.1.5 (Polynomial planar curves) We call a planar curve (→ Def. 5.5.1.3) polynomial of
degree d ∈ N0 , if it has a parameterization c : [ a, b] → R2 of the form

d
c(t) = ∑ ak tk with vectors ak ∈ R2 , k ∈ {0, . . . , d} . (5.5.1.6)
k =0

We also write c ∈ (Pd )2 to indicate that c is polynomial of degree d. For d = 1 we recover a line segment
with endpoints c( a) and c(b).

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 441
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.0

0.8

0.6
✁ a parabola over [−1, 1].
x2

0.4 Every finite section of a graph of a polynomial can be


regarded as a polynomial curve.
0.2

0.0 parabola

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00


Fig. 188 x1
y y

§5.5.1.7 (Smooth curves) The “smoothness” of a curve is directly related to the differentiability class of its
parameterization. As remarked in the beginning of Section 5.4.2, a curve is “visually smooth”, if its parame-
terization c : [ a, b] → R2 is twice continuously differentiable: c ∈ C2 ([ a, b]) × C2 ([ a, b]) =: (C2 ([ a, b]))2 .
Hence, the polygon connecting the control points is not an acceptable curve, because it is merely contin-
uous and has kinks. y

The meaning of “shape fidelity” cannot be captured rigorously, but some aspects of it will be discussed
below.

EXAMPLE 5.5.1.8 (Curves from interpolation) After what we have learned in Section 5.2, Section 5.3,
and Section 5.3.3, it is a natural idea to try and construct the parameterization of a curve by interpolating
the control points pℓ , ℓ ∈ {0, . . . , n} using an established scheme for 1D interpolation of the data points
(tℓ , pℓ ) with mutually distinct nodes tℓ ∈ R. However how should we choose the tℓ ?
The idea is to extract the nodes tℓ from the accumulated length of the segments of the
polygon with corners pℓ :


t0 : = 0 , t ℓ : = ∑ k p k − p k −1 k , ℓ ∈ {1, . . . , n} . (5.5.1.9)
k =1
This choice of the nodes together with three different interpolation schemes was used to generate curves
based on a “car-shaped” sequence of control points displayed in Fig. 186.

6 control polygon
polynomial interpolant
4 C 2 cubic spline interpolant
C 1 Hermite interpolant
x2

0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 189 x1

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 442
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Obviously, this is not what a designer wants to obtain:


• The polynomial interpolant oscillates wildly, a typical observation with high-degree polynomial inter-
polants, see Exp. 5.3.1.6.
• The cubic spline interpolant’s “shape fidelity” is found wanting.
• The Hermite interpolant lacks smoothness; it is merely of class C1 .
y

5.5.2 Bezier Curves


We abandon the idea of interpolating the control points pℓ , ℓ ∈ {0, . . . , n}, but restrict ourselves to
the class of polynomial curves as introduced in Ex. 5.5.1.5. More precisely, we aim for a polynomial
curve of degree n, the same degree as required for a polynomial interpolating n + 1 data points, re-
call Thm. 5.2.2.7. For the polynomial curve to be constructed, the so-called Bezier curve, we write
bk : [0, 1] → R2 . As parameter interval we use [0, 1] throughout in this section.

Idea: Construct bn recursively by convex combination.

Definition 5.5.2.1. Convex combination


A convex combination of elements v1 , . . . , vm , m ∈ N, of a real vector space V is a linear com-
bination

ξ 1 v1 + ξ 2 v2 + · · · + ξ m v m ∈ V ,

with coefficients satisfying

0 ≤ ξk ≤ 1 , k = 1, . . . , m , ξ 1 + ξ 2 + · · · + ξ m = 1 .

To explain the construction we write bik : [0, 1] → R2 , k = 1, . . . , n, i = 0, . . . , n − k for the polynomial


curves of degree k that are Bezier curves for the sub-sequence (pi , . . . , pi+k ) of control points. We define
recursively (w.r.t. to the degree k) by two-term convex combinations

b1i (t) := pi (1 − t) + pi+1 t , i = 0, . . . , n − 1 ,


t ∈ [0, 1] . (5.5.2.2)
i = 0, . . . , n − k ,
bik (t) := t bik+−11 (t) + (1 − t)bik−1 (t) ,
k = 2, . . . , n ,

Finally, we set

bn (t) := b0n (t) .

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 443
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Construction of cubic Bezier curve, 4 control points


1.0

The constuction is illustrated for n = 3:


0.8
b10 (t) = (1 − t)p0 + tp1 ,
0.6 b11 (t) = (1 − t)p1 + tp2 ,
x2

b12 (t) = (1 − t)p2 + tp3 ,


0.4
b20 (t) = (1 − t)b10 (t) + tb11 (t) ,
0.2 polygon, d = 1 b21 (t) = (1 − t)b11 (t) + tb12 (t) ,
left part b20 , d = 2
right part b21 , d = 2 b3 (t) = (1 − t)b20 (t) + tb21 (t) .
0.0 Bezier curve b3 , d = 3

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00


Fig. 190 x1

A recursive definition suggests proofs by induction and a simple one yields the insight that Bezier curves
connect the first and last control point.

Lemma 5.5.2.3.

Given mutually distinct points p0 , . . . , pn , the functions bik , k ∈ {1, . . . , n}, i ∈ {0, . . . , n − k }, de-
fined by (5.5.2.2) satisfy

bik (0) = pi , bik (1) = pi+k .

Concerning shape-fidelity the following concept is important:

Definition 5.5.2.4. Convex hull


The convex hull of m ∈ N elements v1 , . . . , vm of a real vector space V is the set of all of their
convex combinations (→ Def. 5.5.2.1):
( )
m m
convex{v1 , . . . , vm } := ∑ ξ k vk : 0 ≤ ξ k ≤ 1, ∑ ξk = 1 .
k =1 k =1

0.8

0.6

ˆ convex hull of some points in the plane


✁ =
0.4
—=
ˆ boundary of convex hull

0.2

0.0
Fig. 191 0.0 0.2 0.4 0.6 0.8 1.0

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 444
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 5.5.2.5. Bezier curves stay in convex hull

The Bezier curves bik , k ∈ {1, . . . , n}, i ∈ {0, . . . , n − k }, defined by (5.5.2.2) are confined to the
convex hulls of their underlying control points pi , . . . , pi+k :

bik (t) ∈ convex{pi , . . . , pi+k } ∀t ∈ [0, 1] .

Proof. The result follows by straightforward induction since every convex combination of two points in the
convex hull again belongs to it.

The next theorem states a deep result also connected with shape-fidelity. Its proof is beyond the scope of
this course:

Theorem 5.5.2.6. Variation -diminishing property of Bezier curves

No straight line intersects a Bezier curve bik more times than it intersects the control polygon
formed by its underlying control points pi , . . . , pi+k .

EXPERIMENT 5.5.2.7 (Poor shape-fidelity of high-degree Bezier curves)


Bezier curve, 9 control points
6 polygon
Bezier curve
4
x2

0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 192 x1

Higher-degree Bezier curves are oblivious of local features!

Nevertheless, in order to motivate the construction of a better method, we continue examin-


ing Bezier curves and study them from a different perspective. From Thm. 5.5.2.5 we know
n
that b (t) ∈ convex{p0 , . . . , pn } for all t ∈ [0, 1]. Thus, there must exists t-dependent weights
0 ≤ ξ i (t) ≤ 1 such that
n
bn (t) = ∑ ξ i ( t ) pi ∀t ∈ [0, 1] .
i =0

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 445
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 5.5.2.8. Bernstein basis representation of Bezier curves

The Bezier curve bn : [0, 1] → R2 as defined in (5.5.2.2) based on the control points p0 , . . . , pn ,
n ∈ N, can be written as
n
bn (t) = ∑ Bin (t) pi , (5.5.2.9)
i =0

with the Bernstein polynomials


 
n i
Bin (t) := t (1 − t ) n −i , i ∈ {0, . . . , n} . (5.5.2.10)
i

Proof. Of course, we use induction in the polynomial degree based on (5.5.2.2) and the Pascal triangle
for binomial coefficients.

Bernstein polynomials, degree = 8
1.0 B08
B18
B28
0.8 B38
✁ All Bernstein polynomials of degree n = 8.
B48
B58
0.6 B68 Lemma 5.5.2.11. Basis property of Bern-
B78
stein polynomials
Bid (t)

B88

The Bernstein polynomials Bin , i = 0, . . . , n,


0.4

form a basis of the space Pn of polynomials


0.2
of degree ≤ n.

0.0

0.0 0.2 0.4 0.6 0.8 1.0


Fig. 193 t

We can verify by direct computation based on the multinomial theorem that the Bernstein polynomials
satisfy
n
0≤ Bin (t) ≤1 , ∑ Bin (t) = 1 ∀t ∈ [0, 1] . (5.5.2.12)
i =0

Remark 5.5.2.13 (Modified Horner scheme for evaluation of Bezier polynomials) The representa-
tion (5.5.2.9) together with (6.2.1.3) paves the way for an efficient evaluation of Bezier curves for many
parameter values. For example, this is important for fast graphical rendering of Bezier curves.
The algorithm is similar to that from § 5.2.3.33 and based on rewriting
n
bn (t) = ∑ Bin (t) pi
i =0
 n  n ( n − 1) 2 
= (1 − t ) p0 + t p1 (1 − t ) + t p2 (1 − t)+
1 1·2
n(n − 1)(n − 2) 3 
t p3 (1 − t)+
1·2·3 !
..
. (1 − t ) + t n p n .

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 446
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and evaluating the terms in brackets from the innermost to the outer.

C++ code 5.5.2.14: Evaluation of Bezier polynomial curve for many parameter arguments
➺ GITLAB
2 Eigen : : MatrixXd evalBezier ( const Eigen : : MatrixXd &nodes ,
3 const Eigen : : RowVectorXd & t ) {
4 assert ( ( nodes . rows ( ) == 2 ) && "Nodes must have two coordinates " ) ;
5 const Eigen : : Index n = nodes . cols ( ) ; // Number of control points
6 const Eigen : : Index d = n − 1 ; // Polynomial degree
7 const Eigen : : Index N = t . s i z e ( ) ; // No. of evaluation points
8 // Vector containing 1-t ("one minus t")
9 const auto oml { Eigen : : RowVectorXd : : Constant (N, 1 . 0 ) − t } ;
10 // Modified Horner scheme for polynomial in Bezier form
11 // Vector for returning result, initialized with p[0]*(1-t)
12 Eigen : : MatrixXd r e s = nodes . col ( 0 ) * oml ;
13 double binom_val = 1 . 0 ; // Accumulate binomial coefficients
14 // Powers of argument values
15 Eigen : : RowVectorXd t_pow { Eigen : : RowVectorXd : : Constant (N, 1 . 0 ) } ;
16 f o r ( i n t i = 1 ; i < d ; ++ i ) {
17 t_pow . array ( ) * = t . array ( ) ;
18 binom_val * = ( s t a t i c _ c a s t <double >( d − i ) + 1 . 0 ) / i ;
19 r e s += binom_val * nodes . col ( i ) * t_pow ;
20 r e s . row ( 0 ) . array ( ) * = oml . array ( ) ;
21 r e s . row ( 1 ) . array ( ) * = oml . array ( ) ;
22 }
23 r e s += nodes . col ( d ) * ( t . array ( ) * t_pow . array ( ) ) . matrix ( ) ;
24 return res ;
25 }

The asymptotic computational effort for the evaluation for N ∈ N parameter values is O( Nn) for
n, N → ∞. y

5.5.3 Spline Curves


As we have seen in Exp. 5.5.2.7, Bezier curves based on global polynomials cannot represent shapes well,
because they lack locality and flexibility. This agrees with the observations made in Section 5.3. There we
learned that superior shape preservation could be achieved by using piecewise polynomial functions. The
same policy will also be successful for curve design.
Again, without loss of generality we restrict ourselves to the parameter interval [0, 1], cf. Def. 5.5.1.3. We
can conveniently reuse spaces of piecewise polynomials introduced earlier in Section 5.4:

Definition 5.4.1.1. Spline space

Given an interval I := [ a, b] ⊂ R and a knot sequence M := { a = t0 < t1 < . . . < tm−1 <
tm = b}, m ∈ N, the vector space Sd,M of the spline functions of degree d (or order d + 1) is
defined by

Sd,M := {s ∈ C d−1 ( I ): s j := s|[t j−1 ,t j ] ∈ Pd ∀ j = 1, . . . , m} .

d − 1-times continuously differentiable locally polynomial of degree d

It goes without saying that a spline curve of degree d ∈ N is a curve s : [0, 1] → R2 that possesses a
parameterization that is component-wise a spline function of degree d with respect to a knot sequence
M := {t0 = 0 < t1 < . . . < tm−1 < tm = 1} for the interval [0, 1]: s ∈ (Sd,M )2 !

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 447
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This does not yet bring us closer to the goal of building a shape-aware spline curve from n + 1 given
control points p0 , . . . , pn , Pℓ ∈ R2 , n ∈ N. To begin with, recall

dim(Sd,M )2 = 2(m + d) . (5.5.3.1)

A first consideration realizes that n + 1 control points ∈ R2 correspond to 2(n + 1) degrees of freedom
and matching that with the dimension of the space of spline curves we see that we should choose

m := n + 1 − d

knots in order to facilitate the construction of a unique curve from p0 , . . . , pn .

§5.5.3.2 (B-splines) Now we take he cue from the representation of Bezier curves by means of Bernstein
polynomials as established Thm. 5.5.2.8 and aim for the construction of spline counterparts of the Bezier
polynomials.
For an elegant presentation we admit generalized knot sequences that may contain multiple knots

U : = {0 = : u 0 ≤ u 1 ≤ u 2 ≤ · · · ≤ u m −1 ≤ u m : = 1 } , m∈N. (5.5.3.3)

The number of times a uk -value occurs in the sequence is called its multiplicity
The following definition can be motivated by a recursion satisfied by the Bernstein polynomials from
(6.2.1.3):

  B0k (t) = (1 − t) B0k−1 (t) ,


k i
Bik (t) := t (1 − t)k−i ⇒ Bik (t) = tBik−−11 (t) + (1 − t) Bik (t) , i = 1, . . . , k − 1 t∈R.
i
−1
Bkk (t) = tBkk− 1 (t) ,
(5.5.3.4)

In particular, we notice that Bik is a two-term convex combination of Bik−1 and Bik . We pursue something
similar with splines:

Definition 5.5.3.5. B-splines

Given a generalized knot sequence U := {0 =: u0 ≤ u1 ≤ u2 ≤ · · · ≤ uℓ−1 ≤ uℓ := 1} the B-


splines Nik , k ∈ {0, . . . , ℓ − 1}, i ∈ {0, . . . , ℓ − k − 1}, are defined via the recursion
(
1 for ui ≤ t < ui+1 ,
Ni0 (t) := i = 0, . . . , ℓ − 1 ,
0 elsewhere ,
(5.5.3.6)
t − ui u i + k +1 − t
Nik (t) := Nik−1 (t) + Nik+−11 (t) , i = 0, . . . , ℓ − k − 1 ,
ui +k − ui u i + k +1 − u i +1

where we adopt the convention that “ 00 = 0”.

Note that also Nik is a two-term convex combination of Nik−1 and Nik+1 . From this we can deduce the
following properties by simple induction:

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 448
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 5.5.3.7. Elementary properties of B-splines

The B-splines based on the generalized knot sequence U and defined by (5.5.3.6) satisfy
• Nik is a U -piecewise polynomial of degree ≤ k,
• 0 ≤ Nik (t) ≤ 1 for all 0 ≤ t ≤ 1,
• Nik vanishes outside the interval [ui , . . . , ui+k+1 ]: supp( Nik ) = [ui , ui+k+1 ],
i
• ∑ Njk (t) = 1 for all t ∈ [ui , ui+1 [, i = k, . . . , ℓ − k − 1
j =i − k

The main “magic” result about B-splines concerns their smoothness properties that are by no means
obvious from the recursive definition (5.5.3.6). To make a concise statement we build special generalized
knot sequences from a (standard) knot sequence by copying endpoints d times: Given a degree d ∈ N0
and

M : = { 0 = : t 0 < t 1 < . . . < t m −1 < t m : = 1 } , m∈N,

we construct a generalized knot sequence with ℓ + 1 := 2d + m + 1 members


d

UM := 0 =: u0 = · · · = ud < ud+1 := t1 < . . . < ud+m−1 := tm−1 < ud+m = · · · = u2d+m := 1 .
| {z } | {z }
d+1 times d+1 times

d :
These plots show B-splines for special generalized knot sequences of the type UM
9 B-splines, degree = 2, 12 knots 9 B-splines, degree = 3, 13 knots
1.0 N02 1.0 N03
N12 N13
N22 N23
0.8 N32 0.8 N33
N42 N43
N52 N53
0.6 N62 0.6 N63
N72 N73
Nid (t)

Nid (t)

N82 N83

0.4 knots 0.4 knots

0.2 0.2

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 194 t Fig. 195 t

We notice that on that generalized knot sequence (5.5.3.6) provides ℓ − d = m + d B-spline functions
Nid , i = 0, . . . , m + d − 1. In light of Thm. 5.4.1.2 this is necessary for the following much more general
assertion to hold, whose proof, unfortunately, is very technical and will be skipped.

Theorem 5.5.3.8. Basis property of B-splines


d according to Def. 5.5.3.5 satisfy
The B-splines of degree d on the generalized knot set UM
• Nid ∈ Sd,M for all i = 0, . . . , m + d − 1,
n o
• N0d , . . . , Nm
d
+d−1 is a basis of the spline space Sd,M .

In particular, we conclude that Nid is d − 1-times continuously differentiable: Nid ∈ C d−1 ([0, 1])! So,
already for d = 3, the case of cubic splines we obtain curves that look perfectly smooth.
Inspired by Thm. 5.5.2.8 we propose the following recipe for constructing a degree-d spline curve from
control points p0 , . . . , pn :

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 449
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➊ Setting m := n + 1 − d to match the numbers of degrees of freedom choose a knot sequence


M : = { 0 = : t 0 < t 1 < . . . < t m −1 < t m : = 1 } , m∈N,

d :
➋ Parameterize the curve by a linear combination of B-splines based on UM
n
s(t) := ∑ Njd (t) p j , t ∈ [0, 1] . (5.5.3.9)
j =0

y
EXPERIMENT 5.5.3.10 (Curve design based on B-splines) We display the spline curves induced by
the “car-shaped” set of control points using equidistant knots in [0, 1]:

Spline curve, degree 3, 9 control points


6 control polygon
spline curve
4
x2

0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 196 x1
Spline curve, degree 2, 9 control points
6 control polygon
spline curve
4
x2

0
−2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
Fig. 197 x1

Superior shape fidelity compared to Bezier curves!

BUT, considerable room for improvement still remains. y

§5.5.3.11 (Node insertion)


y

5. Data Interpolation and Data Fitting in 1D, 5.5. Algorithms for Curve Design 450
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

5.6 Trigonometric Interpolation


We consider time series data (ti , yi ), i = 0, . . . , n, ti , yi ∈ R, obtained by sampling a time-dependent
scalar physical quantity t 7→ ϕ(t). We know that ϕ is a T -periodic function with period T > 0, that is
ϕ(t) = ϕ(t + T ) for all t ∈ R . In the spirit of shape preservation (→ Section 5.3) an interpolant f of the
time series should also be T -periodic: f (t + T ) = f (t) for all t ∈ R.

Assumption 5.6.0.1. Sampling in a period

We assume the period T > 0 to be known and ti ∈ [0, T [ for all interpolation nodes ti , i = 0, . . . , n.

In the sequel, for the case of simplicity, we consider only T = 1.

Task: Given T > 0 and data points (ti , yi ), yi ∈ K, ti ∈ [0, T [, find a T -periodic function f : R → K (the
interpolant), f (t + T ) = f (t) ∀t ∈ R, that satisfies the interpolation conditions

f (ti ) = yi , i = 0, . . . , n . (5.6.0.2)

Supplementary literature. This topic is also presented in [DR08, Sect. 8.5].

5.6.1 Trigonometric Polynomials


The most fundamental periodic functions are derived from the trigonometric functions sin and cos and
dilations of them (A dilation of a function t 7→ ψ(t) is a function of the form t 7→ ψ(ct) with some c > 0).

Definition 5.6.1.1. Space of trigonometric polynomials

The vector space of 1-periodic trigonometric polynomials of degree 2n, n ∈ N, is given by


T
P2n := Span{t 7→ cos(2πjt), t 7→ sin(2πjt)}nj=0 ⊂ C ∞ (R ) .

The terminology is natural after recalling expressions for trigonometric functions via complex exponentials
(“Euler’s formula”)
ncos t = 1 (eıt + e−ıt )
2
eit = cos t + ı sin t ⇒ (5.6.1.2)
sin t = 1 ıt
2ı ( e − e−ıt ) .
T given in the form
Thus we can rewrite q ∈ P2n

q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) , α j , β j ∈ R , (5.6.1.3)


j =1

by means of complex exponentials making use of Euler’s formula (5.6.1.2):


n
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt)
j =1

1n n 2πıjt −2πıjt
o
2 j∑
= α0 + ( α j − ıβ j ) e + ( α j + ıβ j ) e
=1
−1 n
2πıjt
= α0 + 1
2 ∑ (α− j + ıβ − j )e + 1
2 ∑ (α j − ıβ j )e2πıjt
j=−n j =1

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 451
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


1
2n  2 (αn− j + ıβ n− j ) for j = 0, . . . , n − 1 ,

= e−2πınt ∑ γ j e2πıjt , with γ j = α0 for j = n , (5.6.1.4)
j =0

1
2 ( α j−n − ıβ j−n ) for j = n + 1, . . . , 2n .

Note that γ j ∈ C even if α j , β j ∈ R ➣ We mainly work in C in the context of trigonometric interpola-


tion! Admit yk ∈ C.

From the above manipulations we conclude


a polynomial !
2n
T
q ∈ P2n ⇒ q(t) = e−2πınt · p(e2πıt ) with p(z) = ∑ γj z j ∈ P2n , (5.6.1.5)
j =0

and γ j from (5.6.1.4).

(After scaling) a trigonometric polynomial of degree 2n is a regular polynomial ∈ P2n (in C) re-
stricted to the unit circle S1 ⊂ C.

✎ notation: S1 := {z ∈ C: |z| = 1} is the unit circle in the complex plane.


T a space of trigonometric polynomials. It also defines a
The relationship (5.6.1.5) justifies calling P2n
bijective linear mapping of vector spaces P2n → P2nT . This immediately reveals the dimension of P T :
2n

T
Corollary 5.6.1.6. Dimension of P2n

T has dimension T
The vector space P2n dim P2n = 2n + 1.

5.6.2 Reduction to Lagrange Interpolation


We observed that trigonometric polynomials are standard (complex) polynomials in disguise. Thus we
can relate trigonometric interpolation to well-known standard Lagrangian interpolation discussed in Sec-
tion 5.2.2. In fact, we slightly extend the method, because now we admit interpolation nodes ∈ C. All
results obtained earlier carry over to this setting.
The key tool is a smooth bijective mapping between I := [0, 1[ and S1 defined as

ΦS1 : I → S1 , t 7→ z := exp(2πıt) , (5.6.2.1)


Im
z = exp(−2πit)
1

t
0 1 1 Re

Fig. 198

and the relationship from (5.6.1.5):


T
q ∈ P2n ⇔ q(t) = e−2πınt p(e2πıt ) for some p ∈ P2n . (5.6.2.2)

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 452
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Trigonometric interpolation through Lagrange polynomial interpolation


⇐⇒
data points (tk , yk ), k = 0, . . . , 2n through data points (e2πıtk , e2πıntk yk ), k = 0, . . . , 2n

Trigonometric interpolation ←→ polynomial interpolation on S1

All theoretical results and algorithms from polynomial interpolation carry over to trigonometric
interpolation
✦ Existence and uniqueness of trigonometric interpolation polynomial, see Thm. 5.2.2.7,
✦ Concept of Lagrange polynomials, see (5.2.2.4),
✦ the algorithms and representations discussed in Section 5.2.3.
The next code demonstrates the use of standard routines for polynomial interpolation provided by
BarycPolyInterp (→ Code 5.2.3.7) for trigonometric interpolation.

C++-code 5.6.2.3: Evaluation of trigonometric interpolation polynomial in many points


2 // Evaluation of trigonometric interpolant at numerous points
3 // IN : t = vector of nodes t0 , . . . , tn ∈ [0, 1[
4 // y = vector of data y0 , . . . , yn
5 // x = vector of evaluation points x1 , . . . , x N
6 // OUT : vector of values of the interpolant at points in x
7 VectorXd t r i g p o l y v a l ( const VectorXd& t , const VectorXd& y , const VectorXd& x ) {
8 using i d x _ t = VectorXd : : Index ;
9 using comp = std : : complex <double > ;
10 const i d x _ t N = y . s i z e ( ) ; // Number of data points
11 i f (N % 2 == 0 ) {
12 throw std : : r u n t i m e _ e r r o r ( "Number of points must be odd ! " ) ;
13 }
14 const auto n = s t a t i c _ c a s t <double >(N − 1 ) / 2 ;
15 const std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
16 // interpolation nodes and evalutation points on unit circle
17 const VectorXcd t c = ( 2 * M_PI * M_I * t ) . array ( ) . exp ( ) . matrix ( ) ;
18 const VectorXcd xc = ( 2 * M_PI * M_I * x ) . array ( ) . exp ( ) . matrix ( ) ;
19 // Rescaled values, according to q(t) = e−2πint · p(e2πit ), see (5.6.1.4)
20 const VectorXcd z = ( ( 2 * n * M_PI * M_I * t ) . array ( ) . exp ( ) * y . array ( ) ) . matrix ( ) ;
21 // Evaluation of interpolating polynomial on unit circle using the
22 // barycentric interpolation formula in C, see Code 5.2.3.4
23 const BarycPolyInterp <comp> I n t e r p o l ( t c ) ;
24 auto p = I n t e r p o l . eval <VectorXcd >( z , xc ) ;
25 // Undo the scaling, see (5.6.1.5)
26 VectorXcd qc = ( ( − 2 * n * M_PI * M_I * x ) . array ( ) . exp ( ) * p . array ( ) ) . matrix ( ) ;
27 r e t u r n ( qc . r e a l ( ) ) ; // imaginary part is zero, cut it off
28 }

The computational effort is O(n2 + Nn), N =


ˆ no. of evaluation points, as for barycentric polynomial
interpolation studied in § 5.2.3.2.

The next code finds the coefficients α j , β j ∈ R of a trigonometric interpolation polynomial in the real-valued
representation (5.6.1.3) for real-valued data y j ∈ R by simply solving the linear system of equations arising
from the interpolation conditions (5.6.0.2).

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 453
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 5.6.2.4: Computation of coefficients of trigonometric interpolation polynomial, gen-


eral nodes ➺ GITLAB
2 //Computes expansion coefficients of trigonometric polyonomials (5.6.1.3)
3 // IN : t = vector of nodes t0 , . . . , tn ∈ [0, 1[
4 // y = vector of data y0 , . . . , yn
5 // OUT: pair of vectors storing the basis expansion coefficients α j , β j ,
see Def. 5.6.1.1
6 std : : pair <VectorXd , VectorXd>
7 t r i g p o l y c o e f f ( const VectorXd& t , const VectorXd& y ) {
8 const unsigned N = y . s i z e ( ) ;
9 const unsigned n = (N−1) / 2 ;
10 i f (N % 2 == 0 ) {
11 throw std : : l o g i c _ e r r o r ( "Number of points must be odd ! " ) ;
12 }
13

14 // build system matrix M


15 MatrixXd M(N, N) ;
16 M. col ( 0 ) = VectorXd : : Ones (N) ;
17 f o r ( unsigned c = 1 ; c <= n ; ++c ) {
18 M. col ( c ) = ( 2 * M_PI * c * t ) . array ( ) . cos ( ) . matrix ( ) ;
19 M. col ( n + c ) = ( 2 * M_PI * c * t ) . array ( ) . s i n ( ) . matrix ( ) ;
20 }
21 // solve LSE and extract coefficients α j and β j
22 VectorXd c = M. l u ( ) . solve ( y ) ;
23 r e t u r n { c . head ( n + 1 ) , c . t a i l ( n ) } ;
24 }

The asymptotic computational effort of this implementation is dominated by the cost for Gaussian elimina-
tion applied to a fully populated (dense) matrix, see Thm. 2.5.0.2: O(n3 ) for n → ∞.

5.6.3 Equidistant Trigonometric Interpolation


Often timeseries data for a time-periodic quantity are measured with a constant rhythm over the entire
T
(known) period of duration T > 0, that is, t j = j∆t, ∆t = n+ 1 , j = 0, . . . , n. In this case, the formu-
las for computing the coefficients of the interpolating trigonometric polynomial (→ Def. 5.6.1.1) become
special versions of the discrete Fourier transform (DFT, see Def. 4.2.1.18) studied in Section 4.2. Efficient
implementation can thus harness the speed of FFT introduced in Section 4.3.

§5.6.3.1 (Computation of equidistant trigonometric interpolants) Now we consider trigonometric in-


k
terpolation in the 1-periodic setting with 2n + 1 uniformly distributed interpolation nodes tk = ,
2n + 1
k = 0, . . . , 2n, and associated data values yk . Existence and uniqueness of an interpolating trigonometric
polynomial q ∈ P2n T , q ( t ) = y , was established in Section 5.6.2. We rely on the relationship
k k

2n
T
q ∈ P2n ⇒ q(t) = e−2πınt · p(e2πıt ) with p(z) = ∑ γj z j ∈ P2n , (5.6.1.5)
j =0

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 454
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

to arrive at the following (2n + 1) × (2n + 1) linear system of equations for computing the unknown coeffi-

2n
jk  nk 
∑ γj exp 2πı
2n + 1
= (b)k := exp 2πı y , k = 0, . . . , 2n .
2n + 1 k
(5.6.3.2
j =0

cients γ j : m
Lemma 4.2.1.14 1
F2n+1 c = b , c = [γ0 , . . . , γ2n ]⊤ ⇒ c= F2n+1 b . (5.6.3.3
2n + 1

(2n + 1) × (2n + 1) (conjugate) Fourier matrix, see (4.2.1.13)!


The linear system of equations (5.6.3.2) is amenable to fast solution by means of FFT with
O(n log n) asymptotic complexity for n → ∞, see Section 4.3.
The following C++ code demonstrates the efficient computation of the coefficients α j , β j ∈ R of the
trigonometric polynomial
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) , α j , β j ∈ R , (5.6.1.3)
j =1
 2n
k
interpolating the data points 2n+1 k k=0 .
, y Only the data values y j need to be passed.

C++-code 5.6.3.4: Efficient computation of coefficient of trigonometric interpolation polyno-


mial (equidistant nodes) ➺ GITLAB
2 // Efficient FFT-based computation of coefficients in expansion (5.6.1.3)
3 // for a trigonometric interpolation polynomial in equidistant points
4 // ( 2nj+1 , y j ), j = 0, . . . , 2n.
5 // IN : y has to be a row vector of odd length, return values are column
vectors    
6 // OUT: vectors α j j , α j j of expansion coefficients
7 // with respect to trigonometric basis from Def. 5.6.1.1
8 std : : pair <VectorXd , VectorXd> t r i g i p e q u i d ( const VectorXd& y ) {
9 using i n d e x _ t = VectorXcd : : Index ;
10 const i n d e x _ t N = y . s i z e ( ) ;
11 const i n d e x _ t n = (N − 1 ) / 2 ;
12 i f (N % 2 ! = 1 ) {
13 throw std : : l o g i c _ e r r o r ( "Number of points must be odd ! " ) ;
14 }
15 // prepare data for fft
16 const std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
17 // right hand side vector b from (5.6.3.3)
18 VectorXcd b (N) ;
19 f o r ( i n d e x _ t k = 0 ; k < N; ++k ) {
20 b(k) = y(k) *
std : : exp ( 2 * M_PI * M_I * ( s t a t i c _ c a s t <double >( n ) / s t a t i c _ c a s t <double >(N) * s t a t i c _ c a s t <double >( k
21 }
22 Eigen : : FFT<double> f f t ; // DFT helper class
23 VectorXcd c = f f t . fwd ( b ) ; // means that “c = fft(b)”
24

25 // By (5.6.1.4) we can recover


26 // α j = 12 (γn− j + γn+ j ) and β j = 2i1 (γn− j − γn+ j ), j = 1, . . . , n, α0 = γn .
27 VectorXcd alpha ( n + 1 ) ;
28 VectorXcd beta ( n ) ;
29 alpha ( 0 ) = c ( n ) / s t a t i c _ c a s t <double >(N) ;
30 f o r ( i n d e x _ t l = 1 ; l <= n ; ++ l ) {
31 alpha ( l ) = ( c ( n− l ) +c ( n+ l ) ) / s t a t i c _ c a s t <double >(N) ;
32 beta ( l −1) = −M_I * ( c ( n− l ) −c ( n+ l ) ) / s t a t i c _ c a s t <double >(N) ;

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 455
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

33 }
34 r e t u r n { alpha . r e a l ( ) , beta . r e a l ( ) } ;
35 }

EXAMPLE 5.6.3.5 (Runtime comparison for computation of coefficient of trigonometric interpola-


tion polynomials)
0
10

Runtime measurements for M ATLAB equivalents of trigpolycoeff


trigipequid
codes Code 5.6.2.4 and Code 5.6.3.4
−1
10

tic-toc-timings ✄
−2
10
MATLAB 7.10.0.499 (R2010a)

runtime[s]
CPU: Intel Core i7, 2.66 GHz, 2 cores, L2 256 KB, L3 −3
10

4 MB

OS: Mac OS X, Darwin Kernel Version 10.5.0 −4


10

(Similar runtimes would also be measured for C++ or


P YTHON codes.) −5
10
1
10
2
10
3
10
Fig. 199 n

We make the same observation as in Ex. 4.3.0.12: massive gain in efficiency through relying on FFT. y

§5.6.3.6 (Efficient evaluation of trigonometric interpolation polynomials) Imagine we want to plot a


1-periodic real-valued trigonometric polynomial q ∈ P2nT given through its coefficients α , β . To that end
j j
we have to evaluate it in many points, which are naturally chosen as uniformly spaced in [0, 1].
Task: Find an efficient way to evaluate the trigonometric polynomial

q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) , α j , β j ∈ R , (5.6.1.3)


j =1

k
at equidistant points N , N > 2n. k = 0, . . . , N − 1.

As in (5.6.1.4) we can rewrite



1
2n  2 (αn− j + ıβ n− j ) for j = 0, . . . , n − 1 ,

q(t) = e−2πınt ∑ γj e2πıjt , with γ j = α0 for j = n ,
j =0

1
2 ( α j−n − ıβ j−n ) for j = n + 1, . . . , 2n .

2n
−2πınk/N kj
(5.6.1.4) q(k/N ) =e ∑ γj exp(2πı N ) , k = 0, . . . , N − 1 .
j =0

q(k/N ) = e−2πı
kn/N
v j with v = F N ec , (5.6.3.7)

Fourier matrix, see (4.2.1.13).


where e
c∈ CN is obtained by zero padding of c := (γ0 , . . . , γ2n )T :
(
γ j , for k = 0, . . . , 2n ,
(ec)k =
0 , for k = 2n + 1, . . . , N − 1 .

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 456
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The FFT-based implementation is realized in the following code, which accepts the coefficients α j , β j as
components of the vectors a and b, and returns q(k/N ), k = 0, . . . , N − 1, in the vector y.

C++-code 5.6.3.8: Fast evaluation of trigonometric polynomial at equidistant points


➺ GITLAB
2 void t r i g i p e q u i d c o m p ( const VectorXcd& a , const VectorXcd& b , const unsigned N,
VectorXcd& y ) {
3 const unsigned n = a . s i z e ( ) − 1 ;
4 i f (N < ( 2 * n − 1 ) ) {
5 std : : c e r r << "N i s too small ! Must be l a r g e r than 2 * n" ;
6 return ;
7 }
8 const std : : complex <double> i u ( 0 , 1 ) ; // imaginary unit
9 // build vector γ
10 VectorXcd gamma( 2 * n + 1 ) ;
11 gamma( n ) = a ( 0 ) ;
12 f o r ( unsigned k = 0 ; k < n ; ++k ) {
13 gamma( k ) = 0 . 5 * ( a ( n − k ) + i u * b ( n − k − 1 ) ) ;
14 gamma( n + k + 1 ) = 0 . 5 * ( a ( k + 1 ) − i u * b ( k ) ) ;
15 }
16 // zero padding to obtain ec
17 VectorXcd ch (N) ; ch << gamma, VectorXcd : : Zero (N − ( 2 * n + 1 ) ) ;
18

19 // realize multiplication with conjugate fourier matrix


20 Eigen : : FFT<double> f f t ;
21 const VectorXcd chCon = ch . conjugate ( ) ;
22 VectorXcd v = f f t . fwd ( chCon ) . conjugate ( ) ;
23

24 // final scaling, implemented without efficiency considerations


25 y = VectorXcd (N) ;
26 f o r ( unsigned k = 0 ; k < N; ++k ) {
27 y ( k ) = v ( k ) * std : : exp ( −2. * k * n * M_PI / N * i u ) ;
28 }
29 }

The next code merges the steps of computing the coefficient of the trigonometric interpolation polynomial
in equidistant points and its evaluation in another set of M equidistant points.

C++ code 5.6.3.9: Equidistant points: fast on the fly evaluation of trigonometric interpolation
polynomial ➺ GITLAB
j
2 // Evaluation of trigonometric interpolation polynomial through ( 2n+1 , y j ),
j = 0, . . . , 2n
k
3 // in equidistant points M , k = 0, M − 1
4 // IN : y = vector of values to be interpolated
5 // q (COMPLEX!) will be used to save the return values
6 void t r i g p o l y v a l e q u i d ( const VectorXd y , const i n t M, VectorXd& q ) {
7 const i n t N = y . s i z e ( ) ;
8 i f (N % 2 == 0 ) {
9 std : : c e r r << "Number of points must be odd ! \ n" ;
10 return ;
11 }
12 const i n t n = (N − 1 ) / 2 ;
13 // computing coefficient γ j , see (5.6.3.3)
14 VectorXcd a , b ;
15 trigipequid ( y , a , b ) ;

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 457
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

16

17 std : : complex <double> i ( 0 , 1 ) ;


18 VectorXcd gamma( 2 * n + 1 ) ;
19 gamma( n ) = a ( 0 ) ;
20 f o r ( i n t k = 0 ; k < n ; ++k ) {
21 gamma( k ) = 0 . 5 * ( a ( n − k ) + i * b ( n − k − 1 ) ) ;
22 gamma( n + k + 1 ) = 0 . 5 * ( a ( k + 1 ) − i * b ( k ) ) ;
23 }
24

25 // zero padding
26 VectorXcd ch (M) ; ch << gamma, VectorXcd : : Zero (M − ( 2 * n + 1 ) ) ;
27

28 // build conjugate fourier matrix


29 Eigen : : FFT<double> f f t ;
30 const VectorXcd chCon = ch . conjugate ( ) ;
31 const VectorXcd v = f f t . fwd ( chCon ) . conjugate ( ) ;
32

33 // multiplicate with conjugate fourier matrix


34 VectorXcd q_complex = VectorXcd (M) ;
35 f o r ( i n t k = 0 ; k < M; ++k ) {
36 q_complex ( k ) = v ( k ) * std : : exp ( −2. * k * n * M_PI /M* i ) ;
37 }
38 // complex part is zero up to machine precision, cut off!
39 q = q_complex . r e a l ( ) ;
40 }

y
Review question(s) 5.6.3.10 (Trigonomentic interpolation)
(Q5.6.3.10.A) You have at your disposal the function
s t d ::pair<Eigen::VectorXd,Eigen::VectorXd>
trigpolycoeff(comnst Eigen::VectorXd &t, const Eigen::VectorXd &y);
T
that computes the coefficients α j , β j , of the 1-periodic interpolating trigonometric polynomial q ∈ P2n
n 
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) .
j =1

for the data points (tk , yk ) ∈ R2 , k = 0, . . . , 2n, passed in the vectors t and y of equal length.
However, now we have to process data points that are obtained by sampling a T -periodic function,
T > 0 known. Describe, how you can use trigpolycoeff() to obtain a T -periodic interpolant
p : R → R.
T be the 1-periodic trigonometric interpolant of ( t , y ), k = 0, . . . , 2n,
(Q5.6.3.10.B) Let q ∈ P2n k k
0 < tk < 1:
n 
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) .
j =1

What are the basis expansion coefficients e α j , βej for the trigonometric interpolant qe ∈ P2n
T through

(etk , yk ), k = 0, . . . , 2n, with etk = tk + ∆t, ∆t > 0 given?


(Q5.6.3.10.C) Outline an implementation of a C++ function
double evalTrigPol( const Eigen::VectorXd &alpha, const
Eigen::VectorXd &beta, double t);

5. Data Interpolation and Data Fitting in 1D, 5.6. Trigonometric Interpolation 458
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

that computes q(t) for


n 
q(t) = α0 + ∑ α j cos(2πjt) + β j sin(2πjt) ,
j =1

using only a single call of a mathematical function of the C++ standard library. The coefficients are
passed through the vectors alpha and beta, the evaluation point t in t.

5.7 Least Squares Data Fitting

Video tutorial for Section 5.7 "Least Squares Data Fitting": (13 minutes) Download link,
tablet notes

→ review questions 5.7.0.31

As remarked in Ex. 5.1.0.8 the basic assumption underlying the reconstructing of the functional depen-
dence of two quantities by means of interpolation is that of accurate data. In case of data uncertainty or
measurement errors the exact satisfaction of interpolation conditions ceases to make sense and we are
better off reconstructing a fitting function that is merely “close to the data” in a sense to be made precise
next.

The most general task of (multidimensional, vector-valued) least squares data fitting can be described
as follows:

Least square data fitting

Given: data points (ti , yi ), i ∈ {1, . . . , m}, m ∈ N, ti ∈ D ⊂ R k , yi ∈ R d , d ∈ N

Objective: Find a (continuous) function f : D 7→ R d in some set S ⊂ C0 ( D )


of admissible functions satisfying
m
f ∈ argmin ∑ kg(ti ) − yi k22 . (5.7.0.2)
g∈S i =1

Such a function f is called a (best) least squares fit for the data in S.

Focus in this section: k = 1, d = 1 (one-dimensional scalar setting)


↔ fitting of scalar data depending on one parameter (t ∈ R),
↔ Data points (ti , yi ), ti ∈ I ⊂ R, yi ∈ R ➣ S ⊂ C0 ( I )
↔ (best) least squares fit f : I ⊂ R → R is a function of one real variable.

m
k = 1 , d = 1: (5.7.0.2) ⇔ f ∈ argmin ∑ | g(ti ) − yi |2 . (5.7.0.3)
g∈S i =1

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 459
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§5.7.0.4 (Linear spaces of admissible functions) Consider a special variant of the general least squares
data fitting problem: The set S of admissible continuous functions is now chosen as a finite-dimensional
vector space Vn ⊂ C0 ( D ), dim Vn = n ∈ N, cf. the discussion in § 5.1.0.21 for interpolation.
Choose basis of Vn : Vn = Span{b1 , . . . , bn }, b j : D → R d continuous

The best least squares fit f ∈ Vn can be represented by a finite linear combination of the basis
functions b j :

n
f(t) = ∑ j =1 x j b j ( t ) , xj ∈ R . (5.7.0.5)

Often, in the case d > 1, Vn is chosen as a product space

Vn = W × · · · × W , (5.7.0.6)
| {z }
d factors

of a space W ⊂ C0 ( D ) of R-valued functions D 7→ R. In this case

dim Vn = d · dim W .

If ℓ := dim W and {q1 , . . . , qℓ } is a basis of W , then



ei q j i =1,...,d = { e1 q1 , e2 q1 , . . . , e d q1 , e1 q2 , . . . , e d q2 , e1 q3 , . . . , e1 q ℓ , . . . e d q ℓ } (5.7.0.7)
j=1,...,ℓ

is a basis of Vn (ei =
ˆ i-th unit vector). y

§5.7.0.8 (Linear data fitting → [DR08, Sect. 4.1]) We adopt the setting of § 5.7.0.4 of an n-
dimensional space Vn of admissible functions with basis {b1 , . . . , bn }. Then the least squares data
fitting problem can be recast as follows.

General Linear least squares fitting problem:

Given: ✦ data points (ti , yi ) ∈ R k × R d , i = 1, . . . , m


✦ basis functions b j : D ⊂ R k 7→ R d , j = 1, . . . , n, n < m
Sought: coefficients x j ∈ R, j = 1, . . . , n, such that

2
m n

x := [ x1 , . . . , xn ] := argmin ∑ ∑ z j b j ( ti ) − yi . (5.7.0.9)
z j ∈R i =1 j =1 2

Special cases:
• For k = 1, d = 1, data points (ti , yi ) ∈ R × R (scalar, one-dimensional setting), and Vn =
Span{b1 , . . . , bn }, we seek coefficients x j ∈ R, j = 1, . . . , n, as the components of a vector
x = [ x1 , . . . , xn ]⊤ ∈ R n satisfying
2
m n
x = argmin ∑ ∑ ( z ) j b j ( ti ) − yi . (5.7.0.10)
z ∈R n i =1 j =1

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 460
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• If Vn is a product space according to (5.7.0.6) with basis (5.7.0.7), then (5.7.0.9) amounts to finding
vectors x j ∈ R d , j = 1, . . . , ℓ, with

2
m ℓ
(x1 , . . . , xℓ ) = argmin ∑ ∑ z j q j ( ti ) − yi . (5.7.0.11)
z j ∈R d i =1 j =1 2

EXAMPLE 5.7.0.12 (Linear parameter estimation = linear data fitting → Ex. 3.0.1.4, Ex. 3.1.1.5)
The linear parameter estimation/linear regression problem presented in Ex. 3.0.1.4 can be recast as a
linear data fitting problem with
• k = n, d = 1, data points (xi , yi ) ∈ R k × R,
• an k + 1-dimensional space Vn = {x 7→ a⊤ x + β, a ∈ R k , β ∈ R } of affine linear admissible
functions,
• the choice of basis { x 7 → ( x )1 , . . . , x 7 → ( x ) k , x 7 → 1}.
y

§5.7.0.13 (Linear data fitting as a lineare least squares problem) Linear (least squares) data fitting
leads to an overdetermined linear system of equations for which we seek a least squares solution (→
Def. 3.1.1.1) as in Section 3.1.1. To see this rewrite
2 !2
m n m n d 
∑ ∑ z j b j ( ti ) − yi = ∑ ∑∑ b j ( ti ) r z j − ( yi )r .
i =1 j =1 2 i =1 j =1 r =1

Theorem 5.7.0.14. Least squares solution of data fitting problem



The solution x = [ x1 , . . . , xn ] ∈ R n of the linear least squares fitting problem (5.7.0.9) is the least
squares solution of the overdetermined linear system of equations
   
A1 b1
 ..   . 
 . x =  ..  , (5.7.0.15)
Ad bd
   
(b1 (t1 ))r . . . (bn (t1 ))r ( y1 )r
 .. ..  m,n  ..  m
with Ar : =  . . ∈R , br : =  .  ∈ R , r = 1, . . . , d .
(b1 (tm ))r . . . (bn (tm ))r ( y m )r

In the one-dimensional, scalar case (d = 1) of (5.7.0.10) the related overdetermined linear system of
equations is
   
b1 (t1 ) . . . bn (t1 ) y1
 .. 
..  . 
 . x =  ..  .
. (5.7.0.16)
b1 (tm ) . . . bn (tm ) ym

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 461
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Obviously, for m = n we recover the 1D data interpolation problem from § 5.1.0.21. For d > 1 the
overdetermined linear system of equations (5.7.0.15) can be thought of as d systems of the type (5.7.0.16)
stacked on top of each other, one for every component of the data vectors.

Having reduced the linear least squares data fitting problem to finding the least squares solution of an
overdetermined linear system of equations, we can now apply theoretical results about least squares
solutions, for instance, Cor. 3.1.2.13. The key issue is, whether the coefficient matrix of (5.7.0.16) has full
rank n. Of course, this will depend on the location of the ti .

Lemma 5.7.0.17. Unique solvability of linear least squares fitting problem

The scalar one-dimensional linear least squares fitting problem (5.7.0.10) with dim Vn = n, Vn the
vector space of admissible functions, has a unique solution, if and only if there are ti1 , . . . , tin such
that
 
b1 (ti1 ) . . . bn (ti1 )
 .. ..  n,n
 . . ∈R is invertible, (5.7.0.18)
b1 (tin ) . . . bn (tin )

which is independent of the choice of basis of Vn .

Equivalent to (5.7.0.18) is the requirement, that there is an n-subset of {t1 , . . . , tn } such that the corre-
sponding interpolation problem for Vn has a unique solution for any data values yi . y
EXAMPLE 5.7.0.19 (Polynomial fitting)
Special variant of scalar (d = 1), one-dimensional k = 1 linear data fitting (→ § 5.7.0.8): we choose the
space of admissible functions as polynomials of degree n − 1,

Vn = Pn−1 , e.g. with basis b j (t) = t j−1 (monomial basis, Section 5.2.1) .
The corresponding overdetermined linear system of equations (5.7.0.16) now reads:
   
1 t1 t21 . . . t1n−1 y1
1 t 2 n −1   
 2 t2 . . . t2   y2 
 .. .. .. .. x =  ..  , (5.7.0.20)
. . . .  .
2
1 tm tm . . . tm n − 1 yn
which, for M ≥ n, has full rank, because it contains invertible Vandermonde matrices (5.2.2.11),
Rem. 5.2.2.10.

The next code demonstrates the computation of the fitting polynomial with respect to the monomial basis
of Pn−1 :

C++ code 5.7.0.21: Polynomial fitting ➺ GITLAB


2 // Solver for polynomial (degree passed in ’degree’) linear least
squares data
3 // fitting problem,data points passed in t and y
4 VectorXd p o l y f i t ( const VectorXd& t , const VectorXd& y , unsigned i n t degree ) {
5 // Initialize the coefficient matrix of (5.7.0.20)
6 Eigen : : MatrixXd A = Eigen : : MatrixXd : : Ones ( t . s i z e ( ) , degree + 1 ) ;
7 f o r ( unsigned i n t j = 1 ; j < degree + 1 ; ++ j )
8 A . col ( j ) = A . col ( j − 1 ) . cwiseProduct ( t ) ;
9 // Use E I G E N ’s built-in least squares solver, see Code 3.3.4.2
10 Eigen : : VectorXd c o e f f s = A . householderQr ( ) . solve ( y ) ;

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 462
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11 // leading coefficients have low indices.


12 r e t u r n c o e f f s . reverse ( ) ;
13 }

The function polyfit returns a vector [ x1 , x2 , . . . , xn ]⊤ describing the fitting polynomial according to the
convention

p ( t ) = x 1 t n −1 + x 2 t n −2 + · · · + x n −1 t + x n . (5.7.0.22)

EXPERIMENT 5.7.0.23 (Polynomial interpolation vs. polynomial fitting)


2
function f
interpolating polynomial
fitting polynomial
1
1.5 Data from function f (t) = ,
1 + t2

1
✦ polynomial degree d = 10,
✦ interpolation through data points (t j , f (t j )),
y

0.5 j = 0, . . . , d, t j = −5 + j, see Ex. 5.2.4.3,


✦ fitting to data points (t j , f (t j )), j = 0, . . . , 3d,
0
t j = −5 + j/3.
Experiment carried
out with the code
interpfit.cpp ➺ GITLAB.
−0.5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 200 t

Fitting helps curb oscillations that haunt polynomial interpolation: in terms of “shape preservation” the
fitted polynomial is clearly superior to the interpolating polynomial. y

Remark 5.7.0.24 (“Overfitting”) In the previous example we saw that the degree-10 polynomial fitted to
the data in 31 points rather well reflects the “shape of the data”, see Fig. 200. If we had fit a polynomial of
degree 30 to the data instead, the result would have been a function with humongous oscillations, because
we would have recovered the interpolating polynomial at equidistant nodes, cf. Ex. 5.2.4.3.
Thus, in Exp. 5.7.0.23 we see a manifestation of a widely observed phenomenon, which plays a big role
in machine learning:

Overfitting

Fitting data with functions from a large space often produces poorer results (in terms of “shape
preservation” and “generalization error”) than the use of a smaller subspace for fitting.

§5.7.0.26 (Data fitting with regularization) Let us recall two observations made in the context of one-
dimensional interpolation of scalar data:
• High-degree global polynomial interpolants usually suffer from massive oscillations.
• The “nice” cubic-spline interpolant s on [t0 , tn ] could also be characterized as the C2 -interpolant with
Rt
minimal bending energy t n |s′′ (t)| dt, see Thm. 5.4.3.3.
0

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 463
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This motivates augmenting the fitting problem with another term depending on inner product norms of
derivatives of the fitting functions.
Concretely, given data points (ti , yi ) ∈ R2 , i = 0, . . . , m, ti ∈ [ a, b], and seeking the fitting function f in
the n-dimensional space

Vn := Span{b1 , . . . , bn } ⊂ C2 ([ a, b]) ,
we determine it as
( Z b
)
n
2 ′′ 2
f ∈ argmin ∑ | g ( ti ) − yi | +α
a
| g (t)| dt , (5.7.0.27)
g ∈V i =0

Rb
with a regularization parameter α > 0. Since | g′′ (t)| dt will be large, if g oscillates wildly, for large α the
a
extra regularization term will serve to suppress oscillations.
n
Plugging a basis expansion g = ∑ x j b j into (5.7.0.27), we arrive at the quadratic minimization problem
j =1

kAx − yk22 + αx⊤ Mx → min , (5.7.0.28)


 n
  Zb
A : = b j ( ti ) i =0,...,m ∈R m+1,n
, M :=  b′′j (t) bk′′ (t) dt ∈ R n,n . (5.7.0.29)
j=1,...,n
a j,k=1

For this minimization problem we can find a linear system of equations for minimizers by setting
!
grad φ(x) = 0 for φ(x) := kAx − yk22 + αx⊤ Mx. As in § 3.6.1.4 we find generalized normal equa-
tions

grad φ(x) = 2A⊤ Ax − 2A⊤ y + 2αMx = 0 ,


m
(A⊤ A + αM)x = A⊤ y . (5.7.0.30)

This is an n × n linear system of equations with symmetric positive semi-definite coefficient matrix. y
Review question(s) 5.7.0.31 (Least squares data fitting)
(Q5.7.0.31.A) [] Given data points (ti , yi ) ∈ R2 , i = 0, . . . , m, ti ∈ I ⊂ R, let {b1 , . . . , bn } ⊂ C0 ( I )
be a basis of the space V of admissible fitting functions. A least squares fit can be computed as
2
n m n
f = ∑ (x) j b j where x ∈ argmin ∑ ∑ ( z ) j b j ( ti ) − yi .
j =1 z ∈R n i =0 j =1

Show that f does not depend on the choice of basis for V .


(Q5.7.0.31.B) Given n + 1 data points (ti , yi ) ∈ R2 , t0 < t1 < · · · < tn , we can define a regularized
piecewise linear fit as
 
 n n Zti 
| g ( ti ) − yi |2 + ∑ | g′ (t)|2 dt
i∑
f ∈ argmin ,
g∈S1,M =0 i =1t

i −1

where the knot set M is given by the node set: M := {t0 < t1 < · · · < tn }. Using the “cardinal basis”
of S1,M comprised of “tent functions”, derive the linear system that has to be solved to find f .

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 464
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Learning outcomes
After you have studied this chapter you should
• understand the use of basis functions for representing functions on a computer,
• know the concept of a interpolation operator and what its linearity means,
• know the connection between linear interpolation operators and linear systems of equations,
• be familiar with efficient algorithms for polynomial interpolation in different settings,
• know the meaning and significance of “sensitivity” in the context of interpolation,
• Be familiar with the notions of “shape preservation” for an interpolation scheme and its different
aspects (monotonicity, curvature),
• know the details of cubic Hermite interpolation and how to ensure that it is monotonicity preserving.
• know what splines are and how cubic spline interpolation with different endpoint constraints works.

5. Data Interpolation and Data Fitting in 1D, 5.7. Least Squares Data Fitting 465
Bibliography

[Aki70] H. Akima. “A New Method of Interpolation and Smooth Curve Fitting Based on Local Proce-
dures”. In: J. ACM 17.4 (1970), pp. 589–602.
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on p. 389).
[Bei+16] Lourenço Beirão da Veiga, Annalisa Buffa, Giancarlo Sangalli, and Rafael Vázquez. “An intro-
duction to the numerical analysis of isogeometric methods”. In: Numerical simulation in physics
and engineering. Vol. 9. SEMA SIMAI Springer Ser. Springer, [Cham], 2016, pp. 3–69.
[BT04] Jean-Paul Berrut and Lloyd N. Trefethen. “Barycentric Lagrange interpolation”. In: SIAM Rev.
46.3 (2004), 501–517 (electronic). DOI: 10.1137/S0036144502417715.
[CR92] D. Coppersmith and T.J. Rivlin. “The growth of polynomials bounded at equally spaced points”.
In: SIAM J. Math. Anal. 23.4 (1992), pp. 970–983 (cit. on p. 412).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 388–391, 396, 403, 407–409, 425, 451, 460).
[FC80] F.N. Fritsch and R.E. Carlson. “Monotone Piecewise Cubic Interpolation”. In: SIAM J. Numer.
Anal. 17.2 (1980), pp. 238–246 (cit. on p. 423).
[Gou05] E. Gourgoulhon. An introduction to polynomial interpolation. Slides for School on spectral
methods. 2005.
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 426, 430).
[Kva14] Boris I. Kvasov. “Monotone and convex interpolation by weighted quadratic splines”. In: Adv.
Comput. Math. 40.1 (2014), pp. 91–116. DOI: 10.1007/s10444-013-9300-9.
[MR81] D. McAllister and J. Roulier. “An algorithm for computing a shape-preserving osculatory
quadratic spline”. In: ACM Trans. Math. Software 7.3 (1981), pp. 331–347 (cit. on pp. 436,
439).
[Moh15] A. Mohsen. “CORRECTING THE FUNCTION DERIVATIVE ESTIMATION USING LA-
GRANGIAN INTERPOLATION”. In: ZAMP (2015).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 387).
[QQY01] A.L. Dontchevand H.-D. Qi, L.-Q. Qi, and H.-X. Yin. “Convergence of Newton’s method for
convex best interpolation”. In: Numer. Math. 87.3 (2001), pp. 435–456 (cit. on p. 436).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 389–392, 403, 409, 425, 426, 431).
[Sch83] Larry L. Schumaker. “On shape preserving quadratic spline interpolation”. In: SIAM J. Numer.
Anal. 20.4 (1983), pp. 854–864. DOI: 10.1137/0720057 (cit. on p. 435).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 414, 415).
[Tre13] Lloyd N. Trefethen. Approximation theory and approximation practice. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2013, viii+305 pp.+back matter (cit. on
p. 394).
[Tre14] N. Trefethen. SIX MYTHS OF POLYNOMIAL INTERPOLATION AND QUADRATURE. Slides.
2014.

466
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

[WX12] Haiyong Wang and Shuhuang Xiang. “On the convergence rates of Leg-
endre approximation”. In: Math. Comp. 81.278 (2012), pp. 861–877. DOI:
10.1090/S0025-5718-2011-02549-4.

BIBLIOGRAPHY, BIBLIOGRAPHY 467


Chapter 6

Approximation of Functions in 1D

6.1 Introduction
Video tutorial for Section 6.1 "Approximation of Functions in 1D: Introduction": (7 minutes)
Download link, tablet notes

In Chapter 5 we aimed to fill the gaps between given data points by constructing a function connecting
them. In this chapter the function is available already, at least in principle, but its evaluation is too costly,
which forces us to replace it with a “cheaper” or “simpler” function, sometimes called a surrogate function.
§6.1.0.1 (General function approximation problem)

Approximation of functions: Generic view

Given: a function f : D ⊂ R n 7→ R d , n, d ∈ N, often in procedural form, e.g. for n = d = 1, as


double f(double) (→ Rem. 5.1.0.9).
Goal: Find a “simple” function e
f : D 7→ R d such that the approximation error f − ef is “small” .

Let us clarify the meaning of some terms:

Made more precise: “simple”

The function ef can be encoded by a small amount of information and is easy to evaluate. For instance,
this is the case for polynomial or piecewise polynomial e
f.

Made more precise: “small” approximation error

f − ef is small for some norm k·k on the space C0 ( D ) of (piecewise) continous functions.

The most commonly used norms are


✦ the supremum norm kgk∞ := kgk L∞ ( D) := max |g( x )|, see (5.2.4.5).
x∈ D

If the approximation error is small with respect to the supremum norm, e


f is also called a good uniform
approximant of f.
R
✦ the L2 -norm kgk22 := kgk2L2 ( D) = |g( x )|2 dx, see (5.2.4.6).
D

Below we consider only the case n = d = 1: approximation of scalar valued functions defined on an

468
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

interval. The techniques can be applied componentwise in order to cope with the case of vector valued
function (d > 1). y

EXAMPLE 6.1.0.3 (Model reduction in circuit simulation)

The non-linear circuit (✄| stands for a diode, a circuit


element, whose resistance strongly depends on the
voltage) sketched beside has two ports.
I
For the sake of circuit simulation it should be replaced
by a non-linear lumped circuit element characterized
by a single voltage-current constitutive relationship
I = I (U ). For any value of the voltage U the current
I can be computed by solving a (large) non-linear
system of equations, see Ex. 8.1.0.1. U
Fig. 201

A faster alternative is the advance approximation of the function U 7→ I (U ) based on a few computed
values I (Ui ), i = 0, . . . , n, followed by the fast evaluation of the approximant U 7→ e
I (U ) during actual
circuit simulations. This is an example of model reduction by approximation of functions: a complex
subsystem in a mathematical model is replaced by a surrogate function.

In this example we also encounter a typical situation: we have nothing at our disposal but, possibly ex-
pensive, point evaluations of the function U 7→ I (U ) (U 7→ I (U ) in “procedural form”, see Rem. 5.1.0.9).
The number of evaluations of I (U ) will largely determine the cost of building e
I.

This application displays a fundamental difference compared to the reconstruction of constitutive relation-
ships from a priori measurements → Ex. 5.1.0.8: Now we are free to choose the number and location
of the data points, because we can simply evaluate the function U 7→ I (U ) for any U and as often as
needed.

C++ code 6.1.0.4: Class describing a 2-port circuit element for circuit simulation
1 class C i r c u i t E l e m e n t {
2 private :
3 // internal data describing U 7→ e
I (U ) .
4 public :
5 // Constructor taking some parameters and building e
I
6 C i r c u i t E l e m e n t ( const Parameters &P ) ;
7 // Point evaluation operators for e
I and d e
dU I
8 double I ( double U) const ;
9 double dIdU ( double U) const ;
10 };

§6.1.0.5 (Approximation schemes) We define an abstract concept for the sake of clarity: When in this
chapter we talk about an “approximation scheme” (in 1D) we refer to a mapping A : X 7→ V , where X and
V are spaces of functions I 7→ K, I ⊂ R an interval.

Examples are
• X = C k ( I ), the spaces of functions I 7→ K that are k times continuously differentiable, k ∈ N.
• V = Pm ( I ), the space of polynomials of degree ≤ k, see Section 5.2.1

6. Approximation of Functions in 1D, 6.1. Introduction 469


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• V = Sd,M , the space of splines of degree d on the knot set M ⊂ I , see Def. 5.4.1.1.
T , the space of trigonometric polynomials of degree 2n, see Def. 5.6.1.1.
• V = P2n
y

§6.1.0.6 (Approximation by interpolation) In Chapter 5 we discussed ways to construct functions whos


graph runs through given data points, see 5.1. We can hope that the interpolant will approximate the
function, if the data points are also located on the graph of that function. Thus every interpolation scheme,
see § 5.1.0.7, spawns a corresponding approximation scheme.

Interpolation scheme + sampling → approximation scheme

sampling interpolation
f : I ⊂ R → K −−−−→ (ti , yi := f (ti ))im=0 −−−−−−→ fe := IT y ( fe(ti ) = yi ) .

free choice of nodes ti ∈ I

In this chapter we will mainly study approximation by interpolation relying on the interpolation schemes
(→ § 5.1.0.7) introduced in Section 5.2, Section 5.3.3, and Section 5.4.

There is additional freedom compared to data interpolation: we can choose the interpolation nodes in
a smart way in order to obtain an accurate interpolant fe.
y

Remark 6.1.0.8 (Interpolation and approximation: enabling technologies) Approximation and interpo-
lation (→ Chapter 5) are key components of many numerical methods, like for integration, differentiation
and computation of the solutions of differential equations, as well as for computer graphics and generation
of smooth curves and surfaces.

This chapter is a “foundations” part of the course


y
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
6.2 Approximation by Global Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 471
6.2.1 Polynomial Approximation: Theory . . . . . . . . . . . . . . . . . . . . . . . 472
6.2.2 Error Estimates for Polynomial Interpolation . . . . . . . . . . . . . . . . . . 478
6.2.3 Chebychev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
6.3 Mean Square Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
6.3.1 Abstract Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
6.3.2 Polynomial Mean Square Best Approximation . . . . . . . . . . . . . . . . . 515
6.4 Uniform Best Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
6.5 Approximation by Trigonometric Polynomials . . . . . . . . . . . . . . . . . . . . 525
6.5.1 Approximation by Trigonometric Interpolation . . . . . . . . . . . . . . . . 526
6.5.2 Trigonometric Interpolation Error Estimates . . . . . . . . . . . . . . . . . . 527
6.5.3 Trigonometric Interpolation of Analytic Periodic Functions . . . . . . . . . . 534
6.6 Approximation by Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . . 540
6.6.1 Piecewise Polynomial Lagrange Interpolation . . . . . . . . . . . . . . . . . 541
6.6.2 Cubic Hermite Interpolation: Error Estimates . . . . . . . . . . . . . . . . . . 544
6.6.3 Cubic Spline Interpolation: Error Estimates [Han02, Ch. 47] . . . . . . . . . 546

6. Approximation of Functions in 1D, 6.1. Introduction 470


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Review question(s) 6.1.0.9 (Approximation of Functions in 1D: Introduction)


(Q6.1.0.9.A) Describe an interpolation-based approximation scheme A : C0 ( I ) → C0 ( I ) for the approx-
imation of functions in C0 ( I ), I ⊂ R a closed interval, such that
• A( f ) is increasing whenever f is increasing,
• for convex f also A( f ) is convex.
(Q6.1.0.9.B) Let A : C0 ( I ) → C0 ( I ), I ⊂ R a closed interval, be an approximation scheme. How can
you ensure that

f 6≡ 0 ⇒ A( f ) 6≡ 0 ?

6.2 Approximation by Global Polynomials

Video tutorial for Section 6.2 "Polynomial Approximation: Theory": (13 minutes)
Download link, tablet notes

The space Pk of polynomials of degree ≤ k has been introduced in Section 5.2.1. For reasons listed
in § 5.2.1.3 polynomials are the most important theoretical and practical tool for the approximation of
functions. The next example presents an important case of approximation by polynomials.

EXAMPLE 6.2.0.1 (Taylor approximation → [Str09, Sect. 5.5]) The local approximation of sufficiently
smooth functions by polynomials is a key idea in calculus, which manifests itself in the importance of
approximation by Taylor polynomials: For f ∈ C k ( I ), k ∈ N, I ⊂ R an interval, we approximate

f (2) ( t 0 ) f ( k ) ( t0 )
f (t) ≈ f (t0 ) + f ′ (t0 )(t − t0 ) + ( t − t0 )2 + · · · + (t − t0 )k , for some t0 ∈ I .
| 2 {z k! }
=:Tk (t)∈Pk

✎ Notation: f (k) =
ˆ k-th derivative of function f : I ⊂ R → K

The Taylor polynomial Tk of degree k approximates f in a neighbourhood J ⊂ I of t0 ( J can be small!);


it supplies a local approximation of f . The approximation error can be expressed through remainder
formulas [Str09, Bem. 5.5.1]

Zt
(t − τ )k
f (t) − Tk (t) = f ( k +1) ( τ ) dτ (6.2.0.2a)
k!
t0
( t − t 0 ) k +1
= f ( k +1) ( ξ ) , ξ = ξ (t, t0 ) ∈] min(t, t0 ), max(t, t0 )[ , (6.2.0.2b)
( k + 1) !

which shows that for f ∈ C k+1 ( I ) the Taylor polynomial Tk is pointwise close to f ∈ C k+1 ( I ), if the
interval I is small and f (k+1) is bounded pointwise.

Approximation by Taylor polynomials is easy and direct but inefficient: a polynomial of lower degree often
gives the same accuracy. Moreover, when f is available only in procedural form as double f(double),
(approximations of) higher order derivatives are difficult to obtain. y

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 471


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§6.2.0.3 (Nested approximation spaces of polynomials) Obviously, for every interval I ⊂ R, the
spaces of polynomials are nested in the following sense:

P0 ⊂ P1 ⊂ · · · ⊂ P m ⊂ P m +1 ⊂ · · · ⊂ C ∞ ( I ) , (6.2.0.4)

with finite, but increasing dimensions dim Pm = m + 1 according to Thm. 5.2.1.2.

With this family of nested spaces of polynomials at our disposal, it is natural to study associated families
of approximation schemes, one for each degree, mapping into Pm , m ∈ N0 . y

6.2.1 Polynomial Approximation: Theory


§6.2.1.1 (Scope of polynomial approximation) Sloppily speaking, according to (6.2.0.2b) the Taylor
polynomials from Ex. 6.2.0.1 provide uniform (→ § 6.1.0.1) approximation of a smooth function f in (small)
intervals, provided that its derivatives do not blow up “too fast” (We do not want to make this precise here).

The question is, whether polynomials still offer uniform approximation on arbitrary bounded closed inter-
vals and for functions that are merely continuous, but not any smoother. The answer is YES and this
profound result is known as the Weierstrass Approximation Theorem. Here we give an extended version
with a concrete formula due to Bernstein, see [Dav75, Section 6.2].

Theorem 6.2.1.2. Uniform approximation by polynomials

For f ∈ C0 ([0, 1]), define the n-th Bernstein approximant as


 
n n j
pn (t) = ∑ j =0
f ( j/n) t (1 − t ) n − j , p n ∈ P n . (6.2.1.3)
j
(k)
It satisfies k f − pn k∞ → 0 for n → ∞. If f ∈ C m ([0, 1]), then even f (k) − pn → 0 for

n → ∞ and all 0 ≤ k ≤ m.

✎ Notation: g(k) =
ˆ k-th derivative of a function g : I ⊂ R → K.

In (6.2.1.3) the function f is approximated by a linear combination of Bernstein polynomials of degree n


 
n j
Bnj (t) := t (1 − t)n− j , Bnj ∈ Pn . (6.2.1.4)
j

j= 0
0.9 j= 1
j= 2
j= 3
j= 4
0.8
Plots of Bernstein polynomials of degree n = 7 ✄ j=
j=
j=
5
6
7
0.7

Bernstein polynomials provide a positive partition of 0.6


unity over [0, 1]:
Bj (t)

0.5

n 0.4

∑ Bnj (t) = 1 ∀t ∈ R , (6.2.1.5)


0.3
j =0
0 ≤ Bnj (t) ≤ 1 ∀0 ≤ t ≤ 1 . (6.2.1.6) 0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 202 t

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 472


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0.5
deg = 2, j = 1
deg = 5, j = 2
deg = 8, j = 3
deg = 11, j = 4
deg = 14, j = 6
0.4
deg = 17, j = 7
deg = 20, j = 8
deg = 23, j = 9
deg = 26, j = 10
n− j
= Bnj (t)( tj −
deg = 29, j = 12
d n
0.3
✁ Since dt Bj ( t ) 1− t ) , Bnj has its
Bj (t)

j
unique local maximum in [0, 1] at the site tmax := n .
0.2
As n → ∞ the Bernstein polynomials become more
and more concentrated around the maximum.
0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Fig. 203 t

Proof. (of Thm. 6.2.1.2, first part) Fix t ∈ [0, 1]. Using the notations from (6.2.1.3) and the identity (6.2.1.5)
we find
n
f (t) − pn (t) = ∑ ( f (t) − f ( j/n)) Bnj (t) . (6.2.1.7)
k =0

As we see from Fig. 203, for large n the bulk of sum will be contributed by Bernstein polynomial with index
j/n ≈ t, because for every δ > 0

n
1 1 (∗) nt(1 − t) 1
∑ Bnj (t) ≤ 2 ∑ ( j/n − t)2 Bnj (t) ≤ 2 ∑ ( j/n − t)2 Bnj (t) = 2 2
≤ .
| j/n−t|>δ
δ | j/n−t|>δ
δ j =0
δ n 4nδ2

∑ means summation over j ∈ N0 with summation indices confined to the set { j : | j/n − t| > δ}.
| j/n−t|>δ
The identity (∗) can be established by direct but tedious computations [Dav75, Formulas (6.2.8)].
Combining this estimate with (6.2.1.6) and (6.2.1.7) we arrive at

1
| f (t) − pn (t)| ≤ ∑ 4nδ2
| f (t) − f ( j/n)| + ∑ | f (t) − f ( j/n)| .
| j/n−t|>δ | j/n−t|≤δ

Since, f is uniformly continuous on [0, 1], given ǫ > 0 we can choose δ > 0 independently of t such that
| f (s) − f (t)| < ǫ, if |s − t| < δ. The, if we choose n > (ǫδ2 )−1 , we can bound

| f (t) − pn (t)| ≤ (k f k∞ + 1)ǫ ∀t ∈ [0, 1] .

This means that pn is arbitrarily close to f for sufficiently large n.


✷ y

EXPERIMENT 6.2.1.8 (Bernstein approximants) We compute and plot pn , n = 1, . . . , 25, for two
functions
(
0 , if |2t − 1| > 12 , 1
f 1 (t) := 1 , f 2 (t) := .
2 (1 + cos(2π (2t − 1))) else, 1 + e−12( x−1/2)

The following plots display the sequences of the polynomials pn for n = 2, . . . , 25.

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 473


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1 1
Function f Function f
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 204 t Fig. 205 t

f = f1 f = f2
We see that the Bernstein approximants “slowly” edge closer and closer to f . Apparently it takes a very
large degree to get really close to f . y

§6.2.1.9 (Best approximation) Now we introduce a concept needed to gauge how close an approximation
scheme gets to the best possible performance.

Definition 6.2.1.10. (Size of) best approximaton error

Let k·k be a (semi-)norm on a space X of functions I 7→ K, I ⊂ R an interval. The (size of the)


best approximation error of f ∈ X in the space Pk of polynomials of degree ≤ k with respect to k·k
is

distk·k ( f , Pk ) := inf k f − pk .
p∈Pk

The notation distk·k is motivated by the notation of “distance” as distance to the nearest point in a set.

For the L2 -norm k·k2 and the supremum norm k·k∞ the best approximation error is well defined for
C = C 0 ( I ).

The polynomial realizing best approximation w.r.t. k·k may neither be unique nor computable with reason-
able effort. Often one is content with rather sharp upper bounds like those asserted in the next theorem,
due to Jackson [Dav75, Thm. 13.3.7].

Theorem 6.2.1.11. L∞ polynomial best approximation estimate

If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,

( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])

where f (r ) := max | f (r) ( x )|.


L∞ ([−1,1]) x ∈[−1,1]

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 474


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

As above, f (r) stands for the r-th derivative of f . Using Stirling’s formula

2π nn+1/2 e−n ≤ n! ≤ e nn+1/2 e−n ∀n ∈ N , (6.2.1.12)

we can get a looser bound of the form

inf k f − pk L∞ ([−1,1]) ≤ C (r )n−r f (r) , (6.2.1.13)


p∈Pn L∞ ([−1,1])

with C (r ) dependent on r, but independent of f and, in particular, the polynomial degree n. Using the
Landau symbol from Def. 1.4.1.2 we can rewrite the statement of (6.2.1.13) in asymptotic form

(6.2.1.13) ⇒ inf k f − pk L∞ ([−1,1]) = O(n−r ) for n → ∞ .


p∈Pn

§6.2.1.14 (Transformation of polynomial approximation schemes) What if a polynomial approximation


scheme is defined only on a special interval, say [−1, 1]. Then by the following trick it can be transferred
to any interval [ a, b] ⊂ R.

b : C0 ([−1, 1]) →
Assume that an interval [ a, b] ⊂ R, a < b, and a polynomial approximation scheme A
Pn are given. Based on the affine linear mapping

Φ : [−1, 1] → [ a, b] , Φ(t̂) := a + 12 (t̂ + 1)(b − a) , − 1 ≤ t̂ ≤ 1 , (6.2.1.15)

we can introduce the affine pullback of functions:

Φ∗ : C0 ([ a, b]) → C0 ([−1, 1]) , Φ∗ ( f )(t̂) := ( f ◦ Φ)(t̂) := f (Φ(t̂)) , − 1 ≤ t̂ ≤ 1 .

(6.2.1.16)

Φ∗ f lives here f lives here

bt t
Fig. 206
−1 1 a b
bt 7→ t := Φ(bt) := 1 (1 − bt) a + 1 (bt + 1)b
2 2
We add the important observations that affine pullbacks are linear and bijective, they are isomorphisms of
the involved vector spaces of functions (what is the inverse?).

Lemma 6.2.1.17. Affine pullbacks preserve polynomials

If Φ∗ : C0 ([ a, b]) → C0 ([−1, 1]) is an affine pullback according to (6.2.1.15) and (6.2.1.16), then
Φ∗ : Pn → Pn is a bijective linear mapping for any n ∈ N0 .

Proof. This is a consequence of the fact that translations and dilations take polynomials to polynomials of
the same degree: for monomials we find

Φ∗ {t → tn } = {t̂ → ( a + 21 (t̂ + 1)(b − a))n } ∈ Pn .

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 475


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The lemma tells us that the spaces of polynomials of some maximal degree are invariant under affine
pullback. Thus, we can define a polynomial approximation scheme A on C0 ([ a, b]) by

A : C0 ([ a, b]) → Pn , b ◦ Φ∗ ,
A : = ( Φ ∗ ) −1 ◦ A (6.2.1.18)

b is a polynomial approximation scheme on [−1, 1].


whenever A y

§6.2.1.19 (Transforming approximation error estimates) Thm. 6.2.1.11 targets only the special interval
[−1, 1]. What does it imply for polynomial best approximation on a general interval [ a, b]? To answer this
question we apply techniques from § 6.2.1.14, in particular the pullback (6.2.1.16).

We first have to study the change of norms of functions under the action of affine pullbacks:

Lemma 6.2.1.20. Transformation of norms under affine pullbacks

For every f ∈ C0 ([ a, b]) we have


r
∗ |b − a|
k f k L∞ ([a,b]) = kΦ f k L∞ ([−1,1]) , k f k L2 ([a,b]) = kΦ∗ f k L2 ([−1,1]) . (6.2.1.21)
2

Proof. The first estimate should be evident, and the second is a consequence of the transformation
formula for integrals [Str09, Satz 6.1.5]
Z 1 Z
∗ b b b−a b
Φ f ( t ) dt = f (t) dt , (6.2.1.22)
−1 2 a

and the definition of the L2 -norm from (5.2.4.6).


Thus, for norms of the approximation errors of polynomial approximation schemes defined by affine trans-
formation (6.2.1.18) we get

k f − A f k L∞ ([a,b]) = Φ∗ f − Ab (Φ∗ f ) ,
L∞ ([−1,1])
r , ∀ f ∈ C0 ([ a, b]) . (6.2.1.23)
|b − a|
k f − A f k L2 ([a,b]) = Φ∗ f − Ab (Φ∗ f ) ,
2 L2 ([−1,1])

b.
Equipped with approximation error estimates for A, we can infer corresponding estimates for A

The bounds for approximation errors often involve norms of derivatives as in Thm. 6.2.1.11. Hence, it is
important to understand the interplay of pullback and differentiation: By the 1D chain rule
d df dΦ df
(Φ∗ f )(t̂) = (Φ(t̂)) = (Φ(t̂)) · 21 (b − a) ,
dt̂ dt dt̂ dt
which implies a simple scaling rule for derivatives of arbitrary order r ∈ N0 :
 
∗ (r ) b − a r ∗ (r )
(Φ f ) = Φ (f ) . (6.2.1.24)
2
Lemma 6.2.1.20  
b − a r (r )
( Φ ∗ f ) (r ) = f , f ∈ Cr ([ a, b]), r ∈ N0 .
L∞ ([−1,1]) 2 L∞ ([ a,b])
(6.2.1.25)

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 476


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§6.2.1.26 (Polynomial best approximation on general intervals) The estimate (6.2.1.24) together with
Thm. 6.2.1.11 paves the way for bounding the polynomial best approximation error on arbitrary intervals
[ a, b], a, b ∈ R. Based on the affine mapping Φ : [−1, 1] → [ a, b] from (6.2.1.15) and writing Φ∗ for the
pullback according to (6.2.1.16) we can chain estimates. If f ∈ Cr ([ a, b]) and n ≥ r, then

(∗)
inf k f − pk L∞ ([a,b]) = inf kΦ∗ f − pk L∞ ([−1,1])
p∈Pn p∈Pn
Thm. 6.2.1.11 (n − r )!
≤ (1 + π2/2)r( Φ ∗ f ) (r ) ∞
n! L ([−1,1])
 r
(6.2.1.24) (n − r )! b − a
= (1 + π2/2)r f (r ) ∞ .
n! 2 L ([ a,b])

In step (∗) we used the result of Lemma 6.2.1.17 that Φ∗ p ∈ Pn for all p ∈ Pn . Invoking the arguments
that gave us (6.2.1.13), we end up with the simpler bound
 r
b−a
r
f ∈ C ([ a, b]) ⇒ inf k f − pk L∞ ([a,b]) ≤ C (r ) f (r ) . (6.2.1.27)
p∈Pn n L∞ ([ a,b])

Observe that the length of the interval enters the bound in r-th power. y
Review question(s) 6.2.1.28 (Polynomial Approximation: Theory)
(Q6.2.1.28.A) What does Jackson’s theorem Thm. 6.2.1.11 tell us about the polynomial best approxima-
tion error

inf k f − pk L∞ ([−1,1]) for f (t) := sin(t) ?


p∈Pn

What does it predict for the case r = n?

Theorem 6.2.1.11. L∞ polynomial best approximation estimate

If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,

( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])

where f (r ) := max | f (r) ( x )|.


L∞ ([−1,1]) x ∈[−1,1]

(Q6.2.1.28.B) Consider the discontinuous step function on [−1, 1]:


(
0 for − 1 ≤ t < 0 ,
s(t) =
1 for 0 ≤ t ≤ 1 .

What can you say about inf ks − pk L∞ ([−1,1]) ?


p∈Pn

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 477


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6.2.2 Error Estimates for Polynomial Interpolation

Video tutorial for Section 6.2.2 "Error Estimates for Polynomial Interpolation": (12 minutes)
Download link, tablet notes

In Section 5.2.2, Cor. 5.2.2.8, we introduced


 nthe Lagrangian polynomial interpolation operator IT :
K n+1 → Pn belonging to a node set T = t j j=0 . In the spirit of § 6.1.0.6 it induces an approximation
scheme on C0 ( I ), I ⊂ R an interval, if T ⊂ I .

Definition 6.2.2.1. Lagrangian (interpolation polynomial) approximation scheme

Given an interval I ⊂ R, n ∈ N, a node set T = {t0 , . . . , tn } ⊂ I , the Lagrangian (interpolation


polynomial) approximation scheme LT : C0 ( I ) → Pn is defined by

LT ( f ) : = IT ( y ) ∈ P n with y := [ f (t0 ), . . . , f (tn )] T ∈ R n+1 ,


IT (y)(t j ) = (y) j , j = 0, . . . , n .

Our goal in this section will be


to estimate the norm of the interpolation error k f − IT f k (for relevant norm on C ( I )).

6.2.2.1 Convergence of Interpolation Errors

We start with an abstract discussion of “covergence of errors/error norms” in order to create awareness
of what behaviors or phenomena we should be looking for and how we can detect them. You may read
Section 6.2.2.2 first, if you prefer to see concrete cases and examples before adopting a higher-level
perspective.

§6.2.2.2 (Families of Lagrangian interpolation polynomial approximation schemes) Already


Thm. 6.2.1.11 considered the size of the best approximation error in Pn as a function of the polyno-
mial degree n. In the same vein, we may study a family of Lagrange interpolation schemes {LTn }n∈N on
0
I ⊂ R induced by a family of node sets {Tn }n∈N0 , Tn ⊂ I , according to Def. 6.2.2.1.

An example for such a family of node sets on I := [ a, b] are the equidistant or equispaced nodes

(n) j
Tn := {t j := a + (b − a) : j = 0, . . . , n} ⊂ I . (6.2.2.3)
n

For families of Lagrange interpolation schemes {LTn }n∈N we can shift the focus onto estimating the
0
asymptotic behavior of the norm of the interpolation error for n → ∞. y

EXPERIMENT 6.2.2.4 (Asymptotic behavior of Lagrange interpolation error) We perform polynomial


interpolation of f (t) = sin t on equispaced nodes in I = [0, π ]: Tn = { jπ/n}nj=0 . Write pn for the
polynomial interpolants: pn := LTn f ∈ Pn .

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 478


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0
10

||f−p ||
−2
10
n ∞
In the numerical experiment the norms of the inter-
||f−p ||
−4
n 2 polation errors can be computed only approximately
10
as follows.
−6 • L∞ -norm: approximated by sampling on a grid
Error norm

10

of meshsize π/1000. y
• L2 -norm: numerical quadrature (→ Chapter 7)
−8
10

−10
10
with trapezoidal rule (7.5.0.4) on a grid of
meshsize π/1000.
−12
10
✁ approximate norms k f − LTn f k∗ , ∗ = 2, ∞.
−14
10
2 4 6 8 10 12 14 16
Fig. 207 n

§6.2.2.5 (Classification of asymptotic behavior of norms of the interpolation error) In the previous
experiment we observed a clearly visible regular behavior of k f − LTn f k as we increased the polyno-
mial degree n. The prediction of the decay law for k f − LTn f k for n → ∞ is one goal in the study of
interpolation errors.

Often this goal can be achieved, even if a rigorous quantitative bound for a norm of the interpolation error
remains elusive. In other words, in many cases
✎ ☞
no quantitative bound for k f − LTn f k can usually be given, but the decay of this norm

✍ ✌
of the interpolation error for increasing n can often be described precisely.

Now we introduce some important terminology for the qualitative description of the behavior of k f − LTn f k
as a function of the polynomial degree n. We assume that

∃ C 6= C (n) > 0: k f − LTn f k ≤ C T (n) for n → ∞ . (6.2.2.6)

Definition 6.2.2.7. Types of asymptotic convergence of approximation schemes

Writing T (n) for the bound of the norm of the interpolation error according to (6.2.2.6) we distinguish
the following types of asymptotic behavior :

∃ p > 0: T (n) ≤ n− p : algebraic convergence, with rate p > 0 ,


∀n ∈ N .
∃ 0 < q < 1: T (n) ≤ qn : exponential convergence ,

The bounds are assumed to be sharp in the sense, that no bounds with larger rate p (for algebraic
convergence) or smaller q (for exponential convergence) can be found.

Convergence behavior of norms of the interpolation error is often expressed by means of the Landau-O-
notation, cf. Def. 1.4.1.2:
Algebraic convergence: k f − IT f k = O ( n − p )
for n → ∞ (“asymptotic!”)
Exponential convergence: k f − IT f k = O ( q n )
y

Remark 6.2.2.8 (Different meanings of “convergence”) Unfortunately, as in many other fields of math-
ematics and beyond, also in numerical analysis the meaning of terms is context-dependent:

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 479


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Beware: same concept ↔ different meanings:


• convergence of a sequence (e.g. of iterates x(k) → Section 8.2)
• convergence of an approximation (dependent on an approximation parameter, e.g. n)
y

§6.2.2.9 (Determining the type of convergence in numerical experiments → § 1.4.1.6) Given pairs
(ni , ǫi ), i = 1, 2, 3, . . ., ni =
ˆ polynomial degrees, ǫi =
ˆ (measured) norms of interpolation errors, how
can we tease out the likely type of convergence according to Def. 6.2.2.7? A similar task was already
encountered in § 1.4.1.6, where we had to extract information about asymptotic complexity from runtime
measurements.

Assumption 6.2.2.10. Sharpness of error bounds

We assume that the error bound is (asymptotically) sharp:

∃C 6= C (n): : ǫi ≈ CT (ni ) ∀i .

−p
➊ Conjectured: algebraic convergence: ǫi ≈ Cni

log(ǫi ) ≈ log(C ) − p log ni (affine linear in log-log scale) .

Slope of line approximating (log ni , log(ǫi )) predicts rate of algebraic convergence: Apply linear regres-
sion as explained in Ex. 3.1.1.5 for data points (log ni , log ǫi ) ➣ least squares estimate for rate p.

➊ Conjectured: exponential convergence: ǫi ≈ C exp(− βni )

log ǫi ≈ log(C ) − βni (affine linear in lin-log scale) .

Apply linear regression (→ Ex. 3.1.1.5)to points (ni , log ǫi ) ➣ estimate for q := exp(− β).

☞ Fig. 207: we suspect exponential convergence in Exp. 6.2.2.4. y

EXAMPLE 6.2.2.11 (Runge’s example → Ex. 5.2.4.3) We examine the polynomial interpolant of
f (t) = 1+1t2 for equispaced nodes:
n on 1
Tn := t j := −5 + 10
n j , j = 0, . . . , n ➣ y j = .
j =0 1 + t2j

We rely on an approximate computation of the supremum norm of the interpolation error by means of
sampling as in Exp. 6.2.2.4; here we used 1000 equidistant sampling points, see Code 6.2.2.12.

C++ code 6.2.2.12: Computing the interpolation error for Runge’s example ➺ GITLAB
2 // Note: “quick & dirty” implementation !
3 // Lambda function representing x 7→ (1 + x2 )− 1
4 auto f = [ ] ( double x ) { r e t u r n 1 . / ( 1 + x * x ) ; } ;
5 // 1000 sampling points for approximate maximum norm
6 const VectorXd x = VectorXd : : LinSpaced (1000 , −5 , 5 ) ;
7 // Sample function
8 const VectorXd f x = x . unaryExpr ( f ) ; // evaluate f at x
9

10 std : : vector <double> e r r ; // Accumulate error norms here


11 f o r ( i n t d = 1 ; d <= 2 0 ; ++d ) {

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 480


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

12 // Interpolation nodes
13 const VectorXd t = Eigen : : VectorXd : : LinSpaced ( d + 1 , −5 , 5 ) ;
14 // Interpolation data values
15 const VectorXd f t = f e v a l ( f , t ) ;
16 // Compute interpolating polynomial in monomial representation
17 const VectorXd p = p o l y f i t ( t , f t , d ) ;
18 // Evaluate polynomial interpolant in sampling points
19 const VectorXd y = polyval ( p , x ) ;
20 // Approximate supremum norm of interpolation error
21 e r r . push_back ( ( y − f x ) . cwiseAbs ( ) . maxCoeff ( ) ) ;
22 }

Here, polyfit() computes the monomial coefficients of a polynomial interpolant, while polyval()
uses the vectorized Horner scheme of Code 5.2.1.7 to evaluate the polynomial in given points. The names
of the functions are borrowed from P YTHON, see numpy.poly1d.

2
1/(1+x2)
Interpolating polynomial

1.5

0.5

−0.5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 208 Fig. 209

Interpolating polynomial, n = 10 Approximate f − LT n f ∞


on [−5, 5]

Note: approximation of k f − LTn f k∞ by sampling in 1000 equidistant points.

Observation: Strong oscillations of IT f near the endpoints of the interval, which seem to cause
n→∞
k f − LT f k L∞ (]−5,5[) −−−→ ∞ .

Though polynomials possess great power to approximate functions, see Thm. 6.2.1.11 and Thm. 6.2.1.2,
here polynomial interpolants fail completely. Approximation theorists even discovered the following “nega-
tive result”:

Theorem 6.2.2.13. Divergent polynomial interpolants

(n) (n) (n)


Given a sequence of meshes of increasing size {Tn }∞n=1 , T j = { t0 , . . . , tn } ⊂ [ a, b ], a ≤ t0 <
( j) (n)
t2 < · · · < tn ≤ b, there exists a continuous function f such that the sequence of interpolating

polynomials (LTn f )n=1 does not converge to f uniformly as n → ∞.

y
Review question(s) 6.2.2.14 (Convergence of interpolation errors)
(Q6.2.2.14.A) Assume that the interpolation error for some family (LTn )n∈N , ♯Tn = n + 1, of Lagrangian
polynomial interpolation schemes and some function f ∈ C0 ( I ), I ⊂ R, converges algebraically ac-

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 481


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

cording to

∃ p > 0, C 6= C (n): k f − LTn k L∞ ( I ) ≈ Cn− p ∀n ∈ N .

How do you have to raise the polynomial degree n in order to reduce the maximum norm of the interpo-
lation error approximately by a factor of 2?

Definition 6.2.2.7. Types of asymptotic convergence of approximation schemes

Writing T (n) for the bound of the norm of the interpolation error according to (6.2.2.6) we distin-
guish the following types of asymptotic behavior :

∃ p > 0: T (n) ≤ n− p : algebraic convergence, with rate p > 0 ,


∀n ∈ N .
∃ 0 < q < 1: T (n) ≤ qn : exponential convergence ,

The bounds are assumed to be sharp in the sense, that no bounds with larger rate p (for algebraic
convergence) or smaller q (for exponential convergence) can be found.

(Q6.2.2.14.B) Let f ∈ C0 ( I ), I ⊂ R, be given along with a family (LTn )n∈N , ♯Tn = n + 1, of Lagrangian
polynomial interpolation schemes. You know that the maximum norm of the interpolation error for f
enjoys exponential convergence of the form

∃β > 0 , C 6= C (n): k f − LTn k L∞ ( I ) ≈ C exp(− βn) ∀n ∈ N .

How do you have to increase n in order to halve the maximum norm of the interpolation error?
(Q6.2.2.14.C) Discuss the statement:
Exponential convergence is always faster than algebraic convergence.

6.2.2.2 Interpolands of Finite Smoothness

Video tutorial for Section 6.2.2.2 "Error Estimates for Polynomial Interpolation: Interpolands
of Finite Smoothness": (17 minutes) Download link, tablet notes

Now we aim to establish bounds for the supremum norm of the interpolation error of Lagrangian interpo-
lation similar to the result of Jackson’s best approximation theorem.

Theorem 6.2.1.11. L∞ polynomial best approximation estimate

If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,

( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])

where f (r ) := max | f (r) ( x )|.


L∞ ([−1,1]) x ∈[−1,1]

It states a result for at least continuously differentiable functions and its bound for the polynomial best
approximation error involves norms of certain derivatives of the function f to be approximated. Thus some
smoothness of f is required, but only a few of derivatives need exist. Thus, we say, that Thm. 6.2.1.11

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 482


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

deals with functions of finite smoothnes, that is, f ∈ C k for some k ∈ N. In this section we aim to bound
polynomial interpolation errors for such functions.

Theorem 6.2.2.15. Representation of interpolation error [DR08, Thm. 8.22], [Han02,


Thm. 37.4]

We consider f ∈ C n+1 ( I ) and the Lagrangian interpolation approximation scheme (→ Def. 6.2.2.1)
for a node set T := {t0 , . . . , tn } ⊂ I . Then,

for every t ∈ I there exists a τt ∈] min{t, t0 , . . . , tn }, max{t, t0 , . . . , tn }[ such that

f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.2.2.16)
j =0

n
Proof. Write wT (t) := ∏ (t − t j ) ∈ Pn+1 and fix t ∈ I \ T .
j =0

t 6= t j ⇒ wT (t) 6= 0 ⇒ ∃c = c(t) ∈ R: f (t) − LT ( f )(t) = c(t)wT (t) (6.2.2.17)

ϕ
Consider the auxiliary function

ϕ( x ) := f ( x ) − LT ( f )( x ) − cwT ( x )
t0 t tn
t1 that belongs to C n+1 ( I ) and has n + 2 distinct zeros
t0 , . . . , t n , t.

Fig. 210

By iterated application of the mean value theorem [Str09, Thm .5.2.1]/Rolle’s theorem

f ∈ C1 ([ a, b]), f ( a) = f (b) = 0 ⇒ ∃ξ ∈] a, b[: f ′ (ξ ) = 0 , (6.2.2.18)

to higher and higher derivatives, we conclude that


ϕ(m) has n + 2 − m distinct zeros in I .

m:=n+1
⇒ ∃τt ∈ I: ϕ(n+1) (τt ) = f (n+1) (τt ) − c(n + 1)! = 0 .

f ( n +1) ( τ )
This fixes the value of c = (n+1)!t and by (6.2.2.17) this amounts to the assertion of the theorem.

Remark 6.2.2.19 (Explicit representation of error of polynomial interpolation) The previous theorem
can be refined:

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 483


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lemma 6.2.2.20. Error representation for polynomial Lagrange interpolation

For f ∈ C n+1 ( I ) let IT ∈ Pn stand for the unique Lagrange interpolant (→ Thm. 5.2.2.7) of f in
the node set T := {t0 , . . . , tn } ⊂ I . Then for all t ∈ I the interpolation error is

Z1 Zτ1 τZn−1Zτn

f (t) − IT ( f )(t) = ··· f (n+1) (t0 + τ1 (t1 − t0 ) + · · ·


0 0 0 0
n
+ τn (tn − tn−1 ) + τ (t − tn )) dτdτn · · · dτ1 · ∏(t − t j ) .
j =0

The proof relies on induction on n, use (5.2.3.9) and the fundamental theorem of calculus, see [Ran00,
Sect. 3.1]. y

Remark 6.2.2.21 (Error representation for generalized Lagrangian interpolation) A result analogous
to Lemma 6.2.2.20 holds also for general polynomial interpolation with multiple nodes as defined in
(5.2.2.15). y

Lemma 6.2.2.20 provides an exact formula (6.5.2.27) for the interpolation error. From it and also from
Thm. 6.2.2.15 we can derive estimates for the supremum norm of the interpolation error on the interval I
as follows:
➊ first bound the right hand side via f (n+1) (τt ) ≤ f ( n +1) ,
L∞ ( I )

➋ then increase the right hand side further by switching to the maximum (in modulus) w.r.t. t (the
resulting bound does no longer depend on t!),
➌ and, finally, take the maximum w.r.t. t on the left of ≤.
This yields the following interpolation error estimate for degree-n Lagrange interpolation on the node set
{ t0 , . . . , t n }:

k f ( n +1) k L ∞ ( I )
Thm. 6.2.2.15 ⇒ k f − LT f k L ∞ ( I ) ≤ ( n +1) !
max|(t − t0 ) · · · · · (t − tn )| . (6.2.2.22)
t∈ I

Remark 6.2.2.23 (Significance of smoothness of interpoland) The estimate (6.2.2.22) hinges on


bounds for (higher) derivatives of the interpoland f , which, essentially, should belong to C n+1 ( I ). The
same can be said about the estimate of Thm. 6.2.1.11.

This reflects a general truth about estimates of norms of the interpolation error:

Quantitative interpolation error estimates rely on smoothness!

EXAMPLE 6.2.2.24 (Error of polynomial interpolation Exp. 6.2.2.4 cnt’d) Now we are in a position to
give a theoretical explanation for exponential convergence observed for polynomial interpolation of f (t) =

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 484


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

sin(t) on equidistant nodes: by Lemma 6.2.2.20 and (6.2.2.22)


1
f (k) ≤1, k f − p k L∞ ( I ) ≤ max (t − 0)(t − πn )(t − 2π
n ) · · · · · (t − π )
L∞ ( I )
(1 + n ) ! t ∈ I

∀k ∈ N0 1  π  n +1
≤ .
n+1 n
➙ Uniform asymptotic (even more than) exponential convergence of the interpolation polynomials
(independently of the set of nodes T . In fact, k f − pk L∞ ( I ) decays even faster than exponential!)
y

EXAMPLE 6.2.2.25 (Runge’s example Ex. 6.2.2.11 cnt’d) How can the blow-up of the interpolation
error observed in Ex. 6.2.2.11 be reconciled with Lemma 6.2.2.20 ?

Here f (t) = 1
1+ t2
allows only to conclude | f (n) (t)| = 2n n! · O(|t|−2−n ) for n → ∞.
➙ Possible blow-up of error bound from Thm. 6.2.2.15 →∞ for n → ∞. y

Remark 6.2.2.26 ( L2 -error estimates for polynomial interpolation) Thm. 6.2.2.15 gives error estimates
for the L∞ -norm. What about other norms?
From Lemma 6.2.2.20 we know the error representation

Z1 Zτ1 τZn−1Zτn

f (t) − IT ( f )(t) = ··· f (n+1) (t0 + τ1 (t1 − t0 ) + · · ·


0 0 0 0
n
+ τn (tn − tn−1 ) + τ (t − tn )) dτdτn · · · dτ1 · ∏(t − t j ) .
j =0

We also repeatedly use the Cauchy-Schwarz inequality for integrals:

2
Zb Zb Zb
2
f (t) g(t) dt ≤ | f (t)| dt · | g(t)|2 dt , ∀ f , g ∈ C0 ([ a, b]) . (6.2.2.27)
a a a

Thus we can estimate, (. . .) = t0 + τ1 (t1 − t0 ) + · · · + τn (tn − tn−1 ) + τ (t − tn ),

Z Z1 Zτ1 τZn−1Zτn 2
n
kf − LT ( f )k2L2 ( I ) = ··· f ( n +1)
(. . .) dτdτn · · · dτ1 · ∏(t − t j ) dt
I 0 0 0 0 j =0
| {z }
|t−t j |≤| I |
Z Z
≤ | I |2n+2 vol(n+1) (Sn+1 ) | f (n+1) (. . .)|2 dτ dt
I | {z } Sn +1
=1/(n+1)!
Z Z
| I |2n+2
= vol(n) (Ct,τ ) | f (n+1) (τ )|2 dτdt ,
I ( n + 1) ! I | {z }
≤2(n−1)/2 /n!

where

Sn+1 := {x ∈ R n+1 : 0 ≤ xn ≤ xn−1 ≤ · · · ≤ x1 ≤ 1} (unit simplex) ,


Ct,τ := {x ∈ Sn+1 : t0 + x1 (t1 − t0 ) + · · · + xn (tn − tn−1 ) + xn+1 (t − tn ) = τ } .

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 485


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This gives the bound for the L2 -norm of the error:


Z
2(n−1)/4 | I |n+1  1/2
⇒ k f − LT ( f )k L2 ( I ) ≤p | f (n+1) (τ )|2 dτ . (6.2.2.28)
(n + 1)!n! I

Notice: f 7→ f (n) defines a seminorm on C n+1 ( I )


L2 ( I )
(Sobolev-seminorm, measure of the smoothness of a function).
Estimates like (6.2.2.28) play a key role in the analysis of numerical methods for solving partial differential
equations (→ course “Numerical methods for partial differential equations”). y

Remark 6.2.2.29 (Interpolation error estimates and the Lebesgue constant [Tre13, Thm. 15.1]) The
sensitivity of a (polynomial) interpolation scheme IT : K n+1 → C0 ( I ), T ⊂ I a node set, as introduced in
Section 5.2.4 and expressed by the Lebesgue constant (→ Lemma 5.2.4.10)

kIT (y)k L∞ ( I )
λ T : = kIT k ∞ → ∞ : = sup ,
y∈R n+1 \{0} kyk∞

establishes an important connection between the norms of the interpolation error and of the best approxi-
mation error.

We first observe that the polynomial approximation scheme LT induced by IT preserves polynomials of
degree ≤ n := ♯T − 1:

LT p = IT [ p(t)]t∈T = p ∀ p ∈ Pn . (6.2.2.30)

Thus, by the triangle inequality, for a generic norm on C0 ( I ) and kLT k designating the associated operator
norm of the linear mapping LT , cf. (5.2.4.9),
(6.2.2.30)
k f − LT f k = k( f − p) − LT ( f − p)k ≤ (1 + kLT k)k f − pk ∀ p ∈ Pn ,

k f − LT f k ≤ (1 + kLT k) inf k f − pk . (6.2.2.31)


p∈Pn
| {z }
best approximation error

Note that for k·k = k·k L∞ ( I ) , since [ f (t)]t∈T ∞ ≤ k f k L∞ ( I ) , we can estimate the operator norm, cf.
(5.2.4.9),

kLT k L∞ ( I )→ L∞ ( I ) ≤ kIT kRn+1 → L∞ ( I ) = λT , (6.2.2.32)

k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) . (6.2.2.33)
p∈Pn

Hence, if a bound for λT is available, the best approximation error estimate of Thm. 6.2.1.11 immediately
yields interpolation error estimates. y
Review question(s) 6.2.2.34 (Interpolands of finite smoothness)
(Q6.2.2.34.A) For the interpolation error observed in Exp. 6.2.2.4 we found the bound

1  π  n +1
k f − LT n f k L ∞ ( I ) ≤ ∀n ∈ N .
n+1 n
Explain why this behavior of the maximum norm of the interpolation error is also called “superexponential
convergence”.

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 486


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q6.2.2.34.B) For Lagrange polynomial interpolation we have seen the following error representation.

Theorem 6.2.2.15. Representation of interpolation error

We consider f ∈ C n+1 ( I ) and the Lagrangian interpolation approximation scheme (→


Def. 6.2.2.1) for a node set T := {t0 , . . . , tn } ⊂ I . Then,

for every t ∈ I there exists a τt ∈] min{t, t0 , . . . , tn }, max{t, t0 , . . . , tn }[ such that

f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.5.2.27)
j =0

This was obtained by “counting the zeros” of derivatives of the auxiliary function

ϕ( x ) := f ( x ) − LT ( f )( x ) − cwT ( x ) , a≤x≤b,

where wT was the nodal polynomial belonging to the node set T and c ∈ R was chosen to ensure
ϕ(t) = 0 for a fixed t ∈ [ a, b] \ T .
Now we consider the cubic Hermite interpolation operator

p( a) = f ( a) , p(b) = f (b) ,
H : C1 ([ a, b]) → P3 , p := H( f ) such that
p′ ( a) = f ′ ( a) , p′ (b) = f ′ (b) .

Here p′ , f ′ stands for the derivative. Fixing t ∈] a, b] and using the auxiliary function

ψ( x ) := f ( x ) − H( f )( x ) − c( x − a)2 ( x − b)2 , a≤x≤b,

derive an error representation formula for f (t) = H(t)(t).


(Q6.2.2.34.C) Assume that for a sequence of node sets (Tn )n∈N , the associated Lebesgue constants
of Lagrange polynomial interpolation with respect to the maximum norm behave like λTn = O(nq ) for
n → ∞ and some q ∈ N.
What conclusions about the maximum norm of the interpolation error of Lagrange interpolation can then
be drawn from Jackson’s theorem?

Theorem 6.2.1.11. L∞ polynomial best approximation estimate

If f ∈ Cr ([−1, 1]) (r times continuously differentiable), r ∈ N, then, for any polynomial degree
n ≥ r,

( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])

where f (r ) := max | f (r) ( x )|.


L∞ ([−1,1]) x ∈[−1,1]

6.2.2.3 Analytic Interpolands

Video tutorial for Section 6.2.2.3 "Error Estimates for Polynomial Interpolation: Analytic Inter-
polands": (27 minutes) Download link, tablet notes

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 487


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We have seen that for some Lagrangian approximation schemes applied to certain functions we can
observe exponential convergence (→ Def. 6.2.2.7) of the approximation error for increasing polynomial
degree. This section presents a class of interpolands, which often enable this convergence.

Definition 6.2.2.35. Real-analytic functions

A function f ∈ C ∞ ( I ) defined on an open interval I ⊂ R is called (real-)analytic on I , if it pos-


sesses a convergent Taylor series at every point t0 ∈ I :

f ( k ) ( t0 )
∀t0 ∈ I: ∃ρ = ρ(t0 ) > 0: f (t) = ∑ ( t − t0 ) k ∀ t ∈ I : | t − t0 | < ρ ( t0 ) .
k =0
k!

ρ(t0 ) is called the radius of convergence of the Taylor series.

We may say that an analytic function locally agrees with a “polynomial of degree ∞”, because this is
exactly what a convergent power series

f (t) = ∑ a k ( t − t0 ) k , ak ∈ R , (6.2.2.36)
k =0

represents. Note that f need not be given by its Taylor power series on the whole interval I . Those
may converge only on small sub-intervals. Def. 6.2.2.35 merely tells us that I can be covered by such
sub-intevals.

EXAMPLE 6.2.2.37 (Hump function) We consider the C ∞ -function

1
f (t) := , t∈R,
1 + t2
on the interval I = [−5, 5]. By the geometric sum formula we have as Taylor series at t0 = 0:

f (t) = ∑ (−1)k t2k , |t| < 1 ,
k =0

whose radius of convergenceq = 1. More generally, deeper theory tells us that the Taylor series at t0
has radius of convergence = t20 + 1. Thus, in [−5, 5] cannot be represented by a single power series,
though it is a perfectly smooth function. y

EXAMPLE 6.2.2.38 (Square root function) We consider the the function f (t) := t on I :=]0, 1[. From
calculus we know the power series

√ ∞ k −1 j − 21 k
1+x = 1+ ∑ (−1)k ∏ j+1
x , |x| < 1 . (6.2.2.39)
k =1 j =0

It converges for a all x with | x | < 1 [Str09, Satz 3.7.2]. Using, (6.2.2.39) we get the Taylor series for f at
t0 ∈ I
r  k
√ √ t − t0 √ ∞ k −1 j − 1
t − t0 t − t0
t= t0 1+ = t0 + ∑ (−1)k ∏ 2
, <1.
t0 k =1 j =0
j + 1 t0 t0

This series converges only in the open ball around t0 of radius |t0 |. The closer we get to the “singular
point” t = 0, the smaller the radius of convergence. y

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 488


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 6.2.2.40 (Analytic functions everywhere) Analyticity of a function seems to be a very special
property confined the functions that are given through simple formulas. Yet, this is not true:
C1 R1 C1 R1
✁ Linear RLC-circuit with variable resistor, see also
Ex. 2.6.0.24.

R1
L L
R1

R2 R2
C2 C2 For the shown linear electric circuit all branch cur-
R4 R x R4 R1 rents and voltages will depend analytically on the re-
sistance R x of the variable resistor. This a conse-
C2

C2
R1

R1

quence of the fact that


R3

R3
U ~~
R2

R2
t 7 → v ( t ) ⊤ A ( t ) −1 u ( t ) , t∈I⊂R,

is an analytic function (where well-defined), provided


C1

C1
R4

R4

that the components of u(t), v(t) ∈ R n and the en-


L

R2 12 R2 R2 tries of A(t) ∈ R n,n are analytic functions of t.


Fig. 211

This remark remains true for many output quantities of physical models considered as functions of model
parameters or input quantities. y

§6.2.2.41 (Approximation by truncated power series) A first glimpse of the relevance of analyticity
for polynomial approximation: Let I ⊂ R be an closed interval and f real-analytic on I according to
Def. 6.2.2.35.
We add a stronger assumption: There is a t0 ∈ I and ρ > 0 such that
• the Taylor series of f at t0

f (t) = ∑ a k ( t − t0 ) k , ak ∈ R , (6.2.2.42)
k =0

has radius of convergence ρ,


• I is contained in the ρ-ball around t0 t0 I t
I ⊂ { t ∈ R : | t − t0 | < ρ } . Fig. 212
ρ ρ
We approximate f by its truncated Taylor series
n
pn (t) := ∑ a k ( t − t0 ) k ∈ P n . (6.2.2.43)
k =0

Since I is closed we can find 0 < r < ρ such that

I ⊂ { t ∈ R : | t − t0 | ≤ r } . (6.2.2.44)

The convergence theory of power series [Str09, Sect. 3.7] ensures that for any r < R < ρ

∑ | ak | Rk =: C ≤ ∞ . (6.2.2.45)
k =0

Now we combine (6.2.2.42), (6.2.2.43), (6.2.2.45): for arbitrary t ∈ I we arrive at


∞ ∞  k  r  n +1
k k t − t0
| f (t) − pn (t)| = ∑ a k ( t − t0 ) ≤ ∑ | ak | R
R
≤C
R
.
k = n +1 k = n +1

k f − pn k L∞ ( I ) ≤ C qn+1 ∀n ∈ N with q := r/R < 1 . (6.2.2.46)

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 489


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This confirms exponential convergence of the approximation error incurred by truncating the Taylor series.
y

The previous § heavily relied on the assumption that f possesses a power series representation that
converges on the entire interval I and even beyond. As we see from Ex. 6.2.2.37 and Ex. 6.2.2.38 this will
usually not be the case.
This is why we continue with a key observation: a power series make perfect sense for complex arguments.
Any function f ∈ C ∞ ( I ) that, locally, for t0 ∈ I can be written as

f (t) = ∑ a n ( t − t0 ) n , an ∈ R , ∀ t ∈ R : | t − t0 | < ρ ( t0 ) ,
n =0

can be extended to a complex-valued function defined on a C-disk around t0 by



f (z) = ∑ a n ( z − t0 ) n , ∀ z ∈ C : | z − t0 | < ρ ( t0 ) .
n =0

Real- and complex-analytic functions

Every real-analytic function on I ⊂ R can be extended to a complex-analytic function on some open


set D ⊂ C with I ⊂ D.

For, this reason no distinction between real and complex analyticity has to be made. For the sake of
completeness, the definition of an analytic function on D ⊂ C is given, nevertheless.

Definition 6.2.2.48. Analyticity of a complex valued function

let D ⊂ C be an open set in the complex plane. A function f : D → C is called


(complex)-analytic/holomorphic in D, for every point z ∈ D one can find ρ(z) > 0 and ak ∈ C,
k ∈ N0 , such that

f (w) = ∑ ak (w − z)k ∀w ∈ D : |z − w| < ρ(z) .
k =0

Be aware, that also this definition asserts the existence of a local power series representation of f only.
§6.2.2.49 (Residue calculus) Why go C? The reason is that this permits us to harness powerful tool
from complex analysis (→ course in the BSc program CSE), a field of mathematics, which studies analytic
functions. One of theses tools is the residue theorem.

Theorem 6.2.2.50. Residue theorem [Rem84, Ch. 13]

Let D ⊂ C be an open set, G ⊂ D a closed set contained in D, γ := ∂G its (piecewise smooth


and oriented) boundary, and Π a finite set contained in the interior of G.
Then for each function f that is analytic in D \ Π holds
Z
1
2πı γ
f (z) dz = ∑ res p f ,
p∈Π

where res p f is the residual of f in p ∈ C.

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 490


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

R
• Note that the integral γ in Thm. 6.2.2.50 is a path integral in the complex plane (“contour integral”):
If the path of integration γ is described by a parameterization τ ∈ J 7→ γ(τ ) ∈ C, J ⊂ R, then
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.2.2.51)
γ J

where γ̇ designates the derivative of γ with respect to the parameter, and · indicates multiplication
in C. For contour integrals we have the estimate
Z
f (z) dz ≤ |γ| max | f (z)| . (6.2.2.52)
γ z∈γ

• Π often stands for the set of poles of f , that is, points where “ f attains the value ∞”.

The residue theorem is very useful, because there are simple formulas for res p f :

Lemma 6.2.2.53. Residue formula for quotients

let g and h be complex valued functions that are both analytic in a neighborhood of p ∈ C, and
satisfy h( p) = 0, h′ ( p) 6= 0. Then

g g( p)
res p = ′ .
h h ( p)

§6.2.2.54 (Residue remainder formula for Lagrange interpolation) Now we consider a polynomial
Lagrangian approximation scheme on the interval I := [ a, b] ⊂ R, based on the node set T :=
{ t0 , . . . , t n } ⊂ I .

Im
Assumption 6.2.2.55. Analyticity of inter-
poland
D
γ We assume that the interpoland f : [ a, b] → C
t0 t1 t2 t4 Re can be extended to a function f : D ⊂ C →
a b C, which is analytic (→ Def. 6.2.2.48) on the
open set D ⊂ C with [ a, b] ⊂ D.
Fig. 213

Key is the following representation of the Lagrange polynomials (5.2.2.4) for node set T =
{ t0 , . . . , t n }:
n
t − tk w(t) w(t)
L j (t) = ∏ t j − t k
= n =
(t − t j )w′ (t j )
, (6.2.2.56)
k=0,k6= j (t − t j ) ∏ (t j − tk )
k=0,k6= j
where w ( t ) = ( t − t 0 ) · · · · · ( t − t n ) ∈ P n +1 .

Consider the following parameter dependent function gt , whose set of poles in D is Π = {t, t0 , . . . , tn }

f (z)
gt ( z ) : = , z ∈ C \ Π , t ∈ [ a, b] \ {t0 , . . . , tn } .
(z − t)w(z)

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 491


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ gt is analytic on D \ {t, t0 , . . . , tn } (t must be regarded as parameter!)

Apply residue theorem Thm. 6.2.2.50 to gt and a closed path of integration γ ⊂ D winding once
around [ a, b], such that its interior is simply connected, see the magenta curve in Fig. 213:

Z n n
1 Lemma 6.2.2.53 f (t) f (t j )
gt (z) dz = rest gt + ∑ rest j gt = +∑
2πı γ j =0
w ( t ) j =0 ( t j − t ) w ′ ( t j )

Possible, because all zeros of w are single zeros!

n Z
w(t) w(t)
f (t) = − ∑ f (t j ) ′
+ gt (z) dz . (6.2.2.57)
j =0
(t j − t)w (t j ) 2πı γ
| {z } | {z }
−Lagrange polynomial ! interpolation error !
| {z }
polynomial interpolant !

This is the famous Hermite integral formula [Tre13, Ch. 11], a representation formula for the interpolation
error, an alternative to that of Thm. 6.2.2.15 and Lemma 6.2.2.20. We conclude that for all t ∈ [ a, b]

Z max |w(τ )| max | f (z)|


w(t) f (z) | γ | a≤τ ≤b z∈γ
| f (t) − LT f (t)| = dz ≤ · . (6.2.2.58)
2πı γ (z − t)w(z) 2π min |w(z)| dist([ a, b], γ)
z∈γ

In concrete setting, in order to exploit the estimate (6.2.2.58) to study the n-dependence of the supremum
norm of the interpolation error, we need to know
• an upper bound for |w(t)| for a ≤ t ≤ b,
• an a lower bound for |w(z)|, z ∈ γ, for a suitable path of integration γ ⊂ D,
• a lower bound for the distance of the path γ and the interval [ a, b] in the complex plane.
y

Remark 6.2.2.59 (Frobenius’ derivation of the Hermite integral formula [Boo05, Sect. 9]) We give a
more elementary alternative derivation of the interpolation error formula in (6.2.2.57), which does not rely
on the residual theorem. We retain the node set T := {t0 , . . . , tn } ⊂ [ a, b] and Ass. 6.2.2.55. We write

w j ( t ) : = ( t − t 0 ) · · · · · ( t − t j −1 ) , j ∈ {1, . . . , n + 1} , w0 := 1 ,

for nodal polynomials: w j ∈ P j . They obviously satisfy

tw j−1 (t) = w j (t) + t j−1 w j−1 (t) , t∈R, j = 1, . . . , n + 1 . (6.2.2.60)

We pick z ∈ C \ T , replace t → z in (6.2.2.60), and divide by w j−1 (z)w j (z), which yields

z 1 t j −1
= + , z ∈ C\T , j = 1, . . . , n + 1 . (6.2.2.61)
w j (z) w j −1 ( z ) w j ( z )
From the two identities (6.2.2.60) and (6.2.2.61) we obtain by summation
( )
n +1 w j −1 ( t ) n +1 w
j −1 ( t ) t j −1 w j −1 ( t )
z· ∑ w j (z)
= ∑
w j −1 ( z )
+
w j (z)
, (6.2.2.62)
j =1 j =1

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 492


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

( )
n +1 w j −1 ( t ) n +1 w ( t ) t j −1 w j −1 ( t )
j
t· ∑ w j (z)
= ∑
w j (z)
+
w j (z)
. (6.2.2.63)
j =1 j =1

Subtracting the last two formulas we get a telescopic sum:


( )
n +1 w j −1 ( t ) n +1 w ( t ) w ( t )
j −1 j w (t)
(z − t) ∑ = ∑ − = 1 − n +1 , z 6∈ T , t∈R.
j =1
w j (z) j =1
w j − 1 ( z ) w j ( z ) w n + 1 ( z )
n +1 w
1 j −1 ( t ) w n +1 ( t )
= ∑ + , t∈R, z 6∈ T ∪ {t} . (6.2.2.64)
z−t j =1
w j (z) ( z − t ) w n +1 ( z )

Recall the Cauchy integral representation formula


Z
1 f (z)
f (t) = dz , t ∈ [ a, b] , (6.2.2.65)
2πı γ z−t

where γ ⊂ D is a curve enclosing [ a, b] as drawn in Fig. 213. We can rewrite (6.2.2.65) using (6.2.2.64),
( Z
) Z
n +1
1 f (z) 1 f (z)
f (t) = ∑ 2πı γ w j (z)
dz · w j−1 (t) +
2πı γ (z − t)wn+1 (z)
dz · wn+1 (t) . (6.2.2.66)
j =1
| {z }
=:p(t)

By the definition of w j the function t 7→ p(t) is a polynomial of degree n that interpolates f in T , the set
of zeros of wn+1 . Hence p is the unique Lagrange polynomial interpolant of f with respect to the node
set T . Thus the second term in (6.2.2.66) represents the interpolation error and obviously agrees with the
formula found in (6.2.2.57). y

Remark 6.2.2.67 (Determining the domain of analyticity) The subset of C, where a function f given by
a formula, is analytic can often be determined without computing derivatives using the following conse-
quence of the chain rule:

Theorem 6.2.2.68. Composition and products of analytic functions

If f , h : D ⊂ C → C and g : U ⊂ C → C are analytic in the open sets D and U , respectively, then


(i) the composition f ◦ g is analytic in {z ∈ U : g(z) ∈ D },
(ii) the product f · h is analytic on D.

This can be combined with the following facts:


• Polynomials, exp(z), sin(z), cos(z), sinh(z), cosh(z) are analytic on C (entire functions).
• Rational functions (quotients of polynomials) are analytic everywhere except in the zeros of their
denominator.

• The square root z → z is analytic in C \] − ∞, 0].
For example, according to these rules the function f (t) = (1 + t2 )−1 can be extended to an analytic
function on C \ {−ı, ı}. If A ∈ C n,n , b ∈ R n , n ∈ N, then z 7→ (A + zI)−1 b ∈ C n is (componentwise)
analytic in C \ σ (A), where σ (A) denotes the set of eigenvalues of A (the spectrum). y
Review question(s) 6.2.2.69 (Analytic interpolants)

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 493


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q6.2.2.69.A) We consider the Lagrange polynomial interpolation of the entire function f (t) = et on
n o
j n
I = [0, 1] with equidistant nodes T := t j = n , n ∈ N. As integration path in the residual re-
j =0
mainder estimate

Z max |w(τ )| max | f (z)|


w(t) f (z) | γ | a≤τ ≤b z∈γ
| f (t) − LT f (t)| = dz ≤ · . (6.2.2.58)
2πı γ (z − t)w(z) 2π min |w(z)| dist([ a, b], γ)
z∈γ

we choose the square path (endowed with an arbitrary orientation)

γ := {−1 + tı}3t=−3 ∪ {t + 3ı}2t=−1 ∪ {2 + tı}− 3 −1


t=3 ∪ { t − 3ı }t=2 .

Work out the bound from (6.2.2.58) in this case.


(Q6.2.2.69.B) [Interpolands analytic on stadium domains] Assume that f : [−1, 1] → C possesses an
analytic extension to an open set D ⊂ C containing the stadium domain
 
S := z ∈ C: min |z − t| ≤ 1 .
−1≤ t ≤1

Let Tn := {t0 , . . . , tn } ⊂ [−1, 1] a sequence of sets of interpolation nodes and In : C0 ([−1, 1]) → Pn
the associated family of Lagrangian polynomial interpolation operators. Based on (6.2.2.58) show that
the supremum norm of the interpolation error k f − In f k∞,[−1,1] converges to zero exponentially in the
degree n as n → ∞.
n on
(Q6.2.2.69.C) For Lagrange polynomial interpolation of f (t) = 1+1t2 in the nodes T := −5 + 10
n j ,
j =0
sketch a valid integration path γ ⊂ C for the estimate (6.2.2.58).
(Q6.2.2.69.D) What is the problem,√ if you want to apply (6.2.2.58) to estimate the error of Lagrange
polynomial interpolation of t 7→ t in equidistant nodes in [0, 1]?
(Q6.2.2.69.E) Find largest subset of the complex plane C, to which the logistic curve function

1
f (t) := , t∈R,
1 + exp(−t)

possesses an analytic extension.


Note that from Euler’s formula

exp( x + ıy) = exp( x )(cos(y) + ı sin(y)) , x, y ∈ R .

6.2.3 Chebychev Interpolation


As pointed out in § 6.1.0.6, when we build approximation schemes from interpolation schemes, we have
the extra freedom to choose the sampling points (= interpolation nodes). Now, based on the insight into
the structure of the interpolation error gained from Thm. 6.2.2.15, we seek to choose “optimal” sampling
points. They will give rise to the so-called Chebychev polynomial approximation schemes, also known as
Chebychev interpolation.

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 494


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6.2.3.1 Motivation and Definition

Video tutorial for Section 6.2.3.1 "Chebychev Interpolation: Motivation and Definition": (21
minutes) Download link, tablet notes

Setting: ✦ Without loss of generality (→ § 6.2.1.14): I = [−1, 1],


✦ interpoland f : I → R at least continuous, f ∈ C0 ( I ),
✦ set of interpolation nodes T := {−1 ≤ t0 < t1 < · · · < tn−1 < tn ≤ 1}, n ∈ N.

1
Recall Thm. 6.2.2.15: k f − LT f k L ∞ ( I ) ≤ f ( n +1) k w k L∞ ( I ) ,
( n + 1) ! L∞ ( I )

with nodal polynomial w ( t ) : = ( t − t0 ) · · · · · ( t − t n ) .

Optimal choice of interpolation nodes independent of interpoland

Idea: choose nodes t0 , . . . , tn such that kwk L∞ ( I ) is minimal!


This is equivalent to finding a polynomial q ∈ Pn+1
✦ with leading coefficient = 1,
✦ such that it minimizes the norm kqk L∞ ( I ) .
Then choose nodes t0 , . . . , tn as zeros of q.
(caution: all t j must lie in I !)

Remark 6.2.3.2 (A priori and a posteriori choice of optimal interpolation nodes) We stress that we
aim for an “optimal” a priori choice of interpolation nodes, a choice that is made before any information
about the interpoland becomes available.
Of course, an a posteriori choice based on information gleaned from evaluations of the interpoland f may
yield much better interpolants (in the sense of smaller norm of the interpolation error). Many modern
algorithms employ this a posteriori adaptive approximation policy, but this chapter will not cover them.
However, see Section 7.6 for the discussion of an a posteriori adaptive approach for the numerical approx-
imation of definite integrals. y

w(t)

k q k L∞ ( I ) Requirements on q (by heuristic reasoning)

Optimal polynomials q will exist, but they seem to be


elusive. First, we develop some insights in how they
must “look like”:
• If t∗ is an extremal point of q
−1t0 ➙ |q(t∗ )| = kqk L∞ ( I ) ,
t1 t2 tn 1
• q has n + 1 zeros in I (∗),
• |q(−1)| = |q(1)| = kqk L∞ ( I ) .

➣ q has n + 2 extrema in [−1, 1]


−kqk L∞ ( I )
Fig. 214

(∗) is motivated by an indirect argument:

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 495


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

If q(t) = (t − t0 ) · · · · · (t − tn+1 ) with t0 < −1, then


| p(t)| := |(t + 1)(t − t1 ) · · · · · (t − tn+1 )|<|q(t)| ∀t ∈ I (why ?) ,
which contradicts the minimality property of q. Same argument for t0 > 1. The reasonings leading to the
above heuristic demands will be elaborated in the proof of Thm. 6.2.3.8.

Are there polynomials satisfying these requirements? If so, do they allow a simple characterization?

Definition 6.2.3.3. Chebychev polynomials → [Han02, Ch. 32]

The nth Chebychev polynomial is Tn (t) := cos(n arccos t), −1 ≤ t ≤ 1, n ∈ N0 .

The next result confirms that the Tn are polynomials, indeed.

Theorem 6.2.3.4. 3-term recursion for Chebychev polynomials → [Han02, (32.2)]


The function Tn defined in Def. 6.2.3.3 satisfy the 3-term recursion

Tn+1 (t) = 2t Tn (t) − Tn−1 (t) , T0 ≡ 1 , T1 (t) = t , n ∈ N . (6.2.3.5)

Proof. Just use the trigonometric identity cos(n + 1) x = 2 cos nx cos x − cos(n − 1) x with cos x = t.

The theorem implies: • Tn ∈ Pn ,
• their leading coefficients are equal to 2n−1 ,
•  Tnn are linearly independent,
the
• Tj j=0 is a basis of Pn = Span{ T0 , . . . , Tn }, n ∈ N0 .

See Code 6.2.3.6 for algorithmic use of the 3-term recursion (6.2.3.5).
1 1 n=5
n=6
n=7
0.8 0.8 n=8
n=9
0.6 0.6

0.4 0.4

0.2 0.2
Tn(t)

Tn(t)

0 0

−0.2 −0.2

n=0
−0.4 −0.4
n=1

−0.6 n=2 −0.6


n=3
−0.8 n=4 −0.8

−1 −1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 215 t Fig. 216 t

Chebychev polynomials T0 , . . . , T4 Chebychev polynomials T5 , . . . , T9

C++ code 6.2.3.6: Efficient evaluation of Chebychev polynomials up to a certain degree


➺ GITLAB
2 // Computes the values of the Chebychev polynomials T0 , . . . , Td
3 // at points passed in x using the 3-term recursion
4 // (6.2.3.5). The values Tk ( x j ), are returned in
5 // row k + 1 of V.
6 void chebpolmult ( const unsigned i n t d , const RowVectorXd& x , MatrixXd& V) {
7 const unsigned i n t n = x . s i z e ( ) ;

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 496


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8 V = MatrixXd : : Ones ( d + 1 , n ) ; // T0 ≡ 1
9 i f ( d == 0 ) r e t u r n ;
10 V . block ( 1 , 0 , 1 , n ) = x ; // T1 ( x ) = x
11 i f ( d == 1 ) r e t u r n ;
12 f o r ( unsigned i n t k = 1 ; k < d ; ++k ) {
13 const RowVectorXd p = V . block ( k , 0 , 1 , n ) ; // p = Tk
14 const RowVectorXd q = V . block ( k − 1 , 0 , 1 , n ) ; // q = Tk−1
15 V . block ( k + 1 , 0 , 1 , n ) =
16 2 * x . cwiseProduct ( p ) − q ; // 3-term recursion
17 }
18 }

From Def. 6.2.3.3 we conclude that Tn attains the values ±1 in its extrema with alternating signs, thus
matching our heuristic demands:

| Tn (tk )| = 1 ⇔ ∃ k = 0, . . . , n: tk = cos , k Tn k L∞ ([−1,1]) = 1 . (6.2.3.7)
n

What is still open is the validity of the heuristics guiding the choice of the optimal nodes. The next funda-
mental theorem will demonstrate that, after scaling, the Tn really supply polynomials on [−1, 1] with fixed
leading coefficient and minimal supremum norm.

Theorem 6.2.3.8. Minimax property of the Chebychev polynomials [DH03, Section 7.1.4.],
[Han02, Thm. 32.2]

The polynomials Tn from Def. 6.2.3.3 minimize the supremum norm in the following sense:

k Tn k L∞ ([−1,1]) = inf{k pk L∞ ([−1,1]) : p ∈ Pn , p(t) = 2n−1 tn + · · · } , ∀n ∈ N .

Proof. (indirect) Assume

∃q ∈ Pn , leading coefficient = 2n−1 : kqk L∞ ([−1,1]) < k Tn k L∞ ([−1,1]) . (6.2.3.9)

( Tn − q)( x ) > 0 in the local maxima of Tn , and ( Tn − q)( x ) < 0 in all local minima of Tn .
From our knowledge about the n + 1 local extrema of Tn in [−1, 1] (They have alternating signs!), see
(6.2.3.7), we conclude that Tn − q changes sign at least n + 1 times. This implies that Tn − q has at least
n zeros. As a consequence, Tn − q ≡ 0, because Tn − q ∈ Pn−1 (same leading coefficient!).
This cannot be reconciled with the properties (6.2.3.9) of q and, thus, leads to a contradiction.

 
2k + 1
The zeros of Tn are tk = cos π , k = 0, . . . , n − 1 . (6.2.3.10)
2n

Too see this, notice


zeros of cos π
Tn (t) = 0 ⇔ n arccos t ∈ (2Z + 1)
  2 
arccos ∈ [0, π ] 2k + 1 π 
⇔ t ∈ cos , k = 0, . . . , n − 1 .
n 2
Thus, we have identified the tk from (6.2.3.10) as optimal interpolation nodes for a Lagrangian approxima-
tion scheme. The tk are known as Chebychev nodes. Their distribution in [−1, 1] is plotted in Fig. 217, a
geometric construction is indicated in Fig. 218.

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 497


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

20

18

16

14

12
n

10

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1


Fig. 218
Fig. 217 t a b

Remark 6.2.3.11 (Chebychev nodes on arbitrary interval) Following the recipe of § 6.2.1.14 Chebychev
interpolation on an arbitrary interval [ a, b] can immediately be defined. The same polynomial Lagrangian
approximation scheme is obtained by transforming the Chebychev nodes (6.2.3.10) from [−1, 1] to [ a, b]
using the unique affine transformation (6.2.1.15):

bt ∈ [−1, 1] 7→ t := a + 1 (bt + 1)(b − a) ∈ [ a, b] ].


✬ ✩
2

The Chebychev nodes in the interval I = [ a, b]


are
 
1 2k + 1 
tk := a + 2 (b − a) cos
2( n + 1)
π +1 ,
(6.2.3.12)
k = 0, . . . , n .
a b✫ ✪
a b y
Parlance: When we use Chebychev nodes for polynomial interpolation we call the resulting Lagrangian
approximation scheme Chebychev interpolation.
Review question(s) 6.2.3.13 (Chebychev interpolation: motivation and definition)
(Q6.2.3.13.A) We write Tn ∈ Pn for the n-th Chebychev polynomial:

Definition 6.2.3.3. Chebychev polynomials

The nth Chebychev polynomial is Tn (t) := cos(n arccos t), −1 ≤ t ≤ 1, n ∈ N.

What is the composition Tn ◦ Tm , Tn ◦ Tm = Tn ( Tm (t)) of two Chebychev polynomials?


(Q6.2.3.13.B) Writing Tn ∈ Pn for the n-th Chebychev polynomial, using a trigonometric identity show
that

2Tm (t) Tn (t) = Tm+n (t) + Tm−n (t) ∀m, n ∈ N0 , M≥n.

6.2.3.2 Chebychev Interpolation Error Estimates

Video tutorial for Section 6.2.3.2 "Chebychev Interpolation Error Estimates": (14 minutes)
Download link, tablet notes

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 498


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 6.2.3.14 (Polynomial interpolation: Chebychev nodes versus equidistant nodes) We


consider Runge’s function f (t) = 1+1t2 , see Ex. 6.2.2.11, and compare polynomial interpolation based on
uniformly spaced nodes and Chebychev nodes in terms of behavior of interpolants.

2 1.2
1/(1+x2) Function f
Interpolating polynomial Chebychev interpolation polynomial
1
1.5

0.8

1
0.6

0.4
0.5

0.2

0
0

−0.5 −0.2
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 219 Fig. 220 t

Equidistant nodes Chebychev nodes


We observe that the Chebychev nodes cluster at the endpoints of the interval, which successfully sup-
presses the huge oscillations haunting equidistant interpolation there. y

§6.2.3.15 (Finite-smoothness error estimates for Chebychev interpolation) Note the following features
of Chebychev interpolation on the interval [−1, 1]:
n   o
• Use of “optimal” interpolation nodes T = btk := cos 2k+1 π , k = 0, . . . , n ,
2( n +1)

• corresponding to the nodal polynomial w(t) = (t − t0 ) · · · (t − tn ) = 2−n Tn+1 (t) , k w k L∞ ( I ) =


−n
2 , with leading coefficient 1.
Then, by

Theorem 6.2.2.15. Representation of interpolation error

We consider f ∈ C n+1 ( I ) and the Lagrangian interpolation approximation scheme (→ Def. 6.2.2.1)
for a node set T := {t0 , . . . , tn } ⊂ I . Then,

for every t ∈ I there exists a τt ∈] min{t, t0 , . . . , tn }, max{t, t0 , . . . , tn }[ such that

f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.2.3.16)
j =0

we immediately get an interpolation error estimate for Chebychev interpolation of f ∈ C n+1 ([−1, 1]):

2− n
k f − IT ( f )k L∞ ([−1,1]) ≤ f ( n +1) . (6.2.3.17)
( n + 1) ! L∞ ([−1,1])

Estimates for the Chebychev interpolation error on [ a, b] are easily derived from (6.2.3.17):

p ∈ Pn ∧ p(t j ) = f (t j ) ⇔ pb ∈ Pn ∧ pb(bt j ) = fb(bt j ) ,

an the affine pullback introduced and discussed in in § 6.2.1.14. For instance, repeated application of the
dn fb dn f
chain rule yields the formula (bt) = ( 1 | I |)n ( t ).
2
dbtn dtn

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 499


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2− n dn+1 fb
k f − IT ( f )k L∞ ( I ) = fb − ITb ( fb) ≤
L∞ ([−1,1]) (n + 1)! dbtn+1
L∞ ([−1,1])
2−2n−1 n+1 (n+1)
≤ |I| f . (6.2.3.18)
( n + 1) ! L∞ ( I )

Remark 6.2.3.19 (Lebesgue Constant for Chebychev nodes [Tre13, Thm 15.2]) We saw in Sec-
tion 5.2.4 and, in particular, in Rem. 5.2.4.13 that the Lebesgue constant λT that measures the sensitivity
of a polynomial interpolation scheme, blows up exponentially with increasing number of equispaced inter-
polation nodes. In stark contrast λT grows only logarithmically in the number of Chebychev nodes.
3.2

More precisely, sophisticated theory [CB95; Ver86; 3

Ver90] supplies the bound


2.8

2.6
2
λT ≤ log(1 + n) + 1 . (6.2.3.20)
λT

π
2.4

2.2
Measured Lebesgue constant for Chebychev nodes
based on approximate evaluation of (5.2.4.11) by 2

sampling. ✄
1.8
0 5 10 15 20 25
Polynomial degree n

Combining (6.2.3.20) with the general estimate from Rem. 6.2.2.29

k f − LT f k L∞ ( I ) ≤ (1 + λT ) inf k f − pk L∞ ( I ) ∀ f ∈ C0 ( I ) , (6.2.2.33)
p∈Pn

and the bound for the best approximation error by polynomials for f ∈ Cr ([−1, 1]) from Thm. 6.2.1.11,

( n − r ) ! (r )
inf k f − pk L∞ ([−1,1]) ≤ (1 + π2/2)r f ,
p∈Pn n! L∞ ([−1,1])

we end up with a bound for the supremum norm of the interpolation error in the case of Chebychev
interpolation on [−1, 1]

( n − r ) ! (r )
k f − LT f k L∞ ([−1,1]) ≤ (2/π log(1 + n) + 2)(1 + π2/2)r f . (6.2.3.21)
n! L∞ ([−1,1])

Emphasizing the asymptotic behavior of the maximum norm of the interpolation error, we can infer for
Chebychev interpolation
 
r log n
f ∈ C ([−1, 1]) ⇒ k f − LT f k L∞ ([−1,1]) =O for n→∞, (6.2.3.22)
nr

which could be dubbed “almost algebraic convergence” with rate r. This guarantees convergence, if only
f ∈ C1 ([−1, 1])! y

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 500


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXPERIMENT 6.2.3.23 (Chebychev interpolation errors) Now we empirically investigate the behavior
of norms of the interpolation error for Chebychev interpolation and functions with different (smoothness)
properties as we increase the number of interpolation nodes.

In the experiments, for I = [ a, b] we set xl := a + b− a


N l, l = 0, ..., N, N = 1000, and we approximate
the norms of the interpolation error as follows ( p =
ˆ interpolating polynomial):

|| f − p||∞ ≈ max | f ( xl ) − p( xl )| (6.2.3.24)


0≤ l ≤ N
b−a  
p||22 2 2
2N 0≤∑
|| f − ≈ | f ( xl ) − p( xl )| + | f ( xl +1 ) − p( xl +1 )| (6.2.3.25)
l<N

➀ f (t) = (1 + t2 )−1 , I = [−5, 5] (see Ex. 6.2.2.11): analytic in a neighborhood of I .


Interpolation with n = 10 Chebychev nodes (plot on the left).
1
10

1.2
Function f
||f−p ||
n ∞
Chebychev interpolation polynomial
||f−pn||2
1

0.8 10
0
Error norm

0.6

0.4

−1
10
0.2

−0.2
−5 −4 −3 −2 −1 0 1 2 3 4 5 −2
10
t 2 4 6 8 10 12 14 16 18 20
Fig. 221 Polynomial degree n

Notice: exponential convergence (→ Def. 6.2.2.7) of the Chebychev interpolation:

pn → f , k f − In f k L∞ ([−5,5]) ≈ 0.8n
.

➁ f (t) = max{1 − |t|, 0}, I = [−2, 2], n = 10 nodes (plot on the left).

Now f ∈ C0 ( I ) but f ∈
/ C 1 ( I ).

0
1.2 10
Function f ||f−p ||
n ∞
Chebychev interpolation polynomial ||f−p ||
n 2
1

0.8
Error norm

0.6
−1
10

0.4

0.2

−2
−0.2 10
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2 4 6 8 10 12 14 16 18 20
t Polynomial degree n
Fig. 222 Fig. 223

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 501


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0
10
||f−p ||
n ∞
||f−p ||
n 2

Error norm
−1

From the doubly logarithmic plot we conclude ➙ 10

• no exponential convergence
• algebraic convergence (?)

−2
10
0 1 2
10 10 10

Fig. 224
Polynomial degree n

(
1
2 (1 + cos πt ) |t| < 1
➂ f (t) = I = [−2, 2], n = 10 (plot on the left).
0 1 ≤ |t| ≤ 2

0
1.2 10
Function f ||f−p ||
n ∞
Chebychev interpolation polynomial ||f−p ||
n 2
1

0.8
−1
10
Error norm

0.6

0.4

−2
10
0.2

−3
−0.2 10
0 1 2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 10 10 10

Fig. 225 t Fig. 226


Polynomial degree n

Notice: only (vaguely) algebraic convergence.

Summary of observations, cf. § 6.2.2.9:

✦ Essential role of smoothness of f : slow convergence of approximation error of the Cheychev

interpolant if f enjoys little smoothness, cf. also (6.2.2.22),


✦ for analytic f ∈ C ∞ (→ Def. 6.2.2.48) the approximation error of the Cheychev interpolant seems
to decay to zero exponentially in the polynomial degree n.
y

Remark 6.2.3.26 (Chebychev interpolation of analytic functions [Tre13, Ch. 8]) Assuming that the
interpoland f possesses an analytic extension to a complex neighborhood D of [−1, 1], we now apply
the theory of Section 6.2.2.3 to bound the supremum norm of the Chebychev interpolation error of f on
[−1, 1].
To convert

Z max |w(τ )| max | f (z)|


w(t) f (z) | γ | a≤τ ≤b z∈γ
| f (t) − LT f (t)| ≤ dz ≤ · , (6.2.2.58)
2πı γ (z − t)w(z) 2π min |w(z)| dist([ a, b], γ)
z∈γ

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 502


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

as obtained in Section 6.2.2.3, into a more concrete estimate, we have to study the behavior of
 
2k + 1
wn (t) = (t − t0 )(t − t1 ) · · · · · (t − tn ) , tk = cos π , k = 0, . . . , n ,
2n + 2
where the tk are the Chebychev nodes according to (6.2.3.12). They are the zeros of the Chebychev
polynomial (→ Def. 6.2.3.3) of degree n + 1. Since w has leading coefficient 1, we conclude w =
2−n Tn+1 , and

max |w(t)| ≤ 2−n . (6.2.3.27)


−1≤ t ≤1

Next, we fix a suitable path γ ⊂ D for integration:For a constant ρ > 1 we set

γ := {z = cos(θ − ı log ρ), 0 ≤ θ ≤ 2π }


 
1
= z = (exp(ı(θ − ı log ρ)) + exp(−ı(θ − ı log ρ))), 0 ≤ θ ≤ 2π
2
 
1  ıθ −1 −iuθ

= z= ρe + ρ e , 0 ≤ θ ≤ 2π
2
n o
1 −1 1 −1
= z = 2 (ρ + ρ ) cos θ + ı 2 (ρ − ρ ) sin θ, 0 ≤ θ ≤ 2π

0.8
ρ=1
0.6
ρ=1.2
ρ=1.4
0.4
Thus, we see that γ is an ellipse with foci ±1, large ρ=1.6
ρ=1.8
semi-axis 12 (ρ + ρ−1 ) > 1 and small semi-axis 0.2
ρ=2
1 −1
2 ( ρ − ρ ) > 0.
Im

−0.2
The figure shows elliptical integration contours for dif-
−0.4
ferent values of ρ ✄
−0.6

−0.8

−1 −0.5 0 0.5 1
Fig. 227 Re

Appealing to geometric evidence, we find dist(γ, [−1, 1]) = 21 (ρ + ρ−1 ) − 1, which gives another term
in (6.2.2.58).

The rationale for choosing this particular integration contour is that the cos in its defintion nicely cancels
the arccos in the formula for the Chebychev polynomials. This lets us compute

|2n w(γ(θ ))|2 = | Tn+1 (cos(θ − ı log ρ))|2


= | cos((n + 1)(θ − ı log ρ))|2
= cos((n + 1)(θ − ı log ρ)) · cos((n + 1)(θ − ı log ρ))
1
= (ρs eı(n+1)θ + ρ−(n+1) e−ı(n+1)θ )(ρ(n+1) e−ı(n+1)θ + ρ−(n+1) eı(n+1)θ )
4
1 2( n +1)
= (ρ + ρ−2(n+1) + e2ı(n+1)θ + e−2ı(n+1)θ )
4
1 1 1
= (ρn+1 − ρ−(n+1) )2 + (eı(n+1)θ + e−ı(n+1)θ )2 ≥ (ρn+1 − 1)2 ,
4 | {z } |4 {z } 4
<1
=cos2 ((n+1)θ )≥0

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 503


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

for all 0 ≤ θ ≤ 2π , which provides a lower bound for |wn | on γ. Plugging all these estimates into
(6.2.2.58) we arrive at

2| γ | 1
k f − LT f k L∞ ([−1,1]) ≤ · max | f (z)| . (6.2.3.28)
π (ρ n + 1 − 1)(ρ + ρ−1 − 2) z∈γ

Note that instead on the nodal polynomial w we have inserted Tn+1 into (6.2.2.58), which is a simple
multiple. The factor will cancel.

The supremum norm of the interpolation error converges exponentially (ρ > 1!):

k f − LT f k L∞ ([−1,1]) = O(ρ−n ) for n → ∞ .

y
EXPERIMENT 6.2.3.29 (Chebychev interpolation of analytic function → Exp. 6.2.3.23 cnt’d)
−1
10

−2 ||f−p ||
10 n ∞
||f−pn||2
−3
10

Modification: the same function f (t) = (1 + t2 )−1


−4
10
on a smaller interval I = [−1, 1].
Error norm

−5
10
(Faster) exponential convergence than on the interval
I =] − 5, 5[: −6
10

k f − In f k L2 ([−1,1]) ≈ 0.42n . −7
10

−8
10

−9
10
2 4 6 8 10 12 14 16 18 20
Fig. 228 Polynomial degree n

Explanation, cf. Rem. 6.2.3.26: for I = [−1, 1] the poles ±i of f are farther away relative to the size of
the interval than for I = [−5, 5]. y
Review question(s) 6.2.3.30 (Chebychev interpolation error estimates)
−1
10

−2
10
||f−pn||∞

||f−pn||2
−3
10 Plotting norms of the Chebychev interpolation
for the “Runge function”
−4
10
Error norm

1
(Q6.2.3.30.A)
−5
10 f (t) :=
1 + t2
−6
10

on [−1, 1] we observe a strange staircase pa


−7
10

−8
Guess what could be the cause of its emerge
10

−9
10
2 4 6 8 10 12 14 16 18 20
Fig. 229 Polynomial degree n

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 504


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6.2.3.3 Chebychev Interpolation: Computational Aspects

Video tutorial for Section 6.2.3.3 "Chebychev Interpolation: Computational Aspects": (11
minutes) Download link, tablet notes

Task: Given: polynomial degree n ∈ N and a continuous function f : [−1, 1] 7→ R

Sought: efficient representation/evaluation of Chebychev interpolant p ∈ Pn (= polynomial La-


grange interpolant of degree ≤ n in Chebychev nodes (6.2.3.12) on [−1, 1]).

More concretely, this boils down to a implementation of the following class:

C++ code 6.2.3.31: Definition of class for Chebychev interpolation


1 class ChebInterp {
2 private :
3 // various internal data describing Chebychev interpolating polynomial
p
4 public :
5 // Constructor taking function f and degree n as arguments
6 template <typename Function >
7 P o l y I n t e r p ( const F u n c t i o n &f , unsigned i n t n ) ;
8 // Evaluation operator: y j = p( x j ), j = 1, . . . , m (m “large”)
9 template <typename Vector >
10 void e v a l ( const V e c t o r &x , V e c t o r &y ) const ;
11 };

Idea: internally represent p as a linear combination of Chebychev polynomials, a


Chebychev expansion:
n
p(t) = ∑ α j Tj (t) , t ∈ R , αj ∈ R ,
j =0

where Tj is the Chebychev polynomial of degree j, see Def. 6.2.3.3.

The representation (6.2.3.3) is always possible, because { T0 , . . . , Tn } is a basis of Pn , owing to deg Tn =


n. The representation is amenable to efficient evaluation and computation by means of special algorithms.

§6.2.3.32 (Fast evaluation of Chebychev expansion → [Han02, Alg. 32.1]) Let us assume that the
Chebychev expansion coefficients α j in (6.2.3.3) are given and wonder, how we can efficiently compute
p( x ) for some x ∈ R:
Task: Given n ∈ N, x ∈ R, and the Chebychev expansion coefficients α j ∈ R, j = 0, . . . , n, compute
p( x ) with
n
p( x ) = ∑ α j Tj ( x) , αj ∈ R . (6.2.3.3)
j =0

Idea: Use the 3-term recurrence (6.2.3.5)

Tj ( x ) = 2xTj−1 ( x ) − Tj−2 ( x ) , j = 2, 3, . . . , (6.2.3.5)

to design a recursive evaluation scheme.

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 505


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

By means of (6.2.3.5) rewrite (6.2.3.3) as

n −1 n −1
(6.2.3.5)
p( x ) = ∑ α j Tj ( x) + αn Tn ( x) = ∑ α j Tj ( x) + αn (2xTn−1 ( x) − Tn−2 ( x))
j =0 j =0
n −3
= ∑ α j Tj ( x) + (αn−2 − αn )Tn−2 ( x) + (αn−1 + 2xαn )Tn−1 ( x) .
j =0

We recover the point value p( x ) as the point value of another polynomial of degree n − 1 with known
Chebychev expansion:

n −1 α j + 2xα j+1 , if j = n − 1 ,

p( x ) = ∑ e α j Tj ( x ) with e
α j = α j − α j +2 , if j = n − 2 , (6.2.3.33)
j =0


αj else.

This inspires the recursive algorithm of Code 6.2.3.34. A loop-based implementation without recursive
function calls is also possible and given in Code 6.2.3.35.

C++ code 6.2.3.34: Recursive evaluation of Chebychev expansion (6.2.3.3) ➺ GITLAB


2 // Recursive evaluation of a polynomial p = ∑nj=+11 a j Tj−1 at point x
3 // based on (6.2.3.33)
4 // IN : Vector of coefficients a
5 // evaluation point x
6 // OUT: Value at point x
7 double recclenshaw ( const VectorXd& a , const double x ) {
8 const VectorXd : : Index n = a . s i z e ( ) − 1 ;
9 i f ( n == 0 ) { r e t u r n a ( 0 ) ; } // Constant polynomial
10 i f ( n == 1 ) { r e t u r n ( x * a ( 1 ) + a ( 0 ) ) ; } // Value α1 ∗ x + α0
11 VectorXd new_a ( n ) ;
12 new_a << a . head ( n − 2 ) , a ( n − 2 ) − a ( n ) , a ( n − 1 ) + 2 * x * a ( n ) ;
13 r e t u r n recclenshaw ( new_a , x ) ; // recursion
14 }

Non-recursive version: Clenshaw algorithm

C++ code 6.2.3.35: Clenshaw algorithm for evalation of Chebychev expansion (6.2.3.3)
➺ GITLAB
2 // Clenshaw algorithm for evaluating p = ∑nj=+11 a j Tj−1
3 // at points passed
 in vector x
4 // IN : a = α j , coefficients for p = ∑nj=+11 α j Tj−1
5 // x = (many) evaluation points
6 // OUT: values p( x j ) for all j
7 VectorXd clenshaw ( const VectorXd& a , const VectorXd& x ) {
8 const i n t n = a . s i z e ( ) − 1 ; // degree of polynomial
9 MatrixXd d ( n + 1 , x . s i z e ( ) ) ; // temporary storage for intermediate values
10 f o r ( i n t c = 0 ; c < x . s i z e ( ) ; ++c ) d . col ( c ) = a ;
11 f o r ( i n t j = n − 1 ; j > 0 ; −− j ) {
12 d . row ( j ) += 2 * x . transpose ( ) . cwiseProduct ( d . row ( j +1) ) ; // see (6.2.3.33)
13 d . row ( j −1) −= d . row ( j + 1 ) ;
14 }
15 r e t u r n d . row ( 0 ) + x . transpose ( ) . cwiseProduct ( d . row ( 1 ) ) ;
16 }

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 506


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Computational effort : O(nm) for evaluation at m points, m, n → ∞. y

§6.2.3.36 (Computation of Chebychev expansions of interpolants) Chebychev interpolation is a linear


interpolation scheme, see § 5.1.0.21. Thus, the expansion α j in (6.2.3.3) can be computed by solving a
linear system of equations of the form (5.1.0.23). However, for Chebychev interpolation this linear system
can be cast into a very special form, which paves the way for its fast direct solution:

Task: Efficiently compute the Chebychev expansion coefficients α j in (6.2.3.3) from the interpolation
conditions
 
2k + 1
p(tk ) = f (tk ) , k = 0, . . . , n , for tk := cos π . (6.2.3.37)
2( n + 1)

Chebychev nodes
Trick: Ttransform of p into a 1-periodic function q, which turns out to be a trigonometric polynomial
according to Def. 4.2.6.25. Using the definition of the Chebychev polynomials and Euler’s formula we get

n n
Def. 6.2.3.3
q(s):= p(cos 2πs) = ∑ α j Tj (cos 2πs) = ∑ α j cos(2πjs)
j =0 j =0
n 
= ∑ 12 α j exp(2πıjs) + exp(−2πıjs) [ by cos z = 12 (ez + e−z ) ]
j =0
 (6.2.3.38)

 0 , for j = n + 1 ,
n +1 1

2 αj , for j = 1, . . . , n ,
= ∑ β j exp(−2πıjs) , with β j :=
 α0 , for j = 0 ,
j=−n 

1
2 αn− j , for j = − n, . . . , −1 .

The interpolation conditions (6.2.3.37) for p become interpolation conditions for q in transformed nodes,
which turn out to be equidistant:
 
(6.2.3.37) 2k + 1
t = cos(2πs) =⇒ q = yk := f (tk ) , k = 0, . . . , n . (6.2.3.39)
4( n + 1)

This amounts to a Lagrange polynomial interpolation problem for equidistant points on the unit circle as
we have seen them in Section 5.6.3.

Also observe the following even symmetry with re-


spect to s = 12 :

q ( s ) = q (1 − s )
⇓← (6.2.3.39)
2k + 1
q (1 − ) = yk , k = 0, . . . , n .
4( n + 1)

It ensures that the coefficients β j actually satisfy the


constraints implied by their relationship with α j .
Fig. 230

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 507


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Thanks to the symmetry of q, see Fig. 231, we can augment the interpolation conditions (6.2.3.39) and
demand
  (
k 1 yk , for k = 0, . . . , n ,
q + = zk := (6.2.3.40)
2( n + 1) 4( n + 1) y2n+1−k , for k = n + 1, . . . , 2n + 1 .

In a sense, we have just reflected the interpolation conditions at s = 21 :

s = 1/2

0 1

Fig. 231

Let us summarize our insights gained from switching from p to q:

Chebychev expansion for Chebychev interpolants


m
Trigonometric interpolation at equidistant points → Section 5.6.3

Trigonometric interpolation at equidistant points can be done very efficiently by means of FFT-based algo-
rithms, see Code 5.6.3.4. We can also apply these for the computation of Chebychev expansion coeffi-
cients.
From (6.2.3.40) we can derive an (2n + 1) × (2n + 1) square linear system of equations for the unknown
coefficients β j .

  n +1    
k 1 2πıj  2πı
q +
2( n + 1) 4( n + 1)
= ∑ β j exp −
4( n + 1)
exp −
2( n + 1)
kj = zk .
j=−n
m
2n+1   
2πı( j − n)  2πı nk 
∑ β j−n exp −
4( n + 1)
exp −
2( n + 1)
kj = exp −πı
n+1 k
z , k = 0, . . . , 2n + 1 .
j =0 | {z }
kj
= ω2 ( n + 1 ) !

m
h i
2πı( j−n)  2n+1
c = β j−n exp − 4(n+1) ,
F2(n+1) c = b with j =0 (6.2.3.41)
h  i2n+1
b = exp −πı nnk+1 z k .
k =0

(2n + 2) × (2n + 2) Fourier matrix, see (4.2.1.13)


Thus we can solve (6.2.3.41) by means of the inverse discrete Fourier transform of length 2(n + 1), see
4.2. Using the FTT algorithm we can do this with an asymptotic complexity of O(n log n) for n → ∞ as
explained in Section 4.3.
Note that by the by the symmetry of that data vector z implied by (6.2.3.40) we find β 2n+1 = 0, which
complies with (6.2.3.38). The following code demonstrates the FFT-based computation of the coefficients
of the Chebychev expansion of the Chebychev interpolant on [−1, 1].

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 508


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++ code 6.2.3.42: Efficient computation of Chebychev expansion coefficient of Chebychev


interpolant ➺ GITLAB
2 // efficiently compute coefficients α j in the Chebychev expansion
n
3 // p = ∑ α j Tj of p ∈ Pn based on values yk ,
j =0
4 // k = 0, . . . , n, in Chebychev nodes tk , k = 0, . . . , n
5 // IN: values yk passed in y
6 // OUT: coefficients α j
7 VectorXd chebexp ( const VectorXd& y ) {
8 const Index n = y . s i z e ( ) − 1 ; // degree of polynomial
9 const std : : complex <double> M_I ( 0 , 1 ) ; // imaginary unit
10 // create vector z, see (6.2.3.40)
11 VectorXcd b ( 2 * ( n + 1 ) ) ;
12 const std : : complex <double> om =
−M_I * ( M_PI * s t a t i c _ c a s t <double >( n ) ) / s t a t i c _ c a s t <double >( n +1) ;
13 f o r ( i n t j = 0 ; j <= n ; ++ j ) {
14 b ( j ) = std : : exp (om * s t a t i c _ c a s t <double >( j ) ) * y ( j ) ; // this cast to double is
necessary!!
15 b ( 2 * n+1− j ) = std : : exp (om * s t a t i c _ c a s t <double > ( 2 * n+1− j ) ) * y ( j ) ;
16 }
17

18 // Solve linear system (6.2.3.41) with effort O(n log n)


19 Eigen : : FFT<double> f f t ; // E I G E N ’s helper class for DFT
20 VectorXcd c = f f t . inv ( b ) ; // -> c = ifft(z), inverse fourier transform
21 // recover β j , see (6.2.3.41)
22 VectorXd beta ( c . s i z e ( ) ) ;
23 const std : : complex <double> sc = M_PI_2 / s t a t i c _ c a s t <double >( n + 1 ) * M_I ;
24 f o r ( unsigned j = 0 ; j < c . s i z e ( ) ; ++ j ) {
25 beta ( j ) = ( std : : exp ( sc * s t a t i c _ c a s t <double >( −n+ j ) ) * c [ j ] ) . r e a l ( ) ;
26 }
27 // recover α j , see (6.2.3.38)
28 VectorXd alpha = 2 * beta . segment ( n , n ) ; alpha ( 0 ) = beta ( n ) ;
29 r e t u r n alpha ;
30 }

Remark 6.2.3.43 (Chebychev representation of built-in functions) Computers use approximation by


sums of Chebychev polynomials in the computation of functions like log, exp, sin, cos, . . .. The evalua-
tion by means of Clenshaw algorithm according to Code 6.2.3.35 is more efficient and stable than the
approximation by Taylor polynomials. y
Review question(s) 6.2.3.44 (Chebychev Interpolation: Computational Aspects)
(Q6.2.3.44.A) Outline an efficient algorithm for computing the Chebychev expansion
n
p(t) = ∑ ak Tk (t) , t∈R, ak ∈ R ,
k =0

of the Lagrange polynomial interpolant of f ∈ C0 ([−1, 1]) for an arbitrary node set T = {t0 , . . . , tn },
n ∈ N.
(Q6.2.3.44.B) Devise an efficient algorithm for evaluating a polynomial of degree n ∈ N given through its
Chebychev expansion
n
p(t) = ∑ ak Tk (t) , t∈R, ak ∈ R ,
k =0

6. Approximation of Functions in 1D, 6.2. Approximation by Global Polynomials 509


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

at all the points

k 
xk := cos π , k = 0, . . . , m , m∈N.
m

6.3 Mean Square Best Approximation


There is a particular family of norms for which the best approximant of a function f in a finite dimensional
function space VN , that is, the element of VN that is closest to f with respect to that particular norm can
actually be computed. It turns out that this computation boils down to solving a kind of least squares
problem, similar to the least squares problems in K n discussed in Chapter 3.

6.3.1 Abstract Theory


Concerning mean square best approximation it is useful to learn an abstract framework first into which the
concrete examples can be fit later.

6.3.1.1 Mean Square Norms

Mean square norms generalize the Euclidean norm on K n , see [NS02, Sect. 4.4]. In a sense, they endow
a vector space with a geometry and give a meaning to concepts like “orthogonality”.

Definition 6.3.1.1. (Semi-)inner product [NS02, Sect. 4.4]

Let V be a vector space over the field K. A mapping b : V × V → K is called an inner product on
V , if it satisfies
(i) b is linear in the first argument: b(αv + βw, u) = αb(v, u) + βb(w, u) for all α, β ∈ K,
u, v, w ∈ V ,
(ii) b is (anti-)symmetric: b(v, w) = b(w, v) ( = ˆ complex conjugation),
(iii) b is positive definite: v 6= 0 ⇔ b(v, v) > 0.
b is a semi-inner product, if it still complies with (i) and (ii), but is only positive semi-definite:
b(v, v) ≥ 0 for all v ∈ V .

✎ notation: usually we write (·, ·)V for an inner product on the vector space V .

Definition 6.3.1.2. Orthogonality

Let V be a vector space equipped with a (semi-)inner product (·, ·)V . Any two elements v and w of
V are called orthogonal, if (v, w)V = 0. We write v ⊥ w.

✎ notation: If W ⊂ V is a (closed) subsspace: v ⊥ W :⇔ (v, w)V = 0 ∀ w ∈ W .

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 510


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 6.3.1.3. Mean square (semi-)norm/Inner product (semi-)norm

If (·, ·)V is a (semi-)inner product (→ Def. 6.3.1.1) on the vector space V , then
q
k v kV : = (v, v)V

defines a (semi-)norm (→ Def. 1.5.5.4) on V , the mean square (semi-)norm/ inner product
(semi-)norm induced by (·, ·)V .

§6.3.1.4 (Examples for mean square norms)


✦ The Euclidean norm on K n induced by the dot product (Euclidean inner product).
n
(x, y)Kn := ∑ (x) j (y) j [“Mathematical indexing” !] x, y ∈ K n .
j =1

✦ The L2 -norm (5.2.4.6) on C0 ([ a, b]) induced by the L2 ([ a, b]) inner product


Z b
( f , g) L2 ([a,b)] := f (τ ) g(τ ) dτ , f , g, ∈ C0 ([ a, b]) . (6.3.1.5)
a

6.3.1.2 Normal Equations

Mean square best approximation = best approximation in a mean square norm

From § 3.1.1.8 we know that in Euclidean space K n the best approximation of vector x ∈ K n in a subspace
V ⊂ K n is unique and given by the orthogonal projection of x onto V . Now we generalize this to vector
spaces equipped with inner products.

ˆ a vector space over K = R, equipped with an mean square semi-norm k·k X induced by a semi-
X=
inner product (·, ·) X , see Thm. 6.3.1.3.
It can be an infinite dimensional function space, e.g., X = C0 ([ a, b]).

ˆ a finite-dimensional subspace of X , with basis BV := {b1 , . . . , b N } ⊂ V , N := dim V .


V=

Assumption 6.3.1.6.

The semi-inner product (·, ·) X is a genuine inner product (→ Def. 6.3.1.1) on V , that is, it is positive
definite: (v, v) X > 0 ∀v ∈ V \ {0}.

Now we give a formula for the element q of V , which is nearest to a given element f of X with respect to

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 511


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

the norm k·k X . This is a genuine generalization of Thm. 3.1.2.1.

Theorem 6.3.1.7. Mean square norm best approximation through normal equations

Given any f ∈ X there is a unique q ∈ V such that

k f − qk X = inf k f − pk X .
p ∈V

Its coefficients γ j , j = 1, . . . , N , with respect to the basis BV := {b1 , . . . , b N } of V (q = ∑ N


j =1 γ j b j )
are the unique solution of the normal equations
 
h (b1 , b1 ) X . . . (b1 , b N ) X
 N  iN  .. ..  N,N
M γ j j =1 = f , b j X , M :=  . . ∈K . (6.3.1.8)
j =1
(b N , b1 ) X . . . (b N , b N ) X

Proof. (inspired by Rem. 3.1.2.5) We first show that M is s.p.d. (→ Def. 1.1.2.6). Symmetry is clear from
the definition and the symmetry of (·, ·) X . That M is even positive definite follows from

N N  2
N
xH Mx = ∑ ∑ ξkξ j bk , b j X
= ∑ j =1 ξ j b j >0, (6.3.1.9)
X
k =1 j =1

 N
if x := ξ j j=1 6= 0 ⇔ ∑ N
j=1 ξ j b j 6 = 0, since k·k X is a norm on V by Ass. 6.3.1.6.

 N h  iN
Now, writing c := γ j j=1 ∈ K N , b := f , bj X j =1
∈ K N , and using the basis representation

N
q= ∑ γj bj ,
j =1

we find

Φ(c) := k f − qk2X = k f k2X − 2b⊤ c + c⊤ Mc .

Applying the differentiation rules from Ex. 8.5.1.19 to Φ : K N → R, c 7→ Φ(c), we obtain

grad Φ(c) = 2(Mc − b) , (6.3.1.10)


H Φ(c) = 2M . (independent of c!) (6.3.1.11)

Since M is s.p.d., the unique solution of grad Φ(c) = Mc − b = 0 yields the unique global minimizer of
Φ; the Hessian 2M is s.p.d. everywhere!

The unique q from Thm. 6.3.1.7 is called the best approximant of f in V .

Corollary 6.3.1.12. Best approximant by orthogonal projection

If q is the best approximant of f in V , then f − q is orthogonal to every p ∈ V :

( f − q, p) X = 0 ∀ p ∈ V ⇔ f −q ⊥ V .

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 512


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

f
The message of Cor. 6.3.1.12:

V ✁ the best approximation error f − q for f ∈ X in V


is orthogonal to the subspace V .
q See § 3.1.1.8 for related discussion in Euclidean
space K n .
Fig. 232

Remark 6.3.1.13 (Connetion with linear least squares problems Chapter 3) In Section 3.1.1 we in-
troduced the concept of least squares solutions of overdetermined linear systems of equations Ax = b,
A ∈ R m,n , m > n, see Def. 3.1.1.1. Thm. 3.1.2.1 taught that the normal equations A⊤ AX = A⊤ b give
the least squares solution, if rank(A) = n.
In fact, Thm. 3.1.2.1 and the above Thm. 6.3.1.7 agree if X = K n (Euclidean space) and V =
Span{a1 , . . . , an }, where a j ∈ R m are the columns of A and N = n. y

6.3.1.3 Orthonormal Bases

In the setting of Section 6.3.1.2 we may ask: Which choice of basis B = {b1 , . . . , b N } of V ⊂ X renders
the normal equations (6.3.1.8) particularly simple? Answer: A basis B, for which bk , b j X = δkj (δkj the
Kronecker symbol), because this will imply M = I for the coefficient matrix of the normal equations.

Definition 6.3.1.14. Orthonormal basis

A subset {b1 , . . . , b N } of an N -dimensional vector


 space V with inner product (→ Def. 6.3.1.1)
(·, ·)V is an orthonormal basis (ONB), if bk , b j V = δkj .

A basis of {b1 , . . . , b N } of V is called orthogonal, if bk , b j V = 0 for k 6= j.

Corollary 6.3.1.15. ONB representation of best approximant

If {b1 , . . . , b N } is an orthonormal basis (→ Def. 6.3.1.14) of V ⊂ X , then the best approximant


q := argmin p∈V k f − pk X of f ∈ X has the representation

N 
q= ∑ f , bj b
X j
. (6.3.1.16)
j =1

§6.3.1.17 (Gram-Schmidt orthonormalization) From Section 1.5.1 we already know how to compute
orthonormal bases: The algorithm from § 1.5.1.1 can be run in the framework of any vector space V

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 513


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

endowed with an inner product (·, ·)V and induced mean square norm k·kV .

Theorem 6.3.1.19. Gram-Schmidt orthonor-


p malization
1: b1 := k p 1k % 1st output vector
1 V
2: for j = 2, . . . , k do When supplied with k ∈ N linearly indepen-
{ % Orthogonal projection dent vectors p1 , . . . , pk ∈ V in a vector space
3: bj := p j with inner product (·, ·)V , Algorithm (6.3.1.18)
4: for ℓ = 1, 2, . . . , j − 1 do (6.3.1.18) computes vectors b1 , . . . , bk with
5: { b j ← b j − p j , bℓ V bℓ } 
bℓ , b j = δℓ j , ℓ, j ∈ {1, . . . , k} ,
6: if ( b j = 0 ) then STOP V

7: else { bj ←
bj
}
Span{b1 , . . . , bℓ } = Span{ p1 , . . . , pℓ }
k b j kV
} for all ℓ ∈ {1, . . . , k }.

This suggests the following alternative approach to the computation of the mean square best approximant
q in V of f ∈ X :
➊ Orthonormalize a basis {b1 , . . . , b N } of V , N := dim V , using Gram-Schmidt algorithm (6.3.1.18).
➋ Compute q according to (6.3.1.16).
Number of inner products to be evaluated: O( N 2 ) for N → ∞.
Review question(s) 6.3.1.20 (Mean-square best approximation: abstract theory)
(Q6.3.1.20.A) Let V be a finite-dimensional vector space with inner product (·, ·)V . What is an orthonor-
mal basis (ONB) of V ?
(Q6.3.1.20.B) Let X be a finite-dimensional real vector space with inner product (·, ·) X and equipped with
a basis {b1 , . . . , b N }, N := dim X .
Show that the coefficient matrix
 
(b1 , b1 ) X . . . (b1 , b N ) X
 .. ..  N,N
M :=  . . ∈R
(b N , b1 ) X . . . (b N , b N ) X

is symmetric positive definite.

Definition 1.1.2.6. Symmetric positive definite (s.p.d.) matrices

M ∈ K n,n , n ∈ N, is symmetric (Hermitian) positive definite (s.p.d.), if

M = MH and ∀x ∈ K n : xH Mx > 0 ⇔ x 6= 0 .

If xH Mx ≥ 0 for all x ∈ K n ✄ M positive semi-definite.

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 514


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6.3.2 Polynomial Mean Square Best Approximation


Now we apply the results of Section 6.3.1 in the following setting:

X : function space C0 ([ a, b]), −∞ < a < b < ∞, of R-values continuous functions,


V : space Pm of polynomials of degree ≤ m.

Remark 6.3.2.1 (Inner products on spaces Pm of polynomials) To match the abstract framework of
Section 6.3.1 we need to find (semi-)inner products on C0 ([ a, b]) that supply positive definite inner prod-
ucts on Pm . The following options are commonly considered:
✦ On any interval [ a, b] we can use the L2 ([ a, b])-inner product (·, ·) L2 ([a,b)] , defined in (6.3.1.5).
✦ Given a positive integrable weight function
Z b
w : [ a, b] → R , w(t) > 0 for all t ∈ [ a, b] , |w(t)| dt < ∞ , (6.3.2.2)
a

we can consider the weighted L2 -inner product on the interval [ a, b]


Z b
( f , g)w,[a,b] := w(τ ) f (τ ) g(τ ) dτ . (6.3.2.3)
a

✦ For n ≥ m and n + 1 distinct points collected in the set T := {t0 , t1 , . . . , tn } ⊂ [ a, b] we can use
the discrete L2 -inner product
n
( f , g)T := ∑ f (t j ) g(t j ) . (6.3.2.4)
j =0

Since a polynomial of degree ≤ m must be zero everywhere, if it vanishes in at least m + 1 distinct


points, ( f , g)T is positive definite on Pm .
For all these inner products on Pm holds

({t 7→ t f (t)}, g) X = ( f , {t 7→ tg(t)}) X , f , g ∈ C0 ([ a, b]) , (6.3.2.5)

that is, multiplication with the independent variable can be shifted to the other function inside the inner
product.

✎ notation: Note that we have to plug a function into the slots of the inner products; this is indicated by
the notation {t 7→ . . .}.

Assumption 6.3.2.6. Self-adjointness of multiplication operator

We assume the inner product (·, ·) X to satisfy (6.3.2.5).

The ideas of Section 6.3.1.3 that center around the use of orthonormal bases can also be applied to

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 515


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

polynomials.

Definition 6.3.2.7. Orthonormal polynomials → Def. 6.3.1.14


Let (·, ·) X be an inner product on Pm . A sequence r0 , r1 , . . . , rm provides orthonormal polynomials
(ONPs) with respect to (·, ·) X , if

rℓ ∈ Pℓ , (rk , rℓ ) X = δkℓ , ℓ, k ∈ {0, . . . , m} . (6.3.2.8)

The polynomials are just orthogonal, if (rk , rℓ ) X = 0 for k 6= ℓ.

By virtue of Thm. 6.3.1.19 orthonormal polynomials can be generated


 by applying Gram-Schmidt orthonor-
m
malization from § 6.3.1.17 to the ordered basis of monomials t 7→ t j j=0 .

Lemma 6.3.2.9. Uniqueness of orthonormal polynomials

The sequence of orthonormal polynomials from Def. 6.3.2.7 is unique up to signs, supplies an
(·, ·) X -orthonormal basis (→ Def. 6.3.1.14) of Pm , and satisfies

Span{r0 , . . . , rk } = Pk , k ∈ {0, . . . , m} . (6.3.2.10)

Proof. Comparing Def. 6.3.1.14 and (6.3.2.8) the ONB-property of {r0 , . . . , rm } is immediate. Then
(6.3.2.8) follows from dimensional considerations.

r0 must be a constant, which, up to sign, is fixed by the normalization condition kr0 k X = 1.

Pk−1 ⊂ Pk has co-dimension 1 so that there a unit “vector” in Pk , which is orthogonal to Pk−1 and unique
up to sign.

§6.3.2.11 (Orthonormal polynomials by orthogonal projection) Let r0 , . . . , rm ∈ T p be a sequence of


orthonormal polynomials according to Def. 6.3.2.7. From (6.3.2.8) we conclude that rk ∈ Pk has leading
coefficient 6= 0.

Hence sk (t) := t · rk (t) is a polynomial of degree k + 1 with leading coefficient 6= 0, that is sk ∈ Pk+1 \ Pk .
Therefore, rk+1 can be obtain by orthogonally projecting sk onto Pk plus normalization, cf. Lines 4-5 of
Algorithm (6.3.1.18):
k 
e
r k +1
r k +1 = ± r k +1 = s k − ∑ s k , r j X r j .
, e (6.3.2.12)
kerk+1 k X j =0

Straightforward computations confirm that rk+1 ⊥ Pk .

The sum in (6.3.2.12) collapses to two terms! In fact, since (rk , q) X = 0 for all q ∈ Pk−1 , by Ass. 6.3.2.6
  (6.3.2.5) 
sk , r j X
= {t 7→ trk (t)}, r j X
= rk , {t 7→ tr j } X
= 0 , if j < k − 1 ,
because in this case {t 7→ tr j } ∈ Pk−1 . As a consequence (6.3.2.12) reduces to the 3-term recursion
e
r k +1
r k +1 = ± ,
kerk+1 k X , k = 1, . . . , m − 1 . (6.3.2.13)
e
r k +1 = sk − ({t 7→ trk }, rk ) X rk − ({t 7→ trk }, rk−1 ) X rk−1

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 516


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The recursion starts with r−1 := 0, r0 = {t 7→ 1/k1kX }. y

The 3-term recursion (6.3.2.13) can be recast in various ways. Forgoing normalization the next theorem
presents one of them.

Theorem 6.3.2.14. 3-term recursion for orthogonal polynomials

Given any inner product (·, ·) X on Pm , m ∈ N, define p−1 := 0, p0 = 1, and

pk+1 (t) := (t − αk+1 ) pk (t) − β k pk−1 (t) , k = 0, 1, . . . , m − 1 ,


({t 7→ tpk (t)}, pk ) X k pk k2X (6.3.2.15)
with α k +1 : = , β k := .
k pk k2X k pk−1 k2X

Then ✦ pk ∈ Pk has leading coefficient = 1, and


✦ { p0 , p1 , . . . , pm } is an orthogonal basis of Pm .

Proof. (by rather straightforward induction) We first confirm, thanks to the definition of α1 ,

( p0 , p1 ) X = ( p0 , {t 7→ (t − α1 ) p0 (t)}) X = ( p0 , {t 7→ tp0 (t)}) X − α1 ( p0 , p0 ) X = 0 .

For the induction step we assume that the assertion is true for p0 , . . . , pk and observe that for pk+1
according to (6.3.2.15) we have

( pk , pk+1 ) X = ( pk , {t 7→ (t − αk+1 ) pk (t)} − β k pk−1 ) X


= ( pk , {t 7→ tpk (t)}) X − αk+1 ( pk , pk ) X − β k ( pk , pk−1 ) X = 0 ,
| {z } | {z }
=({t7→tpk (t)},pk ) X =0

( pk−1 , pk+1 ) X = ( pk−1 , {t 7→ (t − αk+1 ) pk (t)} − β k pk−1 ) X


= ( pk−1 , {t 7→ tpk (t)}) X − αk+1 ( pk−1 , pk ) X − β k ( pk , pk−1 ) X = 0 ,
( pℓ , pk+1 ) X = ( pℓ , {t 7→ (t − αk+1 ) pk (t)} − β k pk−1 ) X
= ( pℓ , {t 7→ tpk (t)}) X −αk+1 ( pℓ , pk ) X − β k ( pℓ , pk−1 ) X = 0 , ℓ = 0, . . . , k − 2 .
| {z } | {z } | {z }
=0 0 0

This amounts to the assertion of orthogonality for k + 1. Above, several inner product vanish because of
the induction hypothesis!

Remark 6.3.2.16 ( L2 ([−1, 1])-orthogonal polynomials) An important inner product on C0 [−1, 1] is the
L2 -inner product, see (6.3.1.5)
Z 1
( f , g) L2 ([−1,1)] := f (τ ) g(τ ) dτ , f , g, ∈ C0 ([−1, 1]) .
−1

It is a natural question what is the unique sequence of L2 ([−1, 1])-orthonormal polynomials. Their rather
simple characterization will be discussed in the sequel.

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 517


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Legendre polynomials

1 n=0
n=1
0.8 n=2
n=3
0.6
The Legendre polynomials Pn can be defined by the n=4
n=5
0.4
3-term recursion
0.2
2n + 1 n

P (t)
0
Pn+1 (t) := tPn (t) − Pn−1 (t) ,

n
n+1 n+1 −0.2
(7.4.2.21)
−0.4
P0 := 1 , P1 (t) := t . −0.6

−0.8

−1
−1 −0.5 0 0.5 1
Fig. 233 t

Pk ∈ Pk is immediate and so is the parity

Pk (t) = Pk (−t) if k is even, Pk (t) = − Pk (−t) if k is odd . (6.3.2.17)

Orthogonality, ( Pk , Pm ) L2 (Ω) = 0 or m 6= k, as well as k]k L Pk ([−1,1) 22 = 2/2n+1 can be proved by induction



based on (6.3.2.17).This implies that the L] ([−1, 1)2-orthonormal polynomials are rk = k + 1/2 Pk and
their 3-term recursion from Thm. 6.3.2.14 reads
p s
4( n + 1)2 − 1 n 4( n + 1)2 − 1
r n +1 ( t ) = trn (t) − u n −1 ( t ) , (6.3.2.18)
n+1 n+1 4n2 − 1

r −1 := 0 , r0 = 21 2 . (6.3.2.19)

§6.3.2.20 (Discrete orthogonal polynomials) Since they involve integrals, weighted L2 -inner products
(6.3.2.3) are not accessible computationally, unless one resigns to approximation, see Chapter 7 for cor-
responding theory and techniques.

Therefore, given a point set T := {t0 , t1 , . . . , tn }, we focus on the associated discrete L2 -inner product

n
( f , g) X := ( f , g)T := ∑ f (t j ) g(t j ) , f , g ∈ C0 ([ a, b]) , (6.3.2.4)
j =0

which is positive definite on Pn and satisfies Ass. 6.3.2.6.

The polynomials pk generated by the 3-term recursion (6.3.2.15) from Thm. 6.3.2.14 are then called
discrete orthogonal polynomials. The following C++ code computes the recursion coefficients αk and β k ,
k = 1, . . . , n − 1.

C++ code 6.3.2.21: Computation of weights in 3-term recursion for discrete orthogonal poly-
nomials ➺ GITLAB
2 // Computation of coefficients α, β from 6.3.2.14
3 // IN : t = points in the definition of the discrete L2 -inner product
4 // n = maximal index desired
5 // alpha, beta are used to save coefficients of recursion
6 void c o e f f o r t h o ( const VectorXd& t , const Index n , VectorXd& alpha , VectorXd& beta ) {
7 const Index m = t . s i z e ( ) ; // maximal degree of orthogonal polynomial

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 518


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8 alpha = VectorXd ( std : : min ( n −1 , m−2) + 1 ) ;


9 beta = VectorXd ( std : : min ( n −1 , m−2) + 1 ) ;
10 alpha ( 0 ) = t .sum ( ) / s t a t i c _ c a s t <double >(m) ;
11 // initialization of recursion; we store only the values of
12 // the polynomials a the points in T
13 VectorXd p0 ;
14 VectorXd p1 = VectorXd : : Ones (m) ;
15 VectorXd p2 = t − alpha ( 0 ) * VectorXd : : Ones (m) ;
16 f o r ( Index k = 0 ; k < std : : min ( n −1 , m−2) ; ++k ) {
17 p0 = p1 ; p1 = p2 ;
18 // 3-term recursion (6.3.2.15)
19 alpha ( k +1) = p1 . dot ( t . cwiseProduct ( p1 ) ) / p1 . squaredNorm ( ) ;
20 beta ( k ) = p1 . squaredNorm ( ) / p0 . squaredNorm ( ) ;
21 p2 = ( t − alpha ( k +1) * VectorXd : : Ones (m) ) . cwiseProduct ( p1 ) −beta ( k ) * p0 ;
22 }
23 }

§6.3.2.22 (Polynomial fitting) Given a point set T := {t0 , t1 , . . . , tn } ⊂ [ a, b], and a function f : [ a, b] →
K, we may seek to approximate f by its polynomial best approximant with respect to the discrete L2 -norm
k·kT induced by the discrete L2 -inner product (6.3.2.4).

Definition 6.3.2.23. Fitted polynomial

Given a point set T := {t0 , t1 , . . . , tn } ⊂ [ a, b], and a function f : [ a, b] → K we call

qk := argmink f − pkT , k ∈ {0, . . . , n} ,


p∈Pk

the fitted polynomial to f on T of degree k, k ≤ n.

The stable and efficient computation of fitting polynomials can rely on combining Thm. 6.3.2.14 with
Cor. 6.3.1.15:
➊ (Pre-)compute the weights αℓ and β ℓ for the 3-term recursion (6.3.2.15).
➋ (Pre-)compute the values of the orthogonal polynomials pk at desired evaluation points xi ∈ R,
i = 1, . . . , N .
➌ Compute the inner product ( f , pℓ ) X , ℓ = 0, . . . , k, and use (6.3.1.16) to linearly combine the vectors
[ pℓ ( xi )]iN=1 , ℓ = 0, . . . , k.
N
This yields [q( xi )]i=1 , q ∈ Pk the fitting polynomial. y
EXAMPLE 6.3.2.24 (Approximation by discrete polynomial fitting) We use equidistant points T :=
{tk = −1 + k m2 , k = 0, . . . , m} ⊂ [−1, 1], m ∈ N to compute fitting polynomials (→ Def. 6.3.2.23) for
two different functions.
We monitor the L2 -norm and L∞ -norm of the approximation error, both norms approximated by sampling
j
in ξ j = −1 + 500 , j = 0, . . . , 1000.

➀ f (t) = (1 + (5t)2 )−1 , I = [−1, 1] → Ex. 6.2.2.11, analytic in complex neighborhood of [−1, 1]:

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 519


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.2 Polynomial fitting of Runge function: equidistant points


2 { 10 0
(1+(5x) ) −1}
n=0 L∞-norm, m=50
1
n=2 L2 -norm, m=100

n=4 L -norm, m=100
2
n=6 L -norm, m=200
0.8
n=8 L∞-norm, m=200
n=10 10 -1 L2 -norm, m=400

0.6

error norm
y

0.4

10 -2
0.2

−0.2 10 -3
0 5 10 15 20 25 30 35
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 235
Fig. polynomial degree n
Fig. 234 t
➣ We observe exponential convergence (→ Def. 6.2.2.7) in the polynomial degree n.

➁ f (t) = max{0, 1 − 2 ∗ | x + 14 |}, f only in C0 ([−1, 1]):


1 Polynomial fitting of tent function: equidistant points
2 10 0
1/(1+(5x) )
n=0 L∞-norm, m=50
2
0.8 L -norm, m=100
n=2

n=4 L -norm, m=100
n=6 L2 -norm, m=200
0.6
n=8 L∞-norm, m=200
2
n=10 10
-1
L -norm, m=400
error norm

0.4
y

0.2

10 -2
0

−0.2

10 -3
−0.4 10 0 10 1 10 2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 237
Fig. polynomial degree n
Fig. 236 t
➣ We observe only algebraic convergence (→ Def. 6.2.2.7 in the polynomial degree n (for n ≪ m!).

Polynomial fitting of cosine bump function: equidistant points


10 0

L∞-norm, m=50
2
L -norm, m=100

L -norm, m=100
➂ “bump function” 10
-1
L2 -norm, m=200
L∞-norm, m=200
L2 -norm, m=400
1
f (t) = max{cos(4π |t + |), 0} .
error norm

4 10 -2

➣ Merely f ∈ C1 ([−1, 1])


Doubly logarithmic plot suggests “asymptotic” alge- 10
-3

braic convergence.

10 -4
10 0 10 1 10 2
Fig. 238 polynomial degree n
y
Review question(s) 6.3.2.25 (Polynomial mean-square best approximation)
(Q6.3.2.25.A) Given a, b ∈ R and a positive continuous weight function w ∈ C0 ([ a, b]), w(t) > 0 for all
t ∈ [ a, b], write { P0 , P1 , P2 }, Pj ∈ P j , j ∈ N0 , for the sequence of orthonormal polynomials with respect

6. Approximation of Functions in 1D, 6.3. Mean Square Best Approximation 520


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

to the weighted L2 -inner product on the interval [ a, b]


Z b
( f , g)w,[a,b] := w(τ ) f (τ ) g(τ ) dτ . (6.3.2.3)
a

Give an indirect proof that Pj must have j distinct zeros in ] a, b[. To that end assume that Pj has only
ℓ < j zeros z1 , . . . , zℓ in ] a, b[, at which it changes sign and consider the polynomial

q ( t ) : = ( t − z1 ) · · · · · ( t − z ℓ ) , q ∈ P ℓ ,

in order to arrive at a contradiction.


6.4 Uniform Best Approximation


§6.4.0.1 (The alternation theorem) Given an interval [ a, b] we seek a best approximant of a function f ∈
C0 ([ a, b]) in the space Pn of polynomials of degree ≤ n with respect to the supremum norm k·k L∞ ([a,b]) :

q ∈ argmink f − pk L∞ ( I ) .
p∈Pn

The results of Section 6.3.1 cannot be applied because the supremum norm is not induced by an inner
product on Pn .

Theory provides us with surprisingly precise necessary and sufficient conditions to be satisfied by the
polynomial L∞ ([ a, b])-best approximant q.

Theorem 6.4.0.2. Chebychev alternation theorem

Given f ∈ C0 [ a, b], a < b, and a polynomial degree n ∈ N, a polynomial q ∈ Pn satisfies

q = argmink f − pk L∞ ( I )
p∈Pn

if and only if there exist n + 2 points a ≤ ξ 0 < ξ 1 < · · · < ξ n+1 ≤ b such that

|e(ξ j )| = kek L∞ ([a,b]) , j = 0, . . . , n + 1 ,


e(ξ j ) = −e(ξ j+1 ) , j = 0, . . . , n ,

where e := f − q denotes the approximation error.

6. Approximation of Functions in 1D, 6.4. Uniform Best Approximation 521


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Visualization of the behavior of the kek L∞ ([ a,b])


L∞ ([ a, b])-best approximation error e :=
f − q according to the Chebychev alter-
nation theorem Thm. 6.4.0.2. ✄
y
The extrema of the approximation error
are sometimes called alternants. ξ0 ξ1 ξ2 ξm

Compare with the shape of the Chebychev


polynomials → Def. 6.2.3.3.
−kek L∞ ([a,b])
Fig. 239

§6.4.0.3 (Remez algorithm) The widely used iterative algorithm (Remez algorithm) for finding an L∞ -
best approximant is motivated by the alternation theorem. The idea is to determine successively better
approximations of the set of alternants: A(0) → A(1) → . . ., ♯A(l ) = n + 2.

Key is the observation that, due to the alternation theorem, the polynomial L∞ ([ a, b])-best approximant q
will satisfy (one of the) interpolation conditions

q(ξ k )±(−1)k δ = f (ξ k ) , k − 0, . . . , n + 1 , δ := k f − qk L∞ ([a,b]) . (6.4.0.4)

(0) (0) (0) (0)


➀ Initial guess A(0) := {ξ 0 < ξ 1 < · · · < ξ n < ξ n+1 } ⊂ [ a, b] “arbitrary”, for instance extremal
points of the Chebychev polynomial Tn+1 , → Def. 6.2.3.3, so so-called Chebychev alternants
 
(0) 1 1 j
ξ j = 2 ( a + b) + 2 (b − a) cos π , j = 0, . . . , n + 1 . (6.4.0.5)
n+1

(l ) (l ) (l ) (l )
➁ Given approximate alternants A(l ) := {ξ 0 < ξ 1 < · · · < ξ n < ξ n+1 } ⊂ [ a, b] determine
q ∈ Pn and a deviation δ ∈ R satisfying the extended interpolation condition
(l ) (l )
q(ξ k )+(−1)k δ = f (ξ k ) , k = 0, . . . , n + 1 . (6.4.0.6)

After choosing a basis for Pn , this is (n + 2) × (n + 2) linear system of equations, cf. § 5.1.0.21.

➂ Choose A(l +1) as the set of extremal points of f − q, truncated in case more than n + 2 of these
exist.

These extreme can be located approximately by sampling on a fine grid covering [ a, b]). If the
derivative of f ∈ C1 ([ a, b]) is available, too, then search for zeros of ( f − p)′ using the secant
method from § 8.4.2.28.
➃ If k f − qk L∞ ([a,b]) | ≤ TOL · kdk L∞ ([a,b]) STOP, else GOTO ➁.
(TOL is a prescribed relative tolerance.)

C++ code 6.4.0.7: Remez algorithm for uniform polynomial approximation on an interval
➺ GITLAB
2 // IN : f = handle to the Function, point evaluation

6. Approximation of Functions in 1D, 6.4. Uniform Best Approximation 522


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3 // df = handle to the Derivative of f, point evaluation


4 // a, b = interval boundaries
5 // d = degree of polynomial
6 // c will be saved to save the coefficients of the interpolant in
monomial
7 // basis
8 template <class Function , class D e r i v a t i v e >
9 void remez ( const F u n c t i o n &f , const D e r i v a t i v e &df , const double a ,
10 const double b , const unsigned d , const double t o l , VectorXd &c ) {
11 const unsigned n = 8 * d ; // number of sampling points
12 const VectorXd x t a b = VectorXd : : LinSpaced ( n , a , b ) ; // points of sampling grid
13 const VectorXd f t a b = f e v a l ( f , x t a b ) ; // function values at sampling grid
14 const double fsupn =
15 f t a b . cwiseAbs ( ) . maxCoeff ( ) ; // approximate supremum norm of f
16 const VectorXd d f t a b = f e v a l ( df , x t a b ) ; // derivative values at sampling
grid
17

18 // The vector xe stores the current guess for the alternants


19 // initial guess is Chebychev alternant (6.4.0.5)
20 const double h = M_PI / ( d + 1 ) ;
21 VectorXd xe ( d + 2 ) ;
22 f o r ( unsigned i = 0 ; i < d + 2 ; ++ i ) {
23 xe ( i ) = ( a + b ) / 2 . + ( a − b ) / 2 . * std : : cos ( h * i ) ;
24 }
25

26 VectorXd f x e = f e v a l ( f , xe ) ; // f evaluated at alternants


27 const unsigned m a x i t = 1 0 ;
28 // Main iteration loop of Remez algorithm
29 f o r ( unsigned k = 0 ; k < m a x i t ; ++k ) {
30 // Interpolation at d + 2 points xe with deviations
31 // ±δ Algorithm uses monomial basis, which is not
32 // optimal
33 MatrixXd V = vander ( xe ) ;
34 MatrixXd A ( d + 2 , d + 2 ) ;
35 // build Matrix A, LSE
36 A . block ( 0 , 0 , d + 2 , d + 1 ) = V . block ( 0 , 1 , d + 2 , d + 1 ) ;
37 f o r ( unsigned r = 0 ; r < d + 2 ; ++ r ) {
38 A( r , d + 1 ) = std : : pow( −1 , r ) ;
39 }
40

41 c = A . l u ( ) . solve ( f x e ) ; // solve for coefficients of polynomial q


42 VectorXd cd (
43 d ) ; // to compute monomial coefficients of derivative q′
44 f o r ( unsigned i = 0 ; i < d ; ++ i ) {
45 cd ( i ) = ( d − i ) * c ( i ) ;
46 }
47

48 // Find initial guesses for the inner extremes by sampling


49 // track sign changes of the derivative of the approximation error
50 VectorXd d e l t a b = polyval ( cd , x t a b ) − d f t a b ;
51 const VectorXd s = d e l t a b . head ( n − 1 ) . cwiseProduct ( d e l t a b . t a i l ( n − 1 ) ) ;
52 const VectorXd i n d = f i n d N e g a t i v e ( s ) ; // ind = find(s < 0)
53 VectorXd xx0 = s e l e c t ( xtab , i n d ) ; // approximate zeros of e’
54 const unsigned nx = i n d . s i z e ( ) ; // number of approximate zeros
55

56 i f ( nx < d ) { // too few extrema; bail out


57 std : : c e r r << "Too few extrema ! \ n" ;
58 return ;
59 }
60

61 // Secant method to determin zeros of derivative of approximation

6. Approximation of Functions in 1D, 6.4. Uniform Best Approximation 523


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

62 // error
63 VectorXd F0 = polyval ( cd , xx0 ) − f e v a l ( df , xx0 ) ;
64 // initial guesses from shifting sampling points
65 VectorXd xx1 = xx0 + ( b − a ) / ( 2 * n ) * VectorXd : : Ones ( xx0 . s i z e ( ) ) ;
66 VectorXd F1 = polyval ( cd , xx1 ) − f e v a l ( df , xx1 ) ;
67 // Main loop of the secant method
68 while ( F1 . cwiseAbs ( ) . minCoeff ( ) > 1e −12) {
69 const VectorXd xx2 = xx1 − ( F1 . cwiseQuotient ( F1 − F0 ) ) . cwiseProduct ( xx1 − xx0 ) ;
70 xx0 = xx1 ;
71 xx1 = xx2 ;
72 F0 = F1 ;
73 F1 = polyval ( cd , xx1 ) − f e v a l ( df , xx1 ) ;
74 }
75

76 // Determine new approximation for alternants; store in xe


77 // If too many zeros of the derivative ( f − p)′
78 // have been found, select those, where the deviation is maximal
79 i f ( nx == d ) {
80 xe = VectorXd ( xx0 . s i z e ( ) + 2 ) ;
81 xe << a , xx0 , b ;
82 } else i f ( nx == d + 1 ) {
83 xe = VectorXd ( xx0 . s i z e ( ) + 1 ) ;
84 i f ( xx0 . minCoeff ( ) − a > b − xx0 . maxCoeff ( ) ) {
85 xe << a , xx0 ;
86 }
87 else {
88 xe << xx0 , b ;
89 }
90 } else i f ( nx == d + 2 ) {
91 xe = xx0 ;
92 } else {
93 const VectorXd d e l = ( polyval ( c . head ( d + 1 ) , xx0 ) − f e v a l ( f , xx0 ) ) . cwiseAbs ( ) ;
94 VectorXd i n d = s o r t _ i n d i c e s ( d e l ) ;
95 xe = s e l e c t ( xx0 , i n d . t a i l ( d + 2 ) ) ;
96 }
97

98 // Deviation in sampling points and approximate alternants


99 f x e = f e v a l ( f , xe ) ;
100 VectorXd d e l ( xe . s i z e ( ) + 2 ) ;
101 d e l << polyval ( c . head ( d + 1 ) , a * VectorXd : : Ones ( 1 ) ) −
102 f t a b ( 0 ) * VectorXd : : Ones ( 1 ) ,
103 polyval ( c . head ( d + 1 ) , xe ) − fxe ,
104 polyval ( c . head ( d + 1 ) , b * VectorXd : : Ones ( 1 ) ) −
105 f t a b ( f t a b . s i z e ( ) − 1 ) * VectorXd : : Ones ( 1 ) ;
106 // Approximation of supremum norm of approximation error
107 const double dev = d e l . cwiseAbs ( ) . maxCoeff ( ) ;
108 // Termination of Remez iteration
109 i f ( dev < t o l * fsupn ) {
110 break ;
111 }
112 }
113 const VectorXd tmp = c . head ( d + 1 ) ;
114 c = tmp ;
115 }

EXPERIMENT 6.4.0.8 (Convergence of Remez algorithm) We examine the convergence of the Remez
algorithm from Code 6.4.0.7 for two different functions:

6. Approximation of Functions in 1D, 6.4. Uniform Best Approximation 524


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• f (t) = (1 + t2 )−1 , I = [−5, 5] → Bsp. 6.2.2.11


(
1
2 (1 + cos(2πt )) , if |t| < 12 ,
• f (t) = , I = [−1, 1]
0 else.
2 f(t) = χ(1+cos(2π t)), I = [-1,1]
f(t) = 1/(1+t ), I = [-5,5]
10 0 10 0

n=3 n=3
n=5 n=5
10 -2 10 -2 n=7
n=7
n=9 n=9
10
-4
n=11 n=11
10 -4

L -norm of approximation error


L∞-norm of approximation error

-6
10
10 -6

10 -8
10 -8
-10
10
-10
10
-12
10


-12
10
-14
10

-14
10
10 -16

10 -18 10 -16
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Fig. 240 Step of Remez algorithm Fig. 241 Step of Remez algorithm

Convergence in both cases; faster convergence observed for smooth function, for which machine precision
is reached after a few steps. y
Review question(s) 6.4.0.9 (Uniform best approximation)

6.5 Approximation by Trigonometric Polynomials


Now we address the approximation of a continuous 1-periodic function

f ∈ C 0 (R ) , f ( t + 1) = f ( t ) ∀ t ∈ R .

Policy: In the interest of “structure preservation” approximate f in space of functions with “built-in” 1-
periodicity. This already rules out approximation on [0, 1] by global polynomials, because those
can never the extended to globally 1-periodic functions.

The natural space for approximating generic periodic functions is a space of trigonometric polyno-
mials with the same period.

T of 1-periodic trigonometric polyno-


Remember from Def. 5.6.1.1 different ways to represent the space P2n
mials of degree 2n, n ∈ N. We can use either a real-valued or a complex-valued basis.
n
T
P2n = Span t 7→ 1, t 7→ sin(2πt), t 7→ cos(2πt), t 7→ sin(4πt), t 7→ cos(4πt), . . . (6.5.0.1a)
o
t 7→ sin(2πnt), t 7→ cos(2πnt)
T
P2n = Span{t 7→ exp(−2πıkt) : k = −n . . . , n} . (6.5.0.1b)

T , when considered as a vector space over


Both sets of functions provide a basis for the same space P2n
C of C-valued functions on R. The complex-valued basis allows simpler manipulations and will mainly be
used in the sequel.

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 525


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6.5.1 Approximation by Trigonometric Interpolation

Video tutorial for Section 6.5.1 "Approximation by Trigonometric Interpolation": (5 minutes)


Download link, tablet notes

Idea: Adapt the policy of approximation by interpolation from § 6.1.0.6

Here: Employ trigonometric interpolation from Section 5.6


T of 1-periodic trigonometric polynomials → Def. 5.6.1.1.
into space P2n

Recall: Trigonometric interpolation → Section 5.6


Given nodes t0 < t1 < · · · < t2n , tk ∈ [0, 1[, and values yk ∈ R, k = 0, . . . , 2n find
T
q ∈ P2n := Span{t 7→ cos(2πjt), t 7→ sin(2πjt)}nj=0 , (6.5.1.2)
with q(tk ) = yk for all k = 0, . . . , 2n . (6.5.1.3)

Terminology: T =
P2n ˆ space of trigonometric polynomials of degree 2n.

From Section 5.6 remember a few more facts about trigonometric polynomials and trigonometric interpo-
lation:
T = 2n + 1
✦ Cor. 5.6.1.6: Dimension of the space of trigonometric polynomials: dim P2n
✦ Trigonometric interpolation can be reduced to polynomial interpolation on the unit circle S1 ⊂ C in
the complex plane, see (5.6.1.5).
existence & uniqueness of trigonometric interpolant q satisfying (6.5.1.2) and (6.5.1.3)
✦ There are very efficient FFT-based algorithms for trigonometric interpolation in equidistant nodes
tk = 2nk+1 , k = 0, . . . , 2n, see Code 5.6.3.4.
The relationship of trigonometric interpolation and polynomial interpolation on the unit circle suggests a
uniform distribution of nodes for general trigonometric interpolation.

Nodes for function approximation by trigonometric interpolation


T usually
Trigonometric approximation of generic 1-periodic continuous functions ∈ C0 (R ) in P2n
k
relies on equidistant interpolation nodes tk = , k = 0, . . . , 2n.
2n + 1

k
✎ notation: trigonometric interpolation operator in 2n + 1 equidistant nodes tk = 2n+1 , k = 0, . . . , 2n

T
Tn : C0 ([0, 1[) → P2n , Tn ( f )(tk ) = f (tk ) ∀k ∈ {0, . . . , 2n} . (6.5.1.5)

f (t)

Note that a function f ∈ C0 ([0, 1[) can spawn a dis-


continuous 1-periodic function on R

A prominent example is the “sawtooth function” ✄


−1 1 2 3 t
Fig. 242

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 526


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Review question(s) 6.5.1.6 (Approximation by trigonometric interpolation)


T of trigonometric polynomials
(Q6.5.1.6.A) Two different (ordered) sets of basis functions for the space P2n
of degree 2n are widely used:
1. The real-valued trigonometric basis
n
T
P2n = Span t 7→ 1, t 7→ sin(2πt), t 7→ cos(2πt), t 7→ sin(4πt), t 7→ cos(4πt), . . .
o
t 7→ sin(2πnt), t 7→ cos(2πnt) .

2. The complex-valued exponential basis


T
P2n = Span{t 7→ exp(−2πıkt) : k = −n . . . , n} .

State the matrix C ∈ C2n+1,2n+1 converting a representation in trigonometric basis into a representation
with respect to exponential basis:
(Q6.5.1.6.B) When can a function f 0 ∈ C m ([0, 1]) be extended to a 1-periodic function f ∈ C m (R )?
That f is an extension of f 0 means that f |[0,1] ≡ f 0 .

(Q6.5.1.6.C) Write LTT : Cper


0 ([0, 1]) → P T for the approximation scheme based on trigonometric inter-
2n
polation in the node set T := {t0 , t1 , . . . , t2n } ⊂ R, n ∈ N.

Show that LTT f (t) ∈ R for all t ∈ R, if f is 1-periodic and real-valued: f (t) ∈ R for all t ∈ R.
(Q6.5.1.6.D) Let
• Ln : C0 ([−1, 1]) → Pn denote the family of approximation schemes spawned by Chebychev in-
terpolation on [−1, 1].
• Tn : C0 ([0, 1]) → P2nT stand for the family of approximation schemes related to trigonometric in-

terpolation in equidistant nodes tk := 2nk+1 , k = 0, . . . , 2n.


Express Ln in terms of Tn using
• the Chebychev polynomials T0 , . . . , Tn as a basis of Pn ,
T.
• the complex exponentials t 7→ exp(−2πıkt), k = −n, . . . , n, as a basis of P2n

6.5.2 Trigonometric Interpolation Error Estimates

Video tutorial for Section 6.5.2 "Trigonometric Interpolation Error Estimates": (14 minutes)
Download link, tablet notes

From (6.5.1.5) use the notation Tn for trigonometric interpolation in the 2n + 1 equidistant nodes
k
tk := . Our focus will be on the asymptotic behavior of
2n + 1

k f − Tn f k L∞ ([0,1[) and k f − Tn f k L2 ([0,1[) as n → ∞ ,

for functions f : [0, 1[→ C with different smoothness properties. To begin with we report an empiric study.

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 527


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXPERIMENT 6.5.2.1 (Interpolation error: trigonometric interpolation) Now we study the asymptotic
behavior of the error of equidistant trigonometric interpolation as n → ∞ in a numerical experiment for
functions with different smoothness properties.
#1 Step function: f (t) = 0 for |t − 21 | > 14 , f (t) = 1 for |t − 12 | ≤ 41
1
#2 C ∞ periodic function: f (t) = q .
1 + 12 sin(2πt)

#3 “wedge function”: f (t) = |t − 12 |


Approximate computation of norms of interpolation errors on equidistant grid with 4096 points.
0 0
10 10

−2 −2
10 10

−4
−4
10 10

||Interpolationsfehler||2
||Interpolationsfehler||∞

−6
−6 10
10
#1 #1
#2 −8 #2
−8 10
10 #3 #3

−10
−10 10
10

−12
−12 10
10

−14
−14
10
10

2 4 8 16 32 64 128 2 4 8 16 32 64 128
Fig. 243 n Fig. 244 n
Maximum norm of interpolation error L2 ([0, 1])-norm of interpolation error
Observations: Function #1: no convergence in L∞ -norm, algebraic convergence in L2 -norm
Function #3: algebraic convergence in both norms
Function #2: exponential convergence in both norms

We conclude that in this experiment higher smoothness of f leads to faster convergence of the trigono-
metric interplant. y

EXPERIMENT 6.5.2.2 (Gibbs phenomenon) Of course the smooth trigonometric interpolants of the
step function must fail to converge in the L∞ -norm in Exp. 6.5.2.1. Moreover, they will not even converge
“visually” to the step function, which becomes manifest by a closer inspection of the interpolants.

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 528


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.2 1.2
f f
p p

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t

n = 16 n = 128
We observe massive “overshooting oscillations” in a neighborhood of the discontinuity. This is the notori-
ous Gibbs phenomenon affecting approximation schemes relying on trigonometric polynomials. y

§6.5.2.3 (Fourier series → (4.2.6.7)) From (6.5.0.1b) we know that the complex vector space space
of trigonometric polynomials P2nT is spanned by the 2n + 1 Fourier modes t 7 → exp(2πıkt ) of “lowest

frequency”, that is, for −n ≤ k ≤ n


n
T
p ∈ P2n ⇒ p(t) = ∑ ak exp(−2πıkt) for some ak ∈ C . (6.5.2.4)
k=−n

Now let us make a connection: In Section 4.2.6 we learned that every function f : [0, 1[→ C with finite
L2 ([0, 1])-norm
Z 1
k f k2L2 ([0,1]) := | f (t)|2 dt < ∞ ,
0

can be expanded in a Fourier series (→ Thm. 4.2.6.33)

∞ Z 1
f (t) = ∑ fbk exp(−2πıkt) 2
in L ([0, 1]) , fbk := f (t) exp(2πıkt) dt . (6.5.2.5)
k=−∞ 0

M
We add, that a limit in L2 ([0, 1]) means that f− ∑ fbk exp(−2πık ·) → 0 for M → ∞. Also
k=− M L2 ([0,1])
note the customary the notation fbk for the Fourier coefficients.
Seeing (6.5.2.4) and (6.5.2.5) side-by-side and understanding that trigonometric polynomials are finite
Fourier series suggests that we investigate the approximation of Fourier series by trigonometric inter-
polants.

Idea: Study trigonometric interpolation error for interpolands


given in Fourier series representation (6.5.2.5)
y

Remark 6.5.2.6 ( L2 (]0, 1[): Natural setting for trigonometric interpolation) A fundamental result about
functions given through Fourier series was the following fundamental isometry property of the mapping
taking a function to the sequence of its Fourier coefficients.

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 529


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 4.2.6.33. Isometry property of the Fourier transform

If the Fourier coefficients satisfy ∑k∈Z |cbj |2 < ∞, then the Fourier series

c(t) = ∑ cbk exp(−2πıkt)


k ∈Z

yields a function c ∈ L2 ([0, 1]) that satisfies


Z 1
kck2L2 ([0,1]) := |c(t)|2 dt = ∑ |cbj |2 .
0 k ∈Z

This paves the way for estimating the L2 ([0, 1])-norm of interpolation/approximation errors, once we have
information about the decay of their Fourier coefficients.
The L2 ([0, 1]) is also a highly relevant quantity in engineering application, when t 7→ c(t) is regarded as
a time-dependent signal. In this case kck L2 ([0,1]) is the root mean square (RMS) power of the signal. y

§6.5.2.7 (Aliasing) Guided by the insights from § 6.5.2.3, we study the action of the trigonometric
interpolation operator Tn from (6.5.1.5) on individual Fourier modes

µk (t) := exp(−2πkıt) , t∈R, k∈Z.


j
Due to the 1-periodicity of t 7→ exp(−2πıt) we find for every node t j := 2n+1 , j = 0, . . . , 2n,

j j
µk (t j ) = exp(−2πık 2n+1 ) = exp(−2πı(k − ℓ(2n + 1)) 2n+1 ) = µk−ℓ(2n+1) (t j ) ∀ℓ ∈ Z .

When sampled on the node set Tn := {t0 , . . . , t2n } all the Fourier modes µk−ℓ(2n+1) , ℓ ∈ Z, yield
the same values. Thus trigonometric interpolation cannot distinguish them! This phenomenon is called
aliasing.

Aliasing demonstrated for f (t) = sin(2π · 19t) = Im(exp(2πı19t)) for different node sets.
1 1 1
p p p
f f f
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−1 −1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t t

n=2 n=4 n=8


The “low-frequency” sine waves plotted in red coincide with f on the node set Tn .

T , the aliasing effect yields


Since Tn µk = µk for k = −n, . . . , n, that is, for µk ∈ P2n

Tn µk = µek , e
k ∈ {−n, . . . , n} , k − e
k ∈ (2n + 1)Z [ e
k := k mod (2n + 1) ] . (6.5.2.8)

e = n, n]
For instance, we have n fn = −n, −^
+ 1 = − n, − f = −1, etc.
n − 1 = n, 2n

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 530


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Trigonometric interpolation by Tn maps all Fourier modes (“frequencies”) to another single Fourier
mode in the finite range {−n, . . . , n}.
y

§6.5.2.9 (Fourier representation of trigonometric interpolation error) From (6.5.2.8), by linearity of Tn ,


we obtain for f : [0, 1[→ C in Fourier series representation
∞ n ∞
f (t) = ∑ fbj µ j (t) Tn ( f )(t) = ∑ γ j µ j (t) , γ j = ∑ fbj+ℓ(2n+1) . (6.5.2.10)
j=−∞ j=−n ℓ=−∞

We can read the trigonometric polynomial Tn f ∈ P2n T as a Fourier series with non-zero coefficients only
bj of the trigonometric interpolation
in the index range {−n, . . . , n}. Thus, for the Fourier coefficients E
error E(t) := f (t) − Tn f (t) we find from (6.5.2.10)

− ∑ fbj+ℓ(2n+1) , if j ∈ {−n, . . . , n} ,
bj =
E ℓ∈Z \{0} j∈Z. (6.5.2.11)

fbj , if | j| > n ,

Since we have (sufficient smoothness of f assumed)



E ( t ) : = f ( t ) − Tn f ( t ) = ∑ bj e−2πıjt ,
E (6.5.2.12)
j=−∞

we conclude from the isometry property asserted in Thm. 4.2.6.33 and the triangle inequality
2
n
k f − Tn f k2L2 (]0,1[) = ∑ ∑ fbj+ℓ(2n+1) + ∑ | fbj |2 , (6.5.2.13)
j=−n ℓ∈Z \{0} | j|>n
n
k f − Tn f k L∞ (]0,1[) ≤ ∑ ∑ | fbj+ℓ(2n+1) | + ∑ | fbj | . (6.5.2.14)
j=−n ℓ∈Z \{0} | j|>n

In order to estimate these norms of the trigonometric interpolation error we need quantitative information
about the decay of the Fourier coefficients fbj as | j| → ∞. y

§6.5.2.15 (Fourier expansions of derivatives) For 1-periodic c ∈ C0 (R ) with integrable derivative


d
ċ := dt c we find by integration by parts (boundary terms cancel due to periodicity)
Z 1 Z 1
ḃc j = ċ(t)e2πıjt dt = −2πıj · c(t)e2πıjt dt = (−2πıj) cbj , j ∈ Z .
0 0
We can also arrive at this formula by (formal) term-wise differentiation of the Fourier series:
∞ ∞
c(t) = ∑ cbj e−2πıjt =⇒ ċ(t) = ∑ (−2πıj)cbj e−2πıjt . (6.5.2.16)
j=−∞ j=−∞ | {z }
=ḃc j

These considerations essentially provide a formal proof of the following result.

Lemma 6.5.2.17. Fourier coefficients of derivatives

For the Fourier coefficients of the derivatives a 1-periodic function f ∈ C k−1 (R ), k ∈ N, with
integrable k-th derivative f (k) holds

(\
f (k) ) j = (−2πıj)k fbj , j ∈ Z .

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 531


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§6.5.2.18 (Fourier coefficients and smoothness) From Lemma 6.5.2.17 and the trivial estimates
(| exp(2πıt)| = 1)
Z 1
| fbj | ≤ | f (t)| dt ≤ k f k L1 (]0,1[) ∀ j ∈ Z , (6.5.2.19)
0
 
we conclude that (2π | j|)m fbj , m ∈ N, is bounded, provided that f ∈ C m−1 (R ) with integrable m-th
j ∈Z
derivative

Lemma 6.5.2.20. Decay of Fourier coefficients

If f ∈ C k−1 (R ) with integrable k-th derivative, then fbj = O(| j|−k ) for | j| → ∞

Decay of Fourier coefficients and smoothness

The smoother a periodic function the faster the decay of its Fourier coefficients

The isometry property of Thm. 4.2.6.33 also yields for f ∈ C k−1 (R ) with f (k) ∈ L2 (]0, 1[) that

2 ∞
f (k) = (2π )2k ∑ | j|2k | fbj |2 . (6.5.2.22)
L2 (]0,1[)
j=−∞

We can now combine the identity (6.5.2.22) with (6.5.2.13) and obtain an interpolation error estimate in
L2 (]0, 1[)-norm.

Theorem 6.5.2.23. Finite-smoothness L2 -error estimate for trigonometric interpolation

If f ∈ C k−1 (R ), k ∈ N, with square-integrable k-th derivative ( f (k) ∈ L2 (]0, 1[)), then


p
k f − Tn f k L2 (]0,1[) ≤ 1 + ck (2πn)−k f (k) , (6.5.2.24)
L2 (]0,1[)

with ck = 2 ∑∞
ℓ=1 (2ℓ − 1)
−2k < ∞.

Proof. From Thm. 4.2.6.33 and Lemma 6.5.2.17 we infer


2
∑ |2πj|2k | fbj |2 = f (k)
L2 ([0,1])
.
j ∈Z

As a tool we will need the Cauchy-Schwarz inequality for quadratically convergent sequences:
2
∞ ∞ ∞
∑ a ℓ bℓ ≤ ∑ | a ℓ | 2 · ∑ | bℓ | 2 ∀( aℓ ), (bℓ ) ∈ CN . (6.5.2.25)
ℓ=1 ℓ=1 ℓ=1

We start from
2
n
k f − Tn f k2L2 (]0,1[) = ∑ ∑ fbj+ℓ(2n+1) + ∑ | fbj |2 , (6.5.2.13)
j=−n ℓ∈Z \{0} | j|>n

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 532


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and then use (6.5.2.25),


2
n
k f − Tn f k2L2 (]0,1[) = ∑ ∑ fbj+ℓ(2n+1) + ∑ | fbj |2
j=−n |ℓ|≥1 | j|>n
2
n  
= ∑ ∑ |2π ( j + ℓ(2n + 1))|−k |2π ( j + ℓ(2n + 1))|k fbj+ℓ(2n+1) +
j=−n |ℓ|≥1

∑ |2πj|−2k |2πj|2k | fbj |2


| j|>n
 
n  
∑ |2π ( j + ℓ(2n + 1))|−2k · |2π ( j + ℓ(2n + 1))|2k | fbj+ℓ(2n+1) |2 +
∑ ∑

j=−n |ℓ|≥1 |ℓ|≥1

∑ |2πj|−2k |2πj|2k | fbj |2 .


| j|>n

We plug in the estimate

∑ |2π ( j + ℓ(2n + 1))|−2k ≤ 2 ∑ |2π (ℓ(2n + 1) − n)|−2k ≤ (2πn)−2k ck ,


|ℓ|≥1 ℓ≥1

which yields
n
k f − Tn f k2L2 (]0,1[) ≤ ck (2πn)−2k ∑ ∑ |2π ( j + ℓ(2n + 1))|2k | fbj+ℓ(2n+1) |2 +
j=−n |ℓ|≥1

(2πn)−2n ∑ |2πj|2k | fbj |2


| j|>n
2
≤ (1 + ck )(2πn)−2k f (k) .
L2 ([0,1])


Thm. 6.5.2.23 confirms algebraic convergence of the L2 -norm of the trigonometric interpolation error for
functions with limited smoothness. Higher rates can be expected for smoother functions, which we have
also found in cases #1 and #3 in Exp. 6.5.2.1.

Review question(s) 6.5.2.26 (Trigonometric Interpolation Error Estimates)


(Q6.5.2.26.A) We know that trigonometric interpolation can be regarded as standard polynomial interpo-
lation with nodes located on the unit circle S1 ⊂ C.
Nevertheless, it is not possible to apply the following theorem to obtain error estimates for trigonometric
interpolation. Why?

Theorem 6.2.2.15. Representation of interpolation error

We consider f ∈ C n+1 ( I ) and the Lagrangian interpolation approximation scheme (→


Def. 6.2.2.1) for a node set T := {t0 , . . . , tn } ⊂ I . Then,

for every t ∈ I there exists a τt ∈] min{t, t0 , . . . , tn }, max{t, t0 , . . . , tn }[ such that

f (n+1) (τt ) n
( n + 1) ! ∏
f (t) − LT ( f )(t) = · (t − t j ) . (6.5.2.27)
j =0

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 533


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

6.5.3 Trigonometric Interpolation of Analytic Periodic Functions

Video tutorial for Section 6.5.3 "Trigonometric Interpolation of Analytic Periodic Functions":
(16 minutes) Download link, tablet notes

In Section 6.2.2.3 we saw that we can expect exponential decay of the maximum norm of polynomial
interpolation errors in the case of “very smooth” interpolands. To capture this property of functions we
resorted to the notion of analytic functions, as defined in Def. 6.2.2.48. Since trigonometric interpolation is
closely connected to polynomial interpolation (on the unit circle S1 , see Section 5.6.2), it is not surprising
that analyticity of interpolands will also involve exponential convergence of trigonometric interpolants. This
result will be established in this section.
In case #2 of Exp. 6.5.2.1 we already say an instance of exponential convergence for an analytic inter-
poland. A more detailed study follows.

EXPERIMENT 6.5.3.1 (Trigonometric interpolation of 1-periodic analytic functions)


2
10

We study the convergence of equidistant trigonomet- 0


10

ric interpolation for the 1-periodic interpoland tr −2


10

1
Interpolationsfehlernorm

−4

f (t) = p on I = [0, 1] . 10

1 − α sin(2πt) −6
10

(6.5.3.2) −8 α=0.5, L∞
10
2
α=0.5, L

For 0 ≤ α < 1 we have f ∈ C ∞ (R ), and f is even −10


10 α=0.9, L∞
α=0.9, L2
analytic (→ Def. 6.2.2.48). −12
10 α=0.95, L∞
α=0.95, L2

Approximative computations of error norms by “over- −14


10 α=0.99, L∞
α=0.99, L2
sampling” in 4096 points. ✄ −16
10
0 10 20 30 40 50 60 70 80 90 100
Fig. 245 Polynomgrad n
➣ Observation: exponential convergence in n, faster for smaller α y

§6.5.3.3 (Analytic periodic functions) Assume that a 1-periodic function f : R → R possesses an ana-
lytic extension to an open domain D ⊂ C beyond the interval [0, 1]: [0, 1] ⊂ D.
Im

Re

−1 1 2 3

Fig. 246

S
Then, thanks to 1-peridicity f will also have an analytic extension to Dper := ( D + k). That domain
k ∈Z
will contain a strip parallel to the real axis, see Fig. 246:

∃η − < 0 < η + : {z ∈ C : η − ≤ Im(z) ≤ η + } ⊂ Dper .

§6.5.3.4 (Decay of Fourier coefficients of analytic functions) Lemma 6.5.2.20 asserts algebraic de-
cay of the Fourier coefficients of functions with limited smoothness. As analytic 1-periodic functions are

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 534


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

“infinitely smooth”, the will always belong to C ∞ (R ), we expect a stronger result in this case. In fact, we
can conclude exponential decay of the Fourier coefficients.

Theorem 6.5.3.5. Exponential decay of Fourier coefficients of analytic functions

If f : R → C is 1-periodic and has an analytic extension to the strip

S̄ := {z ∈ C: − η ≤ Im z ≤ η } , for some η > 0 ,

then its Fourier coefficients decay according to

| fbj | ≤ q|k| · k f k L∞ (S̄) ∀k ∈ Z with q := exp(−2πη ) ∈]0, 1[ . (6.5.3.6)

Proof. ➊: A first variant of the proof uses techniques from complex analysis:

z-plane

Let f : R 7→ C be 1-periodic with analytic extension η


to the (closed) strip

S := {z ∈ C: − η ≤ Im{z} ≤ η }, η > 0 .
η

Fig. 247

We recall the fundamental Cauchy integral theorem from complex analysis.

Theorem 6.5.3.7. Cauchy integral theorem [Rem02, Satz 7.1.2]

Let f : D → C be analytic in D ⊂ C and U ⊂ D be simply connected and strictly contained in D.


Then
Z
f (z) dz = 0 .
∂U

We apply this theorem to either rectangle

U + := {z = ξ + ıη, 0 ≤ ξ ≤ 1, 0 ≤ η ≤ r } for k ≥ 0 ,
U − := {z = ξ + ıη, 0 ≤ ξ ≤ 1, 0 ≤ η ≤ r } for k < 0 ,

and note that the contributions of the sides parallel to the imaginary axis cancel thanks to 1-periodicity
(and their opposite orientation).

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 535


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

z-plane

✁ Equivalent integration paths for the computation of


Fourier coefficients fbk , k > 0.
η
−→ integration path 0 → 1 on the real line.
1 −→ integration path through the upper half complex
η plane. Contributions of sections parallel to the
imaginary axis cancel.

Fig. 248

Thus we compute the Fourier coefficients fbk of f by a different integral. For k > 0 we get

Z1 Z Z1 Z1
fbk = f (t)e 2πıkt
dt = f (z)e 2πıkz
dz = f (s + ıη )e 2πık(s+ıη )
ds = e −2πηk
f (s + ıη )e2πıks ds ,
0 ∂U + \R 0 0

which leads to the estimate

| fbk | ≤ e−2πηk max{| f (t + ıη )|, 0 ≤ t ≤ 1} .


Analogously, for k < 1 we compute
Z1 Z Z1 Z1
fbk = f (t)e2πıkt dt = f (z)e2πıkz dz = f (s − ıη )e2πık(s−ıη ) ds = e2πηk f (s + ıη )e2πıks ds ,
0 ∂U − \R 0 0

and obtain the bound

| fbk | ≤ e2πηk max{| f (t − ıη )|, 0 ≤ t ≤ 1} .


The assertion of the theorem follows directly.

➋: A second variant of the proof merely relies on classical calculus.


By the assumptions of the theorem the function f : R → C can be expanded into a power series with
radius of convergence η around at every y ∈ R

∀y ∈ R: ∃cn (y) ∈ C: f (x) = ∑ cn (y)( x − y)n ∀ x : | x − y| ≤ η . (6.5.3.8)
n =0

This implies |cn (y)|η n ≤ C (y) and, since [0, 1] is compact and y 7→ C (y) is continuous,

∃C > 0: |cn (y)|η n ≤ C ∀n ∈ N0 , ∀y ∈ R . (6.5.3.9)

The power series (6.5.3.8) is a Taylor series, which means


1 (n) (6.5.3.9)
cn (y) = f (y) =⇒ | f (n) (y)| ≤ Cn!η −n ∀n ∈ N0 , y∈R. (6.5.3.10)
n!
We use this estimate to bound the Fourier coefficients of f . Starting from the defining formula (4.2.6.20),
we continue with n-fold integration by parts
Z1 Z1
(−1)n
fbk := f (t) exp(2πıkt) dt = f (n) (t) exp(2πıkt) dt . (6.5.3.11)
(2πık)n
0 0

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 536


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We combine this formula with (6.5.3.10) and obtain

n!
| fbk | ≤ C ∀n ∈ N, k ∈ Z . (6.5.3.12)
(2π |k|η )n
Next, we use Stirling’s formula (6.2.1.12) in the form

n! ≤ enn+ /2 e−n ,
1
n∈N,

which gives

nn+1/2 −n
| fbk | ≤ Ce e ∀n ∈ N, k ∈ Z .
(2π |k|η )n
We can also “interpolate” and replace n with a real number.

rr+1/2 −r
| fbk | ≤ Ce e ∀r ≥ 1, k ∈ Z .
(2π |k|η )r
Finally, we set r := 2π |k |η and arrive at
p
| fbk | ≤ Ce 2πkη exp(−2πη )|k| , k∈Z.
| {z }
=:q

This bound is slightly less explicit than that obtain in ➊.


✷ y

Knowing exponential decay of the Fourier coefficients, the geometric sum formula can be used to ex-
tract estimates for the trigonometric interpolation operator Tn (for equispaced nodes) from (6.5.2.13) and
(6.5.2.14):

Lemma 6.5.3.13. Interpolation error estimates for exponentially decaying Fourier coefficients

If f : R → C is 1-periodic and f ∈ L2 (]0, 1[), then



ρn/2 2 2
k f − Tn ( f )k L2 (]0,1[) ≤M p ,
1 − ρ n 1 − ρ2
∃ M > 0, ρ ∈]0, 1[: | fb(k)| ≤ Mρ |k|

ρn/2
k f − Tn ( f )k L∞ (]0,1[) ≤ 4M .
1−ρ

This estimate can be combined with the result of Thm. 6.5.3.5 and gives the main result of this section:

Theorem 6.5.3.14. Exponential convergence of trigonometric interpolation for analytic inter-


polands

If f : R → C is 1-periodic and possesses an analytic extension to the strip

S̄ := {z ∈ C: − η ≤ Im z ≤ η } , for some η > 0 ,

then there is Cη > 0 depending only on η such that

k f − Tn f k∗ ≤ Cη e−πηn k f k L∞ (S̄) , n ∈ N , ( ∗ = L2 (]0, 1[), L∞ (]0, 1[) ) . (6.5.3.15)

The speed of exponential convergence clearly depends on the width η of the “strip of analyticity” S̄.

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 537


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§6.5.3.16 (Convergence of trigonometric interpolation for analytic interpolands) We can now give
a precise explanation of the observations made in Exp. 6.5.3.1, cf. Rem. 6.2.3.26. Similar to Chebychev
interpolants, also trigonometric interpolants converge exponentially fast, if the interpoland f is 1-periodic
analytic (→ Def. 6.2.2.48) in a strip around the real axis in C, see Thm. 6.5.3.14 for details.
Thus we have to determine the maximal open subset D of C to which the function
1
f (t) = p , t ∈ [0, 1] , 0<α<1, (6.5.3.2)
1 − α sin(2πt)
possesses an analytic extension. Usually it is easier to determine the complement P := C \ D, the
“domain of singularity”. We start with a result from complex analysis.

Lemma 6.5.3.17. Principal branch of the square root



The square root function t 7→ t, t ≥ 0, can be extended to an analytic function on

C \ R0− := {z ∈ C : Re(z) > 0 or Im(z) 6= 0} .

Im

Re √
✁ domain of singularity of z 7→ z, principal branch

Fig. 249

As a consequence of this and of

Theorem 6.2.2.68. Composition and products of analytic functions

If f , h : D ⊂ C → C and g : U ⊂ C → C are analytic in the open sets D and U , respectively, then


(i) the composition f ◦ g is analytic in {z ∈ U : g(z) ∈ D },
(ii) the product f · h is analytic on D.

we find for the domain of singularity of f



P = z ∈ C : 1 − α sin(2πz) ∈ R0− .
Based on the identities
1
sin(ıy) = (exp(−y) − exp(y)) = ı sinh(y) ,

1
cos(ıy) = (exp(−y) + exp(y)) = cosh(y)
2
for y ∈ R, and using addition theorems for trigonometric functions, we find with z = x + ıy, x, y ∈ R,

1 + α sin(2πz) ∈ R0−
m
sin(2πz) = sin(2πx ) cosh(2πy) + ı cos(2πx ) sinh(2πy) ∈] − ∞, −1 − α1 ]
m
1
sin(2πx ) cosh(2πy) ≤ −1 − α and cos(2πx ) sinh(2πy) = 0 .
Note that y = 0 is not possible, because this would imply | sin(2πz)| ≤ 1. Hence, the term
x 7→ cos(2πx ) must make the imaginary part vanish, which means
π
2πx ∈ (2Z + 1) ⇔ x ∈ 21 Z + 41 .
2

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 538


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Thus we have sin(2πx ) = ±1. As cosh(2πy) > 0, the sine has be negative, which leaves as only
remaining choices for the real part
3
x ∈Z+ 4 ⇔ sin(2πx ) = −1 .
As ξ 7→ cosh(ξ ) is a positive even function, we find the following domain of analyticity of f :
[
C\ (k + 43 + i (R \] − ζ, ζ [)) , ζ ∈ R + , cosh(2πζ ) = 1 + 1
α .
k ∈Z

10
Im
9

7
1
6
cosh Re
5
−2 −1 1 2
4

3 −1
2

1
Fig. 251

0
➣ f analytic in strip
Fig. 250 −1
−3 −2 −1 0 1 2 3 S := {z ∈ C: : −ζ < Im(z) < ζ }.
➣ As α decreases the strip of analyticity becomes wider, since x → cosh( x ) is increasing for x > 0. y

Review question(s) 6.5.3.18 (Trigonometric interpolation of analytic periodic functions)


(Q6.5.3.18.A) From complex analysis we know the following result.

Theorem 6.5.3.19. Series of analytic functions

Assume that all the functions f k : D → C, k ∈ N, are analytic on the open set D ⊂ C and

lim sup
n→∞ z∈ D
∑ | f k (z)| = 0 .
k=n

Then the series



F (z) := ∑ f k (z)
k =1

converges for all z ∈ D and defines an analytic function F : D → C.

Show that

f (t) := ∑ ak exp(2πıt) , t∈R,


k ∈Z

can be extended to a 1-periodic analytic function defined on a C-neighborhood of R, if

∃C > 0, 0 ≤ q < 1: | ak | ≤ Cq|k| ∀k ∈ Z .


(Q6.5.3.18.B) Determine the maximal open set D ⊂ C, to which the function
1
f (t) = , t∈R,
1 + sin2 (2πt)

6. Approximation of Functions in 1D, 6.5. Approximation by Trigonometric Polynomials 539


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

can be extended analytically. What is the maximal width of a strip

S := {z ∈ C : | Im(z)| < η } for some η > 0 ,

such that f has an analytic extension to S.


6.6 Approximation by Piecewise Polynomials


Recall some alternatives to interpolation by global polynomials discussed in Chapter 5:
✦ piecewise linear/quadratic interpolation → Section 5.3.2, Ex. 5.3.2.4,
✦ cubic Hermite interpolation → Section 5.3.3,
✦ (cubic) spline interpolation → Section 5.4.2.
☞ All these interpolation schemes rely on piecewise polynomials (of different global smoothness)

Focus in this section: function approximation by piecewise polynomial interpolants

§6.6.0.1 (Grid/mesh) The attribute “piecewise” refers to partitioning of the interval on which we aim to
approximate. In the case of data interpolation the natural choice was to use intervals defined by interpo-
lation nodes. Yet we already saw exceptions in the case of shape-preserving interpolation by means of
quadratic splines, see Section 5.4.4.

In the case of function approximation based on an interpolation scheme the additional freedom to choose
the interpolation nodes suggests that those be decoupled from the partitioning.
Idea: use piecewise polynomials with respect to a grid/mesh

M : = { a = x 0 < x 1 < . . . < x m −1 < x m = b } (6.6.0.2)

to approximate function f : [ a, b] 7→ R, a < b.

Borrowing from terminology for splines, cf. Def. 5.4.1.1, the underlying mesh for piecewise polynomial
approximation is sometimes called the “knot set”.

Terminology:
✦ xj = ˆ nodes of the mesh M,
ˆ intervals/cells of the mesh,
✦ [ x j −1 , x j [ = a b
✦ hM := max | x j − x j−1 | = ˆ mesh width, x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
j
✦ If x j = a + jh =
ˆ equidistant (uniform) mesh
with meshwidth h > 0
y

Remark 6.6.0.3 (Local approximation by piecewise polynomials) We will see that most approximation
schemes relying on piecewise polynomials are local in the sense that finding the approximant on a cell of
the mesh relies only on a fixed number of function evaluations in a neighborhood of the cell.

➣ O(1) computational effort to find interpolant on [ x j−1 , x j ] (independent of m)

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 540


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ O(m) computational effort to determine piecewise polynomial approximant for m → ∞ (‘fine


meshes”)
Contrast this with the computational cost of computing global polynomial interpolants, which will usually
be O(n2 ) for polynomial degree n → ∞. y

6.6.1 Piecewise Polynomial Lagrange Interpolation

Video tutorial for Section 6.6.1 "Piecewise Polynomial Lagrange Interpolation": (17 minutes)
Download link, tablet notes

Given: interval [ a, b] ⊂ R endowed with mesh M : = { a = x 0 < x 1 < . . . < x m −1 < x m = b } .

Recall theory of polynomial interpolation → Section 5.2.2: n + 1 data points needed to fix interpolating
polynomial, see Thm. 5.2.2.7.

Approach to local Lagrange interpolation (→ (5.2.2.2)) of f ∈ C ( I ) on mesh M

General local Lagrange interpolation on a mesh (PPLIP)

➊ Choose local degree n j ∈ N0 for each cell of the mesh, j = 1, . . . , m.


➋ Choose set of local interpolation nodes
j j
T j := {t0 , . . . , tn j } ⊂ Ij := [ x j−1 , x j ] , j = 1, . . . , m ,

for each mesh cell/grid interval Ij .


➌ Define piecewise polynomial (PP) interpolant s : [ x0 , xm ] → K:
j j
s j : = s| Ij ∈ Pn j and s j (ti ) = f (ti ) i = 0, . . . , n j , j = 1, . . . , m . (6.6.1.2)

Owing to Thm. 5.2.2.7, s j is well defined.

Corollary 6.6.1.3. Piecewise polynomials Lagrange interpolation operator

The mapping f 7→ s defines a linear operator IM : C0 ([ a, b]) 7→ CM


0
,pw ([ a, b ]) in the sense of
Def. 5.1.0.25.

Obviously, IM depends on M, the local degrees n j , and the sets T j of local interpolation points (the latter
two are suppressed in notation).

Corollary 6.6.1.4. Continuous local Lagrange interpolants


j
If the local degrees n j are at least 1 and the local interpolation nodes tk , j = 1, . . . , m, k = 0, . . . , n j ,
for local Lagrange interpolation satisfy
j j +1
t n j = t0 ∀ j = 1, . . . , m − 1 ⇒ s ∈ C0 ([ a, b]) , (6.6.1.5)

then the piecewise polynomial Lagrange interpolant according to (6.6.1.2) is continuous on [ a, b]:
s ∈ C0 ([ a, b]).

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 541


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Focus: asymptotic behavior of (some norm of) interpolation error

k f − IM f k ≤ CT ( N ) for N → ∞ , (6.6.1.6)

m
where N := ∑ ( n j + 1).
j =1

The decay of the bound T ( N ) will characterize the type of convergence:

☛ algebraic convergence or exponential convergence, see Section 6.2.2, Def. 6.2.2.7.

But why do we choose this strange number N as parameter when investigating the approximation error?

Because, by Thm. 5.2.1.2, it agrees with the dimension of the space of discontinuous, piecewise polyno-
mials functions
{q : [ a, b] → R: q| Ij ∈ Pn j ∀ j = 1, . . . , m} !
This dimension tells us the number of real parameters we need to describe the interpolant s, that is, the
“information cost” of s. N is also proportional to the number of interpolation conditions, which agrees with
the number of f -evaluations needed to compute s (why only proportional in general?).
Special case: uniform polynomial degree n j = n for all j = 1, . . . , m.

Then we may aim for estimates k f − IM f k ≤ CT (hM ) for hM → 0


in terms of meshwidth hM .

Terminology: investigations of this kind are called the study of h-convergence.

EXAMPLE 6.6.1.7 (h-convergence of piecewise polynomial interpolation)


1.5

atan(t)
piecew. linear
Compare Exp. 5.3.1.6: 1
piecew. quadratic
piecew. cubic
f (t) = arctan t, I = [−5, 5]
0.5

Grid M := {−5, − 25 , 0, 52 , 5}
Local interpolation nodes equidistant in Ij , endpoints 0

included, (6.6.1.5) satisfied.


−0.5

Plots of the piecewise linear, quadratic and cubic −1

polynomial interpolants ✄
−1.5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 252 t

i
✦ Sequence of (equidistant) meshes: Mi := {−5 + j 2−i 10}2j=0 , i = 1, . . . , 6.
✦ Equidistant local interpolation nodes (endpoints of grid intervals included).
Monitored: interpolation error in (approximate) L∞ - and L2 -norms, see (6.2.3.25), (6.2.3.24)
k gk L∞ ([−5,5]) ≈ max | g(−5 + j/100)| ,
j=0,...,1000

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 542


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

q !1/2
999
1 1 2 2 2
k gk L2 ([−5,5]) ≈ 1000 · 2 g (−5) + ∑ | g(−5 + j/100)| + 21 g(5) .
j =1

2 0
10 10

0
10

−2
10


||Interpolation error||2

−4 −5

||Interpolation error||
10 10

−6
10

−8
10

−10 −10
10 10

−12
10 Deg. =1 Deg. =1
Deg. =2 Deg. =2
−14
Deg. =3 Deg. =3
10 Deg. =4 Deg. =4
Deg. =5 Deg. =5
Deg. =6 Deg. =6
−16 −15
10 10
−2 −1 0 1 −2 −1 0 1
10 10 10 10 10 10 10 10
Fig. 253 mesh width h Fig. 254 mesh width h

Observation: Algebraic convergence (→ Def. 6.2.2.7) for meshwidth hM → 0

(nearly linear error norm graphs in doubly logarithmic scale, see § 6.2.2.9)

Observation: rate of algebraic convergence increases with polynomial degree n


Rates α of algebraic convergence O( hαM ) of norms of interpolation error:

n 1 2 3 4 5 6
w.r.t. L2 -norm 1.9957 2.9747 4.0256 4.8070 6.0013 5.2012
w.r.t. L∞ -norm 1.9529 2.8989 3.9712 4.7057 5.9801 4.9228
➣ Higher polynomial degree provides faster algebraic decrease of interpolation error norms. Empiric
evidence for rates α = p + 1

Here: rates estimated by linear regression (→ Ex. 3.1.1.5) based on P YTHON’s polyfit and the inter-
polation errors for meshwidth h ≤ 10 · 2−5 . This was done in order to avoid erratic “preasymptotic”, that
is, for large meshwidth h, behavior of the error.

The bad rates for n = 6 are probably due to the impact of roundoff, because the norms of the interpolation
error had dropped below machine precision, see Fig. 253, 254. y

§6.6.1.8 (Approximation error estimates for piecewise polynomial Lagrange interpolation) The ob-
servations made in Ex. 6.6.1.7 are easily explained by applying the polynomial interpolation error estimates
of Section 6.2.2, for instance

f ( n +1)
L∞ ( I )
k f − LT f k L ∞ ( I ) ≤ max|(t − t0 ) · · · · · (t − tn )| . (6.2.2.22)
( n + 1) ! t∈ I

locally on the mesh intervals [ x j−1 , x j ], j = 1, . . . , m: for constant polynomial degree n = n j , j =


1, . . . , m, we get for f ∈ C n+1 ([ x0 , xm ]) (smoothness requirement)
+1
hnM
(6.2.2.22) ⇒ k f − sk L∞ ([x0 ,xm ]) ≤ f ( n +1) , (6.6.1.9)
( n + 1) ! L∞ ([ x0 ,xm ])

with mesh width hM := max{| x j − x j−1 |: j = 1, . . . , m}. y

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 543


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Another special case: fixed mesh M, uniform polynomial degree n


Study estimates k f − IM f k ≤ CT (n) for n → ∞.
Terminology: investigation of p-convergence

EXAMPLE 6.6.1.10 ( p-convergence of piecewise polynomial interpolation) We study p-convergence


in the setting of Ex. 6.6.1.7.
1 0
10 10

0
10 −1
10

−1
10 −2
10

−2
10
−3

||Interpolation error||∞
||Interpolation error||2

10
−3
10
−4
10
−4
10
−5
10
−5
10
−6
10
−6
10

−7
−7 10
10
h =5 h =5
h =2.5 −8
h =2.5
−8
10 h =1.25 10 h =1.25
h =0.625 h =0.625
h =0.3125 h =0.3125
−9 −9
10 10
1 2 3 4 5 6 1 2 3 4 5 6
Fig. 255 Local polynomial degree Fig. 256 Local polynomial degree

Observation: (apparent) exponential convergence in polynomial degree


Note: in the case of p-convergence the situation is the same as for standard polynomial interpolation, see
6.2.2.

In this example we deal with an analytic function, see Rem. 6.2.3.26. Though equidistant local interpolation
nodes are used cf. Ex. 6.2.2.11, the mesh intervals seems to be small enough that even in this case
exponential convergence prevails. y

6.6.2 Cubic Hermite Interpolation: Error Estimates

Video tutorial for Section 6.6.2 "Cubic Hermite and Spline Interpolation: Error Estimates": (10
minutes) Download link, tablet notes

See Section 5.3.3 for definition and algorithms for cubic Hermite interpolation of data points, with a focus
on shape preservation, however. If the derivative f ′ of the interpoland f is available (in procedural form),
then it can be used to fix local cubic polynomials by prescribing point values and derivative values in the
endpoints of grid intervals.

Definition 6.6.2.1. Piecewise cubic Hermite interpolant (with exact slopes) → Def. 5.3.3.1

Given f ∈ C1 ([ a, b]) and a mesh M := { a = x0 < x1 < . . . < xm−1 < xm = b} the piecewise
cubic Hermite interpolant (with exact slopes) s : [ a, b] → R is defined as

s|[ x j−1 ,x j ] ∈ P3 , j = 1, . . . , m , s( x j ) = f ( x j ) , s′ ( x j ) = f ′ ( x j ) , j = 0, . . . , m .

Clearly, the piecewise cubic Hermite interpolant is continuously differentiable: s ∈ C1 ([ a, b]), cf.
Cor. 5.3.3.2.

EXPERIMENT 6.6.2.2 (Convergence of Hermite interpolation with exact slopes) In this experiment

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 544


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

we study the h-convergence of Cubic Hermite interpolation for a smooth function.


2
10
sup−norm
L2−norm
1
10

Piecewise cubic Hermite interpolation of 0


10

norm of interpolation error


f ( x ) = arctan( x ) . −1
10

✦ domain: I = (−5, 5)
−2
10

✦ mesh T = {−5 + hj}nj=0 ⊂ I , h = 10 n, −3


10

✦ exact slopes ci = f (ti ), i = 0, . . . , n
−4
10

algebraic convergence O( h4 ) −5
10

−6
10
−1 0 1
10 10 10
Fig. 257 meshwidth h

Approximate computation of error norms analoguous to Ex. 6.6.1.7.


y

The observation made in Exp. 6.6.2.2 matches the theoretical prediction of the rate of algebraic conver-
gence for cubic Hermite interpolation with exact slopes for a smooth function.

Theorem 6.6.2.3. Convergence of approximation by cubic Hermite interpolation

Let s be the cubic Hermite interpolant of f ∈ C4 ([ a, b]) on a mesh M := { a = x0 < x1 < . . . <
xm−1 < xm = b} according to Def. 6.6.2.1. Then

1 4
k f − sk L∞ ([a,b]) ≤ h f (4) ,
4! M L∞ ([ a,b])

with the meshwidth hM := max j | x j − x j−1 |.

In Section 5.3.3.2 we saw variants of cubic Hermite interpolation, for which the slopes c j = s′ ( x j ) were
computed from the values y j in preprocessing step. Now we study the use of such a scheme for approxi-
mation.

EXPERIMENT 6.6.2.4 (Convergence of Hermite interpolation with averaged slopes)

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 545


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2
10
sup−norm
2
L −norm
1
10
Piecewise cubic Hermite interpolation of

norm of interpolation error


0
10
f ( x ) = arctan( x ) .
−1
10
✦ domain: I = (−5, 5)
✦ equidistant mesh T in I , see Exp. 6.6.2.2, −2
10

✦ averaged local slopes, see (5.3.3.8)


−3
10

algebraic convergence O ( h3 ) in meshwidth


−4
10
Code ➺ GITLAB
−5
10
−1 0 1
10 10 10
Fig. 258 meshwidth h

We observe lower rate of algebraic convergence compared to the use of exact slopes due to averaging
(5.3.3.8). From the plot we deduce O( h3 ) asymptotic decay of the L2 - and L∞ -norms of the approximation
error for meshwidth h → 0.

6.6.3 Cubic Spline Interpolation: Error Estimates [Han02, Ch. 47]


Recall concept and algorithms for cubic spline interpolation from Section 5.4.2. As an interpolation scheme
it can also serve as the foundation for an approximation scheme according to § 6.1.0.6: the mesh will
double as knot set, see Def. 5.4.1.1. Cubic spline interpolation is not local as we saw in § 5.4.3.7. Never-
theless, cubic spline interpolants can be computed with an effort of O(m) as elaborated in § 5.4.2.5.

We have seen three main classes of cubic spline interpolants s ∈ S3,M of data points with node set
M = { a = t0 < t1 < · · · < tn = b} § 5.4.2.11: the complete cubic spline (s′ prescribed at endpoints),
the natural cubic spline (s′′ ( a) = s′′ (b) = 0), and the periodic cubic spline (s′ ( a) = s′ (b), s′′ ( a) = s′′ (b)).
Obviously, both the natural and periodic cubic spline do not make much sense for approximating a generic
continuous function f ∈ C0 ([ a, b]). So we focus on complete cubic spline interpolants with endpoint
slopes inherited from the interpoland:

Definition 6.6.3.1. Complete cubic spline interpolant

Given f ∈ C1 ([ a, b]) and a mesh (= knot set) M := { a = x0 < x1 < . . . < xm−1 < xm = b} the
complete cubic spline Hermite interpolant s is defined by the conditions

s ∈ S3,M , s( x j ) = f ( x j ) , j = 0, . . . , m , s′ ( a) = f ′ ( a) , s′ (b) = f ′ (b) .

In § 5.4.2.5 and § 5.4.2.11 we found that interpolation condition at the knots plus fixing the derivatives at
the endpoints uniquely determine a cubic spline function. Hence, the above definition is valid.

EXPERIMENT 6.6.3.2 (Approximation by complete cubic spline interpolants) We take I = [−1, 1]


2
and rely on an equidistant mesh (knot set) M := {−1 + j}nj=0 , n ∈ N ➙ meshwidth h = 2/n.
n

We study h-convergence of complete (→ § 5.4.2.11) cubic spline interpolation according to Def. 6.6.3.1,
where the slopes at the endpoints of the interval are made to agree with the derivatives of the interpoland

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 546


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

at these points. As interpolands we consider




0 , if t < − 52 ,
1
f 1 (t) = ∈ C∞ ( I ) , f 2 (t) = 1
(1 + cos(π (t − 53 ))) , if − 25 < t < 53 , ∈ C1 ( I ) .
1 + e−2t 

2
1 otherwise.

−2 0
10 10

L −Norm L∞−Norm
2
L −Norm −1 L2−Norm
10
−4
10
−2
10

−6
10 −3
10

||s−f||
||s−f||

−4
−8 10
10

−5
10
−10
10
−6
10

−12 −7
10 −2 −1 0
10 −2 −1 0
10 10 10 10 10 10
Fig. 259 Meshwidth h Fig. 260 Meshwidth h

k f 1 − sk L∞ ([−1,1]) = O(h4 ) k f 2 − sk L∞ ([−1,1]) = O(h2 )


The codes used to run this experiment can be accessed through ➺ GITLAB.
We observe algebraic order of convergence in h with empiric rate approximately given by min{1 +
regularity of f , 4}. y

We remark that there is the following theoretical result [HM76], [DR08, Rem. 9.2]:

5 4 (4)
f ∈ C4 ([t0 , tn ]) k f − sk L∞ ([t0 ,tn ]) ≤ h f . (6.6.3.3)
384 L∞ ([t0 ,tn ])

Summary and Learning Outcomes


This chapter is meant to impart the following knowledge and skills.
• You should be able to extract the (asymptotic) convergence of approximation errors from empiric
data.
• You should know a few relevant norms on spaces of functions.
• You should be able to construct and approximation scheme on an arbitrary interval from an interpo-
latioon scheme on a fixed interval.
• You should recall bounds for the pointwise approximation error of polynomial interpolation for func-
tions in Cr and analytic functions.
• You should be familiar with Chebychev interpolation: rationale, definition, and algorithms.
• You should know about trigonometric interpolation and the behavior of the associated pointwise
approximation errors.
• You should be able to predict the convergence of piecewise polynomial interpolation in terms of
meshwidth h → 0.

6. Approximation of Functions in 1D, 6.6. Approximation by Piecewise Polynomials 547


Bibliography

[Boo05] Carl de Boor. “Divided differences”. In: Surv. Approx. Theory 1 (2005), pp. 46–69 (cit. on
p. 492).
[Bör21] Steffen Börm. On iterated interpolation. 2021.
[CB95] Q. Chen and I. Babuska. “Approximate optimal points for polynomial interpolation of real func-
tions in an interval and in a triangle”. In: Comp. Meth. Appl. Mech. Engr. 128 (1995), pp. 405–
417 (cit. on p. 500).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 483, 547).
[Dav75] P.J. Davis. Interpolation and Approximation. New York: Dover, 1975 (cit. on pp. 472–474).
[DY10] L. Demanet and L. Ying. On Chebyshev interpolation of analytic functions. Online notes. 2010.
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 497).
[EZ66] H. Ehlich and K. Zeller. “Auswertung der Normen von Interpolationsoperatoren”. In: Math. Ann.
164 (1966), pp. 105–112. DOI: 10.1007/BF01429047.
[HM76] C.A. Hall and W.W. Meyer. “Optimal error bounds for cubic spline interpolation”. In: J. Approx.
Theory 16 (1976), pp. 105–122 (cit. on p. 547).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 483, 496, 497,
505, 546).
[JWZ19] Peter Jantsch, Clayton G. Webster, and Guannan Zhang. “On the Lebesgue constant of
weighted Leja points for Lagrange interpolation on unbounded domains”. In: IMA J. Numer.
Anal. 39.2 (2019), pp. 1039–1057. DOI: 10.1093/imanum/dry002.
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 510).
[Ran00] R. Rannacher. Einführung in die Numerische Mathematik. Vorlesungsskriptum Universität Hei-
delberg. 2000 (cit. on p. 484).
[Rem84] R. Remmert. Funktionentheorie I. Grundwissen Mathematik 5. Berlin: Springer, 1984 (cit. on
p. 490).
[Rem02] R. Remmert. Funktionentheorie I. Grundwissen Mathematik 5. Berlin: Springer, 2002 (cit. on
p. 535).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 471, 476,
483, 488, 489).
[Tad86] Eitan Tadmor. “The Exponential Accuracy of Fourier and Chebyshev Differencing Methods”.
In: SIAM Journal on Numerical Analysis 23.1 (1986), pp. 1–10. DOI: 10.1137/0723001.
[TT10] Rodney Taylor and Vilmos Totik. “Lebesgue constants for Leja points”. In: IMA J. Numer. Anal.
30.2 (2010), pp. 462–486. DOI: 10.1093/imanum/drn082.
[Tre13] Lloyd N. Trefethen. Approximation theory and approximation practice. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2013, viii+305 pp.+back matter (cit. on
pp. 486, 492, 500, 502).
[Tre] N. Trefethen. Six myths of polynomial interpolation and quadrature. Slides, University of Ox-
ford.

548
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

[Ver86] P. Vertesi. “On the optimal Lebesgue constants for polynomial interpolation”. In: Acta Math.
Hungaria 47.1-2 (1986), pp. 165–178 (cit. on p. 500).
[Ver90] P. Vertesi. “Optimal Lebesgue constant for Lagrange interpolation”. In: SIAM J. Numer. Anal.
27.5 (1990), pp. 1322–1331 (cit. on p. 500).

BIBLIOGRAPHY, BIBLIOGRAPHY 549


Chapter 7

Numerical Quadrature

7.1 Introduction
Video tutorial for Section 7.1 "Numerical Quadrature: Introduction": (4 minutes)
Download link, tablet notes

Z
Numerical quadrature deals with the approximate numerical evaluation of integrals f (x) dx for a given

(closed) integration domain Ω ⊂ R d . Thus, the underlying problem in the sense of § 1.5.5.1 is the
mapping

C0 (Ω) → RR
I: , (7.1.0.1)
f 7→ Ω f (x) dx

with data space X := C0 (Ω) and result space Y := R.


If f is complex-valued or vector-valued, then so is the integral. The methods presented in this chapter can
immediately be generalized to this case by componentwise application.

§7.1.0.2 (Integrands in procedural form) The integrand f , a continuos function f : Ω ⊂ R d 7→ R


should not be thought of as given by an analytic expression, but as given in procedural form, cf.
Rem. 5.1.0.9.
For instance, in C++ the integrand is provided through a “functor” data type with an evaluation operator
double operator (Point &) or a corresponding member function, see Code 5.1.0.10 for an example,
or a lambda function (to Section 0.3.3).

General methods for numerical quadrature should rely only on finitely many point evaluations of
the integrand.
y

In this chapter the focus is on the special case d = 1: Ω = [ a, b] (an interval).

(Multidimensional numerical quadrature is substantially more difficult, unless Ω is tensor-product domain,


a multi-dimensional box. Multidimensional numerical quadrature will be treated in the course “Numerical
methods for partial differential equations”.)

Remark 7.1.0.3 (Importance of numerical quadrature)

550
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

☞ Numerical quadrature methods are key building blocks for so-called variational methods for the nu-
merical treatment of partial differential equations. A prominent example is the finite element method.
They are also a pivotal device for the numerical solution of integral equations and in computational
statistics.
y
3

2.5 For d = 1, from a geometric point of view


methods for numerical quadrature aimed
at computing
2

Zb
1.5 f (t) dt
f

1 ? seek to approximate an area under the


graph of the function f .
0.5
✁ area corresponding to value of an inte-
gral
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 261 t

EXAMPLE 7.1.0.4 (Heating production in electrical circuits) In Ex. 2.1.0.3 we learned about the nodal
analysis of electrical circuits. Its application to a non-linear circuit will be discussed in Ex. 8.1.0.1, which will
reveal that every computation of currents and voltages can be rather time-consuming. In this example we
consider a non-linear circuit in quasi-stationary operation (capacities and inductances are ignored). Then
the computation of branch currents and nodal voltages entails solving a non-linear system of equations.

Now assume time-harmonic periodic excitation with


period T > 0. R3 R4
R1
U (t) ➀ ➃
Rb



RL
T t U (t)
Re R2
I (t)
Fig. 262
Fig. 263

The goal is to compute the energy dissipated by the circuit, which is equal to the energy injected by the
voltage source. This energy can be obtained by integrating the power P(t) = U (t) I (t) over period [0, T ]:
Z T
Wtherm = U (t) I (t) dt , where I = I (U ) .
0

double I(double U) involves solving non-linear system of equations, see Ex. 8.1.0.1!
This is a typical example where “point evaluation” by solving the non-linear circuit equations is the only
way to gather information about the integrand. y

Contents

7. Numerical Quadrature , 7.1. Introduction 551


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550


7.2 Quadrature Formulas – Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . . 552
7.3 Polynomial Quadrature Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
7.4 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
7.4.1 Order of a Quadrature Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
7.4.2 Maximal-Order Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . . . 562
7.4.3 Quadrature Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
7.5 Composite Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
7.6 Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583

Supplementary literature. Numerical quadrature is covered in [Han02, p. VII] and [DR08,

Ch. 10].

Review question(s) 7.1.0.5 (Numerical quadrature: introduction)


(Q7.1.0.5.A) Let A ∈ R n,n a symmetric positive definite (s.p.d.) matrix and define
(
  (A)i,j for (i, j) 6= (1, 1) ,
e (ξ ) ∈ R n,n :
A e (ξ )
A :=
i,j (A)1,1 + ξ for (i, j) = (1, 1) .
e (ξ )x = b for given right-hand side vector b ∈ R n .
We consider the linear system of equations A
We assume that ξ : Ω → R is a real-valued continuous random variable that is uniformly distributed in
[0, 1]. Derive integral expressions for the expectation and variance of the random variable
η : = ( x )1 + · · · + ( x ) n .

7.2 Quadrature Formulas – Quadrature Rules


Video tutorial for Section 7.2 "Quadrature Formulas/Rules": (13 minutes) Download link,
tablet notes

Quadrature formulas realize the approximation of an integral through finitely many point evaluations of the
integrand.

Definition 7.2.0.1. Quadrature formula/quadrature rule

An n-point quadrature formula/quadrature rule (QR) on [ a, b] provides an approximation of the value


of an integral through a weighted sum of point values of the integrand:
Z b n
(n) (n)
a
f (t) dt ≈ Qn ( f ) := ∑ wj f (c j ) . (7.2.0.2)
j =1

(n)
wj : quadrature weights ∈R
Terminology: (n)
cj : quadrature nodes ∈ [ a, b] (also called quadrature points)
Obviously (7.2.0.2) is compatible with integrands f given in procedural form as double f(double t),
compare § 7.1.0.2.

7. Numerical Quadrature , 7.2. Quadrature Formulas – Quadrature Rules 552


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 7.2.0.3: C++ template implementing generic quadrature formula ➺ GITLAB


2 // Generic numerical quadrature routine implementing (7.2.0.2):
3 // f is a handle to a function, e.g. as lambda function
4 // c, w pass quadrature nodes c j ∈ [ a, b], and weights w j ∈ R
5 // in a Eigen::VectorXd
6 template <class Function >
7 double quadformula ( F u n c t i o n &&f , const Eigen : : VectorXd& c , const Eigen : : VectorXd& w) {
8 const Eigen : : Index n = c . s i z e ( ) ;
9 double I = 0 ;
10 f o r ( Eigen : : Index i = 0 ; i < n ; ++ i ) { I += w( i ) * f ( c ( i ) ) ; }
11 return I ;
12 }

A single invocation costs n point evaluations of the integrand plus n additions and multiplications.

Remark 7.2.0.4 (Transformation of quadrature rules) In the setting of function approximation by poly-
nomials we learned in § 6.2.1.14 that an approximation schemes for any interval could be obtained from
an approximation scheme on a single reference interval ([−1, 1] in § 6.2.1.14) by means of affine pullback,
see (6.2.1.18). A similar affine transformation technique makes it possible to derive quadrature formula for
an arbitrary interval from a single quadrature formula on a reference interval.
n
Given: quadrature formula cbj , w
b j j=1 on reference interval [−1, 1]

Idea: transformation formula for integrals


Z b Z 1
f (t) dt = 12 (b − a) fb(τ ) dτ ,
a −1 (7.2.0.5)
fb(τ ) := f ( 21 (1 − τ ) a + 1
2 (τ + 1) b ) .

τ t
Fig. 264
−1 1 a b
τ 7→ t := Φ(τ ) := 12 (1 − τ ) a + 21 (τ + 1)b

Note that fb is the affine pullback Φ∗ f of f to [−1, 1] as defined in Eq. (6.2.1.16).

quadrature formula for general interval [ a, b], a, b ∈ R:

Rb n n c j = 12 (1 − cbj ) a + 21 (1 + cbj )b ,
a f (t) dt ≈ 1
2 (b − b j fb(cbj ) = ∑ w j f (c j ) with
a) ∑ w
j =1 j =1 w j = 12 (b − a)w
bj .

In words, the nodes are just mapped through the affine transformation c j = Φ(cbj ), the weights are scaled
by the ratio of lengths of [ a, b] and [−1, 1].

A 1D quadrature formula on arbitrary intervals can be specified by providing its weights w


b j /nodes cbj
for the integration domain [−1, 1] (reference interval). Then the above transformation is assumed.

Another common choice for the reference interval: [0, 1], pay attention! y

7. Numerical Quadrature , 7.2. Quadrature Formulas – Quadrature Rules 553


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 7.2.0.6 (Tables of quadrature rules) In many codes families of quadrature rules are used to
(n) (n)
control the quadrature error. Usually, suitable sequences of weights w j and nodes c j are precomputed
and stored in tables up to sufficiently large values of n. A possible interface could be the following:

s t r u c t QuadTab {
t e m p l a t e < typename VecType>
s t a t i c v o i d getrule( i n t n,VecType &c,VecType &w,
double a=-1.0, double b=1.0);
}

Calling the method getrule() fills the vectors c and w with the nodes and the weights for a desired
n-point quadrature on [ a, b] with [−1, 1] being the default reference interval. For VecType we may assume
the basic functionality of Eigen::VectorXd. y

§7.2.0.7 (Quadrature rules by approximation schemes) Every approximation scheme A : C0 ([ a, b]) →


V , V a space of “simple functions” on [ a, b], see § 6.1.0.5, gives rise to a method for numerical quadrature
according to
Z b Z b
f (t) dt ≈ QA ( f ) := (A f )(t) dt . (7.2.0.8)
a a

As explained in § 6.1.0.6 every interpolation scheme IT : R n+1 → V based on the node set T =
{t0 , t1 , . . . , tn } ⊂ [ a, b] (→ § 5.1.0.7) induces an approximation scheme, and, hence, also a quadrature
scheme on [ a, b]:
Z b Z b
f (t) dt ≈ IT [ f (t0 ), . . . , f (tn )]⊤ (t) dt . (7.2.0.9)
a a

Lemma 7.2.0.10. Quadrature formulas from linear interpolation schemes

Every linear interpolation operator IT according to Def. 5.1.0.25 spawns a quadrature formula (→
Def. 7.2.0.1) by (7.2.0.9).

Proof. Writing e j for the j-th unit vector of R n+1 , j = 0, . . . , n, we have by linearity
Z b n Z b

a
IT [ f (t0 ), . . . , f (tn )] dt = ∑ f (t j ) (IT (e j ))(t) dt . (7.2.0.11)
j =0 |a {z }
weight w j

Hence, we have arrived at an n + 1-point quadrature formula with nodes t j , whose weights are the
integrals of the cardinal interpolants for the interpolation scheme T .

Summing up, we have found
✓ ✏
interpolation approximation quadrature
−→ −→
✒ ✑
schemes schemes schemes

§7.2.0.12 (Convergence of numerical quadrature) In general the quadrature formula (7.2.0.2) will only
provide an approximate value for the integral.

➣ For a generic integrand we will encounter a non-vanishing

7. Numerical Quadrature , 7.2. Quadrature Formulas – Quadrature Rules 554


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Z b
quadrature error En ( f ) := f (t) dt − Qn ( f )
a

As in the case of function approximation by interpolation Section 6.2.2, our focus will on the asymptotic
behavior of the quadrature error as a function of the number n of point evaluations of the integrand.

Therefore consider families of quadrature rules { Qn }n (→ Def. 7.2.0.1) described by


n o
✦ quadrature weights wnj , j = 1, . . . , n and
n o n ∈N
✦ n
quadrature nodes c j , j = 1, . . . , n .
n ∈N
We study the asymptotic behavior of the quadrature error E(n) for n → ∞
As in the case of interpolation errors in § 6.2.2.5 we make the usual qualitative distinction, see
Def. 6.2.2.7:
✄ algebraic convergence E(n) = O(n− p ), rate p > 0
✄ exponential convergence E ( n ) = O ( q n ), 0 ≤ q < 1
Note that the number n of nodes agrees with the number of f -evaluations required for the evaluation of
the quadrature formula. This is usually used as a measure for the cost of computing Qn ( f ).

Therefore, in the sequel, we consider the quadrature error as a function of n. y

§7.2.0.13 (Quadrature error from approximation error) Bounds for the maximum norm of the approx-
imation error of an approximation scheme directly translate into estimates of the quadrature error of the
induced quadrature scheme (7.2.0.8):
Z b Z b
f (t) dt − QA( f ) ≤ | f (t) − A( f )(t)| dt ≤ |b − a|k f − A( f )k L∞ ([a,b]) . (7.2.0.14)
a a

Hence, the various estimates derived in Section 6.2.2 and Section 6.2.3.2 give us quadrature error esti-
mates “for free”. More details will be given in the next section. y
Review question(s) 7.2.0.15 (Quadrature formulas)
(Q7.2.0.15.A) Explain the structure of a quadrature formula/rule for the approximation of the integral
Rb
a f ( t ) dt, a, b ∈ R .
(Q7.2.0.15.B) The integral satisfies
Z b Z b
g, f ∈ C0 ([ a, b]) , g ≤ f on [ a, b] ⇒ g(t) dt ≤ f (t) dt .
a a

Formulate necessary and sufficient conditions on an n-point quadrature rule Qn such that

g, f ∈ C0 ([ a, b]) , g ≤ f on [ a, b] ⇒ Qn ( g) ≤ Qn ( f ) .
(Q7.2.0.15.C) The documentation of the C++ function
Eigen::Matrix< double , Eigen::Dynamic, 2>
getQuadRule( unsigned i n t n);

claims that it provides a family of quadrature rules on the interval [−1, 1], and that n passes the number
of quadrature nodes/points.
Why does this claim make sense and which piece of information is missing? How could you retrieve
that missing piece, if you can call the function getQuadRule().

7. Numerical Quadrature , 7.2. Quadrature Formulas – Quadrature Rules 555


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q7.2.0.15.D) Consider the following C++ function for numerical quadrature.


t e m p l a t e < typename QUADRULE, typename FUNCTION>
au to integrate( const QUADRULE &qr, FUNCTION &&f)
-> d e c l t y p e (qr[0].second * f(qr[0].first)) {
using Scalar = d e c l t y p e (qr[0].second * f(qr[0].first));
Scalar s{0};
f o r ( au to nw : qr) {
s += nw.second * f(nw.first);
}
r e t u r n s;
}

Explain the requirements on the types QUADRULE and FUNCTION.

Hint. The C++ class std::pair is an abstract container for two objects of different types that can be
accessed via data members first and second.
(Q7.2.0.15.E) A integration path γ in the complex plane C is usually given by its parameterization
γ : [ a, b] → C, a, b ∈ R. If f : D ⊂ C → C is continuous and γ([ a, b]) ⊂ D, then the path integral
of f along γ is defined as
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.2.2.51)
γ J

where γ̇ designates the derivative of γ with respect to the parameter, and · indicates multiplication in C.
How do quadrature formulas look like that can be used for the approximate computation of
Z
f (z) dz , D : = { z ∈ C : | z | ≤ 1} .
∂D

Hint. A natural parameterization of ∂D is provided by the complex exponential.


7.3 Polynomial Quadrature Formulas

Video tutorial for Section 7.3 "Polynomial Quadrature Formulas": (9 minutes) Download link,
tablet notes

Now we specialize the general recipe of § 7.2.0.7 for approximation schemes based on global polynomials,
the Lagrange approximation scheme as introduced in Section 6.2, Def. 6.2.2.1.

Supplementary literature. This topic is discussed in [DR08, Sect. 10.2].

Idea: replace integrand f with pn−1 := IT ∈ Pn−1 = polynomial Lagrange in-


terpolant of f (→ Cor. 5.2.2.8) for given node set T := {t0 , . . . , tn−1 } ⊂
[ a, b]
Z b Z b
f (t) dt ≈ Qn ( f ) := pn−1 (t) dt . (7.3.0.1)
a a

7. Numerical Quadrature , 7.3. Polynomial Quadrature Formulas 556


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The cardinal interpolants for Lagrange interpolation are the Lagrange polynomials (5.2.2.4)

n −1 (5.2.2.6) n −1
t − tj
Li ( t ) : = ∏ ti − t j
, i = 0, . . . , n − 1 p n −1 ( t ) = ∑ f ( ti ) Li ( t ) .
j =0 i =0
j 6 =i

Then (7.2.0.11) amounts to the n-point quadrature formula

Zb n −1 Zb nodes c i = t i −1 ,
Z b
pn−1 (t) dt = ∑ f ( ti ) Li (t) dt
weights wi : = Li−1 (t) dt .
(7.3.0.2)
a i =0 a a

EXAMPLE 7.3.0.3 (Midpoint rule)


3

The midpoint rule is (7.3.0.2) for n = 1 and t0 =


1
2 ( a + b ). It leads to the 1-point quadrature formula
2.5

2
Zb
f (t) dt ≈ Qmp ( f ) = (b − a) f ( 12 ( a + b)) .
1.5
f

1 “midpoint”

0.5
✁ the area under the graph of f is approximated by
the area of a rectangle.
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 265 t
y

EXAMPLE 7.3.0.4 (Newton-Cotes formulas → [Han02, Ch. 38]) The n := m + 1-point Newton-Cotes
formulas arise from Lagrange interpolation in equidistant nodes (6.2.2.3) in the integration interval [ a, b]:

b−a
Equidistant quadrature nodes t j := a + hj, h := , j = 0, . . . , m:
m
The weights for the interval [0, 1] can be found, e.g., by symbolic computation using MAPLE: the following
MAPLE function expects the polynomial degree as input argument and computes the weights for the
interval [0, 1]:
> newtoncotes := m -> factor(int(interp([seq(i/n, i=0..m)],
[seq(f(i/n), i=0..m)], z),z=0..1)):
Weights on general intervals [ a, b] can then be deduced by the affine transformation rule as explained in
Rem. 7.2.0.4.

• n = 2: Trapezoidal rule (integrate linear interpolant of integrand in endpoints)

7. Numerical Quadrature , 7.3. Polynomial Quadrature Formulas 557


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

111111111111111
000000000000000
2.5
> trapez := newtoncotes(1);
000000000000000
111111111111111
000000000000000
111111111111111
2
000000000000000
111111111111111
000000000000000
111111111111111
btrp ( f ) := 1 ( f (0) + f (1)) 000000000000000
111111111111111
Q (7.3.0.5) 000000000000000
111111111111111
000000000000000
111111111111111
2 1.5

f
000000000000000
111111111111111
000000000000000
111111111111111
 Zb 
b−a 000000000000000
111111111111111
f (t) dt ≈ ( f ( a) + f (b)) 1
000000000000000
111111111111111
X

000000000000000
111111111111111
2
a 000000000000000
111111111111111
000000000000000
111111111111111
0.5
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
0
0 0.5
000000000000000
111111111111111
1 1.5 2 2.5 3 3.5 4
Fig. 266 t

• n = 3: Simpson rule
> simpson := newtoncotes(2);

  Zb    
1 1 b−a a+b
f (0) + 4 f ( 2 ) + f (1) f (t) dt ≈ f ( a) + 4 f + f (b) (7.3.0.6)
6 6 2
a
• n = 5: Milne rule
> milne := newtoncotes(4);

1  1 1 3

7 f (0) + 32 f ( 4 ) + 12 f ( 2 ) + 32 f ( 4 ) + 7 f (1)
90
b − a 
(7 f ( a) + 32 f ( a + (b − a)/4) + 12 f ( a + (b − a)/2) + 32 f ( a + 3(b − a)/4) + 7 f (b))
90
• n = 7: Weddle rule
> weddle := newtoncotes(6);

1  
41 f (0) + 216 f ( 16 ) + 27 f ( 31 ) + 272 f ( 12 ) + 27 f ( 32 ) + 216 f ( 65 ) + 41 f (1) .
840

• n ≥ 8: quadrature formulas with negative weights


> newtoncotes(8);

1
(989 f (0) + 5888 f ( 81 ) − 928 f ( 41 ) + 10496 f ( 83 )
28350
−4540 f ( 12 ) + 10496 f ( 58 ) − 928 f ( 43 ) + 5888 f ( 78 ) + 989 f (1))
y
From Ex. 6.2.2.11 we know that the approximation error incurred by Lagrange interpolation
in equidistant nodes can blow up even for analytic functions. This blow-up can also infect
! the quadrature error of Newton-Cotes formulas for large n, which renders them essentially
useless. In addition they will be marred by large (in modulus) and negative weights, wich
compromises numerical stability (→ Def. 1.5.5.19)

7. Numerical Quadrature , 7.3. Polynomial Quadrature Formulas 558


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

No negative weights!

Quadrature formulas with negative weights should not be used, not even considered!

Remark 7.3.0.8 (Clenshaw-Curtis quadrature rules [Tre08]) The considerations of Section 6.2.3 con-
firmed the superiority of the “optimal” Chebychev nodes (6.2.3.10) for globally polynomial Lagrange in-
terpolation. This suggests that we use these nodes also for numerical quadrature with weights given by
(7.3.0.2). This yields the so-called Clenshaw-Curtis rules with the following rather desirable property:

Theorem 7.3.0.9. Positivity of Clenshaw-Curtis weights [Fej33]

The weights wnj , j = 1, . . . , n, for every n-point Clenshaw-Curtis rule are positive.

The weights of any n-point Clenshaw-Curtis rule can be computed with a computational effort of
O(n log n) using FFT [Wal06], [Tre08, Sect. 2]. y

§7.3.0.10 (Error estimates for polynomial quadrature) As a concrete application of § 7.2.0.13,


(7.2.0.14) we use the L∞ -bound (6.2.2.22) for Lagrange interpolation

f ( n +1)
L∞ ( I )
k f − LT f k L ∞ ( I ) ≤ max|(t − t0 ) · · · · · (t − tn )| . (6.2.2.22)
( n + 1) ! t∈ I

to conclude for any n-point quadrature rule based on polynomial interpolation:


Z b
1
n
f ∈ C ([ a, b]) ⇒ f (t) dt − Qn ( f ) ≤ ( b − a ) n +1 f ( n ) . (7.3.0.11)
a n! L∞ ([ a,b])

Much sharper estimates for Clenshaw-Curtis rules (→ Rem. 7.3.0.8) can be inferred from the interpolation
error estimate (6.2.3.18) for Chebychev interpolation. For functions with limited smoothness algebraic con-
vergence of the quadrature error for Clenshaw-Curtis quadrature follows from (6.2.3.21). For integrands
that possess an analytic extension to the complex plane in a neighborhood of [ a, b], we can conclude
exponential convergence from (6.2.3.28). y

Review question(s) 7.3.0.12 (Polynomial quadrature formulas)


(Q7.3.0.12.A) Given an n-point quadrature formula on [−1, 1] in terms of pairs of weights and nodes
(w j , c j ), j = 1, . . . , n, how can you decide whether it is a “polynomial quadrature formula”, that is, a
quadrature formula that is based on Lagrange polynomial interpolation of the integrand.

7.4 Gauss Quadrature

Supplementary literature. Gauss quadrature is discussed in detail in [Han02, Ch. 40-41],

[DR08, Sect.10.3]

7.4.1 Order of a Quadrature Rule


Video tutorial for Section 7.4.1 "Order of a Quadrature Rule": (9 minutes) Download link,
tablet notes

7. Numerical Quadrature , 7.4. Gauss Quadrature 559


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

How to gauge the “quality” of an n-point quadrature formula Qn without testing it for specific integrands?
The next definition gives a classical answer.

Definition 7.4.1.1. Order of a quadrature rule

The order of a quadrature rule Qn : C0 ([ a, b]) → R is defined as


Z b
order( Qn ) := max{m ∈ N0 : Qn ( p) = p(t) dt ∀ p ∈ Pm }+1 , (7.4.1.2)
a

that is, as the maximal degree +1 of polynomials for which the quadrature rule is guaranteed to be
exact.

§7.4.1.3 (Invariance of order under (affine) transformation) First we note a simple consequence of the
invariance of the polynomial space Pn under affine pullback, see Lemma 6.2.1.17.

Corollary 7.4.1.4. Invariance of order under affine transformation

An affine transformation of a quadrature rule according to Rem. 7.2.0.4 does not change its order.

§7.4.1.5 (Order of polynomial quadrature rules) Further, by construction all polynomial n-point quadra-
ture rules possess order at least n.

Theorem 7.4.1.6. Sufficient order conditions for quadrature rules

An n-point quadrature rule on [ a, b] (→ Def. 7.2.0.1)


n
Qn ( f ) := ∑ w j f (c j ) , f ∈ C0 ([ a, b]) ,
j =1

with nodes t j ∈ [ a, b] and weights w j ∈ R, j = 1, . . . , n, has order ≥ n, if and only if


Z b
wj = L j−1 (t) dt , j = 1, . . . , n ,
a

where Lk , k = 0, . . . , n − 1, is the k-th Lagrange polynomial (5.2.2.4) associated with the ordered
node set {c1 , c2 , . . . , cn }.

Proof. The conclusion of the theorem is a direct consequence of the facts that

Pn−1 = Span{ L0 , . . . , Ln−1 } and Lk (t j ) = δk+1,j , k + 1, j ∈ {1, . . . , n} :

just plug a Lagrange polynomial into the quadrature formula.


By construction (7.3.0.2) polynomial n-point quadrature formulas (7.3.0.1) exact for f ∈ Pn−1 ⇒ n-point
polynomial quadrature formula has at least order n. y

Remark 7.4.1.7 (Linear system for quadrature weights) Thm. 7.4.1.6 provides a concrete formula for
quadrature weights, which guaranteed order n for an n-point quadrature formula. Yet evaluating integrals

7. Numerical Quadrature , 7.4. Gauss Quadrature 560


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

of Lagrange polynomials may be cumbersome. Here we give a general recipe for finding the weights w j
according to Thm. 7.4.1.6 without dealing with Lagrange polynomials.

Given: arbitrary nodes c1 , . . . , cn for n-point (local) quadrature formula on [ a, b]

From Def. 7.4.1.1 we immediately conclude the following procedure: If p1 , . . . , pn is any


basis of Pn−1 , then, thanks to the linearity of the integral and quadrature formulas,
Z b
Qn ( p j ) = p j (t) dt ∀ j = 1, . . . , n ⇔ Qn has order ≥ n . (7.4.1.8)
a
➣ n × n linear system of equations, see (7.4.2.3) for an example:
    R b 
p 1 ( c 1 ) . . . p 1 ( c n ) w1 a p1 ( t ) dt
 .. ..  ..   .. 
 . .  .  =  . . (7.4.1.9)
Rb
p n ( c1 ) . . . p n ( c n ) wn pn (t) dt a

For instance, for the computation of quadrature weights, one may choose the monomial basis p j (t) = t j .
y

EXAMPLE 7.4.1.10 (Orders of simple polynomial quadrature formulas) From the order rule for poly-
nomial quadrature rule we immediately conclude the orders of simple representatives.
n Order
1 midpoint rule 2
2 trapezoidal rule (7.3.0.5) 2
3 Simpson rule (7.3.0.6) 4
3
4 8 -rule 4
5 Milne rule 6
The orders for odd n surpass the predictions of Thm. 7.4.1.6 by 1, which can be verified by straightforward
computations; following Def. 7.4.1.1 check the exactness of the quadrature rule on [0, 1] (this is sufficient
→ Cor. 7.4.1.4) for monomials {t 7→ tk }, k = 0, . . . , q − 1, which form a basis of Pq , where q is the order
that is to be confirmed: essentially one has to show
n
1
Q({t 7→ tk }) = ∑ w j ckj = k + 1 , k = 0, . . . , q − 1 , (7.4.1.11)
j =1

ˆ quadrature rule on [0, 1] given by (7.2.0.2).


where Q =

For the Simpson rule (7.3.0.6) we can also confirm order 4 with symbolic calculations in MAPLE:
> rule := 1/3*h*(f(2*h)+4*f(h)+f(0))
> err := taylor(rule - int(f(x),x=0..2*h),h=0,6);

 
1  (4)   
err := D ( f )(0)h5 + O h6 , h, 6
90
➣ Composite Simpson rule possesses order 4, indeed ! y

Review question(s) 7.4.1.12 (Order of a quadrature rule)


(Q7.4.1.12.A) What is the minimal order of an n-point polynomial quadrature rule?

7. Numerical Quadrature , 7.4. Gauss Quadrature 561


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q7.4.1.12.B) What is meant by the statement that a linear mapping X → Y , X, Y finite-dimensional


vector spaces, is uniquely determined by its action on the elements of a basis of X ?
(Q7.4.1.12.C) In order to determine the weights of an order-n n-point quadrature rule on [−1, 1] we rely
on the monomials t → tk , k = 0, . . . , n − 1, as basis polynomials. Write down the linear system of
equations for the weights w j , j = 1, . . . , n.
(Q7.4.1.12.D) Discuss the statement:
“A quadrature rule is good, if it yields a small quadrature error.”

7.4.2 Maximal-Order Quadrature Rules


Video tutorial for Section 7.4.2 "Maximal-Order Quadrature Rules": (16 minutes)
Download link, tablet notes

A natural question is whether an n-point quadrature formula achieve an order > n. A negative result limits
the maximal order that can be achieved:

Theorem 7.4.2.1. Maximal order of n-point quadrature rule

The maximal order of an n-point quadrature rule is 2n.

Proof. Consider a generic n-point quadrature rule according to Def. 7.2.0.1


n
Qn ( f ) := ∑ wnj f (cnj ) , (7.2.0.2)
j =1

We build a polynomial of degree 2n that cannot be integrated exactly by Qn . We choose polynomial

q(t) := (t − c1 )2 · · · · · (t − cn )2 ∈ P2n .

On the one hand, q is strictly positive almost everywhere, which means


Z b
q(t) dt > 0 .
a

On the other hand, we find a different value


n
Qn (q) = ∑ wnj q(cnj ) = 0 .
j =1 | {z }
=0

Can we at least find n-point rules with maximal order 2n?


Heuristics: A quadrature formula has order m ∈ N already, if it is exact for m polynomials ∈ Pm−1 that
form a basis of Pm−1 (recall Thm. 5.2.1.2).
m
An n-point quadrature formula has 2n “degrees of freedom” (n node positions, n weights).

It might be possible to achieve order 2n = dim P2n−1
(“No. of equations = No. of unknowns”)

7. Numerical Quadrature , 7.4. Gauss Quadrature 562


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 7.4.2.2 (2-point quadrature rule of order 4) Necessary & sufficient conditions for order 4, cf.
(7.4.1.9), integrate the functions of the monomial basis of P3 exactly:
Z b
1
Qn ( p) = p(t) dt ∀ p ∈ P3 ⇔ Qn ({t 7→ tq }) = (bq+1 − aq+1 ) , q = 0, 1, 2, 3 .
a q+1

4 equations for weights w j and nodes c j , j = 1, 2 ( a = −1, b = 1), cf. Rem. 7.4.1.7
Z 1 Z 1
1 dt = 2 = 1w1 + 1w2 , t dt = 0 = c1 w1 + c2 w2
−1 −1
Z 1 Z 1 (7.4.2.3)
2
t dt = = c21 w1 + c22 w2 ,
2 3
t dt = 0 = c31 w1 + c32 w2 .
−1 3 −1
Solve using MAPLE:
> eqns := {seq(int(x^k, x=-1..1) = w[1]*xi[1]^k+w[2]*xi[2]^k,k=0..3)};
> sols := solve(eqns, indets(eqns, name)):
> convert(sols, radical);

n √ √ o
➣ weights & nodes: w2 = 1, w1 = 1, c1 = 1/3 3, c2 = −1/3 3

Z 1    
1 1
quadrature formula (order 4): f ( x ) dx ≈ f √ + f −√ (7.4.2.4)
−1 3 3
y

§7.4.2.5 (Construction of n-point quadrature rules with maximal order 2n) First we search for neces-
sary conditions that have to be met by the nodes, if an n-point quadrature rule has order 2n.

Optimist’s assumption: ∃ family of n-point quadrature formulas on [−1, 1]

n Z 1
(n) (n) (n)
Qn ( f ) := ∑ w j f (c j ) ≈
−1
f (t) dt , w j ∈R,n∈N,
j =1
of order 2n ⇔ exact for polynomials ∈ P2n−1 . (7.4.2.6)

(n) (n)
Define P̄n (t) := (t − c1 ) · · · · · (t − cn ) , t ∈ R ⇒ P̄n ∈ Pn .
Note: P̄n has leading coefficient = 1.
By assumption on the order of Qn we know that for any q ∈ P n −1
Z 1 n
(7.4.2.6) (n) (n) (n)
q(t) P̄n (t) dt
−1 | {z }
= ∑ wj q(c j ) P̄n (c j ) = 0 .
j =1 | {z }
∈P2n−1 =0

Z1
2
We conclude L ([−1, 1])-orthogonality: q(t) P̄n (t) dt = 0 ∀ q ∈ P n −1 . (7.4.2.7)
−1

L2 (] − 1, 1[)-inner product of q and P̄n , see (6.3.1.5).

7. Numerical Quadrature , 7.4. Gauss Quadrature 563


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Pn
P̄n
Pn equipped with the inner product
R1
( p, q) 7→ −1 p(t)q(t) dt can be viewed as an
Euclidean space:

✁ P̄n ⊥ Pn−1
Pn−1 ⊂ Pn is a subspace of co-dimension 1. Hence,
Pn−1 has a 1-dimensional orthogonal complement in
P n −1 Pn . By (7.4.2.7) P̄n belongs to that complement. It
takes one additional condition to fix P̄n and the re-
quirement that its leading coefficient be = 1 is that
condition.

Fig. 267

We can also give algebraic arguments for existence and uniqueness of P̄n . Switching to a monomial
representation of P̄n

P̄n (t) = tn + αn−1 tn−1 + · · · + α1 t + α0 ,

by linearity of the integral, (7.4.2.7) is equivalent to


Z 1 
(7.4.2.7) ⇔ tn + αn−1 tn−1 + · · · + α1 t + α0 q(t) dt = 0 ∀q ∈ Pn−1
−1
n −1 Z 1 Z 1
j ℓ
⇔ ∑ αj t t dt = − tn tℓ dt , ℓ = 0, . . . , n − 1 .
j =0 −1 −1

  n −1
This is a linear system of equations A α j j=0 = b with a symmetric, positive definite (→ Def. 1.1.2.6)
coefficient matrix A ∈ R n,n . The A is positive definite can be concluded from

Z1 n−1 2

x Ax = ∑ (x) j t j dt > 0 , if x 6= 0 .
−1 j =0

Hence, A is regular and the coefficients α j are uniquely determined. Thus there is only one n-point
quadrature rule of order 2n.

The nodes of an n-point quadrature formula of order 2n, if it exists, must coincide with the unique zeros
of the polynomials P̄n ∈ Pn \ {0} satisfying (7.4.2.7).

Remark 7.4.2.8 (Gram-Schmidt orthogonalization of polynomials)


Rb
Recall: ( f , g) 7→ f (t) g(t) dt is an inner product on C0 ([ a, b]), the L2 -inner product, see Rem. 6.3.2.1,
a
[NS02, Sect. 4.4, Ex. 2], [Gut09, Ex. 6.5]
➣ Treat space of polynomials Pn as a vector space equipped with an inner product.

7. Numerical Quadrature , 7.4. Gauss Quadrature 564


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ As we have seen in Section 6.3.2, abstract techniques for vector spaces with inner product can be
applied to polynomials, for instance Gram-Schmidt orthogonalization, cf. § 6.3.1.17, [NS02, Thm. 4.8],
[Gut09, Alg. 6.1].
Now carry out the abstract Gram-Schmidt orthogonalization according to Algorithm (6.3.1.18) and recall
Thm. 6.3.1.19: in a vector space V with inner product (·, ·)V orthogonal vectors q0 , q1 , . . . spanning the
same subspaces as the linearly independent vectors v0 , v1 , . . . are constructed recursively via
n −1
( v n , q k )V
qn := vn − ∑ q ,
( q k , q k )V k
n = 1, 2, . . . , q0 : = v0 . (7.4.2.9)
k =0

➣ Construction of P̄n by Gram-Schmidt orthogonalization of the monomial basis {1, t, t2 , . . . , tn−1 }


(the vk s in (7.4.2.9)!) of Pn−1 w.r.t. L2 ([−1, 1])-inner product through the recursion
R1n
n −1
n −1 t P̄k ( t ) dt
P̄0 (t) := 1 , P̄n (t) = t − ∑ R1 2 · P̄k (t) (7.4.2.10)
k =0 −1 P̄k ( t ) dt

Note: P̄n has leading coefficient = 1 ⇒ P̄n uniquely defined (up to sign) by (7.4.2.10).
y

The considerations so far only reveal necessary conditions on the nodes of an n-point quadrature rule of
order 2n:

They do by no means confirm the existence of such rules, but offer a clear hint on how to construct them:

Theorem 7.4.2.11. Existence of n-point quadrature formulas of order 2n

Let { P̄n }n∈N0 be a family of non-zero polynomials that satisfies


• P̄n ∈ Pn ,
Z 1
• q(t) P̄n (t) dt = 0 for all q ∈ Pn−1 ( L2 ([−1, 1])-orthogonality),
−1
(n)
• The set {c j }m
j=1 , m ≤ n, of real zeros of P̄n is contained in [−1, 1].
m
(n) (n)
Then the quadrature rule (→ Def. 7.2.0.1) Qn ( f ) := ∑ wj f (c j )
j =1
with weights chosen according to Thm. 7.4.1.6 provides a quadrature formula of order 2n on [−1, 1].

n
Proof. Conclude from the orthogonality of the P̄n that { P̄k }k=0 is a basis of Pn and
Z 1
h(t) P̄n (t) dt = 0 ∀h ∈ Pn−1 . (7.4.2.12)
−1

Recall division of polynomials with remainder (Euclid’s algorithm → Course “Diskrete Mathematik”): for
any p ∈ P2n−1

p(t) = h(t) P̄n (t) + r (t) , for some h ∈ Pn−1 , r ∈ Pn−1 . (7.4.2.13)

Apply this representation to the integral:

Z1 Z1 Z1 m
(∗) (n) (n)
p(t) dt = h(t) P̄n (t) dt + r (t) dt = ∑ wj r (c j ) , (7.4.2.14)
−1 −1 −1 j =1
| {z }
=0 by (7.4.2.12)

7. Numerical Quadrature , 7.4. Gauss Quadrature 565


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(∗): by choice of weights according to Rem. 7.4.1.7 Qn is exact for polynomials of degree ≤ n − 1!

By choice of nodes as zeros of P̄n using (7.4.2.12):

m m m Z1
(n) (n) (7.4.2.13) (n) (n) (n) (n) (n) (7.4.2.14)
∑ w j p(c j ) = ∑ w j h(c j ) P̄n (c j ) + ∑ w j r (c j ) = p(t) dt .
j =1 j =1 | {z } j =1 −1
=0

§7.4.2.15 (Legendre polynomials and Gauss-Legendre quadrature) The family of polynomials


{ P̄n }n∈N0 are so-called orthogonal polynomials w.r.t. the L2 (] − 1, 1[)-inner product, see Def. 6.3.2.7.
We have made use of orthogonal polynomials already in Section 6.3.2. L2 ([−1, 1])-orthogonal polynomi-
als play a key role in analysis.

Legendre polynomials
The L2 (] − 1, 1[)-orthogonal are those already dis- 1 n=0
cussed in Rem. 6.3.2.16: 0.8
n=1
n=2
n=3
0.6 n=4
Definition 7.4.2.16. Legendre polynomials n=5
0.4
The n-th Legendre polynomial Pn is defined by 0.2
• Pn ∈ Pn ,
Pn(t)

Z 1 0

• Pn (t)q(t) dt = 0 ∀q ∈ Pn−1 , −0.2


−1 −0.4

• Pn (1) = 1. −0.6

−0.8

−1
Legendre polynomials P0 , . . . , P5 ➣ −1 −0.5 0 0.5 1
Fig. 268 t

Notice: the polynomials P̄n defined by (7.4.2.10) and the Legendre polynomials Pn of Def. 7.4.2.16 (merely)
differ by a constant factor!

(n)
Gauss points ξ j = zeros of Legendre polynomial Pn

Note: the above considerations, recall (7.4.2.7), show that the nodes of an n-point quadrature formula of
order 2n on [−1, 1] must agree with the zeros of L2 (] − 1, 1[)-orthogonal polynomials.
✞ ☎

✝ ✆
n-point quadrature formulas of order 2n are unique

This is not surprising in light of “2n equations for 2n degrees of freedom”.

We are not done yet: the zeros of P̄n from (7.4.2.10) may lie outside [−1, 1].
! In principle P̄n could also have less than n real zeros.

The next lemma shows that all this cannot happen.

7. Numerical Quadrature , 7.4. Gauss Quadrature 566


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Zeros of Legendre polynomials in [−1,1]


20

18
✁ Obviously:
Number n of quadrature nodes

16

14
Lemma 7.4.2.17. Zeros of Legendre polyno-
12 mials
10
Pn has n distinct zeros in ] − 1, 1[.
8

6
Zeros of Legendre polynomials = Gauss points
4

2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 269 t

Proof. (indirect) Assume that Pn has only m < n zeros ζ 1 , . . . , ζ m in ] − 1, 1[ at which it changes sign.
Define
m
q(t) := ∏(t − ζ j ) ⇒ qPn ≥ 0 or qPn ≤ 0 .
j =1
Z 1
⇒ q(t) Pn (t) dt 6= 0 .
−1

As q ∈ Pn−1 , this contradicts (7.4.2.12).



Definition 7.4.2.18. Gauss-Legendre quadrature formulas

The n-point Quadrature formulas whose nodes, the Gauss points, are given by the zeros of the n-th
Legendre polynomial (→ Def. 7.4.2.16), and whose weights are chosen according to Thm. 7.4.1.6,
are called Gauss-Legendre quadrature formulas.

The last part of this section examines the non-trivial question of how to compute the Gauss points given as
the zeros of Legendre polynomials. Many different algorithms have been devised for this purpose and we
focus on one that employs tools from numerical linear algebra and relies on particular algebraic properties
of the Legendre polynomials.

Remark 7.4.2.19 (3-Term recursion for Legendre polynomials) From Thm. 6.3.2.14 we learn the or-
thogonal polynomials satisfy the 3-term recursion (6.3.2.15), see also (7.4.2.21). To keep this chapter
self-contained we derive it independently for Legendre polynomials.

Note: the polynomials P̄n from (7.4.2.10) are uniquely characterized by the two properties (try a proof!)
(i) P̄n ∈ Pn with leading coefficient 1: P̄(t) = tn + . . .,
Z 1
(ii) P̄k (t) P̄j (t) dt = 0, if j 6= k ( L2 (] − 1, 1[)-orthogonality).
−1

7. Numerical Quadrature , 7.4. Gauss Quadrature 567


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ we get the same polynomials P̄n by another Gram-Schmidt orthogonalization procedure, cf. (7.4.2.9)
and § 6.3.2.11:
R1
n
−1 τ P̄n ( τ ) P̄k ( τ ) dτ
P̄n+1 (t) = t P̄n (t) − ∑ R1 2 · P̄k (t)
k =0 −1 P̄k ( τ ) dτ

By the orthogonality property (7.4.2.12) the sum collapses, since


Z 1 Z 1
τ P̄n (τ ) P̄k (τ ) dτ = P̄n (τ ) (τ P̄k (τ )) dτ = 0 ,
−1 −1 | {z }
∈Pk+1

if k + 1 < n:
R1 R1
τ P̄n (τ ) P̄n (τ ) dτ −1 τ P̄n ( τ ) P̄n−1 ( τ ) dτ
P̄n+1 (t) = t P̄n (t) − −1R 1 · P̄n (t) − R1 2 · P̄n−1 (t) . (7.4.2.20)
2 ( τ ) dτ
−1 nP̄ −1 P̄n−1 ( τ ) dτ

After rescaling (tedious!) we obtain the famous 3-term recursion for Legendre polynomials

2n + 1 n
Pn+1 (t) := tPn (t) − Pn−1 (t) , P0 := 1 , P1 (t) := t . (7.4.2.21)
n+1 n+1

In Section 6.2.3.1 we discovered a similar 3-term recursion (6.2.3.5) for Chebychev polynomials. Coinci-
dence? Of course not, nothing in mathematics holds “by accident”. By Thm. 6.3.2.14 3-term recursions
are a distinguishing feature of so-called families of orthogonal polynomials, to which the Chebychev poly-
nomials belong as well, spawned by Gram-Schmidt orthogonalization with respect to a weighted L2 -inner
product, however, see [Han02, p. VI].

➤ Efficient and stable evaluation of Legendre polynomials by means of 3-term recursion (7.4.2.21), cf.
the analoguous algorithm for Chebychev polynomials given in Code 6.2.3.6.

C++-code 7.4.2.22: computing Legende polynomials


2 // returns the values of the first n - 1 legendre polynomials
3 // in point x as columns of the matrix L
4 void legendre ( const unsigned n , const VectorXd& x , MatrixXd& L ) {
5 L = MatrixXd : : Ones ( n , n ) ; // p0 ( x ) = 1
6 L . col ( 1 ) = x ; // p1 ( x ) = x
7 f o r ( unsigned j = 1 ; j < n − 1 ; ++ j ) {
2j+1 j
8 // p j+1 ( x ) = j+1 xp j ( x ) − j+1 p j−1 ( x ) Eq. (7.4.2.21)
9 L . col ( j +1) = ( 2 . * j +1) / ( j + 1 . ) * L . col ( j −1) . cwiseProduct ( x )
10 − j / ( j + 1 . ) * L . col ( j −1) ;
11 }
12 }

Comments on Code 7.4.2.22:


☛ return value: matrix L with (L)ij = Pi ( x j )
☛ lines 5-6: take into account initialization of Legendre 3-term recursion (7.4.2.21)
y
Remark 7.4.2.23 (Computing Gauss nodes and weights) There are several efficient ways to find the
Gauss points. Here we discuss an intriguing connection with an eigenvalue problem.

7. Numerical Quadrature , 7.4. Gauss Quadrature 568


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Compute nodes/weights of Gaussian quadrature by solving an eigenvalue problem!


(Golub-Welsch algorithm [Gan+05, Sect. 3.5.4], [Tre08, Sect. 1])

In codes Gauss nodes and weights are usually retrieved from tables, cf. Rem. 7.2.0.6.

C++-code 7.4.2.24: Golub-Welsch algorithm ➺ GITLAB


2 s t r u c t QuadRule {
3 Eigen : : VectorXd nodes_ , weights_ ;
4 } __attribute__ ( ( aligned (32) ) ) ;
5

6 i n l i n e QuadRule gaussquad ( const unsigned i n t n ) {


7 QuadRule q r ;
8 // Symmetric matrix whose eigenvalues provide Gauss points
9 Eigen : : MatrixXd M = Eigen : : MatrixXd : : Zero ( n , n ) ;
10 f o r ( unsigned i n t i = 1 ; i < n ; ++ i ) { //
11 const double b = i / std : : s q r t ( 4 . * i * i − 1 . ) ;
12 M( i , i − 1 ) = M( i − 1 , i ) = b ;
13 } //
14 // using Eigen’s built-in solver for symmetric eigenvalue problems
15 const Eigen : : SelfAdjointEigenSolver <Eigen : : MatrixXd > eig (M) ;
16

17 q r . nodes_ = eig . eigenvalues ( ) ; // Gauss quadrature nodes as eigenvalues!


18 q r . weights_ = 2 * eig . eigenvectors ( ) . topRows<1 >() . array ( ) . pow ( 2 ) ; //
19 return qr ;
20 }

Justification: en = √ 1 Pn
rewrite 3-term recurrence (7.4.2.21) for scaled Legendre polynomials P 1 n+ /2

n n+1
t Pen (t) = √ Pen−1 (t) + p Pen+1 (t) . (7.4.2.25)
4n2 − 1 4( n + 1)2 − 1
| {z } | {z }
=:β n =:β n+1

For fixed t ∈ R (7.4.2.25) can be expressed as


 
0 β1
     
Pe0 (t)  β1 0 β2  Pe0 (t) 0
 
 Pe (t)   . .  Pe (t)   .. 
 1   β2 . . . .  1   . 
t ..  =  . . .   ..  +  
 .   . . . . . .   .   0 
 
Pen−1 (t)  0 β n−1  Pen−1 (t) β n Pen (t)
| {z } β n −1 0
=:p(t)∈R n | {z }
=:Jn ∈R n,n

Pen (ξ ) = 0 ⇔ ξp(ξ ) = Jn p(ξ ) (an eigenvalue problem!) .

The zeros of Pn can be obtained as the n real eigenvalues of the symmetric tridiagonal matrix Jn ∈
R n,n !
This matrix Jn is initialized in Line 10–Line 13 of Code 7.4.2.24. The computation of the weights in Line 18
of Code 7.4.2.24 is explained in [Gan+05, Sect. 3.5.4]. y

Remark 7.4.2.26 (Asymptotic methods for the computation of Gauss-Legendre quadrature rules)
The fastest methods for the computation of nodes for p-point Gauss-Legendre quadrature rules make use

7. Numerical Quadrature , 7.4. Gauss Quadrature 569


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

of asymptotic formulas for the zeros of special functions and achieve an asymptotic complexity of O( p) for
p → ∞, see article and [Bog14]. y

Review question(s) 7.4.2.27 (Gauss-Legendre quadrature)


(Q7.4.2.27.A) [Gauss-Radau quadrature] You insist that for a family of p-point quadrature rules on
[−1, 1] the endpoints x = −1 and x = 1 belong to the set of quadrature nodes. What order can you
achieve?

7.4.3 Quadrature Error Estimates


Video tutorial for Section 7.4.3 "(Gauss-Legendre) Quadrature Error Estimates": (18 minutes)
Download link, tablet notes

The Gauss-Legendre quadrature formulas do not only enjoy maximal order, but another key property that
can be regarded as essential for viable families of quadrature rules.

Gauss−Legendre weights for [−1,1]


n=2
1 n=4
n=6
n=8
Obviously ✄ n=10
n=12
0.8 n=14

Lemma 7.4.3.1. Positivity of Gauss- 0.6


j

Legendre quadrature weights


w

The weights of the Gauss-Legendre quadra- 0.4

ture formulas are positive.


0.2

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Fig. 270 tj

(n)
Proof. Writing ξ j , j = 1, . . . , n, for the nodes (Gauss points) of the n-point Gauss-Legendre quadrature
formula, n ∈ N, we define
n
(n) 2
qk (t) = ∏(t − ξ j ) ⇒ qk ∈ P2n−2 .
j =1
j6=k

(n)
This polynomial is integrated exactly by the quadrature rule: since qk (ξ j ) = 0 for j 6= k

Z1
(n) (n)
0< q(t) dt = wk q(ξ k ) ,
| {z }
−1 >0

(n)
where w j are the quadrature weights.

(n)
§7.4.3.2 (Quadrature error and best approximation error) The positivity of the weights w j for all
n-point Gauss-Legendre and Clenshaw-Curtis quadrature rules has important consequences.

7. Numerical Quadrature , 7.4. Gauss Quadrature 570


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 7.4.3.3. Quadrature error estimate for quadrature rules with positive weights

For every n-point quadrature rule Qn as in (7.2.0.2) of order q ∈ N with weights w j ≥ 0, j =


1, . . . , n the quadrature error satisfies
Z b
En ( f ) := f (t) dt − Qn ( f ) ≤ 2|b − a| inf k f − pk L∞ ([a,b]) ∀ f ∈ C0 ([ a, b]) . (7.4.3.4)
a p∈Pq−1
| {z }
best approximation error

Proof. The proof runs parallel to the derivation of (6.2.2.33). Writing En ( f ) for the quadrature error, the
left hand side of (7.4.3.4), we find by the definition Def. 7.4.1.1 of the order of a quadrature rule that for all
p ∈ P q −1
Z b n
En ( f ) = En ( f − p) ≤
a
( f − p)(t) dt + ∑ w j ( f − p)(c j ) (7.4.3.5)
j =1
n 
≤|b − a|k f − pk L∞ ([a,b]) + ∑ |w j | k f − pk L∞ ([a,b]) .
j =1

Since the quadrature rule is exact for constants and w j ≥ 0

n n
∑ |w j | = ∑ w j = |b − a| ,
j =1 j =1

which finishes the proof.



Drawing on best approximation estimates from Section 6.2.1 and Rem. 6.2.3.26, we immediately get
results about the asymptotic decay of the quadrature error for n-point Gauss-Legendre and Clenshaw-
Curtis quadrature as n → ∞:
f ∈ Cr ([ a, b]) ⇒ En ( f ) → 0 algebraically with rate r,

f ∈ C ([ a, b]) “analytically extensible” ⇒ En ( f ) → 0 exponentially,
as n → ∞, see Def. 6.2.2.7 for type of convergence.

Appealing to (6.2.1.27) and (6.2.2.22), the dependence of the constants on the length of the integration
interval can be quantified for integrands with limited smoothness.

Lemma 7.4.3.6. Quadrature error estimates for Cr -integrands

For every n-point quadrature rule Qn as in (7.2.0.2) of order q ∈ N with weights w j ≥ 0, j =


1, . . . , n we find that the quadrature error En ( f ) for and integrand f ∈ Cr ([ a, b]), r ∈ N0 , satisfies

in the case q ≥ r: En ( f ) ≤ C (q − 1)−r |b − a|r+1 f (r) , (7.4.3.7)


L∞ ([ a,b])
| b − a | q +1 ( q )
in the case q < r: En ( f ) ≤ f , (7.4.3.8)
q! L∞ ([ a,b])

with a constant C > 0 independent of n, f , and [ a, b].

Proof. The first estimate (7.4.3.7) is immediate from (6.2.1.27). The second bound (7.4.3.8) is obtained
by combining (7.4.3.4) and (6.2.2.22).

7. Numerical Quadrature , 7.4. Gauss Quadrature 571


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


Please note the different estimates depending on whether the smoothness of f (as described by r) or the
order of the quadrature rule is the “limiting factor”. y

EXAMPLE 7.4.3.9 (Convergence of global quadrature rules)


We examine three families of global polynomial (→ Thm. 7.4.1.6) quadrature rules: Newton-Cotes for-
mulas, Gauss-Legendre rules, and Clenshaw-Curtis rules. We record the convergence of the quadrature
errors for the interval [0, 1] and two different functions
1. f 1 (t) = 1+(15t)2 , an analytic function, see Rem. 6.2.2.67,

2. f 2 (t) = t, merely continuous, derivatives singular in t = 0.
Numerical quadrature of function 1/(1+(5t)2) 0
Numerical quadrature of function sqrt(t)
0
10 10
Equidistant Newton−Cotes quadrature
Chebyshev quadrature
−2 Gauss quadrature
10 −1
10

−4
10
−2
10
|quadrature error|
|quadrature error|

−6
10
−3
10
−8
10

−4
10
−10
10

−5
−12
10 10
Equidistant Newton−Cotes quadrature
Chebyshev quadrature
Gauss quadrature
−14 −6
10 10
0 2 4 6 8 10 12 14 16 18 20 0 1
Fig. 271 10 10
Number of quadrature nodes Fig. 272 Number of quadrature nodes
quadrature error, f 1 (t) := 1+(15t)2 on [0, 1]

quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error ǫn := 0 f (t) dt − Qn ( f ) for "n → ∞”:


exponential convergence ǫn ≈ O(qn ), 0 < q < 1, for C ∞ -integrand f 1 ❀ : Newton-Cotes quadra-
ture : q ≈ 0.61, Clenshaw-Curtis quadrature : q ≈ 0.40, Gauss-Legendre quadrature : q ≈ 0.27


algebraic convergence ǫn ≈ O(n−α ), α > 0, for integrand f 2 with singularity at t = 0 ❀ Newton-
Cotes quadrature : α ≈ 1.8, Clenshaw-Curtis quadrature : α ≈ 2.5, Gauss-Legendre quadrature :
α ≈ 2.7
y

Remark 7.4.3.10 (Removing a singularity by transformation) From Ex. 7.4.3.9 teaches us that a lack of
smoothness of the integrand can thwart exponential convergence and severely limits the rate of algebraic
convergence of a global quadrature rule for n → ∞.

Idea: recover integral with smooth integrand by “analytic preprocessing”

Here is an example:

Z b√
For a general but smooth f ∈ C ∞ ([0, b]) compute t f (t) dt via a quadrature rule, e.g., n-point
0
Gauss-Legendre quadrature on [0, b]. Due to the presence of a square-root singularity at t = 0 the direct

7. Numerical Quadrature , 7.4. Gauss Quadrature 572


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

application of n-point Gauss-Legendre quadrature will result in a rather slow algebraic convergence of the
quadrature error as n → ∞, see Ex. 7.4.3.9.

Trick: Transformation of integrand by substitution rule:

√ Z b√ Z √b
substitution s = t: t f (t) dt = 2s2 f (s2 ) ds . (7.4.3.11)
0 0

Then: Apply Gauss-Legendre quadrature rule to smooth integrand


y

Remark 7.4.3.12 (The message of asymptotic estimates) There is one blot on most n-asymptotic esti-
mates obtained from Thm. 7.4.3.3: the bounds usually involve quantities like norms of higher derivatives
of the interpoland that are elusive in general, in particular for integrands given only in procedural form,
see § 7.1.0.2. Such unknown quantities are often hidden in “generic constants C”. Can we extract useful
information from estimates marred by the presence of such constants?

For fixed integrand f let us assume sharp algebraic convergence (in n) with rate r ∈ N of the quadrature
error En ( f ) for a family of n-point quadrature rules:

sharp
En ( f ) = O(n−r ) =⇒ En ( f ) ≈ Cn−r , (7.4.3.13)

with a “generic constant C > 0” independent of n.

Goal: Reduction of the quadrature error by a factor of ρ > 1

Which (minimal) increase in the number n of quadrature points accomplishes this?


−r
Cnold ! √
−r =ρ ⇔ nnew : nold = r ρ . (7.4.3.14)
Cnnew

In the case of algebraic convergence with rate r ∈ R a reduction of the quadrature error by a factor
of ρ is bought by an increase of the number of quadrature points by a factor of ρ /r .
1

(7.4.3.7) ➣ gains in accuracy are “cheaper” for smoother integrands!

Now assume sharp exponential convergence (in n) of the quadrature error En ( f ) for a family of n-point
quadrature rules, 0 ≤ q < 1:

sharp
En ( f ) = O(qn ) =⇒ En ( f ) ≈ Cqn , (7.4.3.15)

with a “generic constant C > 0” independent of n.

Error reduction by a factor ρ > 1 results from

Cqnold ! log ρ
=ρ ⇔ nnew − nold = − log q .
Cqnnew

7. Numerical Quadrature , 7.4. Gauss Quadrature 573


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In the case of exponential convergence (7.4.3.15) a fixed increase of the number of quadrature
points by − log ρ : log q results in a reduction of the quadrature error by a factor of ρ > 1.

y
Review question(s) 7.4.3.16 (Quadrature error estimates)

(Q7.4.3.16.A) By the substitution s = t we could transform

Zb √ Zb
t f (t) dt = 2 s2 f (s2 ) ds .
0 0

(n) (n)
Describe the weights w j and nodes c j , j = 1, . . . , n, n ∈ N, of a family of quadrature rules satisfy-
ing

Zb √ n
(n) (n)
t f (t) dt ≈ ∑ wj f (c j ) ,
0 j =1

and enjoying n-asymptotic exponential convergence, if f possess an analytic extension beyond [0, b].
(Q7.4.3.16.B) Let ( Qn )n∈N denote a family of quadrature rules with n the number of quadrature points.
Rb
For the approximate evaluation of a f (t) dt the following adaptive algorithm is employed
1 n := 0;
2 do
3 n : = n +1;
4 while ( | Qn+1 ( f ) − Qn ( f )| ≤ tol · | Qn+1 ( f )| ) ;
5 r e t u r n Q n +1 ( f ) ;

Assuming a sharp asymptotic behavior like En ( f ) = O(n−r ), r ∈ N, (algebraic convergence with rate
r) for n → ∞ of the quadrature error En ( f ), how much extra work may be incurred when reducing the
tolerance by a factor of 10.
(Q7.4.3.16.C) [Improper integral with logarithmic weight] We consider the improper integral
Z 1
I ( f ) := log(t) f (t) dt , (7.4.3.17)
0

where f : [0, 1] → R is supposed to possess an analytic extension to a C-neighborhood of [0, 1].


• Is it possible to apply the n-point Gauss-Legendre quadrature formula on [0, 1] to (7.4.3.17)?
• What is your prediction for the behavior of the quadrature error of the n-point Gauss-Legendre
quadrature formula on [0, 1] applied to (7.4.3.17) for n → ∞?
(Q7.4.3.16.D) [“Regularization” of integrands by transformation] Transform (7.4.3.17) by the change of
variables (“substitution”)

t = ϕ(τ ) , ϕ(τ ) := sin4 ( π2 τ ) , τ ∈ [0, 1] .

When applying the n-point Gauss-Legendre quadrature formula on [0, 1] to the transformed integral,
how will the quadrature error behave for n → ∞?

Hint. Examine the differentiability properties of the transformed integrand in τ = 0.

7. Numerical Quadrature , 7.4. Gauss Quadrature 574


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Hint. Remember the product rule for the k-th derivative

k  
k ( j)
( f g) (k)
(τ ) = ∑ f ( τ ) g(k− j) ( τ ) , f , g ∈ Ck . (7.4.3.18)
j =0
j

Hint. How does τ 7→ ϕ(τ ) and its derivatives behave as τ → 0: ϕ(k) (τ ) = O(τ ? ) for τ → 0.

7.5 Composite Quadrature

Video tutorial for Section 7.5 "Composite Quadrature": (18 minutes) Download link,
tablet notes

In Chapter 6, Section 6.6.1 we studied approximation by piecewise polynomial interpolants. A similar


idea underlies the so-called composite quadrature rules on an interval [ a, b]. Analogously to piecewise
polynomial techniques they start from a grid/mesh

M : = { a = x 0 < x 1 < . . . < x m −1 < x m = b } (6.6.0.2)

and appeal to the trivial identity


Z b m Z xj

a
f (t) dt = ∑ f (t) dt . (7.5.0.1)
j =1 x j −1

On each mesh interval [ x j−1 , x j ] we then use a local quadrature rule, which may be one of the polynomial
quadrature formulas from 7.3.

General construction of composite quadrature rules

Idea: Partition integration domain [ a, b] by a mesh/grid (→ Section 6.6)


M : = { a = x0 < x1 < . . . < x m = b }
Apply quadrature formulas from Section 7.3, Section 7.4 locally
on mesh intervals Ij := [ x j−1 , x j ], j = 1, . . . , m, and sum up.

composite quadrature rule

EXAMPLE 7.5.0.3 (Simple composite polynomial quadrature rules)

Composite trapezoidal rule, cf. (7.3.0.5)


2.5

Zb
f (t)dt = 12 ( x1 − x0 ) f ( a)+ (7.5.0.4)
a m −1 1.5

1
∑ 2 ( x j +1 − x j−1 ) f ( x j )+
j =1
1
2 ( xm − x m −1 ) f ( b ) . 0.5

Fig. 273 a
−1 a b
7. Numerical Quadrature , 7.5. Composite Quadrature 575
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ arising from piecewise linear interpolation of f .


Composite Simpson rule, cf. (7.3.0.6)

Zb
f (t)dt = 2.5

a
1
6 ( x1 − x0 ) f ( a)+ (7.5.0.5)
m −1 1.5

1
∑ 6 ( x j +1 − x j−1 ) f ( x j )+
j =1
m
∑ 23 ( x j − x j−1 ) f ( 12 ( x j + x j−1 ))+
0.5

j =1 a
Fig. 274 −1 a b
1
6 ( xm − x m −1 ) f ( b ) .

related to piecewise quadratic Lagrangian interpolation.

Formulas (7.5.0.4), (7.5.0.5) directly suggest efficient implementation with minimal number of f -
evaluations.

C++-code 7.5.0.6: Equidistant composite trapezoidal rule (7.5.0.4)


1 # include < c a s s e r t >
2

3 // N-interval equidistant trapezoidal rule


4 template <class Function >
5 double t r a p e z o i d a l ( F u n c t i o n &f , const double a , const double b ,
6 const unsigned N) {
7 double I = 0 ;
8 const double h = ( b − a ) / N; // interval length
9

10 f o r ( unsigned i = 0 ; i < N; ++ i ) {
11 // rule: T = (b - a)/2 * (f(a) + f(b)),
12 // apply on N intervals: [a + i*h, a + (i+1)*h], i=0..(N-1)
13 I += h / 2 * ( f ( a + i * h ) + f ( a + ( i + 1 ) * h ) ) ;
14 }
15 return I ;
16 }
17

18 // Alternative implementation of n-point equidistant trapezoidal rule


19 template <typename Functor >
20 double e q u i d T r a p e z o i d a l R u l e ( F u n c t o r &&f , double a , double b , unsigned i n t n ) {
21 assert ( n>=2) ;
22 const double h = ( b − a ) / ( n −1) ;
23 double t = a + h ;
24 double s = 0 . 0 ;
25 f o r ( unsigned i n t i = 1 ; i < n −1; t += h , ++ i ) {
26 s += f ( t ) ;
27 }
28 return ( 0 . 5 * h * f ( a ) + 0.5 * h * f ( b ) + h * s ) ;
29 }

C++-code 7.5.0.7: Equidistant composite Simpson rule (7.5.0.5)


1 template <class Function >

7. Numerical Quadrature , 7.5. Composite Quadrature 576


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 double simpson ( F u n c t i o n& f , const double a , const double b , const unsigned N) {


3 double I = 0 ;
4 const double h = ( b − a ) / N; // intervall length
5

6 f o r ( unsigned i = 0 ; i < N; ++ i ) {
7 // rule: S = (b - a)/6*( f(a) + 4*f(0.5*(a + b)) + f(b) )
8 // apply on [a + i*h, a + (i+1)*h]
9 I += h / 6 * ( f ( a + i * h ) + 4 * f ( a + ( i + 0 . 5 ) * h ) + f ( a + ( i +1) * h ) ) ;
10 }
11

12 return I ;
13 }

In both cases the function object passed in f must provide an evaluation operator double operator
(double)const. y

Remark 7.5.0.8 (Composite quadrature and piecewise polynomial interpolation) Composite quadra-
ture scheme based on local polynomial quadrature can usually be understood as “quadrature by approxi-
mation schemes” as explained in § 7.2.0.7. The underlying approximation schemes belong to the class of
general local Lagrangian interpolation schemes introduced in Section 6.6.1.

In other words, many composite quadrature schemes arise from replacing the integrand by a piecewise
interpolating polynomial, see Fig. 273 and Fig. 274 and compare with Fig. 252. y

To see the main rationale behind the use of composite quadrature rules recall Lemma 7.4.3.6: for a
polynomial quadrature rule (7.3.0.1) of order q with positive weights and f ∈ Cr ([ a, b]) the quadrature
error shrinks with the min{r, q} + 1-st power of the length |b − a| of the integration domain! Hence,
applying polynomial quadrature rules to small mesh intervals should lead to a small overall quadrature
error.

§7.5.0.9 (Quadrature error estimate for composite polynomial quadrature rules) Assume a com-
j
posite quadrature rule Q on [ x0 , xm ] = [ a, b], b > a, based on n j -point local quadrature rules Qn j with
positive weights (e.g. local Gauss-Legendre quadrature rules or local Clenshaw-Curtis quadrature rules)
and of fixed orders q j ∈ N on each mesh interval [ x j−1 , x j ]. From Lemma 7.4.3.6 recall the estimate for
f ∈ Cr ([ x j−1 , x j ])
Z xj
j
f (t) dt − Qn j ( f ) ≤ C | x j − x j−1 |min{r,q j }+1 f (min{r,q j }) . (7.3.0.11)
x j −1 L∞ ([ x j−1 ,x j ])

with C > 0 independent of f and j. For f ∈ Cr ([ a, b]), summing up these bounds we get for the global
quadrature error
Z xm m
min{r,q }+1
f (min{r,q j })
j

x0
f (t) dt − Q( f ) ≤ C ∑ hj L∞ ([ x j−1 ,x j ])
,
j =1

with local meshwidths h j = x j − x j−1 . If q j = q, q ∈ N, for all j = 1, . . . , m, then, as ∑ j h j = b − a,


Z xm
min{q,r }
f (t) dt − Q( f ) ≤ C hM |b − a| f (min{q,r}) , (7.5.0.10)
x0 L∞ ([ a,b])

with (global) meshwidth hM := max j h j .

7. Numerical Quadrature , 7.5. Composite Quadrature 577


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(7.5.0.10) ←→ Algebraic convergence in no. of f -evaluations for n → ∞

§7.5.0.11 (Constructing families of composite quadrature rules) As with polynomial quadrature rules,
we study the asymptotic behavior of the quadrature error for families of composite quadrature rules as a
function on the total number n of function evaluations.
As in the case of M-piecewise polynomial approximation of function (→ Section 6.6.1) families of com-
posite quadrature rules can be generated in two different ways:
 
(I) use a sequence of successively refined meshes Mk = { x kj } j with ♯M = m(k ) + 1,
k ∈N
m(k ) → ∞ for k → ∞, , combined with the same (transformed, → Rem. 7.2.0.4) local quadrature
rule on all mesh intervals [ x kj−1 , x kj ]. Examples are the composite trapezoidal rule and composite
Simpson rule from Ex. 7.5.0.3 on sequences of equidistant meshes.
➣ h-convergence
 m
(II) On a fixed mesh M = xj j =0
, on each cell use the same (transformed) local quadrature rule
taken from a sequence of polynomial quadrature rules of increasing order.
➣ p-convergence
y

EXPERIMENT 7.5.0.12 (Quadrature errors for composite quadrature rules) Composite quadrature
rules based on
• trapezoidal rule (7.3.0.5) ➣ local order 2 (exact for linear functions, see Ex. 7.4.1.10),
• Simpson rule (7.3.0.6) ➣ local order 4 (exact for cubic polynomials, see Ex. 7.4.1.10)
n
on equidistant mesh M := { jh} j=0 , h = 1/n, n ∈ N.

2
0
numerical quadrature of function 1/(1+(5t) ) numerical quadrature of function sqrt(t)
0
10 10
trapezoidal rule trapezoidal rule
Simpson rule
2
Simpson rule
O(h ) −1 1.5
4
10 O(h )
O(h )

−2
−5
10
10
|quadrature error|

|quadrature error|

−3
10

−4
10
−10
10
−5
10

−6
10

−15 −7
10 −2 −1 0 10
10 10 10 −2 −1 0
10 10 10
Fig. 275 meshwidth Fig. 276 meshwidth

quadrature error, f 1 (t) := 1+(15t)2 on [0, 1] quadrature error, f 2 (t) := t on [0, 1]
R1
Asymptotic behavior of quadrature error E(n) := 0 f (t) dt − Qn ( f ) for meshwidth "h → 0”:

☛ Throughout we observe algebraic convergence E(n) = O( hα ) of with rate α > 0 for h = n−1 → 0
➣ Sufficiently smooth integrand f 1 : trapezoidal rule → α = 2, Simpson rule → α = 4 !?

7. Numerical Quadrature , 7.5. Composite Quadrature 578


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ singular integrand f 2 : α = 3/2 for trapezoidal rule & Simpson rule !

(lack of) smoothness of integrand limits convergence!


y

Remark 7.5.0.13 (Composite quadrature rules vs. global quadrature rules) For a fixed integrand
f ∈ Cr ([ a, b]) of limited smoothness on an interval [ a, b] we compare
• a family of composite quadrature rules basedon single localℓ-point rule (with positive weights) of
order q on a sequence of equidistant meshes Mk = { x kj } j ,
k ∈N
• the family of Gauss-Legendre quadrature rules from Def. 7.4.2.18.
We study the asymptotic dependence of the quadrature error on the number n of function evaluations.

For the composite quadrature rules we have n ≈ ℓ♯Mk ≈ ℓ h− 1


M . Combined with (7.5.0.10), we find for
comp
quadrature error En ( f ) of the composite quadrature rules
comp
En ( f ) ≤ C1 n− min{q,r} , (7.5.0.14)

with C1 > 0 independent of M = Mk .

The quadrature errors EnGL ( f ) of the n-point Gauss-Legendre quadrature rules are given in
Lemma 7.4.3.6, (7.4.3.7):

EnGL ( f ) ≤ C2 n−r , (7.5.0.15)

with C2 > 0 independent of n.

Gauss-Legendre quadrature converges at least as fast fixed order composite quadrature on equidistant
meshes.

Moreover, Gauss-Legendre quadrature “automatically detects” the smoothness of the integrand, and en-
joys fast exponential convergence for analytic integrands.
✞ ☎

✝ ✆
Use Gauss-Legendre quadrature instead of fixed order composite quadrature on equidistant meshes.
y

EXPERIMENT 7.5.0.16 (Empiric convergence of equidistant trapezoidal rule) Sometimes there are
surprises: Now we will witness a convergence behavior of a composite quadrature rule that is much better
than predicted by the order of the local quadrature formula.
We consider the equidistant trapezoidal rule (order 2), see (7.5.0.4), Code 7.5.0.6
Z b  m −1 
1 b−a
f (t) dt ≈ Tm ( f ) := h 2 f ( a) + ∑ f ( a + kh) + 21 f (b) , h := . (7.5.0.17)
a k =1
m

and the 1-periodic smooth (analytic) integrand

1
f (t) = p , 0<a<1.
1 − a sin(2πt − 1)

(As “exact value of integral” we use T500 in the computation or quadrature errors.)

7. Numerical Quadrature , 7.5. Composite Quadrature 579


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Trapezoidal rule quadrature for 1./sqrt(1−a*sin(2*pi*x+1)) Trapezoidal rule quadrature for non−periodic function
0
0
10 10

−2
10
−1
10

−4
10

|quadrature error|
|quadrature error|

−2
−6
10
10

−8
10 −3
10

−10
10
−4
10
−12 a=0.5 a=0.5
10 a=0.9
a=0.9
a=0.95 a=0.95
a=0.99 a=0.99
−14 −5
10 10 0 1 2
0 2 4 6 8 10 12 14 16 18 20 10 10 10
Fig. 277 no. of. quadrature nodes Fig. 278 no. of. quadrature nodes

quadrature error for Tn ( f ) on [0, 1] quadrature error for Tn ( f ) on [0, 12 ]

exponential convergence !! merely algebraic convergence


y

§7.5.0.18 (The magic of the equidistant trapezoidal rule (for periodic integrands))
In this § we use I := [0, 1[ as a reference interval, cf. Exp. 7.5.0.16. We rely on similar techniques as in
Section 5.6, Section 5.6.2. Again, a key tool will be the bijective mapping, see Fig. 198,
ΦS1 : I → S1 := {z ∈ C : |z| = 1} , t 7→ z := exp(2πıt) , (5.6.2.1)
which induces the general pullback, c.f. (6.2.1.16),

(ΦS−11 )∗ : C0 ([0, 1[) → C0 (S1 ) , (ΦS−11 )∗ f (z) := f (ΦS−11 (z)) , z ∈ S1 .
If f ∈ Cr (R ) and 1-periodic, then (ΦS−11 )∗ f ∈ Cr (S1 ). Further, ΦS1 maps equidistant nodes on I := [0, 1]
to equispaced nodes on S1 , which are the roots of unity:
j j j
ΦS1 ( n ) = exp(2πı n ) [ exp(2πı n )n = 1; ] . (7.5.0.19)
Now consider an n-point polynomial quadrature rule on S1 based on the set of equidistant nodes Z :=
{z j := exp(2πı j−n 1 ), j = 1, . . . , n} and defined as
Z  n
S1 1
Qn ( g) := LZ g (τ ) dS(τ ) = ∑ wSj g(z j ) , (7.5.0.20)
j =1
S1
where LZ is the Lagrange interpolation operator (→ Def. 6.2.2.1). This means that the weights obey
Thm. 7.4.1.6, where the definition (5.2.2.4) of Lagrange polynomials remains the same for complex nodes.
By sheer symmetry, all the weights have to be the same, which, since the rule will be at least of order 1,
means
1 2π
wSj = , j = 1, . . . , n .
n
1
Moreover, the quadrature rule QSn will be of order n, see Def. 7.4.1.1, that is, it will integrate polynomials
of degree ≤ n − 1, exactly.

1
By transformation (→ Rem. 7.2.0.4 and pullback (7.5.0.18), QSn induces a quadrature rule on I := [0, 1]
by
 n n
1 S1 1 1 1 j −1
QnI ( f ) := Qn (ΦS−11 )∗ f = ∑ wSj f (Φ −1
(z j )) = ∑ n f( n ) . (7.5.0.21)
2π 2π j =1 j =1

7. Numerical Quadrature , 7.5. Composite Quadrature 580


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This is exactly the equidistant trapezoidal rule(7.5.0.17), if f is 1-periodic, f (0) = f (1): QnI = Tn . Hence
we arrive at the following estimate for the quadrature error
Z 1  
En ( f ) := f (t) dt − Tn ( f ) ≤ 2π max (ΦS−11 )∗ f (z) − LZ (ΦS−11 )∗ f (z) .
0 z ∈S 1

Equivalently, one can show that Tn integrates trigonometric polynomials up to degree 2n − 1 exactly.
Remember from Section 5.6.1 that the 2n + 1-dimensional space of 1-periodic trigonometric polynomials
of degree 2n can be defined as
T
P2n := Span{t 7→ exp(2πıjt) : j = −n, . . . , n} .

By elementary computations we find:


 (

 R1 0 , if k 6= 0 ,

 0 f (t) dt = 1 , if k = 0 .

f (t) = e2πıkt (

 n−1 2πı (4.2.1.8) 0 , if k 6 ∈ nZ ,
 1 lk
 Tn ( f ) = n ∑ e n
 =
1 , if k ∈ nZ .
l =0

The second identity is a consequence of the geometric sum formula (4.2.1.9).

Lemma 7.5.0.22. Exact quadrature by equidistant trapezoidal rule

The n-point equidistant trapezoidal quadrature rule on [0, 1]


Z 1  n −1 
1 1
f (t) dt ≈ Tn ( f ) := h 2 f (0) + ∑ f (kh) + 12 f (1) , h := . (7.5.0.17)
0 k =1
n

is exact for trigonometric polynomials (→ Section 6.5.1) of degree ≤ 2n − 2.

Since the weights of the equidistant trapezoidal rule are clearly positive, by Thm. 7.4.3.3 the asymp-
totic behavior of the quadrature error can directly be inferred from estimates for equidistant trigono-
metric interpolation. Such estimates are given, e.g., in Thm. 6.5.3.14 and they confirm exponential
convergence for periodic integrands that allow analytic extension, in agreement with the observa-
tions made in Exp. 7.5.0.16.

Numerical Quadrature of periodic integrands

Use the equidistant trapezoidal rule for the numerical quadrature of a periodic integrand over its
interval of periodicity.

Remark 7.5.0.24 (Approximate computation of Fourier coefficients)


Recall from Section 4.2.6: recovery of signal (yk )k∈Z from its Fourier transform c(t)

Z1
yj = c(t) exp(2πijt) dt . (4.2.6.20)
0

Task: approximate computation of y j

7. Numerical Quadrature , 7.5. Composite Quadrature 581


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Recall: c(t) obtained from (yk )k∈Z through Fourier series

c(t) = ∑ yk exp(−2πikt) . (4.2.6.7)


k ∈Z

➣ c(t) smooth & 1-periodic for finite/rapidly decaying (yk )k∈Z .


Exp. 7.5.0.16 use equidistant trapezoidal rule (7.5.0.17)
for approximate evaluation of integral in (4.2.6.20).
☞ Boils down to inverse DFT (4.2.1.20); hardly surprising in light of the derivation of (4.2.6.20) in Sec-
tion 4.2.6.

C++-code 7.5.0.25: DFT-based approximate computation of Fourier coefficients


1 # include <complex >
2 # include < iostream >
3 # include <unsupported / Eigen / FFT>
4 # include <vector >
5

6 template <class Function >


7 void f o u r c o e f f c o m p ( std : : vector <std : : complex <double>>& y , F u n c t i o n& c , const unsigned
m, const unsigned ovsmpl = 2 ) {
8 // Compute the Fourier coefficients y−m , . . . , ym of the function
9 // c : [0, 1[7→ C using an oversampling factor ovsmpl.
10 // c must be a handle to a function @(t), e.g. a lambda function
11 const unsigned N = ( 2 *m + 1 ) * ovsmpl ; // number of quadrature points
12 const double h = 1 . / N;
13

14 // evaluate function in N points


15 std : : vector <std : : complex <double>> c _ e v a l (N) ;
16 f o r ( unsigned i = 0 ; i < N; ++ i ) {
17 c_eval [ i ] = c ( i *h ) ;
18 }
19

20 // inverse discrete fourier transformation


21 Eigen : : FFT<double> f f t ;
22 std : : vector <std : : complex <double>> z ;
23 f f t . inv ( z , c_eval ) ;
24

25 // Undo oversampling and wrapping of Fourier coefficient array


26 // -> y contains same values as z but in a different order:
27 // y = [z(N-m+1:N), z(1:m+1)]
28 y = std : : vector <std : : complex <double > >() ;
29 y . r e s e r v e (N) ;
30 f o r ( unsigned i = N − m; i < N; ++ i ) {
31 y . push_back ( z [ i ] ) ;
32 }
33 f o r ( unsigned j = 0 ; j < m + 1 ; ++ j ) {
34 y . push_back ( z [ j ] ) ;
35 }
36 }

Review question(s) 7.5.0.26 (Composite quadrature rules)


(Q7.5.0.26.A) Assume that f : D ⊂ C → C is analytic on the open subset D of the complex plane. By

7. Numerical Quadrature , 7.5. Composite Quadrature 582


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

the Cauchy integral formula we have for z ∈ D


Z
(k) k! f (w)
f (z) = dw , k ∈ N0 , (7.5.0.27)
2πı ( w − z ) k +1
∂Br (z)

where

Br (z) := {w ∈ C : |w − z| ≤ r }, , r>0,

is the closed disk with radius r > 0 around z and r is so small that Br (z) ⊂ D.
Using the parameterization

∂Br (z) = {z + r exp(2πıt), t ∈ [0, 1]}

outline how (7.5.0.27) can be approximated by means of numerical quadrature.

Hint. Recall the definition of a path integral in the complex plane (“contour integral”): If the path of
integration γ is described by a parameterization τ ∈ J 7→ γ(τ ) ∈ C, J ⊂ R, then
Z Z
f (z) dz := f (γ(τ )) · γ̇(τ ) dτ , (6.2.2.51)
γ J

7.6 Adaptive Quadrature

Video tutorial for Section 7.6 "Adaptive Quadrature": (13 minutes) Download link, tablet notes

Rb
Hitherto, we have just “blindly” applied quadrature rules for the approximate evaluation of a f (t) dt, obliv-
ious of any properties of the integrand f . This led us to the conclusion of Rem. 7.5.0.13 that Gauss-
Legendre quadrature (→ Def. 7.4.2.18) should be preferred to composite quadrature rules (→ Section 7.5)
in general. Now the composite quadrature rule will partly be rehabilitated, because they offer the flexibility
to adjust the quadrature rule to the integrand, a policy known as adaptive quadrature.

Adaptive numerical quadrature


Rb
The policy of adaptive quadrature approximates a f (t) dt by a quadrature formula (7.2.0.2), whose
nodes cnj are chosen depending on the integrand f .

We distinguish
(I) a priori adaptive quadrature: the nodes are fixed before the evaluation of the quadrature
formula, taking into account external information about f , and
(II) a posteriori adaptive quadrature: the node positions are chosen or improved based on infor-
mation gleaned during the computation inside a loop. It terminates when sufficient accuracy
has been reached.

In this section we will chiefly discuss a posteriori adaptive quadrature for composite quadrature rules (→
Section 7.5) based on a single local quadrature rule (and its transformation).

7. Numerical Quadrature , 7.6. Adaptive Quadrature 583


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Supplementary literature. [DH03, Sect. 9.7]

EXAMPLE 7.6.0.2 (Rationale for adaptive quadrature) This example presents an extreme case. We
consider the composite trapezoidal rule (7.5.0.4) on a mesh M := { a = x0 < x1 < · · · < xm = b} and
for the integrand f (t) = 10−14 +t2 on [−1, 1].
10000

9000

8000

f is a spike-like function ✄ 7000 f (t) = 1


10−4 +t2
6000
Intuition: quadrature nodes should cluster around 0,
whereas hardly any are needed close to the end-

f(t)
5000

points of the integration interval, where the function 4000

has very small (in modulus) values. 3000

➣ Use locally refined mesh ! 2000

1000

0
−1 −0.5 0 0.5 1
Fig. 279 t

A quantitative justification can appeal to (7.3.0.11) and the resulting bound for the local quadrature error
(for f ∈ C2 ([ a, b])):

Zxk
1
f (t) dt − ( f ( xk−1 ) + f ( xk )) ≤ 81 h3k f ′′ L∞ ([ xk−1 ,xk ])
, h k : = x k − x k −1 . (7.6.0.3)
2
x k −1

➣ Suggests the use of small mesh intervals, where | f ′′ | is large !


y

§7.6.0.4 (Goal: equidistribution of errors) The ultimate but elusive goal is to find a mesh with a minimal
number of cells that just delivers a quadrature error below a prescribed threshold. A more practical goal is
to adjust the local meshwidths hk := xk − xk−1 in order to achieve a minimal sum of local error bounds.
This leads to the constrained minimization problem:
m m
∑ h3k f ′′ L∞ ([ x
→ min s.t. ∑ hk = b − a . (7.6.0.5)
k −1 ,xk ])
k =1 k =1

Lemma 7.6.0.6.

Let f : R0+ → R0+ be a convex function with f (0) = 0 and x > 0. Then the constrained
minimization problem: seek ζ 1 , . . . , ζ m ∈ R0+ such that

m m
∑ f (ζ k ) → min and ∑ ζk = x , (7.6.0.7)
k =1 k =1

x
has the solution ζ 1 = ζ 2 = · · · = ζ m = m .

This means that we should strive for equal bounds h3k k f ′′ k L∞ ([ x for all mesh cells.
k −1 ,xk ])

7. Numerical Quadrature , 7.6. Adaptive Quadrature 584


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Error equidistribution principle

The mesh for a posteriori adaptive composite numerical quadrature should be chosen to achieve
equal contributions of all mesh intervals to the quadrature error

A indicated above, guided by the equidistribution principle, the improvement of the mesh will be done
gradually in an iteration. The change of the mesh in each step is called mesh adaptation and there are
two fundamentally different ways to do it:
(I) by moving nodes, keeping their total number, but making them cluster where mesh intervals should
be small, or
(II) by adding nodes, where mesh intervals should be small (mesh refinement).

Algorithms for a posteriori adaptive quadrature based on mesh refinement usually have the following
structure:

Adaptation loop for numerical quadrature

(1) ESTIMATE: based on available information compute an approximation for the quadrature error
on every mesh interval.
(2) CHECK TERMINATION: if total error sufficient small → STOP
(3) MARK: single out mesh intervals with the largest or above average error contributions.
(4) REFINE: add node(s) inside the marked mesh intervals. GOTO (1)

§7.6.0.10 (Adaptive multilevel quadrature) We now see a concrete algorithm based on the two com-
posite quadrature rules introduced in Ex. 7.5.0.3.

Idea: local error estimation by comparing local results of two quadrature formu-
las Q1 , Q2 of different order → local error estimates

heuristics: error( Q2 ) ≪ error( Q1 ) ⇒ error( Q1 ) ≈ Q2 ( f ) − Q1 ( f ) .

Here: Q1 = trapezoidal rule (order 2) ↔ Q2 = Simpson rule (order 4)

Given: initial mesh M : = { a = x0 < x1 < · · · < x m = b }

❶ (Error estimation)

For Ik = [ xk−1 , xk ], k = 1, . . . , m (midpoints pk := 21 ( xk−1 + xk ) )

hk h
ESTk := ( f ( xk−1 ) + 4 f ( pk ) + f ( xk )) − k ( f ( xk−1 ) + 2 f ( pk ) + f ( xk )) . (7.6.0.11)
|6 {z } |4 {z }
Simpson rule trapezoidal rule on split mesh interval

7. Numerical Quadrature , 7.6. Adaptive Quadrature 585


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

❷ (Check termination)
Rb
Simpson rule on M ⇒ intermediate approximation I ≈ a f (t) dt
m
If ∑ ESTk ≤ RTOL · I ( RTOL := prescribed relative tolerance) ⇒ STOP (7.6.0.12)
k =1

❸ (Marking)
m
1
Marked intervals: S := {k ∈ {1, . . . , m}: ESTk ≥ η ·
m ∑ ESTj } , η ≈ 0.9 . (7.6.0.13)
j =1

❹ (Local mesh refinement)

1
new mesh: M∗ := M ∪ { pk := ( xk−1 + xk ): k ∈ S} . (7.6.0.14)
2
Then continue with step ❶ and mesh M ← M∗ .

The following C++ code give a (non-optimal) recursive implementation

7. Numerical Quadrature , 7.6. Adaptive Quadrature 586


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

C++-code 7.6.0.15: h-adaptive numerical quadrature ➺ GITLAB


2 // Adaptive multilevel quadrature of a function passed in f.
3 // The vector M passes the positions of current quadrature nodes
4 template <class Function >
5 double adaptquad ( F u n c t i o n& f , VectorXd& M, double r t o l , double a t o l ) {
6 const std : : s i z e _ t n = M. s i z e ( ) ; // number of nodes
7 // distance of quadature nodes
8 const VectorXd h = M. t a i l ( n −1)−M. head ( n −1) ;
9 const VectorXd mp = 0 . 5 * (M. head ( n −1)+M. t a i l ( n −1) ) ; // midpoints
10 // Values of integrand at nodes and midpoints
11 VectorXd f x ( n ) ;
12 VectorXd fm ( n − 1 ) ;
13 f o r ( unsigned i = 0 ; i < n ; ++ i ) {
14 f x ( i ) = f (M( i ) ) ;
15 }
16 f o r ( unsigned j = 0 ; j < n − 1 ; ++ j ) {
17 fm ( j ) = f (mp( j ) ) ;
18 }
19 // trapezoidal rule (7.5.0.4)
20 const VectorXd t r p _ l o c = 1 . / 4 * h . cwiseProduct ( f x . head ( n −1) +2 * fm+ f x . t a i l ( n −1) ) ;
21 // Simpson rule (7.5.0.5)
22 const VectorXd simp_loc = 1 . / 6 * h . cwiseProduct ( f x . head ( n −1) +4 * fm+ f x . t a i l ( n −1) ) ;
23

24 // Simpson approximation for the integral value


25 double I = simp_loc .sum ( ) ;
26 // local error estimate (7.6.0.11)
27 const VectorXd e s t _ l o c = ( simp_loc − t r p _ l o c ) . cwiseAbs ( ) ;
28 // estimate for quadrature error
29 const double e r r _ t o t = e s t _ l o c .sum ( ) ;
30

31 // STOP: Termination based on (7.6.0.12)


32 i f ( e r r _ t o t > r t o l * std : : abs ( I ) && e r r _ t o t > a t o l ) { //
33 // find cells where error is large
34 std : : vector <double> n e w _ c e l l s ;
35 f o r ( unsigned i = 0 ; i < e s t _ l o c . s i z e ( ) ; ++ i ) {
36 // MARK by criterion (7.6.0.13) & REFINE by (7.6.0.14)
37 i f ( e s t _ l o c ( i ) > 0 . 9 / s t a t i c _ c a s t <double >( n −1) * e r r _ t o t ) {
38 // new quadrature point = midpoint of interval with large error
39 n e w _ c e l l s . push_back (mp( i ) ) ;
40 }}
41

42 // create new set of quadrature nodes


43 // (necessary to convert std::vector to Eigen vector)
44 const Eigen : : Map<VectorXd> tmp ( n e w _ c e l l s . data ( ) ,
45 s t a t i c _ c a s t <Eigen : : Index >( n e w _ c e l l s . s i z e ( ) ) ) ;
46 VectorXd new_M(M. s i z e ( ) + tmp . s i z e ( ) ) ;
47 new_M << M, tmp ; // concatenate old cells and new cells
48 // nodes of a mesh are supposed to be sorted
49 std : : s o r t (new_M . begin ( ) ,new_M . end ( ) ) ;
50 I = adaptquad ( f , new_M, r t o l , a t o l ) ; // recursion
51 }
52 return I ;
53 }

Comments on Code 7.6.0.15:


• Arguments: f = ˆ handle to function f , M =
ˆ initial mesh, rtol =
ˆ relative tolerance for termination,
atol = ˆ absolute tolerance for termination, necessary in case the exact integral value = 0, which
renders a relative tolerance meaningless.

7. Numerical Quadrature , 7.6. Adaptive Quadrature 587


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Line 7: compute lengths of mesh-intervals [ x j−1 , x j ],


• Line 9: store positions of midpoints p j ,
• Line 10: evaluate function (vector arguments!),
• Line 19: local composite trapezoidal rule (7.5.0.4),
• Line 21: local simpson rule (7.3.0.6),
• Line 24: value obtained from composite simpson rule is used as intermediate approximation for
integral value,
• Line 26: difference of values obtained from local composite trapezoidal rule (∼ Q1 ) and local simp-
son rule (∼ Q2 ) is used as an estimate for the local quadrature error.
• Line 28: estimate for global error by summing up moduli of local error contributions,
• Line 32: terminate, once the estimated total error is below the relative or absolute error threshold,
• Line 50 otherwise, add midpoints of mesh intervals with large error contributions according to
(7.6.0.14) to the mesh and continue.
C++-code 7.6.0.16: Call of adaptquad():

1 # include " . / adaptquad . hpp"


2 # include <Eigen / Dense>
3 # include <cmath>
4 # include < iostream >
5

6 using Eigen : : VectorXd ;


7

8 i n t main ( ) {
9 auto f = [ ] ( double x ) { r e t u r n std : : exp ( − x * x ) ; } ;
10 VectorXd M( 4 ) ;
11 M << −100 , 0 . 1 , 0 . 5 , 100;
12 std : : cout << " Sqrt ( Pi ) − I n t _ { −100}^{100} exp(−x * x ) dx = " ;
13 std : : cout << adaptquad : : adaptquad ( f , M, 1e−10 , 1e −12) − std : : s q r t ( M_PI ) << " \ n" ;
14 return 0;
15 }

Remark 7.6.0.17 (Estimation of “wrong quadrature error”?) In Code 7.6.0.15 we use the higher order
quadrature rule, the Simpson rule of order 4, to compute an approximate value for the integral. This is
reasonable, because it would be foolish not to use this information after we have collected it for the sake
of error estimation.

Yet, according to our heuristics, what est_loc and est_tot give us are estimates for the error of the
second-order trapezoidal rule, which we do not use for the actual computations.

However, experience teaches that


est_loc gives useful (for the sake of mesh refinement) information about the distribution of
the error of the Simpson rule, though it fails to capture its size.
Therefore, the termination criterion of Line 32 may not be appropriate! y

EXPERIMENT 7.6.0.18 (h-adaptive numerical quadrature) In this numerical test we investigate whether
the adaptive technique from § 7.6.0.10 produces an appropriate distribution of integration nodes. We do

7. Numerical Quadrature , 7.6. Adaptive Quadrature 588


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

this for different functions.


Z 1
✦ approximate exp(6 sin(2πt)) dt, initial mesh M0 = { j/10}10
j =0
0

Algorithm: adaptive quadrature, Code 7.6.0.15 with tolerances rtol = 10−6 , abstol = 10−10

We monitor the distribution of quadrature points during the adaptive quadrature and the true and esti-
mated quadrature errors. The “exact” value for the integral is computed by composite Simpson rule on an
equidistant mesh with 107 intervals.
1
10
exact error
0
estimated error
10
500

450 −1
10
400
−2
10
350

quadrature errors
300 −3
10
250
f

−4
200 10

150 −5
10
100
−6
50 10

0 −7
0 10
0
5 0.2
−8
0.4 10
10 0.6
0.8
15 1 −9
10
x 0 200 400 600 800 1000 1200 1400 1600
Fig. 280 quadrature level Fig. 281 no. of quadrature points

Z 1
✦ approximate min{exp(6 sin(2πt)), 100} dt, initial mesh as above
0
1
10
exact error
estimated error
0
100 10

90
−1
10
80

70 −2
10
quadrature errors

60
−3
50 10
f

40
−4
10
30

20 −5
10
10
−6
0 10
0
0
5 0.2 −7
0.4 10
10 0.6
0.8
15 1 −8
10
x 0 100 200 300 400 500 600 700 800
Fig. 282 quadrature level Fig. 283 no. of quadrature points

Observation:
• Adaptive quadrature locally decreases meshwidth where integrand features variations or kinks.
• Trend for estimated error mirrors behavior of true error.
• Overestimation may be due to taking the modulus in (7.6.0.11)
However, the important piece of information we want to extract from ESTk is about the distribution of the
quadrature error.
y

Remark 7.6.0.19 (Adaptive quadrature in P YTHON)

7. Numerical Quadrature , 7.6. Adaptive Quadrature 589


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

q = scipy.integrate.quad(fun,a,b,tol): adaptive multigrid quadrature


(local low order quadrature formulas)
q = scipy.integrate.quadrature(fun,a,b,tol): adaptive Gauss-Lobatto quadrature
y

Review question(s) 7.6.0.20 (Adaptive quadrature)


(Q7.6.0.20.A) For the composite
trapezoidal rule applied on a mesh
M := { a = x0 < x1 < · · · < xm := b} we have the following estimate for the local quadrature
error for an integrand f ∈ C2 ([ a, b]):

Zxk
1
f (t) dt − ( f ( xk−1 ) + f ( xk )) ≤ 18 h3k f ′′ L∞ ([ xk−1 ,xk ])
, h k : = x k − x k −1 . (7.6.0.3)
2
x k −1

We consider the singular integrand f (t) = t on [0, 1]. What mesh M has to be chosen to ensure the
equidistribution of the error bounds from (7.6.0.3).
Rb
(Q7.6.0.20.B) For a posteriori adaptive mesh refinement for the approximation of a f (t) dt
we employed the following estimate of the local quadrature error on the mesh
M : = { a = x0 < x1 < · · · < x m : = b }
hk h
ESTk := ( f ( xk−1 ) + 4 f ( pk ) + f ( xk )) − k ( f ( xk−1 ) + 2 f ( pk ) + f ( xk )) . (7.6.0.11)
|6 {z } |4 {z }
Simpson rule trapezoidal rule on split mesh interval

We could also have use the two lowest order Gauss-Legendre quadrature rules for that purpose,
• The 1-point midpoint rule, on [−1, 1] defined by the node c1 := 0 and w1 := 2,
• the
n 2-point Gauss-Legendre rule from Ex. o7.4.2.2, on [−1, 1] by the weights/nodes
√ √
w2 = 1, w1 = 1, c1 = 1/3 3, c2 = −1/3 3 .

Write down the formula for the resulting estimator ESTk and compare it with the choice (7.6.0.11) in
terms of number of required f -evaluations.

Learning Outcomes
✦ You should know what is a quadrature formula and terminology connected with it,
✦ You should be able to transform quadrature formulas to arbitrary intervals.
✦ You should understand how a interpolation and approximation schemes spawn quadrature formulas
and how quadrature errors are connected to interpolation/approximation errors.
✦ You should be able to compute the weights of polynomial quadrature formulas.
✦ You should know the concept of order of a quadrature rule and why it is invariant under (affine)
transformation
✦ You should remember the maximal and minimal order of polynomial quadrature rules.
✦ You should know the order of the n-point Gauss-Legendre quadrature rule.
✦ You should understand why Gauss-Legendre quadrature converges exponentially for integrands that
can be extended analytically and algebraically for integrands with limited smoothness.

7. Numerical Quadrature , 7.6. Adaptive Quadrature 590


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ You should be apply to apply regularizing transformations to integrals with non-smooth integrands.
✦ You should know about asymptotic convergence of the h-version of composite quadrature.
✦ You should know the principles of adaptive composite quadrature.

7. Numerical Quadrature , 7.6. Adaptive Quadrature 591


Bibliography

[BL19] L. Banjai and M. López-Fernández. “Efficient high order algorithms for fractional integrals
and fractional differential equations”. In: Numer. Math. 141.2 (2019), pp. 289–317. DOI:
10.1007/s00211-018-1004-0.
[Bog14] I. Bogaert. “Iteration-free computation of Gauss-Legendre quadrature nodes and weights”.
In: SIAM J. Sci. Comput. 36.3 (2014), A1008–A1026. DOI: 10.1137/140954969 (cit. on
p. 570).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 552, 556, 559).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 584).
[Fej33] L. Fejér. “Mechanische Quadraturen mit positiven Cotesschen Zahlen”. In: Math. Z. 37.1
(1933), pp. 287–309. DOI: 10.1007/BF01474575 (cit. on p. 559).
[Gan+05] M. Gander, W. Gander, G. Golub, and D. Gruntz. Scientific Computing: An introduction using
MATLAB. Springer, 2005 (cit. on p. 569).
[GLR07] Andreas Glaser, Xiangtao Liu, and Vladimir Rokhlin. “A fast algorithm for the calculation of
the roots of special functions”. In: SIAM J. Sci. Comput. 29.4 (2007), pp. 1420–1438. DOI:
10.1137/06067016X.
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 564,
565).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 552, 557, 559,
568).
[Joh08] S.G. Johnson. Notes on the convergence of trapezoidal-rule quadrature. MIT online course
notes, http://math.mit.edu/ stevenj/trapezoidal.pdf. 2008.
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 564, 565).
[Tre08] Lloyd N. Trefethen. “Is Gauss quadrature better than Clenshaw-Curtis?” In: SIAM Rev. 50.1
(2008), pp. 67–87. DOI: 10.1137/060659831 (cit. on pp. 559, 569).
[TWXX] Lloyd N. Trefethen and J. A. C. Weideman. THE EXPONENTIALLY CONVERGENT TRAPE-
ZOIDAL RULE. XX.
[Wal06] Jörg Waldvogel. “Fast construction of the Fejér and Clenshaw-Curtis quadrature rules”. In: BIT
46.1 (2006), pp. 195–202. DOI: 10.1007/s10543-006-0045-4 (cit. on p. 559).
[Wal11] Jörg Waldvogel. “Towards a general error theory of the trapezoidal rule”. In: Approximation
and computation. Vol. 42. Springer Optim. Appl. Springer, New York, 2011, pp. 267–282. DOI:
10.1007/978-1-4419-6594-3_17.

592
Chapter 8

Iterative Methods for Non-Linear Systems of


Equations

8.1 Introduction
Video tutorial for Section 8.1 "Iterative Methods for Non-Linear Systems of Equations: Intro-
duction": (6 minutes) Download link, tablet notes

EXAMPLE 8.1.0.1 (Non-linear electric circuit)


Non-linear systems naturally arise in mathematical models of electrical circuits, once non-linear circuit
elements are introduced. This generalizes Ex. 2.1.0.3, where the current-voltage relationship for all circuit
elements was the simple proportionality (2.1.0.5) (of the complex amplitudes U and I ).
As an example we consider the
U+
Schmitt trigger circuit ✄
Its key non-linear circuit element is the NPN bipolar R3 R4
R1
junction transistor:
collector ➀ ➃
Rb

➄ ➁
Uout
base Uin
Re R2

Fig. 284
emitter
A transistor has three ports: emitter, collector, and base. Transistor models give the port currents as
functions of the applied voltages, for instance the Ebers-Moll model (large signal approximation):
 UBC
  U 
UBE IS BC
IC = IS e UT −e UT − e T − 1 = IC (UBE , UBC ) ,
U
βR
 U   U 
IS BE IS BC
IB = e UT − 1 + e UT − 1 = IB (UBE , UBC ) , (8.1.0.2)
βF βR
 U UBC
  U 
BE IS BE
IE = IS e UT − e UT + e UT − 1 = IE (UBE , UBC ) .
βF
IC , IB , IE : current in collector/base/emitter,
UBE , UBC : potential drop between base-emitter, base-collector.

593
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The parameters have the following meanings: β F is the forward common emitter current gain (20 to 500),
β R is the reverse common emitter current gain (0 to 20), IS is the reverse saturation current (on the order
of 10−15 to 10−12 amperes), UT is the thermal voltage (approximately 26 mV at 300 K).

The circuit of Fig. 284 has 5 nodes ➀–➄ with unknown nodal potentials. Kirchhoffs law (2.1.0.4) plus the
constitutive relations gives an equation for each of them.

Non-linear system of equations from nodal analysis, static case (→ Ex. 2.1.0.3):

➀ : R3−1 (U1 − U+ ) + R1−1 (U1 − U3 ) + IC (U5 − U1 , U5 − U2 ) =0,


➁ : R− 1
e U2 + IE (U5 − U1 , U5 − U2 ) + IE (U3 − U4 , U3 − U2 ) =0,
➂: R1−1 (U3 − U1 ) + IB (U3 − U4 , U3 − U2 ) =0, (8.1.0.3)
➃: R4−1 (U4 − U+ ) + IC (U3 − U4 , U3 − U2 ) =0,
➄: R− 1
b (U5 − Uin ) + IB (U5 − U1 , U5 − U2 ) =0.

5 equations ↔ 5 unknowns U1 , U2 , U3 , U4 , U5

Formally: (8.1.0.3) ←→ F (u) = 0 with a function F : R5 → R5 y

Remark 8.1.0.4 (General non-linear systems of equations) A non-linear system of equations is a con-
cept almost too abstract to be useful, because it covers an extremely wide variety of problems . Never-
theless in this chapter we will mainly look at “generic” methods for such systems. This means that every
method discussed may take a good deal of fine-tuning before it will really perform satisfactorily for a given
non-linear system of equations. y

§8.1.0.5 (Generic/general non-linear system of equations) Let us try to describe the “problem” of
having to solve a non-linear system of equations, where the concept of a “problem” was first introduced in
§ 1.5.5.1.
Given: function F : D ⊂ R n 7→ R n , n∈N
m
Possible meanings: ☞ F is known as an analytic expression.
☞ F is merely available in procedural form allowing point evaluations.

Here, D is the domain of definition of the function F, which cannot be evaluated for x 6∈ D.

Sought: solution(s) x ∈ D of non-linear equation F (x) = 0

Note: F : D ⊂ R n 7→ R n ↔ “same number n of equations and unknowns”

In contrast to the situation for linear systems of equations (→ Thm. 2.2.1.4), the class of non-linear sys-
tems is far too big to allow a general theory:

There are no general results on the existence & uniqueness of solutions of a “generic” non-
linear system of equations F (x) = 0.

y
Contents
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Introduction 594


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8.2 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596


8.2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
8.2.2 Speed of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
8.2.3 Termination Criteria/Stopping Rules . . . . . . . . . . . . . . . . . . . . . . 605
8.3 Fixed-Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
8.3.1 Consistent Fixed-Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . 610
8.3.2 Convergence of Fixed-Point Iterations . . . . . . . . . . . . . . . . . . . . . . 611
8.4 Finding Zeros of Scalar Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
8.4.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
8.4.2 Model Function Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
8.4.3 Asymptotic Efficiency of Iterative Methods for Zero Finding . . . . . . . . . 633
8.5 Newton’s Method in R n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
8.5.1 The Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
8.5.2 Convergence of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . 649
8.5.3 Termination of Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 652
8.5.4 Damped Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
8.6 Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
8.7 Non-linear Least Squares [DR08, Ch. 6] . . . . . . . . . . . . . . . . . . . . . . . . . 665
8.7.1 (Damped) Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
8.7.2 Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
8.7.3 Trust Region Method (Levenberg-Marquardt Method) . . . . . . . . . . . . 674

Review question(s) 8.1.0.6 (Iterative Methods for Non-Linear Systems of Equations: Introduction)
(Q8.1.0.6.A) State that nonlinear system of equations in the form F (x) = 0 with a function F : R n → R n
whose solution answers the following question:
How does the diagonal of the given matrix A ∈ R n,n have to be modified (yielding a matrix
e ) so that the linear system of equations Ax
A e = b, b ∈ R n given, has a prescribed solution
x∗ ∈ R n ?
When does the non-linear system of equations have a unique solution and what is it?
(Q8.1.0.6.B) Which non-linear system of equations is solved by every

x∗ ∈ argmin Φ(x) , Φ : R n → R continuously differentiable?


x ∈R n

Hint. From your analysis course remember necessary conditions for a global minimum of a continuously
differentiable function.
(Q8.1.0.6.C) A diode is a non-linear circuit element that yields vastly different currents depending on
the polarity of the applied voltage. Quantitatively, its current voltage relationship is described by the
Shockley diode equation

  U
U 
I (U ) = IS exp −1 ,
UT

where IS is the reverse bias saturation current, and


UT so-called thermal voltage, both known parame- Fig. 285
ters.

8. Iterative Methods for Non-Linear Systems of Equations, 8.1. Introduction 595


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✁ The circuit drawn beside is a rectifier.

Uout Using nodal circuit analysis find the non-linear sys-


U tem of equations that permits us to compute the out-
put voltage Uout between the two nodes •, when an
input voltage U is applied?

Fig. 286

Hint. Nodal analysis of electric circuits is explained in Ex. 2.1.0.3. You may look up that example.
(Q8.1.0.6.D) [Inverse function] From analysis you know that every monotonic function f : I ⊂ R → R
can be inverted on its range f ( I ). Reformulate the task of evaluating f −1 (y) for y ∈ f ( I ) as a non-
linear equation in the standard form F ( x ) = 0 for a suitable function F.

8.2 Iterative Methods


Remark 8.2.0.1 (Necessity of iterative approximation) Gaussian elimination (→ Section 2.3) provides
an algorithm that, if carried out in exact arithmetic (no roundoff errors), computes the solution of a
linear system of equations with a finite number of elementary operations. However, linear systems of
equations represent an exceptional case, because it is hardly ever possible to solve general systems of
non-linear equations using only finitely many elementary operations. y

8.2.1 Fundamental Concepts

Video tutorial for Section 8.2.1 "Iterative Methods: Fundamental Concepts": (6 minutes)
Download link, tablet notes

§8.2.1.1 (Generic iterations)


All methods for general non-linear systems of equations are iterative in the sense that they will usually

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 596
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

yield only approximate solutions whenever they terminate after finite time.

D
★ ✥
An iterative method for (approximately) solving the
non-linear equation F (x) = 0 is an algorithm
  gen-
x (3)
erating an arbitrarily long sequence x(k) of ap- x (2)
k
x∗
✧ ✦
proximate solutions.
x (1) x (4)
x (6)
x(k) =
ˆ k-th iterate
x (0) x (5)
Initial guess

Fig. 287
y

§8.2.1.2 (Key issues with iterative methods) When applying an iterative method to solve a non-linear
system of equations F (x) = 0, the following issues arise:

✦ Convergence: Does the sequence (x(k) )k converge to a limit: limk→∞ x(k) = x∗ ?


✦ Consistency: Does the limit, if it exists, provide a solution of the non-linear system of equations:
F (x∗ ) = 0?

✦ Speed of convergence: How “fast” does x(k) − x∗ (k·k a suitable norm on R N ) decrease for
increasing k?
More formal definitions can be given:

Definition 8.2.1.3. Convergence of iterative methods

k→∞
An iterative method converges (for fixed initial guess(es)) :⇔ x(k) → x∗ and F (x∗ ) = 0.

§8.2.1.4 ((Stationary) m-point iterative method) All the iterative methods discussed below fall in the
class of (stationary) m-point, m ∈ N, iterative methods, for which the iterate x(k+1) depends on F and the
m most recent iterates x(k) , . . . , x(k−m+1) , e.g.,

x ( k +1) = Φ F ( x ( k ) , . . . , x ( k − m +1) ) (8.2.1.5)


| {z }
iteration function for m-point method

Terminology: Φ F is called the iteration function.

Note: The initial guess(es) x(0) , . . . , x(m−1) ∈ R n have to be provided.

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 597
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Visualization of a 1-point iteration:


Φ x (3)
( k +1) (k)
x = Φ(x ), (8.2.1.6) x (2) Φ
Φ
x∗
with an iteration function x (1) x (4)
Φ x (6) Φ
n n
Φ:D⊂R →R . x (0) Φ x (5)

Fig. 288
y

Definition 8.2.1.7. Consistency of iterative methods

A stationanry m-point iterative method

x ( k +1) = Φ F ( x ( k ) , . . . , x ( k − m +1) ) , m∈N, (8.2.1.5)

is consistent with the non-linear system of equations F (x) = 0, if and only if

Φ F (x∗ , . . . , x∗ ) = x∗ ⇐⇒ F (x∗ ) = 0 .

Theorem 8.2.1.8. Consistency and convergence


 
The limit of a convergent sequence x(k) generated by an m-point stationary iterative method
k ∈N 0
with a continuous iteration function Φ F that is consistent with F (x) = 0 is a solution:

x ( k +1) : = Φ F ( x ( k ) , . . . , x ( k − m +1) ) ,
⇒ F (x∗ ) = 0 .
x∗ := lim x(k)
k→∞

Proof. The very definition of continuity means that limits can be “pulled into a function”:

x∗ = lim x(k+1) = lim Φ F (x(k) , . . . , x(k−m+1) )


k→∞ k→∞
 
( k − m +1)
(k)
= Φ F lim x , . . . , lim x = Φ F (x∗ , . . . , x∗ ) .
k→∞ k→∞

Appealing to Def. 8.2.1.7 finishes the proof.


For a consistent stationary iterative method we can study the error of the iterates x(k) defined as:

e(k) : = x(k) − x∗ .

§8.2.1.9 (Local convergence of iterative methods) Unfortunately, convergence may critically depend on

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 598
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

the choice of initial guesses. The property defined next weakens this dependence:

Definition 8.2.1.10. Local and global convergence → [Han02, Def. 17.1]


As stationary m-point iterative method converges locally to x∗ ∈ R n , if there is a neighborhood
U ⊂ D of x∗ , such that

x(0) , . . . , x(m−1) ∈ U ⇒ x(k) well defined ∧ lim x(k) = x∗


k→∞

where (x(k) )k∈N0 is the (infinite) sequence of iterates.


If U = D, the iterative method is globally convergent.

Illustration of local convergence ✄

(Only initial guesses “sufficiently close” to x∗ guaran-


tee convergence.) x∗
U
Unfortunately, the neighborhood U is rarely known a
priori. It may also be very small.

Fig. 289
y

Our goal: Given a non-linear system of equations, find iterative methods that converge (locally) to a
solution of F (x) = 0.
Two general questions: How to measure, describe, and predict the speed of convergence?
When to terminate the iteration?

Review question(s) 8.2.1.11 (Fundamentals of iterative methods)


(Q8.2.1.11.A) Rewrite an m-point iterative method for solving F (x) = 0,

x ( k +1) = Φ F ( x ( k ) , . . . , x ( k − m +1) ) ,

with iteration function

ΦF : Rn × · · · × Rn → Rn
| {z }
m times

as a 1-point iteration (also called a fixed-point iteration). What does consistency mean for that 1-point
iteration.
(Q8.2.1.11.B) When is the following 1-point iterative method

x(k+1) = x(k) + MF (x(k) ) , M ∈ R n,n ,

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 599
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

consistent with the non-linear system of equations F (x) = 0, F : D ⊂ R n → R n .

Definition 8.2.1.7. Consistency of iterative methods

A stationary m-point iterative method is consistent with the non-linear system of equations
F (x) = 0, if and only if

Φ F (x∗ , . . . , x∗ ) = x∗ ⇐⇒ F (x∗ ) = 0 .

8.2.2 Speed of Convergence

Video tutorial for Section 8.2.2 "Iterative Methods: Speed of Convergence": (15 minutes)
Download link, tablet notes

Here and in the sequel, k·k designates a generic vector norm on R n , see Def. 1.5.5.4. Any occurring
matrix norm is induced by this vector norm, see Def. 1.5.5.10.
It is important to be aware which statements depend on the choice of norm and which do not!

“Speed of convergence” measures the decrease of a norm (see Def. 1.5.5.4) of the iteration error

Definition 8.2.2.1. Linear convergence

A sequence x(k) , k = 0, 1, 2, . . ., in R n converges linearly to x∗ ∈ R n ,

∃0 < L < 1: x ( k +1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 .

Terminology: The least upper bound for L gives the rate of convergence:

x ( k +1) − x ∗
rate = sup , x∗ := lim x(k) . (8.2.2.2)
k ∈N 0 x(k) − x∗ k→∞

Remark 8.2.2.3 (Impact of choice of norm)


Fact of convergence of an iteration is independent of choice of norm
Fact of linear convergence depends on choice of norm
Rate of linear convergence depends on choice of norm
The first statement is a consequence of the equivalence of all norms on the finite dimensional vector space
Kn :

Definition 8.2.2.4. Equivalence of norms → [Str09, Def. 4.4.2]


Two norms k·k a and k·kb on a vector space V are equivalent if

∃C, C > 0: C kvk a ≤ kvkb ≤ C kvk a ∀v ∈ V .

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 600
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 8.2.2.5. Equivalence of all norms on finite-dimensional vector spaces → [Str09,


Satz 4.4.1]

If dim V < ∞ all norms (→ Def. 1.5.5.4) on V are equivalent (→ Def. 8.2.2.4).

Remark 8.2.2.6 (Detecting linear convergence) Often we will study the behavior of a consistent iterative
method for a model problem in a numerical experiments and measure the norms of the iteration errors
e(k) := x(k) − x∗ . How can we tell that the method enjoys linear convergence?
log e(k)
norms of iteration errors
l
∼ on straight line in lin-log plot

e ( k ) ≤ L k e (0) ,

log( e(k) ) ≤ k log( L) + log( e(0) ) .

ˆ linear cvg., • =
•= ˆ faster than linear cvg. ✄
Fig. 290
1 2 3 4 5 6 7 8 k

Let us abbreviate the error norm in step k by ǫk := x(k) − x∗ . In the case of linear convergence (see
Def. 8.2.2.1) assume (with 0 < L < 1)

ǫk+1 ≈ Lǫk ⇒ log ǫk+1 ≈ log L + log ǫk ⇒ log ǫk ≈ k log L + log ǫ0 . (8.2.2.7)

We conclude that log L < 0 determines the slope of the graph in lin-log error chart.

Related: guessing time complexity O(nα ) of an algorithm from measurements, see § 1.4.1.6.
Note the green dots • in Fig. 290: Any “faster” convergence also qualifies as linear convergence in the strict
sense of the definition. However, whenever this term is used, we tacitly imply, that no “faster convergence”
prevails and that the estimates in (8.2.2.7) are sharp. y

EXAMPLE 8.2.2.8 (Linearly convergent iteration) We consider the iteration (n = 1):

cos( x (k) ) + 1
x ( k +1) = x ( k ) + .
sin( x (k) )
In the C++ code Code 8.2.2.9 x has to be initialized with the different values for x0 .
Note: The final iterate x (15) replaces the exact solution x ∗ in the computation of the rate of convergence.

C++ code 8.2.2.9: Simple fixed point iteration in 1D ➺ GITLAB


2 void f p i t ( double x0 , VectorXd &rates , VectorXd &e r r )
3 {
4 const Eigen : : Index N = 1 5 ;
5 double x = x0 ; // initial guess
6 VectorXd y (N) ;
7

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 601
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8 f o r ( i n t i =0; i <N; ++ i ) {
9 x = x + ( cos ( x ) +1) / s i n ( x ) ;
10 y( i ) = x;
11 }
12 e r r . r e s i z e (N) ; r a t e s . r e s i z e (N) ;
13 e r r = y−VectorXd : : Constant (N, x ) ;
14 r a t e s = e r r . bottomRows (N−1) . cwiseQuotient ( e r r . topRows (N−1) ) ;
15 }

k x (0) = 0.4 x (0) = 0.6 x (0) = 1


| x (k) − x (15) | | x (k) − x (15) | | x (k) − x (15) |
x (k) | x (k−1) − x (15) |
x (k) | x (k−1) − x (15) |
x (k) | x (k−1) − x (15) |
2 3.3887 0.1128 3.4727 0.4791 2.9873 0.4959
3 3.2645 0.4974 3.3056 0.4953 3.0646 0.4989
4 3.2030 0.4992 3.2234 0.4988 3.1031 0.4996
5 3.1723 0.4996 3.1825 0.4995 3.1224 0.4997
6 3.1569 0.4995 3.1620 0.4994 3.1320 0.4995
7 3.1493 0.4990 3.1518 0.4990 3.1368 0.4990
8 3.1454 0.4980 3.1467 0.4980 3.1392 0.4980

Rate of convergence ≈ 0.5


1
10
(0)
x = 0.4
(0)
x = 0.6
(0)
x =1
Plot of modulus of iteration errors for different initial 0
10

guesses (see above table) ✄


Observation:
iteration error

−1
10

Linear convergence as in Def. 8.2.2.1


−2

m 10

error graphs = straight lines in lin-log scale −3


10

→ Rem. 8.2.2.6
−4
10
1 2 3 4 5 6 7 8 9 10
Fig. 291 index of iterate

There are notions of convergence that guarantee a much faster (asymptotic) decay of the norms of the
iteration errors than linear convergence from Def. 8.2.2.1.

Definition 8.2.2.10. Order of convergence → [Han02, Sect. 17.2], [DR08, Def. 5.14], [QSS00,
Def. 6.1]

A convergent sequence x(k) , k = 0, 1, 2, . . ., in R n with limit x∗ ∈ R n converges with order p,


p ≥ 1, if
p
∃C > 0: x ( k +1) − x ∗ ≤ C x ( k ) − x ∗ ∀k ∈ N0 , (8.2.2.11)

and, in addition, C < 1 in the case p = 1 (linear convergence → Def. 8.2.2.1).

Of course, the order p of convergence of an iterative method refers to the largest possible p in the def-

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 602
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

inition, that is, the error estimate will in general not hold, if p is replaced with p + ǫ for any ǫ > 0, cf.
Rem. 1.4.1.3.

0
10

−2
10

−4
10
iteration error

Qualitative error graphs for convergence of or-


−6
10

−8
10 der p
−10
10
(lin-log scale, k plotted versus log x(k) − x∗ )
−12
10
p = 1.1
p = 1.2
−14
10 p = 1.4
p = 1.7
p=2
−16
10
0 1 2 3 4 5 6 7 8 9 10 11
index k of iterates

In the case of convergence of order p ( p > 1) according to Def. 8.2.2.10 and assuming sharpness of the
error bound we obtain for the error norms ǫk := x(k) − x∗ :

k
p
ǫk+1 ≈ Cǫk ⇒ log ǫk+1 = log C + p log ǫk ⇒ log ǫk+1 = log C ∑ pl + pk+1 log ǫ0
l =0
 
log C log C
⇒ log ǫk+1 = − + + log ǫ0 pk+1 .
p−1 p−1
In this case, the error graph of the function k 7→ log ǫk is a concave (“downward bent”) power curve (for
sufficiently small ǫ0 !)

Remark 8.2.2.12 (Detecting order p > 1 of convergence) How can we guess the order of convergence
(→ Def. 8.2.2.10) from tabulated error norms measured in a numerical experiment?

Abbreviate by ǫk := x(k) − x∗ the norm of the iteration error.

p log ǫk+1 − log ǫk


assume ǫk+1 ≈ Cǫk ⇒ log ǫk+1 ≈ log C + p log ǫk ⇒ ≈p .
log ǫk − log ǫk−1

➣ monitor the quotients (log ǫk+1 − log ǫk )/(log ǫk − log ǫk−1 ) over several steps of the iteration. y

EXAMPLE 8.2.2.13 (quadratic convergence = convergence √ of order 2) From your analysis course
[Str09, Bsp. 3.3.2(iii)] recall the famous iteration for computing a, a > 0:
1 (k) a √ 1 √
x ( k +1) = ( x + ( k ) ) ⇒ | x ( k +1) − a | = ( k ) | x ( k ) − a | 2 . (8.2.2.14)
2 x 2x

√ √
By the arithmetic-geometric mean inequality (AGM) ab ≤ 12 ( a + b) we conclude: x (k) > a for
k ≥√1. Therefore estimate from (8.2.2.14) means that the sequence from (8.2.2.14) converges with order
2 to a.

Note: x (k+1) < x (k) for all k ≥ 2 ➣ ( x (k) )k∈N0 converges as a decreasing sequence that is bounded
from below (→ analysis course)

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 603
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Numerical experiment: iterates for a = 2:


√ | e(k) | | e ( k −1) |
k x (k) e(k) : = x (k) − 2 log |e(k−1) | : log |e(k−2) |
0 2.00000000000000000 0.58578643762690485
1 1.50000000000000000 0.08578643762690485
2 1.41666666666666652 0.00245310429357137 1.850
3 1.41421568627450966 0.00000212390141452 1.984
4 1.41421356237468987 0.00000000000159472 2.000
5 1.41421356237309492 0.00000000000000022 0.630

Note the doubling of the number of correct digits in each step ! [impact of roundoff !]
The doubling of the number of significant digits for the iterates holds true for any quadratically convergent
iteration:

Recall from Rem. 1.5.3.4 that the relative error (→ Def. 1.5.3.3) tells the number of significant digits.
Indeed, denoting the relative error in step k by δk , we have in the case of quadratic convergence.

x (k) = x ∗ (1 + δk ) ⇒ x (k) − x ∗ = δk x ∗ .
⇒| x ∗ δk+1 | = | x (k+1) − x ∗ | ≤ C | x (k) − x ∗ |2 = C | x ∗ δk |2
⇒ |δk+1 | ≤ C | x ∗ |δk2 . (8.2.2.15)

Note: δk ≈ 10−ℓ means that x(k) has ℓ significant digits.


Also note that if C ≈ 1, then δk = 10−ℓ and (8.2.2.13) implies δk+1 ≈ 10−2ℓ . y
Review question(s) 8.2.2.16 (Iterative methods: speed of convergence)
(Q8.2.2.16.A) You have a table of the iterates x (k) ∈ R, k = 1, . . . , N , N ≫ 1, produced by some con-
vergent iterative method. How do you proceed to produce evidence
1. for linear convergence,
2. for convergence with order p, p > 1.

Definition 8.2.2.1. Linear convergence

A sequence x(k) , k = 0, 1, 2, . . ., in R n converges linearly to x∗ ∈ R n ,

∃0 < L < 1: x ( k +1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 .

Definition 8.2.2.10. Order of convergence

A convergent sequence x(k) , k = 0, 1, 2, . . ., in R n with limit x∗ ∈ R n converges with order p,


if
p
∃C > 0: x ( k +1) − x ∗ ≤ C x ( k ) − x ∗ ∀k ∈ N0 , (8.2.2.17)

and, in addition, C < 1 in the case p = 1 (linear convergence → Def. 8.2.2.1).

(Q8.2.2.16.B) Consider the following iteration in R2


 
( k +1) 1/3 1 (k)
x = x . (8.2.2.18)
0 1/2

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 604
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

What is lim x(k) ? Does this iteration generate linearly convergent sequences?
k→∞

Hint. The matrix in (8.2.2.18) can be diagonalized.


 
(Q8.2.2.16.C) Assume that all sequences x(k) produced by a 1-point iteration x(k+1) := Φ(x(k) ),
k ∈N 0
with iteration function Φ : R n → R n , satisfy
p
x ( k +1) − x ∗ ≤ C x ( k ) − x ∗ ∀k ∈ N0 ,
for some p > 1, C > 0 and x∗ ∈ R n .

Give a sharp criterion for the initial guess x(0) ∈ R n that guarantees convergence of the resulting se-
quence.
 
Hint. When will the sequence x (k) − x ∗ be decreasing?
k ∈N 0

8.2.3 Termination Criteria/Stopping Rules

Video tutorial for Section 8.2.3 "Iterative Methods: Termination Criteria/Stopping Rules": (14
minutes) Download link, tablet notes

Supplementary literature. Also discussed in [AG11, Sect. 3.1, p. 42].

As remarked above, usually (even without roundoff errors) an iteration will never arrive at an/the exact
solution x∗ after finitely many steps. Thus, we can only hope to compute an approximate solution by
accepting x(K ) as result for some K ∈ N0 . Termination criteria (stopping rules) are used to determine a
suitable value for K .

For the sake of efficiency ✄ stop iteration when iteration error is just “small enough”
(“small enough” depends on the concrete problem and user demands.)

§8.2.3.1 (Classification of termination criteria (stopping rules) for iterative solvers for non-linear

✎ ☞
systems of equations)

A termination criterion (stopping rule) is an algorithm deciding in each step of an iterative method
✍ ✌
whether to STOP or to CONTINUE.
We can distinguish two types of stopping rules:

A priori termination A posteriori termination

Decision to stop based on information Beside x(0) and F, also current and
about F and x(0) , made before start- past iterates are used to decide about
ing the iteration. termination.

A termination criterion for a convergent iteration is deemed reliable, if it lets the iteration CONTINUE, until
the iteration error e(k) := x(k) − x∗ , x∗ the limit value, satisfies certain conditions (usually imposed before
the start of the iteration). y

§8.2.3.2 (Ideal termination) Writing x∗ for the desired solution, termination criteria are usually meant to
ensure accuracy of the final iterate x(K ) in the following sense:

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 605
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

x(K ) − x∗ ≤ τabs , τabs ˆ prescribed (absolute) tolerance.


=
or
x(K ) − x∗ ≤ τrel kx∗ k, τrel ˆ prescribed (relative) tolerance.
=
it seems that the second criterion, asking that the relative (→ Def. 1.5.3.3) iteration error be below a
prescribed threshold, alone would suffice, but the absolute tolerance should be checked, if, by “accident”,
kx∗ k = 0 is possible. Otherwise, the iteration might fail to terminate at all.

Both criteria enter the “ideal (a posteriori) termination rule”:



  τabs
STOP at step K = min k ∈ N0 : x(k) − x∗ ≤ or . (8.2.3.3)

τrel kx∗ k .

As pointed out before, the comparison x(k) − x∗ ≤ τabs is necessary to ensure termination when
x∗ = 0 can happen.
Obviously, (8.2.3.3) achieves the optimum in terms of efficiency and reliability. Obviously, this termination
criterion is not practical, because x∗ is not known. Algorithmic feasible stopping rules have to replace
x(k) − x∗ and kx∗ k with (upper/lower) bounds or estimates.

§8.2.3.4 (Practical termination criteria for iterations) The following termination criteria are commonly
used in numerical codes:

➀ A priori termination: stop iteration after fixed number of steps (possibly depending on x(0) ).

Drawback: hard to ensure prescribed accuracy!

(A priori =
ˆ without actually taking into account the computed iterates, see § 8.2.3.1)

Invoking additional properties of either the non-linear system of equations F (x) = 0 or the iteration
it is sometimes possible to tell that for sure x(k) − x∗ ≤ τ for all k ≥ K, though this K may be
(significantly) larger than the optimal termination index from (8.2.3.3), see § 8.2.3.7.
➁ Residual based termination: STOP convergent iteration {x(k) }k∈N0 , when

F ( x(k) ) ≤ τ , ˆ prescribed tolerance > 0 .


τ=

no guaranteed accuracy

Consider the case n = 1. If F : D ⊂ R → R is “flat” in the neighborhood of a zero x ∗ , then a small


value of | F ( x )| does not mean that x is close to x ∗ .

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 606
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

F(x) F(x)

x x

Fig. 292 Fig. 293

F ( x(k) ) small 6⇒ | x − x ∗ | small F (x(k) ) small ⇒ | x − x ∗ | small

➂ Correction based termination:


STOP convergent iteration {x(k) }k∈N0 , when


 τabs
( k +1) (k) or τabs absolute
x −x ≤ prescribed tolerances > 0 .

 τrel x(k+1) , τrel relative

Also for this criterion, we have no guarantee that (8.2.3.3) will be satisfied only remotely.
y

Remark 8.2.3.5 (STOP, when stationary in M) A special variant of correction based termination exploits
that M is finite! (→ Section 1.5.3)

C++ code 8.2.3.6: Square root iteration →


Ex. 8.2.2.13 ➺ GITLAB
double s q r t i t ( double a )
Wait until (convergent) iteration becomes 2

3 {
stationary in the discrete set M of ma- 4 double x _ o l d = −1;
chine numbers! 5 double x = a ;
while ( x _ o l d ! = x ) {
y
possibly grossly inefficient ! 6

7 x_old = x ;
(always computes “up to
8 x = 0 . 5 * ( x+a / x ) ;
machine precision”) 9 }
10 return x ;
11 }

§8.2.3.7 (A posteriori termination criterion for linearly convergent iterations → [DR08,


Lemma 5.17, 5.19]) Let us assume that we know that an iteration is linearly convergent (→ Def. 8.2.2.1)
with rate of convergence 0 < L < 1:

Definition 8.2.2.1. Linear convergence

A sequence x(k) , k = 0, 1, 2, . . ., in R n converges linearly to x∗ ∈ R n ,

∃0 < L < 1: x ( k +1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 .

The following simple manipulations give an a posteriori termination criterion (for linearly convergent itera-

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 607
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

tions with rate of convergence 0 < L < 1):


△-inequ.
x(k) − x∗ ≤ x ( k +1) − x ( k ) + x ( k +1) − x ∗ ≤ x ( k +1) − x ( k ) + L x ( k ) − x ∗ .

Iterates satisfy: x ( k +1) − x ∗ ≤ L


1− L x ( k +1) − x ( k ) . (8.2.3.8)

This suggests that we take the right hand side of (8.2.3.8) as a posteriori error bound and use it instead
of the inaccessible x(k+1) − x∗ for checking absolute and relative accuracy in (8.2.3.3). The resulting
termination criterium will be reliable (→ § 8.2.3.1), since we will certainly have achieved the desired
accuracy when we stop the iteration.

Estimating the rate of convergence L might be difficult.

Pessimistic estimate for L will not compromise reliability.

(Using e
L > L in (8.2.3.8) still yields a valid upper bound for x(k) − x∗ . Hence, the result can
be trusted, though we might have wasted computational resources by needlessly carrying on with the
iteration.) y

EXAMPLE 8.2.3.9 (A posteriori error bound for linearly convergent iteration) We revisit the iteration
of Ex. 8.2.2.8:
cos x (k) + 1
x ( k +1) = x ( k ) + ⇒ x (k) → π for x (0) close to π .
sin x (k)
Observed rate of convergence: L = 1/2
Error and error bound for x (0) = 0.4:

k | x (k) − π | L
1− L | x
(k) − x ( k −1) | slack of bound

1 2.191562221997101 4.933154875586894 2.741592653589793


2 0.247139097781070 1.944423124216031 1.697284026434961
3 0.122936737876834 0.124202359904236 0.001265622027401
4 0.061390835206217 0.061545902670618 0.000155067464401
5 0.030685773472263 0.030705061733954 0.000019288261691
6 0.015341682696235 0.015344090776028 0.000002408079792
7 0.007670690889185 0.007670991807050 0.000000300917864
8 0.003835326638666 0.003835364250520 0.000000037611854
9 0.001917660968637 0.001917665670029 0.000000004701392
10 0.000958830190489 0.000958830778147 0.000000000587658
11 0.000479415058549 0.000479415131941 0.000000000073392
12 0.000239707524646 0.000239707533903 0.000000000009257
13 0.000119853761949 0.000119853762696 0.000000000000747
14 0.000059926881308 0.000059926880641 0.000000000000667
15 0.000029963440745 0.000029963440563 0.000000000000181

Hence: the a posteriori error bound is highly accurate in this case!


y
Review question(s) 8.2.3.10 (Iterative methods: termination criteria)

8. Iterative Methods for Non-Linear Systems of Equations, 8.2. Iterative Methods 608
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q8.2.3.10.A) Let x0 ∈ R be an approximation for a zero of a continuously differentiable function


f : R → R, that is, f ( x0 ) ≈ 0. Argue, why

f ( x0 )
∆s :=
f ′ ( x0 )

can be used to estimate the error x0 − x ∗ , where x ∗ is a zero of f close to x0 .


 
(Q8.2.3.10.B) For a linearly convergent sequence x(k) ⊂ R n with limit x∗ we found the a posteri-
k ∈N 0
ori error estimate
L
x ( k +1) − x ∗ ≤ x ( k +1) − x ( k ) , (8.2.3.8)
1−L
where 0 ≤ L < 1 is the rate of (linear) convergence. Based on (8.2.3.8), the following stopping rule is
employed

L
STOP, as soon as x(k+1) − x(k) ≤ τabs ,
1−L
where τabs > 0 is a user-supplied threshold. Assume that the true (“sharp”) rate of linear convergence
is L ∈ [0, 1[, that is,

x ( k +1) − x ∗ ≈ L x ( k ) − x ∗ ∀k ∈ N0 ,

but for stopping rule a larger value L, L < L < 1 is used. Discuss what this means for the number of
steps until termination and the absolute accuracy of the returned approximation.

8.3 Fixed-Point Iterations


Video tutorial for Section 8.3 "Fixed-Point Iterations": (12 minutes) Download link,
tablet notes

Supplementary literature. The contents of this section are also treated in [DR08, Sect. 5.3],

[QSS00, Sect. 6.3], [AG11, Sect .3.3]


As before we consider a non-linear system of equations F ( x ) = 0, F : D ⊂ R n 7 → R n .

1-point stationary iterative methods, see (8.2.1.5), for F (x) = 0 are also called fixed point iterations.

A fixed point iteration is defined by iteration function Φ : U ⊂ R n 7→ R n :

iteration function Φ : U ⊂ R n 7→ R n
➣ iterates (x(k) )k∈N0 : x ( k +1) : = Φ ( x ( k ) ) .
initial guess x (0) ∈ U
| {z }
→ 1-point method, cf. (8.2.1.5)

Here, U designates the domain of definition of the iteration function Φ.

Note that the sequence of iterates need not be well defined: x(k) 6∈ U possible !

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 609
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8.3.1 Consistent Fixed-Point Iterations


Next, we specialize Def. 8.2.1.7 for fixed point iterations:

Definition 8.3.1.1. Consistency of fixed point iterations, c.f. Def. 8.2.1.7

A fixed point iteration x(k+1) = Φ(x(k) ) is consistent with F (x) = 0, if, for x ∈ U ∩ D,

F (x) = 0 ⇔ Φ(x) = x .

iteration function fixed point iteration (locally) x∗ is a


Note: and ⇒
Φ continuous convergent to x∗ ∈ U fixed point of Φ.

This is an immediate consequence that for a continuous function limits and function evaluations commute
[Str09, Sect. 4.1].

General construction of fixed point iterations that is consistent with F (x) = 0:

➊ Rewrite equivalently F (x) = 0 ⇔ Φ(x) = x and then


➋ use the fixed point iteration

x ( k +1) : = Φ ( x ( k ) ) . (8.3.1.2)

Note: there are many ways to transform F (x) = 0 into a fixed point form !

EXPERIMENT 8.3.1.3 (Many choices for consistent fixed point iterations) In this example we con-
struct three different consistent fixed point iteration for a single scalar (n = 1) non-linear equation
F ( x ) = 0. In numerical experiments we will see that they behave very differently.
2

1.5

F ( x ) = xe x − 1 , x ∈ [0, 1] .
1
Different fixed-point forms:
F(x)

Φ1 ( x ) = e − x ,
0.5

1+x
Φ2 ( x ) = , 0
1 + ex
Φ3 ( x ) = x + 1 − xe x . −0.5

−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 610
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5


Φ

Φ
Φ
0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x

function Φ1 function Φ2 function Φ3


With the same intial guess x (0) = 0.5 for all three fixed point iterations we obtain the following iterates:

k x ( k +1) : = Φ 1 ( x ( k ) ) x ( k +1) : = Φ 2 ( x ( k ) ) x ( k +1) : = Φ 3 ( x ( k ) )


0 0.500000000000000 0.500000000000000 0.500000000000000
1 0.606530659712633 0.566311003197218 0.675639364649936
2 0.545239211892605 0.567143165034862 0.347812678511202
3 0.579703094878068 0.567143290409781 0.855321409174107
4 0.560064627938902 0.567143290409784 -0.156505955383169
5 0.571172148977215 0.567143290409784 0.977326422747719
6 0.564862946980323 0.567143290409784 -0.619764251895580
7 0.568438047570066 0.567143290409784 0.713713087416146
8 0.566409452746921 0.567143290409784 0.256626649129847
9 0.567559634262242 0.567143290409784 0.924920676910549
10 0.566907212935471 0.567143290409784 -0.407422405542253
We can also tabulate the modulus of the iteration error and mark correct digits with red:
( k +1) ( k +1) ( k +1)
k | x1 − x∗ | | x2 − x∗ | | x3 − x∗ |
0 0.067143290409784 0.067143290409784 0.067143290409784
1 0.039387369302849 0.000832287212566 0.108496074240152
2 0.021904078517179 0.000000125374922 0.219330611898582
3 0.012559804468284 0.000000000000003 0.288178118764323
4 0.007078662470882 0.000000000000000 0.723649245792953
5 0.004028858567431 0.000000000000000 0.410183132337935
6 0.002280343429460 0.000000000000000 1.186907542305364
7 0.001294757160282 0.000000000000000 0.146569797006362
8 0.000733837662863 0.000000000000000 0.310516641279937
9 0.000416343852458 0.000000000000000 0.357777386500765
10 0.000236077474313 0.000000000000000 0.974565695952037
(k) (k)
Observed: linear convergence of x1 , quadratic convergence of x2 ,
(k) (0)
no convergence (erratic behavior of x3 ) ( xi = 0.5 in all cases).
y
Question: Can we explain/forecast the behaviour of a fixed point iteration?

8.3.2 Convergence of Fixed-Point Iterations


In this section we will try to find easily verifiable conditions that ensure convergence (of a certain order) of
fixed point iterations. It will turn out that these conditions are surprisingly simple and general.

EXPERIMENT 8.3.2.1 (Exp. 8.3.1.3 revisited)

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 611
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In Exp. 8.3.1.3 we observed vastly different behavior of different fixed point iterations for n = 1. Is it
possible to predict this from the shape of the graph of the iteration functions?
1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5


Φ

Φ
0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x

Φ1 : linear convergence ? Φ2 : quadratic convergence ? function Φ3 : no convergence


y

Remark 8.3.2.2 (Visualization of fixed point iterations in 1D)


1D setting (n = 1): Φ : R 7→ R continuously differentiable, Φ( x ∗ ) = x ∗

fixed point iteration: x ( k +1) = Φ ( x ( k ) )


In 1D it is possible to visualize the different convergence behavior of fixed point iterations: In order to
construct x (k+1) from x (k) one moves vertically to ( x (k) , x (k+1) = Φ( x (k) )), then horizontally to the
angular bisector of the first/third quadrant, that is, to the point ( x (k+1) , x (k+1) ). Returning vertically to the
abscissa gives x (k+1) .
Φ( x ) Φ( x )

x x

−1 < Φ′ ( x ∗ ) ≤ 0 ➣ convergence Φ′ ( x ∗ ) < −1 ➣ divergence


Φ( x ) Φ( x )

x x

0 ≤ Φ′ ( x ∗ ) < 1 ➣ convergence 1 < Φ′ ( x ∗ ) ➣ divergence


Numerical examples for iteration functions ➣ Exp. 8.3.1.3, iteration functions Φ1 and Φ3

It seems that the slope of the iteration function Φ in the fixed point, that is, in the point where it intersects

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 612
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

the bisector of the first/third quadrant, is crucial. y

Now we investigate rigorously, when a fixed point iteration will lead to a convergent iteration with a partic-
ular qualitative kind of convergence according to Def. 8.2.2.10.

Definition 8.3.2.3. Contractive mapping

Φ : U ⊂ R n 7→ R n is contractive (w.r.t. norm k·k on R n ), if

∃ L < 1: kΦ(x) − Φ(y)k ≤ Lkx − yk ∀x, y ∈ U . (8.3.2.4)

A simple consideration: if Φ(x∗ ) = x∗ (fixed point), then a fixed point iteration induced by a contractive
mapping Φ satisfies

(8.3.2.4)
x ( k +1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ ) ≤ L x(k) − x∗ ,

that is, the iteration converges (at least) linearly (→ Def. 8.2.2.1).

Note that Φ contractive ⇒ Φ has at most one fixed point.


Remark 8.3.2.5 (Banach’s fixed point theorem → [Str09, Satz 6.5.2],[DR08, Satz 5.8]) A key theo-
rem in calculus (also functional analysis):

Theorem 8.3.2.6. Banach’s fixed point theorem

If D ⊂ K n (K = R, C) closed and bounded and Φ : D 7→ D satisfies

∃ L < 1: kΦ(x) − Φ(y)k ≤ Lkx − yk ∀x, y ∈ D ,

then there is a unique fixed point x∗ ∈ D, Φ(x∗ ) = x∗ , which is the limit of the sequence of iterates
x(k+1) := Φ( x (k) ) for any x(0) ∈ D.

Proof. Proof based on 1-point iteration x ( k ) = Φ ( x ( k −1) ) , x (0) ∈ D :


k + N −1 k + N −1
(k+ N ) (k) ( j +1) ( j)
x −x ≤ ∑ x −x ≤ ∑ L j x (1) − x (0)
j=k j=k

Lk
≤ x (1) − x (0) k→∞
−−−→ 0 .
1−L

(x(k) )k∈N0 Cauchy sequence ➤ convergent x(k) −k−−


→∞
→ x ∗ ..
∗ ∗
Continuity of Φ ➤ Φ(x ) = x . Uniqueness of fixed point is evident.
✷ y
A simple criterion for a differentiable Φ to be contractive:

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 613
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lemma 8.3.2.7. Sufficient condition for local linear convergence of fixed point iteration →
[Han02, Thm. 17.2], [DR08, Cor. 5.12]

If Φ : U ⊂ R n 7→ R n , Φ(x∗ ) = x∗ ,Φ differentiable in x∗ , and kD Φ(x∗ )k < 1, then the fixed point


iteration

x ( k +1) : = Φ ( x ( k ) ) , (8.3.1.2)

converges locally and at least linearly that is matrix norm, Def. 1.5.5.10 !

∃0 ≤ L < 1: x ( k +1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 ,

provided that the initial guess x(0) belongs to some neighborhood of x∗ .

✎ notation: ˆ Jacobian (ger.: Jacobi-Matrix) of Φ at x ∈ D → [Str09, Sect. 7.6]


D Φ(x) =

 ∂Φ ∂Φ1 ∂Φ1

1
∂x1 (x) ∂x2 ( x ) ··· ··· ∂xn ( x )
" #n  ∂Φ2 ∂Φ2 
∂Φi  (x) 
∂xn ( x ) 
D Φ(x) = (x) =

∂x1
.. .. . (8.3.2.8)
∂x j  . . 
i,j=1
∂Φn ∂Φn ∂Φn
∂x1 ( x ) ∂x2 ( x ) ··· ··· ∂xn ( x )

A “visualization” of the statement of Lemma 8.3.2.7 has been provided in Rem. 8.3.2.2: The iteration
converges locally, if Φ is flat in a neighborhood of x ∗ , it will diverge, if Φ is steep there.

Proof. (of Lemma 8.3.2.7) By the definition of the derivative

kΦ(y) − Φ(x∗ ) − DΦ(x∗ )(y − x∗ )k ≤ ψ(ky − x∗ k)ky − x∗ k ,

with ψ : R0+ 7→ R0+ satisfying lim ψ(t) = 0.


t →0

Choose δ > 0 such that

L := ψ(t) + k DΦ(x∗ )k ≤ 12 (1 + k DΦ(x∗ )k) < 1 ∀0 ≤ t < δ .

By inverse triangle inequality

|kak − kbk| ≤ ka + bk ∀a, b ∈ R n ,

and thanks to Φ(x∗ ) = x∗ we obtain for the fixed-point iteration

kΦ(x) − x∗ k − k DΦ(x∗ )(x − x∗ )k ≤ ψ(kx − x∗ k)kx − x∗ k


x(k+1) − x∗ ≤ (ψ(t) + k DΦ(x∗ )k) x(k) − x∗ ≤ L x(k) − x∗ ,

if x(k) − x∗ < δ.

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 614
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lemma 8.3.2.9. Sufficient condition for linear convergence of fixed point iteration

Let U be convex and Φ : U ⊂ R n 7→ R n be continuously differentiable with

L := supk DΦ(x)k < 1 .


x ∈U

If Φ(x∗ ) = x∗ for some interior point x∗ ∈ U , then the fixed point iteration x(k+1) = Φ(x(k) ) with
x(0) ∈ U converges to x∗ at least linearly with rate L.

Recall: U ⊂ R n convex :⇔ (tx + (1 − t)y) ∈ U for all x, y ∈ U , 0 ≤ t ≤ 1

Proof. (of Lemma 8.3.2.9) By the mean value theorem


Z 1
Φ(x) − Φ(y) = DΦ(x + τ (y − x))(y − x) dτ ∀x, y ∈ dom(Φ) .
0
⇒ kΦ(x) − Φ(y)k ≤ Lky − xk ,
⇒ ( x ) ( k +1) − x ∗ ≤ L x ( k ) − x ∗ .

We find that Φ is contractive on U with unique fixed point x∗ , to which x(k) converges linearly for k → ∞.

Remark 8.3.2.10 (Bound for asymptotic rate of linear convergence) By asymptotic rate of a linearly
converging iteration we mean the contraction factor for the norm of the iteration error that we can expect,
when we are already very close to the limit x∗ .
If 0 < k DΦ(x∗ )k < 1, x(k) ≈ x∗ then the (worst) asymptotic rate of linear convergence is L =
k DΦ( x ∗ )k y

EXAMPLE 8.3.2.11 (Multidimensional fixed point iteration) In this example we encounter the first
genuine system of non-linear equations and apply Lemma 8.3.2.9 to it.

 System of equations in  fixed point form:


x1 − c(cos x1 − sin x2 ) = 0 c(cos x1 − sin x2 ) = x1
⇒ .
( x1 − x2 ) − c sin x2 = 0 c(cos x1 − 2 sin x2 ) = x2
       
x1 cos x1 − sin x2 x1 sin x1 cos x2
Define: Φ =c ⇒ DΦ = −c .
x2 cos x1 − 2 sin x2 x2 sin x1 2 cos x2

Choose appropriate norm: k·k = ∞-norm k·k∞ (→ Example 1.5.5.12) ;


1
if c < ⇒ k DΦ(x)k∞ < 1 ∀x ∈ R2 ,
3
➣ (at least) linear convergence of the fixed point iteration. The existence of a fixed point is also guar-
anteed, because Φ maps into the closed set [−3, 3]2 . Thus, the Banach fixed point theorem, Thm. 8.3.2.6,
can be applied. y

What about higher order convergence (→ Def. 8.2.2.10, cf. Φ2 in Ex. 8.3.1.3)? Also in this case we should
study the derivatives of the iteration functions in the fixed point (limit point).

We give a refined convergence result only for n = 1 (scalar case, Φ : dom(Φ) ⊂ R 7→ R):

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 615
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 8.3.2.12. Taylor’s formula → [Str09, Sect. 5.5]

If Φ : U ⊂ R 7→ R, U interval, is m + 1 times continuously differentiable, x ∈ U


m
1 (k)
Φ(y) − Φ( x ) = ∑ Φ ( x )(y − x )k + O(|y − x |m+1 ) ∀y ∈ U . (8.3.2.13)
k =1
k!

Now apply Taylor expansion (8.3.2.13) to iteration function Φ:


If Φ( x ∗ ) = x ∗ and Φ : dom(Φ) ⊂ R 7→ R is “sufficiently smooth”, it tells us that
m
1
x ( k +1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ ) = ∑ l! Φ(l) ( x∗ )( x(k) − x∗ )l + O(| x(k) − x∗ |m+1 ) . (8.3.2.14)
l =1

Here we used the Landau symbol O(·) to describe the local behavior of a remainder term in the vicinity of
x∗
Lemma 8.3.2.15. Higher order local convergence of fixed point iterations

If Φ : U ⊂ R 7→ R is m + 1 times continuously differentiable, Φ( x ∗ ) = x ∗ for some x ∗ in the


interior of U , and Φ(l ) ( x ∗ ) = 0 for l = 1, . . . , m, m ≥ 1, then the fixed point iteration (8.3.1.2)
converges locally to x ∗ with order ≥ m + 1 (→ Def. 8.2.2.10).

Proof. For neighborhood U of x ∗

(8.3.2.14) ⇒ ∃C > 0: |Φ(y) − Φ( x ∗ )| ≤ C |y − x ∗ |m+1 ∀y ∈ U .


δm C < 1/2 : | x (0) − x ∗ | < δ ⇒ | x (k) − x ∗ | < 2−k δ ➣ local convergence .

Then appeal to (8.3.2.14)


EXPERIMENT 8.3.2.16 (Exp. 8.3.2.1 continued) Now, Lemma 8.3.2.9 and Lemma 8.3.2.15 permit us a
precise prediction of the (asymptotic) convergence we can expect from the different fixed point iterations
studied in Exp. 8.3.1.3.
1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5


Φ

Φ
Φ

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x x

function Φ1 function Φ2 function Φ3

1 − xe x
Φ2′ ( x ) = = 0 , if xe x − 1 = 0 hence quadratic convergence ! .
(1 + e x )2


Since x ∗ e x − 1 = 0, simple computations yield

Φ1′ ( x ) = −e− x ⇒ Φ1′ ( x ∗ ) = − x ∗ ≈ −0.56 hence local linear convergence .

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 616
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1
Φ3′ ( x ) = 1 − xe x − e x ⇒ Φ3′ ( x ∗ ) = − ≈ −1.79 hence no convergence .
x∗
y

Remark 8.3.2.17 (Termination criterion for contractive fixed point iteration)


We recall the considerations of § 8.2.3.7 about a termination criterion for contractive fixed point itera-
tion (= linearly convergence fixed point iteration → Def. 8.2.2.1), c.f. (8.3.2.4), with contraction factor (=
rate of convergence) 0 ≤ L < 1:

△-ineq. k+m−1 k + m −1
x(k+m) − x(k) ≤ ∑ x ( j +1) − x ( j ) ≤ ∑ L j − k x ( k +1) − x ( k )
j=k j=k
1− Lm ( k +1) (k) 1 − L m k − l ( l +1)
= x −x ≤ L x − x(l ) .
1−L 1−L

Hence, for m → ∞, with x∗ := lim x(k) we find the estimate


k→∞

Lk−l
x∗ − x(k) ≤ x ( l +1) − x ( l ) . (8.3.2.18)
1−L

Set l = 0 in (8.3.2.18) Set l = k − 1 in (8.3.2.18)

a priori termination criterion a posteriori termination criterion

Lk L
x∗ − x(k) ≤ x (1) − x (0) (8.3.2.19) x∗ − x(k) ≤ x ( k ) − x ( k −1)
1−L 1−L
(8.3.2.20)

With the same arguments as in § 8.2.3.7 we see that overestimating L, that is, using a value for L that is
larger than the true value, still gives reliable termination criteria.

However, whereas overestimating L in (8.3.2.20) will not lead to a severe deterioration of the bound, unless
L ≈ 1, using a pessimistic value for L in (8.3.2.19) will result in a bound way bigger than the true bound, if
k ≫ 1. Then the a priori termination criterion (8.3.2.19) will recommend termination many iterations after
the accuracy requirements have already been met. This will thwart the efficiency of the method. y
Review question(s) 8.3.2.21 (Fixed-point iterations)
(Q8.3.2.21.A) Let x(k) , k ∈ N0 , be the iterates produced by a fixed-point iteration x(k+1) = Φ(x(k) ),
Φ : R n → R n . Formulate a 2-point iteration

s(k+1) = Ψ(k, s(k) , s(k−1) ) , k ∈ N0 ,

that produces the sequence


k
1
s(k) : = ∑ x( j) .
k+1 j =0

Is this a stationary 2-point iteration?

8. Iterative Methods for Non-Linear Systems of Equations, 8.3. Fixed-Point Iterations 617
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition § 8.2.1.4. Stationary m-point iterative method


n o
A stationary m-point iterative method produces the sequences x(k) ⊂ R n of iterates ac-
k
cording to the formula

x ( k +1) = Φ F ( x ( k ) , . . . , x ( k − m +1) ) ,

where Φ F : R n × · · · × R n → R n is a given iteration function.

(Q8.3.2.21.B)
√ Given a > 0 the following iteration functions spawn fixed-point iterations for the computation
of a:

ϕ1 ( x ) : = a + x − x 2 ,
ϕ2 ( x ) := a/x ,
1
ϕ3 ( x ) : = 1 + x − x 2 ,
a
1
ϕ4 ( x ) := ( x + a/x ) .
2
Predict the behavior and the type of convergence
√ of the induced fixed-point iterations when started with
an initial guess “sufficiently close” to a.

8.4 Finding Zeros of Scalar Functions

Supplementary literature. [AG11, Ch. 3] is also devoted to this topic. The algorithm of “bisec-

tion” discussed in the next subsection, is treated in [DR08, Sect. 5.5.1] and [AG11, Sect. 3.2].

Now, we focus on scalar case n = 1: F : I ⊂ R 7→ R continuous, I interval

Sought: x∗ ∈ I : F( x∗ ) = 0

8.4.1 Bisection
Video tutorial for Section 8.4.1 "Finding Zeros of Scalar Functions: Bisection": (7 minutes)
Download link, tablet notes

Idea: use ordering of real numbers & intermediate value theorem [Str09, Sect. 4.6]
F(x)
Input: a, b ∈ I such that

F ( a) F (b) < 0 . (different signs!)

x∗ x
∃ x ∗ ∈] min{ a, b}, max{ a, b}[: a b
F( x∗ ) = 0 ,

as we conclude from the intermediate value theorem.


Fig. 294

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 618
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Find a sequence of intervals with geometrically decreasing lengths, in each of which F will change
sign.
Such a sequence can easily be found by testing the sign of F at the midpoint of the current interval, see
Code 8.4.1.2.

§8.4.1.1 (Bisection method) The following C++ code implements the bisection method for finding the
zeros of a function passed through the function handle F in the interval [ a, b] with absolute tolerance
tol.

C++ code 8.4.1.2: Bisection method for solving F ( x ) = 0 on [ a, b] ➺ GITLAB


2 // Searching zero of F in [ a, b] by bisection
3 template <typename Func , typename Scalar >
4 S c a l a r b i s e c t ( Func&& F , S c a l a r a , S c a l a r b , S c a l a r t o l )
5 {
6 if (a > b) {
7 std : : swap ( a , b ) ; // sort interval bounds
8 }
9 i f (F( a ) *F ( b ) > 0) {
10 throw std : : l o g i c _ e r r o r ( " f ( a ) and f ( b ) have same sign " ) ;
11 }
12 s t a t i c _ a s s e r t ( std : : i s _ f l o a t i n g _ p o i n t < Scalar > : : value ,
13 " Scalar must be a f l o a t i n g point type " ) ;
14 const i n t v=F ( a ) < 0 ? 1 : −1;
15 S c a l a r x = ( a+b ) / 2 ; // determine midpoint
16 // termination, relies on machine arithmetic if tol = 0
17 while ( b−a > t o l ) {
18 assert ( a<=x && x<=b ) ; // assert invariant
19 // sgn( f ( x )) = sgn( f (b)), then use x as next right boundary
20 i f ( v *F( x ) > 0) {
21 b=x ;
22 }
23 // sgn( f ( x )) = sgn( f ( a)), then use x as next left boundary
24 else {
25 a=x ;
26 }
27 x = ( a+b ) / 2 ; // determine next midpoint
28 }
29 return x ;
30 }

Line 18: the test ((a<x)&& (x<b)) offers a safeguard against an infinite loop in case tol < resolution
of M at zero x ∗ (cf. “M-based termination criterion”).

This is also an example for an algorithm that (in the case of tol=0) uses the properties of machine
arithmetic to define an a posteriori termination criterion, see Section 8.2.3. The iteration will terminate,
when, e.g., a+e 12 (b − a) = a (+
e is the floating point realization of addition), which, by the Ass. 1.5.3.11
can only happen, when

| 21 (b − a)| ≤ EPS · | a| .

Since the exact zero is located between a and b, this condition implies a relative error ≤ EPS of the
computed zero.
• “foolproof”, robust: will always terminate with a zero of requested accuracy,
Advantages: • requires only point evaluations of F,
• works with any continuous function F, no derivatives needed.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 619
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Merely “linear-type”
 (∗)convergence: | x ( k ) − x ∗ | ≤ 2− k | b − a |
Drawbacks: |b − a|
log2 steps necessary
tol
(∗): the convergence of a bisection algorithm is not linear in the sense of Def. 8.2.2.1, because the
condition x (k+1) − x ∗ ≤ L x (k) − x ∗ might be violated at any step of the iteration.

Remark 8.4.1.3 (Generalized bisection methods) It is straightforward to combine the bisection idea
with more elaborate “model function methods” as they will be discussed in the next section: Instead of
stubbornly choosing the midpoint of the probing interval [ a, b] (→ Code 8.4.1.2) as next iterate, one may
use a refined guess for the location of a zero of F in [ a, b].

A method of this type is used by M ATLAB’s fzero function for root finding in 1D [QSS00, Sect. 6.2.3]. y
Review question(s) 8.4.1.4 (Finding Zeros of Scalar Functions: Bisection)
(Q8.4.1.4.A) We use the bisection method to find a zero of f : [0.5, 2] → R with f (1) < 0 and f (2) > 0.
Find an a priori bound for the number of steps needed to determine a zero with a guaranteed relative
error of 10−6 .
(Q8.4.1.4.B) What prevents us from using bisection to find zeros of a function f : D ⊂ C → C?

8.4.2 Model Function Methods


=ˆ class of iterative methods for finding zeroes of F: iterate in step k + 1 is computed according to the
following idea:

Idea: Given recent iterates (approximate zeroes)


x ( k ) , x ( k −1) , . . . , x ( k − m +1) , m ∈ N
➊ ek
replace F with a k-dependent model function F

(based on function values F ( x (k) ), F ( x (k−1) ), . . . , F ( x (k−m+1) ) and,


possibly, derivative values F ′ ( x (k) ), F ′ ( x (k−1) ), . . . , F ′ ( x (k−m+1) ))

➋ x (k+1) := zero of Fek : Fek ( x (k+1) ) = 0


(has to be readily available ↔ analytic formula)

Distinguish (see § 8.2.1.1 and (8.2.1.5)):

one-point methods : x (k+1) = Φ F ( x (k) ), k ∈ N (e.g., fixed point iteration → Section 8.3)
multi-point methods : x (k+1) = Φ F ( x (k) , x (k−1) , . . . , x (k−m) ), k ∈ N, m = 2, 3, . . ..

8.4.2.1 Newton Method in the Scalar Case

Video tutorial for Section 8.4.2.1 "Newton Method in the Scalar Case": (20 minutes)
Download link, tablet notes

Again we consider the problem of finding zeros of the function F : I ⊂ R → R defined on an interval I :
we seek x ∗ ∈ I such that F ( x ∗ ) = 0. Now we impose stricter smoothness requirements and we assume
that F : I ⊂ R 7→ R is continuously differentiable, which means that both F and its derivative F ′ have to
be continuous on I .

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 620
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Now model function := tangent at F in x (k) :

Fek ( x ) := F ( x (k) ) + F ′ ( x (k) )( x − x (k) )

take x (k+1) := zero of tangent


tangent
We obtain the Newton iteration (N.I.)

F ( x (k) )
x ( k +1) : = x ( k ) − , (8.4.2.1)
F ′ ( x (k) ) F x ( k +1) x ( k )
Fig. 295

that requires F ′ ( x (k) ) 6= 0.

The following C++ code snippet implements a generic Newton method for zero finding.
• The types FuncType and DervType must be functor types and provide an evaluation operator
Scalar operator (Scalar)const.
• The arguments F and DF must provide functors for F and F ′ .

C++11-code 8.4.2.2: Newton method in the scalar case n = 1


1 template <typename FuncType , typename DervType , typename Scalar >
2 Scalar newton1D ( FuncType &&F , DervType &&DF,
3 const Scalar &x0 , double r t o l , double a t o l )
4 {
5 Scalar s , x = x0 ;
6 do {
7 s = F ( x ) / DF( x ) ; // compute Newton correction
8 x −= s ; // compute next iterate
9 }
10 // correction based termination (relative and absolute)
11 while ( ( std : : abs ( s ) > r t o l * std : : abs ( x ) ) && ( std : : abs ( s ) > a t o l ) ) ;
12 return ( x ) ;
13 }

This code implements a correction-based termination criterion as introduced in § 8.2.3.4, see also § 8.2.3.2
for a discussion of absolute and relative tolerances.

EXAMPLE 8.4.2.3 (Square root iteration as a Newton iteration) In Ex. 8.2.2.13 we learned about the
quadratically convergent fixed point iteration (8.2.2.14) for the approximate computation of the square root
of a positive number. It can be derived as a Newton iteration (8.4.2.1)!

For F ( x ) = x2 − a, a > 0, we find F ′ ( x ) = 2x, and, thus, the Newton iteration for finding zeros of F
reads:
( x ( k ) )2 − a a 
x ( k +1) = x ( k ) − = 1
2 x (k) + ,
2x (k) x (k)
which is exactly (8.2.2.14). Thus, for this F Newton’s method converges globally with order p = 2. y

EXAMPLE 8.4.2.4 (Newton method in 1D (→ Exp. 8.3.1.3)) Newton iterations for two different scalar
non-linear equations F ( x ) = 0 with the same solution sets:
(k) (k)
x (k) e x
−1 ( x ( k ) )2 + e − x
F ( x ) = xe x − 1 ⇒ F ′ ( x ) = e x (1 + x ) ⇒ x (k+1) = x (k) − (k)
=
e x (1 + x ( k ) ) 1 + x (k)

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 621
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(k)
−x ′ −x ( k +1) (k) x (k) − e− x 1 + x (k)
F(x) = x − e ⇒ F (x) = 1 + e ⇒ x =x − (k)
= (k)
.
1 + e− x 1 + ex
Exp. 8.3.1.3 confirms quadratic convergence in both cases! (→ Def. 8.2.2.10)

Note that for the computation of its zeros, the function F in this example can be recast in different forms!
y

§8.4.2.5 (Convergence of Newton’s method in 1D) In fact, based on Lemma 8.3.2.15,

Lemma 8.3.2.15. Higher order local convergence of fixed point iterations

If Φ : U ⊂ R 7→ R is m + 1 times continuously differentiable, Φ( x ∗ ) = x ∗ for some x ∗ in the


interior of U , and Φ(l ) ( x ∗ ) = 0 for l = 1, . . . , m, m ≥ 1, then the fixed point iteration (8.3.1.2)
converges locally to x ∗ with order ≥ m + 1 (→ Def. 8.2.2.10).

it is straightforward to show local quadratic convergence of Newton’s method to a zero x ∗ of F, provided


that F ′ ( x ∗ ) 6= 0. We easily see that the Newton iteration

F ( x (k) )
x ( k +1) : = x ( k ) − , (8.4.2.1)
F ′ ( x (k) )

is a fixed fixed point iteration (→ Section 8.3) with iteration function

F(x)
Φ( x ) := x − (8.4.2.1) ⇔ x ( k +1) = Φ ( x ( k ) ) . (8.4.2.6)
F′ ( x)

Invoking the quotient rule for differentiation we get

′ F ( x ) F ′′ ( x )
Φ (x) = ⇒ Φ′ ( x ∗ ) = 0 , if F ( x ∗ ) = 0, F ′ ( x ∗ ) 6= 0 . (8.4.2.7)
( F ′ ( x ))2

Thus from Lemma 8.3.2.15 we conclude the following result:

Convergence of Newton’s method in 1D

Newton’s method locally converges quadratically (→ Def. 8.2.2.10) to a zero x ∗ of F, if F ′ ( x ∗ ) 6= 0


and F is three times continuously differentiable in a neighborhood of x ∗ .

EXAMPLE 8.4.2.9 (Implicit differentiation of F)


R1 u1 R2 u2 R3 u3 R4 u4 u n −2 R n −1 u n −1 Rn un

U R R R R R R R

Fig. 296

How do we have to choose the leak resistance R > 0 in the linear circuit displayed in Fig. 296 in order to
achieve a prescribed potential at one of the nodes?

The circuit displayed in Fig. 296 is composed of linear resistors only. Thus we can use the nodal analysis of
the circuit introduced in Ex. 2.1.0.3 in order to derive a linear system of equations for the nodal potentials u j

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 622
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

in the nodes represented by • in Fig. 296. Kirchhoff’s current law (2.1.0.4) plus the constitutive relationship
I = U/R for a resistor with resistance R give
1 1 1
Node 1: ( u1 − U ) + u1 + ( u1 − u2 ) = 0 ,
R1 R R2
1 1 1
Node j: ( u − u j −1 ) + u j + ( u j − u j +1 ) = 0 , j = 2, . . . , n − 1 , (8.4.2.10)
R j −1 j R Rj
1 1
Node n: ( u n − u n −1 ) + u n = 0 .
Rn R
 n
These n equations are equivalent to a linear system of equations for the vector u = u j j=1 ∈ R n , which
reads in compact notation
 
1
A+ ·I u = b , (8.4.2.11)
R
 1

+ R12R1 − R12  
  U
 − R12 1
R2 + R3
1
− R13    R1
  0

 − R13 1
R3 + 1
R4 − R14 
  . 
 .. .. ..   .. 
A= . . . , b=
 .

   
 1
− R n −2 1
+ 1
− Rn1−1   .. 
 R n −2 R n −1   . 
 − Rn1−1 1 1 1 
 R n + R n −1 − R n  0
− R1n 1
Rn

Thus the current problem can be formulated as: find x ∈ R, x := R−1 > 0, such that

R → R
F ( x ) = 0 with F : , (8.4.2.12)
x 7→ w⊤ (A + xI)−1 b − 1

where A ∈ R n,n is a symmetric, tridiagonal, diagonally dominant matrix, w ∈ R n is a unit vector singling
out the node of interest, and b takes into account the exciting voltage U .

In order to apply Newton’s method to (8.4.2.12), we have to determine the derivative F ′ ( x ) and so by
implicit differentiation [Str09, Sect. 7.8], first rewriting (u( x ) =
ˆ vector of nodal potentials as a function of
x=R ) − 1

F ( x ) = w⊤ u( x ) − 1 , (A + xI)u( x ) = b .

Then we differentiate the linear system of equations defining u( x ) on both sides with respect to x using
the product rule (8.5.1.17)
d
dx
(A + xI)u( x ) = b =⇒ (A + xI)u′ ( x ) + u( x ) = 0 .
u′ ( x ) = −(A + xI)−1 u( x ) . (8.4.2.13)
F ′ ( x ) = w⊤ u′ ( x ) = −w⊤ (A + xI)−1 u( x ) . (8.4.2.14)

Thus, the Newton iteration for (8.4.2.12) reads:

w⊤ u ( x (k) ) − 1
x ( k +1) = x ( k ) + F ′ ( x ( k ) ) −1 F ( x ( k ) ) = , (8.4.2.15)
w⊤ z ( x(k) )
with ( A + x (k) I ) u ( x (k) ) = b ,
( A + x (k) I ) z ( x(k) ) = u ( x (k) ) .

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 623
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

In each step of the iteration we have to solve two linear systems of equations, which can be done with
asymptotic effort O(n) in this case, because A + x (k) I is tridiagonal.

Note that in a practical application one must demand x > 0, in addition, because the solution must provide
a meaningful conductance (= inverse resistance.)

Also note that bisection (→ 8.4.1) is a viable alternative to using Newton’s method in this case. y

Supplementary literature. Newton’s method in 1D is discussed in [Han02, Sect. 18.1], [DR08,

Sect. 5.5.2], [AG11, Sect. 3.4].

Review question(s) 8.4.2.16 (Newton’s method in one dimension)


(Q8.4.2.16.A) State Newton’s iteration for finding the zero of f ( x ) := αx + β, α, β ∈ R, x ∈ R. What
can you say about its convergence?
(Q8.4.2.16.B) The Lambert W-function W : I ⊂ R → R is implicitly defined through

W ( x ) exp(W ( x )) = x , W ( x ) > 0 for x > 0 .

1. What is the maximal domain of definition I ?


2. State the Newton iteration for computing W ( x ), x ∈ I .
3. Sketch a C++ function that can compute W ( x ) with a relative error equal to the machine precision
EPS.
(Q8.4.2.16.C) Planck’s radiation law describes the spectral emissive power of a black body by the formula

ˆ Boltzmann constant,
kB =
2hν3 1
B(ν, T ) =   , h =ˆ Planck constant,
c2 exp hν − 1
k TB
c =ˆ speed of light ,

as function of the frequency ν > 0 and of the temperature T > 0.


Outline an algorithm based on Newton’s method that can determine the frequency νmax of the emission
maximum as a function of T .
(Q8.4.2.16.D) What will happen, if one applies Newton’s method to F ( x ) := x2 + 1 with x (0) = 1 or
x ( 0 ) = 2?

8.4.2.2 Special One-Point Methods

Idea underlying other one-point methods: non-linear local approximation

Useful, if a priori knowledge about the structure of F (e.g. about F being a rational function, see below) is
available. This is often the case, because many problems of 1D zero finding are posed for functions given
in analytic form with a few parameters.

Prerequisite: Smoothness of F: F ∈ C m ( I ) for some m > 1

EXAMPLE 8.4.2.17 (Halley’s iteration → [Han02, Sect. 18.3]) This example demonstrates that non-
polynomial model functions can offer excellent approximation of F. In this example the model function is
chosen as a quotient of two linear function, that is, from the simplest class of true rational functions.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 624
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Of course, that this function provides a good model function is merely “a matter of luck”, unless you have
some more information about F. Such information might be available from the application context.

Given x (k) ∈ I , next iterate := zero of model function: h( x (k+1) ) = 0, where


a
h( x ) := + c (rational function) such that F ( j) ( x (k) ) = h( j) ( x (k) ) , j = 0, 1, 2 .
x+b

a a ′ (k) 2a
( k )
+ c = F ( x (k)
) , − ( k ) 2
= F ( x ) , ( k ) 3
= F ′′ ( x (k) ) .
x +b ( x + b) ( x + b)

F ( x (k) ) 1
x ( k +1) = x ( k ) − · .
F ′ ( x (k) ) 1 − 1 F ( x (k) ) F ′′ ( x (k) )
2 F ′ ( x ( k ) )2

1 1
Halley’s iteration for F(x) = 2
+ − 1 , x > 0 : and x (0) = 0
( x + 1) ( x + 0.1)2
k x (k) F ( x (k) ) x ( k ) − x ( k −1) x (k) − x ∗
1 0.19865959351191 10.90706835180178 -0.19865959351191 -0.84754290138257
2 0.69096314049024 0.94813655914799 -0.49230354697833 -0.35523935440424
3 1.02335017694603 0.03670912956750 -0.33238703645579 -0.02285231794846
4 1.04604398836483 0.00024757037430 -0.02269381141880 -0.00015850652965
5 1.04620248685303 0.00000001255745 -0.00015849848821 -0.00000000804145
Compare with Newton method (8.4.2.1) for the same problem:

k x (k) F ( x (k) ) x ( k ) − x ( k −1) x (k) − x ∗


1 0.04995004995005 44.38117504792020 -0.04995004995005 -0.99625244494443
2 0.12455117953073 19.62288236082625 -0.07460112958068 -0.92165131536375
3 0.23476467495811 8.57909346342925 -0.11021349542738 -0.81143781993637
4 0.39254785728080 3.63763326452917 -0.15778318232269 -0.65365463761368
5 0.60067545233191 1.42717892023773 -0.20812759505112 -0.44552704256257
6 0.82714994286833 0.46286007749125 -0.22647449053641 -0.21905255202615
7 0.99028203077844 0.09369191826377 -0.16313208791011 -0.05592046411604
8 1.04242438221432 0.00592723560279 -0.05214235143588 -0.00377811268016
9 1.04618505691071 0.00002723158211 -0.00376067469639 -0.00001743798377
10 1.04620249452271 0.00000000058056 -0.00001743761199 -0.00000000037178
Note that Halley’s iteration is superior in this case, since F is a rational function.

! Newton method converges more slowly, but also needs less effort per step (→ Section 8.4.3) y

§8.4.2.18 (Preconditioning of Newton’s method) In the previous example Newton’s method performed
rather poorly. Often its convergence can be boosted by converting the non-linear equation to an equivalent
one (that is, one with the same solutions) for another function g, which is “closer to a linear function”:
b, where Fb is (locally) invertible with an inverse Fb−1 that can be evaluated with
Assume that (locally) F ≈ F
little effort.

g( x ) := Fb−1 ( F ( x )) ≈ x .

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 625
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

b−1 (0), using the formula for the derivative of the inverse
Then apply Newton’s method to G ( x ) := g( x ) − F
of a function
d b−1 1 1
( F )(y) = ⇒ g′ ( x ) = · F′ ( x) .
dy Fb ( Fb 1 (y))
′ − Fb′ ( g( x ))

This results in the Newton iteration

G ( x (k) ) Fb′ ( g( x (k) ))


x ( k +1) = x ( k ) − , (8.4.2.19)
F ′ ( x (k) )
g( x (k) ) = Fb−1 ( F ( x (k) )) , G ( x (k) ) = g( x (k) ) − y0 , y0 := Fb−1 (0) .

Since G is “almost linear” this Newton iteration can be expected to enjoy quadratic convergence for initial
b−1 (0).
guesses x (0) from a large set. A good initial guess is x (0) := F y

EXAMPLE 8.4.2.20 (Preconditioned Newton method) As in Ex. 8.4.2.17 we consider

1 1
F(x) = 2
+ −1 , x > 0 ,
( x + 1) ( x + 0.1)2
and try to find its zeros.
10
F(x)
9 g(x)

7
Observation:
6
F ( x ) + 1 ≈ 2x −2 for x ≫ 1
5

1 4
and so g( x ) := p is “almost” linear for
F(x) + 1 3
x ≫ 1. 2

0
0 0.5 1 1.5 2 2.5 3 3.5 4
x

! !
Idea: instead of F ( x ) = 0 tackle g( x ) = 1 with Newton’s method (8.4.2.1).
 
( k )
g( x ) − 1 1 2( F ( x (k) ) + 1)3/2
x ( k +1) (k)
=x − =x + q(k)  −1 
g′ ( x (k) ) ( k )
F(x ) + 1 F ′ ( x (k) )
q
( k )
2( F ( x ) + 1)(1 − F ( x (k) ) + 1)
(k)
=x + .
F ′ ( x (k) )

Convergence recorded for x (0) = 0:

k x (k) F ( x (k) ) x ( k ) − x ( k −1) x (k) − x ∗


1 0.91312431341979 0.24747993091128 0.91312431341979 -0.13307818147469
2 1.04517022155323 0.00161402574513 0.13204590813344 -0.00103227334125
3 1.04620244004116 0.00000008565847 0.00103221848793 -0.00000005485332
4 1.04620249489448 0.00000000000000 0.00000005485332 -0.00000000000000

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 626
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For zero finding there is wealth of iterative methods that offer higher order of convergence. One class is
discussed next.

§8.4.2.21 (Modified Newton methods) Taking the cue from the iteration function of Newton’s method
(8.4.2.1), we extend it by introducing an extra function H :

F(x)
new fixed point iteration : Φ( x ) = x − H ( x ) with “proper” H : I 7→ R .
F′ ( x)
Still, every zero of F is a fixed point of this Φ,that is, the fixed point iteration is still consistent (→
Def. 8.3.1.1).

Aim: find H such that the method is of p-th order. The main tool is Lemma 8.3.2.15, which tells us that we
have to ensure Φ(ℓ) ( x ∗ ) = 0, 1 ≤ ℓ ≤ p − 1, guarantees local convergence of order p.

Assume: F smooth “enough” and ∃ x ∗ ∈ I : F ( x ∗ ) = 0, F ′ ( x ∗ ) 6= 0. Then we can compute the derivatives


of Φ appealing to the product rule and quotient rule for derivatives.

Φ = x − uH , Φ′ = 1 − u′ H − uH ′ , Φ′′ = −u′′ H − 2u′ H − uH ′′ ,


F FF ′′ F ′′ F ( F ′′ )2 FF ′′′
with u = ⇒ u′ = 1 − , u′′ = − + 2 − .
F′ ( F ′ )2 F′ ( F ′ )3 ( F ′ )2
F ′′ ( x ∗ )
F ( x ∗ ) = 0 ➤ u( x ∗ ) = 0, u′ ( x ∗ ) = 1, u′′ ( x ∗ ) = − F′ ( x∗ ) .

F ′′ ( x ∗ )
Φ′ ( x ∗ ) = 1 − H ( x ∗ ) , Φ′′ ( x ∗ ) = H ( x ∗ ) − 2H ′ ( x ∗ ) . (8.4.2.22)
F′ ( x∗ )
Lemma 8.3.2.15 ➢ Necessary conditions for local convergence of order p:

p = 2 (quadratical convergence): H ( x∗ ) = 1 ,
1 F ′′ ( x ∗ )
p = 3 (cubic convergence): H ( x∗ ) = 1 ∧ H ′ ( x∗ ) = .
2 F′ ( x∗ )

Trial expression: H ( x ) = G (1 − u′ ( x )) with “appropriate” G


!
F ( x (k) ) F ( x (k) ) F ′′ ( x (k) )
fixed point iteration x ( k +1) = x (k) − ′ (k) G . (8.4.2.23)
F (x ) ( F ′ ( x (k) ))2

Lemma 8.4.2.24. Cubic convergence of modified Newton methods

If F ∈ C2 ( I ), F ( x ∗ ) = 0, F ′ ( x ∗ ) 6= 0, G ∈ C2 (U ) in a neighbourhood U of 0, G (0) = 1,
G ′ (0) = 12 , then the fixed point iteration (8.4.2.23) converge locally cubically to x ∗ .

Proof. We apply Lemma 8.3.2.15, which tells us that both derivatives from (8.4.2.22) have to vanish.
Using the definition of H we find.

F ′′ ( x ∗ )
H ( x ∗ ) = G (0) , H ′ ( x ∗ ) = − G ′ (0)u′′ ( x ∗ ) = G ′ (0) .
F′ ( x∗ )

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 627
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Plugging these expressions into (8.4.2.22) finishes the proof.


✷ y

EXPERIMENT 8.4.2.25 (Application of modified Newton methods)


1
• G (t) = ➡ Halley’s iteration (→ Ex. 8.4.2.17)
1 − 21 t
2
• G (t) = √ ➡ Euler’s iteration
1 + 1 − 2t
• G (t) = 1 + 12 t ➡ quadratic inverse interpolation

k e(k) : = x (k) − x ∗
Halley Euler Quad. Inv.
1 2.81548211105635 3.57571385244736 2.03843730027891
Numerical experiment: 2 1.37597082614957 2.76924150041340 1.02137913293045
3 0.34002908011728 1.95675490333756 0.28835890388161
F ( x ) = xe x − 1 ,
4 0.00951600547085 1.25252187565405 0.01497518178983
x (0) = 5 5 0.00000024995484 0.51609312477451 0.00000315361454
6 0.14709716035310
7 0.00109463314926
8 0.00000000107549
y
Review question(s) 8.4.2.26 (Special 1-point iterative methods for root finding)
(Q8.4.2.26.A) The generic iteration for a modified Newton method for solving the scalar zero-finding prob-
lem F ( x ) = 0 is
!
F ( x (k) ) F ( x (k) ) F ′′ ( x (k) )
x ( k +1) = x (k) − ′ (k) G , (8.4.2.23)
F (x ) ( F ′ ( x (k) ))2

with a C2 -function G : U ⊂ R → R defined in a neighborhood of 0.


Give the explicit formula for the iteration (8.4.2.23) for F ( x ) = xe x − 1 and G (t) = (1 − 12 t)−1 (Halley’s
iteration).

8.4.2.3 Multi-Point Methods

Video tutorial for Section 8.4.2.3 "Multi-Point Methods": (12 minutes) Download link,
tablet notes

Supplementary literature. The secant method is presented in [Han02, Sect. 18.2], [DR08,

Sect. 5.5.3], [AG11, Sect. 3.4].

Construction of multi-point iterations in 1D

Idea: Replace F with interpolating polynomial


producing interpolatory model function methods

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 628
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§8.4.2.28 (The secant method)


F(x)
The secant method is the simplest representative of
model function multi-point methods:

The figure illustrates the geometric idea underlying x ( k −1) x ( k +1) x (k)
this 2-point method for zero finding:

x (k+1) is the zero of the secant (red line ✄) connect-


ing the points ( x (k−1) , F ( x (k−1) )) and ( x k , F ( x k )) on
the graph of F.
Fig. 297

The secant line is the graph of the function

F ( x ( k ) ) − F ( x ( k −1) )
s ( x ) = F ( x (k) ) + ( x − x (k) ) , (8.4.2.29)
x ( k ) − x ( k −1)
F ( x (k) )( x (k) − x (k−1) )
x ( k +1) = x (k) − . (8.4.2.30)
F ( x ( k ) ) − F ( x ( k −1) )

The following C++ code snippet demonstrates the implementation of the abstract secant method for finding
the zeros of a function passed through the functor F.

C++ code 8.4.2.31: Secant method for 1D non-linear equation ➺ GITLAB


2 // Secand method for solving F ( x ) = 0 for F : D ⊂ R → R,
3 // initial guesses x0 , x1 ,
4 // tolerances atol (absolute), rtol (relative)
5 template <typename Func>
6 double secant ( double x0 , double x1 , Func &&F , double r t o l , double a t o l ,
7 unsigned i n t maxIt ) {
8 double f o = F ( x0 ) ;
9 f o r ( unsigned i n t i = 0 ; i < maxIt ; ++ i ) {
10 const double f n = F ( x1 ) ;
11 const double s = f n * ( x1 − x0 ) / ( f n − f o ) ; // secant correction
12 x0 = x1 ;
13 x1 = x1 − s ;
14 // correction based termination (relative and absolute)
15 i f ( abs ( s ) < max ( a t o l , r t o l * min ( abs ( x0 ) , abs ( x1 ) ) ) ) {
16 r e t u r n x1 ;
17 }
18 fo = fn ;
19 }
20 r e t u r n x1 ;
21 }

This code demonstrates several important features of the secant method:


• Only one function evaluation per step
• no derivatives required!
• 2-point method: two initial guesses needed

Remember: F ( x ) may only be available as output of a (complicated) procedure. In this case it is difficult
to find a procedure that evaluates F ′ ( x ). Thus the significance of methods that do not involve evaluations
of derivatives. y

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 629
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXPERIMENT 8.4.2.32 (Convergence of secant method) We empirically examine the convergence of


the secant method from Code 8.4.2.31 for a model problem.
F ( x ) = xe x − 1, using secant method with initial guesses x (0) = 0, x (1) = 5.
log |e(k+1) |−log |e(k) |
k x (k) F ( x (k) ) e(k) : = x (k) − x ∗ log |e(k) |−log |e(k−1) |
2 0.00673794699909 -0.99321649977589 -0.56040534341070
3 0.01342122983571 -0.98639742654892 -0.55372206057408 24.43308649757745
4 0.98017620833821 1.61209684919288 0.41303291792843 2.70802321457994
5 0.38040476787948 -0.44351476841567 -0.18673852253030 1.48753625853887
6 0.50981028847430 -0.15117846201565 -0.05733300193548 1.51452723840131
7 0.57673091089295 0.02670169957932 0.00958762048317 1.70075240166256
8 0.56668541543431 -0.00126473620459 -0.00045787497547 1.59458505614449
9 0.56713970649585 -0.00000990312376 -0.00000358391394 1.62641838319117
10 0.56714329175406 0.00000000371452 0.00000000134427
11 0.56714329040978 -0.00000000000001 -0.00000000000000
Rem. 8.2.2.12: the rightmost column of the table provides an estimate for the order of convergence →
Def. 8.2.2.10. For further explanations see Rem. 8.2.2.12.

A startling observation: the method seems to have a fractional (!) order of convergence, see Def. 8.2.2.10.
y

Remark 8.4.2.33 (Fractional order of convergence of secant method) Indeed, a fractional order of
convergence can be proved for the secant method, see [Han02, Sect. 18.2]. Here we give an asymptotic
argument that holds, if the iterates are already very close to the zero x ∗ of F.

We can write the secant method in the form (8.2.1.5)

F ( x )( x − y)
x (k+1) = Φ( x (k) , x (k−1) ) with Φ( x, y) = Φ( x, y) := x − . (8.4.2.34)
F ( x ) − F (y)

Using Φ we find a recursion for the iteration error e(k) : = x (k) − x ∗ :

e ( k +1) = Φ ( x ∗ + e ( k ) , x ∗ + e ( k −1) ) − x ∗ . (8.4.2.35)

Thanks to the asymptotic perspective we may assume that |e(k) |, |e(k−1) | ≪ 1 so that we can rely on
two-dimensional Taylor expansion around ( x ∗ , x (∗) ), cf. [Str09, Satz 7.5.2]:

∂Φ ∗ ∗ ∂Φ ∗ ∗
Φ( x ∗ + h, x ∗ + k ) = Φ( x ∗ , x ∗ ) + ( x , x )h + ( x , x )k+
∂x ∂y
2 (8.4.2.36)
1∂ Φ ∗ ∗ 2 ∂2 Φ ∗ ∗ 2
1∂ Φ ∗ ∗ 2 ∗
2 ∂2 x ( x , x ) h + ∂x∂y ( x , x ) hk + 2 ∂2 y ( x , x ) k + R ( x , h, k ) ,

with | R| ≤ C (h3 + h2 k + hk2 + k3 ) .


Computations invoking the quotient rule and product rule and using F ( x ∗ ) = 0 show
2 2
∗ ∗ ∗ ∂Φ ∗ ∗ ∂Φ ∗ ∗ 1∂ Φ ∗ ∗ 1∂ Φ ∗ ∗
Φ( x , x ) = x , (x , x ) = (x , x ) = 2 2 (x , x ) = 2 2 (x , x ) = 0 .
∂x ∂y ∂ x ∂ y
We may also use MAPLE to find the Taylor expansion (assuming F sufficiently smooth):
> Phi := (x,y) -> x-F(x)*(x-y)/(F(x)-F(y));
> F(s) := 0;

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 630
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

> e2 = normal(mtaylor(Phi(s+e1,s+e0)-s,[e0,e1],4));

➣ truncated error propagation formula (products of three or more error terms ignored)

. 1 F ′′ ( x ∗ ) (k) (k−1)
e ( k +1) = 2 F′ ( x∗ ) e e = Ce(k) e(k−1) . (8.4.2.37)

How can we deduce the order of converge from this recursion formula? We try e ( k ) = K ( e ( k −1) ) p
inspired by the estimate in Def. 8.2.2.10:
2
⇒ e ( k +1) = K p +1 ( e ( k −1) ) p
2 − p −1 √
⇒ ( e ( k −1) ) p = K − p C ⇒ p2 − p − 1 = 0 ⇒ p = 21 (1 ± 5) .

The second implication is clear after realizing that that the equation has to be satisfied for all k and that
the right-hand (k)
1
√ side does not depend on k. As e → 0 for k → ∞ we get the order of convergence
p = 2 (1 + 5) ≈ 1.62 (see Exp. 8.4.2.32 !) y

EXAMPLE 8.4.2.38 (local convergence of the secant method)


10

Model problem: find zero of 8

F ( x ) = arctan( x )
6

· =ˆ secant method converges for a pair


x(1)


( x (0) , x (1) ) ∈ R2+ of initial guesses. 4

We observe that the secant method will converge 3

only for initial guesses sufficiently close to 0 = 2

local convergence → Def. 8.2.1.10 1

0
0 1 2 3 4 5 6 7 8 9 10
(0)
Fig. 298 x
y

§8.4.2.39 (Inverse interpolation) Another class of multi-point methods: inverse interpolation

Assume: F : I ⊂ R 7→ R one-to-one (monotone)

F ( x ∗ ) = 0 ⇒ F −1 (0 ) = x ∗ .

Interpolate F −1 by polynomial p of degree m − 1 determined by

p( F ( x (k− j) )) = x (k− j) , j = 0, . . . , m − 1 .
New approximate zero x (k+1) := p(0)

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 631
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

F −1

F
The graph of F −1 can be obtained by reflecting the
graph of F at the angular bisector. ✄

F ( x ∗ ) = 0 ⇔ F −1 (0 ) = x ∗

Fig. 299

F −1

x∗
Case m = 2 (2-point method) F

➢ secant method
x∗
The interpolation polynomial is a line. In this case
we do not get a new method, because the inverse
function of a linear function (polynomial of degree 1)
is again a polynomial of degree 1.

Fig. 300

Case m = 3: quadratic inverse interpolation, a 3-point method, see [Mol04, Sect. 4.5]

We interpolate the points ( F ( x (k) ), x (k) ), ( F ( x (k−1) ), x (k−1) ), ( F ( x (k−2) ), x (k−2) ) with a parabola
(polynomial of degree 2). Note the importance of monotonicity of F, which ensures that
( k )
F ( x ), F ( x ( k − 1 ) ), F ( x ( k − 2 ) ) are mutually different.

MAPLE code: p := x-> a*x^2+b*x+c;


solve({p(f0)=x0,p(f1)=x1,p(f2)=x2},{a,b,c});
assign(%); p(0);

F02 ( F1 x2 − F2 x1 ) + F12 ( F2 x0 − F0 x2 ) + F22 ( F0 x1 − F1 x0 )


x ( k +1) = .
F02 ( F1 − F2 ) + F12 ( F2 − F0 ) + F22 ( F0 − F1 )
( F0 := F ( x (k−2) ), F1 := F ( x (k−1) ), F2 := F ( x (k) ), x0 := x (k−2) , x1 := x (k−1) , x2 := x (k) )

EXPERIMENT 8.4.2.40 (Convergence of quadratic inverse interpolation) We test the method for the
model problem/initial guesses F ( x ) := xe x − 1 = 0 , x (0) = 0 , x (1) = 2.5 , x (2) = 5 .

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 632
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

log |e(k+1) |−log |e(k) |


k x (k) F ( x (k) ) e(k) : = x (k) − x ∗ log |e(k) |−log |e(k−1) |
3 0.08520390058175 -0.90721814294134 -0.48193938982803
4 0.16009252622586 -0.81211229637354 -0.40705076418392 3.33791154378839
5 0.79879381816390 0.77560534067946 0.23165052775411 2.28740488912208
6 0.63094636752843 0.18579323999999 0.06380307711864 1.82494667289715
7 0.56107750991028 -0.01667806436181 -0.00606578049951 1.87323264214217
8 0.56706941033107 -0.00020413476766 -0.00007388007872 1.79832936980454
9 0.56714331707092 0.00000007367067 0.00000002666114 1.84841261527097
10 0.56714329040980 0.00000000000003 0.00000000000001
Also in this case the numerical experiment hints at a fractional rate of convergence p ≈ 1.8, as in the case
of the secant method, see Rem. 8.4.2.33. y

Review question(s) 8.4.2.41 (Multi-point methods)


(Q8.4.2.41.A) The inverse quadratic interpolation method for solving F ( x ) = 0 reads

( k +1) F02 ( F1 x2 − F2 x1 ) + F12 ( F2 x0 − F0 x2 ) + F22 ( F0 x1 − F1 x0 )


x = .
F02 ( F1 − F2 ) + F12 ( F2 − F0 ) + F22 ( F0 − F1 )
( F0 := F ( x (k−2) ), F1 := F ( x (k−1) ), F2 := F ( x (k) ), x0 := x (k−2) , x1 := x (k−1) , x2 := x (k) )
Which evaluations and computations have to be carried out in a single step?
(Q8.4.2.41.B) We apply the secant method with initial guesses x (0) , x (1) ∈ [ a, b] to find the zero of a
function f : [ a, b] → R with the following properties:
• f is increasing,
• f is convex,
• f ( a) < 0 and f (b) > 0.
Give a “geometric proof” that the iteration will converge.

8.4.3 Asymptotic Efficiency of Iterative Methods for Zero Finding

Video tutorial for Section 8.4.3 "Asymptotic Efficiency of Iterative Methods for Zero Finding":
(10 minutes) Download link, tablet notes

§8.4.3.1 (Efficiency) Efficiency is measured by forming the ratio of gain and the effort required to achieve
it:
gain
Efficiency = .
effort
For iterative methods for solving F (x) = 0, F : D ⊂ R n → R n , this means the following:

Efficiency of an iterative method computational effort to reach a prescribed



(for solving F (x) = 0) number of correct digits in result.

Ingredient ➊: Computational effort W per step

#{evaluations of D F } #{evaluations of F ′ }
e.g, W≈ +n· +··· .
step step

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 633
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Ingredient ➋: Number of steps k = k (ρ) to achieve a relative reduction of the error by a factor of ρ (=
gain),

e ( k ) ≤ ρ e (0) , 0 < ρ < 1 prescribed. (8.4.3.2)

Here, e(k) stands for the iteration error in the k-th step.
Notice: | log ρ| ↔ Gain in no. of significant digits of x (k) [ log = log10 ]

no. of digits gained | log ρ|


Measure for efficiency: Efficiency := =
total work required k(ρ) · W

(8.4.3.3)
y
§8.4.3.4 (Minimal number of iteration steps) Let us consider an iterative method generating a sequence
of approximate solutions the converges with order p ≥ 1 (→ Def. 8.2.2.10). From its error recursion
we want to derive an estimate for the minimal number k (ρ) ∈ N of iteration steps required to achieve
(8.4.3.2).
➊ Case p = 1, linearly convergent iteration:
Definition 8.2.2.1. Linear convergence

A sequence x(k) , k = 0, 1, 2, . . ., in R n converges linearly to x∗ ∈ R n ,

∃0 < L < 1: x ( k +1) − x ∗ ≤ L x ( k ) − x ∗ ∀k ∈ N0 .

This implies the recursion and estimate for the error norms:

e ( k +1) ≤ C e ( k ) with a constant 0≤C<1, (8.4.3.5)

e ( k ) ≤ C k e (0) ∀k ∈ N0 . (8.4.3.6)

! log ρ
e ( k ) ≤ ρ e (0) takes k(ρ) ≥ steps . (8.4.3.7)
log C

➋ Case p > 1, order- p convergence:


Definition 8.2.2.10. order of convergence

A convergent sequence x(k) , k = 0, 1, 2, . . ., in R n with limit x∗ ∈ R n converges with order


p, p ≥ 1, if
p
∃C > 0: x ( k +1) − x ∗ ≤ C x ( k ) − x ∗ ∀k ∈ N0 , (8.4.3.8)

and, in addition, C < 1 in the case p = 1 (linear convergence → Def. 8.2.2.1).

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 634
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For p > 1 this results in the error estimate

p p2 2 p3
e ( k ) ≤ C e ( k −1) ≤ C 1+ p e ( k −2) ≤ C 1+ p + p e ( k −3) ≤ ...
2 + p3 +···+ pk −1 pk p k −1 p k −1
≤ C 1+ p + p e (0) =C p −1 e (0) e (0) , k∈N,

for some constant C > 0. Here, we use the geometric sum formula

pk − 1
1 + p 2 + p 3 + · · · + p k −1 = , k∈N.
p−1

We make the assumption that this estimate predicts convergence


1
e (0) so small that C p −1 e (0) < 1 . (8.4.3.9)
! log ρ
e ( k ) ≤ ρ e (0) requires pk ≥ 1 + .
log C/p−1 + log( e(0) )

This permits us to estimate the minimal number of steps we have to execute to guarantee a reduction
of the error by a factor of ρ

log ρ
log(1 + log L0 )
, L0 := C /p−1 e(0) < 1 by (8.4.3.9) .
1
k(ρ) ≥ (8.4.3.10)
log p

Summing up, (8.4.3.7) and (8.4.3.10) give us explicit formulas for k (ρ) as a function of ρ. y

§8.4.3.11 (Asymptotic efficiency) Now we adopt an asymptotic perspective and ask for a large reduction
of the error, that is ρ ≪ 1.
If ρ ≪ 1, then (log ρ, log L0 < 0 !)

log ρ
log(1 + ) ≈ log | log ρ| − log | log L0 | ≈ log | log ρ| .
log L0

This simplification will be made in the context of asymptotic considerations ρ → 0 below.

asymptotic efficiency for ρ ≪ 1 (➜ | log ρ| → ∞):


 log C
− , if p=1,
Efficiency|ρ≪1 = log W
p | log ρ| (8.4.3.12)

 · , if p>1.
W log(| log ρ|)

We conclude that
• when requiring high accuracy, linearly convergent iterations should not be used, because their effi-
ciency does not increase for ρ → 0,
log p
• for method of order p > 1, the factor W offers a gauge for efficiency.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 635
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

y
EXAMPLE 8.4.3.13 (Efficiency of iterative methods) We “simulate” iterations to explore the quantitative
dependence of efficiency on the order of the methods and the target accuracy.
10
C = 0.5
9 C = 1.0
C = 1.5

max(no. of iterations), ρ = 1.000000e−08


8 C = 2.0

We choose e(0) = 0.1, ρ = 10−8 . 7

6
The plot displays the number of iteration steps ac- 5
cording to (8.4.3.10).
4

Higher-order method require substantially fewer 3

steps compared to low-order methods. 2

0
1 1.5 2 2.5
Fig. 301 p

7
Newton method
secant method
6
We compare
• Newton’s method from Section 8.4.2.1 and the 5

• secant method, see § 8.4.2.28,


no. of iterations

4
in terms of number of steps required for a prescribed
guaranteed error reduction, assuming C = 1 in both 3
cases and for e(0) = 0.1.
2
We observe that Newton’s method requires only
marginally fewer steps than the secant method. 1

0
0 2 4 6 8 10
Fig. 302 −log (ρ)
10
y

§8.4.3.14 (Comparison of Newton’s method and of the secant method) We draw conclusions from the
discussion above and (8.4.3.12):

WNewton = 2Wsecant , log pNewton log psecant


➣ : = 0.71 .
pNewton = 2, psecant = 1.62 WNewton Wsecant

We set the effort for a step of Newton’s method to twice that for a step of the secant method from
Code 8.4.2.31, because we need an additional evaluation of the derivative F ′ in Newton’s method.

➣ The secant method is more efficient than Newton’s method! y

Review question(s) 8.4.3.15 (Asymptotic efficiency of iterative methods)


(Q8.4.3.15.A) How many step of an iteration that converges linearly with rate L ∈ [0, 1[ are required to
gain 6 digits of accuracy compared to the initial guess.

8. Iterative Methods for Non-Linear Systems of Equations, 8.4. Finding Zeros of Scalar Functions 636
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

8.5 Newton’s Method in R n


We return to non-linear systems of n equations with n unknowns, n ∈ N:
for F : D ⊂ R n 7→ R n find x∗ ∈ D: F (x∗ ) = 0.
Throughout this section we assume that F : D ⊂ R n 7→ R n is continuously differentiable. Next we
generalized the one-dimensional Newton method introduced in Section 8.4.2.1.

8.5.1 The Newton Iteration


Video tutorial for Section 8.5.1 "The Newton Iteration in R n (I)": (10 minutes) Download link,
tablet notes

Video tutorial for Section 8.5.1 "The Newton Iteration in R n (II)": (15 minutes) Download link,
tablet notes

Recall that the idea underlying the 1D Newton


method for generating a sequence x (k) of ap-
k ∈N 0
proximate solutions of F ( x ) = 0 is
1. to replace F by its tangent in ( x (k) , F ( x (k) )),
2. to compute the next iterate x (k+1) as the zero
tangent of that tangent.
This led to the 1D Newton iteration

F ( x (k) )
x ( k +1) : = x ( k ) − , (8.4.2.1)
F x ( k +1) x ( k ) F ′ ( x (k) )

well defined provided that F ′ ( x (k) ) 6= 0 for all k.


Remark 8.5.1.1 (Derivative-based local linear approximation of functions) The approximating tangent
of the 1D Newton method for computing a zero of F : I ⊂ R → R is the graph of the function

Fe( x ) := F ( x (k) ) + F ′ ( x (k) )( x − x (k) ) , x∈R. (8.5.1.2)

From another perspective, F e is just the Taylor expansion of F around x (k) truncated after the linear term.
Thus, if F is twice continuous differentiable, F ∈ C2 ( I ), by Taylor’s formula [Str09, Satz 5.5.1] the tangent
satisfies

| F ( x ) − Fe( x )| = | F ( x ) − F ( x (k) ) − F ′ ( x (k) )( x − x (k) )| = O(| x − x (k) |2 ) for x → x (k) , (8.5.1.3)

that is the tangent provides a local quadratic approximation of F around x (k) . This directly gives us the
local quadratic convergence of the 1D Newton method, recall § 8.4.2.5.
We know an analogous construction for general twice continuously differentiable F : D ⊂ R n 7→ R n . We
define the affine linear function

Fe(x) := F (x(k) ) + D F (x(k) )(x − x(k) ) , x ∈ Rn , (8.5.1.4)

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 637
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024


where D F (z) ∈ R n,n is the Jacobian of F = [ F1 , . . . , Fn ] in z ∈ D:
 ∂F ∂F1 ∂F1

1
∂x1 (z) ∂x2 ( z ) ··· ··· ∂xn ( z )
" #n  ∂F2 ∂F2 
∂Fi  (z) ( z ) 
D F (z) = (z) =

∂x1
..
∂xn
..

. (8.3.2.8)
∂x j  . . 
i,j=1
∂Fn ∂Fn ∂Fn
∂x1 ( z ) ∂x2 ( z ) ··· ··· ∂xn ( z )

This is the multi-dimensional generalization of a truncated Taylor expansion. From analysis we know that
e with an error that quadratically depends
also in this case we have a local affine linear approximation by F
(
on the distance to x : k )

2
F (x) − Fe(x) = F (x) − F (x(k) ) − D F (x(k) )(x − x(k) ) = O( x − x(k) ) for x → x(k) .
(8.5.1.5)
y
Idea (→ Section 8.4.2.1): local linearization:
Given x(k) ∈ D ➣ x(k+1) as zero of affine linear model function

F (x) ≈ Fek (x) := F (x(k) ) + D F (x(k) )(x − x(k) ) ,


 n
∂Fj
D F (x) ∈ R n,n = Jacobian, D F (x) := (x) .
∂xk j,k=1

Newton iteration: (generalizes (8.4.2.1) to n > 1)

x ( k +1) : = x ( k ) − D F ( x ( k ) ) −1 F ( x ( k ) ) , [ if D F (x(k) ) regular ] (8.5.1.6)

Terminology: − D F (x(k) )−1 F (x(k) ) is called the Newton correction


EXAMPLE 8.5.1.7 (Visualization: local affine linear approximation for n = 2) We give a graph-
ical illustration of the idea of Newton’s method for n = 2, where we seek x∗ ∈ R2 with F (x∗ ) = 0,
F (x) = [ F1 (x), F2 (x)]⊤ .
F2 (x) = 0
x2
Fe1 (x) = 0
Sought: intersection point x∗ of the curves F1 (x) =
x∗
0 (solid red) and F2 (x) = 0 (solid green). F1 (x) = 0

Idea: Choose x(k+1) as he intersection of two x ( k +1)


straight lines (dashed in Fig. 303) x(k)

L1 := {x ∈ R2 : Fek,1 (x) = 0} ,
L2 := {x ∈ R2 : Fek,2 (x) = 0} , Fe2 (x) = 0

which are the zero sets of the components of


ek . They are
the affine linear model function F
approximations of the zero curves of the com- x1
ponents of F.
Fig. 303
y

§8.5.1.8 (Generic Newton method in C++) The following C++ implementation of the skeleton for New-
ton’s method uses a correction based a posteriori termination criterion for the Newton iteration. It stops the

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 638
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

iteration if the relative size of the Newton correction drops below the prescribed relative tolerance rtol.
If x∗ ≈ 0 also the absolute size of the Newton correction has to be tested against an absolute tolerance
atol in order to avoid non-termination despite convergence of the iteration,

C++11-code 8.5.1.9: Newton’s method in C++ ➺ GITLAB


2 template <typename FuncType , typename JacType , typename VecType>
3 VecType newton ( FuncType &&F , JacType &&DFinv , VecType x , const double r t o l ,
4 const double a t o l ) {
5 // Note that the vector x passes both the initial guess and also
6 // contains the iterates
7 VecType s ( x . s i z e ( ) ) ; // Vector for Newton corrections
8 // Main loop
9 do {
10 s = DFinv ( x , F ( x ) ) ; // compute Newton correction
11 x −= s ; // compute next iterate
12 }
13 // correction based termination (relative and absolute)
14 while ( ( s . norm ( ) > r t o l * x . norm ( ) ) && ( s . norm ( ) > a t o l ) ) ;
15 return x ;
16 }

☞ Objects of type FuncType must feature


Eigen::VectorXd o p e r a t o r ( const Eigen::VectorXd &x);

that evaluates F (x) (x ↔ x).


☞ Objects of type JacType must provide a method
Eigen::VectorXd o p e r a t o r ( const Eigen::VectorXd &x, const
Eigen::VectorXd &f);

that computes the Newton correction, that is it returns the solution of a linear system with system
matrix D F (x) (x ↔ x) and right hand side f ↔ f.
☞ The function returns the computed approximate solution of the non-linear system.

The next code demonstrates the invocation of newton for a 2 × 2 non-linear system from a code relying
on E IGEN. It also demonstrates the use of fixed size eigen matrices and vectors.

C++11-code 8.5.1.10: Calling newton with E IGEN data types ➺ GITLAB


2 void newton2Ddriver ( ) {
3 // Function F defined through lambda function
4 auto F = [ ] ( const Eigen : : Vector2d &x ) {
5 Eigen : : Vector2d z ;
6 const double x1 = x ( 0 ) ;
7 const double x2 = x ( 1 ) ;
8 z << x1 * x1 − 2 * x1 − x2 + 1 , x1 * x1 + x2 * x2 − 1 ;
9 return ( z ) ;
10 };
11 // Lambda function for the computation of the Newton correction
12 auto DFinv = [ ] ( const Eigen : : Vector2d &x , const Eigen : : Vector2d & f ) {
13 Eigen : : Matrix2d J ;
14 const double x1 = x ( 0 ) ;
15 const double x2 = x ( 1 ) ;
16 // Jacobian of F
17 J << 2 * x1 − 2 , −1 , 2 * x1 , 2 * x2 ;

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 639
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

18 // Solve 2x2 linear system of equations


19 Eigen : : Vector2d s = J . l u ( ) . solve ( f ) ;
20 return ( s ) ;
21 };
22 // initial guess
23 const Eigen : : Vector2d x0 ( 2 . , 3 . ) ;
24

25 // Invoke Newton’s method


26 const Eigen : : Vector2d x = newton ( F , DFinv , x0 , 1E−6 , 1E−8) ;
27 std : : cout << " | | F( x ) | | = " << F ( x ) . norm ( ) << std : : endl ;
28 }

New aspect for n ≫ 1 (compared to n = 1-dimensional case, Section 8.4.2.1):


Computation of the Newton correction may be expensive!
! (because it involves the solution of a LSE, cf. Thm. 2.5.0.2)

Remark 8.5.1.11 (Affine invariance of Newton method) An important property of the Newton iteration
(8.5.1.6) is its affine invariance → [Deu11, Sect .1.2.2]

Consider GA (x) := AF (x) with regular A ∈ R n,n so that F (x∗ ) = 0 ⇔ GA (x∗ ) = 0 .

☛ ✟
Affine invariance: The Newton iterations for GA (x) = 0 are the same for all regular A !
✡ ✠
This is confirmed by a simple computation:

D GA (x) = A D F (x) ⇒ D GA (x)−1 GA (x) = D F (x)−1 A−1 AF (x) = D F (x)−1 F (x) .

Why is this an interesting property? Affine invariance should be used as a guideline for
• convergence theory for Newton’s method: assumptions and results should be affine invariant, too.
• modifying and extending Newton’s method: resulting schemes should preserve affine invariance.
In particular, termination criteria for Newton’s method should also be affine invariant in the sense that,
when applied for GA they STOP the iteration at exactly the same step for any choice of the regular matrix
A. y

The function F : R n → R n defining the non-linear system of equations may be given in various formats,
as explicit expression or rather implicitly. In most cases, D F has to be computed symbolically in order to
obtain concrete formulas for the Newton iteration. We now learn how these symbolic computations can be
carried out harnessing advanced techniques of multi-variate calculus.

§8.5.1.12 (Derivatives and their role in Newton’s method) The reader will probably agree that the
derivative of a function F : I ⊂ R → R in x ∈ I is a number F ′ ( x ) ∈ R, the derivative of a function
F : D ⊂ R n → R m , in x ∈ D a matrix D F (x) ∈ R m,n . However, the nature of a derivative in a point is

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 640
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

that of a linear mapping that approximates F locally up to second order:

Definition 8.5.1.13. Derivative of functions between vector spaces

Let V , W be finite dimensional vector spaces and F : D ⊂ V 7→ W a sufficiently smooth mapping.


The derivative (differential) D F (x) of F in x ∈ V is the unique linear mapping D F (x) : V 7→ W
such that there is a δ > 0 and a function ǫ : [0, δ] → R + satisfying limξ →0 ǫ(ξ ) = 0 such that

k F (x + h) − F (x) − D F (x)hk = ǫ(khk) ∀h ∈ V , khk < δ . (8.5.1.14)

The vector h in (8.5.1.14) may be called “direction of differentiation”.

☞ Note that D F (x)h ∈ W is the vector returned by the linear mapping D F (x) when applied to h ∈ V .
☞ In Def. 8.5.1.13 k·k can be any norm on V (→ Def. 1.5.5.4).
☞ A common shorthand notation for (8.5.1.14) relies on the “little-o” Landau symbol:

k F (x + h) − F (x) − D F (h)hk = o (h) for h → 0 ,


which designates a remainder term tending to 0 as its arguments tends to 0.
☞ Choosing bases of V and W , D F ( x) can be described by the Jacobian (matrix) (8.3.2.8), because
every linear mapping between finite dimensional vector spaces has a matrix representation after
bases have been fixed. Thus, the derivative is usually written as a matrix-valued function on D.

In the context of the Newton iteration (8.5.1.6) the computation of the Newton correction s in the k + 1-th
step amounts to solving a linear system of equations:

s = − D F ( x ( k ) ) −1 F ( x ( k ) ) ⇔ D F ( x ( k ) ) s = − F ( x ( k ) ) .

Matching this with Def. 8.5.1.13 we see that we need only determine expressions for D F (x(k) )h, h ∈ V ,
in order to state the LSE yielding the Newton correction. This will become important when applying the
“compact” differentiation rules discussed next. y

§8.5.1.15 (“High-level” Differentiation rules → Repetition of basic analysis skills)

Video tutorial for § 8.5.1.15 "Multi-dimensional Differentiation": (20 minutes) Download link,
tablet notes

Stating the Newton iteration (8.5.1.6) for F : R n 7→ R n through an analytic formula entails computing the
Jacobian D F. The safe, but tedious way is to use the definition (8.3.2.8) directly and compute the partial
derivatives.
To avoid cumbersome component-oriented considerations, it is sometimes useful to know the rules of
multidimensional differentiation:

Immediate from Def. 8.5.1.13 are the following differentiation rules (V, W, U, Z are finite-dimensional
normed vector spaces, all functions assumed to be differentiable):
• For F : V 7→ W linear, we have D F (x) = F for all x ∈ V

(For instance, if F : R n → R m , F (x) = Ax, A ∈ R m,n , then D F (x) = A for all x ∈ R n .)


• Chain rule: For F : V 7→ W , G : U 7→ V sufficiently smooth

D( F ◦ G )(x)h = D F ( G (x))(D G (x)h) , h ∈ V , x ∈ D . (8.5.1.16)

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 641
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This can be justified by a formal computation

F ( G (x + h)) = F ( G (x) + D G (x)h + o (khk)) = F ( G (x)) + D F ( G (x)) ◦ D G (x) h + o (khk)) .

• Product rule: F : D ⊂ V 7→ W , G : D ⊂ V 7→ U sufficiently smooth, b : W × U 7→ Z bilinear, ie.,


linear in each argument:

T (x) := b( F (x), G (x)) ⇒ D T (x)h = b(D F (x)h, G (x)) + b( F (x), D G (x)h) , (8.5.1.17)
h ∈ V, x ∈ D .
We see this by formal computations, making heavy use of the bilinearity of b in step (∗):

T (x + h) = b( F (x + h), G (x + h))
= b( F (x) + D F (x)h + o (khk), G (x) + D G (x)h + o (khk))
(∗)
= b( F (x), G (x)) + b(D F (x)h, G (x)) + b( F (x), D G (x)h) +o (khk) ,
| {z }
=D T ( x ) h

2
where the term b(D F (x)h, D G (x)h) is obviously O(khk ) for h → 0 and, therefore, can be
“thrown into the garbage bin” of o (khk).

Advice: If you do not feel comfortable with these


rules of multidimensional differential calculus, please
resort to detailed componentwise/entrywise calcula-
tions according to (8.3.2.8) (“pedestrian differentia-
tion”), though they may be tedious.

The first and second derivatives of real-valued functions occur frequently and have special names, see
[Str09, Def. 7.3.2] and [Str09, Satz 7.5.3].

Definition 8.5.1.18. Gradient and Hessian


For sufficiently smooth F : D ⊂ R n 7→ R the gradient grad F : D 7→ R n , and the Hessian
(matrix) H F (x) : D 7→ R n,n are defined as

grad F (x) T h := D F (x)h , h1T H F (x)h2 := D(D F (x)(h1 ))(h2 ) , ∀h, h1 , h2 ∈ R n .

In other words,
• the gradient grad F (x) is the vector representative of the linear mapping D F (x) : R n → R,
• and the Hessian H F (x) is the matrix representative of the bilinear mapping
D{z 7→ D F (z)}(x) : R n × R n → R.
y

EXAMPLE 8.5.1.19 (Derivative of a bilinear form) We start with a simple example:

Ψ : R n 7→ R , Ψ(x) := x⊤ Ax , A ∈ R n,n .

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 642
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This is the general matrix representation of a bilinear form on R n . We want to compute the gradient of Ψ.
We do this in two ways:
➊ “High level differentiation”: We apply the product rule (8.5.1.17) with D = V = W = U = R n ,
Z = R, F, G = Id, which means D F (x) = D G (x) = I, and the bilinear form b(x, y) :=
x T Ay:

D Ψ(x)h = h⊤ Ax + x⊤ Ah = (Ax)⊤ h = x⊤ Ah = x⊤ A⊤ + x⊤ A h ,
| {z }
=(grad Ψ(x))⊤

Hence, grad Ψ(x) = (A + A T )x according to the definition of a gradient (→ Def. 8.5.1.18).

➋ “Low level/pedestrian differentiation”: Using the rules of matrix×vector multiplication, Ψ can be


written in terms of the vector components xi := (x)i , i = 1, . . . , n: Fixing i ∈ {1, . . . , n} we find

n n n n n n
Ψ(x) = ∑ ∑ (A)k,j xk x j = (A)i,i xi2 + ∑ (A)i,j xi x j + ∑ (A)k,i xk xi + ∑ ∑ (A)k,j xk x j .
k =1 j =1 j =1 k =1 k =1 j =1
j6=i k 6=i k 6=i j6=i

Now computing the partial derivative with respect to xi is easy:


n n n n
∂Ψ(x)
(x) =2(A)i,i xi + ∑ (A)i,j x j + ∑ (A)k,i xk = ∑ (A)i,j x j + ∑ (A)k,i xk
∂xi j =1 k =1 j =1 k =1
j6=i k 6=i


=(Ax + A x)i , i = 1, . . . , n .

This provides the components of the gradient, since i ∈ {1, . . . , n} was arbitrary.
Of course, the results obtained by both methods must agree! y

EXAMPLE 8.5.1.20 (Derivative of Euclidean norm) We seek the derivative of the Euclidean norm, that
is, of the function F (x) := kxk2 , x ∈ R n \ {0} ( F is defined but not differentiable in x = 0, just look at
the case n = 1!)
➊ “High level differentiation”: We can write F as the composition of two functions F = G ◦ H with
p
G : R + → R + , G (ξ ) := ξ ,
H : Rn → R , H (x) := x⊤ x .

Using the rule for the differentiation of bilinear forms from Ex. 8.5.1.19 for the case A = I and basic
calculus, we find

D H (x)h = 2x⊤ h , x, h ∈ R n ,
ζ
D G (ξ )ζ = √ , ξ > 0, ζ ∈ R .
2 ξ
Finally, the chain rule (8.5.1.16) gives

2x⊤ h x⊤
D F (x)h = D G ( H (x))(D H (x)h) = √ = ·h . (8.5.1.21)
2 x⊤ x k x k2
x
Def. 8.5.1.18 ⇒ grad F (x) = .
k x k2

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 643
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➋ “Pedestrian differentiation”: We can explicitly write F (x) as


q
F (x) = x12 + x22 + · · · + xi2 + · · · + xn2 ,
∂( F (x))i 1
= q · 2xi , i = 1, . . . , n [chain rule] .
∂xi 2 2 2 2
2 x1 + x2 + · · · + xi + · · · + x n
1 x
grad F (x) = [ x1 , x2 , . . . , x n ] ⊤ = .
k x k2 k x k2

§8.5.1.22 (Newton iteration via product rule) This paragraph explains the use of the general product
rule (8.5.1.17) to derive the linear system solved by the Newton correction. It implements the insights from
§ 8.5.1.12.
We seek solutions of F (x) = 0 with F (x) := b( G (x), H (x)), where
✦ V, W are some vector spaces (finite- or even infinite-dimensional),
✦ G : D → V , H : D → W , D ⊂ R n , are continuously differentiable in the sense of Def. 8.5.1.13,
✦ b : V × W 7→ R n is bilinear (linear in each argument).
According to the general product rule (8.5.1.17) we have

D F (x)h = b(D G (x)h, H (x)) + b( G (x), D H (x)h) , h ∈ R n . (8.5.1.23)

This already defines the linear system of equations to be solved to compute the Newton correction s

b(D G (x(k) )s, H (x(k) )) + b( G (x(k) ), D H (x(k) )s) = −b( G (x(k) ), H (x(k) )) . (8.5.1.24)

Since the left-hand side is linear in s, this really represents a square linear system of n equations. The
next example will present a concrete case. y

§8.5.1.25 (A “quasi-linear” system of equations) We call a quasilinear system of equations a non-linear


equation of the form

A(x)x = b with b ∈ R n , (8.5.1.26)

where A : D ⊂ R n → R n,n is a matrix-valued function. On other words, a quasi-linear system is a “linear


system of equations with solution-dependent system matrix”. It incarnates a zero-finding problem for a
function F : R n → R n with a bilinear structure as introduced in § 8.5.1.22.

For many quasi-linear systems, for which there exist solutions, the fixed point iteration (→ Section 8.3)

x ( k +1) = A ( x ( k ) ) −1 b ⇔ A ( x ( k ) ) x ( k +1) = b , (8.5.1.27)

provides a convergent iteration, provided that a good initial guess is available.

We can also reformulate

(8.5.1.26) ⇔ F (x) = 0 with F (x) = A(x)x − b .

If x 7→ A(x) is differentiable, the product rule (8.5.1.17) yields

D F ( x ) h = (D A ( x ) h ) x + A ( x ) h , h ∈ R n . (8.5.1.28)

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 644
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note that D A(x(k) ) is a mapping from R n into R n,n , which gets h as an argument. Then the Newton
iteration reads

x ( k +1) = x ( k ) − s , D F ( x ( k ) ) s = (D A ( x ( k ) ) s ) x ( k ) + A ( x ( k ) ) s = A ( x ( k ) ) x ( k ) − b . (8.5.1.29)

The next example will examine a concrete quasi-linear system of equations. y

EXAMPLE 8.5.1.30 (A special quasi-linear system of equations) We consider the quasi-linear system
of equations
 
γ(x) 1
 1 γ(x) 1 
 
 .. .. .. 
 . . . 
A(x)x = b , A(x) :=  . . .  ∈ R n×n , (8.5.1.31)
 .. .. .. 
 
 1 γ(x) 1 
1 γ(x)

where γ(x) := 3 + kxk2 (Euclidean vector norm), the right hand side vector b ∈ R n is given and x ∈ R n
is unknown.

In order to compute the derivative of F (x) := A(x)x − b it is advisable to rewrite


 
3 1
1 3 1 
 
 ... ..
. 
 3 
A(x)x = Tx + xkxk2 , T :=  .. .. .. .
 . . . 
 
 1 3 1
1 3
The derivative of the first term is straightforward, because it is linear in x, see the discussion following
Def. 8.5.1.13.

The “pedestrian” approach to the second term starts with writing it explicitly in components as
q
(xkxk)i = xi x12 + · · · + xn2 , i = 1, . . . , n .

Then we can compute the Jacobian according to (8.3.2.8) by taking partial derivatives:
q
∂ xi
(xkxk)i = x12 + · · · + xn2 + xi q ,
∂xi x12 +···+ xn2
∂ xj
(xkxk)i = xi q , j 6= i .
∂x j x2 + · · · + x2
1 n

For the “high level” treatment of the second term x 7→ xkxk2 we apply the product rule (8.5.1.17), together
with (8.5.1.21):

x⊤ h xx⊤ 
D F (x)h = Th + kxk2 h + x = A(x) + h.
k x k2 k x k2
Thus, in concrete terms the Newton iteration (8.5.1.29) becomes
 x ( k ) ( x ( k ) ) ⊤  −1
x ( k +1) = x ( k ) − A ( x ( k ) ) + ( A ( x(k) ) x(k) − b ) .
k x k2

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 645
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note that the coefficient matrix of the linear system to be solved in each step is a rank-1-modification
(2.6.0.16) of the symmetric positive definite tridiagonal matrix A(x(k) ), cf. Lemma 2.8.0.12. Thus the
Sherman-Morrison-Woodbury formula from Lemma 2.6.0.21 can be used to solve it efficiently. y

§8.5.1.32 (Implicit differentiation of bilinear expressions) Given are


✦ a finite-dimensional vector space W ,
✦ a continuously differentiable function G : D ⊂ R n → V into some vector space V ,
✦ a bilinear mapping b : V × W → W .
Let F : D → W be implicitly defined by

F (x): b( G (x), F (x)) = b ∈ W . (8.5.1.33)

This relationship will provide a valid definition of F in a neighborhood of x0 ∈ W , if we assume that there
is x0 , z0 ∈ W such that b( G (x0 ), z0 ) = b, and that the linear mapping z 7→ b( G (x0 ), z) is invertible.
Then, for x close to x0 , F (x) can be computed by solving a square linear system of equations in W . In
Ex. 8.4.2.9 we already saw an example of an implicitly defined F for W = R.
We want to solve F (x) = 0 for this implicitly defined F by means of Newton’s method. In order to determine
the derivative of F we resort to implicit differentiation [Str09, Sect. 7.8] of the defining equation (8.5.1.33)
by means of the general product rule (8.5.1.17). We formally differentiate both sides of (8.5.1.33):

b(D G (x)h, F (x)) + b( G (x), D F (x)h) = 0 ∀h ∈ W , (8.5.1.34)

and find that the Newton correction s in the k + 1-th Newton step can be computed as follows:

D F (x(k) )s = − F (x(k) ) ⇒ b( G (x(k) ), D F (x(k) )s) = −b( G (x(k) ), F (x(k) ))


(8.5.1.34)
⇒ b(D G (x(k) )s, F (x(k) )) = b( G (x(k) ), F (x(k) )) ,

which constitutes an dim W × dim W linear system of equations. The next example discusses a concrete
application of implicit differentiation with W = R n,n . y

EXAMPLE 8.5.1.35 (Derivative of matrix inversion) We consider matrix inversion as a mapping and
(formally) compute its derivative, that is, the derivative of function

R ∗n,n → R n,n
inv : ,
X 7 → X −1

where R n,n
∗ denotes the (open) set of invertible n × n-matrices, n ∈ N .

We apply the technique of implicit differentiation from § 8.5.1.32 to the equation

inv(X) · X = I , X ∈ R n,n
∗ . (8.5.1.36)

Differentiation on both sides of (8.5.1.36) by means of the product rule (8.5.1.17) yields

D inv(X)H · X + inv(X) · H = O , H ∈ R n,n ,

D inv(X)H = −X−1 HX−1 , H ∈ R n,n . (8.5.1.37)

For n = 1 we get D inv( x ) h = − xh2 , which recovers the well-known derivative of the function x → x −1 .
y

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 646
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 8.5.1.38 (Matrix inversion by means of Newton’s method → [Ale12; PS91]) Surprisingly,
it is possible to obtain the inverse of a matrix as a the solution of a non-linear system of equations. Thus
it can be computed using Newton’s method.
Given a regular matrix A ∈ R n,n , its inverse can be defined as the unique zero of a function:

−1 R n,n
∗ → R n,n
X=A ⇐⇒ F (X) = O for F : .
X 7 → A − X −1
n,n
Using (8.5.1.37) we find for the derivative of F in X ∈ R ∗

D F (X)H = X−1 HX−1 , H ∈ R n,n . (8.5.1.39)

The abstract Newton iteration (8.5.1.6) for F reads

X ( k +1) = X ( k ) − S , S : = D F ( X ( k ) ) −1 F ( X ( k ) ) . (8.5.1.40)

The Newton correction S in the k-th step solves the linear system of equations
(8.5.1.39)  −1
 −1  −1
D F ( X(k) ) S = X(k)
S X(k) = F ( X(k) ) = A − X(k) .
 −1
S = X(k) ( A − X(k) )X(k) = X(k) AX(k) − X(k) . (8.5.1.41)
in (8.5.1.40)
 
X(k+1) = X(k) − X(k) AX(k) − X(k) = X(k) 2I − AX(k) . (8.5.1.42)

This is the Newton iteration (8.5.1.6) for F (X) = O that we expect to converge locally to X∗ := A−1 . y

Remark 8.5.1.43 (Simplified Newton method [DR08, Sect. 5.6.2]) Computing the Newton correction
can be expensive owing to the O(n3 ) asymptotic cost (→ § 2.5.0.4) of solving a different large n × n
linear system of equations in every step of the Newton iteration (8.5.1.6).
We know that the cost of a linear solve can be reduced to O(n2 ) if the coefficient matrix is available in LU-
or QR factorized form, see, e.g., § 2.3.2.15. This motivates the attempt to “samemp*freeze” the Jacobian
in the Newton iteration and use D F ( x (0) ) throughout, which leads to the simplified Newton iteration:

x ( k +1) = x ( k ) − D F ( x (0) ) −1 F ( x ( k ) ) , k = 0, 1, . . . .

The following C++ function implements a template for this simplified Newton Method and uses the same
Jacobian D F (x(0) ) for all steps, which makes it possible to reuse an LU-decomposition, cf. Rem. 2.5.0.10.

C++ code 8.5.1.44: Efficient implementation of simplified Newton method ➺ GITLAB


2 // C++ template for simplified Newton method
3 template <typename Func , typename Jac , typename Vec>
4 void simpnewton ( Vec& x , Func F , Jac DF, double r t o l , double a t o l )
5 {
6 auto l u = DF( x ) . l u ( ) ; // do LU decomposition once!
7 Vec s ; // Newton correction
8 double ns = NAN; // auxiliary variables for termination control
9 double nx = NAN;
10 do {
11 s = l u . solve ( F ( x ) ) ;
12 x = x−s ; // new iterate
13 ns = s . norm ( ) ; nx = x . norm ( ) ;
14 }
15 // termination based on relative and absolute tolerance
16 while ( ( ns > r t o l * nx ) && ( ns > a t o l ) ) ;

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 647
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

17 }

Drawback: Switching to the simplified Newton method usually sacrifices the asymptotic
quadratic convergence of the Newton method: merely linear convergence can be expected.
y

Remark 8.5.1.45 (Numerical Differentiation for computation of Jacobian) If D F (x) is not available
(e.g. when F (x) is given only as a procedure) we may resort to approximation by difference quotients:

∂Fi Fi (x + h~e j ) − Fi (x)


Numerical Differentiation: (x) ≈ .
∂x j h

Caution: Roundoff errors wreak havoc for small h → Ex. 1.5.4.7 ! Therefore use h ≈ EPS. y

Review question(s) 8.5.1.46 (Newton’s iteration in R n )


(Q8.5.1.46.A) In Ex. 7.4.2.2 we found the following non-linear system of equations for the weights w1 , w2
and nodes c1 , c2 of a quadrature rule of order 4 on [−1, 1]:

w1 + w2 = 2 ,
c 1 w1 + c 2 w2 = 0 ,
c21 w1 + c22 w2 = 2
3 ,
c31 w1 + c32 w2 =0.

Write down a function F : R4 → R4 such that this non-linear system corresponds to F (x) = 0,
x = [w1 , w2 , c1 , c2 ]⊤ and then derive the corresponding Newton iteration.
(Q8.5.1.46.B) Consider the following non-linear interpolation problem for a twice continuously differen-
tiable function f : [−1, 1] → R. Seek a node set {t0 , . . . , tn } ⊂ [−1, 1] and a polynomial p ∈ Pn such
that

p(tk ) = f (tk ) , p′ (tk ) = f ′ (tk ) ∀k ∈ {0, . . . , n} .

Recast this problem as a non-linear system of equations for the nodes tk , k = 0, . . . , n, and the n + 1
monomial coefficients of p and then state the corresponding Newton iteration.
(Q8.5.1.46.C) For a symmetric positive definite matrix A = A⊤ ∈ R n,n derive the Newton iteration for
solving F (X) = O, where

F : Sym(n) → Sym(n) , F (X) := X2 − A ,

and Sym(n) stands for the vector space of symmetric n × n-matrices.


(Q8.5.1.46.D) In order to find a global minimum of a twice differentiable function f : R n → R, we could
first try determine vectors x∗ ∈ R n such that grad f (x∗ ) = 0. What is the Newton iteration for this
n × n non-linear system of equations?
(Q8.5.1.46.E) Is the stopping rule implemented in the following code affine invariant?

C++11-code 8.5.1.9: Newton’s method in C++ ➺ GITLAB


2 template <typename FuncType , typename JacType , typename VecType>

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 648
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3 VecType newton ( FuncType &&F , JacType &&DFinv , VecType x , const double r t o l ,


4 const double a t o l ) {
5 // Note that the vector x passes both the initial guess and also
6 // contains the iterates
7 VecType s ( x . s i z e ( ) ) ; // Vector for Newton corrections
8 // Main loop
9 do {
10 s = DFinv ( x , F ( x ) ) ; // compute Newton correction
11 x −= s ; // compute next iterate
12 }
13 // correction based termination (relative and absolute)
14 while ( ( s . norm ( ) > r t o l * x . norm ( ) ) && ( s . norm ( ) > a t o l ) ) ;
15 return x ;
16 }

8.5.2 Convergence of Newton’s Method

Video tutorial for Section 8.5.2 "Convergence of Newton’s Method": (9 minutes)


Download link, tablet notes

Notice that the Newton iteration (8.5.1.6) is a fixed point iteration (→ Section 8.3) with iteration function

Φ ( x ) = x − D F ( x ) −1 F ( x ) .

[“product rule” (8.5.1.17) : D Φ(x) = I − D({x 7→ D F (x)−1 }) F (x) − D F (x)−1 D F (x) ]

F (x∗ ) = 0 ⇒ D Φ(x∗ ) = O ,

that is, the derivative (Jacobian) of the iteration function of the Newton fixed point iteration vanishes in the
limit point. Thus from Lemma 8.3.2.15 we draw the same conclusion as in the scalar case n = 1, cf.
Section 8.4.2.1.

Local quadratic convergence of Newton’s method, if D F (x∗ ) regular

This can easily be seen by the following formal argument valid for F ∈ C2 ( D ) with F (x∗ ) 6= 0.

x ( k +1) − x ∗ = Φ ( x ( k ) ) − Φ ( x ∗ )
2 2
= D Φ(x∗ )(x(k) − x∗ ) + O( x(k) − x∗ ) = O ( x(k) − x∗ ) for x(k) ≈ x∗ .

EXPERIMENT 8.5.2.1 (Convergence of Newton’s method in 2D) We study the convergence of Newton’s
method empirically for n = 2 for
    
x12 − x24 x1 2 1
F (x) = , x= ∈R with solution F( )=0. (8.5.2.2)
x1 − x23 x2 1

   
∂ x1 F1 ( x ) ∂ x2 F1 ( x ) 2x1 −4x23
Jacobian (analytic computation): D F (x) = =
∂ x1 F2 ( x ) ∂ x2 F2 ( x ) 1 −3x22

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 649
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Realization of Newton iteration (8.5.1.6):


1. Solve LSE
   
2x1 −4x23 (k) x12 − x24
∆x =− ,
1 −3x22 x1 − x23

where x(k) = [ x1 , x2 ] T .
2. Set x(k+1) = x(k) + ∆x(k) .
Monitoring the iteration we obtain the following iterates/error norms:
log ǫk+1 − log ǫk
k x(k) ǫk : = k x ∗ − x ( k ) k2
log ǫk − log ǫk−1
0 [0.7, 0.7] T 4.24e-01
1 [0.87850000000000, 1.064285714285714] T 1.37e-01 1.69
2 [1.01815943274188, 1.00914882463936] T 2.03e-02 2.23
3 [1.00023355916300, 1.00015913936075] T 2.83e-04 2.15
4 [1.00000000583852, 1.00000002726552] T 2.79e-08 1.77
5 [0.999999999999998, 1.000000000000000] T 2.11e-15
6 [1, 1] T
☞ (Some) evidence of quadratic convergence, see Rem. 8.2.2.12. y

EXAMPLE 8.5.2.3 (Convergence of Newton’s method for matrix inversion → [Ale12; PS91]) in
Ex. 8.5.1.38 we have derived the Newton iteration for

F (X) = O with F : R n,n → R n,n , F ( X ) : = A − X −1 ,

for a given regular matrix A ∈ R n,n :

 
X(k+1) = X(k) − X(k) AX(k) − X(k) = X(k) 2I − AX(k) . (8.5.1.42)

Now we study the local convergence of this iteration by direct estimates. To that end we first derive a
recursion for the iteration errors E(k) := X(k) − A−1 :

E ( k +1) = X ( k +1) − A −1
(8.5.1.42) 
= X(k) 2I − AX(k) − A−1

= (E(k) + A−1 ) 2I − A(E(k) + A−1 ) − A−1
= (E(k) + A−1 )(I − AE(k) ) − A−1 = −E(k) AE(k) .

For the norm of the iteration error (a matrix norm → Def. 1.5.5.10) we conclude from submultiplicativity
(1.5.5.11) a recursive estimate
2
E ( k +1) ≤ E ( k ) kAk . (8.5.2.4)

This holds for any matrix norm according to Def. 1.5.5.10, which is induced by a vector norm. For the
relative iteration error we obtain
!2
E ( k +1) E(k)
≤ k A k A −1 , (8.5.2.5)
kAk kAk | {z }
| {z } | {z }
relative error relative error =cond(A)

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 650
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

where the condition number is defined in Def. 2.2.2.7.

From (8.5.2.4) we conclude that the iteration will converge (limk→∞ E(k) = 0), if

E (0) A = X (0) A − I < 1 , (8.5.2.6)

which gives a condition on the initial guess S(0) . Now let us consider the Euclidean matrix norm k·k2 ,
which can be expressed in terms of eigenvalues, see Cor. 1.5.5.16. Motivated by this relationship, we use
the initial guess X(0) = αA⊤ with a > 0 still to be determined.
! 2
X (0) A − I = αA⊤ A − I = αkAk22 − 1 < 1 ⇔ α < ,
2 2 kAk22

which is a sufficient condition for the initial guess X(0) = αA⊤ , in order to make (8.5.1.42) converge. In
this case we infer quadratic convergence from both (8.5.2.4) and (8.5.2.5). y

There is a sophisticated theory about the convergence of Newton’s method. For example one can find the
following theorem in [DH03, Thm. 4.10], [Deu11, Sect. 2.1]):

Theorem 8.5.2.7. Local quadratic convergence of Newton’s method

If:
(A) D ⊂ R n open and convex,
(B) F : D 7→ R n continuously differentiable,
(C) D F (x) regular ∀x ∈ D,
∀v ∈ R n , v + x ∈ D,
(D) ∃ L ≥ 0: D F (x)−1 (D F (x + v) − D F (x)) ≤ L k v k2 ,
2 ∀x ∈ D
(E) ∃x∗ : F (x∗ ) = 0 (existence of solution in D)
2
(F) initial guess x(0) ∈ D satisfies ρ : = x ∗ − x (0) < ∧ Bρ (x∗ ) ⊂ D .
2 L
then the Newton iteration (8.5.1.6) satisfies:
(i) x(k) ∈ Bρ (x∗ ) := {y ∈ R n , ky − x∗ k < ρ} for all k ∈ N,
(ii) lim x(k) = x∗ ,
k→∞
2
(iii) x ( k +1) − x ∗ ≤ L
2 x(k) − x∗ (local quadratic convergence) .
2 2

✎ notation: ball Bρ (z) := {x ∈ R n : kx − zk2 ≤ ρ}


Terminology: ˆ affine invariant Lipschitz condition
(D) =

Usually, it is hardly possible to verify the assumptions of the theorem for a concrete non-linear
system of equations, because neither L nor x ∗ are known.

In general: a priori estimates as in Thm. 8.5.2.7 are of little practical relevance.


Review question(s) 8.5.2.8 (Convergence of Newton’s method)
(Q8.5.2.8.A) Outline how you would empirically investigate the quadratic convergence of a Newton itera-
tion

x ( k +1) = x ( k ) − D F ( x ( k ) ) −1 F ( x ( k ) ) (8.5.1.6)

for solving F (x) = 0.

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 651
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q8.5.2.8.B) Indentifying C ≃ R2 we can consider the equation z2 = 1 as a 2 × 2 non-linear system of


equations.
Now please take a look at the convergence theorem Thm. 8.5.2.7 for Newton’s method in R n . Verify the
assumptions of this theorem for D := {z ∈ C : |z − 1| < 12 } and the function F : R2 → R2 for which
solving F (x) = 0 in R2 is equivalent to solving z2 = 1 in C. Estimate the Lipschitz constant L.

8.5.3 Termination of Newton Iteration


Video tutorial for Section 8.5.3 "Termination of Newton Iteration": (7 minutes) Download link,
tablet notes

An abstract discussion of ways to stop iterations for solving F (x) = 0 was presented in Section 8.2.3, with
“ideal termination” (→ § 8.2.3.2) as ultimate, but unfeasible, goal.

Yet, in 8.5.2 we saw that Newton’s method enjoys (asymptotic) quadratic convergence, which means rapid
decrease of the relative error of the iterates, once we are close to the solution, which is exactly the point,
when we want to STOP. As a consequence, asymptotically, the Newton correction (difference of two
consecutive iterates) yields rather precise information about the size of the error:

x ( k +1) − x ∗ ≪ x ( k ) − x ∗ ⇒ x ( k ) − x ∗ ≈ x ( k +1) − x ( k ) . (8.5.3.1)

This suggests the following correction based termination criterion:

STOP, as soon as ∆x(k) ≤ τrel x(k) or ∆x(k) ≤ τabs ,


(8.5.3.2)
(k) ( k ) −1 (k)
with Newton correction ∆x := D F (x ) F (x ).

Here, k·k can be any suitable vector norm, τrel =


ˆ relative tolerance, τabs =
ˆ absolute tolerance, see
§ 8.2.3.2.

➣ quit iterating as soon as x ( k +1) − x ( k ) = D F ( x ( k ) ) −1 F ( x ( k ) ) < τ x ( k ) ,


with τ = tolerance

→ uneconomical: one needless update, because x(k) would already be accurate enough.

Remark 8.5.3.3 (Newton’s iteration; computational effort and termination) Some facts about the New-
ton method for solving large (n ≫ 1) non-linear systems of equations:
☛ Solving the linear system to compute the Newton correction may be expensive (asymptotic compu-
tational effort O(n3 ) for direct elimination → § 2.3.1.5) and accounts for the bulk of numerical cost
of a single step of the iteration.
☛ In applications only very few steps of the iteration will be needed to achieve the desired accuracy
due to fast quadratic convergence.
✄ The termination criterion (8.5.3.2) computes the last Newton correction ∆x(k) needlessly, because
x(k) already accurate enough!

Therefore we would like to use an a-posteriori termination criterion that dispenses with computing (and
“inverting”) another Jacobian D F (x(k) ) just to tell us that x(k) is already accurate enough. y

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 652
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§8.5.3.4 (Termination of Newton iteration based on simplified Newton correction) Due to fast asymp-
totic quadratic convergence, we can expect D F (x(k−1) ) ≈ D F (x(k) ) during the final steps of the iteration.

Idea: Replace D F (x(k) ) with D F (x(k−1) ) in any correction based termination


criterion.
Rationale: LU-decomposition of D F (x(k−1) ) is already available
➤ reduced effort.

Terminology: ∆x̄(k) := D F (x(k−1) )−1 F (x(k) ) =


ˆ simplified Newton correction

Economical correction based termination criterion for Newton’s method:

STOP, as soon as ∆x̄(k) ≤ τrel x(k) or ∆x̄(k) ≤ τabs ,


(8.5.3.5)
(k) ( k −1) −1 (k)
with simplified Newton correction ∆x̄ := D F (x ) F (x ).
Note that (8.5.3.5) is affine invariant → Rem. 8.5.1.11.

∆x̄(k) available
Effort: Reuse of LU-factorization (→ Rem. 2.5.0.10) of D F (x(k−1) ) ➤
with O(n2 ) operations

C++11 code 8.5.3.6: Generic Newton iteration with termination criterion (8.5.3.5) ➺ GITLAB
2 template <typename FuncType , typename JacType , typename VecType>
3 void newton_stc ( const FuncType &F , const JacType &DF, VecType &x , double r t o l ,
4 double a t o l ) {
5 using s c a l a r _ t = typename VecType : : S c a l a r ;
6 s c a l a r _ t sn ;
7 do {
8 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian ]
9 x −= j a c f a c . solve ( F ( x ) ) ; // Compute next iterate
10 // Compute norm of simplified Newton correction
11 sn = j a c f a c . solve ( F ( x ) ) . norm ( ) ;
12 }
13 // Termination based on simplified Newton correction
14 while ( ( sn > r t o l * x . norm ( ) ) && ( sn > a t o l ) ) ;
15 }

Remark 8.5.3.7 (Residual based termination of Newton’s method) If we used the residual based ter-
mination criterion

F ( x(k) ) ≤ τ ,

then the resulting algorithm would not be affine invariant, because for F (x) = 0 and AF (x) = 0, A ∈
R n,n regular, the Newton iteration might terminate with different iterates. y

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 653
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Summary: Newton’s method

converges asymptotically very fast: doubling of number of significant digits in each step

often a very small region of convergence, which requires an initial guess


rather close to the solution.

Review question(s) 8.5.3.9 (Termination of Newton’s iteration)


(Q8.5.3.9.A) Show that the following stopping rule for Newton’s method,

STOP, as soon as ∆x̄(k) ≤ τrel x(k) or ∆x̄(k) ≤ τabs ,


(8.5.3.5)
(k) ( k −1) −1 (k)
with simplified Newton correction ∆x̄ := D F (x ) F (x ),
is affine invariant.

Recall that an implementation of Newton’s method (including stopping rules) for solving GA (x) = 0,
GA (x) := AF (x), F : D ⊂ R n → R n , is called affine invariant, if the same sequence of iterates is
produced for every regular matrix A ∈ R n,n .

8.5.4 Damped Newton Method

Video tutorial for Section 8.5.4 "Damped Newton Method": (11 minutes) Download link,
tablet notes

Potentially big problem: Newton method converges quadratically, but only locally , which may render
it useless, if convergence is guaranteed only for initial guesses very close to exact solution, see also
Ex. 8.4.2.38.

In this section we study a method to enlarge the region of convergence, at the expense of quadratic
convergence, of course.

EXAMPLE 8.5.4.1 (Local convergence of Newton’s method) The dark side of local convergence (→
Def. 8.2.1.10): for many initial guesses x(0) Newton’s method will not converge!

In 1D two main causes can be identified:


➊ “Wrong direction” of Newton correction:
2

1.5

F ( x ) = xe x − 1 ⇒ F ′ (−1) = 0 1
x 7→ xe x − 1

x (0) < − 1 ⇒ x ( k ) → − ∞ ,
0.5

x (0) > − 1 ⇒ x ( k ) → x ∗ , 0

−0.5
because all Newton corrections for x (k) < −1 make
the iterates decrease even further. −1

Fig. 304 −1.5


−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 654
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➋ Newton correction is too large:


2

1.5

F ( x ) = arctan( ax ) , a > 0, x ∈ R 0.5


with zero

arctan(ax)
x =0. 0

If x (k) is located where the function is “flat”, the in-


−0.5

tersection of the tangents with the x-axis is “far out”, −1

see Fig. 306.


−1.5
a=10
a=1
a=0.3
−2
−15 −10 −5 0 5 10 15
Fig. 305 x

5
Diverging Newton iteration for F(x) = arctan x
1.5
4.5

1 4

3.5

0.5
3

2.5
a

0
x ( k +1) x ( k −1) x (k) 2

-0.5 1.5

1
-1
0.5

-1.5 0
Fig. 306 −15 −10 −5 0 5 10 15
-6 -4 -2 0 2 4 6
Fig. 307 x

In Fig. 307 the red zone = { x (0) ∈ R, x (k) → 0}, domain of initial guesses for which Newton’s
method converges.
y

If the Newton correction points in the wrong direction (Item ➊), no general remedy is available. If the
Newton correction is too large (Item ➋), there is an effective cure:

we observe “overshooting” of Newton correction

Idea: damping of Newton correction:

With λ(k) > 0: x(k+1) := x(k) − λ(k) D F (x(k) )−1 F (x(k) ) . (8.5.4.2)

Terminology: λ(k) = damping factor, λ ∈]0, 1].

Affine invariant damping strategy

Choice of damping factor: affine invariant natural monotonicity test (NMT) [Deu11, Ch. 3]:

λ(k)
choose “maximal” 0 < λ(k) ≤ 1: ∆x(λ(k) ) ≤ (1 − ) ∆x(k) (8.5.4.4)
2 2 2

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 655
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

∆x(k) := D F (x(k) )−1 F (x(k) ) → current Newton correction ,


where
∆x(λ(k) ) := D F (x(k) )−1 F (x(k) − λ(k) ∆x(k) ) → tentative simplified Newton correction .

Heuristics behind control of damping:


✦ When the method converges ⇔ size of Newton correction decreases ⇔ (8.5.4.4) satisfied.
✦ In the case of strong damping (λ(k) ≪ 1) the size of the Newton correction cannot be expected to
shrink significantly, since iterates do not change much ➣ factor (1 − 12 λ(k) ) in (8.5.4.4).

Note: As before, reuse of LU-factorization in the computation of ∆x(k) and ∆x(λ(k) ).


Policy: Reduce damping factor by a factor q ∈]0, 1[ (usually q = 21 ) until the affine invariant natural
monotonicity test (8.5.4.4) passed, see Line 20 in the following C++ code.

C++ code 8.5.4.5: Generic damped Newton method based on natural monotonicity test
➺ GITLAB
1 template <typename FuncType , typename JacType , typename VecType>
2 void dampnewton ( const FuncType &F , const JacType &DF,
3 VecType &x , double r t o l , double a t o l )
4 {
5 using i n d e x _ t = typename VecType : : Index ;
6 using s c a l a r _ t = typename VecType : : S c a l a r ;
7 const i n d e x _ t n = x . s i z e ( ) ; // No. of unknowns
8 const s c a l a r _ t l m i n = 1E−3; // Minimal damping factor
9 s c a l a r _ t lambda = 1 . 0 ; // Initial and actual damping factor
10 VecType s ( n ) , s t ( n ) ; // Newton corrections
11 VecType xn ( n ) ; // Tentative new iterate
12 s c a l a r _ t sn , s t n ; // Norms of Newton corrections
13

14 do {
15 auto j a c f a c = DF( x ) . l u ( ) ; // LU-factorize Jacobian
16 s = j a c f a c . solve ( F ( x ) ) ; // Newton correction
17 sn = s . norm ( ) ; // Norm of Newton correction
18 lambda * = 2 . 0 ;
19 do {
20 lambda / = 2 ; // Reduce damping factor
21 i f ( lambda < l m i n ) throw "No convergence : lambda −> 0" ;
22 xn = x−lambda * s ; // Tentative next iterate
23 s t = j a c f a c . solve ( F ( xn ) ) ; // Simplified Newton correction
24 s t n = s t . norm ( ) ;
25 }
26 while ( s t n > (1 − lambda / 2 ) * sn ) ; // Natural monotonicity test
27 x = xn ; // Now: xn accepted as new iterate
28 lambda = std : : min ( 2 . 0 * lambda , 1 . 0 ) ; // Try to mitigate damping
29 }
30 // Termination based on simplified Newton correction
31 while ( ( s t n > r t o l * x . norm ( ) ) && ( s t n > a t o l ) ) ;
32 }

The arguments for Code 8.5.4.5 are the same as for Code 8.5.3.6. As termination criterion is uses
(8.5.3.5). Note that all calls to solve boil down to forward/backward elimination for triangular matrices
and incur cost of O(n2 ) only.

Note: The LU-factorization of the Jacobi matrix D F (x(k) ) is done once per successful iteration step and

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 656
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

reused for the computation of the simplified Newton correction in Line 23 of the above C++ code.

EXPERIMENT 8.5.4.6 (Damped Newton method) We test the damped Newton method for Item ➋ of
Ex. 8.5.4.1, where excessive Newton corrections made Newton’s method fail.
k λ(k) x (k) F ( x (k) )
F ( x ) = arctan( x ) ,
1 0.03125 0.94199967624205 0.75554074974604
• x (0) = 20
2 0.06250 0.85287592931991 0.70616132170387
• q = 21
3 0.12500 0.70039827977515 0.61099321623952
• LMIN = 0.001
4 0.25000 0.47271811131169 0.44158487422833
We observe that damping
5 0.50000 0.20258686348037 0.19988168667351
is effective and asymptotic
6 1.00000 -0.00549825489514 -0.00549819949059
quadratic convergence is
7 1.00000 0.00000011081045 0.00000011081045
recovered.
8 1.00000 -0.00000000000001 -0.00000000000001
y

EXPERIMENT 8.5.4.7 (Failure of damped Newton method) We examine the effect of damping in the
case of Item ➊ of Ex. 8.5.4.1.
2

1.5

✦ As in Ex. 8.5.4.1:
F ( x ) = xe x − 1, 1
x 7→ xe x − 1
0.5

✦ Initial guess for damped Newton method: x (0) = 0

−1.5
−0.5
This time the initial guess is to the left of the global
minimum of the function. −1

Fig. 308 −1.5


−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1

Observation: k λ(k) x (k) F ( x (k) )


1 0.25000 -4.4908445351690 -1.0503476286303
Newton correction pointing in 2 0.06250 -6.1682249558799 -1.0129221310944
“wrong direction” 3 0.01562 -7.6300006580712 -1.0037055902301
4 0.00390 -8.8476436930246 -1.0012715832278
➤ no convergence despite 5 0.00195 -10.5815494437311 -1.0002685596314
damping Bailed out because of lambda < LMIN !
y
Review question(s) 8.5.4.8 (Damped Newton method)
(Q8.5.4.8.A) Give an example of a one-dimensional zero-finding problem, where a suitable damping strat-
egy can ensure global convergence of Newton’s method.
(Q8.5.4.8.B) Show that the affine invariant damping strategy for Newton’s method

8. Iterative Methods for Non-Linear Systems of Equations, 8.5. Newton’s Method in R n 657
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Choice of damping factor: affine invariant natural monotonicity test (NMT) [Deu11, Ch. 3]:

λ(k)
choose “maximal” 0 < λ(k) ≤ 1: ∆x(λ(k) ) ≤ (1 − ) ∆x(k) (8.5.4.4)
2 2 2

∆x(k) := D F (x(k) )−1 F (x(k) ) → current Newton correction ,


where
∆x(λ(k) ) := D F (x(k) )−1 F (x(k) − λ(k) ∆x(k) ) → tentative simplified Newton correction .

leads to an affine invariant iteration.


8.6 Quasi-Newton Method


Video tutorial for Section 8.6 "Quasi-Newton Method": (15 minutes) Download link,
tablet notes

We start with the following question: How can we solve the non-linear system of equations F (x) =
0, D : D ⊂ R n → R n , iteratively, in case D F (x) is not available and numerical differentiation (see
Rem. 8.5.1.45) is too expensive?

In 1D (n = 1) we can choose among many derivative-free methods that rely on F-evaluations alone, for
instance the secant method (8.4.2.30) from Section 8.4.2.3:
! −1
F ( x (k) )( x (k) − x (k−1) ) F ( x ( k ) ) − f ( x ( k −1) )
x ( k +1) = x (k) − = x (k) − F ( x (k) ) . (8.4.2.30)
F ( x ( k ) ) − F ( x ( k −1) ) x ( k ) − x ( k −1)

Recall from Rem. 8.4.2.33 that the secant method converges locally with order p ≈ 1.6 and beats
Newton’s method in terms of efficiency (→ Section 8.4.3).
Compare (8.4.2.30) with Newton’s method in 1D for solving F ( x ) = 0:

x ( k +1) = x ( k ) − F ′ ( x ( k ) ) −1 F ( x ( k ) ) . (8.4.2.1)

We realize that this iteration amounts to a “Newton-type iteration” based on the


approximation

F ( x ( k ) ) − F ( x ( k −1) )
F ′ ( x (k) ) ≈ "difference quotient" (8.6.0.1)
x ( k ) − x ( k −1)
already computed ! → cheap

Unfortunately, it is not immediate how to generalize the secant method to n > 1.

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 658
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Idea: Rewrite (8.6.0.1) as a secant condition for an approximation


J k ≈ D F ( x ( k ) ), x ( k ) =
ˆ the current iterate:

J k ( x ( k ) − x ( k −1) ) = F ( x ( k ) ) − F ( x ( k −1) ) . (8.6.0.2)

Iteration: x ( k +1) : = x ( k ) − J − 1 (k)


k F (x ) . (8.6.0.3)

Embarrassment of choice: Many different matrices Jk fulfill (8.6.0.2)!

➣ We need extra conditions to fix Jk ∈ R n,n .

Reasoning: If we assume that Jk is a good approximation of D F (x(k) ), then it would be foolish not to use
the information contained in Jk for the construction of Jk+1 .

Guideline: obtain Jk through a “small” modification of Jk−1 compliant with (8.6.0.2)

What can “small modification” mean? Demand that Jk acts like Jk−1 on the orthogonal complement of the
one-dimensional subspace of R n generated by the vector x(k) − x(k−1) ! This is expressed by

Broyden’s conditions: Jk z = Jk−1 z ∀z: z ⊥ (x(k) − x(k−1) ) . (8.6.0.4)

Together with the secant condition (8.6.0.2) this uniquely determines Jk :

(8.6.0.2) F (x(k) )(x(k) − x(k−1) )⊤


J k : = J k −1 + (8.6.0.5)
(8.6.0.4) x ( k ) − x ( k −1)
2
2

Note that the update formula (8.6.0.5) means that Jk is spawned by a rank-1-modification of Jk−1 . We
have arrived at a well-defined iterative method.

Final form of Broyden’s quasi-Newton method for solving F (x) = 0:

x(k+1) := x(k) + ∆x(k) , ∆x(k) := −J− 1 (k)


k F (x ) ,
F (x(k+1) )(∆x(k) )⊤ k ∈ N0 . (8.6.0.6)
J k +1 : = J k + 2
,
∆x(k) 2

To start the iteration we have to initialize J0 , e.g. with the exact Jacobi matrix D F (x(0) ).

Remark 8.6.0.7 (Minimality property of Broyden’s rank-1-modification) In another sense Jl is closest


to Jk−1 under the constraint of the secant condition (8.6.0.2):

Let x(k) and Jk be the iterates and matrices, respectively, from Broyden’s method (8.6.0.6), and let J ∈ R n,n
satisfy the same secant condition (8.6.0.2) as Jk+1 :

J ( x ( k +1) − x ( k ) ) = F ( x ( k +1) ) − F ( x ( k ) ) . (8.6.0.8)

Then from x(k+1) − x(k) = −J− 1 (k)


k F ( x ) we obtain

(I − J− 1
k J )( x
( k +1)
− x(k) ) = − J− 1 (k) −1
k F (x ) − Jk ( F (x
( k +1)
) − F (x(k) )) = −J− 1
k F (x
( k +1)
) . (8.6.0.9)

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 659
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

From this we get the identity


!
F (x(k+1) )(x(k+1) − x(k) )⊤
I − J− 1 −1
k J k +1 = I − J k Jk + 2
x ( k +1) − x ( k ) 2
( x ( k +1) − x ( k ) ) ⊤
= −J− 1
k F (x
( k +1)
) 2
=
x ( k +1) − x ( k ) 2
(8.6.0.9) (x(k+1) − x(k) )(x(k+1) − x(k) )⊤
= (I − J− 1
k J) 2
.
x ( k +1) − x ( k ) 2

Using the submultiplicative property (1.5.5.11) of the Euclidean matrix norm, we conclude

(x(k+1) − x(k) )(x(k+1) − x(k) )⊤


I − J− 1 −1
k Jk+1 ≤ I − Jk J , because 2
≤1,
x ( k +1) − x ( k ) 2 2

which we saw in Ex. 1.5.5.20. This estimate holds for all matrices J satisfying (8.6.0.8).

We may read this as follows: (8.6.0.5) gives the k·k2 -minimal relative correction of Jk−1 , such that the
secant condition (8.6.0.2) holds. y

EXPERIMENT 8.6.0.10 (Broydens quasi-Newton method: Convergence) We revisit the 2 × 2 non-


linear system of the Exp. 8.5.2.1,
     
x12 − x24 x1 2 1
F (x) = , x= ∈R with solution F( )=0, (8.5.2.2)
x1 − x23 x2 1

and take x(0) = [0.7, 0.7] T . As starting value for the matrix iteration we use J0 = D F (x(0) ).
0 Broyden: ||F(x (k) )||
10
Broyden: error norm
(k)
Newton: ||F(x )||

The numerical example shows that, in 10 -2


Newton: error norm
Newton (simplified)

terms of convergence, the method is:


-4

• slower than Newton method


10
Euclidean norms of errors

(8.5.1.6), 10 -6

• faster than the simplified Newton ✄ -8


10
method (see Rem. 8.5.1.43)
-10
10
In particular, we cannot expect Broyden’s
quasi-Newton method to converge locally 10
-12

quadratically. 10
-14

0 1 2 3 4 5 6 7 8 9 10 11
Fig. 309 Step of iteration
y
Remark 8.6.0.11 (Convergence monitors) In general, any iterative methods for non-linear systems of
equations convergence can fail, that is it may stall or even diverge.

Demand on good numerical software: Algorithms should warn users of impending failure. For iterative
methods this is the task of convergence monitors, that is, conditions, cheaply verifiable a posteriori during
the iteration, that indicate stalled convergence or divergence.

For the damped Newton’s method this role can be played by the natural monotonicity test, see
Code 8.5.4.5; if it fails repeatedly, then the iteration should terminate with an error status.

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 660
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For Broyden’s quasi-Newton method, a similar strategy can rely on the relative size of the “simplified
Broyden corrections” J− 1 (k)
k −1 F ( x ):

J− 1 (k)
k −1 F ( x )
Convergence monitor for (8.6.0.6) : µ := <1? (8.6.0.12)
∆x(k−1)
y

EXPERIMENT 8.6.0.13 (Monitoring convergence for Broyden’s quasi-Newton method)


10 0

10 1
10 -2
We rely on the setting of Exp. 8.6.0.10.

Convergence monitor
0
10
10 -4
We track
error norm

-6 10 -1
1. the Euclidean norm of the iteration error,
10
2. and the value of the convergence monitor from
10 -8 10 -2
(8.6.0.12).

✁ Decay of (norm of) iteration error and µ are well


10 -10
correlated.
1 2 3 4 5 6 7 8 9 10 11
Fig. 310
Step of iteration
y

Remark 8.6.0.14 (Damped Broyden method) Option to improve robustness (increase region of local
convergence):
damped Broyden method (cf. same idea for Newton’s method, Section 8.5.4)
y

§8.6.0.15 (Implementation of Broyden’s quasi-Newton method) As remarked,


F (x(k+1) )(∆x(k) )⊤
J k +1 : = J k + 2
(8.6.0.6)
∆x(k) 2
represents a rank-1-update of the approximate Jacobians Jk , which are then used as coefficient matrices
for n × n linear systems of equations. In § 2.6.0.12 we already discussed efficient algorithms for updating
the solutions of LSEs in the case of rank-1-modifications of the coefficient matrices. Now these techniques
come handy.
Idea: use the Sherman-Morrison-Woodbury update-formula from Lemma 2.6.0.21,
(A + UVH )−1 = A−1 − A−1 U(I + VH A−1 U)−1 VH A−1 ,
for the special case k = 1, K = R,
!
A−1 uv⊤
(A + uv⊤ )−1 = I− A −1 , 1 + v ⊤ A −1 u 6 = 0 ,
1 + v ⊤ A −1 u
∆x(k)
which yields, with u := F (x(k+1) ), v := 2, a recursion for the inverses of the approximate Jaco-
k∆x(k) k
bians,
! !
J− 1
k F (x
(k+1) )( ∆x(k) ) T
∆x(k+1) (∆x(k) ) T
J− 1
k +1 = I− 2
J− 1
k = I− 2
J− 1
k ,
∆x(k) 2 + (∆x(k) )⊤ J− 1
k F (x
( k +1) ) ∆x(k) 2 + (∆x(k) )⊤ ∆x(k+1)
(8.6.0.16)

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 661
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

with the "simplified quasi-Newton correction" ∆x(k+1) := J− 1


k F (x
( k +1)
). This gives a well defined Jk+1 , if
2
(∆x(k) )⊤ ∆x(k+1) < ∆x(k) , (8.6.0.17)
2

which can be expected to hold, if the method converges and the initial guess is sufficiently close to x∗ .
Note that the simplified quasi-Newton correction is also needed for the convergence monitor (8.6.0.12).

The iterated application of (8.6.0.16) pays off, if the iteration terminates after only a few steps. In particular,
for large n ≫ 1 it is not advisable to form the matrices J− 1
k (which will usually be dense in contrast to Jk ),
because we can employ fast successive multiplications with rank-1-matrices (→ Ex. 1.4.3.1) to apply J−
k
1

to the vector F (x(k) ): Let us assume that

• the quasi-Newton corrections ∆x(ℓ) , for ℓ = 0, . . . , k − 1,


• and the simplified quasi-Newton corrections ∆x(ℓ) := J− 1 (ℓ)
ℓ−1 F ( x ) for ℓ = 1, . . . , k
have already been stored in the course of the iteration. Then we can compute
!
k −1
∆x(ℓ+1) (∆x(ℓ) ) T
∆x(k) = −J− 1 (k)
k F (x ) = − ∏ I − 2
J0−1 F (x(k) ) .
ℓ=0 ∆x(ℓ) 2 + (∆x(ℓ) )⊤ ∆x(ℓ+1)

Except for a single solution of a linear system this can be implemented with simple vector arithmetic:

t0 ∈ R n : J0 t0 = F ( x ( k ) ) ,
(ℓ+1) (∆x(ℓ) )T tℓ
tℓ+1 := tℓ − ∆x 2
, ℓ = 0, . . . , k − 1 , (8.6.0.18)
x(ℓ) 2
+ (∆x(ℓ) )⊤ ∆x(ℓ+1)
∆x(k) := −tk ,

with a computational effort O(n3 + nk ) for n → ∞, assuming that a standard elimination-based direct
solver is used to get t0 , recall Thm. 2.5.0.2. Based on the next iterate x(k+1) := x(k) + ∆x(k) , we obtain
∆x(k+1) := J− 1
k F (x
( k +1)
) in a similar fashion:
!
k −1
∆x(ℓ+1) (∆x(ℓ) ) T
∆x(k+1) = ∏ I− 2
J0−1 F (x(k+1) ) . (8.6.0.19)
ℓ=0 ∆x(ℓ) 2 + (∆x(ℓ) )⊤ ∆x(ℓ+1)

Thus, the cost for N steps of Broyden’s quasi-Newton algorithm is asymptotically O(n3 + n2 N + nN 2 )
for n, N → ∞, because the expensive LU-factorization of J0 will be carried out only once.
This is implemented in the following function upbroyd(), whose arguments are
• a functor object F implementing F : R n → R n ,
• the initial guess x(0) ∈ R n in x,
• another functor object J providing the Jacobian D F (x) ∈ R n,n ,
• relative and absolute tolerance reltol and abstol for correction based termination as discussed
in Section 8.5.3,
• the maximal number of iterations maxit,
• and an optional monitor object for tracking the progress of the iteration, see the explanations con-
cerning recorder objects in § 0.3.3.4.
The implementation makes use of the following type definition:

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 662
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

t e m p l a t e < typename T, i n t N> using Vector = Eigen::Matrix<T, N, 1>;

The function is templated to allow its use for both fixed-size and variable size vector types of E IGEN.

C++ code 8.6.0.20: Implementation of quasi-Newton method with recursive update of ap-
proximate Jacobians. ➺ GITLAB
2 template <typename FUNCTION, typename JACOBIAN , typename SCALAR,
3 i n t N = Eigen : : Dynamic ,
4 typename MONITOR =
5 std : : f u n c t i o n <void ( unsigned i n t , Vector <SCALAR, N> ,
6 Vector <SCALAR, N> , Vector <SCALAR, N>) >>
7 Vector <SCALAR, N> upbroyd (
8 FUNCTION &&F , Vector <SCALAR, N> x , JACOBIAN &&J , SCALAR r e l t o l ,
9 SCALAR a b s t o l , unsigned i n t m a x i t = 20 ,
10 MONITOR &&m o n i t o r = [ ] ( unsigned i n t /*itnum*/ ,
11 const Vector <SCALAR, N> & /*x*/ ,
12 const Vector <SCALAR, N> & /*fx*/ ,
13 const Vector <SCALAR, N> & /*dx*/ ) { } ) {
14 // Calculate LU factorization of initial Jacobian once, cf.
Rem. 2.5.0.10
15 auto f a c = J . l u ( ) ;
16 // First quasi-Newton correction ∆x(0) := −J0−1 F (x(0) )
17 Vector <SCALAR, N> s = − f a c . solve ( F ( x ) ) ;
18 // Store the first quasi-Newton correction ∆x(0)
19 std : : vector <Vector <SCALAR, N>> dx { s } ;
20 x += s ; // x(1) := x(0)+ ∆x(0)
21 auto f = F ( x ) ; // Here = F ( x(1) )
22 // Array storing simplified quasi-Newton corrections ∆x(ℓ)
23 std : : vector <Vector <SCALAR, N>> dxs { } ;
2
24 // Array of denominators x(ℓ) + (∆x(ℓ) )⊤ ∆x(ℓ+1)
2
25 std : : vector <SCALAR> den { } ;
26 m o n i t o r ( 0 , x , f , s ) ; // Record start of iteration
27 // Main loop with correction based termination control
28 f o r ( unsigned i n t k = 1 ;
29 ( ( s . norm ( ) >= r e l t o l * x . norm ( ) ) && ( s . norm ( ) >= a b s t o l ) && ( k < m a x i t ) ) ;
30 ++k ) {
31 // Compute J0−1 F (x(k) ), needed for both recursions
32 s = f a c . solve ( f ) ;
33 // (8.6.0.19): recursion for next simplified quasi-Newton correction
34 Vector <SCALAR, N> ss = s ;
35 f o r ( unsigned i n t l = 1 ; l < k ; ++ l ) {
36 ss −= dxs [ l − 1 ] * ( dx [ l − 1 ] . dot ( ss ) ) / den [ l − 1 ] ;
37 }
2
38 // Store next denominator x ( k −1) + (∆x(k−1) )⊤ ∆x(k)
2
39 den . push_back ( dx [ k − 1 ] . squaredNorm ( ) + dx [ k − 1 ] . dot ( ss ) ) ;
40 // Store current simplified quasi-Newton correction ∆x(k)
41 dxs . push_back ( ss ) ;
42 // (8.6.0.18): Compute next quasi-Newton correction recursively
43 f o r ( unsigned i n t l = 0 ; l < k ; ++ l ) {
44 s −= dxs [ l ] * ( dx [ l ] . dot ( s ) ) / den [ l ] ;
45 }
46 s * = ( − 1 . 0 ) ; // Comply with sign convention
47 dx . push_back ( s ) ;
48 // Compute next iterate x(k+1) and F (x(k+1) )
49 x += s ;
50 f = F( x ) ;
51 m o n i t o r ( k , x , f , s ) ; // Record progress

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 663
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

52 }
53 return x ;
54 }

Computational cost :
✦ O( N 2 · n) operations with vectors, (Level I)
N steps
✦ 1 LU-decomposition of J0 , N × solutions of LSEs, see Section 2.3.2
✦ N evaluations of F !
Memory cost :
N steps ✦ LU-factors of J + auxiliary vectors ∈ R n ,
✦ 2N vectors ∆x(k) ∈ R n , ∆x(k) .
y

EXPERIMENT 8.6.0.21 (Broyden method for a large non-linear system)


n = 1000, tol = 2.000000e-02
10 5

R n 7→ R n
F (x) =
x 7→ diag(x)Ax − b , 0
norms (errors, residuals)

10
n
b = [1, 2, . . . , n] ∈ R ,
A = I + aa T ∈ R n,n ,
10 -5
1
a= √ (b − 1) .
1·b−1 Broyden: ||F(x (k) )||
Broyden: error norm
-10
10 Newton: ||F(x (k) )||
Initial guess: h = 2/n; x0 = (2:h:4-h)’; Newton: eror norm
Newton (simplified)
The results resemble those of Exp. 8.6.0.10 ✄ 0 1 2 3 4 5 6 7 8 9
Fig. 311 iteration step

Efficiency comparison: Broyden method ←→ Newton method:


(in case of dimension n use tolerance tol = 2n · 10−5 , h = 2/n; x0 = (2:h:4-h)’; )

20
Broyden−Verfahren
18 Newton−Verfahren 30

16
25
14
Anzahl Schritte

12 20
Laufzeit [s]

10
15
8

6 10

4
5
2 Broyden−Verfahren
Newton−Verfahren
0 0
0 500 1000 1500 0 500 1000 1500
Fig. 312 n Fig. 313 n

☞ In conclusion,
the Broyden method is worthwhile for dimensions n ≫ 1 and low accuracy requirements.
y

Supplementary literature. A comprehensive monograph about all aspects of Newton’s meth-

ods and generalizations in [Deu11]. The multi-dimensional Newton method is also presented in

8. Iterative Methods for Non-Linear Systems of Equations, 8.6. Quasi-Newton Method 664
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

[Han02, Sect. 19], [DR08, Sect. 5.6], [AG11, Sect. 9.1].


For background information about quasi-Newton methods refer to [QSS00, Sect. 7.1.4], [Wer92,
p. 2.3.2].
Review question(s) 8.6.0.22 (Quasi-Newton methods)
(Q8.6.0.22.A) Under what conditions is Broyden’s quasi-Newton method for solving F (x) = 0,
F : D ⊂ R n → R n , n ∈ N,

x(k+1) := x(k) + ∆x(k) , ∆x(k) := −J− 1 (k)


k F (x ) ,
F (x(k+1) )(∆x(k) )⊤ k ∈ N0 , (8.6.0.6)
J k +1 : = J k + 2
,
∆x(k) 2

affine invariant?

Remember that an iterative method for solving F (x) = 0 is called affine invariant, if it produces the
same sequence of iterates when applied (with the same initial guess) to AF (x) = 0 with any regular
matrix A ∈ R n,n .
(Q8.6.0.22.B) Show that the matrices Jk from Broyden’s quasi-Newton method (8.6.0.6) satisfy the secant
condition

J k ( x ( k ) − x ( k −1) ) = F ( x ( k ) ) − F ( x ( k −1) ) . (8.6.0.2)

8.7 Non-linear Least Squares [DR08, Ch. 6]

Video tutorial for Section 8.7 "Non-linear Least Squares": (7 minutes) Download link,
tablet notes

So far we have studied non-linear systems of equations F (x) = 0 with the same number n ∈ N of un-
knowns and equations: F : D ⊂ R n → R n . This generalizes square linear systems of equations whose
numerical treatment was the subjects of Chapter 2. Then, in Chapter 3, we turned our attention to overde-
termined linear systems of equations Ax = b with A ∈ R m,n , m > n. Now we take the same step for
non-linear systems of equations F (x) = 0 and admit non-linear functions F : D ⊂ R n → R m , m > n.
For overdetermined linear systems of equations we had to introduce the concept of a least-squares solu-
tion in Section 3.1, Def. 3.1.1.1. The same concept will apply in the non-linear case.
EXAMPLE 8.7.0.1 (Least squares data fitting) In Section 5.1 we discussed the reconstruction of a
parameterized function f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R from data points (ti , yi ), i = 1, . . . , n, by imposing
interpolation conditions. We demanded that the number n of parameters agreed with number of data
points. Thus, in the case of a general depedence of f on the parameters, the interpolation conditions
(5.1.0.2) yield a non-linear system of equations.
The interpolation approach is justified in the case of highly accurate data. However, we frequently en-
countered inaccurate data, for instance, due to measurement errors. As we discussed in Section 5.7 this
renders the interpolation approach dubious, also in light of the impact of “outliers”.

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 665
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Idea: Mitigate the impact of data uncertainty by choosing


fewer parameters than data points.

➣ Thus measurement errors can “average out”.

Guided by this idea we arrive at a particular version of the least squares data fitting problem from Sec-
tion 5.7, cf. (5.7.0.2).

Non-linear least squares fitting problem

Given: ✦ data points (ti , yi ), i = 1, . . . , m


✦ (symbolic formula) for parameterized function
f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R, n < m
Sought: parameter values x1∗ , . . . , xn∗ ∈ R such that

m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.7.0.3)
x ∈R n i =1

As we did in § 5.7.0.13 for the linear case, we can rewrite (8.7.0.3) by introducing
   
f ( x1 , . . . , x n , t1 ) − y1 x1
 ..   
F (x) :=  .  , x =  ...  . (8.7.0.4)
f ( x1 , . . . , x n , t m ) − y m xn

With this notation we have the equivalence


m
(8.7.0.3) ⇔ x∗ = argmin ∑ | f ( x1 , . . . , xn , ti ) − yi |2 = argmink F (x)k22 . (8.7.0.5)
x ∈R n i =1 x ∈R n

y
The previous example motivates the following definition generalizing Def. 3.1.1.1.

Definition 8.7.0.6. Non-linear least-squares solution

Given F : D ⊂ R n 7→ R m , m, n ∈ N, m > n, we call x∗ a non-linear least squares solution of


F (x) = 0, if

x∗ ∈ argminx∈ D k F (x)k22 .

The search for such non-linear least-squares solutions is our current concern.

Non-linear least squares problem

Given: F : D ⊂ R n 7→ R m , m, n ∈ N, m > n.
Find: x∗ ∈ D: x∗ = argminx∈ D Φ(x) , Φ(x) := 12 k F (x)k22 . (8.7.0.7)

Terminology: ˆ parameter space, x1 , . . . , xn =


D= ˆ parameter.

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 666
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

As in the case of linear least squares problems (→ Section 3.1.1): a non-linear least squares problem is
related to an overdetermined non-linear system of equations F (x) = 0.

As for non-linear systems of equations discussed in the beginning of this chapter, existence and unique-
ness of x∗ in (8.7.0.7) has to be established in each concrete case!

Remark 8.7.0.8 (“Full-rank condition”) Recall from Rem. 3.1.2.15, Ex. 3.1.2.17, Rem. 3.1.2.18 that for
a linear least-squares problem kAx − bk → min full rank of the matrix A ∈ R m,n was linked to a “good
model”, in which every parameter had an influence independently of the others.
Also in the non-linear setting we we require “independence for each parameter”:

∃ neighbourhood U (x∗ ) such that D F (x) has full rank n ∀ x ∈ U (x∗ ) . (8.7.0.9)

This means that the columns of the Jacobi matrix DF (x) must be linearly independent.
If (8.7.0.9) is not satisfied, then the parameters are redundant in the sense that fewer parameters would
be enough to model the same dependence (locally at x∗ ), cf. Rem. 3.1.2.18. y
Review question(s) 8.7.0.10 (Non-linear least squares)
(Q8.7.0.10.A) The one-dimensional non-linear least-squares data fitting problem for a sequence
(ti , yi ) ∈ R2 , i = 1, . . . , m of data points relies on families of parameterized functions

f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R , n<m,

and seeks to find parameter values x1∗ , . . . , xn∗ ∈ R such that

m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.7.0.3)
x ∈R n i =1

Explain, why this generalizes one-dimensional linear least-squares data fitting.


(Q8.7.0.10.B) Given data points (ti , yi ) ∈ R2 , i = 1, . . . , m, are to be fitted in least-squares sense by
linear combinations of two functions t 7→ eλi t , λi ∈ R, i = 1, 2. Bring this problem in the form

1D non-linear least squares fitting problem

Given: ✦ data points (ti , yi ), i = 1, . . . , m


✦ (symbolic formula) for parameterized function
f ( x1 , . . . , xn ; ·) : D ⊂ R 7→ R, n < m
Sought: parameter values x1∗ , . . . , xn∗ ∈ R such that

m
( x1∗ , . . . , xn∗ ) = argmin ∑ | f ( x1 , . . . , xn ; ti ) − yi |2 . (8.7.0.3)
x ∈R n i =1

by describing a suitable parameterized function f .


(Q8.7.0.10.C) We expect a real-valued random variable X to be normally distributed X ∼ N (σ, µ), that
is, its density is given by the function

1 1 x −µ 2
f X ( x ) = √ e− 2 ( σ ) , x∈R, σ>0.
σ 2π

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 667
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

We collect m ≫ 1 independent samples xi ∈ R of X , and want to use them to estimate σ. Outline


how this can be done using non-linear least-squares fitting.

You may resort to the empiric cumulative distribution function

1
CX : { x 1 , . . . , x m } → R , CX ( x i ) : = ♯{ j : x j ≤ xi } , i ∈ {1, . . . , m} ,
m
to formulate an overdetermined non-linear system of equations.
(Q8.7.0.10.D) A scientist proposes that you fit times series data (ti , yi ) ∈ R2 , i = 1, . . . , n, by linear
combinations of m shifted exponentials t 7→ exp(λ(t − c j )), j = 1, . . . , m, with unknown shifts c j ∈ R.
Is this a good idea? Justify your judgment by examining the associated non-linear system of equations
F (x) = 0 and D F (x).

8.7.1 (Damped) Newton Method

Video tutorial for Section 8.7.1 "Non-linear Least Squares: (Damped) Newton Method": (13
minutes) Download link, tablet notes

We examine a first, natural approach to solving the non-linear least squares problem

x∗ ∈ D: x∗ = argminx∈ D Φ(x) , Φ(x) := 12 k F (x)k22 . (8.7.0.7)

We assume that F : D ⊂ R n → R m is twice continuously differentiable. Then the non-linear least-


squares solution x∗ has to be a zero of the derivative of x 7→ Φ(x):
 ⊤
∗ ∂Φ ∂Φ
Φ(x ) = min ⇒ grad Φ(x) = 0 , where grad Φ(x) := ( x ), . . . , (x) ∈ Rn .
∂x1 ∂xn

§8.7.1.1 (The Newton iteration) Note that grad Φ : D ⊂ R n 7→ R n . The simple idea is to use Newton’s
method (→ Section 8.5) to solve the non-linear n × n system of equations grad Φ(x) = 0.
The Newton iteration (8.5.1.6) for non-linear system of equations grad Φ(x) = 0 is

x(k+1) = x(k) − H Φ(x(k) )−1 grad Φ(x(k) ) , (8.7.1.2)

where H Φ(x) ∈ R n,n is the Hessian matrix of Φ, see Def. 8.5.1.18:


" #n
∂2 Φ
H Φ(x) = ∈ R n,n .
∂xi ∂x j
i,j=1

2
Using the definition Φ(x) := 21 k F (x)k we can express grad Φ and H Φ in terms of F : R n 7→ R n . First,
2
since Φ(x) = ( G ◦ F )(x) with G : R m → R, G (z) := 21 kzk , the chain rule

D( H ◦ G )(x)h = D H ( G (x))(D G (x)h) , h ∈ V , x ∈ D , (8.5.1.16)

gives

D Φ(x)h = D G ( F (x)) D F (x)h = F (x)⊤ D F (x)h , (8.7.1.3)


grad Φ(x) = D F (x) T F (x) . (8.7.1.4)

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 668
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Alternatively, we could also have used the product rule


T (x) := b( H (x), G (x)) ⇒ D T (x)h = b(D H (x)h, G (x)) + b( H (x), D G (x)h) , (8.5.1.17)
h ∈ V, x ∈ D .
with b : R n × R n → R, b(x, y) := 12 x⊤ y, and H = G := F, which yields

D Φ(x)h = 21 (D F (x)h)⊤ F (x) + 12 F (x)⊤ (D F (x)h) , h ∈ Rn ,


the same as before, of course.
A third option is direct “pedestrian-style” differentiation using the Taylor approximation
F (x + h) = F (x) + D F (x)h + O(khk22 ) for h → 0 :

Φ(x + h) = 21 k F (x + h)k22 = 21 F (x + h)⊤ F (x + h)


1 ⊤  
= F (x) + D F (x)h + O(khk22 ) F (x) + D F (x)h + O(khk22 )
2
= 21 F (x)⊤ F (x) + 12 F (x)⊤ D F (x)h + 21 (D F (x)h)⊤ F (x) + O(khk22 )
= Φ(x) + F (x)⊤ D F (x) h + O(khk22 ) for h → 0 .
| {z }
=grad Φ(x)⊤

In a second step, we can apply the product rule (8.5.1.17) to x 7→ D F (x) T F (x) and get
m
H Φ(x) := D(grad Φ)(x) = D F (x) T D F (x) + ∑ Fj (x) D2 Fj (x) ,
j =1
m (8.7.1.5)
( )
n ∂2 Fj ∂Fj ∂Fj
(H Φ(x))i,k = ∑ ∂xi ∂xk
(x) Fj (x) +
∂xk
(x) (x)
∂xi
.
j =1

We make the recommendation, cf. § 8.5.1.15, that when in doubt, the reader should differentiate com-
ponents of matrices and vectors! Let us pursue this “pedestrian option” also in this case. To begin
with, we recall that the derivative of the i-th component of grad Φ yields the i-th row of the Jacobian
(D grad Φ)(x) = H Φ(x) ∈ R n,n . So for some i ∈ {1, . . . , n} we abbreviate
(8.7.1.4)
g(x) := (grad Φ(x))i = (D F (x))⊤
:,i F ( x ) .
We compute the components of the gradient of g, which give entries of H Φ:
( )
m
∂g ∂ ∂Fℓ
(H Φ(x))i,k =
∂xk
(x) =
∂xk ℓ=∑ ∂xi (x) Fℓ (x)
1
m  2 
∂ Fℓ ∂Fℓ ∂Fℓ
= ∑ (x) + (x) (x) , k = 1, . . . , n .
ℓ=1
∂xi ∂xk ∂xi ∂xk
Of course, we end up with the same formula as in (8.7.1.5).
The above derivative formulas permit us to rewrite (8.7.1.2) in concrete terms. We obtain the
Newton correction s ∈ R n to the current Newton iterate x(k) by solving the n × n linear system of equa-
tions
!
m
D F (x(k) ) T D F (x(k) ) + ∑ Fj (x(k) ) D2 Fj (x(k) ) s = − D F ( x(k) ) T F ( x(k) ) . (8.7.1.6)
j =1
| {z }
| {z } =grad Φ(x(k) )
=H Φ ( x ( k ) )

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 669
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

All the techniques presented in Section 8.5 (damping, termination) can now be applied to the particular
Newton iteration for grad Φ = 0. We refer to that section. y

Remark 8.7.1.7 (Newton method and minimization of quadratic functional) Newton’s method (8.7.1.2)
for (8.7.0.7) can be read as successive minimization of a local quadratic approximation of Φ:
1
Φ(x) ≈ Q(s) := Φ(x(k) ) + grad Φ(x(k) ) T s + s T H Φ(x(k) )s , (8.7.1.8)
2
(k) (k)
grad Q(s) = 0 ⇔ H Φ(x )s + grad Φ(x ) = 0 ⇔ (8.7.1.6) .
➣ So we deal with yet another model function method (→ Section 8.4.2) with quadratic model function
Q for Φ.
y
Review question(s) 8.7.1.9 (Non-linear Least Squares: (Damped) Newton Method)
2 ⊤
(Q8.7.1.9.A) For grad Φ and H Φ for Φ(x) := 12 k F (x)k2 , F = [ F1 , . . . , Fm ] : R n → R m twice continu-
ously differentiable compute
 ⊤ " #n
∂Φ ∂Φ ∂2 Φ
grad Φ(x) := ( x ), . . . , (x) ∈ Rn , H Φ(x) = ∈ R n,n ,
∂x1 ∂xn ∂xi ∂x j
i,j=1

in the “pedestrian way” based on partial derivatives.


(Q8.7.1.9.B) Consider the case F (x) := Ax − b, A ∈ R m,n , b ∈ R m . What does the Newton iteration

x(k+1) = x(k) − H Φ(x(k) )−1 grad Φ(x(k) ) , k ∈ N0 , (8.7.1.2)


2
boil down to in this case (Φ(x) := 12 k F (x)k2 )?

8.7.2 Gauss-Newton Method


Video tutorial for Section 8.7.2 "(Trust-region) Gauss-Newton Method": (13 minutes)
Download link, tablet notes

The Newton method derived in Section 8.7.1 hinges on the availability of second derivatives of F. This
compounds difficulties of implementation, in particular, if F is given only implicitly or in procedural form.
Now we will learn about a method for the non-linear least-squares problem

x∗ ∈ D: x∗ = argminx∈ D Φ(x) , Φ(x) := 12 k F (x)k22 , (8.7.0.7)

that relies on first derivatives of F only.

Idea:
Local linearization of F (here at y):

F (x) ≈ F (y) + D F (y)(x − y) .

➣ Leads to a sequence of linear least squares problems

Details: Employing local linearization we find that the minimization problem

argmink F (x)k2 is approximated by argmink F (x0 ) + D F (x0 )(x − x0 )k2 ,


x ∈R n x ∈R n
| {z }
(♠)

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 670
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

where x0 is an approximation of the solution x∗ of (8.7.0.7). This is a linear least squares problem in the
standard form given in Def. 3.1.1.1.

A := D F (x0 ) ∈ R m,n ,
(♠) ⇔ argminkAx − bk2 with
x ∈R n b : = − F ( x0 ) + D F ( x0 ) x0 ∈ R m .

Note that by condition (8.7.0.9) A has full rank, if x0 sufficiently close to x∗ .


Also be aware that that this approach is different from local quadratic approximation of Φ underlying
Newton’s method for (8.7.0.7), see Section 8.7.1, Rem. 8.7.1.7.

The idea of local linearization leads to the Gauss-Newton iteration, if we make the full-rank assumption
(8.7.0.9), which, thanks to Cor. 3.1.2.13 , guarantees uniqueness of solutions of the linear least-squares
problem ♠. Making the substitutions x0 := x(k) and s := x − x(k) , we recover at the following iterative
method:

Initial guess x (0) ∈ D


x(k+1) := x(k) + s , s := argmin F (x(k) ) + D F (x(k) )s′ . (8.7.2.1)
s ′ ∈R n 2

a linear least squares problem!

C++ code 8.7.2.2: Generic algorithm: Gauss-Newton method ➺ GITLAB


2 template <typename FUNCTION, typename JACOBIAN>
3 Eigen : : VectorXd gn ( const Eigen : : VectorXd & i n i t , FUNCTION &&F , JACOBIAN &&J ,
4 double r t o l = 1 . 0E−6 , double a t o l = 1 . 0E−8) {
5 Eigen : : VectorXd x = i n i t ; // Vector for iterates x(k)
6 // Vector for Gauss-Newton correction s
7 Eigen : : VectorXd s = J ( x ) . householderQr ( ) . solve ( F ( x ) ) ; //
8 x = x − s;
9 // A posteriori termination based on absolute and relative tolerances
10 while ( ( s . norm ( ) > r t o l * x . norm ( ) ) && ( s . norm ( ) > a t o l ) ) { //
11 s = J ( x ) . householderQr ( ) . solve ( F ( x ) ) ; //
12 x = x − s;
13 }
14 return x ;
15 }

Comments on Code 8.7.2.2:


☞ Recall E IGEN’s built-in functions for solving linear least-squares problems used in Line 7 and Line 11,
see Code 3.3.4.2.
☞ Argument x passes initial guess x(0) ∈ R n , argument F must be a handle to a function F : R n 7→
R m , argument J provides the Jacobian of F, namely D F : R n 7→ R m,n , arguments rtol and
atol specify the tolerances for termination
☞ Line 10 implements the same correction-based stopping rule as it was proposed for Newton’s
method in Section 8.5.3, (8.5.3.2): The iteration terminates, if the Euclidean norm of the Gauss-
Newton correction s is either small relative to the norm of the current iterate or small in absolute
terms.

Note that the function of Code 8.7.2.2 also implements Newton’s method (→ Section 8.5.1) in the case
m = n!

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 671
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 8.7.2.3 (Gauss-Newton versus Newton) Let us summarize the pros and cons of using the
Gauss-Newton approach:

Advantage of the Gauss-Newton method : second derivative of F not needed.


y
Drawback of the Gauss-Newton method : no local quadratic convergence.

EXAMPLE 8.7.2.4 (Non-linear fitting of data (II) → Ex. 8.7.0.1) Given data points (ti , yi ),
i = 1, . . . , m, we consider the non-linear data fitting problem (8.7.0.5) for the parameterized function

f ( x1 , x2 , x3 ; t) := x1 + x2 exp(− x3 t) .

This means that we face an (overdetermined) non-linear system F (x) = 0 with


 
x1 + x2 exp(− x3 t1 ) − y1
 .. 
F : R3 7→ R m , F (x) :=  . .
x1 + x2 exp(− x3 tm ) − ym

Computing partial derivatives, the Jacobian of F is seen to be


 
1 e − x3 t1 − x 2 t 1 e − x3 t1
 
D F (x) =  ... ..
.
..
. .
1 e − x 3 t m − x2 t m e − x 3 t m

In this experiment we use data points (t j , y j ), j = 1, . . . , m, m = 21, t j = 1 + 0.3j, j = 1, . . . , 21 and


“random” y j generated by the following C++ code
Eigen::VectorXd gnrandinit( const Eigen::VectorXd &x) {
s t d ::srand(( unsigned i n t )time(0));
au to t = Eigen::VectorXd::LinSpaced((7.0 - 1.0) / 0.3 - 1, 1.0,
7.0);
au to y = x(0) + x(1) * ((-x(2) * t).array().exp());
r e t u r n y + 0.1 * (Eigen::VectorXd::Random(y.size()).array() - 0.5);
}

We study the convergence of


• the Newton method from Section 8.7.1,
• its extension, damped Newton method (→ Section 8.5.4),
• and the Gauss-Newton method
for different initial guesses:
✦ initial value (1.8, 1.8, 0.1) T (red curves, blue curves)
✦ initial value (1.5, 1.5, 0.1) T (cyan curves, green curves)
The first experiment investigates the iterative solution of non-linear least squares data fitting problem by
means of the Newton method (8.7.1.6) and the damped Newton method from Code 8.5.4.5.

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 672
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 4
10 10

2
10

0
10

norm of grad Φ(x(k) )


1
10
2

−2
10
value of F (x(k) )

−4
10

0 −6
10 10

−8
10

−10
10
−1
10
−12
10

−14
10

−2 −16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 314 No. of step of undamped Newton method Fig. 315 No. of step of undamped Newton method

Concerning the convergence behaviour of the plain Newton method we observe that
• for initial value (1.8, 1.8, 0.1) T (red curve) ➤ Newton method caught in local minimum,
• for initial value (1.5, 1.5, 0.1) T (cyan curve) ➤ fast (locally quadratic) convergence.
2 2
10 10

0
10

−2
10
norm of grad Φ(x(k) )

1
10
2

2
value of F (x(k) )

−4
10

−6
10
0
10
−8
10

−10
10

−1
10 −12
10

−14
10

−2 −16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 316 No. of step of damped Newton method Fig. 317 No. of step of damped Newton method

The observed convergence behavior of the damped Newton method is as follows:


• For initial value (1.8, 1.8, 0.1) T (red curve) ➤ fast (locally quadratic) convergence,
• For initial value (1.5, 1.5, 0.1) T (cyan curve) ➤ Newton method caught in local minimum.
The second experiment studies iterative solution of non-linear least squares data fitting problem by means
of the Gauss-Newton method (8.7.2.1), see Code 8.7.2.2.

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 673
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0 0
10 10

−2
10

norm of the corrector


−4
10
2

2
value of F (x(k) )

−6
10

−1 −8
10 10

−10
10

−12
10

−14
10

−2 −16
10 10
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fig. 318 No. of step of Gauss−Newton method Fig. 319 No. of step of Gauss−Newton method

For the Gauss-Newton method we observe linear convergence for both initial values (Refer to Def. 8.2.2.1,
Rem. 8.2.2.6 for “linear convergence” and how to see it in error plots).
In this experiment the convergence of the Gauss-Newton method is asymptotically clearly slower than that
of the Newton method, but less dependent on the choice of good initial guesses. This matches what is
often observed in practical non-linear fitting. y

8.7.3 Trust Region Method (Levenberg-Marquardt Method)


As in the case of Newton’s method for non-linear systems of equations, see Section 8.5.4, often overshoot-
ing of Gauss-Newton corrections occurs. We can resort to a similar remedy as in the case of Newton’s
method: damping of Gauss-Newton corrections.
Idea: damping of the Gauss-Newton correction in (8.7.2.1) using a penalty term
2 2
instead of F ( x(k) ) + D F ( x(k) ) s minimize F ( x(k) ) + D F ( x(k) ) s + λksk22 ,
2 2

where λ > 0 is a penalty parameter.


A central issue is the choice of the penalty parameter λ > 0. Since rigorous rules are elusive, heuristic
strategies are used in practice, for instance


 10 , if F (x(k) ) ≥ 10 ,

 2
λ = γ F ( x(k) ) , γ := 1 , if 1 < F (x k) ) < 10 ,
( (8.7.3.1)
2 
 2

0.01 , if F (x(k) ) ≤ 1 .
2

The minimization problem


2
s := argmin F (x(k) ) + D F (x(k) )z + λkzk22
z ∈R n

is close to a standard linear least-squares problem and its solution s can be obtained by solving a linear
system of equations, which we get by setting the gradient of
2
Ψ(z) := F (x(k) ) + D F (x(k) )z + λkzk22
  2
= z⊤ D F (x(k) )⊤ D F (x(k) ) + λI z + 2F (x(k) )⊤ D F (x(k) )z + F (x(k) ) ,
2
 
grad Ψ(z) = 2 D F (x(k) )⊤ D F (x(k) )⊤ + λI z + 2F (x(k) )⊤ z ,

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 674
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

to zero. This leads to the following n × n linear system of normal equations for damped Gauss-Newton
correction s in the k-th step, see Thm. 3.1.2.1:
 
D F (x (k) T
) D F (x (k)
) + λI s = − D F (x(k) )⊤ F (x(k) ) . (8.7.3.2)

Review question(s) 8.7.3.3 (Gauss-Newton method)


(Q8.7.3.3.A) What becomes of the Gauss-Newton method for solving F (x) = 0,

2
x(k+1) := argmin F (x(k) ) + D F (x(k) )(x − x(k) ) ,
x ∈R n 2

in the linear case, when F (x) = Ax − b, A ∈ R m,n , b ∈ R m ?


(Q8.7.3.3.B) What will you get when you apply the Gauss-Newton method

x(k+1) := x(k) − s , s := argmin − F (x(k) ) + D F (x(k) )s .


x ∈R n 2

to an n × n non-linear system of equations, that is, F : D ⊂ R n → R n . You may assume that D F (x(k) )
always has full rank.

Learning Outcomes
• Knowledge about concepts related to the speed of convergence of an iteration for solving a non-
linear system of equations.
• Ability to estimate type and orders of convergence from empiric data.
• Ability to predict asymptotic linear, quadratic and cubic convergence by inspection of the iteration
function.
• Familiarity with (damped) Newton’s method for general non-linear systems of equations and with the
secant method in 1D.
• Ability to derive the Newton iteration for an (implicitly) given non-linear system of equations.
• Knowledge about quasi-Newton method as multi-dimensional generalizations of the secant method.

8. Iterative Methods for Non-Linear Systems of Equations, 8.7. Non-linear Least Squares [DR08, Ch. 6] 675
Bibliography

[Ale12] A. Alexanderian. A basic note on iterative matrix inversion. Onlie document. 2012 (cit. on
pp. 647, 650).
[AG11] Uri M. Ascher and Chen Greif. A first course in numerical methods. Vol. 7. Computational
Science & Engineering. Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
PA, 2011, pp. xxii+552. DOI: 10.1137/1.9780898719987 (cit. on pp. 605, 609, 618, 624,
628, 665).
[BC17] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory
in Hilbert spaces. Second. CMS Books in Mathematics/Ouvrages de Mathématiques de la
SMC. Springer, Cham, 2017, pp. xix+619. DOI: 10.1007/978-3-319-48311-5.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 602, 607, 609, 613, 614, 618, 624, 628, 647, 665–675).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 651).
[Deu11] Peter Deuflhard. Newton methods for nonlinear problems. Vol. 35. Springer Se-
ries in Computational Mathematics. Heidelberg: Springer, 2011, pp. xii+424. DOI:
10.1007/978-3-642-23899-4 (cit. on pp. 640, 651, 655, 658, 664).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 599, 602, 614,
624, 628, 630, 665).
[Mol04] C. Moler. Numerical Computing with MATLAB. Philadelphia, PA: SIAM, 2004 (cit. on p. 632).
[PS91] Victor Pan and Robert Schreiber. “An Improved Newton Iteration for the Generalized Inverse
of a Matrix, with Applications”. In: SIAM Journal on Scientific and Statistical Computing 12.5
(1991), pp. 1109–1130. DOI: 10.1137/0912058 (cit. on pp. 647, 650).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 602, 609, 620, 665).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 600, 601,
603, 610, 613, 614, 616, 618, 623, 630, 637, 642, 646).
[Wer92] J. Werner. Numerische Mathematik I. Lineare und nichtlineare Gleichungssysteme, Interpola-
tion, numerische Integration. vieweg studium. Aufbaukurs Mathematik. Braunschweig: Vieweg,
1992 (cit. on p. 665).

676
Chapter 9

Computation of Eigenvalues and Eigenvectors

Supplementary literature. [Bai+00] offers comprehensive presentation of numerical methods

for the solution of eigenvalue problems from an algorithmic point of view.


EXAMPLE 9.0.0.1 (Resonances of linear electric circuits)
C ➀ L ➁ L
Simple electric circuit, cf. Ex. 2.1.0.3 ✄ ➂

✦ linear components (resistors, coils, capacitors)


U ~~ R
only, C C
✦ time-harmonic excitation (alternating volt-
age/current)
✦ “frequency domain” circuit model
Fig. 320

Ex. 2.1.0.3: nodal analysis of linear (↔ composed of resistors, inductors, capacitors) electric circuit in
frequency domain (at angular frequency ω > 0) , see (2.1.0.6)

➣ linear system of equations for nodal potentials with complex system matrix A
For circuit of Fig. 320: three unknown nodal potentials
➣ system matrix from nodal analysis at angular frequency ω > 0:
 1 1 
ıωC + ıωL − ıωL 0
A =  − ıωL 1
ıωC + R1 + ıωL
2 1
− ıωL 
1 1
0 − ıωL ıωC + ıωL
     1 
0 0 0 C 0 0 L − L1 0
= 0 R1 0 + ıω  0 C 0  + 1/ıω − L1 L2 − L1  .
0 0 0 0 0 C 0 − L1 1
L

A(ω ) := W + iωC − iω −1 S , W, C, S ∈ R n,n symmetric . (9.0.0.2)

677
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

R = 1, C= 1, L= 1
30
|u |
1
|u |
2
|u3|
25
maximum nodal potential

20

✁ plot of |ui (U )|, i = 1, 2, 3 for R = L = C = 1


15 (scaled model)

10
Blow-up of some nodal potentials for certain ω !

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

☛ ✟
Fig. 321 angular frequency ω of source voltage U

✡ ✠
resonant frequencies = ω ∈ {ω ∈ R: A(ω ) singular}

If the circuit is operated at a real resonant frequency, the circuit equations will not possess a solution. Of
course, the real circuit will always behave in a well-defined way, but the linear model will break down due
to extremely large currents and voltages. In an experiment this breakdown manifests itself as a rather
explosive meltdown of circuits components. Hence, it is vital to determine resonant frequencies of circuits
in order to avoid their destruction.

➥ relevance of numerical methods for solving:


1
Find ω ∈ C \ {0}: W + ıωC + S singular .
ıω
This is a quadratic eigenvalue problem: find x 6= 0, ω ∈ C \ {0},

1
A(ω )x = (W + ıωC + S)x = 0 . (9.0.0.3)
ıω
1
Substitution: y= ıω x ↔ x = ıωy [TM01, Sect. 3.4]:
     
W S x −ıC 0 x
(9.0.0.3) ⇔ =ω (9.0.0.4)
I 0 y 0 −ıI y
| {z } |{z} | {z }
:=M :=z :=B

➣ generalized linear eigenvalue problem of the form: find ω ∈ C, z ∈ C2n \ {0} such that

Mz = ωBz . (9.0.0.5)

In this example one is mainly interested in the eigenvalues ω , whereas the eigenvectors z usually need
not be computed.

9. Computation of Eigenvalues and Eigenvectors, 9. Computation of Eigenvalues and Eigenvectors 678


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

R = 1, C= 1, L= 1
0.4
ω

0.35

0.3

0.25

0.2
✁ resonant frequencies for circuit from Fig. 320 (in-
Im(ω)

0.15 cluding decaying modes with Im(ω ) > 0)


0.1

0.05

−0.05
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Fig. 322 Re(ω)
y

EXAMPLE 9.0.0.6 (Analytic solution of homogeneous linear ordinary differential equations →


[Str09, Remark 5.6.1], [Gut09, Sect. 10.1],[NS02, Sect. 8.1], [DR08, Ex. 7.3])
Autonomous homogeneous linear ordinary differential equation (ODE):

ẏ = Ay , A ∈ C n,n . (9.0.0.7)

 
λ1  
 ..  −1 n,n z = S −1 y
A = S .  S , S ∈ C regular =⇒ ẏ = Ay ←→ ż = Dz .
λn
| {z }
=:D

➣ solution of initial value problem:

ẏ = Ay , y(0) = y0 ∈ C n ⇒ y(t) = Sz(t) , ż = Dz , z(0) = S−1 y0 .

The initial value problem for the decoupled homogeneous linear ODE ż = Dz has a simple analytic
solution
 
zi (t) = exp(λi t)(z0 )i = exp(λi t) (S−1 )i,:
T
y0 .

In light of Rem. 1.3.1.3:


 
λ1
 ..  −1
A = S . S ⇔ A((S):,i ) = λi ((S):,i ) i = 1, . . . , n . (9.0.0.8)
λn

In order to find the transformation matrix S all non-zero solution vectors (= eigenvectors) x ∈ C n of the
linear eigenvalue problem

Ax = λx

have to be found.
y
Contents

9. Computation of Eigenvalues and Eigenvectors, 9. Computation of Eigenvalues and Eigenvectors 679


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

9.1 Theory of eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680


9.2 “Direct” Eigensolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
9.3 Power Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
9.3.1 Direct power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
9.3.2 Inverse Iteration [DR08, Sect. 7.6], [QSS00, Sect. 5.3.2] . . . . . . . . . . . . . 692
9.3.3 Preconditioned inverse iteration (PINVIT) . . . . . . . . . . . . . . . . . . . 702
9.3.4 Subspace iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
9.4 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716

9.1 Theory of eigenvalue problems

Supplementary literature. [NS02, Ch. 7], [Gut09, Ch. 9], [QSS00, Sect. 1.7]

Definition 9.1.0.1. Eigenvalues and eigenvectors → [NS02, Sects. 7.1, 7.2], [Gut09,
Sect. 9.1]

• λ ∈ C eigenvalue (ger.: Eigenwert) of A ∈ K n,n :⇔ det(λI − A) = 0


| {z }
characteristic polynomial χ(λ)
• spectrum of A ∈ K n,n : σ (A) := {λ ∈ C: λ eigenvalue of A}
• eigenspace (ger.: Eigenraum) associated with eigenvalue λ ∈ σ (A):
EigAλ := N λI − A
• x ∈ EigAλ \ {0} ⇒ x is eigenvector
• Geometric multiplicity (ger.: Vielfachheit) of an eigenvalue λ ∈ σ (A):
m(λ) := dim EigAλ

Two simple facts:

λ ∈ σ(A) ⇒ dim EigAλ > 0 , (9.1.0.2)


T n,n T
det(A) = det(A ) ∀A ∈ K ⇒ σ(A) = σ(A ) . (9.1.0.3)

ˆ spectral radius of A ∈ K n,n


✎ notation: ρ(A) := max{|λ|: λ ∈ σ (A)} =

Theorem 9.1.0.4. Bound for spectral radius

For any matrix norm k·k induced by a vector norm (→ Def. 1.5.5.10)

ρ(A) ≤ kAk .

Proof. Let z ∈ C n \ {0} be an eigenvector to the largest (in modulus) eigenvalue λ of A ∈ C n,n . Then

kAxk kAzk
kAk := sup ≥ = |λ| = ρ(A) .
x∈C n,n \{0} k x k kzk

9. Computation of Eigenvalues and Eigenvectors, 9.1. Theory of eigenvalue problems 680


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lemma 9.1.0.5. Gershgorin circle theorem → [DR08, Thm. 7.13], [Han02, Thm. 32.1],
[QSS00, Sect. 5.1]

For any A ∈ K n,n holds true


n n
[ o
σ(A) ⊂ z ∈ C: |z − a jj | ≤ ∑i6= j ji .
| a |
j =1

Lemma 9.1.0.6. Similarity and spectrum → [Gut09, Thm. 9.7], [DR08, Lemma 7.6], [NS02,
Thm. 7.2]

The spectrum of a matrix is invariant with respect to similarity transformations:

∀A ∈ K n,n : σ(S−1 AS) = σ(A) ∀ regular S ∈ K n,n .

Lemma 9.1.0.7.
Existence of a one-dimensional invariant subspace

∀C ∈ C n,n : ∃u ∈ C n : C(Span{u}) ⊂ Span{u} .

Theorem 9.1.0.8. Schur normal form → [Hac94, Thm .2.8.1]

∀A ∈ K n,n : ∃U ∈ C n,n unitary: U H AU = T with T ∈ C n,n upper triangular .

Corollary 9.1.0.9. Principal axis transformation

A ∈ K n,n , AA H = A H A: ∃U ∈ C n,n unitary: U H AU = diag(λ1 , . . . , λn ) , λi ∈ C .

A matrix A ∈ K n,n with AA H = A H A is called normal.


• Hermitian matrices: A H = A ➤ σ(A) ⊂ R
H
Examples of normal matrices are • unitary matrices: A = A − 1 ➤ |σ (A)| = 1
• skew-Hermitian matrices: A = −A H ➤ σ(A) ⊂ iR

Normal matrices can be diagonalized by unitary similarity transformations

Symmetric real matrices can be diagonalized by orthogonal similarity transformations

In Cor. 9.1.0.9: – λ1 , . . . , λn = eigenvalues of A


– Columns of U = orthonormal basis of eigenvectors of A

9. Computation of Eigenvalues and Eigenvectors, 9.1. Theory of eigenvalue problems 681


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Classes of relevant eigenvalue problems (EVP):


➊ Given A ∈ K n,n find all eigenvalues (= spectrum of A).
➋ Given A ∈ K n,n find σ (A) plus all eigenvectors.
➌ Given A ∈ K n,n find a few eigenvalues and associated eigenvectors.

(Linear) generalized eigenvalue problem:

Given A ∈ C n,n , regular B ∈ C n,n , seek x 6= 0, λ ∈ C

Ax = λBx ⇔ B−1 Ax = λx . (9.1.0.10)

ˆ generalized eigenvector, λ =
x= ˆ generalized eigenvalue

Obviously every generalized eigenvalue problem is equivalent to a standard eigenvalue problem

Ax = λBx ⇔ B−1 A = λx .

However, usually it is not advisable to use this equivalence for numerical purposes!
Remark 9.1.0.11 (Generalized eigenvalue problems and Cholesky factorization)
If B = B H s.p.d. (→ Def. 1.1.2.6) with Cholesky factorization B = R H R

e := R− H AR−1 , y := Rx .
e = λy where A
Ax = λBx ⇔ Ay

➞ This transformation can be used for efficient computations. y

9.2 “Direct” Eigensolvers


Purpose: solution of eigenvalue problems ➊, ➋ for dense matrices “up to machine precision”

M ATLAB-function: eig

d = eig(A) : computes spectrum σ (A) = {d1 , . . . , dn } of A ∈ C n,n


[V,D] = eig(A) : computes V ∈ C n,n , diagonal D ∈ C n,n such that AV = VD

Remark 9.2.0.1 (QR-Algorithm → [GV89, Sect. 7.5], [NS02, Sect. 10.3],[Han02, Ch. 26],[QSS00,
Sect. 5.5-5.7])
Note: All “direct” eigensolvers are iterative methods
Idea: Iteration based on successive unitary similarity transformations


diagonal matrix , if
A= A (0) −−−→ A (1) −−−→ . . . −−−→
upper triangular matrix , els
(→ Thm. 9.1.0.8)

(superior stability of unitary transformations, see ??)

9. Computation of Eigenvalues and Eigenvectors, 9.2. “Direct” Eigensolvers 682


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

QR-algorithm (with shift)

✦ in general: quadratic conver-


gence M ATLAB-code 9.2.0.2: QR-algorithm with shift
✦ cubic convergence for normal
matrices
(→ [GV89, Sect. 7.5, 8.2])
Computational cost: O(n3 ) operations per step of the QR-algorithm
✎ ☞
Library implementations of the QR-algorithm provide numerically stable

✍ ✌
eigensolvers (→ Def.1.5.5.19)
y

Remark 9.2.0.3 (Unitary similarity transformation to tridiagonal form)


Successive Householder similarity transformations of A = A H :

(➞ =
ˆ affected rows/columns, ˆ targeted vector)
=

0 0 0 0 0 0
0 0 0 0
0 0 0 0 0
0 0
−−−→ −−−→ −−−→ 0

0 0 0 0 0 0

transformation to tridiagonal form ! (for general matrices a similar strategy can achieve a similarity
transformation to upper Hessenberg form)

this transformation is used as a preprocessing step for QR-algorithm ➣ eig. y

Similar functionality for generalized EVP Ax = λBx, A, B ∈ C n,n


d = eig(A,B) : computes all generalized eigenvalues
[V,D] = eig(A,B) : computes V ∈ C n,n , diagonal D ∈ C n,n such that AV = BVD

Note: (Generalized) eigenvectors can be recovered as columns of V:


AV = VD ⇔ A(V):,i = (D)i,i V:,i ,
if D = diag(d1 , . . . , dn ).

Remark 9.2.0.4 (Computational effort for eigenvalue computations)


Computational effort (#elementary operations) for eig():
)
eigenvalues & eigenvectors of A ∈ K n,n ∼ 25n3 + O(n2 )
only eigenvalues of A ∈ K n,n ∼ 10n3 + O(n2 )
eigenvalues and eigenvectors A = A H ∈ K n,n ∼ 9n3 + O(n2 ) O ( n3 )!
only eigenvalues of A = A H ∈ K n,n ∼ 43 n3 + O(n2 )
only eigenvalues of tridiagonal A = A H ∈ K n,n ∼ 30n2 + O(n)
Note: eig not available for sparse matrix arguments
Exception:
d=eig(A) for sparse Hermitian matrices

9. Computation of Eigenvalues and Eigenvectors, 9.2. “Direct” Eigensolvers 683


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 9.2.0.5 (Runtimes of eig)

M ATLAB-code 9.2.0.6:
1 A = rand (500,500); B = A’*A; C = g a l l e r y (’tridiag’,500,1,3,1);

➤ ✦ A generic dense matrix


✦ B symmetric (s.p.d. → Def. 1.1.2.6) matrix
✦ C s.p.d. tridiagonal matrix

M ATLAB-code 9.2.0.7: measuring runtimes of eig

eig runtimes nxn random matrix


1 2
10 10
d = eig(A) d = eig(A)
[V,D] = eig(A) [V,D] = eig(A)
d = eig(B)
1
O(n3)
0 [V,D] = eig(B) 10
10 d = eig(C)

0
10
−1
10

−1
10
time [s]

time [s]

−2
10
−2
10

−3
10
−3
10

−4
10 −4
10

−5 −5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 323 matrix size n Fig. 324 matrix size n

nxn random Hermitian matrix nxn tridiagonel Hermitian matrix


2 0
10 10
d = eig(A) d = eig(A)
[V,D] = eig(A) O(n2)
3
1
O(n )
10
−1
10

0
10

−2
−1
10
10
time [s]

time [s]

−2
10 −3
10

−3
10

−4
10
−4
10

−5 −5
10 10
0 1 2 3 0 1 2 3
10 10 10 10 10 10 10 10
Fig. 325 matrix size n Fig. 326 matrix size n

For the sake of efficiency: think which information you really need when computing eigenvalues/eigen-

vectors of dense matrices
Potentially more efficient methods for sparse matrices will be introduced below in Section 9.3, 9.4.
y

9. Computation of Eigenvalues and Eigenvectors, 9.2. “Direct” Eigensolvers 684


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

9.3 Power Methods


9.3.1 Direct power method

Supplementary literature. [DR08, Sect. 7.5], [QSS00, Sect. 5.3.1], [QSS00, Sect. 5.3]

EXAMPLE 9.3.1.1 ((Simplified) Page rank algorithm → [Gle15; LM06])


Model: Random surfer visits a web page, stays there for fixed time ∆t, and then
➊ either follows each of ℓ links on a page with probabilty 1/ℓ.
➋ or resumes surfing at a randomly (with equal probability) selected page
Option ➋ is chosen with probability d, 0 ≤ d ≤ 1, option ➊ with probability 1 − d.
Stationary Markov chain, state space =
ˆ set of all web pages
Question: Fraction of time spent by random surfer on i-th page (= page rank xi ∈ [0, 1])
This number ∈]0, 1[ can be used to gauge the “importance” of a web page, which, in turns, offers a way
to sort the hits resulting from a keyword query: the GOOGLE idea.
Method: Stochastic simulation ✄

M ATLAB-code 9.3.1.2: stochastic page rank simulation

Explanations Code 9.3.1.2:


✦ ??: rand generates uniformly distributed pseudo-random numbers ∈ [0, 1[
✦ Web graph encoded in G ∈ {0, 1} N,N :

(G)ij = 1 ⇒ link j → i ,

harvard500: 100000 hops harvard500: 1000000 hops


0.09 0.09

0.08 0.08

0.07 0.07

0.06 0.06
page rank

page rank

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 327 harvard500: no. of page Fig. 328 harvard500: no. of page

Observation: relative visit times stabilize as the number of hops in the stochastic simulation → ∞.

The limit distribution is called stationary distribution/invariant measure of the Markov chain. This is what
we seek.

✦ Numbering of pages 1, . . . , N , ℓi =
ˆ number of links from page i

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 685


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

N
✦ N × N -matrix of transition probabilities page j → page i: A = ( aij )i,j N,N
=1 ∈ R

ˆ probability to jump from page j to page i.


aij ∈ [0, 1] =
N
⇒ ∑ aij = 1 . (9.3.1.3)
i =1

A matrix A ∈ [0, 1] N,N with the property (9.3.1.3) is called a (column) stochastic matrix.

“Meaning” of A: given x ∈ [0, 1] N , kxk1 = 1, where xi is the probability of the surfer to visit page i,
i = 1, . . . , N , at an instance t in time, y = Ax satisfies
N N N N N N
yj ≥ 0 , ∑ y j = ∑ ∑ a ji xi = ∑ xi ∑ aij = ∑ xi = 1 .
j =1 j =1 i =1 i =1 j =1 i =1
| {z }
=1

ˆ probability for visiting page j at time t + ∆t.


yj =

Transition probability matrix for page rank computation



 1
N , if (G)ij = 0 ∀i = 1, . . . , N ,
(A)ij = (G)ij (9.3.1.4)

d/N + (1 − d) ℓ else.
j

random jump to any other page follow link


Note: special treatment of
M ATLAB-code 9.3.1.5: transition probability matrix for page rank zero columns of G, cf.
(9.3.1.4)!

Stochastic simulation based on a single surfer is slow. Alternatives?

Thought experiment: Instead of a single random surfer we may consider m ∈ N, m ≫ 1, of them who
visit pages independently. The fraction of time m · T they all together spend on page i will obviously be
the same for T → ∞ as that for a single random surfer.
Instead of counting the surfers we watch the proportions of them visiting particular web pages at an
(k)
instance of time. Thus, after the k-th hop we can assign a number xi ∈ [0, 1] to web page i, which gives
(k)
(k) ni (k)
the proportion of surfers currently on that page: xi := m , where ni ∈ N0 designates the number of
surfers on page i after the k-th hop.
Now consider m → ∞. The law of law of large numbers suggests that the (“infinitely many”) surfers visiting
page j will move on to other pages proportional to the transistion probabilities aij : in terms of proportions,
for m → ∞ the stochastic evolution becomes a deterministic discrete dynamical system and we find

N
( k +1) (k)
xi = ∑ aij x j , (9.3.1.6)
j =1

that is, the proportion of surfers ending up on page i equals the sum of the proportions on the “source
pages” weighted with the transition probabilities.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 686


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Notice that (9.3.1.6) amounts to matrix×vector. Thus, writing x(0) ∈ [0, 1] N , x (0) = 1 for the initial
distribution of the surfers on the net we find

x ( k ) = A k x (0)

will be their mass distribution after k hops. If the limit exists, the i-th component of x∗ := lim x(k) tells us
k→∞
which fraction of the (infinitely many) surfers will be visiting page i most of the time. Thus, x∗ yields the
stationary distribution of the Markov chain.

M ATLAB-code 9.3.1.7: tracking fractions of many surfers

step 5 step 15
0.1 0.1

0.09 0.09

0.08 0.08

0.07 0.07

0.06 0.06
page rank

page rank

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 329 harvard500: no. of page Fig. 330 harvard500: no. of page

Comparison:
harvard500: 1000000 hops step 5
0.09 0.1

0.08 0.09

0.07 0.08

0.07
0.06

0.06
page rank

page rank

0.05

0.05
0.04
0.04

0.03
0.03

0.02
0.02

0.01
0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 331 harvard500: no. of page Fig. 332 harvard500: no. of page

Single surfer stochastic simulation Power method, Code 9.3.1.7


Observation: Convergence of the x(k) → x∗ , and the limit must be a fixed point of the iteration function:

➣ Ax∗ = x∗ ⇒ x∗ ∈ EigA1 .

Does A possess an eigenvalue = 1? Does the associated eigenvector really provide a probability distri-
bution (after scaling), that is, are all of its entries non-negative? Is this probability distribution unique? To
answer these questions we have to study the matrix A:

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 687


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For every stochastic matrix A, by definition (9.3.1.3)


(9.1.0.3)
AT 1 = 1 ⇒ 1 ∈ σ(A) ,
Thm. 9.1.0.4
(1.5.5.14) ⇒ k A k1 = 1 ⇒ ρ(A) = 1 ,

where ρ(A) is the spectral radius of the matrix A, see Section 9.1.

For r ∈ EigA1, that is, Ar = r, denote by |r| the vector (|ri |)iN=1 . Since all entries of A are non-negative,
we conclude by the triangle inequality that kArk1 ≤ kA|r|k1

kAxk1 kA|r|k1 kArk1


⇒ 1 = kAk1 = sup ≥ ≥ =1.
x ∈R N k x k1 k|r|k1 k r k1
if aij >0
⇒ kA|r|k1 = kArk1 ⇒ |r| = ±r .

Hence, different components of r cannot have opposite sign, which means, that r can be chosen to have
non-negative entries, if the entries of A are strictly positive, which is the case for A from (9.3.1.4). After
normalization krk1 = 1 the eigenvector can be regarded as a probability distribution on {1, . . . , N }.

If Ar = r and As = s with (r)i ≥ 0, (s)i ≥ 0, krk1 = ksk1 = 1, then A(r − s) = r − s. Hence,


by the above considerations, also all the entries of r − s are either non-negative or non-positive. By the
assumptions on r and s this is only possible, if r − s = 0. We conclude that

A ∈ ]0, 1] N,N stochastic ⇒ dim EigA1 = 1 . (9.3.1.8)

Sorting the pages according to the size of the corresponding entries in r yields the famous “page rank”.
Plot of entries of
unique vector r ∈
R N with

0 ≤ ( r )i ≤ 1 ,
M ATLAB-code 9.3.1.9: computing page rank vector r via eig k r k1 = 1 ,

Ar = r .

Inefficient implemen-
tation!
harvard500: 1000000 hops harvard 500: Perron−Frobenius vector
0.09 0.1

0.09
0.08

0.08
0.07

0.07
0.06
entry of r−vector

0.06
page rank

0.05
0.05

0.04
0.04

0.03
0.03

0.02
0.02

0.01 0.01

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Fig. 333 harvard500: no. of page Fig. 334 harvard500: no. of page

stochastic simulation eigenvector computation

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 688


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The possibility to compute the stationary probability distribution of a Markov chain through an eigenvector
of the transition probability matrix is due to a property of stationary Markov chains called ergodicity.

0
10

A =ˆ page rank transition probability matrix, see −1


10

Code 9.3.1.5, d = 0.15, harvard500 example.


−2
10

Errors: ✄

error 1−norm
−3
10

A k x0 − r ,
1 −4
10

with x0 = 1/N , N = 500. −5


10

We observe linear convergence! (→ Def. 8.2.2.1, it-


eration error vs. iteration count ≈ straight line lin-log −6
10

plot)
−7
10
0 10 20 30 40 50 60
Fig. 335 iteration step
y

The computation of page rank amounts to finding the eigenvector of the matrix A of transition probabilities
that belongs to its largest eigenvalue 1. This is addressed by an important class of practical eigenvalue
problems:

Task: given A ∈ K n,n , find largest (in modulus) eigenvalue of A


and (an) associated eigenvector.

Idea: (suggested by page rank computation, Code 9.3.1.7)


Iteration: z(k+1) = Az(k) , z(0) arbitrary

EXAMPLE 9.3.1.10 (Power iteration → Ex. 9.3.1.1)


Try the above iteration for general 10 × 10-matrix, largest eigenvalue 10, algebraic multiplicity 1.
1
10

M ATLAB-code 9.3.1.11:

0
10
d = ( 1 : 1 0 ) ’ ; n = length (d ) ;
S = t r i u ( diag ( n : − 1 : 1 , 0 ) + . . .
ones ( n , n ) ) ;
A = S * d i a g ( d , 0 ) * i n v (S ) ;
errors

−1
10

z(k)
−2
10 ✁ error norm − (S):,10
k z(k) k
(Note: (S):,10 =
ˆ eigenvector for eigenvalue 10)
−3
10

Fig. 336
0 5 10 15
iteration step k
20 25 30
z(0) = random vector
Observation: linear convergence of (normalized) eigenvectors!

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 689


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Suggests direct power method (ger.: Potenzmethode): iterative method (→ Section 8.2)

initial guess: z(0) “arbitrary” ,


w (9.3.1.12)
next iterate: w := Az(k−1) , z(k) := , k = 1, 2, . . . .
k w k2

Note: the “normalization” of the iterates in (9.3.1.12) does not change anything (in exact arithmetic) and
helps avoid overflow in floating point arithmetic.

Computational effort: 1× matrix×vector per step ➣ inexpensive for sparse matrices

A persuasive theoretical justification for the direct power method:

Assume A ∈ K n,n to be diagonalizable:


⇔ ∃ basis {v1 , . . . , vn } of eigenvectors of A: Av j = λ j v j , λ j ∈ C.
Assume

|λ1 | ≤ |λ2 | ≤ · · · ≤ |λn−1 |<|λn | , vj 2


=1. (9.3.1.13)

Key observations for power iteration (9.3.1.12)

z ( k ) = A k z (0) (→ name “power method”) (9.3.1.14)


n n
z (0) = ∑ ζ j vj ⇒ z(k) = ∑ ζ j λkj v j . (9.3.1.15)
j =1 j =1

Due to (9.3.1.13) for large k ≫ 1 (⇒ |λkn | ≫ |λkj | for j 6= n) the contribution of vn (size ζ n λkn ) in
the eigenvector expansion (9.3.1.15) will be much larger than the contribution (size ζ n λkj ) of any other
eigenvector (, if ζ n 6= 0): the eigenvector for λn will swamp all other for k → ∞.

Further (9.3.1.15) nutures expectation: vn will become dominant in z(k) the faster, the better |λn | is
separated from |λn−1 |, see Thm. 9.3.1.21 for rigorous statement.

z(k) → eigenvector, but how do we get the associated eigenvalue λn ?

When (9.3.1.12) has converged, two common ways to recover λmax → [DR08, Alg. 7.20]
kAz(k) k
➊ Az(k) ≈ λmax z(k) ➣ |λn | ≈ (modulus only!)
k z(k) k

2 (z(k) ) H Az(k)
➋ λmax ≈ argmin Az(k) − θz(k) ➤ λmax ≈ 2
.
θ ∈R 2 z(k) 2

This latter formula is extremely useful, which has earned it a special name:

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 690


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 9.3.1.16.
For A ∈ K n,n , u ∈ K n the Rayleigh quotient is defined by

u H Au
ρA (u) := .
uH u

An immediate consequence of the definitions:

λ ∈ σ (A) , z ∈ EigλA ⇒ ρA (z) = λ . (9.3.1.17)

EXAMPLE 9.3.1.18 (Direct power method → Ex. 9.3.1.18 cnt’d)


1
10

M ATLAB-code 9.3.1.19:
0
n = length(d); S = triu(diag(n:-1:1,0)+...
10
ones(n,n)); A = S*diag(d,0)*inv(S);

d = (1:10)’;
errors

−1
10

o : error |λn − ρA (z(k) )|


✁ ∗ : error norm z(k) − s·,n
−2
10
kAz(k−1) k
+ : λ n − z ( k −1) 2
k k2
z(0) = random vector
−3
10
0 5 10 15 20 25 30
Fig. 337 iteration step k

Test matrices:
① d=(1:10)’; ➣ |λn−1 | : |λn | = 0.9
② d = [ones(9,1); 2]; ➣ |λn−1 | : |λn | = 0.5
③ d = 1-2.^(-(1:0.5:5)’); ➣ |λn−1 | : |λn | = 0.9866

M ATLAB-code 9.3.1.20: Investigating convergence of direct power method

① ② ③
(k) (k) (k) (k) (k) (k)
k ρEV ρEW ρEV ρEW ρEV ρEW
22 0.9102 0.9007 0.5000 0.5000 0.9900 0.9781
(k)
z(k) − s·,n 23 0.9092 0.9004 0.5000 0.5000 0.9900 0.9791
ρEV := , 24 0.9083 0.9001 0.5000 0.5000 0.9901 0.9800
z(k−1) − s·,n
25 0.9075 0.9000 0.5000 0.5000 0.9901 0.9809
(k) | ρA ( z(k) ) − λn | 26 0.9068 0.8998 0.5000 0.5000 0.9901 0.9817
ρEW := .
| ρ A ( z ( k −1) ) − λ n | 27 0.9061 0.8997 0.5000 0.5000 0.9901 0.9825
28 0.9055 0.8997 0.5000 0.5000 0.9901 0.9832
29 0.9049 0.8996 0.5000 0.5000 0.9901 0.9839
30 0.9045 0.8996 0.5000 0.5000 0.9901 0.9844
Observation: linear convergence (→ ??) y

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 691


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 9.3.1.21. Convergence of direct power method → [DR08, Thm. 25.1]


Let λn > 0 be the largest (in modulus) eigenvalue of A ∈ K n,n and have (algebraic) multiplicity 1.
Let v, y be the left and right eigenvectors of A for λn normalized according to kyk2 = kvk2 = 1.
Then there is convergence

| λ n −1 |
Az(k) → λn , z(k) → ±v linearly with rate ,
2 |λn |

where z(k) are the iterates of the direct power iteration and y H z(0) 6= 0 is assumed.

Remark 9.3.1.22 (Initial guess for power iteration)


roundoff errors ➤ y H z(0) 6= 0 always satisfied in practical computations

Usual (not the best!) choice for x(0) = random vector y

Remark 9.3.1.23 (Termination criterion for direct power iteration) (→ Section 8.2.3)
Adaptation of a posteriori termination criterion (8.3.2.20)

 (k) (k−1) ≤ (1/L − 1)tol ,
 min z ± z

“relative change” ≤ tol:

 kAz(k) k kAz(k−1) k
 − z ( k −1) ≤ (1/L − 1)tol see (8.2.3.8) .
k z(k) k k k

Estimated rate of convergence


y

9.3.2 Inverse Iteration [DR08, Sect. 7.6], [QSS00, Sect. 5.3.2]


EXAMPLE 9.3.2.1 ( Image segmentation )
Given: gray-scale image: intensity matrix P ∈ {0, . . . , 255}m,n , m, n ∈ N
((P)ij ↔ pixel, 0 =
ˆ black, 255 =
ˆ white)
Loading and displaying images
M ATLAB-code 9.3.2.2: loading and displaying an image
in M ATLAB ✄

(Fuzzy) task: Local segmentation

Find connected patches of image of the same shade/color

More general segmentation problem (non-local): identify parts of the image, not necessarily connected,
with the same texture.

Next: Statement of (rigorously defined) problem, cf. ??:

Preparation: Numbering of pixels 1 . . . , mn, e.g, lexicographic numbering:


✦ pixel set V := {1. . . . , nm}
✦ indexing: index(pixeli,j ) = (i − 1)n + j
✎ notation: pk := (P)ij , if k = index(pixeli,j ) = (i − 1)n + j, k = 1, . . . , N := mn

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 692


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

( m − 1) n + 1 mn
Local similarity matrix:

W ∈ R N,N , N := mn , (9.3.2.3)


0 , if pixels i, j not adjacent,
(W)ij = 0 , if i = j ,


σ ( pi , p j ) , if pixels i, j adjacent.
m
ˆ adjacent pixels
↔= ✄
Similarity function, e.g., with α > 0
n+1 n+2 2n
2
σ ( x, y) := exp(−α( x − y) ) , x, y ∈ R .
1 2 3 n
Lexicographic numbering ✄
Fig. 338
n
The entries of the matrix W measure the “similarity” of neighboring pixels: if (W)ij is large, they encode
(almost) the same intensity, if (W)ij is close to zero, then they belong to parts of the picture with very
different brightness. In the latter case, the boundary of the segment may separate the two pixels.

Definition 9.3.2.4. Normalized cut (→ [SM00, Sect. 2])

For X ⊂ V define the normalized cut as

cut(X ) cut(X )
Ncut(X ) := + ,
weight(X ) weight(V \ X )
with cut(X ) := ∑ wij , weight(X ) := ∑ wij .
i ∈X ,j6∈X i ∈X ,j∈X

In light of local similarity relationship:


• cut(X ) big ➣ substantial similarity of pixels across interface between X and V \ X .
• weight(X ) big ➣ a lot of similarity of adjacent pixels inside X .
Segmentation problem (rigorous statement):

find X ∗ ⊂ V : X ∗ = argmin Ncut(X ) . (9.3.2.5)


X ⊂V

NP-hard combinatorial optimization problem !

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 693


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Image Scanning rectangles

5 5

10 10

15 15
pixel

pixel
20 20

25 25

30 30

5 10 15 20 25 5 10 15 20 25
Fig. 339 pixel Fig. 340 pixel

Minimum NCUT rectangle


25

5
20

10

15
pixel

15
pixel

10
20

25
5

30
2 4 6 8 10 12 14 16 18
5 10 15 20 25
Fig. 341 pixel Fig. 342 pixel

△ Ncut(X ) for pixel subsets X defined by sliding rectangles, see Fig. 340.

Equivalent reformulation:
(
1 , if i ∈ X ,
indicator function: z : {1, . . . , N } 7→ {−1, 1} , zi := z(i ) = (9.3.2.6)
−1 , if i 6∈ X .

∑ −wij zi z j ∑ −wij zi z j
zi >0,z j <0 zi >0,z j <0
Ncut(X ) = + , (9.3.2.7)
∑ di ∑ di
z i >0 z i <0
di = ∑ wij = weight({i}) . (9.3.2.8)
j∈V

Sparse matrices:

D := diag(d1 , . . . , d N ) ∈ R N,N , A := D − W = A⊤ . (9.3.2.9)

Summary: (obvious) properties of these matrices

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 694


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ A has positive diagonal and non-positive off-diagonal entries.


✦ A is diagonally dominant (→ Def. 2.8.0.8) ➣ A is positive semidefinite by Lemma 2.8.0.12.
✦ A has row sums = 0:
1⊤ A = A1 = 0 . (9.3.2.10)

M ATLAB-code 9.3.2.11: assembly of A, D

Lemma 9.3.2.12. Ncut and Rayleigh quotient (→ [SM00, Sect. 2])

With z ∈ {−1, 1} N according to (9.3.2.6) there holds

∑ di
y⊤ Ay z i >0
Ncut(X ) = ⊤ , y := (1 + z) − β(1 − z) , β := .
y Dy ∑ di
z i <0

generalized Rayleigh quotient ρA,D (y)

Proof. Note that by (9.3.2.6) (1 − z)i = 0 ⇔ i ∈ X , (1 + z)i = 0 ⇔ i 6∈ X . Hence, since


(1 + z)⊤ D(1 − z) = 0,
 
⊤ 1 1 y⊤ Ay
4 Ncut(X ) = (1 + z) A(1 + z) + = ,
κ1 D1 (1 − κ )1⊤ D1
⊤ β1⊤ D1
β
where κ := ∑ di/∑ di = 1+ β . Also observe
z i >0 i

y⊤ Dy = (1 + z)⊤ D(1 + z) + β2 (1 − z)⊤ D(1 − z) =


4( ∑ di + β2 ∑ di ) = 4β∑ di = 4β1⊤ D1 .
z i >0 z i <0 i

This finishes the proof.


✦ (9.3.2.10) ⇒ 1 ∈ EigA0
✦ Lemma 2.8.0.12: A diagonally dominant =⇒ A is positive semidefinite (→ Def. 1.1.2.6)
Ncut(X ) ≥ 0 and 0 is the smallest eigenvalue of A.

However, we are by no means interested in a minimizer y ∈ Span{1} (with constant entries) that does
not provide a meaningful segmentation.

Idea: weed out undesirable constant minimizers by imposing orthogonality


constraint (orthogonality w.r.t. inner product induced by D, cf. Sec-
tion 10.1)

y ⊥ D1 ⇔ 1⊤ Dy = 0 . (9.3.2.13)

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 695


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

segmentation problem (9.3.2.5) ⇔ argmin ρA,D (y) . (9.3.2.14)


y∈{2,−2β} N , 1⊤ Dy=0

still NP-hard
➣ Minimizing Ncut(X ) amounts to minimizing a (generalized) Rayleigh quotient (→ Def. 9.3.1.16) over
a discrete set of vectors, which is still an NP-hard problem.

Idea: Relaxation

Discrete optimization problem → continuous optimization problem

(9.3.2.14) → argmin ρA,D (y) . (9.3.2.15)


y∈R N , y6=0, 1⊤ Dy=0

✎ ☞
Task: (9.3.2.15) ⇔ Find minimizer of (generalized) Rayleigh quotient under linear
✍ ✌
constraint

Here: linear constraint on y: 1⊤ Dy = 0


The next theorem establishes a link between argument vectors that render the Rayleigh quotient extremal
and eigenspace for extremal eigenvalues.

Theorem 9.3.2.16. Rayleigh quotient theorem

Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of all (real!) eigenvalues of A = AH ∈
C n,n . Then

EigAλ1 = argmin ρA (y) and EigAλm = argmax ρA (y) .


y∈C n.n \{0} y∈C n.n \{0}

Remark 9.3.2.17 (Min-max theorem)


Thm. 9.3.2.16 is a an immediate consequence of the following more general and fundamentally important
result.

Theorem 9.3.2.18. Courant-Fischer min-max theorem → [GV89, Thm. 8.1.2]

Let λ1 < λ2 < · · · < λm , m ≤ n, be the sorted sequence of the (real!) eigenvalues of A = AH ∈
C n,n . Write

U0 = {0} , Uℓ := ∑ EigAλ j , ℓ = 1, . . . , m and Uℓ⊥ := {x ∈ C n : uH x = 0 ∀u ∈ Uℓ } .
j =1

Then

min ρA (y) = λℓ , 1 ≤ ℓ ≤ m , argmin ρA (y) ⊂ EigAλℓ .


⊥ \{0}
y∈Uℓ− ⊥ \{0}
1 y∈Uℓ− 1

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 696


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Proof. For diagonal A ∈ R n,n the assertion of the theorem is obvious. Thus, Cor. 9.1.0.9 settles
everything.

A simple conclusion from Thm. 9.3.2.18: If A = A⊤ ∈ R n,n with eigenvalues λ1 ≤ λ2 ≤ · · · ≤ λn , then

λ1 = minn ρA (z) , λ2 = min ρA (z) , (9.3.2.19)


z ∈R z∈R n ,z⊥v1

where v1 ∈ EigAλ1 \ {0}. y

Well, in Lemma 9.3.2.12 we encounter a generalized Rayleigh quotient ρA,D (y)! How can Thm. 9.3.2.16
be applied to it?

ρA,D (D− /2 z) = ρD−1/2 AD−1/2 (z) , y ∈ R n .


1
Transformation idea: (9.3.2.20)

e := D− /2 AD− /2 . Elementary manipulations show


Apply Thm. 9.3.2.18 to transformed matrix A
1 1

1/2
z=D y
argmin ρA,D (D− /2 z)
1
(9.3.2.15) ⇔ argmin ρA,D (y) =
1⊤ Dy=0 1⊤ D1/2 z=0 (9.3.2.21)
e := D−1/2 AD−1/2 .
= argmin ρAe (z) with A
1⊤ D1/2 z=0

Related: transformation of a generalized eigenvalue problem into a standard eigenvalue problem accord-
ing to
1/2
z=B x
B− /2 AB− /2 z = λz .
1 1
Ax = λBx =⇒ (9.3.2.22)

B1/2 =
ˆ square root of s.p.d. matrix B → Rem. 10.3.0.2.
For segmentation problem: B = D diagonal with positive diagonal entries, see (9.3.2.9)
D−1/2 = diag(d1− /2 , . . . , d− e
1 1/2 −1/2 AD−1/2 can easily be computed.
➥ N ) and A : = D

In the sequel consider minimization problem/related eigenvalue problem

z∗ = argmin ρAe (z) e = λz .


←→ Az (9.3.2.23)
1⊤ D1/2 z=0

Recover solution y∗ of (9.3.2.15) as y∗ = D− /2 z∗ .


1

Still, Thm. 9.3.2.16 cannot be applied to (9.3.2.23):

1⊤ D /2 z = 0 ?
1
How to deal with constraint

Idea: Penalization

Add term P(z) to ρA e ( z ) that becomes “sufficiently large” in case the con-
straint is violated.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 697


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

z∗ can only be a minimizer of ρAe (z) + P(z), if P(z) = 0.


How to choose the penalty function P(z) for the segmentation problem ?

n o |1⊤ D1/2 z|2


1⊤ D /2 z 6= 0 ⇒ P(z) > 0
1
satisfied for P(z) = µ ,
kzk22

with penalty parameter µ > 0.

penalized minimization problem dense rank-1 matrix

z⊤ (D /2 11⊤ D /2 )z
1 1

z = argmin ρAe (z) + P(z) = argmin ρAe (z) +
z∈R N \{0} z∈R N \{0} z⊤ z
(9.3.2.24)
e + D1/2 11⊤ D1/2 .
b := A
= argmin ρ Ab (z) with A
z∈R N \{0}

How to choose the penalty parameter µ ?

In general: finding a “suitable” value for µ may be difficult or even impossible!

Here we are lucky:

(9.3.2.10) ⇒ A1 = 0 ⇒ A e .
e (D1/2 1) = 0 ⇔ D1/2 1 ∈ EigA0

Constraint in (9.3.2.23) means

Minimize over the orthogonal complement of an eigenvector. (9.3.2.25)

Cor. 9.1.0.9 ➤ The orthogonal complement of an eigenvector of a symmetric matrix is spanned by the
other eigenvectors (orthonormalization of eigenvectors belonging to the same eigen-
value is assumed).
(9.3.2.25) e that belongs to
The minimizer of (9.3.2.23) will be one of the other eigenvectors of A
the smallest eigenvalue.
Note: This eigenvector z∗ will be orthogonal to D /2 1, it satisfies the constraint, and, thus, P(z∗ ) = 0!
1

Note: e and A
eigenspaces of A b agree.

Note: Lemma 2.8.0.12 e is positive semidefinite (→ Def. 1.1.2.6) with smallest eigenvalue 0.n
=⇒ A

Choose penalization parameter µ in (9.3.2.24) such that D /2 1 is guar-


1
Idea:
anteed not to be an eigenvector belonging to the smallest eigenvalue of
b.
A

b : Thm. 9.1.0.4 suggests


Safe choice: choose µ such that D /2 1 will belong to the largest eigenvalue of A
1

(1.5.5.13)
e
µ= A = 2. (9.3.2.26)

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 698


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

z∗ = argmin ρAe (z) = argmin ρAb (z) . (9.3.2.27)


1⊤ D1/2 z=0 z 6 =0

By Thm. 9.3.2.16:
b,
z∗ = eigenvector belonging to minimal eigenvalue of A
m
∗ 1/2
z = eigenvector ⊥ D 1 belonging to minimal eigenvalue of Ae,
m
D − 1/2 ∗
z = minimizer for (9.3.2.15).
§9.3.2.28 (Algorithm outline: Binary grayscale image segmentation)
➊ Given similarity function σ compute (sparse!) matrices W, D, A ∈ R N,N , see (9.3.2.3), (9.3.2.9).

b :=
➋ Compute y∗ , ky∗ k2 = 1, as eigenvector belonging to the smallest eigenvalue of A
D − 1/2
AD − 1/2 1/2 1/2 ⊤
+ 2(D 1)(D 1) .

➌ Set x∗ = D−1/2 y∗ and define the image segment as pixel set

N
X := {i ∈ {1, . . . , N }: xi∗ > 1
N ∑ xi∗ } . (9.3.2.29)
i =1

mean value of entries of x∗

M ATLAB-code 9.3.2.30: 1st stage of segmentation of grayscale image


1 % Read image and build matrices, see Code 9.3.2.11 and (9.3.2.9)
2 P = imread(’image.pbm’); [m,n] = s i z e (P); [A,D] = imgsegmat(P);
3 % Build scaling matrics
4 N = s i z e (A,1); dv = s q r t ( spdiags (A,0));
5 Dm = spdiags (1./dv,[0],N,N); % D−1/2
6 Dp = spdiags (dv,[0],N,N); % D−1/2
7 % Build (densely populated !) matrix A b
8 c = Dp*ones(N,1); Ah = Dm*A*Dm + 2*c*c’;
9 % Compute and sort eigenvalues; grossly inefficient !
10 [W,E] = eig( f u l l (Ah)); [ev,idx] = s o r t ( d i a g (E)); W(:,idx) = W;
11 % Obtain eigenvector x∗ belonging to 2nd smallest generalized
12 % eigenvalue of A and D
13 x = W(:,1); x = Dm*v;
14 % Extract segmented image
15 xs = reshape (x,m,n); Xidx = f i n d (xs>(sum(sum(xs))/(n*m)));

y
1st-stage of segmentation of 31 × 25 grayscale pixel image (root.pbm, red pixels =
ˆ X , σ ( x, y) =
exp(−( /10) ))
x − y 2

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 699


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Original Segments

5 5

10 10

15 15

20 20

25 25

30 30
Fig. 343 Fig. 344
5 10 15 20 25 5 10 15 20 25

vector r: size of entries on pixel grid


0.022

0.02

0.025 0.018

0.02 0.016

0.015
0.014
Image from Fig. 343:
0.012
0.01
0.01 ✁ eigenvector x∗ plotted on pixel grid
0.005
0.008

0 0.006
30
25 25 0.004
20 20
15 15 0.002
10 10
Fig. 345 5 5

To identify more segments, the same algorithm is recursively applied to segment parts of the image
already determined.

Practical segmentation algorithms rely on many more steps of which the above algorithm is only one, pre-
ceeded by substantial preprocessing. Moreover, they dispense with the strictly local perspective adopted
above and take into account more distant connections between image parts, often in a randomized fashion
[SM00].

The image segmentation problem falls into the wider class of graph partitioning problems. Methods based
on (a few of) the eigenvector of the connectivity matrix belonging to the smallest eigenvalues are known
as spectral partitioning methods. The eigenvector belonging to the smallest non-zero eigenvalue that we
computed above is usually called the Fiedler vector of the graph, see [AKY99; ST96]. y

The solution of the image segmentation problem by means of eig in Code 9.3.2.30 amounts a tremendous
waste of computational resources: we compute all eigenvalues/eigenvectors of dense matrices, though
only a single eigenvector associated with the smallest eigenvalue is of interest.

This motivates the quest to find efficient numerical methods for the following task.

Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 700


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

If A ∈ K n,n regular:
  −1
Smallest (in modulus) EV of A = Largest (in modulus) EV of A−1

Direct power method (→ Section 9.3.1) for A−1 = inverse iteration

M ATLAB-code 9.3.2.31: inverse iteration for computing λmin (A) and associated eigenvector

Note: reuse of LU-factorization, see Rem. 2.5.0.10


Remark 9.3.2.32 (Shifted inverse iteration)
More general task:
For α ∈ C find λ ∈ σ (A) such that |α − λ| = min{|α − µ|, µ ∈ σ (A)}

Shifted inverse iteration: [DR08, Alg .7.24]


w
z(0) arbitrary , w = (A − αI)−1 z(k−1) , z(k) := , k = 1, 2, . . . , (9.3.2.33)
k w k2

where: (A − αI)−1 z(k−1) = ˆ solve (A − αI)w = z(k−1) based on Gaussian elimination (↔ a single
LU-factorization of A − αI as in Code 9.3.2.31).
y

Remark 9.3.2.34 ((Nearly) singular LSE in shifted inverse iteration)


What if “by accident” α ∈ σ (A) (⇔ A − αI singular) ?

Stability of Gaussian elimination/LU-factorization (→ ??) will ensure that “w from (9.3.2.33) points in
the right direction”

In other words, roundoff errors may badly affect the length of the solution w, but not its direction.
Practice [GT08]: If, in the course of Gaussian elimination/LU-factorization a zero pivot element is really
encountered, then we just replace it with eps, in order to avoid inf values!

Thm. 9.3.1.21 ➣ Convergence of shifted inverse iteration for A H = A: y


Asymptotic linear convergence, Rayleigh quotient → λ j with rate

|λ j − α|
with λ j ∈ σ (A) , |α − λ j | ≤ |α − λ| ∀λ ∈ σ (A) .
min{|λi − α|, i 6= j}

Extremely fast for α ≈ λ j !

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 701


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Idea: A posteriori adaptation of shift


Use α := ρA (z(k−1) ) in k-th step of inverse iteration.

§9.3.2.35 (Rayleigh quotient iteration → [Han02, Alg. 25.2])

M ATLAB-code 9.3.2.36: Rayleigh quotient iteration (for normal A ∈ R n,n )

??: note use of speye to preserve spare matrix data format!


y

✦ Drawback compared with Code 9.3.2.31: reuse of LU-factorization no longer possible.


✦ Even if LSE nearly singular, stability of Gaussian elimination guarantees correct direction of z, see
discussion in Rem. 9.3.2.34.

EXAMPLE 9.3.2.37 (Rayleigh quotient iteration)


Monitored: iterates of Rayleigh quotient iteration (9.3.2.36) for s.p.d. A ∈ R n,n
0
10

M ATLAB-code 9.3.2.38:
d = (1:10) ’;
−5 n = length (d ) ;
10
Z = d i a g ( s q r t ( 1 : n ) , 0 ) + ones ( n , n ) ;
[ Q, R ] = q r ( Z ) ;
A = Q* d i a g ( d , 0 ) * Q ’ ;
−10
10

o: |λmin − ρA (z(k) )|
∗ : z(k) − x j , λmin = λ j , x j ∈ EigAλ j ,
−15
10
1 2 3 4 5 6 7 8 9 10 : xj 2 = 1
k

Theorem 9.3.2.39. → [Han02, Thm. 25.4]


k |λmin − ρA (z(k) )| z(k) − x j
1 0.09381702342056 0.20748822490698 If A = A H , then ρA (z(k) ) converges locally
2 0.00029035607981 0.01530829569530 of order 3 (→ Def. 8.2.2.10) to the small-
3 0.00000000001783 0.00000411928759 est eigenvalue (in modulus), when z(k) are
4 0.00000000000000 0.00000000000000 generated by the Rayleigh quotient iteration
5 0.00000000000000 0.00000000000000 (9.3.2.36).
y

9.3.3 Preconditioned inverse iteration (PINVIT)

Task: given A ∈ K n,n , find smallest (in modulus) eigenvalue of regular A ∈ K n,n
and (an) associated eigenvector.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 702


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Options: inverse iteration (→ Code 9.3.2.31) and Rayleigh quotient iteration (9.3.2.36).

? What if direct solution of Ax = b not feasible ?


This can happen, in case
• for large sparse A the amount of fill-in exhausts memory, despite sparse elimination techniques (→
Section 2.7.4),
• A is available only through a routine evalA(x) providing A×vector.

We expect that an approximate solution of the linear systems of equations encountered during
inverse iteration should be sufficient, because we are dealing with approximate eigenvectors anyway.
Thus, iterative solvers for solving Aw = z(k−1) may be considered, see Chapter 10. However, the required
accuracy is not clear a priori. Here we examine an approach that completely dispenses with an iterative
solver and uses a preconditioner (→ Notion 10.3.0.3) instead.

Idea: (for inverse iteration without shift, A = A H s.p.d.)


Instead of solving Aw = z(k−1) compute w = B−1 z(k−1) with
“inexpensive” s.p.d. approximate inverse B−1 ≈ A−1

ˆ Preconditioner for A, see Notion 10.3.0.3


➣ B=

Possible to replace A−1 with B−1 in inverse iteration ?

! NO, because we are not interested in smallest eigenvalue of B !

Replacement A−1 → B−1 possible only when applied to residual quantity


residual quantity = quantity that → 0 in the case of convergence to exact
solution

Natural residual quantity for eigenvalue problem Ax = λx:


r := Az − ρA (z)z , ρA (z) = Rayleigh quotient → Def. 9.3.1.16 .
Note: only direction of A−1 z matters in inverse iteration (9.3.2.33)
(A−1 z) k (z − A−1 (Az − ρA (z)z)) ⇒ defines same next iterate!

[Preconditioned inverse iteration (PINVIT) for s.p.d. A]

w = z(k−1) − B−1 (Az(k−1) − ρA (z(k−1) )z(k−1) ) ,


z(0) arbitrary, w k = 1, 2, . . . . (9.3.3.1)
z(k) = ,
k w k2

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 703


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Computational effort:

1 matrix×vector
M ATLAB-code 9.3.3.2: preconditioned inverse iteration (9.3.3.1)
1 evaluation of pre-
conditioner
A few
AXPY-operations

EXAMPLE 9.3.3.3 (Convergence of PINVIT)


S.p.d. matrix A ∈ R n,n , tridiagonal preconditioner, see Ex. 10.3.0.11

M ATLAB-code 9.3.3.4:
1 A = spdiags (repmat([1/n,-1,2*(1+1/n),-1,1/n],n,1),
[-n/2,-1,0,1,n/2],n,n);
2 evalA = @(x) A*x;
3 % inverse iteration
4 invB = @(x) A\x;
5 % tridiagonal preconditioning
6 B = spdiags ( spdiags (A,[-1,0,1]),[-1,0,1],n,n); invB = @(x) B\x;

Monitored: error decay during iteration of Code 9.3.3.2: |ρA (z(k) ) − λmin (A)|
(P)INVIT iterations: tolerance = 0.0001
2
10 28
INVIT, n = 50 INVIT
INVIT, n = 100 PINVIT
0 INVIT, n = 200 26
10
PINVIT, n = 50
PINVIT, n = 100
PINVIT, n = 200 24
max

−2
10
error in approximation for λ

22
−4
#iteration steps

10
20

−6
10 18

−8
10 16

14
−10
10

12
−12
10
10

−14
10
0 5 10 15 20 25 30 35 40 45 50 8
1 2 3 4 5
Fig. 346 # iterationstep 10 10 10 10 10
Fig. 347 n

Observation: linear convergence of eigenvectors also for PINVIT.


y

Theory [Ney99a; Ney99b]:


✦ linear convergence of (9.3.3.1)
✦ fast convergence, if spectral condition number κ (B−1 A) small

The theory of PINVIT [Ney99a; Ney99b] is based on the identity


w = ρA (z(k−1) )A−1 z(k−1) + (I − B−1 A)(z(k−1) − ρA (z(k−1) )A−1 z(k−1) ) . (9.3.3.5)
For small residual Az(k−1) − ρA (z(k−1) )z(k−1) PINVIT almost agrees with the regular inverse iteration.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 704


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

9.3.4 Subspace iterations


Remark 9.3.4.1 (Excited resonances)
Consider the non-autonomous ODE (excited harmonic oscillator)

ÿ + λ2 y = cos(ωt) , (9.3.4.2)

with general solution




 1
 cos(ωt) + A cos(λt) + B sin(λt) , if λ 6= ω ,
λ2 − ω2
y(t) = (9.3.4.3)

 t sin(ωt) + A cos(λt) + B sin(λt)
 , if λ = ω .

growing solutions possible in resonance case λ=ω!

Now consider harmonically excited vibration modelled by ODE

ÿ + Ay = b cos(ωt) , (9.3.4.4)

with symmetric, positive (semi)definite matrix A ∈ R n,n , b ∈ R n . By Cor. 9.1.0.9 there is an orthogonal
matrix Q ∈ R n,n such that

Q⊤ AQ = D := diag(λ1 , . . . , λn ) .

where the 0 ≤ λ1 < λ2 < · · · < λn are the eigenvalues of A.

Transform ODE as in Ex. 9.0.0.6: with z = Q⊤ y

(9.3.4.4) z̈ + Dz = Q⊤ b cos(ωt) .

We have obtained decoupled linear 2nd-order scalar ODEs of the type (9.3.4.2).
☛ ✟

✡ ✠
(9.3.4.4) can have growing (with time) solutions, if ω = λi for some i = 1, . . . , n
p
If ω = λ j for one j ∈ {1, . . . , n}, then the solution for the initial value problem for (9.3.4.4) with
y(0) = ẏ(0) = 0 (↔ z(0) = ż(0) = 0) is

t
z(t) ∼ sin(ωt)e j + bounded oscillations

m
t
y(t) ∼ sin(ωt)(Q):,j + bounded oscillations .

j-th eigenvector of A
Eigenvectors of A ↔ excitable states y
EXAMPLE 9.3.4.5 (Vibrations of a truss structure cf. [Han02, Sect. 3], M ATLAB’s truss demo)

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 705


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2.5

1.5 ✁ a “bridge” truss]


1
A truss is a structure composed of (massless) rods
and point masses; we consider in-plane (2D) trusses.
0.5

Encoding: positions of masses + (sparse) connectiv-


−0.5
ity matrix
−1

−1.5

Fig. 348
0 1 2 3 4 5

M ATLAB-code 9.3.4.6: Data for “bridge truss”

Assumptions: ✦ Truss in static equilibrium (perfect balance of forces at each point mass).
✦ Rods are perfectly elastic (i.e., frictionless).
Hook’s law holds for force in the direction of a rod:
∆l
F=α , (9.3.4.7)
l

where ✦ l is the equilibrium length of the rod,


✦ ∆l is the elongation of the rod effected by the force F in the direction of the rod
✦ α is a material coefficient (Young’s modulus).
n point masses are numbered 1, . . . , n: pi ∈ R2 =
ˆ position of i-th mass

We consider a swaying truss: description by time-dependent displacements ui (t) ∈ R2 of point masses:

position of i-th mass at time t = p i + u i ( t ); .

✁ deformed truss:

ˆ point masses at positions pi


•=
ˆ displacement vectors ui
→=
Fig. 349
ˆ shifted masses at pi + ui
•=
Equilibrium length and (time-dependent) elongation of rod connecting point masses i and j, i 6= j:

lij := ∆p ji , ∆p ji := p j − pi , (9.3.4.8)
2
∆lij (t) := ∆p ji + ∆u ji (t) − lij , ∆u ji (t) := u j (t) − ui (t) . (9.3.4.9)
2

Extra (reaction) force on masses i and j:

∆lij ∆p ji + ∆u ji (t)
Fij (t) = −αij · . (9.3.4.10)
lij ∆p ji + ∆u ji (t) 2

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 706


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✞ ☎

✝ ✆
Assumption: Small displacements

2
Possibility of linearization by neglecting terms of order ui 2
!
(9.3.4.8) 1 1
Fij (t) = αij − · (∆p ji + ∆u ji (t)) . (9.3.4.11)
(9.3.4.9) ∆p ji + ∆u ji (t) ∆p ji
2

Lemma 9.3.4.12. Taylor expansion of inverse distance function

For x ∈ R d \ {0}, y ∈ R d , kyk2 < kxk2 holds for y → 0

1 1 x·y
= − + O(kyk22 ) .
k x + y k2 kxk2 kxk32

Proof. Simple Taylor expansion up to linear term for f (x) = ( x12 + · · · + xd2 )−1/2 and f (x + y) =
f (x) + grad f (x) · y + O(kyk22 ).

2
Linearization of force: apply Lemma 9.3.4.12 to (9.3.4.11) and drop terms O( ∆u ji 2 ):

∆p ji · ∆u ji (t)
Fij (t) ≈ − αij 3
· (∆p ji + ∆u ji (t))
lij
(9.3.4.13)
∆p ji · ∆u ji (t)
≈ − αij · ∆p ji .
lij3

Newton’s second law of motion: ( Fi =


ˆ total force acting on i-th mass)

n
d2 i
mi
dt2
u (t) = Fi = ∑ − Fij (t) , (9.3.4.14)
j =1
j6=i

ˆ mass of point mass i.


mi =

d2 i n
1  ji ji ⊤

mi u (t) = ∑ αij ∆p ( ∆p ) (u j (t) − ui (t)) . (9.3.4.15)
dt2 j =1
3
lij
j6=i

 n
Compact notation: collect all displacements into one vector u ( t ) = ui ( t ) ∈ R2n
i =1

du
(9.3.4.15) M (t) + Au(t) = f(t) . (9.3.4.16)
dt2

with mass matrix M = diag(m1 , m1 , . . . , mn , mn )

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 707


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and stiffness matrix A ∈ R2n,2n with 2 × 2-blocks


n
1  ji ji ⊤

(A)2i−1:2i,2i−1,2i = ∑ αij 3 ∆p (∆p ) , i = 1, . . . , n ,
j =1 lij
j6=i
(9.3.4.17)
1  ji ji ⊤

(A)2i−1:2i,2j−1:2j = −αij 3 ∆p (∆p ) , i 6= j .
lij
 n
and external forces f(t) = fi ( t ) .
i =1

Note: stiffness matrix A is symmetric, positive semidefinite (→ Def. 1.1.2.6).

✛ ✘
Rem. 9.3.4.1: if periodic external forces f(t) = cos(ωt)f, f ∈ (wind, earthquake)pact on the R2n ,
truss they can excite vibrations of (linearly in time) growing amplitude, if ω coincides with λ j for an
eigenvalue λ j of A.
✚ ✙

Excited vibrations can lead to the collapse of a truss structure, cf. the notorious
Tacoma-Narrows bridge disaster.
It is essential to know whether eigenvalues of a truss structure fall into a range that can be excited
by external forces.
These will typically(∗) be the low modes ↔ a few of the smallest eigenvalues.
((∗) Reason: fast oscillations will quickly be damped due to friction, which was neglected in our model.)

M ATLAB-code 9.3.4.18: Computing resonant frequencies and modes of elastic truss

truss resonant frequencies


6

✁ resonant frequencies of bridge truss from Fig. 348.


eigenvalue

3
The stiffness matrix will always possess three zero
2
eigenvalues corresponding to rigid body modes (=
displacements without change of length of the rods)
1

−1
0 5 10 15 20 25
Fig. 350 no. of eigenvalue

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 708


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

mode 4: frequency = 6.606390e−02 mode 5: frequency = 3.004450e−01


1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
Fig. 351 Fig. 352
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6

mode 6: frequency = 3.778801e−01 mode 7: frequency = 4.427214e−01


1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
Fig. 353 Fig. 354
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6

y
To compute a few of a truss’s lowest resonant frequencies and excitable mode, we need efficient numerical
methods for the following tasks. Obviously, Code 9.3.4.18 cannot be used for large trusses, because eig
invariable operates on dense matrices and will be prohibitively slow and gobble up huge amounts of
memory, also recall the discussion of Code 9.3.2.30.

Task: Compute m, m ≪ n, of the smallest/largest (in modulus) eigenvalues


of A = AH ∈ C n,n and associated eigenvectors.

Of course, we aim to tackle this task by iterative methods generalizing power iteration (→ Section 9.3.1)
and inverse iteration (→ Section 9.3.2).

9.3.4.1 Orthogonalization

Preliminary considerations (in R, m = 2):

According to Cor. 9.1.0.9: For A = A⊤ ∈ R n,n there is a factorization A = UDU⊤ with D =


diag(λ1 , . . . , λn ), λ j ∈ R, λ1 ≤ λ2 ≤ · · · ≤ λn , and U orthogonal. Thus, u j := (U):,j , j = 1, . . . , n,
are (mutually orthogonal) eigenvectors of A.

Assume 0 ≤ λ1 ≤ · · · ≤ λn−2 <λn−1 <λn (largest eigenvalues are simple).

If we just carry out the direct power iteration (9.3.1.12) for two vectors both sequences will converge to the
largest (in modulus) eigenvector. However, we recall that all eigenvectors are mutually orthogonal. This

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 709


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

suggests that we orthogonalize the iterates of the second power iteration (that is to yield the eigenvector for
the second largest eigenvalue) with respect to those of the first. This idea spawns the following iteration,
cf. Gram-Schmidt orthogonalization in (10.2.2.4):

M ATLAB-code 9.3.4.19: one step of subspace power iteration, m = 2


1 f u n c t i o n [v,w] = sspowitstep(A,v,w)
2 v = A*v; w = A*w; % “power iteration”, cf. (9.3.1.12)
3 % orthogonalization, cf. Gram-Schmidt orthogonalization (10.2.2.4)
4 v = v/norm(v); w = w - d o t (v,w)*v; w = w/norm(w); % now w ⊥ v

v
w−w· k v k2
✁ Orthogonalization of two vectors
(see Line 4 of Code 9.3.4.19)
v

v
Fig. 355 w· k v k2
Analysis through eigenvector expansions (v, w ∈ R n , kvk2 = kwk2 = 1)
n n
v= ∑ αj uj , w = ∑ β j uj ,
i =1 i =1
n n
⇒ Av = ∑ λ j α j u j , Aw = ∑ λj β j uj ,
i =1 i =1
n −1/2 n
v
v0 : =
k v k2
= ∑ λ2j α2j ∑ λj αj uj ,
i =1 i =1
!
n  n n 
Aw − (v0⊤ Aw)v0 = ∑ βj− ∑ λ2j α j β j / ∑ λ2j α2j αj λj uj .
i =1 i =1 i =1

We notice that v is just mapped to the next iterate in the regular direct power iteration (9.3.1.12). After
many steps, it will be very close to un , and, therefore, we may now assume v = un ⇔ α j = δj,n
(Kronecker symbol).

n −1
z := Aw − (v0⊤ Aw)v0 = 0 · un + ∑ λj β j uj ,
i =1
n −1 −1/2 n −1
z
w(new) := = ∑ λ2j β2j ∑ λj β j uj .
k z k2 i =1 i =1

The sequence w(k) produced by repeated application of the mapping given by Code 9.3.4.19 asymp-
totically (that is, when v(k) has already converged to un ) agrees with the sequence produced by the
direct power method for A e := U diag(λ1 , . . . , λn−1 , 0). Its convergence will be governed by the relative
gap λn−2 /λn−1 , see Thm. 9.3.1.21.

However: if v(k) itself converges slowly, this reasoning does not apply.

EXAMPLE 9.3.4.20 (Subspace power iteration with orthogonal projection)

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 710


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

☞ construction of matrix A = A⊤ as in Ex. 9.3.2.37

M ATLAB-code 9.3.4.21: power iteration with orthogonal projection for two vectors

σ (A) = {1, 2, . . . , 10}:


d = (1:10) d = (1:10)
1
10 1
error in λ
n
error in λ −1
n 0.95
error in v
error in w
0.9

0.85
0
10

error quotient
0.8
error

0.75

0.7
−1
10
0.65

0.6
error in λn
error in λn−1
0.55
error in v
error in w
−2
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 356 power iteration step Fig. 357 power iteration step

σ (A) = {0.5, 1, . . . , 4, 9.5, 10}:


d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1
10 1
error in λn
error in λn−1
0.95
error in v
error in w
0.9
0
10
0.85
error quotient

0.8
error

−1
10 0.75

0.7

0.65
−2
10
0.6
error in λn
error in λn−1
0.55
error in v
error in w
−3
10 0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 358 power iteration step Fig. 359 power iteration step
y

Issue: generalization of orthogonalization idea to subspaces of dimension > 2

Nothing new:
Gram-Schmidt orthonormalization
(→ [NS02, Thm. 4.8], [Gut09, Alg. 6.1], [QSS00, Sect. 3.4.3])

Given: linearly independent vectors v1 , . . . , vm ∈ R n , m ∈ N


Sought: vectors q1 , . . . , qm ∈ R n such that

➋ q⊤
l qk = δlk (orthonormality) , (9.3.4.22)
➋ Span{q1 , . . . , qk } = Span{v1 , . . . , vk } for all k = 1, . . . , m . (9.3.4.23)

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 711


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Constructive proof & algorithm for Gram-Schmidt orthonormalization:

z1 = v1 ,
v2⊤ z1
z2 = v2 − z ,
z1⊤ z1 1

v3 z1 v3⊤ z2 (9.3.4.24)
z3 = v3 − z
z1⊤ z1 1
− z
z2⊤ z2 2
,
..
.
zk
+ normalization qk = , k = 1, . . . , m . (9.3.4.25)
k z k k2
Easy computation: the vectors q1 , . . . , qm produced by (9.3.4.24) satisfy (9.3.4.22) and (9.3.4.23).

M ATLAB-code 9.3.4.26: Gram-Schmidt orthonormalization (do not use, unstable algorithm!)

Warning! Code 9.3.4.26 provides an unstable implementation of Gram-Schmidt or-


thonormalization: for large n, m impact of round-off will destroy the orthogonality of the
! columns of Q.

A stable implementation of Gram-Schmidt orthogonalization of the columns of a matrix V ∈ K n,m , m ≤ n,


is provided by the following MATLAB command:
[Q,~] = qr(V,0) ( Asymptotic computational cost: O(m2 n) )
dummy return value (for our purposes) dummy argument
Detailed description of the algorithm behind qr and meaning of the return value R → Section 3.3.3.
EXAMPLE 9.3.4.27 (qr based orthogonalization, m = 2)
The following two M ATLAB code snippets perform the same function, cf. Code 9.3.4.19:

M ATLAB-code 9.3.4.28: sspowitstep1.m M ATLAB-code 9.3.4.29: sspowitstep2.m

Explanation ➣ discussion of Gram-Schmidt orthonormalization. y

M ATLAB-code 9.3.4.30: General subspace power iteration step with qr based orthonormal-
ization

9.3.4.2 Ritz projection

Observations on Code 9.3.4.19:


✦ the first column of V, (V):,1 , is a sequence of vectors created by the standard direct power method
(9.3.1.12).
✦ reasoning: the other columns of V, after each multiplication with A can be expected to contain a
significant component in the direction of the eigenvector associated with the eigenvalue of largest
modulus.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 712


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Idea: use information in (V):,2 , . . . , (V):,end to accelerate convergence of


(V):,1 .

Since the columns of V span a subspace of R n , this idea can be recast as the following task:

Task: given v1 , . . . , vk ∈ K n , k ≪ n, extract (good approximations of) eigenvectors of


A = AH ∈ K n,n contained in Span{v1 , . . . , vm }.

We take for granted that {v1 , . . . , vm } is linearly independent.

Assumption: EigAλ ∩ V := Span{v1 , . . . , vm } 6= {0}


⇔ V contains an eigenvector of A

⇔ ∃w ∈ V \ {0}: Aw = λw
⇔ ∃u ∈ K m \ {0}: AVu = λVu
⇒ ∃u ∈ K m \ {0}: VH AVu = λVH Vu , (9.3.4.31)

where V := (v1 , . . . , vm ) ∈ K n,m and we used


V = {Vu : u ∈ K m } (linear combinations of the vi ).
(9.3.4.31) ➣ u ∈ K k \ {0} solves m × m generalized eigenvalue problem

(VH AV)u = λ(VH V)u . (9.3.4.32)

Note: {v1 , . . . , vm } linearly independent


m
V has full rank m (→ Def. 2.2.1.3)
m
H
V V is symmetric positive definite (→ Def. 1.1.2.6)

If our initial assumption holds true and u solves (9.3.4.32) and is a simple eigenvalue, a corresponding
x ∈ EigAλ can be recovered as x = Vu.

Idea: Given a subspace V = Im(V) ⊂ K n , V ∈ K n,m , dim(V ) = m, obtain


approximations of (a few) eigenvalues and eigenvectors x1 , . . . , xm of A
by

➊ solving the generalized eigenvalue problem (9.3.4.32)


➔ eigenvectors u1 , . . . , uk ∈ K m (linearly independent),
➋ and transforming them back according to xk = Vuk , k = 1, . . . , m.
Terminology: (9.3.4.32) is called the Ritz projection of EVP Ax = λx onto V

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 713


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Terminology: σ (VH AV) =ˆ Ritz values,


eigenvectors of VH AV =ˆ Ritz vectors

Example: Ritz projection of Ax = λx onto Span{v, w}:


   
H α H α
(v, w) A(v, w) = λ(v, w) (v, w) .
β β

Note: If V is unitary (→ Def. 6.3.1.2), then the generalized eigenvalue problem (9.3.4.32) will become a
standard linear eigenvalue problem.
Remark 9.3.4.33 (Justification of Ritz projection by min-max theorem)
We revisit m = 2, see Code 9.3.4.19. Recall that by the min-max theorem Thm. 9.3.2.18

un = argmaxx∈Rn ρA (x) , un−1 = argmaxx∈Rn , x⊥un ρA (x) . (9.3.4.34)

Idea: maximize Rayleigh quotient over Span{v, w}, where v, w are output by Code 9.3.4.19. This leads
to the optimization problem
 
∗ ∗ α
(α , β ) := argmax ρA (αv + βw) = argmax ρ(v,w)⊤ A(v,w) ( ). (9.3.4.35)
α,β∈R, α2 + β2 =1 α,β∈R, α2 + β2 =1
β

Then a better approximation for the eigenvector to the largest eigenvalue is

v∗ := α∗ v + β∗ w .

Note that kv∗ k2 = 1, if both v and w are normalized, which is guaranteed in Code 9.3.4.19.
Then, orthogonalizing w w.r.t v∗ will produce a new iterate w∗ .
Again the min-max theorem Thm. 9.3.2.18 tells us that we can find (α∗ , β∗ )⊤ as eigenvector to the largest
eigenvalue of
   
⊤ α α
(v, w) A(v, w) =λ . (9.3.4.36)
β β

Since eigenvectors of symmetric matrices are mutually orthogonal, we find w∗ = α2 v + β 2 w, where


(α2 , β 2 )⊤ is the eigenvector of (9.3.4.36) belonging to the smallest eigenvalue. This assumes orthonormal
vectors v, w. y

M ATLAB-code 9.3.4.37: one step of subspace power iteration with Ritz projection, matrix ver-
sion
1 f u n c t i o n V = sspowitsteprp(A,V)
2 V = A*V; % power iteration applied to columns of V
3 [Q,R] = qr(V,0); % orthonormalization, see Section 9.3.4.1
4 [U,D] = eig(Q’*A*Q); % Solve Ritz projected m × m eigenvalue problem
5 V = Q*U; % recover approximate eigenvectors
6 ev = d i a g (D); % approximate eigenvalues

Note that he orthogonalization step in Code 9.3.4.37 is actually redundant, if exact arithmetic could be
employed, because the Ritz projection could also be realized by solving the generalized eigenvalue prob-
lem.

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 714


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

However, prior orthogonalization is essential for numerical stability (→ Def. 1.5.5.19), cf. the discussion in
Section 3.3.3.
EXAMPLE 9.3.4.38 (Power iteration with Ritz projection)
Matrix as in Ex. 9.3.4.20, σ (A) = {0.5, 1, . . . , 4, 9.5, 10}:
d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1 2
10 10
error in λn error in λn
error in λn−1 error in λn−1
0
error in v 10 error in v
error in w error in w

0 −2
10 10

−4
10
error

error
−1 −6
10 10

−8
10

−2 −10
10 10

−12
10

−3 −14
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 360 power iteration step Fig. 361 power iteration step

simple orthonormalization, Ex. 9.3.4.20 with Ritz projection, Code 9.3.4.37


d = [0.5*(1:8),9.5,10] d = [0.5*(1:8),9.5,10]
1 1
error in λ
n
error in λn−1
0.95 0.9
error in v
error in w
0.9 0.8

0.85 0.7
error quotient
error quotient

0.8 0.6

0.75 0.5

0.7 0.4

0.65 0.3

0.6 0.2
error in λn
error in λn−1
0.55 0.1
error in v
error in w
0.5 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 362 power iteration step Fig. 363 power iteration step

simple orthonormalization, Ex. 9.3.4.20 with Ritz projection, Code 9.3.4.37


Observation: tremendous acceleration of power iteration through Ritz projection, convergence still linear
but with much better rates.
y
In Code 9.3.4.37: diagonal entries of D provide approximations of eigenvalues. Their (relative) changes
can be used as a termination criterion.

§9.3.4.39 (Subspace variant of direct power method with Ritz projection)

M ATLAB-code 9.3.4.40: Subspace power iteration with Ritz projection

EXAMPLE 9.3.4.41 (Convergence of subspace variant of direct power method)

9. Computation of Eigenvalues and Eigenvectors, 9.3. Power Methods 715


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2
10

0
10
j
−2
10
aij := min{ ij , i }
S.p.d. test matrix:
n=200; A = gallery(’lehmer’,n);
error in eigenvalue

−4
10
“Initial eigenvector guesses”:
−6
10 V = eye(n,m);
λ , m=3
−8
10
1 • Observation:
λ , m=3
−10
2
λ3, m=3
linear convergence of eigenvalues
10
λ1, m=6 • choice m > k boosts convergence
−12
10 λ , m=6
2 of eigenvalues
λ , m=6
3
−14
10
1 2 3 4 5 6 7 8 9 10
Fig. 364 iteration step
y

Remark 9.3.4.42 (Subspace power methods)


Analoguous to § 9.3.4.39: construction of subspace variants of inverse iteration (→ Code 9.3.2.31), PIN-
VIT (9.3.3.1), and Rayleigh quotient iteration (9.3.2.36). y

9.4 Krylov Subspace Methods

Supplementary literature. [Han02, Sect. 30]

All power methods (→ Section 9.3) for the eigenvalue problem (EVP) Ax = λx only rely on the last iterate
to determine the next one (1-point methods, cf. (8.2.1.5))

➣ NO MEMORY, cf. discussion in the beginning of Section 10.2.

“Memory for power iterations”: pursue same idea that led from the gradient method, § 10.1.3.3, to the
conjugate gradient method, § 10.2.2.10: use information from previous iterates to achieve efficient mini-
mization over larger and larger subspaces.

Min-max theorem, Finding extrema/stationary points


: A = AH ⇒ EVPs ⇔
Thm. 9.3.2.18 of Rayleigh quotient (→ Def. 9.3.1.16)

Setting: EVP Ax = λx for real s.p.d. (→ Def. 1.1.2.6) matrix A = A T ∈ R n,n

notations used below: 0 < λ1 ≤ λ2 ≤ · · · ≤ λn : eigenvalues of A, counted with multiplicity, see


Def. 9.1.0.1,

ˆ corresponding orthonormal eigenvectors, cf. Cor. 9.1.0.9.


u1 , . . . , u n =

AU = DU , U = (u1 , . . . , un ) ∈ R n,n , D = diag(λ1 , . . . , λn ) .

We recall
✦ the direct power method (9.3.1.12) from Section 9.3.1
✦ and the inverse iteration from Section 9.3.2

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 716


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

and how they produce sequences (z(k) )k∈N0 of vectors that are supposed to converge to a vector ∈
EigAλ1 or ∈ EigAλn , respectively.

Idea: Better z(k) from Ritz projection onto V := Span{z(0) , . . . , z(k) }


(= space spanned by previous iterates)

Recall (→ Code 9.3.4.37) Ritz projection of an EVP Ax = λx onto a subspace V := Span{v1 , . . . , vm },


m < n ➡ smaller m×m generalized EVP
T
V
| {zAV} x = λV T Vx , V := (v1 , . . . , vm ) ∈ R n,m . (9.4.0.1)
:=H

From Rayleigh quotient Thm. 9.3.2.16 and considerations in Section 9.3.4.2:

un ∈ V ⇒ largest eigenvalue of (9.4.0.1) = λmax (A) ,


u1 ∈ V ⇒ smallest eigenvalue of (9.4.0.1) = λmin (A) .

Intuition: If un (u1 ) “well captured” by V (that is, the angle between the vector and the space V is
small), then we can expect that the largest (smallest) eigenvalue of (9.4.0.1) is a good approximation
for λmax (A)(λmin (A)), and that, assuming normalization

Vw ≈ u1 (or Vw ≈ un ) ,

where w is the corresponding eigenvector of (9.4.0.1).

For direct power method (9.3.1.12): z(k) ||Ak z(0)

V = Span{z(0) , Az(0) , . . . , A(k) z(0) } = Kk+1 (A, z(0) ) a Krylov space, → Def. 10.2.1.1 . (9.4.0.2)


direct power method with
Ritz projection onto Krylov
space from (9.4.0.2), cf.
M ATLAB-code 9.4.0.3: Ritz projections onto Krylov space
§ 9.3.4.39.
(9.4.0.2)
Note: implementation for
demonstration purposes only
(inefficient for sparse matrix A!)

EXAMPLE 9.4.0.4 (Ritz projections onto Krylov space)

M ATLAB-code 9.4.0.5:
1 n=100;
2 M= g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
3 [Q,R]= q r (M); A=Q’* d i a g (1:n)*Q; % synthetic matrix,
σ(A) = {1, 2, 3, . . . , 100}

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 717


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2
10
100 |λ −µ |
m m
|λ −µ |
m−1 m−1
1
10 |λ −µ |
m−2 m−2

95

0
10

90
Ritz value

Ritz value
−1
10

85
−2
10

80
−3
10

75 µm −4
10
µ
m−1
µ
m−2
−5
70 10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 365 dimension m of Krylov space Fig. 366 dimension m of Krylov space

Observation: “vaguely linear” convergence of largest Ritz values (notation µi ) to largest eigenvalues.
Fastest convergence of largest Ritz value → largest eigenvalue of A
2
10
40 µ1 |λ −µ |
1 1
µ2 |λ −µ |
2 2
35 µ |λ −µ |
3 2 3
1
10
30
error of Ritz value

25 0
10
Ritz value

20

−1
10
15

10
−2
10

0 −3
10
5 10 15 20 25 30 5 10 15 20 25 30
Fig. 367 dimension m of Krylov space Fig. 368 dimension m of Krylov space

Observation: Also the smallest Ritz values converge “vaguely linearly” to the smallest eigenvalues of A.
Fastest convergence of smallest Ritz value → smallest eigenvalue of A. y

? Why do smallest Ritz values converge to smallest eigenvalues of A?

e := νI − A, ν > λmax (A):


Consider direct power method (9.3.1.12) for A

z(k)
(νI − A)e
z(0) arbitrary , e
z ( k +1) = (9.4.0.6)
z(k)
(νI − A)e 2

As σ (νI − A) = ν − σ (A) and eigenspaces agree, we infer from Thm. 9.3.1.21


k→∞ k→∞
λ1 < λ2 ⇒ z(k) −→ u1 & ρA (z(k) ) −→ λ1 linearly . (9.4.0.7)

By the binomial theorem (also applies to matrices, if they commute)


  k
k k− j j
(νI − A) = ∑ k
z(0) ∈ Kk (A, z(0) ) ,
ν A ⇒ (νI − A)k e
j =0
j

Kk (νI − A, x) = Kk (A, x) . (9.4.0.8)

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 718


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ u1 can also be expected to be “well captured” by Kk (A, x) and the smallest Ritz value should provide
a good aproximation for λmin (A).

Recall from Section 10.2.2 Lemma 10.2.2.5 :


✓ ✏
Residuals r0 , . . . , rm−1 generated in CG iteration, § 10.2.2.10 applied to Ax = z with x(0) = 0, provide
✒ ✑
orthogonal basis for Km (A, z) (, if rk 6= 0).

Inexpensive Ritz projection of Ax = λx onto Km (A, z): orthogonal matrix


 
r0 r
T
Vm AVm x = λx , Vm := , . . . , m −1 ∈ R n,m . (9.4.0.9)
k r0 k k r m −1 k

recall: residuals generated by short recursions, see § 10.2.2.10

Lemma 9.4.0.10. Tridiagonal Ritz projection from CG residuals


T AV is a tridiagonal matrix.
Vm m

Proof. Lemma 10.2.2.5: {r0 , . . . , rℓ−1 } is an orthogonal basis of Kℓ (A, r0 ), if all the residuals are non-
zero. As AKℓ−1 (A, r0 ) ⊂ Kℓ (A, r0 ), we conclude the orthogonality rm T Ar for all j = 0, . . . , m − 2. Since
j

 
T
Vm AVm = riT−1 Ar j−1 , 1 ≤ i, j ≤ m ,
ij

the assertion of the theorem follows.


 
α1 β 1
 β 1 α2 β 2 
 
 . 

 β 2 α3 . . 

 .. .. 
 . . 
VlH AVl =   =: Tl ∈ K k,k [tridiagonal matrix] (9.4.0.11)
 
 .. 
 . 
 
 .. .. 
 . . β k −1 
β k −1 αk

Algorithm for computing Vl and Tl :


Lanczos process
Computational effort/step:

1× A×vector
2 dot products M ATLAB-code 9.4.0.12: Lanczos process, cf. Code 10.2.2.11
2 AXPY-operations
1 division
Closely related to CG iteration,
§ 10.2.2.10, Code 10.2.2.11.

Total computational effort for l steps of Lanczos process, if A has at most k non-zero entries per row:
O(nkl )

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 719


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note: Code 9.4.0.12 assumes that no residual vanishes. This could happen, if z0 exactly belonged to
the span of a few eigenvectors. However, in practical computations inevitable round-off errors will always
ensure that the iterates do not stay in an invariant subspace of A, cf. Rem. 9.3.1.22.

Convergence (what we expect from the above considerations) → [DH03, Sect. 8.5])
(l ) (l ) (l )
In l -th step: λ n ≈ µ l , λ n −1 ≈ µ l −1 , . . . , λ 1 ≈ µ 1 ,
(l ) (l ) (l ) (l ) (l )
σ ( T l ) = { µ1 , . . . , µ l }, µ1 ≤ µ2 ≤ · · · ≤ µ l .

EXAMPLE 9.4.0.13 (Lanczos process for eigenvalue computation)


A from Ex. 9.4.0.4 A = gallery(’minij’,100);
2 4
10 10

2
10
1
10
0
10
|Ritz value−eigenvalue|

0 −2
10 error in Ritz values 10

−4
10
−1
10
−6
10

−2 −8
10 10

−10
10
λn λ
−3 n
10 λ λ
n−1 n−1
−12
λ 10 λn−2
n−2
λn−3 λn−3
−4 −14
10 10
0 5 10 15 20 25 30 0 5 10 15
Fig. 369 step of Lanzcos process Fig. 370 step of Lanzcos process

Observation: same as in Ex. 9.4.0.4, linear convergence of Ritz values to eigenvalues.


However for A ∈ R10,10 , aij = min{i, j} good initial convergence, but sudden “jump” of Ritz values off
eigenvalues!
Conjecture: Impact of roundoff errors, cf. Ex. 10.2.3.1
y

EXAMPLE 9.4.0.14 (Impact of roundoff on Lanczos process)

A ∈ R10,10 , aij = min{i, j} . A = gallery(’minij’,10);

Computed by [V,alpha,beta] = lanczos(A,n,ones(n,1));, see Code 9.4.0.12:


 
38.500000 14.813845
 14.813845 9.642857 2.062955 
 
 2.062955 2.720779 0.776284 
 
 0.776284 1.336364 0.385013 
 
 0.385013 0.826316 0.215431 
T=



0.215431 0.582380 0.126781
 
 
 0.126781 0.446860 0.074650

 
 0.074650 0.363803 0.043121

 0.043121 3.820888 11.991094 
11.991094 41.254286

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 720


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

σ (A) = {0.255680,0.273787,0.307979,0.366209,0.465233,0.643104,1.000000,1.873023,5.048917,44.766069}
σ (T) = {0.263867,0.303001,0.365376,0.465199,0.643104,1.000000,1.873023,5.048917,44.765976,44.766069}

Uncanny cluster of computed eigenvalues of T (“ghost eigenvalues”, [GV89, Sect. 9.2.5])


 
1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000251 0.258801 0.883711
 0.000000 1.000000 −0.000000 0.000000 0.000000 0.000000 0.000000 0.000106 0.109470 0.373799 
 
 0.000000 −0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000005 0.005373 0.018347 
 
 0.000000 0.000000 0.000000 1.000000 −0.000000 0.000000 0.000000 0.000000 0.000096 0.000328 
 
 0.000000 0.000000 0.000000 −0.000000 1.000000 0.000000 0.000000 0.000000 0.000001 0.000003 
V V=
H



0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 −0.000000 0.000000 0.000000 0.000000
 
 −0.000000 −0.000000 
 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000

 −0.000000 −0.000000 
 0.000251 0.000106 0.000005 0.000000 0.000000 0.000000 1.000000 0.000000

 0.258801 0.109470 0.005373 0.000096 0.000001 0.000000 0.000000 −0.000000 1.000000 0.000000 
0.883711 0.373799 0.018347 0.000328 0.000003 0.000000 0.000000 0.000000 0.000000 1.000000

Loss of orthogonality of residual vectors due to roundoff


(compare: impact of roundoff on CG iteration, Ex. 10.2.3.1
l σ (Tl )
1 38.500000

2 3.392123 44.750734

3 1.117692 4.979881 44.766064

4 0.597664 1.788008 5.048259 44.766069

5 0.415715 0.925441 1.870175 5.048916 44.766069 y


6 0.336507 0.588906 0.995299 1.872997 5.048917 44.766069

7 0.297303 0.431779 0.638542 0.999922 1.873023 5.048917 44.766069

8 0.276160 0.349724 0.462449 0.643016 1.000000 1.873023 5.048917 44.766069

9 0.276035 0.349451 0.462320 0.643006 1.000000 1.873023 3.821426 5.048917 44.766069

10 0.263867 0.303001 0.365376 0.465199 0.643104 1.000000 1.873023 5.048917 44.765976 44.766069

Idea:
✦ do not rely on orthogonality relations of Lemma 10.2.2.5
✦ use explicit Gram-Schmidt orthogonalization [NS02, Thm. 4.8],
[Gut09, Alg .6.1]

Details: inductive approach: given {v1 , . . . , vl } ONB of Kl (A, z)

l
e l +1
v
vl +1 := Avl − ∑ (v jH Avl ) v j , vl +1 :=
e ⇒ vl +1 ⊥ Kl (A, z) . (9.4.0.15)
j =1
kv
e l +1 k2

(Gram-Schmidt, cf. (10.2.2.4) ) orthogonal

Arnoldi process: In step l : 1× A×vector


l + 1 dot products
l AXPY-operations
n divisions

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 721


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

➣ Computational cost for l steps, if at most k non-zero entries in each row of A: O(nkl 2 )

M ATLAB-code 9.4.0.16: Arnoldi process


✎ ☞
If it does not stop prematurely, the Arnoldi process of Code 9.4.0.16 will yield an orthonormal basis

✍ ✌
(ONB) of Kk+1 (A, v0 ) for a general A ∈ C n,n .

Algebraic view of the Arnoldi process of Code 9.4.0.16, meaning of output H:



H
  vi Av j , if i ≤ j ,

Vl = v1 , . . . , vl : AVl = Vl +1 H e l ∈ K l +1,l
el , H mit e
hij = ke vi k2 , if i = j + 1 ,


0 else.

e l = non-square upper Hessenberg matrices


➡ H
    
    
    
    
     
    
    
     
     el
H 
 A     
v l +1
  =  
vl
v1

v1

vl

     
     
     
    
    
    
    
    
    

Translate Code 9.4.0.16 to matrix calculus:


Lemma 9.4.0.17. Theory of Arnoldi process

e l ∈ K l +1,l arising in the l -th step, l ≤ n, of the Arnoldi process holds


For the matrices Vl ∈ K n,l , H
(i) VlH Vl = I (unitary matrix),
(ii) AVl = Vl +1 He l, H e l is non-square upper Hessenberg matrix,
(iii) VlH AVl = Hl ∈ K l,l , hij = e hij for 1 ≤ i, j ≤ l ,
(iv) H
If A = A then Hl is tridiagonal (➣ Lanczos process)

Proof. Direct from Gram-Schmidt orthogonalization and inspection of Code 9.4.0.16.


Remark 9.4.0.18 (Arnoldi process and Ritz projection)


Interpretation of Lemma 9.4.0.17 (iii) & (i):
Hl x = λx is a (generalized) Ritz projection of EVP Ax = λx, cf. Section 9.3.4.2.
y
Eigenvalue approximation for general EVP Ax = λx by Arnoldi process:

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 722


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

M ATLAB-code 9.4.0.19: Arnoldi eigenvalue approximation

M ATLAB-code 9.4.0.20: M ATLAB -C ODE Arnoldi eigenvalue


approximation

f u n c t i o n [ dn , V , Ht ] = a r n o l d i e i g ( A , v0 , k , t o l )
n = s i z e ( A , 1 ) ; V = [ v0 / norm ( v0 ) ] ;
H = zeros ( 1 , 0 ) ; dn = zeros ( k , 1 ) ;
f o r l =1: n
d = dn ;
Ht = [ Ht , zeros ( l , 1 ) ; zeros ( 1 , l ) ] ;
v t = A*V ( : , l ) ;
f o r j =1: l
Ht ( j , l ) = d o t (V ( : , j ) , v t ) ;
v t = v t − Ht ( j , l ) * V ( : , j ) ;
end
ev = s o r t ( e i g ( Ht ( 1 : l , 1 : l ) ) ) ;
dn ( 1 : min ( l , k ) ) = ev ( end : − 1 : end−min ( l , k ) + 1 ) ;
i f ( norm ( d−dn ) < t o l * norm ( dn ) ) , break ; end ;
Ht ( l +1 , l ) = norm ( v t ) ;
V = [ V , v t / Ht ( l +1 , l ) ] ;

end
Heuristic termination criterion
Arnoldi process for computing the
k largest (in modulus) eigenvalues
✗ of A ∈ C n,n ✔
1 A×vector per step
(➣ attractive for sparse

✖ ✕
matrices)

However: required storage in-


creases with number of steps,
cf. situation with GMRES, Sec-
tion 10.4.1.

Heuristic termination criterion

EXAMPLE 9.4.0.21 (Stabilty of Arnoldi process)

A ∈ R100,100 , aij = min{i, j} . A = gallery(’minij’,100);

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 723


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

4 4
10 10

2 2
10 10

Approximation error of Ritz value


0 0
10 10

−2 −2
error in Ritz values

10 10

−4 −4
10 10

−6 −6
10 10

−8 −8
10 10

−10 −10
10 10
λn λn
λn−1 λn−1
−12 −12
10 λ 10 λ
n−2 n−2
λ λ
n−3 n−3
−14 −14
10 10
0 5 10 15 0 5 10 15
Fig. 371 step of Lanzcos process Fig. 372 Step of Arnoldi process

Lanczos process: Ritz values Arnoldi process: Ritz values

Ritz values during Arnoldi process for A = gallery(’minij’,10); ↔ Ex. 9.4.0.13

l σ (Hl )
1 38.500000

2 3.392123 44.750734

3 1.117692 4.979881 44.766064

4 0.597664 1.788008 5.048259 44.766069

5 0.415715 0.925441 1.870175 5.048916 44.766069

6 0.336507 0.588906 0.995299 1.872997 5.048917 44.766069

7 0.297303 0.431779 0.638542 0.999922 1.873023 5.048917 44.766069

8 0.276159 0.349722 0.462449 0.643016 1.000000 1.873023 5.048917 44.766069

9 0.263872 0.303009 0.365379 0.465199 0.643104 1.000000 1.873023 5.048917 44.766069

10 0.255680 0.273787 0.307979 0.366209 0.465233 0.643104 1.000000 1.873023 5.048917 44.766069

Observation: (almost perfect approximation of spectrum of A)


For the above examples both the Arnoldi process and the Lanczos process are algebraically equivalent,
because they are applied to a symmetric matrix A = A T . However, they behave strikingly differently,
which indicates that they are not numerically equivalent.
The Arnoldi process is much less affected by roundoff than the Lanczos process, because it does not take
for granted orthogonality of the “residual vector sequence”. Hence, the Arnoldi process enjoys superior
numerical stability (→ ??, Def. 1.5.5.19) compared to the Lanczos process. y

EXAMPLE 9.4.0.22 (Eigenvalue computation with Arnoldi process)


Eigenvalue approximation from Arnoldi process for non-symmetric A, initial vector ones(100,1);

M ATLAB-code 9.4.0.23:

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 724


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Approximation of largest eigenvalues 2


Approximation of largest eigenvalues
10
100

95 1
10

Approximation error of Ritz value


90
0
10
85

80
Ritz value

−1
10

75
−2
10
70

−3
65 10

60
λ −4
10 λ
n n

55 λ λ
n−1 n−1
λn−2 λ
n−2
−5
50 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 373 Step of Arnoldi process Fig. 374 Step of Arnoldi process

Approximation of smallest eigenvalues Approximation of smallest eigenvalues


2
10
λ
10 1
λ2
1
9 λ 10
3
Approximation error of Ritz value
8 0
10

7
−1
10
Ritz value

6
−2
10
5

−3
4 10

3
−4
10

2
−5
λ1
10
λ2
1
λ
3
−6
0 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Fig. 375 Step of Arnoldi process Fig. 376 Step of Arnoldi process

Observation: “vaguely linear” convergence of largest and smallest eigenvalues, cf. Ex. 9.4.0.4. y

Krylov subspace iteration methods (= Arnoldi process, Lanczos process) attractive for computing a
few of the largest/smallest eigenvalues and associated eigenvectors of large sparse matrices.

Remark 9.4.0.24 (Krylov subspace methods for generalized EVP)


Adaptation of Krylov subspace iterative eigensolvers to generalized EVP: Ax = λBx, B s.p.d.: replace
Euclidean inner product with “B-inner product” (x, y) 7→ x H By. y

M ATLAB-functions:

d = eigs(A,k,sigma) : k largest/smallest eigenvalues of A


d = eigs(A,B,k,sigma): k largest/smallest eigenvalues for generalized EVP Ax = λBx,B
s.p.d.
d = eigs(Afun,n,k) : Afun = handle to function providing matrix×vector for A/A−1 /A −
αI/(A − αB)−1 . (Use flags to tell eigs about special properties of
matrix behind Afun.)
eigs just calls routines of the open source ARPACK numerical library.

9. Computation of Eigenvalues and Eigenvectors, 9.4. Krylov Subspace Methods 725


Bibliography

[AKY99] Charles J Alpert, Andrew B Kahng, and So-Zen Yao. “Spectral partitioning with mul-
tiple eigenvectors”. In: Discrete Applied Mathematics 90.1-3 (1999), pp. 3–26. DOI:
10.1016/S0166-218X(98)00083-3 (cit. on p. 700).
[Bai+00] Z.-J. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the Solution
of Algebraic Eigenvalue Problems. Philadelphia, PA: SIAM, 2000 (cit. on p. 677).
[BF06] Yuri Boykov and Gareth Funka-Lea. “Graph Cuts and Efficient N-D Image Segmenta-
tion”. In: International Journal of Computer Vision 70.2 (Nov. 2006), pp. 109–131. DOI:
10.1007/s11263-006-7934-5.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 679, 681, 685, 690, 692, 701).
[DH03] P. Deuflhard and A. Hohmann. Numerical Analysis in Modern Scientific Computing. Vol. 43.
Texts in Applied Mathematics. Springer, 2003 (cit. on p. 720).
[Gle15] David F. Gleich. “PageRank Beyond the Web”. In: SIAM Review 57.3 (2015), pp. 321–363.
DOI: 10.1137/140976649 (cit. on p. 685).
[GV89] G.H. Golub and C.F. Van Loan. Matrix computations. 2nd. Baltimore, London: John Hopkins
University Press, 1989 (cit. on pp. 682, 683, 696, 721).
[GT08] Craig Gotsman and Sivan Toledo. “On the computation of null spaces of sparse rect-
angular matrices”. In: SIAM J. Matrix Anal. Appl. 30.2 (2008), pp. 445–463. DOI:
10.1137/050638369 (cit. on p. 701).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 679–681,
711, 721).
[Hac94] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations. Vol. 95. Applied
Mathematical Sciences. New York: Springer-Verlag, 1994, pp. xxii+429 (cit. on p. 681).
[Hal70] K.M. Hall. “An r-dimensional quadratic placement algorithm”. In: Management Science 17.3
(1970), pp. 219–229.
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 681, 682, 702,
705, 716).
[LM06] A.N. Lengville and C.D. Meyer. Google’s PageRank and Beyond: The Science of Search En-
gine Rankings. Princeton, NJ: Princeton University Press, 2006 (cit. on p. 685).
[LMM19] Anna Little, Mauro Maggioni, and James M. Murphy. Path-Based Spectral Clustering: Guaran-
tees, Robustness to Outliers, and Fast Algorithms. 2019.
[Lux07] Ulrike von Luxburg. “A tutorial on spectral clustering”. In: Stat. Comput. 17.4 (2007), pp. 395–
416. DOI: 10.1007/s11222-007-9033-z.
[Ney99a] K. Neymeyr. A geometric theory for preconditioned inverse iteration applied to a subspace.
Tech. rep. 130. Tübingen, Germany: SFB 382, Universität Tübingen, Nov. 1999 (cit. on p. 704).
[Ney99b] K. Neymeyr. A geometric theory for preconditioned inverse iteration: III. Sharp convergence
estimates. Tech. rep. 130. Tübingen, Germany: SFB 382, Universität Tübingen, Nov. 1999 (cit.
on p. 704).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 679–682, 711, 721).

726
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 680–682, 685, 692, 711).
[SM00] J.-B. Shi and J. Malik. “Normalized cuts and image segmentation”. In: IEEE Trans. Pattern
Analysis and Machine Intelligence 22.8 (2000), pp. 888–905 (cit. on pp. 693, 695, 700).
[ST96] D.A. Spielman and Shang-Hua Teng. “Spectral partitioning works: planar graphs and finite el-
ement meshes”. In: Foundations of Computer Science, 1996. Proceedings., 37th Annual Sym-
posium on. Oct. 1996, pp. 96–105. DOI: 10.1109/SFCS.1996.548468 (cit. on p. 700).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on p. 679).
[TM01] F. Tisseur and K. Meerbergen. “The quadratic eigenvalue problem”. In: SIAM Review 43.2
(2001), pp. 235–286 (cit. on p. 678).

BIBLIOGRAPHY, BIBLIOGRAPHY 727


Chapter 10

Krylov Methods for Linear Systems of


Equations

Supplementary literature. There is a wealth of literature on iterative methods for the solution of

linear systems of equations: The two books [Hac94] and [Saa03] offer a comprehensive treatment
of the topic (the latter is available online for ETH students and staff).

Concise presentations can be found in [QSS00, Ch. 4] and [DR08, Ch. 13].
Learning outcomes:
• Understanding when and why iterative solution of linear systems of equations may be preferred to
direct solvers based on Gaussian elimination.

= A class of iterative methods (→ Section 8.2) for approximate solution of large linear systems of
equations Ax = b, A ∈ K n,n .
BUT, we have reliable direct methods (Gauss elimination → Section 2.3, LU-factorization →
§ 2.3.2.15, QR-factorization → ??) that provide an (apart from roundoff errors) exact solution with a
finite number of elementary operations!
Alas, direct elimination may not be feasible, or may be grossly inefficient, because
• it may be too expensive (e.g. for A too large, sparse), → (2.3.2.10),
• inevitable fill-in may exhaust main memory,
• the system matrix may be available only as procedure y=evalA(x) ↔ y = Ax

Contents
10.1 Descent Methods [QSS00, Sect. 4.3.3] . . . . . . . . . . . . . . . . . . . . . . . . . . 729
10.1.1 Quadratic minimization context . . . . . . . . . . . . . . . . . . . . . . . . . 729
10.1.2 Abstract steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
10.1.3 Gradient method for s.p.d. linear system of equations . . . . . . . . . . . . . 731
10.1.4 Convergence of the gradient method . . . . . . . . . . . . . . . . . . . . . . . 732
10.2 Conjugate gradient method (CG) [Han02, Ch. 9], [DR08, Sect. 13.4], [QSS00, Sect. 4.3.4]736
10.2.1 Krylov spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
10.2.2 Implementation of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
10.2.3 Convergence of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
10.3 Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10], [QSS00, Sect. 4.3.5] . . . . . . 745
10.4 Survey of Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 751

728
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10.4.1 Minimal residual methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751


10.4.2 Iterations with short recursions [QSS00, Sect. 4.5] . . . . . . . . . . . . . . . 752

10.1 Descent Methods [QSS00, Sect. 4.3.3]


Focus:
Linear system of equations Ax = b, A ∈ R n,n , b ∈ R n , n ∈ N given,
with symmetric positive definite (s.p.d., → Def. 1.1.2.6) system matrix A

A-inner product (x, y) 7→ x⊤ Ay ⇒ “A-geometry”

Definition 10.1.0.1. Energy norm → [Han02, Def. 9.1]


A s.p.d. matrix A ∈ R n,n induces an energy norm

kxk A := (x⊤ Ax) /2 , x ∈ R n .


1

Remark 10.1.0.2 (Krylov methods for complex s.p.d. system matrices) In this chapter, for the sake of
simplicity, we restrict ourselves to K = R.

However, the (conjugate) gradient methods introduced below also work for LSE Ax = b with A ∈ C n,n ,
A = A H s.p.d. when ⊤ is replaced with H (Hermitian transposed). Then, all theoretical statements remain
valid unaltered for K = C. y

10.1.1 Quadratic minimization context


Lemma 10.1.1.1. S.p.d. LSE and quadratic minimization problem [DR08, (13.37)]

A LSE with A ∈ R n,n s.p.d. and b ∈ R n is equivalent to a minimization problem:

Ax = b ⇔ x = arg minn J (y) , J (y) := 12 y⊤ Ay − b⊤ y . (10.1.1.2)


y ∈R

A quadratic functional

Proof. If x∗ := A−1 b a straightforward computation using A = A T shows

J (x) − J (x∗ ) = 21 x T Ax − b T x − 12 (x∗ ) T Ax∗ + b T x∗


b=Ax∗ 1 T
= 2 x Ax − (x∗ )T Ax + 12 (x∗ )T Ax∗ (10.1.1.3)

= 21 kx − x∗ k2A .

Then the assertion follows from the properties of the energy norm.

   
2 1 1
EXAMPLE 10.1.1.4 (Quadratic functional in 2D) Plot of J from (10.1.1.2) for A = , b= .
1 2 1

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 729
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2 16 16

14
1.5 14

12

1 12
10

0.5 10

J(x1,x2)
8

6
8
2

0
x

−0.5 6
2

4
−1 0

−2
2 −2
−1.5
0
0 0.5 1 1.5 2
2 −2 −1.5 −1 −0.5
0
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x2
x1

✞ ☎
Fig. 377 x1 Fig. 378

✝ ✆
Level lines of quadratic functionals with s.p.d. A are (hyper)ellipses y

Algorithmic idea: (Lemma 10.1.1.1 ➣) Solve Ax = b iteratively by successive


solution of simpler minimization problems

10.1.2 Abstract steepest descent

Task: Given continuously differentiable F : D ⊂ R n 7→ R,


find minimizer x∗ ∈ D: x∗ = argmin F (x)
x∈ D

Note that a minimizer need not exist, if F is not bounded√ from below (e.g., F ( x ) = x3 , x ∈ R, or
F ( x ) = log x, x > 0), or if D is open (e.g., F ( x ) = x, x > 0).
The existence of a minimizer is guaranteed if F is bounded from below and D is closed (→ Analysis).

The most natural iteration:


§10.1.2.1 (Steepest descent (ger.: steilster Abstieg))

Initial guess x (0) ∈ D , k = 0


repeat ✦ dk =ˆ direction of steepest descent
dk := − grad F (x(k) ) ✦ linear search = ˆ 1D minimization:
use Newton’s method (→ Sec-
t∗ := argmin F (x(k) + tdk ) ( line search)
t ∈R
tion 8.4.2.1) on derivative
x ( k +1) : = x ( k ) + t ∗ d k ✦ correction based a posteriori termi-
k : = k+1 nation criterion, see Section 8.2.3
for a discussion.
until x(k) − x(k−1) ≤τrel x(k) or
 (τ =ˆ prescribed tolerance)
( k
x −x ) ( k − 1 ) ≤τabs
y

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 730
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The gradient (→ [Str09, Kapitel 7])


 ∂F 
∂x
 i
grad F (x) =  ...  ∈ R n (10.1.2.2)
∂F
∂xn

provides the direction of local steepest ascent/des-


cent of F

Fig. 379

Of course this very algorithm can encounter plenty of difficulties:


• iteration may get stuck in a local minimum,
• iteration may diverge or lead out of D,
• line search may not be feasible.

10.1.3 Gradient method for s.p.d. linear system of equations


However, for the quadratic minimization problem (10.1.1.2) § 10.1.2.1 will converge:

(“Geometric intuition”, see Fig. 377: quadratic functional J with s.p.d. A has unique global minimum,
grad J 6= 0 away from minimum, pointing towards it.)

Adaptation: steepest descent algorithm § 10.1.2.1 for quadratic minimization problem (10.1.1.2), see
[QSS00, Sect. 7.2.4]:

F (x) := J (x) = 12 x⊤ Ax − b⊤ x ⇒ grad J (x) = Ax − b . (10.1.3.1)

This follows from A = A⊤ , the componentwise expression


n n
1
J (x) = 2 ∑ aij xi x j − ∑ bi xi
i,j=1 i =1

and the definition (10.1.2.2) of the gradient.


➣ For the descent direction in § 10.1.2.1 applied to the minimization of J from (10.1.1.2) holds

dk = b − Ax(k) =: rk the residual (→ Def. 2.4.0.1) for x(k−1) .

§ 10.1.2.1 for F = J from (10.1.1.2): function to be minimized in line search step:

ϕ(t) := J (x(k) + tdk ) = J (x(k) ) + td⊤


k ( Ax
(k)
− b) + 12 t2 d⊤
k Adk ➙ a parabola ! .

dϕ ∗ ∗ d⊤
k dk
(t ) = 0 ⇔ t = (unique minimizer) . (10.1.3.2)
dt d⊤
k Adk

Note: dk = 0 ⇔ Ax(k) = b (solution found !)

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 731
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Note: A s.p.d. (→ Def. 1.1.2.6) ⇒ d⊤


k Adk > 0, if dk 6 = 0

ϕ(t) is a parabola that is bounded from below (upward opening)


Based on (10.1.3.1) and (10.1.3.2) we obtain the following steepest descent method for the minimization
problem (10.1.1.2):

Steepest descent iteration = gradient method for LSE Ax = b, A ∈ R n,n s.p.d., b ∈ R n :

Initial guess x(0) ∈ R n , k = 0


r0 := b − Ax(0)
repeat
r⊤
k rk
t∗ :=
r⊤k Ark M ATLAB-code 10
§10.1.3.3 (Gradient method for s.p.d. LSE) x ( k +1)
: = x(k) + t∗ rk Ax = b, A s.p.d.
rk+1 := rk − t∗ Ark
k : = k+1
until x(k) − x(k−1) ≤τrel x(k) or

x(k) − x(k−1) ≤τabs
y
Recursion for residuals, see ?? of Code 10.1.3.4:

rk+1 = b − Ax(k+1) = b − A(x(k) + t∗ rk ) = rk − t∗ Ark . (10.1.3.5)

✬ ✩
One step of gradient method involves
✦ A single matrix×vector product with A ,
✦ 2 AXPY-operations (→ Section 1.3.2) on vectors of length n,
✦ 2 dot products in R n .
✫ ✪
Computational cost (per step) = cost(matrix×vector) + O(n)

➣ If A ∈ R n,n is a sparse matrix (→ ??) with “O(n) nonzero entries”, and the data structures allow
to perform the matrix×vector product with a computational effort O(n), then a single step of the
gradient method costs O(n) elementary operations.
➣ Gradient method of § 10.1.3.3 only needs A×vector in procedural form y = evalA(x).

10.1.4 Convergence of the gradient method


EXAMPLE 10.1.4.1 (Gradient method in 2D) S.p.d. matrices ∈ R2,2 :
   
1.9412 −0.2353 7.5353 −1.8588
A1 = , A2 =
−0.2353 1.0588 −1.8588 0.5647

Eigenvalues: σ (A1 ) = {1, 2}, σ (A2 ) = {0.1, 8}


✎ notation: spectrum of a matrix ∈ K n,n σ (M) := {λ ∈ C: λ is eigenvalue of M}

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 732
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 10

9 9

8 8

(0)
x
7 7

6 6
2

2
5 5
x

x
(1)
x

4 4

3 3

2 2
x(2)
x(3)
1 1

0 0
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 380 x1 Fig. 381 x
1

iterates of § 10.1.3.3 for A1 iterates of § 10.1.3.3 for A2


Recall theorem on principal axis transformation: every real symmetric matrix can be diagonalized by
orthogonal similarity transformations, see Cor. 9.1.0.9, [NS02, Thm. 7.8], [Gut09, Satz 9.15],

A = A⊤ ∈ R n,n ⇒ ∃Q ∈ R n,n orthogonal: A = QDQ⊤ , D = diag(d1 , . . . , dn ) ∈ R n,n diagonal .


(10.1.4.2)

n
J (Qb b⊤ Db
y) = 12 y y − (Q⊤ b)⊤ y
| {z }
b= 1
2 ∑ di yb2i − bbi ybi .
i =1
b⊤
=:b

Hence, a rigid transformation (rotation, reflection) maps the level surfaces of J from (10.1.1.2) to ellipses
with principal axes di . As A s.p.d. di > 0 is guaranteed.

Observations:
• Larger spread of spectrum leads to more elongated ellipses as level lines ➣ slower convergence
of gradient method, see Fig. 381.
• Orthogonality of successive residuals rk , rk+1 .
Clear from definition of § 10.1.3.3:

r⊤
k rk
r⊤ ⊤ ⊤
k r k +1 = r k r k − r k Ark = 0 . (10.1.4.3)
r⊤
k Ark
y

EXAMPLE 10.1.4.4 (Convergence of gradient method)


Convergence of gradient method for diagonal matrices, x∗ = [1, . . . , 1]⊤ , x(0) = 0:
1 d = 1:0.01:2; A1 = d i a g (d);
2 d = 1:0.1:11; A2 = d i a g (d);
3 d = 1:1:101; A3 = d i a g (d);

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 733
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

4 2
10 10

2
10 0
10

0
10 −2
10

−2
10

energy norm of error


−4
2−norm of residual

10
−4
10
−6
10
−6
10
−8
10
−8
10
−10
10
−10
10

−12
−12 10
10

−14
−14
10 A = diag(1:0.01:2) 10 A = diag(1:0.01:2)
A = diag(1:0.1:11) A = diag(1:0.1:11)
A = diag(1:1:101) A = diag(1:1:101)
−16 −16
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Fig. 382 iteration step k Fig. 383 iteration step k

Note: To study convergence it is sufficient to consider diagonal matrices, because


1. (10.1.4.2): for every A ∈ R n,n with A⊤ = A there is an orthogonal matrix Q ∈ R n.n such that
A = Q⊤ DQ with a diagonal matrix D (principal axis transformation), → Cor. 9.1.0.9, [NS02,
Thm. 7.8], [Gut09, Satz 9.15],
2. when applying the gradient method § 10.1.3.3 to both Ax = b and De e := Qb, then the
x = b
( k )
iterates x and e ( k ) ( k
x are related by Qx = e ) (
x .k )

With x(k) := Qx(k) , using Q⊤ Q = I:


erk := Qrk , e

Initial guess x(0) ∈ R n , k = 0 x (0) ∈ R n , k = 0


Initial guess e
r0 := b − Q⊤ DQx(0) e − De
er0 := b x (0)
repeat repeat
r⊤ Q⊤ Qrk er⊤
kerk
t∗ := ⊤k ⊤ t∗ :=
rk Q DQrk er⊤k Derk
( k +1) (k) ∗
x := x + t rk e
x ( k +1)
:= ex(k) + t∗erk
rk+1 := rk − t∗ Q⊤ DQrk erk+1 := erk − t∗ Derk
k := k + 1 k := k + 1
until x ( k ) − x ( k −1) ≤ τ x ( k ) until e x ( k −1) ≤ τ e
x(k) − e x(k)

Observation:
✦ linear convergence (→ Def. 8.2.2.1), see also Rem. 8.2.2.6

rate of convergence increases (↔ speed of convergence decreases) with spread of
spectrum of A
Impact of distribution of diagonal entries (↔ eigenvalues) of (diagonal matrix) A
(b = x∗ = 0, x0 = cos((1:n)’);)
Test matrix #1: A=diag(d); d = (1:100);
Test matrix #2: A=diag(d); d = [1+(0:97)/97 , 50 , 100];
Test matrix #3: A=diag(d); d = [1+(0:49)*0.05, 100-(0:49)*0.05];
Test matrix #4: eigenvalues exponentially dense at 1

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 734
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3
10
error norm, #1
norm of residual, #1
error norm, #2
norm of residual, #2
#4 error norm, #3
norm of residual, #3
2
10 error norm, #4
norm of residual, #4

#3

2−norms
matrix no.

1
10

#2

0
10
#1

−1
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45
Fig. 384 diagonal entry iteration step k

Observation: Matrices #1, #2 & #4 ➣ little impact of distribution of eigenvalues on asymptotic con-
vergence (exception: matrix #2)
y

Theory [Hac91, Sect. 9.2.2], [QSS00, Sect.7.2.4]:

Theorem 10.1.4.5. Convergence of gradient method/steepest descent

The iterates of the gradient method of § 10.1.3.3 satisfy

cond2 (A) − 1
x ( k +1) − x ∗ ≤ L x(k) − x∗ , L := ,
A A cond2 (A) + 1

that is, the iteration converges at least linearly (→ Def. 8.2.2.1) w.r.t. energy norm (→ Def. 10.1.0.1).

✎ notation: cond2 (A) =


ˆ condition number (→ Def. 2.2.2.7) of A induced by 2-norm

Remark 10.1.4.6 (2-norm from eigenvalues → [Gut09, Sect. 10.6], [NS02, Sect. 7.4])

A = A⊤ ⇒ kAk2 = max(|σ(A)|) , (10.1.4.7)

A −1 = min(|σ(A)|)−1 , if A regular.
2

λmax (A) λmax (A) := max(|σ (A)|) ,


A = A⊤ ⇒ cond2 (A) = , where (10.1.4.8)
λmin (A) λmin (A) := min(|σ (A)|) .

λmax (A)
✎ other notation κ (A) := ˆ spectral condition number of A
=
λmin (A)
(for general A: λmax (A)/λmin (A) largest/smallest eigenvalue in modulus)
These results are an immediate consequence of the fact that

∀A ∈ R n,n , A⊤ = A ∃U ∈ R n,n , U−1 = U⊤ : U⊤ AU is diagonal,

see (10.1.4.2), Cor. 9.1.0.9, [NS02, Thm. 7.8], [Gut09, Satz 9.15].

Please note that for general regular M ∈ R n,n we cannot expect cond2 (M) = κ (M). y

10. Krylov Methods for Linear Systems of Equations, 10.1. Descent Methods [QSS00, Sect. 4.3.3] 735
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10.2 Conjugate gradient method (CG) [Han02, Ch. 9], [DR08,


Sect. 13.4], [QSS00, Sect. 4.3.4]
Again we consider a linear system of equations Ax = b with s.p.d. (→ Def. 1.1.2.6) system matrix
A ∈ R n,n and given b ∈ R n .

Liability of gradient method of Section 10.1.3: NO MEMORY

1D line search in § 10.1.3.3 is oblivious of former line searches, which rules out reuse of information
gained in previous steps of the iteration. This is a typical drawback of 1-point iterative methods.

Idea:
Replace linear search with subspace correction
Given:
✦ initial guess x(0)
✦ nested subspaces U1 ⊂ U2 ⊂ U3 ⊂ · · · ⊂ Un = R n , dim Uk = k

x(k) := argmin J ( x ) , (10.2.0.1)


x∈Uk +x(0)

quadratic functional from (10.1.1.2)


Note: Once the subspaces Uk and x(0) are fixed, the iteration (10.2.0.1) is well defined, because J|U +x(0)
k
always possess a unique minimizer.

Obvious (from Lemma 10.1.1.1): x ( n ) = x ∗ = A −1 b

Thanks to (10.1.1.3), definition (10.2.0.1) ensures: x ( k +1) − x ∗ ≤ x(k) − x∗


A A

How to find suitable subspaces Uk ?

Idea: Uk+1 ← Uk + “ local steepest descent direction”

given by − grad J (x(k) ) = b − Ax(k) = rk (residual → Def. 2.4.0.1)

Uk+1 = Span{Uk , rk } , x(k) from (10.2.0.1). (10.2.0.2)

Obvious: rk = 0 ⇒ x(k) = x∗ := A−1 b done ✔

Lemma 10.2.0.3. rk ⊥ Uk

With x(k) according to (10.2.0.1), Uk from (10.2.0.2) the residual rk := b − Ax(k) satisfies

r⊤
k u = 0 ∀ u ∈ Uk (”rk ⊥ Uk ”).

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 736
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Geometric consideration: since x(k) is the minimizer of J over the affine space Uk + x(0) , the projection of
the steepest descent direction grad J (x(k) ) onto Uk has to vanish:

x(k) := argmin J ( x ) ⇒ grad J (x(k) ) ⊥ Uk . (10.2.0.4)


x∈Uk +x(0)

Proof. Consider
ψ(t) = J (x(k) + tu) , u ∈ Uk , t ∈ R .
By (10.2.0.1), t 7→ ψ(t) has a global minimum in t = 0, which implies

(0) = grad J (x(k) )⊤ u = (Ax(k) − b)⊤ u = 0 .
dt
Since u ∈ Uk was arbitrary, the lemma is proved.

Corollary 10.2.0.5.

If rl 6= 0 for l = 0, . . . , k, k ≤ n, then {r0 , . . . , rk } is an orthogonal basis of Uk .

Lemma 10.2.0.3 also implies that, if U0 = {0}, then dim Uk = k as long as x(k) 6= x∗ , that is, before we
have converged to the exact solution.

(10.2.0.1) and (10.2.0.2) define the conjugate gradient method (CG) for the iterative solution of
Ax = b
(hailed as a “top ten algorithm” of the 20th century, SIAM News, 33(4))

10.2.1 Krylov spaces


Definition 10.2.1.1. Krylov space

For A ∈ R n,n , z ∈ R n , z 6= 0, the l -th Krylov space is defined as

Kl (A, z) := Span{z, Az, . . . , Al −1 z} .

Equivalent definition: Kl (A, z) = { p(A)z: p polynomial of degree ≤ l }

Lemma 10.2.1.2.
The subspaces Uk ⊂ R n , k ≥ 1, defined by (10.2.0.1) and (10.2.0.2) satisfy

Uk = Span{r0 , Ar0 , . . . , Ak−1 r0 } = Kk (A, r0 ) ,

where r0 = b − Ax(0) is the initial residual.

Proof. (by induction) Obviously AKk (A, r0 ) ⊂ Kk+1 (A, r0 ) . In addition

rk = b − A(x(0) + z) for some z ∈ Uk ⇒ rk = r0 − Az


|{z} .
|{z}
∈Kk+1 (A,r0 ) ∈Kk+1 (A,r0 )

Since Uk+1 = Span{Uk , rk }, we obtain Uk+1 ⊂ Kk+1 (A, r0 ). Dimensional considerations based on
Lemma 10.2.0.3 finish the proof.

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 737
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10.2.2 Implementation of CG
Assume: basis {p1 , . . . , pl }, l = 1, . . . , n, of Kl (A, r) available

x(l ) ∈ x(0) + Kl (A, r0 ) ➣ set x(l ) = x(0) + γ1 p1 + · · · + γl pl .

For ψ(γ1 , . . . , γl ) := J (x(0) + γ1 p1 + · · · + γl pl ) holds

∂ψ
(10.2.0.1) ⇔ = 0 , j = 1, . . . , l .
∂γ j

This leads to a linear system of equations by which the coefficients γ j can be computed:
    
p1⊤ Ap1 · · · p1⊤ Apl γ1 p1⊤ r
 ..  ..   .. 
.. (0)
 .  .  =  .  , r := b − Ax .
. (10.2.2.1)
p⊤
l Ap1 · · · p⊤
l Apl
γl p⊤
l r

Great simplification, if {p1 , . . . , pl } A-orthogonal basis of Kl (A, r): p⊤


j Api = 0 for i 6 = j.

Recall: s.p.d. A induces an inner product ➣ concept of orthogonality [NS02, Sect. 4.4], [Gut09,
Sect. 6.2]. “A-geometry” like standard Euclidean space.
Assume: A-orthogonal basis {p1 , . . . , pn } of R n available, such that

Span{p1 , . . . , pl } = Kl (A, r) .

(Efficient) successive computation of x(l ) becomes possible, see [DR08, Lemma 13.24]
(LSE (10.2.2.1) becomes diagonal !)

Input: : initial guess x(0) ∈ R n


Given: : A-orthogonal bases {p1 , . . . , pl } of Kl (A, r0 ), l = 1, . . . , n
Output: : approximate solution x(l ) ∈ R n of Ax = b

r0 := b − Ax(0) ;
p⊤
j r0 (10.2.2.2)
for j = 1 to l do { x ( j ) : = x ( j −1) + pj }
p⊤
j Ap j

Task: Efficient computation of A-orthogonal vectors {p1 , . . . , pl } spanning Kl (A, r0 )


during the CG iteration.
A-orthogonalities/orthogonalities ➤ short recursions
Lemma 10.2.0.3 implies orthogonality p j ⊥ rm := b − Ax(m) , 1 ≤ j ≤ m ≤ l . Also by A-orthogonality
of the pk
 
m p⊤k r0
p Tj (b − Ax(m) ) = p⊤
j
b − Ax(0) − ∑ Apk  = 0 . (10.2.2.3)
| {z } k=1 p⊤ Ap
k
= r0 k

From linear algebra we already know a way to construct orthogonal basis vectors:

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 738
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(10.2.2.3) ⇒ Idea:
Gram-Schmidt orthogonalization [NS02, Thm. 4.8], [Gut09, Alg. 6
of residuals r j := b − Ax( j) w.r.t. A-inner product:

p⊤ Ar j
( j)
j
p1 := r0 , p j+1 := (b − Ax ) − ∑ ⊤k p , j = 1, . . . , l − 1 .
| {z } k=1 pk Apk k
=:r j | {z }
(∗)
(10.2.2.4)

rj
Geometric interpretation of
K j (A, r0 ) (10.2.2.4):

ˆ orthogonal projection of r j on the subspace


(∗) =
(∗)
Span{p1 , . . . , p j }
Fig. 385

Lemma 10.2.2.5. Bases for Krylov spaces in CG

If they do not vanish, the vectors p j , 1 ≤ j ≤ l , and r j := b − Ax( j) , 0 ≤ j ≤ l , from (10.2.2.2),


(10.2.2.4) satisfy
(i) {p1 , . . . , p j } is A-orthogonal basis von K j (A, r0 ),
(ii) {r0 , . . . , r j−1 } is orthogonal basis of K j (A, r0 ), cf. Cor. 10.2.0.5

Proof. A-orthogonality of p j by construction, study (10.2.2.4).

j
p⊤
k r0
j
p⊤
k Ar j
(10.2.2.2) & (10.2.2.4) ⇒ p j +1 = r 0 − ∑ ⊤
Apk − ∑ ⊤
pk .
k=1 pk Apk k=1 pk Apk
⇒ p j+1 ∈ Span{r0 , p1 , . . . , p j , Ap1 , . . . , Ap j } .

A simple induction argument confirms (i)

(10.2.2.4) ⇒ r j ∈ Span{p1 , . . . , p j+1 } & p j ∈ Span{r0 , . . . , r j−1 } . (10.2.2.6)

Span{p1 , . . . , p j } = Span{r0 , . . . , r j−1 } = Kl (A, r0 ) . (10.2.2.7)

(10.2.2.3) ⇒ r j ⊥ Span{p1 , . . . , p j } = Span{r0 , . . . , r j−1 } . (10.2.2.8)


Orthogonalities from Lemma 10.2.2.5 ➤ short recursions for pk , rk , x(k) !

p⊤
j Ar j
(10.2.2.3) ⇒ (10.2.2.4) collapses to p j+1 := r j − p j , j = 1, . . . , l .
p⊤
j Ap j

recursion for residuals:

p⊤
j r0
(10.2.2.2) r j = r j −1 − Ap j .
p⊤
j Ap j

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 739
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

!T
m −1 r0⊤ pk
Lemma 10.2.2.5, (i) r jH−1 p j = r0 + ∑ Apk p j =r0⊤ p j . (10.2.2.9)
k =1 pkT Apk

The orthogonality (10.2.2.9) together with (10.2.2.8) permits us to replace r0 with r j−1 in the actual imple-
mentation.

§10.2.2.10 (CG method for solving Ax = b, A s.p.d. → [DR08, Alg. 13.27])

Input : initial guess x(0) ∈ R n Input: ˆ x (0) ∈ R n


initial guess x =
Output : approximate solution x(l ) ∈ R n tolerance τ > 0
Output: approximate solution x = ˆ x(l )
p1 = r0 : = b − Ax(0) ;
for j = 1 to l do { p := r0 := r := b − Ax;
p Tj r j−1 for j = 1 to lmax do {
x ( j ) : = x ( j −1) + pj; β : = r T r;
p Tj Ap j h := Ap;
β
α := pT h ;
p Tj r j−1
r j = r j −1 − Ap j ; x := x + αp;
p Tj Ap j r := r − αh;
if krk ≤ τ kr0 k then stop;
(Ap j )T r j T
p j +1 = r j − pj; β : = r βr ;
p Tj Ap j p := r + βp;
} }

y
In CG algorithm: r j = b − Ax(k) agrees with the residual associated with the current iterate (in exact
arithmetic, cf. Ex. 10.2.3.1), but computation through short recursion is more efficient.
➣ We find that the CG method possesses all the algorithmic advantages of the gradient method, cf. the
discussion in Section 10.1.3.

✎ ☞
1 matrix×vector product, 3 dot products, 3 AXPY-operations per step:
✍ ✌
If A sparse, nnz(A) ∼ n ➤ computational effort O(n) per step

M ATLAB-code 10.2.2.11: basic CG iteration for solving Ax = b, § 10.2.2.10

M ATLAB-function:

x=pcg(A,b,tol,maxit,[],[],x0) : Solve Ax = b with at most maxit CG steps: stop,


when krl k : kr0 k < tol.
x=pcg(Afun,b,tol,maxit,[],[],x0): Afun = handle to function for computing A×vector.
[x,flag,relr,it,resv] = pcg(. . .) : diagnostic information about iteration

Remark 10.2.2.12 (A posteriori termination criterion for plain CG)

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 740
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For any vector norm and associated matrix norm (→ Def. 1.5.5.10) hold (with residual rl := b − Ax(l ) )

1 krl k k x(l ) − x∗ k kr k
≤ x(0) −x∗ ≤ cond(A) l . (10.2.2.13)
cond(A) kr0 k k k k r0 k

relative decrease of iteration error

(10.2.2.13) can easily be deduced from the error equation A(x(k) − x∗ ) = rk , see Def. 2.4.0.1 and
(2.4.0.13). y

10.2.3 Convergence of CG
Note: CG is a direct solver, because (in exact arithmetic) x(k) = x∗ for some k ≤ n

EXAMPLE 10.2.3.1 (Impact of roundoff errors on CG → [QSS00, Rem. 4.3])


1
10

0
10

Numerical experiment: A=hilb(20),


x(0) = 0, b = [1, . . . , 1]⊤
2−norm of residual

−1
10

Hilbert-Matrix: extremely ill-conditioned


−2
10

residual
h norms during
i CG iteration ✄:
(10)
R = r0 , . . . , r −3
10

−4
10
0 2 4 6 8 10 12 14 16 18 20
Fig. 386 iteration step k

R⊤ R =
 
1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.000000 0.016019 −0.795816 −0.430569 0.348133
−0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000 −0.012075 0.600068 −0.520610 0.420903
 
 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.001582 −0.078664 0.384453 −0.310577
 
−0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000024 0.001218 −0.024115 0.019394 
 
 0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000002 0.000151 −0.000118
 
−0.000000 0.000000 −0.000000 0.000000 −0.000000 1.000000 −0.000000 0.000000 −0.000000 0.000000 
 
 0.016019 −0.012075 −0.000024 −0.000000 −0.000000 −0.000000 0.000000 
 0.001582 0.000000 1.000000

−0.795816 −0.078664 −0.000002 −0.000000 −0.000000
 0.600068 0.001218 0.000000 1.000000 0.000000

−0.430569 −0.520610 0.384453 −0.024115 0.000151 −0.000000 −0.000000 0.000000 1.000000 0.000000 

0.348133 0.420903 −0.310577 0.019394 −0.000118 0.000000 0.000000 −0.000000 0.000000 1.000000

➣ Roundoff
✦ destroys orthogonality of residuals
✦ prevents computation of exact solution after n steps.

Numerical instability (→ Def. 1.5.5.19) ➣ pointless to (try to) use CG as direct solver! y
Practice: CG used for large n as iterative solver : x(k) for some k ≪ n is expected to provide good
approximation for x∗

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 741
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 10.2.3.2 (Convergence of CG as iterative solver) CG (Code 10.2.2.11) & gradient method
(Code 10.1.3.4) for LSE with sparse s.p.d. “Poisson matrix”
A = gallery(’poisson’,m); x0 = (1:n)’; b = zeros(n,1);
2 ,m2
➣ A ∈ Rm
Poisson matrix, m = 10
0

eigenvalues of Poisson matrices, m=10,20,30


10 40

20 35

30 30

size m of poisson matrix


40 25

50 20

60 15

70 10

80 5

90 0
−2 −1 0 1
10 10 10 10
Fig. 388 eigenvalues poissoneig
100
0 10 20 30 40 50 60 70 80 90 100
Fig. 387 nz = 460 poissonspy
0 0
10 10

−1
10
normalized (!) 2−norms
normalized (!) norms

−2
10

−1
10

−3
10

m=10, error A−norm −4


10 m=10, error norm
m=10, 2−norm of residual m=10, norm of residual
m=20, error A−norm m=20, error norm
m=20, 2−norm of residual m=20, norm of residual
m=30 error A−norm m=30 error norm
m=30, 2−norm of residual m=30, norm of residual, #3
−2 −5
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25
Fig. 389 gradient iteration step k poissongmcvg
Fig. 390 no. of CG steps cgcvgpoisso
Observations:
• CG much faster than gradient method (as expected, because it has “memory”)
• Both, CG and gradient method converge more slowly for larger sizes of Poisson matrices.
y
Convergence theory: [Hac94, Sect. 9.4.3]
A simple consequence of (10.1.1.3) and (10.2.0.1):

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 742
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Corollary 10.2.3.3. “Optimality” of CG iterates

Writing x∗ ∈ R n for the exact solution of Ax = b the CG iterates satisfy

x∗ − x(l ) = min{ky − x∗ k A : y ∈ x(0) + Kl (A, r0 )} , r0 := b − Ax(0) .


A

This paves the way for a quantitative convergence estimate:

y ∈ x(0) + Kl (A, r) ⇔ y = x(0) + A p(A)(x − x(0) ) , p = polynomial of degree ≤ l − 1 .


x − y = q(A)(x − x(0) ), q = polynomial of degree ≤ l , q(0) = 1 .

x − x(l ) ≤ min{ max |q(λ)|: q polynomial of degree ≤ l , q(0) = 1} · x − x (0) .


A λ∈σ(A) A

(10.2.3.4)

Bound this minimum for λ ∈ [λmin (A), λmax (A)] by using suitable “polynomial candidates”

Tool: Chebychev polynomials (→ Section 6.2.3.1) ➣ lead to the following estimate [Hac91,
Satz 9.4.2], [DR08, Satz 13.29]

Theorem 10.2.3.5. Convergence of CG method

The iterates of the CG method for solving Ax = b (see Code 10.2.2.11) with A = A⊤ s.p.d. satisfy

 l
2 1− √1
(l ) κ (A)
x−x ≤  2l  2l x − x (0)
A A
1+ √1 + 1− √1
κ (A) κ (A)
p !l
κ (A) − 1
≤ 2 p x − x (0) .
κ (A) + 1 A

(recall: κ (A) = spectral condition number of A, κ (A) = cond2 (A))

The estimate of
pthis theorem confirms asymptotic linear convergence of the CG method (→ Def. 8.2.2.1)
κ (A) − 1
with a rate of p
κ (A) + 1

Plots of bounds for error reduction (in energy norm) during CG iteration from Thm. 10.2.3.5:

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 743
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

100

9
90 0.

1
error reduction (energy norm)

80

0.8 70
9
0.
0.6 60
0.8

1/2
κ(A)
0.4 50

0.2 40 0.
9 0.8 0.7

30
0 0.7
100 0.8 0.6
80 20
10 0.5 0.4
60 8 .9 0.7
0 0.6
6 0.5 0.4 0.3
40 10 0.2
4 0.8 0.1
20 0.6 0.4 0.3
0.2 0.1
2 0.7 0. 5
0.3 0.1
0 0.2
0
κ(A)1/2 CG step l
1 2 3 4 5 6 7 8 9 10
CG step l

M ATLAB-code 10.2.3.6: plotting theoretical bounds for CG convergence rate

M ATLAB-code 10.2.3.8: CG for Poisson matrix


1 A = g a l l e r y (’poisson’,m); n =
s i z e (A,1);
EXAMPLE 10.2.3.7 (Convergence rates for CG method) 2 x0 = (1:n)’; b = ones(n,1); maxit
30; tol =0;
3 [x, f l a g ,relres,iter,resvec] =
pcg(A,b,tol,maxit,[],[],x0);

Measurement of
rate of (linear) convergence:
s
kr30 k2
rate ≈ 10
. (10.2.3.9)
kr20 k2

CG convergence for Poisson matrix


CG convergence for Poisson matrix
1000 1 1

0.9

0.8
convergence rate of CG
convergence rate of CG

0.7

0.6
cond (A)
2

500 0.5 0.5

0.4

0.3

0.2

0.1 theoretical bound


measured rate
0
5 10 15 20 25
0
30 0
5 10 15 20 25 30
m m

Justification for estimating the rate of linear convergence (→ Def. 8.2.2.1) of krk k2 → 0:

k r k +1 k2 ≈ L k r k k2 ⇒ k r k + m k2 ≈ L m k r k k2 .

10. Krylov Methods for Linear Systems of Equations, 10.2. Conjugate gradient method (CG) [Han02, Ch. 9], 744
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 10.2.3.10 (CG convergence and spectrum → Ex. 10.1.4.4)


Test matrix #1: A=diag(d); d = (1:100);
Test matrix #2: A=diag(d); d = [1+(0:97)/97 , 50 , 100];
Test matrix #3: A=diag(d); d = [1+(0:49)*0.05, 100-(0:49)*0.05];
Test matrix #4: eigenvalues exponentially dense at 1
x0 = cos((1:n)’); b = zeros(n,1);
0
10

−2
#4 10

−4
10
#3

2−norms
matrix no.

−6
10

#2

−8
10
error norm, #1
norm of residual, #1
#1 error norm, #2
−10
10 norm of residual, #2
error norm, #3
norm of residual, #3
error norm, #4
norm of residual, #4
−12
10
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25
diagonal entry no. of CG steps

Observations: Distribution of eigenvalues has crucial impact on convergence of CG


(This is clear from the convergence theory, because detailed information about the spec-
trum allows a much better choice of “candidate polynomial” in (10.2.3.4) than merely
using Chebychev polynomials)
➣ Clustering of eigenvalues leads to faster convergence of CG
(in stark contrast to the behavior of the gradient method, see Ex. 10.1.4.4)
✞ ☎

✝ ✆
CG convergence boosted by clustering of eigenvalues
y

10.3 Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10], [QSS00,


Sect. 4.3.5]
Thm. 10.2.3.5 ➣ (Potentially) slow convergence of CG in case κ (A) ≫ 1.

Idea: Preconditioning
Apply CG method to transformed linear system

Ae e , A
ex = b e := B−1/2 AB−1/2 , e e := B−1/2 b ,
x := B /2 x , b
1
(10.3.0.1)

e ), B = B⊤ ∈ R N,N s.p.d. =
with “small” κ (A ˆ preconditioner.

Remark 10.3.0.2 (Square root of a s.p.d. matrix)


What is meant by the “square root” B /2 of a s.p.d. matrix B ?
1

Recall (10.1.4.2) : for every B ∈ R n,n with B⊤ = B there is an orthogonal matrix Q ∈ R n.n such that
B = Q⊤ DQ with a diagonal matrix D (→ Cor. 9.1.0.9, [NS02, Thm. 7.8], [LS09, Satz 9.15]). If B is s.p.d.

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],745
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

the (diagonal) entries of D are strictly positive and we can define


p p
D = diag(λ1 , . . . , λn ), λi > 0 ⇒ D /2 := diag(
1
λ1 , . . . , λn ) .

This is generalized to

B /2 := Q⊤ D /2 Q ,
1 1

and one easily verifies, using Q⊤ = Q−1 , that (B /2 )2 = B and that B /2 is s.p.d. In fact, these two
1 1

requirements already determine B /2 uniquely:


1

B1/2 is the unique s.p.d. matrix such that (B1/2 )2 = B.


y

Notion 10.3.0.3. Preconditioner


A s.p.d. matrix B ∈ R n,n is called a preconditioner (ger.: Vorkonditionierer) for the s.p.d. matrix
A ∈ R n,n , if
1. κ (B− /2 AB− /2 ) is “small” and
1 1

2. the evaluation of B−1 x is about as expensive (in terms of elementary operations) as the
matrix×vector multiplication Ax, x ∈ R n .

λmax (A)
Recall: spectral condition number κ (A) := , see (10.1.4.8)
λmin (A)

There are several equivalent ways to express that κ (B− /2 AB− /2 ) is “small”:
1 1

• κ (B−1 A) is “small”,
because spectra agree σ (B−1 A) = σ (B− /2 AB− /2 ) due to similarity (→ Lemma 9.1.0.6)
1 1

• ∃0 < γ < Γ, Γ/γ “small”: γ (x⊤ Bx) ≤ x⊤ Ax ≤ Γ (x⊤ Bx) ∀x ∈ R n ,


where equivalence is seen by transforming y := B−1/2 x and appealing to the min-max
Thm. 9.3.2.18.
“Reader’s digest” version of Notion 10.3.0.3:
☛ ✟

✡ ✠
S.p.d. B preconditioner :⇔ B−1 = cheap approximate inverse of A

Problem: B /2 , which occurs prominently in (10.3.0.1) is usually not available with acceptable computa-
1

tional costs.
However, if one formally applies § 10.2.2.10 to the transformed system
 
e
Aex := B −1/2
AB −1/2 e := B−1/2 b ,
(B /2 x) = b
1

from (10.3.0.1), it becomes apparent that, after suitable transformation of the iteration variables p j and r j ,
B1/2 and B−1/2 invariably occur in products B−1/2 B−1/2 = B−1 and B1/2 B−1/2 = I. Thus, thanks to this
intrinsic transformation square roots of B are not required for the implementation!

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],746
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

e
ex = b
CG for Ae Equivalent CG with transformed variables
x (0) ∈ R n
Input : initial guess e Input : initial guess x(0) ∈ R n
Output : approximate solution e x(l ) ∈ R n Output : approximate solution x(l ) ∈ R n

e − B−1/2 AB−1/2e
e 1 := er0 := b
p x (0) ; e − AB−1/2e
B1/2er0 := B1/2 b x (0) ;
for j = 1 to l do { B−1/2 pe 1 := B−1 (B1/2er0 ) ;
for j = 1 to l do {
e Tjer j−1
p
α := (B−1/2 p
e j ) T B1/2er j−1
e Tj B−1/2 AB−1/2 p
p ej α :=
( j) ( j −1) (B−1/2 p
e j ) T AB−1/2 p ej
x := e
e x e j;
+αp −1/2 ( j) −1/2 ( j−1)
+ α B− /2 p
1
B x := B
e e
x e j;
er j = er j−1 − αB− /2 AB /2 p
1 1
e j;
B /2er j = B /2er j−1 − αAB− /2 p
1 1 1
e j;
(B−1/2 AB−1/2 p
e j ) Ter j
B− /2 p
e j+1 = B−1 (B− /2er j )
1 1
e j +1
p = er j − T −1/2 −1/2 e j;
p
ej B
p AB ej
p
(B−1/2 p
e j ) T AB−1 (B1/2er j ) −1/2
} − B e j;
p
(B−1/2 pe j ) T AB−1/2 p
ej
}

with the transformations:

x(k) = B /2 x(k) , erk = B− /2 rk , p


e k = B− /2 rk .
1 1 1
e (10.3.0.4)

§10.3.0.5 (Preconditioned CG method (PCG) [DR08, Alg. 13.32], [Han02, Alg. 10.1])
Input: ˆ x(0) ∈ R n , tolerance τ > 0
initial guess x ∈ R n =
Output: approximate solution x =ˆ x(l )

p := r := b − Ax; p := B−1 r; q := p; τ0 := p⊤ r;
for l = 1 to lmax do {
β
β := r⊤ q; h := Ap; α := p⊤ h ;
x := x + αp;
r := r − αh; (10.3.0.6)
r⊤ q
q : = B −1 r ; β : = β ;
if |q⊤ r| ≤ τ · τ0 then stop;
p := q + βp;
}

M ATLAB-code 10.3.0.7: simple implementation of PCG algorithm § 10.3.0.5

✛ ✘
Computational effort per step: 1 evaluation A×vector,
1 evaluation B−1 ×vector,

✚ ✙
3 dot products, 3 AXPY-operations

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],747
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Remark 10.3.0.8 (Convergence theory for PCG) Assertions of Thm. 10.2.3.5 remain valid with κ (A)
e instead of A.
replaced with κ (B−1 A) and energy norm based on A y

EXAMPLE 10.3.0.9 (Simple preconditioners)

B = easily invertible “part” of A

✦ B =diag(A): Jacobi preconditioner (diagonal scaling)


(
(A)ij , if |i − j| ≤ k ,
✦ (B)ij = for some k ≪ n.
0 else,

✦Symmetric Gauss-Seidel preconditioner

Idea: Solve Ax = b approximately in two stages:


➀ Approximation A−1 ≈ tril(A) (lower triangular part): x = tril(A)−1 b
e
➁ Approximation A−1 ≈ triu(A) (upper triangular part) and use this to approximately “solve” the
error equation A(x − e
x) = r, with residual r := b − Ae
x:

x + triu(A)−1 (b − Ae
x=e x) .

With L A := tril(A), U A := triu(A) one finds

x = (L− 1 −1 −1 −1
A + U A − U A AL A ) b ➤ B
−1
= L− 1 −1 −1 −1
A + U A − U A AL A . (10.3.0.10)

For all these approaches the evaluation of B−1 r can be done with effort of O(n) in the case of a sparse
matrix A (e.g. with O(1) non-zero entries per row). However, there is absolutely no guarantee that
κ (B−1 A) will be reasonably small. It will crucially depend on A, if this can be expected. y

More complicated preconditioning strategies:

✦ Incomplete Cholesky factorization, M ATLAB-ichol, [DR08, Sect. 13.5]]

✦ Sparse approximate inverse preconditioner (SPAI)

EXAMPLE 10.3.0.11 (Tridiagonal preconditioning)


Efficacy of preconditioning of sparse LSE with tridiagonal part:

M ATLAB-code 10.3.0.12: LSE for Ex. 10.3.0.11


1 A =
spdiags (repmat([1/n,-1,2+2/n,-1,1/n],n,1),[-n/2,-1,0,1,n/2],n,n);
2 b = ones(n,1); x0 = ones(n,1); tol = 1.0E-4; maxit = 1000;
3 evalA = @(x) A*x;
4

5 % no preconditioning, see Code 10.3.0.7


6 invB = @(x) x; [x,rn] = pcgbase(evalA,b,tol,maxit,invB,x0);
7

8 % tridiagonal preconditioning, see Code 10.3.0.7


9 B = spdiags ( spdiags (A,[-1,0,1]),[-1,0,1],n,n);

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],748
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 invB = @(x) B\x; [x,rnpc] = pcgbase(evalA,b,tol,maxit,invB,x0);


%

The Code 10.3.0.12 highlights the use of a preconditioner in the context of the PCG method; it only takes
a function that realizes the application of B−1 to a vector. In Line 10 of the code this function is passed as
function handle invB.
5 5
10 10

0
10 0
10

−5
10

B−1−norm of residuals
−5
10
A−norm of error

−10
10
−10
10
−15
10

−15
10
−20
10

CG, n = 50 CG, n = 50
−20 CG, n = 100
CG, n = 100 10
−25
10 CG, n = 200 CG, n = 200
PCG, n = 50 PCG, n = 50
PCG, n = 100 PCG, n = 100
PCG, n = 200 PCG, n = 200
−30 −25
10 10
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Fig. 391 # (P)CG step Fig. 392 # (P)CG step

n # CG steps # PCG steps 10


3
PCG iterations: tolerance = 0.0001

CG

16 8 3 PCG

32 16 3
64 25 4
2

128 38 4 10
# (P)CG step

256 66 4
512 106 4
1024 149 4 1
10
2048 211 4
4096 298 3
8192 421 3
16384 595 3 0
10
32768 841 3
1 2 3 4 5
10 10 10 10 10
Fig. 393 n

Clearly in this example the tridiagonal part of the matrix is dominant for large n. In addition, its condition
number grows ∼ n2 as is revealed by a closer inspection of the spectrum.
Preconditioning with the tridiagonal part manages to suppress this growth of the condition number of
B−1 A and ensures fast convergence of the preconditioned CG method y

Remark 10.3.0.13 (Termination of PCG) Recall Rem. 10.2.2.12, (10.2.2.13):

1 krl k k x(l ) − x∗ k kr k
≤ x(0) −x∗ ≤ cond(A) l . (10.2.2.13)
cond(A) kr0 k k k k r0 k

B good preconditioner ➤ cond2 (B−1/2 AB−1/2 ) small (→ Notion 10.3.0.3)

Idea: consider (10.2.2.13) for


✦ Euclidean norm k·k = k·k2 ↔ cond2
✦ transformed quantities e
x, er, see (10.3.0.1), (10.3.0.4)

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],749
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Monitor 2-norm of transformed residual:

e x = B−1/2 r ⇒ kerk2 = r⊤ B−1 r .


e − Ae
er = b 2

(10.2.2.13) 2
estimate for 2-norm of transformed iteration errors: e(l )
e = (e(l ) )⊤ Be(l )
2

Analogous to (10.2.2.13), estimates for energy norm (→ Def. 10.1.0.1) of error e(l ) := x − x(l ) , x∗ :=
A −1 b :
Use error equation Ae(l ) = rl :
2
r⊤ −1 −1 (l ) ⊤
l B rl = ( B Ae ) Ae
(l )
≤ λmax (B−1 A) e(l ) ,
A
2
e(l ) = (Ae(l ) )⊤ e(l ) = r⊤ −1 −1 ⊤ −1 −1 −1 ⊤
l A rl = B r BA rl ≤ λmax ( BA ) ( B rl ) rl .
A

available during PCG iteration (10.3.0.6)

2 2
1 e(l ) ( B −1 r l ) ⊤ r l e(l )
A
≤ ≤ κ ( B −1 A ) A
(10.3.0.14)
κ ( B −1 A ) e ( 0 ) 2 ( B −1 r 0 ) ⊤ r 0 e ( 0 ) 2
A A

κ (B−1 A) “small” ➤ B−1 -energy norm of residual ≈ A-norm of error !


(rl · B−1 rl = q⊤ r in Algorithm (10.3.0.6))
y

M ATLAB-function: [x,flag,relr,it,rv] = pcg(A,b,tol,maxit,B,[],x0);


(A, B may be handles to functions providing Ax and B−1 x, resp.)

Remark 10.3.0.15 (Termination criterion in M ATLAB-pcg → [QSS00, Sect. 4.6])


Implementation (skeleton) of M ATLAB built-in pcg:

10. Krylov Methods for Linear Systems of Equations, 10.3. Preconditioning [DR08, Sect. 13.5], [Han02, Ch. 10],750
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Listing 10.1: M ATLAB-code : PCG algorithm


1 f u n c t i o n x = pcg(Afun,b,tol,maxit,Binvfun,x0)
2 x = x0; r = b - f e v a l (Afun,x); rho = 1;
3 f o r i = 1 : maxit
4 y = feval(Binvfun,r);
5 rho1 = rho; rho = r’ * y;
6 i f (i == 1)
7 p = y;
8 else
9 b e t a = rho / rho1;
10 p = y + b e t a * p;
11 end
12 q = feval(Afun,p);
13 a l p h a = rho /(p’ * q);
14 x = x + a l p h a * p;
15 if (norm(b - evalf(Afun,x)) <= tolb*norm(b)) , r e t u r n ; end
16 r = r - a l p h a * q;
17 end

Dubious termination criterion !


y

10.4 Survey of Krylov Subspace Methods


10.4.1 Minimal residual methods
Idea: Replace Euclidean inner product in CG with A-inner product

x(l ) − x replaced with A ( x(l ) − x ) = k r l k2


A 2

MINRES method [Hac91, Sect. 9.5.2] (for any symmetric matrix !)

Theorem 10.4.1.1.

For A = A H ∈ R n,n the residuals rl generated in the MINRES iteration satisfy

krl k2 = min{kAy − bk2 : y ∈ x(0) + Kl (A, r0 )}


 l
2 1 − κ (1A)
k r l k2 ≤  2l  2l kr0 k2 .
1 1
1 + κ (A) + 1 − κ (A)

p
Note: similar formula for (linear) rate of convergence as for CG, see Thm. 10.2.3.5, but with κ (A)
replaced with κ (A) !

Iterative solver for Ax = b with symmetric system matrix A:

M ATLAB-functions: • [x,flg,res,it,resv] = minres(A,b,tol,maxit,B,[],x0);


• [. . .] = minres(Afun,b,tol,maxit,Binvfun,[],x0);

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 751
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Computational costs : 1 A×vector, 1 B−1 ×vector per step, a few dot products & SAXPYs
Memory requirement: a few vectors ∈ R n

Extension to general regular A ∈ R n,n :


Idea: Solver overdetermined linear system of equations

x(l ) ∈ x(0) + Kl (A, r0 ): Ax(l ) = b

in least squares sense, → Chapter 3.

x(l ) = argmin{kAy − bk2 : y ∈ x(0) + Kl (A, r0 )} .

➤ GMRES method for general matrices A ∈ R n,n → [Han02, Ch. 16], [QSS00, Sect. 4.4.2]
M ATLAB-function: • [x,flag,relr,it,rv] = gmres(A,b,rs,tol,maxit,B,[],x0);
• [. . .] = gmres(Afun,b,rs,tol,maxit,Binvfun,[],x0);

Computational costs : 1 A×vector, 1 B−1 ×vector per step,


: O(l ) dot products & SAXPYs in l -th step
Memory requirements: O(l ) vectors ∈ K n in l -th step

Remark 10.4.1.2 (Restarted GMRES) After many steps of GMRES we face considerable computational
costs and memory requirements for every further step. Thus, the iteration may be restarted with the
current iterate x(l ) as initial guess → rs-parameter triggers restart after every rs steps (Danger: failure
to converge). y

10.4.2 Iterations with short recursions [QSS00, Sect. 4.5]


Iterative methods for general regular system matrix A:

Idea: Given x(0) ∈ R n determine (better) approximation x(l ) through Petrov-


Galerkin condition

x(l ) ∈ x(0) + Kl (A, r0 ): p H (b − Ax(l ) ) = 0 ∀p ∈ Wl ,

with suitable test space Wl , dim Wl = l , e.g. Wl := Kl (A H , r0 ) (→


bi-conjugate gradients, BiCG)

Zoo of methods with short recursions (i.e. constant effort per step)
MATLAB-function: • [x,flag,r,it,rv] = bicgstab(A,b,tol,maxit,B,[],x0)
• [. . .] = bicgstab(Afun,b,tol,maxit,Binvfun,[],x0);

Computational costs : 2 A×vector, 2 B−1 ×vector, 4 dot products, 6 SAXPYs per step
Memory requirements: 8 vectors ∈ R n

MATLAB-function: • [x,flag,r,it,rv] = qmr(A,b,tol,maxit,B,[],x0)


• [. . .] = qmr(Afun,b,tol,maxit,Binvfun,[],x0);

Computational costs : 2 A×vector, 2 B−1 ×vector, 2 dot products, 12 SAXPYs per step
Memory requirements: 10 vectors ∈ R n

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 752
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ little (useful) convergence theory available


✦ stagnation & “breakdowns” commonly occur

EXAMPLE 10.4.2.1 (Failure of Krylov iterative solvers)


 
0 1 0 ··· ··· 0  
0
 .. 
0 0 1 0 .   .. 
.  .
. .. .. ..   
. . . .   
 .. ..   
A= . .   , b =   x = e1 .
 .
 .. .. ..   .. 
. . . 0   
  0
0 0 1 
1 0 ··· ··· 0 1

x(0) = 0 ➣ r0 = en ➣ Kl (A, r0 ) = Span{en , en−1 , . . . , en−l +1 }


(
1 , if l ≤ n ,
min{ky − xk2 : y ∈ Kl (A, r0 )} =
0 , for l = n .
y
☛ ✟

✡ ✠
TRY & PRAY

EXAMPLE 10.4.2.2 (Convergence of Krylov subspace methods for non-symmetric system matrix)
1 A = g a l l e r y (’tridiag’,-0.5*ones(n-1,1),2*ones(n,1),-1.5*ones(n-1,1));
2 B = g a l l e r y (’tridiag’,0.5*ones(n-1,1),2*ones(n,1),1.5*ones(n-1,1));

Plotted: k r l k2 : k r0 k2 :
3 0
10 10
bicgstab bicgstab
qmr qmr

2
Relative 2−norm of residual

Relative 2−norm of residual

10

−1
10

1
10

−2
10

0
10

−1 −3
10 10
0 5 10 15 20 25 0 5 10 15 20 25
iteration step iteration step

tridiagonal matrix A tridiagonal matrix B


y
Summary:

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 753
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Advantages of Krylov methods vs. direct elimination (IF they converge at all/sufficiently fast).
• They require system matrix A in procedural form y=evalA(x) ↔ y = Ax only.
• They can perfectly exploit sparsity of system matrix.
• They can cash in on low accuracy requirements (IF viable termination criterion available).
• They can benefit from a good initial guess.

10. Krylov Methods for Linear Systems of Equations, 10.4. Survey of Krylov Subspace Methods 754
Bibliography

[AK07] Owe Axelsson and János Karátson. “Mesh independent superlinear PCG rates via
compact-equivalent operators”. In: SIAM J. Numer. Anal. 45.4 (2007), pp. 1495–1516. DOI:
10.1137/06066391X.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 728, 729, 736–750).
[Gut09] M.H. Gutknecht. Lineare Algebra. Lecture Notes. SAM, ETH Zürich, 2009 (cit. on pp. 733–735,
738, 739).
[Hac91] W. Hackbusch. Iterative Lösung großer linearer Gleichungssysteme. B.G. Teubner–Verlag,
Stuttgart, 1991 (cit. on pp. 735, 743, 751).
[Hac94] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations. Vol. 95. Applied
Mathematical Sciences. New York: Springer-Verlag, 1994, pp. xxii+429 (cit. on pp. 728, 742).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 729, 736–750,
752).
[IM97] I.C.F. Ipsen and C.D. Meyer. The idea behind Krylov methods. Technical Report 97-3. Raleigh,
NC: Math. Dep., North Carolina State University, Jan. 1997.
[LS09] A.R. Laliena and F.-J. Sayas. “Theoretical aspects of the application of convolution quadrature
to scattering of acoustic waves”. In: Numer. Math. 112.4 (2009), pp. 637–678 (cit. on p. 745).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
pp. 733–735, 738, 739, 745).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 728–750, 752).
[Saa03] Yousef Saad. Iterative methods for sparse linear systems. Second. Philadelphia, PA: Society
for Industrial and Applied Mathematics, 2003, pp. xviii+528 (cit. on p. 728).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on p. 731).
[Win80] Ragnar Winther. “Some superlinear convergence results for the conjugate gradient method”.
In: SIAM J. Numer. Anal. 17.1 (1980), pp. 14–17.

755
Chapter 11

Numerical Integration – Single Step Methods

For historical reasons the approximate solution of initial value problems for ordinary differential equations is
called “Numerical Integration”. This chapter will introduce the most important class of numerical methods
for that purpose.

Contents
11.1 Initial-Value Problems (IVPs) for Ordinary Differential Equations (ODEs) . . . 757
11.1.1 Ordinary Differential Equations (ODEs) . . . . . . . . . . . . . . . . . . . . . 757
11.1.2 Mathematical Modeling with Ordinary Differential Equations: Examples . 759
11.1.3 Theory of Initial-Value-Problems (IVPs) . . . . . . . . . . . . . . . . . . . . . 764
11.1.4 Evolution Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
11.2 Introduction: Polygonal Approximation Methods . . . . . . . . . . . . . . . . . . 771
11.2.1 Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
11.2.2 Implicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774
11.2.3 Implicit midpoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
11.3 General Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
11.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
11.3.2 (Asymptotic) Convergence of Single-Step Methods . . . . . . . . . . . . . . 782
11.4 Explicit Runge-Kutta Single-Step Methods (RKSSMs) . . . . . . . . . . . . . . . . 791
11.5 Adaptive Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
11.5.1 The Need for Timestep Adaptation . . . . . . . . . . . . . . . . . . . . . . . . 798
11.5.2 Local-in-Time Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . 800
11.5.3 Embedded Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . 807

Supplementary literature. Some grasp of the meaning and theory of ordinary differential

equations (ODEs) is indispensable for understanding the construction and properties of numerical
methods. Relevant information can be found in [Str09, Sect. 5.6, 5.7, 6.5].

Books dedicated to numerical methods for ordinary differential equations:


• [DB02] excellent textbook, but geared to the needs of students of mathematics.
• [HNW93] and [HW11] : the standard reference.
• [HLW06]: wonderful book conveying deep insight, with emphasis on mathematical concepts.

756
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11.1 Initial-Value Problems (IVPs) for Ordinary Differential Equa-


tions (ODEs)
Video tutorial for Section 11.1: Initial-Value Problems (IVPs) for Ordinary Differential Equa-
tions (ODEs): (35 minutes) Download link, tablet notes

☞ You may also watch the explanations of 3Blue1Brown here.

The title of this section contains a widely used acronym: ODE =


ˆ ordinary differential equation . In this
section we present a few examples of models involving ODEs and briefly review the relevant mathematical
theory.

11.1.1 Ordinary Differential Equations (ODEs)


§11.1.1.1 (Terminology and notations related to ordinary differential equations) In our parlance,
a (first-order) ordinary differential equation (ODE) is an equation of the form

dy
ẏ := (t) = f(t, y(t)) , (ODE)
dt
with
☞ a (continuous) right-hand-side function (r.h.s) f : I × D → R N of time t ∈ R and state y ∈ R N ,
☞ defined on a (finite) time interval I ⊂ R, and state space D, which is some sub-set of R N :
D ⊂ R N , N ∈ N.

✎ Notation (due to Newton): dot ˙ =


ˆ (total) derivative with respect to time t

An ODE is called autonomous, if the right-hand-side function f does not depend on time: f = f(y), see
Def. 11.1.2.4 below.

In the context of mathematical modeling the state vector y ∈ R N is supposed to provide a complete (in the
sense of the model) description of a system. Then (ODE) models a finite-dimensional dynamical system.
Examples will be provided below, see Ex. 11.1.2.1, Ex. 11.1.2.5, and Ex. 11.1.2.7.

For N > 1 ẏ = f(t, y) can be viewed as a system of ordinary differential equations:


   
ẏ1 f 1 (t, y1 , . . . , y N )
 ..   .. 
ẏ = f(t, y) ⇐⇒  . = . .
ẏ N f N (t, y1 , . . . , y N )

Definition 11.1.1.2. Solution of an ordinary differential equation

A solution of the ODE ẏ = f(t, y) with continuous right hand side function f is a continuously
differentiable function “of time t” y : J ⊂ I → D, defined on an open interval J , for which ẏ(t) =
f(t, y(t)) holds for all t ∈ J (=
ˆ “pointwise”).

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 757
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

A solution describes a continuous trajectory in state space, a one-parameter family of states, parameter-
ized by time.

It goes without saying that smoothness of the right hand side function f is inherited by solutions of the
ODE:

Lemma 11.1.1.3. Smoothness of solutions of ODEs

Let y : I ⊂ R → D be a solution of the ODE ẏ = f(y) on the time interval I .


If f : D → R N is r-times continuously differentiable with respect to both arguments, r ∈ N0 , then
the trajectory t 7→ y(t) is r + 1-times continuously differentiable in the interior of I .

§11.1.1.4 (Scalar autonomous ODE: Solution by principals) We consider scalar ODEs, namely (ODE)
in the case N = 1, and, in particular ẏ = f (y) with f : D ⊂ R → R, D an interval.
We embark on formal calculations. Assume that f is continuous and f (t) 6= 0 for all t ∈ D. Further,
suppose that we know a principal F : D → R of 1f , that is, a function y 7→ F (y) satisfying dF 1
dy = f on D.
Then, by the chain rule, every solution y : I ⊂ R → R of ẏ = f (y) also solves

d 1
F (y(t)) = ẏ(t) = 1 , t ∈ D . ⇔ F (y(t)) = t − t0 for some t0 ∈ R . (11.1.1.5)
dt f (y(t))

We also know that F is monotonic and, thus, possesses an inverse function F −1 . Integrating (11.1.1.5)
and applying the fundamental theorem of calculus, we find

y(t) = F −1 (t − t0 ) for some t0 ∈ I . (11.1.1.6)

This formula describes a one-parameter family of functions (t0 is the parameter), all of which provide a
solution of ẏ = f (y) on a suitable interval.
A particularly simple case is f (y) = λy + c, λ, c ∈ R, the scalar ODE ẏ = λy + c. Following the steps
outlined above, we calculate the solution
h 1 i 1  λ ( t − t0 ) 
F (y) = log(λy + x ) ⇒ y(t) = e −c , t∈R. (11.1.1.7)
λ λ
y

§11.1.1.8 (Linear ordinary differential equations) Now we take a look at the simplest class of ODEs,
which is also the most important.

Definition 11.1.1.9. Linear first-order ODE

A first-order ordinary differential equation ẏ = f(t, y), as introduced in § 11.1.1.1 is linear, if

f(t, y) = A(t)y , A : I → R N,N a continuous function . (11.1.1.10)

Lemma 11.1.1.11. Space of solutions of linear ODEs

The set of solutions y : I → R N of (11.1.1.10) is a vector space.

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 758
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Proof. We have to show that, if y, z : I → R N are two solutions of (11.1.1.10), then so are y + z and
αy for all α ∈ R. This is an immediate consequence of the linearity of the operations of differentiation and
matrix×vector multiplication.

For the scalar case N = 1 (11.1.1.10) can be written as ẏ = a(t)y with a continuous function a : I → R.
In this case, the chain rule immediately verifies that for fixed t0 ∈ I every function
Z t

y(t) = C exp a(τ ) dτ , C∈R, (11.1.1.12)
t0

is a solution.
If the matrix A ∈ R N,N does not depend on time, (11.1.1.10) is known as a linear ODE with constant co-
efficients: ẏ = Ay. In this case we can choose I = R, and the ODE can be solved by a diagonalization
technique [Str09, Bemerkung 5.6.1], [NS02, Sect. 8.1]: If

A = SDS−1 , S ∈ R N,N regular , D = diag(λ1 , . . . , λ N ) ∈ R N,N , (11.1.1.13)

we can rewrite

ẏ = SDS−1 y ⇒ ż = Dz with z(t) := S−1 y(t) .

We get N decoupled scalar linear equations żℓ = λℓ zℓ , ℓ = 1, . . . , N . Returning to y we find that every
solution y : R → R N of ẏ = Ay can be written as
 
e λ1 t
 ..  −1 N
y(t) = Sz(t) = S . S w for some w ∈ R . (11.1.1.14)
eλ N t

This is an N -parameter family of solutions. y

11.1.2 Mathematical Modeling with Ordinary Differential Equations: Examples


Most models of physical systems and phenomena that are continuously changing with time involve ordi-
nary differential equations.
EXAMPLE 11.1.2.1 (Growth with limited resources [Ama83, Sect. 1.1], [Han02, Ch. 60]) This is
an example from population dynamics with a one-dimensional state space D = R0+ , N = 1. The
interpretation of y : [0, T ] 7→ R is that of the population density of bacteria as a function of time. A scaled,
non-dimensional model is assumed, that is by fixing reference units all quantities are regarded as “mere
numbers”.
ODE-based model: autonomous logistic differential equations [Str09, Ex. 5.7.2]

ẏ = f (y) := (α − βy) y (11.1.2.2)

ˆ population density, [y] = m12


✦ y=
ˆ instantaneous change (growth/decay) of population density
➣ ẏ =
m2
✦ growth rate α − βy with growth coefficients α, β > 0, [α] = 1s , [ β] = s : decreases due to fiercer
competition as population density increases.

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 759
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.5

By the technique from § 11.1.1.4 we can compute a


family of solutions of (11.1.2.2) parameterized by the
1 initial value y(0) = y0 > 0:

αy0
y(t) = (11.1.2.3)
y

,
βy0 + (α − βy0 ) exp(−αt)
0.5
for all t ∈ R.

Note: f (y∗ ) = 0 for y∗ ∈ {0, α/β}, which are the


stationary points for the ODE (11.1.2.2). If y(0) = y∗
0
0 0.5 1 1.5 the solution will be constant in time.
Fig. 394 t

Solution for different y(0) (α, β = 5)

Note that by fixing the initial value y(0) we can single out a unique representative from the family of
solutions. This will turn out to be a general principle, see Section 11.1.3. y

Definition 11.1.2.4. Autonomous ODE

An ODE of the form ẏ = f(y), that is, with a right hand side function that does not depend on time,
but only on state, is called autonomous.

For an autonomous ODE the right hand side function defines a vector field (“velocity field”) y 7→ f(y) on
state space.

EXAMPLE 11.1.2.5 (Predator-prey model [Ama83, Sect. 1.1],[HLW06, Sect. 1.1.1],[Han02, Ch. 60],
[DR08, Ex. 11.3]) We consider the following model from population dynamics:
Predators and prey coexist in an ecosystem. Without predators the population of prey would be gov-
erned by a simple exponential growth law. However, the growth rate of prey will decrease with increasing
numbers of predators and, eventually, become negative. Similar considerations apply to the predator
population and lead to an ODE model.
ODE-based model: autonomous Lotka-Volterra ODE:
   
u̇ = (α − βv)u u (α − βv)u
↔ ẏ = f(y) with y = , f(y) = , (11.1.2.6)
v̇ = (δu − γ)v v (δu − γ)v

with positive model parameters α, β, γ, δ > 0 and population densities


ˆ density of prey at time t , v(t) =
u(t) = ˆ density of predators at time t.

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 760
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Plot of right-hand side vector field f for Lotka-Volterra


ODE ✄
Solution curves are trajectories of particles carried

v
along by velocity field f.
α/β
(Parameter values for Fig. 395: α = 2, β = 1, δ =
1, γ = 1.)

γ/δ
Fig. 395 u

6
u=y 6
1
v=y
2
5
5

4
4

3
2
y

v=y

2
2

1
1

0
1 2 3 4 5 6 7 8 9 10 0
0 0.5 1 1.5 2 2.5 3 3.5 4
Fig. 396
  t    Fig. 397 u = y1
u(t) u (0) 4
Solution for y0 := = Solution curves for (11.1.2.6)
v(t) v (0) 2
(Parameter values for Fig. 397, 396: α = 2, β = 1, δ = 1, γ = 1) stationary point
y

EXAMPLE 11.1.2.7 (Heartbeat model → [Dea80, p. 655]) This example deals with a phenomenolog-
ical model from physiology. A model is called phenomenological, if it is entirely motivated by observations
without appealing to underlying mechanisms or first principles.
l = l (t) ˆ length of muscle fiber
=
State of heart described by quantities:
p = p(t) ˆ electro-chemical potential
=

l˙ = −(l 3 − αl + p) ,
Phenomenological model: (11.1.2.8)
ṗ = βl ,

with parameters: α ˆ pre-tension of muscle fiber


=
β ˆ (phenomenological) feedback parameter
=
This is the so-called Zeeman model: it is a phenomenological model entirely based on macroscopic
observations without relying on knowledge about the underlying molecular mechanisms.

Plots of vector fields for (11.1.2.8) and solutions for different choices of parameters are given next:

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 761
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Phase flow for Zeeman model (α = 3,β=1.000000e−01) Heartbeat according to Zeeman model (α = 3,β=1.000000e−01)
2.5 3
l(t)
p(t)
2

2
1.5

1
1

0.5

l/p
p

0 0

−0.5

−1
−1

−1.5
−2

−2

−2.5 −3
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 398 l Fig. 399 time t

Phase flow for Zeeman model (α = 5.000000e−01,β=1.000000e−01) Heartbeat according to Zeeman model (α = 5.000000e−01,β=1.000000e−01)
2.5 3
l(t)
p(t)
2

2
1.5

1
1

0.5
l/p
p

0 0

−0.5

−1
−1

−1.5
−2

−2

−2.5 −3
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100
Fig. 400 l Fig. 401 time t

Observation: α ≪ 1 (bottom plots) ➤ ventricular fibrillation, a life-threatening condition. y

EXAMPLE 11.1.2.9 (SIR model for spread of local epidemic [Het00]) The field of epidemiology tries to
understand the spread of contagious diseases in populations. It heavily relies on ODEs in its mathematical
modeling. This example presents a particularly simple model for an epidemic in a large, stable, isolated,
and vigorously mixing homogeneous population.
With respect to the disease we partition the population into three different groups and introduce time-
dependent variables for their fractions ∈ [0, 1]:
(I) S = S(t) =
ˆ fraction of susceptible persons, who can still contract the disease,
ˆ fraction of infected/infectious persons, who can pass on the disease,
(II) I = I (t) =
(III) R = R(t) =
ˆ fraction of recovered/removed persons, who are immune or have died.
These three quantities enter the SIR model named after the groups it considers. Besides, the model
involves two crucial model parameters, which have to be determined from data:
1. A parameter β > 0, whose value expresses the probability of transmission, and
2. a parameter r > 0, taking into account how quickly sick people recover or die.
With these notation the ODE underlying the SIR model can be stated as

Ṡ(t) = − βS(t) I (t) , İ (t) = βS(t) I (t) − rI (t) , Ṙ(t) = rI (t) . (11.1.2.10)

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 762
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

0.9

0.8
✁ Evolution of an epidemic according to the SIR
0.7
model (11.1.2.10) for
fractions of individuals

0.6
• β = 0.3, r = 0.1,
S(t)
0.5 I(t) • S(0) = 0.99, I (0) = 0.01, R(0) = 0.0
R(t)
0.4 (non-dimensionalized time)
0.3
Note that in this case not all people end up infected,
0.2 lim S(t) > 0!
t→∞
0.1

0
0 20 40 60 80 100
Fig. 402
time t
y

EXAMPLE 11.1.2.11 (Transient circuit simulation [Han02, Ch. 64]) Chapter 1 and Chapter 8 discuss
circuit analysis as a source of linear and non-linear systems of equations, see Ex. 2.1.0.3 and Ex. 8.1.0.1.
The former example admitted time-dependent currents and potentials, but dependence on time was con-
fined to be “sinusoidal”. This enabled us to switch to frequency domain, see (2.1.0.6), which gave us a
complex linear system of equations for the complex nodal potentials. Yet, this trick is only possible for
linear circuits. In the general case, circuits have to be modelled by ODEs connecting time-dependent
potentials and currents. This will be briefly explained now.
The approach is transient nodal analysis, cf. Ex. 2.1.0.3, based on the Kirchhoff current law (2.1.0.4),
which reads for the node • of the simple circuit drawn in Fig. 403

i R ( t ) − i L ( t ) − iC ( t ) = 0 . (11.1.2.12)

C
In addition we rely on known transient constitutive re-
R
lations for basic linear circuit elements:
u(t)
resistor: i R ( t ) = R −1 u R ( t ) , (11.1.2.13) L
du
capacitor: iC ( t ) = C C ( t ) , (11.1.2.14)
dt Us (t)
di L
coil: u L (t) = L (t) . (11.1.2.15)
dt

Fig. 403

We assume that the source voltage Us (t) is given. To apply nodal analysis to the circuit of Fig. 403 we
differentiate (11.1.2.12) w.r.t. t
di R di di
(t) − L (t) − C (t) = 0 ,
dt dt dt
and plug in the above constitutive relations for circuit elements:

−1 du R −1 d2 u C
R (t) − L u L (t) − C 2 (t) = 0 .
dt dt
We continue following the policy of nodal analysis and express all voltages by potential differences between
nodes of the circuit.

u R (t) = Us (t) − u(t) , uC (t) = u(t) − 0 , u L (t) = u(t) − 0 .

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 763
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For this simple circuit there is only one node with unknown potential, see Fig. 403. Its time-dependent
potential will be denoted by u(t) and this is the unknown of the model, a function of time satisfying the
ordinary differential equation

d2 u
R−1 (U̇s (t) − u̇(t)) − L−1 u(t) − C (t) = 0 .
dt2
This is an autonomous 2nd-order ordinary differential equation:

C ü + R−1 u̇ + L−1 u = R−1 U̇s . (11.1.2.16)

The attribute “2nd-order” refers to the occurrence of a second derivative with respect to time.
y

11.1.3 Theory of Initial-Value-Problems (IVPs)


§11.1.3.1 (Initial value problems) We start with an abstract mathematical description that also introduces
key terminology:

A generic Initial value problem (IVP) for a first-order ordinary differential equation (ODE) (→ [Str09,
Sect. 5.6], [DR08, Sect. 11.1]) can be stated as:

Find a function y : I → D that satisfies, cf. Def. 11.1.1.2,

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.3.2)

• f : I × D 7→ R N = ˆ right hand side (r.h.s.) ( N ∈ N),


• I⊂R= ˆ (time)interval ↔ “time variable” t,
N
• D⊂R = ˆ state space/phase space ↔ “state variable” y,
• Ω := I × D = ˆ extended state space (of tupels (t, y)),
• t0 ∈ I =ˆ initial time, y0 ∈ D = ˆ initial state ➣ initial conditions.

The time interval I may be finite or infinite. Frequently, the extended state space is not specified, but as-
sumed to coincide with the maximal domain of definition of f. Sometimes, the model suggests constraints
on D, for instance, positivity of certain components that represent a density. y
§11.1.3.3 (IVPs for autonomous ODEs) Recall Def. 11.1.2.4: For an autonomous ODE ẏ = f(y), that
is the right hand side f does not depend on time t.

Hence, for autonomous ODEs we have I = R and the right hand side function y 7→ f(y) can be regarded
as a stationary vector field (velocity field), see Fig. 395 or Fig. 398.

An important observation: If t 7→ y(t) is a solution of an autonomous ODE, then, for any τ ∈ R, also the
shifted function t 7→ y(t − τ ) is a solution.

Initial time for autonomous ODEs


For initial value problems for autonomous ODEs the initial time is irrelevant and therefore we can
always make the “canonical” choice t0 = 0.

Autonomous ODEs naturally arise when modeling time-invariant systems or phenomena. All examples for
Section 11.1.2 belong to this class. y

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 764
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§11.1.3.5 (Autonomization: Conversion into autonomous ODE) In fact, autonomous ODEs already
represent the general case, because any ODE can be converted into an autonomous one:
The idea is to include time as an extra N + 1-st component of an extended state vector z(t). This solution
component has to grow linearly ⇔ its temporal derivative must be = 1
   ′   
y(t) z f ( z N +1 , z ′ )
z(t) := = : ẏ = f(t, y) ↔ ż = g(z) , g(z) := .
t z N +1 1
This means ż N +1 = 1 and implies z N +1 (t) = t + t0 , if t0 stands for the initial time in the original non-
autonomous IVP.
➣ We restrict ourselves to autonomous ODEs in the remainder of this chapter. y

Remark 11.1.3.6 (From higher order ODEs to first order systems [DR08, Sect. 11.2])
An ordinary differential equation of order n ∈ N has the form

y(n) = f(t, y, ẏ, . . . , y(n−1) ) . (11.1.3.7)

where, with notations from § 11.1.3.1, f : I × D × · · · × D → R N is a function of time t and n state


arguments.
(n) dn
✎ Notation: superscript ˆ n-th temporal derivative t:
= dtn
No special treatment of higher order ODEs is necessary, because (11.1.3.7) can be turned into a 1st-order
ODE (a system of size nN ) by adding all derivatives up to order n − 1 as additional components to the
state vector. This extended state vector z(t) ∈ R nd is defined as
 
    z2
y(t) z1  
 y (1) ( t )   z 2   z3 
     .. 
z(t) :=  ..  =  ..  ∈ R Nn : (11.1.3.7) ↔ ż = g(z) , g(z) :=  . .
 .  .  
( n − 1 )
 zn 
y (t) zn
f(t, z1 , . . . , zn )
(11.1.3.8)

Note that the extended system requires initial values y(t0 ), ẏ(t0 ), . . . , y(n−1) (t0 ):

For ODEs of order n ∈ N well-posed initial value problems need to specify initial values for the first
n − 1 derivatives.

§11.1.3.10 (Smoothness classes for right-hand side functions) Now we review results about existence
and uniqueness of solutions of initial value problems for first-order ODEs. These are surprisingly general
and do not impose severe constraints on right hand side functions. Some kind of smoothness of the
right-hand side function f is required, nevertheless and the following definitions describe it in detail.

Definition 11.1.3.11. Lipschitz continuous function (→ [Str09, Def. 4.1.4])

Let Θ := I × D, I ⊂ R an interval, D ⊂ R N , N ∈ N, an open domain. A function f : Θ 7→ R N is


Lipschitz continuous (in the second argument) on Θ, if

∃ L > 0: kf(t, w) − f(t, z)k ≤ Lkw − zk ∀(t, w), (t, z) ∈ Θ . (11.1.3.12)

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 765
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 11.1.3.13. Local Lipschitz continuity (→ [Str09, Def. 4.1.5])

Let Ω := I × D, I ⊂ R an interval, D ⊂ R N , N ∈ N, an open domain. A functions f : Ω 7→ R N


is locally Lipschitz continuous, if for every (t, y) ∈ Ω there is a closed box B with (t, y) ∈ B such
that f is Lipschitz continuous on B:

∀(t, y) ∈ Ω: ∃δ > 0, L > 0:


kf(τ, z) − f(τ, w)k ≤ Lkz − wk (11.1.3.14)
∀z, w ∈ D: kz − yk ≤ δ, kw − yk ≤ δ, ∀τ ∈ I: |t − τ | ≤ δ .

The property of local Lipschitz continuity means that the function (t, y) 7→ f(t, y) has “locally finite slope”
in y. y

EXAMPLE 11.1.3.15 (A function that is not locally Lipschitz continuous [Str09, Bsp. 6.5.3]) The
meaning of local Lipschitz continuity is best explained by giving an example of a function that fails to
possess this property.

Consider the square root function t 7→ t on the closed interval [0, 1]. Its slope in t = 0 is infinite and so
it is not locally Lipschitz continuous on [0, 1].

However, if we consider the square root on the open interval ]0, 1[, then it is locally Lipschitz continuous
there. y

The next lemma gives a simple criterion for local Lipschitz continuity, which can be proved by the mean
value theorem, cf. the proof of Lemma 8.3.2.9.

Lemma 11.1.3.16. Criterion for local Liptschitz continuity

If f and Dy f are continuous on the extended state space Ω, then f is locally Lipschitz continuous
(→ Def. 11.1.3.13).

✎ Notation: ˆ the derivative of f w.r.t. the state variable y, a Jacobian matrix ∈ R N,N as
Dy f =
defined in (8.3.2.8).

The following is the the most important mathematical result in the theory of initial-value problems for ODEs:

Theorem 11.1.3.17. Theorem of Peano & Picard-Lindelöf [Ama83, Satz II(7.6)], [Str09,
Satz 6.5.1], [DR08, Thm. 11.10], [Han02, Thm. 73.1]

If the right hand side function f : Ω 7→ R N is locally Lipschitz continuous (→ Def. 11.1.3.13) then
for all initial conditions (t0 , y0 ) ∈ Ω the IVP

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.3.2)

has a solution y ∈ C1 ( J (t0 , y0 ), R N ) with maximal (temporal) domain of definition J (t0 , y0 ) ⊂ R.

In light of § 11.1.3.5 and Thm. 11.1.3.17 henceforth we mainly consider

autonomous IVPs: ẏ = f(y) , y(0) = y0 , (11.1.3.18)

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 766
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

with locally Lipschitz continuous (→ Def. 11.1.3.13) right hand side f.

§11.1.3.19 (Domain of definition of solutions of IVPs) We emphasize a subtle message of


Thm. 11.1.3.17.
Solutions of an IVP have an intrinsic maximal domain of definition

Also not that the domain of definition/domain of existence J (t0 , y0 ) of the solution usually depends
! on the initial values (t0 , y0 ) !
Terminology: if J (t0 , y0 ) = I , I the maximal temporal domain of definition of f, we say that the solution
y : I 7→ RN is global.

Notation: For autonomous ODE we always have t0 = 0, and therefore we write J (y0 ) := J (0, y0 ). y

EXAMPLE 11.1.3.20 (“Explosion equation”: finite-time blow-up) Let us explain the still mysterious
“maximal domain of definition” in statement of Thm. 11.1.3.17. It is related to the fact that every solution
of an initial value problem (11.1.3.18) has its own largest possible time interval J (y0 ) ⊂ R on which it is
defined naturally.

As an example we consider the autonomous scalar (d = 1) initial value problem, modeling “explosive
growth” with a growth rate increasing linearly with the density:

ẏ = y2 , y(0) = y0 ∈ R . (11.1.3.21)

We choose I = D = R. Clearly, y 7→ y2 is locally Lipschitz-continuous, but only locally! Why not


globally?

10
y = −0.5
0
y = −1
8 0

We find the solutions y0 = 1


y0 = 0.5
6
( 1
, if y0 6= 0 ,
y(t) = y0−1 −t (11.1.3.22)
4

0 , if y0 = 0 , 2
y(t)

with domains of definition


−2

−1
] − ∞, y0 [ , if y0 > 0 ,
 −4

J ( y0 ) = R , if y0 = 0 , −6

 −1
] y0 , ∞ [ , if y0 < 0 . −8

−10
−3 −2 −1 0 1 2 3
Fig. 404 t

In this example, for y0 > 0 the solution experiences a blow-up in finite time and ceases to exists afterwards.
y

Supplementary literature. For other concise summaries of the theory of IVPs for ODEs refer

to [QSS00, Sect. 11.1], [DR08, Sect. 11.3].

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 767
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11.1.4 Evolution Operators


Now we examine a difficult but fundamental concept for time-dependent models stated by means of ODEs.
For the sake of simplicity we restrict the discussion to autonomous initial-value problems (IVPs)

ẏ = f(y) , y(0) = y0 , (11.1.3.18)

with locally Lipschitz continuous (→ Def. 11.1.3.13) right hand side f : D ⊂ R N → R N , N ∈ N, and
make the following assumption. A more general treatment is given in [DB02].

Assumption 11.1.4.1. Global solutions

Solutions of (11.1.3.18) are global: J (y0 ) = R for all y0 ∈ D.

Now we return to the study of a generic ODE (ODE) instead of an IVP (11.1.3.2). We do this by temporarily
changing the perspective: we fix a “time of interest” t ∈ R \ {0} and follow all trajectories for the duration
t. This induces a mapping of points in state space:

t D 7→ D
➣ mapping Φ : , t 7→ y(t) solution of IVP (11.1.3.18) , (11.1.4.2)
y0 7 → y ( t )

This is a well-defined mapping of the state space into itself, by Thm. 11.1.3.17 and Ass. 11.1.4.1.

Now, we may also let t vary, which spawns a family of mappings Φt t∈R of the state space D into itself.
However, it can also be viewed as a mapping with two arguments, a duration t and an initial state value
y0 !

Definition 11.1.4.3. Evolution operator/mapping

Under Ass. 11.1.4.1 the mapping



R × D 7→ D
Φ: ,
(t, y0 ) 7→ Φt y0 := y(t)

where t 7→ y(t) ∈ C1 (R, R N ) is the unique (global) solution of the IVP ẏ = f(y), y(0) = y0 , is
the evolution operator/mapping for the autonomous ODE ẏ = f(y).

Note that t 7→ Φt y0 describes the solution of ẏ = f(y) for y(0) = y0 (a trajectory). Therefore, by virtue
of definition, we have

∂Φ
(t, y) = f(Φt y) . (11.1.4.4)
∂t
Let us repeat the different kinds of information contained in an evolution operator when viewed from differ-
ent angles:

t 7→ Φt y0 , y0 ∈ D fixed =
ˆ a trajectory = solution of an IVP ,
t
y 7→ Φ y , t ∈ R fixed =
ˆ a mapping of the state space onto itself .

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 768
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 11.1.4.5 (Evolution operator for Lotka-Volterra ODE (11.1.2.6)) For N = 2 the action of an
evolution operator can be visualized by tracking the movement of point sets in state space. Here this is
done for the Lotka-Volterra ODE
   
u̇ = (α − βv)u u (α − βv)u
↔ ẏ = f(y) with y = , f(y) = , (11.1.2.6)
v̇ = (δu − γ)v v (δu − γ)v
with positive model parameters α, β, γ, δ > 0.
6
Flow map for Lotka-Volterra system, α=2, β=γ =δ =1
8
t=0
t=0.5
t=1
5 7 t=1.5
t=2
t=3

v (predator)
v = y2

3 4

3
2 X

1
1

0
0 0 1 2 3 4 5 6
0 0.5 1 1.5 2 2.5 3 3.5 4 406
Fig.
u = y1 u (prey)
Fig. 405

state mapping y 7→ Φt y
trajectories t 7→ Φt y0
Think of y ∈ R2 7→ f(y) ∈ R2 as the velocity of the surface of a fluid. Specks of floating dust will be
carried along by the fluid, patches of dust covering parts of the surface will move and deform over time.
This can serve as a “mental image” of Φ. y

Given an evolution operator, we can recover the right-hand side function f of the underlying autonomous
ODE as f(y) = ∂Φ ∂t (0, y ): There is a one-to-one relationship between ODEs and their evolution operators,
and those are the key objects behind an ODE.

An ODE “encodes” an evolution operator.

Understanding the concept of evolution operators is indispensable for numerical integration, that the is the
construction of numerical methods for the solution of IVPs for ODEs:

Numerical integration is concerned with the approximation of evolution operators.

Remark 11.1.4.6 (Group property of autonomous evolutions) Under Ass. 11.1.4.1 the
evolution operator gives rise to a group of mappings D 7→ D:
Φs ◦ Φt = Φs+t , Φ−t ◦ Φt = Id ∀t ∈ R . (11.1.4.7)
This is a consequence of the uniqueness theorem Thm. 11.1.3.17. It is also intuitive: following an evolution
up to time t and then for some more time s leads us to the same final state as observing it for the whole
time s + t. y

Review question(s) 11.1.4.8 (IVPs for ODEs)


(Q11.1.4.8.A) A simple model for the spread of a viral epidemic like SARS-Cov2 is the SIR model:
Ṡ(t) = − βI (t)S(t) , İ (t) = βI (t)S(t) − γI (t) , Ṙ(t) = γI (t) , (11.1.4.9)
with parameters β, γ > 0. Here t 7→ S(t) is the fraction of susceptible individuals, t 7→ I (t) that of
infected (and infectious) individuals, and t 7→ R(t) stands for the removed (immune or dead) individuals.

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 769
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Write (11.1.4.9) in the form ẏ = f(y).


• What is a meaningful state space for (11.1.4.9).
• Show that S(t) + I (t) + R(t) ≡ const.
• Show that t 7→ R(t) is non-decreasing.
(Q11.1.4.8.B) Determine the one-parameter family of solutions of the scalar autonomous ODE
ẏ = 1 + y2 . Can you expect global solutions defined for all times t ∈ R?

d
Hint. dy {y 7→ arctan(y)} = 1+1y2

(Q11.1.4.8.C) Consider the autonomous scalar ODE ẏ = cos2 y.


• What are the stationary states, that is the states y∗ that are zeros of the right-hand-side function?
• Compute the (analytic) )solution of a related initial-value problem with y(0) = y0 ∈ R.
• The evolution operator Φ belonging to ẏ = cos2 y will satisfy Φt ◦ Φs = Φt+s . Verify this formula
based on what you found as analytic solution.
Hint. Remember that tan′ = cos−2 .
(Q11.1.4.8.D) Show that the scalar autonomous initial-value problem

ẏ = y , y (0) = 0 ,

has at least two solutions in the state space R0+ according to the following definition.

Definition Def. 11.1.1.2. Solution of an ordinary differential equation

A solution of the ODE ẏ = f(t, y) with continuous right hand side function f is a continuously
differentiable function “of time t” y : J ⊂ I → D, defined on an open interval J , for which
ẏ(t) = f(t, y(t)) holds for all t ∈ J (=
ˆ “pointwise”).

How can this be reconciled with the assertion of the main theorem?

Theorem Thm. 11.1.3.17. Theorem of Peano & Picard-Lindelöf

If the right hand side function f : Ω̂ 7→ R N is locally Lipschitz continuous (→ Def. 11.1.3.13)
then for all initial conditions (t0 , y0 ) ∈ Ω̂ the IVP

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.3.2)

has a solution y ∈ C1 ( J (t0 , y0 ), R N ) with maximal (temporal) domain of definition J (t0 , y0 ) ⊂ R.

 2
1
Hint. Consider the function y(t) = 2t .

(Q11.1.4.8.E) For the autonomous scalar ODE ẏ = sin y1 − 2 answer the following questions

• What is the maximal state space?


• Which initial values for t0 = 0 will allow a solution on [0, ∞[,
• and for which will the solution be defined for a finite time interval only?

11. Numerical Integration – Single Step Methods, 11.1. Initial-Value Problems (IVPs) for Ordinary Differential 770
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Hint. Make use of the geometrically intuitive statement: If a differentiable function f : [t0 , T ] → R
satisfies f˙(t) ≤ C for all t0 ≤ t ≤ T , then f (t) ≤ f (t0 ) + Ct.
(Q11.1.4.8.F) Rewrite the matrix differential equation Ẏ(t) = AY(t) for Y : R → R n,n , n ∈ N, in the
standard form ẏ = f(y) with right-hand-side function f : R N → R N and suitable N ∈ N.
(Q11.1.4.8.G) What “ingredients” does it take to define an initial value problem for an ODE?

11.2 Introduction: Polygonal Approximation Methods

Video tutorial for Section 11.2: Introduction: Polygonal Approximation Methods: (17 minutes)
Download link, tablet notes

In this section we will see the first simple methods for the numerical integration (= solution) of initial-value
problems (IVPs). We target an initial value problem (11.1.3.2) for a first-order ordinary differential equation

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.3.2)

As usual, the right hand side function f : D ⊂ R N → R N , N ∈ N, may be given only in procedural form,
for instance, in a C++ code as an functor object providing an evaluation operator
Eigen::VectorXd o p e r a t o r () ( double t, const Eigen::VectorXd &y)
const ;

cf. Rem. 5.1.0.9. Occasionally the evaluation of f may involve costly computations.

§11.2.0.1 (Objectives of numerical integration) Two basic tasks can be identified in the field of nu-
merical integration, which aims for the approximate solution of initial value problems for ODEs (Please
distinguish from “numerical quadrature”, see Chapter 7.):
(I) Given initial time t0 , final time T , and initial state y0 compute an approximation of y( T ), where
t 7→ y(t) is the solution of (11.1.3.2). A corresponding function in C++ could look like
State solveivp( double t0, double T, State y0);

Here State is a type providing a fixed size or variable size vector ∈ R N , e.g.,
using State = Eigen::Matrix< double , N, 1>;

(II) Output an approximate solution t → yh (t) of (11.1.3.2) on [t0 , T ] up to final time T 6= t0 for “all
times” t ∈ [t0 , T ] (in practice, of course, only for finitely many times t0 < t1 < t2 < · · · < t M−1 <
t M = T , M ∈ N, consecutively)
s t d :: v e c t o r <State>
solveivp(State y0, const s t d :: v e c t o r < double > &tvec);

This is the “plot solution” task, because we need to know y(t) for many times, if we want to create
a faithful plot of t 7→ y(t).

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 771
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

y
y

This section presents three methods that provide a


piecewise linear, that is, “polygonal” approximation
of solution trajectories t 7→ y(t).

✁ A piecewise linear function, aka a polygonal curve,


approximating a function t 7→ y(t) ∈ R in grid
points t0 < t1 < · · · < t4 .

t
Fig. 407
t0 t1 t2 t3 t4

§11.2.0.2 (Temporal mesh) As in Section 6.6.1 the polygonal approximation in this section will be based
on a (temporal) mesh with M + 1 mesh points (→ § 6.6.0.1)

M : = { t 0 < t 1 < t 2 < · · · < t M −1 < t M : = T } ⊂ [ t 0 , T ] , (11.2.0.3)

covering the time interval of interest between initial time t0 and final time T > t0 . We assume that the
interval of interest is contained in the domain of definition of the solution of the IVP: [t0 , T ] ⊂ J (t0 , y0 ). y

The next three sections will derive three simple mesh-based numerical integration methods, each in two
ways:
(i) Based on geometric reasoning we interpret ẏ as the slope/direction of a tangent line.
(ii) In the spirit of numerical differentiation § 5.2.3.16, we replace the derivative ẏ with a mesh-based
difference quotient.

11.2.1 Explicit Euler method


EXAMPLE 11.2.1.1 (Tangent field and solution curves) For N = 1 polygonal methods can be con-
structed by geometric considerations in the t − y plane, a model for the extended state space. We explain
this for the Riccati differential equation, a scalar ODE:

ẏ = f (t, y) := y2 + t2 ➤ N = 1, I, D = R + . (11.2.1.2)

1.5
1.5

1
1
y

0.5
0.5

0
0 0.5 1 1.5 0
Fig. 408 0 0.5 1 1.5
t
h i Fig. 409 t
1 1
tangent field (t, y) 7→ √ f (t,y) solution curves
f 2 (t,y)+1

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 772
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The solution curves run tangentially to the tangent field in each point of the extended state space. y

Idea: “follow the tangents over short periods of time”

➊ timestepping: successive approximation of evolution on mesh inter-


vals [tk−1 , tk ], k = 1, . . . , M, t M := T ,

➋ approximation of solution on [tk−1 , tk ] by tangent line to solution tra-


jectory through (tk−1 , yk−1 ).

y1
y

y(t) explicit Euler method (Euler 1768)


y0 ✁ First step of explicit Euler method ( N = 1):

Slope of tangent = f (t0 , y0 )

t y1 serves as initial value for next step!


t0 t1 See also [Han02, Ch. 74], [DR08, Alg. 11.4]
Fig. 410

EXAMPLE 11.2.1.3 (Visualization of explicit Euler method)


2.4
We use the temporal mesh exact solution
explicit Euler
2.2

M := {t j := j/5: j = 0, . . . , 5} , 2

1.8
and solve an IVP for the Riccati differential equation,
1.6
see Ex. 11.2.1.1
y

1.4

2 2
ẏ = y + t . (11.2.1.2) 1.2

1
Here: y0 = 21 , t0 = 0, T = 1, ✄
0.8

—=
ˆ “Euler polygon” for uniform timestep h = 0.2 0.6

ˆ tangent field of Riccati ODE


7→ = Fig. 411
0.4
0 0.2 0.4 0.6 0.8 1 1.2 1.4
t

§11.2.1.4 (Recursion for explicit Euler method) We translate the graphical construction of Fig. 410 into
a formula. Given a temporal mesh M := {t0 < t1 < t2 < · · · < t M−1 < t M } and applied to a general
IVP

ẏ = f(t, y) , y(t0 ) = y0 . (11.1.3.2)

N
the explicit Euler method generates a sequence (yk )k=0 of states by the recursion

yk+1 = yk + hk f(tk , yk ) , k = 0, . . . , M − 1 , (11.2.1.5)

with local (size of) timestep (stepsize) h k : = t k +1 − t k .

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 773
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The state yk is supposed to approximate y(tk ), where t 7→ y(t) is the exact solution of the IVP (11.1.3.2).
y

Remark 11.2.1.6 (Explicit Euler method as a difference scheme)


d
One can obtain (11.2.1.5) by approximating the derivative dt by a forward difference quotient on the (tem-
poral) mesh M := {t0 , t1 , . . . , t M }:

y(tk + hk ) − y(tk )
ẏ(tk ) ≈
hk
y k +1 − y k
ẏ = f(t, y) ←→ = f(tk , yk ) , k = 0, . . . , M − 1 . (11.2.1.7)
hk

Why a “forward difference quotient”? Because the difference quotient in (11.2.1.7) relies on the “future
state” yk+1 ≈ y(tk+1 ) to approximate ẏ(tk ).
In general, Difference schemes follow a simple policy for the discretization of differential equations: replace
all derivatives by difference quotients connecting solution values on a set of discrete points (the mesh). y

Remark 11.2.1.8 (Output of explicit Euler method) To begin with, the explicit Euler recursion (11.2.1.5)
produces a sequence y0 , . . . , y M of states. How does it deliver on the task (I) and (II) stated in § 11.2.0.1?
By “geometric insight” we expect

yk ≈ y(tk ) .

(As usual, we use the notation t 7→ y(t) for the exact solution of an IVP.)

No let us discuss to what extent the explicit Euler method delivers on the tasks formulated in § 11.2.0.1.
Task (I): Easy, because y M already provides an approximation of y( T ).

Task (II): The trajectory t 7→ y(t) is approximated by the piecewise linear function (‘Euler polygon”)

t k +1 − t t − tk
y h : [ t0 , t N ] → R N , y h ( t ) : = y k + y k +1 for t ∈ [ t k , t k +1 ] , (11.2.1.9)
t k +1 − t k t k +1 − t k

see Fig. 411. This function can easily be sampled on any grid of [t0 , t M ]. In fact, it is the M-piecewise
linear interpolant of the data points (tk , yk ), k = 0, . . . , N , see Section 5.3.2).

The same considerations apply to the methods discussed in the next two sections and will not be repeated
there. y

11.2.2 Implicit Euler method


Recall the discussions of Rem. 11.2.1.6. Why should be use a forward difference quotient and not a
backward difference quotient, which relies on “states in the past”. Let’s try!
On (temporal) mesh M := {t0 < t1 < · · · < t M } we obtain

y k +1 − y k
ẏ = f (t, y) ←→ = f (tk+1 , yk+1 ) , k = 0, . . . , M − 1 . (11.2.2.1)
hk
backward difference quotient

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 774
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This leads to another simple timestepping scheme analoguous to (11.2.1.5):

yk+1 = yk + hk f(tk+1 , yk+1 ) , k = 0, . . . , M − 1 , (11.2.2.2)

with local timestep (stepsize) h k : = t k +1 − t k .

(11.2.2.2) = implicit Euler method

Note: (11.2.2.2) requires solving a (possibly non-linear) system of equations to obtain yk+1 !
(➤ Terminology “implicit”)

y
y h ( t1 ) Geometry of implicit Euler method:
y(t)
Approximate solution through (t0 , y0 ) on [t0 , t1 ] by
y0
• straight line through (t0 , y0 )
• with slope f (t1 , y1 )
✁ —= ˆ trajectory through (t0 , y0 ),
t —= ˆ trajectory through (t1 , y1 ),
—= ˆ tangent at — in (t1 , y1 ).
t0 t1
Fig. 412

Remark 11.2.2.3 (Feasibility of implicit Euler timestepping) The issue is whether (11.2.2.2) well de-
fined, that is, whether we can solve it for yk+1 and whether this solution unique. The intuition is that for
small timestep size h > 0 the right hand side of (11.2.2.2) is a “small perturbation of the identity”.
Let us give a formal argument. Consider an autonomous ODE ẏ = f(y), assume a continuously differ-
entiable right hand side function f, f ∈ C1 ( D, R N ), and regard (11.2.2.2) as an h-dependent non-linear
system of equations:

yk+1 = yk + hk f(tk+1 , yk+1 ) ⇔ G ( h, yk+1 ) = 0 with G ( h, z) := z − hf(tk+1 , z) − yk .

To investigate the solvability of this non-linear equation we start with an observation about a partial deriva-
tive of G:
dG dG
(h, z) = I − h Dy f(tk+1 , z) ⇒ (0, z) = I .
dz dz
In addition, G (0, yk ) = 0. Next, recall the implicit function theorem [Str09, Thm. 7.8.1]:

Theorem 11.2.2.4. Implicit function theorem

Let G = G (x, y) a continuously differentiable function of x ∈ R k and y ∈ R ℓ , defined on the open


set Ω ⊂ R k × R ℓ with values in R ℓ : G : Ω ⊂ R k × R ℓ → R ℓ .
 
x0
Assume that G has a zero in z0 := ∈ Ω, x0 ∈ R k , y0 ∈ R ℓ : G (z0 ) = 0.
y0

If the Jacobian ∂G
∂y ( p0 ) ∈ R
ℓ,ℓ is invertible, then there is an open neighborhood U of x ∈ R k and
0
a continuously differentiable function g : U → R l such that

g(x0 ) = y0 and G (x, g(x)) = 0 ∀x ∈ U .

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 775
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

For sufficiently small | h| it permits us to conclude that the equation G ( h, z) = 0 defines a continuous
function g = g( h) with g(0) = yk .

Corollary 11.2.2.5. Solvability of implicit Euler recursion

For sufficiently small h > 0 the equation (11.2.2.2) has a unique solution yk+1 .

11.2.3 Implicit midpoint method


Beside using forward or backward difference quotients, the derivative ẏ can also be approximated by the
symmetric difference quotient, see also (5.2.3.17),

y(t + h) − y(t − h)
ẏ(t) ≈ , h>0. (11.2.3.1)
2h

The idea is to apply this formula in t = 21 (tk + tk+1 ) with h = hk/2, which transforms the ODE into

y k +1 − y k 1

ẏ = f (t, y) ←→ = f 2 (tk + tk+1 ), yh ( 12 (tk + tk+1 )) , k = 0, . . . , M − 1 . (11.2.3.2)
hk

The trouble is that the value yh ( 12 (tk+1 + tk+1 )) does not seem to be available, unless we recall that the
approximate trajectory t 7→ yh (t) is supposed to be piecewise linear, which implies yh ( 12 (tk+1 + tk+1 )) =
1
2 ( y h ( tk ) + y h ( tk+1 )). This gives the recursion formula for the implicit midpoint method in analogy to
(11.2.1.5) and (11.2.2.2):

1

y k +1 = y k + h k f 2 (tk + tk+1 ), 21 (yk + yk+1 ) , k = 0, . . . , N − 1 , (11.2.3.3)

with local timestep (stepsize) h k : = t k +1 − t k .

y Implicit midpoint method, a geometric view:

y∗ y h ( t1 ) Approximaate trajectory through (t0 , y0 ) on [t0 , t1 ]


by
y0 • straight line through (t0 , y0 )
• with slope f (t∗ , y∗ ), where
f (t∗ , y∗ ) t∗ := 12 (t0 + t1 ), y∗ = 21 (y0 + y1 )
t ✁ —=
ˆ trajectory through (t0 , y0 ),
t0 t∗ t1 ˆ trajectory through (t∗ , y∗ ),
—=
Fig. 413
ˆ tangent at — in (t∗ , y∗ ).
—=
As in the case of (11.2.2.2), also (11.2.3.3) entails solving a (non-linear) system of equations in order to
obtain yk+1 . Rem. 11.2.2.3 also holds true in this case: for sufficiently small h (11.2.3.3) will have a unique
solution yk+1 , which renders the recursion well defined.
Review question(s) 11.2.3.4 (Polygonal approximation methods)
(Q11.2.3.4.A) We consider the scalar linear IVP

ẏ = λy , y(0) = 1

on the interval [0, 1]. We use M ∈ N equidistant steps of the explicit Euler method to compute an
approximation y M for y(1).

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 776
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Derive a formula for y M .


• Which known result from calculus is equivalent to the convergence y M → y(1) for M → ∞?
(Q11.2.3.4.B) For an ODE
 
f 1 (y)
 ..  N N
ẏ = f(y) , f :=  .  : D ⊂ R 7→ R , (*)
f N (y)

we know that
N
∑ f ℓ (y) = 0 ∀y ∈ D .
ℓ=1

• Show that the sum of the components of every solution t 7→ y(t) is constant in time.
• Show that the sums of the components of the vectors y0 , y1 , y2 , . . . generated by either the explicit
Euler method, the implicit Euler method, or the implicit midpoint method, all applied to solve some
IVP for (*), are the same for all vectors yk .
(Q11.2.3.4.C) We consider the implicit Euler method for the scalar autonomous “explosion ODE” ẏ = y2 .
Given an explicit formula for yk+1 in terms of yk and the timestep size hk > 0. Specify potentially
necessary constraints on the size of hk .

The defining equation for recursion of the implicit Euler method (on some temporal mesh) applied to the
ODE ẏ = f(t, y) is

y k +1 : y k +1 = y k + h k f ( t k +1 , y k +1 ) . (11.2.2.2)

(Q11.2.3.4.D) The recursion of the implicit midpoint rule for the ODE ẏ = f(t, y) is

1 
y k +1 : y k +1 = y k + h k f (tk + tk+1 ), 12 (yk + yk+1 ) .
2
Give an explicit form of this recursion for the linear ODE ẏ = A(t)y, where A : R → R N,N is a matrix-
valued function. When will this recursion break down?
(Q11.2.3.4.E) For a twice continuously differentiable function f : I ⊂ R → R N we can use the second
symmetric difference quotient as an approximation of the second derivative f ′′ ( x ), x ∈ I :

f ( x + h) − 2 f ( x ) + f ( x − h)
≈ f ′′ ( x ) for | h| ≪ 1 .
h2
Based on this approximation propose an explicit finite-difference timstepping scheme on a uniform tem-
poral mesh for the second-order ODE ÿ = f(y).
(Q11.2.3.4.F) Formulate the equation that defines that single-step method for the IVP ẏ = f(t, y),
y(t0 ) = y0 , that arises from the difference quotient approximation
y k +1 − y k
ẏ = f(t, y) → ≈ 12 ( f (tk , y(tk )) + f (tk+1 , y(tk+1 ))) , h k : = t k +1 − t k .
hk

A temporal mesh M := {t0 < t1 < t2 < · · · < t M−1 < t M := T } can be taken for granted.

11. Numerical Integration – Single Step Methods, 11.2. Introduction: Polygonal Approximation Methods 777
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11.3 General Single-Step Methods

Video tutorial for Section 11.3: General Single-Step Methods: (14 minutes) Download link,
tablet notes

Now we fit the numerical schemes introduced in the previous section into a more general class of methods
for the solution of (autonomous) initial value problems (11.1.3.18) for ODEs. Throughout we assume that
all times considered belong to the domain of definition of the unique solution t → y(t) of (11.1.3.18), that
is, for T > 0 we take for granted [0, T ] ⊂ J (y0 ) (temporal domain of definition of the solution of an IVP is
explained in § 11.1.3.19).

11.3.1 Definition
§11.3.1.1 (Discrete evolution operators) From Section 11.2.1 and Section 11.2.2 recall the two Euler
methods for an autonomous ODE ẏ = f(y):

explicit Euler: y k +1 = y k + h k f ( y k ) ,
h k : = t k +1 − t k .
implicit Euler: yk+1 : yk+1 = yk + hk f(yk+1 ) ,

Both formulas, for sufficiently small hk (→ Rem. 11.2.2.3), provide a mapping

(yk , hk ) 7→ Ψ(h, yk ) := yk+1 . (11.3.1.2)

If y0 is the initial value, then y1 := Ψ( h, y0 ) can be regarded as an approximation of y( h), the value
returned by the evolution operator Φ (→ Def. 11.1.4.3) for ẏ = f(y) applied to y0 over the period h.
y ( t k ):

y1 = Ψ( h, y0 ) ←→ y( h) = Φh y0 ➣ Ψ( h, y) ≈ Φh y , (11.3.1.3)

In a sense the polygonal approximation methods are based on approximations for the evolution operator
associated with the ODE.
This is what every single step method does: it tries to approximate the evolution operator Φ for an ODE
by a mapping Ψ of the kind as described in (11.3.1.2).

➙ A mapping Ψ as in (11.3.1.2) is called (a) discrete evolution operator.

✎ Notation: In analogy to Φh for discrete evolutions we often write Ψh y := Ψ( h, y) y

Remark 11.3.1.4 (Discretization) The adjective “discrete” used above designates (components of) meth-
ods that attempt to approximate the solution of an IVP by a sequence of finitely many states. “Discretiza-
tion” is the process of converting an ODE into a discrete model. This parlance is adopted for all procedures
that reduce a “continuous model” involving ordinary or partial differential equations to a form with a finite
number of unknowns. y

Above we identified the discrete evolutions underlying the polygonal approximation methods. Vice versa,

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 778
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

a mapping Ψ as given in (11.3.1.2) defines a single step method.

Definition 11.3.1.5. Single step method (for autonomous ODE) → [QSS00, Def. 11.2]

Given a discrete evolution Ψ : Ω ⊂ R × D 7→ R N , an initial state y0 , and a temporal mesh


M := {0 =: t0 < t1 < · · · < t M := T }, M ∈ N, the recursion

yk+1 := Ψ(tk+1 − tk , yk ) , k = 0, . . . , M − 1 , (11.3.1.6)

defines a single-step method (SSM) for the autonomous IVP ẏ = f(y), y(0) = y0 on the interval
[0, T ].

☞ In a sense, a single step method defined through its associated discrete evolution does not ap-
proximate a concrete initial value problem, but tries to approximate an ODE in the form of its
evolution operator.

In C++ a discrete evolution operator can be incarnated by a functor type offering an evaluation operator
State o p e r a t o r ()( double h, const State &y) const ;

see § 11.2.0.1 for the State data type.


Remark 11.3.1.7 (Discrete evolutions for non-autonomous ODEs) The concept of single step method
according to Def. 11.3.1.5 can be generalized to non-autonomous ODEs, which leads to recursions of the
form:

yk+1 := Ψ(tk , tk+1 , yk ) , k = 0, . . . , M − 1 ,

for a discrete evolution operator Ψ defined on I × I × D. y

§11.3.1.8 (Consistent single step methods) Now we state a first quantification of the goal that the
“discrete evolution should be an approximation of the evolution operator”: Ψ ≈ Φ, cf. (11.3.1.3). We want
the discrete evolution Ψ to inherit key properties of the evolution operator Φ. One such property is

d t
Φy = f(y) ∀y ∈ D . (11.3.1.9)
dt t =0

Compliance of Ψ with (11.3.1.9) is expressed through the property of consistency, which, roughly speak-
ing, demands that a viable discrete evolution operator methods is structurally similar to that for the explicit
Euler method (11.2.1.5).

Consistent discrete evolution


The discrete evolution Ψ defining a single step method according to Def. 11.3.1.5 and (11.3.1.6) for
the autonomous ODE ẏ = f(y) must be of the form

h ψ : I × D → R N continuous,
Ψ y = y + hψ( h, y) with (11.3.1.11)
ψ(0, y) = f(y) .

 
Differentiating h 7→ Ψh y relying on the product rule confirms that (11.3.1.9) remains true for Ψ instead
of Φ.

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 779
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 11.3.1.12. Consistent single step methods

A single step method according to Def. 11.3.1.5 based on a discrete evolution of the form (11.3.1.11)
is called consistent with the ODE ẏ = f(y).

EXAMPLE 11.3.1.13 (Consistency of implicit midpoint method) The discrete evolution Ψ and, hence,
the function ψ = ψ( h, y) for the implicit midpoint method are defined only implicitly, of course. Thus,
consistency cannot immediately be seen from a formula for ψ.

We examine consistency of the implicit midpoint method for the autonomous ODE ẏ = f(y). The corre-
sponding discrete evolution Ψ is definied by:

Ψh y = y + hf 1 h
2 (y + Ψ y) , h ∈ R, | h| "sufficiently small", y∈D.. (11.3.1.14)

Assume that
• the right hand side function f : D ⊂ R N → R N is locally Lipschitz continuous, f ∈ C0 ( D ),
• and that | h| is “sufficiently small” to guarantee the existence of a solution Φh y of (11.3.1.14) as
explained in Rem. 11.2.2.3.
Then we infer from the implicit function theorem Thm. 11.2.2.4 that the solution Ψh y of (11.3.1.14) will con-
tinuously depend on h: h 7→ Ψh y ∈ C0 (]−δ, δ[, R N ) for small δ > 0. Knowing this, we plug (11.3.1.14)
into itself and obtain
(11.3.1.14)
Ψh y = y + hf( 12 (y + Ψh y)) = y + h f(y + 21 hf( 12 (y + Ψh y))) .
| {z }
=:ψ(h,y)

We repeat that by the implicit function theorem Thm. 11.2.2.4 Ψhj y depends continuously on h and y.
This means that ψ( h, y) has the desired properties, in particular ψ(0, y) = f(y) is clear. y

Remark 11.3.1.15 (Notation for single step methods) Many authors specify a single step method by
writing down the first step for a general stepsize h

y1 = (implicit) expression in y0 , h and f ,

for instance, for the implicit midpoint rule

y1 = y0 + hf( 12 (y0 + y1 )) .

Actually, this fixes the underlying discrete evolution. Also this course will sometimes adopt this practice. y

§11.3.1.16 (Output of single step methods) Here we resume and continue the discussion of
Rem. 11.2.1.8 for general single step methods according to Def. 11.3.1.5. Assuming unique solvability
of the systems of equations faced in each step of an implicit method, every single step method based on
a mesh M = {0 = t0 < t1 < · · · < t M := T } produces a finite sequence (y0 , y1 , . . . , y M ) of states,
where the first agrees with the initial state y0 .

We expect that the states provide a pointwise approximation of the solution trajectory t → y(t):

yk ≈ y(tk ) , k = 1, . . . , M .

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 780
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Thus task (I) from § 11.2.0.1, computing an approximation for y( T ), is again easy: output y M as an
approximation of y( T ).

Task (II) from § 11.2.0.1, computing the solution trajectory, requires interpolation of the data points (tk , yk )
using some of the techniques presented in Chapter 5. The natural option is M-piecewise polynomial
interpolation, generalizing the polygonal approximation (11.2.1.9) used in Section 11.2.

Note that from the ODE ẏ = f(y) the derivatives ẏh (tk ) = f(yk ) are available without any further
approximation. This facilitates cubic Hermite interpolation (→ Def. 5.3.3.1), which yields

dyh
yh ∈ C1 ([0, T ]): yh |[ xk−1 ,xk ] ∈ P3 , yh (tk ) = yk , (t ) = f(yk ) .
dt k
Summing up, an approximate trajectory t 7→ yh (t) is built in two stages:
(i) Compute sequence (yk )k by running the single step method.
(ii) Post-process the obtained sequence, usually by applying interpolation, to get yh .
y
Review question(s) 11.3.1.17 (General single-step methods)
(Q11.3.1.17.A) Explain the concepts
• evolution operator and
• discrete evolution operator
in connection with the numerical integration of initial-value problems for the ODE ẏ = f(y),
f : D ⊂ R N 7→ R N .
(Q11.3.1.17.B) [Single-step methods and numerical quadrature] There is a connection between numer-
ical integration (the design and analysis of numerical methods for the solution of initial-value problems
for ODEs) and numerical quadrature (study of numerical methods for the evaluation of integrals).
• Explain, how a class of single-step methods for the solution of scalar initial-value problems

ẏ = f (t, y) , y(t0 ) = y0 ∈ R ,
Rb
can be used for the approximate evaluation of integrals a ϕ(τ ) dτ , ϕ : [ a, b] → R.
• If the considered single-step methods are of order p, what does this mean for the induced quadra-
ture method.
• Which quadrature formula does the implicit midpoint method yield?
(Q11.3.1.17.C) [Adjoint single-step method] Let a single-step method for the autonomous ODE
ẏ = f(y), f : D ⊂ R N → R N be defined by its discrete evolution operator Ψ : I × D 7→ D. Then
e : I × D 7→ D defined
the adjoint single-step method is spawned by the discrete evolution operator Ψ
according to

h
  −1
e −h
Ψ y := Ψ , y ∈ D, h ∈ R sufficiently small .

What is the adjoint of the explicit Euler method?


(Q11.3.1.17.D) We have seen three simple single-step methods for the autonomous ODE ẏ = f(y),
f : D ⊂ R N → R N , here defined by describing the first step y0 → y1 with stepsize h ∈ R (“sufficiently
small”):

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 781
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• The explicit Euler method:

y1 = y0 + hf(y0 ) .

• The implicit Euler method:

y1 : y1 = y0 + hf(y1 ) .

• The implicit midpoint method:

y1 : y1 = y0 + hf( 21 (y0 + y1 )) .

For which methods does the associated discrete evolution operator Ψ : [−δ, δ] × D → D, δ > 0 suffi-
ciently small, satisfy

Ψh ◦ Ψ−h = Id ∀h ∈ [0, δ] ? (11.3.1.18)

Try to find a simple (scalar) counterexample, if you think that a method does not have property
(11.3.1.18).

11.3.2 (Asymptotic) Convergence of Single-Step Methods

Video tutorial for Section 11.3.2:(Asymptotic) Convergence of Single-Step Methods: (20


minutes) Download link, tablet notes

Of course, the accuracy of the solution sequence (yk )k obtained by a particular single-step method (→
Def. 11.3.1.5) is a central concern. This motivates studying the dependence of suitable norms of the
so-called discretization error on the choice of the temporal mesh M := {0 = t0 < t1 < · · · < t M = T }.

§11.3.2.1 (Discretization error of single step methods) Approximation errors in numerical integration
are also called discretization errors, cf. Rem. 11.3.1.4.
Depending on the objective of numerical integration as stated in § 11.2.0.1 different (norms of) discretiza-
tion errors are of interest:
(I) If only the solution at final time T is sought, the relevant norm of the discretization error is

ǫ M := ky( T ) − y M k ,

where k·k is some vector norm on R N .


(II) If we want to approximate the solution trajectory for (11.1.3.18) the discretization error is the function

t 7→ e(t) , e(t) := y(t) − yh (t) ,

where t 7→ yh (t) is the approximate trajectory obtained by post-processing, see § 11.3.1.16. In


this case accuracy of the method is gauged by looking at norms of the function e, see § 5.2.4.4 for
examples.

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 782
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(III) Between (I) and (II) is the pointwise discretization error, which is the sequence (a so-called grid
function)

e : M → D , ek := y(tk ) − yk , k = 0, . . . , M . (11.3.2.2)

In this case one usually examines the maximum error in the mesh points

k(e)k∞ := max kek k ,


k∈{1,...,N }

where k·k is a suitable vector norm on R N , customarily the Euclidean vector norm.
y

§11.3.2.3 (Asymptotic convergence of single step methods) Once the discrete evolution Ψ associated
with the ODE ẏ = f(y) is specified, the single step method according to Def. 11.3.1.5 is fixed:

yk+1 := Ψ(tk+1 − tk , yk ) , k = 0, . . . , M − 1 , (11.3.1.6)

The only way to control the accuracy of the solution y N or t 7→ yh (t) is through the selection of the mesh
M = {0 = t0 < t1 < · · · < t N = T }.

Hence we study convergence of single step methods for families of meshes {Mℓ } and track the decay of
(a norm) of the discretization error (→ § 11.3.2.1) as a function of the number M := ♯M of mesh points.
In other words, we examine h-convergence. Convergence through mesh refinement is discussed for
piecewise polynomial interpolation in Section 6.6.1 and for composite numerical quadrature in Section 7.5.

When investigating asymptotic convergence of single step methods we often resort to families of equidis-
tant meshes of [0, T ]:

k
M M := {tk := T: k = 0 . . . , M } . (11.3.2.4)
M
T
We also call this the use of uniform timesteps of size h := M . y

EXPERIMENT 11.3.2.5 (Speed of convergence of polygonal methods)


The setting for this experiment is a follows:
✦ We consider the following IVP for the logistic ODE, see Ex. 11.1.2.1

ẏ = λy(1 − y) , y(0) = 0.01 .

✦ We apply explicit and implicit Euler methods (11.2.1.5)/(11.2.2.2) with uniform timestep h = 1/M,
M ∈ {5, 10, 20, 40, 80, 160, 320, 640}.
✦ Monitored: Error at final time E( h) := |y(1) − y M |
We are mainly interested in the qualitative nature of the asymptotic convergence as h → 0 in the sense
of the types of convergence introduced in Def. 6.2.2.7 with N there replaced with h−1 . Abbreviating some
error norm with EN = EN ( h), recall the classification of asymptotic convergence from Def. 6.2.2.7:

∃ p > 0: EN ( h) ≤ h p : algebraic convergence, with order/rate p > 0 ,


∀h > 0; .
∃ 0 < q < 1: EN ( h) ≤ q1/h : exponential convergence ,

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 783
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1 1
10 10
λ = 1.000000 λ = 1.000000
λ = 3.000000 λ = 3.000000
λ = 6.000000 λ = 6.000000
0 λ = 9.000000 0 λ = 9.000000
10 10
O(h) O(h)
error (Euclidean norm)

error (Euclidean norm)


−1 −1
10 10

−2 −2
10 10

−3 −3
10 10

−4 −4
10 10

−5 −5
10 10
−3 −2 −1 0 −3 −2 −1 0
10 10 10 10 10 10 10 10
Fig. 414 timestep h Fig. 415 timestep h

explicit Euler method implicit Euler method


O( M−1 ) = O( h) algebraic convergence with order/rate 1 in both cases for h → 0

This matches our expectations, because, as we see from (11.2.1.7) and (11.2.2.1), both Euler methods
can be introduced via an approximation of ẏ by a one-sided difference quotient, which offers an O( h)
approximation of the derivative as h → 0.
0
10

−1
10

−2
However, polygonal approximation methods can do
10
better:
error (Euclidean norm)

−3
10

−4
✁ We study the convergence of the implicit midpoint
10
method (11.2.3.3) in the above setting.
−5
10

−6
We observe algebraic convergence O( h2 ), that is
10
with order/rate 2 for h → 0.
−7
10
λ = 1.000000
λ = 2.000000 Also this is expected, because symmetric difference
−8 λ = 5.000000
quotients of width h offer O( h2 )-approximation of the
10
λ = 10.000000
O(h2)
−9
10
−3
10
−2
10
−1
10
0
10
derivative for h → 0.
Fig. 416 timestep h

implicit midpoint method

Parlance: Based on the observed rate of algebraic convergence, the two Euler methods are said to “con-
verge with first order”, whereas the implicit midpoint method is called “second-order convergent”.
y

The observations made for polygonal timestepping methods reflect a general pattern:

Algebraic convergence of single step methods

Consider the numerical integration of an initial value problem

ẏ = f(t, y) , y(t0 ) = y0 , (11.1.3.2)

with sufficiently smooth right hand side function f : I × D → R N .

Then conventional single step methods (→ Def. 11.3.1.5) will enjoy asymptotic algebraic conver-

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 784
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

gence in the meshwidth, more precisely, see [DR08, Thm. 11.25],

there is a p ∈ N such that the sequence (yk )k generated by the single step method
for ẏ = f(t, y) on a mesh M := {t0 < t1 < · · · < t M = T } satisfies

maxkyk − y(tk )k ≤ Ch p for h := max |tk − tk−1 | → 0 , (11.3.2.7)


k k=1,...,M

with C > 0 independent of M

Definition 11.3.2.8. Order of a single step method

The maximal integer p ∈ N for which (11.3.2.7) holds for a single step method when applied to an
ODE with (sufficiently) smooth right hand side, is called the order of the method.

As in the case of quadrature rules (→ Def. 7.4.1.1) their order is the principal intrinsic indicator for the
“quality” of a single step method.

§11.3.2.9 (Convergence analysis for the explicit Euler method [Han02, Ch. 74]) We consider the
simplest single-step method, namely the explicit Euler method (11.2.1.5) on a mesh M := {0 = t0 <
t1 < · · · < t M = T } for a generic autonomous IVP

ẏ = f(y) , y(0) = y0 ∈ D ,

with sufficiently smooth and (globally ) Lipschitz continuous f : D ⊂ R N → R N , that is,

∃ L > 0: kf(y) − f(z)k ≤ Lky − zk ∀y, z ∈ D , (11.3.2.10)

cf. Def. 11.1.3.13, and C1 exact solution t 7→ y(t). Throughout we assume that solutions of ẏ = f(y) are
defined on [0, T ] for all initial states y0 ∈ D.

Recall the recursion defining the explicit Euler method

y k +1 = y k + h k f ( y k ) , hk := tk+1 − tk , k = 1, . . . , M − 1 . (11.2.1.5)
D
y(t) In numerical analysis one studies the

y k +1 Ψ error sequence: ek := yk − y(tk ) .


y k +2
Ψ
yk ✁ —=ˆ trajectory t 7→ y(t)
Ψ —=ˆ Euler polygon,
y k −1 ˆ y ( t k ),
•=
t ˆ yk ,
•=
t k −1 tk t k +1 t k +2 −→ =ˆ discrete evolution Ψtk+1 −tk

The approach to estimate kek k follows a fundamental policy that comprises three key steps. To explain
them we rely on the abstract concepts of the
• evolution operator Φ associated with the ODE ẏ = f(y) (→ Def. 11.1.4.3) and
• discrete evolution operator Ψ defining the explicit Euler single step method, see Def. 11.3.1.5:

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 785
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(11.2.1.5) ⇒ Ψh y = y + hf(y) . (11.3.2.11)

We argue that in this context abstraction pays off, because it helps elucidate a general technique for the
convergence analysis of single step methods.

➀ Abstract splitting of error:


y k +1
propagated error
Fundamental error splitting:
e k +1
e k +1 = Ψ h k y k − Φ h k y ( t k )
Ψhk (y(tk ))
= Ψ hk y k − Ψ hk y ( t k ) yk
| {z }
propagated error (11.3.2.12)
+ Ψ hk y ( t k ) − Φ hk y ( t k ) . ek y ( t k +1 )
| {z }
one-step error one-step error
y(tk )

Fig. 417
tk t k +1

A generic one-step error expressed through continu-


D Φh y ous and discrete evolutions reads:

τ ( h, y) := Ψh y − Φh y . (11.3.2.13)
τ ( h, y)

✁ geometric visualisation of one-step error for ex-


y Ψh y plicit Euler method (11.2.1.5), cf. Fig. 410,
t h : = t k +1 − t k
tk t k +1 —: solution trajectory through (tk , y)
Fig. 418

➁ Estimate for one-step error τ ( hk , y(tk )):

Geometric considerations: distance of a smooth curve and its tangent shrinks as the square of the distance
to the intersection point (curve locally looks like a parabola in the ξ − η coordinate system, see Fig. 420).
η
D η h
Φ y(tk )

τ ( h, yk )
τ ( h, yk )

ξ
y(tk ) Ψh y(tk ) ξ
t
tk t k +1
Fig. 419 Fig. 420

The geometric considerations can be made rigorous by analysis: recall Taylor’s formula for the function

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 786
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

y ∈ C K +1 [Str09, Satz 5.5.1]:

K t+h
Z
hj ( j) (t + h − τ )K
y(t + h) − y(t) = ∑ y (t) + y ( K +1) ( τ ) dτ , (11.3.2.14)
j =1
j! K!
t
| {z }
y ( K + 1 ) ( ξ ) K +1
= h
K!
for some ξ ∈ [t, t + h]. We conclude that, if y ∈ C2 ([0, T ]), which is ensured for smooth f, see
Lemma 11.1.1.3, then

y(tk+1 ) − y(tk ) = ẏ(tk ) hk + 21 ÿ(ξ k ) h2k = f(y(tk )) hk + 12 ÿ(ξ k ) h2k ,


for some tk ≤ ξ k ≤ tk+1 .This leads to an expression for the one-step error from (11.3.2.13)

τ ( hk , y(tk ))=Ψhk y(tk ) − y(tk+1 )


(11.3.2.11)
= y(tk ) + hk f(y(tk )) − y(tk ) − f(y(tk )) hk + 21 ÿ(ξ k ) h2k (11.3.2.15)
= 21 ÿ(ξ k )h2k .

Sloppily speaking, we observe τ ( hk , y(tk )) = O( h2k ) uniformly for hk → 0.

➂ Estimate for the propagated error from (11.3.2.12)

Ψhk yk − Ψhk y(tk ) = kyk + hk f(yk ) − y(tk ) − hk f(y(tk ))k


(11.3.2.16)
(11.3.2.10)
≤ (1 + Lhk )kyk − y(tk )k .

Thus we obtain recursion for error norms ǫk := kek k by simply applying the △-inequality:

ǫk+1 ≤ (1 + hk L)ǫk + ρk , ρk := 12 h2k max kÿ(τ )k . (11.3.2.17)


t k ≤ τ ≤ t k +1

Taking into account ǫ0 = 0, this leads to


k l −1
ǫk ≤ ∑ ∏(1 + Lh j ) ρl , k = 1, . . . , N . (11.3.2.18)
l =1 j =1

Use the elementary estimate (1 + Lh j ) ≤ exp( Lh j ) (by convexity of the exponential function):
k l −1 k
l −1
(11.3.2.18) ⇒ ǫk ≤ ∑ ∏ exp( Lh j ) · ρl = ∑ exp( L ∑ j=1 h j )ρl .
l =1 j =1 l =1

l −1
Note: ∑ h j ≤ T for final time T and conclude
j =1

k k
ρk
ǫk ≤ exp( LT ) ∑ ρl ≤ exp( LT ) max ∑ hl ≤ T exp( LT ) l=max hl · max kÿ(τ )k .
l =1 k hk l =1 1,...,k t0 ≤ τ ≤ t k

kyk − y(tk )k ≤ T exp( LT ) max hl · max kÿ(τ )k . (11.3.2.19)


l =1,...,k t0 ≤ τ ≤ t k

We can summarize the insight gleaned through this theoretical analysis as follows:

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 787
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Total error arises from accumulation of propagated one-step errors!

From (11.3.2.19) we can conclude


✦ an error bound = O( h), h := max hl (➤ 1st-order algebraic convergence)
l

✦ and that the error bound grows exponentially with the length T of the integration interval.
y

§11.3.2.20 (One-step error and order of a single step method) In the analysis of the global discretiza-
tion error of the explicit Euler method in § 11.3.2.9 a one-step error of size O( h2k ) led to a total error of
O( h) through the effect of error accumulation over M ≈ h−1 steps. This relationship remains valid for
almost all single step methods [DB02, Theorem 4.10]:

Order of algebraic convergence of single-step methods

Consider an IVP (11.1.3.2) with solution t 7→ y(t) and a single step method defined by the
discrete evolution Ψ (→ Def. 11.3.1.5). If the one-step error along the solution trajectory satisfies
(Φ is the evolution map associated with the ODE, see Def. 11.1.4.3)

Ψh y(t) − Φh y(t) ≤ Ch p+1 ∀h sufficiently small, t ∈ [0, T ] , (11.3.2.22)

for some p ∈ N and C > 0, then, usually,


p
maxkyk − y(tk )k ≤ ChM ,
k

with C > 0 independent of the temporal mesh M: The (pointwise) discretization error converges
algebraically with order/rate p.

A rigorous statement as a theorem would involve some particular assumptions on Ψ, which we do not
want to give here. These assumptions are satisfied, for instance, for all the methods presented in the
sequel. You may refer to [DB02, Sect. 4.1] for further information.

In fact, it is remarkable that a local condition like (11.3.2.22) permits us to make a quantitative prediction
of global convergence. This close relationship has made researchers introduce “order” also as a property
of discrete evolutions.

Definition 11.3.2.23. Order of a discrete evolution operator

Let Ψ : I × D 7→ R N be a discrete evolution for the autonomous ODE ẏ = f(y) (with associated
evolution operator Φ : I × D 7→ R N → Def. 11.1.4.3). The largest integer q ∈ N0 such that

∀y ∈ D ∃τ0 > 0: kΨτ y − Φτ yk ≤ C (y)τ q+1 ∀|τ | ≤ τ0 (11.3.2.24)

is called the order of the discrete evolution. .

This notion of “order of a discrete evolution" allows a concise summary:

A single-step method (SSM, Def. 11.3.1.5) based on the discrete evolution Ψ satisfies
Ψ of order q ∈ N SSM converges algebraically with order q.

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 788
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

EXAMPLE 11.3.2.25 (Orders of finite-difference single-step methods) Let us determine orders of the
discrete evolutions for the three simple single-step methods introduced in Section 11.2, here listed with
their corresponding discrete evolution operators Ψ (→ § 11.3.1.1) when applied to an autonomous ODE
ẏ = f(y): for y0 ∈ D ⊂ R N ,

explicit (forward) Euler method (11.2.1.5): Ψτ y0 := y0 + τf(y0 ) , (11.3.2.26)


implicit (backward) Euler method (11.2.2.2): Ψτ y0 := w: w = y0 + τf(w) , (11.3.2.27)
implicit midpoint method (11.2.3.3): Ψτ y0 := w: w = y0 + τf( 12 (y0 + w)) . (11.3.2.28)

The computation of their orders will rely on a fundamental technique for establishing (11.3.2.24) based on
Taylor expansion, which asserts that for a function g ∈ C m+1 (]t0 − δ, t0 + δ[, R N ), δ > 0,
m
1 (k)
g ( t0 + τ ) = ∑ g (t0 ) hτ k + O(τ m+1 ) for τ → 0 . (11.3.2.29)
k =0
m!

Of course the arguments hinge on the smoothness of the vectorfield f = f(y), which will ensure smooth-
ness of solutions of the associated ODE ẏ = f(y). Thus, we make the following simplifying assumption:

Assumption 11.3.2.30. Smoothness of right-hand side vectorfield

The vectorfield y 7→ f(y) is C ∞ on R N

Let Φ = Φ(t, y) denote the evolution operator (→ Def. 11.1.4.3) induced by ẏ = f(y), which, by defini-
tion, satisfies

∂Φ
(t, y0 ) = f(Φt y0 ) ∀y0 ∈ D, t ∈ J (y0 ) . (11.1.4.4)
∂t
Setting v(τ ) := Φτ y0 , which is a solution of the initial-value problem ẏ = f(y), y(0) = y0 , we find for
small τ , appealing to the one-dimensional chain rule and (11.1.4.4),

dv d2 v ∂f dv
(τ ) = f(v(τ )) , 2
(τ ) = (v(τ )) (τ ) . (11.3.2.31)
dτ dτ ∂y dτ

This yields the following truncated Taylor expansion

dv d2 v
Φ τ y0 = v ( τ ) = v (0) + τ (0) + 21 τ 2 2 (0) + O(τ 3 )
dτ dτ (11.3.2.32)
= y0 + τ f(y0 ) + 2 τ D f(y0 )f(t, y0 ) + O(τ 3 )
1 2

for τ → 0. Note that the derivative D f(y0 ) is an N × N Jacobi matrix. Explicit expressions for the
remainder term involve second derivatives of f.
➊ For the explicit Euler method (11.3.2.26) we immediately have from (11.3.2.32)

Ψτ y0 − Φτ y0 = y0 + τf(y0 ) − y0 − τf(y0 ) + O(τ 2 ) = O(τ 2 ) for τ → 0 .

The explicit Euler method is of order 1.


➋ It is not as straightforward for the implicit Euler method

Ψτ y0 := w(τ ): w(τ ) = y0 + τf(w(τ )) . (11.3.2.27)

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 789
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

First, we plug (11.3.2.27) into itself

w(τ ) = y0 + τf(w(τ )) = y0 + τf(y0 + τf(w(τ ))) ,

and then use the truncated Taylor expansion of f around y0

f(y0 + v) = f(y0 ) + D f(y0 )v + O(kvk2 ) for v → 0 . (11.3.2.33)

This gives

w(τ ) = y0 + τ (f(y0 ) + τ D f(y0 )f(w(τ ))) + O(τ 3 ) for τ → 0 .

Since Ψτ y0 = w(τ ), matching terms with (11.3.2.32) we obtain

Ψτ y0 − Φτ y0 = τ 2 D f(y0 )f(w(τ )) + O(τ 3 ) = O(τ 2 ) for τ → 0 .

Thanks to the smoothness of f the remainder terms will depend continuously on y0 .

The implicit Euler method has order 1.


➌ For the implicit midpoint rule

Ψτ y0 := w(τ ): w(τ ) = y0 + τf( 12 (y0 + w(τ ))) , (11.3.2.28)

we follow the same idea and consider

w(τ ) = y0 + τf( 21 (y0 + w(τ ))) = y0 + τf(y0 + 12 τf( 21 (y0 + w(τ )))
= y0 + τf(y0 + 12 τf(y0 + O(τ ))) for τ → 0 .

Then we resort to the truncated Taylor expansion (11.3.2.33) and get for τ → 0
 
w ( τ ) = y0 + τ + O(τ )) + O(τ 3 )
f(y0 ) + D f(y0 ) 12 τf(y0
 
= y0 + τ f(y0 ) + D f(y0 ) 12 τ (f(y0 ) + O(τ )) + O(τ 3 ) .

Matching with (11.3.2.32) shows w(τ ) − Φτ y0 = O(τ 3 ) where the “O” just comprises continuous
higher order derivatives of f.

The implicit midpoint method is an order-2 method.


Hardly surprising, these analytic results match the orders of algebraic convergence observed in
Exp. 11.3.2.5. y

Review question(s) 11.3.2.34 (Asymptotic convergence of single-step methods)


(Q11.3.2.34.A) We consider an autonomous ODE ẏ = f(y) with smooth f : D ⊂ R N → R n . Explain,
why the one-step error

τ ( h, y) = Ψh y − Φh y , y ∈ , h “sufficiently small” ,

for a consistent single-step method defined by the discrete evolution operator Ψ satisfies

∀y ∈ D: τ (h, y) = O(h) for h → 0 .

11. Numerical Integration – Single Step Methods, 11.3. General Single-Step Methods 790
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Definition 11.3.1.12. Consistent single step methods

A single step method according to Def. 11.3.1.5 based on a discrete evolution of the form

ψ : I × D → R N continuous,
Ψh y = y + hψ( h, y) with (11.3.1.11)
ψ(0, y) = f(y) .

is called consistent with the ODE ẏ = f(y).

(Q11.3.2.34.B) Let t ∈ I 7→ y(t), I ⊂ R an interval containing 0, denote the solution of the au-
tonomous IVP

ẏ = f(y) , y(0) = y0 .

Assume that f is continuously differentiable.


Use the chain rule to express ẏ(t∗ ) and ÿ(t∗ ) by means of f and its Jacobian.
(Q11.3.2.34.C) Based on the answer to Question (Q11.3.2.34.B), determine the order of a single-step
method for the autonomous ODE ẏ = f(y), f : D ⊂ R N → R N smooth, whose discrete evolution
operator is given by

Ψh y := y + hf(y) + 21 h2 D f(y)f(y) ,

where D f(y) ∈ R N,N is the Jacobian of f in y ∈ D.


11.4 Explicit Runge-Kutta Single-Step Methods (RKSSMs)

Video tutorial for Section 11.4: Explicit Runge-Kutta Single-Step Methods (RKSSMs): (27
minutes) Download link, tablet notes

So far we only know first and second order methods from 11.2: the explicit and implicit Euler method
(11.2.1.5) and (11.2.2.2), respectively, are of first order, the implicit midpoint rule of second order. We
observed this in Exp. 11.3.2.5 and it can be proved rigorously for all three methods adapting the arguments
of § 11.3.2.9.

Thus, barring the impact of roundoff, the low-order polygonal approximation methods are guaranteed to
achieve any prescribed accuracy provided that the mesh is fine enough. Why should we need any other
timestepping schemes?

Remark 11.4.0.1 (Rationale for high-order single step methods cf. [DR08, Sect. 11.5.3]) We argue
that the use of higher-order timestepping methods is highly advisable for the sake of efficiency. The
reasoning is very similar to that of Rem. 7.4.3.12, when we considered numerical quadrature. The reader
is advised to study that remark again.

As we saw in § 11.3.2.3 error bounds for single step methods for the solution of IVPs will inevitably feature
unknown constants “C > 0”. Thus they do not give useful information about the discretization error for
a concrete IVP and mesh. Hence, it is too ambitious to ask how many timesteps are needed so that
ky( T ) − y N k stays below a prescribed bound.
However an easier question can be answered by asymptotic estimates like (11.3.2.7) and this questions
reads:

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 791
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

What extra computational effort buys a prescribed reduction of the error ?


The usual concept of “computational effort” for single step methods (→ Def. 11.3.1.5) is as follows

Computational effort ∼ total number of f-evaluations for approximately solving the IVP,

∼ number of timesteps, if evaluation of discete evolution Ψh


(→ Def. 11.3.1.5) requires fixed number of f-evaluations,

∼ h−1 , in the case of uniform timestep size h > 0


(equidistant mesh (11.3.2.4)).

Now, let us consider a single step method of order p ∈ N, employed with a uniform timestep hold .
We focus on the maximal discretization error in the mesh points, see § 11.3.2.1. We make the crucial
assumption that the asymptotic error bounds are sharp:

err( h) ≈ Ch p for small meshwidth h > 0 , (11.4.0.2)

with a “generic constant” C > 0 independent of the mesh.

err( hnew ) ! 1
Goal: = for reduction factor ρ>1.
err( hold ) ρ
p
hnew ! 1
hnew = ρ− /p hold
1
(11.3.2.7) ⇒ p = ⇔ . (11.4.0.3)
hold ρ

For single step method of order p ∈ N we conclude the behavior:

increase effort by factor ρ /p


1
reduce error by factor ρ > 1

10 p=1
p=2
p=3
8 p=4

6
✁ Plots of ρ1/p vs. ρ
1/p

4 ☞ the larger the order p, the less effort for a pre-


scribed reduction of the error!
2

0
2 4 6 8 10
Fig. 421

We remark that another (minor) rationale for using higher-order methods is to curb impact of roundoff
errors (→ Section 1.5.3) accumulating during timestepping [DR08, Sect. 11.5.3]. y

§11.4.0.4 (Bootstrap construction of explicit single step methods) Now we will build a class of meth-
ods that are explicit and achieve orders p > 2. The starting point is a simple integral equation satisfied by
any solution t 7→ y(t) of an initial value problems for the general ODE ẏ = f(t, y):
Z t1
ẏ(t) = f(t, y(t)) ,
IVP: ⇒ y ( t1 ) = y0 + f(τ, y(τ )) dτ
y ( t0 ) = y0 t0

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 792
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Idea: approximate the integral by means of s-point quadrature formula → Sec-


tion 7.2, defined on the reference interval [0, 1]) with nodes c1 , . . . , cs ,
weights b1 , . . . , bs .
s
y(t1 ) ≈ y1 = y0 + h ∑ bi f(t0 + ci h, y(t0 + ci h) ) , h := t1 − t0 .
i =1
(11.4.0.5)

Obtain these values by bootstrapping


“Bootstrapping” = use the same idea in a simpler version to get y(t0 + ci h), noting that these values
can be replaced by other approximations obtained by methods already constructed (this approach will be
elucidated in the next example).

What error can we afford in the approximation of y(t0 + ci h) (under the assumption that f is Lipschitz
continuous)? We take the cue from the considerations in § 11.3.2.9.

Goal: aim for one-step error bound y ( t 1 ) − y 1 = O ( h p +1 )

Note that there is a factor h in front of the quadrature sum in (11.4.0.5). Thus, our goal can already be
achieved, if only
y(t0 + ci h) is approximated up to an error O( h p ),
again, because in (11.4.0.5) a factor of size h multiplies f(t0 + ci , y(t0 + ci h)).

This is accomplished by a less accurate discrete evolution than the one we are about to build. Thus,
we can construct discrete evolutions of higher and higher order, in turns, starting with the explicit Euler
method. All these methods will be explicit, that is, y1 can be computed directly from point values of f. y

EXAMPLE 11.4.0.6 (Simple Runge-Kutta methods by quadrature & boostrapping) Now we apply the
boostrapping idea outlined above. We write kℓ ∈ R N for the approximations of y(t0 + ci h).
• Quadrature formula = trapezoidal rule (7.3.0.5):
1
Q( f ) = 21 ( f (0) + f (1)) ↔ s = 2: c1 = 0, c2 = 1 , b1 = b2 = , (11.4.0.7)
2
and y(t1 ) approximated by explicit Euler step (11.2.1.5)

k1 = f(t0 , y0 ) , k2 = f(t0 + h, y0 + hk1 ) , y1 = y0 + 2h (k1 + k2 ) . (11.4.0.8)

(11.4.0.8) = explicit trapezoidal method (for numerical integration of ODEs).


• Quadrature formula → simplest Gauss quadrature formula = midpoint rule → Ex. 7.3.0.3 &
y( 12 (t1 + t0 )) approximated by explicit Euler step (11.2.1.5)

k1 = f(t0 , y0 ) , k2 = f(t0 + 2h , y0 + 2h k1 ) , y1 = y0 + hk2 . (11.4.0.9)

(11.4.0.9) = explicit midpoint method (for numerical integration of ODEs) [DR08, Alg. 11.18].
y

EXAMPLE 11.4.0.10 (Convergence of simple Runge-Kutta methods) We perform an empiric study of


the order of the explicit single step methods constructed in Ex. 11.4.0.6.

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 793
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ IVP: ẏ = 10y(1 − y) (scalar logistic ODE (11.1.2.2)), initial value y(0) = 0.01, final time T = 1,
✦ Explicit single step methods, uniform timestep h.
0
1 10
y(t) s=1, Explicit Euler
0.9 Explicit Euler s=2, Explicit trapezoidal rule
Explicit trapezoidal rule s=2, Explicit midpoint rule
0.8 Explicit midpoint rule −1
O(h2)
10

0.7

error |y (1)−y(1)|
−2
0.6 10

0.5
y

h
−3
0.4 10

0.3
−4
0.2 10

0.1

0 −2 −1
0 0.2 0.4 0.6 0.8 1 10 10
Fig. 422 t Fig. 423 stepsize h

yh ( j/10), j = 1, . . . , 10 for explicit RK-methods Errors at final time yh (1) − y(1)


Observation: obvious algebraic convergence in meshwidth h with integer rates/orders:
explicit trapezoidal method (11.4.0.8) → order 2
explicit midpoint method (11.4.0.9) → order 2

This is what one expects from the considerations in Ex. 11.4.0.6. y

The formulas that we have obtained follow a general pattern:

Definition 11.4.0.11. Explicit Runge-Kutta single-step method

For bi , aij ∈ R, ci := ∑ij− 1


=1 aij , i, j = 1, . . . , s, s ∈ N , an s-stage explicit Runge-Kutta single step
method (RK-SSM) for the ODE ẏ = f(t, y), f : Ω → R N , is defined by (y0 ∈ D)

i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

The vectors ki ∈ R N , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.

Recall Rem. 11.3.1.15 to understand how the discrete evolution for an explicit Runge-Kutta method is
specified in this definition by giving the formulas for the first step. This is a convention widely adopted in
the literature about numerical methods for ODEs. Of course, the increments ki have to be computed anew
in each timestep.

The implementation of an s-stage explicit Runge-Kutta single step method according to Def. 11.4.0.11 is
straightforward: The increments ki ∈ R N are computed successively, starting from k1 = f(t0 + c1 h, y0 ).

Only s f-evaluations and AXPY operations (→ Section 1.3.2) are required to compute the next state
vector from the current.

In books and research articles a particular way to write down the coefficients characterizing RK-SSMs is
widely used:

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 794
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Butcher scheme notation for explicit RK-SSM

c1 0 ··· 0
.. ..
c2 a21 . .
Shorthand notation for (explicit) Runge-
.. .. .. ..
Kutta methods [DR08, (11.75)] . . . .
c A .. .. .. ..
:= . . . . .
Butcher scheme ✄ bT
.. .. .. ..
. . . .
(Note that A is a strictly lower triangular
s × s-matrix) cs as1 ··· as,s−1 0
b1 ··· bs − 1 bs
(11.4.0.13)

Now we restrict ourselves to the case of an autonomous ODE ẏ = f(y). Matching Def. 11.4.0.11 and
Def. 11.3.1.5, we see that the discrete evolution induced by an explicit Runge-Kutta single-step method
is
s
Ψ h y = y + h ∑ bi k i , h∈R, y∈D, (11.4.0.14)
i =1

where the increments ki are defined by the increment equations


i −1 
ki := f y + h ∑ aij k j .
j =1

In line with (11.3.1.11), this discrete evolution can be written as


s
Ψh y = y + hψ( h, y) , ψ( h, y) = ∑ bi k i .
i =1

Is this discrete evolution consistent in the sense of § 11.3.1.8, that is, does ψ(0, y) = f(y) hold? If h = 0,
the increment equations yield
s 
h = 0 ⇒ k1 = · · · = k s = f ( y ) . ψ(0, y) = ∑ bi f(y) .
i =1

Corollary 11.4.0.15. Consistent Runge-Kutta single step methods

A Runge-Kutta single step method according to Def. 11.4.0.11 is consistent (→ Def. 11.3.1.12) with
the ODE ẏ = f(t, y), if and only if
s
∑ bi = 1 .
i =1

Remark 11.4.0.16 (RK-SSM and quadrature rules) Note that in Def. 11.4.0.11 the coefficients Ci and bi ,
1 ∈ {1, . . . , s}, can be regarded as nodes and weights of a quadrature formula (→ Def. 7.2.0.1) on [0, 1]:
apply the explicit Runge-Kutta single step method to the “ODE” ẏ = f (t), f ∈ C0 ([0, 1]), on [[]0, 1] with
timestep h = 1 and initial value y(0), with exact solution
Z t
ẏ(t) = f (t) , y (0) = 0 ⇒ y ( t ) = f (τ ) dτ .
0

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 795
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Then the formulas of Def. 11.4.0.11 reduce to


s Z 1
y 1 = 0 + ∑ bi f ( c i ) ≈ f (τ ) dτ .
i =1 0

Recall that the quadrature rule with these weights and nodes c j will have order ≥ 1 (→ Def. 7.4.1.1), if
the weights add up to 1! y

EXAMPLE 11.4.0.17 (Butcher schemes for some explicit RK-SSM [DR08, Sect. 11.6.1]) The fol-
lowing explicit Runge-Kutta single step methods are often mentioned in literature.

0 0
• Explicit Euler method (11.2.1.5): ➣ order = 1
1

0 0 0
explicit trapezoidal method
• 1 1 0 ➣ order = 2
(11.4.0.8): 1 1
2 2

0 0 0
1 1
• explicit midpoint method (11.4.0.9): 2 2 0 ➣ order = 2
0 1

0 0 0 0 0
1 1
2 2 0 0 0
1 1
• Classical 4th-order RK-SSM: 2 0 2 0 0 ➣ order = 4
1 0 0 1 0
1 2 2 1
6 6 6 6

0 0 0 0 0
1 1
3 3 0 0 0
2
• Kutta’s 3/8-method: 3− 13 1 0 0 ➣ order = 4
1 1 −1 1 0
1 3 3 1
8 8 8 8

Hosts of (explicit) Runge-Kutta methods can be found in the literature, see for example the Wikipedia page.
They are stated in the form of Butcher schemes (11.4.0.13) most of the time. y

Remark 11.4.0.18 (Construction of higher order Runge-Kutta single step methods) Runge-Kutta sin-
gle step methods of order p > 2 are not found by bootstrapping as in Ex. 11.4.0.6, because the resulting
methods would have quite a lot of stages compared to their order.

Rather one derives order conditions yielding large non-linear systems of equations for the coefficients
aij and bi in Def. 11.4.0.11, see [DB02, Sect .4.2.3] and [HLW06, Ch. III]. This approach is similar to
the construction of a Gauss quadrature rule in Ex. 7.4.2.2. Unfortunately, the systems of equations are

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 796
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

very difficult to solve and no universal recipe is available. Nevertheless, through massive use of symbolic
computation, explicit Runge-Kutta methods of order up to 19 have been constructed in this way. y

Remark 11.4.0.19 (“Butcher barriers” for explicit RK-SSM) The following table gives lower bounds for
the number of stages needed to achieve order p for an explicit Runge-Kutta method.
order p 1 2 3 4 5 6 7 8 ≥9
minimal no. s of stages 1 2 3 4 6 7 9 11 ≥ p+3
No general formula is has been discovered. What is known is that for explicit Runge-Kutta single step
methods according to Def. 11.4.0.11
order p ≤ number s of stages of RK-SSM
y

Supplementary literature. Runge-Kutta methods are presented in every textbook covering

numerical integration: [DR08, Sect. 11.6], [Han02, Ch. 76], [QSS00, Sect. 11.8].

Review question(s) 11.4.0.20 (Explicit Runge-Kutta single-step methods)


(Q11.4.0.20.A) How many parameters describe a consistent 2-stage explicit Runge-Kutta method for the
autonomous ODE ẏ = f(y)?

Definition 11.4.0.11. Explicit Runge-Kutta method

For bi , aij ∈ R, ci := ∑ij− 1


=1 aij , i, j = 1, . . . , s, s ∈ N , an s-stage explicit Runge-Kutta single
step method (RK-SSM) for the ODE ẏ = f(t, y), f : Ω → R N , is defined by (y0 ∈ D)

i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

The vectors ki ∈ R N , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.

(Q11.4.0.20.B) Recall that by “autonomization” the initial value problem

ẏ = f(t, y) , y(t0 ) = y0 , (11.4.0.21)

with f : I × D → R N can be converted into the equivalent IVP for the extended state

z = [ z 1 , . . . , z N , z N +1 ] : = [ y t ] ⊤ ∈ R N +1 :
   
f ( z N +1 , [ z 1 , . . . , z N ] ⊤ ) y0
ż = g(z) , g(z) := , z (0) = . (11.4.0.22)
1 t0

Let us apply the same 2-stage explicit Runge-Kutta method to (11.4.0.21) and (11.4.0.22). When will
both approaches produce the same sequence of states yk ∈ D?
(Q11.4.0.20.C) Formulate a generic 2-stage explicit Runge-Kutta method for the autonomous second-
order ODE ÿ = f(y), f : D ⊂ R N → R N .

Hint. Apply a standard 2-stage explicit Runge-Kutta method after transformation to an equivalent first-
order ODE.

11. Numerical Integration – Single Step Methods, 11.4. Explicit Runge-Kutta Single-Step Methods (RKSSMs) 797
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

11.5 Adaptive Stepsize Control

Video tutorial for Section 11.5: Adaptive Stepsize Control: (32 minutes) Download link,
tablet notes

Section 7.6, in the context of numerical quadrature, teaches an a-posteriori way to adjust the mesh under-
lying a composite quadrature rule to the integrand: During the computation we estimate the local quadra-
ture error by comparing the approximations obtained by using quadrature formulas of different order. The
same policy for adapting the integration mesh is very popular in the context of numerical integration, too.
Since the size hk := tk+1 − tk of the cells of the temporal mesh is also called the timestep size, this kind
of a-posteriori mesh adaptation is also known as stepsize control.

11.5.1 The Need for Timestep Adaptation


EXAMPLE 11.5.1.1 (Oregonator reaction) Chemical reaction kinetics is a field where ODE based mod-
els are very common. This example presents a famous reaction with extremely abrupt dynamics. Refer
to [Han02, Ch. 62] for more information about the ODE-based modelling of kinetics of chemical reactions.

This is a apecial case of an “oscillating” Zhabotinski-Belousov reaction [Gra02]:

BrO3− + Br− 7→ HBrO2


HBrO2 + Br− 7→ Org
BrO3− + HBrO2 7→ 2 HBrO2 + Ce(IV) (11.5.1.2)
2 HBrO2 7→ Org
Ce(IV) 7→ Br−

By the laws of reaction kinetics of physical chemistry from (11.5.1.2) we can extract the following (system
of) ordinary differential equation(s) for the concentrations of the different compounds:

y1 := c(BrO3− ): ẏ1 = − k 1 y1 y2 − k 3 y1 y3 ,
y2 := c(Br− ): ẏ2 = − k 1 y1 y2 − k 2 y2 y3 + k 5 y5 ,
y3 := c(HBrO2 ): ẏ3 = k1 y1 y2 − k2 y2 y3 + k3 y1 y3 − 2k4 y23 , (11.5.1.3)
y4 := c(Org): ẏ4 = k2 y2 y3 + k4 y23 ,
y5 := c(Ce(IV)): ẏ5 = k 3 y1 y3 − k 5 y5 ,

with (non-dimensionalized) reaction constants:

k1 = 1.34 , k2 = 1.6 · 109 , k3 = 8.0 · 103 , k4 = 4.0 · 107 , k5 = 1.0 .

periodic chemical reaction ➽ Video 1, Video 2


These are results from highly accurate simulations with initial state y1 (0) = 0.06, y2 (0) = 0.33 · 10−6 ,
y3 (0) = 0.501 · 10−10 , y4 (0) = 0.03, y5 (0) = 0.24 · 10−7 :

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 798
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Concentration of Br− Concentration of HBrO


−3 −5
2
10 10

−4
10 −6
10

−5
10
−7
10

−6
10
c(t)

c(t)
−8
10
−7
10

−9
10
−8
10

−10
−9 10
10

−10 −11
10 10
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Fig. 424 t Fig. 425 t

We observe a strongly non-uniform behavior of the solution in time.

This is very common with evolutions arising from practical models (circuit models, chemical reaction mod-
els, mechanical systems)
y

EXAMPLE 11.5.1.4 (Blow-up)


100
y =1
0
y = 0.5
90 0

We return to the “explosion ODE” of Ex. 11.1.3.20 y =2


0

and consider the scalar autonomous IVP: 80

70

ẏ = y2 , y(0) = y0 > 0 . 60

y0
y(t)

y(t) = , t < 1/y0 . 50

1 − y0 t 40

As we have seen a solution exists only for finite time 30

and then suffers a blow-up, that is, lim y(t) = ∞ 20


t→1/y0
: J (y0 ) =] − ∞, 1/y0 ]! 10

0
−1 −0.5 0 0.5 1 1.5 2 2.5
Fig. 426 t

How to choose temporal mesh {t0 < t1 < · · · < t N −1 < t N } for single step method in case J (y0 ) is not
known, even worse, if it is not clear a priori that a blow up will happen?
Just imagine: what will result from equidistant explicit Euler integration (11.2.1.5) applied to the above
IVP?

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 799
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

solution by ode45
100
y0 = 1
y0 = 0.5
90
y =2
0

80 A preview: There are single-step methods that can


70 detect and resolve blow-ups!
60
✁ Simulations conducted with the numerical integra-
tor Ode45 introduced in § 11.5.3.3.
k

50
y

40

We observe that Ode45 manages to reduce stepsize


30
more and more as it approaches the singularity of the
solution! How can it accomplish this feat!
20

10

0
−1 −0.5 0 0.5 1 1.5 2 2.5
Fig. 427 t
y

11.5.2 Local-in-Time Stepsize Control


We identify as key challenge (discussed for autonomous ODEs below):
How to choose a good temporal mesh {0 = t0 < t1 < · · · < t M−1 < t M }
for a given single step method applied to a concrete IVP?
What does “good” mean ?
Be efficient! Be reliable!

Stepsize adaptation for single step methods

max ky(tk ) − yk k<TOL


Objective: M as small as possible & k=1,...,N , TOL = tolerance
or ky( T ) − y M k < TOL
Policy: Try to curb/balance one-step error by )
local-in-time
✦ adjusting current stepsize hk , stepsize control
✦ predicting suitable next timestep hk+1
Tool: Local-in-time one-step error estimator (a posteriori, based on yk , hk−1 )

Why do we embrace local-in-time timestep control (based on estimating only the one-step error)? One
could raise a serious objection: If a small time-local error in a single timestep leads to large error
kyk − y(tk )k at later times, then local-in-time timestep control is powerless about it and will not even
notice!
Nevertheless, local-in-time timestep control is used almost exclusively,
☞ because we do not want to discard past timesteps, which could amount to tremendous waste of
computational resources,
☞ because it is inexpensive and it works for many practical problems,
☞ because there is no reliable method that can deliver guaranteed accuracy for general IVP.

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 800
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

§11.5.2.2 (Local-in-time error estimation) We “recycle” heuristics already employed for adaptive quadra-
ture, see Section 7.6, § 7.6.0.10. There we tried to get an idea of the local quadrature error by comparing
two approximations of different order. Now we pursue a similar idea over a single timestep.
Idea: Estimation of one-step error
e h of different order over current
Compare results for two discrete evolutions Ψh , Ψ
timestep h:

If e ) > Order(Ψ), then we expect (for small h!)


Order(Ψ

e h yk − Ψh yk .
Φh yk − Ψh yk ≈ ESTk := Ψ (11.5.2.3)
| {z }
one-step error

Heuristics for concrete h > 0 A computable quantity!


y

§11.5.2.4 ((Crude) local timestep control) We take for granted the availability of a local error estimate
ESTk that we have computed for a current stepsize h. We specify target values ATOL > 0, RTOL > 0 of
absolute and relative tolerances to be met by the local error and implement the following policy:
absolute tolerance

ESTk ↔ ATOL
Compare ➣ Reject/accept current step (11.5.2.5)
ESTk ↔ RTOLkyk k

relative tolerance
Both tolerances RTOL > 0 and ATOL > 0 have to be supplied by the user of the adaptive algorithm. The
absolute tolerance is usually chosen significantly smaller than the relative tolerance and merely serves as
a safeguard against non-termination in case yk ≈ 0. For a similar use of absolute and relative tolerances
see Section 8.2.3, which deals with termination criteria for iterations, in particular (8.2.3.3).

☞ Simple accept/reject algorithm:


ESTk < max{ATOL, kyk kRTOL}: Accept current step:
• Advance by one timestep (stepsize h),
• use larger stepsize (αh with some α > 1) for next step (∗)
ESTk > max{ATOL, kyk kRTOL}: Reject current step:
• Repeat current timestep
• with smaller stepsize < h, e.g., 12 h

The rationale behind the adjustment of the timestep size in (∗) is the following: if the current stepsize
guarantees sufficiently small one-step error, then it might be possible to obtain a still acceptable one-
step error with a larger timestep, which would enhance efficiency (fewer timesteps for total numerical
integration). This should be tried, since timestep control will usually provide a safeguard against undue
loss of accuracy.
The following C++ code implements a wrapper function odeintadapt() for a general adaptive single-
step method according to the policy outlined above. The arguments are meant to pass the following
information:
• Psilow, Psihigh: functors passing discrete evolution operators for autonomous ODE of different
order, type @(y,h), expecting a state (usually a column vector) as first argument, and a stepsize
as second,

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 801
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• T: final time T > 0,


• y0: initial state y0 ,
• h0: stepsize h0 for the first timestep
• reltol, abstol: relative and absolute tolerances, see (11.5.2.5),
• hmin: minimal stepsize, timestepping terminates when stepsize control hk < hmin , which is rele-
vant for detecting blow-ups or collapse of the solution.

C++ code 11.5.2.6: Simple local stepsize control for single step methods ➺ GITLAB
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double _norm ( const S t a t e &y ) { r e t u r n y . norm ( ) ; }
5

6 // Adaptive numerical integrator based on local-in-time stepsize control


7 template <class DiscEvolOp , class State ,
8 class NormFunc = decltype ( _norm< State >) >
9 std : : vector <std : : pair <double , State >> odeintadapt (
10 DiscEvolOp &&Psilow , DiscEvolOp &&Psihigh , const S t a t e &y0 , double T ,
11 double h0 , double r e l t o l , double a b s t o l , double hmin ,
12 NormFunc &norm = _norm< State >) {
13 double t = 0 ; // initial time t0 = 0
14 S t a t e y = y0 ; // current state
15 double h = h0 ; // timestep to start with
16 // vector of times/computed states: (tk , yk )k
17 std : : vector <std : : pair <double , State >> s t a t e s ;
18 s t a t e s . push_back ( { t , y } ) ; // initial time and state
19 // Main timestepping loop
20 while ( ( s t a t e s . back ( ) . f i r s t < T ) && ( h >= hmin ) ) { //
h
21 e
S t a t e yh = Psihigh ( h , y ) ; // high-order discrete evolution Ψ
22 S t a t e yH = Psilow ( h , y ) ; // low-order discrete evolution Ψh
23 double e s t = norm ( yH − yh ) ; // local error estimate ESTk
24 i f ( e s t < std : : max ( r e l t o l * norm ( y ) , a b s t o l ) ) { // step accepted
25 y = yh ; // use high order approximation
26 t = t + std : : min ( T − t , h ) ; // next time tk
27 s t a t e s . push_back ( { t , y } ) ; // store approximate state
28 h = 1 . 1 * h ; // try with increased stepsize
29 } else { // step rejected
30 h = h / 2; // try with half the stepsize
31 }
32 // Numerical integration has ground to a halt !
33 i f ( h < hmin ) {
34 std : : c e r r << " Warning : Failure at t =" << s t a t e s . back ( ) . f i r s t
35 << " . Unable to meet i n t e g r a t i o n tolerances without reducing "
36 << " the step size below the smallest value allowed ( " << hmin
37 << " ) at time t = " << t << " . " << std : : endl ;
38 }
39 } // end main loop
40 r e t u r n s t a t e s ; // ok thanks to return value optimization (RVO)
41 }

Comments on Code 11.5.2.6:


• line 20: check whether final time is reached or timestepping has ground to a halt (hk < hmin ).
• line 21, 22: advance state by low and high order integrator.
• line 23: compute norm of estimated error, see (11.5.2.3).

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 802
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• line 24: make comparison (11.5.2.5) to decide whether to accept or reject local step.
• line 27, 28: step accepted, update state and current time and suggest 1.1 times the current stepsize
for next step.
• line 30 step rejected, try again with half the stepsize.
• Return value is a vector of pairs consisting of
– times t ↔ temporal mesh t0 < t1 < t2 < . . . < t N < T , where t N < T indicated
premature termination (collapse, blow-up),
– states y ↔ sequence (yk )kN=0 .
y

Remark 11.5.2.7 (Choice of norm) In Code 11.5.2.6 the norm underlying timestep control is passed
through a functor. This is related to the important fact that that norm has to be chosen in a problem-
dependent way. For instance, in the case of systems of ODEs different components of the state vector y
may correspond to different physical quantities. Hence, if we used the Euclidean norm k·k = k·k2 , the
choice of physical units might have a strong impact on the selection of timesteps, which is clearly not
desirable and destroys the scale-invariance of the algorithm, cf. Rem. 2.3.3.11. y

Remark 11.5.2.8 (Estimation of “wrong” error?) We face the same conundrum as in the case of adap-
tive numerical quadrature, see Rem. 7.6.0.17:

By the heuristic considerations, see (11.5.2.3) it seems that ESTk measures the one-step error for
! the low-order method Ψ and that we should use yk+1 = Ψhk yk , if the timestep is accepted.

hk
e y , since it is available for free. This is
However, it would be foolish not to use the better value yk+1 = Ψ k
what is done in every implementation of adaptive methods, also in Code 11.5.2.6, and this choice can be
justified by control theoretic arguments [DB02, Sect. 5.2]. y

EXPERIMENT 11.5.2.9 (Simple adaptive stepsize control) We test adaptive timestepping routine from
Code 11.5.2.6 for a scalar IVP and compare the estimated local error and true local error.
✦ IVP for ODE ẏ = cos(αy)2 , α > 0, solution y(t) = arctan(α(t − c))/α for y(0) ∈] − π/2, π/2[
✦ Simple adaptive timestepping based on explicit Euler (11.2.1.5) and explicit trapezoidal rule
(11.4.0.8)
Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000 Adaptive timestepping, rtol = 0.010000, atol = 0.000100, a = 20.000000
0.08 0.025
y(t) true error |y(t )−y |
k k
yk estimated error ESTk
0.06 rejection

0.02
0.04

0.02
0.015
error
y

0.01
−0.02

−0.04
0.005

−0.06

−0.08 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 428 t Fig. 429 t

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 803
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Statistics: 66 timesteps, 131 rejected timesteps

Observations:
☞ Adaptive timestepping well resolves local features of solution y(t) at t = 1
☞ Estimated error (an estimate for the one-step error) and true error are not related! To understand
this recall Rem. 11.5.2.8.
y

EXPERIMENT 11.5.2.10 (Gain through adaptivity → Exp. 11.5.2.9) In this experiment we want to
explore whether adaptive timestepping is worth while, as regards reduction of computational effort without
sacrificing accuracy.
We retain the simple adaptive timestepping from previous experiment Exp. 11.5.2.9 and also study the
same IVP.
New: initial state y(0) = 0!
Now we examine the dependence of the maximal discretization error in mesh points on the computational
effort. The latter is proportional to the number of timesteps.
2
Solving dt y = a cos(y) with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for dt y = a cos(y)2 with a = 40.000000
1
0.05 10
uniform timestep
adaptive timestep
0.045
0
10
0.04

0.035
−1
10
max |y(t )−y |
k

0.03
k

−2
10
y

0.025
k

0.02
−3
10
0.015
rtol = 0.400000
0.01 rtol = 0.200000
−4
rtol = 0.100000 10
rtol = 0.050000
0.005 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
−5
0 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 430 t Fig. 431 no. N of timesteps

Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping achieves much better accuracy for a fixed computational effort.
y

EXPERIMENT 11.5.2.11 (“Failure” of adaptive timestepping → Exp. 11.5.2.10)


Same ODE and simple adaptive timestepping as in previous experiment Exp. 11.5.2.10.
π π
ẏ = cos2 (αy) ⇒ y(t) = arctan(α(t − c))/α ,y(0) ∈] − ,− [ ,
2α 2α
for α = 40.
π
Now: initial state y(0) = −0.0386 ≈ 2α as in Exp. 11.5.2.9

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 804
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

2
Solving d y = a cos(y)2 with a = 40.000000 by simple adaptive timestepping Error vs. no. of timesteps for dt y = a cos(y) with a = 40.000000
t
0
0.05 10
uniform timestep
adaptive timestep
0.04

0.03

0.02
−1
10

max |y(t )−y |


k
0.01

k
y

k
−0.01
−2
10
−0.02
rtol = 0.400000
−0.03 rtol = 0.200000
rtol = 0.100000
rtol = 0.050000
−0.04 rtol = 0.025000
rtol = 0.012500
rtol = 0.006250
−3
−0.05 10
1 2 3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 10 10
Fig. 432 t Fig. 433 no. N of timesteps

Solutions (yk )k for different values of rtol Error vs. computational effort
Observations:
☞ Adaptive timestepping leads to larger errors at the same computational cost as uniform timestep-
ping!

Explanation: the position of the steep step of the solution has a sensitive dependence on an initial value,
π
if y(0) ≈ 2α :
1
y(t) = α arctan(α(t + tan(y0/α))) , step at ≈ − tan(y0/α) .
Hence, small local errors in the initial timesteps will lead to large errors at around time t ≈ 1. The stepsize
control is mistaken in condoning these small one-step errors in the first few steps and, therefore, incurs
huge errors later.

However, the perspective of backward error analysis (→ § 1.5.5.18) rehabilitates adaptive stepsize control
in this case: it gives us a numerical solution that is very close to the exact solution of the ODE with slightly
perturbed initial state y0 . y

§11.5.2.12 (Refined local stepsize control → [DR08, Sect. 11.7]) The above algorithm (Code 11.5.2.6)
is simple, but the rule for increasing/shrinking of timestep “squanders” the information contained in ESTk :
TOL:

More ambitious goal ! When ESTk > TOL : stepsize adjustment better hk = ?
When ESTk < TOL : stepsize prediction good hk+1 = ?

Assumption: At our disposal are two discrete evolutions:

✦ Ψ with order(Ψ) = p (➙ “low order” single step method)


e with order(Ψ
✦ Ψ e )> p (➙ “higher order” single step method)
These are the same building blocks as for the simple adaptive strategy employed in Code 11.5.2.6 (passed
as arguments Psilow, Psihigh there).
According to (11.3.2.22) we can expect the following asymptotic decay of the one-step errors for h → 0:

p +2
Ψhk y(tk ) − Φhk y(tk ) = ch p+1 + O( hk ),
(11.5.2.13)
e h k y ( t k ) − Φ h k y ( t k ) = O ( h p +2 ) ,
Ψ k

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 805
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

with some (unknown) constant c > 0. Why h p+1 ? Remember the estimate (11.3.2.15) from the error
analysis of the explicit Euler method: we also found O( h2k ) there for the one-step error of a single step
method of order 1.
p +2
Heuristic reasoning: The timestep hk is small ➣ “higher order terms” O( hk ) can be ignored.
. p +1 p +2
Ψhk y(tk ) − Φhk y(tk ) = chk + O( hk ), . p +1
⇒ ESTk = chk . (11.5.2.14)
e hk y(tk ) − Φhk y(tk ) =. O( h p+2 ) .
Ψ k
.
✎ notation: = equality up to higher order terms in hk

. p +1 . ESTk
ESTk = chk ⇒ c= p +1
. (11.5.2.15)
hk

Available in algorithm, see (11.5.2.3)

For the sake of accuracy (demands “ESTk < TOL”) & efficiency (favors “>”) we aim for
!
ESTk = TOL := max{ATOL, kyk kRTOL} . (11.5.2.16)

What timestep h∗ can actually achieve (11.5.2.16), if we “believe” (heuristics!) in (11.5.2.14) (and, there-
fore, in (11.5.2.15))?
ESTk p +1
(11.5.2.15) & (11.5.2.16) ⇒ TOL = p +1
h∗ . (11.5.2.17)
hk

r
TOL
"‘Optimal timestep”: h∗ = h p +1
. (11.5.2.18)
ESTk
(stepsize prediction)

The proposed timestep size h∗ will be used in both cases:


(Rejection of current timestep): In case ESTk > TOL ➣ repeat step with stepsize h∗ .

(Acceptance of current timstep): If ESTk ≤ TOL ➣ use h∗ as stepsize for next step.

C++ code 11.5.2.19: Refined local stepsize control for single step methods ➺ GITLAB
2 // Auxiliary function: default norm for an E I G E N vector type
3 template <class State >
4 double d e f a u l t n o r m ( const State &y ) { r e t u r n y . norm ( ) ; }
5 // Auxilary struct to hold user options
6 struct Odeintssctrl_options {
7 double T ; // terminal time
8 double h0 ; // initial time step
9 double r e l t o l ; // norm-relative error tolerance
10 double a b s t o l ; // absolute error tolerance
11 double hmin ; // smallest allowed time step
12 } __attribute__ ( ( aligned (64) ) ) ;
13 // Adaptive single-step integrator
14 template <class DiscEvolOp , class State ,
15 class NormFunc = decltype ( d e f a u l t n o r m <State >) >
16 std : : vector <std : : pair <double , State >> o d e i n t s s c t r l (
17 DiscEvolOp &&Psilow , unsigned i n t p , DiscEvolOp &&Psihigh , const State &y0 ,
18 const O d e i n t s s c t r l _ o p t i o n s &opt ,

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 806
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

19 NormFunc &norm = d e f a u l t n o r m <State >) {


20 double t = 0 ; // initial time t0 = 0
21 State y = y0 ; // current state, initialized here
22 double h = o p t . h0 ; // timestep to start with
23 // Array for returning pairs of times/states (tk , yk )k
24 std : : vector <std : : pair <double , State >> s t a t e s ;
25 s t a t e s . push_back ( { t , y } ) ; // initial time and state
26 // Main timestepping loop
27 while ( ( s t a t e s . back ( ) . f i r s t < o p t . T ) && ( h >= o p t . hmin ) ) { //
28 State yh = Psihigh ( h , y ) ; // high-order discrete evolution Ψ eh
29 State yH = Psilow ( h , y ) ; // low-order discrete evolution Ψh
30 const double e s t = norm ( yH − yh ) ; // ↔ ESTk
31 const double t o l = std : : max ( o p t . r e l t o l * norm ( y ) , o p t . a b s t o l ) ; // effective
tolerance
32 // Optimal stepsize according to (11.5.2.18)
33 i f ( e s t < t o l ) { // step accepted
34 // store next approximate state
35 s t a t e s . push_back ( { t = t + std : : min ( o p t . T − t , h ) , y = yh } ) ;
36 }
37 // New timestep size according to (11.5.2.18)
38 h * = std : : max ( 0 . 5 , std : : min ( 2 . , 0 . 9 * std : : pow ( t o l / est , 1 . / ( p + 1 ) ) ) ) ; //
39 i f ( h < o p t . hmin ) {
40 std : : c e r r << " Warning : Failure at t =" << s t a t e s . back ( ) . f i r s t
41 << " . Unable to meet i n t e g r a t i o n tolerances without reducing the step "
42 << " size below the smallest value allowed ( " << o p t . hmin
43 << " ) at time t = " << t << " . " << std : : endl ;
44 }
45 } // end main loop
46 return s t a t e s ;
47 }

Comments on Code 11.5.2.19 (see comments on Code 11.5.2.6 for more explanations):
• Input arguments as for Code 11.5.2.6, except for p =
ˆ order of lower order discrete evolution.
• line 38: compute presumably better local stepsize according to (11.5.2.18),
• line 33: decide whether to repeat the step or advance,
• line 33: extend output arrays if current step has not been rejected.
y

11.5.3 Embedded Runge-Kutta Methods


For higher order RK-SSM with a considerable number of stages computing different sets of increments (→
Def. 11.4.0.11) for two methods of different order just for the sake of local-in-time stepsize control would
mean undue effort. This makes the following embedding idea attractive:

Embedding idea for RK-SSM

Use two RK-SSMs based on the same increments, that is, built with the same coefficients aij , but
different weights bi , see Def. 11.4.0.11 for the formulas, and different orders p and p + 1.

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 807
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

c1 a11 ··· a1s


Augmented Butcher scheme for embedded explicit c A .. .. ..
. . .
Runge-Kutta methods ✄ bT := cs as1 ··· ass .
(The lower-order scheme has weights b
bi .) cT
b b1 ··· bs
b
b1 ··· b
bs

EXAMPLE 11.5.3.2 (Commonly used embedded explicit Runge-Kutta methods) The following two
embedded RK-SSM, presented in the form of their extended Butcher schemes, provided single step meth-
ods of orders 4 & 5 [HNW93, Tables 4.1 & 4.2].

0 0
1 1 1 1
3 3 2 2
1 1 1 1 1
3 6 6 2 0 2
1 1 3 1 0 0 1
2 8 0 8
1 3 5 7 13 1
1 2 0 − 32 2 4 32 32 32 − 32
1 2 1 1 1 1 1
y1 6 0 0 3 6
y1 6 3 3 6

yb1 1
10 0 3
10
2
5
1
5
yb1 − 12 7
3
7
3
13
6 − 16
3

Merson’s embedded RK-SSM Fehlberg’s embedded RK-SSM


y

§11.5.3.3 (E IGEN-compatible adaptive explicit embedded Runge-Kutta integrator) An implementation


of an explicit embedded Runge-Kutta single-step method with adaptive stepsize control for solving an
autonomous IVP is provided by the utility class Ode45 ➺ GITLAB: The class name already indicates the
orders of the pair of single step methods used for adaptive stepsize control:
ˆ RK-method of order 4
Ψ= e=
Ψ ˆ RK-method of order 5
Ode45
Refer to the class implementation ➺ GITLAB and the data members Ode45::_mA, Ode45::_vb4,
b , as introduced in Section 11.5.3.
Ode45::_vb5 for the underlying coefficients in A, b, and b
The class is templated with two type parameters:
t e m p l a t e < c l a s s StateType,
c l a s s RhsType = s t d ::function<StateType( const StateType &)>>
c l a s s Ode45 { ... };

(i) StateType: type for vectors in state space V , e.g. a fixed size vector type of E IGEN:
Eigen::Matrix<double,N,1>, where N is an integer constant § 11.2.0.1.
(ii) RhsType: a functor type, see Section 0.3.3, for the right hand side function f; must match State-
Type, default type provided.
The functor for the right hand side f : D ⊂ V → V of the ODE ẏ = f(y) is specified as an argument of
the constructor. The single-step numerical integrator is invoked by the templated method

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 808
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

t e m p l a t e < c l a s s NormFunc = d e c l t y p e (_norm<StateType>)>


s t d :: v e c t o r < s t d ::pair<StateType, double >>
solve( const StateType & y0, double T, const NormFunc & norm =
_norm<StateType>);

The following arguments have to be supplied:


1. y0: the initial value y0
2. T: the final time T , initial time t0 = 0 is assumed, because the class can deal with autonomous
ODEs only, recall § 11.1.3.3.
3. norm: a functor returning a suitable norm for a state vector. Defaults to E IGEN’s maximum vector
norm.
The method returns a vector of 2-tuples (yk , tk ) (note the order!), k = 0, . . . , N , of temporal mesh points
tk , t0 = 0, t N = T , see § 11.2.0.2, and approximate states yk ≈ y(tk ), where t 7→ y(t) stands for the
exact solution of the initial value problem.

The arguments of solve() are not sufficient to control the behavior of the adaptive integrator. In addition,
one can set data members of the data structure Ode45.options to configure an instance ode45obj of
Ode45:
ode45obj.options.<option_you_want_to_set> = <value>;

In particular, the key data fields for adaptive timestepping are


• rtol: relative tolerance for error control (default is 10−6 )
• atol: absolute tolerance for error control (default is 10−8 )
For complete information about all control parameters and their default values see the Ode45 class defi-
nition ➺ GITLAB.
The following self-explanatory code snippet uses the numerical integrator class Ode45 for solving a scalar
autonomous ODE.

C++ code 11.5.3.4: Invocation of adaptive embedded Runge-Kutta-Fehlberg integrator


➺ GITLAB
2 i n t main ( i n t /*argc*/ , char * * /*argv*/ ) {
3 // Types to be used for a scalar ODE with state space R
4 using StateType = double ;
5 using RhsType = std : : f u n c t i o n <StateType ( StateType ) >;
6 // Logistic differential equation (11.1.2.2)
7 const RhsType f = [ ] ( StateType y ) { r e t u r n 5 * y * ( 1 − y ) ; } ;
8 const StateType y0 = 0 . 2 ; // Initial value
9 // Exact solution of IVP, see (11.1.2.3)
10 auto y = [ y0 ] ( double t ) { r e t u r n y0 / ( y0 + ( 1 − y0 ) * std : : exp ( −5 * t ) ) ; } ;
11 // State space R, simple modulus supplies norm
12 auto normFunc = [ ] ( StateType x ) { r e t u r n std : : f a b s ( x ) ; } ;
13

14 // Invoke explicit Runge-Kutta method with stepsize control provided


15 // by a dedicated class implemented in ode45.h
16 Ode45<StateType , RhsType> i n t e g r a t o r ( f ) ;
17 // Do the timestepping with adaptive strategy of Section 11.5
18 const std : : vector <std : : pair <StateType , double>> s t a t e s =
19 i n t e g r a t o r . solve ( y0 , 1 . 0 , normFunc ) ;
20 // Output information accumulation during numerical integration

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 809
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

21 integrator . options ( ) . d o _ s t a t i s t i c s = true ;


22 integrator . print ( ) ;
23 // Output norm of discretization error in nodes of adaptively
generated
24 // temporal mesh
25 std : : cout << " Errors at points of temporal mesh: " << std : : endl ;
26 f o r ( auto s t a t e : s t a t e s ) {
27 std : : cout << " t = " << s t a t e . second << " , y = " << s t a t e . f i r s t
28 << " , | e r r | = " << f a b s ( s t a t e . f i r s t − y ( s t a t e . second ) )
29 << std : : endl ;
30 }
31 return 0;
32 }

y
Remark 11.5.3.5 (Tolerances and accuracy) As we have learned in § 11.5.3.3 for objects of the class
Ode45 tolerances for the refined local stepsize control of § 11.5.2.12 can be specified by setting the
member variables options.rtol and options.atol.
The possibility to pass tolerances to numerical integrators based on adaptive timestepping may tempt
the user into believing that they allow to control the accuracy of the solutions. However, as is clear from
§ 11.5.2.12, these tolerances are solely applied to local error estimates and, inherently, have nothing to
do with global discretization errors, see Exp. 11.5.2.9.

No global error control through local-in-time adaptive timestepping

The absolute/relative tolerances imposed for local-in-time adaptive timestepping do not allow to
predict accuracy of solution!

EXAMPLE 11.5.3.7 (Adaptive timestepping for mechanical problem) We test the effect of adaptive
stepsize control in M ATLAB for the equations of motion describing the planar movement of a point mass in
a conservative force field x ∈ R2 7→ F (x) ∈ R2 : Let t 7→ y(t) ∈ R2 be the trajectory of point mass (in
the plane).

2y
From Newton’s law: ÿ = F (y) := − . (11.5.3.8)
kyk22
acceleration force
As in Rem. 11.1.3.6 we can convert the second-order ODE (11.5.3.8) into an equivalent 1st-order ODE by
introducing the velocity v := ẏ as an extra solution component:
  " v #

(11.5.3.8) ⇒ = − 2y . (11.5.3.9)
v̇ k y k2 2

The following initial values used in the experiment:


   
−1 0.1
y (0) : = , v (0) : =
0 −0.1

Adaptive numerical integration with adaptive numerical integrator Ode45 (to § 11.5.3.3) with
➊ options.rtol = 0.001, options.atol = 1.0E-5,
➋ options.rtol = 0.01, options.atol = 1.0E-3,

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 810
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

abstol = 0.000010, reltol = 0.001000 abstol = 0.000010, reltol = 0.001000


4 5 0.2
y (t) (exakt) y (t ) (Naeherung)
1 1 k
y (t) (exakt) y2(tk) (Naeherung)
3 2
v1(t) (exakt) v (t ) (Naeherung)
1 k

2 v2(t) (exakt) v (t ) (Naeherung)


2 k

Zeitschrittweite
yi(t)
0 0 0.1

−1

−2

−3

−4 −5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t

abstol = 0.001000, reltol = 0.010000 abstol = 0.001000, reltol = 0.010000


4 5 0.2
y1(t) (exakt) y (t ) (Naeherung)
1 k
y (t) (exakt) y2(tk) (Naeherung)
3 2
v1(t) (exakt) v (t ) (Naeherung)
1 k

2 v (t) (exakt) v (t ) (Naeherung)


2 2 k

Zeitschrittweite
yi(t)

0 0 0.1

−1

−2

−3

−4 −5 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
t t

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
y2

y2

−0.2 −0.2

−0.4 −0.4

−0.6 −0.6

−0.8 −0.8
Exakte Bahn Exakte Bahn
Naeherung Naeherung
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
y1 y1

reltol=0.001, abstol=1e-5 reltol=0.01, abstol=1e-3

Observations:
☞ Fast changes in solution components captured by adaptive approach through very small timesteps.
☞ Completely wrong solution, if tolerance reduced slightly.

In this example we face a rather sensitive dependence of the trajectories on initial states or intermediate
states. Small perturbations at one instance in time can be have a massive impact on the solution at later

11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 811
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

times. Local stepsize control is powerless about preventing this. y

Review question(s) 11.5.3.10 (Adaptive timestep control)


(Q11.5.3.10.A) Explain how the blow-up of solutions of an initial-value problem can be captured by a
single-step numerical integrator with adaptive stepsize control.
(Q11.5.3.10.B) Code 11.5.2.19 contains the line
h = h * s t d ::max(0.5, s t d ::min(2.0, s t d ::pow(tol/est, 1.0/(p+1))));

Explain all parts and variables occurring in this expression.


11. Numerical Integration – Single Step Methods, 11.5. Adaptive Stepsize Control 812
Bibliography

[Ama83] H. Amann. Gewöhnliche Differentialgleichungen. 1st. Berlin: Walter de Gruyter, 1983 (cit. on
pp. 759, 760, 766).
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 760, 764–767, 773, 785, 791–793, 795–797, 805).
[Dea80] M.A.B. Deakin. “Applied catastrophe theory in the social and biological sciences”. In: Bulletin
of Mathematical Biology 42.5 (1980), pp. 647–679 (cit. on p. 761).
[DB02] P. Deuflhard and F. Bornemann. Scientific Computing with Ordinary Differential Equations.
2nd ed. Vol. 42. Texts in Applied Mathematics. New York: Springer, 2002 (cit. on pp. 756, 768,
788, 796, 803).
[Gra02] C.R. Gray. “An analysis of the Belousov-Zhabotinski reaction”. In: Rose-Hulman Undergradu-
ate Math Journal 3.1 (2002) (cit. on p. 798).
[HLW06] E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration. 2nd ed. Vol. 31.
Springer Series in Computational Mathematics. Heidelberg: Springer, 2006 (cit. on pp. 756,
760, 796).
[HNW93] E. Hairer, S.P. Norsett, and G. Wanner. Solving Ordinary Differential Equations I. Nonstiff
Problems. 2nd ed. Berlin, Heidelberg, New York: Springer-Verlag, 1993 (cit. on pp. 756, 808).
[HW11] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Stiff and Differential-
Algebraic Problems. Vol. 14. Springer Series in Computational Mathematics. Berlin: Springer-
Verlag, 2011 (cit. on p. 756).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 759, 760, 763,
766, 773, 785, 797, 798).
[Het00] Herbert W. Hethcote. “The mathematics of infectious diseases”. In: SIAM Rev. 42.4 (2000),
pp. 599–653. DOI: 10.1137/S0036144500371907 (cit. on p. 762).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 759).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 767, 779, 797).
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 756, 759,
764–766, 775, 787).

813
Chapter 12

Single-Step Methods for Stiff Initial-Value


Problems

Explicit Runge-Kutta methods with stepsize control (→ Section 11.5) seem to be able to provide approx-
imate solutions for any IVP with good accuracy provided that tolerances are set appropriately. Does this
mean that everything is settled about numerical integration?

EXAMPLE 12.0.0.1 (Explicit adaptive RK-SSM for stiff IVP) In this example we will witness the near
failure of a high-order adaptive explicit Runge-Kutta method for a simple scalar autonomous ODE.

IVP considered: ẏ = λy2 (1 − y) , λ := 500 , y(0) = 1


100 . (12.0.0.2)

This is a logistic ODE as introduced in Ex. 11.1.2.1. We try to solve it by means of an explicit adaptive
embedded Runge-Kutta-Fehlberg method (→ Section 11.5.3) using the embedded Runge-Kutta single-
step method offered by Ode45 as explained in § 11.5.3.3 (Preprocessor switch MATLABCOEFF activated).

C++ code 12.0.0.3: Solving (12.0.0.2) with Ode45 numerical integrator ➺ GITLAB
2 // Types to be used for a scalar ODE with state space R
3 using StateType = double ;
4 using RhsType = std : : f u n c t i o n <StateType ( StateType ) >;
5 // Logistic differential equation (11.1.2.2)
6 const double lambda = 5 0 0 . 0 ;
7 const RhsType f = [ lambda ] ( StateType y ) { r e t u r n lambda * y * y * ( 1 − y ) ; } ;
8 const StateType y0 = 0 . 0 1 ; // Initial value, will create a STIFF IVP
9 // State space R, simple modulus supplies norm
10 const auto normFunc = [ ] ( StateType x ) { r e t u r n f a b s ( x ) ; } ;
11

12 // Invoke explicit Runge-Kutta method with stepsize control


13 Ode45<StateType , RhsType> i n t e g r a t o r ( f ) ;
14 // Set rather loose tolerances
15 integrator . options ( ) . r t o l = 0.1;
16 integrator . options ( ) . atol = 0.001;
17 i n t e g r a t o r . o p t i o n s ( ) . min_dt = 1E−18;
18 const std : : vector <std : : pair <StateType , double>> s t a t e s =
19 i n t e g r a t o r . solve ( y0 , 1 . 0 , normFunc ) ;
20 // Output information accumulation during numerical integration
21 integrator . options ( ) . d o _ s t a t i s t i c s = true ;
22 integrator . print ( ) ;

814
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1 − number o f s t e p s : 183
Statistics of the integrator run ✄ 2 − number o f r e j e c t e d s t e p s : 185
3 − function calls : 1302

2
ode45 for dty = 500.000000 y (1−y) 2
ode45 for d y = 500.000000 y (1−y)
1.4 t
y(t) 1.5 0.03
ode45
1.2

1
1 0.02

0.8

Stepsize
y(t)
y

0.6

0.5 0.01
0.4

0.2

0 0
0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 Fig.
1 435 t
Fig. 434 t
Stepsize control of Ode45 running amok!
The solution is virtually constant from t > 0.2 and, nevertheless, the integrator uses tiny timesteps
? until the end of the integration interval. Why this crazy behavior?
y
Contents
12.1 Model Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
12.2 Stiff Initial-Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
12.3 Implicit Runge-Kutta Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . 835
12.3.1 The Implicit Euler Method for Stiff IVPs . . . . . . . . . . . . . . . . . . . . . 835
12.3.2 Collocation Single-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . 836
12.3.3 General Implicit Runge-Kutta Single-Step Methods (RK-SSMs) . . . . . . . 840
12.3.4 Model Problem Analysis for Implicit Runge-Kutta Single-Step Methods (IRK-SSMs)842
12.4 Semi-Implicit Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 850
12.5 Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853

Supplementary literature. [DR08, Sect. 11.9]

12.1 Model Problem Analysis

Video tutorial for Section 12.1:Model Problem Analysis: (40 minutes) Download link,
tablet notes

Fortunately, full insight into the observations made in Ex. 12.0.0.1 can already be gleaned from a scalar
linear model problem that is extremely easy to analyze.
EXPERIMENT 12.1.0.1 (Adaptive explicit RK-SSM for scalar linear decay ODE) To rule out that what
we observed in Ex. 12.0.0.1 might have been a quirk of the IVP (12.0.0.2) we conduct the same investiga-
tions for the simple linear, scalar ( N = 1), autonomous IVP

ẏ = λy , λ := −80 , y(0) = 1 . (12.1.0.2)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 815
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1 − number o f s t e p s : 33
We use the adaptive integrator of Ode45 (→
2 − number o f r e j e c t e d s t e p s : 32
§ 11.5.3.3) to solve (12.1.0.2) with the same param-
3 − function calls : 231
eters as in Code 11.5.3.4. ✄
ode45 for dty = -80.000000 y
1 ode45 for dty = -80.000000 y
y(t) 1 0.015
ode45
0.8

0.6 0.5 0.01

Stepsize
y(t)
0.4
y

0.2 0 0.005

-0.5 0
0 0.2 0.4 0.6 0.8 1
-0.2 Fig. 437
0 0.2 0.4 0.6 0.8 1 t
Fig. 436 t

Observation: Though y(t) ≈ 0 for t > 0.1, the integrator keeps on using “unreasonably small” timesteps
even then. y
In this section we will discover a simple explanation for the startling behavior of the adaptive timestepping
Ode45 in Ex. 12.0.0.1.

EXAMPLE 12.1.0.3 (Blow-up of explicit Euler method) The simplest explicit RK-SSM is the explicit
Euler method, see Section 11.2.1. We know that it should converge like O( h) for meshwidth h → 0. In
this example we will see that this may be true only for sufficiently small h, which may be extremely small.

✦ We consider the IVP for the scalar linear decay ODE:

ẏ = f (y) := λy , λ ≪ 0 , y (0) = 1 .

✦ We apply the explicit Euler method (11.2.1.5) with uniform timestep h = 1/M, M ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
Explicit Euler method for saalar model problem Explicit Euler, h=174.005981Explicit Euler, h=175.005493
20
10 3.5
λ = −10.000000
λ = −30.000000 3
λ = −60.000000
error at final time T=1 (Euclidean norm)

10 λ = −90.000000
10 2.5
O(h)

0
10 1.5

−10
10
y

0.5

−20
10 −0.5

−1

−30
10 −1.5

−2 exact solution
−40
explicit Euler
10
−3 −2 −1 0
10 10 10 10 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 438 timestep h Fig. 439 t

λ ≪ 0: blow-up of yk for large timestep h λ = 20: — =


ˆ y ( t ), — =
ˆ Euler polygon
Explanation: From Fig. 439 we draw the geometric conclusion that, if h is “large in comparison with λ−1 ”,
then the approximations yk way miss the stationary point y = 0 due to overshooting.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 816
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

This leads to a sequence (yk )k with exponentially increasing oscillations.

✦ Now we look at an IVP for the logistic ODE, see Ex. 11.1.2.1:

ẏ = f (y) := λy(1 − y) , y(0) = 0.01 .

✦ As before, we apply the explicit Euler method (11.2.1.5) with uniform timestep h = 1/M, M ∈
{5, 10, 20, 40, 80, 160, 320, 640}.
140
10
λ = 10.000000 1.4
λ = 30.000000
10
120 λ = 60.000000
λ = 90.000000 1.2

100
10 1
error (Euclidean norm)

80 0.8
10

0.6
60
10

y
0.4
40
10
0.2

20
10 0

10
0 −0.2

exact solution
−0.4 explicit Euler
−20
10
−3 −2 −1 0
10 10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 440 timestep h Fig. 441 t

λ large: blow-up of yk for large timestep h λ = 90: — =


ˆ y ( t ), — =
ˆ Euler polygon

For large timesteps h we also observe oscillatory blow-up of the sequence (yk )k .
Deeper analysis:

For y ≈ 1: f (y) ≈ λ(1 − y) ➣ If y(t0 ) ≈ 1, then the solution of the IVP will behave like the solution
of ẏ = λ(1 − y), which is a linear ODE. Similary, z(t) := 1 − y(t) will behave like the solution of the
“decay equation” ż = −λz. Thus, around the stationary point y = 1 the explicit Euler method behaves
like it did for ẏ = λy in the vicinity of the stationary point y = 0; it grossly overshoots. y

§12.1.0.4 (Linear model problem analysis: explicit Euler method) The phenomenon observed in the
two previous examples is accessible to a remarkably simple rigorous analysis: Motivated by the consider-
ations in Ex. 12.1.0.3 we study the explicit Euler method (11.2.1.5) for the

linear model problem: ẏ = λy , y(0) = y0 , with λ ≪ 0 , (12.1.0.5)

which has an exponentially decaying exact solution

y(t) = y0 exp(λt) → 0 for t → ∞ .

Recall the recursion for the explicit Euler with uniform timestep h > 0 method for (12.1.0.5):

(11.2.1.5) for f (y) = λy: yk+1 = yk (1 + λh) . (12.1.0.6)

We easily get a closed form expression for the approximations yk :


(
0 , if λh > −2 (qualitatively correct) ,
yk = y0 (1 + λh)k ⇒ |yk | →
∞ , if λh < −2 (qualitatively wrong) .

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 817
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Observed: stability-induced timestep constraint

Only if |λ| h < 2 we obtain a decaying solution by the explicit Euler method!

Could it be that the timestep control is desperately trying to enforce the qualitatively correct behavior of
the numerical solution in Ex. 12.1.0.3? Let us examine how the simple stepsize control of Code 11.5.2.6
fares for model problem (12.1.0.5):

EXPERIMENT 12.1.0.8 (Simple adaptive timestepping for fast decay) In this example we let a trans-
parent adaptive timestep struggle with “overshooting”:
✦ “Linear model problem IVP”: ẏ = λy, y(0) = 1, λ = −100
✦ Simple adaptive timestepping method as in Exp. 11.5.2.9, see Code 11.5.2.6. Timestep control
based on the pair of 1st-order explicit Euler method and 2nd-order explicit trapezoidal method.
Decay equation, rtol = 0.010000, atol = 0.000100, λ = 100.000000 x 10
−3 Decay equation, rtol = 0.010000, atol = 0.000100, λ = 100.000000
1 3
y(t) true error |y(t )−y |
k k
yk estimated error EST
0.9 k
rejection
2.5
0.8

0.7
2

0.6
error
y

0.5 1.5

0.4

1
0.3

0.2
0.5

0.1

0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Fig. 442 t Fig. 443 t

Observation: in fact, stepsize control enforces small timesteps even if y(t) ≈ 0 and persistently triggers
rejections of timesteps. This is necessary to prevent overshooting in the Euler method, which contributes
to the estimate of the one-step error.

We see the purpose of stepsize control thwarted, because after only a very short time the solution is
almost zero and then, in fact, large timesteps should be chosen. y

Are these observations a particular “flaw” of the explicit Euler method? Let us study the behavior of another
simple explicit Runge-Kutta method applied to the linear model problem.
EXAMPLE 12.1.0.9 (Explicit trapzoidal method for decay equation → [DR08, Ex. 11.29])
The explicit trapezoidal method is a 2-stage explicit Ruge-Kutta method, whose Butcher scheme is given
in Ex. 11.4.0.17 and which was derived in Ex. 11.4.0.6. We state its recursion for the ODE ẏ = f(t, y) in
terms of the first step y0 → y1 :
k1 = f(t0 , y0 ) , k2 = f(t0 + h, y0 + hk1 ) , y1 = y0 + 2h (k1 + k2 ) . (11.4.0.8)
Apply it to the model problem (12.1.0.5), that is, the scalar autonomous ODE with right hand side function
f(y) = f (y) = λy, λ < 0:
k1 = λy0 , k2 = λ(y0 + hk1 ) ⇒ y1 = (1 + λh + 12 (λh)2 ) y0 . (12.1.0.10)
| {z }
=:S(hλ)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 818
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The sequence of approximations generated by the explicit trapezoidal rule can be expressed in
closed form as

yk = S( hλ)k y0 , k = 0, . . . , N . (12.1.0.11)

Stability polynomial for explicit trapezoidal rule


2.5

Clearly, blow-up can be avoided only if |S( hλ)| ≤ 1:

2 z 7→ 1 − z + 21 z2
|S(hλ)| < 1 ⇔ − 2 < hλ < 0 .
1.5
Qualitatively correct decay behavior of (yk )k only un-
S(z)

der timestep constraint


1

h ≤ |2/λ| . (12.1.0.12)
0.5

✁ the stability function for the explicit trapezoidal


0
−3 −2.5 −2 −1.5 −1 −0.5 0
method
Fig. 444 z
y

§12.1.0.13 (Model problem analysis for general explicit Runge-Kutta single step methods) We
generalize the approach taken in Ex. 12.1.0.9 and apply an explicit s-stage Runge-Kutta method (→
A c
Def. 11.4.0.11) encoded by the Butcher scheme , A ∈ R s,s strictly lower-triangular, to the au-
bT
tonomous scalar linear ODE (12.1.0.5) (ẏ = λy). We write down the equations for the increments and y1
from Def. 11.4.0.11 for f (y) := λy and then convert the resulting system of equations into matrix form:

i −1
k i = λ(y0 + h ∑ aij k j ) ,     
j =1 I − zA 0 k 1
⇒ = y0 , (12.1.0.14)
s −zb⊤ 1 y1 1
y 1 = y 0 + h ∑ bi k i
i =1

where k ∈ R s =ˆ denotes the vector [k1 , . . . , k s ]⊤ /λ of increments, and z := λh. Next we apply block
Gaussian elimination (→ Rem. 2.3.1.11) to solve for y1 and obtain

y1 = S(z)y0 with S(z) := 1 + zb T (I − zA)−1 1 , 1 = [1, . . . , 1]⊤ . (12.1.0.15)

Alternatively we can express y1 through determinants appealing to Cramer’s rule,


 
I − zA 1
det
−zb⊤ 1
y1 = y0   ⇒ S(z) = det(I − zA + z1b T ) , (12.1.0.16)
I − zA 0
det
−zb⊤ 1

and note that A is a strictly lower triangular matrix, which means that det(I − zA) = 1. Thus we have
proved the following theorem.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 819
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Theorem 12.1.0.17. Stability function of some explicit Runge-Kutta methods → [Han02,


Thm. 77.2], [QSS00, Sect. 11.8.4]

The discrete evolution Ψλh of an explicit s-stage Runge-Kutta single step method (→ Def. 11.4.0.11)
c A
with Butcher scheme (see (11.4.0.13)) for the ODE ẏ = λy amounts to a multiplication
bT
with the number

Ψλh = S(λh) ⇔ y1 = S(λh)y0 ,

where S is the stability function (SF)

S(z) := 1 + zb T (I − zA)−1 1 = det(I − zA + z1b T ) , 1 := [1, . . . , 1]⊤ ∈ R s . (12.1.0.18)

EXAMPLE 12.1.0.19 (Stability functions of explicit Runge-Kutta single step methods) From
Thm. 12.1.0.17 and their Butcher schemes we can instantly compute the stability functions of explicit
RK-SSM. We do this for a few methods whose Butcher schemes were listed in Ex. 11.4.0.17

0 0 S(z) = 1 + z .
• Explicit Euler method (11.2.1.5): ➣
1

0 0 0
• Expl. trapezoidal method (11.4.0.8): 1 1 0 ➣ S(z) = 1 + z + 21 z2 .
1 1
2 2

0 0 0 0 0
1 1
2 2 0 0 0
1 2 1 3 1 4
• Classical RK4 method: 1 1
2 0 2 0 0 ➣ S(z) = 1 + z + 2 z + 6 z + 24 z .
1 0 0 1 0
1 2 2 1
6 6 6 6

These examples confirm an immediate consequence of the determinant formula for the stability function
S ( z ).

Corollary 12.1.0.20. Polynomial stability function of explicit RK-SSM

For a consistent (→ Def. 11.3.1.12) s-stage explicit Runge-Kutta single step method according to
Def. 11.4.0.11 the stability function S defined by (12.1.0.56) is a non-constant polynomial of degree
≤ s, that is, S ∈ Ps .

Remark 12.1.0.21 (Stability function and exponential function) Let us compare the two evolution op-
erators:
• Φ=
ˆ evolution operator (→ Def. 11.1.4.3) for ẏ = λy,

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 820
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

• Ψ=
ˆ discrete evolution operator (→ § 11.3.1.1) for an s-stage Runge-Kutta single step method.

Φh y = eλh y ←→ Ψh y = S(λh)y .

In light of that Ψ is supposed to be an approximation for Φ, Ψ ≈ Φ, see (11.3.1.3), we expect that

S(z) ≈ exp(z) for small |z| . (12.1.0.22)

A more precise statement is made by the following lemma:

Lemma 12.1.0.23. Stability function as approximation of exp for small arguments

Let S denote the stability function of an s-stage explicit Runge-Kutta single step method of order
q ∈ N. Then

|S(z) − exp(z)| = O(|z|q+1 ) for |z| → 0 . (12.1.0.24)

This means that the lowest q + 1 coefficients of S(z) must be equal to the first coefficients of the expo-
nential series:
q
1
S(z) = ∑ j! z j + zq+1 p(z) with some p ∈ P s − q −1 .
j =0

In order to match the first q terms of the exponential series, we need at least S ∈ Pq , which entails a
minimum of q stages.

Corollary 12.1.0.25. Stages limit order of explicit RK-SSM

An explicit s-stage RK-SSM has maximal order q ≤ s.

§12.1.0.26 (Stability induced timestep constraint) In § 12.1.0.13 we established that for the sequence
(yk )∞
k=0 produced by an explicit Runge-Kutta single step method applied to the linear scalar model ODE
ẏ = λy, λ ∈ R, with uniform timestep h > 0 holds

yk+1 = S(λh)yk ⇒ yk = S(λhk y0 .

(yk )∞
k=0 non-increasing ⇔ |S(λh)| ≤ 1 ,
∞ (12.1.0.27)
(yk )k=0 exponentially increasing ⇔ |S(λh)| > 1 .

where S = S(z) is the stability function of the RK-SSM as defined in (12.1.0.56).

Invariably polynomials tend to ±∞ for large (in modulus) arguments:

∀S ∈ Ps , S 6= const : lim S(z) = ∞ uniformly . (12.1.0.28)


|z|→∞

So, for any λ 6= 0 there will be a threshold hmax > 0 so that |yk | → ∞ as | h| > hmax .

Reversing the argument we arrive at a timestep constraint, as already observed for the explicit Euler
methods in § 12.1.0.4.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 821
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Only if one ensures that |λh| is sufficiently small, one can avoid exponentially increasing approxi-
mations yk (qualitatively wrong for λ < 0) when applying an explicit RK-SSM to the model problem
(12.1.0.5) with uniform timestep h > 0,

For λ ≪ 0 this stability induced timestep constraint may force h to be much smaller than required by
demands on accuracy : in this case timestepping becomes inefficient. y

Remark 12.1.0.29 (Stepsize control detects instability) Ex. 12.0.0.1, Exp. 12.1.0.8 send the message
that local-in-time stepsize control as discussed in Section 11.5 selects timesteps that avoid blow-up, with
a hefty price tag however in terms of computational cost and poor accuracy. y

Objection: simple linear scalar IVP (12.1.0.5) may be an oddity rather than a model problem: the weakness
of explicit Runge-Kutta methods discussed above may be just a peculiar response to an unusual situation.
Let us extend our investigations to systems of linear ODEs of dimension N > 1 of the state space.

§12.1.0.30 (Systems of linear ordinary differential equations, § 11.1.1.8 revisited) A generic linear
ordinary differential equation with constant coefficients on the state space R N has the form

ẏ = My with a matrix M ∈ R N,N . (12.1.0.31)

As explained in [NS02, Sect. 8.1], (12.1.0.31) can be solved by diagonalization: If we can find a regular
matrix V ∈ C N,N such that
 
λ1 0
 ..  N,N
MV = VD with diagonal matrix D =  . ∈C , (12.1.0.32)
0 λN

then the 1-parameter family of global solutions of (12.1.0.31) is given by


 
exp(λ1 t) 0
 ..  −1 N
y(t) = V .  V y0 , y0 ∈ R . (12.1.0.33)
0 exp(λ N t)

The columns of V are a basis of eigenvectors of M, the λ j ∈ C, j = 1, . . . , N are the associated eigen-
values of M, see Def. 9.1.0.1.

The idea behind diagonalization is the transformation of (12.1.0.31) into N decoupled scalar linear
ODEs:

ż1 = λ1 z1
z ( t ) : = V −1 y ( t ) ..
ẏ = My −−−−−−−−→ ż = Dz ↔ . , since M = VDV−1 .
ż N = λ N z N

The formula (12.1.0.33) can be generalized to



1 k
y(t) = exp(Mt)y0 with the matrix exponential exp(B) := ∑ B , B ∈ C N,N . (12.1.0.34)
k =0
k!
y
EXAMPLE 12.1.0.35 (Transient simulation of RLC-circuit) This example demonstrates the diagonal-
ization of a linear system of ODEs.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 822
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Consider the circuit from Ex. 11.1.2.11 ✄ R


Transient nodal analysis (→ Ex. 2.1.0.3) leads to the u(t)
second-order linear ODE L

ü + αu̇ + βu = g(t) , Us (t)


with coefficients α := ( RC )−1 , β = ( LC )−1 , g(t) =
αU̇s (t).

Fig. 445

We transform it to a linear 1st-order ODE as in Rem. 11.1.3.6 by introducing v := u̇ as additional solution


component:
      
u̇ 0 1 u 0 with β ≫ α ≫ 1
= − ,
v̇ − β −α v g(t) in usual settings.
|{z} | {z }
=:ẏ =:f(t,y)

We integrate IVPs for this ODE by means of the adaptive integrator Ode45 from § 11.5.3.3.
RCL−circuit: R=100.000000, L=1.000000, C=0.000001
0.01
u(t)

0.008
v(t)/100 R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V sin(t),
0.006
u(0) = v(0) = 0 (“switch on”)
0.004 Ode45 statistics:
0.002 17897 successful steps
u(t),v(t)

0
1090 failed attempts
−0.002
113923 function evaluations
−0.004

−0.006 Inefficient: way more timesteps than required for re-


−0.008
solving smooth solution, cf. remark in the end of
−0.01
§ 12.1.0.26.
0 1 2 3 4 5 6
Fig. 446 time t

Maybe the time-dependent right hand side due to the time-harmonic excitation severly affects ode45? Let
us try a constant exciting voltage:
x 10
−3 RCL−circuit: R=100.000000, L=1.000000, C=0.000001
2
u(t)
v(t)/100

0
R = 100Ω, L = 1H, C = 1µF, Us (t) = 1V,
−2
u(0) = v(0) = 0 (“switch on”)

−4
Ode45 statistics:
u(t),v(t)

17901 successful steps


−6 1210 failed attempts
114667 function evaluations
−8

−10 Tiny timesteps despite virtually constant solution!


−12
0 1 2 3 4 5 6
Fig. 447 time t

We make the same observation as in Ex. 12.0.0.1, Exp. 12.1.0.8: the local-in-time stepsize control of

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 823
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

ode45 (→ Section 11.5) enforces extremely small timesteps though the solution almost constant except
at t = 0.

To understand the structure of the solutions for this transient circuit example, let us apply the diagonaliza-
tion technique from § 12.1.0.30 to the linear ODE
 
0 1
ẏ = y , y (0) = y0 ∈ R 2 . (12.1.0.36)
− β −α
| {z }
=:M

Above we face the situation β ≫ 14 α2 ≫ 1.

We can obtain the general solution of ẏ = My, M ∈ R2,2 , by diagonalization of M (if possible):
 
λ1
MV = M(v1 , v2 ) = (v1 , v2 ) . (12.1.0.37)
λ2

where v1 , v2 ∈ R2 \ {0} are the the eigenvectors of M, λ1 , λ2 are the eigenvalues of M, see Def. 9.1.0.1.
The latter are the roots of the characteristic polynomial t 7→ χ(t) := t2 + αt + β in C, and we find
(p
α2 − 4β , if α2 ≥ 4β ,
λ1/2 = 21 (−α ± D ) , D := p
ı 4β − α2 , if α2 < 4β .

Note that the eigenvalues have a large (in modulus) negative real part and a non-vanishing imaginary part
in the setting of the experiment.

Then we transform ẏ = My into decoupled scalar linear ODEs:


 
−1 −1 −1 z ( t ) : = V −1 y ( t ) λ1
ẏ = My ⇔ V ẏ = V MV(V y) ⇔ ż = z. (12.1.0.38)
λ2
This yields the general solution of the ODE ẏ = My, see also [Str09, Sect. 5.6]:

y(t) = Av1 exp(λ1 t) + Bv2 exp(λ2 t) , A, B ∈ R . (12.1.0.39)

Note: t 7→ exp(λi t) is general solution of the ODE żi = λi zi . y

§12.1.0.40 (“Diagonalization” of explicit Euler method) Recall the discrete evolution of the explicit
Euler method (11.2.1.5) for the linear ODE ẏ = My, M ∈ R N,N :

Ψh y = y + hMy ↔ yk+1 = yk + hMyk .

As in § 12.1.0.30 we assume that M can be diagonalized, that is (12.1.0.32) holds: V−1 MV = D with a
diagonal matrix D ∈ C N,N containing the eigenvalues of M on its diagonal. Next, apply the decoupling
by diagonalization idea to the recursion of the explicit Euler method.

z k : = V −1 y k
V−1 yk+1 = V−1 yk + hV−1 MV(V−1 yk ) ⇔ (zk+1 )i = (zk )i + hλi (zk )i , (12.1.0.41)
| {z }
ˆ explicit Euler step for żi = λi zi
=

with i ∈ {1, . . . , N }. This gives us a crucial insight:



The explicit Euler method generates uniformly bounded solution sequences (yk )k=0 for ẏ = My with
diagonalizable matrix M ∈ R N,N with eigenvalues λ1 , . . . , λ N , if and only if it generates uniformly
bounded sequences for all the scalar ODEs ż = λi z, i = 1, . . . , N .
y

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 824
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

So far we conducted the model problem analysis under the premises λ < 0.
p 
However, in Ex. 12.1.0.35 we face λ1/2 = − 12 α ± i 4β − α2 (complex eigenvalues!). Let us now
examine how the explicit Euler method and even general explicit RK-methods respond to them.

Remark 12.1.0.42 (Explicit Euler method for damped oscillations) Consider linear model IVP
(12.1.0.5) for λ ∈ C:

Re λ < 0 ⇒ exponentially decaying solution y(t) = y0 exp(λt) ,

because | exp(λt)| = exp(Re λ · t).

The model problem analysis from Ex. 12.1.0.3, Ex. 12.1.0.9 can be extended verbatim to the case of
λ ∈ C. It yields the following insight for the for the explicit Euler method and λ ∈ C:
The sequence generated by the explicit Euler method (11.2.1.5) for the model problem (12.1.0.5) satisfies

yk+1 = yk (1 + hλ) lim yk = 0 ⇔ |1 + hλ| < 1 . (12.1.0.6)


k→∞

timestep constraint to get decaying (discrete) solution !


1.5

0.5

✁ { z ∈ C : |1 + z | < 1}
Im z

0
The green region of the complex plane marks values
for λh, for which the explicit Euler method will pro-
−0.5
duce exponentially decaying solutions.

−1

−1.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1
Fig. 448 Re z
q
Now we can conjecture what happens in Ex. 12.1.0.35: the eigenvalues λ1/2 = ± i β − 41 α2 of − 21 α
M have a very large (in modulus) negative real part. Since the integrator of Ode45 can be expected to
behave as if it integrates ż = λ2 z, it faces a severe timestep constraint, if exponential blow-up is to be
avoided, see Ex. 12.1.0.3. Thus stepsize control must resort to tiny timesteps. y

§12.1.0.43 (Extended model problem analysis for explicit Runge-Kutta single step methods) Recall
the definition of a generic explicit RK-SSM for the ODE ẏ = f(t, y):

Definition 11.4.0.11. Explicit Runge-Kutta method

For bi , aij ∈ R, ci := ∑ij− 1


=1 aij , i, j = 1, . . . , s, s ∈ N , an s-stage explicit Runge-Kutta single step
method (RK-SSM) for the ODE ẏ = f(t, y), f : Ω → R N , is defined by (y0 ∈ D)

i −1 s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

The vectors ki ∈ R N , i = 1, . . . , s, are called increments, h > 0 is the size of the timestep.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 825
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

A c
We apply such an explicit s-stage RK-SSM described by the Butcher scheme to the autonomous
bT
linear ODE ẏ = My, M ∈ C N,N , and obtain (for the first step with timestep size h > 0)

ℓ−1 s
k ℓ = M ( y0 + h ∑ aℓ j k j ) , ℓ = 1, . . . , s , y1 = y0 + h ∑ bi k ℓ . (12.1.0.44)
j =1 ℓ=1

Now assume that M can be diagonalized, that is (12.1.0.32) holds: V−1 MV = D with a diagonal
matrix D ∈ C N,N containing the eigenvalues λ1 , . . . , λ N ∈ C of M on its diagonal. Then apply the
substitutions

b ℓ := V−1 kℓ , ℓ = 1, . . . , s ,
k bk := V−1 yk , k = 0, 1 ,
y

to (12.1.0.44), which yield


s −1 s
b ℓ = D(y
k b j ) , ℓ = 1, . . . , s , y
b0 + h ∑ a ℓ j k b1 = y
b0 + h ∑ bi kb ℓ . (12.1.0.45)
j =1 ℓ=1
m
  s −1   s  
bℓ
k b j ) , (b
= λi ((y0 )i + h ∑ aℓ j k y1 )i = ( y
b0 )i + h ∑ b ℓ , i = 1, . . . , N .
bi k (12.1.0.46)
i i i
j =1 ℓ=1

We infer that, if (yk )k is the sequence produced by an explicit RK-SSM applied to ẏ = My, then
 
[1]
yk 0
 ..
 −1
yk = V
 . V ,

[d]
0 yk
 
[i ]
where yk is the sequence generated by the same RK-SSM with the same sequence of timesteps for
k 
the IVP ẏ = λi y, y(0) = V−1 y0 i .
✗ ✔
The RK-SSM generates uniformly bounded solution sequences (yk )∞
for the ODE ẏ = My with k =0
diagonalizable matrix M ∈ R N,N with eigenvalues λ1 , . . . , λ N , if and only if it generates uniformly

✖ ✕
bounded sequences for all the scalar ODEs ż = λi ż, i = 1, . . . , N .

Stability analysis: reduction to scalar case

Understanding the behavior of RK-SSM for autonomous scalar linear ODEs ẏ = λy with λ ∈ C is
enough to predict their behavior for general autonomous linear systems of ODEs.

From the considerations of § 12.1.0.26 we deduce the following fundamental result.

Theorem 12.1.0.48. (Absolute) stability of explicit RK-SSM for linear systems of ODEs

The sequence (yk )k of approximations generated by an explicit RK-SSM (→ Def. 11.4.0.11) with
stability function S (defined in (12.1.0.56)) applied to the linear autonomous ODE ẏ = My, M ∈
C N,N , with uniform timestep h > 0 decays exponentially for every initial state y0 ∈ C N , if and only
if |S(λi h)| < 1 for all eigenvalues λi of M.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 826
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Please note that

Re λi < 0 ∀i ∈ {1, . . . , N } =⇒ ky(t)k → 0 for t → ∞ ,

for any solution of ẏ = My. This is obvious from the representation formula (12.1.0.33). y

§12.1.0.49 (Region of (absolute) stability of explicit RK-SSM) We consider an explicit Runge-Kutta


single step method with stability function S for the model linear scalar IVP ẏ = λy, y(0) = y0 , λ ∈ C.
From Thm. 12.1.0.17 we learn that for uniform stepsize h > 0 we have yk = S(λh)k y0 and conclude
that

yk → 0 for k → ∞ ⇔ |S(λh)| < 1 . (12.1.0.50)

Hence, the modulus |S(λh)| tells us for which combinations of λ and stepsize h we achieve exponential
decay yk → ∞ for k → ∞, which is the desirable behavior of the approximations for Re λ < 0.

Definition 12.1.0.51. Region of (absolute) stability

Let the discrete evolution Ψ for a single step method applied to the scalar linear ODE ẏ = λy,
λ ∈ C, be of the form

Ψh y = S(z)y , y ∈ C, h > 0 with z := hλ (12.1.0.52)

and a function S : C → C. Then the region of (absolute) stability of the single step method is
given by

SΨ := {z ∈ C: |S(z)| < 1} ⊂ C .

Of course, by Thm. 12.1.0.17, in the case of explicit RK-SSM the function S will coincide with their
stability function from (12.1.0.56).

We can easily combine the statement of Thm. 12.1.0.48 with the concept of a region of stability and
conclude that an explicit RK-SSM will generate expoentially decaying solutions for the linear ODE ẏ =
My, M ∈ C N,N , for every initial state y0 ∈ C N , if and only if λi h ∈ SΨ for all eigenvalues λi of M.

Adopting the arguments of § 12.1.0.26 we conclude from Cor. 12.1.0.20 that


✦ the regions of (absolute) stability of explicit RK-SSM are bounded,
✦ a timestep constraint depending on the eigenvalues of M is necessary to have a guaranteed expo-
nential decay RK-solutions for ẏ = My.
y

EXAMPLE 12.1.0.53 (Regions of stability of some explicit RK-SSM) The green domains ⊂ C depict
the bounded regions of stability for some RK-SSM from Ex. 11.4.0.17.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 827
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3 3
2.5

2
2 2
1.5

1 1 1

0.5

Im

Im
0 0
Im

−0.5
−1 −1
−1

−1.5
−2 −2
−2

−2.5 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Re Re Re

SΨ : explicit Euler (11.2.1.5) SΨ : explicit trapezoidal method SΨ : classical RK4 method

In general we have for a consistent RK-SSM (→ Def. 11.3.1.12) that their stability functions staidfy S(z) =
1 + z + O(z2 ) for z → 0. Therefore, SΨ 6= ∅ and the imaginary axis will be tangent to SΨ in z = 0. y

Review question(s) 12.1.0.54 (Model problem analysis)


(Q12.1.0.54.A) We consider the autonomous linear ODE
 
0 −1
ẏ(t) = My(t) with M := . (12.1.0.55)
1 0
on the state space R2 .
(i) Sketch the right-hand-side vectorfield y 7→ f(y) for this ODE.
(ii) Compute the two-parameter family of its solutions.
t
  
− sin t are solutions of (12.1.0.55).
Hint. The functions y(t) := cos
sin t and y ( t ) : = cos t
(iii) Based on geometric reasoning, predict how the explicit Euler method will perform when applied to
(12.1.0.55).
(Q12.1.0.54.B) Explain why explicit Runge-Kutta single step methods (RK-SSMs) are well suited for the
autonomous scalar linear ODE ẏ = λy (“growth ODE”) provided that λ ≥ 0.
(Q12.1.0.54.C) What can you say about the stability function of a 3-stage explicit Runge-Kutta single step
method of order 3?
(Q12.1.0.54.D) Compute the stability function for the “generic” second-order two-stage Runge-Kutta sin-
gle step method, whose Butcher scheme is
0 0 0
α α 0 , α ∈ R \ {0} .
1 1
1 − 2α 2α

Theorem 12.1.0.17. Stability function of some explicit Runge-Kutta methods

The discrete evolution Ψλh of an explicit s-stage Runge-Kutta single step method (→
c A
Def. 11.4.0.11) with Butcher scheme (see (11.4.0.13)) for the ODE ẏ = λy amounts
bT
to a multiplication with the number

Ψλh = S(λh) ⇔ y1 = S(λh)y0 ,

where S is the stability function (SF)

S(z) := 1 + zb T (I − zA)−1 1 = det(I − zA + z1b T ) , 1 := [1, . . . , 1]⊤ ∈ R s . (12.1.0.56)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.1. Model Problem Analysis 828
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q12.1.0.54.E) Compute the stability function for the three-stage Runge-Kutta single step method, defined
through the Butcher scheme
0 0 0 0
1/3 1/3 0 0
.
2/3 0 2/3 0
1/4 0 3/4
(Q12.1.0.54.F) What is the stability-induced timestep constraint for the classical 4-stage explicit Runge-
Kutta single step method of order 4, when applied to the ODE
 
0 −1
ẏ(t) = My(t) with M := .
1 0

✁ Stability domain SΨ of the classical 4-stage


Im

Runge-Kutta single step method of order 4.


−1

−2

−3
−3 −2 −1 0 1 2 3
Re

Hint. You may use that


       
1 1 1 H 0 −1 1 1 1 −ı 0
√ √ = .
2 ı −ı 1 0 2 ı −ı 0 ı

Supplementary literature. Related to this section are [Han02, Ch. 77] and [QSS00,

Sect. 11.3.3].

12.2 Stiff Initial-Value Problems


Video tutorial for Section 12.2: Stiff Initial-Value Problems: (24 minutes) Download link,
tablet notes

This section will reveal that the behavior observed in Ex. 12.0.0.1 and Ex. 12.1.0.3 is typical for a large
class of problems and that the model problem (12.1.0.5) really represents a “generic case”. This justifies
the attention paid to linear model problem analysis in Section 12.1.

EXAMPLE 12.2.0.1 (Kinetics of chemical reactions → [Han02, Ch. 62]) In Ex. 11.5.1.1 we already
saw an ODE model for the dynamics of a chemical reaction. Now we study an abstract reaction.
2k 4 k
reaction: A + B ←−
−→ C , A + C ←−
−→ D ,
k1 k3 (12.2.0.2)
| {z } | {z }
fast reaction slow reaction

12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 829
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

assuming vastly different reaction constants: k1 , k2 ≫ k3 , k4

If c A (0) > c B (0) ➢ 2nd reaction determines overall long-term reaction dynamics


Mathematical model: non-linear ODE involving concentrations y(t) = [c A (t), c B (t), cC (t), c D (t)]
  
cA − k 1 c A c B + k 2 cC − k 3 c A cC + k 4 c D
d  cB   − k 1 c A c B + k 2 cC 
ẏ :=   = f(y) :=  
 k 1 c A c B − k 2 cC − k 3 c A cC + k 4 c D  . (12.2.0.3)
dt  cC 
cD k 3 c A cC − k 4 c D

Concrete choice of parameters: t0 = 0, T = 1, k 1 = 104 , k 2 = 103 , k 3 = 10, k 4 = 1, initial value


y0 = [1, 1, 10, 0]⊤ .
Chemical reaction: concentrations
12 Chemical reaction: stepsize x 10
−5

c (t) 10 7
A
cC(t)
10
cA,k, ode45 9 6
cC,k, ode45
8 8 5
concentrations

6 7 4

timestep
c (t)
C

4 6 3

2 5 2

4 1
0

3 0
−2 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 450
Fig. t
Fig. 449 t

Observations: After a fast initial transient phase, the solution shows only slow dynamics. Nevertheless,
the explicit adaptive integrator used for this simulation insists on using a tiny timestep. It behaves very
much like Ode45 in Ex. 12.0.0.1. y

EXAMPLE 12.2.0.4 (Strongly attractive limit cycle) We consider the non-linear Autonomous ODE
ẏ = f(y) with

0 −1
f(y) := y + λ (1 − k y k2 ) y , (12.2.0.5)
1 0

on the state space D = R2 \ {0}.


h i
cos ϕ
For λ = 0, the initial value problem ẏ = f(y), y(0) = sin ϕ , ϕ ∈ R has the solution


cos(t − ϕ)
y(t) = , t∈R. (12.2.0.6)
sin(t − ϕ)

For this solution we have ky(t)k2 = 1 for all times.

(12.2.0.6) provides a solution even for λ 6= 0, if ky(0)k2 = 1, because in this case the term
λ(1 − kyk2 ) y will never become non-zero on the solution trajectory.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 830
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

1.5 2

1.5
1

0.5

0.5
2

0
y

−0.5 −0.5

−1
−1

−1.5

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
Fig. 451 y Fig. 452 −2
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

vectorfield f (λ = 1) solution trajectories (λ = 10)


 
1
We study the response of Ode45 introduced in § 11.5.3.3 to different choice of λ with initial state y0 = .
0
According to the above considerations this initial state should completely “hide the impact of λ from our
view”.
ode45 for attractive limit cycle x 10
−4 ode45 for rigid motion
1.5 8 1 0.2

1 7

0.5 6
timestep

timestep
y (t)

y (t)

0 5 0 0.1
i

−0.5 4

−1 3

y y
1,k 1,k
y2,k y2,k
−1.5 2 −1 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Fig. 453 t Fig. 454 t

many (3794) steps (λ = 1000) accurate solution with few steps (λ = 0)


Confusing observation: we have ky0 k = 1, which implies ky(t)k = 1 ∀ t!
Thus, the term of the right hand side, which is multiplied by λ will always vanish on the exact solution
trajectory, which stays on the unit circle.
Nevertheless, Ode45 is forced to use tiny timesteps by the mere presence of this term! y

We want to find criteria that allow to predict the massive problems haunting explicit single step methods in
the case of the non-linear IVP of Ex. 12.0.0.1, Ex. 12.2.0.1, and Ex. 12.2.0.4. Recall that for linear IVPs of
the form ẏ = My, y(0) = y0 , the model problem analysis of Section 12.1 tells us that, given knowledge
of the region of stability of the timestepping scheme, the eigenvalues of the matrix M ∈ C N,N provide full
information about timestep constraint we are going to face. Refer to Thm. 12.1.0.48 and § 12.1.0.49.

The ODEs we saw in Ex. 12.2.0.1 and Ex. 12.2.0.4 are non-linear . Yet, the entire stability analysis of
Section 12.1 was based on linear ODEs. Thus, we need to extend the stability analysis to non-linear
ODEs.
We start with a “phenomenological notion”, just a keyword to refer to the kind of difficulties presented by
the IVPs of Ex. 12.0.0.1, Ex. 12.2.0.1, Exp. 12.1.0.8, and Ex. 12.2.0.4.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 831
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Notion 12.2.0.7. Stiff IVP


An initial value problem is called stiff, if stability imposes much tighter timestep constraints on explicit
single step methods than the accuracy requirements.

§12.2.0.8 (Linearization of ODEs) Linear ODEs, though very special, are highly relevant as “local model”
for general ODEs: We consider a general autonomous ODE

ẏ = f(y) , f : D ⊂ R N → R N .

As usual, we assume f to be C2 -smooth and that it enjoys local Lipschitz continuity (→ Def. 11.1.3.13) on
D so that unique solvability of IVPs is guaranteed by Thm. 11.1.3.17.
We fix a state y∗ ∈ D, D the state space, write t 7→ y(t) for the solution with y(0) = y∗ . We set
z(t) = y(t) − y∗ , which satisfies

z(0) = 0 , ż = f(y∗ + z) = f(y∗ ) + D f(y∗ )z + R(y∗ , z) , with k R(y∗ , z)k = O(kzk2 ) .

This is obtained by Taylor expansion of f at y∗ , see [Str09, Satz 7.5.2]. Hence, in a neighborhood of a
state y∗ on a solution trajectory t 7→ y(t), the deviation z(t) = y(t) − y∗ satisfies

ż ≈ f(y∗ ) + D f(y∗ )z . (12.2.0.9)

The short-time evolution of y with y(0) = y∗ is approximately governed by the affine-linear ODE

ẏ = M(y − y∗ ) + b , M := D f(y∗ ) ∈ R N,N , b := f(y∗ ) ∈ R N . (12.2.0.10)


In the scalar case we have come across this linearization already in Ex. 12.1.0.3. y

§12.2.0.11 (Linearization of explicit Runge-Kutta single step methods) We consider one step of
a general s-stage RK-SSM according to Def. 11.4.0.11 for the autonomous ODE ẏ = f(y), with smooth
right hand side function f : D ⊂ R N → R N :
i −1 s
ki = f(y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1

We perform linearization at y∗ := y0 and ignore all terms at least quadratic in the timestep size h (this
is indicated by the ≈ symbol):
i −1 s
∗ ∗
ki ≈ f(y ) + D f(y ) h ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1

The defining equations for the same RK-SSM applied to

ż = Mz + b , M := D f(y∗ ) ∈ R N,N , b := f(y∗ ) ,

which agrees with (12.2.0.10) after substitution z(t) − y(t) − y∗ , are

i −1 s
ki ≈ b + Mh ∑ aij k j , i = 1, . . . , s , y1 = y0 + h ∑ bi ki .
j =1 i =1

We find that for small timesteps

the discrete evolution of the RK-SSM for ẏ = f(y) in the state y∗ is close to the discrete
evolution of the same RK-SSM applied to the linearization (12.2.0.10) of the ODE in y∗ .

12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 832
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

By straightforward manipulations of the defining equations of an explicit RK-SSM we find that, if


• (yk )k is the sequence of states generated by the RK-SSM applied to the affine-linear ODE ẏ =
M(y − y0 ) + b, M ∈ C N,N regular,
• (wk )k is the sequence of states generated by the same RK-SSM applied to the linear ODE ẇ =
Mw and w0 := M−1 b, then

w k = y k − y 0 + M −1 b .

➣ The analysis of the behavior of an RK-SSM for an autonomous affine-linear ODE can be reduced to
understanding its behavior for an autonomous linear ODE with the same matrix.

Combined with the insights from § 12.1.0.43 this means that

the behavior of an explicit Runge-Kutta single-step method applied to ẏ = f(y) close to the
state y∗ is determined by the eigenvalues of the Jacobian D f(y∗ ).

In particular, if D f(y∗ ) has at least one eigenvalue whose modulus is large, then an exponential drift-off
of the approximate states yk away from y∗ can only be avoided for sufficiently small timestep, again a
timestep constraint.

How to distinguish stiff initial value problems

An initial value problem for an autonomous ODE ẏ = f(y) will probably be stiff, if, for substantial
periods of time,

min{Re λ : λ ∈ σ (D f(y(t)))} ≪ 0 , (12.2.0.13)


and max{Re λ : λ ∈ σ (D f(y(t)))} . 0 , (12.2.0.14)

where t 7→ y(t) is the solution trajectory and σ (M) is the spectrum of the matrix M, see
Def. 9.1.0.1.

The condition (12.2.0.14) has to be read as “the real parts of all eigenvalues are below a bound with small
modulus”. If this is not the case, then the exact solution will experience blow-up. It will change drastically
over very short periods of time and small timesteps will be required anyway in order to resolve this. y

EXAMPLE 12.2.0.15 (Predicting stiffness of non-linear IVPs)


➊ We consider the IVP from Ex. 12.0.0.1:

IVP considered: ẏ = f (y) := λy2 (1 − y) , λ := 500 , y(0) = 1


100 .

We find

f ′ (y) = λ(2y − 3y2 ) ⇒ f ′ (1) = − λ .

Hence, in case λ ≫ 1 as in Fig. 435, we face a stiff problem close to the stationary state y = 1.
The observations made in Fig. 435 exactly match this prediction.
➋ The solution of the IVP from Ex. 12.2.0.4
 
0 −1
ẏ = f(y) := y + λ (1 − k y k2 ) y , k y0 k2 = 1 . (12.2.0.5)
1 0

12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 833
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

satisfies ky(t)k2 = 1 for all times. Using the product rule (8.5.1.17) of multi-dimensional differential
calculus, we find
   
0 −1
D f(y) = + λ −2yy⊤ + (1 − kyk22 I) .
1 0
n p p o
σ (D f(y)) = −λ − λ2 − 1, −λ + λ2 − 1 , if kyk2 = 1 .

Thus, for λ ≫ 1, D f(y(t)) will always have an eigenvalue with large negative real part, whereas
the other eigenvalue is close to zero: the IVP is stiff.
y

Remark 12.2.0.16 (Characteristics of stiff IVPs) Often one can already tell from the expected behavior
of the solution of an IVP, which is often clear from the modeling context, that one has to brace for stiffness.

Typical features of stiff IVPs:


✦ Presence of fast transients in the solution, see Ex. 12.1.0.3, Ex. 12.1.0.35,
✦ Occurrence of strongly attractive fixed points/limit cycles, see Ex. 12.2.0.4

Review question(s) 12.2.0.17 (Stiff Initial-Value Problems)


(Q12.2.0.17.A) Then following rules of thumb were given:

How to distinguish stiff initial value problems

An initial value problem for an autonomous ODE ẏ = f(y) will probably be stiff, if, for substantial
periods of time,

min{Re λ : λ ∈ σ (D f(y(t)))} ≪ 0 , (12.2.0.13)


and max{Re λ : λ ∈ σ (D f(y(t)))} . 0 , (12.2.0.14)

where t 7→ y(t) is the solution trajectory and σ (M) is the spectrum of the matrix M, see
Def. 9.1.0.1.

Explain, why the condition (12.2.0.14) is has been included.


(Q12.2.0.17.B) [Damped pendulum] The non-dimensional equation for a damped pendulum is

ẅ = − sin w − λẇ .

When is this second-order ODE stiff in a neighborhood of the rest state w = 0, ẇ = 0?


(Q12.2.0.17.C) [Transient circuit ODE] In Ex. 12.1.0.35 we found the non-dimensional non-autonomous
ODE describing the transient behavior of an RLC-circuit:
      
u̇ 0 1 u 0
= − , β ≫ 41 α ≫ 1 , (12.2.0.19)
v̇ − β −α v g(t)
|{z} | {z }
=:ẏ =:f(t,y)

where t 7→ g(t) is a given smooth excitation. Find out, whether initial-value problems for (12.2.0.19)
whose solution satisfies
 
− β −1 g ( t )
y(t) ≈
0

12. Single-Step Methods for Stiff Initial-Value Problems, 12.2. Stiff Initial-Value Problems 834
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

are stiff.
h i p
0 1
Hint. The eigenvalues of the matrix − β −α are λ1 = 12 (−α + ı 4β − α2 ),
p
λ1 = 21 (−α − ı 4β − α2 ).

Supplementary literature. [QSS00, Sect. 11.10]

12.3 Implicit Runge-Kutta Single-Step Methods

Video tutorial for Section 12.3: Implicit Runge-Kutta Single-Step Methods: (50 minutes)
Download link, tablet notes

Explicit Runge-Kutta single step method cannot escape tight timestep constraints for stiff IVPs that may
render them inefficient, see § 12.1.0.49. In this section we are going to augment the class of Runge-Kutta
methods by timestepping schemes that can cope well with stiff IVPs.

12.3.1 The Implicit Euler Method for Stiff IVPs


EXPERIMENT 12.3.1.1 (Euler methods for stiff decay IVP) We revisit the setting of Ex. 12.1.0.3 and
again consider Euler methods for the decay IVP

ẏ = λy , y(0) = 1 , λ < 0 .

We apply both the explicit Euler method (11.2.1.5) and the implicit Euler method (11.2.2.2) with uniform
timesteps h = 1/N , N ∈ {5, 10, 20, 40, 80, 160, 320, 640} and monitor the error at final time T = 1 for
different values of λ.
Explicit Euler method (11.2.1.5) Implicit Euler method (11.2.2.2)
Explicit Euler method for saalar model problem Implicit Euler method for saalar model problem
20 0
10 10
λ = −10.000000
λ = −30.000000
λ = −60.000000 −5
10
error at final time T=1 (Euclidean norm)

error at final time T=1 (Euclidean norm)

10 λ = −90.000000
10
O(h)
−10
10

0
10
−15
10

−10 −20
10 10

−25
10
−20
10

−30
10

−30 λ = −10.000000
10
−35
λ = −30.000000
10 λ = −60.000000
λ = −90.000000
O(h)
−40 −40
10 10
−3 −2 −1 0 −3 −2 −1 0
10 10 10 10 10 10 10 10
Fig. 455 timestep h Fig. 456 timestep h

λ large: blow-up of yk for large timestep h λ large: stable for all timesteps h > 0 !

We observe onset of convergence of the implicit Euler method already for large timesteps h. y

§12.3.1.2 (Linear model problem analysis: implicit Euler method) We follow the considerations of
§ 12.1.0.4 and consider the implicit Euler method (11.2.2.2) for the

linear model problem: ẏ = λy , y(0) = y0 , with Re λ ≪ 0 , (12.1.0.5)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 835
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

with exponentially decaying (maybe osscillatory for Im λ 6= 0) exact solution

y(t) = y0 exp(λt) → 0 for t → ∞ .

The recursion of the implicit Euler method for (12.1.0.5) is defined by

(11.2.2.2) for f (y) = λy ⇒ yk+1 = yk + λhyk+1 , k ∈ N0 . (12.3.1.3)


 k
1
generated sequence yk := y0 . (12.3.1.4)
1 − λh

Re λ < 0 ⇒ lim yk = 0 ∀h > 0 ! (12.3.1.5)


k→∞

Without any timestep constraint we obtain the qualitatively correct behavior of (yk )k for Re λ < 0 and any
h > 0!

As in § 12.1.0.40 this analysis can be extended to linear systems of ODEs ẏ = My, M ∈ C N,N , by
means of diagonalization.
As in § 12.1.0.30 and § 12.1.0.40 we assume that M can be diagonalized, that is (12.1.0.32) holds:
V−1 MV = D with a regular matrix V ∈ C N,N and a diagonal matrix D ∈ C N,N containing the eigenval-
ues λ1 , . . . , λ N of M on its diagonal. Next, apply the decoupling by diagonalization idea to the recursion
of the implicit Euler method.

z k : = V −1 y k 1
V−1 yk+1 = V−1 yk + h |V−{z
1
MV}(V−1 yk+1 ) ⇔ ( z k +1 ) i = (z ) . (12.3.1.6)
1 − λi h k i
=D | {z }
ˆ implicit Euler step for żi = λi zi
=

Crucial insight:

For any timestep, the implicit Euler method generates exponentially decaying solution sequences
(yk )∞
k=0 for ẏ = My with diagonalizable matrix M ∈ R
N,N with eigenvalues λ , . . . , λ , if Re λ < 0
1 N i
for all i = 1, . . . , N .

Thus we expect that the implicit Euler method will not face stability induced timestep constraints for stiff
problems (→ Notion 12.2.0.7). y

12.3.2 Collocation Single-Step Methods


Unfortunately the implicit Euler method is of first order only, see Exp. 11.3.2.5. This section presents an
algorithm for designing higher order single step methods generalizing the implicit Euler method.

Setting: We consider the general ordinary differential equation ẏ = f(t, y), f : I × D → R N locally
Lipschitz continuous, which guarantees the local existence of unique solutions of initial value problems,
see Thm. 11.1.3.17.

We define the single step method through specifying the first step y0 = y(t0 ) → y1 ≈ y(t1 ), where
y0 ∈ D is the initial step at initial time t0 ∈ I , cf. Rem. 11.3.1.15. We assume that the exact solution
trajectory t 7→ y(t) exists on [t0 , t1 ]. Use as a timestepping scheme on a temporal mesh (→ § 11.2.0.2)
in the sense of Def. 11.3.1.5 is straightforward.

§12.3.2.1 (Collocation approach)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 836
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Abstract collocation idea


Collocation is a paradigm for the discretization (→ Rem. 11.3.1.4) of differential equations:
(I) Write the discrete solution uh , a function, as linear combination of N ∈ N sufficiently smooth
(basis) functions ➣ N unknown coefficients.
(II) Demand that uh satisfies the differential equation at N points/times ➣ N equations.

We apply this policy to the differential equation ẏ = f(t, y) on [t0 , t1 ]:


Idea: ➊ Approximate t 7→ y(t), t ∈ [t0 , t1 ], by a function t 7→ yh (t) ∈ V ,
V an N · (s + 1)-dimensional trial space V
comprising functions [t0 , t1 ] 7→ R N , cf. Item (I).

➋ Fix yh ∈ V by imposing collocation conditions

y h ( t0 ) = y0 ,
(12.3.2.3)
ẏh (τj ) = f(τj , yh (τj )) , j = 1, . . . , s ,

for collocation points t0 ≤ τ1 < . . . < τs ≤ t1 → Item (II).

➌ Choose y1 := yh (t1 ).
y

§12.3.2.4 (Polynomial collocation) Existence of the function yh : [t0 , t1 ] → R N satisfying (12.3.2.3) and
the possibility to compute it efficiently will crucially depend on the choice of the trial space V .
☛ ✟

✡ ✠
N
Our choice (the “standard option”): (Componentwise) polynomial trial space V = (Ps )

Recalling dim Ps = s + 1 from Thm. 5.2.1.2, we see that our choice makes the number = N (s + 1) of
collocation conditions matches the dimension of the trial space V .

Now we want to derive a concrete representation for the polynomial yh . We draw on concepts introduced
in Section 5.2.2. We define the collocation points as

τj := t0 + c j h , j = 1, . . . , s , for 0 ≤ c1 < c2 < . . . < cs ≤ 1 , h := t1 − t0 .


s
Let { L j } j=1 ⊂ Ps−1 denote the set of Lagrange polynomials of degree s − 1 associated with the node set
 s
cj j =1
, see (5.2.2.4). They satisfy L j (ci ) = δij , i, j = 1, . . . , s and form a basis of Ps−1 .

In each of its N components, the derivative ẏh is a polynomial of degree s − 1: ẏ ∈ (Ps−1 ) N . Hence, it
has the following representation, compare (5.2.2.6).
s
ẏh (t0 + ξh) = ∑ ẏh (t0 + c j h) L j (ξ ) , 0 ≤ ξ ≤ 1. (12.3.2.5)
j =1

As τj = t0 + c j h, the collocation conditions (12.3.2.3) make it possible to replace ẏh (c j h) with an expres-

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 837
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

sion in the right hand side function f:

(12.3.2.3) s
ẏh (t0 + ξh) = ∑ k j L j (ξ ) with “coefficients” k j := f (t0 + c j h, yh (t0 + c j h)) .
j =1

Next we integrate and use yh (t0 ) = y0


s Z ξ
yh (t0 + ξh) = y0 + h ∑ k j L j (ζ ) dζ .
j =1 0

This yields the following formulas for the computation of y1 , which characterize the s-stage collocation
single step method induced by the (normalized) collocation points c j ∈ [0, 1], j = 1, . . . , s.

s Z ci
ki = f (t0 + ci h, y0 + h ∑ aij k j ) , aij := L j (τ ) dτ ,
j =1 0
where Z 1 (12.3.2.6)
s
y 1 : = y h ( t 1 ) = y 0 + h ∑ bi k i . bi : = Li (τ ) dτ .
0
i =1

Note that, since arbitrary y0 ∈ D, t0 , t1 ∈ I were admitted, this defines a discrete evolution Ψ : I × I ×
D → R N by Ψt0 ,t1 y0 := yh (t1 ). y

Remark 12.3.2.7 (Implicit nature of collocation single step methods) Note that (12.3.2.6) represents
a generically non-linear system of s · N equations for the s · N components of the vectors ki , i = 1, . . . , s.
Usually, it will not be possible to obtain the increments ki ∈ R N by a fixed number of evaluations of f. For
this reason the single step methods defined by (12.3.2.6) are called implicit.

With similar arguments as in Rem. 11.2.2.3 one can prove that for sufficiently small |t1 − t0 | a unique set
of solution vectors k1 , . . . , ks can be found. y

§12.3.2.8 (Collocation single step methods and quadrature) Clearly, in the case N = 1, f (t, y) =
f (t), y0 = 0 the computation of y1 boils down to the evaluation of a quadrature formula on [t0 , t1 ], because
from (12.3.2.6) we get
s Z 1
y 1 = h ∑ bi f ( t 0 + c i h ) , bi : = Li (τ ) dτ , (12.3.2.9)
i =1 0

which is a polynomial quadrature formula (7.3.0.2) on [0, 1] with nodes c j transformed to [t0 , t1 ] according
to (7.2.0.5). y

EXPERIMENT 12.3.2.10 (Empiric Convergence of collocation single step methods) We consider the
initial value problem for the scalar logistic ODE

ẏ = λy(1 − y) , y(0) = 0.01 , λ = 100 ,

which is mildly stiff, over the time interval [0, 1]

We perform numerical integration by timestepping with uniform timestep h based on a collocation single

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 838
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

step method (12.3.2.6).

0
10

j
➊ Equidistant collocation points, c j = s +1 , 10
−2

j = 1, · · · , s.

max |y (t )−y(t) )|
k
−4
We observe algebraic convergence with the empiric 10

h k
rates
−6

k
10
s =1 : p = 1.96
s =2 : p = 2.03
:
−8
s =3 p = 4.00 10 s=1
s=2
s =4 : p = 4.04 s=3
−10
s=4
10 −2 −1 0
10 10 10
Fig. 457 h

In this case we conclude the following (empiric) order (→ Def. 11.3.2.8) of the collocation single step
method:
(
s for even s ,
(empiric) order =
s + 1 for odd s .

Next, we recall from § 7.4.2.15 an exceptional set of quadrature points, the Gauss points, provided by the
zeros of the L2 ([−1, 1])-orthogonal Legendre polynomials, see Fig. 269.
0
10

➊ Gauss points in [0, 1]


as normalized collocation points c j , j = 1, . . . , s.
−5
10
max |y (t )−y(t) )|
k

We observe algebraic convergence with the empiric


rates
h k
k

s =1 : p = 1.96 −10
10
s =2 : p = 4.01
s =3 : p = 6.00 s=1
s=2
s =4 : p = 8.02 s=3
−15
s=4
10 −2 −1 0
10 10 10
Fig. 458 h

Obviously, for the (empiric) order (→ Def. 11.3.2.8) of the Gauss collocation single step method holds

(empiric) order = 2s .

Note that the 1-stage Gauss collocation single step method is the implicit midpoint method from Sec-
tion 11.2.3. y

§12.3.2.11 (Order of collocation single step method) What we have observed in Exp. 12.3.2.10 reflects

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 839
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

a fundamental result on collocation single step methods as defined in (12.3.2.6).

Theorem 12.3.2.12. Order of collocation single step method [DB02, Satz .6.40]

Provided that f ∈ C p ( I × D ), the order (→ Def. 11.3.2.8) of an s-stage collocation single step
method according to (12.3.2.6) agrees with the order (→ Def. 7.4.1.1) of the quadrature formula on
[0, 1] with nodes c j and weights b j , j = 1, . . . , s.

This also explains the surprisingly high order of the Gauss collocation single-step method, because for
s-point Gauss-Legendre numerical quadrature, the family of quadrature rules based on Gauss points as
nodes, Section 7.4.2 derived the order 2s.

➣ By Thm. 7.4.2.11 the s-stage Gauss collocation single step method whose nodes c j are chosen as the
s Gauss points on [0, 1] is of order 2s.
y

12.3.3 General Implicit Runge-Kutta Single-Step Methods (RK-SSMs)


The notations in (12.3.2.6) have deliberately been chosen to allude to Def. 11.4.0.11. In that definition it
takes only letting the sum in the formula for the increments run up to s to capture (12.3.2.6).

Definition 12.3.3.1. General Runge-Kutta single step method (cf. Def. 11.4.0.11)

For bi , aij ∈ R, ci := ∑sj=1 aij , i, j = 1, . . . , s, s ∈ N, a single step of size h > 0 of an s-stage


Runge-Kutta single step method (RK-SSM) for the IVP (11.1.3.2) is defined by
s s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

As before, the vectors ki ∈ R N are called increments.

Note that the computation of the increments ki may now require the solution of (non-linear) systems of
equations of size s · N . In this case we speak about an “implicit” method, cf. Rem. 12.3.2.7.

The Butcher schenme notation introduced in (11.4.0.13) can easily be adapted to the the case of general
RK-SSMs by dropping the requirement that the Butcher matrix be strictly lower triangular.

General Butcher scheme notation for RK-SSM

Shorthand notation for Runge-Kutta methods c1 a11 ··· a1s


c A .. .. ..
Butcher scheme ✄ := . . . .
bT cs as1 ··· ass
Note that now, in contrast to (11.4.0.13), A can be b1 ··· bs
a general s × s-matrix. (12.3.3.3)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 840
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Summary: terminology for Runge-Kutta single step methods:

A strict lower triangular matrix ➤ explicit Runge-Kutta method, Def. 11.4.0.11


A lower triangular matrix ➤ diagonally-implicit Runge-Kutta method (DIRK)

Many of the techniques and much of the theory discussed for explicit RK-SSMs carry over to general
(implicit) Runge-Kutta single step methods:
• Sufficient condition for consistence from Cor. 11.4.0.15
• Algebraic convergence for meshwidth h → 0 and the related concept of order (→ Def. 11.3.2.8)
• Embedded methods and algorithms for adaptive stepsize control from Section 11.5

§12.3.3.4 (Butcher schemes for Gauss collocation RK-SSMs) As in (11.4.0.13) we can arrange the
coefficients of Gauss collocation single-step methods in the form of a Butcher scheme and get

1 1
for s = 1: 2 2 , (12.3.3.5a)
1
1
√ √
2 − 61 √3 1
4 √
1 1
4 − 6 3
1
for s = 2: 2 + 61 3 1 1
4 + 6 3
1
4
, (12.3.3.5b)
1 1
2 2
1 1
√ 5 2 1
√ 5 1

2 − 10 15 36 √ 9 − 15 15 36 − 30 √15
1 5 1 2 5 1
2√ 36 + 24 √15 9√ − 24 15
for s = 3: 1 1 5 1 2 1
36
5 . (12.3.3.5c)
2 + 10 15 36 + 30 15 9 + 15 15 36
5 4 5
18 9 18
y

Remark 12.3.3.6 (Stage form equations for increments) In Def. 12.3.3.1 instead of the increments we
can consider as unknowns the so-called stages
s
gi := h ∑ aij k j ∈ R N , i = 1, . . . , s , ⇔ ki = f(t0 + ci h, y0 + gi ) . (12.3.3.7)
j =1

This leads to the equivalent defining equations in “stage form” for an implicit RK-SSM
s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s ,
j =1

s s
gi = h ∑ aij f(t0 + ci h, y0 + g j ) , y1 = y0 + h ∑ bi f(t0 + c j h, y0 + gi ) . (12.3.3.8)
j =1 i =1

In terms of implementation there is no difference: Also the stage equations (12.3.3.8) are usually solved
by means of Newton’s method, see next remark. y

Remark 12.3.3.9 (Solving the stage equations for implicit RK-SSMs) We reformulate the increment
equations in stage form (12.3.3.8) as a non-linear system of equations in standard form F (x) = 0.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 841
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Unknowns are the total s · N components of the stage vectors gi ∈ R N , i = 1, . . . , s as defined in


(12.3.3.7).
 
g = [ g1 , . . . , g s ] ⊤ ∈ R s · N , f(t0 + c1 h, y0 + g1 )
 ..  !
s F (g) = g − h (A ⊗ I N )  . =0,
gi := h ∑ aij f(t0 + c j h, y0 + g j ) f(t0 + cs h, y0 + gs )
j =1

where I N is the N × N identity matrix and ⊗ designates the Kronecker product introduced in Def. 1.4.3.7.

We compute an approximate solution of F (g) = 0 iteratively by means of the simplified Newton method
presented in Rem. 8.5.1.43. This is a Newton method with “frozen Jacobian”. As g → 0 for h → 0, we
choose zero as initial guess:

g( k +1) = g( k ) − D F ( 0 ) −1 F ( g( k ) ) k = 0, 1, 2, . . . , g(0) = 0 . (12.3.3.10)

with the Jacobian


 
∂f ∂f
I − ha11 ∂y ( t0 , y0 ) · · · −ha1s ∂y ( t0 , y0 )
 .. .. .. 
D F (0) = 
 . . .  ∈ R sN,sN .
 (12.3.3.11)
∂f ∂f
−has1 ∂y ( t0 , y0 ) ··· I− hass ∂y ( t0 , y0 )

Obviously, D F (0) → I for h → 0. Thus, D F (0) will be regular for sufficiently small h.

In each step of the simplified Newton method we have to solve a linear system of equations with coefficient
matrix D F (0). If s · N is large, an efficient implementation has to reuse the LU-decomposition of D F (0),
see Code 8.5.1.44 and Rem. 2.5.0.10. y

12.3.4 Model Problem Analysis for Implicit Runge-Kutta Single-Step Methods


(IRK-SSMs)
Model problem analysis for general Runge-Kutta single step methods (→ Def. 12.3.3.1) runs parallel to
that for explicit RK-methods as elaborated in Section 12.1, § 12.1.0.13. Familiarity with the techniques
and results of this section is assumed. The reader is asked to recall the concept of stability function from
Thm. 12.1.0.17, the diagonalization technique from § 12.1.0.43, and the definition of region of (absolute)
stability from Def. 12.1.0.51.
We apply the implicit RK-SSM according to Def. 12.3.3.1 to the autonomous linear scalar ODE ẏ = λy,
λ ∈ C, and utterly parallel to the considerations in § 12.1.0.13, (12.1.0.14) we obtain
s
k i = λ(y0 + h ∑ aij k j ) ,     
j =1 I − zA 0 k 1
⇒ = y0 , (12.3.4.1)
s −zb⊤ 1 y1 1
y 1 = y 0 + h ∑ bi k i
i =1

where k ∈ R s = ˆ denotes the vector [k1 , . . . , k s ]⊤ /λ of increments, and z := λh. As in § 12.1.0.13 we


can eliminate the increments and obtain an expression for y1 :

y1 = S(z)y0 with S(z) := 1 + zb T (I − zA)−1 1 . (12.3.4.2)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 842
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Alternatively, Cramer’s rule supplies a formula for y1 in terms of determinants:


 
I − zA 1
det
−zb⊤ 1 det(I − zA + z1b T )
y1 = y0   ⇒ S(z) = . (12.3.4.3)
I − zA 0 det(I − zA)
det
−zb⊤ 1

The next theorem summarizes these findings:

Theorem 12.3.4.4. Stability function of Runge-Kutta methods, cf. Thm. 12.1.0.17

The discrete evolution Ψλh of an s-stage Runge-Kutta single step method (→ Def. 12.3.3.1) with
c A
Butcher scheme (see (12.3.3.3)) for the ODE ẏ = λy is given by a multiplication with
bT

det(I − zA + z1b T )
S(z) := 1 + zb T (I − zA)−1 1 = , z := λh , 1 = [1, . . . , 1] T ∈ R s .
| {z } det(I − zA)
stability function

EXAMPLE 12.3.4.5 (Regions of stability for simple implicit RK-SSM) We determine the Butcher
schemes (12.3.3.3) for simple implicit RK-SSM and apply the formula from Thm. 12.3.4.4 to compute
their stability functions.

1 1 1
• Implicit Euler method: ➣ S(z) = .
1 1−z

1
2
1
2
1 + 21 z
• Implicit midpoint method: ➣ S(z) = .
1 1 − 21 z

Their regions of stability SΨ as defined in Def. 12.1.0.51,

SΨ := {z ∈ C: |S(z)| < 1} ⊂ C ,

can easily found from the respective stability functions:

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 843
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

3
3

2 2

1 1
Im

Im
0

−1 −1

−2 −2

−3
−3 −3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3Fig. 460 Re
Fig. 459 Re
SΨ : implicit midpoint method (11.2.3.3)
SΨ : implicit Euler method (11.2.2.2)

We see that in both cases |S(z)| < 1, if Re z < 0. y

From the determinant formula for the stability function S(z) we can conclude a generalization of
Cor. 12.1.0.20.

Corollary 12.3.4.6. Rational stability function of explicit RK-SSM

For a consistent (→ Def. 11.3.1.12) s-stage general Runge-Kutta single step method according to
P(z)
Def. 12.3.3.1 the stability function S is a non-constant rational function of the form S(z) =
Q(z)
with polynomials P ∈ Ps , Q ∈ Ps .

Of course, a rational function z 7→ S(z) can satisfy lim|z|→∞ |S(z)| < 1 as we habe seen in Ex. 12.3.4.5.
As a consequence, the region of stability for implicit RK-SSM need not be bounded.
§12.3.4.7 (A-stability) A general RK-SSM with stability function S applied to the scalar linear IVP ẏ = λy,
y(0) = y0 ∈ C, λ ∈ C, with uniform timestep h > 0 will yield the sequence (yk )∞ k=0 defined by

yk = S(z)k y0 , z = λh . (12.3.4.8)

Hence, the next property of a RK-SSM guarantees that the sequence of approximations decays exponen-
tially whenever the exact solution of the model problem IVP (12.1.0.5) does so.

Definition 12.3.4.9. A-stability of a Runge-Kutta single step method

A Runge-Kutta single step method with stability function S is A-stable, if

C − := {z ∈ C: Re z < 0} ⊂ SΨ . (SΨ =
ˆ region of stability Def. 12.1.0.51)

From Ex. 12.3.4.5 we conclude that both the implicit Euler method and the implicit midpoint method are

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 844
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

A-stable.

A-stable Runge-Kutta single step methods will not be affected by stability induced timestep constraints
when applied to stiff IVP (→ Notion 12.2.0.7).

§12.3.4.10 (“Ideal” region of stability) In order to reproduce the qualitative behavior of the exact solution,
a single step method when applied to the scalar linear IVP ẏ = λy, y(0) = y0 ∈ C, λ ∈ C, with uniform
timestep h > 0,

• should yield an exponentially decaying sequence (yk )k=0 , whenever Re λ < 0,

• should produce an exponentially increasing sequence sequence (yk )k=0 , whenever Re λ > 0.
Thus, in light of (12.3.4.8), we agree that the stability if

“ideal” region of stability is SΨ = C − . (12.3.4.11)

Are there RK-SSMs that can boast of an ideal region of stability?

Regions of stability of Gauss collocation single step methods, see Exp. 12.3.2.10:

5 20 50 0.7 1.5
0.7
1.5
1.1

1.1
0.9
0.9

0.7
1

40
1

4
1.1

15

0.9
1.5 5
1.

1
3 30
10
5
1.

0.
7

20

0.7
2
0.7

0.4 0.
4
1.5

0.
4 5
1 10
4

0.4
0.

0.4

0.9 1

0.9 1.51.1
1.1
0.91

Im

Im

0 0
1.1
Im

0
0.4

1
−1 −10
−5
1.5

0.
4
0.7

0.4
0.4 0.4
0.7
−2 −20
0.7

1.

−10
5

−3 −30
1.
5
0.9

1.1

−15 1.5
1.1
0.9

0.7 −40
1.1

−4
1

1.5
1

0.9
1

0.7 1.5
0.7 −20 −50
−5 −20 −10 0 10 20 −60 −40 −20 0 20 40 60
Fig. 461
−6 −4 −2 0 2 4 Fig.6 462 Re Fig. 463 Re
Re

Implicit midpoint method s = 2 (order 4) s = 4 (order 8)


Level lines for |S(z)| for Gauss collocation methods

Theorem 12.3.4.12. Region of stability of Gauss collocation single step methods [DB02,
Satz 6.44]

s-stage Gauss collocation single step methods defined by (12.3.2.6) with the nodes cs given by the
s Gauss points on [0, 1], feature the “ideal” stability domain:

SΨ = C − . (12.3.4.11)

In particular, all Gauss collocation single step methods are A-stable.

EXPERIMENT 12.3.4.13 (Implicit RK-SSMs for stiff IVP) We consider the stiff IVP

ẏ = −λy + β sin(2πt) , λ = 106 , β = 106 , y(0) = 1 ,

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 845
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

whose solution essentially is the smooth function t 7→ sin(2πt). Applying the criteria (12.2.0.13) and
(12.2.0.14) we immediately see that this IVP is extremely stiff.

1
We solve it with different implicit RK-SSM on [0, 1] with large uniform timestep h = 20 .
4
1
y(t)
Impliziter Euler 0.8 exp(z)
3
Kollokations RK−ESV s=1 Impliziter Euler
0.6
Kollokations RK−ESV s=2 Gauss−Koll.−RK−ESV s=1
2 Kollokations RK−ESV s=3 0.4 Gauss−Koll.−RK−ESV s=2
Kollokations RK−ESV s=4 Gauss−Koll.−RK−ESV s=3
0.2
1

Re(S(z))
Gauss−Koll.−RK−ESV s=4
0
y

0 −0.2

−0.4
−1
−0.6

−2 −0.8

−1
−3 −1000 −800 −600 −400 −200 0
0 0.2 0.4 0.6 0.8 1 465
Fig. z
Fig. 464 t

Solutions by RK-SSMs Stability functions on R −

We observe that Gauss collocation RK-SSMs incur a huge discretization error, whereas the simple implicit
Euler method provides a perfect approximation!

Explanation: The stability functions for Gauss collocation RK-SSMs satisfy

lim |S(z)| = 1 .
|z|→∞

Hence, when they are applied to ẏ = λy with extremely large (in modulus) λ < 0, they will produce
sequences that decay only very slowly or even oscillate, which misses the very rapid decay of the ex-
act solution. The stability function for the implicity Euler method is S(z) = (1 − z)−1 and satisfies
lim|z|→∞ S(z) = 0, which will mean a fast exponential decay of the yk . y

§12.3.4.14 (L-stability) In light of what we learned in the previous experiment we can now state what we
expect from the stability function of a Runge-Kutta method that is suitable for stiff IVP (→ Notion 12.2.0.7):

Definition 12.3.4.15. L-stable Runge-Kutta method → [Han02, Ch. 77]


A Runge-Kutta method (→ Def. 12.3.3.1) is L-stable/asymptotically stable, if its stability function (→
Thm. 12.3.4.4) satisfies

(i ) Re z < 0 ⇒ |S(z)| < 1 , (12.3.4.16)


(ii ) lim S(z) = 0 . (12.3.4.17)
Re z→−∞

Remember: L-stable :⇔ A-stable & “S(−∞) = 0’ ’ y

Remark 12.3.4.18 (Necessary condition for L-stability of Runge-Kutta methods)


c A
Consider a Runge-Kutta single step method (→ Def. 12.3.3.1) described by the Butcher scheme .
bT
Assume that A ∈ R s,s is regular, which can be fulfilled only for an implicit RK-SSM.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 846
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

P(z)
For a rational function S(z) = Q(z) the limit for |z| → ∞ exists and can easily be expressed by the leading
coefficients of the polynomials P and Q:

Thm. 12.3.4.4 ⇒ S(−∞) = 1 − b T A−1 1 . (12.3.4.19)

If b T = (A):,j
T
(row of A) ⇒ S(−∞) = 0 . (12.3.4.20)

c1 a11 ··· a1s


.. .. ..
Butcher scheme (12.3.3.3) for L-stable c A . . .
✄ := cs−1 as−1,1 ··· as−1,s .
RK-methods, see Def. 12.3.4.15 bT
1 b1 ··· bs
b1 ··· bs

A closer look at the coefficient formulas of (12.3.2.6) reveals that the algebraic condition (12.3.4.20) will
automatically satisfied for a collocation single step method with cs = 1! y

EXAMPLE 12.3.4.21 (L-stable implicit Runge-Kutta methods) There is a family of s-point quadra-
ture formulas on [0, 1] with a node located in 1 and (maximal) order 2s − 1: Gauss-Radau formulas.
They induce the L-stable Gauss-Radau collocation single step methods of order 2s − 1 according to
Thm. 12.3.2.12.

√ √ √ √
4− 6 88−7 6 296−169 6 −2+3 6
1 5 1 10
√ 360 √ 1800√ 225√
3 12 − 12 4+ 6 296+169 6 88+7 6 −2−3 6
1 1 3 1 10 1800
√ 360√ 225
1
1 4
3
4
1 1 16− 6 16+ 6 1
36√ 36√ 9
4 4 16− 6 16+ 6 1
36 36 9

Implicit Euler method Radau RK-SSM, order 3 Radau RK-SSM, order 5


100
exp(z)
90 RADAU, s=2
The stability functions of s-stage Gauss-Radau collo- RADAU, s=3
80 RADAU, s=4
cation SSMs are rational functions of the form RADAU, s=5
70

P(z) 60
S(z) = , P ∈ P s −1 , Q ∈ P s .
Re(S(z))

Q(z) 50

40

Beware that also "‘S(∞) = 0”, which means that 30

Gauss-Radau methods when applied to problems 20


with fast exponential blow-up may produce a spuri- 10
ous decaying solution.
0
−2 −1 0 1 2 3 4 5 6
Fig. 466 z

Level lines of stability functions of s-stage Gauss-Radau collocation SSMs:

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 847
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10 20 30

8
15
0.4 20
6 0.4 0.4
0.4
10 0.7
0.7 0.7 0.9
4
0.4

1.1 0. 1
0.9 4
0.9 1 1
1.1 1.5 10 1.1 1.1 0.

0.4
1. 5 1.5 7
5 1.
2

0.9
0.7
5

0 .7
0.4

1 .1
1

1
0.1
1

0.4
Im
Im

Im
0 0

0.9
0.7
0.9
1.5

1.5

0.9

0.9
1.5
1.5 .1

0.4
1 1.1
0.7
1.1

0.4

1
−2

0.4
1

1.5

7
−5 1.1

0.
1.1 1.5
0.7

7 1 1.1 −10
0.9 1 0. 0.9
−4 1
0.4

0.7 0.9
0.4 0.7
−10 0.4
−6 0.4 0.4
0.4 −20
−15
−8

−10 −20 −30


−2 0 2 4 6 8 0 5 10 15 20
Fig. 467 Re Fig.10468 Re Fig. 469 0 5 10
Re
15 20 25 30

s=2 s=3 s=4


Further information about Radau-Runge-Kutta single step methods can be found in [Han02, Ch. 79]. y

EXPERIMENT 12.3.4.22 (Gauss-Radau collocation SSM for stiff IVP) We revisit the stiff IVP from
Ex. 12.0.0.1

ẏ(t) = λy2 (1 − y) , λ = 500 , y(0) = 1


100 .

We compare the sequences generated by 1-stage and 2-stage Gauss collocation and Gauss-Radau col-
location SSMs, respectively (uniform timestep).
Äquidistantes Gitter, h=0.016667 Äquidistantes Gitter, h=0.016667
1.4 1.4
y(t) y(t)
Gauss−Koll., s= 1 RADAU, s= 1
1.2 Gauss−Koll., s= 2 1.2 RADAU, s= 2

1 1

0.8 0.8
y

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t

The 2nd-order Gauss collocation SSM (implicit midpoint method) suffers from spurious oscillations when
homing in on the stable stationary state y = 1. The explanation from Exp. 12.3.4.13 also applies to this
example.
The fourth-order Gauss method is already so accurate that potential overshoots when approaching y = 1
are damped fast enough. y

Review question(s) 12.3.4.23 (Implicit Runge-Kutta single-step methods)


(Q12.3.4.23.A) The stability function for the implicit Euler method for the ODE ẏ = f(t, y),

y k +1 = y k + h k f ( t k +1 , y k +1 ) , h k : = t k +1 − t k ,

1
is S(z) = .
1−z
When will one observe a totally wrong qualitative behavior of the sequence (yk ) of states generated by
the implicit Euler method applied to the scalar growth ODE ẏ = λy, λ > 0?

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 848
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q12.3.4.23.B) Gauss collocation RK-SSMs possess the ideal stability domain


SΨ = C − := {z ∈ C : Re z < 0}. Argue, why necessarily lim S(z) = ±1.
z→∞

Hint. What kind of function is the stability function for an implicit RK-SSM?
(Q12.3.4.23.C) We apply a general Runge-Kutta single-step method to the autonomous affine-linear ODE
ẏ = My + b, M ∈ R N,N , b ∈ R N , N ∈ N. Describe the linear system of equations that has to be
solved in every timestep.

The definition of a general Runge-Kutta single-step method applied to the ODE is ẏ = f(t, y) is as
follows:

Definition 12.3.3.1. Runge-Kutta single-step method

For bi , aij ∈ R, ci := ∑sj=1 aij , i, j = 1, . . . , s, s ∈ N, a single step of size h > 0 of an s-stage


Runge-Kutta single step method (RK-SSM) for the IVP (11.1.3.2) is defined by
s s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

As before, the vectors ki ∈ R N are called increments.

(Q12.3.4.23.D) Show that the single-step methods arising from the polynomial collocation approach with
s ∈ N collocation points will always be consistent.

Hint. The general formulas for a single-step method constructed via the polynomial collocation approach
with normalized collocation points c1 , c2 , . . . , cs are

s Z ci
ki = f (t0 + ci h, y0 + h ∑ aij k j ) , aij := L j (τ ) dτ ,
j =1 0
where Z 1 (12.3.2.6)
s
y 1 : = y h ( t 1 ) = y 0 + h ∑ bi k i . bi : = Li (τ ) dτ .
0
i =1
where { L1 , . . . , Ls } ⊂ Ps−1 are the Lagrange polynomials associated with the node set {c1 , c2 , . . . , cs }
on [0, 1].
Also remember that a single-step method for the ODE ẏ = f(y) is consistent, if and only if, its associ-
ated discrete evolution is of the form
ψ : I × D → R N continuous,
Ψh y = y + hψ( h, y) with (11.3.1.11)
ψ(0, y) = f(y) .
c A
(Q12.3.4.23.E) Let be the Butcher scheme for an s-stage collocation single-step method. Show
b⊤
that
(A)s,: = b⊤ ,
which, for an A-stable method is a sufficient condition for L-stability.

Supplementary literature. [DR08, Sect. 11.6.2], [QSS00, Sect. 11.8.3]

12. Single-Step Methods for Stiff Initial-Value Problems, 12.3. Implicit Runge-Kutta Single-Step Methods 849
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

12.4 Semi-Implicit Runge-Kutta Methods

Video tutorial for Section 12.4: Semi-Implicit Runge-Kutta Methods: (13 minutes)
Download link, tablet notes

From Section 12.3.3 recall the formulas for general/implicit Runge-Kutta single-step methods for the ODE
ẏ = f(t, y):
Definition 12.3.3.1. General Runge-Kutta single-step method

For bi , aij ∈ R, ci := ∑sj=1 aij , i, j = 1, . . . , s, s ∈ N, an s-stage Runge-Kutta single step method


(RK-SSM) for the IVP (11.1.3.2) is defined by
s s
ki := f(t0 + ci h, y0 + h ∑ aij k j ) , i = 1, . . . , s , y1 := y0 + h ∑ bi ki .
j =1 i =1

As before, the vectors ki ∈ R N are called increments.

The equations fixing the increments ki ∈ R N , i = 1, . . . , s, for an s-stage implicit RK-method


constitute a (Non-)linear system of equations with s · N unknowns.

Several expensive (Newton) iterations needed to find ki ?

Remember that we compute approximate solutions anyway, and the increments are weighted with the
stepsize h ≪ 1, see Def. 12.3.3.1. So there is no point in determining them with high accuracy!

Idea: Use only a fixed small number of Newton steps to solve for the ki , i = 1, . . . , s.

Extreme case: use only a single Newton step! Let’s try.

EXAMPLE 12.4.0.1 (Semi-implicit Euler single-step method) We apply the above idea to the implicit
Euler method introduced in Section 11.2.2. For the sake of simplicity we consider the autonomous ODE
ẏ = f(y), f : D ⊂ R N → R N .
The recursion for the implicit Euler method with (local) stepsize h > 0 is

yk+1 : yk+1 = yk + hf(yk+1 ) . (11.2.2.2)

We recast it as a non-linear system of N equations in “standard form F (x) = 0”:

yk+1 = yk + hf(yk+1 ) ⇔ F (yk+1 ) := yk+1 − hf(yk+1 ) − yk = 0 .

A single Newton step (8.5.1.6) applied to F (y) = 0 with the natural initial guess yk yields

yk+1 = yk − D f(yk )−1 F (yk ) = yk + (I − hDf(yk ))−1 hf(yk ) . (12.4.0.2)

This defines the recursion for the semi-implicit Euler method.

Note that for a linear ODE with f(y) = My, M ∈ R N,N , we recover the original implicit Euler method! y

EXPERIMENT 12.4.0.3 (Empiric convergence of semi-implicit Euler single-step method)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.4. Semi-Implicit Runge-Kutta Methods 850
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

✦ We consider an Initial value problem for logistic ODE, see Ex. 11.1.2.1

ẏ = λy(1 − y) , y(0) = 0.1 , λ = 5 .


Logistic ODE, y0 = 0.100000, λ = 5.000000
0
10

−1
10
✦ We run the implicit Euler method (11.2.2.2) and
the semi-implicit Euler method (12.4.0.2) with
−2
10
uniform timestep h = 1/n,

error
n ∈ {5, 8, 11, 17, 25, 38, 57, 85, 128, 192, 288,
, 432, 649, 973, 1460, 2189, 3284, 4926, 7389}. 10
−3

✦ Measured error err = max |y j − y(t j )|


j=1,...,n 10
−4

implicit Euler
semi−implicit Euler
−5
O(h)
10 −4 −3 −2 −1 0
10 10 10 10 10
Fig. 470 h

We observe that the approximate solution of the defining equation for yk+1 by a single Newton step pre-
serves the 1st-order convergence of the implicit Newton method. Also the semi-implicit Euler methods
seems to be of first order. y

EXPERIMENT 12.4.0.4 (Convergence of semi-implicit midpoint method) Again, we tackle the IVP
from Exp. 12.4.0.3.
Logistic ODE, y0 = 0.100000, λ = 5.000000
0
10

−2
✦ Now, implicit midpoint method (11.2.3.3), 10

uniform timesteps h = 1/n as above


−4
10
error

& approximate computation of yk+1 by 1 New-


−6
ton step, initial guess yk 10

✦ Measured error err = max |y j − y(t j )| −8


j=1,...,n 10
implicit midpoint rule
semi−implicit m.p.r.
−10
O(h2)
10 −4 −3 −2 −1 0
10 10 10 10 10
Fig. 471 h

We still observe second-order convergence! y

Try: Use linearized increment equations for implicit RK-SSM

s
ki := f(y0 + h ∑ aij k j ) , i = 1, . . . , s

? ki = f ( y0 ) + h D f ( y0 )
j =1

s
∑ aij k j
j =1
!
, i = 1, . . . , s . (12.4.0.5)

12. Single-Step Methods for Stiff Initial-Value Problems, 12.4. Semi-Implicit Runge-Kutta Methods 851
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

The good news is that all results about stability derived from model problem analysis (→ Section 12.1)
remain valid despite linearization of the increment equations:
✞ ☎

✝ ✆
Linearization does nothing for linear ODEs ➢ stability function (→ Thm. 12.3.4.4) not affected!

The bad news is that the preservation of the order observed in Exp. 12.4.0.3 will no longer hold in the
general case.

EXPERIMENT 12.4.0.6 (Convergence of naive semi-implicit Radau method)


✦ We consider an IVP for the logistic ODE from Ex. 11.1.2.1:
ẏ = λy(1 − y) , y(0) = 0.1 , λ = 5 .
Logistic ODE, y = 0.100000, λ = 5.000000
0
0
✦ 10
2-stage Radau RK-SSM, Butcher scheme
−2
10
1 5 1
3 12 − 12 −4
3 1 10
1 4 4 , (12.4.0.7)
3 1 −6
4 4 10
error

order = 3, see Ex. 12.3.4.21. −8


10

✦ −10
10
Increments from linearized equations (12.4.0.5)
RADAU (s=2)
−12 semi−implicit RADAU
✦ We monitor the error through err = 10
O(h3)
max |y j − y(t j )| −14
O(h2)
10
j=1,...,n 10
−4 −3
10
−2
10 10
−1
10
0

Fig. 472 h

Loss of order due to linearization ! y

§12.4.0.8 (Rosenbrock-Wanner methods) We have just seen that the simple linearization according to
(12.4.0.5) will degrade the order of implicit RK-SSMs and leads to a substantial loss of accuracy. This is
not an option.

Yet, the idea behind (12.4.0.5) has been refined. One does not start from a known RK-SSM, but introduces
general coefficients for structurally linear increment equations.

Class of s-stage semi-implicit (linearly implicit) Runge-Kutta methods (Rosenbrock-Wanner (ROW)


methods):

i −1 i −1
(I − haii J)ki = f(y0 + h ∑ ( aij + dij )k j ) − hJ ∑ dij k j , J = D f(y0 ) ,
j =1 j =1
(12.4.0.9)
s
y1 : = y0 + h ∑ b j k j .
j =1

Then the coefficients aij , dij , and bi are determined from order conditions by solving large non-linear
systems of equations.

In each step s linear systems with coefficient matrices I − haii J have to be solved. For methods used in
practice one often demands that aii = γ for all i = 1, . . . , s. As a consequence, we have to solve s linear

12. Single-Step Methods for Stiff Initial-Value Problems, 12.4. Semi-Implicit Runge-Kutta Methods 852
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

systems with the same coefficient matrix I − hγJ ∈ R N,N , which permits us to reuse LU-factorizations,
see Rem. 2.5.0.10. y

Supplementary literature. A related discussion can be found in [Han02, Ch. 80].

Review question(s) 12.4.0.10 (Semi-implicit Runge-Kutta single-step methods)


(Q12.4.0.10.A) [Semi-implicit midpoint method] The implicit midpoint single-step method applied to
the autonomous ODE ẏ = f(y) and with timestep h leads to the recursion

yk+1 : yk+1 = yk + hf( 21 (yk + yk+1 )) .

Derive the defining equation of the semi-implicit variant, which arises from solving the defining equation
for yk+1 by a single Newton step with initial guess yk .
(Q12.4.0.10.B) [Stability function of ROW-SSM] A Rosenbrock-Wanner (ROW) single-step method for
the autonomous ODE ẏ = f(y) can be defined by

i −1 i −1
(I − haii J)ki = f(y0 + h ∑ ( aij + dij )k j ) − hJ ∑ dij k j , J = D f(y0 ) ,
j =1 j =1
(12.4.0.9)
s
y1 : = y0 + h ∑ b j k j .
j =1

Derive its stability functions for s = 2.


12.5 Splitting Methods

Video tutorial for Section 12.5: Splitting Methods: (21 minutes) Download link, tablet notes

§12.5.0.1 (Splitting idea: composition of partial evolutions) Many relevant ordinary differential equa-
tions feature a right hand side function that is the sum to two (or more) terms. Consider an autonomous
IVP with a right hand side function that can be split in an additive fashion:

ẏ = f(y) + g(y) , y(0) = y0 , (12.5.0.2)

with f : D ⊂ R N 7→ R N , g : D ⊂ R N 7→ R N “sufficiently smooth”, locally Lipschitz continuous (→


Def. 11.1.3.13).

Let us introduce the evolution operators (→ Def. 11.1.4.3) for both summands:

Φtf ↔ ODE ẏ = f(y) ,


(Continuous) evolution maps:
Φtg ↔ ODE ẏ = g(y) .

Temporarily we assume that both Φtf , Φtg are available in the form of analytic formulas or highly accurate

12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 853
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

approximations.

Idea: Build single step methods (→ Def. 11.3.1.5) based on the following
discrete evolutions

Lie-Trotter splitting: Ψh = Φhg ◦ Φhf , (12.5.0.3)


Ψh = Φ f/2 ◦ Φhg ◦ Φ f/2 .
h h
Strang splitting: (12.5.0.4)

These splittings are easily remembered in graphical form:


Φ f/2
h
y1 y1

Ψh Ψh
(12.5.0.3) ↔ Φhg (12.5.0.4) ↔ Φhg

y0 y0
Fig. 473
Φhf Fig. 474
Φ f/2
h

Note that over many timesteps the Strang splitting approach is not more expensive than Lie-Trotter split-
ting, because the actual implementation of (12.5.0.4) should be done as follows:

y1/2 := Φ f/2 ,
h
y1 := Φhg y1/2 ,
y3/2 := Φhf y1 , y2 := Φhg y3/2 ,
y5/2 := Φhf y2 , y3 := Φhg y5/2 ,
.. ..
. .,

because Φ f/2 ◦ Φ f/2 = Φhf . This means that a Strang splitting SSM differs from a Lie-Trotter splitting SSM
h h

in the first and the last step only. y

EXPERIMENT 12.5.0.5 (Convergence of simple splitting methods) We consider the following IVP
whose right hand side function is the sum of two functions for which the ODEs can be solved analytically:

q
ẏ = λy(1 − y) + 1 − y2 , y (0) = 0 .
| {z } | {z }
=: f (y) =:g(y)

1
Φtf y = , t > 0, y ∈]0, 1] (logistic ODE (11.1.2.2))
1 + (y−1 − 1)e−λt
(
t sin(t + arcsin(y)) , if t + arcsin(y) < π2 ,
Φg y = t > 0, y ∈ [0, 1] .
1 , else,

12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 854
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

−2
10

Numerical experiment:
−3
10 For T = 1, λ = 1, we compare the two splitting
methods for uniform timesteps with a very accurate
|y(T)−y (T)|

reference solution obtained in M ATLAB by


h

−4
10
f=@(t,x) λ*x*(1-x)+sqrt(1-x^2);
options=odeset(’reltol’,1.0e-10,...
−5
10 ’abstol’,1.0e-12);
Lie−Trotter−Splitting
Strang−Splitting [t,yex]=ode45(f,[0,1],y0,options);
O(h)
2
−6
10
O(h ) ✁ Error at final time T = 1
−2 −1
10 10
Fig. 475 Zeitschrittweite h

We observe algebraic convergence of the two splitting methods, order 1 for (12.5.0.3), oder 2 for (12.5.0.4).
y

The observation made in Exp. 12.5.0.5 reflects a general truth:

Theorem 12.5.0.6. Order of simple splitting methods

Die single step methods defined by (12.5.0.3) or (12.5.0.4) are of order (→ Def. 11.3.2.8) 1 and 2,
respetively.

§12.5.0.7 (Inexact splitting methods) Of course, the assumption that ẏ = f(y) and ẏ = g(y) can be
solved exactly will hardly ever be met. However, it should be clear that a “sufficiently accurate” approxima-
tion of the evolution maps Φhg and Φhf is all we need

Idea: In (12.5.0.3)/(12.5.0.4) replace

exact evolutions −→ discrete evolutions


.
Φhg , Φhf −→ Ψhg , Ψhf

EXPERIMENT 12.5.0.8 (Convergence of inexact simple splitting methods) Again we consider the
IVP of Exp. 12.5.0.5 and inexact splitting methods based on different single step methods for the two ODE
corresponding to the summands.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 855
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

10
−2
LTS-Eul explicit Euler method (11.2.1.5) → Ψhh,g ,
Ψhh, f + Lie-Trotter splitting (12.5.0.3)
10
−3
SS-Eul explicit Euler method (11.2.1.5) → Ψhh,g ,
Ψhh, f + Strang splitting (12.5.0.4)
|y(T)−y (T)|

SS-EuEI Strang splitting (12.5.0.4): explicit Euler


h

−4
10
method (11.2.1.5) ◦ exact evolution Φhg ◦
implicit Euler method (11.2.2.2)
LTS-EMP explicit midpoint method (11.2.3.3) →
−5
10 LTS−Eul
SS−Eul
SS−EuEI Ψhh,g , Ψhh, f + Lie-Trotter splitting (12.5.0.3)
LTS−EMP
−6 SS−EMP SS-EMP explicit midpoint method (11.4.0.9) →
10
−2
10
−1
10 Ψhh,g , Ψhh, f + Strang splitting (12.5.0.4)
Fig. 476 Zeitschrittweite h

☞ The order of splitting methods may be (but need not be) limited by the order of the SSMs used for
Φhf , Φhg .
y

§12.5.0.9 (Application of splitting methods) In the following situation the use splitting methods seems
advisable:

“Splittable” ODEs

ẏ = f (y) + g(y) "‘difficult” ẏ = f (y) → stiff, but with an analytic solution


:
(e.g., stiff → Section 12.2) ẏ = g(y) "‘easy”, amenable to explicit integration.

EXPERIMENT 12.5.0.11 (Splitting off stiff components) Recall Ex. 12.0.0.1 and the IVP studied there:

IVP ẏ = λy(1 − y) + α sin(y) , λ = 100 , α = 1 , y(0) = 10−4 .

small perturbation
1.4
0.03
ode45
y(t)
1.2

1
1
0.02
Zeitschrittweite

0.8
y(t)

0.6

0.01
0.4

y(t)
0.2 LT−Eulex, h=0.04
0 LT−Eulex, h=0.02
ST−MPRexpl, h=0.05
0
0 0.2 0.4 0.6 0.8 1 0
Fig. 477 t 0 0.2 0.4 0.6 0.8 1
Fig. 478 t
Solution by ode45, see Ex. 12.0.0.1 inexact splitting method: solution (yk )
ode45: 152
LT-Eulex, h = 0.04: 25
Total number of timesteps
LT-Eulex, h = 0.02: 50
ST-MPRexpl, h = 0.05: 20
Details of the methods:

12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 856
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

LT-Eulex: ẏ = λy(1 − y) → exact evolution, ẏ = α sin y → expl. Euler (11.2.1.5) & Lie-Trotter
splitting (12.5.0.3)
ST-MPRexpl: ẏ = λy(1 − y) → exact evolution, ẏ = α sin y → expl. midpoint rule (11.4.0.9) & Strang
splitting (12.5.0.4)

We observe that this splitting scheme can cope well with the stiffness of the problem, because the stiff
term on the right hand side is integrated exactly. y

EXAMPLE 12.5.0.12 (Splitting linear and decoupled terms) In the numerical treatment of partial differ-
ential equation one commonly encounters ODEs of the form
 
g ( y1 )
 ..  ⊤ N,N
ẏ = f(y) := −Ay +  . , A=A ∈R positive definite (→ Def. 1.1.2.6) , (12.5.0.13)
g(y N )

with state space D = R N , where λmin (A) ≈ 1, λmax (A) ≈ N 2 , and the derivative of g : R → R is
bounded. Then IVPs for (12.5.0.13) will be stiff, since the Jacobian
 
g ′ ( y1 )
 ..  N,N
D f(y) = −A +  . ∈R
g′ (y N )

will have eigenvalues “close to zero” and others that are large (in modulus) and negative. Hence, D f(y)
will satisfy the criteria (12.2.0.13) and (12.2.0.14) for any state y ∈ R N .

The natural splitting is


 
g ( y1 )
 .. 
f(y) = g(y) + q(y) with g(y) := −Ay , q(y) :=  . .
g(y N )

• For the linear ODE ẏ = g(y) we have to use an L-stable (→ Def. 12.3.4.15) single step method,
for instance a second-order implicit Runge-Kutta method. Its increments can be obtained by solving
a linear system of equations, whose coefficient matrix will be the same for every step, if uniform
timesteps are used.
• The ODE ẏ = q(y) boils down to decoupled scalar ODEs ẏ j = g(y j ), j = 1, . . . , N . For them we
can use an inexpensive explicit RK-SSM like the explicit trapezoidal method (11.4.0.8). According
to our assumptions on g these ODEs are not haunted by stiffness.
y

Review question(s) 12.5.0.14 (Splitting single-step methods)


(Q12.5.0.14.A) The ODE describing the motion of a mathematical pendulum is
       
ẇ v v 0
= = + , (12.5.0.15)
v̇ − sin w 0 − sin w

with state space R2 . Derive formulas for the Strang-splitting single step method applied to the math-
ematical pendulum equation and using the additive decomposition of the right-hand-side vectorfield
suggested in (??). Distinguish the initial step, a regular step, and the final step.
h i h i
ẋ ϕ(y)
Hint. What is the analytic solution of the ODE ẏ = for an arbitrary function ϕ : R → R?
0

12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 857
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

(Q12.5.0.14.B) Elaborate the extension of the Strang splitting single step method to the ODE

ẏ = f(y) + g(y) + r(y) , f, g, r : D ⊂ R N → R N .

Develop formulas relying on the exact evolutions Φhf , Φhg , and Φrh for the ODEs ẏ = f(y), ẏ = g(y),
and ẏ = r(y).
(Q12.5.0.14.C) For a symmetric positive definite matrix A ∈ R N,N consider the autonomous ODE on
state space R N :
 N  N
ẏ = −Ay + sin(πy j ) j=1 , y = y j j =1 . (12.5.0.16)

We know that the smallest eigenvalue λmin (A) of A is 1 and the largest λmax (A) can be as big as 109 .
(i) Based on the Strang splitting Propose an efficient second-order single-step method for (12.5.0.16).
(ii) What is the computational effort for every regular timestep of your method, asymptotically for
N → ∞?

Supplementary literature. [MQ02] offers a comprehensive review of splitting methods.

12. Single-Step Methods for Stiff Initial-Value Problems, 12.5. Splitting Methods 858
Bibliography

[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Heidelberg:
Springer, 2008 (cit. on pp. 815, 818, 849).
[DB02] P. Deuflhard and F. Bornemann. Scientific Computing with Ordinary Differential Equations.
2nd ed. Vol. 42. Texts in Applied Mathematics. New York: Springer, 2002 (cit. on pp. 840, 845).
[Han02] M. Hanke-Bourgeois. Grundlagen der Numerischen Mathematik und des Wissenschaftlichen
Rechnens. Mathematische Leitfäden. Stuttgart: B.G. Teubner, 2002 (cit. on pp. 820, 829, 846,
848, 853).
[MQ02] R.I. McLachlan and G.R.W. Quispel. “Splitting methods”. In: Acta Nmerica 11 (2002) (cit. on
p. 858).
[NS02] K. Nipp and D. Stoffer. Lineare Algebra. 5th ed. Zürich: vdf Hochschulverlag, 2002 (cit. on
p. 822).
[QSS00] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics. Vol. 37. Texts in Applied Math-
ematics. New York: Springer, 2000 (cit. on pp. 820, 829, 835, 849).
[Ran15] Joachim Rang. “Improved traditional Rosenbrock-Wanner methods for stiff
ODEs and DAEs”. In: J. Comput. Appl. Math. 286 (2015), pp. 128–144. DOI:
10.1016/j.cam.2015.03.010.
[Str09] M. Struwe. Analysis für Informatiker. Lecture notes, ETH Zürich. 2009 (cit. on pp. 824, 832).

859
Main Index: Terms and Keywords

LU -decomposition a posteriori, 583


existence, 156 a priori, 583
L2 -inner product, 563 Adding EPS to 1, 100
h-convergence, 542 affine invariance, 640
p-convergence, 544 AGM, 603
(Asymptotic) complexity, 83 Aitken-Neville scheme, 397
(Semi-)inner product, 510 algebraic convergence, 479, 482, 783
(Size of) best approximaton error, 474 algebraic dependence, 84
BLAS algebraically equivalent, 107
axpy, 79 aliasing, 530
C++ alternation theorem, 521
move semantics, 40 Analyticity of a complex valued function, 490
E IGEN: triangularView, 87 approximation
M ATLAB: cumsum, 88 uniform, 468
P YTHON: reshape, 68 approximation error, 468
3-term recursion arrow matrix, 195
for Chebychev polynomials, 496 Ass: “Axiom” of roundoff analysis, 99
for Legendre polynomials, 567 Ass: Analyticity of interpoland, 491
3-term recusion Ass: Global solutions, 768
orthogonal polynomials, 516 Ass: Sampling in a period, 451
5-points-star-operator, 366 Ass: Self-adjointness of multiplication operator,
515
a posteriori Ass: Sharpness of error bounds, 480
adaptive quadrature, 583 Ass: Smoothness of right-hand side vectorfield,
a posteriori adaptive, 495 789
a posteriori error bound, 608 asymptotic complexity, 83
a posteriori termination, 606 sharp bounds, 83
a priori asymptotic rate of linear convergence, 615
adaptive quadrature, 583 audio signal, 304
a priori termination, 606 augmented normal equations, 299
A-inner product, 729 autonomization, 765
A-orthogonal, 738 Autonomous ODE, 760
A-stability of a Runge-Kutta single step method, autonomous ODE, 757
844 AXPY operation, 740
A-stable single step method, 844 axpy operation, 79
Absolute and relative error, 98
absolute error, 98 B-splines, 448
absolute tolerance, 606, 638, 801 back substitution, 137
adaptive backward error analysis, 122
a posteriori, 495 backward substitution, 149
adaptive multigrid quadrature, 585 Bandbreite
adaptive quadrature, 583 Zeilen-, 200

860
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

banded matrix, 199 CCS format, 181


bandwidth, 199 cell
lower, 199 of a mesh, 540
minimizing, 204 CG
upper, 199 convergence, 745
barycentric interpolation formula, 394 preconditioned, 747
basis termination criterion, 740
cosine, 369 CG = conjugate gradient method, 736
orthonormal, 681 CG algorithm, 740
sine, 363 chain rule, 641
trigonometric, 322 1D, 50
Belousov-Zhabotinsky reaction, 798 channel, 304
bending energy, 432 Characteristic parameters of IEEE floating point
Bernstein approximant, 472 numbers, 97
Bernstein polynomials, 446, 472 characteristic polynomial, 680
Besetzungsmuster, 205 Chebychev expansion, 505
best approximation Chebychev nodes, 498
uniform, 521 Chebychev polynomials, 496, 743
best approximation error, 474, 513 3-term recursion, 496
Bezier curve, 443 Chebychev-interpolation, 494
bicg, 752 chemical reaction kinetics, 829
BiCGStab, 752 Cholesky decomposition
bisection, 618 costs, 210
BLAS, 76 circuit simulation
block LU-decomposition, 150 transient, 763
block matrix multiplication, 76 circulant matrix, 317
blow-up, 799 Classical Runge-Kutta method
blurring operator, 342 Butcher scheme, 796
Boundary edge, 186 Clenshaw algorithm, 506
Broyden cluster analysis, 292
quasi-Newton method, 659 coefficient matrix, 127
Broyden-Verfahren coil, 128
ceonvergence monitor, 660 collocation, 836
Butcher scheme, 795, 840 collocation conditions, 837
cache miss, 69 collocation points, 837
cache thrashing, 69 collocation single step methods, 836
CAD, 440 column
cancellation, 102, 104 of a matrix, 55
capacitance, 128 column major matrix format, 65
capacitor, 128 column sum norm, 120
cardinal column transformation, 75
spline, 434 combinatorial graph Laplacian, 188
Cardinal basis, 385 Complete cubic spline interpolant, 546
cardinal basis, 385, 389 complexity
cardinal basis function, 434 asymptotic, 83
cardinal interpolant, 434, 554 linear, 86
Cauchy product of SVD, 270
of power series, 313 composite quadrature formulas, 575
Cauchy-Schwarz ineqaulity, 485 Compressed Column Storage (CCS), 183
Causal channel/filter, 306 compressed row storage, 180
causal filter, 304 Compressed Row Storage (CRS), 183

BIBLIOGRAPHY, BIBLIOGRAPHY 861


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

computational cost convolution


Gaussian elimination, 139 discrete, 310, 312
computational costs discrete periodic, 315
LU-decomposition, 148 of sequences, 309, 313
QR-decomposition, 254 Convolution of sequences, 309
Computational effort, 82 Corollary: “Optimality” of CG iterates, 743
computational effort, 82, 633 Corollary: Best approximant by orthogonal
eigenvalue computation, 683 projection, 512
concave Corollary: Composition of orthogonal
data, 414 transformations, 240
function, 415 Corollary: Consistent Runge-Kutta single step
Condition (number) of a matrix, 133 methods, 795
condition number Corollary: Continuous local Lagrange
of a matrix, 133 interpolants, 541
spectral, 735 Corollary: Dimension of P2n T , 452

conjugate gradient method, 736 Corollary: Euclidean matrix norm and


consistency eigenvalues, 121
of iterative methods, 598 Corollary: Invariance of order under affine
fixed point iteration, 610 transformation, 560
Consistency of fixed point iterations, 610 Corollary: Lagrange interpolation as linear
Consistency of iterative methods, 598 mapping, 390
Consistent single step methods, 780 Corollary: ONB representation of best
constant approximant, 513
Lebesgue, 500 Corollary: Periodicity of Fourier transforms, 347
constitutive relations, 128, 381 Corollary: Piecewise polynomials Lagrange
constrained least squares, 297 interpolation operator, 541
Contractive mapping, 613 Corollary: Polynomial stability function of explicit
control points, 440 RK-SSM, 820
Convergence, 597 Corollary: Principal axis transformation, 681
convergence Corollary: Rational stability function of explicit
algebraic, 479, 482, 783 RK-SSM, 844
asymptotic, 630 Corollary: Smoothness of cubic Hermite
exponential, 479, 482, 485, 501, 783 polynomial interpolant, 418
global, 599 Corollary: Solvability of implicit Euler recursion,
iterative method, 597 776
linear, 600 Corollary: Stages limit order of explicit RK-SSM,
linear in Gauss-Newton method, 673 821
local, 599 Corollary: Uniqueness of least squares
numerical quadrature, 554 solutions, 224
quadratic, 603 Corollary: Uniqueness of QR-factorization, 240
rate, 600 Correct rounding, 98
convergence monitor, 660 cosine
of Broyden method, 660 basis, 369
convex transform, 369
data, 414 cosine matrix, 369
function, 415 cosine transform, 369
Convex combination, 443 costs
convex combination, 443 Cholesky decomposition, 210
Convex hull, 444 Crout’s algorithm, 147
Convex/concave data, 414 CRS, 180
convex/concave function, 415 CRS format

BIBLIOGRAPHY, BIBLIOGRAPHY 862


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

diagonal, 181 discrete periodic convolution, 315


cubic complexity, 84 two-dimensional, 339
cubic Hermite interpolation, 392, 418, 419 discretization
Cubic Hermite polynomial interpolant, 418 of a differential equation, 778
cubic spline interpolation discretization error, 782
error estimates, 546 discriminant formula, 103, 108
Cubic-spline interpolant, 427 divided differences, 405
curve, 441 domain specific language (DSL), 58
cyclic permutation, 198 domain of definition, 594
dot product, 70
damped Newton method, 654
double nodes, 392
damping factor, 655
double precision, 96
data fitting, 459, 665
DSL: domain specific language, 58
linear, 460
polynomial, 462 economical singular value decomposition, 266
data interpolation, 379 efficiency, 633
deblurring, 341 Eigen, 58
definite, 119 arrays, 61
dense matrix, 178 data types, 58
derivative initialisation, 60
in vector spaces, 641 sparse matrices, 183
Derivative of functions between vector spaces, eigen
641 accessing matrix entries, 60
descent methods, 729 Eigen: LDLT(), 210
destructor, 42 Eigen: LLT(), 210
DFT, 319, 324 eigenspace, 680
two-dimensional, 337 eigenvalue, 680
Diagonal dominance, 207 generalized, 682
diagonal matrix, 56 eigenvalue problem
diagonalization generalized, 682
for solving linear ODEs, 822 eigenvalues and eigenvectors, 680
of a matrix, 681 eigenvector, 680
diagonalization of local translation invariant generalized, 682
linear operators, 366 electric circuit, 127, 593
Diagonally dominant matrix, 207 resonant frequencies, 677
diagonally implicit Runge-Kutta method, 841 element
difference quotient, 104 of a matrix, 55
backward, 774 elementary arithmetic operations, 94, 99
forward, 774 elimination matrix, 145
symmetric, 776 embedded Runge-Kutta methods, 807
difference scheme, 774 Energy norm, 729
differential, 641 entry
dilation, 451 of a matrix, 55
direct power method, 690 envelope
DIRK-SSM, 841 matrix, 200
discrete L2 -inner product, 515, 518 Equation
Discrete convolution, 312 non-linear, 594
discrete convolution, 310, 312 equidistant mesh, 540
discrete evolution, 778 equidistribution principle
discrete Fourier transform, 319, 324 for quadrature error, 584
two-dimensional, 338 equivalence
Discrete periodic convolution, 315 of norms, 119

BIBLIOGRAPHY, BIBLIOGRAPHY 863


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Equivalence of norms, 600 consistency, 610


ergodicity, 689 Newton’s method, 649
error floating point number, 95
absolute, 98 floating point numbers, 94, 95
relative, 98 forward elimination, 136
error estimator forward substitution, 149
a posteriori, 608 Fourier
error indicator matrix, 323
for extrapolation, 402 Fourier coefficient, 351
Euler method discrete, 332
explicit, 773 Fourier coefficients, 529
implicit, 775 Fourier modes, 530
implicit, stability function, 843 Fourier series, 347, 529
semi implicit, 850 Fourier transform, 347
Euler polygon, 773 discrete, 319, 324
Euler’s formula, 451 fractional order of convergence, 630
Euler’s iteration, 628 frequency domain, 128, 329
evolution operator, 768 frequency filtering, 328
Evolution operator/mapping, 768 Frobenius norm, 279
expansion full-rank condition, 224
asymptotic, 400 function
explicit Euler method, 773 concave, 415
Butcher scheme, 796 convex, 415
explicit midpoint rule function object, 43
Butcher scheme, 796 function representation, 382
for ODEs, 793
Gauss collocation single step method, 839
explicit Runge-Kutta method, 794
Gauss Quadrature, 559
Explicit Runge-Kutta single-step method, 794
Gauss-Legendre quadrature formulas, 567
explicit trapzoidal rule
Gauss-Newton method, 670
Butcher scheme, 796
Gauss-Radau quadrature formulas, 847
exponential convergence, 501
Gauss-Seidel preconditioner, 748
extended normal equations, 233
Gaussian elimination, 135
extended state space
block version, 142
of an ODE, 764
by rank-1 modifications, 142
extrapolation, 399
for non-square matrices, 140
Families of sparse matrices, 179 general least squares problem, 228
fast Fourier transform, 355 generalization error, 463
FFT, 355 Generalized condition number of a matrix, 229
Fill-in, 195 Generalized Lagrange polynomials, 392
fill-in, 195 Generalized solution of a linear system of
filter equations, 227
high pass, 333 Givens rotation, 243, 261
low pass, 333 Givens-Rotation, 248
Finding out EPS in C++, 99, 100 global solution
Finite channel/filter, 305 of an IVP, 767
finite filter, 304 GMRES, 752
Fitted polynomial, 519 Golub-Welsch algorithm, 568
fixed point, 610 gradient, 642, 731
fixed point form, 610 Gradient and Hessian, 642
fixed point interation, 609 Gram-Schmidt
fixed point iteration Orthonormalisierung, 721

BIBLIOGRAPHY, BIBLIOGRAPHY 864


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Gram-Schmidt orthogonalisation, 91, 237 inductor, 128


Gram-Schmidt orthogonalization, 513, 721, 739 inexact splitting methods, 855
Gram-Schmidt orthonormalization, 711 inf, 96
graph partitioning, 700 infinity, 96
grid, 540 initial guess, 597, 609
grid cell, 540 initial value problem
grid function, 366 stiff, 832
grid interval, 540 initial value problem (IVP), 764
inner product
Halley’s iteration, 628 A-, 729
harmonic mean, 422 intermediate value theorem, 618
hat function, 384 interpolant
heartbeat model, 761 piecewise linear, 416
Hermite integral formula, 492 interpolation
Hermite interpolation barycentric formula, 394
cubic, 392 Chebychev, 494
Hermitian matrix, 57 complete cubic spline, 429
Hermitian/symmetric matrices, 57 cubic Hermite, 419
Hessenberg matrix, 258 Hermite, 392
Hessian, 57 Lagrange, 389
Hessian matrix, 642 natural cubic spline, 429
high pass filter, 333 periodic cubic spline, 429
Hilbert matrix, 107 piecewise linear, 384
homogeneous, 119 spline cubic, 427
Hooke’s law, 706 spline cubic, locality, 434
Horner scheme, 388 spline shape preserving, 435
for Bezier polynomials, 446 trigonometric, 451
Householder matrix, 241 interpolation operator, 386
Householder reflection, 241 interpolation problem, 381
interpolation scheme, 381
I/O-complexity, 83
inverse interpolation, 631
identity matrix, 56
inverse iteration, 701
IEEE standard 754, 96
preconditioned, 702
ill conditioned, 133
inverse matrix, 130
ill-conditioned problem, 123
Invertible matrix, 130
image segmentation, 692
invertible matrix, 130
image space, 217
iteration, 596
Image space and kernel of a matrix, 130
Halley’s, 628
implicit differentiation, 646
Euler’s, 628
implicit Euler method, 775
quadratical inverse interpolation, 628
implicit function theorem, 775
iteration function, 597, 609
implicit midpoint method, 776
iterative method, 596
Impulse response, 306
convergence, 597
impulse response, 306
IVP, 764
of a filter, 304
in place, 147, 148 Jacobi preconditioner, 748
in situ, 142, 148 Jacobian, 614, 638, 641
increment equations
linearized, 851 kernel, 217
increments kinetics
Runge-Kutta, 794, 840 of chemical reaction, 829
inductance, 128 Kirchhoff (current) law, 128

BIBLIOGRAPHY, BIBLIOGRAPHY 865


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

knots Lemma: Decay of Fourier coefficients, 532


spline, 425 Lemma: Diagonal dominance and definiteness,
Konvergenz 209
Algebraische, Quadratur, 572 Lemma: Diagonalization of circulant matrices,
Kronecker product, 89 323
Kronecker symbol, 55 Lemma: Equivalence of Gaussian elimination
Krylov space, 737 and LU-factorization, 159
for Ritz projection, 717 Lemma: Error representation for polynomial
Lagrange interpolation, 484
L-stable, 846
Lemma: Exact quadrature by equidistant
L-stable Runge-Kutta method, 846
trapezoidal rule, 581
Lagrange function, 298
Lemma: Existence of LU -decomposition, 146
Lagrange interpolation approximation scheme,
Lemma: Existence of LU-factorization with
478
pivoting, 156
Lagrange multiplier, 298
Lemma: Formula for Euclidean norm of a
Lagrangian, 298
Hermitian matrix, 121
Lagrangian (interpolation polynomial)
Lemma: Fourier coefficients of derivatives, 531
approximation scheme, 478
Lemma: Gerschgorin circle theorem, 681
Lagrangian multiplier, 298
lambda function, 33, 43 Lemma: Group of regular diagonal/triangular
Landau symbol, 83, 479 matrices, 74
Landau-O, 83 Lemma: Higher order local convergence of fixed
Lapack, 140 point iterations, 616
leading coefficient Lemma: Interpolation error estimates for
of polynomial, 387 exponentially decaying Fourier
Least squares coefficients, 537
with linear constraint, 297 Lemma: Kernel and range of (Hermitian)
least squares transposed matrices, 223
total, 296 Lemma: LU-factorization of diagonally dominant
least squares problem, 225 matrices, 207
Least squares solution, 218 Lemma: Ncut and Rayleigh quotient (→
least-squares problem [SM00, Sect. 2]), 695
non-linear, 666 Lemma: Necessary conditions for s.p.d., 57
least-squares solution Lemma: Perturbation lemma, 132
non-linear, 666 Lemma: Positivity of Gauss-Legendre
Lebesgue quadrature weights, 570
constant, 500 Lemma: Principal branch of the square root, 538
Lebesgue constant, 411, 486 Lemma: Properties of cosine matrix, 369
Legendre polynomials, 566 Lemma: Properties of Fourier matrices, 323
Lemma: rk ⊥ Uk , 736 Lemma: Properties of the sine matrix, 363
Lemma: Absolute conditioning of polynomial Lemma: Quadrature error estimates for
interpolation, 411 Cr -integrands, 571
Lemma: Affine pullbacks preserve polynomials, Lemma: Quadrature formulas from linear
475 interpolation schemes, 554
Lemma: Bases for Krylov spaces in CG, 739 Lemma: Residue formula for quotients, 491
Lemma: Basis property of Bernstein Lemma: S.p.d. LSE and quadratic minimization
polynomials, 446 problem, 729
Lemma: Cholesky decomposition, 209 Lemma: Sherman-Morrison-Woodbury formula,
Lemma: Criterion for local Liptschitz continuity, 174
766 Lemma: Similarity and spectrum → [Gut09,
Lemma: Cubic convergence of modified Newton Thm. 9.7], [DR08, Lemma 7.6], [NS02,
methods, 627 Thm. 7.2], 681

BIBLIOGRAPHY, BIBLIOGRAPHY 866


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Lemma: Smoothness of solutions of ODEs, 758 of interpolation, 433


Lemma: Space of solutions of linear ODEs, 758 logistic differential equation, 759
Lemma: Stability function as approximation of Lotka-Volterra ODE, 760
exp for small arguments, 821 low pass filter, 333
Lemma: Sufficient condition for linear lower triangular matrix, 56
convergence of fixed point iteration, 615 LU-decomposition
Lemma: Sufficient condition for local linear blocked, 150
convergence of fixed point iteration, 614 computational costs, 148
Lemma: SVD and Euclidean matrix norm, 276 envelope aware, 201
Lemma: SVD and rank of a matrix → [NS02, existence, 146
Cor. 9.7], 267 in place, 148
Lemma: Taylor expansion of inverse distance LU-decomposition/LU-factorization, 145
function, 707 LU-factorization
Lemma: Theory of Arnoldi process, 722 envelope aware, 201
Lemma: Transformation of norms under affine of sparse matrices, 193
pullbacks, 476 with pivoting, 154
Lemma: Tridiagonal Ritz projection from CG
machine number, 95
residuals, 719
exponent, 95
Lemma: Unique solvability of linear least
machine numbers, 94, 96
squares fitting problem, 462
distribution, 96
Lemma: Uniqueness of orthonormal
extremal, 95
polynomials, 516
Machine numbers/floating point numbers, 95
Lemma: Zeros of Legendre polynomials, 567
machine precision, 99
Levinson algorithm, 374
mantissa, 95
Lie-Trotter splitting, 854 Markov chain, 372, 685
limit cycle, 830 stationary distribution, 685
limiter, 421 mass matrix, 707
line search, 730 Matrix
Linear channel/filter, 306 adjoint, 56
linear complexity, 84, 86 Hermitian, 681
Linear convergence, 600 Hermitian transposed, 56
linear correlation, 288 normal, 681
linear data fitting, 460 skew-Hermitian, 681
linear electric circuit, 127 transposed, 56
linear filter, 304 unitary, 681
Linear first-order ODE, 758 matrix
Linear interpolation operator, 386 banded, 199
linear ODE, 758 condition number, 133
linear operator, 386 dense, 178
diagonalization, 366 diagonal, 56
linear ordinary differential equation, 679 envelope, 200
linear regression, 84, 214 Fourier, 323
linear system of equations Hermitian, 57
multiple right hand sides, 141 Hessian, 642
linear system of equations, 127 lower triangular, 56
Lipschitz continuos function, 765, 766 normalized, 56
Lloyd-Max algorithm, 292 orthogonal, 236
Local and global convergence, 599 positive definite, 57
local Lagrange interpolation, 541 positive semi-definite, 57
local linearization, 638 rank, 130
locality sine, 363

BIBLIOGRAPHY, BIBLIOGRAPHY 867


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

sparse, 178 of an interpolation node, 392


storage formats, 65
NaN, 96
structurally symmetric, 203
Ncut, 693
symmetric, 57
nested
tridiagonal, 199
subspaces, 736
unitary, 236
nested spaces, 472
upper triangular, 56
Newton
matrix block, 55
basis, 404
matrix compression, 278
damping, 655
Matrix envelope, 200
damping factor, 655
matrix exponential, 822
monotonicity test, 655, 658
matrix factorization, 143, 144
simplified method, 648
Matrix norm, 120
Newton correction, 638
matrix norm, 120
simplified, 653
column sums, 120
Newton iteration, 637
row sums, 120
numerical Differentiation, 648
matrix storage
termination criterion, 652
envelope oriented, 203
Newton method
maximum likelihood, 219 1D, 620
member function, 31 damped, 654
mesh, 540 local quadratic convergence, 649
equidistant, 540 modified, 627
in time, 779 Newton’s law of motion, 707
temporal, 772 nodal analysis, 127, 593
mesh width, 540 transient, 763
mesh adaptation, 585 nodal polynomial, 495
mesh refinement, 585 nodal potentials, 128
Method node
Quasi-Newton, 658 double, 392
method, 31 for interpolation, 389
midpoint method in electric circuit, 127
implicit, stability function, 843 multiple, 391
midpoint rule, 557, 793 multiplicity, 392
Milne rule, 558 of a mesh, 540
min-max theorem, 696 quadrature, 552
minimal residual methods, 751 nodes, 389
model function, 620 Chebychev, 498
model reduction, 469 Chebychev nodes, 500
Modellfunktionsverfahren, 620 for interpolation, 389
modification techniques, 257 non-linear least-squares problem, 666
modified Newton method, 627 Non-linear least-squares solution, 666
monomial, 387 non-linear least-squares solution, 666
monomial representation non-normalized numbers, 96
of a polynomial, 387 Norm, 119
monotonic data, 414 norm, 119
Moore-Penrose pseudoinverse, 228 L1 , 410
move semantics, 40 L2 , 410
multi-point methods, 620, 628 ∞-, 119
multiplicity 1-, 119
geometric, 680 energy-, 729
of a spline knot, 448 Euclidean, 119

BIBLIOGRAPHY, BIBLIOGRAPHY 868


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Frobenius norm, 279 orthonormal basis, 513, 681


of matrix, 120 Orthonormal polynomials, 516
Sobolev semi-, 486 overfitting, 463
supremum, 410 overflow, 96, 101
normal equations, 220, 512 overloading
augmented, 299 of functions, 30
extended, 233 of operators, 31
with constraint, 299
page rank, 685
normalization, 690
stochastic simulation, 685
Normalized cut, 693
parameter estimation, 214
normalized lower triangular matrix, 145
parameterization, 441
normalized triangular matrix, 56
PARDISO, 192
not a number, 96
partial pivoting, 153, 155
nullspace, 217
pattern
Nullstellenbestimmung
of a matrix, 72
Modellfunktionsverfahren, 620
PCA, 284
numerical Differentiation
PCG, 747
Newton iteration, 648
Peano
numerical rank, 274
Theorem of, 766
Numerical differentiation
penalization, 697
roundoff, 106
penalty parameter, 698
numerical differentiation, 104
periodic
numerical quadrature, 550
function, 451
numerical rank, 270
periodic sequence, 314
ODE, 764 periodic signal, 314
autonomous, 757 Periodic time-discrete signal, 314
linear, 758 permutation, 156
scalar, 758, 772 Permutation matrix, 156
ODE, right-hand-side function, 757 permutation matrix, 156, 262
Ohmic resistor, 128 perturbation lemma, 132
one-point methods, 620 Petrov-Galerkin condition, 752
one-step error, 786 phase space
order of an ODE, 764
of a discrete evolution, 788 phenomenological model, 761
of an ODE, 765 Picard-Lindelöf
of quadrature formula, 560 Theorem of, 766
Order of a discrete evolution operator, 788 Piecewise cubic Hermite interpolant (with exact
Order of a quadrature rule, 560 slopes), 544
Order of a single step method, 785 PINVIT, 702
order of convergence, 602 pivot, 136–138
fractional, 630 pivot row, 136, 138
ordinary differential equation pivoting, 152
linear, 679 partial, 153
ordinary differential equation (ODE), 764 Planar curve, 441
oregonator, 798 Planar triangulation, 185
orthogonal complement, 268 point spread function, 341
orthogonal matrix, 236 polar decomposition, 268
orthogonal polynomials, 566 polynomial
orthogonal projection, 512 characteristic, 680
Orthogonality, 510 generalized Lagrange, 392
Orthonormal basis, 513 Lagrange, 389

BIBLIOGRAPHY, BIBLIOGRAPHY 869


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

polynomial curve, 441 quadratic inverse interpolation, 632


polynomial fitting, 462 quadratical inverse interpolation, 628
polynomial interpolation quadrature
existence and uniqueness, 390 adaptive, 583
generalized, 391 polynomial formulas, 556
polynomial space, 387 quadrature formula
positive definite order, 560
criteria, 57 Quadrature formula/quadrature rule, 552
matrix, 57 quadrature node, 552
potentials quadrature numerical, 550
nodal, 128 quadrature point, 552
power series, 488 quadrature weight, 552
power spectrum quasi-linear system, 644
of a signal, 333 Quasi-Newton method, 658, 659
preconditioned CG method, 747
Radau RK-method
preconditioned inverse iteration, 702
order 3, 847
Preconditioner, 746
order 5, 847
preconditioner, 746
Rader’s algorithm, 359
preconditioning, 745
radiative heat transfer, 315
predator-prey model, 760
radius of convergence
principal, 758
of a power series, 489
principal axis, 292
range, 217
principal axis transformation, 733
rank
principal component, 289
column rank, 130
principal component analysis (PCA), 284
computation, 270
principal minor, 149
numerical, 270, 274
principal orthogonal decompostion (POD), 290
of a matrix, 130
problem
row rank, 130
ill conditioned, 133
Rank of a matrix, 130
ill-conditioned, 123
rank-1 modification, 142, 257
sensitivity, 131
rank-1-matrix, 86
well conditioned, 133
rank-1-modification, 174, 659
procedural form, 550
rate
product rule, 642
of algebraic convergence, 479, 482, 783
1D, 50
of convergence, 600
propagated error, 786
Rayleigh quotient, 691, 696
pullback, 475, 553
Rayleigh quotient iteration, 702
Punkt
Real-analytic functions, 488
stationär, 761
Region of (absolute) stability, 827
pwer method
region of absolute stability, 827
direct, 690
regular matrix, 130
Python, 64
Regular refinemnent of a planar triangulation,
QR algorithm, 682 188
QR-algorithm with shift, 683 relative error, 98
QR-decomposition, 93, 238 relative tolerance, 606, 638, 801
computational costs, 254 rem:Fspec, 323
QR-factorization, QR-decomposition, 242 Residual, 158
quadratic complexity, 84 residual quantity, 703
quadratic convergence, 616 Ricati differential equation, 772
quadratic eigenvalue problem, 677 Riccati differential equation, 773
quadratic functional, 729 Riemann sum, 350

BIBLIOGRAPHY, BIBLIOGRAPHY 870


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

right hand side shape


of an ODE, 764 preservation, 417
right hand side vector, 127 preserving spline interpolation, 435
rigid body mode, 708 Sherman-Morrison-Woodbury formula, 174
Ritz projection, 713, 717 shifted inverse iteration, 701
Ritz value, 714 signal
Ritz vector, 714 periodic, 314
root of unity, 322 time-discrete, 303
roots of unity, 580 similarity
rounding, 98 of matrices, 681
rounding up, 98 similarity function
roundoff for image segmentation, 693
for numerical differentiation, 106 similarity transformations, 681
row similary transformation
of a matrix, 55 unitary, 682
row major matrix format, 65 Simpson rule, 558
ROW methods, 852 sine
row sum norm, 120 basis, 363
row transformation, 75, 136, 144 matrix, 363
Runge’s example, 409 transform, 363
Runge-Kutta Sine transform, 363
increments, 794, 840 single precicion, 96
Runge-Kutta method, 794, 840 single step method
L-stable, 846 A-stability, 844
Runge-Kutta methods Single-step method, 779
embedded, 807 single-step method, 779
semi-implicit, 850 singular value decomposition, 264, 265
stability function, 820, 843 Singular value decomposition (SVD), 265
Runge-Kutta single-step method, 840 SIR model, 762
saddle point problem, 298 slopes
matrix form, 299 for cubic Hermite interpolation, 418
scalar ODE, 772 Smoothed triangulation, 187
scaling Solution of an ordinary differential equation, 757
of a matrix, 74 Space of trigonometric polynomials, 451
scaling invariance, 101 Sparse matrix, 178, 179
scheme sparse matrix, 178
Horner, 388 COO format, 179
Schur LU-factorization, 193
Komplement, 150 triplet format, 179
Schur complement, 150, 171, 174 sparse matrix storage formats, 179
scientific notation, 95 spectral condition number, 735
secant condition, 659 spectral partitioning, 700
secant method, 629, 632, 658 spectral radius, 680
segmentation spectrum, 680
of an image, 692 of a matrix, 732
semi-implicit Euler method, 850 spline, 425
seminorm, 486 cardinal, 434
sensitive dependence, 123 complete cubic, 429
sensitivity cubic, 427
of a problem, 131 cubic, locality, 434
of polynomial interpolation, 409 knots, 425

BIBLIOGRAPHY, BIBLIOGRAPHY 871


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

natural cubic, 429 tangent field, 772


periodic cubic, 429 Taylor expansion, 50, 616
physical, 432 Taylor polynomial, 471
shape preserving interpolation, 435 Taylor series, 490
Splines, 425 Taylor’s formula, 471
splitting template, 32
Lie-Trotter, 854 tensor product, 70
Strang, 854 tent function, 384
splitting methods, 853 Teopltiz matrices, 371
inexact, 855 termination criterion, 605
spy, 73 ideal, 606
stability Newton iteration, 652
region of, 827 residual based, 606
stability function Theorem: → [Han02, Thm. 25.4], 702
of explicit Runge-Kutta methods, 820 Theorem: L∞ polynomial best approximation
of Runge-Kutta methods, 843 estimate, 474
stable Theorem: (Absolute) stability of explicit RK-SSM
algorithm, 122 for linear systems of ODEs, 826
numerically, 122 Theorem: 2D convolution theorem, 340
Stable algorithm, 122 Theorem: 3-term recursion for Chebychev
stages, 841 polynomials, 496
state space Theorem: 3-term recursion for orthogonal
of an ODE, 764 polynomials, 517
stationary distribution, 685 Theorem: Courant-Fischer min-max theorem
steepest descent, 730 → [GV89, Thm. 8.1.2], 696
Stiff IVP, 832 Theorem:
stiffness matrix, 707 Variation -diminishing property of Bezier curves
stochastic matrix, 686 445
stochastic simulation of page rank, 685 Theorem: Banach’s fixed point theorem, 613
stopping rule, 605 Theorem: Basis property of B-splines, 449
Strang splitting, 854 Theorem: Bernstein basis representation of
Strassen’s algorithm, 85 Bezier curves, 446
Structurally symmetric matrix, 203 Theorem: Best low rank approximation, 279
structurally symmetric matrix, 203 Theorem: Bezier curves stay in convex hull, 445
sub-matrix, 55 Theorem: Bound for spectral radius, 680
sub-multiplicative, 120 Theorem: Cauchy integral theorem, 535
subspace correction, 736 Theorem: Chebychev alternation theorem, 521
subspace iteration Theorem: Commuting matrices have the same
for direct power method, 715 eigenvectors, 321
subspaces Theorem: Composition and products of analytic
nested, 736 functions, 493
SuperLU, 192 Theorem: Conditioning of LSEs, 132
surrogate function, 469 Theorem: Consistency and convergence, 598
SVD, 264, 265 Theorem: Convergence of approximation by
symmetric matrix, 57 cubic Hermite interpolation, 545
Symmetric positive definite (s.p.d.) matrices, 57 Theorem: Convergence of CG method, 743
symmetry Theorem: Convergence of direct power method
structural, 203 → [DR08, Thm. 25.1], 692
system matrix, 127 Theorem: Convergence of gradient
system of equations method/steepest descent, 735
linear, 127 Theorem: Convolution of sequences commutes,

BIBLIOGRAPHY, BIBLIOGRAPHY 872


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

310 Theorem: Mean square (semi-)norm/Inner


Theorem: Convolution theorem, 327, 352 product (semi-)norm, 511
Theorem: Cost for solving triangular systems, Theorem: Minimax property of the Chebychev
166 polynomials, 497
Theorem: Cost of Gaussian elimination, 166 Theorem: Monotonicity preservation of limited
Theorem: Criteria for invertibility of matrix, 130 cubic Hermite interpolation, 423
Theorem: Dimension of space of polynomials, Theorem: Obtaining least squares solutions by
387 solving normal equations, 221
Theorem: Dimension of spline space, 426 Theorem: Optimality of natural cubic spline
Theorem: Divergent polynomial interpolants, 481 interpolant, 432
Theorem: Elementary properties of B-splines, Theorem: Order of collocation single step
449 method, 840
Theorem: Envelope and fill-in, 200 Theorem: Order of simple splitting methods, 855
Theorem: Equivalence of all norms on finite Theorem: Positivity of Clenshaw-Curtis weights,
dimensional vector spaces, 601 559
Theorem: Existence & uniqueness of Theorem: Preservation of Euclidean norm, 236
generalized Lagrange interpolation Theorem: Property of linear, monotonicity
polynomials, 392 preserving interpolation into C1 , 423
Theorem: Existence & uniqueness of Lagrange Theorem: Pseudoinverse and SVD, 274
interpolation polynomial, 390 Theorem: QR-decomposition, 239
Theorem: Existence of n-point quadrature Theorem: QR-decomposition “preserves
formulas of order 2n, 565 bandwidth”, 248
Theorem: Existence of least squares solutions, Theorem: Quadrature error estimate for
219 quadrature rules with positive weights,
Theorem: Exponential convergence of 571
trigonometric interpolation for analytic Theorem: Rayleigh quotient, 696
interpolands, 537 Theorem: Region of stability of Gauss
Theorem: Exponential decay of Fourier collocation single step methods, 845
coefficients of analytic functions, 535 Theorem: Representation of interpolation error,
Theorem: Finite-smoothness L2 -error estimate 483
for trigonometric interpolation, 532 Theorem: Residue theorem, 490
Theorem: Formula for generalized solution, 228 Theorem: Schur’s lemma, 681
Theorem: Gaussian elimination for s.p.d. Theorem: Sensitivity of full-rank linear least
matrices, 208 squares problem, 230
Theorem: Gram-Schmidt orthonormalization, Theorem: Series of analytic functions, 539
514 Theorem: Singular value decomposition (SVD),
Theorem: Implicit function theorem, 775 264
Theorem: Isometry property of the Fourier Theorem: Solution of POD problem, 291
transform, 353 Theorem: Span property of G.S. vectors, 237
Theorem: Kernel and range of A⊤ A, 223 Theorem: Stability function of general
Theorem: Least squares solution of data fitting Runge-Kutta methods, 843
problem, 461 Theorem: Stability function of some explicit
Theorem: Local quadratic convergence of Runge-Kutta methods, 820
Newton’s method, 651 Theorem: Stability of Gaussian elimination with
Theorem: Local shape preservation by partial pivoting, 160
piecewise linear interpolation, 417 Theorem: Stability of Householder QR [Hig02,
Theorem: Maximal order of n-point quadrature Thm. 19.4], 250
rule, 562 Theorem: Sufficient order conditions for
Theorem: Mean square norm best quadrature rules, 560
approximation through normal Theorem: Taylor’s formula, 616
equations, 512 Theorem: Theorem of Peano & Picard-Lindelöf

BIBLIOGRAPHY, BIBLIOGRAPHY 873


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

[Ama83, Satz II(7.6)], [Str09, Satz 6.5.1], trigonometric transformations, 363


[DR08, Thm. 11.10], [Han02, tripled format, 179
Thm. 73.1], 766 truss structure
Theorem: Uniform approximation by vibrations, 705
polynomials, 472 trust region method, 674
time domain, 329 Types of asymptotic convergence of
Time-invariant channel/filter, 305 approximation schemes, 479
time-invariant filter, 304 Types of matrices, 56
timestep (size), 774
timestep constraint, 821 UMFPACK, 192
timestepping, 773 underflow, 96, 101
Toeplitz matrix, 373 uniform approximation, 468
Toeplitz solvers uniform best approximation, 521
fast algorithms, 376 Uniform convergence
tolerance, 606 of Fourier series, 348
absolute, 801 unit vector, 55
absoute, 606, 638 Unitary and orthogonal matrices, 236
for adaptive timestepping for ODEs, 800 unitary matrix, 236
for termination, 606 unitary similary transformation, 682
realtive, 801 upper Hessenberg matrix, 722
relative, 606, 638 upper triangular matrix, 56, 136, 145
total least squares, 296
trajectory, 761 Vandermonde matrix, 391
transform variation-diminishing property, 445
cosine, 369 variational calculus, 432
fast Fourier, 355 vector field, 764
sine, 363 vectorization
transformation matrix, 76 of a matrix, 67
trapezoidal rule, 557, 580, 793 Vieta’s formula, 108
for ODEs, 793
WAV file format, 304
trend, 284
Weddle rule, 558
trial space
weight
for collocation, 837
quadrature, 552
triangle inequality, 119
weight function, 515
triangular linear systems, 170
weighted L2 -inner product, 515
triangulation, 185
well conditioned, 133
tridiagonal matrix, 199
trigonometric basis, 322 Young’s modulus, 706
trigonometric interpolation, 451, 526
Trigonometric polynomial, 352 Zerlegung
trigonometric polynomial, 352 LU, 149
trigonometric polynomials, 451, 526 zero padding, 313, 318, 374, 456

BIBLIOGRAPHY, BIBLIOGRAPHY 874


List of Symbols

(A)i,j = ˆ reference to entry aij of matrix A, 55 k·k =ˆ norm on vector space, 119
(A)k:l,r:s =ˆ reference to submatrix of A Pk , 387
spanning rows k, . . . , l and columns Ψh y =ˆ discrete evolution for autonomous ODE,
r, . . . , s, 55 778
ˆ i-th component of vector x, 54
( x )i = R(A) = ˆ image/range space of a matrix, 130,
( xk ) ∗n (yk ) = ˆ discrete periodic convolution, 315 217
0
C (I) = ˆ space of continuous functions I → R, (·, ·)V = ˆ inner product on vector space V , 510
410 Sd,M , 425
C1 ([ a, b]) = ˆ space of continuously differentiable A† = ˆ Moore-Penrose pseudoinverse of A, 228
functions [ a, b] 7→ R, 418 A =⊤ ˆ transposed matrix, 56
J ( t0 , y0 ) =ˆ maximal domain of definition of a I= ˆ identity matrix, 56
solution of an IVP, 766 h∗x= ˆ discrete convolution of two vectors, 312
O= ˆ zero matrix, 56 x ∗n y = ˆ discrete periodic convolution of vectors,
O(·)= ˆ Landau symbol, 83 315

V = ˆ orthogonal complement of a subspace, ˆ complex conjugation, 56
z̄ =
223 C − := {z ∈ C: Re z < 0}, 844
E= ˆ expected value of a random variable, 372 K= ˆ generic field of numbers, either R or C, 54
T n,n
Pn = ˆ space of trigonometric polynomials of K∗ = ˆ set of invertible n × n matrices, 131
degree n, 451 M= ˆ set of machine numbers, 94
Rk (m, n) = ˆ set of rank-k matrices, 279 δij =ˆ Kronecker symbol, 389
DFTn = ˆ discrete Fourier transform of length n, δij =ˆ Kronecker symbol, called “Kronecker delta”
324 by Wikipedia, 55
DΦ = ˆ Jacobian of Φ : D 7→ R n at x ∈ D, 614 δij =ˆ Kronecker symbol, called “Kronecker delta”
Dy f = ˆ Derivative of f w.r.t. y (Jacobian), 766 by Wikipedia, 306
EPS = ˆ machine precision, 99 ℓ ∞ (Z ) = ˆ space of bounded bi-infinite
EigAλ = ˆ eigenspace of A for eigenvalue λ, 680 sequences, 305 √
IT =ˆ Lagrange polynomial interpolation operator ı= ˆ imaginary unit, “ı := −1”, 128
based on node set T , 393 κ (A) = ˆ spectral condition number, 735
RA = ˆ range/column space of matrix A, 267 λT = ˆ Lebesgue constant for Lagrange
N (A) = ˆ kernel/nullspace of a matrix, 130, 217 interpolation on node set T , 411
NA = ˆ nullspace of matrix A, 267 λmax = ˆ largest eigenvalue (in modulus), 735
Kl (A, z) = ˆ Krylov subspace, 737 λmin = ˆ smallest eigenvalue (in modulus), 735
LT = ˆ Lagrangian (interpolation polynomial) ( xk ) ∗ ( hk ) =ˆ convolution of two sequences, 309
approximation scheme on node set T , 1 = [1, . . . , ]⊤ , 820, 828
478 Ncut(X ) = ˆ normalized cut of subset of
kAx − bk2 → min = ˆ minimize kAx − bk2 , weighted graph, 693
226 argmin = ˆ (global) minimizer of a functional, 730
kAk2F , 279 cond(A), 133
kxk A = ˆ energy norm induced by s.p.d. matrix cut(X ) = ˆ cut of subset of weighted graph, 693
A, 729 ˆ square diagonal matrix, 56
diag(d1 , . . . , dn ) =
k·k = ˆ Euclidean norm of a vector ∈ K n , 91 distk·k ( x, V ) = ˆ distance of an element of a

875
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

normed vector spcace from set V , 474 ⋆, 99


e
env(A), 200 S1 = ˆ unit circle in the complex plane, 452
lsq(A, b) =ˆ set of least squares solutions of m(A), 199
ax = b, 218 fbj = ˆ j-th Fourier coefficient of periodic function
nnz, 178 f , 351
rank(A) = ˆ rank of matrix A, 130 (
f = k ) ˆ k-th derivative of function
sgn =ˆ sign function, sgn(0) := 0 , 244 f : I ⊂ R → K, 471
sgn =ˆ sign function, 421 (
f = k ) ˆ k derivative of f , 115
vec(A) = ˆ vectorization of a matrix, 67 m(A), 199
weight(X ) = ˆ connectivity of subset of weighted ˆ divided difference, 405
y [ ti , . . . , ti +k ] =
graph, 693 f= ˆ right hand side of an ODE, 764
=ˆ complex conjugation, 510 k x k1 , 119
m(A), 199 k x k2 , 119
ρ(A) = ˆ spectral radius of A ∈ K n,n , 680 k x k∞ , 119
ρA (u) =ˆ Rayleigh quotient, 691 ˙= ˆ Derivative w.r.t. time t, 757
σ(A) = ˆ spectrum of matrix A, 680
σ (M) hat= spectrum of matrix M, 732 TOL tolerance, 800

BIBLIOGRAPHY, BIBLIOGRAPHY 876


Examples and Remarks

LU -decomposition of sparse matrices, 194 differential equations , 679


L2 -error estimates for polynomial interpolation, Convergence of PINVIT , 704
485 Convergence of subspace variant of direct power
h-adaptive numerical quadrature, 588 method , 715
p-convergence of piecewise polynomial Direct power method , 691
interpolation, 544 Eigenvalue computation with Arnoldi process ,
2
L ()0, 1[]: Natural setting for trigonometric 724
interpolation, 529 Impact of roundoff on Lanczos process , 720
’auto’ in E IGEN codes, 62 Lagrange polynomials for uniformly spaced
(Nearly) singular LSE in shifted inverse iteration, nodes, 389
701 Lanczos process for eigenvalue computation ,
(Relative) point locations from distances, 216 720
Ex. 2.3.3.6 cnt’d, 157 Page rank algorithm , 685
Ex. 5.4.4.5 cnt’d, 439 Power iteration with Ritz projection , 715
General non-linear systems of equations, 594 qr based orthogonalization , 712
Complex step differentiation [LM67], 116 Rayleigh quotient iteration , 702
L2 ([−1, 1])-orthogonal polynomials → Resonances of linear electrical circuits , 677
[Han02, Bsp. 33.2], 517 Ritz projections onto Krylov space , 717
BLAS calling conventions, 80 Runtimes of eig , 684
E IGEN-based code: debug mode and release Stabilty of Arnoldi process , 723
mode, 63 Subspace power iteration with orthogonal
E IGEN in use, 63 projection , 710
‘Partial LU -decompositions” of principal minors, Vibrations of a truss structure , 705
149
“Annihilating” orthogonal transformations in 2D, A broader view of cancellation, 117
241 A data type designed for interpolation problems,
“Behind the scenes” of MyVector arithmetic, 46 383
“Butcher barriers” for explicit RK-SSM, 797 A function that is not locally Lipschitz continuous,
“Explosion equation”: finite-time blow-up, 767 766
“Failure” of adaptive timestepping, 804 A parameterized curve, 441
“Fast” matrix multiplication, 85 A posteriori error bound for linearly convergent
“Full-rank condition”, 667 iteration, 608
“Squeezed” DFT of a periodically truncated A posteriori termination criterion for plain CG,
signal, 345 740
“auto” considered harmful, 37 A priori and a posteriori choice of optimal
“Overfitting”, 463 interpolation nodes, 495
B = B H s.p.d. mit Cholesky-Zerlegung, 682 A special quasi-linear system of equations, 645
L-stable implicit Runge-Kutta methods, 847 Access to LU-factors in E IGEN, 170
2-norm from eigenvalues, 735 Accessing matrix data as a vector, 65
3-Term recursion for Legendre polynomials, 567 Adaptive explicit RK-SSM for scalar linear decay
ODE, 815
Analytic solution of homogeneous linear ordinary Adaptive quadrature in P YTHON, 589

877
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Adaptive timestepping for mechanical problem, Characteristics of stiff IVPs, 834


810 Chebychev interpolation errors, 501
Adding EPS to 1, 100 Chebychev interpolation of analytic function, 504
Affine invariance of Newton method, 640 Chebychev interpolation of analytic functions,
Algorithm for cluster analysis, 292 502
Analytic functions everywhere, 489 Chebychev nodes on arbitrary interval, 498
Angles in a triangulation, 215 Chebychev representation of built-in functions,
Application of modified Newton methods, 628 509
Approximate computaton of Fourier coefficients, Chebychev vs equidistant nodes, 499
581 Choice of norm, 803
Approximation by discrete polynomial fitting, 519 Class PolyEval, 407
Arnoldi process Ritz projection, 722 Classification from measured data, 284
Asymptotic behavior of Lagrange interpolation Clenshaw-Curtis quadrature rules, 559
error, 478 Commonly used embedded explicit Runge-Kutta
Asymptotic complexity of Householder methods, 808
QR-factorization, 251 Communicating special properties of system
Asymptotic methods for the computation of matrices in E IGEN, 166
Gauss-Legendre quadrature rules, 569 Commutativity of discrete convolution, 312
Auxiliary construction for shape preserving Composite quadrature and piecewise polynomial
quadratic spline interpolation, 437 interpolation, 577
Average-based pecewise cubic Hermite Composite quadrature rules vs. global
interpolation, 419 quadrature rules, 579
Computation of nullspace and image space of
Bad behavior of global polynomial interpolants,
matrices, 271
415
Computational effort for eigenvalue
Banach’s fixed point theorem, 613
computations, 683
Bernstein approximants, 473
Computing Gauss nodes and weights, 568
Block Gaussian elimination, 142
Computing the zeros of a quadratic polynomial,
Block LU-factorization, 150
103
Blow-up, 799
Conditioning and relative error, 162
Blow-up of explicit Euler method, 816
Conditioning of conventional row
Blow-up solutions of vibration equations, 705
transformations, 249, 250
Bound for asymptotic rate of linear convergence,
Conditioning of normal equations, 232
615
Conditioning of the extended normal equations,
Breakdown of associativity, 99
233
Broyden method for a large non-linear system,
Connetion with linear least squares problems
664
Chapter 3, 513
Broyden’s quasi-Newton method: convergence,
Consistency of implicit midpoint method, 780
660
Consistent right hand side vectors are highly
Butcher scheme for some explicit RK-SSM, 796
improbable, 217
Calling BLAS routines from C/C++, 80 Constitutive relations from measurements, 381
Cancellation during the computation of relative Construction of higher order Runge-Kutta single
errors, 107 step methods, 796
Cancellation in decimal system, 104 Contiguous arrays in C++, 38
Cancellation in Gram-Schmidt orthogonalisation, Control of a robotic arm, 430
107 Convergence monitors, 660
Cancellation when evaluating difference Convergence of CG as iterative solver, 742
quotients, 104 Convergence of Fourier sums, 348
Cancellation: roundoff error analysis, 107 Convergence of global quadrature rules, 572
Cardinal shape preserving quadratic spline, 439 Convergence of gradient method, 733
CG convergence and spectrum, 745 Convergence of Hermite interpolation, 545

BIBLIOGRAPHY, BIBLIOGRAPHY 878


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Convergence of Hermite interpolation with exact Different choices for consistent iteration
slopes, 544 functions (III), 616
Convergence of inexact simple splitting methods, Different meanings of “convergence”, 479
855 Differentiating and integrating splines, 426
Convergence of Krylov subspace methods for Discrete evolutions for non-autonomous ODEs,
non-symmetric system matrix, 753 779
Convergence of naive semi-implicit Radau Discretization, 778
method, 852 Distribution of machine numbers, 96
Convergence of Newton’s method for matrix Divided differences and derivatives, 408
inversion, 650
Convergence of Newton’s method in 2D, 649 Economical vs. full QR-decomposition, 257
Convergence of quadratic inverse interpolation, Efficiency of FFT for different backend
632 implementations, 361
Convergence of Remez algorithm, 524 Efficiency of FFT-based solver, 368
Convergence of secant method, 630 Efficiency of iterative methods, 636
Convergence of semi-implicit midpoint method, Efficient associative matrix multiplication, 86
851 Eigenvectors of circulant matrices, 319
Convergence of simple Runge-Kutta methods, Eigenvectors of commuting matrices, 320
793 Empiric Convergence of collocation single step
Convergence of simple splitting methods, 854 methods, 838
Convergence rates for CG method, 744 Empiric convergence of equidistant trapezoidal
Convergence theory for PCG, 748 rule, 579
Convex least squares functional, 225 Empiric convergence of semi-implicit Euler
Convolution of causal sequences, 313 single-step method, 850
Cosine transforms for compression, 371 Envelope of a matrix, 200
Curve design based on B-splines, 450 Envelope oriented matrix storage, 203
Curves from interpolation, 442 Error of polynomial interpolation, 484
Error representation for generalized Lagrangian
Damped Broyden method, 661 interpolation, 484
Damped Newton method, 657 Estimation of “wrong quadrature error”?, 588
Data points confined to a subspace, 289 Estimation of “wrong” error?, 803
Deblurring by DFT, 341 Euler methods for stiff decay IVP, 835
Decay conditions for bi-infinite signals, 348 Evolution operator for Lotka-Volterra ODE, 769
Decay of cardinal basis functions for natural Explicit adaptive RK-SSM for stiff IVP, 814
cubic spline interpolation, 434 Explicit Euler method as a difference scheme,
Decimal floating point numbers, 95 774
Denoising by frequency filtering, 333 Explicit Euler method for damped oscillations,
Derivative of a bilinear form, 642 825
Derivative of matrix inversion, 646 Explicit representation of error of polynomial
Derivative of Euclidean norm, 643 interpolation, 483
Derivative-based local linear approximation of Explicit trapzoidal rule for decay equation, 818
functions, 637 Exploiting trigonometric identities to avoid
Detecting linear convergence, 601 cancellation, 109
Detecting order p > 1 of convergence, 603 Extracting triplets from Eigen::SparseMatrix,
Detecting periodicity in data, 330 184
Determining the domain of analyticity, 493 Extremal numbers in M, 95
Diagonalization of local translation invariant
linear grid operators, 366 Failure of damped Newton method, 657
diagonally dominant matrices from nodal Failure of Krylov iterative solvers, 753
analysis, 206 Fast Toeplitz solvers, 376
Different choices for consistent fixed point Feasibility of implicit Euler timestepping, 775
iterations (II), 611 FFT algorithm by matrix factorization, 357

BIBLIOGRAPHY, BIBLIOGRAPHY 879


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

FFT based on general factorization, 359 Impact of roundoff errors on CG, 741
FFT for prime vector length, 359 Implicit differentiation of F, 622
Filtering in Fourier domain, 352 Implicit nature of collocation single step
Fit of hyperplanes, 276 methods, 838
Fixed points in 1D, 612 Implicit RK-SSMs for stiff IVP, 845
Fractional order of convergence of secant Importance of numerical quadrature, 550
method, 630 In-situ LU-decomposition, 148
Frequency identification with DFT, 329 Inequalities between vector norms, 119
Frobenius’ derivation of the Hermite integral Initial guess for power iteration, 692
formula, 492 Initialization of sparse matrices in Eigen, 184
From higher order ODEs to first order systems, Inner products on spaces Pm of polynomials,
765 515
Full-rank condition, 224 Input errors and roundoff errors, 97
Instability of multiplication with inverse, 164
Gain through adaptivity, 804
interpolation
Gauss-Newton versus Newton, 672
piecewise cubic monotonicity preserving,
Gauss-Radau collocation SSM for stiff IVP, 848
422
Gaussian elimination, 136
shape preserving quadratic spline, 439
Gaussian elimination and LU-factorization, 144
Interpolation and approximation: enabling
Gaussian elimination for non-square matrices,
technologies, 470
140
Interpolation error estimates and the Lebesgue
Gaussian elimination via rank-1 modifications,
constant, 486
142
Interpolation error: trigonometric interpolation,
Gaussian elimination with pivoting for
528
3 × 3-matrix, 153
Interpolation of vector-valued data, 380
Generalization of data, 380
Intersection of lines in 2D, 134
Generalized bisection methods, 620
Generalized eigenvalue problems and Cholesky Justification of Ritz projection by min-max
factorization, 682 theorem, 714
Generalized Lagrange polynomials for Hermite
Kinetics of chemical reactions, 829
Interpolation, 392
Krylov methods for complex s.p.d. system
Generalized polynomial interpolation, 391
matrices, 729
Gibbs phenomenon, 528
Krylov subspace methods for generalized EVP,
Gradient method in 2D, 732
725
Gram-Schmidt orthogonalization of polynomials,
564 Least squares data fitting, 665
Gram-Schmidt orthonormalization based on Lebesgue Constant for Chebychev nodes, 500
MyVector implementation, 46 Lebesgue constant for equidistant nodes, 411
Group property of autonomous evolutions, 769 Linear Filtering, 334
Growth with limited resources, 759 Linear parameter estimation = linear data fitting,
461
Halley’s iteration, 624
Linear parameter estimation in 1D, 214
Heartbeat model, 761
linear regression, 218
Heating production in electrical circuits, 551
Linear regression for stationary Markov chains,
Hesse matrix of least squares functional, 225
372
Hidden summation, 87
Linear regression: Parameter estimation for a
Hump function, 488
linear model, 214
Image compression, 282 Linear system for quadrature weights, 560
Image segmentation, 692 Linear systems with arrow matrices, 171
Impact of choice of norm, 600 Linearly convergent iteration, 601
Impact of matrix data access patterns on Local approximation by piecewise polynomials,
runtime, 68 540

BIBLIOGRAPHY, BIBLIOGRAPHY 880


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Local convergence of Newton’s method, 654 Non-linear electric circuit, 593


local convergence of the secant method, 631 Normal equations for some examples from
Loss of sparsity when forming normal equations, Section 3.0.1, 222
233 Normal equations from gradient, 223
LSE: key components of mathematical models in Notation for single step methods, 780
many fields, 127 Numerical Differentiation for computation of
LU-decomposition of flipped “arrow matrix”, 196 Jacobian, 648
Numerical stability and sensitive dependence on
Machine precision for IEEE standard, 99 data, 123
Magnetization curves, 414 Numerical summation of Fourier series, 349
Many choices for consistent fixed point NumPy command reshape, 68
iterations, 610
Many sequential solutions of LSE, 168 Orders of finite-difference single-step methods,
Mathematical functions in a numerical code, 382 789
Matrix inversion by means of Newton’s method, Orders of simple polynomial quadrature
647 formulas, 561
Matrix norm associated with ∞-norm and Oregonator reaction, 798
1-norm, 120 Origin of the term “Spline”, 432
Matrix representation of interpolation operator, Oscillating polynomial interpolant, 409
391 Output of explicit Euler method, 774
Meaning of full-rank condition for linear models, Overflow and underflow, 101
224
Meaningful “O-bounds” for complexity, 83 Parameter identification for linear time-invariant
Measuring the angles of a triangle, 215 filters, 371
Midpoint rule, 557 PCA for data classification , 288
Min-max theorem, 696 PCA of stock prices, 286
Minimality property of Broyden’s Perceived smoothness of cubic splines, 426
rank-1-modification, 659 Piecewise cubic interpolation schemes, 430
Model reduction in circuit simulation, 469 Piecewise linear interpolation, 384
Modified Horner scheme for evaluation of Piecewise polynomial interpolation, 542
Bezier polynomials, 446 Piecewise quadratic interpolation, 417
Monitoring convergence for Broyden’s Pivoting and numerical stability, 152
quasi-Newton method, 661 Pivoting destroys sparsity, 198
Monomial representation, 387 Polynomial fitting, 462
Multi-dimensional data interpolation, 381 Polynomial interpolation vs. polynomial fitting,
Multidimensional fixed point iteration, 615 463
Multiplication of Kronecker product with vector, Polynomial planar curves, 441
89 Poor shape-fidelity of high-degree Bezier curves,
Multiplying matrices in E IGEN, 77 445
Multiplying triangular matrices, 74 Power iteration, 689
Preconditioned Newton method, 626
Necessary condition for L-stability, 846 Predator-prey model, 760
Necessity of iterative approximation, 596 Predicting stiffness of non-linear IVPs, 833
Newton method and minimization of quadratic Principal axis of a point cloud, 291
functional, 670 Pseudoinverse and SVD, 274
Newton method in 1D, 621
Newton’s iteration; computational effort and QR-Algorithm, 682
termination, 652 QR-based solution of banded LSE, 255
Newton-Cotes formulas, 557 QR-based solution of linear systems of
Nodal analysis of linear electric circuit, 127 equations, 254
Non-linear cubic Hermite interpolation, 422 QR-decomposition of “fat” matrices, 242
Non-linear data fitting (II), 672 QR-decomposition of banded matrices, 248

BIBLIOGRAPHY, BIBLIOGRAPHY 881


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

Quadratic convergence, 603 S.p.d. Hessians, 57


Quadratic functional in 2D, 729 Sacrificing numerical stability for efficiency, 173
Quadratur Sampled audio signals, 303
Gauss-Legendre Ordnung 4, 563 Scaling a matrix, 74
Quadrature errors for composite quadrature Semi-implicit Euler single-step method, 850
rules, 578 Sensitivity of linear mappings, 131
Shape preservation of cubic spline interpolation,
Radiative heat transfer, 315 433
Rank defect in linear least squares problems, Shape preserving quadratic spline interpolation,
225 439
Rationale for adaptive quadrature, 584 Shifted inverse iteration, 701
Rationale for high-order single step methods, Significance of smoothness of interpoland, 484
791 Simple adaptive stepsize control, 803
Rationale for partial pivoting policy, 155 Simple adaptive timestepping for fast decay, 818
Rationale for using LU-decomposition in Simple composite polynomial quadrature rules,
algorithms, 149 575
Recursive LU-factorization, 148 Simple preconditioners, 748
Reducing bandwidth by row/column Simple Runge-Kutta methods by quadrature &
permutations, 204 boostrapping, 793
Reducing fill-in by reordering, 204 Simplified Newton method, 647
Reduction of discrete convolution to periodic Sine transform via DFT of half length, 364
convolution, 318 SIR model, 762
Regions of stability for simple implicit RK-SSM, Small residuals by Gaussian elimination, 163
843 Smoothing of a triangulation, 185
Regions of stability of some explicit RK-SSM, Solving the stage equations for implicit
827 RK-SSMs, 841
Relative error and number of correct digits, 98 Sound filtering by DFT, 333
Relevance of asymptotic complexity, 84 Sparse LU -factors, 195
Removing a singularity by transformation, 572 Sparse elimination for arrow matrix, 190
Reshaping matrices in E IGEN, 67 Sparse LSE in circuit modelling, 179
Residual based termination of Newton’s method, Sparse matrices from the discretization of linear
653 partial differential equations, 179
Resistance to currents map, 175 Special cases in IEEE standard, 96
Restarted GMRES, 752 Spectrum of Fourier matrix, 323
Reuse of LU-decomposition in inverse power Speed of convergence of polygonal methods,
iteration, 169 783
RK-SSM and quadrature rules, 795 spline
Roundoff effects in normal equations, 232 interpolants, approx. complete cubic, 546
Row and column transformations, 75 Splitting linear and decoupled terms, 857
Row swapping commutes with forward Splitting off stiff components, 856
elimination, 157 Square root function, 488
Row-wise & column-wise view of matrix product, Square root iteration as a Newton iteration, 621
71 Square root of a s.p.d. matrix, 745
Runge’s example, 480, 485 Stability by small random perturbations, 161
Runtime comparison for computation of Stability function and exponential function, 820
coefficient of trigonometric interpolation Stability functions of explicit Runge-Kutta single
polynomials, 456 step methods, 820
Runtime of Gaussian elimination, 139 Stable discriminant formula, 108
Runtimes of DFT implementations, 354 Stable implementation of Householder
Runtimes of elementary linear algebra reflections, 243
operations in E IGEN, 85 Stable orthonormalization by QR-decomposition,

BIBLIOGRAPHY, BIBLIOGRAPHY 882


NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

93 The message of asymptotic estimates, 573


Stable solution of LSE by means of The zoo of sparse matrix formats, 180
QR-decomposition, 256 Timing polynomial evaluations, 398
Stage form equations for increments, 841 Timing sparse elimination for the combinatorial
Standard E IGEN lu() operator versus graph Laplacian, 191
triangularView() , 167 Tolerances and accuracy, 810
Stepsize control detects instability, 822 Trade cancellation for approximation, 115
STOP, when stationary in M, 607 Transformation of quadrature rules, 553
Storing orthogonal transformations, 246 Transient circuit simulation, 763
Strongly attractive limit cycle, 830 Transient simulation of RLC-circuit, 822
Subspace power methods, 716 Trend analysis, 284
Summation of exponential series, 113 Tridiagonal preconditioning, 748
SVD and additive rank-1 decomposition, 266 Trigonometric interpolation of 1-periodic analytic
SVD-based computation of the rank of a matrix, functions, 534
270 Two-dimensional DFT in P YTHON, 338
Switching to equivalent formulas to avoid
cancellation, 110 Understanding the structure of product matrices,
71
Tables of quadrature rules, 554 Uniqueness of SVD, 267
Tangent field and solution curves, 772 Unitary similarity transformation to tridiagonal
Taylor approximation, 471 form, 683
Termination criterion for contrative fixed point Unstable Gram-Schmidt orthonormalization, 92
iteration, 617 Using Intel Math Kernel Library (Intel MKL) from
Termination criterion for direct power iteration, E IGEN, 81
692
Termination criterion in pcg, 750 Vandermonde matrix, 391
Termination of PCG, 749 Vectorisation of a matrix, 67
Testing Visualization of explicit Euler method, 773
= 0.0 in Code 3.3.3.17, 246 Visualization: local affine linear approximation
Testing equality with zero, 100 for n = 2, 638
Testing stability of matrix×vector multiplication, Visualization: superposition of impulse
122 responses, 308
The “matrix×vector-multiplication problem”, 119
The case of finite signals and filters, 310 Why using K = C?, 321
The inverse matrix and solution of a LSE, 131 Wilkinson’s counterexample, 160

BIBLIOGRAPHY, BIBLIOGRAPHY 883


Abbreviations and Acronyms

CAD =ˆ computer aided design, 440 N.I. =


ˆ Newton iteration, 621
CHI =
ˆ cubic Hermite interpolation, 418 NEQ = ˆ normal eqations, 221
CM =ˆ circulant matrix, 317 NMT = ˆ natural monotonicity test, 655
CSI =
ˆ cubic spline interpolation, 426 NMT = ˆ natural monotonicity test, 658
DCONV =
ˆ discrete convolution, 312 OD =
ˆ overdetermined (for LSE), 213
DCONV =
ˆ discrete convolution, 312 ODE =ˆ ordinary differential equation, 757
EV =
ˆ eigenvector, 322 ODE =ˆ ordinary differential equation, 757

FRC =
ˆ full-ranjk condition, 224 PCA =ˆ principal component analysis, 284
PCONV = ˆ discrete periodic convolution (of
GE =
ˆ Gaussian elimination, 135
vectors), 315
HHM =
ˆ Householder matrix, 241 POD =ˆ principal orthogonal decompostion, 290
PP =
ˆ piecewise polynomial, 541
IC =
ˆ interpolation conditions, 380 PPLIP =ˆ piecewise polynomial Lagrange
IR =
ˆ impulse response, 306 interpolation, 541
IRK-SSM = ˆ implicit Runge-Kutta single-step PSF =
ˆ point-spread function, 341
method, 842
IVP =
ˆ initial-value problem, 757 QR =
ˆ quadrature rule, 552
KCL =
ˆ Krichoff Current Law, 128
r.h.s =
ˆ right-hand side, 757
LIP =ˆ Lagrange polynomial interpolation, 389
LPI =ˆ Lagrangian (interpolation polynomial) SF =
ˆ stability function, 820, 828
approximation scheme, 478 SSM =ˆ single-step method, 779
LSE = ˆ Linear System of Equations, 127 SVD =ˆ singular value decomposition, 264
LT-FIR =ˆ finite, linear, time-invariant, causal
filter/channel, 306 TI =
ˆ time-invariant, 305

884
German terms

L2 -inner product = L2 -Skalarprodukt, 563 extended normal equations = erweiterte


Normalengleichungen, 233
back substitution = Rücksubstitution, 137
bandwidth = Bandbreite, 199 family of functions = Funktionenschar, 758
field of numbers = Zahlenkörper, 54
cancellation = Auslöschung, 102 fill-in = “fill-in”, 195
capacitance = Kapazität, 128 fixed point interation = Fixpunktiteration, 609
capacitor = Kondensator, 128 floating point number = Gleitpunktzahl, 95
circulant = zirkulant, 317 floating point numbers = Gleitkommazahlen, 95
circulant matrix = zirkulante Matrix, 317 forward elimination = Vorwärtselimination, 136
coil = Spule, 128 Fourier series = Fourierreihe, 347
column (of a matrix) = (Matrix)spalte, 55 frequency domain = Frequenzbereich, 329
column transformation = Spaltenumformung, 75
column vector = Spaltenvektor, 54 Gaussian elimination = Gausselimination, 135
composite quadrature formulas =
Hessian = Hesse-Matrix, 57
zusammengesetzte Quadraturformeln,
high pass filter = Hochpass, 333
575
computational effort = Rechenaufwand, 82 identity matrix = Einheitsmatrix, 56
consistency = Konsistenz, 598 image segmentation = Bildsegmentierung, 692
constitutive relation = Kennlinie, 381 impulse response = Impulsantwort, 306
constitutive relations = Bauelementgleichungen, in situ = am Ort, 148
128 in situ = an Ort und Stelle, 142
constrained least squares = Ausgleichsproblem initial guess = Anfangsnäherung, 597
mit Nebenbedingungen, 297 initial-value problem (IVP) =
convergence = Konvergenz, 597 Anfangswertproblem, 757
convolution = Faltung, 310 intermediate value theorem = Zwischenwertsatz,
618
damped Newton method = gedämpftes
inverse iteration = inverse Iteration, 692
Newton-Verfahren, 654
deblurring = Entrauschen, 341 Kirchhoff (current) law = Kirchhoffsche
dense matrix = vollbesetzte Matrix, 178 Knotenregel, 128
descent methods = Abstiegsverfahren, 729 knot = Knoten, 425
difference scheme = Differenzenverfahren, 774 Krylov space = Krylovraum, 737
discrete convolution = diskrete Faltung, 312
divided differences = dividierte Differenzen, 405 least squares = Methode der kleinsten Quadrate,
dot product = (Euklidisches) Skalarprodukt, 70 213
line search = Minimierung in eine Richtung, 730
eigenspace = Eigenraum, 680 linear system of equations = lineares
eigenvalue = Eigenwert, 680 Gleichungssystem, 127
electric circuit = elektrischer low pass filter= Tiefpassfilter, 333
Schaltkreis/Netzwerk, 127 lower triangular matrix = untere Dreiecksmatrix,
energy norm = Energienorm, 729 56
envelope = Hülle, 200 LU-factorization = LR-Zerlegung, 143

885
NumCS(E), AT’24, Prof. Ralf Hiptmair ©SAM, ETH Zurich, 2024

machine number = Maschinenzahl, 95 rounding = Rundung, 98


machine numbers = Maschinenzahlen, 94 roundoff error = Rundungsfehler, 94
mass matrix = Massenmatrix, 707 row (of a matrix) = (Matrix)zeile, 55
matrix factorization = Matrixzerlegung, 143 row transformation = Zeilenumformung, 75
mesh width = Gitterweite, 540 row vector = Zeilenvektor, 54
mesh/grid = Gitter, 540
multiplicity= Vielfachheit, 680 saddle point problem = Sattelpunktproblem, 298
Sampling = Abtasten, 303
nested = geschachtelt, 472 scaling = Skalierung, 74
nodal analysis = Knotenanalyse, 127 singular value decomposition =
node = Knoten, 127 Singlärwertzerlegung, 264
normal equations = Normalengleichungen, 220 sparse matrix = dünnbesezte Matrix, 178
speed of convergence =
order of convergence = Konvergenzordnung, 602 Konvergenzgeschwindigkeit, 600
ordinary differential equation (ODE) = steepest descent = steilster Abstieg, 730
gewöhnliche Differentialgleichung, 757 stiffness matrix = Steifigkeitsmatrix, 707
overdetermined = überbestimmt, 213 storage format = Speicherformat, 65
overflow = Überlauf, 101 subspace correction = Unterraumkorrektur, 736
partial pivoting = Spaltenpivotsuche, 155 system matrix = Koeffizientenmatrix, 127
pattern (of a matrix) = Besetzungsmuster, 74 tensor product = Tensorprodukt, 70
power method = Potenzmethode, 685 tent/hat function = Hutfunktion, 384
preconditioning = Vorkonditionierung, 745 termination criterion = Abbruchbedingung, 605
principal = Stammfunktion, 758 time domain = Zeitbereich, 329
principal = Stammgunktion, 50 total least squares = totales Ausgleichsproblem,
principal axis transformation = 296
Hauptachsentransformation, 681 transpose = transponieren/Transponierte, 54
principal component analysis = truss = Stabwerk, 705
Hauptkomponentenanalyse, 284
underflow = Unterlauf, 101
quadratic functional = quadratisches Funktional, unit vector = Einheitsvektor, 55
729 upper triangular matrix = obere Dreiecksmatrix,
quadrature node = Quadraturknoten, 552 56
quadrature weight = Quadraturgewicht, 552
variational calculus = Variationsrechnung, 432
resistor = Widerstand, 128
right hand side vector = rechte-Seite-Vektor, 127 zero padding = Ergänzen durch Null, 456

BIBLIOGRAPHY, BIBLIOGRAPHY 886

You might also like