Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
100 views91 pages

Mca-0 1

This document provides an overview of modern computer arithmetic. It covers topics like integer arithmetic, modular arithmetic, floating point arithmetic, and Newton's method. The document is divided into multiple sections that discuss efficient algorithms for operations like addition, multiplication, division, roots, etc. It also covers representations for integers, modular numbers, and floating point numbers.

Uploaded by

CorniciucOana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views91 pages

Mca-0 1

This document provides an overview of modern computer arithmetic. It covers topics like integer arithmetic, modular arithmetic, floating point arithmetic, and Newton's method. The document is divided into multiple sections that discuss efficient algorithms for operations like addition, multiplication, division, roots, etc. It also covers representations for integers, modular numbers, and floating point numbers.

Uploaded by

CorniciucOana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Modern Computer Arithmetic

Richard P. Brent and Paul Zimmermann

Version 0.1
Contents

1 Integer Arithmetic 9
1.1 Representation and Notations . . . . . . . . . . . . . . . . . . 9
1.2 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . 10
1.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Naive Multiplication . . . . . . . . . . . . . . . . . . . 11
1.3.2 Karatsuba’s Algorithm . . . . . . . . . . . . . . . . . . 12
1.3.3 Toom-Cook Multiplication . . . . . . . . . . . . . . . . 14
1.3.4 Fast Fourier Transform . . . . . . . . . . . . . . . . . . 15
1.3.5 Unbalanced Multiplication . . . . . . . . . . . . . . . . 16
1.3.6 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.7 Multiplication by a constant . . . . . . . . . . . . . . . 17
1.4 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Naive Division . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Divisor Preconditioning . . . . . . . . . . . . . . . . . 20
1.4.3 Divide and Conquer Division . . . . . . . . . . . . . . 22
1.4.4 Newton’s Division . . . . . . . . . . . . . . . . . . . . . 24
1.4.5 Exact Division . . . . . . . . . . . . . . . . . . . . . . 25
1.4.6 Only Quotient or Remainder Wanted . . . . . . . . . . 26
1.4.7 Division by a Constant . . . . . . . . . . . . . . . . . . 27
1.4.8 Hensel’s Division . . . . . . . . . . . . . . . . . . . . . 28
1.5 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.1 Square Root . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.2 k-th Root . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.3 Exact Root . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Gcd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.1 Naive Gcd . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.2 Extended Gcd . . . . . . . . . . . . . . . . . . . . . . . 37

3
4 Modern Computer Arithmetic, version 0.1 of October 25, 2006

1.6.3 Divide and Conquer Gcd . . . . . . . . . . . . . . . . . 38


1.7 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.7.1 Quadratic Algorithms . . . . . . . . . . . . . . . . . . 41
1.7.2 Subquadratic Algorithms . . . . . . . . . . . . . . . . . 41
1.8 Notes and further references . . . . . . . . . . . . . . . . . . . 43
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2 Modular Arithmetic and Finite Fields 47


2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.1 Classical Representations . . . . . . . . . . . . . . . . . 47
2.1.2 Montgomery’s Representation . . . . . . . . . . . . . . 47
2.1.3 MSB vs LSB Algorithms . . . . . . . . . . . . . . . . . 47
2.1.4 Residue Number System . . . . . . . . . . . . . . . . . 47
2.1.5 Link with polynomials . . . . . . . . . . . . . . . . . . 48
2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.1 Barrett’s Algorithm . . . . . . . . . . . . . . . . . . . . 48
2.2.2 Montgomery’s Algorithm . . . . . . . . . . . . . . . . . 49
2.2.3 Special Moduli . . . . . . . . . . . . . . . . . . . . . . 49
2.3 Division/Inversion . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Several Inversions at Once . . . . . . . . . . . . . . . . 50
2.4 Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.7 Applications of FFT . . . . . . . . . . . . . . . . . . . . . . . 51
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.9 Notes and further references . . . . . . . . . . . . . . . . . . . 52

3 Floating-Point Arithmetic 53
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Representation . . . . . . . . . . . . . . . . . . . . . . 53
3.1.2 Precision vs Accuracy . . . . . . . . . . . . . . . . . . 53
3.1.3 Link to Integers . . . . . . . . . . . . . . . . . . . . . . 53
3.1.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.5 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.6 Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Addition/Subtraction/Comparison . . . . . . . . . . . . . . . 56
3.2.1 Floating-Point Addition . . . . . . . . . . . . . . . . . 56
3.2.2 Leading Zero Detection . . . . . . . . . . . . . . . . . . 58
Modern Computer Arithmetic, §0.0 5

3.2.3 Floating-Point Subtraction . . . . . . . . . . . . . . . . 58


3.3 Multiplication, Division, Algebraic Functions . . . . . . . . . . 58
3.3.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Division . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.4 Square Root . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Floating-Point Output . . . . . . . . . . . . . . . . . . 67
3.4.2 Floating-Point Input . . . . . . . . . . . . . . . . . . . 69
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Notes and further references . . . . . . . . . . . . . . . . . . . 70

4 Newton’s Method and Function Evaluation 71


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Newton’s method via linearisation . . . . . . . . . . . . 72
4.2.2 Newton’s method for inverse roots . . . . . . . . . . . . 73
4.2.3 Newton’s method for reciprocals . . . . . . . . . . . . . 73
4.2.4 Newton’s method for inverse square roots . . . . . . . . 74
4.2.5 Newton’s method for power series . . . . . . . . . . . . 75
4.2.6 Newton’s method for exp and log . . . . . . . . . . . . 76
4.3 Argument Reduction . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Asymptotic Expansions . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Continued Fractions . . . . . . . . . . . . . . . . . . . . . . . 79
4.7 Recurrence relations . . . . . . . . . . . . . . . . . . . . . . . 79
4.8 Arithmetic-Geometric Mean . . . . . . . . . . . . . . . . . . . 80
4.9 Binary Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Holonomic Functions . . . . . . . . . . . . . . . . . . . . . . . 81
4.11 Contour integration . . . . . . . . . . . . . . . . . . . . . . . . 83
4.12 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.13 Summary of Best-known Methods . . . . . . . . . . . . . . . . 83
4.14 Notes and further references . . . . . . . . . . . . . . . . . . . 83
4.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Notation

β the word base (usually 232 or 264 )


n an integer

sign(n) +1 if n > 0, −1 if n < 0, and 0 if n = 0


r := a mod b integer remainder (0 ≤ r < b)
q := a div b integer quotient (0 ≤ a − qb < b)
ν(n) the 2-valuation of n, i.e. the largest power of two that divides n,
with ν(0) = ∞
log, ln the natural logarithm
log2 , lg the base-2 
logarithm

t a
[a, b] the vector
b  
a b
[a, b; c, d] the 2 × 2 matrix
c d

Z/nZ the ring of residues modulo n


Cn the set of (real or complex) functions with n continuous derivatives
in the region of interest
z̄ the conjugate of the complex number z
|z| the Euclidean norm of the complex number z
ord(A) for a power series A(z) = a0 + a1 z + · · ·, ord(A) = min{j : aj 6= 0}
(note the special case ord(0) = +∞)
C the set of complex numbers
N the set of natural numbers (nonnegative integers)
Q the set of rational numbers
R the set of real numbers
Z the set of integers

7
8 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Chapter 1

Integer Arithmetic

In this chapter our main topic is integer arithmetic. However,


we shall see that many algorithms for polynomial arithmetic are
similar to the corresponding algorithms for integer arithmetic,
but simpler due to the lack of carries in polynomial arithmetic.
Consider for example addition: the sum of two polynomials of
degree n always has degree n at most, whereas the sum of two
n-digit integers may have n + 1 digits. Thus we often describe
algorithms for polynomials as an aid to understanding the corre-
sponding algorithms for integers.

1.1 Representation and Notations


We consider in this chapter algorithms working on integers. We shall distin-
guish between the logical — or mathematical — representation of an integer,
and its physical representation on a computer.
Several physical representations are possible. We consider here only the
most common one, namely a dense representation in a fixed integral base.
Choose a base β > 1. (In case of ambiguity, β will be called the internal
base.) A positive integer A is represented by the length n and the digits ai
of its base β expansion:

A = an−1 β n−1 + · · · + a1 β + a0 ,

where 0 ≤ ai ≤ β − 1, where an−1 is sometimes assumed to be non-zero.


Since the base β is usually fixed in a given program, it does not need to be

9
10 Modern Computer Arithmetic, version 0.1 of October 25, 2006

represented. Thus only the length n and the integers (ai )0≤i<n are effectively
stored. Some common choices for β are 232 on a 32-bit computer, or 264
on a 64-bit machine; other possible choices are respectively 109 and 1019 for
a decimal representation, or 253 when using double precision floating-point
registers. Most algorithms from this chapter work in any base, the exceptions
are explicitely mentioned.
We assume that the sign is stored separately from the absolute value.
Zero is an important special case; to simplify the algorithms we assume that
n = 0 if A = 0, and in most cases we assume that case is treated apart.
Except when explicitly mentioned, we assume that all operations are off-
line, i.e. all inputs (resp. outputs) are completely known at the beginning
(resp. end) of the algorithm. Different models include lazy or on-line algo-
rithms, and relaxed algorithms [53].

1.2 Addition and Subtraction


As an explanatory example, here is an algorithm for integer addition:

 

1 Algorithm IntegerAddition .
P P
2 Input : A = 0n−1 ai β i , B = 0n−1 bi β i
P
3 Output : C := 0n−1 ci β i and 0 ≤ d ≤ 1 such t h a t A + B = dβ n + C
4 d←0
5 for i from 0 to n − 1 do
6 s ← a i + bi + d
7 ci ← s mod β
8 d ← s div β
9 Return C, d .




Let M be the number of different values taken by the data type rep-
resenting the coefficients ai , bi . (Clearly β ≤ M but the equality does not
necessarily hold, e.g. β = 109 and M = 232 .) At step 6, the value of s can be
as large as 2β − 1, which is not representable if β = M . Several workarounds
are possible: either use a machine instruction that gives the possible carry of
ai +bi ; or use the fact that, if a carry occurs in ai +bi , then the computed sum
— if performed modulo M — equals t := ai + bi − M < ai , thus comparing
t and ai will determine if a carry occurred. A third solution is to keep one
extra bit, taking β = bM/2c.
Modern Computer Arithmetic, §1.3 11

The subtraction code is very similar. Step 6 simply becomes s ← ai − bi +


d, where d ∈ {0, −1} is the borrow of the subtraction, and −β ≤ s < β (recall
that mod gives a nonnegative remainder). The other steps are unchanged.
Addition and subtraction of n-word integers costs O(n), which is negli-
gible compared to the multiplication cost. However, it is worth trying to
reduce the constant factor in front of this O(n) cost; indeed, we shall see in
§1.3 that “fast” multiplication algorithms are obtained by replacing multi-
plications by additions (usually more additions than the multiplications that
they replace). Thus, the faster the additions are, the smaller the thresholds
for changing over to the “fast” algorithms will be.

1.3 Multiplication
A nice application of large integer multiplication is the Kronecker/Schönhage
trick. Assume we want to multiply two polynomials A(x) and B(x) with non-
negative integer coefficients. Assume both polynomials have degree less than
n, and coefficients are bounded by B. Now take a power X = β k of the base
β that is larger than nB 2 , and multiply the integers a = A(X) and b =PB(X)
obtained by evaluating APand B at x = X. If C(x) = A(x)B(x) = c i xi ,
we clearly have C(X) = ci X i . Now since the ci are bounded by nB 2 < X,
the coefficients ci can be retrieved by simply “reading” blocks of k words in
C(X). P i
Conversely,
P suppose you want to multiply two integers a
P = 0≤i<n ai β
j i
and b = b β . Multiply the polynomials A(x) = 0≤i<n ai x and
P 0≤j<n j j
B(x) = 0≤j<n bj x , obtaining a polynomial C(x), then evaluate C(x) at
x = β to obtain ab. Note that the coefficients of C(x) may be larger than β,
in fact they may be of order nβ 2 . These examples demonstrate the analogy
between operations on polynomials and integers, and also show the limits of
the analogy.

1.3.1 Naive Multiplication


Theorem 1.3.1 Algorithm BasecaseMultiply correctly computes the prod-
uct AB, and uses Θ(mn) word operations.

Remark. The multiplication by β j at step 6 is trivial with the chosen dense


representation: it simply consists a shifting by j words towards the most
12 Modern Computer Arithmetic, version 0.1 of October 25, 2006

 

1 Algorithm BasecaseMultiply .
P P
2 Input : A = 0m−1 ai β i , B = 0n−1 bj β j
Pm+n−1
3 Output : C = AB := 0 ck β k
4 C ← A · b0
5 for j from 1 to n − 1 do
6 C ← C + β j (A · bj )
7 Return C .




significant words. The main operation in algorithm BasecaseMultiply


is the computation of A · bj at step 6, which is accumulated into C. Since
all fast algorithms rely on multiplication, the most important operation to
optimize in multiple-precision software is the multiplication of an array of m
words by one word, with accumulation of the result in another array of m or
m + 1 words.
Since multiplication with accumulation usually makes extensive use of the
pipeline, it is also best to give it arrays that are as long as possible, which
means that A rather than B should be the operand of larger size.

1.3.2 Karatsuba’s Algorithm


In the following, n0 ≥ 2 denotes the threshold between naive multiplication
and Karatsuba’s algorithm, which is used for n0 -word and larger inputs (see
Ex. 1.9.2).

 

1 Algorithm KaratsubaMultiply .
P P
2 Input : A = 0n−1 ai β i , B = 0n−1 bj β j
P2n−1
3 Output : C = AB := 0 ck β k
4 i f n < n0 then r e t u r n BasecaseMultiply(A, B)
5 k ← dn/2e
6 (A0 , B0 ) := (A, B) mod β k , (A1 , B1 ) := (A, B) div β k
7 sA ← sign(A0 − A1 ) , sB ← sign(B0 − B1 )
8 C0 ← KaratsubaMultiply(A0 , B0 )
9 C1 ← KaratsubaMultiply(A1 , B1 )
10 C2 ← KaratsubaMultiply(|A0 − A1 |, |B0 − B1 |)
11 Return C := C0 + (C0 + C1 − sA sB C2 )β k + C1 β 2k .




Modern Computer Arithmetic, §1.3 13

Theorem 1.3.2 Algorithm KaratsubaMultiply correctly computes the prod-


uct AB, using K(n) = O(nα ) word multiplications, with α = log2 3 ≈ 1.585.

Proof Since sA |A0 − A1 | = A0 − A1 , and similarly for B, sA sB |A0 − A1 ||B0 −


B1 | = (A0 − A1 )(B0 − B1 ), thus C = A0 B0 + (A0 B1 + A1 B0 )β k + A1 B1 β 2k .
Since A0 and B0 have (at most) dn/2e words, and |A0 −A1 | and |B0 −B1 |,
and A1 and B1 have bn/2c words, the number K(n) of word multiplications
satisfies the recurrence K(n) = n2 for n < n0 , and K(n) = 2K(dn/2e) +
K(bn/2c) for n ≥ n0 . Assume 2l−1 n0 ≤ n ≤ 2l n0 with l ≥ 1, then K(n)
is the sum of three K(j) values with j ≤ 2l−1 n0 , . . . , thus of 3l K(j) with
j ≤ n0 . Thus K(n) ≤ 3l max(K(n0 ), (n0 − 1)2 ), which gives K(n) ≤ Cnα
with C = 31−log2 n0 max(K(n0 ), (n0 − 1)2 ).

This variant of Karatsuba’s algorithm is known as the subtractive version.


Different variants of Karatsuba’s algorithm exist. Another classical one is the
additive version, which uses A0 + A1 and B0 + B1 instead of |A0 − A1 | and
|B0 − B1 |. However, the subtractive version is more convenient for integer
arithmetic, since it avoids the possible carries in A0 + A1 and B0 + B1 , which
require either an extra word in those sums, or extra additions.
The “Karatsuba threshold” n0 can vary from 10 to 100 words depending
on the processor, and the relative efficiency of the word multiplication and
addition.
The efficiency of an implementation of Karatsuba’s algorithm depends
heavily on memory usage. It is quite important not to allocate memory for
the intermediate results |A0 − A1 |, |B0 − B1 |, C0 , C1 , and C2 at each step
(however modern compilers are quite good at optimising code and removing
unnecessary memory references). One possible solution is to allow a large
temporary storage of m words, that will be used both for those intermediate
results and for the recursive calls. It can be shown that an auxiliary space
of m = 2n words is sufficient (see Ex. 1.9.3).
Since the third product C2 is used only once, it may be faster to have
two auxiliary routines KaratsubaAddmul and KaratsubaSubmul that
accumulate their result, calling themselves recursively, together with Karat-
subaMultiply (see Ex. 1.9.5).
The above version uses ∼ 4n additions (or subtractions): 2× n2 to compute
|A0 − A1 | and |B0 − B1 |, then n to add C0 and C1 , again n to add or subtract
C2 , and n to add (C0 + C1 − sA sB C2 )β k to C0 + C2 β 2k . An improved scheme
uses only ∼ 27 n additions (see Ex. 1.9.4).
14 Modern Computer Arithmetic, version 0.1 of October 25, 2006

Most fast multiplication algorithms can be viewed as evaluation/interpo-


lation algorithms, from a polynomial point of view. Karatsuba’s algorithm
regards the inputs as polynomials A0 + A1 t and B0 + B1 t evaluated in t = β k ;
since their product C(t) is of degree 2, Lagrange’s interpolation theorem says
that it is sufficient to evaluate it at three points. The subtractive version eval-
uates C(t) at t = 0, −1, ∞, whereas the additive version uses t = 0, +1, ∞.1

1.3.3 Toom-Cook Multiplication


The above idea readily generalizes to what is known as Toom-Cook r-way
multiplication. Write the inputs as A0 +· · ·+Ar−1 tr−1 and B0 +· · ·+Br−1 tr−1 ,
with t ← β k , and k = dn/re. Since their product C(t) is of degree 2r − 2,
it suffices to evaluate it at 2r − 1 distinct points to be able to recover C(t),
and in particular C(β k ).
Most books, for example [46], when describing subquadratic multiplica-
tion algorithms, only describe Karatsuba and FFT-based algorithms. Nev-
ertheless, the Toom-Cook algorithm is quite interesting in practice.
Toom-Cook r-way reduces one n-word product to 2r−1 products of dn/re
words. This gives an asymptotic complexity of O(nβ ) with β = log(2r−1) log r
.
However, the constant in the big-Oh depends strongly on the evaluation
and interpolation formula, which in turn depend on the chosen points. One
possibility is to take −(r − 1), . . . , −1, 0, 1, . . . , (r − 1) as evaluation points.
The case r = 2 corresponds to Karatsuba’s algorithm (§1.3.2). The case
r = 3 is known as Toom-Cook 3-way; sometimes people simply say “Toom-
Cook algorithm” for r = 3. The following algorithm uses evaluation points
0, 1, −1, 2, ∞, and tries to optimize the evaluation and interpolation formulæ.
The divisions at step 11 are exact2 : if β is a power of two, that by 6 can
be done by a division by 2 — which consists of a single shift — followed by
a division by 3 (§1.4.7).
We refer the reader interested in higher order Toom-Cook implementa-
tions to [57], which considers the 4- and 5-way variants, and also squaring.
Toom-Cook r-way has to invert a (2r − 1) × (2r − 1) Vandermonde matrix
with parameters the evaluation points; if one chooses consecutive integer
points, the determinant of that matrix contains all primes up to 2r − 2. This
1
Evaluating C(t) at ∞ means computing the product A1 B1 of the leading coefficients.
2
An exact division can be performed from the least significant bits, which is usually
more efficient: see §1.4.5.
Modern Computer Arithmetic, §1.3 15

 

1 Algorithm ToomCook3 .
2 Input : two i n t e g e r s 0 ≤ A, B < β n .
3 Output : AB := c0 + c1 β k + c2 β 2k + c3 β 3k + c4 β 4k with k = dn/3e .
4 i f n < 3 then r e t u r n KaratsubaMultiply(A, B)
5 Write A = a0 + a1 t + a2 t2 , B = b0 + b1 t + b2 t2 with t = β k .
6 v0 ← ToomCook3(a0 , b0 )
7 v1 ← ToomCook3(a02 + a1 , b02 + b1 ) where a02 ← a0 + a2 , b02 ← b0 + b2
8 v−1 ← ToomCook3(a02 − a1 , b02 − b1 )
9 v2 ← ToomCook3(a0 + 2a1 + 4a2 , b0 + 2b1 + 4b2 )
10 v∞ ← ToomCook3(a2 , b2 )
11 t1 ← (3v0 + 2v−1 + v2 )/6 − 2v∞ , t2 ← (v1 + v−1 )/2
12 c 0 ← v 0 , c 1 ← v 1 − t 1 , c 2 ← t2 − v 0 − v ∞ , c 3 ← t1 − t 2 , c 4 ← v∞




proves that the division by 3 cannot be avoided for Toom-Cook 3-way (see
Ex. 1.9.8).

1.3.4 Fast Fourier Transform


Most subquadratic multiplication algorithms can be seen as evaluation-inter-
polation algorithms. They mainly differ in the number of evaluation points,
and the values of those points. However the evaluation and interpolation
formulæ become intricate in Toom-Cook r-way for large r. The Fast Fourier
Transform (FFT) is a way to perform evaluation and interpolation in an
efficient way for some special values of r. This explains why multiplica-
tion algorithms of best asymptotic complexity are based on the Fast Fourier
Transform (FFT).
There are different flavours of FFT multiplication, depending on the ring
where the operations are made. The asymptotically best algorithm, due to
Schönhage-Strassen [47], with a complexity of O(n log n log log n), works in
Z/(2n + 1)Z; since it is based on modular computations, we describe it in
Chapter 2.
Another method commonly used is to work with floating-point complex
numbers [32, Section 4.3.3.C]; one drawback is that, due to the inexact na-
ture of floating-point computations, a careful error analysis is required to
guarantee the correctness of the implementation. We refer to Chapter 3 for
a description of this method.
16 Modern Computer Arithmetic, version 0.1 of October 25, 2006

1.3.5 Unbalanced Multiplication


How to efficiently multiply integers of different sizes with a subquadratic
algorithm? This case is important in practice but is rarely considered in the
literature. Assume the larger operand has size m, and the smaller has size
n, with m ≥ n.
When m is an entire multiple of n, say m = kn, a trivial strategy is to
cut the largest operand into k pieces, giving M (kn, n) = kM (n). However,
this is not always the best one, see Ex. 1.9.9.
When m is not an entire multiple of n, different strategies are possible.
Consider for example Karatsuba multiplication, and let K(m, n) be the num-
ber of word-products for a m × n product. Take for example m = 5, n = 3.
A natural idea is to pad the smallest operand to the size of the largest one.
However there are several ways to perform this padding, the Karatsuba cut
being represented by a double column:

a4 a3 a2 a1 a0 a4 a3 a2 a1 a0 a4 a3 a2 a1 a0
b2 b1 b0 b2 b1 b0 b2 b1 b0
A×B A × (βB) A × (β 2 B)

The first strategy leads to two products of size 3 i.e. 2K(3, 3), the second one
to K(2, 1)+K(3, 2)+K(3, 3), and the third one to K(2, 2)+K(3, 1)+K(3, 3),
which give respectively 14, 15, 13 word products.
However, whenever m/2 ≤ n ≤ m, any such “padding strategy” will
require K(dm/2e, dm/2e) for the product of the differences of the low and
high parts from the operands, due to a “wrap around” effect when subtracting
the parts from the smaller operand; this will ultimately lead to a O(mα ) cost.
The “odd-even strategy” (Ex. 1.9.10) avoids this wrap around. For example,
we get K(3, 2) = 5 with the odd-even strategy, against K(3, 2) = 6 for the
classical one.
Like for the classical strategy, there are several ways of padding with
the odd-even strategy. Consider again m = 5, n = 3, and write A :=
a4 x4 + a3 x3 + a2 x2 + a1 x + a0 = xA1 (x2 ) + A0 (x2 ), with A1 (x) = a3 x + a1 ,
and A0 (x) = a4 x2 + a2 x + a0 ; and B := b2 x2 + b1 x + b0 = xB1 (x2 ) + B0 (x2 ),
with B1 (x) = b1 , B0 (x) = b2 x + b0 . Without padding, we write AB =
x2 (A1 B1 )(x2 ) + x((A0 + A1 )(B0 + B1 ) − A1 B1 − A0 B0 )(x2 ) + (A0 B0 )(x2 ),
which gives K(5, 3) = K(2, 1) + 2K(3, 2) = 12. With padding, we consider
xB = xB10 (x2 ) + B00 (x2 ), with B10 (x) = b2 x + b0 , B00 = b1 x. This gives
K(2, 2) = 3 for A1 B10 , K(3, 2) = 5 for (A0 + A1 )(B00 + B10 ), and K(3, 1) = 3
Modern Computer Arithmetic, §1.3 17

for A0 B00 — taking into account the fact that B00 has only one non-zero
coefficient —, thus a total of 11 only.

1.3.6 Squaring
In many applications, an significant proportion of the multiplications have
both operands equal. Hence it is worth tuning a special squaring imple-
mentation as much as the implementation of multiplication itself, bearing in
mind that the best possible speedup is two (see Ex. 1.9.11).
For naive multiplication, Algorithm BasecaseMultiply (§1.3.1) can be
modified to obtain a theoretical speedup of two, since only half of the prod-
ucts ai bj need to be computed.
Subquadratic algorithms like Karatsuba and Toom-Cook r-way can be
specialized for squaring too. However, the speedup obtained is less than two,
and the threshold obtained is larger than the corresponding multiplication
threshold (see Ex. 1.9.11).

1.3.7 Multiplication by a constant


It often happens that one integer is used in several consecutive multiplica-
tions, or is fixed for a complete calculation. If that constant is small, i.e. less
than the base β, no much speedup can be obtained compared to the usual
product. We thus consider here a “large” constant.
When using evaluation-interpolation algorithms, like Karatsuba or Toom-
Cook (see §1.3.2–1.3.3), one may store the results of the evaluation for that
fixed multiplicand. If one assumes that an interpolation is as expensive as
one evaluation, this may give a speedup of up to 3/2.
Special-purpose algorithms exist too. These algorithms differ from clas-
sical multiplication algorithms because they take into account the value of
the given constant, and not only its size in bits or digits. They also differ
in the model of complexity used. For example, Bernstein’s algorithm [6],
which is used by several compilers to compute addresses in data structure
records, considers as basic operation x, y → 2i x ± y, with a cost assumed to
be independent of the integer i.
18 Modern Computer Arithmetic, version 0.1 of October 25, 2006

For example Bernstein’s algorithm computes 20061x in five steps:

x1 := 31x = 25 x − x
x2 := 93x = 2 1 x1 + x 1
x3 := 743x = 2 3 x2 − x
x4 := 6687x = 2 3 x3 + x 3
20061x = 2 1 x4 + x 4 .

We refer the reader to [34] for a comparison of different algorithms for the
problem of multiplication by an integer constant.

1.4 Division
Division is the next operation to consider after multiplication. Optimizing
division is almost as important as optimizing multiplication, since division is
usually more expensive, thus the speedup obtained on division will be more
effective. (On the other hand, one usually performs more multiplications than
divisions.) One strategy is to avoid divisions when possible, or replace them
by multiplications. An example is when the same divisor is used for several
consecutive operations; one can then precompute its inverse (see §2.2.1).
We distinguish several kinds of division: full division computes both quo-
tient and remainder, while in some cases only the quotient (for example
when dividing two floating-point mantissas) or remainder (when dividing two
residues modulo n) is needed. Finally we discuss exact division — when the
remainder is known to be zero — and the problem of dividing by a constant.

1.4.1 Naive Division


P
We say that B := 0n−1 bj β j is normalized when its most significant word
bn−1 is greater than or equal to half of the base β. (If this is not the case,
compute A0 = 2k A and B 0 = 2k B so that B 0 is normalized, then divide A0 by
B 0 giving A0 = Q0 B 0 + R0 ; the quotient and remainder of the division of A by
B are respectively Q := Q0 and R := R0 /2k , the latter division being exact.)

Theorem 1.4.1 Algorithm BasecaseDivRem correctly computes the quo-


tient and remainder of the division of A by a normalized B, in O(nm) word
operations.
Modern Computer Arithmetic, §1.4 19

 

1 Algorithm BasecaseDivRem .
P P
2 Input : A = n+m−1 0 ai β i , B = 0n−1 bj β j , B n o r m a l i z e d
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 i f A ≥ β m B then qm ← 1 , A ← A − β m B e l s e qm ← 0
5 for j from m − 1 downto 0 do
6 qj∗ ← b(an+j β + an+j−1 )/bn−1 c
7 qj ← min(qj∗ , β − 1)
8 A ← A − qj β j B
9 while A < 0 do
10 q j ← qj − 1
11 A ← AP + βj B
12 Return Q = m j
0 qj β , R = A .




(Note: in the above algorithm, ai denotes the current value of the ith word
of A, after the possible changes at steps 8 and 11.)

Proof First prove that the invariant A < β j+1 B holds at step 5. This holds
trivially for j = m − 1: B being normalized, A < 2β m B initially.
First consider the case qj = qj∗ : then qj bn−1 ≥ an+j β + an+j−1 − bn−1 + 1,
thus
A − qj β j B ≤ (bn−1 − 1)β n+j−1 + (A mod β n+j−1 ),
which ensures that the new an+j vanishes, and an+j−1 < bn−1 , thus A < β j B
after step 8. Now A may become negative after step 8, but since qj bn−1 ≤
an+j β + an+j−1 :

A − qj β j B > (an+j β + an+j−1 )β n+j−1 − qj (bn−1 β n−1 + β n−1 )β j ≥ −qj β n+j−1 .

Therefore A − qj β j B + 2β j B ≥ (2bn−1 − qj )β n+j−1 > 0, which proves that the


while-loop at steps 9-11 is performed at most twice [32, Theorem 4.3.1.B].
When the while-loop is entered, A may increase only by β j B at a time, hence
A < β j B at exit.
In the case qj 6= qj∗ , i.e. qj∗ ≥ β, we have before the while-loop: A <
β B − (β − 1)β j B = β j B, thus the invariant holds. If the while-loop is
j+1

entered, the same reasoning as above holds.


P We conclude that when the for-loop ends, 0 ≤ A < B holds, and since
( jm−1 qi β i )B + A is invariant through the algorithm, the quotient Q and
remainder R are correct.
20 Modern Computer Arithmetic, version 0.1 of October 25, 2006

The most expensive step is step 8, which costs O(n) operations for qj B
— the multiplication by β j is simply a word-shift — thus the total cost is
O(nm).

Here is an example of algorithm BasecaseDivRem for the inputs A =


766970544842443844 and B = 862664913, with β = 1000:

j A qj A − qj Bβ j after correction
2 766 970 544 842 443 844 889 61 437 185 443 844 no change
1 61 437 185 443 844 071 187 976 620 844 no change
0 187 976 620 844 218 −84 330 190 778 334 723

which gives as quotient Q = 889071217 and as remainder R = 778334723.


Remark 1: Algorithm BasecaseDivRem simplifies when A < β m B: re-
move step 4, and change m into m − 1 in the return value Q. However, the
more general form we give is more convenient for a computer implementation,
and will be used below.
Remark 2: a possible variant when qj∗ ≥ β is to let qj = β; then A − qj β j B
at step 8 reduces to a single subtraction of B shifted by j + 1 words. However
in this case the while-loop will be performed at least once, which corresponds
to the identity A − (β − 1)β j B = A − β j+1 B + β j B.
Remark 3: if instead of having B normalized, i.e. bn ≥ β/2, we have
bn ≥ β/k, one can have up to k iterations of the while-loop (and step 4 has
to be modified accordingly).
Remark 4: a drawback of algorithm BasecaseDivRem is that the A < 0
test at line 9 is true with non-negligible probability, therefore it will make fail
branch prediction algorithms available on modern processors. A workaround
is to compute a more accurate partial quotient, and therefore decrease to
almost zero the proportion of corrections (see Ex. 1.9.14).

1.4.2 Divisor Preconditioning


It sometimes happens that the quotient selection — step 6 of algorithm Base-
caseDivision — is quite expensive compared to the total cost, especially for
small sizes. Indeed, some processors don’t have a machine instruction for the
division of two words by one word; then one way to compute qj∗ is to pre-
compute a one-word approximation of the inverse of bn−1 , and to multiply it
by an+j β + an+j−1 .
Modern Computer Arithmetic, §1.4 21

Svoboda’s algorithm [50] makes the quotient selection trivial, after pre-
conditioning the divisor. The main idea is that if bn−1 equals the base β,
then the quotient selection is easy, since it suffices to take qj∗ = an+j . (In
addition, the condition of step 7 is then always fulfilled.)

 

1 Algorithm SvobodaDivision .
P P
2 Input : A = n+m−1 0 ai β i , B = 0n−1 bj β j normalized , A < β m B
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 k ← dβ n+1 /Be
P
5 B 0 ← kB = β n+1 + 0n−1 b0j β j
6 for j from m − 1 downto 1 do
7 qj ← an+j
8 A ← A − qj β j−1 B 0
9 i f A < 0 do
10 q j ← qj − 1
11 A ← A + β j−1 B 0
P m−1
12 Q0 = 1 qj β j , R 0 = A
13 (q0 , R) ← (R0 div B, R0 mod B)
14 Return Q = q0 + kQ0 , R .




Remarks: At step 8, the most significant word an+j β n+j automatically


cancels withPqj β j−1 β n+1 ; one can thus subtract only the product of qj by the
lower part 0n−1 b0j β j of B 0 . The division at step 13 can be performed with
BasecaseDivRem; it gives a single word since A has n + 1 words.
With the example of previous section, Svoboda’s algorithm would give
k = 1160, B 0 = 1000691299080,

j A qj A − qj B 0 β j after correction
2 766 970 544 842 443 844 766 441 009 747 163 844 no change
1 441 009 747 163 844 441 −295 115 730 436 705 575 568 644

We thus get Q0 = 766440 and R0 = 705575568644. The final division gives


R0 = 817B + 778334723, thus we finally get Q = 1160 · 766440 + 817 =
889071217, and R = 778334723.
Svoboda’s algorithm is especially interesting when only the remainder is
needed, since one then avoids the post-normalization Q = q0 + kQ0 .
22 Modern Computer Arithmetic, version 0.1 of October 25, 2006

1.4.3 Divide and Conquer Division


The base-case division determines the quotient word by word. A natural idea
is to try getting several words at a time, for example replacing the quotient
selection step in Algorithm BasecaseDivRem by:
an+j β 3 + an+j−1 β 2 + an+j−2 β + an+j−3
qj∗ ← b c.
bn−1 β + bn−2
Then since qj∗ has now two words, one can use fast multiplication algorithms
(§1.3) to speed up the computation of qj B at step 8 of Algorithm Basecase-
DivRem.
More generally, the most significant half of the quotient — say Q1 , of k
words — depends mainly on the k most significant words of the dividend
and divisor. Once a good approximation to Q1 is known, fast multiplication
algorithms can be used to compute the partial remainder A − Q1 B. The
second idea of the divide and conquer division algorithm below is to compute
the corresponding remainder together with the partial quotient qj∗ ; in such a
way that we only have to subtract the product of qj from the low part of the
divisor.

 

1 Algorithm RecursiveDivRem .
P P
2 Input : A = n+m−10 ai β i , B = 0n−1 bj β j , B normalized , n ≥ m
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 i f m < 2 then r e t u r n BasecaseDivRem(A, B)
5 k ← bm k
2 c , B1 ← B div β , B0 ← B mod β
k

6
2k
(Q1 , R1 ) ← RecursiveDivRem(A div β , B1 )
7 A0 ← R1 β 2k + A mod β 2k − Q1 β k B0
8 while A0 < 0 do Q1 ← Q1 − 1 , A0 ← A0 + β k B
9 (Q0 , R0 ) ← RecursiveDivRem(A0 div β k , B1 )
10 A00 ← R0 β k + A0 mod β k − Q0 B0
11 while A00 < 0 do Q0 ← Q0 − 1 , A00 ← A00 + B
12 Return Q := Q1 β k + Q0 , R := A00 .




Theorem 1.4.2 Algorithm RecursiveDivRem is correct, and uses D(m, n)


operations, where D(2m, n) = 2D(m, n − m) + 2M (m) + O(n). In particu-
lar D(n) := D(n, n) satisfies D(2n) = 2D(n) + 2M (n) + O(n), which gives
D(n) ∼ 2α−11 −1 M (n) for M (n) ∼ nα , α > 1.
Modern Computer Arithmetic, §1.4 23

Proof We first check the assumption for the recursive calls: B1 is normalized
since it has the same most significant word than B.
After step 6, we have A = (Q1 B1 + R1 )β 2k + A mod β2k , thus after step
7: A0 = A − Q1 β k B, which still holds after step 8. After step 9, we have
A0 = (Q0 B1 + R0 )β k + A0 mod β k , thus after step 10: A00 = A0 − Q0 B, which
still holds after step 11. At step 12 we thus have A = QB + R.
A div β 2k has m + n − 2k words, while B1 has n − k words, thus 0 ≤ Q1 <
2β m−k and 0 ≤ R1 < B1 < β n−k . Thus at step 7, −2β m+k < A0 < β k B.
Since B is normalized, the while-loop at step 8 is performed at most four
times. At step 9 we have 0 ≤ A0 < β k B, thus A0 div β k has at most n
words. It follows 0 ≤ Q0 < 2β k and 0 ≤ R0 < B1 < β n−k . Hence at step
10, −2β 2k < A00 < B, and after at most four iterations at step 11, we have
0 ≤ A00 < B.

A graphical view of Algorithm RecursiveDivRem in the case m = 2n


is given on Fig. 1.1, which represents the multiplication Q · B: one firstly
computes the lower left corner in D(n/2), secondly the lower right corner in
M (n/2), thirdly the upper left corner in D(n/2), and finally the upper right
corner in M (n/2). RecursiveDivRem.

M( n
8
)
M (n/4)
M( n
8
)
M (n/2)
M( n
8
)
M (n/4)
M( n
8
)
quotient Q
M( n
8
)
M (n/4)
M( n
8
)
M (n/2)
M( n
8
)
M (n/4)
M( n
8
)

divisor B
Figure 1.1: Divide and conquer division: a graphical view (most significant
parts at the lower left corner).
24 Modern Computer Arithmetic, version 0.1 of October 25, 2006

Remark 1: we may replace the condition m < 2 at step 4 by m < T for


any integer T ≥ 2. In practice, T may be in the range 50 to 200 words.
Remark 2: we cannot require here A < β m B, since this condition may not
be satisfied in the recursive calls. Consider for example A = 5517, B = 56
with β = 10: the first recursive call will divide 55 by 5, which requires
a two-digit quotient. Even A ≤ β m B is not recursively fulfilled; consider
A = 55170000 with B = 5517: the first recursive call will divide 5517 by 55.
The weakest possible condition is that the n most significant words of A do
not exceed those from B, i.e. A < β m (B + 1). In that case, the quotient is
m
bounded by β m + b β B−1 c, which yields β m + 1 in the case n = m (compare
Ex. 1.9.13). See also Ex. 1.9.15.
Remark 3: Theorem 1.4.2 gives D(n) ∼ 2M (n) for Karatsuba multiplica-
tion, and D(n) ∼ 2.63M (n) for Toom-Cook 3-way. In the FFT range, see
Ex. 1.9.16.
Remark 4: the same idea as in Ex. 1.9.14 applies: to decrease the proba-
bility that the estimated quotients Q1 and Q0 are too large, use one extra
word of the truncated dividend and divisors in the recursive calls to Recur-
siveDivRem.

Large dividend
The condition n ≥ m in Algorithm RecursiveDivRem means that the
dividend A is at most twice as large as the divisor B.
When A is more than twice as large as B (m > n with the above nota-
tions), the best strategy (see Ex. 1.9.17) is to get n words of the quotient at
a time (this simply reduces to the base-case algorithm, replacing β by β n ).

1.4.4 Newton’s Division


Newton’s iteration gives the division algorithm with best asymptotic com-
plexity. We refer here to Ch. 4. The p-adic version of Newton’s method, also
called Hensel lifting, is used below for the exact division.

Theorem 1.4.3 Algorithm InvRem is correct.

Proof At step 6, by induction β 2h = Bh Xh + Rh with 0 ≤ Rh < Bh . Thus


we have
β 2n = (Bh Xh + Rh )β 2l = (BXh + Y )β l = BX + R.
Modern Computer Arithmetic, §1.4 25

 

1 Algorithm UnbalancedDivision .
P P
2 Input : A = n+m−10 ai β i , B = 0n−1 bj β j .
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 Assumptions : m > n , B n o r m a l i z e d .
5 Q←0
6 while m > n do
7 (q, r) ← RecursiveDivRem(A div β m−n , B)
8 Q ← Qβ n + q
9 A ← rβ m−n + A mod β m−n
10 m←m−n
11 (q, r) ← RecursiveDivRem(A, B)
12 Return Q := Qβ m + q , R := r .




The conditions 0 ≤ R < B are ensured thanks to the while loops at the end
of the algorithm.

1.4.5 Exact Division


A division is exact when the remainder is zero. This happens for example
when normalizing a fraction a/b: one divides both a and b by their greatest
common divisor, and both divisions are exact. If the remainder is known a
priori to be zero, this information is useful to speed up the computation of
the quotient. Two strategies are possible:

• use classical division algorithms (most significant bits first), without


computing the lower part of the remainder. Here, one has to take care
of rounding errors, in order to guarantee the correctness of the final
result;

• or start from least significant bits first. Indeed, if the quotient is known
to be less than β n , computing a/b mod β n will reveal it.

In both strategies, subquadratic algorithms can be used too. We describe


here the least significant bit algorithm, using Hensel lifting — which can be
seen as a p-adic version of Newton’s method:
Remark: This algorithm uses the Karp-Markstein trick: lines 4-7 compute
1/B mod β dn/2e , while the two last lines incorporate the dividend to obtain
26 Modern Computer Arithmetic, version 0.1 of October 25, 2006

 

1 Algorithm InvRem .
2 Input : a p o s i t i v e i n t e g e r B , 12 β n ≤ B < β n .
3 Output : X, R such t h a t β 2n = BX + R , 0 ≤ R < B .
4 i f n = 1 then r e t u r n NaiveInvRem(B) .
5 h ← dn/2e, l ← bn/2c , w r i t e B = Bh β l + Bl
6 (Xh , Rh ) ← InvRem(Bh )
7 Y ← Rh β l − B l X h
8 while Y < 0 do { Xh ← Xh − 1, Y ← Y + B }
9 Xl ← b Xβh2hY c
10 R ← Y β l − BXl
11 while R < 0 do { Xl ← Xl − 1, R ← R + B }
12 while R ≥ B do { Xl ← Xl + 1, R ← R − B }
13 Return Xh β l + Xl , R .





 

1 Algorithm ExactDivision .
P P
2 Input : A = 0n−1 ai β i , B = 0n−1 bj β j
3 Output : q u o t i e n t Q = A/B mod β n
4 C ← 1/b0 mod β
5 for i from dlog2 ne − 1 downto 1 do
6 k ← dn/2i e
7 C ← C + C(1 − BC) mod β k
8 Q ← AC mod β k
9 Q ← Q + C(A − BQ) mod β n




A/B mod β n . Note that the middle product (§3.3) can be used in lines 7 and
9, to speed up the computation of 1 − BC and A − BQ respectively.
Finally, another gain is obtained using both strategies simultaneously:
compute the most significant n/2 bits of the quotient using the first strategy,
and the least n/2 bits using the second one. Since an exact division of size
n is replaced by two exact divisions of size n/2, this gives a speedup up to 2
for quadratic algorithms (see Ex. 1.9.19).

1.4.6 Only Quotient or Remainder Wanted


When both the quotient and remainder of a division are needed, it is better
to compute them simultaneously. This may seem to be a trivial statement,
Modern Computer Arithmetic, §1.4 27

nevertheless some high-level languages provide both div and mod, but no
instruction to compute both quotient and remainder.
Once the quotient is known, the remainder can be recovered by a single
multiplication as a − qb; on the other hand, when the remainder is known,
the quotient can be recovered by an exact division as (a − r)/b (§1.4.5).
However, it often happens that only one of the quotient and remainder
is needed. For example, the division of two floating-point numbers reduces
to the quotient of their fractions (see Ch. 3). Conversely, the multiplication
of two numbers modulo n reduces to the remainder of their product after
division by n (see Ch. 2). In such cases, one may wonder if faster algorithms
exist.
For a dividend of 2n words and a divisor of n words, a significant speedup
— up to two for quadratic algorithms — can be obtained when only the
quotient is needed, since one doesn’t need to update the low n bits of the
current remainder (line 8 of Algorithm BasecaseDivRem).
Surprisingly, it seems difficult to get a similar speedup when only the
remainder is required. One possibility would be to use Svoboda’s algo-
rithm, however this requires some precomputation, so is only useful when
several divisions are performed with the same divisor. The idea is the fol-
lowing: precompute a multiple B1 of B, having 3n/2 words, the n/2 most
significant words being β n/2 . Then reducing A mod B1 reduces to a single
n/2 × n multiplication. Once A is reduced into A1 of 3n/2 words by Svo-
boda’s algorithm in 2M (n/2), use RecursiveDivRem on A1 and B, which
costs D(n/2) + M (n/2). The total cost is thus 3M (n/2) + D(n/2) — in-
stead of 2M (n/2) + 2D(n/2) for a full division with RecursiveDivRem —
i.e. 5/3M (n) for Karatsuba, 2.04M (n) for Toom-Cook 3-way, and better for
the FFT as soon as D(n)  3M (n).

1.4.7 Division by a Constant


As for multiplication, division by a constant c is an important special case.
It arises for example in Toom-Cook multiplication, where one has to perform
an exact division by 3 (§1.3.3). We assume here that we want to divide
a multiprecision number by a one-word constant. One could of course use
a classical division algorithm (§1.4.1). The following algorithm performs a
modular division:
A + bβ n = cQ,
28 Modern Computer Arithmetic, version 0.1 of October 25, 2006

where the “carry” b will be zero when the division is exact.



 

1 Algorithm ConstantDivide .
P
2 Input : A = 0n−1 ai β i , 0 ≤ c < β .
P
3 Output : Q = 0n−1 qi β i and 0 ≤ b < c such t h a t A + bβ n = cQ
4 d ← 1/c mod β
5 b←0
6 for i from 0 to n − 1 do
7 i f b ≤ ai then (x, b0 ) ← (ai − b, 0)
8 e l s e (x, b0 ) ← (ai − b + β, 1)
9 qi ← dx mod β
10 b00 ← qi c−x
β
11 b ← b0 + b00
Pn−1
12 Return 0 qi β i , b .




Theorem 1.4.4 The output of Algorithm ConstantDivide satisfies A =


cQ + bβ n .
i+1
ProofPWe show that after step Pi i, 0 ≤i i < n, we have Ai +bβ = cQi , where
i i
Ai := j=0 ai β and Qi := j=0 qi β . For i = 0, this is a0 + bβ = cq0 , which
is exactly line 10: since q0 = a0 /c mod β, q0 c − a0 is divisible by β. Assume
now that Ai−1 + bβ i = cQi−1 holds for 1 ≤ i < n. We have ai − b + b0 β = x,
then x + b00 β = cqi , thus Ai + (b0 + b00 )β i+1 = Ai−1 + β i (ai + b0 + b00 β) =
cQi−1 − bβ i + β i (x + b − b0 + b0 + b00 β) = cQi−1 + β i (x + b00 β) = cQi .

Remark: at line 10, since 0 ≤ x < β, b00 can also be obtained as b qβi c c.

1.4.8 Hensel’s Division


Classical division consists in cancelling the most significant part of the div-
idend by a multiple of the divisor, while Hensel’s division cancels the least
significant part (Fig. 1.2). Given a dividend A of 2n words and a divisor B
of n words, the classical or MSB (most significant bit) division computes a
quotient Q and a remainder R such that A = QB + R, while Hensel’s or
LSB (least significant bit) division computes a LSB-quotient Q0 and a LSB-
remainder R0 such that A = Q0 B + R0 β n . While the MSB division requires
the most significant bit of B to be set, the LSB division requires B to be
Modern Computer Arithmetic, §1.5 29

A A

B B

QB Q0 B

R R0

Figure 1.2: Classical/MSB division (left) vs Hensel/LSB division (right).

prime to the word base β, i.e. the least significant bit of B to be set for β a
power of two.
The LSB-quotient is uniquely defined by Q0 = A/B mod β n , with 0 ≤
Q0 < β n . This defines in turn uniquely the LSB-remainder R0 = (A −
Q0 B)β −n , with −B < R0 < β n .
Most MSB-division variants (naive, with preconditioning, divide and con-
quer, Newton’s iteration) have their LSB-counterpart. For example the pre-
conditioning consists in using a multiple of the divisor such that kB ≡
1 mod β, and Newton’s iteration is called Hensel lifting in the LSB case. The
exact division algorithm described at the end of §1.4.5 uses both MSB- and
LSB-division simultaneously. One important difference is that LSB-division
does not need any correction step, since the carries go in the direction oppo-
site to the cancelled bits.

1.5 Roots
1.5.1 Square Root
The “paper and pencil” method once taught at school to extract square roots
is very similar to the “paper and pencil” division. It decomposes an integer
m in the form s2 + r, taking two digits at a time of m, and finding one
digit at a time of s. It is based on the following idea: if m = s2 + r is the
current decomposition, when taking two more digits of the root-end, we have
decomposition of the form 100m + r 0 = 100s2 + 100r + r 0 with 0 ≤ r0 < 100.
Since (10s + t)2 = 100s2 + 20st + t2 , a good approximation of the next digit
t will be found by dividing 10r by 2s.
30 Modern Computer Arithmetic, version 0.1 of October 25, 2006

The following algorithm generalizes this idea to a power β l of the internal


base close to m1/4 : one obtains a divide and conquer algorithm, which is in
fact an error-free variant of Newton’s method (cf Ch. 4):

 

1 Algorithm SqrtRem .
2 Input : m = an−1 β n−1 + · · · + a1 β + a0 with an−1 6= 0
3 Output : (s, r) such t h a t s2 ≤ m = s2 + r < (s + 1)2
4 l ← b n−1 4 c
5 i f l = 0 then r e t u r n BasecaseSqrtRem(m)
6 w r i t e m = a3 β 3l + a2 β 2l + a1 β l + a0 with 0 ≤ a2 , a1 , a0 < β l
7 (s0 , r0 ) ← SqrtRem(a3 β l + a2 )
8 (q, u) ← DivRem(r 0 β l + a1 , 2s0 )
9 s ← s0 β l + q
10 r ← uβ l + a0 − q 2
11 i f r < 0 then
12 r ← r + 2s − 1
13 s←s−1
14 Return (s, r)




Theorem 1.5.1 Algorithm SqrtRem correctly returns the integer square


root s and remainder r of the input m, and has complexity R(2n) ∼ R(n) +
D(n)+S(n) where D(n) and S(n) are the complexities of the division with re-
mainder and square respectively. This gives R(n) ∼ 12 n2 with naive multipli-
cation, R(n) ∼ 34 K(n) with Karatsuba’s multiplication, and R(n) ∼ 316
M (n)
2
with FFT multiplication, assuming S(n) ∼ 3 M (n).

1.5.2 k-th Root


The above idea for the integer square root can be generalized to any power:
if the current decomposition is n = n0 β k + n00 β k−1 + n000 , first compute a
k-th root of n0 , say n0 = s0 k + r0 , then divide r0 β + n00 by ks0k−1 to get an
approximation of the next root digit t, and correct it if needed. Unfortunately
the computation of the remainder, which is easy for the square root, involves
O(k) terms for the k-th root, and this method may become slower than
recomputing directly (s0 β + t)k .
Modern Computer Arithmetic, §1.5 31

Cube Root.
We illustrate with the case k = 3 (the cube root), where BasecaseCbrtRem
is a naive algorithm that should deal with inputs of up to 6 words.

 

1 Algorithm CbrtRem .
2 Input : 0 ≤ n = nd−1 β d−1 + · · · + n1 β + n0 with 0 ≤ ni < β
3 Output : (s, r) such t h a t s3 ≤ n = s3 + r < (s + 1)3
4 l ← b d−1 6 c
5 i f l = 0 then r e t u r n BasecaseCbrtRem(n)
6 w r i t e n a s n0 b3 + a2 b2 + a1 b + a0 where b := β l
7 (s0 , r0 ) ← CbrtRem(n0 )
8 (q, u) ← DivRem(br 0 + a2 , 3s02 )
9 r ← b2 u + ba1 + a0 − q 2 (3s0 b + q)
10 s ← bs0 + q
11 while r < 0 do
12 r ← r + 1 − 3s + 3s2
13 s←s−1
14 Return (s, r) .




Exact Newton Iteration for k-th Root.


Theorem 1.5.2 Algorithm RootRem is correct.

Proof We prove by induction on n that the returned values s and r satisfy


n = sk + r and sk ≤ n < (s + 1)k . With the notations from the algorithm,
we have s = s0 β + q, thus t = sk and r = n − t, which proves that n = sk + r.
Assuming the while loop exits, t ≤ n, thus r ≥ 0 and sk ≤ n. It only remains
to prove that n < (s + 1)k .
If the while-loop is entered, this means that the previous value of t was
larger than n, i.e. (s+1)k > n. It thus suffices to prove that n < (s0 β +q+1)k .
0
0 k−1
We first have rksβ+n 0
0 k−1 < q + 1, thus r β + n1 ≤ (q + 1)ks
1
− 1. This gives
k
n = n2 β k + n1 β k−1 + n0 = (s0 + r0 )β k + n1 β k−1 + n0
k k k−1 k−1
= s0 β k + (r0 β + n1 )β k−1 + n0 ≤ s0 β k + (q + 1)ks0 β − β k−1 + n0
k k−1 k−1
< s0 β k + (q + 1)ks0 β ≤ (s0 β + q + 1)k .
32 Modern Computer Arithmetic, version 0.1 of October 25, 2006

 

1 Algorithm RootRem .
2 Input : n ≥ 0
3 Output : (s, r) such t h a t sk ≤ n = sk + r < (s + 1)k
4 i f n < N use a n a i v e a l g o r i t h m
5 c h o o s e a base β such t h a t n ≥ β 2
6 w r i t e n = n2 β k + n1 β k−1 + n0 with n2 ≥ β k , 0 ≤ n1 < β , 0 ≤ n0 < β k−1
7 (s0 , r0 ) ← RootRem(n2 )
0
8 q ← b rksβ+n 0 k−1 c
1

9 t ← (s0 β + q)k
10 while t > n do
11 q ←q−1
12 t ← (s0 β + q)k
13 Return (s0 β + q, n − t) .




However, the above result is not very satisfactory, since we have no bound
on the number of iterations of the while-loop in Algorithm RootRem. The
following lemma shows how to choose β at step 5 to ensure the while-loop is
performed only once, and thus can be replaced by a if-test, like in Algorithm
SqrtRem.

Lemma 1.5.1 If s0 ≥ kβ at step 7 of Algorithm RootRem, then at most one


correction is necessary.

Proof Let q 0 be the final value of q at step 13, and q the value at step 8.
By hypothesis we have n = n2 β k + n1 β k−1 + n0 < (s0 β + q 0 + 1)k , thus we
deduce:

r 0 β + n1 (n2 − s0 k )β k + n1 β k−1 (s0 β + q 0 + 1)k − (s0 β)k


q ≤ = <
ks0 k−1 ks0 k−1 β k−1 ks0 k−1 β k−1
k−2   0 i
0 (q 0 + 1)2 X k q +1
= q +1+ .
ks0 β i=0 i + 2 s0 β
Pk−2 k  i
It can be shown that i=0 i+2
x = (1 + x)k /x2 − 1/x2 − k/x < (e − 2)k 2
0
for x ≤ 1/k. We thus conclude that for qs+1
0β ≤ 1/k — which is true since
s0 ≥ kβ > k and q 0 < β — we have q < q 0 + 1 + (e−2)kβs0
≤ q 0 + e − 1. Since
both q and q 0 are integers, and e − 1 < 2, it follows q ≤ q 0 + 1.
Modern Computer Arithmetic, §1.6 33

This lemma shows that it suffices to choose β slightly smaller — by log k


bits or by one word — to ensure there is at most one correction. In practice,
especially for large operands, one may want to take a few more bits in s0
0
with respect to β. Indeed, if we take s0 ≥ 2g kβ, assuming rksβ+n 0 k−1 is uniformly
1

distributed in [q 0 , q 0 + 1 + (e − 2)kβ/s0 ], the probability of correction is less


than (e − 2)2−g . With g = 6 for example, this is about 1%.

1.5.3 Exact Root

When a k-th root is known to be exact, there is of course no need to compute


exactly the final remainder in the “exact root” algorithms shown above,
which saves some computation time. However one has to check that the
remainder is sufficiently small that the computed root is correct.
When a root is known to be exact, one may also try to compute it starting
from the low significant bits, as for exact division. Indeed, if sk = n, then
sk = n mod β l for any integer l. However, in the case of exact division, the
equation a = qb mod β l has only one solution q as soon as b is prime to β.
Here, the equation sk = n mod β l may have several solutions, so the lifting
process is not unique. For example, x2 = 1 mod 23 has four solutions.
Suppose we have sk = n mod β l , and we want to lift to β l+1 . We want (s+
k
tβ l )k = n + n0 β l mod β l+1 where 0 ≤ t, n0 < β. Thus kt = n0 + n−s βl
mod β.
This equation has a unique solution t when k is prime to β. For example we
can extract cube roots in this way for β a power of two. When k is prime
to β, we can also compute the root simultaneously from the most significant
and least significant ends, as for the exact division.

Unknown exponent.

Assume now that one wants to check if a given integer n is an exact power,
without knowing the corresponding exponent. For example, many factoriza-
tion algorithms fail when given an exact power, therefore this case has to
be checked first. The following algorithm detects exact powers, and returns
the largest exponent. To early detect non-kth powers at step 5, one may use
modular algorithms when k is prime to the base β (see above).
34 Modern Computer Arithmetic, version 0.1 of October 25, 2006

 

1 Algorithm IsPower .
2 Input : a p o s i t i v e i n t e g e r n .
3 Output : k i f n i s an e x a c t k th power , false o t h e r w i s e .
4 for k from blog2 nc downto 2 do
5 i f n i s a k th power , r e t u r n k
6 Return false .




1.6 Gcd
There are many algorithms computing gcds in the literature. We can distin-
guish between the following (non-exclusive) types:

• left-to-right versus right-to-left algorithms: in the former the actions


depend on the most significant bits, while in the latter the actions
depend on the least significant bits;

• naive algorithms: these O(n2 ) algorithms consider one word of each


operand at a time, trying to guess from them the first quotients; we
count in this class algorithms considering double-size words, namely
Lehmer’s algorithm and Sorenson’s k-ary reduction in the left-to-right
and right-to-left cases respectively; algorithms not in that class consider
a number of words that depends on the input size n, and are often
subquadratic;

• subtraction-only algorithms: these algorithms trade divisions for sub-


tractions, at the cost of more iterations;

• plain versus extended algorithms: the former just compute the gcd of
the inputs, while the latter express the gcd as a linear combination of
the inputs.

1.6.1 Naive Gcd


We do not give Euclid’s algorithm here: it can be found in many textbooks,
e.g. Knuth [32], and we don’t recommend it in its simplest form, except for
testing purposes. Indeed, it is one of the slowest ways to compute a gcd,
except for very small inputs.
Modern Computer Arithmetic, §1.6 35

Double-Digit Gcd. A first improvement comes from the following sim-


ple remark due to Lehmer: the first quotients in Euclid’s algorithm usually
can be determined from the two most significant words of the inputs. This
avoids expensive divisions that give small quotients most of the time (see
Knuth [32][§4.5.3]). Consider for example and a = 427, 419, 669, 081 and
b = 321, 110, 693, 270 with 3-digit words. The first quotients are 1, 3, 48, . . .
Now if we consider the most significant words, namely 427 and 321, we get
the quotients 1, 3, 35, . . .. If we stop after the first two quotients, we see
that we can replace the initial inputs by a − b and −3a + 4b, which gives
106, 308, 975, 811 and 2, 183, 765, 837.
Lehmer’s algorithm determines cofactors from the most significant words
of the input integers. Those cofactors have size usually only half a word. The
DoubleDigitGcd algorithm — which should be called “double-word” instead
— uses the two most significant words instead, which gives cofactors t, u, v, w
of one full-word. This is optimal for the computation of the four products ta,
ub, va, wb. With the above example, if we consider 427, 419 and 321, 110, we
find that the first five quotients agree, so we can replace a, b by −148a + 197b
and 441a − 587b, i.e. 695, 550, 202 and 97, 115, 231.

 

1 Algorithm DoubleDigitGcd .
2 Input : a := an−1 β n−1 + · · · + a0 , b := bm−1 β m−1 + · · · + b0 .
3 Output : gcd(a, b) .
4 i f b = 0 then r e t u r n a
5 i f m < 2 then r e t u r n BasecaseGcd(a, b)
6 i f a < b or n > m then r e t u r n DoubleDigitGcd(b, a mod b)
7 (t, u, v, w) ← HalfBezout(an−1 β + an−2 , bn−1 β + bn−2 )
8 Return DoubleDigitGcd(|ta + ub|, |va + wb|) .




Note: in DoubleDigitGcd, we assume a and b are the current value of


an−1 β n−1 + · · · + a0 and bm−1 β m−1 + · · · + b0 , with n and m updated after
each step, so that an−1 6= 0 and bm−1 6= 0. The subroutine HalfBezout takes
as input two 2-word integers, performs Euclid’s algorithm until the  smallest

t u
remainder fits in one word, and returns the corresponding matrix .
v w

Binary Gcd. A better algorithm than Euclid’s one, still with an O(n2 )
complexity, is the binary algorithm. It differs from Euclid’s algorithm in
36 Modern Computer Arithmetic, version 0.1 of October 25, 2006

two ways: firstly it consider least significant bits first, and secondly it avoids
expensive divisions, which most of the time give a small quotient.

 

1 Algorithm BinaryGcd .
2 Input : a, b > 0 .
3 Output : gcd(a, b) .
4 i←0
5 while a mod 2 = b mod 2 = 0 do
6 (i, a, b) ← (i + 1, a/2, b/2)
7 while a mod 2 = 0 do
8 a ← a/2
9 while b mod 2 = 0 do
10 b ← b/2
11 while a 6= b do
12 (a, b) ← (|a − b|, min(a, b))
13 repeat a ← a/2 u n t i l a mod 2 6= 0
14 Return 2i · a .




Sorenson’s k-ary reduction

The binary algorithm is based on the fact that if a and b are both odd, then
a − b is even, and we can remove a factor of two since 2 does not divide
gcd(a, b). Sorenson’s k-ary reduction is a generalization of that idea: given
a and b odd, we try to find small integers u, v such that ua − vb is divisible
by a large power of two.

Theorem 1.6.1 [55] If a, b >√0 and m > 1 with gcd(a, m) = gcd(b, m) = 1,


there exists u, v, 0 < |u|, v < m such that ua ≡ vb mod m.

The following algorithm, ReducedRatMod, finds such a pair (u, v): it is a


simple variation of the extended Euclidean algorithm; indeed, the ui are
denominators from the continued fraction expansion of c/m.
When m is a prime power, the inversion 1/b mod m at line 4 can be
performed efficiently using Hensel lifting (§2.3), otherwise by an extended
gcd algorithm (§1.6.2).
Modern Computer Arithmetic, §1.6 37

 

1 Algorithm ReducedRatMod .
2 Input : a, b > 0 , m > 1 with gcd(a, m) = gcd(b, m) = 1

3 Output : (u, v) such t h a t 0 < |u|, v < m and ua ≡ vb mod m
4 c ← a/b mod m
5 (u1 , v1 ) ← (0, m)
6 (u2 , v2 ) ← (1, c)

7 while v2 ≥ m do
8 q ← bv1 /v2 c
9 (u1 , u2 ) ← (u2 , u1 − qu2 )
10 (v1 , v2 ) ← (v2 , v1 − qv2 )
11 r e t u r n (u2 , v2 ) .




1.6.2 Extended Gcd


Algorithm ExtendedGcd (Table 1.1) solves the extended greatest common
divisor problem: given two integers a and b, it computes their gcd g, and
also two integers u and v (called Bézout coefficients or sometimes cofactors
or multipliers) such that g = ua + vb. If a0 and b0 are the input numbers,

 

1 Input : i n t e g e r s a and b .
2 Output : i n t e g e r s (g, u, v) such t h a t g = gcd(a, b) = ua + vb .
3 (u, w) ← (1, 0)
4 (v, x) ← (0, 1)
5 while b 6= 0 do
6 (q, r) ← DivRem(a, b)
7 (a, b) ← (b, r)
8 (u, w) ← (w, u − qw)
9 (v, x) ← (x, v − qx)
10 Return (a, u, v) .




Table 1.1: Algorithm ExtendedGcd.

and a, b the current values, the following invariants hold: a = ua0 + vb0 , and
b = wa0 + xb0 .
An important special case is modular inversion (see Ch. 2): given an
integer n, one wants to compute 1/a mod n for a prime to n. One then
simply runs algorithm ExtendedGcd with input a and b = n: this yields u
38 Modern Computer Arithmetic, version 0.1 of October 25, 2006

and v with ua + vn = 1, thus 1/a = u mod n. But since v is not needed here,
we can simply avoid computing v and x, by removing lines 4 and 9.
In practice, it may be interesting to compute only u in the general case
too. Indeed, the cofactor v can be recovered afterwards by v = (g − ua)/b;
this division is exact (see §1.4.5).
All known algorithms for subquadratic gcd rely on an extended gcd sub-
routine, so we refer to §1.6.3 for subquadratic extended gcd.

1.6.3 Divide and Conquer Gcd


Designing a subquadratic integer gcd algorithm that is both mathematically
correct and efficient in practice appears to be quite a challenging problem.
A first remark is that, starting from n-bit inputs, there are O(n) terms in
the remainder sequence r0 = a, r1 = b, . . . , ri+1 = ri−1 mod ri , . . . , and the
size of ri decreases linearly with i. Thus computing all the partial remainders
ri leads to a quadratic cost, and a fast algorithm should avoid this. However,
the partial quotients qi = ri−1 div ri are usually small, and since they are
O(n), computing them is less expensive.
The main idea is thus to compute the partial quotients without com-
puting the partial remainders. It can be seen as an generalization of the
DoubleDigitGcd algorithm: instead of considering a fixed base β, adjust it
so that the inputs have four “big words”. The cofactor-matrix returned by
the HalfBezout subroutine will then reduce the input size to about 3n/4. A
second call with the remaining two most significant “big words” of the new
remainders will reduce their size to half the input size. This gives rise to the
HalGcd algorithm:
Let H(n) be the complexity of HalfGcd for inputs of n bits: a1 and b1 have
n/2 bits, thus the coefficients of S and a2 , b2 have n/4 bits. Thus a0 , b0 have
3n/4 bits, a01 , b01 have n/2 bits, a00 , b00 have n/4 bits, the coefficients of T and
a02 , b02 have n/4 bits, and a00 , b00 have n/2 bits. We have H(n) ∼ 2H(n/2) +
4M (n/4, n/2) + 4M (n/4) + 8M (n/4), i.e. H(n) ∼ 2H(n/2) + 20M (n/4). If
we do not need the final matrix S ·T , then we have H ∗ (n) ∼ H(n)−8M (n/4).
For the plain gcd, which simply calls HalfGcd until b is sufficiently small to
call a naive algorithm, the corresponding cost G(n) satisfies G(n) = H ∗ (n) +
G(n/2).
An application of the half gcd per se is the integer reconstruction problem.
Assume one wants to compute a rational p/q where p and q are known to
be bound by some constant c. Instead of computing with rationals, one may
Modern Computer Arithmetic, §1.7 39

 

1 Algorithm HalfGcd .
2 Input : a ≥ b > 0  0
a a

3 Output : a 2 × 2 matrix R and a0 , b0 such t h a t b0 =R b
4 n ← nbits(a) , k ← bn/2c
5 a := a1 + 2k a0 , b := b1 + 2k b0
6 S, a2 , b2 ← HalfGcd(a1 , b1 )
7 a0 ← a2 2k + S11 a0 + S12 b0
8 b0 ← b2 2k + S21 a0 + S22 b0
9 l ← bk/2c
10 a0 := a01 2l + a00 , b0 := b01 + 2k b0
11 T, a02 , b02 ← HalfGcd(a01 , b01 )
12 a00 ← a02 2l + T11 a00 + T12 b00
13 b00 ← b02 2l + T21 a00 + T22 b00
14 Return S · T , a00 , b00 .




naive Karatsuba Toom-Cook FFT


H(n) 2.5 6.67 9.52 5 log2 n
H ∗ (n) 2.0 5.78 8.48 5 log2 n
G(n) 2.67 8.67 13.29 10 log2 n

Table 1.2: Cost of HalfGcd, with — H(n) — and without — H ∗ (n) — the
cofactor matrix, and plain gcd — G(n) —, in terms of the multiplication cost
M (n), for naive multiplication, Karatsuba, Toom-Cook and FFT.

perform all computations modulo some integer n > c2 . Hence one will end
up with pq ≡ m mod n, and the problem is now to find the unknown p and q
from the known integer m. To do this, one starts an extended gcd from m
and n, and one stops as soon as the current a and u are smaller than c: since
we have a = um + vn, this gives m ≡ −a/u mod n. This is exactly what is
called a half-gcd; a subquadratic version is given in §1.6.3.

Subquadratic binary gcd

The binary gcd can also be made fast: see Table 1.3. The idea is to mimic
the left-to-right version, by defining an appropriate right-to-left division (Al-
gorithm BinaryDivide).
40 Modern Computer Arithmetic, version 0.1 of October 25, 2006

 

1 Algorithm BinaryHalfGcd .
2 Input : P, Q ∈ Z with 0 = ν(P ) < ν(Q) , and k ∈ N
3 Output : a 2 × 2 i n t e g e r matrix R , j ∈ N , and P 0 , Q0 such t h a t
4
t [P 0 , Q0 ] = 2−j R ·t [P, Q] with ν(P 0 ) ≤ k < ν(Q0 )

5 m ← ν(Q), d ← bk/2c
6 i f k < m then r e t u r n R = Id, j = 0, P 0 = P, Q0 = Q
7 decompose P i n t o P1 22d+1 + P0 , same for Q
8 R, j1 , P00 , Q00 ← BinaryHalfGcd(P0 , Q0 , d)
9 P 0 ← (R1,1 P1 + R1,2 Q1 )22d+1−2j1 + P00
10 Q0 ← (R2,1 P1 + R2,2 Q1 )22d+1−2j1 + Q00
11 m ← ν(Q0 ) , i f k < j1 + m then r e t u r n R, j1 , P 0 , Q0
12 q ← BinaryDivide(P 0 , Q0 )
13 P 0 ← P 0 + q2−m Q0 , d0 ← k − (j1 + m)
14 (P 0 , Q0 ) ← (2−m P 0 , 2−m Q0 )
0
15 decompose P 0 i n t o P3 22d +1 + P2 , same for Q0
16 S, j2 , P20 , Q02 ← BinaryHalfGcd(P2 , Q2 , d0 )
0 0
17 (P 00 , Q00 ) ← ([S1,1 P3 + S1,2 Q1 ]22d +1−2j2 + P20 , [S2,1 P3 + S2,2 Q3 ]22d +1−2j2 + Q02 )
18 Return S · [0, 2m ; 2m , q] · R, j1 + m + j2 , Q00 , P 00 .
19

20 Algorithm BinaryDivide .
21 Input : P, Q ∈ Z with 0 = ν(P ) < ν(Q) = j
22 Output : |q| < 2j such t h a t ν(Q) < ν(P + q2−j Q)
23 Q0 ← 2−j Q
24 q ← −P/Q0 mod 2j+1
25 i f q < 2j then r e t u r n q e l s e r e t u r n q − 2j+1




Table 1.3: A subquadratic binary gcd algorithm.

1.7 Conversion
Since computers usually work with binary numbers, and human prefer deci-
mal representations, input/output base conversions are needed. In a typical
computation, there will be only few conversions, compared to the total num-
ber of operations, thus optimizing conversions is less important. However,
when working with huge numbers, naı̈ve conversion algorithms — like several
software packages have — may slow down the whole computation.
In this section we consider that numbers are represented internally in base
β — think of 2 or a power of 2 — and externally in base B — for example 10
Modern Computer Arithmetic, §1.7 41

or a power of 10. When both bases are commensurable, i.e. both are powers
of a common integer, like 8 and 16, conversions of n-digit numbers can be
performed in O(n) operations. We therefore assume that β and B are not
commensurable from now on.
One may think that since input and output are symmetric by exchanging
bases β and B, only one algorithm is needed. Unfortunately, this is not true,
since computations are done in base β only.

1.7.1 Quadratic Algorithms


The following two algorithms respectively read and print n-word integers,
both with a complexity of O(n2 ).

 

1 Algorithm IntegerInput .
2 Input : a s t r i n g S = sm−1 . . . s1 s0 of d i g i t s in base B
3 Output : t h e v a l u e A of t h e i n t e g e r r e p r e s e n t e d by S
4 A=0
5 for i from m − 1 downto 0 do
6 A ← BA + val(si )
7 Return A .





 

1 Algorithm IntegerOutput .
P
2 Input : A = 0n−1 ai β i of t h e number r e p r e s e n t e d by S
3 Output : a s t r i n g S of c h a r a c t e r s , r e p r e s e n t i n g A in base B
4 m←0
5 while A 6= 0
6 sm ← char(A mod B)
7 A ← A div B
8 m←m+1
9 Return S = sm−1 . . . s1 s0 .




1.7.2 Subquadratic Algorithms


Fast conversions routines are obtained using a “divide and conquer” strategy.
For integer input, if the given string decomposes as S = Shi || Slo where Slo
42 Modern Computer Arithmetic, version 0.1 of October 25, 2006

has k digits in base B, then

Input(S, B) = Input(Slo , B) + B k Input(Shi , B),

where Input(S, B) is the value obtained when reading the string S in the
external base B. The following algorithm shows a possible way to implement
that: If the output A has n words, algorithm IntegerInput has complexity

 

1 Algorithm IntegerInput .
2 Input : a s t r i n g S = sm−1 . . . s1 s0 of d i g i t s in base B
3 Output : t h e v a l u e A of t h e i n t e g e r r e p r e s e n t e d by S
4 l ← [val(s0 ), val(s1 ), . . . , val(sm−1 )]
5 (b, k) ← (B, m)
6 while k > 1 do
7 i f k even then l ← [l1 + bl2 , l3 + bl4 , . . . , lk−1 + blk ]
8 e l s e l ← [l1 + bl2 , l3 + bl4 , . . . , lk ]
9 (b, k) ← (b2 , dk/2e)
10 Return l1 .




O(M (n) log n), more precisely ∼ 21 M (n/2) log2 n for n a power of two (see
Ex. 1.9.20).
For integer output, a similar algorithm can be designed, by replacing
multiplication by divisions. Namely, if A = Alo + B k Ahi , then

Output(A, B) = Output(Ahi , B) || Output(Alo , B),

where Output(A, B) is the string resulting from the printing of the integer A
in the external base B, S1 || S0 denotes the concatenation of S1 and S0 , and
it is assumed that Output(Alo , B) has k digits, after possibly adding leading
zeros.
If the input A has n words, algorithm IntegerOutput has complexity
O(M (n) log n), more precisely ≡ 21 D(n/2) log2 n for n a power of two, where
D(n/2) is the cost of dividing an n-word integer by an n/2-word integer.
Depending on the cost ratio between multiplication and division, integer
output may thus be 2 to 5 times slower than integer input; see however
Ex. 1.9.21.
Modern Computer Arithmetic, §1.9 43

 

1 Algorithm IntegerOutput .
P
2 Input : A = 0n−1 ai β i of t h e number r e p r e s e n t e d by S
3 Output : a s t r i n g S of c h a r a c t e r s , r e p r e s e n t i n g A in base B
4 i f A < B then char(A)
5 else
6 f i n d k such t h a t B 2k−2 ≤ A < B 2k
7 (Q, R) ← DivRem(A, B k )
8 IntegerOutput(Q)||IntegerOutput(R)




1.8 Notes and further references


Very little is known about the average complexity of Karatsuba’s algorithm.
What is clear is that no simple asymptotic equivalent can be obtained, since
the ratio K(n)/nα does not converge. See Ex. 1.9.1.
A very good description of Toom-Cook algorithms can be found in [20,
Section 9.5.1], in particular how to symbolically generate the evaluation and
interpolation formulæ.
The exact division algorithm starting from least significant bits is due to
Jebelean [27], who also invented with Krandick the “bidirectional” algorithm
[33]. Karp-Markstein trick to speed up Newton’s iteration (or Hensel lifting
over p-adic numbers) is described in [29]. The “recursive division” in §1.4.3
is from [17], although previous but not-so-detailed ideas can be found in [37]
or [26].
The square root algorithm in §1.5.1 was proven in [7].
The binary gcd was analysed by Brent [10, 13], Knuth [31, 32] and
Vallée [52]. The double-digit gcd (which should be called double-word gcd
instead) is due to Jebelean [28]. Sorenson’s k-ary reduction is due to Soren-
son [48], and was improved and implemented in GNU MP by Weber, who
also invented algorithm ReducedRatMod [55]. The first subquadratic gcd al-
gorithm was published by Knuth [30], but his analysis was suboptimal —
he gave O(n(log n)5 (log log n)) —, and the correct complexity was given by
Schönhage [44]: some people thus call it the Knuth-Schönhage algorithm. A
description in the polynomial case can be found in [2], and a detailed but
incorrect one in the integer case in [56]. The subquadratic binary gcd given
here is due to Stehlé and Zimmermann [49].
44 Modern Computer Arithmetic, version 0.1 of October 25, 2006

1.9 Exercises
Exercise 1.9.1 [Hanrot] Prove that the number K(n) of word products in Karat-
suba’s algorithm as defined in Th. 1.3.2 is non-decreasing for n0 = 2 (caution:
this is no longer true with a larger threshold, for example with n0 = 8 we have
K(7) = 49 whereas K(8) = 48). Plot the graph of nK(n)log2 3 with a logarithmic scale
7 10
for n, for 2 ≤ n ≤ 2 , and find experimentally where the maximum appears.

Exercise 1.9.2 [Ryde] Assume the basecase multiply costs M (n) = an2 +bn, and
that Karatsuba’s algorithm costs K(n) = 3K(n/2) + cn. Show that dividing a by
two increases the Karatsuba threshold n0 by a factor of two, and on the contrary
decreasing b and c decreases n0 .

Exercise 1.9.3 [Maeder [35]] Show that an auxiliary memory of 2n+2blog 2 nc−2
words is enough to implement Karatsuba’s algorithm in-place.

Exercise 1.9.4 [Quercia, McLaughlin] Show that Algorithm KaratsubaMultiply


can be implemented with only ∼ 72 n additions/subtractions. [Hint: decompose C0 ,
C1 and C2 in two parts.]

Exercise 1.9.5 Design an in-place version of Algorithm KaratsubaMultiply (see


Ex. 1.9.3) that accumulates the result in c0 , . . . , c2n−1 , and returns a carry bit.

Exercise 1.9.6 [Vuillemin [54]] Design a program or circuit to compute a 3 ×


2 product in 4 multiplications. Then use it to perform a 6 × 6 product in 16
multiplications. How does this compare asymptotically with Karatsuba and Toom-
Cook 3-way?

Exercise 1.9.7 [Weimerskirch,Paar] Extend the Karatsuba trick to compute an


n × n product in n(n+1)
2 multiplications and (5n−2)(n−1)
2 additions/subtractions.
For which n does this win?

Exercise 1.9.8 Prove that if 5 integer evaluation points are used for Toom-Cook
3-way, the division by 3 cannot be avoided. Does this remain true if only 4 integer
points are used together with ∞?

Exercise 1.9.9 For multiplication of two numbers of size kn and n, with k > 1
integer, show that the trivial strategy which performs k multiplications n × n is
not always the best possible.
Modern Computer Arithmetic, §1.9 45

Exercise 1.9.10 [Hanrot] In Karatsuba’s algorithm, instead of splitting the ope-


rands in high and low parts, one can split them in odd and even part. Con-
sidering the inputs as polynomials A(β) and B(β), this corresponds to writing
A(t) = A0 (t2 ) + tA1 (t2 ). This is known as the “odd-even” scheme [25]. Design
an algorithm UnbalancedKaratsuba using that scheme. Show that its complexity
satisfies K(m, n) = 2K(dm/2e, dn/2e) + K(bm/2c, bn/2c).

Exercise 1.9.11 [Karatsuba, Zuras [57]] Assuming the multiplication has super-
linear cost, show that the speedup of squaring with respect to multiplication cannot
exceed 2.
Now we go from a multiplication algorithm of cost cnα to Toom-Cook r-way;
get an expression for the threshold n0 , assuming Toom-Cook cost has a 2nd order
term in kn. See how this threshold evolves when c is replaced by another constant,
in particular show that this threshold increases for squaring (c0 < c). Assuming
Toom-Cook r-way has cost lnβ for multiplication, and l 0 nβ for squaring, obtain a
closed-form expression for the ratio l 0 /l, in terms of c, c0 , α, β.

Exercise 1.9.12 [Thomé, Quercia] Multiplication and the middle product are
just special cases of linear forms programs: consider two set of inputs a1 , . . . , an
and b1 , . . . , bm , and a set of outputs c1 , . . . , ck that are sums of products of ai bj .
For such a given problem, what is the least number of multiplies required? As an
example, can we compute x = au + cw, y = av + bw, z = bu + cv in less than 6
multiplies? Same question for x = au − cw, y = av − bw, z = bu − cv.

Exercise 1.9.13 In algorithm BasecaseDivRem (§1.4.1), prove that qj∗ ≤ β + 1.


Can this bound be reached? In the case qj∗ ≥ β, prove that the while-loop at steps
9-11 is executed at most once.
Prove that the same holds for Svoboda’s algorithm, i.e. that A ≥ 0 after step 11.

Exercise 1.9.14 [Granlund,Möller] In algorithm BasecaseDivRem, estimate


the probability that A < 0 is true at line 9, assuming the remainder rj from
the division of an+j β + an+j−1 by bn−1 is uniformly distributed in [0, bn−1 − 1],
A mod β n+j−1 is uniformly distributed in [0, β n+j−1 − 1], and B mod β n−1 is uni-
formly distributed in [0, β n−1 −1]. Then replace the computation of qj∗ by a division
of the three most significant words of A by the two most significant words of B.
Prove the algorithm is still correct; what is the maximal number of corrections,
and the probability that A < 0 holds?

Exercise 1.9.15 In Algorithm RecursiveDivRem, find inputs that require 1,


2, 3 or 4 corrections [hint: consider β = 2]. Prove that when n = m and A <
β m (B + 1), at most two corrections occur.
46 Modern Computer Arithmetic, version 0.1 of October 25, 2006

Exercise 1.9.16 Find the asymptotic complexity of Algorithm RecursiveDi-


vRem in the FFT range.

Exercise 1.9.17 Consider the division of A of kn words by B of n words, with


integer k ≥ 3, and the alternate strategy which consists in extending the divisor
with zeroes so that it has half the size of the dividend. Show this is always slower
than Algorithm UnbalancedDivision [assuming the division has superlinear cost].

Exercise 1.9.18 An important special base of division is when the divisor is of


the form bk . This is useful for example for the output routine (§1.7). Can one
design a fast algorithm for that case?

Exercise 1.9.19 Design an algorithm that performs an exact division of a 4n-bit


integer by a 2n-bit integer, with a quotient of 2n bits, using the idea from the last
paragraph of §exactdiv. Prove that your algorithm is correct.

Exercise 1.9.20 Find the asymptotic complexity T (n) of Algorithm IntegerInput


for n = 2k (§1.7.2), and show that, for general n, it is within a factor of two of
T (n) [Hint: consider the binary expansion of n]. Design another subquadratic
algorithm that works top-down: is is faster?

Exercise 1.9.21 Show that asymptotically, the output routine can be made as
fast as the input routine IntegerInput. [Hint: use Bernstein’s scaled remain-
der tree and the middle product.] Experiment with it on your favorite multiple-
precision software.

Exercise 1.9.22 If the internal base β and the external one B share a common
divisor — like in the case β = 2l and B = 10 — show how one can exploit this to
speed up the subquadratic input and output routines.

Exercise 1.9.23 Assume you are given two n-digit integers in base 10, but you
have fast arithmetic in base 2 only. Can you multiply them in O(M (n))?
Chapter 2

The FFT, Modular Arithmetic


and Finite Fields

2.1 Representation
2.1.1 Classical Representations
Non-negative, symmetric

2.1.2 Montgomery’s Representation


2.1.3 MSB vs LSB Algorithms
Many classical (most significant bits first) algorithms have a p-adic (least
significant bit first) equivalent:

classical (MSB) p-adic (LSB)


Euclidean division Montgomery reduction
Svoboda’s algorithm Montgomery-Svoboda
Euclidean gcd 2-adic gcd
Newton’s iteration Hensel lifting

2.1.4 Residue Number System


CRT, parallel/distributed algorithms.

47
48 Modern Computer Arithmetic, version 0.1 of October 25, 2006

2.1.5 Link with polynomials


Modular arithmetic on polynomials.

2.2 Multiplication
2.2.1 Barrett’s Algorithm
Barrett’s algorithm [3] is interesting when many divisions have to be made
with the same divisor; this is in particular the case when one performs compu-
tations modulo a fixed integer. The idea is to precompute an approximation
of the divisor inverse. In such a way, an approximation of the quotient is
obtained with just one multiplication, and the corresponding remainder af-
ter a second one. A small number of corrections suffice to convert those
approximations into exact values.

 

1 Algorithm BarrettDivRem .
2 Input : i n t e g e r s A , B with 0 ≤ A < 2n B , 2n−1 < B < 2n .
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 I ← b22n /Bc [ precomputation ]
5 Q0 ← bA1 I/2n c where A = A1 2n + A0 with 0 ≤ A0 < 2n
6 R 0 ← A − Q0 B
7 while R0 ≥ B do
8 (Q0 , R0 ) ← (Q0 + 1, R0 − B)
9 Return (Q0 , R0 ) .




Theorem 2.2.1 Algorithm BarrettDivRem is correct.


Proof Since 2n−1 < B < 2n , we have 2n < 22n /B < 2n+1 , thus 2n ≤ I <
2n+1 . We have Q0 ≤ A1 I/2n ≤ A1 /2n 22n /B ≤ A1 2n /B ≤ A/B. This ensures
that R0 is nonnegative. Now I > 22n /B − 1, which gives
BI > 22n − B. (2.1)
Similarly, Q0 > A1 I/2n − 1 gives
2n Q0 > A 1 I − 2 n . (2.2)
This gives 2n Q0 B > A1 IB −2n B > A1 (22n −B)−2n B = 2n A−2n A0 −B(2n +
A1 ) > 2n A − 42n B since A0 < 2n < 2B and A1 ≤ B. We thus conclude that
2n A < 2n Q0 (B + 4), thus at most 3 corrections are needed.
Modern Computer Arithmetic, §2.3 49

The bound of 3 corrections is tight: it is obtained for A = 1980, B = 36,


n = 6. Indeed, we have then I = 113, A1 = 30, Q0 = 52, R0 = 108 = 3B.
Remark: the multiplications at steps 5 and 6 may be replaced by short
products, that of step 5 by a high short product, and that of step 6 by a low
short product.

2.2.2 Montgomery’s Algorithm



 

1 Algorithm REDC .
2 Input : 0 ≤ C < β 2n , N , µ ← −N −1 mod β
3 Output : 0 ≤ R < β n such t h a t R = Cβ −n mod N
P2n−1
4 Assume C decomposes i n t o 0 ci β i where t h e ci e v o l v e
5 for i from 0 to n − 1 do
6 q ← µci mod β
7 C ← C + qN β i
8 R ← Cβ −n
9 i f R ≥ β n then r e t u r n R − N e l s e r e t u r n R .




Theorem 2.2.2 Algorithm REDC is correct.

Proof Assume that for a given i, we have c0 = . . . = ci−1 = 0 when entering


step 6. Then since q = ci /N mod β, we have C + qN β i = 0 mod β i+1 at the
next step, so ci = 0. Thus when one exists the for-loop, C is a multiple of
β n , thus R is well defined at step 8.
Still at step 8, we have C < β 2n + (β − 1)N (1 + β + · · · + β n−1 ) =
β 2n + N (β n − 1) thus R < β n + N , and R − N < β n .

Remark: a subquadratic version of REDC is obtained by taking n = 1,


and considering β as a “big base”. This is exactly the 2-adic counterpart
of Barrett’s subquadratic algorithm: step 6 can be replaced by a low short
product, and step 7 by a high short product.

2.2.3 Special Moduli


β n ± 1, Schönhage-Strassen FFT (including extra primitive root 23n/4 − 2n/4
modulo 2n + 1).
Multiplication modulo a ± b where a, b are highly composite.
50 Modern Computer Arithmetic, version 0.1 of October 25, 2006

2.3 Division/Inversion
Link to extended GCD (Ch. 1) or Fermat (cf MCA).
Describe here Hensel lifting for inversion mod pk (link with division by a
constant in §1.4.7). Cite paper of Shanks-Vuillemin for division mod β n .

2.3.1 Several Inversions at Once


A modular inversion, which reduces to an extended gcd (§1.6.2), is usually
much more expensive than a multiplication. This is true not only in the
FFT range, where a gcd takes time M (n) log n, but also for smaller numbers.
When several inversions are to be performed modulo the same number, the
following algorithm is usually faster:

 

1 Algorithm MultipleInversion .
2 Input : r e s i d u e s x1 , . . . , xk modulo n
3 Output : y1 = 1/x1 , . . . , yk = 1/xk modulo n
4 z 1 ← x1
5 for i from 2 to k do
6 zi ← zi−1 xi (modn)
7 q ← 1/zk (modn)
8 for i from k downto 2 do
9 yi ← qzi−1 (modn)
10 q ← qxi (modn)
11 y1 ← q




Proof We have zi = x1 x2 . . . xi (modn), thus at the beginning of step i,


q = (x1 . . . xi )−1 (modn), which indeed gives yi = 1/xi (modn).

This algorithm uses only one modular inversion, and 3(k − 1) modular mul-
tiplications. It is thus faster when an inversion is 3 times more expensive (or
more) than a product. Fig. 2.1 shows a recursive variant of that algorithm,
with the same numbers of modular multiplications: one for each internal
node when going up the (product) tree, and two for each internal node when
going down the (remainder) tree.
A dual case is when the number to invert is invariant, and we want to
compute 1/x mod n1 , . . . , 1/x mod nk . A similar algorithm works as follows:
first compute N = n1 . . . nk using a product tree like in Fig. 2.1. Then
compute 1/x mod N , and go down the tree, while reducing the residue at
Modern Computer Arithmetic, §2.9 51
1
x1 x2 x3 x4
@
@
@
@
1 1
x1 x2
@ x x
@ @3 4
@@ @@
1/x1 1/x2 1/x3 1/x4

Figure 2.1: A recursive variant of Algorithm MultipleInversion.

each node. The main difference is that here, the residues grow while going
up the tree, thus even if it performs only one modular inversion, this method
might be slower for large k.

2.4 Exponentiation
Link to HAC, Ch. 14.

2.5 Conversion
integer from/to modular (CRT, FFT), 3-primes variant of FFT.

2.6 Finite Fields


FFT in finite fields. Generalization of above, trinomials, ...

2.7 Applications of FFT


Applications and other variants of FFT.

2.8 Exercises
Exercise 2.8.1 Assuming you have a FFT algorithm computing products modulo
2n + 1. Prove that with some preconditioning, you can perform a division of a 2n-
bit integer by a n-bit integer as fast as 1.5 multiplications of n bits by n bits.
52 Modern Computer Arithmetic, version 0.1 of October 25, 2006

2.9 Notes and further references


Applications: Pollard ρ, ECM, roots, ...
Algorithm MultipleInversion is due to Montgomery [38].
Chapter 3

Floating-Point Arithmetic

3.1 Introduction
3.1.1 Representation
mantissa, exponent, sign, position of the point
IEEE 754/854: special values (infinities, NaN), signed zero, rounding
modes (±∞, to zero, to nearest, away).
Binary vs decimal representation.
Implicit vs explicit leading bit.
Links to other possible representations. In her PhD [36], Valérie Ménissier-
Morain discusses three different representations for real numbers (Ch. V):
continued fractions, redundant representation, and the classical non-redundant
representation. She also considers the theory of computable reals, their rep-
resentation by B-adic numbers, and the computation of algebraic or tran-
scendental functions (Ch. III).

3.1.2 Precision vs Accuracy


global precision vs one precision for each variable

3.1.3 Link to Integers


Using f-p numbers for integer word operations, and for integer arbitrary-
precision (expansions: cf Priest [42]). Also use of (complex) floating-point

53
54 Modern Computer Arithmetic, version 0.1 of October 25, 2006

numbers for FFT multiplication (cf Knuth vol 2, and error analysis in Colin
Percival’s paper [41]).

3.1.4 Error analysis


absolute vs relative vs ulp error
forward vs backward error analysis
Theorem 3.1.1 Consider a binary floating-point system in precision n. Let
u be the rounding to nearest of some real x, then the following inequalities
hold:
1
|u − x| ≤ ulp(u)
2
|u − x| ≤ 2−n |u|
|u − x| ≤ 2−n |x|.
Proof Without loss of generality, we can assume u and x positive. The first
inequality follows from the definition of rounding to nearest, and the second
one comes from ulp(u) ≤ 21−n u. For the last one, we distinguish two cases:
if u ≤ x, it follows from the second one. If x < u, then if x and u are in
the same binade — 2e−1 ≤ x < u < 2e , then 21 ulp(u) = 2e−1−n ≤ 2−n x.
The only remaining case is 2e−1 ≤ x < u = 2e . Since the floating-point
number preceding 2e is 2e (1 − 2−n ), and x was rounded to nearest, then
|u − x| ≤ 2e−1−n too.

3.1.5 Rounding
Assume we want to correctly round to n bits a real number whose binary
expansion is 0.1b1 . . . bn bn+1 . . . It is enough to know the value of r = bn+1
— called the round bit — and that of the sticky bit s, which is 0 when
bn+2 bn+3 . . . is identically zero, and 1 otherwise. The following table shows
how to correctly round from r and s, and the given rounding mode; rounding
to ±∞ being converted to rounding to zero or away, according to the sign of
the number.
r s zero nearest away
0 0 0 0 0
0 1 0 0 1
1 0 0 0 or 1 1
1 1 0 1 1
Modern Computer Arithmetic, §3.2 55

However, in general we don’t have an infinite expansion, but a finite


approximation y of an unknown real value x. The problem is the following:
given the approximation y, and a bound on the error |y − x|, is it possible to
determine the correct rounding of x?

 

1 Algorithm RoundingPossible .
2 Input : a f −p number y = 0.y1 . . . ym , y1 = 1 , a p r e c i s i o n n ≤ m ,
3 an e r r o r  = 2−k , a rounding mode ◦
4 Output : true i f f ◦n (x) can be determined for |y − x| ≤ 
5 I f ◦ i s to n e a r e s t , then n ← n + 1
6 I f k ≤ n then r e t u r n false
7 I f ◦ i s to n e a r e s t and yn = yn+1 then r e t u r n true
8 I f yn+1 = yn+2 = . . . = yk then r e t u r n false
9 Return true .




Proof Since rounding is monotonic, it is possible to determine ◦(x) exactly


when ◦(y−2−k ) = ◦(y+2k ), or in other words when the interval [y−2−k , y+2k ]
contains no rounding boundary. The rounding boundaries for rounding to
nearest in precision n are those for directed rounding in precision n + 1.
If k ≤ n, then the error on y may change the significand, so it is not
possible to round correctly. In case of rounding to nearest, if the round bit
and the following bit are equal — thus 00 or 11 — and the error is after
the round bit, it is possible to round correctly. Otherwise it is only when
yn+1 , yn+2 , . . . , yk are not identical.

The Double Rounding Problem

This problem does not happen with all rounding modes (Ex. 3.5.1).

3.1.6 Strategies
To determine correct rounding of f (x) with n bits of precision, the best
strategy is usually to first compute an approximation y of f (x) with a working
precision of m = n + k bits, with k relatively small. Several strategies are
possible when this first approximation y is not accurate enough, or too close
from a rounding boundary.
56 Modern Computer Arithmetic, version 0.1 of October 25, 2006

3.2 Addition/Subtraction/Comparison
Leading Zero Anticipation and Detection.
Sterbenz Theorem.
Unlike integer operations, floating-point addition and subtraction are
more difficult to implement for two reasons:

• scaling due to the exponents requires to shift the mantissas before


adding or subtracting them. In theory one could perform all operations
using integer operations only, but this would require huge integers, for
example when adding 1 and 2−1000 .

• the carries being propagated from right to left, one may have to look
at an arbitrarily low bits to guarantee correct rounding.

We distinguish here the “addition”, where both operands are of the same
sign — zero operands are treated apart —, and “subtraction”, where both
operands are of different signs.

3.2.1 Floating-Point Addition


The following algorithm adds two binary floating-point numbers b and c.
More precisely, it computes the correct rounding of b + c, with respect to
the given rounding mode ◦. For the sake of simplicity, we assume b and
c are positive, b ≥ c > 0, 2n−1 ≤ b < 2n and 2m−1 ≤ c < 2m . We also
assume that the rounding mode is either to nearest, towards zero or away
from zero (rounding to ±∞ reduces to rounding towards zero or away from
zero, depending on the sign of the operands).

 

1 Algorithm FPadd .
2 Input : b and c two f l o a t i n g −p o i n t numbers , a p r e c i s i o n n ,
3 and a rounding mode ◦ .
4 Output : a f −p number a · 2e of p r e c i s i o n n , e q u a l to ◦(b + c) .
5 S p l i t b i n t o bh + bl where bh c o n t a i n s t h e n most s i g n i f i c a n t
6 b i t s of b .
7 S p l i t c i n t o ch + cl where ch c o n t a i n s t h e max(0, m) most
8 s i g n i f i c a n t b i t s of c .
9 a h ← bh + c h , e ← 0
10 (c, r, s) ← bl + cl .
11 a ← ah + c + round(◦, r, s)
Modern Computer Arithmetic, §3.2 57

12 i f a ≥ 2n then
13 a ← round2(◦, a mod 2, t)
14 e←e+1
15 i f a = 2n then (a, e) ← (a/2, e + 1)
16 Return (a, e) .




The values of round(◦, r, s) and round2(◦, a mod 2, t) are given in Table 3.1.
At step 10, the notation (c, r, s) ← bl + cl means that c is the carry bit of
bl + cl , r the round bit, and s the sticky bit. For rounding to nearest, t is
a ternary value, which is respectively positive, zero, or negative when a is
larger than, equal to, or smaller than the exact sum b + c.

◦ r s round(◦, r, s) t
zero any any 0
away r s 0 if r = s = 0, 1 otherwise
nearest 0 any 0 −s
nearest 1 0 0/1 (even rounding) −1/1
nearest 1 1 1 1
◦ a mod 2 t round2(◦, a mod 2, t)
any 0 any a/2
zero 1 (a − 1)/2
away 1 (a + 1)/2
nearest 1 0 (a − 1)/2 if even, (a + 1)/2 otherwise
nearest 1 ±1 (a − t)/2

Figure 3.1: Rounding rules for addition.

Theorem 3.2.1 Algorithm FPadd is correct.


Proof With the assumptions made, bh and ch are the integer parts of b and c,
bl and cl their fractional parts. Since b ≥ c, we have ch ≤ bh and 2n−1 ≤ bh ≤
2n −1, thus 2n−1 ≤ ah ≤ 2n+1 −2, and at step 11, 2n−1 ≤ a ≤ 2n+1 . If a < 2n ,
a is the correct rounding of b + c. Otherwise, we face the “double rounding”
problem: rounding a down to n bits will give the correct result, except when
a is odd and rounding it to nearest. In that case, we need to know if the
first rounding was exact, and if not in which direction it was rounded, which
is represented by the ternary value t. After the second rounding, we have
2n−1 ≤ a ≤ 2n .
58 Modern Computer Arithmetic, version 0.1 of October 25, 2006

We may notice that the exponent ea of the result lies between that eb
of b, and eb + 2. Thus no underflow can happen in an addition. The case
ea = eb + 2 can happen only when the destination precision is less than that
of the operands.

3.2.2 Leading Zero Detection


3.2.3 Floating-Point Subtraction

3.3 Multiplication, Division, Algebraic Func-


tions
Link to chapter 1.
Bounds, extension to different precisions.
Short multiplication/division.
Middle product (cf Newton). Application to argument reduction: Payne
and Hanek method.

3.3.1 Multiplication
0
The exact product of two floating-point numbers m · 2e and m0 · 2e is (mm0 ) ·
0
2e+e . Therefore, if no underflow or overflow occurs, the problem reduces to
the multiplication of the significands m and m0 .

 

1 Algorithm FPmultiply .
0
2 Input : x = m · β e , x0 = m0 · β e , a p r e c i s i o n n , a rounding mode ◦
3 Output : ◦(xx0 ) rounded to p r e c i s i o n n
4 e00 ← e + e0
5 m00 ← ◦(mm0 ) to p r e c i s i o n n
0
6 Return m00 · β e+e .




The product at step 5 is a short product, i.e. a product whose most significand
part only is wanted. In the quadratic range, it can be computed in about
half the time of a full product. In the Karatsuba and Toom-Cook ranges,
Mulders’ algorithm can gain 10% to 20%; however due to carries, using this
algorithm for floating-point computations seems tricky. Lastly, in the FFT
range, no better algorithm is known than computing the full product mm0 .
Modern Computer Arithmetic, §3.3 59

Hence our advice is to perform a full product of the m and m0 , after


possibly truncating them to n+k digits if they have more than n digits. This
also makes it easier to compute additional digits in case the first rounding
fails.
How long can a run of zeros or ones be after the first n bits? The answer
is: as long as the longest mantissa minus n bits. Indeed, take an arbitrary
output m00 with k ones after the round bit, and consider an arbitrary m of
n bits. Then compute n + k bits of m00 /m, and let m0 the corresponding
mantissa. Since m00 /m agrees with m0 up to n + k bits, then m00 agrees with
mm0 up to n+k bits. Even if both mantissas have at most n bits, we can have
a run of n − 1 identical bits: take for example m = 2n − 1 and m0 = 2n−1 + 1.

Error analysis of the short product. Consider two n-word normalized


mantissae A and B that we multiply using a short product algorithm:

 

1 Algorithm ShortProduct .
Pn−1 Pn−1
2 Input : A = i=0 ai β i , B = i=0 bi β i
3 Output : an a p p r o x i m a t i o n of AB div β n
4 i f n ≤ n0 then r e t u r n FullProduct(A, B)
5 c h o o s e k ≥ n/2 , l ← n − k
6 C1 ← FullProduct(A div β l , B div β l , k)
7 C2 ← ShortProduct(A mod β l , B div β k , l)
8 C3 ← ShortProduct(A div β k , B mod β l , l)
9 Return C1 div β k−l + (C2 + C3 ) div β l .




Theorem 3.3.1 The value C 0 returned by Algorithm ShortProduct differs


from the exact short product C = AB div β n by at most n − 1, more precisely

C 0 ≤ C ≤ C 0 + (n − 1).

W e first prove by induction that the algorithm computes all products ai bj


with i + j ≥ n − 1. For n ≤ n0 this is trivial since all products are computed.
Now assume n > n0 , and consider a product ai bj with i + j ≥ n − 1. Three
cases can occur:

1. i, j ≥ l, then ai bj is computed in C1 ;

2. i < l, which implies j ≥ k since i + j ≥ n − 1; the index of ai in


A mod β l is i0 = i, and that of bj in B div β k is j 0 = j − k, thus
60 Modern Computer Arithmetic, version 0.1 of October 25, 2006

i0 + j 0 = i + j − k ≥ (n − 1) − k = l − 1, which proves that ai bj is


computed in C2 ;

3. similarly, j < l implies i ≥ k, and ai bj is computed in C3 .

The case i, j < l cannot occur since we would have i + j ≤ n − 2 (remember


l ≤ n/2).
P The neglected i+j
partP
with respect to the full product AB is thus at most
2 i+j
i+j≤n−2 ai bj β ≤ i+j≤n−2 (β − 1) β = (n − 1)β n − nβ n−1 + 1 ≤
(n − 1)β n .
The product D0 — before division by β n — computed by the algorithm
thus satisfies D0 ≤ AB ≤ D0 + (n − 1)β n , from which the theorem follows,
after division by β n .

Question: is the upper bound C 0 + (n − 1) attained? Can the theorem be


improved?
Remark 1: if one the operands was truncated before applying algorithm
ShortProduct, simply add one unit to the upper bound. Indeed, the trun-
cated part is less than 1, thus its product by the other operand is bounded
by β n .
Remark 2: assuming β is a power of two, if A and B are normalized,
i.e. β/2 ≤ an−1 , bn−1 < β, then C 0 ≥ β n /4, and the error is bounded by
2(n − 1) ulps.

Integer Multiplication via Complex Floating-Point FFT


To multiply n-bit integers, the algorithms using Fast Fourier Transform —
FFT for short — belong to two classes: those using number theoretical prop-
erties, and those based on complex floating-point computations. The later,
while not giving the best known asymptotic complexity of the former in
O(n log n log log n), have a good practical behaviour, because they are using
the efficiency of the floating-point hardware. The drawback of complex FFT
is that, being based on floating-point computations, it requires a rigorous
error analysis. However, in some contexts where errors are not dramatic,
for example in the context of integer factorization, one may accept a small
probability of error if it can speed up the computation.
Up to very recently, all rigorous error analyses of complex FFT gave very
pessimistic bounds. The following theorem from Percival [41] changes this
situation:
Modern Computer Arithmetic, §3.3 61

n bits/float m-bit mult. n bits/float m-bit mult.


1 25 25 11 18 18432
2 24 48 12 17 34816
3 23 92 13 17 69632
4 22 176 14 16 131072
5 22 352 15 16 262144
6 21 672 16 15 491520
7 20 1280 17 15 983040
8 20 2560 18 14 1835008
9 19 4864 19 14 3670016
10 19 9728 20 13 6815744

Table 3.1: Maximal number of bits per floating-point number, and maximal
m for a plain m × m bit integer product, for a given FFT size 2n , with signed
coefficients, and 53-bit floating-point mantissae.

Theorem 3.3.2 [41] The FFT allows computation of the cyclic convolution
z = x ∗ y of two vectors of length N = 2n of complex values such that

|z 0 − z|∞ < |x| · |y| · ((1 + )3n (1 +  5)3n+1 (1 + β)3n − 1), (3.1)

where | · | and | · |∞ denote the Euclidean and infinity norms respectively,  is


such that |(a ± b)0 − (a ± b)| < |a ± b|, |(ab)0 − (ab)| < |ab| for all machine
2πi
floats a, b, β > |(w k )0 − (wk )|, 0 ≤ k < N , w = e N , and (·)0 refers to the
computed (stored) value of · for each expression.
The proof given in Percival’s paper [41][p. 387] is incorrect, but we have a
correct proof (see Ex. 3.5.3). For the double precision format of IEEE 754,
with rounding to nearest, we√have  = 2−53 , and if the wk are correctly
rounded, we can take β = / 2. For a fixed FFT size N = 2n , Eq. (3.1)
enables one to compute a bound B of the coefficients of x and y, such that
|z 0 − z|∞ < 1/2, which enables to uniquely round the coefficients of z 0 to an
integer (Tab. 3.1).

3.3.2 Reciprocal
The following algorithm computes a floating-point inverse in 2M (n), when
considering multiplication as a black-box.
62 Modern Computer Arithmetic, version 0.1 of October 25, 2006


 

1 Algorithm Invert .
2 Input : 1 ≤ A ≤ 2 , a p−b i t f −p number x with 0 ≤ 1/A − x ≤ 21−p
0
3 Output : a p0−b i t f −p number x0 with p0 = 2p − 5 , 0 ≤ 1/A − x0 ≤ 21−p
4 v ← ◦(A) [ 2p b i t s , rounded up ]
5 w ← vx [ 3p b i t s , e x a c t ]
6 i f w ≥ 1 then r e t u r n x
7 y ← ◦(1 − w) [ p b i t s , towards z e r o ]
8 z ← ◦(xy) [ p b i t s , towards z e r o ]
9 x0 ← ◦(x + z) [ p0 b i t s , towards z e r o ]




Theorem 3.3.3 Algorithm Invert is correct, and yields an inversion algo-


rithm of complexity 2M (n).

Proof First consider we have no roundoff error. Newton’s iteration for the
inverse of A is xk+1 = xk + xk (1 − Axk ). If we denote k := xk − 1/A, we have
k+1 = −A2k . This shows that if x0 ≤ 1/A, then all xj ’s are less or equal to
1/A.
Now consider rounding errors. The hypothesis 0 ≤ 1/A − x ≤ 21−p can
be written 0 ≤ 1 − Ax ≤ 22−p since A ≤ 2.
If w ≥ 1 at step 6, we have Ax ≤ 1 ≤ vx; since v − A ≤ 21−2p , this shows
0
that 1 − Ax ≤ vx − Ax ≤ 21−2p , thus 1/A − x ≤ 21−2p ≤ 21−p . Otherwise,
we can write v = A + 1−2p , where i denotes a positive quantity less that 2i .
Similarly, y = 1 − w − 2−2p , z = xy − 02−2p , and x0 = x + z − 5−2p . This gives
x0 = x + x(1 − Ax) − 1−2p x2 − 2−2p x − 02−2p − 5−2p . Since the difference
between 1/A and x + x(1 − Ax) is bounded by A(x − 1/A)2 ≤ 23−2p , we get
0
|x0 − 1/A| ≤ 23−2p + 21−2p + 2 · 22−2p + 25−2p = 50 · 2−2p ≤ 26−2p = 21−p .

The complexity bound of 2M (n) is obtained using a “negacyclic convolu-


tion” for w = vx, assuming we use a multiplication algorithm that computes
mod 2n ± 1. Indeed, the product vx has 3p bits, but we know the p most
significant bits give 1, so it can be obtained by multiplication mod 22p ± 1.

If we keep the FFT-transform of x from step 5 to step 8, we can save 1/3M (n)
— assuming the term-to-term products have negligible cost —, which gives
5/3M (n) as noticed by Bernstein, who also proposes a “messy” algorithm in
3/2M (n).
Remark: Schönhage’s algorithm in 1.5M (n) is better [45].
Modern Computer Arithmetic, §3.3 63

3.3.3 Division
Theorem 3.3.4 Assume we divide a m-bit floating-point number by a n-bit
floating-point number, with m < 2n. Then the (infinity) binary expansion of
the quotient can have at most n consecutive zeros after its first n bits, if not
exact.

Proof Without loss of generality, we can assume that the nth significand
bit of the quotient q correspond to 1, and similarly for the divisor d. If the
quotient has more than n consecutive zeros, we can write it q = q1 + 2−n q0
with q1 a n-bit integer and either q0 = 0 if the division is exact, or an infinite
expansion 0 < q0 < 1. Thus qd = q1 d + 2−n q0 d, where q1 d is an integer of
2n − 1 or 2n bits, and 0 < 2−n q0 d < 1. This implies that qd has at least 2n
bits.

The best known constant is 25 M (n). (Bernstein gets 7/3 or even 13/6 but
with special assumptions.)

Algorithm Divide (h,f,2n).


1. Compute $g_0 = Invert(f,n)$ [2M(n)]
2. $q_0 = h g_0$ truncated to $n$ bits [M(n)]
3. $e = MP(q_0, f)$ [M(n)]
4. $q = q_0 - g_0 e$ [M(n)]

The total cost is therefore 5M (n) for precision 2n, or 52 M (n) for precision n.
As for the reciprocal, if we cache FFT transforms, we get 5/3M (n) for
step 1, and a further gain of 1/3M (n) by saving the transform of g0 between
25
steps 2 and 4, which gives 12 M (n) = 2.0833...M (n).

Lemma 3.3.1 Let A and B be two positive integers, and β ≥ 2 a positive


integer. Let Q = bA/Bc, A1 = bA/βc, B1 = bB/βc, Q1 = bA1 /B1 c.
Assuming Q1 ≤ 2B1 , then

Q ≤ Q1 ≤ Q + 2.

L et A1 = Q1 B1 + R1 . We have A = A1 β + A0 , B = B1 β + B0 , thus
A1 β + A 0 A1 β + A 0 R1 β + A 0
A/B = ≤ = Q1 .
B1 β + B 0 B1 β B1 β
64 Modern Computer Arithmetic, version 0.1 of October 25, 2006

Since R1 < B1 and A0 < β, R1 β + A0 < B1 β, thus A/B < Q1 + 1. Taking


the floor of each side proves, since Q1 is an integer, that Q ≤ Q1 .
For the second inequality,
A1 β
A/B ≥ = (Q1 B1 + R1 )βB1 β + (β − 1)
B1 β + (β − 1)
R1 β − Q1 (β − 1) Q1
= Q1 + ≥ Q1 − .
B1 β + (β − 1) B1

This lemma is useful when replacing β by β n .



 

1 Algorithm ShortDivision .
P2n−1 Pn−1
2 Input : A = i=0 ai β i , B = i=0 bi β i with β/2 ≤ bn−1 < β
3 Output : an a p p r o x i m a t i o n of A/B
4 i f n ≤ n0 then r e t u r n ExactQuotient(A, B)
5 c h o o s e k ≥ n/2 , l ← n − k
6 (A1 , A0 ) ← (A div β 2l , A mod β 2l )
7 (B1 , B0 ) ← (B div β l , B mod β l )
8 (Q1 , R1 ) ← DivRem(A1 , B1 )
9 A0 ← R1 β 2l + A0 − Q1 B0 β l
10 Q0 ← ShortDivision(A0 div β k , B div β k )
11 Return Q1 β l + Q0 .




Theorem 3.3.5 The approximate quotient Q0 returned by ShortDivision


differs at most by n from the exact quotient Q = A/B, more precisely:

Q ≤ Q0 ≤ Q + 2 log2 n.

I f n ≤ n0 , Q = Q0 so the statement holds. Assume n > n0 . We have


A = A1 β 2l + A0 and B = B1 β l + B0 , thus since A1 = Q1 B1 + R1 , A =
(Q1 B1 +R1 )β 2l +A0 = A0 +Q1 Bβ l . Let A0 = A01 β k +A00 , and B = B10 β k +B00 .
We have seen (Chapter 1) that the exact quotient of A0 div β k by B div β k is
greater or equal to that of A0 by B, thus by induction Q0 ≥ A0 /B too. Since
A/B = Q1 β l + A0 /B, this proves that Q0 ≥ Q.
A0 A0
Now by induction Q0 ≤ B10 + 2 log2 l, and B10 ≤ A0 /B + 2 (see Chapter 1),
1 1
so Q0 ≤ A0 /B + 2 log2 n, and Q0 ≤ A/B + 2 log2 n.
Modern Computer Arithmetic, §3.3 65

@
@
@
M( n
8
) @∗ ( n8 )
M
@
M( n ) M( n )@
8
M (n/4)
8 @
@
M( n
8
) M( n
8
) M ∗ ( n4 @
)
@
M( n ) M( n ) @
8
M (n/4)
8
M ( n4 ) @
@
M( n
8
) M( n
8
) @
M (n/2) @
M( n ) M( n ) @
8
M (n/4)
8
M ( n4 ) M ∗ (n/2) @
@
M( n
8
) M( n
8
) @
@

Figure 3.2: Divide and conquer short division: a graphical view. Left: with
plain multiplication; right: with short multiplication. See also Fig. 1.1.

Barrett’s division
Assume we want to divide a by b of n bits, assuming the quotient has exactly
n bits. Barrett’s algorithm is as follows:

0. Precompute the inverse i of b on n bits [nearest]


1. q ← ◦(ai) [nearest]
2. r ← a − bq.

Lemma 3.3.2 At step 2, we have |a − bq| ≤ 23 |b|.

Proof We can assume without loss of generality that a is an integer < 22n ,
that b is an integer 2n−1 ≤ b < 2n . We have i = 1/b+ with || ≤ 12 ulp(1/b) ≤
2−2n . And q = ai + 0 with |0 | ≤ 21 ulp(q) ≤ 21 since q < 2n . Thus q =
a(1/b + ) + 0 = a/b + a + 0 , and |bq − a| = |b||a + 0 | ≤ 32 |b|.

As a consequence, after at most one correction, q is correct (for rounding to


nearest).
Remark: if a < 22n−1 , then the bound becomes |a − bq| ≤ |b|, thus r is
exact.
66 Modern Computer Arithmetic, version 0.1 of October 25, 2006

3.3.4 Square Root


The following algorithm assumes an integer significand m, and a directed
rounding mode.

 

1 Algorithm FPSqrt .
2 Input : x = m · 2e , a t a r g e t p r e c i s i o n n , a rounding mode ◦

3 Output : y = ◦n ( x)
4 I f e i s odd , (m0 , f ) ← (2m, e − 1) , e l s e (m0 , f ) ← (m, e)
5 Write m0 := m1 22k + m0 , m1 having 2n or 2n − 1 b i t s , 0 ≤ m0 < 22k
6 (s, r) ← SqrtRem(m1 )
7 I f round to z e r o or down or r = m0 = 0 , r e t u r n s · 2k+f /2
8 e l s e r e t u r n (s + 1) · 2k+f /2 .




Theorem 3.3.6 Algorithm FPSqrt returns the correctly-rounded square root


of x.
Proof Since m1 has √ 2n or 2n − 1 bits, s has exactly n bits, and we have
x ≥ s2 22k+f , thus x ≥ s2k+f /2 . On the other hand, SqrtRem ensures that
r ≤ 2s, thus x2−f = (s2 +r)22k +m0 < s2 +r+1 ≤ (s+1)2 . Since y := s·2k+e/2
and y + = (s + 1) · 2k+e/2 are two consecutive n-bit floating-point numbers,
this concludes the proof.
Note: in the case s = 2n − 1, s + 1 = 2n is still representable with n bits, and
y + is in the upper binade.
To compute a square root on 2n bits, the asymptotically best known
method is the following:
Algorithm SquareRoot(h, 2n).
1. Compute $g_0 = h^{-1/2}$ on $n$ bits.
2. $f_0 = h g_0$ truncated to $n$ bits [M(n)]
3. $e = h - f_0^2$ [1/2M(n)]
4. $f = f_0 + \frac{g_0}{2} e$ [M(n)]
Since the n most significant bits of f02 are known to match those of h in step
3, we can do a transform mod xn − 1. The inverse square root g0 is computed
via Newton’s iteration:
Algorithm InverseSquareRoot(h, 2n).
1. $g_0 = InverseSquareRoot(h,n)$
2. $g = g_0 + \frac{1}{2}(g_0 - h g_0^3)$
Modern Computer Arithmetic, §3.4 67

At step 2, hg03 has 5n bits, and we want only bits n to 2n — the low n bits
are known to match those of g0 —, thus we can compute hg03 mod x4n − 1
which costs 2M (n).
The total cost for SquareRoot(2n) is thus 4.5M (n), and thus 2.25M (n)
for SquareRoot(n).

3.4 Conversion
Cf Chapter 1.

3.4.1 Floating-Point Output


Consider the problem of printing a floating-point number, represented in-
ternally in base b, in another base B. We distinguish here two kinds of
floating-point output:

• fixed-format output, where the output precision is given by the user,


and we want the output value to be correctly rounded according to the
given rounding mode. This is the usual method when values are to be
used by humans, for example to fill a table of results. In that case the
input and output precision may be very different: for example one may
want to print thousand digits of 1/3, which needs only one digit in base
b = 3. Conversely, one may want to print only a few digits of a number
accurate to thousand bits.

• free-format output, where we want that the output value, when read
with correct rounding according to the given rounding mode, gives back
the initial number. Here the number of printed digits may depend on
the input number. This is useful when storing data in a file, while guar-
anteing that reading them back will produce exactly the same internal
numbers, or for exchanging data between different programs.

In other words, if we denote by x the number we want to print, and X


the printed value, the fixed-format output requires |x − X| < ulp(X), and
the free-format output requires |x − X| < ulp(x) (we consider here directed
rounding).

 

1 Algorithm PrintFixed .
2 Input : f, e, p, b, B, P i n t e g e r s , bp−1 ≤ |f | < bp , a rounding mode ◦ .
68 Modern Computer Arithmetic, version 0.1 of October 25, 2006

3 Output : F, E such t h a t B P −1 ≤ |F | < B P , and X = F · B E−P i s t h e


4 c l o s e s t f l o a t i n g −p o i n t number of x = f · be−p a c c o r d i n g to ◦ .
log b
5 λ ← o( log B)
6 E ← 1 + b(e − 1)λc
7 q ← dP/λe
8 y ← xB P −E with p r e c i s i o n q
9 I f one cannot round y to an i n t e g e r , i n c r e a s e q and goto 8 .
10 F ← Integer(y, ◦) .
11 I f |F | ≥ B P then E ← E + 1 and goto 8 .
12 Return F, E .




log b
We assume here that we have precomputed values of λB = Assuming
o( log B
).
the input exponent e is bounded, it is possible — see Ex. 3.5.6 — to choose
those values precise enough so that

log b
E = 1 + b(e − 1) c. (3.2)
log B

Theorem 3.4.1 Algorithm PrintFixed is correct.

Proof First assume that the algorithm terminates. Eq. (3.2) implies B E−1 ≤
be−1 , thus |x|B P −E ≥ B P −1 , which implies that |F | ≥ B P −1 at step 10. Thus
B P −1 ≤ |F | < B P is fulfilled. Now, printing x gives F · B a iff printing xB k
gives F ·B a+k for any integer k. Thus is suffices to check that printing xB P −E
would give F , which is clear by construction.
Now the algorithm terminates because at step xB P −E , if not integer,
cannot be infinitely near from an integer. If P − E ≥ 0, let k the number
of bits of B P −E , then xB P −E can be represent exactly on p + k bits. If
P − E < 0, let g = B E−P , of k bits. Assume f /g = n +  with n integer; then
f − gn = g. If  is not zero, g is a non-zero integer, thus || ≥ 1/g ≥ 2−k .
The case |F | ≥ B P at step 11 can appear for two reasons. Either
xB P −E ≥ B P , thus its rounding also; either xB P −E < B P , but its rounding
equals B P (this can only happen for rounding away from zero or to nearest).
In the former case one will still have xB P −E ≥ B P −1 at the next step 8, while
in the latter case the rounded value F will equal B P −1 and the algorithm
will terminate.

Now return to the free-format output. For a directed rounding mode


(resp. rounding to nearest), we want that |x − X| < ulp(x) (resp. |x − X| ≤
Modern Computer Arithmetic, §3.5 69

1
2
ulp(x)) knowing that |x − X| < ulp(X) (resp. |x − X| ≤ 12 ulp(X)). It is
easy to see that a sufficient condition is that ulp(X) ≤ ulp(x), or equivalently
B E−P ≤ be−p . In summary, we have

be−1 ≤ x < be , B E−1 ≤ X < B E .

Since x < be , and X is the rounding of x, we must have B E−1 ≤ be . It follows


that B E−P ≤ be B 1−P , and the sufficient condition becomes:

log b
P ≥1+p .
log B

For example, with p = 53, b = 2, B = 10, this gives P ≥ 17. As a conse-


quence, if a double-precision floating-point number is printed with at least
17 significant digits, it can be read back without any discrepancy, assuming
input and output are performed with correct rounding to nearest.

3.4.2 Floating-Point Input


Assume we have a floating-point number x with a mantissa of n digits in
base β, that we convert to x0 with n0 digits in base β 0 , and then back to x00
with n digits in base β (both conversions with rounding to nearest). How
large must be n0 with respect to n, β, β 0 so that x = x00 ? For β = 2, β 0 = 10,
we should have n0 ≥ 9 for single precision (n = 24), and n0 ≥ 17 in double
precision (n = 53).

3.5 Exercises
Exercise 3.5.1 (Kidder,Boldo) The “rounding to odd” mode is defined as fol-
lows: in case the exact value is not representable, it rounds to the unique ad-
jacent with an odd mantisse (assuming a binary representation). Prove that
if y = round(x, p + k, odd) and z = round(y, p, neareste ven), and k > 1, then
z = round(x, p, neareste ven), i.e. the double-rounding problem does not happen.

Exercise 3.5.2 Adapt Mulders’ short product algorithm [39] to floating-point


numbers. In case the first rounding fails, can you compute additional digits without
starting again from scratch?
70 Modern Computer Arithmetic, version 0.1 of October 25, 2006

Exercise 3.5.3 (Percival) One computes the product of two complex floating
point numbers z0 = a0 + ib0 and z1 = a1 + ib1 in the following way: xa = ◦(a0 a1 ),
xb = ◦(b0 b1 ), ya = ◦(a0 b1 ), yb = ◦(a1 b0 ), z = ◦(xa − xb ) + ◦(ya + yb ) · i. All
computations being done in precision n, with rounding to nearest, compute an
error bound of the form |z − z0 z1 | ≤ c2−n |z0 z1 |. What is the best c?

Exercise 3.5.4 (Enge) Design an algorithm that correctly rounds the product
of two complex floating-point numbers with 3 multiplications only. [Hint: assume
all operands and the result have n-bit significand.]

Exercise 3.5.5 Prove that for any n-bit floating-point numbers (x, y) 6= (0, 0),
and if all computations are correctly rounded, with the same rounding mode, the
result of √ 2x 2 lies in [−1, 1], except in some special case.
x +y

Exercise 3.5.6 Show that the computation of E in Algorithm PrintFixed is


n log B
correct as long as there is no integer n such that | e−1 log b − 1| < , where  is the
relative precision when computing λ: λ = log B
log b (1 + θ) with |θ| ≤ . For a fixed
range of exponents −emax ≤ e ≤ emax , deduce a working precision . Application:
for b = 2, and emax = 231 , compute the required precision for 3 ≤ B ≤ 36.

Exercise 3.5.7 (Lefèvre) The IEEE 754 standard requires binary to decimal
conversions to be correctly rounded in the range m · 10n for |m| ≤ 1017 − 1 and
|n| ≤ 27 in double precision. Find the hardest-to-print double precision number
in that range — for rounding to nearest for example —, write a C program that
outputs double precision numbers in that range, and compare it to the sprintf
function of your system.

Exercise 3.5.8 Same question as the above, for the decimal to binary conversion,
and the atof function.

3.6 Notes and further references


The link between usual multiplication and the middle product using trilinear
forms was already mentioned by Victor Pan in [40] for the multiplication of
two complex numbers: “The duality technique enables us to extend any suc-
cessful bilinear algorithms to two new ones for the new problems, sometimes
quite different from the original problem ...”
Interval arithmetic (endpoint vs middle/error representation).
Complex arithmetic (pointers).
Fixed-point arithmetic.
Chapter 4

Newton’s Method and


Unrestricted Algorithms for
Elementary and Special
Function Evaluation

4.1 Introduction

This chapter is concerned with algorithms for computing elementary and spe-
cial functions, although the methods apply more generally. First we consider
Newton’s method, which is useful for computing inverse functions. For exam-
ple, if we have an algorithm for computing y = log x, then Newton’s method
can be used to compute x = exp y (see §4.2.6). However, Newton’s method
has many other applications. We already mentioned in Chapter 1 that New-
ton’s method is useful for computing reciprocals, and hence for division. We
consider this in more detail in §4.2.3.
After considering Newton’s method, we go on to consider various meth-
ods for computing elementary and special functions. These methods in-
clude power series (§4.4), asymptotic expansions (§4.5), continued fractions
(§4.6), recurrence relations (§4.7), the arithmetic-geometric mean (§4.8), bi-
nary splitting (§4.9), and contour integration (§4.11).

71
72 Modern Computer Arithmetic, version 0.1 of October 25, 2006

4.2 Newton’s method


4.2.1 Newton’s method via linearisation
Recall that a function f of a real variable is said to have a zero ζ if f (ζ) = 0.
Similarly for functions several real (or complex) variables. If f is differen-
tiable in a neighbourhood of ζ, and f 0 (ζ) 6= 0, then ζ is said to be a simple
zero. In the case of several variables, ζ is a simple zero if the Jacobian matrix
evaluated at ζ is nonsingular, see [reference-to-be-added].
Newton’s method for approximating a simple zero ζ of f is based on the
idea of making successive linear approximations to f (x) a the neighbourhood
of ζ. Suppose that x0 is an initial approximation, and that f (x) has two
continuous derivatives in the region of interest. From Taylor’s theorem

(x0 − ζ)2
f (x0 ) = f (ζ) + (x0 − ζ)f 0 (ζ) + f ”(ξ)
2
for some point ξ in an interval including {x0 , ζ}. Since f (ζ) = 0, we see that

x1 = x0 − f (x0 )/f 0 (x0 )

is an approximation to ζ, and

x1 − ζ = O |x0 − ζ|2 .

Provided x0 is sufficiently close to ζ, we will have

|x1 − ζ| ≤ |x0 − ζ|/2 < 1 .

This motivates the definition of Newton’s method as the iteration

xn+1 = xn − fn /fn0 , n = 0, 1, . . . ,

where we have abbreviated fn = f (xn ) and fn0 = f 0 (xn ). Provided |x0 − ζ| is


sufficiently small, we expect xn to converge to ζ and the order of convergence
will be at least 2, that is
|en+1 | ≤ K|en |2
for some constant K independent of n, where en = xn − ζ is the error after
n iterations.
Modern Computer Arithmetic, §4.2 73

A more careful analysis shows that


f ”(ζ) 2 
en+1 = 0
en + O en 3 ,
2f (ζ)
provided f ∈ C 3 near ζ. Thus, the order of convergence is exactly 2 if
f ”(ζ) 6= 0 and e0 is sufficiently small but nonzero. (Such an iteration is also
said to be quadratically convergent).

4.2.2 Newton’s method for inverse roots


Consider applying Newton’s method to the function

f (x) = y − x−m ,

where m is a positive integer constant, and (for the moment) y is a nonzero


constant. Since f 0 (x) = mx−(m+1) , Newton’s iteration simplifies to

xj+1 = xj + xj (1 − xm
j y)/m . (4.1)

This iteration converges to ζ = y −1/m provided the initial approximation x0


is sufficiently close to ζ. It is perhaps surprising that (4.1) does not involve
divisions, except for a division by the integer constant m. Thus, we can
easily compute reciprocals (the case m = 1) and inverse square roots (the
case m = 2) by Newton’s method. These cases are sufficiently important
that we discuss them separately in the following subsections.

4.2.3 Newton’s method for reciprocals


Taking m = 1 in (4.1), we obtain the iteration

xj+1 = xj + xj (1 − xj y) (4.2)

which we expect to converge to 1/y provided x0 is a sufficiently good approx-


imation. To see what “sufficiently good” means, define

uj = 1 − x j y .

Note that uj → 0 if and only if xj → y. Multiplying each side of (4.2) by y,


we get
1 − uj+1 = (1 − uj )(1 + uj ) ,
74 Modern Computer Arithmetic, version 0.1 of October 25, 2006

which simplifies to
uj+1 = u2j . (4.3)
Thus
j
uj = (u0 )2 . (4.4)
We see that the iteration converges if and only if |u0 | < 1, which (for real x0
and y) is equivalent to the condition x0 y ∈ (0, 2). Second-order convergence
is reflected in the double exponential on the right-hand-side of (4.4).
The iteration (4.2) is sometimes implemented in hardware to compute
reciprocals of floating-point numbers, see for example [29]. The sign and
exponent of the floating-point number are easily handled, so we can assume
that y ∈ [0.5, 1.0). The initial approximation x0 is found by table lookup,
where the table is indexed by the first few bits of y. Since the order of
convergence is two, the number of correct bits approximately doubles at
each iteration. Thus, we can predict in advance how many iterations are
required. Of course, this assumes that the table is initialised correctly. In
the case of the infamous Pentium bug [24], this was not the case, and the
reciprocal was occasionally inaccurate!

4.2.4 Newton’s method for inverse square roots


Taking m = 2 in (4.1), we obtain the iteration

xj+1 = xj + xj (1 − xj y)/2 , (4.5)

which we expect to converge to y −1/2 provided x0 is a sufficiently good ap-


proximation.
If we want to compute y 1/2 , we can do this in one multiplication after
first computing y −1/2 , since

y 1/2 = y × y −1/2 .

This method does not involve any divisions (except by 2). In contrast, if we
apply Newton’s method to the function f (x) = x2 − y, we obtain Heron’s1
iteration  
1 y
xj+1 = xj + (4.6)
2 xj
1
Heron of Alexandria, circa 10–75 AD.
Modern Computer Arithmetic, §4.2 75

for the square root of y. This requires a division by xj at iteration j, so it is


essentially different from the iteration (4.5). Although both iterations have
second-order convergence, we expect (4.5) to be more efficient (we consider
such issues in more detail below).

4.2.5 Newton’s method for power series


Newton’s method can be applied to functions of power series as well as to
functions of a real or complex variable. For simplicity we consider power
series of the form
A(z) = a0 + a1 z + a2 z 2 + · · ·
where ai ∈ R (or any field of characteristic zero) and ordA = 0, i.e. a0 6= 0.
For example, if we replace y in (4.2) by 1 − z, and take initial approxi-
mation x0 = 1, we obtain a quadratically-convergent iteration for the power
series
X∞
−1
(1 − z) = zn .
n=0

In the case of power series, “quadratically convergent” means that ord(ej ) →


j
+∞ like 2j . In our example, u0 = 1 − x0 y = z, so uj = z 2 and

1 − uj 1  j
xj = = + O z2 .
1−z 1−z
Another example: if we replace y in (4.5) by 1 − 4z, and take initial
approximation x0 = 1, we obtain a quadratically-convergent iteration for the
power series
X∞  
−1/2 2n n
(1 − 4z) = z .
n=0
n
Some operations on power series have
P no analogue for integers. For ex-
j
ample, given a power series A(z) = a
j≥0 j z , we can define the formal
derivative X
A0 (z) = jaj z j−1 = a1 + 2a2 z + 3a3 z 2 + · · · ,
j>0

and the integral X aj


xj+1 ,
j≥0
j + 1
76 Modern Computer Arithmetic, version 0.1 of October 25, 2006

but there is no useful analogue for multiple-precision integers

n
X
aj β j .
j=0

For more on Newton’s method for power series, we refer to [9, 14, 16, 32].

4.2.6 Newton’s method for exp and log

Newton’s method to evaluate eh is f ← f0 + f0 (h − log f0 ).


 

1 Algorithm Improve-Exp .
2 Input : h , n , f0 an n−b i t a p p r o x i m a t i o n to exp(h) .
3 Output : f := a 2n−b i t a p p r o x i m a t i o n to exp(h) .
4 g ← log f0 [ computed to 2n−b i t a c c u r a c y ]
5 e←h−g
6 f ← f0 + f0 e
7 Return f .




Since the computation of g = log f has the same complexity of the divi-
sion, via the formula g 0 = f 0 /f , Step 2 costs 5M (n), and Step 4 costs M (n),
which gives a total cost of 6M (n).
However, some computations in f 0 /f can be cached: 1/f was already
computed to n/2 bits at the previous iteration, so we only need to update it
to n bits; q = f 0 /f was already computed to n bits at the previous iteration,
so we don’t need to compute it again. In summary, the computation of log f0
at step 2 reduces to 3M (n): the update of g0 = 1/f to n bits with cost M (n);
the update q ← q + g0 (f 0 − f q) for f 0 /f , with cost 2M (n). The total cost
reduces thus to 4M (n).
Modern Computer Arithmetic, §4.4 77

4.3 Argument Reduction

4.4 Power Series


If f (x) is analytic in a neighbourhood of some point c, an obvious method
to consider for the evaluation of f (x) is summation of the Taylor series

k−1
X
f (x) = (x − c)j f (j) (c)/j! + Rk (x, c) .
j=0

As a simple but instructive example we consider the evaluation of exp(x)


for |x| ≤ 1, using
k−1
X
exp(x) = xj /j! + Rk (x) , (4.7)
j=0

where |Rk (x)| ≤ e/k!

Using Stirling’s approximation for k!, we see that k ≥ K(n) ∼ n/ log 2 n


is sufficient to ensure that |Rk (x)| = O(2−n ). Thus the time required is
O(nM (n)/ ln n).

In practice it is convenient to sum the series in the forward direction


(j = 0, 1, . . . , k − 1). The terms Tj = x/j! and partial sums

j
X
Sj = Ti
i=0

may be generated by the recurrence Tj = x × Tj−1 /j, Sj = Sj−1 + Tj , and


the summation terminated when |Tk | < 2−n . Thus, it is not necessary to
estimate k in advance, as it would be if the series were summed by Horner’s
rule in the backward direction (j = k − 1, k − 2, . . . , 0).

We now consider the effect of rounding errors, under the assumption that
floating-point operations satisfy

f l(x op y) = (x op y)(1 + δ) ,
78 Modern Computer Arithmetic, version 0.1 of October 25, 2006

where |δ| ≤ ε and “op” = “+”, “−”, “×” or “/”. Here ε ≤ β 1−t is the
“machine-precision”. Let Tbj be the computed value of Tj , etc. Thus

|Tbj − Tj | / |Tj | ≤ 2jε + O(ε2 )

and
k
X
|Sbk − Sk | ≤ keε + 2jε|Tj | + O(ε2 )
j=1

≤ (k + 2)eε + O(ε2 ) = O(nε) .

Thus, to get |Sbk − Sk | = O(2−n ) it is sufficient that ε = O(2−n /n), i.e. we


need to work with about logβ n guard digits. This is not a significant over-
head if (as we assume) the number of digits may vary dynamically. The
slightly better error bound obtainable for backward summation is thus of no
importance.

In practice it is inefficient to keep ε fixed. We can profitably reduce the


working precision when computing Tk from Tk−1 if |Tk−1 |  1, without sig-
nificantly increasing the error bound.

It is instructive to consider the effect of relaxing our restriction that


|x| ≤ 1. First suppose that x is large and positive. Since |Tj | > |Tj−1 |
when j < |x|, it is clear that the number of terms required in the sum (4.7)
is at least of order |x|. Thus, the method is slow for large |x| (see §4.3 for
faster methods in this case).

If |x| is large and x is negative, the situation is even worse. From Stirling’s
approximation we have

exp |x|
max |Tj | ' p ,
j≥0 2π|x|

but the result is exp(−|x|), so about 2|x|/ ln β guard digits are required
to compensate for Lehmer’s “catastrophic cancellation” [21]. Since exp(x) =
1/ exp(−x), this problem may easily be avoided, but the corresponding prob-
lem is not always so easily avoided for other analytic functions.
Modern Computer Arithmetic, §4.8 79

In the following sections we generally ignore the effect of rounding errors,


but the results obtained above are typical. For an example of an extremely
detailed error analysis of an unrestricted algorithm, see [19].

To conclude this section we give a less trivial example where power series
expansions are useful. To compute the error function
Z x
−1/2 2
erf(x) = 2π e−u du ,
0

we may use the series


X∞
(−1)j x2j+1
erf(x) = 2π −1/2 (4.8)
j=0
j!(2j + 1)

or ∞
X 2j x2j+1
erf(x) = 2π −1/2 exp(−x2 ) . (4.9)
j=0
1 · 3 · 5 · · · (2j + 1)

The series (4.9) is preferable to (4.8) for moderate |x| because it involves
no cancellation. For large |x| neither series is satisfactory, because Ω(x2 )
terms are required, and it is preferable to use the asymptotic expansion or
continued fraction for erfc(x) = 1 − erf(x): see §§4.5–4.6.

4.5 Asymptotic Expansions


Example: Ei, erf (cut-off point with power series).

4.6 Continued Fractions


Examples: Ei, erf, Bessel functions

4.7 Recurrence relations


Linear and/or nonlinear recurrence relations (an example of nonlinear is con-
sidered in §4.8).
Ex: Bessel functions
80 Modern Computer Arithmetic, version 0.1 of October 25, 2006

4.8 Arithmetic-Geometric Mean


[Another nonlinear recurrence, important enough to treat separately.]
References Brent [9, 12], Borweins’ book [8], Salamin, Hakmem, Bernstein
(http://cr.yp.to/papers.html#logagm), etc.
Quadratic convergence but lacks self-correcting property of Newton’s method.
Take care over sign for complex case (or avoid it if possible).
log → exp → sin, cos → inverse sin, cos, tan
(Landen transformations (reference [12] and exercise), more efficient meth-
ods [9].)
Can improve constants in [9, 12] by using faster square root algorithm,
see Chapter 3.
Ex: π (using Lagrange identity), log 2, elliptic integrals and functions.
Reference: Borwein’s book [8], book π unleashed by J. Arndt and Ch.
Haenel (http://www.maa.org/reviews/piunleashed.html)?

4.9 Binary Splitting


Cf [23] and the CLN implementation.
Used together with argument reduction (e.g. Brent’s algorithm for exp).
The following is an extension of Brent’s algorithm to sin and cos (Theorem
6.2 of [11] gives a O(M (n) log2 n) bound for sin, but the method is not
detailed; it seems the author had in mind computing exp z for complex z):

Input: floating-point |x| < 1/2 in precision p


Output: an approximation
Pk of sin x and cos x to precision p
2i+1
1. Write x = i=0 ri · 2 , where ri is an integer of at most 2i
bits P
2. Let xj = ki=j ri · 22 , and yi = ri · 22 , thus xj = yi + xj+1
i+1 i+1

3. Compute sin yi and cos yi using binary splitting


4. Reconstruct sin xj and cos xj using

sin xj = sin yi cos xj+1 + cos yi sin xj+1


cos xj = cos yi cos xj+1 + sin yi sin xj+1
Modern Computer Arithmetic, §4.10 81

4.10 Holonomic Functions


We describe here the “bit-burst” algorithm invented by the Chudnovsky
brothers [18]:

Theorem 4.10.1 If f is holonomic (or D-finite), and assuming f (0) = 0,


then f (x) can be computed within an accurary of n bits for any n-bit floating-
point number x in O(M (n) log3 n).

A function f (x) is said to be holonomic iff it satisfies a linear differential


equation with polynomial coefficients in x. Equivalently, the Taylor coeffi-
cients un of f satisfy a linear recurrence with polynomial coefficients in n.
For example, the exp, log, sin, cos functions are holonomic, but not tan.
Note: the condition f (0) = 0 is just a technical condition to simplify the
proof of the theorem; f (0) can be any value which can be computed within
O(M (n) log3 n) to n bits.

Proof. Without loss of generality, we assume 0 ≤ x < 1; the binary expan-


sion of x can then be written x = 0.b1 b2 . . . bn . Define r1 = 0.b1 , r2 = 0.0b2 b3 ,
r3 = 0.000b4 b5 b6 b7 : r1 consists of the first bit of the binary expansion of x, r2
consists of the next two bits, r3 the next four bits, and so on. We thus have
x = r1 + r2 + . . . + rk where 2k−1 ≤ n < 2k .
Define xi = r1 +· · ·+ri . The idea of the algorithm is to transfer the Taylor
series of f from xi to xi+1 , which since f is holonomic reduces to convert the
recurrence. We define f0 (t) = f (t), f1 (t) = f0 (r1 + t), f2 (t) = f1 (r2 + t), . . . ,
fi (t) = fi−1 (ri + t) for i ≤ k. We have fi (t) = f (xi + t), and fk (t) = f (x + t)
since xk = x. Thus we are looking for fk (0) = f (x).
Let fi∗ (t) = fi (t) − fi (0) be non-constant part of the Taylor expansion
of fi . We have fi∗ (ri+1 ) = fi (ri+1 ) − fi (0) = fi+1 (0) − fi (0) since fi+1 (t) =
fi (ri+1 + t). Thus f0∗ (r1 ) + · · · + fk−1∗
(rk ) = (f1 (0) − f0 (0)) + · · · + (fk (0) −
fk−1 (0)) = fk (0) − f0 (0) = f (x) − f (0). This yields with the assumption
f (0) = 0:
f (x) = f0∗ (r1 ) + · · · + fi∗ (ri+1 ) + · · · + fk−1

(rk ).
It suffices to show that each term fi∗ (ri+1 ) can be evaluated to n bits in
O(M (n) log2 n) to conclude the proof.
Now ri+1 is a rational whose numerator has at most 2i bits, and whose
i
value is less than ≈ 2−2 . Thus to evaluate fi∗ (ri+1 ) to n bits, 2ni terms of the
Taylor expansion of fi∗ (t) are enough.
82 Modern Computer Arithmetic, version 0.1 of October 25, 2006

We now use the fact that f is holonomic. Assume f satisfies the following
linear differential equation with polynomial coefficients:
cm (t)f (m) (t) + · · · + c1 (t)f 0 (t) + c0 (t)f (t) = 0.
Substituting xi + t for x, we obtain a differential equation for fi :
(m)
cm (xi + t)fi (t) + · · · + c1 (xi + t)fi0 (t) + c0 (xi + t)fi (t) = 0.
From this latter equation we deduce a linear recurrence for the Taylor coef-
ficients of fi (t), of the same order than that for f (t). The coefficients in the
recurrence for fi (t) have O(2i ) bits, since xi = r1 + · · · + ri has O(2i ) bits.
It follows that the kth Taylor coefficient from fi (t) has size O(k(2i + log k))
[the k log k term comes from the polynomials in k in the recurrence]. Since
k goes to n/2i at most, this is O(n log n).
However we don’t want to evaluate the kth Taylor coefficient uk of fi (t),
but the series
Xk
j
sk = uj ri+1 .
j=1
j
Noticing that uj = (sj − sj−1 )/ri+1 , and substituting that value in the recur-
rence for (uj ), say of order l, we obtain a recurrence of order l + 1 for (sk ).
Putting this latter recurrence in matrix form Sk = M (k)Sk−1 , where Sk is the
vector (sk , sk−1 , sk−l+1 ), we obtain Sk = M (k)M (k − 1) · · · M (l)Sl−1 , where
the matrix product M (k)M (k−1) · · · M (l) can be evaluated in O(M (n) log 2 n)
using binary splitting.
We illustrate the above theorem with the arc-tangent function, which
satisfies the differential equation:
f 0 (t)(1 + t2 ) = 0.
This equation evaluates at xi + t into fi0 (t)(1 + (xi + t)2 ) = 0, which gives the
recurrence
(1 + x2i )kuk + 2xi (k − 1)uk−1 + (k − 2)uk−2 = 0.
This recurrence translates to
(1 + x2i )kvk + 2xi ri+1 (k − 1)vk−1 + ri+1
2
(k − 2)vk−2 = 0
k
for vk = uk ri+1 , and to
(1+x2i )k(sk −sk−1 )+2xi ri+1 (k −1)(sk−1 −sk−2 )+ri+1
2
(k −2)(sk−2 −sk−3 ) = 0
Pk
for sk = j=1 vj .
Modern Computer Arithmetic, §4.15 83

4.11 Contour integration


z z
Ex: ez −1
+ 2

4.12 Constants
Ex: exp, π, γ [15], Gamma, Psi, ζ, ζ( 12 + it), ... Cf http://cr.yp.to/1987/
bernstein.html for π and e. Cf also [22].

4.13 Summary of Best-known Methods


Table giving for each function the best-known method (for asymptotic com-
plexity, perhaps with a comment for the corresponding region), together with
the corresponding complexity.

4.14 Notes and further references


If you want to know more about holonomic or D-finite functions, see for
example [43].
Of course Abramowitz & Stegun [1]. Book of Nico Temme [51]?
Cf http://remote.science.uva.nl/~thk/specfun/compalg.html and
http://numbers.computation.free.fr/Constants/constants.html
For history, see Bernstein [4, 5].

4.15 Exercises
P
Exercise 4.15.1 If A(z) = j≥0 aj z j is a formal power series over R with a0 = 1,
show that log(A(x)) can be computed with error O(z n ) in time O(M (n)), where
M (n) is the time required to multiply two polynomials of degree n−1. (A smooth-
ness condition on the growth of M (n) as a function of n may be required.)
Hint: (d/dx) log(A(x)) = A0 (x)/A(x).
Does a similar result hold for n-bit numbers if z is replaced by 1/2?

Exercise 4.15.2 (Brent) Assuming we can compute n bits of log x in O(M (n) log n),
and of exp x in O(M (n) log2 n), show how to compute exp x in O(M (n) log n), with
almost the same constant as the logarithm.
84 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Appendix: Implementations
and Pointers

A few words about implementations? Pointers to mailing-lists (perhaps only


on the web page or CD, with errata)

85
86 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Bibliography

[1] Milton Abramowitz and Irene A. Stegun. Handbook of Mathematical Func-


tions. Dover, 1973.

[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of


Computer Algorithms. Addison-Wesley, 1974.

[3] Paul Barrett. Implementing the Rivest Shamir and Adleman public key en-
cryption algorithm on a standard digital signal processor. In A. M. Odlyzko,
editor, Advances in Cryptology, Proceedings of Crypto’86, volume 263 of Lec-
ture Notes in Computer Science, pages 311–323. Springer-Verlag, 1987.

[4] Dan J. Bernstein. Computing logarithm intervals with the arithmetic-


geometric-mean iteration. http://cr.yp.to/arith.html#logagm, 2003. 8
pages.

[5] Dan J. Bernstein. Removing redundancy in high-precision Newton iteration.


http://cr.yp.to/fastnewton.html, 2004. 13 pages.

[6] R. Bernstein. Multiplication by integer constants. Software, Practice and


Experience, 16(7):641–652, 1986.

[7] Yves Bertot, Nicolas Magaud, and Paul Zimmermann. A proof of GMP square
root. Journal of Automated Reasoning, 29:225–252, 2002. Special Issue on
Automating and Mechanising Mathematics: In honour of N.G. de Bruijn.

[8] J. M. Borwein and P. B. Borwein. Pi and the AGM: A Study in Analytic


Number Theory and Computational Complexity. Wiley, 1998.

[9] Richard P. Brent. Multiple-precision zero-finding methods and the complexity


of elementary function evaluation. In J. F. Traub, editor, Analytic Computa-
tional Complexity, pages 151–176, New York, 1975. Academic Press. http:
//www.comlab.ox.ac.uk/oucl/work/richard.brent/pub/pub028.html.

87
88 Modern Computer Arithmetic, version 0.1 of October 25, 2006

[10] Richard P. Brent. Analysis of the binary Euclidean algorithm. In J. F. Traub,


editor, New Directions and Recent Results in Algorithms and Complexity,
pages 321–355. Academic Press, New York, 1976. http://www.comlab.ox.
ac.uk/oucl/work/richard.brent/pub/pub037.html.
[11] Richard P. Brent. The complexity of multiple-precision arithmetic. In R. S.
Anderssen and Richard P. Brent, editors, The Complexity of Computational
Problem Solving, pages 126–165. University of Queensland Press, 1976. http:
//www.comlab.ox.ac.uk/oucl/work/richard.brent/pub/pub032.html.
[12] Richard P. Brent. Fast multiple-precision evaluation of elementary functions.
Journal of the ACM, 23(2):242–251, 1976. http://www.comlab.ox.ac.uk/
oucl/work/richard.brent/pub/pub034.html.
[13] Richard P. Brent. Twenty years’ analysis of the binary Euclidean algorithm.
In A. W. Roscoe J. Davies and J. Woodcock, editors, Millenial Perspectives
in Computer Science, pages 41–53. Palgrave, New York, 2000. http://www.
comlab.ox.ac.uk/oucl/work/richard.brent/pub/pub183.html.
[14] Richard P. Brent and H. T. Kung. Fast algorithms for manipulating formal
power series. Journal of the ACM, 25(2):581–595, 1978. http://www.comlab.
ox.ac.uk/oucl/work/richard.brent/pub/pub045.html.
[15] Richard P. Brent and Edwin M. McMillan. Some new algorithms for
high-precision computation of Euler’s constant. Mathematics of Compu-
tation, 34(149):305–312, 1980. http://www.comlab.ox.ac.uk/oucl/work/
richard.brent/pub/pub049.html.
[16] Richard P. Brent and Joseph F. Traub. On the complexity of composition
and generalized composition of power series. SIAM Journal on Computing,
9:54–66, 1980. http://www.comlab.ox.ac.uk/oucl/work/richard.brent/
pub/pub050.html.
[17] Christoph Burnikel and Joachim Ziegler. Fast recursive division. Research
Report MPI-I-98-1-022, MPI Saarbrücken, October 1998. urlhttp://data.mpi-
sb.mpg.de/internet/reports.nsf/NumberView/1998-1-022.
[18] D. V. Chudnovsky and G. V. Chudnovsky. Computer algebra in the service
of mathematical physics and number theory. In Computers in Mathemat-
ics (Stanford, CA, 1986), volume 125 of Lecture Notes in Pure and Applied
Mathematics, pages 109–232, New York, 1990. Dekker.
[19] C. W. Clenshaw and F. W. J. Olver. An unrestricted algorithm for the
exponential function. SIAM J. Numerical Analysis, 17:310–331, 1980.
Modern Computer Arithmetic, §4.15 89

[20] Richard E. Crandall and Carl Pomerance. Prime Numbers: A Computational


Perspective. Springer-Verlag, 2001.

[21] George E. Forsythe. Pitfalls in computation, or why a math book isn’t enough.
Amer. Math. Monthly, 77:931–956, 1970.

[22] X. Gourdon and P. Sebah. Numbers, constants and computation. http:


//numbers.computation.free.fr/Constants/constants.html.

[23] B. Haible and T. Papanikolaou. Fast multiprecision evaluation of series of ra-


tional numbers. Technical Report TI-7/97, Darmstadt University of Technol-
ogy, 1997. http://www.informatik.th-darmstadt.de/TI/Mitarbeiter/
papanik/.

[24] Tom R. Halfhill. The truth behind the Pentium bug. Byte, March 1995.

[25] Guillaume Hanrot and Paul Zimmermann. A long note on Mulders’ short
product. Journal of Symbolic Computation, 2003. To appear.

[26] T. Jebelean. Practical integer division with Karatsuba complexity. In W. W.


Küchlin, editor, Proc. ISSAC’97, pages 339–341, Maui, Hawaii, 1997.

[27] Tudor Jebelean. An algorithm for exact division. Journal of Symbolic Com-
putation, 15:169–180, 1993.

[28] Tudor Jebelean. A double-digit Lehmer-Euclid algorithm for finding the GCD
of long integers. Journal of Symbolic Computation, 19:145–157, 1995.

[29] Alan H. Karp and Peter Markstein. High-precision division and square root.
ACM Transactions on Mathematical Software, 23(4):561–589, 1997.

[30] D. Knuth. The analysis of algorithms. In Actes du Congrès International des


Mathématiciens de 1970, volume 3, pages 269–274, Paris, 1971. Gauthiers-
Villars.

[31] Donald E. Knuth. The Art of Computer Programming, volume 2 : Seminu-


merical Algorithms. Addison-Wesley, second edition, 1981.

[32] Donald E. Knuth. The Art of Computer Programming, volume 2 :


Seminumerical Algorithms. Addison-Wesley, third edition, 1998. http:
//www-cs-staff.stanford.edu/~knuth/taocp.html.

[33] Werner Krandick and Tudor Jebelean. Bidirectional exact integer division.
Journal of Symbolic Computation, 21(4–6):441–456, 1996.
90 Modern Computer Arithmetic, version 0.1 of October 25, 2006

[34] V. Lefèvre. Multiplication by an integer constant. Research Report


RR-4192, INRIA, May 2001. ftp://ftp.inria.fr/INRIA/publication/
publi-ps-gz/RR/RR-4192.ps.gz.

[35] Roman Maeder. Storage allocation for the Karatsuba integer multiplication
algorithm. DISCO, 1993. preprint.

[36] Valérie Ménissier-Morain. Arithmétique exacte, conception, algorithmique


et performances d’une implémentation informatique en précision arbitraire.
PhD thesis, University of Paris 7, 1994. ftp.inria.fr/INRIA/Projects/
cristal/Valerie.Menissier/these94.ps.gz.

[37] R. Moenck and A. Borodin. Fast modular transforms via division. In Pro-
ceedings of the 13th Annual IEEE Symposium on Switching and Automata
Theory, pages 90–96, October 1972.

[38] P. L. Montgomery. Speeding the Pollard and elliptic curve methods of factor-
ization. Mathematics of Computation, 48(177):243–264, 1987.

[39] T. Mulders. On short multiplications and divisions. Applicable Algebra in


Engineering, Communication and Computing, 11(1):69–88, 2000.

[40] Victor Pan. How to Multiply Matrices Faster, volume 179 of Lecture Notes in
Computer Science. Springer-Verlag, 1984.

[41] Colin Percival. Rapid multiplication modulo the sum and difference of highly
composite numbers. Mathematics of Computation, 72(241):387–395, 2003.

[42] Douglas M. Priest. Algorithms for arbitrary precision floating point arith-
metic. In Peter Kornerup and David Matula, editors, Proceedings of the 10th
Symposium on Computer Arithmetic, pages 132–144, Grenoble, France, 1991.
IEEE Computer Society Press. http://www.cs.cmu.edu/~quake-papers/
related/Priest.ps.

[43] B. Salvy and P. Zimmermann. Gfun: A Maple package for the manipulation
of generating and holonomic functions in one variable. ACM Transactions on
Mathematical Software, 20(2):163–177, June 1994.

[44] Arnold Schönhage. Schnelle Berechnung von Kettenbruchentwicklungen. Acta


Informatica, 1:139–144, 1971.

[45] Arnold Schönhage. Variations on computing reciprocals of power series. In-


formation Processing Letters, 74:41–46, 2000.
Modern Computer Arithmetic, §4.15 91

[46] Arnold Schönhage, A. F. W. Grotefeld, and E. Vetter. Fast Algorithms, A


Multitape Turing Machine Implementation. BI-Wissenschaftsverlag, 1994.

[47] Arnold Schönhage and Volker Strassen. Schnelle Multiplikation großer Zahlen.
Computing, 7:281–292, 1971.

[48] Jonathan P. Sorenson. Two fast GCD algorithms. Journal of Algorithms,


16:110–144, 1994.

[49] Damien Stehlé and Paul Zimmermann. A binary recursive gcd algorithm. In
Proceedings of the Algorithmic Number Theory Symposium (ANTS VI), 2004.

[50] A. Svoboda. An algorithm for division. Information Processing Machines,


9:25–34, 1963.

[51] Nico M. Temme. Special Functions: An Introduction to the Classical Func-


tions of Mathematical Physics. Wiley, John and Sons, Inc., 1996. http:
//www.addall.com/Browse/Detail/0471113131.html.

[52] Brigitte Vallée. Dynamics of the binary Euclidean algorithm: Functional


analysis and operators. Algorithmica, 22:660–685, 1998.

[53] Joris van der Hoeven. Relax, but don’t be too lazy. Journal of Symbolic Com-
putation, 34(6):479–542, 2002. http://www.math.u-psud.fr/~vdhoeven.

[54] Jean Vuillemin. private communication, January 2004.

[55] Kenneth Weber. The accelerated integer GCD algorithm. ACM Transactions
on Mathematical Software, 21(1):111–122, 1995.

[56] C. K. Yap. Fundamental Problems in Algorithmic Algebra. Oxford University


Press, 2000.

[57] Dan Zuras. More on squaring and multiplying large integers. IEEE Transac-
tions on Computers, 43(8):899–908, 1994.

You might also like