Mca-0 1
Mca-0 1
Version 0.1
Contents
1 Integer Arithmetic 9
1.1 Representation and Notations . . . . . . . . . . . . . . . . . . 9
1.2 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . 10
1.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Naive Multiplication . . . . . . . . . . . . . . . . . . . 11
1.3.2 Karatsuba’s Algorithm . . . . . . . . . . . . . . . . . . 12
1.3.3 Toom-Cook Multiplication . . . . . . . . . . . . . . . . 14
1.3.4 Fast Fourier Transform . . . . . . . . . . . . . . . . . . 15
1.3.5 Unbalanced Multiplication . . . . . . . . . . . . . . . . 16
1.3.6 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.7 Multiplication by a constant . . . . . . . . . . . . . . . 17
1.4 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Naive Division . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Divisor Preconditioning . . . . . . . . . . . . . . . . . 20
1.4.3 Divide and Conquer Division . . . . . . . . . . . . . . 22
1.4.4 Newton’s Division . . . . . . . . . . . . . . . . . . . . . 24
1.4.5 Exact Division . . . . . . . . . . . . . . . . . . . . . . 25
1.4.6 Only Quotient or Remainder Wanted . . . . . . . . . . 26
1.4.7 Division by a Constant . . . . . . . . . . . . . . . . . . 27
1.4.8 Hensel’s Division . . . . . . . . . . . . . . . . . . . . . 28
1.5 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.1 Square Root . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.2 k-th Root . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.3 Exact Root . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Gcd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.1 Naive Gcd . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.2 Extended Gcd . . . . . . . . . . . . . . . . . . . . . . . 37
3
4 Modern Computer Arithmetic, version 0.1 of October 25, 2006
3 Floating-Point Arithmetic 53
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Representation . . . . . . . . . . . . . . . . . . . . . . 53
3.1.2 Precision vs Accuracy . . . . . . . . . . . . . . . . . . 53
3.1.3 Link to Integers . . . . . . . . . . . . . . . . . . . . . . 53
3.1.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.5 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.6 Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Addition/Subtraction/Comparison . . . . . . . . . . . . . . . 56
3.2.1 Floating-Point Addition . . . . . . . . . . . . . . . . . 56
3.2.2 Leading Zero Detection . . . . . . . . . . . . . . . . . . 58
Modern Computer Arithmetic, §0.0 5
7
8 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Chapter 1
Integer Arithmetic
A = an−1 β n−1 + · · · + a1 β + a0 ,
9
10 Modern Computer Arithmetic, version 0.1 of October 25, 2006
represented. Thus only the length n and the integers (ai )0≤i<n are effectively
stored. Some common choices for β are 232 on a 32-bit computer, or 264
on a 64-bit machine; other possible choices are respectively 109 and 1019 for
a decimal representation, or 253 when using double precision floating-point
registers. Most algorithms from this chapter work in any base, the exceptions
are explicitely mentioned.
We assume that the sign is stored separately from the absolute value.
Zero is an important special case; to simplify the algorithms we assume that
n = 0 if A = 0, and in most cases we assume that case is treated apart.
Except when explicitly mentioned, we assume that all operations are off-
line, i.e. all inputs (resp. outputs) are completely known at the beginning
(resp. end) of the algorithm. Different models include lazy or on-line algo-
rithms, and relaxed algorithms [53].
Let M be the number of different values taken by the data type rep-
resenting the coefficients ai , bi . (Clearly β ≤ M but the equality does not
necessarily hold, e.g. β = 109 and M = 232 .) At step 6, the value of s can be
as large as 2β − 1, which is not representable if β = M . Several workarounds
are possible: either use a machine instruction that gives the possible carry of
ai +bi ; or use the fact that, if a carry occurs in ai +bi , then the computed sum
— if performed modulo M — equals t := ai + bi − M < ai , thus comparing
t and ai will determine if a carry occurred. A third solution is to keep one
extra bit, taking β = bM/2c.
Modern Computer Arithmetic, §1.3 11
1.3 Multiplication
A nice application of large integer multiplication is the Kronecker/Schönhage
trick. Assume we want to multiply two polynomials A(x) and B(x) with non-
negative integer coefficients. Assume both polynomials have degree less than
n, and coefficients are bounded by B. Now take a power X = β k of the base
β that is larger than nB 2 , and multiply the integers a = A(X) and b =PB(X)
obtained by evaluating APand B at x = X. If C(x) = A(x)B(x) = c i xi ,
we clearly have C(X) = ci X i . Now since the ci are bounded by nB 2 < X,
the coefficients ci can be retrieved by simply “reading” blocks of k words in
C(X). P i
Conversely,
P suppose you want to multiply two integers a
P = 0≤i<n ai β
j i
and b = b β . Multiply the polynomials A(x) = 0≤i<n ai x and
P 0≤j<n j j
B(x) = 0≤j<n bj x , obtaining a polynomial C(x), then evaluate C(x) at
x = β to obtain ab. Note that the coefficients of C(x) may be larger than β,
in fact they may be of order nβ 2 . These examples demonstrate the analogy
between operations on polynomials and integers, and also show the limits of
the analogy.
proves that the division by 3 cannot be avoided for Toom-Cook 3-way (see
Ex. 1.9.8).
a4 a3 a2 a1 a0 a4 a3 a2 a1 a0 a4 a3 a2 a1 a0
b2 b1 b0 b2 b1 b0 b2 b1 b0
A×B A × (βB) A × (β 2 B)
The first strategy leads to two products of size 3 i.e. 2K(3, 3), the second one
to K(2, 1)+K(3, 2)+K(3, 3), and the third one to K(2, 2)+K(3, 1)+K(3, 3),
which give respectively 14, 15, 13 word products.
However, whenever m/2 ≤ n ≤ m, any such “padding strategy” will
require K(dm/2e, dm/2e) for the product of the differences of the low and
high parts from the operands, due to a “wrap around” effect when subtracting
the parts from the smaller operand; this will ultimately lead to a O(mα ) cost.
The “odd-even strategy” (Ex. 1.9.10) avoids this wrap around. For example,
we get K(3, 2) = 5 with the odd-even strategy, against K(3, 2) = 6 for the
classical one.
Like for the classical strategy, there are several ways of padding with
the odd-even strategy. Consider again m = 5, n = 3, and write A :=
a4 x4 + a3 x3 + a2 x2 + a1 x + a0 = xA1 (x2 ) + A0 (x2 ), with A1 (x) = a3 x + a1 ,
and A0 (x) = a4 x2 + a2 x + a0 ; and B := b2 x2 + b1 x + b0 = xB1 (x2 ) + B0 (x2 ),
with B1 (x) = b1 , B0 (x) = b2 x + b0 . Without padding, we write AB =
x2 (A1 B1 )(x2 ) + x((A0 + A1 )(B0 + B1 ) − A1 B1 − A0 B0 )(x2 ) + (A0 B0 )(x2 ),
which gives K(5, 3) = K(2, 1) + 2K(3, 2) = 12. With padding, we consider
xB = xB10 (x2 ) + B00 (x2 ), with B10 (x) = b2 x + b0 , B00 = b1 x. This gives
K(2, 2) = 3 for A1 B10 , K(3, 2) = 5 for (A0 + A1 )(B00 + B10 ), and K(3, 1) = 3
Modern Computer Arithmetic, §1.3 17
for A0 B00 — taking into account the fact that B00 has only one non-zero
coefficient —, thus a total of 11 only.
1.3.6 Squaring
In many applications, an significant proportion of the multiplications have
both operands equal. Hence it is worth tuning a special squaring imple-
mentation as much as the implementation of multiplication itself, bearing in
mind that the best possible speedup is two (see Ex. 1.9.11).
For naive multiplication, Algorithm BasecaseMultiply (§1.3.1) can be
modified to obtain a theoretical speedup of two, since only half of the prod-
ucts ai bj need to be computed.
Subquadratic algorithms like Karatsuba and Toom-Cook r-way can be
specialized for squaring too. However, the speedup obtained is less than two,
and the threshold obtained is larger than the corresponding multiplication
threshold (see Ex. 1.9.11).
x1 := 31x = 25 x − x
x2 := 93x = 2 1 x1 + x 1
x3 := 743x = 2 3 x2 − x
x4 := 6687x = 2 3 x3 + x 3
20061x = 2 1 x4 + x 4 .
We refer the reader to [34] for a comparison of different algorithms for the
problem of multiplication by an integer constant.
1.4 Division
Division is the next operation to consider after multiplication. Optimizing
division is almost as important as optimizing multiplication, since division is
usually more expensive, thus the speedup obtained on division will be more
effective. (On the other hand, one usually performs more multiplications than
divisions.) One strategy is to avoid divisions when possible, or replace them
by multiplications. An example is when the same divisor is used for several
consecutive operations; one can then precompute its inverse (see §2.2.1).
We distinguish several kinds of division: full division computes both quo-
tient and remainder, while in some cases only the quotient (for example
when dividing two floating-point mantissas) or remainder (when dividing two
residues modulo n) is needed. Finally we discuss exact division — when the
remainder is known to be zero — and the problem of dividing by a constant.
Proof First prove that the invariant A < β j+1 B holds at step 5. This holds
trivially for j = m − 1: B being normalized, A < 2β m B initially.
First consider the case qj = qj∗ : then qj bn−1 ≥ an+j β + an+j−1 − bn−1 + 1,
thus
A − qj β j B ≤ (bn−1 − 1)β n+j−1 + (A mod β n+j−1 ),
which ensures that the new an+j vanishes, and an+j−1 < bn−1 , thus A < β j B
after step 8. Now A may become negative after step 8, but since qj bn−1 ≤
an+j β + an+j−1 :
The most expensive step is step 8, which costs O(n) operations for qj B
— the multiplication by β j is simply a word-shift — thus the total cost is
O(nm).
j A qj A − qj Bβ j after correction
2 766 970 544 842 443 844 889 61 437 185 443 844 no change
1 61 437 185 443 844 071 187 976 620 844 no change
0 187 976 620 844 218 −84 330 190 778 334 723
Svoboda’s algorithm [50] makes the quotient selection trivial, after pre-
conditioning the divisor. The main idea is that if bn−1 equals the base β,
then the quotient selection is easy, since it suffices to take qj∗ = an+j . (In
addition, the condition of step 7 is then always fulfilled.)
1 Algorithm SvobodaDivision .
P P
2 Input : A = n+m−1 0 ai β i , B = 0n−1 bj β j normalized , A < β m B
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 k ← dβ n+1 /Be
P
5 B 0 ← kB = β n+1 + 0n−1 b0j β j
6 for j from m − 1 downto 1 do
7 qj ← an+j
8 A ← A − qj β j−1 B 0
9 i f A < 0 do
10 q j ← qj − 1
11 A ← A + β j−1 B 0
P m−1
12 Q0 = 1 qj β j , R 0 = A
13 (q0 , R) ← (R0 div B, R0 mod B)
14 Return Q = q0 + kQ0 , R .
j A qj A − qj B 0 β j after correction
2 766 970 544 842 443 844 766 441 009 747 163 844 no change
1 441 009 747 163 844 441 −295 115 730 436 705 575 568 644
6
2k
(Q1 , R1 ) ← RecursiveDivRem(A div β , B1 )
7 A0 ← R1 β 2k + A mod β 2k − Q1 β k B0
8 while A0 < 0 do Q1 ← Q1 − 1 , A0 ← A0 + β k B
9 (Q0 , R0 ) ← RecursiveDivRem(A0 div β k , B1 )
10 A00 ← R0 β k + A0 mod β k − Q0 B0
11 while A00 < 0 do Q0 ← Q0 − 1 , A00 ← A00 + B
12 Return Q := Q1 β k + Q0 , R := A00 .
Proof We first check the assumption for the recursive calls: B1 is normalized
since it has the same most significant word than B.
After step 6, we have A = (Q1 B1 + R1 )β 2k + A mod β2k , thus after step
7: A0 = A − Q1 β k B, which still holds after step 8. After step 9, we have
A0 = (Q0 B1 + R0 )β k + A0 mod β k , thus after step 10: A00 = A0 − Q0 B, which
still holds after step 11. At step 12 we thus have A = QB + R.
A div β 2k has m + n − 2k words, while B1 has n − k words, thus 0 ≤ Q1 <
2β m−k and 0 ≤ R1 < B1 < β n−k . Thus at step 7, −2β m+k < A0 < β k B.
Since B is normalized, the while-loop at step 8 is performed at most four
times. At step 9 we have 0 ≤ A0 < β k B, thus A0 div β k has at most n
words. It follows 0 ≤ Q0 < 2β k and 0 ≤ R0 < B1 < β n−k . Hence at step
10, −2β 2k < A00 < B, and after at most four iterations at step 11, we have
0 ≤ A00 < B.
M( n
8
)
M (n/4)
M( n
8
)
M (n/2)
M( n
8
)
M (n/4)
M( n
8
)
quotient Q
M( n
8
)
M (n/4)
M( n
8
)
M (n/2)
M( n
8
)
M (n/4)
M( n
8
)
divisor B
Figure 1.1: Divide and conquer division: a graphical view (most significant
parts at the lower left corner).
24 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Large dividend
The condition n ≥ m in Algorithm RecursiveDivRem means that the
dividend A is at most twice as large as the divisor B.
When A is more than twice as large as B (m > n with the above nota-
tions), the best strategy (see Ex. 1.9.17) is to get n words of the quotient at
a time (this simply reduces to the base-case algorithm, replacing β by β n ).
The conditions 0 ≤ R < B are ensured thanks to the while loops at the end
of the algorithm.
• or start from least significant bits first. Indeed, if the quotient is known
to be less than β n , computing a/b mod β n will reveal it.
A/B mod β n . Note that the middle product (§3.3) can be used in lines 7 and
9, to speed up the computation of 1 − BC and A − BQ respectively.
Finally, another gain is obtained using both strategies simultaneously:
compute the most significant n/2 bits of the quotient using the first strategy,
and the least n/2 bits using the second one. Since an exact division of size
n is replaced by two exact divisions of size n/2, this gives a speedup up to 2
for quadratic algorithms (see Ex. 1.9.19).
nevertheless some high-level languages provide both div and mod, but no
instruction to compute both quotient and remainder.
Once the quotient is known, the remainder can be recovered by a single
multiplication as a − qb; on the other hand, when the remainder is known,
the quotient can be recovered by an exact division as (a − r)/b (§1.4.5).
However, it often happens that only one of the quotient and remainder
is needed. For example, the division of two floating-point numbers reduces
to the quotient of their fractions (see Ch. 3). Conversely, the multiplication
of two numbers modulo n reduces to the remainder of their product after
division by n (see Ch. 2). In such cases, one may wonder if faster algorithms
exist.
For a dividend of 2n words and a divisor of n words, a significant speedup
— up to two for quadratic algorithms — can be obtained when only the
quotient is needed, since one doesn’t need to update the low n bits of the
current remainder (line 8 of Algorithm BasecaseDivRem).
Surprisingly, it seems difficult to get a similar speedup when only the
remainder is required. One possibility would be to use Svoboda’s algo-
rithm, however this requires some precomputation, so is only useful when
several divisions are performed with the same divisor. The idea is the fol-
lowing: precompute a multiple B1 of B, having 3n/2 words, the n/2 most
significant words being β n/2 . Then reducing A mod B1 reduces to a single
n/2 × n multiplication. Once A is reduced into A1 of 3n/2 words by Svo-
boda’s algorithm in 2M (n/2), use RecursiveDivRem on A1 and B, which
costs D(n/2) + M (n/2). The total cost is thus 3M (n/2) + D(n/2) — in-
stead of 2M (n/2) + 2D(n/2) for a full division with RecursiveDivRem —
i.e. 5/3M (n) for Karatsuba, 2.04M (n) for Toom-Cook 3-way, and better for
the FFT as soon as D(n) 3M (n).
Remark: at line 10, since 0 ≤ x < β, b00 can also be obtained as b qβi c c.
A A
B B
QB Q0 B
R R0
prime to the word base β, i.e. the least significant bit of B to be set for β a
power of two.
The LSB-quotient is uniquely defined by Q0 = A/B mod β n , with 0 ≤
Q0 < β n . This defines in turn uniquely the LSB-remainder R0 = (A −
Q0 B)β −n , with −B < R0 < β n .
Most MSB-division variants (naive, with preconditioning, divide and con-
quer, Newton’s iteration) have their LSB-counterpart. For example the pre-
conditioning consists in using a multiple of the divisor such that kB ≡
1 mod β, and Newton’s iteration is called Hensel lifting in the LSB case. The
exact division algorithm described at the end of §1.4.5 uses both MSB- and
LSB-division simultaneously. One important difference is that LSB-division
does not need any correction step, since the carries go in the direction oppo-
site to the cancelled bits.
1.5 Roots
1.5.1 Square Root
The “paper and pencil” method once taught at school to extract square roots
is very similar to the “paper and pencil” division. It decomposes an integer
m in the form s2 + r, taking two digits at a time of m, and finding one
digit at a time of s. It is based on the following idea: if m = s2 + r is the
current decomposition, when taking two more digits of the root-end, we have
decomposition of the form 100m + r 0 = 100s2 + 100r + r 0 with 0 ≤ r0 < 100.
Since (10s + t)2 = 100s2 + 20st + t2 , a good approximation of the next digit
t will be found by dividing 10r by 2s.
30 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Cube Root.
We illustrate with the case k = 3 (the cube root), where BasecaseCbrtRem
is a naive algorithm that should deal with inputs of up to 6 words.
1 Algorithm CbrtRem .
2 Input : 0 ≤ n = nd−1 β d−1 + · · · + n1 β + n0 with 0 ≤ ni < β
3 Output : (s, r) such t h a t s3 ≤ n = s3 + r < (s + 1)3
4 l ← b d−1 6 c
5 i f l = 0 then r e t u r n BasecaseCbrtRem(n)
6 w r i t e n a s n0 b3 + a2 b2 + a1 b + a0 where b := β l
7 (s0 , r0 ) ← CbrtRem(n0 )
8 (q, u) ← DivRem(br 0 + a2 , 3s02 )
9 r ← b2 u + ba1 + a0 − q 2 (3s0 b + q)
10 s ← bs0 + q
11 while r < 0 do
12 r ← r + 1 − 3s + 3s2
13 s←s−1
14 Return (s, r) .
9 t ← (s0 β + q)k
10 while t > n do
11 q ←q−1
12 t ← (s0 β + q)k
13 Return (s0 β + q, n − t) .
However, the above result is not very satisfactory, since we have no bound
on the number of iterations of the while-loop in Algorithm RootRem. The
following lemma shows how to choose β at step 5 to ensure the while-loop is
performed only once, and thus can be replaced by a if-test, like in Algorithm
SqrtRem.
Proof Let q 0 be the final value of q at step 13, and q the value at step 8.
By hypothesis we have n = n2 β k + n1 β k−1 + n0 < (s0 β + q 0 + 1)k , thus we
deduce:
Unknown exponent.
Assume now that one wants to check if a given integer n is an exact power,
without knowing the corresponding exponent. For example, many factoriza-
tion algorithms fail when given an exact power, therefore this case has to
be checked first. The following algorithm detects exact powers, and returns
the largest exponent. To early detect non-kth powers at step 5, one may use
modular algorithms when k is prime to the base β (see above).
34 Modern Computer Arithmetic, version 0.1 of October 25, 2006
1 Algorithm IsPower .
2 Input : a p o s i t i v e i n t e g e r n .
3 Output : k i f n i s an e x a c t k th power , false o t h e r w i s e .
4 for k from blog2 nc downto 2 do
5 i f n i s a k th power , r e t u r n k
6 Return false .
1.6 Gcd
There are many algorithms computing gcds in the literature. We can distin-
guish between the following (non-exclusive) types:
• plain versus extended algorithms: the former just compute the gcd of
the inputs, while the latter express the gcd as a linear combination of
the inputs.
Binary Gcd. A better algorithm than Euclid’s one, still with an O(n2 )
complexity, is the binary algorithm. It differs from Euclid’s algorithm in
36 Modern Computer Arithmetic, version 0.1 of October 25, 2006
two ways: firstly it consider least significant bits first, and secondly it avoids
expensive divisions, which most of the time give a small quotient.
1 Algorithm BinaryGcd .
2 Input : a, b > 0 .
3 Output : gcd(a, b) .
4 i←0
5 while a mod 2 = b mod 2 = 0 do
6 (i, a, b) ← (i + 1, a/2, b/2)
7 while a mod 2 = 0 do
8 a ← a/2
9 while b mod 2 = 0 do
10 b ← b/2
11 while a 6= b do
12 (a, b) ← (|a − b|, min(a, b))
13 repeat a ← a/2 u n t i l a mod 2 6= 0
14 Return 2i · a .
The binary algorithm is based on the fact that if a and b are both odd, then
a − b is even, and we can remove a factor of two since 2 does not divide
gcd(a, b). Sorenson’s k-ary reduction is a generalization of that idea: given
a and b odd, we try to find small integers u, v such that ua − vb is divisible
by a large power of two.
and a, b the current values, the following invariants hold: a = ua0 + vb0 , and
b = wa0 + xb0 .
An important special case is modular inversion (see Ch. 2): given an
integer n, one wants to compute 1/a mod n for a prime to n. One then
simply runs algorithm ExtendedGcd with input a and b = n: this yields u
38 Modern Computer Arithmetic, version 0.1 of October 25, 2006
and v with ua + vn = 1, thus 1/a = u mod n. But since v is not needed here,
we can simply avoid computing v and x, by removing lines 4 and 9.
In practice, it may be interesting to compute only u in the general case
too. Indeed, the cofactor v can be recovered afterwards by v = (g − ua)/b;
this division is exact (see §1.4.5).
All known algorithms for subquadratic gcd rely on an extended gcd sub-
routine, so we refer to §1.6.3 for subquadratic extended gcd.
Table 1.2: Cost of HalfGcd, with — H(n) — and without — H ∗ (n) — the
cofactor matrix, and plain gcd — G(n) —, in terms of the multiplication cost
M (n), for naive multiplication, Karatsuba, Toom-Cook and FFT.
perform all computations modulo some integer n > c2 . Hence one will end
up with pq ≡ m mod n, and the problem is now to find the unknown p and q
from the known integer m. To do this, one starts an extended gcd from m
and n, and one stops as soon as the current a and u are smaller than c: since
we have a = um + vn, this gives m ≡ −a/u mod n. This is exactly what is
called a half-gcd; a subquadratic version is given in §1.6.3.
The binary gcd can also be made fast: see Table 1.3. The idea is to mimic
the left-to-right version, by defining an appropriate right-to-left division (Al-
gorithm BinaryDivide).
40 Modern Computer Arithmetic, version 0.1 of October 25, 2006
1 Algorithm BinaryHalfGcd .
2 Input : P, Q ∈ Z with 0 = ν(P ) < ν(Q) , and k ∈ N
3 Output : a 2 × 2 i n t e g e r matrix R , j ∈ N , and P 0 , Q0 such t h a t
4
t [P 0 , Q0 ] = 2−j R ·t [P, Q] with ν(P 0 ) ≤ k < ν(Q0 )
5 m ← ν(Q), d ← bk/2c
6 i f k < m then r e t u r n R = Id, j = 0, P 0 = P, Q0 = Q
7 decompose P i n t o P1 22d+1 + P0 , same for Q
8 R, j1 , P00 , Q00 ← BinaryHalfGcd(P0 , Q0 , d)
9 P 0 ← (R1,1 P1 + R1,2 Q1 )22d+1−2j1 + P00
10 Q0 ← (R2,1 P1 + R2,2 Q1 )22d+1−2j1 + Q00
11 m ← ν(Q0 ) , i f k < j1 + m then r e t u r n R, j1 , P 0 , Q0
12 q ← BinaryDivide(P 0 , Q0 )
13 P 0 ← P 0 + q2−m Q0 , d0 ← k − (j1 + m)
14 (P 0 , Q0 ) ← (2−m P 0 , 2−m Q0 )
0
15 decompose P 0 i n t o P3 22d +1 + P2 , same for Q0
16 S, j2 , P20 , Q02 ← BinaryHalfGcd(P2 , Q2 , d0 )
0 0
17 (P 00 , Q00 ) ← ([S1,1 P3 + S1,2 Q1 ]22d +1−2j2 + P20 , [S2,1 P3 + S2,2 Q3 ]22d +1−2j2 + Q02 )
18 Return S · [0, 2m ; 2m , q] · R, j1 + m + j2 , Q00 , P 00 .
19
20 Algorithm BinaryDivide .
21 Input : P, Q ∈ Z with 0 = ν(P ) < ν(Q) = j
22 Output : |q| < 2j such t h a t ν(Q) < ν(P + q2−j Q)
23 Q0 ← 2−j Q
24 q ← −P/Q0 mod 2j+1
25 i f q < 2j then r e t u r n q e l s e r e t u r n q − 2j+1
1.7 Conversion
Since computers usually work with binary numbers, and human prefer deci-
mal representations, input/output base conversions are needed. In a typical
computation, there will be only few conversions, compared to the total num-
ber of operations, thus optimizing conversions is less important. However,
when working with huge numbers, naı̈ve conversion algorithms — like several
software packages have — may slow down the whole computation.
In this section we consider that numbers are represented internally in base
β — think of 2 or a power of 2 — and externally in base B — for example 10
Modern Computer Arithmetic, §1.7 41
or a power of 10. When both bases are commensurable, i.e. both are powers
of a common integer, like 8 and 16, conversions of n-digit numbers can be
performed in O(n) operations. We therefore assume that β and B are not
commensurable from now on.
One may think that since input and output are symmetric by exchanging
bases β and B, only one algorithm is needed. Unfortunately, this is not true,
since computations are done in base β only.
1 Algorithm IntegerOutput .
P
2 Input : A = 0n−1 ai β i of t h e number r e p r e s e n t e d by S
3 Output : a s t r i n g S of c h a r a c t e r s , r e p r e s e n t i n g A in base B
4 m←0
5 while A 6= 0
6 sm ← char(A mod B)
7 A ← A div B
8 m←m+1
9 Return S = sm−1 . . . s1 s0 .
where Input(S, B) is the value obtained when reading the string S in the
external base B. The following algorithm shows a possible way to implement
that: If the output A has n words, algorithm IntegerInput has complexity
1 Algorithm IntegerInput .
2 Input : a s t r i n g S = sm−1 . . . s1 s0 of d i g i t s in base B
3 Output : t h e v a l u e A of t h e i n t e g e r r e p r e s e n t e d by S
4 l ← [val(s0 ), val(s1 ), . . . , val(sm−1 )]
5 (b, k) ← (B, m)
6 while k > 1 do
7 i f k even then l ← [l1 + bl2 , l3 + bl4 , . . . , lk−1 + blk ]
8 e l s e l ← [l1 + bl2 , l3 + bl4 , . . . , lk ]
9 (b, k) ← (b2 , dk/2e)
10 Return l1 .
O(M (n) log n), more precisely ∼ 21 M (n/2) log2 n for n a power of two (see
Ex. 1.9.20).
For integer output, a similar algorithm can be designed, by replacing
multiplication by divisions. Namely, if A = Alo + B k Ahi , then
where Output(A, B) is the string resulting from the printing of the integer A
in the external base B, S1 || S0 denotes the concatenation of S1 and S0 , and
it is assumed that Output(Alo , B) has k digits, after possibly adding leading
zeros.
If the input A has n words, algorithm IntegerOutput has complexity
O(M (n) log n), more precisely ≡ 21 D(n/2) log2 n for n a power of two, where
D(n/2) is the cost of dividing an n-word integer by an n/2-word integer.
Depending on the cost ratio between multiplication and division, integer
output may thus be 2 to 5 times slower than integer input; see however
Ex. 1.9.21.
Modern Computer Arithmetic, §1.9 43
1 Algorithm IntegerOutput .
P
2 Input : A = 0n−1 ai β i of t h e number r e p r e s e n t e d by S
3 Output : a s t r i n g S of c h a r a c t e r s , r e p r e s e n t i n g A in base B
4 i f A < B then char(A)
5 else
6 f i n d k such t h a t B 2k−2 ≤ A < B 2k
7 (Q, R) ← DivRem(A, B k )
8 IntegerOutput(Q)||IntegerOutput(R)
1.9 Exercises
Exercise 1.9.1 [Hanrot] Prove that the number K(n) of word products in Karat-
suba’s algorithm as defined in Th. 1.3.2 is non-decreasing for n0 = 2 (caution:
this is no longer true with a larger threshold, for example with n0 = 8 we have
K(7) = 49 whereas K(8) = 48). Plot the graph of nK(n)log2 3 with a logarithmic scale
7 10
for n, for 2 ≤ n ≤ 2 , and find experimentally where the maximum appears.
Exercise 1.9.2 [Ryde] Assume the basecase multiply costs M (n) = an2 +bn, and
that Karatsuba’s algorithm costs K(n) = 3K(n/2) + cn. Show that dividing a by
two increases the Karatsuba threshold n0 by a factor of two, and on the contrary
decreasing b and c decreases n0 .
Exercise 1.9.3 [Maeder [35]] Show that an auxiliary memory of 2n+2blog 2 nc−2
words is enough to implement Karatsuba’s algorithm in-place.
Exercise 1.9.8 Prove that if 5 integer evaluation points are used for Toom-Cook
3-way, the division by 3 cannot be avoided. Does this remain true if only 4 integer
points are used together with ∞?
Exercise 1.9.9 For multiplication of two numbers of size kn and n, with k > 1
integer, show that the trivial strategy which performs k multiplications n × n is
not always the best possible.
Modern Computer Arithmetic, §1.9 45
Exercise 1.9.11 [Karatsuba, Zuras [57]] Assuming the multiplication has super-
linear cost, show that the speedup of squaring with respect to multiplication cannot
exceed 2.
Now we go from a multiplication algorithm of cost cnα to Toom-Cook r-way;
get an expression for the threshold n0 , assuming Toom-Cook cost has a 2nd order
term in kn. See how this threshold evolves when c is replaced by another constant,
in particular show that this threshold increases for squaring (c0 < c). Assuming
Toom-Cook r-way has cost lnβ for multiplication, and l 0 nβ for squaring, obtain a
closed-form expression for the ratio l 0 /l, in terms of c, c0 , α, β.
Exercise 1.9.12 [Thomé, Quercia] Multiplication and the middle product are
just special cases of linear forms programs: consider two set of inputs a1 , . . . , an
and b1 , . . . , bm , and a set of outputs c1 , . . . , ck that are sums of products of ai bj .
For such a given problem, what is the least number of multiplies required? As an
example, can we compute x = au + cw, y = av + bw, z = bu + cv in less than 6
multiplies? Same question for x = au − cw, y = av − bw, z = bu − cv.
Exercise 1.9.21 Show that asymptotically, the output routine can be made as
fast as the input routine IntegerInput. [Hint: use Bernstein’s scaled remain-
der tree and the middle product.] Experiment with it on your favorite multiple-
precision software.
Exercise 1.9.22 If the internal base β and the external one B share a common
divisor — like in the case β = 2l and B = 10 — show how one can exploit this to
speed up the subquadratic input and output routines.
Exercise 1.9.23 Assume you are given two n-digit integers in base 10, but you
have fast arithmetic in base 2 only. Can you multiply them in O(M (n))?
Chapter 2
2.1 Representation
2.1.1 Classical Representations
Non-negative, symmetric
47
48 Modern Computer Arithmetic, version 0.1 of October 25, 2006
2.2 Multiplication
2.2.1 Barrett’s Algorithm
Barrett’s algorithm [3] is interesting when many divisions have to be made
with the same divisor; this is in particular the case when one performs compu-
tations modulo a fixed integer. The idea is to precompute an approximation
of the divisor inverse. In such a way, an approximation of the quotient is
obtained with just one multiplication, and the corresponding remainder af-
ter a second one. A small number of corrections suffice to convert those
approximations into exact values.
1 Algorithm BarrettDivRem .
2 Input : i n t e g e r s A , B with 0 ≤ A < 2n B , 2n−1 < B < 2n .
3 Output : q u o t i e n t Q and remainder R of A d i v i d e d by B .
4 I ← b22n /Bc [ precomputation ]
5 Q0 ← bA1 I/2n c where A = A1 2n + A0 with 0 ≤ A0 < 2n
6 R 0 ← A − Q0 B
7 while R0 ≥ B do
8 (Q0 , R0 ) ← (Q0 + 1, R0 − B)
9 Return (Q0 , R0 ) .
2.3 Division/Inversion
Link to extended GCD (Ch. 1) or Fermat (cf MCA).
Describe here Hensel lifting for inversion mod pk (link with division by a
constant in §1.4.7). Cite paper of Shanks-Vuillemin for division mod β n .
This algorithm uses only one modular inversion, and 3(k − 1) modular mul-
tiplications. It is thus faster when an inversion is 3 times more expensive (or
more) than a product. Fig. 2.1 shows a recursive variant of that algorithm,
with the same numbers of modular multiplications: one for each internal
node when going up the (product) tree, and two for each internal node when
going down the (remainder) tree.
A dual case is when the number to invert is invariant, and we want to
compute 1/x mod n1 , . . . , 1/x mod nk . A similar algorithm works as follows:
first compute N = n1 . . . nk using a product tree like in Fig. 2.1. Then
compute 1/x mod N , and go down the tree, while reducing the residue at
Modern Computer Arithmetic, §2.9 51
1
x1 x2 x3 x4
@
@
@
@
1 1
x1 x2
@ x x
@ @3 4
@@ @@
1/x1 1/x2 1/x3 1/x4
each node. The main difference is that here, the residues grow while going
up the tree, thus even if it performs only one modular inversion, this method
might be slower for large k.
2.4 Exponentiation
Link to HAC, Ch. 14.
2.5 Conversion
integer from/to modular (CRT, FFT), 3-primes variant of FFT.
2.8 Exercises
Exercise 2.8.1 Assuming you have a FFT algorithm computing products modulo
2n + 1. Prove that with some preconditioning, you can perform a division of a 2n-
bit integer by a n-bit integer as fast as 1.5 multiplications of n bits by n bits.
52 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Floating-Point Arithmetic
3.1 Introduction
3.1.1 Representation
mantissa, exponent, sign, position of the point
IEEE 754/854: special values (infinities, NaN), signed zero, rounding
modes (±∞, to zero, to nearest, away).
Binary vs decimal representation.
Implicit vs explicit leading bit.
Links to other possible representations. In her PhD [36], Valérie Ménissier-
Morain discusses three different representations for real numbers (Ch. V):
continued fractions, redundant representation, and the classical non-redundant
representation. She also considers the theory of computable reals, their rep-
resentation by B-adic numbers, and the computation of algebraic or tran-
scendental functions (Ch. III).
53
54 Modern Computer Arithmetic, version 0.1 of October 25, 2006
numbers for FFT multiplication (cf Knuth vol 2, and error analysis in Colin
Percival’s paper [41]).
3.1.5 Rounding
Assume we want to correctly round to n bits a real number whose binary
expansion is 0.1b1 . . . bn bn+1 . . . It is enough to know the value of r = bn+1
— called the round bit — and that of the sticky bit s, which is 0 when
bn+2 bn+3 . . . is identically zero, and 1 otherwise. The following table shows
how to correctly round from r and s, and the given rounding mode; rounding
to ±∞ being converted to rounding to zero or away, according to the sign of
the number.
r s zero nearest away
0 0 0 0 0
0 1 0 0 1
1 0 0 0 or 1 1
1 1 0 1 1
Modern Computer Arithmetic, §3.2 55
This problem does not happen with all rounding modes (Ex. 3.5.1).
3.1.6 Strategies
To determine correct rounding of f (x) with n bits of precision, the best
strategy is usually to first compute an approximation y of f (x) with a working
precision of m = n + k bits, with k relatively small. Several strategies are
possible when this first approximation y is not accurate enough, or too close
from a rounding boundary.
56 Modern Computer Arithmetic, version 0.1 of October 25, 2006
3.2 Addition/Subtraction/Comparison
Leading Zero Anticipation and Detection.
Sterbenz Theorem.
Unlike integer operations, floating-point addition and subtraction are
more difficult to implement for two reasons:
• the carries being propagated from right to left, one may have to look
at an arbitrarily low bits to guarantee correct rounding.
We distinguish here the “addition”, where both operands are of the same
sign — zero operands are treated apart —, and “subtraction”, where both
operands are of different signs.
12 i f a ≥ 2n then
13 a ← round2(◦, a mod 2, t)
14 e←e+1
15 i f a = 2n then (a, e) ← (a/2, e + 1)
16 Return (a, e) .
The values of round(◦, r, s) and round2(◦, a mod 2, t) are given in Table 3.1.
At step 10, the notation (c, r, s) ← bl + cl means that c is the carry bit of
bl + cl , r the round bit, and s the sticky bit. For rounding to nearest, t is
a ternary value, which is respectively positive, zero, or negative when a is
larger than, equal to, or smaller than the exact sum b + c.
◦ r s round(◦, r, s) t
zero any any 0
away r s 0 if r = s = 0, 1 otherwise
nearest 0 any 0 −s
nearest 1 0 0/1 (even rounding) −1/1
nearest 1 1 1 1
◦ a mod 2 t round2(◦, a mod 2, t)
any 0 any a/2
zero 1 (a − 1)/2
away 1 (a + 1)/2
nearest 1 0 (a − 1)/2 if even, (a + 1)/2 otherwise
nearest 1 ±1 (a − t)/2
We may notice that the exponent ea of the result lies between that eb
of b, and eb + 2. Thus no underflow can happen in an addition. The case
ea = eb + 2 can happen only when the destination precision is less than that
of the operands.
3.3.1 Multiplication
0
The exact product of two floating-point numbers m · 2e and m0 · 2e is (mm0 ) ·
0
2e+e . Therefore, if no underflow or overflow occurs, the problem reduces to
the multiplication of the significands m and m0 .
1 Algorithm FPmultiply .
0
2 Input : x = m · β e , x0 = m0 · β e , a p r e c i s i o n n , a rounding mode ◦
3 Output : ◦(xx0 ) rounded to p r e c i s i o n n
4 e00 ← e + e0
5 m00 ← ◦(mm0 ) to p r e c i s i o n n
0
6 Return m00 · β e+e .
The product at step 5 is a short product, i.e. a product whose most significand
part only is wanted. In the quadratic range, it can be computed in about
half the time of a full product. In the Karatsuba and Toom-Cook ranges,
Mulders’ algorithm can gain 10% to 20%; however due to carries, using this
algorithm for floating-point computations seems tricky. Lastly, in the FFT
range, no better algorithm is known than computing the full product mm0 .
Modern Computer Arithmetic, §3.3 59
C 0 ≤ C ≤ C 0 + (n − 1).
1. i, j ≥ l, then ai bj is computed in C1 ;
Table 3.1: Maximal number of bits per floating-point number, and maximal
m for a plain m × m bit integer product, for a given FFT size 2n , with signed
coefficients, and 53-bit floating-point mantissae.
Theorem 3.3.2 [41] The FFT allows computation of the cyclic convolution
z = x ∗ y of two vectors of length N = 2n of complex values such that
√
|z 0 − z|∞ < |x| · |y| · ((1 + )3n (1 + 5)3n+1 (1 + β)3n − 1), (3.1)
3.3.2 Reciprocal
The following algorithm computes a floating-point inverse in 2M (n), when
considering multiplication as a black-box.
62 Modern Computer Arithmetic, version 0.1 of October 25, 2006
1 Algorithm Invert .
2 Input : 1 ≤ A ≤ 2 , a p−b i t f −p number x with 0 ≤ 1/A − x ≤ 21−p
0
3 Output : a p0−b i t f −p number x0 with p0 = 2p − 5 , 0 ≤ 1/A − x0 ≤ 21−p
4 v ← ◦(A) [ 2p b i t s , rounded up ]
5 w ← vx [ 3p b i t s , e x a c t ]
6 i f w ≥ 1 then r e t u r n x
7 y ← ◦(1 − w) [ p b i t s , towards z e r o ]
8 z ← ◦(xy) [ p b i t s , towards z e r o ]
9 x0 ← ◦(x + z) [ p0 b i t s , towards z e r o ]
Proof First consider we have no roundoff error. Newton’s iteration for the
inverse of A is xk+1 = xk + xk (1 − Axk ). If we denote k := xk − 1/A, we have
k+1 = −A2k . This shows that if x0 ≤ 1/A, then all xj ’s are less or equal to
1/A.
Now consider rounding errors. The hypothesis 0 ≤ 1/A − x ≤ 21−p can
be written 0 ≤ 1 − Ax ≤ 22−p since A ≤ 2.
If w ≥ 1 at step 6, we have Ax ≤ 1 ≤ vx; since v − A ≤ 21−2p , this shows
0
that 1 − Ax ≤ vx − Ax ≤ 21−2p , thus 1/A − x ≤ 21−2p ≤ 21−p . Otherwise,
we can write v = A + 1−2p , where i denotes a positive quantity less that 2i .
Similarly, y = 1 − w − 2−2p , z = xy − 02−2p , and x0 = x + z − 5−2p . This gives
x0 = x + x(1 − Ax) − 1−2p x2 − 2−2p x − 02−2p − 5−2p . Since the difference
between 1/A and x + x(1 − Ax) is bounded by A(x − 1/A)2 ≤ 23−2p , we get
0
|x0 − 1/A| ≤ 23−2p + 21−2p + 2 · 22−2p + 25−2p = 50 · 2−2p ≤ 26−2p = 21−p .
If we keep the FFT-transform of x from step 5 to step 8, we can save 1/3M (n)
— assuming the term-to-term products have negligible cost —, which gives
5/3M (n) as noticed by Bernstein, who also proposes a “messy” algorithm in
3/2M (n).
Remark: Schönhage’s algorithm in 1.5M (n) is better [45].
Modern Computer Arithmetic, §3.3 63
3.3.3 Division
Theorem 3.3.4 Assume we divide a m-bit floating-point number by a n-bit
floating-point number, with m < 2n. Then the (infinity) binary expansion of
the quotient can have at most n consecutive zeros after its first n bits, if not
exact.
Proof Without loss of generality, we can assume that the nth significand
bit of the quotient q correspond to 1, and similarly for the divisor d. If the
quotient has more than n consecutive zeros, we can write it q = q1 + 2−n q0
with q1 a n-bit integer and either q0 = 0 if the division is exact, or an infinite
expansion 0 < q0 < 1. Thus qd = q1 d + 2−n q0 d, where q1 d is an integer of
2n − 1 or 2n bits, and 0 < 2−n q0 d < 1. This implies that qd has at least 2n
bits.
The best known constant is 25 M (n). (Bernstein gets 7/3 or even 13/6 but
with special assumptions.)
The total cost is therefore 5M (n) for precision 2n, or 52 M (n) for precision n.
As for the reciprocal, if we cache FFT transforms, we get 5/3M (n) for
step 1, and a further gain of 1/3M (n) by saving the transform of g0 between
25
steps 2 and 4, which gives 12 M (n) = 2.0833...M (n).
Q ≤ Q1 ≤ Q + 2.
L et A1 = Q1 B1 + R1 . We have A = A1 β + A0 , B = B1 β + B0 , thus
A1 β + A 0 A1 β + A 0 R1 β + A 0
A/B = ≤ = Q1 .
B1 β + B 0 B1 β B1 β
64 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Q ≤ Q0 ≤ Q + 2 log2 n.
@
@
@
M( n
8
) @∗ ( n8 )
M
@
M( n ) M( n )@
8
M (n/4)
8 @
@
M( n
8
) M( n
8
) M ∗ ( n4 @
)
@
M( n ) M( n ) @
8
M (n/4)
8
M ( n4 ) @
@
M( n
8
) M( n
8
) @
M (n/2) @
M( n ) M( n ) @
8
M (n/4)
8
M ( n4 ) M ∗ (n/2) @
@
M( n
8
) M( n
8
) @
@
Figure 3.2: Divide and conquer short division: a graphical view. Left: with
plain multiplication; right: with short multiplication. See also Fig. 1.1.
Barrett’s division
Assume we want to divide a by b of n bits, assuming the quotient has exactly
n bits. Barrett’s algorithm is as follows:
Proof We can assume without loss of generality that a is an integer < 22n ,
that b is an integer 2n−1 ≤ b < 2n . We have i = 1/b+ with || ≤ 12 ulp(1/b) ≤
2−2n . And q = ai + 0 with |0 | ≤ 21 ulp(q) ≤ 21 since q < 2n . Thus q =
a(1/b + ) + 0 = a/b + a + 0 , and |bq − a| = |b||a + 0 | ≤ 32 |b|.
At step 2, hg03 has 5n bits, and we want only bits n to 2n — the low n bits
are known to match those of g0 —, thus we can compute hg03 mod x4n − 1
which costs 2M (n).
The total cost for SquareRoot(2n) is thus 4.5M (n), and thus 2.25M (n)
for SquareRoot(n).
3.4 Conversion
Cf Chapter 1.
• free-format output, where we want that the output value, when read
with correct rounding according to the given rounding mode, gives back
the initial number. Here the number of printed digits may depend on
the input number. This is useful when storing data in a file, while guar-
anteing that reading them back will produce exactly the same internal
numbers, or for exchanging data between different programs.
log b
E = 1 + b(e − 1) c. (3.2)
log B
Proof First assume that the algorithm terminates. Eq. (3.2) implies B E−1 ≤
be−1 , thus |x|B P −E ≥ B P −1 , which implies that |F | ≥ B P −1 at step 10. Thus
B P −1 ≤ |F | < B P is fulfilled. Now, printing x gives F · B a iff printing xB k
gives F ·B a+k for any integer k. Thus is suffices to check that printing xB P −E
would give F , which is clear by construction.
Now the algorithm terminates because at step xB P −E , if not integer,
cannot be infinitely near from an integer. If P − E ≥ 0, let k the number
of bits of B P −E , then xB P −E can be represent exactly on p + k bits. If
P − E < 0, let g = B E−P , of k bits. Assume f /g = n + with n integer; then
f − gn = g. If is not zero, g is a non-zero integer, thus || ≥ 1/g ≥ 2−k .
The case |F | ≥ B P at step 11 can appear for two reasons. Either
xB P −E ≥ B P , thus its rounding also; either xB P −E < B P , but its rounding
equals B P (this can only happen for rounding away from zero or to nearest).
In the former case one will still have xB P −E ≥ B P −1 at the next step 8, while
in the latter case the rounded value F will equal B P −1 and the algorithm
will terminate.
1
2
ulp(x)) knowing that |x − X| < ulp(X) (resp. |x − X| ≤ 12 ulp(X)). It is
easy to see that a sufficient condition is that ulp(X) ≤ ulp(x), or equivalently
B E−P ≤ be−p . In summary, we have
log b
P ≥1+p .
log B
3.5 Exercises
Exercise 3.5.1 (Kidder,Boldo) The “rounding to odd” mode is defined as fol-
lows: in case the exact value is not representable, it rounds to the unique ad-
jacent with an odd mantisse (assuming a binary representation). Prove that
if y = round(x, p + k, odd) and z = round(y, p, neareste ven), and k > 1, then
z = round(x, p, neareste ven), i.e. the double-rounding problem does not happen.
Exercise 3.5.3 (Percival) One computes the product of two complex floating
point numbers z0 = a0 + ib0 and z1 = a1 + ib1 in the following way: xa = ◦(a0 a1 ),
xb = ◦(b0 b1 ), ya = ◦(a0 b1 ), yb = ◦(a1 b0 ), z = ◦(xa − xb ) + ◦(ya + yb ) · i. All
computations being done in precision n, with rounding to nearest, compute an
error bound of the form |z − z0 z1 | ≤ c2−n |z0 z1 |. What is the best c?
Exercise 3.5.4 (Enge) Design an algorithm that correctly rounds the product
of two complex floating-point numbers with 3 multiplications only. [Hint: assume
all operands and the result have n-bit significand.]
Exercise 3.5.5 Prove that for any n-bit floating-point numbers (x, y) 6= (0, 0),
and if all computations are correctly rounded, with the same rounding mode, the
result of √ 2x 2 lies in [−1, 1], except in some special case.
x +y
Exercise 3.5.7 (Lefèvre) The IEEE 754 standard requires binary to decimal
conversions to be correctly rounded in the range m · 10n for |m| ≤ 1017 − 1 and
|n| ≤ 27 in double precision. Find the hardest-to-print double precision number
in that range — for rounding to nearest for example —, write a C program that
outputs double precision numbers in that range, and compare it to the sprintf
function of your system.
Exercise 3.5.8 Same question as the above, for the decimal to binary conversion,
and the atof function.
4.1 Introduction
This chapter is concerned with algorithms for computing elementary and spe-
cial functions, although the methods apply more generally. First we consider
Newton’s method, which is useful for computing inverse functions. For exam-
ple, if we have an algorithm for computing y = log x, then Newton’s method
can be used to compute x = exp y (see §4.2.6). However, Newton’s method
has many other applications. We already mentioned in Chapter 1 that New-
ton’s method is useful for computing reciprocals, and hence for division. We
consider this in more detail in §4.2.3.
After considering Newton’s method, we go on to consider various meth-
ods for computing elementary and special functions. These methods in-
clude power series (§4.4), asymptotic expansions (§4.5), continued fractions
(§4.6), recurrence relations (§4.7), the arithmetic-geometric mean (§4.8), bi-
nary splitting (§4.9), and contour integration (§4.11).
71
72 Modern Computer Arithmetic, version 0.1 of October 25, 2006
(x0 − ζ)2
f (x0 ) = f (ζ) + (x0 − ζ)f 0 (ζ) + f ”(ξ)
2
for some point ξ in an interval including {x0 , ζ}. Since f (ζ) = 0, we see that
is an approximation to ζ, and
x1 − ζ = O |x0 − ζ|2 .
xn+1 = xn − fn /fn0 , n = 0, 1, . . . ,
f (x) = y − x−m ,
xj+1 = xj + xj (1 − xm
j y)/m . (4.1)
xj+1 = xj + xj (1 − xj y) (4.2)
uj = 1 − x j y .
which simplifies to
uj+1 = u2j . (4.3)
Thus
j
uj = (u0 )2 . (4.4)
We see that the iteration converges if and only if |u0 | < 1, which (for real x0
and y) is equivalent to the condition x0 y ∈ (0, 2). Second-order convergence
is reflected in the double exponential on the right-hand-side of (4.4).
The iteration (4.2) is sometimes implemented in hardware to compute
reciprocals of floating-point numbers, see for example [29]. The sign and
exponent of the floating-point number are easily handled, so we can assume
that y ∈ [0.5, 1.0). The initial approximation x0 is found by table lookup,
where the table is indexed by the first few bits of y. Since the order of
convergence is two, the number of correct bits approximately doubles at
each iteration. Thus, we can predict in advance how many iterations are
required. Of course, this assumes that the table is initialised correctly. In
the case of the infamous Pentium bug [24], this was not the case, and the
reciprocal was occasionally inaccurate!
y 1/2 = y × y −1/2 .
This method does not involve any divisions (except by 2). In contrast, if we
apply Newton’s method to the function f (x) = x2 − y, we obtain Heron’s1
iteration
1 y
xj+1 = xj + (4.6)
2 xj
1
Heron of Alexandria, circa 10–75 AD.
Modern Computer Arithmetic, §4.2 75
1 − uj 1 j
xj = = + O z2 .
1−z 1−z
Another example: if we replace y in (4.5) by 1 − 4z, and take initial
approximation x0 = 1, we obtain a quadratically-convergent iteration for the
power series
X∞
−1/2 2n n
(1 − 4z) = z .
n=0
n
Some operations on power series have
P no analogue for integers. For ex-
j
ample, given a power series A(z) = a
j≥0 j z , we can define the formal
derivative X
A0 (z) = jaj z j−1 = a1 + 2a2 z + 3a3 z 2 + · · · ,
j>0
n
X
aj β j .
j=0
For more on Newton’s method for power series, we refer to [9, 14, 16, 32].
1 Algorithm Improve-Exp .
2 Input : h , n , f0 an n−b i t a p p r o x i m a t i o n to exp(h) .
3 Output : f := a 2n−b i t a p p r o x i m a t i o n to exp(h) .
4 g ← log f0 [ computed to 2n−b i t a c c u r a c y ]
5 e←h−g
6 f ← f0 + f0 e
7 Return f .
Since the computation of g = log f has the same complexity of the divi-
sion, via the formula g 0 = f 0 /f , Step 2 costs 5M (n), and Step 4 costs M (n),
which gives a total cost of 6M (n).
However, some computations in f 0 /f can be cached: 1/f was already
computed to n/2 bits at the previous iteration, so we only need to update it
to n bits; q = f 0 /f was already computed to n bits at the previous iteration,
so we don’t need to compute it again. In summary, the computation of log f0
at step 2 reduces to 3M (n): the update of g0 = 1/f to n bits with cost M (n);
the update q ← q + g0 (f 0 − f q) for f 0 /f , with cost 2M (n). The total cost
reduces thus to 4M (n).
Modern Computer Arithmetic, §4.4 77
k−1
X
f (x) = (x − c)j f (j) (c)/j! + Rk (x, c) .
j=0
j
X
Sj = Ti
i=0
We now consider the effect of rounding errors, under the assumption that
floating-point operations satisfy
f l(x op y) = (x op y)(1 + δ) ,
78 Modern Computer Arithmetic, version 0.1 of October 25, 2006
where |δ| ≤ ε and “op” = “+”, “−”, “×” or “/”. Here ε ≤ β 1−t is the
“machine-precision”. Let Tbj be the computed value of Tj , etc. Thus
and
k
X
|Sbk − Sk | ≤ keε + 2jε|Tj | + O(ε2 )
j=1
If |x| is large and x is negative, the situation is even worse. From Stirling’s
approximation we have
exp |x|
max |Tj | ' p ,
j≥0 2π|x|
but the result is exp(−|x|), so about 2|x|/ ln β guard digits are required
to compensate for Lehmer’s “catastrophic cancellation” [21]. Since exp(x) =
1/ exp(−x), this problem may easily be avoided, but the corresponding prob-
lem is not always so easily avoided for other analytic functions.
Modern Computer Arithmetic, §4.8 79
To conclude this section we give a less trivial example where power series
expansions are useful. To compute the error function
Z x
−1/2 2
erf(x) = 2π e−u du ,
0
or ∞
X 2j x2j+1
erf(x) = 2π −1/2 exp(−x2 ) . (4.9)
j=0
1 · 3 · 5 · · · (2j + 1)
The series (4.9) is preferable to (4.8) for moderate |x| because it involves
no cancellation. For large |x| neither series is satisfactory, because Ω(x2 )
terms are required, and it is preferable to use the asymptotic expansion or
continued fraction for erfc(x) = 1 − erf(x): see §§4.5–4.6.
We now use the fact that f is holonomic. Assume f satisfies the following
linear differential equation with polynomial coefficients:
cm (t)f (m) (t) + · · · + c1 (t)f 0 (t) + c0 (t)f (t) = 0.
Substituting xi + t for x, we obtain a differential equation for fi :
(m)
cm (xi + t)fi (t) + · · · + c1 (xi + t)fi0 (t) + c0 (xi + t)fi (t) = 0.
From this latter equation we deduce a linear recurrence for the Taylor coef-
ficients of fi (t), of the same order than that for f (t). The coefficients in the
recurrence for fi (t) have O(2i ) bits, since xi = r1 + · · · + ri has O(2i ) bits.
It follows that the kth Taylor coefficient from fi (t) has size O(k(2i + log k))
[the k log k term comes from the polynomials in k in the recurrence]. Since
k goes to n/2i at most, this is O(n log n).
However we don’t want to evaluate the kth Taylor coefficient uk of fi (t),
but the series
Xk
j
sk = uj ri+1 .
j=1
j
Noticing that uj = (sj − sj−1 )/ri+1 , and substituting that value in the recur-
rence for (uj ), say of order l, we obtain a recurrence of order l + 1 for (sk ).
Putting this latter recurrence in matrix form Sk = M (k)Sk−1 , where Sk is the
vector (sk , sk−1 , sk−l+1 ), we obtain Sk = M (k)M (k − 1) · · · M (l)Sl−1 , where
the matrix product M (k)M (k−1) · · · M (l) can be evaluated in O(M (n) log 2 n)
using binary splitting.
We illustrate the above theorem with the arc-tangent function, which
satisfies the differential equation:
f 0 (t)(1 + t2 ) = 0.
This equation evaluates at xi + t into fi0 (t)(1 + (xi + t)2 ) = 0, which gives the
recurrence
(1 + x2i )kuk + 2xi (k − 1)uk−1 + (k − 2)uk−2 = 0.
This recurrence translates to
(1 + x2i )kvk + 2xi ri+1 (k − 1)vk−1 + ri+1
2
(k − 2)vk−2 = 0
k
for vk = uk ri+1 , and to
(1+x2i )k(sk −sk−1 )+2xi ri+1 (k −1)(sk−1 −sk−2 )+ri+1
2
(k −2)(sk−2 −sk−3 ) = 0
Pk
for sk = j=1 vj .
Modern Computer Arithmetic, §4.15 83
4.12 Constants
Ex: exp, π, γ [15], Gamma, Psi, ζ, ζ( 12 + it), ... Cf http://cr.yp.to/1987/
bernstein.html for π and e. Cf also [22].
4.15 Exercises
P
Exercise 4.15.1 If A(z) = j≥0 aj z j is a formal power series over R with a0 = 1,
show that log(A(x)) can be computed with error O(z n ) in time O(M (n)), where
M (n) is the time required to multiply two polynomials of degree n−1. (A smooth-
ness condition on the growth of M (n) as a function of n may be required.)
Hint: (d/dx) log(A(x)) = A0 (x)/A(x).
Does a similar result hold for n-bit numbers if z is replaced by 1/2?
Exercise 4.15.2 (Brent) Assuming we can compute n bits of log x in O(M (n) log n),
and of exp x in O(M (n) log2 n), show how to compute exp x in O(M (n) log n), with
almost the same constant as the logarithm.
84 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Appendix: Implementations
and Pointers
85
86 Modern Computer Arithmetic, version 0.1 of October 25, 2006
Bibliography
[3] Paul Barrett. Implementing the Rivest Shamir and Adleman public key en-
cryption algorithm on a standard digital signal processor. In A. M. Odlyzko,
editor, Advances in Cryptology, Proceedings of Crypto’86, volume 263 of Lec-
ture Notes in Computer Science, pages 311–323. Springer-Verlag, 1987.
[7] Yves Bertot, Nicolas Magaud, and Paul Zimmermann. A proof of GMP square
root. Journal of Automated Reasoning, 29:225–252, 2002. Special Issue on
Automating and Mechanising Mathematics: In honour of N.G. de Bruijn.
87
88 Modern Computer Arithmetic, version 0.1 of October 25, 2006
[21] George E. Forsythe. Pitfalls in computation, or why a math book isn’t enough.
Amer. Math. Monthly, 77:931–956, 1970.
[24] Tom R. Halfhill. The truth behind the Pentium bug. Byte, March 1995.
[25] Guillaume Hanrot and Paul Zimmermann. A long note on Mulders’ short
product. Journal of Symbolic Computation, 2003. To appear.
[27] Tudor Jebelean. An algorithm for exact division. Journal of Symbolic Com-
putation, 15:169–180, 1993.
[28] Tudor Jebelean. A double-digit Lehmer-Euclid algorithm for finding the GCD
of long integers. Journal of Symbolic Computation, 19:145–157, 1995.
[29] Alan H. Karp and Peter Markstein. High-precision division and square root.
ACM Transactions on Mathematical Software, 23(4):561–589, 1997.
[33] Werner Krandick and Tudor Jebelean. Bidirectional exact integer division.
Journal of Symbolic Computation, 21(4–6):441–456, 1996.
90 Modern Computer Arithmetic, version 0.1 of October 25, 2006
[35] Roman Maeder. Storage allocation for the Karatsuba integer multiplication
algorithm. DISCO, 1993. preprint.
[37] R. Moenck and A. Borodin. Fast modular transforms via division. In Pro-
ceedings of the 13th Annual IEEE Symposium on Switching and Automata
Theory, pages 90–96, October 1972.
[38] P. L. Montgomery. Speeding the Pollard and elliptic curve methods of factor-
ization. Mathematics of Computation, 48(177):243–264, 1987.
[40] Victor Pan. How to Multiply Matrices Faster, volume 179 of Lecture Notes in
Computer Science. Springer-Verlag, 1984.
[41] Colin Percival. Rapid multiplication modulo the sum and difference of highly
composite numbers. Mathematics of Computation, 72(241):387–395, 2003.
[42] Douglas M. Priest. Algorithms for arbitrary precision floating point arith-
metic. In Peter Kornerup and David Matula, editors, Proceedings of the 10th
Symposium on Computer Arithmetic, pages 132–144, Grenoble, France, 1991.
IEEE Computer Society Press. http://www.cs.cmu.edu/~quake-papers/
related/Priest.ps.
[43] B. Salvy and P. Zimmermann. Gfun: A Maple package for the manipulation
of generating and holonomic functions in one variable. ACM Transactions on
Mathematical Software, 20(2):163–177, June 1994.
[47] Arnold Schönhage and Volker Strassen. Schnelle Multiplikation großer Zahlen.
Computing, 7:281–292, 1971.
[49] Damien Stehlé and Paul Zimmermann. A binary recursive gcd algorithm. In
Proceedings of the Algorithmic Number Theory Symposium (ANTS VI), 2004.
[53] Joris van der Hoeven. Relax, but don’t be too lazy. Journal of Symbolic Com-
putation, 34(6):479–542, 2002. http://www.math.u-psud.fr/~vdhoeven.
[55] Kenneth Weber. The accelerated integer GCD algorithm. ACM Transactions
on Mathematical Software, 21(1):111–122, 1995.
[57] Dan Zuras. More on squaring and multiplying large integers. IEEE Transac-
tions on Computers, 43(8):899–908, 1994.