RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Multipliers, Algorithms, and Hardware Designs
Mahzad Azarmehr Supervisor: Dr. M. Ahmadi
Spring 2008
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Outline
Survey Objectives Basic Multiplication Schemes
Shift/Add Multiplication Algorithm Basic B i H Hardware d M Multiplier lti li
High-Radix Multipliers
Multiplication of Signed Numbers Radix-4 Multiplication Modified Booths Recoding
Tree and Array Multipliers
Using Carry-save Adders Full Tree Multipliers High-Radix Multipliers Alternative Reduction Trees Tree Multipliers for signed numbers Divide and Conquer Design Array y Multipliers p Additive Multiply Modules Pipelined Tree and Array Multipliers Bit-Serial Multipliers Modular Multipliers Squaring
Variation in Multipliers
Conclusion
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
S Survey Objectives Obj ti
Multiplication is a heavily used arithmetic operation that figures prominently in signal processing and scientific applications Multiplication is hardware intensive, and the main criteria of interest are higher speed, lower cost, and less VLSI area The main concern in classic multiplication, often realized by K cycles of shifting and adding, is to speed up the underlying multi-operand addition add to o of pa partial ta p products oducts In this survey, a variety of multiplication algorithms and hardware d i designs are di discussed d
3
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Shift/Add Multiplication Algorithm
With the following notation: a Multiplicand ak-1ak-2a1a0 x Multiplier p Product xk-1xk-2x1x0 p2k-1p2k-2p1p0
Each row corresponds to the product of the multiplicand and a single bit of multiplier. Each term is either 0 or a
Binary multiplication reduces to adding a set of numbers, each of which is 0, or shifted version of the multiplicand a
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Shift/Add Multiplication Algorithm
Sequential multiplication can be done by a cumulative partial product (initialized to 0) and successively adding to it the properly shifted terms xja p(j+1) = (p(j) + xja2k) 2-l Instead of shifting successive numbers to the left for alignment, cumulative partial product is shifted by one bit to the right The product will have a total shift of k bits to the right so we pre-multiply a by 2k to offset this right, effect
5
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Basic Hardware Multiplier
x and p are stored in shift registers The next bit of x is used to select 0 or a for addition Shifting can be performed by connecting the (i)th sum output to the (k+i-1)th bit of the partial product register and the adders carry out to bit b t 2k-1 x and lower half of p can share the same register i
6
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Multiplication of Signed Numbers
In signed-magnitude numbers, the products sign should be computed separately by XORing the operand signs In 2s-complement representation: Negative multiplicand, the same routine with sign-extended values Negative multiplier, the term xk-1a should be subtracted rather than added in the last cycle
In practice, the required subtraction is performed by adding the 2scomplement of the multiplicand or adding its 1s-complement and inserting a carry-in carry in of 1 into the adder
7
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Multiplication of Signed Numbers
Examples of 2s-complement multiplications:
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Multiplication using Booths Booth s Recoding
The more 1s there are in x, the slower the multiplication In Booths recoding, every sequence of 1s is replaced with a sequence of 0s 0s, a -1 in the least significant end, and addition of 1 in the next higher position: 2j+2j-1++2 + +2i+1+2i = 2j+1-2 2i
xi 0 0 1 1 xi-1 0 1 0 1 yi 0 1 -1 0
explanation
No string of 1s in sight End of string of 1s Beginning of string of 1s Continuation of string of 1s
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
High Radix Multipliers High-Radix
These multiplication schemes handle more than one bit of the multiplier in each cycle
A higher representation radix leads to fewer digits. Thus, a digit-at-atime multiplication algorithm requires fewer cycles as we move to higher radices, which means fewer partial products The reduction in the number of cycles, cycles along with the use of recoding and carry-save adders, leads to significant gains in speed over basic multipliers
Multipliers, Algorithms and Hardware Designs
10
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Radix 4 Multipliers Radix-4
Based on two least significant end bits of multiplier, a pre-computed multiple of a is added Alternately, rather than adding 3a, add a and send a carry of 1 into the next radix-4 radix 4 digit of the multiplier
Multipliers, Algorithms and Hardware Designs
11
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Modified Booths Booth s Recoding
If radix radix-4 4 multiplication is performed with the recoded multiplier, only the multiples of a and 2a will be required, all of which are easily obtained by shifting and/or complementation
xi+1
0 0 0 0 1 1 1 1
xi
0 0 1 1 0 0 1 1
xi-1
0 1 0 1 0 1 0 1
yi+1
0 0 1 1 -1 -1 0 0
yi
0 1 -1 0 0 1 -1 0
explanation No string of 1s in sight End of a string of 1s Isolated 1 in x End of a string of 1s Beginning of a string of 1s End one string, begin new string Beginning of a string of 1s C ti Continuation ti of f string t i of f 1s 1
Multipliers, Algorithms and Hardware Designs
12
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Radix 4 Multipliers Radix-4
Booth s recoding is fully paralleled Booths and carry-free
non0: 1 bit to distinguish 0 from nonzero digits neg: 1 bit to show the sign of nonzero digit two: 1 bit to show the magnitude of nonzero digit 13
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
Carry save adders (CSA) can be used to reduce the number of Carry-save addition cycles as well as to make each cycle faster A row of binary FA is used as a mechanism to reduce three numbers to two numbers, rather than finding a single sum
Multipliers, Algorithms and Hardware Designs
14
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Wallace and Dadda trees
Wallaces strategy is to combine the partial product bits at the earliest
opportunity which leads to the fastest possible design opportunity, With Daddas method, combining takes place as late as possible and usually
leads to simpler CSA tree and a wider CPA
Multipliers, Algorithms and Hardware Designs
15
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
A carry carry-save save adder tree can reduce n binary numbers to two numbers having the same sum in O(log n) levels As an example, this CSA tree, reduces seven k-bit operands to two (k+2)-bit operands Not necessarily all the operands have th same alignment the li t
Multipliers, Algorithms and Hardware Designs
16
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
Radix 4 multiplication without Radix-4 Booths recoding can be implemented by using a CSA to handle the 3a multiple The drawback is that the add time is slightly increased increased, since the CSA overhead is paid in every cycle, regardless of whether 3a is actually needed
Multipliers, Algorithms and Hardware Designs
17
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
CSA can be put to better use for reducing the addition time by keeping the cumulative partial product in stored-carry form As the three values that form the next cumulative partial product are added, one bit of the final product is obtained and shifted into the lower half of the register register. This eliminates the need for carry propagation in all but the final addition
Multipliers, Algorithms and Hardware Designs
18
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
The previous CSA-based CSA based design can be combined with radix-4 Booths recoding to reduce the number of cycles by 50%, while also making each cycle considerably faster
Multipliers, Algorithms and Hardware Designs
19
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
In the Booth recoding logic and multiple selection circuit, the sign of each multiple must be incorporated in the multiple itself, rather than as a signal that controls addition/subtraction This configuration can be used for high-radix and parallel multipliers
Multipliers, Algorithms and Hardware Designs
20
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Using Carry Carry-Save Save Adders
This is another way to accommodate the required 3a multiple Four numbers (the sum and carry components of the cumulative partial products products, xia and 2xi+1a) need to be combined, thus necessitating a two-level CSA tree
Multipliers, Algorithms and Hardware Designs
21
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
High Radix Multipliers High-Radix
Now, it is an easy step to visualize a higher-radix multiplier: In radix-2b multiplication with Booths recoding, we have to reduce b/2 multiples to 2 using a (b/2+2) input CSA tree whose (b/2+2)-input other two inputs are taken by the carry-save partial products. Without Booths Booth s recoding a (b+2)-input CSA tree would be needed
Multipliers, Algorithms and Hardware Designs
22
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Tree and Array Multipliers
Tree, or fully parallel multipliers constitute limiting cases of high-radix high radix multipliers (radix-2k ) With a high-performance CSA tree followed by a fast adder, logarithmic time multiplication becomes possible The resulting multipliers are expensive, but justifiable, for applications in which multiplication speed is critical One-sided CSA trees lead to much slower, but highly regular, structures known as array multipliers that offer higher pipelined throughput than tree multipliers and significantly lower chip area
Multipliers, Algorithms and Hardware Designs
23
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Full Tree Multipliers Full-Tree
In full full-tree tree multipliers, all the k multiples of multiplicand are produced at once and a k-input CSA tree is used All the multiples are combined in one pass; the tree does not require feedback links, making pipelining quite feasible
Multipliers, Algorithms and Hardware Designs
24
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
R d ti T Reduction Tree
A logarithmic depth reduction tree based on CSA CSA, has an irregular structure that makes its design and layout quite difficult Additionally, connections and signal paths of varying lengths lead to logic hazards and signal skew that have implications for both performance and power consumption Compared to generic CSA, the only modification required is relative shifting of the operands to be added
Multipliers, Algorithms and Hardware Designs
25
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Reduction Tree
Multipliers, Algorithms and Hardware Designs
26
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Alternative Reduction Trees
A slice of (n;2) counter counter, when suitably replicated, can perform the function of the reduction tree Using counters assures us that all outputs are produced after the same number b of f full-adder f ll dd d delays l The structure can be replicated to form an n-input reduction tree of desired width. Such balanced-delay trees are quite suitable for VLSI implementation q p of parallel multipliers
27
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Alternative Reduction Trees
Another alternative is using a module that reduces four numbers to two as the basic building block Then partial products reduction trees can be structured as binary trees that possess a recursive structure, making them more regular and easier to layout
Multipliers, Algorithms and Hardware Designs
28
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Tree multipliers for signed numbers
In multiplying 2s-complement 2 s-complement numbers directly, partial products are signed numbers To avoid having to deal with negatively weighted bits, an efficient ff method offered ff by Baugh and Wooley:
x0 -1 -x0 =
Multipliers, Algorithms and Hardware Designs
29
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Array Multipliers
A tree multiplier, multiplier with a one-sided reduction tree and a ripple-carry final adder is called an array multiplier an array multiplier is very regular in its structure and uses only short wires that go from one FA to adjacent FA It has a very simple and efficient y in VLSI and can be easily y and layout efficiently pipelined
30
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Array Multipliers
Sum outputs are connected diagonally, while the carry outputs are linked vertically, except in the last row, where they are chained from right to left Baugh and Wooley method can be easily applied to array multiplier for 2s-complement multiplication
Multipliers, Algorithms and Hardware Designs
31
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Pipelined Tree and Array Multipliers
Xi inputs are delayed through the insertion of latches in their paths and the product emerges with a latency of 2k-1 2k 1 cycles FA blocks used are assumed to have output latches for f both sum and carry The final ripple-carry adder has been pipelined as well
Multipliers, Algorithms and Hardware Designs
32
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Divide and Conquer Design
A 2b2b multiplier can be synthesized using bb multiplier Although there are four partial products, only three values need to be added 2b2b multiplication has been reduced to 4 bb multiplications and a three-operand addition
Multipliers, Algorithms and Hardware Designs
33
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Divide and Conquer Design
For 2b2b multiplication one can use b-bit adders exclusively to accumulate the partial products
Multipliers, Algorithms and Hardware Designs
34
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Additive Multiply Modules (AMMs)
In certain computations, multiplications are commonly followed by additions. In such cases, implementing a multiply-add unit to compute p=ax+y might be cost effective. F th Furthermore, AMMs AMM can be b used d as building blocks for multipliers In a bc AMM: (2b-1)(2c-1)+(2b-1)+(2c-1)=2b+c-1 The cost of a 42 AMM is less than the p and a 4combined costs of a 42 multiplier bit adder
Inputs marked with an asterisk carry 0s Multipliers, Algorithms and Hardware Designs 35
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Bit Serial Multipliers Bit-Serial
Bit-serial arithmetic is attractive in view of its smaller pin count count, reduced wire length, and lower floor space requirements in VLSI The compactness of the design may allow it to run a bit-serial multiplier at a high enough clock rate to make it competitive with much more complex designs with regard to speed
Multipliers, Algorithms and Hardware Designs
36
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Bit Serial Multipliers Bit-Serial
For a latency-free multiplier, the relationship between the output and inputs are written in the form of a recurrence:
a(0)=a0 , a(1)=(a1a0)2 , , a(i)=2iai+a(i-1) p(i)=2-(i+1) a(i) x(i) , 2p(i)=p(i-1)+aix(i-1)+xia(i-1)+2iaixi
A (5;3) counter can be used as an adder, if p(i1) is stored in double-carry-save form
Multipliers, Algorithms and Hardware Designs
37
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Modular Multipliers
A modular multiplier is one that produces the product of two (unsigned) integers modulo some fixed constant m. The two special cases of m=2b and m=2b-1 are simpler to deal with If the partial products are accumulated through carry carry-save save addition addition,
for m=2b, the output carry in position b-1 is ignored for m=2b-1, the carry out of position b-1 is combined with bits in column 0
Multipliers, Algorithms and Hardware Designs
38
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Modular Multipliers
Similar techniques can be used to handle modular multiplication in the general case As an example, a modulo-13 multiplier can be designed by using identities: 16=3 16 3 mod 13 32=6 mod 13 64=12 mod 13 32 2+1 1 6 4+2 12 8+4
Multipliers, Algorithms and Hardware Designs
39
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Squaring
Any standard or modular multiplier can be used for computing p=x2 if both inputs are connected to x A special-purpose k-bit squarer, if f built in hardware, will be significantly lower in cost and delay than a kk multiplier x i xi x i xixj + xjxi 2xixj
Multipliers, Algorithms and Hardware Designs
40
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Conclusion
The classic shift/add multiplication schemes and their implementation have been examined There are two ways to speed up the underlying multi-operand addition; reducing d i th the number b of f operands d l leads d t to hi high-radix h di multipliers, lti li and dd devising i i hardware multi-operand adders that minimize the latency and/or maximize the throughput leads to tree and array multipliers Cost, VLSI area, and pin limitations favor bit-serial designs, while the desire g blocks leads to designs g based on Additive Multiply py to use available building Modules (AMMs) Finally, Fi ll th the special i l case of f squaring i was of f interest, i t t as it l leads d t to considerable simplification
41
Multipliers, Algorithms and Hardware Designs
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR
Questions and Comments
Multipliers, Algorithms and Hardware Designs
42