Low- and Ultra Low-Power Arithmetic Units: Design and Comparison
Milena Vratonjic, Bart R. Zeydel, Vojin G. Oklobdzija
Advanced Computer Systems Engineering Laboratory (ACSEL)
University of California, Davis, CA 95616
(milena, zeydel, vojin)@acsel-lab.com
Abstract introduced; Section 4 describes the simulation setup;
Section 5 presents the comparison of these adders in
Design guidelines for low- and ultra low-power the low- and ultra low-power domain; and Section 6
arithmetic units are presented. We analyze structures concludes the paper.
for addition in the energy-delay space to determine the
most suitable for these regions of operations. This 2. Analysis
paper demonstrates that the use of more complex high-
performance structures combined with scaling of the Different arithmetic algorithms have been proposed
supply-voltage outperforms traditional low-power in order to improve computational efficiency in terms
oriented designs in the low- and ultra low-power of speed, area, and regularity of structures. In low-
domain. power applications, evaluating the energy efficiency of
the algorithm is crucial. Research for low-power
adders lacks the framework for analyzing and
1. Introduction quantifying the energy ramifications of different
algorithm choices and their implementations.
Integrated circuit designs for wireless computations
and sensor networks are required to operate under ultra 2.1. Prior Work
low energy requirement. In this design space it is not
difficult to find designs which meet the required Delay estimation of designs was initially based on
performance; however the challenge exists in finding the number of logic levels. The notion of fan-in and
the design that uses the least energy for a given fan-out considerations for the delay was formulated in
performance. [7] and later expanded into a comprehensive method
Low power designs have often been compared known as “Logical Effort” [16]. In [7], different adder
based on area or total gate count. However, gate count implementations are compared in terms of their delay
does not account for the impact of transistor sizing and and complexity. Complexity of the adder structure is
supply voltage scaling on energy and delay. We chose expressed by the number of equivalent 2-input gates. It
to use adders as a test case to determine if there exists a was shown in [7] that a structure optimized around the
direct relationship between gate count and the best critical path, Variable Block Adder, could reach speed
structures for low power operation. of Carry-Lookahead Adder (CLA) yet with a low
Recently, two adders have been proposed for low- complexity comparable to the Ripple-Carry Adder. The
power and high-speed operation [5,6]. These gate count and complexity of adder implementations
approaches focus on either improving the energy-delay considered in this paper are shown in Table 1.
characteristics of traditional low-power adders through
the use of more complex circuits, or simplifying high- Table 1. Gate Count and Complexity of Adders
performance adders to reduce energy. We analyze Adder Type (32-bit) Gate Count Complexity
these adders in comparison with traditional adders to RCA 161 208
determine which of these structures are the best for CSK [13] 197 245
low- and ultra low-power operation. VBA [7] 209 254
The paper is organized as follows: in Section 2 we CSA [12,15] 248 423
discuss prior work and our approach for design and CLSK [5] 272 338
comparison of arithmetic units; in Section 3 energy KS [14] 404 461
efficient adder designs and design considerations are
Neither gate count nor complexity can be used as a 3.1. Variable Block Adder (VBA)
figure of merit for energy efficiency because they do
not account for the impact of switching activity, gate Carry-Skip-Adder (CSK) was initially proposed to
sizing, parasitics and wiring on energy. Approaches for improve the speed of a Ripple-Carry-Adder (RCA)
determining the best adder through the use of more with only a minimal overhead in number of gates [13].
complex figures of merit have been suggested in The speed of CSK was improved in the Variable-
[9,10]. Block-Adder (VBA) [8], where the blocks sizes were
To evaluate the design, the following figure of merit varied to optimally balance the delay between the
for “efficiency” of was used in [9]: ripple and carry paths. This improvement in speed
requires only a small increase in number of gates
10,000 (Eq. 1) (Table 1).
Efficiency =
D× T
3.2. Carry-Select Adder (CSA)
where D was the worst case delay and T corresponded
to the average number of gate transitions per addition. In CSA [12], each block is evaluated conditionally.
Based on this metric, it was concluded that a carry- When carry-in to a block becomes available, it
lookahead topology was the best. Later analysis of conditionally selects the carry-out and sum-bits of the
low-power adders examined the use of transistor sizing block. The critical path of CSA is either the ripple-
[10]. Measurements for the worst case delay were carry path in the largest block or the worst case carry-
extracted from simulation results. Power measurements select path. The optimal block sizing is chosen such
were obtained from simulations where all the adders that the delay of the ripple and carry select paths are
had the same clock frequency which was chosen to balanced [15]. The addition can be performed in less
accommodate the slowest adder, ripple-carry. Since stages than VBA, however this comes at the expense of
frequency is kept constant for all the adders, the metric increased branching and more logic gates required for
employed for comparison is a scaled form of energy- conditional computation.
delay product (EDP). These comparisons are made at
only a single operating point, and thus do not capture 3.3. Carry-Lookahead-Skip Adder (CLSK)
the varying energy-delay characteristics of each adder.
A modification of VBA was introduced in [5] to
2.2. Energy-Delay Analysis further minimize delay by using carry-lookahead logic
within the blocks. Group sizing is chosen to balance
It is necessary to perform analysis of arithmetic the delay of each path within the adder. This balancing
units for the various design points in the energy-delay of delay intends to reduce power consumption by
space [1,2,3] in order to provide a comprehensive eliminating spurious glitch transitions that occur when
comparison over possible design ranges. For low- and the delay of the paths are non-equal. A 32-bit
ultra low-power operation energy is the primary implementation has seven logic levels, each with
objective. Using linear delay and energy models for complex CMOS gates. The results in [5] indicated
gates, it can be shown that the minimal energy through speed comparable to high-performance 32-bit adders.
gate sizing is obtained when minimal gate sizes are However, it was not shown how CLSK compared to
used. Further energy reduction is achieved through the adders when energy is taken into account.
supply-voltage scaling, which can be used to create the
energy-delay characteristics of the adder for the low 3.4. Sparse Carry-Lookahead Adder (SCL)
and ultra-low power regions of operation.
A sparse carry-tree adder architecture proposed in
3. Adder Designs for Low-Power [6] reduces carry-tree density through the use of 4-bit
conditional sum computation [11]. Carry signals are
In the following section, adder designs for low- generated for every fourth bit (C3, C7, …, C23, and
power addition are introduced with considerations for C27). This is opposed to Kogge-Stone adder
energy efficient implementations. architecture (KS) [14] which generates carry signal for
every bit. In Fig. 1, our implementation of the SCL
adder is shown. The wire length on the critical path is
reduced in our design compared to [6] and has less
branching in the carry-tree.
Table 2. Characteristics of the adders at 1.2V
Adder Av.
Delay EDP Gate
Type Energy
[ns] [pJ/GHz] Count
(32-bit ) [pJ]
RCA 2.1 1.1 2.31 161
VBA 0.98 1.38 1.35 197
CSA 1 1.78 1.78 209
CLSK 0.94 2.63 2.47 248
SCL 0.62 1.78 1.1 315
KS 0.65 2.04 1.3 404
We use the results of the adders at nominal voltage
Fig. 1. Sparse Carry-Lookahead adder (SCL)
from Table 2 to define the low- and ultra low-power
regions of operation. The low-power region is defined
4. Simulation Conditions as the range of delay between VBA and RCA at
nominal voltage (0.98ns to 2.1ns). The ultra-low power
All simulations are performed in a 130nm CMOS region is defined as the region of operation where
technology, with 1.2V nominal supply, at 27ºC, under delay exceeds the performance of RCA at nominal
typical process conditions. The simulation test bench is voltage (>2.1ns), as indicated in Fig. 3, 4.
shown in Fig. 2. Average energy is measured in H- Simulation results for the energy and delay of the
SPICE on a set of 500 random input test vectors. The adders described in Section 3 are presented in Fig. 3.
delay of each adder is obtained from simulation of the
critical path vectors.
Fig. 2. Simulation test bench
Each output is loaded with a 10µm equivalent gate
capacitance. Wire lengths are estimated assuming a
4µm bit pitch and are included in the analysis and
simulation. Each adder is sized for minimal energy.
Fig. 3. Low-Power comparison of 32-bit adders
Using these gate sizes, the supply-voltage (Vdd) is
with Vdd scaling
swept from 1.2V to 0.6V in 50mV decrements to
obtain the range of operation for each adder in the
In the low-power design region the use of high-
energy-delay space.
performance schemes, such as SCL, combined with
reduced supply- voltage always yield lower energy. At
5. Results the target frequency of 1GHz, the SCL adder can
operate at 0.85V and consumes 42% less energy than
A comparison of the adders at the nominal supply- VBA which must operate at nominal voltage to achieve
voltage of 1.2V is shown in Table 2. From this table it the same speed. As supply-voltage is reduced, the
appears that VBA is the best design at 1ns, while RCA energy saving of the SCL structure versus VBA is
uses the least amount of energy at 2.1ns. However, it is reduced. At the intersection of low-power and ultra-
unclear if these statements will hold if adders are low power region, which occurs at 2.1ns, the energy
designed to operate at the same delay. saving of SCL compared to VBA is 25%.
The results in Fig. 3. also demonstrate that CLSK 7. Acknowledgements
which was designed as intent to reduce power and yet
improve performance, is inefficient in all regions of its This work has been supported by SRC Research
operation. It leads to only 5% of delay improvement Grant No. 931.001, California MICRO 04-068, Fujitsu
over VBA at the expense of twice the energy Ltd. and Intel Corp.
consumption. Among traditional low-power designs,
the CSA structure demonstrates similar performance to 8. References
VBA, however conditional computation led to an
energy increase of 30% compared to VBA. [1] V. Zyuban, P. Strenski, “Unified Methodology for
Simulation results for the ultra low-power region of Resolving Power-Performance Tradeoffs at the Micro-
the adders are shown in Fig. 4. At the lowest operating Architectural and Circuit Levels”, ISLPED, 2002.
voltage (0.6V), VBA is two times faster than RCA and [2] H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, “Energy
requires only a 25% increase in energy. At nominal Minimization Method for Optimal Energy-Delay Extraction”,
voltage, there is no justification of using simple RCA ESSCIRC, 2003.
structure, it has 2.5x higher energy than SCL adder [3] V. G. Oklobdzija,et.al,“Comparison of High-Performance
operation at 0.6V, thus provides no energy savings at VLSI Adders in Energy-Delay Space”, IEEE Trans. on VLSI
Systems, Vol.13, No.6, pp. 754-758, June 2005.
all. [4] M. Vratonjic, B. R. Zeydel, H. Q. Dao, V. G. Oklobdzija,
“Low-Power Aspects of Different Adder Topologies, 37th
Annual Asilomar Conference, November 2003.
[5] K. Chirca, M. Schulte, et. al, “A Static Low-Power, High-
Performance 32-bit Carry Skip Adder”, Euromicro
Symposium on Digital System Design, pp. 615-619, 2004.
[6] S. Mathew, et. al, “A 4GHz 130nm Address Generation
Unit with 32-bit Sparse-Tree Adder Core”, IEEE JSSCC,
Vol. 38, No. 5, pp. 689-695, May 2003.
[7] V. G. Oklobdzija, E. R. Barnes, “On Implementing
Addition in VLSI Technology”, IEEE Journal of Parallel and
Distributed Computing, No. 5, pp. 716-728, 1988.
[8] V. G. Oklobdzija, E. R. Barnes, “Some Optimal Schemes
for ALU Implementation in VLSI Technology”, Proceedings
of the Symposium on Comp. Arithmetic, pp. 2-8, June 1985.
[9] T. K. Callaway, E. E. Swartzlander, Jr, “Optimizing
Arithmetic Elements for Signal Processing”, VLSI Signal
Processing, V, pp. 91-100, IEEE Special Publications, 1992.
[10] C. Nagendra, M. J. Irwin, R. M. Owens, “Area-Time-
Fig. 4. Ultra low-power comparison of 32-bit Power Tradeoffs in Parallel Adders”, IEEE Trans. on Circuits
adders with Vdd scaling and Systems II: Analog and Digital Signal Processing, Vol.
43, No. 10, pp. 689-702, October 1996.
[11] J. Sklanski, “Conditional-Sum Addition Logic”, IRE
6. Conclusion Trans. on El. Comp, Vol.EC-9, No.2, pp. 226-231, 1960.
[12] O. J. Bedrij, “Carry-Select Adder”, IRE Trans. on El.
In this paper, we provide an approach for the design Computers, Vol. EC-11, pp. 340-346, 1962.
and comparison of 32-bit adders for low- and ultra [13] M. Lehman, N. Burla, “Skip Techniques for High-Speed
low-power applications. The energy-delay space Carry-Propagation in Binary Arithmetic Units”, IRE Trans.
results demonstrate that when designing for low power on El. Comp, Vol. EC-10, pp.691-698, December 1961.
a comparison of designs at a single voltage or a [14] P. M. Kogge, H. S. Stone, “A Parallel Algorithm for the
Efficient Solution of a General Class of Recurrence
comparison based on gate count is insufficient for
Equations”, IEEE Trans. on Computers, Vol. C-22, No. 8,
determining the optimal structures. pp. 786-793, August 1973.
We have demonstrated that the use of high- [15] A. Tyagi, “A Reduced-Area Scheme for Carry-Select
performance structures combined with supply-voltage Adders,” IEEE Trans. on Computers, Vol. 42, No. 10, pp.
scaling, results in reduced energy compared to 1163-1170, October 1993.
traditional designs for low power and ultra low-power [16] D. Harris, R. F. Sproull, I. E. Sutherland, “Logical
operation. This finding is contrary to common belief. Effort: Designing Fast CMOS Circuits”,M. Kaufmann, 1999.
[17] V. G. Oklobdzija, “High-Performance System Design:
Circuits and Logic”, IEEE Press, 1999.