1
JugglePAC: a Pipelined Accumulation Circuit
Ahmad Houraniah , H. Fatih Ugurdag (Senior Member, IEEE) , Furkan Aydin (Member, IEEE)
Abstract—Reducing a set of numbers to a single value is TABLE I
a fundamental operation in applications such as signal pro- Accumulation schedule for SimplePAC versus JugglePAC.
cessing, data compression, scientific computing, and neural
networks. Accumulation, which involves summing a dataset SimplePAC JugglePAC
to obtain a single result, is crucial for these tasks. Due to Adder Adder
Input Input
hardware constraints, large vectors or matrices often cannot in1 in2 out in1 in2 out
arXiv:2310.01336v2 [cs.AR] 16 Sep 2024
be fully stored in memory and must be read sequentially, a0 a0 0 a0
one item per clock cycle. For high-speed inputs, such as a1 a1 0 a1 a0 a1
rapidly arriving floating-point numbers, pipelined adders a2 a2 0 a2
are necessary to maintain performance. However, pipelining a3 a0 a3 a0 a3 a2 a3
introduces multiple intermediate sums and requires delays a4 a1 a4 a1 a4 a0,1
between back-to-back datasets unless their processing is a5 a2 a5 a2 a5 a4 a5
overlapped. In this paper, we present JugglePAC, a novel a0,3 b0 a0,1 a2,3 a2,3
accumulation circuit designed to address these challenges. a0,3 a1,4 a1,4 b1 b0 b1
JugglePAC operates quickly, is area-efficient, and features Stall a2,5 b2 a0,1 a2,3 a4,5
a fully pipelined design. It effectively manages back-to- b3 b2 b3 a0:3
back variable-length datasets while consistently producing a0,1,3,4 a2,5 a0,1,3,4 b4 a4,5 a0:3 b0,1
results in the correct input order. Compared to the state-of- b0 b0 0 b5 b4 b5
the-art, JugglePAC achieves higher throughput and reduces b1 b1 0 b6 a0:3 a4,5 b2,3
area complexity, offering significant improvements in perfor- b2 b2 0 a0:5 b7 b6 b7 a0:5
mance and efficiency.
Index Terms—Fully pipelined reduction circuits, floating- Table I depicts an example of a floating-point adder-
point number accumulation, field-programmable gate arrays, based accumulator with a pipeline latency of 3 clock cy-
computer arithmetic. cles, while new input values are fed in every cycle. Using
I. Introduction a simple accumulation schedule (SimplePAC), we end up
In the realm of modern computing, the ability to with 3 subsums, 3 cycles after all data inputs are fed to
efficiently reduce a set of numbers to a single value is the adder. The subsums are shown as a0,3 , a1,4 , and a2,5 in
fundamental to a wide variety of computational appli- bold in Table I. The simplest approach to accumulation
cations. This process, known as reduction, is crucial for would be not to allow a new dataset (called b0:5 ) until
tasks ranging from signal processing [1], [2], data com- the last addition of the current dataset is fed into the
pression [3], [4], scientific computing [5], [6], and neural adder pipeline (i.e., introduce stalls between consecutive
networks [7], [8]. Among the various types of reduction datasets). However, SimplePAC does not accept back-
operations, accumulation, which involves summing a to-back datasets. Using a more intuitive schedule, we
dataset to produce a single result, is one of the most can alternate between additions from different sets to
essential. As data complexity and scale increase, high- maintain a fully pipelined accumulation circuit. Such
performance accumulation methods that handle large a schedule is shown in Table I using the JugglePAC
datasets efficiently become more essential. schedule, which starts accumulating the data from the
Accumulation operations can be applied to both consecutive dataset without requiring any stalls.
floating-point and integer data. For integer data, the This work presents JugglePAC, a novel fully pipelined
use of a 3:2 compressor simplifies the process by re- accumulation circuit that optimizes both area and timing
ducing latency, making integer accumulation straight- using a single floating-point adder. We implement Jug-
forward. However, floating-point accumulation presents glePAC and evaluate its performance across multiple FP-
more complex challenges, necessitating pipelined adders GAs, benchmarking it against state-of-the-art solutions.
to manage high data input rates. Pipelining, while essen- The major contributions of this work are:
tial for maintaining throughput in high-speed applica- 1) We propose a fully pipelined accumulation circuit,
tions, introduces design complexities such as managing namely JugglePAC, which efficiently handles high-
multiple intermediate sums and potential delays with speed, back-to-back variable-length datasets.
consecutive datasets. 2) JugglePAC demonstrates improvements over exist-
A. Houraniah is with the Dept. of Computer Science, Ozyegin Uni- ing solutions by achieving higher frequencies and
versity, Istanbul 34794, Turkey (email:
[email protected]). reducing area complexity.
H. F. Ugurdag is with the Dept. of Electrical and Electronics Engi- 3) Our work introduces a dynamic scheduling mech-
neering, Ozyegin University, Istanbul 34794, Turkey.
F. Aydin is with the Dept. of Electrical and Computer Engineering, anism for variable-length datasets, addressing the
North Carolina State University, Raleigh, NC 27606, USA. challenges associated with inefficient control logic
2
in existing designs. data to achieve efficient accumulation. Given that data
4) We implement JugglePAC on two different tar- typically arrives serially, the adder operates with a
get FPGAs, specifically the Xilinx XC2VP30 and throughput of 1, performing additions every 2 cycles.
XC5VLX110T, showing consistent improvements in This setup ensures that the adder is utilized 50% of the
both area and timing over the state-of-the-art. time, while producing results at least every 2 cycles.
Consequently, a single adder is sufficient to maintain
II. Related Work pace with the input rate when additions are scheduled
effectively. JugglePAC employs a state machine with two
Recent work on floating-point accumulation circuits distinct states to manage the addition process. In the first
has focused on optimizing area, performance, and com- state, the inputs are directly added. In the second state,
plexity. Early designs, such as Luo and Martonosi [9], the design handles the addition of any available pair
used carry-save arithmetic and delayed adders for high- of subsums, as illustrated in Fig. 1. The state machine
performance accumulation but were not fully pipelined, iterates between these two states every cycle, with an
leading to performance bottlenecks due to the required exception for datasets of odd length, where the state
stalls. machine remains in state 1 for an additional cycle.
Vangal [10] improved on this by introducing a To enhance performance for high-throughput ap-
pipelined structure for single-precision floating-point plications, JugglePAC is designed to handle back-to-
multiply-accumulate operations, yet managing the back inputs, thereby avoiding data pile-ups. This ap-
pipeline complexity remained a challenge. He et al. proach introduces additional complexity as it neces-
[11] proposed a group alignment algorithm to improve sitates the simultaneous processing of subsums from
accuracy but struggled with scalability and efficiency for different datasets. To manage this, each subsum is as-
variable-length datasets. signed a unique label, an integer that increments with
Nagar and Bakos [12] developed a double-precision each new dataset (color-coded with green in Fig. 1).
accumulator with a coalescing reduction circuit, reduc- This labeling system is maintained using a shift register
ing complexity but limited by its FPGA-specific design. with a latency equal to that of the adder as depicted
Zhou et al. [13] introduced several designs, including the in Fig. 1 and color-coded with purple. The Matching
Fully Compacted Binary Tree (FCBT) and Dual Striped Shift Register block matches subsums from the same
Adder (DSA), which managed multiple input sets but dataset and distinguishes between those from different
faced issues with buffer management and clock speed. datasets. An additional pipeline stage is added before the
Huang and Andrews [14] proposed modular, fully inputs of the adder to increase the throughput, which is
pipelined architectures capable of handling arbitrary represented by dashes in the figure.
dataset sizes, though their designs faced challenges with Scheduling additions of subsums requires efficient
underutilized pipelined adders and extensive buffering. control logic. The Pair Identifier (PI) block, shown in
Recent designs [15]–[19] aimed at balancing area and Fig. 1 and color-coded with yellow, is responsible for
timing performance. [19] notably improved the area- managing this process. The PI receives the results from
timing product but required multiple BRAMs, impacting the adder along with their labels and schedules the addi-
area efficiency. tions accordingly. It uses a register for each label, storing
In contrast, JugglePAC offers a novel approach with a incoming subsums and identifying pairs for addition.
fully pipelined architecture that simplifies control logic When a pair is identified, the PI schedules the addition
and handles variable-length datasets efficiently. Its dy- and clears the corresponding register. This control logic
namic scheduling mechanism ensures high frequency allows JugglePAC to juggle between additions from dif-
and low area complexity, outperforming previous de- ferent datasets while minimizing area requirements. The
signs in scalability and adaptability across different number of registers in the PI depends on the label size,
FPGA architectures. with L representing the maximum number of labels. To
handle situations where the adder may not always be
III. JugglePAC available for scheduled additions, JugglePAC employs a
JugglePAC is a novel floating-point accumulation cir- FIFO buffer to temporarily store ready-to-add subsum
cuit designed to optimize performance and area. This pairs along with their labels. The FIFO is read every 2
section describes the microarchitecture of JugglePAC and cycles, with a maximum depth of ⌈log(p)⌉, where p is
its inter-dataset behavior as well as key challenges such the adder’s latency. The FIFO’s design, which utilizes
as the ability to mix variable-length datasets. registers, ensures efficient data management.
The state machine within JugglePAC manages the
accumulation process by maintaining a low area and
A. JugglePAC Microarchitecture timing complexity. It alternates between adding serial
The JugglePAC architecture is built around a floating- input data and processing data from the FIFO. A sample
point adder, which plays a crucial role in the accumu- scheduling scenario using the JugglePAC approach is
lation process. The fundamental concept of JugglePAC illustrated in Table I, demonstrating how the accumu-
involves scheduling the additions of serially arriving lation of dataset b0:N begins before the results of dataset
3
a0:5 are completed. The simple control logic for the
state machine and pair identification allows JugglePAC
to maintain a low area complexity and critical path,
reset
outperforming the state-of-the-art. Label Adder out
st 0 st 1
For output identification, JugglePAC uses a counter to
track the number of additions performed. The counter
in
increments with each addition from state 1 and decre-
valid
ments with each addition from state 0. The system skips start 0 reg reg reg
incrementing on the first addition of state 1, ensuring the reset
1 0
counter returns to zero after each operation. The counter clock Pair Identifier
p+2 Label Adder Adder
reaches a maximum value of ⌈ 2 ⌉, where p is the adder’s in1 in2
latency, and returns to zero upon completing the final
++Label
addition. Separate counters are used for each simulta-
neous accumulation, with L representing the maximum ( ) -slot FIFO
number of counters. The output identifier is represented
in Fig. 1 and color-coded with red.
st
B. Inter-Dataset Behavior
The JugglePAC architecture allows for the parallel pro-
cessing of datasets, with the label size determining the Matching Floating-point
maximum number of datasets that can be handled simul- Shift Register Adder
taneously. A smaller label size reduces area complexity
but introduces a minimum dataset length requirement.
When the number of datasets exceeds L, the circuit may
mix accumulations from different datasets, leading to in- cnt
st ++/-- == 1 0
0
correct results. Therefore, the minimum dataset length is
identified through comprehensive testing with variable- Output Identifier
length datasets.
Increasing the label size enhances the circuit’s capabil- outEn out
ity to handle more datasets but also increases the logical
resources required for the PI and output identification Fig. 1. JugglePAC microarchitecture features a Floating-point Adder
modules. This tradeoff between area complexity and and key components including a state machine for managing additions,
minimum dataset length is summarized in Section IV, a Matching Shift Register for labeling subsums, and a Pair Identifier
block for scheduling. A FIFO buffer ensures efficient data handling
which shows the area complexity increasing with larger and prevents pile-ups, while an Output Identifier tracks the number of
label sizes. additions performed. This design optimizes throughput and minimizes
JugglePAC’s dynamic accumulation approach results area complexity.
in latency that depends on the current and previous
dataset lengths. For label sizes less than 3, the circuit JugglePAC demonstrates a significant reduction in
maintains consistent latency, ensuring that the results area complexity, using fewer slices than designs like
are produced in input order. However, with larger label MFPA, AeMFPA, and FAAC [14], with up to 71% less
sizes, the minimum dataset length can lead to results slice usage. Additionally, JugglePAC operates without
being output in an order that deviates from the input BRAMs, unlike FCBT and DSA [13], contributing to its
sequence. Additional control logic can be used to reorder lower area complexity.
results based on the label or to output the label itself, In terms of latency, JugglePAC performs competitively.
allowing the system to identify the dataset to which the For a label size of 2 and minimum dataset length of 22,
result belongs. Setting the minimum dataset length to 19 JugglePAC achieves a latency of approximately 1.077 µs,
ensures that JugglePAC consistently produces ordered comparable to or better than most previous designs such
results. This design achieves a balance between high as DSA and SSA [13]. Its throughput is also high, outper-
frequency and low area complexity, surpassing existing forming designs like DB [19] and BTTP [20], especially
state-of-the-art solutions. with larger datasets.
JugglePAC operates at a frequency of 208 MHz, ex-
IV. Implementation Results ceeding many previous designs like FPACC [16] and
We evaluated the JugglePAC floating-point accumu- FCBT [13]. It achieves the lowest ”Slices ×µs” score,
lation circuit against existing designs, focusing on area reflecting superior efficiency in balancing area and per-
complexity, latency, throughput, and frequency. Table formance.
II summarizes these metrics for JugglePAC and other The full pipelining of JugglePAC enhances its perfor-
designs. mance and resource utilization, making it effective in
4
TABLE II
Comparison with previously proposed accumulation circuits.
Label Min. Dataset Frequency Total Latency
Design Adders Slices BRAMs Slices×µs FPGA
Size Length (MHz) clock cycles µs
MFPA [14] - - 4 4,991 2 207 198 0.957 4,776
AeMFPA [14] - - 2 3,130 14 204 198 0.970 3,036
Ae2 MFPA [14] - - 2 3,737 2 144 198 1.370 5,120
FAAC [15] - - 3 6,252 0 199 176 1.086 6,790
FCBT [13] - - 2 2,859 10 170 ≤ 475 ≤ 2.794 7,988
DSA [13] - - 2 2,215 3 142 232 1.634 3,619 XC2VP30
SSA [13] - - 1 1,804 6 165 ≤ 520 ≤ 3.152 5,686
DB [19] - - 1 1,749 6 188 ≤ 199 ≤ 1.058 1850
JugglePAC 1 74 1 1,439 0 208 ≤ 220 ≤ 1.058 1,522
JugglePAC 2 22 1 1,796 0 208 ≤ 224 ≤ 1.077 1,934
JugglePAC 3 10 1 2,343 0 208 ≤ 224 ≤ 1.077 2,523
FPACC [16] - - - 683 - 247 - - - VC5VSX50T
BTTP [20] - - 1 648 9.5 305 - - -
XC5VLX110T
JugglePAC 2 22 1 578 0 334 ≤224 ≤ 0.671 388
high-speed operations. However, using a single floating- [4] T. Miller and K. Davis, “Modern data compression: Algorithms
point adder might limit performance in cases requiring and implementations,” IEEE Trans. on Data Compression, vol. 10,
pp. 234–245, 2020.
multiple adders. JugglePAC’s minimum dataset length [5] C. Lee, “Scientific computing and its challenges in the era of big
varies with label size, ensuring accurate accumulation data,” Computational Sci. Review, vol. 7, pp. 45–60, 2018.
but potentially limiting flexibility. Future work could [6] K. Thompson and B. Carter, “Computational methods in sci.
and engineering: Advances and applications,” Computational Sci.
explore the incorporation of multiple adders to address Review, vol. 35, 2022.
these limitations and enhance performance. [7] A. Brown, “Neural networks and their applications in data reduc-
Overall, JugglePAC represents a significant advance- tion,” Journal of Artif. Intel. Research, vol. 9, pp. 245–260, 2017.
[8] S. Davis and E. Wright, “Advances in neural networks for image
ment in floating-point accumulation circuits, with strong recognition,” in Proc. of the Conf. on Neural Information Processing
performance across key metrics. The results highlight its Systems (NeurIPS), 2021, pp. 2456–2466.
potential for various applications and provide a basis for [9] Z. Luo and M. Martonosi, “Accelerating pipelined integer and
floating-point accumulations in configurable hardware with de-
future research and optimization. layed addition techniques,” IEEE Trans. on Computers, vol. 49, pp.
208–218, 2000.
V. Conclusion [10] S. Vangal, Y. Hoskote, N. Borkar, and A. Alvandpour, “A 6.2-
GFlops floating-point multiply-accumulator with conditional nor-
Accumulation is a fundamental operation, which ap- malization,” IEEE Journal of Solid-St. Circuits, vol. 41, pp. 2314–
pears in many types of computational workloads. In 2323, 2006.
the case of high-throughput floating-point accumulation, [11] C. He, G. Qin, M. Lu, and W. Zhao, “Group-alignment based
accurate floating-point summation on FPGAs.” in Proc. Int. Conf.
complexities arise due to pipelining, especially when Eng. Reconfig. Sys. and Algorithms (ERSA), vol. 6, 2006, pp. 136–142.
the system must handle consecutive datasets of vary- [12] K. K. Nagar and J. D. Bakos, “A high-performance double preci-
ing lengths while producing results in the input order. sion accumulator,” in Int. Conf. on Field-Programmable Technology
(FPT), 2009, pp. 500–503.
Existing solutions often introduce significant overheads, [13] L. Zhuo, G. R. Morris, and V. K. Prasanna, “High-performance
while resulting in a reduced clock frequency or increased reduction circuits using deeply pipelined operators on FPGAs,”
area complexity. In this work, we have introduced Jug- IEEE Trans. on Parallel and Dist. Sys., vol. 18, pp. 1377–1392, 2007.
[14] M. Huang and D. Andrews, “Modular design of fully pipelined
glePAC, a novel fully pipelined reduction circuit de- reduction circuits on FPGAs,” IEEE Trans. on Parallel and Dist. Sys.,
signed to overcome these challenges. JugglePAC lever- vol. 24, pp. 1818–1826, 2013.
ages a single floating-point adder and yet efficiently [15] S. Sun and J. Zambreno, “A floating-point accumulator for FPGA-
based high performance computing applications,” in Proc. Int.
manages accumulation tasks. Implemented and evalu- Conf. on Field-Programmable Technology (FPT), 2009, pp. 493–499.
ated across multiple FPGAs, JugglePAC consistently de- [16] T. Ould-Bachir and J.-P. David, “Performing floating-point accu-
livers superior results in area and timing simultaneously, mulation on a modern FPGA in single and double precision,” in
Proc. IEEE Ann. Int. Symp. on Field-Programmable Custom Comput-
while previous works deliver either competitive area or ing Machines (FCCM), 2010, pp. 105–108.
timing, but not both at the same time. [17] Y. G. Tai, C. T. D. Lo, and K. Psarris, “An improved reduction
algorithm with deeply pipelined operators,” in Proc. IEEE Int.
References Conf. on Systems, Man and Cybernetics (SMC), 2009, pp. 3060–3065.
[18] ——, “Multiple data set reduction on FPGAs,” in Proc. Int. Conf.
[1] J. Smith, “Signal processing for large datasets,” IEEE Trans. on on Field-Programmable Technology (FPT), 2010, pp. 45–52.
Signal Processing, vol. 68, pp. 1234–1245, 2020. [19] ——, “Accelerating matrix operations with improved deeply
[2] A. White and R. Miller, “Advanced techniques in signal process- pipelined vector reduction,” IEEE Trans. on Parallel and Dist. Sys.,
ing for next-generation communication systems,” in Proc. IEEE vol. 23, pp. 202–210, 2012.
Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, [20] L. Tang, Z. Huang, G. Cai, Y. Zheng, and J. Chen, “A novel
pp. 6453–6457. reduction circuit based on binary tree path partition on FPGAs,”
[3] M. Johnson and S. Thompson, “Data compression techniques for Algorithms, vol. 14, p. 30, 2021.
big data applications,” Journal of Data Sci., vol. 17, pp. 89–103,
2019.