Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
40 views11 pages

FPGA DNN Acceleration with BRAMAC

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views11 pages

FPGA DNN Acceleration with BRAMAC

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

BRAMAC: Compute-in-BRAM Architectures for

Multiply-Accumulate on FPGAs
Yuzong Chen and Mohamed S. Abdelfattah
Department of Electrical and Computer Engineering, Cornell University
{yc2367, mohamed}@cornell.edu

Abstract—Deep neural network (DNN) inference using reduced random access memory (BRAM) for model storage and digital
integer precision has been shown to achieve significant improve- signal processing (DSP) units for implementing multiply-
ments in memory utilization and compute throughput with little accumulate (MAC)—the fundamental primitive in DNNs.
or no accuracy loss compared to full-precision floating-point.
Nevertheless, most FPGA vendors’ DSP blocks do not natively
arXiv:2304.03974v1 [cs.AR] 8 Apr 2023

Modern FPGA-based DNN inference relies heavily on the on-


chip block RAM (BRAM) for model storage and the digital signal support precisions lower than 18 bits, making them sub-
processing (DSP) unit for implementing the multiply-accumulate optimal for implementing low-precision MAC [12]–[14]. For
(MAC) operation, a fundamental DNN primitive. In this paper, DNNs to better utilize FPGA’s on-chip resources, researchers
we enhance the existing BRAM to also compute MAC by propos- have proposed novel DSP architectures for low-precision MAC
ing BRAMAC (Compute-in-BRAM Architectures for Multiply-
Accumulate). BRAMAC supports 2’s complement 2- to 8-bit [15], [16]. More recently, some works have proposed to add
MAC in a small dummy BRAM array using a hybrid bit-serial & compute capability inside BRAMs and enable them to perform
bit-parallel data flow. Unlike previous compute-in-BRAM archi- various Boolean and arithmetic operations [17], [18]. This
tectures, BRAMAC allows read/write access to the main BRAM computing in-memory (CIM) approach does not sacrifice the
array while computing in the dummy BRAM array, enabling performance of existing logic resources on FPGA but rather
both persistent and tiling-based DNN inference. We explore two
BRAMAC variants: BRAMAC-2SA (with 2 synchronous dummy complements them to further boost the FPGA’s computing
arrays) and BRAMAC-1DA (with 1 double-pumped dummy throughput. In addition, CIM can reduce the routing associated
array). BRAMAC-2SA/BRAMAC-1DA can boost the peak MAC with data movement between memory and logic units, hence
throughput of a large Arria-10 FPGA by 2.6×/2.1×, 2.3×/2.0×, saving energy and area. This is especially true in DNN accel-
and 1.9×/1.7× for 2-bit, 4-bit, and 8-bit precisions, respectively erators where model parameters and activations are frequently
at the cost of 6.8%/3.4% increase in the FPGA core area.
By adding BRAMAC-2SA/BRAMAC-1DA to a state-of-the-art transferred between BRAMs and DSPs to perform massive
tiling-based DNN accelerator, an average speedup of 2.05×/1.7× computations.
and 1.33×/1.52× can be achieved for AlexNet and ResNet- In this paper, we further enhance the FPGA’s compatibility
34, respectively across different model precisions. Our code is with low-precision DNNs by proposing BRAMAC, an efficient
available at: https://github.com/abdelfattah-lab/BRAMAC. compute-in-BRAM architecture for multiply-accumulate. Un-
like previous CIM architectures that compute directly on the
I. I NTRODUCTION main BRAM array [17], [18], BRAMAC first copies the data
Deep neural networks (DNNs) have become ubiquitous from the main BRAM array to an additional, separate memory
in many important fields such as computer vision, speech array and then computes on this “dummy” array, which is a
recognition, and natural language processing. However, a true dual-port BRAM with the same number of columns as
well-trained DNN model for complicated tasks has a huge the main BRAM array but only 7 rows. This 7-row dummy
model size ranging from several hundreds of megabytes array can be accessed fast with low power consumption due to
(e.g., AlexNet classifying ImageNet) to several hundreds of a much smaller parasitic load on its bitlines compared to the
gigabytes (e.g. GPT3 producing human-like text) [1]–[3]. main BRAM array which typically has >100 physical rows.
Accordingly, many researchers have been exploring reduced Furthermore, the dummy array allows BRAMAC to function
numerical precisions to represent DNN model weights and ac- like a normal BRAM even during CIM operations—the main
tivations, especially during inference where reduced-precision BRAM array’s read and write ports are available for use by the
arithmetic incurs little or no accuracy loss compared to full- application logic. Finally, BRAMAC is optimized for DNN
precision floating-point (FP) [4], [5]. This low-precision prop- MAC operations by performing shared-input multiplication
erty allows better utilization of on-chip memory and com- and in-place accumulation. We enumerate our contributions
putation resources for improved performance. For example, below:
Nvidia GPUs can obtain a 4-8× inference speedup using INT8 1) We propose new peripheral circuits that enable
precision compared to FP32 precision [6], and an additional BRAMAC to compute two MACs (or one MAC2),
1.6× speedup using INT4 precision compared to INT8 [7]. P = (W1 I1 + W2 I2 ), simultaneously using a hybrid
In the meanwhile, FPGAs are becoming an increasingly bit-serial & bit-parallel dataflow.
popular platform for DNN acceleration due to their hardware 2) We propose two BRAMAC variants with different area-
programmability that enables customized datapaths and nu- throughput trade-offs: BRAMAC with 2 synchronous
merical bit-widths suitable for low-precision inference [8]– dummy arrays (2SA) and BRAMAC with one double-
[11]. FPGA-based DNN accelerators heavily rely on block pumped dummy array (1DA).
portB_addr
3) We design an embedded finite-state machine (eFSM) to
free up the main BRAM ports during MAC2 computa- opB_addr
Cell Cell Cell
tion and to allow simultaneous main BRAM access, thus mode

Row Decoder A

Row Decoder B
(MEM, CIM)
enabling efficient tiling-based DNN acceleration. Cell Cell Cell

Input Crossbar
0xfff

...
4) We quantify the benefits of employing BRAMAC in =
128x160
portA_addr Main BRAM
a tiled FPGA DNN accelerator, which achieves up
Cell Cell Cell
to 2.04× and 1.52× performance improvements for opA_addr

AlexNet and ResNet-34, respectively over the baseline Column 4:1 MUX Column
Decoder A Decoder B
accelerator without BRAMAC.
dataB main_dout
Port A & B
II. R ELATED W ORK Sense Amps, Write Drivers

Output Crossbar
dataA
I1,I2,ctrl 40b 40b
40b
In this section. we discuss previous work that targeted mac_precision
Config Sign-Ext Mux
efficient MAC implementation on FPGAs including logic 160b 160b
rd,wr enable 40b
block, DSP, and BRAM enhancements. Embedded Dummy BRAM Array
(7x160 Dual-Port SRAM) dummy_dout
FSM (eFSM) rd,wr addr
160b 160b
A. Logic Block with Fast Arithmetic mac_precision
Precision-Configurable Adder
To efficiently implement arithmetic operations in soft logic,
Fig. 1: Top-level block diagram of BRAMAC modified from Intel’s
modern FPGAs contain hardened adder circuitry in their logic M20K BRAM. New circuit blocks are orange-shaded.
blocks (LBs) [19]. These adders range from simple ripple-
carry adders to more complex variants such as carry-bypass parallelism. However, the circuit implementation of CCB re-
adders and carry-lookahead adders. In order to reduce the quires an additional voltage supply to mitigate the read-disturb
carry propagation delay, dedicated routing is used to propagate issue associated with activating two word-lines from one
carry signals between different LBs. Inspired by the superior BRAM port, which is challenging to implement in practice.
efficiency of adopting low-precision in DNN, recent research Arora et al. [18] later designed a new compute-in-BRAM
started to investigate adding more hardened arithmetic in LBs. architecture called CoMeFa to overcome some limitations of
For example, Boutros et al. [20] proposed three LB archi- CCB. CoMeFa also relies on bit-serial arithmetic but exploits
tectural enhancements to improve the performance of MAC the dual-port nature of BRAM to read out two operands from
implemented in soft logic. Their most promising proposal two ports, respectively instead of activating two word-lines
increases the MAC density by 1.7× while simultaneously from one port, thus eliminating the read-disturb issue.
improving the MAC speed. Both CCB and CoMeFa require transposed data layout for
B. Low-Precision DSP Architectures bit-serial computation, i.e., each word occupies one column
and multiple rows instead of one row and multiple columns
Modern commercial FPGAs include DSP blocks that im- in a conventional data layout. However, transposing data
plement efficient multiplication with additional features such is expensive in both latency and additional hardware cost
as pre-addition and accumulation commonly used in signal (e.g. a swizzle module in CoMeFa) for online execution.
processing applications [19]. Nevertheless, most FPGA ven- Furthermore, these two BRAM architectures compute directly
dors’ DSP multipliers have a minimum precision of 18-bit, on the main BRAM array and receive the CIM instruction
making them less competitive in accelerating low-precision through a BRAM write port—this prevents tiling. As a result,
DNNs. To address this limitation, researchers have proposed these two works are limited to accelerating only persistent-
new DSP architectures to support low-precision MAC. Boutros style DNN inference where the model weights are transposed
et al. [15] introduced an enhanced Intel DSP (eDSP) that offline and remain persistent in the on-chip memory. Different
supports four 9-bit or eight 4-bit multiplications without using from CCB and CoMeFa, BRAMAC adopts a hybrid bit-serial
additional routing ports. Rasoulinezhad et al. [16] presented & bit-parallel MAC dataflow that eliminates the requirement
a modified Xilinx DSP, called PIR-DSP, that can carry out of transposed data layout. In addition, BRAMAC doesn’t
six 9-bit, twelve 4-bit, or twenty-four 2-bit multiplications. compute on the main BRAM array which is typically large and
Regarding industry DSP trends, the recent Xilinx Versal and therefore, slow and power-hungry. Rather, it copies the main
Intel Agilex devices added support for 8-bit multiplication in BRAM’s data to a special, separate dummy BRAM array for
their DSP blocks [21], [22]. In addition, Intel’s latest Stratix- computation. This dummy array has only 7 rows and therefore
10 NX device added a new DSP (called AI tensor) block that can be accessed much faster compared to the main BRAM
contains 30 INT8 multipliers and can also be configured as 60 array. It can also free up the read and write ports of the main
INT4 multipliers [23]. BRAM during CIM to allow tiling-based DNN acceleration.
C. Computing In-BRAM
III. BRAMAC A RCHITECTURE AND DATAFLOW
With the emergence of CIM to overcome the von-Neumann
bottleneck [24], some FPGA researchers suggest augmenting A. Overall Architecture
existing BRAM architectures with compute capability. Wang et Fig. 1 shows the top-level block diagram of BRAMAC
al. [17] proposed a compute-capable BRAM (CCB) that uses modified from Intel’s M20K BRAM [25] with added circuit
bit-serial arithmetic to enable a high degree of computation blocks orange-shaded. The routing interface (i.e., input and
output crossbar) of BRAMAC is the same as that of M20K. Algorithm 1: Hybrid Bit-Serial & Bit-Parallel MAC2
The main BRAM array’s dimension is 128-row × 160-column, Require : All numbers are integers in 2’s complement
i.e., 20 kb memory capacity. The 4:1 column multiplexing Input : W ∈ Z2 , I ∈ Z2 , precision n ≥ 2
feature of M20K is preserved. One additional SRAM cell is Output : P ∈ Z
added to select one of the two operation modes of BRAMAC: 1 Initialization P = 0
2 for i = (n − 1) downto 0 do
1) MEM: In this memory mode, the behavior of BRAMAC 3 psum = W1 ∗ I1 [i] + W2 ∗ I2 [i]
is identical to that of a conventional M20K. The input crossbar 4 if i == (n − 1) then
sends the address and data to portA and portB. For memory 5 P = P + inv(psum) + 1
reads, the two addresses are decoded by the row and column 6 P = P << 1
decoders. The 40-bit BRAM output data from sense amplifiers 7 else if i 6= 0 then
8 P = P + psum
is sent to the output crossbar. For memory writes, the data is 9 P = P << 1
sent to the write drivers for updating the main BRAM. 10 else
2) CIM: This is the compute mode where BRAMAC can 11 P = P + psum
compute MAC2, P = (W1 I1 +W2 I2 ), using 2-bit, 4-bit, or 8-
12 return P
bit operand precision. The two groups of operands, (W1 , W2 )
and (I1 , I2 ), can be thought of as weights and inputs of
DNN in the remainder of this paper, respectively. At a high 1st Weight in Dummy Array
level, BRAMAC computes MAC2 by keeping weights inside MAC2 W1,1 W2,1 W3,1 W4,1 W5,1 W6,1 W7,1 W8,1
BRAMAC while streaming inputs from outside. W1,1 W1,2 ... W1,6 I1 W1,2 W2,2 W3,2 W4,2 W5,2 W6,2 W7,2 W8,2
W2,1 W2,2 ... W2,6 I2 I1
The main BRAM is automatically configured as a simple X

...

...

...

...
...
I2
dual-port memory with a maximum data width of 40-bit, W8,1 W8,2 ... W8,6 I6
Input
W1,1I1 W2,1I1 W3,1I1
+ W1,2I2 + W2,2I2 + W3,2I2
... W8,1I1
+ W8,2I2
and a depth of 512 to maximize the read/write throughput. Stream
A special address (0xfff) is reserved and compared with Fig. 2: Example of MAC2 to compute matrix-vector multiplication.
the portA address, and if equal, the 40-bit portA data is
treated as a CIM instruction. The CIM instruction contains of the nth matrix column. To exploit this input-sharing in
two addresses for reading two 40-bit data from the main BRAMAC, two inputs are packed into the CIM instruction that
BRAM, respectively. Each 40-bit data is a vector that contains is sent to BRAMAC, then multiplied by all elements of the
multiple low-precision W1 /W2 elements. The configurable corresponding two matrix columns copied to the dummy array,
sign-extension mux sign-extends the 40-bit vectors to 160-bit respectively. Copying a matrix column requires the weight
before copying them to a dummy BRAM array which is a 7- matrix to be transposed so that matrix columns correspond to
row × 160-column true dual-port BRAM without the column a BRAM row. This can be easily done offline for DNNs. Fig. 2
multiplexing feature. The CIM instruction also contains the illustrates an example of using MAC2 to compute MVM where
two inputs, I1 and I2 , and several control signals that are sent the matrix dimension is 8×6. For the first MAC2, the first and
to an eFSM to trigger and control the MAC2 operation. The second matrix columns are copied from the main BRAM to the
precision-configurable adder can read two 160-bit vectors from dummy array. Two vector elements I1 and I2 are streamed to
the dummy array, performs a single-instruction-multiple-data BRAMAC through the CIM instruction and multiplied by all
(SIMD) add, and writes the sum back to the dummy array. 8 elements of the first and second matrix columns to obtain 8
Since the dummy array has the same number of columns as partial sums. For large matrices, the number of matrix elements
the main BRAM array, it can read out 40-bit data similar to that can be loaded to the dummy array depends on the MAC
the main BRAM. A 2-to-1 mux is added to select the data precision. Since the two read ports of the main BRAM have
between the main BRAM and the dummy array. a total data width of 80-bit, they can copy ten 8-bit, twenty
4-bit, or forty 2-bit weights to a dummy array for one MAC2,
B. Hybrid Bit-Serial & Bit-Parallel MAC Dataflow providing a parallelism of 10, 20, or 40 MACs, respectively.
BRAMAC computes 2’s complement MAC2 by adopting
a hybrid bit-serial & bit-parallel dataflow [26] as described C. Circuit Design to Support MAC2
in Algorithm 1. The for-loop in line 2-11 iterates through two We now describe the new circuit blocks in BRAMAC to
inputs bit-by-bit. Each iteration involves multiplying the entire support MAC2. These circuit blocks are shown in Fig. 3,
W1 and W2 by a single bit from I1 and I2 , respectively, including a dual-port “dummy” BRAM array (Fig. 3(a)), a
followed by a bit-parallel addition to obtain the partial sum configurable sign-extension mux (Fig. 3(b)), a 160-bit SIMD
(psum) as shown in line 3. If the current input bit is the most- adder implemented using 1-bit full adders, and read/write
significant bit (MSB), then psum is subtracted from P (line 5) circuits (Fig. 3(c)).
since the MSB is negative in 2’s complement representation. 1) Dual-Port Dummy BRAM Array: The dual-port dummy
If the current input bit is not the least-significant bit (LSB), BRAM array is 7-row × 160-column without column mul-
then P also needs to be shifted left by 1-bit after adding psum tiplexing as shown in Fig. 3(a). Its SRAM cell is identical
(lines 6, 9). to that used in the main BRAM. Each column contains two
The hybrid bit-serial & bit-parallel MAC2 algorithm is sense amplifiers and two write drivers to allow true dual-port
efficient for computing matrix-vector multiplication (MVM) access. Its 1st row is hard-coded to always store 0. The 2nd
where the nth vector element is multiplied by all elements and 3rd rows store the W1 and W2 vectors, respectively that
160 Columns I2[i], I1[i] extend one 8-bit element to one 32-bit element (blue crosses
BLA BLbA BLB BLbB BLA BLbA BLB BLbB
160'b0
SA1 SA2
in Fig. 3(b)), or two 4-bit elements to two 16-bit elements

2:4 Demux
WD1 WenA WD2 WenB
W1 (green crosses in Fig. 3(b)), or four 2-bit elements to four 8-bit

Decoder Logic x2
W2 M1 SelA M2 SelB
A A B B elements (red crosses in Fig. 3(b)). Moreover, since a 2/4/8-bit
7 Rows

W1 + W2
ramA S SRight ramB B 1'b0
Inverter RdA_addr MAC2 only requires a maximum bit-width of 5/9/17 bits to
WrA_addr B Cin Cin
P
A A
store the result, the proposed sign-extension mux can provide
RdB_addr
Accumulator 0 1
WrB_addr B S
B
Cout
a higher bit-width required by MAC2. This allows multiple
Sense Amps &
Write Drivers
RenA, RenB A
1 0
sequential MAC2 results to be accumulated by adding the 6th
(a)
B
(c) row (that stores the MAC2 result P) and the 7th row (that
D0 8-bit MAC Mode
stores the Accumulator) of the dummy array.
b0
D1 D0 4-bit MAC Mode b1 3) Bit-Parallel SIMD Adder with Read/Write Circuits:
D3 D2 D1 D0 2-bit MAC Mode b2
b7 b6 b5 b4 b3 b2 b1 b0 Sign-Extension
The 160-bit SIMD adder in BRAMAC is designed using the
b3
8-bit from BRAM b4 conventional 1-bit full adder as shown in Fig. 3(c). It supports
b5
b6
bit-parallel SIMD addition by configuring itself to twenty 8-
b7 bit adders, ten 16-bit adders, and five 32-bit adders for 2-
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 bit, 4-bit, and 8-bit MAC2, respectively, giving a worst-case
(b) 32-bit to Dummy Array
delay equal to 32-bit addition. The two operands A and B
Fig. 3: BRAMAC circuit blocks for computing MAC2: (a) dual-port of the SIMD adder come from two sense amplifiers, SA1 and
dummy BRAM array, (b) configurable sign-extension mux (here we SA2 that compare the voltage differential of two bit-line pairs,
are displaying one out of five identical blocks), (c) 1-bit full-adder (BLA, BLbA) and (BLB, BLbB). To support the addition
with read/write circuits.
followed by 1-bit shift-left operation (required in lines 6 and
9 of Algorithm 1), a write-back mux M1 before the write
are copied from the main BRAM array. The 4th row stores a driver WD1 is used to select either sum S from the current
(W1 + W2 ) vector. The 5th Inverter row is used to store the full adder or sum from the right full adder SRight . M1 can
temporary inverted psum required by the binary subtraction also select ramA to copy the first data W1 from the main
(line 5 of Algorithm 1). The 6th row stores the MAC2 result BRAM. Similarly, a write-back mux M2 before the write
P. The 7th row is a wide Accumulator to accumulate multiple driver WD2 is used to select between three signals: B-bar
MAC2 results that form a large dot product. to perform inverting, ramB to copy the second data W2 from
The read and write operations of the dummy array are the main BRAM, and 1’b0 to initialize either P (line 1) or the
controlled by address and enable signals (blue signals in Fig. Accumulator. Both M1 and M2 are controlled by the eFSM.
3(a)) sent from the eFSM as described in Section III-A2. The
access to 1st – 4th rows during MAC2 is managed by both the IV. BRAMAC VARIANTS
decoder logic and a 2-to-4 demux. The 2-bit selection signal A. BRAMAC with Two Synchronous Dummy Arrays (2SA)
of the demux comes from the current two processing bits of This variant, called BRAMAC-2SA, has two synchronous
the two inputs I1 and I2 , respectively. This allows calculating dummy arrays that share the same clock domain as the main
psum (line 3 of Algorithm 1) using a look-up table [27]. If BRAM. In this architecture, each dummy array is fed by one
{I2 [i], I1 [i]} is 2’b00, then the 1st zero row will be read out port of the main BRAM during weight copy. Since BRAMAC
and added to the 6th row P. If {I2 [i], I1 [i]} is 2’b11, then the intrinsically supports multiplying the same input with many
4th row (W1 + W2 ) will be read out and added to P. If {I2 [i], weights as discussed in Section III-B, this variant adopts an
I1 [i]} is 2’b01 or 2’b10, then then the 2nd row W1 or the 3rd input-sharing approach to balance the data reuse. Specifically,
row W2 will be read out and added to P. in each MAC2 iteration, the two dummy arrays copy the same
Since the dummy array copies data from the main BRAM weights but process different inputs. The first dummy array
array for computation, a coherency issue may arise where receives two inputs I1 , I2 and calculates W1 I1 + W2 I2 , while
the main BRAM is being updated while the dummy array the second dummy array receives another two inputs I3 , I4
is still computing using the stale data. We leave it for the and calculates W1 I3 + W2 I4 .
programmer/compiler to explicitly ensure the memory co- An example 4-bit MAC2 operation for one dummy array
herency similar to the explicit handling of the read-during- of BRAMAC-2SA is illustrated in Fig. 4. Note that we are
write behavior of Intel’s BRAM [28]. displaying 2 out of 10 lanes with 10-bit sign-extension due to
2) Configurable Sign-Extension Mux: Although not re- space limitation (instead of 16-bit sign-extension as described
flected in Algorithm 1, the W1 and W2 vectors from the main in Section III-C2). In Cycle 1 and Cycle 2, W1 and W2 are
BRAM need to be sign-extended before being copied to the sign-extended and copied to the dummy array. During these
dummy array in order to prevent overflow during MAC2. To two cycles, the two inputs for each dummy array are also sent
support this, two configurable sign-extension muxes are added to BRAMAC-2SA through the CIM instruction and latched for
between the main BRAM and the dummy array. Each mux has further processing. In Cycle 3, W1 and W2 are read out and
five identical blocks, one of which is shown in Fig. 3(b). Since added. The sum is written back to the 4th row to store (W1 +
the main BRAM has a data width of 40 bits, it can copy five W2 ). Simultaneously, the 6th row P can also be initialized to
8-bit, ten 4-bit, or twenty 2-bit elements to the dummy array zero. In Cycle 4, the MSB of two inputs is streamed to the
simultaneously. Each of the five identical mux blocks can sign- dummy array. The selected row W1 is inverted to prepare for
W1 W2 Overlap

w11 w12 I1 1011 0111 1001 -5 +7 -7 +56 Cycle # 1 2 3 4 5-7 8 9 10 11 12 - 14 15 16


X = X = X =
w21 w22 I2 0100 1101 0011 +4 -3 +3 -37 Rd W1 Rd W2 W1+W2 Add P
1st MAC2 Copy Invert Add P Accum
Copy Init P Shi�
Cycle 1: Copy W1 Cycle 2: Copy W2 Cycle 3: (W1 + W2). Initialize P Rd W3 Rd W4 W3+W4 Add P
2nd MAC2 Invert Add P Accum
0 0 0 Copy Copy Init P Shi�
1011 0100 0111 1101

...
W1 1111111011 0000000100
Sign-Ext Mux Sign-Ext Mux W2 0000000111 1111111101 (a) BRAMAC-2SA Pipeline for 4-bit MAC2
W1+W2 0000000010 0000000001
1111111011 0000000100 0000000111 1111111101 Inverter x x Overlap
P 0000000000 0000000000 Cycle # 1 2 3 4 5 6
Cycle 4: Stream {I2[3], I1[3]} = 01. Cycle 5: (Inverter + P + 1). Cycle 6: Stream {I2[2], I1[2]} = 00. Rd W1, W2 W1,W2 W1+W2 Add P Add P Add P
1st MAC2 Invert Add P Accum
Inverting. Shift-left 1-bit. (0 + P) & shift-left 1-bit. from BRAM Copy Init P Shi� Shi� Shi�
Read W3, W4 W3,W4 W3+W4
0 0 0 0 0 0 0 2nd MAC2
from BRAM Copy Init P
W1 1111111011 0000000100 1111111011 0000000100 1111111011 0000000100

...
W2 0000000111 1111111101 0000000111 1111111101 0000000111 1111111101
W1+W2 0000000010 0000000001 0000000010 0000000001 0000000010 0000000001 (b) BRAMAC-1DA Pipeline for 4-bit MAC2 Main BRAM Busy
Inverter 0000000100 1111111011 0000000100 1111111011 0000000100 1111111011 Main BRAM Idle
P 0000000000 0000000000 0000001010 1111111000 0000010100 1111110000
Fig. 5: Pipeline diagram of 4-bit MAC2 in (a) BRAMAC-2SA and
Cycle 7: Stream {I2[1], I1[1]} = 10. Cycle 8: Stream {I2[0], I1[0]} = 11. (b) BRAMAC-1DA.
(W2 + P) & shift-left 1-bit. (W1 + W2 + P). No shift. Cycle 9: Add P to Accumulator.
0 0 0 0 0 0 0
W1 1111111011 0000000100 1111111011 0000000100 1111111011 0000000100 0-1 2 3 4 5 6 7 8 - 14 15 - 16 17 - 24 25 - 32
W2 0000000111 1111111101 0000000111 1111111101 0000000111 1111111101 prec inType reset start copy w1_w2 done bramRow bramCol input_1 input_2
W1+W2 0000000010 0000000001 0000000010 0000000001 0000000010 0000000001
Inverter 0000000100 1111111011 0000000100 1111111011 0000000100 1111111011 (a) CIM Instruc�on Format for BRAMAC-2SA
P 0000110110 1111011010 0000111000 1111011011 0000111000 1111011011
0-1 2 3 4 5 6 7 - 13 14 - 20 21 - 22 23 - 30 31 - 38
+ + prec
+56 -37 Accumulator Accumulator inType reset start copy done bramRow_1 bramRow_2 bramCol input_1 input_2
(b) CIM Instruc�on Format for BRAMAC-1DA
Fig. 4: Example operation of one dummy array in BRAMAC-2SA
Fig. 6: CIM instruction format for (a) BRAMAC-2SA and (b)
for 4-bit MAC2. We are displaying 2 out of 10 lanes with 10-bit sign
BRAMAC-1DA.
extension instead of 16 bits (due to space limitation).

the binary subtraction. In Cycle 5, Inverter is added to P. The easily handled. Fig. 5(b) shows the pipeline diagram of 4-
sum is shifted left by 1-bit and written back to P. The input bit MAC2 for BRAMAC-1DA. In Cycle 1, the main BRAM
streaming continues to Cycle 8 where the LSB of two inputs reads out two weights W1 and W2 . In the first half of Cycle
is processed and the correct MAC2 result P is obtained. In 2, the dummy array copies W1 and W2 using its two write
Cycle 9, P is added to the 7th Accumulator row. Then it can ports. Then the dummy array can compute the MAC2 using
be initialized for the subsequent MAC2. the same operation flow as BRAMAC-2SA, except that every
The above example indicates that BRAMAC-2SA can com- cycle in BRAMAC-2SA is now half a cycle in BRAMAC-
plete a 4-bit MAC2 using 9 cycles. However, during the write- 1DA. Similar to the pipelining optimization for BRAMAC-
back phase of the last two cycles, i.e., Cycle 8 and Cycle 9, 2SA, the main BRAM can start to read the next two weights
the current two weights W1 and W2 are no longer needed in W3 and W4 in Cycle 5 while the dummy array is computing.
the dummy array since the current MAC2 result P is already As a result, the 4-bit MAC2 can be completed using 4 cycles.
obtained at the bit-parallel adder’s output. As a result, these This pipelining can also be applied to 2-bit and 8-bit MAC2.
two cycles can also be used to copy the next two weights Hence, 2-bit and 8-bit MAC2 can take 3 and 6 cycles to
W3 and W4 , respectively as illustrated in Fig. 5(a). Therefore, complete, respectively.
the 4-bit MAC2 in BRAMAC-2SA only requires 7 cycles to
C. Embedded FSM to Free Up BRAM Ports
complete. This pipelining can also be applied to 2-bit and 8-
bit MAC2. The only difference between 2-bit, 4-bit, and 8-bit Since the dummy array’s behavior is deterministic for
MAC2 is the number of cycles spent for processing every input computing MAC2, we propose to control it using an eFSM.
bit as described in line 2-11 of Algorithm 1. Thus, 2-bit and This eFSM receives a CIM instruction to trigger the MAC2
8-bit MAC2 can take 5 and 11 cycles to complete, respectively. computation and control the dummy array’s read/write access.
The CIM instruction is only required when the main BRAM
B. BRAMAC with One Double-Pumped Dummy Array (1DA) needs to send data to the dummy array (indicated by the red
This variant, called BRAMAC-1DA, has only one dummy boxes in Fig. 5). As a result, the main BRAM is busy for
array to reduce the area overhead. Using one dummy array 2 cycles in BRAMAC-2SA and 1 cycle in BRAMAC-1DA.
degrades the MAC throughput by 2× compared to BRAMAC- When the main BRAM is idle, it can perform normal read
2SA, however, we propose to double-pump the dummy array operations to feed LBs/DSPs or write operations to load the
with a 2× main BRAM clock frequency. Memory multi- next tile of weights from off-chip DRAM, allowing tiling-
pumping is a commonly used technique in FPGA design to based DNN acceleration. This is different from CCB and
improve the system throughput [29], [30]. The double-pumped CoMeFa whose BRAM ports are always busy during CIM.
dummy array doesn’t add any additional area overhead com- Fig. 6(a) and (b) show the proposed CIM instruction format
pared to a synchronous dummy array. Rather, it only requires for BRAMAC-2SA and BRAMAC-1DA, respectively. For
a separate clock routing during compilation. BRAMAC-2SA, bramRow and bramCol are combined to
Because the main BRAM and the dummy array only interact form one BRAM address during each copy operation. On
during weight copy, synchronization between them can be the other hand, BRAMAC-1DA needs to receive two BRAM
TABLE I: Resource Counts and Area Ratio of the Baseline
Arria 10 GX900 FPGA.
Resource Count Area Ratio
Logic Blocks (LBs) 33920 70.4%
DSP Units 1518 9.5%
BRAMs (M20K) 33920 20.1%

addresses at the same time. This is achieved by using two


BRAM row addresses bramRow1 and bramRow2 with a
shared column address bramCol.
Fig. 7: Comparison between RCA, CBA, and CLA: (a) Delays vs.
The two BRAMAC variants share some common control precision. (b) Area and power at 32-bit precision.
signals. The 2-bit prec specifies one of the three supported
MAC2 precisions. The inType is used to indicate whether B. Design Choice for Adder
the two inputs are signed or unsigned. If the inputs are
unsigned, then the inverting cycle can be skipped to improve As the SIMD adder in BRAMAC has a worst-case delay of
performance. The reset resets the dummy array to the initial 32-bit addition during 8-bit MAC2, a ripple-carry adder (RCA)
state and writes zero to its accumulator. When the start can significantly increase the critical path delay of the dummy
is enabled, BRAMAC is triggered to perform MAC2. The array and become the frequency bottleneck of BRAMAC.
copy tells BRAMAC to copy the data read from the main Hence, we also explore two variants of fast adders [35]:
BRAM to the dummy array, and an additional w1 w2 signal Carry Lookahead Adder (CLA) with a 4-bit carry lookahead
is needed for BRAMAC-2SA to indicate the currently copied generator using mirror implementation, and Carry Bypass
data is W1 or W2 . These two signals also allow the efficient Adder (CBA) with 4-bit Manchester carry chain using dynamic
pipelining optimization in Fig. 5 where the weight copy of logic. We use COFFE to automatically size the carry-out
the next MAC2 can be overlapped with computing the current generator, the carry lookahead generator, and the Manchester
MAC2. The done indicates whether to read out the dummy carry chain to obtain the best area-delay trade-off for RCA,
array’s accumulator. When it’s enabled, the bramCol is used CLA, and CBA, respectively.
to select 40-bit data from the dummy array’s accumulator row Fig. 7 illustrates the performance, area, and power of three
every cycle. As a result, between every two dot products, the different adders RCA, CBA, and CLA based on COFFE
main BRAM needs to be busy for 8 and 4 cycles to read simulations. As shown in Fig. 7(a), the performance gap
out the accumulator in BRAMAC-2SA and BRAMAC-1DA, between RCA and two other fast adders CBA/CLA becomes
respectively. However, as the dummy array’s accumulator has a larger as the adder precision increases. At the highest adder
size of 8/16/32-bit for 2/4/8-bit MAC precisions, it can process precision, i.e., 32-bit accumulation during 8-bit MAC, RCA
a maximum dot product size of 16/256/2048 before being read has a delay of 393.6 ps, which is 2.8× slower than CBA (139.6
out to amortize this cost. ps) and 2.5× slower than CLA (157.6 ps). As illustrated in
Fig. 7(b), all three adders have similar areas, but CBA has
V. C IRCUIT-L EVEL E VALUATION the highest power consumption of 50.2 µW, which is 4.44×
and 2.86× higher than RCA (11.3 µW) and CLA (17.6 µW),
A. Tools and Baseline FPGA respectively. This is because that CBA employs the dynamic
We use COFFE [31], an automatic FPGA transistor sizing Manchester carry chain which is faster but more power-hungry
tool, to model and optimize the area and delay of all BRAMAC than static CMOS logic. Overall, CLA has the best trade-
components except for the eFSM which is implemented in off between delay, area, and power. Hence, we adopt CLA
SystemVerilog to verify its functionality. We use Synopsys in BRAMAC for the remainder of our evaluation.
Design Compiler with TSMC 28-nm technology to synthesize
and get the area of the eFSM, which are 137 µm2 and 81 µm2 C. BRAMAC Area and Frequency
for BRAMAC-2SA and BRAMAC-1DA, respectively after Fig. 8(a) illustrates the area breakdown of BRAMAC’s
scaling to 22-nm. We get the area of an M20K block from dummy array. The total area of a dummy array is 975.6 µm2 ,
COFFE by interpolating between 16 kb and 32 kb BRAMs. which represents an area increase of 16.9% compared to the
For delay estimation, COFFE runs Hspice simulations using baseline M20K. Since M20K constitutes 20.1% area of the
the 22 nm Predictive Technology Model [32]. baseline FPGA, this area overhead is equivalent to only 3.4%
For the baseline FPGA in the remainder of this paper, we increase in the FPGA core area. Note that we ignore the area
use an Arria-10 GX900 device [33] at the fastest speed grade overhead of eFSM in our later evaluation because COFFE’s
(10AX090H1F34E1SG) whose resource information is shown area model doesn’t include any BRAM control logic and
in Table I. The Arria-10 device family is fabricated using 20- some M20K components such as error correction circuits [25].
nm technology similar to COFFE’s simulation setup. The area Given that the eFSMs of BRAMAC-2SA/BRAMAC-1DA are
ratio for each resource type is estimated based on the area equivalent to only 1.4%/2.4% of the baseline M20K area, it’s
model in [34]. The proposed BRAMAC architecture enhances expected that the area overhead of BRAMAC doesn’t change
the baseline FPGA by replacing all M20K blocks with either compared to the baseline M20K when a more accurate area
BRAMAC-2SA or BRAMAC-1DA. model is adopted.
TABLE II: Key Features of BRAMAC and Prior State-of-the-art MAC Architectures for FPGA
Architecture eDSP PIR-DSP CCB CoMeFa-D CoMeFa-A BRAMAC- BRAMAC-
[15] [16] [17] [18] [18] 2SA 1DA
Modified FPGA Block DSP DSP BRAM BRAM BRAM BRAM BRAM
Supported MAC Precision (-bit) 4, 8 2, 4, 8 Arbitrary Arbitrary Arbitrary 2, 4, 8 2, 4, 8
Area Overhead (Block) 12% 28% 16.8% 25.4% 8.1% 33.8% 16.9%
Area Overhead (Core) 1.1% 2.7% 3.4% 5.1% 1.6% 6.8% 3.4%
Clock Period Overhead
0% 30% 60% 25% 150% 10% 46%
over the Baseline FPGA Block
2-bit 8/1 24 / 1 160 / 16 160 / 16 160 / 16 80 / 5 40 / 3
# of MACs in Parallel /
4-bit 8/1 12 / 1 160 / 42 160 / 42 160 / 42 40 / 7 20 / 4
MAC Latency (Cycles) 1
8-bit 4/1 6/1 160 / 113 160 / 113 160 / 113 20 / 11 10 / 6
Design Complexity Very Low Very Low High Low Medium Low Medium
1 For DSP architectures, the accumulator size for each MAC precision is the same as that in the baseline DSP.
For BRAM architectures, the accumulator sizes for 2-bit, 4-bit, and 8-bit MACs are 8-bit, 16-bit, and 27-bit, respectively. The MAC latency
is reported based on unsigned multiplication for CCB and CoMeFa, and 2’s complement multiplication for BRAMAC.

BRAMAC-2SA has the highest area overhead, it achieves


the highest frequency compared to other BRAM architec-
tures. The two DSP architectures have the lowest design
complexity as they can be implemented in digital CAD flow,
while BRAM design typically involves analog components and
manual layout effort [31]. Among all BRAM architectures,
CCB has the highest design complexity as it needs an extra
voltage supply. CoMeFa-A and BRAMAC-1DA have medium
Fig. 8: (a) Area and (b) delay breakdown of the dummy array. design complexity since they require novel timing design
techniques—sense amplifier cycling and a double-pumped
Fig. 8(b) shows the critical path delay breakdown of clock, respectively.
BRAMAC’s dummy array. With only 7 rows, the dummy
array’s bitline parasitic load is significantly reduced compared VI. A PPLICATION -L EVEL E VALUATION
to the main BRAM. As a result, it can precharge and discharge A. Peak MAC Throughput Comparison
much faster, giving less than 1 ns critical path delay. This
suggests that the dummy array itself is able to run at a We compare the peak MAC throughput of the baseline
maximum frequency (Fmax ) of 1 GHz independent from M20K FPGA with those of enhanced FPGAs that employ BRAMAC
whose Fmax is 730 MHz in Arria-10 [28]. For BRAMAC-1DA, and other MAC architectures studied in Section V-D. We
this limits the Fmax of M20K to 500 MHz in CIM mode. While consider three MAC precisions: 2-bit multiply (with an 8-bit
this is less than the typical BRAM Fmax , realistic FPGA delays accumulator), 4-bit multiply (with a 16-bit accumulator), and
are usually constrained by soft logic and routing, and it is 8-bit multiply (with a 27-bit accumulator). The peak MAC
unlikely that a design on Arria-10 will achieve a frequency throughput of each resource type is determined as follows:
higher than 500 MHz. For BRAMAC-2SA, the critical path (1) LB: We synthesize, place, and route one MAC unit
occurs during the weight copy where the write-back phase using only LBs in Quartus to obtain its Fmax and resource
can only start after reading out data from the main BRAM. utilization. We then follow the same methodology as [17],
Hence, the Fmax of BRAMAC-2SA is dependent on M20K. [18] to calculate the total MAC throughput by optimistically
Specifically, the dummy array’s write driver has a delay of assuming that all LBs can be used at the same Fmax .
165 ps, leading to a 1.1× lower Fmax compared to the baseline (2) DSP: The Arria-10 DSP has two 18×19 multipliers, each
M20K. can implement one 8-bit MAC, two 4-bit MACs, or four 2-bit
MACs using DSP packing described in [36]. We run Quartus
D. Comparison with Other MAC Architectures on FPGA to generate a DSP in m18x18 sumof2 mode and find its Fmax
We compare BRAMAC with other state-of-the-art architec- to be 549 MHz. We use the same Fmax for eDSP but a 1.3×
tures for MAC on FPGA, including eDSP [15], PIR-DSP [16], lower Fmax for PIR-DSP based on its reported Fmax .
CCB [17], and CoMeFa [18]. All architectures use the same (3) BRAM: We use Quartus to generate the baseline M20K
baseline Arria-10 FPGA as described in Section V-A. Each in simple dual-port mode and find its Fmax to be 645
architecture replaces the corresponding FPGA block in the MHz. BRAMAC-2SA and BRAMAC-1DA would run at 586
baseline with its proposed new block. The key features for MHz (1.1× lower) and 500 MHz, respectively, while CCB,
each studied architecture are summarized in Table II. CoMeFa-D, and CoMeFa-A would run 1.6×, 1.25×, and 2.5×
Due to bit-serial arithmetic, CCB and CoMeFa have the slower, respectively based on their reported Fmax degradation.
highest flexibility in the supported precision. However, their Fig. 9 shows the peak MAC throughput breakdown in
proposed bit-serial algorithms for fixed-point multiplication TeraMACs/sec for different architectures and MAC preci-
only work for unsigned numbers, while eDSP, PIR-DSP, sions. Compared to the baseline Arria-10 device, BRAMAC-
and BRAMAC can support 2’s complement MAC. Although 2SA/BRAMAC-1DA can improve the peak throughput by
Fig. 9: Peak MAC throughput of different architectures for various MAC precisions: (a) 2-bit, (b) 4-bit, (c) 8-bit.

products and partial sums, while BRAMAC stores temporary


results only in the dummy array. For CCB, a higher packing
factor computes more sequential MACs before a slow in-
memory reduction, giving a higher performance at the cost of
more BRAM usage to save a copy of the input vector. On the
other hand, CoMeFa offers a one-operand-outside-RAM mode
that streams the input vector, avoiding a copy to BRAM which
improves utilization efficiency when compared to CCB.

C. Performance Improvement over CCB and CoMeFa


Fig. 10: Comparison of BRAM utilization efficiency for DNN model We use general matrix-vector multiplication (GEMV) to
storage at different precisions. benchmark and compare the application performance of
BRAMAC, CCB, and CoMeFa. We choose BRAMAC-1DA
2.6×/2.1×, 2.3×/2.0×, and 1.9×/1.7× for 2-bit, 4-bit, and 8- for this experiment because it has a similar BRAM area and
bit MAC, respectively. Although CCB and CoMeFa can com- frequency overhead as CCB/CoMeFa. We assume that there is
pute 160 MACs in parallel, they suffer from long-latency bit- only one BRAM block available to perform the computation.
serial arithmetic, leading to lower throughput than BRAMAC. This approach captures the performance of an architecture
Compared to low-precision DSP architectures, BRAMAC-2SA normalized to BRAM utilization. We consider both persistent
can deliver higher MAC throughput across all precisions, while and non-persistent (tiling-based) computations that exclude
BRAMAC-1DA’s throughput is only slightly lower than PIR- and include the cycles needed for loading the matrix data to
DSP for 8-bit MAC. Note that BRAMAC is an enhanced the single BRAM block, respectively. Since the data mapping
BRAM architecture, therefore doesn’t preclude the use of and computation flow of the three studied architectures are
eDSP or PIR-DSP on the same FPGA. The combination of deterministic, we use a detailed analytical model to map a
BRAMAC and eDSP/PIR-DSP can further boost an FPGA’s given GEMV workload to each architecture and count the
MAC throughput. number of cycles required. In addition to the latency of MAC,
our analytical model accounts for latency associated with
B. BRAM Utilization Efficiency for DNN Model Storage copying the input vector and reading out the accumulation
Since BRAMAC computes MAC in a separate dummy array results in each architecture.
that is fully decoupled from the main BRAM, it can store Fig. 11 illustrates the speedup of BRAMAC-1DA over
a DNN model efficiently. Fig. 10 compares the BRAM uti- CCB and CoMeFa when performing GEMV with different
lization efficiency between BRAMAC, CCB, and CoMeFa for matrix sizes, precisions (2-bit, 4-bit, 8-bit), and computation
storing DNN models with different precisions from 2- to 8-bit. styles (persistent and non-persistent). Overall, BRAMAC-
Here, utilization efficiency is defined as the effective capacity 1DA achieves up to 3.3×/2.8×/2.4× (and 4.1×/3.4×/2.8×)
ratio of a BRAM that can be used to store weight. A higher speedups for 2/4/8-bit persistent (and non-persistent) GEMV.
utilization efficiency can store the DNN model using fewer At the same precision, BRAMAC-1DA achieves higher
BRAM blocks, saving both area and power consumption. For speedup for non-persistent computation thanks to its eFSM
CCB, we examine two variants, CCB-Pack-2 and CCB-Pack- that allows loading the next matrix tile while computing on
4, that map 2 and 4 sequential bit-serial MACs to the same the current tile. Regarding different precisions, the speedup
BRAM column, respectively. of BRAMAC-1DA decreases as the precision increases. This
BRAMAC can achieve 100% utilization for 2-bit, 4-bit, and is because a higher precision directly reduces the compu-
8-bit precisions. Other precisions can be stored in BRAMAC tation parallelism of BRAMAC-1DA by 2×, and it takes
with lower efficiency by sign-extending them to 4-bit or 8- more cycles to process more input bits. On the other hand,
bit. Despite this, BRAMAC still achieves the highest average CCB/CoMeFa only sacrifice latency but not parallelism at
BRAM utilization efficiency which is 1.3× and 1.1× better higher precision. Nevertheless, BRAMAC-1DA still achieves
compared to CCB and CoMeFa, respectively. This is because better performance for all cases due to its overall MAC
CCB and CoMeFa use extra BRAM space to store temporary throughput improvement over CCB/CoMeFa as discussed in
Fig. 12: DLA’s (a) architecture and (b) computation parallelism across
Fig. 11: Speedup (based on cycles) of BRAMAC-1DA over different axes for CNNs. (c) The architecture of DLA-BRAMAC
CCB/CoMeFa for GEMV with different matrix sizes, precisions, and (with one PE shown).
computation styles.
BRAMAC to Intel’s Deep Learning Accelerator (DLA) [9],
Section VI-A. Note that CCB/CoMeFa’s bit-serial algorithms [10] and develop a cycle-accurate simulator to model DLA
for fixed-point multiplication only support unsigned numbers. in both the baseline FPGA and the enhanced FPGA with
It’s expected that they require much higher latency when BRAMAC (which we call DLA-BRAMAC). The original
supporting 2’s complement MAC. DLA is designed to accelerate convolutional neural networks
Along the matrix row size, the speedup of BRAMAC- (CNNs) as shown in Fig. 12(a). It has a processing element
1DA is mainly affected by the vectorization efficiency, and (PE) array organized in a 1D systolic structure, a stream buffer
this effect is more pronounced at a lower precision. For to store input and output features, and a filter cache to store
example, consider the 2-bit persistent case in Fig. 11(a), where weights. It can be parameterized by Cvec, Qvec, and Kvec
BRAMAC-1DA can compute 20 outputs simultaneously. If the which represent the computation parallelism per cycle in input
matrix row size is 64, i.e., the first column in Fig. 11(a), then depth, output width dimension, and output depth, respectively
at least 4 iterations are required to compute an output vector as illustrated in Fig. 12(b). For DLA-BRAMAC, the stream
of size 64, with only 64/80 = 80% useful computation in buffer can send different input features to the PE array and the
BRAMAC-1DA. On the other hand, if the matrix row size BRAMAC-based filter cache simultaneously as shown in Fig.
is 160, i.e., the fourth column in Fig. 11(a), then the output 12(c). In this way, BRAMAC can complement the PE array
vector divides perfectly into 8 iterations at 100% efficiency, to calculate different outputs along the Qvec dimension.
thus giving better speedup as indicated by the darker color of
the fourth column compared to the first column. Similar trends Similar to the approach used in the original DLA [9], we
exist in 4-bit and 8-bit cases but are less pronounced. conduct design space exploration to find the optimal DLA
Along the matrix column size, the speedup of BRAMAC- and DLA-BRAMAC configurations (i.e., Cvec, Qvec, and
1DA is determined by not only the vectorization efficiency Kvec) for two popular CNN models: Alexnet and ResNet-
but also the achievable packing factor of CCB/CoMeFa. For 34. Our analytical model is set to optimize the target function
example, consider the 8-bit non-persistent case in Fig. 11(f). If perf ∗ (perf /area) to balance performance and area cost.
the matrix column size is 480, i.e., the top row in Fig. 11(f), It assumes that all multipliers are implemented using DSPs,
then CCB/CoMeFa can perform 3 sequential MACs on the and each DSP can pack one 8-bit, two 4-bit, or four 2-bit
same BRAM column before a slow in-memory reduction to multiplications using the DSP-packing technique in [36]. For
amortize the reduction’s latency cost. On the other hand, if the area modeling, we use the DLA area model from [9] to
matrix column size is 128, i.e., the bottom row in Fig. 11(f), estimate the number of DSPs and BRAMs required for a
then a reduction is necessary for CCB/CoMeFa after every bit- specific configuration. We ignore the number of ALMs in our
serial MAC, resulting in much longer latency. On the contrary, area modeling since they are mainly used to implement non-
BRAMAC’s dummy array doesn’t require a special reduction compute-intensive operations and are expected to be similar in
operation. Rather, it performs in-place accumulation at the end DLA and DLA-BRAMAC. To evaluate the performance, our
of every MAC2. cycle-accurate simulator accounts for the latency associated
with the MAC2 computation and the dummy array’s accu-
D. Case Study: Employ BRAMAC to Intel’s DLA mulator readout. Note that BRAMAC’s eFSM can effectively
To demonstrate the feasibility of BRAMAC for tiling-based pipeline adjacent MAC2 operations to hide the latency of the
DNN inference with non-persistent weight storage, we employ weight copy, except for the first MAC2 of every CNN layer
TABLE III: Optimal Configurations of DLA and DLA-BRAMAC for AlexNet and ResNet-34
Accelerator DLA DLA-BRAMAC-2SA DLA-BRAMAC-1DA
Model Config1 DSPs BRAMs Config2 DSPs BRAMs Config2 DSPs BRAMs
2-bit (2, 16, 96) 1152 352 (1+2, 24, 140) 1260 1128 (2+2, 16, 100) 1200 816
AlexNet 4-bit (3, 16, 32) 1152 544 (1+2, 16, 100) 1200 1600 (1+1, 12, 130) 1170 1080
8-bit (3, 12, 24) 1296 868 (2+2, 10, 50) 1500 1740 (1+1, 8, 100) 1200 1664
2-bit (4, 12, 72) 1296 792 (1+2, 16, 140) 840 832 (2+2, 22, 80) 1320 924
ResNet-34 4-bit (3, 8, 64) 1152 736 (2+2, 12, 70) 1260 972 (1+1, 16, 90) 1080 1056
8-bit (3, 4, 64) 1152 1452 (2+2, 6, 65) 1170 1530 (1+1, 12, 65) 1170 1788
1 The configuration value for DLA has the form of (Qvec, Cvec, Kvec).
2 The configuration value for DLA-BRAMAC has the form of (Qvec1+Qvec2, Cvec, Kvec), where Qvec1 and Qvec2 are the numbers of
output features computed by DSP and BRAMAC, respectively.

Fig. 13: Comparison between DLA and DLA-BRAMAC for accelerating AlexNet and ResNet at different precisions: a) performance, (b)
utilized DSP-plus-BRAM area, (c) performance per area.

where an additional 2 cycles are required to start the initial 2SA is slightly improved over BRAMAC-1DA, it has 2×
weight copy. However, this overhead is negligible given that BRAM area overhead compared to BRAMAC-1DA. While our
each CNN layer takes thousands of cycles to complete. results more than justify the area overhead of BRAMAC, we
Table III summarizes the optimal configuration for each expect higher gains for a DNN accelerator that is: (1) purpose-
(accelerator, model, precision) case. The performance and built around the capabilities of BRAMAC, and (2) used to
utilized DSP-plus-BRAM area of DLA-BRAMAC, normalized accelerate DNNs with more matrix multiplications such as
to those of DLA, are shown in Fig. 13. The utilized DSP- transformers [37]—we will work on both aspects in the future.
plus-BRAM area is calculated based on the area overhead of
VII. C ONCLUSION
BRAMAC and the area model from [34]. On average, com-
pared to the baseline DLA for AlexNet, employing BRAMAC- This paper proposes BRAMAC, a compute-in-BRAM ar-
2SA/BRAMAC-1DA achieves 2.05×/1.7× speedup at the cost chitecture for MAC on FPGAs. To the best of our knowl-
of 2.01×/1.52× DSP-plus-BRAM area, giving 1.01×/1.12× edge, BRAMAC is the first compute-in-BRAM architecture
performance gains per utilized area. For ResNet-34, employing that: (1) adopts a hybrid bit-serial & bit-parallel dataflow
BRAMAC-2SA/BRAMAC-1DA achieves a lower speedup of to support variable-precision MAC using 2’s complement
1.33×/1.52× on average at the cost of 1.2×/1.22× DSP-plus- representation, (2) computes in a separate dummy array
BRAM area, which corresponds to 1.11×/1.25× performance which improves the main BRAM array’s utilization efficiency,
gains per utilized area. The larger DSP-plus-BRAM area is (3) employs an embedded finite-state machine to free up
mainly attributed to more BRAM usage for computation and the main BRAM ports during in-memory computation. The
BRAMAC’s area overhead. two proposed variants, BRAMAC-2SA/BRAMAC-1DA, boost
In general, BRAMAC-2SA and BRAMAC-1DA achieve the peak MAC throughput of a large Arria 10 FPGA by
higher speedup for AlexNet compared to ResNet-34 as shown 2.6×/2.1×, 2.3×/2.0×, and 1.9×/1.7× for 2-bit, 4-bit, and
in Fig. 13(a). This is because that BRAMAC is better at 8-bit precisions, respectively at the cost of 6.8%/3.4% in-
supporting a higher Kvec that allows the same input fea- crease in FPGA core area. BRAMAC also improves the
ture to be multiplied by many kernels. The early and most BRAM utilization efficiency by 1.3× and 1.1× compared
compute-intensive residual blocks of ResNet-34 only have to two recent compute-in-BRAM architectures, CCB and
an output channel depth of 64, while the first convolution CoMeFa, respectively while significantly outperforming both
layer of AlexNet has an output channel depth of 96. The architectures on matrix-vector multiplications. Combining
latter gives more freedom for DLA-BRAMAC to optimize its BRAMAC-2SA/BRAMAC-1DA with Intel’s DLA, a tiling-
configuration with high vectorization efficiency. However, a based DNN accelerator, an average speedup of 2.05×/1.7×
higher speedup for AlexNet comes with a larger utilized area and 1.33×/1.52× can be achieved for AlexNet and ResNet-
as illustrated in Fig. 13(b). Comparing the two BRAMAC 34, respectively. With its ability to support both persistent and
variants, BRAMAC-2SA has a lower performance gain per tiling-based DNN acceleration, BRAMAC has the potential
utilized area for all model-precision combinations as observed to be a highly practical and valuable addition to future AI-
from Fig. 13(c). Although the MAC throughput of BRAMAC- optimized FPGAs.
R EFERENCES [20] M. Eldafrawy, A. Boutros, S. Yazdanshenas, and V. Betz, “FPGA Logic
Block Architectures for Efficient Deep Learning Inference,” ACM Trans-
actions on Reconfigurable Technology and Systems (TRETS), vol. 13, pp.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification 1–34, 2020.
with Deep Convolutional Neural Networks,” in Advances in Neural [21] Xilinx, “DSP58 Architecture,” 2022. [Online]. Available: https://docs.
Information Processing Systems, 2012. xilinx.com/r/en-US/am004-versal-dsp-engine/DSP58-Architecture
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [22] Intel, “Intel Agilex Variable Precision DSP Blocks User Guide,”
A Large-Scale Hierarchical Image Database,” in Computer Vision and 2021. [Online]. Available: https://www.intel.com/programmable/
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. technical-pdfs/683037.pdf
248–255. [23] M. Langhammer, E. Nurvitadhi, S. Gribok, and B. M. Pasca, “Stratix
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, 10 NX Architecture,” ACM Transactions on Reconfigurable Technology
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- and Systems (TRETS), vol. 15, pp. 1 – 32, 2022.
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, [24] M. Horowitz, “Computing’s Energy Problem (and what we can do
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, about it),” IEEE International Solid-State Circuits Conference Digest
J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and of Technical Papers (ISSCC), pp. 10–14, 2014.
D. Amodei, “Language Models are Few-Shot Learners,” in Advances [25] D. M. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee,
in Neural Information Processing Systems, 2020, pp. 1877–1901. T. Vanderhoek, and H. Yu, “Architectural Enhancements in Stratix V,”
[4] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, in ACM/SIGDA International Symposium on Field Programmable Gate
and T. Blankevoort, “A White Paper on Neural Network Quantization,” Arrays (FPGA), 2013.
arxiv:abs/2106.08295, 2021. [26] P. Judd, J. Albericio, and A. Moshovos, “Stripes: Bit-serial deep neural
[5] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, network computing,” 49th Annual IEEE/ACM International Symposium
and D. Kalenichenko, “Quantization and Training of Neural Networks on Microarchitecture (MICRO), pp. 1–12, 2016.
for Efficient Integer-Arithmetic-Only Inference,” in Conference on Com- [27] C.-F. Lee, C. Lu, C.-E. Lee, H. Mori, H. Fujiwara, Y.-C. Shih, T.-
puter Vision and Pattern Recognition, 2018, pp. 2704–2713. L. Chou, Y. D. Chih, and T.-Y. J. Chang, “A 12nm 121-TOPS/W
[6] H. Wu, “Low Precision Inference on GPU,” 2019. [Online]. Available: 41.6-TOPS/mm2 All Digital Full Precision SRAM-based Compute-in-
https://developer.download.nvidia.com/video/gputechconf/gtc/2019/ Memory with Configurable Bit-width For AI Edge Applications,” IEEE
presentation/s9659-inference-at-reduced-precision-on-gpus.pdf Symposium on VLSI Technology and Circuits, pp. 24–25, 2022.
[7] Nvidia, “INT4 Precision for AI Inference,” 2019. [Online]. Available: [28] Intel, “Intel Arria 10 Core Fabric and General Purpose
https://developer.nvidia.com/blog/int4-for-ai-inference/ I/Os Handbook,” 2022. [Online]. Available: https://www.intel.com/
[8] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, programmable/technical-pdfs/683461.pdf
D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, [29] J. Choi, K. Nam, A. Canis, J. H. Anderson, S. D. Brown, and
P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, T. S. Czajkowski, “Impact of Cache Architecture and Interface on
A. M. Caulfield, E. S. Chung, and D. Burger, “A Configurable Cloud- Performance and Area of FPGA-Based Processor/Parallel-Accelerator
Scale DNN Processor for Real-Time AI,” ACM/IEEE 45th Annual Systems,” IEEE 20th International Symposium on Field-Programmable
International Symposium on Computer Architecture (ISCA), pp. 1–14, Custom Computing Machines (FCCM), pp. 17–24, 2012.
2018. [30] R. Shi, Y. Ding, X. Wei, H. Li, H. Liu, H. K.-H. So, and C. Ding, “FTDL:
[9] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu, A Tailored FPGA-Overlay for Deep Learning with High Scalability,”
“An OpenCL™ Deep Learning Accelerator on Arria 10,” ACM/SIGDA 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6, 2020.
International Symposium on Field-Programmable Gate Arrays (FPGA), [31] S. Yazdanshenas, K. Tatsumura, and V. Betz, “Don’t Forget the Mem-
2017. ory: Automatic Block RAM Modelling, Optimization, and Architec-
[10] M. S. Abdelfattah, D. Han, A. Bitar, R. Dicecco, S. O’Connell, ture Exploration,” ACM/SIGDA International Symposium on Field-
N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, and G. R. Chiu, Programmable Gate Arrays (FPGA), pp. 115–124, 2017.
“DLA: Compiler and FPGA Overlay for Neural Network Inference [32] Arizona State University, “Predictive Technology Model,” 2012.
Acceleration,” 28th International Conference on Field Programmable [Online]. Available: http://ptm.asu.edu/
Logic and Applications (FPL), pp. 411–4117, 2018. [33] Intel, “Arria 10 Device Overview,” 2022. [Online]. Available:
https://www.intel.com/programmable/technical-pdfs/683332.pdf
[11] Intel, “Intel Stratix 10 NX FPGA Overview,” 2020. [Online].
[34] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance, produc-
Available: https://www.intel.com/content/www/us/en/products/details/
tivity and scalability of the TILT overlay processor to OpenCL HLS,”
fpga/stratix/10/nx.html
International Conference on Field-Programmable Technology (FPT), pp.
[12] Xilinx, “UltraScale Architecture DSP Slice User Guide, (UG579 20–27, 2014.
v1.11),” 2021. [Online]. Available: https://docs.xilinx.com/v/u/en-US/ [35] University of California, Berkeley, “ECE241, Lecture 18 Adders,” 2003.
ug579-ultrascale-dsp [Online]. Available: http://bwrcs.eecs.berkeley.edu/Classes/icdesign/
[13] Intel, “Intel Stratix 10 Variable Precision DSP Blocks User Guide ee241 s03/Lectures/lecture18-adders.pdf
(UG-S10-DSP),” 2021. [Online]. Available: https://www.intel.com/ [36] J. Sommer, M. A. Özkan, O. Keszocze, and J. Teich, “DSP-
programmable/technical-pdfs/683832.pdf Packing: Squeezing Low-precision Arithmetic into FPGA DSP Blocks,”
[14] Achronix, “Speedcore eFPGAs).” [Online]. Available: arxiv.org/abs/2203.11028, 2022.
https://www.achronix.com/sites/default/files/docs/Speedcore eFPGA [37] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Product Brief PB028.pdf Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,”
[15] A. Boutros, S. Yazdanshenas, and V. Betz, “Embracing Diversity: arxiv:abs/1706.03762, 2017.
Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs,”
28th International Conference on Field Programmable Logic and Ap-
plications (FPL), pp. 35–42, 2018.
[16] S. Rasoulinezhad, H. Zhou, L. Wang, and P. H. W. Leong, “PIR-
DSP: An FPGA DSP Block Architecture for Multi-precision Deep
Neural Networks,” IEEE 27th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 35–44, 2019.
[17] X. Wang, V. Goyal, J. Yu, V. Bertacco, A. Boutros, E. Nurvitadhi,
C. Augustine, R. R. Iyer, and R. Das, “Compute-Capable Block RAMs
for Efficient Deep Learning Acceleration on FPGAs,” IEEE 29th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), pp. 88–96, 2021.
[18] A. Arora, T. Anand, A. Borda, R. Sehgal, B. Hanindhito, J. Kulkarni, and
L. K. John, “CoMeFa: Compute-in-Memory Blocks for FPGAs,” IEEE
30th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pp. 1–9, 2022.
[19] A. Boutros and V. Betz, “FPGA Architecture: Principles and Progres-
sion,” IEEE Circuits and Systems Magazine, vol. 21, pp. 4–29, 2021.

You might also like